postgresql_transactions_acid.md

postgresql_transactions_acid.md (19008B)
      1 # PostgreSQL Transactions, Stored Routines, and ACID
      2 
      3 PostgreSQL is built around **transactions**: bounded units of work that either
      4 commit as a whole or leave no durable effect (after rollback). Understanding
      5 how those transactions interact with **multi-version concurrency control
      6 (MVCC)**, **locking**, and the **write-ahead log (WAL)** explains both everyday
      7 behaviour (why two writers sometimes wait or fail) and what **ACID** means in
      8 practice.
      9 
     10 ---
     11 
     12 ## Transactions in PostgreSQL
     13 
     14 A transaction begins explicitly with `BEGIN` (or `START TRANSACTION`) and ends
     15 with `COMMIT` or `ROLLBACK`. If you do neither, a single SQL statement still runs
     16 inside an **implicit** transaction: PostgreSQL wraps each standalone statement
     17 in its own mini-transaction that commits on success.
     18 
     19 Inside one transaction, all statements see a coherent database state according to
     20 the chosen **isolation level** (see below). Until `COMMIT`, other sessions may or
     21 may not see your changes, depending on isolation and whether they started before
     22 or after your writes.
     23 
     24 **Savepoints** (`SAVEPOINT`, `ROLLBACK TO SAVEPOINT`) let you partially undo
     25 work without aborting the whole outer transaction—useful for complex workflows
     26 with selective retry logic.
     27 
     28 ---
     29 
     30 ## “Stored procedures” vs functions in PostgreSQL
     31 
     32 PostgreSQL’s primary server-side programming mechanism is the **`FUNCTION`**
     33 (PL/pgSQL, SQL, or other languages). A function **always runs in the caller’s
     34 transaction**; it cannot `COMMIT` or `ROLLBACK` the surrounding transaction from
     35 inside the function body.
     36 
     37 Since **PostgreSQL 11**, **`PROCEDURE`** exists for routines that **may** issue
     38 transaction control: a procedure can end its own transaction with `COMMIT` or
     39 `ROLLBACK` and continue (multi-transaction batching inside one call). That is
     40 the feature people often mean by “stored procedure” when comparing to other
     41 databases. For most application code, plain SQL inside a client-managed
     42 transaction plus optional **functions** is still the common pattern.
     43 
     44 Whether logic lives in the app or in the server, the same **MVCC and locking
     45 rules** apply: routines do not bypass concurrency controls.
     46 
     47 ---
     48 
     49 ## How PostgreSQL handles many writers (conflicts)
     50 
     51 PostgreSQL uses **MVCC** instead of readers blocking writers with a single
     52 global lock on rows. Each row version carries metadata about which transactions
     53 created and invalidated it (`xmin` / `xmax` system columns). A query reads **row
     54 versions visible to its snapshot**—typically versions committed before the
     55 snapshot’s cutoff—so normal **SELECT** does not take row-level locks that block
     56 concurrent **UPDATE** on other rows.
     57 
     58 When **two transactions try to change the same row**, MVCC does not allow both
     59 to succeed blindly:
     60 
     61 1. **Row-level locks** — `UPDATE`, `DELETE`, and `SELECT … FOR UPDATE` acquire
     62    locks on the target rows. A second transaction that needs the same row may
     63    **block** until the first commits or rolls back, then either proceeds or
     64    detects the row was changed (depending on operation and isolation).
     65 
     66 2. **Serialization failures** — Under **`REPEATABLE READ`** or **`SERIALIZABLE`**,
     67    PostgreSQL can abort a transaction with **`SQLSTATE 40001`**
     68    (`serialization_failure`) when the guarantee of that isolation level would be
     69    violated—e.g. a concurrent write pattern that could produce anomalies if both
     70    committed. The application is expected to **retry** the whole transaction.
     71 
     72 3. **Deadlocks** — When transactions wait on each other in a cycle, PostgreSQL
     73    **detects** the deadlock and **aborts one transaction** (`deadlock_detected`)
     74    so the others can proceed.
     75 
     76 So “multiple users writing the same data at the same time” is handled by a mix
     77 of **waiting**, **abort with retry**, and **ordering**—not by silently losing
     78 updates. Which outcome you get depends on SQL patterns (locking reads vs blind
     79 updates) and isolation level.
     80 
     81 The subsections below quote **actual C code** from the upstream PostgreSQL tree
     82 so you can see how these behaviours are implemented. Excerpts are taken from
     83 commit [`43fc1dc7527`](https://github.com/postgres/postgres/tree/43fc1dc7527c4213c4d48a96bf98738df0acb71a)
     84 (PostgreSQL is licensed under [the PostgreSQL Licence](https://www.postgresql.org/about/licence/)).
     85 
     86 ### Row-level blocking: `heap_lock_tuple`
     87 
     88 When a tuple is already being modified or deleted, `heap_lock_tuple` inspects
     89 visibility with **`HeapTupleSatisfiesUpdate`**, unlocks the buffer briefly, and
     90 may **sleep** until competing lockers finish—first acquiring a **heavyweight
     91 tuple lock** (`LockTuple` via `heap_acquire_tuplock`), then waiting on the
     92 transaction or MultiXact holding `xmax`:
     93 
     94 ```c
     95 l3:
     96 	result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer);
     97 
     98 	if (result == TM_Invisible)
     99 	{
    100 		/*
    101 		 * This is possible, but only when locking a tuple for ON CONFLICT DO
    102 		 * SELECT/UPDATE.  We return this value here rather than throwing an
    103 		 * error in order to give that case the opportunity to throw a more
    104 		 * specific error.
    105 		 */
    106 		result = TM_Invisible;
    107 		goto out_locked;
    108 	}
    109 	else if (result == TM_BeingModified ||
    110 			 result == TM_Updated ||
    111 			 result == TM_Deleted)
    112 	{
    113 		TransactionId xwait;
    114 		uint16		infomask;
    115 		uint16		infomask2;
    116 		bool		require_sleep;
    117 		ItemPointerData t_ctid;
    118 
    119 		/* must copy state data before unlocking buffer */
    120 		xwait = HeapTupleHeaderGetRawXmax(tuple->t_data);
    121 		infomask = tuple->t_data->t_infomask;
    122 		infomask2 = tuple->t_data->t_infomask2;
    123 		ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid);
    124 
    125 		LockBuffer(*buffer, BUFFER_LOCK_UNLOCK);
    126 ```
    127 
    128 (The branch continues with multixact checks, tuple-lock acquisition, and sleep
    129 logic; only the opening is shown here.)
    130 
    131 The fragment above is the start of the branch taken when the tuple is already
    132 being modified, updated, or deleted; the remainder of `heap_lock_tuple` decides
    133 whether sleeping is required and how to wait.
    134 
    135 Later, when `require_sleep` is true, the same function acquires the tuple lock
    136 and waits on the rival transaction id:
    137 
    138 ```c
    139 		if (!skip_tuple_lock &&
    140 			!heap_acquire_tuplock(relation, tid, mode, wait_policy,
    141 								  &have_tuple_lock))
    142 		{
    143 			/*
    144 			 * This can only happen if wait_policy is Skip and the lock
    145 			 * couldn't be obtained.
    146 			 */
    147 			result = TM_WouldBlock;
    148 			/* recovery code expects to have buffer lock held */
    149 			LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
    150 			goto failed;
    151 		}
    152 ```
    153 
    154 ```c
    155 				switch (wait_policy)
    156 				{
    157 					case LockWaitBlock:
    158 						XactLockTableWait(xwait, relation, &tuple->t_self,
    159 										  XLTW_Lock);
    160 						break;
    161 					case LockWaitSkip:
    162 						if (!ConditionalXactLockTableWait(xwait, false))
    163 						{
    164 							result = TM_WouldBlock;
    165 							/* recovery code expects to have buffer lock held */
    166 							LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE);
    167 							goto failed;
    168 						}
    169 						break;
    170 					case LockWaitError:
    171 						if (!ConditionalXactLockTableWait(xwait, log_lock_failures))
    172 							ereport(ERROR,
    173 									(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
    174 									 errmsg("could not obtain lock on row in relation \"%s\"",
    175 											RelationGetRelationName(relation))));
    176 						break;
    177 				}
    178 ```
    179 
    180 (Source: [`src/backend/access/heap/heapam.c`](https://github.com/postgres/postgres/blob/43fc1dc7527c4213c4d48a96bf98738df0acb71a/src/backend/access/heap/heapam.c),
    181 function `heap_lock_tuple`; this `switch` sits in the branch that waits on a
    182 single transaction id after copying `xwait` from `xmax`.)
    183 
    184 The **heavyweight** tuple lock is wired to `LockTuple` through a small macro
    185 layer that maps `LockTupleMode` to the lock manager’s `LOCKMODE`:
    186 
    187 ```c
    188 #define LockTupleTuplock(rel, tup, mode) \
    189 	LockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
    190 #define UnlockTupleTuplock(rel, tup, mode) \
    191 	UnlockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock)
    192 #define ConditionalLockTupleTuplock(rel, tup, mode, log) \
    193 	ConditionalLockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock, (log))
    194 ```
    195 
    196 `heap_acquire_tuplock` is the helper that invokes it when blocking is allowed:
    197 
    198 ```c
    199 static bool
    200 heap_acquire_tuplock(Relation relation, const ItemPointerData *tid, LockTupleMode mode,
    201 					 LockWaitPolicy wait_policy, bool *have_tuple_lock)
    202 {
    203 	if (*have_tuple_lock)
    204 		return true;
    205 
    206 	switch (wait_policy)
    207 	{
    208 		case LockWaitBlock:
    209 			LockTupleTuplock(relation, tid, mode);
    210 			break;
    211 
    212 		case LockWaitSkip:
    213 			if (!ConditionalLockTupleTuplock(relation, tid, mode, false))
    214 				return false;
    215 			break;
    216 
    217 		case LockWaitError:
    218 			if (!ConditionalLockTupleTuplock(relation, tid, mode, log_lock_failures))
    219 				ereport(ERROR,
    220 						(errcode(ERRCODE_LOCK_NOT_AVAILABLE),
    221 						 errmsg("could not obtain lock on row in relation \"%s\"",
    222 								RelationGetRelationName(relation))));
    223 			break;
    224 	}
    225 	*have_tuple_lock = true;
    226 
    227 	return true;
    228 }
    229 ```
    230 
    231 (Source: same file, macros near the top of `heapam.c` and `heap_acquire_tuplock`
    232 below `heap_lock_tuple`.)
    233 
    234 ### Deadlock detection: `CheckDeadLock` and `DeadLockCheck`
    235 
    236 While a backend sleeps on a lock, a **deadlock timeout** can fire; the waiter
    237 then locks all lock-hash partitions and runs **`DeadLockCheck`**. If the graph
    238 contains an unresolvable cycle (`DS_HARD_DEADLOCK`), the waiter is removed from
    239 the queue so **`ProcSleep`** surfaces an error and the transaction aborts. The
    240 full handler is **`CheckDeadLock`** in `proc.c`:
    241 
    242 ```c
    243 /*
    244  * CheckDeadLock
    245  *
    246  * We only get to this routine, if DEADLOCK_TIMEOUT fired while waiting for a
    247  * lock to be released by some other process.  Check if there's a deadlock; if
    248  * not, just return.  If we have a real deadlock, remove ourselves from the
    249  * lock's wait queue.
    250  */
    251 static DeadLockState
    252 CheckDeadLock(void)
    253 {
    254 	int			i;
    255 	DeadLockState result;
    256 
    257 	/*
    258 	 * Acquire exclusive lock on the entire shared lock data structures. Must
    259 	 * grab LWLocks in partition-number order to avoid LWLock deadlock.
    260 	 *
    261 	 * Note that the deadlock check interrupt had better not be enabled
    262 	 * anywhere that this process itself holds lock partition locks, else this
    263 	 * will wait forever.  Also note that LWLockAcquire creates a critical
    264 	 * section, so that this routine cannot be interrupted by cancel/die
    265 	 * interrupts.
    266 	 */
    267 	for (i = 0; i < NUM_LOCK_PARTITIONS; i++)
    268 		LWLockAcquire(LockHashPartitionLockByIndex(i), LW_EXCLUSIVE);
    269 
    270 	/*
    271 	 * Check to see if we've been awoken by anyone in the interim.
    272 	 *
    273 	 * If we have, we can return and resume our transaction -- happy day.
    274 	 * Before we are awoken the process releasing the lock grants it to us so
    275 	 * we know that we don't have to wait anymore.
    276 	 *
    277 	 * We check by looking to see if we've been unlinked from the wait queue.
    278 	 * This is safe because we hold the lock partition lock.
    279 	 */
    280 	if (dlist_node_is_detached(&MyProc->waitLink))
    281 	{
    282 		result = DS_NO_DEADLOCK;
    283 		goto check_done;
    284 	}
    285 
    286 #ifdef LOCK_DEBUG
    287 	if (Debug_deadlocks)
    288 		DumpAllLocks();
    289 #endif
    290 
    291 	/* Run the deadlock check */
    292 	result = DeadLockCheck(MyProc);
    293 
    294 	if (result == DS_HARD_DEADLOCK)
    295 	{
    296 		/*
    297 		 * Oops.  We have a deadlock.
    298 		 *
    299 		 * Get this process out of wait state. (Note: we could do this more
    300 		 * efficiently by relying on lockAwaited, but use this coding to
    301 		 * preserve the flexibility to kill some other transaction than the
    302 		 * one detecting the deadlock.)
    303 		 *
    304 		 * RemoveFromWaitQueue sets MyProc->waitStatus to
    305 		 * PROC_WAIT_STATUS_ERROR, so ProcSleep will report an error after we
    306 		 * return.
    307 		 */
    308 		Assert(MyProc->waitLock != NULL);
    309 		RemoveFromWaitQueue(MyProc, LockTagHashCode(&(MyProc->waitLock->tag)));
    310 
    311 		/*
    312 		 * We're done here.  Transaction abort caused by the error that
    313 		 * ProcSleep will raise will cause any other locks we hold to be
    314 		 * released, thus allowing other processes to wake up; we don't need
    315 		 * to do that here.  NOTE: an exception is that releasing locks we
    316 		 * hold doesn't consider the possibility of waiters that were blocked
    317 		 * behind us on the lock we just failed to get, and might now be
    318 		 * wakable because we're not in front of them anymore.  However,
    319 		 * RemoveFromWaitQueue took care of waking up any such processes.
    320 		 */
    321 	}
    322 
    323 	/*
    324 	 * And release locks.  We do this in reverse order for two reasons: (1)
    325 	 * Anyone else who needs more than one of the locks will be trying to lock
    326 	 * them in increasing order; we don't want to release the other process
    327 	 * until it can get all the locks it needs. (2) This avoids O(N^2)
    328 	 * behavior inside LWLockRelease.
    329 	 */
    330 check_done:
    331 	for (i = NUM_LOCK_PARTITIONS; --i >= 0;)
    332 		LWLockRelease(LockHashPartitionLockByIndex(i));
    333 
    334 	return result;
    335 }
    336 ```
    337 
    338 `DeadLockCheck` in `deadlock.c` implements the search for cycles and possible
    339 wait-queue reordering; its header comment states the `DS_HARD_DEADLOCK`
    340 contract, and the body begins by resetting workspace used by the detector:
    341 
    342 ```c
    343 /*
    344  * DeadLockCheck -- Checks for deadlocks for a given process
    345  *
    346  * This code looks for deadlocks involving the given process.  If any
    347  * are found, it tries to rearrange lock wait queues to resolve the
    348  * deadlock.  If resolution is impossible, return DS_HARD_DEADLOCK ---
    349  * the caller is then expected to abort the given proc's transaction.
    350  *
    351  * Caller must already have locked all partitions of the lock tables.
    352  *
    353  * On failure, deadlock details are recorded in deadlockDetails[] for
    354  * subsequent printing by DeadLockReport().  That activity is separate
    355  * because we don't want to do it while holding all those LWLocks.
    356  */
    357 DeadLockState
    358 DeadLockCheck(PGPROC *proc)
    359 {
    360 	/* Initialize to "no constraints" */
    361 	nCurConstraints = 0;
    362 	nPossibleConstraints = 0;
    363 	nWaitOrders = 0;
    364 
    365 	/* Initialize to not blocked by an autovacuum worker */
    366 	blocking_autovacuum_proc = NULL;
    367 
    368 	/* Search for deadlocks and possible fixes */
    369 	if (DeadLockCheckRecurse(proc))
    370 ```
    371 
    372 (Sources: [`src/backend/storage/lmgr/proc.c`](https://github.com/postgres/postgres/blob/43fc1dc7527c4213c4d48a96bf98738df0acb71a/src/backend/storage/lmgr/proc.c)
    373 and [`src/backend/storage/lmgr/deadlock.c`](https://github.com/postgres/postgres/blob/43fc1dc7527c4213c4d48a96bf98738df0acb71a/src/backend/storage/lmgr/deadlock.c). The `EDGE` / waits-for graph structs at the top of `deadlock.c` are also worth reading alongside [`src/backend/storage/lmgr/README`](https://github.com/postgres/postgres/blob/43fc1dc7527c4213c4d48a96bf98738df0acb71a/src/backend/storage/lmgr/README).)
    374 
    375 ### Serializable isolation: `PreCommit_CheckForSerializationFailure`
    376 
    377 Under **SSI**, commit-time validation walks **read/write conflict** lists; if a
    378 **dangerous structure** appears (pivot still uncommitted), PostgreSQL raises
    379 **`ERRCODE_T_R_SERIALIZATION_FAILURE`** (`SQLSTATE 40001`), matching the
    380 user-visible `serialization_failure`:
    381 
    382 ```c
    383 void
    384 PreCommit_CheckForSerializationFailure(void)
    385 {
    386 	dlist_iter	near_iter;
    387 
    388 	if (MySerializableXact == InvalidSerializableXact)
    389 		return;
    390 
    391 	Assert(IsolationIsSerializable());
    392 
    393 	LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE);
    394 
    395 	/*
    396 	 * Check if someone else has already decided that we need to die.  Since
    397 	 * we set our own DOOMED flag when partially releasing, ignore in that
    398 	 * case.
    399 	 */
    400 	if (SxactIsDoomed(MySerializableXact) &&
    401 		!SxactIsPartiallyReleased(MySerializableXact))
    402 	{
    403 		LWLockRelease(SerializableXactHashLock);
    404 		ereport(ERROR,
    405 				(errcode(ERRCODE_T_R_SERIALIZATION_FAILURE),
    406 				 errmsg("could not serialize access due to read/write dependencies among transactions"),
    407 				 errdetail_internal("Reason code: Canceled on identification as a pivot, during commit attempt."),
    408 				 errhint("The transaction might succeed if retried.")));
    409 	}
    410 ```
    411 
    412 (Source: [`src/backend/storage/lmgr/predicate.c`](https://github.com/postgres/postgres/blob/43fc1dc7527c4213c4d48a96bf98738df0acb71a/src/backend/storage/lmgr/predicate.c).)
    413 
    414 ---
    415 
    416 ## Isolation levels (what “consistent view” means)
    417 
    418 Default isolation is **`READ COMMITTED`**: each statement sees rows committed
    419 before that **statement** began. That prevents dirty reads but allows **non-repeatable
    420 reads** and **phantoms** across statements in the same transaction.
    421 
    422 **`REPEATABLE READ`** uses a **snapshot** taken at the **first query** in the
    423 transaction; subsequent reads see the same world. Concurrent commits that would
    424 change what that snapshot means for writes can trigger **`serialization_failure`**.
    425 
    426 **`SERIALIZABLE`** uses **Serializable Snapshot Isolation (SSI)**—tracking
    427 dependencies between concurrent transactions so that any outcome that could not
    428 have occurred if transactions ran one-after-another causes an abort. It is the
    429 strongest guarantee PostgreSQL offers but may force **retries** under contention.
    430 
    431 ---
    432 
    433 ## ACID “under the hood”
    434 
    435 The acronym maps to mechanisms roughly as follows:
    436 
    437 ### Atomicity
    438 
    439 All changes in a transaction are recorded as **WAL records** before commit is
    440 allowed (see durability). If the session crashes mid-transaction, recovery
    441 replays only **committed** work; incomplete transactions are rolled back during
    442 recovery. Within the server, rollback uses transaction status and MVCC rules so
    443 uncommitted row versions are invisible after abort.
    444 
    445 ### Consistency
    446 
    447 PostgreSQL enforces **declared rules**: primary keys, foreign keys, `CHECK`,
    448 `NOT NULL`, triggers, etc. **Consistency** in the broader business sense (e.g.
    449 “account totals match ledger”) remains the application’s responsibility;
    450 transactions bundle changes so those rules can be checked together before commit.
    451 
    452 ### Isolation
    453 
    454 Implemented through **MVCC snapshots**, **row locks**, **predicate locks /
    455 SSI bookkeeping** (for serializable), and occasional **blocking**. Isolation is
    456 not “free”: stronger levels trade convenience for **abort/retry** under
    457 concurrency.
    458 
    459 ### Durability
    460 
    461 Committed transactions are made **durable** via the **write-ahead log**: redo
    462 information is written to WAL and, depending on **`synchronous_commit`** and
    463 replication settings, flushed to disk (and replicas) before the client is told
    464 `COMMIT` succeeded. That way a crash after commit can **replay** the log and
    465 restore committed data.
    466 
    467 ---
    468 
    469 ## Practical takeaways
    470 
    471 - Use explicit transactions when several statements must succeed or fail
    472   together.
    473 - Expect **`serialization_failure`** under **`SERIALIZABLE`** or **`REPEATABLE
    474   READ`** when writes overlap; implement **retry** with backoff.
    475 - Use **`SELECT … FOR UPDATE`** (or equivalent) when you read a row and then
    476   update it based on that read, to serialize competing writers on that row.
    477 - Distinguish **`FUNCTION`** (single outer transaction) from **`PROCEDURE`**
    478   (optional internal commits) when porting “stored procedure” designs from other
    479   DBMSs.
    480 
    481 ---
    482 
    483 ## References
    484 
    485 - [PostgreSQL server sources (GitHub)](https://github.com/postgres/postgres)
    486 - [PostgreSQL Manual — Transaction Isolation](https://www.postgresql.org/docs/current/transaction-iso.html)
    487 - [PostgreSQL Manual — Routine vacuuming / MVCC](https://www.postgresql.org/docs/current/routine-vacuuming.html)
    488 - [PostgreSQL Manual — Write-Ahead Logging (WAL)](https://www.postgresql.org/docs/current/wal.html)
    489 - [PostgreSQL Manual — `CREATE PROCEDURE`](https://www.postgresql.org/docs/current/sql-createprocedure.html)
	notes
	Log \| Files \| Refs \| README
notes