postgresql_transactions_acid.md (19008B)
1 # PostgreSQL Transactions, Stored Routines, and ACID 2 3 PostgreSQL is built around **transactions**: bounded units of work that either 4 commit as a whole or leave no durable effect (after rollback). Understanding 5 how those transactions interact with **multi-version concurrency control 6 (MVCC)**, **locking**, and the **write-ahead log (WAL)** explains both everyday 7 behaviour (why two writers sometimes wait or fail) and what **ACID** means in 8 practice. 9 10 --- 11 12 ## Transactions in PostgreSQL 13 14 A transaction begins explicitly with `BEGIN` (or `START TRANSACTION`) and ends 15 with `COMMIT` or `ROLLBACK`. If you do neither, a single SQL statement still runs 16 inside an **implicit** transaction: PostgreSQL wraps each standalone statement 17 in its own mini-transaction that commits on success. 18 19 Inside one transaction, all statements see a coherent database state according to 20 the chosen **isolation level** (see below). Until `COMMIT`, other sessions may or 21 may not see your changes, depending on isolation and whether they started before 22 or after your writes. 23 24 **Savepoints** (`SAVEPOINT`, `ROLLBACK TO SAVEPOINT`) let you partially undo 25 work without aborting the whole outer transaction—useful for complex workflows 26 with selective retry logic. 27 28 --- 29 30 ## “Stored procedures” vs functions in PostgreSQL 31 32 PostgreSQL’s primary server-side programming mechanism is the **`FUNCTION`** 33 (PL/pgSQL, SQL, or other languages). A function **always runs in the caller’s 34 transaction**; it cannot `COMMIT` or `ROLLBACK` the surrounding transaction from 35 inside the function body. 36 37 Since **PostgreSQL 11**, **`PROCEDURE`** exists for routines that **may** issue 38 transaction control: a procedure can end its own transaction with `COMMIT` or 39 `ROLLBACK` and continue (multi-transaction batching inside one call). That is 40 the feature people often mean by “stored procedure” when comparing to other 41 databases. For most application code, plain SQL inside a client-managed 42 transaction plus optional **functions** is still the common pattern. 43 44 Whether logic lives in the app or in the server, the same **MVCC and locking 45 rules** apply: routines do not bypass concurrency controls. 46 47 --- 48 49 ## How PostgreSQL handles many writers (conflicts) 50 51 PostgreSQL uses **MVCC** instead of readers blocking writers with a single 52 global lock on rows. Each row version carries metadata about which transactions 53 created and invalidated it (`xmin` / `xmax` system columns). A query reads **row 54 versions visible to its snapshot**—typically versions committed before the 55 snapshot’s cutoff—so normal **SELECT** does not take row-level locks that block 56 concurrent **UPDATE** on other rows. 57 58 When **two transactions try to change the same row**, MVCC does not allow both 59 to succeed blindly: 60 61 1. **Row-level locks** — `UPDATE`, `DELETE`, and `SELECT … FOR UPDATE` acquire 62 locks on the target rows. A second transaction that needs the same row may 63 **block** until the first commits or rolls back, then either proceeds or 64 detects the row was changed (depending on operation and isolation). 65 66 2. **Serialization failures** — Under **`REPEATABLE READ`** or **`SERIALIZABLE`**, 67 PostgreSQL can abort a transaction with **`SQLSTATE 40001`** 68 (`serialization_failure`) when the guarantee of that isolation level would be 69 violated—e.g. a concurrent write pattern that could produce anomalies if both 70 committed. The application is expected to **retry** the whole transaction. 71 72 3. **Deadlocks** — When transactions wait on each other in a cycle, PostgreSQL 73 **detects** the deadlock and **aborts one transaction** (`deadlock_detected`) 74 so the others can proceed. 75 76 So “multiple users writing the same data at the same time” is handled by a mix 77 of **waiting**, **abort with retry**, and **ordering**—not by silently losing 78 updates. Which outcome you get depends on SQL patterns (locking reads vs blind 79 updates) and isolation level. 80 81 The subsections below quote **actual C code** from the upstream PostgreSQL tree 82 so you can see how these behaviours are implemented. Excerpts are taken from 83 commit [`43fc1dc7527`](https://github.com/postgres/postgres/tree/43fc1dc7527c4213c4d48a96bf98738df0acb71a) 84 (PostgreSQL is licensed under [the PostgreSQL Licence](https://www.postgresql.org/about/licence/)). 85 86 ### Row-level blocking: `heap_lock_tuple` 87 88 When a tuple is already being modified or deleted, `heap_lock_tuple` inspects 89 visibility with **`HeapTupleSatisfiesUpdate`**, unlocks the buffer briefly, and 90 may **sleep** until competing lockers finish—first acquiring a **heavyweight 91 tuple lock** (`LockTuple` via `heap_acquire_tuplock`), then waiting on the 92 transaction or MultiXact holding `xmax`: 93 94 ```c 95 l3: 96 result = HeapTupleSatisfiesUpdate(tuple, cid, *buffer); 97 98 if (result == TM_Invisible) 99 { 100 /* 101 * This is possible, but only when locking a tuple for ON CONFLICT DO 102 * SELECT/UPDATE. We return this value here rather than throwing an 103 * error in order to give that case the opportunity to throw a more 104 * specific error. 105 */ 106 result = TM_Invisible; 107 goto out_locked; 108 } 109 else if (result == TM_BeingModified || 110 result == TM_Updated || 111 result == TM_Deleted) 112 { 113 TransactionId xwait; 114 uint16 infomask; 115 uint16 infomask2; 116 bool require_sleep; 117 ItemPointerData t_ctid; 118 119 /* must copy state data before unlocking buffer */ 120 xwait = HeapTupleHeaderGetRawXmax(tuple->t_data); 121 infomask = tuple->t_data->t_infomask; 122 infomask2 = tuple->t_data->t_infomask2; 123 ItemPointerCopy(&tuple->t_data->t_ctid, &t_ctid); 124 125 LockBuffer(*buffer, BUFFER_LOCK_UNLOCK); 126 ``` 127 128 (The branch continues with multixact checks, tuple-lock acquisition, and sleep 129 logic; only the opening is shown here.) 130 131 The fragment above is the start of the branch taken when the tuple is already 132 being modified, updated, or deleted; the remainder of `heap_lock_tuple` decides 133 whether sleeping is required and how to wait. 134 135 Later, when `require_sleep` is true, the same function acquires the tuple lock 136 and waits on the rival transaction id: 137 138 ```c 139 if (!skip_tuple_lock && 140 !heap_acquire_tuplock(relation, tid, mode, wait_policy, 141 &have_tuple_lock)) 142 { 143 /* 144 * This can only happen if wait_policy is Skip and the lock 145 * couldn't be obtained. 146 */ 147 result = TM_WouldBlock; 148 /* recovery code expects to have buffer lock held */ 149 LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); 150 goto failed; 151 } 152 ``` 153 154 ```c 155 switch (wait_policy) 156 { 157 case LockWaitBlock: 158 XactLockTableWait(xwait, relation, &tuple->t_self, 159 XLTW_Lock); 160 break; 161 case LockWaitSkip: 162 if (!ConditionalXactLockTableWait(xwait, false)) 163 { 164 result = TM_WouldBlock; 165 /* recovery code expects to have buffer lock held */ 166 LockBuffer(*buffer, BUFFER_LOCK_EXCLUSIVE); 167 goto failed; 168 } 169 break; 170 case LockWaitError: 171 if (!ConditionalXactLockTableWait(xwait, log_lock_failures)) 172 ereport(ERROR, 173 (errcode(ERRCODE_LOCK_NOT_AVAILABLE), 174 errmsg("could not obtain lock on row in relation \"%s\"", 175 RelationGetRelationName(relation)))); 176 break; 177 } 178 ``` 179 180 (Source: [`src/backend/access/heap/heapam.c`](https://github.com/postgres/postgres/blob/43fc1dc7527c4213c4d48a96bf98738df0acb71a/src/backend/access/heap/heapam.c), 181 function `heap_lock_tuple`; this `switch` sits in the branch that waits on a 182 single transaction id after copying `xwait` from `xmax`.) 183 184 The **heavyweight** tuple lock is wired to `LockTuple` through a small macro 185 layer that maps `LockTupleMode` to the lock manager’s `LOCKMODE`: 186 187 ```c 188 #define LockTupleTuplock(rel, tup, mode) \ 189 LockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock) 190 #define UnlockTupleTuplock(rel, tup, mode) \ 191 UnlockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock) 192 #define ConditionalLockTupleTuplock(rel, tup, mode, log) \ 193 ConditionalLockTuple((rel), (tup), tupleLockExtraInfo[mode].hwlock, (log)) 194 ``` 195 196 `heap_acquire_tuplock` is the helper that invokes it when blocking is allowed: 197 198 ```c 199 static bool 200 heap_acquire_tuplock(Relation relation, const ItemPointerData *tid, LockTupleMode mode, 201 LockWaitPolicy wait_policy, bool *have_tuple_lock) 202 { 203 if (*have_tuple_lock) 204 return true; 205 206 switch (wait_policy) 207 { 208 case LockWaitBlock: 209 LockTupleTuplock(relation, tid, mode); 210 break; 211 212 case LockWaitSkip: 213 if (!ConditionalLockTupleTuplock(relation, tid, mode, false)) 214 return false; 215 break; 216 217 case LockWaitError: 218 if (!ConditionalLockTupleTuplock(relation, tid, mode, log_lock_failures)) 219 ereport(ERROR, 220 (errcode(ERRCODE_LOCK_NOT_AVAILABLE), 221 errmsg("could not obtain lock on row in relation \"%s\"", 222 RelationGetRelationName(relation)))); 223 break; 224 } 225 *have_tuple_lock = true; 226 227 return true; 228 } 229 ``` 230 231 (Source: same file, macros near the top of `heapam.c` and `heap_acquire_tuplock` 232 below `heap_lock_tuple`.) 233 234 ### Deadlock detection: `CheckDeadLock` and `DeadLockCheck` 235 236 While a backend sleeps on a lock, a **deadlock timeout** can fire; the waiter 237 then locks all lock-hash partitions and runs **`DeadLockCheck`**. If the graph 238 contains an unresolvable cycle (`DS_HARD_DEADLOCK`), the waiter is removed from 239 the queue so **`ProcSleep`** surfaces an error and the transaction aborts. The 240 full handler is **`CheckDeadLock`** in `proc.c`: 241 242 ```c 243 /* 244 * CheckDeadLock 245 * 246 * We only get to this routine, if DEADLOCK_TIMEOUT fired while waiting for a 247 * lock to be released by some other process. Check if there's a deadlock; if 248 * not, just return. If we have a real deadlock, remove ourselves from the 249 * lock's wait queue. 250 */ 251 static DeadLockState 252 CheckDeadLock(void) 253 { 254 int i; 255 DeadLockState result; 256 257 /* 258 * Acquire exclusive lock on the entire shared lock data structures. Must 259 * grab LWLocks in partition-number order to avoid LWLock deadlock. 260 * 261 * Note that the deadlock check interrupt had better not be enabled 262 * anywhere that this process itself holds lock partition locks, else this 263 * will wait forever. Also note that LWLockAcquire creates a critical 264 * section, so that this routine cannot be interrupted by cancel/die 265 * interrupts. 266 */ 267 for (i = 0; i < NUM_LOCK_PARTITIONS; i++) 268 LWLockAcquire(LockHashPartitionLockByIndex(i), LW_EXCLUSIVE); 269 270 /* 271 * Check to see if we've been awoken by anyone in the interim. 272 * 273 * If we have, we can return and resume our transaction -- happy day. 274 * Before we are awoken the process releasing the lock grants it to us so 275 * we know that we don't have to wait anymore. 276 * 277 * We check by looking to see if we've been unlinked from the wait queue. 278 * This is safe because we hold the lock partition lock. 279 */ 280 if (dlist_node_is_detached(&MyProc->waitLink)) 281 { 282 result = DS_NO_DEADLOCK; 283 goto check_done; 284 } 285 286 #ifdef LOCK_DEBUG 287 if (Debug_deadlocks) 288 DumpAllLocks(); 289 #endif 290 291 /* Run the deadlock check */ 292 result = DeadLockCheck(MyProc); 293 294 if (result == DS_HARD_DEADLOCK) 295 { 296 /* 297 * Oops. We have a deadlock. 298 * 299 * Get this process out of wait state. (Note: we could do this more 300 * efficiently by relying on lockAwaited, but use this coding to 301 * preserve the flexibility to kill some other transaction than the 302 * one detecting the deadlock.) 303 * 304 * RemoveFromWaitQueue sets MyProc->waitStatus to 305 * PROC_WAIT_STATUS_ERROR, so ProcSleep will report an error after we 306 * return. 307 */ 308 Assert(MyProc->waitLock != NULL); 309 RemoveFromWaitQueue(MyProc, LockTagHashCode(&(MyProc->waitLock->tag))); 310 311 /* 312 * We're done here. Transaction abort caused by the error that 313 * ProcSleep will raise will cause any other locks we hold to be 314 * released, thus allowing other processes to wake up; we don't need 315 * to do that here. NOTE: an exception is that releasing locks we 316 * hold doesn't consider the possibility of waiters that were blocked 317 * behind us on the lock we just failed to get, and might now be 318 * wakable because we're not in front of them anymore. However, 319 * RemoveFromWaitQueue took care of waking up any such processes. 320 */ 321 } 322 323 /* 324 * And release locks. We do this in reverse order for two reasons: (1) 325 * Anyone else who needs more than one of the locks will be trying to lock 326 * them in increasing order; we don't want to release the other process 327 * until it can get all the locks it needs. (2) This avoids O(N^2) 328 * behavior inside LWLockRelease. 329 */ 330 check_done: 331 for (i = NUM_LOCK_PARTITIONS; --i >= 0;) 332 LWLockRelease(LockHashPartitionLockByIndex(i)); 333 334 return result; 335 } 336 ``` 337 338 `DeadLockCheck` in `deadlock.c` implements the search for cycles and possible 339 wait-queue reordering; its header comment states the `DS_HARD_DEADLOCK` 340 contract, and the body begins by resetting workspace used by the detector: 341 342 ```c 343 /* 344 * DeadLockCheck -- Checks for deadlocks for a given process 345 * 346 * This code looks for deadlocks involving the given process. If any 347 * are found, it tries to rearrange lock wait queues to resolve the 348 * deadlock. If resolution is impossible, return DS_HARD_DEADLOCK --- 349 * the caller is then expected to abort the given proc's transaction. 350 * 351 * Caller must already have locked all partitions of the lock tables. 352 * 353 * On failure, deadlock details are recorded in deadlockDetails[] for 354 * subsequent printing by DeadLockReport(). That activity is separate 355 * because we don't want to do it while holding all those LWLocks. 356 */ 357 DeadLockState 358 DeadLockCheck(PGPROC *proc) 359 { 360 /* Initialize to "no constraints" */ 361 nCurConstraints = 0; 362 nPossibleConstraints = 0; 363 nWaitOrders = 0; 364 365 /* Initialize to not blocked by an autovacuum worker */ 366 blocking_autovacuum_proc = NULL; 367 368 /* Search for deadlocks and possible fixes */ 369 if (DeadLockCheckRecurse(proc)) 370 ``` 371 372 (Sources: [`src/backend/storage/lmgr/proc.c`](https://github.com/postgres/postgres/blob/43fc1dc7527c4213c4d48a96bf98738df0acb71a/src/backend/storage/lmgr/proc.c) 373 and [`src/backend/storage/lmgr/deadlock.c`](https://github.com/postgres/postgres/blob/43fc1dc7527c4213c4d48a96bf98738df0acb71a/src/backend/storage/lmgr/deadlock.c). The `EDGE` / waits-for graph structs at the top of `deadlock.c` are also worth reading alongside [`src/backend/storage/lmgr/README`](https://github.com/postgres/postgres/blob/43fc1dc7527c4213c4d48a96bf98738df0acb71a/src/backend/storage/lmgr/README).) 374 375 ### Serializable isolation: `PreCommit_CheckForSerializationFailure` 376 377 Under **SSI**, commit-time validation walks **read/write conflict** lists; if a 378 **dangerous structure** appears (pivot still uncommitted), PostgreSQL raises 379 **`ERRCODE_T_R_SERIALIZATION_FAILURE`** (`SQLSTATE 40001`), matching the 380 user-visible `serialization_failure`: 381 382 ```c 383 void 384 PreCommit_CheckForSerializationFailure(void) 385 { 386 dlist_iter near_iter; 387 388 if (MySerializableXact == InvalidSerializableXact) 389 return; 390 391 Assert(IsolationIsSerializable()); 392 393 LWLockAcquire(SerializableXactHashLock, LW_EXCLUSIVE); 394 395 /* 396 * Check if someone else has already decided that we need to die. Since 397 * we set our own DOOMED flag when partially releasing, ignore in that 398 * case. 399 */ 400 if (SxactIsDoomed(MySerializableXact) && 401 !SxactIsPartiallyReleased(MySerializableXact)) 402 { 403 LWLockRelease(SerializableXactHashLock); 404 ereport(ERROR, 405 (errcode(ERRCODE_T_R_SERIALIZATION_FAILURE), 406 errmsg("could not serialize access due to read/write dependencies among transactions"), 407 errdetail_internal("Reason code: Canceled on identification as a pivot, during commit attempt."), 408 errhint("The transaction might succeed if retried."))); 409 } 410 ``` 411 412 (Source: [`src/backend/storage/lmgr/predicate.c`](https://github.com/postgres/postgres/blob/43fc1dc7527c4213c4d48a96bf98738df0acb71a/src/backend/storage/lmgr/predicate.c).) 413 414 --- 415 416 ## Isolation levels (what “consistent view” means) 417 418 Default isolation is **`READ COMMITTED`**: each statement sees rows committed 419 before that **statement** began. That prevents dirty reads but allows **non-repeatable 420 reads** and **phantoms** across statements in the same transaction. 421 422 **`REPEATABLE READ`** uses a **snapshot** taken at the **first query** in the 423 transaction; subsequent reads see the same world. Concurrent commits that would 424 change what that snapshot means for writes can trigger **`serialization_failure`**. 425 426 **`SERIALIZABLE`** uses **Serializable Snapshot Isolation (SSI)**—tracking 427 dependencies between concurrent transactions so that any outcome that could not 428 have occurred if transactions ran one-after-another causes an abort. It is the 429 strongest guarantee PostgreSQL offers but may force **retries** under contention. 430 431 --- 432 433 ## ACID “under the hood” 434 435 The acronym maps to mechanisms roughly as follows: 436 437 ### Atomicity 438 439 All changes in a transaction are recorded as **WAL records** before commit is 440 allowed (see durability). If the session crashes mid-transaction, recovery 441 replays only **committed** work; incomplete transactions are rolled back during 442 recovery. Within the server, rollback uses transaction status and MVCC rules so 443 uncommitted row versions are invisible after abort. 444 445 ### Consistency 446 447 PostgreSQL enforces **declared rules**: primary keys, foreign keys, `CHECK`, 448 `NOT NULL`, triggers, etc. **Consistency** in the broader business sense (e.g. 449 “account totals match ledger”) remains the application’s responsibility; 450 transactions bundle changes so those rules can be checked together before commit. 451 452 ### Isolation 453 454 Implemented through **MVCC snapshots**, **row locks**, **predicate locks / 455 SSI bookkeeping** (for serializable), and occasional **blocking**. Isolation is 456 not “free”: stronger levels trade convenience for **abort/retry** under 457 concurrency. 458 459 ### Durability 460 461 Committed transactions are made **durable** via the **write-ahead log**: redo 462 information is written to WAL and, depending on **`synchronous_commit`** and 463 replication settings, flushed to disk (and replicas) before the client is told 464 `COMMIT` succeeded. That way a crash after commit can **replay** the log and 465 restore committed data. 466 467 --- 468 469 ## Practical takeaways 470 471 - Use explicit transactions when several statements must succeed or fail 472 together. 473 - Expect **`serialization_failure`** under **`SERIALIZABLE`** or **`REPEATABLE 474 READ`** when writes overlap; implement **retry** with backoff. 475 - Use **`SELECT … FOR UPDATE`** (or equivalent) when you read a row and then 476 update it based on that read, to serialize competing writers on that row. 477 - Distinguish **`FUNCTION`** (single outer transaction) from **`PROCEDURE`** 478 (optional internal commits) when porting “stored procedure” designs from other 479 DBMSs. 480 481 --- 482 483 ## References 484 485 - [PostgreSQL server sources (GitHub)](https://github.com/postgres/postgres) 486 - [PostgreSQL Manual — Transaction Isolation](https://www.postgresql.org/docs/current/transaction-iso.html) 487 - [PostgreSQL Manual — Routine vacuuming / MVCC](https://www.postgresql.org/docs/current/routine-vacuuming.html) 488 - [PostgreSQL Manual — Write-Ahead Logging (WAL)](https://www.postgresql.org/docs/current/wal.html) 489 - [PostgreSQL Manual — `CREATE PROCEDURE`](https://www.postgresql.org/docs/current/sql-createprocedure.html)