ddia/content/en/ch8.md

---
title: "8. Transactions"
weight: 208
breadcrumbs: false
---

> *Some authors have claimed that general two-phase commit is too expensive to support, because of the
> performance or availability problems that it brings. We believe it is better to have application
> programmers deal with performance problems due to overuse of transactions as bottlenecks arise,
> rather than always coding around the lack of transactions.*
>
> James Corbett et al., *Spanner: Google’s Globally-Distributed Database* (2012)

In the harsh reality of data systems, many things can go wrong:

* The database software or hardware may fail at any time (including in the middle of a write
  operation).
* The application may crash at any time (including halfway through a series of operations).
* Interruptions in the network can unexpectedly cut off the application from the database, or one
  database node from another.
* Several clients may write to the database at the same time, overwriting each other’s changes.
* A client may read data that doesn’t make sense because it has only partially been updated.
* Race conditions between clients can cause surprising bugs.

In order to be reliable, a system has to deal with these faults and ensure that they don’t cause
catastrophic failure of the entire system. However, implementing fault-tolerance mechanisms is a lot
of work. It requires a lot of careful thinking about all the things that can go wrong, and a lot of
testing to ensure that the solution actually works.

For decades, *transactions* have been the mechanism of choice for simplifying these issues. A
transaction is a way for an application to group several reads and writes together into a logical
unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either
the entire transaction succeeds (*commit*) or it fails (*abort*, *rollback*). If it fails, the
application can safely retry. With transactions, error handling becomes much simpler for an
application, because it doesn’t need to worry about partial failure—i.e., the case where some
operations succeed and some fail (for whatever reason).

If you have spent years working with transactions, they may seem obvious, but we shouldn’t take them
for granted. Transactions are not a law of nature; they were created with a purpose, namely to
*simplify the programming model* for applications accessing a database. By using transactions, the
application is free to ignore certain potential error scenarios and concurrency issues, because the
database takes care of them instead (we call these *safety guarantees*).

Not every application needs transactions, and sometimes there are advantages to weakening
transactional guarantees or abandoning them entirely (for example, to achieve higher performance or
higher availability). Some safety properties can be achieved without transactions. On the other
hand, transactions can prevent a lot of grief: for example, the technical cause behind the Post
Office Horizon scandal (see [“How Important Is Reliability?”](/en/ch2#sidebar_reliability_importance)) was probably a lack of ACID
transactions in the underlying accounting system
[[1](/en/ch8#Murdoch2021)].

How do you figure out whether you need transactions? In order to answer that question, we first need
to understand exactly what safety guarantees transactions can provide, and what costs are associated
with them. Although transactions seem straightforward at first glance, there are actually many
subtle but important details that come into play.

In this chapter, we will examine many examples of things that can go wrong, and explore the
algorithms that databases use to guard against those issues. We will go especially deep in the area
of concurrency control, discussing various kinds of race conditions that can occur and how
databases implement isolation levels such as *read committed*, *snapshot isolation*, and
*serializability*.

Concurrency control is relevant for both single-node and distributed databases. Later in this
chapter, in [“Distributed Transactions”](/en/ch8#sec_transactions_distributed), we will examine the *two-phase commit* protocol and
the challenge of achieving atomicity in a distributed transaction.

# What Exactly Is a Transaction?

Almost all relational databases today, and some nonrelational databases, support transactions. Most
of them follow the style that was introduced in 1975 by IBM System R, the first SQL database
[[2](/en/ch8#Chamberlin1981),
[3](/en/ch8#Gray1976),
[4](/en/ch8#Eswaran1976)].
Although some implementation details have changed, the general idea has remained virtually the same
for 50 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily
similar to that of System R.

In the late 2000s, nonrelational (NoSQL) databases started gaining popularity. They aimed to
improve upon the relational status quo by offering a choice of new data models (see
[Chapter 3](/en/ch3#ch_datamodels)), and by including replication ([Chapter 6](/en/ch6#ch_replication)) and sharding
([Chapter 7](/en/ch7#ch_sharding)) by default. Transactions were the main casualty of this movement: many of this
generation of databases abandoned transactions entirely, or redefined the word to describe a
much weaker set of guarantees than had previously been understood.

The hype around NoSQL distributed databases led to a popular belief that transactions were
fundamentally unscalable, and that any large-scale system would have to abandon transactions in
order to maintain good performance and high availability. More recently, that belief has turned out
to be wrong. So-called “NewSQL” databases such as CockroachDB
[[5](/en/ch8#Taft2020_ch8)],
TiDB [[6](/en/ch8#Huang2020)],
Spanner [[7](/en/ch8#Corbett2012_ch8)],
FoundationDB [[8](/en/ch8#Zhou2021_ch8)],
and Yugabyte have shown that transactional systems can scale to large data volumes and high
throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide
strong ACID guarantees at scale.

However, that doesn’t mean that every system must be transactional either: like every other
technical design choice, transactions have advantages and limitations. In order to understand those
trade-offs, let’s go into the details of the guarantees that transactions can provide—both in normal
operation and in various extreme (but realistic) circumstances.

## The Meaning of ACID

The safety guarantees provided by transactions are often described by the well-known acronym *ACID*,
which stands for *Atomicity*, *Consistency*, *Isolation*, and *Durability*. It was coined in 1983 by
Theo Härder and Andreas Reuter
[[9](/en/ch8#Harder1983)]
in an effort to establish precise terminology for fault-tolerance mechanisms in databases.

However, in practice, one database’s implementation of ACID does not equal another’s implementation.
For example, as we shall see, there is a lot of ambiguity around the meaning of *isolation*
[[10](/en/ch8#Bailis2013HAT)].
The high-level idea is sound, but the devil is in the details. Today, when a system claims to be
“ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately
become mostly a marketing term.

(Systems that do not meet the ACID criteria are sometimes called *BASE*, which stands for
*Basically Available*, *Soft state*, and *Eventual consistency*
[[11](/en/ch8#Fox1997)].
This is even more vague than the definition of ACID. It seems that the only sensible definition of
BASE is “not ACID”; i.e., it can mean almost anything you want.)

Let’s dig into the definitions of atomicity, consistency, isolation, and durability, as this will let
us refine our idea of transactions.

### Atomicity

In general, *atomic* refers to something that cannot be broken down into smaller parts. The word
means similar but subtly different things in different branches of computing. For example, in
multi-threaded programming, if one thread executes an atomic operation, that means there is no way
that another thread could see the half-finished result of the operation. The system can only be in
the state it was before the operation or after the operation, not something in between.

By contrast, in the context of ACID, atomicity is *not* about concurrency. It does not describe
what happens if several processes try to access the same data at the same time, because that is
covered under the letter *I*, for *isolation* (see [“Isolation”](/en/ch8#sec_transactions_acid_isolation)).

Rather, ACID atomicity describes what happens if a client wants to make several writes, but a fault
occurs after some of the writes have been processed—for example, a process crashes, a network
connection is interrupted, a disk becomes full, or some integrity constraint is violated.
If the writes are grouped together into an atomic transaction, and the transaction cannot be
completed (*committed*) due to a fault, then the transaction is *aborted* and the database must
discard or undo any writes it has made so far in that transaction.

Without atomicity, if an error occurs partway through making multiple changes, it’s difficult to
know which changes have taken effect and which haven’t. The application could try again, but that
risks making the same change twice, leading to duplicate or incorrect data. Atomicity simplifies
this problem: if a transaction was aborted, the application can be sure that it didn’t change
anything, so it can safely be retried.

The ability to abort a transaction on error and have all writes from that transaction discarded is
the defining feature of ACID atomicity. Perhaps *abortability* would have been a better term than
*atomicity*, but we will stick with *atomicity* since that’s the usual word.

### Consistency

The word *consistency* is terribly overloaded:

* In [Chapter 6](/en/ch6#ch_replication) we discussed *replica consistency* and the issue of *eventual consistency*
  that arises in asynchronously replicated systems (see [“Problems with Replication Lag”](/en/ch6#sec_replication_lag)).
* A *consistent snapshot* of a database, e.g. for a backup, is a snapshot of the entire database as
  it existed at one moment in time. More precisely, it is consistent with the happens-before
  relation (see [“The “happens-before” relation and concurrency”](/en/ch6#sec_replication_happens_before)): that is, if the snapshot contains a value that
  was written at a particular time, then it also reflects all the writes that happened before that
  value was written.
* *Consistent hashing* is an approach to sharding that some systems use for rebalancing (see
  [“Consistent hashing”](/en/ch7#sec_sharding_consistent_hashing)).
* In the CAP theorem (see [Chapter 10](/en/ch10#ch_consistency)), the word *consistency* is used to mean
  *linearizability* (see [“Linearizability”](/en/ch10#sec_consistency_linearizability)).
* In the context of ACID, *consistency* refers to an application-specific notion of the database
  being in a “good state.”

It’s unfortunate that the same word is used with at least five different meanings.

The idea of ACID consistency is that you have certain statements about your data (*invariants*) that
must always be true—for example, in an accounting system, credits and debits across all accounts
must always be balanced. If a transaction starts with a database that is valid according to these
invariants, and any writes during the transaction preserve the validity, then you can be sure that
the invariants are always satisfied. (An invariant may be temporarily violated during transaction
execution, but it should be satisfied again at transaction commit.)

If you want the database to enforce your invariants, you need to declare them as *constraints* as
part of the schema. For example, foreign key constraints, uniqueness constraints, or check
constraints (which restrict the values that can appear in an individual row) are often used to
model specific types of invariants. More complex consistency requirements can sometimes be modeled
using triggers or materialized views [[12](/en/ch8#Andrews2004)].

However, complex invariants can be difficult or impossible to model using the constraints that
databases usually provide. In that case, it’s the application’s responsibility to define its
transactions correctly so that they preserve consistency. If you write bad data that violates your
invariants, but you haven’t declared those invariants, the database can’t stop you. As such, the C
in ACID often depends on how the application uses the database, and it’s not a property of the
database alone.

### Isolation

Most databases are accessed by several clients at the same time. That is no problem if they are
reading and writing different parts of the database, but if they are accessing the same database
records, you can run into concurrency problems (race conditions).

[Figure 8-1](/en/ch8#fig_transactions_increment) is a simple example of this kind of problem. Say you have two clients
simultaneously incrementing a counter that is stored in a database. Each client needs to read the
current value, add 1, and write the new value back (assuming there is no increment operation built
into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to
44, because two increments happened, but it actually only went to 43 because of the race condition.

![ddia 0801](/fig/ddia_0801.png)

###### Figure 8-1. A race condition between two clients concurrently incrementing a counter.

*Isolation* in the sense of ACID means that concurrently executing transactions are isolated from
each other: they cannot step on each other’s toes. The classic database textbooks formalize
isolation as *serializability*, which means that each transaction can pretend that it is the only
transaction running on the entire database. The database ensures that when the transactions have
committed, the result is the same as if they had run *serially* (one after another), even though in
reality they may have run concurrently
[[13](/en/ch8#Bernstein1987_ch8)].

However, serializability has a performance cost. In practice, many databases use forms of isolation
that are weaker than serializability: that is, they allow concurrent transactions to interfere with
each other in limited ways. Some popular databases, such as Oracle, don’t even implement it (Oracle
has an isolation level called “serializable,” but it actually implements *snapshot isolation*, which
is a weaker guarantee than serializability [[10](/en/ch8#Bailis2013HAT),
[14](/en/ch8#Fekete2005)]).
This means that some kinds of race conditions can still occur. We will explore snapshot isolation
and other forms of isolation in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels).

### Durability

The purpose of a database system is to provide a safe place where data can be stored without fear of
losing it. *Durability* is the promise that once a transaction has committed successfully, any data it
has written will not be forgotten, even if there is a hardware fault or the database crashes.

In a single-node database, durability typically means that the data has been written to nonvolatile
storage such as a hard drive or SSD. Regular file writes are usually buffered in memory before being
sent to the disk sometime later, which means they would be lost if there is a sudden power failure;
many databases therefore use the `fsync()` system call to ensure the data really has been written to
disk. Databases usually also have a write-ahead log or similar (see [“Making B-trees reliable”](/en/ch4#sec_storage_btree_wal)),
which allows them to recover in the event that a crash occurs part way through a write.

In a replicated database, durability may mean that the data has been successfully copied to some
number of nodes. In order to provide a durability guarantee, a database must wait until these writes
or replications are complete before reporting a transaction as successfully committed. However,
as discussed in [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability), perfect durability does not exist: if all your
hard disks and all your backups are destroyed at the same time, there’s obviously nothing your
database can do to save you.

# Replication and Durability

Historically, durability meant writing to an archive tape. Then it was understood as writing to a disk
or SSD. More recently, it has been adapted to mean replication. Which implementation is better?

The truth is, nothing is perfect:

* If you write to disk and the machine dies, even though your data isn’t lost, it is inaccessible
  until you either fix the machine or transfer the disk to another machine. Replicated systems can
  remain available.
* A correlated fault—a power outage or a bug that crashes every node on a particular input—can
  knock out all replicas at once (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability)), losing any data that is
  only in memory. Writing to disk is therefore still relevant for replicated databases.
* In an asynchronously replicated system, recent writes may be lost when the leader becomes
  unavailable (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)).
* When the power is suddenly cut, SSDs in particular have been shown to sometimes violate the
  guarantees they are supposed to provide: even `fsync` isn’t guaranteed to work correctly
  [[15](/en/ch8#Zheng2013)].
  Disk firmware can have bugs, just like any other kind of software
  [[16](/en/ch8#Denness2015),
  [17](/en/ch8#Surak2015)],
  e.g. causing drives to fail after exactly 32,768 hours of operation
  [[18](/en/ch8#HPE2019_ch8)].
  And `fsync` is hard to use; even PostgreSQL used it incorrectly for over 20 years
  [[19](/en/ch8#Ringer2018),
  [20](/en/ch8#Rebello2020),
  [21](/en/ch8#Pillai2015)].
* Subtle interactions between the storage engine and the filesystem implementation can lead to bugs
  that are hard to track down, and may cause files on disk to be corrupted after a crash
  [[22](/en/ch8#Pillai2014),
  [23](/en/ch8#Siebenmann2016)].
  Filesystem errors on one replica can sometimes spread to other replicas as well
  [[24](/en/ch8#Ganesan2017)].
* Data on disk can gradually become corrupted without this being detected
  [[25](/en/ch8#Bairavasundaram2008)].
  If data has been corrupted for some time, replicas and recent backups may also be corrupted. In
  this case, you will need to try to restore the data from a historical backup.
* One study of SSDs found that between 30% and 80% of drives develop at least one bad block during
  the first four years of operation, and only some of these can be corrected by the firmware
  [[26](/en/ch8#Schroeder2016_ch8)].
  Magnetic hard drives have a lower rate of bad sectors, but a higher rate of complete failure than
  SSDs.
* When a worn-out SSD (that has gone through many write/erase cycles) is disconnected from power,
  it can start losing data within a timescale of weeks to months, depending on the temperature
  [[27](/en/ch8#Allison2015)].
  This is less of a problem for drives with lower wear levels
  [[28](/en/ch8#MahUng2015)].

In practice, there is no one technique that can provide absolute guarantees. There are only various
risk-reduction techniques, including writing to disk, replicating to remote machines, and
backups—and they can and should be used together. As always, it’s wise to take any theoretical
“guarantees” with a healthy grain of salt.

## Single-Object and Multi-Object Operations

To recap, in ACID, atomicity and isolation describe what the database should do if a client makes
several writes within the same transaction:

Atomicity
:   If an error occurs halfway through a sequence of writes, the transaction should be aborted, and
    the writes made up to that point should be discarded. In other words, the database saves you from
    having to worry about partial failure, by giving an all-or-nothing guarantee.

Isolation
:   Concurrently running transactions shouldn’t interfere with each other. For example, if one
    transaction makes several writes, then another transaction should see either all or none of those
    writes, but not some subset.

These definitions assume that you want to modify several objects (rows, documents, records) at once.
Such *multi-object transactions* are often needed if several pieces of data need to be kept in sync.
[Figure 8-2](/en/ch8#fig_transactions_read_uncommitted) shows an example from an email application. To display the
number of unread messages for a user, you could query something like:

```
SELECT COUNT(*) FROM emails WHERE recipient_id = 2 AND unread_flag = true
```

![ddia 0802](/fig/ddia_0802.png)

###### Figure 8-2. Violating isolation: one transaction reads another transaction’s uncommitted writes (a “dirty read”).

However, you might find this query to be too slow if there are many emails, and decide to store the
number of unread messages in a separate field (a kind of denormalization, which we discuss in
[“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization)). Now, whenever a new message comes in, you have to increment the
unread counter as well, and whenever a message is marked as read, you also have to decrement the
unread counter.

In [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), user 2 experiences an anomaly: the mailbox listing shows
an unread message, but the counter shows zero unread messages because the counter increment has not
yet happened. (If an incorrect counter in an email application seems too insignificant, think of a
customer account balance instead of an unread counter, and a payment transaction instead of an
email.) Isolation would have prevented this issue by ensuring that user 2 sees either both the
inserted email and the updated counter, or neither, but not an inconsistent halfway point.

[Figure 8-3](/en/ch8#fig_transactions_atomicity) illustrates the need for atomicity: if an error occurs somewhere
over the course of the transaction, the contents of the mailbox and the unread counter might become out
of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted
and the inserted email is rolled back.

![ddia 0803](/fig/ddia_0803.png)

###### Figure 8-3. Atomicity ensures that if an error occurs any prior writes from that transaction are undone, to avoid an inconsistent state.

Multi-object transactions require some way of determining which read and write operations belong to
the same transaction. In relational databases, that is typically done based on the client’s TCP
connection to the database server: on any particular connection, everything between a `BEGIN
TRANSACTION` and a `COMMIT` statement is considered to be part of the same transaction. If the TCP
connection is interrupted, the transaction must be aborted.

On the other hand, many nonrelational databases don’t have such a way of grouping operations
together. Even if there is a multi-object API (for example, a key-value store may have a *multi-put*
operation that updates several keys in one operation), that doesn’t necessarily mean it has
transaction semantics: the command may succeed for some keys and fail for others, leaving the
database in a partially updated state.

### Single-object writes

Atomicity and isolation also apply when a single object is being changed. For example, imagine you
are writing a 20 KB JSON document to a database:

* If the network connection is interrupted after the first 10 KB have been sent, does the
  database store that unparseable 10 KB fragment of JSON?
* If the power fails while the database is in the middle of overwriting the previous value on disk,
  do you end up with the old and new values spliced together?
* If another client reads that document while the write is in progress, will it see a partially
  updated value?

Those issues would be incredibly confusing, so storage engines almost universally aim to provide
atomicity and isolation on the level of a single object (such as a key-value pair) on one node.
Atomicity can be implemented using a log for crash recovery (see [“Making B-trees reliable”](/en/ch4#sec_storage_btree_wal)), and
isolation can be implemented using a lock on each object (allowing only one thread to access an
object at any one time).

Some databases also provide more complex atomic operations, such as an increment operation, which
removes the need for a read-modify-write cycle like that in [Figure 8-1](/en/ch8#fig_transactions_increment).
Similarly popular is a *conditional write* operation, which allows a write to happen only if the value
has not been concurrently changed by someone else (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)),
similarly to a compare-and-set or compare-and-swap (CAS) operation in shared-memory concurrency.

###### Note

Strictly speaking, the term *atomic increment* uses the word *atomic* in the sense of multi-threaded
programming. In the context of ACID, it should actually be called an *isolated* or *serializable*
increment, but that’s not the usual term.

These single-object operations are useful, as they can prevent lost updates when several clients try
to write to the same object concurrently (see [“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update)). However, they are
not transactions in the usual sense of the word. For example, the “lightweight transactions” feature
of Cassandra and ScyllaDB, and Aerospike’s “strong consistency” mode offer linearizable (see
[“Linearizability”](/en/ch10#sec_consistency_linearizability)) reads and conditional writes on a single object, but no
guarantees across multiple objects.

### The need for multi-object transactions

Do we need multi-object transactions at all? Would it be possible to implement any application with
only a key-value data model and single-object operations?

There are some use cases in which single-object inserts, updates, and deletes are sufficient.
However, in many other cases writes to several different objects need to be coordinated:

* In a relational data model, a row in one table often has a foreign key reference to a row in
  another table. Similarly, in a graph-like data model, a vertex has edges to other vertices.
  Multi-object transactions allow you to ensure that these references remain valid: when inserting
  several records that refer to one another, the foreign keys have to be correct and up to date,
  or the data becomes nonsensical.
* In a document data model, the fields that need to be updated together are often within the same
  document, which is treated as a single object—no multi-object transactions are needed when
  updating a single document. However, document databases lacking join functionality also encourage
  denormalization (see [“When to Use Which Model”](/en/ch3#sec_datamodels_document_summary)). When denormalized information needs to
  be updated, like in the example of [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), you need to update
  several documents in one go. Transactions are very useful in this situation to prevent
  denormalized data from going out of sync.
* In databases with secondary indexes (almost everything except pure key-value stores), the indexes
  also need to be updated every time you change a value. These indexes are different database
  objects from a transaction point of view: for example, without transaction isolation, it’s
  possible for a record to appear in one index but not another, because the update to the second
  index hasn’t happened yet (see [“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes)).

Such applications can still be implemented without transactions. However, error handling becomes
much more complicated without atomicity, and the lack of isolation can cause concurrency problems.
We will discuss those in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels), and explore alternative approaches
in [Link to Come].

### Handling errors and aborts

A key feature of a transaction is that it can be aborted and safely retried if an error occurred.
ACID databases are based on this philosophy: if the database is in danger of violating its guarantee
of atomicity, isolation, or durability, it would rather abandon the transaction entirely than allow
it to remain half-finished.

Not all systems follow that philosophy, though. In particular, datastores with leaderless
replication (see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)) work much more on a “best effort” basis, which
could be summarized as “the database will do as much as it can, and if it runs into an error, it
won’t undo something it has already done”—so it’s the application’s responsibility to recover from
errors.

Errors will inevitably happen, but many software developers prefer to think only about the happy
path rather than the intricacies of error handling. For example, popular object-relational mapping
(ORM) frameworks such as Rails’s ActiveRecord and Django don’t retry aborted transactions—the
error usually results in an exception bubbling up the stack, so any user input is thrown away and
the user gets an error message. This is a shame, because the whole point of aborts is to enable safe
retries.

Although retrying an aborted transaction is a simple and effective error handling mechanism, it
isn’t perfect:

* If the transaction actually succeeded, but the network was interrupted while the server tried to
  acknowledge the successful commit to the client (so it timed out from the client’s point of view),
  then retrying the transaction causes it to be performed twice—unless you have an additional
  application-level deduplication mechanism in place.
* If the error is due to overload or high contention between concurrent transactions, retrying the
  transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit
  the number of retries, use exponential backoff, and handle overload-related errors differently
  from other errors (see [“When an overloaded system won’t recover”](/en/ch2#sidebar_metastable)).
* It is only worth retrying after transient errors (for example due to deadlock, isolation
  violation, temporary network interruptions, and failover); after a permanent error (e.g.,
  constraint violation) a retry would be pointless.
* If the transaction also has side effects outside of the database, those side effects may happen
  even if the transaction is aborted. For example, if you’re sending an email, you wouldn’t want to
  send the email again every time you retry the transaction. If you want to make sure that several
  different systems either commit or abort together, two-phase commit can help (we will discuss this
  in [“Two-Phase Commit (2PC)”](/en/ch8#sec_transactions_2pc)).
* If the client process crashes while retrying, any data it was trying to write to the database is lost.

# Weak Isolation Levels

If two transactions don’t access the same data, or if both are read-only, they can safely be run in
parallel, because neither depends on the other. Concurrency issues (race conditions) only come into
play when one transaction reads data that is concurrently modified by another transaction, or when
the two transactions try to modify the same data.

Concurrency bugs are hard to find by testing, because such bugs are only triggered when you get
unlucky with the timing. Such timing issues might occur very rarely, and are usually difficult to
reproduce. Concurrency is also very difficult to reason about, especially in a large application
where you don’t necessarily know which other pieces of code are accessing the database. Application
development is difficult enough if you just have one user at a time; having many concurrent users
makes it much harder still, because any piece of data could unexpectedly change at any time.

For that reason, databases have long tried to hide concurrency issues from application developers by
providing *transaction isolation*. In theory, isolation should make your life easier by letting you
pretend that no concurrency is happening: *serializable* isolation means that the database
guarantees that transactions have the same effect as if they ran *serially* (i.e., one at a time,
without any concurrency).

In practice, isolation is unfortunately not that simple. Serializable isolation has a performance
cost, and many databases don’t want to pay that price
[[10](/en/ch8#Bailis2013HAT)]. It’s therefore common for systems to use
weaker levels of isolation, which protect against *some* concurrency issues, but not all. Those
levels of isolation are much harder to understand, and they can lead to subtle bugs, but they are
nevertheless used in practice
[[29](/en/ch8#Kleppmann2014)].

Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have
caused substantial loss of money
[[30](/en/ch8#Warszawski2017),
[31](/en/ch8#DAgosta2014),
[32](/en/ch8#bitcointhief2014)],
led to investigation by financial auditors
[[33](/en/ch8#Jorwekar2007_ch8)],
and caused customer data to be corrupted [[34](/en/ch8#Melanson2014)].
A popular comment on revelations of such problems is “Use an ACID database if you’re handling
financial data!”—but that misses the point. Even many popular relational database systems (which
are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these
bugs from occurring.

###### Note

Incidentally, much of the banking system relies on text files that are exchanged via secure FTP
[[35](/en/ch8#Kim2014ACH)].
In this context, having an audit trail and some human-level fraud prevention measures is actually
more important than ACID properties.

Those examples also highlight an important point: even if concurrency issues are rare in normal
operation, you have to consider the possibility that an attacker deliberately sends a burst of
highly concurrent requests to your API in an attempt to deliberately exploit concurrency bugs
[[30](/en/ch8#Warszawski2017)]. Therefore, in order to build
applications that are reliable and secure, you have to ensure that such bugs are systematically
prevented.

In this section we will look at several weak (nonserializable) isolation levels that are used in
practice, and discuss in detail what kinds of race conditions can and cannot occur, so that you can
decide what level is appropriate to your application. Once we’ve done that, we will discuss
serializability in detail (see [“Serializability”](/en/ch8#sec_transactions_serializability)). Our discussion of isolation
levels will be informal, using examples. If you want rigorous definitions and analyses of their
properties, you can find them in the academic literature
[[36](/en/ch8#Berenson1995),
[37](/en/ch8#Adya1999),
[38](/en/ch8#Bailis2014virtues_ch8),
[39](/en/ch8#Crooks2017)].

## Read Committed

The most basic level of transaction isolation is *read committed*. It makes two guarantees:

1. When reading from the database, you will only see data that has been committed (no *dirty
   reads*).
2. When writing to the database, you will only overwrite data that has been committed (no *dirty
   writes*).

Some databases support an even weaker isolation level called *read uncommitted*. It prevents dirty
writes, but does not prevent dirty reads. Let’s discuss these two guarantees in more detail.

### No dirty reads

Imagine a transaction has written some data to the database, but the transaction has not yet committed or aborted.
Can another transaction see that uncommitted data? If yes, that is called a
*dirty read* [[3](/en/ch8#Gray1976)].

Transactions running at the read committed isolation level must prevent dirty reads. This means that
any writes by a transaction only become visible to others when that transaction commits (and then
all of its writes become visible at once). This is illustrated in
[Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2’s *get x* still
returns the old value, 2, while user 1 has not yet committed.

![ddia 0804](/fig/ddia_0804.png)

###### Figure 8-4. No dirty reads: user 2 sees the new value for *x* only after user 1’s transaction has committed.

There are a few reasons why it’s useful to prevent dirty reads:

* If a transaction needs to update several rows, a dirty read means that another transaction may
  see some of the updates but not others. For example, in [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), the
  user sees the new unread email but not the updated counter. This is a dirty read of the email.
  Seeing the database in a partially updated state is confusing to users and may cause other
  transactions to take incorrect decisions.
* If a transaction aborts, any writes it has made need to be rolled back (like in
  [Figure 8-3](/en/ch8#fig_transactions_atomicity)). If the database allows dirty reads, that means a transaction may
  see data that is later rolled back—i.e., which is never actually committed to the database. Any
  transaction that read uncommitted data would also need to be aborted, leading to a problem called
  *cascading aborts*.

### No dirty writes

What happens if two transactions concurrently try to update the same row in a database? We don’t
know in which order the writes will happen, but we normally assume that the later write overwrites
the earlier write.

However, what happens if the earlier write is part of a transaction that has not yet committed, so
the later write overwrites an uncommitted value? This is called a *dirty write*
[[36](/en/ch8#Berenson1995)]. Transactions running at the read
committed isolation level must prevent dirty writes, usually by delaying the second write until the
first write’s transaction has committed or aborted.

By preventing dirty writes, this isolation level avoids some kinds of concurrency problems:

* If transactions update multiple rows, dirty writes can lead to a bad outcome. For example,
  consider [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), which illustrates a used car sales website on which
  two people, Aaliyah and Bryce, are simultaneously trying to buy the same car. Buying a car requires
  two database writes: the listing on the website needs to be updated to reflect the buyer, and the
  sales invoice needs to be sent to the buyer. In the case of [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), the
  sale is awarded to Bryce (because he performs the winning update to the `listings` table), but the
  invoice is sent to Aaliyah (because she performs the winning update to the `invoices` table). Read
  committed prevents such mishaps.
* However, read committed does *not* prevent the race condition between two counter increments in
  [Figure 8-1](/en/ch8#fig_transactions_increment). In this case, the second write happens after the first transaction
  has committed, so it’s not a dirty write. It’s still incorrect, but for a different reason—in
  [“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update) we will discuss how to make such counter increments safe.

![ddia 0805](/fig/ddia_0805.png)

###### Figure 8-5. With dirty writes, conflicting writes from different transactions can be mixed up.

### Implementing read committed

Read committed is a very popular isolation level. It is the default setting in Oracle Database,
PostgreSQL, SQL Server, and many other databases
[[10](/en/ch8#Bailis2013HAT)].

Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to
modify a particular row (or document or some other object), it must first acquire a lock on that
row. It must then hold that lock until the transaction is committed or aborted. Only one transaction
can hold the lock for any given row; if another transaction wants to write to the same row, it must
wait until the first transaction is committed or aborted before it can acquire the lock and
continue. This locking is done automatically by databases in read committed mode (or stronger
isolation levels).

How do we prevent dirty reads? One option would be to use the same lock, and to require any
transaction that wants to read a row to briefly acquire the lock and then release it again
immediately after reading. This would ensure that a read couldn’t happen while a row has a
dirty, uncommitted value (because during that time the lock would be held by the transaction that
has made the write).

However, the approach of requiring read locks does not work well in practice, because one
long-running write transaction can force many other transactions to wait until the long-running
transaction has completed, even if the other transactions only read and do not write anything to the
database. This harms the response time of read-only transactions and is bad for
operability: a slowdown in one part of an application can have a knock-on effect in a completely
different part of the application, due to waiting for locks.

Nevertheless, locks are used to prevent dirty reads in some databases, such as IBM
Db2 and Microsoft SQL Server in the `read_committed_snapshot=off` setting
[[29](/en/ch8#Kleppmann2014)].

A more commonly used approach to preventing dirty reads is the one illustrated in
[Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
row that is written, the database remembers both the old committed value and the new value
set by the transaction that currently holds the write lock. While the transaction is ongoing, any
other transactions that read the row are simply given the old value. Only when the new value is
committed do transactions switch over to reading the new value (see
[“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl) for more detail).

## Snapshot Isolation and Repeatable Read

If you look superficially at read committed isolation, you could be forgiven for thinking that it
does everything that a transaction needs to do: it allows aborts (required for atomicity), it
prevents reading the incomplete results of transactions, and it prevents concurrent writes from
getting intermingled. Indeed, those are useful features, and much stronger guarantees than you can
get from a system that has no transactions.

However, there are still plenty of ways in which you can have concurrency bugs when using this
isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that
can occur with read committed.

![ddia 0806](/fig/ddia_0806.png)

###### Figure 8-6. Read skew: Aaliyah observes the database in an inconsistent state.

Say Aaliyah has $1,000 of savings at a bank, split across two accounts with $500 each. Now a
transaction transfers $100 from one of her accounts to the other. If she is unlucky enough to look at her
list of account balances in the same moment as that transaction is being processed, she may see one
account balance at a time before the incoming payment has arrived (with a balance of $500), and the
other account after the outgoing transfer has been made (the new balance being $400). To Aaliyah it
now appears as though she only has a total of $900 in her accounts—it seems that $100 has
vanished into thin air.

This anomaly is called *read skew*, and it is an example of a *nonrepeatable read*:
if Aaliyah were to read the balance of
account 1 again at the end of the transaction, she would see a different value ($600) than she saw
in her previous query. Read skew is considered acceptable under read committed isolation: the
account balances that Aaliyah saw were indeed committed at the time when she read them.

###### Note

The term *skew* is unfortunately overloaded: we previously used it in the sense of an *unbalanced
workload with hot spots* (see [“Skewed Workloads and Relieving Hot Spots”](/en/ch7#sec_sharding_skew)), whereas here it means *timing anomaly*.

In Aaliyah’s case, this is not a lasting problem, because she will most likely see consistent account
balances if she reloads the online banking website a few seconds later. However, some situations
cannot tolerate such temporary inconsistency:

Backups
:   Taking a backup requires making a copy of the entire database, which may take hours on a large
    database. During the time that the backup process is running, writes will continue to be made to
    the database. Thus, you could end up with some parts of the backup containing an older version of
    the data, and other parts containing a newer version. If you need to restore from such a backup,
    the inconsistencies (such as disappearing money) become permanent.

Analytic queries and integrity checks
:   Sometimes, you may want to run a query that scans over large parts of the database. Such queries
    are common in analytics (see [“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)), or may be part of a periodic integrity
    check that everything is in order (monitoring for data corruption). These queries are likely to
    return nonsensical results if they observe parts of the database at different points in time.

*Snapshot isolation* [[36](/en/ch8#Berenson1995)] is the most common
solution to this problem. The idea is that each transaction reads from a *consistent snapshot* of
the database—that is, the transaction sees all the data that was committed in the database at the
start of the transaction. Even if the data is subsequently changed by another transaction, each
transaction sees only the old data from that particular point in time.

Snapshot isolation is a boon for long-running, read-only queries such as backups and analytics. It
is very hard to reason about the meaning of a query if the data on which it operates is changing at
the same time as the query is executing. When a transaction can see a consistent snapshot of the
database, frozen at a particular point in time, it is much easier to understand.

Snapshot isolation is a popular feature: variants of it are supported by PostgreSQL, MySQL with the
InnoDB storage engine, Oracle, SQL Server, and others, although the detailed behavior varies from
one system to the next [[29](/en/ch8#Kleppmann2014),
[40](/en/ch8#Momjian2014),
[41](/en/ch8#Alvaro2023)].
Some databases, such as Oracle, TiDB, and Aurora DSQL, even choose snapshot isolation as their
highest isolation level.

### Multi-version concurrency control (MVCC)

Like read committed isolation, implementations of snapshot isolation typically use write locks to
prevent dirty writes (see [“Implementing read committed”](/en/ch8#sec_transactions_read_committed_impl)), which means that a transaction
that makes a write can block the progress of another transaction that writes to the same row.
However, reads do not require any locks. From a performance point of view, a key principle of
snapshot isolation is *readers never block writers, and writers never block readers*. This allows a
database to handle long-running read queries on a consistent snapshot at the same time as processing
writes normally, without any lock contention between the two.

To implement snapshot isolation, databases use a generalization of the mechanism we saw for
preventing dirty reads in [Figure 8-4](/en/ch8#fig_transactions_read_committed). Instead of two versions of each row
(the committed version and the overwritten-but-not-yet-committed version), the database must
potentially keep several different committed versions of a row, because various in-progress
transactions may need to see the state of the database at different points in time. Because it
maintains several versions of a row side by side, this technique is known as *multi-version
concurrency control* (MVCC).

[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
[[40](/en/ch8#Momjian2014),
[42](/en/ch8#Rogov2023),
[43](/en/ch8#Suzuki2017_ch8)] (other implementations are similar).
When a transaction is started, it is given a unique, always-increasing transaction ID (`txid`).
Whenever a transaction writes anything to the database, the data it writes is tagged with the
transaction ID of the writer. (To be precise, transaction IDs in PostgreSQL are 32-bit integers, so
they overflow after approximately 4 billion transactions. The vacuum process performs cleanup to
ensure that overflow does not affect the data.)

![ddia 0807](/fig/ddia_0807.png)

###### Figure 8-7. Implementing snapshot isolation using multi-version concurrency control.

Each row in a table has a `inserted_by` field, containing the ID of the transaction that inserted
this row into the table. Moreover, each row has a `deleted_by` field, which is initially empty. If a
transaction deletes a row, the row isn’t actually removed from the database, but it is marked for
deletion by setting the `deleted_by` field to the ID of the transaction that requested the deletion.
At some later time, when it is certain that no transaction can any longer access the deleted data, a
garbage collection process in the database removes any rows marked for deletion and frees their
space.

An update is internally translated into a delete and a insert
[[44](/en/ch8#Alleti2025)].
For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the
balance from $500 to $400. The `accounts` table now actually contains two rows for account 2: a row
with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of
$400 which was inserted by transaction 13.

All of the versions of a row are stored within the same database heap (see
[“Storing values within the index”](/en/ch4#sec_storage_index_heap)), regardless of whether the transactions that wrote them have committed
or not. The versions of the same row form a linked list, going either from newest version to oldest
version or the other way round, so that queries can internally iterate over all versions of a row
[[45](/en/ch8#Pavlo2023),
[46](/en/ch8#Wu2017)].

### Visibility rules for observing a consistent snapshot

When a transaction reads from the database, transaction IDs are used to decide which row versions it
can see and which are invisible. By carefully defining visibility rules, the database can present a
consistent snapshot of the database to the application. This works roughly as follows
[[43](/en/ch8#Suzuki2017_ch8)]:

1. At the start of each transaction, the database makes a list of all the other transactions that
   are in progress (not yet committed or aborted) at that time. Any writes that those
   transactions have made are ignored, even if the transactions subsequently commit. This ensures
   that we see a consistent snapshot that is not affected by another transaction committing.
2. Any writes made by transactions with a later transaction ID (i.e., which started after the current
   transaction started, and which are therefore not included in the list of in-progress
   transactions) are ignored, regardless of whether those transactions have committed.
3. Any writes made by aborted transactions are ignored, regardless of when that abort happened.
   This has the advantage that when a transaction aborts, we don’t need to immediately remove the
   rows it wrote from storage, since the visibility rule filters them out. The garbage collection
   process can remove them later.
4. All other writes are visible to the application’s queries.

These rules apply to both insertion and deletion of rows. In [Figure 8-7](/en/ch8#fig_transactions_mvcc), when
transaction 12 reads from account 2, it sees a balance of $500 because the deletion of the $500
balance was made by transaction 13 (according to rule 2, transaction 12 cannot see a deletion made
by transaction 13), and the insertion of the $400 balance is not yet visible (by the same rule).

Put another way, a row is visible if both of the following conditions are true:

* At the time when the reader’s transaction started, the transaction that inserted the row had
  already committed.
* The row is not marked for deletion, or if it is, the transaction that requested deletion had
  not yet committed at the time when the reader’s transaction started.

A long-running transaction may continue using a snapshot for a long time, continuing to read values
that (from other transactions’ point of view) have long been overwritten or deleted. By never
updating values in place but instead inserting a new version every time a value is changed, the
database can provide a consistent snapshot while incurring only a small overhead.

### Indexes and snapshot isolation

How do indexes work in a multi-version database? The most common approach is that each index entry
points at one of the versions of a row that matches the entry (either the oldest or the newest
version). Each row version may contain a reference to the next-oldest or next-newest version. A
query that uses the index must then iterate over the rows to find one that is visible, and where the
value matches what the query is looking for. When garbage collection removes old row versions that
are no longer visible to any transaction, the corresponding index entries can also be removed.

Many implementation details affect the performance of multi-version concurrency control
[[45](/en/ch8#Pavlo2023), [46](/en/ch8#Wu2017)].
For example, PostgreSQL has optimizations for avoiding index updates if different versions of the
same row can fit on the same page [[40](/en/ch8#Momjian2014)].
Some other databases avoid storing full copies of modified rows, and only store differences between
versions to save space.

Another approach is used in CouchDB, Datomic, and LMDB. Although they also use B-trees (see
[“B-Trees”](/en/ch4#sec_storage_b_trees)), they use an *immutable* (copy-on-write) variant that does not overwrite
pages of the tree when they are updated, but instead creates a new copy of each modified page.
Parent pages, up to the root of the tree, are copied and updated to point to the new versions of
their child pages. Any pages that are not affected by a write do not need to be copied, and can be
shared with the new tree [[47](/en/ch8#Prokopov2014)].

With immutable B-trees, every write transaction (or batch of transactions) creates a new B-tree
root, and a particular root is a consistent snapshot of the database at the point in time when it
was created. There is no need to filter out rows based on transaction IDs because subsequent
writes cannot modify an existing B-tree; they can only create new tree roots. This approach also
requires a background process for compaction and garbage collection.

### Snapshot isolation, repeatable read, and naming confusion

MVCC is a commonly used implementation technique for databases, and often it is used to implement
snapshot isolation. However, different databases sometimes use different terms to refer to the same
thing: for example, snapshot isolation is called “repeatable read” in PostgreSQL, and “serializable”
in Oracle [[29](/en/ch8#Kleppmann2014)]. Sometimes different systems
use the same term to mean different things: for example, while in PostgreSQL “repeatable read” means
snapshot isolation, in MySQL it means an implementation of MVCC with weaker consistency than
snapshot isolation [[41](/en/ch8#Alvaro2023)].

The reason for this naming confusion is that the SQL standard doesn’t have the concept of snapshot
isolation, because the standard is based on System R’s 1975 definition of isolation levels
[[3](/en/ch8#Gray1976)] and snapshot isolation hadn’t yet been
invented then. Instead, it defines repeatable read, which looks superficially similar to snapshot
isolation. PostgreSQL calls its snapshot isolation level “repeatable read” because it meets the
requirements of the standard, and so they can claim standards compliance.

Unfortunately, the SQL standard’s definition of isolation levels is flawed—it is ambiguous,
imprecise, and not as implementation-independent as a standard should be
[[36](/en/ch8#Berenson1995)]. Even though several databases
implement repeatable read, there are big differences in the guarantees they actually provide,
despite being ostensibly standardized
[[29](/en/ch8#Kleppmann2014)]. There has been a formal definition of
repeatable read in the research literature [[37](/en/ch8#Adya1999),
[38](/en/ch8#Bailis2014virtues_ch8)], but most implementations don’t satisfy that
formal definition. And to top it off, IBM Db2 uses “repeatable read” to refer to serializability
[[10](/en/ch8#Bailis2013HAT)].

As a result, nobody really knows what repeatable read means.

## Preventing Lost Updates

The read committed and snapshot isolation levels we’ve discussed so far have been primarily about the guarantees
of what a read-only transaction can see in the presence of concurrent writes. We have mostly ignored
the issue of two transactions writing concurrently—we have only discussed dirty writes (see
[“No dirty writes”](/en/ch8#sec_transactions_dirty_write)), one particular type of write-write conflict that can occur.

There are several other interesting kinds of conflicts that can occur between concurrently writing
transactions. The best known of these is the *lost update* problem, illustrated in
[Figure 8-1](/en/ch8#fig_transactions_increment) with the example of two concurrent counter increments.

The lost update problem can occur if an application reads some value from the database, modifies it,
and writes back the modified value (a *read-modify-write cycle*). If two transactions do this
concurrently, one of the modifications can be lost, because the second write does not include the
first modification. (We sometimes say that the later write *clobbers* the earlier write.) This
pattern occurs in various different scenarios:

* Incrementing a counter or updating an account balance (requires reading the current value,
  calculating the new value, and writing back the updated value)
* Making a local change to a complex value, e.g., adding an element to a list within a JSON document
  (requires parsing the document, making the change, and writing back the modified document)
* Two users editing a wiki page at the same time, where each user saves their changes by sending the
  entire page contents to the server, overwriting whatever is currently in the database

Because this is such a common problem, a variety of solutions have been developed
[[48](/en/ch8#Svetlov2025)].

### Atomic write operations

Many databases provide atomic update operations, which remove the need to implement
read-modify-write cycles in application code. They are usually the best solution if your code can be
expressed in terms of those operations. For example, the following instruction is concurrency-safe
in most relational databases:

```
UPDATE counters SET value = value + 1 WHERE key = 'foo';
```

Similarly, document databases such as MongoDB provide atomic operations for making local
modifications to a part of a JSON document, and Redis provides atomic operations for modifying data
structures such as priority queues. Not all writes can easily be expressed in terms of atomic
operations—for example, updates to a wiki page involve arbitrary text editing, which can be handled
using algorithms discussed in [“CRDTs and Operational Transformation”](/en/ch6#sec_replication_crdts)—but in situations where atomic operations
can be used, they are usually the best choice.

Atomic operations are usually implemented by taking an exclusive lock on the object when it is read
so that no other transaction can read it until the update has been applied.
Another option is to simply force all atomic operations to be executed on a single thread.

Unfortunately, object-relational mapping (ORM) frameworks make it easy to accidentally write code
that performs unsafe read-modify-write cycles instead of using atomic operations provided by the
database [[49](/en/ch8#Wiger2010),
[50](/en/ch8#Coglan2020),
[51](/en/ch8#Bailis2015_ch8)].
This can be a source of subtle bugs that are difficult to find by testing.

### Explicit locking

Another option for preventing lost updates, if the database’s built-in atomic operations don’t
provide the necessary functionality, is for the application to explicitly lock objects that are
going to be updated. Then the application can perform a read-modify-write cycle, and if any other
transaction tries to concurrently update or lock the same object, it is forced to wait until the
first read-modify-write cycle has completed.

For example, consider a multiplayer game in which several players can move the same figure
concurrently. In this case, an atomic operation may not be sufficient, because the application also
needs to ensure that a player’s move abides by the rules of the game, which involves some logic that
you cannot sensibly implement as a database query. Instead, you may use a lock to prevent two
players from concurrently moving the same piece, as illustrated in [Example 8-1](/en/ch8#fig_transactions_select_for_update).

##### Example 8-1. Explicitly locking rows to prevent lost updates

```
BEGIN TRANSACTION;

SELECT * FROM figures
  WHERE name = 'robot' AND game_id = 222
  FOR UPDATE; ![1](/fig/1.png)

-- Check whether move is valid, then update the position
-- of the piece that was returned by the previous SELECT.
UPDATE figures SET position = 'c4' WHERE id = 1234;

COMMIT;
```

[![1](/fig/1.png)](/en/ch8#co_transactions_CO1-1)
:   The `FOR UPDATE` clause indicates that the database should take a lock on all rows returned by
    this query.

This works, but to get it right, you need to carefully think about your application logic. It’s easy
to forget to add a necessary lock somewhere in the code, and thus introduce a race condition.

Moreover, if you lock multiple objects there is a risk of deadlock, where two or more transactions
are waiting for each other to release their locks. Many databases automatically detect deadlocks,
and abort one of the involved transactions so that the system can make progress. You can handle this
situation at the application level by retrying the aborted transaction.

### Automatically detecting lost updates

Atomic operations and locks are ways of preventing lost updates by forcing the read-modify-write
cycles to happen sequentially. An alternative is to allow them to execute in parallel and, if the
transaction manager detects a lost update, abort the transaction and force it to retry
its read-modify-write cycle.

An advantage of this approach is that databases can perform this check efficiently in conjunction
with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL
Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort
the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates
[[29](/en/ch8#Kleppmann2014),
[41](/en/ch8#Alvaro2023)].
Some authors [[36](/en/ch8#Berenson1995),
[38](/en/ch8#Bailis2014virtues_ch8)] argue that a database must prevent lost
updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot
isolation under this definition.

Lost update detection is a great feature, because it doesn’t require application code to use any
special database features—you may forget to use a lock or an atomic operation and thus introduce
a bug, but lost update detection happens automatically and is thus less error-prone. However, you
also have to retry aborted transactions at the application level.

### Conditional writes (compare-and-set)

In databases that don’t provide transactions, you sometimes find a *conditional write* operation
that can prevent lost updates by allowing an update to happen only if the value has not changed
since you last read it (previously mentioned in [“Single-object writes”](/en/ch8#sec_transactions_single_object)). If the current
value does not match what you previously read, the update has no effect, and the read-modify-write
cycle must be retried. It is the database equivalent of an atomic *compare-and-set* or
*compare-and-swap* (CAS) instruction that is supported by many CPUs.

For example, to prevent two users concurrently updating the same wiki page, you might try something
like this, expecting the update to occur only if the content of the page hasn’t changed since the
user started editing it:

```
-- This may or may not be safe, depending on the database implementation
UPDATE wiki_pages SET content = 'new content'
  WHERE id = 1234 AND content = 'old content';
```

If the content has changed and no longer matches `'old content'`, this update will have no effect,
so you need to check whether the update took effect and retry if necessary. Instead of comparing the
full content, you could also use a version number column that you increment on every update, and
apply the update only if the current version number hasn’t changed. This approach is sometimes
called *optimistic locking* [[52](/en/ch8#Dogan2020)].

Note that if another transaction has concurrently modified `content`, the new content may not be
visible under the MVCC visibility rules (see [“Visibility rules for observing a consistent snapshot”](/en/ch8#sec_transactions_mvcc_visibility)). Many
implementations of MVCC have an exception to the visibility rules for this scenario, where values
written by other transactions are visible to the evaluation of the `WHERE` clause of `UPDATE` and
`DELETE` queries, even though those writes are not otherwise visible in the snapshot.

### Conflict resolution and replication

In replicated databases (see [Chapter 6](/en/ch6#ch_replication)), preventing lost updates takes on another
dimension: since they have copies of the data on multiple nodes, and the data can potentially be
modified concurrently on different nodes, some additional steps need to be taken to prevent lost
updates.

Locks and conditional write operations assume that there is a single up-to-date copy of the data.
However, databases with multi-leader or leaderless replication usually allow several writes to
happen concurrently and replicate them asynchronously, so they cannot guarantee that there is a
single up-to-date copy of the data. Thus, techniques based on locks or conditional writes do not apply
in this context. (We will revisit this issue in more detail in [“Linearizability”](/en/ch10#sec_consistency_linearizability).)

Instead, as discussed in [“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts), a common approach in such replicated
databases is to allow concurrent writes to create several conflicting versions of a value (also
known as *siblings*), and to use application code or special data structures to resolve and merge
these versions after the fact.

Merging conflicting values can prevent lost updates if the updates are commutative (i.e., you can
apply them in a different order on different replicas, and still get the same result). For example,
incrementing a counter or adding an element to a set are commutative operations. That is the idea
behind CRDTs, which we encountered in [“CRDTs and Operational Transformation”](/en/ch6#sec_replication_crdts). However, some operations such as
conditional writes cannot be made commutative.

On the other hand, the *last write wins* (LWW) conflict resolution method is prone to lost updates,
as discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww). Unfortunately, LWW is the default in many replicated
databases.

## Write Skew and Phantoms

In the previous sections we saw *dirty writes* and *lost updates*, two kinds of race conditions that
can occur when different transactions concurrently try to write to the same objects. In order to
avoid data corruption, those race conditions need to be prevented—either automatically by the
database, or by manual safeguards such as using locks or atomic write operations.

However, that is not the end of the list of potential race conditions that can occur between
concurrent writes. In this section we will see some subtler examples of conflicts.

To begin, imagine this example: you are writing an application for doctors to manage their on-call
shifts at a hospital. The hospital usually tries to have several doctors on call at any one time,
but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if
they are sick themselves), provided that at least one colleague remains on call in that shift
[[53](/en/ch8#Cahill2008),
[54](/en/ch8#Ports2012)].

Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are
feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button
to go off call at approximately the same time. What happens next is illustrated in
[Figure 8-8](/en/ch8#fig_transactions_write_skew).

![ddia 0808](/fig/ddia_0808.png)

###### Figure 8-8. Example of write skew causing an application bug.

In each transaction, your application first checks that two or more doctors are currently on call;
if yes, it assumes it’s safe for one doctor to go off call. Since the database is using snapshot
isolation, both checks return `2`, so both transactions proceed to the next stage. Aaliyah updates her
own record to take herself off call, and Bryce updates his own record likewise. Both transactions
commit, and now no doctor is on call. Your requirement of having at least one doctor on call has
been violated.

### Characterizing write skew

This anomaly is called *write skew* [[36](/en/ch8#Berenson1995)]. It
is neither a dirty write nor a lost update, because the two transactions are updating two different
objects (Aaliyah’s and Bryce’s on-call records, respectively). It is less obvious that a conflict occurred
here, but it’s definitely a race condition: if the two transactions had run one after another, the
second doctor would have been prevented from going off call. The anomalous behavior was only
possible because the transactions ran concurrently.

You can think of write skew as a generalization of the lost update problem. Write skew can occur if two
transactions read the same objects, and then update some of those objects (different transactions
may update different objects). In the special case where different transactions update the same
object, you get a dirty write or lost update anomaly (depending on the timing).

We saw that there are various different ways of preventing lost updates. With write skew, our
options are more restricted:

* Atomic single-object operations don’t help, as multiple objects are involved.
* The automatic detection of lost updates that you find in some implementations of snapshot
  isolation unfortunately doesn’t help either: write skew is not automatically detected in
  PostgreSQL’s repeatable read, MySQL/InnoDB’s repeatable read, Oracle’s serializable, or SQL
  Server’s snapshot isolation level [[29](/en/ch8#Kleppmann2014)].
  Automatically preventing write skew requires true serializable isolation (see
  [“Serializability”](/en/ch8#sec_transactions_serializability)).
* Some databases allow you to configure constraints, which are then enforced by the database (e.g.,
  uniqueness, foreign key constraints, or restrictions on a particular value). However, in order to
  specify that at least one doctor must be on call, you would need a constraint that involves
  multiple objects. Most databases do not have built-in support for such constraints, but you may be
  able to implement them with triggers or materialized views, as discussed in
  [“Consistency”](/en/ch8#sec_transactions_acid_consistency) [[12](/en/ch8#Andrews2004)].
* If you can’t use a serializable isolation level, the second-best option in this case is probably
  to explicitly lock the rows that the transaction depends on. In the doctors example, you could
  write something like the following:

  ```
  BEGIN TRANSACTION;

  SELECT * FROM doctors
    WHERE on_call = true
    AND shift_id = 1234 FOR UPDATE; ![1](/fig/1.png)

  UPDATE doctors
    SET on_call = false
    WHERE name = 'Aaliyah'
    AND shift_id = 1234;

  COMMIT;
  ```

  [![1](/fig/1.png)](/en/ch8#co_transactions_CO2-1)
  :   As before, `FOR UPDATE` tells the database to lock all rows returned by this query.

### More examples of write skew

Write skew may seem like an esoteric issue at first, but once you’re aware of it, you may notice
more situations in which it can occur. Here are some more examples:

Meeting room booking system
:   Say you want to enforce that there cannot be two bookings for the same meeting room at the same
    time [[55](/en/ch8#Terry1995_ch8)].
    When someone wants to make a booking, you first check for any conflicting bookings (i.e.,
    bookings for the same room with an overlapping time range), and if none are found, you create the
    meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).

    ##### Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)

    ```
    BEGIN TRANSACTION;

    -- Check for any existing bookings that overlap with the period of noon-1pm
    SELECT COUNT(*) FROM bookings
      WHERE room_id = 123 AND
        end_time > '2025-01-01 12:00' AND start_time < '2025-01-01 13:00';

    -- If the previous query returned zero:
    INSERT INTO bookings
      (room_id, start_time, end_time, user_id)
      VALUES (123, '2025-01-01 12:00', '2025-01-01 13:00', 666);

    COMMIT;
    ```

    Unfortunately, snapshot isolation does not prevent another user from concurrently inserting a conflicting
    meeting. In order to guarantee you won’t get scheduling conflicts, you once again need serializable
    isolation.

Multiplayer game
:   In [Example 8-1](/en/ch8#fig_transactions_select_for_update), we used a lock to prevent lost updates (that is, making
    sure that two players can’t move the same figure at the same time). However, the lock doesn’t
    prevent players from moving two different figures to the same position on the board or potentially
    making some other move that violates the rules of the game. Depending on the kind of rule you are
    enforcing, you might be able to use a unique constraint, but otherwise you’re vulnerable to write
    skew.

Claiming a username
:   On a website where each user has a unique username, two users may try to create accounts with the
    same username at the same time. You may use a transaction to check whether a name is taken and, if
    not, create an account with that name. However, like in the previous examples, that is not safe
    under snapshot isolation. Fortunately, a unique constraint is a simple solution here (the second
    transaction that tries to register the username will be aborted due to violating the constraint).

Preventing double-spending
:   A service that allows users to spend money or points needs to check that a user doesn’t spend more
    than they have. You might implement this by inserting a tentative spending item into a user’s
    account, listing all the items in the account, and checking that the sum is positive.
    With write skew, it could happen that two spending items are inserted concurrently that together
    cause the balance to go negative, but that neither transaction notices the other.

### Phantoms causing write skew

All of these examples follow a similar pattern:

1. A `SELECT` query checks whether some requirement is satisfied by searching for rows that
   match some search condition (there are at least two doctors on call, there are no existing
   bookings for that room at that time, the position on the board doesn’t already have another
   figure on it, the username isn’t already taken, there is still money in the account).
2. Depending on the result of the first query, the application code decides how to continue (perhaps
   to go ahead with the operation, or perhaps to report an error to the user and abort).
3. If the application decides to go ahead, it makes a write (`INSERT`, `UPDATE`, or `DELETE`) to the
   database and commits the transaction.

   The effect of this write changes the precondition of the decision of step 2. In other words, if you
   were to repeat the `SELECT` query from step 1 after committing the write, you would get a different
   result, because the write changed the set of rows matching the search condition (there is now one
   fewer doctor on call, the meeting room is now booked for that time, the position on the board is now
   taken by the figure that was moved, the username is now taken, there is now less money in the
   account).

The steps may occur in a different order. For example, you could first make the write, then the
`SELECT` query, and finally decide whether to abort or commit based on the result of the query.

In the case of the doctor on call example, the row being modified in step 3 was one of the rows
returned in step 1, so we could make the transaction safe and avoid write skew by locking the rows
in step 1 (`SELECT FOR UPDATE`). However, the other four examples are different: they check for the
*absence* of rows matching some search condition, and the write *adds* a row matching the same
condition. If the query in step 1 doesn’t return any rows, `SELECT FOR UPDATE` can’t attach locks to
anything [[56](/en/ch8#Schoenig2021)].

This effect, where a write in one transaction changes the result of a search query in another
transaction, is called a *phantom* [[4](/en/ch8#Eswaran1976)].
Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the
examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL
generated by ORMs is also prone to write skew
[[50](/en/ch8#Coglan2020),
[51](/en/ch8#Bailis2015_ch8)].

### Materializing conflicts

If the problem of phantoms is that there is no object to which we can attach the locks, perhaps we
can artificially introduce a lock object into the database?

For example, in the meeting room booking case you could imagine creating a table of time slots and
rooms. Each row in this table corresponds to a particular room for a particular time period (say, 15
minutes). You create rows for all possible combinations of rooms and time periods ahead of time,
e.g. for the next six months.

Now a transaction that wants to create a booking can lock (`SELECT FOR UPDATE`) the rows in the
table that correspond to the desired room and time period. After it has acquired the locks, it can
check for overlapping bookings and insert a new booking as before. Note that the additional table
isn’t used to store information about the booking—it’s purely a collection of locks which is used
to prevent bookings on the same room and time range from being modified concurrently.

This approach is called *materializing conflicts*, because it takes a phantom and turns it into a
lock conflict on a concrete set of rows that exist in the database
[[14](/en/ch8#Fekete2005)]. Unfortunately, it can be hard and
error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control
mechanism leak into the application data model. For those reasons, materializing conflicts should be
considered a last resort if no alternative is possible. A serializable isolation level is much
preferable in most cases.

# Serializability

In this chapter we have seen several examples of transactions that are prone to race conditions.
Some race conditions are prevented by the read committed and snapshot isolation levels, but
others are not. We encountered some particularly tricky examples with write skew and phantoms. It’s
a sad situation:

* Isolation levels are hard to understand, and inconsistently implemented in different databases
  (e.g., the meaning of “repeatable read” varies significantly).
* If you look at your application code, it’s difficult to tell whether it is safe to run at a
  particular isolation level—especially in a large application, where you might not be aware of
  all the things that may be happening concurrently.
* There are no good tools to help us detect race conditions. In principle, static analysis may
  help [[33](/en/ch8#Jorwekar2007_ch8)], but research techniques have not
  yet found their way into practical use. Testing for concurrency issues is hard, because they are
  usually nondeterministic—problems only occur if you get unlucky with the timing.

This is not a new problem—it has been like this since the 1970s, when weak isolation levels were
first introduced [[3](/en/ch8#Gray1976)]. All along, the answer
from researchers has been simple: use *serializable* isolation!

Serializable isolation is the strongest isolation level. It guarantees that even
though transactions may execute in parallel, the end result is the same as if they had executed one
at a time, *serially*, without any concurrency. Thus, the database guarantees that if the
transactions behave correctly when run individually, they continue to be correct when run
concurrently—in other words, the database prevents *all* possible race conditions.

But if serializable isolation is so much better than the mess of weak isolation levels, then why
isn’t everyone using it? To answer this question, we need to look at the options for implementing
serializability, and how they perform. Most databases that provide serializability today use one of
three techniques, which we will explore in the rest of this chapter:

* Literally executing transactions in a serial order (see [“Actual Serial Execution”](/en/ch8#sec_transactions_serial))
* Two-phase locking (see [“Two-Phase Locking (2PL)”](/en/ch8#sec_transactions_2pl)), which for several decades was the only viable
  option
* Optimistic concurrency control techniques such as serializable snapshot isolation (see
  [“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi))

## Actual Serial Execution

The simplest way of avoiding concurrency problems is to remove the concurrency entirely: to
execute only one transaction at a time, in serial order, on a single thread. By doing so, we completely
sidestep the problem of detecting and preventing conflicts between transactions: the resulting
isolation is by definition serializable.

Even though this seems like an obvious idea, it was only in the 2000s that database designers
decided that a single-threaded loop for executing transactions was feasible
[[57](/en/ch8#Stonebraker2007_ch8)].
If multi-threaded concurrency was considered essential for getting good performance during the
previous 30 years, what changed to make single-threaded execution possible?

Two developments caused this rethink:

* RAM became cheap enough that for many use cases it is now feasible to keep the entire
  active dataset in memory (see [“Keeping everything in memory”](/en/ch4#sec_storage_inmemory)). When all data that a transaction needs to
  access is in memory, transactions can execute much faster than if they have to wait for data to be
  loaded from disk.
* Database designers realized that OLTP transactions are usually short and only make a small number
  of reads and writes (see [“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)). By contrast, long-running analytic queries
  are typically read-only, so they can be run on a consistent snapshot (using snapshot isolation)
  outside of the serial execution loop.

The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic,
for example [[58](/en/ch8#Hugg2014streaming),
[59](/en/ch8#Kallman2008),
[60](/en/ch8#Hickey2012)].
A system designed for single-threaded execution can sometimes perform better than a system that
supports concurrency, because it can avoid the coordination overhead of locking. However, its
throughput is limited to that of a single CPU core. In order to make the most of that single thread,
transactions need to be structured differently from their traditional form.

### Encapsulating transactions in stored procedures

In the early days of databases, the intention was that a database transaction could encompass an
entire flow of user activity. For example, booking an airline ticket is a multi-stage process
(searching for routes, fares, and available seats; deciding on an itinerary; booking seats on
each of the flights of the itinerary; entering passenger details; making payment). Database
designers thought that it would be neat if that entire process was one transaction so that it could
be committed atomically.

Unfortunately, humans are very slow to make up their minds and respond. If a database transaction
needs to wait for input from a user, the database needs to support a potentially huge number of
concurrent transactions, most of them idle. Most databases cannot do that efficiently, and so almost
all OLTP applications keep transactions short by avoiding interactively waiting for a user within a
transaction. On the web, this means that a transaction is committed within the same HTTP request—a
transaction does not span multiple requests. A new HTTP request starts a new transaction.

Even though the human has been taken out of the critical path, transactions have continued to be
executed in an interactive client/server style, one statement at a time. An application makes a
query, reads the result, perhaps makes another query depending on the result of the first query, and
so on. The queries and results are sent back and forth between the application code (running on one
machine) and the database server (on another machine).

In this interactive style of transaction, a lot of time is spent in network communication between
the application and the database. If you were to disallow concurrency in the database and only
process one transaction at a time, the throughput would be dreadful because the database would
spend most of its time waiting for the application to issue the next query for the current
transaction. In this kind of database, it’s necessary to process multiple transactions concurrently
in order to get reasonable performance.

For this reason, systems with single-threaded serial transaction processing don’t allow interactive
multi-statement transactions. Instead, the application must either limit itself to transactions
containing a single statement, or submit the entire transaction code to the database ahead of time,
as a *stored procedure* [[61](/en/ch8#Hugg2014debunking)].

The differences between interactive transactions and stored procedures is illustrated in
[Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the
stored procedure can execute very quickly, without waiting for any network or disk I/O.

![ddia 0809](/fig/ddia_0809.png)

###### Figure 8-9. The difference between an interactive transaction and a stored procedure (using the example transaction of [Figure 8-8](/en/ch8#fig_transactions_write_skew)).

### Pros and cons of stored procedures

Stored procedures have existed for some time in relational databases, and they have been part of the
SQL standard (SQL/PSM) since 1999. They have gained a somewhat bad reputation, for various reasons:

* Traditionally, each database vendor had its own language for stored procedures (Oracle has PL/SQL, SQL Server
  has T-SQL, PostgreSQL has PL/pgSQL, etc.). These languages haven’t kept up with developments in
  general-purpose programming languages, so they look quite ugly and archaic from today’s point of
  view, and they lack the ecosystem of libraries that you find with most programming languages.
* Code running in a database is difficult to manage: compared to an application server, it’s harder
  to debug, more awkward to keep in version control and deploy, trickier to test, and difficult to
  integrate with a metrics collection system for monitoring.
* A database is often much more performance-sensitive than an application server, because a single
  database instance is often shared by many application servers. A badly written stored procedure
  (e.g., using a lot of memory or CPU time) in a database can cause much more trouble than equivalent
  badly written code in an application server.
* In a multitenant system that allows tenants to write their own stored procedures, it’s a security
  risk to execute untrusted code in the same process as the database kernel
  [[62](/en/ch8#Zhou2025)].

However, those issues can be overcome. Modern implementations of stored procedures have abandoned
PL/SQL and use existing general-purpose programming languages instead: VoltDB uses Java or Groovy,
Datomic uses Java or Clojure, Redis uses Lua, and MongoDB uses Javascript.

Stored procedures are also useful in cases where application logic can’t easily be embedded
elsewhere. Applications that use GraphQL, for example, might directly expose their database through
a GraphQL proxy. If the proxy doesn’t support complex validation logic, you can embed such logic
directly in the database using a stored procedure. If the database doesn’t support stored
procedures, you would have to deploy a validation service between the proxy and the database to do
validation.

With stored procedures and in-memory data, executing all transactions on a single thread becomes
feasible. When stored procedures don’t need to wait for I/O and avoid the overhead of other
concurrency control mechanisms, they can achieve quite good throughput on a single thread.

VoltDB also uses stored procedures for replication: instead of copying a transaction’s writes from
one node to another, it executes the same stored procedure on each replica. VoltDB therefore
requires that stored procedures are *deterministic* (when run on different nodes, they must produce
the same result). If a transaction needs to use the current date and time, for example, it must do
so through special deterministic APIs (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows) for more details on
deterministic operations). This approach is called *state machine replication*, and we will return
to it in [Chapter 10](/en/ch10#ch_consistency).

### Sharding

Executing all transactions serially makes concurrency control much simpler, but limits the
transaction throughput of the database to the speed of a single CPU core on a single machine.
Read-only transactions may execute elsewhere, using snapshot isolation, but for applications with
high write throughput, the single-threaded transaction processor can become a serious bottleneck.

In order to scale to multiple CPU cores, and multiple nodes, you can shard your data
(see [Chapter 7](/en/ch7#ch_sharding)), which is supported in VoltDB. If you can find a way of sharding your dataset
so that each transaction only needs to read and write data within a single shard, then each shard
can have its own transaction processing thread running independently from the others. In this case,
you can give each CPU core its own shard, which allows your transaction throughput to scale linearly
with the number of CPU cores [[59](/en/ch8#Kallman2008)].

However, for any transaction that needs to access multiple shards, the database must coordinate the
transaction across all the shards that it touches. The stored procedure needs to be performed in
lock-step across all shards to ensure serializability across the whole system.

Since cross-shard transactions have additional coordination overhead, they are vastly slower than
single-shard transactions. VoltDB reports a throughput of about 1,000 cross-shard writes per second,
which is orders of magnitude below its single-shard throughput and cannot be increased by adding
more machines [[61](/en/ch8#Hugg2014debunking)]. More recent research
has explored ways of making multi-shard transactions more scalable
[[63](/en/ch8#Zhou2022)].

Whether transactions can be single-shard depends very much on the structure of the data used by the
application. Simple key-value data can often be sharded very easily, but data with multiple
secondary indexes is likely to require a lot of cross-shard coordination (see
[“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes)).

### Summary of serial execution

Serial execution of transactions has become a viable way of achieving serializable isolation within
certain constraints:

* Every transaction must be small and fast, because it takes only one slow transaction to stall all
  transaction processing.
* It is most appropriate in situations where the active dataset can fit in memory. Rarely accessed
  data could potentially be moved to disk, but if it needed to be accessed in a single-threaded
  transaction, the system would get very slow.
* Write throughput must be low enough to be handled on a single CPU core, or else transactions need
  to be sharded without requiring cross-shard coordination.
* Cross-shard transactions are possible, but their throughput is hard to scale.

## Two-Phase Locking (2PL)

For around 30 years, there was only one widely used algorithm for serializability in databases:
*two-phase locking* (2PL), sometimes called *strong strict two-phase locking* (SS2PL) to distinguish
it from other variants of 2PL.

# 2PL is not 2PC

Two-phase *locking* (2PL) and two-phase *commit* (2PC) are two very different things. 2PL provides
serializable isolation, whereas 2PC provides atomic commit in a distributed database (see
[“Two-Phase Commit (2PC)”](/en/ch8#sec_transactions_2pc)). To avoid confusion, it’s best to think of them as entirely separate
concepts and to ignore the unfortunate similarity in the names.

We saw previously that locks are often used to prevent dirty writes (see
[“No dirty writes”](/en/ch8#sec_transactions_dirty_write)): if two transactions concurrently try to write to the same object,
the lock ensures that the second writer must wait until the first one has finished its transaction
(aborted or committed) before it may continue.

Two-phase locking is similar, but makes the lock requirements much stronger. Several transactions
are allowed to concurrently read the same object as long as nobody is writing to it. But as soon as
anyone wants to write (modify or delete) an object, exclusive access is required:

* If transaction A has read an object and transaction B wants to write to that object, B must wait
  until A commits or aborts before it can continue. (This ensures that B can’t change the object
  unexpectedly behind A’s back.)
* If transaction A has written an object and transaction B wants to read that object, B must wait
  until A commits or aborts before it can continue. (Reading an old version of the object, like in
  [Figure 8-4](/en/ch8#fig_transactions_read_committed), is not acceptable under 2PL.)

In 2PL, writers don’t just block other writers; they also block readers and vice
versa. Snapshot isolation has the mantra *readers never block writers, and writers never block
readers* (see [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl)), which captures this key difference between
snapshot isolation and two-phase locking. On the other hand, because 2PL provides serializability,
it protects against all the race conditions discussed earlier, including lost updates and write skew.

### Implementation of two-phase locking

2PL is used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the
repeatable read isolation level in Db2
[[29](/en/ch8#Kleppmann2014)].

The blocking of readers and writers is implemented by having a lock on each object in the
database. The lock can either be in *shared mode* or in *exclusive mode* (also known as a
*multi-reader single-writer* lock). The lock is used as follows:

* If a transaction wants to read an object, it must first acquire the lock in shared mode. Several
  transactions are allowed to hold the lock in shared mode simultaneously, but if another
  transaction already has an exclusive lock on the object, these transactions must wait.
* If a transaction wants to write to an object, it must first acquire the lock in exclusive mode. No
  other transaction may hold the lock at the same time (either in shared or in exclusive mode), so
  if there is any existing lock on the object, the transaction must wait.
* If a transaction first reads and then writes an object, it may upgrade its shared lock to an
  exclusive lock. The upgrade works the same as getting an exclusive lock directly.
* After a transaction has acquired the lock, it must continue to hold the lock until the end of the
  transaction (commit or abort). This is where the name “two-phase” comes from: the first phase
  (while the transaction is executing) is when the locks are acquired, and the second phase (at the
  end of the transaction) is when all the locks are released.

Since so many locks are in use, it can happen quite easily that transaction A is stuck waiting for
transaction B to release its lock, and vice versa. This situation is called *deadlock*. The database
automatically detects deadlocks between transactions and aborts one of them so that the others can
make progress. The aborted transaction needs to be retried by the application.

### Performance of two-phase locking

The big downside of two-phase locking, and the reason why it hasn’t been used by everybody since the
1970s, is performance: transaction throughput and response times of queries are significantly worse
under two-phase locking than under weak isolation.

This is partly due to the overhead of acquiring and releasing all those locks, but more importantly
due to reduced concurrency. By design, if two concurrent transactions try to do anything that may
in any way result in a race condition, one has to wait for the other to complete.

For example, if you have a transaction that needs to read an entire table (e.g. a backup, analytics
query, or integrity check, as discussed in [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation)), that
transaction has to take a shared lock on the entire table. Therefore, the reading transaction first
has to wait until all in-progress transactions writing to that table have completed; then, while the
whole table is being read (which may take a long time on a large table), all other transactions that
want to write to that table are blocked until the big read-only transaction commits. In effect, the
database becomes unavailable for writes for an extended time.

For this reason, databases running 2PL can have quite unstable latencies, and they can be very slow at
high percentiles (see [“Describing Performance”](/en/ch2#sec_introduction_percentiles)) if there is contention in the workload. It
may take just one slow transaction, or one transaction that accesses a lot of data and acquires many
locks, to cause the rest of the system to grind to a halt.

Although deadlocks can happen with the lock-based read committed isolation level, they occur much
more frequently under 2PL serializable isolation (depending on the access patterns of your
transaction). This can be an additional performance problem: when a transaction is aborted due to
deadlock and is retried, it needs to do its work all over again. If deadlocks are frequent, this can
mean significant wasted effort.

### Predicate locks

In the preceding description of locks, we glossed over a subtle but important detail. In
[“Phantoms causing write skew”](/en/ch8#sec_transactions_phantom) we discussed the problem of *phantoms*—that is, one transaction
changing the results of another transaction’s search query. A database with serializable isolation
must prevent phantoms.

In the meeting room booking example this means that if one transaction has searched for existing
bookings for a room within a certain time window (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)), another
transaction is not allowed to concurrently insert or update another booking for the same room and
time range. (It’s okay to concurrently insert bookings for other rooms, or for the same room at a
different time that doesn’t affect the proposed booking.)

How do we implement this? Conceptually, we need a *predicate lock*
[[4](/en/ch8#Eswaran1976)]. It works similarly to the
shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one
row in a table), it belongs to all objects that match some search condition, such as:

```
SELECT * FROM bookings
  WHERE room_id = 123 AND
    end_time   > '2025-01-01 12:00' AND
    start_time < '2025-01-01 13:00';
```

A predicate lock restricts access as follows:

* If transaction A wants to read objects matching some condition, like in that `SELECT` query, it
  must acquire a shared-mode predicate lock on the conditions of the query. If another transaction B
  currently has an exclusive lock on any object matching those conditions, A must wait until B
  releases its lock before it is allowed to make its query.
* If transaction A wants to insert, update, or delete any object, it must first check whether either the old
  or the new value matches any existing predicate lock. If there is a matching predicate lock held by
  transaction B, then A must wait until B has committed or aborted before it can continue.

The key idea here is that a predicate lock applies even to objects that do not yet exist in the
database, but which might be added in the future (phantoms). If two-phase locking includes predicate locks,
the database prevents all forms of write skew and other race conditions, and so its isolation
becomes serializable.

### Index-range locks

Unfortunately, predicate locks do not perform well: if there are many locks by active transactions,
checking for matching locks becomes time-consuming. For that reason, most databases with 2PL
actually implement *index-range locking* (also known as *next-key locking*), which is a simplified
approximation of predicate locking [[54](/en/ch8#Ports2012),
[64](/en/ch8#Hellerstein2007_ch8)].

It’s safe to simplify a predicate by making it match a greater set of objects. For example, if you
have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by
locking bookings for room 123 at any time, or you can approximate it by locking all rooms (not just
room 123) between noon and 1 p.m. This is safe because any write that matches the original predicate
will definitely also match the approximations.

In the room bookings database you would probably have an index on the `room_id` column, and/or
indexes on `start_time` and `end_time` (otherwise the preceding query would be very slow on a large
database):

* Say your index is on `room_id`, and the database uses this index to find existing bookings for
  room 123. Now the database can simply attach a shared lock to this index entry, indicating that a
  transaction has searched for bookings of room 123.
* Alternatively, if the database uses a time-based index to find existing bookings, it can attach a
  shared lock to a range of values in that index, indicating that a transaction has searched for
  bookings that overlap with the time period of noon to 1 p.m. on January 1, 2025.

Either way, an approximation of the search condition is attached to one of the indexes. Now, if
another transaction wants to insert, update, or delete a booking for the same room and/or an
overlapping time period, it will have to update the same part of the index. In the process of doing
so, it will encounter the shared lock, and it will be forced to wait until the lock is released.

This provides effective protection against phantoms and write skew. Index-range locks are not as
precise as predicate locks would be (they may lock a bigger range of objects than is strictly
necessary to maintain serializability), but since they have much lower overheads, they are a good
compromise.

If there is no suitable index where a range lock can be attached, the database can fall back to a
shared lock on the entire table. This will not be good for performance, since it will stop all
other transactions writing to the table, but it’s a safe fallback position.

## Serializable Snapshot Isolation (SSI)

This chapter has painted a bleak picture of concurrency control in databases. On the one hand, we
have implementations of serializability that don’t perform well (two-phase locking) or don’t scale
well (serial execution). On the other hand, we have weak isolation levels that have good
performance, but are prone to various race conditions (lost updates, write skew, phantoms, etc.). Are
serializable isolation and good performance fundamentally at odds with each other?

It seems not: an algorithm called *serializable snapshot isolation* (SSI) provides full
serializability with only a small performance penalty compared to snapshot isolation. SSI is
comparatively new: it was first described in 2008
[[53](/en/ch8#Cahill2008),
[65](/en/ch8#Cahill2009)].

Today SSI and similar algorithms are used in single-node databases (the serializable isolation level
in PostgreSQL [[54](/en/ch8#Ports2012)], SQL Server’s In-Memory
OLTP/Hekaton [[66](/en/ch8#Diaconu2013)], and HyPer
[[67](/en/ch8#Neumann2015)]),
distributed databases (CockroachDB [[5](/en/ch8#Taft2020_ch8)] and
FoundationDB [[8](/en/ch8#Zhou2021_ch8)]), and embedded storage
engines such as BadgerDB.

### Pessimistic versus optimistic concurrency control

Two-phase locking is a so-called *pessimistic* concurrency control mechanism: it is based on the
principle that if anything might possibly go wrong (as indicated by a lock held by another
transaction), it’s better to wait until the situation is safe again before doing anything. It is
like *mutual exclusion*, which is used to protect data structures in multi-threaded programming.

Serial execution is, in a sense, pessimistic to the extreme: it is essentially equivalent to each
transaction having an exclusive lock on the entire database (or one shard of the database) for the
duration of the transaction. We compensate for the pessimism by making each transaction very fast to
execute, so it only needs to hold the “lock” for a short time.

By contrast, serializable snapshot isolation is an *optimistic* concurrency control technique.
Optimistic in this context means that instead of blocking if something potentially dangerous
happens, transactions continue anyway, in the hope that everything will turn out all right. When a
transaction wants to commit, the database checks whether anything bad happened (i.e., whether
isolation was violated); if so, the transaction is aborted and has to be retried. Only transactions
that executed serializably are allowed to commit.

Optimistic concurrency control is an old idea
[[68](/en/ch8#Badal1979)],
and its advantages and disadvantages have been debated for a long time
[[69](/en/ch8#Agrawal1987)].
It performs badly if there is high contention (many transactions trying to access the same objects),
as this leads to a high proportion of transactions needing to abort. If the system is already close
to its maximum throughput, the additional transaction load from retried transactions can make
performance worse.

However, if there is enough spare capacity, and if contention between transactions is not too high,
optimistic concurrency control techniques tend to perform better than pessimistic ones. Contention
can be reduced with commutative atomic operations: for example, if several transactions concurrently
want to increment a counter, it doesn’t matter in which order the increments are applied (as long as
the counter isn’t read in the same transaction), so the concurrent increments can all be applied
without conflicting.

As the name suggests, SSI is based on snapshot isolation—that is, all reads within a transaction
are made from a consistent snapshot of the database (see [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation)).
On top of snapshot isolation, SSI adds an algorithm for detecting serialization conflicts among
reads and writes, and determining which transactions to abort.

### Decisions based on an outdated premise

When we previously discussed write skew in snapshot isolation (see [“Write Skew and Phantoms”](/en/ch8#sec_transactions_write_skew)),
we observed a recurring pattern: a transaction reads some data from the database, examines the
result of the query, and decides to take some action (write to the database) based on the result
that it saw. However, under snapshot isolation, the result from the original query may no longer be
up-to-date by the time the transaction commits, because the data may have been modified in the
meantime.

Put another way, the transaction is taking an action based on a *premise* (a fact that was true at
the beginning of the transaction, e.g., “There are currently two doctors on call”). Later, when the
transaction wants to commit, the original data may have changed—the premise may no longer be
true.

When the application makes a query (e.g., “How many doctors are currently on call?”), the database
doesn’t know how the application logic uses the result of that query. To be safe, the database needs
to assume that any change in the query result (the premise) means that writes in that transaction
may be invalid. In other words, there may be a causal dependency between the queries and the writes
in the transaction. In order to provide serializable isolation, the database must detect situations
in which a transaction may have acted on an outdated premise and abort the transaction in that case.

How does the database know if a query result might have changed? There are two cases to consider:

* Detecting reads of a stale MVCC object version (uncommitted write occurred before the read)
* Detecting writes that affect prior reads (the write occurs after the read)

### Detecting stale MVCC reads

Recall that snapshot isolation is usually implemented by multi-version concurrency control (MVCC;
see [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl)). When a transaction reads from a consistent snapshot in an
MVCC database, it ignores writes that were made by any other transactions that hadn’t yet committed
at the time when the snapshot was taken.

In [Figure 8-10](/en/ch8#fig_transactions_detect_mvcc), transaction 43 sees
Aaliyah as having `on_call = true`, because transaction 42 (which modified Aaliyah’s on-call status) is
uncommitted. However, by the time transaction 43 wants to commit, transaction 42 has already
committed. This means that the write that was ignored when reading from the consistent snapshot has
now taken effect, and transaction 43’s premise is no longer true. Things get even more complicated
when a writer inserts data that didn’t exist before (see [“Phantoms causing write skew”](/en/ch8#sec_transactions_phantom)). We’ll
discuss detecting phantom writes for SSI in [“Detecting writes that affect prior reads”](/en/ch8#sec_detecting_writes_affect_reads).

![ddia 0810](/fig/ddia_0810.png)

###### Figure 8-10. Detecting when a transaction reads outdated values from an MVCC snapshot.

In order to prevent this anomaly, the database needs to track when a transaction ignores another
transaction’s writes due to MVCC visibility rules. When the transaction wants to commit, the
database checks whether any of the ignored writes have now been committed. If so, the transaction
must be aborted.

Why wait until committing? Why not abort transaction 43 immediately when the stale read is detected?
Well, if transaction 43 was a read-only transaction, it wouldn’t need to be aborted, because there
is no risk of write skew. At the time when transaction 43 makes its read, the database doesn’t yet
know whether that transaction is going to later perform a write. Moreover, transaction 42 may yet
abort or may still be uncommitted at the time when transaction 43 is committed, and so the read may
turn out not to have been stale after all. By avoiding unnecessary aborts, SSI preserves snapshot
isolation’s support for long-running reads from a consistent snapshot.

### Detecting writes that affect prior reads

The second case to consider is when another transaction modifies data after it has been read. This
case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range).

![ddia 0811](/fig/ddia_0811.png)

###### Figure 8-11. In serializable snapshot isolation, detecting when one transaction modifies another transaction’s reads.

In the context of two-phase locking we discussed index-range locks (see
[“Index-range locks”](/en/ch8#sec_transactions_2pl_range)), which allow the database to lock access to all rows matching some
search query, such as `WHERE shift_id = 1234`. We can use a similar technique here, except that SSI
locks don’t block other transactions.

In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transactions 42 and 43 both search for on-call doctors
during shift `1234`. If there is an index on `shift_id`, the database can use the index entry 1234 to
record the fact that transactions 42 and 43 read this data. (If there is no index, this information
can be tracked at the table level.) This information only needs to be kept for a while: after a
transaction has finished (committed or aborted), and all concurrent transactions have finished, the
database can forget what data it read.

When a transaction writes to the database, it must look in the indexes for any other transactions
that have recently read the affected data. This process is similar to acquiring a write lock on the affected
key range, but rather than blocking until the readers have committed, the lock acts as a tripwire:
it simply notifies the transactions that the data they read may no longer be up to date.

In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transaction 43 notifies transaction 42 that its prior
read is outdated, and vice versa. Transaction 42 is first to commit, and it is successful: although
transaction 43’s write affected 42, 43 hasn’t yet committed, so the write has not yet taken effect.
However, when transaction 43 wants to commit, the conflicting write from 42 has already been
committed, so 43 must abort.

### Performance of serializable snapshot isolation

As always, many engineering details affect how well an algorithm works in practice. For example, one
trade-off is the granularity at which transactions’ reads and writes are tracked. If the database
keeps track of each transaction’s activity in great detail, it can be precise about which
transactions need to abort, but the bookkeeping overhead can become significant. Less detailed
tracking is faster, but may lead to more transactions being aborted than strictly necessary.

In some cases, it’s okay for a transaction to read information that was overwritten by another
transaction: depending on what else happened, it’s sometimes possible to prove that the result of
the execution is nevertheless serializable. PostgreSQL uses this theory to reduce the number of
unnecessary aborts [[14](/en/ch8#Fekete2005),
[54](/en/ch8#Ports2012)].

Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one
transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot
isolation, writers don’t block readers, and vice versa. This design principle makes query latency
much more predictable and less variable. In particular, read-only queries can run on a consistent
snapshot without requiring any locks, which is very appealing for read-heavy workloads.

Compared to serial execution, serializable snapshot isolation is not limited to the throughput of a
single CPU core: for example, FoundationDB distributes the detection of serialization conflicts across multiple
machines, allowing it to scale to very high throughput. Even though data may be sharded across
multiple machines, transactions can read and write data in multiple shards while ensuring
serializable isolation.

Compared to non-serializable snapshot isolation, the need to check for serializability violations
introduces some performance overheads. How significant these overheads are is a matter of debate:
some believe that serializability checking is not worth it
[[70](/en/ch8#Brooker2024snapshot)],
while others believe that the performance of serializability is now so good that there is no need to
use the weaker snapshot isolation any more [[67](/en/ch8#Neumann2015)].

The rate of aborts significantly affects the overall performance of SSI. For example, a transaction
that reads and writes data over a long period of time is likely to run into conflicts and abort, so
SSI requires that read-write transactions be fairly short (long-running read-only transactions are
okay). However, SSI is less sensitive to slow transactions than two-phase locking or serial
execution.

# Distributed Transactions

The last few sections have focused on concurrency control for isolation, the I in ACID. The
algorithms we have seen apply to both single-node and distributed databases: although there are
challenges in making concurrency control algorithms scalable (for example, performing distributed
serializability checking for SSI), the high-level ideas for distributed concurrency control are
similar to single-node concurrency control
[[8](/en/ch8#Zhou2021_ch8)].

Consistency and durability also don’t change much when we move to distributed transactions. However,
atomicity requires more care.

For transactions that execute at a single database node, atomicity is commonly implemented by the
storage engine. When the client asks the database node to commit the transaction, the database makes
the transaction’s writes durable (typically in a write-ahead log; see [“Making B-trees reliable”](/en/ch4#sec_storage_btree_wal)) and
then appends a commit record to the log on disk. If the database crashes in the middle of this
process, the transaction is recovered from the log when the node restarts: if the commit record was
successfully written to disk before the crash, the transaction is considered committed; if not, any
writes from that transaction are rolled back.

Thus, on a single node, transaction commitment crucially depends on the *order* in which data is
durably written to disk: first the data, then the commit record
[[22](/en/ch8#Pillai2014)].
The key deciding moment for whether the transaction commits or aborts is the moment at which the
disk finishes writing the commit record: before that moment, it is still possible to abort (due to a
crash), but after that moment, the transaction is committed (even if the database crashes). Thus, it
is a single device (the controller of one particular disk drive, attached to one particular node)
that makes the commit atomic.

However, what if multiple nodes are involved in a transaction? For example, perhaps you have a
multi-object transaction in a sharded database, or a global secondary index (in which the
index entry may be on a different node from the primary data; see
[“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes)). Most “NoSQL” distributed datastores do not support such
distributed transactions, but various distributed relational databases do.

In these cases, it is not sufficient to simply send a commit request to all of the nodes and
independently commit the transaction on each one. It could easily happen that the commit succeeds on
some nodes and fails on other nodes, as shown in [Figure 8-12](/en/ch8#fig_transactions_non_atomic):

* Some nodes may detect a constraint violation or conflict, making an abort necessary, while other
  nodes are successfully able to commit.
* Some of the commit requests might be lost in the network, eventually aborting due to a timeout,
  while other commit requests get through.
* Some nodes may crash before the commit record is fully written and roll back on recovery, while
  others successfully commit.

![ddia 0812](/fig/ddia_0812.png)

###### Figure 8-12. When a transaction involves multiple database nodes, it may commit on some and fail on others.

If some nodes commit the transaction but others abort it, the nodes become inconsistent with each
other. And once a transaction has been committed on one node, it cannot be retracted again if it
later turns out that it was aborted on another node. This is because once data has been committed,
it becomes visible to other transactions under *read committed* or stronger isolation. For example,
in [Figure 8-12](/en/ch8#fig_transactions_non_atomic), by the time user 1 notices that its commit failed on database 1,
user 2 has already read the data from the same transaction on database 2. If user 1’s transaction
was later aborted, user 2’s transaction would have to be reverted as well, since it was based on
data that was retroactively declared not to have existed.

A better approach is to ensure that the nodes involved in a transaction either all commit or all
abort, and to prevent a mixture of the two. Ensuring this is known as the *atomic commitment*
problem.

## Two-Phase Commit (2PC)

Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It
is a classic algorithm in distributed databases
[[13](/en/ch8#Bernstein1987_ch8),
[71](/en/ch8#Lindsay1979_ch8),
[72](/en/ch8#Mohan1986)]. 2PC is used
internally in some databases and also made available to applications in the form of *XA transactions*
[[73](/en/ch8#XASpec1991)]
(which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP
web services
[[74](/en/ch8#Neto2008),
[75](/en/ch8#Johnson2004)].

The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two
phases (hence the name).

![ddia 0813](/fig/ddia_0813.png)

###### Figure 8-13. A successful execution of two-phase commit (2PC).

2PC uses a new component that does not normally appear in single-node transactions: a
*coordinator* (also known as *transaction manager*). The coordinator is often implemented as a
library within the same application process that is requesting the transaction (e.g., embedded in a
Java EE container), but it can also be a separate process or service. Examples of such coordinators
include Narayana, JOTM, BTM, or MSDTC.

When 2PC is used, a distributed
transaction begins with the application reading and writing data on multiple database nodes,
as normal. We call these database nodes *participants* in the transaction. When the application is
ready to commit, the coordinator begins phase 1: it sends a *prepare* request to each of the nodes,
asking them whether they are able to commit. The coordinator then tracks the responses from the
participants:

* If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends
  out a *commit* request in phase 2, and the commit actually takes place.
* If any of the participants replies “no,” the coordinator sends an *abort* request to all nodes in
  phase 2.

This process is somewhat like the traditional marriage ceremony in Western cultures: the minister
asks the bride and groom individually whether each wants to marry the other, and typically receives
the answer “I do” from both. After receiving both acknowledgments, the minister pronounces the
couple husband and wife: the transaction is committed, and the happy fact is broadcast to all
attendees. If either bride or groom does not say “yes,” the ceremony is aborted
[[76](/en/ch8#Gray1981_ch8)].

### A system of promises

From this short description it might not be clear why two-phase commit ensures atomicity, while
one-phase commit across several nodes does not. Surely the prepare and commit requests can just
as easily be lost in the two-phase case. What makes 2PC different?

To understand why it works, we have to break down the process in a bit more detail:

1. When the application wants to begin a distributed transaction, it requests a transaction ID from
   the coordinator. This transaction ID is globally unique.
2. The application begins a single-node transaction on each of the participants, and attaches the
   globally unique transaction ID to the single-node transaction. All reads and writes are done in
   one of these single-node transactions. If anything goes wrong at this stage (for example, a node
   crashes or a request times out), the coordinator or any of the participants can abort.
3. When the application is ready to commit, the coordinator sends a prepare request to all
   participants, tagged with the global transaction ID. If any of these requests fails or times out,
   the coordinator sends an abort request for that transaction ID to all participants.
4. When a participant receives the prepare request, it makes sure that it can definitely commit
   the transaction under all circumstances.

   This includes writing all transaction data to disk (a crash, a power failure, or running out of
   disk space is not an acceptable excuse for refusing to commit later), and checking for any
   conflicts or constraint violations. By replying “yes” to the coordinator, the node promises to
   commit the transaction without error if requested. In other words, the participant surrenders the
   right to abort the transaction, but without actually committing it.
5. When the coordinator has received responses to all prepare requests, it makes a definitive
   decision on whether to commit or abort the transaction (committing only if all participants voted
   “yes”). The coordinator must write that decision to its transaction log on disk so that it knows
   which way it decided in case it subsequently crashes. This is called the *commit point*.
6. Once the coordinator’s decision has been written to disk, the commit or abort request is sent
   to all participants. If this request fails or times out, the coordinator must retry forever until
   it succeeds. There is no more going back: if the decision was to commit, that decision must be
   enforced, no matter how many retries it takes. If a participant has crashed in the meantime, the
   transaction will be committed when it recovers—since the participant voted “yes,” it cannot
   refuse to commit when it recovers.

Thus, the protocol contains two crucial “points of no return”: when a participant votes “yes,” it
promises that it will definitely be able to commit later (although the coordinator may still choose to
abort); and once the coordinator decides, that decision is irrevocable. Those promises ensure the
atomicity of 2PC. (Single-node atomic commit lumps these two events into one: writing the commit
record to the transaction log.)

Returning to the marriage analogy, before saying “I do,” you and your bride/groom have the freedom
to abort the transaction by saying “No way!” (or something to that effect). However, after saying “I
do,” you cannot retract that statement. If you faint after saying “I do” and you don’t hear the
minister speak the words “You are now husband and wife,” that doesn’t change the fact that the
transaction was committed. When you recover consciousness later, you can find out whether you are
married or not by querying the minister for the status of your global transaction ID, or you can
wait for the minister’s next retry of the commit request (since the retries will have continued
throughout your period of unconsciousness).

### Coordinator failure

We have discussed what happens if one of the participants or the network fails during 2PC: if any of
the prepare requests fails or times out, the coordinator aborts the transaction; if any of the
commit or abort requests fails, the coordinator retries them indefinitely. However, it is less
clear what happens if the coordinator crashes.

If the coordinator fails before sending the prepare requests, a participant can safely abort the
transaction. But once the participant has received a prepare request and voted “yes,” it can no
longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction
was committed or aborted. If the coordinator crashes or the network fails at this point, the
participant can do nothing but wait. A participant’s transaction in this state is called *in doubt*
or *uncertain*.

The situation is illustrated in [Figure 8-14](/en/ch8#fig_transactions_2pc_crash). In this particular example, the
coordinator actually decided to commit, and database 2 received the commit request. However, the
coordinator crashed before it could send the commit request to database 1, and so database 1 does
not know whether to commit or abort. Even a timeout does not help here: if database 1 unilaterally
aborts after a timeout, it will end up inconsistent with database 2, which has committed. Similarly,
it is not safe to unilaterally commit, because another participant may have aborted.

![ddia 0814](/fig/ddia_0814.png)

###### Figure 8-14. The coordinator crashes after participants vote “yes.” Database 1 does not know whether to commit or abort.

Without hearing from the coordinator, the participant has no way of knowing whether to commit or
abort. In principle, the participants could communicate among themselves to find out how each
participant voted and come to some agreement, but that is not part of the 2PC protocol.

The only way 2PC can complete is by waiting for the coordinator to recover. This is why the
coordinator must write its commit or abort decision to a transaction log on disk before sending
commit or abort requests to participants: when the coordinator recovers, it determines the status of
all in-doubt transactions by reading its transaction log. Any transactions that don’t have a commit
record in the coordinator’s log are aborted. Thus, the commit point of 2PC comes down to a regular
single-node atomic commit on the coordinator.

### Three-phase commit

Two-phase commit is called a *blocking* atomic commit protocol due to the fact that 2PC can become
stuck waiting for the coordinator to recover. It is possible to make an atomic commit protocol
*nonblocking*, so that it does not get stuck if a node fails. However, making this work in practice
is not so straightforward.

As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed
[[13](/en/ch8#Bernstein1987_ch8),
[77](/en/ch8#Skeen1981)].
However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most
practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
cannot guarantee atomicity.

A better solution in practice is to replace the single-node coordinator with a fault-tolerant
consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_consistency).

## Distributed Transactions Across Different Systems

Distributed transactions and two-phase commit have a mixed reputation. On the one hand, they are
seen as providing an important safety guarantee that would be hard to achieve otherwise; on the
other hand, they are criticized for causing operational problems, killing performance, and promising
more than they can deliver [[78](/en/ch8#Hohpe2005),
[79](/en/ch8#Helland2007_ch8),
[80](/en/ch8#Oliver2011),
[81](/en/ch8#Rahien2014)].
Many cloud services choose not to implement distributed transactions due to the operational
problems they engender [[82](/en/ch8#Vasters2012)].

Some implementations of distributed transactions carry a heavy performance penalty. Much of the
performance cost inherent in two-phase commit is due to the additional disk forcing (`fsync`) that
is required for crash recovery, and the additional network round-trips.

However, rather than dismissing distributed transactions outright, we should examine them in some
more detail, because there are important lessons to be learned from them. To begin, we should be
precise about what we mean by “distributed transactions.” Two quite different types of distributed
transactions are often conflated:

Database-internal distributed transactions
:   Some distributed databases (i.e., databases that use replication and sharding in their standard
    configuration) support internal transactions among the nodes of that database. For example,
    YugabyteDB, TiDB, FoundationDB, Spanner, VoltDB, and MySQL Cluster’s NDB storage engine have such
    internal transaction support. In this case, all the nodes participating in the transaction are
    running the same database software.

Heterogeneous distributed transactions
:   In a *heterogeneous* transaction, the participants are two or more different technologies: for
    example, two databases from different vendors, or even non-database systems such as message
    brokers. A distributed transaction across these systems must ensure atomic commit, even though
    the systems may be entirely different under the hood.

Database-internal transactions do not have to be compatible with any other system, so they can
use any protocol and apply optimizations specific to that particular technology. For that reason,
database-internal distributed transactions can often work quite well. On the other hand,
transactions spanning heterogeneous technologies are a lot more challenging.

### Exactly-once message processing

Heterogeneous distributed transactions allow diverse systems to be integrated in powerful ways. For
example, a message from a message queue can be acknowledged as processed if and only if the database
transaction for processing the message was successfully committed. This is implemented by atomically
committing the message acknowledgment and the database writes in a single transaction. With
distributed transaction support, this is possible, even if the message broker and the database are
two unrelated technologies running on different machines.

If either the message delivery or the database transaction fails, both are aborted, and so the
message broker may safely redeliver the message later. Thus, by atomically committing the message
and the side effects of its processing, we can ensure that the message is *effectively* processed
exactly once, even if it required a few retries before it succeeded. The abort discards any side
effects of the partially completed transaction. This is known as *exactly-once semantics*.

Such a distributed transaction is only possible if all systems affected by the transaction are able
to use the same atomic commit protocol, however. For example, say a side effect of processing a
message is to send an email, and the email server does not support two-phase commit: it could happen
that the email is sent two or more times if message processing fails and is retried. But if all side
effects of processing a message are rolled back on transaction abort, then the processing step can
safely be retried as if nothing had happened.

We will return to the topic of exactly-once semantics later in this chapter. Let’s look first at the
atomic commit protocol that allows such heterogeneous distributed transactions.

### XA transactions

*X/Open XA* (short for *eXtended Architecture*) is a standard for implementing two-phase commit
across heterogeneous technologies [[73](/en/ch8#XASpec1991)].
It was introduced in 1991 and has been widely
implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL,
Db2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ).

XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator.
Bindings for this API exist in other languages; for example, in the world of Java EE applications,
XA transactions are implemented using the Java Transaction API (JTA), which in turn is supported by
many drivers for databases using Java Database Connectivity (JDBC) and drivers for message brokers
using the Java Message Service (JMS) APIs.

XA assumes that your application uses a network driver or client library to communicate with the
participant databases or messaging services. If the driver supports XA, that means it calls the XA
API to find out whether an operation should be part of a distributed transaction—and if so, it
sends the necessary information to the database server. The driver also exposes callbacks through
which the coordinator can ask the participant to prepare, commit, or abort.

The transaction coordinator implements the XA API. The standard does not specify how it should be
implemented, but in practice the coordinator is often simply a library that is loaded into the same
process as the application issuing the transaction (not a separate service). It keeps track of the
participants in a transaction, collects partipants’ responses after asking them to prepare (via a
callback into the driver), and uses a log on the local disk to keep track of the commit/abort
decision for each transaction.

If the application process crashes, or the machine on which the application is running dies, the
coordinator goes with it. Any participants with prepared but uncommitted transactions are then stuck
in doubt. Since the coordinator’s log is on the application server’s local disk, that server must be
restarted, and the coordinator library must read the log to recover the commit/abort outcome of each
transaction. Only then can the coordinator use the database driver’s XA callbacks to ask
participants to commit or abort, as appropriate. The database server cannot contact the coordinator
directly, since all communication must go via its client library.

### Holding locks while in doubt

Why do we care so much about a transaction being stuck in doubt? Can’t the rest of the system just
get on with its work, and ignore the in-doubt transaction that will be cleaned up eventually?

The problem is with *locking*. As discussed in [“Read Committed”](/en/ch8#sec_transactions_read_committed), database
transactions usually take a row-level exclusive lock on any rows they modify, to prevent dirty
writes. In addition, if you want serializable isolation, a database using two-phase locking would
also have to take a shared lock on any rows *read* by the transaction.

The database cannot release those locks until the transaction commits or aborts (illustrated as a
shaded area in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit)). Therefore, when using two-phase commit, a
transaction must hold onto the locks throughout the time it is in doubt. If the coordinator has
crashed and takes 20 minutes to start up again, those locks will be held for 20 minutes. If the
coordinator’s log is entirely lost for some reason, those locks will be held forever—or at least
until the situation is manually resolved by an administrator.

While those locks are held, no other transaction can modify those rows. Depending on the isolation
level, other transactions may even be blocked from reading those rows. Thus, other transactions
cannot simply continue with their business—if they want to access that same data, they will be
blocked. This can cause large parts of your application to become unavailable until the in-doubt
transaction is resolved.

### Recovering from coordinator failure

In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the
log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do
occur [[83](/en/ch8#Dhariwal2008),
[84](/en/ch8#Randal2013)]—that is,
transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because
the transaction log has been lost or corrupted due to a software bug). These transactions cannot be
resolved automatically, so they sit forever in the database, holding locks and blocking other
transactions.

Even rebooting your database servers will not fix this problem, since a correct implementation of
2PC must preserve the locks of an in-doubt transaction even across restarts (otherwise it would risk
violating the atomicity guarantee). It’s a sticky situation.

The only way out is for an administrator to manually decide whether to commit or roll back the
transactions. The administrator must examine the participants of each in-doubt transaction,
determine whether any participant has committed or aborted already, and then apply the same outcome
to the other participants. Resolving the problem potentially requires a lot of manual effort, and
most likely needs to be done under high stress and time pressure during a serious production outage
(otherwise, why would the coordinator be in such a bad state?).

Many XA implementations have an emergency escape hatch called *heuristic decisions*: allowing a
participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive
decision from the coordinator [[73](/en/ch8#XASpec1991)]. To be clear,
*heuristic* here is a euphemism for *probably breaking atomicity*, since the heuristic decision
violates the system of promises in two-phase commit. Thus, heuristic decisions are intended only for
getting out of catastrophic situations, and not for regular use.

### Problems with XA transactions

A single-node coordinator is a single point of failure for the entire system, and making it part of
the application server is also problematic because the coordinator’s logs on its local disk become a
crucial part of the durable system state—as important as the databases themselves.

In principle, the coordinator of an XA transaction could be highly available and replicated, just
like we would expect of any other important database. Unfortunately, this still doesn’t solve a
fundamental problem with XA, which is that it provides no way for the coordinator and the
participants of a transaction to communicate with each other directly. They can only communicate via
the application code that invoked the transaction, and the database drivers through which it calls
the participants.

Even if the coordinator were replicated, the application code would therefore be a single point of
failure. Solving this problem would require totally redesigning how application code is run to make
it replicated or restartable, which could perhaps look similar to durable execution (see
[“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows)). However, there don’t seem to be any tools that actually take
this approach in practice.

Another problem is that since XA needs to be compatible with a wide range of data systems, it is
necessarily a lowest common denominator. For example, it cannot detect deadlocks across different
systems (since that would require a standardized protocol for systems to exchange information on the
locks that each transaction is waiting for), and it does not work with SSI (see
[“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)), since that would require a protocol for identifying conflicts across
different systems.

These problems are somewhat inherent in performing transactions across heterogeneous technologies.
However, keeping several heterogeneous data systems consistent with each other is still a real and
important problem, so we need to find a different solution to it. This can be done, as we will see
in the next section and in [Link to Come].

## Database-internal Distributed Transactions

As explained previously, there is a big difference between distributed transactions that span
multiple heterogeneous storage technologies, and those that are internal to a system—i.e., where all
the participating nodes are shards of the same database running the same software. Such internal
distributed transactions are a defining feature of “NewSQL” databases such as
CockroachDB [[5](/en/ch8#Taft2020_ch8)],
TiDB [[6](/en/ch8#Huang2020)],
Spanner [[7](/en/ch8#Corbett2012_ch8)],
FoundationDB [[8](/en/ch8#Zhou2021_ch8)], and YugabyteDB, for
example. Some message brokers such as Kafka also support internal distributed transactions
[[85](/en/ch8#Wang2021)].

Many of these systems use 2-phase commit to ensure atomicity of transactions that write to multiple
shards, and yet they don’t suffer the same problems as XA transactions. The reason is that because
their distributed transactions don’t need to interface with any other technologies, they avoid the
lowest-common-denominator trap—the designers of these systems are free to use better protocols that
are more reliable and faster.

The biggest problems with XA can be fixed by:

* Replicating the coordinator, with automatic failover to another coordinator node if the primary
  one crashes;
* Allowing the coordinator and data shards to communicate directly without going via application
  code;
* Replicating the participating shards, so that the risk of having to abort a transaction because of
  a fault in one of the shards is reduced; and
* Coupling the atomic commitment protocol with a distributed concurrency control protocol that
  supports deadlock detection and consistent reads across shards.

Consensus algorithms are commonly used to replicate the coordinator and the database shards. We will
see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented
using a consensus algorithm. These algorithms tolerate faults by automatically failing over from one
node to another without any human intervention, and while continuing to guarantee strong consistency
properties.

The isolation levels offered for distributed transactions depend on the system, but snapshot
isolation and serializable snapshot isolation are both possible across shards. The details of how
this works can be found in the papers referenced at the end of this chapter.

### Exactly-once message processing revisited

We saw in [“Exactly-once message processing”](/en/ch8#sec_transactions_exactly_once) that an important use case for distributed transactions
is to ensure that some operation takes effect exactly once, even if a crash occurs while it is being
processed and the processing needs to be retried. If you can atomically commit a transaction across
a message broker and a database, you can acknowledge the message to the broker if and only if it was
successfully processed and the database writes resulting from the process were committed.

However, you don’t actually need such distributed transactions to achieve exactly-once semantics. An
alternative approach is as follows, which only requires transactions within the database:

1. Assume every message has a unique ID, and in the database you have a table of message IDs that
   have been processed. When you start processing a message from the broker, you begin a new
   transaction on the database, and check the message ID. If the same message ID is already present
   in the database, you know that it has already been processed, so you can acknowledge the message
   to the broker and drop it.
2. If the message ID is not already in the database, you add it to the table. You then process the
   message, which may result in additional writes to the database within the same transaction. When
   you finish processing the message, you commit the transaction on the database.
3. Once the database transaction is successfully committed, you can acknowledge the message to the
   broker.
4. Once the message has successfully been acknowledged to the broker, you know that it won’t try
   processing the same message again, so you can delete the message ID from the database (in a
   separate transaction).

If the message processor crashes before committing the database transaction, the transaction is
aborted and the message broker will retry processing. If it crashes after committing but before
acknowledging the message to the broker, it will also retry processing, but the retry will see the
message ID in the database and drop it. If it crashes after acknowledging the message but before
deleting the message ID from the database, you will have an old message ID lying around, which
doesn’t do any harm besides taking a little bit of storage space. If a retry happens before the
database transaction is aborted (which could happen if communication between the message processor
and the database is interrupted), a uniqueness constraint on the table of message IDs should prevent
the same message ID from being inserted by two concurrent transactions.

Thus, achieving exactly-once processing only requires transactions within the database—atomicity
across database and message broker is not necessary for this use case. Recording the message ID in
the database makes the message processing *idempotent*, so that message processing can be safely
retried without duplicating its side-effects. A similar approach is used in stream processing
frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in
[Link to Come].

However, internal distributed transactions within the database are still useful for the scalability
of patterns such as these: for example, they would allow the message IDs to be stored on one shard
and the main data updated by the message processing to be stored on other shards, and to ensure
atomicity of the transaction commit across those shards.

# Summary

Transactions are an abstraction layer that allows an application to pretend that certain concurrency
problems and certain kinds of hardware and software faults don’t exist. A large class of errors is
reduced down to a simple *transaction abort*, and the application just needs to try again.

In this chapter we saw many examples of problems that transactions help prevent. Not all
applications are susceptible to all those problems: an application with very simple access patterns,
such as reading and writing only a single record, can probably manage without transactions. However,
for more complex access patterns, transactions can hugely reduce the number of potential error cases
you need to think about.

Without transactions, various error scenarios (processes crashing, network interruptions, power
outages, disk full, unexpected concurrency, etc.) mean that data can become inconsistent in various
ways. For example, denormalized data can easily go out of sync with the source data. Without
transactions, it becomes very difficult to reason about the effects that complex interacting accesses
can have on the database.

In this chapter, we went particularly deep into the topic of concurrency control. We discussed
several widely used isolation levels, in particular *read committed*, *snapshot isolation*
(sometimes called *repeatable read*), and *serializable*. We characterized those isolation levels by
discussing various examples of race conditions, summarized in [Table 8-1](/en/ch8#ch_transactions_isolation_levels):

Table 8-1. Summary of anomalies that can occur at various isolation levels

| Isolation level | Dirty reads | Read skew | Phantom reads | Lost updates | Write skew |
| --- | --- | --- | --- | --- | --- |
| Read uncommitted | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible |
| Read committed | ✓ Prevented | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible |
| Snapshot isolation | ✓ Prevented | ✓ Prevented | ✓ Prevented | ? Depends | ✗ Possible |
| Serializable | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented |

Dirty reads
:   One client reads another client’s writes before they have been committed. The read committed
    isolation level and stronger levels prevent dirty reads.

Dirty writes
:   One client overwrites data that another client has written, but not yet committed. Almost all
    transaction implementations prevent dirty writes.

Read skew
:   A client sees different parts of the database at different points in time. Some cases of read
    skew are also known as *nonrepeatable reads*. This issue is most commonly prevented with snapshot
    isolation, which allows a transaction to read from a consistent snapshot corresponding to one
    particular point in time. It is usually implemented with *multi-version concurrency control*
    (MVCC).

Lost updates
:   Two clients concurrently perform a read-modify-write cycle. One overwrites the other’s write
    without incorporating its changes, so data is lost. Some implementations of snapshot isolation
    prevent this anomaly automatically, while others require a manual lock (`SELECT FOR UPDATE`).

Write skew
:   A transaction reads something, makes a decision based on the value it saw, and writes the decision
    to the database. However, by the time the write is made, the premise of the decision is no longer
    true. Only serializable isolation prevents this anomaly.

Phantom reads
:   A transaction reads objects that match some search condition. Another client makes a write that
    affects the results of that search. Snapshot isolation prevents straightforward phantom reads, but
    phantoms in the context of write skew require special treatment, such as index-range locks.

Weak isolation levels protect against some of those anomalies but leave you, the application
developer, to handle others manually (e.g., using explicit locking). Only serializable isolation
protects against all of these issues. We discussed three different approaches to implementing
serializable transactions:

Literally executing transactions in a serial order
:   If you can make each transaction very fast to execute (typically by using stored procedures), and
    the transaction throughput is low enough to process on a single CPU core or can be sharded, this
    is a simple and effective option.

Two-phase locking
:   For decades this has been the standard way of implementing serializability, but many applications
    avoid using it because of its poor performance.

Serializable snapshot isolation (SSI)
:   A comparatively new algorithm that avoids most of the downsides of the previous approaches. It
    uses an optimistic approach, allowing transactions to proceed without blocking. When a transaction
    wants to commit, it is checked, and it is aborted if the execution was not serializable.

Finally, we examined how to achieve atomicity when a transaction is distributed across multiple
nodes, using two-phase commit. If those nodes are all running the same database software,
distributed transactions can work quite well, but across different storage technologies (using XA
transactions), 2PC is problematic: it is very sensitive to faults in the coordinator and the
application code driving the transaction, and it interacts poorly with concurrency control
mechanisms. Fortunately, idempotence can ensure exactly-once semantics without requiring atomic
commit across different storage technologies, and we will see more on this in later chapters.

The examples in this chapter used a relational data model. However, as discussed in
[“The need for multi-object transactions”](/en/ch8#sec_transactions_need), transactions are a valuable database feature, no matter which data model
is used.

##### Footnotes

##### References

[[1](/en/ch8#Murdoch2021-marker)] Steven J. Murdoch.
[What
went wrong with Horizon: learning from the Post Office Trial](https://www.benthamsgaze.org/2021/07/15/what-went-wrong-with-horizon-learning-from-the-post-office-trial/). *benthamsgaze.org*, July 2021.
Archived at [perma.cc/CNM4-553F](https://perma.cc/CNM4-553F)

[[2](/en/ch8#Chamberlin1981-marker)] Donald D. Chamberlin, Morton M. Astrahan,
Michael W. Blasgen, James N. Gray, W. Frank King, Bruce G. Lindsay, Raymond Lorie, James W. Mehl,
Thomas G. Price, Franco Putzolu, Patricia Griffiths Selinger, Mario Schkolnick, Donald R. Slutz,
Irving L. Traiger, Bradford W. Wade, and Robert A. Yost.
[A History and Evaluation of System
R](https://dsf.berkeley.edu/cs262/2005/SystemR.pdf). *Communications of the ACM*, volume 24, issue 10, pages 632–646, October 1981.
[doi:10.1145/358769.358784](https://doi.org/10.1145/358769.358784)

[[3](/en/ch8#Gray1976-marker)] Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger.
[Granularity of
Locks and Degrees of Consistency in a Shared Data Base](https://citeseerx.ist.psu.edu/pdf/e127f0a6a912bb9150ecfe03c0ebf7fbc289a023). in *Modelling in Data Base Management
Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management
Systems*, edited by G. M. Nijssen, pages 364–394, Elsevier/North Holland Publishing, 1976. Also
in *Readings in Database Systems*, 4th edition, edited by Joseph M. Hellerstein and Michael
Stonebraker, MIT Press, 2005. ISBN: 978-0-262-69314-1

[[4](/en/ch8#Eswaran1976-marker)] Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger.
[The
Notions of Consistency and Predicate Locks in a Database System](https://jimgray.azurewebsites.net/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf?from=https://research.microsoft.com/en-us/um/people/gray/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf). *Communications of the
ACM*, volume 19, issue 11, pages 624–633, November 1976.
[doi:10.1145/360363.360369](https://doi.org/10.1145/360363.360369)

[[5](/en/ch8#Taft2020_ch8-marker)] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan
VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul
Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis.
[CockroachDB: The Resilient
Geo-Distributed SQL Database](https://dl.acm.org/doi/pdf/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of
Data* (SIGMOD), pages 1493–1509, June 2020.
[doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134)

[[6](/en/ch8#Huang2020-marker)] Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang,
Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang,
Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan
Pei, and Xin Tang.
[TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf).
*Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3072–3084.
[doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535)

[[7](/en/ch8#Corbett2012_ch8-marker)] James C. Corbett, Jeffrey Dean,
Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev,
Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li,
Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig,
Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang.
[Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/).
At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI),
October 2012.

[[8](/en/ch8#Zhou2021_ch8-marker)] Jingyu Zhou, Meng Xu, Alexander
Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty
Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser,
Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav.
[FoundationDB: A Distributed Unbundled
Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data*
(SIGMOD), June 2021.
[doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559)

[[9](/en/ch8#Harder1983-marker)] Theo Härder and Andreas Reuter.
[Principles of
Transaction-Oriented Database Recovery](https://citeseerx.ist.psu.edu/pdf/11ef7c142295aeb1a28a0e714c91fc8d610c3047). *ACM Computing Surveys*, volume 15, issue 4,
pages 287–317, December 1983. [doi:10.1145/289.291](https://doi.org/10.1145/289.291)

[[10](/en/ch8#Bailis2013HAT-marker)] Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph
M. Hellerstein, and Ion Stoica.
[HAT, not CAP:
Towards Highly Available Transactions](https://www.usenix.org/system/files/conference/hotos13/hotos13-final80.pdf). At *14th USENIX Workshop on Hot Topics in Operating
Systems* (HotOS), May 2013.

[[11](/en/ch8#Fox1997-marker)] Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric
A. Brewer, and Paul Gauthier.
[Cluster-Based Scalable Network
Services](https://people.eecs.berkeley.edu/~brewer/cs262b/TACC.pdf). At *16th ACM Symposium on Operating Systems Principles* (SOSP), October 1997.
[doi:10.1145/268998.266662](https://doi.org/10.1145/268998.266662)

[[12](/en/ch8#Andrews2004-marker)] Tony Andrews.
[Enforcing
Complex Constraints in Oracle](https://tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html). *tonyandrews.blogspot.co.uk*, October 2004. Archived at
[archive.org](https://web.archive.org/web/20220201190625/https%3A//tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html)

[[13](/en/ch8#Bernstein1987_ch8-marker)] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman.
[*Concurrency Control and
Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available
online at [*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/).

[[14](/en/ch8#Fekete2005-marker)] Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil,
Patrick O’Neil, and Dennis Shasha.
[Making
Snapshot Isolation Serializable](https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2009/Papers/p492-fekete.pdf). *ACM Transactions on Database Systems*,
volume 30, issue 2, pages 492–528, June 2005.
[doi:10.1145/1071610.1071615](https://doi.org/10.1145/1071610.1071615)

[[15](/en/ch8#Zheng2013-marker)] Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge.
[Understanding
the Robustness of SSDs Under Power Fault](https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf). At *11th USENIX Conference on File and Storage
Technologies* (FAST), February 2013.

[[16](/en/ch8#Denness2015-marker)] Laurie Denness.
[SSDs: A Gift and a Curse](https://laur.ie/blog/2015/06/ssds-a-gift-and-a-curse/).
*laur.ie*, June 2015. Archived at [perma.cc/6GLP-BX3T](https://perma.cc/6GLP-BX3T)

[[17](/en/ch8#Surak2015-marker)] Adam Surak.
[When
Solid State Drives Are Not That Solid](https://www.algolia.com/blog/engineering/when-solid-state-drives-are-not-that-solid). *blog.algolia.com*, June 2015.
Archived at [perma.cc/CBR9-QZEE](https://perma.cc/CBR9-QZEE)

[[18](/en/ch8#HPE2019_ch8-marker)] Hewlett Packard Enterprise.
[Bulletin:
(Revision) HPE SAS Solid State Drives - Critical Firmware Upgrade Required for Certain HPE SAS
Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us).
*support.hpe.com*, November 2019.
Archived at [perma.cc/CZR4-AQBS](https://perma.cc/CZR4-AQBS)

[[19](/en/ch8#Ringer2018-marker)] Craig Ringer et al.
[PostgreSQL’s
handling of fsync() errors is unsafe and risks data loss at least on XFS](https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com). Email thread on
pgsql-hackers mailing list, *postgresql.org*, March 2018.
Archived at [perma.cc/5RKU-57FL](https://perma.cc/5RKU-57FL)

[[20](/en/ch8#Rebello2020-marker)] Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan,
Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.
[Can Applications Recover
from fsync Failures?](https://www.usenix.org/conference/atc20/presentation/rebello) At *USENIX Annual Technical Conference* (ATC), July 2020.

[[21](/en/ch8#Pillai2015-marker)] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram,
Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.
[Crash Consistency: Rethinking the
Fundamental Abstractions of the File System](https://dl.acm.org/doi/pdf/10.1145/2800695.2801719). *ACM Queue*, volume 13, issue 7, pages 20–28, July 2015.
[doi:10.1145/2800695.2801719](https://doi.org/10.1145/2800695.2801719)

[[22](/en/ch8#Pillai2014-marker)] Thanumalayan Sankaranarayana Pillai, Vijay
Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.
[All File
Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf).
At *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014.

[[23](/en/ch8#Siebenmann2016-marker)] Chris Siebenmann.
[Unix’s File Durability
Problem](https://utcc.utoronto.ca/~cks/space/blog/unix/FileSyncProblem). *utcc.utoronto.ca*, April 2016.
Archived at [perma.cc/VSS8-5MC4](https://perma.cc/VSS8-5MC4)

[[24](/en/ch8#Ganesan2017-marker)] Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C.
Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.
[Redundancy
Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and
Corruptions](https://www.usenix.org/conference/fast17/technical-sessions/presentation/ganesan). At *15th USENIX Conference on File and Storage Technologies* (FAST),
February 2017.

[[25](/en/ch8#Bairavasundaram2008-marker)] Lakshmi N. Bairavasundaram, Garth R.
Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.
[An
Analysis of Data Corruption in the Storage Stack](https://www.usenix.org/legacy/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf). At *6th USENIX Conference on File and
Storage Technologies* (FAST), February 2008.

[[26](/en/ch8#Schroeder2016_ch8-marker)] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant.
[Flash
Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder). At *14th USENIX Conference on
File and Storage Technologies* (FAST), February 2016.

[[27](/en/ch8#Allison2015-marker)] Don Allison.
[SSD Storage – Ignorance of Technology Is No
Excuse](https://blog.korelogic.com/blog/2015/03/24). *blog.korelogic.com*, March 2015.
Archived at [perma.cc/9QN4-9SNJ](https://perma.cc/9QN4-9SNJ)

[[28](/en/ch8#MahUng2015-marker)] Gordon Mah Ung.
[Debunked:
Your SSD won’t lose data if left unplugged after all](https://www.pcworld.com/article/427602/debunked-your-ssd-wont-lose-data-if-left-unplugged-after-all.html). *pcworld.com*, May 2015.
Archived at [perma.cc/S46H-JUDU](https://perma.cc/S46H-JUDU)

[[29](/en/ch8#Kleppmann2014-marker)] Martin Kleppmann.
[Hermitage:
Testing the ‘I’ in ACID](https://martin.kleppmann.com/2014/11/25/hermitage-testing-the-i-in-acid.html). *martin.kleppmann.com*, November 2014.
Archived at [perma.cc/KP2Y-AQGK](https://perma.cc/KP2Y-AQGK)

[[30](/en/ch8#Warszawski2017-marker)] Todd Warszawski and Peter Bailis.
[ACIDRain: Concurrency-Related Attacks
on Database-Backed Web Applications](http://www.bailis.org/papers/acidrain-sigmod2017.pdf). At *ACM International Conference on Management of
Data* (SIGMOD), May 2017.
[doi:10.1145/3035918.3064037](https://doi.org/10.1145/3035918.3064037)

[[31](/en/ch8#DAgosta2014-marker)] Tristan D’Agosta.
[BTC Stolen from Poloniex](https://bitcointalk.org/index.php?topic=499580).
*bitcointalk.org*, March 2014.
Archived at [perma.cc/YHA6-4C5D](https://perma.cc/YHA6-4C5D)

[[32](/en/ch8#bitcointhief2014-marker)] bitcointhief2.
[How
I Stole Roughly 100 BTC from an Exchange and How I Could Have Stolen More!](https://www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/) *reddit.com*,
February 2014. Archived at
[archive.org](https://web.archive.org/web/20250118042610/https%3A//www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/)

[[33](/en/ch8#Jorwekar2007_ch8-marker)] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan.
[Automating the
Detection of Snapshot Isolation Anomalies](https://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf). At *33rd International Conference on Very Large
Data Bases* (VLDB), September 2007.

[[34](/en/ch8#Melanson2014-marker)] Michael Melanson.
[Transactions:
The Limits of Isolation](https://www.michaelmelanson.net/posts/transactions-the-limits-of-isolation/). *michaelmelanson.net*, November 2014.
Archived at [perma.cc/RG5R-KMYZ](https://perma.cc/RG5R-KMYZ)

[[35](/en/ch8#Kim2014ACH-marker)] Edward Kim.
[How
ACH works: A developer perspective — Part 1](https://engineering.gusto.com/how-ach-works-a-developer-perspective-part-1-339d3e7bea1). *engineering.gusto.com*, April 2014.
Archived at [perma.cc/7B2H-PU94](https://perma.cc/7B2H-PU94)

[[36](/en/ch8#Berenson1995-marker)] Hal Berenson, Philip A. Bernstein, Jim N. Gray,
Jim Melton, Elizabeth O’Neil, and Patrick O’Neil.
[A Critique of
ANSI SQL Isolation Levels](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf). At *ACM International Conference on Management of Data* (SIGMOD),
May 1995. [doi:10.1145/568271.223785](https://doi.org/10.1145/568271.223785)

[[37](/en/ch8#Adya1999-marker)] Atul Adya. [Weak
Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions](https://pmg.csail.mit.edu/papers/adya-phd.pdf).
PhD Thesis, Massachusetts Institute of Technology, March 1999.
Archived at [perma.cc/E97M-HW5Q](https://perma.cc/E97M-HW5Q)

[[38](/en/ch8#Bailis2014virtues_ch8-marker)] Peter Bailis, Aaron Davidson, Alan Fekete, Ali
Ghodsi, Joseph M. Hellerstein, and Ion Stoica.
[Highly Available Transactions: Virtues and
Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). At *40th International Conference on Very Large Data Bases* (VLDB),
September 2014.

[[39](/en/ch8#Crooks2017-marker)] Natacha Crooks, Youer Pu, Lorenzo Alvisi, and Allen Clement.
[Seeing is Believing: A
Client-Centric Specification of Database Isolation](https://www.cs.cornell.edu/lorenzo/papers/Crooks17Seeing.pdf). At *ACM Symposium on Principles of
Distributed Computing* (PODC), pages 73–82, July 2017.
[doi:10.1145/3087801.3087802](https://doi.org/10.1145/3087801.3087802)

[[40](/en/ch8#Momjian2014-marker)] Bruce Momjian.
[MVCC Unmasked](https://momjian.us/main/writings/pgsql/mvcc.pdf). *momjian.us*,
July 2014. Archived at [perma.cc/KQ47-9GYB](https://perma.cc/KQ47-9GYB)

[[41](/en/ch8#Alvaro2023-marker)] Peter Alvaro and Kyle Kingsbury.
[MySQL 8.0.34](https://jepsen.io/analyses/mysql-8.0.34). *jepsen.io*, December 2023.
Archived at [perma.cc/HGE2-Z878](https://perma.cc/HGE2-Z878)

[[42](/en/ch8#Rogov2023-marker)] Egor Rogov.
[PostgreSQL 14 Internals](https://postgrespro.com/community/books/internals).
*postgrespro.com*, April 2023.
Archived at [perma.cc/FRK2-D7WB](https://perma.cc/FRK2-D7WB)

[[43](/en/ch8#Suzuki2017_ch8-marker)] Hironobu Suzuki.
[The Internals of PostgreSQL](https://www.interdb.jp/pg/).
*interdb.jp*, 2017.

[[44](/en/ch8#Alleti2025-marker)] Rohan Reddy Alleti.
[Internals
of MVCC in Postgres: Hidden costs of Updates vs Inserts](https://medium.com/%40rohanjnr44/internals-of-mvcc-in-postgres-hidden-costs-of-updates-vs-inserts-381eadd35844). *medium.com*, March 2025.
Archived at [perma.cc/3ACX-DFXT](https://perma.cc/3ACX-DFXT)

[[45](/en/ch8#Pavlo2023-marker)] Andy Pavlo and Bohan Zhang.
[The
Part of PostgreSQL We Hate the Most](https://www.cs.cmu.edu/~pavlo/blog/2023/04/the-part-of-postgresql-we-hate-the-most.html). *cs.cmu.edu*, April 2023.
Archived at [perma.cc/XSP6-3JBN](https://perma.cc/XSP6-3JBN)

[[46](/en/ch8#Wu2017-marker)] Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo.
[An empirical evaluation of in-memory
multi-version concurrency control](https://vldb.org/pvldb/vol10/p781-Wu.pdf). *Proceedings of the VLDB Endowment*, volume 10, issue
7, pages 781–792, March 2017.
[doi:10.14778/3067421.3067427](https://doi.org/10.14778/3067421.3067427)

[[47](/en/ch8#Prokopov2014-marker)] Nikita Prokopov.
[Unofficial Guide to Datomic
Internals](https://tonsky.me/blog/unofficial-guide-to-datomic-internals/). *tonsky.me*, May 2014.

[[48](/en/ch8#Svetlov2025-marker)] Daniil Svetlov.
[A Practical Guide to Taming Postgres Isolation
Anomalies](https://dansvetlov.me/postgres-anomalies/). *dansvetlov.me*, March 2025.
Archived at [perma.cc/L7LE-TDLS](https://perma.cc/L7LE-TDLS)

[[49](/en/ch8#Wiger2010-marker)] Nate Wiger.
[An Atomic Rant](https://nateware.com/2010/02/18/an-atomic-rant/). *nateware.com*,
February 2010. Archived at [perma.cc/5ZYB-PE44](https://perma.cc/5ZYB-PE44)

[[50](/en/ch8#Coglan2020-marker)] James Coglan.
[Reading and writing,
part 3: web applications](https://blog.jcoglan.com/2020/10/12/reading-and-writing-part-3/). *blog.jcoglan.com*, October 2020.
Archived at [perma.cc/A7EK-PJVS](https://perma.cc/A7EK-PJVS)

[[51](/en/ch8#Bailis2015_ch8-marker)] Peter Bailis, Alan Fekete, Michael J. Franklin,
Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica.
[Feral Concurrency Control: An
Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf). At *ACM International Conference on
Management of Data* (SIGMOD), June 2015.
[doi:10.1145/2723372.2737784](https://doi.org/10.1145/2723372.2737784)

[[52](/en/ch8#Dogan2020-marker)] Jaana Dogan.
[Things
I Wished More Developers Knew About Databases](https://rakyll.medium.com/things-i-wished-more-developers-knew-about-databases-2d0178464f78). *rakyll.medium.com*, April 2020.
Archived at [perma.cc/6EFK-P2TD](https://perma.cc/6EFK-P2TD)

[[53](/en/ch8#Cahill2008-marker)] Michael J. Cahill, Uwe Röhm, and Alan Fekete.
[Serializable
Isolation for Snapshot Databases](https://www.cs.cornell.edu/~sowell/dbpapers/serializable_isolation.pdf). At *ACM International Conference on Management of Data*
(SIGMOD), June 2008.
[doi:10.1145/1376616.1376690](https://doi.org/10.1145/1376616.1376690)

[[54](/en/ch8#Ports2012-marker)] Dan R. K. Ports and Kevin Grittner.
[Serializable Snapshot Isolation in PostgreSQL](https://drkp.net/papers/ssi-vldb12.pdf).
At *38th International Conference on Very Large Databases* (VLDB), August 2012.

[[55](/en/ch8#Terry1995_ch8-marker)] Douglas B. Terry, Marvin M. Theimer,
Karin Petersen, Alan J. Demers, Mike J. Spreitzer and Carl H. Hauser.
[Managing
Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](https://pdos.csail.mit.edu/6.824/papers/bayou-conflicts.pdf). At
*15th ACM Symposium on Operating Systems Principles* (SOSP), December 1995.
[doi:10.1145/224056.224070](https://doi.org/10.1145/224056.224070)

[[56](/en/ch8#Schoenig2021-marker)] Hans-Jürgen Schönig.
[Constraints
over multiple rows in PostgreSQL](https://www.cybertec-postgresql.com/en/postgresql-constraints-over-multiple-rows/). *cybertec-postgresql.com*, June 2021.
Archived at [perma.cc/2TGH-XUPZ](https://perma.cc/2TGH-XUPZ)

[[57](/en/ch8#Stonebraker2007_ch8-marker)] Michael Stonebraker, Samuel Madden,
Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland.
[The End of an
Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on
Very Large Data Bases* (VLDB), September 2007.

[[58](/en/ch8#Hugg2014streaming-marker)] John Hugg.
[H-Store/VoltDB Architecture vs. CEP Systems
and Newer Streaming Architectures](https://www.youtube.com/watch?v=hD5M4a1UVz8). At *Data @Scale Boston*, November 2014.

[[59](/en/ch8#Kallman2008-marker)] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew
Pavlo, Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang
Zhang, John Hugg, and Daniel J. Abadi.
[H-Store: A High-Performance, Distributed Main
Memory Transaction Processing System](https://www.vldb.org/pvldb/vol1/1454211.pdf). *Proceedings of the VLDB Endowment*, volume 1,
issue 2, pages 1496–1499, August 2008.

[[60](/en/ch8#Hickey2012-marker)] Rich Hickey.
[The Architecture of Datomic](https://www.infoq.com/articles/Architecture-Datomic/).
*infoq.com*, November 2012.
Archived at [perma.cc/5YWU-8XJK](https://perma.cc/5YWU-8XJK)

[[61](/en/ch8#Hugg2014debunking-marker)] John Hugg.
[Debunking Myths
About the VoltDB In-Memory Database](https://dzone.com/articles/debunking-myths-about-voltdb). *dzone.com*, May 2014.
Archived at [perma.cc/2Z9N-HPKF](https://perma.cc/2Z9N-HPKF)

[[62](/en/ch8#Zhou2025-marker)] Xinjing Zhou, Viktor Leis, Xiangyao Yu, and Michael Stonebraker.
[OLTP Through the Looking Glass 16
Years Later: Communication is the New Bottleneck](https://www.vldb.org/cidrdb/papers/2025/p17-zhou.pdf). At *15th Annual Conference on Innovative
Data Systems Research* (CIDR), January 2025.

[[63](/en/ch8#Zhou2022-marker)] Xinjing Zhou, Xiangyao Yu, Goetz Graefe, and Michael Stonebraker.
[Lotus: scalable multi-partition
transactions on single-threaded partitioned databases](https://www.vldb.org/pvldb/vol15/p2939-zhou.pdf). *Proceedings of the VLDB
Endowment* (PVLDB), volume 15, issue 11, pages 2939–2952, July 2022.
[doi:10.14778/3551793.3551843](https://doi.org/10.14778/3551793.3551843)

[[64](/en/ch8#Hellerstein2007_ch8-marker)] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton.
[Architecture of a Database System](https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf).
*Foundations and Trends in Databases*, volume 1, issue 2, pages 141–259, November 2007.
[doi:10.1561/1900000002](https://doi.org/10.1561/1900000002)

[[65](/en/ch8#Cahill2009-marker)] Michael J. Cahill.
[Serializable
Isolation for Snapshot Databases](https://ses.library.usyd.edu.au/bitstream/handle/2123/5353/michael-cahill-2009-thesis.pdf). PhD Thesis, University of Sydney, July 2009.
Archived at [perma.cc/727J-NTMP](https://perma.cc/727J-NTMP)

[[66](/en/ch8#Diaconu2013-marker)] Cristian Diaconu, Craig Freedman,
Erik Ismert, Per-Åke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling.
[Hekaton:
SQL Server’s Memory-Optimized OLTP Engine](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/06/Hekaton-Sigmod2013-final.pdf). At *ACM SIGMOD International Conference on
Management of Data* (SIGMOD), pages 1243–1254, June 2013.
[doi:10.1145/2463676.2463710](https://doi.org/10.1145/2463676.2463710)

[[67](/en/ch8#Neumann2015-marker)] Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper.
[Fast Serializable Multi-Version Concurrency
Control for Main-Memory Database Systems](https://db.in.tum.de/~muehlbau/papers/mvcc.pdf). At *ACM SIGMOD International Conference on
Management of Data* (SIGMOD), pages 677–689, May 2015.
[doi:10.1145/2723372.2749436](https://doi.org/10.1145/2723372.2749436)

[[68](/en/ch8#Badal1979-marker)] D. Z. Badal.
[Correctness of Concurrency Control and
Implications in Distributed Databases](https://ieeexplore.ieee.org/abstract/document/762563). At *3rd International IEEE Computer Software and
Applications Conference* (COMPSAC), November 1979.
[doi:10.1109/CMPSAC.1979.762563](https://doi.org/10.1109/CMPSAC.1979.762563)

[[69](/en/ch8#Agrawal1987-marker)] Rakesh Agrawal, Michael J. Carey, and Miron Livny.
[Concurrency Control
Performance Modeling: Alternatives and Implications](https://people.eecs.berkeley.edu/~brewer/cs262/ConcControl.pdf). *ACM Transactions on Database
Systems* (TODS), volume 12, issue 4, pages 609–654, December 1987.
[doi:10.1145/32204.32220](https://doi.org/10.1145/32204.32220)

[[70](/en/ch8#Brooker2024snapshot-marker)] Marc Brooker.
[Snapshot Isolation vs
Serializability](https://brooker.co.za/blog/2024/12/17/occ-and-isolation.html). *brooker.co.za*, December 2024.
Archived at [perma.cc/5TRC-CR5G](https://perma.cc/5TRC-CR5G)

[[71](/en/ch8#Lindsay1979_ch8-marker)] B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N.
Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade.
[Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf).
IBM Research, Research Report RJ2571(33471), July 1979.
Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)

[[72](/en/ch8#Mohan1986-marker)] C. Mohan, Bruce G. Lindsay, and Ron Obermarck.
[Transaction
Management in the R\* Distributed Database Management System](https://cs.brown.edu/courses/csci2270/archives/2012/papers/dtxn/p378-mohan.pdf).
*ACM Transactions on Database Systems*, volume 11, issue 4, pages 378–396, December 1986.
[doi:10.1145/7239.7266](https://doi.org/10.1145/7239.7266)

[[73](/en/ch8#XASpec1991-marker)] X/Open Company Ltd.
[Distributed Transaction Processing:
The XA Specification](https://pubs.opengroup.org/onlinepubs/009680699/toc.pdf). Technical Standard XO/CAE/91/300, December 1991. ISBN: 978-1-872-63024-3,
archived at [perma.cc/Z96H-29JB](https://perma.cc/Z96H-29JB)

[[74](/en/ch8#Neto2008-marker)] Ivan Silva Neto and Francisco Reverbel.
[Lessons Learned from Implementing
WS-Coordination and WS-AtomicTransaction](https://www.ime.usp.br/~reverbel/papers/icis2008.pdf). At *7th IEEE/ACIS International Conference on
Computer and Information Science* (ICIS), May 2008.
[doi:10.1109/ICIS.2008.75](https://doi.org/10.1109/ICIS.2008.75)

[[75](/en/ch8#Johnson2004-marker)] James E. Johnson, David E. Langworthy, Leslie Lamport,
and Friedrich H. Vogt.
[Formal
Specification of a Web Services Protocol](https://www.microsoft.com/en-us/research/publication/formal-specification-of-a-web-services-protocol/). At *1st International Workshop on Web Services and
Formal Methods* (WS-FM), February 2004.
[doi:10.1016/j.entcs.2004.02.022](https://doi.org/10.1016/j.entcs.2004.02.022)

[[76](/en/ch8#Gray1981_ch8-marker)] Jim Gray.
[The Transaction
Concept: Virtues and Limitations](https://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf). At *7th International Conference on Very Large Data
Bases* (VLDB), September 1981.

[[77](/en/ch8#Skeen1981-marker)] Dale Skeen.
[Nonblocking Commit
Protocols](https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/Ske81.pdf). At *ACM International Conference on Management of Data* (SIGMOD), April 1981.
[doi:10.1145/582318.582339](https://doi.org/10.1145/582318.582339)

[[78](/en/ch8#Hohpe2005-marker)] Gregor Hohpe.
[Your Coffee Shop Doesn’t Use
Two-Phase Commit](https://www.martinfowler.com/ieeeSoftware/coffeeShop.pdf). *IEEE Software*, volume 22, issue 2, pages 64–66, March 2005.
[doi:10.1109/MS.2005.52](https://doi.org/10.1109/MS.2005.52)

[[79](/en/ch8#Helland2007_ch8-marker)] Pat Helland.
[Life Beyond Distributed Transactions:
An Apostate’s Opinion](https://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf). At *3rd Biennial Conference on Innovative Data Systems Research*
(CIDR), January 2007.

[[80](/en/ch8#Oliver2011-marker)] Jonathan Oliver.
[My Beef with
MSDTC and Two-Phase Commits](https://blog.jonathanoliver.com/my-beef-with-msdtc-and-two-phase-commits/). *blog.jonathanoliver.com*, April 2011.
Archived at [perma.cc/K8HF-Z4EN](https://perma.cc/K8HF-Z4EN)

[[81](/en/ch8#Rahien2014-marker)] Oren Eini (Ahende Rahien).
[The Fallacy of
Distributed Transactions](https://ayende.com/blog/167362/the-fallacy-of-distributed-transactions). *ayende.com*, July 2014.
Archived at [perma.cc/VB87-2JEF](https://perma.cc/VB87-2JEF)

[[82](/en/ch8#Vasters2012-marker)] Clemens Vasters.
[Transactions
in Windows Azure (with Service Bus) – An Email Discussion](https://learn.microsoft.com/en-gb/archive/blogs/clemensv/transactions-in-windows-azure-with-service-bus-an-email-discussion). *learn.microsoft.com*, July 2012.
Archived at [perma.cc/4EZ9-5SKW](https://perma.cc/4EZ9-5SKW)

[[83](/en/ch8#Dhariwal2008-marker)] Ajmer Dhariwal.
[Orphaned MSDTC
Transactions (-2 spids)](https://www.eraofdata.com/posts/2008/orphaned-msdtc-transactions-2-spids/). *eraofdata.com*, December 2008.
Archived at [perma.cc/YG6F-U34C](https://perma.cc/YG6F-U34C)

[[84](/en/ch8#Randal2013-marker)] Paul Randal.
[Real
World Story of DBCC PAGE Saving the Day](https://www.sqlskills.com/blogs/paul/real-world-story-of-dbcc-page-saving-the-day/). *sqlskills.com*, June 2013.
Archived at [perma.cc/2MJN-A5QH](https://perma.cc/2MJN-A5QH)

[[85](/en/ch8#Wang2021-marker)] Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason
Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva
Mehta, Varun Madan, and Jun Rao.
[Consistency and Completeness:
Rethinking Distributed Stream Processing in Apache Kafka](https://dl.acm.org/doi/pdf/10.1145/3448016.3457556). At *ACM International Conference on
Management of Data* (SIGMOD), June 2021.
[doi:10.1145/3448016.3457556](https://doi.org/10.1145/3448016.3457556)