ddia/content/en/ch10.md

---
title: "10. Consistency and Consensus"
weight: 210
breadcrumbs: false
---

> *An ancient adage warns, “Never go to sea with two chronometers; take one or three.”*
>
> Frederick P. Brooks Jr., *The Mythical Man-Month: Essays on Software Engineering* (1995)

Lots of things can go wrong in distributed systems, as discussed in [Chapter 9](/en/ch9#ch_distributed). If we want a
service to continue working correctly despite those things going wrong, we need to find ways of
tolerating faults.

One of the best tools we have for fault tolerance is *replication*. However, as we saw in
[Chapter 6](/en/ch6#ch_replication), having multiple copies of the data on multiple replicas opens up the risk of
inconsistencies. Reads might be handled by a replica that is not up-to-date, yielding stale results.
If multiple replicas can accept writes, we have to deal with conflicts between values that were
concurrently written on different replicas. At a high level, there are two competing philosophies
for dealing with such issues:

Eventual consistency
: In this philosophy, the fact that a system is replicated is made visible to the application, and
 you as application developer are expected to deal with the inconsistencies and conflicts that may
 arise. This approach is often used in systems with multi-leader (see
 [“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)) and leaderless replication (see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)).

Strong consistency
: This philosophy says that applications should not have to worry about internal details of
 replication, and that the system should behave as if it was single-node. The advantage of this
 approach is that it’s simpler for you, the application developer. The disadvantage is that
 stronger consistency has a performance cost, and some kinds of fault that an eventually consistent
 system can tolerate cause outages in strongly consistent systems.

As always, which approach is better depends on your application. If you have an app where users can
make changes to data while offline, then eventual consistency is inevitable, as discussed in
[“Sync Engines and Local-First Software”](/en/ch6#sec_replication_offline_clients). However, eventual consistency can also be difficult for
applications to deal with. If your replicas are located in datacenters with fast, reliable
communication, then strong consistency is often appropriate because its cost is acceptable.

In this chapter we will dive deeper into the strongly consistent approach, looking at three areas:

1. One challenge is that “strong consistency” is quite vague, so we will develop a more precise
 definition of what we want to achieve: *linearizability*.
2. We will look at the problem of generating IDs and timestamps. This may sound unrelated to
 consistency but is actually closely connected.
3. We will explore how distributed systems can achieve linearizability while still remaining
 fault-tolerant; the answer is *consensus* algorithms.

Along the way, we will see that there are some fundamental limits on what is possible and what is
not in a distributed system.

The topics of this chapter are notorious for being hard to implement correctly; it’s very easy to
build systems that behave fine when there are no faults, but which completely fall apart when faced
with an unlucky combination of faults that the designer of the system hadn’t considered. A lot of
theory has been developed to help us think through those edge cases, which enables us to build
systems that can robustly tolerate faults.

This chapter will only scratch the surface: we will stick with informal intuitions, and avoid the
algorithmic nitty-gritty, formal models, and proofs. If you want to do serious work on consensus
systems and similar infrastructure, you will need to go much deeper into the theory if you want any
chance of your systems being robust. As usual, the literature references in this chapter provide
some initial pointers.


## Linearizability {#sec_consistency_linearizability}

If you want a replicated database to be as simple as possible to use, you should make it behave as
if it wasn’t replicated at all. Then users don’t have to worry about replication lag, conflicts, and
other inconsistencies. That would give us the advantage of fault tolerance, but without the
complexity arising from having to think about multiple replicas.

This is the idea behind *linearizability* [^1] (also known as *atomic consistency* [^2], *strong consistency*, *immediate consistency*, or *external consistency* [^3]).
The exact definition of linearizability is quite subtle, and we will explore it in the rest of this
section. But the basic idea is to make a system appear as if there were only one copy of the data,
and all operations on it are atomic. With this guarantee, even though there may be multiple replicas
in reality, the application does not need to worry about them.

In a linearizable system, as soon as one client successfully completes a write, all clients reading
from the database must be able to see the value just written. Maintaining the illusion of a single
copy of the data means guaranteeing that the value read is the most recent, up-to-date value, and
doesn’t come from a stale cache or replica. In other words, linearizability is a *recency
guarantee*. To clarify this idea, let’s look at an example of a system that is not linearizable.

{{< figure src="/fig/ddia_1001.png" id="fig_consistency_linearizability_0" caption="Figure 10-1. If this database were linearizable, then either Alice's read would return 1 instead of 0, or Bob's read would return 0 instead of 1." class="w-full my-4" >}}

[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a
game their favorite team is playing. Just after the final score is announced, Aaliyah refreshes the
page, sees the winner announced, and excitedly tells Bryce about it. Bryce incredulously hits
*reload* on his own phone, but his request goes to a database replica that is lagging, and so his
phone shows that the game is still ongoing.

If Aaliyah and Bryce had hit reload at the same time, it would have been less surprising if they had
gotten two different query results, because they wouldn’t know at exactly what time their respective
requests were processed by the server. However, Bryce knows that he hit the reload button (initiated
his query) *after* he heard Aaliyah exclaim the final score, and therefore he expects his query
result to be at least as recent as Aaliyah’s. The fact that his query returned a stale result is a
violation of linearizability.

### What Makes a System Linearizable? {#what-makes-a-system-linearizable}

In order to understand linearizability better, let’s look at some more examples.
[Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows three clients concurrently reading and writing the same
object *x* in a linearizable database. In distributed systems theory, *x* is called a *register*—in
practice, it could be one key in a key-value store, one row in a relational database, or one
document in a document database, for example.

{{< figure src="/fig/ddia_1002.png" id="fig_consistency_linearizability_1" caption="Figure 10-2. Alice observes that x = 0 and y = 1, while Bob observes that x = 1 and y = 0. It's as if Alice's and Bob's computers disagree on the order in which the writes happened." class="w-full my-4" >}}


For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients’
point of view, not the internals of the database. Each bar is a request made by a client, where the
start of a bar is the time when the request was sent, and the end of a bar is when the response was
received by the client. Due to variable network delays, a client doesn’t know exactly when the
database processed its request—it only knows that it must have happened sometime between the
client sending the request and receiving the response.

In this example, the register has two types of operations:

* *read*(*x*) ⇒ *v* means the client requested to read the value of register
 *x*, and the database returned the value *v*.
* *write*(*x*, *v*) ⇒ *r* means the client requested to set the
 register *x* to value *v*, and the database returned response *r* (which could be *ok* or *error*).

In [Figure 10-2](/en/ch10#fig_consistency_linearizability_1), the value of *x* is initially 0, and client C performs a
write request to set it to 1. While this is happening, clients A and B are repeatedly polling the
database to read the latest value. What are the possible responses that A and B might get for their
read requests?

* The first read operation by client A completes before the write begins, so it must definitely
 return the old value 0.
* The last read by client A begins after the write has completed, so it must definitely return the
 new value 1 if the database is linearizable, because the read must have been processed after the
 write.
* Any read operations that overlap in time with the write operation might return either 0 or 1,
 because we don’t know whether or not the write has taken effect at the time when the read
 operation is processed. These operations are *concurrent* with the write.

However, that is not yet sufficient to fully describe linearizability: if reads that are concurrent
with a write can return either the old or the new value, then readers could see a value flip back
and forth between the old and the new value several times while a write is going on. That is not
what we expect of a system that emulates a “single copy of the data.”

To make the system linearizable, we need to add another constraint, illustrated in
[Figure 10-3](/en/ch10#fig_consistency_linearizability_2).

{{< figure src="/fig/ddia_1003.png" id="fig_consistency_linearizability_2" caption="Figure 10-3. If Alice and Bob had perfect clocks, linearizability would require that x = 1 is returned, since the read of x begins after the write x = 1 completes." class="w-full my-4" >}}


In a linearizable system we imagine that there must be some point in time (between the start and end
of the write operation) at which the value of *x* atomically flips from 0 to 1. Thus, if one
client’s read returns the new value 1, all subsequent reads must also return the new value, even if
the write operation has not yet completed.

This timing dependency is illustrated with an arrow in [Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
Client A is the first to read the new value, 1. Just after A’s read returns, B begins a new read.
Since B’s read occurs strictly after A’s read, it must also return 1, even though the write by C is
still ongoing. (It’s the same situation as with Aaliyah and Bryce in
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0): after Aaliyah has read the new value, Bryce also expects to
read the new value.)

We can further refine this timing diagram to visualize each operation taking effect atomically at
some point in time [^5],
like in the more complex example shown in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3). In this example we
add a third type of operation besides *read* and *write*:

* *cas*(*x*, *v*old, *v*new) ⇒ *r* means the client
 requested an atomic *compare-and-set* operation (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)). If the
 current value of the register *x* equals *v*old, it should be atomically set to *v*new. If
 the value of *x* is different from *v*old, then the operation should leave the register
 unchanged and return an error. *r* is the database’s response (*ok* or *error*).

Each operation in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3) is marked with a vertical line (inside the
bar for each operation) at the time when we think the operation was executed. Those markers are
joined up in a sequential order, and the result must be a valid sequence of reads and writes for a
register (every read must return the value set by the most recent write).

The requirement of linearizability is that the lines joining up the operation markers always move
forward in time (from left to right), never backward. This requirement ensures the recency guarantee we
discussed earlier: once a new value has been written or read, all subsequent reads see the value
that was written, until it is overwritten again.

{{< figure src="/fig/ddia_1004.png" id="fig_consistency_linearizability_3" caption="Figure 10-4. The read of x is concurrent with the write x = 1. Since we don't know the exact timing of the operations, the read is allowed to return either 0 or 1." class="w-full my-4" >}}


There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3):

* First client B sent a request to read *x*, then client D sent a request to set *x* to 0, and then
 client A sent a request to set *x* to 1. Nevertheless, the value returned to B’s read is 1 (the
 value written by A). This is okay: it means that the database first processed D’s write, then A’s
 write, and finally B’s read. Although this is not the order in which the requests were sent, it’s
 an acceptable order, because the three requests are concurrent. Perhaps B’s read request was
 slightly delayed in the network, so it only reached the database after the two writes.
* Client B’s read returned 1 before client A received its response from the database, saying that
 the write of the value 1 was successful. This is also okay: it just means the *ok* response from
 the database to client A was slightly delayed in the network.
* This model doesn’t assume any transaction isolation: another client may change a value at any
 time. For example, C first reads 1 and then reads 2, because the value was changed by B between
 the two reads. An atomic compare-and-set (*cas*) operation can be used to check the value hasn’t
 been concurrently changed by another client: B and C’s *cas* requests succeed, but D’s *cas*
 request fails (by the time the database processes it, the value of *x* is no longer 0).
* The final read by client B (in a shaded bar) is not linearizable. The operation is concurrent with
 C’s *cas* write, which updates *x* from 2 to 4. In the absence of other requests, it would be okay for
 B’s read to return 2. However, client A has already read the new value 4 before B’s read started,
 so B is not allowed to read an older value than A. Again, it’s the same situation as with Aaliyah
 and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0).

That is the intuition behind linearizability; the formal definition [^1] describes it more precisely. It is
possible (though computationally expensive) to test whether a system’s behavior is linearizable by
recording the timings of all requests and responses, and checking whether they can be arranged into
a valid sequential order [^6] [^7].

Just as there are various weak isolation levels for transactions besides serializability (see
[“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels)), there are also various weaker consistency models for
replicated systems besides linearizability [^8].
In fact, the *read-after-write*, *monotonic reads*, and *consistent prefix reads* properties we saw
in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag) are examples of such weaker consistency models. Linearizability
guarantees all these weaker properties, and more. In this chapter we will focus on linearizability,
which is the strongest consistency model in common use.


--------

> [!TIP] LINEARIZABILITY VERSUS SERIALIZABILITY

Linearizability is easily confused with serializability (see [“Serializability”](/en/ch8#sec_transactions_serializability)),
as both words seem to mean something like “can be arranged in a sequential order.” However, they are
quite different guarantees, and it is important to distinguish between them:

Serializability
: Serializability is an isolation property of transactions, where every transaction may read and
 write *multiple objects* (rows, documents, records). It guarantees that transactions behave the
 same as if they had executed in *some* serial order: that is, as if you first performed all of one
 transaction’s operations, then all of another transaction’s operations, and so on, without
 interleaving them. It is okay for that serial order to be different from the order in which the
 transactions were actually run [^9].

Linearizability
: Linearizability is a guarantee on reads and writes of a register (an *individual object*). It
 doesn’t group operations together into transactions, so it does not prevent problems such as write
 skew that involve multiple objects (see [“Write Skew and Phantoms”](/en/ch8#sec_transactions_write_skew)). However, linearizability
 is a *recency* guarantee: it requires that if one operation finishes before another one starts,
 then the later operation must observe a state that is at least as new as the earlier operation.
 Serializability does not have that requirement: for example, stale reads are allowed by
 serializability [^10].

(*Sequential consistency* is something else again [^8], but we won’t discuss it here.)

A database may provide both serializability and linearizability, and this combination is known as
*strict serializability* or *strong one-copy serializability* (*strong-1SR*) [^11] [^12].
Single-node databases are typically linearizable. With distributed databases using optimistic
methods like serializable snapshot isolation (see [“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)) the situation is more
complicated: for example, CockroachDB provides serializability, and some recency guarantees on
reads, but not strict serializability [^13]
because this would require expensive coordination between transactions [^14].

It is also possible to combine a weaker isolation level with linearizability, or a weaker
consistency model with serializability; in fact, consistency model and isolation level can be chosen
largely independently from each other [^15] [^16].

--------

### Relying on Linearizability {#relying-on-linearizability}

In what circumstances is linearizability useful? Viewing the final score of a sporting match is
perhaps a frivolous example: a result that is outdated by a few seconds is unlikely to cause any
real harm in this situation. However, there a few areas in which linearizability is an important
requirement for making a system work correctly.

#### Locking and leader election {#locking-and-leader-election}

A system that uses single-leader replication needs to ensure that there is indeed only one leader,
not several (split brain). One way of electing a leader is to use a lease: every node that starts up
tries to acquire the lease, and the one that succeeds becomes the leader [^17].
No matter how this mechanism is implemented, it must be linearizable: it should not be possible for
two different nodes to acquire the lease at the same time.

Coordination services like Apache ZooKeeper [^18]
and etcd are often used to implement distributed leases and leader election. They use consensus
algorithms to implement linearizable operations in a fault-tolerant way (we discuss such algorithms
later in this chapter). There are still many subtle details to implementing leases and leader
election correctly (see for example the fencing issue in [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing)), and
libraries like Apache Curator help by providing higher-level recipes on top of ZooKeeper. However, a
linearizable storage service is the basic foundation for these coordination tasks.

--------

> [!NOTE]
> Strictly speaking, ZooKeeper provides linearizable writes, but reads may be stale, since there is no
> guarantee that they are served from the current leader [^18]. etcd since version 3 provides linearizable reads by default.

--------


Distributed locking is also used at a much more granular level in some distributed databases, such as
Oracle Real Application Clusters (RAC) [^19].
RAC uses a lock per disk page, with multiple nodes sharing access
to the same disk storage system. Since these linearizable locks are on the critical path of
transaction execution, RAC deployments usually have a dedicated cluster interconnect network for
communication between database nodes.

#### Constraints and uniqueness guarantees {#sec_consistency_uniqueness}

Uniqueness constraints are common in databases: for example, a username or email address must
uniquely identify one user, and in a file storage service there cannot be two files with the same
path and filename. If you want to enforce this constraint as the data is written (such that if two people
try to concurrently create a user or a file with the same name, one of them will be returned an
error), you need linearizability.

This situation is actually similar to a lock: when a user registers for your service, you can think
of them acquiring a “lock” on their chosen username. The operation is also very similar to an atomic
compare-and-set, setting the username to the ID of the user who claimed it, provided that the
username is not already taken.

Similar issues arise if you want to ensure that a bank account balance never goes negative, or that
you don’t sell more items than you have in stock in the warehouse, or that two people don’t
concurrently book the same seat on a flight or in a theater. These constraints all require there to
be a single up-to-date value (the account balance, the stock level, the seat occupancy) that all
nodes agree on.

In real applications, it is sometimes acceptable to treat such constraints loosely (for example, if
a flight is overbooked, you can move customers to a different flight and offer them compensation for
the inconvenience). In such cases, linearizability may not be needed, and we will discuss such
loosely interpreted constraints in [Link to Come].

However, a hard uniqueness constraint, such as the one you typically find in relational databases,
requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints,
can be implemented without linearizability [^20].

#### Cross-channel timing dependencies {#cross-channel-timing-dependencies}

Notice a detail in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0): if Aaliyah hadn’t exclaimed the score,
Bryce wouldn’t have known that the result of his query was stale. He would have just refreshed the
page again a few seconds later, and eventually seen the final score. The linearizability violation
was only noticed because there was an additional communication channel in the system (Aaliyah’s
voice to Bryce’s ears).

Similar situations can arise in computer systems. For example, say you have a website where users
can upload a video, and a background process transcodes the video to a lower quality that can be
streamed on slow internet connections. The architecture and dataflow of this system is illustrated
in [Figure 10-5](/en/ch10#fig_consistency_transcoder).

The video transcoder needs to be explicitly instructed to perform a transcoding job, and this
instruction is sent from the web server to the transcoder via a message queue (see [Link to Come]).
The web server doesn’t place the entire video on the queue, since most message brokers are designed
for small messages, and a video may be many megabytes in size. Instead, the video is first written
to a file storage service, and once the write is complete, the instruction to the transcoder is
placed on the queue.

{{< figure src="/fig/ddia_1005.png" id="fig_consistency_transcoder" caption="Figure 10-5. A system that is not linearizable: Alice and Bob see the uploaded image at different times, and thus Bob's request is based on stale data." class="w-full my-4" >}}


If the file storage service is linearizable, then this system should work fine. If it is not
linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in
[Figure 10-5](/en/ch10#fig_consistency_transcoder)) might be faster than the internal replication inside the storage
service. In this case, when the transcoder fetches the original video (step 5), it might see an old
version of the file, or nothing at all. If it processes an old version of the video, the original
and transcoded videos in the file storage become permanently inconsistent with each other.

This problem arises because there are two different communication channels between the web server
and the transcoder: the file storage and the message queue. Without the recency guarantee of
linearizability, race conditions between these two channels are possible. This situation is
analogous to [Figure 10-1](/en/ch10#fig_consistency_linearizability_0), where there was also a race condition between
two communication channels: the database replication and the real-life audio channel between
Aaliyah’s mouth and Bryce’s ears.

A similar race condition occurs if you have a mobile app that can receive push notifications, and
the app fetches some data from a server when it receives a push notification. If the data fetch
might go to a lagging replica, it could happen that the push notification goes through quickly, but
the subsequent fetch doesn’t see the data that the push notification was about.

Linearizability is not the only way of avoiding this race condition, but it’s the simplest to
understand. If you control the additional communication channel (like in the case of the message
queue, but not in the case of Aaliyah and Bryce), you can use alternative approaches similar to what
we discussed in [“Reading Your Own Writes”](/en/ch6#sec_replication_ryw), at the cost of additional complexity.


### Implementing Linearizable Systems {#implementing-linearizable-systems}

Now that we’ve looked at a few examples in which linearizability is useful, let’s think about how we
might implement a system that offers linearizable semantics.

Since linearizability essentially means “behave as though there is only a single copy of the data,
and all operations on it are atomic,” the simplest answer would be to really only use a single copy
of the data. However, that approach would not be able to tolerate faults: if the node holding that
one copy failed, the data would be lost, or at least inaccessible until the node was brought up again.

Let’s revisit the replication methods from [Chapter 6](/en/ch6#ch_replication), and compare whether they can be made linearizable:

Single-leader replication (potentially linearizable)
: In a system with single-leader replication, the leader has the primary copy of the data that is
 used for writes, and the followers maintain backup copies of the data on other nodes. As long as
 you perform all reads and writes on the leader, they are likely to be linearizable. However, this
 assumes that you know for sure who the leader is. As discussed in
 [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing), it is quite possible for a node to think that it is the leader,
 when in fact it is not—and if the delusional leader continues to serve requests, it is likely to
 violate linearizability [^21].
 With asynchronous replication, failover may even lose committed writes, which violates both
 durability and linearizability.

 Sharding a single-leader database, with a separate leader per shard, does not affect
 linearizability, since it is only a single-object guarantee. Cross-shard transactions are a
 different matter (see [“Distributed Transactions”](/en/ch8#sec_transactions_distributed)).

Consensus algorithms (likely linearizable)
: Some consensus algorithms are essentially single-leader replication with automatic leader election
 and failover. They are carefully designed to prevent split brain, allowing them to implement
 linearizable storage safely. ZooKeeper uses the Zab consensus algorithm [^22] and etcd uses Raft [^23], for example.
 However, just because a system uses consensus does not guarantee that all operations on it are
 linearizable: if it allows reads on a node without checking that it is still the leader, the
 results of the read may be stale if a new leader has just been elected.

Multi-leader replication (not linearizable)
: Systems with multi-leader replication are generally not linearizable, because they concurrently
 process writes on multiple nodes and asynchronously replicate them to other nodes. For this
 reason, they can produce conflicting writes that require resolution (see
 [“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts)).

Leaderless replication (probably not linearizable)
: For systems with leaderless replication (Dynamo-style; see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)), people
 sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes
 (*w* + *r* > *n*). Depending on the exact algorithm, and depending on how you define
 strong consistency, this is not quite true.

 “Last write wins” conflict resolution methods based on time-of-day clocks (e.g., in Cassandra and
 ScyllaDB) are almost certainly nonlinearizable, because clock timestamps cannot be guaranteed to be
 consistent with actual event ordering due to clock skew (see [“Relying on Synchronized Clocks”](/en/ch9#sec_distributed_clocks_relying)).
 Even with quorums, nonlinearizable behavior is possible, as demonstrated in the next section.

#### Linearizability and quorums {#sec_consistency_quorum_linearizable}

Intuitively, it seems as though quorum reads and writes should be linearizable in a
Dynamo-style model. However, when we have variable network delays, it is possible to have race
conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless).

{{< figure src="/fig/ddia_1006.png" id="fig_consistency_leaderless" caption="Figure 10-6. Quorums are not sufficient to ensure linearizability if network delays are variable." class="w-full my-4" >}}


In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating
*x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3).
Concurrently, client A reads from a quorum of two nodes (*r* = 2) and sees the new value 1
on one of the nodes. Also concurrently with the write, client B reads from a different quorum of two
nodes, and gets back the old value 0 from both.

The quorum condition is met (*w* + *r* > *n*), but this execution is nevertheless not
linearizable: B’s request begins after A’s request completes, but B returns the old value while A
returns the new value. (It’s once again the Aaliyah and Bryce situation from
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0).)

It is possible to make Dynamo-style quorums linearizable at the cost of reduced
performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously,
before returning results to the application [^24].
Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the
latest timestamp of any prior write, and ensure that the new write has a greater timestamp [^25] [^26].
However, Riak does not perform synchronous read repair due to the performance penalty.
Cassandra does wait for read repair to complete on quorum reads [^27],
but it loses linearizability due to its use of time-of-day clocks for timestamps.

Moreover, only linearizable read and write operations can be implemented in this way; a
linearizable compare-and-set operation cannot, because it requires a consensus algorithm [^28].

In summary, it is safest to assume that a leaderless system with Dynamo-style replication does not
provide linearizability, even with quorum reads and writes.

### The Cost of Linearizability {#the-cost-of-linearizability}

As some replication methods can provide linearizability and others cannot, it is interesting to
explore the pros and cons of linearizability in more depth.

We already discussed some use cases for different replication methods in [Chapter 6](/en/ch6#ch_replication); for
example, we saw that multi-leader replication is often a good choice for multi-region
replication (see [“Geographically Distributed Operation”](/en/ch6#sec_replication_multi_dc)). An example of such a deployment is illustrated in
[Figure 10-7](/en/ch10#fig_consistency_cap_availability).

{{< figure src="/fig/ddia_1007.png" id="fig_consistency_cap_availability" caption="Figure 10-7. If clients cannot contact enough replicas due to a network partition, they cannot process writes." class="w-full my-4" >}}


Consider what happens if there is a network interruption between the two regions. Let’s assume
that the network within each region is working, and clients can reach their local region, but the
regions cannot connect to each other. This is known as a *network partition*.

With a multi-leader database, each region can continue operating normally: since writes from one
region are asynchronously replicated to the other, the writes are simply queued up and exchanged
when network connectivity is restored.

On the other hand, if single-leader replication is used, then the leader must be in one of the
regions. Any writes and any linearizable reads must be sent to the leader—thus, for any
clients connected to a follower region, those read and write requests must be sent synchronously
over the network to the leader region.

If the network between regions is interrupted in a single-leader setup, clients connected to
follower regions cannot contact the leader, so they cannot make any writes to the database, nor
any linearizable reads. They can still make reads from the follower, but they might be stale
(nonlinearizable). If the application requires linearizable reads and writes, the network
interruption causes the application to become unavailable in the regions that cannot contact the leader.

If clients can connect directly to the leader region, this is not a problem, since the
application continues to work normally there. But clients that can only reach a follower region
will experience an outage until the network link is repaired.

#### The CAP theorem {#the-cap-theorem}

This issue is not just a consequence of single-leader and multi-leader replication: any linearizable
database has this problem, no matter how it is implemented. The issue also isn’t specific to
multi-region deployments, but can occur on any unreliable network, even within one region.
The trade-off is as follows:

* If your application *requires* linearizability, and some replicas are disconnected from the other
 replicas due to a network problem, then some replicas cannot process requests while they are
 disconnected: they must either wait until the network problem is fixed, or return an error (either
 way, they become *unavailable*). This choice is sometimes known as *CP* (consistent under network partitions).
* If your application *does not require* linearizability, then it can be written in a way that each
 replica can process requests independently, even if it is disconnected from other replicas (e.g.,
 multi-leader). In this case, the application can remain *available* in the face of a network
 problem, but its behavior is not linearizable. This choice is known as *AP* (available under network partitions).

Thus, applications that don’t require linearizability can be more tolerant of network problems. This
insight is popularly known as the *CAP theorem* [^29] [^30] [^31] [^32],
named by Eric Brewer in 2000, although the trade-off had been known to designers of
distributed databases since the 1970s [^33] [^34] [^35].

CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of
starting a discussion about trade-offs in databases. At the time, many distributed databases
focused on providing linearizable semantics on a cluster of machines with shared storage [^19], and CAP encouraged database engineers
to explore a wider design space of distributed shared-nothing systems, which were more suitable for
implementing large-scale web services [^36].
CAP deserves credit for this culture shift—it helped trigger the NoSQL movement, a burst of new
database technologies around the mid-2000s.

> [!TIP] THE UNHELPFUL CAP THEOREM

CAP is sometimes presented as *Consistency, Availability, Partition tolerance: pick 2 out of 3*.
Unfortunately, putting it this way is misleading [^32] because network partitions are a kind of
fault, so they aren’t something about which you have a choice: they will happen whether you like it or not.

At times when the network is working correctly, a system can provide both consistency
(linearizability) and total availability. When a network fault occurs, you have to choose between
either linearizability or total availability. Thus, a better way of phrasing CAP would be
*either Consistent or Available when Partitioned* [^37].
A more reliable network needs to make this choice less often, but at some point the choice is inevitable.

The CP/AP classification scheme has several further flaws [^4]. *Consistency* is formalized as
linearizability (the theorem doesn’t say anything about weaker consistency models), and the
formalization of *availability* [^30] does not
match the usual meaning of the term [^38]. Many highly available (fault-tolerant) systems actually do not meet CAP’s
idiosyncratic definition of availability. Moreover, some system designers choose (with good reason)
to provide neither linearizability nor the form of availability that the CAP theorem assumes, so
those systems are neither CP nor AP [^39] [^40].

All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us
understand systems better, so CAP is best avoided.

The CAP theorem as formally defined [^30] is of
very narrow scope: it only considers one consistency model (namely linearizability) and one kind of
fault (network partitions, which according to data from Google are the cause of less than 8% of incidents [^41]).
It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP
has been historically influential, it has little practical value for designing systems [^4] [^38].

There have been efforts to generalize CAP. For example, the *PACELC principle* observes that system
designers might also choose to weaken consistency at times when the network is working fine in order
to reduce latency [^39] [^40] [^42].
Thus, during a network partition (P), we need to choose between availability (A) and consistency (C);
else (E), when there is no partition, we may choose between low latency (L) and consistency (C).
However, this definition inherits several problems with CAP, such as the counterintuitive definitions of consistency and availability.

There are many more interesting impossibility results in distributed systems [^43], and CAP has now been
superseded by more precise results [^44] [^45], so it is of mostly historical interest today.

#### Linearizability and network delays {#linearizability-and-network-delays}

Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable
in practice. For example, even RAM on a modern multi-core CPU is not linearizable [^46]:
if a thread running on one CPU core writes to a memory address, and a thread on another CPU core
reads the same address shortly afterward, it is not guaranteed to read the value written by the
first thread (unless a *memory barrier* or *fence* [^47] is used).

The reason for this behavior is that every CPU core has its own memory cache and store buffer.
Memory access first goes to the cache by default, and any changes are asynchronously written out to
main memory. Since accessing data in the cache is much faster than going to main memory [^48], this feature is essential for
good performance on modern CPUs. However, there are now several copies of the data (one in main
memory, and perhaps several more in various caches), and these copies are asynchronously updated, so
linearizability is lost.

Why make this trade-off? It makes no sense to use the CAP theorem to justify the multi-core memory
consistency model: within one computer we usually assume reliable communication, and we don’t expect
one CPU core to be able to continue operating normally if it is disconnected from the rest of the
computer. The reason for dropping linearizability is *performance*, not fault tolerance [^39].

The same is true of many distributed databases that choose not to provide linearizable guarantees:
they do so primarily to increase performance, not so much for fault tolerance [^42].
Linearizability is slow—and this is true all the time, not only during a network fault.

Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is
no: Attiya and Welch [^49] prove that if you want linearizability, the response time of read and write requests is at least
proportional to the uncertainty of delays in the network. In a network with highly variable delays,
like most computer networks (see [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing)), the response time of linearizable
reads and writes is inevitably going to be high. A faster algorithm for linearizability does not
exist, but weaker consistency models can be much faster, so this trade-off is important for
latency-sensitive systems. In [Link to Come] we will discuss some approaches for avoiding
linearizability without sacrificing correctness.


## ID Generators and Logical Clocks {#sec_consistency_logical}

In many applications you need to assign some sort of unique ID to database records when they are
created, which gives you a primary key by which you can refer to those records. In single-node
databases it is common to use an auto-incrementing integer, which has the advantage that it can be
stored in only 64 bits (or even 32 bits if you are sure that you will never have more than 4 billion
records, but that is risky).

Another advantage of such auto-incrementing IDs is that the order of the IDs tells you the order in
which the records were created. For example, [Figure 10-8](/en/ch10#fig_consistency_id_generator) shows a chat
application that assigns auto-incrementing IDs to chat messages as they are posted. You can then
display the messages in order of increasing ID, and the resulting chat threads will make sense:
Aaliyah posts a question that is assigned ID 1, and Bryce’s answer to the question is assigned a
greater ID, namely 3.

{{< figure src="/fig/ddia_1008.png" id="fig_consistency_id_generator" caption="Figure 10-8. Two different nodes may generate conflicting IDs." class="w-full my-4" >}}


This single-node ID generator is another example of a linearizable system. Each request to fetch the
ID is an operation that atomically increments a counter and returns the old counter value (a
*fetch-and-add* operation); linearizability ensures that if the posting of Aaliyah’s message
completes before Bryce’s posting begins, then Bryce’s ID must be greater than Aaliyah’s. The
messages by Aaliyah and Caleb in [Figure 10-8](/en/ch10#fig_consistency_id_generator) are concurrent, so linearizability
doesn’t specify how their IDs must be ordered, as long as they are unique.

An in-memory single-node ID generator is easy to implement: you can use the atomic increment
instruction provided by your CPU, which allows multiple threads to safely increment the same
counter. It’s a bit more effort to make the counter persistent, so that the node can crash and
restart without resetting the counter value, which would result in duplicate IDs. But the real
problems are:

* A single-node ID generator is not fault-tolerant because that node is a single point of failure.
* It’s slow if you want to create a record in another region, as you potentially have to make a
 round-trip to the other side of the planet just to get an ID.
* That single node could become a bottleneck if you have high write throughput.

There are various alternative options for ID generators that you can consider:

Sharded ID assignment
: You could have multiple nodes that assign IDs—for example, one that generates only even numbers,
 and one that generates only odd numbers. In general, you can reserve some bits in the ID to
 contain a shard number. Those IDs are still compact, but you lose the ordering property: for
 example, if you have chat messages with IDs 16 and 17, you don’t know whether message 16 was
 actually sent first, because the IDs were assigned by different nodes, and one node might have
 been ahead of the other.

Preallocated blocks of IDs
: Instead of requesting individual IDs from the single-node ID generator, it could hand out blocks
 of IDs. For example, node A might claim the block of IDs from 1 to 1,000, and node B might claim
 the block from 1,001 to 2,000. Then each node can independently hand out IDs from its block, and
 request a new block from the single-node ID generator when its supply of sequence numbers begins
 to run low. However, this scheme doesn’t ensure correct ordering either: it could happen that one
 message is given an ID in the range from 1,001 to 2,000, and a later message is given an ID in the
 range from 1 to 1,000 if the ID was assigned by a different node.

Random UUIDs
: You can use *universally unique identifiers* (UUIDs), also known as *globally unique identifiers*
 (GUIDs). These have the big advantage that they can be generated locally on any node without
 requiring communication, but they require more space (128 bits). There are several different
 versions of UUIDs; the simplest is version 4, which is essentially a random number that is so long
 that is very unlikely that two nodes would ever pick the same one. Unfortunately, the order of
 such IDs is also random, so comparing two IDs tells you nothing about which one is newer.

Wall-clock timestamp made unique
: If your nodes’ time-of-day clock is kept approximately correct using NTP, you can generate IDs by
 putting a timestamp from that clock in the most significant bits, and filling the remaining bits
 with extra information that ensures the ID is unique even if the timestamp is not—for example, a
 shard number and a per-shard incrementing sequence number, or a long random value. This approach
 is used in Version 7 UUIDs [^50], Twitter’s Snowflake [^51], ULIDs [^52], Hazelcast’s Flake ID generator,
 MongoDB ObjectIDs, and many similar schemes [^50]. You can implement these ID generators in application code or within a database [^53].

All these schemes generate IDs that are unique (at least with high enough probability that
collisions are vanishingly rare), but they have much weaker ordering guarantees for IDs than the
single-node auto-incrementing scheme.

As discussed in [“Timestamps for ordering events”](/en/ch9#sec_distributed_lww), wall-clock timestamps can provide at best an approximate
ordering: if an earlier write gets a timestamp from a slightly fast clock, and a later write’s
timestamp is from a slightly slow clock, the timestamp order may be inconsistent with the order in
which the events actually happened. With clock jumps due to using a non-monotonic clock, even the
timestamps generated by a single node might be ordered incorrectly. ID generators based on
wall-clock time are therefore unlikely to be linearizable.

You can reduce such ordering inconsistencies by relying on high-precision clock synchronization,
using atomic clocks or GPS receivers. But it would also be nice to be able to generate IDs that are
unique and correctly ordered without relying on special hardware. That’s what *logical clocks* are
about.

### Logical Clocks {#logical-clocks}

In [“Unreliable Clocks”](/en/ch9#sec_distributed_clocks) we discussed time-of-day clocks and monotonic clocks. Both of these
are *physical clocks*: they measure the passing of seconds (or milliseconds, microseconds, etc.).

In distributed systems it is common to also use another kind of clock, called a *logical clock*.
While a physical clock is a hardware device that counts the seconds that have elapsed, a logical
clock is an algorithm that counts the events that have occurred. A timestamp from a logical clock
therefore doesn’t tell you what time it is, but you *can* compare two timestamps from a logical
clock to tell which one is earlier and which one is later.

The requirements for a logical clock are typically:

* that its timestamps are compact (a few bytes in size) and unique;
* that you can compare any two timestamps (i.e. they are *totally ordered*); and
* that the order of timestamps is *consistent with causality*: if operation A happened before B,
 then A’s timestamp is less than B’s timestamp. (We discussed causality previously in
 [“The “happens-before” relation and concurrency”](/en/ch6#sec_replication_happens_before).)

A single-node ID generator meets these requirements, but the distributed ID generators we just
discussed do not meet the causal ordering requirement.

#### Lamport timestamps {#lamport-timestamps}

Fortunately, there is a simple method for generating logical timestamps that *is* consistent with
causality, and which you can use as a distributed ID generator. It is called a *Lamport clock*,
proposed in 1978 by Leslie Lamport [^54],
in what is now one of the most-cited papers in the field of distributed systems.

[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) shows how a Lamport clock would work in the chat example of
[Figure 10-8](/en/ch10#fig_consistency_id_generator). Each node has a unique identifier, which in
[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) is the name “Aaliyah”, “Bryce”, or “Caleb”, but which in practice
could be a random UUID or something similar. Moreover, each node keeps a counter of the number of
operations it has processed. A Lamport timestamp is then simply a pair of (*counter*, *node ID*).
Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp,
each timestamp is made unique.

{{< figure src="/fig/ddia_1009.png" id="fig_consistency_lamport_ts" caption="Figure 10-9. Lamport timestamps provide a total ordering consistent with causality." class="w-full my-4" >}}


Every time a node generates a timestamp, it increments its counter value and uses the new value.
Moreover, every time a node sees a timestamp from another node, if the counter value in that
timestamp is greater than its local counter value, it increases its local counter to match the value in the timestamp.

In [Figure 10-9](/en/ch10#fig_consistency_lamport_ts), Aaliyah had not yet seen Caleb’s message when posting her own,
and vice versa. Assuming both users start with an initial counter value of 0, both therefore
increment their local counter and attach the new counter value of 1 to their message. When Bryce
receives those messages, he increases his local counter value to 1. Finally, Bryce sends a reply to
Aaliyah’s message, for which he increments his local counter and attaches the new value of 2 to the
message.

To compare two Lamport timestamps, we first compare their counter value: for example,
(2, “Bryce”) is greater than (1, “Aaliyah”) and also greater than (1, “Caleb”). If
two timestamps have the same counter, we compare their node IDs instead, using the usual
lexicographic string comparison. Thus, the timestamp order in this example is
(1, “Aaliyah”) < (1, “Caleb”) < (2, “Bryce”).

#### Hybrid logical clocks {#hybrid-logical-clocks}

Lamport timestamps are good at capturing the order in which things happened, but they have some
limitations:

* Since they have no direct relation to physical time, you can’t use them to find, say, all the
 messages that were posted on a particular date—you would need to store the physical time
 separately.
* If two nodes never communicate, one node’s counter increments will never be reflected in the other
 one’s counter. As a result, it could happen that events generated around the same time on
 different nodes have wildly different counter values.

A *hybrid logical clock* combines the advantages of physical time-of-day clocks with the ordering
guarantees of Lamport clocks [^55].
Like a physical clock, it counts seconds or microseconds. Like a Lamport clock, when one node sees a
timestamp from another node that is greater than its local clock value, it moves its own local value
forward to match the other node’s timestamp. As a result, if one node’s clock is running fast, the
other nodes will similarly move their clocks forward when they communicate.

Every time a timestamp from a hybrid logical clock is generated, it is also incremented, which
ensures that the clock monotonically moves forward, even if the underlying physical clock jumps
backwards, for example due to NTP adjustments. Thus, the hybrid logical clock might be slightly
ahead of the underlying physical clock. Details of the algorithm ensure that this discrepancy
remains as small as possible.

As a result, you can treat a timestamp from a hybrid logical clock almost like a timestamp from a
conventional time-of-day clock, with the added property that its ordering is consistent with the
happens-before relation. It doesn’t depend on any special hardware, and requires only roughly
synchronized clocks. Hybrid logical clocks are used by CockroachDB, for example.

#### Lamport/hybrid logical clocks vs. vector clocks {#lamporthybrid-logical-clocks-vs-vector-clocks}

In [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl) we discussed how snapshot isolation is often implemented:
essentially, by giving each transaction a transaction ID, and allowing each transaction to see
writes made by transactions with a lower ID, but to make writes by transactions with higher IDs
invisible. Lamport clocks and hybrid logical clocks are a good way of generating these transaction
IDs, because they ensure that the snapshot is consistent with causality [^56].

When multiple timestamps are generated concurrently, these algorithms order them arbitrarily. This
means that when you look at two timestamps, you generally can’t tell whether they were generated
concurrently or whether one happened before the other. (In the example of
[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) you actually can tell that Aaliyah and Caleb’s messages must have
been concurrent, because they have the same counter value, but when the counter values are different
you can’t tell whether they were concurrent.)

If you want to be able to determine when records were created concurrently, you need a different
algorithm, such as a *vector clock*. The downside is that the timestamps from a vector clock are
much bigger—potentially one integer for every node in the system. See [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)
for more details on detecting concurrency.

### Linearizable ID Generators {#sec_consistency_linearizable_id}

Although Lamport clocks and hybrid logical clocks provide useful ordering guarantees, that ordering
is still weaker than the linearizable single-node ID generator we talked about previously. Recall
that linearizability requires that if request A completed before request B began, then B must have
the higher ID, even if A and B never communicated with each other. On the other hand, Lamport clocks
can only ensure that a node generates timestamps that are greater than any other timestamp that node
has seen, but it can’t say anything about timestamps that it hasn’t seen.

[Figure 10-10](/en/ch10#fig_consistency_permissions) shows how a non-linearizable ID generator could cause problems.
Imagine a social media website where user A wants to share an embarrassing photo privately with
their friends. A’s account is initially public, but using their laptop, A first changes their
account settings to private. Then A uses their phone to upload the photo. Since A performed these
updates in sequence, they might reasonably expect the photo upload to be subject to the new,
restricted account permissions.

{{< figure src="/fig/ddia_1010.png" id="fig_consistency_permissions" caption="Figure 10-10. An example of a permission system using Lamport timestamps." class="w-full my-4" >}}


The account permission and the photo are stored in two separate databases (or separate shards of the
same database), and let’s assume they use a Lamport clock or hybrid logical clock to assign a
timestamp to every write. Since the photos database didn’t read from the accounts database, it’s
possible that the local counter in the photos database is slightly behind, and therefore the photo
upload is assigned a lower timestamp than the update of the account settings.

Next, let’s say that a viewer (who is not friends with A) is looking at A’s profile, and their read
uses an MVCC implementation of snapshot isolation. It could happen that the viewer’s read has a
timestamp that is greater than that of the photo upload, but less than that of the account settings
update. As a result, the system will determine that the account is still public at the time of the
read, and therefore show the viewer the embarrassing photo that they were not supposed to see.

You can imagine several possible ways of fixing this problem. Maybe the photos database should have
read the user’s account status before performing the write, but it’s easy to forget such a check.
If A’s actions had been performed on the same device, maybe the app on their device could have
tracked the latest timestamp of that user’s writes—but if the user uses a laptop and a phone, as in
the example, that’s not so easy.

The simplest solution in this case would be to use a linearizable ID generator, which would ensure
that the photo upload is assigned a greater ID than the account permissions change.

#### Implementing a linearizable ID generator {#implementing-a-linearizable-id-generator}

The simplest way of ensuring that ID assignment is linearizable is by actually using a single node
for this purpose. That node only needs to atomically increment a counter and return its value when
requested, persist the counter value (so that it doesn’t generate duplicate IDs if the node crashes
and restarts), and replicate it for fault tolerance (using single-leader replication). This approach
is used in practice: for example, TiDB/TiKV calls it a *timestamp oracle*, inspired by Google’s
Percolator [^57].

As an optimization, you can avoid performing a disk write and replication on every single request.
Instead, the ID generator can write a record describing a batch of IDs; once that record is
persisted and replicated, the node can start handing out those IDs to clients in sequence. Before it
runs out of IDs in that batch, it can persist and replicate the record for the next batch. That way,
some IDs will be skipped if the node crashes and restarts or if you fail over to a follower, but you
won’t issue any duplicate or out-of-order IDs.

You can’t easily shard the ID generator, since if you have multiple shards independently handing out
IDs, you can no longer guarantee that their order is linearizable. You also can’t easily distribute
the ID generator across multiple regions; thus, in a geographically distributed database, all
requests for IDs will have to go to a node in a single region. On the upside, the ID generator’s job
is very simple, so a single node can handle a large request throughput.

If you don’t want to use a single-node ID generator, an alternative is possible: you can do what
Google’s Spanner does, as discussed in [“Synchronized clocks for global snapshots”](/en/ch9#sec_distributed_spanner). It relies on a physical clock
that returns not just a single timestamp, but a range of timestamps indicating the uncertainty in
the clock reading. It then waits for the duration of that uncertainty interval to elapse before
returning.

Assuming that the uncertainty interval is correct (i.e., that the true current physical time always
lies within that interval), this process also ensures that if one request completes before another
begins, the later request will have a greater timestamp. This approach ensures this linearizable ID
assignment without any communication: even requests in different regions will be ordered correctly,
without waiting for cross-region requests. The downside is that you need hardware and software
support for clocks to be tightly synchronized and compute the necessary uncertainty interval.

#### Enforcing constraints using logical clocks {#enforcing-constraints-using-logical-clocks}

In [“Constraints and uniqueness guarantees”](/en/ch10#sec_consistency_uniqueness) we saw that a linearizable compare-and-set operation can be used
to implement locks, uniqueness constraints, and similar constructs in a distributed system. This
raises the question: is a logical clock or a linearizable ID generator also sufficient to implement
these things?

The answer is: not quite. When you have several nodes that are all trying to acquire the
same lock or register the same username, you could use a logical clock to assign timestamps to those
requests, and pick the one with the lowest timestamp as the winner. If the clock is linearizable,
you know that any future requests will always generate greater timestamps, and therefore you can be
sure that no future request will receive an even lower timestamp than the winner.

Unfortunately, part of the problem is still unsolved: how does a node know whether its own timestamp
is the lowest? To be sure, it needs to hear from *every* other node that might have generated a
timestamp [^54]. If one of the other nodes
has failed in the meantime, or cannot be reached due to a network problem, this system would grind
to a halt, because we can’t be sure whether that node might have the lowest timestamp. This is not
the kind of fault-tolerant system that we need.

To implement locks, leases, and similar constructs in a fault-tolerant way, we need something
stronger than logical clocks or ID generators: we need consensus.


## Consensus {#consensus}

In this chapter we have seen several examples of things that are easy when you have only a single
node, but which get a lot harder if you want fault tolerance:

* A database can be linearizable if you have only a single leader, and you make all reads and writes
 on that leader. But how do you fail over if that leader fails, while avoiding split brain? How do
 you ensure that a node that believes itself to be the leader hasn’t actually been voted out in the meantime?
* A linearizable ID generator on a single node is just a counter with an atomic fetch-and-add
 instruction, but what if it crashes?
* An atomic compare-and-set (CAS) operation is useful for many things, such as deciding who gets a
 lock or lease when several processes are racing to acquire it, or ensuring the uniqueness of a
 file or user with a given name. On a single node, CAS may be as simple as one CPU instruction, but
 how do you make it fault-tolerant?

It turns out that all of these are instances of the same fundamental distributed systems problem:
*consensus*. Consensus is one of the most important and fundamental problems in distributed
computing; it is also infamously difficult to get right [^58] [^59],
and many systems have got it wrong in the past. Now that we have discussed replication
([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
linearizability (this chapter), we are finally ready to tackle the consensus problem.

The best-known consensus algorithms are Viewstamped Replication [^60] [^61], Paxos [^58] [^62] [^63] [^64],
Raft [^23] [^65] [^66], and Zab [^18] [^22] [^67]. There are quite a few similarities between these algorithms, but they are not the same [^68] [^69].
These algorithms work in a non-Byzantine system model: that is, network communication may be
arbitrarily delayed or dropped, and nodes may crash, restart, and become disconnected, but the
algorithms assume that nodes otherwise follow the protocol correctly and do not behave maliciously.

There are also consensus algorithms that can tolerate some Byzantine nodes, i.e., nodes that don’t
correctly follow the protocol (for example, by sending contradictory messages to other nodes). A
common assumption is that fewer than one-third of the nodes are Byzantine-faulty [^26] [^70].
Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchains [^71].
However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this
book.

--------

> [!TIP] THE IMPOSSIBILITY OF CONSENSUS

You may have heard about the FLP result [^72]—named after the
authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to
reach consensus if there is a risk that a node may crash. In a distributed system, we must assume
that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms
for achieving consensus. What is going on here?

Firstly, FLP doesn’t say that we can never reach consensus—it only says that we can’t guarantee that
a consensus algorithm will *always* terminate. Moreover, the FLP result is proved assuming a
deterministic algorithm in the asynchronous system model (see [“System Model and Reality”](/en/ch9#sec_distributed_system_model)),
which means the algorithm cannot use any clocks or timeouts. If it can use timeouts to suspect that
another node may have crashed (even if the suspicion is sometimes wrong), then consensus becomes solvable [^73].
Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility result [^74].

Thus, although the FLP result about the impossibility of consensus is of great theoretical
importance, distributed systems can usually achieve consensus in practice.

--------

### The Many Faces of Consensus {#the-many-faces-of-consensus}

Consensus can be expressed in several different ways:

* *Single-value consensus* is very similar to an atomic *compare-and-set* operation, and it can be
 used to implement locks, leases, and uniqueness constraints.
* Constructing an *append-only log* also requires consensus; it is usually formalized as *total
 order broadcast*. With a log you can build *state machine replication*, leader-based replication,
 event sourcing, and other useful things.
* *Atomic commitment* of a multi-database or multi-shard transaction requires that all participants
 agree on whether to commit or abort the transaction.

We will explore all of these shortly. In fact, these problems are all equivalent to each other: if
you have an algorithm that solves one of these problems, you can convert it into a solution for any
of the others. This is quite a profound and perhaps surprising insight! And that’s why we can lump
all of these things together under “consensus”, even though they look quite different on the surface.

#### Single-value consensus {#single-value-consensus}

The standard formulation of consensus involves getting multiple nodes to agree on a single value.
For example:

* When a database with single-leader replication first starts up, or when the existing leader fails,
 several nodes may concurrently try to become the leader. Similarly, multiple nodes may race to
 acquire a lock or lease. Consensus allows them to decide which one wins.
* If several people concurrently try to book the last seat on an airplane, or the same seat in a
 theater, or try to register an account with the same username, then a consensus algorithm could
 determine which one should succeed.

More generally, one or more nodes may *propose* values, and the consensus algorithm *decides* on one
of those values. In the examples above, each node could propose its own ID, and the algorithm
decides which node ID should become the new leader, the holder of the lease, or the buyer of the
airplane/theater seat. In this formalism, a consensus algorithm must satisfy the following
properties [^26]:

Uniform agreement
: No two nodes decide differently.

Integrity
: Once a node has decided one value, it cannot change its mind by deciding another value.

Validity
: If a node decides value *v*, then *v* was proposed by some node.

Termination
: Every node that does not crash eventually decides some value.

If you want to decide multiple values, you can run a separate instance of the consensus algorithm
for each. For example, you could have a separate consensus run for each bookable seat in the
theater, so that you get one decision (one buyer) for each seat.

The uniform agreement and integrity properties define the core idea of consensus: everyone decides
on the same outcome, and once you have decided, you cannot change your mind. The validity property
rules out trivial solutions: for example, you could have an algorithm that always decides `null`, no
matter what was proposed; this algorithm would satisfy the agreement and integrity properties, but
not the validity property.

If you don’t care about fault tolerance, then satisfying the first three properties is easy: you can
just hardcode one node to be the “dictator,” and let that node make all of the decisions. However,
if that one node fails, then the system can no longer make any decisions—just like single-leader
replication without failover. All the difficulty arises from the need for fault tolerance.

The termination property formalizes the idea of fault tolerance. It essentially says that a
consensus algorithm cannot simply sit around and do nothing forever—in other words, it must make
progress. Even if some nodes fail, the other nodes must still reach a decision. (Termination is a
liveness property, whereas the other three are safety properties—see
[“Safety and liveness”](/en/ch9#sec_distributed_safety_liveness).)

If a crashed node may recover, you could just wait for it to come back. However, consensus must
ensure that it makes a decision even if a crashed node suddenly disappears and never comes back.
(Instead of a software crash, imagine that there is an earthquake, and the datacenter containing
your node is destroyed by a landslide. You must assume that your node is buried under 30 feet of mud
and is never going to come back online.)

Of course, if *all* nodes crash and none of them are running, then it is not possible for any
algorithm to decide anything. There is a limit to the number of failures that an algorithm can
tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of
nodes to be functioning correctly in order to assure termination [^73]. That majority can safely form a quorum
(see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)).

Thus, the termination property is subject to the assumption that fewer than half of the nodes are
crashed or unreachable. However, most consensus algorithms ensure that the safety
properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or
there is a severe network problem [^75].
Thus, a large-scale outage can stop the system from being able to process requests, but it cannot
corrupt the consensus system by causing it to make inconsistent decisions.

#### Compare-and-set as consensus {#compare-and-set-as-consensus}

A compare-and-set (CAS) operation checks whether the current value of some object equals some
expected value; if yes, it atomically updates the object to some new value; if no, it leaves the
object unchanged and returns an error.

If you have a fault-tolerant, linearizable CAS operation, it is easy to solve the consensus problem:
initially set the object to a null value; each node that wants to propose a value invokes CAS with
the expected value being null, and the new value being the value it wants to propose (assuming it is
non-null). The decided value is then whatever value the object is set to.

Likewise, if you have a solution for consensus, you can implement CAS: whenever one or more nodes
want to perform CAS with the same expected value, you use the consensus protocol to propose the new
values in the CAS invocation, and then set the object to whatever value was decided by the
consensus. Any CAS invocations whose new value was not decided return an error. CAS invocations with
different expected values use separate runs of the consensus protocol.

This shows that CAS and consensus are equivalent to each other [^28] [^73].
Again, both are straightforward on a single node, but challenging to make fault-tolerant. As an
example of CAS in a distributed setting, we saw conditional write operations for object stores in
[“Databases backed by object storage”](/en/ch6#sec_replication_object_storage), which allow a write to happen only if an object with the same
name has not been created or modified by another client since the current client last read it.

However, a linearizable read-write register is not sufficient to solve consensus. The FLP result
tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop
model [^72], but we saw in [“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum
reads/writes in this model [^24] [^25] [^26]. From this it follows that a linearizable register cannot solve consensus.

#### Shared logs as consensus {#sec_consistency_shared_logs}

We have seen several examples of logs, such as replication logs, transaction logs, and write-ahead
logs. A log stores a sequence of *log entries*, and anyone who reads it sees the same entries in the
same order. Sometimes a log has a single writer that is allowed to append new entries, but a *shared
log* is one where multiple nodes can request entries to be appended. An example is single-leader
replication: any client can ask the leader to make a write, which the leader appends to the
replication log, and then all followers apply the writes in the same order as the leader.

More formally, a shared log supports two operations: you can request for a value to be added to the
log, and you can read the entries in the log. It must satisfy the following properties:

Eventual append
: If a node requests for some value to be added the log, and the node does not crash, then that node
 must eventually read that value in a log entry.

Reliable delivery
: No log entries are lost: if one node reads some log entry, then eventually every node that does
 not crash must also read that log entry.

Append-only
: Once a node has read some log entry, it is immutable, and new log entries can only be added after
 it, but not before. A node may re-read the log, in which case it sees the same log entries in the
 same order as it read them initially (even if the node crashes and restarts).

Agreement
: If two nodes both read some log entry *e*, then prior to *e* they must have read exactly the same
 sequence of log entries in the same order.

Validity
: If a node reads a log entry containing some value, then some node previously requested for that
 value to be added to the log.

--------

> [!NOTE]
> A shared log is formally known as a *total order broadcast*, *atomic broadcast*, or *total order multicast* protocol [^26] [^76] [^77]
> It’s the same thing described in different words: requesting a value to be added to the log is then called “broadcasting” it, and reading a log entry is called “delivering” it.

--------

If you have an implementation of a shared log, it is easy to solve the consensus problem: every node
that wants to propose a value requests for it to be added to the log, and whichever value is read
back in the first log entry is the value that is decided. Since all nodes read log entries in the
same order, they are guaranteed to agree on which value is delivered first [^28].

Conversely, if you have a solution for consensus, you can implement a shared log. The details are a
bit more complicated, but the basic idea is this [^73]:

1. You have a slot in the log for every future log entry, and you run a separate instance of the
 consensus algorithm for every such slot to decide what value should go in that entry.
2. When a node wants to add a value to the log, it proposes that value for one of the slots that has
 not yet been decided.
3. When the consensus algorithm decides for one of the slots, and all the previous slots have
 already been decided, then the decided value is appended as a new log entry, and any consecutive
 slots that have been decided also have their decided value appended to the log.
4. If a proposed value was not chosen for some slot, the node that wanted to add it retries by
 proposing it for a later slot.

This shows that consensus is equivalent to total order broadcast and shared logs. Single-leader
replication without failover does not meet the liveness requirements, since it stops delivering
messages if the leader crashes. As usual, the challenge is in performing failover safely and
automatically.

#### Fetch-and-add as consensus {#fetch-and-add-as-consensus}

The linearizable ID generator we saw in [“Linearizable ID Generators”](/en/ch10#sec_consistency_linearizable_id) comes close to solving
consensus, but it falls slightly short. We can implement such an ID generator using a fetch-and-add
operation, which atomically increments a counter and returns the old counter value.

If you have a CAS operation, it’s easy to implement fetch-and-add: first read the counter value,
then perform a CAS where the expected value is the value you read, and the new value is that value
plus one. If the CAS fails, you retry the whole process until the CAS succeeds. This is less
efficient than a native fetch-and-add operation when there is contention, but it is functionally
equivalent. Since you can implement CAS using consensus, you can also implement fetch-and-add using
consensus.

Conversely, if you have a fault-tolerant fetch-and-add operation, can you solve the consensus
problem? Let’s say you initialize the counter to zero, and every node that wants to propose a value
invokes the fetch-and-add operation to increment the counter. Since the fetch-and-add operation is
atomic, one of the nodes will read the initial value of zero, and the others will all read a value
that has been incremented at least once.

Now let’s say that the node that reads zero is the winner, and its value is decided. That works for
the node that read zero, but the other nodes have a problem: they know that they are not the winner,
but they don’t know which of the other nodes has won. The winner could send a message to the other
nodes to let them know it has won, but what if the winner crashes before it has a chance to send
this message? In that case the other nodes are left hanging, unable to decide any value, and thus
the consensus does not terminate. And the other nodes can’t fall back to another node, because the
node that read zero may yet come back and rightly decide the value it proposed.

An exception is if we know for sure that no more than two nodes will propose a value. In that case,
the nodes can send each other the values they want to propose, and then each perform the
fetch-and-add operation. The node that reads zero decides its own value, and the node that reads one
decides the other node’s value. This solves the consensus problem among two nodes, which is why we
can say that fetch-and-add has a *consensus number* of two [^28].
In contrast, CAS and shared logs solve consensus for any number of nodes that may propose values, so
they have a consensus number of ∞ (infinity).

#### Atomic commitment as consensus {#atomic-commitment-as-consensus}

In [“Distributed Transactions”](/en/ch8#sec_transactions_distributed) we saw the *atomic commitment* problem, which is to ensure that
the databases or shards involved in a distributed transaction all either commit or abort a
transaction. We also saw the *two-phase commit* algorithm, which relies on a coordinator that is a
single point of failure.

What is the relationship between consensus and atomic commitment? At first glance, they seem very
similar—both require nodes to come to some form of agreement. However, there is one important
difference: with consensus it’s okay to decide any value that proposed, whereas with atomic
commitment the algorithm *must* abort if *any* of the participants voted to abort. More precisely,
atomic commitment requires the following properties [^78]:

Uniform agreement
: No two nodes decide on different outcomes.

Integrity
: Once a node has decided one outcome, it cannot change its mind by deciding another outcome.

Validity
: If a node decides to commit, then all nodes must have previously voted to commit. If any node
 voted to abort, the nodes must abort.

Non-triviality
: If all nodes vote to commit, and no communication timeouts occur, then all nodes must decide to
 commit.

Termination
: Every node that does not crash eventually decides to either commit or abort.

The validity property ensures that a transaction can only commit if all nodes agree; and the
non-triviality property ensures the algorithm can’t simply always abort (but it allows an abort if
any of the communication among the nodes times out). The other three properties are basically the
same as for consensus.

If you have a solution for consensus, there are multiple ways you could solve atomic commitment [^78] [^79].
One works like this: when you want to commit the transaction, every node sends its vote to commit or
abort to every other node. Nodes that receive a vote to commit from itself and every other node
propose “commit” using the consensus algorithm; nodes that receive a vote to abort, or which
experience a timeout, propose “abort” using the consensus algorithm. When a node finds out what the
consensus algorithm decided, it commits or aborts accordingly.

In this algorithm, “commit” will only be proposed if all nodes voted to commit. If any node voted to
abort, all proposals in the consensus algorithm will be “abort”. It could happen that some nodes
propose “abort” while others propose “commit” if all nodes voted to commit but some communication
timed out; in this case it doesn’t matter whether the nodes commit or abort, as long as they all do the same.

If you have a fault-tolerant atomic commitment protocol, you can also solve consensus. Every node
that wants to propose a value starts a transaction on a quorum of nodes, and at each node it
performs a single-node CAS to set a register to the proposed value if its value has not already been
set by another transaction. If the CAS succeeds, the node votes to commit, otherwise it votes to
abort. If the atomic commit protocol decides to commit a transaction, its value is decided for
consensus; if atomic commit aborts, the proposing node retries with a new transaction.

This shows that atomic commit and consensus are also equivalent to each other.

### Consensus in Practice {#consensus-in-practice}

We have seen that single-value consensus, CAS, shared logs, and atomic commitment are all equivalent
to each other: you can convert a solution to one of them into a solution to any of the others. That
is a valuable theoretical insight, but it doesn’t answer the question: which of these many
formulations of consensus is the most useful in practice?

The answer is that most consensus systems provide shared logs, also known as total order broadcast.
Raft, Viewstamped Replication, and Zab provide shared logs right out of the box. Paxos provides
single-value consensus, but in practice most systems using Paxos actually use the extension called
Multi-Paxos, which also provides a shared log.

#### Using shared logs {#sec_consistency_smr}

A shared log is a good fit for database replication: if every log entry represents a write to the
database, and every replica processes the same writes in the same order using deterministic logic,
then the replicas will all end up in a consistent state. This idea is known as *state machine replication* [^80],
and it is the principle behind event sourcing, which we saw in [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events). Shared
logs are also useful for stream processing, as we shall see in [Link to Come].

Similarly, a shared log can be used to implement serializable transactions: as discussed in
[“Actual Serial Execution”](/en/ch8#sec_transactions_serial), if every log entry represents a deterministic transaction to be
executed as a stored procedure, and if every node executes those transactions in the same order,
then the transactions will be serializable [^81] [^82].

---------

> [!NOTE]
> Sharded databases with a strong consistency model often maintain a separate log per shard, which
> improves scalability, but limits the consistency guarantees (e.g., consistent snapshots, foreign key
> references) they can offer across shards. Serializable transactions across shards are possible, but
> require additional coordination [^83].

--------

A shared log is also powerful because it can easily be adapted to other forms of consensus:

* We saw previously how to use it to implement single-value consensus and CAS: simply decide the
 value that appears first in the log.
* If you want many instances of single-value consensus (e.g. one per seat in a theater that several
 people are trying to book), include the seat number in the log entries, and decide the first log
 entry that contains a given seat number.
* If you want an atomic fetch-and-add, put the number to add to the counter in a log entry, and the
 current counter value is the sum of all of the log entries so far. A simple counter on log entries
 can be used to generate fencing tokens (see [“Fencing off zombies and delayed requests”](/en/ch9#sec_distributed_fencing_tokens)); for example, in
 ZooKeeper, this sequence number is called `zxid` [^18].

#### From single-leader replication to consensus {#from-single-leader-replication-to-consensus}

We saw previously that single-value consensus is easy if you have a single “dictator” node that
makes the decision, and likewise a shared log is easy if a single leader is the only node that is
allowed to append entries to it. The question is how to provide fault tolerance if that node fails.

Traditionally, databases with single-leader replication didn’t solve this problem: they left leader
failover as an action that a human administrator had to perform manually. Unfortunately, this means
a significant amount of downtime, since there is a limit to how fast humans can react, and it
doesn’t satisfy the termination property of consensus. For consensus, we require that the algorithm
can automatically choose a new leader. (Not all consensus algorithms have a leader, but the commonly
used algorithms do [^84].)

However, there is a problem. We previously discussed the problem of split brain, and said that all
nodes need to agree who the leader is—otherwise two different nodes could each believe themselves to
be the leader, and consequently make inconsistent decisions. Thus, it seems like we need consensus
in order to elect a leader, and we need a leader in order to solve consensus. How do we break out of
this conundrum?

In fact, consensus algorithms don’t require that there is only one leader at any one time. Instead,
they make a weaker guarantee: they define an *epoch number* (called the *ballot number* in Paxos,
*view number* in Viewstamped Replication, and *term number* in Raft) and guarantee that within each
epoch, the leader is unique.

When a node believes that the current leader is dead because it hasn’t heard from the leader for
some timeout, it may start a vote to elect a new leader. This election is given a new epoch number
that is greater than any previous epoch. If there is a conflict between two different leaders in two
different epochs (perhaps because the previous leader actually wasn’t dead after all), then the
leader with the higher epoch number prevails.

Before a leader is allowed to append the next entry to the shared log, it must first check that
there isn’t some other leader with a higher epoch number which might append a different entry. It
can do this by collecting votes from a quorum of nodes—typically, but not always, a majority of
nodes [^85]. A node votes yes only if it is not aware of any other leader with a higher epoch.

Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s
proposal for the next entry to append to the log. The quorums for those two votes must overlap: if
a vote on a proposal succeeds, at least one of the nodes that voted for it must have also
participated in the most recent successful leader election [^85]. Thus, if the vote on a proposal
passes without revealing any higher-numbered epoch, the current leader can conclude that no leader
with a higher epoch number has been elected, and therefore it can safely append the proposed entry
to the log [^26] [^86].

These two rounds of voting look superficially similar to two-phase commit, but they are very
different protocols. In consensus algorithms, any node can start an election and it requires only a
quorum of nodes to respond; in 2PC, only the coordinator can request votes, and it requires a “yes”
vote from *every* participant before it can commit.

#### Subtleties of consensus {#subtleties-of-consensus}

This basic structure is common to all of Raft, Multi-Paxos, Zab, and Viewstamped Replication: a vote
by a quorum of nodes elects a leader, and then another quorum vote is required for every entry that
the leader wants to append to the log [^68] [^69]. Every new log entry is synchronously replicated
to a quorum of nodes before it is confirmed to the client that requested the write. This ensures
that the log entry won’t be lost if the current leader fails.

However, the devil is in the details, and that’s also where these algorithms take different
approaches. For example, when the old leader fails and a new one is elected, the algorithm needs to
ensure that the new leader honors any log entries that had already been appended by the old leader
before it failed. Raft does this by only allowing a node to become the new leader if its log is at
least as up-to-date as a majority of its followers [^69].
In contrast, Paxos allows any node to become the new leader, but requires it to bring its log
up-to-date with other nodes before it can start appending new entries of its own.


--------

> [!TIP] CONSISTENCY VS. AVAILABILITY IN LEADER ELECTION

If you want the consensus algorithm to strictly guarantee the properties laid out in
[“Shared logs as consensus”](/en/ch10#sec_consistency_shared_logs), it’s essential that the new leader is up-to-date with any confirmed
log entries before it can process any writes or linearizable reads. If a node with stale data were
to become the new leader, it may write a new value to log entries that were already written by the
old leader, violating the shared log’s append-only property.

In some cases, you might choose to weaken the consensus properties in order to recover more quickly
from a leader failure. For example, Kafka offers the option of enabling *unclean leader election*,
which allows any replica to become leader, even if it is not up-to-date. Also, in databases with
asynchronous replication, you cannot guarantee that any follower is up-to-date when the leader
fails.

If you drop the requirement for the new leader to be up-to-date, you may improve performance and
availability, but you are on thin ice, since the theory of consensus no longer applies. While things
will work fine as long as there are no faults, the problems discussed in [Chapter 9](/en/ch9#ch_distributed) can
easily cause a lot of data loss or corruption.

--------

Another subtlety is in how the algorithms deal with log entries that had been proposed by the old
leader before it failed, but for which the vote on appending to the log had not yet completed. You
can find discussions of these details in the references for this chapter [^23] [^69] [^86].

For databases that use a consensus algorithm for replication, not only do writes need to be turned
into log entries and replicated to a quorum. If you want to guarantee linearizable reads, they also
have to go through a quorum vote similarly to a write, to confirm that the node that believes to be
the leader really still is up-to-date. Linearizable reads in etcd work like this, for example.

In their standard form, most consensus algorithms assume a fixed set of nodes—that is, nodes may go
down and come back up again, but the set of nodes that is allowed to vote is fixed when the cluster
is created. In practice, it’s often necessary to add new nodes or remove old nodes in a system
configuration. Consensus algorithms have been extended with *reconfiguration* features that make
this possible. This is especially useful when adding new regions to a system, or when migrating from
one location to another (by first adding the new nodes, and then removing the old nodes).

#### Pros and cons of consensus {#pros-and-cons-of-consensus}

Although they are complex and subtle, consensus algorithms are a huge breakthrough for distributed
systems. Consensus is essentially “single-leader replication done right”, with automatic failover on
leader failure, ensuring that no committed data is lost and no split-brain is possible, even in the
face of all the problems we discussed in [Chapter 9](/en/ch9#ch_distributed).

Since single-leader replication with automatic failover is essentially one of the definitions of
consensus, any system that provides automatic failover but does not use a proven consensus algorithm
is likely to be unsafe [^87].
Using a proven consensus algorithm is not a guarantee of correctness of the whole system—there are
still plenty of other places where bugs can lurk—but it’s a good start.

Nevertheless, consensus is not used everywhere, because the benefits come at a cost. Consensus
systems always require a strict majority to operate—three nodes to tolerate one failure, or five
nodes to tolerate two failures. Every operation needs to communicate with a quorum, so you can’t
increase throughput by adding more nodes (in fact, every node you add makes the algorithm slower).
If a network partition cuts off some nodes from the rest, only the majority portion of the network
can make progress, and the rest are blocked.

Consensus systems generally rely on timeouts to detect failed nodes. In environments with highly
variable network delays, especially systems distributed across multiple geographic regions, it can
be difficult to tune these timeouts: if they are too large it takes a long time to recover from a
failure; if they are too small there can be lots of unnecessary leader elections, resulting in
terrible performance as the system can end up spending more time choosing leaders than doing useful
work.

Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft
has been shown to have unpleasant edge cases [^88] [^89]:
if the entire network is working correctly except for one particular network link that is
consistently unreliable, Raft can get into situations where leadership continually bounces between
two nodes, or the current leader is continually forced to resign, so the system effectively never
makes progress. Designing algorithms that are more robust to unreliable networks is still an open
research problem.

For systems that want to be highly available, but don’t want to accept the cost of consensus, the
only real alternative is to use a weaker consistency model instead, such as those offered by
leaderless or multi-leader replication as discussed in [Chapter 6](/en/ch6#ch_replication). These approaches
generally don’t offer linearizability, but for applications that don’t need it that is fine.


### Coordination Services {#coordination-services}

Consensus algorithms are useful in any distributed database that wants to offer linearizable
operations, and many modern distributed databases use consensus algorithms for replication. But one
family of systems is a particularly prominent user of consensus: *coordination services* such as
ZooKeeper, etcd, or Consul. Although these systems look superficially like any other key-value
store, they are not designed for general-purpose data storage like most databases.

Instead, they are designed to coordinate between nodes of another distributed system. For example,
Kubernetes relies on etcd, while Spark and Flink in high availability mode rely on ZooKeeper running
in the background. Coordination services are designed to hold small amounts of data that can fit
entirely in memory (although they still write to disk for durability), which is replicated across
multiple nodes using a fault-tolerant consensus algorithm.

Coordination services are modeled after Google’s Chubby lock service [^17] [^58].
They combine a consensus algorithm with several other features that turn out to be particularly
useful when building distributed systems:

Locks and leases
: We saw previously how consensus systems can implement an atomic, fault-tolerant compare-and-set
 (CAS) operation. Coordination services rely on this approach to implement locks and leases: if
 several nodes concurrently try to acquire the same lease, only one of them will succeed.

Support for fencing
: As discussed in [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing), when a resource is protected by a lease, you
 need *fencing* to prevent clients from interfering with each other in the case of a process pause
 or large network delay. Consensus systems can generate fencing tokens by giving each log entry a
 monotonically increasing ID (`zxid` and `cversion` in ZooKeeper, revision number in etcd).

Failure detection
: Clients maintain a long-lived session on the coordination service, and periodically exchange
 heartbeats to check if the other node is still alive. Even if the connection is temporarily
 interrupted, or a server fails, any leases held by the client remain active. However, if there is
 no heartbeat for longer than the timeout of the lease, the coordination service assumes the client
 is dead and releases the lease (ZooKeeper calls these *ephemeral nodes*).

Change notifications
: A client can request that the coordination service sends it a notification whenever certain keys
 change. This allows a client to find out when another client joins the cluster (based on the value
 it writes to the coordination service), or if another client fails (because its session times out
 and its ephemeral nodes disappear), for example. These notifications save the client from having
 to frequently poll the service to find out about changes.

Failure detection and change notifications do not require consensus, but they are useful for
distributed coordination alongside the atomic operations and fencing support that do require
consensus.

--------

> [!TIP] MANAGING CONFIGURATION WITH COORDINATION SERVICES

Applications and infrastructure often have configuration parameters such as timeouts, thread pool
sizes, and so on. Coordination services are sometimes used to store such configuration data,
represented as key-value pairs. Processes load the latest settings upon startup, and subscribe to
receive notifications of any changes. When a configuration changes, the process can begin using the
new setting immediately or restart itself to load the latest changes.

Configuration management doesn’t need the consensus aspect of a coordination service, but it’s
convenient to use a coordination service and rely on its notification feature if you are already
running the coordination service anyway. Alternatively, a process could periodically poll for
configuration updates from a file or URL, which avoids the need for a specialized service.

--------

#### Allocating work to nodes {#allocating-work-to-nodes}

A coordination service is useful if you have several instances of a process or service, and one
of them needs to be chosen as leader or primary. If the leader fails, one of the other nodes should
take over. This is necessary for single-leader databases, but it’s also appropriate for job
schedulers and similar stateful systems.

Another use case is when you have some sharded resource (database, message streams, file storage,
distributed actor system, etc.) and need to decide which shard to assign to which node. As new nodes
join the cluster, some of the shards need to be moved from existing nodes to the new nodes in order
to rebalance the load. As nodes are removed or fail, other nodes need to take over the failed nodes’
work.

These kinds of tasks can be achieved by judicious use of atomic operations, ephemeral nodes, and
notifications in a coordination service. If done correctly, this approach allows the application to
automatically recover from faults without human intervention. It’s not easy, despite the appearance
of libraries such as Apache Curator that have sprung up to provide higher-level tools on top of the
ZooKeeper client API—but it is still much better than attempting to implement the necessary
consensus algorithms from scratch, which would be very prone to bugs.

A dedicated coordination service also has the advantage that it can run on a fixed set of nodes
(usually three or five), regardless of how many nodes there are in the distributed system that
relies on it for coordination. For example, in a storage system with thousands of shards, it would
be terribly inefficient to run a consensus algorithm over thousands of nodes; it’s much better to
“outsource” the consensus to a small number of nodes running a coordination service.

Normally, the kind of data managed by a coordination service is quite slow-changing: it represents
information like “the node running on IP address 10.1.1.23 is the leader for shard 7,” and such
assignments usually change on a timescale of minutes or hours. Coordination services are not
intended for storing data that may change thousands of times per second. For that, it is better to
use a conventional database; alternatively, tools like Apache BookKeeper [^90] [^91]
can be used to replicate fast-changing internal state of a service.

#### Service discovery {#service-discovery}

ZooKeeper, etcd, and Consul are also often used for *service discovery*—that is, to find out which
IP address you need to connect to in order to reach a particular service (see
[“Load balancers, service discovery, and service meshes”](/en/ch5#sec_encoding_service_discovery)). In cloud environments, where it is common for
virtual machines to continually come and go, you often don’t know the IP addresses of your services
ahead of time. Instead, you can configure your services such that when they start up they register
their network endpoints in a service registry, where they can then be found by other services.

Using a coordination service for service discovery can be convenient, as its failure detection and
change notification features make it easy for clients to keep track of service instances as they
come and go. And if you are already using a coordination service for leases, locking, or leader
election, it makes sense to also use it for service discovery, since it already knows which node
should receive requests for your service.

However, using consensus for service discovery is often overkill: this use case often doesn’t
require linearizability, and it’s more important that service discovery is highly available and
fast, since without it everything would grind to a halt. It’s therefore often preferable to cache
service discovery information and accept that it might be slightly stale. For example, DNS-based
service discovery uses multiple layers of caching to achieve good performance and availability.

To support this use case, ZooKeeper supports *observers*, which are replicas that receive the log
and maintain a copy of the data stored in ZooKeeper, but which do not participate in the consensus
algorithm’s voting process. Reads from an observer are not linearizable as they might be stale, but
they remain available even if the network is interrupted, and they increase the read throughput that
the system can support by caching.

## Summary {#summary}

In this chapter we examined the topic of strong consistency in fault-tolerant systems: what it is,
and how to achieve it. We looked in depth at linearizability, a popular formalization of strong
consistency: it means that replicated data appears as though there were only a single copy, and all
operations act on it atomically. We saw that linearizability is useful when you need some data to be
up-to-date when you read it, or if you need to resolve a race condition (e.g. if multiple nodes are
concurrently trying to do the same thing, such as creating files with the same name).

Although linearizability is appealing because it is easy to understand—it makes a database behave
like a variable in a single-threaded program—it has the downside of being slow, especially in
environments with large network delays. Many replication algorithms don’t guarantee linearizability,
even though it superficially might seem like they might provide strong consistency.

Next, we applied the concept of linearizability in the context of ID generators. A single-node
auto-incrementing counter is linearizable, but not fault-tolerant. Many distributed ID generation
schemes don’t guarantee that the IDs are ordered consistently with the order in which the events
actually happened. Logical clocks such as Lamport clocks and hybrid logical clocks provide ordering
that is consistent with causality, but no linearizability.

This led us to the concept of consensus. We saw that achieving consensus means deciding something in
such a way that all nodes agree on what was decided, and such that they can’t change their mind. A
wide range of problems are actually reducible to consensus and are equivalent to each other (i.e.,
if you have a solution for one of them, you can transform it into a solution for all of the others).
Such equivalent problems include:

Linearizable compare-and-set operation
: The register needs to atomically *decide* whether to set its value, based on whether its current
 value equals the parameter given in the operation.

Locks and leases
: When several clients are concurrently trying to grab a lock or lease, the lock *decides* which one
 successfully acquired it.

Uniqueness constraints
: When several transactions concurrently try to create conflicting records with the same key, the
 constraint must *decide* which one to allow and which should fail with a constraint violation.

Shared logs
: When several nodes concurrently want to append entries to a log, the log *decides* in which order
 they are appended. Total order broadcast is also equivalent.

Atomic transaction commit
: The database nodes involved in a distributed transaction must all *decide* the same way whether to
 commit or abort the transaction.

Linearizable fetch-and-add operation
: This operation can be used to implement an ID generator. Several nodes can concurrently invoke the
 operation, and it *decides* the order in which they increment the counter. This case actually
 solves consensus only between two nodes, while the others work for any number of nodes.

All of these are straightforward if you only have a single node, or if you are willing to assign the
decision-making capability to a single node. This is what happens in a single-leader database: all
the power to make decisions is vested in the leader, which is why such databases are able to provide
linearizable operations, uniqueness constraints, a replication log, and more.

However, if that single leader fails, or if a network interruption makes the leader unreachable,
such a system becomes unable to make any progress until a human performs a manual failover.
Widely-used consensus algorithms like Raft and Paxos are essentially single-leader replication with
built-in automatic leader election and failover if the current leader fails.

Consensus algorithms are carefully designed to ensure that no committed writes are lost during a
failover, and that the system cannot get into a split brain state in which multiple nodes are
accepting writes. This requires that every write, and every linearizable read, is confirmed by a
quorum (typically a majority) of nodes. This can be expensive, especially across geographic regions,
but it is unavoidable if you want the strong consistency and fault tolerance that consensus provides.

Coordination services like ZooKeeper and etcd are also built on top of consensus algorithms. They
provide locks, leases, failure detection, and change notification features that are useful for
managing the state of distributed applications. If you find yourself wanting to do one of those
things that is reducible to consensus, and you want it to be fault-tolerant, it is advisable to use
a coordination service. It won’t guarantee that you will get it right, but it will probably help.

Consensus algorithms are complicated and subtle, but they are supported by a rich body of theory
that has been developed since the 1980s. This theory makes it possible to build systems that can
tolerate all the faults that we discussed in [Chapter 9](/en/ch9#ch_distributed), and still ensure that your data is
not corrupted. This is an amazing achievement, and the references at the end of this chapter feature
some of the highlights of this work.

Nevertheless, consensus is not always the right tool: in some systems, the strong consistency
properties it provides are not needed, and it is better to have weaker consistency with higher
availability and better performance. In these cases, it is common to use leaderless or multi-leader
replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we
discussed in this chapter are helpful in that context.

### References {#references}

[^1]: Maurice P. Herlihy and Jeannette M. Wing. [Linearizability: A Correctness Condition for Concurrent Objects](https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 12, issue 3, pages 463–492, July 1990. [doi:10.1145/78969.78972](https://doi.org/10.1145/78969.78972)
[^2]: Leslie Lamport. [On interprocess communication](https://www.microsoft.com/en-us/research/publication/interprocess-communication-part-basic-formalism-part-ii-algorithms/). *Distributed Computing*, volume 1, issue 2, pages 77–101, June 1986. [doi:10.1007/BF01786228](https://doi.org/10.1007/BF01786228)
[^3]: David K. Gifford. [Information Storage in a Decentralized Computer System](https://bitsavers.org/pdf/xerox/parc/techReports/CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System.pdf). Xerox Palo Alto Research Centers, CSL-81-8, June 1981. Archived at [perma.cc/2XXP-3JPB](https://perma.cc/2XXP-3JPB)
[^4]: Martin Kleppmann. [Please Stop Calling Databases CP or AP](https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html). *martin.kleppmann.com*, May 2015. Archived at [perma.cc/MJ5G-75GL](https://perma.cc/MJ5G-75GL)
[^5]: Kyle Kingsbury. [Call Me Maybe: MongoDB Stale Reads](https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads). *aphyr.com*, April 2015. Archived at [perma.cc/DXB4-J4JC](https://perma.cc/DXB4-J4JC)
[^6]: Kyle Kingsbury. [Computational Techniques in Knossos](https://aphyr.com/posts/314-computational-techniques-in-knossos). *aphyr.com*, May 2014. Archived at [perma.cc/2X5M-EHTU](https://perma.cc/2X5M-EHTU)
[^7]: Kyle Kingsbury and Peter Alvaro. [Elle: Inferring Isolation Anomalies from Experimental Observations](https://www.vldb.org/pvldb/vol14/p268-alvaro.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 3, pages 268–280, November 2020. [doi:10.14778/3430915.3430918](https://doi.org/10.14778/3430915.3430918)
[^8]: Paolo Viotti and Marko Vukolić. [Consistency in Non-Transactional Distributed Storage Systems](https://arxiv.org/abs/1512.00168). *ACM Computing Surveys* (CSUR), volume 49, issue 1, article no. 19, June 2016. [doi:10.1145/2926965](https://doi.org/10.1145/2926965)
[^9]: Peter Bailis. [Linearizability Versus Serializability](http://www.bailis.org/blog/linearizability-versus-serializability/). *bailis.org*, September 2014. Archived at [perma.cc/386B-KAC3](https://perma.cc/386B-KAC3)
[^10]: Daniel Abadi. [Correctness Anomalies Under Serializable Isolation](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html). *dbmsmusings.blogspot.com*, June 2019. Archived at [perma.cc/JGS7-BZFY](https://perma.cc/JGS7-BZFY)
[^11]: Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Highly Available Transactions: Virtues and Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). *Proceedings of the VLDB Endowment*, volume 7, issue 3, pages 181–192, November 2013. [doi:10.14778/2732232.2732237](https://doi.org/10.14778/2732232.2732237), extended version published as [arXiv:1302.0309](https://arxiv.org/abs/1302.0309)
[^12]: Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. [*Concurrency Control and Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at [*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/).
[^13]: Andrei Matei. [CockroachDB’s consistency model](https://www.cockroachlabs.com/blog/consistency-model/). *cockroachlabs.com*, February 2021. Archived at [perma.cc/MR38-883B](https://perma.cc/MR38-883B)
[^14]: Murat Demirbas. [Strict-serializability, but at what cost, for what purpose?](https://muratbuffalo.blogspot.com/2022/08/strict-serializability-but-at-what-cost.html) *muratbuffalo.blogspot.com*, August 2022. Archived at [perma.cc/T8AY-N3U9](https://perma.cc/T8AY-N3U9)
[^15]: Ben Darnell. [How to talk about consistency and isolation in distributed DBs](https://www.cockroachlabs.com/blog/db-consistency-isolation-terminology/). *cockroachlabs.com*, February 2022. Archived at [perma.cc/53SV-JBGK](https://perma.cc/53SV-JBGK)
[^16]: Daniel Abadi. [An explanation of the difference between Isolation levels vs. Consistency levels](https://dbmsmusings.blogspot.com/2019/08/an-explanation-of-difference-between.html). *dbmsmusings.blogspot.com*, August 2019. Archived at [perma.cc/QSF2-CD4P](https://perma.cc/QSF2-CD4P)
[^17]: Mike Burrows. [The Chubby Lock Service for Loosely-Coupled Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006.
[^18]: Flavio P. Junqueira and Benjamin Reed. [*ZooKeeper: Distributed Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3
[^19]: Murali Vallath. [*Oracle 10g RAC Grid, Services & Clustering*](https://www.oreilly.com/library/view/oracle-10g-rac/9781555583217/). Elsevier Digital Press, 2006. ISBN: 978-1-555-58321-7
[^20]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). *Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185–196, November 2014. [doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509)
[^21]: Kyle Kingsbury. [Call Me Maybe: etcd and Consul](https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul). *aphyr.com*, June 2014. Archived at [perma.cc/XL7U-378K](https://perma.cc/XL7U-378K)
[^22]: Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. [Zab: High-Performance Broadcast for Primary-Backup Systems](https://marcoserafini.github.io/assets/pdf/zab.pdf). At *41st IEEE International Conference on Dependable Systems and Networks* (DSN), June 2011. [doi:10.1109/DSN.2011.5958223](https://doi.org/10.1109/DSN.2011.5958223)
[^23]: Diego Ongaro and John K. Ousterhout. [In Search of an Understandable Consensus Algorithm](https://www.usenix.org/system/files/conference/atc14/atc14-paper-ongaro.pdf). At *USENIX Annual Technical Conference* (ATC), June 2014.
[^24]: Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. [Sharing Memory Robustly in Message-Passing Systems](https://www.cs.huji.ac.il/course/2004/dist/p124-attiya.pdf). *Journal of the ACM*, volume 42, issue 1, pages 124–142, January 1995. [doi:10.1145/200836.200869](https://doi.org/10.1145/200836.200869)
[^25]: Nancy Lynch and Alex Shvartsman. [Robust Emulation of Shared Memory Using Dynamic Quorum-Acknowledged Broadcasts](https://groups.csail.mit.edu/tds/papers/Lynch/FTCS97.pdf). At *27th Annual International Symposium on Fault-Tolerant Computing* (FTCS), June 1997. [doi:10.1109/FTCS.1997.614100](https://doi.org/10.1109/FTCS.1997.614100)
[^26]: Christian Cachin, Rachid Guerraoui, and Luís Rodrigues. [*Introduction to Reliable and Secure Distributed Programming*](https://www.distributedprogramming.net/), 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7, [doi:10.1007/978-3-642-15260-3](https://doi.org/10.1007/978-3-642-15260-3)
[^27]: Niklas Ekström, Mikhail Panchenko, and Jonathan Ellis. [Possible Issue with Read Repair?](https://lists.apache.org/thread/wwsjnnc93mdlpw8nb0d5gn4q1bmpzbon) Email thread on *cassandra-dev* mailing list, October 2012.
[^28]: Maurice P. Herlihy. [Wait-Free Synchronization](https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 13, issue 1, pages 124–149, January 1991. [doi:10.1145/114005.102808](https://doi.org/10.1145/114005.102808)
[^29]: Armando Fox and Eric A. Brewer. [Harvest, Yield, and Scalable Tolerant Systems](https://radlab.cs.berkeley.edu/people/fox/static/pubs/pdf/c18.pdf). At *7th Workshop on Hot Topics in Operating Systems* (HotOS), March 1999. [doi:10.1109/HOTOS.1999.798396](https://doi.org/10.1109/HOTOS.1999.798396)
[^30]: Seth Gilbert and Nancy Lynch. [Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services](https://www.comp.nus.edu.sg/~gilbert/pubs/BrewersConjecture-SigAct.pdf). *ACM SIGACT News*, volume 33, issue 2, pages 51–59, June 2002. [doi:10.1145/564585.564601](https://doi.org/10.1145/564585.564601)
[^31]: Seth Gilbert and Nancy Lynch. [Perspectives on the CAP Theorem](https://groups.csail.mit.edu/tds/papers/Gilbert/Brewer2.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 30–36, February 2012. [doi:10.1109/MC.2011.389](https://doi.org/10.1109/MC.2011.389)
[^32]: Eric A. Brewer. [CAP Twelve Years Later: How the ‘Rules’ Have Changed](https://sites.cs.ucsb.edu/~rich/class/cs293-cloud/papers/brewer-cap.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 23–29, February 2012. [doi:10.1109/MC.2012.37](https://doi.org/10.1109/MC.2012.37)
[^33]: Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen. [Consistency in Partitioned Networks](https://www.cs.rice.edu/~alc/old/comp520/papers/DGS85.pdf). *ACM Computing Surveys*, volume 17, issue 3, pages 341–370, September 1985. [doi:10.1145/5505.5508](https://doi.org/10.1145/5505.5508)
[^34]: Paul R. Johnson and Robert H. Thomas. [RFC 677: The Maintenance of Duplicate Databases](https://tools.ietf.org/html/rfc677). Network Working Group, January 1975.
[^35]: Michael J. Fischer and Alan Michael. [Sacrificing Serializability to Attain High Availability of Data in an Unreliable Network](https://sites.cs.ucsb.edu/~agrawal/spring2011/ugrad/p70-fischer.pdf). At *1st ACM Symposium on Principles of Database Systems* (PODS), March 1982. [doi:10.1145/588111.588124](https://doi.org/10.1145/588111.588124)
[^36]: Eric A. Brewer. [NoSQL: Past, Present, Future](https://www.infoq.com/presentations/NoSQL-History/). At *QCon San Francisco*, November 2012.
[^37]: Adrian Cockcroft. [Migrating to Microservices](https://www.infoq.com/presentations/migration-cloud-native/). At *QCon London*, March 2014.
[^38]: Martin Kleppmann. [A Critique of the CAP Theorem](https://arxiv.org/abs/1509.05393). arXiv:1509.05393, September 2015.
[^39]: Daniel Abadi. [Problems with CAP, and Yahoo’s little known NoSQL system](https://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html). *dbmsmusings.blogspot.com*, April 2010. Archived at [perma.cc/4NTZ-CLM9](https://perma.cc/4NTZ-CLM9)
[^40]: Daniel Abadi. [Hazelcast and the Mythical PA/EC System](https://dbmsmusings.blogspot.com/2017/10/hazelcast-and-mythical-paec-system.html). *dbmsmusings.blogspot.com*, October 2017. Archived at [perma.cc/J5XM-U5C2](https://perma.cc/J5XM-U5C2)
[^41]: Eric Brewer. [Spanner, TrueTime & The CAP Theorem](https://research.google.com/pubs/archive/45855.pdf). *research.google.com*, February 2017. Archived at [perma.cc/59UW-RH7N](https://perma.cc/59UW-RH7N)
[^42]: Daniel J. Abadi. [Consistency Tradeoffs in Modern Distributed Database System Design](https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 37–42, February 2012. [doi:10.1109/MC.2012.33](https://doi.org/10.1109/MC.2012.33)
[^43]: Nancy A. Lynch. [A Hundred Impossibility Proofs for Distributed Computing](https://groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf). At *8th ACM Symposium on Principles of Distributed Computing* (PODC), August 1989. [doi:10.1145/72981.72982](https://doi.org/10.1145/72981.72982)
[^44]: Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin. [Consistency, Availability, and Convergence](https://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2036.pdf). University of Texas at Austin, Department of Computer Science, Tech Report UTCS TR-11-22, May 2011. Archived at [perma.cc/SAV8-9JAJ](https://perma.cc/SAV8-9JAJ)
[^45]: Hagit Attiya, Faith Ellen, and Adam Morrison. [Limitations of Highly-Available Eventually-Consistent Data Stores](https://www.cs.tau.ac.il/~mad/publications/podc2015-replds.pdf). At *ACM Symposium on Principles of Distributed Computing* (PODC), July 2015. [doi:10.1145/2767386.2767419](https://doi.org/10.1145/2767386.2767419)
[^46]: Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. [x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors](https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf). *Communications of the ACM*, volume 53, issue 7, pages 89–97, July 2010. [doi:10.1145/1785414.1785443](https://doi.org/10.1145/1785414.1785443)
[^47]: Martin Thompson. [Memory Barriers/Fences](https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html). *mechanical-sympathy.blogspot.co.uk*, July 2011. Archived at [perma.cc/7NXM-GC5U](https://perma.cc/7NXM-GC5U)
[^48]: Ulrich Drepper. [What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf). *akkadia.org*, November 2007. Archived at [perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ)
[^49]: Hagit Attiya and Jennifer L. Welch. [Sequential Consistency Versus Linearizability](https://courses.csail.mit.edu/6.852/01/papers/p91-attiya.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 12, issue 2, pages 91–122, May 1994. [doi:10.1145/176575.176576](https://doi.org/10.1145/176575.176576)
[^50]: Kyzer R. Davis, Brad G. Peabody, and Paul J. Leach. [Universally Unique IDentifiers (UUIDs)](https://www.rfc-editor.org/rfc/rfc9562). RFC 9562, IETF, May 2024.
[^51]: Ryan King. [Announcing Snowflake](https://blog.x.com/engineering/en_us/a/2010/announcing-snowflake). *blog.x.com*, June 2010. Archived at [archive.org](https://web.archive.org/web/20241128214604/https%3A//blog.x.com/engineering/en_us/a/2010/announcing-snowflake)
[^52]: Alizain Feerasta. [Universally Unique Lexicographically Sortable Identifier](https://github.com/ulid/spec). *github.com*, 2016. Archived at [perma.cc/NV2Y-ZP8U](https://perma.cc/NV2Y-ZP8U)
[^53]: Rob Conery. [A Better ID Generator for PostgreSQL](https://bigmachine.io/2014/05/29/a-better-id-generator-for-postgresql/). *bigmachine.io*, May 2014. Archived at [perma.cc/K7QV-3KFC](https://perma.cc/K7QV-3KFC)
[^54]: Leslie Lamport. [Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, volume 21, issue 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563)
[^55]: Sandeep S. Kulkarni, Murat Demirbas, Deepak Madeppa, Bharadwaj Avva, and Marcelo Leone. [Logical Physical Clocks](https://cse.buffalo.edu/~demirbas/publications/hlc.pdf). *18th International Conference on Principles of Distributed Systems* (OPODIS), December 2014. [doi:10.1007/978-3-319-14472-6\_2](https://doi.org/10.1007/978-3-319-14472-6_2)
[^56]: Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo Romano, and Luís Rodrigues. [On the use of Clocks to Enforce Consistency in the Cloud](http://sites.computer.org/debull/A15mar/p18.pdf). *IEEE Data Engineering Bulletin*, volume 38, issue 1, pages 18–31, March 2015. Archived at [perma.cc/68ZU-45SH](https://perma.cc/68ZU-45SH)
[^57]: Daniel Peng and Frank Dabek. [Large-Scale Incremental Processing Using Distributed Transactions and Notifications](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf). At *9th USENIX Conference on Operating Systems Design and Implementation* (OSDI), October 2010.
[^58]: Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone. [Paxos Made Live – An Engineering Perspective](https://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf). At *26th ACM Symposium on Principles of Distributed Computing* (PODC), June 2007. [doi:10.1145/1281100.1281103](https://doi.org/10.1145/1281100.1281103)
[^59]: Will Portnoy. [Lessons Learned from Implementing Paxos](https://blog.willportnoy.com/2012/06/lessons-learned-from-paxos.html). *blog.willportnoy.com*, June 2012. Archived at [perma.cc/QHD9-FDD2](https://perma.cc/QHD9-FDD2)
[^60]: Brian M. Oki and Barbara H. Liskov. [Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems](https://pmg.csail.mit.edu/papers/vr.pdf). At *7th ACM Symposium on Principles of Distributed Computing* (PODC), August 1988. [doi:10.1145/62546.62549](https://doi.org/10.1145/62546.62549)
[^61]: Barbara H. Liskov and James Cowling. [Viewstamped Replication Revisited](https://pmg.csail.mit.edu/papers/vr-revisited.pdf). Massachusetts Institute of Technology, Tech Report MIT-CSAIL-TR-2012-021, July 2012. Archived at [perma.cc/56SJ-WENQ](https://perma.cc/56SJ-WENQ)
[^62]: Leslie Lamport. [The Part-Time Parliament](https://www.microsoft.com/en-us/research/publication/part-time-parliament/). *ACM Transactions on Computer Systems*, volume 16, issue 2, pages 133–169, May 1998. [doi:10.1145/279227.279229](https://doi.org/10.1145/279227.279229)
[^63]: Leslie Lamport. [Paxos Made Simple](https://www.microsoft.com/en-us/research/publication/paxos-made-simple/). *ACM SIGACT News*, volume 32, issue 4, pages 51–58, December 2001. Archived at [perma.cc/82HP-MNKE](https://perma.cc/82HP-MNKE)
[^64]: Robbert van Renesse and Deniz Altinbuken. [Paxos Made Moderately Complex](https://people.cs.umass.edu/~arun/590CC/papers/paxos-moderately-complex.pdf). *ACM Computing Surveys* (CSUR), volume 47, issue 3, article no. 42, February 2015. [doi:10.1145/2673577](https://doi.org/10.1145/2673577)
[^65]: Diego Ongaro. [Consensus: Bridging Theory and Practice](https://github.com/ongardie/dissertation). PhD Thesis, Stanford University, August 2014. Archived at [perma.cc/5VTZ-2ADH](https://perma.cc/5VTZ-2ADH)
[^66]: Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy, and Jon Crowcroft. [Raft Refloated: Do We Have Consensus?](https://www.cl.cam.ac.uk/research/srg/netos/papers/2015-raftrefloated-osr.pdf) *ACM SIGOPS Operating Systems Review*, volume 49, issue 1, pages 12–21, January 2015. [doi:10.1145/2723872.2723876](https://doi.org/10.1145/2723872.2723876)
[^67]: André Medeiros. [ZooKeeper’s Atomic Broadcast Protocol: Theory and Practice](http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf). Aalto University School of Science, March 2012. Archived at [perma.cc/FVL4-JMVA](https://perma.cc/FVL4-JMVA)
[^68]: Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider. [Vive La Différence: Paxos vs. Viewstamped Replication vs. Zab](https://arxiv.org/abs/1309.5671). *IEEE Transactions on Dependable and Secure Computing*, volume 12, issue 4, pages 472–484, September 2014. [doi:10.1109/TDSC.2014.2355848](https://doi.org/10.1109/TDSC.2014.2355848)
[^69]: Heidi Howard and Richard Mortier. [Paxos vs Raft: Have we reached consensus on distributed consensus?](https://arxiv.org/abs/2004.05074). At *7th Workshop on Principles and Practice of Consistency for Distributed Data* (PaPoC), April 2020. [doi:10.1145/3380787.3393681](https://doi.org/10.1145/3380787.3393681)
[^70]: Miguel Castro and Barbara H. Liskov. [Practical Byzantine Fault Tolerance and Proactive Recovery](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/p398-castro-bft-tocs.pdf). *ACM Transactions on Computer Systems*, volume 20, issue 4, pages 396–461, November 2002. [doi:10.1145/571637.571640](https://doi.org/10.1145/571637.571640)
[^71]: Shehar Bano, Alberto Sonnino, Mustafa Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis. [SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At *1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019. [doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458)
[^72]: Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. [Impossibility of Distributed Consensus with One Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf). *Journal of the ACM*, volume 32, issue 2, pages 374–382, April 1985. [doi:10.1145/3149.214121](https://doi.org/10.1145/3149.214121)
[^73]: Tushar Deepak Chandra and Sam Toueg. [Unreliable Failure Detectors for Reliable Distributed Systems](https://courses.csail.mit.edu/6.852/08/papers/CT96-JACM.pdf). *Journal of the ACM*, volume 43, issue 2, pages 225–267, March 1996. [doi:10.1145/226643.226647](https://doi.org/10.1145/226643.226647)
[^74]: Michael Ben-Or. [Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols](https://homepage.cs.uiowa.edu/~ghosh/BenOr.pdf). At *2nd ACM Symposium on Principles of Distributed Computing* (PODC), August 1983. [doi:10.1145/800221.806707](https://doi.org/10.1145/800221.806707)
[^75]: Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. [Consensus in the Presence of Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323, April 1988. [doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283)
[^76]: Xavier Défago, André Schiper, and Péter Urbán. [Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey](https://dspace.jaist.ac.jp/dspace/bitstream/10119/4883/1/defago_et_al.pdf). *ACM Computing Surveys*, volume 36, issue 4, pages 372–421, December 2004. [doi:10.1145/1041680.1041682](https://doi.org/10.1145/1041680.1041682)
[^77]: Hagit Attiya and Jennifer Welch. *Distributed Computing: Fundamentals, Simulations and Advanced Topics*, 2nd edition. John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6, [doi:10.1002/0471478210](https://doi.org/10.1002/0471478210)
[^78]: Rachid Guerraoui. [Revisiting the Relationship Between Non-Blocking Atomic Commitment and Consensus](https://citeseerx.ist.psu.edu/pdf/5d06489503b6f791aa56d2d7942359c2592e44b0). At *9th International Workshop on Distributed Algorithms* (WDAG), September 1995. [doi:10.1007/BFb0022140](https://doi.org/10.1007/BFb0022140)
[^79]: Jim N. Gray and Leslie Lamport. [Consensus on Transaction Commit](https://dsf.berkeley.edu/cs286/papers/paxoscommit-tods2006.pdf). *ACM Transactions on Database Systems* (TODS), volume 31, issue 1, pages 133–160, March 2006. [doi:10.1145/1132863.1132867](https://doi.org/10.1145/1132863.1132867)
[^80]: Fred B. Schneider. [Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial](https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf). *ACM Computing Surveys*, volume 22, issue 4, pages 299–319, December 1990. [doi:10.1145/98163.98167](https://doi.org/10.1145/98163.98167)
[^81]: Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. [Calvin: Fast Distributed Transactions for Partitioned Database Systems](https://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2012. [doi:10.1145/2213836.2213838](https://doi.org/10.1145/2213836.2213838)
[^82]: Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck. [Tango: Distributed Data Structures over a Shared Log](https://www.microsoft.com/en-us/research/publication/tango-distributed-data-structures-over-a-shared-log/). At *24th ACM Symposium on Operating Systems Principles* (SOSP), November 2013. [doi:10.1145/2517349.2522732](https://doi.org/10.1145/2517349.2522732)
[^83]: Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. [CORFU: A Shared Log Design for Flash Clusters](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final30.pdf). At *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012.
[^84]: Vasilis Gavrielatos, Antonios Katsarakis, and Vijay Nagarajan. [Odyssey: the impact of modern hardware on strongly-consistent replication protocols](https://vasigavr1.github.io/files/Odyssey_Eurosys_2021.pdf). At *16th European Conference on Computer Systems* (EuroSys), April 2021. [doi:10.1145/3447786.3456240](https://doi.org/10.1145/3447786.3456240)
[^85]: Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. [Flexible Paxos: Quorum Intersection Revisited](https://drops.dagstuhl.de/opus/volltexte/2017/7094/pdf/LIPIcs-OPODIS-2016-25.pdf). At *20th International Conference on Principles of Distributed Systems* (OPODIS), December 2016. [doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25)
[^86]: Martin Kleppmann. [Distributed Systems lecture notes](https://www.cl.cam.ac.uk/teaching/2425/ConcDisSys/dist-sys-notes.pdf). *University of Cambridge*, October 2024. Archived at [perma.cc/SS3Q-FNS5](https://perma.cc/SS3Q-FNS5)
[^87]: Kyle Kingsbury. [Call Me Maybe: Elasticsearch 1.5.0](https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0). *aphyr.com*, April 2015. Archived at [perma.cc/37MZ-JT7H](https://perma.cc/37MZ-JT7H)
[^88]: Heidi Howard and Jon Crowcroft. [Coracle: Evaluating Consensus at the Internet Edge](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p85.pdf). At *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2829988.2790010](https://doi.org/10.1145/2829988.2790010)
[^89]: Tom Lianza and Chris Snook. [A Byzantine failure in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020. Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY)
[^90]: Ivan Kelly. [BookKeeper Tutorial](https://github.com/ivankelly/bookkeeper-tutorial). *github.com*, October 2014. Archived at [perma.cc/37Y6-VZWU](https://perma.cc/37Y6-VZWU)
[^91]: Jack Vanlightly. [Apache BookKeeper Insights Part 1 — External Consensus and Dynamic Membership](https://medium.com/splunk-maas/apache-bookkeeper-insights-part-1-external-consensus-and-dynamic-membership-c259f388da21). *medium.com*, November 2021. Archived at [perma.cc/3MDB-8GFB](https://perma.cc/3MDB-8GFB)