mirror of
https://github.com/Vonng/ddia.git
synced 2026-06-21 00:47:05 +08:00
adjust anchor for ch5 - ch10
This commit is contained in:
parent
860cc17b5d
commit
99b4b00502
5 changed files with 510 additions and 449 deletions
|
|
@ -62,7 +62,9 @@ systems and similar infrastructure, you will need to go much deeper into the the
|
|||
chance of your systems being robust. As usual, the literature references in this chapter provide
|
||||
some initial pointers.
|
||||
|
||||
# Linearizability
|
||||
|
||||
|
||||
## Linearizability
|
||||
|
||||
If you want a replicated database to be as simple as possible to use, you should make it behave as
|
||||
if it wasn’t replicated at all. Then users don’t have to worry about replication lag, conflicts, and
|
||||
|
|
@ -83,7 +85,6 @@ guarantee*. To clarify this idea, let’s look at an example of a system that is
|
|||
|
||||
{{< figure src="/fig/ddia_1001.png" id="fig_consistency_linearizability_0" caption="Figure 10-1. If this database were linearizable, then either Alice's read would return 1 instead of 0, or Bob's read would return 0 instead of 1." class="w-full my-4" >}}
|
||||
|
||||
|
||||
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
|
||||
Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a
|
||||
game their favorite team is playing. Just after the final score is announced, Aaliyah refreshes the
|
||||
|
|
@ -98,7 +99,7 @@ his query) *after* he heard Aaliyah exclaim the final score, and therefore he ex
|
|||
result to be at least as recent as Aaliyah’s. The fact that his query returned a stale result is a
|
||||
violation of linearizability.
|
||||
|
||||
## What Makes a System Linearizable?
|
||||
### What Makes a System Linearizable?
|
||||
|
||||
In order to understand linearizability better, let’s look at some more examples.
|
||||
[Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows three clients concurrently reading and writing the same
|
||||
|
|
@ -219,7 +220,10 @@ in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag) are exampl
|
|||
guarantees all these weaker properties, and more. In this chapter we will focus on linearizability,
|
||||
which is the strongest consistency model in common use.
|
||||
|
||||
# Linearizability Versus Serializability
|
||||
|
||||
--------
|
||||
|
||||
> [!TIP] LINEARIZABILITY VERSUS SERIALIZABILITY
|
||||
|
||||
Linearizability is easily confused with serializability (see [“Serializability”](/en/ch8#sec_transactions_serializability)),
|
||||
as both words seem to mean something like “can be arranged in a sequential order.” However, they are
|
||||
|
|
@ -245,8 +249,7 @@ Linearizability
|
|||
(*Sequential consistency* is something else again [^8], but we won’t discuss it here.)
|
||||
|
||||
A database may provide both serializability and linearizability, and this combination is known as
|
||||
*strict serializability* or *strong one-copy serializability* (*strong-1SR*)
|
||||
[^11] [^12].
|
||||
*strict serializability* or *strong one-copy serializability* (*strong-1SR*) [^11] [^12].
|
||||
Single-node databases are typically linearizable. With distributed databases using optimistic
|
||||
methods like serializable snapshot isolation (see [“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)) the situation is more
|
||||
complicated: for example, CockroachDB provides serializability, and some recency guarantees on
|
||||
|
|
@ -257,14 +260,16 @@ It is also possible to combine a weaker isolation level with linearizability, or
|
|||
consistency model with serializability; in fact, consistency model and isolation level can be chosen
|
||||
largely independently from each other [^15] [^16].
|
||||
|
||||
## Relying on Linearizability
|
||||
--------
|
||||
|
||||
### Relying on Linearizability
|
||||
|
||||
In what circumstances is linearizability useful? Viewing the final score of a sporting match is
|
||||
perhaps a frivolous example: a result that is outdated by a few seconds is unlikely to cause any
|
||||
real harm in this situation. However, there a few areas in which linearizability is an important
|
||||
requirement for making a system work correctly.
|
||||
|
||||
### Locking and leader election
|
||||
#### Locking and leader election
|
||||
|
||||
A system that uses single-leader replication needs to ensure that there is indeed only one leader,
|
||||
not several (split brain). One way of electing a leader is to use a lease: every node that starts up
|
||||
|
|
@ -280,10 +285,14 @@ election correctly (see for example the fencing issue in [“Distributed Locks a
|
|||
libraries like Apache Curator help by providing higher-level recipes on top of ZooKeeper. However, a
|
||||
linearizable storage service is the basic foundation for these coordination tasks.
|
||||
|
||||
> [!NOTE]> Strictly speaking, ZooKeeper provides linearizable writes, but reads may be stale, since there is no
|
||||
> guarantee that they are served from the current leader
|
||||
> [^18].
|
||||
> etcd since version 3 provides linearizable reads by default.
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> Strictly speaking, ZooKeeper provides linearizable writes, but reads may be stale, since there is no
|
||||
> guarantee that they are served from the current leader [^18]. etcd since version 3 provides linearizable reads by default.
|
||||
|
||||
--------
|
||||
|
||||
|
||||
Distributed locking is also used at a much more granular level in some distributed databases, such as
|
||||
Oracle Real Application Clusters (RAC) [^19].
|
||||
|
|
@ -292,7 +301,7 @@ to the same disk storage system. Since these linearizable locks are on the criti
|
|||
transaction execution, RAC deployments usually have a dedicated cluster interconnect network for
|
||||
communication between database nodes.
|
||||
|
||||
### Constraints and uniqueness guarantees
|
||||
#### Constraints and uniqueness guarantees
|
||||
|
||||
Uniqueness constraints are common in databases: for example, a username or email address must
|
||||
uniquely identify one user, and in a file storage service there cannot be two files with the same
|
||||
|
|
@ -320,7 +329,7 @@ However, a hard uniqueness constraint, such as the one you typically find in rel
|
|||
requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints,
|
||||
can be implemented without linearizability [^20].
|
||||
|
||||
### Cross-channel timing dependencies
|
||||
#### Cross-channel timing dependencies
|
||||
|
||||
Notice a detail in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0): if Aaliyah hadn’t exclaimed the score,
|
||||
Bryce wouldn’t have known that the result of his query was stale. He would have just refreshed the
|
||||
|
|
@ -367,7 +376,8 @@ understand. If you control the additional communication channel (like in the cas
|
|||
queue, but not in the case of Aaliyah and Bryce), you can use alternative approaches similar to what
|
||||
we discussed in [“Reading Your Own Writes”](/en/ch6#sec_replication_ryw), at the cost of additional complexity.
|
||||
|
||||
## Implementing Linearizable Systems
|
||||
|
||||
### Implementing Linearizable Systems
|
||||
|
||||
Now that we’ve looked at a few examples in which linearizability is useful, let’s think about how we
|
||||
might implement a system that offers linearizable semantics.
|
||||
|
|
@ -375,11 +385,9 @@ might implement a system that offers linearizable semantics.
|
|||
Since linearizability essentially means “behave as though there is only a single copy of the data,
|
||||
and all operations on it are atomic,” the simplest answer would be to really only use a single copy
|
||||
of the data. However, that approach would not be able to tolerate faults: if the node holding that
|
||||
one copy failed, the data would be lost, or at least inaccessible until the node was brought up
|
||||
again.
|
||||
one copy failed, the data would be lost, or at least inaccessible until the node was brought up again.
|
||||
|
||||
Let’s revisit the replication methods from [Chapter 6](/en/ch6#ch_replication), and compare whether they can be made
|
||||
linearizable:
|
||||
Let’s revisit the replication methods from [Chapter 6](/en/ch6#ch_replication), and compare whether they can be made linearizable:
|
||||
|
||||
Single-leader replication (potentially linearizable)
|
||||
: In a system with single-leader replication, the leader has the primary copy of the data that is
|
||||
|
|
@ -399,10 +407,7 @@ Single-leader replication (potentially linearizable)
|
|||
Consensus algorithms (likely linearizable)
|
||||
: Some consensus algorithms are essentially single-leader replication with automatic leader election
|
||||
and failover. They are carefully designed to prevent split brain, allowing them to implement
|
||||
linearizable storage safely. ZooKeeper uses the Zab consensus algorithm
|
||||
[^22]
|
||||
and etcd uses Raft
|
||||
[^23], for example.
|
||||
linearizable storage safely. ZooKeeper uses the Zab consensus algorithm [^22] and etcd uses Raft [^23], for example.
|
||||
However, just because a system uses consensus does not guarantee that all operations on it are
|
||||
linearizable: if it allows reads on a node without checking that it is still the leader, the
|
||||
results of the read may be stale if a new leader has just been elected.
|
||||
|
|
@ -424,7 +429,7 @@ Leaderless replication (probably not linearizable)
|
|||
consistent with actual event ordering due to clock skew (see [“Relying on Synchronized Clocks”](/en/ch9#sec_distributed_clocks_relying)).
|
||||
Even with quorums, nonlinearizable behavior is possible, as demonstrated in the next section.
|
||||
|
||||
### Linearizability and quorums
|
||||
#### Linearizability and quorums
|
||||
|
||||
Intuitively, it seems as though quorum reads and writes should be linearizable in a
|
||||
Dynamo-style model. However, when we have variable network delays, it is possible to have race
|
||||
|
|
@ -459,7 +464,7 @@ linearizable compare-and-set operation cannot, because it requires a consensus a
|
|||
In summary, it is safest to assume that a leaderless system with Dynamo-style replication does not
|
||||
provide linearizability, even with quorum reads and writes.
|
||||
|
||||
## The Cost of Linearizability
|
||||
### The Cost of Linearizability
|
||||
|
||||
As some replication methods can provide linearizability and others cannot, it is interesting to
|
||||
explore the pros and cons of linearizability in more depth.
|
||||
|
|
@ -495,7 +500,7 @@ If clients can connect directly to the leader region, this is not a problem, sin
|
|||
application continues to work normally there. But clients that can only reach a follower region
|
||||
will experience an outage until the network link is repaired.
|
||||
|
||||
### The CAP theorem
|
||||
#### The CAP theorem
|
||||
|
||||
This issue is not just a consequence of single-leader and multi-leader replication: any linearizable
|
||||
database has this problem, no matter how it is implemented. The issue also isn’t specific to
|
||||
|
|
@ -524,19 +529,17 @@ implementing large-scale web services [^36].
|
|||
CAP deserves credit for this culture shift—it helped trigger the NoSQL movement, a burst of new
|
||||
database technologies around the mid-2000s.
|
||||
|
||||
# The Unhelpful CAP Theorem
|
||||
> [!TIP] THE UNHELPFUL CAP THEOREM
|
||||
|
||||
CAP is sometimes presented as *Consistency, Availability, Partition tolerance: pick 2 out of 3*.
|
||||
Unfortunately, putting it this way is misleading [^32] because network partitions are a kind of
|
||||
fault, so they aren’t something about which you have a choice: they will happen whether you like it
|
||||
or not.
|
||||
fault, so they aren’t something about which you have a choice: they will happen whether you like it or not.
|
||||
|
||||
At times when the network is working correctly, a system can provide both consistency
|
||||
(linearizability) and total availability. When a network fault occurs, you have to choose between
|
||||
either linearizability or total availability. Thus, a better way of phrasing CAP would be
|
||||
*either Consistent or Available when Partitioned* [^37].
|
||||
A more reliable network needs to make this choice less often, but at some point the choice is
|
||||
inevitable.
|
||||
A more reliable network needs to make this choice less often, but at some point the choice is inevitable.
|
||||
|
||||
The CP/AP classification scheme has several further flaws [^4]. *Consistency* is formalized as
|
||||
linearizability (the theorem doesn’t say anything about weaker consistency models), and the
|
||||
|
|
@ -565,7 +568,7 @@ However, this definition inherits several problems with CAP, such as the counter
|
|||
There are many more interesting impossibility results in distributed systems [^43], and CAP has now been
|
||||
superseded by more precise results [^44] [^45], so it is of mostly historical interest today.
|
||||
|
||||
### Linearizability and network delays
|
||||
#### Linearizability and network delays
|
||||
|
||||
Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable
|
||||
in practice. For example, even RAM on a modern multi-core CPU is not linearizable [^46]:
|
||||
|
|
@ -598,7 +601,8 @@ exist, but weaker consistency models can be much faster, so this trade-off is im
|
|||
latency-sensitive systems. In [Link to Come] we will discuss some approaches for avoiding
|
||||
linearizability without sacrificing correctness.
|
||||
|
||||
# ID Generators and Logical Clocks
|
||||
|
||||
## ID Generators and Logical Clocks
|
||||
|
||||
In many applications you need to assign some sort of unique ID to database records when they are
|
||||
created, which gives you a primary key by which you can refer to those records. In single-node
|
||||
|
|
@ -666,14 +670,8 @@ Wall-clock timestamp made unique
|
|||
putting a timestamp from that clock in the most significant bits, and filling the remaining bits
|
||||
with extra information that ensures the ID is unique even if the timestamp is not—for example, a
|
||||
shard number and a per-shard incrementing sequence number, or a long random value. This approach
|
||||
is used in Version 7 UUIDs
|
||||
[^50],
|
||||
Twitter’s Snowflake [^51],
|
||||
ULIDs [^52],
|
||||
Hazelcast’s Flake ID generator, MongoDB ObjectIDs, and many similar schemes
|
||||
[^50].
|
||||
You can implement these ID generators in application code or within a database
|
||||
[^53].
|
||||
is used in Version 7 UUIDs [^50], Twitter’s Snowflake [^51], ULIDs [^52], Hazelcast’s Flake ID generator,
|
||||
MongoDB ObjectIDs, and many similar schemes [^50]. You can implement these ID generators in application code or within a database [^53].
|
||||
|
||||
All these schemes generate IDs that are unique (at least with high enough probability that
|
||||
collisions are vanishingly rare), but they have much weaker ordering guarantees for IDs than the
|
||||
|
|
@ -691,7 +689,7 @@ using atomic clocks or GPS receivers. But it would also be nice to be able to ge
|
|||
unique and correctly ordered without relying on special hardware. That’s what *logical clocks* are
|
||||
about.
|
||||
|
||||
## Logical Clocks
|
||||
### Logical Clocks
|
||||
|
||||
In [“Unreliable Clocks”](/en/ch9#sec_distributed_clocks) we discussed time-of-day clocks and monotonic clocks. Both of these
|
||||
are *physical clocks*: they measure the passing of seconds (or milliseconds, microseconds, etc.).
|
||||
|
|
@ -713,7 +711,7 @@ The requirements for a logical clock are typically:
|
|||
A single-node ID generator meets these requirements, but the distributed ID generators we just
|
||||
discussed do not meet the causal ordering requirement.
|
||||
|
||||
### Lamport timestamps
|
||||
#### Lamport timestamps
|
||||
|
||||
Fortunately, there is a simple method for generating logical timestamps that *is* consistent with
|
||||
causality, and which you can use as a distributed ID generator. It is called a *Lamport clock*,
|
||||
|
|
@ -733,8 +731,7 @@ each timestamp is made unique.
|
|||
|
||||
Every time a node generates a timestamp, it increments its counter value and uses the new value.
|
||||
Moreover, every time a node sees a timestamp from another node, if the counter value in that
|
||||
timestamp is greater than its local counter value, it increases its local counter to match the value
|
||||
in the timestamp.
|
||||
timestamp is greater than its local counter value, it increases its local counter to match the value in the timestamp.
|
||||
|
||||
In [Figure 10-9](/en/ch10#fig_consistency_lamport_ts), Aaliyah had not yet seen Caleb’s message when posting her own,
|
||||
and vice versa. Assuming both users start with an initial counter value of 0, both therefore
|
||||
|
|
@ -749,7 +746,7 @@ two timestamps have the same counter, we compare their node IDs instead, using t
|
|||
lexicographic string comparison. Thus, the timestamp order in this example is
|
||||
(1, “Aaliyah”) < (1, “Caleb”) < (2, “Bryce”).
|
||||
|
||||
### Hybrid logical clocks
|
||||
#### Hybrid logical clocks
|
||||
|
||||
Lamport timestamps are good at capturing the order in which things happened, but they have some
|
||||
limitations:
|
||||
|
|
@ -779,7 +776,7 @@ conventional time-of-day clock, with the added property that its ordering is con
|
|||
happens-before relation. It doesn’t depend on any special hardware, and requires only roughly
|
||||
synchronized clocks. Hybrid logical clocks are used by CockroachDB, for example.
|
||||
|
||||
### Lamport/hybrid logical clocks vs. vector clocks
|
||||
#### Lamport/hybrid logical clocks vs. vector clocks
|
||||
|
||||
In [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl) we discussed how snapshot isolation is often implemented:
|
||||
essentially, by giving each transaction a transaction ID, and allowing each transaction to see
|
||||
|
|
@ -799,7 +796,7 @@ algorithm, such as a *vector clock*. The downside is that the timestamps from a
|
|||
much bigger—potentially one integer for every node in the system. See [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)
|
||||
for more details on detecting concurrency.
|
||||
|
||||
## Linearizable ID Generators
|
||||
### Linearizable ID Generators
|
||||
|
||||
Although Lamport clocks and hybrid logical clocks provide useful ordering guarantees, that ordering
|
||||
is still weaker than the linearizable single-node ID generator we talked about previously. Recall
|
||||
|
|
@ -839,7 +836,7 @@ the example, that’s not so easy.
|
|||
The simplest solution in this case would be to use a linearizable ID generator, which would ensure
|
||||
that the photo upload is assigned a greater ID than the account permissions change.
|
||||
|
||||
### Implementing a linearizable ID generator
|
||||
#### Implementing a linearizable ID generator
|
||||
|
||||
The simplest way of ensuring that ID assignment is linearizable is by actually using a single node
|
||||
for this purpose. That node only needs to atomically increment a counter and return its value when
|
||||
|
|
@ -874,7 +871,7 @@ assignment without any communication: even requests in different regions will be
|
|||
without waiting for cross-region requests. The downside is that you need hardware and software
|
||||
support for clocks to be tightly synchronized and compute the necessary uncertainty interval.
|
||||
|
||||
### Enforcing constraints using logical clocks
|
||||
#### Enforcing constraints using logical clocks
|
||||
|
||||
In [“Constraints and uniqueness guarantees”](/en/ch10#sec_consistency_uniqueness) we saw that a linearizable compare-and-set operation can be used
|
||||
to implement locks, uniqueness constraints, and similar constructs in a distributed system. This
|
||||
|
|
@ -897,15 +894,16 @@ the kind of fault-tolerant system that we need.
|
|||
To implement locks, leases, and similar constructs in a fault-tolerant way, we need something
|
||||
stronger than logical clocks or ID generators: we need consensus.
|
||||
|
||||
# Consensus
|
||||
|
||||
|
||||
## Consensus
|
||||
|
||||
In this chapter we have seen several examples of things that are easy when you have only a single
|
||||
node, but which get a lot harder if you want fault tolerance:
|
||||
|
||||
* A database can be linearizable if you have only a single leader, and you make all reads and writes
|
||||
on that leader. But how do you fail over if that leader fails, while avoiding split brain? How do
|
||||
you ensure that a node that believes itself to be the leader hasn’t actually been voted out in the
|
||||
meantime?
|
||||
you ensure that a node that believes itself to be the leader hasn’t actually been voted out in the meantime?
|
||||
* A linearizable ID generator on a single node is just a counter with an atomic fetch-and-add
|
||||
instruction, but what if it crashes?
|
||||
* An atomic compare-and-set (CAS) operation is useful for many things, such as deciding who gets a
|
||||
|
|
@ -933,7 +931,9 @@ Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchai
|
|||
However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this
|
||||
book.
|
||||
|
||||
# The Impossibility of Consensus
|
||||
--------
|
||||
|
||||
> [!TIP] THE IMPOSSIBILITY OF CONSENSUS
|
||||
|
||||
You may have heard about the FLP result [^72]—named after the
|
||||
authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to
|
||||
|
|
@ -951,7 +951,9 @@ Even just allowing the algorithm to use random numbers is sufficient to get arou
|
|||
Thus, although the FLP result about the impossibility of consensus is of great theoretical
|
||||
importance, distributed systems can usually achieve consensus in practice.
|
||||
|
||||
## The Many Faces of Consensus
|
||||
--------
|
||||
|
||||
### The Many Faces of Consensus
|
||||
|
||||
Consensus can be expressed in several different ways:
|
||||
|
||||
|
|
@ -966,10 +968,9 @@ Consensus can be expressed in several different ways:
|
|||
We will explore all of these shortly. In fact, these problems are all equivalent to each other: if
|
||||
you have an algorithm that solves one of these problems, you can convert it into a solution for any
|
||||
of the others. This is quite a profound and perhaps surprising insight! And that’s why we can lump
|
||||
all of these things together under “consensus”, even though they look quite different on the
|
||||
surface.
|
||||
all of these things together under “consensus”, even though they look quite different on the surface.
|
||||
|
||||
### Single-value consensus
|
||||
#### Single-value consensus
|
||||
|
||||
The standard formulation of consensus involves getting multiple nodes to agree on a single value.
|
||||
For example:
|
||||
|
|
@ -1039,7 +1040,7 @@ there is a severe network problem [^75].
|
|||
Thus, a large-scale outage can stop the system from being able to process requests, but it cannot
|
||||
corrupt the consensus system by causing it to make inconsistent decisions.
|
||||
|
||||
### Compare-and-set as consensus
|
||||
#### Compare-and-set as consensus
|
||||
|
||||
A compare-and-set (CAS) operation checks whether the current value of some object equals some
|
||||
expected value; if yes, it atomically updates the object to some new value; if no, it leaves the
|
||||
|
|
@ -1056,8 +1057,7 @@ values in the CAS invocation, and then set the object to whatever value was deci
|
|||
consensus. Any CAS invocations whose new value was not decided return an error. CAS invocations with
|
||||
different expected values use separate runs of the consensus protocol.
|
||||
|
||||
This shows that CAS and consensus are equivalent to each other
|
||||
[^28] [^73].
|
||||
This shows that CAS and consensus are equivalent to each other [^28] [^73].
|
||||
Again, both are straightforward on a single node, but challenging to make fault-tolerant. As an
|
||||
example of CAS in a distributed setting, we saw conditional write operations for object stores in
|
||||
[“Databases backed by object storage”](/en/ch6#sec_replication_object_storage), which allow a write to happen only if an object with the same
|
||||
|
|
@ -1068,7 +1068,7 @@ tells us that consensus cannot be solved by a deterministic algorithm in the asy
|
|||
model [^72], but we saw in [“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum
|
||||
reads/writes in this model [^24] [^25] [^26]. From this it follows that a linearizable register cannot solve consensus.
|
||||
|
||||
### Shared logs as consensus
|
||||
#### Shared logs as consensus
|
||||
|
||||
We have seen several examples of logs, such as replication logs, transaction logs, and write-ahead
|
||||
logs. A log stores a sequence of *log entries*, and anyone who reads it sees the same entries in the
|
||||
|
|
@ -1101,10 +1101,14 @@ Validity
|
|||
: If a node reads a log entry containing some value, then some node previously requested for that
|
||||
value to be added to the log.
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> A shared log is formally known as a *total order broadcast*, *atomic broadcast*, or *total order multicast* protocol [^26] [^76] [^77]
|
||||
> It’s the same thing described in different words: requesting a value to be added to the log is then called “broadcasting” it, and reading a log entry is called “delivering” it.
|
||||
|
||||
--------
|
||||
|
||||
If you have an implementation of a shared log, it is easy to solve the consensus problem: every node
|
||||
that wants to propose a value requests for it to be added to the log, and whichever value is read
|
||||
back in the first log entry is the value that is decided. Since all nodes read log entries in the
|
||||
|
|
@ -1128,7 +1132,7 @@ replication without failover does not meet the liveness requirements, since it s
|
|||
messages if the leader crashes. As usual, the challenge is in performing failover safely and
|
||||
automatically.
|
||||
|
||||
### Fetch-and-add as consensus
|
||||
#### Fetch-and-add as consensus
|
||||
|
||||
The linearizable ID generator we saw in [“Linearizable ID Generators”](/en/ch10#sec_consistency_linearizable_id) comes close to solving
|
||||
consensus, but it falls slightly short. We can implement such an ID generator using a fetch-and-add
|
||||
|
|
@ -1163,7 +1167,7 @@ can say that fetch-and-add has a *consensus number* of two [^28].
|
|||
In contrast, CAS and shared logs solve consensus for any number of nodes that may propose values, so
|
||||
they have a consensus number of ∞ (infinity).
|
||||
|
||||
### Atomic commitment as consensus
|
||||
#### Atomic commitment as consensus
|
||||
|
||||
In [“Distributed Transactions”](/en/ch8#sec_transactions_distributed) we saw the *atomic commitment* problem, which is to ensure that
|
||||
the databases or shards involved in a distributed transaction all either commit or abort a
|
||||
|
|
@ -1198,8 +1202,7 @@ non-triviality property ensures the algorithm can’t simply always abort (but i
|
|||
any of the communication among the nodes times out). The other three properties are basically the
|
||||
same as for consensus.
|
||||
|
||||
If you have a solution for consensus, there are multiple ways you could solve atomic commitment
|
||||
[^78] [^79].
|
||||
If you have a solution for consensus, there are multiple ways you could solve atomic commitment [^78] [^79].
|
||||
One works like this: when you want to commit the transaction, every node sends its vote to commit or
|
||||
abort to every other node. Nodes that receive a vote to commit from itself and every other node
|
||||
propose “commit” using the consensus algorithm; nodes that receive a vote to abort, or which
|
||||
|
|
@ -1209,8 +1212,7 @@ consensus algorithm decided, it commits or aborts accordingly.
|
|||
In this algorithm, “commit” will only be proposed if all nodes voted to commit. If any node voted to
|
||||
abort, all proposals in the consensus algorithm will be “abort”. It could happen that some nodes
|
||||
propose “abort” while others propose “commit” if all nodes voted to commit but some communication
|
||||
timed out; in this case it doesn’t matter whether the nodes commit or abort, as long as they all do
|
||||
the same.
|
||||
timed out; in this case it doesn’t matter whether the nodes commit or abort, as long as they all do the same.
|
||||
|
||||
If you have a fault-tolerant atomic commitment protocol, you can also solve consensus. Every node
|
||||
that wants to propose a value starts a transaction on a quorum of nodes, and at each node it
|
||||
|
|
@ -1221,7 +1223,7 @@ consensus; if atomic commit aborts, the proposing node retries with a new transa
|
|||
|
||||
This shows that atomic commit and consensus are also equivalent to each other.
|
||||
|
||||
## Consensus in Practice
|
||||
### Consensus in Practice
|
||||
|
||||
We have seen that single-value consensus, CAS, shared logs, and atomic commitment are all equivalent
|
||||
to each other: you can convert a solution to one of them into a solution to any of the others. That
|
||||
|
|
@ -1233,20 +1235,20 @@ Raft, Viewstamped Replication, and Zab provide shared logs right out of the box.
|
|||
single-value consensus, but in practice most systems using Paxos actually use the extension called
|
||||
Multi-Paxos, which also provides a shared log.
|
||||
|
||||
### Using shared logs
|
||||
#### Using shared logs
|
||||
|
||||
A shared log is a good fit for database replication: if every log entry represents a write to the
|
||||
database, and every replica processes the same writes in the same order using deterministic logic,
|
||||
then the replicas will all end up in a consistent state. This idea is known as *state machine
|
||||
replication* [^80],
|
||||
then the replicas will all end up in a consistent state. This idea is known as *state machine replication* [^80],
|
||||
and it is the principle behind event sourcing, which we saw in [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events). Shared
|
||||
logs are also useful for stream processing, as we shall see in [Link to Come].
|
||||
|
||||
Similarly, a shared log can be used to implement serializable transactions: as discussed in
|
||||
[“Actual Serial Execution”](/en/ch8#sec_transactions_serial), if every log entry represents a deterministic transaction to be
|
||||
executed as a stored procedure, and if every node executes those transactions in the same order,
|
||||
then the transactions will be serializable
|
||||
[^81] [^82].
|
||||
then the transactions will be serializable [^81] [^82].
|
||||
|
||||
---------
|
||||
|
||||
> [!NOTE]
|
||||
> Sharded databases with a strong consistency model often maintain a separate log per shard, which
|
||||
|
|
@ -1254,6 +1256,8 @@ then the transactions will be serializable
|
|||
> references) they can offer across shards. Serializable transactions across shards are possible, but
|
||||
> require additional coordination [^83].
|
||||
|
||||
--------
|
||||
|
||||
A shared log is also powerful because it can easily be adapted to other forms of consensus:
|
||||
|
||||
* We saw previously how to use it to implement single-value consensus and CAS: simply decide the
|
||||
|
|
@ -1266,7 +1270,7 @@ A shared log is also powerful because it can easily be adapted to other forms of
|
|||
can be used to generate fencing tokens (see [“Fencing off zombies and delayed requests”](/en/ch9#sec_distributed_fencing_tokens)); for example, in
|
||||
ZooKeeper, this sequence number is called `zxid` [^18].
|
||||
|
||||
### From single-leader replication to consensus
|
||||
#### From single-leader replication to consensus
|
||||
|
||||
We saw previously that single-value consensus is easy if you have a single “dictator” node that
|
||||
makes the decision, and likewise a shared log is easy if a single leader is the only node that is
|
||||
|
|
@ -1314,7 +1318,7 @@ different protocols. In consensus algorithms, any node can start an election and
|
|||
quorum of nodes to respond; in 2PC, only the coordinator can request votes, and it requires a “yes”
|
||||
vote from *every* participant before it can commit.
|
||||
|
||||
### Subtleties of consensus
|
||||
#### Subtleties of consensus
|
||||
|
||||
This basic structure is common to all of Raft, Multi-Paxos, Zab, and Viewstamped Replication: a vote
|
||||
by a quorum of nodes elects a leader, and then another quorum vote is required for every entry that
|
||||
|
|
@ -1330,7 +1334,10 @@ least as up-to-date as a majority of its followers [^69].
|
|||
In contrast, Paxos allows any node to become the new leader, but requires it to bring its log
|
||||
up-to-date with other nodes before it can start appending new entries of its own.
|
||||
|
||||
# Consistency vs. Availability in Leader Election
|
||||
|
||||
--------
|
||||
|
||||
> [!TIP] CONSISTENCY VS. AVAILABILITY IN LEADER ELECTION
|
||||
|
||||
If you want the consensus algorithm to strictly guarantee the properties laid out in
|
||||
[“Shared logs as consensus”](/en/ch10#sec_consistency_shared_logs), it’s essential that the new leader is up-to-date with any confirmed
|
||||
|
|
@ -1349,10 +1356,11 @@ availability, but you are on thin ice, since the theory of consensus no longer a
|
|||
will work fine as long as there are no faults, the problems discussed in [Chapter 9](/en/ch9#ch_distributed) can
|
||||
easily cause a lot of data loss or corruption.
|
||||
|
||||
--------
|
||||
|
||||
Another subtlety is in how the algorithms deal with log entries that had been proposed by the old
|
||||
leader before it failed, but for which the vote on appending to the log had not yet completed. You
|
||||
can find discussions of these details in the references for this chapter
|
||||
[^23] [^69] [^86].
|
||||
can find discussions of these details in the references for this chapter [^23] [^69] [^86].
|
||||
|
||||
For databases that use a consensus algorithm for replication, not only do writes need to be turned
|
||||
into log entries and replicated to a quorum. If you want to guarantee linearizable reads, they also
|
||||
|
|
@ -1366,7 +1374,7 @@ configuration. Consensus algorithms have been extended with *reconfiguration* fe
|
|||
this possible. This is especially useful when adding new regions to a system, or when migrating from
|
||||
one location to another (by first adding the new nodes, and then removing the old nodes).
|
||||
|
||||
### Pros and cons of consensus
|
||||
#### Pros and cons of consensus
|
||||
|
||||
Although they are complex and subtle, consensus algorithms are a huge breakthrough for distributed
|
||||
systems. Consensus is essentially “single-leader replication done right”, with automatic failover on
|
||||
|
|
@ -1406,7 +1414,9 @@ only real alternative is to use a weaker consistency model instead, such as thos
|
|||
leaderless or multi-leader replication as discussed in [Chapter 6](/en/ch6#ch_replication). These approaches
|
||||
generally don’t offer linearizability, but for applications that don’t need it that is fine.
|
||||
|
||||
## Coordination Services
|
||||
|
||||
|
||||
### Coordination Services
|
||||
|
||||
Consensus algorithms are useful in any distributed database that wants to offer linearizable
|
||||
operations, and many modern distributed databases use consensus algorithms for replication. But one
|
||||
|
|
@ -1453,7 +1463,9 @@ Failure detection and change notifications do not require consensus, but they ar
|
|||
distributed coordination alongside the atomic operations and fencing support that do require
|
||||
consensus.
|
||||
|
||||
# Managing configuration with coordination services
|
||||
--------
|
||||
|
||||
> [!TIP] Managing configuration with coordination services
|
||||
|
||||
Applications and infrastructure often have configuration parameters such as timeouts, thread pool
|
||||
sizes, and so on. Coordination services are sometimes used to store such configuration data,
|
||||
|
|
@ -1466,7 +1478,9 @@ convenient to use a coordination service and rely on its notification feature if
|
|||
running the coordination service anyway. Alternatively, a process could periodically poll for
|
||||
configuration updates from a file or URL, which avoids the need for a specialized service.
|
||||
|
||||
### Allocating work to nodes
|
||||
--------
|
||||
|
||||
#### Allocating work to nodes
|
||||
|
||||
A coordination service is useful if you have several instances of a process or service, and one
|
||||
of them needs to be chosen as leader or primary. If the leader fails, one of the other nodes should
|
||||
|
|
@ -1499,7 +1513,7 @@ intended for storing data that may change thousands of times per second. For tha
|
|||
use a conventional database; alternatively, tools like Apache BookKeeper [^90] [^91]
|
||||
can be used to replicate fast-changing internal state of a service.
|
||||
|
||||
### Service discovery
|
||||
#### Service discovery
|
||||
|
||||
ZooKeeper, etcd, and Consul are also often used for *service discovery*—that is, to find out which
|
||||
IP address you need to connect to in order to reach a particular service (see
|
||||
|
|
|
|||
|
|
@ -41,7 +41,9 @@ concepts such as *eventual consistency* still cause confusion. In [“Problems w
|
|||
get more precise about eventual consistency and discuss things like the *read-your-writes* and
|
||||
*monotonic reads* guarantees.
|
||||
|
||||
# Backups and replication
|
||||
--------
|
||||
|
||||
> [!TIP] BACKUPS AND REPLICATION
|
||||
|
||||
You might be wondering whether you still need backups if you have replication. The answer is yes,
|
||||
because they have different purposes: replicas quickly reflect writes from one node on other nodes,
|
||||
|
|
@ -59,15 +61,16 @@ the current state. If you have a large amount of data, it can be cheaper to keep
|
|||
data in an object store that is optimized for infrequently-accessed data, and to store only the
|
||||
current state of the database in primary storage.
|
||||
|
||||
# Single-Leader Replication
|
||||
--------
|
||||
|
||||
## Single-Leader Replication
|
||||
|
||||
Each node that stores a copy of the database is called a *replica*. With multiple replicas, a
|
||||
question inevitably arises: how do we ensure that all the data ends up on all the replicas?
|
||||
|
||||
Every write to the database needs to be processed by every replica; otherwise, the replicas would no
|
||||
longer contain the same data. The most common solution is called *leader-based replication*,
|
||||
*primary-backup*, or *active/passive*. It works as follows (see
|
||||
[Figure 6-1](/en/ch6#fig_replication_leader_follower)):
|
||||
*primary-backup*, or *active/passive*. It works as follows (see [Figure 6-1](/en/ch6#fig_replication_leader_follower)):
|
||||
|
||||
1. One of the replicas is designated the *leader* (also known as *primary* or *source* [^2]).
|
||||
When clients want to write to the database, they must send their requests to the leader, which
|
||||
|
|
@ -96,11 +99,15 @@ Many consensus algorithms such as Raft, which is used for replication in Cockroa
|
|||
etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and automatically
|
||||
elect a new leader if the old one fails (we will discuss consensus in more detail in [Chapter 10](/en/ch10#ch_consistency)).
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> In older documents you may see the term *master–slave replication*. It means the same as
|
||||
> leader-based replication, but the term should be avoided as it is widely considered offensive [^8].
|
||||
|
||||
## Synchronous Versus Asynchronous Replication
|
||||
--------
|
||||
|
||||
### Synchronous Versus Asynchronous Replication
|
||||
|
||||
An important detail of a replicated system is whether the replication happens *synchronously* or
|
||||
*asynchronously*. (In relational databases, this is often a configurable option; other systems are
|
||||
|
|
@ -158,7 +165,7 @@ Weakening durability may sound like a bad trade-off, but asynchronous replicatio
|
|||
widely used, especially if there are many followers or if they are geographically distributed [^9].
|
||||
We will return to this issue in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag).
|
||||
|
||||
## Setting Up New Followers
|
||||
### Setting Up New Followers
|
||||
|
||||
From time to time, you need to set up new followers—perhaps to increase the number of replicas,
|
||||
or to replace failed nodes. How do you ensure that the new follower has an accurate copy of the
|
||||
|
|
@ -195,7 +202,9 @@ recovery. You can also perform steps 1 and 2 of setting up a new follower by dow
|
|||
from the object store. For example, WAL-G does this for PostgreSQL, MySQL, and SQL Server, and
|
||||
Litestream does the equivalent for SQLite.
|
||||
|
||||
# Databases backed by object storage
|
||||
--------
|
||||
|
||||
> [!TIP] DATABASES BACKED BY OBJECT STORAGE
|
||||
|
||||
Object storage can be used for more than archiving data. Many databases are beginning to use object
|
||||
stores such as Amazon Web Services S3, Google Cloud Storage, and Azure Blob Storage to serve data
|
||||
|
|
@ -228,9 +237,7 @@ Different systems deal with these trade-offs in various ways. Some introduce a *
|
|||
architecture that places less frequently accessed data on object storage while new or frequently
|
||||
accessed data is kept on faster storage devices such as SSDs, NVMe, or even in memory. Other systems
|
||||
use object storage as their primary storage tier, but use a separate low-latency storage system such
|
||||
as Amazon’s EBS or Neon’s Safekeepers
|
||||
[^12])
|
||||
to store their WAL. Recently, some systems have gone even farther by adopting a
|
||||
as Amazon’s EBS or Neon’s Safekeepers [^12]) to store their WAL. Recently, some systems have gone even farther by adopting a
|
||||
*zero-disk architecture* (ZDA). ZDA-based systems persist all data to object storage and use disks
|
||||
and memory strictly for caching. This allows nodes to have no persistent state, which dramatically
|
||||
simplifies operations. WarpStream, Confluent Freight, Buf’s Bufstream, and Redpanda Serverless are
|
||||
|
|
@ -238,7 +245,9 @@ all Kafka-compatible systems built using a zero-disk architecture. Nearly every
|
|||
warehouse also adopts such an architecture, as does Turbopuffer (a vector search engine), and
|
||||
SlateDB (a cloud-native LSM storage engine).
|
||||
|
||||
## Handling Node Outages
|
||||
--------
|
||||
|
||||
### Handling Node Outages
|
||||
|
||||
Any node in the system can go down, perhaps unexpectedly due to a fault, but just as likely due to
|
||||
planned maintenance (for example, rebooting a machine to install a kernel security patch). Being
|
||||
|
|
@ -248,7 +257,7 @@ the impact of a node outage as small as possible.
|
|||
|
||||
How do you achieve high availability with leader-based replication?
|
||||
|
||||
### Follower failure: Catch-up recovery
|
||||
#### Follower failure: Catch-up recovery
|
||||
|
||||
On its local disk, each follower keeps a log of the data changes it has received from the leader. If
|
||||
a follower crashes and is restarted, or if the network between the leader and the follower is
|
||||
|
|
@ -261,8 +270,7 @@ receiving a stream of data changes as before.
|
|||
Although follower recovery is conceptually simple, it can be challenging in terms of performance: if
|
||||
the database has a high write throughput or if the follower has been offline for a long time, there
|
||||
might be a lot of writes to catch up on. There will be high load on both the recovering follower and
|
||||
the leader (which needs to send the backlog of writes to the follower) while this catch-up is
|
||||
ongoing.
|
||||
the leader (which needs to send the backlog of writes to the follower) while this catch-up is ongoing.
|
||||
|
||||
The leader can delete its log of writes once all followers have confirmed that they have processed
|
||||
it, but if a follower is unavailable for a long time, the leader faces a choice: either it retains
|
||||
|
|
@ -271,7 +279,7 @@ leader), or it deletes the log that the unavailable follower has not yet acknowl
|
|||
the follower won’t be able to recover from the log, and will have to be restored from a backup when
|
||||
it comes back).
|
||||
|
||||
### Leader failure: Failover
|
||||
#### Leader failure: Failover
|
||||
|
||||
Handling a failure of the leader is trickier: one of the followers needs to be promoted to be the
|
||||
new leader, clients need to be reconfigured to send their writes to the new leader, and the other
|
||||
|
|
@ -331,11 +339,15 @@ Failover is fraught with things that can go wrong:
|
|||
is already struggling with high load or network problems, an unnecessary failover is likely to
|
||||
make the situation worse, not better.
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> Guarding against split brain by limiting or shutting down old leaders is known as *fencing* or, more
|
||||
> emphatically, *Shoot The Other Node In The Head* (STONITH). We will discuss fencing in more detail
|
||||
> in [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing).
|
||||
|
||||
--------
|
||||
|
||||
There are no easy solutions to these problems. For this reason, some operations teams prefer to
|
||||
perform failovers manually, even if the software supports automatic failover.
|
||||
|
||||
|
|
@ -350,12 +362,12 @@ These issues—node failures; unreliable networks; and trade-offs around replica
|
|||
durability, availability, and latency—are in fact fundamental problems in distributed systems.
|
||||
In [Chapter 9](/en/ch9#ch_distributed) and [Chapter 10](/en/ch10#ch_consistency) we will discuss them in greater depth.
|
||||
|
||||
## Implementation of Replication Logs
|
||||
### Implementation of Replication Logs
|
||||
|
||||
How does leader-based replication work under the hood? Several different replication methods are
|
||||
used in practice, so let’s look at each one briefly.
|
||||
|
||||
### Statement-based replication
|
||||
#### Statement-based replication
|
||||
|
||||
In the simplest case, the leader logs every write request (*statement*) that it executes and sends
|
||||
that statement log to its followers. For a relational database, this means that every `INSERT`,
|
||||
|
|
@ -389,7 +401,7 @@ there is any nondeterminism in a statement. VoltDB uses statement-based replicat
|
|||
safe by requiring transactions to be deterministic [^16]. However, determinism can be hard to guarantee
|
||||
in practice, so many databases prefer other replication methods.
|
||||
|
||||
### Write-ahead log (WAL) shipping
|
||||
#### Write-ahead log (WAL) shipping
|
||||
|
||||
In [Chapter 4](/en/ch4#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
|
||||
every modification is first written to the WAL so that the tree can be restored to a consistent
|
||||
|
|
@ -412,7 +424,7 @@ performing a failover to make one of the upgraded nodes the new leader. If the r
|
|||
does not allow this version mismatch, as is often the case with WAL shipping, such upgrades require
|
||||
downtime.
|
||||
|
||||
### Logical (row-based) log replication
|
||||
#### Logical (row-based) log replication
|
||||
|
||||
An alternative is to use different log formats for replication and for the storage engine, which
|
||||
allows the replication log to be decoupled from the storage engine internals. This kind of
|
||||
|
|
@ -444,12 +456,12 @@ to send the contents of a database to an external system, such as a data warehou
|
|||
analysis, or for building custom indexes and caches [^21].
|
||||
This technique is called *change data capture*, and we will return to it in [Link to Come].
|
||||
|
||||
# Problems with Replication Lag
|
||||
|
||||
## Problems with Replication Lag
|
||||
|
||||
Being able to tolerate node failures is just one reason for wanting replication. As mentioned
|
||||
in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), other reasons are scalability (processing more
|
||||
requests than a single machine can handle) and latency (placing replicas geographically closer to
|
||||
users).
|
||||
requests than a single machine can handle) and latency (placing replicas geographically closer to users).
|
||||
|
||||
Leader-based replication requires all writes to go through a single node, but read-only queries can
|
||||
go to any replica. For workloads that consist of mostly reads and only a small percentage of writes
|
||||
|
|
@ -471,14 +483,14 @@ just a temporary state—if you stop writing to the database and wait a while, t
|
|||
eventually catch up and become consistent with the leader. For that reason, this effect is known
|
||||
as *eventual consistency* [^22].
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> The term *eventual consistency* was coined by Douglas Terry et al.
|
||||
> [^23],
|
||||
> popularized by Werner Vogels
|
||||
> [^24],
|
||||
> The term *eventual consistency* was coined by Douglas Terry et al. [^23], popularized by Werner Vogels [^24],
|
||||
> and became the battle cry of many NoSQL projects. However, not only NoSQL databases are eventually
|
||||
> consistent: followers in an asynchronously replicated relational database have the same
|
||||
> characteristics.
|
||||
> consistent: followers in an asynchronously replicated relational database have the same characteristics.
|
||||
|
||||
--------
|
||||
|
||||
The term “eventually” is deliberately vague: in general, there is no limit to how far a replica can
|
||||
fall behind. In normal operation, the delay between a write happening on the leader and being
|
||||
|
|
@ -490,7 +502,7 @@ When the lag is so large, the inconsistencies it introduces are not just a theor
|
|||
real problem for applications. In this section we will highlight three examples of problems that are
|
||||
likely to occur when there is replication lag. We’ll also outline some approaches to solving them.
|
||||
|
||||
## Reading Your Own Writes
|
||||
### Reading Your Own Writes
|
||||
|
||||
Many applications let the user submit some data and then view what they have submitted. This might
|
||||
be a record in a customer database, or a comment on a discussion thread, or something else of that sort.
|
||||
|
|
@ -505,8 +517,7 @@ submitted was lost, so they will be understandably unhappy.
|
|||
|
||||
{{< figure src="/fig/ddia_0603.png" id="fig_replication_read_your_writes" caption="Figure 6-3. A user makes a write, followed by a read from a stale replica. To prevent this anomaly, we need read-after-write consistency." class="w-full my-4" >}}
|
||||
|
||||
In this situation, we need *read-after-write consistency*, also known as *read-your-writes consistency*
|
||||
[^23].
|
||||
In this situation, we need *read-after-write consistency*, also known as *read-your-writes consistency* [^23].
|
||||
This is a guarantee that if the user reloads the page, they will always see any updates they
|
||||
submitted themselves. It makes no promises about other users: other users’ updates may not be
|
||||
visible until some later time. However, it reassures the user that their own input has been saved
|
||||
|
|
@ -526,15 +537,13 @@ are various possible techniques. To mention a few:
|
|||
effective, as most things would have to be read from the leader (negating the benefit of read
|
||||
scaling). In that case, other criteria may be used to decide whether to read from the leader. For
|
||||
example, you could track the time of the last update and, for one minute after the last update, make all
|
||||
reads from the leader
|
||||
[^25].
|
||||
reads from the leader [^25].
|
||||
You could also monitor the replication lag on followers and prevent queries on any follower that
|
||||
is more than one minute behind the leader.
|
||||
* The client can remember the timestamp of its most recent write—then the system can ensure that the
|
||||
replica serving any reads for that user reflects updates at least until that timestamp. If a
|
||||
replica is not sufficiently up to date, either the read can be handled by another replica or the
|
||||
query can wait until the replica has caught up
|
||||
[^26].
|
||||
query can wait until the replica has caught up [^26].
|
||||
The timestamp could be a *logical timestamp* (something that indicates ordering of writes, such as
|
||||
the log sequence number) or the actual system clock (in which case clock synchronization becomes
|
||||
critical; see [“Unreliable Clocks”](/en/ch9#sec_distributed_clocks)).
|
||||
|
|
@ -558,7 +567,9 @@ In this case, there are some additional issues to consider:
|
|||
the devices’ network routes may be completely different.) If your approach requires reading from the
|
||||
leader, you may first need to route requests from all of a user’s devices to the same region.
|
||||
|
||||
# Regions and Availability Zones
|
||||
--------
|
||||
|
||||
> ![TIP] Regions and Availability Zones
|
||||
|
||||
We use the term *region* to refer to one or more datacenters in a single geographic location. Cloud
|
||||
providers locate multiple datacenters in the same geographic region. Each datacenter is referred to
|
||||
|
|
@ -576,7 +587,9 @@ increased cloud networking bills. We will discuss these tradeoffs more in
|
|||
[“Multi-leader replication topologies”](/en/ch6#sec_replication_topologies). For now, just know that when we say region, we mean a collection of
|
||||
zones/datacenters in a single geographic location.
|
||||
|
||||
## Monotonic Reads
|
||||
--------
|
||||
|
||||
### Monotonic Reads
|
||||
|
||||
Our second example of an anomaly that can occur when reading from asynchronous followers is that it’s
|
||||
possible for a user to see things *moving backward in time*.
|
||||
|
|
@ -605,7 +618,7 @@ the same replica (different users can read from different replicas). For example
|
|||
chosen based on a hash of the user ID, rather than randomly. However, if that replica fails, the
|
||||
user’s queries will need to be rerouted to another replica.
|
||||
|
||||
## Consistent Prefix Reads
|
||||
### Consistent Prefix Reads
|
||||
|
||||
Our third example of replication lag anomalies concerns violation of causality. Imagine the
|
||||
following short dialog between Mr. Poons and Mrs. Cake:
|
||||
|
|
@ -630,15 +643,13 @@ Mr. Poons
|
|||
: How far into the future can you see, Mrs. Cake?
|
||||
|
||||
To the observer it looks as though Mrs. Cake is answering the question before Mr. Poons has even asked
|
||||
it. Such psychic powers are impressive, but very confusing
|
||||
[^27].
|
||||
it. Such psychic powers are impressive, but very confusing [^27].
|
||||
|
||||
{{< figure src="/fig/ddia_0605.png" id="fig_replication_consistent_prefix" caption="Figure 6-5. If some shards are replicated slower than others, an observer may see the answer before they see the question." class="w-full my-4" >}}
|
||||
|
||||
Preventing this kind of anomaly requires another type of guarantee: *consistent prefix reads*
|
||||
[^22]. This guarantee says that if a sequence of
|
||||
writes happens in a certain order, then anyone reading those writes will see them appear in the same
|
||||
order.
|
||||
Preventing this kind of anomaly requires another type of guarantee: *consistent prefix reads* [^22].
|
||||
This guarantee says that if a sequence of writes happens in a certain order,
|
||||
then anyone reading those writes will see them appear in the same order.
|
||||
|
||||
This is a particular problem in sharded (partitioned) databases, which we will discuss in
|
||||
[Chapter 7](/en/ch7#ch_sharding). If the database always applies writes in the same order, reads always see a
|
||||
|
|
@ -651,7 +662,7 @@ the same shard—but in some applications that cannot be done efficiently. There
|
|||
that explicitly keep track of causal dependencies, a topic that we will return to in
|
||||
[“The “happens-before” relation and concurrency”](/en/ch6#sec_replication_happens_before).
|
||||
|
||||
## Solutions for Replication Lag
|
||||
### Solutions for Replication Lag
|
||||
|
||||
When working with an eventually consistent system, it is worth thinking about how the application
|
||||
behaves if the replication lag increases to several minutes or even hours. If the answer is “no
|
||||
|
|
@ -683,7 +694,9 @@ consistency guarantees: they can offer stronger resilience in the face of networ
|
|||
have lower overheads compared to transactional systems. We will explore such approaches in the rest
|
||||
of this chapter.
|
||||
|
||||
# Multi-Leader Replication
|
||||
|
||||
|
||||
## Multi-Leader Replication
|
||||
|
||||
So far in this chapter we have only considered replication architectures using a single leader.
|
||||
Although that is a common approach, there are interesting alternatives.
|
||||
|
|
@ -710,7 +723,7 @@ as equivalent to single-leader replication. The rest of this section focusses on
|
|||
multi-leader replication, in which any leader can process writes even when its connection to the
|
||||
other leaders is interrupted.
|
||||
|
||||
## Geographically Distributed Operation
|
||||
### Geographically Distributed Operation
|
||||
|
||||
It rarely makes sense to use a multi-leader setup within a single region, because the benefits
|
||||
rarely outweigh the added complexity. However, there are some situations in which this configuration
|
||||
|
|
@ -730,8 +743,7 @@ other regions.
|
|||
|
||||
{{< figure src="/fig/ddia_0606.png" id="fig_replication_multi_dc" caption="Figure 6-6. Multi-leader replication across multiple regions." class="w-full my-4" >}}
|
||||
|
||||
Let’s compare how the single-leader and multi-leader configurations fare in a multi-region
|
||||
deployment:
|
||||
Let’s compare how the single-leader and multi-leader configurations fare in a multi-region deployment:
|
||||
|
||||
Performance
|
||||
: In a single-leader configuration, every write must go over the internet to the region with the
|
||||
|
|
@ -756,8 +768,7 @@ Tolerance of network problems
|
|||
over that link and wait for the response before it can complete.
|
||||
|
||||
A multi-leader configuration with asynchronous replication can tolerate network problems better:
|
||||
during a temporary network interruption, each region’s leader can continue independently processing
|
||||
writes.
|
||||
during a temporary network interruption, each region’s leader can continue independently processing writes.
|
||||
|
||||
Consistency
|
||||
: A single-leader system can provide strong consistency guarantees, such as serializable
|
||||
|
|
@ -768,25 +779,21 @@ Consistency
|
|||
account, registering a particular username), but which violate the constraint when taken together
|
||||
with another write on another leader.
|
||||
|
||||
This is simply a fundamental limitation of distributed systems
|
||||
[^28].
|
||||
This is simply a fundamental limitation of distributed systems [^28].
|
||||
If you need to enforce such constraints, you’re therefore better off with a single-leader system.
|
||||
However, as we will see in [“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts), multi-leader systems can still
|
||||
achieve consistency properties that are useful in a wide range of apps that don’t need such
|
||||
constraints.
|
||||
achieve consistency properties that are useful in a wide range of apps that don’t need such constraints.
|
||||
|
||||
Multi-leader replication is less common than single-leader replication, but it is still supported by
|
||||
many databases, including MySQL, Oracle, SQL Server, and YugabyteDB. In some cases it is an external
|
||||
add-on feature, for example in Redis Enterprise, EDB Postgres Distributed, and pglogical
|
||||
[^29].
|
||||
add-on feature, for example in Redis Enterprise, EDB Postgres Distributed, and pglogical [^29].
|
||||
|
||||
As multi-leader replication is a somewhat retrofitted feature in many databases, there are often
|
||||
subtle configuration pitfalls and surprising interactions with other database features. For example,
|
||||
autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason,
|
||||
multi-leader replication is often considered dangerous territory that should be avoided if possible
|
||||
[^30].
|
||||
multi-leader replication is often considered dangerous territory that should be avoided if possible [^30].
|
||||
|
||||
### Multi-leader replication topologies
|
||||
#### Multi-leader replication topologies
|
||||
|
||||
A *replication topology* describes the communication paths along which writes are propagated from
|
||||
one node to another. If you have two leaders, like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), there is
|
||||
|
|
@ -796,27 +803,30 @@ more than two leaders, various different topologies are possible. Some examples
|
|||
|
||||
{{< figure src="/fig/ddia_0607.png" id="fig_replication_topologies" caption="Figure 6-7. Three example topologies in which multi-leader replication can be set up." class="w-full my-4" >}}
|
||||
|
||||
The most general topology is *all-to-all*, shown in
|
||||
[Figure 6-7](/en/ch6#fig_replication_topologies)(c),
|
||||
The most general topology is *all-to-all*, shown in [Figure 6-7](/en/ch6#fig_replication_topologies)(c),
|
||||
in which every leader sends its writes to every other leader. However, more restricted topologies
|
||||
are also used: for example a *circular topology* in which each node receives writes from one node
|
||||
and forwards those writes (plus any writes of its own) to one other node. Another popular topology
|
||||
has the shape of a *star*: one designated root node forwards writes to all of the other nodes. The
|
||||
star topology can be generalized to a tree.
|
||||
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> Don’t confuse a star-shaped network topology with a *star schema* (see
|
||||
> [“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics)), which describes the structure of a data model.
|
||||
|
||||
--------
|
||||
|
||||
In circular and star topologies, a write may need to pass through several nodes before it reaches
|
||||
all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To
|
||||
prevent infinite replication loops, each node is given a unique identifier, and in the replication
|
||||
log, each write is tagged with the identifiers of all the nodes it has passed through
|
||||
[^31].
|
||||
log, each write is tagged with the identifiers of all the nodes it has passed through [^31].
|
||||
When a node receives a data change that is tagged with its own identifier, that data change is
|
||||
ignored, because the node knows that it has already been processed.
|
||||
|
||||
### Problems with different topologies
|
||||
#### Problems with different topologies
|
||||
|
||||
A problem with circular and star topologies is that if just one node fails, it can interrupt the
|
||||
flow of replication messages between other nodes, leaving them unable to communicate until the
|
||||
|
|
@ -850,7 +860,7 @@ issues like the one in [Figure 6-8](/en/ch6#fig_replication_causality). If you
|
|||
is worth being aware of these issues, carefully reading the documentation, and thoroughly testing
|
||||
your database to ensure that it really does provide the guarantees you believe it to have.
|
||||
|
||||
## Sync Engines and Local-First Software
|
||||
### Sync Engines and Local-First Software
|
||||
|
||||
Another situation in which multi-leader replication is appropriate is if you have an application
|
||||
that needs to continue to work while it is disconnected from the internet.
|
||||
|
|
@ -870,7 +880,7 @@ From an architectural point of view, this setup is very similar to multi-leader
|
|||
regions, taken to the extreme: each device is a “region,” and the network connection between them is
|
||||
extremely unreliable.
|
||||
|
||||
### Real-time collaboration, offline-first, and local-first apps
|
||||
#### Real-time collaboration, offline-first, and local-first apps
|
||||
|
||||
Moreover, many modern web apps offer *real-time collaboration* features, such as Google Docs and
|
||||
Sheets for text documents and spreadsheets, Figma for graphics, and Linear for project management.
|
||||
|
|
@ -904,7 +914,7 @@ service providers are available [^40].
|
|||
For example, Git is a local-first collaboration system (albeit one that doesn’t support real-time
|
||||
collaboration) since you can sync via GitHub, GitLab, or any other repository hosting service.
|
||||
|
||||
### Pros and cons of sync engines
|
||||
#### Pros and cons of sync engines
|
||||
|
||||
The dominant way of building web apps today is to keep very little persistent state on the client,
|
||||
and to rely on making requests to a server whenever a new piece of data needs to be displayed or
|
||||
|
|
@ -947,7 +957,8 @@ development jargon the equivalent of a sync engine is called *netcode*. The tech
|
|||
netcode are quite specific to the requirements of games [^44], and don’t directly
|
||||
carry over to other types of software, so we won’t consider them further in this book.
|
||||
|
||||
## Dealing with Conflicting Writes
|
||||
|
||||
### Dealing with Conflicting Writes
|
||||
|
||||
The biggest problem with multi-leader replication—both in a geo-distributed server-side database and
|
||||
a local-first sync engine on end user devices—is that concurrent writes on different leaders can
|
||||
|
|
@ -972,7 +983,7 @@ In [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent) we wi
|
|||
whether two writes are concurrent. For now we will assume that we can detect conflicts, and we want
|
||||
to figure out the best way of resolving them.
|
||||
|
||||
### Conflict avoidance
|
||||
#### Conflict avoidance
|
||||
|
||||
One strategy for conflicts is to avoid them occurring in the first place. For example, if the
|
||||
application can ensure that all writes for a particular record go through the same leader, then
|
||||
|
|
@ -999,7 +1010,8 @@ so that one leader only generates odd numbers and the other only generates even
|
|||
you can be sure that the two leaders won’t concurrently assign the same ID to different records.
|
||||
We will discuss other ID assignment schemes in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical).
|
||||
|
||||
### Last write wins (discarding concurrent writes)
|
||||
|
||||
#### Last write wins (discarding concurrent writes)
|
||||
|
||||
If conflicts can’t be avoided, the simplest way of resolving them is to attach a timestamp to each
|
||||
write, and to always use the value with the greatest timestamp. For example, in
|
||||
|
|
@ -1031,7 +1043,7 @@ that is ahead of the others, and you try to overwrite a value written by that no
|
|||
be ignored as it may have a lower timestamp, even though it clearly occurred later. This problem can
|
||||
be solved by using a *logical clock*, which we will discuss in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical).
|
||||
|
||||
### Manual conflict resolution
|
||||
#### Manual conflict resolution
|
||||
|
||||
If randomly discarding some of your writes is not desirable, the next option is to resolve the
|
||||
conflict manually. You may be familiar with manual conflict resolution from Git and other version
|
||||
|
|
@ -1063,9 +1075,7 @@ suffers from a number of problems:
|
|||
keeping all the shopping cart items that appeared in any of the siblings (i.e., taking the set
|
||||
union of the carts). This meant that if the customer had removed an item from their cart in one
|
||||
sibling, but another sibling still contained that old item, the removed item would unexpectedly
|
||||
reappear in the customer’s cart
|
||||
[^45].
|
||||
[Figure 6-10](/en/ch6#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
|
||||
reappear in the customer’s cart [^45]. [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
|
||||
cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear.
|
||||
* If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution
|
||||
process can itself introduce a new conflict. Those resolutions could even be inconsistent: for
|
||||
|
|
@ -1076,7 +1086,7 @@ suffers from a number of problems:
|
|||
{{< figure src="/fig/ddia_0610.png" id="fig_replication_amazon_anomaly" caption="Figure 6-10. Example of Amazon's shopping cart anomaly: if conflicts on a shopping cart are merged by taking the union, deleted items may reappear." class="w-full my-4" >}}
|
||||
|
||||
|
||||
### Automatic conflict resolution
|
||||
#### Automatic conflict resolution
|
||||
|
||||
For many applications, the best way of handling conflicts is to use an algorithm that automatically
|
||||
merges concurrent writes into a consistent state. Automatic conflict resolution ensures that all
|
||||
|
|
@ -1110,23 +1120,19 @@ Nevertheless, automatic conflict resolution is sufficient to build many useful a
|
|||
start from the requirement of wanting to build a collaborative offline-first or local-first app,
|
||||
then conflict resolution is inevitable, and automating it is often the best approach.
|
||||
|
||||
## CRDTs and Operational Transformation
|
||||
### CRDTs and Operational Transformation
|
||||
|
||||
Two families of algorithms are commonly used to implement automatic conflict resolution:
|
||||
*Conflict-free replicated datatypes* (CRDTs)
|
||||
[^46] and *Operational Transformation* (OT)
|
||||
[^47].
|
||||
*Conflict-free replicated datatypes* (CRDTs) [^46] and *Operational Transformation* (OT) [^47].
|
||||
They have different design philosophies and performance characteristics, but both are able to
|
||||
perform automatic merges for all the aforementioned types of data.
|
||||
|
||||
[Figure 6-11](/en/ch6#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
|
||||
text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the
|
||||
letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make
|
||||
“ice!”.
|
||||
letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make “ice!”.
|
||||
|
||||
{{< figure src="/fig/ddia_0611.png" id="fig_replication_ot_crdt" caption="Figure 6-11. How two concurrent insertions into a string are merged by OT and a CRDT respectively." class="w-full my-4" >}}
|
||||
|
||||
|
||||
The merged result “nice!” is achieved differently by both types of algorithms:
|
||||
|
||||
OT
|
||||
|
|
@ -1155,7 +1161,7 @@ OT is most often used for real-time collaborative editing of text, e.g. in Googl
|
|||
distributed databases such as Redis Enterprise, Riak, and Azure Cosmos DB [^49].
|
||||
Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge or Yjs) and with OT (e.g., ShareDB).
|
||||
|
||||
### What is a conflict?
|
||||
#### What is a conflict?
|
||||
|
||||
Some kinds of conflict are obvious. In the example in [Figure 6-9](/en/ch6#fig_replication_write_conflict), two writes
|
||||
concurrently modified the same field in the same record, setting it to two different values. There
|
||||
|
|
@ -1174,7 +1180,8 @@ good understanding of this problem. We will see some more examples of conflicts
|
|||
[Chapter 8](/en/ch8#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and
|
||||
resolving conflicts in a replicated system.
|
||||
|
||||
# Leaderless Replication
|
||||
|
||||
## Leaderless Replication
|
||||
|
||||
The replication approaches we have discussed so far in this chapter—single-leader and
|
||||
multi-leader replication—are based on the idea that a client sends a write request to one node
|
||||
|
|
@ -1189,17 +1196,21 @@ a fashionable architecture for databases after Amazon used it for its in-house *
|
|||
2007 [^45]. Riak, Cassandra, and ScyllaDB are open source datastores with leaderless replication models inspired
|
||||
by Dynamo, so this kind of database is also known as *Dynamo-style*.
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> The original *Dynamo* system was only described in a paper [^45], but never released outside of Amazon.
|
||||
> The similarly-named *DynamoDB* is a more recent cloud database from AWS, but it has a completely different architecture:
|
||||
> it uses single-leader replication based on the Multi-Paxos consensus algorithm [^5].
|
||||
|
||||
--------
|
||||
|
||||
In some leaderless implementations, the client directly sends its writes to several replicas, while
|
||||
in others, a coordinator node does this on behalf of the client. However, unlike a leader database,
|
||||
that coordinator does not enforce a particular ordering of writes. As we shall see, this difference in design has
|
||||
profound consequences for the way the database is used.
|
||||
|
||||
## Writing to the Database When a Node Is Down
|
||||
### Writing to the Database When a Node Is Down
|
||||
|
||||
Imagine you have a database with three replicas, and one of the replicas is currently
|
||||
unavailable—perhaps it is being rebooted to install a system update. In a single-leader
|
||||
|
|
@ -1231,7 +1242,7 @@ needs to be tagged with a version number or timestamp, similarly to what we saw
|
|||
one with the greatest timestamp (even if that value was only returned by one replica, and several
|
||||
other replicas returned older values). See [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent) for more details.
|
||||
|
||||
### Catching up on missed writes
|
||||
#### Catching up on missed writes
|
||||
|
||||
The replication system should ensure that eventually all the data is copied to every replica. After
|
||||
an unavailable node comes back online, how does it catch up on the writes that it missed? Several
|
||||
|
|
@ -1257,7 +1268,7 @@ Anti-entropy
|
|||
replication log in leader-based replication, this *anti-entropy process* does not copy writes in
|
||||
any particular order, and there may be a significant delay before data is copied.
|
||||
|
||||
### Quorums for reading and writing
|
||||
#### Quorums for reading and writing
|
||||
|
||||
In the example of [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), we considered the write to be successful
|
||||
even though it was only processed on two out of three replicas. What if only one out of three
|
||||
|
|
@ -1270,12 +1281,10 @@ respond, reads can nevertheless continue returning an up-to-date value.
|
|||
|
||||
More generally, if there are *n* replicas, every write must be confirmed by *w* nodes to be
|
||||
considered successful, and we must query at least *r* nodes for each read. (In our example,
|
||||
*n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* >
|
||||
*n*, we expect to get an up-to-date value when reading, because at least one of the *r* nodes we’re
|
||||
reading from must be up to date. Reads and writes that obey these *r* and *w* values are called
|
||||
*quorum* reads and writes [^50].
|
||||
You can think of *r* and *w* as the minimum number of votes required for the read or write to be
|
||||
valid.
|
||||
*n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* > *n*,
|
||||
we expect to get an up-to-date value when reading, because at least one of the *r* nodes we’re
|
||||
reading from must be up to date. Reads and writes that obey these *r* and *w* values are called *quorum* reads and writes [^50].
|
||||
You can think of *r* and *w* as the minimum number of votes required for the read or write to be valid.
|
||||
|
||||
In Dynamo-style databases, the parameters *n*, *w*, and *r* are typically configurable. A common
|
||||
choice is to make *n* an odd number (typically 3 or 5) and to set *w* = *r* =
|
||||
|
|
@ -1284,11 +1293,15 @@ For example, a workload with few writes and many reads may benefit from setting
|
|||
*r* = 1. This makes reads faster, but has the disadvantage that just one failed node causes all
|
||||
database writes to fail.
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> There may be more than *n* nodes in the cluster, but any given value is stored only on *n*
|
||||
> nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit
|
||||
> on one node. We will return to sharding in [Chapter 7](/en/ch7#ch_sharding).
|
||||
|
||||
--------
|
||||
|
||||
The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
|
||||
as follows:
|
||||
|
||||
|
|
@ -1299,8 +1312,8 @@ as follows:
|
|||
* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
|
||||
This case is illustrated in [Figure 6-13](/en/ch6#fig_replication_quorum_overlap).
|
||||
|
||||
Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and
|
||||
*r* determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
|
||||
Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and *r*
|
||||
determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
|
||||
before we consider the read or write to be successful.
|
||||
|
||||
{{< figure src="/fig/ddia_0613.png" id="fig_replication_quorum_overlap" caption="Figure 6-13. If *w* + *r* > *n*, at least one of the *r* replicas you read from must have seen the most recent successful write." class="w-full my-4" >}}
|
||||
|
|
@ -1312,7 +1325,7 @@ error executing the operation (can’t write because the disk is full), due to a
|
|||
between the client and the node, or for any number of other reasons. We only care whether the node
|
||||
returned a successful response and don’t need to distinguish between different kinds of fault.
|
||||
|
||||
## Limitations of Quorum Consistency
|
||||
### Limitations of Quorum Consistency
|
||||
|
||||
If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*, you can
|
||||
generally expect every read to return the most recent value written for a key. This is the case because the
|
||||
|
|
@ -1324,8 +1337,7 @@ Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, becau
|
|||
*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
|
||||
not necessarily majorities—it only matters that the sets of nodes used by the read and write
|
||||
operations overlap in at least one node. Other quorum assignments are possible, which allows some
|
||||
flexibility in the design of distributed algorithms
|
||||
[^51].
|
||||
flexibility in the design of distributed algorithms [^51].
|
||||
|
||||
You may also set *w* and *r* to smaller numbers, so that *w* + *r* ≤ *n* (i.e.,
|
||||
the quorum condition is not satisfied). In this case, reads and writes will still be sent to *n*
|
||||
|
|
@ -1369,7 +1381,7 @@ it is not so simple. Dynamo-style databases are generally optimized for use case
|
|||
eventual consistency. The parameters *w* and *r* allow you to adjust the probability of stale values
|
||||
being read [^53], but it’s wise to not take them as absolute guarantees.
|
||||
|
||||
### Monitoring staleness
|
||||
#### Monitoring staleness
|
||||
|
||||
From an operational perspective, it’s important to monitor whether your databases are
|
||||
returning up-to-date results. Even if your application can tolerate stale reads, you need to be
|
||||
|
|
@ -1388,7 +1400,8 @@ handoff can be one measure of system health, but it’s difficult to interpret u
|
|||
Eventual consistency is a deliberately vague guarantee, but for operability it’s important to be
|
||||
able to quantify “eventual.”
|
||||
|
||||
## Single-Leader vs. Leaderless Replication Performance
|
||||
|
||||
### Single-Leader vs. Leaderless Replication Performance
|
||||
|
||||
A replication system based on a single leader can provide strong consistency guarantees that are
|
||||
difficult or impossible to achieve in a leaderless system. However, as we have seen in
|
||||
|
|
@ -1427,8 +1440,7 @@ That said, leaderless systems can have performance problems as well:
|
|||
* Even though the system doesn’t need to perform failover, one replica does need to detect when
|
||||
another replica is unavailable so that it can store hints about writes that the unavailable
|
||||
replica missed. When the unavailable replica comes back, the handoff process needs to send it
|
||||
those hints. This puts additional load on the replicas at a time when the system is already under
|
||||
strain [^54].
|
||||
those hints. This puts additional load on the replicas at a time when the system is already under strain [^54].
|
||||
* The more replicas you have, the bigger the size of your quorums, and the more responses you have
|
||||
to wait for before a request can complete. Even if you wait only for the fastest *r* or *w*
|
||||
replicas to respond, and even if you make the requests in parallel, a bigger *r* or *w* increases
|
||||
|
|
@ -1448,7 +1460,7 @@ be co-located with the client. However, since a write on one leader is propagate
|
|||
the others, reads can be arbitrarily out-of-date. Quorum reads and writes provide a compromise: good
|
||||
fault tolerance while also having a high likelihood of reading up-to-date data.
|
||||
|
||||
### Multi-region operation
|
||||
#### Multi-region operation
|
||||
|
||||
We previously discussed cross-region replication as a use case for multi-leader replication (see
|
||||
[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)). Leaderless replication is also suitable for
|
||||
|
|
@ -1467,7 +1479,8 @@ describes the number of replicas within one region. Cross-region replication bet
|
|||
database clusters happens asynchronously in the background, in a style that is similar to
|
||||
multi-leader replication.
|
||||
|
||||
## Detecting Concurrent Writes
|
||||
|
||||
### Detecting Concurrent Writes
|
||||
|
||||
Like with multi-leader replication, leaderless databases allow concurrent writes to the same key,
|
||||
resulting in conflicts that need to be resolved. Such conflicts may occur as the writes happen, but
|
||||
|
|
@ -1477,8 +1490,7 @@ The problem is that events may arrive in a different order at different nodes, d
|
|||
network delays and partial failures. For example, [Figure 6-14](/en/ch6#fig_replication_concurrency) shows two clients,
|
||||
A and B, simultaneously writing to a key *X* in a three-node datastore:
|
||||
|
||||
* Node 1 receives the write from A, but never receives the write from B due to a transient
|
||||
outage.
|
||||
* Node 1 receives the write from A, but never receives the write from B due to a transient outage.
|
||||
* Node 2 first receives the write from A, then the write from B.
|
||||
* Node 3 first receives the write from B, then the write from A.
|
||||
|
||||
|
|
@ -1501,7 +1513,7 @@ you whether two values are actually conflicting (i.e., they were written concurr
|
|||
were written one after another). If you want to resolve conflicts explicitly, the system needs to
|
||||
take more care to detect concurrent writes.
|
||||
|
||||
### The “happens-before” relation and concurrency
|
||||
#### The “happens-before” relation and concurrency
|
||||
|
||||
How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look
|
||||
at some examples:
|
||||
|
|
@ -1517,8 +1529,7 @@ at some examples:
|
|||
An operation A *happens before* another operation B if B knows about A, or depends on A, or builds
|
||||
upon A in some way. Whether one operation happens before another operation is the key to defining
|
||||
what concurrency means. In fact, we can simply say that two operations are *concurrent* if neither
|
||||
happens before the other (i.e., neither knows about the other)
|
||||
[^57].
|
||||
happens before the other (i.e., neither knows about the other) [^57].
|
||||
|
||||
Thus, whenever you have two operations A and B, there are three possibilities: either A happened
|
||||
before B, or B happened before A, or A and B are concurrent. What we need is an algorithm to tell us
|
||||
|
|
@ -1526,7 +1537,9 @@ whether two operations are concurrent or not. If one operation happened before a
|
|||
operation should overwrite the earlier operation, but if the operations are concurrent, we have a
|
||||
conflict that needs to be resolved.
|
||||
|
||||
# Concurrency, Time, and Relativity
|
||||
--------
|
||||
|
||||
> ![TIP] Concurrency, Time, and Relativity
|
||||
|
||||
It may seem that two operations should be called concurrent if they occur “at the same time”—but
|
||||
in fact, it is not important whether they literally overlap in time. Because of problems with clocks
|
||||
|
|
@ -1546,7 +1559,9 @@ principle have allowed one operation to affect the other. For example, if the ne
|
|||
interrupted at the time, two operations can occur some time apart and still be concurrent, because
|
||||
the network problems prevented one operation from being able to know about the other.
|
||||
|
||||
### Capturing the happens-before relationship
|
||||
--------
|
||||
|
||||
#### Capturing the happens-before relationship
|
||||
|
||||
Let’s look at an algorithm that determines whether two operations are concurrent, or whether one
|
||||
happened before another. To keep things simple, let’s start with a database that has only one
|
||||
|
|
@ -1619,7 +1634,7 @@ write is based on. If you make a write without including a version number, it is
|
|||
other writes, so it will not overwrite anything—it will just be returned as one of the values
|
||||
on subsequent reads.
|
||||
|
||||
### Version vectors
|
||||
#### Version vectors
|
||||
|
||||
The example in [Figure 6-15](/en/ch6#fig_replication_causality_single) used only a single replica. How does the
|
||||
algorithm change when there are multiple replicas, but no leader?
|
||||
|
|
@ -1646,12 +1661,16 @@ The version vector also ensures that it is safe to read from one replica and sub
|
|||
to another replica. Doing so may result in siblings being created, but no data is lost as long as
|
||||
siblings are merged correctly.
|
||||
|
||||
# Version vectors and vector clocks
|
||||
--------
|
||||
|
||||
> [!TIP] VERSION VECTORS AND VECTOR CLOCKS
|
||||
|
||||
A *version vector* is sometimes also called a *vector clock*, even though they are not quite the
|
||||
same. The difference is subtle—please see the references for details [^60] [^63] [^64]. In brief, when
|
||||
comparing the state of replicas, version vectors are the right data structure to use.
|
||||
|
||||
--------
|
||||
|
||||
## Summary
|
||||
|
||||
In this chapter we looked at the issue of replication. Replication can serve several purposes:
|
||||
|
|
|
|||
|
|
@ -12,8 +12,7 @@ breadcrumbs: false
|
|||
|
||||
A distributed database typically distributes data across nodes in two ways:
|
||||
|
||||
1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in
|
||||
[Chapter 6](/en/ch6#ch_replication).
|
||||
1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in [Chapter 6](/en/ch6#ch_replication).
|
||||
2. If we don’t want every node to store all the data, we can split up a large amount of data into
|
||||
smaller *shards* or *partitions*, and store different shards on different nodes. We’ll discuss
|
||||
sharding in this chapter.
|
||||
|
|
@ -38,7 +37,9 @@ Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replicatio
|
|||
replication of shards. Since the choice of sharding scheme is mostly independent of the choice of
|
||||
replication scheme, we will ignore replication in this chapter for the sake of simplicity.
|
||||
|
||||
# Sharding and Partitioning
|
||||
--------
|
||||
|
||||
> [!TIP] SHARDING AND PARTITIONING
|
||||
|
||||
What we call a *shard* in this chapter has many different names depending on which software you’re
|
||||
using: it’s called a *partition* in Kafka, a *range* in CockroachDB, a *region* in HBase and TiDB, a
|
||||
|
|
@ -61,7 +62,9 @@ Available Replicated Data*—reportedly a 1980s database, details of which are l
|
|||
By the way, partitioning has nothing to do with *network partitions* (netsplits), a type of fault in
|
||||
the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#ch_distributed).
|
||||
|
||||
# Pros and Cons of Sharding
|
||||
--------
|
||||
|
||||
## Pros and Cons of Sharding
|
||||
|
||||
The primary reason for sharding a database is *scalability*: it’s a solution if the volume of data
|
||||
or the write throughput has become too great for a single node to handle, as it allows you to spread
|
||||
|
|
@ -105,7 +108,7 @@ access* (NUMA) architecture in which some banks of memory are closer to one CPU
|
|||
For example, Redis, VoltDB, and FoundationDB use one process per core, and rely on sharding to
|
||||
spread load across CPU cores in the same machine [^6].
|
||||
|
||||
## Sharding for Multitenancy
|
||||
### Sharding for Multitenancy
|
||||
|
||||
Software as a Service (SaaS) products and cloud services are often *multitenant*, where each tenant
|
||||
is a customer. Multiple users may have logins on the same tenant, but each tenant has a
|
||||
|
|
@ -166,7 +169,9 @@ The main challenges around using sharding for multitenancy are:
|
|||
* If you ever need to support features that connect data across multiple tenants, these become
|
||||
harder to implement if you need to join data across multiple shards.
|
||||
|
||||
# Sharding of Key-Value Data
|
||||
|
||||
|
||||
## Sharding of Key-Value Data
|
||||
|
||||
Say you have a large amount of data, and you want to shard it. How do you decide which records to
|
||||
store on which nodes?
|
||||
|
|
@ -181,8 +186,7 @@ If the sharding is unfair, so that some shards have more data or queries than ot
|
|||
*skewed*. The presence of skew makes sharding much less effective. In an extreme case, all the load
|
||||
could end up on one shard, so 9 out of 10 nodes are idle and your bottleneck is the single busy
|
||||
node. A shard with disproportionately high load is called a *hot shard* or *hot spot*. If there’s
|
||||
one key with a particularly high load (e.g., a celebrity in a social network), we call it a *hot
|
||||
key*.
|
||||
one key with a particularly high load (e.g., a celebrity in a social network), we call it a *hot key*.
|
||||
|
||||
Therefore we need an algorithm that takes as input the partition key of a record, and tells us which
|
||||
shard that record is in. In a key-value store the partition key is usually the key, or the first
|
||||
|
|
@ -190,7 +194,8 @@ part of the key. In a relational model the partition key might be some column of
|
|||
necessarily its primary key). That algorithm needs to be amenable to rebalancing in order to relieve
|
||||
hot spots.
|
||||
|
||||
## Sharding by Key Range
|
||||
|
||||
### Sharding by Key Range
|
||||
|
||||
One way of sharding is to assign a contiguous range of partition keys (from some minimum to some
|
||||
maximum) to each shard, like the volumes of a paper encyclopedia, as illustrated in
|
||||
|
|
@ -233,7 +238,7 @@ active at the same time, the write load will end up more evenly spread across th
|
|||
downside is that when you want to fetch the values of multiple sensors within a time range, you now
|
||||
need to perform a separate range query for each sensor.
|
||||
|
||||
### Rebalancing key-range sharded data
|
||||
#### Rebalancing key-range sharded data
|
||||
|
||||
When you first set up your database, there are no key ranges to split into shards. Some databases,
|
||||
such as HBase and MongoDB, allow you to configure an initial set of shards on an empty database,
|
||||
|
|
@ -251,20 +256,18 @@ With databases that manage shard boundaries automatically, a shard split is typi
|
|||
|
||||
* the shard reaching a configured size (for example, on HBase, the default is 10 GB), or
|
||||
* in some systems, the write throughput being persistently above some threshold. Thus, a hot shard
|
||||
may be split even if it is not storing a lot of data, so that its write load can be distributed
|
||||
more uniformly.
|
||||
may be split even if it is not storing a lot of data, so that its write load can be distributed more uniformly.
|
||||
|
||||
An advantage of key-range sharding is that the number of shards adapts to the data volume. If there
|
||||
is only a small amount of data, a small number of shards is sufficient, so overheads are small; if
|
||||
there is a huge amount of data, the size of each individual shard is limited to a configurable
|
||||
maximum [^15].
|
||||
there is a huge amount of data, the size of each individual shard is limited to a configurable maximum [^15].
|
||||
|
||||
A downside of this approach is that splitting a shard is an expensive operation, since it requires
|
||||
all of its data to be rewritten into new files, similarly to a compaction in a log-structured
|
||||
storage engine. A shard that needs splitting is often also one that is under high load, and the cost
|
||||
of splitting can exacerbate that load, risking it becoming overloaded.
|
||||
|
||||
## Sharding by Hash of Key
|
||||
### Sharding by Hash of Key
|
||||
|
||||
Key-range sharding is useful if you want records with nearby (but different) partition keys to be
|
||||
grouped into the same shard; for example, this might be the case with timestamps. If you don’t care
|
||||
|
|
@ -273,9 +276,8 @@ application), a common approach is to first hash the partition key before mappin
|
|||
|
||||
A good hash function takes skewed data and makes it uniformly distributed. Say you have a 32-bit
|
||||
hash function that takes a string. Whenever you give it a new string, it returns a seemingly random
|
||||
number between 0 and 232 − 1. Even if the input strings are very similar, their
|
||||
hashes are evenly distributed across that range of numbers (but the same input always produces the
|
||||
same output).
|
||||
number between 0 and 232 − 1. Even if the input strings are very similar, their hashes are evenly
|
||||
distributed across that range of numbers (but the same input always produces the same output).
|
||||
|
||||
For sharding purposes, the hash function need not be cryptographically strong: for example, MongoDB
|
||||
uses MD5, whereas Cassandra and ScyllaDB use Murmur3. Many programming languages have simple hash
|
||||
|
|
@ -283,7 +285,7 @@ functions built in (as they are used for hash tables), but they may not be suita
|
|||
for example, in Java’s `Object.hashCode()` and Ruby’s `Object#hash`, the same key may have a
|
||||
different hash value in different processes, making them unsuitable for sharding [^16].
|
||||
|
||||
### Hash modulo number of nodes
|
||||
#### Hash modulo number of nodes
|
||||
|
||||
Once you have hashed the key, how do you choose which shard to store it in? Maybe your first thought
|
||||
is to take the hash value *modulo* the number of nodes in the system (using the `%` operator in many
|
||||
|
|
@ -303,7 +305,7 @@ The *mod N* function is easy to compute, but it leads to very inefficient rebala
|
|||
is a lot of unnecessary movement of records from one node to another. We need an approach that
|
||||
doesn’t move data around more than necessary.
|
||||
|
||||
### Fixed number of shards
|
||||
#### Fixed number of shards
|
||||
|
||||
One simple but widely-used solution is to create many more shards than there are nodes, and to
|
||||
assign several shards to each node. For example, a database running on a cluster of 10 nodes may be
|
||||
|
|
@ -313,8 +315,7 @@ which shard is stored on which node.
|
|||
|
||||
Now, if a node is added to the cluster, the system can reassign some of the shards from existing
|
||||
nodes to the new node until they are fairly distributed once again. This process is illustrated in
|
||||
[Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in
|
||||
reverse.
|
||||
[Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in reverse.
|
||||
|
||||
{{< figure src="/fig/ddia_0704.png" id="fig_sharding_rebalance_fixed" caption="Figure 7-4. Adding a new node to a database cluster with multiple shards per node." class="w-full my-4" >}}
|
||||
|
||||
|
|
@ -349,7 +350,7 @@ expensive. But if shards are too small, they incur too much overhead. The best p
|
|||
achieved when the size of shards is “just right,” neither too big nor too small, which can be hard
|
||||
to achieve if the number of shards is fixed but the dataset size varies.
|
||||
|
||||
### Sharding by hash range
|
||||
#### Sharding by hash range
|
||||
|
||||
If the required number of shards can’t be predicted in advance, it’s better to use a scheme in which
|
||||
the number of shards can adapt easily to the workload. The aforementioned key-range sharding scheme
|
||||
|
|
@ -375,7 +376,9 @@ two or more columns, and the partition key is only the first of these columns, y
|
|||
efficient range queries over the second and later columns: as long as all records in the range query
|
||||
have the same partition key, they will be in the same shard.
|
||||
|
||||
# Partitioning and Range Queries in Data Warehouses
|
||||
--------
|
||||
|
||||
> [!TIPS] PARTITIONING AND RANGE QUERIES IN DATA WAREHOUSES
|
||||
|
||||
Data warehouses such as BigQuery, Snowflake, and Delta Lake support a similar indexing approach,
|
||||
though the terminology differs. In BigQuery, for example, the partition key determines which
|
||||
|
|
@ -385,6 +388,8 @@ cluster keys for a table. Delta Lake supports both manual and automatic partitio
|
|||
supports cluster keys. Clustering data not only improves range scan performance, but can
|
||||
improve compression and filtering performance as well.
|
||||
|
||||
--------
|
||||
|
||||
Hash-range sharding is used in YugabyteDB and DynamoDB [^17], and is an option in MongoDB.
|
||||
Cassandra and ScyllaDB use a variant of this approach that is illustrated in
|
||||
[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
|
||||
|
|
@ -402,7 +407,7 @@ transfers parts of two of its ranges to node 3, and node 2 transfers part of one
|
|||
node 3. This has the effect of giving the new node an approximately fair share of the dataset,
|
||||
without transferring more data than necessary from one node to another.
|
||||
|
||||
### Consistent hashing
|
||||
#### Consistent hashing
|
||||
|
||||
A *consistent hashing* algorithm is a hash function that maps keys to a specified number of shards
|
||||
in a way that satisfies two properties:
|
||||
|
|
@ -422,7 +427,7 @@ sub-ranges; on the other hand, with rendezvous and jump consistent hashes, the n
|
|||
individual keys that were previously scattered across all of the other nodes. Which one is
|
||||
preferable depends on the application.
|
||||
|
||||
## Skewed Workloads and Relieving Hot Spots
|
||||
### Skewed Workloads and Relieving Hot Spots
|
||||
|
||||
Consistent hashing ensures that keys are uniformly distributed across nodes, but that doesn’t mean
|
||||
that the actual load is uniformly distributed. If the workload is highly skewed—that is, the amount
|
||||
|
|
@ -461,7 +466,7 @@ Some systems (especially cloud services designed for large scale) have automated
|
|||
dealing with hot shards; for example, Amazon calls it *heat management* [^28] or *adaptive capacity* [^17].
|
||||
The details of how these systems work go beyond the scope of this book.
|
||||
|
||||
## Operations: Automatic or Manual Rebalancing
|
||||
### Operations: Automatic or Manual Rebalancing
|
||||
|
||||
There is one important question with regard to rebalancing that we have glossed over: does the
|
||||
splitting of shards and rebalancing happen automatically or manually?
|
||||
|
|
@ -469,8 +474,7 @@ splitting of shards and rebalancing happen automatically or manually?
|
|||
Some systems automatically decide when to split shards and when to move them from one node to
|
||||
another, without any human interaction, while others leave sharding to be explicitly configured by
|
||||
an administrator. There is also a middle ground: for example, Couchbase and Riak generate a
|
||||
suggested shard assignment automatically, but require an administrator to commit it before it takes
|
||||
effect.
|
||||
suggested shard assignment automatically, but require an administrator to commit it before it takes effect.
|
||||
|
||||
Fully automated rebalancing can be convenient, because there is less operational work to do for
|
||||
normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud
|
||||
|
|
@ -488,13 +492,14 @@ Such automation can be dangerous in combination with automatic failure detection
|
|||
one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that
|
||||
the overloaded node is dead, and automatically rebalance the cluster to move load away from it. This
|
||||
puts additional load on other nodes and the network, making the situation worse. There is a risk of
|
||||
causing a cascading failure where other nodes become overloaded and are also falsely suspected of
|
||||
being down.
|
||||
causing a cascading failure where other nodes become overloaded and are also falsely suspected of being down.
|
||||
|
||||
For that reason, it can be a good thing to have a human in the loop for rebalancing. It’s slower
|
||||
than a fully automatic process, but it can help prevent operational surprises.
|
||||
|
||||
# Request Routing
|
||||
|
||||
|
||||
## Request Routing
|
||||
|
||||
We have discussed how to shard a dataset across multiple nodes, and how to rebalance those shards as
|
||||
nodes are added or removed. Now let’s move on to the question: if you want to read or write a
|
||||
|
|
@ -508,8 +513,8 @@ balancer can send a request to any of the instances. With sharded databases, a r
|
|||
only be handled by a node that is a replica for the shard containing that key.
|
||||
|
||||
This means that request routing has to be aware of the assignment from keys to shards, and from
|
||||
shards to nodes. On a high level, there are a few different approaches to this problem (illustrated
|
||||
in [Figure 7-7](/en/ch7#fig_sharding_routing)):
|
||||
shards to nodes. On a high level, there are a few different approaches to this problem
|
||||
(illustrated in [Figure 7-7](/en/ch7#fig_sharding_routing)):
|
||||
|
||||
1. Allow clients to contact any node (e.g., via a round-robin load balancer). If that node
|
||||
coincidentally owns the shard to which the request applies, it can handle the request directly;
|
||||
|
|
@ -568,7 +573,7 @@ typically have a very different kind of query execution: rather than executing i
|
|||
query typically needs to aggregate and join data from many different shards in parallel. We will
|
||||
discuss techniques for such parallel query execution in [Link to Come].
|
||||
|
||||
# Sharding and Secondary Indexes
|
||||
## Sharding and Secondary Indexes
|
||||
|
||||
The sharding schemes we have discussed so far rely on the client knowing the partition key for any
|
||||
record it wants to access. This is most easily done in a key-value data model, where the partition
|
||||
|
|
@ -587,7 +592,7 @@ search engines such as Solr and Elasticsearch. The problem with secondary indexe
|
|||
map neatly to shards. There are two main approaches to sharding a database with secondary indexes:
|
||||
local and global indexes.
|
||||
|
||||
## Local Secondary Indexes
|
||||
### Local Secondary Indexes
|
||||
|
||||
For example, imagine you are operating a website for selling used cars (illustrated in
|
||||
[Figure 7-9](/en/ch7#fig_sharding_local_secondary)). Each listing has a unique ID, and you use that ID as partition
|
||||
|
|
@ -602,7 +607,7 @@ automatically adds its ID to the list of IDs for the index entry `color:red`. As
|
|||
|
||||
{{< figure src="/fig/ddia_0709.png" id="fig_sharding_local_secondary" caption="Figure 7-9. Local secondary indexes: each shard indexes only the records within its own shard." class="w-full my-4" >}}
|
||||
|
||||
###### Warning
|
||||
> [!WARN] WARNING
|
||||
|
||||
If your database only supports a key-value model, you might be tempted to implement a secondary
|
||||
index yourself by creating a mapping from values to IDs in application code. If you go down this
|
||||
|
|
@ -610,6 +615,8 @@ route, you need to take great care to ensure your indexes remain consistent with
|
|||
data. Race conditions and intermittent write failures (where some changes were saved but others
|
||||
weren’t) can very easily cause the data to go out of sync—see [“The need for multi-object transactions”](/en/ch8#sec_transactions_need).
|
||||
|
||||
--------
|
||||
|
||||
In this indexing approach, each shard is completely separate: each shard maintains its own secondary
|
||||
indexes, covering only the records in that shard. It doesn’t care what data is stored in other
|
||||
shards. Whenever you write to the database—to add, remove, or update a records—you only need to
|
||||
|
|
@ -635,7 +642,7 @@ process every query anyway.
|
|||
Nevertheless, local secondary indexes are widely used [^31]: for example, MongoDB, Riak, Cassandra [^32], Elasticsearch [^33],
|
||||
SolrCloud, and VoltDB [^34] all use local secondary indexes.
|
||||
|
||||
## Global Secondary Indexes
|
||||
### Global Secondary Indexes
|
||||
|
||||
Rather than each shard having its own, local secondary index, we can construct a *global index* that
|
||||
covers data in all shards. However, we can’t just store that index on one node, since it would
|
||||
|
|
@ -645,16 +652,13 @@ but it can be sharded differently from the primary key index.
|
|||
[Figure 7-10](/en/ch7#fig_sharding_global_secondary) illustrates what this could look like: the IDs of red cars from
|
||||
all shards appear under `color:red` in the index, but the index is sharded so that colors starting
|
||||
with the letters *a* to *r* appear in shard 0 and colors starting with *s* to *z* appear in shard 1.
|
||||
The index on the make of car is partitioned similarly (with the shard boundary being between *f* and
|
||||
*h*).
|
||||
The index on the make of car is partitioned similarly (with the shard boundary being between *f* and *h*).
|
||||
|
||||
{{< figure src="/fig/ddia_0710.png" id="fig_sharding_global_secondary" caption="Figure 7-10. A global secondary index reflects data from all shards, and is itself sharded by the indexed value." class="w-full my-4" >}}
|
||||
|
||||
This kind of index is also called *term-partitioned*
|
||||
[^30]:
|
||||
This kind of index is also called *term-partitioned* [^30]:
|
||||
recall from [“Full-Text Search”](/en/ch4#sec_storage_full_text) that in full-text search, a *term* is a keyword in a text that
|
||||
you can search for. Here we generalise it to mean any value that you can search for in the secondary
|
||||
index.
|
||||
you can search for. Here we generalise it to mean any value that you can search for in the secondary index.
|
||||
|
||||
The global index uses the term as partition key, so that when you’re looking for a particular term
|
||||
or value, you can figure out which shard you need to query. As before, a shard can contain a
|
||||
|
|
@ -684,6 +688,7 @@ indexes, so reads from a global index may be stale (similarly to replication lag
|
|||
Nevertheless, global indexes are useful if read throughput is higher than write throughput, and if
|
||||
the postings lists are not too long.
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
In this chapter we explored different ways of sharding a large dataset into smaller subsets.
|
||||
|
|
@ -692,8 +697,7 @@ is no longer feasible.
|
|||
|
||||
The goal of sharding is to spread the data and query load evenly across multiple machines, avoiding
|
||||
hot spots (nodes with disproportionately high load). This requires choosing a sharding scheme that
|
||||
is appropriate to your data, and rebalancing the shards when nodes are added to or removed from the
|
||||
cluster.
|
||||
is appropriate to your data, and rebalancing the shards when nodes are added to or removed from the cluster.
|
||||
|
||||
We discussed two main approaches to sharding:
|
||||
|
||||
|
|
|
|||
|
|
@ -63,11 +63,10 @@ Concurrency control is relevant for both single-node and distributed databases.
|
|||
chapter, in [“Distributed Transactions”](/en/ch8#sec_transactions_distributed), we will examine the *two-phase commit* protocol and
|
||||
the challenge of achieving atomicity in a distributed transaction.
|
||||
|
||||
# What Exactly Is a Transaction?
|
||||
## What Exactly Is a Transaction?
|
||||
|
||||
Almost all relational databases today, and some nonrelational databases, support transactions. Most
|
||||
of them follow the style that was introduced in 1975 by IBM System R, the first SQL database
|
||||
[^2] [^3] [^4].
|
||||
of them follow the style that was introduced in 1975 by IBM System R, the first SQL database [^2] [^3] [^4].
|
||||
Although some implementation details have changed, the general idea has remained virtually the same
|
||||
for 50 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily
|
||||
similar to that of System R.
|
||||
|
|
@ -92,7 +91,7 @@ technical design choice, transactions have advantages and limitations. In order
|
|||
trade-offs, let’s go into the details of the guarantees that transactions can provide—both in normal
|
||||
operation and in various extreme (but realistic) circumstances.
|
||||
|
||||
## The Meaning of ACID
|
||||
## #The Meaning of ACID
|
||||
|
||||
The safety guarantees provided by transactions are often described by the well-known acronym *ACID*,
|
||||
which stands for *Atomicity*, *Consistency*, *Isolation*, and *Durability*. It was coined in 1983 by
|
||||
|
|
@ -112,7 +111,7 @@ BASE is “not ACID”; i.e., it can mean almost anything you want.)
|
|||
Let’s dig into the definitions of atomicity, consistency, isolation, and durability, as this will let
|
||||
us refine our idea of transactions.
|
||||
|
||||
### Atomicity
|
||||
#### Atomicity
|
||||
|
||||
In general, *atomic* refers to something that cannot be broken down into smaller parts. The word
|
||||
means similar but subtly different things in different branches of computing. For example, in
|
||||
|
|
@ -141,7 +140,7 @@ The ability to abort a transaction on error and have all writes from that transa
|
|||
the defining feature of ACID atomicity. Perhaps *abortability* would have been a better term than
|
||||
*atomicity*, but we will stick with *atomicity* since that’s the usual word.
|
||||
|
||||
### Consistency
|
||||
#### Consistency
|
||||
|
||||
The word *consistency* is terribly overloaded:
|
||||
|
||||
|
|
@ -181,7 +180,7 @@ invariants, but you haven’t declared those invariants, the database can’t st
|
|||
in ACID often depends on how the application uses the database, and it’s not a property of the
|
||||
database alone.
|
||||
|
||||
### Isolation
|
||||
### #Isolation
|
||||
|
||||
Most databases are accessed by several clients at the same time. That is no problem if they are
|
||||
reading and writing different parts of the database, but if they are accessing the same database
|
||||
|
|
@ -211,7 +210,7 @@ is a weaker guarantee than serializability [^10] [^14]).
|
|||
This means that some kinds of race conditions can still occur. We will explore snapshot isolation
|
||||
and other forms of isolation in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels).
|
||||
|
||||
### Durability
|
||||
#### Durability
|
||||
|
||||
The purpose of a database system is to provide a safe place where data can be stored without fear of
|
||||
losing it. *Durability* is the promise that once a transaction has committed successfully, any data it
|
||||
|
|
@ -231,7 +230,9 @@ as discussed in [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction
|
|||
hard disks and all your backups are destroyed at the same time, there’s obviously nothing your
|
||||
database can do to save you.
|
||||
|
||||
# Replication and Durability
|
||||
--------
|
||||
|
||||
> [!TIP] REPLICATION AND DURABILITY
|
||||
|
||||
Historically, durability meant writing to an archive tape. Then it was understood as writing to a disk
|
||||
or SSD. More recently, it has been adapted to mean replication. Which implementation is better?
|
||||
|
|
@ -269,7 +270,9 @@ risk-reduction techniques, including writing to disk, replicating to remote mach
|
|||
backups—and they can and should be used together. As always, it’s wise to take any theoretical
|
||||
“guarantees” with a healthy grain of salt.
|
||||
|
||||
## Single-Object and Multi-Object Operations
|
||||
--------
|
||||
|
||||
### Single-Object and Multi-Object Operations
|
||||
|
||||
To recap, in ACID, atomicity and isolation describe what the database should do if a client makes
|
||||
several writes within the same transaction:
|
||||
|
|
@ -329,7 +332,7 @@ operation that updates several keys in one operation), that doesn’t necessaril
|
|||
transaction semantics: the command may succeed for some keys and fail for others, leaving the
|
||||
database in a partially updated state.
|
||||
|
||||
### Single-object writes
|
||||
#### Single-object writes
|
||||
|
||||
Atomicity and isolation also apply when a single object is being changed. For example, imagine you
|
||||
are writing a 20 KB JSON document to a database:
|
||||
|
|
@ -353,11 +356,15 @@ Similarly popular is a *conditional write* operation, which allows a write to ha
|
|||
has not been concurrently changed by someone else (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)),
|
||||
similarly to a compare-and-set or compare-and-swap (CAS) operation in shared-memory concurrency.
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> Strictly speaking, the term *atomic increment* uses the word *atomic* in the sense of multi-threaded
|
||||
> programming. In the context of ACID, it should actually be called an *isolated* or *serializable*
|
||||
> increment, but that’s not the usual term.
|
||||
|
||||
--------
|
||||
|
||||
These single-object operations are useful, as they can prevent lost updates when several clients try
|
||||
to write to the same object concurrently (see [“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update)). However, they are
|
||||
not transactions in the usual sense of the word. For example, the “lightweight transactions” feature
|
||||
|
|
@ -365,7 +372,7 @@ of Cassandra and ScyllaDB, and Aerospike’s “strong consistency” mode offer
|
|||
[“Linearizability”](/en/ch10#sec_consistency_linearizability)) reads and conditional writes on a single object, but no
|
||||
guarantees across multiple objects.
|
||||
|
||||
### The need for multi-object transactions
|
||||
#### The need for multi-object transactions
|
||||
|
||||
Do we need multi-object transactions at all? Would it be possible to implement any application with
|
||||
only a key-value data model and single-object operations?
|
||||
|
|
@ -396,7 +403,7 @@ much more complicated without atomicity, and the lack of isolation can cause con
|
|||
We will discuss those in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels), and explore alternative approaches
|
||||
in [Link to Come].
|
||||
|
||||
### Handling errors and aborts
|
||||
#### Handling errors and aborts
|
||||
|
||||
A key feature of a transaction is that it can be aborted and safely retried if an error occurred.
|
||||
ACID databases are based on this philosophy: if the database is in danger of violating its guarantee
|
||||
|
|
@ -437,7 +444,9 @@ isn’t perfect:
|
|||
in [“Two-Phase Commit (2PC)”](/en/ch8#sec_transactions_2pc)).
|
||||
* If the client process crashes while retrying, any data it was trying to write to the database is lost.
|
||||
|
||||
# Weak Isolation Levels
|
||||
|
||||
|
||||
## Weak Isolation Levels
|
||||
|
||||
If two transactions don’t access the same data, or if both are read-only, they can safely be run in
|
||||
parallel, because neither depends on the other. Concurrency issues (race conditions) only come into
|
||||
|
|
@ -470,11 +479,15 @@ financial data!”—but that misses the point. Even many popular relational dat
|
|||
are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these
|
||||
bugs from occurring.
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> Incidentally, much of the banking system relies on text files that are exchanged via secure FTP [^35].
|
||||
> In this context, having an audit trail and some human-level fraud prevention measures is actually
|
||||
> more important than ACID properties.
|
||||
|
||||
--------
|
||||
|
||||
Those examples also highlight an important point: even if concurrency issues are rare in normal
|
||||
operation, you have to consider the possibility that an attacker deliberately sends a burst of
|
||||
highly concurrent requests to your API in an attempt to deliberately exploit concurrency bugs [^30]. Therefore, in order to build
|
||||
|
|
@ -488,7 +501,7 @@ serializability in detail (see [“Serializability”](/en/ch8#sec_transactions_
|
|||
levels will be informal, using examples. If you want rigorous definitions and analyses of their
|
||||
properties, you can find them in the academic literature [^36] [^37] [^38] [^39].
|
||||
|
||||
## Read Committed
|
||||
### Read Committed
|
||||
|
||||
The most basic level of transaction isolation is *read committed*. It makes two guarantees:
|
||||
|
||||
|
|
@ -498,7 +511,7 @@ The most basic level of transaction isolation is *read committed*. It makes two
|
|||
Some databases support an even weaker isolation level called *read uncommitted*. It prevents dirty
|
||||
writes, but does not prevent dirty reads. Let’s discuss these two guarantees in more detail.
|
||||
|
||||
### No dirty reads
|
||||
#### No dirty reads
|
||||
|
||||
Imagine a transaction has written some data to the database, but the transaction has not yet committed or aborted.
|
||||
Can another transaction see that uncommitted data? If yes, that is called a
|
||||
|
|
@ -506,13 +519,11 @@ Can another transaction see that uncommitted data? If yes, that is called a
|
|||
|
||||
Transactions running at the read committed isolation level must prevent dirty reads. This means that
|
||||
any writes by a transaction only become visible to others when that transaction commits (and then
|
||||
all of its writes become visible at once). This is illustrated in
|
||||
[Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2’s *get x* still
|
||||
all of its writes become visible at once). This is illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2’s *get x* still
|
||||
returns the old value, 2, while user 1 has not yet committed.
|
||||
|
||||
{{< figure src="/fig/ddia_0804.png" id="fig_transactions_read_committed" caption="Figure 8-4. No dirty reads: user 2 sees the new value for x only after user 1's transaction has committed." class="w-full my-4" >}}
|
||||
|
||||
|
||||
There are a few reasons why it’s useful to prevent dirty reads:
|
||||
|
||||
* If a transaction needs to update several rows, a dirty read means that another transaction may
|
||||
|
|
@ -526,7 +537,7 @@ There are a few reasons why it’s useful to prevent dirty reads:
|
|||
transaction that read uncommitted data would also need to be aborted, leading to a problem called
|
||||
*cascading aborts*.
|
||||
|
||||
### No dirty writes
|
||||
#### No dirty writes
|
||||
|
||||
What happens if two transactions concurrently try to update the same row in a database? We don’t
|
||||
know in which order the writes will happen, but we normally assume that the later write overwrites
|
||||
|
|
@ -555,7 +566,7 @@ By preventing dirty writes, this isolation level avoids some kinds of concurrenc
|
|||
{{< figure src="/fig/ddia_0805.png" id="fig_transactions_dirty_writes" caption="Figure 8-5. With dirty writes, conflicting writes from different transactions can be mixed up." class="w-full my-4" >}}
|
||||
|
||||
|
||||
### Implementing read committed
|
||||
#### Implementing read committed
|
||||
|
||||
Read committed is a very popular isolation level. It is the default setting in Oracle Database,
|
||||
PostgreSQL, SQL Server, and many other databases [^10].
|
||||
|
|
@ -584,15 +595,14 @@ different part of the application, due to waiting for locks.
|
|||
Nevertheless, locks are used to prevent dirty reads in some databases, such as IBM
|
||||
Db2 and Microsoft SQL Server in the `read_committed_snapshot=off` setting [^29].
|
||||
|
||||
A more commonly used approach to preventing dirty reads is the one illustrated in
|
||||
[Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
|
||||
A more commonly used approach to preventing dirty reads is the one illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
|
||||
row that is written, the database remembers both the old committed value and the new value
|
||||
set by the transaction that currently holds the write lock. While the transaction is ongoing, any
|
||||
other transactions that read the row are simply given the old value. Only when the new value is
|
||||
committed do transactions switch over to reading the new value (see
|
||||
[“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl) for more detail).
|
||||
|
||||
## Snapshot Isolation and Repeatable Read
|
||||
### Snapshot Isolation and Repeatable Read
|
||||
|
||||
If you look superficially at read committed isolation, you could be forgiven for thinking that it
|
||||
does everything that a transaction needs to do: it allows aborts (required for atomicity), it
|
||||
|
|
@ -616,15 +626,18 @@ now appears as though she only has a total of $900 in her accounts—it seems th
|
|||
vanished into thin air.
|
||||
|
||||
This anomaly is called *read skew*, and it is an example of a *nonrepeatable read*:
|
||||
if Aaliyah were to read the balance of
|
||||
account 1 again at the end of the transaction, she would see a different value ($600) than she saw
|
||||
if Aaliyah were to read the balance of account 1 again at the end of the transaction, she would see a different value ($600) than she saw
|
||||
in her previous query. Read skew is considered acceptable under read committed isolation: the
|
||||
account balances that Aaliyah saw were indeed committed at the time when she read them.
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> The term *skew* is unfortunately overloaded: we previously used it in the sense of an *unbalanced
|
||||
> workload with hot spots* (see [“Skewed Workloads and Relieving Hot Spots”](/en/ch7#sec_sharding_skew)), whereas here it means *timing anomaly*.
|
||||
|
||||
--------
|
||||
|
||||
In Aaliyah’s case, this is not a lasting problem, because she will most likely see consistent account
|
||||
balances if she reloads the online banking website a few seconds later. However, some situations
|
||||
cannot tolerate such temporary inconsistency:
|
||||
|
|
@ -659,7 +672,7 @@ one system to the next [^29] [^40] [^41].
|
|||
Some databases, such as Oracle, TiDB, and Aurora DSQL, even choose snapshot isolation as their
|
||||
highest isolation level.
|
||||
|
||||
### Multi-version concurrency control (MVCC)
|
||||
#### Multi-version concurrency control (MVCC)
|
||||
|
||||
Like read committed isolation, implementations of snapshot isolation typically use write locks to
|
||||
prevent dirty writes (see [“Implementing read committed”](/en/ch8#sec_transactions_read_committed_impl)), which means that a transaction
|
||||
|
|
@ -707,7 +720,7 @@ All of the versions of a row are stored within the same database heap (see
|
|||
or not. The versions of the same row form a linked list, going either from newest version to oldest
|
||||
version or the other way round, so that queries can internally iterate over all versions of a row [^45] [^46].
|
||||
|
||||
### Visibility rules for observing a consistent snapshot
|
||||
#### Visibility rules for observing a consistent snapshot
|
||||
|
||||
When a transaction reads from the database, transaction IDs are used to decide which row versions it
|
||||
can see and which are invisible. By carefully defining visibility rules, the database can present a
|
||||
|
|
@ -743,7 +756,7 @@ that (from other transactions’ point of view) have long been overwritten or de
|
|||
updating values in place but instead inserting a new version every time a value is changed, the
|
||||
database can provide a consistent snapshot while incurring only a small overhead.
|
||||
|
||||
### Indexes and snapshot isolation
|
||||
#### Indexes and snapshot isolation
|
||||
|
||||
How do indexes work in a multi-version database? The most common approach is that each index entry
|
||||
points at one of the versions of a row that matches the entry (either the oldest or the newest
|
||||
|
|
@ -754,9 +767,8 @@ are no longer visible to any transaction, the corresponding index entries can al
|
|||
|
||||
Many implementation details affect the performance of multi-version concurrency control [^45] [^46].
|
||||
For example, PostgreSQL has optimizations for avoiding index updates if different versions of the
|
||||
same row can fit on the same page [^40].
|
||||
Some other databases avoid storing full copies of modified rows, and only store differences between
|
||||
versions to save space.
|
||||
same row can fit on the same page [^40]. Some other databases avoid storing full copies of modified rows,
|
||||
and only store differences between versions to save space.
|
||||
|
||||
Another approach is used in CouchDB, Datomic, and LMDB. Although they also use B-trees (see
|
||||
[“B-Trees”](/en/ch4#sec_storage_b_trees)), they use an *immutable* (copy-on-write) variant that does not overwrite
|
||||
|
|
@ -771,7 +783,7 @@ was created. There is no need to filter out rows based on transaction IDs becaus
|
|||
writes cannot modify an existing B-tree; they can only create new tree roots. This approach also
|
||||
requires a background process for compaction and garbage collection.
|
||||
|
||||
### Snapshot isolation, repeatable read, and naming confusion
|
||||
#### Snapshot isolation, repeatable read, and naming confusion
|
||||
|
||||
MVCC is a commonly used implementation technique for databases, and often it is used to implement
|
||||
snapshot isolation. However, different databases sometimes use different terms to refer to the same
|
||||
|
|
@ -796,7 +808,7 @@ formal definition. And to top it off, IBM Db2 uses “repeatable read” to refe
|
|||
|
||||
As a result, nobody really knows what repeatable read means.
|
||||
|
||||
## Preventing Lost Updates
|
||||
### Preventing Lost Updates
|
||||
|
||||
The read committed and snapshot isolation levels we’ve discussed so far have been primarily about the guarantees
|
||||
of what a read-only transaction can see in the presence of concurrent writes. We have mostly ignored
|
||||
|
|
@ -822,7 +834,7 @@ pattern occurs in various different scenarios:
|
|||
|
||||
Because this is such a common problem, a variety of solutions have been developed [^48].
|
||||
|
||||
### Atomic write operations
|
||||
#### Atomic write operations
|
||||
|
||||
Many databases provide atomic update operations, which remove the need to implement
|
||||
read-modify-write cycles in application code. They are usually the best solution if your code can be
|
||||
|
|
@ -849,7 +861,7 @@ that performs unsafe read-modify-write cycles instead of using atomic operations
|
|||
database [^49] [^50] [^51].
|
||||
This can be a source of subtle bugs that are difficult to find by testing.
|
||||
|
||||
### Explicit locking
|
||||
#### Explicit locking
|
||||
|
||||
Another option for preventing lost updates, if the database’s built-in atomic operations don’t
|
||||
provide the necessary functionality, is for the application to explicitly lock objects that are
|
||||
|
|
@ -869,8 +881,8 @@ players from concurrently moving the same piece, as illustrated in [Example 8-1
|
|||
BEGIN TRANSACTION;
|
||||
|
||||
SELECT * FROM figures
|
||||
WHERE name = 'robot' AND game_id = 222
|
||||
FOR UPDATE; ❶
|
||||
WHERE name = 'robot' AND game_id = 222
|
||||
FOR UPDATE; ❶
|
||||
|
||||
-- Check whether move is valid, then update the position
|
||||
-- of the piece that was returned by the previous SELECT.
|
||||
|
|
@ -889,7 +901,7 @@ are waiting for each other to release their locks. Many databases automatically
|
|||
and abort one of the involved transactions so that the system can make progress. You can handle this
|
||||
situation at the application level by retrying the aborted transaction.
|
||||
|
||||
### Automatically detecting lost updates
|
||||
#### Automatically detecting lost updates
|
||||
|
||||
Atomic operations and locks are ways of preventing lost updates by forcing the read-modify-write
|
||||
cycles to happen sequentially. An alternative is to allow them to execute in parallel and, if the
|
||||
|
|
@ -909,7 +921,7 @@ special database features—you may forget to use a lock or an atomic operation
|
|||
a bug, but lost update detection happens automatically and is thus less error-prone. However, you
|
||||
also have to retry aborted transactions at the application level.
|
||||
|
||||
### Conditional writes (compare-and-set)
|
||||
#### Conditional writes (compare-and-set)
|
||||
|
||||
In databases that don’t provide transactions, you sometimes find a *conditional write* operation
|
||||
that can prevent lost updates by allowing an update to happen only if the value has not changed
|
||||
|
|
@ -925,7 +937,7 @@ user started editing it:
|
|||
```sql
|
||||
-- This may or may not be safe, depending on the database implementation
|
||||
UPDATE wiki_pages SET content = 'new content'
|
||||
WHERE id = 1234 AND content = 'old content';
|
||||
WHERE id = 1234 AND content = 'old content';
|
||||
```
|
||||
|
||||
If the content has changed and no longer matches `'old content'`, this update will have no effect,
|
||||
|
|
@ -940,7 +952,7 @@ implementations of MVCC have an exception to the visibility rules for this scena
|
|||
written by other transactions are visible to the evaluation of the `WHERE` clause of `UPDATE` and
|
||||
`DELETE` queries, even though those writes are not otherwise visible in the snapshot.
|
||||
|
||||
### Conflict resolution and replication
|
||||
#### Conflict resolution and replication
|
||||
|
||||
In replicated databases (see [Chapter 6](/en/ch6#ch_replication)), preventing lost updates takes on another
|
||||
dimension: since they have copies of the data on multiple nodes, and the data can potentially be
|
||||
|
|
@ -968,7 +980,7 @@ On the other hand, the *last write wins* (LWW) conflict resolution method is pro
|
|||
as discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww).
|
||||
Unfortunately, LWW is the default in many replicated databases.
|
||||
|
||||
## Write Skew and Phantoms
|
||||
### Write Skew and Phantoms
|
||||
|
||||
In the previous sections we saw *dirty writes* and *lost updates*, two kinds of race conditions that
|
||||
can occur when different transactions concurrently try to write to the same objects. In order to
|
||||
|
|
@ -995,10 +1007,9 @@ In each transaction, your application first checks that two or more doctors are
|
|||
if yes, it assumes it’s safe for one doctor to go off call. Since the database is using snapshot
|
||||
isolation, both checks return `2`, so both transactions proceed to the next stage. Aaliyah updates her
|
||||
own record to take herself off call, and Bryce updates his own record likewise. Both transactions
|
||||
commit, and now no doctor is on call. Your requirement of having at least one doctor on call has
|
||||
been violated.
|
||||
commit, and now no doctor is on call. Your requirement of having at least one doctor on call has been violated.
|
||||
|
||||
### Characterizing write skew
|
||||
#### Characterizing write skew
|
||||
|
||||
This anomaly is called *write skew* [^36]. It
|
||||
is neither a dirty write nor a lost update, because the two transactions are updating two different
|
||||
|
|
@ -1035,51 +1046,50 @@ options are more restricted:
|
|||
BEGIN TRANSACTION;
|
||||
|
||||
SELECT * FROM doctors
|
||||
WHERE on_call = true
|
||||
AND shift_id = 1234 FOR UPDATE; ❶
|
||||
WHERE on_call = true
|
||||
AND shift_id = 1234 FOR UPDATE; ❶
|
||||
|
||||
UPDATE doctors
|
||||
SET on_call = false
|
||||
WHERE name = 'Aaliyah'
|
||||
AND shift_id = 1234;
|
||||
SET on_call = false
|
||||
WHERE name = 'Aaliyah'
|
||||
AND shift_id = 1234;
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
❶: As before, `FOR UPDATE` tells the database to lock all rows returned by this query.
|
||||
|
||||
### More examples of write skew
|
||||
#### More examples of write skew
|
||||
|
||||
Write skew may seem like an esoteric issue at first, but once you’re aware of it, you may notice
|
||||
more situations in which it can occur. Here are some more examples:
|
||||
|
||||
Meeting room booking system
|
||||
: Say you want to enforce that there cannot be two bookings for the same meeting room at the same time [^55].
|
||||
When someone wants to make a booking, you first check for any conflicting bookings (i.e.,
|
||||
bookings for the same room with an overlapping time range), and if none are found, you create the
|
||||
meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).
|
||||
When someone wants to make a booking, you first check for any conflicting bookings (i.e.,
|
||||
bookings for the same room with an overlapping time range), and if none are found, you create the
|
||||
meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).
|
||||
|
||||
{{< figure id="fig_transactions_meeting_rooms" title="Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)" class="w-full my-4" >}}
|
||||
|
||||
```sql
|
||||
BEGIN TRANSACTION;
|
||||
|
||||
-- Check for any existing bookings that overlap with the period of noon-1pm
|
||||
SELECT COUNT(*) FROM bookings
|
||||
WHERE room_id = 123 AND
|
||||
end_time > '2025-01-01 12:00' AND start_time < '2025-01-01 13:00';
|
||||
|
||||
-- If the previous query returned zero:
|
||||
INSERT INTO bookings (room_id, start_time, end_time, user_id)
|
||||
VALUES (123, '2025-01-01 12:00', '2025-01-01 13:00', 666);
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
##### Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)
|
||||
|
||||
```sql
|
||||
BEGIN TRANSACTION;
|
||||
|
||||
-- Check for any existing bookings that overlap with the period of noon-1pm
|
||||
SELECT COUNT(*) FROM bookings
|
||||
WHERE room_id = 123 AND
|
||||
end_time > '2025-01-01 12:00' AND start_time < '2025-01-01 13:00';
|
||||
|
||||
-- If the previous query returned zero:
|
||||
INSERT INTO bookings
|
||||
(room_id, start_time, end_time, user_id)
|
||||
VALUES (123, '2025-01-01 12:00', '2025-01-01 13:00', 666);
|
||||
|
||||
COMMIT;
|
||||
```
|
||||
|
||||
Unfortunately, snapshot isolation does not prevent another user from concurrently inserting a conflicting
|
||||
meeting. In order to guarantee you won’t get scheduling conflicts, you once again need serializable
|
||||
isolation.
|
||||
Unfortunately, snapshot isolation does not prevent another user from concurrently inserting a conflicting
|
||||
meeting. In order to guarantee you won’t get scheduling conflicts, you once again need serializable
|
||||
isolation.
|
||||
|
||||
Multiplayer game
|
||||
: In [Example 8-1](/en/ch8#fig_transactions_select_for_update), we used a lock to prevent lost updates (that is, making
|
||||
|
|
@ -1103,7 +1113,7 @@ Preventing double-spending
|
|||
With write skew, it could happen that two spending items are inserted concurrently that together
|
||||
cause the balance to go negative, but that neither transaction notices the other.
|
||||
|
||||
### Phantoms causing write skew
|
||||
#### Phantoms causing write skew
|
||||
|
||||
All of these examples follow a similar pattern:
|
||||
|
||||
|
|
@ -1139,7 +1149,7 @@ Snapshot isolation avoids phantoms in read-only queries, but in read-write trans
|
|||
examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL
|
||||
generated by ORMs is also prone to write skew [^50] [^51].
|
||||
|
||||
### Materializing conflicts
|
||||
#### Materializing conflicts
|
||||
|
||||
If the problem of phantoms is that there is no object to which we can attach the locks, perhaps we
|
||||
can artificially introduce a lock object into the database?
|
||||
|
|
@ -1162,7 +1172,9 @@ mechanism leak into the application data model. For those reasons, materializing
|
|||
considered a last resort if no alternative is possible. A serializable isolation level is much
|
||||
preferable in most cases.
|
||||
|
||||
# Serializability
|
||||
|
||||
|
||||
## Serializability
|
||||
|
||||
In this chapter we have seen several examples of transactions that are prone to race conditions.
|
||||
Some race conditions are prevented by the read committed and snapshot isolation levels, but
|
||||
|
|
@ -1195,12 +1207,11 @@ serializability, and how they perform. Most databases that provide serializabili
|
|||
three techniques, which we will explore in the rest of this chapter:
|
||||
|
||||
* Literally executing transactions in a serial order (see [“Actual Serial Execution”](/en/ch8#sec_transactions_serial))
|
||||
* Two-phase locking (see [“Two-Phase Locking (2PL)”](/en/ch8#sec_transactions_2pl)), which for several decades was the only viable
|
||||
option
|
||||
* Two-phase locking (see [“Two-Phase Locking (2PL)”](/en/ch8#sec_transactions_2pl)), which for several decades was the only viable option
|
||||
* Optimistic concurrency control techniques such as serializable snapshot isolation (see
|
||||
[“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi))
|
||||
|
||||
## Actual Serial Execution
|
||||
### Actual Serial Execution
|
||||
|
||||
The simplest way of avoiding concurrency problems is to remove the concurrency entirely: to
|
||||
execute only one transaction at a time, in serial order, on a single thread. By doing so, we completely
|
||||
|
|
@ -1230,7 +1241,7 @@ supports concurrency, because it can avoid the coordination overhead of locking.
|
|||
throughput is limited to that of a single CPU core. In order to make the most of that single thread,
|
||||
transactions need to be structured differently from their traditional form.
|
||||
|
||||
### Encapsulating transactions in stored procedures
|
||||
#### Encapsulating transactions in stored procedures
|
||||
|
||||
In the early days of databases, the intention was that a database transaction could encompass an
|
||||
entire flow of user activity. For example, booking an airline ticket is a multi-stage process
|
||||
|
|
@ -1270,8 +1281,7 @@ stored procedure can execute very quickly, without waiting for any network or di
|
|||
|
||||
{{< figure src="/fig/ddia_0809.png" id="fig_transactions_stored_proc" caption="Figure 8-9. The difference between an interactive transaction and a stored procedure (using the example transaction of [Figure 8-8](/en/ch8#fig_transactions_write_skew))." class="w-full my-4" >}}
|
||||
|
||||
|
||||
### Pros and cons of stored procedures
|
||||
#### Pros and cons of stored procedures
|
||||
|
||||
Stored procedures have existed for some time in relational databases, and they have been part of the
|
||||
SQL standard (SQL/PSM) since 1999. They have gained a somewhat bad reputation, for various reasons:
|
||||
|
|
@ -1312,7 +1322,7 @@ so through special deterministic APIs (see [“Durable Execution and Workflows
|
|||
deterministic operations). This approach is called *state machine replication*, and we will return
|
||||
to it in [Chapter 10](/en/ch10#ch_consistency).
|
||||
|
||||
### Sharding
|
||||
#### Sharding
|
||||
|
||||
Executing all transactions serially makes concurrency control much simpler, but limits the
|
||||
transaction throughput of the database to the speed of a single CPU core on a single machine.
|
||||
|
|
@ -1341,13 +1351,12 @@ application. Simple key-value data can often be sharded very easily, but data wi
|
|||
secondary indexes is likely to require a lot of cross-shard coordination (see
|
||||
[“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes)).
|
||||
|
||||
### Summary of serial execution
|
||||
#### Summary of serial execution
|
||||
|
||||
Serial execution of transactions has become a viable way of achieving serializable isolation within
|
||||
certain constraints:
|
||||
|
||||
* Every transaction must be small and fast, because it takes only one slow transaction to stall all
|
||||
transaction processing.
|
||||
* Every transaction must be small and fast, because it takes only one slow transaction to stall all transaction processing.
|
||||
* It is most appropriate in situations where the active dataset can fit in memory. Rarely accessed
|
||||
data could potentially be moved to disk, but if it needed to be accessed in a single-threaded
|
||||
transaction, the system would get very slow.
|
||||
|
|
@ -1355,19 +1364,24 @@ certain constraints:
|
|||
to be sharded without requiring cross-shard coordination.
|
||||
* Cross-shard transactions are possible, but their throughput is hard to scale.
|
||||
|
||||
## Two-Phase Locking (2PL)
|
||||
### Two-Phase Locking (2PL)
|
||||
|
||||
For around 30 years, there was only one widely used algorithm for serializability in databases:
|
||||
*two-phase locking* (2PL), sometimes called *strong strict two-phase locking* (SS2PL) to distinguish
|
||||
it from other variants of 2PL.
|
||||
|
||||
# 2PL is not 2PC
|
||||
|
||||
--------
|
||||
|
||||
> [!TIP] 2PL IS NOT 2PC
|
||||
|
||||
Two-phase *locking* (2PL) and two-phase *commit* (2PC) are two very different things. 2PL provides
|
||||
serializable isolation, whereas 2PC provides atomic commit in a distributed database (see
|
||||
[“Two-Phase Commit (2PC)”](/en/ch8#sec_transactions_2pc)). To avoid confusion, it’s best to think of them as entirely separate
|
||||
concepts and to ignore the unfortunate similarity in the names.
|
||||
|
||||
--------
|
||||
|
||||
We saw previously that locks are often used to prevent dirty writes (see
|
||||
[“No dirty writes”](/en/ch8#sec_transactions_dirty_write)): if two transactions concurrently try to write to the same object,
|
||||
the lock ensures that the second writer must wait until the first one has finished its transaction
|
||||
|
|
@ -1390,7 +1404,7 @@ readers* (see [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_trans
|
|||
snapshot isolation and two-phase locking. On the other hand, because 2PL provides serializability,
|
||||
it protects against all the race conditions discussed earlier, including lost updates and write skew.
|
||||
|
||||
### Implementation of two-phase locking
|
||||
#### Implementation of two-phase locking
|
||||
|
||||
2PL is used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the
|
||||
repeatable read isolation level in Db2 [^29].
|
||||
|
|
@ -1417,7 +1431,7 @@ transaction B to release its lock, and vice versa. This situation is called *dea
|
|||
automatically detects deadlocks between transactions and aborts one of them so that the others can
|
||||
make progress. The aborted transaction needs to be retried by the application.
|
||||
|
||||
### Performance of two-phase locking
|
||||
#### Performance of two-phase locking
|
||||
|
||||
The big downside of two-phase locking, and the reason why it hasn’t been used by everybody since the
|
||||
1970s, is performance: transaction throughput and response times of queries are significantly worse
|
||||
|
|
@ -1446,7 +1460,7 @@ transaction). This can be an additional performance problem: when a transaction
|
|||
deadlock and is retried, it needs to do its work all over again. If deadlocks are frequent, this can
|
||||
mean significant wasted effort.
|
||||
|
||||
### Predicate locks
|
||||
#### Predicate locks
|
||||
|
||||
In the preceding description of locks, we glossed over a subtle but important detail. In
|
||||
[“Phantoms causing write skew”](/en/ch8#sec_transactions_phantom) we discussed the problem of *phantoms*—that is, one transaction
|
||||
|
|
@ -1485,7 +1499,7 @@ database, but which might be added in the future (phantoms). If two-phase lockin
|
|||
the database prevents all forms of write skew and other race conditions, and so its isolation
|
||||
becomes serializable.
|
||||
|
||||
### Index-range locks
|
||||
#### Index-range locks
|
||||
|
||||
Unfortunately, predicate locks do not perform well: if there are many locks by active transactions,
|
||||
checking for matching locks becomes time-consuming. For that reason, most databases with 2PL
|
||||
|
|
@ -1499,8 +1513,7 @@ room 123) between noon and 1 p.m. This is safe because any write that matches th
|
|||
will definitely also match the approximations.
|
||||
|
||||
In the room bookings database you would probably have an index on the `room_id` column, and/or
|
||||
indexes on `start_time` and `end_time` (otherwise the preceding query would be very slow on a large
|
||||
database):
|
||||
indexes on `start_time` and `end_time` (otherwise the preceding query would be very slow on a large database):
|
||||
|
||||
* Say your index is on `room_id`, and the database uses this index to find existing bookings for
|
||||
room 123. Now the database can simply attach a shared lock to this index entry, indicating that a
|
||||
|
|
@ -1523,7 +1536,7 @@ If there is no suitable index where a range lock can be attached, the database c
|
|||
shared lock on the entire table. This will not be good for performance, since it will stop all
|
||||
other transactions writing to the table, but it’s a safe fallback position.
|
||||
|
||||
## Serializable Snapshot Isolation (SSI)
|
||||
### Serializable Snapshot Isolation (SSI)
|
||||
|
||||
This chapter has painted a bleak picture of concurrency control in databases. On the one hand, we
|
||||
have implementations of serializability that don’t perform well (two-phase locking) or don’t scale
|
||||
|
|
@ -1539,7 +1552,7 @@ Today SSI and similar algorithms are used in single-node databases (the serializ
|
|||
in PostgreSQL [^54], SQL Server’s In-Memory OLTP/Hekaton [^66], and HyPer [^67]), distributed databases (CockroachDB [^5] and
|
||||
FoundationDB [^8]), and embedded storage engines such as BadgerDB.
|
||||
|
||||
### Pessimistic versus optimistic concurrency control
|
||||
#### Pessimistic versus optimistic concurrency control
|
||||
|
||||
Two-phase locking is a so-called *pessimistic* concurrency control mechanism: it is based on the
|
||||
principle that if anything might possibly go wrong (as indicated by a lock held by another
|
||||
|
|
@ -1558,8 +1571,7 @@ transaction wants to commit, the database checks whether anything bad happened (
|
|||
isolation was violated); if so, the transaction is aborted and has to be retried. Only transactions
|
||||
that executed serializably are allowed to commit.
|
||||
|
||||
Optimistic concurrency control is an old idea [^68],
|
||||
and its advantages and disadvantages have been debated for a long time [^69].
|
||||
Optimistic concurrency control is an old idea [^68], and its advantages and disadvantages have been debated for a long time [^69].
|
||||
It performs badly if there is high contention (many transactions trying to access the same objects),
|
||||
as this leads to a high proportion of transactions needing to abort. If the system is already close
|
||||
to its maximum throughput, the additional transaction load from retried transactions can make
|
||||
|
|
@ -1577,14 +1589,13 @@ are made from a consistent snapshot of the database (see [“Snapshot Isolation
|
|||
On top of snapshot isolation, SSI adds an algorithm for detecting serialization conflicts among
|
||||
reads and writes, and determining which transactions to abort.
|
||||
|
||||
### Decisions based on an outdated premise
|
||||
#### Decisions based on an outdated premise
|
||||
|
||||
When we previously discussed write skew in snapshot isolation (see [“Write Skew and Phantoms”](/en/ch8#sec_transactions_write_skew)),
|
||||
we observed a recurring pattern: a transaction reads some data from the database, examines the
|
||||
result of the query, and decides to take some action (write to the database) based on the result
|
||||
that it saw. However, under snapshot isolation, the result from the original query may no longer be
|
||||
up-to-date by the time the transaction commits, because the data may have been modified in the
|
||||
meantime.
|
||||
up-to-date by the time the transaction commits, because the data may have been modified in the meantime.
|
||||
|
||||
Put another way, the transaction is taking an action based on a *premise* (a fact that was true at
|
||||
the beginning of the transaction, e.g., “There are currently two doctors on call”). Later, when the
|
||||
|
|
@ -1603,7 +1614,7 @@ How does the database know if a query result might have changed? There are two c
|
|||
* Detecting reads of a stale MVCC object version (uncommitted write occurred before the read)
|
||||
* Detecting writes that affect prior reads (the write occurs after the read)
|
||||
|
||||
### Detecting stale MVCC reads
|
||||
#### Detecting stale MVCC reads
|
||||
|
||||
Recall that snapshot isolation is usually implemented by multi-version concurrency control (MVCC;
|
||||
see [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl)). When a transaction reads from a consistent snapshot in an
|
||||
|
|
@ -1634,7 +1645,7 @@ abort or may still be uncommitted at the time when transaction 43 is committed,
|
|||
turn out not to have been stale after all. By avoiding unnecessary aborts, SSI preserves snapshot
|
||||
isolation’s support for long-running reads from a consistent snapshot.
|
||||
|
||||
### Detecting writes that affect prior reads
|
||||
#### Detecting writes that affect prior reads
|
||||
|
||||
The second case to consider is when another transaction modifies data after it has been read. This
|
||||
case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range).
|
||||
|
|
@ -1665,7 +1676,7 @@ transaction 43’s write affected 42, 43 hasn’t yet committed, so the write ha
|
|||
However, when transaction 43 wants to commit, the conflicting write from 42 has already been
|
||||
committed, so 43 must abort.
|
||||
|
||||
### Performance of serializable snapshot isolation
|
||||
#### Performance of serializable snapshot isolation
|
||||
|
||||
As always, many engineering details affect how well an algorithm works in practice. For example, one
|
||||
trade-off is the granularity at which transactions’ reads and writes are tracked. If the database
|
||||
|
|
@ -1702,7 +1713,7 @@ SSI requires that read-write transactions be fairly short (long-running read-onl
|
|||
okay). However, SSI is less sensitive to slow transactions than two-phase locking or serial
|
||||
execution.
|
||||
|
||||
# Distributed Transactions
|
||||
## Distributed Transactions
|
||||
|
||||
The last few sections have focused on concurrency control for isolation, the I in ACID. The
|
||||
algorithms we have seen apply to both single-node and distributed databases: although there are
|
||||
|
|
@ -1759,10 +1770,9 @@ was later aborted, user 2’s transaction would have to be reverted as well, sin
|
|||
data that was retroactively declared not to have existed.
|
||||
|
||||
A better approach is to ensure that the nodes involved in a transaction either all commit or all
|
||||
abort, and to prevent a mixture of the two. Ensuring this is known as the *atomic commitment*
|
||||
problem.
|
||||
abort, and to prevent a mixture of the two. Ensuring this is known as the *atomic commitment* problem.
|
||||
|
||||
## Two-Phase Commit (2PC)
|
||||
### Two-Phase Commit (2PC)
|
||||
|
||||
Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It
|
||||
is a classic algorithm in distributed databases [^13] [^71] [^72]. 2PC is used
|
||||
|
|
@ -1800,7 +1810,7 @@ the answer “I do” from both. After receiving both acknowledgments, the minis
|
|||
couple husband and wife: the transaction is committed, and the happy fact is broadcast to all
|
||||
attendees. If either bride or groom does not say “yes,” the ceremony is aborted [^76].
|
||||
|
||||
### A system of promises
|
||||
#### A system of promises
|
||||
|
||||
From this short description it might not be clear why two-phase commit ensures atomicity, while
|
||||
one-phase commit across several nodes does not. Surely the prepare and commit requests can just
|
||||
|
|
@ -1851,7 +1861,7 @@ married or not by querying the minister for the status of your global transactio
|
|||
wait for the minister’s next retry of the commit request (since the retries will have continued
|
||||
throughout your period of unconsciousness).
|
||||
|
||||
### Coordinator failure
|
||||
#### Coordinator failure
|
||||
|
||||
We have discussed what happens if one of the participants or the network fails during 2PC: if any of
|
||||
the prepare requests fails or times out, the coordinator aborts the transaction; if any of the
|
||||
|
|
@ -1886,7 +1896,7 @@ all in-doubt transactions by reading its transaction log. Any transactions that
|
|||
record in the coordinator’s log are aborted. Thus, the commit point of 2PC comes down to a regular
|
||||
single-node atomic commit on the coordinator.
|
||||
|
||||
### Three-phase commit
|
||||
#### Three-phase commit
|
||||
|
||||
Two-phase commit is called a *blocking* atomic commit protocol due to the fact that 2PC can become
|
||||
stuck waiting for the coordinator to recover. It is possible to make an atomic commit protocol
|
||||
|
|
@ -1901,7 +1911,7 @@ cannot guarantee atomicity.
|
|||
A better solution in practice is to replace the single-node coordinator with a fault-tolerant
|
||||
consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_consistency).
|
||||
|
||||
## Distributed Transactions Across Different Systems
|
||||
### Distributed Transactions Across Different Systems
|
||||
|
||||
Distributed transactions and two-phase commit have a mixed reputation. On the one hand, they are
|
||||
seen as providing an important safety guarantee that would be hard to achieve otherwise; on the
|
||||
|
|
@ -1936,7 +1946,7 @@ use any protocol and apply optimizations specific to that particular technology.
|
|||
database-internal distributed transactions can often work quite well. On the other hand,
|
||||
transactions spanning heterogeneous technologies are a lot more challenging.
|
||||
|
||||
### Exactly-once message processing
|
||||
#### Exactly-once message processing
|
||||
|
||||
Heterogeneous distributed transactions allow diverse systems to be integrated in powerful ways. For
|
||||
example, a message from a message queue can be acknowledged as processed if and only if the database
|
||||
|
|
@ -1961,11 +1971,10 @@ safely be retried as if nothing had happened.
|
|||
We will return to the topic of exactly-once semantics later in this chapter. Let’s look first at the
|
||||
atomic commit protocol that allows such heterogeneous distributed transactions.
|
||||
|
||||
### XA transactions
|
||||
#### XA transactions
|
||||
|
||||
*X/Open XA* (short for *eXtended Architecture*) is a standard for implementing two-phase commit
|
||||
across heterogeneous technologies [^73].
|
||||
It was introduced in 1991 and has been widely
|
||||
across heterogeneous technologies [^73]. It was introduced in 1991 and has been widely
|
||||
implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL,
|
||||
Db2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ).
|
||||
|
||||
|
|
@ -1996,7 +2005,7 @@ transaction. Only then can the coordinator use the database driver’s XA callba
|
|||
participants to commit or abort, as appropriate. The database server cannot contact the coordinator
|
||||
directly, since all communication must go via its client library.
|
||||
|
||||
### Holding locks while in doubt
|
||||
#### Holding locks while in doubt
|
||||
|
||||
Why do we care so much about a transaction being stuck in doubt? Can’t the rest of the system just
|
||||
get on with its work, and ignore the in-doubt transaction that will be cleaned up eventually?
|
||||
|
|
@ -2019,7 +2028,7 @@ cannot simply continue with their business—if they want to access that same da
|
|||
blocked. This can cause large parts of your application to become unavailable until the in-doubt
|
||||
transaction is resolved.
|
||||
|
||||
### Recovering from coordinator failure
|
||||
#### Recovering from coordinator failure
|
||||
|
||||
In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the
|
||||
log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do occur [^83] [^84] — that is,
|
||||
|
|
@ -2046,7 +2055,7 @@ decision from the coordinator [^73]. To be clear,
|
|||
violates the system of promises in two-phase commit. Thus, heuristic decisions are intended only for
|
||||
getting out of catastrophic situations, and not for regular use.
|
||||
|
||||
### Problems with XA transactions
|
||||
#### Problems with XA transactions
|
||||
|
||||
A single-node coordinator is a single point of failure for the entire system, and making it part of
|
||||
the application server is also problematic because the coordinator’s logs on its local disk become a
|
||||
|
|
@ -2077,7 +2086,7 @@ However, keeping several heterogeneous data systems consistent with each other i
|
|||
important problem, so we need to find a different solution to it. This can be done, as we will see
|
||||
in the next section and in [Link to Come].
|
||||
|
||||
## Database-internal Distributed Transactions
|
||||
### Database-internal Distributed Transactions
|
||||
|
||||
As explained previously, there is a big difference between distributed transactions that span
|
||||
multiple heterogeneous storage technologies, and those that are internal to a system—i.e., where all
|
||||
|
|
@ -2109,7 +2118,7 @@ The isolation levels offered for distributed transactions depend on the system,
|
|||
isolation and serializable snapshot isolation are both possible across shards. The details of how
|
||||
this works can be found in the papers referenced at the end of this chapter.
|
||||
|
||||
### Exactly-once message processing revisited
|
||||
#### Exactly-once message processing revisited
|
||||
|
||||
We saw in [“Exactly-once message processing”](/en/ch8#sec_transactions_exactly_once) that an important use case for distributed transactions
|
||||
is to ensure that some operation takes effect exactly once, even if a crash occurs while it is being
|
||||
|
|
@ -2148,14 +2157,15 @@ Thus, achieving exactly-once processing only requires transactions within the da
|
|||
across database and message broker is not necessary for this use case. Recording the message ID in
|
||||
the database makes the message processing *idempotent*, so that message processing can be safely
|
||||
retried without duplicating its side-effects. A similar approach is used in stream processing
|
||||
frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in
|
||||
[Link to Come].
|
||||
frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in [Link to Come].
|
||||
|
||||
However, internal distributed transactions within the database are still useful for the scalability
|
||||
of patterns such as these: for example, they would allow the message IDs to be stored on one shard
|
||||
and the main data updated by the message processing to be stored on other shards, and to ensure
|
||||
atomicity of the transaction commit across those shards.
|
||||
|
||||
|
||||
|
||||
## Summary
|
||||
|
||||
Transactions are an abstraction layer that allows an application to pretend that certain concurrency
|
||||
|
|
|
|||
|
|
@ -34,7 +34,7 @@ explore how to think about the state of a distributed system and how to reason a
|
|||
have happened ([“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth)). Later, in [Chapter 10](/en/ch10#ch_consistency), we will look at some
|
||||
examples of how we can achieve fault tolerance in the face of those faults.
|
||||
|
||||
# Faults and Partial Failures
|
||||
## Faults and Partial Failures
|
||||
|
||||
When you are writing a program on a single computer, it normally behaves in a fairly predictable
|
||||
way: either it works or it doesn’t. Buggy software may give the appearance that the computer is
|
||||
|
|
@ -69,7 +69,7 @@ anecdote [^3]:
|
|||
> pickup truck into a DC’s HVAC [heating, ventilation, and air conditioning] system. And I’m not even
|
||||
> an ops guy.
|
||||
>
|
||||
> Coda Hale
|
||||
> —— Coda Hale
|
||||
|
||||
In a distributed system, there may well be some parts of the system that are broken in some
|
||||
unpredictable way, even though other parts of the system are working fine. This is known as a
|
||||
|
|
@ -77,8 +77,7 @@ unpredictable way, even though other parts of the system are working fine. This
|
|||
anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably
|
||||
fail. As we shall see, you may not even *know* whether something succeeded or not!
|
||||
|
||||
This nondeterminism and possibility of partial failures is what makes distributed systems hard to
|
||||
work with [^4].
|
||||
This nondeterminism and possibility of partial failures is what makes distributed systems hard to work with [^4].
|
||||
On the other hand, if a distributed system can tolerate partial failures, that opens up powerful
|
||||
possibilities: for example, it allows you to perform a rolling upgrade, rebooting one node at a time
|
||||
to install software updates while the system as a whole continues working uninterrupted all the
|
||||
|
|
@ -90,7 +89,7 @@ supposed to tolerate. It is important to consider a wide range of possible fault
|
|||
unlikely ones—and to artificially create such situations in your testing environment to see what
|
||||
happens. In distributed systems, suspicion, pessimism, and paranoia pay off.
|
||||
|
||||
# Unreliable Networks
|
||||
## Unreliable Networks
|
||||
|
||||
As discussed in [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](/en/ch2#sec_introduction_shared_nothing), the distributed systems we focus on
|
||||
in this book are mostly *shared-nothing systems*: i.e., a bunch of machines connected by a network.
|
||||
|
|
@ -130,18 +129,22 @@ the response is not going to arrive. However, when a timeout occurs, you still d
|
|||
the remote node got your request or not (and if the request is still queued somewhere, it may still
|
||||
be delivered to the recipient, even if the sender has given up on it).
|
||||
|
||||
## The Limitations of TCP
|
||||
### The Limitations of TCP
|
||||
|
||||
Network packets have a maximum size (generally a few kilobytes), but many applications need to send
|
||||
messages (requests, responses) that are too big to fit in one packet. These applications most often
|
||||
use TCP, the Transmission Control Protocol, to establish a *connection* that breaks up large data
|
||||
streams into individual packets, and puts them back together again on the receiving side.
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> Most of what we say about TCP applies also to its more recent alternative QUIC, as well as the
|
||||
> Stream Control Transmission Protocol (SCTP) used in WebRTC, the BitTorrent uTP protocol, and
|
||||
> other transport protocols. For a comparison to UDP, see [“TCP Versus UDP”](/en/ch9#sidebar_distributed_tcp_udp).
|
||||
|
||||
--------
|
||||
|
||||
TCP is often described as providing “reliable” delivery, in the sense that it detects and
|
||||
retransmits dropped packets, it detects reordered packets and puts them back in the correct order,
|
||||
and it detects packet corruption using a simple checksum. It also figures out how fast it can send
|
||||
|
|
@ -177,7 +180,7 @@ use it to send multiple requests and responses. This is usually done by first se
|
|||
indicates the length of the following message in bytes, followed by the actual message. HTTP and
|
||||
many RPC protocols (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) work like this.
|
||||
|
||||
## Network Faults in Practice
|
||||
### Network Faults in Practice
|
||||
|
||||
We have been building computer networks for decades—one might hope that by now we would have figured
|
||||
out how to make them reliable. Unfortunately, we have not yet succeeded. There are some systematic
|
||||
|
|
@ -197,25 +200,27 @@ even in controlled environments like a datacenter operated by one company [^8]:
|
|||
* Across different cloud regions, round-trip times of up to several *minutes* have been observed at
|
||||
high percentiles [^18].
|
||||
Even within a single datacenter, packet delay of more than a minute can occur during a network
|
||||
topology reconfiguration, triggered by a problem during a software upgrade for a switch
|
||||
[^19].
|
||||
topology reconfiguration, triggered by a problem during a software upgrade for a switch [^19].
|
||||
Thus, we have to assume that messages might be delayed arbitrarily.
|
||||
* Sometimes communications are partially interrupted, depending on who you’re talking to: for
|
||||
example, A and B can communicate, B and C can communicate, but A and C cannot [^20] [^21].
|
||||
Other surprising faults include a network interface that sometimes drops all inbound packets but
|
||||
sends outbound packets successfully [^22]:
|
||||
just because a network link works in one direction doesn’t guarantee it’s also working in the
|
||||
opposite direction.
|
||||
just because a network link works in one direction doesn’t guarantee it’s also working in the opposite direction.
|
||||
* Even a brief network interruption can have repercussions that last for much longer than the
|
||||
original issue [^8] [^20] [^23].
|
||||
|
||||
# Network partitions
|
||||
--------
|
||||
|
||||
> [!TIP] NETWORK PARTITIONS
|
||||
|
||||
When one part of the network is cut off from the rest due to a network fault, that is sometimes
|
||||
called a *network partition* or *netsplit*, but it is not fundamentally different from other kinds
|
||||
of network interruption. Network partitions are not related to sharding of a storage system, which
|
||||
is sometimes also called *partitioning* (see [Chapter 7](/en/ch7#ch_sharding)).
|
||||
|
||||
--------
|
||||
|
||||
Even if network faults are rare in your environment, the fact that faults *can* occur means that
|
||||
your software needs to be able to handle them. Whenever any communication happens over a network, it
|
||||
may fail—there is no way around it.
|
||||
|
|
@ -233,7 +238,7 @@ and ensure that the system can recover from them.
|
|||
It may make sense to deliberately trigger network problems and test the system’s response (this is
|
||||
known as *fault injection*; see [“Fault injection”](/en/ch9#sec_fault_injection)).
|
||||
|
||||
## Detecting Faults
|
||||
### Detecting Faults
|
||||
|
||||
Many systems need to automatically detect faulty nodes. For example:
|
||||
|
||||
|
|
@ -266,7 +271,7 @@ gone wrong, you may get an error response at some level of the stack, but in gen
|
|||
assume that you will get no response at all. You can retry a few times, wait for a timeout to
|
||||
elapse, and eventually declare the node dead if you don’t hear back within the timeout.
|
||||
|
||||
## Timeouts and Unbounded Delays
|
||||
### Timeouts and Unbounded Delays
|
||||
|
||||
If a timeout is the only sure way of detecting a fault, then how long should the timeout be? There
|
||||
is unfortunately no simple answer.
|
||||
|
|
@ -304,7 +309,7 @@ cannot guarantee that they can handle requests within some maximum time (see
|
|||
be fast most of the time: if your timeout is low, it only takes a transient spike in round-trip
|
||||
times to throw the system off-balance.
|
||||
|
||||
### Network congestion and queueing
|
||||
#### Network congestion and queueing
|
||||
|
||||
When driving a car, travel times on road networks often vary most due to traffic congestion.
|
||||
Similarly, the variability of packet delays on computer networks is most often due to queueing [^27]:
|
||||
|
|
@ -327,12 +332,13 @@ Similarly, the variability of packet delays on computer networks is most often d
|
|||
|
||||
{{< figure src="/fig/ddia_0902.png" id="fig_distributed_switch_queueing" caption="Figure 9-2. If several machines send network traffic to the same destination, its switch queue can fill up. Here, ports 1, 2, and 4 are all trying to send packets to port 3." class="w-full my-4" >}}
|
||||
|
||||
|
||||
Moreover, when TCP detects and automatically retransmits a lost packet, although the application
|
||||
does not see the packet loss directly, it does see the resulting delay (waiting for the timeout to
|
||||
expire, and then waiting for the retransmitted packet to be acknowledged).
|
||||
|
||||
# TCP Versus UDP
|
||||
--------
|
||||
|
||||
> [!TIP] TCP VERSUS UDP
|
||||
|
||||
Some latency-sensitive applications, such as videoconferencing and Voice over IP (VoIP), use UDP
|
||||
rather than TCP. It’s a trade-off between reliability and variability of delays: as UDP does not
|
||||
|
|
@ -346,6 +352,8 @@ application must instead fill the missing packet’s time slot with silence (cau
|
|||
interruption in the sound) and move on in the stream. The retry happens at the human layer instead.
|
||||
(“Could you repeat that please? The sound just cut out for a moment.”)
|
||||
|
||||
--------
|
||||
|
||||
All of these factors contribute to the variability of network delays. Queueing delays have an
|
||||
especially wide range when a system is close to its maximum capacity: a system with plenty of spare
|
||||
capacity can easily drain queues, whereas in a highly utilized system, long queues can build up very
|
||||
|
|
@ -369,7 +377,7 @@ observed response time distribution. The Phi Accrual failure detector [^32],
|
|||
which is used for example in Akka and Cassandra [^33]
|
||||
is one way of doing this. TCP retransmission timeouts also work similarly [^5].
|
||||
|
||||
## Synchronous Versus Asynchronous Networks
|
||||
### Synchronous Versus Asynchronous Networks
|
||||
|
||||
Distributed systems would be a lot simpler if we could rely on the network to deliver packets with
|
||||
some fixed maximum delay, and not to drop packets. Why can’t we solve this at the hardware level
|
||||
|
|
@ -394,7 +402,7 @@ suffer from queueing, because the 16 bits of space for the call have already bee
|
|||
next hop of the network. And because there is no queueing, the maximum end-to-end latency of the
|
||||
network is fixed. We call this a *bounded delay*.
|
||||
|
||||
### Can we not simply make network delays predictable?
|
||||
#### Can we not simply make network delays predictable?
|
||||
|
||||
Note that a circuit in a telephone network is very different from a TCP connection: a circuit is a
|
||||
fixed amount of reserved bandwidth which nobody else can use while the circuit is established,
|
||||
|
|
@ -433,7 +441,9 @@ Loss, and Scalable Throughput (L4S) attempt to mitigate some of the queuing and
|
|||
problems both at the client and router level. Linux’s traffic controller (TC) also allows
|
||||
applications to reprioritize packets for QoS purposes.
|
||||
|
||||
# Latency and Resource Utilization
|
||||
--------
|
||||
|
||||
> [!TIP] LATENCY AND RESOURCE UTILIZATION
|
||||
|
||||
More generally, you can think of variable delays as a consequence of dynamic resource partitioning.
|
||||
|
||||
|
|
@ -460,25 +470,25 @@ platforms run several virtual machines from different customers on the same phys
|
|||
Latency guarantees are achievable in certain environments, if resources are statically partitioned
|
||||
(e.g., dedicated hardware and exclusive bandwidth allocations). However, it comes at the cost of
|
||||
reduced utilization—in other words, it is more expensive. On the other hand, multitenancy with
|
||||
dynamic resource partitioning provides better utilization, so it is cheaper, but it has the downside
|
||||
of variable delays.
|
||||
dynamic resource partitioning provides better utilization, so it is cheaper, but it has the downside of variable delays.
|
||||
|
||||
Variable delays in networks are not a law of nature, but simply the result of a cost/benefit
|
||||
trade-off.
|
||||
Variable delays in networks are not a law of nature, but simply the result of a cost/benefit trade-off.
|
||||
|
||||
However, such quality of service is currently not enabled in multitenant datacenters and public
|
||||
clouds, or when communicating via the internet.
|
||||
--------
|
||||
|
||||
However, such quality of service is currently not enabled in multitenant datacenters and public clouds, or when communicating via the internet.
|
||||
Currently deployed technology does not allow us to make any guarantees about delays or reliability
|
||||
of the network: we have to assume that network congestion, queueing, and unbounded delays will
|
||||
happen. Consequently, there’s no “correct” value for timeouts—they need to be determined
|
||||
experimentally.
|
||||
happen. Consequently, there’s no “correct” value for timeouts—they need to be determined experimentally.
|
||||
|
||||
Peering agreements between internet service providers and the establishment of routes through the
|
||||
Border Gateway Protocol (BGP), bear closer resemblance to circuit switching than IP itself. At this
|
||||
level, it is possible to buy dedicated bandwidth. However, internet routing operates at the level of
|
||||
networks, not individual connections between hosts, and at a much longer timescale.
|
||||
|
||||
# Unreliable Clocks
|
||||
|
||||
|
||||
## Unreliable Clocks
|
||||
|
||||
Clocks and time are important. Applications depend on clocks in various ways to answer questions
|
||||
like the following:
|
||||
|
|
@ -509,13 +519,13 @@ synchronize clocks to some degree: the most commonly used mechanism is the Netwo
|
|||
allows the computer clock to be adjusted according to the time reported by a group of servers [^39].
|
||||
The servers in turn get their time from a more accurate time source, such as a GPS receiver.
|
||||
|
||||
## Monotonic Versus Time-of-Day Clocks
|
||||
### Monotonic Versus Time-of-Day Clocks
|
||||
|
||||
Modern computers have at least two different kinds of clocks: a *time-of-day clock* and a *monotonic
|
||||
clock*. Although they both measure time, it is important to distinguish the two, since they serve
|
||||
different purposes.
|
||||
|
||||
### Time-of-day clocks
|
||||
#### Time-of-day clocks
|
||||
|
||||
A time-of-day clock does what you intuitively expect of a clock: it returns the current date and
|
||||
time according to some calendar (also known as *wall-clock time*). For example,
|
||||
|
|
@ -539,11 +549,10 @@ Time-of-day clocks have also historically had quite a coarse-grained resolution,
|
|||
in steps of 10 ms on older Windows systems [^41].
|
||||
On recent systems, this is less of a problem.
|
||||
|
||||
### Monotonic clocks
|
||||
#### Monotonic clocks
|
||||
|
||||
A monotonic clock is suitable for measuring a duration (time interval), such as a timeout or a
|
||||
service’s response time: `clock_gettime(CLOCK_MONOTONIC)` or `clock_gettime(CLOCK_BOOTTIME)` on
|
||||
Linux [^42]
|
||||
service’s response time: `clock_gettime(CLOCK_MONOTONIC)` or `clock_gettime(CLOCK_BOOTTIME)` on Linux [^42]
|
||||
and `System.nanoTime()` in Java are monotonic clocks, for example. The name comes from the fact that
|
||||
they are guaranteed to always move forward (whereas a time-of-day clock may jump back in time).
|
||||
|
||||
|
|
@ -571,7 +580,7 @@ In a distributed system, using a monotonic clock for measuring elapsed time (e.g
|
|||
usually fine, because it doesn’t assume any synchronization between different nodes’ clocks and is
|
||||
not sensitive to slight inaccuracies of measurement.
|
||||
|
||||
## Clock Synchronization and Accuracy
|
||||
### Clock Synchronization and Accuracy
|
||||
|
||||
Monotonic clocks don’t need synchronization, but time-of-day clocks need to be set according to an
|
||||
NTP server or other external time source in order to be useful. Unfortunately, our methods for
|
||||
|
|
@ -631,7 +640,7 @@ Some cloud providers have begun offering high-accuracy clock synchronization for
|
|||
However, clock synchronization still requires a lot of care. If your NTP daemon is misconfigured, or
|
||||
a firewall is blocking NTP traffic, the clock error due to drift can quickly become large.
|
||||
|
||||
## Relying on Synchronized Clocks
|
||||
### Relying on Synchronized Clocks
|
||||
|
||||
The problem with clocks is that while they seem simple and easy to use, they have a surprising
|
||||
number of pitfalls: a day may not have exactly 86,400 seconds, time-of-day clocks may move backward
|
||||
|
|
@ -655,7 +664,7 @@ monitor the clock offsets between all the machines. Any node whose clock drifts
|
|||
others should be declared dead and removed from the cluster. Such monitoring ensures that you notice
|
||||
the broken clocks before they can cause too much damage.
|
||||
|
||||
### Timestamps for ordering events
|
||||
#### Timestamps for ordering events
|
||||
|
||||
Let’s consider one particular situation in which it is tempting, but dangerous, to rely on clocks:
|
||||
ordering of events across multiple nodes [^64].
|
||||
|
|
@ -718,8 +727,7 @@ arrived before it was sent, which is impossible.
|
|||
Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur?
|
||||
Probably not, because NTP’s synchronization accuracy is itself limited by the network round-trip
|
||||
time, in addition to other sources of error such as quartz drift. To guarantee a correct ordering,
|
||||
you would need the clock error to be significantly lower than the network delay, which is not
|
||||
possible.
|
||||
you would need the clock error to be significantly lower than the network delay, which is not possible.
|
||||
|
||||
So-called *logical clocks* [^66], which are based on incrementing counters rather than an oscillating quartz crystal, are a safer
|
||||
alternative for ordering events (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). Logical clocks do not measure
|
||||
|
|
@ -728,7 +736,7 @@ event happened before or after another). In contrast, time-of-day and monotonic
|
|||
measure actual elapsed time, are also known as *physical clocks*. We’ll look at logical clocks in
|
||||
more detail in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical).
|
||||
|
||||
### Clock readings with a confidence interval
|
||||
#### Clock readings with a confidence interval
|
||||
|
||||
You may be able to read a machine’s time-of-day clock with microsecond or even nanosecond
|
||||
resolution. But even if you can get such a fine-grained measurement, that doesn’t mean the value is
|
||||
|
|
@ -740,10 +748,8 @@ possible accuracy is probably to the tens of milliseconds, and the error may eas
|
|||
|
||||
Thus, it doesn’t make sense to think of a clock reading as a point in time—it is more like a
|
||||
range of times, within a confidence interval: for example, a system may be 95% confident that the
|
||||
time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely
|
||||
than that [^67].
|
||||
If we only know the time +/– 100 ms, the microsecond digits in the timestamp are
|
||||
essentially meaningless.
|
||||
time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely than that [^67].
|
||||
If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless.
|
||||
|
||||
The uncertainty bound can be calculated based on your time source. If you have a GPS receiver or
|
||||
atomic clock directly attached to your computer, the expected error range is determined by
|
||||
|
|
@ -763,7 +769,7 @@ timestamp. Based on its uncertainty calculations, the clock knows that the actua
|
|||
somewhere within that interval. The width of the interval depends, among other things, on how long
|
||||
it has been since the local quartz clock was last synchronized with a more accurate clock source.
|
||||
|
||||
### Synchronized clocks for global snapshots
|
||||
#### Synchronized clocks for global snapshots
|
||||
|
||||
In [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation) we discussed *multi-version concurrency control* (MVCC),
|
||||
which is a very useful feature in databases that need to support both small, fast read-write
|
||||
|
|
@ -807,7 +813,7 @@ have a confidence interval, and the accurate clock sources only help keep that i
|
|||
systems are beginning to adopt similar approaches: for example, YugabyteDB can leverage ClockBound
|
||||
when running on AWS [^70], and several other systems now also rely on clock synchronization to various degrees [^71] [^72].
|
||||
|
||||
## Process Pauses
|
||||
### Process Pauses
|
||||
|
||||
Let’s consider another example of dangerous clock use in a distributed system. Say you have a
|
||||
database with a single leader per shard. Only the leader is allowed to accept writes. How does a
|
||||
|
|
@ -824,16 +830,16 @@ You can imagine the request-handling loop looking something like this:
|
|||
|
||||
```js
|
||||
while (true) {
|
||||
request = getIncomingRequest();
|
||||
request = getIncomingRequest();
|
||||
|
||||
// Ensure that the lease always has at least 10 seconds remaining
|
||||
if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
|
||||
lease = lease.renew();
|
||||
}
|
||||
// Ensure that the lease always has at least 10 seconds remaining
|
||||
if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
|
||||
lease = lease.renew();
|
||||
}
|
||||
|
||||
if (lease.isValid()) {
|
||||
process(request);
|
||||
}
|
||||
if (lease.isValid()) {
|
||||
process(request);
|
||||
}
|
||||
}
|
||||
```
|
||||
|
||||
|
|
@ -917,7 +923,7 @@ keeps moving and may even declare the paused node dead because it’s not respon
|
|||
the paused node may continue running, without even noticing that it was asleep until it checks its
|
||||
clock sometime later.
|
||||
|
||||
### Response time guarantees
|
||||
#### Response time guarantees
|
||||
|
||||
In many programming languages and operating systems, threads and processes may pause for an
|
||||
unbounded amount of time, as discussed. Those reasons for pausing *can* be eliminated if you try
|
||||
|
|
@ -929,12 +935,16 @@ must respond quickly and predictably to their sensor inputs. In these systems, t
|
|||
*deadline* by which the software must respond; if it doesn’t meet the deadline, that may cause a
|
||||
failure of the entire system. These are so-called *hard real-time* systems.
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> In embedded systems, *real-time* means that a system is carefully designed and tested to meet
|
||||
> specified timing guarantees in all circumstances. This meaning is in contrast to the more vague use of the
|
||||
> term *real-time* on the web, where it describes servers pushing data to clients and stream
|
||||
> processing without hard response time constraints (see [Link to Come]).
|
||||
|
||||
--------
|
||||
|
||||
For example, if your car’s onboard sensors detect that you are currently experiencing a crash, you
|
||||
wouldn’t want the release of the airbag to be delayed due to an inopportune GC pause in the airbag
|
||||
release system.
|
||||
|
|
@ -958,7 +968,7 @@ For most server-side data processing systems, real-time guarantees are simply no
|
|||
appropriate. Consequently, these systems must suffer the pauses and clock instability that come from
|
||||
operating in a non-real-time environment.
|
||||
|
||||
### Limiting the impact of garbage collection
|
||||
#### Limiting the impact of garbage collection
|
||||
|
||||
Garbage collection used to be one of the biggest reasons for process pauses [^79],
|
||||
but fortunately GC algorithms have improved a lot: a properly tuned collector will now usually pause
|
||||
|
|
@ -990,7 +1000,9 @@ planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding
|
|||
These measures cannot fully prevent garbage collection pauses, but they can usefully reduce their
|
||||
impact on the application.
|
||||
|
||||
# Knowledge, Truth, and Lies
|
||||
|
||||
|
||||
## Knowledge, Truth, and Lies
|
||||
|
||||
So far in this chapter we have explored the ways in which distributed systems are different from
|
||||
programs running on a single computer: there is no shared memory, only message passing via an
|
||||
|
|
@ -1005,8 +1017,7 @@ exchanging messages with it. If a remote node doesn’t respond, there is no way
|
|||
it is in, because problems in the network cannot reliably be distinguished from problems at a node.
|
||||
|
||||
Discussions of these systems border on the philosophical: What do we know to be true or false in our
|
||||
system? How sure can we be of that knowledge, if the mechanisms for perception and measurement are
|
||||
unreliable [^83]?
|
||||
system? How sure can we be of that knowledge, if the mechanisms for perception and measurement are unreliable [^83]?
|
||||
Should software systems obey the laws that we expect of the physical world, such as cause and effect?
|
||||
|
||||
Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed
|
||||
|
|
@ -1022,7 +1033,7 @@ we can make and the guarantees we may want to provide. In [Chapter 10](/en/ch10
|
|||
look at some examples of distributed algorithms that provide particular guarantees under particular
|
||||
assumptions.
|
||||
|
||||
## The Majority Rules
|
||||
### The Majority Rules
|
||||
|
||||
Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but
|
||||
any outgoing messages from that node are dropped or delayed [^22]. Even though that node is working
|
||||
|
|
@ -1064,7 +1075,7 @@ tolerated). However, it is still safe, because there can only be only one majori
|
|||
system—there cannot be two majorities with conflicting decisions at the same time. We will discuss
|
||||
the use of quorums in more detail when we get to *consensus algorithms* in [Chapter 10](/en/ch10#ch_consistency).
|
||||
|
||||
## Distributed Locks and Leases
|
||||
### Distributed Locks and Leases
|
||||
|
||||
Locks and leases in distributed application are prone to be misused, and a common source of bugs [^84].
|
||||
Let’s look at one particular case of how they can go wrong.
|
||||
|
|
@ -1087,8 +1098,7 @@ wasted computational resources, which is not a big deal. But in the first two ca
|
|||
could be lost or corrupted data, which is much more serious.
|
||||
|
||||
For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
|
||||
implementation of locking. (The bug is not theoretical: HBase used to have this problem
|
||||
[^85] [^86].)
|
||||
implementation of locking. (The bug is not theoretical: HBase used to have this problem [^85] [^86].)
|
||||
Say you want to ensure that a file in a storage service can only be
|
||||
accessed by one client at a time, because if multiple clients tried to write to it, the file would
|
||||
become corrupted. You try to implement this by requiring a client to obtain a lease from a lock
|
||||
|
|
@ -1115,7 +1125,7 @@ to [Figure 9-4](/en/ch9#fig_distributed_lease_pause).
|
|||
{{< figure src="/fig/ddia_0905.png" id="fig_distributed_lease_delay" caption="Figure 9-5. A message from a former leaseholder might be delayed for a long time, and arrive after another node has taken over the lease." class="w-full my-4" >}}
|
||||
|
||||
|
||||
### Fencing off zombies and delayed requests
|
||||
#### Fencing off zombies and delayed requests
|
||||
|
||||
The term *zombie* is sometimes used to describe a former leaseholder who has not yet found out that
|
||||
it lost the lease, and who is still acting as if it was the current leaseholder. Since we cannot
|
||||
|
|
@ -1141,12 +1151,16 @@ token*, which is a number that increases every time a lock is granted (e.g., inc
|
|||
service). We can then require that every time a client sends a write request to the storage service,
|
||||
it must include its current fencing token.
|
||||
|
||||
--------
|
||||
|
||||
> [!NOTE]
|
||||
> There are several alternative names for fencing tokens. In Chubby, Google’s lock service, they are
|
||||
> called *sequencers* [^88], and in Kafka they are called *epoch numbers*.
|
||||
> In consensus algorithms, which we will discuss in [Chapter 10](/en/ch10#ch_consistency), the *ballot number* (Paxos) or
|
||||
> *term number* (Raft) serves a similar purpose.
|
||||
|
||||
--------
|
||||
|
||||
In [Figure 9-6](/en/ch9#fig_distributed_fencing), client 1 acquires the lease with a token of 33, but then
|
||||
it goes into a long pause and the lease expires. Client 2 acquires the lease with a token of 34 (the
|
||||
number always increases) and then sends its write request to the storage service, including the
|
||||
|
|
@ -1168,12 +1182,10 @@ read it, similarly to an atomic compare-and-set (CAS) operation. For example, ob
|
|||
services support such a check: Amazon S3 calls it *conditional writes*, Azure Blob Storage calls it
|
||||
*conditional headers*, and Google Cloud Storage calls it *request preconditions*.
|
||||
|
||||
### Fencing with multiple replicas
|
||||
#### Fencing with multiple replicas
|
||||
|
||||
If your clients need to write only to one storage service that supports such conditional writes, the
|
||||
lock service is somewhat redundant
|
||||
[^91] [^92],
|
||||
since the lease assignment could have been implemented directly based on that storage service [^93].
|
||||
lock service is somewhat redundant [^91] [^92], since the lease assignment could have been implemented directly based on that storage service [^93].
|
||||
However, once you have a fencing token you can also use it with multiple services or replicas, and
|
||||
ensure that the old leaseholder is fenced off on all of those services.
|
||||
|
||||
|
|
@ -1202,7 +1214,7 @@ As you can see from these examples, it is not safe to assume that there is only
|
|||
lease at any one time. Fortunately, with a bit of care you can use fencing tokens to prevent zombies
|
||||
and delayed requests from doing any damage.
|
||||
|
||||
## Byzantine Faults
|
||||
### Byzantine Faults
|
||||
|
||||
Fencing tokens can detect and block a node that is *inadvertently* acting in error (e.g., because it
|
||||
hasn’t yet found out that its lease has expired). However, if the node deliberately wanted to
|
||||
|
|
@ -1219,7 +1231,7 @@ arbitrary faulty or corrupted responses)—for example, it might cast multiple c
|
|||
the same election. Such behavior is known as a *Byzantine fault*, and the problem of reaching
|
||||
consensus in this untrusting environment is known as the *Byzantine Generals Problem* [^94].
|
||||
|
||||
# The Byzantine Generals Problem
|
||||
> [!TIP] THE BYZANTINE GENERALS PROBLEM
|
||||
|
||||
The Byzantine Generals Problem is a generalization of the so-called *Two Generals Problem* [^95],
|
||||
which imagines a situation in which two army generals need to agree on a battle plan. As they
|
||||
|
|
@ -1240,6 +1252,8 @@ before computers [^96].
|
|||
Lamport wanted to choose a nationality that would not offend any readers, and he was advised that
|
||||
calling it *The Albanian Generals Problem* was not such a good idea [^97].
|
||||
|
||||
--------
|
||||
|
||||
A system is *Byzantine fault-tolerant* if it continues to operate correctly even if some of the
|
||||
nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering
|
||||
with the network. This concern is relevant in certain specific circumstances. For example:
|
||||
|
|
@ -1285,7 +1299,7 @@ an attacker can compromise one node, they can probably compromise all of them, b
|
|||
probably running the same software. Thus, traditional mechanisms (authentication, access control,
|
||||
encryption, firewalls, and so on) continue to be the main protection against attackers.
|
||||
|
||||
### Weak forms of lying
|
||||
#### Weak forms of lying
|
||||
|
||||
Although we assume that nodes are generally honest, it can be worth adding mechanisms to software
|
||||
that guard against weak forms of “lying”—for example, invalid messages due to hardware issues,
|
||||
|
|
@ -1308,7 +1322,7 @@ pragmatic steps toward better reliability. For example:
|
|||
incorrect time is detected as an outlier and is excluded from synchronization [^39]. The use of multiple servers makes NTP
|
||||
more robust than if it only uses a single server.
|
||||
|
||||
## System Model and Reality
|
||||
### System Model and Reality
|
||||
|
||||
Many algorithms have been designed to solve distributed systems problems—for example, we will
|
||||
examine solutions for the consensus problem in [Chapter 10](/en/ch10#ch_consistency). In order to be useful, these
|
||||
|
|
@ -1377,7 +1391,7 @@ For modeling real systems, the partially synchronous model with crash-recovery f
|
|||
the most useful model. It allows for unbounded network delay, process pauses, and slow nodes. But
|
||||
how do distributed algorithms cope with that model?
|
||||
|
||||
### Defining the correctness of an algorithm
|
||||
#### Defining the correctness of an algorithm
|
||||
|
||||
To define what it means for an algorithm to be *correct*, we can describe its *properties*. For
|
||||
example, the output of a sorting algorithm has the property that for any two distinct elements of
|
||||
|
|
@ -1403,7 +1417,7 @@ that we assume may occur in that system model. However, if all nodes crash, or a
|
|||
suddenly become infinitely long, then no algorithm will be able to get anything done. How can we
|
||||
still make useful guarantees even in a system model that allows complete failures?
|
||||
|
||||
### Safety and liveness
|
||||
#### Safety and liveness
|
||||
|
||||
To clarify the situation, it is worth distinguishing between two different kinds of properties:
|
||||
*safety* and *liveness* properties. In the example just given, *uniqueness* and *monotonic sequence* are
|
||||
|
|
@ -1438,7 +1452,7 @@ network eventually recovers from an outage. The definition of the partially sync
|
|||
requires that eventually the system returns to a synchronous state—that is, any period of network
|
||||
interruption lasts only for a finite duration and is then repaired.
|
||||
|
||||
### Mapping system models to the real world
|
||||
#### Mapping system models to the real world
|
||||
|
||||
Safety and liveness properties and system models are very useful for reasoning about the correctness
|
||||
of a distributed algorithm. However, when implementing an algorithm in practice, the messy facts of
|
||||
|
|
@ -1469,7 +1483,7 @@ They are incredibly helpful for distilling down the complexity of real systems t
|
|||
of faults that we can reason about, so that we can understand the problem and try to solve it
|
||||
systematically.
|
||||
|
||||
## Formal Methods and Randomized Testing
|
||||
### Formal Methods and Randomized Testing
|
||||
|
||||
How do we know that an algorithm satisfies the required properties? Due to concurrency, partial
|
||||
failures, and network delays there are a huge number of potential states. We need to guarantee
|
||||
|
|
@ -1490,7 +1504,7 @@ testing (DST) use randomization to test a system in a wide range of situations.
|
|||
Amazon Web Services have successfully used a combination of these techniques on many of their
|
||||
products [^120] [^121].
|
||||
|
||||
### Model checking and specification languages
|
||||
#### Model checking and specification languages
|
||||
|
||||
*Model checkers* are tools that help verify that an algorithm or system behaves as expected. An algorithm
|
||||
specification is written in a purpose-built language such as TLA+, Gallina, or FizzBee. These
|
||||
|
|
@ -1518,7 +1532,7 @@ state space, but it risks that your specification and your implementation go out
|
|||
It is possible to check whether the model and the real implementation have equivalent behavior, but
|
||||
this requires instrumentation in the real implementation [^127].
|
||||
|
||||
### Fault injection
|
||||
#### Fault injection
|
||||
|
||||
Many bugs are triggered when machine and network failures occur. Fault injection is an effective
|
||||
(and sometimes scary) technique that verifies whether a system’s implementation works as expected things
|
||||
|
|
@ -1546,7 +1560,7 @@ simplify the process. Such frameworks come with integrations for various operati
|
|||
pre-built fault injectors [^129].
|
||||
Jepsen has been remarkably effective at finding critical bugs in many widely-used systems [^130] [^131].
|
||||
|
||||
### Deterministic simulation testing
|
||||
#### Deterministic simulation testing
|
||||
|
||||
Deterministic simulation testing (DST) has also become a popular complement to model-checking and
|
||||
fault injection. It uses a similar state space exploration process as a model checker, but it tests
|
||||
|
|
|
|||
Loading…
Reference in a new issue