2
0
Fork 0
mirror of https://github.com/Vonng/ddia.git synced 2026-06-22 17:37:04 +08:00

convert figure to shortcode

This commit is contained in:
Feng Ruohang 2025-08-09 17:43:23 +08:00
parent aa06c85074
commit cdd205fcdb
9 changed files with 303 additions and 662 deletions

View file

@ -69,12 +69,7 @@ if it wasnt replicated at all. Then users dont have to worry about replica
other inconsistencies. That would give us the advantage of fault tolerance, but without the other inconsistencies. That would give us the advantage of fault tolerance, but without the
complexity arising from having to think about multiple replicas. complexity arising from having to think about multiple replicas.
This is the idea behind *linearizability* This is the idea behind *linearizability* [^1] (also known as *atomic consistency* [^2], *strong consistency*, *immediate consistency*, or *external consistency* [^3]).
[^1]
(also known as *atomic consistency*
[^2],
*strong consistency*, *immediate consistency*, or *external consistency*
[^3]).
The exact definition of linearizability is quite subtle, and we will explore it in the rest of this The exact definition of linearizability is quite subtle, and we will explore it in the rest of this
section. But the basic idea is to make a system appear as if there were only one copy of the data, section. But the basic idea is to make a system appear as if there were only one copy of the data,
and all operations on it are atomic. With this guarantee, even though there may be multiple replicas and all operations on it are atomic. With this guarantee, even though there may be multiple replicas
@ -86,9 +81,8 @@ copy of the data means guaranteeing that the value read is the most recent, up-t
doesnt come from a stale cache or replica. In other words, linearizability is a *recency doesnt come from a stale cache or replica. In other words, linearizability is a *recency
guarantee*. To clarify this idea, lets look at an example of a system that is not linearizable. guarantee*. To clarify this idea, lets look at an example of a system that is not linearizable.
![ddia 1001](/fig/ddia_1001.png) {{< figure src="/fig/ddia_1001.png" id="fig_consistency_linearizability_0" title="Figure 10-1. If this database were linearizable, then either Alice's read would return 1 instead of 0, or Bob's read would return 0 instead of 1." class="w-full my-4" >}}
###### Figure 10-1. This system is not linearizable, causing sports fans to be confused.
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4]. [Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a
@ -112,9 +106,8 @@ object *x* in a linearizable database. In distributed systems theory, *x* is cal
practice, it could be one key in a key-value store, one row in a relational database, or one practice, it could be one key in a key-value store, one row in a relational database, or one
document in a document database, for example. document in a document database, for example.
![ddia 1002](/fig/ddia_1002.png) {{< figure src="/fig/ddia_1002.png" id="fig_consistency_linearizability_1" title="Figure 10-2. Alice observes that x = 0 and y = 1, while Bob observes that x = 1 and y = 0. It's as if Alice's and Bob's computers disagree on the order in which the writes happened." class="w-full my-4" >}}
###### Figure 10-2. If a read request is concurrent with a write request, it may return either the old or the new value.
For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients
point of view, not the internals of the database. Each bar is a request made by a client, where the point of view, not the internals of the database. Each bar is a request made by a client, where the
@ -152,9 +145,8 @@ what we expect of a system that emulates a “single copy of the data.”
To make the system linearizable, we need to add another constraint, illustrated in To make the system linearizable, we need to add another constraint, illustrated in
[Figure 10-3](/en/ch10#fig_consistency_linearizability_2). [Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
![ddia 1003](/fig/ddia_1003.png) {{< figure src="/fig/ddia_1003.png" id="fig_consistency_linearizability_2" title="Figure 10-3. If Alice and Bob had perfect clocks, linearizability would require that x = 1 is returned, since the read of x begins after the write x = 1 completes." class="w-full my-4" >}}
###### Figure 10-3. After any one read has returned the new value, all following reads (on the same or other clients) must also return the new value.
In a linearizable system we imagine that there must be some point in time (between the start and end In a linearizable system we imagine that there must be some point in time (between the start and end
of the write operation) at which the value of *x* atomically flips from 0 to 1. Thus, if one of the write operation) at which the value of *x* atomically flips from 0 to 1. Thus, if one
@ -189,9 +181,8 @@ forward in time (from left to right), never backward. This requirement ensures t
discussed earlier: once a new value has been written or read, all subsequent reads see the value discussed earlier: once a new value has been written or read, all subsequent reads see the value
that was written, until it is overwritten again. that was written, until it is overwritten again.
![ddia 1004](/fig/ddia_1004.png) {{< figure src="/fig/ddia_1004.png" id="fig_consistency_linearizability_3" title="Figure 10-4. The read of x is concurrent with the write x = 1. Since we don't know the exact timing of the operations, the read is allowed to return either 0 or 1." class="w-full my-4" >}}
###### Figure 10-4. Visualizing the points in time at which the reads and writes appear to have taken effect. The final read by B is not linearizable.
There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3): There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3):
@ -349,9 +340,8 @@ for small messages, and a video may be many megabytes in size. Instead, the vide
to a file storage service, and once the write is complete, the instruction to the transcoder is to a file storage service, and once the write is complete, the instruction to the transcoder is
placed on the queue. placed on the queue.
![ddia 1005](/fig/ddia_1005.png) {{< figure src="/fig/ddia_1005.png" id="fig_consistency_transcoder" title="Figure 10-5. A system that is not linearizable: Alice and Bob see the uploaded image at different times, and thus Bob's request is based on stale data." class="w-full my-4" >}}
###### Figure 10-5. The web server and video transcoder communicate both through file storage and a message queue, opening the potential for race conditions.
If the file storage service is linearizable, then this system should work fine. If it is not If the file storage service is linearizable, then this system should work fine. If it is not
linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in
@ -440,9 +430,8 @@ Intuitively, it seems as though quorum reads and writes should be linearizable i
Dynamo-style model. However, when we have variable network delays, it is possible to have race Dynamo-style model. However, when we have variable network delays, it is possible to have race
conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless). conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless).
![ddia 1006](/fig/ddia_1006.png) {{< figure src="/fig/ddia_1006.png" id="fig_consistency_leaderless" title="Figure 10-6. Quorums are not sufficient to ensure linearizability if network delays are variable." class="w-full my-4" >}}
###### Figure 10-6. A nonlinearizable execution, despite using a quorum.
In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating
*x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3). *x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3).
@ -459,8 +448,7 @@ It is possible to make Dynamo-style quorums linearizable at the cost of reduced
performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously, performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously,
before returning results to the application [^24]. before returning results to the application [^24].
Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the
latest timestamp of any prior write, and ensure that the new write has a greater timestamp latest timestamp of any prior write, and ensure that the new write has a greater timestamp [^25] [^26].
[^25] [^26].
However, Riak does not perform synchronous read repair due to the performance penalty. However, Riak does not perform synchronous read repair due to the performance penalty.
Cassandra does wait for read repair to complete on quorum reads [^27], Cassandra does wait for read repair to complete on quorum reads [^27],
but it loses linearizability due to its use of time-of-day clocks for timestamps. but it loses linearizability due to its use of time-of-day clocks for timestamps.
@ -481,9 +469,8 @@ example, we saw that multi-leader replication is often a good choice for multi-r
replication (see [“Geographically Distributed Operation”](/en/ch6#sec_replication_multi_dc)). An example of such a deployment is illustrated in replication (see [“Geographically Distributed Operation”](/en/ch6#sec_replication_multi_dc)). An example of such a deployment is illustrated in
[Figure 10-7](/en/ch10#fig_consistency_cap_availability). [Figure 10-7](/en/ch10#fig_consistency_cap_availability).
![ddia 1007](/fig/ddia_1007.png) {{< figure src="/fig/ddia_1007.png" id="fig_consistency_cap_availability" title="Figure 10-7. If clients cannot contact enough replicas due to a network partition, they cannot process writes." class="w-full my-4" >}}
###### Figure 10-7. A network interruption forcing a choice between linearizability and availability.
Consider what happens if there is a network interruption between the two regions. Lets assume Consider what happens if there is a network interruption between the two regions. Lets assume
that the network within each region is working, and clients can reach their local region, but the that the network within each region is working, and clients can reach their local region, but the
@ -502,8 +489,7 @@ If the network between regions is interrupted in a single-leader setup, clients
follower regions cannot contact the leader, so they cannot make any writes to the database, nor follower regions cannot contact the leader, so they cannot make any writes to the database, nor
any linearizable reads. They can still make reads from the follower, but they might be stale any linearizable reads. They can still make reads from the follower, but they might be stale
(nonlinearizable). If the application requires linearizable reads and writes, the network (nonlinearizable). If the application requires linearizable reads and writes, the network
interruption causes the application to become unavailable in the regions that cannot contact the interruption causes the application to become unavailable in the regions that cannot contact the leader.
leader.
If clients can connect directly to the leader region, this is not a problem, since the If clients can connect directly to the leader region, this is not a problem, since the
application continues to work normally there. But clients that can only reach a follower region application continues to work normally there. But clients that can only reach a follower region
@ -519,20 +505,16 @@ The trade-off is as follows:
* If your application *requires* linearizability, and some replicas are disconnected from the other * If your application *requires* linearizability, and some replicas are disconnected from the other
replicas due to a network problem, then some replicas cannot process requests while they are replicas due to a network problem, then some replicas cannot process requests while they are
disconnected: they must either wait until the network problem is fixed, or return an error (either disconnected: they must either wait until the network problem is fixed, or return an error (either
way, they become *unavailable*). This choice is sometimes known as *CP* (consistent under network way, they become *unavailable*). This choice is sometimes known as *CP* (consistent under network partitions).
partitions).
* If your application *does not require* linearizability, then it can be written in a way that each * If your application *does not require* linearizability, then it can be written in a way that each
replica can process requests independently, even if it is disconnected from other replicas (e.g., replica can process requests independently, even if it is disconnected from other replicas (e.g.,
multi-leader). In this case, the application can remain *available* in the face of a network multi-leader). In this case, the application can remain *available* in the face of a network
problem, but its behavior is not linearizable. This choice is known as *AP* (available under problem, but its behavior is not linearizable. This choice is known as *AP* (available under network partitions).
network partitions).
Thus, applications that dont require linearizability can be more tolerant of network problems. This Thus, applications that dont require linearizability can be more tolerant of network problems. This
insight is popularly known as the *CAP theorem* insight is popularly known as the *CAP theorem* [^29] [^30] [^31] [^32],
[^29] [^30] [^31] [^32],
named by Eric Brewer in 2000, although the trade-off had been known to designers of named by Eric Brewer in 2000, although the trade-off had been known to designers of
distributed databases since the 1970s distributed databases since the 1970s [^33] [^34] [^35].
[^33] [^34] [^35].
CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of
starting a discussion about trade-offs in databases. At the time, many distributed databases starting a discussion about trade-offs in databases. At the time, many distributed databases
@ -552,8 +534,7 @@ or not.
At times when the network is working correctly, a system can provide both consistency At times when the network is working correctly, a system can provide both consistency
(linearizability) and total availability. When a network fault occurs, you have to choose between (linearizability) and total availability. When a network fault occurs, you have to choose between
either linearizability or total availability. Thus, a better way of phrasing CAP would be either linearizability or total availability. Thus, a better way of phrasing CAP would be
*either Consistent or Available when Partitioned* *either Consistent or Available when Partitioned* [^37].
[^37].
A more reliable network needs to make this choice less often, but at some point the choice is A more reliable network needs to make this choice less often, but at some point the choice is
inevitable. inevitable.
@ -570,24 +551,19 @@ understand systems better, so CAP is best avoided.
The CAP theorem as formally defined [^30] is of The CAP theorem as formally defined [^30] is of
very narrow scope: it only considers one consistency model (namely linearizability) and one kind of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of
fault (network partitions, which according to data from Google are the cause of less than 8% of fault (network partitions, which according to data from Google are the cause of less than 8% of incidents [^41]).
incidents [^41]).
It doesnt say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP It doesnt say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP
has been historically influential, it has little practical value for designing systems has been historically influential, it has little practical value for designing systems [^4] [^38].
[^4] [^38].
There have been efforts to generalize CAP. For example, the *PACELC principle* observes that system There have been efforts to generalize CAP. For example, the *PACELC principle* observes that system
designers might also choose to weaken consistency at times when the network is working fine in order designers might also choose to weaken consistency at times when the network is working fine in order
to reduce latency [^39] [^40] [^42]. to reduce latency [^39] [^40] [^42].
Thus, during a network partition (P), we need to choose between availability (A) and consistency Thus, during a network partition (P), we need to choose between availability (A) and consistency (C);
(C); else (E), when there is no partition, we may choose between low latency (L) and else (E), when there is no partition, we may choose between low latency (L) and consistency (C).
consistency (C). However, this definition inherits several problems with CAP, such as the However, this definition inherits several problems with CAP, such as the counterintuitive definitions of consistency and availability.
counterintuitive definitions of consistency and availability.
There are many more interesting impossibility results in distributed systems [^43], There are many more interesting impossibility results in distributed systems [^43], and CAP has now been
and CAP has now been superseded by more precise results superseded by more precise results [^44] [^45], so it is of mostly historical interest today.
[^44] [^45],
so it is of mostly historical interest today.
### Linearizability and network delays ### Linearizability and network delays
@ -595,8 +571,7 @@ Although linearizability is a useful guarantee, surprisingly few systems are act
in practice. For example, even RAM on a modern multi-core CPU is not linearizable [^46]: in practice. For example, even RAM on a modern multi-core CPU is not linearizable [^46]:
if a thread running on one CPU core writes to a memory address, and a thread on another CPU core if a thread running on one CPU core writes to a memory address, and a thread on another CPU core
reads the same address shortly afterward, it is not guaranteed to read the value written by the reads the same address shortly afterward, it is not guaranteed to read the value written by the
first thread (unless a *memory barrier* or *fence* first thread (unless a *memory barrier* or *fence* [^47] is used).
[^47] is used).
The reason for this behavior is that every CPU core has its own memory cache and store buffer. The reason for this behavior is that every CPU core has its own memory cache and store buffer.
Memory access first goes to the cache by default, and any changes are asynchronously written out to Memory access first goes to the cache by default, and any changes are asynchronously written out to
@ -615,8 +590,7 @@ they do so primarily to increase performance, not so much for fault tolerance [^
Linearizability is slow—and this is true all the time, not only during a network fault. Linearizability is slow—and this is true all the time, not only during a network fault.
Cant we maybe find a more efficient implementation of linearizable storage? It seems the answer is Cant we maybe find a more efficient implementation of linearizable storage? It seems the answer is
no: Attiya and Welch [^49] no: Attiya and Welch [^49] prove that if you want linearizability, the response time of read and write requests is at least
prove that if you want linearizability, the response time of read and write requests is at least
proportional to the uncertainty of delays in the network. In a network with highly variable delays, proportional to the uncertainty of delays in the network. In a network with highly variable delays,
like most computer networks (see [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing)), the response time of linearizable like most computer networks (see [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing)), the response time of linearizable
reads and writes is inevitably going to be high. A faster algorithm for linearizability does not reads and writes is inevitably going to be high. A faster algorithm for linearizability does not
@ -639,9 +613,8 @@ display the messages in order of increasing ID, and the resulting chat threads w
Aaliyah posts a question that is assigned ID 1, and Bryces answer to the question is assigned a Aaliyah posts a question that is assigned ID 1, and Bryces answer to the question is assigned a
greater ID, namely 3. greater ID, namely 3.
![ddia 1008](/fig/ddia_1008.png) {{< figure src="/fig/ddia_1008.png" id="fig_consistency_id_generator" title="Figure 10-8. Two different nodes may generate conflicting IDs." class="w-full my-4" >}}
###### Figure 10-8. An ID generator that assigns auto-incrementing integer IDs to messages in a chat application.
This single-node ID generator is another example of a linearizable system. Each request to fetch the This single-node ID generator is another example of a linearizable system. Each request to fetch the
ID is an operation that atomically increments a counter and returns the old counter value (a ID is an operation that atomically increments a counter and returns the old counter value (a
@ -755,9 +728,8 @@ operations it has processed. A Lamport timestamp is then simply a pair of (*coun
Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp, Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp,
each timestamp is made unique. each timestamp is made unique.
![ddia 1009](/fig/ddia_1009.png) {{< figure src="/fig/ddia_1009.png" id="fig_consistency_lamport_ts" title="Figure 10-9. Lamport timestamps provide a total ordering consistent with causality." class="w-full my-4" >}}
###### Figure 10-9. Lamport timestamps provide a total ordering consistent with causality.
Every time a node generates a timestamp, it increments its counter value and uses the new value. Every time a node generates a timestamp, it increments its counter value and uses the new value.
Moreover, every time a node sees a timestamp from another node, if the counter value in that Moreover, every time a node sees a timestamp from another node, if the counter value in that
@ -843,9 +815,8 @@ account settings to private. Then A uses their phone to upload the photo. Since
updates in sequence, they might reasonably expect the photo upload to be subject to the new, updates in sequence, they might reasonably expect the photo upload to be subject to the new,
restricted account permissions. restricted account permissions.
![ddia 1010](/fig/ddia_1010.png) {{< figure src="/fig/ddia_1010.png" id="fig_consistency_permissions" title="Figure 10-10. An example of a permission system using Lamport timestamps." class="w-full my-4" >}}
###### Figure 10-10. User A first sets their account to private, then shares a photo. With a non-linearizable ID generator, an unauthorized viewer may see the photo.
The account permission and the photo are stored in two separate databases (or separate shards of the The account permission and the photo are stored in two separate databases (or separate shards of the
same database), and lets assume they use a Lamport clock or hybrid logical clock to assign a same database), and lets assume they use a Lamport clock or hybrid logical clock to assign a
@ -944,27 +915,20 @@ node, but which get a lot harder if you want fault tolerance:
It turns out that all of these are instances of the same fundamental distributed systems problem: It turns out that all of these are instances of the same fundamental distributed systems problem:
*consensus*. Consensus is one of the most important and fundamental problems in distributed *consensus*. Consensus is one of the most important and fundamental problems in distributed
computing; it is also infamously difficult to get right computing; it is also infamously difficult to get right [^58] [^59],
[^58] [^59],
and many systems have got it wrong in the past. Now that we have discussed replication and many systems have got it wrong in the past. Now that we have discussed replication
([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and ([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
linearizability (this chapter), we are finally ready to tackle the consensus problem. linearizability (this chapter), we are finally ready to tackle the consensus problem.
The best-known consensus algorithms are Viewstamped Replication The best-known consensus algorithms are Viewstamped Replication [^60] [^61], Paxos [^58] [^62] [^63] [^64],
[^60] [^61], Raft [^23] [^65] [^66], and Zab [^18] [^22] [^67]. There are quite a few similarities between these algorithms, but they are not the same [^68] [^69].
Paxos [^58] [^62] [^63] [^64],
Raft [^23] [^65] [^66],
and Zab [^18] [^22] [^67].
There are quite a few similarities between these algorithms, but they are not the same
[^68] [^69].
These algorithms work in a non-Byzantine system model: that is, network communication may be These algorithms work in a non-Byzantine system model: that is, network communication may be
arbitrarily delayed or dropped, and nodes may crash, restart, and become disconnected, but the arbitrarily delayed or dropped, and nodes may crash, restart, and become disconnected, but the
algorithms assume that nodes otherwise follow the protocol correctly and do not behave maliciously. algorithms assume that nodes otherwise follow the protocol correctly and do not behave maliciously.
There are also consensus algorithms that can tolerate some Byzantine nodes, i.e., nodes that dont There are also consensus algorithms that can tolerate some Byzantine nodes, i.e., nodes that dont
correctly follow the protocol (for example, by sending contradictory messages to other nodes). A correctly follow the protocol (for example, by sending contradictory messages to other nodes). A
common assumption is that fewer than one-third of the nodes are Byzantine-faulty common assumption is that fewer than one-third of the nodes are Byzantine-faulty [^26] [^70].
[^26] [^70].
Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchains [^71]. Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchains [^71].
However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this
book. book.
@ -981,10 +945,8 @@ Firstly, FLP doesnt say that we can never reach consensus—it only says that
a consensus algorithm will *always* terminate. Moreover, the FLP result is proved assuming a a consensus algorithm will *always* terminate. Moreover, the FLP result is proved assuming a
deterministic algorithm in the asynchronous system model (see [“System Model and Reality”](/en/ch9#sec_distributed_system_model)), deterministic algorithm in the asynchronous system model (see [“System Model and Reality”](/en/ch9#sec_distributed_system_model)),
which means the algorithm cannot use any clocks or timeouts. If it can use timeouts to suspect that which means the algorithm cannot use any clocks or timeouts. If it can use timeouts to suspect that
another node may have crashed (even if the suspicion is sometimes wrong), then consensus becomes another node may have crashed (even if the suspicion is sometimes wrong), then consensus becomes solvable [^73].
solvable [^73]. Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility result [^74].
Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility
result [^74].
Thus, although the FLP result about the impossibility of consensus is of great theoretical Thus, although the FLP result about the impossibility of consensus is of great theoretical
importance, distributed systems can usually achieve consensus in practice. importance, distributed systems can usually achieve consensus in practice.
@ -1103,10 +1065,8 @@ name has not been created or modified by another client since the current client
However, a linearizable read-write register is not sufficient to solve consensus. The FLP result However, a linearizable read-write register is not sufficient to solve consensus. The FLP result
tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop
model [^72], but we saw in model [^72], but we saw in [“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum
[“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum reads/writes in this model [^24] [^25] [^26]. From this it follows that a linearizable register cannot solve consensus.
reads/writes in this model [^24] [^25] [^26].
From this it follows that a linearizable register cannot solve consensus.
### Shared logs as consensus ### Shared logs as consensus
@ -1304,8 +1264,7 @@ A shared log is also powerful because it can easily be adapted to other forms of
* If you want an atomic fetch-and-add, put the number to add to the counter in a log entry, and the * If you want an atomic fetch-and-add, put the number to add to the counter in a log entry, and the
current counter value is the sum of all of the log entries so far. A simple counter on log entries current counter value is the sum of all of the log entries so far. A simple counter on log entries
can be used to generate fencing tokens (see [“Fencing off zombies and delayed requests”](/en/ch9#sec_distributed_fencing_tokens)); for example, in can be used to generate fencing tokens (see [“Fencing off zombies and delayed requests”](/en/ch9#sec_distributed_fencing_tokens)); for example, in
ZooKeeper, this sequence number is called `zxid` ZooKeeper, this sequence number is called `zxid` [^18].
[^18].
### From single-leader replication to consensus ### From single-leader replication to consensus
@ -1340,8 +1299,7 @@ leader with the higher epoch number prevails.
Before a leader is allowed to append the next entry to the shared log, it must first check that Before a leader is allowed to append the next entry to the shared log, it must first check that
there isnt some other leader with a higher epoch number which might append a different entry. It there isnt some other leader with a higher epoch number which might append a different entry. It
can do this by collecting votes from a quorum of nodes—typically, but not always, a majority of can do this by collecting votes from a quorum of nodes—typically, but not always, a majority of
nodes [^85]. nodes [^85]. A node votes yes only if it is not aware of any other leader with a higher epoch.
A node votes yes only if it is not aware of any other leader with a higher epoch.
Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leaders Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leaders
proposal for the next entry to append to the log. The quorums for those two votes must overlap: if proposal for the next entry to append to the log. The quorums for those two votes must overlap: if
@ -1436,8 +1394,7 @@ terrible performance as the system can end up spending more time choosing leader
work. work.
Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft
has been shown to have unpleasant edge cases has been shown to have unpleasant edge cases [^88] [^89]:
[^88] [^89]:
if the entire network is working correctly except for one particular network link that is if the entire network is working correctly except for one particular network link that is
consistently unreliable, Raft can get into situations where leadership continually bounces between consistently unreliable, Raft can get into situations where leadership continually bounces between
two nodes, or the current leader is continually forced to resign, so the system effectively never two nodes, or the current leader is continually forced to resign, so the system effectively never
@ -1463,8 +1420,7 @@ in the background. Coordination services are designed to hold small amounts of d
entirely in memory (although they still write to disk for durability), which is replicated across entirely in memory (although they still write to disk for durability), which is replicated across
multiple nodes using a fault-tolerant consensus algorithm. multiple nodes using a fault-tolerant consensus algorithm.
Coordination services are modeled after Googles Chubby lock service Coordination services are modeled after Googles Chubby lock service [^17] [^58].
[^17] [^58].
They combine a consensus algorithm with several other features that turn out to be particularly They combine a consensus algorithm with several other features that turn out to be particularly
useful when building distributed systems: useful when building distributed systems:
@ -1540,8 +1496,7 @@ Normally, the kind of data managed by a coordination service is quite slow-chang
information like “the node running on IP address 10.1.1.23 is the leader for shard 7,” and such information like “the node running on IP address 10.1.1.23 is the leader for shard 7,” and such
assignments usually change on a timescale of minutes or hours. Coordination services are not assignments usually change on a timescale of minutes or hours. Coordination services are not
intended for storing data that may change thousands of times per second. For that, it is better to intended for storing data that may change thousands of times per second. For that, it is better to
use a conventional database; alternatively, tools like Apache BookKeeper use a conventional database; alternatively, tools like Apache BookKeeper [^90] [^91]
[^90] [^91]
can be used to replicate fast-changing internal state of a service. can be used to replicate fast-changing internal state of a service.
### Service discovery ### Service discovery

View file

@ -56,9 +56,7 @@ Barack Obama have over 100 million followers).
Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We
have one table for users, one table for posts, and one table for follow relationships. have one table for users, one table for posts, and one table for follow relationships.
![ddia 0201](/fig/ddia_0201.png) {{< figure src="/fig/ddia_0201.png" id="fig_twitter_relational" title="Figure 2-1. Simple relational schema for a social network in which users can follow each other." class="w-full my-4" >}}
###### Figure 2-1. Simple relational schema for a social network in which users can follow each other.
Lets say the main read operation that our social network must support is the *home timeline*, which Lets say the main read operation that our social network must support is the *home timeline*, which
displays recent posts by people you are following (for simplicity we will ignore ads, suggested displays recent posts by people you are following (for simplicity we will ignore ads, suggested
@ -111,9 +109,7 @@ because the home timelines are derived data that needs to be updated. The proces
carried out, we use the term *fan-out* to describe the factor by which the number of requests carried out, we use the term *fan-out* to describe the factor by which the number of requests
increases. increases.
![ddia 0202](/fig/ddia_0202.png) {{< figure src="/fig/ddia_0202.png" id="fig_twitter_timelines" title="Figure 2-2. Fan-out: delivering new posts to every follower of the user who made the post." class="w-full my-4" >}}
###### Figure 2-2. Fan-out: delivering new posts to every follower of the user who made the post.
At a rate of 5,700 posts posted per second, if the average post reaches 200 followers (i.e., a At a rate of 5,700 posts posted per second, if the average post reaches 200 followers (i.e., a
fan-out factor of 200), we will need to do just over 1 million home timeline writes per second. This fan-out factor of 200), we will need to do just over 1 million home timeline writes per second. This
@ -171,9 +167,7 @@ the process of handling an earlier request, and therefore the incoming request n
the earlier request has been completed. As throughput approaches the maximum that the hardware can the earlier request has been completed. As throughput approaches the maximum that the hardware can
handle, queueing delays increase sharply. handle, queueing delays increase sharply.
![ddia 0203](/fig/ddia_0203.png) {{< figure src="/fig/ddia_0203.png" id="fig_throughput" title="Figure 2-3. As the throughput of a service approaches its capacity, the response time increases dramatically due to queueing." class="w-full my-4" >}}
###### Figure 2-3. As the throughput of a service approaches its capacity, the response time increases dramatically due to queueing.
# When an overloaded system wont recover # When an overloaded system wont recover
@ -217,9 +211,7 @@ terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)
i.e., during which it is *latent*. In particular, *network latency* or *network delay* refers to i.e., during which it is *latent*. In particular, *network latency* or *network delay* refers to
the time that request and response spend traveling through the network. the time that request and response spend traveling through the network.
![ddia 0204](/fig/ddia_0204.png) {{< figure src="/fig/ddia_0204.png" id="fig_response_time" title="Figure 2-4. Response time, service time, network latency, and queueing delay." class="w-full my-4" >}}
###### Figure 2-4. Response time, service time, network latency, and queueing delay.
In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a
horizontal line, and a request or response message is shown as a thick diagonal arrow from one node horizontal line, and a request or response message is shown as a thick diagonal arrow from one node
@ -247,9 +239,7 @@ gray bar represents a request to a service, and its height shows how long that r
requests are reasonably fast, but there are occasional *outliers* that take much longer. requests are reasonably fast, but there are occasional *outliers* that take much longer.
Variation in network delay is also known as *jitter*. Variation in network delay is also known as *jitter*.
![ddia 0205](/fig/ddia_0205.png) {{< figure src="/fig/ddia_0205.png" id="fig_lognormal" title="Figure 2-5. Illustrating mean and percentiles: response times for a sample of 100 requests to a service." class="w-full my-4" >}}
###### Figure 2-5. Illustrating mean and percentiles: response times for a sample of 100 requests to a service.
Its common to report the *average* response time of a service (technically, the *arithmetic mean*: Its common to report the *average* response time of a service (technically, the *arithmetic mean*:
that is, sum all the response times, and divide by the number of requests). The mean response time that is, sum all the response times, and divide by the number of requests). The mean response time
@ -322,9 +312,7 @@ increases if an end-user request requires multiple backend calls, and so a highe
end-user requests end up being slow (an effect known as *tail latency amplification* end-user requests end up being slow (an effect known as *tail latency amplification*
[^26]). [^26]).
![ddia 0206](/fig/ddia_0206.png) {{< figure src="/fig/ddia_0206.png" id="fig_tail_amplification" title="Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request." class="w-full my-4" >}}
###### Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request.
Percentiles are often used in *service level objectives* (SLOs) and *service level agreements* Percentiles are often used in *service level objectives* (SLOs) and *service level agreements*
(SLAs) as ways of defining the expected performance and availability of a service [^27]. (SLAs) as ways of defining the expected performance and availability of a service [^27].
@ -423,16 +411,12 @@ cured, as described in the following sections.
When we think of causes of system failure, hardware faults quickly come to mind: When we think of causes of system failure, hardware faults quickly come to mind:
* Approximately 25% of magnetic hard drives fail per year [^40] [^41]; * Approximately 25% of magnetic hard drives fail per year [^40] [^41]; in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
Recent data suggests that disks are getting more reliable, but failure rates remain significant [^42]. Recent data suggests that disks are getting more reliable, but failure rates remain significant [^42].
* Approximately 0.51% of solid state drives (SSDs) fail per year [^43]. * Approximately 0.51% of solid state drives (SSDs) fail per year [^43]. Small numbers of bit errors are corrected automatically [^44], but uncorrectable errors occur approximately once per year per drive, even in drives that are
Small numbers of bit errors are corrected automatically [^44],
but uncorrectable errors occur approximately once per year per drive, even in drives that are
fairly new (i.e., that have experienced little wear); this error rate is higher than that of fairly new (i.e., that have experienced little wear); this error rate is higher than that of
magnetic hard drives [^45], [^46]. magnetic hard drives [^45], [^46].
* Other hardware components such as power supplies, RAID controllers, and memory modules also fail, * Other hardware components such as power supplies, RAID controllers, and memory modules also fail, although less frequently than hard drives [^47] [^48].
although less frequently than hard drives [^47] [^48].
* Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result, * Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result,
likely due to manufacturing defects [^49] [^50] [^51]. In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program simply returning the wrong result. likely due to manufacturing defects [^49] [^50] [^51]. In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program simply returning the wrong result.
* Data in RAM can also be corrupted, either due to random events such as cosmic rays, or due to * Data in RAM can also be corrupted, either due to random events such as cosmic rays, or due to

View file

@ -165,9 +165,7 @@ representing such *one-to-many relationships* is to put positions, education, an
information in separate tables, with a foreign key reference to the `users` table, as in information in separate tables, with a foreign key reference to the `users` table, as in
[Figure 3-1](/en/ch3#fig_obama_relational). [Figure 3-1](/en/ch3#fig_obama_relational).
![ddia 0301](/fig/ddia_0301.png) {{< figure src="/fig/ddia_0301.png" id="fig_obama_relational" title="Figure 3-1. Representing a LinkedIn profile using a relational schema." class="w-full my-4" >}}
###### Figure 3-1. Representing a LinkedIn profile using a relational schema.
Another way of representing the same information, which is perhaps more natural and maps more Another way of representing the same information, which is perhaps more natural and maps more
closely to an object structure in application code, is as a JSON document as shown in closely to an object structure in application code, is as a JSON document as shown in
@ -214,9 +212,7 @@ The one-to-many relationships from the user profile to the users positions, e
contact information imply a tree structure in the data, and the JSON representation makes this tree contact information imply a tree structure in the data, and the JSON representation makes this tree
structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)). structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
![ddia 0302](/fig/ddia_0302.png) {{< figure src="/fig/ddia_0302.png" id="fig_json_tree" title="Figure 3-2. One-to-many relationships forming a tree structure." class="w-full my-4" >}}
###### Figure 3-2. One-to-many relationships forming a tree structure.
> [!NOTE] > [!NOTE]
> This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé typically has a small number of positions [^9] [^10]. > This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé typically has a small number of positions [^9] [^10].
@ -388,9 +384,7 @@ an organization has several past or present employees). In a relational model, s
is usually represented as an *associative table* or *join table*, as shown in is usually represented as an *associative table* or *join table*, as shown in
[Figure 3-3](/en/ch3#fig_datamodels_m2m_rel): each position associates one user ID with one organization ID. [Figure 3-3](/en/ch3#fig_datamodels_m2m_rel): each position associates one user ID with one organization ID.
![ddia 0303](/fig/ddia_0303.png) {{< figure src="/fig/ddia_0303.png" id="fig_datamodels_m2m_rel" title="Figure 3-3. Many-to-many relationships in the relational model." class="w-full my-4" >}}
###### Figure 3-3. Many-to-many relationships in the relational model.
Many-to-one and many-to-many relationships do not easily fit within one self-contained JSON Many-to-one and many-to-many relationships do not easily fit within one self-contained JSON
document; they lend themselves more to a normalized representation. In a document model, one document; they lend themselves more to a normalized representation. In a document model, one
@ -414,9 +408,7 @@ documents.
} }
``` ```
![ddia 0304](/fig/ddia_0304.png) {{< figure src="/fig/ddia_0304.png" id="fig_datamodels_many_to_many" title="Figure 3-4. Many-to-many relationships in the document model: the data within each dotted box can be grouped into one document." class="w-full my-4" >}}
###### Figure 3-4. Many-to-many relationships in the document model: the data within each dotted box can be grouped into one document.
Many-to-many relationships often need to be queried in “both directions”: for example, finding all Many-to-many relationships often need to be queried in “both directions”: for example, finding all
of the organizations that a particular person has worked for, and finding all of the people who have of the organizations that a particular person has worked for, and finding all of the people who have
@ -450,9 +442,7 @@ retailer. At the center of the schema is a so-called *fact table* (in this examp
(here, each row represents a customers purchase of a product). If we were analyzing website traffic (here, each row represents a customers purchase of a product). If we were analyzing website traffic
rather than retail sales, each row might represent a page view or a click by a user. rather than retail sales, each row might represent a page view or a click by a user.
![ddia 0305](/fig/ddia_0305.png) {{< figure src="/fig/ddia_0305.png" id="fig_dwh_schema" title="Figure 3-5. Example of a star schema for use in a data warehouse." class="w-full my-4" >}}
###### Figure 3-5. Example of a star schema for use in a data warehouse.
Usually, facts are captured as individual events, because this allows maximum flexibility of Usually, facts are captured as individual events, because this allows maximum flexibility of
analysis later. However, this means that the fact table can become extremely large. A big enterprise analysis later. However, this means that the fact table can become extremely large. A big enterprise
@ -775,9 +765,7 @@ are married and living in London. Each person and each location is represented a
relationships between them as edges. This example will help demonstrate some queries that are easy relationships between them as edges. This example will help demonstrate some queries that are easy
in graph databases, but difficult in other models. in graph databases, but difficult in other models.
![ddia 0306](/fig/ddia_0306.png) {{< figure src="/fig/ddia_0306.png" id="fig_datamodels_graph" title="Figure 3-6. Example of graph-structured data (boxes represent vertices, arrows represent edges)." class="w-full my-4" >}}
###### Figure 3-6. Example of graph-structured data (boxes represent vertices, arrows represent edges).
## Property Graphs ## Property Graphs
@ -1271,7 +1259,7 @@ before if youve studied computer science.
##### Example 3-12. The same query as [Example 3-5](/en/ch3#fig_cypher_query), expressed in Datalog ##### Example 3-12. The same query as [Example 3-5](/en/ch3#fig_cypher_query), expressed in Datalog
``` ```sql
within_recursive(LocID, PlaceName) :- location(LocID, PlaceName, _). /* Rule 1 */ within_recursive(LocID, PlaceName) :- location(LocID, PlaceName, _). /* Rule 1 */
within_recursive(LocID, PlaceName) :- within(LocID, ViaID), /* Rule 2 */ within_recursive(LocID, PlaceName) :- within(LocID, ViaID), /* Rule 2 */
@ -1320,9 +1308,9 @@ One possible way of applying the rules is thus (and as illustrated in [Figure 3
By repeated application of rules 1 and 2, the `within_recursive` virtual table can tell us all the By repeated application of rules 1 and 2, the `within_recursive` virtual table can tell us all the
locations in North America (or any other location) contained in our database. locations in North America (or any other location) contained in our database.
![ddia 0307](/fig/ddia_0307.png) {{< figure link="#fig_datalog_query" src="/fig/ddia_0307.png" id="fig_datalog_naive" title="Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from Example 3-12." class="w-full my-4" >}}
###### Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query). > Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query).
Now rule 3 can find people who were born in some location `BornIn` and live in some location Now rule 3 can find people who were born in some location `BornIn` and live in some location
`LivingIn`. Rule 4 invokes rule 3 with `BornIn = 'United States'` and `LivingIn`. Rule 4 invokes rule 3 with `BornIn = 'United States'` and
@ -1474,9 +1462,7 @@ and so on. Reservations may also be cancelled, and meanwhile, the conference org
the capacity of the event by moving it to a different room. With all of this going on, simply the capacity of the event by moving it to a different room. With all of this going on, simply
calculating the number of available seats becomes a challenging query. calculating the number of available seats becomes a challenging query.
![ddia 0308](/fig/ddia_0308.png) {{< figure src="/fig/ddia_0308.png" id="fig_event_sourcing" title="Figure 3-8. Using a log of immutable events as source of truth, and deriving materialized views from it." class="w-full my-4" >}}
###### Figure 3-8. Using a log of immutable events as source of truth, and deriving materialized views from it.
In [Figure 3-8](/en/ch3#fig_event_sourcing), every change to the state of the conference (such as the organizer In [Figure 3-8](/en/ch3#fig_event_sourcing), every change to the state of the conference (such as the organizer
opening registrations, or attendees making and cancelling registrations) is first stored as an opening registrations, or attendees making and cancelling registrations) is first stored as an
@ -1617,9 +1603,7 @@ is no data for many user-movie combinations, but this is fine. This matrix may h
of columns and would therefore not fit well in a relational database, but dataframes and libraries of columns and would therefore not fit well in a relational database, but dataframes and libraries
that offer sparse arrays (such as NumPy for Python) can handle such data easily. that offer sparse arrays (such as NumPy for Python) can handle such data easily.
![ddia 0309](/fig/ddia_0309.png) {{< figure src="/fig/ddia_0309.png" id="fig_dataframe_to_matrix" title="Figure 3-9. Transforming a relational database of movie ratings into a matrix representation." class="w-full my-4" >}}
###### Figure 3-9. Transforming a relational database of movie ratings into a matrix representation.
A matrix can only contain numbers, and various techniques are used to transform non-numerical data A matrix can only contain numbers, and various techniques are used to transform non-numerical data
into numbers in the matrix. For example: into numbers in the matrix. For example:

View file

@ -41,7 +41,7 @@ queries, such as text retrieval.
Consider the worlds simplest database, implemented as two Bash functions: Consider the worlds simplest database, implemented as two Bash functions:
``` ```bash
#!/bin/bash #!/bin/bash
db_set () { db_set () {
@ -60,14 +60,13 @@ recent value associated with that particular key and returns it.
And it works: And it works:
``` ```bash
$ db_set 12 '{"name":"London","attractions":["Big Ben","London Eye"]}' $ db_set 12 '{"name":"London","attractions":["Big Ben","London Eye"]}'
$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}' $ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'
$ db_get 42 $ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]} {"name":"San Francisco","attractions":["Golden Gate Bridge"]}
``` ```
The storage format is very simple: a text file where each line contains a key-value pair, separated The storage format is very simple: a text file where each line contains a key-value pair, separated
@ -76,7 +75,7 @@ the end of the file. If you update a key several times, old versions of the valu
overwritten—you need to look at the last occurrence of a key in a file to find the latest value overwritten—you need to look at the last occurrence of a key in a file to find the latest value
(hence the `tail -n 1` in `db_get`): (hence the `tail -n 1` in `db_get`):
``` ```bash
$ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}' $ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}'
$ db_get 42 $ db_get 42
@ -136,9 +135,7 @@ To start, lets assume that you want to continue storing data in the append-on
memory, in which every key is mapped to the byte offset in the file at which the most recent value memory, in which every key is mapped to the byte offset in the file at which the most recent value
for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index). for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
![ddia 0401](/fig/ddia_0401.png) {{< figure src="/fig/ddia_0401.png" id="fig_storage_csv_hash_index" title="Figure 4-1. Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map." class="w-full my-4" >}}
###### Figure 4-1. Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map.
Whenever you append a new key-value pair to the file, you also update the hash map to reflect the Whenever you append a new key-value pair to the file, you also update the hash map to reflect the
offset of the data you just wrote. When you want to look up a value, you use the hash map to find offset of the data you just wrote. When you want to look up a value, you use the hash map to find
@ -162,15 +159,12 @@ This approach is much faster, but it still suffers from several problems:
### The SSTable file format ### The SSTable file format
In practice, hash tables are not used very often for database indexes, and instead it is much more In practice, hash tables are not used very often for database indexes, and instead it is much more
common to keep data in a structure that is *sorted by key* common to keep data in a structure that is *sorted by key* [^3].
[^3].
One example of such a structure is a *Sorted String Table*, or *SSTable* for short, as shown in One example of such a structure is a *Sorted String Table*, or *SSTable* for short, as shown in
[Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that [Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that
they are sorted by key, and each key only appears once in the file. they are sorted by key, and each key only appears once in the file.
![ddia 0402](/fig/ddia_0402.png) {{< figure src="/fig/ddia_0402.png" id="fig_storage_sstable_index" title="Figure 4-2. An SSTable with a sparse index, allowing queries to jump to the right block." class="w-full my-4" >}}
###### Figure 4-2. An SSTable with a sparse index, allowing queries to jump to the right block.
Now you do not need to keep all the keys in memory: you can group the key-value pairs within an Now you do not need to keep all the keys in memory: you can group the key-value pairs within an
SSTable into *blocks* of a few kilobytes, and then store the first key of each block in the index. SSTable into *blocks* of a few kilobytes, and then store the first key of each block in the index.
@ -224,9 +218,7 @@ the same key appears in more than one input file, keep only the more recent valu
new merged segment file, also sorted by key, with one value per key, and it uses minimal memory new merged segment file, also sorted by key, with one value per key, and it uses minimal memory
because we can iterate over the SSTables one key at a time. because we can iterate over the SSTables one key at a time.
![ddia 0403](/fig/ddia_0403.png) {{< figure src="/fig/ddia_0403.png" id="fig_storage_sstable_merging" title="Figure 4-3. Merging several SSTable segments, retaining only the most recent value for each key." class="w-full my-4" >}}
###### Figure 4-3. Merging several SSTable segments, retaining only the most recent value for each key.
To ensure that the data in the memtable is not lost if the database crashes, the storage engine To ensure that the data in the memtable is not lost if the database crashes, the storage engine
keeps a separate log on disk to which every write is immediately appended. This log is not sorted by keeps a separate log on disk to which every write is immediately appended. This log is not sorted by
@ -285,9 +277,7 @@ We set the bits corresponding to those indexes to 1, and leave the rest as 0. Fo
is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of
extra space, but the Bloom filter is generally small compared to the rest of the SSTable. extra space, but the Bloom filter is generally small compared to the rest of the SSTable.
![ddia 0404](/fig/ddia_0404.png) {{< figure src="/fig/ddia_0404.png" id="fig_storage_bloom" title="Figure 4-4. A Bloom filter provides a fast, probabilistic check whether a particular key exists in a particular SSTable." class="w-full my-4" >}}
###### Figure 4-4. A Bloom filter provides a fast, probabilistic check whether a particular key exists in a particular SSTable.
When we want to know whether a key appears in the SSTable, we compute the same hash of that key as When we want to know whether a key appears in the SSTable, we compute the same hash of that key as
before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), were querying before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), were querying
@ -366,8 +356,7 @@ for scaling a database across multiple machines.
The log-structured approach is popular, but it is not the only form of key-value storage. The most The log-structured approach is popular, but it is not the only form of key-value storage. The most
widely used structure for reading and writing database records by key is the *B-tree*. widely used structure for reading and writing database records by key is the *B-tree*.
Introduced in 1970 [^21] Introduced in 1970 [^21] and called “ubiquitous” less than 10 years later [^22],
and called “ubiquitous” less than 10 years later [^22],
B-trees have stood the test of time very well. They remain the standard index implementation in B-trees have stood the test of time very well. They remain the standard index implementation in
almost all relational databases, and many nonrelational databases use them too. almost all relational databases, and many nonrelational databases use them too.
@ -387,9 +376,7 @@ multiplying the page number by the page size gives us the byte offset in the fil
located. We can use these page references to construct a tree of pages, as illustrated in located. We can use these page references to construct a tree of pages, as illustrated in
[Figure 4-5](/en/ch4#fig_storage_b_tree). [Figure 4-5](/en/ch4#fig_storage_b_tree).
![ddia 0405](/fig/ddia_0405.png) {{< figure src="/fig/ddia_0405.png" id="fig_storage_b_tree" title="Figure 4-5. Looking up the key 251 using a B-tree index. From the root page we first follow the reference to the page for keys 200300, then the page for keys 250270." class="w-full my-4" >}}
###### Figure 4-5. Looking up the key 251 using a B-tree index. From the root page we first follow the reference to the page for keys 200300, then the page for keys 250270.
One page is designated as the *root* of the B-tree; whenever you want to look up a key in the index, One page is designated as the *root* of the B-tree; whenever you want to look up a key in the index,
you start here. The page contains several keys and references to child pages. you start here. The page contains several keys and references to child pages.
@ -416,9 +403,7 @@ it to that page. If there isnt enough free space in the page to accommodate t
is split into two half-full pages, and the parent page is updated to account for the new subdivision is split into two half-full pages, and the parent page is updated to account for the new subdivision
of key ranges. of key ranges.
![ddia 0406](/fig/ddia_0406.png) {{< figure src="/fig/ddia_0406.png" id="fig_storage_b_tree_split" title="Figure 4-6. Growing a B-tree by splitting a page on the boundary key 337. The parent page is updated to reference both children." class="w-full my-4" >}}
###### Figure 4-6. Growing a B-tree by splitting a page on the boundary key 337. The parent page is updated to reference both children.
In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the
range 333345 is already full. We therefore split it into a page for the range 333337 (including range 333345 is already full. We therefore split it into a page for the range 333337 (including
@ -444,8 +429,7 @@ modify files in place.
Overwriting several pages at once, like in a page split, is a dangerous operation: if the database Overwriting several pages at once, like in a page split, is a dangerous operation: if the database
crashes after only some of the pages have been written, you end up with a corrupted tree (e.g., crashes after only some of the pages have been written, you end up with a corrupted tree (e.g.,
there may be an *orphan* page that is not a child of any parent). If the hardware cant atomically there may be an *orphan* page that is not a child of any parent). If the hardware cant atomically
write an entire page, you can also end up with a partially written page (this is known as a *torn write an entire page, you can also end up with a partially written page (this is known as a *torn page* [^23]).
page* [^23]).
In order to make the database resilient to crashes, it is common for B-tree implementations to In order to make the database resilient to crashes, it is common for B-tree implementations to
include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file
@ -509,8 +493,7 @@ High write throughput can cause latency spikes in a log-structured storage engin
memtable fills up. This happens if data cant be written out to disk fast enough, perhaps because memtable fills up. This happens if data cant be written out to disk fast enough, perhaps because
the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB, the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB,
perform *backpressure* in this situation: they suspend all reads and writes until the memtable has perform *backpressure* in this situation: they suspend all reads and writes until the memtable has
been written out to disk been written out to disk [^30] [^31].
[^30] [^31].
Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read
requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but
@ -552,8 +535,7 @@ A sequential write workload writes larger chunks of data at a time, so it is lik
512 KiB block belongs to a single file; when that file is later deleted again, the whole block 512 KiB block belongs to a single file; when that file is later deleted again, the whole block
can be erased without having to perform any GC. On the other hand, with a random write workload, it can be erased without having to perform any GC. On the other hand, with a random write workload, it
is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
to perform more work before a block can be erased to perform more work before a block can be erased [^34] [^35] [^36].
[^34] [^35] [^36].
The write bandwidth consumed by GC is then not available for the application. Moreover, the The write bandwidth consumed by GC is then not available for the application. Moreover, the
additional writes performed by GC contribute to wear on the flash memory; therefore, random writes additional writes performed by GC contribute to wear on the flash memory; therefore, random writes
@ -654,14 +636,12 @@ The key in an index is the thing that queries search by, but the value can be on
* If the actual data (row, document, vertex) is stored directly within the index structure, it is * If the actual data (row, document, vertex) is stored directly within the index structure, it is
called a *clustered index*. For example, in MySQLs InnoDB storage engine, the primary key of a called a *clustered index*. For example, in MySQLs InnoDB storage engine, the primary key of a
table is always a clustered index, and in SQL Server, you can specify one clustered index per table is always a clustered index, and in SQL Server, you can specify one clustered index per table [^43].
table [^43].
* Alternatively, the value can be a reference to the actual data: either the primary key of the row * Alternatively, the value can be a reference to the actual data: either the primary key of the row
in question (InnoDB does this for secondary indexes), or a direct reference to a location on disk. in question (InnoDB does this for secondary indexes), or a direct reference to a location on disk.
In the latter case, the place where rows are stored is known as a *heap file*, and it stores data In the latter case, the place where rows are stored is known as a *heap file*, and it stores data
in no particular order (it may be append-only, or it may keep track of deleted rows in order to in no particular order (it may be append-only, or it may keep track of deleted rows in order to
overwrite them with new data later). For example, Postgres uses the heap file approach overwrite them with new data later). For example, Postgres uses the heap file approach [^44].
[^44].
* A middle ground between the two is a *covering index* or *index with included columns*, which * A middle ground between the two is a *covering index* or *index with included columns*, which
stores *some* of a tables columns within the index, in addition to storing the full row on the stores *some* of a tables columns within the index, in addition to storing the full row on the
heap or in the primary key clustered index [^45]. heap or in the primary key clustered index [^45].
@ -707,8 +687,7 @@ easily be backed up, inspected, and analyzed by external utilities.
Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model, Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model,
and the vendors claim that they can offer big performance improvements by removing all the overheads and the vendors claim that they can offer big performance improvements by removing all the overheads
associated with managing on-disk data structures associated with managing on-disk data structures [^46] [^47].
[^46] [^47].
RAMCloud is an open source, in-memory key-value store with durability (using a log-structured RAMCloud is an open source, in-memory key-value store with durability (using a log-structured
approach for the data in memory as well as the data on disk) [^48]. approach for the data in memory as well as the data on disk) [^48].
@ -741,8 +720,7 @@ Some databases, such as Microsoft SQL Server, SAP HANA, and SingleStore, have su
transaction processing and data warehousing in the same product. However, these hybrid transactional transaction processing and data warehousing in the same product. However, these hybrid transactional
and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly
becoming two separate storage and query engines, which happen to be accessible through a common SQL becoming two separate storage and query engines, which happen to be accessible through a common SQL
interface interface [^50] [^51] [^52] [^53].
[^50] [^51] [^52] [^53].
## Cloud Data Warehouses ## Cloud Data Warehouses
@ -774,8 +752,7 @@ Query engine
Storage format Storage format
: The storage format determines how the rows of a table are encoded as bytes in a file, which is : The storage format determines how the rows of a table are encoded as bytes in a file, which is
then typically stored in object storage or a distributed filesystem then typically stored in object storage or a distributed filesystem [^12].
[^12].
This data can then be accessed by the query engine, but also by other applications using the data This data can then be accessed by the query engine, but also by other applications using the data
lake. Examples of such storage formats are Parquet, ORC, Lance, or Nimble, and we will see more lake. Examples of such storage formats are Parquet, ORC, Lance, or Nimble, and we will see more
about them in the next section. about them in the next section.
@ -833,8 +810,7 @@ How can we execute this query efficiently?
In most OLTP databases, storage is laid out in a *row-oriented* fashion: all the values from one row In most OLTP databases, storage is laid out in a *row-oriented* fashion: all the values from one row
of a table are stored next to each other. Document databases are similar: an entire document is of a table are stored next to each other. Document databases are similar: an entire document is
typically stored as one contiguous sequence of bytes. You can see this in the CSV example of typically stored as one contiguous sequence of bytes. You can see this in the CSV example of [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
[Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on
`fact_sales.date_key` and/or `fact_sales.product_sk` that tell the storage engine where to find `fact_sales.date_key` and/or `fact_sales.product_sk` that tell the storage engine where to find
@ -851,16 +827,10 @@ an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema)
> [!NOTE] > [!NOTE]
> Column storage is easiest to understand in a relational data model, but it applies equally to > Column storage is easiest to understand in a relational data model, but it applies equally to
> nonrelational data. For example, Parquet > nonrelational data. For example, Parquet [^57] is a columnar storage format that supports a document data model, based on Googles Dremel [^58],
> [^57] > using a technique known as *shredding* or *striping* [^59].
> is a columnar storage format that supports a document data model, based on Googles Dremel
> [^58],
> using a technique known as *shredding* or *striping*
> [^59].
![ddia 0407](/fig/ddia_0407.png) {{< figure src="/fig/ddia_0407.png" id="fig_column_store" title="Figure 4-7. Storing relational data by column, rather than by row." class="w-full my-4" >}}
###### Figure 4-7. Storing relational data by column, rather than by row.
The column-oriented storage layout relies on each column storing the rows in the same order. The column-oriented storage layout relies on each column storing the rows in the same order.
Thus, if you need to reassemble an entire row, you can take the 23rd entry from each of the Thus, if you need to reassemble an entire row, you can take the 23rd entry from each of the
@ -873,20 +843,10 @@ Since many queries are restricted to a particular date range, it is common to ma
contain the rows for a particular timestamp range. A query then only needs to load the columns it contain the rows for a particular timestamp range. A query then only needs to load the columns it
needs in those blocks that overlap with the required date range. needs in those blocks that overlap with the required date range.
Columnar storage is used in almost all analytic databases nowadays [^60], Columnar storage is used in almost all analytic databases nowadays [^60], ranging from large-scale cloud data warehouses such as Snowflake [^61]
ranging from large-scale cloud data warehouses such as Snowflake [^61] to single-node embedded databases such as DuckDB [^62], and product analytics systems such as Pinot [^63] and Druid [^64].
to single-node embedded databases such as DuckDB [^62], It is used in storage formats such as Parquet, ORC [^65] [^66], Lance [^67], and Nimble [^68], and in-memory analytics formats like Apache Arrow
and product analytics systems such as Pinot [^63] [^65] [^69] and Pandas/NumPy [^70]. Some time-series databases, such as InfluxDB IOx [^71] and TimescaleDB [^72], are also based on column-oriented storage.
and Druid [^64].
It is used in storage formats such as Parquet, ORC
[^65] [^66],
Lance [^67],
and Nimble [^68],
and in-memory analytics formats like Apache Arrow
[^65] [^69]
and Pandas/NumPy [^70].
Some time-series databases, such as InfluxDB IOx [^71] and TimescaleDB [^72],
are also based on column-oriented storage.
### Column Compression ### Column Compression
@ -899,9 +859,7 @@ repetitive, which is a good sign for compression. Depending on the data in the c
compression techniques can be used. One technique that is particularly effective in data warehouses compression techniques can be used. One technique that is particularly effective in data warehouses
is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index). is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index).
![ddia 0408](/fig/ddia_0408.png) {{< figure src="/fig/ddia_0408.png" id="fig_bitmap_index" title="Figure 4-8. Compressed, bitmap-indexed storage of a single column." class="w-full my-4" >}}
###### Figure 4-8. Compressed, bitmap-indexed storage of a single column.
Often, the number of distinct values in a column is small compared to the number of rows (for Often, the number of distinct values in a column is small compared to the number of rows (for
example, a retailer may have billions of sales transactions, but only 100,000 distinct products). example, a retailer may have billions of sales transactions, but only 100,000 distinct products).
@ -1041,9 +999,7 @@ Vectorized processing
shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
a particular store. a particular store.
![ddia 0409](/fig/ddia_0409.png) {{< figure src="/fig/ddia_0409.png" id="fig_bitmap_and" title="Figure 4-9. A bitwise AND between two bitmaps lends itself to vectorization." class="w-full my-4" >}}
###### Figure 4-9. A bitwise AND between two bitmaps lends itself to vectorization.
The two approaches are very different in terms of their implementation, but both are used in The two approaches are very different in terms of their implementation, but both are used in
practice [^77]. Both can achieve very good practice [^77]. Both can achieve very good
@ -1081,9 +1037,7 @@ queries use most often? A *data cube* or *OLAP cube* does this by creating a gri
grouped by different dimensions [^82]. grouped by different dimensions [^82].
[Figure 4-10](/en/ch4#fig_data_cube) shows an example. [Figure 4-10](/en/ch4#fig_data_cube) shows an example.
![ddia 0410](/fig/ddia_0410.png) {{< figure src="/fig/ddia_0410.png" id="fig_data_cube" title="Figure 4-10. Two dimensions of a data cube, aggregating data by summing." class="w-full my-4" >}}
###### Figure 4-10. Two dimensions of a data cube, aggregating data by summing.
Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube), Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube),
these are `date_key` and `product_sk`. You can now draw a two-dimensional table, with these are `date_key` and `product_sk`. You can now draw a two-dimensional table, with
@ -1282,9 +1236,7 @@ Hierarchical Navigable Small World (HNSW)
query vector. The process continues until the last layer is reached. As with IVF indexes, HNSW query vector. The process continues until the last layer is reached. As with IVF indexes, HNSW
indexes are approximate. indexes are approximate.
![ddia 0411](/fig/ddia_0411.png) {{< figure src="/fig/ddia_0411.png" id="fig_vector_hnsw" title="Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index." class="w-full my-4" >}}
###### Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index.
Many popular vector databases implement IVF and HNSW indexes. Facebooks Faiss library has many Many popular vector databases implement IVF and HNSW indexes. Facebooks Faiss library has many
variations of each [^101], variations of each [^101],

View file

@ -243,7 +243,7 @@ will need to include the strings `userName`, `favoriteNumber`, and `interests` s
##### Example 5-2. Example record which we will encode in several binary formats in this chapter ##### Example 5-2. Example record which we will encode in several binary formats in this chapter
``` ```json
{ {
"userName": "Martin", "userName": "Martin",
"favoriteNumber": 1337, "favoriteNumber": 1337,
@ -273,9 +273,8 @@ is worth the loss of human-readability.
In the following sections we will see how we can do much better, and encode the same record in just In the following sections we will see how we can do much better, and encode the same record in just
32 bytes. 32 bytes.
![ddia 0502](/fig/ddia_0502.png) {{< figure src="/fig/ddia_0502.png" id="fig_encoding_messagepack" title="Figure 5-2. Example record ([Example 5-2](/en/ch5#fig_encoding_json)) encoded using MessagePack." class="w-full my-4" >}}
###### Figure 5-2. Example record ([Example 5-2](/en/ch5#fig_encoding_json)) encoded using MessagePack.
## Protocol Buffers ## Protocol Buffers
@ -306,9 +305,8 @@ types, but it does not support other restrictions on the possible values of fiel
Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in
[Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14]. [Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14].
![ddia 0503](/fig/ddia_0503.png) {{< figure src="/fig/ddia_0503.png" id="fig_encoding_protobuf" title="Figure 5-3. Example record encoded using Protocol Buffers." class="w-full my-4" >}}
###### Figure 5-3. Example record encoded using Protocol Buffers.
Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it
is a string, integer, etc.) and, where required, a length indication (such as the length of a is a string, integer, etc.) and, where required, a length indication (such as the length of a
@ -416,9 +414,8 @@ prefix followed by UTF-8 bytes, but theres nothing in the encoded data that t
string. It could just as well be an integer, or something else entirely. An integer is encoded using string. It could just as well be an integer, or something else entirely. An integer is encoded using
a variable-length encoding. a variable-length encoding.
![ddia 0504](/fig/ddia_0504.png) {{< figure src="/fig/ddia_0504.png" id="fig_encoding_avro" title="Figure 5-4. Example record encoded using Avro." class="w-full my-4" >}}
###### Figure 5-4. Example record encoded using Avro.
To parse the binary data, you go through the fields in the order that they appear in the schema and To parse the binary data, you go through the fields in the order that they appear in the schema and
use the schema to tell you the datatype of each field. This means that the binary data can only be use the schema to tell you the datatype of each field. This means that the binary data can only be
@ -440,9 +437,8 @@ encoding, and the *readers schema*, which may be different. This is illustrat
[Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The readers schema defines the fields of each record that the [Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The readers schema defines the fields of each record that the
application code is expecting, and their types. application code is expecting, and their types.
![ddia 0505](/fig/ddia_0505.png) {{< figure src="/fig/ddia_0505.png" id="fig_encoding_avro_schemas" title="Figure 5-5. In Protocol Buffers, encoding and decoding can use different versions of a schema. In Avro, decoding uses two schemas: the writer's schema must be identical to the one used for encoding, but the reader's schema can be an older or newer version." class="w-full my-4" >}}
###### Figure 5-5. In Protocol Buffers, encoding and decoding can use different versions of a schema. In Avro, decoding uses two schemas: the writers schema must be identical to the one used for encoding, but the readers schema can be an older or newer version.
If the readers and writers schema are the same, decoding is easy. If they are different, Avro If the readers and writers schema are the same, decoding is easy. If they are different, Avro
resolves the differences by looking at the writers schema and the readers schema side by side and resolves the differences by looking at the writers schema and the readers schema side by side and
@ -458,9 +454,8 @@ schema, it is ignored. If the code reading the data expects some field, but the
not contain a field of that name, it is filled in with a default value declared in the readers not contain a field of that name, it is filled in with a default value declared in the readers
schema. schema.
![ddia 0506](/fig/ddia_0506.png) {{< figure src="/fig/ddia_0506.png" id="fig_encoding_avro_resolution" title="Figure 5-6. An Avro reader resolves differences between the writer's schema and the reader's schema." class="w-full my-4" >}}
###### Figure 5-6. An Avro reader resolves differences between the writers schema and the readers schema.
### Schema evolution rules ### Schema evolution rules
@ -515,11 +510,7 @@ Database with individually written records
and then fetch the writers schema for that version number from the database. Using that writers and then fetch the writers schema for that version number from the database. Using that writers
schema, it can decode the rest of the record. schema, it can decode the rest of the record.
Confluents schema registry for Apache Kafka Confluents schema registry for Apache Kafka [^19] and LinkedIns Espresso [^20] work this way, for example.
[^19]
and LinkedIns Espresso
[^20]
work this way, for example.
Sending records over a network connection Sending records over a network connection
: When two processes are communicating over a bidirectional network connection, they can negotiate : When two processes are communicating over a bidirectional network connection, they can negotiate
@ -528,8 +519,7 @@ Sending records over a network connection
A database of schema versions is a useful thing to have in any case, since it acts as documentation A database of schema versions is a useful thing to have in any case, since it acts as documentation
and gives you a chance to check schema compatibility [^21]. and gives you a chance to check schema compatibility [^21].
As the version number, you could use a simple incrementing integer, or you could use a hash of the As the version number, you could use a simple incrementing integer, or you could use a hash of the schema.
schema.
### Dynamically generated schemas ### Dynamically generated schemas
@ -570,13 +560,11 @@ implement and simpler to use, they have grown to support a fairly wide range of
languages. languages.
The ideas on which these encodings are based are by no means new. For example, they have a lot in The ideas on which these encodings are based are by no means new. For example, they have a lot in
common with ASN.1, a schema definition language that was first standardized in 1984 common with ASN.1, a schema definition language that was first standardized in 1984 [^23] [^24].
[^23] [^24].
It was used to define various network protocols, and its binary encoding (DER) is still used to encode It was used to define various network protocols, and its binary encoding (DER) is still used to encode
SSL certificates (X.509), for example [^25]. SSL certificates (X.509), for example [^25].
ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers [^26]. ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers [^26].
However, its also very complex and badly documented, so ASN.1 However, its also very complex and badly documented, so ASN.1 is probably not a good choice for new applications.
is probably not a good choice for new applications.
Many data systems also implement some kind of proprietary binary encoding for their data. For Many data systems also implement some kind of proprietary binary encoding for their data. For
example, most relational databases have a network protocol over which you can send queries to the example, most relational databases have a network protocol over which you can send queries to the
@ -666,8 +654,7 @@ schema, even though the underlying storage may contain records encoded with vari
versions of the schema. versions of the schema.
More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or
moving some data into a separate table—still require data to be rewritten, often at the application moving some data into a separate table—still require data to be rewritten, often at the application level [^27].
level [^27].
Maintaining forward and backward compatibility across such migrations is still a research problem [^28]. Maintaining forward and backward compatibility across such migrations is still a research problem [^28].
### Archival storage ### Archival storage
@ -736,8 +723,7 @@ different contexts. For example:
category includes public APIs provided by online services, such as credit card processing category includes public APIs provided by online services, such as credit card processing
systems, or OAuth for shared access to user data. systems, or OAuth for shared access to user data.
The most popular service design philosophy is REST, which builds upon the principles of HTTP The most popular service design philosophy is REST, which builds upon the principles of HTTP [^30] [^31].
[^30] [^31].
It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for
cache control, authentication, and content type negotiation. An API designed according to the cache control, authentication, and content type negotiation. An API designed according to the
principles of REST is called *RESTful*. principles of REST is called *RESTful*.
@ -753,12 +739,11 @@ and receive Protocol Buffers.
Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def). Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def).
The service definition allows developers to define service endpoints, documentation, versions, data The service definition allows developers to define service endpoints, documentation, versions, data
models, and much more. gRPC definitions look similar, but are defined using Protocol Buffers service models, and much more. gRPC definitions look similar, but are defined using Protocol Buffers service definitions.
definitions.
##### Example 5-3. Example OpenAPI service definition in YAML ##### Example 5-3. Example OpenAPI service definition in YAML
``` ```yaml
openapi: 3.0.0 openapi: 3.0.0
info: info:
title: Ping, Pong title: Ping, Pong
@ -981,9 +966,8 @@ Different workflow engines use different names for tasks. Temporal, for example,
*activity*. Others refer to tasks as *durable functions*. Though the names differ, the concepts are *activity*. Others refer to tasks as *durable functions*. Though the names differ, the concepts are
the same. the same.
![ddia 0507](/fig/ddia_0507.png) {{< figure src="/fig/ddia_0507.png" id="fig_encoding_workflow" title="Figure 5-7. Example of a workflow expressed using Business Process Model and Notation (BPMN), a graphical notation." class="w-full my-4" >}}
###### Figure 5-7. Example of a workflow expressed using Business Process Model and Notation (BPMN), a graphical notation.
Workflows are run, or executed, by a *workflow engine*. Workflow engines determine when to run each Workflows are run, or executed, by a *workflow engine*. Workflow engines determine when to run each
task, on which machine a task must be run, what to do if a task fails (e.g., if the machine crashes task, on which machine a task must be run, what to do if a task fails (e.g., if the machine crashes

View file

@ -69,8 +69,7 @@ longer contain the same data. The most common solution is called *leader-based r
*primary-backup*, or *active/passive*. It works as follows (see *primary-backup*, or *active/passive*. It works as follows (see
[Figure 6-1](/en/ch6#fig_replication_leader_follower)): [Figure 6-1](/en/ch6#fig_replication_leader_follower)):
1. One of the replicas is designated the *leader* (also known as *primary* or *source* 1. One of the replicas is designated the *leader* (also known as *primary* or *source* [^2]).
[^2]).
When clients want to write to the database, they must send their requests to the leader, which When clients want to write to the database, they must send their requests to the leader, which
first writes the new data to its local storage. first writes the new data to its local storage.
2. The other replicas are known as *followers* (*read replicas*, *secondaries*, or *hot standbys*). 2. The other replicas are known as *followers* (*read replicas*, *secondaries*, or *hot standbys*).
@ -82,9 +81,7 @@ longer contain the same data. The most common solution is called *leader-based r
followers. However, writes are only accepted on the leader (the followers are read-only from the followers. However, writes are only accepted on the leader (the followers are read-only from the
clients point of view). clients point of view).
![ddia 0601](/fig/ddia_0601.png) {{< figure src="/fig/ddia_0601.png" id="fig_replication_leader_follower" title="Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas." class="w-full my-4" >}}
###### Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas.
If the database is sharded (see [Chapter 7](/en/ch7#ch_sharding)), each shard has one leader. Different shards may If the database is sharded (see [Chapter 7](/en/ch7#ch_sharding)), each shard has one leader. Different shards may
have their leaders on different nodes, but each shard must nevertheless have one leader node. In have their leaders on different nodes, but each shard must nevertheless have one leader node. In
@ -92,24 +89,16 @@ have their leaders on different nodes, but each shard must nevertheless have one
multiple leaders for the same shard at the same time. multiple leaders for the same shard at the same time.
Single-leader replication is very widely used. Its a built-in feature of many relational databases, Single-leader replication is very widely used. Its a built-in feature of many relational databases,
such as PostgreSQL, MySQL, Oracle Data Guard such as PostgreSQL, MySQL, Oracle Data Guard [^3], and SQL Servers Always On Availability Groups [^4].
[^3], It is also used in some document databases such as MongoDB and DynamoDB [^5],
and SQL Servers Always On Availability Groups
[^4].
It is also used in some document databases such as MongoDB and DynamoDB
[^5],
message brokers such as Kafka, replicated block devices such as DRBD, and some network filesystems. message brokers such as Kafka, replicated block devices such as DRBD, and some network filesystems.
Many consensus algorithms such as Raft, which is used for replication in CockroachDB Many consensus algorithms such as Raft, which is used for replication in CockroachDB [^6], TiDB [^7],
[^6], etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and automatically
TiDB [^7], elect a new leader if the old one fails (we will discuss consensus in more detail in [Chapter 10](/en/ch10#ch_consistency)).
etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and
automatically elect a new leader if the old one fails (we will discuss consensus in more detail in
[Chapter 10](/en/ch10#ch_consistency)).
> [!NOTE] > [!NOTE]
> In older documents you may see the term *masterslave replication*. It means the same as > In older documents you may see the term *masterslave replication*. It means the same as
> leader-based replication, but the term should be avoided as it is widely considered offensive > leader-based replication, but the term should be avoided as it is widely considered offensive [^8].
> [^8].
## Synchronous Versus Asynchronous Replication ## Synchronous Versus Asynchronous Replication
@ -123,9 +112,7 @@ shortly afterward, it is received by the leader. At some point, the leader forwa
to the followers. Eventually, the leader notifies the client that the update was successful. to the followers. Eventually, the leader notifies the client that the update was successful.
[Figure 6-2](/en/ch6#fig_replication_sync_replication) shows one possible way how the timings could work out. [Figure 6-2](/en/ch6#fig_replication_sync_replication) shows one possible way how the timings could work out.
![ddia 0602](/fig/ddia_0602.png) {{< figure src="/fig/ddia_0602.png" id="fig_replication_sync_replication" title="Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower." class="w-full my-4" >}}
###### Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower.
In the example of [Figure 6-2](/en/ch6#fig_replication_sync_replication), the replication to follower 1 is In the example of [Figure 6-2](/en/ch6#fig_replication_sync_replication), the replication to follower 1 is
*synchronous*: the leader waits until follower 1 has confirmed that it received the write before *synchronous*: the leader waits until follower 1 has confirmed that it received the write before
@ -168,8 +155,7 @@ client. However, a fully asynchronous configuration has the advantage that the l
processing writes, even if all of its followers have fallen behind. processing writes, even if all of its followers have fallen behind.
Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless
widely used, especially if there are many followers or if they are geographically distributed widely used, especially if there are many followers or if they are geographically distributed [^9].
[^9].
We will return to this issue in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag). We will return to this issue in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag).
## Setting Up New Followers ## Setting Up New Followers
@ -304,8 +290,7 @@ consists of the following steps:
maintenance, this doesnt apply.) maintenance, this doesnt apply.)
2. *Choosing a new leader.* This could be done through an election process (where the leader is chosen by 2. *Choosing a new leader.* This could be done through an election process (where the leader is chosen by
a majority of the remaining replicas), or a new leader could be appointed by a previously a majority of the remaining replicas), or a new leader could be appointed by a previously
established *controller node* established *controller node* [^13].
[^13].
The best candidate for leadership is usually the replica with the most up-to-date data changes The best candidate for leadership is usually the replica with the most up-to-date data changes
from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader
is a consensus problem, discussed in detail in [Chapter 10](/en/ch10#ch_consistency). is a consensus problem, discussed in detail in [Chapter 10](/en/ch10#ch_consistency).
@ -324,9 +309,7 @@ Failover is fraught with things that can go wrong:
in the meantime. The most common solution is for the old leaders unreplicated writes to simply be in the meantime. The most common solution is for the old leaders unreplicated writes to simply be
discarded, which means that writes you believed to be committed actually werent durable after all. discarded, which means that writes you believed to be committed actually werent durable after all.
* Discarding writes is especially dangerous if other storage systems outside of the database need to * Discarding writes is especially dangerous if other storage systems outside of the database need to
be coordinated with the database contents. be coordinated with the database contents. For example, in one incident at GitHub [^14],
For example, in one incident at GitHub
[^14],
an out-of-date MySQL follower an out-of-date MySQL follower
was promoted to leader. The database used an autoincrementing counter to assign primary keys to was promoted to leader. The database used an autoincrementing counter to assign primary keys to
new rows, but because the new leaders counter lagged behind the old leaders, it reused some new rows, but because the new leaders counter lagged behind the old leaders, it reused some
@ -338,8 +321,7 @@ Failover is fraught with things that can go wrong:
leaders accept writes, and there is no process for resolving conflicts (see leaders accept writes, and there is no process for resolving conflicts (see
[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some [“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
systems have a mechanism to shut down one node if two leaders are detected. However, if this systems have a mechanism to shut down one node if two leaders are detected. However, if this
mechanism is not carefully designed, you can end up with both nodes being shut down mechanism is not carefully designed, you can end up with both nodes being shut down [^15].
[^15].
Moreover, there is a risk that by the time the split brain is detected and the old node is shut Moreover, there is a risk that by the time the split brain is detected and the old node is shut
down, it is already too late and data has already been corrupted. down, it is already too late and data has already been corrupted.
* What is the right timeout before the leader is declared dead? A longer timeout means a longer * What is the right timeout before the leader is declared dead? A longer timeout means a longer
@ -404,10 +386,8 @@ also known as *state machine replication*, and we will discuss the theory behind
Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today, Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today,
as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if
there is any nondeterminism in a statement. VoltDB uses statement-based replication, and makes it there is any nondeterminism in a statement. VoltDB uses statement-based replication, and makes it
safe by requiring transactions to be deterministic safe by requiring transactions to be deterministic [^16]. However, determinism can be hard to guarantee
[^16]. in practice, so many databases prefer other replication methods.
However, determinism can be hard to guarantee in practice, so many databases prefer other
replication methods.
### Write-ahead log (WAL) shipping ### Write-ahead log (WAL) shipping
@ -453,18 +433,15 @@ A transaction that modifies several rows generates several such log records, fol
indicating that the transaction was committed. MySQL keeps a separate logical replication log, indicating that the transaction was committed. MySQL keeps a separate logical replication log,
called the *binlog*, in addition to the WAL (when configured to use row-based replication). called the *binlog*, in addition to the WAL (when configured to use row-based replication).
PostgreSQL implements logical replication by decoding the physical WAL into row PostgreSQL implements logical replication by decoding the physical WAL into row
insertion/update/delete events insertion/update/delete events [^19].
[^19].
Since a logical log is decoupled from the storage engine internals, it can more easily be kept Since a logical log is decoupled from the storage engine internals, it can more easily be kept
backward compatible, allowing the leader and the follower to run different versions of the database backward compatible, allowing the leader and the follower to run different versions of the database
software. This in turn enables upgrading to a new version with minimal downtime software. This in turn enables upgrading to a new version with minimal downtime [^20].
[^20].
A logical log format is also easier for external applications to parse. This aspect is useful if you want A logical log format is also easier for external applications to parse. This aspect is useful if you want
to send the contents of a database to an external system, such as a data warehouse for offline to send the contents of a database to an external system, such as a data warehouse for offline
analysis, or for building custom indexes and caches analysis, or for building custom indexes and caches [^21].
[^21].
This technique is called *change data capture*, and we will return to it in [Link to Come]. This technique is called *change data capture*, and we will return to it in [Link to Come].
# Problems with Replication Lag # Problems with Replication Lag
@ -526,9 +503,7 @@ With asynchronous replication, there is a problem, illustrated in
new data may not yet have reached the replica. To the user, it looks as though the data they new data may not yet have reached the replica. To the user, it looks as though the data they
submitted was lost, so they will be understandably unhappy. submitted was lost, so they will be understandably unhappy.
![ddia 0603](/fig/ddia_0603.png) {{< figure src="/fig/ddia_0603.png" id="fig_replication_read_your_writes" title="Figure 6-3. A user makes a write, followed by a read from a stale replica. To prevent this anomaly, we need read-after-write consistency." class="w-full my-4" >}}
###### Figure 6-3. A user makes a write, followed by a read from a stale replica. To prevent this anomaly, we need read-after-write consistency.
In this situation, we need *read-after-write consistency*, also known as *read-your-writes consistency* In this situation, we need *read-after-write consistency*, also known as *read-your-writes consistency*
[^23]. [^23].
@ -617,9 +592,7 @@ hadnt returned anything, because user 2345 probably wouldnt know that user
a comment. However, its very confusing for user 2345 if they first see user 1234s comment appear, a comment. However, its very confusing for user 2345 if they first see user 1234s comment appear,
and then see it disappear again. and then see it disappear again.
![ddia 0604](/fig/ddia_0604.png) {{< figure src="/fig/ddia_0604.png" id="fig_replication_monotonic_reads" title="Figure 6-4. A user first reads from a fresh replica, then from a stale replica. Time appears to go backward. To prevent this anomaly, we need monotonic reads." class="w-full my-4" >}}
###### Figure 6-4. A user first reads from a fresh replica, then from a stale replica. Time appears to go backward. To prevent this anomaly, we need monotonic reads.
*Monotonic reads* [^22] is a guarantee that this *Monotonic reads* [^22] is a guarantee that this
kind of anomaly does not happen. Its a lesser guarantee than strong consistency, but a stronger kind of anomaly does not happen. Its a lesser guarantee than strong consistency, but a stronger
@ -660,9 +633,7 @@ To the observer it looks as though Mrs. Cake is answering the question before Mr
it. Such psychic powers are impressive, but very confusing it. Such psychic powers are impressive, but very confusing
[^27]. [^27].
![ddia 0605](/fig/ddia_0605.png) {{< figure src="/fig/ddia_0605.png" id="fig_replication_consistent_prefix" title="Figure 6-5. If some shards are replicated slower than others, an observer may see the answer before they see the question." class="w-full my-4" >}}
###### Figure 6-5. If some shards are replicated slower than others, an observer may see the answer before they see the question.
Preventing this kind of anomaly requires another type of guarantee: *consistent prefix reads* Preventing this kind of anomaly requires another type of guarantee: *consistent prefix reads*
[^22]. This guarantee says that if a sequence of [^22]. This guarantee says that if a sequence of
@ -757,9 +728,7 @@ regular leaderfollower replication is used (with followers maybe in a differe
from the leader); between regions, each regions leader replicates its changes to the leaders in from the leader); between regions, each regions leader replicates its changes to the leaders in
other regions. other regions.
![ddia 0606](/fig/ddia_0606.png) {{< figure src="/fig/ddia_0606.png" id="fig_replication_multi_dc" title="Figure 6-6. Multi-leader replication across multiple regions." class="w-full my-4" >}}
###### Figure 6-6. Multi-leader replication across multiple regions.
Lets compare how the single-leader and multi-leader configurations fare in a multi-region Lets compare how the single-leader and multi-leader configurations fare in a multi-region
deployment: deployment:
@ -825,9 +794,7 @@ only one plausible topology: leader 1 must send all of its writes to leader 2, a
more than two leaders, various different topologies are possible. Some examples are illustrated in more than two leaders, various different topologies are possible. Some examples are illustrated in
[Figure 6-7](/en/ch6#fig_replication_topologies). [Figure 6-7](/en/ch6#fig_replication_topologies).
![ddia 0607](/fig/ddia_0607.png) {{< figure src="/fig/ddia_0607.png" id="fig_replication_topologies" title="Figure 6-7. Three example topologies in which multi-leader replication can be set up." class="w-full my-4" >}}
###### Figure 6-7. Three example topologies in which multi-leader replication can be set up.
The most general topology is *all-to-all*, shown in The most general topology is *all-to-all*, shown in
[Figure 6-7](/en/ch6#fig_replication_topologies)(c), [Figure 6-7](/en/ch6#fig_replication_topologies)(c),
@ -862,9 +829,7 @@ On the other hand, all-to-all topologies can have issues too. In particular, som
be faster than others (e.g., due to network congestion), with the result that some replication be faster than others (e.g., due to network congestion), with the result that some replication
messages may “overtake” others, as illustrated in [Figure 6-8](/en/ch6#fig_replication_causality). messages may “overtake” others, as illustrated in [Figure 6-8](/en/ch6#fig_replication_causality).
![ddia 0608](/fig/ddia_0608.png) {{< figure src="/fig/ddia_0608.png" id="fig_replication_causality" title="Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas." class="w-full my-4" >}}
###### Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas.
In [Figure 6-8](/en/ch6#fig_replication_causality), client A inserts a row into a table on leader 1, and client B In [Figure 6-8](/en/ch6#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may
@ -958,12 +923,10 @@ approach has a number of advantages:
service calls in application code. Every service call requires error handling, as discussed in service calls in application code. Every service call requires error handling, as discussed in
[“The problems with remote procedure calls (RPCs)”](/en/ch5#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user [“The problems with remote procedure calls (RPCs)”](/en/ch5#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user
interface needs to somehow reflect that error. A sync engine allows the app to perform reads and interface needs to somehow reflect that error. A sync engine allows the app to perform reads and
writes on local data, which almost never fails, leading to a more declarative programming style writes on local data, which almost never fails, leading to a more declarative programming style [^41].
[^41].
* In order to display edits from other users in real-time, you need to receive notifications of * In order to display edits from other users in real-time, you need to receive notifications of
those edits and efficiently update the user interface accordingly. A sync engine combined with a those edits and efficiently update the user interface accordingly. A sync engine combined with a
*reactive programming* model is a good way of implementing this *reactive programming* model is a good way of implementing this [^42].
[^42].
Sync engines work best when all the data that the user may need is downloaded in advance and stored Sync engines work best when all the data that the user may need is downloaded in advance and stored
persistently on the client. This means that the data is available for offline access when needed, persistently on the client. This means that the data is available for offline access when needed,
@ -972,8 +935,7 @@ of data. For example, downloading all the files that the user themselves created
(one user generally doesnt generate that much data), but downloading the entire catalog of an (one user generally doesnt generate that much data), but downloading the entire catalog of an
e-commerce website probably doesnt make sense. e-commerce website probably doesnt make sense.
The sync engine was pioneered by Lotus Notes in the 1980s The sync engine was pioneered by Lotus Notes in the 1980s [^43]
[^43]
(without using that term), and sync for specific apps such as calendars has also existed for a long (without using that term), and sync for specific apps such as calendars has also existed for a long
time. Today there are a number of general-purpose sync engines, some of which use a proprietary time. Today there are a number of general-purpose sync engines, some of which use a proprietary
backend service (e.g., Google Firestore, Realm, or Ditto), and some have an open source backend, backend service (e.g., Google Firestore, Realm, or Ditto), and some have an open source backend,
@ -982,8 +944,7 @@ making them suitable for creating local-first software (e.g., PouchDB/CouchDB, A
Multiplayer video games have a similar need to respond immediately to the users local actions, and Multiplayer video games have a similar need to respond immediately to the users local actions, and
reconcile them with other players actions received asynchronously over the network. In game reconcile them with other players actions received asynchronously over the network. In game
development jargon the equivalent of a sync engine is called *netcode*. The techniques used in development jargon the equivalent of a sync engine is called *netcode*. The techniques used in
netcode are quite specific to the requirements of games netcode are quite specific to the requirements of games [^44], and dont directly
[^44], and dont directly
carry over to other types of software, so we wont consider them further in this book. carry over to other types of software, so we wont consider them further in this book.
## Dealing with Conflicting Writes ## Dealing with Conflicting Writes
@ -998,9 +959,7 @@ independently changes the title from A to C. Each users change is successfull
local leader. However, when the changes are asynchronously replicated, a conflict is detected. local leader. However, when the changes are asynchronously replicated, a conflict is detected.
This problem does not occur in a single-leader database. This problem does not occur in a single-leader database.
![ddia 0609](/fig/ddia_0609.png) {{< figure src="/fig/ddia_0609.png" id="fig_replication_write_conflict" title="Figure 6-9. A write conflict caused by two leaders concurrently updating the same record." class="w-full my-4" >}}
###### Figure 6-9. A write conflict caused by two leaders concurrently updating the same record.
> [!NOTE] > [!NOTE]
> We say that the two writes in [Figure 6-9](/en/ch6#fig_replication_write_conflict) are *concurrent* because neither > We say that the two writes in [Figure 6-9](/en/ch6#fig_replication_write_conflict) are *concurrent* because neither
@ -1114,9 +1073,8 @@ suffers from a number of problems:
not careful to order them consistently. When the conflict between “B/C” and “C/B” is merged, it not careful to order them consistently. When the conflict between “B/C” and “C/B” is merged, it
may result in “B/C/C/B” or something similarly surprising. may result in “B/C/C/B” or something similarly surprising.
![ddia 0610](/fig/ddia_0610.png) {{< figure src="/fig/ddia_0610.png" id="fig_replication_amazon_anomaly" title="Figure 6-10. Example of Amazon's shopping cart anomaly: if conflicts on a shopping cart are merged by taking the union, deleted items may reappear." class="w-full my-4" >}}
###### Figure 6-10. Example of Amazons shopping cart anomaly: if conflicts on a shopping cart are merged by taking the union, deleted items may reappear.
### Automatic conflict resolution ### Automatic conflict resolution
@ -1166,9 +1124,8 @@ text. Assume you have two replicas that both start off with the text “ice”.
letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make
“ice!”. “ice!”.
![ddia 0611](/fig/ddia_0611.png) {{< figure src="/fig/ddia_0611.png" id="fig_replication_ot_crdt" title="Figure 6-11. How two concurrent insertions into a string are merged by OT and a CRDT respectively." class="w-full my-4" >}}
###### Figure 6-11. How two concurrent insertions into a string are merged by OT and a CRDT respectively.
The merged result “nice!” is achieved differently by both types of algorithms: The merged result “nice!” is achieved differently by both types of algorithms:
@ -1192,15 +1149,11 @@ CRDT
There are many algorithms based on variations of these ideas. Lists/arrays can be supported There are many algorithms based on variations of these ideas. Lists/arrays can be supported
similarly, using list elements instead of characters, and other datatypes such as key-value maps can similarly, using list elements instead of characters, and other datatypes such as key-value maps can
be added quite easily. There are some performance and functionality trade-offs between OT and CRDTs, be added quite easily. There are some performance and functionality trade-offs between OT and CRDTs,
but its possible to combine the advantages of CRDTs and OT in one algorithm but its possible to combine the advantages of CRDTs and OT in one algorithm [^48].
[^48].
OT is most often used for real-time collaborative editing of text, e.g. in Google Docs OT is most often used for real-time collaborative editing of text, e.g. in Google Docs [^32], whereas CRDTs can be found in
[^32], whereas CRDTs can be found in distributed databases such as Redis Enterprise, Riak, and Azure Cosmos DB [^49].
distributed databases such as Redis Enterprise, Riak, and Azure Cosmos DB Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge or Yjs) and with OT (e.g., ShareDB).
[^49].
Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge or Yjs) and with OT
(e.g., ShareDB).
### What is a conflict? ### What is a conflict?
@ -1233,8 +1186,7 @@ Some data storage systems take a different approach, abandoning the concept of a
allowing any replica to directly accept writes from clients. Some of the earliest replicated data allowing any replica to directly accept writes from clients. Some of the earliest replicated data
systems were leaderless [^1] [^50], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became systems were leaderless [^1] [^50], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became
a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in
2007 [^45]. 2007 [^45]. Riak, Cassandra, and ScyllaDB are open source datastores with leaderless replication models inspired
Riak, Cassandra, and ScyllaDB are open source datastores with leaderless replication models inspired
by Dynamo, so this kind of database is also known as *Dynamo-style*. by Dynamo, so this kind of database is also known as *Dynamo-style*.
> [!NOTE] > [!NOTE]
@ -1261,9 +1213,8 @@ replica misses it. Lets say that its sufficient for two out of three repli
acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be
successful. The client simply ignores the fact that one of the replicas missed the write. successful. The client simply ignores the fact that one of the replicas missed the write.
![ddia 0612](/fig/ddia_0612.png) {{< figure src="/fig/ddia_0612.png" id="fig_replication_quorum_node_outage" title="Figure 6-12. A quorum write, quorum read, and read repair after a node outage." class="w-full my-4" >}}
###### Figure 6-12. A quorum write, quorum read, and read repair after a node outage.
Now imagine that the unavailable node comes back online, and clients start reading from it. Any Now imagine that the unavailable node comes back online, and clients start reading from it. Any
writes that happened while the node was down are missing from that node. Thus, if you read from that writes that happened while the node was down are missing from that node. Thus, if you read from that
@ -1352,9 +1303,8 @@ Normally, reads and writes are always sent to all *n* replicas in parallel. The
*r* determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success *r* determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
before we consider the read or write to be successful. before we consider the read or write to be successful.
![ddia 0613](/fig/ddia_0613.png) {{< figure src="/fig/ddia_0613.png" id="fig_replication_quorum_overlap" title="Figure 6-13. If *w* + *r* > *n*, at least one of the *r* replicas you read from must have seen the most recent successful write." class="w-full my-4" >}}
###### Figure 6-13. If *w* + *r* > *n*, at least one of the *r* replicas you read from must have seen the most recent successful write.
If fewer than the required *w* or *r* nodes are available, writes or reads return an error. A node If fewer than the required *w* or *r* nodes are available, writes or reads return an error. A node
could be unavailable for many reasons: because the node is down (crashed, powered down), due to an could be unavailable for many reasons: because the node is down (crashed, powered down), due to an
@ -1404,8 +1354,7 @@ properties can be confusing. Some scenarios include:
* If a write succeeded on some replicas but failed on others (for example because the disks on some * If a write succeeded on some replicas but failed on others (for example because the disks on some
nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the
replicas where it succeeded. This means that if a write was reported as failed, subsequent reads replicas where it succeeded. This means that if a write was reported as failed, subsequent reads
may or may not return the value from that write may or may not return the value from that write [^52].
[^52].
* If the database uses timestamps from a real-time clock to determine which write is newer (as * If the database uses timestamps from a real-time clock to determine which write is newer (as
Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a
faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww). faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww).
@ -1418,8 +1367,7 @@ properties can be confusing. Some scenarios include:
Thus, although quorums appear to guarantee that a read returns the latest written value, in practice Thus, although quorums appear to guarantee that a read returns the latest written value, in practice
it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate
eventual consistency. The parameters *w* and *r* allow you to adjust the probability of stale values eventual consistency. The parameters *w* and *r* allow you to adjust the probability of stale values
being read [^53], being read [^53], but its wise to not take them as absolute guarantees.
but its wise to not take them as absolute guarantees.
### Monitoring staleness ### Monitoring staleness
@ -1436,8 +1384,7 @@ current position, you can measure the amount of replication lag.
However, in systems with leaderless replication, there is no fixed order in which writes are However, in systems with leaderless replication, there is no fixed order in which writes are
applied, which makes monitoring more difficult. The number of hints that a replica stores for applied, which makes monitoring more difficult. The number of hints that a replica stores for
handoff can be one measure of system health, but its difficult to interpret usefully handoff can be one measure of system health, but its difficult to interpret usefully [^54].
[^54].
Eventual consistency is a deliberately vague guarantee, but for operability its important to be Eventual consistency is a deliberately vague guarantee, but for operability its important to be
able to quantify “eventual.” able to quantify “eventual.”
@ -1465,15 +1412,12 @@ A big advantage of a leaderless architecture is that it is more resilient agains
Because there is no failover, and requests go to multiple replicas in parallel anyway, one replica Because there is no failover, and requests go to multiple replicas in parallel anyway, one replica
becoming slow or unavailable has very little impact on response times: the client simply uses the becoming slow or unavailable has very little impact on response times: the client simply uses the
responses from the other replicas that are faster to respond. Using the fastest responses is called responses from the other replicas that are faster to respond. Using the fastest responses is called
*request hedging*, and it can significantly reduce tail latency *request hedging*, and it can significantly reduce tail latency [^55]).
[^55]).
At its core, the resilience of a leaderless system comes from the fact that it doesnt distinguish At its core, the resilience of a leaderless system comes from the fact that it doesnt distinguish
between the normal case and the failure case. This is especially helpful when handling so-called between the normal case and the failure case. This is especially helpful when handling so-called
*gray failures*, in which a node isnt completely down, but running in a degraded state where it is *gray failures*, in which a node isnt completely down, but running in a degraded state where it is
unusually slow to handle requests unusually slow to handle requests [^56], or when a node is simply overloaded (for example, if a node has been offline for a while, recovery
[^56],
or when a node is simply overloaded (for example, if a node has been offline for a while, recovery
via hinted handoff can cause a lot of additional load). A leader-based system has to decide whether via hinted handoff can cause a lot of additional load). A leader-based system has to decide whether
the situation is bad enough to warrant a failover (which can itself cause further disruption), the situation is bad enough to warrant a failover (which can itself cause further disruption),
whereas in a leaderless system that question doesnt even arise. whereas in a leaderless system that question doesnt even arise.
@ -1493,8 +1437,7 @@ That said, leaderless systems can have performance problems as well:
* A large-scale network interruption that disconnects a client from a large number of replicas can * A large-scale network interruption that disconnects a client from a large number of replicas can
make it impossible to form a quorum. Some leaderless databases offer a configuration option that make it impossible to form a quorum. Some leaderless databases offer a configuration option that
allows any reachable replica to accept writes, even if its not one of the usual replicas for that allows any reachable replica to accept writes, even if its not one of the usual replicas for that
key (Riak and Dynamo call this a *sloppy quorum* key (Riak and Dynamo call this a *sloppy quorum* [^45];
[^45];
Cassandra and ScyllaDB call it *consistency level ANY*). There is no guarantee that subsequent Cassandra and ScyllaDB call it *consistency level ANY*). There is no guarantee that subsequent
reads will see the written value, but depending on the application it may still be better than reads will see the written value, but depending on the application it may still be better than
having the write fail. having the write fail.
@ -1539,9 +1482,8 @@ A and B, simultaneously writing to a key *X* in a three-node datastore:
* Node 2 first receives the write from A, then the write from B. * Node 2 first receives the write from A, then the write from B.
* Node 3 first receives the write from B, then the write from A. * Node 3 first receives the write from B, then the write from A.
![ddia 0614](/fig/ddia_0614.png) {{< figure src="/fig/ddia_0614.png" id="fig_replication_concurrency" title="Figure 6-14. Concurrent writes in a Dynamo-style datastore: there is no well-defined ordering." class="w-full my-4" >}}
###### Figure 6-14. Concurrent writes in a Dynamo-style datastore: there is no well-defined ordering.
If each node simply overwrote the value for a key whenever it received a write request from a If each node simply overwrote the value for a key whenever it received a write request from a
client, the nodes would become permanently inconsistent, as shown by the final *get* request in client, the nodes would become permanently inconsistent, as shown by the final *get* request in
@ -1642,9 +1584,8 @@ empty. Between them, the clients make five writes to the database:
`[milk, flour]` (note that `[eggs]` was already overwritten in the last step) but is concurrent `[milk, flour]` (note that `[eggs]` was already overwritten in the last step) but is concurrent
with `[eggs, milk, ham]`, so the server keeps those two concurrent values. with `[eggs, milk, ham]`, so the server keeps those two concurrent values.
![ddia 0615](/fig/ddia_0615.png) {{< figure src="/fig/ddia_0615.png" id="fig_replication_causality_single" title="Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart." class="w-full my-4" >}}
###### Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart.
The dataflow between the operations in [Figure 6-15](/en/ch6#fig_replication_causality_single) is illustrated The dataflow between the operations in [Figure 6-15](/en/ch6#fig_replication_causality_single) is illustrated
graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The arrows indicate which operation graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The arrows indicate which operation
@ -1653,9 +1594,8 @@ graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The
on the server, since there is always another operation going on concurrently. But old versions of on the server, since there is always another operation going on concurrently. But old versions of
the value do get overwritten eventually, and no writes are lost. the value do get overwritten eventually, and no writes are lost.
![ddia 0616](/fig/ddia_0616.png) {{< figure link="#fig_replication_causality_single" src="/fig/ddia_0616.png" id="fig_replication_causal_dependencies" title="Figure 6-16. Graph of causal dependencies in Figure 6-15." class="w-full my-4" >}}
###### Figure 6-16. Graph of causal dependencies in [Figure 6-15](/en/ch6#fig_replication_causality_single).
Note that the server can determine whether two operations are concurrent by looking at the version Note that the server can determine whether two operations are concurrent by looking at the version
numbers—it does not need to interpret the value itself (so the value could be any data numbers—it does not need to interpret the value itself (so the value could be any data

View file

@ -32,9 +32,7 @@ of sharding and replication can look like [Figure 7-1](/en/ch7#fig_sharding_rep
leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the
leader for some shards and a follower for other shards, but each shard still only has one leader. leader for some shards and a follower for other shards, but each shard still only has one leader.
![ddia 0701](/fig/ddia_0701.png) {{< figure src="/fig/ddia_0701.png" id="fig_sharding_replicas" title="Figure 7-1. Combining replication and sharding: each node acts as leader for some shards and follower for other shards." class="w-full my-4" >}}
###### Figure 7-1. Combining replication and sharding: each node acts as leader for some shards and follower for other shards.
Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replication of databases applies equally to Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replication of databases applies equally to
replication of shards. Since the choice of sharding scheme is mostly independent of the choice of replication of shards. Since the choice of sharding scheme is mostly independent of the choice of
@ -50,8 +48,7 @@ Couchbase, to name just a few.
Some databases treat partitions and shards as two distinct concepts. For example, in PostgreSQL, Some databases treat partitions and shards as two distinct concepts. For example, in PostgreSQL,
partitioning is a way of splitting a large table into several files that are stored on the same partitioning is a way of splitting a large table into several files that are stored on the same
machine (which has several advantages, such as making it very fast to delete an entire partition), machine (which has several advantages, such as making it very fast to delete an entire partition),
whereas sharding splits a dataset across multiple machines whereas sharding splits a dataset across multiple machines [^1] [^2].
[^1] [^2].
In many other systems, partitioning is just another word for sharding. In many other systems, partitioning is just another word for sharding.
While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to
@ -136,31 +133,26 @@ Cell-based architecture
application code. In a *cell-based architecture*, the services and storage for a particular set of application code. In a *cell-based architecture*, the services and storage for a particular set of
tenants are grouped into a self-contained *cell*, and different cells are set up such that they tenants are grouped into a self-contained *cell*, and different cells are set up such that they
can run largely independently from each other. This approach provides *fault isolation*: that is, can run largely independently from each other. This approach provides *fault isolation*: that is,
a fault in one cell remains limited to that cell, and tenants in other cells are not affected a fault in one cell remains limited to that cell, and tenants in other cells are not affected [^8].
[^8].
Per-tenant backup and restore Per-tenant backup and restore
: Backing up each tenants shard separately makes it possible to restore a tenants state from a : Backing up each tenants shard separately makes it possible to restore a tenants state from a
backup without affecting other tenants, which can be useful in case the tenant accidentally backup without affecting other tenants, which can be useful in case the tenant accidentally
deletes or overwrites important data deletes or overwrites important data [^9].
[^9].
Regulatory compliance Regulatory compliance
: Data privacy regulation such as the GDPR gives individuals the right to access and delete all data : Data privacy regulation such as the GDPR gives individuals the right to access and delete all data
stored about them. If each persons data is stored in a separate shard, this translates into stored about them. If each persons data is stored in a separate shard, this translates into
simple data export and deletion operations on their shard simple data export and deletion operations on their shard [^10].
[^10].
Data residence Data residence
: If a particular tenants data needs to be stored in a particular jurisdiction in order to comply : If a particular tenants data needs to be stored in a particular jurisdiction in order to comply
with data residency laws, a region-aware database can allow you to assign that tenants shard to a with data residency laws, a region-aware database can allow you to assign that tenants shard to a particular region.
particular region.
Gradual schema rollout Gradual schema rollout
: Schema migrations (previously discussed in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)) can be rolled : Schema migrations (previously discussed in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)) can be rolled
out gradually, one tenant at a time. This reduces risk, as you can detect problems before they out gradually, one tenant at a time. This reduces risk, as you can detect problems before they
affect all tenants, but it can be difficult to do transactionally affect all tenants, but it can be difficult to do transactionally [^11].
[^11].
The main challenges around using sharding for multitenancy are: The main challenges around using sharding for multitenancy are:
@ -207,9 +199,7 @@ to look up the entry for a particular title, you can easily determine which shar
entry by finding the volume whose key range contains the title youre looking for, and thus pick the entry by finding the volume whose key range contains the title youre looking for, and thus pick the
correct book off the shelf. correct book off the shelf.
![ddia 0702](/fig/ddia_0702.png) {{< figure src="/fig/ddia_0702.png" id="fig_sharding_encyclopedia" title="Figure 7-2. A print encyclopedia is sharded by key range." class="w-full my-4" >}}
###### Figure 7-2. A print encyclopedia is sharded by key range.
The ranges of keys are not necessarily evenly spaced, because your data may not be evenly The ranges of keys are not necessarily evenly spaced, because your data may not be evenly
distributed. For example, in [Figure 7-2](/en/ch7#fig_sharding_encyclopedia), volume 1 contains words starting with A distributed. For example, in [Figure 7-2](/en/ch7#fig_sharding_encyclopedia), volume 1 contains words starting with A
@ -307,9 +297,7 @@ have three nodes and add a fourth. Before the rebalancing, node 0 stored the key
0, 3, 6, 9, and so on. After adding the fourth node, the key with hash 3 has moved to node 3, the 0, 3, 6, 9, and so on. After adding the fourth node, the key with hash 3 has moved to node 3, the
key with hash 6 has moved to node 2, the key with hash 9 has moved to node 1, and so on. key with hash 6 has moved to node 2, the key with hash 9 has moved to node 1, and so on.
![ddia 0703](/fig/ddia_0703.png) {{< figure src="/fig/ddia_0703.png" id="fig_sharding_hash_mod_n" title="Figure 7-3. Assigning keys to nodes by hashing the key and taking it modulo the number of nodes. Changing the number of nodes results in many keys moving from one node to another." class="w-full my-4" >}}
###### Figure 7-3. Assigning keys to nodes by hashing the key and taking it modulo the number of nodes. Changing the number of nodes results in many keys moving from one node to another.
The *mod N* function is easy to compute, but it leads to very inefficient rebalancing because there The *mod N* function is easy to compute, but it leads to very inefficient rebalancing because there
is a lot of unnecessary movement of records from one node to another. We need an approach that is a lot of unnecessary movement of records from one node to another. We need an approach that
@ -328,9 +316,7 @@ nodes to the new node until they are fairly distributed once again. This process
[Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in [Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in
reverse. reverse.
![ddia 0704](/fig/ddia_0704.png) {{< figure src="/fig/ddia_0704.png" id="fig_sharding_rebalance_fixed" title="Figure 7-4. Adding a new node to a database cluster with multiple shards per node." class="w-full my-4" >}}
###### Figure 7-4. Adding a new node to a database cluster with multiple shards per node.
In this model, only entire shards are moved between nodes, which is cheaper than splitting shards. In this model, only entire shards are moved between nodes, which is cheaper than splitting shards.
The number of shards does not change, nor does the assignment of keys to shards. The only thing that The number of shards does not change, nor does the assignment of keys to shards. The only thing that
@ -377,9 +363,7 @@ Even if the input keys are very similar (e.g., consecutive timestamps), their ha
distributed across that range. We can then assign a range of hash values to each shard: for example, distributed across that range. We can then assign a range of hash values to each shard: for example,
values between 0 and 16,383 to shard 0, values between 16,384 and 32,767 to shard 1, and so on. values between 0 and 16,383 to shard 0, values between 16,384 and 32,767 to shard 1, and so on.
![ddia 0705](/fig/ddia_0705.png) {{< figure src="/fig/ddia_0705.png" id="fig_sharding_hash_range" title="Figure 7-5. Assigning a contiguous range of hash values to each shard." class="w-full my-4" >}}
###### Figure 7-5. Assigning a contiguous range of hash values to each shard.
Like with key-range sharding, a shard in hash-range sharding can be split when it becomes too big or Like with key-range sharding, a shard in hash-range sharding can be split when it becomes too big or
too heavily loaded. This is still an expensive operation, but it can happen as needed, so the number too heavily loaded. This is still an expensive operation, but it can happen as needed, so the number
@ -407,12 +391,9 @@ Cassandra and ScyllaDB use a variant of this approach that is illustrated in
to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8 to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
those ranges. This means some ranges are bigger than others, but by having multiple ranges per node those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
those imbalances tend to even out those imbalances tend to even out [^15] [^18].
[^15] [^18].
![ddia 0706](/fig/ddia_0706.png) {{< figure src="/fig/ddia_0706.png" id="fig_sharding_cassandra" title="Figure 7-6. Cassandra and ScyllaDB split the range of possible hash values (here 01023) into contiguous ranges with random boundaries, and assign several ranges to each node." class="w-full my-4" >}}
###### Figure 7-6. Cassandra and ScyllaDB split the range of possible hash values (here 01023) into contiguous ranges with random boundaries, and assign several ranges to each node.
When nodes are added or removed, range boundaries are added and removed, and shards are split or When nodes are added or removed, range boundaries are added and removed, and shards are split or
merged accordingly [^19]. merged accordingly [^19].
@ -433,13 +414,9 @@ Note that *consistent* here has nothing to do with replica consistency (see [Cha
ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describes the tendency of a key to stay in ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describes the tendency of a key to stay in
the same shard as much as possible. the same shard as much as possible.
The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of consistent hashing [^20],
consistent hashing [^20], but several other consistent hashing algorithms have also been proposed [^21], such as *highest random weight*, also known as *rendezvous hashing* [^22],
but several other consistent hashing algorithms have also been proposed [^21], and *jump consistent hash* [^23].
such as *highest random weight*, also known as *rendezvous hashing*
[^22],
and *jump consistent hash*
[^23].
With Cassandras algorithm, if one node is added, a small number of existing shards are split into With Cassandras algorithm, if one node is added, a small number of existing shards are split into
sub-ranges; on the other hand, with rendezvous and jump consistent hashes, the new node is assigned sub-ranges; on the other hand, with rendezvous and jump consistent hashes, the new node is assigned
individual keys that were previously scattered across all of the other nodes. Which one is individual keys that were previously scattered across all of the other nodes. Which one is
@ -458,8 +435,7 @@ of activity when they do something [^24].
This event can result in a large volume of reads and writes to the same key (where the partition key This event can result in a large volume of reads and writes to the same key (where the partition key
is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on). is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on).
In such situations, a more flexible sharding policy is required In such situations, a more flexible sharding policy is required [^25] [^26].
[^25] [^26].
A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put
an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine [^27]. an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine [^27].
@ -482,9 +458,7 @@ likely to calm down again. Moreover, some keys may be hot for writes while other
necessitating different strategies for handling them. necessitating different strategies for handling them.
Some systems (especially cloud services designed for large scale) have automated approaches for Some systems (especially cloud services designed for large scale) have automated approaches for
dealing with hot shards; for example, Amazon calls it *heat management* dealing with hot shards; for example, Amazon calls it *heat management* [^28] or *adaptive capacity* [^17].
[^28]
or *adaptive capacity* [^17].
The details of how these systems work go beyond the scope of this book. The details of how these systems work go beyond the scope of this book.
## Operations: Automatic or Manual Rebalancing ## Operations: Automatic or Manual Rebalancing
@ -501,8 +475,7 @@ effect.
Fully automated rebalancing can be convenient, because there is less operational work to do for Fully automated rebalancing can be convenient, because there is less operational work to do for
normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud
databases such as DynamoDB are promoted as being able to automatically add and remove shards to databases such as DynamoDB are promoted as being able to automatically add and remove shards to
adapt to big increases or decreases of load within a matter of minutes adapt to big increases or decreases of load within a matter of minutes [^17] [^29].
[^17] [^29].
However, automatic shard management can also be unpredictable. Rebalancing is an expensive However, automatic shard management can also be unpredictable. Rebalancing is an expensive
operation, because it requires rerouting requests and moving a large amount of data from one node to operation, because it requires rerouting requests and moving a large amount of data from one node to
@ -548,9 +521,7 @@ in [Figure 7-7](/en/ch7#fig_sharding_routing)):
3. Require that clients be aware of the sharding and the assignment of shards to nodes. In this 3. Require that clients be aware of the sharding and the assignment of shards to nodes. In this
case, a client can connect directly to the appropriate node, without any intermediary. case, a client can connect directly to the appropriate node, without any intermediary.
![ddia 0707](/fig/ddia_0707.png) {{< figure src="/fig/ddia_0707.png" id="fig_sharding_routing" title="Figure 7-7. Three different ways of routing a request to the right node." class="w-full my-4" >}}
###### Figure 7-7. Three different ways of routing a request to the right node.
In all cases, there are some key problems: In all cases, there are some key problems:
@ -573,9 +544,7 @@ to nodes. Other actors, such as the routing tier or the sharding-aware client, c
information in ZooKeeper. Whenever a shard changes ownership, or a node is added or removed, information in ZooKeeper. Whenever a shard changes ownership, or a node is added or removed,
ZooKeeper notifies the routing tier so that it can keep its routing information up to date. ZooKeeper notifies the routing tier so that it can keep its routing information up to date.
![ddia 0708](/fig/ddia_0708.png) {{< figure src="/fig/ddia_0708.png" id="fig_sharding_zookeeper" title="Figure 7-8. Using ZooKeeper to keep track of assignment of shards to nodes." class="w-full my-4" >}}
###### Figure 7-8. Using ZooKeeper to keep track of assignment of shards to nodes.
For example, HBase and SolrCloud use ZooKeeper to manage shard assignment, and Kubernetes uses etcd For example, HBase and SolrCloud use ZooKeeper to manage shard assignment, and Kubernetes uses etcd
to keep track of which service instance is running where. MongoDB has a similar architecture, but it to keep track of which service instance is running where. MongoDB has a similar architecture, but it
@ -631,9 +600,7 @@ indexing automatically. For example, whenever a red car is added to the database
automatically adds its ID to the list of IDs for the index entry `color:red`. As discussed in automatically adds its ID to the list of IDs for the index entry `color:red`. As discussed in
[Chapter 4](/en/ch4#ch_storage), that list of IDs is also called a *postings list*. [Chapter 4](/en/ch4#ch_storage), that list of IDs is also called a *postings list*.
![ddia 0709](/fig/ddia_0709.png) {{< figure src="/fig/ddia_0709.png" id="fig_sharding_local_secondary" title="Figure 7-9. Local secondary indexes: each shard indexes only the records within its own shard." class="w-full my-4" >}}
###### Figure 7-9. Local secondary indexes: each shard indexes only the records within its own shard.
###### Warning ###### Warning
@ -648,8 +615,7 @@ indexes, covering only the records in that shard. It doesnt care what data is
shards. Whenever you write to the database—to add, remove, or update a records—you only need to shards. Whenever you write to the database—to add, remove, or update a records—you only need to
deal with the shard that contains the record that you are writing. For that reason, this type of deal with the shard that contains the record that you are writing. For that reason, this type of
secondary index is known as a *local index*. In an information retrieval context it is also known as secondary index is known as a *local index*. In an information retrieval context it is also known as
a *document-partitioned index* a *document-partitioned index* [^30].
[^30].
When reading from a local secondary index, if you already know the partition key of the record When reading from a local secondary index, if you already know the partition key of the record
youre looking for, you can just perform the search on the appropriate shard. Moreover, if you only youre looking for, you can just perform the search on the appropriate shard. Moreover, if you only
@ -666,11 +632,8 @@ expensive. Even if you query the shards in parallel, it is prone to tail latency
shards lets you store more data, but it doesnt increase your query throughput if every shard has to shards lets you store more data, but it doesnt increase your query throughput if every shard has to
process every query anyway. process every query anyway.
Nevertheless, local secondary indexes are widely used [^31]: Nevertheless, local secondary indexes are widely used [^31]: for example, MongoDB, Riak, Cassandra [^32], Elasticsearch [^33],
for example, MongoDB, Riak, Cassandra [^32], SolrCloud, and VoltDB [^34] all use local secondary indexes.
Elasticsearch [^33], SolrCloud,
and VoltDB [^34]
all use local secondary indexes.
## Global Secondary Indexes ## Global Secondary Indexes
@ -685,9 +648,7 @@ with the letters *a* to *r* appear in shard 0 and colors starting with *s* to *z
The index on the make of car is partitioned similarly (with the shard boundary being between *f* and The index on the make of car is partitioned similarly (with the shard boundary being between *f* and
*h*). *h*).
![ddia 0710](/fig/ddia_0710.png) {{< figure src="/fig/ddia_0710.png" id="fig_sharding_global_secondary" title="Figure 7-10. A global secondary index reflects data from all shards, and is itself sharded by the indexed value." class="w-full my-4" >}}
###### Figure 7-10. A global secondary index reflects data from all shards, and is itself sharded by the indexed value.
This kind of index is also called *term-partitioned* This kind of index is also called *term-partitioned*
[^30]: [^30]:

View file

@ -82,10 +82,7 @@ much weaker set of guarantees than had previously been understood.
The hype around NoSQL distributed databases led to a popular belief that transactions were The hype around NoSQL distributed databases led to a popular belief that transactions were
fundamentally unscalable, and that any large-scale system would have to abandon transactions in fundamentally unscalable, and that any large-scale system would have to abandon transactions in
order to maintain good performance and high availability. More recently, that belief has turned out order to maintain good performance and high availability. More recently, that belief has turned out
to be wrong. So-called “NewSQL” databases such as CockroachDB [^5], to be wrong. So-called “NewSQL” databases such as CockroachDB [^5], TiDB [^6], Spanner [^7], FoundationDB [^8],
TiDB [^6],
Spanner [^7],
FoundationDB [^8],
and Yugabyte have shown that transactional systems can scale to large data volumes and high and Yugabyte have shown that transactional systems can scale to large data volumes and high
throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide
strong ACID guarantees at scale. strong ACID guarantees at scale.
@ -99,19 +96,16 @@ operation and in various extreme (but realistic) circumstances.
The safety guarantees provided by transactions are often described by the well-known acronym *ACID*, The safety guarantees provided by transactions are often described by the well-known acronym *ACID*,
which stands for *Atomicity*, *Consistency*, *Isolation*, and *Durability*. It was coined in 1983 by which stands for *Atomicity*, *Consistency*, *Isolation*, and *Durability*. It was coined in 1983 by
Theo Härder and Andreas Reuter [^9] Theo Härder and Andreas Reuter [^9] in an effort to establish precise terminology for fault-tolerance mechanisms in databases.
in an effort to establish precise terminology for fault-tolerance mechanisms in databases.
However, in practice, one databases implementation of ACID does not equal anothers implementation. However, in practice, one databases implementation of ACID does not equal anothers implementation.
For example, as we shall see, there is a lot of ambiguity around the meaning of *isolation* For example, as we shall see, there is a lot of ambiguity around the meaning of *isolation* [^10].
[^10].
The high-level idea is sound, but the devil is in the details. Today, when a system claims to be The high-level idea is sound, but the devil is in the details. Today, when a system claims to be
“ACID compliant,” its unclear what guarantees you can actually expect. ACID has unfortunately “ACID compliant,” its unclear what guarantees you can actually expect. ACID has unfortunately
become mostly a marketing term. become mostly a marketing term.
(Systems that do not meet the ACID criteria are sometimes called *BASE*, which stands for (Systems that do not meet the ACID criteria are sometimes called *BASE*, which stands for
*Basically Available*, *Soft state*, and *Eventual consistency* *Basically Available*, *Soft state*, and *Eventual consistency* [^11].
[^11].
This is even more vague than the definition of ACID. It seems that the only sensible definition of This is even more vague than the definition of ACID. It seems that the only sensible definition of
BASE is “not ACID”; i.e., it can mean almost anything you want.) BASE is “not ACID”; i.e., it can mean almost anything you want.)
@ -199,9 +193,8 @@ current value, add 1, and write the new value back (assuming there is no increme
into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to
44, because two increments happened, but it actually only went to 43 because of the race condition. 44, because two increments happened, but it actually only went to 43 because of the race condition.
![ddia 0801](/fig/ddia_0801.png) {{< figure src="/fig/ddia_0801.png" id="fig_transactions_increment" title="Figure 8-1. A race condition between two clients concurrently incrementing a counter." class="w-full my-4" >}}
###### Figure 8-1. A race condition between two clients concurrently incrementing a counter.
*Isolation* in the sense of ACID means that concurrently executing transactions are isolated from *Isolation* in the sense of ACID means that concurrently executing transactions are isolated from
each other: they cannot step on each others toes. The classic database textbooks formalize each other: they cannot step on each others toes. The classic database textbooks formalize
@ -300,9 +293,8 @@ number of unread messages for a user, you could query something like:
SELECT COUNT(*) FROM emails WHERE recipient_id = 2 AND unread_flag = true SELECT COUNT(*) FROM emails WHERE recipient_id = 2 AND unread_flag = true
``` ```
![ddia 0802](/fig/ddia_0802.png) {{< figure src="/fig/ddia_0802.png" id="fig_transactions_read_uncommitted" title="Figure 8-2. Violating isolation: one transaction reads another transaction's uncommitted writes (a \"dirty read\")." class="w-full my-4" >}}
###### Figure 8-2. Violating isolation: one transaction reads another transactions uncommitted writes (a “dirty read”).
However, you might find this query to be too slow if there are many emails, and decide to store the However, you might find this query to be too slow if there are many emails, and decide to store the
number of unread messages in a separate field (a kind of denormalization, which we discuss in number of unread messages in a separate field (a kind of denormalization, which we discuss in
@ -322,9 +314,8 @@ over the course of the transaction, the contents of the mailbox and the unread c
of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted
and the inserted email is rolled back. and the inserted email is rolled back.
![ddia 0803](/fig/ddia_0803.png) {{< figure src="/fig/ddia_0803.png" id="fig_transactions_atomicity" title="Figure 8-3. Atomicity ensures that if an error occurs any prior writes from that transaction are undone, to avoid an inconsistent state." class="w-full my-4" >}}
###### Figure 8-3. Atomicity ensures that if an error occurs any prior writes from that transaction are undone, to avoid an inconsistent state.
Multi-object transactions require some way of determining which read and write operations belong to Multi-object transactions require some way of determining which read and write operations belong to
the same transaction. In relational databases, that is typically done based on the clients TCP the same transaction. In relational databases, that is typically done based on the clients TCP
@ -473,18 +464,14 @@ levels of isolation are much harder to understand, and they can lead to subtle b
nevertheless used in practice [^29]. nevertheless used in practice [^29].
Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have
caused substantial loss of money caused substantial loss of money [^30] [^31] [^32], led to investigation by financial auditors [^33],
[^30] [^31] [^32], and caused customer data to be corrupted [^34]. A popular comment on revelations of such problems is “Use an ACID database if youre handling
led to investigation by financial auditors [^33],
and caused customer data to be corrupted [^34].
A popular comment on revelations of such problems is “Use an ACID database if youre handling
financial data!”—but that misses the point. Even many popular relational database systems (which financial data!”—but that misses the point. Even many popular relational database systems (which
are usually considered “ACID”) use weak isolation, so they wouldnt necessarily have prevented these are usually considered “ACID”) use weak isolation, so they wouldnt necessarily have prevented these
bugs from occurring. bugs from occurring.
> [!NOTE] > [!NOTE]
> Incidentally, much of the banking system relies on text files that are exchanged via secure FTP > Incidentally, much of the banking system relies on text files that are exchanged via secure FTP [^35].
> [^35].
> In this context, having an audit trail and some human-level fraud prevention measures is actually > In this context, having an audit trail and some human-level fraud prevention measures is actually
> more important than ACID properties. > more important than ACID properties.
@ -499,17 +486,14 @@ practice, and discuss in detail what kinds of race conditions can and cannot occ
decide what level is appropriate to your application. Once weve done that, we will discuss decide what level is appropriate to your application. Once weve done that, we will discuss
serializability in detail (see [“Serializability”](/en/ch8#sec_transactions_serializability)). Our discussion of isolation serializability in detail (see [“Serializability”](/en/ch8#sec_transactions_serializability)). Our discussion of isolation
levels will be informal, using examples. If you want rigorous definitions and analyses of their levels will be informal, using examples. If you want rigorous definitions and analyses of their
properties, you can find them in the academic literature properties, you can find them in the academic literature [^36] [^37] [^38] [^39].
[^36] [^37] [^38] [^39].
## Read Committed ## Read Committed
The most basic level of transaction isolation is *read committed*. It makes two guarantees: The most basic level of transaction isolation is *read committed*. It makes two guarantees:
1. When reading from the database, you will only see data that has been committed (no *dirty 1. When reading from the database, you will only see data that has been committed (no *dirty reads*).
reads*). 2. When writing to the database, you will only overwrite data that has been committed (no *dirty writes*).
2. When writing to the database, you will only overwrite data that has been committed (no *dirty
writes*).
Some databases support an even weaker isolation level called *read uncommitted*. It prevents dirty Some databases support an even weaker isolation level called *read uncommitted*. It prevents dirty
writes, but does not prevent dirty reads. Lets discuss these two guarantees in more detail. writes, but does not prevent dirty reads. Lets discuss these two guarantees in more detail.
@ -526,9 +510,8 @@ all of its writes become visible at once). This is illustrated in
[Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2s *get x* still [Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2s *get x* still
returns the old value, 2, while user 1 has not yet committed. returns the old value, 2, while user 1 has not yet committed.
![ddia 0804](/fig/ddia_0804.png) {{< figure src="/fig/ddia_0804.png" id="fig_transactions_read_committed" title="Figure 8-4. No dirty reads: user 2 sees the new value for x only after user 1's transaction has committed." class="w-full my-4" >}}
###### Figure 8-4. No dirty reads: user 2 sees the new value for *x* only after user 1s transaction has committed.
There are a few reasons why its useful to prevent dirty reads: There are a few reasons why its useful to prevent dirty reads:
@ -550,8 +533,7 @@ know in which order the writes will happen, but we normally assume that the late
the earlier write. the earlier write.
However, what happens if the earlier write is part of a transaction that has not yet committed, so However, what happens if the earlier write is part of a transaction that has not yet committed, so
the later write overwrites an uncommitted value? This is called a *dirty write* the later write overwrites an uncommitted value? This is called a *dirty write* [^36]. Transactions running at the read
[^36]. Transactions running at the read
committed isolation level must prevent dirty writes, usually by delaying the second write until the committed isolation level must prevent dirty writes, usually by delaying the second write until the
first writes transaction has committed or aborted. first writes transaction has committed or aborted.
@ -570,9 +552,8 @@ By preventing dirty writes, this isolation level avoids some kinds of concurrenc
has committed, so its not a dirty write. Its still incorrect, but for a different reason—in has committed, so its not a dirty write. Its still incorrect, but for a different reason—in
[“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update) we will discuss how to make such counter increments safe. [“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update) we will discuss how to make such counter increments safe.
![ddia 0805](/fig/ddia_0805.png) {{< figure src="/fig/ddia_0805.png" id="fig_transactions_dirty_writes" title="Figure 8-5. With dirty writes, conflicting writes from different transactions can be mixed up." class="w-full my-4" >}}
###### Figure 8-5. With dirty writes, conflicting writes from different transactions can be mixed up.
### Implementing read committed ### Implementing read committed
@ -623,9 +604,8 @@ However, there are still plenty of ways in which you can have concurrency bugs w
isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that
can occur with read committed. can occur with read committed.
![ddia 0806](/fig/ddia_0806.png) {{< figure src="/fig/ddia_0806.png" id="fig_transactions_item_many_preceders" title="Figure 8-6. Read skew: Aaliyah observes the database in an inconsistent state." class="w-full my-4" >}}
###### Figure 8-6. Read skew: Aaliyah observes the database in an inconsistent state.
Say Aaliyah has $1,000 of savings at a bank, split across two accounts with $500 each. Now a Say Aaliyah has $1,000 of savings at a bank, split across two accounts with $500 each. Now a
transaction transfers $100 from one of her accounts to the other. If she is unlucky enough to look at her transaction transfers $100 from one of her accounts to the other. If she is unlucky enough to look at her
@ -705,9 +685,8 @@ transaction ID of the writer. (To be precise, transaction IDs in PostgreSQL are
they overflow after approximately 4 billion transactions. The vacuum process performs cleanup to they overflow after approximately 4 billion transactions. The vacuum process performs cleanup to
ensure that overflow does not affect the data.) ensure that overflow does not affect the data.)
![ddia 0807](/fig/ddia_0807.png) {{< figure src="/fig/ddia_0807.png" id="fig_transactions_mvcc" title="Figure 8-7. Implementing snapshot isolation using multi-version concurrency control." class="w-full my-4" >}}
###### Figure 8-7. Implementing snapshot isolation using multi-version concurrency control.
Each row in a table has a `inserted_by` field, containing the ID of the transaction that inserted Each row in a table has a `inserted_by` field, containing the ID of the transaction that inserted
this row into the table. Moreover, each row has a `deleted_by` field, which is initially empty. If a this row into the table. Moreover, each row has a `deleted_by` field, which is initially empty. If a
@ -726,8 +705,7 @@ $400 which was inserted by transaction 13.
All of the versions of a row are stored within the same database heap (see All of the versions of a row are stored within the same database heap (see
[“Storing values within the index”](/en/ch4#sec_storage_index_heap)), regardless of whether the transactions that wrote them have committed [“Storing values within the index”](/en/ch4#sec_storage_index_heap)), regardless of whether the transactions that wrote them have committed
or not. The versions of the same row form a linked list, going either from newest version to oldest or not. The versions of the same row form a linked list, going either from newest version to oldest
version or the other way round, so that queries can internally iterate over all versions of a row version or the other way round, so that queries can internally iterate over all versions of a row [^45] [^46].
[^45] [^46].
### Visibility rules for observing a consistent snapshot ### Visibility rules for observing a consistent snapshot
@ -774,8 +752,7 @@ query that uses the index must then iterate over the rows to find one that is vi
value matches what the query is looking for. When garbage collection removes old row versions that value matches what the query is looking for. When garbage collection removes old row versions that
are no longer visible to any transaction, the corresponding index entries can also be removed. are no longer visible to any transaction, the corresponding index entries can also be removed.
Many implementation details affect the performance of multi-version concurrency control Many implementation details affect the performance of multi-version concurrency control [^45] [^46].
[^45] [^46].
For example, PostgreSQL has optimizations for avoiding index updates if different versions of the For example, PostgreSQL has optimizations for avoiding index updates if different versions of the
same row can fit on the same page [^40]. same row can fit on the same page [^40].
Some other databases avoid storing full copies of modified rows, and only store differences between Some other databases avoid storing full copies of modified rows, and only store differences between
@ -852,7 +829,7 @@ read-modify-write cycles in application code. They are usually the best solution
expressed in terms of those operations. For example, the following instruction is concurrency-safe expressed in terms of those operations. For example, the following instruction is concurrency-safe
in most relational databases: in most relational databases:
``` ```sql
UPDATE counters SET value = value + 1 WHERE key = 'foo'; UPDATE counters SET value = value + 1 WHERE key = 'foo';
``` ```
@ -888,12 +865,12 @@ players from concurrently moving the same piece, as illustrated in [Example 8-1
##### Example 8-1. Explicitly locking rows to prevent lost updates ##### Example 8-1. Explicitly locking rows to prevent lost updates
``` ```sql
BEGIN TRANSACTION; BEGIN TRANSACTION;
SELECT * FROM figures SELECT * FROM figures
WHERE name = 'robot' AND game_id = 222 WHERE name = 'robot' AND game_id = 222
FOR UPDATE; ![1](/fig/1.png) FOR UPDATE;
-- Check whether move is valid, then update the position -- Check whether move is valid, then update the position
-- of the piece that was returned by the previous SELECT. -- of the piece that was returned by the previous SELECT.
@ -902,9 +879,7 @@ UPDATE figures SET position = 'c4' WHERE id = 1234;
COMMIT; COMMIT;
``` ```
[![1](/fig/1.png)](/en/ch8#co_transactions_CO1-1) ❶: The `FOR UPDATE` clause indicates that the database should take a lock on all rows returned by this query.
: The `FOR UPDATE` clause indicates that the database should take a lock on all rows returned by
this query.
This works, but to get it right, you need to carefully think about your application logic. Its easy This works, but to get it right, you need to carefully think about your application logic. Its easy
to forget to add a necessary lock somewhere in the code, and thus introduce a race condition. to forget to add a necessary lock somewhere in the code, and thus introduce a race condition.
@ -924,8 +899,7 @@ its read-modify-write cycle.
An advantage of this approach is that databases can perform this check efficiently in conjunction An advantage of this approach is that databases can perform this check efficiently in conjunction
with snapshot isolation. Indeed, PostgreSQLs repeatable read, Oracles serializable, and SQL with snapshot isolation. Indeed, PostgreSQLs repeatable read, Oracles serializable, and SQL
Servers snapshot isolation levels automatically detect when a lost update has occurred and abort Servers snapshot isolation levels automatically detect when a lost update has occurred and abort
the offending transaction. However, MySQL/InnoDBs repeatable read does not detect lost updates the offending transaction. However, MySQL/InnoDBs repeatable read does not detect lost updates [^29] [^41].
[^29] [^41].
Some authors [^36] [^38] argue that a database must prevent lost Some authors [^36] [^38] argue that a database must prevent lost
updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot
isolation under this definition. isolation under this definition.
@ -948,7 +922,7 @@ For example, to prevent two users concurrently updating the same wiki page, you
like this, expecting the update to occur only if the content of the page hasnt changed since the like this, expecting the update to occur only if the content of the page hasnt changed since the
user started editing it: user started editing it:
``` ```sql
-- This may or may not be safe, depending on the database implementation -- This may or may not be safe, depending on the database implementation
UPDATE wiki_pages SET content = 'new content' UPDATE wiki_pages SET content = 'new content'
WHERE id = 1234 AND content = 'old content'; WHERE id = 1234 AND content = 'old content';
@ -991,8 +965,8 @@ behind CRDTs, which we encountered in [“CRDTs and Operational Transformation
conditional writes cannot be made commutative. conditional writes cannot be made commutative.
On the other hand, the *last write wins* (LWW) conflict resolution method is prone to lost updates, On the other hand, the *last write wins* (LWW) conflict resolution method is prone to lost updates,
as discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww). Unfortunately, LWW is the default in many replicated as discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww).
databases. Unfortunately, LWW is the default in many replicated databases.
## Write Skew and Phantoms ## Write Skew and Phantoms
@ -1007,17 +981,15 @@ concurrent writes. In this section we will see some subtler examples of conflict
To begin, imagine this example: you are writing an application for doctors to manage their on-call To begin, imagine this example: you are writing an application for doctors to manage their on-call
shifts at a hospital. The hospital usually tries to have several doctors on call at any one time, shifts at a hospital. The hospital usually tries to have several doctors on call at any one time,
but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if
they are sick themselves), provided that at least one colleague remains on call in that shift they are sick themselves), provided that at least one colleague remains on call in that shift [^53] [^54].
[^53] [^54].
Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are
feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button
to go off call at approximately the same time. What happens next is illustrated in to go off call at approximately the same time. What happens next is illustrated in
[Figure 8-8](/en/ch8#fig_transactions_write_skew). [Figure 8-8](/en/ch8#fig_transactions_write_skew).
![ddia 0808](/fig/ddia_0808.png) {{< figure src="/fig/ddia_0808.png" id="fig_transactions_write_skew" title="Figure 8-8. Example of write skew causing an application bug." class="w-full my-4" >}}
###### Figure 8-8. Example of write skew causing an application bug.
In each transaction, your application first checks that two or more doctors are currently on call; In each transaction, your application first checks that two or more doctors are currently on call;
if yes, it assumes its safe for one doctor to go off call. Since the database is using snapshot if yes, it assumes its safe for one doctor to go off call. Since the database is using snapshot
@ -1047,9 +1019,8 @@ options are more restricted:
* The automatic detection of lost updates that you find in some implementations of snapshot * The automatic detection of lost updates that you find in some implementations of snapshot
isolation unfortunately doesnt help either: write skew is not automatically detected in isolation unfortunately doesnt help either: write skew is not automatically detected in
PostgreSQLs repeatable read, MySQL/InnoDBs repeatable read, Oracles serializable, or SQL PostgreSQLs repeatable read, MySQL/InnoDBs repeatable read, Oracles serializable, or SQL
Servers snapshot isolation level [^29]. Servers snapshot isolation level [^29].
Automatically preventing write skew requires true serializable isolation (see Automatically preventing write skew requires true serializable isolation (see [“Serializability”](/en/ch8#sec_transactions_serializability)).
[“Serializability”](/en/ch8#sec_transactions_serializability)).
* Some databases allow you to configure constraints, which are then enforced by the database (e.g., * Some databases allow you to configure constraints, which are then enforced by the database (e.g.,
uniqueness, foreign key constraints, or restrictions on a particular value). However, in order to uniqueness, foreign key constraints, or restrictions on a particular value). However, in order to
specify that at least one doctor must be on call, you would need a constraint that involves specify that at least one doctor must be on call, you would need a constraint that involves
@ -1060,12 +1031,12 @@ options are more restricted:
to explicitly lock the rows that the transaction depends on. In the doctors example, you could to explicitly lock the rows that the transaction depends on. In the doctors example, you could
write something like the following: write something like the following:
``` ```sql
BEGIN TRANSACTION; BEGIN TRANSACTION;
SELECT * FROM doctors SELECT * FROM doctors
WHERE on_call = true WHERE on_call = true
AND shift_id = 1234 FOR UPDATE; ![1](/fig/1.png) AND shift_id = 1234 FOR UPDATE;
UPDATE doctors UPDATE doctors
SET on_call = false SET on_call = false
@ -1075,8 +1046,7 @@ options are more restricted:
COMMIT; COMMIT;
``` ```
[![1](/fig/1.png)](/en/ch8#co_transactions_CO2-1) ❶: As before, `FOR UPDATE` tells the database to lock all rows returned by this query.
: As before, `FOR UPDATE` tells the database to lock all rows returned by this query.
### More examples of write skew ### More examples of write skew
@ -1084,15 +1054,14 @@ Write skew may seem like an esoteric issue at first, but once youre aware of
more situations in which it can occur. Here are some more examples: more situations in which it can occur. Here are some more examples:
Meeting room booking system Meeting room booking system
: Say you want to enforce that there cannot be two bookings for the same meeting room at the same : Say you want to enforce that there cannot be two bookings for the same meeting room at the same time [^55].
time [^55].
When someone wants to make a booking, you first check for any conflicting bookings (i.e., When someone wants to make a booking, you first check for any conflicting bookings (i.e.,
bookings for the same room with an overlapping time range), and if none are found, you create the bookings for the same room with an overlapping time range), and if none are found, you create the
meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)). meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).
##### Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation) ##### Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)
``` ```sql
BEGIN TRANSACTION; BEGIN TRANSACTION;
-- Check for any existing bookings that overlap with the period of noon-1pm -- Check for any existing bookings that overlap with the period of noon-1pm
@ -1168,8 +1137,7 @@ This effect, where a write in one transaction changes the result of a search que
transaction, is called a *phantom* [^4]. transaction, is called a *phantom* [^4].
Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the
examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL
generated by ORMs is also prone to write skew generated by ORMs is also prone to write skew [^50] [^51].
[^50] [^51].
### Materializing conflicts ### Materializing conflicts
@ -1300,9 +1268,8 @@ The differences between interactive transactions and stored procedures is illust
[Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the [Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the
stored procedure can execute very quickly, without waiting for any network or disk I/O. stored procedure can execute very quickly, without waiting for any network or disk I/O.
![ddia 0809](/fig/ddia_0809.png) {{< figure src="/fig/ddia_0809.png" id="fig_transactions_stored_proc" title="Figure 8-9. The difference between an interactive transaction and a stored procedure (using the example transaction of [Figure 8-8](/en/ch8#fig_transactions_write_skew))." class="w-full my-4" >}}
###### Figure 8-9. The difference between an interactive transaction and a stored procedure (using the example transaction of [Figure 8-8](/en/ch8#fig_transactions_write_skew)).
### Pros and cons of stored procedures ### Pros and cons of stored procedures
@ -1321,8 +1288,7 @@ SQL standard (SQL/PSM) since 1999. They have gained a somewhat bad reputation, f
(e.g., using a lot of memory or CPU time) in a database can cause much more trouble than equivalent (e.g., using a lot of memory or CPU time) in a database can cause much more trouble than equivalent
badly written code in an application server. badly written code in an application server.
* In a multitenant system that allows tenants to write their own stored procedures, its a security * In a multitenant system that allows tenants to write their own stored procedures, its a security
risk to execute untrusted code in the same process as the database kernel risk to execute untrusted code in the same process as the database kernel [^62].
[^62].
However, those issues can be overcome. Modern implementations of stored procedures have abandoned However, those issues can be overcome. Modern implementations of stored procedures have abandoned
PL/SQL and use existing general-purpose programming languages instead: VoltDB uses Java or Groovy, PL/SQL and use existing general-purpose programming languages instead: VoltDB uses Java or Groovy,
@ -1332,8 +1298,7 @@ Stored procedures are also useful in cases where application logic cant easil
elsewhere. Applications that use GraphQL, for example, might directly expose their database through elsewhere. Applications that use GraphQL, for example, might directly expose their database through
a GraphQL proxy. If the proxy doesnt support complex validation logic, you can embed such logic a GraphQL proxy. If the proxy doesnt support complex validation logic, you can embed such logic
directly in the database using a stored procedure. If the database doesnt support stored directly in the database using a stored procedure. If the database doesnt support stored
procedures, you would have to deploy a validation service between the proxy and the database to do procedures, you would have to deploy a validation service between the proxy and the database to do validation.
validation.
With stored procedures and in-memory data, executing all transactions on a single thread becomes With stored procedures and in-memory data, executing all transactions on a single thread becomes
feasible. When stored procedures dont need to wait for I/O and avoid the overhead of other feasible. When stored procedures dont need to wait for I/O and avoid the overhead of other
@ -1494,8 +1459,7 @@ transaction is not allowed to concurrently insert or update another booking for
time range. (Its okay to concurrently insert bookings for other rooms, or for the same room at a time range. (Its okay to concurrently insert bookings for other rooms, or for the same room at a
different time that doesnt affect the proposed booking.) different time that doesnt affect the proposed booking.)
How do we implement this? Conceptually, we need a *predicate lock* How do we implement this? Conceptually, we need a *predicate lock* [^4]. It works similarly to the
[^4]. It works similarly to the
shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one
row in a table), it belongs to all objects that match some search condition, such as: row in a table), it belongs to all objects that match some search condition, such as:
@ -1569,15 +1533,11 @@ serializable isolation and good performance fundamentally at odds with each othe
It seems not: an algorithm called *serializable snapshot isolation* (SSI) provides full It seems not: an algorithm called *serializable snapshot isolation* (SSI) provides full
serializability with only a small performance penalty compared to snapshot isolation. SSI is serializability with only a small performance penalty compared to snapshot isolation. SSI is
comparatively new: it was first described in 2008 comparatively new: it was first described in 2008 [^53] [^65].
[^53] [^65].
Today SSI and similar algorithms are used in single-node databases (the serializable isolation level Today SSI and similar algorithms are used in single-node databases (the serializable isolation level
in PostgreSQL [^54], SQL Servers In-Memory in PostgreSQL [^54], SQL Servers In-Memory OLTP/Hekaton [^66], and HyPer [^67]), distributed databases (CockroachDB [^5] and
OLTP/Hekaton [^66], and HyPer [^67]), FoundationDB [^8]), and embedded storage engines such as BadgerDB.
distributed databases (CockroachDB [^5] and
FoundationDB [^8]), and embedded storage
engines such as BadgerDB.
### Pessimistic versus optimistic concurrency control ### Pessimistic versus optimistic concurrency control
@ -1658,9 +1618,8 @@ now taken effect, and transaction 43s premise is no longer true. Things get e
when a writer inserts data that didnt exist before (see [“Phantoms causing write skew”](/en/ch8#sec_transactions_phantom)). Well when a writer inserts data that didnt exist before (see [“Phantoms causing write skew”](/en/ch8#sec_transactions_phantom)). Well
discuss detecting phantom writes for SSI in [“Detecting writes that affect prior reads”](/en/ch8#sec_detecting_writes_affect_reads). discuss detecting phantom writes for SSI in [“Detecting writes that affect prior reads”](/en/ch8#sec_detecting_writes_affect_reads).
![ddia 0810](/fig/ddia_0810.png) {{< figure src="/fig/ddia_0810.png" id="fig_transactions_detect_mvcc" title="Figure 8-10. Detecting when a transaction reads outdated values from an MVCC snapshot." class="w-full my-4" >}}
###### Figure 8-10. Detecting when a transaction reads outdated values from an MVCC snapshot.
In order to prevent this anomaly, the database needs to track when a transaction ignores another In order to prevent this anomaly, the database needs to track when a transaction ignores another
transactions writes due to MVCC visibility rules. When the transaction wants to commit, the transactions writes due to MVCC visibility rules. When the transaction wants to commit, the
@ -1680,9 +1639,8 @@ isolations support for long-running reads from a consistent snapshot.
The second case to consider is when another transaction modifies data after it has been read. This The second case to consider is when another transaction modifies data after it has been read. This
case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range). case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range).
![ddia 0811](/fig/ddia_0811.png) {{< figure src="/fig/ddia_0811.png" id="fig_transactions_detect_index_range" title="Figure 8-11. In serializable snapshot isolation, detecting when one transaction modifies another transaction's reads." class="w-full my-4" >}}
###### Figure 8-11. In serializable snapshot isolation, detecting when one transaction modifies another transactions reads.
In the context of two-phase locking we discussed index-range locks (see In the context of two-phase locking we discussed index-range locks (see
[“Index-range locks”](/en/ch8#sec_transactions_2pl_range)), which allow the database to lock access to all rows matching some [“Index-range locks”](/en/ch8#sec_transactions_2pl_range)), which allow the database to lock access to all rows matching some
@ -1788,9 +1746,8 @@ some nodes and fails on other nodes, as shown in [Figure 8-12](/en/ch8#fig_tran
* Some nodes may crash before the commit record is fully written and roll back on recovery, while * Some nodes may crash before the commit record is fully written and roll back on recovery, while
others successfully commit. others successfully commit.
![ddia 0812](/fig/ddia_0812.png) {{< figure src="/fig/ddia_0812.png" id="fig_transactions_non_atomic" title="Figure 8-12. When a transaction involves multiple database nodes, it may commit on some and fail on others." class="w-full my-4" >}}
###### Figure 8-12. When a transaction involves multiple database nodes, it may commit on some and fail on others.
If some nodes commit the transaction but others abort it, the nodes become inconsistent with each If some nodes commit the transaction but others abort it, the nodes become inconsistent with each
other. And once a transaction has been committed on one node, it cannot be retracted again if it other. And once a transaction has been committed on one node, it cannot be retracted again if it
@ -1808,21 +1765,17 @@ problem.
## Two-Phase Commit (2PC) ## Two-Phase Commit (2PC)
Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It
is a classic algorithm in distributed databases is a classic algorithm in distributed databases [^13] [^71] [^72]. 2PC is used
[^13] [^71] [^72]. 2PC is used internally in some databases and also made available to applications in the form of *XA transactions* [^73]
internally in some databases and also made available to applications in the form of *XA transactions*
[^73]
(which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP (which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP
web services web services [^74] [^75].
[^74] [^75].
The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two
phases (hence the name). phases (hence the name).
![ddia 0813](/fig/ddia_0813.png) {{< figure src="/fig/ddia_0813.png" id="fig_transactions_two_phase_commit" title="Figure 8-13. A successful execution of two-phase commit (2PC)." class="w-full my-4" >}}
###### Figure 8-13. A successful execution of two-phase commit (2PC).
2PC uses a new component that does not normally appear in single-node transactions: a 2PC uses a new component that does not normally appear in single-node transactions: a
*coordinator* (also known as *transaction manager*). The coordinator is often implemented as a *coordinator* (also known as *transaction manager*). The coordinator is often implemented as a
@ -1839,8 +1792,7 @@ participants:
* If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends * If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends
out a *commit* request in phase 2, and the commit actually takes place. out a *commit* request in phase 2, and the commit actually takes place.
* If any of the participants replies “no,” the coordinator sends an *abort* request to all nodes in * If any of the participants replies “no,” the coordinator sends an *abort* request to all nodes in phase 2.
phase 2.
This process is somewhat like the traditional marriage ceremony in Western cultures: the minister This process is somewhat like the traditional marriage ceremony in Western cultures: the minister
asks the bride and groom individually whether each wants to marry the other, and typically receives asks the bride and groom individually whether each wants to marry the other, and typically receives
@ -1920,9 +1872,8 @@ not know whether to commit or abort. Even a timeout does not help here: if datab
aborts after a timeout, it will end up inconsistent with database 2, which has committed. Similarly, aborts after a timeout, it will end up inconsistent with database 2, which has committed. Similarly,
it is not safe to unilaterally commit, because another participant may have aborted. it is not safe to unilaterally commit, because another participant may have aborted.
![ddia 0814](/fig/ddia_0814.png) {{< figure src="/fig/ddia_0814.png" id="fig_transactions_2pc_crash" title="Figure 8-14. The coordinator crashes after participants vote \"yes.\" Database 1 does not know whether to commit or abort." class="w-full my-4" >}}
###### Figure 8-14. The coordinator crashes after participants vote “yes.” Database 1 does not know whether to commit or abort.
Without hearing from the coordinator, the participant has no way of knowing whether to commit or Without hearing from the coordinator, the participant has no way of knowing whether to commit or
abort. In principle, the participants could communicate among themselves to find out how each abort. In principle, the participants could communicate among themselves to find out how each
@ -1942,8 +1893,7 @@ stuck waiting for the coordinator to recover. It is possible to make an atomic c
*nonblocking*, so that it does not get stuck if a node fails. However, making this work in practice *nonblocking*, so that it does not get stuck if a node fails. However, making this work in practice
is not so straightforward. is not so straightforward.
As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed [^13] [^77].
[^13] [^77].
However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most
practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
cannot guarantee atomicity. cannot guarantee atomicity.
@ -1957,8 +1907,7 @@ Distributed transactions and two-phase commit have a mixed reputation. On the on
seen as providing an important safety guarantee that would be hard to achieve otherwise; on the seen as providing an important safety guarantee that would be hard to achieve otherwise; on the
other hand, they are criticized for causing operational problems, killing performance, and promising other hand, they are criticized for causing operational problems, killing performance, and promising
more than they can deliver [^78] [^79] [^80] [^81]. more than they can deliver [^78] [^79] [^80] [^81].
Many cloud services choose not to implement distributed transactions due to the operational Many cloud services choose not to implement distributed transactions due to the operational problems they engender [^82].
problems they engender [^82].
Some implementations of distributed transactions carry a heavy performance penalty. Much of the Some implementations of distributed transactions carry a heavy performance penalty. Much of the
performance cost inherent in two-phase commit is due to the additional disk forcing (`fsync`) that performance cost inherent in two-phase commit is due to the additional disk forcing (`fsync`) that
@ -2073,8 +2022,7 @@ transaction is resolved.
### Recovering from coordinator failure ### Recovering from coordinator failure
In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the
log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do occur [^83] [^84] — that is,
occur [^83] [^84]—that is,
transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because
the transaction log has been lost or corrupted due to a software bug). These transactions cannot be the transaction log has been lost or corrupted due to a software bug). These transactions cannot be
resolved automatically, so they sit forever in the database, holding locks and blocking other resolved automatically, so they sit forever in the database, holding locks and blocking other
@ -2135,11 +2083,8 @@ As explained previously, there is a big difference between distributed transacti
multiple heterogeneous storage technologies, and those that are internal to a system—i.e., where all multiple heterogeneous storage technologies, and those that are internal to a system—i.e., where all
the participating nodes are shards of the same database running the same software. Such internal the participating nodes are shards of the same database running the same software. Such internal
distributed transactions are a defining feature of “NewSQL” databases such as distributed transactions are a defining feature of “NewSQL” databases such as
CockroachDB [^5], CockroachDB [^5], TiDB [^6], Spanner [^7], FoundationDB [^8], and YugabyteDB, for example.
TiDB [^6], Some message brokers such as Kafka also support internal distributed transactions [^85].
Spanner [^7],
FoundationDB [^8], and YugabyteDB, for
example. Some message brokers such as Kafka also support internal distributed transactions [^85].
Many of these systems use 2-phase commit to ensure atomicity of transactions that write to multiple Many of these systems use 2-phase commit to ensure atomicity of transactions that write to multiple
shards, and yet they dont suffer the same problems as XA transactions. The reason is that because shards, and yet they dont suffer the same problems as XA transactions. The reason is that because
@ -2149,14 +2094,10 @@ are more reliable and faster.
The biggest problems with XA can be fixed by: The biggest problems with XA can be fixed by:
* Replicating the coordinator, with automatic failover to another coordinator node if the primary * Replicating the coordinator, with automatic failover to another coordinator node if the primary one crashes;
one crashes; * Allowing the coordinator and data shards to communicate directly without going via application code;
* Allowing the coordinator and data shards to communicate directly without going via application * Replicating the participating shards, so that the risk of having to abort a transaction because of a fault in one of the shards is reduced; and
code; * Coupling the atomic commitment protocol with a distributed concurrency control protocol that supports deadlock detection and consistent reads across shards.
* Replicating the participating shards, so that the risk of having to abort a transaction because of
a fault in one of the shards is reduced; and
* Coupling the atomic commitment protocol with a distributed concurrency control protocol that
supports deadlock detection and consistent reads across shards.
Consensus algorithms are commonly used to replicate the coordinator and the database shards. We will Consensus algorithms are commonly used to replicate the coordinator and the database shards. We will
see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented
@ -2240,12 +2181,12 @@ discussing various examples of race conditions, summarized in [Table 8-1](/en/c
Table 8-1. Summary of anomalies that can occur at various isolation levels Table 8-1. Summary of anomalies that can occur at various isolation levels
| Isolation level | Dirty reads | Read skew | Phantom reads | Lost updates | Write skew | | Isolation level | Dirty reads | Read skew | Phantom reads | Lost updates | Write skew |
|--------------------|-------------|-------------|---------------|--------------|-------------| |--------------------|-------------|-------------|---------------|--------------|-------------|
| Read uncommitted | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | | Read uncommitted | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible |
| Read committed | ✓ Prevented | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | | Read committed | ✓ Prevented | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible |
| Snapshot isolation | ✓ Prevented | ✓ Prevented | ✓ Prevented | ? Depends | ✗ Possible | | Snapshot isolation | ✓ Prevented | ✓ Prevented | ✓ Prevented | ? Depends | ✗ Possible |
| Serializable | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | | Serializable | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented |
Dirty reads Dirty reads
: One client reads another clients writes before they have been committed. The read committed : One client reads another clients writes before they have been committed. The read committed
@ -2305,9 +2246,7 @@ mechanisms. Fortunately, idempotence can ensure exactly-once semantics without r
commit across different storage technologies, and we will see more on this in later chapters. commit across different storage technologies, and we will see more on this in later chapters.
The examples in this chapter used a relational data model. However, as discussed in The examples in this chapter used a relational data model. However, as discussed in
[“The need for multi-object transactions”](/en/ch8#sec_transactions_need), transactions are a valuable database feature, no matter which data model [“The need for multi-object transactions”](/en/ch8#sec_transactions_need), transactions are a valuable database feature, no matter which data model is used.
is used.

View file

@ -117,9 +117,8 @@ a request and expect a response, many things could go wrong (some of which are i
6. The remote node may have processed your request, but the response has been delayed and will be 6. The remote node may have processed your request, but the response has been delayed and will be
delivered later (perhaps the network or your own machine is overloaded). delivered later (perhaps the network or your own machine is overloaded).
![ddia 0901](/fig/ddia_0901.png) {{< figure src="/fig/ddia_0901.png" id="fig_distributed_network" title="Figure 9-1. If you send a request and don't get a response, it's not possible to distinguish whether (a) the request was lost, (b) the remote node is down, or (c) the response was lost." class="w-full my-4" >}}
###### Figure 9-1. If you send a request and dont get a response, its not possible to distinguish whether (a) the request was lost, (b) the remote node is down, or (c) the response was lost.
The sender cant even tell whether the packet was delivered: the only option is for the recipient to The sender cant even tell whether the packet was delivered: the only option is for the recipient to
send a response message, which may in turn be lost or delayed. These issues are indistinguishable in send a response message, which may in turn be lost or delayed. These issues are indistinguishable in
@ -147,8 +146,7 @@ TCP is often described as providing “reliable” delivery, in the sense that i
retransmits dropped packets, it detects reordered packets and puts them back in the correct order, retransmits dropped packets, it detects reordered packets and puts them back in the correct order,
and it detects packet corruption using a simple checksum. It also figures out how fast it can send and it detects packet corruption using a simple checksum. It also figures out how fast it can send
data so that it is transferred as quickly as possible, but without overloading the network or the data so that it is transferred as quickly as possible, but without overloading the network or the
receiving node; this is known as *congestion control*, *flow control*, or *backpressure* receiving node; this is known as *congestion control*, *flow control*, or *backpressure* [^5].
[^5].
When you “send” some data by writing it to a socket, it actually doesnt get sent immediately, When you “send” some data by writing it to a socket, it actually doesnt get sent immediately,
but its only placed in a buffer managed by your operating system. When the congestion control but its only placed in a buffer managed by your operating system. When the congestion control
@ -252,8 +250,7 @@ that something is not working:
or refuse TCP connections by sending a `RST` or `FIN` packet in reply. or refuse TCP connections by sending a `RST` or `FIN` packet in reply.
* If a node process crashed (or was killed by an administrator) but the nodes operating system is * If a node process crashed (or was killed by an administrator) but the nodes operating system is
still running, a script can notify other nodes about the crash so that another node can take over still running, a script can notify other nodes about the crash so that another node can take over
quickly without having to wait for a timeout to expire. For example, HBase does this quickly without having to wait for a timeout to expire. For example, HBase does this [^26].
[^26].
* If you have access to the management interface of the network switches in your datacenter, you can * If you have access to the management interface of the network switches in your datacenter, you can
query them to detect link failures at a hardware level (e.g., if the remote machine is powered query them to detect link failures at a hardware level (e.g., if the remote machine is powered
down). This option is ruled out if youre connecting via the internet, or if youre in a shared down). This option is ruled out if youre connecting via the internet, or if youre in a shared
@ -282,9 +279,7 @@ to a load spike on the node or the network).
Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of
performing some action (for example, sending an email), and another node takes over, the action may performing some action (for example, sending an email), and another node takes over, the action may
end up being performed twice. We will discuss this issue in more detail in end up being performed twice. We will discuss this issue in more detail in
[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), and in [“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), and in Chapters [^10] and [Link to Come].
Chapters [^10]
and [Link to Come].
When a node is declared dead, its responsibilities need to be transferred to other nodes, which When a node is declared dead, its responsibilities need to be transferred to other nodes, which
places additional load on other nodes and the network. If the system is already struggling with high places additional load on other nodes and the network. If the system is already struggling with high
@ -322,19 +317,16 @@ Similarly, the variability of packet delays on computer networks is most often d
the network is functioning fine. the network is functioning fine.
* When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming * When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming
request from the network is queued by the operating system until the application is ready to request from the network is queued by the operating system until the application is ready to
handle it. Depending on the load on the machine, this may take an arbitrary length of time handle it. Depending on the load on the machine, this may take an arbitrary length of time [^28].
[^28].
* In virtualized environments, a running operating system is often paused for tens of milliseconds * In virtualized environments, a running operating system is often paused for tens of milliseconds
while another virtual machine uses a CPU core. During this time, the VM cannot consume any data while another virtual machine uses a CPU core. During this time, the VM cannot consume any data
from the network, so the incoming data is queued (buffered) by the virtual machine monitor from the network, so the incoming data is queued (buffered) by the virtual machine monitor [^29],
[^29],
further increasing the variability of network delays. further increasing the variability of network delays.
* As mentioned earlier, in order to avoid overloading the network, TCP limits the rate at which it * As mentioned earlier, in order to avoid overloading the network, TCP limits the rate at which it
sends data. This means additional queueing at the sender before the data even enters the network. sends data. This means additional queueing at the sender before the data even enters the network.
![ddia 0902](/fig/ddia_0902.png) {{< figure src="/fig/ddia_0902.png" id="fig_distributed_switch_queueing" title="Figure 9-2. If several machines send network traffic to the same destination, its switch queue can fill up. Here, ports 1, 2, and 4 are all trying to send packets to port 3." class="w-full my-4" >}}
###### Figure 9-2. If several machines send network traffic to the same destination, its switch queue can fill up. Here, ports 1, 2, and 4 are all trying to send packets to port 3.
Moreover, when TCP detects and automatically retransmits a lost packet, although the application Moreover, when TCP detects and automatically retransmits a lost packet, although the application
does not see the packet loss directly, it does see the resulting delay (waiting for the timeout to does not see the packet loss directly, it does see the resulting delay (waiting for the timeout to
@ -588,26 +580,21 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples
* The quartz clock in a computer is not very accurate: it *drifts* (runs faster or slower than it * The quartz clock in a computer is not very accurate: it *drifts* (runs faster or slower than it
should). Clock drift varies depending on the temperature of the machine. Google assumes a clock should). Clock drift varies depending on the temperature of the machine. Google assumes a clock
drift of up to 200 ppm (parts per million) for its servers drift of up to 200 ppm (parts per million) for its servers [^45],
[^45],
which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30 which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30
seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best
possible accuracy you can achieve, even if everything is working correctly. possible accuracy you can achieve, even if everything is working correctly.
* If a computers clock differs too much from an NTP server, it may refuse to synchronize, or the * If a computers clock differs too much from an NTP server, it may refuse to synchronize, or the
local clock will be forcibly reset [^39]. Any local clock will be forcibly reset [^39]. Any applications observing the time before and after this reset may see time go backward or suddenly jump forward.
applications observing the time before and after this reset may see time go backward or suddenly
jump forward.
* If a node is accidentally firewalled off from NTP servers, the misconfiguration may go * If a node is accidentally firewalled off from NTP servers, the misconfiguration may go
unnoticed for some time, during which the drift may add up to large discrepancies between unnoticed for some time, during which the drift may add up to large discrepancies between
different nodes clocks. Anecdotal evidence suggests that this does happen in practice. different nodes clocks. Anecdotal evidence suggests that this does happen in practice.
* NTP synchronization can only be as good as the network delay, so there is a limit to its * NTP synchronization can only be as good as the network delay, so there is a limit to its
accuracy when youre on a congested network with variable packet delays. One experiment showed accuracy when youre on a congested network with variable packet delays. One experiment showed
that a minimum error of 35 ms is achievable when synchronizing over the internet that a minimum error of 35 ms is achievable when synchronizing over the internet [^46],
[^46],
though occasional spikes in network delay lead to errors of around a second. Depending on the though occasional spikes in network delay lead to errors of around a second. Depending on the
configuration, large network delays can cause the NTP client to give up entirely. configuration, large network delays can cause the NTP client to give up entirely.
* Some NTP servers are wrong or misconfigured, reporting time that is off by hours * Some NTP servers are wrong or misconfigured, reporting time that is off by hours [^47] [^48].
[^47] [^48].
NTP clients mitigate such errors by querying several servers and ignoring outliers. NTP clients mitigate such errors by querying several servers and ignoring outliers.
Nevertheless, its somewhat worrying to bet the correctness of your systems on the time that you Nevertheless, its somewhat worrying to bet the correctness of your systems on the time that you
were told by a stranger on the internet. were told by a stranger on the internet.
@ -619,9 +606,7 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples
adjustment gradually over the course of a day (this is known as *smearing*) [^51] [^52], adjustment gradually over the course of a day (this is known as *smearing*) [^51] [^52],
although actual NTP server behavior varies in practice [^53]. although actual NTP server behavior varies in practice [^53].
Leap seconds will no longer be used from 2035 onwards, so this problem will fortunately go away. Leap seconds will no longer be used from 2035 onwards, so this problem will fortunately go away.
* In virtual machines, the hardware clock is virtualized, which raises additional challenges for * In virtual machines, the hardware clock is virtualized, which raises additional challenges for applications that need accurate timekeeping [^54].
applications that need accurate timekeeping
[^54].
When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds
while another VM is running. From an applications point of view, this pause manifests itself as while another VM is running. From an applications point of view, this pause manifests itself as
the clock suddenly jumping forward [^29]. the clock suddenly jumping forward [^29].
@ -642,8 +627,7 @@ Such accuracy can be achieved with some special hardware (GPS receivers and/or a
Precision Time Protocol (PTP) and careful deployment and monitoring [^58] [^59]. Precision Time Protocol (PTP) and careful deployment and monitoring [^58] [^59].
Relying on GPS alone can be risky because GPS signals can easily be jammed. In some locations this Relying on GPS alone can be risky because GPS signals can easily be jammed. In some locations this
happens frequently, e.g. close to military facilities [^60]. happens frequently, e.g. close to military facilities [^60].
Some cloud providers have begun offering high-accuracy clock synchronization for their virtual Some cloud providers have begun offering high-accuracy clock synchronization for their virtual machines [^61].
machines [^61].
However, clock synchronization still requires a lot of care. If your NTP daemon is misconfigured, or However, clock synchronization still requires a lot of care. If your NTP daemon is misconfigured, or
a firewall is blocking NTP traffic, the clock error due to drift can quickly become large. a firewall is blocking NTP traffic, the clock error due to drift can quickly become large.
@ -664,8 +648,7 @@ its network is misconfigured, it most likely wont work at all, so it will qui
fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most
things will seem to work fine, even though its clock gradually drifts further and further away from things will seem to work fine, even though its clock gradually drifts further and further away from
reality. If some piece of software is relying on an accurately synchronized clock, the result is reality. If some piece of software is relying on an accurately synchronized clock, the result is
more likely to be silent and subtle data loss than a dramatic crash more likely to be silent and subtle data loss than a dramatic crash [^62] [^63].
[^62] [^63].
Thus, if you use software that requires synchronized clocks, it is essential that you also carefully Thus, if you use software that requires synchronized clocks, it is essential that you also carefully
monitor the clock offsets between all the machines. Any node whose clock drifts too far from the monitor the clock offsets between all the machines. Any node whose clock drifts too far from the
@ -684,9 +667,8 @@ multi-leader replication (the example is similar to [Figure 6-8](/en/ch6#fig_re
*x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node *x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node
3 (we now have *x* = 2); and finally, both writes are replicated to node 2. 3 (we now have *x* = 2); and finally, both writes are replicated to node 2.
![ddia 0903](/fig/ddia_0903.png) {{< figure src="/fig/ddia_0903.png" id="fig_distributed_timestamps" title="Figure 9-3. The write by client B is causally later than the write by client A, but B's write has an earlier timestamp." class="w-full my-4" >}}
###### Figure 9-3. The write by client B is causally later than the write by client A, but Bs write has an earlier timestamp.
In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a
timestamp according to the time-of-day clock on the node where the write originated. The clock timestamp according to the time-of-day clock on the node where the write originated. The clock
@ -710,12 +692,10 @@ a higher timestamp than the overwritten value, even if that timestamp is ahead o
clock. However, that incurs the cost of an additional read to find the greatest existing timestamp. clock. However, that incurs the cost of an additional read to find the greatest existing timestamp.
Some systems, including Cassandra and ScyllaDB, want to write to all replicas in a single round Some systems, including Cassandra and ScyllaDB, want to write to all replicas in a single round
trip, and therefore they simply use the client clocks timestamp along with a last write wins trip, and therefore they simply use the client clocks timestamp along with a last write wins
policy [^62]. This approach has some policy [^62]. This approach has some serious problems:
serious problems:
* Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite * Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite
values previously written by a node with a fast clock until the clock skew between the nodes has values previously written by a node with a fast clock until the clock skew between the nodes has elapsed [^63] [^65].
elapsed [^63] [^65].
This scenario can cause arbitrary amounts of data to be silently dropped without any error being This scenario can cause arbitrary amounts of data to be silently dropped without any error being
reported to the application. reported to the application.
* LWW cannot distinguish between writes that occurred sequentially in quick succession (in * LWW cannot distinguish between writes that occurred sequentially in quick succession (in
@ -741,9 +721,7 @@ time, in addition to other sources of error such as quartz drift. To guarantee a
you would need the clock error to be significantly lower than the network delay, which is not you would need the clock error to be significantly lower than the network delay, which is not
possible. possible.
So-called *logical clocks* So-called *logical clocks* [^66], which are based on incrementing counters rather than an oscillating quartz crystal, are a safer
[^66],
which are based on incrementing counters rather than an oscillating quartz crystal, are a safer
alternative for ordering events (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). Logical clocks do not measure alternative for ordering events (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). Logical clocks do not measure
the time of day or the number of seconds elapsed, only the relative ordering of events (whether one the time of day or the number of seconds elapsed, only the relative ordering of events (whether one
event happened before or after another). In contrast, time-of-day and monotonic clocks, which event happened before or after another). In contrast, time-of-day and monotonic clocks, which
@ -810,13 +788,11 @@ Can we use the timestamps from synchronized time-of-day clocks as transaction ID
the synchronization good enough, they would have the right properties: later transactions have a the synchronization good enough, they would have the right properties: later transactions have a
higher timestamp. The problem, of course, is the uncertainty about clock accuracy. higher timestamp. The problem, of course, is the uncertainty about clock accuracy.
Spanner implements snapshot isolation across datacenters in this way Spanner implements snapshot isolation across datacenters in this way [^68] [^69].
[^68] [^69].
It uses the clocks confidence interval as reported by the TrueTime API, and is based on the It uses the clocks confidence interval as reported by the TrueTime API, and is based on the
following observation: if you have two confidence intervals, each consisting of an earliest and following observation: if you have two confidence intervals, each consisting of an earliest and
latest possible timestamp (*A* = [*Aearliest*, *Alatest*] and latest possible timestamp (*A* = [*Aearliest*, *Alatest*] and *B* = [*Bearliest*, *Blatest*]), and those two intervals do not overlap
*B* = [*Bearliest*, *Blatest*]), and those two intervals do not overlap (i.e., (i.e., *Aearliest* < *Alatest* < *Bearliest* < *Blatest*), then B definitely happened after A—there
*Aearliest* < *Alatest* < *Bearliest* < *Blatest*), then B definitely happened after A—there
can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened. can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened.
In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the
@ -824,8 +800,7 @@ length of the confidence interval before committing a read-write transaction. By
ensures that any transaction that may read the data is at a sufficiently later time, so their ensures that any transaction that may read the data is at a sufficiently later time, so their
confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner
needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS
receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [^45].
7 ms [^45].
The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to
have a confidence interval, and the accurate clock sources only help keep that interval small. Other have a confidence interval, and the accurate clock sources only help keep that interval small. Other
@ -839,8 +814,7 @@ database with a single leader per shard. Only the leader is allowed to accept wr
node know that it is still leader (that it hasnt been declared dead by the others), and that it may node know that it is still leader (that it hasnt been declared dead by the others), and that it may
safely accept writes? safely accept writes?
One option is for the leader to obtain a *lease* from the other nodes, which is similar to a lock One option is for the leader to obtain a *lease* from the other nodes, which is similar to a lock with a timeout [^73].
with a timeout [^73].
Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that
it is the leader for some amount of time, until the lease expires. In order to remain leader, the it is the leader for some amount of time, until the lease expires. In order to remain leader, the
node must periodically renew the lease before it expires. If the node fails, it stops renewing the node must periodically renew the lease before it expires. If the node fails, it stops renewing the
@ -887,12 +861,10 @@ various reasons why this could happen:
* Contention among threads accessing a shared resource, such as a lock or queue, can cause threads * Contention among threads accessing a shared resource, such as a lock or queue, can cause threads
to spend a lot of their time waiting. Moving to a machine with more CPU cores can make such to spend a lot of their time waiting. Moving to a machine with more CPU cores can make such
problems worse, and contention problems can be difficult to diagnose problems worse, and contention problems can be difficult to diagnose [^74].
[^74].
* Many programming language runtimes (such as the Java Virtual Machine) have a *garbage collector* * Many programming language runtimes (such as the Java Virtual Machine) have a *garbage collector*
(GC) that occasionally needs to stop all running threads. In the past, such *“stop-the-world” GC (GC) that occasionally needs to stop all running threads. In the past, such *“stop-the-world” GC
pauses* would sometimes last for several minutes pauses* would sometimes last for several minutes [^75]!
[^75]!
With modern GC algorithms this is less of a problem, but GC pauses can still be noticable (see With modern GC algorithms this is less of a problem, but GC pauses can still be noticable (see
[“Limiting the impact of garbage collection”](/en/ch9#sec_distributed_gc_impact)). [“Limiting the impact of garbage collection”](/en/ch9#sec_distributed_gc_impact)).
* In virtualized environments, a virtual machine can be *suspended* (pausing the execution of all * In virtualized environments, a virtual machine can be *suspended* (pausing the execution of all
@ -900,8 +872,7 @@ various reasons why this could happen:
memory and continuing execution). This pause can occur at any time in a processs execution and can memory and continuing execution). This pause can occur at any time in a processs execution and can
last for an arbitrary length of time. This feature is sometimes used for *live migration* of last for an arbitrary length of time. This feature is sometimes used for *live migration* of
virtual machines from one host to another without a reboot, in which case the length of the pause virtual machines from one host to another without a reboot, in which case the length of the pause
depends on the rate at which processes are writing to memory depends on the rate at which processes are writing to memory [^76].
[^76].
* On end-user devices such as laptops and phones, execution may also be suspended and resumed * On end-user devices such as laptops and phones, execution may also be suspended and resumed
arbitrarily, e.g., when the user closes the lid of their laptop. arbitrarily, e.g., when the user closes the lid of their laptop.
* When the operating system context-switches to another thread, or when the hypervisor switches to a * When the operating system context-switches to another thread, or when the hypervisor switches to a
@ -914,11 +885,9 @@ various reasons why this could happen:
disk I/O operation to complete [^77]. In many languages, disk access can happen disk I/O operation to complete [^77]. In many languages, disk access can happen
surprisingly, even if the code doesnt explicitly mention file access—for example, the Java surprisingly, even if the code doesnt explicitly mention file access—for example, the Java
classloader lazily loads class files when they are first used, which could happen at any time in classloader lazily loads class files when they are first used, which could happen at any time in
the program execution. I/O pauses and GC pauses may even conspire to combine their delays the program execution. I/O pauses and GC pauses may even conspire to combine their delays [^78].
[^78].
If the disk is actually a network filesystem or network block device (such as Amazons EBS), the If the disk is actually a network filesystem or network block device (such as Amazons EBS), the
I/O latency is further subject to the variability of network delays I/O latency is further subject to the variability of network delays [^31].
[^31].
* If the operating system is configured to allow *swapping to disk* (*paging*), a simple memory * If the operating system is configured to allow *swapping to disk* (*paging*), a simple memory
access may result in a page fault that requires a page from disk to be loaded into memory. The access may result in a page fault that requires a page from disk to be loaded into memory. The
thread is paused while this slow I/O operation takes place. If memory pressure is high, this may thread is paused while this slow I/O operation takes place. If memory pressure is high, this may
@ -1126,9 +1095,8 @@ become corrupted. You try to implement this by requiring a client to obtain a le
service before accessing the file. Such a lock service is often implemented using a consensus service before accessing the file. Such a lock service is often implemented using a consensus
algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency). algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency).
![ddia 0904](/fig/ddia_0904.png) {{< figure src="/fig/ddia_0904.png" id="fig_distributed_lease_pause" title="Figure 9-4. Incorrect implementation of a distributed lock: client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage." class="w-full my-4" >}}
###### Figure 9-4. Incorrect implementation of a distributed lock: client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage.
The problem is an example of what we discussed in [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses): if the client The problem is an example of what we discussed in [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses): if the client
holding the lease is paused for too long, its lease expires. Another client can obtain a lease for holding the lease is paused for too long, its lease expires. Another client can obtain a lease for
@ -1144,9 +1112,8 @@ or more.) By the time the write request arrives at the storage service, the leas
out, allowing client 2 to acquire it and issue a write of its own. The result is corruption similar out, allowing client 2 to acquire it and issue a write of its own. The result is corruption similar
to [Figure 9-4](/en/ch9#fig_distributed_lease_pause). to [Figure 9-4](/en/ch9#fig_distributed_lease_pause).
![ddia 0905](/fig/ddia_0905.png) {{< figure src="/fig/ddia_0905.png" id="fig_distributed_lease_delay" title="Figure 9-5. A message from a former leaseholder might be delayed for a long time, and arrive after another node has taken over the lease." class="w-full my-4" >}}
###### Figure 9-5. A message from a former leaseholder might be delayed for a long time, and arrive after another node has taken over the lease.
### Fencing off zombies and delayed requests ### Fencing off zombies and delayed requests
@ -1166,9 +1133,8 @@ detected and shut down, it may already be too late and data may already have bee
A more robust fencing solution, which protects against both zombies and delayed requests, is A more robust fencing solution, which protects against both zombies and delayed requests, is
illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing). illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing).
![ddia 0906](/fig/ddia_0906.png) {{< figure src="/fig/ddia_0906.png" id="fig_distributed_fencing" title="Figure 9-6. Making access to storage safe by allowing writes only in the order of increasing fencing tokens." class="w-full my-4" >}}
###### Figure 9-6. Making access to storage safe by allowing writes only in the order of increasing fencing tokens.
Lets assume that every time the lock service grants a lock or lease, it also returns a *fencing Lets assume that every time the lock service grants a lock or lease, it also returns a *fencing
token*, which is a number that increases every time a lock is granted (e.g., incremented by the lock token*, which is a number that increases every time a lock is granted (e.g., incremented by the lock
@ -1221,9 +1187,8 @@ the most significant bits or digits of the timestamp. You can then be sure that
generated by the new leaseholder will be greater than any timestamp from the old leaseholder, even generated by the new leaseholder will be greater than any timestamp from the old leaseholder, even
if the old leaseholders writes happened later. if the old leaseholders writes happened later.
![ddia 0907](/fig/ddia_0907.png) {{< figure src="/fig/ddia_0907.png" id="fig_distributed_fencing_leaderless" title="Figure 9-7. Using fencing tokens to protect writes to a leaderless replicated database." class="w-full my-4" >}}
###### Figure 9-7. Using fencing tokens to protect writes to a leaderless replicated database.
In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its
timestamps starting with 34… are greater than any timestamps starting with 33… that are timestamps starting with 34… are greater than any timestamps starting with 33… that are
@ -1252,13 +1217,11 @@ playing by the rules of the protocol.
Distributed systems problems become much harder if there is a risk that nodes may “lie” (send Distributed systems problems become much harder if there is a risk that nodes may “lie” (send
arbitrary faulty or corrupted responses)—for example, it might cast multiple contradictory votes in arbitrary faulty or corrupted responses)—for example, it might cast multiple contradictory votes in
the same election. Such behavior is known as a *Byzantine fault*, and the problem of reaching the same election. Such behavior is known as a *Byzantine fault*, and the problem of reaching
consensus in this untrusting environment is known as the *Byzantine Generals Problem* consensus in this untrusting environment is known as the *Byzantine Generals Problem* [^94].
[^94].
# The Byzantine Generals Problem # The Byzantine Generals Problem
The Byzantine Generals Problem is a generalization of the so-called *Two Generals Problem* The Byzantine Generals Problem is a generalization of the so-called *Two Generals Problem* [^95],
[^95],
which imagines a situation in which two army generals need to agree on a battle plan. As they which imagines a situation in which two army generals need to agree on a battle plan. As they
have set up camp on two different sites, they can only communicate by messenger, and the messengers have set up camp on two different sites, they can only communicate by messenger, and the messengers
sometimes get delayed or lost (like packets in a network). We will discuss this problem of sometimes get delayed or lost (like packets in a network). We will discuss this problem of
@ -1290,8 +1253,7 @@ with the network. This concern is relevant in certain specific circumstances. Fo
defraud others. In such circumstances, it is not safe for a node to simply trust another nodes defraud others. In such circumstances, it is not safe for a node to simply trust another nodes
messages, since they may be sent with malicious intent. For example, cryptocurrencies like messages, since they may be sent with malicious intent. For example, cryptocurrencies like
Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties
to agree whether a transaction happened or not, without relying on a central authority to agree whether a transaction happened or not, without relying on a central authority [^100].
[^100].
However, in the kinds of systems we discuss in this book, we can usually safely assume that there However, in the kinds of systems we discuss in this book, we can usually safely assume that there
are no Byzantine faults. In a datacenter, all the nodes are controlled by your organization (so are no Byzantine faults. In a datacenter, all the nodes are controlled by your organization (so
@ -1308,8 +1270,7 @@ end-user control, such as web browsers. This is why input validation, sanitizati
escaping are so important: to prevent SQL injection and cross-site scripting, for example. However, escaping are so important: to prevent SQL injection and cross-site scripting, for example. However,
we typically dont use Byzantine fault-tolerant protocols here, but simply make the server the we typically dont use Byzantine fault-tolerant protocols here, but simply make the server the
authority on deciding what client behavior is and isnt allowed. In peer-to-peer networks, where authority on deciding what client behavior is and isnt allowed. In peer-to-peer networks, where
there is no such central authority, Byzantine fault tolerance is more relevant there is no such central authority, Byzantine fault tolerance is more relevant [^103] [^104].
[^103] [^104].
A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to
all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant
@ -1336,18 +1297,15 @@ pragmatic steps toward better reliability. For example:
drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and
UDP, but sometimes they evade detection [^105] [^106] [^107]. UDP, but sometimes they evade detection [^105] [^106] [^107].
Simple measures are usually sufficient protection against such corruption, such as checksums in Simple measures are usually sufficient protection against such corruption, such as checksums in
the application-level protocol. TLS-encrypted connections also offer protection against the application-level protocol. TLS-encrypted connections also offer protection against corruption.
corruption.
* A publicly accessible application must carefully sanitize any inputs from users, for example * A publicly accessible application must carefully sanitize any inputs from users, for example
checking that a value is within a reasonable range and limiting the size of strings to prevent checking that a value is within a reasonable range and limiting the size of strings to prevent
denial of service through large memory allocations. An internal service behind a firewall may be denial of service through large memory allocations. An internal service behind a firewall may be
able to get away with less strict checks on inputs, but basic checks in protocol parsers are still able to get away with less strict checks on inputs, but basic checks in protocol parsers are still a good idea [^105].
a good idea [^105].
* NTP clients can be configured with multiple server addresses. When synchronizing, the client * NTP clients can be configured with multiple server addresses. When synchronizing, the client
contacts all of them, estimates their errors, and checks that a majority of servers agree on some contacts all of them, estimates their errors, and checks that a majority of servers agree on some
time range. As long as most of the servers are okay, a misconfigured NTP server that is reporting an time range. As long as most of the servers are okay, a misconfigured NTP server that is reporting an
incorrect time is detected as an outlier and is excluded from synchronization incorrect time is detected as an outlier and is excluded from synchronization [^39]. The use of multiple servers makes NTP
[^39]. The use of multiple servers makes NTP
more robust than if it only uses a single server. more robust than if it only uses a single server.
## System Model and Reality ## System Model and Reality
@ -1367,15 +1325,13 @@ With regard to timing assumptions, three system models are in common use:
Synchronous model Synchronous model
: The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock : The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock
error. This does not imply exactly synchronized clocks or zero network delay; it just means you error. This does not imply exactly synchronized clocks or zero network delay; it just means you
know that network delay, pauses, and clock drift will never exceed some fixed upper bound know that network delay, pauses, and clock drift will never exceed some fixed upper bound [^108].
[^108].
The synchronous model is not a realistic model of most practical The synchronous model is not a realistic model of most practical
systems, because (as discussed in this chapter) unbounded delays and pauses do occur. systems, because (as discussed in this chapter) unbounded delays and pauses do occur.
Partially synchronous model Partially synchronous model
: Partial synchrony means that a system behaves like a synchronous system *most of the time*, but it : Partial synchrony means that a system behaves like a synchronous system *most of the time*, but it
sometimes exceeds the bounds for network delay, process pauses, and clock drift sometimes exceeds the bounds for network delay, process pauses, and clock drift [^108]. This is a realistic model of many
[^108]. This is a realistic model of many
systems: most of the time, networks and processes are quite well behaved—otherwise we would never systems: most of the time, networks and processes are quite well behaved—otherwise we would never
be able to get anything done—but we have to reckon with the fact that any timing assumptions be able to get anything done—but we have to reckon with the fact that any timing assumptions
may be shattered occasionally. When this happens, network delay, pauses, and clock error may become may be shattered occasionally. When this happens, network delay, pauses, and clock error may become
@ -1391,8 +1347,7 @@ nodes are:
Crash-stop faults Crash-stop faults
: In the *crash-stop* (or *fail-stop*) model, an algorithm may assume that a node can fail in only : In the *crash-stop* (or *fail-stop*) model, an algorithm may assume that a node can fail in only
one way, namely by crashing one way, namely by crashing [^109].
[^109].
This means that the node may suddenly stop responding at any moment, and thereafter that node is This means that the node may suddenly stop responding at any moment, and thereafter that node is
gone forever—it never comes back. gone forever—it never comes back.
@ -1405,19 +1360,14 @@ Crash-recovery faults
Degraded performance and partial functionality Degraded performance and partial functionality
: In addition to crashing and restarting, nodes may go slow: they may still be able to respond to : In addition to crashing and restarting, nodes may go slow: they may still be able to respond to
health check requests, while being too slow to get any real work done. For example, a Gigabit health check requests, while being too slow to get any real work done. For example, a Gigabit
network interface could suddenly drop to 1 Kb/s throughput due to a driver bug network interface could suddenly drop to 1 Kb/s throughput due to a driver bug [^110];
[^110]; a process that is under memory pressure may spend most of its time performing garbage collection [^111];
a process that is under memory pressure may spend most of its time performing garbage collection
[^111];
worn-out SSDs can have erratic performance; and hardware can be affected by high temperature, worn-out SSDs can have erratic performance; and hardware can be affected by high temperature,
loose connectors, mechanical vibration, power supply problems, firmware bugs, and more loose connectors, mechanical vibration, power supply problems, firmware bugs, and more [^112].
[^112]. Such a situation is called a *limping node*, *gray failure*, or *fail-slow* [^113],
Such a situation is called a *limping node*, *gray failure*, or *fail-slow*
[^113],
and it can be even more difficult to deal with than a cleanly failed node. A related problem is and it can be even more difficult to deal with than a cleanly failed node. A related problem is
when a process stops doing some of the things it is supposed to do while other aspects continue when a process stops doing some of the things it is supposed to do while other aspects continue
working, for example because a background thread is crashed or deadlocked working, for example because a background thread is crashed or deadlocked [^114].
[^114].
Byzantine (arbitrary) faults Byzantine (arbitrary) faults
: Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described : Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described
@ -1558,15 +1508,13 @@ longer executions would then not be found.
Still, model checkers strike a nice balance between ease of use and the ability to find non-obvious Still, model checkers strike a nice balance between ease of use and the ability to find non-obvious
bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find
and fix bugs and fix bugs [^122] [^123] [^124]. For example,
[^122] [^123] [^124]. For example,
using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped
replication (VR) caused by ambiguity in the prose description of the algorithm [^125]. replication (VR) caused by ambiguity in the prose description of the algorithm [^125].
By design, model checkers dont run your actual code, but rather a simplified model that specifies By design, model checkers dont run your actual code, but rather a simplified model that specifies
only the core ideas of your protocol. This makes it more tractable to systematically explore the only the core ideas of your protocol. This makes it more tractable to systematically explore the
state space, but it risks that your specification and your implementation go out of sync with each state space, but it risks that your specification and your implementation go out of sync with each other [^126].
other [^126].
It is possible to check whether the model and the real implementation have equivalent behavior, but It is possible to check whether the model and the real implementation have equivalent behavior, but
this requires instrumentation in the real implementation [^127]. this requires instrumentation in the real implementation [^127].
@ -1596,8 +1544,7 @@ The myriad of tools required to trigger failures make fault injection tests cumb
Its common to adopt a fault injection framework like Jepsen to run fault injection tests to Its common to adopt a fault injection framework like Jepsen to run fault injection tests to
simplify the process. Such frameworks come with integrations for various operating systems and many simplify the process. Such frameworks come with integrations for various operating systems and many
pre-built fault injectors [^129]. pre-built fault injectors [^129].
Jepsen has been remarkably effective at finding critical bugs in many widely-used systems Jepsen has been remarkably effective at finding critical bugs in many widely-used systems [^130] [^131].
[^130] [^131].
### Deterministic simulation testing ### Deterministic simulation testing
@ -1620,19 +1567,16 @@ Application-level
: Some systems are built from the ground-up to make it easy to execute code deterministically. For : Some systems are built from the ground-up to make it easy to execute code deterministically. For
example, FoundationDB, one of the pioneers in the DST space, is built using an asynchronous example, FoundationDB, one of the pioneers in the DST space, is built using an asynchronous
communication library called Flow. Flow provides a point for developers to inject a deterministic communication library called Flow. Flow provides a point for developers to inject a deterministic
network simulation into the system network simulation into the system [^132].
[^132].
Similarly, TigerBeetle is an online transaction processing (OLTP) database with first-class DST Similarly, TigerBeetle is an online transaction processing (OLTP) database with first-class DST
support. The systems state is modeled as a state machine, with all mutations occuring within a support. The systems state is modeled as a state machine, with all mutations occuring within a
single event loop. When combined with mock deterministic primitives such as clocks, such an single event loop. When combined with mock deterministic primitives such as clocks, such an
architecture is able to run deterministically architecture is able to run deterministically [^133].
[^133].
Runtime-level Runtime-level
: Languages with asynchronous runtimes and commonly used libraries provide an insertion point : Languages with asynchronous runtimes and commonly used libraries provide an insertion point
to introduce determinism. A single-threaded runtime is used to force all asynchronous code to run to introduce determinism. A single-threaded runtime is used to force all asynchronous code to run
sequentially. FrostDB, for example, patches Gos runtime to execute goroutines sequentially sequentially. FrostDB, for example, patches Gos runtime to execute goroutines sequentially [^134].
[^134].
Rusts madsim library works in a similar manner. Madsim provides deterministic implementations of Rusts madsim library works in a similar manner. Madsim provides deterministic implementations of
Tokios asynchronous runtime API, AWSs S3 library, Kafkas Rust library, and many others. Tokios asynchronous runtime API, AWSs S3 library, Kafkas Rust library, and many others.
Applications can swap in deterministic libraries and runtimes to get deterministic test executions Applications can swap in deterministic libraries and runtimes to get deterministic test executions
@ -1710,8 +1654,7 @@ node to be falsely suspected of crashing. Handling limping nodes, which are resp
slow to do anything useful, is even harder. slow to do anything useful, is even harder.
Once a fault is detected, making a system tolerate it is not easy either: there is no global Once a fault is detected, making a system tolerate it is not easy either: there is no global
variable, no shared memory, no common knowledge or any other kind of shared state between the variable, no shared memory, no common knowledge or any other kind of shared state between the machines [^83].
machines [^83].
Nodes cant even agree on what time it is, let alone on anything more profound. The only way Nodes cant even agree on what time it is, let alone on anything more profound. The only way
information can flow from one node to another is by sending it over the unreliable network. Major information can flow from one node to another is by sending it over the unreliable network. Major
decisions cannot be safely made by a single node, so we require protocols that enlist help from decisions cannot be safely made by a single node, so we require protocols that enlist help from
@ -1722,8 +1665,7 @@ where the same operation always deterministically returns the same result, then
physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems
engineers will often regard a problem as trivial if it can be solved on a single computer [^4], engineers will often regard a problem as trivial if it can be solved on a single computer [^4],
and indeed a single computer can do a lot nowadays. If you can avoid opening Pandoras box and and indeed a single computer can do a lot nowadays. If you can avoid opening Pandoras box and
simply keep things on a single machine, for example by using an embedded storage engine (see simply keep things on a single machine, for example by using an embedded storage engine (see [“Embedded storage engines”](/en/ch4#sidebar_embedded)), it is generally worth doing so.
[“Embedded storage engines”](/en/ch4#sidebar_embedded)), it is generally worth doing so.
However, as discussed in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), scalability is not the only reason for However, as discussed in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), scalability is not the only reason for
wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically