convert figure to shortcode

2026-06-22 17:37:04 +08:00 · 2025-08-09 17:43:23 +08:00 · 2025-08-09 17:43:23 +08:00 · cdd205fcdb
commit cdd205fcdb
parent aa06c85074
9 changed files with 303 additions and 662 deletions
--- a/content/en/ch10.md
+++ b/content/en/ch10.md
@ -69,12 +69,7 @@ if it wasn’t replicated at all. Then users don’t have to worry about replica
 other inconsistencies. That would give us the advantage of fault tolerance, but without the
 complexity arising from having to think about multiple replicas.
-This is the idea behind *linearizability*
+This is the idea behind *linearizability* [^1] (also known as *atomic consistency* [^2], *strong consistency*, *immediate consistency*, or *external consistency* [^3]).
 [^1]
 (also known as *atomic consistency*
 [^2],
 *strong consistency*, *immediate consistency*, or *external consistency*
 [^3]).
 The exact definition of linearizability is quite subtle, and we will explore it in the rest of this
 section. But the basic idea is to make a system appear as if there were only one copy of the data,
 and all operations on it are atomic. With this guarantee, even though there may be multiple replicas
@ -86,9 +81,8 @@ copy of the data means guaranteeing that the value read is the most recent, up-t
 doesn’t come from a stale cache or replica. In other words, linearizability is a *recency
 guarantee*. To clarify this idea, let’s look at an example of a system that is not linearizable.
-![ddia 1001](/fig/ddia_1001.png)
+{{< figure src="/fig/ddia_1001.png" id="fig_consistency_linearizability_0" title="Figure 10-1. If this database were linearizable, then either Alice's read would return 1 instead of 0, or Bob's read would return 0 instead of 1." class="w-full my-4" >}}
 ###### Figure 10-1. This system is not linearizable, causing sports fans to be confused.
 [Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
 Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a
@ -112,9 +106,8 @@ object *x* in a linearizable database. In distributed systems theory, *x* is cal
 practice, it could be one key in a key-value store, one row in a relational database, or one
 document in a document database, for example.
-![ddia 1002](/fig/ddia_1002.png)
+{{< figure src="/fig/ddia_1002.png" id="fig_consistency_linearizability_1" title="Figure 10-2. Alice observes that x = 0 and y = 1, while Bob observes that x = 1 and y = 0. It's as if Alice's and Bob's computers disagree on the order in which the writes happened." class="w-full my-4" >}}
 ###### Figure 10-2. If a read request is concurrent with a write request, it may return either the old or the new value.
 For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients’
 point of view, not the internals of the database. Each bar is a request made by a client, where the
@ -152,9 +145,8 @@ what we expect of a system that emulates a “single copy of the data.”
 To make the system linearizable, we need to add another constraint, illustrated in
 [Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
-![ddia 1003](/fig/ddia_1003.png)
+{{< figure src="/fig/ddia_1003.png" id="fig_consistency_linearizability_2" title="Figure 10-3. If Alice and Bob had perfect clocks, linearizability would require that x = 1 is returned, since the read of x begins after the write x = 1 completes." class="w-full my-4" >}}
 ###### Figure 10-3. After any one read has returned the new value, all following reads (on the same or other clients) must also return the new value.
 In a linearizable system we imagine that there must be some point in time (between the start and end
 of the write operation) at which the value of *x* atomically flips from 0 to 1. Thus, if one
@ -189,9 +181,8 @@ forward in time (from left to right), never backward. This requirement ensures t
 discussed earlier: once a new value has been written or read, all subsequent reads see the value
 that was written, until it is overwritten again.
-![ddia 1004](/fig/ddia_1004.png)
+{{< figure src="/fig/ddia_1004.png" id="fig_consistency_linearizability_3" title="Figure 10-4. The read of x is concurrent with the write x = 1. Since we don't know the exact timing of the operations, the read is allowed to return either 0 or 1." class="w-full my-4" >}}
 ###### Figure 10-4. Visualizing the points in time at which the reads and writes appear to have taken effect. The final read by B is not linearizable.
 There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3):
@ -349,9 +340,8 @@ for small messages, and a video may be many megabytes in size. Instead, the vide
 to a file storage service, and once the write is complete, the instruction to the transcoder is
 placed on the queue.
-![ddia 1005](/fig/ddia_1005.png)
+{{< figure src="/fig/ddia_1005.png" id="fig_consistency_transcoder" title="Figure 10-5. A system that is not linearizable: Alice and Bob see the uploaded image at different times, and thus Bob's request is based on stale data." class="w-full my-4" >}}
 ###### Figure 10-5. The web server and video transcoder communicate both through file storage and a message queue, opening the potential for race conditions.
 If the file storage service is linearizable, then this system should work fine. If it is not
 linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in
@ -440,9 +430,8 @@ Intuitively, it seems as though quorum reads and writes should be linearizable i
 Dynamo-style model. However, when we have variable network delays, it is possible to have race
 conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless).
-![ddia 1006](/fig/ddia_1006.png)
+{{< figure src="/fig/ddia_1006.png" id="fig_consistency_leaderless" title="Figure 10-6. Quorums are not sufficient to ensure linearizability if network delays are variable." class="w-full my-4" >}}
 ###### Figure 10-6. A nonlinearizable execution, despite using a quorum.
 In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating
 *x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3).
@ -459,8 +448,7 @@ It is possible to make Dynamo-style quorums linearizable at the cost of reduced
 performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously,
 before returning results to the application [^24].
 Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the
-latest timestamp of any prior write, and ensure that the new write has a greater timestamp
+latest timestamp of any prior write, and ensure that the new write has a greater timestamp [^25] [^26].
 [^25] [^26].
 However, Riak does not perform synchronous read repair due to the performance penalty.
 Cassandra does wait for read repair to complete on quorum reads [^27],
 but it loses linearizability due to its use of time-of-day clocks for timestamps.
@ -481,9 +469,8 @@ example, we saw that multi-leader replication is often a good choice for multi-r
 replication (see [“Geographically Distributed Operation”](/en/ch6#sec_replication_multi_dc)). An example of such a deployment is illustrated in
 [Figure 10-7](/en/ch10#fig_consistency_cap_availability).
-![ddia 1007](/fig/ddia_1007.png)
+{{< figure src="/fig/ddia_1007.png" id="fig_consistency_cap_availability" title="Figure 10-7. If clients cannot contact enough replicas due to a network partition, they cannot process writes." class="w-full my-4" >}}
 ###### Figure 10-7. A network interruption forcing a choice between linearizability and availability.
 Consider what happens if there is a network interruption between the two regions. Let’s assume
 that the network within each region is working, and clients can reach their local region, but the
@ -502,8 +489,7 @@ If the network between regions is interrupted in a single-leader setup, clients
 follower regions cannot contact the leader, so they cannot make any writes to the database, nor
 any linearizable reads. They can still make reads from the follower, but they might be stale
 (nonlinearizable). If the application requires linearizable reads and writes, the network
-interruption causes the application to become unavailable in the regions that cannot contact the
+interruption causes the application to become unavailable in the regions that cannot contact the leader.
 leader.
 If clients can connect directly to the leader region, this is not a problem, since the
 application continues to work normally there. But clients that can only reach a follower region
@ -519,20 +505,16 @@ The trade-off is as follows:
 * If your application *requires* linearizability, and some replicas are disconnected from the other
 replicas due to a network problem, then some replicas cannot process requests while they are
 disconnected: they must either wait until the network problem is fixed, or return an error (either
- way, they become *unavailable*). This choice is sometimes known as *CP* (consistent under network
+ way, they become *unavailable*). This choice is sometimes known as *CP* (consistent under network partitions).
 partitions).
 * If your application *does not require* linearizability, then it can be written in a way that each
 replica can process requests independently, even if it is disconnected from other replicas (e.g.,
 multi-leader). In this case, the application can remain *available* in the face of a network
- problem, but its behavior is not linearizable. This choice is known as *AP* (available under
+ problem, but its behavior is not linearizable. This choice is known as *AP* (available under network partitions).
 network partitions).
 Thus, applications that don’t require linearizability can be more tolerant of network problems. This
-insight is popularly known as the *CAP theorem*
+insight is popularly known as the *CAP theorem* [^29] [^30] [^31] [^32],
 [^29] [^30] [^31] [^32],
 named by Eric Brewer in 2000, although the trade-off had been known to designers of
-distributed databases since the 1970s
+distributed databases since the 1970s [^33] [^34] [^35].
 [^33] [^34] [^35].
 CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of
 starting a discussion about trade-offs in databases. At the time, many distributed databases
@ -552,8 +534,7 @@ or not.
 At times when the network is working correctly, a system can provide both consistency
 (linearizability) and total availability. When a network fault occurs, you have to choose between
 either linearizability or total availability. Thus, a better way of phrasing CAP would be
-*either Consistent or Available when Partitioned*
+*either Consistent or Available when Partitioned* [^37].
 [^37].
 A more reliable network needs to make this choice less often, but at some point the choice is
 inevitable.
@ -570,24 +551,19 @@ understand systems better, so CAP is best avoided.
 The CAP theorem as formally defined [^30] is of
 very narrow scope: it only considers one consistency model (namely linearizability) and one kind of
-fault (network partitions, which according to data from Google are the cause of less than 8% of
+fault (network partitions, which according to data from Google are the cause of less than 8% of incidents [^41]).
 incidents [^41]).
 It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP
-has been historically influential, it has little practical value for designing systems
+has been historically influential, it has little practical value for designing systems [^4] [^38].
 [^4] [^38].
 There have been efforts to generalize CAP. For example, the *PACELC principle* observes that system
 designers might also choose to weaken consistency at times when the network is working fine in order
 to reduce latency [^39] [^40] [^42].
-Thus, during a network partition (P), we need to choose between availability (A) and consistency
+Thus, during a network partition (P), we need to choose between availability (A) and consistency (C); 
-(C); else (E), when there is no partition, we may choose between low latency (L) and
+else (E), when there is no partition, we may choose between low latency (L) and consistency (C).
-consistency (C). However, this definition inherits several problems with CAP, such as the
+However, this definition inherits several problems with CAP, such as the counterintuitive definitions of consistency and availability.
 counterintuitive definitions of consistency and availability.
-There are many more interesting impossibility results in distributed systems [^43],
+There are many more interesting impossibility results in distributed systems [^43], and CAP has now been 
-and CAP has now been superseded by more precise results
+superseded by more precise results [^44] [^45], so it is of mostly historical interest today.
 [^44] [^45],
 so it is of mostly historical interest today.
 ### Linearizability and network delays
@ -595,8 +571,7 @@ Although linearizability is a useful guarantee, surprisingly few systems are act
 in practice. For example, even RAM on a modern multi-core CPU is not linearizable [^46]:
 if a thread running on one CPU core writes to a memory address, and a thread on another CPU core
 reads the same address shortly afterward, it is not guaranteed to read the value written by the
-first thread (unless a *memory barrier* or *fence*
+first thread (unless a *memory barrier* or *fence* [^47] is used).
 [^47] is used).
 The reason for this behavior is that every CPU core has its own memory cache and store buffer.
 Memory access first goes to the cache by default, and any changes are asynchronously written out to
@ -615,8 +590,7 @@ they do so primarily to increase performance, not so much for fault tolerance [^
 Linearizability is slow—and this is true all the time, not only during a network fault.
 Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is
-no: Attiya and Welch [^49]
+no: Attiya and Welch [^49] prove that if you want linearizability, the response time of read and write requests is at least
 prove that if you want linearizability, the response time of read and write requests is at least
 proportional to the uncertainty of delays in the network. In a network with highly variable delays,
 like most computer networks (see [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing)), the response time of linearizable
 reads and writes is inevitably going to be high. A faster algorithm for linearizability does not
@ -639,9 +613,8 @@ display the messages in order of increasing ID, and the resulting chat threads w
 Aaliyah posts a question that is assigned ID 1, and Bryce’s answer to the question is assigned a
 greater ID, namely 3.
-![ddia 1008](/fig/ddia_1008.png)
+{{< figure src="/fig/ddia_1008.png" id="fig_consistency_id_generator" title="Figure 10-8. Two different nodes may generate conflicting IDs." class="w-full my-4" >}}
 ###### Figure 10-8. An ID generator that assigns auto-incrementing integer IDs to messages in a chat application.
 This single-node ID generator is another example of a linearizable system. Each request to fetch the
 ID is an operation that atomically increments a counter and returns the old counter value (a
@ -755,9 +728,8 @@ operations it has processed. A Lamport timestamp is then simply a pair of (*coun
 Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp,
 each timestamp is made unique.
-![ddia 1009](/fig/ddia_1009.png)
+{{< figure src="/fig/ddia_1009.png" id="fig_consistency_lamport_ts" title="Figure 10-9. Lamport timestamps provide a total ordering consistent with causality." class="w-full my-4" >}}
 ###### Figure 10-9. Lamport timestamps provide a total ordering consistent with causality.
 Every time a node generates a timestamp, it increments its counter value and uses the new value.
 Moreover, every time a node sees a timestamp from another node, if the counter value in that
@ -843,9 +815,8 @@ account settings to private. Then A uses their phone to upload the photo. Since
 updates in sequence, they might reasonably expect the photo upload to be subject to the new,
 restricted account permissions.
-![ddia 1010](/fig/ddia_1010.png)
+{{< figure src="/fig/ddia_1010.png" id="fig_consistency_permissions" title="Figure 10-10. An example of a permission system using Lamport timestamps." class="w-full my-4" >}}
 ###### Figure 10-10. User A first sets their account to private, then shares a photo. With a non-linearizable ID generator, an unauthorized viewer may see the photo.
 The account permission and the photo are stored in two separate databases (or separate shards of the
 same database), and let’s assume they use a Lamport clock or hybrid logical clock to assign a
@ -944,27 +915,20 @@ node, but which get a lot harder if you want fault tolerance:
 It turns out that all of these are instances of the same fundamental distributed systems problem:
 *consensus*. Consensus is one of the most important and fundamental problems in distributed
-computing; it is also infamously difficult to get right
+computing; it is also infamously difficult to get right [^58] [^59],
 [^58] [^59],
 and many systems have got it wrong in the past. Now that we have discussed replication
 ([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
 linearizability (this chapter), we are finally ready to tackle the consensus problem.
-The best-known consensus algorithms are Viewstamped Replication
+The best-known consensus algorithms are Viewstamped Replication [^60] [^61], Paxos [^58] [^62] [^63] [^64],
-[^60] [^61],
+Raft [^23] [^65] [^66], and Zab [^18] [^22] [^67]. There are quite a few similarities between these algorithms, but they are not the same [^68] [^69].
 Paxos [^58] [^62] [^63] [^64],
 Raft [^23] [^65] [^66],
 and Zab [^18] [^22] [^67].
 There are quite a few similarities between these algorithms, but they are not the same
 [^68] [^69].
 These algorithms work in a non-Byzantine system model: that is, network communication may be
 arbitrarily delayed or dropped, and nodes may crash, restart, and become disconnected, but the
 algorithms assume that nodes otherwise follow the protocol correctly and do not behave maliciously.
 There are also consensus algorithms that can tolerate some Byzantine nodes, i.e., nodes that don’t
 correctly follow the protocol (for example, by sending contradictory messages to other nodes). A
-common assumption is that fewer than one-third of the nodes are Byzantine-faulty
+common assumption is that fewer than one-third of the nodes are Byzantine-faulty [^26] [^70].
 [^26] [^70].
 Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchains [^71].
 However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this
 book.
@ -981,10 +945,8 @@ Firstly, FLP doesn’t say that we can never reach consensus—it only says that
 a consensus algorithm will *always* terminate. Moreover, the FLP result is proved assuming a
 deterministic algorithm in the asynchronous system model (see [“System Model and Reality”](/en/ch9#sec_distributed_system_model)),
 which means the algorithm cannot use any clocks or timeouts. If it can use timeouts to suspect that
-another node may have crashed (even if the suspicion is sometimes wrong), then consensus becomes
+another node may have crashed (even if the suspicion is sometimes wrong), then consensus becomes solvable [^73].
-solvable [^73].
+Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility result [^74].
 Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility
 result [^74].
 Thus, although the FLP result about the impossibility of consensus is of great theoretical
 importance, distributed systems can usually achieve consensus in practice.
@ -1103,10 +1065,8 @@ name has not been created or modified by another client since the current client
 However, a linearizable read-write register is not sufficient to solve consensus. The FLP result
 tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop
-model [^72], but we saw in
+model [^72], but we saw in [“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum
-[“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum
+reads/writes in this model [^24] [^25] [^26]. From this it follows that a linearizable register cannot solve consensus.
 reads/writes in this model [^24] [^25] [^26].
 From this it follows that a linearizable register cannot solve consensus.
 ### Shared logs as consensus
@ -1304,8 +1264,7 @@ A shared log is also powerful because it can easily be adapted to other forms of
 * If you want an atomic fetch-and-add, put the number to add to the counter in a log entry, and the
 current counter value is the sum of all of the log entries so far. A simple counter on log entries
 can be used to generate fencing tokens (see [“Fencing off zombies and delayed requests”](/en/ch9#sec_distributed_fencing_tokens)); for example, in
- ZooKeeper, this sequence number is called `zxid`
+ ZooKeeper, this sequence number is called `zxid` [^18].
 [^18].
 ### From single-leader replication to consensus
@ -1340,8 +1299,7 @@ leader with the higher epoch number prevails.
 Before a leader is allowed to append the next entry to the shared log, it must first check that
 there isn’t some other leader with a higher epoch number which might append a different entry. It
 can do this by collecting votes from a quorum of nodes—typically, but not always, a majority of
-nodes [^85].
+nodes [^85]. A node votes yes only if it is not aware of any other leader with a higher epoch.
 A node votes yes only if it is not aware of any other leader with a higher epoch.
 Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s
 proposal for the next entry to append to the log. The quorums for those two votes must overlap: if
@ -1436,8 +1394,7 @@ terrible performance as the system can end up spending more time choosing leader
 work.
 Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft
-has been shown to have unpleasant edge cases
+has been shown to have unpleasant edge cases [^88] [^89]:
 [^88] [^89]:
 if the entire network is working correctly except for one particular network link that is
 consistently unreliable, Raft can get into situations where leadership continually bounces between
 two nodes, or the current leader is continually forced to resign, so the system effectively never
@ -1463,8 +1420,7 @@ in the background. Coordination services are designed to hold small amounts of d
 entirely in memory (although they still write to disk for durability), which is replicated across
 multiple nodes using a fault-tolerant consensus algorithm.
-Coordination services are modeled after Google’s Chubby lock service
+Coordination services are modeled after Google’s Chubby lock service [^17] [^58].
 [^17] [^58].
 They combine a consensus algorithm with several other features that turn out to be particularly
 useful when building distributed systems:
@ -1540,8 +1496,7 @@ Normally, the kind of data managed by a coordination service is quite slow-chang
 information like “the node running on IP address 10.1.1.23 is the leader for shard 7,” and such
 assignments usually change on a timescale of minutes or hours. Coordination services are not
 intended for storing data that may change thousands of times per second. For that, it is better to
-use a conventional database; alternatively, tools like Apache BookKeeper
+use a conventional database; alternatively, tools like Apache BookKeeper [^90] [^91]
 [^90] [^91]
 can be used to replicate fast-changing internal state of a service.
 ### Service discovery
--- a/content/en/ch2.md
+++ b/content/en/ch2.md
@ -56,9 +56,7 @@ Barack Obama have over 100 million followers).
 Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We
 have one table for users, one table for posts, and one table for follow relationships.
-![ddia 0201](/fig/ddia_0201.png)
+{{< figure src="/fig/ddia_0201.png" id="fig_twitter_relational" title="Figure 2-1. Simple relational schema for a social network in which users can follow each other." class="w-full my-4" >}}
 ###### Figure 2-1. Simple relational schema for a social network in which users can follow each other.
 Let’s say the main read operation that our social network must support is the *home timeline*, which
 displays recent posts by people you are following (for simplicity we will ignore ads, suggested
@ -111,9 +109,7 @@ because the home timelines are derived data that needs to be updated. The proces
 carried out, we use the term *fan-out* to describe the factor by which the number of requests
 increases.
-![ddia 0202](/fig/ddia_0202.png)
+{{< figure src="/fig/ddia_0202.png" id="fig_twitter_timelines" title="Figure 2-2. Fan-out: delivering new posts to every follower of the user who made the post." class="w-full my-4" >}}
 ###### Figure 2-2. Fan-out: delivering new posts to every follower of the user who made the post.
 At a rate of 5,700 posts posted per second, if the average post reaches 200 followers (i.e., a
 fan-out factor of 200), we will need to do just over 1 million home timeline writes per second. This
@ -171,9 +167,7 @@ the process of handling an earlier request, and therefore the incoming request n
 the earlier request has been completed. As throughput approaches the maximum that the hardware can
 handle, queueing delays increase sharply.
-![ddia 0203](/fig/ddia_0203.png)
+{{< figure src="/fig/ddia_0203.png" id="fig_throughput" title="Figure 2-3. As the throughput of a service approaches its capacity, the response time increases dramatically due to queueing." class="w-full my-4" >}}
 ###### Figure 2-3. As the throughput of a service approaches its capacity, the response time increases dramatically due to queueing.
 # When an overloaded system won’t recover
@ -217,9 +211,7 @@ terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)
 i.e., during which it is *latent*. In particular, *network latency* or *network delay* refers to
 the time that request and response spend traveling through the network.
-![ddia 0204](/fig/ddia_0204.png)
+{{< figure src="/fig/ddia_0204.png" id="fig_response_time" title="Figure 2-4. Response time, service time, network latency, and queueing delay." class="w-full my-4" >}}
 ###### Figure 2-4. Response time, service time, network latency, and queueing delay.
 In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a
 horizontal line, and a request or response message is shown as a thick diagonal arrow from one node
@ -247,9 +239,7 @@ gray bar represents a request to a service, and its height shows how long that r
 requests are reasonably fast, but there are occasional *outliers* that take much longer.
 Variation in network delay is also known as *jitter*.
-![ddia 0205](/fig/ddia_0205.png)
+{{< figure src="/fig/ddia_0205.png" id="fig_lognormal" title="Figure 2-5. Illustrating mean and percentiles: response times for a sample of 100 requests to a service." class="w-full my-4" >}}
 ###### Figure 2-5. Illustrating mean and percentiles: response times for a sample of 100 requests to a service.
 It’s common to report the *average* response time of a service (technically, the *arithmetic mean*:
 that is, sum all the response times, and divide by the number of requests). The mean response time
@ -322,9 +312,7 @@ increases if an end-user request requires multiple backend calls, and so a highe
 end-user requests end up being slow (an effect known as *tail latency amplification*
 [^26]).
-![ddia 0206](/fig/ddia_0206.png)
+{{< figure src="/fig/ddia_0206.png" id="fig_tail_amplification" title="Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request." class="w-full my-4" >}}
 ###### Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request.
 Percentiles are often used in *service level objectives* (SLOs) and *service level agreements*
 (SLAs) as ways of defining the expected performance and availability of a service [^27].
@ -423,16 +411,12 @@ cured, as described in the following sections.
 When we think of causes of system failure, hardware faults quickly come to mind:
-* Approximately 2–5% of magnetic hard drives fail per year [^40] [^41];
+* Approximately 2–5% of magnetic hard drives fail per year [^40] [^41]; in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
 in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
 Recent data suggests that disks are getting more reliable, but failure rates remain significant [^42].
-* Approximately 0.5–1% of solid state drives (SSDs) fail per year [^43].
+* Approximately 0.5–1% of solid state drives (SSDs) fail per year [^43]. Small numbers of bit errors are corrected automatically [^44], but uncorrectable errors occur approximately once per year per drive, even in drives that are
 Small numbers of bit errors are corrected automatically [^44],
 but uncorrectable errors occur approximately once per year per drive, even in drives that are
 fairly new (i.e., that have experienced little wear); this error rate is higher than that of
 magnetic hard drives [^45], [^46].
-* Other hardware components such as power supplies, RAID controllers, and memory modules also fail,
+* Other hardware components such as power supplies, RAID controllers, and memory modules also fail, although less frequently than hard drives [^47] [^48].
 although less frequently than hard drives [^47] [^48].
 * Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result,
 likely due to manufacturing defects [^49] [^50] [^51]. In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program simply returning the wrong result.
 * Data in RAM can also be corrupted, either due to random events such as cosmic rays, or due to
--- a/content/en/ch3.md
+++ b/content/en/ch3.md
@ -165,9 +165,7 @@ representing such *one-to-many relationships* is to put positions, education, an
 information in separate tables, with a foreign key reference to the `users` table, as in
 [Figure 3-1](/en/ch3#fig_obama_relational).
-![ddia 0301](/fig/ddia_0301.png)
+{{< figure src="/fig/ddia_0301.png" id="fig_obama_relational" title="Figure 3-1. Representing a LinkedIn profile using a relational schema." class="w-full my-4" >}}
 ###### Figure 3-1. Representing a LinkedIn profile using a relational schema.
 Another way of representing the same information, which is perhaps more natural and maps more
 closely to an object structure in application code, is as a JSON document as shown in
@ -214,9 +212,7 @@ The one-to-many relationships from the user profile to the user’s positions, e
 contact information imply a tree structure in the data, and the JSON representation makes this tree
 structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
-![ddia 0302](/fig/ddia_0302.png)
+{{< figure src="/fig/ddia_0302.png" id="fig_json_tree" title="Figure 3-2. One-to-many relationships forming a tree structure." class="w-full my-4" >}}
 ###### Figure 3-2. One-to-many relationships forming a tree structure.
 > [!NOTE]
 > This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé typically has a small number of positions [^9] [^10].
@ -388,9 +384,7 @@ an organization has several past or present employees). In a relational model, s
 is usually represented as an *associative table* or *join table*, as shown in
 [Figure 3-3](/en/ch3#fig_datamodels_m2m_rel): each position associates one user ID with one organization ID.
-![ddia 0303](/fig/ddia_0303.png)
+{{< figure src="/fig/ddia_0303.png" id="fig_datamodels_m2m_rel" title="Figure 3-3. Many-to-many relationships in the relational model." class="w-full my-4" >}}
 ###### Figure 3-3. Many-to-many relationships in the relational model.
 Many-to-one and many-to-many relationships do not easily fit within one self-contained JSON
 document; they lend themselves more to a normalized representation. In a document model, one
@ -414,9 +408,7 @@ documents.
 }
 ```
-![ddia 0304](/fig/ddia_0304.png)
+{{< figure src="/fig/ddia_0304.png" id="fig_datamodels_many_to_many" title="Figure 3-4. Many-to-many relationships in the document model: the data within each dotted box can be grouped into one document." class="w-full my-4" >}}
 ###### Figure 3-4. Many-to-many relationships in the document model: the data within each dotted box can be grouped into one document.
 Many-to-many relationships often need to be queried in “both directions”: for example, finding all
 of the organizations that a particular person has worked for, and finding all of the people who have
@ -450,9 +442,7 @@ retailer. At the center of the schema is a so-called *fact table* (in this examp
 (here, each row represents a customer’s purchase of a product). If we were analyzing website traffic
 rather than retail sales, each row might represent a page view or a click by a user.
-![ddia 0305](/fig/ddia_0305.png)
+{{< figure src="/fig/ddia_0305.png" id="fig_dwh_schema" title="Figure 3-5. Example of a star schema for use in a data warehouse." class="w-full my-4" >}}
 ###### Figure 3-5. Example of a star schema for use in a data warehouse.
 Usually, facts are captured as individual events, because this allows maximum flexibility of
 analysis later. However, this means that the fact table can become extremely large. A big enterprise
@ -775,9 +765,7 @@ are married and living in London. Each person and each location is represented a
 relationships between them as edges. This example will help demonstrate some queries that are easy
 in graph databases, but difficult in other models.
-![ddia 0306](/fig/ddia_0306.png)
+{{< figure src="/fig/ddia_0306.png" id="fig_datamodels_graph" title="Figure 3-6. Example of graph-structured data (boxes represent vertices, arrows represent edges)." class="w-full my-4" >}}
 ###### Figure 3-6. Example of graph-structured data (boxes represent vertices, arrows represent edges).
 ## Property Graphs
@ -1271,7 +1259,7 @@ before if you’ve studied computer science.
 ##### Example 3-12. The same query as [Example 3-5](/en/ch3#fig_cypher_query), expressed in Datalog
-```
+```sql
 within_recursive(LocID, PlaceName) :- location(LocID, PlaceName, _). /* Rule 1 */
 within_recursive(LocID, PlaceName) :- within(LocID, ViaID), /* Rule 2 */
@ -1320,9 +1308,9 @@ One possible way of applying the rules is thus (and as illustrated in [Figure 3
 By repeated application of rules 1 and 2, the `within_recursive` virtual table can tell us all the
 locations in North America (or any other location) contained in our database.
-![ddia 0307](/fig/ddia_0307.png)
+{{< figure link="#fig_datalog_query" src="/fig/ddia_0307.png" id="fig_datalog_naive" title="Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from Example 3-12." class="w-full my-4" >}}
-###### Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query).
+> Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query).
 Now rule 3 can find people who were born in some location `BornIn` and live in some location
 `LivingIn`. Rule 4 invokes rule 3 with `BornIn = 'United States'` and
@ -1474,9 +1462,7 @@ and so on. Reservations may also be cancelled, and meanwhile, the conference org
 the capacity of the event by moving it to a different room. With all of this going on, simply
 calculating the number of available seats becomes a challenging query.
-![ddia 0308](/fig/ddia_0308.png)
+{{< figure src="/fig/ddia_0308.png" id="fig_event_sourcing" title="Figure 3-8. Using a log of immutable events as source of truth, and deriving materialized views from it." class="w-full my-4" >}}
 ###### Figure 3-8. Using a log of immutable events as source of truth, and deriving materialized views from it.
 In [Figure 3-8](/en/ch3#fig_event_sourcing), every change to the state of the conference (such as the organizer
 opening registrations, or attendees making and cancelling registrations) is first stored as an
@ -1617,9 +1603,7 @@ is no data for many user-movie combinations, but this is fine. This matrix may h
 of columns and would therefore not fit well in a relational database, but dataframes and libraries
 that offer sparse arrays (such as NumPy for Python) can handle such data easily.
-![ddia 0309](/fig/ddia_0309.png)
+{{< figure src="/fig/ddia_0309.png" id="fig_dataframe_to_matrix" title="Figure 3-9. Transforming a relational database of movie ratings into a matrix representation." class="w-full my-4" >}}
 ###### Figure 3-9. Transforming a relational database of movie ratings into a matrix representation.
 A matrix can only contain numbers, and various techniques are used to transform non-numerical data
 into numbers in the matrix. For example:
--- a/content/en/ch4.md
+++ b/content/en/ch4.md
@ -41,7 +41,7 @@ queries, such as text retrieval.
 Consider the world’s simplest database, implemented as two Bash functions:
-```
+```bash
 #!/bin/bash
 db_set () {
@ -60,14 +60,13 @@ recent value associated with that particular key and returns it.
 And it works:
-```
+```bash
 $ db_set 12 '{"name":"London","attractions":["Big Ben","London Eye"]}'
 $ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'
 $ db_get 42
 {"name":"San Francisco","attractions":["Golden Gate Bridge"]}
 ```
 The storage format is very simple: a text file where each line contains a key-value pair, separated
@ -76,7 +75,7 @@ the end of the file. If you update a key several times, old versions of the valu
 overwritten—you need to look at the last occurrence of a key in a file to find the latest value
 (hence the `tail -n 1` in `db_get`):
-```
+```bash
 $ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}'
 $ db_get 42
@ -136,9 +135,7 @@ To start, let’s assume that you want to continue storing data in the append-on
 memory, in which every key is mapped to the byte offset in the file at which the most recent value
 for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
-![ddia 0401](/fig/ddia_0401.png)
+{{< figure src="/fig/ddia_0401.png" id="fig_storage_csv_hash_index" title="Figure 4-1. Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map." class="w-full my-4" >}}
 ###### Figure 4-1. Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map.
 Whenever you append a new key-value pair to the file, you also update the hash map to reflect the
 offset of the data you just wrote. When you want to look up a value, you use the hash map to find
@ -162,15 +159,12 @@ This approach is much faster, but it still suffers from several problems:
 ### The SSTable file format
 In practice, hash tables are not used very often for database indexes, and instead it is much more
-common to keep data in a structure that is *sorted by key*
+common to keep data in a structure that is *sorted by key* [^3].
 [^3].
 One example of such a structure is a *Sorted String Table*, or *SSTable* for short, as shown in
 [Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that
 they are sorted by key, and each key only appears once in the file.
-![ddia 0402](/fig/ddia_0402.png)
+{{< figure src="/fig/ddia_0402.png" id="fig_storage_sstable_index" title="Figure 4-2. An SSTable with a sparse index, allowing queries to jump to the right block." class="w-full my-4" >}}
 ###### Figure 4-2. An SSTable with a sparse index, allowing queries to jump to the right block.
 Now you do not need to keep all the keys in memory: you can group the key-value pairs within an
 SSTable into *blocks* of a few kilobytes, and then store the first key of each block in the index.
@ -224,9 +218,7 @@ the same key appears in more than one input file, keep only the more recent valu
 new merged segment file, also sorted by key, with one value per key, and it uses minimal memory
 because we can iterate over the SSTables one key at a time.
-![ddia 0403](/fig/ddia_0403.png)
+{{< figure src="/fig/ddia_0403.png" id="fig_storage_sstable_merging" title="Figure 4-3. Merging several SSTable segments, retaining only the most recent value for each key." class="w-full my-4" >}}
 ###### Figure 4-3. Merging several SSTable segments, retaining only the most recent value for each key.
 To ensure that the data in the memtable is not lost if the database crashes, the storage engine
 keeps a separate log on disk to which every write is immediately appended. This log is not sorted by
@ -285,9 +277,7 @@ We set the bits corresponding to those indexes to 1, and leave the rest as 0. Fo
 is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of
 extra space, but the Bloom filter is generally small compared to the rest of the SSTable.
-![ddia 0404](/fig/ddia_0404.png)
+{{< figure src="/fig/ddia_0404.png" id="fig_storage_bloom" title="Figure 4-4. A Bloom filter provides a fast, probabilistic check whether a particular key exists in a particular SSTable." class="w-full my-4" >}}
 ###### Figure 4-4. A Bloom filter provides a fast, probabilistic check whether a particular key exists in a particular SSTable.
 When we want to know whether a key appears in the SSTable, we compute the same hash of that key as
 before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), we’re querying
@ -366,8 +356,7 @@ for scaling a database across multiple machines.
 The log-structured approach is popular, but it is not the only form of key-value storage. The most
 widely used structure for reading and writing database records by key is the *B-tree*.
-Introduced in 1970 [^21]
+Introduced in 1970 [^21] and called “ubiquitous” less than 10 years later [^22],
 and called “ubiquitous” less than 10 years later [^22],
 B-trees have stood the test of time very well. They remain the standard index implementation in
 almost all relational databases, and many nonrelational databases use them too.
@ -387,9 +376,7 @@ multiplying the page number by the page size gives us the byte offset in the fil
 located. We can use these page references to construct a tree of pages, as illustrated in
 [Figure 4-5](/en/ch4#fig_storage_b_tree).
-![ddia 0405](/fig/ddia_0405.png)
+{{< figure src="/fig/ddia_0405.png" id="fig_storage_b_tree" title="Figure 4-5. Looking up the key 251 using a B-tree index. From the root page we first follow the reference to the page for keys 200–300, then the page for keys 250–270." class="w-full my-4" >}}
 ###### Figure 4-5. Looking up the key 251 using a B-tree index. From the root page we first follow the reference to the page for keys 200–300, then the page for keys 250–270.
 One page is designated as the *root* of the B-tree; whenever you want to look up a key in the index,
 you start here. The page contains several keys and references to child pages.
@ -416,9 +403,7 @@ it to that page. If there isn’t enough free space in the page to accommodate t
 is split into two half-full pages, and the parent page is updated to account for the new subdivision
 of key ranges.
-![ddia 0406](/fig/ddia_0406.png)
+{{< figure src="/fig/ddia_0406.png" id="fig_storage_b_tree_split" title="Figure 4-6. Growing a B-tree by splitting a page on the boundary key 337. The parent page is updated to reference both children." class="w-full my-4" >}}
 ###### Figure 4-6. Growing a B-tree by splitting a page on the boundary key 337. The parent page is updated to reference both children.
 In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the
 range 333–345 is already full. We therefore split it into a page for the range 333–337 (including
@ -444,8 +429,7 @@ modify files in place.
 Overwriting several pages at once, like in a page split, is a dangerous operation: if the database
 crashes after only some of the pages have been written, you end up with a corrupted tree (e.g.,
 there may be an *orphan* page that is not a child of any parent). If the hardware can’t atomically
-write an entire page, you can also end up with a partially written page (this is known as a *torn
+write an entire page, you can also end up with a partially written page (this is known as a *torn page* [^23]).
 page* [^23]).
 In order to make the database resilient to crashes, it is common for B-tree implementations to
 include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file
@ -509,8 +493,7 @@ High write throughput can cause latency spikes in a log-structured storage engin
 memtable fills up. This happens if data can’t be written out to disk fast enough, perhaps because
 the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB,
 perform *backpressure* in this situation: they suspend all reads and writes until the memtable has
-been written out to disk
+been written out to disk [^30] [^31].
 [^30] [^31].
 Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read
 requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but
@ -552,8 +535,7 @@ A sequential write workload writes larger chunks of data at a time, so it is lik
 512 KiB block belongs to a single file; when that file is later deleted again, the whole block
 can be erased without having to perform any GC. On the other hand, with a random write workload, it
 is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
-to perform more work before a block can be erased
+to perform more work before a block can be erased [^34] [^35] [^36].
 [^34] [^35] [^36].
 The write bandwidth consumed by GC is then not available for the application. Moreover, the
 additional writes performed by GC contribute to wear on the flash memory; therefore, random writes
@ -654,14 +636,12 @@ The key in an index is the thing that queries search by, but the value can be on
 * If the actual data (row, document, vertex) is stored directly within the index structure, it is
 called a *clustered index*. For example, in MySQL’s InnoDB storage engine, the primary key of a
- table is always a clustered index, and in SQL Server, you can specify one clustered index per
+ table is always a clustered index, and in SQL Server, you can specify one clustered index per table [^43].
 table [^43].
 * Alternatively, the value can be a reference to the actual data: either the primary key of the row
 in question (InnoDB does this for secondary indexes), or a direct reference to a location on disk.
 In the latter case, the place where rows are stored is known as a *heap file*, and it stores data
 in no particular order (it may be append-only, or it may keep track of deleted rows in order to
- overwrite them with new data later). For example, Postgres uses the heap file approach
+ overwrite them with new data later). For example, Postgres uses the heap file approach [^44].
 [^44].
 * A middle ground between the two is a *covering index* or *index with included columns*, which
 stores *some* of a table’s columns within the index, in addition to storing the full row on the
 heap or in the primary key clustered index [^45].
@ -707,8 +687,7 @@ easily be backed up, inspected, and analyzed by external utilities.
 Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model,
 and the vendors claim that they can offer big performance improvements by removing all the overheads
-associated with managing on-disk data structures
+associated with managing on-disk data structures [^46] [^47].
 [^46] [^47].
 RAMCloud is an open source, in-memory key-value store with durability (using a log-structured
 approach for the data in memory as well as the data on disk) [^48].
@ -741,8 +720,7 @@ Some databases, such as Microsoft SQL Server, SAP HANA, and SingleStore, have su
 transaction processing and data warehousing in the same product. However, these hybrid transactional
 and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly
 becoming two separate storage and query engines, which happen to be accessible through a common SQL
-interface
+interface [^50] [^51] [^52] [^53].
 [^50] [^51] [^52] [^53].
 ## Cloud Data Warehouses
@ -774,8 +752,7 @@ Query engine
 Storage format
 : The storage format determines how the rows of a table are encoded as bytes in a file, which is
- then typically stored in object storage or a distributed filesystem
+ then typically stored in object storage or a distributed filesystem [^12].
 [^12].
 This data can then be accessed by the query engine, but also by other applications using the data
 lake. Examples of such storage formats are Parquet, ORC, Lance, or Nimble, and we will see more
 about them in the next section.
@ -833,8 +810,7 @@ How can we execute this query efficiently?
 In most OLTP databases, storage is laid out in a *row-oriented* fashion: all the values from one row
 of a table are stored next to each other. Document databases are similar: an entire document is
-typically stored as one contiguous sequence of bytes. You can see this in the CSV example of
+typically stored as one contiguous sequence of bytes. You can see this in the CSV example of [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
 [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
 In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on
 `fact_sales.date_key` and/or `fact_sales.product_sk` that tell the storage engine where to find
@ -851,16 +827,10 @@ an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema)
 > [!NOTE]
 > Column storage is easiest to understand in a relational data model, but it applies equally to
-> nonrelational data. For example, Parquet
+> nonrelational data. For example, Parquet [^57] is a columnar storage format that supports a document data model, based on Google’s Dremel [^58],
-> [^57]
+> using a technique known as *shredding* or *striping* [^59].
 > is a columnar storage format that supports a document data model, based on Google’s Dremel
 > [^58],
 > using a technique known as *shredding* or *striping*
 > [^59].
-![ddia 0407](/fig/ddia_0407.png)
+{{< figure src="/fig/ddia_0407.png" id="fig_column_store" title="Figure 4-7. Storing relational data by column, rather than by row." class="w-full my-4" >}}
 ###### Figure 4-7. Storing relational data by column, rather than by row.
 The column-oriented storage layout relies on each column storing the rows in the same order.
 Thus, if you need to reassemble an entire row, you can take the 23rd entry from each of the
@ -873,20 +843,10 @@ Since many queries are restricted to a particular date range, it is common to ma
 contain the rows for a particular timestamp range. A query then only needs to load the columns it
 needs in those blocks that overlap with the required date range.
-Columnar storage is used in almost all analytic databases nowadays [^60],
+Columnar storage is used in almost all analytic databases nowadays [^60], ranging from large-scale cloud data warehouses such as Snowflake [^61]
-ranging from large-scale cloud data warehouses such as Snowflake [^61]
+to single-node embedded databases such as DuckDB [^62], and product analytics systems such as Pinot [^63] and Druid [^64].
-to single-node embedded databases such as DuckDB [^62],
+It is used in storage formats such as Parquet, ORC [^65] [^66], Lance [^67], and Nimble [^68], and in-memory analytics formats like Apache Arrow
-and product analytics systems such as Pinot [^63]
+[^65] [^69] and Pandas/NumPy [^70]. Some time-series databases, such as InfluxDB IOx [^71] and TimescaleDB [^72], are also based on column-oriented storage.
 and Druid [^64].
 It is used in storage formats such as Parquet, ORC
 [^65] [^66],
 Lance [^67],
 and Nimble [^68],
 and in-memory analytics formats like Apache Arrow
 [^65] [^69]
 and Pandas/NumPy [^70].
 Some time-series databases, such as InfluxDB IOx [^71] and TimescaleDB [^72],
 are also based on column-oriented storage.
 ### Column Compression
@ -899,9 +859,7 @@ repetitive, which is a good sign for compression. Depending on the data in the c
 compression techniques can be used. One technique that is particularly effective in data warehouses
 is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index).
-![ddia 0408](/fig/ddia_0408.png)
+{{< figure src="/fig/ddia_0408.png" id="fig_bitmap_index" title="Figure 4-8. Compressed, bitmap-indexed storage of a single column." class="w-full my-4" >}}
 ###### Figure 4-8. Compressed, bitmap-indexed storage of a single column.
 Often, the number of distinct values in a column is small compared to the number of rows (for
 example, a retailer may have billions of sales transactions, but only 100,000 distinct products).
@ -1041,9 +999,7 @@ Vectorized processing
 shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
 a particular store.
-![ddia 0409](/fig/ddia_0409.png)
+{{< figure src="/fig/ddia_0409.png" id="fig_bitmap_and" title="Figure 4-9. A bitwise AND between two bitmaps lends itself to vectorization." class="w-full my-4" >}}
 ###### Figure 4-9. A bitwise AND between two bitmaps lends itself to vectorization.
 The two approaches are very different in terms of their implementation, but both are used in
 practice [^77]. Both can achieve very good
@ -1081,9 +1037,7 @@ queries use most often? A *data cube* or *OLAP cube* does this by creating a gri
 grouped by different dimensions [^82].
 [Figure 4-10](/en/ch4#fig_data_cube) shows an example.
-![ddia 0410](/fig/ddia_0410.png)
+{{< figure src="/fig/ddia_0410.png" id="fig_data_cube" title="Figure 4-10. Two dimensions of a data cube, aggregating data by summing." class="w-full my-4" >}}
 ###### Figure 4-10. Two dimensions of a data cube, aggregating data by summing.
 Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube),
 these are `date_key` and `product_sk`. You can now draw a two-dimensional table, with
@ -1282,9 +1236,7 @@ Hierarchical Navigable Small World (HNSW)
 query vector. The process continues until the last layer is reached. As with IVF indexes, HNSW
 indexes are approximate.
-![ddia 0411](/fig/ddia_0411.png)
+{{< figure src="/fig/ddia_0411.png" id="fig_vector_hnsw" title="Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index." class="w-full my-4" >}}
 ###### Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index.
 Many popular vector databases implement IVF and HNSW indexes. Facebook’s Faiss library has many
 variations of each [^101],
--- a/content/en/ch5.md
+++ b/content/en/ch5.md
@ -243,7 +243,7 @@ will need to include the strings `userName`, `favoriteNumber`, and `interests` s
 ##### Example 5-2. Example record which we will encode in several binary formats in this chapter
-```
+```json
 {
 "userName": "Martin",
 "favoriteNumber": 1337,
@ -273,9 +273,8 @@ is worth the loss of human-readability.
 In the following sections we will see how we can do much better, and encode the same record in just
 32 bytes.
-![ddia 0502](/fig/ddia_0502.png)
+{{< figure src="/fig/ddia_0502.png" id="fig_encoding_messagepack" title="Figure 5-2. Example record ([Example 5-2](/en/ch5#fig_encoding_json)) encoded using MessagePack." class="w-full my-4" >}}
 ###### Figure 5-2. Example record ([Example 5-2](/en/ch5#fig_encoding_json)) encoded using MessagePack.
 ## Protocol Buffers
@ -306,9 +305,8 @@ types, but it does not support other restrictions on the possible values of fiel
 Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in
 [Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14].
-![ddia 0503](/fig/ddia_0503.png)
+{{< figure src="/fig/ddia_0503.png" id="fig_encoding_protobuf" title="Figure 5-3. Example record encoded using Protocol Buffers." class="w-full my-4" >}}
 ###### Figure 5-3. Example record encoded using Protocol Buffers.
 Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it
 is a string, integer, etc.) and, where required, a length indication (such as the length of a
@ -416,9 +414,8 @@ prefix followed by UTF-8 bytes, but there’s nothing in the encoded data that t
 string. It could just as well be an integer, or something else entirely. An integer is encoded using
 a variable-length encoding.
-![ddia 0504](/fig/ddia_0504.png)
+{{< figure src="/fig/ddia_0504.png" id="fig_encoding_avro" title="Figure 5-4. Example record encoded using Avro." class="w-full my-4" >}}
 ###### Figure 5-4. Example record encoded using Avro.
 To parse the binary data, you go through the fields in the order that they appear in the schema and
 use the schema to tell you the datatype of each field. This means that the binary data can only be
@ -440,9 +437,8 @@ encoding, and the *reader’s schema*, which may be different. This is illustrat
 [Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The reader’s schema defines the fields of each record that the
 application code is expecting, and their types.
-![ddia 0505](/fig/ddia_0505.png)
+{{< figure src="/fig/ddia_0505.png" id="fig_encoding_avro_schemas" title="Figure 5-5. In Protocol Buffers, encoding and decoding can use different versions of a schema. In Avro, decoding uses two schemas: the writer's schema must be identical to the one used for encoding, but the reader's schema can be an older or newer version." class="w-full my-4" >}}
 ###### Figure 5-5. In Protocol Buffers, encoding and decoding can use different versions of a schema. In Avro, decoding uses two schemas: the writer’s schema must be identical to the one used for encoding, but the reader’s schema can be an older or newer version.
 If the reader’s and writer’s schema are the same, decoding is easy. If they are different, Avro
 resolves the differences by looking at the writer’s schema and the reader’s schema side by side and
@ -458,9 +454,8 @@ schema, it is ignored. If the code reading the data expects some field, but the
 not contain a field of that name, it is filled in with a default value declared in the reader’s
 schema.
-![ddia 0506](/fig/ddia_0506.png)
+{{< figure src="/fig/ddia_0506.png" id="fig_encoding_avro_resolution" title="Figure 5-6. An Avro reader resolves differences between the writer's schema and the reader's schema." class="w-full my-4" >}}
 ###### Figure 5-6. An Avro reader resolves differences between the writer’s schema and the reader’s schema.
 ### Schema evolution rules
@ -515,11 +510,7 @@ Database with individually written records
 and then fetch the writer’s schema for that version number from the database. Using that writer’s
 schema, it can decode the rest of the record.
- Confluent’s schema registry for Apache Kafka
+ Confluent’s schema registry for Apache Kafka [^19] and LinkedIn’s Espresso [^20] work this way, for example.
 [^19]
 and LinkedIn’s Espresso
 [^20]
 work this way, for example.
 Sending records over a network connection
 : When two processes are communicating over a bidirectional network connection, they can negotiate
@ -528,8 +519,7 @@ Sending records over a network connection
 A database of schema versions is a useful thing to have in any case, since it acts as documentation
 and gives you a chance to check schema compatibility [^21].
-As the version number, you could use a simple incrementing integer, or you could use a hash of the
+As the version number, you could use a simple incrementing integer, or you could use a hash of the schema.
 schema.
 ### Dynamically generated schemas
@ -570,13 +560,11 @@ implement and simpler to use, they have grown to support a fairly wide range of
 languages.
 The ideas on which these encodings are based are by no means new. For example, they have a lot in
-common with ASN.1, a schema definition language that was first standardized in 1984
+common with ASN.1, a schema definition language that was first standardized in 1984 [^23] [^24].
 [^23] [^24].
 It was used to define various network protocols, and its binary encoding (DER) is still used to encode
 SSL certificates (X.509), for example [^25].
 ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers [^26].
-However, it’s also very complex and badly documented, so ASN.1
+However, it’s also very complex and badly documented, so ASN.1 is probably not a good choice for new applications.
 is probably not a good choice for new applications.
 Many data systems also implement some kind of proprietary binary encoding for their data. For
 example, most relational databases have a network protocol over which you can send queries to the
@ -666,8 +654,7 @@ schema, even though the underlying storage may contain records encoded with vari
 versions of the schema.
 More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or
-moving some data into a separate table—still require data to be rewritten, often at the application
+moving some data into a separate table—still require data to be rewritten, often at the application level [^27].
 level [^27].
 Maintaining forward and backward compatibility across such migrations is still a research problem [^28].
 ### Archival storage
@ -736,8 +723,7 @@ different contexts. For example:
 category includes public APIs provided by online services, such as credit card processing
 systems, or OAuth for shared access to user data.
-The most popular service design philosophy is REST, which builds upon the principles of HTTP
+The most popular service design philosophy is REST, which builds upon the principles of HTTP [^30] [^31].
 [^30] [^31].
 It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for
 cache control, authentication, and content type negotiation. An API designed according to the
 principles of REST is called *RESTful*.
@ -753,12 +739,11 @@ and receive Protocol Buffers.
 Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def).
 The service definition allows developers to define service endpoints, documentation, versions, data
-models, and much more. gRPC definitions look similar, but are defined using Protocol Buffers service
+models, and much more. gRPC definitions look similar, but are defined using Protocol Buffers service definitions.
 definitions.
 ##### Example 5-3. Example OpenAPI service definition in YAML
-```
+```yaml
 openapi: 3.0.0
 info:
 title: Ping, Pong
@ -981,9 +966,8 @@ Different workflow engines use different names for tasks. Temporal, for example,
 *activity*. Others refer to tasks as *durable functions*. Though the names differ, the concepts are
 the same.
-![ddia 0507](/fig/ddia_0507.png)
+{{< figure src="/fig/ddia_0507.png" id="fig_encoding_workflow" title="Figure 5-7. Example of a workflow expressed using Business Process Model and Notation (BPMN), a graphical notation." class="w-full my-4" >}}
 ###### Figure 5-7. Example of a workflow expressed using Business Process Model and Notation (BPMN), a graphical notation.
 Workflows are run, or executed, by a *workflow engine*. Workflow engines determine when to run each
 task, on which machine a task must be run, what to do if a task fails (e.g., if the machine crashes
--- a/content/en/ch6.md
+++ b/content/en/ch6.md
@ -69,8 +69,7 @@ longer contain the same data. The most common solution is called *leader-based r
 *primary-backup*, or *active/passive*. It works as follows (see
 [Figure 6-1](/en/ch6#fig_replication_leader_follower)):
-1. One of the replicas is designated the *leader* (also known as *primary* or *source*
+1. One of the replicas is designated the *leader* (also known as *primary* or *source* [^2]).
   [^2]).
   When clients want to write to the database, they must send their requests to the leader, which
   first writes the new data to its local storage.
 2. The other replicas are known as *followers* (*read replicas*, *secondaries*, or *hot standbys*).
@ -82,9 +81,7 @@ longer contain the same data. The most common solution is called *leader-based r
   followers. However, writes are only accepted on the leader (the followers are read-only from the
   client’s point of view).
-![ddia 0601](/fig/ddia_0601.png)
+{{< figure src="/fig/ddia_0601.png" id="fig_replication_leader_follower" title="Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas." class="w-full my-4" >}}
 ###### Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas.
 If the database is sharded (see [Chapter 7](/en/ch7#ch_sharding)), each shard has one leader. Different shards may
 have their leaders on different nodes, but each shard must nevertheless have one leader node. In
@ -92,24 +89,16 @@ have their leaders on different nodes, but each shard must nevertheless have one
 multiple leaders for the same shard at the same time.
 Single-leader replication is very widely used. It’s a built-in feature of many relational databases,
-such as PostgreSQL, MySQL, Oracle Data Guard
+such as PostgreSQL, MySQL, Oracle Data Guard [^3], and SQL Server’s Always On Availability Groups [^4].
-[^3],
+It is also used in some document databases such as MongoDB and DynamoDB [^5],
 and SQL Server’s Always On Availability Groups
 [^4].
 It is also used in some document databases such as MongoDB and DynamoDB
 [^5],
 message brokers such as Kafka, replicated block devices such as DRBD, and some network filesystems.
-Many consensus algorithms such as Raft, which is used for replication in CockroachDB
+Many consensus algorithms such as Raft, which is used for replication in CockroachDB [^6], TiDB [^7],
-[^6],
+etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and automatically 
-TiDB [^7],
+elect a new leader if the old one fails (we will discuss consensus in more detail in [Chapter 10](/en/ch10#ch_consistency)).
 etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and
 automatically elect a new leader if the old one fails (we will discuss consensus in more detail in
 [Chapter 10](/en/ch10#ch_consistency)).
 > [!NOTE]
 > In older documents you may see the term *master–slave replication*. It means the same as
-> leader-based replication, but the term should be avoided as it is widely considered offensive
+> leader-based replication, but the term should be avoided as it is widely considered offensive [^8].
 > [^8].
 ## Synchronous Versus Asynchronous Replication
@ -123,9 +112,7 @@ shortly afterward, it is received by the leader. At some point, the leader forwa
 to the followers. Eventually, the leader notifies the client that the update was successful.
 [Figure 6-2](/en/ch6#fig_replication_sync_replication) shows one possible way how the timings could work out.
-![ddia 0602](/fig/ddia_0602.png)
+{{< figure src="/fig/ddia_0602.png" id="fig_replication_sync_replication" title="Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower." class="w-full my-4" >}}
 ###### Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower.
 In the example of [Figure 6-2](/en/ch6#fig_replication_sync_replication), the replication to follower 1 is
 *synchronous*: the leader waits until follower 1 has confirmed that it received the write before
@ -168,8 +155,7 @@ client. However, a fully asynchronous configuration has the advantage that the l
 processing writes, even if all of its followers have fallen behind.
 Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless
-widely used, especially if there are many followers or if they are geographically distributed
+widely used, especially if there are many followers or if they are geographically distributed [^9].
 [^9].
 We will return to this issue in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag).
 ## Setting Up New Followers
@ -304,8 +290,7 @@ consists of the following steps:
   maintenance, this doesn’t apply.)
 2. *Choosing a new leader.* This could be done through an election process (where the leader is chosen by
   a majority of the remaining replicas), or a new leader could be appointed by a previously
-   established *controller node*
+   established *controller node* [^13].
   [^13].
   The best candidate for leadership is usually the replica with the most up-to-date data changes
   from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader
   is a consensus problem, discussed in detail in [Chapter 10](/en/ch10#ch_consistency).
@ -324,9 +309,7 @@ Failover is fraught with things that can go wrong:
  in the meantime. The most common solution is for the old leader’s unreplicated writes to simply be
  discarded, which means that writes you believed to be committed actually weren’t durable after all.
 * Discarding writes is especially dangerous if other storage systems outside of the database need to
-  be coordinated with the database contents.
+  be coordinated with the database contents. For example, in one incident at GitHub [^14],
  For example, in one incident at GitHub
  [^14],
  an out-of-date MySQL follower
  was promoted to leader. The database used an autoincrementing counter to assign primary keys to
  new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some
@ -338,8 +321,7 @@ Failover is fraught with things that can go wrong:
  leaders accept writes, and there is no process for resolving conflicts (see
  [“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
  systems have a mechanism to shut down one node if two leaders are detected. However, if this
-  mechanism is not carefully designed, you can end up with both nodes being shut down
+  mechanism is not carefully designed, you can end up with both nodes being shut down [^15].
  [^15].
  Moreover, there is a risk that by the time the split brain is detected and the old node is shut
  down, it is already too late and data has already been corrupted.
 * What is the right timeout before the leader is declared dead? A longer timeout means a longer
@ -404,10 +386,8 @@ also known as *state machine replication*, and we will discuss the theory behind
 Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today,
 as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if
 there is any nondeterminism in a statement. VoltDB uses statement-based replication, and makes it
-safe by requiring transactions to be deterministic
+safe by requiring transactions to be deterministic [^16]. However, determinism can be hard to guarantee 
-[^16].
+in practice, so many databases prefer other replication methods.
 However, determinism can be hard to guarantee in practice, so many databases prefer other
 replication methods.
 ### Write-ahead log (WAL) shipping
@ -453,18 +433,15 @@ A transaction that modifies several rows generates several such log records, fol
 indicating that the transaction was committed. MySQL keeps a separate logical replication log,
 called the *binlog*, in addition to the WAL (when configured to use row-based replication).
 PostgreSQL implements logical replication by decoding the physical WAL into row
-insertion/update/delete events
+insertion/update/delete events [^19].
 [^19].
 Since a logical log is decoupled from the storage engine internals, it can more easily be kept
 backward compatible, allowing the leader and the follower to run different versions of the database
-software. This in turn enables upgrading to a new version with minimal downtime
+software. This in turn enables upgrading to a new version with minimal downtime [^20].
 [^20].
 A logical log format is also easier for external applications to parse. This aspect is useful if you want
 to send the contents of a database to an external system, such as a data warehouse for offline
-analysis, or for building custom indexes and caches
+analysis, or for building custom indexes and caches [^21].
 [^21].
 This technique is called *change data capture*, and we will return to it in [Link to Come].
 # Problems with Replication Lag
@ -526,9 +503,7 @@ With asynchronous replication, there is a problem, illustrated in
 new data may not yet have reached the replica. To the user, it looks as though the data they
 submitted was lost, so they will be understandably unhappy.
-![ddia 0603](/fig/ddia_0603.png)
+{{< figure src="/fig/ddia_0603.png" id="fig_replication_read_your_writes" title="Figure 6-3. A user makes a write, followed by a read from a stale replica. To prevent this anomaly, we need read-after-write consistency." class="w-full my-4" >}}
 ###### Figure 6-3. A user makes a write, followed by a read from a stale replica. To prevent this anomaly, we need read-after-write consistency.
 In this situation, we need *read-after-write consistency*, also known as *read-your-writes consistency*
 [^23].
@ -617,9 +592,7 @@ hadn’t returned anything, because user 2345 probably wouldn’t know that user
 a comment. However, it’s very confusing for user 2345 if they first see user 1234’s comment appear,
 and then see it disappear again.
-![ddia 0604](/fig/ddia_0604.png)
+{{< figure src="/fig/ddia_0604.png" id="fig_replication_monotonic_reads" title="Figure 6-4. A user first reads from a fresh replica, then from a stale replica. Time appears to go backward. To prevent this anomaly, we need monotonic reads." class="w-full my-4" >}}
 ###### Figure 6-4. A user first reads from a fresh replica, then from a stale replica. Time appears to go backward. To prevent this anomaly, we need monotonic reads.
 *Monotonic reads* [^22] is a guarantee that this
 kind of anomaly does not happen. It’s a lesser guarantee than strong consistency, but a stronger
@ -660,9 +633,7 @@ To the observer it looks as though Mrs. Cake is answering the question before Mr
 it. Such psychic powers are impressive, but very confusing
 [^27].
-![ddia 0605](/fig/ddia_0605.png)
+{{< figure src="/fig/ddia_0605.png" id="fig_replication_consistent_prefix" title="Figure 6-5. If some shards are replicated slower than others, an observer may see the answer before they see the question." class="w-full my-4" >}}
 ###### Figure 6-5. If some shards are replicated slower than others, an observer may see the answer before they see the question.
 Preventing this kind of anomaly requires another type of guarantee: *consistent prefix reads*
 [^22]. This guarantee says that if a sequence of
@ -757,9 +728,7 @@ regular leader–follower replication is used (with followers maybe in a differe
 from the leader); between regions, each region’s leader replicates its changes to the leaders in
 other regions.
-![ddia 0606](/fig/ddia_0606.png)
+{{< figure src="/fig/ddia_0606.png" id="fig_replication_multi_dc" title="Figure 6-6. Multi-leader replication across multiple regions." class="w-full my-4" >}}
 ###### Figure 6-6. Multi-leader replication across multiple regions.
 Let’s compare how the single-leader and multi-leader configurations fare in a multi-region
 deployment:
@ -825,9 +794,7 @@ only one plausible topology: leader 1 must send all of its writes to leader 2, a
 more than two leaders, various different topologies are possible. Some examples are illustrated in
 [Figure 6-7](/en/ch6#fig_replication_topologies).
-![ddia 0607](/fig/ddia_0607.png)
+{{< figure src="/fig/ddia_0607.png" id="fig_replication_topologies" title="Figure 6-7. Three example topologies in which multi-leader replication can be set up." class="w-full my-4" >}}
 ###### Figure 6-7. Three example topologies in which multi-leader replication can be set up.
 The most general topology is *all-to-all*, shown in
 [Figure 6-7](/en/ch6#fig_replication_topologies)(c),
@ -862,9 +829,7 @@ On the other hand, all-to-all topologies can have issues too. In particular, som
 be faster than others (e.g., due to network congestion), with the result that some replication
 messages may “overtake” others, as illustrated in [Figure 6-8](/en/ch6#fig_replication_causality).
-![ddia 0608](/fig/ddia_0608.png)
+{{< figure src="/fig/ddia_0608.png" id="fig_replication_causality" title="Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas." class="w-full my-4" >}}
 ###### Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas.
 In [Figure 6-8](/en/ch6#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
 updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may
@ -958,12 +923,10 @@ approach has a number of advantages:
  service calls in application code. Every service call requires error handling, as discussed in
  [“The problems with remote procedure calls (RPCs)”](/en/ch5#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user
  interface needs to somehow reflect that error. A sync engine allows the app to perform reads and
-  writes on local data, which almost never fails, leading to a more declarative programming style
+  writes on local data, which almost never fails, leading to a more declarative programming style [^41].
  [^41].
 * In order to display edits from other users in real-time, you need to receive notifications of
  those edits and efficiently update the user interface accordingly. A sync engine combined with a
-  *reactive programming* model is a good way of implementing this
+  *reactive programming* model is a good way of implementing this [^42].
  [^42].
 Sync engines work best when all the data that the user may need is downloaded in advance and stored
 persistently on the client. This means that the data is available for offline access when needed,
@ -972,8 +935,7 @@ of data. For example, downloading all the files that the user themselves created
 (one user generally doesn’t generate that much data), but downloading the entire catalog of an
 e-commerce website probably doesn’t make sense.
-The sync engine was pioneered by Lotus Notes in the 1980s
+The sync engine was pioneered by Lotus Notes in the 1980s [^43]
 [^43]
 (without using that term), and sync for specific apps such as calendars has also existed for a long
 time. Today there are a number of general-purpose sync engines, some of which use a proprietary
 backend service (e.g., Google Firestore, Realm, or Ditto), and some have an open source backend,
@ -982,8 +944,7 @@ making them suitable for creating local-first software (e.g., PouchDB/CouchDB, A
 Multiplayer video games have a similar need to respond immediately to the user’s local actions, and
 reconcile them with other players’ actions received asynchronously over the network. In game
 development jargon the equivalent of a sync engine is called *netcode*. The techniques used in
-netcode are quite specific to the requirements of games
+netcode are quite specific to the requirements of games [^44], and don’t directly
 [^44], and don’t directly
 carry over to other types of software, so we won’t consider them further in this book.
 ## Dealing with Conflicting Writes
@ -998,9 +959,7 @@ independently changes the title from A to C. Each user’s change is successfull
 local leader. However, when the changes are asynchronously replicated, a conflict is detected.
 This problem does not occur in a single-leader database.
-![ddia 0609](/fig/ddia_0609.png)
+{{< figure src="/fig/ddia_0609.png" id="fig_replication_write_conflict" title="Figure 6-9. A write conflict caused by two leaders concurrently updating the same record." class="w-full my-4" >}}
 ###### Figure 6-9. A write conflict caused by two leaders concurrently updating the same record.
 > [!NOTE]
 > We say that the two writes in [Figure 6-9](/en/ch6#fig_replication_write_conflict) are *concurrent* because neither
@ -1114,9 +1073,8 @@ suffers from a number of problems:
  not careful to order them consistently. When the conflict between “B/C” and “C/B” is merged, it
  may result in “B/C/C/B” or something similarly surprising.
-![ddia 0610](/fig/ddia_0610.png)
+{{< figure src="/fig/ddia_0610.png" id="fig_replication_amazon_anomaly" title="Figure 6-10. Example of Amazon's shopping cart anomaly: if conflicts on a shopping cart are merged by taking the union, deleted items may reappear." class="w-full my-4" >}}
 ###### Figure 6-10. Example of Amazon’s shopping cart anomaly: if conflicts on a shopping cart are merged by taking the union, deleted items may reappear.
 ### Automatic conflict resolution
@ -1166,9 +1124,8 @@ text. Assume you have two replicas that both start off with the text “ice”.
 letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make
 “ice!”.
-![ddia 0611](/fig/ddia_0611.png)
+{{< figure src="/fig/ddia_0611.png" id="fig_replication_ot_crdt" title="Figure 6-11. How two concurrent insertions into a string are merged by OT and a CRDT respectively." class="w-full my-4" >}}
 ###### Figure 6-11. How two concurrent insertions into a string are merged by OT and a CRDT respectively.
 The merged result “nice!” is achieved differently by both types of algorithms:
@ -1192,15 +1149,11 @@ CRDT
 There are many algorithms based on variations of these ideas. Lists/arrays can be supported
 similarly, using list elements instead of characters, and other datatypes such as key-value maps can
 be added quite easily. There are some performance and functionality trade-offs between OT and CRDTs,
-but it’s possible to combine the advantages of CRDTs and OT in one algorithm
+but it’s possible to combine the advantages of CRDTs and OT in one algorithm [^48].
 [^48].
-OT is most often used for real-time collaborative editing of text, e.g. in Google Docs
+OT is most often used for real-time collaborative editing of text, e.g. in Google Docs [^32], whereas CRDTs can be found in
-[^32], whereas CRDTs can be found in
+distributed databases such as Redis Enterprise, Riak, and Azure Cosmos DB [^49].
-distributed databases such as Redis Enterprise, Riak, and Azure Cosmos DB
+Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge or Yjs) and with OT (e.g., ShareDB).
 [^49].
 Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge or Yjs) and with OT
 (e.g., ShareDB).
 ### What is a conflict?
@ -1233,8 +1186,7 @@ Some data storage systems take a different approach, abandoning the concept of a
 allowing any replica to directly accept writes from clients. Some of the earliest replicated data
 systems were leaderless [^1] [^50], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became
 a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in
-2007 [^45].
+2007 [^45]. Riak, Cassandra, and ScyllaDB are open source datastores with leaderless replication models inspired
 Riak, Cassandra, and ScyllaDB are open source datastores with leaderless replication models inspired
 by Dynamo, so this kind of database is also known as *Dynamo-style*.
 > [!NOTE]
@ -1261,9 +1213,8 @@ replica misses it. Let’s say that it’s sufficient for two out of three repli
 acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be
 successful. The client simply ignores the fact that one of the replicas missed the write.
-![ddia 0612](/fig/ddia_0612.png)
+{{< figure src="/fig/ddia_0612.png" id="fig_replication_quorum_node_outage" title="Figure 6-12. A quorum write, quorum read, and read repair after a node outage." class="w-full my-4" >}}
 ###### Figure 6-12. A quorum write, quorum read, and read repair after a node outage.
 Now imagine that the unavailable node comes back online, and clients start reading from it. Any
 writes that happened while the node was down are missing from that node. Thus, if you read from that
@ -1352,9 +1303,8 @@ Normally, reads and writes are always sent to all *n* replicas in parallel. The
 *r* determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
 before we consider the read or write to be successful.
-![ddia 0613](/fig/ddia_0613.png)
+{{< figure src="/fig/ddia_0613.png" id="fig_replication_quorum_overlap" title="Figure 6-13. If *w* + *r* > *n*, at least one of the *r* replicas you read from must have seen the most recent successful write." class="w-full my-4" >}}
 ###### Figure 6-13. If *w* + *r* > *n*, at least one of the *r* replicas you read from must have seen the most recent successful write.
 If fewer than the required *w* or *r* nodes are available, writes or reads return an error. A node
 could be unavailable for many reasons: because the node is down (crashed, powered down), due to an
@ -1404,8 +1354,7 @@ properties can be confusing. Some scenarios include:
 * If a write succeeded on some replicas but failed on others (for example because the disks on some
  nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the
  replicas where it succeeded. This means that if a write was reported as failed, subsequent reads
-  may or may not return the value from that write
+  may or may not return the value from that write [^52].
  [^52].
 * If the database uses timestamps from a real-time clock to determine which write is newer (as
  Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a
  faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww).
@ -1418,8 +1367,7 @@ properties can be confusing. Some scenarios include:
 Thus, although quorums appear to guarantee that a read returns the latest written value, in practice
 it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate
 eventual consistency. The parameters *w* and *r* allow you to adjust the probability of stale values
-being read [^53],
+being read [^53], but it’s wise to not take them as absolute guarantees.
 but it’s wise to not take them as absolute guarantees.
 ### Monitoring staleness
@ -1436,8 +1384,7 @@ current position, you can measure the amount of replication lag.
 However, in systems with leaderless replication, there is no fixed order in which writes are
 applied, which makes monitoring more difficult. The number of hints that a replica stores for
-handoff can be one measure of system health, but it’s difficult to interpret usefully
+handoff can be one measure of system health, but it’s difficult to interpret usefully [^54].
 [^54].
 Eventual consistency is a deliberately vague guarantee, but for operability it’s important to be
 able to quantify “eventual.”
@ -1465,15 +1412,12 @@ A big advantage of a leaderless architecture is that it is more resilient agains
 Because there is no failover, and requests go to multiple replicas in parallel anyway, one replica
 becoming slow or unavailable has very little impact on response times: the client simply uses the
 responses from the other replicas that are faster to respond. Using the fastest responses is called
-*request hedging*, and it can significantly reduce tail latency
+*request hedging*, and it can significantly reduce tail latency [^55]).
 [^55]).
 At its core, the resilience of a leaderless system comes from the fact that it doesn’t distinguish
 between the normal case and the failure case. This is especially helpful when handling so-called
 *gray failures*, in which a node isn’t completely down, but running in a degraded state where it is
-unusually slow to handle requests
+unusually slow to handle requests [^56], or when a node is simply overloaded (for example, if a node has been offline for a while, recovery
 [^56],
 or when a node is simply overloaded (for example, if a node has been offline for a while, recovery
 via hinted handoff can cause a lot of additional load). A leader-based system has to decide whether
 the situation is bad enough to warrant a failover (which can itself cause further disruption),
 whereas in a leaderless system that question doesn’t even arise.
@ -1493,8 +1437,7 @@ That said, leaderless systems can have performance problems as well:
 * A large-scale network interruption that disconnects a client from a large number of replicas can
  make it impossible to form a quorum. Some leaderless databases offer a configuration option that
  allows any reachable replica to accept writes, even if it’s not one of the usual replicas for that
-  key (Riak and Dynamo call this a *sloppy quorum*
+  key (Riak and Dynamo call this a *sloppy quorum* [^45];
  [^45];
  Cassandra and ScyllaDB call it *consistency level ANY*). There is no guarantee that subsequent
  reads will see the written value, but depending on the application it may still be better than
  having the write fail.
@ -1539,9 +1482,8 @@ A and B, simultaneously writing to a key *X* in a three-node datastore:
 * Node 2 first receives the write from A, then the write from B.
 * Node 3 first receives the write from B, then the write from A.
-![ddia 0614](/fig/ddia_0614.png)
+{{< figure src="/fig/ddia_0614.png" id="fig_replication_concurrency" title="Figure 6-14. Concurrent writes in a Dynamo-style datastore: there is no well-defined ordering." class="w-full my-4" >}}
 ###### Figure 6-14. Concurrent writes in a Dynamo-style datastore: there is no well-defined ordering.
 If each node simply overwrote the value for a key whenever it received a write request from a
 client, the nodes would become permanently inconsistent, as shown by the final *get* request in
@ -1642,9 +1584,8 @@ empty. Between them, the clients make five writes to the database:
   `[milk, flour]` (note that `[eggs]` was already overwritten in the last step) but is concurrent
   with `[eggs, milk, ham]`, so the server keeps those two concurrent values.
-![ddia 0615](/fig/ddia_0615.png)
+{{< figure src="/fig/ddia_0615.png" id="fig_replication_causality_single" title="Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart." class="w-full my-4" >}}
 ###### Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart.
 The dataflow between the operations in [Figure 6-15](/en/ch6#fig_replication_causality_single) is illustrated
 graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The arrows indicate which operation
@ -1653,9 +1594,8 @@ graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The
 on the server, since there is always another operation going on concurrently. But old versions of
 the value do get overwritten eventually, and no writes are lost.
-![ddia 0616](/fig/ddia_0616.png)
+{{< figure link="#fig_replication_causality_single" src="/fig/ddia_0616.png" id="fig_replication_causal_dependencies" title="Figure 6-16. Graph of causal dependencies in Figure 6-15." class="w-full my-4" >}}
 ###### Figure 6-16. Graph of causal dependencies in [Figure 6-15](/en/ch6#fig_replication_causality_single).
 Note that the server can determine whether two operations are concurrent by looking at the version
 numbers—it does not need to interpret the value itself (so the value could be any data
--- a/content/en/ch7.md
+++ b/content/en/ch7.md
@ -32,9 +32,7 @@ of sharding and replication can look like [Figure 7-1](/en/ch7#fig_sharding_rep
 leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the
 leader for some shards and a follower for other shards, but each shard still only has one leader.
-![ddia 0701](/fig/ddia_0701.png)
+{{< figure src="/fig/ddia_0701.png" id="fig_sharding_replicas" title="Figure 7-1. Combining replication and sharding: each node acts as leader for some shards and follower for other shards." class="w-full my-4" >}}
 ###### Figure 7-1. Combining replication and sharding: each node acts as leader for some shards and follower for other shards.
 Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replication of databases applies equally to
 replication of shards. Since the choice of sharding scheme is mostly independent of the choice of
@ -50,8 +48,7 @@ Couchbase, to name just a few.
 Some databases treat partitions and shards as two distinct concepts. For example, in PostgreSQL,
 partitioning is a way of splitting a large table into several files that are stored on the same
 machine (which has several advantages, such as making it very fast to delete an entire partition),
-whereas sharding splits a dataset across multiple machines
+whereas sharding splits a dataset across multiple machines [^1] [^2].
 [^1] [^2].
 In many other systems, partitioning is just another word for sharding.
 While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to
@ -136,31 +133,26 @@ Cell-based architecture
 application code. In a *cell-based architecture*, the services and storage for a particular set of
 tenants are grouped into a self-contained *cell*, and different cells are set up such that they
 can run largely independently from each other. This approach provides *fault isolation*: that is,
- a fault in one cell remains limited to that cell, and tenants in other cells are not affected
+ a fault in one cell remains limited to that cell, and tenants in other cells are not affected [^8].
 [^8].
 Per-tenant backup and restore
 : Backing up each tenant’s shard separately makes it possible to restore a tenant’s state from a
 backup without affecting other tenants, which can be useful in case the tenant accidentally
- deletes or overwrites important data
+ deletes or overwrites important data [^9].
 [^9].
 Regulatory compliance
 : Data privacy regulation such as the GDPR gives individuals the right to access and delete all data
 stored about them. If each person’s data is stored in a separate shard, this translates into
- simple data export and deletion operations on their shard
+ simple data export and deletion operations on their shard [^10].
 [^10].
 Data residence
 : If a particular tenant’s data needs to be stored in a particular jurisdiction in order to comply
- with data residency laws, a region-aware database can allow you to assign that tenant’s shard to a
+ with data residency laws, a region-aware database can allow you to assign that tenant’s shard to a particular region.
 particular region.
 Gradual schema rollout
 : Schema migrations (previously discussed in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)) can be rolled
 out gradually, one tenant at a time. This reduces risk, as you can detect problems before they
- affect all tenants, but it can be difficult to do transactionally
+ affect all tenants, but it can be difficult to do transactionally [^11].
 [^11].
 The main challenges around using sharding for multitenancy are:
@ -207,9 +199,7 @@ to look up the entry for a particular title, you can easily determine which shar
 entry by finding the volume whose key range contains the title you’re looking for, and thus pick the
 correct book off the shelf.
-![ddia 0702](/fig/ddia_0702.png)
+{{< figure src="/fig/ddia_0702.png" id="fig_sharding_encyclopedia" title="Figure 7-2. A print encyclopedia is sharded by key range." class="w-full my-4" >}}
 ###### Figure 7-2. A print encyclopedia is sharded by key range.
 The ranges of keys are not necessarily evenly spaced, because your data may not be evenly
 distributed. For example, in [Figure 7-2](/en/ch7#fig_sharding_encyclopedia), volume 1 contains words starting with A
@ -307,9 +297,7 @@ have three nodes and add a fourth. Before the rebalancing, node 0 stored the key
 0, 3, 6, 9, and so on. After adding the fourth node, the key with hash 3 has moved to node 3, the
 key with hash 6 has moved to node 2, the key with hash 9 has moved to node 1, and so on.
-![ddia 0703](/fig/ddia_0703.png)
+{{< figure src="/fig/ddia_0703.png" id="fig_sharding_hash_mod_n" title="Figure 7-3. Assigning keys to nodes by hashing the key and taking it modulo the number of nodes. Changing the number of nodes results in many keys moving from one node to another." class="w-full my-4" >}}
 ###### Figure 7-3. Assigning keys to nodes by hashing the key and taking it modulo the number of nodes. Changing the number of nodes results in many keys moving from one node to another.
 The *mod N* function is easy to compute, but it leads to very inefficient rebalancing because there
 is a lot of unnecessary movement of records from one node to another. We need an approach that
@ -328,9 +316,7 @@ nodes to the new node until they are fairly distributed once again. This process
 [Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in
 reverse.
-![ddia 0704](/fig/ddia_0704.png)
+{{< figure src="/fig/ddia_0704.png" id="fig_sharding_rebalance_fixed" title="Figure 7-4. Adding a new node to a database cluster with multiple shards per node." class="w-full my-4" >}}
 ###### Figure 7-4. Adding a new node to a database cluster with multiple shards per node.
 In this model, only entire shards are moved between nodes, which is cheaper than splitting shards.
 The number of shards does not change, nor does the assignment of keys to shards. The only thing that
@ -377,9 +363,7 @@ Even if the input keys are very similar (e.g., consecutive timestamps), their ha
 distributed across that range. We can then assign a range of hash values to each shard: for example,
 values between 0 and 16,383 to shard 0, values between 16,384 and 32,767 to shard 1, and so on.
-![ddia 0705](/fig/ddia_0705.png)
+{{< figure src="/fig/ddia_0705.png" id="fig_sharding_hash_range" title="Figure 7-5. Assigning a contiguous range of hash values to each shard." class="w-full my-4" >}}
 ###### Figure 7-5. Assigning a contiguous range of hash values to each shard.
 Like with key-range sharding, a shard in hash-range sharding can be split when it becomes too big or
 too heavily loaded. This is still an expensive operation, but it can happen as needed, so the number
@ -407,12 +391,9 @@ Cassandra and ScyllaDB use a variant of this approach that is illustrated in
 to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
 per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
 those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
-those imbalances tend to even out
+those imbalances tend to even out [^15] [^18].
 [^15] [^18].
-![ddia 0706](/fig/ddia_0706.png)
+{{< figure src="/fig/ddia_0706.png" id="fig_sharding_cassandra" title="Figure 7-6. Cassandra and ScyllaDB split the range of possible hash values (here 0–1023) into contiguous ranges with random boundaries, and assign several ranges to each node." class="w-full my-4" >}}
 ###### Figure 7-6. Cassandra and ScyllaDB split the range of possible hash values (here 0–1023) into contiguous ranges with random boundaries, and assign several ranges to each node.
 When nodes are added or removed, range boundaries are added and removed, and shards are split or
 merged accordingly [^19].
@ -433,13 +414,9 @@ Note that *consistent* here has nothing to do with replica consistency (see [Cha
 ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describes the tendency of a key to stay in
 the same shard as much as possible.
-The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of
+The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of consistent hashing [^20],
-consistent hashing [^20],
+but several other consistent hashing algorithms have also been proposed [^21], such as *highest random weight*, also known as *rendezvous hashing* [^22],
-but several other consistent hashing algorithms have also been proposed [^21],
+and *jump consistent hash* [^23].
 such as *highest random weight*, also known as *rendezvous hashing*
 [^22],
 and *jump consistent hash*
 [^23].
 With Cassandra’s algorithm, if one node is added, a small number of existing shards are split into
 sub-ranges; on the other hand, with rendezvous and jump consistent hashes, the new node is assigned
 individual keys that were previously scattered across all of the other nodes. Which one is
@ -458,8 +435,7 @@ of activity when they do something [^24].
 This event can result in a large volume of reads and writes to the same key (where the partition key
 is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on).
-In such situations, a more flexible sharding policy is required
+In such situations, a more flexible sharding policy is required [^25] [^26].
 [^25] [^26].
 A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put
 an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine [^27].
@ -482,9 +458,7 @@ likely to calm down again. Moreover, some keys may be hot for writes while other
 necessitating different strategies for handling them.
 Some systems (especially cloud services designed for large scale) have automated approaches for
-dealing with hot shards; for example, Amazon calls it *heat management*
+dealing with hot shards; for example, Amazon calls it *heat management* [^28] or *adaptive capacity* [^17].
 [^28]
 or *adaptive capacity* [^17].
 The details of how these systems work go beyond the scope of this book.
 ## Operations: Automatic or Manual Rebalancing
@ -501,8 +475,7 @@ effect.
 Fully automated rebalancing can be convenient, because there is less operational work to do for
 normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud
 databases such as DynamoDB are promoted as being able to automatically add and remove shards to
-adapt to big increases or decreases of load within a matter of minutes
+adapt to big increases or decreases of load within a matter of minutes [^17] [^29].
 [^17] [^29].
 However, automatic shard management can also be unpredictable. Rebalancing is an expensive
 operation, because it requires rerouting requests and moving a large amount of data from one node to
@ -548,9 +521,7 @@ in [Figure 7-7](/en/ch7#fig_sharding_routing)):
 3. Require that clients be aware of the sharding and the assignment of shards to nodes. In this
 case, a client can connect directly to the appropriate node, without any intermediary.
-![ddia 0707](/fig/ddia_0707.png)
+{{< figure src="/fig/ddia_0707.png" id="fig_sharding_routing" title="Figure 7-7. Three different ways of routing a request to the right node." class="w-full my-4" >}}
 ###### Figure 7-7. Three different ways of routing a request to the right node.
 In all cases, there are some key problems:
@ -573,9 +544,7 @@ to nodes. Other actors, such as the routing tier or the sharding-aware client, c
 information in ZooKeeper. Whenever a shard changes ownership, or a node is added or removed,
 ZooKeeper notifies the routing tier so that it can keep its routing information up to date.
-![ddia 0708](/fig/ddia_0708.png)
+{{< figure src="/fig/ddia_0708.png" id="fig_sharding_zookeeper" title="Figure 7-8. Using ZooKeeper to keep track of assignment of shards to nodes." class="w-full my-4" >}}
 ###### Figure 7-8. Using ZooKeeper to keep track of assignment of shards to nodes.
 For example, HBase and SolrCloud use ZooKeeper to manage shard assignment, and Kubernetes uses etcd
 to keep track of which service instance is running where. MongoDB has a similar architecture, but it
@ -631,9 +600,7 @@ indexing automatically. For example, whenever a red car is added to the database
 automatically adds its ID to the list of IDs for the index entry `color:red`. As discussed in
 [Chapter 4](/en/ch4#ch_storage), that list of IDs is also called a *postings list*.
-![ddia 0709](/fig/ddia_0709.png)
+{{< figure src="/fig/ddia_0709.png" id="fig_sharding_local_secondary" title="Figure 7-9. Local secondary indexes: each shard indexes only the records within its own shard." class="w-full my-4" >}}
 ###### Figure 7-9. Local secondary indexes: each shard indexes only the records within its own shard.
 ###### Warning
@ -648,8 +615,7 @@ indexes, covering only the records in that shard. It doesn’t care what data is
 shards. Whenever you write to the database—to add, remove, or update a records—you only need to
 deal with the shard that contains the record that you are writing. For that reason, this type of
 secondary index is known as a *local index*. In an information retrieval context it is also known as
-a *document-partitioned index*
+a *document-partitioned index* [^30].
 [^30].
 When reading from a local secondary index, if you already know the partition key of the record
 you’re looking for, you can just perform the search on the appropriate shard. Moreover, if you only
@ -666,11 +632,8 @@ expensive. Even if you query the shards in parallel, it is prone to tail latency
 shards lets you store more data, but it doesn’t increase your query throughput if every shard has to
 process every query anyway.
-Nevertheless, local secondary indexes are widely used [^31]:
+Nevertheless, local secondary indexes are widely used [^31]: for example, MongoDB, Riak, Cassandra [^32], Elasticsearch [^33], 
-for example, MongoDB, Riak, Cassandra [^32],
+SolrCloud, and VoltDB [^34] all use local secondary indexes.
 Elasticsearch [^33], SolrCloud,
 and VoltDB [^34]
 all use local secondary indexes.
 ## Global Secondary Indexes
@ -685,9 +648,7 @@ with the letters *a* to *r* appear in shard 0 and colors starting with *s* to *z
 The index on the make of car is partitioned similarly (with the shard boundary being between *f* and
 *h*).
-![ddia 0710](/fig/ddia_0710.png)
+{{< figure src="/fig/ddia_0710.png" id="fig_sharding_global_secondary" title="Figure 7-10. A global secondary index reflects data from all shards, and is itself sharded by the indexed value." class="w-full my-4" >}}
 ###### Figure 7-10. A global secondary index reflects data from all shards, and is itself sharded by the indexed value.
 This kind of index is also called *term-partitioned*
 [^30]:
--- a/content/en/ch8.md
+++ b/content/en/ch8.md
@ -82,10 +82,7 @@ much weaker set of guarantees than had previously been understood.
 The hype around NoSQL distributed databases led to a popular belief that transactions were
 fundamentally unscalable, and that any large-scale system would have to abandon transactions in
 order to maintain good performance and high availability. More recently, that belief has turned out
-to be wrong. So-called “NewSQL” databases such as CockroachDB [^5],
+to be wrong. So-called “NewSQL” databases such as CockroachDB [^5], TiDB [^6], Spanner [^7], FoundationDB [^8],
 TiDB [^6],
 Spanner [^7],
 FoundationDB [^8],
 and Yugabyte have shown that transactional systems can scale to large data volumes and high
 throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide
 strong ACID guarantees at scale.
@ -99,19 +96,16 @@ operation and in various extreme (but realistic) circumstances.
 The safety guarantees provided by transactions are often described by the well-known acronym *ACID*,
 which stands for *Atomicity*, *Consistency*, *Isolation*, and *Durability*. It was coined in 1983 by
-Theo Härder and Andreas Reuter [^9]
+Theo Härder and Andreas Reuter [^9] in an effort to establish precise terminology for fault-tolerance mechanisms in databases.
 in an effort to establish precise terminology for fault-tolerance mechanisms in databases.
 However, in practice, one database’s implementation of ACID does not equal another’s implementation.
-For example, as we shall see, there is a lot of ambiguity around the meaning of *isolation*
+For example, as we shall see, there is a lot of ambiguity around the meaning of *isolation* [^10].
 [^10].
 The high-level idea is sound, but the devil is in the details. Today, when a system claims to be
 “ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately
 become mostly a marketing term.
 (Systems that do not meet the ACID criteria are sometimes called *BASE*, which stands for
-*Basically Available*, *Soft state*, and *Eventual consistency*
+*Basically Available*, *Soft state*, and *Eventual consistency* [^11].
 [^11].
 This is even more vague than the definition of ACID. It seems that the only sensible definition of
 BASE is “not ACID”; i.e., it can mean almost anything you want.)
@ -199,9 +193,8 @@ current value, add 1, and write the new value back (assuming there is no increme
 into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to
 44, because two increments happened, but it actually only went to 43 because of the race condition.
-![ddia 0801](/fig/ddia_0801.png)
+{{< figure src="/fig/ddia_0801.png" id="fig_transactions_increment" title="Figure 8-1. A race condition between two clients concurrently incrementing a counter." class="w-full my-4" >}}
 ###### Figure 8-1. A race condition between two clients concurrently incrementing a counter.
 *Isolation* in the sense of ACID means that concurrently executing transactions are isolated from
 each other: they cannot step on each other’s toes. The classic database textbooks formalize
@ -300,9 +293,8 @@ number of unread messages for a user, you could query something like:
 SELECT COUNT(*) FROM emails WHERE recipient_id = 2 AND unread_flag = true
 ```
-![ddia 0802](/fig/ddia_0802.png)
+{{< figure src="/fig/ddia_0802.png" id="fig_transactions_read_uncommitted" title="Figure 8-2. Violating isolation: one transaction reads another transaction's uncommitted writes (a \"dirty read\")." class="w-full my-4" >}}
 ###### Figure 8-2. Violating isolation: one transaction reads another transaction’s uncommitted writes (a “dirty read”).
 However, you might find this query to be too slow if there are many emails, and decide to store the
 number of unread messages in a separate field (a kind of denormalization, which we discuss in
@ -322,9 +314,8 @@ over the course of the transaction, the contents of the mailbox and the unread c
 of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted
 and the inserted email is rolled back.
-![ddia 0803](/fig/ddia_0803.png)
+{{< figure src="/fig/ddia_0803.png" id="fig_transactions_atomicity" title="Figure 8-3. Atomicity ensures that if an error occurs any prior writes from that transaction are undone, to avoid an inconsistent state." class="w-full my-4" >}}
 ###### Figure 8-3. Atomicity ensures that if an error occurs any prior writes from that transaction are undone, to avoid an inconsistent state.
 Multi-object transactions require some way of determining which read and write operations belong to
 the same transaction. In relational databases, that is typically done based on the client’s TCP
@ -473,18 +464,14 @@ levels of isolation are much harder to understand, and they can lead to subtle b
 nevertheless used in practice [^29].
 Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have
-caused substantial loss of money
+caused substantial loss of money [^30] [^31] [^32], led to investigation by financial auditors [^33],
-[^30] [^31] [^32],
+and caused customer data to be corrupted [^34]. A popular comment on revelations of such problems is “Use an ACID database if you’re handling
 led to investigation by financial auditors [^33],
 and caused customer data to be corrupted [^34].
 A popular comment on revelations of such problems is “Use an ACID database if you’re handling
 financial data!”—but that misses the point. Even many popular relational database systems (which
 are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these
 bugs from occurring.
 > [!NOTE]
-> Incidentally, much of the banking system relies on text files that are exchanged via secure FTP
+> Incidentally, much of the banking system relies on text files that are exchanged via secure FTP [^35].
 > [^35].
 > In this context, having an audit trail and some human-level fraud prevention measures is actually
 > more important than ACID properties.
@ -499,17 +486,14 @@ practice, and discuss in detail what kinds of race conditions can and cannot occ
 decide what level is appropriate to your application. Once we’ve done that, we will discuss
 serializability in detail (see [“Serializability”](/en/ch8#sec_transactions_serializability)). Our discussion of isolation
 levels will be informal, using examples. If you want rigorous definitions and analyses of their
-properties, you can find them in the academic literature
+properties, you can find them in the academic literature [^36] [^37] [^38] [^39].
 [^36] [^37] [^38] [^39].
 ## Read Committed
 The most basic level of transaction isolation is *read committed*. It makes two guarantees:
-1. When reading from the database, you will only see data that has been committed (no *dirty
+1. When reading from the database, you will only see data that has been committed (no *dirty reads*).
- reads*).
+2. When writing to the database, you will only overwrite data that has been committed (no *dirty writes*).
 2. When writing to the database, you will only overwrite data that has been committed (no *dirty
 writes*).
 Some databases support an even weaker isolation level called *read uncommitted*. It prevents dirty
 writes, but does not prevent dirty reads. Let’s discuss these two guarantees in more detail.
@ -526,9 +510,8 @@ all of its writes become visible at once). This is illustrated in
 [Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2’s *get x* still
 returns the old value, 2, while user 1 has not yet committed.
-![ddia 0804](/fig/ddia_0804.png)
+{{< figure src="/fig/ddia_0804.png" id="fig_transactions_read_committed" title="Figure 8-4. No dirty reads: user 2 sees the new value for x only after user 1's transaction has committed." class="w-full my-4" >}}
 ###### Figure 8-4. No dirty reads: user 2 sees the new value for *x* only after user 1’s transaction has committed.
 There are a few reasons why it’s useful to prevent dirty reads:
@ -550,8 +533,7 @@ know in which order the writes will happen, but we normally assume that the late
 the earlier write.
 However, what happens if the earlier write is part of a transaction that has not yet committed, so
-the later write overwrites an uncommitted value? This is called a *dirty write*
+the later write overwrites an uncommitted value? This is called a *dirty write* [^36]. Transactions running at the read
 [^36]. Transactions running at the read
 committed isolation level must prevent dirty writes, usually by delaying the second write until the
 first write’s transaction has committed or aborted.
@ -570,9 +552,8 @@ By preventing dirty writes, this isolation level avoids some kinds of concurrenc
 has committed, so it’s not a dirty write. It’s still incorrect, but for a different reason—in
 [“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update) we will discuss how to make such counter increments safe.
-![ddia 0805](/fig/ddia_0805.png)
+{{< figure src="/fig/ddia_0805.png" id="fig_transactions_dirty_writes" title="Figure 8-5. With dirty writes, conflicting writes from different transactions can be mixed up." class="w-full my-4" >}}
 ###### Figure 8-5. With dirty writes, conflicting writes from different transactions can be mixed up.
 ### Implementing read committed
@ -623,9 +604,8 @@ However, there are still plenty of ways in which you can have concurrency bugs w
 isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that
 can occur with read committed.
-![ddia 0806](/fig/ddia_0806.png)
+{{< figure src="/fig/ddia_0806.png" id="fig_transactions_item_many_preceders" title="Figure 8-6. Read skew: Aaliyah observes the database in an inconsistent state." class="w-full my-4" >}}
 ###### Figure 8-6. Read skew: Aaliyah observes the database in an inconsistent state.
 Say Aaliyah has $1,000 of savings at a bank, split across two accounts with $500 each. Now a
 transaction transfers $100 from one of her accounts to the other. If she is unlucky enough to look at her
@ -705,9 +685,8 @@ transaction ID of the writer. (To be precise, transaction IDs in PostgreSQL are
 they overflow after approximately 4 billion transactions. The vacuum process performs cleanup to
 ensure that overflow does not affect the data.)
-![ddia 0807](/fig/ddia_0807.png)
+{{< figure src="/fig/ddia_0807.png" id="fig_transactions_mvcc" title="Figure 8-7. Implementing snapshot isolation using multi-version concurrency control." class="w-full my-4" >}}
 ###### Figure 8-7. Implementing snapshot isolation using multi-version concurrency control.
 Each row in a table has a `inserted_by` field, containing the ID of the transaction that inserted
 this row into the table. Moreover, each row has a `deleted_by` field, which is initially empty. If a
@ -726,8 +705,7 @@ $400 which was inserted by transaction 13.
 All of the versions of a row are stored within the same database heap (see
 [“Storing values within the index”](/en/ch4#sec_storage_index_heap)), regardless of whether the transactions that wrote them have committed
 or not. The versions of the same row form a linked list, going either from newest version to oldest
-version or the other way round, so that queries can internally iterate over all versions of a row
+version or the other way round, so that queries can internally iterate over all versions of a row [^45] [^46].
 [^45] [^46].
 ### Visibility rules for observing a consistent snapshot
@ -774,8 +752,7 @@ query that uses the index must then iterate over the rows to find one that is vi
 value matches what the query is looking for. When garbage collection removes old row versions that
 are no longer visible to any transaction, the corresponding index entries can also be removed.
-Many implementation details affect the performance of multi-version concurrency control
+Many implementation details affect the performance of multi-version concurrency control [^45] [^46].
 [^45] [^46].
 For example, PostgreSQL has optimizations for avoiding index updates if different versions of the
 same row can fit on the same page [^40].
 Some other databases avoid storing full copies of modified rows, and only store differences between
@ -852,7 +829,7 @@ read-modify-write cycles in application code. They are usually the best solution
 expressed in terms of those operations. For example, the following instruction is concurrency-safe
 in most relational databases:
-```
+```sql
 UPDATE counters SET value = value + 1 WHERE key = 'foo';
 ```
@ -888,12 +865,12 @@ players from concurrently moving the same piece, as illustrated in [Example 8-1
 ##### Example 8-1. Explicitly locking rows to prevent lost updates
-```
+```sql
 BEGIN TRANSACTION;
 SELECT * FROM figures
 WHERE name = 'robot' AND game_id = 222
- FOR UPDATE; ![1](/fig/1.png)
+ FOR UPDATE; ❶
 -- Check whether move is valid, then update the position
 -- of the piece that was returned by the previous SELECT.
@ -902,9 +879,7 @@ UPDATE figures SET position = 'c4' WHERE id = 1234;
 COMMIT;
 ```
-[![1](/fig/1.png)](/en/ch8#co_transactions_CO1-1)
+❶: The `FOR UPDATE` clause indicates that the database should take a lock on all rows returned by this query.
 : The `FOR UPDATE` clause indicates that the database should take a lock on all rows returned by
 this query.
 This works, but to get it right, you need to carefully think about your application logic. It’s easy
 to forget to add a necessary lock somewhere in the code, and thus introduce a race condition.
@ -924,8 +899,7 @@ its read-modify-write cycle.
 An advantage of this approach is that databases can perform this check efficiently in conjunction
 with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL
 Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort
-the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates
+the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates [^29] [^41].
 [^29] [^41].
 Some authors [^36] [^38] argue that a database must prevent lost
 updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot
 isolation under this definition.
@ -948,7 +922,7 @@ For example, to prevent two users concurrently updating the same wiki page, you
 like this, expecting the update to occur only if the content of the page hasn’t changed since the
 user started editing it:
-```
+```sql
 -- This may or may not be safe, depending on the database implementation
 UPDATE wiki_pages SET content = 'new content'
 WHERE id = 1234 AND content = 'old content';
@ -991,8 +965,8 @@ behind CRDTs, which we encountered in [“CRDTs and Operational Transformation
 conditional writes cannot be made commutative.
 On the other hand, the *last write wins* (LWW) conflict resolution method is prone to lost updates,
-as discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww). Unfortunately, LWW is the default in many replicated
+as discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww). 
-databases.
+Unfortunately, LWW is the default in many replicated databases.
 ## Write Skew and Phantoms
@ -1007,17 +981,15 @@ concurrent writes. In this section we will see some subtler examples of conflict
 To begin, imagine this example: you are writing an application for doctors to manage their on-call
 shifts at a hospital. The hospital usually tries to have several doctors on call at any one time,
 but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if
-they are sick themselves), provided that at least one colleague remains on call in that shift
+they are sick themselves), provided that at least one colleague remains on call in that shift [^53] [^54].
 [^53] [^54].
 Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are
 feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button
 to go off call at approximately the same time. What happens next is illustrated in
 [Figure 8-8](/en/ch8#fig_transactions_write_skew).
-![ddia 0808](/fig/ddia_0808.png)
+{{< figure src="/fig/ddia_0808.png" id="fig_transactions_write_skew" title="Figure 8-8. Example of write skew causing an application bug." class="w-full my-4" >}}
 ###### Figure 8-8. Example of write skew causing an application bug.
 In each transaction, your application first checks that two or more doctors are currently on call;
 if yes, it assumes it’s safe for one doctor to go off call. Since the database is using snapshot
@ -1047,9 +1019,8 @@ options are more restricted:
 * The automatic detection of lost updates that you find in some implementations of snapshot
 isolation unfortunately doesn’t help either: write skew is not automatically detected in
 PostgreSQL’s repeatable read, MySQL/InnoDB’s repeatable read, Oracle’s serializable, or SQL
- Server’s snapshot isolation level [^29].
+ Server’s snapshot isolation level [^29]. 
- Automatically preventing write skew requires true serializable isolation (see
+ Automatically preventing write skew requires true serializable isolation (see [“Serializability”](/en/ch8#sec_transactions_serializability)).
 [“Serializability”](/en/ch8#sec_transactions_serializability)).
 * Some databases allow you to configure constraints, which are then enforced by the database (e.g.,
 uniqueness, foreign key constraints, or restrictions on a particular value). However, in order to
 specify that at least one doctor must be on call, you would need a constraint that involves
@ -1060,12 +1031,12 @@ options are more restricted:
 to explicitly lock the rows that the transaction depends on. In the doctors example, you could
 write something like the following:
- ```
+ ```sql
 BEGIN TRANSACTION;
 SELECT * FROM doctors
 WHERE on_call = true
- AND shift_id = 1234 FOR UPDATE; ![1](/fig/1.png)
+ AND shift_id = 1234 FOR UPDATE; ❶
 UPDATE doctors
 SET on_call = false
@ -1075,8 +1046,7 @@ options are more restricted:
 COMMIT;
 ```
- [![1](/fig/1.png)](/en/ch8#co_transactions_CO2-1)
+❶: As before, `FOR UPDATE` tells the database to lock all rows returned by this query.
 : As before, `FOR UPDATE` tells the database to lock all rows returned by this query.
 ### More examples of write skew
@ -1084,15 +1054,14 @@ Write skew may seem like an esoteric issue at first, but once you’re aware of
 more situations in which it can occur. Here are some more examples:
 Meeting room booking system
-: Say you want to enforce that there cannot be two bookings for the same meeting room at the same
+: Say you want to enforce that there cannot be two bookings for the same meeting room at the same time [^55].
 time [^55].
 When someone wants to make a booking, you first check for any conflicting bookings (i.e.,
 bookings for the same room with an overlapping time range), and if none are found, you create the
 meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).
 ##### Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)
- ```
+ ```sql
 BEGIN TRANSACTION;
 -- Check for any existing bookings that overlap with the period of noon-1pm
@ -1168,8 +1137,7 @@ This effect, where a write in one transaction changes the result of a search que
 transaction, is called a *phantom* [^4].
 Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the
 examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL
-generated by ORMs is also prone to write skew
+generated by ORMs is also prone to write skew [^50] [^51].
 [^50] [^51].
 ### Materializing conflicts
@ -1300,9 +1268,8 @@ The differences between interactive transactions and stored procedures is illust
 [Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the
 stored procedure can execute very quickly, without waiting for any network or disk I/O.
-![ddia 0809](/fig/ddia_0809.png)
+{{< figure src="/fig/ddia_0809.png" id="fig_transactions_stored_proc" title="Figure 8-9. The difference between an interactive transaction and a stored procedure (using the example transaction of [Figure 8-8](/en/ch8#fig_transactions_write_skew))." class="w-full my-4" >}}
 ###### Figure 8-9. The difference between an interactive transaction and a stored procedure (using the example transaction of [Figure 8-8](/en/ch8#fig_transactions_write_skew)).
 ### Pros and cons of stored procedures
@ -1321,8 +1288,7 @@ SQL standard (SQL/PSM) since 1999. They have gained a somewhat bad reputation, f
 (e.g., using a lot of memory or CPU time) in a database can cause much more trouble than equivalent
 badly written code in an application server.
 * In a multitenant system that allows tenants to write their own stored procedures, it’s a security
- risk to execute untrusted code in the same process as the database kernel
+ risk to execute untrusted code in the same process as the database kernel [^62].
 [^62].
 However, those issues can be overcome. Modern implementations of stored procedures have abandoned
 PL/SQL and use existing general-purpose programming languages instead: VoltDB uses Java or Groovy,
@ -1332,8 +1298,7 @@ Stored procedures are also useful in cases where application logic can’t easil
 elsewhere. Applications that use GraphQL, for example, might directly expose their database through
 a GraphQL proxy. If the proxy doesn’t support complex validation logic, you can embed such logic
 directly in the database using a stored procedure. If the database doesn’t support stored
-procedures, you would have to deploy a validation service between the proxy and the database to do
+procedures, you would have to deploy a validation service between the proxy and the database to do validation.
 validation.
 With stored procedures and in-memory data, executing all transactions on a single thread becomes
 feasible. When stored procedures don’t need to wait for I/O and avoid the overhead of other
@ -1494,8 +1459,7 @@ transaction is not allowed to concurrently insert or update another booking for
 time range. (It’s okay to concurrently insert bookings for other rooms, or for the same room at a
 different time that doesn’t affect the proposed booking.)
-How do we implement this? Conceptually, we need a *predicate lock*
+How do we implement this? Conceptually, we need a *predicate lock* [^4]. It works similarly to the
 [^4]. It works similarly to the
 shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one
 row in a table), it belongs to all objects that match some search condition, such as:
@ -1569,15 +1533,11 @@ serializable isolation and good performance fundamentally at odds with each othe
 It seems not: an algorithm called *serializable snapshot isolation* (SSI) provides full
 serializability with only a small performance penalty compared to snapshot isolation. SSI is
-comparatively new: it was first described in 2008
+comparatively new: it was first described in 2008 [^53] [^65].
 [^53] [^65].
 Today SSI and similar algorithms are used in single-node databases (the serializable isolation level
-in PostgreSQL [^54], SQL Server’s In-Memory
+in PostgreSQL [^54], SQL Server’s In-Memory OLTP/Hekaton [^66], and HyPer [^67]), distributed databases (CockroachDB [^5] and
-OLTP/Hekaton [^66], and HyPer [^67]),
+FoundationDB [^8]), and embedded storage engines such as BadgerDB.
 distributed databases (CockroachDB [^5] and
 FoundationDB [^8]), and embedded storage
 engines such as BadgerDB.
 ### Pessimistic versus optimistic concurrency control
@ -1658,9 +1618,8 @@ now taken effect, and transaction 43’s premise is no longer true. Things get e
 when a writer inserts data that didn’t exist before (see [“Phantoms causing write skew”](/en/ch8#sec_transactions_phantom)). We’ll
 discuss detecting phantom writes for SSI in [“Detecting writes that affect prior reads”](/en/ch8#sec_detecting_writes_affect_reads).
-![ddia 0810](/fig/ddia_0810.png)
+{{< figure src="/fig/ddia_0810.png" id="fig_transactions_detect_mvcc" title="Figure 8-10. Detecting when a transaction reads outdated values from an MVCC snapshot." class="w-full my-4" >}}
 ###### Figure 8-10. Detecting when a transaction reads outdated values from an MVCC snapshot.
 In order to prevent this anomaly, the database needs to track when a transaction ignores another
 transaction’s writes due to MVCC visibility rules. When the transaction wants to commit, the
@ -1680,9 +1639,8 @@ isolation’s support for long-running reads from a consistent snapshot.
 The second case to consider is when another transaction modifies data after it has been read. This
 case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range).
-![ddia 0811](/fig/ddia_0811.png)
+{{< figure src="/fig/ddia_0811.png" id="fig_transactions_detect_index_range" title="Figure 8-11. In serializable snapshot isolation, detecting when one transaction modifies another transaction's reads." class="w-full my-4" >}}
 ###### Figure 8-11. In serializable snapshot isolation, detecting when one transaction modifies another transaction’s reads.
 In the context of two-phase locking we discussed index-range locks (see
 [“Index-range locks”](/en/ch8#sec_transactions_2pl_range)), which allow the database to lock access to all rows matching some
@ -1788,9 +1746,8 @@ some nodes and fails on other nodes, as shown in [Figure 8-12](/en/ch8#fig_tran
 * Some nodes may crash before the commit record is fully written and roll back on recovery, while
 others successfully commit.
-![ddia 0812](/fig/ddia_0812.png)
+{{< figure src="/fig/ddia_0812.png" id="fig_transactions_non_atomic" title="Figure 8-12. When a transaction involves multiple database nodes, it may commit on some and fail on others." class="w-full my-4" >}}
 ###### Figure 8-12. When a transaction involves multiple database nodes, it may commit on some and fail on others.
 If some nodes commit the transaction but others abort it, the nodes become inconsistent with each
 other. And once a transaction has been committed on one node, it cannot be retracted again if it
@ -1808,21 +1765,17 @@ problem.
 ## Two-Phase Commit (2PC)
 Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It
-is a classic algorithm in distributed databases
+is a classic algorithm in distributed databases [^13] [^71] [^72]. 2PC is used
-[^13] [^71] [^72]. 2PC is used
+internally in some databases and also made available to applications in the form of *XA transactions* [^73]
 internally in some databases and also made available to applications in the form of *XA transactions*
 [^73]
 (which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP
-web services
+web services [^74] [^75].
 [^74] [^75].
 The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
 commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two
 phases (hence the name).
-![ddia 0813](/fig/ddia_0813.png)
+{{< figure src="/fig/ddia_0813.png" id="fig_transactions_two_phase_commit" title="Figure 8-13. A successful execution of two-phase commit (2PC)." class="w-full my-4" >}}
 ###### Figure 8-13. A successful execution of two-phase commit (2PC).
 2PC uses a new component that does not normally appear in single-node transactions: a
 *coordinator* (also known as *transaction manager*). The coordinator is often implemented as a
@ -1839,8 +1792,7 @@ participants:
 * If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends
 out a *commit* request in phase 2, and the commit actually takes place.
-* If any of the participants replies “no,” the coordinator sends an *abort* request to all nodes in
+* If any of the participants replies “no,” the coordinator sends an *abort* request to all nodes in phase 2.
 phase 2.
 This process is somewhat like the traditional marriage ceremony in Western cultures: the minister
 asks the bride and groom individually whether each wants to marry the other, and typically receives
@ -1920,9 +1872,8 @@ not know whether to commit or abort. Even a timeout does not help here: if datab
 aborts after a timeout, it will end up inconsistent with database 2, which has committed. Similarly,
 it is not safe to unilaterally commit, because another participant may have aborted.
-![ddia 0814](/fig/ddia_0814.png)
+{{< figure src="/fig/ddia_0814.png" id="fig_transactions_2pc_crash" title="Figure 8-14. The coordinator crashes after participants vote \"yes.\" Database 1 does not know whether to commit or abort." class="w-full my-4" >}}
 ###### Figure 8-14. The coordinator crashes after participants vote “yes.” Database 1 does not know whether to commit or abort.
 Without hearing from the coordinator, the participant has no way of knowing whether to commit or
 abort. In principle, the participants could communicate among themselves to find out how each
@ -1942,8 +1893,7 @@ stuck waiting for the coordinator to recover. It is possible to make an atomic c
 *nonblocking*, so that it does not get stuck if a node fails. However, making this work in practice
 is not so straightforward.
-As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed
+As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed [^13] [^77].
 [^13] [^77].
 However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most
 practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
 cannot guarantee atomicity.
@ -1957,8 +1907,7 @@ Distributed transactions and two-phase commit have a mixed reputation. On the on
 seen as providing an important safety guarantee that would be hard to achieve otherwise; on the
 other hand, they are criticized for causing operational problems, killing performance, and promising
 more than they can deliver [^78] [^79] [^80] [^81].
-Many cloud services choose not to implement distributed transactions due to the operational
+Many cloud services choose not to implement distributed transactions due to the operational problems they engender [^82].
 problems they engender [^82].
 Some implementations of distributed transactions carry a heavy performance penalty. Much of the
 performance cost inherent in two-phase commit is due to the additional disk forcing (`fsync`) that
@ -2073,8 +2022,7 @@ transaction is resolved.
 ### Recovering from coordinator failure
 In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the
-log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do
+log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do occur [^83] [^84] — that is,
 occur [^83] [^84]—that is,
 transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because
 the transaction log has been lost or corrupted due to a software bug). These transactions cannot be
 resolved automatically, so they sit forever in the database, holding locks and blocking other
@ -2135,11 +2083,8 @@ As explained previously, there is a big difference between distributed transacti
 multiple heterogeneous storage technologies, and those that are internal to a system—i.e., where all
 the participating nodes are shards of the same database running the same software. Such internal
 distributed transactions are a defining feature of “NewSQL” databases such as
-CockroachDB [^5],
+CockroachDB [^5], TiDB [^6], Spanner [^7], FoundationDB [^8], and YugabyteDB, for example. 
-TiDB [^6],
+Some message brokers such as Kafka also support internal distributed transactions [^85].
 Spanner [^7],
 FoundationDB [^8], and YugabyteDB, for
 example. Some message brokers such as Kafka also support internal distributed transactions [^85].
 Many of these systems use 2-phase commit to ensure atomicity of transactions that write to multiple
 shards, and yet they don’t suffer the same problems as XA transactions. The reason is that because
@ -2149,14 +2094,10 @@ are more reliable and faster.
 The biggest problems with XA can be fixed by:
-* Replicating the coordinator, with automatic failover to another coordinator node if the primary
+* Replicating the coordinator, with automatic failover to another coordinator node if the primary one crashes;
- one crashes;
+* Allowing the coordinator and data shards to communicate directly without going via application code;
-* Allowing the coordinator and data shards to communicate directly without going via application
+* Replicating the participating shards, so that the risk of having to abort a transaction because of a fault in one of the shards is reduced; and
- code;
+* Coupling the atomic commitment protocol with a distributed concurrency control protocol that supports deadlock detection and consistent reads across shards.
 * Replicating the participating shards, so that the risk of having to abort a transaction because of
 a fault in one of the shards is reduced; and
 * Coupling the atomic commitment protocol with a distributed concurrency control protocol that
 supports deadlock detection and consistent reads across shards.
 Consensus algorithms are commonly used to replicate the coordinator and the database shards. We will
 see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented
@ -2240,12 +2181,12 @@ discussing various examples of race conditions, summarized in [Table 8-1](/en/c
 Table 8-1. Summary of anomalies that can occur at various isolation levels
-| Isolation level | Dirty reads | Read skew | Phantom reads | Lost updates | Write skew |
+| Isolation level    | Dirty reads | Read skew   | Phantom reads | Lost updates | Write skew  |
 |--------------------|-------------|-------------|---------------|--------------|-------------|
-| Read uncommitted | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible |
+| Read uncommitted   | ✗ Possible  | ✗ Possible  | ✗ Possible    | ✗ Possible   | ✗ Possible  |
-| Read committed | ✓ Prevented | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible |
+| Read committed     | ✓ Prevented | ✗ Possible  | ✗ Possible    | ✗ Possible   | ✗ Possible  |
-| Snapshot isolation | ✓ Prevented | ✓ Prevented | ✓ Prevented | ? Depends | ✗ Possible |
+| Snapshot isolation | ✓ Prevented | ✓ Prevented | ✓ Prevented   | ? Depends    | ✗ Possible  |
-| Serializable | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented |
+| Serializable       | ✓ Prevented | ✓ Prevented | ✓ Prevented   | ✓ Prevented  | ✓ Prevented |
 Dirty reads
 : One client reads another client’s writes before they have been committed. The read committed
@ -2305,9 +2246,7 @@ mechanisms. Fortunately, idempotence can ensure exactly-once semantics without r
 commit across different storage technologies, and we will see more on this in later chapters.
 The examples in this chapter used a relational data model. However, as discussed in
-[“The need for multi-object transactions”](/en/ch8#sec_transactions_need), transactions are a valuable database feature, no matter which data model
+[“The need for multi-object transactions”](/en/ch8#sec_transactions_need), transactions are a valuable database feature, no matter which data model is used.
 is used.
--- a/content/en/ch9.md
+++ b/content/en/ch9.md
@ -117,9 +117,8 @@ a request and expect a response, many things could go wrong (some of which are i
 6. The remote node may have processed your request, but the response has been delayed and will be
 delivered later (perhaps the network or your own machine is overloaded).
-![ddia 0901](/fig/ddia_0901.png)
+{{< figure src="/fig/ddia_0901.png" id="fig_distributed_network" title="Figure 9-1. If you send a request and don't get a response, it's not possible to distinguish whether (a) the request was lost, (b) the remote node is down, or (c) the response was lost." class="w-full my-4" >}}
 ###### Figure 9-1. If you send a request and don’t get a response, it’s not possible to distinguish whether (a) the request was lost, (b) the remote node is down, or (c) the response was lost.
 The sender can’t even tell whether the packet was delivered: the only option is for the recipient to
 send a response message, which may in turn be lost or delayed. These issues are indistinguishable in
@ -147,8 +146,7 @@ TCP is often described as providing “reliable” delivery, in the sense that i
 retransmits dropped packets, it detects reordered packets and puts them back in the correct order,
 and it detects packet corruption using a simple checksum. It also figures out how fast it can send
 data so that it is transferred as quickly as possible, but without overloading the network or the
-receiving node; this is known as *congestion control*, *flow control*, or *backpressure*
+receiving node; this is known as *congestion control*, *flow control*, or *backpressure* [^5].
 [^5].
 When you “send” some data by writing it to a socket, it actually doesn’t get sent immediately,
 but it’s only placed in a buffer managed by your operating system. When the congestion control
@ -252,8 +250,7 @@ that something is not working:
 or refuse TCP connections by sending a `RST` or `FIN` packet in reply.
 * If a node process crashed (or was killed by an administrator) but the node’s operating system is
 still running, a script can notify other nodes about the crash so that another node can take over
- quickly without having to wait for a timeout to expire. For example, HBase does this
+ quickly without having to wait for a timeout to expire. For example, HBase does this [^26].
 [^26].
 * If you have access to the management interface of the network switches in your datacenter, you can
 query them to detect link failures at a hardware level (e.g., if the remote machine is powered
 down). This option is ruled out if you’re connecting via the internet, or if you’re in a shared
@ -282,9 +279,7 @@ to a load spike on the node or the network).
 Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of
 performing some action (for example, sending an email), and another node takes over, the action may
 end up being performed twice. We will discuss this issue in more detail in
-[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), and in
+[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), and in Chapters [^10] and [Link to Come].
 Chapters [^10]
 and [Link to Come].
 When a node is declared dead, its responsibilities need to be transferred to other nodes, which
 places additional load on other nodes and the network. If the system is already struggling with high
@ -322,19 +317,16 @@ Similarly, the variability of packet delays on computer networks is most often d
 the network is functioning fine.
 * When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming
 request from the network is queued by the operating system until the application is ready to
- handle it. Depending on the load on the machine, this may take an arbitrary length of time
+ handle it. Depending on the load on the machine, this may take an arbitrary length of time [^28].
 [^28].
 * In virtualized environments, a running operating system is often paused for tens of milliseconds
 while another virtual machine uses a CPU core. During this time, the VM cannot consume any data
- from the network, so the incoming data is queued (buffered) by the virtual machine monitor
+ from the network, so the incoming data is queued (buffered) by the virtual machine monitor [^29],
 [^29],
 further increasing the variability of network delays.
 * As mentioned earlier, in order to avoid overloading the network, TCP limits the rate at which it
 sends data. This means additional queueing at the sender before the data even enters the network.
-![ddia 0902](/fig/ddia_0902.png)
+{{< figure src="/fig/ddia_0902.png" id="fig_distributed_switch_queueing" title="Figure 9-2. If several machines send network traffic to the same destination, its switch queue can fill up. Here, ports 1, 2, and 4 are all trying to send packets to port 3." class="w-full my-4" >}}
 ###### Figure 9-2. If several machines send network traffic to the same destination, its switch queue can fill up. Here, ports 1, 2, and 4 are all trying to send packets to port 3.
 Moreover, when TCP detects and automatically retransmits a lost packet, although the application
 does not see the packet loss directly, it does see the resulting delay (waiting for the timeout to
@ -588,26 +580,21 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples
 * The quartz clock in a computer is not very accurate: it *drifts* (runs faster or slower than it
 should). Clock drift varies depending on the temperature of the machine. Google assumes a clock
- drift of up to 200 ppm (parts per million) for its servers
+ drift of up to 200 ppm (parts per million) for its servers  [^45],
 [^45],
 which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30
 seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best
 possible accuracy you can achieve, even if everything is working correctly.
 * If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the
- local clock will be forcibly reset [^39]. Any
+ local clock will be forcibly reset [^39]. Any applications observing the time before and after this reset may see time go backward or suddenly jump forward.
 applications observing the time before and after this reset may see time go backward or suddenly
 jump forward.
 * If a node is accidentally firewalled off from NTP servers, the misconfiguration may go
 unnoticed for some time, during which the drift may add up to large discrepancies between
 different nodes’ clocks. Anecdotal evidence suggests that this does happen in practice.
 * NTP synchronization can only be as good as the network delay, so there is a limit to its
 accuracy when you’re on a congested network with variable packet delays. One experiment showed
- that a minimum error of 35 ms is achievable when synchronizing over the internet
+ that a minimum error of 35 ms is achievable when synchronizing over the internet [^46],
 [^46],
 though occasional spikes in network delay lead to errors of around a second. Depending on the
 configuration, large network delays can cause the NTP client to give up entirely.
-* Some NTP servers are wrong or misconfigured, reporting time that is off by hours
+* Some NTP servers are wrong or misconfigured, reporting time that is off by hours [^47] [^48].
 [^47] [^48].
 NTP clients mitigate such errors by querying several servers and ignoring outliers.
 Nevertheless, it’s somewhat worrying to bet the correctness of your systems on the time that you
 were told by a stranger on the internet.
@ -619,9 +606,7 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples
 adjustment gradually over the course of a day (this is known as *smearing*) [^51] [^52],
 although actual NTP server behavior varies in practice [^53].
 Leap seconds will no longer be used from 2035 onwards, so this problem will fortunately go away.
-* In virtual machines, the hardware clock is virtualized, which raises additional challenges for
+* In virtual machines, the hardware clock is virtualized, which raises additional challenges for applications that need accurate timekeeping [^54].
 applications that need accurate timekeeping
 [^54].
 When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds
 while another VM is running. From an application’s point of view, this pause manifests itself as
 the clock suddenly jumping forward [^29].
@ -642,8 +627,7 @@ Such accuracy can be achieved with some special hardware (GPS receivers and/or a
 Precision Time Protocol (PTP) and careful deployment and monitoring [^58] [^59].
 Relying on GPS alone can be risky because GPS signals can easily be jammed. In some locations this
 happens frequently, e.g. close to military facilities [^60].
-Some cloud providers have begun offering high-accuracy clock synchronization for their virtual
+Some cloud providers have begun offering high-accuracy clock synchronization for their virtual machines [^61].
 machines [^61].
 However, clock synchronization still requires a lot of care. If your NTP daemon is misconfigured, or
 a firewall is blocking NTP traffic, the clock error due to drift can quickly become large.
@ -664,8 +648,7 @@ its network is misconfigured, it most likely won’t work at all, so it will qui
 fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most
 things will seem to work fine, even though its clock gradually drifts further and further away from
 reality. If some piece of software is relying on an accurately synchronized clock, the result is
-more likely to be silent and subtle data loss than a dramatic crash
+more likely to be silent and subtle data loss than a dramatic crash [^62] [^63].
 [^62] [^63].
 Thus, if you use software that requires synchronized clocks, it is essential that you also carefully
 monitor the clock offsets between all the machines. Any node whose clock drifts too far from the
@ -684,9 +667,8 @@ multi-leader replication (the example is similar to [Figure 6-8](/en/ch6#fig_re
 *x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node
 3 (we now have *x* = 2); and finally, both writes are replicated to node 2.
-![ddia 0903](/fig/ddia_0903.png)
+{{< figure src="/fig/ddia_0903.png" id="fig_distributed_timestamps" title="Figure 9-3. The write by client B is causally later than the write by client A, but B's write has an earlier timestamp." class="w-full my-4" >}}
 ###### Figure 9-3. The write by client B is causally later than the write by client A, but B’s write has an earlier timestamp.
 In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a
 timestamp according to the time-of-day clock on the node where the write originated. The clock
@ -710,12 +692,10 @@ a higher timestamp than the overwritten value, even if that timestamp is ahead o
 clock. However, that incurs the cost of an additional read to find the greatest existing timestamp.
 Some systems, including Cassandra and ScyllaDB, want to write to all replicas in a single round
 trip, and therefore they simply use the client clock’s timestamp along with a last write wins
-policy [^62]. This approach has some
+policy [^62]. This approach has some serious problems:
 serious problems:
 * Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite
- values previously written by a node with a fast clock until the clock skew between the nodes has
+ values previously written by a node with a fast clock until the clock skew between the nodes has elapsed [^63] [^65].
 elapsed [^63] [^65].
 This scenario can cause arbitrary amounts of data to be silently dropped without any error being
 reported to the application.
 * LWW cannot distinguish between writes that occurred sequentially in quick succession (in
@ -741,9 +721,7 @@ time, in addition to other sources of error such as quartz drift. To guarantee a
 you would need the clock error to be significantly lower than the network delay, which is not
 possible.
-So-called *logical clocks*
+So-called *logical clocks* [^66], which are based on incrementing counters rather than an oscillating quartz crystal, are a safer
 [^66],
 which are based on incrementing counters rather than an oscillating quartz crystal, are a safer
 alternative for ordering events (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). Logical clocks do not measure
 the time of day or the number of seconds elapsed, only the relative ordering of events (whether one
 event happened before or after another). In contrast, time-of-day and monotonic clocks, which
@ -810,13 +788,11 @@ Can we use the timestamps from synchronized time-of-day clocks as transaction ID
 the synchronization good enough, they would have the right properties: later transactions have a
 higher timestamp. The problem, of course, is the uncertainty about clock accuracy.
-Spanner implements snapshot isolation across datacenters in this way
+Spanner implements snapshot isolation across datacenters in this way [^68] [^69].
 [^68] [^69].
 It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the
 following observation: if you have two confidence intervals, each consisting of an earliest and
-latest possible timestamp (*A* = [*Aearliest*, *Alatest*] and
+latest possible timestamp (*A* = [*Aearliest*, *Alatest*] and *B* = [*Bearliest*, *Blatest*]), and those two intervals do not overlap 
-*B* = [*Bearliest*, *Blatest*]), and those two intervals do not overlap (i.e.,
+(i.e., *Aearliest* < *Alatest* < *Bearliest* < *Blatest*), then B definitely happened after A—there
 *Aearliest* < *Alatest* < *Bearliest* < *Blatest*), then B definitely happened after A—there
 can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened.
 In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the
@ -824,8 +800,7 @@ length of the confidence interval before committing a read-write transaction. By
 ensures that any transaction that may read the data is at a sufficiently later time, so their
 confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner
 needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS
-receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about
+receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [^45].
 7 ms [^45].
 The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to
 have a confidence interval, and the accurate clock sources only help keep that interval small. Other
@ -839,8 +814,7 @@ database with a single leader per shard. Only the leader is allowed to accept wr
 node know that it is still leader (that it hasn’t been declared dead by the others), and that it may
 safely accept writes?
-One option is for the leader to obtain a *lease* from the other nodes, which is similar to a lock
+One option is for the leader to obtain a *lease* from the other nodes, which is similar to a lock with a timeout [^73].
 with a timeout [^73].
 Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that
 it is the leader for some amount of time, until the lease expires. In order to remain leader, the
 node must periodically renew the lease before it expires. If the node fails, it stops renewing the
@ -887,12 +861,10 @@ various reasons why this could happen:
 * Contention among threads accessing a shared resource, such as a lock or queue, can cause threads
 to spend a lot of their time waiting. Moving to a machine with more CPU cores can make such
- problems worse, and contention problems can be difficult to diagnose
+ problems worse, and contention problems can be difficult to diagnose [^74].
 [^74].
 * Many programming language runtimes (such as the Java Virtual Machine) have a *garbage collector*
 (GC) that occasionally needs to stop all running threads. In the past, such *“stop-the-world” GC
- pauses* would sometimes last for several minutes
+ pauses* would sometimes last for several minutes [^75]!
 [^75]!
 With modern GC algorithms this is less of a problem, but GC pauses can still be noticable (see
 [“Limiting the impact of garbage collection”](/en/ch9#sec_distributed_gc_impact)).
 * In virtualized environments, a virtual machine can be *suspended* (pausing the execution of all
@ -900,8 +872,7 @@ various reasons why this could happen:
 memory and continuing execution). This pause can occur at any time in a process’s execution and can
 last for an arbitrary length of time. This feature is sometimes used for *live migration* of
 virtual machines from one host to another without a reboot, in which case the length of the pause
- depends on the rate at which processes are writing to memory
+ depends on the rate at which processes are writing to memory [^76].
 [^76].
 * On end-user devices such as laptops and phones, execution may also be suspended and resumed
 arbitrarily, e.g., when the user closes the lid of their laptop.
 * When the operating system context-switches to another thread, or when the hypervisor switches to a
@ -914,11 +885,9 @@ various reasons why this could happen:
 disk I/O operation to complete [^77]. In many languages, disk access can happen
 surprisingly, even if the code doesn’t explicitly mention file access—for example, the Java
 classloader lazily loads class files when they are first used, which could happen at any time in
- the program execution. I/O pauses and GC pauses may even conspire to combine their delays
+ the program execution. I/O pauses and GC pauses may even conspire to combine their delays [^78].
 [^78].
 If the disk is actually a network filesystem or network block device (such as Amazon’s EBS), the
- I/O latency is further subject to the variability of network delays
+ I/O latency is further subject to the variability of network delays [^31].
 [^31].
 * If the operating system is configured to allow *swapping to disk* (*paging*), a simple memory
 access may result in a page fault that requires a page from disk to be loaded into memory. The
 thread is paused while this slow I/O operation takes place. If memory pressure is high, this may
@ -1126,9 +1095,8 @@ become corrupted. You try to implement this by requiring a client to obtain a le
 service before accessing the file. Such a lock service is often implemented using a consensus
 algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency).
-![ddia 0904](/fig/ddia_0904.png)
+{{< figure src="/fig/ddia_0904.png" id="fig_distributed_lease_pause" title="Figure 9-4. Incorrect implementation of a distributed lock: client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage." class="w-full my-4" >}}
 ###### Figure 9-4. Incorrect implementation of a distributed lock: client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage.
 The problem is an example of what we discussed in [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses): if the client
 holding the lease is paused for too long, its lease expires. Another client can obtain a lease for
@ -1144,9 +1112,8 @@ or more.) By the time the write request arrives at the storage service, the leas
 out, allowing client 2 to acquire it and issue a write of its own. The result is corruption similar
 to [Figure 9-4](/en/ch9#fig_distributed_lease_pause).
-![ddia 0905](/fig/ddia_0905.png)
+{{< figure src="/fig/ddia_0905.png" id="fig_distributed_lease_delay" title="Figure 9-5. A message from a former leaseholder might be delayed for a long time, and arrive after another node has taken over the lease." class="w-full my-4" >}}
 ###### Figure 9-5. A message from a former leaseholder might be delayed for a long time, and arrive after another node has taken over the lease.
 ### Fencing off zombies and delayed requests
@ -1166,9 +1133,8 @@ detected and shut down, it may already be too late and data may already have bee
 A more robust fencing solution, which protects against both zombies and delayed requests, is
 illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing).
-![ddia 0906](/fig/ddia_0906.png)
+{{< figure src="/fig/ddia_0906.png" id="fig_distributed_fencing" title="Figure 9-6. Making access to storage safe by allowing writes only in the order of increasing fencing tokens." class="w-full my-4" >}}
 ###### Figure 9-6. Making access to storage safe by allowing writes only in the order of increasing fencing tokens.
 Let’s assume that every time the lock service grants a lock or lease, it also returns a *fencing
 token*, which is a number that increases every time a lock is granted (e.g., incremented by the lock
@ -1221,9 +1187,8 @@ the most significant bits or digits of the timestamp. You can then be sure that
 generated by the new leaseholder will be greater than any timestamp from the old leaseholder, even
 if the old leaseholder’s writes happened later.
-![ddia 0907](/fig/ddia_0907.png)
+{{< figure src="/fig/ddia_0907.png" id="fig_distributed_fencing_leaderless" title="Figure 9-7. Using fencing tokens to protect writes to a leaderless replicated database." class="w-full my-4" >}}
 ###### Figure 9-7. Using fencing tokens to protect writes to a leaderless replicated database.
 In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its
 timestamps starting with 34… are greater than any timestamps starting with 33… that are
@ -1252,13 +1217,11 @@ playing by the rules of the protocol.
 Distributed systems problems become much harder if there is a risk that nodes may “lie” (send
 arbitrary faulty or corrupted responses)—for example, it might cast multiple contradictory votes in
 the same election. Such behavior is known as a *Byzantine fault*, and the problem of reaching
-consensus in this untrusting environment is known as the *Byzantine Generals Problem*
+consensus in this untrusting environment is known as the *Byzantine Generals Problem* [^94].
 [^94].
 # The Byzantine Generals Problem
-The Byzantine Generals Problem is a generalization of the so-called *Two Generals Problem*
+The Byzantine Generals Problem is a generalization of the so-called *Two Generals Problem* [^95],
 [^95],
 which imagines a situation in which two army generals need to agree on a battle plan. As they
 have set up camp on two different sites, they can only communicate by messenger, and the messengers
 sometimes get delayed or lost (like packets in a network). We will discuss this problem of
@ -1290,8 +1253,7 @@ with the network. This concern is relevant in certain specific circumstances. Fo
 defraud others. In such circumstances, it is not safe for a node to simply trust another node’s
 messages, since they may be sent with malicious intent. For example, cryptocurrencies like
 Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties
- to agree whether a transaction happened or not, without relying on a central authority
+ to agree whether a transaction happened or not, without relying on a central authority [^100].
 [^100].
 However, in the kinds of systems we discuss in this book, we can usually safely assume that there
 are no Byzantine faults. In a datacenter, all the nodes are controlled by your organization (so
@ -1308,8 +1270,7 @@ end-user control, such as web browsers. This is why input validation, sanitizati
 escaping are so important: to prevent SQL injection and cross-site scripting, for example. However,
 we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the
 authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where
-there is no such central authority, Byzantine fault tolerance is more relevant
+there is no such central authority, Byzantine fault tolerance is more relevant [^103] [^104].
 [^103] [^104].
 A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to
 all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant
@ -1336,18 +1297,15 @@ pragmatic steps toward better reliability. For example:
 drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and
 UDP, but sometimes they evade detection [^105] [^106] [^107].
 Simple measures are usually sufficient protection against such corruption, such as checksums in
- the application-level protocol. TLS-encrypted connections also offer protection against
+ the application-level protocol. TLS-encrypted connections also offer protection against corruption.
 corruption.
 * A publicly accessible application must carefully sanitize any inputs from users, for example
 checking that a value is within a reasonable range and limiting the size of strings to prevent
 denial of service through large memory allocations. An internal service behind a firewall may be
- able to get away with less strict checks on inputs, but basic checks in protocol parsers are still
+ able to get away with less strict checks on inputs, but basic checks in protocol parsers are still a good idea [^105].
 a good idea [^105].
 * NTP clients can be configured with multiple server addresses. When synchronizing, the client
 contacts all of them, estimates their errors, and checks that a majority of servers agree on some
 time range. As long as most of the servers are okay, a misconfigured NTP server that is reporting an
- incorrect time is detected as an outlier and is excluded from synchronization
+ incorrect time is detected as an outlier and is excluded from synchronization [^39]. The use of multiple servers makes NTP
 [^39]. The use of multiple servers makes NTP
 more robust than if it only uses a single server.
 ## System Model and Reality
@ -1367,15 +1325,13 @@ With regard to timing assumptions, three system models are in common use:
 Synchronous model
 : The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock
 error. This does not imply exactly synchronized clocks or zero network delay; it just means you
- know that network delay, pauses, and clock drift will never exceed some fixed upper bound
+ know that network delay, pauses, and clock drift will never exceed some fixed upper bound [^108].
 [^108].
 The synchronous model is not a realistic model of most practical
 systems, because (as discussed in this chapter) unbounded delays and pauses do occur.
 Partially synchronous model
 : Partial synchrony means that a system behaves like a synchronous system *most of the time*, but it
- sometimes exceeds the bounds for network delay, process pauses, and clock drift
+ sometimes exceeds the bounds for network delay, process pauses, and clock drift [^108]. This is a realistic model of many
 [^108]. This is a realistic model of many
 systems: most of the time, networks and processes are quite well behaved—otherwise we would never
 be able to get anything done—but we have to reckon with the fact that any timing assumptions
 may be shattered occasionally. When this happens, network delay, pauses, and clock error may become
@ -1391,8 +1347,7 @@ nodes are:
 Crash-stop faults
 : In the *crash-stop* (or *fail-stop*) model, an algorithm may assume that a node can fail in only
- one way, namely by crashing
+ one way, namely by crashing [^109].
 [^109].
 This means that the node may suddenly stop responding at any moment, and thereafter that node is
 gone forever—it never comes back.
@ -1405,19 +1360,14 @@ Crash-recovery faults
 Degraded performance and partial functionality
 : In addition to crashing and restarting, nodes may go slow: they may still be able to respond to
 health check requests, while being too slow to get any real work done. For example, a Gigabit
- network interface could suddenly drop to 1 Kb/s throughput due to a driver bug
+ network interface could suddenly drop to 1 Kb/s throughput due to a driver bug [^110];
- [^110];
+ a process that is under memory pressure may spend most of its time performing garbage collection [^111];
 a process that is under memory pressure may spend most of its time performing garbage collection
 [^111];
 worn-out SSDs can have erratic performance; and hardware can be affected by high temperature,
- loose connectors, mechanical vibration, power supply problems, firmware bugs, and more
+ loose connectors, mechanical vibration, power supply problems, firmware bugs, and more [^112].
- [^112].
+ Such a situation is called a *limping node*, *gray failure*, or *fail-slow* [^113],
 Such a situation is called a *limping node*, *gray failure*, or *fail-slow*
 [^113],
 and it can be even more difficult to deal with than a cleanly failed node. A related problem is
 when a process stops doing some of the things it is supposed to do while other aspects continue
- working, for example because a background thread is crashed or deadlocked
+ working, for example because a background thread is crashed or deadlocked [^114].
 [^114].
 Byzantine (arbitrary) faults
 : Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described
@ -1558,15 +1508,13 @@ longer executions would then not be found.
 Still, model checkers strike a nice balance between ease of use and the ability to find non-obvious
 bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find
-and fix bugs
+and fix bugs [^122] [^123] [^124]. For example,
 [^122] [^123] [^124]. For example,
 using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped
 replication (VR) caused by ambiguity in the prose description of the algorithm [^125].
 By design, model checkers don’t run your actual code, but rather a simplified model that specifies
 only the core ideas of your protocol. This makes it more tractable to systematically explore the
-state space, but it risks that your specification and your implementation go out of sync with each
+state space, but it risks that your specification and your implementation go out of sync with each other [^126].
 other [^126].
 It is possible to check whether the model and the real implementation have equivalent behavior, but
 this requires instrumentation in the real implementation [^127].
@ -1596,8 +1544,7 @@ The myriad of tools required to trigger failures make fault injection tests cumb
 It’s common to adopt a fault injection framework like Jepsen to run fault injection tests to
 simplify the process. Such frameworks come with integrations for various operating systems and many
 pre-built fault injectors [^129].
-Jepsen has been remarkably effective at finding critical bugs in many widely-used systems
+Jepsen has been remarkably effective at finding critical bugs in many widely-used systems [^130] [^131].
 [^130] [^131].
 ### Deterministic simulation testing
@ -1620,19 +1567,16 @@ Application-level
 : Some systems are built from the ground-up to make it easy to execute code deterministically. For
 example, FoundationDB, one of the pioneers in the DST space, is built using an asynchronous
 communication library called Flow. Flow provides a point for developers to inject a deterministic
- network simulation into the system
+ network simulation into the system [^132].
 [^132].
 Similarly, TigerBeetle is an online transaction processing (OLTP) database with first-class DST
 support. The system’s state is modeled as a state machine, with all mutations occuring within a
 single event loop. When combined with mock deterministic primitives such as clocks, such an
- architecture is able to run deterministically
+ architecture is able to run deterministically [^133].
 [^133].
 Runtime-level
 : Languages with asynchronous runtimes and commonly used libraries provide an insertion point
 to introduce determinism. A single-threaded runtime is used to force all asynchronous code to run
- sequentially. FrostDB, for example, patches Go’s runtime to execute goroutines sequentially
+ sequentially. FrostDB, for example, patches Go’s runtime to execute goroutines sequentially [^134].
 [^134].
 Rust’s madsim library works in a similar manner. Madsim provides deterministic implementations of
 Tokio’s asynchronous runtime API, AWS’s S3 library, Kafka’s Rust library, and many others.
 Applications can swap in deterministic libraries and runtimes to get deterministic test executions
@ -1710,8 +1654,7 @@ node to be falsely suspected of crashing. Handling limping nodes, which are resp
 slow to do anything useful, is even harder.
 Once a fault is detected, making a system tolerate it is not easy either: there is no global
-variable, no shared memory, no common knowledge or any other kind of shared state between the
+variable, no shared memory, no common knowledge or any other kind of shared state between the machines [^83].
 machines [^83].
 Nodes can’t even agree on what time it is, let alone on anything more profound. The only way
 information can flow from one node to another is by sending it over the unreliable network. Major
 decisions cannot be safely made by a single node, so we require protocols that enlist help from
@ -1722,8 +1665,7 @@ where the same operation always deterministically returns the same result, then
 physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems
 engineers will often regard a problem as trivial if it can be solved on a single computer [^4],
 and indeed a single computer can do a lot nowadays. If you can avoid opening Pandora’s box and
-simply keep things on a single machine, for example by using an embedded storage engine (see
+simply keep things on a single machine, for example by using an embedded storage engine (see [“Embedded storage engines”](/en/ch4#sidebar_embedded)), it is generally worth doing so.
 [“Embedded storage engines”](/en/ch4#sidebar_embedded)), it is generally worth doing so.
 However, as discussed in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), scalability is not the only reason for
 wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically