add ddia 11-14

2026-06-21 00:47:05 +08:00 · 2026-02-15 10:03:16 +08:00 · 2026-02-15 10:03:16 +08:00 · a320e1b551
commit a320e1b551
parent 181eb7970d
36 changed files with 9599 additions and 946 deletions
--- a/.github/workflows/pages.yaml
+++ b/.github/workflows/pages.yaml
@ -31,7 +31,7 @@ jobs:
  build:
    runs-on: ubuntu-latest
    env:
-      HUGO_VERSION: 0.147.7
+      HUGO_VERSION: 0.155.3
    steps:
      - name: Checkout
        uses: actions/checkout@v4
@ -41,7 +41,7 @@ jobs:
      - name: Setup Go
        uses: actions/setup-go@v5
        with:
-          go-version: '1.24'
+          go-version: '1.26'
      - name: Setup Pages
        id: pages
        uses: actions/configure-pages@v4
--- a/README.md
+++ b/README.md
@ -67,9 +67,10 @@
  - [9. 分布式系统的麻烦](https://ddia.vonng.com/ch9)
  - [10.一致性与共识](https://ddia.vonng.com/ch10)
 * [第三部分：派生数据](https://ddia.vonng.com/part-iii)
-  - [11. 批处理](https://ddia.vonng.com/ch11) （尚未发布）
-  - [12. 流处理](https://ddia.vonng.com/ch12) （尚未发布）
-  - [13. 做正确的事](https://ddia.vonng.com/ch13)（尚未发布）
+  - [11. 批处理](https://ddia.vonng.com/ch11)
+  - [12. 流处理](https://ddia.vonng.com/ch12)
+  - [13. 流处理系统哲学](https://ddia.vonng.com/ch13)
+  - [14. 做正确的事](https://ddia.vonng.com/ch14)
 * [术语表](https://ddia.vonng.com/glossary)
 * [后记](https://ddia.vonng.com/colophon)

--- a/content/en/_index.md
+++ b/content/en/_index.md
@ -49,9 +49,10 @@ breadcrumbs: false
  - [10. Consistency and Consensus](/en/ch10)

 ### [Part III: Derived Data](/en/part-iii)
-  - [11. Batch Processing](/en/ch11) (WIP)
-  - [12. Stream Processing](/en/ch12) (WIP)
-  - [13. Doing the Right Thing](/en/ch13) (WIP)
+  - [11. Batch Processing](/en/ch11)
+  - [12. Stream Processing](/en/ch12)
+  - [13. A Philosophy of Streaming Systems](/en/ch13)
+  - [14. Doing the Right Thing](/en/ch14)

 ### [Glossary](/en/glossary)

--- a/content/en/ch1.md
+++ b/content/en/ch1.md
@ -4,6 +4,8 @@ weight: 101
 breadcrumbs: false
 ---

+<a id="ch_tradeoffs"></a>
+
 > *There are no solutions, there are only trade-offs. […] But you try to get the best
 > trade-off you can get, and that’s all you can hope for.*
 >
@ -156,7 +158,7 @@ the term *transaction* nevertheless stuck, referring to a group of reads and wri
 logical unit.

 > [!NOTE]
-> [Chapter 8](/en/ch8#ch_transactions) explores in detail what we mean with a transaction. This chapter uses the term
+> [Chapter 8](/en/ch8#ch_transactions) explores in detail what we mean with a transaction. This chapter uses the term
 > loosely to refer to low-latency reads and writes.

 Even though databases started being used for many different kinds of data—posts on social media,
@ -179,7 +181,7 @@ answer analytic queries such as:
 The reports that result from these types of queries are important for business intelligence, helping
 the management decide what to do next. In order to differentiate this pattern of using databases
 from transaction processing, it has been called *online analytic processing* (OLAP) [^5].
-The difference between OLTP and analytics is not always clear-cut, but some typical characteristics are listed in [Table 1-1](/en/ch1#tab_oltp_vs_olap).
+The difference between OLTP and analytics is not always clear-cut, but some typical characteristics are listed in [Table 1-1](/en/ch1#tab_oltp_vs_olap).

 {{< figure id="tab_oltp_vs_olap" title="Table 1-1. Comparing characteristics of operational and analytic systems" class="w-full my-4" >}}

@ -241,14 +243,14 @@ systems, for several reasons:

 A *data warehouse*, by contrast, is a separate database that analysts can query to their hearts’
 content, without affecting OLTP operations [^7].
-As we shall see in [Chapter 4](/en/ch4#ch_storage), data warehouses often store data in a way that is very different
+As we shall see in [Chapter 4](/en/ch4#ch_storage), data warehouses often store data in a way that is very different
 from OLTP databases, in order to optimize for the types of queries that are common in analytics.

 The data warehouse contains a read-only copy of the data in all the various OLTP systems in the
 company. Data is extracted from OLTP databases (using either a periodic data dump or a continuous
 stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into
 the data warehouse. This process of getting data into the data warehouse is known as
-*Extract–Transform–Load* (ETL) and is illustrated in [Figure 1-1](/en/ch1#fig_dwh_etl). Sometimes the order of the
+*Extract–Transform–Load* (ETL) and is illustrated in [Figure 1-1](/en/ch1#fig_dwh_etl). Sometimes the order of the
 *transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse,
 after loading), resulting in *ELT*.

@ -287,7 +289,7 @@ scale, the more specialized systems tend to become [^11].
 #### From data warehouse to data lake {#from-data-warehouse-to-data-lake}

 A data warehouse often uses a *relational* data model that is queried through SQL (see
-[Chapter 3](/en/ch3#ch_datamodels)), perhaps using specialized business intelligence software. This model works well
+[Chapter 3](/en/ch3#ch_datamodels)), perhaps using specialized business intelligence software. This model works well
 for the types of queries that business analysts need to make, but it is less well suited to the
 needs of data scientists, who might need to perform tasks such as:

@ -313,7 +315,7 @@ data scientists. The answer is a *data lake*: a centralized data repository that
 data that might be useful for analysis, obtained from operational systems via ETL processes. The
 difference from a data warehouse is that a data lake simply contains files, without imposing any
 particular file format or data model. Files in a data lake might be collections of database records,
-encoded using a file format such as Avro or Parquet (see [Chapter 5](/en/ch5#ch_encoding)), but they can equally well
+encoded using a file format such as Avro or Parquet (see [Chapter 5](/en/ch5#ch_encoding)), but they can equally well
 contain text, images, videos, sensor readings, sparse matrices, feature vectors, genome sequences,
 or any other kind of data [^15].
 Besides being more flexible, this is also often cheaper than relational data storage, since the data
@ -340,10 +342,10 @@ As analytics practices have matured, organizations have been increasingly paying
 management and operations of analytics systems and data pipelines, as captured for example in the
 DataOps manifesto [^18].
 Part of this are issues of governance, privacy, and compliance with regulation such as GDPR and
-CCPA, which we discuss in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance) and [Link to Come].
+CCPA, which we discuss in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance) and [“Legislation and Self-Regulation”](/en/ch14#sec_future_legislation).

 Moreover, analytical data is increasingly made available not only as files and relational tables,
-but also as streams of events (see [Link to Come]). With file-based data analysis you can re-run the
+but also as streams of events (see [Chapter 12](/en/ch12#ch_stream)). With file-based data analysis you can re-run the
 analysis periodically (e.g., daily) in order to respond to changes in the data, but stream processing
 allows analytics systems to respond to events much faster, on the order of seconds. Depending on the
 application and how time-sensitive it is, a stream processing approach can be valuable, for example
@ -398,7 +400,7 @@ When the data in one system is derived from the data in another, you need a proc
 derived data when the original in the system of record changes. Unfortunately, many databases are
 designed based on the assumption that your application only ever needs to use that one database, and
 they do not make it easy to integrate multiple systems in order to propagate such updates. In
-[Link to Come] we will discuss approaches to *data integration*, which allow us to compose multiple
+[“Data Integration”](/en/ch13#sec_future_integration) we will discuss approaches to *data integration*, which allow us to compose multiple
 data systems to achieve things that one system alone cannot do.

 That brings us to the end of our comparison of analytics and transaction processing. In the next
@ -420,7 +422,7 @@ energy company, and leaving aside emergency backup power), since it is cheaper t

 With software, two important decisions to be made are who builds the software and who deploys it.
 There is a spectrum of possibilities that outsource each decision to various degrees, as illustrated
-in [Figure 1-2](/en/ch1#fig_cloud_spectrum). At one extreme is bespoke software that you write and run in-house; at
+in [Figure 1-2](/en/ch1#fig_cloud_spectrum). At one extreme is bespoke software that you write and run in-house; at
 the other extreme are widely-used cloud services or Software as a Service (SaaS) products that are
 implemented and operated by an external vendor, and which you only access through a web interface or API.

@ -519,9 +521,9 @@ and indeed such managed services are now available for many popular data systems
 that have been designed from the ground up to be cloud-native have been shown to have several
 advantages: better performance on the same hardware, faster recovery from failures, being able to
 quickly scale computing resources to match the load, and supporting larger datasets [^25] [^26] [^27].
-[Table 1-2](/en/ch1#tab_cloud_native_dbs) lists some examples of both types of systems.
+[Table 1-2](/en/ch1#tab_cloud_native_dbs) lists some examples of both types of systems.

-{{< figure id="#tab_cloud_native_dbs" title="Table 1-2. Examples of self-hosted and cloud-native database systems" class="w-full my-4" >}}
+{{< figure id="tab_cloud_native_dbs" title="Table 1-2. Examples of self-hosted and cloud-native database systems" class="w-full my-4" >}}

 | Category         | Self-hosted systems         | Cloud-native systems                                                  |
 |------------------|-----------------------------|-----------------------------------------------------------------------|
@ -580,7 +582,7 @@ As an alternative to local disks, cloud services also offer virtual disk storage
 detached from one instance and attached to a different one (Amazon EBS, Azure managed disks, and
 persistent disks in Google Cloud). Such a virtual disk is not actually a physical disk, but rather a
 cloud service provided by a separate set of machines, which emulates the behavior of a disk (a
-*block device*, where each block is typically 4 KiB in size). This technology makes it
+*block device*, where each block is typically 4 KiB in size). This technology makes it
 possible to run traditional disk-based software in the cloud, but the block device emulation
 introduces overheads that can be avoided in systems that are designed from the ground up for the cloud [^25]. It also makes the application
 very sensitive to network glitches, since every I/O on the virtual block device is actually a network call [^28].
@ -591,7 +593,7 @@ services such as S3 are designed for long-term storage of fairly large files, ra
 of kilobytes to several gigabytes in size. The individual rows or values stored in a database are
 typically much smaller than this; cloud databases therefore typically manage smaller values in a
 separate service, and store larger data blocks (containing many individual values) in an object
-store [^26] [^29]. We will see ways of doing this in [Chapter 4](/en/ch4#ch_storage).
+store [^26] [^29]. We will see ways of doing this in [Chapter 4](/en/ch4#ch_storage).

 In a traditional systems architecture, the same computer is responsible for both storage (disk) and
 computation (CPU and RAM), but in cloud-native systems, these two responsibilities have become
@ -691,7 +693,7 @@ Fault tolerance/high availability
 :   If your application needs to continue working even if one machine (or several machines, or
    the network, or an entire datacenter) goes down, you can use multiple machines to give you
    redundancy. When one fails, another one can take over. See [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability) and
-    [Chapter 6](/en/ch6#ch_replication) on replication.
+    [Chapter 6](/en/ch6#ch_replication) on replication.

 Scalability
 :   If your data volume or computing requirements grow bigger than a single machine can handle,
@ -739,7 +741,7 @@ Distributed systems also have downsides. Every request and API call that goes vi
 to deal with the possibility of failure: the network may be interrupted, or the service may be
 overloaded or crashed, and therefore any request may time out without receiving a response. In this
 case, we don’t know whether the service received the request, and simply retrying it might not be
-safe. We will discuss these problems in detail in [Chapter 9](/en/ch9#ch_distributed).
+safe. We will discuss these problems in detail in [Chapter 9](/en/ch9#ch_distributed).

 Although datacenter networks are fast, making a call to another service is still vastly slower than
 calling a function in the same process [^44].
@ -760,9 +762,9 @@ as OpenTelemetry, Zipkin, and Jaeger allow you to track which client called whic
 operation, and how long each call took [^49].

 Databases provide various mechanisms for ensuring data consistency, as we shall see in
-[Chapter 6](/en/ch6#ch_replication) and [Chapter 8](/en/ch8#ch_transactions). However, when each service has its own database,
+[Chapter 6](/en/ch6#ch_replication) and [Chapter 8](/en/ch8#ch_transactions). However, when each service has its own database,
 maintaining consistency of data across those different services becomes the application’s problem.
-Distributed transactions, which we explore in [Chapter 8](/en/ch8#ch_transactions), are a possible technique for
+Distributed transactions, which we explore in [Chapter 8](/en/ch8#ch_transactions), are a possible technique for
 ensuring consistency, but they are rarely used in a microservices context because they run counter
 to the goal of making services independent from each other, and many databases don’t support them [^50].

@ -770,7 +772,7 @@ For all these reasons, if you can do something on a single machine, this is ofte
 cheaper compared to setting up a distributed system [^23] [^46] [^51].
 CPUs, memory, and disks have grown larger, faster, and more reliable. When combined with single-node
 databases such as DuckDB, SQLite, and KùzuDB, many workloads can now run on a single node. We will
-explore more on this topic in [Chapter 4](/en/ch4#ch_storage).
+explore more on this topic in [Chapter 4](/en/ch4#ch_storage).

 ### Microservices and Serverless {#sec_introduction_microservices}

@ -807,7 +809,7 @@ certain fields. Developers might wish to add or remove fields to an API as busin
 but doing so can cause clients to fail. Worse still, such failures are often not discovered until
 late in the development cycle when the updated service API is deployed to a staging or production
 environment. API description standards such as OpenAPI and gRPC help manage the relationship between
-client and server APIs; we discuss these further in [Chapter 5](/en/ch5#ch_encoding).
+client and server APIs; we discuss these further in [Chapter 5](/en/ch5#ch_encoding).

 Microservices are primarily a technical solution to a people problem: allowing different teams to
 make progress independently without having to coordinate with each other. This is valuable in a large
@ -937,7 +939,7 @@ Service Organization Control (SOC) Type 2 standards. As with PCI compliance, ven
 party audits to verify adherence.

 Generally, it is important to balance the needs of your business against the needs of the people
-whose data you are collecting and processing. There is much more to this topic; in [Link to Come] we
+whose data you are collecting and processing. There is much more to this topic; in [Chapter 14](/en/ch14#ch_right_thing) we
 will go deeper into the topics of ethics and legal compliance, including the problems of bias and
 discrimination.

@ -952,7 +954,7 @@ We started by making a distinction between operational (transaction-processing,
 (OLAP) systems, and saw their different characteristics: not only managing different types of data
 with different access patterns, but also serving different audiences. We encountered the concept of
 a data warehouse and data lake, which receive data feeds from operational systems via ETL. In
-[Chapter 4](/en/ch4#ch_storage) we will see that operational and analytical systems often use very different internal
+[Chapter 4](/en/ch4#ch_storage) we will see that operational and analytical systems often use very different internal
 data layouts because of the different types of queries they need to serve.

 We then compared cloud services, a comparatively recent development, to the traditional paradigm of
@ -964,7 +966,7 @@ example in the way they separate storage and compute.
 Cloud systems are intrinsically distributed, and we briefly examined some of the trade-offs of
 distributed systems compared to using a single machine. There are situations in which you can’t
 avoid going distributed, but it’s advisable not to rush into making a system distributed if it’s
-possible to keep it on a single machine. In [Chapter 9](/en/ch9#ch_distributed) we will cover the challenges with
+possible to keep it on a single machine. In [Chapter 9](/en/ch9#ch_distributed) we will cover the challenges with
 distributed systems in more detail.

 Finally, we saw that data systems architecture is determined not only by the needs of the business
@ -1038,4 +1040,3 @@ this question in mind as we move through the rest of this book.
 [^61]: Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. [Understanding and Benchmarking the Impact of GDPR on Database Systems](https://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 7, pages 1064–1077, March 2020. [doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354)
 [^62]: Martin Fowler. [Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html). *martinfowler.com*, December 2013. Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6)
 [^63]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016.
-
--- a/content/en/ch10.md
+++ b/content/en/ch10.md
@ -4,18 +4,20 @@ weight: 210
 breadcrumbs: false
 ---

+<a id="ch_consistency"></a>
+
 ![](/map/ch09.png)

 > *An ancient adage warns, “Never go to sea with two chronometers; take one or three.”*
 >
 > Frederick P. Brooks Jr., *The Mythical Man-Month: Essays on Software Engineering* (1995)

-Lots of things can go wrong in distributed systems, as discussed in [Chapter 9](/en/ch9#ch_distributed). If we want a
+Lots of things can go wrong in distributed systems, as discussed in [Chapter 9](/en/ch9#ch_distributed). If we want a
 service to continue working correctly despite those things going wrong, we need to find ways of
 tolerating faults.

 One of the best tools we have for fault tolerance is *replication*. However, as we saw in
-[Chapter 6](/en/ch6#ch_replication), having multiple copies of the data on multiple replicas opens up the risk of
+[Chapter 6](/en/ch6#ch_replication), having multiple copies of the data on multiple replicas opens up the risk of
 inconsistencies. Reads might be handled by a replica that is not up-to-date, yielding stale results.
 If multiple replicas can accept writes, we have to deal with conflicts between values that were
 concurrently written on different replicas. At a high level, there are two competing philosophies
@ -87,7 +89,7 @@ guarantee*. To clarify this idea, let’s look at an example of a system that is

 {{< figure src="/fig/ddia_1001.png" id="fig_consistency_linearizability_0" caption="Figure 10-1. If this database were linearizable, then either Alice's read would return 1 instead of 0, or Bob's read would return 0 instead of 1." class="w-full my-4" >}}

-[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
+[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
 Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a
 game their favorite team is playing. Just after the final score is announced, Aaliyah refreshes the
 page, sees the winner announced, and excitedly tells Bryce about it. Bryce incredulously hits
@ -104,7 +106,7 @@ violation of linearizability.
 ### What Makes a System Linearizable? {#sec_consistency_lin_definition}

 In order to understand linearizability better, let’s look at some more examples.
-[Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows three clients concurrently reading and writing the same
+[Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows three clients concurrently reading and writing the same
 object *x* in a linearizable database. In distributed systems theory, *x* is called a *register*—in
 practice, it could be one key in a key-value store, one row in a relational database, or one
 document in a document database, for example.
@ -112,7 +114,7 @@ document in a document database, for example.
 {{< figure src="/fig/ddia_1002.png" id="fig_consistency_linearizability_1" caption="Figure 10-2. Alice observes that x = 0 and y = 1, while Bob observes that x = 1 and y = 0. It's as if Alice's and Bob's computers disagree on the order in which the writes happened." class="w-full my-4" >}}


-For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients’
+For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients’
 point of view, not the internals of the database. Each bar is a request made by a client, where the
 start of a bar is the time when the request was sent, and the end of a bar is when the response was
 received by the client. Due to variable network delays, a client doesn’t know exactly when the
@ -121,12 +123,12 @@ client sending the request and receiving the response.

 In this example, the register has two types of operations:

-* *read*(*x*) ⇒ *v* means the client requested to read the value of register
+* *read*(*x*) ⇒ *v* means the client requested to read the value of register
 *x*, and the database returned the value *v*.
-* *write*(*x*, *v*) ⇒ *r* means the client requested to set the
+* *write*(*x*, *v*) ⇒ *r* means the client requested to set the
 register *x* to value *v*, and the database returned response *r* (which could be *ok* or *error*).

-In [Figure 10-2](/en/ch10#fig_consistency_linearizability_1), the value of *x* is initially 0, and client C performs a
+In [Figure 10-2](/en/ch10#fig_consistency_linearizability_1), the value of *x* is initially 0, and client C performs a
 write request to set it to 1. While this is happening, clients A and B are repeatedly polling the
 database to read the latest value. What are the possible responses that A and B might get for their
 read requests?
@ -146,7 +148,7 @@ and forth between the old and the new value several times while a write is going
 what we expect of a system that emulates a “single copy of the data.”

 To make the system linearizable, we need to add another constraint, illustrated in
-[Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
+[Figure 10-3](/en/ch10#fig_consistency_linearizability_2).

 {{< figure src="/fig/ddia_1003.png" id="fig_consistency_linearizability_2" caption="Figure 10-3. If Alice and Bob had perfect clocks, linearizability would require that x = 1 is returned, since the read of x begins after the write x = 1 completes." class="w-full my-4" >}}

@ -156,25 +158,25 @@ of the write operation) at which the value of *x* atomically flips from 0 to 1.
 client’s read returns the new value 1, all subsequent reads must also return the new value, even if
 the write operation has not yet completed.

-This timing dependency is illustrated with an arrow in [Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
+This timing dependency is illustrated with an arrow in [Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
 Client A is the first to read the new value, 1. Just after A’s read returns, B begins a new read.
 Since B’s read occurs strictly after A’s read, it must also return 1, even though the write by C is
 still ongoing. (It’s the same situation as with Aaliyah and Bryce in
-[Figure 10-1](/en/ch10#fig_consistency_linearizability_0): after Aaliyah has read the new value, Bryce also expects to
+[Figure 10-1](/en/ch10#fig_consistency_linearizability_0): after Aaliyah has read the new value, Bryce also expects to
 read the new value.)

 We can further refine this timing diagram to visualize each operation taking effect atomically at
 some point in time [^5],
-like in the more complex example shown in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3). In this example we
+like in the more complex example shown in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3). In this example we
 add a third type of operation besides *read* and *write*:

-* *cas*(*x*, *v*old, *v*new) ⇒ *r* means the client
+* *cas*(*x*, *v*old, *v*new) ⇒ *r* means the client
 requested an atomic *compare-and-set* operation (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)). If the
 current value of the register *x* equals *v*old, it should be atomically set to *v*new. If
 the value of *x* is different from *v*old, then the operation should leave the register
 unchanged and return an error. *r* is the database’s response (*ok* or *error*).

-Each operation in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3) is marked with a vertical line (inside the
+Each operation in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3) is marked with a vertical line (inside the
 bar for each operation) at the time when we think the operation was executed. Those markers are
 joined up in a sequential order, and the result must be a valid sequence of reads and writes for a
 register (every read must return the value set by the most recent write).
@ -187,7 +189,7 @@ that was written, until it is overwritten again.
 {{< figure src="/fig/ddia_1004.png" id="fig_consistency_linearizability_3" caption="Figure 10-4. The read of x is concurrent with the write x = 1. Since we don't know the exact timing of the operations, the read is allowed to return either 0 or 1." class="w-full my-4" >}}


-There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3):
+There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3):

 * First client B sent a request to read *x*, then client D sent a request to set *x* to 0, and then
 client A sent a request to set *x* to 1. Nevertheless, the value returned to B’s read is 1 (the
@ -207,7 +209,7 @@ There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_
 C’s *cas* write, which updates *x* from 2 to 4. In the absence of other requests, it would be okay for
 B’s read to return 2. However, client A has already read the new value 4 before B’s read started,
 so B is not allowed to read an older value than A. Again, it’s the same situation as with Aaliyah
- and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0).
+ and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0).

 That is the intuition behind linearizability; the formal definition [^1] describes it more precisely. It is
 possible (though computationally expensive) to test whether a system’s behavior is linearizable by
@ -225,6 +227,8 @@ which is the strongest consistency model in common use.

 --------

+<a id="sidebar_consistency_serializability"></a>
+
 > [!TIP] LINEARIZABILITY VERSUS SERIALIZABILITY

 Linearizability is easily confused with serializability (see [“Serializability”](/en/ch8#sec_transactions_serializability)),
@ -325,7 +329,7 @@ nodes agree on.
 In real applications, it is sometimes acceptable to treat such constraints loosely (for example, if
 a flight is overbooked, you can move customers to a different flight and offer them compensation for
 the inconvenience). In such cases, linearizability may not be needed, and we will discuss such
-loosely interpreted constraints in [Link to Come].
+loosely interpreted constraints in [“Timeliness and Integrity”](/en/ch13#sec_future_integrity).

 However, a hard uniqueness constraint, such as the one you typically find in relational databases,
 requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints,
@ -333,7 +337,7 @@ can be implemented without linearizability [^20].

 #### Cross-channel timing dependencies {#cross-channel-timing-dependencies}

-Notice a detail in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0): if Aaliyah hadn’t exclaimed the score,
+Notice a detail in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0): if Aaliyah hadn’t exclaimed the score,
 Bryce wouldn’t have known that the result of his query was stale. He would have just refreshed the
 page again a few seconds later, and eventually seen the final score. The linearizability violation
 was only noticed because there was an additional communication channel in the system (Aaliyah’s
@ -342,10 +346,10 @@ voice to Bryce’s ears).
 Similar situations can arise in computer systems. For example, say you have a website where users
 can upload a video, and a background process transcodes the video to a lower quality that can be
 streamed on slow internet connections. The architecture and dataflow of this system is illustrated
-in [Figure 10-5](/en/ch10#fig_consistency_transcoder).
+in [Figure 10-5](/en/ch10#fig_consistency_transcoder).

 The video transcoder needs to be explicitly instructed to perform a transcoding job, and this
-instruction is sent from the web server to the transcoder via a message queue (see [Link to Come]).
+instruction is sent from the web server to the transcoder via a message queue (see [“Messaging Systems”](/en/ch12#sec_stream_messaging)).
 The web server doesn’t place the entire video on the queue, since most message brokers are designed
 for small messages, and a video may be many megabytes in size. Instead, the video is first written
 to a file storage service, and once the write is complete, the instruction to the transcoder is
@ -356,7 +360,7 @@ placed on the queue.

 If the file storage service is linearizable, then this system should work fine. If it is not
 linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in
-[Figure 10-5](/en/ch10#fig_consistency_transcoder)) might be faster than the internal replication inside the storage
+[Figure 10-5](/en/ch10#fig_consistency_transcoder)) might be faster than the internal replication inside the storage
 service. In this case, when the transcoder fetches the original video (step 5), it might see an old
 version of the file, or nothing at all. If it processes an old version of the video, the original
 and transcoded videos in the file storage become permanently inconsistent with each other.
@ -364,7 +368,7 @@ and transcoded videos in the file storage become permanently inconsistent with e
 This problem arises because there are two different communication channels between the web server
 and the transcoder: the file storage and the message queue. Without the recency guarantee of
 linearizability, race conditions between these two channels are possible. This situation is
-analogous to [Figure 10-1](/en/ch10#fig_consistency_linearizability_0), where there was also a race condition between
+analogous to [Figure 10-1](/en/ch10#fig_consistency_linearizability_0), where there was also a race condition between
 two communication channels: the database replication and the real-life audio channel between
 Aaliyah’s mouth and Bryce’s ears.

@ -389,7 +393,7 @@ and all operations on it are atomic,” the simplest answer would be to really o
 of the data. However, that approach would not be able to tolerate faults: if the node holding that
 one copy failed, the data would be lost, or at least inaccessible until the node was brought up again.

-Let’s revisit the replication methods from [Chapter 6](/en/ch6#ch_replication), and compare whether they can be made linearizable:
+Let’s revisit the replication methods from [Chapter 6](/en/ch6#ch_replication), and compare whether they can be made linearizable:

 Single-leader replication (potentially linearizable)
 : In a system with single-leader replication, the leader has the primary copy of the data that is
@ -423,7 +427,7 @@ Multi-leader replication (not linearizable)
 Leaderless replication (probably not linearizable)
 : For systems with leaderless replication (Dynamo-style; see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)), people
 sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes
- (*w* + *r* > *n*). Depending on the exact algorithm, and depending on how you define
+ (*w* + *r* > *n*). Depending on the exact algorithm, and depending on how you define
 strong consistency, this is not quite true.

 “Last write wins” conflict resolution methods based on time-of-day clocks (e.g., in Cassandra and
@ -435,21 +439,21 @@ Leaderless replication (probably not linearizable)

 Intuitively, it seems as though quorum reads and writes should be linearizable in a
 Dynamo-style model. However, when we have variable network delays, it is possible to have race
-conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless).
+conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless).

 {{< figure src="/fig/ddia_1006.png" id="fig_consistency_leaderless" caption="Figure 10-6. Quorums are not sufficient to ensure linearizability if network delays are variable." class="w-full my-4" >}}


-In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating
-*x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3).
-Concurrently, client A reads from a quorum of two nodes (*r* = 2) and sees the new value 1
+In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating
+*x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3).
+Concurrently, client A reads from a quorum of two nodes (*r* = 2) and sees the new value 1
 on one of the nodes. Also concurrently with the write, client B reads from a different quorum of two
 nodes, and gets back the old value 0 from both.

-The quorum condition is met (*w* + *r* > *n*), but this execution is nevertheless not
+The quorum condition is met (*w* + *r* > *n*), but this execution is nevertheless not
 linearizable: B’s request begins after A’s request completes, but B returns the old value while A
 returns the new value. (It’s once again the Aaliyah and Bryce situation from
-[Figure 10-1](/en/ch10#fig_consistency_linearizability_0).)
+[Figure 10-1](/en/ch10#fig_consistency_linearizability_0).)

 It is possible to make Dynamo-style quorums linearizable at the cost of reduced
 performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously,
@ -471,10 +475,10 @@ provide linearizability, even with quorum reads and writes.
 As some replication methods can provide linearizability and others cannot, it is interesting to
 explore the pros and cons of linearizability in more depth.

-We already discussed some use cases for different replication methods in [Chapter 6](/en/ch6#ch_replication); for
+We already discussed some use cases for different replication methods in [Chapter 6](/en/ch6#ch_replication); for
 example, we saw that multi-leader replication is often a good choice for multi-region
 replication (see [“Geographically Distributed Operation”](/en/ch6#sec_replication_multi_dc)). An example of such a deployment is illustrated in
-[Figure 10-7](/en/ch10#fig_consistency_cap_availability).
+[Figure 10-7](/en/ch10#fig_consistency_cap_availability).

 {{< figure src="/fig/ddia_1007.png" id="fig_consistency_cap_availability" caption="Figure 10-7. If clients cannot contact enough replicas due to a network partition, they cannot process writes." class="w-full my-4" >}}

@ -600,7 +604,7 @@ proportional to the uncertainty of delays in the network. In a network with high
 like most computer networks (see [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing)), the response time of linearizable
 reads and writes is inevitably going to be high. A faster algorithm for linearizability does not
 exist, but weaker consistency models can be much faster, so this trade-off is important for
-latency-sensitive systems. In [Link to Come] we will discuss some approaches for avoiding
+latency-sensitive systems. In [“Timeliness and Integrity”](/en/ch13#sec_future_integrity) we will discuss some approaches for avoiding
 linearizability without sacrificing correctness.


@ -613,7 +617,7 @@ stored in only 64 bits (or even 32 bits if you are sure that you will never have
 records, but that is risky).

 Another advantage of such auto-incrementing IDs is that the order of the IDs tells you the order in
-which the records were created. For example, [Figure 10-8](/en/ch10#fig_consistency_id_generator) shows a chat
+which the records were created. For example, [Figure 10-8](/en/ch10#fig_consistency_id_generator) shows a chat
 application that assigns auto-incrementing IDs to chat messages as they are posted. You can then
 display the messages in order of increasing ID, and the resulting chat threads will make sense:
 Aaliyah posts a question that is assigned ID 1, and Bryce’s answer to the question is assigned a
@ -626,7 +630,7 @@ This single-node ID generator is another example of a linearizable system. Each
 ID is an operation that atomically increments a counter and returns the old counter value (a
 *fetch-and-add* operation); linearizability ensures that if the posting of Aaliyah’s message
 completes before Bryce’s posting begins, then Bryce’s ID must be greater than Aaliyah’s. The
-messages by Aaliyah and Caleb in [Figure 10-8](/en/ch10#fig_consistency_id_generator) are concurrent, so linearizability
+messages by Aaliyah and Caleb in [Figure 10-8](/en/ch10#fig_consistency_id_generator) are concurrent, so linearizability
 doesn’t specify how their IDs must be ordered, as long as they are unique.

 An in-memory single-node ID generator is easy to implement: you can use the atomic increment
@ -720,9 +724,9 @@ causality, and which you can use as a distributed ID generator. It is called a *
 proposed in 1978 by Leslie Lamport [^54],
 in what is now one of the most-cited papers in the field of distributed systems.

-[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) shows how a Lamport clock would work in the chat example of
-[Figure 10-8](/en/ch10#fig_consistency_id_generator). Each node has a unique identifier, which in
-[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) is the name “Aaliyah”, “Bryce”, or “Caleb”, but which in practice
+[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) shows how a Lamport clock would work in the chat example of
+[Figure 10-8](/en/ch10#fig_consistency_id_generator). Each node has a unique identifier, which in
+[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) is the name “Aaliyah”, “Bryce”, or “Caleb”, but which in practice
 could be a random UUID or something similar. Moreover, each node keeps a counter of the number of
 operations it has processed. A Lamport timestamp is then simply a pair of (*counter*, *node ID*).
 Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp,
@ -735,7 +739,7 @@ Every time a node generates a timestamp, it increments its counter value and use
 Moreover, every time a node sees a timestamp from another node, if the counter value in that
 timestamp is greater than its local counter value, it increases its local counter to match the value in the timestamp.

-In [Figure 10-9](/en/ch10#fig_consistency_lamport_ts), Aaliyah had not yet seen Caleb’s message when posting her own,
+In [Figure 10-9](/en/ch10#fig_consistency_lamport_ts), Aaliyah had not yet seen Caleb’s message when posting her own,
 and vice versa. Assuming both users start with an initial counter value of 0, both therefore
 increment their local counter and attach the new counter value of 1 to their message. When Bryce
 receives those messages, he increases his local counter value to 1. Finally, Bryce sends a reply to
@ -743,10 +747,10 @@ Aaliyah’s message, for which he increments his local counter and attaches the
 message.

 To compare two Lamport timestamps, we first compare their counter value: for example,
-(2, “Bryce”) is greater than (1, “Aaliyah”) and also greater than (1, “Caleb”). If
+(2, “Bryce”) is greater than (1, “Aaliyah”) and also greater than (1, “Caleb”). If
 two timestamps have the same counter, we compare their node IDs instead, using the usual
 lexicographic string comparison. Thus, the timestamp order in this example is
-(1, “Aaliyah”) < (1, “Caleb”) < (2, “Bryce”).
+(1, “Aaliyah”) < (1, “Caleb”) < (2, “Bryce”).

 #### Hybrid logical clocks {#hybrid-logical-clocks}

@ -789,7 +793,7 @@ IDs, because they ensure that the snapshot is consistent with causality [^56].
 When multiple timestamps are generated concurrently, these algorithms order them arbitrarily. This
 means that when you look at two timestamps, you generally can’t tell whether they were generated
 concurrently or whether one happened before the other. (In the example of
-[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) you actually can tell that Aaliyah and Caleb’s messages must have
+[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) you actually can tell that Aaliyah and Caleb’s messages must have
 been concurrent, because they have the same counter value, but when the counter values are different
 you can’t tell whether they were concurrent.)

@ -807,7 +811,7 @@ the higher ID, even if A and B never communicated with each other. On the other
 can only ensure that a node generates timestamps that are greater than any other timestamp that node
 has seen, but it can’t say anything about timestamps that it hasn’t seen.

-[Figure 10-10](/en/ch10#fig_consistency_permissions) shows how a non-linearizable ID generator could cause problems.
+[Figure 10-10](/en/ch10#fig_consistency_permissions) shows how a non-linearizable ID generator could cause problems.
 Imagine a social media website where user A wants to share an embarrassing photo privately with
 their friends. A’s account is initially public, but using their laptop, A first changes their
 account settings to private. Then A uses their phone to upload the photo. Since A performed these
@ -917,7 +921,7 @@ It turns out that all of these are instances of the same fundamental distributed
 *consensus*. Consensus is one of the most important and fundamental problems in distributed
 computing; it is also infamously difficult to get right [^58] [^59],
 and many systems have got it wrong in the past. Now that we have discussed replication
-([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
+([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
 linearizability (this chapter), we are finally ready to tackle the consensus problem.

 The best-known consensus algorithms are Viewstamped Replication [^60] [^61], Paxos [^58] [^62] [^63] [^64],
@ -1243,7 +1247,7 @@ A shared log is a good fit for database replication: if every log entry represen
 database, and every replica processes the same writes in the same order using deterministic logic,
 then the replicas will all end up in a consistent state. This idea is known as *state machine replication* [^80],
 and it is the principle behind event sourcing, which we saw in [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events). Shared
-logs are also useful for stream processing, as we shall see in [Link to Come].
+logs are also useful for stream processing, as we shall see in [Chapter 12](/en/ch12#ch_stream).

 Similarly, a shared log can be used to implement serializable transactions: as discussed in
 [“Actual Serial Execution”](/en/ch8#sec_transactions_serial), if every log entry represents a deterministic transaction to be
@ -1355,7 +1359,7 @@ fails.

 If you drop the requirement for the new leader to be up-to-date, you may improve performance and
 availability, but you are on thin ice, since the theory of consensus no longer applies. While things
-will work fine as long as there are no faults, the problems discussed in [Chapter 9](/en/ch9#ch_distributed) can
+will work fine as long as there are no faults, the problems discussed in [Chapter 9](/en/ch9#ch_distributed) can
 easily cause a lot of data loss or corruption.

 --------
@ -1381,7 +1385,7 @@ one location to another (by first adding the new nodes, and then removing the ol
 Although they are complex and subtle, consensus algorithms are a huge breakthrough for distributed
 systems. Consensus is essentially “single-leader replication done right”, with automatic failover on
 leader failure, ensuring that no committed data is lost and no split-brain is possible, even in the
-face of all the problems we discussed in [Chapter 9](/en/ch9#ch_distributed).
+face of all the problems we discussed in [Chapter 9](/en/ch9#ch_distributed).

 Since single-leader replication with automatic failover is essentially one of the definitions of
 consensus, any system that provides automatic failover but does not use a proven consensus algorithm
@ -1413,7 +1417,7 @@ research problem.

 For systems that want to be highly available, but don’t want to accept the cost of consensus, the
 only real alternative is to use a weaker consistency model instead, such as those offered by
-leaderless or multi-leader replication as discussed in [Chapter 6](/en/ch6#ch_replication). These approaches
+leaderless or multi-leader replication as discussed in [Chapter 6](/en/ch6#ch_replication). These approaches
 generally don’t offer linearizability, but for applications that don’t need it that is fine.


@ -1617,14 +1621,14 @@ a coordination service. It won’t guarantee that you will get it right, but it

 Consensus algorithms are complicated and subtle, but they are supported by a rich body of theory
 that has been developed since the 1980s. This theory makes it possible to build systems that can
-tolerate all the faults that we discussed in [Chapter 9](/en/ch9#ch_distributed), and still ensure that your data is
+tolerate all the faults that we discussed in [Chapter 9](/en/ch9#ch_distributed), and still ensure that your data is
 not corrupted. This is an amazing achievement, and the references at the end of this chapter feature
 some of the highlights of this work.

 Nevertheless, consensus is not always the right tool: in some systems, the strong consistency
 properties it provides are not needed, and it is better to have weaker consistency with higher
 availability and better performance. In these cases, it is common to use leaderless or multi-leader
-replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we
+replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we
 discussed in this chapter are helpful in that context.

 ### References
--- a/content/en/ch11.md
+++ b/content/en/ch11.md
--- a/content/en/ch12.md
+++ b/content/en/ch12.md
--- a/content/en/ch13.md
+++ b/content/en/ch13.md
--- a/content/en/ch14.md
+++ b/content/en/ch14.md
@ -0,0 +1,625 @@
+---
+title: "14. Doing the Right Thing"
+weight: 314
+breadcrumbs: false
+---
+
+<a id="ch_right_thing"></a>
+
+![](/map/ch13.png)
+
+> *Feeding AI systems on the world's beauty, ugliness, and cruelty, but expecting it to reflect only
+> the beauty is a fantasy.*
+>
+> Vinay Uday Prabhu and Abeba Birhane, *Large Datasets: A Pyrrhic Win for Computer Vision?* (2020)
+
+> [!TIP] A NOTE FOR EARLY RELEASE READERS
+> With Early Release ebooks, you get books in their earliest form---the author's raw and unedited
+> content as they write---so you can take advantage of these technologies long before the official
+> release of these titles.
+>
+> This will be the 14th chapter of the final book. The GitHub repo for this book is
+> *[*https://github.com/ept/ddia2-feedback*](https://github.com/ept/ddia2-feedback)*.
+>
+> If you'd like to be actively involved in reviewing and commenting on this draft, please reach out on GitHub.
+
+In the final chapter of this book, let's take a step back. Throughout this book we have examined a
+wide range of different architectures for data systems, evaluated their pros and cons, and explored
+techniques for building reliable, scalable, and maintainable applications. However, we have left out
+an important and fundamental part of the discussion, which we should now fill in.
+
+Every system is built for a purpose; every action we take has both intended and unintended
+consequences. The purpose may be as simple as making money, but the consequences for the world may
+reach far beyond that original purpose. We, the engineers building these systems, have a
+responsibility to carefully consider those consequences and to consciously decide what kind of world
+we want to live in.
+
+We talk about data as an abstract thing, but remember that many datasets are about people: their
+behavior, their interests, their identity. We must treat such data with humanity and respect. Users
+are humans too, and human dignity is paramount [^1].
+
+Software development increasingly involves making important ethical choices. There are guidelines to
+help software engineers navigate these issues, such as the ACM Code of Ethics and Professional
+Conduct [^2], but they are rarely discussed, applied, and enforced in practice. As a
+result, engineers and product managers sometimes take a very cavalier attitude to privacy and
+potential negative consequences of their products [^3], [^4].
+
+A technology is not good or bad in itself---what matters is how it is used and how it affects
+people. This is true for a software system like a search engine in much the same way as it is for a
+weapon like a gun. Is not sufficient for software engineers to focus exclusively on the technology
+and ignore its consequences: the ethical responsibility is ours to bear also. Reasoning about ethics
+is difficult, but it is too important to ignore.
+
+However, what makes something "good" or "bad" is not well-defined, and most people in computing
+don't even discuss that question [^5]. In contrast to much of computing, the concepts at
+the heart of ethics are not fixed or determinate in their precise meaning, and they require
+interpretation, which may be subjective [^6]. Ethics is not going through some checklist
+to confirm you comply; it's a participatory and iterative process of reflection, in dialog with the
+people involved, with accountability for the results [^7].
+
+## Predictive Analytics {#id369}
+
+For example, predictive analytics is a major part of why people are excited about big data and AI.
+Using data analysis to predict the weather, or the spread of diseases, is one thing [^8];
+it is another matter to predict whether a convict is likely to reoffend, whether an applicant for a
+loan is likely to default, or whether an insurance customer is likely to make expensive claims
+[^9]. The latter have a direct effect on individual people's lives.
+
+Naturally, payment networks want to prevent fraudulent transactions, banks want to avoid bad loans,
+airlines want to avoid hijackings, and companies want to avoid hiring ineffective or untrustworthy
+people. From their point of view, the cost of a missed business opportunity is low, but the cost of
+a bad loan or a problematic employee is much higher, so it is natural for organizations to want to
+be cautious. If in doubt, they are better off saying no.
+
+However, as algorithmic decision-making becomes more widespread, someone who has (accurately or
+falsely) been labeled as risky by some algorithm may suffer a large number of those "no" decisions.
+Systematically being excluded from jobs, air travel, insurance coverage, property rental, financial
+services, and other key aspects of society is such a large constraint of the individual's freedom
+that it has been called "algorithmic prison" [^10]. In countries that respect human
+rights, the criminal justice system presumes innocence until proven guilty; on the other hand,
+automated systems can systematically and arbitrarily exclude a person from participating in society
+without any proof of guilt, and with little chance of appeal.
+
+### Bias and Discrimination {#id370}
+
+Decisions made by an algorithm are not necessarily any better or any worse than those made by a
+human. Every person is likely to have biases, even if they actively try to counteract them, and
+discriminatory practices can become culturally institutionalized. There is hope that basing
+decisions on data, rather than subjective and instinctive assessments by people, could be more fair
+and give a better chance to people who are often overlooked in the traditional system
+[^11].
+
+When we develop predictive analytics and AI systems, we are not merely automating a human's decision
+by using software to specify the rules for when to say yes or no; we are even leaving the rules
+themselves to be inferred from data. However, the patterns learned by these systems are opaque: even
+if there is some correlation in the data, we may not know why. If there is a systematic bias in the
+input to an algorithm, the system will most likely learn and amplify that bias in its output
+[^12].
+
+In many countries, anti-discrimination laws prohibit treating people differently depending on
+protected traits such as ethnicity, age, gender, sexuality, disability, or beliefs. Other features
+of a person's data may be analyzed, but what happens if they are correlated with protected traits?
+For example, in racially segregated neighborhoods, a person's postal code or even their IP address
+is a strong predictor of race. Put like this, it seems ridiculous to believe that an algorithm could
+somehow take biased data as input and produce fair and impartial output from it [^13],
+[^14]. Yet this belief often seems to be implied by proponents of data-driven decision
+making, an attitude that has been satirized as "machine learning is like money laundering for bias"
+[^15].
+
+Predictive analytics systems merely extrapolate from the past; if the past is discriminatory, they
+codify and amplify that discrimination [^16]. If we want the future to be better than the
+past, moral imagination is required, and that's something only humans can provide [^17].
+Data and models should be our tools, not our masters.
+
+### Responsibility and Accountability {#id371}
+
+Automated decision making opens the question of responsibility and accountability [^17].
+If a human makes a mistake, they can be held accountable, and the person affected by the decision
+can appeal. Algorithms make mistakes too, but who is accountable if they go wrong [^18]?
+When a self-driving car causes an accident, who is responsible? If an automated credit scoring
+algorithm systematically discriminates against people of a particular race or religion, is there any
+recourse? If a decision by your machine learning system comes under judicial review, can you explain
+to the judge how the algorithm made its decision? People should not be able to evade their
+responsibility by blaming an algorithm.
+
+Credit rating agencies are an old example of collecting data to make decisions about people. A bad
+credit score makes life difficult, but at least a credit score is normally based on relevant facts
+about a person's actual borrowing history, and any errors in the record can be corrected (although
+the agencies normally do not make this easy). However, scoring algorithms based on machine learning
+typically use a much wider range of inputs and are much more opaque, making it harder to understand
+how a particular decision has come about and whether someone is being treated in an unfair or
+discriminatory way [^19].
+
+A credit score summarizes "How did you behave in the past?" whereas predictive analytics usually
+work on the basis of "Who is similar to you, and how did people like you behave in the past?"
+Drawing parallels to others' behavior implies stereotyping people, for example based on where they
+live (a close proxy for race and socioeconomic class). What about people who get put in the wrong
+bucket? Furthermore, if a decision is incorrect due to erroneous data, recourse is almost impossible
+[^17].
+
+Much data is statistical in nature, which means that even if the probability distribution on the
+whole is correct, individual cases may well be wrong. For example, if the average life expectancy in
+your country is 80 years, that doesn't mean you're expected to drop dead on your 80th birthday. From
+the average and the probability distribution, you can't say much about the age to which one
+particular person will live. Similarly, the output of a prediction system is probabilistic and may
+well be wrong in individual cases.
+
+A blind belief in the supremacy of data for making decisions is not only delusional, it is
+positively dangerous. As data-driven decision making becomes more widespread, we will need to figure
+out how to make algorithms accountable and transparent, how to avoid reinforcing existing biases,
+and how to fix them when they inevitably make mistakes.
+
+We will also need to figure out how to prevent data being used to harm people, and realize its
+positive potential instead. For example, analytics can reveal financial and social characteristics
+of people's lives. On the one hand, this power could be used to focus aid and support to help those
+people who most need it. On the other hand, it is sometimes used by predatory business seeking to
+identify vulnerable people and sell them risky products such as high-cost loans and worthless
+college degrees [^17], [^20].
+
+### Feedback Loops {#id372}
+
+Even with predictive applications that have less immediately far-reaching effects on people, such as
+recommendation systems, there are difficult issues that we must confront. When services become good
+at predicting what content users want to see, they may end up showing people only opinions they
+already agree with, leading to echo chambers in which stereotypes, misinformation, and polarization
+can breed. We are already seeing the impact of social media echo chambers on election campaigns.
+
+When predictive analytics affect people's lives, particularly pernicious problems arise due to
+self-reinforcing feedback loops. For example, consider the case of employers using credit scores to
+evaluate potential hires. You may be a good worker with a good credit score, but suddenly find
+yourself in financial difficulties due to a misfortune outside of your control. As you miss payments
+on your bills, your credit score suffers, and you will be less likely to find work. Joblessness
+pushes you toward poverty, which further worsens your scores, making it even harder to find
+employment [^17]. It's a downward spiral due to poisonous assumptions, hidden behind a
+camouflage of mathematical rigor and data.
+
+As another example of a feedback loop, economists found that when gas stations in Germany introduced
+algorithmic prices, competition was reduced and prices for consumers went up because the algorithms
+learned to collude [^21].
+
+We can't always predict when such feedback loops happen. However, many consequences can be predicted
+by thinking about the entire system (not just the computerized parts, but also the people
+interacting with it)---an approach known as *systems thinking* [^22]. We can try to
+understand how a data analysis system responds to different behaviors, structures, or
+characteristics. Does the system reinforce and amplify existing differences between people (e.g.,
+making the rich richer or the poor poorer), or does it try to combat injustice? And even with the
+best intentions, we must beware of unintended consequences.
+
+## Privacy and Tracking {#id373}
+
+Besides the problems of predictive analytics---i.e., using data to make automated decisions about
+people---there are ethical problems with data collection itself. What is the relationship between
+the organizations collecting data and the people whose data is being collected?
+
+When a system only stores data that a user has explicitly entered, because they want the system to
+store and process it in a certain way, the system is performing a service for the user: the user is
+the customer. But when a user's activity is tracked and logged as a side effect of other things they
+are doing, the relationship is less clear. The service no longer just does what the user tells it to
+do, but it takes on interests of its own, which may conflict with the user's interests.
+
+Tracking behavioral data has become increasingly important for user-facing features of many online
+services: tracking which search results are clicked helps improve the ranking of search results;
+recommending "people who liked X also liked Y" helps users discover interesting and useful things;
+A/B tests and user flow analysis can help indicate how a user interface might be improved. Those
+features require some amount of tracking of user behavior, and users benefit from them.
+
+However, depending on a company's business model, tracking often doesn't stop there. If the service
+is funded through advertising, the advertisers are the actual customers, and the users' interests
+take second place. Tracking data becomes more detailed, analyses become further-reaching, and data
+is retained for a long time in order to build up detailed profiles of each person for marketing
+purposes.
+
+Now the relationship between the company and the user whose data is being collected starts looking
+quite different. The user is given a free service and is coaxed into engaging with it as much as
+possible. The tracking of the user serves not primarily that individual, but rather the needs of the
+advertisers who are funding the service. This relationship can be appropriately described with a
+word that has more sinister connotations: *surveillance*.
+
+### Surveillance {#id374}
+
+As a thought experiment, try replacing the word *data* with *surveillance*, and observe if common
+phrases still sound so good [^23]. How about this: "In our surveillance-driven
+organization we collect real-time surveillance streams and store them in our surveillance warehouse.
+Our surveillance scientists use advanced analytics and surveillance processing in order to derive
+new insights."
+
+This thought experiment is unusually polemic for this book, *Designing Surveillance-Intensive
+Applications*, but strong words are needed to emphasize this point. In our attempts to make software
+"eat the world" [^24], we have built the greatest mass surveillance infrastructure the
+world has ever seen. We are rapidly approaching a world in which every inhabited space contains at
+least one internet-connected microphone, in the form of smartphones, smart TVs, voice-controlled
+assistant devices, baby monitors, and even children's toys that use cloud-based speech recognition.
+Many of these devices have a terrible security record [^25].
+
+What is new compared to the past is that digitization has made it easy to collect large amounts of
+data about people. Surveillance of our location and movements, our social relationships and
+communications, our purchases and payments, and data about our health have become almost
+unavoidable. A surveillance organisation may end up knowing more about a person than that person
+knows about themselves---for example, identifying illnesses or economic problems before the person
+themselves is aware of them.
+
+Even the most totalitarian and repressive regimes of the past could only dream of putting a
+microphone in every room and forcing every person to constantly carry a device capable of tracking
+their location and movements. Yet the benefits that we get from digital technology are so great that
+we now voluntarily accept this world of total surveillance. The difference is just that the data is
+being collected by corporations to provide us with services, rather than government agencies seeking
+control [^26].
+
+Not all data collection necessarily qualifies as surveillance, but examining it as such can help us
+understand our relationship with the data collector. Why are we seemingly happy to accept
+surveillance by corporations? Perhaps you feel you have nothing to hide---in other words, you are
+totally in line with existing power structures, you are not a marginalized minority, and you needn't
+fear persecution [^27]. Not everyone is so fortunate. Or perhaps it's because the purpose
+seems benign---it's not overt coercion and conformance, but merely better recommendations and more
+personalized marketing. However, combined with the discussion of predictive analytics from the last
+section, that distinction seems less clear.
+
+We are already seeing behavioral data on car driving, tracked by cars without drivers' consent,
+affecting their insurance premiums [^28], and health insurance coverage that depends on
+people wearing a fitness tracking device. When surveillance is used to determine things that hold
+sway over important aspects of life, such as insurance coverage or employment, it starts to appear
+less benign. Moreover, data analysis can reveal surprisingly intrusive things: for example, the
+movement sensor in a smartwatch or fitness tracker can be used to work out what you are typing (for
+example, passwords) with fairly good accuracy [^29]. Sensor accuracy and algorithms for
+analysis are only going to get better.
+
+### Consent and Freedom of Choice {#id375}
+
+We might assert that users voluntarily choose to use a service that tracks their activity, and they
+have agreed to the terms of service and privacy policy, so they consent to data collection. We might
+even claim that users are receiving a valuable service in return for the data they provide, and that
+the tracking is necessary in order to provide the service. Undoubtedly, social networks, search
+engines, and various other free online services are valuable to users---but there are problems with
+this argument.
+
+First, we should ask in what way the tracking is necessary. Some forms of tracking directly feed
+into improving features for users: for example, tracking the click-through rate on search results
+can help improve a search engine's result ranking and relevance, and tracking which products
+customers tend to buy together can help an online shop suggest related products. However, when
+tracking user interaction for content recommendations, or to build user profiles for advertising
+purposes, it is less clear whether this is genuinely in the user's interest---or is it only
+necessary because the ads pay for the service?
+
+Second, users have little knowledge of what data they are feeding into our databases, or how it is
+retained and processed---and most privacy policies do more to obscure than to illuminate. Without
+understanding what happens to their data, users cannot give any meaningful consent. Often, data from
+one user also says things about other people who are not users of the service and who have not
+agreed to any terms. The derived datasets that we discussed in this part of the book---in which data
+from the entire user base may have been combined with behavioral tracking and external data
+sources---are precisely the kinds of data of which users cannot have any meaningful understanding.
+
+Moreover, data is extracted from users through a one-way process, not a relationship with true
+reciprocity, and not a fair value exchange. There is no dialog, no option for users to negotiate how
+much data they provide and what service they receive in return: the relationship between the service
+and the user is very asymmetric and one-sided. The terms are set by the service, not by the user
+[^30], [^31].
+
+In the European Union, the *General Data Protection Regulation* (GDPR) requires that consent must be
+"freely given, specific, informed, and unambiguous", and that the user must be able to "refuse or
+withdraw consent without detriment"---otherwise it is not considered "freely given". Any request for
+consent must be written "in an intelligible and easily accessible form, using clear and plain
+language". Moreover, "silence, pre-ticked boxes or inactivity \[do not\] constitute consent"
+[^32]. There are other bases for lawful processing of personal data besides consent, such
+as *legitimate interest*, which permits certain uses of data such as fraud prevention
+[^33].
+
+You might argue that a user who does not consent to surveillance can simply choose not to use a
+service. But this choice is not free either: if a service is so popular that it is "regarded by most
+people as essential for basic social participation" [^30], then it is not reasonable to
+expect people to opt out of this service---using it is *de facto* mandatory. For example, in most
+Western social communities, it has become the norm to carry a smartphone, to use social networks for
+socializing, and to use Google for finding information. Especially when a service has network
+effects, there is a social cost to people choosing *not* to use it.
+
+Declining to use a service due to its user tracking policies is easier said than done. These
+platforms are designed specifically to engage users. Many use game mechanics and tactics common in
+gambling to keep users coming back [^34]. Even if a user gets past this, declining to
+engage is only an option for the small number of people who are privileged enough to have the time
+and knowledge to understand its privacy policy, and who can afford to potentially miss out on social
+participation or professional opportunities that may have arisen if they had participated in the
+service. For people in a less privileged position, there is no meaningful freedom of choice:
+surveillance becomes inescapable.
+
+### Privacy and Use of Data {#id457}
+
+Sometimes people claim that "privacy is dead" on the grounds that some users are willing to post all
+sorts of things about their lives to social media, sometimes mundane and sometimes deeply personal.
+However, this claim is false and rests on a misunderstanding of the word *privacy*.
+
+Having privacy does not mean keeping everything secret; it means having the freedom to choose which
+things to reveal to whom, what to make public, and what to keep secret. The right to privacy is a
+decision right: it enables each person to decide where they want to be on the spectrum between
+secrecy and transparency in each situation [^30]. It is an important aspect of a person's
+freedom and autonomy.
+
+For example, someone who suffers from a rare medical condition might be very happy to provide their
+private medical data to researchers if there is a chance that it might help the development of
+treatments for their condition. However, the important thing is that this person has a choice over
+who may access this data, and for what purpose. If there was a risk that information about their
+medical condition would harm their access to medical insurance or employment or other important
+things, this person would probably be much more cautious about sharing their data.
+
+When data is extracted from people through surveillance infrastructure, privacy rights are not
+necessarily eroded, but rather transferred to the data collector. Companies that acquire data
+essentially say "trust us to do the right thing with your data," which means that the right to
+decide what to reveal and what to keep secret is transferred from the individual to the company.
+
+The companies in turn choose to keep much of the outcome of this surveillance secret, because to
+reveal it would be perceived as creepy, and would harm their business model (which relies on knowing
+more about people than other companies do). Intimate information about users is only revealed
+indirectly, for example in the form of tools for targeting advertisements to specific groups of
+people (such as those suffering from a particular illness).
+
+Even if particular users cannot be personally reidentified from the bucket of people targeted by a
+particular ad, they have lost their agency about the disclosure of some intimate information. It is
+not the user who decides what is revealed to whom on the basis of their personal preferences---it is
+the company that exercises the privacy right with the goal of maximizing its profit.
+
+Many companies have a goal of not being *perceived* as creepy---avoiding the question of how
+intrusive their data collection actually is, and instead focusing on managing user perceptions. And
+even these perceptions are often managed poorly: for example, something may be factually correct,
+but if it triggers painful memories, the user may not want to be reminded about it [^35].
+With any kind of data we should expect the possibility that it is wrong, undesirable, or
+inappropriate in some way, and we need to build mechanisms for handling those failures. Whether
+something is "undesirable" or "inappropriate" is of course down to human judgment; algorithms are
+oblivious to such notions unless we explicitly program them to respect human needs. As engineers of
+these systems we must be humble, accepting and planning for such failings.
+
+Privacy settings that allow a user of an online service to control which aspects of their data other
+users can see are a starting point for handing back some control to users. However, regardless of
+the setting, the service itself still has unfettered access to the data, and is free to use it in
+any way permitted by the privacy policy. Even if the service promises not to sell the data to third
+parties, it usually grants itself unrestricted rights to process and analyze the data internally,
+often going much further than what is overtly visible to users.
+
+This kind of large-scale transfer of privacy rights from individuals to corporations is historically
+unprecedented [^30]. Surveillance has always existed, but it used to be expensive and
+manual, not scalable and automated. Trust relationships have always existed, for example between a
+patient and their doctor, or between a defendant and their attorney---but in these cases the use of
+data has been strictly governed by ethical, legal, and regulatory constraints. Internet services
+have made it much easier to amass huge amounts of sensitive information without meaningful consent,
+and to use it at massive scale without users understanding what is happening to their private data.
+
+### Data as Assets and Power {#id376}
+
+Since behavioral data is a byproduct of users interacting with a service, it is sometimes called
+"data exhaust"---suggesting that the data is worthless waste material. Viewed this way, behavioral
+and predictive analytics can be seen as a form of recycling that extracts value from data that would
+have otherwise been thrown away.
+
+More correct would be to view it the other way round: from an economic point of view, if targeted
+advertising is what pays for a service, then the user activity that generates behavioral data could
+be regarded as a form of labor [^36]. One could go even further and argue that the
+application with which the user interacts is merely a means to lure users into feeding more and more
+personal information into the surveillance infrastructure [^30]. The delightful human
+creativity and social relationships that often find expression in online services are cynically
+exploited by the data extraction machine.
+
+Personal data is a valuable asset, as evidenced by the existence of data brokers, a shady industry
+operating in secrecy, purchasing, aggregating, analyzing, inferring, and reselling intrusive
+personal data about people, mostly for marketing purposes [^20]. Startups are valued by
+their user numbers, by "eyeballs"---i.e., by their surveillance capabilities.
+
+Because the data is valuable, many people want it. Of course companies want it---that's why they
+collect it in the first place. But governments want to obtain it too: by means of secret deals,
+coercion, legal compulsion, or simply stealing it [^37]. When a company goes bankrupt, the
+personal data it has collected is one of the assets that gets sold. Moreover, the data is difficult
+to secure, so breaches happen disconcertingly often.
+
+These observations have led critics to saying that data is not just an asset, but a "toxic asset"
+[^37], or at least "hazardous material" [^38]. Maybe data is not the new gold,
+nor the new oil, but rather the new uranium [^39]. Even if we think that we are capable of
+preventing abuse of data, whenever we collect data, we need to balance the benefits with the risk of
+it falling into the wrong hands: computer systems may be compromised by criminals or hostile foreign
+intelligence services, data may be leaked by insiders, the company may fall into the hands of
+unscrupulous management that does not share our values, or the country may be taken over by a regime
+that has no qualms about compelling us to hand over the data.
+
+When collecting data, we need to consider not just today's political environment, but all possible
+future governments. There is no guarantee that every government elected in future will respect human
+rights and civil liberties, so "it is poor civic hygiene to install technologies that could someday
+facilitate a police state" [^40].
+
+"Knowledge is power," as the old adage goes. And furthermore, "to scrutinize others while avoiding
+scrutiny oneself is one of the most important forms of power" [^41]. This is why
+totalitarian governments want surveillance: it gives them the power to control the population.
+Although today's technology companies are not overtly seeking political power, the data and
+knowledge they have accumulated nevertheless gives them a lot of power over our lives, much of which
+is surreptitious, outside of public oversight [^42].
+
+### Remembering the Industrial Revolution {#id377}
+
+Data is the defining feature of the information age. The internet, data storage, processing, and
+software-driven automation are having a major impact on the global economy and human society. As our
+daily lives and social organization have been changed by information technology, and will probably
+continue to radically change in the coming decades, comparisons to the Industrial Revolution come to
+mind [^17], [^26].
+
+The Industrial Revolution came about through major technological and agricultural advances, and it
+brought sustained economic growth and significantly improved living standards in the long run. Yet
+it also came with major problems: pollution of the air (due to smoke and chemical processes) and the
+water (from industrial and human waste) was dreadful. Factory owners lived in splendor, while urban
+workers often lived in very poor housing and worked long hours in harsh conditions. Child labor was
+common, including dangerous and poorly paid work in mines.
+
+It took a long time before safeguards were established, such as environmental protection
+regulations, safety protocols for workplaces, outlawing child labor, and health inspections for
+food. Undoubtedly the cost of doing business increased when factories were no longer allowed to dump
+their waste into rivers, sell tainted foods, or exploit workers. But society as a whole benefited
+hugely from these regulations, and few of us would want to return to a time before [^17].
+
+Just as the Industrial Revolution had a dark side that needed to be managed, our transition to the
+information age has major problems that we need to confront and solve [^43], [^44].
+The collection and use of data is one of those problems. In the words of Bruce Schneier
+[^26]:
+
+> Data is the pollution problem of the information age, and protecting privacy is the environmental
+> challenge. Almost all computers produce information. It stays around, festering. How we deal with
+> it---how we contain it and how we dispose of it---is central to the health of our information
+> economy. Just as we look back today at the early decades of the industrial age and wonder how our
+> ancestors could have ignored pollution in their rush to build an industrial world, our
+> grandchildren will look back at us during these early decades of the information age and judge us
+> on how we addressed the challenge of data collection and misuse.
+>
+> We should try to make them proud.
+
+### Legislation and Self-Regulation {#sec_future_legislation}
+
+Data protection laws might be able to help preserve individuals' rights. For example, the European
+GDPR states that personal data must be "collected for specified, explicit and legitimate purposes
+and not further processed in a manner that is incompatible with those purposes", and furthermore
+that data must be "adequate, relevant and limited to what is necessary in relation to the purposes
+for which they are processed" [^32].
+
+However, this principle of *data minimization* runs directly counter to the philosophy of Big Data,
+which is to maximize data collection, to combine it with other datasets, to experiment and to
+explore in order to generate new insights. Exploration means using data for unforeseen purposes,
+which is the opposite of the "specified and explicit" purposes for which the data must have been
+collected. While the GDPR has had some effect on the online advertising industry [^45],
+the regulation has been weakly enforced [^46], and it does not seem to have led to much of
+a change in culture and practices across the wider tech industry.
+
+Companies that collect lots of data about people oppose regulation as being a burden and a hindrance
+to innovation. To some extent that opposition is justified. For example, when sharing medical data,
+there are clear risks to privacy, but there are also potential opportunities: how many deaths could
+be prevented if data analysis was able to help us achieve better diagnostics or find better
+treatments [^47]? Over-regulation may prevent such breakthroughs. It is difficult to
+balance such potential opportunities with the risks [^41].
+
+Fundamentally, we need a culture shift in the tech industry with regard to personal data. We should
+stop regarding users as metrics to be optimized, and remember that they are humans who deserve
+respect, dignity, and agency. We should self-regulate our data collection and processing practices
+in order to establish and maintain the trust of the people who depend on our software
+[^48]. And we should take it upon ourselves to educate end users about how their data is
+used, rather than keeping them in the dark.
+
+We should allow each individual to maintain their privacy---i.e., their control over own data---and
+not steal that control from them through surveillance. Our individual right to control our data is
+like the natural environment of a national park: if we don't explicitly protect and care for it, it
+will be destroyed. It will be the tragedy of the commons, and we will all be worse off for it.
+Ubiquitous surveillance is not inevitable---we are still able to stop it.
+
+As a first step, we should not retain data forever, but purge it as soon as it is no longer needed,
+and minimize what we collect in the first place [^48], [^49]. Data you don't have is
+data that can't be leaked, stolen, or compelled by governments to be handed over. Overall, culture
+and attitude changes will be necessary. As people working in technology, if we don't consider the
+societal impact of our work, we're not doing our job [^50].
+
+## Summary {#id594}
+
+This brings us to the end of the book. We have covered a lot of ground:
+
+- In [Chapter 1](/en/ch1#ch_tradeoffs) we contrasted analytical and operational systems, compared
+  the cloud to self-hosting, weighed up distributed and single-node systems, and discussed balancing
+  the needs of your business with the needs of your users.
+
+- In [Chapter 2](/en/ch2#ch_nonfunctional) we saw how to define several nonfunctional requirements
+  such as performance, reliability, scalability, and maintainability.
+
+- In [Chapter 3](/en/ch3#ch_datamodels) we explored a spectrum of data models, including the
+  relational, document, and graph models, event sourcing, and DataFrames. We also looked at examples
+  of various query languages, including SQL, Cypher, SPARQL, Datalog, and GraphQL.
+
+- In [Chapter 4](/en/ch4#ch_storage) we discussed storage engines for OLTP (LSM-trees and B-trees),
+  for analytics (column-oriented storage), and indexes for information retrieval (full-text and
+  vector search).
+
+- In [Chapter 5](/en/ch5#ch_encoding) we examined different ways of encoding data objects as bytes,
+  and how to support evolution as requirements change. We also compared several ways how data flows
+  between processes: via databases, service calls, workflow engines, or event-driven architectures.
+
+- In [Chapter 6](/en/ch6#ch_replication) we studied the trade-offs between single-leader,
+  multi-leader, and leaderless replication. We also looked at consistency models such as
+  read-after-write consistency, and sync engines that allow clients to work offline.
+
+- In [Chapter 7](/en/ch7#ch_sharding) we went into sharding, including strategies for rebalancing,
+  request routing, and secondary indexing.
+
+- In [Chapter 8](/en/ch8#ch_transactions) we covered transactions: durability, how various isolation
+  levels (read committed, snapshot isolation, and serializable) can be achieved, and how atomicity
+  can be ensured in distributed transactions.
+
+- In [Chapter 9](/en/ch9#ch_distributed) we surveyed fundamental problems that occur in distributed
+  systems (network faults and delays, clock errors, process pauses, crashes), and saw how they make
+  it difficult to correctly implement even something seemingly simple like a lock.
+
+- In [Chapter 10](/en/ch10#ch_consistency) we went on a deep-dive into various forms of consensus
+  and the consistency model (linearizability) it enables.
+
+- In [Chapter 11](/en/ch11#ch_batch) we dug into batch processing, building up from simple chains of
+  Unix tools to large-scale distributed batch processors using distributed filesystems or object
+  stores.
+
+- In [Chapter 12](/en/ch12#ch_stream) we generalized batch processing to stream processing,
+  discussed the underlying message brokers, change data capture, fault tolerance, and processing
+  patterns such as streaming joins.
+
+- In [Chapter 13](/en/ch13#ch_philosophy) we explored a philosophy of streaming systems that allows
+  disparate data systems to be integrated, systems to be evolved, and applications to be scaled more
+  easily.
+
+Finally, in this last chapter, we took a step back and examined some ethical aspects of building
+data-intensive applications. We saw that although data can be used to do good, it can also do
+significant harm: making decisions that seriously affect people's lives and are difficult to appeal
+against, leading to discrimination and exploitation, normalizing surveillance, and exposing intimate
+information. We also run the risk of data breaches, and we may find that a well-intentioned use of
+data has unintended consequences.
+
+As software and data are having such a large impact on the world, we as engineers must remember that
+we carry a responsibility to work toward the kind of world that we want to live in: a world that
+treats people with humanity and respect. Let's work together towards that goal.
+
+##### Footnotes
+
+### References {#references}
+
+[^1]: David Schmudde. [What If Data Is a Bad Idea?](https://schmud.de/posts/2024-08-18-data-is-a-bad-idea.html). *schmud.de*, August 2024. Archived at [perma.cc/ZXU5-XMCT](https://perma.cc/ZXU5-XMCT)
+[^2]: [ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics). Association for Computing Machinery, *acm.org*, 2018. Archived at [perma.cc/SEA8-CMB8](https://perma.cc/SEA8-CMB8)
+[^3]: Igor Perisic. [Making Hard Choices: The Quest for Ethics in Machine Learning](https://www.linkedin.com/blog/engineering/archive/making-hard-choices-the-quest-for-ethics-in-machine-learning). *linkedin.com*, November 2016. Archived at [perma.cc/DGF8-KNT7](https://perma.cc/DGF8-KNT7)
+[^4]: John Naughton. [Algorithm Writers Need a Code of Conduct](https://www.theguardian.com/commentisfree/2015/dec/06/algorithm-writers-should-have-code-of-conduct). *theguardian.com*, December 2015. Archived at [perma.cc/TBG2-3NG6](https://perma.cc/TBG2-3NG6)
+[^5]: Ben Green. ["Good" isn't good enough](https://www.benzevgreen.com/wp-content/uploads/2019/11/19-ai4sg.pdf). At *NeurIPS Joint Workshop on AI for Social Good*, December 2019. Archived at [perma.cc/H4LN-7VY3](https://perma.cc/H4LN-7VY3)
+[^6]: Deborah G. Johnson and Mario Verdicchio. [Ethical AI is Not about AI](https://cacm.acm.org/opinion/ethical-ai-is-not-about-ai/). *Communications of the ACM*, volume 66, issue 2, pages 32--34, January 2023. [doi:10.1145/3576932](https://doi.org/10.1145/3576932)
+[^7]: Marc Steen. [Ethics as a Participatory and Iterative Process](https://cacm.acm.org/opinion/ethics-as-a-participatory-and-iterative-process/). *Communications of the ACM*, volume 66, issue 5, pages 27--29, April 2023. [doi:10.1145/3550069](https://doi.org/10.1145/3550069)
+[^8]: Logan Kugler. [What Happens When Big Data Blunders?](https://cacm.acm.org/news/what-happens-when-big-data-blunders/) *Communications of the ACM*, volume 59, issue 6, pages 15--16, June 2016. [doi:10.1145/2911975](https://doi.org/10.1145/2911975)
+[^9]: Miri Zilka. [Algorithms and the criminal justice system: promises and challenges in deployment and research](https://www.cl.cam.ac.uk/research/security/seminars/archive/video/2023-03-07-t196231.html). At *University of Cambridge Security Seminar Series*, March 2023.
+[^10]: Bill Davidow. [Welcome to Algorithmic Prison](https://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/). *theatlantic.com*, February 2014. Archived at [archive.org](https://web.archive.org/web/20171019201812/https://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/)
+[^11]: Don Peck. [They're Watching You at Work](https://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/). *theatlantic.com*, December 2013. Archived at [perma.cc/YR9T-6M38](https://perma.cc/YR9T-6M38)
+[^12]: Leigh Alexander. [Is an Algorithm Any Less Racist Than a Human?](https://www.theguardian.com/technology/2016/aug/03/algorithm-racist-human-employers-work) *theguardian.com*, August 2016. Archived at [perma.cc/XP93-DSVX](https://perma.cc/XP93-DSVX)
+[^13]: Jesse Emspak. [How a Machine Learns Prejudice](https://www.scientificamerican.com/article/how-a-machine-learns-prejudice/). *scientificamerican.com*, December 2016. [perma.cc/R3L5-55E6](https://perma.cc/R3L5-55E6)
+[^14]: Rohit Chopra, Kristen Clarke, Charlotte A. Burrows, and Lina M. Khan. [Joint Statement on Enforcement Efforts Against Discrimination and Bias in Automated Systems](https://www.ftc.gov/system/files/ftc_gov/pdf/EEOC-CRT-FTC-CFPB-AI-Joint-Statement%28final%29.pdf). *ftc.gov*, April 2023. Archived at [perma.cc/YY4Y-RCCA](https://perma.cc/YY4Y-RCCA)
+[^15]: Maciej Cegłowski. [The Moral Economy of Tech](https://idlewords.com/talks/sase_panel.htm). *idlewords.com*, June 2016. Archived at [perma.cc/L8XV-BKTD](https://perma.cc/L8XV-BKTD)
+[^16]: Greg Nichols. [Artificial Intelligence in healthcare is racist](https://www.zdnet.com/article/artificial-intelligence-in-healthcare-is-racist/). *zdnet.com*, November 2020. Archived at [perma.cc/3MKW-YKRS](https://perma.cc/3MKW-YKRS)
+[^17]: Cathy O'Neil. *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. ISBN: 978-0-553-41881-1
+[^18]: Julia Angwin. [Make Algorithms Accountable](https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html). *nytimes.com*, August 2016. Archived at [archive.org](https://web.archive.org/web/20230819055242/https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html)
+[^19]: Bryce Goodman and Seth Flaxman. [European Union Regulations on Algorithmic Decision-Making and a 'Right to Explanation'](https://arxiv.org/abs/1606.08813). At *ICML Workshop on Human Interpretability in Machine Learning*, June 2016. Archived at [arxiv.org/abs/1606.08813](https://arxiv.org/abs/1606.08813)
+[^20]: [A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes](https://www.commerce.senate.gov/services/files/0d2b3642-6221-4888-a631-08f2f255b577). Staff Report, *United States Senate Committee on Commerce, Science, and Transportation*, *commerce.senate.gov*, December 2013. Archived at [perma.cc/32NV-YWLQ](https://perma.cc/32NV-YWLQ)
+[^21]: Stephanie Assad, Robert Clark, Daniel Ershov, and Lei Xu. [Algorithmic Pricing and Competition: Empirical Evidence from the German Retail Gasoline Market](https://economics.yale.edu/sites/default/files/clark_acex_jan_2021.pdf). *Journal of Political Economy*, volume 132, issue 3, pages 723-771, March 2024. [doi:10.1086/726906](https://doi.org/10.1086/726906)
+[^22]: Donella H. Meadows and Diana Wright. *Thinking in Systems: A Primer*. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7
+[^23]: Daniel J. Bernstein. [Listening to a "big data"/"data science" talk. Mentally translating "data" to "surveillance": "\...everything starts with surveillance\..."](https://x.com/hashbreaker/status/598076230437568512) *x.com*, May 2015. Archived at [perma.cc/EY3D-WBBJ](https://perma.cc/EY3D-WBBJ)
+[^24]: Marc Andreessen. [Why Software Is Eating the World](https://a16z.com/why-software-is-eating-the-world/). *a16z.com*, August 2011. Archived at [perma.cc/3DCC-W3G6](https://perma.cc/3DCC-W3G6)
+[^25]: J. M. Porup. ['Internet of Things' Security Is Hilariously Broken and Getting Worse](https://arstechnica.com/information-technology/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/). *arstechnica.com*, January 2016. Archived at [archive.org](https://web.archive.org/web/20250823001716/https://arstechnica.com/information-technology/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/)
+[^26]: Bruce Schneier. [*Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World*](https://www.schneier.com/books/data_and_goliath/). W. W. Norton, 2015. ISBN: 978-0-393-35217-7
+[^27]: The Grugq. [Nothing to Hide](https://grugq.tumblr.com/post/142799983558/nothing-to-hide). *grugq.tumblr.com*, April 2016. Archived at [perma.cc/BL95-8W5M](https://perma.cc/BL95-8W5M)
+[^28]: Federal Trade Commission. [FTC Takes Action Against General Motors for Sharing Drivers' Precise Location and Driving Behavior Data Without Consent](https://www.ftc.gov/news-events/news/press-releases/2025/01/ftc-takes-action-against-general-motors-sharing-drivers-precise-location-driving-behavior-data). *ftc.gov*, January 2025. Archived at [perma.cc/3XGV-3HRD](https://perma.cc/3XGV-3HRD)
+[^29]: Tony Beltramelli. [Deep-Spying: Spying Using Smartwatch and Deep Learning](https://arxiv.org/abs/1512.05616). Masters Thesis, IT University of Copenhagen, December 2015. Archived at *arxiv.org/abs/1512.05616*
+[^30]: Shoshana Zuboff. [Big Other: Surveillance Capitalism and the Prospects of an Information Civilization](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2594754). *Journal of Information Technology*, volume 30, issue 1, pages 75--89, April 2015. [doi:10.1057/jit.2015.5](https://doi.org/10.1057/jit.2015.5)
+[^31]: Michiel Rhoen. [Beyond Consent: Improving Data Protection Through Consumer Protection Law](https://policyreview.info/articles/analysis/beyond-consent-improving-data-protection-through-consumer-protection-law). *Internet Policy Review*, volume 5, issue 1, March 2016. [doi:10.14763/2016.1.404](https://doi.org/10.14763/2016.1.404)
+[^32]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016](https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng). *Official Journal of the European Union*, L 119/1, May 2016.
+[^33]: UK Information Commissioner's Office. [What is the 'legitimate interests' basis?](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/lawful-basis/legitimate-interests/what-is-the-legitimate-interests-basis/) *ico.org.uk*. Archived at [perma.cc/W8XR-F7ML](https://perma.cc/W8XR-F7ML)
+[^34]: Tristan Harris. [How a handful of tech companies control billions of minds every day](https://www.ted.com/talks/tristan_harris_how_a_handful_of_tech_companies_control_billions_of_minds_every_day). At *TED2017*, April 2017.
+[^35]: Carina C. Zona. [Consequences of an Insightful Algorithm](https://www.youtube.com/watch?v=YRI40A4tyWU). At *GOTO Berlin*, November 2016.
+[^36]: Imanol Arrieta Ibarra, Leonard Goff, Diego Jiménez Hernández, Jaron Lanier, and E. Glen Weyl. [Should We Treat Data as Labor? Moving Beyond 'Free'](https://www.aeaweb.org/conference/2018/preliminary/paper/2Y7N88na). *American Economic Association Papers Proceedings*, volume 1, issue 1, December 2017.
+[^37]: Bruce Schneier. [Data Is a Toxic Asset, So Why Not Throw It Out?](https://www.schneier.com/essays/archives/2016/03/data_is_a_toxic_asse.html) *schneier.com*, March 2016. Archived at [perma.cc/4GZH-WR3D](https://perma.cc/4GZH-WR3D)
+[^38]: Cory Scott. [Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want](https://x.com/cory_scott/status/706586399483437056). *x.com*, March 2016. Archived at [perma.cc/CLV7-JF2E](https://perma.cc/CLV7-JF2E)
+[^39]: Mark Pesce. [Data is the new uranium -- incredibly powerful and amazingly dangerous](https://www.theregister.com/2024/11/20/data_is_the_new_uranium/). *theregister.com*, November 2024. Archived at [perma.cc/NV8B-GYGV](https://perma.cc/NV8B-GYGV)
+[^40]: Bruce Schneier. [Mission Creep: When Everything Is Terrorism](https://www.schneier.com/essays/archives/2013/07/mission_creep_when_e.html). *schneier.com*, July 2013. Archived at [perma.cc/QB2C-5RCE](https://perma.cc/QB2C-5RCE)
+[^41]: Lena Ulbricht and Maximilian von Grafenstein. [Big Data: Big Power Shifts?](https://policyreview.info/articles/analysis/big-data-big-power-shifts) *Internet Policy Review*, volume 5, issue 1, March 2016. [doi:10.14763/2016.1.406](https://doi.org/10.14763/2016.1.406)
+[^42]: Ellen P. Goodman and Julia Powles. [Facebook and Google: Most Powerful and Secretive Empires We've Ever Known](https://www.theguardian.com/technology/2016/sep/28/google-facebook-powerful-secretive-empire-transparency). *theguardian.com*, September 2016. Archived at [perma.cc/8UJA-43G6](https://perma.cc/8UJA-43G6)
+[^43]: Judy Estrin and Sam Gill. [The World Is Choking on Digital Pollution](https://washingtonmonthly.com/2019/01/13/the-world-is-choking-on-digital-pollution/). *washingtonmonthly.com*, January 2019. Archived at [perma.cc/3VHF-C6UC](https://perma.cc/3VHF-C6UC)
+[^44]: A. Michael Froomkin. [Regulating Mass Surveillance as Privacy Pollution: Learning from Environmental Impact Statements](https://repository.law.miami.edu/cgi/viewcontent.cgi?article=1062&context=fac_articles). *University of Illinois Law Review*, volume 2015, issue 5, August 2015. Archived at [perma.cc/24ZL-VK2T](https://perma.cc/24ZL-VK2T)
+[^45]: Pengyuan Wang, Li Jiang, and Jian Yang. [The Early Impact of GDPR Compliance on Display Advertising: The Case of an Ad Publisher](https://openreview.net/pdf?id=TUnLHNo19S). *Journal of Marketing Research*, volume 61, issue 1, April 2023. [doi:10.1177/00222437231171848](https://doi.org/10.1177/00222437231171848)
+[^46]: Johnny Ryan. [Don't be fooled by Meta's fine for data breaches](https://www.economist.com/by-invitation/2023/05/24/dont-be-fooled-by-metas-fine-for-data-breaches-says-johnny-ryan). *The Economist*, May 2023. Archived at [perma.cc/VCR6-55HR](https://perma.cc/VCR6-55HR)
+[^47]: Jessica Leber. [Your Data Footprint Is Affecting Your Life in Ways You Can't Even Imagine](https://www.fastcompany.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine). *fastcompany.com*, March 2016. Archived at [archive.org](https://web.archive.org/web/20161128133016/https://www.fastcoexist.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine)
+[^48]: Maciej Cegłowski. [Haunted by Data](https://idlewords.com/talks/haunted_by_data.htm). *idlewords.com*, October 2015. Archived at [archive.org](https://web.archive.org/web/20161130143932/https://idlewords.com/talks/haunted_by_data.htm)
+[^49]: Sam Thielman. [You Are Not What You Read: Librarians Purge User Data to Protect Privacy](https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy). *theguardian.com*, January 2016. Archived at [archive.org](https://web.archive.org/web/20250828224851/https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy)
+[^50]: Jez Humble. [It's a cliché that people get into tech to "change the world". So then, you have to actually consider what the impact of your work is on the world. The idea that you can or should exclude societal and political discussions in tech is idiotic. It means you're not doing your job](https://x.com/jezhumble/status/1386758340894597122). *x.com*, April 2021. Archived at [perma.cc/3NYS-MHLC](https://perma.cc/3NYS-MHLC)
--- a/content/en/ch2.md
+++ b/content/en/ch2.md
@ -4,6 +4,8 @@ weight: 102
 breadcrumbs: false
 ---

+<a id="ch_nonfunctional"></a>
+
 ![](/map/ch01.png)

 > *The Internet was done so well that most people think of it as a natural resource like the Pacific
@ -55,7 +57,7 @@ Barack Obama have over 100 million followers).

 ### Representing Users, Posts, and Follows {#id20}

-Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We
+Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We
 have one table for users, one table for posts, and one table for follow relationships.

 {{< figure src="/fig/ddia_0201.png" id="fig_twitter_relational" caption="Figure 2-1. Simple relational schema for a social network in which users can follow each other." class="w-full my-4" >}}
@ -107,7 +109,7 @@ needs to subscribe to the stream of posts being added to their home timeline.

 The downside of this approach is that we now need to do more work every time a user makes a post,
 because the home timelines are derived data that needs to be updated. The process is illustrated in
-[Figure 2-2](/en/ch2#fig_twitter_timelines). When one initial request results in several downstream requests being
+[Figure 2-2](/en/ch2#fig_twitter_timelines). When one initial request results in several downstream requests being
 carried out, we use the term *fan-out* to describe the factor by which the number of requests
 increases.

@ -126,7 +128,7 @@ load, since we simply serve them from a cache.

 This process of precomputing and updating the results of a query is called *materialization*, and
 the timeline cache is an example of a *materialized view* (a concept we will discuss further in
-[Link to Come]). The materialized view speeds up reads, but in return we have to do more work on
+[“Maintaining materialized views”](/en/ch12#sec_stream_mat_view)). The materialized view speeds up reads, but in return we have to do more work on
 write. The cost of writes for most users is modest, but a social network also has to consider some
 extreme cases:

@ -163,7 +165,7 @@ metrics, whereas the “time it takes to load the home timeline” or the “tim
 delivered to followers” are response time metrics.

 There is often a connection between throughput and response time; an example of such a relationship
-for an online service is sketched in [Figure 2-3](/en/ch2#fig_throughput). The service has a low response time when
+for an online service is sketched in [Figure 2-3](/en/ch2#fig_throughput). The service has a low response time when
 request throughput is low, but response time increases as load increases. This is because of
 *queueing*: when a request arrives on a highly loaded system, it’s likely that the CPU is already in
 the process of handling an earlier request, and therefore the incoming request needs to wait until
@ -175,6 +177,8 @@ handle, queueing delays increase sharply.

 --------

+<a id="sidebar_metastable"></a>
+
 > [!TIP] WHEN AN OVERLOADED SYSTEM WON'T RECOVER

 If a system is close to overload, with throughput pushed close to the limit, it can sometimes enter a
@ -206,7 +210,7 @@ scalability in [“Scalability”](/en/ch2#sec_introduction_scalability).
 ### Latency and Response Time {#id23}

 “Latency” and “response time” are sometimes used interchangeably, but in this book we will use the
-terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)):
+terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)):

 * The *response time* is what the client sees; it includes all delays incurred anywhere in the
 system.
@ -221,7 +225,7 @@ terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)

 {{< figure src="/fig/ddia_0204.png" id="fig_response_time" caption="Figure 2-4. Response time, service time, network latency, and queueing delay." class="w-full my-4" >}}

-In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a
+In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a
 horizontal line, and a request or response message is shown as a thick diagonal arrow from one node
 to another. You will encounter this style of diagram frequently over the course of this book.

@ -242,7 +246,7 @@ it is important to measure response times on the client side.
 ### Average, Median, and Percentiles {#id24}

 Because the response time varies from one request to the next, we need to think of it not as a
-single number, but as a *distribution* of values that you can measure. In [Figure 2-5](/en/ch2#fig_lognormal), each
+single number, but as a *distribution* of values that you can measure. In [Figure 2-5](/en/ch2#fig_lognormal), each
 gray bar represents a request to a service, and its height shows how long that request took. Most
 requests are reasonably fast, but there are occasional *outliers* that take much longer.
 Variation in network delay is also known as *jitter*.
@ -257,7 +261,7 @@ because it doesn’t tell you how many users actually experienced that delay.

 Usually it is better to use *percentiles*. If you take your list of response times and sort it from
 fastest to slowest, then the *median* is the halfway point: for example, if your median response
-time is 200 ms, that means half your requests return in less than 200 ms, and half your
+time is 200 ms, that means half your requests return in less than 200 ms, and half your
 requests take longer than that. This makes the median a good metric if you want to know how long
 users typically have to wait. The median is also known as the *50th percentile*, and sometimes
 abbreviated as *p50*.
@ -267,7 +271,7 @@ In order to figure out how bad your outliers are, you can look at higher percent
 response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular
 threshold. For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of
 100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more. This is
-illustrated in [Figure 2-5](/en/ch2#fig_lognormal).
+illustrated in [Figure 2-5](/en/ch2#fig_lognormal).

 High percentiles of response times, also known as *tail latencies*, are important because they
 directly affect users’ experience of the service. For example, Amazon describes response time
@ -291,14 +295,14 @@ However, it is surprisingly difficult to get hold of reliable data to quantify t
 latency has on user behavior.

 Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search
-results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue [^21].
-However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
+results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue [^21].
+However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
 only 0.6% fewer searches per day [^22],
 and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3% [^23].
 Newer data from these companies appears not to be publicly available.

 A more recent Akamai study [^24]
-claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
+claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
 by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times
 are also correlated with lower conversion rates! This seemingly paradoxical result is explained by
 the fact that the pages that load fastest are often those that have no useful content (e.g., 404
@ -316,7 +320,7 @@ fast and slow responses is 1.25 seconds or more.
 High percentiles are especially important in backend services that are called multiple times as
 part of serving a single end-user request. Even if you make the calls in parallel, the end-user
 request still needs to wait for the slowest of the parallel calls to complete. It takes just one
-slow call to make the entire end-user request slow, as illustrated in [Figure 2-6](/en/ch2#fig_tail_amplification).
+slow call to make the entire end-user request slow, as illustrated in [Figure 2-6](/en/ch2#fig_tail_amplification).
 Even if only a small percentage of backend calls are slow, the chance of getting a slow call
 increases if an end-user request requires multiple backend calls, and so a higher proportion of
 end-user requests end up being slow (an effect known as *tail latency amplification* [^26]).
@ -326,13 +330,15 @@ end-user requests end up being slow (an effect known as *tail latency amplificat
 Percentiles are often used in *service level objectives* (SLOs) and *service level agreements*
 (SLAs) as ways of defining the expected performance and availability of a service [^27].
 For example, an SLO may set a target for a service to have a median response time of less than
-200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
+200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
 result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not
 met (for example, customers may be entitled to a refund). That is the basic idea, at least; in
 practice, defining good availability metrics for SLOs and SLAs is not straightforward [^28] [^29].

 --------

+<a id="sidebar_percentiles"></a>
+
 > [!TIP] COMPUTING PERCENTILES

 If you want to add response time percentiles to the monitoring dashboards for your services, you
@ -395,7 +401,7 @@ For example, in the social network case study, a fault that might happen is that
 process, a machine involved in updating the materialized timelines crashes or become unavailable.
 To make this process fault-tolerant, we would need to ensure that another machine can take over this
 task without missing any posts that should have been delivered, and without duplicating any posts.
-(This idea is known as *exactly-once semantics*, and we will examine it in detail in [Link to Come].)
+(This idea is known as *exactly-once semantics*, and we will examine it in detail in [“The End-to-End Argument for Databases”](/en/ch13#sec_future_end_to_end).)

 Fault tolerance is always limited to a certain number of certain types of faults. For example, a
 system might be able to tolerate a maximum of two hard drives failing at the same time, or a maximum
@ -473,14 +479,14 @@ resources.
 The fault-tolerance techniques we discuss in this book are designed to tolerate the loss of entire
 machines, racks, or availability zones. They generally work by allowing a machine in one datacenter
 to take over when a machine in another datacenter fails or becomes unreachable. We will discuss such
-techniques for fault tolerance in [Chapter 6](/en/ch6#ch_replication), [Chapter 10](/en/ch10#ch_consistency), and at various other
+techniques for fault tolerance in [Chapter 6](/en/ch6#ch_replication), [Chapter 10](/en/ch10#ch_consistency), and at various other
 points in this book.

 Systems that can tolerate the loss of entire machines also have operational advantages: a
 single-server system requires planned downtime if you need to reboot the machine (to apply operating
 system security patches, for example), whereas a multi-node fault-tolerant system can be patched by
 restarting one node at a time, without affecting the service for users. This is called a *rolling
-upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding).
+upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding).

 #### Software faults {#software-faults}

@ -559,6 +565,8 @@ work with it every day, and take steps to improve it based on this feedback [^71

 --------

+<a id="sidebar_reliability_importance"></a>
+
 > [!TIP] HOW IMPORTANT IS RELIABILITY?

 Reliability is not just for nuclear power stations and air traffic control—more mundane applications
@ -691,8 +699,8 @@ The advantages of shared-nothing are that it has the potential to scale linearly
 whatever hardware offers the best price/performance ratio (especially in the cloud), it can more
 easily adjust its hardware resources as load increases or decreases, and it can achieve greater
 fault tolerance by distributing the system across multiple data centers and regions. The downsides
-are that it requires explicit sharding (see [Chapter 7](/en/ch7#ch_sharding)), and it incurs all the complexity of
-distributed systems ([Chapter 9](/en/ch9#ch_distributed)).
+are that it requires explicit sharding (see [Chapter 7](/en/ch7#ch_sharding)), and it incurs all the complexity of
+distributed systems ([Chapter 9](/en/ch9#ch_distributed)).

 Some cloud-native database systems use separate services for storage and transaction execution (see
 [“Separation of storage and compute”](/en/ch1#sec_introduction_storage_compute)), with multiple compute nodes sharing access to the same
@ -706,9 +714,9 @@ the database [^83].
 The architecture of systems that operate at large scale is usually highly specific to the
 application—there is no such thing as a generic, one-size-fits-all scalable architecture
 (informally known as *magic scaling sauce*). For example, a system that is designed to handle
-100,000 requests per second, each 1 kB in size, looks very different from a system that is
-designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same
-data throughput (100 MB/sec).
+100,000 requests per second, each 1 kB in size, looks very different from a system that is
+designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same
+data throughput (100 MB/sec).

 Moreover, an architecture that is appropriate for one level of load is unlikely to cope with 10
 times that load. If you are working on a fast-growing service, it is therefore likely that you will
@ -718,11 +726,11 @@ one order of magnitude in advance.

 A good general principle for scalability is to break a system down into smaller components that can
 operate largely independently from each other. This is the underlying principle behind microservices
-(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
-([Link to Come]), and shared-nothing architectures. However, the challenge is in knowing where to
+(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
+([Chapter 12](/en/ch12#ch_stream)), and shared-nothing architectures. However, the challenge is in knowing where to
 draw the line between things that should be together, and things that should be apart. Design
 guidelines for microservices can be found in other books [^84],
-and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).
+and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).

 Another good principle is not to make things more complicated than necessary. If a single-machine
 database will do the job, it’s probably preferable to a complicated distributed setup. Auto-scaling
@ -997,4 +1005,3 @@ this book will cover a selection of building blocks that have proved to be valua
 [^96]: Eric Evans. [*Domain-Driven Design: Tackling Complexity in the Heart of Software*](https://learning.oreilly.com/library/view/domain-driven-design-tackling/0321125215/). Addison-Wesley Professional, August 2003. ISBN: 9780321125217 
 [^97]: Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson. [Analyzing Software Evolvability](https://www.es.mdh.se/pdf_publications/1251.pdf). at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. [doi:10.1109/COMPSAC.2008.50](https://doi.org/10.1109/COMPSAC.2008.50) 
 [^98]: Enrico Zaninotto. [From X programming to the X organisation](https://martinfowler.com/articles/zaninotto.pdf). At *XP Conference*, May 2002. Archived at [perma.cc/R9AR-QCKZ](https://perma.cc/R9AR-QCKZ)
-
--- a/content/en/ch3.md
+++ b/content/en/ch3.md
@ -4,6 +4,8 @@ weight: 103
 breadcrumbs: false
 ---

+<a id="ch_datamodels"></a>
+
 ![](/map/ch02.png)

 > *The limits of my language mean the limits of my world.*
@ -27,7 +29,7 @@ question is: how is it *represented* in terms of the next-lower layer? For examp
 3. The engineers who built your database software decided on a way of representing that
 document/relational/graph data in terms of bytes in memory, on disk, or on a network. The
 representation may allow the data to be queried, searched, manipulated, and processed in various
- ways. We will discuss these storage engine designs in [Chapter 4](/en/ch4#ch_storage).
+ ways. We will discuss these storage engine designs in [Chapter 4](/en/ch4#ch_storage).
 4. On yet lower levels, hardware engineers have figured out how to represent bytes in terms of
 electrical currents, pulses of light, magnetic fields, and more.

@ -156,7 +158,7 @@ Nevertheless, ORMs also have advantages:
 #### The document data model for one-to-many relationships {#the-document-data-model-for-one-to-many-relationships}

 Not all data lends itself well to a relational representation; let’s look at an example to explore a
-limitation of the relational model. [Figure 3-1](/en/ch3#fig_obama_relational) illustrates how a résumé (a LinkedIn
+limitation of the relational model. [Figure 3-1](/en/ch3#fig_obama_relational) illustrates how a résumé (a LinkedIn
 profile) could be expressed in a relational schema. The profile as a whole can be identified by a
 unique identifier, `user_id`. Fields like `first_name` and `last_name` appear exactly once per user,
 so they can be modeled as columns on the `users` table.
@ -165,13 +167,13 @@ Most people have had more than one job in their career (positions), and people m
 numbers of periods of education and any number of pieces of contact information. One way of
 representing such *one-to-many relationships* is to put positions, education, and contact
 information in separate tables, with a foreign key reference to the `users` table, as in
-[Figure 3-1](/en/ch3#fig_obama_relational).
+[Figure 3-1](/en/ch3#fig_obama_relational).

 {{< figure src="/fig/ddia_0301.png" id="fig_obama_relational" caption="Figure 3-1. Representing a LinkedIn profile using a relational schema." class="w-full my-4" >}}

 Another way of representing the same information, which is perhaps more natural and maps more
 closely to an object structure in application code, is as a JSON document as shown in
-[Example 3-1](/en/ch3#fig_obama_json).
+[Example 3-1](/en/ch3#fig_obama_json).

 {{< figure id="fig_obama_json" title="Example 3-1. Representing a LinkedIn profile as a JSON document" class="w-full my-4" >}}

@ -199,12 +201,12 @@ closely to an object structure in application code, is as a JSON document as sho
 ```

 Some developers feel that the JSON model reduces the impedance mismatch between the application code
-and the storage layer. However, as we shall see in [Chapter 5](/en/ch5#ch_encoding), there are also problems with
+and the storage layer. However, as we shall see in [Chapter 5](/en/ch5#ch_encoding), there are also problems with
 JSON as a data encoding format. The lack of a schema is often cited as an advantage; we will discuss
 this in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility).

 The JSON representation has better *locality* than the multi-table schema in
-[Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile
+[Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile
 in the relational example, you need to either perform multiple queries (query each table by
 `user_id`) or perform a messy multi-way join between the `users` table and its subordinate tables [^8].
 In the JSON representation, all the relevant information is in one place, making the query both
@ -212,7 +214,7 @@ faster and simpler.

 The one-to-many relationships from the user profile to the user’s positions, educational history, and
 contact information imply a tree structure in the data, and the JSON representation makes this tree
-structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
+structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).

 {{< figure src="/fig/ddia_0302.png" id="fig_json_tree" caption="Figure 3-2. One-to-many relationships forming a tree structure." class="w-full my-4" >}}

@ -222,13 +224,13 @@ structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
 > This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé typically has a small number of positions [^9] [^10].
 > In situations where there may be a genuinely large number of related items—say, comments on a
 > celebrity’s social media post, of which there could be many thousands—embedding them all in the same
-> document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable.
+> document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable.

 --------

 ### Normalization, Denormalization, and Joins {#sec_datamodels_normalization}

-In [Example 3-1](/en/ch3#fig_obama_json) in the preceding section, `region_id` is given as an ID, not as the plain-text
+In [Example 3-1](/en/ch3#fig_obama_json) in the preceding section, `region_id` is given as an ID, not as the plain-text
 string `"Washington, DC, United States"`. Why?

 If the user interface has a free-text field for entering the region, it makes sense to store it as a
@ -321,7 +323,7 @@ Besides the cost of performing all these updates, you also need to consider the
 database if a process crashes halfway through making its updates. Databases that offer atomic
 transactions (see [“Atomicity”](/en/ch8#sec_transactions_acid_atomicity)) make it easier to remain consistent, but not
 all databases offer atomicity across multiple documents. It is also possible to ensure consistency
-through stream processing, which we discuss in [Link to Come].
+through stream processing, which we discuss in [“Keeping Systems in Sync”](/en/ch12#sec_stream_sync).

 Normalization tends to be better for OLTP systems, where both reads and updates need to be fast;
 analytics systems often fare better with denormalized data, since they perform updates in bulk, and
@ -332,7 +334,7 @@ acceptable. However, in very large-scale systems, the cost of joins can become p

 #### Denormalization in the social networking case study {#denormalization-in-the-social-networking-case-study}

-In [“Case Study: Social Network Home Timelines”](/en/ch2#sec_introduction_twitter) we compared a normalized representation ([Figure 2-1](/en/ch2#fig_twitter_relational))
+In [“Case Study: Social Network Home Timelines”](/en/ch2#sec_introduction_twitter) we compared a normalized representation ([Figure 2-1](/en/ch2#fig_twitter_relational))
 and a denormalized one (precomputed, materialized timelines): here, the join between `posts` and
 `follows` was too expensive, and the materialized timeline is a cache of the result of that join.
 The fan-out process that inserts a new post into followers’ timelines was our way of keeping the
@ -380,7 +382,7 @@ of performance of reads and writes, as well as the amount of effort to implement

 ### Many-to-One and Many-to-Many Relationships {#sec_datamodels_many_to_many}

-While `positions` and `education` in [Figure 3-1](/en/ch3#fig_obama_relational) are examples of one-to-many or
+While `positions` and `education` in [Figure 3-1](/en/ch3#fig_obama_relational) are examples of one-to-many or
 one-to-few relationships (one résumé has several positions, but each position belongs only to one
 résumé), the `region_id` field is an example of a *many-to-one* relationship (many people live in
 the same region, but we assume that each person lives in only one region at any one time).
@ -389,14 +391,14 @@ If we introduce entities for organizations and schools, and reference them by ID
 then we also have *many-to-many* relationships (one person has worked for several organizations, and
 an organization has several past or present employees). In a relational model, such a relationship
 is usually represented as an *associative table* or *join table*, as shown in
-[Figure 3-3](/en/ch3#fig_datamodels_m2m_rel): each position associates one user ID with one organization ID.
+[Figure 3-3](/en/ch3#fig_datamodels_m2m_rel): each position associates one user ID with one organization ID.

 {{< figure src="/fig/ddia_0303.png" id="fig_datamodels_m2m_rel" caption="Figure 3-3. Many-to-many relationships in the relational model." class="w-full my-4" >}}

 Many-to-one and many-to-many relationships do not easily fit within one self-contained JSON
 document; they lend themselves more to a normalized representation. In a document model, one
-possible representation is given in [Example 3-2](/en/ch3#fig_datamodels_m2m_json) and illustrated in
-[Figure 3-4](/en/ch3#fig_datamodels_many_to_many): the data within each dotted rectangle can be grouped into one
+possible representation is given in [Example 3-2](/en/ch3#fig_datamodels_m2m_json) and illustrated in
+[Figure 3-4](/en/ch3#fig_datamodels_many_to_many): the data within each dotted rectangle can be grouped into one
 document, but the links to organizations and schools are best represented as references to other
 documents.

@ -426,11 +428,11 @@ representation is denormalized, since the relationship is stored in two places,
 inconsistent with each other.

 A normalized representation stores the relationship in only one place, and relies on *secondary
-indexes* (which we discuss in [Chapter 4](/en/ch4#ch_storage)) to allow the relationship to be efficiently queried in
-both directions. In the relational schema of [Figure 3-3](/en/ch3#fig_datamodels_m2m_rel), we would tell the database
+indexes* (which we discuss in [Chapter 4](/en/ch4#ch_storage)) to allow the relationship to be efficiently queried in
+both directions. In the relational schema of [Figure 3-3](/en/ch3#fig_datamodels_m2m_rel), we would tell the database
 to create indexes on both the `user_id` and the `org_id` columns of the `positions` table.

-In the document model of [Example 3-2](/en/ch3#fig_datamodels_m2m_json), the database needs to index the `org_id` field
+In the document model of [Example 3-2](/en/ch3#fig_datamodels_m2m_json), the database needs to index the `org_id` field
 of objects inside the `positions` array. Many document databases and relational databases with JSON
 support are able to create such indexes on values inside a document.

@ -442,7 +444,7 @@ widely-used conventions for the structure of tables in a data warehouse: a *star
 and *one big table* (OBT). These structures are optimized for the needs of business analysts. ETL
 processes translate data from operational systems into this schema.

-[Figure 3-5](/en/ch3#fig_dwh_schema) shows an example of a star schema that might be found in the data warehouse of a grocery
+[Figure 3-5](/en/ch3#fig_dwh_schema) shows an example of a star schema that might be found in the data warehouse of a grocery
 retailer. At the center of the schema is a so-called *fact table* (in this example, it is called
 `fact_sales`). Each row of the fact table represents an event that occurred at a particular time
 (here, each row represents a customer’s purchase of a product). If we were analyzing website traffic
@ -460,7 +462,7 @@ Other columns in the fact table are foreign key references to other tables, call
 tables*. As each row in the fact table represents an event, the dimensions represent the *who*,
 *what*, *where*, *when*, *how*, and *why* of the event.

-For example, in [Figure 3-5](/en/ch3#fig_dwh_schema), one of the dimensions is the product that was sold. Each row in
+For example, in [Figure 3-5](/en/ch3#fig_dwh_schema), one of the dimensions is the product that was sold. Each row in
 the `dim_product` table represents one type of product that is for sale, including its stock-keeping
 unit (SKU), description, brand name, category, fat content, package size, etc. Each row in the
 `fact_sales` table uses a foreign key to indicate which product was sold in that particular
@ -470,7 +472,7 @@ Even date and time are often represented using dimension tables, because this al
 information about dates (such as public holidays) to be encoded, allowing queries to differentiate
 between sales on holidays and non-holidays.

-[Figure 3-5](/en/ch3#fig_dwh_schema) is an example of a star schema. The name comes from the fact that when the table
+[Figure 3-5](/en/ch3#fig_dwh_schema) is an example of a star schema. The name comes from the fact that when the table
 relationships are visualized, the fact table is in the middle, surrounded by its dimension tables;
 the connections to these tables are like the rays of a star.

@ -516,7 +518,7 @@ many-to-many relationships. Let’s examine these arguments in more detail.
 If the data in your application has a document-like structure (i.e., a tree of one-to-many
 relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to
 use a document model. The relational technique of *shredding*—splitting a document-like structure
-into multiple tables (like `positions`, `education`, and `contact_info` in [Figure 3-1](/en/ch3#fig_obama_relational))
+into multiple tables (like `positions`, `education`, and `contact_info` in [Figure 3-1](/en/ch3#fig_obama_relational))
 — can lead to cumbersome schemas and unnecessarily complicated application code.

 The document model has limitations: for example, you cannot refer directly to a nested item within a
@ -595,14 +597,14 @@ structure for some reason (i.e., the data is heterogeneous)—for example, becau
 In situations like these, a schema may hurt more than it helps, and schemaless documents can be a
 much more natural data model. But in cases where all records are expected to have the same
 structure, schemas are a useful mechanism for documenting and enforcing that structure. We will
-discuss schemas and schema evolution in more detail in [Chapter 5](/en/ch5#ch_encoding).
+discuss schemas and schema evolution in more detail in [Chapter 5](/en/ch5#ch_encoding).

 #### Data locality for reads and writes {#sec_datamodels_document_locality}

 A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant
 thereof (such as MongoDB’s BSON). If your application often needs to access the entire document
 (for example, to render it on a web page), there is a performance advantage to this *storage
-locality*. If data is split across multiple tables, like in [Figure 3-1](/en/ch3#fig_obama_relational), multiple
+locality*. If data is split across multiple tables, like in [Figure 3-1](/en/ch3#fig_obama_relational), multiple
 index lookups are required to retrieve it all, which may require more disk seeks and take more time.

 The locality advantage only applies if you need large parts of the document at the same time. The
@ -755,7 +757,7 @@ as SQL support for querying graphs. Other graph query languages exist, such as G
 but these will give us a representative overview.

 To illustrate these different languages and models, this section uses the graph shown in
-[Figure 3-6](/en/ch3#fig_datamodels_graph) as running example. It could be taken from a social network or a
+[Figure 3-6](/en/ch3#fig_datamodels_graph) as running example. It could be taken from a social network or a
 genealogical database: it shows two people, Lucy from Idaho and Alain from Saint-Lô, France. They
 are married and living in London. Each person and each location is represented as a vertex, and the
 relationships between them as edges. This example will help demonstrate some queries that are easy
@ -782,7 +784,7 @@ Each edge consists of:
 * A collection of properties (key-value pairs)

 You can think of a graph store as consisting of two relational tables, one for vertices and one for
-edges, as shown in [Example 3-3](/en/ch3#fig_graph_sql_schema) (this schema uses the PostgreSQL `jsonb` datatype to
+edges, as shown in [Example 3-3](/en/ch3#fig_graph_sql_schema) (this schema uses the PostgreSQL `jsonb` datatype to
 store the properties of each vertex or edge). The head and tail vertex are stored for each edge; if
 you want the set of incoming or outgoing edges for a vertex, you can query the `edges` table by
 `head_vertex` or `tail_vertex`, respectively.
@ -814,7 +816,7 @@ Some important aspects of this model are:
 restricts which kinds of things can or cannot be associated.
 2. Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus
 *traverse* the graph—i.e., follow a path through a chain of vertices—both forward and backward.
- (That’s why [Example 3-3](/en/ch3#fig_graph_sql_schema) has indexes on both the `tail_vertex` and `head_vertex`
+ (That’s why [Example 3-3](/en/ch3#fig_graph_sql_schema) has indexes on both the `tail_vertex` and `head_vertex`
 columns.)
 3. By using different labels for different kinds of vertices and relationships, you can store
 several different kinds of information in a single graph, while still maintaining a clean data
@ -837,7 +839,7 @@ vertices or edges with certain properties to be found efficiently.
 --------

 Those features give graphs a great deal of flexibility for data modeling, as illustrated in
-[Figure 3-6](/en/ch3#fig_datamodels_graph). The figure shows a few things that would be difficult to express in a
+[Figure 3-6](/en/ch3#fig_datamodels_graph). The figure shows a few things that would be difficult to express in a
 traditional relational schema, such as different kinds of regional structures in different countries
 (France has *départements* and *régions*, whereas the US has *counties* and *states*), quirks of
 history such as a country within a country (ignoring for now the intricacies of sovereign states and
@ -859,8 +861,8 @@ and later developed into an open standard as *openCypher* [^38]. Besides Neo4j,
 Amazon Neptune, Apache AGE (with storage in PostgreSQL), and others. It is named after a character
 in the movie *The Matrix* and is not related to ciphers in cryptography [^39].

-[Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of
-[Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each
+[Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of
+[Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each
 vertex is given a symbolic name like `usa` or `idaho`. That name is not stored in the database, but
 only used internally within the query to create edges between the vertices, using an arrow notation:
 `(idaho) -[:WITHIN]-> (usa)` creates an edge labeled `WITHIN`, with `idaho` as the tail node and
@ -878,13 +880,13 @@ CREATE
    (lucy) -[:BORN_IN]-> (idaho)
 ```

-When all the vertices and edges of [Figure 3-6](/en/ch3#fig_datamodels_graph) are added to the database, we can start
+When all the vertices and edges of [Figure 3-6](/en/ch3#fig_datamodels_graph) are added to the database, we can start
 asking interesting questions: for example, *find the names of all the people who emigrated from the
 United States to Europe*. That is, find all the vertices that have a `BORN_IN` edge to a location
 within the US, and also a `LIVING_IN` edge to a location within Europe, and return the `name`
 property of each of those vertices.

-[Example 3-5](/en/ch3#fig_cypher_query) shows how to express that query in Cypher. The same arrow notation is used in a
+[Example 3-5](/en/ch3#fig_cypher_query) shows how to express that query in Cypher. The same arrow notation is used in a
 `MATCH` clause to find patterns in the graph: `(person) -[:BORN_IN]-> ()` matches any two vertices
 that are related by an edge labeled `BORN_IN`. The tail vertex of that edge is bound to the
 variable `person`, and the head vertex is left unnamed.
@ -923,7 +925,7 @@ can be found through an incoming `BORN_IN` or `LIVES_IN` edge at one of the loca

 ### Graph Queries in SQL {#id58}

-[Example 3-3](/en/ch3#fig_graph_sql_schema) suggested that graph data can be represented in a relational database. But
+[Example 3-3](/en/ch3#fig_graph_sql_schema) suggested that graph data can be represented in a relational database. But
 if we put graph data in a relational structure, can we also query it using SQL?

 The answer is yes, but with some difficulty. Every edge that you traverse in a graph query is
@ -943,7 +945,7 @@ or more times.” It is like the `*` operator in a regular expression.

 Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using
 something called *recursive common table expressions* (the `WITH RECURSIVE` syntax).
-[Example 3-6](/en/ch3#fig_graph_sql_query) shows the same query—finding the names of people who emigrated from the US
+[Example 3-6](/en/ch3#fig_graph_sql_query) shows the same query—finding the names of people who emigrated from the US
 to Europe—expressed in SQL using this technique. However, the syntax is very clumsy in comparison to
 Cypher.

@ -1035,7 +1037,7 @@ The subject of a triple is equivalent to a vertex in a graph. The object is one

 1. A value of a primitive datatype, such as a string or a number. In that case, the predicate and
 object of the triple are equivalent to the key and value of a property on the subject vertex.
- Using the example from [Figure 3-6](/en/ch3#fig_datamodels_graph), (*lucy*, *birthYear*, *1989*) is like a vertex
+ Using the example from [Figure 3-6](/en/ch3#fig_datamodels_graph), (*lucy*, *birthYear*, *1989*) is like a vertex
 `lucy` with properties `{"birthYear": 1989}`.
 2. Another vertex in the graph. In that case, the predicate is an edge in the
 graph, the subject is the tail vertex, and the object is the head vertex. For example, in
@ -1051,7 +1053,7 @@ The subject of a triple is equivalent to a vertex in a graph. The object is one
 > Since these databases retain the basic *subject-predicate-object* structure explained above, this
 > book nevertheless calls them triple-stores.

-[Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as
+[Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as
 triples in a format called *Turtle*, a subset of *Notation3* (*N3*) [^48].

 {{< figure id="fig_graph_n3_triples" title="Example 3-7. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as Turtle triples" class="w-full my-4" >}}
@ -1081,7 +1083,7 @@ _:usa`. When the predicate is a property, the object is a string literal, as in

 It’s quite repetitive to repeat the same subject over and over again, but fortunately you can use
 semicolons to say multiple things about the same subject. This makes the Turtle format quite
-readable: see [Example 3-8](/en/ch3#fig_graph_n3_shorthand).
+readable: see [Example 3-8](/en/ch3#fig_graph_n3_shorthand).

 {{< figure id="fig_graph_n3_shorthand" title="Example 3-8. A more concise way of writing the data in [Example 3-7](/en/ch3#fig_graph_n3_triples)" class="w-full my-4" >}}

@ -1112,10 +1114,10 @@ case: even if you have no interest in the Semantic Web, triples can be a good in

 #### The RDF data model {#the-rdf-data-model}

-The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the
+The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the
 *Resource Description Framework* (RDF) [^55],
 a data model that was designed for the Semantic Web. RDF data can also be encoded in other ways, for
-example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can
+example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can
 automatically convert between different RDF encodings.

 {{< figure id="fig_graph_rdf_xml" title="Example 3-9. The data of [Example 3-8](/en/ch3#fig_graph_n3_shorthand), expressed using RDF/XML syntax" class="w-full my-4" >}}
@ -1169,7 +1171,7 @@ It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQ
 similar.

 The same query as before—finding people who have moved from the US to Europe—is similarly concise in
-SPARQL as it is in Cypher (see [Example 3-10](/en/ch3#fig_sparql_query)).
+SPARQL as it is in Cypher (see [Example 3-10](/en/ch3#fig_sparql_query)).

 {{< figure id="fig_sparql_query" title="Example 3-10. The same query as [Example 3-5](/en/ch3#fig_cypher_query), expressed in SPARQL" class="w-full my-4" >}}

@ -1224,8 +1226,8 @@ columns: *ID*, *name*, and *type*. The fact that the US is a country could then
 `table(val1, val2, …)` means that `table` contains a row where the first column contains `val1`,
 the second column contains `val2`, and so on.

-[Example 3-11](/en/ch3#fig_datalog_triples) shows how to write the data from the left-hand side of
-[Figure 3-6](/en/ch3#fig_datamodels_graph) in Datalog. The edges of the graph (`within`, `born_in`, and `lives_in`)
+[Example 3-11](/en/ch3#fig_datalog_triples) shows how to write the data from the left-hand side of
+[Figure 3-6](/en/ch3#fig_datamodels_graph) in Datalog. The edges of the graph (`within`, `born_in`, and `lives_in`)
 are represented as two-column join tables. For example, Lucy has the ID 100 and Idaho has the ID 3,
 so the relationship “Lucy was born in Idaho” is represented as `born_in(100, 3)`.

@ -1244,7 +1246,7 @@ born_in(100, 3). /* Lucy was born in Idaho */
 ```

 Now that we have defined the data, we can write the same query as before, as shown in
-[Example 3-12](/en/ch3#fig_datalog_query). It looks a bit different from the equivalent in Cypher or SPARQL, but don’t
+[Example 3-12](/en/ch3#fig_datalog_query). It looks a bit different from the equivalent in Cypher or SPARQL, but don’t
 let that put you off. Datalog is a subset of Prolog, a programming language that you might have seen
 before if you’ve studied computer science.

@ -1271,7 +1273,7 @@ define *rules* that derive new virtual tables from the underlying facts. These d
 like (virtual) SQL views: they are not stored in the database, but you can query them in the same
 way as a table containing stored facts.

-In [Example 3-12](/en/ch3#fig_datalog_query) we define three derived tables: `within_recursive`, `migrated`, and
+In [Example 3-12](/en/ch3#fig_datalog_query) we define three derived tables: `within_recursive`, `migrated`, and
 `us_to_europe`. The name and columns of the virtual tables are defined by what appears before the
 `:-` symbol of each rule. For example, `migrated(PName, BornIn, LivingIn)` is a virtual table with
 three columns: the name of a person, the name of the place where they were born, and the name of the
@ -1284,7 +1286,7 @@ variable `PName` bound to the value `"Lucy"`. A rule applies if the system can f
 *all* patterns on the righthand side of the `:-` operator. When the rule applies, it’s as though the
 lefthand side of the `:-` was added to the database (with variables replaced by the values they matched).

-One possible way of applying the rules is thus (and as illustrated in [Figure 3-7](/en/ch3#fig_datalog_naive)):
+One possible way of applying the rules is thus (and as illustrated in [Figure 3-7](/en/ch3#fig_datalog_naive)):

 1. `location(1, "North America", "continent")` exists in the database, so rule 1 applies. It generates `within_recursive(1, "North America")`.
 2. `within(2, 1)` exists in the database and the previous step generated `within_recursive(1, "North America")`, so rule 2 applies. It generates `within_recursive(2, "North America")`.
@ -1295,7 +1297,7 @@ locations in North America (or any other location) contained in our database.

 {{< figure link="#fig_datalog_query" src="/fig/ddia_0307.png" id="fig_datalog_naive" title="Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from Example 3-12." class="w-full my-4" >}}

-> Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query).
+> Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query).

 Now rule 3 can find people who were born in some location `BornIn` and live in some location
 `LivingIn`. Rule 4 invokes rule 3 with `BornIn = 'United States'` and
@ -1307,7 +1309,7 @@ The Datalog approach requires a different kind of thinking compared to the other
 discussed in this chapter. It allows complex queries to be built up rule by rule, with one rule
 referring to other rules, similarly to the way that you break down code into functions that call
 each other. Just like functions can be recursive, Datalog rules can also invoke themselves, like
-rule 2 in [Example 3-12](/en/ch3#fig_datalog_query), which enables graph traversals in Datalog queries.
+rule 2 in [Example 3-12](/en/ch3#fig_datalog_query), which enables graph traversals in Datalog queries.

 ### GraphQL {#id63}

@ -1319,7 +1321,7 @@ interfaces allow developers to rapidly change queries in client code without cha

 GraphQL’s flexibility comes at a cost. Organizations that adopt GraphQL often need tooling to
 convert GraphQL queries into requests to internal services, which often use REST or gRPC (see
-[Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns [^61].
+[Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns [^61].
 GraphQL’s query language is also limited since GraphQL come from an untrusted source. The language
 does not allow anything that could be expensive to execute, since otherwise users could perform
 denial-of-service attacks on a server by running lots of expensive queries. In particular, GraphQL
@ -1327,7 +1329,7 @@ does not allow recursive queries (unlike Cypher, SPARQL, SQL, or Datalog), and i
 arbitrary search conditions such as “find people who were born in the US and are now living in
 Europe” (unless the service owners specifically choose to offer such search functionality).

-Nevertheless, GraphQL is useful. [Example 3-13](/en/ch3#fig_graphql_query) shows how you might implement a group chat
+Nevertheless, GraphQL is useful. [Example 3-13](/en/ch3#fig_graphql_query) shows how you might implement a group chat
 application such as Discord or Slack using GraphQL. The query requests all the channels that the
 user has access to, including the channel name and the 50 most recent messages in each channel. For
 each message it requests the timestamp, the message content, and the name and profile picture URL
@ -1359,7 +1361,7 @@ query ChatApp {
 }
 ```

-[Example 3-14](/en/ch3#fig_graphql_response) shows what a response to the query in [Example 3-13](/en/ch3#fig_graphql_query) might look
+[Example 3-14](/en/ch3#fig_graphql_response) shows what a response to the query in [Example 3-13](/en/ch3#fig_graphql_query) might look
 like. The response is a JSON document that mirrors the structure of the query: it contains exactly
 those attributes that were requested, no more and no less. This approach has the advantage that the
 server does not need to know which attributes the client requires in order to render the user
@ -1395,13 +1397,13 @@ were changed to add that profile picture, it would be easy for the client to add
 ...
 ```

-In [Example 3-14](/en/ch3#fig_graphql_response) the name and image URL of a message sender is embedded directly in the
+In [Example 3-14](/en/ch3#fig_graphql_response) the name and image URL of a message sender is embedded directly in the
 message object. If the same user sends multiple messages, this information is repeated on each
 message. In principle, it would be possible to reduce this duplication, but GraphQL makes the design
 choice to accept a larger response size in order to make it simpler to render the user interface
 based on the data.

-The `replyTo` field is similar: in [Example 3-14](/en/ch3#fig_graphql_response), the second message is a reply to the
+The `replyTo` field is similar: in [Example 3-14](/en/ch3#fig_graphql_response), the second message is a reply to the
 first, and the content (“Hey!…”) and sender Aaliyah are duplicated under `replyTo`. It would be
 possible to instead return the ID of the message being replied to, but then the client would have to
 make an additional request to the server if that ID is not among the 50 most recent messages
@ -1439,7 +1441,7 @@ timestamp, and then append it to a sequence of events. Events in this log are *i
 change or delete them, you only ever append more events to the log (which may supersede earlier
 events). An event can contain arbitrary properties.

-[Figure 3-8](/en/ch3#fig_event_sourcing) shows an example that could be taken from a conference management system. A
+[Figure 3-8](/en/ch3#fig_event_sourcing) shows an example that could be taken from a conference management system. A
 conference can be a complex business domain: not only can individual attendees register and pay by
 card, but companies can also order seats in bulk, pay by invoice, and then later assign the seats to
 individual people. Some number of seats may be reserved for speakers, sponsors, volunteer helpers,
@ -1449,7 +1451,7 @@ calculating the number of available seats becomes a challenging query.

 {{< figure src="/fig/ddia_0308.png" id="fig_event_sourcing" title="Figure 3-8. Using a log of immutable events as source of truth, and deriving materialized views from it." class="w-full my-4" >}}

-In [Figure 3-8](/en/ch3#fig_event_sourcing), every change to the state of the conference (such as the organizer
+In [Figure 3-8](/en/ch3#fig_event_sourcing), every change to the state of the conference (such as the organizer
 opening registrations, or attendees making and cancelling registrations) is first stored as an
 event. Whenever an event is appended to the log, several *materialized views* (also known as
 *projections* or *read models*) are also updated to reflect the effect of that event. In the
@ -1540,11 +1542,11 @@ You can implement event sourcing on top of any database, but there are also some
 specifically designed to support this pattern, such as EventStoreDB, MartenDB (based on PostgreSQL),
 and Axon Framework. You can also use message brokers such as Apache Kafka to store the event log,
 and stream processors can keep the materialized views up-to-date; we will return to these topics in
-[Link to Come].
+[“Change data capture versus event sourcing”](/en/ch12#sec_stream_event_sourcing).

 The only important requirement is that the event storage system must guarantee that all materialized
 views process the events in exactly the same order as they appear in the log; as we shall see in
-[Chapter 10](/en/ch10#ch_consistency), this is not always easy to achieve in a distributed system.
+[Chapter 10](/en/ch10#ch_consistency), this is not always easy to achieve in a distributed system.


 ## Dataframes, Matrices, and Arrays {#sec_datamodels_dataframes}
@ -1579,7 +1581,7 @@ For example, a common use of dataframes is to transform data from a relational-l
 into a matrix or multidimensional array representation, which is the form that many machine learning
 algorithms expect of their input.

-A simple example of such a transformation is shown in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix). On the left we
+A simple example of such a transformation is shown in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix). On the left we
 have a relational table of how different users have rated various movies (on a scale of 1 to 5), and
 on the right the data has been transformed into a matrix where each column is a movie and each row
 is a user (similarly to a *pivot table* in a spreadsheet). The matrix is *sparse*, which means there
@ -1592,7 +1594,7 @@ that offer sparse arrays (such as NumPy for Python) can handle such data easily.
 A matrix can only contain numbers, and various techniques are used to transform non-numerical data
 into numbers in the matrix. For example:

-* Dates (which are omitted from the example matrix in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix)) could be scaled
+* Dates (which are omitted from the example matrix in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix)) could be scaled
 to be floating-point numbers within some suitable range.
 * For columns that can only take one of a small, fixed set of values (for example, the genre of a
 movie in a database of movies), a *one-hot encoding* is often used: we create a column for each
@ -1603,7 +1605,7 @@ into numbers in the matrix. For example:

 Once the data is in the form of a matrix of numbers, it is amenable to linear algebra operations,
 which form the basis of many machine learning algorithms. For example, the data in
-[Figure 3-9](/en/ch3#fig_dataframe_to_matrix) could be a part of a system for recommending movies that the user may
+[Figure 3-9](/en/ch3#fig_dataframe_to_matrix) could be a part of a system for recommending movies that the user may
 like. Dataframes are flexible enough to allow data to be gradually evolved from a relational form
 into a matrix representation, while giving the data scientist control over the representation that
 is most suitable for achieving the goals of the data analysis or model training process.
@ -1648,7 +1650,7 @@ gradually improving.

 Another model we discussed is *event sourcing*, which represents data as an append-only log of
 immutable events, and which can be advantageous for modeling activities in complex business domains.
-An append-only log is good for writing data (as we shall see in [Chapter 4](/en/ch4#ch_storage)); in order to support
+An append-only log is good for writing data (as we shall see in [Chapter 4](/en/ch4#ch_storage)); in order to support
 efficient queries, the event log is translated into read-optimized materialized views through CQRS.

 One thing that non-relational data models have in common is that they typically don’t enforce a
--- a/content/en/ch4.md
+++ b/content/en/ch4.md
@ -4,6 +4,8 @@ weight: 104
 breadcrumbs: false
 ---

+<a id="ch_storage"></a>
+
 ![](/map/ch03.png)

 > *One of the miseries of life is that everybody names things a little bit wrong. And so it makes
@ -17,7 +19,7 @@ breadcrumbs: false
 On the most fundamental level, a database needs to do two things: when you give it some data, it
 should store the data, and when you ask it again later, it should give the data back to you.

-In [Chapter 3](/en/ch3#ch_datamodels) we discussed data models and query languages—i.e., the format in which you give
+In [Chapter 3](/en/ch3#ch_datamodels) we discussed data models and query languages—i.e., the format in which you give
 the database your data, and the interface through which you can ask for it again later. In this
 chapter we discuss the same from the database’s point of view: how the database can store the data
 that you give it, and how it can find the data again when you ask for it.
@ -140,7 +142,7 @@ your application the greatest benefit, without introducing more overhead on writ
 To start, let’s assume that you want to continue storing data in the append-only file written by
 `db_set`, and you just want to speed up reads. One way you could do this is by keeping a hash map in
 memory, in which every key is mapped to the byte offset in the file at which the most recent value
-for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
+for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).

 {{< figure src="/fig/ddia_0401.png" id="fig_storage_csv_hash_index" caption="Figure 4-1. Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map." class="w-full my-4" >}}

@ -167,7 +169,7 @@ This approach is much faster, but it still suffers from several problems:
 In practice, hash tables are not used very often for database indexes, and instead it is much more
 common to keep data in a structure that is *sorted by key* [^3].
 One example of such a structure is a *Sorted String Table*, or *SSTable* for short, as shown in
-[Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that
+[Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that
 they are sorted by key, and each key only appears once in the file.

 {{< figure src="/fig/ddia_0402.png" id="fig_storage_sstable_index" caption="Figure 4-2. An SSTable with a sparse index, allowing queries to jump to the right block." class="w-full my-4" >}}
@ -178,7 +180,7 @@ This kind of index, which stores only some of the keys, is called *sparse*. This
 a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data
 structure that allows queries to quickly look up a particular key [^4].

-For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
+For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
 first key of the next block is `handsome`. Now say you’re looking for the key `handiwork`, which
 doesn’t appear in the sparse index. Because of the sorting you know that `handiwork` must appear
 between `handbag` and `handsome`. This means you can seek to the offset for `handbag` and scan the
@ -186,7 +188,7 @@ file from there until you find `handiwork` (or not, if the key is not present in
 of a few kilobytes can be scanned very quickly.

 Moreover, each block of records can be compressed (indicated by the shaded area in
-[Figure 4-2](/en/ch4#fig_storage_sstable_index)). Besides saving disk space, compression also reduces the I/O
+[Figure 4-2](/en/ch4#fig_storage_sstable_index)). Besides saving disk space, compression also reduces the I/O
 bandwidth use, at the cost of using a bit more CPU time.

 #### Constructing and merging SSTables {#constructing-and-merging-sstables}
@ -217,7 +219,7 @@ log and a sorted file:
 and to discard overwritten or deleted values.

 Merging segments works similarly to the *mergesort* algorithm [^5]. The process is illustrated in
-[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
+[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
 in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If
 the same key appears in more than one input file, keep only the more recent value. This produces a
 new merged segment file, also sorted by key, with one value per key, and it uses minimal memory
@ -258,7 +260,9 @@ the memtable or while merging segments, the database can just delete the unfinis
 start afresh. The log that persists writes to the memtable could contain incomplete records if there
 was a crash halfway through writing a record, or if the disk was full; these are typically detected
 by including checksums in the log, and discarding corrupted or incomplete log entries. We will talk
-more about durability and crash recovery in [Chapter 8](/en/ch8#ch_transactions).
+more about durability and crash recovery in [Chapter 8](/en/ch8#ch_transactions).
+
+<a id="sec_storage_bloom_filter"></a>

 #### Bloom filters {#bloom-filters}

@ -268,7 +272,7 @@ reads, LSM storage engines often include a *Bloom filter* [^13]
 in each segment, which provides a fast but approximate way of checking whether a particular key
 appears in a particular SSTable.

-[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
+[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
 reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash
 function, producing a set of numbers that are then interpreted as indexes into the array of bits [^14].
 We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key
@ -279,7 +283,7 @@ extra space, but the Bloom filter is generally small compared to the rest of the
 {{< figure src="/fig/ddia_0404.png" id="fig_storage_bloom" caption="Figure 4-4. A Bloom filter provides a fast, probabilistic check whether a particular key exists in a particular SSTable." class="w-full my-4" >}}

 When we want to know whether a key appears in the SSTable, we compute the same hash of that key as
-before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), we’re querying
+before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), we’re querying
 the key `handheld`, which hashes to (6, 11, 2). One of those bits is 1 (namely, bit number 2),
 while the other two are 0. These checks can be made extremely fast using the bitwise operations that
 all CPUs support.
@ -333,6 +337,8 @@ characteristics in more detail in [“Comparing B-Trees and LSM-Trees”](/en/ch

 --------

+<a id="sidebar_embedded"></a>
+
 > [!TIP] EMBEDDED STORAGE ENGINES

 Many databases run as a service that accepts queries over a network, but there are also *embedded*
@ -349,7 +355,7 @@ queries that combine data from multiple tenants), you can potentially use a sepa
 database instance per tenant [^20].

 The storage and retrieval methods we discuss in this chapter are used in both embedded and in
-client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
+client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
 for scaling a database across multiple machines.

 --------
@ -370,14 +376,14 @@ philosophy.
 The log-structured indexes we saw earlier break the database down into variable-size *segments*,
 typically several megabytes or more in size, that are written once and are then immutable. By
 contrast, B-trees break the database down into fixed-size *blocks* or *pages*, and may overwrite a
-page in-place. A page is traditionally 4 KiB in size, but PostgreSQL now uses 8 KiB and
-MySQL uses 16 KiB by default.
+page in-place. A page is traditionally 4 KiB in size, but PostgreSQL now uses 8 KiB and
+MySQL uses 16 KiB by default.

 Each page can be identified using a page number, which allows one page to refer to another—similar
 to a pointer, but on disk instead of in memory. If all the pages are stored in the same file,
 multiplying the page number by the page size gives us the byte offset in the file where the page is
 located. We can use these page references to construct a tree of pages, as illustrated in
-[Figure 4-5](/en/ch4#fig_storage_b_tree).
+[Figure 4-5](/en/ch4#fig_storage_b_tree).

 {{< figure src="/fig/ddia_0405.png" id="fig_storage_b_tree" caption="Figure 4-5. Looking up the key 251 using a B-tree index. From the root page we first follow the reference to the page for keys 200–300, then the page for keys 250–270." class="w-full my-4" >}}

@ -388,14 +394,14 @@ where the boundaries between those ranges lie.
 (This structure is sometimes called a B+ tree, but we don’t need to distinguish it
 from other B-tree variants.)

-In the example in [Figure 4-5](/en/ch4#fig_storage_b_tree), we are looking for the key 251, so we know that we need to
+In the example in [Figure 4-5](/en/ch4#fig_storage_b_tree), we are looking for the key 251, so we know that we need to
 follow the page reference between the boundaries 200 and 300. That takes us to a similar-looking
 page that further breaks down the 200–300 range into subranges. Eventually we get down to a
 page containing individual keys (a *leaf page*), which either contains the value for each key
 inline or contains references to the pages where the values can be found.

 The number of references to child pages in one page of the B-tree is called the *branching factor*.
-For example, in [Figure 4-5](/en/ch4#fig_storage_b_tree) the branching factor is six. In practice, the branching
+For example, in [Figure 4-5](/en/ch4#fig_storage_b_tree) the branching factor is six. In practice, the branching
 factor depends on the amount of space required to store the page references and the range
 boundaries, but typically it is several hundred.

@ -408,7 +414,7 @@ of key ranges.

 {{< figure src="/fig/ddia_0406.png" id="fig_storage_b_tree_split" caption="Figure 4-6. Growing a B-tree by splitting a page on the boundary key 337. The parent page is updated to reference both children." class="w-full my-4" >}}

-In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the
+In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the
 range 333–345 is already full. We therefore split it into a page for the range 333–337 (including
 the new key), and a page for 337–344. We also have to update the parent page to have references to
 both children, with a boundary value of 337 between them. If the parent page doesn’t have enough
@ -417,9 +423,9 @@ to the root of the tree. When the root is split, we make a new root above it. De
 may require nodes to be merged) is more complex [^5].

 This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth
-of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
+of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
 you don’t need to follow many page references to find the page you are looking for. (A four-level
-tree of 4 KiB pages with a branching factor of 500 can store up to 250 TB.)
+tree of 4 KiB pages with a branching factor of 500 can store up to 250 TB.)

 #### Making B-trees reliable {#sec_storage_btree_wal}

@ -530,14 +536,14 @@ flash memory attached to the PCI Express bus) have now overtaken HDDs for many u
 are not subject to such mechanical limitations.

 Nevertheless, SSDs also have higher throughput for sequential writes than for than random writes.
-The reason is that flash memory can be read or written one page (typically 4 KiB) at a time,
-but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
+The reason is that flash memory can be read or written one page (typically 4 KiB) at a time,
+but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
 may contain valid data, whereas others may contain data that is no longer needed. Before erasing a
 block, the controller must first move pages containing valid data into other blocks; this process is
 called *garbage collection* (GC) [^33].

 A sequential write workload writes larger chunks of data at a time, so it is likely that a whole
-512 KiB block belongs to a single file; when that file is later deleted again, the whole block
+512 KiB block belongs to a single file; when that file is later deleted again, the whole block
 can be erased without having to perform any GC. On the other hand, with a random write workload, it
 is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
 to perform more work before a block can be erased [^34] [^35] [^36].
@ -624,7 +630,7 @@ to that row/document/vertex by its primary key (or ID), and the index is used to

 It is also very common to have *secondary indexes*. In relational databases, you can create several
 secondary indexes on the same table using the `CREATE INDEX` command, allowing you to search by
-columns other than the primary key. For example, in [Figure 3-1](/en/ch3#fig_obama_relational) in [Chapter 3](/en/ch3#ch_datamodels)
+columns other than the primary key. For example, in [Figure 3-1](/en/ch3#fig_obama_relational) in [Chapter 3](/en/ch3#ch_datamodels)
 you would most likely have a secondary index on the `user_id` columns so that you can find all the
 rows belonging to the same user in each of the tables.

@ -791,7 +797,7 @@ rows), so in this section we will focus on storage of facts.

 Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4
 or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics) [^52]. Take the query in
-[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
+[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
 buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of
 the `fact_sales` table: `date_key`, `product_sk`,
 and `quantity`. The query ignores all other columns.
@ -816,9 +822,9 @@ How can we execute this query efficiently?

 In most OLTP databases, storage is laid out in a *row-oriented* fashion: all the values from one row
 of a table are stored next to each other. Document databases are similar: an entire document is
-typically stored as one contiguous sequence of bytes. You can see this in the CSV example of [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
+typically stored as one contiguous sequence of bytes. You can see this in the CSV example of [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).

-In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on
+In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on
 `fact_sales.date_key` and/or `fact_sales.product_sk` that tell the storage engine where to find
 all the sales for a particular date or for a particular product. But then, a row-oriented storage
 engine still needs to load all of those rows (each consisting of over 100 attributes) from disk into
@ -828,8 +834,8 @@ long time.
 The idea behind *column-oriented* (or *columnar*) storage is simple: don’t store all the values from
 one row together, but store all the values from each *column* together instead [^56].
 If each column is stored separately, a query only needs to read and parse those columns that are
-used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
-an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).
+used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
+an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).

 --------

@ -864,10 +870,10 @@ Besides only loading those columns from disk that are required for a query, we c
 the demands on disk throughput and network bandwidth by compressing data. Fortunately,
 column-oriented storage often lends itself very well to compression.

-Take a look at the sequences of values for each column in [Figure 4-7](/en/ch4#fig_column_store): they often look quite
+Take a look at the sequences of values for each column in [Figure 4-7](/en/ch4#fig_column_store): they often look quite
 repetitive, which is a good sign for compression. Depending on the data in the column, different
 compression techniques can be used. One technique that is particularly effective in data warehouses
-is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index).
+is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index).

 {{< figure src="/fig/ddia_0408.png" id="fig_bitmap_index" caption="Figure 4-8. Compressed, bitmap-indexed storage of a single column." class="w-full my-4" >}}

@ -880,7 +886,7 @@ not.
 One option is to store those bitmaps using one bit per row. However, these bitmaps typically contain
 a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can additionally be
 run-length encoded: counting the number of consecutive zeros or ones and storing that number, as
-shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
+shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
 two bitmap representations, using whichever is the most compact [^73].
 This can make the encoding of a column remarkably efficient.

@ -928,7 +934,7 @@ last month, it might make sense to make `date_key` the first sort key. Then the
 scan only the rows from the last month, which will be much faster than scanning all rows.

 A second column can determine the sort order of any rows that have the same value in the first
-column. For example, if `date_key` is the first sort key in [Figure 4-7](/en/ch4#fig_column_store), it might make
+column. For example, if `date_key` is the first sort key in [Figure 4-7](/en/ch4#fig_column_store), it might make
 sense for `product_sk` to be the second sort key so that all sales for the same product on the same
 day are grouped together in storage. That will help queries that need to group or filter sales by
 product within a certain date range.
@ -936,7 +942,7 @@ product within a certain date range.
 Another advantage of sorted order is that it can help with compression of columns. If the primary
 sort column does not have many distinct values, then after sorting, it will have long sequences
 where the same value is repeated many times in a row. A simple run-length encoding, like we used for
-the bitmaps in [Figure 4-8](/en/ch4#fig_bitmap_index), could compress that column down to a few kilobytes—even if
+the bitmaps in [Figure 4-8](/en/ch4#fig_bitmap_index), could compress that column down to a few kilobytes—even if
 the table has billions of rows.

 That compression effect is strongest on the first sort key. The second and third sort keys will be
@ -1004,7 +1010,7 @@ Vectorized processing
 and get back a bitmap (one bit per value in the input column, which is 1 if it’s a banana); we could
 then pass the `store_sk` column and the ID of the store of interest to the same equality operator,
 and get back another bitmap; and then we could pass the two bitmaps to a “bitwise AND” operator, as
- shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
+ shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
 a particular store.

 {{< figure src="/fig/ddia_0409.png" id="fig_bitmap_and" caption="Figure 4-9. A bitwise AND between two bitmaps lends itself to vectorization." class="w-full my-4" >}}
@ -1039,18 +1045,18 @@ discussed earlier, data warehouse queries often involve an aggregate function, s
 `AVG`, `MIN`, or `MAX` in SQL. If the same aggregates are used by many different queries, it can be
 wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that
 queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates grouped by different dimensions [^82].
-[Figure 4-10](/en/ch4#fig_data_cube) shows an example.
+[Figure 4-10](/en/ch4#fig_data_cube) shows an example.

 {{< figure src="/fig/ddia_0410.png" id="fig_data_cube" caption="Figure 4-10. Two dimensions of a data cube, aggregating data by summing." class="w-full my-4" >}}

-Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube),
+Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube),
 these are `date_key` and `product_sk`. You can now draw a two-dimensional table, with
 dates along one axis and products along the other. Each cell contains the aggregate (e.g., `SUM`) of
 an attribute (e.g., `net_price`) of all facts with that date-product combination. Then you can apply
 the same aggregate along each row or column and get a summary that has been reduced by one
 dimension (the sales by product regardless of date, or the sales by date regardless of product).

-In general, facts often have more than two dimensions. In [Figure 3-5](/en/ch3#fig_dwh_schema) there are five
+In general, facts often have more than two dimensions. In [Figure 3-5](/en/ch3#fig_dwh_schema) there are five
 dimensions: date, product, store, promotion, and customer. It’s a lot harder to imagine what a
 five-dimensional hypercube would look like, but the principle remains the same: each cell contains
 the sales for a particular date-product-store-promotion-customer combination. These values can then
@ -1132,11 +1138,11 @@ value of 0. Searching for documents mentioning “red apples” means a query th
 The data structure that many search engines use to answer such queries is called an *inverted
 index*. This is a key-value structure where the key is a term, and the value is the list of IDs of
 all the documents that contain the term (the *postings list*). If the document IDs are sequential
-numbers, the postings list can also be represented as a sparse bitmap, like in [Figure 4-8](/en/ch4#fig_bitmap_index):
+numbers, the postings list can also be represented as a sparse bitmap, like in [Figure 4-8](/en/ch4#fig_bitmap_index):
 the *n*th bit in the bitmap for term *x* is a 1 if the document with ID *n* contains the term *x* [^89].

 Finding all the documents that contain both terms *x* and *y* is now similar to a vectorized data
-warehouse query that searches for rows matching two conditions ([Figure 4-9](/en/ch4#fig_bitmap_and)): load the two
+warehouse query that searches for rows matching two conditions ([Figure 4-9](/en/ch4#fig_bitmap_and)): load the two
 bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps are run-length
 encoded, this can be done very efficiently.

@ -1147,7 +1153,7 @@ PostgreSQL’s GIN index type also uses postings lists to support full-text sear
 JSON documents [^92] [^93].

 Instead of breaking text into words, an alternative is to find all the substrings of length *n*,
-which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
+which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
 `"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can
 search the documents for arbitrary substrings that are at least three characters long. Trigram
 indexes even allows regular expressions in search queries; the downside is that they are quite large [^94].
@ -1226,7 +1232,7 @@ Inverted file (IVF) indexes
 more vectors must be compared.

 Hierarchical Navigable Small World (HNSW)
-: HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw).
+: HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw).
 Each layer is represented as a graph, where nodes represent vectors, and edges represent proximity
 to nearby vectors. A query starts by locating the nearest vector in the topmost layer, which has a
 small number of nodes. The query then moves to the same node in the layer below and follows the
@ -1395,4 +1401,4 @@ documentation for the database of your choice.
 [^101]: Matthijs Douze, Maria Lomeli, and Lucas Hosseini. [Faiss indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). *github.com*, August 2024. Archived at [perma.cc/2EWG-FPBS](https://perma.cc/2EWG-FPBS) 
 [^102]: Varik Matevosyan. [Understanding pgvector’s HNSW Index Storage in Postgres](https://lantern.dev/blog/pgvector-storage). *lantern.dev*, August 2024. Archived at [perma.cc/B2YB-JB59](https://perma.cc/B2YB-JB59) 
 [^103]: Dmitry Baranchuk, Artem Babenko, and Yury Malkov. [Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors](https://arxiv.org/pdf/1802.02422). At *European Conference on Computer Vision* (ECCV), pages 202–216, September 2018. [doi:10.1007/978-3-030-01258-8\_13](https://doi.org/10.1007/978-3-030-01258-8_13) 
-[^104]: Yury A. Malkov and Dmitry A. Yashunin. [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320). *IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824–836, April 2020. [doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473) 
+[^104]: Yury A. Malkov and Dmitry A. Yashunin. [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320). *IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824–836, April 2020. [doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473) 
--- a/content/en/ch5.md
+++ b/content/en/ch5.md
@ -4,6 +4,8 @@ weight: 105
 breadcrumbs: false
 ---

+<a id="ch_encoding"></a>
+
 ![](/map/ch04.png)

 > *Everything changes and nothing stands still.*
@ -12,14 +14,14 @@ breadcrumbs: false

 Applications inevitably change over time. Features are added or modified as new products are
 launched, user requirements become better understood, or business circumstances change. In
-[Chapter 2](/en/ch2#ch_nonfunctional) we introduced the idea of *evolvability*: we should aim to build systems that
+[Chapter 2](/en/ch2#ch_nonfunctional) we introduced the idea of *evolvability*: we should aim to build systems that
 make it easy to adapt to change (see [“Evolvability: Making Change Easy”](/en/ch2#sec_introduction_evolvability)).

 In most cases, a change to an application’s features also requires a change to data that it stores:
 perhaps a new field or record type needs to be captured, or perhaps existing data needs to be
 presented in a new way.

-The data models we discussed in [Chapter 3](/en/ch3#ch_datamodels) have different ways of coping with such change.
+The data models we discussed in [Chapter 3](/en/ch3#ch_datamodels) have different ways of coping with such change.
 Relational databases generally assume that all data in the database conforms to one schema: although
 that schema can be changed (through schema migrations; i.e., `ALTER` statements), there is exactly
 one schema in force at any one point in time. By contrast, schema-on-read (“schemaless”) databases
@ -52,13 +54,13 @@ format of data written by older code, and so you can explicitly handle it (if ne
 keeping the old code to read the old data). Forward compatibility can be trickier, because it
 requires older code to ignore additions made by a newer version of the code.

-Another challenge with forward compatibility is illustrated in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
+Another challenge with forward compatibility is illustrated in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
 Say you add a field to a record schema, and the newer code creates a record containing that new
 field and stores it in a database. Subsequently, an older version of the code (which doesn’t yet
 know about the new field) reads the record, updates it, and writes it back. In this situation, the
 desirable behavior is usually for the old code to keep the new field intact, even though it couldn’t
 be interpreted. But if the record is decoded into a model object that does not explicitly
-preserve unknown fields, data can be lost, like in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
+preserve unknown fields, data can be lost, like in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).

 {{< figure src="/fig/ddia_0501.png" id="fig_encoding_preserve_field" caption="When an older version of the application updates data previously written by a newer version of the application, data may be lost if you’re not careful." class="w-full my-4" >}}

@ -90,7 +92,7 @@ in-memory representation to a byte sequence is called *encoding* (also known as

 > [!TIP] TERMINOLOGY CLASH

-*Serialization* is unfortunately also used in the context of transactions (see [Chapter 8](/en/ch8#ch_transactions)),
+*Serialization* is unfortunately also used in the context of transactions (see [Chapter 8](/en/ch8#ch_transactions)),
 with a completely different meaning. To avoid overloading the word we’ll stick with *encoding* in
 this book, even though *serialization* is perhaps a more common term.

@ -202,7 +204,7 @@ Open content models are powerful, but can be complex. For example, say you want
 integers (such as IDs) to strings. JSON does not have a map or dictionary type, only an “object”
 type that can contain string keys, and values of any type. You can then constrain this type with
 JSON Schema so that keys may only contain digits, and values can only be strings, using
-`patternProperties` and `additionalProperties` as shown in [Example 5-1](/en/ch5#fig_encoding_json_schema).
+`patternProperties` and `additionalProperties` as shown in [Example 5-1](/en/ch5#fig_encoding_json_schema).


 {{< figure id="fig_encoding_json_schema" title="Example 5-1. Example JSON Schema with integer keys and string values. Integer keys are represented as strings containing only integers since JSON Schema requires all keys to be strings." class="w-full my-4" >}}
@ -237,7 +239,7 @@ sometimes faster to parse, but none of them are as widely adopted as the textual
 Some of these formats extend the set of datatypes (e.g., distinguishing integers and floating-point numbers,
 or adding support for binary strings), but otherwise they keep the JSON/XML data model unchanged. In
 particular, since they don’t prescribe a schema, they need to include all the object field names within
-the encoded data. That is, in a binary encoding of the JSON document in [Example 5-2](/en/ch5#fig_encoding_json), they
+the encoded data. That is, in a binary encoding of the JSON document in [Example 5-2](/en/ch5#fig_encoding_json), they
 will need to include the strings `userName`, `favoriteNumber`, and `interests` somewhere.

 {{< figure id="fig_encoding_json" title="Example 5-2. Example record which we will encode in several binary formats in this chapter" class="w-full my-4" >}}
@ -250,8 +252,8 @@ will need to include the strings `userName`, `favoriteNumber`, and `interests` s
 }
 ```

-Let’s look at an example of MessagePack, a binary encoding for JSON. [Figure 5-2](/en/ch5#fig_encoding_messagepack)
-shows the byte sequence that you get if you encode the JSON document in [Example 5-2](/en/ch5#fig_encoding_json) with
+Let’s look at an example of MessagePack, a binary encoding for JSON. [Figure 5-2](/en/ch5#fig_encoding_messagepack)
+shows the byte sequence that you get if you encode the JSON document in [Example 5-2](/en/ch5#fig_encoding_json) with
 MessagePack. The first few bytes are as follows:

 1. The first byte, `0x83`, indicates that what follows is an object (top four bits = `0x80`) with three
@ -281,7 +283,7 @@ It is similar to Apache Thrift, which was originally developed by Facebook [^13]
 most of what this section says about Protocol Buffers applies also to Thrift.

 Protocol Buffers requires a schema for any data that is encoded. To encode the data
-in [Example 5-2](/en/ch5#fig_encoding_json) in Protocol Buffers, you would describe the schema in the Protocol Buffers
+in [Example 5-2](/en/ch5#fig_encoding_json) in Protocol Buffers, you would describe the schema in the Protocol Buffers
 interface definition language (IDL) like this:

 ```protobuf
@ -300,17 +302,17 @@ application code can call this generated code to encode or decode records of the
 language is very simple compared to JSON Schema: it only defines the fields of records and their
 types, but it does not support other restrictions on the possible values of fields.

-Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in [Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14].
+Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in [Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14].

 {{< figure src="/fig/ddia_0503.png" id="fig_encoding_protobuf" caption="Figure 5-3. Example record encoded using Protocol Buffers." class="w-full my-4" >}}


-Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it
+Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it
 is a string, integer, etc.) and, where required, a length indication (such as the length of a
 string). The strings that appear in the data (“Martin”, “daydreaming”, “hacking”) are also encoded
 as ASCII (to be precise, UTF-8), similar to before.

-The big difference compared to [Figure 5-2](/en/ch5#fig_encoding_messagepack) is that there are no field names
+The big difference compared to [Figure 5-2](/en/ch5#fig_encoding_messagepack) is that there are no field names
 (`userName`, `favoriteNumber`, `interests`). Instead, the encoded data contains *field tags*, which
 are numbers (`1`, `2`, and `3`). Those are the numbers that appear in the schema definition. Field tags
 are like aliases for fields—they are a compact way of saying what field we’re talking about,
@ -344,7 +346,7 @@ You can add new fields to the schema, provided that you give each field a new ta
 code (which doesn’t know about the new tag numbers you added) tries to read data written by new
 code, including a new field with a tag number it doesn’t recognize, it can simply ignore that field.
 The datatype annotation allows the parser to determine how many bytes it needs to skip, and preserve
-the unknown fields to avoid the problem in [Figure 5-1](/en/ch5#fig_encoding_preserve_field). This maintains forward
+the unknown fields to avoid the problem in [Figure 5-1](/en/ch5#fig_encoding_preserve_field). This maintains forward
 compatibility: old code can read records that were written by new code.

 What about backward compatibility? As long as each field has a unique tag number, new code can
@ -400,9 +402,9 @@ The equivalent JSON representation of that schema is as follows:
 ```

 First of all, notice that there are no tag numbers in the schema. If we encode our example record
-([Example 5-2](/en/ch5#fig_encoding_json)) using this schema, the Avro binary encoding is just 32 bytes long—the
+([Example 5-2](/en/ch5#fig_encoding_json)) using this schema, the Avro binary encoding is just 32 bytes long—the
 most compact of all the encodings we have seen. The breakdown of the encoded byte sequence is shown
-in [Figure 5-4](/en/ch5#fig_encoding_avro).
+in [Figure 5-4](/en/ch5#fig_encoding_avro).

 If you examine the byte sequence, you can see that there is nothing to identify fields or their
 datatypes. The encoding simply consists of values concatenated together. A string is just a length
@ -430,7 +432,7 @@ example, that schema may be compiled into the application. This is known as the
 When an application wants to decode some data (read it from a file or database, receive it from the
 network, etc.), it uses two schemas: the writer’s schema that is identical to the one used for
 encoding, and the *reader’s schema*, which may be different. This is illustrated in
-[Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The reader’s schema defines the fields of each record that the
+[Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The reader’s schema defines the fields of each record that the
 application code is expecting, and their types.

 {{< figure src="/fig/ddia_0505.png" id="fig_encoding_avro_schemas" caption="Figure 5-5. In Protocol Buffers, encoding and decoding can use different versions of a schema. In Avro, decoding uses two schemas: the writer's schema must be identical to the one used for encoding, but the reader's schema can be an older or newer version." class="w-full my-4" >}}
@ -438,7 +440,7 @@ application code is expecting, and their types.
 If the reader’s and writer’s schema are the same, decoding is easy. If they are different, Avro
 resolves the differences by looking at the writer’s schema and the reader’s schema side by side and
 translating the data from the writer’s schema into the reader’s schema. The Avro specification [^16] [^17]
-defines exactly how this resolution works, and it is illustrated in [Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
+defines exactly how this resolution works, and it is illustrated in [Figure 5-6](/en/ch5#fig_encoding_avro_resolution).

 For example, it’s no problem if the writer’s schema and the reader’s schema have their fields in a
 different order, because the schema resolution matches up the fields by field name. If the code
@ -490,7 +492,7 @@ The answer depends on the context in which Avro is being used. To give a few exa

 Large file with lots of records
 : A common use for Avro is for storing a large file containing millions of records, all encoded with
- the same schema. (We will discuss this kind of situation in [Link to Come].) In this case, the
+ the same schema. (We will discuss this kind of situation in [Chapter 11](/en/ch11#ch_batch).) In this case, the
 writer of that file can just include the writer’s schema once at the beginning of the file. Avro
 specifies a file format (object container files) to do this.

@ -661,7 +663,7 @@ As the data dump is written in one go and is thereafter immutable, formats like
 container files are a good fit. This is also a good opportunity to encode the data in an
 analytics-friendly column-oriented format such as Parquet (see [“Column Compression”](/en/ch4#sec_storage_column_compression)).

-In [Link to Come] we will talk more about using data in archival storage.
+In [Chapter 11](/en/ch11#ch_batch) we will talk more about using data in archival storage.

 ### Dataflow Through Services: REST and RPC {#sec_encoding_dataflow_rpc}

@ -686,7 +688,7 @@ application-specific, and the client and server need to agree on the details of

 In some ways, services are similar to databases: they typically allow clients to submit and query
 data. However, while databases allow arbitrary queries using the query languages we discussed in
-[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
+[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
 that are predetermined by the business logic (application code) of the service [^29]. This restriction provides a degree of encapsulation: services can impose
 fine-grained restrictions on what clients can and cannot do.

@ -728,7 +730,7 @@ service. The two most popular service IDLs are OpenAPI (also known as Swagger [^
 and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send
 and receive Protocol Buffers.

-Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def).
+Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def).
 The service definition allows developers to define service endpoints, documentation, versions, data
 models, and much more. gRPC definitions look similar, but are defined using Protocol Buffers service definitions.

@ -762,8 +764,8 @@ Even if a design philosophy and IDL are adopted, developers must still write the
 implements their service’s API calls. A service framework is often adopted to simplify this
 effort. Service frameworks such as Spring Boot, FastAPI, and gRPC allow developers to write the
 business logic for each API endpoint while the framework code handles routing, metrics, caching,
-authentication, and so on. [Example 5-4](/en/ch5#fig_fastapi_def) shows an example Python implementation of the service
-defined in [Example 5-3](/en/ch5#fig_open_api_def).
+authentication, and so on. [Example 5-4](/en/ch5#fig_fastapi_def) shows an example Python implementation of the service
+defined in [Example 5-3](/en/ch5#fig_open_api_def).

 {{< figure id="fig_fastapi_def" title="Example 5-4. Example FastAPI service implementing the definition from [Example 5-3](/en/ch5#fig_open_api_def)" class="w-full my-4" >}}

@ -815,11 +817,11 @@ A network request is very different from a local function call:
 it goes into an infinite loop or the process crashes). A network request has another possible
 outcome: it may return without a result, due to a *timeout*. In that case, you simply don’t know
 what happened: if you don’t get a response from the remote service, you have no way of knowing
- whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).)
+ whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).)
 * If you retry a failed network request, it could happen that the previous request actually got
 through, and only the response was lost. In that case, retrying will cause the action to
 be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into the protocol [^40].
- Local function calls don’t have this problem. (We discuss idempotence in more detail in [Link to Come].)
+ Local function calls don’t have this problem. (We discuss idempotence in more detail in [“Idempotence”](/en/ch12#sec_stream_idempotence).)
 * Every time you call a local function, it normally takes about the same time to execute. A network
 request is much slower than a function call, and its latency is also wildly variable: at good
 times it may complete in less than a millisecond, but when the network is congested or the remote
@ -870,7 +872,7 @@ There are many load balancing and service discovery solutions available:
 * *Service discovery systems* use a centralized registry rather than DNS to track which service
 endpoints are available. When a new service instance starts up, it registers itself with the
 service discovery system by declaring the host and port it’s listening on, along with relevant
- metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location,
+ metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location,
 and more. The service then periodically sends a heartbeat signal to the discovery system to signal
 that the service is still available.

@ -936,7 +938,7 @@ services responsible for fraud detection, credit card integration, bank integrat
 Processing a single payment in our example requires many service calls. A payment processor service
 might invoke the fraud detection service to check for fraud, call the credit card service to debit
 the credit card, and call the banking service to deposit debited funds, as shown in
-[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
+[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
 Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a
 general-purpose programming language, a domain specific language (DSL), or a markup language such as
 Business Process Execution Language (BPEL) [^44].
@ -967,7 +969,7 @@ tasks.
 There are many kinds of workflow engines that address a diverse set of use cases. Some, such as
 Airflow, Dagster, and Prefect, integrate with data systems and orchestrate ETL tasks. Others, such
 as Camunda and Orkes, provide a graphical notation for workflows (such as BPMN, used in
-[Figure 5-7](/en/ch5#fig_encoding_workflow)) so that non-engineers can more easily define and execute workflows. Still
+[Figure 5-7](/en/ch5#fig_encoding_workflow)) so that non-engineers can more easily define and execute workflows. Still
 others, such as Temporal and Restate provide *durable execution*.

 #### Durable execution {#durable-execution}
@ -984,7 +986,7 @@ task fails, the framework will re-execute the task, but will skip any RPC calls
 that the task made successfully before failing. Instead, the framework will pretend to make the
 call, but will instead return the results from the previous call. This is possible because durable
 execution frameworks log all RPCs and state changes to durable storage like a write-ahead log [^45] [^46].
-[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
+[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
 using Temporal.

 {{< figure id="fig_temporal_workflow" title="Example 5-5. A Temporal workflow definition fragment for the payment workflow in [Figure 5-7](/en/ch5#fig_encoding_workflow)." class="w-full my-4" >}}
@ -1060,7 +1062,7 @@ In the past, the landscape of message brokers was dominated by commercial enterp
 companies such as TIBCO, IBM WebSphere, and webMethods, before open source implementations such as
 RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka become popular. More recently, cloud services
 such as Amazon Kinesis, Azure Service Bus, and Google Cloud Pub/Sub have gained adoption. We will
-compare them in more detail in [Link to Come].
+compare them in more detail in [“Messaging Systems”](/en/ch12#sec_stream_messaging).

 The detailed delivery semantics vary by implementation and configuration, but in general, two
 message distribution patterns are most often used:
@ -1084,7 +1086,7 @@ to use event sourcing (see [“Event Sourcing and CQRS”](/en/ch3#sec_datamodel

 If a consumer republishes messages to another topic, you may need to be careful to preserve unknown
 fields, to prevent the issue described previously in the context of databases
-([Figure 5-1](/en/ch5#fig_encoding_preserve_field)).
+([Figure 5-1](/en/ch5#fig_encoding_preserve_field)).

 #### Distributed actor frameworks {#distributed-actor-frameworks}

@ -1213,4 +1215,4 @@ quite achievable. May your application’s evolution be rapid and your deploymen
 [^48]: [What is a Temporal Workflow?](https://docs.temporal.io/workflows) *docs.temporal.io*, 2024. Archived at [perma.cc/B5C5-Y396](https://perma.cc/B5C5-Y396) 
 [^49]: Jack Kleeman. [Solving durable execution’s immutability problem](https://restate.dev/blog/solving-durable-executions-immutability-problem/). *restate.dev*, February 2024. Archived at [perma.cc/G55L-EYH5](https://perma.cc/G55L-EYH5) 
 [^50]: Srinath Perera. [Exploring Event-Driven Architecture: A Beginner’s Guide for Cloud Native Developers](https://wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/). *wso2.com*, August 2023. Archived at [archive.org](https://web.archive.org/web/20240716204613/https%3A//wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/) 
-[^51]: Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. [Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical Report MSR-TR-2014-41, March 2014. Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF) 
+[^51]: Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. [Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical Report MSR-TR-2014-41, March 2014. Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF) 
--- a/content/en/ch6.md
+++ b/content/en/ch6.md
@ -4,6 +4,8 @@ weight: 206
 breadcrumbs: false
 ---

+<a id="ch_replication"></a>
+
 ![](/map/ch05.png)

 > *The major difference between a thing that might go wrong and a thing that cannot possibly go wrong
@ -21,7 +23,7 @@ why you might want to replicate data:
 * To scale out the number of machines that can serve read queries (and thus increase read throughput)

 In this chapter we will assume that your dataset is small enough that each machine can hold a copy of
-the entire dataset. In [Chapter 7](/en/ch7#ch_sharding) we will relax that assumption and discuss *sharding*
+the entire dataset. In [Chapter 7](/en/ch7#ch_sharding) we will relax that assumption and discuss *sharding*
 (*partitioning*) of datasets that are too big for a single machine. In later chapters we will discuss
 various kinds of faults that can occur in a replicated data system, and how to deal with them.

@ -72,7 +74,7 @@ question inevitably arises: how do we ensure that all the data ends up on all th

 Every write to the database needs to be processed by every replica; otherwise, the replicas would no
 longer contain the same data. The most common solution is called *leader-based replication*,
-*primary-backup*, or *active/passive*. It works as follows (see [Figure 6-1](/en/ch6#fig_replication_leader_follower)):
+*primary-backup*, or *active/passive*. It works as follows (see [Figure 6-1](/en/ch6#fig_replication_leader_follower)):

 1. One of the replicas is designated the *leader* (also known as *primary* or *source* [^2]).
   When clients want to write to the database, they must send their requests to the leader, which
@ -88,7 +90,7 @@ longer contain the same data. The most common solution is called *leader-based r

 {{< figure src="/fig/ddia_0601.png" id="fig_replication_leader_follower" caption="Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas." class="w-full my-4" >}}

-If the database is sharded (see [Chapter 7](/en/ch7#ch_sharding)), each shard has one leader. Different shards may
+If the database is sharded (see [Chapter 7](/en/ch7#ch_sharding)), each shard has one leader. Different shards may
 have their leaders on different nodes, but each shard must nevertheless have one leader node. In
 [“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader) we will discuss an alternative model in which a system may have
 multiple leaders for the same shard at the same time.
@ -99,7 +101,7 @@ It is also used in some document databases such as MongoDB and DynamoDB [^5],
 message brokers such as Kafka, replicated block devices such as DRBD, and some network filesystems.
 Many consensus algorithms such as Raft, which is used for replication in CockroachDB [^6], TiDB [^7],
 etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and automatically 
-elect a new leader if the old one fails (we will discuss consensus in more detail in [Chapter 10](/en/ch10#ch_consistency)).
+elect a new leader if the old one fails (we will discuss consensus in more detail in [Chapter 10](/en/ch10#ch_consistency)).

 --------

@ -115,15 +117,15 @@ An important detail of a replicated system is whether the replication happens *s
 *asynchronously*. (In relational databases, this is often a configurable option; other systems are
 often hardcoded to be either one or the other.)

-Think about what happens in [Figure 6-1](/en/ch6#fig_replication_leader_follower), where the user of a website updates
+Think about what happens in [Figure 6-1](/en/ch6#fig_replication_leader_follower), where the user of a website updates
 their profile image. At some point in time, the client sends the update request to the leader;
 shortly afterward, it is received by the leader. At some point, the leader forwards the data change
 to the followers. Eventually, the leader notifies the client that the update was successful.
-[Figure 6-2](/en/ch6#fig_replication_sync_replication) shows one possible way how the timings could work out.
+[Figure 6-2](/en/ch6#fig_replication_sync_replication) shows one possible way how the timings could work out.

 {{< figure src="/fig/ddia_0602.png" id="fig_replication_sync_replication" caption="Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower." class="w-full my-4" >}}

-In the example of [Figure 6-2](/en/ch6#fig_replication_sync_replication), the replication to follower 1 is
+In the example of [Figure 6-2](/en/ch6#fig_replication_sync_replication), the replication to follower 1 is
 *synchronous*: the leader waits until follower 1 has confirmed that it received the write before
 reporting success to the user, and before making the write visible to other clients. The replication
 to follower 2 is *asynchronous*: the leader sends the message, but doesn’t wait for a response from
@ -155,7 +157,7 @@ In some systems, a *majority* (e.g., 3 out of 5 replicas, including the leader)
 updated synchronously, and the remaining minority is asynchronous. This is an example of a *quorum*,
 which we will discuss further in [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition). Majority quorums are often
 used in systems that use a consensus protocol for automatic leader election, which we will return to
-in [Chapter 10](/en/ch10#ch_consistency).
+in [Chapter 10](/en/ch10#ch_consistency).

 Sometimes, leader-based replication is configured to be completely asynchronous. In this case, if the
 leader fails and is not recoverable, any writes that have not yet been replicated to followers are
@ -206,6 +208,8 @@ Litestream does the equivalent for SQLite.

 --------

+<a id="sec_replication_object_storage"></a>
+
 > [!TIP] DATABASES BACKED BY OBJECT STORAGE

 Object storage can be used for more than archiving data. Many databases are beginning to use object
@ -303,7 +307,7 @@ consists of the following steps:
   established *controller node* [^13].
   The best candidate for leadership is usually the replica with the most up-to-date data changes
   from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader
-   is a consensus problem, discussed in detail in [Chapter 10](/en/ch10#ch_consistency).
+   is a consensus problem, discussed in detail in [Chapter 10](/en/ch10#ch_consistency).
 3. *Reconfiguring the system to use the new leader.* Clients now need to send
   their write requests to the new leader (we discuss this
   in [“Request Routing”](/en/ch7#sec_sharding_routing)). If the old leader comes back, it might still believe that it is
@ -326,7 +330,7 @@ Failover is fraught with things that can go wrong:
  primary keys that were previously assigned by the old leader. These primary keys were also used in
  a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis,
  which caused some private data to be disclosed to the wrong users.
-* In certain fault scenarios (see [Chapter 9](/en/ch9#ch_distributed)), it could happen that two nodes both believe
+* In certain fault scenarios (see [Chapter 9](/en/ch9#ch_distributed)), it could happen that two nodes both believe
  that they are the leader. This situation is called *split brain*, and it is dangerous: if both
  leaders accept writes, and there is no process for resolving conflicts (see
  [“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
@ -362,7 +366,7 @@ behind by several days could be catastrophic.

 These issues—node failures; unreliable networks; and trade-offs around replica consistency,
 durability, availability, and latency—are in fact fundamental problems in distributed systems.
-In [Chapter 9](/en/ch9#ch_distributed) and [Chapter 10](/en/ch10#ch_consistency) we will discuss them in greater depth.
+In [Chapter 9](/en/ch9#ch_distributed) and [Chapter 10](/en/ch10#ch_consistency) we will discuss them in greater depth.

 ### Implementation of Replication Logs {#sec_replication_implementation}

@ -405,7 +409,7 @@ in practice, so many databases prefer other replication methods.

 #### Write-ahead log (WAL) shipping {#write-ahead-log-wal-shipping}

-In [Chapter 4](/en/ch4#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
+In [Chapter 4](/en/ch4#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
 every modification is first written to the WAL so that the tree can be restored to a consistent
 state after a crash. Since the WAL contains all the information necessary to restore the indexes and
 heap into a consistent state, we can use the exact same log to build a replica on another node:
@ -426,6 +430,8 @@ performing a failover to make one of the upgraded nodes the new leader. If the r
 does not allow this version mismatch, as is often the case with WAL shipping, such upgrades require
 downtime.

+<a id="sec_replication_logical"></a>
+
 #### Logical (row-based) log replication {#logical-row-based-log-replication}

 An alternative is to use different log formats for replication and for the storage engine, which
@ -456,7 +462,7 @@ software. This in turn enables upgrading to a new version with minimal downtime
 A logical log format is also easier for external applications to parse. This aspect is useful if you want
 to send the contents of a database to an external system, such as a data warehouse for offline
 analysis, or for building custom indexes and caches [^21].
-This technique is called *change data capture*, and we will return to it in [Link to Come].
+This technique is called *change data capture*, and we will return to it in [“Change Data Capture”](/en/ch12#sec_stream_cdc).


 ## Problems with Replication Lag {#sec_replication_lag}
@ -513,7 +519,7 @@ be read from a follower. This is especially appropriate if data is frequently vi
 occasionally written.

 With asynchronous replication, there is a problem, illustrated in
-[Figure 6-3](/en/ch6#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
+[Figure 6-3](/en/ch6#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
 new data may not yet have reached the replica. To the user, it looks as though the data they
 submitted was lost, so they will be understandably unhappy.

@ -597,7 +603,7 @@ Our second example of an anomaly that can occur when reading from asynchronous f
 possible for a user to see things *moving backward in time*.

 This can happen if a user makes several reads from different replicas. For example,
-[Figure 6-4](/en/ch6#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
+[Figure 6-4](/en/ch6#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
 with little lag, then to a follower with greater lag. (This scenario is quite likely if the user
 refreshes a web page, and each request is routed to a random server.) The first query returns a
 comment that was recently added by user 1234, but the second query doesn’t return anything because
@ -636,7 +642,7 @@ answered it.

 Now, imagine a third person is listening to this conversation through followers. The things said by
 Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer
-replication lag (see [Figure 6-5](/en/ch6#fig_replication_consistent_prefix)). This observer would hear the following:
+replication lag (see [Figure 6-5](/en/ch6#fig_replication_consistent_prefix)). This observer would hear the following:

 Mrs. Cake
 :   About ten seconds usually, Mr. Poons.
@ -654,7 +660,7 @@ This guarantee says that if a sequence of writes happens in a certain order,
 then anyone reading those writes will see them appear in the same order.

 This is a particular problem in sharded (partitioned) databases, which we will discuss in
-[Chapter 7](/en/ch7#ch_sharding). If the database always applies writes in the same order, reads always see a
+[Chapter 7](/en/ch7#ch_sharding). If the database always applies writes in the same order, reads always see a
 consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different
 shards operate independently, so there is no global ordering of writes: when a user reads from the
 database, they may see some parts of the database in an older state and some in a newer state.
@ -678,8 +684,8 @@ synchronously updated follower. However, dealing with these issues in applicatio
 and easy to get wrong.

 The simplest programming model for application developers is to choose a database that provides a
-strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/en/ch10#ch_consistency)), and ACID
-transactions (see [Chapter 8](/en/ch8#ch_transactions)). This allows you to mostly ignore the challenges that arise
+strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/en/ch10#ch_consistency)), and ACID
+transactions (see [Chapter 8](/en/ch8#ch_transactions)). This allows you to mostly ignore the challenges that arise
 from replication, and treat the database as if it had just a single node. In the early 2010s the
 *NoSQL* movement promoted the view that these features limited scalability, and that large-scale
 systems would have to embrace eventual consistency.
@ -738,7 +744,7 @@ single-leader replication, the leader has to be in *one* of the regions, and all
 through that region.

 In a multi-leader configuration, you can have a leader in *each* region.
-[Figure 6-6](/en/ch6#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
+[Figure 6-6](/en/ch6#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
 regular leader–follower replication is used (with followers maybe in a different availability zone
 from the leader); between regions, each region’s leader replicates its changes to the leaders in
 other regions.
@ -774,7 +780,7 @@ Tolerance of network problems

 Consistency
 :   A single-leader system can provide strong consistency guarantees, such as serializable
-    transactions, which we will discuss in [Chapter 8](/en/ch8#ch_transactions). The biggest downside of multi-leader
+    transactions, which we will discuss in [Chapter 8](/en/ch8#ch_transactions). The biggest downside of multi-leader
    systems is that the consistency they can achieve is much weaker. For example, you can’t guarantee
    that a bank account won’t go negative or that a username is unique: it’s always possible for
    different leaders to process writes that are individually fine (paying out some of the money in an
@ -798,14 +804,14 @@ multi-leader replication is often considered dangerous territory that should be
 #### Multi-leader replication topologies {#sec_replication_topologies}

 A *replication topology* describes the communication paths along which writes are propagated from
-one node to another. If you have two leaders, like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), there is
+one node to another. If you have two leaders, like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), there is
 only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With
 more than two leaders, various different topologies are possible. Some examples are illustrated in
-[Figure 6-7](/en/ch6#fig_replication_topologies).
+[Figure 6-7](/en/ch6#fig_replication_topologies).

 {{< figure src="/fig/ddia_0607.png" id="fig_replication_topologies" caption="Figure 6-7. Three example topologies in which multi-leader replication can be set up." class="w-full my-4" >}}

-The most general topology is *all-to-all*, shown in [Figure 6-7](/en/ch6#fig_replication_topologies)(c),
+The most general topology is *all-to-all*, shown in [Figure 6-7](/en/ch6#fig_replication_topologies)(c),
 in which every leader sends its writes to every other leader. However, more restricted topologies
 are also used: for example a *circular topology* in which each node receives writes from one node
 and forwards those writes (plus any writes of its own) to one other node. Another popular topology
@ -839,11 +845,11 @@ along different paths, avoiding a single point of failure.

 On the other hand, all-to-all topologies can have issues too. In particular, some network links may
 be faster than others (e.g., due to network congestion), with the result that some replication
-messages may “overtake” others, as illustrated in [Figure 6-8](/en/ch6#fig_replication_causality).
+messages may “overtake” others, as illustrated in [Figure 6-8](/en/ch6#fig_replication_causality).

 {{< figure src="/fig/ddia_0608.png" id="fig_replication_causality" caption="Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas." class="w-full my-4" >}}

-In [Figure 6-8](/en/ch6#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
+In [Figure 6-8](/en/ch6#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
 updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may
 first receive the update (which, from its point of view, is an update to a row that does not exist
 in the database) and only later receive the corresponding insert (which should have preceded the
@ -853,12 +859,12 @@ This is a problem of causality, similar to the one we saw in [“Consistent Pref
 the update depends on the prior insert, so we need to make sure that all nodes process the insert
 first, and then the update. Simply attaching a timestamp to every write is not sufficient, because
 clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see
-[Chapter 9](/en/ch9#ch_distributed)).
+[Chapter 9](/en/ch9#ch_distributed)).

 To order these events correctly, a technique called *version vectors* can be used, which we will
 discuss later in this chapter (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). However, many multi-leader
 replication systems don’t use good techniques for ordering updates, leaving them vulnerable to
-issues like the one in [Figure 6-8](/en/ch6#fig_replication_causality). If you are using multi-leader replication, it
+issues like the one in [Figure 6-8](/en/ch6#fig_replication_causality). If you are using multi-leader replication, it
 is worth being aware of these issues, carefully reading the documentation, and thoroughly testing
 your database to ensure that it really does provide the guarantees you believe it to have.

@ -926,8 +932,8 @@ approach has a number of advantages:

 * Having the data locally means the user interface can be much faster to respond than if it had to
  wait for a service call to fetch some data. Some apps aim to respond to user input in the *next
-  frame* of the graphics system, which means rendering within 16 ms on a display with a
-  60 Hz refresh rate.
+  frame* of the graphics system, which means rendering within 16 ms on a display with a
+  60 Hz refresh rate.
 * Allowing users to continue working while offline is valuable, especially on mobile devices with
  intermittent connectivity. With a sync engine, an app doesn’t need a separate offline mode: being
  offline is the same as having very large network delay.
@ -967,7 +973,7 @@ a local-first sync engine on end user devices—is that concurrent writes on dif
 lead to conflicts that need to be resolved.

 For example, consider a wiki page that is simultaneously being edited by two users, as shown in
-[Figure 6-9](/en/ch6#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
+[Figure 6-9](/en/ch6#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
 independently changes the title from A to C. Each user’s change is successfully applied to their
 local leader. However, when the changes are asynchronously replicated, a conflict is detected.
 This problem does not occur in a single-leader database.
@ -975,7 +981,7 @@ This problem does not occur in a single-leader database.
 {{< figure src="/fig/ddia_0609.png" id="fig_replication_write_conflict" caption="Figure 6-9. A write conflict caused by two leaders concurrently updating the same record." class="w-full my-4" >}}

 > [!NOTE]
-> We say that the two writes in [Figure 6-9](/en/ch6#fig_replication_write_conflict) are *concurrent* because neither
+> We say that the two writes in [Figure 6-9](/en/ch6#fig_replication_write_conflict) are *concurrent* because neither
 > was “aware” of the other at the time the write was originally made. It doesn’t matter whether the
 > writes literally happened at the same time; indeed, if the writes were made while offline, they
 > might have actually happened some time apart. What matters is whether one write occurred in a state
@ -1017,7 +1023,7 @@ We will discuss other ID assignment schemes in [“ID Generators and Logical Clo

 If conflicts can’t be avoided, the simplest way of resolving them is to attach a timestamp to each
 write, and to always use the value with the greatest timestamp. For example, in
-[Figure 6-9](/en/ch6#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than
+[Figure 6-9](/en/ch6#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than
 the timestamp of user 2’s write. In that case, both leaders will determine that the new title of the
 page should be B, and they discard the write that sets it to C. If the writes coincidentally have
 the same timestamp, the winner can be chosen by comparing the values (e.g., in the case of strings,
@ -1025,7 +1031,7 @@ taking the one that’s earlier in the alphabet).

 This approach is called *last write wins* (LWW) because the write with the greatest timestamp can be
 considered the “last” one. The term is misleading though, because when two writes are concurrent
-like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), which one is older and which is later is undefined, and
+like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), which one is older and which is later is undefined, and
 so the timestamp order of concurrent writes is essentially random.

 Therefore the real meaning of LWW is: when the same record is concurrently written on different
@ -1055,7 +1061,7 @@ merge is complete.

 In a database, it would be impractical for a conflict to stop the entire replication process until a
 human has resolved it. Instead, databases typically store all the concurrently written values for a
-given record—for example, both B and C in [Figure 6-9](/en/ch6#fig_replication_write_conflict). These values are
+given record—for example, both B and C in [Figure 6-9](/en/ch6#fig_replication_write_conflict). These values are
 sometimes called *siblings*. The next time you query that record, the database returns *all* those
 values, rather than just the latest one. You can then resolve those values in whatever way you want,
 either automatically in application code (for example, you could concatenate B and C into “B/C”), or
@ -1077,7 +1083,7 @@ suffers from a number of problems:
  keeping all the shopping cart items that appeared in any of the siblings (i.e., taking the set
  union of the carts). This meant that if the customer had removed an item from their cart in one
  sibling, but another sibling still contained that old item, the removed item would unexpectedly
-  reappear in the customer’s cart [^45]. [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
+  reappear in the customer’s cart [^45]. [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
  cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear.
 * If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution
  process can itself introduce a new conflict. Those resolutions could even be inconsistent: for
@ -1088,6 +1094,8 @@ suffers from a number of problems:
 {{< figure src="/fig/ddia_0610.png" id="fig_replication_amazon_anomaly" caption="Figure 6-10. Example of Amazon's shopping cart anomaly: if conflicts on a shopping cart are merged by taking the union, deleted items may reappear." class="w-full my-4" >}}


+<a id="sec_replication_automatic_resolution"></a>
+
 #### Automatic conflict resolution {#automatic-conflict-resolution}

 For many applications, the best way of handling conflicts is to use an algorithm that automatically
@ -1105,8 +1113,8 @@ updates as much as possible, and hence avoiding data loss:
  same position, it can be ordered deterministically so that all nodes get the same merged outcome.
 * If the data is a collection of items (ordered like a to-do list, or unordered like a shopping
  cart), we can merge it similarly to text by tracking insertions and deletions. To avoid the
-  shopping cart issue in [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly), the algorithms track the fact that Book
-  and DVD were deleted, so the merged result is Cart = {Soap}.
+  shopping cart issue in [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly), the algorithms track the fact that Book
+  and DVD were deleted, so the merged result is Cart = {Soap}.
 * If the data is an integer representing a counter that can be incremented or decremented (e.g., the
  number of likes on a social media post), the merge algorithm can tell how many increments and
  decrements happened on each sibling, and add them together correctly so that the result does not
@ -1129,7 +1137,7 @@ Two families of algorithms are commonly used to implement automatic conflict res
 They have different design philosophies and performance characteristics, but both are able to
 perform automatic merges for all the aforementioned types of data.

-[Figure 6-11](/en/ch6#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
+[Figure 6-11](/en/ch6#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
 text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the
 letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make “ice!”.

@ -1147,7 +1155,7 @@ OT

 CRDT
 :   Most CRDTs give each character a unique, immutable ID and use those to determine the positions of
-    insertions/deletions, instead of indexes. For example, in [Figure 6-11](/en/ch6#fig_replication_ot_crdt) we assign
+    insertions/deletions, instead of indexes. For example, in [Figure 6-11](/en/ch6#fig_replication_ot_crdt) we assign
    the ID 1A to “i”, the ID 2A to “c”, etc. When inserting the exclamation mark, we generate an
    operation containing the ID of the new character (4B) and the ID of the existing character after
    which we want to insert (3A). To insert at the beginning of the string we give “nil” as the
@ -1165,7 +1173,7 @@ Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge o

 #### What is a conflict? {#what-is-a-conflict}

-Some kinds of conflict are obvious. In the example in [Figure 6-9](/en/ch6#fig_replication_write_conflict), two writes
+Some kinds of conflict are obvious. In the example in [Figure 6-9](/en/ch6#fig_replication_write_conflict), two writes
 concurrently modified the same field in the same record, setting it to two different values. There
 is little doubt that this is a conflict.

@ -1179,7 +1187,7 @@ are made on two different leaders.

 There isn’t a quick ready-made answer, but in the following chapters we will trace a path toward a
 good understanding of this problem. We will see some more examples of conflicts in
-[Chapter 8](/en/ch8#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and
+[Chapter 8](/en/ch8#ch_transactions), and in [“Ordering events to capture causality”](/en/ch13#sec_future_capture_causality) we will discuss scalable approaches for detecting and
 resolving conflicts in a replicated system.


@ -1220,7 +1228,7 @@ configuration, if you want to continue processing writes, you may need to perfor
 [“Handling Node Outages”](/en/ch6#sec_replication_failover)).

 On the other hand, in a leaderless configuration, failover does not exist.
-[Figure 6-12](/en/ch6#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
+[Figure 6-12](/en/ch6#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
 all three replicas in parallel, and the two available replicas accept the write but the unavailable
 replica misses it. Let’s say that it’s sufficient for two out of three replicas to
 acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be
@ -1252,7 +1260,7 @@ mechanisms are used in Dynamo-style datastores:

 Read repair
 :   When a client makes a read from several nodes in parallel, it can detect any stale responses.
-    For example, in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
+    For example, in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
    replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale
    value and writes the newer value back to that replica. This approach works well for values that are
    frequently read.
@ -1272,7 +1280,7 @@ Anti-entropy

 #### Quorums for reading and writing {#sec_replication_quorum_condition}

-In the example of [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), we considered the write to be successful
+In the example of [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), we considered the write to be successful
 even though it was only processed on two out of three replicas. What if only one out of three
 replicas accepted the write? How far can we push this?

@ -1283,14 +1291,14 @@ respond, reads can nevertheless continue returning an up-to-date value.

 More generally, if there are *n* replicas, every write must be confirmed by *w* nodes to be
 considered successful, and we must query at least *r* nodes for each read. (In our example,
-*n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* > *n*, 
+*n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* > *n*, 
 we expect to get an up-to-date value when reading, because at least one of the *r* nodes we’re
 reading from must be up to date. Reads and writes that obey these *r* and *w* values are called *quorum* reads and writes [^50].
 You can think of *r* and *w* as the minimum number of votes required for the read or write to be valid.

 In Dynamo-style databases, the parameters *n*, *w*, and *r* are typically configurable. A common
 choice is to make *n* an odd number (typically 3 or 5) and to set *w* = *r* =
-(*n* + 1) / 2 (rounded up). However, you can vary the numbers as you see fit.
+(*n* + 1) / 2 (rounded up). However, you can vary the numbers as you see fit.
 For example, a workload with few writes and many reads may benefit from setting *w* = *n* and
 *r* = 1. This makes reads faster, but has the disadvantage that just one failed node causes all
 database writes to fail.
@ -1300,19 +1308,19 @@ database writes to fail.
 > [!NOTE]
 > There may be more than *n* nodes in the cluster, but any given value is stored only on *n*
 > nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit
-> on one node. We will return to sharding in [Chapter 7](/en/ch7#ch_sharding).
+> on one node. We will return to sharding in [Chapter 7](/en/ch7#ch_sharding).

 --------

-The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
+The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
 as follows:

-* If *w* < *n*, we can still process writes if a node is unavailable.
-* If *r* < *n*, we can still process reads if a node is unavailable.
-* With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable
-  node, like in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage).
-* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
-  This case is illustrated in [Figure 6-13](/en/ch6#fig_replication_quorum_overlap).
+* If *w* < *n*, we can still process writes if a node is unavailable.
+* If *r* < *n*, we can still process reads if a node is unavailable.
+* With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable
+  node, like in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage).
+* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
+  This case is illustrated in [Figure 6-13](/en/ch6#fig_replication_quorum_overlap).

 Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and *r* 
 determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
@ -1329,19 +1337,19 @@ returned a successful response and don’t need to distinguish between different

 ### Limitations of Quorum Consistency {#sec_replication_quorum_limitations}

-If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*, you can
+If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*, you can
 generally expect every read to return the most recent value written for a key. This is the case because the
 set of nodes to which you’ve written and the set of nodes from which you’ve read must overlap. That
 is, among the nodes you read there must be at least one node with the latest value (illustrated in
-[Figure 6-13](/en/ch6#fig_replication_quorum_overlap)).
+[Figure 6-13](/en/ch6#fig_replication_quorum_overlap)).

 Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, because that ensures
-*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
+*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
 not necessarily majorities—it only matters that the sets of nodes used by the read and write
 operations overlap in at least one node. Other quorum assignments are possible, which allows some
 flexibility in the design of distributed algorithms [^51].

-You may also set *w* and *r* to smaller numbers, so that *w* + *r* ≤ *n* (i.e.,
+You may also set *w* and *r* to smaller numbers, so that *w* + *r* ≤ *n* (i.e.,
 the quorum condition is not satisfied). In this case, reads and writes will still be sent to *n*
 nodes, but a smaller number of successful responses is required for the operation to succeed.

@ -1352,14 +1360,14 @@ unreachable, there’s a higher chance that you can continue processing reads an
 the number of reachable replicas falls below *w* or *r* does the database become unavailable for
 writing or reading, respectively.

-However, even with *w* + *r* > *n*, there are edge cases in which the consistency
+However, even with *w* + *r* > *n*, there are edge cases in which the consistency
 properties can be confusing. Some scenarios include:

 * If a node carrying a new value fails, and its data is restored from a replica carrying an old
  value, the number of replicas storing the new value may fall below *w*, breaking the quorum
  condition.
 * While a rebalancing is in progress, where some data is moved from one node to another (see
-  [Chapter 7](/en/ch7#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
+  [Chapter 7](/en/ch7#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
  replicas for a particular value. This can result in the read and write quorums no longer
  overlapping.
 * If a read is concurrent with a write operation, the read may or may not see the concurrently
@ -1489,7 +1497,7 @@ resulting in conflicts that need to be resolved. Such conflicts may occur as the
 not always: they could also be detected later during read repair, hinted handoff, or anti-entropy.

 The problem is that events may arrive in a different order at different nodes, due to variable
-network delays and partial failures. For example, [Figure 6-14](/en/ch6#fig_replication_concurrency) shows two clients,
+network delays and partial failures. For example, [Figure 6-14](/en/ch6#fig_replication_concurrency) shows two clients,
 A and B, simultaneously writing to a key *X* in a three-node datastore:

 * Node 1 receives the write from A, but never receives the write from B due to a transient outage.
@ -1501,7 +1509,7 @@ A and B, simultaneously writing to a key *X* in a three-node datastore:

 If each node simply overwrote the value for a key whenever it received a write request from a
 client, the nodes would become permanently inconsistent, as shown by the final *get* request in
-[Figure 6-14](/en/ch6#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
+[Figure 6-14](/en/ch6#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
 nodes think that the value is A.

 In order to become eventually consistent, the replicas should converge toward the same value. For
@ -1520,11 +1528,11 @@ take more care to detect concurrent writes.
 How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look
 at some examples:

-* In [Figure 6-8](/en/ch6#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before*
+* In [Figure 6-8](/en/ch6#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before*
  B’s increment, because the value incremented by B is the value inserted by A. In other words, B’s
  operation builds upon A’s operation, so B’s operation must have happened later.
  We also say that B is *causally dependent* on A.
-* On the other hand, the two writes in [Figure 6-14](/en/ch6#fig_replication_concurrency) are concurrent: when each
+* On the other hand, the two writes in [Figure 6-14](/en/ch6#fig_replication_concurrency) are concurrent: when each
  client starts the operation, it does not know that another client is also performing an operation
  on the same key. Thus, there is no causal dependency between the operations.

@ -1546,7 +1554,7 @@ conflict that needs to be resolved.
 It may seem that two operations should be called concurrent if they occur “at the same time”—but
 in fact, it is not important whether they literally overlap in time. Because of problems with clocks
 in distributed systems, it is actually quite difficult to tell whether two things happened
-at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/en/ch9#ch_distributed).
+at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/en/ch9#ch_distributed).

 For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if
 they are both unaware of each other, regardless of the physical time at which they occurred. People
@ -1570,7 +1578,7 @@ happened before another. To keep things simple, let’s start with a database th
 replica. Once we have worked out how to do this on a single replica, we can generalize the approach
 to a leaderless database with multiple replicas.

-[Figure 6-15](/en/ch6#fig_replication_causality_single) shows two clients concurrently adding items to the same
+[Figure 6-15](/en/ch6#fig_replication_causality_single) shows two clients concurrently adding items to the same
 shopping cart. (If that example strikes you as too inane, imagine instead two air traffic
 controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is
 empty. Between them, the clients make five writes to the database:
@ -1604,8 +1612,8 @@ empty. Between them, the clients make five writes to the database:
 {{< figure src="/fig/ddia_0615.png" id="fig_replication_causality_single" caption="Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart." class="w-full my-4" >}}


-The dataflow between the operations in [Figure 6-15](/en/ch6#fig_replication_causality_single) is illustrated
-graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The arrows indicate which operation
+The dataflow between the operations in [Figure 6-15](/en/ch6#fig_replication_causality_single) is illustrated
+graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The arrows indicate which operation
 *happened before* which other operation, in the sense that the later operation *knew about* or
 *depended on* the earlier one. In this example, the clients are never fully up to date with the data
 on the server, since there is always another operation going on concurrently. But old versions of
@ -1638,10 +1646,10 @@ on subsequent reads.

 #### Version vectors {#version-vectors}

-The example in [Figure 6-15](/en/ch6#fig_replication_causality_single) used only a single replica. How does the
+The example in [Figure 6-15](/en/ch6#fig_replication_causality_single) used only a single replica. How does the
 algorithm change when there are multiple replicas, but no leader?

-[Figure 6-15](/en/ch6#fig_replication_causality_single) uses a single version number to capture dependencies between
+[Figure 6-15](/en/ch6#fig_replication_causality_single) uses a single version number to capture dependencies between
 operations, but that is not sufficient when there are multiple replicas accepting writes
 concurrently. Instead, we need to use a version number *per replica* as well as per key. Each
 replica increments its own version number when processing a write, and also keeps track of the
@ -1653,7 +1661,7 @@ A few variants of this idea are in use, but the most interesting is probably the
 which is used in Riak 2.0 [^61] [^62].
 We won’t go into the details, but the way it works is quite similar to what we saw in our cart example.

-Like the version numbers in [Figure 6-15](/en/ch6#fig_replication_causality_single), version vectors are sent from the
+Like the version numbers in [Figure 6-15](/en/ch6#fig_replication_causality_single), version vectors are sent from the
 database replicas to clients when values are read, and need to be sent back to the database when a
 value is subsequently written. (Riak encodes the version vector as a string that it calls *causal
 context*.) The version vector allows the database to distinguish between overwrites and concurrent
@ -1818,4 +1826,4 @@ machine to store only a subset of the data.
 [^61]: Sean Cribbs. [A Brief History of Time in Riak](https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak). At *RICON*, October 2014. Archived at [perma.cc/7U9P-6JFX](https://perma.cc/7U9P-6JFX)
 [^62]: Russell Brown. [Vector Clocks Revisited Part 2: Dotted Version Vectors](https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/). *riak.com*, November 2015. Archived at [perma.cc/96QP-W98R](https://perma.cc/96QP-W98R)
 [^63]: Carlos Baquero. [Version Vectors Are Not Vector Clocks](https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/). *haslab.wordpress.com*, July 2011. Archived at [perma.cc/7PNU-4AMG](https://perma.cc/7PNU-4AMG)
-[^64]: Reinhard Schwarz and Friedemann Mattern. [Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed Computing*, volume 7, issue 3, pages 149–174, March 1994. [doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859) 
+[^64]: Reinhard Schwarz and Friedemann Mattern. [Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed Computing*, volume 7, issue 3, pages 149–174, March 1994. [doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859) 
--- a/content/en/ch7.md
+++ b/content/en/ch7.md
@ -4,6 +4,8 @@ weight: 207
 breadcrumbs: false
 ---

+<a id="ch_sharding"></a>
+
 ![](/map/ch06.png)

 > *Clearly, we must break away from the sequential and not limit the computers. We must state
@ -14,7 +16,7 @@ breadcrumbs: false

 A distributed database typically distributes data across nodes in two ways:

-1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in [Chapter 6](/en/ch6#ch_replication).
+1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in [Chapter 6](/en/ch6#ch_replication).
 2. If we don’t want every node to store all the data, we can split up a large amount of data into
 smaller *shards* or *partitions*, and store different shards on different nodes. We’ll discuss
 sharding in this chapter.
@ -29,13 +31,13 @@ nodes. This means that, even though each record belongs to exactly one shard, it
 on several different nodes for fault tolerance.

 A node may store more than one shard. If a single-leader replication model is used, the combination
-of sharding and replication can look like [Figure 7-1](/en/ch7#fig_sharding_replicas), for example. Each shard’s
+of sharding and replication can look like [Figure 7-1](/en/ch7#fig_sharding_replicas), for example. Each shard’s
 leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the
 leader for some shards and a follower for other shards, but each shard still only has one leader.

 {{< figure src="/fig/ddia_0701.png" id="fig_sharding_replicas" caption="Figure 7-1. Combining replication and sharding: each node acts as leader for some shards and follower for other shards." class="w-full my-4" >}}

-Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replication of databases applies equally to
+Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replication of databases applies equally to
 replication of shards. Since the choice of sharding scheme is mostly independent of the choice of
 replication scheme, we will ignore replication in this chapter for the sake of simplicity.

@ -62,7 +64,7 @@ to databases. Another theory is that *shard* was originally an acronym of *Syste
 Available Replicated Data*—reportedly a 1980s database, details of which are lost to history.

 By the way, partitioning has nothing to do with *network partitions* (netsplits), a type of fault in
-the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#ch_distributed).
+the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#ch_distributed).

 --------

@ -71,7 +73,7 @@ the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#c
 The primary reason for sharding a database is *scalability*: it’s a solution if the volume of data
 or the write throughput has become too great for a single node to handle, as it allows you to spread
 that data and those writes across multiple nodes. (If read throughput is the problem, you don’t
-necessarily need sharding—you can use *read scaling* as discussed in [Chapter 6](/en/ch6#ch_replication).)
+necessarily need sharding—you can use *read scaling* as discussed in [Chapter 6](/en/ch6#ch_replication).)

 In fact, sharding is one of the main tools we have for achieving *horizontal scaling* (a *scale-out*
 architecture), as discussed in [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](/en/ch2#sec_introduction_shared_nothing): that is, allowing a system to
@ -98,9 +100,9 @@ may be distributed across different shards. We will discuss this further in
 [“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes).

 Another problem with sharding is that a write may need to update related records in several
-different shards. While transactions on a single node are quite common (see [Chapter 8](/en/ch8#ch_transactions)),
+different shards. While transactions on a single node are quite common (see [Chapter 8](/en/ch8#ch_transactions)),
 ensuring consistency across multiple shards requires a *distributed transaction*. As we shall see in
-[Chapter 8](/en/ch8#ch_transactions), distributed transactions are available in some databases, but they are usually
+[Chapter 8](/en/ch8#ch_transactions), distributed transactions are available in some databases, but they are usually
 much slower than single-node transactions, may become a bottleneck for the system as a whole, and
 some systems don’t support them at all.

@ -201,7 +203,7 @@ hot spots.

 One way of sharding is to assign a contiguous range of partition keys (from some minimum to some
 maximum) to each shard, like the volumes of a paper encyclopedia, as illustrated in
-[Figure 7-2](/en/ch7#fig_sharding_encyclopedia). In this example, an entry’s partition key is its title. If you want
+[Figure 7-2](/en/ch7#fig_sharding_encyclopedia). In this example, an entry’s partition key is its title. If you want
 to look up the entry for a particular title, you can easily determine which shard contains that
 entry by finding the volume whose key range contains the title you’re looking for, and thus pick the
 correct book off the shelf.
@ -209,7 +211,7 @@ correct book off the shelf.
 {{< figure src="/fig/ddia_0702.png" id="fig_sharding_encyclopedia" caption="Figure 7-2. A print encyclopedia is sharded by key range." class="w-full my-4" >}}

 The ranges of keys are not necessarily evenly spaced, because your data may not be evenly
-distributed. For example, in [Figure 7-2](/en/ch7#fig_sharding_encyclopedia), volume 1 contains words starting with A
+distributed. For example, in [Figure 7-2](/en/ch7#fig_sharding_encyclopedia), volume 1 contains words starting with A
 and B, but volume 12 contains words starting with T, U, V, W, X, Y, and Z. Simply having one volume
 per two letters of the alphabet would lead to some volumes being much bigger than others. In order
 to distribute the data evenly, the shard boundaries need to adapt to the data.
@ -221,7 +223,7 @@ range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB
 tablet splitting.

 Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in
-[Chapter 4](/en/ch4#ch_storage)). This has the advantage that range scans are easy, and you can treat the key as a
+[Chapter 4](/en/ch4#ch_storage)). This has the advantage that range scans are easy, and you can treat the key as a
 concatenated index in order to fetch several related records in one query (see
 [“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional)). For example, consider an application that stores data from a
 network of sensors, where the key is the timestamp of the measurement. Range scans are very useful
@ -256,7 +258,7 @@ This process is similar to what happens at the top level of a B-tree (see [“B-

 With databases that manage shard boundaries automatically, a shard split is typically triggered by:

-* the shard reaching a configured size (for example, on HBase, the default is 10 GB), or
+* the shard reaching a configured size (for example, on HBase, the default is 10 GB), or
 * in some systems, the write throughput being persistently above some threshold. Thus, a hot shard
 may be split even if it is not storing a lot of data, so that its write load can be distributed more uniformly.

@ -278,7 +280,7 @@ application), a common approach is to first hash the partition key before mappin

 A good hash function takes skewed data and makes it uniformly distributed. Say you have a 32-bit
 hash function that takes a string. Whenever you give it a new string, it returns a seemingly random
-number between 0 and 232 − 1. Even if the input strings are very similar, their hashes are evenly 
+number between 0 and 232 − 1. Even if the input strings are very similar, their hashes are evenly 
 distributed across that range of numbers (but the same input always produces the same output).

 For sharding purposes, the hash function need not be cryptographically strong: for example, MongoDB
@ -291,12 +293,12 @@ different hash value in different processes, making them unsuitable for sharding

 Once you have hashed the key, how do you choose which shard to store it in? Maybe your first thought
 is to take the hash value *modulo* the number of nodes in the system (using the `%` operator in many
-programming languages). For example, *hash*(*key*) % 10 would return a number between
-0 and 9 (if we write the hash as a decimal number, the hash % 10 would be the last digit).
+programming languages). For example, *hash*(*key*) % 10 would return a number between
+0 and 9 (if we write the hash as a decimal number, the hash % 10 would be the last digit).
 If we have 10 nodes, numbered 0 to 9, that seems like an easy way of assigning each key to a node.

 The problem with the *mod N* approach is that if the number of nodes *N* changes, most of the keys
-have to be moved from one node to another. [Figure 7-3](/en/ch7#fig_sharding_hash_mod_n) shows what happens when you
+have to be moved from one node to another. [Figure 7-3](/en/ch7#fig_sharding_hash_mod_n) shows what happens when you
 have three nodes and add a fourth. Before the rebalancing, node 0 stored the keys whose hashes are
 0, 3, 6, 9, and so on. After adding the fourth node, the key with hash 3 has moved to node 3, the
 key with hash 6 has moved to node 2, the key with hash 9 has moved to node 1, and so on.
@ -312,12 +314,12 @@ doesn’t move data around more than necessary.
 One simple but widely-used solution is to create many more shards than there are nodes, and to
 assign several shards to each node. For example, a database running on a cluster of 10 nodes may be
 split into 1,000 shards from the outset so that 100 shards are assigned to each node. A key is then
-stored in shard number *hash*(*key*) % 1,000, and the system separately keeps track of
+stored in shard number *hash*(*key*) % 1,000, and the system separately keeps track of
 which shard is stored on which node.

 Now, if a node is added to the cluster, the system can reassign some of the shards from existing
 nodes to the new node until they are fairly distributed once again. This process is illustrated in
-[Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in reverse.
+[Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in reverse.

 {{< figure src="/fig/ddia_0704.png" id="fig_sharding_rebalance_fixed" caption="Figure 7-4. Adding a new node to a database cluster with multiple shards per node." class="w-full my-4" >}}

@ -360,8 +362,8 @@ has this property, but it has a risk of hot spots when there are a lot of writes
 solution is to combine key-range sharding with a hash function so that each shard contains a range
 of *hash values* rather than a range of *keys*.

-[Figure 7-5](/en/ch7#fig_sharding_hash_range) shows an example using a 16-bit hash function that returns a number
-between 0 and 65,535 = 216 − 1 (in reality, the hash is usually 32 bits or more).
+[Figure 7-5](/en/ch7#fig_sharding_hash_range) shows an example using a 16-bit hash function that returns a number
+between 0 and 65,535 = 216 − 1 (in reality, the hash is usually 32 bits or more).
 Even if the input keys are very similar (e.g., consecutive timestamps), their hashes are uniformly
 distributed across that range. We can then assign a range of hash values to each shard: for example,
 values between 0 and 16,383 to shard 0, values between 16,384 and 32,767 to shard 1, and so on.
@ -394,8 +396,8 @@ improve compression and filtering performance as well.

 Hash-range sharding is used in YugabyteDB and DynamoDB [^17], and is an option in MongoDB.
 Cassandra and ScyllaDB use a variant of this approach that is illustrated in
-[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
-to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
+[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
+to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
 per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
 those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
 those imbalances tend to even out [^15] [^18].
@ -404,7 +406,7 @@ those imbalances tend to even out [^15] [^18].

 When nodes are added or removed, range boundaries are added and removed, and shards are split or
 merged accordingly [^19].
-In the example of [Figure 7-6](/en/ch7#fig_sharding_cassandra), when node 3 is added, node 1
+In the example of [Figure 7-6](/en/ch7#fig_sharding_cassandra), when node 3 is added, node 1
 transfers parts of two of its ranges to node 3, and node 2 transfers part of one of its ranges to
 node 3. This has the effect of giving the new node an approximately fair share of the dataset,
 without transferring more data than necessary from one node to another.
@ -417,8 +419,8 @@ in a way that satisfies two properties:
 1. the number of keys mapped to each shard is roughly equal, and
 2. when the number of shards changes, as few keys as possible are moved from one shard to another.

-Note that *consistent* here has nothing to do with replica consistency (see [Chapter 6](/en/ch6#ch_replication)) or
-ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describes the tendency of a key to stay in
+Note that *consistent* here has nothing to do with replica consistency (see [Chapter 6](/en/ch6#ch_replication)) or
+ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describes the tendency of a key to stay in
 the same shard as much as possible.

 The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of consistent hashing [^20],
@ -516,7 +518,7 @@ only be handled by a node that is a replica for the shard containing that key.

 This means that request routing has to be aware of the assignment from keys to shards, and from
 shards to nodes. On a high level, there are a few different approaches to this problem 
-(illustrated in [Figure 7-7](/en/ch7#fig_sharding_routing)):
+(illustrated in [Figure 7-7](/en/ch7#fig_sharding_routing)):

 1. Allow clients to contact any node (e.g., via a round-robin load balancer). If that node
 coincidentally owns the shard to which the request applies, it can handle the request directly;
@ -544,8 +546,8 @@ In all cases, there are some key problems:
 those?

 Many distributed data systems rely on a separate coordination service such as ZooKeeper or etcd to
-keep track of shard assignments, as illustrated in [Figure 7-8](/en/ch7#fig_sharding_zookeeper). They use consensus
-algorithms (see [Chapter 10](/en/ch10#ch_consistency)) to provide fault tolerance and protection against split-brain.
+keep track of shard assignments, as illustrated in [Figure 7-8](/en/ch7#fig_sharding_zookeeper). They use consensus
+algorithms (see [Chapter 10](/en/ch10#ch_consistency)) to provide fault tolerance and protection against split-brain.
 Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of shards
 to nodes. Other actors, such as the routing tier or the sharding-aware client, can subscribe to this
 information in ZooKeeper. Whenever a shard changes ownership, or a node is added or removed,
@ -573,7 +575,7 @@ This discussion of request routing has focused on finding the shard for an indiv
 most relevant for sharded OLTP databases. Analytic databases often use sharding as well, but they
 typically have a very different kind of query execution: rather than executing in a single shard, a
 query typically needs to aggregate and join data from many different shards in parallel. We will
-discuss techniques for such parallel query execution in [Link to Come].
+discuss techniques for such parallel query execution in [“JOIN and GROUP BY”](/en/ch11#sec_batch_join).

 ## Sharding and Secondary Indexes {#sec_sharding_secondary_indexes}

@ -597,7 +599,7 @@ local and global indexes.
 ### Local Secondary Indexes {#id166}

 For example, imagine you are operating a website for selling used cars (illustrated in
-[Figure 7-9](/en/ch7#fig_sharding_local_secondary)). Each listing has a unique ID, and you use that ID as partition
+[Figure 7-9](/en/ch7#fig_sharding_local_secondary)). Each listing has a unique ID, and you use that ID as partition
 key for sharding (for example, IDs 0 to 499 in shard 0, IDs 500 to 999 in shard 1, etc.).

 If you want to let users search for cars, allowing them to filter by color and by make, you need a
@ -605,7 +607,7 @@ secondary index on `color` and `make` (in a document database these would be fie
 database they would be columns). If you have declared the index, the database can perform the
 indexing automatically. For example, whenever a red car is added to the database, the database shard
 automatically adds its ID to the list of IDs for the index entry `color:red`. As discussed in
-[Chapter 4](/en/ch4#ch_storage), that list of IDs is also called a *postings list*.
+[Chapter 4](/en/ch4#ch_storage), that list of IDs is also called a *postings list*.

 {{< figure src="/fig/ddia_0709.png" id="fig_sharding_local_secondary" caption="Figure 7-9. Local secondary indexes: each shard indexes only the records within its own shard." class="w-full my-4" >}}

@ -632,7 +634,7 @@ want *some* results, and you don’t need all, you can send the request to any s

 However, if you want all the results and don’t know their partition key in advance, you need to send
 the query to all shards, and combine the results you get back, because the matching records might be
-scattered across all the shards. In [Figure 7-9](/en/ch7#fig_sharding_local_secondary), red cars appear in both shard
+scattered across all the shards. In [Figure 7-9](/en/ch7#fig_sharding_local_secondary), red cars appear in both shard
 0 and shard 1.

 This approach to querying a sharded database can make read queries on secondary indexes quite
@ -651,7 +653,7 @@ covers data in all shards. However, we can’t just store that index on one node
 likely become a bottleneck and defeat the purpose of sharding. A global index must also be sharded,
 but it can be sharded differently from the primary key index.

-[Figure 7-10](/en/ch7#fig_sharding_global_secondary) illustrates what this could look like: the IDs of red cars from
+[Figure 7-10](/en/ch7#fig_sharding_global_secondary) illustrates what this could look like: the IDs of red cars from
 all shards appear under `color:red` in the index, but the index is sharded so that colors starting
 with the letters *a* to *r* appear in shard 0 and colors starting with *s* to *z* appear in shard 1.
 The index on the make of car is partitioned similarly (with the shard boundary being between *f* and *h*).
@ -664,7 +666,7 @@ you can search for. Here we generalise it to mean any value that you can search

 The global index uses the term as partition key, so that when you’re looking for a particular term
 or value, you can figure out which shard you need to query. As before, a shard can contain a
-contiguous range of terms (as in [Figure 7-10](/en/ch7#fig_sharding_global_secondary)), or you can assign terms to
+contiguous range of terms (as in [Figure 7-10](/en/ch7#fig_sharding_global_secondary)), or you can assign terms to
 shards based on a hash of the term.

 Global indexes have the advantage that a query with a single condition (such as *color = red*) only
@ -682,7 +684,7 @@ Another challenge with global secondary indexes is that writes are more complica
 indexes, because writing a single record might affect multiple shards of the index (every term in
 the document might be on a different shard). This makes it harder to keep the secondary index in
 sync with the underlying data. One option is to use a distributed transaction to atomically update
-the shards storing the primary record and its secondary indexes (see [Chapter 8](/en/ch8#ch_transactions)).
+the shards storing the primary record and its secondary indexes (see [Chapter 8](/en/ch8#ch_transactions)).

 Global secondary indexes are used by CockroachDB, TiDB, and YugabyteDB; DynamoDB supports both local
 and global secondary indexes. In the case of DynamoDB, writes are asynchronously reflected in global
@ -781,4 +783,4 @@ that question in the following chapters.
 [^31]: Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. [Earlybird: Real-Time Search at Twitter](https://cs.uwaterloo.ca/~jimmylin/publications/Busch_etal_ICDE2012.pdf). At *28th IEEE International Conference on Data Engineering* (ICDE), April 2012. [doi:10.1109/ICDE.2012.149](https://doi.org/10.1109/ICDE.2012.149) 
 [^32]: Nadav Har’El. [Indexing in Cassandra 3](https://github.com/scylladb/scylladb/wiki/Indexing-in-Cassandra-3). *github.com*, April 2017. Archived at [perma.cc/3ENV-8T9P](https://perma.cc/3ENV-8T9P) 
 [^33]: Zachary Tong. [Customizing Your Document Routing](https://www.elastic.co/blog/customizing-your-document-routing/). *elastic.co*, June 2013. Archived at [perma.cc/97VM-MREN](https://perma.cc/97VM-MREN) 
-[^34]: Andrew Pavlo. [H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). *hstore.cs.brown.edu*, October 2013. Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z) 
+[^34]: Andrew Pavlo. [H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). *hstore.cs.brown.edu*, October 2013. Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z) 
--- a/content/en/ch8.md
+++ b/content/en/ch8.md
@ -4,6 +4,8 @@ weight: 208
 breadcrumbs: false
 ---

+<a id="ch_transactions"></a>
+
 ![](/map/ch07.png)

 > *Some authors have claimed that general two-phase commit is too expensive to support, because of the
@ -75,8 +77,8 @@ similar to that of System R.

 In the late 2000s, nonrelational (NoSQL) databases started gaining popularity. They aimed to
 improve upon the relational status quo by offering a choice of new data models (see
-[Chapter 3](/en/ch3#ch_datamodels)), and by including replication ([Chapter 6](/en/ch6#ch_replication)) and sharding
-([Chapter 7](/en/ch7#ch_sharding)) by default. Transactions were the main casualty of this movement: many of this
+[Chapter 3](/en/ch3#ch_datamodels)), and by including replication ([Chapter 6](/en/ch6#ch_replication)) and sharding
+([Chapter 7](/en/ch7#ch_sharding)) by default. Transactions were the main casualty of this movement: many of this
 generation of databases abandoned transactions entirely, or redefined the word to describe a
 much weaker set of guarantees than had previously been understood.

@ -85,7 +87,7 @@ fundamentally unscalable, and that any large-scale system would have to abandon
 order to maintain good performance and high availability. More recently, that belief has turned out
 to be wrong. So-called “NewSQL” databases such as CockroachDB [^5], TiDB [^6], Spanner [^7], FoundationDB [^8],
 and Yugabyte have shown that transactional systems can scale to large data volumes and high
-throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide
+throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide
 strong ACID guarantees at scale.

 However, that doesn’t mean that every system must be transactional either: like every other
@ -146,7 +148,7 @@ the defining feature of ACID atomicity. Perhaps *abortability* would have been a

 The word *consistency* is terribly overloaded:

-* In [Chapter 6](/en/ch6#ch_replication) we discussed *replica consistency* and the issue of *eventual consistency*
+* In [Chapter 6](/en/ch6#ch_replication) we discussed *replica consistency* and the issue of *eventual consistency*
 that arises in asynchronously replicated systems (see [“Problems with Replication Lag”](/en/ch6#sec_replication_lag)).
 * A *consistent snapshot* of a database, e.g. for a backup, is a snapshot of the entire database as
 it existed at one moment in time. More precisely, it is consistent with the happens-before
@ -155,7 +157,7 @@ The word *consistency* is terribly overloaded:
 value was written.
 * *Consistent hashing* is an approach to sharding that some systems use for rebalancing (see
 [“Consistent hashing”](/en/ch7#sec_sharding_consistent_hashing)).
-* In the CAP theorem (see [Chapter 10](/en/ch10#ch_consistency)), the word *consistency* is used to mean
+* In the CAP theorem (see [Chapter 10](/en/ch10#ch_consistency)), the word *consistency* is used to mean
 *linearizability* (see [“Linearizability”](/en/ch10#sec_consistency_linearizability)).
 * In the context of ACID, *consistency* refers to an application-specific notion of the database
 being in a “good state.”
@ -188,10 +190,10 @@ Most databases are accessed by several clients at the same time. That is no prob
 reading and writing different parts of the database, but if they are accessing the same database
 records, you can run into concurrency problems (race conditions).

-[Figure 8-1](/en/ch8#fig_transactions_increment) is a simple example of this kind of problem. Say you have two clients
+[Figure 8-1](/en/ch8#fig_transactions_increment) is a simple example of this kind of problem. Say you have two clients
 simultaneously incrementing a counter that is stored in a database. Each client needs to read the
 current value, add 1, and write the new value back (assuming there is no increment operation built
-into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to
+into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to
 44, because two increments happened, but it actually only went to 43 because of the race condition.

 {{< figure src="/fig/ddia_0801.png" id="fig_transactions_increment" caption="Figure 8-1. A race condition between two clients concurrently incrementing a counter." class="w-full my-4" >}}
@ -234,6 +236,8 @@ database can do to save you.

 --------

+<a id="sidebar_transactions_durability"></a>
+
 > [!TIP] REPLICATION AND DURABILITY

 Historically, durability meant writing to an archive tape. Then it was understood as writing to a disk
@ -291,7 +295,7 @@ Isolation

 These definitions assume that you want to modify several objects (rows, documents, records) at once.
 Such *multi-object transactions* are often needed if several pieces of data need to be kept in sync.
-[Figure 8-2](/en/ch8#fig_transactions_read_uncommitted) shows an example from an email application. To display the
+[Figure 8-2](/en/ch8#fig_transactions_read_uncommitted) shows an example from an email application. To display the
 number of unread messages for a user, you could query something like:

 ```
@ -307,14 +311,14 @@ number of unread messages in a separate field (a kind of denormalization, which
 unread counter as well, and whenever a message is marked as read, you also have to decrement the
 unread counter.

-In [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), user 2 experiences an anomaly: the mailbox listing shows
+In [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), user 2 experiences an anomaly: the mailbox listing shows
 an unread message, but the counter shows zero unread messages because the counter increment has not
 yet happened. (If an incorrect counter in an email application seems too insignificant, think of a
 customer account balance instead of an unread counter, and a payment transaction instead of an
 email.) Isolation would have prevented this issue by ensuring that user 2 sees either both the
 inserted email and the updated counter, or neither, but not an inconsistent halfway point.

-[Figure 8-3](/en/ch8#fig_transactions_atomicity) illustrates the need for atomicity: if an error occurs somewhere
+[Figure 8-3](/en/ch8#fig_transactions_atomicity) illustrates the need for atomicity: if an error occurs somewhere
 over the course of the transaction, the contents of the mailbox and the unread counter might become out
 of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted
 and the inserted email is rolled back.
@ -337,10 +341,10 @@ database in a partially updated state.
 #### Single-object writes {#sec_transactions_single_object}

 Atomicity and isolation also apply when a single object is being changed. For example, imagine you
-are writing a 20 KB JSON document to a database:
+are writing a 20 KB JSON document to a database:

-* If the network connection is interrupted after the first 10 KB have been sent, does the
- database store that unparseable 10 KB fragment of JSON?
+* If the network connection is interrupted after the first 10 KB have been sent, does the
+ database store that unparseable 10 KB fragment of JSON?
 * If the power fails while the database is in the middle of overwriting the previous value on disk,
 do you end up with the old and new values spliced together?
 * If another client reads that document while the write is in progress, will it see a partially
@ -353,7 +357,7 @@ isolation can be implemented using a lock on each object (allowing only one thre
 object at any one time).

 Some databases also provide more complex atomic operations, such as an increment operation, which
-removes the need for a read-modify-write cycle like that in [Figure 8-1](/en/ch8#fig_transactions_increment).
+removes the need for a read-modify-write cycle like that in [Figure 8-1](/en/ch8#fig_transactions_increment).
 Similarly popular is a *conditional write* operation, which allows a write to happen only if the value
 has not been concurrently changed by someone else (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)),
 similarly to a compare-and-set or compare-and-swap (CAS) operation in shared-memory concurrency.
@ -391,7 +395,7 @@ However, in many other cases writes to several different objects need to be coor
 document, which is treated as a single object—no multi-object transactions are needed when
 updating a single document. However, document databases lacking join functionality also encourage
 denormalization (see [“When to Use Which Model”](/en/ch3#sec_datamodels_document_summary)). When denormalized information needs to
- be updated, like in the example of [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), you need to update
+ be updated, like in the example of [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), you need to update
 several documents in one go. Transactions are very useful in this situation to prevent
 denormalized data from going out of sync.
 * In databases with secondary indexes (almost everything except pure key-value stores), the indexes
@ -403,7 +407,7 @@ However, in many other cases writes to several different objects need to be coor
 Such applications can still be implemented without transactions. However, error handling becomes
 much more complicated without atomicity, and the lack of isolation can cause concurrency problems.
 We will discuss those in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels), and explore alternative approaches
-in [Link to Come].
+in [“Derived data versus distributed transactions”](/en/ch13#sec_future_derived_vs_transactions).

 #### Handling errors and aborts {#handling-errors-and-aborts}

@ -521,7 +525,7 @@ Can another transaction see that uncommitted data? If yes, that is called a

 Transactions running at the read committed isolation level must prevent dirty reads. This means that
 any writes by a transaction only become visible to others when that transaction commits (and then
-all of its writes become visible at once). This is illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2’s *get x* still
+all of its writes become visible at once). This is illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2’s *get x* still
 returns the old value, 2, while user 1 has not yet committed.

 {{< figure src="/fig/ddia_0804.png" id="fig_transactions_read_committed" caption="Figure 8-4. No dirty reads: user 2 sees the new value for x only after user 1's transaction has committed." class="w-full my-4" >}}
@ -529,12 +533,12 @@ returns the old value, 2, while user 1 has not yet committed.
 There are a few reasons why it’s useful to prevent dirty reads:

 * If a transaction needs to update several rows, a dirty read means that another transaction may
- see some of the updates but not others. For example, in [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), the
+ see some of the updates but not others. For example, in [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), the
 user sees the new unread email but not the updated counter. This is a dirty read of the email.
 Seeing the database in a partially updated state is confusing to users and may cause other
 transactions to take incorrect decisions.
 * If a transaction aborts, any writes it has made need to be rolled back (like in
- [Figure 8-3](/en/ch8#fig_transactions_atomicity)). If the database allows dirty reads, that means a transaction may
+ [Figure 8-3](/en/ch8#fig_transactions_atomicity)). If the database allows dirty reads, that means a transaction may
 see data that is later rolled back—i.e., which is never actually committed to the database. Any
 transaction that read uncommitted data would also need to be aborted, leading to a problem called
 *cascading aborts*.
@ -553,15 +557,15 @@ first write’s transaction has committed or aborted.
 By preventing dirty writes, this isolation level avoids some kinds of concurrency problems:

 * If transactions update multiple rows, dirty writes can lead to a bad outcome. For example,
- consider [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), which illustrates a used car sales website on which
+ consider [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), which illustrates a used car sales website on which
 two people, Aaliyah and Bryce, are simultaneously trying to buy the same car. Buying a car requires
 two database writes: the listing on the website needs to be updated to reflect the buyer, and the
- sales invoice needs to be sent to the buyer. In the case of [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), the
+ sales invoice needs to be sent to the buyer. In the case of [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), the
 sale is awarded to Bryce (because he performs the winning update to the `listings` table), but the
 invoice is sent to Aaliyah (because she performs the winning update to the `invoices` table). Read
 committed prevents such mishaps.
 * However, read committed does *not* prevent the race condition between two counter increments in
- [Figure 8-1](/en/ch8#fig_transactions_increment). In this case, the second write happens after the first transaction
+ [Figure 8-1](/en/ch8#fig_transactions_increment). In this case, the second write happens after the first transaction
 has committed, so it’s not a dirty write. It’s still incorrect, but for a different reason—in
 [“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update) we will discuss how to make such counter increments safe.

@ -597,7 +601,7 @@ different part of the application, due to waiting for locks.
 Nevertheless, locks are used to prevent dirty reads in some databases, such as IBM
 Db2 and Microsoft SQL Server in the `read_committed_snapshot=off` setting [^29].

-A more commonly used approach to preventing dirty reads is the one illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
+A more commonly used approach to preventing dirty reads is the one illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
 row that is written, the database remembers both the old committed value and the new value
 set by the transaction that currently holds the write lock. While the transaction is ongoing, any
 other transactions that read the row are simply given the old value. Only when the new value is
@ -613,7 +617,7 @@ getting intermingled. Indeed, those are useful features, and much stronger guara
 get from a system that has no transactions.

 However, there are still plenty of ways in which you can have concurrency bugs when using this
-isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that
+isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that
 can occur with read committed.

 {{< figure src="/fig/ddia_0806.png" id="fig_transactions_item_many_preceders" caption="Figure 8-6. Read skew: Aaliyah observes the database in an inconsistent state." class="w-full my-4" >}}
@ -685,14 +689,14 @@ database to handle long-running read queries on a consistent snapshot at the sam
 writes normally, without any lock contention between the two.

 To implement snapshot isolation, databases use a generalization of the mechanism we saw for
-preventing dirty reads in [Figure 8-4](/en/ch8#fig_transactions_read_committed). Instead of two versions of each row
+preventing dirty reads in [Figure 8-4](/en/ch8#fig_transactions_read_committed). Instead of two versions of each row
 (the committed version and the overwritten-but-not-yet-committed version), the database must
 potentially keep several different committed versions of a row, because various in-progress
 transactions may need to see the state of the database at different points in time. Because it
 maintains several versions of a row side by side, this technique is known as *multi-version
 concurrency control* (MVCC).

-[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
+[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
 [^40] [^42] [^43] (other implementations are similar).
 When a transaction is started, it is given a unique, always-increasing transaction ID (`txid`).
 Whenever a transaction writes anything to the database, the data it writes is tagged with the
@ -712,7 +716,7 @@ garbage collection process in the database removes any rows marked for deletion
 space.

 An update is internally translated into a delete and a insert [^44].
-For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the
+For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the
 balance from $500 to $400. The `accounts` table now actually contains two rows for account 2: a row
 with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of
 $400 which was inserted by transaction 13.
@ -741,7 +745,7 @@ consistent snapshot of the database to the application. This works roughly as fo
 process can remove them later.
 4. All other writes are visible to the application’s queries.

-These rules apply to both insertion and deletion of rows. In [Figure 8-7](/en/ch8#fig_transactions_mvcc), when
+These rules apply to both insertion and deletion of rows. In [Figure 8-7](/en/ch8#fig_transactions_mvcc), when
 transaction 12 reads from account 2, it sees a balance of $500 because the deletion of the $500
 balance was made by transaction 13 (according to rule 2, transaction 12 cannot see a deletion made
 by transaction 13), and the insertion of the $400 balance is not yet visible (by the same rule).
@ -758,6 +762,8 @@ that (from other transactions’ point of view) have long been overwritten or de
 updating values in place but instead inserting a new version every time a value is changed, the
 database can provide a consistent snapshot while incurring only a small overhead.

+<a id="sec_transactions_snapshot_indexes"></a>
+
 #### Indexes and snapshot isolation {#indexes-and-snapshot-isolation}

 How do indexes work in a multi-version database? The most common approach is that each index entry
@ -819,7 +825,7 @@ the issue of two transactions writing concurrently—we have only discussed dirt

 There are several other interesting kinds of conflicts that can occur between concurrently writing
 transactions. The best known of these is the *lost update* problem, illustrated in
-[Figure 8-1](/en/ch8#fig_transactions_increment) with the example of two concurrent counter increments.
+[Figure 8-1](/en/ch8#fig_transactions_increment) with the example of two concurrent counter increments.

 The lost update problem can occur if an application reads some value from the database, modifies it,
 and writes back the modified value (a *read-modify-write cycle*). If two transactions do this
@ -875,7 +881,7 @@ For example, consider a multiplayer game in which several players can move the s
 concurrently. In this case, an atomic operation may not be sufficient, because the application also
 needs to ensure that a player’s move abides by the rules of the game, which involves some logic that
 you cannot sensibly implement as a database query. Instead, you may use a lock to prevent two
-players from concurrently moving the same piece, as illustrated in [Example 8-1](/en/ch8#fig_transactions_select_for_update).
+players from concurrently moving the same piece, as illustrated in [Example 8-1](/en/ch8#fig_transactions_select_for_update).

 {{< figure id="fig_transactions_select_for_update" title="Example 8-1. Explicitly locking rows to prevent lost updates" class="w-full my-4" >}}

@ -956,7 +962,7 @@ written by other transactions are visible to the evaluation of the `WHERE` claus

 #### Conflict resolution and replication {#conflict-resolution-and-replication}

-In replicated databases (see [Chapter 6](/en/ch6#ch_replication)), preventing lost updates takes on another
+In replicated databases (see [Chapter 6](/en/ch6#ch_replication)), preventing lost updates takes on another
 dimension: since they have copies of the data on multiple nodes, and the data can potentially be
 modified concurrently on different nodes, some additional steps need to be taken to prevent lost
 updates.
@ -1000,7 +1006,7 @@ they are sick themselves), provided that at least one colleague remains on call
 Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are
 feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button
 to go off call at approximately the same time. What happens next is illustrated in
-[Figure 8-8](/en/ch8#fig_transactions_write_skew).
+[Figure 8-8](/en/ch8#fig_transactions_write_skew).

 {{< figure src="/fig/ddia_0808.png" id="fig_transactions_write_skew" caption="Figure 8-8. Example of write skew causing an application bug." class="w-full my-4" >}}

@ -1070,7 +1076,7 @@ Meeting room booking system
 : Say you want to enforce that there cannot be two bookings for the same meeting room at the same time [^55].
    When someone wants to make a booking, you first check for any conflicting bookings (i.e.,
    bookings for the same room with an overlapping time range), and if none are found, you create the
-    meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).
+    meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).
    
    {{< figure id="fig_transactions_meeting_rooms" title="Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)" class="w-full my-4" >}}
    
@ -1094,7 +1100,7 @@ Meeting room booking system
     isolation.

 Multiplayer game
-: In [Example 8-1](/en/ch8#fig_transactions_select_for_update), we used a lock to prevent lost updates (that is, making
+: In [Example 8-1](/en/ch8#fig_transactions_select_for_update), we used a lock to prevent lost updates (that is, making
 sure that two players can’t move the same figure at the same time). However, the lock doesn’t
 prevent players from moving two different figures to the same position on the board or potentially
 making some other move that violates the rules of the game. Depending on the kind of rule you are
@ -1278,7 +1284,7 @@ containing a single statement, or submit the entire transaction code to the data
 as a *stored procedure* [^61].

 The differences between interactive transactions and stored procedures is illustrated in
-[Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the
+[Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the
 stored procedure can execute very quickly, without waiting for any network or disk I/O.

 {{< figure src="/fig/ddia_0809.png" id="fig_transactions_stored_proc" caption="Figure 8-9. The difference between an interactive transaction and a stored procedure (using the example transaction of [Figure 8-8](/en/ch8#fig_transactions_write_skew))." class="w-full my-4" >}}
@ -1322,7 +1328,7 @@ requires that stored procedures are *deterministic* (when run on different nodes
 the same result). If a transaction needs to use the current date and time, for example, it must do
 so through special deterministic APIs (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows) for more details on
 deterministic operations). This approach is called *state machine replication*, and we will return
-to it in [Chapter 10](/en/ch10#ch_consistency).
+to it in [Chapter 10](/en/ch10#ch_consistency).

 #### Sharding {#sharding}

@ -1332,7 +1338,7 @@ Read-only transactions may execute elsewhere, using snapshot isolation, but for
 high write throughput, the single-threaded transaction processor can become a serious bottleneck.

 In order to scale to multiple CPU cores, and multiple nodes, you can shard your data
-(see [Chapter 7](/en/ch7#ch_sharding)), which is supported in VoltDB. If you can find a way of sharding your dataset
+(see [Chapter 7](/en/ch7#ch_sharding)), which is supported in VoltDB. If you can find a way of sharding your dataset
 so that each transaction only needs to read and write data within a single shard, then each shard
 can have its own transaction processing thread running independently from the others. In this case,
 you can give each CPU core its own shard, which allows your transaction throughput to scale linearly
@ -1398,7 +1404,7 @@ anyone wants to write (modify or delete) an object, exclusive access is required
 unexpectedly behind A’s back.)
 * If transaction A has written an object and transaction B wants to read that object, B must wait
 until A commits or aborts before it can continue. (Reading an old version of the object, like in
- [Figure 8-4](/en/ch8#fig_transactions_read_committed), is not acceptable under 2PL.)
+ [Figure 8-4](/en/ch8#fig_transactions_read_committed), is not acceptable under 2PL.)

 In 2PL, writers don’t just block other writers; they also block readers and vice
 versa. Snapshot isolation has the mantra *readers never block writers, and writers never block
@ -1470,7 +1476,7 @@ changing the results of another transaction’s search query. A database with se
 must prevent phantoms.

 In the meeting room booking example this means that if one transaction has searched for existing
-bookings for a room within a certain time window (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)), another
+bookings for a room within a certain time window (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)), another
 transaction is not allowed to concurrently insert or update another booking for the same room and
 time range. (It’s okay to concurrently insert bookings for other rooms, or for the same room at a
 different time that doesn’t affect the proposed booking.)
@ -1623,7 +1629,7 @@ see [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_sn
 MVCC database, it ignores writes that were made by any other transactions that hadn’t yet committed
 at the time when the snapshot was taken.

-In [Figure 8-10](/en/ch8#fig_transactions_detect_mvcc), transaction 43 sees
+In [Figure 8-10](/en/ch8#fig_transactions_detect_mvcc), transaction 43 sees
 Aaliyah as having `on_call = true`, because transaction 42 (which modified Aaliyah’s on-call status) is
 uncommitted. However, by the time transaction 43 wants to commit, transaction 42 has already
 committed. This means that the write that was ignored when reading from the consistent snapshot has
@ -1650,7 +1656,7 @@ isolation’s support for long-running reads from a consistent snapshot.
 #### Detecting writes that affect prior reads {#sec_detecting_writes_affect_reads}

 The second case to consider is when another transaction modifies data after it has been read. This
-case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range).
+case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range).

 {{< figure src="/fig/ddia_0811.png" id="fig_transactions_detect_index_range" caption="Figure 8-11. In serializable snapshot isolation, detecting when one transaction modifies another transaction's reads." class="w-full my-4" >}}

@ -1660,7 +1666,7 @@ In the context of two-phase locking we discussed index-range locks (see
 search query, such as `WHERE shift_id = 1234`. We can use a similar technique here, except that SSI
 locks don’t block other transactions.

-In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transactions 42 and 43 both search for on-call doctors
+In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transactions 42 and 43 both search for on-call doctors
 during shift `1234`. If there is an index on `shift_id`, the database can use the index entry 1234 to
 record the fact that transactions 42 and 43 read this data. (If there is no index, this information
 can be tracked at the table level.) This information only needs to be kept for a while: after a
@ -1672,7 +1678,7 @@ that have recently read the affected data. This process is similar to acquiring
 key range, but rather than blocking until the readers have committed, the lock acts as a tripwire:
 it simply notifies the transactions that the data they read may no longer be up to date.

-In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transaction 43 notifies transaction 42 that its prior
+In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transaction 43 notifies transaction 42 that its prior
 read is outdated, and vice versa. Transaction 42 is first to commit, and it is successful: although
 transaction 43’s write affected 42, 43 hasn’t yet committed, so the write has not yet taken effect.
 However, when transaction 43 wants to commit, the conflicting write from 42 has already been
@ -1750,7 +1756,7 @@ distributed transactions, but various distributed relational databases do.

 In these cases, it is not sufficient to simply send a commit request to all of the nodes and
 independently commit the transaction on each one. It could easily happen that the commit succeeds on
-some nodes and fails on other nodes, as shown in [Figure 8-12](/en/ch8#fig_transactions_non_atomic):
+some nodes and fails on other nodes, as shown in [Figure 8-12](/en/ch8#fig_transactions_non_atomic):

 * Some nodes may detect a constraint violation or conflict, making an abort necessary, while other
 nodes are successfully able to commit.
@ -1766,7 +1772,7 @@ If some nodes commit the transaction but others abort it, the nodes become incon
 other. And once a transaction has been committed on one node, it cannot be retracted again if it
 later turns out that it was aborted on another node. This is because once data has been committed,
 it becomes visible to other transactions under *read committed* or stronger isolation. For example,
-in [Figure 8-12](/en/ch8#fig_transactions_non_atomic), by the time user 1 notices that its commit failed on database 1,
+in [Figure 8-12](/en/ch8#fig_transactions_non_atomic), by the time user 1 notices that its commit failed on database 1,
 user 2 has already read the data from the same transaction on database 2. If user 1’s transaction
 was later aborted, user 2’s transaction would have to be reverted as well, since it was based on
 data that was retroactively declared not to have existed.
@ -1782,7 +1788,7 @@ internally in some databases and also made available to applications in the form
 (which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP
 web services [^74] [^75].

-The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
+The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
 commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two
 phases (hence the name).

@ -1877,7 +1883,7 @@ was committed or aborted. If the coordinator crashes or the network fails at thi
 participant can do nothing but wait. A participant’s transaction in this state is called *in doubt*
 or *uncertain*.

-The situation is illustrated in [Figure 8-14](/en/ch8#fig_transactions_2pc_crash). In this particular example, the
+The situation is illustrated in [Figure 8-14](/en/ch8#fig_transactions_2pc_crash). In this particular example, the
 coordinator actually decided to commit, and database 2 received the commit request. However, the
 coordinator crashed before it could send the commit request to database 1, and so database 1 does
 not know whether to commit or abort. Even a timeout does not help here: if database 1 unilaterally
@ -1907,11 +1913,11 @@ is not so straightforward.

 As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed [^13] [^77].
 However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most
-practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
+practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
 cannot guarantee atomicity.

 A better solution in practice is to replace the single-node coordinator with a fault-tolerant
-consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_consistency).
+consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_consistency).

 ### Distributed Transactions Across Different Systems {#sec_transactions_xa}

@ -2018,7 +2024,7 @@ writes. In addition, if you want serializable isolation, a database using two-ph
 also have to take a shared lock on any rows *read* by the transaction.

 The database cannot release those locks until the transaction commits or aborts (illustrated as a
-shaded area in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit)). Therefore, when using two-phase commit, a
+shaded area in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit)). Therefore, when using two-phase commit, a
 transaction must hold onto the locks throughout the time it is in doubt. If the coordinator has
 crashed and takes 20 minutes to start up again, those locks will be held for 20 minutes. If the
 coordinator’s log is entirely lost for some reason, those locks will be held forever—or at least
@ -2086,7 +2092,7 @@ different systems.
 These problems are somewhat inherent in performing transactions across heterogeneous technologies.
 However, keeping several heterogeneous data systems consistent with each other is still a real and
 important problem, so we need to find a different solution to it. This can be done, as we will see
-in the next section and in [Link to Come].
+in the next section and in [“Derived data versus distributed transactions”](/en/ch13#sec_future_derived_vs_transactions).

 ### Database-internal Distributed Transactions {#sec_transactions_internal}

@ -2111,7 +2117,7 @@ The biggest problems with XA can be fixed by:
 * Coupling the atomic commitment protocol with a distributed concurrency control protocol that supports deadlock detection and consistent reads across shards.

 Consensus algorithms are commonly used to replicate the coordinator and the database shards. We will
-see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented
+see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented
 using a consensus algorithm. These algorithms tolerate faults by automatically failing over from one
 node to another without any human intervention, and while continuing to guarantee strong consistency
 properties.
@ -2159,7 +2165,7 @@ Thus, achieving exactly-once processing only requires transactions within the da
 across database and message broker is not necessary for this use case. Recording the message ID in
 the database makes the message processing *idempotent*, so that message processing can be safely
 retried without duplicating its side-effects. A similar approach is used in stream processing
-frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in [Link to Come].
+frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in [“Fault Tolerance”](/en/ch12#sec_stream_fault_tolerance).

 However, internal distributed transactions within the database are still useful for the scalability
 of patterns such as these: for example, they would allow the message IDs to be stored on one shard
@ -2189,7 +2195,7 @@ can have on the database.
 In this chapter, we went particularly deep into the topic of concurrency control. We discussed
 several widely used isolation levels, in particular *read committed*, *snapshot isolation*
 (sometimes called *repeatable read*), and *serializable*. We characterized those isolation levels by
-discussing various examples of race conditions, summarized in [Table 8-1](/en/ch8#ch_transactions_isolation_levels):
+discussing various examples of race conditions, summarized in [Table 8-1](/en/ch8#ch_transactions_isolation_levels):

 {{< figure id="ch_transactions_isolation_levels" title="Table 8-1. Summary of anomalies that can occur at various isolation levels" class="w-full my-4" >}}

--- a/content/en/ch9.md
+++ b/content/en/ch9.md
@ -4,6 +4,8 @@ weight: 209
 breadcrumbs: false
 ---

+<a id="ch_distributed"></a>
+
 ![](/map/ch08.png)

 > *They’re funny things, Accidents. You never have them till you’re having them.*
@ -33,7 +35,7 @@ explore the things that may go wrong in a distributed system. We will look into
 networks ([“Unreliable Networks”](/en/ch9#sec_distributed_networks)) as well as clocks and timing issues
 ([“Unreliable Clocks”](/en/ch9#sec_distributed_clocks)). The consequences of all these issues are disorienting, so we’ll
 explore how to think about the state of a distributed system and how to reason about things that
-have happened ([“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth)). Later, in [Chapter 10](/en/ch10#ch_consistency), we will look at some
+have happened ([“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth)). Later, in [Chapter 10](/en/ch10#ch_consistency), we will look at some
 examples of how we can achieve fault tolerance in the face of those faults.

 ## Faults and Partial Failures {#sec_distributed_partial_failure}
@ -104,7 +106,7 @@ The internet and most internal networks in datacenters (often Ethernet) are *asy
 networks*. In this kind of network, one node can send a message (a packet) to another node, but the
 network gives no guarantees as to when it will arrive, or whether it will arrive at all. If you send
 a request and expect a response, many things could go wrong (some of which are illustrated in
-[Figure 9-1](/en/ch9#fig_distributed_network)):
+[Figure 9-1](/en/ch9#fig_distributed_network)):

 1. Your request may have been lost (perhaps someone unplugged a network cable).
 2. Your request may be waiting in a queue and will be delivered later (perhaps the network or the
@ -219,7 +221,7 @@ even in controlled environments like a datacenter operated by one company [^8]:
 When one part of the network is cut off from the rest due to a network fault, that is sometimes
 called a *network partition* or *netsplit*, but it is not fundamentally different from other kinds
 of network interruption. Network partitions are not related to sharding of a storage system, which
-is sometimes also called *partitioning* (see [Chapter 7](/en/ch7#ch_sharding)).
+is sometimes also called *partitioning* (see [Chapter 7](/en/ch7#ch_sharding)).

 --------

@ -286,7 +288,7 @@ to a load spike on the node or the network).
 Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of
 performing some action (for example, sending an email), and another node takes over, the action may
 end up being performed twice. We will discuss this issue in more detail in
-[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), and in Chapters [^10] and [Link to Come].
+[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), [Chapter 10](/en/ch10#ch_consistency), and [“The End-to-End Argument for Databases”](/en/ch13#sec_future_end_to_end).

 When a node is declared dead, its responsibilities need to be transferred to other nodes, which
 places additional load on other nodes and the network. If the system is already struggling with high
@ -299,9 +301,9 @@ Imagine a fictitious system with a network that guaranteed a maximum delay for p
 is either delivered within some time *d*, or it is lost, but delivery never takes longer than *d*.
 Furthermore, assume that you can guarantee that a non-failed node always handles a request within
 some time *r*. In this case, you could guarantee that every successful request receives a response
-within time 2*d* + *r*—and if you don’t receive a response within that time, you know
+within time 2*d* + *r*—and if you don’t receive a response within that time, you know
 that either the network or the remote node is not working. If this was true,
-2*d* + *r* would be a reasonable timeout to use.
+2*d* + *r* would be a reasonable timeout to use.

 Unfortunately, most systems we work with have neither of those guarantees: asynchronous networks
 have *unbounded delays* (that is, they try to deliver packets as quickly as possible, but there is
@ -311,6 +313,8 @@ cannot guarantee that they can handle requests within some maximum time (see
 be fast most of the time: if your timeout is low, it only takes a transient spike in round-trip
 times to throw the system off-balance.

+<a id="sec_distributed_congestion"></a>
+
 #### Network congestion and queueing {#network-congestion-and-queueing}

 When driving a car, travel times on road networks often vary most due to traffic congestion.
@ -318,7 +322,7 @@ Similarly, the variability of packet delays on computer networks is most often d

 * If several different nodes simultaneously try to send packets to the same destination, the network
 switch must queue them up and feed them into the destination network link one by one (as illustrated
- in [Figure 9-2](/en/ch9#fig_distributed_switch_queueing)). On a busy network link, a packet may have to wait a while
+ in [Figure 9-2](/en/ch9#fig_distributed_switch_queueing)). On a busy network link, a packet may have to wait a while
 until it can get a slot (this is called *network congestion*). If there is so much incoming data
 that the switch queue fills up, the packet is dropped, so it needs to be resent—even though
 the network is functioning fine.
@ -340,6 +344,8 @@ expire, and then waiting for the retransmitted packet to be acknowledged).

 --------

+<a id="sidebar_distributed_tcp_udp"></a>
+
 > [!TIP] TCP VERSUS UDP

 Some latency-sensitive applications, such as videoconferencing and Voice over IP (VoIP), use UDP
@ -445,6 +451,8 @@ applications to reprioritize packets for QoS purposes.

 --------

+<a id="sidebar_distributed_latency_utilization"></a>
+
 > [!TIP] LATENCY AND RESOURCE UTILIZATION

 More generally, you can think of variable delays as a consequence of dynamic resource partitioning.
@ -548,7 +556,7 @@ unsuitable for measuring elapsed time [^40].
 Time-of-day clocks can experience jumps due to the start and end of Daylight Saving Time (DST);
 these can be avoided by always using UTC as time zone, which does not have DST.
 Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward
-in steps of 10 ms on older Windows systems [^41].
+in steps of 10 ms on older Windows systems [^41].
 On recent systems, this is less of a problem.

 #### Monotonic clocks {#monotonic-clocks}
@ -591,8 +599,8 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples

 * The quartz clock in a computer is not very accurate: it *drifts* (runs faster or slower than it
 should). Clock drift varies depending on the temperature of the machine. Google assumes a clock
- drift of up to 200 ppm (parts per million) for its servers  [^45],
- which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30
+ drift of up to 200 ppm (parts per million) for its servers  [^45],
+ which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30
 seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best
 possible accuracy you can achieve, even if everything is working correctly.
 * If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the
@ -602,7 +610,7 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples
 different nodes’ clocks. Anecdotal evidence suggests that this does happen in practice.
 * NTP synchronization can only be as good as the network delay, so there is a limit to its
 accuracy when you’re on a congested network with variable packet delays. One experiment showed
- that a minimum error of 35 ms is achievable when synchronizing over the internet [^46],
+ that a minimum error of 35 ms is achievable when synchronizing over the internet [^46],
 though occasional spikes in network delay lead to errors of around a second. Depending on the
 configuration, large network delays can cause the NTP client to give up entirely.
 * Some NTP servers are wrong or misconfigured, reporting time that is off by hours [^47] [^48].
@ -673,29 +681,29 @@ ordering of events across multiple nodes [^64].
 For example, if two clients write to a distributed database, who got there first? Which write is the
 more recent one?

-[Figure 9-3](/en/ch9#fig_distributed_timestamps) illustrates a dangerous use of time-of-day clocks in a database with
-multi-leader replication (the example is similar to [Figure 6-8](/en/ch6#fig_replication_causality)). Client A writes
-*x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node
-3 (we now have *x* = 2); and finally, both writes are replicated to node 2.
+[Figure 9-3](/en/ch9#fig_distributed_timestamps) illustrates a dangerous use of time-of-day clocks in a database with
+multi-leader replication (the example is similar to [Figure 6-8](/en/ch6#fig_replication_causality)). Client A writes
+*x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node
+3 (we now have *x* = 2); and finally, both writes are replicated to node 2.

 {{< figure src="/fig/ddia_0903.png" id="fig_distributed_timestamps" caption="Figure 9-3. The write by client B is causally later than the write by client A, but B's write has an earlier timestamp." class="w-full my-4" >}}


-In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a
+In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a
 timestamp according to the time-of-day clock on the node where the write originated. The clock
 synchronization is very good in this example: the skew between node 1 and node 3 is less than
-3 ms, which is probably better than you can expect in practice.
+3 ms, which is probably better than you can expect in practice.

-Since the increment builds upon the earlier write of *x* = 1, we might expect that the
-write of *x* = 2 should have the greater timestamp of the two. Unfortunately, that is
-not what happens in [Figure 9-3](/en/ch9#fig_distributed_timestamps): the write *x* = 1 has a timestamp of
-42.004 seconds, but the write *x* = 2 has a timestamp of 42.003 seconds.
+Since the increment builds upon the earlier write of *x* = 1, we might expect that the
+write of *x* = 2 should have the greater timestamp of the two. Unfortunately, that is
+not what happens in [Figure 9-3](/en/ch9#fig_distributed_timestamps): the write *x* = 1 has a timestamp of
+42.004 seconds, but the write *x* = 2 has a timestamp of 42.003 seconds.

 As discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww), one way of resolving conflicts between concurrently written
 values on different nodes is *last write wins* (LWW), which means keeping the write with the
 greatest timestamp for a given key and discarding all writes with older timestamps. In the example
-of [Figure 9-3](/en/ch9#fig_distributed_timestamps), when node 2 receives these two events, it will incorrectly
-conclude that *x* = 1 is the more recent value and drop the write *x* = 2,
+of [Figure 9-3](/en/ch9#fig_distributed_timestamps), when node 2 receives these two events, it will incorrectly
+conclude that *x* = 1 is the more recent value and drop the write *x* = 2,
 so the increment is lost.

 This problem can be prevented by ensuring that when a value is overwritten, the new value always has
@ -710,7 +718,7 @@ policy [^62]. This approach has some serious problems:
 This scenario can cause arbitrary amounts of data to be silently dropped without any error being
 reported to the application.
 * LWW cannot distinguish between writes that occurred sequentially in quick succession (in
- [Figure 9-3](/en/ch9#fig_distributed_timestamps), client B’s increment definitely occurs *after* client A’s write)
+ [Figure 9-3](/en/ch9#fig_distributed_timestamps), client B’s increment definitely occurs *after* client A’s write)
 and writes that were truly concurrent (neither writer was aware of the other). Additional
 causality tracking mechanisms, such as version vectors, are needed in order to prevent violations
 of causality (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)).
@ -722,8 +730,8 @@ policy [^62]. This approach has some serious problems:
 Thus, even though it is tempting to resolve conflicts by keeping the most “recent” value and
 discarding others, it’s important to be aware that the definition of “recent” depends on a local
 time-of-day clock, which may well be incorrect. Even with tightly NTP-synchronized clocks, you could
-send a packet at timestamp 100 ms (according to the sender’s clock) and have it arrive at
-timestamp 99 ms (according to the recipient’s clock)—so it appears as though the packet
+send a packet at timestamp 100 ms (according to the sender’s clock) and have it arrive at
+timestamp 99 ms (according to the recipient’s clock)—so it appears as though the packet
 arrived before it was sent, which is impossible.

 Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur?
@ -746,12 +754,12 @@ actually accurate to such precision. In fact, it most likely is not—as mention
 drift in an imprecise quartz clock can easily be several milliseconds, even if you synchronize with
 an NTP server on the local network every minute. With an NTP server on the public internet, the best
 possible accuracy is probably to the tens of milliseconds, and the error may easily spike to over
-100 ms when there is network congestion.
+100 ms when there is network congestion.

 Thus, it doesn’t make sense to think of a clock reading as a point in time—it is more like a
 range of times, within a confidence interval: for example, a system may be 95% confident that the
 time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely than that [^67].
-If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless.
+If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless.

 The uncertainty bound can be calculated based on your time source. If you have a GPS receiver or
 atomic clock directly attached to your computer, the expected error range is determined by
@ -808,7 +816,7 @@ length of the confidence interval before committing a read-write transaction. By
 ensures that any transaction that may read the data is at a sufficiently later time, so their
 confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner
 needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS
-receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [^45].
+receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [^45].

 The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to
 have a confidence interval, and the accurate clock sources only help keep that interval small. Other
@ -943,7 +951,7 @@ failure of the entire system. These are so-called *hard real-time* systems.
 > In embedded systems, *real-time* means that a system is carefully designed and tested to meet
 > specified timing guarantees in all circumstances. This meaning is in contrast to the more vague use of the
 > term *real-time* on the web, where it describes servers pushing data to clients and stream
-> processing without hard response time constraints (see [Link to Come]).
+> processing without hard response time constraints (see [Chapter 12](/en/ch12#ch_stream)).

 --------

@ -997,7 +1005,7 @@ A variant of this idea is to use the garbage collector only for short-lived obje
 to collect) and to restart processes periodically, before they accumulate enough long-lived objects
 to require a full GC of long-lived objects [^79] [^82].
 One node can be restarted at a time, and traffic can be shifted away from the node before the
-planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).
+planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).

 These measures cannot fully prevent garbage collection pauses, but they can usefully reduce their
 impact on the application.
@ -1031,7 +1039,7 @@ even if the underlying system model provides very few guarantees.
 However, although it is possible to make software well behaved in an unreliable system model, it
 is not straightforward to do so. In the rest of this chapter we will further explore the notions of
 knowledge and truth in distributed systems, which will help us think about the kinds of assumptions
-we can make and the guarantees we may want to provide. In [Chapter 10](/en/ch10#ch_consistency) we will proceed to
+we can make and the guarantees we may want to provide. In [Chapter 10](/en/ch10#ch_consistency) we will proceed to
 look at some examples of distributed algorithms that provide particular guarantees under particular
 assumptions.

@ -1075,7 +1083,7 @@ of quorums are possible). A majority quorum allows the system to continue workin
 are faulty (with three nodes, one faulty node can be tolerated; with five nodes, two faulty nodes can be
 tolerated). However, it is still safe, because there can only be only one majority in the
 system—there cannot be two majorities with conflicting decisions at the same time. We will discuss
-the use of quorums in more detail when we get to *consensus algorithms* in [Chapter 10](/en/ch10#ch_consistency).
+the use of quorums in more detail when we get to *consensus algorithms* in [Chapter 10](/en/ch10#ch_consistency).

 ### Distributed Locks and Leases {#sec_distributed_lock_fencing}

@ -1099,13 +1107,13 @@ hold the lease, perhaps due to a process pause. In the third example, the conseq
 wasted computational resources, which is not a big deal. But in the first two cases, the consequence
 could be lost or corrupted data, which is much more serious.

-For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
+For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
 implementation of locking. (The bug is not theoretical: HBase used to have this problem [^85] [^86].)
 Say you want to ensure that a file in a storage service can only be
 accessed by one client at a time, because if multiple clients tried to write to it, the file would
 become corrupted. You try to implement this by requiring a client to obtain a lease from a lock
 service before accessing the file. Such a lock service is often implemented using a consensus
-algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency).
+algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency).

 {{< figure src="/fig/ddia_0904.png" id="fig_distributed_lease_pause" caption="Figure 9-4. Incorrect implementation of a distributed lock: client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage." class="w-full my-4" >}}

@ -1116,13 +1124,13 @@ the same file, and start writing to the file. When the paused client comes back,
 (incorrectly) that it still has a valid lease and proceeds to also write to the file. We now have a
 split brain situation: the clients’ writes clash and corrupt the file.

-[Figure 9-5](/en/ch9#fig_distributed_lease_delay) shows a different problem that has similar consequences. In this
+[Figure 9-5](/en/ch9#fig_distributed_lease_delay) shows a different problem that has similar consequences. In this
 example there is no process pause, only a crash by client 1. Just before client 1 crashes it sends a
 write request to the storage service, but this request is delayed for a long time in the network.
 (Remember from [“Network Faults in Practice”](/en/ch9#sec_distributed_network_faults) that packets can sometimes be delayed by a minute
 or more.) By the time the write request arrives at the storage service, the lease has already timed
 out, allowing client 2 to acquire it and issue a write of its own. The result is corruption similar
-to [Figure 9-4](/en/ch9#fig_distributed_lease_pause).
+to [Figure 9-4](/en/ch9#fig_distributed_lease_pause).

 {{< figure src="/fig/ddia_0905.png" id="fig_distributed_lease_delay" caption="Figure 9-5. A message from a former leaseholder might be delayed for a long time, and arrive after another node has taken over the lease." class="w-full my-4" >}}

@ -1139,11 +1147,11 @@ from the network [^9], shutting down the VM via
 the cloud provider’s management interface, or even physically powering down the machine [^87].
 This approach is known as *Shoot The Other Node In The Head* or STONITH. Unfortunately, it suffers
 from some problems: it does not protect against large network delays like in
-[Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down [^19]; and by the time the zombie has been
+[Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down [^19]; and by the time the zombie has been
 detected and shut down, it may already be too late and data may already have been corrupted.

 A more robust fencing solution, which protects against both zombies and delayed requests, is
-illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing).
+illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing).

 {{< figure src="/fig/ddia_0906.png" id="fig_distributed_fencing" caption="Figure 9-6. Making access to storage safe by allowing writes only in the order of increasing fencing tokens." class="w-full my-4" >}}

@ -1158,12 +1166,12 @@ it must include its current fencing token.
 > [!NOTE]
 > There are several alternative names for fencing tokens. In Chubby, Google’s lock service, they are
 > called *sequencers* [^88], and in Kafka they are called *epoch numbers*.
-> In consensus algorithms, which we will discuss in [Chapter 10](/en/ch10#ch_consistency), the *ballot number* (Paxos) or
+> In consensus algorithms, which we will discuss in [Chapter 10](/en/ch10#ch_consistency), the *ballot number* (Paxos) or
 > *term number* (Raft) serves a similar purpose.

 --------

-In [Figure 9-6](/en/ch9#fig_distributed_fencing), client 1 acquires the lease with a token of 33, but then
+In [Figure 9-6](/en/ch9#fig_distributed_fencing), client 1 acquires the lease with a token of 33, but then
 it goes into a long pause and the lease expires. Client 2 acquires the lease with a token of 34 (the
 number always increases) and then sends its write request to the storage service, including the
 token of 34. Later, client 1 comes back to life and sends its write to the storage service,
@ -1196,7 +1204,7 @@ last-write-wins conflict resolution (see [“Leaderless Replication”](/en/ch6#
 client sends writes directly to each replica, and each replica independently decides whether to
 accept a write based on a timestamp assigned by the client.

-As illustrated in [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), you can put the writer’s fencing token in
+As illustrated in [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), you can put the writer’s fencing token in
 the most significant bits or digits of the timestamp. You can then be sure that any timestamp
 generated by the new leaseholder will be greater than any timestamp from the old leaseholder, even
 if the old leaseholder’s writes happened later.
@ -1204,7 +1212,7 @@ if the old leaseholder’s writes happened later.
 {{< figure src="/fig/ddia_0907.png" id="fig_distributed_fencing_leaderless" caption="Figure 9-7. Using fencing tokens to protect writes to a leaderless replicated database." class="w-full my-4" >}}


-In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its
+In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its
 timestamps starting with 34… are greater than any timestamps starting with 33… that are
 generated by Client 1. Client 2 writes to a quorum of replicas but it can’t reach Replica 3. This
 means that when the zombie Client 1 later tries to write, its write may succeed at Replica 3 even
@ -1239,7 +1247,7 @@ The Byzantine Generals Problem is a generalization of the so-called *Two General
 which imagines a situation in which two army generals need to agree on a battle plan. As they
 have set up camp on two different sites, they can only communicate by messenger, and the messengers
 sometimes get delayed or lost (like packets in a network). We will discuss this problem of
-*consensus* in [Chapter 10](/en/ch10#ch_consistency).
+*consensus* in [Chapter 10](/en/ch10#ch_consistency).

 In the Byzantine version of the problem, there are *n* generals who need to agree, and their
 endeavor is hampered by the fact that there are some traitors in their midst. Most of the generals
@ -1301,6 +1309,8 @@ an attacker can compromise one node, they can probably compromise all of them, b
 probably running the same software. Thus, traditional mechanisms (authentication, access control,
 encryption, firewalls, and so on) continue to be the main protection against attackers.

+<a id="sec_distributed_weak_lying"></a>
+
 #### Weak forms of lying {#weak-forms-of-lying}

 Although we assume that nodes are generally honest, it can be worth adding mechanisms to software
@ -1327,7 +1337,7 @@ pragmatic steps toward better reliability. For example:
 ### System Model and Reality {#sec_distributed_system_model}

 Many algorithms have been designed to solve distributed systems problems—for example, we will
-examine solutions for the consensus problem in [Chapter 10](/en/ch10#ch_consistency). In order to be useful, these
+examine solutions for the consensus problem in [Chapter 10](/en/ch10#ch_consistency). In order to be useful, these
 algorithms need to tolerate the various faults of distributed systems that we discussed in this
 chapter.

@ -1409,7 +1419,7 @@ Uniqueness

 Monotonic sequence
 : If request *x* returned token *t**x*, and request *y* returned token *t**y*, and
- *x* completed before *y* began, then *t**x* < *t**y*.
+ *x* completed before *y* began, then *t**x* < *t**y*.

 Availability
 : A node that requests a fencing token and does not crash eventually receives a response.
@ -1615,7 +1625,7 @@ TigerBeetle’s time abstraction allows simulations to simulate network latency
 actually taking the full length of time to trigger the timeout. Such techniques allow the simulator
 to explore more code paths faster.

-# The Power of Determinism
+#### The Power of Determinism {#sidebar_distributed_determinism}

 Nondeterminism is at the core of all of the distributed systems challenges we discussed in this
 chapter: concurrency, network delay, process pauses, clock jumps, and crashes all happen in
@ -1839,4 +1849,4 @@ problems in distributed systems.
 [^131]: Rupak Majumdar and Filip Niksic. [Why is random testing effective for partition tolerance bugs?](https://dl.acm.org/doi/pdf/10.1145/3158134) *Proceedings of the ACM on Programming Languages* (PACMPL), volume 2, issue POPL, article no. 46, December 2017. [doi:10.1145/3158134](https://doi.org/10.1145/3158134) 
 [^132]: FoundationDB project authors. [Simulation and Testing](https://apple.github.io/foundationdb/testing.html). *apple.github.io*. Archived at [perma.cc/NQ3L-PM4C](https://perma.cc/NQ3L-PM4C) 
 [^133]: Alex Kladov. [Simulation Testing For Liveness](https://tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness/). *tigerbeetle.com*, July 2023. Archived at [perma.cc/RKD4-HGCR](https://perma.cc/RKD4-HGCR) 
-[^134]: Alfonso Subiotto Marqués. [(Mostly) Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024. Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4) 
+[^134]: Alfonso Subiotto Marqués. [(Mostly) Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024. Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4) 
--- a/content/en/colophon.md
+++ b/content/en/colophon.md
@ -4,23 +4,20 @@ weight: 600
 breadcrumbs: false
 ---

-{{< callout type="warning" >}}
-This page is from the 1st edition， 2nd edition is not available yet.
-{{< /callout >}}
-
 ## About the Author

-**Martin Kleppmann** is a researcher in distributed systems at the University of Cambridge, UK.
-Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure.
-In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes.
-
-Martin is a regular conference speaker, blogger, and open source contributor. He believes that profound technical ideas should be accessible to everyone, and that deeper understanding will help us develop better software.
+**Martin Kleppmann** is an Associate Professor at the University of Cambridge, UK, where he teaches on distributed systems and cryptographic protocols. 
+The first edition of *Designing Data-Intensive Applications* in 2017 established him as an authority on data systems, 
+and through his research on distributed systems he helped start the local-first software movement. 
+Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive, 
+where he worked on large-scale data infrastructure.

 ![](http://martin.kleppmann.com/2017/03/ddia-poster.jpg)

-**Chris Riccomini** is a software engineer, startup investor, and author with 15+ years of experience at PayPal, LinkedIn, and WePay.
-He runs Materialized View Capital, where he invests in infrastructure startups. He is also the cocreator of Apache Samza and SlateDB,
-and coauthor of The Missing README: A Guide for the New Software Engineer.
+**Chris Riccomini** is a software engineer, startup investor, and author with 15+ years of experience at PayPal, 
+LinkedIn, and WePay. He runs Materialized View Capital, where he invests in infrastructure startups. 
+He is also the co-creator of Apache Samza and SlateDB, 
+and co-author of The Missing README: A Guide for the New Software Engineer.


 ## Colophon
--- a/content/en/glossary.md
+++ b/content/en/glossary.md
@ -4,38 +4,33 @@ weight: 500
 breadcrumbs: false
 ---

-{{< callout type="warning" >}}
-This page is from the 1st edition， 2nd edition is not available yet.
-{{< /callout >}}
-
 > Please note that the definitions in this glossary are short and simple, intended to convey the core idea but not the full subtleties of a term. For more detail, please follow the references into the main text.

 ### asynchronous

-Not waiting for something to complete (e.g., sending data over the network to another node), and not making any assumptions about how long it is going to take. See “Synchronous Versus Asynchro‐ nous Replication” on page 153, “Synchro‐ nous Versus Asynchronous Networks” on page 284, and “System Model and Reality” on page 306.
+Not waiting for something to complete (e.g., sending data over the network to another node), and not making any assumptions about how long it is going to take. See [“Synchronous Versus Asynchronous Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_sync_async), [“Synchronous Versus Asynchronous Networks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_sync_networks), and [“System Model and Reality”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_system_model).

 ### atomic

-1. In the context of concurrent operations: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half- finished” state. See also *isolation*.
-2. In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur. See “Atomicity” on page 223 and “Atomic Commit and Two-Phase Commit (2PC)” on page 354.
+1.  In the context of concurrency: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half-finished” state. See also *isolation*.
+
+2.  In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur. See [“Atomicity”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_atomicity) and [“Two-Phase Commit (2PC)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pc).

 ### backpressure

-Forcing the sender of some data to slow down because the recipient cannot keep
-
-up with it. Also known as *flow control*. See “Messaging Systems” on page 441.
+Forcing the sender of some data to slow down when the recipient cannot keep up with it. Also known as *flow control*. See [“When an Overloaded System Won’t Recover”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sidebar_metastable).

 ### batch process

-A computation that takes some fixed (and usually large) set of data as input and pro‐ duces some other data as output, without modifying the input. See Chapter 10.
+A computation that takes some fixed (and usually large) set of data as input and produces some other data as output, without modifying the input. See [Chapter 11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch11.html#ch_batch).

 ### bounded

-Having some known upper limit or size. Used for example in the context of net‐ work delay (see “Timeouts and Unboun‐ ded Delays” on page 281) and datasets (see the introduction to Chapter 11).
+Having some known upper limit or size. Used for example in the context of network delay (see [“Timeouts and Unbounded Delays”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_queueing)) and datasets (see the introduction to [Chapter 12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#ch_stream)).

 ### Byzantine fault

-A node that behaves incorrectly in some arbitrary way, for example by sending contradictory or malicious messages to other nodes. See “Byzantine Faults” on page 304.
+A node that behaves incorrectly in some arbitrary way, for example by sending contradictory or malicious messages to other nodes. See [“Byzantine Faults”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_byzantine).

 ### cache

@ -43,55 +38,55 @@ A component that remembers recently used data in order to speed up future reads

 ### CAP theorem

-A widely misunderstood theoretical result that is not useful in practice. See “The CAP theorem” on page 336.
+A widely misunderstood theoretical result that is not useful in practice. See [“The CAP theorem”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_cap).

 ### causality

-The dependency between events that ari‐ ses when one thing “happens before” another thing in a system. For example, a later event that is in response to an earlier event, or builds upon an earlier event, or should be understood in the light of an earlier event. See “The “happens-before” relationship and concurrency” on page 186 and “Ordering and Causality” on page 339.
+The dependency between events that arises when one thing “happens before” another thing in a system. For example, a later event that is in response to an earlier event, or builds upon an earlier event, or should be understood in the light of an earlier event. See [“The “happens-before” relation and concurrency”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_happens_before).

 ### consensus

-A fundamental problem in distributed computing, concerning getting several nodes to agree on something (for exam‐ ple, which node should be the leader for a database cluster). The problem is much harder than it seems at first glance. See “Fault-Tolerant Consensus” on page 364.
+A fundamental problem in distributed computing, concerning getting several nodes to agree on something (for example, which node should be the leader for a database cluster). The problem is much harder than it seems at first glance. See [“Consensus”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_consensus).

 ### data warehouse

-A database in which data from several dif‐ ferent OLTP systems has been combined and prepared to be used for analytics pur‐ poses. See “Data Warehousing” on page 91.
+A database in which data from several different OLTP systems has been combined and prepared to be used for analytics purposes. See [“Data Warehousing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_dwh).

 ### declarative

-Describing the properties that something should have, but not the exact steps for how to achieve it. In the context of quer‐ ies, a query optimizer takes a declarative query and decides how it should best be executed. See “Query Languages for Data” on page 42.
+Describing the properties that something should have, but not the exact steps for how to achieve it. In the context of database queries, a query optimizer takes a declarative query and decides how it should best be executed. See [“Terminology: Declarative Query Languages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sidebar_declarative).

 ### denormalize

-To introduce some amount of redun‐ dancy or duplication in a *normalized* dataset, typically in the form of a *cache* or *index*, in order to speed up reads. A denormalized value is a kind of precom‐ puted query result, similar to a materialized view. See “Single-Object and Multi- Object Operations” on page 228 and “Deriving several views from the same event log” on page 461.
+To introduce some amount of redundancy or duplication in a *normalized* dataset, typically in the form of a *cache* or *index*, in order to speed up reads. A denormalized value is a kind of precomputed query result, similar to a materialized view. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization).

 ### derived data

-A dataset that is created from some other data through a repeatable process, which you could run again if necessary. Usually, derived data is needed to speed up a par‐ ticular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data. See the introduction to Part III.
+A dataset that is created from some other data through a repeatable process, which you could run again if necessary. Usually, derived data is needed to speed up a particular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data. See [“Systems of Record and Derived Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_derived).

 ### deterministic

-Describing a function that always pro‐ duces the same output if you give it the same input. This means it cannot depend on random numbers, the time of day, net‐ work communication, or other unpredict‐ able things.
+Describing a function that always produces the same output if you give it the same input. This means it cannot depend on random numbers, the time of day, network communication, or other unpredictable things. See [“The Power of Determinism”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sidebar_distributed_determinism).

 ### distributed

-Running on several nodes connected by a network. Characterized by *partial failures*: some part of the system may be broken while other parts are still working, and it is often impossible for the software to know what exactly is broken. See “Faults and Partial Failures” on page 274.
+Running on several nodes connected by a network. Characterized by *partial failures*: some part of the system may be broken while other parts are still working, and it is often impossible for the software to know what exactly is broken. See [“Faults and Partial Failures”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_partial_failure).

 ### durable

-Storing data in a way such that you believe it will not be lost, even if various faults occur. See “Durability” on page 226.
+Storing data in a way such that you believe it will not be lost, even if various faults occur. See [“Durability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_durability).

 ### ETL

-Extract–Transform–Load. The process of extracting data from a source database, transforming it into a form that is more suitable for analytic queries, and loading it into a data warehouse or batch processing system. See “Data Warehousing” on page 91.
+Extract–Transform–Load. The process of extracting data from a source database, transforming it into a form that is more suitable for analytic queries, and loading it into a data warehouse or batch processing system. See [“Data Warehousing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_dwh).

 ### failover

-In systems that have a single leader, fail‐ over is the process of moving the leader‐ ship role from one node to another. See “Handling Node Outages” on page 156.
+In systems that have a single leader, failover is the process of moving the leadership role from one node to another. See [“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover).

 ### fault-tolerant

-Able to recover automatically if some‐ thing goes wrong (e.g., if a machine crashes or a network link fails). See “Reli‐ ability” on page 6.
+Able to recover automatically if something goes wrong (e.g., if a machine crashes or a network link fails). See [“Reliability and Fault Tolerance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_reliability).

 ### flow control

@ -99,150 +94,164 @@ See *backpressure*.

 ### follower

-A replica that does not directly accept any writes from clients, but only processes data changes that it receives from a leader. Also known as a *secondary*, *slave*, *read replica*, or *hot standby*. See “Leaders and Followers” on page 152.
+A replica that does not directly accept any writes from clients, but only processes data changes that it receives from a leader. Also known as a *secondary*, *read replica*, or *hot standby*. See [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader).

 ### full-text search

-Searching text by arbitrary keywords, often with additional features such as matching similarly spelled words or syno‐ nyms. A full-text index is a kind of *secon‐ dary index* that supports such queries. See “Full-text search and fuzzy indexes” on page 88.
+Searching text by arbitrary keywords, often with additional features such as matching similarly spelled words or synonyms. A full-text index is a kind of *secondary index* that supports such queries. See [“Full-Text Search”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_full_text).

 ### graph

-A data structure consisting of *vertices* (things that you can refer to, also known as *nodes* or *entities*) and *edges* (connec‐ tions from one vertex to another, also known as *relationships* or *arcs*). See “Graph-Like Data Models” on page 49.
+A data structure consisting of *vertices* (things that you can refer to, also known as *nodes* or *entities*) and *edges* (connections from one vertex to another, also known as *relationships* or *arcs*). See [“Graph-Like Data Models”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_graph).

 ### hash

-A function that turns an input into a random-looking number. The same input always returns the same number as out‐ put. Two different inputs are very likely to have two different numbers as output, although it is possible that two different inputs produce the same output (this is called a *collision*). See “Partitioning by Hash of Key” on page 203.
+A function that turns an input into a random-looking number. The same input always returns the same number as output. Two different inputs are very likely to have two different numbers as output, although it is possible that two different inputs produce the same output (this is called a *collision*). See [“Sharding by Hash of Key”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_hash).

 ### idempotent

-Describing an operation that can be safely retried; if it is executed more than once, it has the same effect as if it was only exe‐ cuted once. See “Idempotence” on page 478.
+Describing an operation that can be safely retried; if it is executed more than once, it has the same effect as if it was only executed once. See [“Idempotence”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#sec_stream_idempotence).

 ### index

-A data structure that lets you efficiently search for all records that have a particular value in a particular field. See “Data Structures That Power Your Database” on page 70.
+A data structure that lets you efficiently search for all records that have a particular value in a particular field. See [“Storage and Indexing for OLTP”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_oltp).

 ### isolation

-In the context of transactions, describing the degree to which concurrently execut‐ ing transactions can interfere with each other. *Serializable* isolation provides the strongest guarantees, but weaker isolation levels are also used. See “Isolation” on page 225.
+In the context of transactions, describing the degree to which concurrently executing transactions can interfere with each other. *Serializable* isolation provides the strongest guarantees, but weaker isolation levels are also used. See [“Isolation”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_isolation).

 ### join

-To bring together records that have some‐ thing in common. Most commonly used in the case where one record has a refer‐ ence to another (a foreign key, a docu‐ ment reference, an edge in a graph) and a query needs to get the record that the ref‐ erence points to. See “Many-to-One and Many-to-Many Relationships” on page 33 and “Reduce-Side Joins and Grouping” on page 403.
+To bring together records that have something in common. Most commonly used in the case where one record has a reference to another (a foreign key, a document reference, an edge in a graph) and a query needs to get the record that the reference points to. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization) and [“JOIN and GROUP BY”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch11.html#sec_batch_join).

 ### leader

-When data or a service is replicated across several nodes, the leader is the designated replica that is allowed to make changes. A leader may be elected through some pro‐ tocol, or manually chosen by an adminis‐ trator. Also known as the *primary* or *master*. See “Leaders and Followers” on page 152.
+When data or a service is replicated across several nodes, the leader is the designated replica that is allowed to make changes. A leader may be elected through some protocol, or manually chosen by an administrator. Also known as the *primary* or *source*. See [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader).

 ### linearizable

-Behaving as if there was only a single copy of data in the system, which is updated by atomic operations. See “Linearizability” on page 324.
+Behaving as if there was only a single copy of data in the system, which is updated by atomic operations. See [“Linearizability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_linearizability).

 ### locality

-A performance optimization: putting sev‐ eral pieces of data in the same place if they are frequently needed at the same time. See “Data locality for queries” on page 41.
+A performance optimization: putting several pieces of data in the same place if they are frequently needed at the same time. See [“Data locality for reads and writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_document_locality).

 ### lock

-A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See “Two-Phase Locking (2PL)” on page 257 and “The leader and the lock” on page 301.
+A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See [“Two-Phase Locking (2PL)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pl) and [“Distributed Locks and Leases”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lock_fencing).

 ### log

-A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See “Two-Phase Locking (2PL)” on page 257 and “The leader and the lock” on page 301.
-
+An append-only file for storing data. A *write-ahead log* is used to make a storage engine resilient against crashes (see [“Making B-trees reliable”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_btree_wal)), a *log-structured* storage engine uses logs as its primary storage format (see [“Log-Structured Storage”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_log_structured)), a *replication log* is used to copy writes from a leader to followers (see [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader)), and an *event log* can represent a data stream (see [“Log-based Message Brokers”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#sec_stream_log)).

 ### materialize
-To perform a computation eagerly and write out its result, as opposed to calculat‐ ing it on demand when requested. See “Aggregation: Data Cubes and Material‐ ized Views” on page 101 and “Materialization of Intermediate State” on page 419.

+To perform a computation eagerly and write out its result, as opposed to calculating it on demand when requested. See [“Event Sourcing and CQRS”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_events).

 ### node

 An instance of some software running on a computer, which communicates with other nodes via a network in order to accomplish some task.

-
 ### normalized
-An instance of some software running on a computer, which communicates with other nodes via a network in order to accomplish some task.
-Structured in such a way that there is no redundancy or duplication. In a normal‐ ized database, when some piece of data changes, you only need to change it in one place, not many copies in many different places. See “Many-to-One and Many-to- Many Relationships” on page 33.
+
+Structured in such a way that there is no redundancy or duplication. In a normalized database, when some piece of data changes, you only need to change it in one place, not many copies in many different places. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization).

 ### OLAP
-Online analytic processing. Access pattern characterized by aggregating (e.g., count, sum, average) over a large number of records. See “Transaction Processing or Analytics?” on page 90.

+Online analytic processing. Access pattern characterized by aggregating (e.g., count, sum, average) over a large number of records. See [“Operational Versus Analytical Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_analytics).

 ### OLTP

-Online transaction processing. Access pattern characterized by fast queries that read or write a small number of records, usually indexed by key. See “Transaction Processing or Analytics?” on page 90.
+Online transaction processing. Access pattern characterized by fast queries that read or write a small number of records, usually indexed by key. See [“Operational Versus Analytical Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_analytics).

-### partitioning
+### sharding

-Splitting up a large dataset or computa‐ tion that is too big for a single machine into smaller parts and spreading them across several machines. Also known as sharding. See Chapter 6.
+Splitting up a large dataset or computation that is too big for a single machine into smaller parts and spreading them across several machines. Also known as *partitioning*. See [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding).

 ### percentile
-A way of measuring the distribution of values by counting how many values are above or below some threshold. For example, the 95th percentile response time during some period is the time t such that 95% of requests in that period com‐ plete in less than t, and 5% take longer than t. See “Describing Performance” on page 13.
+
+A way of measuring the distribution of values by counting how many values are above or below some threshold. For example, the 95th percentile response time during some period is the time *t* such that 95% of requests in that period complete in less than *t*, and 5% take longer than *t*. See [“Describing Performance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_percentiles).

 ### primary key
-A value (typically a number or a string) that uniquely identifies a record. In many applications, primary keys are generated by the system when a record is created (e.g., sequentially or randomly); they are not usually set by users. See also secondary index.
+
+A value (typically a number or a string) that uniquely identifies a record. In many applications, primary keys are generated by the system when a record is created (e.g., sequentially or randomly); they are not usually set by users. See also *secondary index*.

 ### quorum

-The minimum number of nodes that need to vote on an operation before it can be considered successful. See “Quorums for reading and writing” on page 179.
+The minimum number of nodes that need to vote on an operation before it can be considered successful. See [“Quorums for reading and writing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_quorum_condition).

 ### rebalance
-To move data or services from one node to another in order to spread the load fairly. See “Rebalancing Partitions” on page 209.
+
+To move data or services from one node to another in order to spread the load fairly. See [“Sharding of Key-Value Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_key_value).

 ### replication
-Keeping a copy of the same data on sev‐ eral nodes (replicas) so that it remains accessible if a node becomes unreachable. See Chapter 5.
+
+Keeping a copy of the same data on several nodes (*replicas*) so that it remains accessible if a node becomes unreachable. See [Chapter 6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#ch_replication).

 ### schema
-A description of the structure of some data, including its fields and datatypes. Whether some data conforms to a schema can be checked at various points in the data’s lifetime (see “Schema flexibility in the document model” on page 39), and a schema can change over time (see Chap‐ ter 4).
+
+A description of the structure of some data, including its fields and datatypes. Whether some data conforms to a schema can be checked at various points in the data’s lifetime (see [“Schema flexibility in the document model”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_schema_flexibility)), and a schema can change over time (see [Chapter 5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch05.html#ch_encoding)).

 ### secondary index
-An additional data structure that is main‐ tained alongside the primary data storage and which allows you to efficiently search for records that match a certain kind of condition. See “Other Indexing Struc‐ tures” on page 85 and “Partitioning and Secondary Indexes” on page 206.
+
+An additional data structure that is maintained alongside the primary data storage and which allows you to efficiently search for records that match a certain kind of condition. See [“Multi-Column and Secondary Indexes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_index_multicolumn) and [“Sharding and Secondary Indexes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_secondary_indexes).

 ### serializable
-A guarantee that if several transactions execute concurrently, they behave the same as if they had executed one at a time, in some serial order. See “Serializability” on page 251.
+
+An *isolation* guarantee that if several transactions execute concurrently, they behave the same as if they had executed one at a time, in some serial order. See [“Serializability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_serializability).

 ### shared-nothing
-An architecture in which independent nodes—each with their own CPUs, mem‐ ory, and disks—are connected via a con‐ ventional network, in contrast to shared- memory or shared-disk architectures. See the introduction to Part II.
+
+An architecture in which independent nodes—each with their own CPUs, memory, and disks—are connected via a conventional network, in contrast to shared-memory or shared-disk architectures. See [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_shared_nothing).

 ### skew
-1. Imbalanced load across partitions, such that some partitions have lots of requests or data, and others have much less. Also known as hot spots. See “Skewed Work‐ loads and Relieving Hot Spots” on page 205 and “Handling skew” on page 407.
-2. A timing anomaly that causes events to appear in an unexpected, nonsequential order. See the discussions of read skew in “Snapshot Isolation and Repeatable Read” on page 237, write skew in “Write Skew and Phantoms” on page 246, and clock skew in “Timestamps for ordering events” on page 291.
+
+1.  Imbalanced load across shards, such that some shards have lots of requests or data, and others have much less. Also known as *hot spots*. See [“Skewed Workloads and Relieving Hot Spots”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_skew).
+
+2.  A timing anomaly that causes events to appear in an unexpected, nonsequential order. See the discussions of *read skew* in [“Snapshot Isolation and Repeatable Read”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_snapshot_isolation), *write skew* in [“Write Skew and Phantoms”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_write_skew), and *clock skew* in [“Timestamps for ordering events”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lww).

 ### split brain
-A scenario in which two nodes simultane‐ ously believe themselves to be the leader, and which may cause system guarantees to be violated. See “Handling Node Out‐ ages” on page 156 and “The Truth Is Defined by the Majority” on page 300.
+
+A scenario in which two nodes simultaneously believe themselves to be the leader, and which may cause system guarantees to be violated. See [“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover) and [“The Majority Rules”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_majority).

 ### stored procedure
-A way of encoding the logic of a transac‐ tion such that it can be entirely executed on a database server, without communi‐ cating back and forth with a client during the transaction. See “Actual Serial Execu‐ tion” on page 252.
+
+A way of encoding the logic of a transaction such that it can be entirely executed on a database server, without communicating back and forth with a client during the transaction. See [“Actual Serial Execution”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_serial).

 ### stream process
-A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See Chapter 11.
+
+A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See [Chapter 12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#ch_stream).

 ### synchronous
-The opposite of asynchronous.
+
+The opposite of *asynchronous*.

 ### system of record
-A system that holds the primary, authori‐ tative version of some data, also known as the source of truth. Changes are first writ‐ ten here, and other datasets may be derived from the system of record. See the introduction to Part III.
+
+A system that holds the primary, authoritative version of some data, also known as the *source of truth*. Changes are first written here, and other datasets may be derived from the system of record. See [“Systems of Record and Derived Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_derived).

 ### timeout
-One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See “Timeouts and Unbounded Delays” on page 281.
+
+One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See [“Timeouts and Unbounded Delays”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_queueing).

 ### total order
-A way of comparing things (e.g., time‐ stamps) that allows you to always say which one of two things is greater and which one is lesser. An ordering in which some things are incomparable (you can‐ not say which is greater or smaller) is called a partial order. See “The causal order is not a total order” on page 341.
+
+A way of comparing things (e.g., timestamps) that allows you to always say which one of two things is greater and which one is lesser. An ordering in which some things are incomparable (you cannot say which is greater or smaller) is called a *partial order*.

 ### transaction
-Grouping together several reads and writes into a logical unit, in order to sim‐ plify error handling and concurrency issues. See Chapter 7.
+
+Grouping together several reads and writes into a logical unit, in order to simplify error handling and concurrency issues. See [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions).

 ### two-phase commit (2PC)
-An algorithm to ensure that several data‐ base nodes either all commit or all abort a transaction. See “Atomic Commit and Two-Phase Commit (2PC)” on page 354.
+
+An algorithm to ensure that several database nodes either all *atomically* commit or all abort a transaction. See [“Two-Phase Commit (2PC)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pc).

 ### two-phase locking (2PL)
-An algorithm for achieving serializable isolation that works by a transaction acquiring a lock on all data it reads or writes, and holding the lock until the end of the transaction. See “Two-Phase Lock‐ ing (2PL)” on page 257.
+
+An algorithm for achieving *serializable isolation* that works by a transaction acquiring a lock on all data it reads or writes, and holding the lock until the end of the transaction. See [“Two-Phase Locking (2PL)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pl).

 ### unbounded
-Not having any known upper limit or size. The opposite of bounded.
+
+Not having any known upper limit or size. The opposite of *bounded*.


-
-
-
-
-……
--- a/content/en/indexes.md
+++ b/content/en/indexes.md
--- a/content/en/part-iii.md
+++ b/content/en/part-iii.md
@ -61,12 +61,13 @@ This point will be a running theme throughout this part of the book.

 We will start in [Chapter 11](/en/ch11) by examining batch-oriented dataflow systems such as MapReduce, and see how they give us good tools and principles for building large- scale data systems. 
 In [Chapter 12](/en/ch12) we will take those ideas and apply them to data streams, which allow us to do the same kinds of things with lower delays. 
-[Chapter 13](/en/ch13) concludes the book by exploring ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future.
+In [Chapter 13](/en/ch13) we explore ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future.
+[Chapter 14](/en/ch14) concludes the book with ethics, privacy, and the social impact of data systems.


 ## Index

 - [11. Batch Processing](/en/ch11) (WIP)
 - [12. Stream Processing](/en/ch12) (WIP)
- [13. Doing the Right Thing](/en/ch13) (WIP)
-
+- [13. A Philosophy of Streaming Systems](/en/ch13) (WIP)
+- [14. Doing the Right Thing](/en/ch14) (WIP)
--- a/content/en/toc.md
+++ b/content/en/toc.md
@ -368,22 +368,26 @@ breadcrumbs: false

 ## [11. Batch Processing](/en/ch11)
 - [……](/en/ch11#)
- [Summary](/en/ch11#summary)
+- [Summary](/en/ch11#id292)
    - [References](/en/ch11#references)

 ## [12. Stream Processing](/en/ch12)
 - [……](/en/ch12#)
- [Summary](/en/ch12#summary)
+- [Summary](/en/ch12#id332)
    - [References](/en/ch12#references)

-## [13. Do the Right Thing](/en/ch13)
+## [13. A Philosophy of Streaming Systems](/en/ch13)
 - [……](/en/ch13#)
- [Summary](/en/ch13#summary)
+- [Summary](/en/ch13#id367)
    - [References](/en/ch13#references)

+## [14. Doing the Right Thing](/en/ch14)
+- [……](/en/ch14#)
+- [Summary](/en/ch14#id594)
+    - [References](/en/ch14#references)
+
 ## [Glossary](/en/glossary)

 ## [Colophon](/en/colophon)
 - [About the Author](/en/colophon#about-the-author)
 - [Colophon](/en/colophon#colophon)
-
--- a/hugo.yaml
+++ b/hugo.yaml
@ -127,22 +127,26 @@ menu:
      name: "PostgreSQL 14 内参 ↗"
      url: "https://postgres-internals.cn/"
      weight: 9
-    - identifier: pigsty
-      name: "Pigsty Free PG RDS ↗"
-      url: "https://pgsty.com/"
+    - identifier: pigsty-cc
+      name: "Pigsty：开源 PG RDS ↗"
+      url: "https://pigsty.cc/"
      weight: 10
+    - identifier: pigsty-io
+      name: "Pigsty: Free PG RDS ↗"
+      url: "https://pigsty.io/"
+      weight: 11
    - identifier: pgext
      name: "PG 扩展目录 ↗"
      url: "https://ext.pgsty.com/zh"
-      weight: 11
+      weight: 12
    - identifier: ddia1
      name: "DDIA O'reilly ↗"
      url: "https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/"
-      weight: 12
+      weight: 13
    - identifier: ddia2
      name: "DDIA 2nd O'reilly ↗"
      url: "https://www.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/"
-      weight: 13
+      weight: 14


 params:
--- a/static/fig/ddia_1101.png
+++ b/static/fig/ddia_1101.png
--- a/static/fig/ddia_1102.png
+++ b/static/fig/ddia_1102.png
--- a/static/fig/ddia_1103.png
+++ b/static/fig/ddia_1103.png
--- a/static/fig/ddia_1201.png
+++ b/static/fig/ddia_1201.png
--- a/static/fig/ddia_1202.png
+++ b/static/fig/ddia_1202.png
--- a/static/fig/ddia_1203.png
+++ b/static/fig/ddia_1203.png
--- a/static/fig/ddia_1204.png
+++ b/static/fig/ddia_1204.png
--- a/static/fig/ddia_1205.png
+++ b/static/fig/ddia_1205.png
--- a/static/fig/ddia_1206.png
+++ b/static/fig/ddia_1206.png
--- a/static/fig/ddia_1207.png
+++ b/static/fig/ddia_1207.png
--- a/static/fig/ddia_1208.png
+++ b/static/fig/ddia_1208.png
--- a/static/fig/ddia_1301.png
+++ b/static/fig/ddia_1301.png
--- a/static/fig/ddia_1302.png
+++ b/static/fig/ddia_1302.png