diff --git a/.github/workflows/pages.yaml b/.github/workflows/pages.yaml
index 5b9f907..25ac531 100644
--- a/.github/workflows/pages.yaml
+++ b/.github/workflows/pages.yaml
@@ -31,7 +31,7 @@ jobs:
build:
runs-on: ubuntu-latest
env:
- HUGO_VERSION: 0.147.7
+ HUGO_VERSION: 0.155.3
steps:
- name: Checkout
uses: actions/checkout@v4
@@ -41,7 +41,7 @@ jobs:
- name: Setup Go
uses: actions/setup-go@v5
with:
- go-version: '1.24'
+ go-version: '1.26'
- name: Setup Pages
id: pages
uses: actions/configure-pages@v4
diff --git a/README.md b/README.md
index 0ce7d3c..2d49af8 100644
--- a/README.md
+++ b/README.md
@@ -67,9 +67,10 @@
- [9. 分布式系统的麻烦](https://ddia.vonng.com/ch9)
- [10.一致性与共识](https://ddia.vonng.com/ch10)
* [第三部分:派生数据](https://ddia.vonng.com/part-iii)
- - [11. 批处理](https://ddia.vonng.com/ch11) (尚未发布)
- - [12. 流处理](https://ddia.vonng.com/ch12) (尚未发布)
- - [13. 做正确的事](https://ddia.vonng.com/ch13)(尚未发布)
+ - [11. 批处理](https://ddia.vonng.com/ch11)
+ - [12. 流处理](https://ddia.vonng.com/ch12)
+ - [13. 流处理系统哲学](https://ddia.vonng.com/ch13)
+ - [14. 做正确的事](https://ddia.vonng.com/ch14)
* [术语表](https://ddia.vonng.com/glossary)
* [后记](https://ddia.vonng.com/colophon)
diff --git a/content/en/_index.md b/content/en/_index.md
index c4aced7..7c50f74 100644
--- a/content/en/_index.md
+++ b/content/en/_index.md
@@ -49,9 +49,10 @@ breadcrumbs: false
- [10. Consistency and Consensus](/en/ch10)
### [Part III: Derived Data](/en/part-iii)
- - [11. Batch Processing](/en/ch11) (WIP)
- - [12. Stream Processing](/en/ch12) (WIP)
- - [13. Doing the Right Thing](/en/ch13) (WIP)
+ - [11. Batch Processing](/en/ch11)
+ - [12. Stream Processing](/en/ch12)
+ - [13. A Philosophy of Streaming Systems](/en/ch13)
+ - [14. Doing the Right Thing](/en/ch14)
### [Glossary](/en/glossary)
diff --git a/content/en/ch1.md b/content/en/ch1.md
index 6d33fcf..e18c6f6 100644
--- a/content/en/ch1.md
+++ b/content/en/ch1.md
@@ -4,6 +4,8 @@ weight: 101
breadcrumbs: false
---
+
+
> *There are no solutions, there are only trade-offs. […] But you try to get the best
> trade-off you can get, and that’s all you can hope for.*
>
@@ -156,7 +158,7 @@ the term *transaction* nevertheless stuck, referring to a group of reads and wri
logical unit.
> [!NOTE]
-> [Chapter 8](/en/ch8#ch_transactions) explores in detail what we mean with a transaction. This chapter uses the term
+> [Chapter 8](/en/ch8#ch_transactions) explores in detail what we mean with a transaction. This chapter uses the term
> loosely to refer to low-latency reads and writes.
Even though databases started being used for many different kinds of data—posts on social media,
@@ -179,7 +181,7 @@ answer analytic queries such as:
The reports that result from these types of queries are important for business intelligence, helping
the management decide what to do next. In order to differentiate this pattern of using databases
from transaction processing, it has been called *online analytic processing* (OLAP) [^5].
-The difference between OLTP and analytics is not always clear-cut, but some typical characteristics are listed in [Table 1-1](/en/ch1#tab_oltp_vs_olap).
+The difference between OLTP and analytics is not always clear-cut, but some typical characteristics are listed in [Table 1-1](/en/ch1#tab_oltp_vs_olap).
{{< figure id="tab_oltp_vs_olap" title="Table 1-1. Comparing characteristics of operational and analytic systems" class="w-full my-4" >}}
@@ -241,14 +243,14 @@ systems, for several reasons:
A *data warehouse*, by contrast, is a separate database that analysts can query to their hearts’
content, without affecting OLTP operations [^7].
-As we shall see in [Chapter 4](/en/ch4#ch_storage), data warehouses often store data in a way that is very different
+As we shall see in [Chapter 4](/en/ch4#ch_storage), data warehouses often store data in a way that is very different
from OLTP databases, in order to optimize for the types of queries that are common in analytics.
The data warehouse contains a read-only copy of the data in all the various OLTP systems in the
company. Data is extracted from OLTP databases (using either a periodic data dump or a continuous
stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into
the data warehouse. This process of getting data into the data warehouse is known as
-*Extract–Transform–Load* (ETL) and is illustrated in [Figure 1-1](/en/ch1#fig_dwh_etl). Sometimes the order of the
+*Extract–Transform–Load* (ETL) and is illustrated in [Figure 1-1](/en/ch1#fig_dwh_etl). Sometimes the order of the
*transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse,
after loading), resulting in *ELT*.
@@ -287,7 +289,7 @@ scale, the more specialized systems tend to become [^11].
#### From data warehouse to data lake {#from-data-warehouse-to-data-lake}
A data warehouse often uses a *relational* data model that is queried through SQL (see
-[Chapter 3](/en/ch3#ch_datamodels)), perhaps using specialized business intelligence software. This model works well
+[Chapter 3](/en/ch3#ch_datamodels)), perhaps using specialized business intelligence software. This model works well
for the types of queries that business analysts need to make, but it is less well suited to the
needs of data scientists, who might need to perform tasks such as:
@@ -313,7 +315,7 @@ data scientists. The answer is a *data lake*: a centralized data repository that
data that might be useful for analysis, obtained from operational systems via ETL processes. The
difference from a data warehouse is that a data lake simply contains files, without imposing any
particular file format or data model. Files in a data lake might be collections of database records,
-encoded using a file format such as Avro or Parquet (see [Chapter 5](/en/ch5#ch_encoding)), but they can equally well
+encoded using a file format such as Avro or Parquet (see [Chapter 5](/en/ch5#ch_encoding)), but they can equally well
contain text, images, videos, sensor readings, sparse matrices, feature vectors, genome sequences,
or any other kind of data [^15].
Besides being more flexible, this is also often cheaper than relational data storage, since the data
@@ -340,10 +342,10 @@ As analytics practices have matured, organizations have been increasingly paying
management and operations of analytics systems and data pipelines, as captured for example in the
DataOps manifesto [^18].
Part of this are issues of governance, privacy, and compliance with regulation such as GDPR and
-CCPA, which we discuss in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance) and [Link to Come].
+CCPA, which we discuss in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance) and [“Legislation and Self-Regulation”](/en/ch14#sec_future_legislation).
Moreover, analytical data is increasingly made available not only as files and relational tables,
-but also as streams of events (see [Link to Come]). With file-based data analysis you can re-run the
+but also as streams of events (see [Chapter 12](/en/ch12#ch_stream)). With file-based data analysis you can re-run the
analysis periodically (e.g., daily) in order to respond to changes in the data, but stream processing
allows analytics systems to respond to events much faster, on the order of seconds. Depending on the
application and how time-sensitive it is, a stream processing approach can be valuable, for example
@@ -398,7 +400,7 @@ When the data in one system is derived from the data in another, you need a proc
derived data when the original in the system of record changes. Unfortunately, many databases are
designed based on the assumption that your application only ever needs to use that one database, and
they do not make it easy to integrate multiple systems in order to propagate such updates. In
-[Link to Come] we will discuss approaches to *data integration*, which allow us to compose multiple
+[“Data Integration”](/en/ch13#sec_future_integration) we will discuss approaches to *data integration*, which allow us to compose multiple
data systems to achieve things that one system alone cannot do.
That brings us to the end of our comparison of analytics and transaction processing. In the next
@@ -420,7 +422,7 @@ energy company, and leaving aside emergency backup power), since it is cheaper t
With software, two important decisions to be made are who builds the software and who deploys it.
There is a spectrum of possibilities that outsource each decision to various degrees, as illustrated
-in [Figure 1-2](/en/ch1#fig_cloud_spectrum). At one extreme is bespoke software that you write and run in-house; at
+in [Figure 1-2](/en/ch1#fig_cloud_spectrum). At one extreme is bespoke software that you write and run in-house; at
the other extreme are widely-used cloud services or Software as a Service (SaaS) products that are
implemented and operated by an external vendor, and which you only access through a web interface or API.
@@ -519,9 +521,9 @@ and indeed such managed services are now available for many popular data systems
that have been designed from the ground up to be cloud-native have been shown to have several
advantages: better performance on the same hardware, faster recovery from failures, being able to
quickly scale computing resources to match the load, and supporting larger datasets [^25] [^26] [^27].
-[Table 1-2](/en/ch1#tab_cloud_native_dbs) lists some examples of both types of systems.
+[Table 1-2](/en/ch1#tab_cloud_native_dbs) lists some examples of both types of systems.
-{{< figure id="#tab_cloud_native_dbs" title="Table 1-2. Examples of self-hosted and cloud-native database systems" class="w-full my-4" >}}
+{{< figure id="tab_cloud_native_dbs" title="Table 1-2. Examples of self-hosted and cloud-native database systems" class="w-full my-4" >}}
| Category | Self-hosted systems | Cloud-native systems |
|------------------|-----------------------------|-----------------------------------------------------------------------|
@@ -580,7 +582,7 @@ As an alternative to local disks, cloud services also offer virtual disk storage
detached from one instance and attached to a different one (Amazon EBS, Azure managed disks, and
persistent disks in Google Cloud). Such a virtual disk is not actually a physical disk, but rather a
cloud service provided by a separate set of machines, which emulates the behavior of a disk (a
-*block device*, where each block is typically 4 KiB in size). This technology makes it
+*block device*, where each block is typically 4 KiB in size). This technology makes it
possible to run traditional disk-based software in the cloud, but the block device emulation
introduces overheads that can be avoided in systems that are designed from the ground up for the cloud [^25]. It also makes the application
very sensitive to network glitches, since every I/O on the virtual block device is actually a network call [^28].
@@ -591,7 +593,7 @@ services such as S3 are designed for long-term storage of fairly large files, ra
of kilobytes to several gigabytes in size. The individual rows or values stored in a database are
typically much smaller than this; cloud databases therefore typically manage smaller values in a
separate service, and store larger data blocks (containing many individual values) in an object
-store [^26] [^29]. We will see ways of doing this in [Chapter 4](/en/ch4#ch_storage).
+store [^26] [^29]. We will see ways of doing this in [Chapter 4](/en/ch4#ch_storage).
In a traditional systems architecture, the same computer is responsible for both storage (disk) and
computation (CPU and RAM), but in cloud-native systems, these two responsibilities have become
@@ -691,7 +693,7 @@ Fault tolerance/high availability
: If your application needs to continue working even if one machine (or several machines, or
the network, or an entire datacenter) goes down, you can use multiple machines to give you
redundancy. When one fails, another one can take over. See [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability) and
- [Chapter 6](/en/ch6#ch_replication) on replication.
+ [Chapter 6](/en/ch6#ch_replication) on replication.
Scalability
: If your data volume or computing requirements grow bigger than a single machine can handle,
@@ -739,7 +741,7 @@ Distributed systems also have downsides. Every request and API call that goes vi
to deal with the possibility of failure: the network may be interrupted, or the service may be
overloaded or crashed, and therefore any request may time out without receiving a response. In this
case, we don’t know whether the service received the request, and simply retrying it might not be
-safe. We will discuss these problems in detail in [Chapter 9](/en/ch9#ch_distributed).
+safe. We will discuss these problems in detail in [Chapter 9](/en/ch9#ch_distributed).
Although datacenter networks are fast, making a call to another service is still vastly slower than
calling a function in the same process [^44].
@@ -760,9 +762,9 @@ as OpenTelemetry, Zipkin, and Jaeger allow you to track which client called whic
operation, and how long each call took [^49].
Databases provide various mechanisms for ensuring data consistency, as we shall see in
-[Chapter 6](/en/ch6#ch_replication) and [Chapter 8](/en/ch8#ch_transactions). However, when each service has its own database,
+[Chapter 6](/en/ch6#ch_replication) and [Chapter 8](/en/ch8#ch_transactions). However, when each service has its own database,
maintaining consistency of data across those different services becomes the application’s problem.
-Distributed transactions, which we explore in [Chapter 8](/en/ch8#ch_transactions), are a possible technique for
+Distributed transactions, which we explore in [Chapter 8](/en/ch8#ch_transactions), are a possible technique for
ensuring consistency, but they are rarely used in a microservices context because they run counter
to the goal of making services independent from each other, and many databases don’t support them [^50].
@@ -770,7 +772,7 @@ For all these reasons, if you can do something on a single machine, this is ofte
cheaper compared to setting up a distributed system [^23] [^46] [^51].
CPUs, memory, and disks have grown larger, faster, and more reliable. When combined with single-node
databases such as DuckDB, SQLite, and KùzuDB, many workloads can now run on a single node. We will
-explore more on this topic in [Chapter 4](/en/ch4#ch_storage).
+explore more on this topic in [Chapter 4](/en/ch4#ch_storage).
### Microservices and Serverless {#sec_introduction_microservices}
@@ -807,7 +809,7 @@ certain fields. Developers might wish to add or remove fields to an API as busin
but doing so can cause clients to fail. Worse still, such failures are often not discovered until
late in the development cycle when the updated service API is deployed to a staging or production
environment. API description standards such as OpenAPI and gRPC help manage the relationship between
-client and server APIs; we discuss these further in [Chapter 5](/en/ch5#ch_encoding).
+client and server APIs; we discuss these further in [Chapter 5](/en/ch5#ch_encoding).
Microservices are primarily a technical solution to a people problem: allowing different teams to
make progress independently without having to coordinate with each other. This is valuable in a large
@@ -937,7 +939,7 @@ Service Organization Control (SOC) Type 2 standards. As with PCI compliance, ven
party audits to verify adherence.
Generally, it is important to balance the needs of your business against the needs of the people
-whose data you are collecting and processing. There is much more to this topic; in [Link to Come] we
+whose data you are collecting and processing. There is much more to this topic; in [Chapter 14](/en/ch14#ch_right_thing) we
will go deeper into the topics of ethics and legal compliance, including the problems of bias and
discrimination.
@@ -952,7 +954,7 @@ We started by making a distinction between operational (transaction-processing,
(OLAP) systems, and saw their different characteristics: not only managing different types of data
with different access patterns, but also serving different audiences. We encountered the concept of
a data warehouse and data lake, which receive data feeds from operational systems via ETL. In
-[Chapter 4](/en/ch4#ch_storage) we will see that operational and analytical systems often use very different internal
+[Chapter 4](/en/ch4#ch_storage) we will see that operational and analytical systems often use very different internal
data layouts because of the different types of queries they need to serve.
We then compared cloud services, a comparatively recent development, to the traditional paradigm of
@@ -964,7 +966,7 @@ example in the way they separate storage and compute.
Cloud systems are intrinsically distributed, and we briefly examined some of the trade-offs of
distributed systems compared to using a single machine. There are situations in which you can’t
avoid going distributed, but it’s advisable not to rush into making a system distributed if it’s
-possible to keep it on a single machine. In [Chapter 9](/en/ch9#ch_distributed) we will cover the challenges with
+possible to keep it on a single machine. In [Chapter 9](/en/ch9#ch_distributed) we will cover the challenges with
distributed systems in more detail.
Finally, we saw that data systems architecture is determined not only by the needs of the business
@@ -1038,4 +1040,3 @@ this question in mind as we move through the rest of this book.
[^61]: Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. [Understanding and Benchmarking the Impact of GDPR on Database Systems](https://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 7, pages 1064–1077, March 2020. [doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354)
[^62]: Martin Fowler. [Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html). *martinfowler.com*, December 2013. Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6)
[^63]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016.
-
diff --git a/content/en/ch10.md b/content/en/ch10.md
index 424eff2..d8721d1 100644
--- a/content/en/ch10.md
+++ b/content/en/ch10.md
@@ -4,18 +4,20 @@ weight: 210
breadcrumbs: false
---
+
+

> *An ancient adage warns, “Never go to sea with two chronometers; take one or three.”*
>
> Frederick P. Brooks Jr., *The Mythical Man-Month: Essays on Software Engineering* (1995)
-Lots of things can go wrong in distributed systems, as discussed in [Chapter 9](/en/ch9#ch_distributed). If we want a
+Lots of things can go wrong in distributed systems, as discussed in [Chapter 9](/en/ch9#ch_distributed). If we want a
service to continue working correctly despite those things going wrong, we need to find ways of
tolerating faults.
One of the best tools we have for fault tolerance is *replication*. However, as we saw in
-[Chapter 6](/en/ch6#ch_replication), having multiple copies of the data on multiple replicas opens up the risk of
+[Chapter 6](/en/ch6#ch_replication), having multiple copies of the data on multiple replicas opens up the risk of
inconsistencies. Reads might be handled by a replica that is not up-to-date, yielding stale results.
If multiple replicas can accept writes, we have to deal with conflicts between values that were
concurrently written on different replicas. At a high level, there are two competing philosophies
@@ -87,7 +89,7 @@ guarantee*. To clarify this idea, let’s look at an example of a system that is
{{< figure src="/fig/ddia_1001.png" id="fig_consistency_linearizability_0" caption="Figure 10-1. If this database were linearizable, then either Alice's read would return 1 instead of 0, or Bob's read would return 0 instead of 1." class="w-full my-4" >}}
-[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
+[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a
game their favorite team is playing. Just after the final score is announced, Aaliyah refreshes the
page, sees the winner announced, and excitedly tells Bryce about it. Bryce incredulously hits
@@ -104,7 +106,7 @@ violation of linearizability.
### What Makes a System Linearizable? {#sec_consistency_lin_definition}
In order to understand linearizability better, let’s look at some more examples.
-[Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows three clients concurrently reading and writing the same
+[Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows three clients concurrently reading and writing the same
object *x* in a linearizable database. In distributed systems theory, *x* is called a *register*—in
practice, it could be one key in a key-value store, one row in a relational database, or one
document in a document database, for example.
@@ -112,7 +114,7 @@ document in a document database, for example.
{{< figure src="/fig/ddia_1002.png" id="fig_consistency_linearizability_1" caption="Figure 10-2. Alice observes that x = 0 and y = 1, while Bob observes that x = 1 and y = 0. It's as if Alice's and Bob's computers disagree on the order in which the writes happened." class="w-full my-4" >}}
-For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients’
+For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients’
point of view, not the internals of the database. Each bar is a request made by a client, where the
start of a bar is the time when the request was sent, and the end of a bar is when the response was
received by the client. Due to variable network delays, a client doesn’t know exactly when the
@@ -121,12 +123,12 @@ client sending the request and receiving the response.
In this example, the register has two types of operations:
-* *read*(*x*) ⇒ *v* means the client requested to read the value of register
+* *read*(*x*) ⇒ *v* means the client requested to read the value of register
*x*, and the database returned the value *v*.
-* *write*(*x*, *v*) ⇒ *r* means the client requested to set the
+* *write*(*x*, *v*) ⇒ *r* means the client requested to set the
register *x* to value *v*, and the database returned response *r* (which could be *ok* or *error*).
-In [Figure 10-2](/en/ch10#fig_consistency_linearizability_1), the value of *x* is initially 0, and client C performs a
+In [Figure 10-2](/en/ch10#fig_consistency_linearizability_1), the value of *x* is initially 0, and client C performs a
write request to set it to 1. While this is happening, clients A and B are repeatedly polling the
database to read the latest value. What are the possible responses that A and B might get for their
read requests?
@@ -146,7 +148,7 @@ and forth between the old and the new value several times while a write is going
what we expect of a system that emulates a “single copy of the data.”
To make the system linearizable, we need to add another constraint, illustrated in
-[Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
+[Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
{{< figure src="/fig/ddia_1003.png" id="fig_consistency_linearizability_2" caption="Figure 10-3. If Alice and Bob had perfect clocks, linearizability would require that x = 1 is returned, since the read of x begins after the write x = 1 completes." class="w-full my-4" >}}
@@ -156,25 +158,25 @@ of the write operation) at which the value of *x* atomically flips from 0 to 1.
client’s read returns the new value 1, all subsequent reads must also return the new value, even if
the write operation has not yet completed.
-This timing dependency is illustrated with an arrow in [Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
+This timing dependency is illustrated with an arrow in [Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
Client A is the first to read the new value, 1. Just after A’s read returns, B begins a new read.
Since B’s read occurs strictly after A’s read, it must also return 1, even though the write by C is
still ongoing. (It’s the same situation as with Aaliyah and Bryce in
-[Figure 10-1](/en/ch10#fig_consistency_linearizability_0): after Aaliyah has read the new value, Bryce also expects to
+[Figure 10-1](/en/ch10#fig_consistency_linearizability_0): after Aaliyah has read the new value, Bryce also expects to
read the new value.)
We can further refine this timing diagram to visualize each operation taking effect atomically at
some point in time [^5],
-like in the more complex example shown in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3). In this example we
+like in the more complex example shown in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3). In this example we
add a third type of operation besides *read* and *write*:
-* *cas*(*x*, *v*old, *v*new) ⇒ *r* means the client
+* *cas*(*x*, *v*old, *v*new) ⇒ *r* means the client
requested an atomic *compare-and-set* operation (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)). If the
current value of the register *x* equals *v*old, it should be atomically set to *v*new. If
the value of *x* is different from *v*old, then the operation should leave the register
unchanged and return an error. *r* is the database’s response (*ok* or *error*).
-Each operation in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3) is marked with a vertical line (inside the
+Each operation in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3) is marked with a vertical line (inside the
bar for each operation) at the time when we think the operation was executed. Those markers are
joined up in a sequential order, and the result must be a valid sequence of reads and writes for a
register (every read must return the value set by the most recent write).
@@ -187,7 +189,7 @@ that was written, until it is overwritten again.
{{< figure src="/fig/ddia_1004.png" id="fig_consistency_linearizability_3" caption="Figure 10-4. The read of x is concurrent with the write x = 1. Since we don't know the exact timing of the operations, the read is allowed to return either 0 or 1." class="w-full my-4" >}}
-There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3):
+There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3):
* First client B sent a request to read *x*, then client D sent a request to set *x* to 0, and then
client A sent a request to set *x* to 1. Nevertheless, the value returned to B’s read is 1 (the
@@ -207,7 +209,7 @@ There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_
C’s *cas* write, which updates *x* from 2 to 4. In the absence of other requests, it would be okay for
B’s read to return 2. However, client A has already read the new value 4 before B’s read started,
so B is not allowed to read an older value than A. Again, it’s the same situation as with Aaliyah
- and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0).
+ and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0).
That is the intuition behind linearizability; the formal definition [^1] describes it more precisely. It is
possible (though computationally expensive) to test whether a system’s behavior is linearizable by
@@ -225,6 +227,8 @@ which is the strongest consistency model in common use.
--------
+
+
> [!TIP] LINEARIZABILITY VERSUS SERIALIZABILITY
Linearizability is easily confused with serializability (see [“Serializability”](/en/ch8#sec_transactions_serializability)),
@@ -325,7 +329,7 @@ nodes agree on.
In real applications, it is sometimes acceptable to treat such constraints loosely (for example, if
a flight is overbooked, you can move customers to a different flight and offer them compensation for
the inconvenience). In such cases, linearizability may not be needed, and we will discuss such
-loosely interpreted constraints in [Link to Come].
+loosely interpreted constraints in [“Timeliness and Integrity”](/en/ch13#sec_future_integrity).
However, a hard uniqueness constraint, such as the one you typically find in relational databases,
requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints,
@@ -333,7 +337,7 @@ can be implemented without linearizability [^20].
#### Cross-channel timing dependencies {#cross-channel-timing-dependencies}
-Notice a detail in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0): if Aaliyah hadn’t exclaimed the score,
+Notice a detail in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0): if Aaliyah hadn’t exclaimed the score,
Bryce wouldn’t have known that the result of his query was stale. He would have just refreshed the
page again a few seconds later, and eventually seen the final score. The linearizability violation
was only noticed because there was an additional communication channel in the system (Aaliyah’s
@@ -342,10 +346,10 @@ voice to Bryce’s ears).
Similar situations can arise in computer systems. For example, say you have a website where users
can upload a video, and a background process transcodes the video to a lower quality that can be
streamed on slow internet connections. The architecture and dataflow of this system is illustrated
-in [Figure 10-5](/en/ch10#fig_consistency_transcoder).
+in [Figure 10-5](/en/ch10#fig_consistency_transcoder).
The video transcoder needs to be explicitly instructed to perform a transcoding job, and this
-instruction is sent from the web server to the transcoder via a message queue (see [Link to Come]).
+instruction is sent from the web server to the transcoder via a message queue (see [“Messaging Systems”](/en/ch12#sec_stream_messaging)).
The web server doesn’t place the entire video on the queue, since most message brokers are designed
for small messages, and a video may be many megabytes in size. Instead, the video is first written
to a file storage service, and once the write is complete, the instruction to the transcoder is
@@ -356,7 +360,7 @@ placed on the queue.
If the file storage service is linearizable, then this system should work fine. If it is not
linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in
-[Figure 10-5](/en/ch10#fig_consistency_transcoder)) might be faster than the internal replication inside the storage
+[Figure 10-5](/en/ch10#fig_consistency_transcoder)) might be faster than the internal replication inside the storage
service. In this case, when the transcoder fetches the original video (step 5), it might see an old
version of the file, or nothing at all. If it processes an old version of the video, the original
and transcoded videos in the file storage become permanently inconsistent with each other.
@@ -364,7 +368,7 @@ and transcoded videos in the file storage become permanently inconsistent with e
This problem arises because there are two different communication channels between the web server
and the transcoder: the file storage and the message queue. Without the recency guarantee of
linearizability, race conditions between these two channels are possible. This situation is
-analogous to [Figure 10-1](/en/ch10#fig_consistency_linearizability_0), where there was also a race condition between
+analogous to [Figure 10-1](/en/ch10#fig_consistency_linearizability_0), where there was also a race condition between
two communication channels: the database replication and the real-life audio channel between
Aaliyah’s mouth and Bryce’s ears.
@@ -389,7 +393,7 @@ and all operations on it are atomic,” the simplest answer would be to really o
of the data. However, that approach would not be able to tolerate faults: if the node holding that
one copy failed, the data would be lost, or at least inaccessible until the node was brought up again.
-Let’s revisit the replication methods from [Chapter 6](/en/ch6#ch_replication), and compare whether they can be made linearizable:
+Let’s revisit the replication methods from [Chapter 6](/en/ch6#ch_replication), and compare whether they can be made linearizable:
Single-leader replication (potentially linearizable)
: In a system with single-leader replication, the leader has the primary copy of the data that is
@@ -423,7 +427,7 @@ Multi-leader replication (not linearizable)
Leaderless replication (probably not linearizable)
: For systems with leaderless replication (Dynamo-style; see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)), people
sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes
- (*w* + *r* > *n*). Depending on the exact algorithm, and depending on how you define
+ (*w* + *r* > *n*). Depending on the exact algorithm, and depending on how you define
strong consistency, this is not quite true.
“Last write wins” conflict resolution methods based on time-of-day clocks (e.g., in Cassandra and
@@ -435,21 +439,21 @@ Leaderless replication (probably not linearizable)
Intuitively, it seems as though quorum reads and writes should be linearizable in a
Dynamo-style model. However, when we have variable network delays, it is possible to have race
-conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless).
+conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless).
{{< figure src="/fig/ddia_1006.png" id="fig_consistency_leaderless" caption="Figure 10-6. Quorums are not sufficient to ensure linearizability if network delays are variable." class="w-full my-4" >}}
-In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating
-*x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3).
-Concurrently, client A reads from a quorum of two nodes (*r* = 2) and sees the new value 1
+In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating
+*x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3).
+Concurrently, client A reads from a quorum of two nodes (*r* = 2) and sees the new value 1
on one of the nodes. Also concurrently with the write, client B reads from a different quorum of two
nodes, and gets back the old value 0 from both.
-The quorum condition is met (*w* + *r* > *n*), but this execution is nevertheless not
+The quorum condition is met (*w* + *r* > *n*), but this execution is nevertheless not
linearizable: B’s request begins after A’s request completes, but B returns the old value while A
returns the new value. (It’s once again the Aaliyah and Bryce situation from
-[Figure 10-1](/en/ch10#fig_consistency_linearizability_0).)
+[Figure 10-1](/en/ch10#fig_consistency_linearizability_0).)
It is possible to make Dynamo-style quorums linearizable at the cost of reduced
performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously,
@@ -471,10 +475,10 @@ provide linearizability, even with quorum reads and writes.
As some replication methods can provide linearizability and others cannot, it is interesting to
explore the pros and cons of linearizability in more depth.
-We already discussed some use cases for different replication methods in [Chapter 6](/en/ch6#ch_replication); for
+We already discussed some use cases for different replication methods in [Chapter 6](/en/ch6#ch_replication); for
example, we saw that multi-leader replication is often a good choice for multi-region
replication (see [“Geographically Distributed Operation”](/en/ch6#sec_replication_multi_dc)). An example of such a deployment is illustrated in
-[Figure 10-7](/en/ch10#fig_consistency_cap_availability).
+[Figure 10-7](/en/ch10#fig_consistency_cap_availability).
{{< figure src="/fig/ddia_1007.png" id="fig_consistency_cap_availability" caption="Figure 10-7. If clients cannot contact enough replicas due to a network partition, they cannot process writes." class="w-full my-4" >}}
@@ -600,7 +604,7 @@ proportional to the uncertainty of delays in the network. In a network with high
like most computer networks (see [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing)), the response time of linearizable
reads and writes is inevitably going to be high. A faster algorithm for linearizability does not
exist, but weaker consistency models can be much faster, so this trade-off is important for
-latency-sensitive systems. In [Link to Come] we will discuss some approaches for avoiding
+latency-sensitive systems. In [“Timeliness and Integrity”](/en/ch13#sec_future_integrity) we will discuss some approaches for avoiding
linearizability without sacrificing correctness.
@@ -613,7 +617,7 @@ stored in only 64 bits (or even 32 bits if you are sure that you will never have
records, but that is risky).
Another advantage of such auto-incrementing IDs is that the order of the IDs tells you the order in
-which the records were created. For example, [Figure 10-8](/en/ch10#fig_consistency_id_generator) shows a chat
+which the records were created. For example, [Figure 10-8](/en/ch10#fig_consistency_id_generator) shows a chat
application that assigns auto-incrementing IDs to chat messages as they are posted. You can then
display the messages in order of increasing ID, and the resulting chat threads will make sense:
Aaliyah posts a question that is assigned ID 1, and Bryce’s answer to the question is assigned a
@@ -626,7 +630,7 @@ This single-node ID generator is another example of a linearizable system. Each
ID is an operation that atomically increments a counter and returns the old counter value (a
*fetch-and-add* operation); linearizability ensures that if the posting of Aaliyah’s message
completes before Bryce’s posting begins, then Bryce’s ID must be greater than Aaliyah’s. The
-messages by Aaliyah and Caleb in [Figure 10-8](/en/ch10#fig_consistency_id_generator) are concurrent, so linearizability
+messages by Aaliyah and Caleb in [Figure 10-8](/en/ch10#fig_consistency_id_generator) are concurrent, so linearizability
doesn’t specify how their IDs must be ordered, as long as they are unique.
An in-memory single-node ID generator is easy to implement: you can use the atomic increment
@@ -720,9 +724,9 @@ causality, and which you can use as a distributed ID generator. It is called a *
proposed in 1978 by Leslie Lamport [^54],
in what is now one of the most-cited papers in the field of distributed systems.
-[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) shows how a Lamport clock would work in the chat example of
-[Figure 10-8](/en/ch10#fig_consistency_id_generator). Each node has a unique identifier, which in
-[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) is the name “Aaliyah”, “Bryce”, or “Caleb”, but which in practice
+[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) shows how a Lamport clock would work in the chat example of
+[Figure 10-8](/en/ch10#fig_consistency_id_generator). Each node has a unique identifier, which in
+[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) is the name “Aaliyah”, “Bryce”, or “Caleb”, but which in practice
could be a random UUID or something similar. Moreover, each node keeps a counter of the number of
operations it has processed. A Lamport timestamp is then simply a pair of (*counter*, *node ID*).
Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp,
@@ -735,7 +739,7 @@ Every time a node generates a timestamp, it increments its counter value and use
Moreover, every time a node sees a timestamp from another node, if the counter value in that
timestamp is greater than its local counter value, it increases its local counter to match the value in the timestamp.
-In [Figure 10-9](/en/ch10#fig_consistency_lamport_ts), Aaliyah had not yet seen Caleb’s message when posting her own,
+In [Figure 10-9](/en/ch10#fig_consistency_lamport_ts), Aaliyah had not yet seen Caleb’s message when posting her own,
and vice versa. Assuming both users start with an initial counter value of 0, both therefore
increment their local counter and attach the new counter value of 1 to their message. When Bryce
receives those messages, he increases his local counter value to 1. Finally, Bryce sends a reply to
@@ -743,10 +747,10 @@ Aaliyah’s message, for which he increments his local counter and attaches the
message.
To compare two Lamport timestamps, we first compare their counter value: for example,
-(2, “Bryce”) is greater than (1, “Aaliyah”) and also greater than (1, “Caleb”). If
+(2, “Bryce”) is greater than (1, “Aaliyah”) and also greater than (1, “Caleb”). If
two timestamps have the same counter, we compare their node IDs instead, using the usual
lexicographic string comparison. Thus, the timestamp order in this example is
-(1, “Aaliyah”) < (1, “Caleb”) < (2, “Bryce”).
+(1, “Aaliyah”) < (1, “Caleb”) < (2, “Bryce”).
#### Hybrid logical clocks {#hybrid-logical-clocks}
@@ -789,7 +793,7 @@ IDs, because they ensure that the snapshot is consistent with causality [^56].
When multiple timestamps are generated concurrently, these algorithms order them arbitrarily. This
means that when you look at two timestamps, you generally can’t tell whether they were generated
concurrently or whether one happened before the other. (In the example of
-[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) you actually can tell that Aaliyah and Caleb’s messages must have
+[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) you actually can tell that Aaliyah and Caleb’s messages must have
been concurrent, because they have the same counter value, but when the counter values are different
you can’t tell whether they were concurrent.)
@@ -807,7 +811,7 @@ the higher ID, even if A and B never communicated with each other. On the other
can only ensure that a node generates timestamps that are greater than any other timestamp that node
has seen, but it can’t say anything about timestamps that it hasn’t seen.
-[Figure 10-10](/en/ch10#fig_consistency_permissions) shows how a non-linearizable ID generator could cause problems.
+[Figure 10-10](/en/ch10#fig_consistency_permissions) shows how a non-linearizable ID generator could cause problems.
Imagine a social media website where user A wants to share an embarrassing photo privately with
their friends. A’s account is initially public, but using their laptop, A first changes their
account settings to private. Then A uses their phone to upload the photo. Since A performed these
@@ -917,7 +921,7 @@ It turns out that all of these are instances of the same fundamental distributed
*consensus*. Consensus is one of the most important and fundamental problems in distributed
computing; it is also infamously difficult to get right [^58] [^59],
and many systems have got it wrong in the past. Now that we have discussed replication
-([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
+([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
linearizability (this chapter), we are finally ready to tackle the consensus problem.
The best-known consensus algorithms are Viewstamped Replication [^60] [^61], Paxos [^58] [^62] [^63] [^64],
@@ -1243,7 +1247,7 @@ A shared log is a good fit for database replication: if every log entry represen
database, and every replica processes the same writes in the same order using deterministic logic,
then the replicas will all end up in a consistent state. This idea is known as *state machine replication* [^80],
and it is the principle behind event sourcing, which we saw in [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events). Shared
-logs are also useful for stream processing, as we shall see in [Link to Come].
+logs are also useful for stream processing, as we shall see in [Chapter 12](/en/ch12#ch_stream).
Similarly, a shared log can be used to implement serializable transactions: as discussed in
[“Actual Serial Execution”](/en/ch8#sec_transactions_serial), if every log entry represents a deterministic transaction to be
@@ -1355,7 +1359,7 @@ fails.
If you drop the requirement for the new leader to be up-to-date, you may improve performance and
availability, but you are on thin ice, since the theory of consensus no longer applies. While things
-will work fine as long as there are no faults, the problems discussed in [Chapter 9](/en/ch9#ch_distributed) can
+will work fine as long as there are no faults, the problems discussed in [Chapter 9](/en/ch9#ch_distributed) can
easily cause a lot of data loss or corruption.
--------
@@ -1381,7 +1385,7 @@ one location to another (by first adding the new nodes, and then removing the ol
Although they are complex and subtle, consensus algorithms are a huge breakthrough for distributed
systems. Consensus is essentially “single-leader replication done right”, with automatic failover on
leader failure, ensuring that no committed data is lost and no split-brain is possible, even in the
-face of all the problems we discussed in [Chapter 9](/en/ch9#ch_distributed).
+face of all the problems we discussed in [Chapter 9](/en/ch9#ch_distributed).
Since single-leader replication with automatic failover is essentially one of the definitions of
consensus, any system that provides automatic failover but does not use a proven consensus algorithm
@@ -1413,7 +1417,7 @@ research problem.
For systems that want to be highly available, but don’t want to accept the cost of consensus, the
only real alternative is to use a weaker consistency model instead, such as those offered by
-leaderless or multi-leader replication as discussed in [Chapter 6](/en/ch6#ch_replication). These approaches
+leaderless or multi-leader replication as discussed in [Chapter 6](/en/ch6#ch_replication). These approaches
generally don’t offer linearizability, but for applications that don’t need it that is fine.
@@ -1617,14 +1621,14 @@ a coordination service. It won’t guarantee that you will get it right, but it
Consensus algorithms are complicated and subtle, but they are supported by a rich body of theory
that has been developed since the 1980s. This theory makes it possible to build systems that can
-tolerate all the faults that we discussed in [Chapter 9](/en/ch9#ch_distributed), and still ensure that your data is
+tolerate all the faults that we discussed in [Chapter 9](/en/ch9#ch_distributed), and still ensure that your data is
not corrupted. This is an amazing achievement, and the references at the end of this chapter feature
some of the highlights of this work.
Nevertheless, consensus is not always the right tool: in some systems, the strong consistency
properties it provides are not needed, and it is better to have weaker consistency with higher
availability and better performance. In these cases, it is common to use leaderless or multi-leader
-replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we
+replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we
discussed in this chapter are helpful in that context.
### References
diff --git a/content/en/ch11.md b/content/en/ch11.md
index b7dc78f..e43a9ec 100644
--- a/content/en/ch11.md
+++ b/content/en/ch11.md
@@ -4,177 +4,1300 @@ weight: 311
breadcrumbs: false
---
-{{< callout type="warning" >}}
-This page is from the 1st edition, 2nd edition is not available yet.
-{{< /callout >}}
+

-> *A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments.*
+> *A system cannot be successful if it is too strongly influenced by a single person. Once the
+> initial design is complete and fairly robust, the real test begins as people with many different
+> viewpoints undertake their own experiments.*
>
-> — Donald Knuth
+> Donald Knuth
----------------
+> [!TIP] A NOTE FOR EARLY RELEASE READERS
+> With Early Release ebooks, you get books in their earliest form---the author's raw and unedited
+> content as they write---so you can take advantage of these technologies long before the official
+> release of these titles.
+>
+> This will be the 11th chapter of the final book. The GitHub repo for this book is
+> *[*https://github.com/ept/ddia2-feedback*](https://github.com/ept/ddia2-feedback)*.
+>
+> If you'd like to be actively involved in reviewing and commenting on this draft, please reach out on
+> GitHub.
-In the first two parts of this book we talked a lot about *requests* and *queries*, and the corresponding *responses* or *results*. This style of data processing is assumed in many modern data systems: you ask for something, or you send an instruction, and some time later the system (hopefully) gives you an answer. Databases, caches, search indexes, web servers, and many other systems work this way.
+Much of this book so far has talked about *requests* and *queries*, and the corresponding
+*responses* or *results*. This style of data processing is assumed in many modern data systems: you
+ask for something, or you send an instruction, and the system tries to give you an answer as quickly
+as possible.
-In such *online* systems, whether it’s a web browser requesting a page or a service call‐ ing a remote API, we generally assume that the request is triggered by a human user, and that the user is waiting for the response. They shouldn’t have to wait too long, so we pay a lot of attention to the *response time* of these systems (see “[Describing Performance](/en/ch1#describing-performance)”).
+A web browser requesting a page, a service calling a remote API, databases, caches, search indexes,
+and many other systems work this way. We call these *online systems*. Response time is usually their
+primary measure of performance, and they often require fault tolerance to ensure high availability.
-The web, and increasing numbers of HTTP/REST-based APIs, has made the request/ response style of interaction so common that it’s easy to take it for granted. But we should remember that it’s not the only way of building systems, and that other approaches have their merits too. Let’s distinguish three different types of systems:
+However, sometimes you need to run a bigger computation or process larger amounts of data than you
+can do in an interactive request. Maybe you need to train an AI model, or transform lots of data
+from one form into another, or compute analytics over a very large dataset. We call these tasks
+*batch processing* jobs, or sometimes *offline systems*.
-***Services (online systems)***
+A batch processing job takes some input data (which is read-only), and produces some output data
+(which is generated from scratch every time the job runs). It typically does not mutate data in the
+way a read/write transaction would. The output is therefore *derived* from the input (as discussed
+in ["Systems of Record and Derived Data"](/en/ch1#sec_introduction_derived)): if you don't like the
+output, you can just delete it, adjust the job logic, and run it again. By treating inputs as
+immutable and avoiding side effects (such as writing to external databases), batch jobs not only
+achieve good performance but also have other benefits:
-A service waits for a request or instruction from a client to arrive. When one is received, the service tries to handle it as quickly as possible and sends a response back. Response time is usually the primary measure of performance of a service, and availability is often very important (if the client can’t reach the service, the user will probably get an error message).
+- If you introduce a bug into the code and the output is wrong or corrupted, you can simply roll
+ back to a previous version of the code and rerun the job, and the output will be correct again.
+ Or, even simpler, you can keep the old output in a different directory and simply switch back to
+ it. Most object stores and open table formats (see ["Cloud Data
+ Warehouses"](/en/ch4#sec_cloud_data_warehouses)) support this feature, which is known as *time
+ travel*. Most databases with read-write transactions do not have this property: if you deploy
+ buggy code that writes bad data to the database, then rolling back the code will do nothing to fix
+ the data in the database. The idea of being able to recover from buggy code has been called *human
+ fault tolerance* [^1].
-***Batch processing systems (offline systems)***
+- As a consequence of this ease of rolling back, feature development can proceed more quickly than
+ in an environment where mistakes could mean irreversible damage. This principle of *minimizing
+ irreversibility* is beneficial for Agile software development [^2].
-A batch processing system takes a large amount of input data, runs a *job* to pro‐ cess it, and produces some output data. Jobs often take a while (from a few minutes to several days), so there normally isn’t a user waiting for the job to fin‐ ish. Instead, batch jobs are often scheduled to run periodically (for example, once a day). The primary performance measure of a batch job is usually *throughput* (the time it takes to crunch through an input dataset of a certain size). We dis‐ cuss batch processing in this chapter.
+- The same set of files can be used as input for various different jobs, including monitoring jobs
+ that calculate metrics and evaluate whether a job's output has the expected characteristics (for
+ example, by comparing it to the output from the previous run and measuring discrepancies).
-***Stream processing systems (near-real-time systems)***
+- Batch processing frameworks make efficient use of computing resources. Even though it's possible
+ to batch process data using online data systems such as OLTP databases and applications servers,
+ doing so can be much more expensive in terms of the resources required.
-Stream processing is somewhere between online and offline/batch processing (so it is sometimes called *near-real-time* or *nearline* processing). Like a batch pro‐ cessing system, a stream processor consumes inputs and produces outputs (rather than responding to requests). However, a stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data. This difference allows stream processing systems to have lower latency than the equivalent batch systems. As stream processing builds upon batch process‐ ing, we discuss it in [Chapter 11](/en/ch11).
+Batch data processing also presents challenges. With most frameworks, output can only be processed
+by other jobs after the whole job finishes. Batch processing can also be inefficient: any change to
+input data---even a single byte---means the batch job must reprocess the entire input dataset.
+Despite these limitations, batch processing has proven useful in a wide range of use cases, which
+we'll revisit in ["Batch Use Cases"](/en/ch11#sec_batch_output).
-As we shall see in this chapter, batch processing is an important building block in our quest to build reliable, scalable, and maintainable applications. For example, Map‐ Reduce, a batch processing algorithm published in 2004 [1], was (perhaps over- enthusiastically) called “the algorithm that makes Google so massively scalable” [2]. It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB.
+A batch job may take a long time to run: minutes, hours, or even days. Jobs may be scheduled to run
+periodically (for example, once per day). The primary measure of performance is usually throughput:
+how much data the job can process per unit time. Some batch systems handle faults by simply aborting
+and restarting the whole job, while others have fault tolerance so that a job can complete
+successfully despite some of its nodes crashing.
-MapReduce is a fairly low-level programming model compared to the parallel pro‐ cessing systems that were developed for data warehouses many years previously [^3] [^4], but it was a major step forward in terms of the scale of processing that could be achieved on commodity hardware. Although the importance of MapReduce is now declining [5], it is still worth understanding, because it provides a clear picture of why and how batch processing is useful.
+> [!NOTE]
+> An alternative to batch processing is *stream processing*, in which the job doesn't finish running
+> when it has processed the input, but instead continues watching the input and processes changes in
+> the input shortly after they happen. We will turn to stream processing in
+> [Chapter 12](/en/ch12#ch_stream).
-In fact, batch processing is a very old form of computing. Long before programmable digital computers were invented, punch card tabulating machines—such as the Hol‐ lerith machines used in the 1890 US Census [6]—implemented a semi-mechanized form of batch processing to compute aggregate statistics from large inputs. And Map‐ Reduce bears an uncanny resemblance to the electromechanical IBM card-sorting machines that were widely used for business data processing in the 1940s and 1950s [7]. As usual, history has a tendency of repeating itself.
+The boundary between online and batch processing systems is not always clear: a long-running
+database query looks quite like a batch process. But batch processing also has some particular
+characteristics that make it a useful building block for building reliable, scalable, and
+maintainable applications. For example, it often plays a role in *data integration*, i.e., composing
+multiple data systems to achieve things that one system alone cannot do. ETL, as discussed in ["Data
+Warehousing"](/en/ch1#sec_introduction_dwh), is an example of this.
-In this chapter, we will look at MapReduce and several other batch processing algo‐ rithms and frameworks, and explore how they are used in modern data systems. But first, to get started, we will look at data processing using standard Unix tools. Even if you are already familiar with them, a reminder about the Unix philosophy is worthwhile because the ideas and lessons from Unix carry over to large-scale, heterogene‐ ous distributed data systems.
+Modern batch processing has been heavily influenced by MapReduce, a batch processing algorithm that
+was published by Google in 2004 [^3], and subsequently implemented in various open source
+data systems, including Hadoop, CouchDB, and MongoDB. MapReduce is a fairly low-level programming
+model, and less sophisticated than the parallel query execution engines found, for example, in data
+warehouses [^4], [^5]. When it was new, MapReduce was a step forward in terms of the
+scale of processing that could be achieved on commodity hardware, but now it is largely obsolete,
+and no longer used at Google [^6], [^7].
+Batch processing today is more often done using frameworks such as Spark or Flink, or data warehouse
+query engines. Like MapReduce, they rely heavily on sharding (see [Chapter 7](/en/ch7#ch_sharding))
+and parallel execution, but they have far more sophisticated caching and execution strategies. As
+these systems have matured, operational concerns have been largely solved, so focus has shifted
+toward usability. New processing models such as dataflow APIs, query languages, and DataFrame APIs
+are now widely supported. Job and workflow orchestration has also matured. Hadoop-centric workflow
+schedulers such as Oozie and Azkaban have been replaced with more generalized solutions such as
+Airflow, Dagster, and Prefect, which support a wide array of batch processing frameworks and cloud
+data warehouses.
+Cloud computing has grown ubiquitous. Batch storage layers are shifting from distributed filesystems
+(DFSs) like HDFS, GlusterFS, and CephFS to object storage systems such as S3. Scalable cloud data
+warehouses like BigQuery and Snowflake are blurring the line between data warehouses and batch
+processing.
-## ……
+To build an intuition of what batch processing is about, we will start this chapter with an example
+that uses standard Unix tools on a single machine. We will then investigate how we can extend data
+processing to multiple machines in a distributed system. We will see that, much like an operating
+system, distributed batch processing frameworks have a scheduler and a filesystem. We will then
+explore various processing models that we use to write batch jobs. Finally, we discuss common batch
+processing use cases.
+## Batch Processing with Unix Tools {#sec_batch_unix}
+Say you have a web server that appends a line to a log file every time it serves a request. For
+example, using the nginx default access log format, one line of the log might look like this:
-## Summary
+ 216.58.210.78 - - [27/Jun/2025:17:55:11 +0000] "GET /css/typography.css HTTP/1.1"
+ 200 3377 "https://martin.kleppmann.com/" "Mozilla/5.0 (Macintosh; Intel Mac OS X
+ 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.0.0 Safari/537.36"
+(That is actually one line; it's only broken onto multiple lines here for readability.) There's a
+lot of information in that line. In order to interpret it, you need to look at the definition of the
+log format, which is as follows:
-In this chapter we explored the topic of batch processing. We started by looking at Unix tools such as awk, grep, and sort, and we saw how the design philosophy of those tools is carried forward into MapReduce and more recent dataflow engines. Some of those design principles are that inputs are immutable, outputs are intended to become the input to another (as yet unknown) program, and complex problems are solved by composing small tools that “do one thing well.”
+ $remote_addr - $remote_user [$time_local] "$request"
+ $status $body_bytes_sent "$http_referer" "$http_user_agent"
-In the Unix world, the uniform interface that allows one program to be composed with another is files and pipes; in MapReduce, that interface is a distributed filesys‐ tem. We saw that dataflow engines add their own pipe-like data transport mecha‐ nisms to avoid materializing intermediate state to the distributed filesystem, but the initial input and final output of a job is still usually HDFS.
+So, this one line of the log indicates that on June 27, 2025, at 17:55:11 UTC, the server received a
+request for the file */css/typography.css* from the client IP address 216.58.210.78. The user was
+not authenticated, so `$remote_user` is set to a hyphen (`-`). The response status was 200 (i.e.,
+the request was successful), and the response was 3,377 bytes in size. The web browser was Chrome
+137, and it loaded the file because it was referenced in the page at the URL
+*[*https://martin.kleppmann.com/*](https://martin.kleppmann.com/)*.
-The two main problems that distributed batch processing frameworks need to solve are:
+Though log parsing might seem contrived, it's actually a critical part of many modern technology
+companies, and is used for everything from ad pipelines to payment processing. Indeed, it was a
+driving force behind the rapid adoption of MapReduce and the "big data" movement.
-***Partitioning***
+### Simple Log Analysis {#sec_batch_log_analysis}
-In MapReduce, mappers are partitioned according to input file blocks. The out‐ put of mappers is repartitioned, sorted, and merged into a configurable number of reducer partitions. The purpose of this process is to bring all the related data— e.g., all the records with the same key—together in the same place.
+Various tools can take these log files and produce pretty reports about your website traffic, but
+for the sake of exercise, let's build our own, using basic Unix tools. For example, say you want to
+find the five most popular pages on your website. You can do this in a Unix shell as follows:
-Post-MapReduce dataflow engines try to avoid sorting unless it is required, but they otherwise take a broadly similar approach to partitioning.
+``` bash
+cat /var/log/nginx/access.log | #1
+ awk '{print $7}' | #2
+ sort | #3
+ uniq -c | #4
+ sort -r -n | #5
+ head -n 5 #6
+```
-***Fault tolerance***
+1. Read the log file. (Strictly speaking, `cat` is unnecessary here, as the input file could be
+ given directly as an argument to `awk`. However, the linear pipeline is more apparent when
+ written like this.)
-MapReduce frequently writes to disk, which makes it easy to recover from an individual failed task without restarting the entire job but slows down execution in the failure-free case. Dataflow engines perform less materialization of inter‐ mediate state and keep more in memory, which means that they need to recom‐ pute more data if a node fails. Deterministic operators reduce the amount of data that needs to be recomputed.
+2. Split each line into fields by whitespace, and output only the seventh such field from each
+ line, which happens to be the requested URL. In our example line, this request URL is
+ */css/typography.css*.
+3. Alphabetically `sort` the list of requested URLs. If some URL has been requested *n* times, then
+ after sorting, the file contains the same URL repeated *n* times in a row.
+4. The `uniq` command filters out repeated lines in its input by checking whether two adjacent
+ lines are the same. The `-c` option tells it to also output a counter: for every distinct URL,
+ it reports how many times that URL appeared in the input.
-We discussed several join algorithms for MapReduce, most of which are also inter‐ nally used in MPP databases and dataflow engines. They also provide a good illustra‐ tion of how partitioned algorithms work:
+5. The second `sort` sorts by the number (`-n`) at the start of each line, which is the number of
+ times the URL was requested. It then returns the results in reverse (`-r`) order, i.e. with the
+ largest number first.
-***Sort-merge joins***
+6. Finally, `head` outputs just the first five lines (`-n 5`) of input, and discards the rest.
-Each of the inputs being joined goes through a mapper that extracts the join key. By partitioning, sorting, and merging, all the records with the same key end up going to the same call of the reducer. This function can then output the joined records.
+The output of that series of commands looks something like this:
-***Broadcast hash joins***
+ 4189 /favicon.ico
+ 3631 /2016/02/08/how-to-do-distributed-locking.html
+ 2124 /2020/11/18/distributed-systems-and-elliptic-curves.html
+ 1369 /
+ 915 /css/typography.css
-One of the two join inputs is small, so it is not partitioned and it can be entirely loaded into a hash table. Thus, you can start a mapper for each partition of the large join input, load the hash table for the small input into each mapper, and then scan over the large input one record at a time, querying the hash table for each record.
+Although the preceding command line likely looks a bit obscure if you're unfamiliar with Unix tools,
+it is incredibly powerful. It will process gigabytes of log files in a matter of seconds, and you
+can easily modify the analysis to suit your needs. For example, if you want to omit CSS files from
+the report, change the `awk` argument to `'$7 !~ /\.css$/ {print $7}'`. If you want to count top
+client IP addresses instead of top pages, change the `awk` argument to `'{print $1}'`. And so on.
-***Partitioned hash joins***
+We don't have space in this book to explore Unix tools in detail, but they are very much worth
+learning about. Surprisingly many data analyses can be done in a few minutes using some combination
+of `awk`, `sed`, `grep`, `sort`, `uniq`, and `xargs`, and they perform surprisingly well
+[^8].
-If the two join inputs are partitioned in the same way (using the same key, same hash function, and same number of partitions), then the hash table approach can be used independently for each partition.
+### Chain of Commands Versus Custom Program {#sec_batch_custom_program}
+Instead of the chain of Unix commands, you could write a simple program to do the same thing. For
+example, in Python, it might look something like this:
+``` python
+from collections import defaultdict
-Distributed batch processing engines have a deliberately restricted programming model: callback functions (such as mappers and reducers) are assumed to be stateless and to have no externally visible side effects besides their designated output. This restriction allows the framework to hide some of the hard distributed systems prob‐ lems behind its abstraction: in the face of crashes and network issues, tasks can be retried safely, and the output from any failed tasks is discarded. If several tasks for a partition succeed, only one of them actually makes its output visible.
+counts = defaultdict(int) #1
-Thanks to the framework, your code in a batch processing job does not need to worry about implementing fault-tolerance mechanisms: the framework can guarantee that the final output of a job is the same as if no faults had occurred, even though in real‐ ity various tasks perhaps had to be retried. These reliable semantics are much stron‐ ger than what you usually have in online services that handle user requests and that write to databases as a side effect of processing a request.
+with open('/var/log/nginx/access.log', 'r') as file:
+ for line in file:
+ url = line.split()[6] #2
+ counts[url] += 1 #3
-The distinguishing feature of a batch processing job is that it reads some input data and produces some output data, without modifying the input—in other words, the output is derived from the input. Crucially, the input data is *bounded*: it has a known, fixed size (for example, it consists of a set of log files at some point in time, or a snap‐ shot of a database’s contents). Because it is bounded, a job knows when it has finished reading the entire input, and so a job eventually completes when it is done.
+top5 = sorted(((count, url) for url, count in counts.items()), reverse=True)[:5] #4
-In the next chapter, we will turn to stream processing, in which the input is *unboun‐ ded*—that is, you still have a job, but its inputs are never-ending streams of data. In this case, a job is never complete, because at any time there may still be more work coming in. We shall see that stream and batch processing are similar in some respects, but the assumption of unbounded streams also changes a lot about how we build systems.
+for count, url in top5: #5
+ print(f"{count} {url}")
+```
+1. `counts` is a hash table that keeps a counter for the number of times we've seen each URL. A
+ counter is zero by default.
+2. From each line of the log, we take the URL to be the seventh whitespace-separated field (the
+ array index is 6 because Python's arrays are zero-indexed).
-### References
+3. Increment the counter for the URL in the current line of the log.
-1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/pub62/),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004.
-1. Joel Spolsky: “[The Perils of JavaSchools](https://www.joelonsoftware.com/2005/12/29/the-perils-of-javaschools-2/),” *joelonsoftware.com*, December 29, 2005.
-1. Shivnath Babu and Herodotos Herodotou: “[Massively Parallel Databases and MapReduce Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/db-mr-survey-final.pdf),” *Foundations and Trends in Databases*, volume 5, number 1, pages 1–104, November 2013. [doi:10.1561/1900000036](http://dx.doi.org/10.1561/1900000036)
-1. David J. DeWitt and Michael Stonebraker: “[MapReduce: A Major Step Backwards](https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html),” originally published at *databasecolumn.vertica.com*, January 17, 2008.
-1. Henry Robinson: “[The Elephant Was a Trojan Horse: On the Death of Map-Reduce at Google](https://www.the-paper-trail.org/post/2014-06-25-the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/),” *the-paper-trail.org*, June 25, 2014.
-1. “[The Hollerith Machine](https://www.census.gov/history/www/innovations/technology/the_hollerith_tabulator.html),” United States Census Bureau, *census.gov*.
-1. “[IBM 82, 83, and 84 Sorters Reference Manual](https://bitsavers.org/pdf/ibm/punchedCard/Sorter/A24-1034-1_82-83-84_sorters.pdf),” Edition A24-1034-1, International Business Machines Corporation, July 1962.
-1. Adam Drake: “[Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster](https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html),” *aadrake.com*, January 25, 2014.
-1. “[GNU Coreutils 8.23 Documentation](http://www.gnu.org/software/coreutils/manual/html_node/index.html),” Free Software Foundation, Inc., 2014.
-1. Martin Kleppmann: “[Kafka, Samza, and the Unix Philosophy of Distributed Data](http://martin.kleppmann.com/2015/08/05/kafka-samza-unix-philosophy-distributed-data.html),” *martin.kleppmann.com*, August 5, 2015.
-1. Doug McIlroy: [Internal Bell Labs memo](https://swtch.com/~rsc/thread/mdmpipe.pdf), October 1964. Cited in: Dennis M. Richie: “[Advice from Doug McIlroy](https://www.bell-labs.com/usr/dmr/www/mdmpipe.html),” *bell-labs.com*.
-1. M. D. McIlroy, E. N. Pinson, and B. A. Tague: “[UNIX Time-Sharing System: Foreword](https://archive.org/details/bstj57-6-1899),” *The Bell System Technical Journal*, volume 57, number 6, pages 1899–1904, July 1978.
-1. Eric S. Raymond: [*The Art of UNIX Programming*](http://www.catb.org/~esr/writings/taoup/html/). Addison-Wesley, 2003. ISBN: 978-0-13-142901-7
-1. Ronald Duncan: “[Text File Formats – ASCII Delimited Text – Not CSV or TAB Delimited Text](https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/),” *ronaldduncan.wordpress.com*, October 31, 2009.
-1. Alan Kay: “[Is 'Software Engineering' an Oxymoron?](http://tinlizzie.org/~takashi/IsSoftwareEngineeringAnOxymoron.pdf),” *tinlizzie.org*.
-1. Martin Fowler: “[InversionOfControl](http://martinfowler.com/bliki/InversionOfControl.html),” *martinfowler.com*, June 26, 2005.
-1. Daniel J. Bernstein: “[Two File Descriptors for Sockets](http://cr.yp.to/tcpip/twofd.html),” *cr.yp.to*.
-1. Rob Pike and Dennis M. Ritchie: “[The Styx Architecture for Distributed Systems](http://doc.cat-v.org/inferno/4th_edition/styx),” *Bell Labs Technical Journal*, volume 4, number 2, pages 146–152, April 1999.
-1. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: “[The Google File System](http://research.google.com/archive/gfs-sosp2003.pdf),” at *19th ACM Symposium on Operating Systems Principles* (SOSP), October 2003. [doi:10.1145/945445.945450](http://dx.doi.org/10.1145/945445.945450)
-1. Michael Ovsiannikov, Silvius Rus, Damian Reeves, et al.: “[The Quantcast File System](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p808-ovsiannikov.pdf),” *Proceedings of the VLDB Endowment*, volume 6, number 11, pages 1092–1101, August 2013. [doi:10.14778/2536222.2536234](http://dx.doi.org/10.14778/2536222.2536234)
-1. “[OpenStack Swift 2.6.1 Developer Documentation](http://docs.openstack.org/developer/swift/),” OpenStack Foundation, *docs.openstack.org*, March 2016.
-1. Zhe Zhang, Andrew Wang, Kai Zheng, et al.: “[Introduction to HDFS Erasure Coding in Apache Hadoop](https://blog.cloudera.com/introduction-to-hdfs-erasure-coding-in-apache-hadoop/),” *blog.cloudera.com*, September 23, 2015.
-1. Peter Cnudde: “[Hadoop Turns 10](https://web.archive.org/web/20190119112713/https://yahoohadoop.tumblr.com/post/138739227316/hadoop-turns-10),” *yahoohadoop.tumblr.com*, February 5, 2016.
-1. Eric Baldeschwieler: “[Thinking About the HDFS vs. Other Storage Technologies](https://web.archive.org/web/20190529215115/http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/),” *hortonworks.com*, July 25, 2012.
-1. Brendan Gregg: “[Manta: Unix Meets Map Reduce](https://web.archive.org/web/20220125052545/http://dtrace.org/blogs/brendan/2013/06/25/manta-unix-meets-map-reduce/),” *dtrace.org*, June 25, 2013.
-1. Tom White: *Hadoop: The Definitive Guide*, 4th edition. O'Reilly Media, 2015. ISBN: 978-1-491-90163-2
-1. Jim N. Gray: “[Distributed Computing Economics](http://arxiv.org/pdf/cs/0403019.pdf),” Microsoft Research Tech Report MSR-TR-2003-24, March 2003.
-1. Márton Trencséni: “[Luigi vs Airflow vs Pinball](http://bytepawn.com/luigi-airflow-pinball.html),” *bytepawn.com*, February 6, 2016.
-1. Roshan Sumbaly, Jay Kreps, and Sam Shah: “[The 'Big Data' Ecosystem at LinkedIn](http://www.slideshare.net/s_shah/the-big-data-ecosystem-at-linkedin-23512853),” at *ACM International Conference on Management of Data* (SIGMOD), July 2013. [doi:10.1145/2463676.2463707](http://dx.doi.org/10.1145/2463676.2463707)
-1. Alan F. Gates, Olga Natkovich, Shubham Chopra, et al.: “[Building a High-Level Dataflow System on Top of Map-Reduce: The Pig Experience](http://www.vldb.org/pvldb/vol2/vldb09-1074.pdf),” at *35th International Conference on Very Large Data Bases* (VLDB), August 2009.
-1. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, et al.: “[Hive – A Petabyte Scale Data Warehouse Using Hadoop](http://i.stanford.edu/~ragho/hive-icde2010.pdf),” at *26th IEEE International Conference on Data Engineering* (ICDE), March 2010. [doi:10.1109/ICDE.2010.5447738](http://dx.doi.org/10.1109/ICDE.2010.5447738)
-1. “[Cascading 3.0 User Guide](https://web.archive.org/web/20231206195311/http://docs.cascading.org/cascading/3.0/userguide/),” Concurrent, Inc., *docs.cascading.org*, January 2016.
-1. “[Apache Crunch User Guide](https://crunch.apache.org/user-guide.html),” Apache Software Foundation, *crunch.apache.org*.
-1. Craig Chambers, Ashish Raniwala, Frances Perry, et al.: “[FlumeJava: Easy, Efficient Data-Parallel Pipelines](https://research.google.com/pubs/archive/35650.pdf),” at *31st ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2010. [doi:10.1145/1806596.1806638](http://dx.doi.org/10.1145/1806596.1806638)
-1. Jay Kreps: “[Why Local State is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing),” *oreilly.com*, July 31, 2014.
-1. Martin Kleppmann: “[Rethinking Caching in Web Apps](http://martin.kleppmann.com/2012/10/01/rethinking-caching-in-web-apps.html),” *martin.kleppmann.com*, October 1, 2012.
-1. Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira: *[Hadoop Application Architectures](http://shop.oreilly.com/product/0636920033196.do)*. O'Reilly Media, 2015. ISBN: 978-1-491-90004-8
-1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
-1. Sriranjan Manjunath: “[Skewed Join](https://web.archive.org/web/20151228114742/https://wiki.apache.org/pig/PigSkewedJoinSpec),” *wiki.apache.org*, 2009.
-1. David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri: “[Practical Skew Handling in Parallel Joins](http://www.vldb.org/conf/1992/P027.PDF),” at *18th International Conference on Very Large Data Bases* (VLDB), August 1992.
-1. Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “[Impala: A Modern, Open-Source SQL Engine for Hadoop](http://pandis.net/resources/cidr15impala.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
-1. Matthieu Monsch: “[Open-Sourcing PalDB, a Lightweight Companion for Storing Side Data](https://engineering.linkedin.com/blog/2015/10/open-sourcing-paldb--a-lightweight-companion-for-storing-side-da),” *engineering.linkedin.com*, October 26, 2015.
-1. Daniel Peng and Frank Dabek: “[Large-Scale Incremental Processing Using Distributed Transactions and Notifications](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf),” at *9th USENIX conference on Operating Systems Design and Implementation* (OSDI), October 2010.
-1. “["Cloudera Search User Guide,"](http://www.cloudera.com/documentation/cdh/5-1-x/Search/Cloudera-Search-User-Guide/Cloudera-Search-User-Guide.html) Cloudera, Inc., September 2015.
-1. Lili Wu, Sam Shah, Sean Choi, et al.: “[The Browsemaps: Collaborative Filtering at LinkedIn](http://ceur-ws.org/Vol-1271/Paper3.pdf),” at *6th Workshop on Recommender Systems and the Social Web* (RSWeb), October 2014.
-1. Roshan Sumbaly, Jay Kreps, Lei Gao, et al.: “[Serving Large-Scale Batch Computed Data with Project Voldemort](http://static.usenix.org/events/fast12/tech/full_papers/Sumbaly.pdf),” at *10th USENIX Conference on File and Storage Technologies* (FAST), February 2012.
-1. Varun Sharma: “[Open-Sourcing Terrapin: A Serving System for Batch Generated Data](https://web.archive.org/web/20170215032514/https://engineering.pinterest.com/blog/open-sourcing-terrapin-serving-system-batch-generated-data-0),” *engineering.pinterest.com*, September 14, 2015.
-1. Nathan Marz: “[ElephantDB](http://www.slideshare.net/nathanmarz/elephantdb),” *slideshare.net*, May 30, 2011.
-1. Jean-Daniel (JD) Cryans: “[How-to: Use HBase Bulk Loading, and Why](https://blog.cloudera.com/how-to-use-hbase-bulk-loading-and-why/),” *blog.cloudera.com*, September 27, 2013.
-1. Nathan Marz: “[How to Beat the CAP Theorem](http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html),” *nathanmarz.com*, October 13, 2011.
-1. Molly Bartlett Dishman and Martin Fowler: “[Agile Architecture](https://web.archive.org/web/20161130034721/http://conferences.oreilly.com/software-architecture/sa2015/public/schedule/detail/40388),” at *O'Reilly Software Architecture Conference*, March 2015.
-1. David J. DeWitt and Jim N. Gray: “[Parallel Database Systems: The Future of High Performance Database Systems](http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/dewittgray92.pdf),” *Communications of the ACM*, volume 35, number 6, pages 85–98, June 1992. [doi:10.1145/129888.129894](http://dx.doi.org/10.1145/129888.129894)
-1. Jay Kreps: “[But the multi-tenancy thing is actually really really hard](https://twitter.com/jaykreps/status/528235702480142336),” tweetstorm, *twitter.com*, October 31, 2014.
-1. Jeffrey Cohen, Brian Dolan, Mark Dunlap, et al.: “[MAD Skills: New Analysis Practices for Big Data](http://www.vldb.org/pvldb/vol2/vldb09-219.pdf),” *Proceedings of the VLDB Endowment*, volume 2, number 2, pages 1481–1492, August 2009. [doi:10.14778/1687553.1687576](http://dx.doi.org/10.14778/1687553.1687576)
-1. Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino: “[Data Wrangling: The Challenging Journey from the Wild to the Lake](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
-1. Paige Roberts: “[To Schema on Read or to Schema on Write, That Is the Hadoop Data Lake Question](https://web.archive.org/web/20171105001306/http://adaptivesystemsinc.com/blog/to-schema-on-read-or-to-schema-on-write-that-is-the-hadoop-data-lake-question/),” *adaptivesystemsinc.com*, July 2, 2015.
-1. Bobby Johnson and Joseph Adler: “[The Sushi Principle: Raw Data Is Better](https://web.archive.org/web/20161126104941/https://conferences.oreilly.com/strata/big-data-conference-ca-2015/public/schedule/detail/38737),” at *Strata+Hadoop World*, February 2015.
-1. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, et al.: “[Apache Hadoop YARN: Yet Another Resource Negotiator](https://www.cs.cmu.edu/~garth/15719/papers/yarn.pdf),” at *4th ACM Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523633](http://dx.doi.org/10.1145/2523616.2523633)
-1. Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, et al.: “[Large-Scale Cluster Management at Google with Borg](http://research.google.com/pubs/pub43438.html),” at *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741964](http://dx.doi.org/10.1145/2741948.2741964)
-1. Malte Schwarzkopf: “[The Evolution of Cluster Scheduler Architectures](https://web.archive.org/web/20201109052657/http://www.firmament.io/blog/scheduler-architectures.html),” *firmament.io*, March 9, 2016.
-1. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al.: “[Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf),” at *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012.
-1. Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia: *Learning Spark*. O'Reilly Media, 2015. ISBN: 978-1-449-35904-1
-1. Bikas Saha and Hitesh Shah: “[Apache Tez: Accelerating Hadoop Query Processing](http://www.slideshare.net/Hadoop_Summit/w-1205phall1saha),” at *Hadoop Summit*, June 2014.
-1. Bikas Saha, Hitesh Shah, Siddharth Seth, et al.: “[Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications](http://home.cse.ust.hk/~weiwa/teaching/Fall15-COMP6611B/reading_list/Tez.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742790](http://dx.doi.org/10.1145/2723372.2742790)
-1. Kostas Tzoumas: “[Apache Flink: API, Runtime, and Project Roadmap](http://www.slideshare.net/KostasTzoumas/apache-flink-api-runtime-and-project-roadmap),” *slideshare.net*, January 14, 2015.
-1. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al.: “[The Stratosphere Platform for Big Data Analytics](https://ssc.io/pdf/2014-VLDBJ_Stratosphere_Overview.pdf),” *The VLDB Journal*, volume 23, number 6, pages 939–964, May 2014. [doi:10.1007/s00778-014-0357-y](http://dx.doi.org/10.1007/s00778-014-0357-y)
-1. Michael Isard, Mihai Budiu, Yuan Yu, et al.: “[Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks](https://www.microsoft.com/en-us/research/publication/dryad-distributed-data-parallel-programs-from-sequential-building-blocks/),” at *European Conference on Computer Systems* (EuroSys), March 2007. [doi:10.1145/1272996.1273005](http://dx.doi.org/10.1145/1272996.1273005)
-1. Daniel Warneke and Odej Kao: “[Nephele: Efficient Parallel Data Processing in the Cloud](https://stratosphere2.dima.tu-berlin.de/assets/papers/Nephele_09.pdf),” at *2nd Workshop on Many-Task Computing on Grids and Supercomputers* (MTAGS), November 2009. [doi:10.1145/1646468.1646476](http://dx.doi.org/10.1145/1646468.1646476)
-1. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd: “[The PageRank Citation Ranking: Bringing Order to the Web](https://web.archive.org/web/20230219170930/http://ilpubs.stanford.edu:8090/422/),” Stanford InfoLab Technical Report 422, 1999.
-1. Leslie G. Valiant: “[A Bridging Model for Parallel Computation](http://dl.acm.org/citation.cfm?id=79181),” *Communications of the ACM*, volume 33, number 8, pages 103–111, August 1990. [doi:10.1145/79173.79181](http://dx.doi.org/10.1145/79173.79181)
-1. Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl: “[Spinning Fast Iterative Data Flows](http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf),” *Proceedings of the VLDB Endowment*, volume 5, number 11, pages 1268-1279, July 2012. [doi:10.14778/2350229.2350245](http://dx.doi.org/10.14778/2350229.2350245)
-1. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, et al.: “[Pregel: A System for Large-Scale Graph Processing](https://kowshik.github.io/JPregel/pregel_paper.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2010. [doi:10.1145/1807167.1807184](http://dx.doi.org/10.1145/1807167.1807184)
-1. Frank McSherry, Michael Isard, and Derek G. Murray: “[Scalability! But at What COST?](http://www.frankmcsherry.org/assets/COST.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
-1. Ionel Gog, Malte Schwarzkopf, Natacha Crooks, et al.: “[Musketeer: All for One, One for All in Data Processing Systems](http://www.cl.cam.ac.uk/research/srg/netos/camsas/pubs/eurosys15-musketeer.pdf),” at *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741968](http://dx.doi.org/10.1145/2741948.2741968)
-1. Aapo Kyrola, Guy Blelloch, and Carlos Guestrin: “[GraphChi: Large-Scale Graph Computation on Just a PC](https://www.usenix.org/system/files/conference/osdi12/osdi12-final-126.pdf),” at *10th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2012.
-1. Andrew Lenharth, Donald Nguyen, and Keshav Pingali: “[Parallel Graph Analytics](http://cacm.acm.org/magazines/2016/5/201591-parallel-graph-analytics/fulltext),” *Communications of the ACM*, volume 59, number 5, pages 78–87, May 2016. [doi:10.1145/2901919](http://dx.doi.org/10.1145/2901919)
-1. Fabian Hüske: “[Peeking into Apache Flink's Engine Room](http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html),” *flink.apache.org*, March 13, 2015.
-1. Mostafa Mokhtar: “[Hive 0.14 Cost Based Optimizer (CBO) Technical Overview](https://web.archive.org/web/20170607112708/http://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/),” *hortonworks.com*, March 2, 2015.
-1. Michael Armbrust, Reynold S Xin, Cheng Lian, et al.: “[Spark SQL: Relational Data Processing in Spark](http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742797](http://dx.doi.org/10.1145/2723372.2742797)
-1. Daniel Blazevski: “[Planting Quadtrees for Apache Flink](https://blog.insightdatascience.com/planting-quadtrees-for-apache-flink-b396ebc80d35),” *insightdataengineering.com*, March 25, 2016.
-1. Tom White: “[Genome Analysis Toolkit: Now Using Apache Spark for Data Processing](https://web.archive.org/web/20190215132904/http://blog.cloudera.com/blog/2016/04/genome-analysis-toolkit-now-using-apache-spark-for-data-processing/),” *blog.cloudera.com*, April 6, 2016.
+4. Sort the hash table contents by counter value (descending), and take the top five entries.
+
+5. Print out those top five entries.
+
+This program is not as concise as the chain of Unix pipes, but it's fairly readable, and which of
+the two you prefer is partly a matter of taste. However, besides the superficial syntactic
+differences between the two, there is a big difference in the execution flow, which becomes apparent
+if you run this analysis on a large file.
+
+### Sorting Versus In-memory Aggregation {#id275}
+
+The Python script keeps an in-memory hash table of URLs, where each URL is mapped to the number of
+times it has been seen. The Unix pipeline example does not have such a hash table, but instead
+relies on sorting a list of URLs in which multiple occurrences of the same URL are simply repeated.
+
+Which approach is better? It depends how many different URLs you have. For most small to mid-sized
+websites, you can probably fit all distinct URLs, and a counter for each URL, in (say) 1 GB of
+memory. In this example, the *working set* of the job (the amount of memory to which the job needs
+random access) depends only on the number of distinct URLs: if there are a million log entries for a
+single URL, the space required in the hash table is still just one URL plus the size of the counter.
+If this working set is small enough, an in-memory hash table works fine---even on a laptop.
+
+On the other hand, if the job's working set is larger than the available memory, the sorting
+approach has the advantage that it can make efficient use of disks. It's the same principle as we
+discussed in ["Log-Structured Storage"](/en/ch4#sec_storage_log_structured): chunks of data can be
+sorted in memory and written out to disk as segment files, and then multiple sorted segments can be
+merged into a larger sorted file. Mergesort has sequential access patterns that perform well on
+disks (see ["Sequential Versus Random Writes on SSDs"](/en/ch4#sidebar_sequential)).
+
+The `sort` utility in GNU Coreutils (Linux) automatically handles larger-than-memory datasets by
+spilling to disk, and automatically parallelizes sorting across multiple CPU cores [^9].
+This means that the simple chain of Unix commands we saw earlier easily scales to large datasets,
+without running out of memory. The bottleneck is likely to be the rate at which the input file can
+be read from disk.
+
+A limitation of Unix tools is that they run only on a single machine. Datasets that are too large to
+fit in memory or local disk present a problem---and that's where distributed batch processing
+frameworks come in.
+
+## Batch Processing in Distributed Systems {#sec_batch_distributed}
+
+The machine that runs our Unix tool example has a number of components that work together to process
+the log data:
+
+- Storage devices that are accessed through the operating system's filesystem interface.
+
+- A scheduler that determines when processes get to run, and how to allocate CPU resources to them.
+
+- A series of Unix programs whose `stdin` and `stdout` are connected together by pipes.
+
+These same components exist in distributed data processing frameworks. In fact, you can think of a
+distributed processing framework as a distributed operating system; they have filesystems, job
+schedulers, and programs that send data to each other through the filesystem or other communication
+channels.
+
+### Distributed Filesystems {#sec_batch_dfs}
+
+The filesystem provided by your operating system is composed of several layers:
+
+- At the lowest level, block device drivers speak directly to the disk, and allow the layers above
+ to read and write raw blocks.
+
+- Above the block layer sits a page cache that keeps recently accessed blocks in memory for faster
+ access.
+
+- The block API is wrapped in a filesystem layer that breaks up large files into blocks, and tracks
+ file metadata such as inodes, directories, and files. ext4 and XFS are two common implementations
+ on Linux, for example.
+
+- Finally, the operating system exposes different filesystems to applications through a common API
+ called the virtual file system (VFS). The VFS is what allows applications to read and write in a
+ standard way regardless of the underlying filesystem.
+
+Distributed filesystems work in much the same way. Files are broken up into blocks, which are
+distributed across many machines. DFS blocks are typically much larger than local blocks: HDFS
+(Hadoop Distributed File System) defaults to 128MB, while JuiceFS and many object stores use 4MB
+blocks---much larger than ext4's 4096 bytes. Larger blocks mean less metadata to keep track of,
+which makes a big difference on petabyte-sized datasets. Larger blocks also lower the overhead of
+seeking to a block relative to reading it.
+
+Most physical storage devices can't write partial blocks, so operating systems require writes to use
+an entire block even if the data doesn't take up the whole block. Since distributed filesystems have
+larger blocks and are usually implemented on top of operating system filesystems, they don't have
+this requirement. For example, a 900MB file stored with 128MB blocks would have 7 blocks that use
+128MB and 1 block that uses 4MB.
+
+DFS blocks are read by making network requests to a machine in the cluster that stores the block.
+Each machine runs a daemon, exposing an API that allows remote processes to read and write blocks as
+files on its local filesystem. HDFS refers to these daemons as DataNodes, while GlusterFS calls them
+glusterfsd processes. We'll call them *data nodes* in this book.
+
+Distributed filesystems also implement the distributed equivalent of a page cache. Since DFS blocks
+are stored as files on data nodes, reads and writes go through each data node's operating system,
+which includes an in-memory page cache. This keeps frequently read data blocks in-memory on the data
+nodes. Some distributed filesystems also implement more caching tiers such as the client-side and
+local-disk caching found in JuiceFS.
+
+Filesystems such as ext4 and XFS keep track of storage metadata including free space, file block
+locations, directory structures, permission settings, and more. Distributed filesystems also need a
+way to track file locations spread across machines, permission settings, and so on. Hadoop has a
+service called the NameNode, which maintains metadata for the cluster. DeepSeek's 3FS has a metadata
+service that persists its data to a key-value store such as FoundationDB.
+
+Above the filesystem sits the VFS. A close analogue in batch processing is a distributed
+filesystem's protocol. Distributed filesystems must expose a protocol or interface so that batch
+processing systems can read and write files. This protocol acts as a pluggable interface: any DFS
+may be used so long as it implements the protocol. For example, Amazon S3's API has been widely
+adopted by other storage systems such as MinIO, Cloudflare's R2, Tigris, Backblaze's B2, and many
+others. Batch processing systems with S3 support can use any of these storage systems.
+
+Some DFSs implement POSIX-compliant filesystems that appear to the operating system's VFS like any
+other filesystem. Filesystem in Userspace (FUSE) or the Network File System (NFS) protocol are often
+used to integrate into the VFS. NFS is perhaps the most well known distributed filesystem protocol.
+The protocol was originally developed to allow multiple clients to read and write data on a single
+server. More recently, filesystems such as AWS's Elastic File System (EFS) and Archil provide
+NFS-compatible distributed filesystem implementations that are far more scalable. NFS clients still
+connect to one end point, but underneath, these systems communicate with distributed metadata
+services and data nodes to read and write data.
+
+> [!TIP] DISTRIBUTED FILESYSTEMS AND NETWORK STORAGE
+> Distributed filesystems are based on the *shared-nothing* principle (see ["Shared-Memory,
+> Shared-Disk, and Shared-Nothing
+> Architecture"](/en/ch2#sec_introduction_shared_nothing)), in contrast to the
+> shared-disk approach of *Network Attached Storage* (NAS) and *Storage Area Network* (SAN)
+> architectures. Shared-disk storage is implemented by a centralized storage appliance, often using
+> custom hardware and special network infrastructure such as Fibre Channel. On the other hand, the
+> shared-nothing approach requires no special hardware, only computers connected by a conventional
+> datacenter network.
+
+Many distributed filesystems are built on commodity hardware, which is less expensive but has higher
+failure rates than enterprise-grade hardware. In order to tolerate machine and disk failures, file
+blocks are replicated on multiple machines. This also allows schedulers to more evenly distribute
+workloads since it can execute a task on any node that contains a replica of the task's input data.
+Replication may mean simply several copies of the same data on multiple machines, as in
+[Chapter 6](/en/ch6#ch_replication), or an *erasure coding* scheme such as Reed--Solomon codes,
+which allows lost data to be recovered with lower storage overhead than full replication
+[^10], [^11], [^12]. The techniques are similar to RAID, which provides
+redundancy across several disks attached to the same machine; the difference is that in a
+distributed filesystem, file access and replication are done over a conventional datacenter network
+without special hardware.
+
+### Object Stores {#id277}
+
+Object storage services such as Amazon S3, Google Cloud Storage, Azure Blob Storage, and OpenStack
+Swift have become a popular alternative to distributed filesystems for batch processing jobs. In
+fact, the line between the two is somewhat blurry. As we saw in the previous section and ["Databases
+Backed by Object Storage"](/en/ch6#sec_replication_object_storage), Filesystem in Userspace (FUSE)
+drivers allow users to treat object stores such as S3 as a filesystem. Some DFS implementations such
+as JuiceFS and Ceph offer both object storage and filesystem APIs. However, their APIs, performance,
+and consistency guarantees are very different. Care must be taken when adopting such systems to make
+sure they behave as expected, even if they seem to implement the requisite APIs.
+
+Each object in an object store has a URL such as `s3://my-photo-bucket/2025/04/01/birthday.png`. The
+host portion of the URL (`my-photo-bucket`) describes the bucket where objects are stored, and the
+part that follows is the object's *key* (`/2025/04/01/birthday.png` in our example). A bucket has a
+globally unique name, and each object's key must be unique within its bucket.
+
+Object are read using a `get` call and written using a `put` call. Unlike files on a filesystem,
+objects are immutable once written. To update an object, it must be fully rewritten using a `put`
+call, similarly to a key-value store. Azure Blob Storage and S3 Express One Zone support appends,
+but most other stores do not. There are no file handle APIs with functions like `fopen` and `fseek`.
+
+Objects may look as if they are organised into directories, which is somewhat confusing, since
+object stores do not have the concept of directories. The path structure is simply a convention, and
+the slashes are a part of the object's key. This convention allows you to perform something similar
+to a directory listing by requesting a list of objects with a particular prefix. However, listing
+objects by prefix is different from a filesystem directory listing in two ways:
+
+- A prefix `list` operation behaves like recursive `ls -R` call on a Unix system: it returns all
+ objects that start with the prefix---objects in subpaths are included.
+
+- Empty directories are not possible: if you were to remove all objects underneath
+ `s3://my-photo-bucket/2025/04/01`, then `01` would no longer appear when we call `list` on
+ `s3://my-photo-bucket/2025/04`. It is a common practice to create a zero-byte object as a way to
+ represent an empty directory (e.g. creating an empty `s3://my-photo-bucket/2025/04/01` file to
+ keep it present when all child objects are deleted).
+
+DFS implementations often support many common filesystem operations such as hard links, symbolic
+links, file locking, and atomic renames. Such features are missing from object stores. Linking and
+locks are typically not supported, while renames are non-atomic; they're accomplished by copying the
+object to the new key, and then deleting the old object. If you want to rename a directory, you have
+to individually rename every object within it, since the directory name is a part of the key.
+
+The key-value stores we discussed in [Chapter 4](/en/ch4#ch_storage) are optimized for small values
+(typically kilobytes) and frequent, low-latency reads/writes. In contrast, distributed filesystems
+and object stores are generally optimized for large objects (megabytes to gigabytes) and less
+frequent, larger reads. Recently, however, object stores have begun to add support for frequent and
+smaller reads/writes. For example, S3 Express One Zone now offers single-millisecond latency and a
+pricing model that is more similar to key-value stores.
+
+Another difference between distributed filesystems and object stores is that DFSes such as HDFS
+allow computing tasks to be run on the machine that stores a copy of a particular file. This allows
+the task to read that file without having to send it over the network, which saves bandwidth if the
+executable code of the task is smaller than the file it needs to read. On the other hand, object
+stores usually keep storage and computation separate. Doing so might use more bandwidth, but modern
+datacenter networks are very fast, so this is often acceptable. This architecture also allows
+machine resources such as CPU and memory to be scaled independently of storage since the two are
+decoupled.
+
+### Distributed Job Orchestration {#id278}
+
+Our operating system analogy also applies to job orchestration. When you execute a Unix batch job,
+something needs to actually run the `awk`, `sort`, `uniq`, and `head` processes. Data needs to be
+transferred from one process's output to another process's input, memory must be allocated for each
+process, instructions from each process must be scheduled fairly and executed on the CPU, memory and
+I/O boundaries must be enforced, and so on. On a single machine, an operating system's kernel is
+responsible for such work. In a distributed environment, this is the role of a job orchestrator.
+
+Batch processing frameworks send a request to an orchestrator's scheduler to run a job. Requests to
+start a job contain metadata such as:
+
+- the number of tasks to execute,
+
+- the amount of memory, CPU, and disk needed for each task,
+
+- a job identifier,
+
+- access credentials,
+
+- job paramaters such as input and output data,
+
+- required hardware details such as GPUs or disk types, and
+
+- where the job's executable code is located.
+
+Orchestrators such as Kubernetes and Hadoop YARN (Yet Another Resource Negotiator) [^13]
+combine this information with cluster metadata to execute the job using the following components:
+
+Task executors
+
+: An executor daemon such as YARN's *NodeManager* or Kubernetes's *kubelet* runs on each node in
+ the cluster. Executors are responsible for running job tasks, sending heartbeats to signal their
+ liveness, and tracking task status and resource allocation on the node. When a task-start
+ request is sent to an executor, it retrieves the job's executable code and runs a command to
+ start the task. The executor then monitors the process until it finishes or fails, at which
+ point it updates the task status metadata accordingly.
+
+ Many executors also work with the operating system to provide both security and performance
+ isolation. YARN and Kubernetes both use Linux *cgroups*, for example. This prevents tasks from
+ accessing data without permission, or from negatively affecting the performance of other tasks
+ on the node by using excessive resources.
+
+Resource Manager
+
+: An orchestrator's resource manager stores metadata about each node, including available hardware
+ (CPUs, GPUs, memory, disks, and so on), task statuses, network location, node status, and other
+ relevant information. Thus, the manager provides a global view of the cluster's current state.
+ The centralized nature of the resource manager can lead to both scalability and availability
+ bottlenecks. YARN uses ZooKeeper and Kubernetes uses etcd to store cluster state (see
+ ["Coordination Services"](/en/ch10#sec_consistency_coordination)).
+
+Scheduler
+
+: Orchestrators usually have a centralized scheduler subsystem, which receives requests to start,
+ stop, or check on the status of a job. For example, a scheduler might receive a request to start
+ a job with 10 tasks using a specific Docker image on nodes that have a specific type of GPU. The
+ scheduler uses the information from the request and state of the resource manager to determine
+ which tasks to run on which nodes. The task executors are then informed of their assigned work
+ and begin execution.
+
+Though each orchestrator uses different terminology, you will find these components in nearly all
+orchestration systems.
+
+> [!NOTE]
+> Scheduling decisions sometimes require application-specific schedulers that can take into account
+> particular requirements, such as auto-scaling read replicas when a certain query threshold is
+> reached. The centralized scheduler and application-specific schedulers work together to determine
+> how to best execute tasks. YARN refers to its sub-schedulers as *ApplicationMasters*, while
+> Kubernetes calls them *operators*.
+
+#### Resource Allocation {#id279}
+
+Schedulers have a particularly challenging role in job orchestration: they must figure out how to
+best allocate the cluster's limited resources amongst jobs with competing needs. Fundamentally, its
+decisions must balance fairness and efficiency.
+
+Imagine a small cluster with five nodes that has a total of 160 CPU cores available. The cluster's
+scheduler receives two job requests, each wanting 100 cores to complete its work. What's the best
+way to schedule the workload?
+
+- The scheduler could decide to run 80 tasks for each job, starting the remaining 20 tasks for each
+ job as earlier tasks complete.
+
+- The scheduler could run all of one job's tasks, and begin running the second job's tasks only when
+ 100 cores are available, a strategy known as *gang scheduling*.
+
+- One job request comes before the other. The scheduler has to decide whether to allocate all 100
+ cores to that job, or hold some back in anticipation for future jobs.
+
+This is a very simple example, but we already see many difficult trade-offs. In the gang-scheduling
+scenario, for example, if the scheduler reserves CPU cores until all 100 are available at the same
+time, nodes will sit idle. The cluster's resource utilization will drop and a deadlock might occur
+if other jobs also attempt to reserve CPU cores.
+
+On the other hand, if the scheduler simply waits for 100 cores to become available, other jobs might
+grab the cores in the meantime. The cluster might not have 100 cores available for a very long time,
+which leads to *starvation*. The scheduler could decide to *preempt* some of the first job's tasks,
+killing them to make room for the second job. Task preemption decreases cluster efficiency as well,
+since the killed tasks will need to be restarted later and re-run.
+
+Now imagine a scheduler that must make allocation decisions for hundreds or even millions of such
+job requests. Finding an optimal solution seems intractable. In fact, the problem is *NP-hard*,
+which means that it is prohibitively slow to calculate an optimal solution for all but the smallest
+examples [^14], [^15].
+
+In practice, schedulers therefore use heuristics to make non-optimal but reasonable decisions.
+Several algorithms are commonly used, including first-in first-out (FIFO), dominant resource
+fairness (DRF), priority queues, capacity or quota-based scheduling, and various bin-packing
+algorithms. The details for such algorithms are beyond the scope of this book, but they're a
+fascinating area of research.
+
+#### Scheduling Workflows {#sec_batch_workflows}
+
+The Unix tools example at the start of this chapter involved a chain of several commands, connected
+by Unix pipes. The same pattern arises in distributed batch processes: often the output from one job
+needs to become the input to one or more other jobs, and each job may have several inputs that are
+produced by other jobs. This is called a *workflow* or *directed acyclic graph (DAG)* of jobs.
+
+> [!NOTE]
+> In ["Durable Execution and Workflows"](/en/ch5#sec_encoding_dataflow_workflows) we
+> saw workflow engines that offer durable execution of a sequence of steps, typically performing RPCs.
+> In the context of batch processing, "workflow" has a different meaning: it's a sequence of batch
+> processes, each taking input data and producing output data, but normally not making RPCs to
+> external services. Durable execution engines typically process less data per-request than their
+> batch processing counterparts, though the line is somewhat fuzzy.
+
+There are several reasons why a workflow of multiple jobs might be needed:
+
+- If the output of one job needs to become the input to several other jobs, which are maintained by
+ different teams, it's best for the first job to first write its output to a location where all the
+ other jobs can read it. Those consuming jobs can then be scheduled to run every time that data has
+ been updated, or on some other schedule.
+
+- You might want to transfer data from one processing tool to another. For example, a Spark job
+ might output its data to HDFS, then a Python script might trigger a Trino SQL query (see ["Cloud
+ Data Warehouses"](/en/ch4#sec_cloud_data_warehouses)) that does further processing on the HDFS
+ files and outputs to S3.
+
+- Some data pipelines internally require multiple processing stages. For example, if one stage needs
+ to shard the data by one key, and the next stage needs to shard by a different key, the first
+ stage can output data sharded in the way that is required by the second stage.
+
+In the Unix tools example, the pipe that connects the output of one command to the input of another
+uses only a small in-memory buffer, and doesn't write the data into a file. If that buffer fills up,
+the producing process needs to wait until the consuming process has read some data from the buffer
+before it can output more---a form of backpressure. Spark, Flink, and other batch execution engines
+support a similar model where the output of one task is directly passed to another task (over the
+network if the tasks are running on different machines).
+
+However, in a workflow it is more usual for one job to write its output to a distributed filesystem
+or object store, and for the next job to read it from there. This decouples the jobs from each
+other, allowing them to run at different times. If a job has several inputs, a workflow scheduler
+typically waits until all of the jobs that produce its inputs have completed successfully before
+running the job that consumes those inputs.
+
+Schedulers found in orchestration frameworks such as YARN's ResourceManager or Spark's built-in
+scheduler do not manage entire workflows; they do scheduling on a per-job basis. To handle these
+dependencies between job executions, various workflow schedulers have been developed, including
+Airflow, Dagster, and Prefect. Workflow schedulers have management features that are useful when
+maintaining a large collection of batch jobs. Workflows consisting of 50 to 100 jobs are common in
+many data pipelines, and in a large organization, many different teams may be running different jobs
+or workflows that read each other's output across many different systems. Tool support is important
+for managing such complex dataflows.
+
+#### Handling Faults {#id281}
+
+Batch jobs often run for long periods of time. Long-running jobs with many parallel tasks are likely
+to experience at least one task failure along the way. As discussed in ["Hardware and Software
+Faults"](/en/ch2#sec_introduction_hardware_faults) and ["Unreliable
+Networks"](/en/ch9#sec_distributed_networks), there are many reasons why this could happen,
+including hardware faults (especially on commodity hardware), or network interruptions.
+
+Another reason why a task might not finish running is that the scheduler may intentionally preempt
+(kill) it. Preemption is particularly useful if you have multiple priority levels: low-priority
+tasks that are cheaper to run, and high-priority tasks that cost more. Low-priority tasks can run
+whenever there is spare computing capacity, but they run the risk of being preempted at any moment
+if a higher-priority task arrives. Such cheaper, low-priority virtual machines are called *spot
+instances* on Amazon EC2, *spot virtual machines* on Azure, and *preemptible instances* on Google
+Cloud [^16].
+
+Since batch processing is often used for jobs that are not time-sensitive, it is well suited for
+using low-priority tasks and spot instances to reduce the cost of running jobs. Essentially, those
+jobs can use spare computing resources that would otherwise be idle, and thereby increase the
+utilization of the cluster. However, this also means that those tasks are more likely to be killed
+by the scheduler: preemptions occur more frequently than hardware faults [^17].
+
+Since batch jobs regenerate their output from scratch every time they are run, task failures are
+easier to handle than in online systems: the system can delete the partial output from the failed
+execution and schedule it to run again on another machine. It would be wasteful to rerun the entire
+job due to a single task failure, though. MapReduce and its successors therefore keep the execution
+of parallel tasks independent from each other, so that they can retry work at the granularity of an
+individual task [^3].
+
+Fault tolerance is trickier when the output of one task becomes the input to another task as part of
+a workflow. MapReduce solves this by always writing such intermediate data back to the distributed
+filesystem, and waiting for the writing task to complete successfully before allowing other tasks to
+read the data. This works, even in an environment where preemption is common, but it means a lot of
+writes to the DFS, which can be inefficient.
+
+Spark keeps intermediate data in memory or "spills" to local disk, and only writes the final result
+to the DFS. It also keeps track of how the intermediate data was computed, allowing Spark to
+recompute it in case it is lost [^18]. Flink uses a different approach based on
+periodically checkpointing a snapshot of tasks [^19]. We will return to this topic in
+["Dataflow Engines"](/en/ch11#sec_batch_dataflow).
+
+## Batch Processing Models {#id431}
+
+We have seen how batch jobs are scheduled in a distributed environment. Let us now turn our
+attention to how batch processing frameworks actually process data. The two most common models are
+MapReduce and dataflow engines. Although dataflow engines have largely replaced MapReduce in
+practice, it is useful to understand how MapReduce works, since it influenced many modern batch
+processing frameworks.
+
+MapReduce and dataflow engines have evolved to support multiple programming models including
+low-level programmatic APIs, relational query languages, and DataFrame APIs. A variety of options
+enable application engineers, analytics engineers, business analysts, and even non-technical
+employees to process company data for various use cases, which we'll discuss in ["Batch Use
+Cases"](/en/ch11#sec_batch_output).
+
+### MapReduce {#sec_batch_mapreduce}
+
+The pattern of data processing in MapReduce is very similar to the web server log analysis example
+in ["Simple Log Analysis"](/en/ch11#sec_batch_log_analysis):
+
+1. Read a set of input files, and break it up into *records*. In the web server log example, each
+ record is one line in the log (that is, `\n` is the record separator). In Hadoop's MapReduce,
+ the input file is stored in a distributed filesystem like HDFS or an object store like S3.
+ Various file formats are used, such as Apache Parquet (a columnar format, see ["Column-Oriented
+ Storage"](/en/ch4#sec_storage_column)) or Apache Avro (a row-based format, see
+ ["Avro"](/en/ch5#sec_encoding_avro)).
+
+2. Call the mapper function to extract a key and value from each input record. In the Unix tool
+ example, the mapper function is `awk '{print $7}'`: it extracts the URL (`$7`) as the key, and
+ leaves the value empty.
+
+3. Sort all of the key-value pairs by key. In the log example, this is done by the first `sort`
+ command.
+
+4. Call the reducer function to iterate over the sorted key-value pairs. If there are multiple
+ occurrences of the same key, the sorting has made them adjacent in the list, so it is easy to
+ combine those values without having to keep a lot of state in memory. In the Unix tool example,
+ the reducer is implemented by the command `uniq -c`, which counts the number of adjacent records
+ with the same key.
+
+Those four steps can be performed by one MapReduce job. Steps 2 (map) and 4 (reduce) are where you
+write your custom data processing code. Step 1 (breaking files into records) is handled by the input
+format parser. Step 3, the `sort` step, is implicit in MapReduce---you don't have to write it,
+because the output from the mapper is always sorted before it is given to the reducer. This sorting
+step is a foundational batch processing algorithm, which we'll revisit in ["Shuffling
+Data"](/en/ch11#sec_shuffle).
+
+To create a MapReduce job, you need to implement two callback functions, the mapper and reducer,
+which behave as follows:
+
+Mapper
+
+: The mapper is called once for every input record, and its job is to extract the key and value
+ from the input record. For each input, it may generate any number of key-value pairs (including
+ none). It does not keep any state from one input record to the next, so each record is handled
+ independently.
+
+Reducer
+
+: The MapReduce framework takes the key-value pairs produced by the mappers, collects all the
+ values belonging to the same key, and calls the reducer with an iterator over that collection of
+ values. The reducer can produce output records (such as the number of occurrences of the same
+ URL).
+
+In the web server log example, we had a second `sort` command in step 5, which ranked URLs by number
+of requests. In MapReduce, if you need a second sorting stage, you can implement it by writing a
+second MapReduce job and using the output of the first job as input to the second job. Viewed like
+this, the role of the mapper is to prepare the data by putting it into a form that is suitable for
+sorting, and the role of the reducer is to process the data that has been sorted.
+
+> [!TIP] MAPREDUCE AND FUNCTIONAL PROGRAMMING
+> Though MapReduce is used for batch processing, the programming model comes from functional
+> programming. Lisp introduced *map* and *reduce* (or *fold*) as higher‑order functions on lists, and
+> they have made their way into mainstream languages such as Python, Rust, and Java. Many common data
+> processing operations, including those offered by SQL, can be implemented on top of MapReduce. Both
+> functions, and functional programming in general, have important properties that MapReduce benefits
+> from. Map and reduce are composable, which fits nicely with data processing (as we saw in our Unix
+> example). Map is also *embarassingly parallel* (each input is processed independently), which
+> simplifies MapReduce's parallel execution. For reduce, different keys can be processed in parallel.
+
+Implementing a complex processing job using the raw MapReduce APIs is actually quite hard and
+laborious---for instance, any join algorithms used by the job would need to be implemented from
+scratch [^20]. MapReduce is also quite slow compared to more modern batch processors. One
+reason is that its file-based I/O prevents job pipelining, i.e., processing output data in a
+downstream job before the upstream job is complete.
+
+### Dataflow Engines {#sec_batch_dataflow}
+
+In order to fix some of MapReduce's problems, several new execution engines for distributed batch
+computations were developed, the most well known of which are Spark [^18], [^21] and
+Flink [^19]. There are various differences in the way they are designed, but they have one
+thing in common: they handle an entire workflow as one job, rather than breaking it up into
+independent subjobs.
+
+Since they explicitly model the flow of data through several processing stages, these systems are
+known as *dataflow engines*. Like MapReduce, they support a low-level API that repeatedly calls a
+user-defined function to process one record at a time, but they also offer higher-level operators
+such as *join* and *group by*. They parallelize work by sharding inputs, and they copy the output of
+one task over the network to become the input to another task. Unlike in MapReduce, operators need
+not take the strict roles of alternating map and reduce, but instead can be assembled in more
+flexible ways.
+
+These dataflow APIs generally use relational-style building blocks to express a computation: joining
+datasets on the value of some field; grouping tuples by key; filtering by some condition; and
+aggregating tuples by counting, summing, or other functions. Internally, these operations are
+implemented using the shuffle algorithms that we discuss in the next section.
+
+This style of processing engine is based on research systems like Dryad [^22] and Nephele
+[^23], and it offers several advantages compared to the MapReduce model:
+
+- Expensive work such as sorting need only be performed in places where it is actually required,
+ rather than always happening by default between every map and reduce stage.
+
+- When there are several operators in a row that don't change the sharding of the dataset (such as
+ map or filter), they can be combined into a single task, reducing data copying overheads.
+
+- Because all joins and data dependencies in a workflow are explicitly declared, the scheduler has
+ an overview of what data is required where, so it can make locality optimizations. For example, it
+ can try to place the task that consumes some data on the same machine as the task that produces
+ it, so that the data can be exchanged through a shared memory buffer rather than having to copy it
+ over the network.
+
+- It is usually sufficient for intermediate state between operators to be kept in memory or written
+ to local disk, which requires less I/O than writing it to a distributed filesystem or object store
+ (where it must be replicated to several machines and written to disk on each replica). MapReduce
+ already uses this optimization for mapper output, but dataflow engines generalize the idea to all
+ intermediate state.
+
+- Operators can start executing as soon as their input is ready; there is no need to wait for the
+ entire preceding stage to finish before the next one starts.
+
+- Existing processes can be reused to run new operators, reducing startup overheads compared to
+ MapReduce (which launches a new JVM for each task).
+
+You can use dataflow engines to implement the same computations as MapReduce workflows, and they
+usually execute significantly faster due to the optimizations described here.
+
+### Shuffling Data {#sec_shuffle}
+
+We saw that both the Unix tools example at the beginning of the chapter and MapReduce are based on
+sorting. Batch processors need to be able to sort datasets petabytes in size, which are too large to
+fit on a single machine. They therefore require a distributed sorting algorithm where both the input
+and the output is sharded. Such an algorithm is called a *shuffle*.
+
+> [!NOTE] SHUFFLE IS NOT RANDOM
+> The term *shuffle* is confusing. When you shuffle a deck of cards, you end up with a random order.
+> In contrast, the shuffle we're talking about here produces a sorted order, with no randomness.
+
+Shuffling is a foundational algorithm for batch processors, where it is used for joins and
+aggregations. MapReduce, Spark, Flink, Daft, Dataflow, and BigQuery [^24] all implement
+scalable and performant shuffle algorithms in order to handle large datasets. We'll use the shuffle
+in Hadoop MapReduce [^25] for illustration purposes, but the concepts in this section
+translate to other systems as well.
+
+[Figure 11-1](/en/ch11#fig_batch_mapreduce) shows the dataflow in a MapReduce job. We assume that
+the input to the job is sharded, and the shards are labelled *m 1*, *m 2*, and *m 3*. For example,
+each shard may be a separate file on HDFS or a separate object in an object store, and all the
+shards belonging to the same dataset are grouped into the same HDFS directory or have the same key
+prefix in an object store bucket.
+
+{{< figure src="/fig/ddia_1101.png" id="fig_batch_mapreduce" caption="Figure 11-1. A MapReduce job with three mappers and three reducers." class="w-full my-4" >}}
+
+The framework starts a separate map task for each input shard. A task reads its assigned file,
+passing one record at a time to the mapper callback. The reduce side of the computation is also
+sharded. While the number of map tasks is determined by the number of input shards, the number of
+reduce tasks is configured by the job author (it can be different from the number of map tasks).
+
+The output of the mapper consists of key-value pairs, and the framework needs to ensure that if two
+different mappers output the same key, those key-value pairs end up being processed by the same
+reducer task. To achieve this, each mapper creates a separate output file on its local disk for
+every reducer (for example, the file *m 1, r 2* in [Figure 11-1](/en/ch11#fig_batch_mapreduce) is
+the file created by mapper 1 containing the data destined for reducer 2). When the mapper outputs a
+key-value pair, a hash of the key typically determines which reducer file it is written to
+(similarly to ["Sharding by Hash of Key"](/en/ch7#sec_sharding_hash)).
+
+While a mapper is writing these files, it also sorts the key-value pairs within each file. This can
+be done using the techniques we saw in ["Log-Structured
+Storage"](/en/ch4#sec_storage_log_structured): batches of key-value pairs are first collected in a
+sorted data structure in memory, then written out as sorted segment files, and smaller segment files
+are progressively merged into larger ones.
+
+After each mapper finishes, reducers connect to it and copy the appropriate file of sorted key-value
+pairs to their local disk. Once the reduce task has its share of the output from all of the mappers,
+it merges these files together, preserving the sort order, mergesort-style. Key-value pairs with the
+same key are now consecutive, even if they came from different mappers. The reducer function is then
+called once per-key, each time with an iterator that returns all the values for that key.
+
+Any records output by the reducer function are sequentially written to a file, with one file per
+reduce task. These files (*r 1*, *r 2*, *r 3* in [Figure 11-1](/en/ch11#fig_batch_mapreduce)) become
+the shards of the job's output dataset, and they are written back to the distributed filesystem or
+object store.
+
+Though MapReduce executes the shuffle step between its map and reduce steps, modern dataflow engines
+and cloud data warehouses are more sophisticated. Systems such as BigQuery have optimized their
+shuffle algorithms to keep data in memory and to write data to external sorting services
+[^24]. Such services speed up shuffling and replicate shuffled data to provide resilience.
+
+#### JOIN and GROUP BY {#sec_batch_join}
+
+Let's look at how sorted data simplifies distributed joins and aggregations. We'll continue with
+MapReduce for illustration purposes, though these concepts apply to most batch processing systems.
+
+A typical example of a join in a batch job is illustrated in
+[Figure 11-2](/en/ch11#fig_batch_join_example). On the left is a log of events describing the things
+that logged-in users did on a website (known as *activity events* or *clickstream data*), and on the
+right is a database of users. You can think of this example as being part of a star schema (see
+["Stars and Snowflakes: Schemas for Analytics"](/en/ch3#sec_datamodels_analytics)): the log of
+events is the fact table, and the user database is one of the dimensions.
+
+{{< figure src="/fig/ddia_1102.png" id="fig_batch_join_example" caption="Figure 11-2. A join between a log of user activity events and a database of user profiles." class="w-full my-4" >}}
+
+If you want to perform an analysis of the activity events that takes into account information from
+the user database (for example, find out whether certain pages are more popular with younger or
+older users, using the date of birth field in the user profile), you need to compute a join between
+these two tables. How would you compute that join, assuming both tables are so large that they have
+to be sharded?
+
+You can use the fact that in MapReduce, the shuffle brings together all the key-value pairs with the
+same key to the same reducer, no matter which shard they were on originally. Here, the user ID can
+serve as the key. You can therefore write a mapper that goes over the user activity events, and
+emits page view URLs keyed by user ID, as illustrated in
+[Figure 11-3](/en/ch11#fig_batch_join_reduce). Another mapper goes over the user database row by
+row, extracting the user ID as the key and the user's date of birth as the value.
+
+{{< figure src="/fig/ddia_1103.png" id="fig_batch_join_reduce" caption="Figure 11-3. A sort-merge join on user ID. If the input datasets are sharded into multiple files, each could be processed with multiple mappers in parallel." class="w-full my-4" >}}
+
+The shuffle then ensures that a reducer function can access a particular user's date of birth and
+all of that user's page view events at the same time. The MapReduce job can even arrange the records
+to be sorted such that the reducer always sees the record from the user database first, followed by
+the activity events in timestamp order---this technique is known as a *secondary sort*
+[^25].
+
+The reducer can then perform the actual join logic easily. The first value is expected to be the
+date of birth, which the reducer stores in a local variable. It then iterates over the activity
+events with the same user ID, outputting each viewed URL along with the viewer's date of birth.
+Since the reducer processes all of the records for a particular user ID in one go, it only needs to
+keep one user record in memory at any one time, and it never needs to make any requests over the
+network. This algorithm is known as a *sort-merge join*, since mapper output is sorted by key, and
+the reducers then merge together the sorted lists of records from both sides of the join.
+
+The next MapReduce job in the workflow can then calculate the distribution of viewer ages for each
+URL. To do so, the job would first shuffle the data using the URL as key. Once sorted, the reducers
+would then iterate over all the page views (with viewer birth date) for a single URL, keep a counter
+for the number of views by each age group, and increment the appropriate counter for each page view.
+This way you can implement a *group by* operation and aggregation.
+
+### Query languages {#sec_batch_query_lanauges}
+
+Over the years, execution engines for distributed batch processing have matured. By now, the
+infrastructure has become robust enough to store and process many petabytes of data on clusters of
+over 10,000 machines. As the problem of physically operating batch processes at such scale has been
+considered more or less solved, attention has turned to improving the programming model.
+
+MapReduce, dataflow engines, and cloud data warehouses have all embraced SQL as the lingua franca
+for batch processing. It's a natural fit: legacy data warehouses used SQL, data analytics and ETL
+tools already support SQL, and all developers and analysts know it.
+
+Besides the obvious advantage of requiring less code than handwritten MapReduce jobs, these query
+language interfaces also allow interactive use, in which you write analytical queries and run them
+from a terminal or GUI. This style of interactive querying is an efficient and natural way for
+business analytics, product managers, sales and finance teams, and others to explore data in a batch
+processing environment. Though not a classic form of batch processing, SQL support has made
+exploratory queries suitable for distributed batch processing systems.
+
+High-level query languages not only make the humans using the system more productive, but they also
+improve the job execution efficiency at a machine level. As we saw in ["Cloud Data
+Warehouses"](/en/ch4#sec_cloud_data_warehouses), query engines are responsible for converting SQL
+queries into batch jobs to be executed in a cluster. This translation step from query to syntax tree
+to physical operators allows the engine to optimize queries. Query engines such as Hive, Trino,
+Spark, and Flink have cost-based query optimizers that can analyze the properties of join inputs and
+automatically decide which algorithm would be most suitable for the task at hand. Optimizers might
+even change the order of joins so that the amount of intermediate state is minimized [^19],
+[^26], [^27], [^28].
+
+While SQL is the most popular general-purpose batch processing query language, other languages
+remain in use for niche use cases. Apache Pig was a language based on relational operators that
+allowed data pipelines to be specified step by step, rather than as one big SQL query. DataFrames
+(see next section) have similar characteristics, and Morel is a more modern language influenced by
+Pig. Other users have adopted JSON query languages such as jq, JMESPath, or JsonPath.
+
+In ["Graph-Like Data Models"](/en/ch3#sec_datamodels_graph) we discussed using graphs for modeling
+data, and using graph query languages to traverse the edges and vertices in a graph. Many graph
+processing frameworks also support batch computation through query languages such as Apache
+TinkerPop's Gremlin. We will look at graph processing use cases in more detail in ["Batch Use
+Cases"](/en/ch11#sec_batch_output).
+
+> [!TIP] BATCH PROCESSING AND CLOUD DATA WAREHOUSES CONVERGE
+> Historically, data warehouses ran on specialized hardware appliances, and provided SQL analytics
+> queries over relational data. In contrast, batch processing frameworks like MapReduce set out to
+> provide greater scalability and greater flexibility by supporting processing logic written in a
+> general-purpose programming language, allowing it to read and write arbitrary data formats.
+>
+> Over time, the two have become much more similar. Modern batch processing frameworks now support SQL
+> as a language for writing batch jobs, and they achieve good performance on relational queries by
+> using columnar storage formats such as Parquet and optimized query execution engines (see ["Query
+> Execution: Compilation and Vectorization"](/en/ch4#sec_storage_vectorized)).
+> Meanwhile, data warehouses have grown more scalable by moving to the cloud (see ["Cloud Data
+> Warehouses"](/en/ch4#sec_cloud_data_warehouses)), and implementing many of the
+> same scheduling, fault tolerance, and shuffling techniques that distributed batch frameworks do.
+> Many use distributed filesystems as well.
+>
+> Just as batch processing systems adopted SQL as a processing model, cloud warehouses have adopted
+> alternative processing models such as DataFrames as well (discussed in the next section). For
+> example, Google Cloud BigQuery offers a BigQuery DataFrames library and Snowflake's Snowpark
+> integrates with Pandas. Batch processing workflow orchestrators such as Airflow, Prefect, and
+> Dagster also integrate with cloud warehouses.
+>
+> Not all batch jobs are easily expressed in SQL, though. Iterative graph algorithms such as PageRank,
+> complex machine learning, and many other tasks are difficult to express in SQL. AI data processing,
+> which includes non-relational and multi-modal data such as images, video, and audio, can also be
+> difficult to do in SQL.
+>
+> Moreover, cloud data warehouses struggle with certain workloads. Row-by-row computation is less
+> efficient when using column-oriented storage formats. Alternative warehouse APIs or a batch
+> processing system are preferable in such cases. Cloud data warehouses also tend to be more expensive
+> than other batch processing systems. It can be more cost-efficient to run large jobs in batch
+> processing systems such as Spark or Flink instead.
+>
+> Ultimately, the decision between processing data in batch systems or data warehouses comes down to
+> factors such as cost, convenience, ease of implementation, availability, and so on. Most large
+> enterprises have many data processing systems, which give them flexibility in this decision. Smaller
+> companies often get by with just one.
+
+### DataFrames {#id287}
+
+As data scientists and statisticians began using distributed batch processing frameworks for machine
+learning use cases, they found existing processing models cumbersome, as they were used to working
+with the DataFrame data model found in R and Pandas (see ["DataFrames, Matrices, and
+Arrays"](/en/ch3#sec_datamodels_dataframes)). A DataFrame is similar to a table in a relational
+database: it is a collection of rows, and all the values in the same column have the same type.
+Instead of writing one big SQL query, users call functions corresponding to relational operators to
+perform filters, joins, sorting, group by, and other operations.
+
+Originally, DataFrame manipulation typically occurred locally, in memory. Consequently, DataFrames
+were limited to datasets that fit on a single machine. Data scientists wanted to interact with the
+large datasets found in batch processing environments using the DataFrame APIs they were used to.
+Distributed data processing frameworks such as Spark, Flink, and Daft have adopted DataFrame APIs to
+meet this need. On the other hand, local DataFrames are usually indexed and ordered while
+distributed DataFrames are generally not [^29]. This can lead to performance surprises
+when migrating to batch frameworks.
+
+DataFrame APIs appear similar to dataflow APIs, but implementations vary. While Pandas executes
+operations immediately when the DataFrame methods are called, Apache Spark first translates all the
+DataFrame API calls into a query plan and runs query optimization before executing the workflow on
+top of its distributed dataflow engine. This allows it to improve performance.
+
+Frameworks such as Daft even support both client and server-side computation. Smaller, in-memory
+operations are executed on the client while larger datasets and computation are executed on the
+server. Columnar storage formats such as Apache Arrow offer a unified data model that both client
+and server-side execution engines can share.
+
+## Batch Use Cases {#sec_batch_output}
+
+Now that we've seen how batch processing works, let's see how it is applied to a range of different
+applications. Batch jobs are excellent for processing large datasets in bulk, but they aren't good
+for low latency use cases. Consequently, you'll find batch jobs wherever there's a lot of data and
+data freshness isn't important. This might sound limiting, but it turns out that the a significant
+amount of data processing fits this model:
+
+- Accounting and inventory reconciliation, where companies verify that transactions line up with
+ their bank accounts and inventory, are often done in batch [^30].
+
+- In manufacturing, demand forecasting is computed in periodic batch jobs [^31].
+
+- Ecommerce, media, and social media companies train their recommendation models using batch jobs
+ [^32], [^33].
+
+- Many financial systems are batch-based, as well. For example, the United States's banking network
+ runs almost entirely on batch jobs [^34].
+
+In the following sections, we'll discuss some of the batch processing use cases you'll find in
+nearly every industry.
+
+### Extract--Transform--Load (ETL) {#sec_batch_etl_usage}
+
+["Data Warehousing"](/en/ch1#sec_introduction_dwh) introduced the idea of ETL and ELT, where a data
+processing pipeline extracts data from a production database, transforms it, and loads results into
+a downstream system (we'll use "ETL" in this section to represent both ETL and ELT workloads). Batch
+jobs are often used for such workloads, especially when the downstream system is a data warehouse.
+
+The parallel nature of batch jobs makes them a great fit for data transformation. Much of data
+transformation involves "embarrassingly parallel" workloads. Filtering data, projecting fields, and
+many other common data warehouse transformations can all be done in parallel.
+
+Batch processing environments also come with robust workflow schedulers, which make it easy to
+schedule, orchestrate, and debug ETL data pipeline jobs. When a failure occurs, schedulers often
+retry jobs to mitigate transient issues that might occur. A job that fails repeatedly will be marked
+as failed, which helps developers easily see which job in their data pipeline stopped working.
+Schedulers like Airflow even come with built-in source, sink, and query operators for MySQL,
+PostgreSQL, Snowflake, Spark, Flink, and dozens of other popular systems. A tight integration
+between schedulers and data processing systems simplifies data integration.
+
+We've also seen that batch jobs are easy to troubleshoot and fix when things go awry. This feature
+is invaluable when debugging data pipelines. Failed files can be easily inspected to see what went
+wrong, and ETL batch jobs can be fixed and re-run. For example, an input file might no longer
+contain a field that a transformation batch job intends to use. Data engineers will see that the
+field is missing, and update the transformation logic or the job that produced the input.
+
+Data pipelines used to be managed by a single data engineering team, as it was considered unfair to
+ask other teams working on product features to write and manage complex batch data pipelines.
+Recently, improvements in batch processing models and metadata management have made it much easier
+for engineers across an organization to contribute to and manage their own data pipelines. *Data
+mesh* [^35], [^36], *data contract* [^37], and *data fabric*
+[^38] practices provide standards and tools to help teams safely publish their data for
+consumption by anybody in the organization.
+
+Data pipelines and analytic queries have begun to share not only processing models, but execution
+engines as well. Many batch ETL jobs now run on the same systems as the analytic queries that read
+their output. It is not uncommon to see data pipeline transformations and analytic queries both run
+as SparkSQL, Trino, or DuckDB queries. Such an architecture further blurs the line between
+application engineering, data engineering, analytics engineering, and business analysis.
+
+### Analytics {#sec_batch_olap}
+
+In ["Operational Versus Analytical Systems"](/en/ch1#sec_introduction_analytics), we saw that
+analytic queries (OLAP) often scan over a large number of records, performing groupings and
+aggregations. It is possible to run such workloads in a batch processing system, alongside other
+batch processing workloads. Analysts write SQL queries that execute atop a query engine, which reads
+and writes from a distributed file system or object store. Table metadata such as table to file
+mappings, names, and types are managed with table formats such as Apache Iceberg and catalogs such
+as Unity (see ["Cloud Data Warehouses"](/en/ch4#sec_cloud_data_warehouses)). This architecture is
+known as a *data lakehouse* [^39].
+
+As with ETL, improvements in SQL query interfaces mean many organizations now use batch frameworks
+such as Spark for analytics. Such query patterns come in two styles:
+
+- Pre-aggregation queries, where data is rolled up into OLAP cubes or data marts to speed up queries
+ (see ["Materialized Views and Data Cubes"](/en/ch4#sec_storage_materialized_views)).
+ Pre-aggregated data is queried in the warehouse or pushed to purpose-built realtime OLAP systems
+ such as Apache Druid or Apache Pinot. Pre-aggregation normally takes place at a scheduled
+ interval. The workflow schedulers discussed in ["Scheduling
+ Workflows"](/en/ch11#sec_batch_workflows) are used to manage these workloads.
+
+- Ad hoc queries that users run to answer specific business questions, investigate user behavior,
+ debug operational issues, and much more. Response times are important for this use case. Analysts
+ run queries iteratively as they get responses and learn more about the data they're investigating.
+ Batch processing frameworks with fast query execution help reduce waiting times for analysts.
+
+SQL support enables batch processing frameworks to integrate with spreadsheets and data
+visualization tools such as Tableau, Power BI, Looker, and Apache Superset. For example, Tableau
+offers SparkSQL and Presto connectors, while Apache Superset supports Trino, Hive, Spark SQL,
+Presto, and many other systems that ultimately execute batch jobs to query data.
+
+### Machine Learning {#id290}
+
+Machine learning (ML) makes frequent use of batch processing. Data scientists, ML engineers, and AI
+engineers use batch processing frameworks to investigate data patterns, transform data, and train
+machine learning models. Common uses include:
+
+- Feature engineering: Raw data is filtered and transformed into data that models can be trained on.
+ Predictive models often need numeric data, so engineers must transform other forms of data (such
+ as text or discrete values) into the required format.
+
+- Model training: The training data is the input to the batch process, and the weights of the
+ trained model are the output.
+
+- Batch inference: A trained model can then be used to make predictions in bulk if datasets are
+ large and realtime results are not required. This includes evaluating the model's predictions on a
+ test dataset.
+
+Batch processing frameworks provide tools explicitly for these use cases. For example, Apache
+Spark's MLlib and Apache Flink's FlinkML come with a wide variety of feature engineering tools,
+statistical functions, and classifiers.
+
+Machine learning applications such as recommendation engines and ranking systems also make heavy use
+of graph processing (see ["Graph-Like Data Models"](/en/ch3#sec_datamodels_graph)). Many graph
+algorithms are expressed by traversing one edge at a time, joining one vertex with an adjacent
+vertex in order to propagate some information, and repeating until some condition is met---for
+example, until there are no more edges to follow, or until some metric converges.
+
+The *bulk synchronous parallel* (BSP) model of computation [^40] has become popular for
+batch processing graphs. Among others, it is implemented by Apache Giraph [^20], Spark's
+GraphX API, and Flink's Gelly API [^41]. It is also known as the *Pregel* model, as
+Google's Pregel paper popularized this approach for processing graphs [^42].
+
+Batch processing is also an integral part of large language model (LLM) data preparation and
+training. Raw text input data such as websites typically reside in a DFS or object store. This data
+must be pre-processed to make it suitable for training. Pre-processing steps that are well-suited
+for batch processing frameworks include:
+
+- Plain text must be extracted from HTML and malformed text must be fixed.
+
+- Low quality, irrelevant, and duplicate documents must be detected and removed.
+
+- Text must be tokenized (split into words) and converted into embeddings, which are numeric
+ representations each word.
+
+Batch processing frameworks such as Kubeflow, Flyte, and Ray are purpose-built for such workloads.
+OpenAI uses Ray as part of its ChatGPT training process, for example [^43]. These
+frameworks have built-in integrations for LLM and AI libraries such as PyTorch, Tensorflow, XGBoost,
+and many others. They also offer built-in support for feature engineering, model training, batch
+inference, and fine tuning (adjusting a foundational model for specific use cases).
+
+Finally, data scientists often experiment with data in interactive notebooks such as Jupyter or Hex.
+Notebooks are made up of *cells*, which are small chunks of markdown, Python, or SQL. Cells are
+executed sequentially to produce spreadsheets, graphs, or data. Many notebooks use batch processing
+via DataFrame APIs or query such systems using SQL.
+
+### Serving Derived Data {#sec_batch_serving_derived}
+
+Batch jobs are often used to build pre-computed or derived datasets such as product recommendations,
+user-facing reports, and features for machine learning models. These datasets are typically served
+from a production database, key-value store, or search engine. Regardless of the system used, the
+pre-computed data needs to make its way from the batch processor's distributed filesystem or object
+store back into the database that's serving live traffic.
+
+The most obvious choice might be to use the client library for your favorite database directly
+within a batch job, and to write directly to the database server, one record at a time. This will
+work (assuming your firewall rules allow direct access from your batch processing environment to
+your production databases), but it is a bad idea for several reasons:
+
+- Making a network request for every single record is orders of magnitude slower than the normal
+ throughput of a batch task. Even if the client library supports batching, performance is likely to
+ be poor.
+
+- Batch processing frameworks often run many tasks in parallel. If all the tasks concurrently write
+ to the same output database, with a rate expected of a batch process, that database can easily be
+ overwhelmed, and its performance for queries is likely to suffer. This can in turn cause
+ operational problems in other parts of the system [^44].
+
+- Normally, batch jobs provide a clean all-or-nothing guarantee for job output: if a job succeeds,
+ the result is the output of running every task exactly once, even if some tasks failed and had to
+ be retried along the way; if the entire job fails, no output is produced. However, writing to an
+ external system from inside a job produces externally visible side effects that cannot be hidden
+ in this way. Thus, you have to worry about the results from partially completed jobs being visible
+ to other systems. If a task fails and is restarted, it may duplicate output from the failed
+ execution.
+
+A better solution is to have batch jobs push pre-computed datasets to streams such as Kafka topics,
+which we discuss further in [Chapter 12](/en/ch12#ch_stream). Search engines like Elasticsearch,
+realtime OLAP systems like Apache Pinot and Apache Druid, derived datastores like Venice
+[^45], and cloud data warehouses like ClickHouse all have the built-in ability to ingest
+data from Kafka into their systems. Pushing data through a streaming systems fixes a few of the
+problems we discussed above:
+
+- Streaming systems are optimized for sequential writes, which make them better suited for the bulk
+ write workload of a batch job.
+
+- Streaming systems can also act as a buffer between the batch job and the production databases.
+ Downstream systems can throttle their read rate to ensure they can continue to comfortably serve
+ production traffic.
+
+- The output of a single batch job can be consumed by multiple downstream systems.
+
+- Streaming systems can serve as a security boundary between batch processing environments and
+ production networks: they can be deployed in a so-called DMZ (demilitarized zone) network that
+ sits between the batch processing network and production network.
+
+Pushing data through streams doesn't inherently solve the all-or-nothing guarantee issue we
+discussed above. To make this work, batch jobs must send a notification to downstream systems that
+their job is complete and the data can now be served. Consumers of the stream need to be able to
+keep data they receive invisible to queries, like an uncommitted transaction with *read committed*
+isolation (see ["Read Committed"](/en/ch8#sec_transactions_read_committed)), until they are notified
+that it is complete.
+
+Another pattern that is more common when bootstrapping databases is to build a brand-new database
+*inside* the batch job and bulk load those files directly into the database from a distributed
+filesystem, object store, or local filesystem. Many data systems offer bulk import tools such as
+TiDB's Lightning tool, or Apache Pinot's and Apache Druid's Hadoop import jobs. RocksDB also offers
+an API to bulk import SSTs from batch jobs.
+
+Building databases in batch and bulk importing the data is very fast, and makes it easier for
+systems to atomically switch between dataset versions. On the other hand, it can be challenging to
+incrementally update datasets from batch jobs that build brand-new databases. It's common to take a
+hybrid approach in situations where both bootstrapping and incremental loads are needed. Venice, for
+example, supports hybrid stores that allow for batch row-based updates and full dataset swaps.
+
+## Summary {#id292}
+
+In this chapter, we explored the design and implementation of batch processing systems. We began
+with the classic Unix toolchain (awk, sort, uniq, etc.), to illustrate fundamental batch processing
+primitives such as sorting and counting.
+
+We then scaled up to distributed batch processing systems. We saw that batch-style I/O processes
+immutable, bounded input datasets to produce output data, allowing reruns and debugging without side
+effects. To process files, we saw that batch frameworks have three main components: an orchestration
+layer that determines where and when jobs run, a storage layer to persist data, and a computation
+layer that processes the actual data.
+
+We looked at how distributed filesystems and object stores manage large files through block-based
+replication, caching, and metadata services, and how modern batch frameworks interact with these
+systems using pluggable APIs. We also discussed how orchestrators schedule tasks, allocate
+resources, and handle faults in large clusters. We also compared job orchestrators that schedule
+jobs with workflow orchestrators that manage the lifecycle of a collection of jobs that run in a
+dependency graph.
+
+We surveyed batch processing models, starting with MapReduce and its canonical map and reduce
+functions. Next, we turned to dataflow engines like Spark and Flink, which offer simpler-to-use
+dataflow APIs and better performance. To understand how batch jobs scale, we covered the shuffle
+algorithm, a foundational operation that enables grouping, joining, and aggregation.
+
+As batch systems matured, focus shifted to usability. You learned about high-level query languages
+like SQL and DataFrame APIs, which make batch jobs more accessible and easier to optimize. Query
+optimizers translate declarative queries into efficient execution plans.
+
+We finished the chapter with common batch processing use cases:
+
+- ETL pipelines, which extract, transform, and load data between different systems using scheduled
+ workflows;
+
+- Analytics, where batch jobs support both pre-aggregated dashboards and ad hoc queries;
+
+- Machine learning, where batch jobs prepare and process large training datasets;
+
+- Populating production-facing systems from batch outputs, often via streams or bulk loading tools,
+ in order to serve the derived data to users.
+
+In the next chapter, we will turn to stream processing, in which the input is *unbounded*---that is,
+you still have a job, but its inputs are never-ending streams of data. In this case, a job is never
+complete, because at any time there may still be more work coming in. We shall see that stream and
+batch processing are similar in some respects, but the assumption of unbounded streams also changes
+a lot about how we build systems.
+
+##### Footnotes
+
+### References {#references}
+
+[^1]: Nathan Marz. [How to Beat the CAP Theorem](http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html). *nathanmarz.com*, October 2011. Archived at [perma.cc/4BS9-R9A4](https://perma.cc/4BS9-R9A4)
+[^2]: Molly Bartlett Dishman and Martin Fowler. [Agile Architecture](https://www.youtube.com/watch?v=VjKYO6DP3fo&list=PL055Epbe6d5aFJdvWNtTeg_UEHZEHdInE). At *O'Reilly Software Architecture Conference*, March 2015.
+[^3]: Jeffrey Dean and Sanjay Ghemawat. [MapReduce: Simplified Data Processing on Large Clusters](https://www.usenix.org/legacy/publications/library/proceedings/osdi04/tech/full_papers/dean/dean.pdf). At *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004.
+[^4]: Shivnath Babu and Herodotos Herodotou. [Massively Parallel Databases and MapReduce Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/db-mr-survey-final.pdf). *Foundations and Trends in Databases*, volume 5, issue 1, pages 1--104, November 2013. [doi:10.1561/1900000036](https://doi.org/10.1561/1900000036)
+[^5]: David J. DeWitt and Michael Stonebraker. [MapReduce: A Major Step Backwards](https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html). Originally published at *databasecolumn.vertica.com*, January 2008. Archived at [perma.cc/U8PA-K48V](https://perma.cc/U8PA-K48V)
+[^6]: Henry Robinson. [The Elephant Was a Trojan Horse: On the Death of Map-Reduce at Google](https://www.the-paper-trail.org/post/2014-06-25-the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/). *the-paper-trail.org*, June 2014. Archived at [perma.cc/9FEM-X787](https://perma.cc/9FEM-X787)
+[^7]: Urs Hölzle. [R.I.P. MapReduce. After having served us well since 2003, today we removed the remaining internal codebase for good](https://twitter.com/uhoelzle/status/1177360023976067077). *twitter.com*, September 2019. Archived at [perma.cc/B34T-LLY7](https://perma.cc/B34T-LLY7)
+[^8]: Adam Drake. [Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster](https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html). *aadrake.com*, January 2014. Archived at [perma.cc/87SP-ZMCY](https://perma.cc/87SP-ZMCY)
+[^9]: [`sort`: Sort text files](https://www.gnu.org/software/coreutils/manual/html_node/sort-invocation.html). GNU Coreutils 9.7 Documentation, Free Software Foundation, Inc., 2025.
+[^10]: Michael Ovsiannikov, Silvius Rus, Damian Reeves, Paul Sutter, Sriram Rao, and Jim Kelly. [The Quantcast File System](https://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p808-ovsiannikov.pdf). *Proceedings of the VLDB Endowment*, volume 6, issue 11, pages 1092--1101, August 2013. [doi:10.14778/2536222.2536234](https://doi.org/10.14778/2536222.2536234)
+[^11]: Andrew Wang, Zhe Zhang, Kai Zheng, Uma Maheswara G., and Vinayakumar B. [Introduction to HDFS Erasure Coding in Apache Hadoop](https://www.cloudera.com/blog/technical/introduction-to-hdfs-erasure-coding-in-apache-hadoop.html). *blog.cloudera.com*, September 2015. Archived at [archive.org](https://web.archive.org/web/20250731115546/https://www.cloudera.com/blog/technical/introduction-to-hdfs-erasure-coding-in-apache-hadoop.html)
+[^12]: Andy Warfield. [Building and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. Archived at [perma.cc/7LPK-TP7V](https://perma.cc/7LPK-TP7V)
+[^13]: Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. [Apache Hadoop YARN: Yet Another Resource Negotiator](https://opencourse.inf.ed.ac.uk/sites/default/files/2023-10/yarn-socc13.pdf). At *4th Annual Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523633](https://doi.org/10.1145/2523616.2523633)
+[^14]: Richard M. Karp. [Reducibility Among Combinatorial Problems](https://www.cs.purdue.edu/homes/hosking/197/canon/karp.pdf). *Complexity of Computer Computations. The IBM Research Symposia Series*. Springer, 1972. [doi:10.1007/978-1-4684-2001-2_9](https://doi.org/10.1007/978-1-4684-2001-2_9)
+[^15]: J. D. Ullman. [NP-Complete Scheduling Problems](https://www.cs.montana.edu/bhz/classes/fall-2018/csci460/paper4.pdf). *Journal of Computer and System Sciences*, volume 10, issue 3, June 1975. [doi:10.1016/S0022-0000(75)80008-0](https://doi.org/10.1016/S0022-0000(75)80008-0)
+[^16]: Gilad David Maayan. [The complete guide to spot instances on AWS, Azure and GCP](https://www.datacenterdynamics.com/en/opinions/complete-guide-spot-instances-aws-azure-and-gcp/). *datacenterdynamics.com*, March 2021. Archived at [archive.org](https://web.archive.org/web/20250722114617/https://www.datacenterdynamics.com/en/opinions/complete-guide-spot-instances-aws-azure-and-gcp/)
+[^17]: Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. [Large-Scale Cluster Management at Google with Borg](https://dl.acm.org/doi/pdf/10.1145/2741948.2741964). At *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741964](https://doi.org/10.1145/2741948.2741964)
+[^18]: Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. [Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf). At *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012.
+[^19]: Paris Carbone, Stephan Ewen, Seif Haridi, Asterios Katsifodimos, Volker Markl, and Kostas Tzoumas. [Apache Flink™: Stream and Batch Processing in a Single Engine](http://sites.computer.org/debull/A15dec/p28.pdf). *Bulletin of the IEEE Computer Society Technical Committee on Data Engineering*, volume 38, issue 4, December 2015. Archived at [perma.cc/G3N3-BKX5](https://perma.cc/G3N3-BKX5)
+[^20]: Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira. *[Hadoop Application Architectures](https://learning.oreilly.com/library/view/hadoop-application-architectures/9781491910313/)*. O'Reilly Media, 2015. ISBN: 978-1-491-90004-8
+[^21]: Jules S. Damji, Brooke Wenig, Tathagata Das, and Denny Lee. *[Learning Spark, 2nd Edition](https://learning.oreilly.com/library/view/learning-spark-2nd/9781492050032/)*. O'Reilly Media, 2020. ISBN: 978-1492050049
+[^22]: Michael Isard, Mihai Budiu, Yuan Yu, Andrew Birrell, and Dennis Fetterly. [Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks](https://www.microsoft.com/en-us/research/publication/dryad-distributed-data-parallel-programs-from-sequential-building-blocks/). At *2nd European Conference on Computer Systems* (EuroSys), March 2007. [doi:10.1145/1272996.1273005](https://doi.org/10.1145/1272996.1273005)
+[^23]: Daniel Warneke and Odej Kao. [Nephele: Efficient Parallel Data Processing in the Cloud](https://stratosphere2.dima.tu-berlin.de/assets/papers/Nephele_09.pdf). At *2nd Workshop on Many-Task Computing on Grids and Supercomputers* (MTAGS), November 2009. [doi:10.1145/1646468.1646476](https://doi.org/10.1145/1646468.1646476)
+[^24]: Hossein Ahmadi. [In-memory query execution in Google BigQuery](https://cloud.google.com/blog/products/bigquery/in-memory-query-execution-in-google-bigquery). *cloud.google.com*, August 2016. Archived at [perma.cc/DGG2-FL9W](https://perma.cc/DGG2-FL9W)
+[^25]: Tom White. *[Hadoop: The Definitive Guide](https://learning.oreilly.com/library/view/hadoop-the-definitive/9781491901687/)*, 4th edition. O'Reilly Media, 2015. ISBN: 978-1-491-90163-2
+[^26]: Fabian Hüske. [Peeking into Apache Flink's Engine Room](https://flink.apache.org/2015/03/13/peeking-into-apache-flinks-engine-room/). *flink.apache.org*, March 2015. Archived at [perma.cc/44BW-ALJX](https://perma.cc/44BW-ALJX)
+[^27]: Mostafa Mokhtar. [Hive 0.14 Cost Based Optimizer (CBO) Technical Overview](https://web.archive.org/web/20170607112708/http://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/). *hortonworks.com*, March 2015. Archived on [archive.org](https://web.archive.org/web/20170607112708/http://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/)
+[^28]: Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. [Spark SQL: Relational Data Processing in Spark](https://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742797](https://doi.org/10.1145/2723372.2742797)
+[^29]: Kaya Kupferschmidt. [Spark vs Pandas, part 2 -- Spark](https://towardsdatascience.com/spark-vs-pandas-part-2-spark-c57f8ea3a781/). *towardsdatascience.com*, October 2020. Archived at [perma.cc/5BRK-G4N5](https://perma.cc/5BRK-G4N5)
+[^30]: Ammar Chalifah. [Tracking payments at scale](https://bolt.eu/en/blog/tracking-payments-at-scale). *bolt.eu.com*, June 2025. Archived at [perma.cc/Q4KX-8K3J](https://perma.cc/Q4KX-8K3J)
+[^31]: Nafi Ahmet Turgut, Hamza Akyıldız, Hasan Burak Yel, Mehmet İkbal Özmen, Mutlu Polatcan, Pinar Baki, and Esra Kayabali. [Demand forecasting at Getir built with Amazon Forecast](https://aws.amazon.com/blogs/machine-learning/demand-forecasting-at-getir-built-with-amazon-forecast). *aws.amazon.com.com*, May 2023. Archived at [perma.cc/H3H6-GNL7](https://perma.cc/H3H6-GNL7)
+[^32]: Jason (Siyu) Zhu. [Enhancing homepage feed relevance by harnessing the power of large corpus sparse ID embeddings](https://www.linkedin.com/blog/engineering/feed/enhancing-homepage-feed-relevance-by-harnessing-the-power-of-lar). *linkedin.com*, August 2023. Archived at [archive.org](https://web.archive.org/web/20250225094424/https://www.linkedin.com/blog/engineering/feed/enhancing-homepage-feed-relevance-by-harnessing-the-power-of-lar)
+[^33]: Avery Ching, Sital Kedia, and Shuojie Wang. [Apache Spark \@Scale: A 60 TB+ production use case](https://engineering.fb.com/2016/08/31/core-infra/apache-spark-scale-a-60-tb-production-use-case/). *engineering.fb.com*, August 2016. Archived at [perma.cc/F7R5-YFAV](https://perma.cc/F7R5-YFAV)
+[^34]: Edward Kim. [How ACH works: A developer perspective --- Part 1](https://engineering.gusto.com/how-ach-works-a-developer-perspective-part-1-339d3e7bea1). *engineering.gusto.com*, April 2014. Archived at [perma.cc/F67P-VBLK](https://perma.cc/F67P-VBLK)
+[^35]: Zhamak Dehghani. [How to Move Beyond a Monolithic Data Lake to a Distributed Data Mesh](https://martinfowler.com/articles/data-monolith-to-mesh.html). *martinfowler.com*, May 2019. Archived at [perma.cc/LN2L-L4VC](https://perma.cc/LN2L-L4VC)
+[^36]: Chris Riccomini. [What the Heck is a Data Mesh?!](https://cnr.sh/essays/what-the-heck-data-mesh) *cnr.sh*, June 2021. Archived at [perma.cc/NEJ2-BAX3](https://perma.cc/NEJ2-BAX3)
+[^37]: Chad Sanderson, Mark Freeman, B. E. Schmidt. [*Data Contracts*](https://www.oreilly.com/library/view/data-contracts/9781098157623/). O'Reilly Media, 2025. ISBN: 9781098157623
+[^38]: Daniel Abadi. [Data Fabric vs. Data Mesh: What's the Difference?](https://www.starburst.io/blog/data-fabric-vs-data-mesh-whats-the-difference/) *starburst.io*, November 2021. Archived at [perma.cc/RSK3-HXDK](https://perma.cc/RSK3-HXDK)
+[^39]: Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. [Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference on Innovative Data Systems Research* (CIDR), January 2021.
+[^40]: Leslie G. Valiant. [A Bridging Model for Parallel Computation](https://dl.acm.org/doi/pdf/10.1145/79173.79181). *Communications of the ACM*, volume 33, issue 8, pages 103--111, August 1990. [doi:10.1145/79173.79181](https://doi.org/10.1145/79173.79181)
+[^41]: Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl. [Spinning Fast Iterative Data Flows](https://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf). *Proceedings of the VLDB Endowment*, volume 5, issue 11, pages 1268-1279, July 2012. [doi:10.14778/2350229.2350245](https://doi.org/10.14778/2350229.2350245)
+[^42]: Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. [Pregel: A System for Large-Scale Graph Processing](https://kowshik.github.io/JPregel/pregel_paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2010. [doi:10.1145/1807167.1807184](https://doi.org/10.1145/1807167.1807184)
+[^43]: Richard MacManus. [OpenAI Chats about Scaling LLMs at Anyscale's Ray Summit](https://thenewstack.io/openai-chats-about-scaling-llms-at-anyscales-ray-summit/). *thenewstack.io*, September 2023. Archived at [perma.cc/YJD6-KUXU](https://perma.cc/YJD6-KUXU)
+[^44]: Jay Kreps. [Why Local State is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing). *oreilly.com*, July 2014. Archived at [perma.cc/P8HU-R5LA](https://perma.cc/P8HU-R5LA)
+[^45]: Félix GV. [Open Sourcing Venice -- LinkedIn's Derived Data Platform](https://www.linkedin.com/blog/engineering/open-source/open-sourcing-venice-linkedin-s-derived-data-platform). *linkedin.com*, September 2022. Archived at [archive.org](https://web.archive.org/web/20250226160927/https://www.linkedin.com/blog/engineering/open-source/open-sourcing-venice-linkedin-s-derived-data-platform)
diff --git a/content/en/ch12.md b/content/en/ch12.md
index 81a9447..a731ee3 100644
--- a/content/en/ch12.md
+++ b/content/en/ch12.md
@@ -4,177 +4,1864 @@ weight: 312
breadcrumbs: false
---
-{{< callout type="warning" >}}
-This page is from the 1st edition, 2nd edition is not available yet.
-{{< /callout >}}
+

-> *A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work.*
+> *A complex system that works is invariably found to have evolved from a simple system that works.
+> The inverse proposition also appears to be true: A complex system designed from scratch never
+> works and cannot be made to work.*
>
-> — John Gall, *Systemantics* (1975)
+> John Gall, *Systemantics* (1975)
----------------
+> [!TIP] A NOTE FOR EARLY RELEASE READERS
+> With Early Release ebooks, you get books in their earliest form---the author's raw and unedited
+> content as they write---so you can take advantage of these technologies long before the official
+> release of these titles.
+>
+> This will be the 12th chapter of the final book. The GitHub repo for this book is
+> *[*https://github.com/ept/ddia2-feedback*](https://github.com/ept/ddia2-feedback)*.
+>
+> If you'd like to be actively involved in reviewing and commenting on this draft, please reach out on
+> GitHub.
-In [Chapter 10](/en/ch10) we discussed batch processing—techniques that read a set of files as input and produce a new set of output files. The output is a form of *derived data*; that is, a dataset that can be recreated by running the batch process again if necessary. We saw how this simple but powerful idea can be used to create search indexes, recom‐ mendation systems, analytics, and more.
+In [Chapter 11](/en/ch11#ch_batch) we discussed batch processing---techniques that read a set of
+files as input and produce a new set of output files. The output is a form of *derived data*; that
+is, a dataset that can be recreated by running the batch process again if necessary. We saw how this
+simple but powerful idea can be used to create search indexes, recommendation systems, analytics,
+and more.
-However, one big assumption remained throughout [Chapter 10](/en/ch10): namely, that the input is bounded—i.e., of a known and finite size—so the batch process knows when it has finished reading its input. For example, the sorting operation that is central to MapReduce must read its entire input before it can start producing output: it could happen that the very last input record is the one with the lowest key, and thus needs to be the very first output record, so starting the output early is not an option.
+However, one big assumption remained throughout [Chapter 11](/en/ch11#ch_batch): namely, that the
+input is bounded---i.e., of a known and finite size---so the batch process knows when it has
+finished reading its input. For example, the sorting operation that is central to MapReduce must
+read its entire input before it can start producing output: it could happen that the very last input
+record is the one with the lowest key, and thus needs to be the very first output record, so
+starting the output early is not an option.
-In reality, a lot of data is unbounded because it arrives gradually over time: your users produced data yesterday and today, and they will continue to produce more data tomorrow. Unless you go out of business, this process never ends, and so the dataset is never “complete” in any meaningful way [1]. Thus, batch processors must artifi‐ cially divide the data into chunks of fixed duration: for example, processing a day’s worth of data at the end of every day, or processing an hour’s worth of data at the end of every hour.
+In reality, a lot of data is unbounded because it arrives gradually over time: your users produced
+data yesterday and today, and they will continue to produce more data tomorrow. Unless you go out of
+business, this process never ends, and so the dataset is never "complete" in any meaningful way
+[^1]. Thus, batch processors must artificially divide the data into chunks of fixed
+duration: for example, processing a day's worth of data at the end of every day, or processing an
+hour's worth of data at the end of every hour.
-The problem with daily batch processes is that changes in the input are only reflected in the output a day later, which is too slow for many impatient users. To reduce the delay, we can run the processing more frequently—say, processing a second’s worth of data at the end of every second—or even continuously, abandoning the fixed time slices entirely and simply processing every event as it happens. That is the idea behind *stream processing*.
+The problem with daily batch processes is that changes in the input are only reflected in the output
+a day later, which is too slow for many impatient users. To reduce the delay, we can run the
+processing more frequently---say, processing a second's worth of data at the end of every
+second---or even continuously, abandoning the fixed time slices entirely and simply processing every
+event as it happens. That is the idea behind *stream processing*.
-In general, a “stream” refers to data that is incrementally made available over time. The concept appears in many places: in the stdin and stdout of Unix, programming languages (lazy lists) [2], filesystem APIs (such as Java’s `FileInputStream`), TCP con‐ nections, delivering audio and video over the internet, and so on.
+In general, a "stream" refers to data that is incrementally made available over time. The concept
+appears in many places: in the `stdin` and `stdout` of Unix, programming languages (lazy lists)
+[^2], filesystem APIs (such as Java's `FileInputStream`), TCP connections, delivering
+audio and video over the internet, and so on.
-In this chapter we will look at *event streams* as a data management mechanism: the unbounded, incrementally processed counterpart to the batch data we saw in the last chapter. We will first discuss how streams are represented, stored, and transmit‐ ted over a network. In “[Databases and Streams](#databases-and-streams)” we will investigate the relationship between streams and databases. And finally, in “[Processing Streams](#processing-streams)” we will explore approaches and tools for processing those streams continually, and ways that they can be used to build applications.
+In this chapter we will look at *event streams* as a data management mechanism: the unbounded,
+incrementally processed counterpart to the batch data we saw in the last chapter. We will first
+discuss how streams are represented, stored, and transmitted over a network. In ["Databases and
+Streams"](/en/ch12#sec_stream_databases) we will investigate the relationship between streams and
+databases. And finally, in ["Processing Streams"](/en/ch12#sec_stream_processing) we will explore
+approaches and tools for processing those streams continually, and ways that they can be used to
+build applications.
+## Transmitting Event Streams {#sec_stream_transmit}
-## ……
+In the batch processing world, the inputs and outputs of a job are files (perhaps on a distributed
+filesystem). What does the streaming equivalent look like?
+When the input is a file (a sequence of bytes), the first processing step is usually to parse it
+into a sequence of records. In a stream processing context, a record is more commonly known as an
+*event*, but it is essentially the same thing: a small, self-contained, immutable object containing
+the details of something that happened at some point in time. An event usually contains a timestamp
+indicating when it happened according to a time-of-day clock (see ["Monotonic Versus Time-of-Day
+Clocks"](/en/ch9#sec_distributed_monotonic_timeofday)).
+For example, the thing that happened might be an action that a user took, such as viewing a page or
+making a purchase. It might also originate from a machine, such as a periodic measurement from a
+temperature sensor, or a CPU utilization metric. In the example of ["Batch Processing with Unix
+Tools"](/en/ch11#sec_batch_unix), each line of the web server log is an event.
-## Summary
+An event may be encoded as a text string, or JSON, or perhaps in some binary form, as discussed in
+[Chapter 5](/en/ch5#ch_encoding). This encoding allows you to store an event, for example by
+appending it to a file, inserting it into a relational table, or writing it to a document database.
+It also allows you to send the event over the network to another node in order to process it.
-In this chapter we have discussed event streams, what purposes they serve, and how to process them. In some ways, stream processing is very much like the batch pro‐ cessing we discussed in [Chapter 10](/en/ch10), but done continuously on unbounded (neverending) streams rather than on a fixed-size input. From this perspective, message brokers and event logs serve as the streaming equivalent of a filesystem.
+In batch processing, a file is written once and then potentially read by multiple jobs. Analogously,
+in streaming terminology, an event is generated once by a *producer* (also known as a *publisher* or
+*sender*), and then potentially processed by multiple *consumers* (*subscribers* or *recipients*)
+[^3]. In a filesystem, a filename identifies a set of related records; in a streaming
+system, related events are usually grouped together into a *topic* or *stream*.
+
+In principle, a file or database is sufficient to connect producers and consumers: a producer writes
+every event that it generates to the datastore, and each consumer periodically polls the datastore
+to check for events that have appeared since it last ran. This is essentially what a batch process
+does when it processes a day's worth of data at the end of every day.
+
+However, when moving toward continual processing with low delays, polling becomes expensive if the
+datastore is not designed for this kind of usage. The more often you poll, the lower the percentage
+of requests that return new events, and thus the higher the overheads become. Instead, it is better
+for consumers to be notified when new events appear.
+
+Databases have traditionally not supported this kind of notification mechanism very well: relational
+databases commonly have *triggers*, which can react to a change (e.g., a row being inserted into a
+table), but they are very limited in what they can do and have been somewhat of an afterthought in
+database design [^4]. Instead, specialized tools have been developed for the purpose of
+delivering event notifications.
+
+### Messaging Systems {#sec_stream_messaging}
+
+A common approach for notifying consumers about new events is to use a *messaging system*: a
+producer sends a message containing the event, which is then pushed to consumers. We touched on
+these systems previously in ["Event-Driven Architectures"](/en/ch5#sec_encoding_dataflow_msg), but
+we will now go into more detail.
+
+A direct communication channel like a Unix pipe or TCP connection between producer and consumer
+would be a simple way of implementing a messaging system. However, most messaging systems expand on
+this basic model. In particular, Unix pipes and TCP connect exactly one sender with one recipient,
+whereas a messaging system allows multiple producer nodes to send messages to the same topic and
+allows multiple consumer nodes to receive messages in a topic.
+
+Within this *publish/subscribe* model, different systems take a wide range of approaches, and there
+is no one right answer for all purposes. To differentiate the systems, it is particularly helpful to
+ask the following two questions:
+
+1. *What happens if the producers send messages faster than the consumers can process them?*
+ Broadly speaking, there are three options: the system can drop messages, buffer messages in a
+ queue, or apply *backpressure* (also known as *flow control*; i.e., blocking the producer from
+ sending more messages). For example, Unix pipes and TCP use backpressure: they have a small
+ fixed-size buffer, and if it fills up, the sender is blocked until the recipient takes data out
+ of the buffer (see ["Network congestion and queueing"](/en/ch9#sec_distributed_congestion)).
+
+ If messages are buffered in a queue, it is important to understand what happens as that queue
+ grows. Does the system crash if the queue no longer fits in memory, or does it write messages to
+ disk? In the latter case, how does the disk access affect the performance of the messaging
+ system [^5], and what happens when the disk fills up [^6]?
+
+2. *What happens if nodes crash or temporarily go offline---are any messages lost?* As with
+ databases, durability may require some combination of writing to disk and/or replication (see
+ the sidebar ["Replication and Durability"](/en/ch8#sidebar_transactions_durability)), which has
+ a cost. If you can afford to sometimes lose messages, you can probably get higher throughput and
+ lower latency on the same hardware.
+
+Whether message loss is acceptable depends very much on the application. For example, with sensor
+readings and metrics that are transmitted periodically, an occasional missing data point is perhaps
+not important, since an updated value will be sent a short time later anyway. However, beware that
+if a large number of messages are dropped, it may not be immediately apparent that the metrics are
+incorrect [^7]. If you are counting events, it is more important that they are delivered
+reliably, since every lost message means incorrect counters.
+
+A nice property of the batch processing systems we explored in [Chapter 11](/en/ch11#ch_batch) is
+that they provide a strong reliability guarantee: failed tasks are automatically retried, and
+partial output from failed tasks is automatically discarded. This means the output is the same as if
+no failures had occurred, which helps simplify the programming model. Later in this chapter we will
+examine how we can provide similar guarantees in a streaming context.
+
+#### Direct messaging from producers to consumers {#id296}
+
+A number of messaging systems use direct network communication between producers and consumers
+without going via intermediary nodes:
+
+- UDP multicast is widely used in the financial industry for streams such as stock market feeds,
+ where low latency is important [^8]. Although UDP itself is unreliable,
+ application-level protocols can recover lost packets (the producer must remember packets it has
+ sent so that it can retransmit them on demand).
+
+- Brokerless messaging libraries such as ZeroMQ and nanomsg take a similar approach, implementing
+ publish/subscribe messaging over TCP or IP multicast.
+
+- Some metrics collection agents, such as StatsD [^9] use unreliable UDP messaging to
+ collect metrics from all machines on the network and monitor them. (In the StatsD protocol,
+ counter metrics are only correct if all messages are received; using UDP makes the metrics at best
+ approximate [^10]. See also ["TCP Versus UDP"](/en/ch9#sidebar_distributed_tcp_udp).)
+
+- If the consumer exposes a service on the network, producers can make a direct HTTP or RPC request
+ (see ["Dataflow Through Services: REST and RPC"](/en/ch5#sec_encoding_dataflow_rpc)) to push
+ messages to the consumer. This is the idea behind webhooks [^11], a pattern in which a
+ callback URL of one service is registered with another service, and it makes a request to that URL
+ whenever an event occurs.
+
+Although these direct messaging systems work well in the situations for which they are designed,
+they generally require the application code to be aware of the possibility of message loss. The
+faults they can tolerate are quite limited: even if the protocols detect and retransmit packets that
+are lost in the network, they generally assume that producers and consumers are constantly online.
+
+If a consumer is offline, it may miss messages that were sent while it is unreachable. Some
+protocols allow the producer to retry failed message deliveries, but this approach may break down if
+the producer crashes, losing the buffer of messages that it was supposed to retry.
+
+#### Message brokers {#id433}
+
+A widely used alternative is to send messages via a *message broker* (also known as a *message
+queue*), which is essentially a kind of database that is optimized for handling message streams
+[^12]. It runs as a server, with producers and consumers connecting to it as clients.
+Producers write messages to the broker, and consumers receive them by reading them from the broker.
+
+By centralizing the data in the broker, these systems can more easily tolerate clients that come and
+go (connect, disconnect, and crash), and the question of durability is moved to the broker instead.
+Some message brokers only keep messages in memory, while others (depending on configuration) write
+them to disk so that they are not lost in case of a broker crash. Faced with slow consumers, they
+generally allow unbounded queueing (as opposed to dropping messages or backpressure), although this
+choice may also depend on the configuration.
+
+A consequence of queueing is also that consumers are generally *asynchronous*: when a producer sends
+a message, it normally only waits for the broker to confirm that it has buffered the message and
+does not wait for the message to be processed by consumers. The delivery to consumers will happen at
+some undetermined future point in time---often within a fraction of a second, but sometimes
+significantly later if there is a queue backlog.
+
+#### Message brokers compared to databases {#id297}
+
+Some message brokers can even participate in two-phase commit protocols using XA or JTA (see
+["Distributed Transactions Across Different Systems"](/en/ch8#sec_transactions_xa)). This feature
+makes them quite similar in nature to databases, although there are still important practical
+differences between message brokers and databases:
+
+- Databases usually keep data until it is explicitly deleted, whereas some message brokers
+ automatically delete a message when it has been successfully delivered to its consumers. Such
+ message brokers are not suitable for long-term data storage.
+
+- Since they quickly delete messages, most message brokers assume that their working set is fairly
+ small---i.e., the queues are short. If the broker needs to buffer a lot of messages because the
+ consumers are slow (perhaps spilling messages to disk if they no longer fit in memory), each
+ individual message takes longer to process, and the overall throughput may degrade [^5].
+
+- Databases often support secondary indexes and various ways of searching for data using a query
+ language, while message brokers often support some way of subscribing to a subset of topics
+ matching some pattern. Both are essentially ways for a client to select the portion of the data
+ that it wants to know about, but databases typically offer much more advanced query functionality.
+
+- When querying a database, the result is typically based on a point-in-time snapshot of the data;
+ if another client subsequently writes something to the database that changes the query result, the
+ first client does not find out that its prior result is now outdated (unless it repeats the query,
+ or polls for changes). By contrast, message brokers do not support arbitrary queries and don't
+ allow message updates once they're sent, but they do notify clients when data changes (i.e., when
+ new messages become available).
+
+This is the traditional view of message brokers, which is encapsulated in standards like JMS
+[^13] and AMQP [^14] and implemented in software like RabbitMQ, ActiveMQ,
+HornetQ, Qpid, TIBCO Enterprise Message Service, IBM MQ, Azure Service Bus, and Google Cloud Pub/Sub
+[^15]. Although it is possible to use databases as queues, tuning them to get good
+performance is not straightforward [^16].
+
+#### Multiple consumers {#id298}
+
+When multiple consumers read messages in the same topic, two main patterns of messaging are used, as
+illustrated in [Figure 12-1](/en/ch12#fig_stream_multi_consumer):
+
+Load balancing
+
+: Each message is delivered to *one* of the consumers, so the consumers can share the work of
+ processing the messages in the topic. The broker may assign messages to consumers arbitrarily.
+ This pattern is useful when the messages are expensive to process, and so you want to be able to
+ add consumers to parallelize the processing. (In AMQP, you can implement load balancing by
+ having multiple clients consuming from the same queue, and in JMS it is called a *shared*
+ *subscription*.)
+
+Fan-out
+
+: Each message is delivered to *all* of the consumers. Fan-out allows several independent
+ consumers to each "tune in" to the same broadcast of messages, without affecting each
+ other---the streaming equivalent of having several different batch jobs that read the same input
+ file. (This feature is provided by topic subscriptions in JMS, and exchange bindings in AMQP.)
+
+{{< figure src="/fig/ddia_1201.png" id="fig_stream_multi_consumer" caption="Figure 12-1. (a) Load balancing: sharing the work of consuming a topic among consumers; (b) fan-out: delivering each message to multiple consumers." class="w-full my-4" >}}
+
+The two patterns can be combined, for example using Kafka's *consumer groups* feature. When a
+consumer group subscribes to a topic, each message in the topic is sent to one of the consumers in
+the group (load-balancing across the consumers in the group). If two separate consumer groups
+subscribe to the same topic, each message is sent to one consumer in each group (providing fan-out
+across consumer groups).
+
+#### Acknowledgments and redelivery {#sec_stream_reordering}
+
+Consumers may crash at any time, so it could happen that a broker delivers a message to a consumer
+but the consumer never processes it, or only partially processes it before crashing. In order to
+ensure that the message is not lost, message brokers use *acknowledgments*: a client must explicitly
+tell the broker when it has finished processing a message so that the broker can remove it from the
+queue.
+
+If the connection to a client is closed or times out without the broker receiving an acknowledgment,
+it assumes that the message was not processed, and therefore it delivers the message again to
+another consumer. (Note that it could happen that the message actually *was* fully processed, but
+the acknowledgment was lost in the network. Handling this case requires an atomic commit protocol,
+as discussed in ["Exactly-once message processing"](/en/ch8#sec_transactions_exactly_once), unless
+the operation was idempotent or exactly-once semantics are not required.)
+
+When combined with load balancing, this redelivery behavior has an interesting effect on the
+ordering of messages. In [Figure 12-2](/en/ch12#fig_stream_redelivery), the consumers generally
+process messages in the order they were sent by producers. However, consumer 2 crashes while
+processing message *m3*, at the same time as consumer 1 is processing message *m4*. The
+unacknowledged message *m3* is subsequently redelivered to consumer 1, with the result that consumer
+1 processes messages in the order *m4*, *m3*, *m5*. Thus, *m3* and *m4* are not delivered in the
+same order as they were sent by producer 1.
+
+{{< figure src="/fig/ddia_1202.png" id="fig_stream_redelivery" caption="Figure 12-2. Consumer 2 crashes while processing m3, so it is redelivered to consumer 1 at a later time." class="w-full my-4" >}}
+
+Even if the message broker otherwise tries to preserve the order of messages (as required by both
+the JMS and AMQP standards), the combination of load balancing with redelivery inevitably leads to
+messages being reordered. To avoid this issue, you can use a separate queue per consumer (i.e., not
+use the load balancing feature). Message reordering is not a problem if messages are completely
+independent of each other, but it can be important if there are causal dependencies between
+messages, as we shall see later in the chapter.
+
+Redelivery can also result in wasted resources, resource starvation, or permanent blockages in a
+stream. A common scenario is a producer that improperly serializes a message; for example, by
+leaving out a required key in a JSON-encoded object. Any consumer that reads the message will expect
+the key, and fail if it's missing. No acknowledgement is sent, so the broker will re-send the
+message, which will cause another consumer to fail. This loop repeats itself indefinitely. If the
+broker guarantees strong ordering, no further progress can be made. Brokers that allow message
+reordering can continue to make progress, but will waste resources on messages that will never be
+acknowledged.
+
+Dead letter queues (DLQs) are used to handle this problem. Rather than keeping the message in the
+current queue and retrying forever, the message is moved to a different queue to unblock consumers
+[^17], [^18]. Monitoring is usually set up on dead letter queues---any message in
+the queue is an error. Once a new message is detected, an operator can decide to permanently drop
+it, manually modify and re-produce the message, or fix consumer code to handle the message
+appropriately. DLQs are common in most queuing systems, but log-based messaging systems such as
+Apache Pulsar and stream processing systems such as Kafka Streams now support them as well
+[^19].
+
+### Log-based Message Brokers {#sec_stream_log}
+
+Sending a packet over a network or making a request to a network service is normally a transient
+operation that leaves no permanent trace. Although it is possible to record it permanently (using
+packet capture and logging), we normally don't think of it that way. AMQP/JMS-style message brokers
+inherited this transient messaging mindset: even though they may write messages to disk, they
+quickly delete the messages again after they have been delivered to consumers.
+
+Databases and filesystems take the opposite approach: everything that is written to a database or
+file is normally expected to be permanently recorded, at least until someone explicitly chooses to
+delete it again.
+
+This difference in mindset has a big impact on how derived data is created. A key feature of batch
+processes, as discussed in [Chapter 11](/en/ch11#ch_batch), is that you can run them repeatedly,
+experimenting with the processing steps, without risk of damaging the input (since the input is
+read-only). This is not the case with AMQP/JMS-style messaging: receiving a message is destructive
+if the acknowledgment causes it to be deleted from the broker, so you cannot run the same consumer
+again and expect to get the same result.
+
+If you add a new consumer to a messaging system, it typically only starts receiving messages sent
+after the time it was registered; any prior messages are already gone and cannot be recovered.
+Contrast this with files and databases, where you can add a new client at any time, and it can read
+data written arbitrarily far in the past (as long as it has not been explicitly overwritten or
+deleted by the application).
+
+Why can we not have a hybrid, combining the durable storage approach of databases with the
+low-latency notification facilities of messaging? This is the idea behind *log-based message
+brokers*, which have become very popular in recent years.
+
+#### Using logs for message storage {#id300}
+
+A log is simply an append-only sequence of records on disk. We previously discussed logs in the
+context of log-structured storage engines and write-ahead logs in [Chapter 4](/en/ch4#ch_storage),
+in the context of replication in [Chapter 6](/en/ch6#ch_replication), and as a form of consensus in
+[Chapter 10](/en/ch10#ch_consistency).
+
+The same structure can be used to implement a message broker: a producer sends a message by
+appending it to the end of the log, and a consumer receives messages by reading the log
+sequentially. If a consumer reaches the end of the log, it waits for a notification that a new
+message has been appended. The Unix tool `tail -f`, which watches a file for data being appended,
+essentially works like this.
+
+In order to scale to higher throughput than a single disk can offer, the log can be *sharded* (in
+the sense of [Chapter 7](/en/ch7#ch_sharding)). Different shards can then be hosted on different
+machines, making each shard a separate log that can be read and written independently from other
+shards. A topic can then be defined as a group of shards that all carry messages of the same type.
+This approach is illustrated in [Figure 12-3](/en/ch12#fig_stream_kafka_partitions).
+
+Within each shard, which Kafka calls a *partition*, the broker assigns a monotonically increasing
+sequence number, or *offset*, to every message (in
+[Figure 12-3](/en/ch12#fig_stream_kafka_partitions), the numbers in boxes are message offsets). Such
+a sequence number makes sense because a partition (shard) is append-only, so the messages within a
+partition are totally ordered. There is no ordering guarantee across different partitions.
+
+{{< figure src="/fig/ddia_1203.png" id="fig_stream_kafka_partitions" caption="Figure 12-3. Producers send messages by appending them to a topic-partition file, and consumers read these files sequentially." class="w-full my-4" >}}
+
+Apache Kafka [^20] and Amazon Kinesis Streams are log-based message brokers that work like
+this. Google Cloud Pub/Sub is architecturally similar but exposes a JMS-style API rather than a log
+abstraction [^15]. Even though these message brokers write all messages to disk, they are
+able to achieve throughput of millions of messages per second by sharding across multiple machines,
+and fault tolerance by replicating messages [^21], [^22].
+
+#### Logs compared to traditional messaging {#sec_stream_logs_vs_messaging}
+
+The log-based approach trivially supports fan-out messaging, because several consumers can
+independently read the log without affecting each other---reading a message does not delete it from
+the log. To achieve load balancing across a group of consumers, instead of assigning individual
+messages to consumer clients, the broker can assign entire shards to nodes in the consumer group.
+
+Each client then consumes *all* the messages in the shards it has been assigned. Typically, when a
+consumer has been assigned a log shard, it reads the messages in the shard sequentially, in a
+straightforward single-threaded manner. This coarse-grained load balancing approach has some
+downsides:
+
+- The number of nodes sharing the work of consuming a topic can be at most the number of log shards
+ in that topic, because messages within the same shard are delivered to the same node. (It's
+ possible to create a load balancing scheme in which two consumers share the work of processing a
+ shard by having both read the full set of messages, but one of them only considers messages with
+ even-numbered offsets while the other deals with the odd-numbered offsets. Alternatively, you
+ could spread message processing over a thread pool, but that approach complicates consumer offset
+ management. In general, single-threaded processing of a shard is preferable, and parallelism can
+ be increased by using more shards.)
+
+- If a single message is slow to process, it holds up the processing of subsequent messages in that
+ shard (a form of head-of-line blocking; see ["Describing
+ Performance"](/en/ch2#sec_introduction_percentiles)).
+
+Thus, in situations where messages may be expensive to process and you want to parallelize
+processing on a message-by-message basis, and where message ordering is not so important, the
+JMS/AMQP style of message broker is preferable. On the other hand, in situations with high message
+throughput, where each message is fast to process and where message ordering is important, the
+log-based approach works very well [^23], [^24]. However, the distinction between
+the two architectures is being blurred as log-based messaging systems such as Kafka now support
+JMS/AMQP style consumer groups, which allow multiple consumers to receive messages from the same
+partition [^25], [^26].
+
+Since sharded logs typically preserve message ordering only within a single shard, all messages that
+need to be ordered consistently need to be routed to the same shard. For example, an application may
+require that the events relating to one particular user appear in a fixed order. This can be
+achieved by choosing the shard for an event based on the user ID of that event (in other words,
+making the user ID the *partition key*).
+
+#### Consumer offsets {#sec_stream_log_offsets}
+
+Consuming a shard sequentially makes it easy to tell which messages have been processed: all
+messages with an offset less than a consumer's current offset have already been processed, and all
+messages with a greater offset have not yet been seen. Thus, the broker does not need to track
+acknowledgments for every single message---it only needs to periodically record the consumer
+offsets. The reduced bookkeeping overhead and the opportunities for batching and pipelining in this
+approach help increase the throughput of log-based systems. If a consumer fails, however, it will
+resume from the last recorded offset rather than the more recent last offset it saw. This can causes
+the consumer to see some messages twice.
+
+This offset is in fact very similar to the *log sequence number* that is commonly found in
+single-leader database replication, and which we discussed in ["Setting Up New
+Followers"](/en/ch6#sec_replication_new_replica). In database replication, the log sequence number
+allows a follower to reconnect to a leader after it has become disconnected, and resume replication
+without skipping any writes. Exactly the same principle is used here: the message broker behaves
+like a leader database, and the consumer like a follower.
+
+If a consumer node fails, another node in the consumer group is assigned the failed consumer's
+shards, and it starts consuming messages at the last recorded offset. If the consumer had processed
+subsequent messages but not yet recorded their offset, those messages will be processed a second
+time upon restart. We will discuss ways of dealing with this issue later in the chapter.
+
+#### Disk space usage {#sec_stream_disk_usage}
+
+If you only ever append to the log, you will eventually run out of disk space. To reclaim disk
+space, the log is actually divided into segments, and from time to time old segments are deleted or
+moved to archive storage. (We'll discuss a more sophisticated way of freeing disk space in ["Log
+compaction"](/en/ch12#sec_stream_log_compaction).)
+
+This means that if a slow consumer cannot keep up with the rate of messages, and it falls so far
+behind that its consumer offset points to a deleted segment, it will miss some of the messages.
+Effectively, the log implements a bounded-size buffer that discards old messages when it gets full,
+also known as a *circular buffer* or *ring buffer*. However, since that buffer is on disk, it can be
+quite large.
+
+Let's do a back-of-the-envelope calculation. At the time of writing, a typical large hard drive has
+a capacity of 20 TB and a sequential write throughput of 250 MB/s. If you are writing messages at
+the fastest possible rate, it takes about 22 hours until the drive is full and you need to start
+deleting the oldest messages. That means a disk-based log can always buffer at least 22 hours worth
+of messages, even if you have many disks with many machines (having more disks increases both the
+available space and the total write bandwidth). In practice, deployments rarely use the full write
+bandwidth of the disk, so the log can typically keep a buffer of several days' or even weeks' worth
+of messages.
+
+Many log-based message brokers now store messages in object storage to increase their storage
+capacity, similarly to databases as we saw in ["Databases Backed by Object
+Storage"](/en/ch6#sec_replication_object_storage). Message brokers such as Apache Kafka and Redpanda
+serve older messages from object storage as part of their tiered storage. Others, such as
+WarpStream, Confluent Freight, and Bufstream store all of their data in the object store. In
+addition to cost-efficiency, this architecture also makes data integration easier: messages in
+object storage are stored as Iceberg tables, which enable batch and data warehouse job execution
+directly on the data without having to copy it into another system.
+
+#### When consumers cannot keep up with producers {#id459}
+
+At the beginning of ["Messaging Systems"](/en/ch12#sec_stream_messaging) we discussed three choices
+of what to do if a consumer cannot keep up with the rate at which producers are sending messages:
+dropping messages, buffering, or applying backpressure. In this taxonomy, the log-based approach is
+a form of buffering with a large but fixed-size buffer (limited by the available disk space).
+
+If a consumer falls so far behind that the messages it requires are older than what is retained on
+disk, it will not be able to read those messages---so the broker effectively drops old messages that
+go back further than the size of the buffer can accommodate. You can monitor how far a consumer is
+behind the head of the log, and raise an alert if it falls behind significantly. As the buffer is
+large, there is enough time for a human operator to fix the slow consumer and allow it to catch up
+before it starts missing messages.
+
+Even if a consumer does fall too far behind and starts missing messages, only that consumer is
+affected; it does not disrupt the service for other consumers. This fact is a big operational
+advantage: you can experimentally consume a production log for development, testing, or debugging
+purposes, without having to worry much about disrupting production services. When a consumer is shut
+down or crashes, it stops consuming resources---the only thing that remains is its consumer offset.
+
+This behavior also contrasts with traditional message brokers, where you need to be careful to
+delete any queues whose consumers have been shut down---otherwise they continue unnecessarily
+accumulating messages and taking away memory from consumers that are still active.
+
+#### Replaying old messages {#sec_stream_replay}
+
+We noted previously that with AMQP- and JMS-style message brokers, processing and acknowledging
+messages is a destructive operation, since it causes the messages to be deleted on the broker. On
+the other hand, in a log-based message broker, consuming messages is more like reading from a file:
+it is a read-only operation that does not change the log.
+
+The only side effect of processing, besides any output of the consumer, is that the consumer offset
+moves forward. But the offset is under the consumer's control, so it can easily be manipulated if
+necessary: for example, you can start a copy of a consumer with yesterday's offsets and write the
+output to a different location, in order to reprocess the last day's worth of messages. You can
+repeat this any number of times, varying the processing code.
+
+This aspect makes log-based messaging more like the batch processes of the last chapter, where
+derived data is clearly separated from input data through a repeatable transformation process. It
+allows more experimentation and easier recovery from errors and bugs, making it a good tool for
+integrating dataflows within an organization [^27].
+
+## Databases and Streams {#sec_stream_databases}
+
+We have drawn some comparisons between message brokers and databases. Even though they have
+traditionally been considered separate categories of tools, we saw that log-based message brokers
+have been successful in taking ideas from databases and applying them to messaging. We can also go
+in reverse: take ideas from messaging and streams, and apply them to databases.
+
+One approach is to use an *event stream as the system of record* for storing data (see ["Systems of
+Record and Derived Data"](/en/ch1#sec_introduction_derived)). This is what happens in *event
+sourcing*, which we discussed in ["Event Sourcing and CQRS"](/en/ch3#sec_datamodels_events): instead
+of storing data in a data model that is mutated by updating and deleting, you can model every state
+change as an immutable event, and write it to an append-only log. Any read-optimized materialized
+views are derived from these events. Log-based message brokers (configured to never delete old
+events) are well suited for event sourcing since they use append-only storage, and they can notify
+consumers about new events with low latency.
+
+But you don't have to go as far as adopting event sourcing; even with mutable data models, event
+streams are useful for databases. In fact, every write to a database is an event that can be
+captured, stored, and processed. The connection between databases and streams runs deeper than just
+the physical storage of logs on disk---it is quite fundamental.
+
+For example, a replication log (see ["Implementation of Replication
+Logs"](/en/ch6#sec_replication_implementation)) is a stream of database write events, produced by
+the leader as it processes transactions. The followers apply that stream of writes to their own copy
+of the database and thus end up with an accurate copy of the same data. The events in the
+replication log describe the data changes that occurred.
+
+We also came across the *state machine replication* principle in ["Using shared
+logs"](/en/ch10#sec_consistency_smr), which states: if every event represents a write to the
+database, and every replica processes the same events in the same order, then the replicas will all
+end up in the same final state. (Processing an event is assumed to be a deterministic operation.)
+It's just another case of event streams!
+
+In this section we will first look at a problem that arises in heterogeneous data systems, and then
+explore how we can solve it by bringing ideas from event streams to databases.
+
+### Keeping Systems in Sync {#sec_stream_sync}
+
+As we have seen throughout this book, there is no single system that can satisfy all data storage,
+querying, and processing needs. In practice, most nontrivial applications need to combine several
+different technologies in order to satisfy their requirements: for example, using an OLTP database
+to serve user requests, a cache to speed up common requests, a full-text index to handle search
+queries, and a data warehouse for analytics. Each of these has its own copy of the data, stored in
+its own representation that is optimized for its own purposes.
+
+As the same or related data appears in several different places, they need to be kept in sync with
+one another: if an item is updated in the database, it also needs to be updated in the cache, search
+indexes, and data warehouse. With data warehouses this synchronization is usually performed by ETL
+processes (see ["Data Warehousing"](/en/ch1#sec_introduction_dwh)), often by taking a full copy of a
+database, transforming it, and bulk-loading it into the data warehouse---in other words, a batch
+process. Similarly, we saw in ["Batch Use Cases"](/en/ch11#sec_batch_output) how search indexes,
+recommendation systems, and other derived data systems might be created using batch processes.
+
+If periodic full database dumps are too slow, an alternative that is sometimes used is *dual
+writes*, in which the application code explicitly writes to each of the systems when data changes:
+for example, first writing to the database, then updating the search index, then invalidating the
+cache entries (or even performing those writes concurrently).
+
+However, dual writes have some serious problems, one of which is a race condition illustrated in
+[Figure 12-4](/en/ch12#fig_stream_write_order). In this example, two clients concurrently want to
+update an item X: client 1 wants to set the value to A, and client 2 wants to set it to B. Both
+clients first write the new value to the database, then write it to the search index. Due to unlucky
+timing, the requests are interleaved: the database first sees the write from client 1 setting the
+value to A, then the write from client 2 setting the value to B, so the final value in the database
+is B. The search index first sees the write from client 2, then client 1, so the final value in the
+search index is A. The two systems are now permanently inconsistent with each other, even though no
+error occurred.
+
+{{< figure src="/fig/ddia_1204.png" id="fig_stream_write_order" caption="Figure 12-4. In the database, X is first set to A and then to B, while at the search index the writes arrive in the opposite order." class="w-full my-4" >}}
+
+Unless you have some additional concurrency detection mechanism, such as the version vectors we
+discussed in ["Detecting Concurrent Writes"](/en/ch6#sec_replication_concurrent), you will not even
+notice that concurrent writes occurred---one value will simply silently overwrite another value.
+
+Another problem with dual writes is that one of the writes may fail while the other succeeds. This
+is a fault-tolerance problem rather than a concurrency problem, but it also has the effect of the
+two systems becoming inconsistent with each other. Ensuring that they either both succeed or both
+fail is a case of the atomic commit problem, which is expensive to solve (see ["Two-Phase Commit
+(2PC)"](/en/ch8#sec_transactions_2pc)).
+
+If you only have one replicated database with a single leader, then that leader determines the order
+of writes, so the state machine replication approach works among replicas of the database. However,
+in [Figure 12-4](/en/ch12#fig_stream_write_order) there isn't a single leader: the database may have
+a leader and the search index may have a leader, but neither follows the other, and so conflicts can
+occur (see ["Multi-Leader Replication"](/en/ch6#sec_replication_multi_leader)).
+
+The situation would be better if there really was only one leader---for example, the database---and
+if we could make the search index a follower of the database. But is this possible in practice?
+
+### Change Data Capture {#sec_stream_cdc}
+
+The problem with most databases' replication logs is that they have long been considered to be an
+internal implementation detail of the database, not a public API. Clients are supposed to query the
+database through its data model and query language, not parse the replication logs and try to
+extract data from them.
+
+For decades, many databases simply did not have a documented way of getting the log of changes
+written to them. For this reason it was difficult to take all the changes made in a database and
+replicate them to a different storage technology such as a search index, cache, or data warehouse.
+
+More recently, there has been growing interest in *change data capture* (CDC), which is the process
+of observing all data changes written to a database and extracting them in a form in which they can
+be replicated to other systems [^28]. CDC is especially interesting if changes are made
+available as a stream, immediately as they are written.
+
+For example, you can capture the changes in a database and continually apply the same changes to a
+search index. If the log of changes is applied in the same order, you can expect the data in the
+search index to match the data in the database. The search index and any other derived data systems
+are just consumers of the change stream.
+
+[Figure 12-5](/en/ch12#fig_stream_change_capture) shows how the concurrency problem of
+[Figure 12-4](/en/ch12#fig_stream_write_order) is solved with CDC. Even though the two requests to
+set X to A and B respectively arrive concurrently at the database, the database decides on some
+order in which to execute them, and writes them to its replication log in that order. The search
+index picks them up and applies them in the same order. If you need the data in another system, such
+as a data warehouse, you can simply add it as another consumer of the CDC event stream.
+
+{{< figure src="/fig/ddia_1205.png" id="fig_stream_change_capture" caption="Figure 12-5. Taking data in the order it was written to one database, and applying the changes to other systems in the same order." class="w-full my-4" >}}
+
+#### Implementing change data capture {#id307}
+
+We can call the log consumers *derived data systems*, as discussed in ["Systems of Record and
+Derived Data"](/en/ch1#sec_introduction_derived): the data stored in the search index and the data
+warehouse is just another view onto the data in the system of record. Change data capture is a
+mechanism for ensuring that all changes made to the system of record are also reflected in the
+derived data systems so that the derived systems have an accurate copy of the data.
+
+Essentially, change data capture makes one database the leader (the one from which the changes are
+captured), and turns the others into followers. A log-based message broker is well suited for
+transporting the change events from the source database to the derived systems, since it preserves
+the ordering of messages (avoiding the reordering issue of
+[Figure 12-2](/en/ch12#fig_stream_redelivery)).
+
+Logical replication logs can be used to implement change data capture (see ["Logical (row-based) log
+replication"](/en/ch6#sec_replication_logical)), although it comes with challenges, such as handling
+schema changes and properly modeling updates. The Debezium open source project addresses these
+challenges. The project contains *source connectors* for MySQL, PostgreSQL, Oracle, SQL Server, Db2,
+Cassandra, and many other databases. These connectors attach to database replication logs and
+surface the changes in a standard event schema. Messages can then be transformed and written to
+downstream databases. The Kafka Connect framework offers further CDC connectors for various
+databases, as well. Maxwell does something similar for MySQL by parsing the binlog [^29],
+GoldenGate provides similar facilities for Oracle, and pgcapture does the same for PostgreSQL.
+
+Like message brokers, change data capture is usually asynchronous: the system of record database
+does not wait for the change to be applied to consumers before committing it. This design has the
+operational advantage that adding a slow consumer does not affect the system of record too much, but
+it has the downside that all the issues of replication lag apply (see ["Problems with Replication
+Lag"](/en/ch6#sec_replication_lag)).
+
+#### Initial snapshot {#sec_stream_cdc_snapshot}
+
+If you have the log of all changes that were ever made to a database, you can reconstruct the entire
+state of the database by replaying the log. However, in many cases, keeping all changes forever
+would require too much disk space, and replaying it would take too long, so the log needs to be
+truncated.
+
+Building a new full-text index, for example, requires a full copy of the entire database---it is not
+sufficient to only apply a log of recent changes, since it would be missing items that were not
+recently updated. Thus, if you don't have the entire log history, you need to start with a
+consistent snapshot, as previously discussed in ["Setting Up New
+Followers"](/en/ch6#sec_replication_new_replica).
+
+The snapshot of the database must correspond to a known position or offset in the change log, so
+that you know at which point to start applying changes after the snapshot has been processed. Some
+CDC tools integrate this snapshot facility, while others leave it as a manual operation. Debezium
+uses Netflix's DBLog watermarking algorithm to provide incremental snapshots [^30],
+[^31].
+
+#### Log compaction {#sec_stream_log_compaction}
+
+If you can only keep a limited amount of log history, you need to go through the snapshot process
+every time you want to add a new derived data system. However, *log compaction* provides a good
+alternative.
+
+We discussed log compaction previously in ["Log-Structured
+Storage"](/en/ch4#sec_storage_log_structured), in the context of log-structured storage engines (see
+[Figure 4-3](/en/ch4#fig_storage_sstable_merging) for an example). The principle is simple: the
+storage engine periodically looks for log records with the same key, throws away any duplicates, and
+keeps only the most recent update for each key. This might make log segments much smaller, so
+segments may also be merged as part of the compaction process, as shown in
+[Figure 12-6](/en/ch12#fig_stream_compaction). This process runs in the background.
+
+{{< figure src="/fig/ddia_1206.png" id="fig_stream_compaction" caption="Figure 12-6. A log of key-value pairs, where the key is the ID of a cat video (mew, purr, scratch, or yawn), and the value is the number of times it has been played. Log compaction retains only the most value for each key." class="w-full my-4" >}}
+
+In a log-structured storage engine, an update with a special null value (a *tombstone*) indicates
+that a key was deleted, and causes it to be removed during log compaction. But as long as a key is
+not overwritten or deleted, it stays in the log forever. The disk space required for such a
+compacted log depends only on the current contents of the database, not the number of writes that
+have ever occurred in the database. If the same key is frequently overwritten, previous values will
+eventually be garbage-collected, and only the latest value will be retained.
+
+The same idea works in the context of log-based message brokers and change data capture. If the CDC
+system is set up such that every change has a primary key, and every update for a key replaces the
+previous value for that key, then it's sufficient to keep just the most recent write for a
+particular key.
+
+Now, whenever you want to rebuild a derived data system such as a search index, you can start a new
+consumer from offset 0 of the log-compacted topic, and sequentially scan over all messages in the
+log. The log is guaranteed to contain the most recent value for every key in the database (and maybe
+some older values)---in other words, you can use it to obtain a full copy of the database contents
+without having to take another snapshot of the CDC source database.
+
+This log compaction feature is supported by Apache Kafka. As we shall see later in this chapter, it
+allows the message broker to be used for durable storage, not just for transient messaging.
+
+#### API support for change streams {#sec_stream_change_api}
+
+Most popular databases now expose change streams as a first-class interface, rather than the
+retrofitted and reverse-engineered CDC efforts of the past. Relational databases such as MySQL and
+PostgreSQL typically send changes through the same replication log they use for their own replicas.
+Most cloud vendors offer CDC solutions for their products as well: for example, Datastream offers
+streaming data access for Google Cloud's relational databases and data warehouses.
+
+Even evenutally consistent, quorum-based databases such as Cassandra now support change data
+capture. As we saw in ["Linearizability and quorums"](/en/ch10#sec_consistency_quorum_linearizable),
+clients must persist writes to a majority of nodes before they're considered visible. CDC support
+for quorum writes is challenging because there's no single source of truth to subscribe to. Whether
+the data is visible or not depends on each reader's consistency preferences. Cassandra sidesteps
+this issue by exposing raw log segments for each node rather than providing a single stream of
+mutations. Systems that wish to consume the data must read the raw log segments for each node and
+decide how best to merge them into a single stream (much like a quorum reader does) [^32].
+
+Kafka Connect [^33] integrates change data capture tools for a wide range of database
+systems with Kafka. Once the stream of change events is in Kafka, it can be used to update derived
+data systems such as search indexes, and also feed into stream processing systems as discussed later
+in this chapter.
+
+#### Change data capture versus event sourcing {#sec_stream_event_sourcing}
+
+Let's compare change data capture to event sourcing. Similarly to change data capture, event
+sourcing involves storing all changes to the application state as a log of change events. The
+biggest difference is that event sourcing applies the idea at a different level of abstraction:
+
+- In change data capture, the application uses the database in a mutable way, updating and deleting
+ records at will. The log of changes is extracted from the database at a low level (e.g., by
+ parsing the replication log), which ensures that the order of writes extracted from the database
+ matches the order in which they were actually written, avoiding the race condition in
+ [Figure 12-4](/en/ch12#fig_stream_write_order).
+
+- In event sourcing, the application logic is explicitly built on the basis of immutable events that
+ are written to an event log. In this case, the event store is append-only, and updates or deletes
+ of events are discouraged or prohibited. Events are designed to reflect things that happened at
+ the application level, rather than low-level state changes.
+
+Which one is better depends on your situation. Adopting event sourcing is a big change for an
+application that is not already doing it; it has a number of pros and cons, which we discussed in
+["Event Sourcing and CQRS"](/en/ch3#sec_datamodels_events). In contrast, CDC can be added to an
+existing database with minimal changes---the application writing to the database might not even know
+that CDC is occurring.
+
+> [!TIP] CHANGE DATA CAPTURE AND DATABASE SCHEMAS
+> Though change data capture appears easier to adopt than event sourcing, it comes with its own set of
+> challenges.
+>
+> In a microservices architecture, a database is typically only accessed from one service. Other
+> services interact with it through that service's public API, but they don't normally access the
+> database directly. This makes the database an internal implementation detail of the service,
+> allowing the developers to change its database schema without affecting the public API.
+>
+> However, CDC systems typically use the upstream database's schema when replicating its data, which
+> turns these schemas into public APIs that must be managed much like the public API of the service. A
+> developer who removes a table column in their database table will break downstream consumers that
+> depend on this field. Such challenges have always existed with data pipelines, but they typically
+> only impacted data warehouse ETL. Since CDC is often implemented as a data stream, other production
+> services might be consumers. Breaking such consumers can cause a customer-facing outage
+> [^34]. Data contracts are often used to prevent these breakages.
+>
+> A common way to decouple internal from external schemas is to use the *outbox pattern*. Outboxes are
+> tables with their own schemas, which are exposed to the CDC system rather than the internal domain
+> model in the database [^35], [^36]. Developers can then modify their internal
+> schemas as they see fit while leaving their outbox tables untouched. This might look like a dual
+> write---it is. However, outboxes avoid the challenges we discussed in ["Keeping Systems in
+> Sync"](/en/ch12#sec_stream_sync) by keeping both writes in the same system (the
+> database). This design allows both writes to appear in a single transaction.
+>
+> Outboxes present a few tradeoffs, though. Developers must still maintain the transformation between
+> their internal and outbox schemas, which can be challenging. An outbox also increases the amount of
+> data that the database has to write to its underlying storage, which might trigger performance
+> problems.
+
+Like with change data capture, replaying the event log allows you to reconstruct the current state
+of the system. However, log compaction needs to be handled differently:
+
+- A CDC event for the update of a record typically contains the entire new version of the record, so
+ the current value for a primary key is entirely determined by the most recent event for that
+ primary key, and log compaction can discard previous events for the same key.
+
+- On the other hand, with event sourcing, events are modeled at a higher level: an event typically
+ expresses the intent of a user action, not the mechanics of the state update that occurred as a
+ result of the action. In this case, later events typically do not override prior events, and so
+ you need the full history of events to reconstruct the final state. Log compaction is not possible
+ in the same way.
+
+Applications that use event sourcing typically have some mechanism for storing snapshots of the
+current state that is derived from the log of events, so they don't need to repeatedly reprocess the
+full log. However, this is only a performance optimization to speed up reads and recovery from
+crashes; the intention is that the system is able to store all raw events forever and reprocess the
+full event log whenever required. We discuss this assumption in ["Limitations of
+immutability"](/en/ch12#sec_stream_immutability_limitations).
+
+### State, Streams, and Immutability {#sec_stream_immutability}
+
+We saw in [Chapter 11](/en/ch11#ch_batch) that batch processing benefits from the immutability of
+its input files, so you can run experimental processing jobs on existing input files without fear of
+damaging them. This principle of immutability is also what makes event sourcing and change data
+capture so powerful.
+
+We normally think of databases as storing the current state of the application---this representation
+is optimized for reads, and it is usually the most convenient for serving queries. The nature of
+state is that it changes, so databases support updating and deleting data as well as inserting it.
+How does this fit with immutability?
+
+Whenever you have state that changes, that state is the result of the events that mutated it over
+time. For example, your list of currently available seats is the result of the reservations you have
+processed, the current account balance is the result of the credits and debits on the account, and
+the response time graph for your web server is an aggregation of the individual response times of
+all web requests that have occurred.
+
+No matter how the state changes, there was always a sequence of events that caused those changes.
+Even as things are done and undone, the fact remains true that those events occurred. The key idea
+is that mutable state and an append-only log of immutable events do not contradict each other: they
+are two sides of the same coin. The log of all changes, the *changelog*, represents the evolution of
+state over time.
+
+If you are mathematically inclined, you might say that the application state is what you get when
+you integrate an event stream over time, and a change stream is what you get when you differentiate
+the state by time, as shown in [Figure 12-7](/en/ch12#fig_stream_integral) [^37],
+[^38]. The analogy has limitations (for example, the second derivative of state does not
+seem to be meaningful), but it's a useful starting point for thinking about data.
+
+{{< figure src="/fig/ddia_1207.png" id="fig_stream_integral" caption="Figure 12-7. The relationship between the current application state and an event stream." class="w-full my-4" >}}
+
+If you store the changelog durably, that simply has the effect of making the state reproducible. If
+you consider the log of events to be your system of record, and any mutable state as being derived
+from it, it becomes easier to reason about the flow of data through a system. As Jim Gray and
+Andreas Reuter put it in 1992 [^39]:
+
+> \[T\]here is no fundamental need to keep a database at all; the log contains all the information
+> there is. The only reason for storing the database (i.e., the current end-of-the-log) is
+> performance of retrieval operations.
+
+Log compaction is one way of bridging the distinction between log and database state: it retains
+only the latest version of each record, and discards overwritten versions.
+
+#### Advantages of immutable events {#sec_stream_immutability_pros}
+
+Immutability in databases is an old idea. For example, accountants have been using immutability for
+centuries in financial bookkeeping. When a transaction occurs, it is recorded in an append-only
+*ledger*, which is essentially a log of events describing money, goods, or services that have
+changed hands. The accounts, such as profit and loss or the balance sheet, are derived from the
+transactions in the ledger by adding them up [^40].
+
+If a mistake is made, accountants don't erase or change the incorrect transaction in the
+ledger---instead, they add another transaction that compensates for the mistake, for example
+refunding an incorrect charge. The incorrect transaction still remains in the ledger forever,
+because it might be important for auditing reasons. If incorrect figures, derived from the incorrect
+ledger, have already been published, then the figures for the next accounting period include a
+correction. This process is entirely normal in accounting [^41].
+
+Although such auditability is particularly important in financial systems, it is also beneficial for
+many other systems that are not subject to such strict regulation. If you accidentally deploy buggy
+code that writes bad data to a database, recovery is much harder if the code is able to
+destructively overwrite data. With an append-only log of immutable events, it is much easier to
+diagnose what happened and recover from the problem. Similarly, customer service can use an audit
+log to diagnose customer requests and complaints.
+
+Immutable events also capture more information than just the current state. For example, on a
+shopping website, a customer may add an item to their cart and then remove it again. Although the
+second event cancels out the first event from the point of view of order fulfillment, it may be
+useful to know for analytics purposes that the customer was considering a particular item but then
+decided against it. Perhaps they will choose to buy it in the future, or perhaps they found a
+substitute. This information is recorded in an event log, but would be lost in a database that
+deletes items when they are removed from the cart.
+
+#### Deriving several views from the same event log {#sec_stream_deriving_views}
+
+Moreover, by separating mutable state from the immutable event log, you can derive several different
+read-oriented representations from the same log of events. This works just like having multiple
+consumers of a stream ([Figure 12-5](/en/ch12#fig_stream_change_capture)): for example, the analytic
+database Druid ingests directly from Kafka using this approach, and Kafka Connect sinks can export
+data from Kafka to various different databases and indexes [^33].
+
+Having an explicit translation step from an event log to a database makes it easier to evolve your
+application over time: if you want to introduce a new feature that presents your existing data in
+some new way, you can use the event log to build a separate read-optimized view for the new feature,
+and run it alongside the existing systems without having to modify them. Running old and new systems
+side by side is often easier than performing a complicated schema migration in an existing system.
+Once readers have switched to the new system and the old system is no longer needed, you can simply
+shut it down and reclaim its resources [^42], [^43].
+
+This idea of writing data in one write-optimized form, and then translating it into different
+read-optimized representations as needed, is the *command query responsibility segregation* (CQRS)
+pattern that we already encountered in ["Event Sourcing and CQRS"](/en/ch3#sec_datamodels_events).
+It doesn't necessarily require event sourcing: you can just as well build multiple materialized
+views from a stream of CDC events [^44].
+
+The traditional approach to database and schema design is based on the fallacy that data must be
+written in the same form as it will be queried. Debates about normalization and denormalization (see
+["Normalization, Denormalization, and Joins"](/en/ch3#sec_datamodels_normalization)) become largely
+irrelevant if you can translate data from a write-optimized event log to read-optimized application
+state: it is entirely reasonable to denormalize data in the read-optimized views, as the translation
+process gives you a mechanism for keeping it consistent with the event log.
+
+In ["Case Study: Social Network Home Timelines"](/en/ch2#sec_introduction_twitter) we discussed a
+social network's home timelines, a cache of recent posts by the people a particular user is
+following (like a mailbox). This is another example of read-optimized state: home timelines are
+highly denormalized, since your posts are duplicated in all of the timelines of the people following
+you. However, the fan-out service keeps this duplicated state in sync with new posts and new
+following relationships, which keeps the duplication manageable.
+
+#### Concurrency control {#sec_stream_concurrency}
+
+The biggest downside of CQRS is that the consumers of the event log are usually asynchronous, so
+there is a possibility that a user may make a write to the log, then read from a derived view and
+find that their write has not yet been reflected in the view. We discussed this problem and
+potential solutions previously in ["Reading Your Own Writes"](/en/ch6#sec_replication_ryw).
+
+One solution would be to perform the updates of the read view synchronously with appending the event
+to the log. This either requires a distributed transaction across the event log and the derived
+view, or some way of waiting until an event has been reflected in the view. Both approaches are
+usually impractical, so views are normally updated asynchronously.
+
+On the other hand, deriving the current state from an event log also simplifies some aspects of
+concurrency control. Much of the need for multi-object transactions (see ["Single-Object and
+Multi-Object Operations"](/en/ch8#sec_transactions_multi_object)) stems from a single user action
+requiring data to be changed in several different places. With event sourcing, you can design an
+event such that it is a self-contained description of a user action. The user action then requires
+only a single write in one place---namely appending the event to the log---which is easy to make
+atomic.
+
+If the event log and the application state are sharded in the same way (for example, processing an
+event for a customer in shard 3 only requires updating shard 3 of the application state), then a
+straightforward single-threaded log consumer needs no concurrency control for writes---by
+construction, it only processes a single event at a time (see also ["Actual Serial
+Execution"](/en/ch8#sec_transactions_serial)). The log removes the nondeterminism of concurrency by
+defining a serial order of events in a shard [^27]. If an event touches multiple state
+shards, a bit more work is required, which we will discuss in [Chapter 13](/en/ch13#ch_philosophy).
+
+Many systems that don't use an event-sourced model nevertheless rely on immutability for concurrency
+control: various databases internally use immutable data structures or multi-version data to support
+point-in-time snapshots (see ["Indexes and snapshot
+isolation"](/en/ch8#sec_transactions_snapshot_indexes)). Version control systems such as Git,
+Mercurial, and Fossil also rely on immutable data to preserve version history of files.
+
+#### Limitations of immutability {#sec_stream_immutability_limitations}
+
+To what extent is it feasible to keep an immutable history of all changes forever? The answer
+depends on the amount of churn in the dataset. Some workloads mostly add data and rarely update or
+delete; they are easy to make immutable. Other workloads have a high rate of updates and deletes on
+a comparatively small dataset; in these cases, the immutable history may grow prohibitively large,
+fragmentation may become an issue, and the performance of compaction and garbage collection becomes
+crucial for operational robustness [^45], [^46].
+
+Besides the performance reasons, there may also be circumstances in which you need data to be
+deleted for administrative or legal reasons, in spite of all immutability. For example, privacy
+regulations such as the European General Data Protection Regulation (GDPR) require that a user's
+personal information be deleted and erroneous information be removed on demand, or an accidental
+leak of sensitive information may need to be contained.
+
+In these circumstances, it's not sufficient to just append another event to the log to indicate that
+the prior data should be considered deleted---you actually want to rewrite history and pretend that
+the data was never written in the first place. For example, Datomic calls this feature *excision*
+[^47], and the Fossil version control system has a similar concept called *shunning*
+[^48].
+
+Truly deleting data is surprisingly hard [^49], since copies can live in many places: for
+example, storage engines, filesystems, and SSDs often write to a new location rather than
+overwriting in place [^41], and backups are often deliberately immutable to prevent
+accidental deletion or corruption.
+
+One way of enabling deletion of immutable data is *crypto-shredding* [^50]: data that you
+may want to delete in the future is stored encrypted, and when you want to get rid of it, you forget
+the encryption key. The encrypted data is then still there, but nobody can use it. In some sense
+this only moves the problem around: the actual data is now immutable, but your key storage is
+mutable.
+
+Moreover, you have to decide up front which data is going to be encrypted with the same key, and
+when you are going to use different keys---an important decision, since you can later crypto-shred
+either all or none of the data encrypted with a particular key, but not some of it. Storing a
+separate key for every single data item would get too unwieldy, as the key storage would get as big
+as the primary data storage. More sophisticated schemes such as puncturable encryption
+[^51] make it possible to selectively revoke a key's decryption abilities, but they are
+not widely used.
+
+Overall, deletion is more a matter of "making it harder to retrieve the data" than actually "making
+it impossible to retrieve the data." Nevertheless, you sometimes have to try, as we shall see in
+["Legislation and Self-Regulation"](/en/ch14#sec_future_legislation).
+
+## Processing Streams {#sec_stream_processing}
+
+So far in this chapter we have talked about where streams come from (user activity events, sensors,
+and writes to databases), and we have talked about how streams are transported (through direct
+messaging, via message brokers, and in event logs).
+
+What remains is to discuss what you can do with the stream once you have it---namely, you can
+process it. Broadly, there are three options:
+
+1. You can take the data in the events and write it to a database, cache, search index, or similar
+ storage system, from where it can then be queried by other clients. As shown in
+ [Figure 12-5](/en/ch12#fig_stream_change_capture), this is a good way of keeping a database in
+ sync with changes happening in other parts of the system---especially if the stream consumer is
+ the only client writing to the database. Writing to a storage system is the streaming equivalent
+ of what we discussed in ["Batch Use Cases"](/en/ch11#sec_batch_output).
+
+2. You can push the events to users in some way, for example by sending email alerts or push
+ notifications, or by streaming the events to a real-time dashboard where they are visualized. In
+ this case, a human is the ultimate consumer of the stream.
+
+3. You can process one or more input streams to produce one or more output streams. Streams may go
+ through a pipeline consisting of several such processing stages before they eventually end up at
+ an output (option 1 or 2).
+
+In the rest of this chapter, we will discuss option 3: processing streams to produce other, derived
+streams. A piece of code that processes streams like this is known as an *operator* or a *job*. It
+is closely related to the Unix processes and MapReduce jobs we discussed in
+[Chapter 11](/en/ch11#ch_batch), and the pattern of dataflow is similar: a stream processor consumes
+input streams in a read-only fashion and writes its output to a different location in an append-only
+fashion.
+
+The patterns for sharding and parallelization in stream processors are also very similar to those in
+MapReduce and the dataflow engines we saw in [Chapter 11](/en/ch11#ch_batch), so we won't repeat
+those topics here. Basic mapping operations such as transforming and filtering records also work the
+same.
+
+The one crucial difference from batch jobs is that a stream never ends. This difference has many
+implications: as discussed at the start of this chapter, sorting does not make sense with an
+unbounded dataset, and so sort-merge joins (see ["JOIN and GROUP BY"](/en/ch11#sec_batch_join))
+cannot be used. Fault-tolerance mechanisms must also change: with a batch job that has been running
+for a few minutes, a failed task can simply be restarted from the beginning, but with a stream job
+that has been running for several years, restarting from the beginning after a crash may not be a
+viable option.
+
+### Uses of Stream Processing {#sec_stream_uses}
+
+Stream processing has long been used for monitoring purposes, where an organization wants to be
+alerted if certain things happen. For example:
+
+- Fraud detection systems need to determine if the usage patterns of a credit card have unexpectedly
+ changed, and block the card if it is likely to have been stolen.
+
+- Trading systems need to examine price changes in a financial market and execute trades according
+ to specified rules.
+
+- Manufacturing systems need to monitor the status of machines in a factory, and quickly identify
+ the problem if there is a malfunction.
+
+- Military and intelligence systems need to track the activities of a potential aggressor, and raise
+ the alarm if there are signs of an attack.
+
+These kinds of applications require quite sophisticated pattern matching and correlations. However,
+other uses of stream processing have also emerged over time. In this section we will briefly compare
+and contrast some of these applications.
+
+#### Complex event processing {#id317}
+
+*Complex event processing* (CEP) is an approach developed in the 1990s for analyzing event streams,
+especially geared toward the kind of application that requires searching for certain event patterns
+[^52]. Similarly to the way that a regular expression allows you to search for certain
+patterns of characters in a string, CEP allows you to specify rules to search for certain patterns
+of events in a stream.
+
+CEP systems often use a high-level declarative query language like SQL, or a graphical user
+interface, to describe the patterns of events that should be detected. These queries are submitted
+to a processing engine that consumes the input streams and internally maintains a state machine that
+performs the required matching. When a match is found, the engine emits a *complex event* (hence the
+name) with the details of the event pattern that was detected [^53].
+
+In these systems, the relationship between queries and data is reversed compared to normal
+databases. Usually, a database stores data persistently and treats queries as transient: when a
+query comes in, the database searches for data matching the query, and then forgets about the query
+when it has finished. CEP engines reverse these roles: queries are stored long-term; as each event
+arrives, the engine checks whether it has now seen an event pattern that matches any of its standing
+queries [^54].
+
+Implementations of CEP include Esper, Apama, and TIBCO StreamBase. Distributed stream processors
+like Flink and Spark Streaming also have SQL support for declarative queries on streams.
+
+#### Stream analytics {#id318}
+
+Another area in which stream processing is used is for *analytics* on streams. The boundary between
+CEP and stream analytics is blurry, but as a general rule, analytics tends to be less interested in
+finding specific event sequences and is more oriented toward aggregations and statistical metrics
+over a large number of events---for example:
+
+- Measuring the rate of some type of event (how often it occurs per time interval)
+
+- Calculating the rolling average of a value over some time period
+
+- Comparing current statistics to previous time intervals (e.g., to detect trends or to alert on
+ metrics that are unusually high or low compared to the same time last week)
+
+Such statistics are usually computed over fixed time intervals---for example, you might want to know
+the average number of queries per second to a service over the last 5 minutes, and their 99th
+percentile response time during that period. Averaging over a few minutes smoothes out irrelevant
+fluctuations from one second to the next, while still giving you a timely picture of any changes in
+traffic pattern. The time interval over which you aggregate is known as a *window*, and we will look
+into windowing in more detail in ["Reasoning About Time"](/en/ch12#sec_stream_time).
+
+Stream analytics systems sometimes use probabilistic algorithms, such as Bloom filters (which we
+encountered in ["Bloom filters"](/en/ch4#sec_storage_bloom_filter)) for set membership, HyperLogLog
+[^55] for cardinality estimation, and various percentile estimation algorithms (see
+["Computing Percentiles"](/en/ch2#sidebar_percentiles)). Probabilistic algorithms produce
+approximate results, but have the advantage of requiring significantly less memory in the stream
+processor than exact algorithms. This use of approximation algorithms sometimes leads people to
+believe that stream processing systems are always lossy and inexact, but that is wrong: there is
+nothing inherently approximate about stream processing, and using probabilistic algorithms is merely
+an optimization [^56].
+
+Many open source distributed stream processing frameworks are designed with analytics in mind: for
+example, Apache Storm, Spark Streaming, Flink, Samza, Apache Beam, and Kafka Streams
+[^57]. Hosted services include Google Cloud Dataflow and Azure Stream Analytics.
+
+#### Maintaining materialized views {#sec_stream_mat_view}
+
+We saw that a stream of changes to a database can be used to keep derived data systems, such as
+caches, search indexes, and data warehouses, up to date with a source database. These are examples
+of maintaining materialized views: deriving an alternative view onto some dataset so that you can
+query it efficiently, and updating that view whenever the underlying data changes [^37].
+
+Similarly, in event sourcing, application state is maintained by applying a log of events; here the
+application state is also a kind of materialized view. Unlike stream analytics scenarios, it is
+usually not sufficient to consider only events within some time window: building the materialized
+view potentially requires *all* events over an arbitrary time period, apart from any obsolete events
+that may be discarded by log compaction. In effect, you need a window that stretches all the way
+back to the beginning of time.
+
+In principle, any stream processor could be used for materialized view maintenance, although the
+need to maintain events forever runs counter to the assumptions of some analytics-oriented
+frameworks that mostly operate on windows of a limited duration. Kafka Streams and Confluent's
+ksqlDB support this kind of usage, building upon Kafka's support for log compaction [^58].
+
+> [!TIP] INCREMENTAL VIEW MAINTENANCE
+> Databases might seem well suited for materialized view maintenance; they are designed to keep full
+> copies of a dataset, after all. Many also support materialized views. We saw in ["Materialized Views
+> and Data Cubes"](/en/ch4#sec_storage_materialized_views) that analytical queries
+> typical of a data warehouse can be materialized into OLAP cubes.
+>
+> Unfortunately, databases often refresh materialized view tables using batch jobs or on-demand
+> requests such as PostgreSQL's `REFRESH MATERIALIZED VIEW`. Views are recalculated
+> periodically rather than as updates to souce data occurs. This approach has two significant
+> drawbacks that make it inappropriate for stream processing view maintenance:
+>
+> 1. Poor efficiency: All data is reprocessed every time the view is updated, though it's likely that
+> most of the data remains unchanged.
+>
+> 2. Data freshness: changes in source data are not reflected in a materialized view until its query
+> is re-run during its next scheduled update.
+>
+> It is possible to write database triggers that update materialized views efficiently in scenarios
+> where the data is easily partitioned and the computation is naturally incremental. For example, if a
+> materialized view maintains total sales revenue per-day, the row for the appropriate day can be
+> updated every time a new sale occurs. Bespoke solutions work in a few cases, but many SQL queries
+> can't be easily or efficiently converted to incremental computation.
+>
+>
+>
+> *Incremental view maintenance (IVM)* is a more general solution to the problems listed above. IVM
+> techniques convert relational grammars such as SQL into operators capable of incremental
+> computations. Rather than processing entire datasets, IVM algorithms recompute and update only data
+> that has changed [^38], [^59], [^60]. View computation becomes far more
+> efficient. Updates can then be run much more frequently, which dramatically increases data
+> freshness.
+>
+> Databases such as Materialize [^61], RisingWave, ClickHouse, and Feldera all use IVM
+> techniques to provide efficient incremental materialized views. These databases ingest streams of
+> events to expose materialized views in realtime. Recent events are buffered in-memory and
+> periodically used to update on-disk materialized views. Reads combine the recent events and the
+> materialized data to provide a single realtime view. Since reads are often expressed in SQL and
+> materialized views are often stored in OLAP-style formats, these systems also support large-scale
+> data warehouse-style queries such as those disucssed in
+> [Chapter 11](/en/ch11#ch_batch).
+
+#### Search on streams {#id320}
+
+Besides CEP, which allows searching for patterns consisting of multiple events, there is also
+sometimes a need to search for individual events based on complex criteria, such as full-text search
+queries.
+
+For example, media monitoring services subscribe to feeds of news articles and broadcasts from media
+outlets, and search for any news mentioning companies, products, or topics of interest. This is done
+by formulating a search query in advance, and then continually matching the stream of news items
+against this query. Similar features exist on some websites: for example, users of real estate
+websites can ask to be notified when a new property matching their search criteria appears on the
+market. The percolator feature of Elasticsearch [^62] is one option for implementing this
+kind of stream search.
+
+Conventional search engines first index the documents and then run queries over the index. By
+contrast, searching a stream turns the processing on its head: the queries are stored, and the
+documents run past the queries, like in CEP. In the simplest case, you can test every document
+against every query, although this can get slow if you have a large number of queries. To optimize
+the process, it is possible to index the queries as well as the documents, and thus narrow down the
+set of queries that may match [^63].
+
+#### Event-Driven Architectures and RPC {#sec_stream_actors_drpc}
+
+In ["Event-Driven Architectures"](/en/ch5#sec_encoding_dataflow_msg) we discussed message-passing
+systems as an alternative to RPC---i.e., as a mechanism for services to communicate, as used for
+example in the actor model. Although these systems are also based on messages and events, we
+normally don't think of them as stream processors:
+
+- Actor frameworks are primarily a mechanism for managing concurrency and distributed execution of
+ communicating modules, whereas stream processing is primarily a data management technique.
+
+- Communication between actors is often ephemeral and one-to-one, whereas event logs are durable and
+ multi-subscriber.
+
+- Actors can communicate in arbitrary ways (including cyclic request/response patterns), but stream
+ processors are usually set up in acyclic pipelines where every stream is the output of one
+ particular job, and derived from a well-defined set of input streams.
+
+That said, there is some crossover area between RPC-like systems and stream processing. For example,
+Apache Storm has a feature called *distributed RPC*, which allows user queries to be farmed out to a
+set of nodes that also process event streams; these queries are then interleaved with events from
+the input streams, and results can be aggregated and sent back to the user. (See also ["Multi-shard
+data processing"](/en/ch13#sec_future_unbundled_multi_shard).)
+
+It is also possible to process streams using actor frameworks. However, many such frameworks do not
+guarantee message delivery in the case of crashes, so the processing is not fault-tolerant unless
+you implement additional retry logic.
+
+### Reasoning About Time {#sec_stream_time}
+
+Stream processors often need to deal with time, especially when used for analytics purposes, which
+frequently use time windows such as "the average over the last five minutes." It might seem that the
+meaning of "the last five minutes" should be unambiguous and clear, but unfortunately the notion is
+surprisingly tricky.
+
+In a batch process, the processing tasks rapidly crunch through a large collection of historical
+events. If some kind of breakdown by time needs to happen, the batch process needs to look at the
+timestamp embedded in each event. There is no point in looking at the system clock of the machine
+running the batch process, because the time at which the process is run has nothing to do with the
+time at which the events actually occurred.
+
+A batch process may read a year's worth of historical events within a few minutes; in most cases,
+the timeline of interest is the year of history, not the few minutes of processing. Moreover, using
+the timestamps in the events allows the processing to be deterministic: running the same process
+again on the same input yields the same result.
+
+On the other hand, many stream processing frameworks use the local system clock on the processing
+machine (the *processing time*) to determine windowing [^64]. This approach has the
+advantage of being simple, and it is reasonable if the delay between event creation and event
+processing is negligibly short. However, it breaks down if there is any significant processing
+lag---i.e., if the processing may happen noticeably later than the time at which the event actually
+occurred.
+
+#### Event time versus processing time {#id322}
+
+There are many reasons why processing may be delayed: queueing, network faults, a performance issue
+leading to contention in the message broker or processor, a restart of the stream consumer, or
+reprocessing of past events while recovering from a fault or after fixing a bug in the code.
+
+Moreover, message delays can also lead to unpredictable ordering of messages. For example, say a
+user first makes one web request (which is handled by web server A), and then a second request
+(which is handled by server B). A and B emit events describing the requests they handled, but B's
+event reaches the message broker before A's event does. Now stream processors will first see the B
+event and then the A event, even though they actually occurred in the opposite order.
+
+If it helps to have an analogy, consider the *Star Wars* movies: Episode IV was released in 1977,
+Episode V in 1980, and Episode VI in 1983, followed by Episodes I, II, and III in 1999, 2002, and
+2005, respectively, and Episodes VII, VIII, and IX in 2015, 2017, and 2019 [^65]. If you
+watched the movies in the order they came out, the order in which you processed the movies is
+inconsistent with the order of their narrative. (The episode number is like the event timestamp, and
+the date when you watched the movie is the processing time.) As humans, we are able to cope with
+such discontinuities, but stream processing algorithms need to be specifically written to
+accommodate such timing and ordering issues.
+
+Confusing event time and processing time leads to bad data. For example, say you have a stream
+processor that measures the rate of requests (counting the number of requests per second). If you
+redeploy the stream processor, it may be shut down for a minute and process the backlog of events
+when it comes back up. If you measure the rate based on the processing time, it will look as if
+there was a sudden anomalous spike of requests while processing the backlog, when in fact the real
+rate of requests was steady ([Figure 12-8](/en/ch12#fig_stream_processing_time)).
+
+{{< figure src="/fig/ddia_1208.png" id="fig_stream_processing_time" caption="Figure 12-8. Windowing by processing time introduces artifacts due to variations in processing rate." class="w-full my-4" >}}
+
+#### Handling straggler events {#id323}
+
+A tricky problem when defining windows in terms of event time is that you can never be sure when you
+have received all of the events for a particular window, or whether there are some events still to
+come.
+
+For example, say you're grouping events into one-minute windows so that you can count the number of
+requests per minute. You have counted some number of events with timestamps that fall in the 37th
+minute of the hour, and time has moved on; now most of the incoming events fall within the 38th and
+39th minutes of the hour. When do you declare that you have finished the window for the 37th minute,
+and output its counter value?
+
+You can time out and declare a window ready after you have not seen any new events for that window
+in a while. However, it could still happen that some events were buffered on another machine
+somewhere, delayed due to a network interruption. You need to be able to handle such *straggler*
+events that arrive after the window has already been declared complete. Broadly, you have two
+options [^1]:
+
+1. Ignore the straggler events, as they are probably a small percentage of events in normal
+ circumstances. You can track the number of dropped events as a metric, and alert if you start
+ dropping a significant amount of data.
+
+2. Publish a *correction*, an updated value for the window with stragglers included. You may also
+ need to retract the previous output.
+
+In some cases it is possible to use a special message to indicate, "From now on there will be no
+more messages with a timestamp earlier than *t*," which can be used by consumers to trigger windows
+[^66]. However, if several producers on different machines are generating events, each
+with their own minimum timestamp thresholds, the consumers need to keep track of each producer
+individually. Adding and removing producers is trickier in this case.
+
+#### Whose clock are you using, anyway? {#id438}
+
+Assigning timestamps to events is even more difficult when events can be buffered at several points
+in the system. For example, consider a mobile app that reports events for usage metrics to a server.
+The app may be used while the device is offline, in which case it will buffer events locally on the
+device and send them to a server when an internet connection is next available (which may be hours
+or even days later). To any consumers of this stream, the events will appear as extremely delayed
+stragglers.
+
+In this context, the timestamp on the events should really be the time at which the user interaction
+occurred, according to the mobile device's local clock. However, the clock on a user-controlled
+device often cannot be trusted, as it may be accidentally or deliberately set to the wrong time (see
+["Clock Synchronization and Accuracy"](/en/ch9#sec_distributed_clock_accuracy)). The time at which
+the event was received by the server (according to the server's clock) is more likely to be
+accurate, since the server is under your control, but less meaningful in terms of describing the
+user interaction.
+
+To adjust for incorrect device clocks, one approach is to log three timestamps [^67]:
+
+- The time at which the event occurred, according to the device clock
+
+- The time at which the event was sent to the server, according to the device clock
+
+- The time at which the event was received by the server, according to the server clock
+
+By subtracting the second timestamp from the third, you can estimate the offset between the device
+clock and the server clock (assuming the network delay is negligible compared to the required
+timestamp accuracy). You can then apply that offset to the event timestamp, and thus estimate the
+true time at which the event actually occurred (assuming the device clock offset did not change
+between the time the event occurred and the time it was sent to the server).
+
+This problem is not unique to stream processing---batch processing suffers from exactly the same
+issues of reasoning about time. It is just more noticeable in a streaming context, where we are more
+aware of the passage of time.
+
+#### Types of windows {#id324}
+
+Once you know how the timestamp of an event should be determined, the next step is to decide how
+windows over time periods should be defined. The window can then be used for aggregations, for
+example to count events, or to calculate the average of values within the window. Several types of
+windows are in common use [^64], [^68]:
+
+Tumbling window
+
+: A tumbling window has a fixed length, and every event belongs to exactly one window. For
+ example, if you have a 1-minute tumbling window, all the events with timestamps between 10:03:00
+ and 10:03:59 are grouped into one window, events between 10:04:00 and 10:04:59 into the next
+ window, and so on. You could implement a 1-minute tumbling window by taking each event timestamp
+ and rounding it down to the nearest minute to determine the window that it belongs to.
+
+Hopping window
+
+: A hopping window also has a fixed length, but allows windows to overlap in order to provide some
+ smoothing. For example, a 5-minute window with a hop size of 1 minute would contain the events
+ between 10:03:00 and 10:07:59, then the next window would cover events between 10:04:00 and
+ 10:08:59, and so on. You can implement this hopping window by first calculating 1-minute
+ tumbling windows, and then aggregating over several adjacent windows.
+
+Sliding window
+
+: A sliding window contains all the events that occur within some interval of each other. For
+ example, a 5-minute sliding window would cover events at 10:03:39 and 10:08:12, because they are
+ less than 5 minutes apart (note that tumbling and hopping 5-minute windows would not have put
+ these two events in the same window, as they use fixed boundaries). A sliding window can be
+ implemented by keeping a buffer of events sorted by time and removing old events when they
+ expire from the window.
+
+Session window
+
+: Unlike the other window types, a session window has no fixed duration. Instead, it is defined by
+ grouping together all events for the same user that occur closely together in time, and the
+ window ends when the user has been inactive for some time (for example, if there have been no
+ events for 30 minutes). Sessionization is a common requirement for website analytics.
+
+Window operations usually maintain temporary state. In some cases, the state is of a fixed size, no
+matter how large the window or how many events occur: for example, a counting operation will only
+have one counter regardless of the window size or event count. On the other hand, sliding windows or
+stream joins, which we discuss in the next section, require that events be buffered until the window
+finishes. Therefore, large window sizes or high-throughput streams can cause stream processors to
+keep a lot of temporary state. You must then take care to ensure the machines running stream
+processing tasks have enough capacity to maintain this state, whether in-memory or on-disk.
+
+### Stream Joins {#sec_stream_joins}
+
+In ["JOIN and GROUP BY"](/en/ch11#sec_batch_join) we discussed how batch jobs can join datasets by
+key, and how such joins form an important part of data pipelines. Since stream processing
+generalizes data pipelines to incremental processing of unbounded datasets, there is exactly the
+same need for joins on streams.
+
+However, the fact that new events can appear anytime on a stream makes joins on streams more
+challenging than in batch jobs. To understand the situation better, let's distinguish three
+different types of joins: *stream-stream* joins, *stream-table* joins, and *table-table* joins. In
+the following sections we'll illustrate each by example.
+
+#### Stream-stream join (window join) {#id440}
+
+Say you have a search feature on your website, and you want to detect recent trends in searched-for
+URLs. Every time someone types a search query, you log an event containing the query and the results
+returned. Every time someone clicks one of the search results, you log another event recording the
+click. In order to calculate the click-through rate for each URL in the search results, you need to
+bring together the events for the search action and the click action, which are connected by having
+the same session ID. Similar analyses are needed in advertising systems [^69].
+
+The click may never come if the user abandons their search, and even if it comes, the time between
+the search and the click may be highly variable: in many cases it might be a few seconds, but it
+could be as long as days or weeks (if a user runs a search, forgets about that browser tab, and then
+returns to the tab and clicks a result sometime later). Due to variable network delays, the click
+event may even arrive before the search event. You can choose a suitable window for the join---for
+example, you may choose to join a click with a search if they occur at most one hour apart.
+
+Note that embedding the details of the search in the click event is not equivalent to joining the
+events: doing so would only tell you about the cases where the user clicked a search result, not
+about the searches where the user did not click any of the results. In order to measure search
+quality, you need accurate click-through rates, for which you need both the search events and the
+click events.
+
+To implement this type of join, a stream processor needs to maintain *state*: for example, all the
+events that occurred in the last hour, indexed by session ID. Whenever a search event or click event
+occurs, it is added to the appropriate index, and the stream processor also checks the other index
+to see if another event for the same session ID has already arrived. If there is a matching event,
+you emit an event saying which search result was clicked. If the search event expires without you
+seeing a matching click event, you emit an event saying which search results were not clicked.
+
+#### Stream-table join (stream enrichment) {#sec_stream_table_joins}
+
+In ["JOIN and GROUP BY"](/en/ch11#sec_batch_join) ([Figure 11-2](/en/ch11#fig_batch_join_example))
+we saw an example of a batch job joining two datasets: a set of user activity events and a database
+of user profiles. It is natural to think of the user activity events as a stream, and to perform the
+same join on a continuous basis in a stream processor: the input is a stream of activity events
+containing a user ID, and the output is a stream of activity events in which the user ID has been
+augmented with profile information about the user. This process is sometimes known as *enriching*
+the activity events with information from the database.
+
+To perform this join, the stream process needs to look at one activity event at a time, look up the
+event's user ID in the database, and add the profile information to the activity event. The database
+lookup could be implemented by querying a remote database; however, as discussed in ["JOIN and GROUP
+BY"](/en/ch11#sec_batch_join), such remote queries are likely to be slow and risk overloading the
+database [^58].
+
+Another approach is to load a copy of the database into the stream processor so that it can be
+queried locally without a network round-trip. This technique is called a *hash join* since the local
+copy of the database might be an in-memory hash table if it is small enough, or an index on the
+local disk.
+
+The difference from batch jobs is that a batch job uses a point-in-time snapshot of the database as
+input, whereas a stream processor is long-running, and the contents of the database are likely to
+change over time, so the stream processor's local copy of the database needs to be kept up to date.
+This issue can be solved by change data capture: the stream processor can subscribe to a changelog
+of the user profile database as well as the stream of activity events. When a profile is created or
+modified, the stream processor updates its local copy. Thus, we obtain a join between two streams:
+the activity events and the profile updates.
+
+A stream-table join is actually very similar to a stream-stream join; the biggest difference is that
+for the table changelog stream, the join uses a window that reaches back to the "beginning of time"
+(a conceptually infinite window), with newer versions of records overwriting older ones. For the
+stream input, the join might not maintain a window at all.
+
+#### Table-table join (materialized view maintenance) {#id326}
+
+Consider the social network timeline example that we discussed in ["Case Study: Social Network Home
+Timelines"](/en/ch2#sec_introduction_twitter). We said that when a user wants to view their home
+timeline, it is too expensive to iterate over all the people the user is following, find their
+recent posts, and merge them.
+
+Instead, we want a timeline cache: a kind of per-user "inbox" to which posts are written as they are
+sent, so that reading the timeline is a single lookup. Materializing and maintaining this cache
+requires the following event processing:
+
+- When user *u* sends a new post, it is added to the timeline of every user who is following *u*.
+
+- When a user deletes a post, or deletes their entire account, it is removed from all users'
+ timelines.
+
+- When user *u*~1~ starts following user *u*~2~, recent posts by *u*~2~ are added to *u*~1~'s
+ timeline.
+
+- When user *u*~1~ unfollows user *u*~2~, posts by *u*~2~ are removed from *u*~1~'s timeline.
+
+To implement this cache maintenance in a stream processor, you need streams of events for posts
+(sending and deleting) and for follow relationships (following and unfollowing). The stream process
+needs to maintain a database containing the set of followers for each user so that it knows which
+timelines need to be updated when a new post arrives.
+
+Another way of looking at this stream process is that it maintains a materialized view for a query
+that joins two tables (posts and follows), something like the following:
+
+``` sql
+SELECT follows.follower_id AS timeline_id,
+ array_agg(posts.* ORDER BY posts.timestamp DESC)
+FROM posts
+JOIN follows ON follows.followee_id = posts.sender_id
+GROUP BY follows.follower_id
+```
+
+The join of the streams corresponds directly to the join of the tables in that query. The timelines
+are effectively a cache of the result of this query, updated every time the underlying tables
+change.
+
+> [!NOTE]
+> If you regard a stream as the derivative of a table, as in
+> [Figure 12-7](/en/ch12#fig_stream_integral), and regard a join as a product of two
+> tables *u·v*, something interesting happens: the stream of changes to the materialized join follows
+> the product rule (*u·v*)′ = *u*′*v* + *uv*′. In words: any change of posts is joined with the
+> current followers, and any change of followers is joined with the current posts [^37].
+
+#### Time-dependence of joins {#sec_stream_join_time}
+
+The three types of joins described here (stream-stream, stream-table, and table-table) have a lot in
+common: they all require the stream processor to maintain some state (search and click events, user
+profiles, or follower list) based on one join input, and query that state on messages from the other
+join input.
+
+The order of the events that maintain the state is important (it matters whether you first follow
+and then unfollow, or the other way round). In a sharded event log like Kafka, the ordering of
+events within a single shard (partition) is preserved, but there is typically no ordering guarantee
+across different streams or shards.
+
+This raises a question: if events on different streams happen around a similar time, in which order
+are they processed? In the stream-table join example, if a user updates their profile, which
+activity events are joined with the old profile (processed before the profile update), and which are
+joined with the new profile (processed after the profile update)? Put another way: if state changes
+over time, and you join with some state, what point in time do you use for the join?
+
+Such time dependence can occur in many places. For example, if you sell things, you need to apply
+the right tax rate to invoices, which depends on the country or state, the type of product, and the
+date of sale (since tax rates change from time to time). When joining sales to a table of tax rates,
+you probably want to join with the tax rate at the time of the sale, which may be different from the
+current tax rate if you are reprocessing historical data.
+
+If the ordering of events across streams is undetermined, the join becomes nondeterministic
+[^70], which means you cannot rerun the same job on the same input and necessarily get the
+same result: the events on the input streams may be interleaved in a different way when you run the
+job again.
+
+In data warehouses, this issue is known as a *slowly changing dimension* (SCD), and it is often
+addressed by using a unique identifier for a particular version of the joined record: for example,
+every time the tax rate changes, it is given a new identifier, and the invoice includes the
+identifier for the tax rate at the time of sale [^71], [^72]. This change makes the
+join deterministic, but has the consequence that log compaction is not possible, since all versions
+of the records in the table need to be retained. Alternatively, you can denormalize the data and
+include the applicable tax rate directly in every sale event.
+
+### Fault Tolerance {#sec_stream_fault_tolerance}
+
+In the final section of this chapter, let's consider how stream processors can tolerate faults. We
+saw in [Chapter 11](/en/ch11#ch_batch) that batch processing frameworks can tolerate faults fairly
+easily: if a task fails, it can simply be started again on another machine, and the output of the
+failed task is discarded. This transparent retry is possible because input files are immutable, each
+task writes its output to a separate file, and output is only made visible when a task completes
+successfully.
+
+In particular, the batch approach to fault tolerance ensures that the output of the batch job is the
+same as if nothing had gone wrong, even if in fact some tasks did fail. It appears as though every
+input record was processed exactly once---no records are skipped, and none are processed twice.
+Although restarting tasks means that records may in fact be processed multiple times, the visible
+effect in the output is as if they had only been processed once. This principle is known as
+*exactly-once semantics*, although *effectively-once* would be a more descriptive term
+[^73].
+
+The same issue of fault tolerance arises in stream processing, but it is less straightforward to
+handle: waiting until a task is finished before making its output visible is not an option, because
+a stream is infinite and so you can never finish processing it.
+
+#### Microbatching and checkpointing {#id329}
+
+One solution is to break the stream into small blocks, and treat each block like a miniature batch
+process. This approach is called *microbatching*, and it is used in Spark Streaming [^74].
+The batch size is typically around one second, which is the result of a performance compromise:
+smaller batches incur greater scheduling and coordination overhead, while larger batches mean a
+longer delay before results of the stream processor become visible.
+
+Microbatching also implicitly provides a tumbling window equal to the batch size (windowed by
+processing time, not event timestamps); any jobs that require larger windows need to explicitly
+carry over state from one microbatch to the next.
+
+A variant approach, used in Apache Flink, is to periodically generate rolling checkpoints of state
+and write them to durable storage [^75], [^76]. If a stream operator crashes, it can
+restart from its most recent checkpoint and discard any output generated between the last checkpoint
+and the crash. The checkpoints are triggered by barriers in the message stream, similar to the
+boundaries between microbatches, but without forcing a particular window size.
+
+Within the confines of the stream processing framework, the microbatching and checkpointing
+approaches provide the same exactly-once semantics as batch processing. However, as soon as output
+leaves the stream processor (for example, by writing to a database, sending messages to an external
+message broker, or sending emails), the framework is no longer able to discard the output of a
+failed microbatch. In this case, restarting a failed task causes the external side effect to happen
+twice, and microbatching or checkpointing alone is not sufficient to prevent this problem.
+
+#### Atomic commit revisited {#sec_stream_atomic_commit}
+
+In order to give the appearance of exactly-once processing in the presence of faults, we need to
+ensure that all outputs and side effects of processing an event take effect *if and only if* the
+processing is successful. Those effects include any messages sent to downstream operators or
+external messaging systems (including email or push notifications), any database writes, any changes
+to operator state, and any acknowledgment of input messages (including moving the consumer offset
+forward in a log-based message broker).
+
+Those things either all need to happen atomically, or none of them must happen, but they should not
+go out of sync with each other. If this approach sounds familiar, it is because we discussed it in
+["Exactly-once message processing"](/en/ch8#sec_transactions_exactly_once) in the context of
+distributed transactions and two-phase commit.
+
+In [Chapter 10](/en/ch10#ch_consistency) we discussed the problems in the traditional
+implementations of distributed transactions, such as XA. However, in more restricted environments it
+is possible to implement such an atomic commit facility efficiently. This approach is used in Google
+Cloud Dataflow [^66], [^75], VoltDB [^77], and Apache Kafka [^78],
+[^79]. Unlike XA, these implementations do not attempt to provide transactions across
+heterogeneous technologies, but instead keep the transactions internal by managing both state
+changes and messaging within the stream processing framework. The overhead of the transaction
+protocol can be amortized by processing several input messages within a single transaction.
+
+#### Idempotence {#sec_stream_idempotence}
+
+Our goal is to discard the partial output of any failed tasks so that they can be safely retried
+without taking effect twice. Distributed transactions are one way of achieving that goal, but
+another way is to rely on *idempotence*, as we saw in ["Durable Execution and
+Workflows"](/en/ch5#sec_encoding_dataflow_workflows) [^80].
+
+An idempotent operation is one that you can perform multiple times, and it has the same effect as if
+you performed it only once. For example, deleting a key in a key-value store is idempotent (deleting
+the value again has no further effect), whereas incrementing a counter is not idempotent (performing
+the increment again means the value is incremented twice).
+
+Even if an operation is not naturally idempotent, it can often be made idempotent with a bit of
+extra metadata. For example, when consuming messages from Kafka, every message has a persistent,
+monotonically increasing offset. When writing a value to an external database, you can include the
+offset of the message that triggered the last write with the value. Thus, you can tell whether an
+update has already been applied, and avoid performing the same update again.
+
+The state handling in Storm's Trident is based on a similar idea. Relying on idempotence implies
+several assumptions: restarting a failed task must replay the same messages in the same order (a
+log-based message broker does this), the processing must be deterministic, and no other node may
+concurrently update the same value [^81], [^82].
+
+When failing over from one processing node to another, fencing may be required (see ["Distributed
+Locks and Leases"](/en/ch9#sec_distributed_lock_fencing)) to prevent interference from a node that
+is thought to be dead but is actually alive. Despite all those caveats, idempotent operations can be
+an effective way of achieving exactly-once semantics with only a small overhead.
+
+#### Rebuilding state after a failure {#sec_stream_state_fault_tolerance}
+
+Any stream process that requires state---for example, any windowed aggregations (such as counters,
+averages, and histograms) and any tables and indexes used for joins---must ensure that this state
+can be recovered after a failure.
+
+One option is to keep the state in a remote datastore and replicate it, although having to query a
+remote database for each individual message can be slow. An alternative is to keep state local to
+the stream processor, and replicate it periodically. Then, when the stream processor is recovering
+from a failure, the new task can read the replicated state and resume processing without data loss.
+
+For example, Flink periodically captures snapshots of operator state and writes them to durable
+storage such as a distributed filesystem [^75], [^76], and Kafka Streams replicates
+state changes by sending them to a dedicated Kafka topic with log compaction, similar to change data
+capture [^83]. VoltDB replicates state by redundantly processing each input message on
+several nodes (see ["Actual Serial Execution"](/en/ch8#sec_transactions_serial)).
+
+In some cases, it may not even be necessary to replicate the state, because it can be rebuilt from
+the input streams. For example, if the state consists of aggregations over a fairly short window, it
+may be fast enough to simply replay the input events corresponding to that window. If the state is a
+local replica of a database, maintained by change data capture, the database can also be rebuilt
+from the log-compacted change stream.
+
+However, all of these trade-offs depend on the performance characteristics of the underlying
+infrastructure: in some systems, network delay may be lower than disk access latency, and network
+bandwidth may be comparable to disk bandwidth. There is no universally ideal trade-off for all
+situations, and the merits of local versus remote state may also shift as storage and networking
+technologies evolve.
+
+## Summary {#id332}
+
+In this chapter we have discussed event streams, what purposes they serve, and how to process them.
+In some ways, stream processing is very much like the batch processing we discussed in
+[Chapter 11](/en/ch11#ch_batch), but done continuously on unbounded (never-ending) streams rather
+than on a fixed-size input [^84]. From this perspective, message brokers and event logs
+serve as the streaming equivalent of a filesystem.
We spent some time comparing two types of message brokers:
-***AMQP/JMS-style message broker***
+AMQP/JMS-style message broker
-The broker assigns individual messages to consumers, and consumers acknowl‐ edge individual messages when they have been successfully processed. Messages are deleted from the broker once they have been acknowledged. This approach is appropriate as an asynchronous form of RPC (see also “[Message-Passing Data‐ flow]()”), for example in a task queue, where the exact order of mes‐ sage processing is not important and where there is no need to go back and read old messages again after they have been processed.
+: The broker assigns individual messages to consumers, and consumers acknowledge individual
+ messages when they have been successfully processed. Messages are deleted from the broker once
+ they have been acknowledged. This approach is appropriate as an asynchronous form of RPC (see
+ also ["Event-Driven Architectures"](/en/ch5#sec_encoding_dataflow_msg)), for example in a task
+ queue, where the exact order of message processing is not important and where there is no need
+ to go back and read old messages again after they have been processed.
-***Log-based message broker***
+Log-based message broker
-The broker assigns all messages in a partition to the same consumer node, and always delivers messages in the same order. Parallelism is achieved through par‐ titioning, and consumers track their progress by checkpointing the offset of the last message they have processed. The broker retains messages on disk, so it is possible to jump back and reread old messages if necessary.
+: The broker assigns all messages in a shard to the same consumer node, and always delivers
+ messages in the same order. Parallelism is achieved through sharding, and consumers track their
+ progress by checkpointing the offset of the last message they have processed. The broker retains
+ messages on disk, so it is possible to jump back and reread old messages if necessary.
-The log-based approach has similarities to the replication logs found in databases (see [Chapter 5](/en/ch5)) and log-structured storage engines (see [Chapter 3](/en/ch3)). We saw that this approach is especially appropriate for stream processing systems that consume input streams and generate derived state or derived output streams.
+The log-based approach has similarities to the replication logs found in databases (see
+[Chapter 6](/en/ch6#ch_replication)) and log-structured storage engines (see
+[Chapter 4](/en/ch4#ch_storage)). It is also a form of consensus, as we saw in
+[Chapter 10](/en/ch10#ch_consistency). We saw that this approach is especially appropriate for
+stream processing systems that consume input streams and generate derived state or derived output
+streams.
-In terms of where streams come from, we discussed several possibilities: user activity events, sensors providing periodic readings, and data feeds (e.g., market data in finance) are naturally represented as streams. We saw that it can also be useful to think of the writes to a database as a stream: we can capture the changelog—i.e., the history of all changes made to a database—either implicitly through change data cap‐ ture or explicitly through event sourcing. Log compaction allows the stream to retain a full copy of the contents of a database.
+In terms of where streams come from, we discussed several possibilities: user activity events,
+sensors providing periodic readings, and data feeds (e.g., market data in finance) are naturally
+represented as streams. We saw that it can also be useful to think of the writes to a database as a
+stream: we can capture the changelog---i.e., the history of all changes made to a database---either
+implicitly through change data capture or explicitly through event sourcing. Log compaction allows
+the stream to retain a full copy of the contents of a database.
-Representing databases as streams opens up powerful opportunities for integrating systems. You can keep derived data systems such as search indexes, caches, and analytics systems continually up to date by consuming the log of changes and applying them to the derived system. You can even build fresh views onto existing data by starting from scratch and consuming the log of changes from the beginning all the way to the present.
+Representing databases as streams opens up powerful opportunities for integrating systems. You can
+keep derived data systems such as search indexes, caches, and analytics systems continually up to
+date by consuming the log of changes and applying them to the derived system. You can even build
+fresh views onto existing data by starting from scratch and consuming the log of changes from the
+beginning all the way to the present.
-The facilities for maintaining state as streams and replaying messages are also the basis for the techniques that enable stream joins and fault tolerance in various stream processing frameworks. We discussed several purposes of stream processing, including searching for event patterns (complex event processing), computing windowed aggregations (stream analytics), and keeping derived data systems up to date (materialized views).
+The facilities for maintaining state as streams and replaying messages are also the basis for the
+techniques that enable stream joins and fault tolerance in various stream processing frameworks. We
+discussed several purposes of stream processing, including searching for event patterns (complex
+event processing), computing windowed aggregations (stream analytics), and keeping derived data
+systems up to date (materialized views).
-We then discussed the difficulties of reasoning about time in a stream processor, including the distinction between processing time and event timestamps, and the problem of dealing with straggler events that arrive after you thought your window was complete.
+We then discussed the difficulties of reasoning about time in a stream processor, including the
+distinction between processing time and event timestamps, and the problem of dealing with straggler
+events that arrive after you thought your window was complete.
We distinguished three types of joins that may appear in stream processes:
-***Stream-stream joins***
+Stream-stream joins
-Both input streams consist of activity events, and the join operator searches for related events that occur within some window of time. For example, it may match two actions taken by the same user within 30 minutes of each other. The two join inputs may in fact be the same stream (a *self-join*) if you want to find related events within that one stream.
+: Both input streams consist of activity events, and the join operator searches for related events
+ that occur within some window of time. For example, it may match two actions taken by the same
+ user within 30 minutes of each other. The two join inputs may in fact be the same stream (a
+ *self-join*) if you want to find related events within that one stream.
-***Stream-table joins***
+Stream-table joins
-One input stream consists of activity events, while the other is a database change‐ log. The changelog keeps a local copy of the database up to date. For each activity event, the join operator queries the database and outputs an enriched activity event.
+: One input stream consists of activity events, while the other is a database changelog. The
+ changelog keeps a local copy of the database up to date. For each activity event, the join
+ operator queries the database and outputs an enriched activity event.
-***Table-table joins***
+Table-table joins
-Both input streams are database changelogs. In this case, every change on one side is joined with the latest state of the other side. The result is a stream of changes to the materialized view of the join between the two tables.
+: Both input streams are database changelogs. In this case, every change on one side is joined
+ with the latest state of the other side. The result is a stream of changes to the materialized
+ view of the join between the two tables.
-Finally, we discussed techniques for achieving fault tolerance and exactly-once semantics in a stream processor. As with batch processing, we need to discard the partial output of any failed tasks. However, since a stream process is long-running and produces output continuously, we can’t simply discard all output. Instead, a finer-grained recovery mechanism can be used, based on microbatching, checkpoint‐ ing, transactions, or idempotent writes.
+Finally, we discussed techniques for achieving fault tolerance and exactly-once semantics in a
+stream processor. As with batch processing, we need to discard the partial output of any failed
+tasks. However, since a stream process is long-running and produces output continuously, we can't
+simply discard all output. Instead, a finer-grained recovery mechanism can be used, based on
+microbatching, checkpointing, transactions, or idempotent writes.
+##### Footnotes
+### References {#references}
-
-### References
-
-1. Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “[The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 12, pages 1792–1803, August 2015. [doi:10.14778/2824032.2824076](http://dx.doi.org/10.14778/2824032.2824076)
-1. Harold Abelson, Gerald Jay Sussman, and Julie Sussman: [*Structure and Interpretation of Computer Programs*](https://web.archive.org/web/20220807043536/https://mitpress.mit.edu/sites/default/files/sicp/index.html), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, available online at *mitpress.mit.edu*
-1. Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “[The Many Faces of Publish/Subscribe](http://www.cs.ru.nl/~pieter/oss/manyfaces.pdf),” *ACM Computing Surveys*, volume 35, number 2, pages 114–131, June 2003. [doi:10.1145/857076.857078](http://dx.doi.org/10.1145/857076.857078)
-1. Joseph M. Hellerstein and Michael Stonebraker: [*Readings in Database Systems*](http://redbook.cs.berkeley.edu/), 4th edition. MIT Press, 2005. ISBN: 978-0-262-69314-1, available online at *redbook.cs.berkeley.edu*
-1. Don Carney, Uğur Çetintemel, Mitch Cherniack, et al.: “[Monitoring Streams – A New Class of Data Management Applications](http://www.vldb.org/conf/2002/S07P02.pdf),” at *28th International Conference on Very Large Data Bases* (VLDB), August 2002.
-1. Matthew Sackman: “[Pushing Back](https://wellquite.org/posts/lshift/pushing_back/),” *lshift.net*, May 5, 2016.
-1. Vicent Martí: “[Brubeck, a statsd-Compatible Metrics Aggregator](http://githubengineering.com/brubeck/),” *githubengineering.com*, June 15, 2015.
-1. Seth Lowenberger: “[MoldUDP64 Protocol Specification V 1.00](http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf),” *nasdaqtrader.com*, July 2009.
-1. Pieter Hintjens: [*ZeroMQ – The Guide*](http://zguide.zeromq.org/page:all). O'Reilly Media, 2013. ISBN: 978-1-449-33404-8
-1. Ian Malpass: “[Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/),” *codeascraft.com*, February 15, 2011.
-1. Dieter Plaetinck: “[25 Graphite, Grafana and statsd Gotchas](https://grafana.com/blog/2016/03/03/25-graphite-grafana-and-statsd-gotchas/),” *grafana.com*, March 3, 2016.
-1. Jeff Lindsay: “[Web Hooks to Revolutionize the Web](https://web.archive.org/web/20180928201955/http://progrium.com/blog/2007/05/03/web-hooks-to-revolutionize-the-web/),” *progrium.com*, May 3, 2007.
-1. Jim N. Gray: “[Queues Are Databases](https://arxiv.org/pdf/cs/0701158.pdf),” Microsoft Research Technical Report MSR-TR-95-56, December 1995.
-1. Mark Hapner, Rich Burridge, Rahul Sharma, et al.: “[JSR-343 Java Message Service (JMS) 2.0 Specification](https://jcp.org/en/jsr/detail?id=343),” *jms-spec.java.net*, March 2013.
-1. Sanjay Aiyagari, Matthew Arrott, Mark Atwell, et al.: “[AMQP: Advanced Message Queuing Protocol Specification](http://www.rabbitmq.com/resources/specs/amqp0-9-1.pdf),” Version 0-9-1, November 2008.
-1. “[Google Cloud Pub/Sub: A Google-Scale Messaging Service](https://cloud.google.com/pubsub/architecture),” *cloud.google.com*, 2016.
-1. “[Apache Kafka 0.9 Documentation](http://kafka.apache.org/documentation.html),” *kafka.apache.org*, November 2015.
-1. Jay Kreps, Neha Narkhede, and Jun Rao: “[Kafka: A Distributed Messaging System for Log Processing](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf),” at *6th International Workshop on Networking Meets Databases* (NetDB), June 2011.
-1. “[Amazon Kinesis Streams Developer Guide](http://docs.aws.amazon.com/streams/latest/dev/introduction.html),” *docs.aws.amazon.com*, April 2016.
-1. Leigh Stewart and Sijie Guo: “[Building DistributedLog: Twitter’s High-Performance Replicated Log Service](https://blog.twitter.com/2015/building-distributedlog-twitter-s-high-performance-replicated-log-service),” *blog.twitter.com*, September 16, 2015.
-1. “[DistributedLog Documentation](https://web.archive.org/web/20210517201308/https://bookkeeper.apache.org/distributedlog/docs/latest/),” Apache Software Foundation, *distributedlog.io*.
-1. Jay Kreps: “[Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)](https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines),” *engineering.linkedin.com*, April 27, 2014.
-1. Kartik Paramasivam: “[How We’re Improving and Advancing Kafka at LinkedIn](https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedin),” *engineering.linkedin.com*, September 2, 2015.
-1. Jay Kreps: “[The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying),” *engineering.linkedin.com*, December 16, 2013.
-1. Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “[All Aboard the Databus!](http://www.socc2012.org/s18-das.pdf),” at *3rd ACM Symposium on Cloud Computing* (SoCC), October 2012.
-1. Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “[Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf),” at *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015.
-1. P. P. S. Narayan: “[Sherpa Update](http://web.archive.org/web/20160801221400/https://developer.yahoo.com/blogs/ydn/sherpa-7992.html),” *developer.yahoo.com*, June 8, .
-1. Martin Kleppmann: “[Bottled Water: Real-Time Integration of PostgreSQL and Kafka](http://martin.kleppmann.com/2015/04/23/bottled-water-real-time-postgresql-kafka.html),” *martin.kleppmann.com*, April 23, 2015.
-1. Ben Osheroff: “[Introducing Maxwell, a mysql-to-kafka Binlog Processor](https://web.archive.org/web/20170208100334/https://developer.zendesk.com/blog/introducing-maxwell-a-mysql-to-kafka-binlog-processor),” *developer.zendesk.com*, August 20, 2015.
-1. Randall Hauch: “[Debezium 0.2.1 Released](https://debezium.io/blog/2016/06/10/Debezium-0.2.1-Released/),” *debezium.io*, June 10, 2016.
-1. Prem Santosh Udaya Shankar: “[Streaming MySQL Tables in Real-Time to Kafka](https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html),” *engineeringblog.yelp.com*, August 1, 2016.
-1. “[Mongoriver](https://github.com/stripe/mongoriver),” Stripe, Inc., *github.com*, September 2014.
-1. Dan Harvey: “[Change Data Capture with Mongo + Kafka](http://www.slideshare.net/danharvey/change-data-capture-with-mongodb-and-kafka),” at *Hadoop Users Group UK*, August 2015.
-1. “[Oracle GoldenGate 12c: Real-Time Access to Real-Time Information](https://web.archive.org/web/20160923105841/http://www.oracle.com/us/products/middleware/data-integration/oracle-goldengate-realtime-access-2031152.pdf),” Oracle White Paper, March 2015.
-1. “[Oracle GoldenGate Fundamentals: How Oracle GoldenGate Works](https://www.youtube.com/watch?v=6H9NibIiPQE),” Oracle Corporation, *youtube.com*, November 2012.
-1. Slava Akhmechet: “[Advancing the Realtime Web](http://rethinkdb.com/blog/realtime-web/),” *rethinkdb.com*, January 27, 2015.
-1. “[Firebase Realtime Database Documentation](https://firebase.google.com/docs/database/),” Google, Inc., *firebase.google.com*, May 2016.
-1. “[Apache CouchDB 1.6 Documentation](http://docs.couchdb.org/en/latest/),” *docs.couchdb.org*, 2014.
-1. Matt DeBergalis: “[Meteor 0.7.0: Scalable Database Queries Using MongoDB Oplog Instead of Poll-and-Diff](https://web.archive.org/web/20160324055429/http://info.meteor.com/blog/meteor-070-scalable-database-queries-using-mongodb-oplog-instead-of-poll-and-diff),” *info.meteor.com*, December 17, 2013.
-1. “[Chapter 15. Importing and Exporting Live Data](https://docs.voltdb.com/UsingVoltDB/ChapExport.php),” VoltDB 6.4 User Manual, *docs.voltdb.com*, June 2016.
-1. Neha Narkhede: “[Announcing Kafka Connect: Building Large-Scale Low-Latency Data Pipelines](http://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines),” *confluent.io*, February 18, 2016.
-1. Greg Young: “[CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs),” at *Code on the Beach*, August 2014.
-1. Martin Fowler: “[Event Sourcing](http://martinfowler.com/eaaDev/EventSourcing.html),” *martinfowler.com*, December 12, 2005.
-1. Vaughn Vernon: [*Implementing Domain-Driven Design*](https://www.informit.com/store/implementing-domain-driven-design-9780321834577). Addison-Wesley Professional, 2013. ISBN: 978-0-321-83457-7
-1. H. V. Jagadish, Inderpal Singh Mumick, and Abraham Silberschatz: “[View Maintenance Issues for the Chronicle Data Model](https://dl.acm.org/doi/10.1145/212433.220201),” at *14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems* (PODS), May 1995. [doi:10.1145/212433.220201](http://dx.doi.org/10.1145/212433.220201)
-1. “[Event Store 3.5.0 Documentation](http://docs.geteventstore.com/),” Event Store LLP, *docs.geteventstore.com*, February 2016.
-1. Martin Kleppmann: [*Making Sense of Stream Processing*](http://www.oreilly.com/data/free/stream-processing.csp). Report, O'Reilly Media, May 2016.
-1. Sander Mak: “[Event-Sourced Architectures with Akka](http://www.slideshare.net/SanderMak/eventsourced-architectures-with-akka),” at *JavaOne*, September 2014.
-1. Julian Hyde: [personal communication](https://twitter.com/julianhyde/status/743374145006641153), June 2016.
-1. Ashish Gupta and Inderpal Singh Mumick: *Materialized Views: Techniques, Implementations, and Applications*. MIT Press, 1999. ISBN: 978-0-262-57122-7
-1. Timothy Griffin and Leonid Libkin: “[Incremental Maintenance of Views with Duplicates](http://homepages.inf.ed.ac.uk/libkin/papers/sigmod95.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 1995. [doi:10.1145/223784.223849](http://dx.doi.org/10.1145/223784.223849)
-1. Pat Helland: “[Immutability Changes Everything](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
-1. Martin Kleppmann: “[Accounting for Computer Scientists](http://martin.kleppmann.com/2011/03/07/accounting-for-computer-scientists.html),” *martin.kleppmann.com*, March 7, 2011.
-1. Pat Helland: “[Accountants Don't Use Erasers](https://web.archive.org/web/20200220161036/https://blogs.msdn.microsoft.com/pathelland/2007/06/14/accountants-dont-use-erasers/),” *blogs.msdn.com*, June 14, 2007.
-1. Fangjin Yang: “[Dogfooding with Druid, Samza, and Kafka: Metametrics at Metamarkets](https://metamarkets.com/2015/dogfooding-with-druid-samza-and-kafka-metametrics-at-metamarkets/),” *metamarkets.com*, June 3, 2015.
-1. Gavin Li, Jianqiu Lv, and Hang Qi: “[Pistachio: Co-Locate the Data and Compute for Fastest Cloud Compute](https://web.archive.org/web/20181214032620/https://yahoohadoop.tumblr.com/post/116365275781/pistachio-co-locate-the-data-and-compute-for),” *yahoohadoop.tumblr.com*, April 13, 2015.
-1. Kartik Paramasivam: “[Stream Processing Hard Problems – Part 1: Killing Lambda](https://engineering.linkedin.com/blog/2016/06/stream-processing-hard-problems-part-1-killing-lambda),” *engineering.linkedin.com*, June 27, 2016.
-1. Martin Fowler: “[CQRS](http://martinfowler.com/bliki/CQRS.html),” *martinfowler.com*, July 14, 2011.
-1. Greg Young: “[CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf),” *cqrs.files.wordpress.com*, November 2010.
-1. Baron Schwartz: “[Immutability, MVCC, and Garbage Collection](https://web.archive.org/web/20161110094746/http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/),” *xaprb.com*, December 28, 2013.
-1. Daniel Eloff, Slava Akhmechet, Jay Kreps, et al.: ["Re: Turning the Database Inside-out with Apache Samza](https://news.ycombinator.com/item?id=9145197)," Hacker News discussion, *news.ycombinator.com*, March 4, 2015.
-1. “[Datomic Development Resources: Excision](http://docs.datomic.com/excision.html),” Cognitect, Inc., *docs.datomic.com*.
-1. “[Fossil Documentation: Deleting Content from Fossil](http://fossil-scm.org/index.html/doc/trunk/www/shunning.wiki),” *fossil-scm.org*, 2016.
-1. Jay Kreps: “[The irony of distributed systems is that data loss is really easy but deleting data is surprisingly hard,](https://twitter.com/jaykreps/status/582580836425330688)” *twitter.com*, March 30, 2015.
-1. David C. Luckham: “[What’s the Difference Between ESP and CEP?](http://www.complexevents.com/2006/08/01/what%E2%80%99s-the-difference-between-esp-and-cep/),” *complexevents.com*, August 1, 2006.
-1. Srinath Perera: “[How Is Stream Processing and Complex Event Processing (CEP) Different?](https://www.quora.com/How-is-stream-processing-and-complex-event-processing-CEP-different),” *quora.com*, December 3, 2015.
-1. Arvind Arasu, Shivnath Babu, and Jennifer Widom: “[The CQL Continuous Query Language: Semantic Foundations and Query Execution](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cql.pdf),” *The VLDB Journal*, volume 15, number 2, pages 121–142, June 2006. [doi:10.1007/s00778-004-0147-z](http://dx.doi.org/10.1007/s00778-004-0147-z)
-1. Julian Hyde: “[Data in Flight: How Streaming SQL Technology Can Help Solve the Web 2.0 Data Crunch](http://queue.acm.org/detail.cfm?id=1667562),” *ACM Queue*, volume 7, number 11, December 2009. [doi:10.1145/1661785.1667562](http://dx.doi.org/10.1145/1661785.1667562)
-1. “[Esper Reference, Version 5.4.0](http://esper.espertech.com/release-5.4.0/esper-reference/html_single/index.html),” EsperTech, Inc., *espertech.com*, April 2016.
-1. Zubair Nabi, Eric Bouillet, Andrew Bainbridge, and Chris Thomas: “[Of Streams and Storms](https://web.archive.org/web/20170711081434/https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2014/04/Streams-and-Storm-April-2014-Final.pdf),” IBM technical report, *developer.ibm.com*, April 2014.
-1. Milinda Pathirage, Julian Hyde, Yi Pan, and Beth Plale: “[SamzaSQL: Scalable Fast Data Management with Streaming SQL](https://github.com/milinda/samzasql-hpbdc2016/blob/master/samzasql-hpbdc2016.pdf),” at *IEEE International Workshop on High-Performance Big Data Computing* (HPBDC), May 2016. [doi:10.1109/IPDPSW.2016.141](http://dx.doi.org/10.1109/IPDPSW.2016.141)
-1. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier: “[HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf),” at *Conference on Analysis of Algorithms* (AofA), June 2007.
-1. Jay Kreps: “[Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture),” *oreilly.com*, July 2, 2014.
-1. Ian Hellström: “[An Overview of Apache Streaming Technologies](https://databaseline.bitbucket.io/an-overview-of-apache-streaming-technologies/),” *databaseline.bitbucket.io*, March 12, 2016.
-1. Jay Kreps: “[Why Local State Is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing),” *oreilly.com*, July 31, 2014.
-1. Shay Banon: “[Percolator](https://www.elastic.co/blog/percolator),” *elastic.co*, February 8, 2011.
-1. Alan Woodward and Martin Kleppmann: “[Real-Time Full-Text Search with Luwak and Samza](http://martin.kleppmann.com/2015/04/13/real-time-full-text-search-luwak-samza.html),” *martin.kleppmann.com*, April 13, 2015.
-1. “[Apache Storm 2.1.0 Documentation](https://storm.apache.org/releases/2.1.0/index.html),” *storm.apache.org*, October 2019.
-1. Tyler Akidau: “[The World Beyond Batch: Streaming 102](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102),” *oreilly.com*, January 20, 2016.
-1. Stephan Ewen: “[Streaming Analytics with Apache Flink](https://www.confluent.io/resources/kafka-summit-2016/advanced-streaming-analytics-apache-flink-apache-kafka/),” at *Kafka Summit*, April 2016.
-1. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, et al.: “[MillWheel: Fault-Tolerant Stream Processing at Internet Scale](http://research.google.com/pubs/pub41378.html),” at *39th International Conference on Very Large Data Bases* (VLDB), August 2013.
-1. Alex Dean: “[Improving Snowplow's Understanding of Time](https://snowplow.io/blog/improving-snowplows-understanding-of-time/),” *snowplowanalytics.com*, September 15, 2015.
-1. “[Windowing (Azure Stream Analytics)](https://msdn.microsoft.com/en-us/library/azure/dn835019.aspx),” Microsoft Azure Reference, *msdn.microsoft.com*, April 2016.
-1. “[State Management](http://samza.apache.org/learn/documentation/0.10/container/state-management.html),” Apache Samza 0.10 Documentation, *samza.apache.org*, December 2015.
-1. Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, et al.: “[Photon: Fault-Tolerant and Scalable Joining of Continuous Data Streams](http://research.google.com/pubs/pub41318.html),” at *ACM International Conference on Management of Data* (SIGMOD), June 2013. [doi:10.1145/2463676.2465272](http://dx.doi.org/10.1145/2463676.2465272)
-1. Martin Kleppmann: “[Samza Newsfeed Demo](https://github.com/ept/newsfeed),” *github.com*, September 2014.
-1. Ben Kirwin: “[Doing the Impossible: Exactly-Once Messaging Patterns in Kafka](http://ben.kirw.in/2014/11/28/kafka-patterns/),” *ben.kirw.in*, November 28, 2014.
-1. Pat Helland: “[Data on the Outside Versus Data on the Inside](http://cidrdb.org/cidr2005/papers/P12.pdf),” at *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.
-1. Ralph Kimball and Margy Ross: *The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*, 3rd edition. John Wiley & Sons, 2013. ISBN: 978-1-118-53080-1
-1. Viktor Klang: “[I'm coining the phrase 'effectively-once' for message processing with at-least-once + idempotent operations](https://twitter.com/viktorklang/status/789036133434978304),” *twitter.com*, October 20, 2016.
-1. Matei Zaharia, Tathagata Das, Haoyuan Li, et al.: “[Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters](https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf),” at *4th USENIX Conference in Hot Topics in Cloud Computing* (HotCloud), June 2012.
-1. Kostas Tzoumas, Stephan Ewen, and Robert Metzger: “[High-Throughput, Low-Latency, and Exactly-Once Stream Processing with Apache Flink](https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink),” *ververica.com*, August 5, 2015.
-1. Paris Carbone, Gyula Fóra, Stephan Ewen, et al.: “[Lightweight Asynchronous Snapshots for Distributed Dataflows](http://arxiv.org/abs/1506.08603),” arXiv:1506.08603 [cs.DC], June 29, 2015.
-1. Ryan Betts and John Hugg: [*Fast Data: Smart and at Scale*](http://www.oreilly.com/data/free/fast-data-smart-and-at-scale.csp). Report, O'Reilly Media, October 2015.
-1. Flavio Junqueira: “[Making Sense of Exactly-Once Semantics](https://web.archive.org/web/20160812172900/http://conferences.oreilly.com/strata/hadoop-big-data-eu/public/schedule/detail/49690),” at *Strata+Hadoop World London*, June 2016.
-1. Jason Gustafson, Flavio Junqueira, Apurva Mehta, Sriram Subramanian, and Guozhang Wang: “[KIP-98 – Exactly Once Delivery and Transactional Messaging](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging),” *cwiki.apache.org*, November 2016.
-1. Pat Helland: “[Idempotence Is Not a Medical Condition](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=4b6dda7fe75b51e1c543a87ca7b3b322fbf55614),” *Communications of the ACM*, volume 55, number 5, page 56, May 2012. [doi:10.1145/2160718.2160734](http://dx.doi.org/10.1145/2160718.2160734)
-1. Jay Kreps: “[Re: Trying to Achieve Deterministic Behavior on Recovery/Rewind](http://mail-archives.apache.org/mod_mbox/samza-dev/201409.mbox/%3CCAOeJiJg%2Bc7Ei%3DgzCuOz30DD3G5Hm9yFY%3DUJ6SafdNUFbvRgorg%40mail.gmail.com%3E),” email to *samza-dev* mailing list, September 9, 2014.
-1. E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson: “[A Survey of Rollback-Recovery Protocols in Message-Passing Systems](http://www.cs.utexas.edu/~lorenzo/papers/SurveyFinal.pdf),” *ACM Computing Surveys*, volume 34, number 3, pages 375–408, September 2002. [doi:10.1145/568522.568525](http://dx.doi.org/10.1145/568522.568525)
-1. Adam Warski: “[Kafka Streams – How Does It Fit the Stream Processing Landscape?](https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/),” *softwaremill.com*, June 1, 2016.
+[^1]: Tyler Akidau, Robert Bradshaw, Craig Chambers, Slava Chernyak, Rafael J. Fernández-Moctezuma, Reuven Lax, Sam McVeety, Daniel Mills, Frances Perry, Eric Schmidt, and Sam Whittle. [The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](https://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf). *Proceedings of the VLDB Endowment*, volume 8, issue 12, pages 1792--1803, August 2015. [doi:10.14778/2824032.2824076](https://doi.org/10.14778/2824032.2824076)
+[^2]: Harold Abelson, Gerald Jay Sussman, and Julie Sussman. [*Structure and Interpretation of Computer Programs*](https://web.mit.edu/6.001/6.037/sicp.pdf), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, archived at [archive.org/details/sicp_20211010](https://archive.org/details/sicp_20211010)
+[^3]: Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec. [The Many Faces of Publish/Subscribe](https://www.cs.ru.nl/~pieter/oss/manyfaces.pdf). *ACM Computing Surveys*, volume 35, issue 2, pages 114--131, June 2003. [doi:10.1145/857076.857078](https://doi.org/10.1145/857076.857078)
+[^4]: Don Carney, Uğur Çetintemel, Mitch Cherniack, Christian Convey, Sangdon Lee, Greg Seidman, Michael Stonebraker, Nesime Tatbul, and Stan Zdonik. [Monitoring Streams -- A New Class of Data Management Applications](https://www.vldb.org/conf/2002/S07P02.pdf). At *28th International Conference on Very Large Data Bases* (VLDB), August 2002. [doi:10.1016/B978-155860869-6/50027-5](https://doi.org/10.1016/B978-155860869-6/50027-5)
+[^5]: Matthew Sackman. [Pushing Back](https://wellquite.org/posts/lshift/pushing_back/). *wellquite.org*, May 2016. Archived at [perma.cc/3KCZ-RUFY](https://perma.cc/3KCZ-RUFY)
+[^6]: Thomas Figg (tef). [how (not) to write a pipeline](https://web.archive.org/web/20250107135013/https://cohost.org/tef/post/1764930-how-not-to-write-a). *cohost.org*, June 2023. Archived at [perma.cc/A3V8-NYCM](https://perma.cc/A3V8-NYCM)
+[^7]: Vicent Martí. [Brubeck, a statsd-Compatible Metrics Aggregator](https://github.blog/news-insights/the-library/brubeck/). *github.blog*, June 2015. Archived at [perma.cc/TP3Q-DJYM](https://perma.cc/TP3Q-DJYM)
+[^8]: Seth Lowenberger. [MoldUDP64 Protocol Specification V 1.00](https://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf). *nasdaqtrader.com*, July 2009. Archived at
+[^9]: Ian Malpass. [Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/). *codeascraft.com*, February 2011. Archived at [archive.org](https://web.archive.org/web/20250820034209/https://www.etsy.com/codeascraft/measure-anything-measure-everything/)
+[^10]: Dieter Plaetinck. [25 Graphite, Grafana and statsd Gotchas](https://grafana.com/blog/2016/03/03/25-graphite-grafana-and-statsd-gotchas/). *grafana.com*, March 2016. Archived at [perma.cc/3NP3-67U7](https://perma.cc/3NP3-67U7)
+[^11]: Jeff Lindsay. [Web Hooks to Revolutionize the Web](https://progrium.github.io/blog/2007/05/03/web-hooks-to-revolutionize-the-web/). *progrium.com*, May 2007. Archived at [perma.cc/BF9U-XNX4](https://perma.cc/BF9U-XNX4)
+[^12]: Jim N. Gray. [Queues Are Databases](https://arxiv.org/pdf/cs/0701158.pdf). Microsoft Research Technical Report MSR-TR-95-56, December 1995. Archived at [arxiv.org](https://arxiv.org/pdf/cs/0701158)
+[^13]: Mark Hapner, Rich Burridge, Rahul Sharma, Joseph Fialli, Kate Stout, and Nigel Deakin. [JSR-343 Java Message Service (JMS) 2.0 Specification](https://jcp.org/en/jsr/detail?id=343). *jms-spec.java.net*, March 2013. Archived at [perma.cc/E4YG-46TA](https://perma.cc/E4YG-46TA)
+[^14]: Sanjay Aiyagari, Matthew Arrott, Mark Atwell, Jason Brome, Alan Conway, Robert Godfrey, Robert Greig, Pieter Hintjens, John O'Hara, Matthias Radestock, Alexis Richardson, Martin Ritchie, Shahrokh Sadjadi, Rafael Schloming, Steven Shaw, Martin Sustrik, Carl Trieloff, Kim van der Riet, and Steve Vinoski. [AMQP: Advanced Message Queuing Protocol Specification](https://www.rabbitmq.com/resources/specs/amqp0-9-1.pdf). Version 0-9-1, November 2008. Archived at [perma.cc/6YJJ-GM9X](https://perma.cc/6YJJ-GM9X)
+[^15]: [Architectural overview of Pub/Sub](https://cloud.google.com/pubsub/architecture). *cloud.google.com*, 2025. Archived at [perma.cc/VWF5-ABP4](https://perma.cc/VWF5-ABP4)
+[^16]: Aris Tzoumas. [Lessons from scaling PostgreSQL queues to 100k events per second](https://www.rudderstack.com/blog/scaling-postgres-queue/). *rudderstack.com*, July 2025. Archived at [perma.cc/QD8C-VA4Y](https://perma.cc/QD8C-VA4Y)
+[^17]: Robin Moffatt. [Kafka Connect Deep Dive -- Error Handling and Dead Letter Queues](https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues/). *confluent.io*, March 2019. Archived at [perma.cc/KQ5A-AB28](https://perma.cc/KQ5A-AB28)
+[^18]: Dunith Danushka. [Message reprocessing: How to implement the dead letter queue](https://redpanda.com/blog/reliable-message-processing-with-dead-letter-queue). *redpanda.com*. Archived at [perma.cc/R7UB-WEWF](https://perma.cc/R7UB-WEWF)
+[^19]: Damien Gasparina, Loic Greffier, and Sebastien Viale. [KIP-1034: Dead letter queue in Kafka Streams](https://cwiki.apache.org/confluence/display/KAFKA/KIP-1034%3A+Dead+letter+queue+in+Kafka+Streams). *cwiki.apache.org*, April 2024. Archived at [perma.cc/3VXV-QXAN](https://perma.cc/3VXV-QXAN)
+[^20]: Jay Kreps, Neha Narkhede, and Jun Rao. [Kafka: A Distributed Messaging System for Log Processing](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf). At *6th International Workshop on Networking Meets Databases* (NetDB), June 2011. Archived at [perma.cc/CSW7-TCQ5](https://perma.cc/CSW7-TCQ5)
+[^21]: Jay Kreps. [Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)](https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines). *engineering.linkedin.com*, April 2014. Archived at [archive.org](https://web.archive.org/web/20140921000742/https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines)
+[^22]: Kartik Paramasivam. [How We're Improving and Advancing Kafka at LinkedIn](https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedin). *engineering.linkedin.com*, September 2015. Archived at [perma.cc/3S3V-JCYJ](https://perma.cc/3S3V-JCYJ)
+[^23]: Philippe Dobbelaere and Kyumars Sheykh Esmaili. [Kafka versus RabbitMQ: A comparative study of two industry reference publish/subscribe implementations](https://arxiv.org/abs/1709.00333). At *11th ACM International Conference on Distributed and Event-based Systems* (DEBS), June 2017. [doi:10.1145/3093742.3093908](https://doi.org/10.1145/3093742.3093908)
+[^24]: Kate Holterhoff. [Why Message Queues Endure: A History](https://redmonk.com/kholterhoff/2024/12/12/why-message-queues-endure-a-history/). *redmonk.com*, December 2024. Archived at [perma.cc/6DX8-XK4W](https://perma.cc/6DX8-XK4W)
+[^25]: Andrew Schofield. [KIP-932: Queues for Kafka](https://cwiki.apache.org/confluence/display/KAFKA/KIP-932%3A+Queues+for+Kafka). *cwiki.apache.org*, May 2023. Archived at [perma.cc/LBE4-BEMK](https://perma.cc/LBE4-BEMK)
+[^26]: Jack Vanlightly. [The advantages of queues on logs](https://jack-vanlightly.com/blog/2023/10/2/the-advantages-of-queues-on-logs). *jack-vanlightly.com*, October 2023. Archived at [perma.cc/WJ7V-287K](https://perma.cc/WJ7V-287K)
+[^27]: Jay Kreps. [The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying). *engineering.linkedin.com*, December 2013. Archived at [perma.cc/2JHR-FR64](https://perma.cc/2JHR-FR64)
+[^28]: Andy Hattemer. [Change Data Capture is having a moment. Why?](https://materialize.com/blog/change-data-capture-is-having-a-moment-why/) *materialize.com*, September 2021. Archived at [perma.cc/AL37-P53C](https://perma.cc/AL37-P53C)
+[^29]: Prem Santosh Udaya Shankar. [Streaming MySQL Tables in Real-Time to Kafka](https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html). *engineeringblog.yelp.com*, August 2016. Archived at [perma.cc/5ZR3-2GVV](https://perma.cc/5ZR3-2GVV)
+[^30]: Andreas Andreakis, Ioannis Papapanagiotou. [DBLog: A Watermark Based Change-Data-Capture Framework](https://arxiv.org/pdf/2010.12597). October 2020. Archived at [arxiv.org](https://arxiv.org/pdf/2010.12597)
+[^31]: Jiri Pechanec. [Percolator](https://debezium.io/blog/2021/10/07/incremental-snapshots/). *debezium.io*, October 2021. Archived at [perma.cc/EQ8E-W6KQ](https://perma.cc/EQ8E-W6KQ)
+[^32]: Debezium maintainers. [Debezium Connector for Cassandra](https://debezium.io/documentation/reference/stable/connectors/cassandra.html). *debezium.io*. Archived at [perma.cc/WR6K-EKMD](https://perma.cc/WR6K-EKMD)
+[^33]: Neha Narkhede. [Announcing Kafka Connect: Building Large-Scale Low-Latency Data Pipelines](https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/). *confluent.io*, February 2016. Archived at [perma.cc/8WXJ-L6GF](https://perma.cc/8WXJ-L6GF)
+[^34]: Chris Riccomini. [Kafka change data capture breaks database encapsulation](https://cnr.sh/posts/2018-11-05-kafka-change-data-capture-breaks-database-encapsulation/). *cnr.sh*, November 2018. Archived at [perma.cc/P572-9MKF](https://perma.cc/P572-9MKF)
+[^35]: Gunnar Morling. ["Change Data Capture Breaks Encapsulation". Does it, though?](https://www.decodable.co/blog/change-data-capture-breaks-encapsulation-does-it-though) *decodable.co*, November 2023. Archived at [perma.cc/YX2P-WNWR](https://perma.cc/YX2P-WNWR)
+[^36]: Gunnar Morling. [Revisiting the Outbox Pattern](https://www.decodable.co/blog/revisiting-the-outbox-pattern). *decodable.co*, October 2024. Archived at [perma.cc/M5ZL-RPS9](https://perma.cc/M5ZL-RPS9)
+[^37]: Ashish Gupta and Inderpal Singh Mumick. [Maintenance of Materialized Views: Problems, Techniques, and Applications](https://web.archive.org/web/20220407025818id_/http://sites.computer.org/debull/95JUN-CD.pdf#page=5). *IEEE Data Engineering Bulletin*, volume 18, issue 2, pages 3--18, June 1995. Archived at [archive.org](https://web.archive.org/web/20220407025818id_/http://sites.computer.org/debull/95JUN-CD.pdf#page=5)
+[^38]: Mihai Budiu, Tej Chajed, Frank McSherry, Leonid Ryzhyk, Val Tannen. [DBSP: Incremental Computation on Streams and Its Applications to Databases](https://sigmodrecord.org/publications/sigmodRecord/2403/pdfs/20_dbsp-budiu.pdf). *SIGMOD Record*, volume 53, issue 1, pages 87--95, March 2024. [doi:10.1145/3665252.3665271](https://doi.org/10.1145/3665252.3665271)
+[^39]: Jim Gray and Andreas Reuter. [*Transaction Processing: Concepts and Techniques*](https://learning.oreilly.com/library/view/transaction-processing/9780080519555/). Morgan Kaufmann, 1992. ISBN: 9781558601901
+[^40]: Martin Kleppmann. [Accounting for Computer Scientists](https://martin.kleppmann.com/2011/03/07/accounting-for-computer-scientists.html). *martin.kleppmann.com*, March 2011. Archived at [perma.cc/9EGX-P38N](https://perma.cc/9EGX-P38N)
+[^41]: Pat Helland. [Immutability Changes Everything](https://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf). At *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
+[^42]: Martin Kleppmann. [*Making Sense of Stream Processing*](https://martin.kleppmann.com/papers/stream-processing.pdf). Report, O'Reilly Media, May 2016. Archived at [perma.cc/RAY4-JDVX](https://perma.cc/RAY4-JDVX)
+[^43]: Kartik Paramasivam. [Stream Processing Hard Problems -- Part 1: Killing Lambda](https://engineering.linkedin.com/blog/2016/06/stream-processing-hard-problems-part-1-killing-lambda). *engineering.linkedin.com*, June 2016. Archived at [archive.org](https://web.archive.org/web/20240621211312/https://www.linkedin.com/blog/engineering/data-streaming-processing/stream-processing-hard-problems-part-1-killing-lambda)
+[^44]: Stéphane Derosiaux. [CQRS: What? Why? How?](https://sderosiaux.medium.com/cqrs-what-why-how-945543482313) *sderosiaux.medium.com*, September 2019. Archived at [perma.cc/FZ3U-HVJ4](https://perma.cc/FZ3U-HVJ4)
+[^45]: Baron Schwartz. [Immutability, MVCC, and Garbage Collection](https://web.archive.org/web/20220122020806/http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/). *xaprb.com*, December 2013. Archived at [archive.org](https://web.archive.org/web/20220122020806/http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/)
+[^46]: Daniel Eloff, Slava Akhmechet, Jay Kreps, et al. [Re: Turning the Database Inside-out with Apache Samza](https://news.ycombinator.com/item?id=9145197). Hacker News discussion, *news.ycombinator.com*, March 2015. Archived at [perma.cc/ML9E-JC83](https://perma.cc/ML9E-JC83)
+[^47]: [Datomic Documentation: Excision](https://docs.datomic.com/operation/excision.html). Cognitect, Inc., *docs.datomic.com*. Archived at [perma.cc/J5QQ-SH32](https://perma.cc/J5QQ-SH32)
+[^48]: [Fossil Documentation: Deleting Content from Fossil](https://fossil-scm.org/home/doc/trunk/www/shunning.wiki). *fossil-scm.org*, 2025. Archived at [perma.cc/DS23-GTNG](https://perma.cc/DS23-GTNG)
+[^49]: Jay Kreps. [The irony of distributed systems is that data loss is really easy but deleting data is surprisingly hard.](https://x.com/jaykreps/status/582580836425330688) *x.com*, March 2015. Archived at [perma.cc/7RRZ-V7B7](https://perma.cc/7RRZ-V7B7)
+[^50]: Brent Robinson. [Crypto shredding: How it can solve modern data retention challenges](https://medium.com/@brentrobinson5/crypto-shredding-how-it-can-solve-modern-data-retention-challenges-da874b01745b). *medium.com*, January 2019. Archived at
+[^51]: Matthew D. Green and Ian Miers. [Forward Secure Asynchronous Messaging from Puncturable Encryption](https://isi.jhu.edu/~mgreen/forward_sec.pdf). At *IEEE Symposium on Security and Privacy*, May 2015. [doi:10.1109/SP.2015.26](https://doi.org/10.1109/SP.2015.26)
+[^52]: David C. Luckham. [What's the Difference Between ESP and CEP?](https://complexevents.com/2020/06/15/whats-the-difference-between-esp-and-cep-2/) *complexevents.com*, June 2019. Archived at [perma.cc/E7PZ-FDEF](https://perma.cc/E7PZ-FDEF)
+[^53]: Arvind Arasu, Shivnath Babu, and Jennifer Widom. [The CQL Continuous Query Language: Semantic Foundations and Query Execution](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cql.pdf). *The VLDB Journal*, volume 15, issue 2, pages 121--142, June 2006. [doi:10.1007/s00778-004-0147-z](https://doi.org/10.1007/s00778-004-0147-z)
+[^54]: Julian Hyde. [Data in Flight: How Streaming SQL Technology Can Help Solve the Web 2.0 Data Crunch](https://queue.acm.org/detail.cfm?id=1667562). *ACM Queue*, volume 7, issue 11, December 2009. [doi:10.1145/1661785.1667562](https://doi.org/10.1145/1661785.1667562)
+[^55]: Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier. [HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm](https://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf). At *Conference on Analysis of Algorithms* (AofA), June 2007. [doi:10.46298/dmtcs.3545](https://doi.org/10.46298/dmtcs.3545)
+[^56]: Jay Kreps. [Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture). *oreilly.com*, July 2014. Archived at [perma.cc/2WY5-HC8Y](https://perma.cc/2WY5-HC8Y)
+[^57]: Ian Reppel. [An Overview of Apache Streaming Technologies](https://ianreppel.org/an-overview-of-apache-streaming-technologies/). *ianreppel.org*, March 2016. Archived at [perma.cc/BB3E-QJLW](https://perma.cc/BB3E-QJLW)
+[^58]: Jay Kreps. [Why Local State is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing). *oreilly.com*, July 2014. Archived at [perma.cc/P8HU-R5LA](https://perma.cc/P8HU-R5LA)
+[^59]: RisingWave Labs. [Deep Dive Into the RisingWave Stream Processing Engine - Part 2: Computational Model](https://risingwave.com/blog/deep-dive-into-the-risingwave-stream-processing-engine-part-2-computational-model/). *risingwave.com*, November 2023. Archived at [perma.cc/LM74-XDEL](https://perma.cc/LM74-XDEL)
+[^60]: Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard. [Differential dataflow](https://www.cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf). At *6th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2013.
+[^61]: Andy Hattemer. [Incremental Computation in the Database](https://materialize.com/guides/incremental-computation/). *materialize.com*, March 2020. Archived at [perma.cc/AL94-YVRN](https://perma.cc/AL94-YVRN)
+[^62]: Shay Banon. [Percolator](https://www.elastic.co/blog/percolator). *elastic.co*, February 2011. Archived at [perma.cc/LS5R-4FQX](https://perma.cc/LS5R-4FQX)
+[^63]: Alan Woodward and Martin Kleppmann. [Real-Time Full-Text Search with Luwak and Samza](https://martin.kleppmann.com/2015/04/13/real-time-full-text-search-luwak-samza.html). *martin.kleppmann.com*, April 2015. Archived at [perma.cc/2U92-Q7R4](https://perma.cc/2U92-Q7R4)
+[^64]: Tyler Akidau. [The World Beyond Batch: Streaming 102](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102). *oreilly.com*, January 2016. Archived at [perma.cc/4XF9-8M2K](https://perma.cc/4XF9-8M2K)
+[^65]: Stephan Ewen. [Streaming Analytics with Apache Flink](https://www.slideshare.net/slideshow/advanced-streaming-analytics-with-apache-flink-and-apache-kafka-stephan-ewen/61920008). At *Kafka Summit*, April 2016. Archived at [perma.cc/QBQ4-F9MR](https://perma.cc/QBQ4-F9MR)
+[^66]: Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. [MillWheel: Fault-Tolerant Stream Processing at Internet Scale](https://www.vldb.org/pvldb/vol6/p1033-akidau.pdf). *Proceedings of the VLDB Endowment*, volume 6, issue 11, pages 1033--1044, August 2013. [doi:10.14778/2536222.2536229](https://doi.org/10.14778/2536222.2536229)
+[^67]: Alex Dean. [Improving Snowplow's Understanding of Time](https://snowplow.io/blog/improving-snowplows-understanding-of-time). *snowplow.io*, September 2015. Archived at [perma.cc/6CT9-Z3Q2](https://perma.cc/6CT9-Z3Q2)
+[^68]: [Azure Stream Analytics: Windowing functions](https://learn.microsoft.com/en-gb/stream-analytics-query/windowing-azure-stream-analytics). Microsoft Azure Reference, *learn.microsoft.com*, July 2025. Archived at [archive.org](https://web.archive.org/web/20250901140013/https://learn.microsoft.com/en-gb/stream-analytics-query/windowing-azure-stream-analytics)
+[^69]: Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, Ashish Gupta, Haifeng Jiang, Tianhao Qiu, Alexey Reznichenko, Deomid Ryabkov, Manpreet Singh, and Shivakumar Venkataraman. [Photon: Fault-Tolerant and Scalable Joining of Continuous Data Streams](https://research.google.com/pubs/archive/41529.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2013. [doi:10.1145/2463676.2465272](https://doi.org/10.1145/2463676.2465272)
+[^70]: Ben Kirwin. [Doing the Impossible: Exactly-Once Messaging Patterns in Kafka](https://ben.kirw.in/2014/11/28/kafka-patterns/). *ben.kirw.in*, November 2014. Archived at [perma.cc/A5QL-QRX7](https://perma.cc/A5QL-QRX7)
+[^71]: Pat Helland. [Data on the Outside Versus Data on the Inside](https://www.cidrdb.org/cidr2005/papers/P12.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.
+[^72]: Ralph Kimball and Margy Ross. [*The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*](https://learning.oreilly.com/library/view/the-data-warehouse/9781118530801/), 3rd edition. John Wiley & Sons, 2013. ISBN: 978-1-118-53080-1
+[^73]: Viktor Klang. [I'm coining the phrase 'effectively-once' for message processing with at-least-once + idempotent operations](https://x.com/viktorklang/status/789036133434978304). *x.com*, October 2016. Archived at [perma.cc/7DT9-TDG2](https://perma.cc/7DT9-TDG2)
+[^74]: Matei Zaharia, Tathagata Das, Haoyuan Li, Scott Shenker, and Ion Stoica. [Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters](https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf). At *4th USENIX Conference in Hot Topics in Cloud Computing* (HotCloud), June 2012.
+[^75]: Kostas Tzoumas, Stephan Ewen, and Robert Metzger. [High-Throughput, Low-Latency, and Exactly-Once Stream Processing with Apache Flink](https://web.archive.org/web/20250429165534/https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink). *ververica.com*, August 2015. Archived at [archive.org](https://web.archive.org/web/20250429165534/https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink)
+[^76]: Paris Carbone, Gyula Fóra, Stephan Ewen, Seif Haridi, and Kostas Tzoumas. [Lightweight Asynchronous Snapshots for Distributed Dataflows](https://arxiv.org/abs/1506.08603). arXiv:1506.08603 \[cs.DC\], June 2015.
+[^77]: Ryan Betts and John Hugg. [*Fast Data: Smart and at Scale*](https://www.voltactivedata.com/wp-content/uploads/2017/03/hv-ebook-fast-data-smart-and-at-scale.pdf). Report, O'Reilly Media, October 2015. Archived at [perma.cc/VQ6S-XQQY](https://perma.cc/VQ6S-XQQY)
+[^78]: Neha Narkhede and Guozhang Wang. [Exactly-Once Semantics Are Possible: Here's How Kafka Does It](https://www.confluent.io/blog/exactly-once-semantics-are-possible-heres-how-apache-kafka-does-it/). *confluent.io*, June 2019. Archived at [perma.cc/Q2AU-Q2ED](https://perma.cc/Q2AU-Q2ED)
+[^79]: Jason Gustafson, Flavio Junqueira, Apurva Mehta, Sriram Subramanian, and Guozhang Wang. [KIP-98 -- Exactly Once Delivery and Transactional Messaging](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging). *cwiki.apache.org*, November 2016. Archived at [perma.cc/95PT-RCTG](https://perma.cc/95PT-RCTG)
+[^80]: Pat Helland. [Idempotence Is Not a Medical Condition](https://dl.acm.org/doi/pdf/10.1145/2160718.2160734). *Communications of the ACM*, volume 55, issue 5, page 56, May 2012. [doi:10.1145/2160718.2160734](https://doi.org/10.1145/2160718.2160734)
+[^81]: Jay Kreps. [Re: Trying to Achieve Deterministic Behavior on Recovery/Rewind](https://lists.apache.org/thread/n0sz6zld72nvjtnytv09pxc57mdcf9ft). Email to *samza-dev* mailing list, September 2014. Archived at [perma.cc/7DPD-GJNL](https://perma.cc/7DPD-GJNL)
+[^82]: E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. [A Survey of Rollback-Recovery Protocols in Message-Passing Systems](https://www.cs.utexas.edu/~lorenzo/papers/SurveyFinal.pdf). *ACM Computing Surveys*, volume 34, issue 3, pages 375--408, September 2002. [doi:10.1145/568522.568525](https://doi.org/10.1145/568522.568525)
+[^83]: Adam Warski. [Kafka Streams -- How Does It Fit the Stream Processing Landscape?](https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/) *softwaremill.com*, June 2016. Archived at [perma.cc/WQ5Q-H2J2](https://perma.cc/WQ5Q-H2J2)
+[^84]: Stephan Ewen, Fabian Hueske, and Xiaowei Jiang. [Batch as a Special Case of Streaming and Alibaba's contribution of Blink](https://flink.apache.org/2019/02/13/batch-as-a-special-case-of-streaming-and-alibabas-contribution-of-blink/). *flink.apache.org*, February 2019. Archived at [perma.cc/A529-SKA9](https://perma.cc/A529-SKA9)
diff --git a/content/en/ch13.md b/content/en/ch13.md
index 05f4a4d..e2a9418 100644
--- a/content/en/ch13.md
+++ b/content/en/ch13.md
@@ -1,167 +1,1778 @@
---
-title: "13. Do the Right Thing"
+title: "13. A Philosophy of Streaming Systems"
weight: 313
breadcrumbs: false
---
-{{< callout type="warning" >}}
-This page is from the 1st edition, 2nd edition is not available yet.
-{{< /callout >}}
+

-> *If a thing be ordained to another as to its end, its last end cannot consist in the preservation of its being. Hence a captain does not intend as a last end, the preservation of the ship entrusted to him, since a ship is ordained to something else as its end, viz. to navigation.*
+> *If a thing be ordained to another as to its end, its last end cannot consist in the preservation
+> of its being. Hence a captain does not intend as a last end, the preservation of the ship
+> entrusted to him, since a ship is ordained to something else as its end, viz. to navigation.*
>
-> *(Often quoted as: If the highest aim of a captain was the preserve his ship, he would keep it in port forever.)*
+> *(Often quoted as: If the highest aim of a captain was the preserve his ship, he would keep it in
+> port forever.)*
>
-> — St. Thomas Aquinas, *Summa Theologica* (1265–1274)
+> St. Thomas Aquinas, *Summa Theologica* (1265--1274)
----------------
+> [!TIP] A NOTE FOR EARLY RELEASE READERS
+> With Early Release ebooks, you get books in their earliest form---the author's raw and unedited
+> content as they write---so you can take advantage of these technologies long before the official
+> release of these titles.
+>
+> This will be the 13th chapter of the final book. The GitHub repo for this book is
+> *[*https://github.com/ept/ddia2-feedback*](https://github.com/ept/ddia2-feedback)*.
+>
+> If you'd like to be actively involved in reviewing and commenting on this draft, please reach out on GitHub.
-So far, this book has been mostly about describing things as they *are* at present. In this final chapter, we will shift our perspective toward the future and discuss how things *should be*: I will propose some ideas and approaches that, I believe, may funda‐ mentally improve the ways we design and build applications.
+In [Chapter 2](/en/ch2#ch_nonfunctional) we discussed the goal of creating applications and systems
+that are *reliable*, *scalable*, and *maintainable*. These themes have run through all of the
+chapters: for example, we discussed many fault-tolerance algorithms that help improve reliability,
+sharding to improve scalability, and mechanisms for evolution and abstraction that improve
+maintainability.
-Opinions and speculation about the future are of course subjective, and so I will use the first person in this chapter when writing about my personal opinions. You are welcome to disagree with them and form your own opinions, but I hope that the ideas in this chapter will at least be a starting point for a productive discussion and bring some clarity to concepts that are often confused.
+In this chapter we will bring all of these ideas together, and build on the streaming/event-driven
+architecture ideas from [Chapter 12](/en/ch12#ch_stream) in particular to develop a philosophy of
+application development that meets those goals. This chapter is more opinionated than previous
+chapters, presenting a deep-dive into one particular philosophy rather than comparing multiple
+approaches.
-The goal of this book was outlined in [Chapter 1](/en/ch1): to explore how to create applications and systems that are *reliable*, *scalable*, and *maintainable*. These themes have run through all of the chapters: for example, we discussed many fault-tolerance algo‐ rithms that help improve reliability, partitioning to improve scalability, and mecha‐ nisms for evolution and abstraction that improve maintainability. In this chapter we will bring all of these ideas together, and build on them to envisage the future. Our goal is to discover how to design applications that are better than the ones of today— robust, correct, evolvable, and ultimately beneficial to humanity.
+## Data Integration {#sec_future_integration}
+A recurring theme in this book has been that for any given problem, there are several solutions, all
+of which have different pros, cons, and trade-offs. For example, when discussing storage engines in
+[Chapter 4](/en/ch4#ch_storage), we saw log-structured storage, B-trees, and column-oriented
+storage. When discussing replication in [Chapter 6](/en/ch6#ch_replication), we saw single-leader,
+multi-leader, and leaderless approaches.
-## ……
+If you have a problem such as "I want to store some data and look it up again later," there is no
+one right solution, but many different approaches that are each appropriate in different
+circumstances. A software implementation typically has to pick one particular approach. It's hard
+enough to get one code path robust and performing well---trying to do everything in one piece of
+software almost guarantees that the implementation will be poor.
+Thus, the most appropriate choice of software tool also depends on the circumstances. Every piece of
+software, even a so-called "general-purpose" database, is designed for a particular usage pattern.
+Faced with this profusion of alternatives, the first challenge is then to figure out the mapping
+between the software products and the circumstances in which they are a good fit. Vendors are
+understandably reluctant to tell you about the kinds of workloads for which their software is poorly
+suited, but hopefully the previous chapters have equipped you with some questions to ask in order to
+read between the lines and better understand the trade-offs.
-## Summary
+However, even if you perfectly understand the mapping between tools and circumstances for their use,
+there is another challenge: in complex applications, data is often used in several different ways.
+There is unlikely to be one piece of software that is suitable for *all* the different circumstances
+in which the data is used, so you inevitably end up having to cobble together several different
+pieces of software in order to provide your application's functionality.
-In this chapter we discussed new approaches to designing data systems, and I included my personal opinions and speculations about the future. We started with the observation that there is no one single tool that can efficiently serve all possible use cases, and so applications necessarily need to compose several different pieces of software to accomplish their goals. We discussed how to solve this *data integration* problem by using batch processing and event streams to let data changes flow between different systems.
+### Combining Specialized Tools by Deriving Data {#id442}
-In this approach, certain systems are designated as systems of record, and other data is derived from them through transformations. In this way we can maintain indexes, materialized views, machine learning models, statistical summaries, and more. By making these derivations and transformations asynchronous and loosely coupled, a problem in one area is prevented from spreading to unrelated parts of the system, increasing the robustness and fault-tolerance of the system as a whole.
+For example, it is common to need to integrate an OLTP database with a full-text search index in
+order to handle queries for arbitrary keywords. Although some databases (such as PostgreSQL) include
+a full-text indexing feature, which can be sufficient for simple applications [^1], more
+sophisticated search facilities require specialist information retrieval tools. Conversely, search
+indexes are generally not very suitable as a durable system of record, and so many applications need
+to combine two different tools in order to satisfy all of the requirements.
-Expressing dataflows as transformations from one dataset to another also helps evolve applications: if you want to change one of the processing steps, for example to change the structure of an index or cache, you can just rerun the new transformation code on the whole input dataset in order to rederive the output. Similarly, if some‐ thing goes wrong, you can fix the code and reprocess the data in order to recover.
+We touched on the issue of integrating data systems in ["Keeping Systems in
+Sync"](/en/ch12#sec_stream_sync). As the number of different representations of the data increases,
+the integration problem becomes harder. Besides the database and the search index, perhaps you need
+to keep copies of the data in analytics systems (data warehouses, or batch and stream processing
+systems); maintain caches or denormalized versions of objects that were derived from the original
+data; pass the data through machine learning, classification, ranking, or recommendation systems; or
+send notifications based on changes to the data.
-These processes are quite similar to what databases already do internally, so we recast the idea of dataflow applications as *unbundling* the components of a database, and building an application by composing these loosely coupled components.
+#### Reasoning about dataflows {#id443}
-Derived state can be updated by observing changes in the underlying data. Moreover, the derived state itself can further be observed by downstream consumers. We can even take this dataflow all the way through to the end-user device that is displaying the data, and thus build user interfaces that dynamically update to reflect data changes and continue to work offline.
+When copies of the same data need to be maintained in several storage systems in order to satisfy
+different access patterns, you need to be very clear about the inputs and outputs: where is data
+written first, and which representations are derived from which sources? How do you get data into
+all the right places, in the right formats?
-Next, we discussed how to ensure that all of this processing remains correct in the presence of faults. We saw that strong integrity guarantees can be implemented scala‐ bly with asynchronous event processing, by using end-to-end operation identifiers to make operations idempotent and by checking constraints asynchronously. Clients can either wait until the check has passed, or go ahead without waiting but risk hav‐ ing to apologize about a constraint violation. This approach is much more scalable and robust than the traditional approach of using distributed transactions, and fits with how many business processes work in practice.
+For example, you might arrange for data to first be written to a system of record database,
+capturing the changes made to that database (see ["Change Data Capture"](/en/ch12#sec_stream_cdc))
+and then applying the changes to the search index in the same order. If change data capture (CDC) is
+the only way of updating the index, you can be confident that the index is entirely derived from the
+system of record, and therefore consistent with it (barring bugs in the software). Writing to the
+database is the only way of supplying new input into this system.
-By structuring applications around dataflow and checking constraints asynchro‐ nously, we can avoid most coordination and create systems that maintain integrity but still perform well, even in geographically distributed scenarios and in the pres‐ ence of faults. We then talked a little about using audits to verify the integrity of data and detect corruption.
+Allowing the application to directly write to both the search index and the database introduces the
+problem shown in [Figure 12-4](/en/ch12#fig_stream_write_order), in which two clients concurrently
+send conflicting writes, and the two storage systems process them in a different order. In this
+case, neither the database nor the search index is "in charge" of determining the order of writes,
+and so they may make contradictory decisions and become permanently inconsistent with each other.
-Finally, we took a step back and examined some ethical aspects of building data- intensive applications. We saw that although data can be used to do good, it can also do significant harm: making justifying decisions that seriously affect people’s lives and are difficult to appeal against, leading to discrimination and exploitation, nor‐ malizing surveillance, and exposing intimate information. We also run the risk of data breaches, and we may find that a well-intentioned use of data has unintended consequences.
+If it is possible for you to funnel all user input through a single system that decides on an
+ordering for all writes, it becomes much easier to derive other representations of the data by
+processing the writes in the same order. This is an application of the state machine replication
+approach that we saw in ["Consensus in Practice"](/en/ch10#sec_consistency_total_order). Whether you
+use change data capture or an event sourcing log is less important than simply the principle of
+deciding on a total order.
-As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal.
+Updating a derived data system based on an event log can often be made deterministic and idempotent
+(see ["Idempotence"](/en/ch12#sec_stream_idempotence)), making it quite easy to recover from faults.
-### References
+#### Derived data versus distributed transactions {#sec_future_derived_vs_transactions}
-1. Rachid Belaid: “[Postgres Full-Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/),” *rachbelaid.com*, July 13, 2015.
-1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
-1. Pat Helland and Dave Campbell: “[Building on Quicksand](https://web.archive.org/web/20220606172817/https://database.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf),” at *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009.
-1. Jessica Kerr: “[Provenance and Causality in Distributed Systems](https://web.archive.org/web/20190425150540/http://blog.jessitron.com/2016/09/provenance-and-causality-in-distributed.html),” *blog.jessitron.com*, September 25, 2016.
-1. Kostas Tzoumas: “[Batch Is a Special Case of Streaming](http://data-artisans.com/blog/batch-is-a-special-case-of-streaming/),” *data-artisans.com*, September 15, 2015.
-1. Shinji Kim and Robert Blafford: “[Stream Windowing Performance Analysis: Concord and Spark Streaming](https://web.archive.org/web/20180125074821/http://concord.io/posts/windowing_performance_analysis_w_spark_streaming),” *concord.io*, July 6, 2016.
-1. Jay Kreps: “[The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying),” *engineering.linkedin.com*, December 16, 2013.
-1. Pat Helland: “[Life Beyond Distributed Transactions: An Apostate’s Opinion](https://web.archive.org/web/20200730171311/http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf),” at *3rd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2007.
-1. “[Great Western Railway (1835–1948)](https://web.archive.org/web/20160122155425/https://www.networkrail.co.uk/VirtualArchive/great-western/),” Network Rail Virtual Archive, *networkrail.co.uk*.
-1. Jacqueline Xu: “[Online Migrations at Scale](https://stripe.com/blog/online-migrations),” *stripe.com*, February 2, 2017.
-1. Molly Bartlett Dishman and Martin Fowler: “[Agile Architecture](https://web.archive.org/web/20161130034721/http://conferences.oreilly.com/software-architecture/sa2015/public/schedule/detail/40388),” at *O'Reilly Software Architecture Conference*, March 2015.
-1. Nathan Marz and James Warren: [*Big Data: Principles and Best Practices of Scalable Real-Time Data Systems*](https://www.manning.com/books/big-data). Manning, 2015. ISBN: 978-1-617-29034-3
-1. Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin: “[Summingbird: A Framework for Integrating Batch and Online MapReduce Computations](http://www.vldb.org/pvldb/vol7/p1441-boykin.pdf),” at *40th International Conference on Very Large Data Bases* (VLDB), September 2014.
-1. Jay Kreps: “[Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture),” *oreilly.com*, July 2, 2014.
-1. Raul Castro Fernandez, Peter Pietzuch, Jay Kreps, et al.: “[Liquid: Unifying Nearline and Offline Big Data Integration](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper25u.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
-1. Dennis M. Ritchie and Ken Thompson: “[The UNIX Time-Sharing System](http://web.eecs.utk.edu/~qcao1/cs560/papers/paper-unix.pdf),” *Communications of the ACM*, volume 17, number 7, pages 365–375, July 1974. [doi:10.1145/361011.361061](http://dx.doi.org/10.1145/361011.361061)
-1. Eric A. Brewer and Joseph M. Hellerstein: “[CS262a: Advanced Topics in Computer Systems](http://people.eecs.berkeley.edu/~brewer/cs262/systemr.html),” lecture notes, University of California, Berkeley, *cs.berkeley.edu*, August 2011.
-1. Michael Stonebraker: “[The Case for Polystores](http://wp.sigmod.org/?p=1629),” *wp.sigmod.org*, July 13, 2015.
-1. Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, et al.: “[The BigDAWG Polystore System](https://dspace.mit.edu/handle/1721.1/100936),” *ACM SIGMOD Record*, volume 44, number 2, pages 11–16, June 2015. [doi:10.1145/2814710.2814713](http://dx.doi.org/10.1145/2814710.2814713)
-1. Patrycja Dybka: “[Foreign Data Wrappers for PostgreSQL](https://web.archive.org/web/20221003115732/https://www.vertabelo.com/blog/foreign-data-wrappers-for-postgresql/),” *vertabelo.com*, March 24, 2015.
-1. David B. Lomet, Alan Fekete, Gerhard Weikum, and Mike Zwilling: “[Unbundling Transaction Services in the Cloud](https://www.microsoft.com/en-us/research/publication/unbundling-transaction-services-in-the-cloud/),” at *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009.
-1. Martin Kleppmann and Jay Kreps: “[Kafka, Samza and the Unix Philosophy of Distributed Data](http://martin.kleppmann.com/papers/kafka-debull15.pdf),” *IEEE Data Engineering Bulletin*, volume 38, number 4, pages 4–14, December 2015.
-1. John Hugg: “[Winning Now and in the Future: Where VoltDB Shines](https://voltdb.com/blog/winning-now-and-future-where-voltdb-shines),” *voltdb.com*, March 23, 2016.
-1. Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard: “[Differential Dataflow](http://cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf),” at *6th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2013.
-1. Derek G Murray, Frank McSherry, Rebecca Isaacs, et al.: “[Naiad: A Timely Dataflow System](http://sigops.org/s/conferences/sosp/2013/papers/p439-murray.pdf),” at *24th ACM Symposium on Operating Systems Principles* (SOSP), pages 439–455, November 2013. [doi:10.1145/2517349.2522738](http://dx.doi.org/10.1145/2517349.2522738)
-1. Gwen Shapira: “[We have a bunch of customers who are implementing ‘database inside-out’ concept and they all ask ‘is anyone else doing it? are we crazy?’](https://twitter.com/gwenshap/status/758800071110430720)” *twitter.com*, July 28, 2016.
-1. Martin Kleppmann: “[Turning the Database Inside-out with Apache Samza,](http://martin.kleppmann.com/2015/03/04/turning-the-database-inside-out.html)” at *Strange Loop*, September 2014.
-1. Peter Van Roy and Seif Haridi: [*Concepts, Techniques, and Models of Computer Programming*](https://www.info.ucl.ac.be/~pvr/book.html). MIT Press, 2004. ISBN: 978-0-262-22069-9
-1. “[Juttle Documentation](http://juttle.github.io/juttle/),” *juttle.github.io*, 2016.
-1. Evan Czaplicki and Stephen Chong: “[Asynchronous Functional Reactive Programming for GUIs](http://people.seas.harvard.edu/~chong/pubs/pldi13-elm.pdf),” at *34th ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2013. [doi:10.1145/2491956.2462161](http://dx.doi.org/10.1145/2491956.2462161)
-1. Engineer Bainomugisha, Andoni Lombide Carreton, Tom van Cutsem, Stijn Mostinckx, and Wolfgang de Meuter: “[A Survey on Reactive Programming](http://soft.vub.ac.be/Publications/2012/vub-soft-tr-12-13.pdf),” *ACM Computing Surveys*, volume 45, number 4, pages 1–34, August 2013. [doi:10.1145/2501654.2501666](http://dx.doi.org/10.1145/2501654.2501666)
-1. Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak: “[Consistency Analysis in Bloom: A CALM and Collected Approach](https://dsf.berkeley.edu/cs286/papers/calm-cidr2011.pdf),” at *5th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2011.
-1. Felienne Hermans: “[Spreadsheets Are Code](https://vimeo.com/145492419),” at *Code Mesh*, November 2015.
-1. Dan Bricklin and Bob Frankston: “[VisiCalc: Information from Its Creators](http://danbricklin.com/visicalc.htm),” *danbricklin.com*.
-1. D. Sculley, Gary Holt, Daniel Golovin, et al.: “[Machine Learning: The High-Interest Credit Card of Technical Debt](http://research.google.com/pubs/pub43146.html),” at *NIPS Workshop on Software Engineering for Machine Learning* (SE4ML), December 2014.
-1. Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “[Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2737784](http://dx.doi.org/10.1145/2723372.2737784)
-1. Guy Steele: “[Re: Need for Macros (Was Re: Icon)](https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01134.html),” email to *ll1-discuss* mailing list, *people.csail.mit.edu*, December 24, 2001.
-1. David Gelernter: “[Generative Communication in Linda](http://cseweb.ucsd.edu/groups/csag/html/teaching/cse291s03/Readings/p80-gelernter.pdf),” *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 7, number 1, pages 80–112, January 1985. [doi:10.1145/2363.2433](http://dx.doi.org/10.1145/2363.2433)
-1. Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “[The Many Faces of Publish/Subscribe](http://www.cs.ru.nl/~pieter/oss/manyfaces.pdf),” *ACM Computing Surveys*, volume 35, number 2, pages 114–131, June 2003. [doi:10.1145/857076.857078](http://dx.doi.org/10.1145/857076.857078)
-1. Ben Stopford: “[Microservices in a Streaming World](https://www.infoq.com/presentations/microservices-streaming),” at *QCon London*, March 2016.
-1. Christian Posta: “[Why Microservices Should Be Event Driven: Autonomy vs Authority](http://blog.christianposta.com/microservices/why-microservices-should-be-event-driven-autonomy-vs-authority/),” *blog.christianposta.com*, May 27, 2016.
-1. Alex Feyerke: “[Say Hello to Offline First](https://web.archive.org/web/20210420014747/http://hood.ie/blog/say-hello-to-offline-first.html),” *hood.ie*, November 5, 2013.
-1. Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich: “[Global Sequence Protocol: A Robust Abstraction for Replicated Shared State](http://drops.dagstuhl.de/opus/volltexte/2015/5238/),” at *29th European Conference on Object-Oriented Programming* (ECOOP), July 2015. [doi:10.4230/LIPIcs.ECOOP.2015.568](http://dx.doi.org/10.4230/LIPIcs.ECOOP.2015.568)
-1. Mark Soper: “[Clearing Up React Data Management Confusion with Flux, Redux, and Relay](https://medium.com/@marksoper/clearing-up-react-data-management-confusion-with-flux-redux-and-relay-aad504e63cae),” *medium.com*, December 3, 2015.
-1. Eno Thereska, Damian Guy, Michael Noll, and Neha Narkhede: “[Unifying Stream Processing and Interactive Queries in Apache Kafka](http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/),” *confluent.io*, October 26, 2016.
-1. Frank McSherry: “[Dataflow as Database](https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-17.md),” *github.com*, July 17, 2016.
-1. Peter Alvaro: “[I See What You Mean](https://www.youtube.com/watch?v=R2Aa4PivG0g),” at *Strange Loop*, September 2015.
-1. Nathan Marz: “[Trident: A High-Level Abstraction for Realtime Computation](https://blog.twitter.com/2012/trident-a-high-level-abstraction-for-realtime-computation),” *blog.twitter.com*, August 2, 2012.
-1. Edi Bice: “[Low Latency Web Scale Fraud Prevention with Apache Samza, Kafka and Friends](http://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends),” at *Merchant Risk Council MRC Vegas Conference*, March 2016.
-1. Charity Majors: “[The Accidental DBA](https://charity.wtf/2016/10/02/the-accidental-dba/),” *charity.wtf*, October 2, 2016.
-1. Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu: “[Semantic Conditions for Correctness at Different Isolation Levels](http://db.cs.berkeley.edu/cs286/papers/isolation-icde2000.pdf),” at *16th International Conference on Data Engineering* (ICDE), February 2000. [doi:10.1109/ICDE.2000.839387](http://dx.doi.org/10.1109/ICDE.2000.839387)
-1. Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “[Automating the Detection of Snapshot Isolation Anomalies](http://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf),” at *33rd International Conference on Very Large Data Bases* (VLDB), September 2007.
-1. Kyle Kingsbury: [Jepsen blog post series](https://aphyr.com/tags/jepsen), *aphyr.com*, 2013–2016.
-1. Michael Jouravlev: “[Redirect After Post](http://www.theserverside.com/news/1365146/Redirect-After-Post),” *theserverside.com*, August 1, 2004.
-1. Jerome H. Saltzer, David P. Reed, and David D. Clark: “[End-to-End Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf),” *ACM Transactions on Computer Systems*, volume 2, number 4, pages 277–288, November 1984. [doi:10.1145/357401.357402](http://dx.doi.org/10.1145/357401.357402)
-1. Peter Bailis, Alan Fekete, Michael J. Franklin, et al.: “[Coordination-Avoiding Database Systems](http://arxiv.org/pdf/1402.2237.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 3, pages 185–196, November 2014.
-1. Alex Yarmula: “[Strong Consistency in Manhattan](https://blog.twitter.com/2016/strong-consistency-in-manhattan),” *blog.twitter.com*, March 17, 2016.
-1. Douglas B Terry, Marvin M Theimer, Karin Petersen, et al.: “[Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](http://css.csail.mit.edu/6.824/2014/papers/bayou-conflicts.pdf),” at *15th ACM Symposium on Operating Systems Principles* (SOSP), pages 172–182, December 1995. [doi:10.1145/224056.224070](http://dx.doi.org/10.1145/224056.224070)
-1. Jim Gray: “[The Transaction Concept: Virtues and Limitations](http://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf),” at *7th International Conference on Very Large Data Bases* (VLDB), September 1981.
-1. Hector Garcia-Molina and Kenneth Salem: “[Sagas](http://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 1987. [doi:10.1145/38713.38742](http://dx.doi.org/10.1145/38713.38742)
-1. Pat Helland: “[Memories, Guesses, and Apologies](https://web.archive.org/web/20160304020907/http://blogs.msdn.com/b/pathelland/archive/2007/05/15/memories-guesses-and-apologies.aspx),” *blogs.msdn.com*, May 15, 2007.
-1. Yoongu Kim, Ross Daly, Jeremie Kim, et al.: “[Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf),” at *41st Annual International Symposium on Computer Architecture* (ISCA), June 2014. [doi:10.1145/2678373.2665726](http://dx.doi.org/10.1145/2678373.2665726)
-1. Mark Seaborn and Thomas Dullien: “[Exploiting the DRAM Rowhammer Bug to Gain Kernel Privileges](https://googleprojectzero.blogspot.co.uk/2015/03/exploiting-dram-rowhammer-bug-to-gain.html),” *googleprojectzero.blogspot.co.uk*, March 9, 2015.
-1. Jim N. Gray and Catharine van Ingen: “[Empirical Measurements of Disk Failure Rates and Error Rates](https://www.microsoft.com/en-us/research/publication/empirical-measurements-of-disk-failure-rates-and-error-rates/),” Microsoft Research, MSR-TR-2005-166, December 2005.
-1. Annamalai Gurusami and Daniel Price: “[Bug #73170: Duplicates in Unique Secondary Index Because of Fix of Bug#68021](http://bugs.mysql.com/bug.php?id=73170),” *bugs.mysql.com*, July 2014.
-1. Gary Fredericks: “[Postgres Serializability Bug](https://github.com/gfredericks/pg-serializability-bug),” *github.com*, September 2015.
-1. Xiao Chen: “[HDFS DataNode Scanners and Disk Checker Explained](http://blog.cloudera.com/blog/2016/12/hdfs-datanode-scanners-and-disk-checker-explained/),” *blog.cloudera.com*, December 20, 2016.
-1. Jay Kreps: “[Getting Real About Distributed System Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability),” *blog.empathybox.com*, March 19, 2012.
-1. Martin Fowler: “[The LMAX Architecture](http://martinfowler.com/articles/lmax.html),” *martinfowler.com*, July 12, 2011.
-1. Sam Stokes: “[Move Fast with Confidence](http://blog.samstokes.co.uk/blog/2016/07/11/move-fast-with-confidence/),” *blog.samstokes.co.uk*, July 11, 2016.
-1. “[Hyperledger Sawtooth documentation](https://web.archive.org/web/20220120211548/https://sawtooth.hyperledger.org/docs/core/releases/latest/introduction.html),” Intel Corporation, *sawtooth.hyperledger.org*, 2017.
-1. Richard Gendal Brown: “[Introducing R3 Corda™: A Distributed Ledger Designed for Financial Services](https://gendal.me/2016/04/05/introducing-r3-corda-a-distributed-ledger-designed-for-financial-services/),” *gendal.me*, April 5, 2016.
-1. Trent McConaghy, Rodolphe Marques, Andreas Müller, et al.: “[BigchainDB: A Scalable Blockchain Database](https://www.bigchaindb.com/whitepaper/bigchaindb-whitepaper.pdf),” *bigchaindb.com*, June 8, 2016.
-1. Ralph C. Merkle: “[A Digital Signature Based on a Conventional Encryption Function](https://people.eecs.berkeley.edu/~raluca/cs261-f15/readings/merkle.pdf),” at *CRYPTO '87*, August 1987. [doi:10.1007/3-540-48184-2_32](http://dx.doi.org/10.1007/3-540-48184-2_32)
-1. Ben Laurie: “[Certificate Transparency](http://queue.acm.org/detail.cfm?id=2668154),” *ACM Queue*, volume 12, number 8, pages 10-19, August 2014. [doi:10.1145/2668152.2668154](http://dx.doi.org/10.1145/2668152.2668154)
-1. Mark D. Ryan: “[Enhanced Certificate Transparency and End-to-End Encrypted Mail](https://www.ndss-symposium.org/wp-content/uploads/2017/09/12_2_1.pdf),” at *Network and Distributed System Security Symposium* (NDSS), February 2014. [doi:10.14722/ndss.2014.23379](http://dx.doi.org/10.14722/ndss.2014.23379)
-1. “[ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics),” Association for Computing Machinery, *acm.org*, 2018.
-1. François Chollet: “[Software development is starting to involve important ethical choices](https://twitter.com/fchollet/status/792958695722201088),” *twitter.com*, October 30, 2016.
-1. Igor Perisic: “[Making Hard Choices: The Quest for Ethics in Machine Learning](https://engineering.linkedin.com/blog/2016/11/making-hard-choices--the-quest-for-ethics-in-machine-learning),” *engineering.linkedin.com*, November 2016.
-1. John Naughton: “[Algorithm Writers Need a Code of Conduct](https://www.theguardian.com/commentisfree/2015/dec/06/algorithm-writers-should-have-code-of-conduct),” *theguardian.com*, December 6, 2015.
-1. Logan Kugler: “[What Happens When Big Data Blunders?](http://cacm.acm.org/magazines/2016/6/202655-what-happens-when-big-data-blunders/fulltext),” *Communications of the ACM*, volume 59, number 6, pages 15–16, June 2016. [doi:10.1145/2911975](http://dx.doi.org/10.1145/2911975)
-1. Bill Davidow: “[Welcome to Algorithmic Prison](http://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/),” *theatlantic.com*, February 20, 2014.
-1. Don Peck: “[They're Watching You at Work](http://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/),” *theatlantic.com*, December 2013.
-1. Leigh Alexander: “[Is an Algorithm Any Less Racist Than a Human?](https://www.theguardian.com/technology/2016/aug/03/algorithm-racist-human-employers-work)” *theguardian.com*, August 3, 2016.
-1. Jesse Emspak: “[How a Machine Learns Prejudice](https://www.scientificamerican.com/article/how-a-machine-learns-prejudice/),” *scientificamerican.com*, December 29, 2016.
-1. Maciej Cegłowski: “[The Moral Economy of Tech](http://idlewords.com/talks/sase_panel.htm),” *idlewords.com*, June 2016.
-1. Cathy O'Neil: [*Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*](https://web.archive.org/web/20210621234447/https://weaponsofmathdestructionbook.com/). Crown Publishing, 2016. ISBN: 978-0-553-41881-1
-1. Julia Angwin: “[Make Algorithms Accountable](http://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html),” *nytimes.com*, August 1, 2016.
-1. Bryce Goodman and Seth Flaxman: “[European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation’](https://arxiv.org/abs/1606.08813),” *arXiv:1606.08813*, August 31, 2016.
-1. “[A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes](https://web.archive.org/web/20240619042302/http://educationnewyork.com/files/rockefeller_databroker.pdf),” Staff Report, *United States Senate Committee on Commerce, Science, and Transportation*, *commerce.senate.gov*, December 2013.
-1. Olivia Solon: “[Facebook’s Failure: Did Fake News and Polarized Politics Get Trump Elected?](https://www.theguardian.com/technology/2016/nov/10/facebook-fake-news-election-conspiracy-theories)” *theguardian.com*, November 10, 2016.
-1. Donella H. Meadows and Diana Wright: *Thinking in Systems: A Primer*. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7
-1. Daniel J. Bernstein: “[Listening to a ‘big data’/‘data science’ talk](https://twitter.com/hashbreaker/status/598076230437568512),” *twitter.com*, May 12, 2015.
-1. Marc Andreessen: “[Why Software Is Eating the World](http://genius.com/Marc-andreessen-why-software-is-eating-the-world-annotated),” *The Wall Street Journal*, 20 August 2011.
-1. J. M. Porup: “[‘Internet of Things’ Security Is Hilariously Broken and Getting Worse](http://arstechnica.com/security/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/),” *arstechnica.com*, January 23, 2016.
-1. Bruce Schneier: [*Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World*](https://www.schneier.com/books/data_and_goliath/). W. W. Norton, 2015. ISBN: 978-0-393-35217-7
-1. The Grugq: “[Nothing to Hide](https://grugq.tumblr.com/post/142799983558/nothing-to-hide),” *grugq.tumblr.com*, April 15, 2016.
-1. Tony Beltramelli: “[Deep-Spying: Spying Using Smartwatch and Deep Learning](https://arxiv.org/abs/1512.05616),” Masters Thesis, IT University of Copenhagen, December 2015. Available at *arxiv.org/abs/1512.05616*
-1. Shoshana Zuboff: “[Big Other: Surveillance Capitalism and the Prospects of an Information Civilization](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2594754),” *Journal of Information Technology*, volume 30, number 1, pages 75–89, April 2015. [doi:10.1057/jit.2015.5](http://dx.doi.org/10.1057/jit.2015.5)
-1. Carina C. Zona: “[Consequences of an Insightful Algorithm](https://www.youtube.com/watch?v=YRI40A4tyWU),” at *GOTO Berlin*, November 2016.
-1. Bruce Schneier: “[Data Is a Toxic Asset, So Why Not Throw It Out?](https://www.schneier.com/essays/archives/2016/03/data_is_a_toxic_asse.html),” *schneier.com*, March 1, 2016.
-1. John E. Dunn: “[The UK’s 15 Most Infamous Data Breaches](https://web.archive.org/web/20161120070058/http://www.techworld.com/security/uks-most-infamous-data-breaches-2016-3604586/),” *techworld.com*, November 18, 2016.
-1. Cory Scott: “[Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want](https://twitter.com/cory_scott/status/706586399483437056),” *twitter.com*, March 6, 2016.
-1. Bruce Schneier: “[Mission Creep: When Everything Is Terrorism](https://www.schneier.com/essays/archives/2013/07/mission_creep_when_e.html),” *schneier.com*, July 16, 2013.
-1. Lena Ulbricht and Maximilian von Grafenstein: “[Big Data: Big Power Shifts?](http://policyreview.info/articles/analysis/big-data-big-power-shifts),” *Internet Policy Review*, volume 5, number 1, March 2016. [doi:10.14763/2016.1.406](http://dx.doi.org/10.14763/2016.1.406)
-1. Ellen P. Goodman and Julia Powles: “[Facebook and Google: Most Powerful and Secretive Empires We've Ever Known](https://www.theguardian.com/technology/2016/sep/28/google-facebook-powerful-secretive-empire-transparency),” *theguardian.com*, September 28, 2016.
-1. [Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data](http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:31995L0046), Official Journal of the European Communities No. L 281/31, *eur-lex.europa.eu*, November 1995.
-1. Brendan Van Alsenoy: “[Regulating Data Protection: The Allocation of Responsibility and Risk Among Actors Involved in Personal Data Processing](https://lirias.kuleuven.be/handle/123456789/545027),” Thesis, KU Leuven Centre for IT and IP Law, August 2016.
-1. Michiel Rhoen: “[Beyond Consent: Improving Data Protection Through Consumer Protection Law](http://policyreview.info/articles/analysis/beyond-consent-improving-data-protection-through-consumer-protection-law),” *Internet Policy Review*, volume 5, number 1, March 2016. [doi:10.14763/2016.1.404](http://dx.doi.org/10.14763/2016.1.404)
-1. Jessica Leber: “[Your Data Footprint Is Affecting Your Life in Ways You Can’t Even Imagine](https://www.fastcoexist.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine),” *fastcoexist.com*, March 15, 2016.
-1. Maciej Cegłowski: “[Haunted by Data](http://idlewords.com/talks/haunted_by_data.htm),” *idlewords.com*, October 2015.
-1. Sam Thielman: “[You Are Not What You Read: Librarians Purge User Data to Protect Privacy](https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy),” *theguardian.com*, January 13, 2016.
-1. Conor Friedersdorf: “[Edward Snowden’s Other Motive for Leaking](http://www.theatlantic.com/politics/archive/2014/05/edward-snowdens-other-motive-for-leaking/370068/),” *theatlantic.com*, May 13, 2014.
-1. Phillip Rogaway: “[The Moral Character of Cryptographic Work](http://web.cs.ucdavis.edu/~rogaway/papers/moral-fn.pdf),” Cryptology ePrint 2015/1162, December 2015.
+The classic approach for keeping different data systems consistent with each other involves
+distributed transactions, as discussed in ["Two-Phase Commit (2PC)"](/en/ch8#sec_transactions_2pc).
+How does the approach of using derived data systems fare in comparison to distributed transactions?
+
+At an abstract level, they achieve a similar goal by different means. Distributed transactions
+decide on an ordering of writes by using locks for mutual exclusion, while CDC and event sourcing
+use a log for ordering. Distributed transactions use atomic commit to ensure that changes take
+effect exactly once, while log-based systems are often based on deterministic retry and idempotence.
+
+The biggest difference is that transaction systems usually guarantee that after a value is written,
+you can immediately read the up-to-date value (see ["Reading Your Own
+Writes"](/en/ch6#sec_replication_ryw)). On the other hand, derived data systems are often updated
+asynchronously, and so they do not by default guarantee that reads are up-to-date.
+
+Within limited environments that are willing to pay the cost of distributed transactions, they have
+been used successfully. However, XA has poor fault tolerance and performance characteristics (see
+["Distributed Transactions Across Different Systems"](/en/ch8#sec_transactions_xa)), which severely
+limit its usefulness. It might be possible to create a better protocol for distributed transactions,
+but getting such a protocol widely adopted and integrated with existing tools would be challenging,
+and is unlikely to happen soon.
+
+In the absence of widespread support for a good distributed transaction protocol, log-based derived
+data is the most promising approach for integrating different data systems. However, guarantees such
+as reading your own writes are useful, and it is not productive to tell everyone "eventual
+consistency is inevitable---suck it up and learn to deal with it" (at least not without good
+guidance on *how* to deal with it).
+
+Later in this chapter we will discuss some approaches for implementing stronger guarantees on top of
+asynchronously derived systems, and work toward a middle ground between distributed transactions and
+asynchronous log-based systems.
+
+#### The limits of total ordering {#id335}
+
+With systems that are small enough, constructing a totally ordered event log is entirely feasible
+(as demonstrated by the popularity of databases with single-leader replication, which construct
+precisely such a log). However, as systems are scaled toward bigger and more complex workloads,
+limitations begin to emerge:
+
+- In most cases, constructing a totally ordered log requires all events to pass through a *single
+ leader node* that decides on the ordering. If the throughput of events is greater than a single
+ machine can handle, you need to shard the log across multiple machines. The order of events in two
+ different shards is then ambiguous.
+
+- If the servers are spread across multiple *geographically distributed* regions, for example in
+ order to tolerate an entire datacenter going offline, you typically have a separate leader in each
+ datacenter, because network delays make synchronous cross-datacenter coordination inefficient.
+ This implies an undefined ordering of events that originate in two different datacenters.
+
+- When applications are deployed as *microservices*, a common design choice is to deploy each
+ service and its durable state as an independent unit, with no durable state shared between
+ services. When two events originate in different services, there is no defined order for those
+ events.
+
+- Some applications maintain client-side state that is updated immediately on user input (without
+ waiting for confirmation from a server), and even continue to work offline. With such
+ applications, clients and servers are very likely to see events in different orders.
+
+In formal terms, deciding on a total order of events is known as *total order broadcast*, which is
+equivalent to consensus (see ["The Many Faces of Consensus"](/en/ch10#sec_consistency_faces)). Most
+consensus algorithms are designed for situations in which the throughput of a single node is
+sufficient to process the entire stream of events, and these algorithms do not provide a mechanism
+for multiple nodes to share the work of ordering the events.
+
+#### Ordering events to capture causality {#sec_future_capture_causality}
+
+In cases where there is no causal link between events, the lack of a total order is not a big
+problem, since concurrent events can be ordered arbitrarily. Some other cases are easy to handle:
+for example, when there are multiple updates of the same object, they can be totally ordered by
+routing all updates for a particular object ID to the same log shard. However, causal dependencies
+sometimes arise in more subtle ways.
+
+For example, consider a social networking service, and two users who were in a relationship but have
+just broken up. One of the users removes the other as a friend, and then sends a message to their
+remaining friends complaining about their ex-partner. The user's intention is that their ex-partner
+should not see the rude message, since the message was sent after the friend status was revoked.
+
+However, in a system that stores friendship status in one place and messages in another place, that
+ordering dependency between the *unfriend* event and the *message-send* event may be lost. If the
+causal dependency is not captured, a service that sends notifications about new messages may process
+the *message-send* event before the *unfriend* event, and thus incorrectly send a notification to
+the ex-partner.
+
+In this example, the notifications are effectively a join between the messages and the friend list,
+making it related to the timing issues of joins that we discussed previously (see ["Time-dependence
+of joins"](/en/ch12#sec_stream_join_time)). Unfortunately, there does not seem to be a simple answer
+to this problem [^2], [^3]. Starting points include:
+
+- Logical timestamps can provide total ordering without coordination (see ["ID Generators and
+ Logical Clocks"](/en/ch10#sec_consistency_logical)), so they may help in cases where total order
+ broadcast is not feasible. However, they still require recipients to handle events that are
+ delivered out of order, and they require additional metadata to be passed around.
+
+- If you can log an event to record the state of the system that the user saw before making a
+ decision, and give that event a unique identifier, then any later events can reference that event
+ identifier in order to record the causal dependency [^4].
+
+- Conflict resolution algorithms (see ["Automatic conflict
+ resolution"](/en/ch6#sec_replication_automatic_resolution)) help with processing events that are
+ delivered in an unexpected order. They are useful for maintaining state, but they do not help if
+ actions have external side effects (such as sending a notification to a user).
+
+Perhaps, patterns for application development will emerge in the future that allow causal
+dependencies to be captured efficiently, and derived state to be maintained correctly, without
+forcing all events to go through the bottleneck of total order broadcast.
+
+### Batch and Stream Processing {#sec_future_batch_streaming}
+
+The goal of data integration is to make sure that data ends up in the right form in all the right
+places. Doing so requires consuming inputs, transforming, joining, filtering, aggregating, training
+models, evaluating, and eventually writing to the appropriate outputs. Batch and stream processors
+are the tools for achieving this goal. The outputs of batch and stream processes are derived
+datasets such as search indexes, materialized views, recommendations to show to users, aggregate
+metrics, and so on.
+
+As we saw in [Chapter 11](/en/ch11#ch_batch) and [Chapter 12](/en/ch12#ch_stream), batch and stream
+processing have a lot of principles in common, and the main fundamental difference is that stream
+processors operate on unbounded datasets whereas batch process inputs are of a known, finite size.
+
+#### Maintaining derived state {#id446}
+
+Batch processing has a quite strong functional flavor (even if the code is not written in a
+functional programming language): it encourages deterministic, pure functions whose output depends
+only on the input and which have no side effects other than the explicit outputs, treating inputs as
+immutable and outputs as append-only. Stream processing is similar, but it extends operators to
+allow managed, fault-tolerant state.
+
+The principle of deterministic functions with well-defined inputs and outputs is not only good for
+fault tolerance, but also simplifies reasoning about the dataflows in an organization
+[^5]. No matter whether the derived data is a search index, a statistical model, or a
+cache, it is helpful to think in terms of data pipelines that derive one thing from another, pushing
+state changes in one system through functional application code and applying the effects to derived
+systems.
+
+In principle, derived data systems could be maintained synchronously, just like a relational
+database updates secondary indexes synchronously within the same transaction as writes to the table
+being indexed. However, asynchrony is what makes systems based on event logs robust: it allows a
+fault in one part of the system to be contained locally, whereas distributed transactions abort if
+any one participant fails, so they tend to amplify failures by spreading them to the rest of the
+system.
+
+We saw in ["Sharding and Secondary Indexes"](/en/ch7#sec_sharding_secondary_indexes) that secondary
+indexes often cross shard boundaries. A sharded system with secondary indexes either needs to send
+writes to multiple shards (if the index is term-partitioned) or send reads to all shards (if the
+index is document-partitioned). Such cross-shard communication is also most reliable and scalable if
+the index is maintained asynchronously [^6].
+
+#### Reprocessing data for application evolution {#sec_future_reprocessing}
+
+When maintaining derived data, batch and stream processing are both useful. Stream processing allows
+changes in the input to be reflected in derived views with low delay, whereas batch processing
+allows large amounts of accumulated historical data to be reprocessed in order to derive new views
+onto an existing dataset.
+
+In particular, reprocessing existing data provides a good mechanism for maintaining a system,
+evolving it to support new features and changed requirements. Without reprocessing, schema evolution
+is limited to simple changes like adding a new optional field to a record, or adding a new type of
+record. On the other hand, with reprocessing it is possible to restructure a dataset into a
+completely different model in order to better serve new requirements.
+
+> [!TIP] SCHEMA MIGRATIONS ON RAILWAYS
+> Large-scale "schema migrations" occur in noncomputer systems as well. For example, in the early days
+> of railway building in 19th-century England there were various competing standards for the gauge
+> (the distance between the two rails). Trains built for one gauge couldn't run on tracks of another
+> gauge, which restricted the possible interconnections in the train network [^7].
+>
+> After a single standard gauge was finally decided upon in 1846, tracks with other gauges had to be
+> converted---but how do you do this without shutting down the train line for months or years? The
+> solution is to first convert the track to *dual gauge* or *mixed gauge* by adding a third rail. This
+> conversion can be done gradually, and when it is done, trains of both gauges can run on the line,
+> using two of the three rails. Eventually, once all trains have been converted to the standard gauge,
+> the rail providing the nonstandard gauge can be removed.
+>
+> "Reprocessing" the existing tracks in this way, and allowing the old and new versions to exist side
+> by side, makes it possible to change the gauge gradually over the course of years. Nevertheless, it
+> is an expensive undertaking, which is why nonstandard gauges still exist today. For example, the
+> BART system in the San Francisco Bay Area uses a different gauge from the majority of the US.
+
+Derived views allow *gradual* evolution. If you want to restructure a dataset, you do not need to
+perform the migration as a sudden switch. Instead, you can maintain the old schema and the new
+schema side by side as two independently derived views onto the same underlying data. You can then
+start shifting a small number of users to the new view in order to test its performance and find any
+bugs, while most users continue to be routed to the old view. Gradually, you can increase the
+proportion of users accessing the new view, and eventually you can drop the old view [^8],
+[^9].
+
+The beauty of such a gradual migration is that every stage of the process is easily reversible if
+something goes wrong: you always have a working system to go back to. By reducing the risk of
+irreversible damage, you can be more confident about going ahead, and thus move faster to improve
+your system [^10].
+
+#### Unifying batch and stream processing {#id338}
+
+An early proposal for unifying batch and stream processing was the *lambda architecture*
+[^11], which had a number of problems [^12] and has fallen out of use. More
+recent systems allow batch computations (reprocessing historical data) and stream computations
+(processing events as they arrive) to be implemented in the same system [^13], an approach
+that is sometimes known as the *kappa architecture* [^12].
+
+Unifying batch and stream processing in one system requires the following features:
+
+- The ability to replay historical events through the same processing engine that handles the stream
+ of recent events. For example, log-based message brokers have the ability to replay messages, and
+ some stream processors can read input from a distributed filesystem or object storage.
+
+- Exactly-once semantics for stream processors---that is, ensuring that the output is the same as if
+ no faults had occurred, even if faults did in fact occur. Like with batch processing, this
+ requires discarding the partial output of any failed tasks.
+
+- Tools for windowing by event time, not by processing time, since processing time is meaningless
+ when reprocessing historical events. For example, Apache Beam provides an API for expressing such
+ computations, which can then be run using Apache Flink or Google Cloud Dataflow.
+
+## Unbundling Databases {#sec_future_unbundling}
+
+At a most abstract level, databases, batch/stream processors, and operating systems all perform the
+same functions: they store some data, and they allow you to process and query that data
+[^14], [^15]. A database stores data in records of some data model (rows in tables,
+documents, vertices in a graph, etc.) while an operating system's filesystem stores data in
+files---but at their core, both are "information management" systems [^16]. As we saw in
+[Chapter 11](/en/ch11#ch_batch), batch processors are like a distributed version of Unix.
+
+Of course, there are many practical differences. For example, many filesystems do not cope very well
+with a directory containing 10 million small files, whereas a database containing 10 million small
+records is completely normal and unremarkable. Nevertheless, the similarities and differences
+between operating systems and databases are worth exploring.
+
+Unix and relational databases have approached the information management problem with very different
+philosophies. Unix viewed its purpose as presenting programmers with a logical but fairly low-level
+hardware abstraction, whereas relational databases wanted to give application programmers a
+high-level abstraction that would hide the complexities of data structures on disk, concurrency,
+crash recovery, and so on. Unix developed pipes and files that are just sequences of bytes, whereas
+databases developed SQL and transactions.
+
+Which approach is better? Of course, it depends what you want. Unix is "simpler" in the sense that
+it is a fairly thin wrapper around hardware resources; relational databases are "simpler" in the
+sense that a short declarative query can draw on a lot of powerful infrastructure (query
+optimization, indexes, join methods, concurrency control, replication, etc.) without the author of
+the query needing to understand the implementation details.
+
+The tension between these philosophies has lasted for decades (both Unix and the relational model
+emerged in the early 1970s) and still isn't resolved. For example, the NoSQL movement could be
+interpreted as wanting to apply a Unix-esque approach of low-level abstractions to the domain of
+distributed OLTP data storage.
+
+This section attempts to reconcile the two philosophies, in the hope that we can combine the best of
+both worlds.
+
+### Composing Data Storage Technologies {#id447}
+
+Over the course of this book we have discussed various features provided by databases and how they
+work, including:
+
+- Secondary indexes, which allow you to efficiently search for records based on the value of a
+ field;
+
+- Materialized views, which are a kind of precomputed cache of query results;
+
+- Replication logs, which keep copies of the data on other nodes up to date; and
+
+- Full-text search indexes, which allow keyword search in text and which are built into some
+ relational databases [^1].
+
+In Chapters [11](/en/ch11#ch_batch) and [12](/en/ch12#ch_stream), similar themes emerged. We talked
+about building full-text search indexes, about materialized view maintenance, and about replicating
+changes from a database to derived data systems using change data capture.
+
+It seems that there are parallels between the features that are built into databases and the derived
+data systems that people are building with batch and stream processors.
+
+#### Creating an index {#id340}
+
+Think about what happens when you run `CREATE INDEX` to create a new index in a relational database.
+The database has to scan over a consistent snapshot of a table, pick out all of the field values
+being indexed, sort them, and write out the index. Then it must process the backlog of writes that
+have been made since the consistent snapshot was taken (assuming the table was not locked while
+creating the index, so writes could continue). Once that is done, the database must continue to keep
+the index up to date whenever a transaction writes to the table.
+
+This process is remarkably similar to setting up a new follower replica (see ["Setting Up New
+Followers"](/en/ch6#sec_replication_new_replica)), and also very similar to bootstrapping change
+data capture in a streaming system (see ["Initial snapshot"](/en/ch12#sec_stream_cdc_snapshot)).
+
+Whenever you run `CREATE INDEX`, the database essentially reprocesses the existing dataset and
+derives the index as a new view onto the existing data. The existing data may be a snapshot of the
+state rather than a log of all changes that ever happened, but the two are closely related.
+
+#### The meta-database of everything {#id341}
+
+In this light, the dataflow across an entire organization starts looking like one huge database
+[^5]. Whenever a batch, stream, or ETL process transports data from one place and form to
+another place and form, it is acting like the database subsystem that keeps indexes or materialized
+views up to date.
+
+Viewed like this, batch and stream processors are like elaborate implementations of triggers, stored
+procedures, and materialized view maintenance algorithms. The derived data systems they maintain are
+like different index types. For example, a relational database may support B-tree indexes, hash
+indexes, spatial indexes, and other types of indexes. In the emerging architecture of derived data
+systems, instead of implementing those facilities as features of a single integrated database
+product, they are provided by various different pieces of software, running on different machines,
+administered by different teams.
+
+Where will these developments take us in the future? If we start from the premise that there is no
+single data model or storage format that is suitable for all access patterns, there are two avenues
+by which different storage and processing tools can nevertheless be composed into a cohesive system:
+
+Federated databases: unifying reads
+
+: It is possible to provide a unified query interface to a wide variety of underlying storage
+ engines and processing methods---an approach known as a *federated database* or *polystore*
+ [^17], [^18]. For example, PostgreSQL's *foreign data wrapper* feature fits this
+ pattern, as do federated query engines such as Trino, Hoptimator, and Xorq. Applications that
+ need a specialized data model or query interface can still access the underlying storage engines
+ directly, while users who want to combine data from disparate places can do so easily through
+ the federated interface.
+
+ A federated query interface follows the relational tradition of a single integrated system with
+ a high-level query language and elegant semantics, but a complicated implementation.
+
+Unbundled databases: unifying writes
+
+: While federation addresses read-only querying across several different systems, it does not have
+ a good answer to synchronizing writes across those systems. We said that within a single
+ database, creating a consistent index is a built-in feature. When we compose several storage
+ systems, we similarly need to ensure that all data changes end up in all the right places, even
+ in the face of faults. Making it easier to reliably plug together storage systems (e.g., through
+ change data capture and event logs) is like *unbundling* a database's index-maintenance features
+ in a way that can synchronize writes across disparate technologies [^5], [^19].
+
+ The unbundled approach follows the Unix tradition of small tools that do one thing well
+ [^20], that communicate through a uniform low-level API (pipes), and that can be
+ composed using a higher-level language (the shell) [^14].
+
+#### Making unbundling work {#sec_future_unbundling_favor}
+
+Federation and unbundling are two sides of the same coin: composing a reliable, scalable, and
+maintainable system out of diverse components. Federated read-only querying requires mapping one
+data model into another, which takes some thought but is ultimately quite a manageable problem.
+Keeping the writes to several storage systems in sync is the harder engineering problem, and so we
+will focus on it here.
+
+The traditional approach to synchronizing writes requires distributed transactions across
+heterogeneous storage systems [^17], which are problematic, as discussed previously.
+Transactions within a single storage or stream processing system are feasible, but when data crosses
+the boundary between different technologies, an asynchronous event log with idempotent writes is a
+much more robust and practicable approach.
+
+For example, distributed transactions are used within some stream processors to achieve exactly-once
+semantics, and this can work quite well. However, when a transaction would need to involve systems
+written by different groups of people (e.g., when data is written from a stream processor to a
+distributed key-value store or search index), the lack of a standardized transaction protocol makes
+integration much harder. An ordered log of events with idempotent consumers is a much simpler
+abstraction, and thus much more feasible to implement across heterogeneous systems [^5].
+
+The big advantage of log-based integration is *loose coupling* between the various components, which
+manifests itself in two ways:
+
+1. At a system level, asynchronous event streams make the system as a whole more robust to outages
+ or performance degradation of individual components. If a consumer runs slow or fails, the event
+ log can buffer messages, allowing the producer and any other consumers to continue running
+ unaffected. The faulty consumer can catch up when it is fixed, so it doesn't miss any data, and
+ the fault is contained. By contrast, the synchronous interaction of distributed transactions
+ tends to escalate local faults into large-scale failures.
+
+2. At a human level, unbundling data systems allows different software components and services to
+ be developed, improved, and maintained independently from each other by different teams.
+ Specialization allows each team to focus on doing one thing well, with well-defined interfaces
+ to other teams' systems. Event logs provide an interface that is powerful enough to capture
+ fairly strong consistency properties (due to durability and ordering of events), but also
+ general enough to be applicable to almost any kind of data.
+
+#### Unbundled versus integrated systems {#id448}
+
+If unbundling does indeed become the way of the future, it will not replace databases in their
+current form---they will still be needed as much as ever. Databases are still required for
+maintaining state in stream processors, and in order to serve queries for the output of batch and
+stream processors. Specialized query engines will continue to be important for particular workloads:
+for example, query engines in data warehouses are optimized for exploratory analytic queries and
+handle this kind of workload very well.
+
+The complexity of running several different pieces of infrastructure can be a problem: each piece of
+software has a learning curve, configuration issues, and operational quirks, and so it is worth
+deploying as few moving parts as possible. A single integrated software product may also be able to
+achieve better and more predictable performance on the kinds of workloads for which it is designed,
+compared to a system consisting of several tools that you have composed with application code
+[^21]. Building for scale that you don't need is wasted effort and may lock you into an
+inflexible design. In effect, it is a form of premature optimization.
+
+The goal of unbundling is not to compete with individual databases on performance for particular
+workloads; the goal is to allow you to combine several different databases in order to achieve good
+performance for a much wider range of workloads than is possible with a single piece of software.
+It's about breadth, not depth.
+
+Thus, if there is a single technology that does everything you need, you're most likely best off
+simply using that product rather than trying to reimplement it yourself from lower-level components.
+The advantages of unbundling and composition only come into the picture when there is no single
+piece of software that satisfies all your requirements.
+
+The tools for composing data systems are getting better: Debezium can extract change streams from
+many databases, Kafka's protocol is becoming a de-facto standard for event streams, and incremental
+view maintenance engines (see ["Incremental View Maintenance"](/en/ch12#sec_stream_ivm)) make it
+possible to precompute and update caches of complex queries.
+
+### Designing Applications Around Dataflow {#sec_future_dataflow}
+
+The general idea of updating derived data when its underlying data changes is nothing new. For
+example, spreadsheets have powerful dataflow programming capabilities [^22]: you can put a
+formula in one cell (for example, the sum of cells in another column), and whenever any input to the
+formula changes, the result of the formula is automatically recalculated. This is exactly what we
+want at a data system level: when a record in a database changes, we want any index for that record
+to be automatically updated, and any cached views or aggregations that depend on the record to be
+automatically refreshed. You should not have to worry about the technical details of how this
+refresh happens, but be able to simply trust that it works correctly.
+
+Thus, most data systems still have something to learn from the features that VisiCalc already had in
+1979 [^23]. The difference from spreadsheets is that today's data systems need to be
+fault-tolerant, scalable, and store data durably. They also need to be able to integrate disparate
+technologies written by different groups of people over time, and reuse existing libraries and
+services: it is unrealistic to expect all software to be developed using one particular language,
+framework, or tool.
+
+In this section we will expand on these ideas and explore some ways of building applications around
+the ideas of unbundled databases and dataflow.
+
+#### Application code as a derivation function {#sec_future_dataflow_derivation}
+
+When one dataset is derived from another, it goes through some kind of transformation function. For
+example:
+
+- A secondary index is a kind of derived dataset with a straightforward transformation function: for
+ each row or document in the base table, it picks out the values in the columns or fields being
+ indexed, and sorts by those values (assuming a SSTable or B-tree index, which are sorted by key).
+
+- A full-text search index is created by applying various natural language processing functions such
+ as language detection, word segmentation, stemming or lemmatization, spelling correction, and
+ synonym identification, followed by building a data structure for efficient lookups (such as an
+ inverted index).
+
+- In a machine learning system, we can consider the model as being derived from the training data by
+ applying various feature extraction and statistical analysis functions. When the model is applied
+ to new input data, the output of the model is derived from the input and the model (and hence,
+ indirectly, from the training data).
+
+- A cache often contains an aggregation of data in the form in which it is going to be displayed in
+ a user interface (UI). Populating the cache thus requires knowledge of what fields are referenced
+ in the UI; changes in the UI may require updating the definition of how the cache is populated and
+ rebuilding the cache.
+
+The derivation function for a secondary index is so commonly required that it is built into many
+databases as a core feature, and you can invoke it by merely saying `CREATE INDEX`. For full-text
+indexing, basic linguistic features for common languages may be built into a database, but the more
+sophisticated features often require domain-specific tuning. In machine learning, feature
+engineering is notoriously application-specific, and often has to incorporate detailed knowledge
+about the user interaction and deployment of an application [^24].
+
+When the function that creates a derived dataset is not a standard cookie-cutter function like
+creating a secondary index, custom code is required to handle the application-specific aspects. And
+this custom code is where many databases struggle. Although relational databases commonly support
+triggers, stored procedures, and user-defined functions, which can be used to execute application
+code within the database, they have been somewhat of an afterthought in database design.
+
+#### Separation of application code and state {#id344}
+
+In theory, databases could be deployment environments for arbitrary application code, like an
+operating system. However, in practice they have turned out to be poorly suited for this purpose.
+They do not fit well with the requirements of modern application development, such as dependency and
+package management, version control, rolling upgrades, evolvability, monitoring, metrics, calls to
+network services, and integration with external systems.
+
+On the other hand, deployment and cluster management tools such as Kubernetes, Docker, Mesos, YARN,
+and others are designed specifically for the purpose of running application code. By focusing on
+doing one thing well, they are able to do it much better than a database that provides execution of
+user-defined functions as one of its many features.
+
+Most web applications today are deployed as stateless services, in which any user request can be
+routed to any application server, and the server forgets everything about the request once it has
+sent the response. This style of deployment is convenient, as servers can be added or removed at
+will, but the state has to go somewhere: typically, a database. The trend has been to keep stateless
+application logic separate from state management (databases): not putting application logic in the
+database and not putting persistent state in the application [^25]. As people in the
+functional programming community like to joke, "We believe in the separation of Church and state"
+[^26].
+
+> [!NOTE]
+> Explaining a joke usually ruins it, but here is an explanation anyway so that nobody feels left out.
+> *Church* is a reference to the mathematician Alonzo Church, who created the lambda calculus, an
+> early form of computation that is the basis for most functional programming languages. The lambda
+> calculus has no mutable state (i.e., no variables that can be overwritten), so one could say that
+> mutable state is separate from Church's work.
+
+In this typical web application model, the database acts as a kind of mutable shared variable that
+can be accessed synchronously over the network. The application can read and update the variable,
+and the database takes care of making it durable, providing some concurrency control and fault
+tolerance.
+
+However, in most programming languages you cannot subscribe to changes in a mutable variable---you
+can only read it periodically. Unlike in a spreadsheet, readers of the variable don't get notified
+if the value of the variable changes. (You can implement such notifications in your own code---this
+is known as the *observer pattern*---but most languages do not have this pattern as a built-in
+feature.)
+
+Databases have inherited this passive approach to mutable data: if you want to find out whether the
+content of the database has changed, often your only option is to poll (i.e., to repeat your query
+periodically). Subscribing to changes is only just beginning to emerge as a feature.
+
+#### Dataflow: Interplay between state changes and application code {#id450}
+
+Thinking about applications in terms of dataflow implies renegotiating the relationship between
+application code and state management. Instead of treating a database as a passive variable that is
+manipulated by the application, we think much more about the interplay and collaboration between
+state, state changes, and code that processes them. Application code responds to state changes in
+one place by triggering state changes in another place.
+
+We have already seen this idea in change data capture, in the actor model, in triggers, and
+incremental view maintenance. Unbundling the database means taking this idea and applying it to the
+creation of derived datasets outside of the primary database: caches, full-text search indexes,
+machine learning, or analytics systems. We can use stream processing and messaging systems for this
+purpose.
+
+Maintaining derived data requires the following properties, which log-based message brokers can
+provide:
+
+- When maintaining derived data, the order of state changes is often important (if several views are
+ derived from an event log, they need to process the events in the same order so that they remain
+ consistent with each other).
+
+- Fault tolerance is essential: losing just a single message causes the derived dataset to go
+ permanently out of sync with its data source. Both message delivery and derived state updates must
+ be reliable.
+
+Stable message ordering and fault-tolerant message processing are quite stringent demands, but they
+are much less expensive and more operationally robust than distributed transactions. Modern stream
+processors can provide these ordering and reliability guarantees at scale, and they allow
+application code to be run as stream operators.
+
+This application code can do the arbitrary processing that built-in derivation functions in
+databases generally don't provide. Like Unix tools chained by pipes, stream operators can be
+composed to build large systems around dataflow. Each operator takes streams of state changes as
+input, and produces other streams of state changes as output.
+
+#### Stream processors and services {#id345}
+
+The currently dominant style of application development involves breaking down functionality into a
+set of *services* that communicate via synchronous network requests such as REST APIs. The advantage
+of such a service-oriented architecture over a single monolithic application is primarily
+organizational scalability through loose coupling: different teams can work on different services,
+which reduces coordination effort between teams (as long as the services can be deployed and updated
+independently).
+
+Composing stream operators into dataflow systems has a lot of similar characteristics to the
+microservices approach [^27], [^28]. However, the underlying communication mechanism
+is very different: one-directional, asynchronous message streams rather than synchronous
+request/response interactions.
+
+Besides the advantages listed in ["Event-Driven Architectures"](/en/ch5#sec_encoding_dataflow_msg),
+such as better fault tolerance, dataflow systems can also achieve better performance than
+traditional REST APIs or RPC. For example, say a customer is purchasing an item that is priced in
+one currency but paid for in another currency. In order to perform the currency conversion, you need
+to know the current exchange rate. This operation could be implemented in two ways [^27],
+[^29]:
+
+1. In the microservices approach, the code that processes the purchase would probably query an
+ exchange-rate service or database in order to obtain the current rate for a particular currency.
+
+2. In the dataflow approach, the code that processes purchases would subscribe to a stream of
+ exchange rate updates ahead of time, and record the current rate in a local database whenever it
+ changes. When it comes to processing the purchase, it only needs to query the local database.
+
+The second approach has replaced a synchronous network request to another service with a query to a
+local database (which may be on the same machine, even in the same process). In the microservices
+approach, you could avoid the synchronous network request by caching the exchange rate locally in
+the service that processes the purchase. However, in order to keep that cache fresh, you would need
+to periodically poll for updated exchange rates, or subscribe to a stream of changes---which is
+exactly what happens in the dataflow approach.
+
+Not only is the dataflow approach faster, but it is also more robust to the failure of another
+service. The fastest and most reliable network request is no network request at all! Instead of RPC,
+we now have a stream join between purchase events and exchange rate update events.
+
+The join is time-dependent: if the purchase events are reprocessed at a later point in time, the
+exchange rate will have changed. If you want to reconstruct the original output, you will need to
+obtain the historical exchange rate at the original time of purchase. No matter whether you query a
+service or subscribe to a stream of exchange rate updates, you will need to handle this time
+dependence (see ["Time-dependence of joins"](/en/ch12#sec_stream_join_time)).
+
+Subscribing to a stream of changes, rather than querying the current state when needed, brings us
+closer to a spreadsheet-like model of computation: when some piece of data changes, any derived data
+that depends on it can swiftly be updated. There are still many open questions, for example around
+issues like time-dependent joins, but building applications around dataflow ideas is a very
+promising direction to explore.
+
+### Observing Derived State {#sec_future_observing}
+
+At an abstract level, the dataflow systems discussed in the last section give you a process for
+creating derived datasets (such as search indexes, materialized views, and predictive models) and
+keeping them up to date. Let's call that process the *write path*: whenever some piece of
+information is written to the system, it may go through multiple stages of batch and stream
+processing, and eventually every derived dataset is updated to incorporate the data that was
+written. [Figure 13-1](/en/ch13#fig_future_write_read_paths) shows an example of updating a search
+index.
+
+{{< figure src="/fig/ddia_1301.png" id="fig_future_write_read_paths" caption="Figure 13-1. In a search index, writes (document updates) meet reads (queries)." class="w-full my-4" >}}
+
+But why do you create the derived dataset in the first place? Most likely because you want to query
+it again at a later time. This is the *read path*: when serving a user request you read from the
+derived dataset, perhaps perform some more processing on the results, and construct the response to
+the user.
+
+Taken together, the write path and the read path encompass the whole journey of the data, from the
+point where it is collected to the point where it is consumed (probably by another human). The write
+path is the portion of the journey that is precomputed---i.e., that is done eagerly as soon as the
+data comes in, regardless of whether anyone has asked to see it. The read path is the portion of the
+journey that only happens when someone asks for it. If you are familiar with functional programming
+languages, you might notice that the write path is similar to eager evaluation, and the read path is
+similar to lazy evaluation.
+
+The derived dataset is the place where the write path and the read path meet, as illustrated in
+[Figure 13-1](/en/ch13#fig_future_write_read_paths). It represents a trade-off between the amount of
+work that needs to be done at write time and the amount that needs to be done at read time.
+
+#### Materialized views and caching {#id451}
+
+A full-text search index is a good example: the write path updates the index, and the read path
+searches the index for keywords. Both reads and writes need to do some work. Writes need to update
+the index entries for all terms that appear in the document. Reads need to search for each of the
+words in the query, and apply Boolean logic to find documents that contain *all* of the words in the
+query (an `AND` operator), or *any* synonym of each of the words (an `OR` operator).
+
+If you didn't have an index, a search query would have to scan over all documents (like `grep`),
+which would get very expensive if you had a large number of documents. No index means less work on
+the write path (no index to update), but a lot more work on the read path.
+
+On the other hand, you could imagine precomputing the search results for all possible queries. In
+that case, you would have less work to do on the read path: no Boolean logic, just find the results
+for your query and return them. However, the write path would be a lot more expensive: the set of
+possible search queries that could be asked is infinite (or at least exponential in the number of
+terms in the corpus), and thus precomputing all possible search results would not be possible.
+
+Another option would be to precompute the search results for only a fixed set of the most common
+queries, so that they can be served quickly without having to go to the index. The uncommon queries
+can still be served from the index. This would generally be called a *cache* of common queries,
+although we could also call it a materialized view, as it would need to be updated when new
+documents appear that should be included in the results of one of the common queries.
+
+From this example we can see that an index is not the only possible boundary between the write path
+and the read path. Caching of common search results is possible, and `grep`-like scanning without
+the index is also possible on a small number of documents. Viewed like this, the role of caches,
+indexes, and materialized views is simple: they shift the boundary between the read path and the
+write path. They allow us to do more work on the write path, by precomputing results, in order to
+save effort on the read path.
+
+Shifting the boundary between work done on the write path and the read path was in fact the topic of
+the social networking example in ["Case Study: Social Network Home
+Timelines"](/en/ch2#sec_introduction_twitter). In that example, we also saw how the boundary between
+write path and read path might be drawn differently for celebrities compared to ordinary users.
+After 500 pages we have come full circle!
+
+#### Stateful, offline-capable clients {#id347}
+
+The idea of a boundary between write and read paths is interesting because we can discuss shifting
+that boundary and explore what that shift means in practical terms. Let's look at the idea in a
+different context.
+
+In the past, web browsers were stateless clients that can only do useful things when you have an
+internet connection (just about the only thing you could do offline was to scroll up and down in a
+page that you had previously loaded while online). However, single-page JavaScript web apps now have
+a lot of stateful capabilities, including client-side user interface interaction and persistent
+local storage in the web browser. Mobile apps can similarly store a lot of state on the device and
+don't require a round-trip to the server for most user interactions.
+
+In ["Sync Engines and Local-First Software"](/en/ch6#sec_replication_offline_clients) we saw how
+persistent local state enables a class of applications in which users can work offline, without an
+internet connection, and sync with remote servers in the background when a network connection is
+available [^30]. Since mobile devices sometimes have slow and unreliable cellular internet
+connections, it's a big advantage for users if their user interface does not have to wait for
+synchronous network requests, and if apps mostly work offline.
+
+When we move away from the assumption of stateless clients talking to a central database and toward
+state that is maintained on end-user devices, a world of new opportunities opens up. In particular,
+we can think of the on-device state as a *cache of state on the server*. The pixels on the screen
+are a materialized view onto model objects in the client app; the model objects are a local replica
+of state in a remote datacenter [^31].
+
+#### Pushing state changes to clients {#id348}
+
+In a typical web page, if you load the page in a web browser and the data subsequently changes on
+the server, the browser does not find out about the change until you reload the page. The browser
+only reads the data at one point in time, assuming that it is static---it does not subscribe to
+updates from the server. Thus, the state in the browser is a stale cache that is not updated unless
+you explicitly poll for changes. (HTTP-based feed subscription protocols like RSS are really just a
+basic form of polling.)
+
+More recent protocols have moved beyond the basic request/response pattern of HTTP: server-sent
+events (the EventSource API) and WebSockets provide communication channels by which a web browser
+can keep an open TCP connection to a server, and the server can actively push messages to the
+browser as long as it remains connected. This provides an opportunity for the server to actively
+inform the end-user client about any changes to the state it has stored locally, reducing the
+staleness of the client-side state.
+
+In terms of our model of write path and read path, actively pushing state changes all the way to
+client devices means extending the write path all the way to the end user. When a client is first
+initialized, it would still need to use a read path to get its initial state, but thereafter it
+could rely on a stream of state changes sent by the server. The ideas we discussed around stream
+processing and messaging are not restricted to running only in a datacenter: we can take the ideas
+further, and extend them all the way to end-user devices [^32].
+
+The devices will be offline some of the time, and unable to receive any notifications of state
+changes from the server during that time. But we already solved that problem: in ["Consumer
+offsets"](/en/ch12#sec_stream_log_offsets) we discussed how a consumer of a log-based message broker
+can reconnect after failing or becoming disconnected, and ensure that it doesn't miss any messages
+that arrived while it was disconnected. The same technique works for individual users, where each
+device is a small subscriber to a small stream of events.
+
+#### End-to-end event streams {#id349}
+
+Tools for developing stateful clients and user interfaces, such as React and Elm [^33],
+already have the ability to update the rendered user interface in response to changes in the
+underlying state. It would be very natural to extend this programming model to also allow a server
+to push state-change events into this client-side event pipeline.
+
+Thus, state changes could flow through an end-to-end write path: from the interaction on one device
+that triggers a state change, via event logs and through several derived data systems and stream
+processors, all the way to the user interface of a person observing the state on another device.
+These state changes could be propagated with fairly low delay---say, under one second end to end.
+
+Some applications, such as instant messaging and online games, already have such a "real-time"
+architecture (in the sense of interactions with low delay, not in the sense of response time
+guarantees). But why don't we build all applications this way?
+
+The challenge is that the assumption of stateless clients and request/response interactions is very
+deeply ingrained in our databases, libraries, frameworks, and protocols. Many datastores support
+read and write operations where a request returns one response, but much fewer provide an ability to
+subscribe to changes---i.e., a request that returns a stream of responses over time.
+
+In order to extend the write path all the way to the end user, we would need to fundamentally
+rethink the way we build many of these systems: moving away from request/response interaction and
+toward publish/subscribe dataflow [^31]. This would require effort, but it would have the
+advantage of making user interfaces more responsive and providing better offline support.
+
+#### Reads are events too {#sec_future_read_events}
+
+We discussed that when a stream processor writes derived data to a store (database, cache, or
+index), and that store is queried, the store acts as the boundary between the write path and the
+read path. The store allows random-access read queries to the data that would otherwise require
+scanning the whole event log.
+
+In many cases, the data storage is separate from the streaming system. But recall that stream
+processors also need to maintain state to perform aggregations and joins. This state is normally
+hidden inside the stream processor, but some frameworks allow it to also be queried by outside
+clients [^34], turning the stream processor itself into a kind of simple database.
+
+Let's take that idea further. As discussed so far, the writes to the store go through an event log,
+while reads are transient network requests that go directly to the nodes that store the data being
+queried. This is a reasonable design, but not the only possible one. It is also possible to
+represent read requests as streams of events, and send both the read events and the write events
+through a stream processor; the processor responds to read events by emitting the result of the read
+to an output stream [^35].
+
+When both the writes and the reads are represented as events, and routed to the same stream operator
+in order to be handled, we are in fact performing a stream-table join between the stream of read
+queries and the database. The read event needs to be sent to the database shard holding the data,
+just like batch and stream processors need to copartition inputs on the same key when joining.
+
+This correspondence between serving requests and performing joins is quite fundamental
+[^36]. A one-off read request passes through the join operator, which then immediately
+forgets the request; a subscribe request is a persistent join with past and future events on the
+other side of the join.
+
+Recording a log of read events potentially also has benefits with regard to tracking causal
+dependencies and data provenance across a system: it would allow you to reconstruct what the user
+saw before they made a particular decision. For example, in an online shop, it is likely that the
+predicted shipping date and the inventory status shown to a customer affect whether they choose to
+buy an item [^4]. To analyze this connection, you need to record the result of the user's
+query of the shipping and inventory status.
+
+Writing read requests to durable storage thus enables better tracking of causal dependencies, but it
+incurs additional storage and I/O cost. Optimizing such systems to reduce the overhead is still an
+open research problem [^2]. But if you already log read requests for operational purposes,
+as a side effect of request processing, it is not such a great change to make the log the source of
+the requests instead.
+
+#### Multi-shard data processing {#sec_future_unbundled_multi_shard}
+
+For queries that only touch a single shard, the effort of sending queries through a stream and
+collecting a stream of responses is perhaps overkill. However, this idea opens the possibility of
+distributed execution of complex queries that need to combine data from several shards, taking
+advantage of the infrastructure for message routing, sharding, and joining that is already provided
+by stream processors.
+
+Storm's distributed RPC feature supports this usage pattern. For example, it has been used to
+compute the number of people who have seen a URL on a social network---i.e., the union of the
+follower sets of everyone who has posted that URL [^37]. As the set of users is sharded,
+this computation requires combining results from many shards.
+
+Another example of this pattern occurs in fraud prevention: in order to assess the risk of whether a
+particular purchase event is fraudulent, you can examine the reputation scores of the user's IP
+address, email address, billing address, shipping address, and so on. Each of these reputation
+databases is itself sharded, and so collecting the scores for a particular purchase event requires a
+sequence of joins with differently sharded datasets [^38].
+
+The internal query execution graphs of data warehouse query engines have similar characteristics. If
+you need to perform this kind of multi-shard join, it is probably simpler to use a database that
+provides this feature than to implement it using a stream processor. However, treating queries as
+streams provides an option for implementing large-scale applications that run against the limits of
+conventional off-the-shelf solutions.
+
+## Aiming for Correctness {#sec_future_correctness}
+
+With stateless services that only read data, it is not a big deal if something goes wrong: you can
+fix the bug and restart the service, and everything returns to normal. Stateful systems such as
+databases are not so simple: they are designed to remember things forever (more or less), so if
+something goes wrong, the effects also potentially last forever---which means they require more
+careful thought [^39].
+
+We want to build applications that are reliable and *correct* (i.e., programs whose semantics are
+well defined and understood, even in the face of various faults). For approximately four decades,
+the transaction properties of atomicity, isolation, and durability have been the tools of choice for
+building correct applications. However, those foundations are weaker than they seem: witness for
+example the confusion of weak isolation levels (see ["Weak Isolation
+Levels"](/en/ch8#sec_transactions_isolation_levels)).
+
+In some areas, transactions have been abandoned entirely and replaced with models that offer better
+performance and scalability, but much messier semantics. *Consistency* is often talked about, but
+poorly defined. Some people assert that we should "embrace weak consistency" for the sake of better
+availability, while lacking a clear idea of what that actually means in practice.
+
+For a topic that is so important, our understanding and our engineering methods are surprisingly
+flaky. For example, it is very difficult to determine whether it is safe to run a particular
+application at a particular transaction isolation level or replication configuration [^40],
+[^41]. Often simple solutions appear to work correctly when concurrency is low and there are
+no faults, but turn out to have many subtle bugs in more demanding circumstances.
+
+For example, Kyle Kingsbury's Jepsen experiments [^42] have highlighted the stark
+discrepancies between some products' claimed safety guarantees and their actual behavior in the
+presence of network problems and crashes. Even if infrastructure products like databases were free
+from problems, application code would still need to correctly use the features they provide, which
+is error-prone if the configuration is hard to understand (which is the case with weak isolation
+levels, quorum configurations, and so on).
+
+If your application can tolerate occasionally corrupting or losing data in unpredictable ways, life
+is a lot simpler, and you might be able to get away with simply crossing your fingers and hoping for
+the best. On the other hand, if you need stronger assurances of correctness, then serializability
+and atomic commit are established approaches, but they come at a cost: they typically only work in a
+single datacenter (ruling out geographically distributed architectures), and they limit the scale
+and fault-tolerance properties you can achieve.
+
+While the traditional transaction approach is not going away, it is not the last word in making
+applications correct and resilient to faults. In this section we will explore some ways of thinking
+about correctness in the context of dataflow architectures.
+
+### The End-to-End Argument for Databases {#sec_future_end_to_end}
+
+Just because an application uses a data system that provides comparatively strong safety properties,
+such as serializable transactions, that does not mean the application is guaranteed to be free from
+data loss or corruption. For example, if an application has a bug that causes it to write incorrect
+data, or delete data from a database, serializable transactions aren't going to save you. This is an
+argument in favor of immutable and append-only data, because it is easier to recover from such
+mistakes if you remove the ability of faulty code to destroy good data.
+
+Although immutability is useful, it is not a cure-all by itself. Let's look at a more subtle example
+of data corruption that can occur.
+
+#### Exactly-once execution of an operation {#id353}
+
+In ["Fault Tolerance"](/en/ch12#sec_stream_fault_tolerance) we encountered *exactly-once* (or
+*effectively-once*) semantics. If something goes wrong while processing a message, you can either
+give up (drop the message---i.e., incur data loss) or try again. If you try again, there is the risk
+that it actually succeeded the first time, but you just didn't find out about the success, and so
+the message ends up being processed twice.
+
+Processing twice is a form of data corruption: it is undesirable to charge a customer twice for the
+same service (billing them too much) or increment a counter twice (overstating some metric). In this
+context, *exactly-once* means arranging the computation such that the final effect is the same as if
+no faults had occurred, even if the operation actually was retried due to some fault. We previously
+discussed a few approaches for achieving this goal.
+
+One of the most effective approaches is to make the operation *idempotent*; that is, to ensure that
+it has the same effect, no matter whether it is executed once or multiple times. However, taking an
+operation that is not naturally idempotent and making it idempotent requires some effort and care:
+you may need to maintain some additional metadata (such as the set of operation IDs that have
+updated a value), and ensure fencing when failing over from one node to another (see ["Distributed
+Locks and Leases"](/en/ch9#sec_distributed_lock_fencing)).
+
+#### Duplicate suppression {#id354}
+
+The same pattern of needing to suppress duplicates occurs in many other places besides stream
+processing. For example, TCP uses sequence numbers on packets to put them in the correct order at
+the recipient, and to determine whether any packets were lost or duplicated on the network. Any lost
+packets are retransmitted and any duplicates are removed by the TCP stack before it hands the data
+to an application.
+
+However, this duplicate suppression only works within the context of a single TCP connection.
+Imagine the TCP connection is a client's connection to a database, and it is currently executing the
+transaction in [Example 13-1](/en/ch13#fig_future_non_idempotent). In many databases, a transaction
+is tied to a client connection (if the client sends several queries, the database knows that they
+belong to the same transaction because they are sent on the same TCP connection). If the client
+suffers a network interruption and connection timeout after sending the `COMMIT`, but before hearing
+back from the database server, it does not know whether the transaction has been committed or
+aborted ([Figure 9-1](/en/ch9#fig_distributed_network)).
+
+
+
+##### Example 13-1. A nonidempotent transfer of money from one account to another
+
+``` sql
+BEGIN TRANSACTION;
+UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
+UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
+COMMIT;
+```
+
+The client can reconnect to the database and retry the transaction, but now it is outside of the
+scope of TCP duplicate suppression. Since the transaction in
+[Example 13-1](/en/ch13#fig_future_non_idempotent) is not idempotent, it could happen that \$22 is
+transferred instead of the desired \$11. Thus, even though
+[Example 13-1](/en/ch13#fig_future_non_idempotent) is a standard example for transaction atomicity,
+it is actually not correct, and real banks do not work like this [^3].
+
+Two-phase commit (see ["Two-Phase Commit (2PC)"](/en/ch8#sec_transactions_2pc)) protocols break the
+1:1 mapping between a TCP connection and a transaction, since they must allow a transaction
+coordinator to reconnect to a database after a network fault, and tell it whether to commit or abort
+an in-doubt transaction. Is this sufficient to ensure that the transaction will only be executed
+once? Unfortunately not.
+
+Even if we can suppress duplicate transactions between the database client and server, we still need
+to worry about the network between the end-user device and the application server. For example, if
+the end-user client is a web browser, it probably uses an HTTP POST request to submit an instruction
+to the server. Perhaps the user is on a weak cellular data connection, and they succeed in sending
+the POST, but the signal becomes too weak before they are able to receive the response from the
+server.
+
+In this case, the user will probably be shown an error message, and they may retry manually. Web
+browsers warn, "Are you sure you want to submit this form again?"---and the user says yes, because
+they wanted the operation to happen. (The Post/Redirect/Get pattern [^43] avoids this
+warning message in normal operation, but it doesn't help if the POST request times out.) From the
+web server's point of view the retry is a separate request, and from the database's point of view it
+is a separate transaction. The usual deduplication mechanisms don't help.
+
+#### Uniquely identifying requests {#id355}
+
+To make the request idempotent through several hops of network communication, it is not sufficient
+to rely just on a transaction mechanism provided by a database---you need to consider the
+*end-to-end* flow of the request.
+
+For example, you could generate a unique identifier for a request (such as a UUID) and include it as
+a hidden form field in the client application, or calculate a hash of all the relevant form fields
+to derive the request ID [^3]. If the web browser submits the POST request twice, the two
+requests will have the same request ID. You can then pass that request ID all the way through to the
+database and check that you only ever execute one request with a given ID, as shown in
+[Example 13-2](/en/ch13#fig_future_request_id).
+
+
+
+##### Example 13-2. Suppressing duplicate requests using a unique ID
+
+``` sql
+ALTER TABLE requests ADD UNIQUE (request_id);
+
+BEGIN TRANSACTION;
+
+INSERT INTO requests
+ (request_id, from_account, to_account, amount)
+ VALUES('0286FDB8-D7E1-423F-B40B-792B3608036C', 4321, 1234, 11.00);
+
+UPDATE accounts SET balance = balance + 11.00 WHERE account_id = 1234;
+UPDATE accounts SET balance = balance - 11.00 WHERE account_id = 4321;
+
+COMMIT;
+```
+
+[Example 13-2](/en/ch13#fig_future_request_id) relies on a uniqueness constraint on the `request_id`
+column. If a transaction attempts to insert an ID that already exists, the `INSERT` fails and the
+transaction is aborted, preventing it from taking effect twice. Relational databases can generally
+maintain a uniqueness constraint correctly, even at weak isolation levels (whereas an
+application-level check-then-insert may fail under nonserializable isolation, as discussed in
+["Write Skew and Phantoms"](/en/ch8#sec_transactions_write_skew)).
+
+Besides suppressing duplicate requests, the `requests` table in
+[Example 13-2](/en/ch13#fig_future_request_id) acts as a kind of event log, which can be useful for
+event sourcing or change data capture. The updates to the account balances don't actually have to
+happen in the same transaction as the insertion of the event, since they are redundant and could be
+derived from the request event in a downstream consumer---as long as the event is processed exactly
+once, which can again be enforced using the request ID.
+
+#### The end-to-end argument {#sec_future_e2e_argument}
+
+This scenario of suppressing duplicate transactions is just one example of a more general principle
+called the *end-to-end argument*, which was articulated by Saltzer, Reed, and Clark in 1984
+[^44]:
+
+> The function in question can completely and correctly be implemented only with the knowledge and
+> help of the application standing at the endpoints of the communication system. Therefore,
+> providing that questioned function as a feature of the communication system itself is not
+> possible. (Sometimes an incomplete version of the function provided by the communication system
+> may be useful as a performance enhancement.)
+
+In our example, the *function in question* was duplicate suppression. We saw that TCP suppresses
+duplicate packets at the TCP connection level, and some stream processors provide so-called
+exactly-once semantics at the message processing level, but that is not enough to prevent a user
+from submitting a duplicate request if the first one times out. By themselves, TCP, database
+transactions, and stream processors cannot entirely rule out these duplicates. Solving the problem
+requires an end-to-end solution: a transaction identifier that is passed all the way from the
+end-user client to the database.
+
+The end-to-end argument also applies to checking the integrity of data: checksums built into
+Ethernet, TCP, and TLS can detect corruption of packets in the network, but they cannot detect
+corruption due to bugs in the software at the sending and receiving ends of the network connection,
+or corruption on the disks where the data is stored. If you want to catch all possible sources of
+data corruption, you also need end-to-end checksums.
+
+A similar argument applies with encryption [^44]: the password on your home WiFi network
+protects against people snooping your WiFi traffic, but not against attackers elsewhere on the
+internet; TLS/SSL between your client and the server protects against network attackers, but not
+against compromises of the server. Only end-to-end encryption and authentication can protect against
+all of these things.
+
+Although the low-level features (TCP duplicate suppression, Ethernet checksums, WiFi encryption)
+cannot provide the desired end-to-end features by themselves, they are still useful, since they
+reduce the probability of problems at the higher levels. For example, HTTP requests would often get
+mangled if we didn't have TCP putting the packets back in the right order. We just need to remember
+that the low-level reliability features are not by themselves sufficient to ensure end-to-end
+correctness.
+
+#### Applying end-to-end thinking in data systems {#id357}
+
+This brings us back to the original thesis: just because an application uses a data system that
+provides comparatively strong safety properties, such as serializable transactions, that does not
+mean the application is guaranteed to be free from data loss or corruption. The application itself
+needs to take end-to-end measures, such as duplicate suppression, as well.
+
+That is a shame, because fault-tolerance mechanisms are hard to get right. Low-level reliability
+mechanisms, such as those in TCP, work quite well, and so the remaining higher-level faults occur
+fairly rarely. It would be really nice to wrap up the remaining high-level fault-tolerance machinery
+in an abstraction so that application code needn't worry about it---but it seems that we have not
+yet found the right abstraction.
+
+Transactions have long been seen as a useful abstraction. As discussed in
+[Chapter 8](/en/ch8#ch_transactions), they take a wide range of possible issues (concurrent writes,
+constraint violations, crashes, network interruptions, disk failures) and collapse them down to two
+possible outcomes: commit or abort. That is a huge simplification of the programming model, but it
+is not enough.
+
+Transactions are expensive, especially when they involve heterogeneous storage technologies (see
+["Distributed Transactions Across Different Systems"](/en/ch8#sec_transactions_xa)). When we refuse
+to use distributed transactions because they are too expensive, we end up having to reimplement
+fault-tolerance mechanisms in application code. As numerous examples throughout this book have
+shown, reasoning about concurrency and partial failure is difficult and counterintuitive, and so
+most application-level mechanisms do not work correctly. The consequence is lost or corrupted data.
+
+For these reasons, it is worth exploring fault-tolerance abstractions that make it easy to provide
+application-specific end-to-end correctness properties, but also maintain good performance and good
+operational characteristics in a large-scale distributed environment.
+
+### Enforcing Constraints {#sec_future_constraints}
+
+Let's think about correctness in the context of the ideas around unbundling databases. We saw that
+end-to-end duplicate suppression can be achieved with a request ID that is passed all the way from
+the client to the database that records the write. What about other kinds of constraints?
+
+In particular, let's focus on uniqueness constraints---such as the one we relied on in
+[Example 13-2](/en/ch13#fig_future_request_id). In ["Constraints and uniqueness
+guarantees"](/en/ch10#sec_consistency_uniqueness) we saw several other examples of application
+features that need to enforce uniqueness: a username or email address must uniquely identify a user,
+a file storage service cannot have more than one file with the same name, and two people cannot book
+the same seat on a flight or in a theater.
+
+Other kinds of constraints are very similar: for example, ensuring that an account balance never
+goes negative, that you don't sell more items than you have in stock in the warehouse, or that a
+meeting room does not have overlapping bookings. Techniques that enforce uniqueness can often be
+used for these kinds of constraints as well.
+
+#### Uniqueness constraints require consensus {#id452}
+
+In [Chapter 10](/en/ch10#ch_consistency) we saw that in a distributed setting, enforcing a
+uniqueness constraint requires consensus: if there are several concurrent requests with the same
+value, the system somehow needs to decide which one of the conflicting operations is accepted, and
+reject the others as violations of the constraint.
+
+The most common way of achieving this consensus is to make a single node the leader, and put it in
+charge of making all the decisions. That works fine as long as you don't mind funneling all requests
+through a single node (even if the client is on the other side of the world), and as long as that
+node doesn't fail. Consensus algorithms like Raft tackle the problem of safely electing a new leader
+if the current leader has failed (or is believed to have failed due to a network problem), and
+preventing split brain.
+
+Uniqueness checking can be scaled out by sharding based on the value that needs to be unique. For
+example, if you need to ensure uniqueness by request ID, as in
+[Example 13-2](/en/ch13#fig_future_request_id), you can ensure all requests with the same request ID
+are routed to the same shard. If you need usernames to be unique, you can shard by hash of username.
+
+However, asynchronous multi-leader replication is ruled out, because it could happen that different
+leaders concurrently accept conflicting writes, and thus the values are no longer unique. If you
+want to be able to immediately reject any writes that would violate the constraint, synchronous
+coordination is unavoidable [^45].
+
+#### Uniqueness in log-based messaging {#sec_future_uniqueness_log}
+
+A shared log ensures that all consumers see messages in the same order---a guarantee that is
+formally known as *total order broadcast* and is equivalent to consensus (see ["The Many Faces of
+Consensus"](/en/ch10#sec_consistency_faces)). In the unbundled database approach with log-based
+messaging, we can use a very similar approach to enforce uniqueness constraints.
+
+A stream processor consumes all the messages in a log shard sequentially on a single thread. Thus,
+if the log is sharded based on the value that needs to be unique, a stream processor can
+unambiguously and deterministically decide which one of several conflicting operations came first in
+the log. For example, in the case of several users trying to claim the same username
+[^46]:
+
+1. Every request for a username is encoded as a message, and appended to a shard determined by the
+ hash of the username.
+
+2. A stream processor sequentially reads the requests in the log, using a local database to keep
+ track of which usernames are taken. For every request for a username that is available, it
+ records the name as taken and emits a success message to an output stream. For every request for
+ a username that is already taken, it emits a rejection message to an output stream.
+
+3. The client that requested the username watches the output stream and waits for a success or
+ rejection message corresponding to its request.
+
+This algorithm is the same as the construction for achieving consensus using a shared log, which we
+saw in [Chapter 10](/en/ch10#ch_consistency). It scales easily to a large request throughput by
+increasing the number of shards, as each shard can be processed independently.
+
+The approach works not only for uniqueness constraints, but also for many other kinds of
+constraints. Its fundamental principle is that any writes that may conflict are routed to the same
+shard and processed sequentially. The definition of a conflict may depend on the application, but
+the stream processor can use arbitrary logic to validate a request.
+
+#### Multi-shard request processing {#id360}
+
+Ensuring that an operation is executed atomically, while satisfying constraints, becomes more
+interesting when several shards are involved. In [Example 13-2](/en/ch13#fig_future_request_id),
+there are potentially three shards: the one containing the request ID, the one containing the payee
+account, and the one containing the payer account. There is no reason why those three things should
+be in the same shard, since they are all independent from each other.
+
+In the traditional approach to databases, executing this transaction would require an atomic commit
+across all three shards, which essentially forces it into a total order with respect to all other
+transactions on any of those shards. Since there is now cross-shard coordination, different shards
+can no longer be processed independently, so throughput is likely to suffer.
+
+However, equivalent correctness can be achieved without cross-shard transactions using sharded logs
+and stream processors. [Figure 13-2](/en/ch13#fig_future_multi_shard) shows an example of a payment
+transaction that needs to check whether there is sufficient money in the source account, and if so,
+atomically transfers some amount to a destination account while deducting fees. It works as follows
+[^47]:
+
+{{< figure src="/fig/ddia_1302.png" id="fig_future_multi_shard" caption="Figure 13-2. Checking whether a source account has enough money, and atomically transferring money to a destination account and a fees account, using event logs and stream processors." class="w-full my-4" >}}
+
+1. The request to transfer money from the source account to the destination account is given a
+ unique request ID by the user's client, and appended to a log shard based on the source account
+ ID.
+
+2. A stream processor reads the log of requests and maintains a database containing the state of
+ the source account and the IDs of requests it has already processed. The contents of this
+ database are entirely derived from the log. When the stream processor encounters a request with
+ an ID that it has not seen before, it checks in its local database whether the source account
+ has enough money to perform the transfer.
+
+ If yes, it updates its local database to reserve the payment amount on the source account, and
+ emits events to several other logs: an outgoing payment event to the log shard for the source
+ account (its own input log), an incoming payment event to the log shard for the destination
+ account, and an incoming payment event to the log shard for the fees account. The original
+ request ID is included in those emitted events.
+
+3. Eventually the outgoing payment event is delivered back to the source account processor
+ (possibly after having received unrelated events in the meantime). The stream processor
+ recognises based on the request ID that this is a payment it previously reserved, and it now
+ executes the payment, again updating its local state of the source account. It ignores
+ duplicates based on request ID.
+
+4. The log shards for the destination and fees accounts are consumed by independent stream
+ processing tasks. When they receive an incoming payment event, they update their local state to
+ reflect the payment, and they deduplicate events based on request ID.
+
+[Figure 13-2](/en/ch13#fig_future_multi_shard) shows the three accounts as being in three separate
+shards, but they could just as well be in the same shards---it doesn't matter. All we need is that
+the events for any given account are processed strictly in log order with at-least-once semantics,
+and that the stream processors are deterministic.
+
+For example, consider what happens if the source account processor crashes while processing a
+payment request. The output messages may or may not have been emitted before the crash occurred.
+When it recovers from the crash, it will process the same request again (due to at-least-once
+semantics), and it will make the same decision on whether to allow the payment (since it's
+deterministic). It will therefore emit the same output messages with the same request ID to the
+outgoing, incoming, and fees account shards. If they are duplicates, the downstream consumers will
+ignore them based on the request ID.
+
+Atomicity in this system comes not from any transactions, but from the fact that writing the initial
+request event to the source account log is an atomic action. Once that one event in the log, all the
+downstream events will eventually be written as well---possibly after stream processors have
+recovered from crashes, and possibly with duplicates, but they will appear eventually.
+
+With exactly-once semantics this example becomes easier to implement, since it ensures that the
+stream processor's local state is consistent with the set of messages it has processed. Thus, if it
+crashes and re-processes some messages, its local state is also reset to what it was before those
+messages were processed.
+
+If the user in [Figure 13-2](/en/ch13#fig_future_multi_shard) wants to find out whether their
+transfer was approved or not, they can subscribe to the source account log shard and wait for the
+outgoing payment event. In order to explicitly notify the user if the balance is insufficient, the
+stream processor can emit a "declined payment" event to that log shard.
+
+By breaking down the multi-shard transaction into several differently sharded stages and using the
+end-to-end request ID, we have achieved the same correctness property (every request is applied
+exactly once to both the payer and payee accounts), even in the presence of faults, and without
+using an atomic commit protocol.
+
+### Timeliness and Integrity {#sec_future_integrity}
+
+A convenient property of many transactional systems is that as soon as one transaction commits, its
+writes are immediately visible to other transactions. This property is formalized as *strict
+serializability* (see ["Linearizability Versus
+Serializability"](/en/ch10#sidebar_consistency_serializability)).
+
+This is not the case when unbundling an operation across multiple stages of stream processors:
+consumers of a log are asynchronous by design, so a sender does not wait until its message has been
+processed by consumers. However, it is possible for a client to wait for a message to appear on an
+output stream, like the user waiting for an outgoing payment or payment declined event in
+[Figure 13-2](/en/ch13#fig_future_multi_shard), which depends on whether there was enough money in
+the source account.
+
+In this example, the correctness of the source account balance check does not depend on whether the
+user making the request waits for the outcome. The waiting only has the purpose of synchronously
+informing the user whether or not the payment succeeded, but this notification is decoupled from the
+effects of processing the request.
+
+More generally, the term *consistency* conflates two different requirements that are worth
+considering separately:
+
+Timeliness
+
+: Timeliness means ensuring that users observe the system in an up-to-date state. We saw
+ previously that if a user reads from a stale copy of the data, they may observe it in an
+ inconsistent state (see ["Problems with Replication Lag"](/en/ch6#sec_replication_lag)).
+ However, that inconsistency is temporary, and will eventually be resolved simply by waiting and
+ trying again.
+
+ The CAP theorem uses consistency in the sense of linearizability, which is a strong way of
+ achieving timeliness. Weaker timeliness properties like *read-after-write* consistency can also
+ be useful.
+
+Integrity
+
+: Integrity means absence of corruption; i.e., no data loss, and no contradictory or false data.
+ In particular, if some derived dataset is maintained as a view onto some underlying data, the
+ derivation must be correct. For example, a database index must correctly reflect the contents of
+ the database---an index in which some records are missing is not very useful.
+
+ If integrity is violated, the inconsistency is permanent: waiting and trying again is not going
+ to fix database corruption in most cases. Instead, explicit checking and repair is needed. In
+ the context of ACID transactions, "consistency" is usually understood as some kind of
+ application-specific notion of integrity. Atomicity and durability are important tools for
+ preserving integrity.
+
+In slogan form: violations of timeliness are "eventual consistency," whereas violations of integrity
+are "perpetual inconsistency."
+
+In most applications, integrity is much more important than timeliness. Violations of timeliness can
+be annoying and confusing, but violations of integrity can be catastrophic.
+
+For example, on your credit card statement, it is not surprising if a transaction that you made
+within the last 24 hours does not yet appear---it is normal that these systems have a certain lag.
+We know that banks reconcile and settle transactions asynchronously, and timeliness is not very
+important here [^3]. However, it would be very bad if the statement balance was not equal
+to the sum of the transactions plus the previous statement balance (an error in the sums), or if a
+transaction was charged to you but not paid to the merchant (disappearing money). Such problems
+would be violations of the integrity of the system.
+
+#### Correctness of dataflow systems {#id453}
+
+ACID transactions usually provide both timeliness (e.g., linearizability) and integrity (e.g.,
+atomic commit) guarantees. Thus, if you approach application correctness from the point of view of
+ACID transactions, the distinction between timeliness and integrity is fairly inconsequential.
+
+On the other hand, an interesting property of the event-based dataflow systems that we have
+discussed in this chapter is that they decouple timeliness and integrity. When processing event
+streams asynchronously, there is no guarantee of timeliness, unless you explicitly build consumers
+that wait for a message to arrive before returning. For example, a user could request a payment and
+then read the state of their account before the stream processor has executed the request; the user
+will not see the payment they just requested.
+
+However, integrity is in fact central to streaming systems. *Exactly-once* or *effectively-once*
+semantics is a mechanism for preserving integrity. If an event is lost, or if an event takes effect
+twice, the integrity of a data system could be violated. Thus, fault-tolerant message delivery and
+duplicate suppression (e.g., idempotent operations) are important for maintaining the integrity of a
+data system in the face of faults.
+
+As we saw in the last section, reliable stream processing systems can preserve integrity without
+requiring distributed transactions and an atomic commit protocol, which means they can potentially
+achieve comparable correctness with much better performance and operational robustness. We achieved
+this integrity through a combination of mechanisms:
+
+- Representing the content of the write operation as a single message, which can easily be written
+ atomically---an approach that fits very well with event sourcing
+
+- Deriving all other state updates from that single message using deterministic derivation
+ functions, similarly to stored procedures
+
+- Passing a client-generated request ID through all these levels of processing, enabling end-to-end
+ duplicate suppression and idempotence
+
+- Making messages immutable and allowing derived data to be reprocessed from time to time, which
+ makes it easier to recover from bugs
+
+#### Loosely interpreted constraints {#id362}
+
+As discussed previously, enforcing a uniqueness constraint requires consensus, typically implemented
+by funneling all events in a particular shard through a single node. This limitation is unavoidable
+if we want the traditional form of uniqueness constraint, and stream processing cannot get around
+it.
+
+However, another thing to realize is that in many real applications there is actually a business
+requirement to allow violations of what you might think of as hard constraints:
+
+- If customers order more items than you have in your warehouse, you can order in more stock,
+ apologize to customers for the delay, and offer them a discount. This is actually the same as what
+ you'd have to do if, say, a forklift truck ran over some of the items in your warehouse, leaving
+ you with fewer items in stock than you thought you had [^3]. Thus, the apology workflow
+ already needs to be part of your business processes anyway in order to deal with forklift
+ incidents, and a hard constraint on the number of items in stock might be unnecessary.
+
+- Similarly, many airlines overbook airplanes in the expectation that some passengers will miss
+ their flight, and many hotels overbook rooms, expecting that some guests will cancel. In these
+ cases, the constraint of "one person per seat" is deliberately violated for business reasons, and
+ compensation processes (refunds, upgrades, providing a complimentary room at a neighboring hotel)
+ are put in place to handle situations in which demand exceeds supply. Even if there was no
+ overbooking, apology and compensation processes would be needed in order to deal with flights
+ being cancelled due to bad weather or staff on strike---recovering from such issues is just a
+ normal part of business [^3].
+
+- If someone withdraws more money than they have in their account, the bank can charge them an
+ overdraft fee and ask them to pay back what they owe. By limiting the total withdrawals per day,
+ the risk to the bank is bounded.
+
+- In systems that integrate data between different organizations, inconsistencies will inevitably
+ arise, and correction mechanisms are necessary to handle them. As noted in ["Batch Use
+ Cases"](/en/ch11#sec_batch_output), settlement of payments between banks is an example of this.
+
+In many business contexts, it is therefore acceptable to temporarily violate a constraint and fix it
+up later by apologizing. This kind of change to correct a mistake is called a *compensating
+transaction* [^48], [^49]. The cost of the apology (in terms of money or reputation)
+varies, but it is often quite low: you can't unsend an email, but you can send a follow-up email
+with a correction. If you accidentally charge a credit card twice, you can refund one of the
+charges, and the cost to you is just the processing fees and perhaps a customer complaint. Once
+money has been paid out of an ATM, you can't directly get it back, although in principle you can
+send debt collectors to recover the money if the account was overdrawn and the customer won't pay it
+back.
+
+Whether the cost of the apology is acceptable is a business decision. If it is acceptable, the
+traditional model of checking all constraints before even writing the data is unnecessarily
+restrictive. It may well be reasonable to go ahead with a write optimistically, and to check the
+constraint after the fact. You can still ensure that the validation occurs before doing things that
+would be expensive to recover from, but that doesn't imply you must do the validation before you
+even write the data.
+
+These applications *do* require integrity: you would not want to lose a reservation, or have money
+disappear due to mismatched credits and debits. But they *don't* require timeliness on the
+enforcement of the constraint: if you have sold more items than you have in the warehouse, you can
+patch up the problem after the fact by apologizing. Doing so is similar to the conflict resolution
+approaches we discussed in ["Dealing with Conflicting
+Writes"](/en/ch6#sec_replication_write_conflicts).
+
+#### Coordination-avoiding data systems {#id454}
+
+We have now made two interesting observations:
+
+1. Dataflow systems can maintain integrity guarantees on derived data without atomic commit,
+ linearizability, or synchronous cross-shard coordination.
+
+2. Although strict uniqueness constraints require timeliness and coordination, many applications
+ are actually fine with loose constraints that may be temporarily violated and fixed up later, as
+ long as integrity is preserved throughout.
+
+Taken together, these observations mean that dataflow systems can provide the data management
+services for many applications without requiring coordination, while still giving strong integrity
+guarantees. Such *coordination-avoiding* data systems have a lot of appeal: they can achieve better
+performance and fault tolerance than systems that need to perform synchronous coordination
+[^45].
+
+For example, such a system could operate distributed across multiple datacenters in a multi-leader
+configuration, asynchronously replicating between regions. Any one datacenter can continue operating
+independently from the others, because no synchronous cross-region coordination is required. Such a
+system would have weak timeliness guarantees---it could not be linearizable without introducing
+coordination---but it can still have strong integrity guarantees.
+
+In this context, serializable transactions are still useful as part of maintaining derived state,
+but they can be run at a small scope where they work well [^6]. Heterogeneous distributed
+transactions such as XA transactions are not required. Synchronous coordination can still be
+introduced in places where it is needed (for example, to enforce strict constraints before an
+operation from which recovery is not possible), but there is no need for everything to pay the cost
+of coordination if only a small part of an application needs it [^32].
+
+Another way of looking at coordination and constraints: they reduce the number of apologies you have
+to make due to inconsistencies, but potentially also reduce the performance and availability of your
+system, and thus potentially increase the number of apologies you have to make due to outages. You
+cannot reduce the number of apologies to zero, but you can aim to find the best trade-off for your
+needs---the sweet spot where there are neither too many inconsistencies nor too many availability
+problems.
+
+### Trust, but Verify {#sec_future_verification}
+
+All of our discussion of correctness, integrity, and fault-tolerance has been under the assumption
+that certain things might go wrong, but other things won't. We call these assumptions our *system
+model* (see ["System Model and Reality"](/en/ch9#sec_distributed_system_model)): for example, we
+should assume that processes can crash, machines can suddenly lose power, and the network can
+arbitrarily delay or drop messages. But we might also assume that data written to disk is not lost
+after `fsync`, that data in memory is not corrupted, and that the multiplication instruction of our
+CPU always returns the correct result.
+
+These assumptions are quite reasonable, as they are true most of the time, and it would be difficult
+to get anything done if we had to constantly worry about our computers making mistakes.
+Traditionally, system models take a binary approach toward faults: we assume that some things can
+happen, and other things can never happen. In reality, it is more a question of probabilities: some
+things are more likely, other things less likely. The question is whether violations of our
+assumptions happen often enough that we may encounter them in practice.
+
+We have seen that data can become corrupted in memory (see ["Hardware and Software
+Faults"](/en/ch2#sec_introduction_hardware_faults)), on disk (see ["Replication and
+Durability"](/en/ch8#sidebar_transactions_durability)), and on the network (see ["Weak forms of
+lying"](/en/ch9#sec_distributed_weak_lying)). Maybe this is something we should be paying more
+attention to? If you are operating at large enough scale, even very unlikely things do happen.
+
+#### Maintaining integrity in the face of software bugs {#id455}
+
+Besides such hardware issues, there is always the risk of software bugs, which would not be caught
+by lower-level network, memory, or filesystem checksums. Even widely used database software has
+bugs: for example, past versions of MySQL have failed to correctly maintain uniqueness constraints
+[^50] and PostgreSQL's serializable isolation level has exhibited write skew anomalies in
+the past [^51], even though MySQL and PostgreSQL are robust and well-regarded databases
+that have been battle-tested by many people for many years. In less mature software, the situation
+is likely to be much worse.
+
+Despite considerable efforts in careful design, testing, and review, bugs still creep in. Although
+they are rare, and they eventually get found and fixed, there is still a period during which such
+bugs can corrupt data.
+
+When it comes to application code, we have to assume many more bugs, since most applications don't
+receive anywhere near the amount of review and testing that database code does. Many applications
+don't even correctly use the features that databases offer for preserving integrity, such as foreign
+key or uniqueness constraints [^25].
+
+Consistency in the sense of ACID is based on the idea that the database starts off in a consistent
+state, and a transaction transforms it from one consistent state to another consistent state. Thus,
+we expect the database to always be in a consistent state. However, this notion only makes sense if
+you assume that the transaction is free from bugs. If the application uses the database incorrectly
+in some way, for example using a weak isolation level unsafely, the integrity of the database cannot
+be guaranteed.
+
+#### Don't just blindly trust what they promise {#id364}
+
+With both hardware and software not always living up to the ideal that we would like them to be, it
+seems that data corruption is inevitable sooner or later. Thus, we should at least have a way of
+finding out if data has been corrupted so that we can fix it and try to track down the source of the
+error. Checking the integrity of data is known as *auditing*.
+
+As discussed in ["Advantages of immutable events"](/en/ch12#sec_stream_immutability_pros), auditing
+is not just for financial applications. However, auditability is very important in finance precisely
+because everyone knows that mistakes happen, and we all recognize the need to be able to detect and
+fix problems.
+
+Mature systems similarly tend to consider the possibility of unlikely things going wrong, and manage
+that risk. For example, large-scale storage systems such as HDFS and Amazon S3 do not fully trust
+disks: they run background processes that continually read back files, compare them to other
+replicas, and move files from one disk to another, in order to mitigate the risk of silent
+corruption [^52], [^53].
+
+If you want to be sure that your data is still there, you have to actually read it and check. Most
+of the time it will still be there, but if it isn't, you really want to find out sooner rather than
+later. By the same argument, it is important to try restoring from your backups from time to
+time---otherwise you may only find out that your backup is broken when it is too late and you have
+already lost data. Don't just blindly trust that it is all working.
+
+Systems like HDFS and S3 still have to assume that disks work correctly most of the time---which is
+a reasonable assumption, but not the same as assuming that they *always* work correctly. However,
+not many systems currently have this kind of "trust, but verify" approach of continually auditing
+themselves. Many assume that correctness guarantees are absolute and make no provision for the
+possibility of rare data corruption. In the future we may see more *self-validating* or
+*self-auditing* systems that continually check their own integrity, rather than relying on blind
+trust [^54].
+
+#### Designing for auditability {#id365}
+
+If a transaction mutates several objects in a database, it is difficult to tell after the fact what
+that transaction means. Even if you capture the transaction logs, the insertions, updates, and
+deletions in various tables do not necessarily give a clear picture of *why* those mutations were
+performed. The invocation of the application logic that decided on those mutations is transient and
+cannot be reproduced.
+
+By contrast, event-based systems can provide better auditability. In the event sourcing approach,
+user input to the system is represented as a single immutable event, and any resulting state updates
+are derived from that event. The derivation can be made deterministic and repeatable, so that
+running the same log of events through the same version of the derivation code will result in the
+same state updates.
+
+Being explicit about dataflow makes the *provenance* of data much clearer, which makes integrity
+checking much more feasible. For the event log, we can use hashes to check that the event storage
+has not been corrupted. For any derived state, we can rerun the batch and stream processors that
+derived it from the event log in order to check whether we get the same result, or even run a
+redundant derivation in parallel.
+
+A deterministic and well-defined dataflow also makes it easier to debug and trace the execution of a
+system in order to determine why it did something [^4], [^55]. If something
+unexpected occurred, it is valuable to have the diagnostic capability to reproduce the exact
+circumstances that led to the unexpected event---a kind of time-travel debugging capability.
+
+#### The end-to-end argument again {#id456}
+
+If we cannot fully trust that every individual component of the system will be free from
+corruption---that every piece of hardware is fault-free and that every piece of software is
+bug-free---then we must at least periodically check the integrity of our data. If we don't check, we
+won't find out about corruption until it is too late and it has caused some downstream damage, at
+which point it will be much harder and more expensive to track down the problem.
+
+Checking the integrity of data systems is best done in an end-to-end fashion: the more systems we
+can include in an integrity check, the fewer opportunities there are for corruption to go unnoticed
+at some stage of the process. If we can check that an entire derived data pipeline is correct end to
+end, then any disks, networks, services, and algorithms along the path are implicitly included in
+the check.
+
+Having continuous end-to-end integrity checks gives you increased confidence about the correctness
+of your systems, which in turn allows you to move faster [^56]. Like automated testing,
+auditing increases the chances that bugs will be found quickly, and thus reduces the risk that a
+change to the system or a new storage technology will cause damage. If you are not afraid of making
+changes, you can much better evolve an application to meet changing requirements.
+
+#### Tools for auditable data systems {#id366}
+
+At present, not many data systems make auditability a top-level concern. Some applications implement
+their own audit mechanisms, for example by logging all changes to a separate audit table, but
+guaranteeing the integrity of the audit log and the database state is still difficult. A transaction
+log can be made tamper-proof by periodically signing it with a hardware security module, but that
+does not guarantee that the right transactions went into the log in the first place.
+
+Blockchains such as Bitcoin or Ethereum are shared append-only logs with cryptographic consistency
+checks; the transactions they store are events, and smart contracts are basically stream processors.
+The consensus protocols they use ensure that all nodes agree on the same sequence of events. The
+difference to the consensus protocols of [Chapter 10](/en/ch10#ch_consistency) is that blockchains
+are Byzantine fault tolerant, i.e. they still work if some of the participating nodes have corrupted
+data because the replicas continually check each other's integrity.
+
+For most applications, blockchains have too high overhead to be useful. However, some of their
+cryptographic tools can also be used in a lighterweight context. For example, *Merkle trees*
+[^57], are trees of hashes that can be used to efficiently prove that a record appears in
+some dataset (and a few other things). *Certificate transparency* uses cryptographically verified
+append-only logs and Merkle trees to check the validity of TLS/SSL certificates [^58],
+[^59]; it avoids needing a consensus protocol by having a single leader per log.
+
+Integrity-checking and auditing algorithms, like those of certificate transparency and distributed
+ledgers, might becoming more widely used in data systems in general in the future. Some work will be
+needed to make them equally scalable as systems without cryptographic auditing, and to keep the
+performance penalty as low as possible, but they are nevertheless interesting.
+
+## Summary {#id367}
+
+In this chapter we discussed new approaches to designing data systems based on ideas from stream
+processing. We started with the observation that there is no one single tool that can efficiently
+serve all possible use cases, and so applications necessarily need to compose several different
+pieces of software to accomplish their goals. We discussed how to solve this *data integration*
+problem by using batch processing and event streams to let data changes flow between different
+systems.
+
+In this approach, certain systems are designated as systems of record, and other data is derived
+from them through transformations. In this way we can maintain indexes, materialized views, machine
+learning models, statistical summaries, and more. By making these derivations and transformations
+asynchronous and loosely coupled, a problem in one area is prevented from spreading to unrelated
+parts of the system, increasing the robustness and fault-tolerance of the system as a whole.
+
+Expressing dataflows as transformations from one dataset to another also helps evolve applications:
+if you want to change one of the processing steps, for example to change the structure of an index
+or cache, you can just rerun the new transformation code on the whole input dataset in order to
+rederive the output. Similarly, if something goes wrong, you can fix the code and reprocess the data
+in order to recover.
+
+These processes are quite similar to what databases already do internally, so we recast the idea of
+dataflow applications as *unbundling* the components of a database, and building an application by
+composing these loosely coupled components.
+
+Derived state can be updated by observing changes in the underlying data. Moreover, the derived
+state itself can further be observed by downstream consumers. We can even take this dataflow all the
+way through to the end-user device that is displaying the data, and thus build user interfaces that
+dynamically update to reflect data changes and continue to work offline.
+
+Next, we discussed how to ensure that all of this processing remains correct in the presence of
+faults. We saw that strong integrity guarantees can be implemented scalably with asynchronous event
+processing, by using end-to-end request identifiers to make operations idempotent and by checking
+constraints asynchronously. Clients can either wait until the check has passed, or go ahead without
+waiting but risk having to apologize about a constraint violation. This approach is much more
+scalable and robust than the traditional approach of using distributed transactions, and fits with
+how many business processes work in practice.
+
+By structuring applications around dataflow and checking constraints asynchronously, we can avoid
+most coordination and create systems that maintain integrity but still perform well, even in
+geographically distributed scenarios and in the presence of faults. We then talked a little about
+using audits to verify the integrity of data and detect corruption, and observed that the techniques
+used by blockchains also have a similarity to event-based systems.
+
+##### Footnotes
+
+### References {#references}
+
+[^1]: Rachid Belaid. [Postgres Full-Text Search is Good Enough!](https://rachbelaid.com/postgres-full-text-search-is-good-enough/) *rachbelaid.com*, July 2015. Archived at [perma.cc/ZVP9-YDCB](https://perma.cc/ZVP9-YDCB)
+[^2]: Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, Wyatt Lloyd, and Kaushik Veeraraghavan. [Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf). At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
+[^3]: Pat Helland and Dave Campbell. [Building on Quicksand](https://arxiv.org/pdf/0909.1788). At *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009.
+[^4]: Jessica Kerr. [Provenance and Causality in Distributed Systems](https://jessitron.com/2016/09/25/provenance-and-causality-in-distributed-systems/). *jessitron.com*, September 2016. Archived at [perma.cc/DTD2-F8ZM](https://perma.cc/DTD2-F8ZM)
+[^5]: Jay Kreps. [The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](https://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying). *engineering.linkedin.com*, December 2013. Archived at [perma.cc/2JHR-FR64](https://perma.cc/2JHR-FR64)
+[^6]: Pat Helland. [Life Beyond Distributed Transactions: An Apostate's Opinion](https://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf). At *3rd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2007.
+[^7]: Lionel A. Smith. [The Broad Gauge Story](https://lionels.orpheusweb.co.uk/RailSteam/GWRBroadG/BGHist.html). *Journal of the Monmouthshire Railway Society*, Summer 1985. Archived at [perma.cc/DDK9-JA6X](https://perma.cc/DDK9-JA6X)
+[^8]: Jacqueline Xu. [Online Migrations at Scale](https://stripe.com/blog/online-migrations). *stripe.com*, February 2017. Archived at [perma.cc/ZQY2-EAU2](https://perma.cc/ZQY2-EAU2)
+[^9]: Flavio Santos and Robert Stephenson. [Changing the Wheels on a Moving Bus --- Spotify's Event Delivery Migration](https://engineering.atspotify.com/2021/10/changing-the-wheels-on-a-moving-bus-spotify-event-delivery-migration). *engineering.atspotify.com*, October 2021. Archived at [perma.cc/5C4V-G8EV](https://perma.cc/5C4V-G8EV)
+[^10]: Molly Bartlett Dishman and Martin Fowler. [Agile Architecture](https://www.youtube.com/watch?v=VjKYO6DP3fo&list=PL055Epbe6d5aFJdvWNtTeg_UEHZEHdInE). At *O'Reilly Software Architecture Conference*, March 2015.
+[^11]: Nathan Marz and James Warren. [*Big Data: Principles and Best Practices of Scalable Real-Time Data Systems*](https://www.manning.com/books/big-data). Manning, 2015. ISBN: 978-1-617-29034-3
+[^12]: Jay Kreps. [Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture). *oreilly.com*, July 2014. Archived at [perma.cc/PGH6-XUCH](https://perma.cc/PGH6-XUCH)
+[^13]: Raul Castro Fernandez, Peter Pietzuch, Jay Kreps, Neha Narkhede, Jun Rao, Joel Koshy, Dong Lin, Chris Riccomini, and Guozhang Wang. [Liquid: Unifying Nearline and Offline Big Data Integration](https://www.cidrdb.org/cidr2015/Papers/CIDR15_Paper25u.pdf). At *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015.
+[^14]: Dennis M. Ritchie and Ken Thompson. [The UNIX Time-Sharing System](https://web.eecs.utk.edu/~qcao1/cs560/papers/paper-unix.pdf). *Communications of the ACM*, volume 17, issue 7, pages 365--375, July 1974. [doi:10.1145/361011.361061](https://doi.org/10.1145/361011.361061)
+[^15]: Wes McKinney. [The Road to Composable Data Systems: Thoughts on the Last 15 Years and the Future](https://wesmckinney.com/blog/looking-back-15-years/). *wesmckinney.com*, September 2023. Archived at [perma.cc/J9SJ-886N](https://perma.cc/J9SJ-886N)
+[^16]: Eric A. Brewer and Joseph M. Hellerstein. [CS262a: Advanced Topics in Computer Systems](https://people.eecs.berkeley.edu/~brewer/cs262/systemr.html). Lecture notes, University of California, Berkeley, *cs.berkeley.edu*, August 2011. Archived at [perma.cc/TE79-LGWU](https://perma.cc/TE79-LGWU)
+[^17]: Michael Stonebraker. [The Case for Polystores](https://wp.sigmod.org/?p=1629). *wp.sigmod.org*, July 2015. Archived at [perma.cc/G7J2-KR45](https://perma.cc/G7J2-KR45)
+[^18]: Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, Magda Balazinska, Bill Howe, Jeremy Kepner, Sam Madden, David Maier, Tim Mattson, and Stan Zdonik. [The BigDAWG Polystore System](https://sigmod.org/publications/sigmodRecord/1506/pdfs/04_vision_Duggan.pdf). *ACM SIGMOD Record*, volume 44, issue 2, pages 11--16, June 2015. [doi:10.1145/2814710.2814713](https://doi.org/10.1145/2814710.2814713)
+[^19]: David B. Lomet, Alan Fekete, Gerhard Weikum, and Mike Zwilling. [Unbundling Transaction Services in the Cloud](https://arxiv.org/pdf/0909.1768). At *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009.
+[^20]: Martin Kleppmann and Jay Kreps. [Kafka, Samza and the Unix Philosophy of Distributed Data](https://martin.kleppmann.com/papers/kafka-debull15.pdf). *IEEE Data Engineering Bulletin*, volume 38, issue 4, pages 4--14, December 2015.
+[^21]: John Hugg. [Winning Now and in the Future: Where Volt Active Data Shines](https://www.voltactivedata.com/blog/2016/03/winning-now-future-voltdb-shines/). *voltactivedata.com*, March 2016. Archived at [perma.cc/44MP-3MWM](https://perma.cc/44MP-3MWM)
+[^22]: Felienne Hermans. [Spreadsheets Are Code](https://vimeo.com/145492419). At *Code Mesh*, November 2015.
+[^23]: Dan Bricklin and Bob Frankston. [VisiCalc: Information from Its Creators](http://danbricklin.com/visicalc.htm). *danbricklin.com*. Archived at [archive.org](https://web.archive.org/web/20250905040530/http://danbricklin.com/visicalc.htm)
+[^24]: D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, and Michael Young. [Machine Learning: The High-Interest Credit Card of Technical Debt](https://research.google.com/pubs/archive/43146.pdf). At *NIPS Workshop on Software Engineering for Machine Learning* (SE4ML), December 2014. Archived at
+[^25]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2737784](https://doi.org/10.1145/2723372.2737784)
+[^26]: Guy Steele. [Re: Need for Macros (Was Re: Icon)](https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01134.html). email to *ll1-discuss* mailing list, *people.csail.mit.edu*, December 2001. Archived at [perma.cc/K9X8-CJ65](https://perma.cc/K9X8-CJ65)
+[^27]: Ben Stopford. [Microservices in a Streaming World](https://www.infoq.com/presentations/microservices-streaming). At *QCon London*, March 2016.
+[^28]: Adam Bellemare. [*Building Event-Driven Microservices, 2nd Edition*](https://learning.oreilly.com/library/view/building-event-driven-microservices/9798341622180/). O'Reilly Media, 2025.
+[^29]: Christian Posta. [Why Microservices Should Be Event Driven: Autonomy vs Authority](https://blog.christianposta.com/microservices/why-microservices-should-be-event-driven-autonomy-vs-authority/). *blog.christianposta.com*, May 2016. Archived at [perma.cc/E6N9-3X92](https://perma.cc/E6N9-3X92)
+[^30]: Alex Feyerke. [Designing Offline-First Web Apps](https://alistapart.com/article/offline-first/). *alistapart.com*, December 2013. Archived at [perma.cc/WH7R-S2DS](https://perma.cc/WH7R-S2DS)
+[^31]: Martin Kleppmann. [Turning the Database Inside-out with Apache Samza.](https://martin.kleppmann.com/2015/03/04/turning-the-database-inside-out.html) at *Strange Loop*, September 2014. Archived at [perma.cc/U6E8-A9MT](https://perma.cc/U6E8-A9MT)
+[^32]: Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich. [Global Sequence Protocol: A Robust Abstraction for Replicated Shared State](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.ECOOP.2015.568). At *29th European Conference on Object-Oriented Programming* (ECOOP), July 2015. [doi:10.4230/LIPIcs.ECOOP.2015.568](https://doi.org/10.4230/LIPIcs.ECOOP.2015.568)
+[^33]: Evan Czaplicki and Stephen Chong. [Asynchronous Functional Reactive Programming for GUIs](https://people.seas.harvard.edu/~chong/pubs/pldi13-elm.pdf). At *34th ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2013. [doi:10.1145/2491956.2462161](https://doi.org/10.1145/2491956.2462161)
+[^34]: Eno Thereska, Damian Guy, Michael Noll, and Neha Narkhede. [Unifying Stream Processing and Interactive Queries in Apache Kafka](https://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/). *confluent.io*, October 2016. Archived at [perma.cc/W8JG-EAZF](https://perma.cc/W8JG-EAZF)
+[^35]: Frank McSherry. [Dataflow as Database](https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-17.md). *github.com*, July 2016. Archived at [perma.cc/384D-DUFH](https://perma.cc/384D-DUFH)
+[^36]: Peter Alvaro. [I See What You Mean](https://www.youtube.com/watch?v=R2Aa4PivG0g). At *Strange Loop*, September 2015.
+[^37]: Nathan Marz. [Trident: A High-Level Abstraction for Realtime Computation](https://blog.x.com/engineering/en_us/a/2012/trident-a-high-level-abstraction-for-realtime-computation). *blog.x.com*, August 2012. Archived at [archive.org](https://web.archive.org/web/20250515030808/https://blog.x.com/engineering/en_us/a/2012/trident-a-high-level-abstraction-for-realtime-computation)
+[^38]: Edi Bice. [Low Latency Web Scale Fraud Prevention with Apache Samza, Kafka and Friends](https://www.slideshare.net/slideshow/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends/57068078). At *Merchant Risk Council MRC Vegas Conference*, March 2016. Archived at [perma.cc/T3H5-QN3R](https://perma.cc/T3H5-QN3R)
+[^39]: Charity Majors. [The Accidental DBA](https://charity.wtf/2016/10/02/the-accidental-dba/). *charity.wtf*, October 2016. Archived at [perma.cc/6ANP-ARB6](https://perma.cc/6ANP-ARB6)
+[^40]: Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu. [Semantic Conditions for Correctness at Different Isolation Levels](https://dsf.berkeley.edu/cs286/papers/isolation-icde2000.pdf). At *16th International Conference on Data Engineering* (ICDE), February 2000. [doi:10.1109/ICDE.2000.839387](https://doi.org/10.1109/ICDE.2000.839387)
+[^41]: Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan. [Automating the Detection of Snapshot Isolation Anomalies](https://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf). At *33rd International Conference on Very Large Data Bases* (VLDB), September 2007.
+[^42]: Kyle Kingsbury. [Jespen: Distributed Systems Safety Research](https://jepsen.io/). *jepsen.io*.
+[^43]: Michael Jouravlev. [Redirect After Post](https://www.theserverside.com/news/1365146/Redirect-After-Post). *theserverside.com*, August 2004. Archived at [archive.org](https://web.archive.org/web/20250904205736/https://www.theserverside.com/news/1365146/Redirect-After-Post)
+[^44]: Jerome H. Saltzer, David P. Reed, and David D. Clark. [End-to-End Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf). *ACM Transactions on Computer Systems*, volume 2, issue 4, pages 277--288, November 1984. [doi:10.1145/357401.357402](https://doi.org/10.1145/357401.357402)
+[^45]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). *Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185--196, November 2014. [doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509)
+[^46]: Alex Yarmula. [Strong Consistency in Manhattan](https://blog.x.com/engineering/en_us/a/2016/strong-consistency-in-manhattan). *blog.x.com*, March 2016. Archived at [archive.org](https://web.archive.org/web/20250713175819/https://blog.x.com/engineering/en_us/a/2016/strong-consistency-in-manhattan)
+[^47]: Martin Kleppmann, Alastair R. Beresford, and Boerge Svingen. [Online Event Processing: Achieving consistency where distributed transactions have failed](https://martin.kleppmann.com/papers/olep-cacm.pdf). *Communications of the ACM*, volume 62, issue 5, pages 43-49, May 2019. [doi:10.1145/3312527](https://doi.org/10.1145/3312527)
+[^48]: Jim Gray. [The Transaction Concept: Virtues and Limitations](https://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf). At *7th International Conference on Very Large Data Bases* (VLDB), September 1981. Archived at [perma.cc/8VPT-N5H6](https://perma.cc/8VPT-N5H6)
+[^49]: Hector Garcia-Molina and Kenneth Salem. [Sagas](https://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 1987. [doi:10.1145/38713.38742](https://doi.org/10.1145/38713.38742)
+[^50]: Annamalai Gurusami and Daniel Price. [Bug #73170: Duplicates in Unique Secondary Index Because of Fix of Bug#68021](https://bugs.mysql.com/bug.php?id=73170). *bugs.mysql.com*, July 2014. Archived at [perma.cc/P6BV-W7JJ](https://perma.cc/P6BV-W7JJ)
+[^51]: Gary Fredericks. [Postgres Serializability Bug](https://github.com/gfredericks/pg-serializability-bug). *github.com*, September 2015. Archived at [perma.cc/N8UP-2822](https://perma.cc/N8UP-2822)
+[^52]: Xiao Chen. [HDFS DataNode Scanners and Disk Checker Explained](https://www.cloudera.com/blog/technical/hdfs-datanode-scanners-and-disk-checker-explained.html). *blog.cloudera.com*, December 2016. Archived at [perma.cc/6S36-X98L](https://perma.cc/6S36-X98L)
+[^53]: Daniel Persson. [How does Ceph scrubbing work?](https://www.youtube.com/watch?v=M9QGMoc3GU8) *youtube.com*, March 2022.
+[^54]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)
+[^55]: Martin Fowler. [The LMAX Architecture](https://martinfowler.com/articles/lmax.html). *martinfowler.com*, July 2011. Archived at [perma.cc/5AV4-N6RJ](https://perma.cc/5AV4-N6RJ)
+[^56]: Sam Stokes. [Move Fast with Confidence](https://five-eights.com/2016/07/11/move-fast-with-confidence/). *five-eights.com*, July 2016. Archived at [perma.cc/J8C6-DHXB](https://perma.cc/J8C6-DHXB)
+[^57]: Ralph C. Merkle. [A Digital Signature Based on a Conventional Encryption Function](https://people.eecs.berkeley.edu/~raluca/cs261-f15/readings/merkle.pdf). At *CRYPTO '87*, August 1987. [doi:10.1007/3-540-48184-2_32](https://doi.org/10.1007/3-540-48184-2_32)
+[^58]: Ben Laurie. [Certificate Transparency](https://queue.acm.org/detail.cfm?id=2668154). *ACM Queue*, volume 12, issue 8, pages 10-19, August 2014. [doi:10.1145/2668152.2668154](https://doi.org/10.1145/2668152.2668154)
+[^59]: Mark D. Ryan. [Enhanced Certificate Transparency and End-to-End Encrypted Mail](https://www.ndss-symposium.org/wp-content/uploads/2017/09/12_2_1.pdf). At *Network and Distributed System Security Symposium* (NDSS), February 2014. [doi:10.14722/ndss.2014.23379](https://doi.org/10.14722/ndss.2014.23379)
diff --git a/content/en/ch14.md b/content/en/ch14.md
new file mode 100644
index 0000000..5c87a24
--- /dev/null
+++ b/content/en/ch14.md
@@ -0,0 +1,625 @@
+---
+title: "14. Doing the Right Thing"
+weight: 314
+breadcrumbs: false
+---
+
+
+
+
+
+> *Feeding AI systems on the world's beauty, ugliness, and cruelty, but expecting it to reflect only
+> the beauty is a fantasy.*
+>
+> Vinay Uday Prabhu and Abeba Birhane, *Large Datasets: A Pyrrhic Win for Computer Vision?* (2020)
+
+> [!TIP] A NOTE FOR EARLY RELEASE READERS
+> With Early Release ebooks, you get books in their earliest form---the author's raw and unedited
+> content as they write---so you can take advantage of these technologies long before the official
+> release of these titles.
+>
+> This will be the 14th chapter of the final book. The GitHub repo for this book is
+> *[*https://github.com/ept/ddia2-feedback*](https://github.com/ept/ddia2-feedback)*.
+>
+> If you'd like to be actively involved in reviewing and commenting on this draft, please reach out on GitHub.
+
+In the final chapter of this book, let's take a step back. Throughout this book we have examined a
+wide range of different architectures for data systems, evaluated their pros and cons, and explored
+techniques for building reliable, scalable, and maintainable applications. However, we have left out
+an important and fundamental part of the discussion, which we should now fill in.
+
+Every system is built for a purpose; every action we take has both intended and unintended
+consequences. The purpose may be as simple as making money, but the consequences for the world may
+reach far beyond that original purpose. We, the engineers building these systems, have a
+responsibility to carefully consider those consequences and to consciously decide what kind of world
+we want to live in.
+
+We talk about data as an abstract thing, but remember that many datasets are about people: their
+behavior, their interests, their identity. We must treat such data with humanity and respect. Users
+are humans too, and human dignity is paramount [^1].
+
+Software development increasingly involves making important ethical choices. There are guidelines to
+help software engineers navigate these issues, such as the ACM Code of Ethics and Professional
+Conduct [^2], but they are rarely discussed, applied, and enforced in practice. As a
+result, engineers and product managers sometimes take a very cavalier attitude to privacy and
+potential negative consequences of their products [^3], [^4].
+
+A technology is not good or bad in itself---what matters is how it is used and how it affects
+people. This is true for a software system like a search engine in much the same way as it is for a
+weapon like a gun. Is not sufficient for software engineers to focus exclusively on the technology
+and ignore its consequences: the ethical responsibility is ours to bear also. Reasoning about ethics
+is difficult, but it is too important to ignore.
+
+However, what makes something "good" or "bad" is not well-defined, and most people in computing
+don't even discuss that question [^5]. In contrast to much of computing, the concepts at
+the heart of ethics are not fixed or determinate in their precise meaning, and they require
+interpretation, which may be subjective [^6]. Ethics is not going through some checklist
+to confirm you comply; it's a participatory and iterative process of reflection, in dialog with the
+people involved, with accountability for the results [^7].
+
+## Predictive Analytics {#id369}
+
+For example, predictive analytics is a major part of why people are excited about big data and AI.
+Using data analysis to predict the weather, or the spread of diseases, is one thing [^8];
+it is another matter to predict whether a convict is likely to reoffend, whether an applicant for a
+loan is likely to default, or whether an insurance customer is likely to make expensive claims
+[^9]. The latter have a direct effect on individual people's lives.
+
+Naturally, payment networks want to prevent fraudulent transactions, banks want to avoid bad loans,
+airlines want to avoid hijackings, and companies want to avoid hiring ineffective or untrustworthy
+people. From their point of view, the cost of a missed business opportunity is low, but the cost of
+a bad loan or a problematic employee is much higher, so it is natural for organizations to want to
+be cautious. If in doubt, they are better off saying no.
+
+However, as algorithmic decision-making becomes more widespread, someone who has (accurately or
+falsely) been labeled as risky by some algorithm may suffer a large number of those "no" decisions.
+Systematically being excluded from jobs, air travel, insurance coverage, property rental, financial
+services, and other key aspects of society is such a large constraint of the individual's freedom
+that it has been called "algorithmic prison" [^10]. In countries that respect human
+rights, the criminal justice system presumes innocence until proven guilty; on the other hand,
+automated systems can systematically and arbitrarily exclude a person from participating in society
+without any proof of guilt, and with little chance of appeal.
+
+### Bias and Discrimination {#id370}
+
+Decisions made by an algorithm are not necessarily any better or any worse than those made by a
+human. Every person is likely to have biases, even if they actively try to counteract them, and
+discriminatory practices can become culturally institutionalized. There is hope that basing
+decisions on data, rather than subjective and instinctive assessments by people, could be more fair
+and give a better chance to people who are often overlooked in the traditional system
+[^11].
+
+When we develop predictive analytics and AI systems, we are not merely automating a human's decision
+by using software to specify the rules for when to say yes or no; we are even leaving the rules
+themselves to be inferred from data. However, the patterns learned by these systems are opaque: even
+if there is some correlation in the data, we may not know why. If there is a systematic bias in the
+input to an algorithm, the system will most likely learn and amplify that bias in its output
+[^12].
+
+In many countries, anti-discrimination laws prohibit treating people differently depending on
+protected traits such as ethnicity, age, gender, sexuality, disability, or beliefs. Other features
+of a person's data may be analyzed, but what happens if they are correlated with protected traits?
+For example, in racially segregated neighborhoods, a person's postal code or even their IP address
+is a strong predictor of race. Put like this, it seems ridiculous to believe that an algorithm could
+somehow take biased data as input and produce fair and impartial output from it [^13],
+[^14]. Yet this belief often seems to be implied by proponents of data-driven decision
+making, an attitude that has been satirized as "machine learning is like money laundering for bias"
+[^15].
+
+Predictive analytics systems merely extrapolate from the past; if the past is discriminatory, they
+codify and amplify that discrimination [^16]. If we want the future to be better than the
+past, moral imagination is required, and that's something only humans can provide [^17].
+Data and models should be our tools, not our masters.
+
+### Responsibility and Accountability {#id371}
+
+Automated decision making opens the question of responsibility and accountability [^17].
+If a human makes a mistake, they can be held accountable, and the person affected by the decision
+can appeal. Algorithms make mistakes too, but who is accountable if they go wrong [^18]?
+When a self-driving car causes an accident, who is responsible? If an automated credit scoring
+algorithm systematically discriminates against people of a particular race or religion, is there any
+recourse? If a decision by your machine learning system comes under judicial review, can you explain
+to the judge how the algorithm made its decision? People should not be able to evade their
+responsibility by blaming an algorithm.
+
+Credit rating agencies are an old example of collecting data to make decisions about people. A bad
+credit score makes life difficult, but at least a credit score is normally based on relevant facts
+about a person's actual borrowing history, and any errors in the record can be corrected (although
+the agencies normally do not make this easy). However, scoring algorithms based on machine learning
+typically use a much wider range of inputs and are much more opaque, making it harder to understand
+how a particular decision has come about and whether someone is being treated in an unfair or
+discriminatory way [^19].
+
+A credit score summarizes "How did you behave in the past?" whereas predictive analytics usually
+work on the basis of "Who is similar to you, and how did people like you behave in the past?"
+Drawing parallels to others' behavior implies stereotyping people, for example based on where they
+live (a close proxy for race and socioeconomic class). What about people who get put in the wrong
+bucket? Furthermore, if a decision is incorrect due to erroneous data, recourse is almost impossible
+[^17].
+
+Much data is statistical in nature, which means that even if the probability distribution on the
+whole is correct, individual cases may well be wrong. For example, if the average life expectancy in
+your country is 80 years, that doesn't mean you're expected to drop dead on your 80th birthday. From
+the average and the probability distribution, you can't say much about the age to which one
+particular person will live. Similarly, the output of a prediction system is probabilistic and may
+well be wrong in individual cases.
+
+A blind belief in the supremacy of data for making decisions is not only delusional, it is
+positively dangerous. As data-driven decision making becomes more widespread, we will need to figure
+out how to make algorithms accountable and transparent, how to avoid reinforcing existing biases,
+and how to fix them when they inevitably make mistakes.
+
+We will also need to figure out how to prevent data being used to harm people, and realize its
+positive potential instead. For example, analytics can reveal financial and social characteristics
+of people's lives. On the one hand, this power could be used to focus aid and support to help those
+people who most need it. On the other hand, it is sometimes used by predatory business seeking to
+identify vulnerable people and sell them risky products such as high-cost loans and worthless
+college degrees [^17], [^20].
+
+### Feedback Loops {#id372}
+
+Even with predictive applications that have less immediately far-reaching effects on people, such as
+recommendation systems, there are difficult issues that we must confront. When services become good
+at predicting what content users want to see, they may end up showing people only opinions they
+already agree with, leading to echo chambers in which stereotypes, misinformation, and polarization
+can breed. We are already seeing the impact of social media echo chambers on election campaigns.
+
+When predictive analytics affect people's lives, particularly pernicious problems arise due to
+self-reinforcing feedback loops. For example, consider the case of employers using credit scores to
+evaluate potential hires. You may be a good worker with a good credit score, but suddenly find
+yourself in financial difficulties due to a misfortune outside of your control. As you miss payments
+on your bills, your credit score suffers, and you will be less likely to find work. Joblessness
+pushes you toward poverty, which further worsens your scores, making it even harder to find
+employment [^17]. It's a downward spiral due to poisonous assumptions, hidden behind a
+camouflage of mathematical rigor and data.
+
+As another example of a feedback loop, economists found that when gas stations in Germany introduced
+algorithmic prices, competition was reduced and prices for consumers went up because the algorithms
+learned to collude [^21].
+
+We can't always predict when such feedback loops happen. However, many consequences can be predicted
+by thinking about the entire system (not just the computerized parts, but also the people
+interacting with it)---an approach known as *systems thinking* [^22]. We can try to
+understand how a data analysis system responds to different behaviors, structures, or
+characteristics. Does the system reinforce and amplify existing differences between people (e.g.,
+making the rich richer or the poor poorer), or does it try to combat injustice? And even with the
+best intentions, we must beware of unintended consequences.
+
+## Privacy and Tracking {#id373}
+
+Besides the problems of predictive analytics---i.e., using data to make automated decisions about
+people---there are ethical problems with data collection itself. What is the relationship between
+the organizations collecting data and the people whose data is being collected?
+
+When a system only stores data that a user has explicitly entered, because they want the system to
+store and process it in a certain way, the system is performing a service for the user: the user is
+the customer. But when a user's activity is tracked and logged as a side effect of other things they
+are doing, the relationship is less clear. The service no longer just does what the user tells it to
+do, but it takes on interests of its own, which may conflict with the user's interests.
+
+Tracking behavioral data has become increasingly important for user-facing features of many online
+services: tracking which search results are clicked helps improve the ranking of search results;
+recommending "people who liked X also liked Y" helps users discover interesting and useful things;
+A/B tests and user flow analysis can help indicate how a user interface might be improved. Those
+features require some amount of tracking of user behavior, and users benefit from them.
+
+However, depending on a company's business model, tracking often doesn't stop there. If the service
+is funded through advertising, the advertisers are the actual customers, and the users' interests
+take second place. Tracking data becomes more detailed, analyses become further-reaching, and data
+is retained for a long time in order to build up detailed profiles of each person for marketing
+purposes.
+
+Now the relationship between the company and the user whose data is being collected starts looking
+quite different. The user is given a free service and is coaxed into engaging with it as much as
+possible. The tracking of the user serves not primarily that individual, but rather the needs of the
+advertisers who are funding the service. This relationship can be appropriately described with a
+word that has more sinister connotations: *surveillance*.
+
+### Surveillance {#id374}
+
+As a thought experiment, try replacing the word *data* with *surveillance*, and observe if common
+phrases still sound so good [^23]. How about this: "In our surveillance-driven
+organization we collect real-time surveillance streams and store them in our surveillance warehouse.
+Our surveillance scientists use advanced analytics and surveillance processing in order to derive
+new insights."
+
+This thought experiment is unusually polemic for this book, *Designing Surveillance-Intensive
+Applications*, but strong words are needed to emphasize this point. In our attempts to make software
+"eat the world" [^24], we have built the greatest mass surveillance infrastructure the
+world has ever seen. We are rapidly approaching a world in which every inhabited space contains at
+least one internet-connected microphone, in the form of smartphones, smart TVs, voice-controlled
+assistant devices, baby monitors, and even children's toys that use cloud-based speech recognition.
+Many of these devices have a terrible security record [^25].
+
+What is new compared to the past is that digitization has made it easy to collect large amounts of
+data about people. Surveillance of our location and movements, our social relationships and
+communications, our purchases and payments, and data about our health have become almost
+unavoidable. A surveillance organisation may end up knowing more about a person than that person
+knows about themselves---for example, identifying illnesses or economic problems before the person
+themselves is aware of them.
+
+Even the most totalitarian and repressive regimes of the past could only dream of putting a
+microphone in every room and forcing every person to constantly carry a device capable of tracking
+their location and movements. Yet the benefits that we get from digital technology are so great that
+we now voluntarily accept this world of total surveillance. The difference is just that the data is
+being collected by corporations to provide us with services, rather than government agencies seeking
+control [^26].
+
+Not all data collection necessarily qualifies as surveillance, but examining it as such can help us
+understand our relationship with the data collector. Why are we seemingly happy to accept
+surveillance by corporations? Perhaps you feel you have nothing to hide---in other words, you are
+totally in line with existing power structures, you are not a marginalized minority, and you needn't
+fear persecution [^27]. Not everyone is so fortunate. Or perhaps it's because the purpose
+seems benign---it's not overt coercion and conformance, but merely better recommendations and more
+personalized marketing. However, combined with the discussion of predictive analytics from the last
+section, that distinction seems less clear.
+
+We are already seeing behavioral data on car driving, tracked by cars without drivers' consent,
+affecting their insurance premiums [^28], and health insurance coverage that depends on
+people wearing a fitness tracking device. When surveillance is used to determine things that hold
+sway over important aspects of life, such as insurance coverage or employment, it starts to appear
+less benign. Moreover, data analysis can reveal surprisingly intrusive things: for example, the
+movement sensor in a smartwatch or fitness tracker can be used to work out what you are typing (for
+example, passwords) with fairly good accuracy [^29]. Sensor accuracy and algorithms for
+analysis are only going to get better.
+
+### Consent and Freedom of Choice {#id375}
+
+We might assert that users voluntarily choose to use a service that tracks their activity, and they
+have agreed to the terms of service and privacy policy, so they consent to data collection. We might
+even claim that users are receiving a valuable service in return for the data they provide, and that
+the tracking is necessary in order to provide the service. Undoubtedly, social networks, search
+engines, and various other free online services are valuable to users---but there are problems with
+this argument.
+
+First, we should ask in what way the tracking is necessary. Some forms of tracking directly feed
+into improving features for users: for example, tracking the click-through rate on search results
+can help improve a search engine's result ranking and relevance, and tracking which products
+customers tend to buy together can help an online shop suggest related products. However, when
+tracking user interaction for content recommendations, or to build user profiles for advertising
+purposes, it is less clear whether this is genuinely in the user's interest---or is it only
+necessary because the ads pay for the service?
+
+Second, users have little knowledge of what data they are feeding into our databases, or how it is
+retained and processed---and most privacy policies do more to obscure than to illuminate. Without
+understanding what happens to their data, users cannot give any meaningful consent. Often, data from
+one user also says things about other people who are not users of the service and who have not
+agreed to any terms. The derived datasets that we discussed in this part of the book---in which data
+from the entire user base may have been combined with behavioral tracking and external data
+sources---are precisely the kinds of data of which users cannot have any meaningful understanding.
+
+Moreover, data is extracted from users through a one-way process, not a relationship with true
+reciprocity, and not a fair value exchange. There is no dialog, no option for users to negotiate how
+much data they provide and what service they receive in return: the relationship between the service
+and the user is very asymmetric and one-sided. The terms are set by the service, not by the user
+[^30], [^31].
+
+In the European Union, the *General Data Protection Regulation* (GDPR) requires that consent must be
+"freely given, specific, informed, and unambiguous", and that the user must be able to "refuse or
+withdraw consent without detriment"---otherwise it is not considered "freely given". Any request for
+consent must be written "in an intelligible and easily accessible form, using clear and plain
+language". Moreover, "silence, pre-ticked boxes or inactivity \[do not\] constitute consent"
+[^32]. There are other bases for lawful processing of personal data besides consent, such
+as *legitimate interest*, which permits certain uses of data such as fraud prevention
+[^33].
+
+You might argue that a user who does not consent to surveillance can simply choose not to use a
+service. But this choice is not free either: if a service is so popular that it is "regarded by most
+people as essential for basic social participation" [^30], then it is not reasonable to
+expect people to opt out of this service---using it is *de facto* mandatory. For example, in most
+Western social communities, it has become the norm to carry a smartphone, to use social networks for
+socializing, and to use Google for finding information. Especially when a service has network
+effects, there is a social cost to people choosing *not* to use it.
+
+Declining to use a service due to its user tracking policies is easier said than done. These
+platforms are designed specifically to engage users. Many use game mechanics and tactics common in
+gambling to keep users coming back [^34]. Even if a user gets past this, declining to
+engage is only an option for the small number of people who are privileged enough to have the time
+and knowledge to understand its privacy policy, and who can afford to potentially miss out on social
+participation or professional opportunities that may have arisen if they had participated in the
+service. For people in a less privileged position, there is no meaningful freedom of choice:
+surveillance becomes inescapable.
+
+### Privacy and Use of Data {#id457}
+
+Sometimes people claim that "privacy is dead" on the grounds that some users are willing to post all
+sorts of things about their lives to social media, sometimes mundane and sometimes deeply personal.
+However, this claim is false and rests on a misunderstanding of the word *privacy*.
+
+Having privacy does not mean keeping everything secret; it means having the freedom to choose which
+things to reveal to whom, what to make public, and what to keep secret. The right to privacy is a
+decision right: it enables each person to decide where they want to be on the spectrum between
+secrecy and transparency in each situation [^30]. It is an important aspect of a person's
+freedom and autonomy.
+
+For example, someone who suffers from a rare medical condition might be very happy to provide their
+private medical data to researchers if there is a chance that it might help the development of
+treatments for their condition. However, the important thing is that this person has a choice over
+who may access this data, and for what purpose. If there was a risk that information about their
+medical condition would harm their access to medical insurance or employment or other important
+things, this person would probably be much more cautious about sharing their data.
+
+When data is extracted from people through surveillance infrastructure, privacy rights are not
+necessarily eroded, but rather transferred to the data collector. Companies that acquire data
+essentially say "trust us to do the right thing with your data," which means that the right to
+decide what to reveal and what to keep secret is transferred from the individual to the company.
+
+The companies in turn choose to keep much of the outcome of this surveillance secret, because to
+reveal it would be perceived as creepy, and would harm their business model (which relies on knowing
+more about people than other companies do). Intimate information about users is only revealed
+indirectly, for example in the form of tools for targeting advertisements to specific groups of
+people (such as those suffering from a particular illness).
+
+Even if particular users cannot be personally reidentified from the bucket of people targeted by a
+particular ad, they have lost their agency about the disclosure of some intimate information. It is
+not the user who decides what is revealed to whom on the basis of their personal preferences---it is
+the company that exercises the privacy right with the goal of maximizing its profit.
+
+Many companies have a goal of not being *perceived* as creepy---avoiding the question of how
+intrusive their data collection actually is, and instead focusing on managing user perceptions. And
+even these perceptions are often managed poorly: for example, something may be factually correct,
+but if it triggers painful memories, the user may not want to be reminded about it [^35].
+With any kind of data we should expect the possibility that it is wrong, undesirable, or
+inappropriate in some way, and we need to build mechanisms for handling those failures. Whether
+something is "undesirable" or "inappropriate" is of course down to human judgment; algorithms are
+oblivious to such notions unless we explicitly program them to respect human needs. As engineers of
+these systems we must be humble, accepting and planning for such failings.
+
+Privacy settings that allow a user of an online service to control which aspects of their data other
+users can see are a starting point for handing back some control to users. However, regardless of
+the setting, the service itself still has unfettered access to the data, and is free to use it in
+any way permitted by the privacy policy. Even if the service promises not to sell the data to third
+parties, it usually grants itself unrestricted rights to process and analyze the data internally,
+often going much further than what is overtly visible to users.
+
+This kind of large-scale transfer of privacy rights from individuals to corporations is historically
+unprecedented [^30]. Surveillance has always existed, but it used to be expensive and
+manual, not scalable and automated. Trust relationships have always existed, for example between a
+patient and their doctor, or between a defendant and their attorney---but in these cases the use of
+data has been strictly governed by ethical, legal, and regulatory constraints. Internet services
+have made it much easier to amass huge amounts of sensitive information without meaningful consent,
+and to use it at massive scale without users understanding what is happening to their private data.
+
+### Data as Assets and Power {#id376}
+
+Since behavioral data is a byproduct of users interacting with a service, it is sometimes called
+"data exhaust"---suggesting that the data is worthless waste material. Viewed this way, behavioral
+and predictive analytics can be seen as a form of recycling that extracts value from data that would
+have otherwise been thrown away.
+
+More correct would be to view it the other way round: from an economic point of view, if targeted
+advertising is what pays for a service, then the user activity that generates behavioral data could
+be regarded as a form of labor [^36]. One could go even further and argue that the
+application with which the user interacts is merely a means to lure users into feeding more and more
+personal information into the surveillance infrastructure [^30]. The delightful human
+creativity and social relationships that often find expression in online services are cynically
+exploited by the data extraction machine.
+
+Personal data is a valuable asset, as evidenced by the existence of data brokers, a shady industry
+operating in secrecy, purchasing, aggregating, analyzing, inferring, and reselling intrusive
+personal data about people, mostly for marketing purposes [^20]. Startups are valued by
+their user numbers, by "eyeballs"---i.e., by their surveillance capabilities.
+
+Because the data is valuable, many people want it. Of course companies want it---that's why they
+collect it in the first place. But governments want to obtain it too: by means of secret deals,
+coercion, legal compulsion, or simply stealing it [^37]. When a company goes bankrupt, the
+personal data it has collected is one of the assets that gets sold. Moreover, the data is difficult
+to secure, so breaches happen disconcertingly often.
+
+These observations have led critics to saying that data is not just an asset, but a "toxic asset"
+[^37], or at least "hazardous material" [^38]. Maybe data is not the new gold,
+nor the new oil, but rather the new uranium [^39]. Even if we think that we are capable of
+preventing abuse of data, whenever we collect data, we need to balance the benefits with the risk of
+it falling into the wrong hands: computer systems may be compromised by criminals or hostile foreign
+intelligence services, data may be leaked by insiders, the company may fall into the hands of
+unscrupulous management that does not share our values, or the country may be taken over by a regime
+that has no qualms about compelling us to hand over the data.
+
+When collecting data, we need to consider not just today's political environment, but all possible
+future governments. There is no guarantee that every government elected in future will respect human
+rights and civil liberties, so "it is poor civic hygiene to install technologies that could someday
+facilitate a police state" [^40].
+
+"Knowledge is power," as the old adage goes. And furthermore, "to scrutinize others while avoiding
+scrutiny oneself is one of the most important forms of power" [^41]. This is why
+totalitarian governments want surveillance: it gives them the power to control the population.
+Although today's technology companies are not overtly seeking political power, the data and
+knowledge they have accumulated nevertheless gives them a lot of power over our lives, much of which
+is surreptitious, outside of public oversight [^42].
+
+### Remembering the Industrial Revolution {#id377}
+
+Data is the defining feature of the information age. The internet, data storage, processing, and
+software-driven automation are having a major impact on the global economy and human society. As our
+daily lives and social organization have been changed by information technology, and will probably
+continue to radically change in the coming decades, comparisons to the Industrial Revolution come to
+mind [^17], [^26].
+
+The Industrial Revolution came about through major technological and agricultural advances, and it
+brought sustained economic growth and significantly improved living standards in the long run. Yet
+it also came with major problems: pollution of the air (due to smoke and chemical processes) and the
+water (from industrial and human waste) was dreadful. Factory owners lived in splendor, while urban
+workers often lived in very poor housing and worked long hours in harsh conditions. Child labor was
+common, including dangerous and poorly paid work in mines.
+
+It took a long time before safeguards were established, such as environmental protection
+regulations, safety protocols for workplaces, outlawing child labor, and health inspections for
+food. Undoubtedly the cost of doing business increased when factories were no longer allowed to dump
+their waste into rivers, sell tainted foods, or exploit workers. But society as a whole benefited
+hugely from these regulations, and few of us would want to return to a time before [^17].
+
+Just as the Industrial Revolution had a dark side that needed to be managed, our transition to the
+information age has major problems that we need to confront and solve [^43], [^44].
+The collection and use of data is one of those problems. In the words of Bruce Schneier
+[^26]:
+
+> Data is the pollution problem of the information age, and protecting privacy is the environmental
+> challenge. Almost all computers produce information. It stays around, festering. How we deal with
+> it---how we contain it and how we dispose of it---is central to the health of our information
+> economy. Just as we look back today at the early decades of the industrial age and wonder how our
+> ancestors could have ignored pollution in their rush to build an industrial world, our
+> grandchildren will look back at us during these early decades of the information age and judge us
+> on how we addressed the challenge of data collection and misuse.
+>
+> We should try to make them proud.
+
+### Legislation and Self-Regulation {#sec_future_legislation}
+
+Data protection laws might be able to help preserve individuals' rights. For example, the European
+GDPR states that personal data must be "collected for specified, explicit and legitimate purposes
+and not further processed in a manner that is incompatible with those purposes", and furthermore
+that data must be "adequate, relevant and limited to what is necessary in relation to the purposes
+for which they are processed" [^32].
+
+However, this principle of *data minimization* runs directly counter to the philosophy of Big Data,
+which is to maximize data collection, to combine it with other datasets, to experiment and to
+explore in order to generate new insights. Exploration means using data for unforeseen purposes,
+which is the opposite of the "specified and explicit" purposes for which the data must have been
+collected. While the GDPR has had some effect on the online advertising industry [^45],
+the regulation has been weakly enforced [^46], and it does not seem to have led to much of
+a change in culture and practices across the wider tech industry.
+
+Companies that collect lots of data about people oppose regulation as being a burden and a hindrance
+to innovation. To some extent that opposition is justified. For example, when sharing medical data,
+there are clear risks to privacy, but there are also potential opportunities: how many deaths could
+be prevented if data analysis was able to help us achieve better diagnostics or find better
+treatments [^47]? Over-regulation may prevent such breakthroughs. It is difficult to
+balance such potential opportunities with the risks [^41].
+
+Fundamentally, we need a culture shift in the tech industry with regard to personal data. We should
+stop regarding users as metrics to be optimized, and remember that they are humans who deserve
+respect, dignity, and agency. We should self-regulate our data collection and processing practices
+in order to establish and maintain the trust of the people who depend on our software
+[^48]. And we should take it upon ourselves to educate end users about how their data is
+used, rather than keeping them in the dark.
+
+We should allow each individual to maintain their privacy---i.e., their control over own data---and
+not steal that control from them through surveillance. Our individual right to control our data is
+like the natural environment of a national park: if we don't explicitly protect and care for it, it
+will be destroyed. It will be the tragedy of the commons, and we will all be worse off for it.
+Ubiquitous surveillance is not inevitable---we are still able to stop it.
+
+As a first step, we should not retain data forever, but purge it as soon as it is no longer needed,
+and minimize what we collect in the first place [^48], [^49]. Data you don't have is
+data that can't be leaked, stolen, or compelled by governments to be handed over. Overall, culture
+and attitude changes will be necessary. As people working in technology, if we don't consider the
+societal impact of our work, we're not doing our job [^50].
+
+## Summary {#id594}
+
+This brings us to the end of the book. We have covered a lot of ground:
+
+- In [Chapter 1](/en/ch1#ch_tradeoffs) we contrasted analytical and operational systems, compared
+ the cloud to self-hosting, weighed up distributed and single-node systems, and discussed balancing
+ the needs of your business with the needs of your users.
+
+- In [Chapter 2](/en/ch2#ch_nonfunctional) we saw how to define several nonfunctional requirements
+ such as performance, reliability, scalability, and maintainability.
+
+- In [Chapter 3](/en/ch3#ch_datamodels) we explored a spectrum of data models, including the
+ relational, document, and graph models, event sourcing, and DataFrames. We also looked at examples
+ of various query languages, including SQL, Cypher, SPARQL, Datalog, and GraphQL.
+
+- In [Chapter 4](/en/ch4#ch_storage) we discussed storage engines for OLTP (LSM-trees and B-trees),
+ for analytics (column-oriented storage), and indexes for information retrieval (full-text and
+ vector search).
+
+- In [Chapter 5](/en/ch5#ch_encoding) we examined different ways of encoding data objects as bytes,
+ and how to support evolution as requirements change. We also compared several ways how data flows
+ between processes: via databases, service calls, workflow engines, or event-driven architectures.
+
+- In [Chapter 6](/en/ch6#ch_replication) we studied the trade-offs between single-leader,
+ multi-leader, and leaderless replication. We also looked at consistency models such as
+ read-after-write consistency, and sync engines that allow clients to work offline.
+
+- In [Chapter 7](/en/ch7#ch_sharding) we went into sharding, including strategies for rebalancing,
+ request routing, and secondary indexing.
+
+- In [Chapter 8](/en/ch8#ch_transactions) we covered transactions: durability, how various isolation
+ levels (read committed, snapshot isolation, and serializable) can be achieved, and how atomicity
+ can be ensured in distributed transactions.
+
+- In [Chapter 9](/en/ch9#ch_distributed) we surveyed fundamental problems that occur in distributed
+ systems (network faults and delays, clock errors, process pauses, crashes), and saw how they make
+ it difficult to correctly implement even something seemingly simple like a lock.
+
+- In [Chapter 10](/en/ch10#ch_consistency) we went on a deep-dive into various forms of consensus
+ and the consistency model (linearizability) it enables.
+
+- In [Chapter 11](/en/ch11#ch_batch) we dug into batch processing, building up from simple chains of
+ Unix tools to large-scale distributed batch processors using distributed filesystems or object
+ stores.
+
+- In [Chapter 12](/en/ch12#ch_stream) we generalized batch processing to stream processing,
+ discussed the underlying message brokers, change data capture, fault tolerance, and processing
+ patterns such as streaming joins.
+
+- In [Chapter 13](/en/ch13#ch_philosophy) we explored a philosophy of streaming systems that allows
+ disparate data systems to be integrated, systems to be evolved, and applications to be scaled more
+ easily.
+
+Finally, in this last chapter, we took a step back and examined some ethical aspects of building
+data-intensive applications. We saw that although data can be used to do good, it can also do
+significant harm: making decisions that seriously affect people's lives and are difficult to appeal
+against, leading to discrimination and exploitation, normalizing surveillance, and exposing intimate
+information. We also run the risk of data breaches, and we may find that a well-intentioned use of
+data has unintended consequences.
+
+As software and data are having such a large impact on the world, we as engineers must remember that
+we carry a responsibility to work toward the kind of world that we want to live in: a world that
+treats people with humanity and respect. Let's work together towards that goal.
+
+##### Footnotes
+
+### References {#references}
+
+[^1]: David Schmudde. [What If Data Is a Bad Idea?](https://schmud.de/posts/2024-08-18-data-is-a-bad-idea.html). *schmud.de*, August 2024. Archived at [perma.cc/ZXU5-XMCT](https://perma.cc/ZXU5-XMCT)
+[^2]: [ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics). Association for Computing Machinery, *acm.org*, 2018. Archived at [perma.cc/SEA8-CMB8](https://perma.cc/SEA8-CMB8)
+[^3]: Igor Perisic. [Making Hard Choices: The Quest for Ethics in Machine Learning](https://www.linkedin.com/blog/engineering/archive/making-hard-choices-the-quest-for-ethics-in-machine-learning). *linkedin.com*, November 2016. Archived at [perma.cc/DGF8-KNT7](https://perma.cc/DGF8-KNT7)
+[^4]: John Naughton. [Algorithm Writers Need a Code of Conduct](https://www.theguardian.com/commentisfree/2015/dec/06/algorithm-writers-should-have-code-of-conduct). *theguardian.com*, December 2015. Archived at [perma.cc/TBG2-3NG6](https://perma.cc/TBG2-3NG6)
+[^5]: Ben Green. ["Good" isn't good enough](https://www.benzevgreen.com/wp-content/uploads/2019/11/19-ai4sg.pdf). At *NeurIPS Joint Workshop on AI for Social Good*, December 2019. Archived at [perma.cc/H4LN-7VY3](https://perma.cc/H4LN-7VY3)
+[^6]: Deborah G. Johnson and Mario Verdicchio. [Ethical AI is Not about AI](https://cacm.acm.org/opinion/ethical-ai-is-not-about-ai/). *Communications of the ACM*, volume 66, issue 2, pages 32--34, January 2023. [doi:10.1145/3576932](https://doi.org/10.1145/3576932)
+[^7]: Marc Steen. [Ethics as a Participatory and Iterative Process](https://cacm.acm.org/opinion/ethics-as-a-participatory-and-iterative-process/). *Communications of the ACM*, volume 66, issue 5, pages 27--29, April 2023. [doi:10.1145/3550069](https://doi.org/10.1145/3550069)
+[^8]: Logan Kugler. [What Happens When Big Data Blunders?](https://cacm.acm.org/news/what-happens-when-big-data-blunders/) *Communications of the ACM*, volume 59, issue 6, pages 15--16, June 2016. [doi:10.1145/2911975](https://doi.org/10.1145/2911975)
+[^9]: Miri Zilka. [Algorithms and the criminal justice system: promises and challenges in deployment and research](https://www.cl.cam.ac.uk/research/security/seminars/archive/video/2023-03-07-t196231.html). At *University of Cambridge Security Seminar Series*, March 2023.
+[^10]: Bill Davidow. [Welcome to Algorithmic Prison](https://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/). *theatlantic.com*, February 2014. Archived at [archive.org](https://web.archive.org/web/20171019201812/https://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/)
+[^11]: Don Peck. [They're Watching You at Work](https://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/). *theatlantic.com*, December 2013. Archived at [perma.cc/YR9T-6M38](https://perma.cc/YR9T-6M38)
+[^12]: Leigh Alexander. [Is an Algorithm Any Less Racist Than a Human?](https://www.theguardian.com/technology/2016/aug/03/algorithm-racist-human-employers-work) *theguardian.com*, August 2016. Archived at [perma.cc/XP93-DSVX](https://perma.cc/XP93-DSVX)
+[^13]: Jesse Emspak. [How a Machine Learns Prejudice](https://www.scientificamerican.com/article/how-a-machine-learns-prejudice/). *scientificamerican.com*, December 2016. [perma.cc/R3L5-55E6](https://perma.cc/R3L5-55E6)
+[^14]: Rohit Chopra, Kristen Clarke, Charlotte A. Burrows, and Lina M. Khan. [Joint Statement on Enforcement Efforts Against Discrimination and Bias in Automated Systems](https://www.ftc.gov/system/files/ftc_gov/pdf/EEOC-CRT-FTC-CFPB-AI-Joint-Statement%28final%29.pdf). *ftc.gov*, April 2023. Archived at [perma.cc/YY4Y-RCCA](https://perma.cc/YY4Y-RCCA)
+[^15]: Maciej Cegłowski. [The Moral Economy of Tech](https://idlewords.com/talks/sase_panel.htm). *idlewords.com*, June 2016. Archived at [perma.cc/L8XV-BKTD](https://perma.cc/L8XV-BKTD)
+[^16]: Greg Nichols. [Artificial Intelligence in healthcare is racist](https://www.zdnet.com/article/artificial-intelligence-in-healthcare-is-racist/). *zdnet.com*, November 2020. Archived at [perma.cc/3MKW-YKRS](https://perma.cc/3MKW-YKRS)
+[^17]: Cathy O'Neil. *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. ISBN: 978-0-553-41881-1
+[^18]: Julia Angwin. [Make Algorithms Accountable](https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html). *nytimes.com*, August 2016. Archived at [archive.org](https://web.archive.org/web/20230819055242/https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html)
+[^19]: Bryce Goodman and Seth Flaxman. [European Union Regulations on Algorithmic Decision-Making and a 'Right to Explanation'](https://arxiv.org/abs/1606.08813). At *ICML Workshop on Human Interpretability in Machine Learning*, June 2016. Archived at [arxiv.org/abs/1606.08813](https://arxiv.org/abs/1606.08813)
+[^20]: [A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes](https://www.commerce.senate.gov/services/files/0d2b3642-6221-4888-a631-08f2f255b577). Staff Report, *United States Senate Committee on Commerce, Science, and Transportation*, *commerce.senate.gov*, December 2013. Archived at [perma.cc/32NV-YWLQ](https://perma.cc/32NV-YWLQ)
+[^21]: Stephanie Assad, Robert Clark, Daniel Ershov, and Lei Xu. [Algorithmic Pricing and Competition: Empirical Evidence from the German Retail Gasoline Market](https://economics.yale.edu/sites/default/files/clark_acex_jan_2021.pdf). *Journal of Political Economy*, volume 132, issue 3, pages 723-771, March 2024. [doi:10.1086/726906](https://doi.org/10.1086/726906)
+[^22]: Donella H. Meadows and Diana Wright. *Thinking in Systems: A Primer*. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7
+[^23]: Daniel J. Bernstein. [Listening to a "big data"/"data science" talk. Mentally translating "data" to "surveillance": "\...everything starts with surveillance\..."](https://x.com/hashbreaker/status/598076230437568512) *x.com*, May 2015. Archived at [perma.cc/EY3D-WBBJ](https://perma.cc/EY3D-WBBJ)
+[^24]: Marc Andreessen. [Why Software Is Eating the World](https://a16z.com/why-software-is-eating-the-world/). *a16z.com*, August 2011. Archived at [perma.cc/3DCC-W3G6](https://perma.cc/3DCC-W3G6)
+[^25]: J. M. Porup. ['Internet of Things' Security Is Hilariously Broken and Getting Worse](https://arstechnica.com/information-technology/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/). *arstechnica.com*, January 2016. Archived at [archive.org](https://web.archive.org/web/20250823001716/https://arstechnica.com/information-technology/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/)
+[^26]: Bruce Schneier. [*Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World*](https://www.schneier.com/books/data_and_goliath/). W. W. Norton, 2015. ISBN: 978-0-393-35217-7
+[^27]: The Grugq. [Nothing to Hide](https://grugq.tumblr.com/post/142799983558/nothing-to-hide). *grugq.tumblr.com*, April 2016. Archived at [perma.cc/BL95-8W5M](https://perma.cc/BL95-8W5M)
+[^28]: Federal Trade Commission. [FTC Takes Action Against General Motors for Sharing Drivers' Precise Location and Driving Behavior Data Without Consent](https://www.ftc.gov/news-events/news/press-releases/2025/01/ftc-takes-action-against-general-motors-sharing-drivers-precise-location-driving-behavior-data). *ftc.gov*, January 2025. Archived at [perma.cc/3XGV-3HRD](https://perma.cc/3XGV-3HRD)
+[^29]: Tony Beltramelli. [Deep-Spying: Spying Using Smartwatch and Deep Learning](https://arxiv.org/abs/1512.05616). Masters Thesis, IT University of Copenhagen, December 2015. Archived at *arxiv.org/abs/1512.05616*
+[^30]: Shoshana Zuboff. [Big Other: Surveillance Capitalism and the Prospects of an Information Civilization](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2594754). *Journal of Information Technology*, volume 30, issue 1, pages 75--89, April 2015. [doi:10.1057/jit.2015.5](https://doi.org/10.1057/jit.2015.5)
+[^31]: Michiel Rhoen. [Beyond Consent: Improving Data Protection Through Consumer Protection Law](https://policyreview.info/articles/analysis/beyond-consent-improving-data-protection-through-consumer-protection-law). *Internet Policy Review*, volume 5, issue 1, March 2016. [doi:10.14763/2016.1.404](https://doi.org/10.14763/2016.1.404)
+[^32]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016](https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng). *Official Journal of the European Union*, L 119/1, May 2016.
+[^33]: UK Information Commissioner's Office. [What is the 'legitimate interests' basis?](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/lawful-basis/legitimate-interests/what-is-the-legitimate-interests-basis/) *ico.org.uk*. Archived at [perma.cc/W8XR-F7ML](https://perma.cc/W8XR-F7ML)
+[^34]: Tristan Harris. [How a handful of tech companies control billions of minds every day](https://www.ted.com/talks/tristan_harris_how_a_handful_of_tech_companies_control_billions_of_minds_every_day). At *TED2017*, April 2017.
+[^35]: Carina C. Zona. [Consequences of an Insightful Algorithm](https://www.youtube.com/watch?v=YRI40A4tyWU). At *GOTO Berlin*, November 2016.
+[^36]: Imanol Arrieta Ibarra, Leonard Goff, Diego Jiménez Hernández, Jaron Lanier, and E. Glen Weyl. [Should We Treat Data as Labor? Moving Beyond 'Free'](https://www.aeaweb.org/conference/2018/preliminary/paper/2Y7N88na). *American Economic Association Papers Proceedings*, volume 1, issue 1, December 2017.
+[^37]: Bruce Schneier. [Data Is a Toxic Asset, So Why Not Throw It Out?](https://www.schneier.com/essays/archives/2016/03/data_is_a_toxic_asse.html) *schneier.com*, March 2016. Archived at [perma.cc/4GZH-WR3D](https://perma.cc/4GZH-WR3D)
+[^38]: Cory Scott. [Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want](https://x.com/cory_scott/status/706586399483437056). *x.com*, March 2016. Archived at [perma.cc/CLV7-JF2E](https://perma.cc/CLV7-JF2E)
+[^39]: Mark Pesce. [Data is the new uranium -- incredibly powerful and amazingly dangerous](https://www.theregister.com/2024/11/20/data_is_the_new_uranium/). *theregister.com*, November 2024. Archived at [perma.cc/NV8B-GYGV](https://perma.cc/NV8B-GYGV)
+[^40]: Bruce Schneier. [Mission Creep: When Everything Is Terrorism](https://www.schneier.com/essays/archives/2013/07/mission_creep_when_e.html). *schneier.com*, July 2013. Archived at [perma.cc/QB2C-5RCE](https://perma.cc/QB2C-5RCE)
+[^41]: Lena Ulbricht and Maximilian von Grafenstein. [Big Data: Big Power Shifts?](https://policyreview.info/articles/analysis/big-data-big-power-shifts) *Internet Policy Review*, volume 5, issue 1, March 2016. [doi:10.14763/2016.1.406](https://doi.org/10.14763/2016.1.406)
+[^42]: Ellen P. Goodman and Julia Powles. [Facebook and Google: Most Powerful and Secretive Empires We've Ever Known](https://www.theguardian.com/technology/2016/sep/28/google-facebook-powerful-secretive-empire-transparency). *theguardian.com*, September 2016. Archived at [perma.cc/8UJA-43G6](https://perma.cc/8UJA-43G6)
+[^43]: Judy Estrin and Sam Gill. [The World Is Choking on Digital Pollution](https://washingtonmonthly.com/2019/01/13/the-world-is-choking-on-digital-pollution/). *washingtonmonthly.com*, January 2019. Archived at [perma.cc/3VHF-C6UC](https://perma.cc/3VHF-C6UC)
+[^44]: A. Michael Froomkin. [Regulating Mass Surveillance as Privacy Pollution: Learning from Environmental Impact Statements](https://repository.law.miami.edu/cgi/viewcontent.cgi?article=1062&context=fac_articles). *University of Illinois Law Review*, volume 2015, issue 5, August 2015. Archived at [perma.cc/24ZL-VK2T](https://perma.cc/24ZL-VK2T)
+[^45]: Pengyuan Wang, Li Jiang, and Jian Yang. [The Early Impact of GDPR Compliance on Display Advertising: The Case of an Ad Publisher](https://openreview.net/pdf?id=TUnLHNo19S). *Journal of Marketing Research*, volume 61, issue 1, April 2023. [doi:10.1177/00222437231171848](https://doi.org/10.1177/00222437231171848)
+[^46]: Johnny Ryan. [Don't be fooled by Meta's fine for data breaches](https://www.economist.com/by-invitation/2023/05/24/dont-be-fooled-by-metas-fine-for-data-breaches-says-johnny-ryan). *The Economist*, May 2023. Archived at [perma.cc/VCR6-55HR](https://perma.cc/VCR6-55HR)
+[^47]: Jessica Leber. [Your Data Footprint Is Affecting Your Life in Ways You Can't Even Imagine](https://www.fastcompany.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine). *fastcompany.com*, March 2016. Archived at [archive.org](https://web.archive.org/web/20161128133016/https://www.fastcoexist.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine)
+[^48]: Maciej Cegłowski. [Haunted by Data](https://idlewords.com/talks/haunted_by_data.htm). *idlewords.com*, October 2015. Archived at [archive.org](https://web.archive.org/web/20161130143932/https://idlewords.com/talks/haunted_by_data.htm)
+[^49]: Sam Thielman. [You Are Not What You Read: Librarians Purge User Data to Protect Privacy](https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy). *theguardian.com*, January 2016. Archived at [archive.org](https://web.archive.org/web/20250828224851/https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy)
+[^50]: Jez Humble. [It's a cliché that people get into tech to "change the world". So then, you have to actually consider what the impact of your work is on the world. The idea that you can or should exclude societal and political discussions in tech is idiotic. It means you're not doing your job](https://x.com/jezhumble/status/1386758340894597122). *x.com*, April 2021. Archived at [perma.cc/3NYS-MHLC](https://perma.cc/3NYS-MHLC)
diff --git a/content/en/ch2.md b/content/en/ch2.md
index f148e59..881247f 100644
--- a/content/en/ch2.md
+++ b/content/en/ch2.md
@@ -4,6 +4,8 @@ weight: 102
breadcrumbs: false
---
+
+

> *The Internet was done so well that most people think of it as a natural resource like the Pacific
@@ -55,7 +57,7 @@ Barack Obama have over 100 million followers).
### Representing Users, Posts, and Follows {#id20}
-Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We
+Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We
have one table for users, one table for posts, and one table for follow relationships.
{{< figure src="/fig/ddia_0201.png" id="fig_twitter_relational" caption="Figure 2-1. Simple relational schema for a social network in which users can follow each other." class="w-full my-4" >}}
@@ -107,7 +109,7 @@ needs to subscribe to the stream of posts being added to their home timeline.
The downside of this approach is that we now need to do more work every time a user makes a post,
because the home timelines are derived data that needs to be updated. The process is illustrated in
-[Figure 2-2](/en/ch2#fig_twitter_timelines). When one initial request results in several downstream requests being
+[Figure 2-2](/en/ch2#fig_twitter_timelines). When one initial request results in several downstream requests being
carried out, we use the term *fan-out* to describe the factor by which the number of requests
increases.
@@ -126,7 +128,7 @@ load, since we simply serve them from a cache.
This process of precomputing and updating the results of a query is called *materialization*, and
the timeline cache is an example of a *materialized view* (a concept we will discuss further in
-[Link to Come]). The materialized view speeds up reads, but in return we have to do more work on
+[“Maintaining materialized views”](/en/ch12#sec_stream_mat_view)). The materialized view speeds up reads, but in return we have to do more work on
write. The cost of writes for most users is modest, but a social network also has to consider some
extreme cases:
@@ -163,7 +165,7 @@ metrics, whereas the “time it takes to load the home timeline” or the “tim
delivered to followers” are response time metrics.
There is often a connection between throughput and response time; an example of such a relationship
-for an online service is sketched in [Figure 2-3](/en/ch2#fig_throughput). The service has a low response time when
+for an online service is sketched in [Figure 2-3](/en/ch2#fig_throughput). The service has a low response time when
request throughput is low, but response time increases as load increases. This is because of
*queueing*: when a request arrives on a highly loaded system, it’s likely that the CPU is already in
the process of handling an earlier request, and therefore the incoming request needs to wait until
@@ -175,6 +177,8 @@ handle, queueing delays increase sharply.
--------
+
+
> [!TIP] WHEN AN OVERLOADED SYSTEM WON'T RECOVER
If a system is close to overload, with throughput pushed close to the limit, it can sometimes enter a
@@ -206,7 +210,7 @@ scalability in [“Scalability”](/en/ch2#sec_introduction_scalability).
### Latency and Response Time {#id23}
“Latency” and “response time” are sometimes used interchangeably, but in this book we will use the
-terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)):
+terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)):
* The *response time* is what the client sees; it includes all delays incurred anywhere in the
system.
@@ -221,7 +225,7 @@ terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)
{{< figure src="/fig/ddia_0204.png" id="fig_response_time" caption="Figure 2-4. Response time, service time, network latency, and queueing delay." class="w-full my-4" >}}
-In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a
+In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a
horizontal line, and a request or response message is shown as a thick diagonal arrow from one node
to another. You will encounter this style of diagram frequently over the course of this book.
@@ -242,7 +246,7 @@ it is important to measure response times on the client side.
### Average, Median, and Percentiles {#id24}
Because the response time varies from one request to the next, we need to think of it not as a
-single number, but as a *distribution* of values that you can measure. In [Figure 2-5](/en/ch2#fig_lognormal), each
+single number, but as a *distribution* of values that you can measure. In [Figure 2-5](/en/ch2#fig_lognormal), each
gray bar represents a request to a service, and its height shows how long that request took. Most
requests are reasonably fast, but there are occasional *outliers* that take much longer.
Variation in network delay is also known as *jitter*.
@@ -257,7 +261,7 @@ because it doesn’t tell you how many users actually experienced that delay.
Usually it is better to use *percentiles*. If you take your list of response times and sort it from
fastest to slowest, then the *median* is the halfway point: for example, if your median response
-time is 200 ms, that means half your requests return in less than 200 ms, and half your
+time is 200 ms, that means half your requests return in less than 200 ms, and half your
requests take longer than that. This makes the median a good metric if you want to know how long
users typically have to wait. The median is also known as the *50th percentile*, and sometimes
abbreviated as *p50*.
@@ -267,7 +271,7 @@ In order to figure out how bad your outliers are, you can look at higher percent
response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular
threshold. For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of
100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more. This is
-illustrated in [Figure 2-5](/en/ch2#fig_lognormal).
+illustrated in [Figure 2-5](/en/ch2#fig_lognormal).
High percentiles of response times, also known as *tail latencies*, are important because they
directly affect users’ experience of the service. For example, Amazon describes response time
@@ -291,14 +295,14 @@ However, it is surprisingly difficult to get hold of reliable data to quantify t
latency has on user behavior.
Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search
-results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue [^21].
-However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
+results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue [^21].
+However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
only 0.6% fewer searches per day [^22],
and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3% [^23].
Newer data from these companies appears not to be publicly available.
A more recent Akamai study [^24]
-claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
+claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times
are also correlated with lower conversion rates! This seemingly paradoxical result is explained by
the fact that the pages that load fastest are often those that have no useful content (e.g., 404
@@ -316,7 +320,7 @@ fast and slow responses is 1.25 seconds or more.
High percentiles are especially important in backend services that are called multiple times as
part of serving a single end-user request. Even if you make the calls in parallel, the end-user
request still needs to wait for the slowest of the parallel calls to complete. It takes just one
-slow call to make the entire end-user request slow, as illustrated in [Figure 2-6](/en/ch2#fig_tail_amplification).
+slow call to make the entire end-user request slow, as illustrated in [Figure 2-6](/en/ch2#fig_tail_amplification).
Even if only a small percentage of backend calls are slow, the chance of getting a slow call
increases if an end-user request requires multiple backend calls, and so a higher proportion of
end-user requests end up being slow (an effect known as *tail latency amplification* [^26]).
@@ -326,13 +330,15 @@ end-user requests end up being slow (an effect known as *tail latency amplificat
Percentiles are often used in *service level objectives* (SLOs) and *service level agreements*
(SLAs) as ways of defining the expected performance and availability of a service [^27].
For example, an SLO may set a target for a service to have a median response time of less than
-200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
+200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not
met (for example, customers may be entitled to a refund). That is the basic idea, at least; in
practice, defining good availability metrics for SLOs and SLAs is not straightforward [^28] [^29].
--------
+
+
> [!TIP] COMPUTING PERCENTILES
If you want to add response time percentiles to the monitoring dashboards for your services, you
@@ -395,7 +401,7 @@ For example, in the social network case study, a fault that might happen is that
process, a machine involved in updating the materialized timelines crashes or become unavailable.
To make this process fault-tolerant, we would need to ensure that another machine can take over this
task without missing any posts that should have been delivered, and without duplicating any posts.
-(This idea is known as *exactly-once semantics*, and we will examine it in detail in [Link to Come].)
+(This idea is known as *exactly-once semantics*, and we will examine it in detail in [“The End-to-End Argument for Databases”](/en/ch13#sec_future_end_to_end).)
Fault tolerance is always limited to a certain number of certain types of faults. For example, a
system might be able to tolerate a maximum of two hard drives failing at the same time, or a maximum
@@ -473,14 +479,14 @@ resources.
The fault-tolerance techniques we discuss in this book are designed to tolerate the loss of entire
machines, racks, or availability zones. They generally work by allowing a machine in one datacenter
to take over when a machine in another datacenter fails or becomes unreachable. We will discuss such
-techniques for fault tolerance in [Chapter 6](/en/ch6#ch_replication), [Chapter 10](/en/ch10#ch_consistency), and at various other
+techniques for fault tolerance in [Chapter 6](/en/ch6#ch_replication), [Chapter 10](/en/ch10#ch_consistency), and at various other
points in this book.
Systems that can tolerate the loss of entire machines also have operational advantages: a
single-server system requires planned downtime if you need to reboot the machine (to apply operating
system security patches, for example), whereas a multi-node fault-tolerant system can be patched by
restarting one node at a time, without affecting the service for users. This is called a *rolling
-upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding).
+upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding).
#### Software faults {#software-faults}
@@ -559,6 +565,8 @@ work with it every day, and take steps to improve it based on this feedback [^71
--------
+
+
> [!TIP] HOW IMPORTANT IS RELIABILITY?
Reliability is not just for nuclear power stations and air traffic control—more mundane applications
@@ -691,8 +699,8 @@ The advantages of shared-nothing are that it has the potential to scale linearly
whatever hardware offers the best price/performance ratio (especially in the cloud), it can more
easily adjust its hardware resources as load increases or decreases, and it can achieve greater
fault tolerance by distributing the system across multiple data centers and regions. The downsides
-are that it requires explicit sharding (see [Chapter 7](/en/ch7#ch_sharding)), and it incurs all the complexity of
-distributed systems ([Chapter 9](/en/ch9#ch_distributed)).
+are that it requires explicit sharding (see [Chapter 7](/en/ch7#ch_sharding)), and it incurs all the complexity of
+distributed systems ([Chapter 9](/en/ch9#ch_distributed)).
Some cloud-native database systems use separate services for storage and transaction execution (see
[“Separation of storage and compute”](/en/ch1#sec_introduction_storage_compute)), with multiple compute nodes sharing access to the same
@@ -706,9 +714,9 @@ the database [^83].
The architecture of systems that operate at large scale is usually highly specific to the
application—there is no such thing as a generic, one-size-fits-all scalable architecture
(informally known as *magic scaling sauce*). For example, a system that is designed to handle
-100,000 requests per second, each 1 kB in size, looks very different from a system that is
-designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same
-data throughput (100 MB/sec).
+100,000 requests per second, each 1 kB in size, looks very different from a system that is
+designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same
+data throughput (100 MB/sec).
Moreover, an architecture that is appropriate for one level of load is unlikely to cope with 10
times that load. If you are working on a fast-growing service, it is therefore likely that you will
@@ -718,11 +726,11 @@ one order of magnitude in advance.
A good general principle for scalability is to break a system down into smaller components that can
operate largely independently from each other. This is the underlying principle behind microservices
-(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
-([Link to Come]), and shared-nothing architectures. However, the challenge is in knowing where to
+(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
+([Chapter 12](/en/ch12#ch_stream)), and shared-nothing architectures. However, the challenge is in knowing where to
draw the line between things that should be together, and things that should be apart. Design
guidelines for microservices can be found in other books [^84],
-and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).
+and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).
Another good principle is not to make things more complicated than necessary. If a single-machine
database will do the job, it’s probably preferable to a complicated distributed setup. Auto-scaling
@@ -997,4 +1005,3 @@ this book will cover a selection of building blocks that have proved to be valua
[^96]: Eric Evans. [*Domain-Driven Design: Tackling Complexity in the Heart of Software*](https://learning.oreilly.com/library/view/domain-driven-design-tackling/0321125215/). Addison-Wesley Professional, August 2003. ISBN: 9780321125217
[^97]: Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson. [Analyzing Software Evolvability](https://www.es.mdh.se/pdf_publications/1251.pdf). at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. [doi:10.1109/COMPSAC.2008.50](https://doi.org/10.1109/COMPSAC.2008.50)
[^98]: Enrico Zaninotto. [From X programming to the X organisation](https://martinfowler.com/articles/zaninotto.pdf). At *XP Conference*, May 2002. Archived at [perma.cc/R9AR-QCKZ](https://perma.cc/R9AR-QCKZ)
-
diff --git a/content/en/ch3.md b/content/en/ch3.md
index 9486423..a5370cf 100644
--- a/content/en/ch3.md
+++ b/content/en/ch3.md
@@ -4,6 +4,8 @@ weight: 103
breadcrumbs: false
---
+
+

> *The limits of my language mean the limits of my world.*
@@ -27,7 +29,7 @@ question is: how is it *represented* in terms of the next-lower layer? For examp
3. The engineers who built your database software decided on a way of representing that
document/relational/graph data in terms of bytes in memory, on disk, or on a network. The
representation may allow the data to be queried, searched, manipulated, and processed in various
- ways. We will discuss these storage engine designs in [Chapter 4](/en/ch4#ch_storage).
+ ways. We will discuss these storage engine designs in [Chapter 4](/en/ch4#ch_storage).
4. On yet lower levels, hardware engineers have figured out how to represent bytes in terms of
electrical currents, pulses of light, magnetic fields, and more.
@@ -156,7 +158,7 @@ Nevertheless, ORMs also have advantages:
#### The document data model for one-to-many relationships {#the-document-data-model-for-one-to-many-relationships}
Not all data lends itself well to a relational representation; let’s look at an example to explore a
-limitation of the relational model. [Figure 3-1](/en/ch3#fig_obama_relational) illustrates how a résumé (a LinkedIn
+limitation of the relational model. [Figure 3-1](/en/ch3#fig_obama_relational) illustrates how a résumé (a LinkedIn
profile) could be expressed in a relational schema. The profile as a whole can be identified by a
unique identifier, `user_id`. Fields like `first_name` and `last_name` appear exactly once per user,
so they can be modeled as columns on the `users` table.
@@ -165,13 +167,13 @@ Most people have had more than one job in their career (positions), and people m
numbers of periods of education and any number of pieces of contact information. One way of
representing such *one-to-many relationships* is to put positions, education, and contact
information in separate tables, with a foreign key reference to the `users` table, as in
-[Figure 3-1](/en/ch3#fig_obama_relational).
+[Figure 3-1](/en/ch3#fig_obama_relational).
{{< figure src="/fig/ddia_0301.png" id="fig_obama_relational" caption="Figure 3-1. Representing a LinkedIn profile using a relational schema." class="w-full my-4" >}}
Another way of representing the same information, which is perhaps more natural and maps more
closely to an object structure in application code, is as a JSON document as shown in
-[Example 3-1](/en/ch3#fig_obama_json).
+[Example 3-1](/en/ch3#fig_obama_json).
{{< figure id="fig_obama_json" title="Example 3-1. Representing a LinkedIn profile as a JSON document" class="w-full my-4" >}}
@@ -199,12 +201,12 @@ closely to an object structure in application code, is as a JSON document as sho
```
Some developers feel that the JSON model reduces the impedance mismatch between the application code
-and the storage layer. However, as we shall see in [Chapter 5](/en/ch5#ch_encoding), there are also problems with
+and the storage layer. However, as we shall see in [Chapter 5](/en/ch5#ch_encoding), there are also problems with
JSON as a data encoding format. The lack of a schema is often cited as an advantage; we will discuss
this in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility).
The JSON representation has better *locality* than the multi-table schema in
-[Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile
+[Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile
in the relational example, you need to either perform multiple queries (query each table by
`user_id`) or perform a messy multi-way join between the `users` table and its subordinate tables [^8].
In the JSON representation, all the relevant information is in one place, making the query both
@@ -212,7 +214,7 @@ faster and simpler.
The one-to-many relationships from the user profile to the user’s positions, educational history, and
contact information imply a tree structure in the data, and the JSON representation makes this tree
-structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
+structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
{{< figure src="/fig/ddia_0302.png" id="fig_json_tree" caption="Figure 3-2. One-to-many relationships forming a tree structure." class="w-full my-4" >}}
@@ -222,13 +224,13 @@ structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
> This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé typically has a small number of positions [^9] [^10].
> In situations where there may be a genuinely large number of related items—say, comments on a
> celebrity’s social media post, of which there could be many thousands—embedding them all in the same
-> document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable.
+> document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable.
--------
### Normalization, Denormalization, and Joins {#sec_datamodels_normalization}
-In [Example 3-1](/en/ch3#fig_obama_json) in the preceding section, `region_id` is given as an ID, not as the plain-text
+In [Example 3-1](/en/ch3#fig_obama_json) in the preceding section, `region_id` is given as an ID, not as the plain-text
string `"Washington, DC, United States"`. Why?
If the user interface has a free-text field for entering the region, it makes sense to store it as a
@@ -321,7 +323,7 @@ Besides the cost of performing all these updates, you also need to consider the
database if a process crashes halfway through making its updates. Databases that offer atomic
transactions (see [“Atomicity”](/en/ch8#sec_transactions_acid_atomicity)) make it easier to remain consistent, but not
all databases offer atomicity across multiple documents. It is also possible to ensure consistency
-through stream processing, which we discuss in [Link to Come].
+through stream processing, which we discuss in [“Keeping Systems in Sync”](/en/ch12#sec_stream_sync).
Normalization tends to be better for OLTP systems, where both reads and updates need to be fast;
analytics systems often fare better with denormalized data, since they perform updates in bulk, and
@@ -332,7 +334,7 @@ acceptable. However, in very large-scale systems, the cost of joins can become p
#### Denormalization in the social networking case study {#denormalization-in-the-social-networking-case-study}
-In [“Case Study: Social Network Home Timelines”](/en/ch2#sec_introduction_twitter) we compared a normalized representation ([Figure 2-1](/en/ch2#fig_twitter_relational))
+In [“Case Study: Social Network Home Timelines”](/en/ch2#sec_introduction_twitter) we compared a normalized representation ([Figure 2-1](/en/ch2#fig_twitter_relational))
and a denormalized one (precomputed, materialized timelines): here, the join between `posts` and
`follows` was too expensive, and the materialized timeline is a cache of the result of that join.
The fan-out process that inserts a new post into followers’ timelines was our way of keeping the
@@ -380,7 +382,7 @@ of performance of reads and writes, as well as the amount of effort to implement
### Many-to-One and Many-to-Many Relationships {#sec_datamodels_many_to_many}
-While `positions` and `education` in [Figure 3-1](/en/ch3#fig_obama_relational) are examples of one-to-many or
+While `positions` and `education` in [Figure 3-1](/en/ch3#fig_obama_relational) are examples of one-to-many or
one-to-few relationships (one résumé has several positions, but each position belongs only to one
résumé), the `region_id` field is an example of a *many-to-one* relationship (many people live in
the same region, but we assume that each person lives in only one region at any one time).
@@ -389,14 +391,14 @@ If we introduce entities for organizations and schools, and reference them by ID
then we also have *many-to-many* relationships (one person has worked for several organizations, and
an organization has several past or present employees). In a relational model, such a relationship
is usually represented as an *associative table* or *join table*, as shown in
-[Figure 3-3](/en/ch3#fig_datamodels_m2m_rel): each position associates one user ID with one organization ID.
+[Figure 3-3](/en/ch3#fig_datamodels_m2m_rel): each position associates one user ID with one organization ID.
{{< figure src="/fig/ddia_0303.png" id="fig_datamodels_m2m_rel" caption="Figure 3-3. Many-to-many relationships in the relational model." class="w-full my-4" >}}
Many-to-one and many-to-many relationships do not easily fit within one self-contained JSON
document; they lend themselves more to a normalized representation. In a document model, one
-possible representation is given in [Example 3-2](/en/ch3#fig_datamodels_m2m_json) and illustrated in
-[Figure 3-4](/en/ch3#fig_datamodels_many_to_many): the data within each dotted rectangle can be grouped into one
+possible representation is given in [Example 3-2](/en/ch3#fig_datamodels_m2m_json) and illustrated in
+[Figure 3-4](/en/ch3#fig_datamodels_many_to_many): the data within each dotted rectangle can be grouped into one
document, but the links to organizations and schools are best represented as references to other
documents.
@@ -426,11 +428,11 @@ representation is denormalized, since the relationship is stored in two places,
inconsistent with each other.
A normalized representation stores the relationship in only one place, and relies on *secondary
-indexes* (which we discuss in [Chapter 4](/en/ch4#ch_storage)) to allow the relationship to be efficiently queried in
-both directions. In the relational schema of [Figure 3-3](/en/ch3#fig_datamodels_m2m_rel), we would tell the database
+indexes* (which we discuss in [Chapter 4](/en/ch4#ch_storage)) to allow the relationship to be efficiently queried in
+both directions. In the relational schema of [Figure 3-3](/en/ch3#fig_datamodels_m2m_rel), we would tell the database
to create indexes on both the `user_id` and the `org_id` columns of the `positions` table.
-In the document model of [Example 3-2](/en/ch3#fig_datamodels_m2m_json), the database needs to index the `org_id` field
+In the document model of [Example 3-2](/en/ch3#fig_datamodels_m2m_json), the database needs to index the `org_id` field
of objects inside the `positions` array. Many document databases and relational databases with JSON
support are able to create such indexes on values inside a document.
@@ -442,7 +444,7 @@ widely-used conventions for the structure of tables in a data warehouse: a *star
and *one big table* (OBT). These structures are optimized for the needs of business analysts. ETL
processes translate data from operational systems into this schema.
-[Figure 3-5](/en/ch3#fig_dwh_schema) shows an example of a star schema that might be found in the data warehouse of a grocery
+[Figure 3-5](/en/ch3#fig_dwh_schema) shows an example of a star schema that might be found in the data warehouse of a grocery
retailer. At the center of the schema is a so-called *fact table* (in this example, it is called
`fact_sales`). Each row of the fact table represents an event that occurred at a particular time
(here, each row represents a customer’s purchase of a product). If we were analyzing website traffic
@@ -460,7 +462,7 @@ Other columns in the fact table are foreign key references to other tables, call
tables*. As each row in the fact table represents an event, the dimensions represent the *who*,
*what*, *where*, *when*, *how*, and *why* of the event.
-For example, in [Figure 3-5](/en/ch3#fig_dwh_schema), one of the dimensions is the product that was sold. Each row in
+For example, in [Figure 3-5](/en/ch3#fig_dwh_schema), one of the dimensions is the product that was sold. Each row in
the `dim_product` table represents one type of product that is for sale, including its stock-keeping
unit (SKU), description, brand name, category, fat content, package size, etc. Each row in the
`fact_sales` table uses a foreign key to indicate which product was sold in that particular
@@ -470,7 +472,7 @@ Even date and time are often represented using dimension tables, because this al
information about dates (such as public holidays) to be encoded, allowing queries to differentiate
between sales on holidays and non-holidays.
-[Figure 3-5](/en/ch3#fig_dwh_schema) is an example of a star schema. The name comes from the fact that when the table
+[Figure 3-5](/en/ch3#fig_dwh_schema) is an example of a star schema. The name comes from the fact that when the table
relationships are visualized, the fact table is in the middle, surrounded by its dimension tables;
the connections to these tables are like the rays of a star.
@@ -516,7 +518,7 @@ many-to-many relationships. Let’s examine these arguments in more detail.
If the data in your application has a document-like structure (i.e., a tree of one-to-many
relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to
use a document model. The relational technique of *shredding*—splitting a document-like structure
-into multiple tables (like `positions`, `education`, and `contact_info` in [Figure 3-1](/en/ch3#fig_obama_relational))
+into multiple tables (like `positions`, `education`, and `contact_info` in [Figure 3-1](/en/ch3#fig_obama_relational))
— can lead to cumbersome schemas and unnecessarily complicated application code.
The document model has limitations: for example, you cannot refer directly to a nested item within a
@@ -595,14 +597,14 @@ structure for some reason (i.e., the data is heterogeneous)—for example, becau
In situations like these, a schema may hurt more than it helps, and schemaless documents can be a
much more natural data model. But in cases where all records are expected to have the same
structure, schemas are a useful mechanism for documenting and enforcing that structure. We will
-discuss schemas and schema evolution in more detail in [Chapter 5](/en/ch5#ch_encoding).
+discuss schemas and schema evolution in more detail in [Chapter 5](/en/ch5#ch_encoding).
#### Data locality for reads and writes {#sec_datamodels_document_locality}
A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant
thereof (such as MongoDB’s BSON). If your application often needs to access the entire document
(for example, to render it on a web page), there is a performance advantage to this *storage
-locality*. If data is split across multiple tables, like in [Figure 3-1](/en/ch3#fig_obama_relational), multiple
+locality*. If data is split across multiple tables, like in [Figure 3-1](/en/ch3#fig_obama_relational), multiple
index lookups are required to retrieve it all, which may require more disk seeks and take more time.
The locality advantage only applies if you need large parts of the document at the same time. The
@@ -755,7 +757,7 @@ as SQL support for querying graphs. Other graph query languages exist, such as G
but these will give us a representative overview.
To illustrate these different languages and models, this section uses the graph shown in
-[Figure 3-6](/en/ch3#fig_datamodels_graph) as running example. It could be taken from a social network or a
+[Figure 3-6](/en/ch3#fig_datamodels_graph) as running example. It could be taken from a social network or a
genealogical database: it shows two people, Lucy from Idaho and Alain from Saint-Lô, France. They
are married and living in London. Each person and each location is represented as a vertex, and the
relationships between them as edges. This example will help demonstrate some queries that are easy
@@ -782,7 +784,7 @@ Each edge consists of:
* A collection of properties (key-value pairs)
You can think of a graph store as consisting of two relational tables, one for vertices and one for
-edges, as shown in [Example 3-3](/en/ch3#fig_graph_sql_schema) (this schema uses the PostgreSQL `jsonb` datatype to
+edges, as shown in [Example 3-3](/en/ch3#fig_graph_sql_schema) (this schema uses the PostgreSQL `jsonb` datatype to
store the properties of each vertex or edge). The head and tail vertex are stored for each edge; if
you want the set of incoming or outgoing edges for a vertex, you can query the `edges` table by
`head_vertex` or `tail_vertex`, respectively.
@@ -814,7 +816,7 @@ Some important aspects of this model are:
restricts which kinds of things can or cannot be associated.
2. Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus
*traverse* the graph—i.e., follow a path through a chain of vertices—both forward and backward.
- (That’s why [Example 3-3](/en/ch3#fig_graph_sql_schema) has indexes on both the `tail_vertex` and `head_vertex`
+ (That’s why [Example 3-3](/en/ch3#fig_graph_sql_schema) has indexes on both the `tail_vertex` and `head_vertex`
columns.)
3. By using different labels for different kinds of vertices and relationships, you can store
several different kinds of information in a single graph, while still maintaining a clean data
@@ -837,7 +839,7 @@ vertices or edges with certain properties to be found efficiently.
--------
Those features give graphs a great deal of flexibility for data modeling, as illustrated in
-[Figure 3-6](/en/ch3#fig_datamodels_graph). The figure shows a few things that would be difficult to express in a
+[Figure 3-6](/en/ch3#fig_datamodels_graph). The figure shows a few things that would be difficult to express in a
traditional relational schema, such as different kinds of regional structures in different countries
(France has *départements* and *régions*, whereas the US has *counties* and *states*), quirks of
history such as a country within a country (ignoring for now the intricacies of sovereign states and
@@ -859,8 +861,8 @@ and later developed into an open standard as *openCypher* [^38]. Besides Neo4j,
Amazon Neptune, Apache AGE (with storage in PostgreSQL), and others. It is named after a character
in the movie *The Matrix* and is not related to ciphers in cryptography [^39].
-[Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of
-[Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each
+[Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of
+[Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each
vertex is given a symbolic name like `usa` or `idaho`. That name is not stored in the database, but
only used internally within the query to create edges between the vertices, using an arrow notation:
`(idaho) -[:WITHIN]-> (usa)` creates an edge labeled `WITHIN`, with `idaho` as the tail node and
@@ -878,13 +880,13 @@ CREATE
(lucy) -[:BORN_IN]-> (idaho)
```
-When all the vertices and edges of [Figure 3-6](/en/ch3#fig_datamodels_graph) are added to the database, we can start
+When all the vertices and edges of [Figure 3-6](/en/ch3#fig_datamodels_graph) are added to the database, we can start
asking interesting questions: for example, *find the names of all the people who emigrated from the
United States to Europe*. That is, find all the vertices that have a `BORN_IN` edge to a location
within the US, and also a `LIVING_IN` edge to a location within Europe, and return the `name`
property of each of those vertices.
-[Example 3-5](/en/ch3#fig_cypher_query) shows how to express that query in Cypher. The same arrow notation is used in a
+[Example 3-5](/en/ch3#fig_cypher_query) shows how to express that query in Cypher. The same arrow notation is used in a
`MATCH` clause to find patterns in the graph: `(person) -[:BORN_IN]-> ()` matches any two vertices
that are related by an edge labeled `BORN_IN`. The tail vertex of that edge is bound to the
variable `person`, and the head vertex is left unnamed.
@@ -923,7 +925,7 @@ can be found through an incoming `BORN_IN` or `LIVES_IN` edge at one of the loca
### Graph Queries in SQL {#id58}
-[Example 3-3](/en/ch3#fig_graph_sql_schema) suggested that graph data can be represented in a relational database. But
+[Example 3-3](/en/ch3#fig_graph_sql_schema) suggested that graph data can be represented in a relational database. But
if we put graph data in a relational structure, can we also query it using SQL?
The answer is yes, but with some difficulty. Every edge that you traverse in a graph query is
@@ -943,7 +945,7 @@ or more times.” It is like the `*` operator in a regular expression.
Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using
something called *recursive common table expressions* (the `WITH RECURSIVE` syntax).
-[Example 3-6](/en/ch3#fig_graph_sql_query) shows the same query—finding the names of people who emigrated from the US
+[Example 3-6](/en/ch3#fig_graph_sql_query) shows the same query—finding the names of people who emigrated from the US
to Europe—expressed in SQL using this technique. However, the syntax is very clumsy in comparison to
Cypher.
@@ -1035,7 +1037,7 @@ The subject of a triple is equivalent to a vertex in a graph. The object is one
1. A value of a primitive datatype, such as a string or a number. In that case, the predicate and
object of the triple are equivalent to the key and value of a property on the subject vertex.
- Using the example from [Figure 3-6](/en/ch3#fig_datamodels_graph), (*lucy*, *birthYear*, *1989*) is like a vertex
+ Using the example from [Figure 3-6](/en/ch3#fig_datamodels_graph), (*lucy*, *birthYear*, *1989*) is like a vertex
`lucy` with properties `{"birthYear": 1989}`.
2. Another vertex in the graph. In that case, the predicate is an edge in the
graph, the subject is the tail vertex, and the object is the head vertex. For example, in
@@ -1051,7 +1053,7 @@ The subject of a triple is equivalent to a vertex in a graph. The object is one
> Since these databases retain the basic *subject-predicate-object* structure explained above, this
> book nevertheless calls them triple-stores.
-[Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as
+[Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as
triples in a format called *Turtle*, a subset of *Notation3* (*N3*) [^48].
{{< figure id="fig_graph_n3_triples" title="Example 3-7. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as Turtle triples" class="w-full my-4" >}}
@@ -1081,7 +1083,7 @@ _:usa`. When the predicate is a property, the object is a string literal, as in
It’s quite repetitive to repeat the same subject over and over again, but fortunately you can use
semicolons to say multiple things about the same subject. This makes the Turtle format quite
-readable: see [Example 3-8](/en/ch3#fig_graph_n3_shorthand).
+readable: see [Example 3-8](/en/ch3#fig_graph_n3_shorthand).
{{< figure id="fig_graph_n3_shorthand" title="Example 3-8. A more concise way of writing the data in [Example 3-7](/en/ch3#fig_graph_n3_triples)" class="w-full my-4" >}}
@@ -1112,10 +1114,10 @@ case: even if you have no interest in the Semantic Web, triples can be a good in
#### The RDF data model {#the-rdf-data-model}
-The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the
+The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the
*Resource Description Framework* (RDF) [^55],
a data model that was designed for the Semantic Web. RDF data can also be encoded in other ways, for
-example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can
+example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can
automatically convert between different RDF encodings.
{{< figure id="fig_graph_rdf_xml" title="Example 3-9. The data of [Example 3-8](/en/ch3#fig_graph_n3_shorthand), expressed using RDF/XML syntax" class="w-full my-4" >}}
@@ -1169,7 +1171,7 @@ It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQ
similar.
The same query as before—finding people who have moved from the US to Europe—is similarly concise in
-SPARQL as it is in Cypher (see [Example 3-10](/en/ch3#fig_sparql_query)).
+SPARQL as it is in Cypher (see [Example 3-10](/en/ch3#fig_sparql_query)).
{{< figure id="fig_sparql_query" title="Example 3-10. The same query as [Example 3-5](/en/ch3#fig_cypher_query), expressed in SPARQL" class="w-full my-4" >}}
@@ -1224,8 +1226,8 @@ columns: *ID*, *name*, and *type*. The fact that the US is a country could then
`table(val1, val2, …)` means that `table` contains a row where the first column contains `val1`,
the second column contains `val2`, and so on.
-[Example 3-11](/en/ch3#fig_datalog_triples) shows how to write the data from the left-hand side of
-[Figure 3-6](/en/ch3#fig_datamodels_graph) in Datalog. The edges of the graph (`within`, `born_in`, and `lives_in`)
+[Example 3-11](/en/ch3#fig_datalog_triples) shows how to write the data from the left-hand side of
+[Figure 3-6](/en/ch3#fig_datamodels_graph) in Datalog. The edges of the graph (`within`, `born_in`, and `lives_in`)
are represented as two-column join tables. For example, Lucy has the ID 100 and Idaho has the ID 3,
so the relationship “Lucy was born in Idaho” is represented as `born_in(100, 3)`.
@@ -1244,7 +1246,7 @@ born_in(100, 3). /* Lucy was born in Idaho */
```
Now that we have defined the data, we can write the same query as before, as shown in
-[Example 3-12](/en/ch3#fig_datalog_query). It looks a bit different from the equivalent in Cypher or SPARQL, but don’t
+[Example 3-12](/en/ch3#fig_datalog_query). It looks a bit different from the equivalent in Cypher or SPARQL, but don’t
let that put you off. Datalog is a subset of Prolog, a programming language that you might have seen
before if you’ve studied computer science.
@@ -1271,7 +1273,7 @@ define *rules* that derive new virtual tables from the underlying facts. These d
like (virtual) SQL views: they are not stored in the database, but you can query them in the same
way as a table containing stored facts.
-In [Example 3-12](/en/ch3#fig_datalog_query) we define three derived tables: `within_recursive`, `migrated`, and
+In [Example 3-12](/en/ch3#fig_datalog_query) we define three derived tables: `within_recursive`, `migrated`, and
`us_to_europe`. The name and columns of the virtual tables are defined by what appears before the
`:-` symbol of each rule. For example, `migrated(PName, BornIn, LivingIn)` is a virtual table with
three columns: the name of a person, the name of the place where they were born, and the name of the
@@ -1284,7 +1286,7 @@ variable `PName` bound to the value `"Lucy"`. A rule applies if the system can f
*all* patterns on the righthand side of the `:-` operator. When the rule applies, it’s as though the
lefthand side of the `:-` was added to the database (with variables replaced by the values they matched).
-One possible way of applying the rules is thus (and as illustrated in [Figure 3-7](/en/ch3#fig_datalog_naive)):
+One possible way of applying the rules is thus (and as illustrated in [Figure 3-7](/en/ch3#fig_datalog_naive)):
1. `location(1, "North America", "continent")` exists in the database, so rule 1 applies. It generates `within_recursive(1, "North America")`.
2. `within(2, 1)` exists in the database and the previous step generated `within_recursive(1, "North America")`, so rule 2 applies. It generates `within_recursive(2, "North America")`.
@@ -1295,7 +1297,7 @@ locations in North America (or any other location) contained in our database.
{{< figure link="#fig_datalog_query" src="/fig/ddia_0307.png" id="fig_datalog_naive" title="Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from Example 3-12." class="w-full my-4" >}}
-> Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query).
+> Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query).
Now rule 3 can find people who were born in some location `BornIn` and live in some location
`LivingIn`. Rule 4 invokes rule 3 with `BornIn = 'United States'` and
@@ -1307,7 +1309,7 @@ The Datalog approach requires a different kind of thinking compared to the other
discussed in this chapter. It allows complex queries to be built up rule by rule, with one rule
referring to other rules, similarly to the way that you break down code into functions that call
each other. Just like functions can be recursive, Datalog rules can also invoke themselves, like
-rule 2 in [Example 3-12](/en/ch3#fig_datalog_query), which enables graph traversals in Datalog queries.
+rule 2 in [Example 3-12](/en/ch3#fig_datalog_query), which enables graph traversals in Datalog queries.
### GraphQL {#id63}
@@ -1319,7 +1321,7 @@ interfaces allow developers to rapidly change queries in client code without cha
GraphQL’s flexibility comes at a cost. Organizations that adopt GraphQL often need tooling to
convert GraphQL queries into requests to internal services, which often use REST or gRPC (see
-[Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns [^61].
+[Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns [^61].
GraphQL’s query language is also limited since GraphQL come from an untrusted source. The language
does not allow anything that could be expensive to execute, since otherwise users could perform
denial-of-service attacks on a server by running lots of expensive queries. In particular, GraphQL
@@ -1327,7 +1329,7 @@ does not allow recursive queries (unlike Cypher, SPARQL, SQL, or Datalog), and i
arbitrary search conditions such as “find people who were born in the US and are now living in
Europe” (unless the service owners specifically choose to offer such search functionality).
-Nevertheless, GraphQL is useful. [Example 3-13](/en/ch3#fig_graphql_query) shows how you might implement a group chat
+Nevertheless, GraphQL is useful. [Example 3-13](/en/ch3#fig_graphql_query) shows how you might implement a group chat
application such as Discord or Slack using GraphQL. The query requests all the channels that the
user has access to, including the channel name and the 50 most recent messages in each channel. For
each message it requests the timestamp, the message content, and the name and profile picture URL
@@ -1359,7 +1361,7 @@ query ChatApp {
}
```
-[Example 3-14](/en/ch3#fig_graphql_response) shows what a response to the query in [Example 3-13](/en/ch3#fig_graphql_query) might look
+[Example 3-14](/en/ch3#fig_graphql_response) shows what a response to the query in [Example 3-13](/en/ch3#fig_graphql_query) might look
like. The response is a JSON document that mirrors the structure of the query: it contains exactly
those attributes that were requested, no more and no less. This approach has the advantage that the
server does not need to know which attributes the client requires in order to render the user
@@ -1395,13 +1397,13 @@ were changed to add that profile picture, it would be easy for the client to add
...
```
-In [Example 3-14](/en/ch3#fig_graphql_response) the name and image URL of a message sender is embedded directly in the
+In [Example 3-14](/en/ch3#fig_graphql_response) the name and image URL of a message sender is embedded directly in the
message object. If the same user sends multiple messages, this information is repeated on each
message. In principle, it would be possible to reduce this duplication, but GraphQL makes the design
choice to accept a larger response size in order to make it simpler to render the user interface
based on the data.
-The `replyTo` field is similar: in [Example 3-14](/en/ch3#fig_graphql_response), the second message is a reply to the
+The `replyTo` field is similar: in [Example 3-14](/en/ch3#fig_graphql_response), the second message is a reply to the
first, and the content (“Hey!…”) and sender Aaliyah are duplicated under `replyTo`. It would be
possible to instead return the ID of the message being replied to, but then the client would have to
make an additional request to the server if that ID is not among the 50 most recent messages
@@ -1439,7 +1441,7 @@ timestamp, and then append it to a sequence of events. Events in this log are *i
change or delete them, you only ever append more events to the log (which may supersede earlier
events). An event can contain arbitrary properties.
-[Figure 3-8](/en/ch3#fig_event_sourcing) shows an example that could be taken from a conference management system. A
+[Figure 3-8](/en/ch3#fig_event_sourcing) shows an example that could be taken from a conference management system. A
conference can be a complex business domain: not only can individual attendees register and pay by
card, but companies can also order seats in bulk, pay by invoice, and then later assign the seats to
individual people. Some number of seats may be reserved for speakers, sponsors, volunteer helpers,
@@ -1449,7 +1451,7 @@ calculating the number of available seats becomes a challenging query.
{{< figure src="/fig/ddia_0308.png" id="fig_event_sourcing" title="Figure 3-8. Using a log of immutable events as source of truth, and deriving materialized views from it." class="w-full my-4" >}}
-In [Figure 3-8](/en/ch3#fig_event_sourcing), every change to the state of the conference (such as the organizer
+In [Figure 3-8](/en/ch3#fig_event_sourcing), every change to the state of the conference (such as the organizer
opening registrations, or attendees making and cancelling registrations) is first stored as an
event. Whenever an event is appended to the log, several *materialized views* (also known as
*projections* or *read models*) are also updated to reflect the effect of that event. In the
@@ -1540,11 +1542,11 @@ You can implement event sourcing on top of any database, but there are also some
specifically designed to support this pattern, such as EventStoreDB, MartenDB (based on PostgreSQL),
and Axon Framework. You can also use message brokers such as Apache Kafka to store the event log,
and stream processors can keep the materialized views up-to-date; we will return to these topics in
-[Link to Come].
+[“Change data capture versus event sourcing”](/en/ch12#sec_stream_event_sourcing).
The only important requirement is that the event storage system must guarantee that all materialized
views process the events in exactly the same order as they appear in the log; as we shall see in
-[Chapter 10](/en/ch10#ch_consistency), this is not always easy to achieve in a distributed system.
+[Chapter 10](/en/ch10#ch_consistency), this is not always easy to achieve in a distributed system.
## Dataframes, Matrices, and Arrays {#sec_datamodels_dataframes}
@@ -1579,7 +1581,7 @@ For example, a common use of dataframes is to transform data from a relational-l
into a matrix or multidimensional array representation, which is the form that many machine learning
algorithms expect of their input.
-A simple example of such a transformation is shown in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix). On the left we
+A simple example of such a transformation is shown in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix). On the left we
have a relational table of how different users have rated various movies (on a scale of 1 to 5), and
on the right the data has been transformed into a matrix where each column is a movie and each row
is a user (similarly to a *pivot table* in a spreadsheet). The matrix is *sparse*, which means there
@@ -1592,7 +1594,7 @@ that offer sparse arrays (such as NumPy for Python) can handle such data easily.
A matrix can only contain numbers, and various techniques are used to transform non-numerical data
into numbers in the matrix. For example:
-* Dates (which are omitted from the example matrix in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix)) could be scaled
+* Dates (which are omitted from the example matrix in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix)) could be scaled
to be floating-point numbers within some suitable range.
* For columns that can only take one of a small, fixed set of values (for example, the genre of a
movie in a database of movies), a *one-hot encoding* is often used: we create a column for each
@@ -1603,7 +1605,7 @@ into numbers in the matrix. For example:
Once the data is in the form of a matrix of numbers, it is amenable to linear algebra operations,
which form the basis of many machine learning algorithms. For example, the data in
-[Figure 3-9](/en/ch3#fig_dataframe_to_matrix) could be a part of a system for recommending movies that the user may
+[Figure 3-9](/en/ch3#fig_dataframe_to_matrix) could be a part of a system for recommending movies that the user may
like. Dataframes are flexible enough to allow data to be gradually evolved from a relational form
into a matrix representation, while giving the data scientist control over the representation that
is most suitable for achieving the goals of the data analysis or model training process.
@@ -1648,7 +1650,7 @@ gradually improving.
Another model we discussed is *event sourcing*, which represents data as an append-only log of
immutable events, and which can be advantageous for modeling activities in complex business domains.
-An append-only log is good for writing data (as we shall see in [Chapter 4](/en/ch4#ch_storage)); in order to support
+An append-only log is good for writing data (as we shall see in [Chapter 4](/en/ch4#ch_storage)); in order to support
efficient queries, the event log is translated into read-optimized materialized views through CQRS.
One thing that non-relational data models have in common is that they typically don’t enforce a
diff --git a/content/en/ch4.md b/content/en/ch4.md
index af8b274..bf20a18 100644
--- a/content/en/ch4.md
+++ b/content/en/ch4.md
@@ -4,6 +4,8 @@ weight: 104
breadcrumbs: false
---
+
+

> *One of the miseries of life is that everybody names things a little bit wrong. And so it makes
@@ -17,7 +19,7 @@ breadcrumbs: false
On the most fundamental level, a database needs to do two things: when you give it some data, it
should store the data, and when you ask it again later, it should give the data back to you.
-In [Chapter 3](/en/ch3#ch_datamodels) we discussed data models and query languages—i.e., the format in which you give
+In [Chapter 3](/en/ch3#ch_datamodels) we discussed data models and query languages—i.e., the format in which you give
the database your data, and the interface through which you can ask for it again later. In this
chapter we discuss the same from the database’s point of view: how the database can store the data
that you give it, and how it can find the data again when you ask for it.
@@ -140,7 +142,7 @@ your application the greatest benefit, without introducing more overhead on writ
To start, let’s assume that you want to continue storing data in the append-only file written by
`db_set`, and you just want to speed up reads. One way you could do this is by keeping a hash map in
memory, in which every key is mapped to the byte offset in the file at which the most recent value
-for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
+for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
{{< figure src="/fig/ddia_0401.png" id="fig_storage_csv_hash_index" caption="Figure 4-1. Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map." class="w-full my-4" >}}
@@ -167,7 +169,7 @@ This approach is much faster, but it still suffers from several problems:
In practice, hash tables are not used very often for database indexes, and instead it is much more
common to keep data in a structure that is *sorted by key* [^3].
One example of such a structure is a *Sorted String Table*, or *SSTable* for short, as shown in
-[Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that
+[Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that
they are sorted by key, and each key only appears once in the file.
{{< figure src="/fig/ddia_0402.png" id="fig_storage_sstable_index" caption="Figure 4-2. An SSTable with a sparse index, allowing queries to jump to the right block." class="w-full my-4" >}}
@@ -178,7 +180,7 @@ This kind of index, which stores only some of the keys, is called *sparse*. This
a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data
structure that allows queries to quickly look up a particular key [^4].
-For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
+For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
first key of the next block is `handsome`. Now say you’re looking for the key `handiwork`, which
doesn’t appear in the sparse index. Because of the sorting you know that `handiwork` must appear
between `handbag` and `handsome`. This means you can seek to the offset for `handbag` and scan the
@@ -186,7 +188,7 @@ file from there until you find `handiwork` (or not, if the key is not present in
of a few kilobytes can be scanned very quickly.
Moreover, each block of records can be compressed (indicated by the shaded area in
-[Figure 4-2](/en/ch4#fig_storage_sstable_index)). Besides saving disk space, compression also reduces the I/O
+[Figure 4-2](/en/ch4#fig_storage_sstable_index)). Besides saving disk space, compression also reduces the I/O
bandwidth use, at the cost of using a bit more CPU time.
#### Constructing and merging SSTables {#constructing-and-merging-sstables}
@@ -217,7 +219,7 @@ log and a sorted file:
and to discard overwritten or deleted values.
Merging segments works similarly to the *mergesort* algorithm [^5]. The process is illustrated in
-[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
+[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If
the same key appears in more than one input file, keep only the more recent value. This produces a
new merged segment file, also sorted by key, with one value per key, and it uses minimal memory
@@ -258,7 +260,9 @@ the memtable or while merging segments, the database can just delete the unfinis
start afresh. The log that persists writes to the memtable could contain incomplete records if there
was a crash halfway through writing a record, or if the disk was full; these are typically detected
by including checksums in the log, and discarding corrupted or incomplete log entries. We will talk
-more about durability and crash recovery in [Chapter 8](/en/ch8#ch_transactions).
+more about durability and crash recovery in [Chapter 8](/en/ch8#ch_transactions).
+
+
#### Bloom filters {#bloom-filters}
@@ -268,7 +272,7 @@ reads, LSM storage engines often include a *Bloom filter* [^13]
in each segment, which provides a fast but approximate way of checking whether a particular key
appears in a particular SSTable.
-[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
+[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash
function, producing a set of numbers that are then interpreted as indexes into the array of bits [^14].
We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key
@@ -279,7 +283,7 @@ extra space, but the Bloom filter is generally small compared to the rest of the
{{< figure src="/fig/ddia_0404.png" id="fig_storage_bloom" caption="Figure 4-4. A Bloom filter provides a fast, probabilistic check whether a particular key exists in a particular SSTable." class="w-full my-4" >}}
When we want to know whether a key appears in the SSTable, we compute the same hash of that key as
-before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), we’re querying
+before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), we’re querying
the key `handheld`, which hashes to (6, 11, 2). One of those bits is 1 (namely, bit number 2),
while the other two are 0. These checks can be made extremely fast using the bitwise operations that
all CPUs support.
@@ -333,6 +337,8 @@ characteristics in more detail in [“Comparing B-Trees and LSM-Trees”](/en/ch
--------
+
+
> [!TIP] EMBEDDED STORAGE ENGINES
Many databases run as a service that accepts queries over a network, but there are also *embedded*
@@ -349,7 +355,7 @@ queries that combine data from multiple tenants), you can potentially use a sepa
database instance per tenant [^20].
The storage and retrieval methods we discuss in this chapter are used in both embedded and in
-client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
+client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
for scaling a database across multiple machines.
--------
@@ -370,14 +376,14 @@ philosophy.
The log-structured indexes we saw earlier break the database down into variable-size *segments*,
typically several megabytes or more in size, that are written once and are then immutable. By
contrast, B-trees break the database down into fixed-size *blocks* or *pages*, and may overwrite a
-page in-place. A page is traditionally 4 KiB in size, but PostgreSQL now uses 8 KiB and
-MySQL uses 16 KiB by default.
+page in-place. A page is traditionally 4 KiB in size, but PostgreSQL now uses 8 KiB and
+MySQL uses 16 KiB by default.
Each page can be identified using a page number, which allows one page to refer to another—similar
to a pointer, but on disk instead of in memory. If all the pages are stored in the same file,
multiplying the page number by the page size gives us the byte offset in the file where the page is
located. We can use these page references to construct a tree of pages, as illustrated in
-[Figure 4-5](/en/ch4#fig_storage_b_tree).
+[Figure 4-5](/en/ch4#fig_storage_b_tree).
{{< figure src="/fig/ddia_0405.png" id="fig_storage_b_tree" caption="Figure 4-5. Looking up the key 251 using a B-tree index. From the root page we first follow the reference to the page for keys 200–300, then the page for keys 250–270." class="w-full my-4" >}}
@@ -388,14 +394,14 @@ where the boundaries between those ranges lie.
(This structure is sometimes called a B+ tree, but we don’t need to distinguish it
from other B-tree variants.)
-In the example in [Figure 4-5](/en/ch4#fig_storage_b_tree), we are looking for the key 251, so we know that we need to
+In the example in [Figure 4-5](/en/ch4#fig_storage_b_tree), we are looking for the key 251, so we know that we need to
follow the page reference between the boundaries 200 and 300. That takes us to a similar-looking
page that further breaks down the 200–300 range into subranges. Eventually we get down to a
page containing individual keys (a *leaf page*), which either contains the value for each key
inline or contains references to the pages where the values can be found.
The number of references to child pages in one page of the B-tree is called the *branching factor*.
-For example, in [Figure 4-5](/en/ch4#fig_storage_b_tree) the branching factor is six. In practice, the branching
+For example, in [Figure 4-5](/en/ch4#fig_storage_b_tree) the branching factor is six. In practice, the branching
factor depends on the amount of space required to store the page references and the range
boundaries, but typically it is several hundred.
@@ -408,7 +414,7 @@ of key ranges.
{{< figure src="/fig/ddia_0406.png" id="fig_storage_b_tree_split" caption="Figure 4-6. Growing a B-tree by splitting a page on the boundary key 337. The parent page is updated to reference both children." class="w-full my-4" >}}
-In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the
+In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the
range 333–345 is already full. We therefore split it into a page for the range 333–337 (including
the new key), and a page for 337–344. We also have to update the parent page to have references to
both children, with a boundary value of 337 between them. If the parent page doesn’t have enough
@@ -417,9 +423,9 @@ to the root of the tree. When the root is split, we make a new root above it. De
may require nodes to be merged) is more complex [^5].
This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth
-of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
+of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
you don’t need to follow many page references to find the page you are looking for. (A four-level
-tree of 4 KiB pages with a branching factor of 500 can store up to 250 TB.)
+tree of 4 KiB pages with a branching factor of 500 can store up to 250 TB.)
#### Making B-trees reliable {#sec_storage_btree_wal}
@@ -530,14 +536,14 @@ flash memory attached to the PCI Express bus) have now overtaken HDDs for many u
are not subject to such mechanical limitations.
Nevertheless, SSDs also have higher throughput for sequential writes than for than random writes.
-The reason is that flash memory can be read or written one page (typically 4 KiB) at a time,
-but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
+The reason is that flash memory can be read or written one page (typically 4 KiB) at a time,
+but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
may contain valid data, whereas others may contain data that is no longer needed. Before erasing a
block, the controller must first move pages containing valid data into other blocks; this process is
called *garbage collection* (GC) [^33].
A sequential write workload writes larger chunks of data at a time, so it is likely that a whole
-512 KiB block belongs to a single file; when that file is later deleted again, the whole block
+512 KiB block belongs to a single file; when that file is later deleted again, the whole block
can be erased without having to perform any GC. On the other hand, with a random write workload, it
is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
to perform more work before a block can be erased [^34] [^35] [^36].
@@ -624,7 +630,7 @@ to that row/document/vertex by its primary key (or ID), and the index is used to
It is also very common to have *secondary indexes*. In relational databases, you can create several
secondary indexes on the same table using the `CREATE INDEX` command, allowing you to search by
-columns other than the primary key. For example, in [Figure 3-1](/en/ch3#fig_obama_relational) in [Chapter 3](/en/ch3#ch_datamodels)
+columns other than the primary key. For example, in [Figure 3-1](/en/ch3#fig_obama_relational) in [Chapter 3](/en/ch3#ch_datamodels)
you would most likely have a secondary index on the `user_id` columns so that you can find all the
rows belonging to the same user in each of the tables.
@@ -791,7 +797,7 @@ rows), so in this section we will focus on storage of facts.
Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4
or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics) [^52]. Take the query in
-[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
+[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of
the `fact_sales` table: `date_key`, `product_sk`,
and `quantity`. The query ignores all other columns.
@@ -816,9 +822,9 @@ How can we execute this query efficiently?
In most OLTP databases, storage is laid out in a *row-oriented* fashion: all the values from one row
of a table are stored next to each other. Document databases are similar: an entire document is
-typically stored as one contiguous sequence of bytes. You can see this in the CSV example of [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
+typically stored as one contiguous sequence of bytes. You can see this in the CSV example of [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
-In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on
+In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on
`fact_sales.date_key` and/or `fact_sales.product_sk` that tell the storage engine where to find
all the sales for a particular date or for a particular product. But then, a row-oriented storage
engine still needs to load all of those rows (each consisting of over 100 attributes) from disk into
@@ -828,8 +834,8 @@ long time.
The idea behind *column-oriented* (or *columnar*) storage is simple: don’t store all the values from
one row together, but store all the values from each *column* together instead [^56].
If each column is stored separately, a query only needs to read and parse those columns that are
-used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
-an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).
+used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
+an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).
--------
@@ -864,10 +870,10 @@ Besides only loading those columns from disk that are required for a query, we c
the demands on disk throughput and network bandwidth by compressing data. Fortunately,
column-oriented storage often lends itself very well to compression.
-Take a look at the sequences of values for each column in [Figure 4-7](/en/ch4#fig_column_store): they often look quite
+Take a look at the sequences of values for each column in [Figure 4-7](/en/ch4#fig_column_store): they often look quite
repetitive, which is a good sign for compression. Depending on the data in the column, different
compression techniques can be used. One technique that is particularly effective in data warehouses
-is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index).
+is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index).
{{< figure src="/fig/ddia_0408.png" id="fig_bitmap_index" caption="Figure 4-8. Compressed, bitmap-indexed storage of a single column." class="w-full my-4" >}}
@@ -880,7 +886,7 @@ not.
One option is to store those bitmaps using one bit per row. However, these bitmaps typically contain
a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can additionally be
run-length encoded: counting the number of consecutive zeros or ones and storing that number, as
-shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
+shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
two bitmap representations, using whichever is the most compact [^73].
This can make the encoding of a column remarkably efficient.
@@ -928,7 +934,7 @@ last month, it might make sense to make `date_key` the first sort key. Then the
scan only the rows from the last month, which will be much faster than scanning all rows.
A second column can determine the sort order of any rows that have the same value in the first
-column. For example, if `date_key` is the first sort key in [Figure 4-7](/en/ch4#fig_column_store), it might make
+column. For example, if `date_key` is the first sort key in [Figure 4-7](/en/ch4#fig_column_store), it might make
sense for `product_sk` to be the second sort key so that all sales for the same product on the same
day are grouped together in storage. That will help queries that need to group or filter sales by
product within a certain date range.
@@ -936,7 +942,7 @@ product within a certain date range.
Another advantage of sorted order is that it can help with compression of columns. If the primary
sort column does not have many distinct values, then after sorting, it will have long sequences
where the same value is repeated many times in a row. A simple run-length encoding, like we used for
-the bitmaps in [Figure 4-8](/en/ch4#fig_bitmap_index), could compress that column down to a few kilobytes—even if
+the bitmaps in [Figure 4-8](/en/ch4#fig_bitmap_index), could compress that column down to a few kilobytes—even if
the table has billions of rows.
That compression effect is strongest on the first sort key. The second and third sort keys will be
@@ -1004,7 +1010,7 @@ Vectorized processing
and get back a bitmap (one bit per value in the input column, which is 1 if it’s a banana); we could
then pass the `store_sk` column and the ID of the store of interest to the same equality operator,
and get back another bitmap; and then we could pass the two bitmaps to a “bitwise AND” operator, as
- shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
+ shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
a particular store.
{{< figure src="/fig/ddia_0409.png" id="fig_bitmap_and" caption="Figure 4-9. A bitwise AND between two bitmaps lends itself to vectorization." class="w-full my-4" >}}
@@ -1039,18 +1045,18 @@ discussed earlier, data warehouse queries often involve an aggregate function, s
`AVG`, `MIN`, or `MAX` in SQL. If the same aggregates are used by many different queries, it can be
wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that
queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates grouped by different dimensions [^82].
-[Figure 4-10](/en/ch4#fig_data_cube) shows an example.
+[Figure 4-10](/en/ch4#fig_data_cube) shows an example.
{{< figure src="/fig/ddia_0410.png" id="fig_data_cube" caption="Figure 4-10. Two dimensions of a data cube, aggregating data by summing." class="w-full my-4" >}}
-Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube),
+Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube),
these are `date_key` and `product_sk`. You can now draw a two-dimensional table, with
dates along one axis and products along the other. Each cell contains the aggregate (e.g., `SUM`) of
an attribute (e.g., `net_price`) of all facts with that date-product combination. Then you can apply
the same aggregate along each row or column and get a summary that has been reduced by one
dimension (the sales by product regardless of date, or the sales by date regardless of product).
-In general, facts often have more than two dimensions. In [Figure 3-5](/en/ch3#fig_dwh_schema) there are five
+In general, facts often have more than two dimensions. In [Figure 3-5](/en/ch3#fig_dwh_schema) there are five
dimensions: date, product, store, promotion, and customer. It’s a lot harder to imagine what a
five-dimensional hypercube would look like, but the principle remains the same: each cell contains
the sales for a particular date-product-store-promotion-customer combination. These values can then
@@ -1132,11 +1138,11 @@ value of 0. Searching for documents mentioning “red apples” means a query th
The data structure that many search engines use to answer such queries is called an *inverted
index*. This is a key-value structure where the key is a term, and the value is the list of IDs of
all the documents that contain the term (the *postings list*). If the document IDs are sequential
-numbers, the postings list can also be represented as a sparse bitmap, like in [Figure 4-8](/en/ch4#fig_bitmap_index):
+numbers, the postings list can also be represented as a sparse bitmap, like in [Figure 4-8](/en/ch4#fig_bitmap_index):
the *n*th bit in the bitmap for term *x* is a 1 if the document with ID *n* contains the term *x* [^89].
Finding all the documents that contain both terms *x* and *y* is now similar to a vectorized data
-warehouse query that searches for rows matching two conditions ([Figure 4-9](/en/ch4#fig_bitmap_and)): load the two
+warehouse query that searches for rows matching two conditions ([Figure 4-9](/en/ch4#fig_bitmap_and)): load the two
bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps are run-length
encoded, this can be done very efficiently.
@@ -1147,7 +1153,7 @@ PostgreSQL’s GIN index type also uses postings lists to support full-text sear
JSON documents [^92] [^93].
Instead of breaking text into words, an alternative is to find all the substrings of length *n*,
-which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
+which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
`"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can
search the documents for arbitrary substrings that are at least three characters long. Trigram
indexes even allows regular expressions in search queries; the downside is that they are quite large [^94].
@@ -1226,7 +1232,7 @@ Inverted file (IVF) indexes
more vectors must be compared.
Hierarchical Navigable Small World (HNSW)
-: HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw).
+: HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw).
Each layer is represented as a graph, where nodes represent vectors, and edges represent proximity
to nearby vectors. A query starts by locating the nearest vector in the topmost layer, which has a
small number of nodes. The query then moves to the same node in the layer below and follows the
@@ -1395,4 +1401,4 @@ documentation for the database of your choice.
[^101]: Matthijs Douze, Maria Lomeli, and Lucas Hosseini. [Faiss indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). *github.com*, August 2024. Archived at [perma.cc/2EWG-FPBS](https://perma.cc/2EWG-FPBS)
[^102]: Varik Matevosyan. [Understanding pgvector’s HNSW Index Storage in Postgres](https://lantern.dev/blog/pgvector-storage). *lantern.dev*, August 2024. Archived at [perma.cc/B2YB-JB59](https://perma.cc/B2YB-JB59)
[^103]: Dmitry Baranchuk, Artem Babenko, and Yury Malkov. [Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors](https://arxiv.org/pdf/1802.02422). At *European Conference on Computer Vision* (ECCV), pages 202–216, September 2018. [doi:10.1007/978-3-030-01258-8\_13](https://doi.org/10.1007/978-3-030-01258-8_13)
-[^104]: Yury A. Malkov and Dmitry A. Yashunin. [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320). *IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824–836, April 2020. [doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473)
\ No newline at end of file
+[^104]: Yury A. Malkov and Dmitry A. Yashunin. [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320). *IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824–836, April 2020. [doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473)
diff --git a/content/en/ch5.md b/content/en/ch5.md
index b934872..48993e6 100644
--- a/content/en/ch5.md
+++ b/content/en/ch5.md
@@ -4,6 +4,8 @@ weight: 105
breadcrumbs: false
---
+
+

> *Everything changes and nothing stands still.*
@@ -12,14 +14,14 @@ breadcrumbs: false
Applications inevitably change over time. Features are added or modified as new products are
launched, user requirements become better understood, or business circumstances change. In
-[Chapter 2](/en/ch2#ch_nonfunctional) we introduced the idea of *evolvability*: we should aim to build systems that
+[Chapter 2](/en/ch2#ch_nonfunctional) we introduced the idea of *evolvability*: we should aim to build systems that
make it easy to adapt to change (see [“Evolvability: Making Change Easy”](/en/ch2#sec_introduction_evolvability)).
In most cases, a change to an application’s features also requires a change to data that it stores:
perhaps a new field or record type needs to be captured, or perhaps existing data needs to be
presented in a new way.
-The data models we discussed in [Chapter 3](/en/ch3#ch_datamodels) have different ways of coping with such change.
+The data models we discussed in [Chapter 3](/en/ch3#ch_datamodels) have different ways of coping with such change.
Relational databases generally assume that all data in the database conforms to one schema: although
that schema can be changed (through schema migrations; i.e., `ALTER` statements), there is exactly
one schema in force at any one point in time. By contrast, schema-on-read (“schemaless”) databases
@@ -52,13 +54,13 @@ format of data written by older code, and so you can explicitly handle it (if ne
keeping the old code to read the old data). Forward compatibility can be trickier, because it
requires older code to ignore additions made by a newer version of the code.
-Another challenge with forward compatibility is illustrated in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
+Another challenge with forward compatibility is illustrated in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
Say you add a field to a record schema, and the newer code creates a record containing that new
field and stores it in a database. Subsequently, an older version of the code (which doesn’t yet
know about the new field) reads the record, updates it, and writes it back. In this situation, the
desirable behavior is usually for the old code to keep the new field intact, even though it couldn’t
be interpreted. But if the record is decoded into a model object that does not explicitly
-preserve unknown fields, data can be lost, like in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
+preserve unknown fields, data can be lost, like in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
{{< figure src="/fig/ddia_0501.png" id="fig_encoding_preserve_field" caption="When an older version of the application updates data previously written by a newer version of the application, data may be lost if you’re not careful." class="w-full my-4" >}}
@@ -90,7 +92,7 @@ in-memory representation to a byte sequence is called *encoding* (also known as
> [!TIP] TERMINOLOGY CLASH
-*Serialization* is unfortunately also used in the context of transactions (see [Chapter 8](/en/ch8#ch_transactions)),
+*Serialization* is unfortunately also used in the context of transactions (see [Chapter 8](/en/ch8#ch_transactions)),
with a completely different meaning. To avoid overloading the word we’ll stick with *encoding* in
this book, even though *serialization* is perhaps a more common term.
@@ -202,7 +204,7 @@ Open content models are powerful, but can be complex. For example, say you want
integers (such as IDs) to strings. JSON does not have a map or dictionary type, only an “object”
type that can contain string keys, and values of any type. You can then constrain this type with
JSON Schema so that keys may only contain digits, and values can only be strings, using
-`patternProperties` and `additionalProperties` as shown in [Example 5-1](/en/ch5#fig_encoding_json_schema).
+`patternProperties` and `additionalProperties` as shown in [Example 5-1](/en/ch5#fig_encoding_json_schema).
{{< figure id="fig_encoding_json_schema" title="Example 5-1. Example JSON Schema with integer keys and string values. Integer keys are represented as strings containing only integers since JSON Schema requires all keys to be strings." class="w-full my-4" >}}
@@ -237,7 +239,7 @@ sometimes faster to parse, but none of them are as widely adopted as the textual
Some of these formats extend the set of datatypes (e.g., distinguishing integers and floating-point numbers,
or adding support for binary strings), but otherwise they keep the JSON/XML data model unchanged. In
particular, since they don’t prescribe a schema, they need to include all the object field names within
-the encoded data. That is, in a binary encoding of the JSON document in [Example 5-2](/en/ch5#fig_encoding_json), they
+the encoded data. That is, in a binary encoding of the JSON document in [Example 5-2](/en/ch5#fig_encoding_json), they
will need to include the strings `userName`, `favoriteNumber`, and `interests` somewhere.
{{< figure id="fig_encoding_json" title="Example 5-2. Example record which we will encode in several binary formats in this chapter" class="w-full my-4" >}}
@@ -250,8 +252,8 @@ will need to include the strings `userName`, `favoriteNumber`, and `interests` s
}
```
-Let’s look at an example of MessagePack, a binary encoding for JSON. [Figure 5-2](/en/ch5#fig_encoding_messagepack)
-shows the byte sequence that you get if you encode the JSON document in [Example 5-2](/en/ch5#fig_encoding_json) with
+Let’s look at an example of MessagePack, a binary encoding for JSON. [Figure 5-2](/en/ch5#fig_encoding_messagepack)
+shows the byte sequence that you get if you encode the JSON document in [Example 5-2](/en/ch5#fig_encoding_json) with
MessagePack. The first few bytes are as follows:
1. The first byte, `0x83`, indicates that what follows is an object (top four bits = `0x80`) with three
@@ -281,7 +283,7 @@ It is similar to Apache Thrift, which was originally developed by Facebook [^13]
most of what this section says about Protocol Buffers applies also to Thrift.
Protocol Buffers requires a schema for any data that is encoded. To encode the data
-in [Example 5-2](/en/ch5#fig_encoding_json) in Protocol Buffers, you would describe the schema in the Protocol Buffers
+in [Example 5-2](/en/ch5#fig_encoding_json) in Protocol Buffers, you would describe the schema in the Protocol Buffers
interface definition language (IDL) like this:
```protobuf
@@ -300,17 +302,17 @@ application code can call this generated code to encode or decode records of the
language is very simple compared to JSON Schema: it only defines the fields of records and their
types, but it does not support other restrictions on the possible values of fields.
-Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in [Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14].
+Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in [Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14].
{{< figure src="/fig/ddia_0503.png" id="fig_encoding_protobuf" caption="Figure 5-3. Example record encoded using Protocol Buffers." class="w-full my-4" >}}
-Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it
+Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it
is a string, integer, etc.) and, where required, a length indication (such as the length of a
string). The strings that appear in the data (“Martin”, “daydreaming”, “hacking”) are also encoded
as ASCII (to be precise, UTF-8), similar to before.
-The big difference compared to [Figure 5-2](/en/ch5#fig_encoding_messagepack) is that there are no field names
+The big difference compared to [Figure 5-2](/en/ch5#fig_encoding_messagepack) is that there are no field names
(`userName`, `favoriteNumber`, `interests`). Instead, the encoded data contains *field tags*, which
are numbers (`1`, `2`, and `3`). Those are the numbers that appear in the schema definition. Field tags
are like aliases for fields—they are a compact way of saying what field we’re talking about,
@@ -344,7 +346,7 @@ You can add new fields to the schema, provided that you give each field a new ta
code (which doesn’t know about the new tag numbers you added) tries to read data written by new
code, including a new field with a tag number it doesn’t recognize, it can simply ignore that field.
The datatype annotation allows the parser to determine how many bytes it needs to skip, and preserve
-the unknown fields to avoid the problem in [Figure 5-1](/en/ch5#fig_encoding_preserve_field). This maintains forward
+the unknown fields to avoid the problem in [Figure 5-1](/en/ch5#fig_encoding_preserve_field). This maintains forward
compatibility: old code can read records that were written by new code.
What about backward compatibility? As long as each field has a unique tag number, new code can
@@ -400,9 +402,9 @@ The equivalent JSON representation of that schema is as follows:
```
First of all, notice that there are no tag numbers in the schema. If we encode our example record
-([Example 5-2](/en/ch5#fig_encoding_json)) using this schema, the Avro binary encoding is just 32 bytes long—the
+([Example 5-2](/en/ch5#fig_encoding_json)) using this schema, the Avro binary encoding is just 32 bytes long—the
most compact of all the encodings we have seen. The breakdown of the encoded byte sequence is shown
-in [Figure 5-4](/en/ch5#fig_encoding_avro).
+in [Figure 5-4](/en/ch5#fig_encoding_avro).
If you examine the byte sequence, you can see that there is nothing to identify fields or their
datatypes. The encoding simply consists of values concatenated together. A string is just a length
@@ -430,7 +432,7 @@ example, that schema may be compiled into the application. This is known as the
When an application wants to decode some data (read it from a file or database, receive it from the
network, etc.), it uses two schemas: the writer’s schema that is identical to the one used for
encoding, and the *reader’s schema*, which may be different. This is illustrated in
-[Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The reader’s schema defines the fields of each record that the
+[Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The reader’s schema defines the fields of each record that the
application code is expecting, and their types.
{{< figure src="/fig/ddia_0505.png" id="fig_encoding_avro_schemas" caption="Figure 5-5. In Protocol Buffers, encoding and decoding can use different versions of a schema. In Avro, decoding uses two schemas: the writer's schema must be identical to the one used for encoding, but the reader's schema can be an older or newer version." class="w-full my-4" >}}
@@ -438,7 +440,7 @@ application code is expecting, and their types.
If the reader’s and writer’s schema are the same, decoding is easy. If they are different, Avro
resolves the differences by looking at the writer’s schema and the reader’s schema side by side and
translating the data from the writer’s schema into the reader’s schema. The Avro specification [^16] [^17]
-defines exactly how this resolution works, and it is illustrated in [Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
+defines exactly how this resolution works, and it is illustrated in [Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
For example, it’s no problem if the writer’s schema and the reader’s schema have their fields in a
different order, because the schema resolution matches up the fields by field name. If the code
@@ -490,7 +492,7 @@ The answer depends on the context in which Avro is being used. To give a few exa
Large file with lots of records
: A common use for Avro is for storing a large file containing millions of records, all encoded with
- the same schema. (We will discuss this kind of situation in [Link to Come].) In this case, the
+ the same schema. (We will discuss this kind of situation in [Chapter 11](/en/ch11#ch_batch).) In this case, the
writer of that file can just include the writer’s schema once at the beginning of the file. Avro
specifies a file format (object container files) to do this.
@@ -661,7 +663,7 @@ As the data dump is written in one go and is thereafter immutable, formats like
container files are a good fit. This is also a good opportunity to encode the data in an
analytics-friendly column-oriented format such as Parquet (see [“Column Compression”](/en/ch4#sec_storage_column_compression)).
-In [Link to Come] we will talk more about using data in archival storage.
+In [Chapter 11](/en/ch11#ch_batch) we will talk more about using data in archival storage.
### Dataflow Through Services: REST and RPC {#sec_encoding_dataflow_rpc}
@@ -686,7 +688,7 @@ application-specific, and the client and server need to agree on the details of
In some ways, services are similar to databases: they typically allow clients to submit and query
data. However, while databases allow arbitrary queries using the query languages we discussed in
-[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
+[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
that are predetermined by the business logic (application code) of the service [^29]. This restriction provides a degree of encapsulation: services can impose
fine-grained restrictions on what clients can and cannot do.
@@ -728,7 +730,7 @@ service. The two most popular service IDLs are OpenAPI (also known as Swagger [^
and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send
and receive Protocol Buffers.
-Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def).
+Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def).
The service definition allows developers to define service endpoints, documentation, versions, data
models, and much more. gRPC definitions look similar, but are defined using Protocol Buffers service definitions.
@@ -762,8 +764,8 @@ Even if a design philosophy and IDL are adopted, developers must still write the
implements their service’s API calls. A service framework is often adopted to simplify this
effort. Service frameworks such as Spring Boot, FastAPI, and gRPC allow developers to write the
business logic for each API endpoint while the framework code handles routing, metrics, caching,
-authentication, and so on. [Example 5-4](/en/ch5#fig_fastapi_def) shows an example Python implementation of the service
-defined in [Example 5-3](/en/ch5#fig_open_api_def).
+authentication, and so on. [Example 5-4](/en/ch5#fig_fastapi_def) shows an example Python implementation of the service
+defined in [Example 5-3](/en/ch5#fig_open_api_def).
{{< figure id="fig_fastapi_def" title="Example 5-4. Example FastAPI service implementing the definition from [Example 5-3](/en/ch5#fig_open_api_def)" class="w-full my-4" >}}
@@ -815,11 +817,11 @@ A network request is very different from a local function call:
it goes into an infinite loop or the process crashes). A network request has another possible
outcome: it may return without a result, due to a *timeout*. In that case, you simply don’t know
what happened: if you don’t get a response from the remote service, you have no way of knowing
- whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).)
+ whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).)
* If you retry a failed network request, it could happen that the previous request actually got
through, and only the response was lost. In that case, retrying will cause the action to
be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into the protocol [^40].
- Local function calls don’t have this problem. (We discuss idempotence in more detail in [Link to Come].)
+ Local function calls don’t have this problem. (We discuss idempotence in more detail in [“Idempotence”](/en/ch12#sec_stream_idempotence).)
* Every time you call a local function, it normally takes about the same time to execute. A network
request is much slower than a function call, and its latency is also wildly variable: at good
times it may complete in less than a millisecond, but when the network is congested or the remote
@@ -870,7 +872,7 @@ There are many load balancing and service discovery solutions available:
* *Service discovery systems* use a centralized registry rather than DNS to track which service
endpoints are available. When a new service instance starts up, it registers itself with the
service discovery system by declaring the host and port it’s listening on, along with relevant
- metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location,
+ metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location,
and more. The service then periodically sends a heartbeat signal to the discovery system to signal
that the service is still available.
@@ -936,7 +938,7 @@ services responsible for fraud detection, credit card integration, bank integrat
Processing a single payment in our example requires many service calls. A payment processor service
might invoke the fraud detection service to check for fraud, call the credit card service to debit
the credit card, and call the banking service to deposit debited funds, as shown in
-[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
+[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a
general-purpose programming language, a domain specific language (DSL), or a markup language such as
Business Process Execution Language (BPEL) [^44].
@@ -967,7 +969,7 @@ tasks.
There are many kinds of workflow engines that address a diverse set of use cases. Some, such as
Airflow, Dagster, and Prefect, integrate with data systems and orchestrate ETL tasks. Others, such
as Camunda and Orkes, provide a graphical notation for workflows (such as BPMN, used in
-[Figure 5-7](/en/ch5#fig_encoding_workflow)) so that non-engineers can more easily define and execute workflows. Still
+[Figure 5-7](/en/ch5#fig_encoding_workflow)) so that non-engineers can more easily define and execute workflows. Still
others, such as Temporal and Restate provide *durable execution*.
#### Durable execution {#durable-execution}
@@ -984,7 +986,7 @@ task fails, the framework will re-execute the task, but will skip any RPC calls
that the task made successfully before failing. Instead, the framework will pretend to make the
call, but will instead return the results from the previous call. This is possible because durable
execution frameworks log all RPCs and state changes to durable storage like a write-ahead log [^45] [^46].
-[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
+[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
using Temporal.
{{< figure id="fig_temporal_workflow" title="Example 5-5. A Temporal workflow definition fragment for the payment workflow in [Figure 5-7](/en/ch5#fig_encoding_workflow)." class="w-full my-4" >}}
@@ -1060,7 +1062,7 @@ In the past, the landscape of message brokers was dominated by commercial enterp
companies such as TIBCO, IBM WebSphere, and webMethods, before open source implementations such as
RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka become popular. More recently, cloud services
such as Amazon Kinesis, Azure Service Bus, and Google Cloud Pub/Sub have gained adoption. We will
-compare them in more detail in [Link to Come].
+compare them in more detail in [“Messaging Systems”](/en/ch12#sec_stream_messaging).
The detailed delivery semantics vary by implementation and configuration, but in general, two
message distribution patterns are most often used:
@@ -1084,7 +1086,7 @@ to use event sourcing (see [“Event Sourcing and CQRS”](/en/ch3#sec_datamodel
If a consumer republishes messages to another topic, you may need to be careful to preserve unknown
fields, to prevent the issue described previously in the context of databases
-([Figure 5-1](/en/ch5#fig_encoding_preserve_field)).
+([Figure 5-1](/en/ch5#fig_encoding_preserve_field)).
#### Distributed actor frameworks {#distributed-actor-frameworks}
@@ -1213,4 +1215,4 @@ quite achievable. May your application’s evolution be rapid and your deploymen
[^48]: [What is a Temporal Workflow?](https://docs.temporal.io/workflows) *docs.temporal.io*, 2024. Archived at [perma.cc/B5C5-Y396](https://perma.cc/B5C5-Y396)
[^49]: Jack Kleeman. [Solving durable execution’s immutability problem](https://restate.dev/blog/solving-durable-executions-immutability-problem/). *restate.dev*, February 2024. Archived at [perma.cc/G55L-EYH5](https://perma.cc/G55L-EYH5)
[^50]: Srinath Perera. [Exploring Event-Driven Architecture: A Beginner’s Guide for Cloud Native Developers](https://wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/). *wso2.com*, August 2023. Archived at [archive.org](https://web.archive.org/web/20240716204613/https%3A//wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/)
-[^51]: Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. [Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical Report MSR-TR-2014-41, March 2014. Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF)
\ No newline at end of file
+[^51]: Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. [Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical Report MSR-TR-2014-41, March 2014. Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF)
diff --git a/content/en/ch6.md b/content/en/ch6.md
index 45a03e1..bd69bcf 100644
--- a/content/en/ch6.md
+++ b/content/en/ch6.md
@@ -4,6 +4,8 @@ weight: 206
breadcrumbs: false
---
+
+

> *The major difference between a thing that might go wrong and a thing that cannot possibly go wrong
@@ -21,7 +23,7 @@ why you might want to replicate data:
* To scale out the number of machines that can serve read queries (and thus increase read throughput)
In this chapter we will assume that your dataset is small enough that each machine can hold a copy of
-the entire dataset. In [Chapter 7](/en/ch7#ch_sharding) we will relax that assumption and discuss *sharding*
+the entire dataset. In [Chapter 7](/en/ch7#ch_sharding) we will relax that assumption and discuss *sharding*
(*partitioning*) of datasets that are too big for a single machine. In later chapters we will discuss
various kinds of faults that can occur in a replicated data system, and how to deal with them.
@@ -72,7 +74,7 @@ question inevitably arises: how do we ensure that all the data ends up on all th
Every write to the database needs to be processed by every replica; otherwise, the replicas would no
longer contain the same data. The most common solution is called *leader-based replication*,
-*primary-backup*, or *active/passive*. It works as follows (see [Figure 6-1](/en/ch6#fig_replication_leader_follower)):
+*primary-backup*, or *active/passive*. It works as follows (see [Figure 6-1](/en/ch6#fig_replication_leader_follower)):
1. One of the replicas is designated the *leader* (also known as *primary* or *source* [^2]).
When clients want to write to the database, they must send their requests to the leader, which
@@ -88,7 +90,7 @@ longer contain the same data. The most common solution is called *leader-based r
{{< figure src="/fig/ddia_0601.png" id="fig_replication_leader_follower" caption="Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas." class="w-full my-4" >}}
-If the database is sharded (see [Chapter 7](/en/ch7#ch_sharding)), each shard has one leader. Different shards may
+If the database is sharded (see [Chapter 7](/en/ch7#ch_sharding)), each shard has one leader. Different shards may
have their leaders on different nodes, but each shard must nevertheless have one leader node. In
[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader) we will discuss an alternative model in which a system may have
multiple leaders for the same shard at the same time.
@@ -99,7 +101,7 @@ It is also used in some document databases such as MongoDB and DynamoDB [^5],
message brokers such as Kafka, replicated block devices such as DRBD, and some network filesystems.
Many consensus algorithms such as Raft, which is used for replication in CockroachDB [^6], TiDB [^7],
etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and automatically
-elect a new leader if the old one fails (we will discuss consensus in more detail in [Chapter 10](/en/ch10#ch_consistency)).
+elect a new leader if the old one fails (we will discuss consensus in more detail in [Chapter 10](/en/ch10#ch_consistency)).
--------
@@ -115,15 +117,15 @@ An important detail of a replicated system is whether the replication happens *s
*asynchronously*. (In relational databases, this is often a configurable option; other systems are
often hardcoded to be either one or the other.)
-Think about what happens in [Figure 6-1](/en/ch6#fig_replication_leader_follower), where the user of a website updates
+Think about what happens in [Figure 6-1](/en/ch6#fig_replication_leader_follower), where the user of a website updates
their profile image. At some point in time, the client sends the update request to the leader;
shortly afterward, it is received by the leader. At some point, the leader forwards the data change
to the followers. Eventually, the leader notifies the client that the update was successful.
-[Figure 6-2](/en/ch6#fig_replication_sync_replication) shows one possible way how the timings could work out.
+[Figure 6-2](/en/ch6#fig_replication_sync_replication) shows one possible way how the timings could work out.
{{< figure src="/fig/ddia_0602.png" id="fig_replication_sync_replication" caption="Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower." class="w-full my-4" >}}
-In the example of [Figure 6-2](/en/ch6#fig_replication_sync_replication), the replication to follower 1 is
+In the example of [Figure 6-2](/en/ch6#fig_replication_sync_replication), the replication to follower 1 is
*synchronous*: the leader waits until follower 1 has confirmed that it received the write before
reporting success to the user, and before making the write visible to other clients. The replication
to follower 2 is *asynchronous*: the leader sends the message, but doesn’t wait for a response from
@@ -155,7 +157,7 @@ In some systems, a *majority* (e.g., 3 out of 5 replicas, including the leader)
updated synchronously, and the remaining minority is asynchronous. This is an example of a *quorum*,
which we will discuss further in [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition). Majority quorums are often
used in systems that use a consensus protocol for automatic leader election, which we will return to
-in [Chapter 10](/en/ch10#ch_consistency).
+in [Chapter 10](/en/ch10#ch_consistency).
Sometimes, leader-based replication is configured to be completely asynchronous. In this case, if the
leader fails and is not recoverable, any writes that have not yet been replicated to followers are
@@ -206,6 +208,8 @@ Litestream does the equivalent for SQLite.
--------
+
+
> [!TIP] DATABASES BACKED BY OBJECT STORAGE
Object storage can be used for more than archiving data. Many databases are beginning to use object
@@ -303,7 +307,7 @@ consists of the following steps:
established *controller node* [^13].
The best candidate for leadership is usually the replica with the most up-to-date data changes
from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader
- is a consensus problem, discussed in detail in [Chapter 10](/en/ch10#ch_consistency).
+ is a consensus problem, discussed in detail in [Chapter 10](/en/ch10#ch_consistency).
3. *Reconfiguring the system to use the new leader.* Clients now need to send
their write requests to the new leader (we discuss this
in [“Request Routing”](/en/ch7#sec_sharding_routing)). If the old leader comes back, it might still believe that it is
@@ -326,7 +330,7 @@ Failover is fraught with things that can go wrong:
primary keys that were previously assigned by the old leader. These primary keys were also used in
a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis,
which caused some private data to be disclosed to the wrong users.
-* In certain fault scenarios (see [Chapter 9](/en/ch9#ch_distributed)), it could happen that two nodes both believe
+* In certain fault scenarios (see [Chapter 9](/en/ch9#ch_distributed)), it could happen that two nodes both believe
that they are the leader. This situation is called *split brain*, and it is dangerous: if both
leaders accept writes, and there is no process for resolving conflicts (see
[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
@@ -362,7 +366,7 @@ behind by several days could be catastrophic.
These issues—node failures; unreliable networks; and trade-offs around replica consistency,
durability, availability, and latency—are in fact fundamental problems in distributed systems.
-In [Chapter 9](/en/ch9#ch_distributed) and [Chapter 10](/en/ch10#ch_consistency) we will discuss them in greater depth.
+In [Chapter 9](/en/ch9#ch_distributed) and [Chapter 10](/en/ch10#ch_consistency) we will discuss them in greater depth.
### Implementation of Replication Logs {#sec_replication_implementation}
@@ -405,7 +409,7 @@ in practice, so many databases prefer other replication methods.
#### Write-ahead log (WAL) shipping {#write-ahead-log-wal-shipping}
-In [Chapter 4](/en/ch4#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
+In [Chapter 4](/en/ch4#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
every modification is first written to the WAL so that the tree can be restored to a consistent
state after a crash. Since the WAL contains all the information necessary to restore the indexes and
heap into a consistent state, we can use the exact same log to build a replica on another node:
@@ -426,6 +430,8 @@ performing a failover to make one of the upgraded nodes the new leader. If the r
does not allow this version mismatch, as is often the case with WAL shipping, such upgrades require
downtime.
+
+
#### Logical (row-based) log replication {#logical-row-based-log-replication}
An alternative is to use different log formats for replication and for the storage engine, which
@@ -456,7 +462,7 @@ software. This in turn enables upgrading to a new version with minimal downtime
A logical log format is also easier for external applications to parse. This aspect is useful if you want
to send the contents of a database to an external system, such as a data warehouse for offline
analysis, or for building custom indexes and caches [^21].
-This technique is called *change data capture*, and we will return to it in [Link to Come].
+This technique is called *change data capture*, and we will return to it in [“Change Data Capture”](/en/ch12#sec_stream_cdc).
## Problems with Replication Lag {#sec_replication_lag}
@@ -513,7 +519,7 @@ be read from a follower. This is especially appropriate if data is frequently vi
occasionally written.
With asynchronous replication, there is a problem, illustrated in
-[Figure 6-3](/en/ch6#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
+[Figure 6-3](/en/ch6#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
new data may not yet have reached the replica. To the user, it looks as though the data they
submitted was lost, so they will be understandably unhappy.
@@ -597,7 +603,7 @@ Our second example of an anomaly that can occur when reading from asynchronous f
possible for a user to see things *moving backward in time*.
This can happen if a user makes several reads from different replicas. For example,
-[Figure 6-4](/en/ch6#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
+[Figure 6-4](/en/ch6#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
with little lag, then to a follower with greater lag. (This scenario is quite likely if the user
refreshes a web page, and each request is routed to a random server.) The first query returns a
comment that was recently added by user 1234, but the second query doesn’t return anything because
@@ -636,7 +642,7 @@ answered it.
Now, imagine a third person is listening to this conversation through followers. The things said by
Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer
-replication lag (see [Figure 6-5](/en/ch6#fig_replication_consistent_prefix)). This observer would hear the following:
+replication lag (see [Figure 6-5](/en/ch6#fig_replication_consistent_prefix)). This observer would hear the following:
Mrs. Cake
: About ten seconds usually, Mr. Poons.
@@ -654,7 +660,7 @@ This guarantee says that if a sequence of writes happens in a certain order,
then anyone reading those writes will see them appear in the same order.
This is a particular problem in sharded (partitioned) databases, which we will discuss in
-[Chapter 7](/en/ch7#ch_sharding). If the database always applies writes in the same order, reads always see a
+[Chapter 7](/en/ch7#ch_sharding). If the database always applies writes in the same order, reads always see a
consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different
shards operate independently, so there is no global ordering of writes: when a user reads from the
database, they may see some parts of the database in an older state and some in a newer state.
@@ -678,8 +684,8 @@ synchronously updated follower. However, dealing with these issues in applicatio
and easy to get wrong.
The simplest programming model for application developers is to choose a database that provides a
-strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/en/ch10#ch_consistency)), and ACID
-transactions (see [Chapter 8](/en/ch8#ch_transactions)). This allows you to mostly ignore the challenges that arise
+strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/en/ch10#ch_consistency)), and ACID
+transactions (see [Chapter 8](/en/ch8#ch_transactions)). This allows you to mostly ignore the challenges that arise
from replication, and treat the database as if it had just a single node. In the early 2010s the
*NoSQL* movement promoted the view that these features limited scalability, and that large-scale
systems would have to embrace eventual consistency.
@@ -738,7 +744,7 @@ single-leader replication, the leader has to be in *one* of the regions, and all
through that region.
In a multi-leader configuration, you can have a leader in *each* region.
-[Figure 6-6](/en/ch6#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
+[Figure 6-6](/en/ch6#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
regular leader–follower replication is used (with followers maybe in a different availability zone
from the leader); between regions, each region’s leader replicates its changes to the leaders in
other regions.
@@ -774,7 +780,7 @@ Tolerance of network problems
Consistency
: A single-leader system can provide strong consistency guarantees, such as serializable
- transactions, which we will discuss in [Chapter 8](/en/ch8#ch_transactions). The biggest downside of multi-leader
+ transactions, which we will discuss in [Chapter 8](/en/ch8#ch_transactions). The biggest downside of multi-leader
systems is that the consistency they can achieve is much weaker. For example, you can’t guarantee
that a bank account won’t go negative or that a username is unique: it’s always possible for
different leaders to process writes that are individually fine (paying out some of the money in an
@@ -798,14 +804,14 @@ multi-leader replication is often considered dangerous territory that should be
#### Multi-leader replication topologies {#sec_replication_topologies}
A *replication topology* describes the communication paths along which writes are propagated from
-one node to another. If you have two leaders, like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), there is
+one node to another. If you have two leaders, like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), there is
only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With
more than two leaders, various different topologies are possible. Some examples are illustrated in
-[Figure 6-7](/en/ch6#fig_replication_topologies).
+[Figure 6-7](/en/ch6#fig_replication_topologies).
{{< figure src="/fig/ddia_0607.png" id="fig_replication_topologies" caption="Figure 6-7. Three example topologies in which multi-leader replication can be set up." class="w-full my-4" >}}
-The most general topology is *all-to-all*, shown in [Figure 6-7](/en/ch6#fig_replication_topologies)(c),
+The most general topology is *all-to-all*, shown in [Figure 6-7](/en/ch6#fig_replication_topologies)(c),
in which every leader sends its writes to every other leader. However, more restricted topologies
are also used: for example a *circular topology* in which each node receives writes from one node
and forwards those writes (plus any writes of its own) to one other node. Another popular topology
@@ -839,11 +845,11 @@ along different paths, avoiding a single point of failure.
On the other hand, all-to-all topologies can have issues too. In particular, some network links may
be faster than others (e.g., due to network congestion), with the result that some replication
-messages may “overtake” others, as illustrated in [Figure 6-8](/en/ch6#fig_replication_causality).
+messages may “overtake” others, as illustrated in [Figure 6-8](/en/ch6#fig_replication_causality).
{{< figure src="/fig/ddia_0608.png" id="fig_replication_causality" caption="Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas." class="w-full my-4" >}}
-In [Figure 6-8](/en/ch6#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
+In [Figure 6-8](/en/ch6#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may
first receive the update (which, from its point of view, is an update to a row that does not exist
in the database) and only later receive the corresponding insert (which should have preceded the
@@ -853,12 +859,12 @@ This is a problem of causality, similar to the one we saw in [“Consistent Pref
the update depends on the prior insert, so we need to make sure that all nodes process the insert
first, and then the update. Simply attaching a timestamp to every write is not sufficient, because
clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see
-[Chapter 9](/en/ch9#ch_distributed)).
+[Chapter 9](/en/ch9#ch_distributed)).
To order these events correctly, a technique called *version vectors* can be used, which we will
discuss later in this chapter (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). However, many multi-leader
replication systems don’t use good techniques for ordering updates, leaving them vulnerable to
-issues like the one in [Figure 6-8](/en/ch6#fig_replication_causality). If you are using multi-leader replication, it
+issues like the one in [Figure 6-8](/en/ch6#fig_replication_causality). If you are using multi-leader replication, it
is worth being aware of these issues, carefully reading the documentation, and thoroughly testing
your database to ensure that it really does provide the guarantees you believe it to have.
@@ -926,8 +932,8 @@ approach has a number of advantages:
* Having the data locally means the user interface can be much faster to respond than if it had to
wait for a service call to fetch some data. Some apps aim to respond to user input in the *next
- frame* of the graphics system, which means rendering within 16 ms on a display with a
- 60 Hz refresh rate.
+ frame* of the graphics system, which means rendering within 16 ms on a display with a
+ 60 Hz refresh rate.
* Allowing users to continue working while offline is valuable, especially on mobile devices with
intermittent connectivity. With a sync engine, an app doesn’t need a separate offline mode: being
offline is the same as having very large network delay.
@@ -967,7 +973,7 @@ a local-first sync engine on end user devices—is that concurrent writes on dif
lead to conflicts that need to be resolved.
For example, consider a wiki page that is simultaneously being edited by two users, as shown in
-[Figure 6-9](/en/ch6#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
+[Figure 6-9](/en/ch6#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
independently changes the title from A to C. Each user’s change is successfully applied to their
local leader. However, when the changes are asynchronously replicated, a conflict is detected.
This problem does not occur in a single-leader database.
@@ -975,7 +981,7 @@ This problem does not occur in a single-leader database.
{{< figure src="/fig/ddia_0609.png" id="fig_replication_write_conflict" caption="Figure 6-9. A write conflict caused by two leaders concurrently updating the same record." class="w-full my-4" >}}
> [!NOTE]
-> We say that the two writes in [Figure 6-9](/en/ch6#fig_replication_write_conflict) are *concurrent* because neither
+> We say that the two writes in [Figure 6-9](/en/ch6#fig_replication_write_conflict) are *concurrent* because neither
> was “aware” of the other at the time the write was originally made. It doesn’t matter whether the
> writes literally happened at the same time; indeed, if the writes were made while offline, they
> might have actually happened some time apart. What matters is whether one write occurred in a state
@@ -1017,7 +1023,7 @@ We will discuss other ID assignment schemes in [“ID Generators and Logical Clo
If conflicts can’t be avoided, the simplest way of resolving them is to attach a timestamp to each
write, and to always use the value with the greatest timestamp. For example, in
-[Figure 6-9](/en/ch6#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than
+[Figure 6-9](/en/ch6#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than
the timestamp of user 2’s write. In that case, both leaders will determine that the new title of the
page should be B, and they discard the write that sets it to C. If the writes coincidentally have
the same timestamp, the winner can be chosen by comparing the values (e.g., in the case of strings,
@@ -1025,7 +1031,7 @@ taking the one that’s earlier in the alphabet).
This approach is called *last write wins* (LWW) because the write with the greatest timestamp can be
considered the “last” one. The term is misleading though, because when two writes are concurrent
-like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), which one is older and which is later is undefined, and
+like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), which one is older and which is later is undefined, and
so the timestamp order of concurrent writes is essentially random.
Therefore the real meaning of LWW is: when the same record is concurrently written on different
@@ -1055,7 +1061,7 @@ merge is complete.
In a database, it would be impractical for a conflict to stop the entire replication process until a
human has resolved it. Instead, databases typically store all the concurrently written values for a
-given record—for example, both B and C in [Figure 6-9](/en/ch6#fig_replication_write_conflict). These values are
+given record—for example, both B and C in [Figure 6-9](/en/ch6#fig_replication_write_conflict). These values are
sometimes called *siblings*. The next time you query that record, the database returns *all* those
values, rather than just the latest one. You can then resolve those values in whatever way you want,
either automatically in application code (for example, you could concatenate B and C into “B/C”), or
@@ -1077,7 +1083,7 @@ suffers from a number of problems:
keeping all the shopping cart items that appeared in any of the siblings (i.e., taking the set
union of the carts). This meant that if the customer had removed an item from their cart in one
sibling, but another sibling still contained that old item, the removed item would unexpectedly
- reappear in the customer’s cart [^45]. [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
+ reappear in the customer’s cart [^45]. [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear.
* If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution
process can itself introduce a new conflict. Those resolutions could even be inconsistent: for
@@ -1088,6 +1094,8 @@ suffers from a number of problems:
{{< figure src="/fig/ddia_0610.png" id="fig_replication_amazon_anomaly" caption="Figure 6-10. Example of Amazon's shopping cart anomaly: if conflicts on a shopping cart are merged by taking the union, deleted items may reappear." class="w-full my-4" >}}
+
+
#### Automatic conflict resolution {#automatic-conflict-resolution}
For many applications, the best way of handling conflicts is to use an algorithm that automatically
@@ -1105,8 +1113,8 @@ updates as much as possible, and hence avoiding data loss:
same position, it can be ordered deterministically so that all nodes get the same merged outcome.
* If the data is a collection of items (ordered like a to-do list, or unordered like a shopping
cart), we can merge it similarly to text by tracking insertions and deletions. To avoid the
- shopping cart issue in [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly), the algorithms track the fact that Book
- and DVD were deleted, so the merged result is Cart = {Soap}.
+ shopping cart issue in [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly), the algorithms track the fact that Book
+ and DVD were deleted, so the merged result is Cart = {Soap}.
* If the data is an integer representing a counter that can be incremented or decremented (e.g., the
number of likes on a social media post), the merge algorithm can tell how many increments and
decrements happened on each sibling, and add them together correctly so that the result does not
@@ -1129,7 +1137,7 @@ Two families of algorithms are commonly used to implement automatic conflict res
They have different design philosophies and performance characteristics, but both are able to
perform automatic merges for all the aforementioned types of data.
-[Figure 6-11](/en/ch6#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
+[Figure 6-11](/en/ch6#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the
letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make “ice!”.
@@ -1147,7 +1155,7 @@ OT
CRDT
: Most CRDTs give each character a unique, immutable ID and use those to determine the positions of
- insertions/deletions, instead of indexes. For example, in [Figure 6-11](/en/ch6#fig_replication_ot_crdt) we assign
+ insertions/deletions, instead of indexes. For example, in [Figure 6-11](/en/ch6#fig_replication_ot_crdt) we assign
the ID 1A to “i”, the ID 2A to “c”, etc. When inserting the exclamation mark, we generate an
operation containing the ID of the new character (4B) and the ID of the existing character after
which we want to insert (3A). To insert at the beginning of the string we give “nil” as the
@@ -1165,7 +1173,7 @@ Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge o
#### What is a conflict? {#what-is-a-conflict}
-Some kinds of conflict are obvious. In the example in [Figure 6-9](/en/ch6#fig_replication_write_conflict), two writes
+Some kinds of conflict are obvious. In the example in [Figure 6-9](/en/ch6#fig_replication_write_conflict), two writes
concurrently modified the same field in the same record, setting it to two different values. There
is little doubt that this is a conflict.
@@ -1179,7 +1187,7 @@ are made on two different leaders.
There isn’t a quick ready-made answer, but in the following chapters we will trace a path toward a
good understanding of this problem. We will see some more examples of conflicts in
-[Chapter 8](/en/ch8#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and
+[Chapter 8](/en/ch8#ch_transactions), and in [“Ordering events to capture causality”](/en/ch13#sec_future_capture_causality) we will discuss scalable approaches for detecting and
resolving conflicts in a replicated system.
@@ -1220,7 +1228,7 @@ configuration, if you want to continue processing writes, you may need to perfor
[“Handling Node Outages”](/en/ch6#sec_replication_failover)).
On the other hand, in a leaderless configuration, failover does not exist.
-[Figure 6-12](/en/ch6#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
+[Figure 6-12](/en/ch6#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
all three replicas in parallel, and the two available replicas accept the write but the unavailable
replica misses it. Let’s say that it’s sufficient for two out of three replicas to
acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be
@@ -1252,7 +1260,7 @@ mechanisms are used in Dynamo-style datastores:
Read repair
: When a client makes a read from several nodes in parallel, it can detect any stale responses.
- For example, in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
+ For example, in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale
value and writes the newer value back to that replica. This approach works well for values that are
frequently read.
@@ -1272,7 +1280,7 @@ Anti-entropy
#### Quorums for reading and writing {#sec_replication_quorum_condition}
-In the example of [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), we considered the write to be successful
+In the example of [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), we considered the write to be successful
even though it was only processed on two out of three replicas. What if only one out of three
replicas accepted the write? How far can we push this?
@@ -1283,14 +1291,14 @@ respond, reads can nevertheless continue returning an up-to-date value.
More generally, if there are *n* replicas, every write must be confirmed by *w* nodes to be
considered successful, and we must query at least *r* nodes for each read. (In our example,
-*n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* > *n*,
+*n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* > *n*,
we expect to get an up-to-date value when reading, because at least one of the *r* nodes we’re
reading from must be up to date. Reads and writes that obey these *r* and *w* values are called *quorum* reads and writes [^50].
You can think of *r* and *w* as the minimum number of votes required for the read or write to be valid.
In Dynamo-style databases, the parameters *n*, *w*, and *r* are typically configurable. A common
choice is to make *n* an odd number (typically 3 or 5) and to set *w* = *r* =
-(*n* + 1) / 2 (rounded up). However, you can vary the numbers as you see fit.
+(*n* + 1) / 2 (rounded up). However, you can vary the numbers as you see fit.
For example, a workload with few writes and many reads may benefit from setting *w* = *n* and
*r* = 1. This makes reads faster, but has the disadvantage that just one failed node causes all
database writes to fail.
@@ -1300,19 +1308,19 @@ database writes to fail.
> [!NOTE]
> There may be more than *n* nodes in the cluster, but any given value is stored only on *n*
> nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit
-> on one node. We will return to sharding in [Chapter 7](/en/ch7#ch_sharding).
+> on one node. We will return to sharding in [Chapter 7](/en/ch7#ch_sharding).
--------
-The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
+The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
as follows:
-* If *w* < *n*, we can still process writes if a node is unavailable.
-* If *r* < *n*, we can still process reads if a node is unavailable.
-* With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable
- node, like in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage).
-* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
- This case is illustrated in [Figure 6-13](/en/ch6#fig_replication_quorum_overlap).
+* If *w* < *n*, we can still process writes if a node is unavailable.
+* If *r* < *n*, we can still process reads if a node is unavailable.
+* With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable
+ node, like in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage).
+* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
+ This case is illustrated in [Figure 6-13](/en/ch6#fig_replication_quorum_overlap).
Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and *r*
determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
@@ -1329,19 +1337,19 @@ returned a successful response and don’t need to distinguish between different
### Limitations of Quorum Consistency {#sec_replication_quorum_limitations}
-If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*, you can
+If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*, you can
generally expect every read to return the most recent value written for a key. This is the case because the
set of nodes to which you’ve written and the set of nodes from which you’ve read must overlap. That
is, among the nodes you read there must be at least one node with the latest value (illustrated in
-[Figure 6-13](/en/ch6#fig_replication_quorum_overlap)).
+[Figure 6-13](/en/ch6#fig_replication_quorum_overlap)).
Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, because that ensures
-*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
+*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
not necessarily majorities—it only matters that the sets of nodes used by the read and write
operations overlap in at least one node. Other quorum assignments are possible, which allows some
flexibility in the design of distributed algorithms [^51].
-You may also set *w* and *r* to smaller numbers, so that *w* + *r* ≤ *n* (i.e.,
+You may also set *w* and *r* to smaller numbers, so that *w* + *r* ≤ *n* (i.e.,
the quorum condition is not satisfied). In this case, reads and writes will still be sent to *n*
nodes, but a smaller number of successful responses is required for the operation to succeed.
@@ -1352,14 +1360,14 @@ unreachable, there’s a higher chance that you can continue processing reads an
the number of reachable replicas falls below *w* or *r* does the database become unavailable for
writing or reading, respectively.
-However, even with *w* + *r* > *n*, there are edge cases in which the consistency
+However, even with *w* + *r* > *n*, there are edge cases in which the consistency
properties can be confusing. Some scenarios include:
* If a node carrying a new value fails, and its data is restored from a replica carrying an old
value, the number of replicas storing the new value may fall below *w*, breaking the quorum
condition.
* While a rebalancing is in progress, where some data is moved from one node to another (see
- [Chapter 7](/en/ch7#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
+ [Chapter 7](/en/ch7#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
replicas for a particular value. This can result in the read and write quorums no longer
overlapping.
* If a read is concurrent with a write operation, the read may or may not see the concurrently
@@ -1489,7 +1497,7 @@ resulting in conflicts that need to be resolved. Such conflicts may occur as the
not always: they could also be detected later during read repair, hinted handoff, or anti-entropy.
The problem is that events may arrive in a different order at different nodes, due to variable
-network delays and partial failures. For example, [Figure 6-14](/en/ch6#fig_replication_concurrency) shows two clients,
+network delays and partial failures. For example, [Figure 6-14](/en/ch6#fig_replication_concurrency) shows two clients,
A and B, simultaneously writing to a key *X* in a three-node datastore:
* Node 1 receives the write from A, but never receives the write from B due to a transient outage.
@@ -1501,7 +1509,7 @@ A and B, simultaneously writing to a key *X* in a three-node datastore:
If each node simply overwrote the value for a key whenever it received a write request from a
client, the nodes would become permanently inconsistent, as shown by the final *get* request in
-[Figure 6-14](/en/ch6#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
+[Figure 6-14](/en/ch6#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
nodes think that the value is A.
In order to become eventually consistent, the replicas should converge toward the same value. For
@@ -1520,11 +1528,11 @@ take more care to detect concurrent writes.
How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look
at some examples:
-* In [Figure 6-8](/en/ch6#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before*
+* In [Figure 6-8](/en/ch6#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before*
B’s increment, because the value incremented by B is the value inserted by A. In other words, B’s
operation builds upon A’s operation, so B’s operation must have happened later.
We also say that B is *causally dependent* on A.
-* On the other hand, the two writes in [Figure 6-14](/en/ch6#fig_replication_concurrency) are concurrent: when each
+* On the other hand, the two writes in [Figure 6-14](/en/ch6#fig_replication_concurrency) are concurrent: when each
client starts the operation, it does not know that another client is also performing an operation
on the same key. Thus, there is no causal dependency between the operations.
@@ -1546,7 +1554,7 @@ conflict that needs to be resolved.
It may seem that two operations should be called concurrent if they occur “at the same time”—but
in fact, it is not important whether they literally overlap in time. Because of problems with clocks
in distributed systems, it is actually quite difficult to tell whether two things happened
-at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/en/ch9#ch_distributed).
+at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/en/ch9#ch_distributed).
For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if
they are both unaware of each other, regardless of the physical time at which they occurred. People
@@ -1570,7 +1578,7 @@ happened before another. To keep things simple, let’s start with a database th
replica. Once we have worked out how to do this on a single replica, we can generalize the approach
to a leaderless database with multiple replicas.
-[Figure 6-15](/en/ch6#fig_replication_causality_single) shows two clients concurrently adding items to the same
+[Figure 6-15](/en/ch6#fig_replication_causality_single) shows two clients concurrently adding items to the same
shopping cart. (If that example strikes you as too inane, imagine instead two air traffic
controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is
empty. Between them, the clients make five writes to the database:
@@ -1604,8 +1612,8 @@ empty. Between them, the clients make five writes to the database:
{{< figure src="/fig/ddia_0615.png" id="fig_replication_causality_single" caption="Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart." class="w-full my-4" >}}
-The dataflow between the operations in [Figure 6-15](/en/ch6#fig_replication_causality_single) is illustrated
-graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The arrows indicate which operation
+The dataflow between the operations in [Figure 6-15](/en/ch6#fig_replication_causality_single) is illustrated
+graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The arrows indicate which operation
*happened before* which other operation, in the sense that the later operation *knew about* or
*depended on* the earlier one. In this example, the clients are never fully up to date with the data
on the server, since there is always another operation going on concurrently. But old versions of
@@ -1638,10 +1646,10 @@ on subsequent reads.
#### Version vectors {#version-vectors}
-The example in [Figure 6-15](/en/ch6#fig_replication_causality_single) used only a single replica. How does the
+The example in [Figure 6-15](/en/ch6#fig_replication_causality_single) used only a single replica. How does the
algorithm change when there are multiple replicas, but no leader?
-[Figure 6-15](/en/ch6#fig_replication_causality_single) uses a single version number to capture dependencies between
+[Figure 6-15](/en/ch6#fig_replication_causality_single) uses a single version number to capture dependencies between
operations, but that is not sufficient when there are multiple replicas accepting writes
concurrently. Instead, we need to use a version number *per replica* as well as per key. Each
replica increments its own version number when processing a write, and also keeps track of the
@@ -1653,7 +1661,7 @@ A few variants of this idea are in use, but the most interesting is probably the
which is used in Riak 2.0 [^61] [^62].
We won’t go into the details, but the way it works is quite similar to what we saw in our cart example.
-Like the version numbers in [Figure 6-15](/en/ch6#fig_replication_causality_single), version vectors are sent from the
+Like the version numbers in [Figure 6-15](/en/ch6#fig_replication_causality_single), version vectors are sent from the
database replicas to clients when values are read, and need to be sent back to the database when a
value is subsequently written. (Riak encodes the version vector as a string that it calls *causal
context*.) The version vector allows the database to distinguish between overwrites and concurrent
@@ -1818,4 +1826,4 @@ machine to store only a subset of the data.
[^61]: Sean Cribbs. [A Brief History of Time in Riak](https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak). At *RICON*, October 2014. Archived at [perma.cc/7U9P-6JFX](https://perma.cc/7U9P-6JFX)
[^62]: Russell Brown. [Vector Clocks Revisited Part 2: Dotted Version Vectors](https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/). *riak.com*, November 2015. Archived at [perma.cc/96QP-W98R](https://perma.cc/96QP-W98R)
[^63]: Carlos Baquero. [Version Vectors Are Not Vector Clocks](https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/). *haslab.wordpress.com*, July 2011. Archived at [perma.cc/7PNU-4AMG](https://perma.cc/7PNU-4AMG)
-[^64]: Reinhard Schwarz and Friedemann Mattern. [Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed Computing*, volume 7, issue 3, pages 149–174, March 1994. [doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859)
\ No newline at end of file
+[^64]: Reinhard Schwarz and Friedemann Mattern. [Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed Computing*, volume 7, issue 3, pages 149–174, March 1994. [doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859)
diff --git a/content/en/ch7.md b/content/en/ch7.md
index 2f5bb13..61afc9f 100644
--- a/content/en/ch7.md
+++ b/content/en/ch7.md
@@ -4,6 +4,8 @@ weight: 207
breadcrumbs: false
---
+
+

> *Clearly, we must break away from the sequential and not limit the computers. We must state
@@ -14,7 +16,7 @@ breadcrumbs: false
A distributed database typically distributes data across nodes in two ways:
-1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in [Chapter 6](/en/ch6#ch_replication).
+1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in [Chapter 6](/en/ch6#ch_replication).
2. If we don’t want every node to store all the data, we can split up a large amount of data into
smaller *shards* or *partitions*, and store different shards on different nodes. We’ll discuss
sharding in this chapter.
@@ -29,13 +31,13 @@ nodes. This means that, even though each record belongs to exactly one shard, it
on several different nodes for fault tolerance.
A node may store more than one shard. If a single-leader replication model is used, the combination
-of sharding and replication can look like [Figure 7-1](/en/ch7#fig_sharding_replicas), for example. Each shard’s
+of sharding and replication can look like [Figure 7-1](/en/ch7#fig_sharding_replicas), for example. Each shard’s
leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the
leader for some shards and a follower for other shards, but each shard still only has one leader.
{{< figure src="/fig/ddia_0701.png" id="fig_sharding_replicas" caption="Figure 7-1. Combining replication and sharding: each node acts as leader for some shards and follower for other shards." class="w-full my-4" >}}
-Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replication of databases applies equally to
+Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replication of databases applies equally to
replication of shards. Since the choice of sharding scheme is mostly independent of the choice of
replication scheme, we will ignore replication in this chapter for the sake of simplicity.
@@ -62,7 +64,7 @@ to databases. Another theory is that *shard* was originally an acronym of *Syste
Available Replicated Data*—reportedly a 1980s database, details of which are lost to history.
By the way, partitioning has nothing to do with *network partitions* (netsplits), a type of fault in
-the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#ch_distributed).
+the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#ch_distributed).
--------
@@ -71,7 +73,7 @@ the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#c
The primary reason for sharding a database is *scalability*: it’s a solution if the volume of data
or the write throughput has become too great for a single node to handle, as it allows you to spread
that data and those writes across multiple nodes. (If read throughput is the problem, you don’t
-necessarily need sharding—you can use *read scaling* as discussed in [Chapter 6](/en/ch6#ch_replication).)
+necessarily need sharding—you can use *read scaling* as discussed in [Chapter 6](/en/ch6#ch_replication).)
In fact, sharding is one of the main tools we have for achieving *horizontal scaling* (a *scale-out*
architecture), as discussed in [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](/en/ch2#sec_introduction_shared_nothing): that is, allowing a system to
@@ -98,9 +100,9 @@ may be distributed across different shards. We will discuss this further in
[“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes).
Another problem with sharding is that a write may need to update related records in several
-different shards. While transactions on a single node are quite common (see [Chapter 8](/en/ch8#ch_transactions)),
+different shards. While transactions on a single node are quite common (see [Chapter 8](/en/ch8#ch_transactions)),
ensuring consistency across multiple shards requires a *distributed transaction*. As we shall see in
-[Chapter 8](/en/ch8#ch_transactions), distributed transactions are available in some databases, but they are usually
+[Chapter 8](/en/ch8#ch_transactions), distributed transactions are available in some databases, but they are usually
much slower than single-node transactions, may become a bottleneck for the system as a whole, and
some systems don’t support them at all.
@@ -201,7 +203,7 @@ hot spots.
One way of sharding is to assign a contiguous range of partition keys (from some minimum to some
maximum) to each shard, like the volumes of a paper encyclopedia, as illustrated in
-[Figure 7-2](/en/ch7#fig_sharding_encyclopedia). In this example, an entry’s partition key is its title. If you want
+[Figure 7-2](/en/ch7#fig_sharding_encyclopedia). In this example, an entry’s partition key is its title. If you want
to look up the entry for a particular title, you can easily determine which shard contains that
entry by finding the volume whose key range contains the title you’re looking for, and thus pick the
correct book off the shelf.
@@ -209,7 +211,7 @@ correct book off the shelf.
{{< figure src="/fig/ddia_0702.png" id="fig_sharding_encyclopedia" caption="Figure 7-2. A print encyclopedia is sharded by key range." class="w-full my-4" >}}
The ranges of keys are not necessarily evenly spaced, because your data may not be evenly
-distributed. For example, in [Figure 7-2](/en/ch7#fig_sharding_encyclopedia), volume 1 contains words starting with A
+distributed. For example, in [Figure 7-2](/en/ch7#fig_sharding_encyclopedia), volume 1 contains words starting with A
and B, but volume 12 contains words starting with T, U, V, W, X, Y, and Z. Simply having one volume
per two letters of the alphabet would lead to some volumes being much bigger than others. In order
to distribute the data evenly, the shard boundaries need to adapt to the data.
@@ -221,7 +223,7 @@ range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB
tablet splitting.
Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in
-[Chapter 4](/en/ch4#ch_storage)). This has the advantage that range scans are easy, and you can treat the key as a
+[Chapter 4](/en/ch4#ch_storage)). This has the advantage that range scans are easy, and you can treat the key as a
concatenated index in order to fetch several related records in one query (see
[“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional)). For example, consider an application that stores data from a
network of sensors, where the key is the timestamp of the measurement. Range scans are very useful
@@ -256,7 +258,7 @@ This process is similar to what happens at the top level of a B-tree (see [“B-
With databases that manage shard boundaries automatically, a shard split is typically triggered by:
-* the shard reaching a configured size (for example, on HBase, the default is 10 GB), or
+* the shard reaching a configured size (for example, on HBase, the default is 10 GB), or
* in some systems, the write throughput being persistently above some threshold. Thus, a hot shard
may be split even if it is not storing a lot of data, so that its write load can be distributed more uniformly.
@@ -278,7 +280,7 @@ application), a common approach is to first hash the partition key before mappin
A good hash function takes skewed data and makes it uniformly distributed. Say you have a 32-bit
hash function that takes a string. Whenever you give it a new string, it returns a seemingly random
-number between 0 and 232 − 1. Even if the input strings are very similar, their hashes are evenly
+number between 0 and 232 − 1. Even if the input strings are very similar, their hashes are evenly
distributed across that range of numbers (but the same input always produces the same output).
For sharding purposes, the hash function need not be cryptographically strong: for example, MongoDB
@@ -291,12 +293,12 @@ different hash value in different processes, making them unsuitable for sharding
Once you have hashed the key, how do you choose which shard to store it in? Maybe your first thought
is to take the hash value *modulo* the number of nodes in the system (using the `%` operator in many
-programming languages). For example, *hash*(*key*) % 10 would return a number between
-0 and 9 (if we write the hash as a decimal number, the hash % 10 would be the last digit).
+programming languages). For example, *hash*(*key*) % 10 would return a number between
+0 and 9 (if we write the hash as a decimal number, the hash % 10 would be the last digit).
If we have 10 nodes, numbered 0 to 9, that seems like an easy way of assigning each key to a node.
The problem with the *mod N* approach is that if the number of nodes *N* changes, most of the keys
-have to be moved from one node to another. [Figure 7-3](/en/ch7#fig_sharding_hash_mod_n) shows what happens when you
+have to be moved from one node to another. [Figure 7-3](/en/ch7#fig_sharding_hash_mod_n) shows what happens when you
have three nodes and add a fourth. Before the rebalancing, node 0 stored the keys whose hashes are
0, 3, 6, 9, and so on. After adding the fourth node, the key with hash 3 has moved to node 3, the
key with hash 6 has moved to node 2, the key with hash 9 has moved to node 1, and so on.
@@ -312,12 +314,12 @@ doesn’t move data around more than necessary.
One simple but widely-used solution is to create many more shards than there are nodes, and to
assign several shards to each node. For example, a database running on a cluster of 10 nodes may be
split into 1,000 shards from the outset so that 100 shards are assigned to each node. A key is then
-stored in shard number *hash*(*key*) % 1,000, and the system separately keeps track of
+stored in shard number *hash*(*key*) % 1,000, and the system separately keeps track of
which shard is stored on which node.
Now, if a node is added to the cluster, the system can reassign some of the shards from existing
nodes to the new node until they are fairly distributed once again. This process is illustrated in
-[Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in reverse.
+[Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in reverse.
{{< figure src="/fig/ddia_0704.png" id="fig_sharding_rebalance_fixed" caption="Figure 7-4. Adding a new node to a database cluster with multiple shards per node." class="w-full my-4" >}}
@@ -360,8 +362,8 @@ has this property, but it has a risk of hot spots when there are a lot of writes
solution is to combine key-range sharding with a hash function so that each shard contains a range
of *hash values* rather than a range of *keys*.
-[Figure 7-5](/en/ch7#fig_sharding_hash_range) shows an example using a 16-bit hash function that returns a number
-between 0 and 65,535 = 216 − 1 (in reality, the hash is usually 32 bits or more).
+[Figure 7-5](/en/ch7#fig_sharding_hash_range) shows an example using a 16-bit hash function that returns a number
+between 0 and 65,535 = 216 − 1 (in reality, the hash is usually 32 bits or more).
Even if the input keys are very similar (e.g., consecutive timestamps), their hashes are uniformly
distributed across that range. We can then assign a range of hash values to each shard: for example,
values between 0 and 16,383 to shard 0, values between 16,384 and 32,767 to shard 1, and so on.
@@ -394,8 +396,8 @@ improve compression and filtering performance as well.
Hash-range sharding is used in YugabyteDB and DynamoDB [^17], and is an option in MongoDB.
Cassandra and ScyllaDB use a variant of this approach that is illustrated in
-[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
-to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
+[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
+to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
those imbalances tend to even out [^15] [^18].
@@ -404,7 +406,7 @@ those imbalances tend to even out [^15] [^18].
When nodes are added or removed, range boundaries are added and removed, and shards are split or
merged accordingly [^19].
-In the example of [Figure 7-6](/en/ch7#fig_sharding_cassandra), when node 3 is added, node 1
+In the example of [Figure 7-6](/en/ch7#fig_sharding_cassandra), when node 3 is added, node 1
transfers parts of two of its ranges to node 3, and node 2 transfers part of one of its ranges to
node 3. This has the effect of giving the new node an approximately fair share of the dataset,
without transferring more data than necessary from one node to another.
@@ -417,8 +419,8 @@ in a way that satisfies two properties:
1. the number of keys mapped to each shard is roughly equal, and
2. when the number of shards changes, as few keys as possible are moved from one shard to another.
-Note that *consistent* here has nothing to do with replica consistency (see [Chapter 6](/en/ch6#ch_replication)) or
-ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describes the tendency of a key to stay in
+Note that *consistent* here has nothing to do with replica consistency (see [Chapter 6](/en/ch6#ch_replication)) or
+ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describes the tendency of a key to stay in
the same shard as much as possible.
The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of consistent hashing [^20],
@@ -516,7 +518,7 @@ only be handled by a node that is a replica for the shard containing that key.
This means that request routing has to be aware of the assignment from keys to shards, and from
shards to nodes. On a high level, there are a few different approaches to this problem
-(illustrated in [Figure 7-7](/en/ch7#fig_sharding_routing)):
+(illustrated in [Figure 7-7](/en/ch7#fig_sharding_routing)):
1. Allow clients to contact any node (e.g., via a round-robin load balancer). If that node
coincidentally owns the shard to which the request applies, it can handle the request directly;
@@ -544,8 +546,8 @@ In all cases, there are some key problems:
those?
Many distributed data systems rely on a separate coordination service such as ZooKeeper or etcd to
-keep track of shard assignments, as illustrated in [Figure 7-8](/en/ch7#fig_sharding_zookeeper). They use consensus
-algorithms (see [Chapter 10](/en/ch10#ch_consistency)) to provide fault tolerance and protection against split-brain.
+keep track of shard assignments, as illustrated in [Figure 7-8](/en/ch7#fig_sharding_zookeeper). They use consensus
+algorithms (see [Chapter 10](/en/ch10#ch_consistency)) to provide fault tolerance and protection against split-brain.
Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of shards
to nodes. Other actors, such as the routing tier or the sharding-aware client, can subscribe to this
information in ZooKeeper. Whenever a shard changes ownership, or a node is added or removed,
@@ -573,7 +575,7 @@ This discussion of request routing has focused on finding the shard for an indiv
most relevant for sharded OLTP databases. Analytic databases often use sharding as well, but they
typically have a very different kind of query execution: rather than executing in a single shard, a
query typically needs to aggregate and join data from many different shards in parallel. We will
-discuss techniques for such parallel query execution in [Link to Come].
+discuss techniques for such parallel query execution in [“JOIN and GROUP BY”](/en/ch11#sec_batch_join).
## Sharding and Secondary Indexes {#sec_sharding_secondary_indexes}
@@ -597,7 +599,7 @@ local and global indexes.
### Local Secondary Indexes {#id166}
For example, imagine you are operating a website for selling used cars (illustrated in
-[Figure 7-9](/en/ch7#fig_sharding_local_secondary)). Each listing has a unique ID, and you use that ID as partition
+[Figure 7-9](/en/ch7#fig_sharding_local_secondary)). Each listing has a unique ID, and you use that ID as partition
key for sharding (for example, IDs 0 to 499 in shard 0, IDs 500 to 999 in shard 1, etc.).
If you want to let users search for cars, allowing them to filter by color and by make, you need a
@@ -605,7 +607,7 @@ secondary index on `color` and `make` (in a document database these would be fie
database they would be columns). If you have declared the index, the database can perform the
indexing automatically. For example, whenever a red car is added to the database, the database shard
automatically adds its ID to the list of IDs for the index entry `color:red`. As discussed in
-[Chapter 4](/en/ch4#ch_storage), that list of IDs is also called a *postings list*.
+[Chapter 4](/en/ch4#ch_storage), that list of IDs is also called a *postings list*.
{{< figure src="/fig/ddia_0709.png" id="fig_sharding_local_secondary" caption="Figure 7-9. Local secondary indexes: each shard indexes only the records within its own shard." class="w-full my-4" >}}
@@ -632,7 +634,7 @@ want *some* results, and you don’t need all, you can send the request to any s
However, if you want all the results and don’t know their partition key in advance, you need to send
the query to all shards, and combine the results you get back, because the matching records might be
-scattered across all the shards. In [Figure 7-9](/en/ch7#fig_sharding_local_secondary), red cars appear in both shard
+scattered across all the shards. In [Figure 7-9](/en/ch7#fig_sharding_local_secondary), red cars appear in both shard
0 and shard 1.
This approach to querying a sharded database can make read queries on secondary indexes quite
@@ -651,7 +653,7 @@ covers data in all shards. However, we can’t just store that index on one node
likely become a bottleneck and defeat the purpose of sharding. A global index must also be sharded,
but it can be sharded differently from the primary key index.
-[Figure 7-10](/en/ch7#fig_sharding_global_secondary) illustrates what this could look like: the IDs of red cars from
+[Figure 7-10](/en/ch7#fig_sharding_global_secondary) illustrates what this could look like: the IDs of red cars from
all shards appear under `color:red` in the index, but the index is sharded so that colors starting
with the letters *a* to *r* appear in shard 0 and colors starting with *s* to *z* appear in shard 1.
The index on the make of car is partitioned similarly (with the shard boundary being between *f* and *h*).
@@ -664,7 +666,7 @@ you can search for. Here we generalise it to mean any value that you can search
The global index uses the term as partition key, so that when you’re looking for a particular term
or value, you can figure out which shard you need to query. As before, a shard can contain a
-contiguous range of terms (as in [Figure 7-10](/en/ch7#fig_sharding_global_secondary)), or you can assign terms to
+contiguous range of terms (as in [Figure 7-10](/en/ch7#fig_sharding_global_secondary)), or you can assign terms to
shards based on a hash of the term.
Global indexes have the advantage that a query with a single condition (such as *color = red*) only
@@ -682,7 +684,7 @@ Another challenge with global secondary indexes is that writes are more complica
indexes, because writing a single record might affect multiple shards of the index (every term in
the document might be on a different shard). This makes it harder to keep the secondary index in
sync with the underlying data. One option is to use a distributed transaction to atomically update
-the shards storing the primary record and its secondary indexes (see [Chapter 8](/en/ch8#ch_transactions)).
+the shards storing the primary record and its secondary indexes (see [Chapter 8](/en/ch8#ch_transactions)).
Global secondary indexes are used by CockroachDB, TiDB, and YugabyteDB; DynamoDB supports both local
and global secondary indexes. In the case of DynamoDB, writes are asynchronously reflected in global
@@ -781,4 +783,4 @@ that question in the following chapters.
[^31]: Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. [Earlybird: Real-Time Search at Twitter](https://cs.uwaterloo.ca/~jimmylin/publications/Busch_etal_ICDE2012.pdf). At *28th IEEE International Conference on Data Engineering* (ICDE), April 2012. [doi:10.1109/ICDE.2012.149](https://doi.org/10.1109/ICDE.2012.149)
[^32]: Nadav Har’El. [Indexing in Cassandra 3](https://github.com/scylladb/scylladb/wiki/Indexing-in-Cassandra-3). *github.com*, April 2017. Archived at [perma.cc/3ENV-8T9P](https://perma.cc/3ENV-8T9P)
[^33]: Zachary Tong. [Customizing Your Document Routing](https://www.elastic.co/blog/customizing-your-document-routing/). *elastic.co*, June 2013. Archived at [perma.cc/97VM-MREN](https://perma.cc/97VM-MREN)
-[^34]: Andrew Pavlo. [H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). *hstore.cs.brown.edu*, October 2013. Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z)
\ No newline at end of file
+[^34]: Andrew Pavlo. [H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). *hstore.cs.brown.edu*, October 2013. Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z)
diff --git a/content/en/ch8.md b/content/en/ch8.md
index e4924a2..36375f9 100644
--- a/content/en/ch8.md
+++ b/content/en/ch8.md
@@ -4,6 +4,8 @@ weight: 208
breadcrumbs: false
---
+
+

> *Some authors have claimed that general two-phase commit is too expensive to support, because of the
@@ -75,8 +77,8 @@ similar to that of System R.
In the late 2000s, nonrelational (NoSQL) databases started gaining popularity. They aimed to
improve upon the relational status quo by offering a choice of new data models (see
-[Chapter 3](/en/ch3#ch_datamodels)), and by including replication ([Chapter 6](/en/ch6#ch_replication)) and sharding
-([Chapter 7](/en/ch7#ch_sharding)) by default. Transactions were the main casualty of this movement: many of this
+[Chapter 3](/en/ch3#ch_datamodels)), and by including replication ([Chapter 6](/en/ch6#ch_replication)) and sharding
+([Chapter 7](/en/ch7#ch_sharding)) by default. Transactions were the main casualty of this movement: many of this
generation of databases abandoned transactions entirely, or redefined the word to describe a
much weaker set of guarantees than had previously been understood.
@@ -85,7 +87,7 @@ fundamentally unscalable, and that any large-scale system would have to abandon
order to maintain good performance and high availability. More recently, that belief has turned out
to be wrong. So-called “NewSQL” databases such as CockroachDB [^5], TiDB [^6], Spanner [^7], FoundationDB [^8],
and Yugabyte have shown that transactional systems can scale to large data volumes and high
-throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide
+throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide
strong ACID guarantees at scale.
However, that doesn’t mean that every system must be transactional either: like every other
@@ -146,7 +148,7 @@ the defining feature of ACID atomicity. Perhaps *abortability* would have been a
The word *consistency* is terribly overloaded:
-* In [Chapter 6](/en/ch6#ch_replication) we discussed *replica consistency* and the issue of *eventual consistency*
+* In [Chapter 6](/en/ch6#ch_replication) we discussed *replica consistency* and the issue of *eventual consistency*
that arises in asynchronously replicated systems (see [“Problems with Replication Lag”](/en/ch6#sec_replication_lag)).
* A *consistent snapshot* of a database, e.g. for a backup, is a snapshot of the entire database as
it existed at one moment in time. More precisely, it is consistent with the happens-before
@@ -155,7 +157,7 @@ The word *consistency* is terribly overloaded:
value was written.
* *Consistent hashing* is an approach to sharding that some systems use for rebalancing (see
[“Consistent hashing”](/en/ch7#sec_sharding_consistent_hashing)).
-* In the CAP theorem (see [Chapter 10](/en/ch10#ch_consistency)), the word *consistency* is used to mean
+* In the CAP theorem (see [Chapter 10](/en/ch10#ch_consistency)), the word *consistency* is used to mean
*linearizability* (see [“Linearizability”](/en/ch10#sec_consistency_linearizability)).
* In the context of ACID, *consistency* refers to an application-specific notion of the database
being in a “good state.”
@@ -188,10 +190,10 @@ Most databases are accessed by several clients at the same time. That is no prob
reading and writing different parts of the database, but if they are accessing the same database
records, you can run into concurrency problems (race conditions).
-[Figure 8-1](/en/ch8#fig_transactions_increment) is a simple example of this kind of problem. Say you have two clients
+[Figure 8-1](/en/ch8#fig_transactions_increment) is a simple example of this kind of problem. Say you have two clients
simultaneously incrementing a counter that is stored in a database. Each client needs to read the
current value, add 1, and write the new value back (assuming there is no increment operation built
-into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to
+into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to
44, because two increments happened, but it actually only went to 43 because of the race condition.
{{< figure src="/fig/ddia_0801.png" id="fig_transactions_increment" caption="Figure 8-1. A race condition between two clients concurrently incrementing a counter." class="w-full my-4" >}}
@@ -234,6 +236,8 @@ database can do to save you.
--------
+
+
> [!TIP] REPLICATION AND DURABILITY
Historically, durability meant writing to an archive tape. Then it was understood as writing to a disk
@@ -291,7 +295,7 @@ Isolation
These definitions assume that you want to modify several objects (rows, documents, records) at once.
Such *multi-object transactions* are often needed if several pieces of data need to be kept in sync.
-[Figure 8-2](/en/ch8#fig_transactions_read_uncommitted) shows an example from an email application. To display the
+[Figure 8-2](/en/ch8#fig_transactions_read_uncommitted) shows an example from an email application. To display the
number of unread messages for a user, you could query something like:
```
@@ -307,14 +311,14 @@ number of unread messages in a separate field (a kind of denormalization, which
unread counter as well, and whenever a message is marked as read, you also have to decrement the
unread counter.
-In [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), user 2 experiences an anomaly: the mailbox listing shows
+In [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), user 2 experiences an anomaly: the mailbox listing shows
an unread message, but the counter shows zero unread messages because the counter increment has not
yet happened. (If an incorrect counter in an email application seems too insignificant, think of a
customer account balance instead of an unread counter, and a payment transaction instead of an
email.) Isolation would have prevented this issue by ensuring that user 2 sees either both the
inserted email and the updated counter, or neither, but not an inconsistent halfway point.
-[Figure 8-3](/en/ch8#fig_transactions_atomicity) illustrates the need for atomicity: if an error occurs somewhere
+[Figure 8-3](/en/ch8#fig_transactions_atomicity) illustrates the need for atomicity: if an error occurs somewhere
over the course of the transaction, the contents of the mailbox and the unread counter might become out
of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted
and the inserted email is rolled back.
@@ -337,10 +341,10 @@ database in a partially updated state.
#### Single-object writes {#sec_transactions_single_object}
Atomicity and isolation also apply when a single object is being changed. For example, imagine you
-are writing a 20 KB JSON document to a database:
+are writing a 20 KB JSON document to a database:
-* If the network connection is interrupted after the first 10 KB have been sent, does the
- database store that unparseable 10 KB fragment of JSON?
+* If the network connection is interrupted after the first 10 KB have been sent, does the
+ database store that unparseable 10 KB fragment of JSON?
* If the power fails while the database is in the middle of overwriting the previous value on disk,
do you end up with the old and new values spliced together?
* If another client reads that document while the write is in progress, will it see a partially
@@ -353,7 +357,7 @@ isolation can be implemented using a lock on each object (allowing only one thre
object at any one time).
Some databases also provide more complex atomic operations, such as an increment operation, which
-removes the need for a read-modify-write cycle like that in [Figure 8-1](/en/ch8#fig_transactions_increment).
+removes the need for a read-modify-write cycle like that in [Figure 8-1](/en/ch8#fig_transactions_increment).
Similarly popular is a *conditional write* operation, which allows a write to happen only if the value
has not been concurrently changed by someone else (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)),
similarly to a compare-and-set or compare-and-swap (CAS) operation in shared-memory concurrency.
@@ -391,7 +395,7 @@ However, in many other cases writes to several different objects need to be coor
document, which is treated as a single object—no multi-object transactions are needed when
updating a single document. However, document databases lacking join functionality also encourage
denormalization (see [“When to Use Which Model”](/en/ch3#sec_datamodels_document_summary)). When denormalized information needs to
- be updated, like in the example of [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), you need to update
+ be updated, like in the example of [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), you need to update
several documents in one go. Transactions are very useful in this situation to prevent
denormalized data from going out of sync.
* In databases with secondary indexes (almost everything except pure key-value stores), the indexes
@@ -403,7 +407,7 @@ However, in many other cases writes to several different objects need to be coor
Such applications can still be implemented without transactions. However, error handling becomes
much more complicated without atomicity, and the lack of isolation can cause concurrency problems.
We will discuss those in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels), and explore alternative approaches
-in [Link to Come].
+in [“Derived data versus distributed transactions”](/en/ch13#sec_future_derived_vs_transactions).
#### Handling errors and aborts {#handling-errors-and-aborts}
@@ -521,7 +525,7 @@ Can another transaction see that uncommitted data? If yes, that is called a
Transactions running at the read committed isolation level must prevent dirty reads. This means that
any writes by a transaction only become visible to others when that transaction commits (and then
-all of its writes become visible at once). This is illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2’s *get x* still
+all of its writes become visible at once). This is illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2’s *get x* still
returns the old value, 2, while user 1 has not yet committed.
{{< figure src="/fig/ddia_0804.png" id="fig_transactions_read_committed" caption="Figure 8-4. No dirty reads: user 2 sees the new value for x only after user 1's transaction has committed." class="w-full my-4" >}}
@@ -529,12 +533,12 @@ returns the old value, 2, while user 1 has not yet committed.
There are a few reasons why it’s useful to prevent dirty reads:
* If a transaction needs to update several rows, a dirty read means that another transaction may
- see some of the updates but not others. For example, in [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), the
+ see some of the updates but not others. For example, in [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), the
user sees the new unread email but not the updated counter. This is a dirty read of the email.
Seeing the database in a partially updated state is confusing to users and may cause other
transactions to take incorrect decisions.
* If a transaction aborts, any writes it has made need to be rolled back (like in
- [Figure 8-3](/en/ch8#fig_transactions_atomicity)). If the database allows dirty reads, that means a transaction may
+ [Figure 8-3](/en/ch8#fig_transactions_atomicity)). If the database allows dirty reads, that means a transaction may
see data that is later rolled back—i.e., which is never actually committed to the database. Any
transaction that read uncommitted data would also need to be aborted, leading to a problem called
*cascading aborts*.
@@ -553,15 +557,15 @@ first write’s transaction has committed or aborted.
By preventing dirty writes, this isolation level avoids some kinds of concurrency problems:
* If transactions update multiple rows, dirty writes can lead to a bad outcome. For example,
- consider [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), which illustrates a used car sales website on which
+ consider [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), which illustrates a used car sales website on which
two people, Aaliyah and Bryce, are simultaneously trying to buy the same car. Buying a car requires
two database writes: the listing on the website needs to be updated to reflect the buyer, and the
- sales invoice needs to be sent to the buyer. In the case of [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), the
+ sales invoice needs to be sent to the buyer. In the case of [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), the
sale is awarded to Bryce (because he performs the winning update to the `listings` table), but the
invoice is sent to Aaliyah (because she performs the winning update to the `invoices` table). Read
committed prevents such mishaps.
* However, read committed does *not* prevent the race condition between two counter increments in
- [Figure 8-1](/en/ch8#fig_transactions_increment). In this case, the second write happens after the first transaction
+ [Figure 8-1](/en/ch8#fig_transactions_increment). In this case, the second write happens after the first transaction
has committed, so it’s not a dirty write. It’s still incorrect, but for a different reason—in
[“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update) we will discuss how to make such counter increments safe.
@@ -597,7 +601,7 @@ different part of the application, due to waiting for locks.
Nevertheless, locks are used to prevent dirty reads in some databases, such as IBM
Db2 and Microsoft SQL Server in the `read_committed_snapshot=off` setting [^29].
-A more commonly used approach to preventing dirty reads is the one illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
+A more commonly used approach to preventing dirty reads is the one illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
row that is written, the database remembers both the old committed value and the new value
set by the transaction that currently holds the write lock. While the transaction is ongoing, any
other transactions that read the row are simply given the old value. Only when the new value is
@@ -613,7 +617,7 @@ getting intermingled. Indeed, those are useful features, and much stronger guara
get from a system that has no transactions.
However, there are still plenty of ways in which you can have concurrency bugs when using this
-isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that
+isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that
can occur with read committed.
{{< figure src="/fig/ddia_0806.png" id="fig_transactions_item_many_preceders" caption="Figure 8-6. Read skew: Aaliyah observes the database in an inconsistent state." class="w-full my-4" >}}
@@ -685,14 +689,14 @@ database to handle long-running read queries on a consistent snapshot at the sam
writes normally, without any lock contention between the two.
To implement snapshot isolation, databases use a generalization of the mechanism we saw for
-preventing dirty reads in [Figure 8-4](/en/ch8#fig_transactions_read_committed). Instead of two versions of each row
+preventing dirty reads in [Figure 8-4](/en/ch8#fig_transactions_read_committed). Instead of two versions of each row
(the committed version and the overwritten-but-not-yet-committed version), the database must
potentially keep several different committed versions of a row, because various in-progress
transactions may need to see the state of the database at different points in time. Because it
maintains several versions of a row side by side, this technique is known as *multi-version
concurrency control* (MVCC).
-[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
+[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
[^40] [^42] [^43] (other implementations are similar).
When a transaction is started, it is given a unique, always-increasing transaction ID (`txid`).
Whenever a transaction writes anything to the database, the data it writes is tagged with the
@@ -712,7 +716,7 @@ garbage collection process in the database removes any rows marked for deletion
space.
An update is internally translated into a delete and a insert [^44].
-For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the
+For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the
balance from $500 to $400. The `accounts` table now actually contains two rows for account 2: a row
with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of
$400 which was inserted by transaction 13.
@@ -741,7 +745,7 @@ consistent snapshot of the database to the application. This works roughly as fo
process can remove them later.
4. All other writes are visible to the application’s queries.
-These rules apply to both insertion and deletion of rows. In [Figure 8-7](/en/ch8#fig_transactions_mvcc), when
+These rules apply to both insertion and deletion of rows. In [Figure 8-7](/en/ch8#fig_transactions_mvcc), when
transaction 12 reads from account 2, it sees a balance of $500 because the deletion of the $500
balance was made by transaction 13 (according to rule 2, transaction 12 cannot see a deletion made
by transaction 13), and the insertion of the $400 balance is not yet visible (by the same rule).
@@ -758,6 +762,8 @@ that (from other transactions’ point of view) have long been overwritten or de
updating values in place but instead inserting a new version every time a value is changed, the
database can provide a consistent snapshot while incurring only a small overhead.
+
+
#### Indexes and snapshot isolation {#indexes-and-snapshot-isolation}
How do indexes work in a multi-version database? The most common approach is that each index entry
@@ -819,7 +825,7 @@ the issue of two transactions writing concurrently—we have only discussed dirt
There are several other interesting kinds of conflicts that can occur between concurrently writing
transactions. The best known of these is the *lost update* problem, illustrated in
-[Figure 8-1](/en/ch8#fig_transactions_increment) with the example of two concurrent counter increments.
+[Figure 8-1](/en/ch8#fig_transactions_increment) with the example of two concurrent counter increments.
The lost update problem can occur if an application reads some value from the database, modifies it,
and writes back the modified value (a *read-modify-write cycle*). If two transactions do this
@@ -875,7 +881,7 @@ For example, consider a multiplayer game in which several players can move the s
concurrently. In this case, an atomic operation may not be sufficient, because the application also
needs to ensure that a player’s move abides by the rules of the game, which involves some logic that
you cannot sensibly implement as a database query. Instead, you may use a lock to prevent two
-players from concurrently moving the same piece, as illustrated in [Example 8-1](/en/ch8#fig_transactions_select_for_update).
+players from concurrently moving the same piece, as illustrated in [Example 8-1](/en/ch8#fig_transactions_select_for_update).
{{< figure id="fig_transactions_select_for_update" title="Example 8-1. Explicitly locking rows to prevent lost updates" class="w-full my-4" >}}
@@ -956,7 +962,7 @@ written by other transactions are visible to the evaluation of the `WHERE` claus
#### Conflict resolution and replication {#conflict-resolution-and-replication}
-In replicated databases (see [Chapter 6](/en/ch6#ch_replication)), preventing lost updates takes on another
+In replicated databases (see [Chapter 6](/en/ch6#ch_replication)), preventing lost updates takes on another
dimension: since they have copies of the data on multiple nodes, and the data can potentially be
modified concurrently on different nodes, some additional steps need to be taken to prevent lost
updates.
@@ -1000,7 +1006,7 @@ they are sick themselves), provided that at least one colleague remains on call
Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are
feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button
to go off call at approximately the same time. What happens next is illustrated in
-[Figure 8-8](/en/ch8#fig_transactions_write_skew).
+[Figure 8-8](/en/ch8#fig_transactions_write_skew).
{{< figure src="/fig/ddia_0808.png" id="fig_transactions_write_skew" caption="Figure 8-8. Example of write skew causing an application bug." class="w-full my-4" >}}
@@ -1070,7 +1076,7 @@ Meeting room booking system
: Say you want to enforce that there cannot be two bookings for the same meeting room at the same time [^55].
When someone wants to make a booking, you first check for any conflicting bookings (i.e.,
bookings for the same room with an overlapping time range), and if none are found, you create the
- meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).
+ meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).
{{< figure id="fig_transactions_meeting_rooms" title="Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)" class="w-full my-4" >}}
@@ -1094,7 +1100,7 @@ Meeting room booking system
isolation.
Multiplayer game
-: In [Example 8-1](/en/ch8#fig_transactions_select_for_update), we used a lock to prevent lost updates (that is, making
+: In [Example 8-1](/en/ch8#fig_transactions_select_for_update), we used a lock to prevent lost updates (that is, making
sure that two players can’t move the same figure at the same time). However, the lock doesn’t
prevent players from moving two different figures to the same position on the board or potentially
making some other move that violates the rules of the game. Depending on the kind of rule you are
@@ -1278,7 +1284,7 @@ containing a single statement, or submit the entire transaction code to the data
as a *stored procedure* [^61].
The differences between interactive transactions and stored procedures is illustrated in
-[Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the
+[Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the
stored procedure can execute very quickly, without waiting for any network or disk I/O.
{{< figure src="/fig/ddia_0809.png" id="fig_transactions_stored_proc" caption="Figure 8-9. The difference between an interactive transaction and a stored procedure (using the example transaction of [Figure 8-8](/en/ch8#fig_transactions_write_skew))." class="w-full my-4" >}}
@@ -1322,7 +1328,7 @@ requires that stored procedures are *deterministic* (when run on different nodes
the same result). If a transaction needs to use the current date and time, for example, it must do
so through special deterministic APIs (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows) for more details on
deterministic operations). This approach is called *state machine replication*, and we will return
-to it in [Chapter 10](/en/ch10#ch_consistency).
+to it in [Chapter 10](/en/ch10#ch_consistency).
#### Sharding {#sharding}
@@ -1332,7 +1338,7 @@ Read-only transactions may execute elsewhere, using snapshot isolation, but for
high write throughput, the single-threaded transaction processor can become a serious bottleneck.
In order to scale to multiple CPU cores, and multiple nodes, you can shard your data
-(see [Chapter 7](/en/ch7#ch_sharding)), which is supported in VoltDB. If you can find a way of sharding your dataset
+(see [Chapter 7](/en/ch7#ch_sharding)), which is supported in VoltDB. If you can find a way of sharding your dataset
so that each transaction only needs to read and write data within a single shard, then each shard
can have its own transaction processing thread running independently from the others. In this case,
you can give each CPU core its own shard, which allows your transaction throughput to scale linearly
@@ -1398,7 +1404,7 @@ anyone wants to write (modify or delete) an object, exclusive access is required
unexpectedly behind A’s back.)
* If transaction A has written an object and transaction B wants to read that object, B must wait
until A commits or aborts before it can continue. (Reading an old version of the object, like in
- [Figure 8-4](/en/ch8#fig_transactions_read_committed), is not acceptable under 2PL.)
+ [Figure 8-4](/en/ch8#fig_transactions_read_committed), is not acceptable under 2PL.)
In 2PL, writers don’t just block other writers; they also block readers and vice
versa. Snapshot isolation has the mantra *readers never block writers, and writers never block
@@ -1470,7 +1476,7 @@ changing the results of another transaction’s search query. A database with se
must prevent phantoms.
In the meeting room booking example this means that if one transaction has searched for existing
-bookings for a room within a certain time window (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)), another
+bookings for a room within a certain time window (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)), another
transaction is not allowed to concurrently insert or update another booking for the same room and
time range. (It’s okay to concurrently insert bookings for other rooms, or for the same room at a
different time that doesn’t affect the proposed booking.)
@@ -1623,7 +1629,7 @@ see [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_sn
MVCC database, it ignores writes that were made by any other transactions that hadn’t yet committed
at the time when the snapshot was taken.
-In [Figure 8-10](/en/ch8#fig_transactions_detect_mvcc), transaction 43 sees
+In [Figure 8-10](/en/ch8#fig_transactions_detect_mvcc), transaction 43 sees
Aaliyah as having `on_call = true`, because transaction 42 (which modified Aaliyah’s on-call status) is
uncommitted. However, by the time transaction 43 wants to commit, transaction 42 has already
committed. This means that the write that was ignored when reading from the consistent snapshot has
@@ -1650,7 +1656,7 @@ isolation’s support for long-running reads from a consistent snapshot.
#### Detecting writes that affect prior reads {#sec_detecting_writes_affect_reads}
The second case to consider is when another transaction modifies data after it has been read. This
-case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range).
+case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range).
{{< figure src="/fig/ddia_0811.png" id="fig_transactions_detect_index_range" caption="Figure 8-11. In serializable snapshot isolation, detecting when one transaction modifies another transaction's reads." class="w-full my-4" >}}
@@ -1660,7 +1666,7 @@ In the context of two-phase locking we discussed index-range locks (see
search query, such as `WHERE shift_id = 1234`. We can use a similar technique here, except that SSI
locks don’t block other transactions.
-In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transactions 42 and 43 both search for on-call doctors
+In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transactions 42 and 43 both search for on-call doctors
during shift `1234`. If there is an index on `shift_id`, the database can use the index entry 1234 to
record the fact that transactions 42 and 43 read this data. (If there is no index, this information
can be tracked at the table level.) This information only needs to be kept for a while: after a
@@ -1672,7 +1678,7 @@ that have recently read the affected data. This process is similar to acquiring
key range, but rather than blocking until the readers have committed, the lock acts as a tripwire:
it simply notifies the transactions that the data they read may no longer be up to date.
-In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transaction 43 notifies transaction 42 that its prior
+In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transaction 43 notifies transaction 42 that its prior
read is outdated, and vice versa. Transaction 42 is first to commit, and it is successful: although
transaction 43’s write affected 42, 43 hasn’t yet committed, so the write has not yet taken effect.
However, when transaction 43 wants to commit, the conflicting write from 42 has already been
@@ -1750,7 +1756,7 @@ distributed transactions, but various distributed relational databases do.
In these cases, it is not sufficient to simply send a commit request to all of the nodes and
independently commit the transaction on each one. It could easily happen that the commit succeeds on
-some nodes and fails on other nodes, as shown in [Figure 8-12](/en/ch8#fig_transactions_non_atomic):
+some nodes and fails on other nodes, as shown in [Figure 8-12](/en/ch8#fig_transactions_non_atomic):
* Some nodes may detect a constraint violation or conflict, making an abort necessary, while other
nodes are successfully able to commit.
@@ -1766,7 +1772,7 @@ If some nodes commit the transaction but others abort it, the nodes become incon
other. And once a transaction has been committed on one node, it cannot be retracted again if it
later turns out that it was aborted on another node. This is because once data has been committed,
it becomes visible to other transactions under *read committed* or stronger isolation. For example,
-in [Figure 8-12](/en/ch8#fig_transactions_non_atomic), by the time user 1 notices that its commit failed on database 1,
+in [Figure 8-12](/en/ch8#fig_transactions_non_atomic), by the time user 1 notices that its commit failed on database 1,
user 2 has already read the data from the same transaction on database 2. If user 1’s transaction
was later aborted, user 2’s transaction would have to be reverted as well, since it was based on
data that was retroactively declared not to have existed.
@@ -1782,7 +1788,7 @@ internally in some databases and also made available to applications in the form
(which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP
web services [^74] [^75].
-The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
+The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two
phases (hence the name).
@@ -1877,7 +1883,7 @@ was committed or aborted. If the coordinator crashes or the network fails at thi
participant can do nothing but wait. A participant’s transaction in this state is called *in doubt*
or *uncertain*.
-The situation is illustrated in [Figure 8-14](/en/ch8#fig_transactions_2pc_crash). In this particular example, the
+The situation is illustrated in [Figure 8-14](/en/ch8#fig_transactions_2pc_crash). In this particular example, the
coordinator actually decided to commit, and database 2 received the commit request. However, the
coordinator crashed before it could send the commit request to database 1, and so database 1 does
not know whether to commit or abort. Even a timeout does not help here: if database 1 unilaterally
@@ -1907,11 +1913,11 @@ is not so straightforward.
As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed [^13] [^77].
However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most
-practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
+practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
cannot guarantee atomicity.
A better solution in practice is to replace the single-node coordinator with a fault-tolerant
-consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_consistency).
+consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_consistency).
### Distributed Transactions Across Different Systems {#sec_transactions_xa}
@@ -2018,7 +2024,7 @@ writes. In addition, if you want serializable isolation, a database using two-ph
also have to take a shared lock on any rows *read* by the transaction.
The database cannot release those locks until the transaction commits or aborts (illustrated as a
-shaded area in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit)). Therefore, when using two-phase commit, a
+shaded area in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit)). Therefore, when using two-phase commit, a
transaction must hold onto the locks throughout the time it is in doubt. If the coordinator has
crashed and takes 20 minutes to start up again, those locks will be held for 20 minutes. If the
coordinator’s log is entirely lost for some reason, those locks will be held forever—or at least
@@ -2086,7 +2092,7 @@ different systems.
These problems are somewhat inherent in performing transactions across heterogeneous technologies.
However, keeping several heterogeneous data systems consistent with each other is still a real and
important problem, so we need to find a different solution to it. This can be done, as we will see
-in the next section and in [Link to Come].
+in the next section and in [“Derived data versus distributed transactions”](/en/ch13#sec_future_derived_vs_transactions).
### Database-internal Distributed Transactions {#sec_transactions_internal}
@@ -2111,7 +2117,7 @@ The biggest problems with XA can be fixed by:
* Coupling the atomic commitment protocol with a distributed concurrency control protocol that supports deadlock detection and consistent reads across shards.
Consensus algorithms are commonly used to replicate the coordinator and the database shards. We will
-see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented
+see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented
using a consensus algorithm. These algorithms tolerate faults by automatically failing over from one
node to another without any human intervention, and while continuing to guarantee strong consistency
properties.
@@ -2159,7 +2165,7 @@ Thus, achieving exactly-once processing only requires transactions within the da
across database and message broker is not necessary for this use case. Recording the message ID in
the database makes the message processing *idempotent*, so that message processing can be safely
retried without duplicating its side-effects. A similar approach is used in stream processing
-frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in [Link to Come].
+frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in [“Fault Tolerance”](/en/ch12#sec_stream_fault_tolerance).
However, internal distributed transactions within the database are still useful for the scalability
of patterns such as these: for example, they would allow the message IDs to be stored on one shard
@@ -2189,7 +2195,7 @@ can have on the database.
In this chapter, we went particularly deep into the topic of concurrency control. We discussed
several widely used isolation levels, in particular *read committed*, *snapshot isolation*
(sometimes called *repeatable read*), and *serializable*. We characterized those isolation levels by
-discussing various examples of race conditions, summarized in [Table 8-1](/en/ch8#ch_transactions_isolation_levels):
+discussing various examples of race conditions, summarized in [Table 8-1](/en/ch8#ch_transactions_isolation_levels):
{{< figure id="ch_transactions_isolation_levels" title="Table 8-1. Summary of anomalies that can occur at various isolation levels" class="w-full my-4" >}}
diff --git a/content/en/ch9.md b/content/en/ch9.md
index 96a4089..30104ba 100644
--- a/content/en/ch9.md
+++ b/content/en/ch9.md
@@ -4,6 +4,8 @@ weight: 209
breadcrumbs: false
---
+
+

> *They’re funny things, Accidents. You never have them till you’re having them.*
@@ -33,7 +35,7 @@ explore the things that may go wrong in a distributed system. We will look into
networks ([“Unreliable Networks”](/en/ch9#sec_distributed_networks)) as well as clocks and timing issues
([“Unreliable Clocks”](/en/ch9#sec_distributed_clocks)). The consequences of all these issues are disorienting, so we’ll
explore how to think about the state of a distributed system and how to reason about things that
-have happened ([“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth)). Later, in [Chapter 10](/en/ch10#ch_consistency), we will look at some
+have happened ([“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth)). Later, in [Chapter 10](/en/ch10#ch_consistency), we will look at some
examples of how we can achieve fault tolerance in the face of those faults.
## Faults and Partial Failures {#sec_distributed_partial_failure}
@@ -104,7 +106,7 @@ The internet and most internal networks in datacenters (often Ethernet) are *asy
networks*. In this kind of network, one node can send a message (a packet) to another node, but the
network gives no guarantees as to when it will arrive, or whether it will arrive at all. If you send
a request and expect a response, many things could go wrong (some of which are illustrated in
-[Figure 9-1](/en/ch9#fig_distributed_network)):
+[Figure 9-1](/en/ch9#fig_distributed_network)):
1. Your request may have been lost (perhaps someone unplugged a network cable).
2. Your request may be waiting in a queue and will be delivered later (perhaps the network or the
@@ -219,7 +221,7 @@ even in controlled environments like a datacenter operated by one company [^8]:
When one part of the network is cut off from the rest due to a network fault, that is sometimes
called a *network partition* or *netsplit*, but it is not fundamentally different from other kinds
of network interruption. Network partitions are not related to sharding of a storage system, which
-is sometimes also called *partitioning* (see [Chapter 7](/en/ch7#ch_sharding)).
+is sometimes also called *partitioning* (see [Chapter 7](/en/ch7#ch_sharding)).
--------
@@ -286,7 +288,7 @@ to a load spike on the node or the network).
Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of
performing some action (for example, sending an email), and another node takes over, the action may
end up being performed twice. We will discuss this issue in more detail in
-[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), and in Chapters [^10] and [Link to Come].
+[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), [Chapter 10](/en/ch10#ch_consistency), and [“The End-to-End Argument for Databases”](/en/ch13#sec_future_end_to_end).
When a node is declared dead, its responsibilities need to be transferred to other nodes, which
places additional load on other nodes and the network. If the system is already struggling with high
@@ -299,9 +301,9 @@ Imagine a fictitious system with a network that guaranteed a maximum delay for p
is either delivered within some time *d*, or it is lost, but delivery never takes longer than *d*.
Furthermore, assume that you can guarantee that a non-failed node always handles a request within
some time *r*. In this case, you could guarantee that every successful request receives a response
-within time 2*d* + *r*—and if you don’t receive a response within that time, you know
+within time 2*d* + *r*—and if you don’t receive a response within that time, you know
that either the network or the remote node is not working. If this was true,
-2*d* + *r* would be a reasonable timeout to use.
+2*d* + *r* would be a reasonable timeout to use.
Unfortunately, most systems we work with have neither of those guarantees: asynchronous networks
have *unbounded delays* (that is, they try to deliver packets as quickly as possible, but there is
@@ -311,6 +313,8 @@ cannot guarantee that they can handle requests within some maximum time (see
be fast most of the time: if your timeout is low, it only takes a transient spike in round-trip
times to throw the system off-balance.
+
+
#### Network congestion and queueing {#network-congestion-and-queueing}
When driving a car, travel times on road networks often vary most due to traffic congestion.
@@ -318,7 +322,7 @@ Similarly, the variability of packet delays on computer networks is most often d
* If several different nodes simultaneously try to send packets to the same destination, the network
switch must queue them up and feed them into the destination network link one by one (as illustrated
- in [Figure 9-2](/en/ch9#fig_distributed_switch_queueing)). On a busy network link, a packet may have to wait a while
+ in [Figure 9-2](/en/ch9#fig_distributed_switch_queueing)). On a busy network link, a packet may have to wait a while
until it can get a slot (this is called *network congestion*). If there is so much incoming data
that the switch queue fills up, the packet is dropped, so it needs to be resent—even though
the network is functioning fine.
@@ -340,6 +344,8 @@ expire, and then waiting for the retransmitted packet to be acknowledged).
--------
+
+
> [!TIP] TCP VERSUS UDP
Some latency-sensitive applications, such as videoconferencing and Voice over IP (VoIP), use UDP
@@ -445,6 +451,8 @@ applications to reprioritize packets for QoS purposes.
--------
+
+
> [!TIP] LATENCY AND RESOURCE UTILIZATION
More generally, you can think of variable delays as a consequence of dynamic resource partitioning.
@@ -548,7 +556,7 @@ unsuitable for measuring elapsed time [^40].
Time-of-day clocks can experience jumps due to the start and end of Daylight Saving Time (DST);
these can be avoided by always using UTC as time zone, which does not have DST.
Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward
-in steps of 10 ms on older Windows systems [^41].
+in steps of 10 ms on older Windows systems [^41].
On recent systems, this is less of a problem.
#### Monotonic clocks {#monotonic-clocks}
@@ -591,8 +599,8 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples
* The quartz clock in a computer is not very accurate: it *drifts* (runs faster or slower than it
should). Clock drift varies depending on the temperature of the machine. Google assumes a clock
- drift of up to 200 ppm (parts per million) for its servers [^45],
- which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30
+ drift of up to 200 ppm (parts per million) for its servers [^45],
+ which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30
seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best
possible accuracy you can achieve, even if everything is working correctly.
* If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the
@@ -602,7 +610,7 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples
different nodes’ clocks. Anecdotal evidence suggests that this does happen in practice.
* NTP synchronization can only be as good as the network delay, so there is a limit to its
accuracy when you’re on a congested network with variable packet delays. One experiment showed
- that a minimum error of 35 ms is achievable when synchronizing over the internet [^46],
+ that a minimum error of 35 ms is achievable when synchronizing over the internet [^46],
though occasional spikes in network delay lead to errors of around a second. Depending on the
configuration, large network delays can cause the NTP client to give up entirely.
* Some NTP servers are wrong or misconfigured, reporting time that is off by hours [^47] [^48].
@@ -673,29 +681,29 @@ ordering of events across multiple nodes [^64].
For example, if two clients write to a distributed database, who got there first? Which write is the
more recent one?
-[Figure 9-3](/en/ch9#fig_distributed_timestamps) illustrates a dangerous use of time-of-day clocks in a database with
-multi-leader replication (the example is similar to [Figure 6-8](/en/ch6#fig_replication_causality)). Client A writes
-*x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node
-3 (we now have *x* = 2); and finally, both writes are replicated to node 2.
+[Figure 9-3](/en/ch9#fig_distributed_timestamps) illustrates a dangerous use of time-of-day clocks in a database with
+multi-leader replication (the example is similar to [Figure 6-8](/en/ch6#fig_replication_causality)). Client A writes
+*x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node
+3 (we now have *x* = 2); and finally, both writes are replicated to node 2.
{{< figure src="/fig/ddia_0903.png" id="fig_distributed_timestamps" caption="Figure 9-3. The write by client B is causally later than the write by client A, but B's write has an earlier timestamp." class="w-full my-4" >}}
-In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a
+In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a
timestamp according to the time-of-day clock on the node where the write originated. The clock
synchronization is very good in this example: the skew between node 1 and node 3 is less than
-3 ms, which is probably better than you can expect in practice.
+3 ms, which is probably better than you can expect in practice.
-Since the increment builds upon the earlier write of *x* = 1, we might expect that the
-write of *x* = 2 should have the greater timestamp of the two. Unfortunately, that is
-not what happens in [Figure 9-3](/en/ch9#fig_distributed_timestamps): the write *x* = 1 has a timestamp of
-42.004 seconds, but the write *x* = 2 has a timestamp of 42.003 seconds.
+Since the increment builds upon the earlier write of *x* = 1, we might expect that the
+write of *x* = 2 should have the greater timestamp of the two. Unfortunately, that is
+not what happens in [Figure 9-3](/en/ch9#fig_distributed_timestamps): the write *x* = 1 has a timestamp of
+42.004 seconds, but the write *x* = 2 has a timestamp of 42.003 seconds.
As discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww), one way of resolving conflicts between concurrently written
values on different nodes is *last write wins* (LWW), which means keeping the write with the
greatest timestamp for a given key and discarding all writes with older timestamps. In the example
-of [Figure 9-3](/en/ch9#fig_distributed_timestamps), when node 2 receives these two events, it will incorrectly
-conclude that *x* = 1 is the more recent value and drop the write *x* = 2,
+of [Figure 9-3](/en/ch9#fig_distributed_timestamps), when node 2 receives these two events, it will incorrectly
+conclude that *x* = 1 is the more recent value and drop the write *x* = 2,
so the increment is lost.
This problem can be prevented by ensuring that when a value is overwritten, the new value always has
@@ -710,7 +718,7 @@ policy [^62]. This approach has some serious problems:
This scenario can cause arbitrary amounts of data to be silently dropped without any error being
reported to the application.
* LWW cannot distinguish between writes that occurred sequentially in quick succession (in
- [Figure 9-3](/en/ch9#fig_distributed_timestamps), client B’s increment definitely occurs *after* client A’s write)
+ [Figure 9-3](/en/ch9#fig_distributed_timestamps), client B’s increment definitely occurs *after* client A’s write)
and writes that were truly concurrent (neither writer was aware of the other). Additional
causality tracking mechanisms, such as version vectors, are needed in order to prevent violations
of causality (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)).
@@ -722,8 +730,8 @@ policy [^62]. This approach has some serious problems:
Thus, even though it is tempting to resolve conflicts by keeping the most “recent” value and
discarding others, it’s important to be aware that the definition of “recent” depends on a local
time-of-day clock, which may well be incorrect. Even with tightly NTP-synchronized clocks, you could
-send a packet at timestamp 100 ms (according to the sender’s clock) and have it arrive at
-timestamp 99 ms (according to the recipient’s clock)—so it appears as though the packet
+send a packet at timestamp 100 ms (according to the sender’s clock) and have it arrive at
+timestamp 99 ms (according to the recipient’s clock)—so it appears as though the packet
arrived before it was sent, which is impossible.
Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur?
@@ -746,12 +754,12 @@ actually accurate to such precision. In fact, it most likely is not—as mention
drift in an imprecise quartz clock can easily be several milliseconds, even if you synchronize with
an NTP server on the local network every minute. With an NTP server on the public internet, the best
possible accuracy is probably to the tens of milliseconds, and the error may easily spike to over
-100 ms when there is network congestion.
+100 ms when there is network congestion.
Thus, it doesn’t make sense to think of a clock reading as a point in time—it is more like a
range of times, within a confidence interval: for example, a system may be 95% confident that the
time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely than that [^67].
-If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless.
+If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless.
The uncertainty bound can be calculated based on your time source. If you have a GPS receiver or
atomic clock directly attached to your computer, the expected error range is determined by
@@ -808,7 +816,7 @@ length of the confidence interval before committing a read-write transaction. By
ensures that any transaction that may read the data is at a sufficiently later time, so their
confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner
needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS
-receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [^45].
+receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [^45].
The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to
have a confidence interval, and the accurate clock sources only help keep that interval small. Other
@@ -943,7 +951,7 @@ failure of the entire system. These are so-called *hard real-time* systems.
> In embedded systems, *real-time* means that a system is carefully designed and tested to meet
> specified timing guarantees in all circumstances. This meaning is in contrast to the more vague use of the
> term *real-time* on the web, where it describes servers pushing data to clients and stream
-> processing without hard response time constraints (see [Link to Come]).
+> processing without hard response time constraints (see [Chapter 12](/en/ch12#ch_stream)).
--------
@@ -997,7 +1005,7 @@ A variant of this idea is to use the garbage collector only for short-lived obje
to collect) and to restart processes periodically, before they accumulate enough long-lived objects
to require a full GC of long-lived objects [^79] [^82].
One node can be restarted at a time, and traffic can be shifted away from the node before the
-planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).
+planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).
These measures cannot fully prevent garbage collection pauses, but they can usefully reduce their
impact on the application.
@@ -1031,7 +1039,7 @@ even if the underlying system model provides very few guarantees.
However, although it is possible to make software well behaved in an unreliable system model, it
is not straightforward to do so. In the rest of this chapter we will further explore the notions of
knowledge and truth in distributed systems, which will help us think about the kinds of assumptions
-we can make and the guarantees we may want to provide. In [Chapter 10](/en/ch10#ch_consistency) we will proceed to
+we can make and the guarantees we may want to provide. In [Chapter 10](/en/ch10#ch_consistency) we will proceed to
look at some examples of distributed algorithms that provide particular guarantees under particular
assumptions.
@@ -1075,7 +1083,7 @@ of quorums are possible). A majority quorum allows the system to continue workin
are faulty (with three nodes, one faulty node can be tolerated; with five nodes, two faulty nodes can be
tolerated). However, it is still safe, because there can only be only one majority in the
system—there cannot be two majorities with conflicting decisions at the same time. We will discuss
-the use of quorums in more detail when we get to *consensus algorithms* in [Chapter 10](/en/ch10#ch_consistency).
+the use of quorums in more detail when we get to *consensus algorithms* in [Chapter 10](/en/ch10#ch_consistency).
### Distributed Locks and Leases {#sec_distributed_lock_fencing}
@@ -1099,13 +1107,13 @@ hold the lease, perhaps due to a process pause. In the third example, the conseq
wasted computational resources, which is not a big deal. But in the first two cases, the consequence
could be lost or corrupted data, which is much more serious.
-For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
+For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
implementation of locking. (The bug is not theoretical: HBase used to have this problem [^85] [^86].)
Say you want to ensure that a file in a storage service can only be
accessed by one client at a time, because if multiple clients tried to write to it, the file would
become corrupted. You try to implement this by requiring a client to obtain a lease from a lock
service before accessing the file. Such a lock service is often implemented using a consensus
-algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency).
+algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency).
{{< figure src="/fig/ddia_0904.png" id="fig_distributed_lease_pause" caption="Figure 9-4. Incorrect implementation of a distributed lock: client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage." class="w-full my-4" >}}
@@ -1116,13 +1124,13 @@ the same file, and start writing to the file. When the paused client comes back,
(incorrectly) that it still has a valid lease and proceeds to also write to the file. We now have a
split brain situation: the clients’ writes clash and corrupt the file.
-[Figure 9-5](/en/ch9#fig_distributed_lease_delay) shows a different problem that has similar consequences. In this
+[Figure 9-5](/en/ch9#fig_distributed_lease_delay) shows a different problem that has similar consequences. In this
example there is no process pause, only a crash by client 1. Just before client 1 crashes it sends a
write request to the storage service, but this request is delayed for a long time in the network.
(Remember from [“Network Faults in Practice”](/en/ch9#sec_distributed_network_faults) that packets can sometimes be delayed by a minute
or more.) By the time the write request arrives at the storage service, the lease has already timed
out, allowing client 2 to acquire it and issue a write of its own. The result is corruption similar
-to [Figure 9-4](/en/ch9#fig_distributed_lease_pause).
+to [Figure 9-4](/en/ch9#fig_distributed_lease_pause).
{{< figure src="/fig/ddia_0905.png" id="fig_distributed_lease_delay" caption="Figure 9-5. A message from a former leaseholder might be delayed for a long time, and arrive after another node has taken over the lease." class="w-full my-4" >}}
@@ -1139,11 +1147,11 @@ from the network [^9], shutting down the VM via
the cloud provider’s management interface, or even physically powering down the machine [^87].
This approach is known as *Shoot The Other Node In The Head* or STONITH. Unfortunately, it suffers
from some problems: it does not protect against large network delays like in
-[Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down [^19]; and by the time the zombie has been
+[Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down [^19]; and by the time the zombie has been
detected and shut down, it may already be too late and data may already have been corrupted.
A more robust fencing solution, which protects against both zombies and delayed requests, is
-illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing).
+illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing).
{{< figure src="/fig/ddia_0906.png" id="fig_distributed_fencing" caption="Figure 9-6. Making access to storage safe by allowing writes only in the order of increasing fencing tokens." class="w-full my-4" >}}
@@ -1158,12 +1166,12 @@ it must include its current fencing token.
> [!NOTE]
> There are several alternative names for fencing tokens. In Chubby, Google’s lock service, they are
> called *sequencers* [^88], and in Kafka they are called *epoch numbers*.
-> In consensus algorithms, which we will discuss in [Chapter 10](/en/ch10#ch_consistency), the *ballot number* (Paxos) or
+> In consensus algorithms, which we will discuss in [Chapter 10](/en/ch10#ch_consistency), the *ballot number* (Paxos) or
> *term number* (Raft) serves a similar purpose.
--------
-In [Figure 9-6](/en/ch9#fig_distributed_fencing), client 1 acquires the lease with a token of 33, but then
+In [Figure 9-6](/en/ch9#fig_distributed_fencing), client 1 acquires the lease with a token of 33, but then
it goes into a long pause and the lease expires. Client 2 acquires the lease with a token of 34 (the
number always increases) and then sends its write request to the storage service, including the
token of 34. Later, client 1 comes back to life and sends its write to the storage service,
@@ -1196,7 +1204,7 @@ last-write-wins conflict resolution (see [“Leaderless Replication”](/en/ch6#
client sends writes directly to each replica, and each replica independently decides whether to
accept a write based on a timestamp assigned by the client.
-As illustrated in [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), you can put the writer’s fencing token in
+As illustrated in [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), you can put the writer’s fencing token in
the most significant bits or digits of the timestamp. You can then be sure that any timestamp
generated by the new leaseholder will be greater than any timestamp from the old leaseholder, even
if the old leaseholder’s writes happened later.
@@ -1204,7 +1212,7 @@ if the old leaseholder’s writes happened later.
{{< figure src="/fig/ddia_0907.png" id="fig_distributed_fencing_leaderless" caption="Figure 9-7. Using fencing tokens to protect writes to a leaderless replicated database." class="w-full my-4" >}}
-In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its
+In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its
timestamps starting with 34… are greater than any timestamps starting with 33… that are
generated by Client 1. Client 2 writes to a quorum of replicas but it can’t reach Replica 3. This
means that when the zombie Client 1 later tries to write, its write may succeed at Replica 3 even
@@ -1239,7 +1247,7 @@ The Byzantine Generals Problem is a generalization of the so-called *Two General
which imagines a situation in which two army generals need to agree on a battle plan. As they
have set up camp on two different sites, they can only communicate by messenger, and the messengers
sometimes get delayed or lost (like packets in a network). We will discuss this problem of
-*consensus* in [Chapter 10](/en/ch10#ch_consistency).
+*consensus* in [Chapter 10](/en/ch10#ch_consistency).
In the Byzantine version of the problem, there are *n* generals who need to agree, and their
endeavor is hampered by the fact that there are some traitors in their midst. Most of the generals
@@ -1301,6 +1309,8 @@ an attacker can compromise one node, they can probably compromise all of them, b
probably running the same software. Thus, traditional mechanisms (authentication, access control,
encryption, firewalls, and so on) continue to be the main protection against attackers.
+
+
#### Weak forms of lying {#weak-forms-of-lying}
Although we assume that nodes are generally honest, it can be worth adding mechanisms to software
@@ -1327,7 +1337,7 @@ pragmatic steps toward better reliability. For example:
### System Model and Reality {#sec_distributed_system_model}
Many algorithms have been designed to solve distributed systems problems—for example, we will
-examine solutions for the consensus problem in [Chapter 10](/en/ch10#ch_consistency). In order to be useful, these
+examine solutions for the consensus problem in [Chapter 10](/en/ch10#ch_consistency). In order to be useful, these
algorithms need to tolerate the various faults of distributed systems that we discussed in this
chapter.
@@ -1409,7 +1419,7 @@ Uniqueness
Monotonic sequence
: If request *x* returned token *t**x*, and request *y* returned token *t**y*, and
- *x* completed before *y* began, then *t**x* < *t**y*.
+ *x* completed before *y* began, then *t**x* < *t**y*.
Availability
: A node that requests a fencing token and does not crash eventually receives a response.
@@ -1615,7 +1625,7 @@ TigerBeetle’s time abstraction allows simulations to simulate network latency
actually taking the full length of time to trigger the timeout. Such techniques allow the simulator
to explore more code paths faster.
-# The Power of Determinism
+#### The Power of Determinism {#sidebar_distributed_determinism}
Nondeterminism is at the core of all of the distributed systems challenges we discussed in this
chapter: concurrency, network delay, process pauses, clock jumps, and crashes all happen in
@@ -1839,4 +1849,4 @@ problems in distributed systems.
[^131]: Rupak Majumdar and Filip Niksic. [Why is random testing effective for partition tolerance bugs?](https://dl.acm.org/doi/pdf/10.1145/3158134) *Proceedings of the ACM on Programming Languages* (PACMPL), volume 2, issue POPL, article no. 46, December 2017. [doi:10.1145/3158134](https://doi.org/10.1145/3158134)
[^132]: FoundationDB project authors. [Simulation and Testing](https://apple.github.io/foundationdb/testing.html). *apple.github.io*. Archived at [perma.cc/NQ3L-PM4C](https://perma.cc/NQ3L-PM4C)
[^133]: Alex Kladov. [Simulation Testing For Liveness](https://tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness/). *tigerbeetle.com*, July 2023. Archived at [perma.cc/RKD4-HGCR](https://perma.cc/RKD4-HGCR)
-[^134]: Alfonso Subiotto Marqués. [(Mostly) Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024. Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4)
\ No newline at end of file
+[^134]: Alfonso Subiotto Marqués. [(Mostly) Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024. Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4)
diff --git a/content/en/colophon.md b/content/en/colophon.md
index cac70d9..45d6775 100644
--- a/content/en/colophon.md
+++ b/content/en/colophon.md
@@ -4,23 +4,20 @@ weight: 600
breadcrumbs: false
---
-{{< callout type="warning" >}}
-This page is from the 1st edition, 2nd edition is not available yet.
-{{< /callout >}}
-
## About the Author
-**Martin Kleppmann** is a researcher in distributed systems at the University of Cambridge, UK.
-Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure.
-In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes.
-
-Martin is a regular conference speaker, blogger, and open source contributor. He believes that profound technical ideas should be accessible to everyone, and that deeper understanding will help us develop better software.
+**Martin Kleppmann** is an Associate Professor at the University of Cambridge, UK, where he teaches on distributed systems and cryptographic protocols.
+The first edition of *Designing Data-Intensive Applications* in 2017 established him as an authority on data systems,
+and through his research on distributed systems he helped start the local-first software movement.
+Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive,
+where he worked on large-scale data infrastructure.

-**Chris Riccomini** is a software engineer, startup investor, and author with 15+ years of experience at PayPal, LinkedIn, and WePay.
-He runs Materialized View Capital, where he invests in infrastructure startups. He is also the cocreator of Apache Samza and SlateDB,
-and coauthor of The Missing README: A Guide for the New Software Engineer.
+**Chris Riccomini** is a software engineer, startup investor, and author with 15+ years of experience at PayPal,
+LinkedIn, and WePay. He runs Materialized View Capital, where he invests in infrastructure startups.
+He is also the co-creator of Apache Samza and SlateDB,
+and co-author of The Missing README: A Guide for the New Software Engineer.
## Colophon
diff --git a/content/en/glossary.md b/content/en/glossary.md
index 33476b1..c4cd9cf 100644
--- a/content/en/glossary.md
+++ b/content/en/glossary.md
@@ -4,38 +4,33 @@ weight: 500
breadcrumbs: false
---
-{{< callout type="warning" >}}
-This page is from the 1st edition, 2nd edition is not available yet.
-{{< /callout >}}
-
> Please note that the definitions in this glossary are short and simple, intended to convey the core idea but not the full subtleties of a term. For more detail, please follow the references into the main text.
### asynchronous
-Not waiting for something to complete (e.g., sending data over the network to another node), and not making any assumptions about how long it is going to take. See “Synchronous Versus Asynchro‐ nous Replication” on page 153, “Synchro‐ nous Versus Asynchronous Networks” on page 284, and “System Model and Reality” on page 306.
+Not waiting for something to complete (e.g., sending data over the network to another node), and not making any assumptions about how long it is going to take. See [“Synchronous Versus Asynchronous Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_sync_async), [“Synchronous Versus Asynchronous Networks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_sync_networks), and [“System Model and Reality”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_system_model).
### atomic
-1. In the context of concurrent operations: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half- finished” state. See also *isolation*.
-2. In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur. See “Atomicity” on page 223 and “Atomic Commit and Two-Phase Commit (2PC)” on page 354.
+1. In the context of concurrency: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half-finished” state. See also *isolation*.
+
+2. In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur. See [“Atomicity”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_atomicity) and [“Two-Phase Commit (2PC)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pc).
### backpressure
-Forcing the sender of some data to slow down because the recipient cannot keep
-
-up with it. Also known as *flow control*. See “Messaging Systems” on page 441.
+Forcing the sender of some data to slow down when the recipient cannot keep up with it. Also known as *flow control*. See [“When an Overloaded System Won’t Recover”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sidebar_metastable).
### batch process
-A computation that takes some fixed (and usually large) set of data as input and pro‐ duces some other data as output, without modifying the input. See Chapter 10.
+A computation that takes some fixed (and usually large) set of data as input and produces some other data as output, without modifying the input. See [Chapter 11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch11.html#ch_batch).
### bounded
-Having some known upper limit or size. Used for example in the context of net‐ work delay (see “Timeouts and Unboun‐ ded Delays” on page 281) and datasets (see the introduction to Chapter 11).
+Having some known upper limit or size. Used for example in the context of network delay (see [“Timeouts and Unbounded Delays”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_queueing)) and datasets (see the introduction to [Chapter 12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#ch_stream)).
### Byzantine fault
-A node that behaves incorrectly in some arbitrary way, for example by sending contradictory or malicious messages to other nodes. See “Byzantine Faults” on page 304.
+A node that behaves incorrectly in some arbitrary way, for example by sending contradictory or malicious messages to other nodes. See [“Byzantine Faults”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_byzantine).
### cache
@@ -43,55 +38,55 @@ A component that remembers recently used data in order to speed up future reads
### CAP theorem
-A widely misunderstood theoretical result that is not useful in practice. See “The CAP theorem” on page 336.
+A widely misunderstood theoretical result that is not useful in practice. See [“The CAP theorem”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_cap).
### causality
-The dependency between events that ari‐ ses when one thing “happens before” another thing in a system. For example, a later event that is in response to an earlier event, or builds upon an earlier event, or should be understood in the light of an earlier event. See “The “happens-before” relationship and concurrency” on page 186 and “Ordering and Causality” on page 339.
+The dependency between events that arises when one thing “happens before” another thing in a system. For example, a later event that is in response to an earlier event, or builds upon an earlier event, or should be understood in the light of an earlier event. See [“The “happens-before” relation and concurrency”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_happens_before).
### consensus
-A fundamental problem in distributed computing, concerning getting several nodes to agree on something (for exam‐ ple, which node should be the leader for a database cluster). The problem is much harder than it seems at first glance. See “Fault-Tolerant Consensus” on page 364.
+A fundamental problem in distributed computing, concerning getting several nodes to agree on something (for example, which node should be the leader for a database cluster). The problem is much harder than it seems at first glance. See [“Consensus”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_consensus).
### data warehouse
-A database in which data from several dif‐ ferent OLTP systems has been combined and prepared to be used for analytics pur‐ poses. See “Data Warehousing” on page 91.
+A database in which data from several different OLTP systems has been combined and prepared to be used for analytics purposes. See [“Data Warehousing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_dwh).
### declarative
-Describing the properties that something should have, but not the exact steps for how to achieve it. In the context of quer‐ ies, a query optimizer takes a declarative query and decides how it should best be executed. See “Query Languages for Data” on page 42.
+Describing the properties that something should have, but not the exact steps for how to achieve it. In the context of database queries, a query optimizer takes a declarative query and decides how it should best be executed. See [“Terminology: Declarative Query Languages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sidebar_declarative).
### denormalize
-To introduce some amount of redun‐ dancy or duplication in a *normalized* dataset, typically in the form of a *cache* or *index*, in order to speed up reads. A denormalized value is a kind of precom‐ puted query result, similar to a materialized view. See “Single-Object and Multi- Object Operations” on page 228 and “Deriving several views from the same event log” on page 461.
+To introduce some amount of redundancy or duplication in a *normalized* dataset, typically in the form of a *cache* or *index*, in order to speed up reads. A denormalized value is a kind of precomputed query result, similar to a materialized view. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization).
### derived data
-A dataset that is created from some other data through a repeatable process, which you could run again if necessary. Usually, derived data is needed to speed up a par‐ ticular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data. See the introduction to Part III.
+A dataset that is created from some other data through a repeatable process, which you could run again if necessary. Usually, derived data is needed to speed up a particular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data. See [“Systems of Record and Derived Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_derived).
### deterministic
-Describing a function that always pro‐ duces the same output if you give it the same input. This means it cannot depend on random numbers, the time of day, net‐ work communication, or other unpredict‐ able things.
+Describing a function that always produces the same output if you give it the same input. This means it cannot depend on random numbers, the time of day, network communication, or other unpredictable things. See [“The Power of Determinism”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sidebar_distributed_determinism).
### distributed
-Running on several nodes connected by a network. Characterized by *partial failures*: some part of the system may be broken while other parts are still working, and it is often impossible for the software to know what exactly is broken. See “Faults and Partial Failures” on page 274.
+Running on several nodes connected by a network. Characterized by *partial failures*: some part of the system may be broken while other parts are still working, and it is often impossible for the software to know what exactly is broken. See [“Faults and Partial Failures”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_partial_failure).
### durable
-Storing data in a way such that you believe it will not be lost, even if various faults occur. See “Durability” on page 226.
+Storing data in a way such that you believe it will not be lost, even if various faults occur. See [“Durability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_durability).
### ETL
-Extract–Transform–Load. The process of extracting data from a source database, transforming it into a form that is more suitable for analytic queries, and loading it into a data warehouse or batch processing system. See “Data Warehousing” on page 91.
+Extract–Transform–Load. The process of extracting data from a source database, transforming it into a form that is more suitable for analytic queries, and loading it into a data warehouse or batch processing system. See [“Data Warehousing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_dwh).
### failover
-In systems that have a single leader, fail‐ over is the process of moving the leader‐ ship role from one node to another. See “Handling Node Outages” on page 156.
+In systems that have a single leader, failover is the process of moving the leadership role from one node to another. See [“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover).
### fault-tolerant
-Able to recover automatically if some‐ thing goes wrong (e.g., if a machine crashes or a network link fails). See “Reli‐ ability” on page 6.
+Able to recover automatically if something goes wrong (e.g., if a machine crashes or a network link fails). See [“Reliability and Fault Tolerance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_reliability).
### flow control
@@ -99,150 +94,164 @@ See *backpressure*.
### follower
-A replica that does not directly accept any writes from clients, but only processes data changes that it receives from a leader. Also known as a *secondary*, *slave*, *read replica*, or *hot standby*. See “Leaders and Followers” on page 152.
+A replica that does not directly accept any writes from clients, but only processes data changes that it receives from a leader. Also known as a *secondary*, *read replica*, or *hot standby*. See [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader).
### full-text search
-Searching text by arbitrary keywords, often with additional features such as matching similarly spelled words or syno‐ nyms. A full-text index is a kind of *secon‐ dary index* that supports such queries. See “Full-text search and fuzzy indexes” on page 88.
+Searching text by arbitrary keywords, often with additional features such as matching similarly spelled words or synonyms. A full-text index is a kind of *secondary index* that supports such queries. See [“Full-Text Search”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_full_text).
### graph
-A data structure consisting of *vertices* (things that you can refer to, also known as *nodes* or *entities*) and *edges* (connec‐ tions from one vertex to another, also known as *relationships* or *arcs*). See “Graph-Like Data Models” on page 49.
+A data structure consisting of *vertices* (things that you can refer to, also known as *nodes* or *entities*) and *edges* (connections from one vertex to another, also known as *relationships* or *arcs*). See [“Graph-Like Data Models”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_graph).
### hash
-A function that turns an input into a random-looking number. The same input always returns the same number as out‐ put. Two different inputs are very likely to have two different numbers as output, although it is possible that two different inputs produce the same output (this is called a *collision*). See “Partitioning by Hash of Key” on page 203.
+A function that turns an input into a random-looking number. The same input always returns the same number as output. Two different inputs are very likely to have two different numbers as output, although it is possible that two different inputs produce the same output (this is called a *collision*). See [“Sharding by Hash of Key”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_hash).
### idempotent
-Describing an operation that can be safely retried; if it is executed more than once, it has the same effect as if it was only exe‐ cuted once. See “Idempotence” on page 478.
+Describing an operation that can be safely retried; if it is executed more than once, it has the same effect as if it was only executed once. See [“Idempotence”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#sec_stream_idempotence).
### index
-A data structure that lets you efficiently search for all records that have a particular value in a particular field. See “Data Structures That Power Your Database” on page 70.
+A data structure that lets you efficiently search for all records that have a particular value in a particular field. See [“Storage and Indexing for OLTP”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_oltp).
### isolation
-In the context of transactions, describing the degree to which concurrently execut‐ ing transactions can interfere with each other. *Serializable* isolation provides the strongest guarantees, but weaker isolation levels are also used. See “Isolation” on page 225.
+In the context of transactions, describing the degree to which concurrently executing transactions can interfere with each other. *Serializable* isolation provides the strongest guarantees, but weaker isolation levels are also used. See [“Isolation”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_isolation).
### join
-To bring together records that have some‐ thing in common. Most commonly used in the case where one record has a refer‐ ence to another (a foreign key, a docu‐ ment reference, an edge in a graph) and a query needs to get the record that the ref‐ erence points to. See “Many-to-One and Many-to-Many Relationships” on page 33 and “Reduce-Side Joins and Grouping” on page 403.
+To bring together records that have something in common. Most commonly used in the case where one record has a reference to another (a foreign key, a document reference, an edge in a graph) and a query needs to get the record that the reference points to. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization) and [“JOIN and GROUP BY”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch11.html#sec_batch_join).
### leader
-When data or a service is replicated across several nodes, the leader is the designated replica that is allowed to make changes. A leader may be elected through some pro‐ tocol, or manually chosen by an adminis‐ trator. Also known as the *primary* or *master*. See “Leaders and Followers” on page 152.
+When data or a service is replicated across several nodes, the leader is the designated replica that is allowed to make changes. A leader may be elected through some protocol, or manually chosen by an administrator. Also known as the *primary* or *source*. See [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader).
### linearizable
-Behaving as if there was only a single copy of data in the system, which is updated by atomic operations. See “Linearizability” on page 324.
+Behaving as if there was only a single copy of data in the system, which is updated by atomic operations. See [“Linearizability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_linearizability).
### locality
-A performance optimization: putting sev‐ eral pieces of data in the same place if they are frequently needed at the same time. See “Data locality for queries” on page 41.
+A performance optimization: putting several pieces of data in the same place if they are frequently needed at the same time. See [“Data locality for reads and writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_document_locality).
### lock
-A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See “Two-Phase Locking (2PL)” on page 257 and “The leader and the lock” on page 301.
+A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See [“Two-Phase Locking (2PL)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pl) and [“Distributed Locks and Leases”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lock_fencing).
### log
-A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See “Two-Phase Locking (2PL)” on page 257 and “The leader and the lock” on page 301.
-
+An append-only file for storing data. A *write-ahead log* is used to make a storage engine resilient against crashes (see [“Making B-trees reliable”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_btree_wal)), a *log-structured* storage engine uses logs as its primary storage format (see [“Log-Structured Storage”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_log_structured)), a *replication log* is used to copy writes from a leader to followers (see [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader)), and an *event log* can represent a data stream (see [“Log-based Message Brokers”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#sec_stream_log)).
### materialize
-To perform a computation eagerly and write out its result, as opposed to calculat‐ ing it on demand when requested. See “Aggregation: Data Cubes and Material‐ ized Views” on page 101 and “Materialization of Intermediate State” on page 419.
+To perform a computation eagerly and write out its result, as opposed to calculating it on demand when requested. See [“Event Sourcing and CQRS”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_events).
### node
An instance of some software running on a computer, which communicates with other nodes via a network in order to accomplish some task.
-
### normalized
-An instance of some software running on a computer, which communicates with other nodes via a network in order to accomplish some task.
-Structured in such a way that there is no redundancy or duplication. In a normal‐ ized database, when some piece of data changes, you only need to change it in one place, not many copies in many different places. See “Many-to-One and Many-to- Many Relationships” on page 33.
+
+Structured in such a way that there is no redundancy or duplication. In a normalized database, when some piece of data changes, you only need to change it in one place, not many copies in many different places. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization).
### OLAP
-Online analytic processing. Access pattern characterized by aggregating (e.g., count, sum, average) over a large number of records. See “Transaction Processing or Analytics?” on page 90.
+Online analytic processing. Access pattern characterized by aggregating (e.g., count, sum, average) over a large number of records. See [“Operational Versus Analytical Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_analytics).
### OLTP
-Online transaction processing. Access pattern characterized by fast queries that read or write a small number of records, usually indexed by key. See “Transaction Processing or Analytics?” on page 90.
+Online transaction processing. Access pattern characterized by fast queries that read or write a small number of records, usually indexed by key. See [“Operational Versus Analytical Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_analytics).
-### partitioning
+### sharding
-Splitting up a large dataset or computa‐ tion that is too big for a single machine into smaller parts and spreading them across several machines. Also known as sharding. See Chapter 6.
+Splitting up a large dataset or computation that is too big for a single machine into smaller parts and spreading them across several machines. Also known as *partitioning*. See [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding).
### percentile
-A way of measuring the distribution of values by counting how many values are above or below some threshold. For example, the 95th percentile response time during some period is the time t such that 95% of requests in that period com‐ plete in less than t, and 5% take longer than t. See “Describing Performance” on page 13.
+
+A way of measuring the distribution of values by counting how many values are above or below some threshold. For example, the 95th percentile response time during some period is the time *t* such that 95% of requests in that period complete in less than *t*, and 5% take longer than *t*. See [“Describing Performance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_percentiles).
### primary key
-A value (typically a number or a string) that uniquely identifies a record. In many applications, primary keys are generated by the system when a record is created (e.g., sequentially or randomly); they are not usually set by users. See also secondary index.
+
+A value (typically a number or a string) that uniquely identifies a record. In many applications, primary keys are generated by the system when a record is created (e.g., sequentially or randomly); they are not usually set by users. See also *secondary index*.
### quorum
-The minimum number of nodes that need to vote on an operation before it can be considered successful. See “Quorums for reading and writing” on page 179.
+The minimum number of nodes that need to vote on an operation before it can be considered successful. See [“Quorums for reading and writing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_quorum_condition).
### rebalance
-To move data or services from one node to another in order to spread the load fairly. See “Rebalancing Partitions” on page 209.
+
+To move data or services from one node to another in order to spread the load fairly. See [“Sharding of Key-Value Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_key_value).
### replication
-Keeping a copy of the same data on sev‐ eral nodes (replicas) so that it remains accessible if a node becomes unreachable. See Chapter 5.
+
+Keeping a copy of the same data on several nodes (*replicas*) so that it remains accessible if a node becomes unreachable. See [Chapter 6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#ch_replication).
### schema
-A description of the structure of some data, including its fields and datatypes. Whether some data conforms to a schema can be checked at various points in the data’s lifetime (see “Schema flexibility in the document model” on page 39), and a schema can change over time (see Chap‐ ter 4).
+
+A description of the structure of some data, including its fields and datatypes. Whether some data conforms to a schema can be checked at various points in the data’s lifetime (see [“Schema flexibility in the document model”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_schema_flexibility)), and a schema can change over time (see [Chapter 5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch05.html#ch_encoding)).
### secondary index
-An additional data structure that is main‐ tained alongside the primary data storage and which allows you to efficiently search for records that match a certain kind of condition. See “Other Indexing Struc‐ tures” on page 85 and “Partitioning and Secondary Indexes” on page 206.
+
+An additional data structure that is maintained alongside the primary data storage and which allows you to efficiently search for records that match a certain kind of condition. See [“Multi-Column and Secondary Indexes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_index_multicolumn) and [“Sharding and Secondary Indexes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_secondary_indexes).
### serializable
-A guarantee that if several transactions execute concurrently, they behave the same as if they had executed one at a time, in some serial order. See “Serializability” on page 251.
+
+An *isolation* guarantee that if several transactions execute concurrently, they behave the same as if they had executed one at a time, in some serial order. See [“Serializability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_serializability).
### shared-nothing
-An architecture in which independent nodes—each with their own CPUs, mem‐ ory, and disks—are connected via a con‐ ventional network, in contrast to shared- memory or shared-disk architectures. See the introduction to Part II.
+
+An architecture in which independent nodes—each with their own CPUs, memory, and disks—are connected via a conventional network, in contrast to shared-memory or shared-disk architectures. See [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_shared_nothing).
### skew
-1. Imbalanced load across partitions, such that some partitions have lots of requests or data, and others have much less. Also known as hot spots. See “Skewed Work‐ loads and Relieving Hot Spots” on page 205 and “Handling skew” on page 407.
-2. A timing anomaly that causes events to appear in an unexpected, nonsequential order. See the discussions of read skew in “Snapshot Isolation and Repeatable Read” on page 237, write skew in “Write Skew and Phantoms” on page 246, and clock skew in “Timestamps for ordering events” on page 291.
+
+1. Imbalanced load across shards, such that some shards have lots of requests or data, and others have much less. Also known as *hot spots*. See [“Skewed Workloads and Relieving Hot Spots”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_skew).
+
+2. A timing anomaly that causes events to appear in an unexpected, nonsequential order. See the discussions of *read skew* in [“Snapshot Isolation and Repeatable Read”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_snapshot_isolation), *write skew* in [“Write Skew and Phantoms”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_write_skew), and *clock skew* in [“Timestamps for ordering events”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lww).
### split brain
-A scenario in which two nodes simultane‐ ously believe themselves to be the leader, and which may cause system guarantees to be violated. See “Handling Node Out‐ ages” on page 156 and “The Truth Is Defined by the Majority” on page 300.
+
+A scenario in which two nodes simultaneously believe themselves to be the leader, and which may cause system guarantees to be violated. See [“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover) and [“The Majority Rules”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_majority).
### stored procedure
-A way of encoding the logic of a transac‐ tion such that it can be entirely executed on a database server, without communi‐ cating back and forth with a client during the transaction. See “Actual Serial Execu‐ tion” on page 252.
+
+A way of encoding the logic of a transaction such that it can be entirely executed on a database server, without communicating back and forth with a client during the transaction. See [“Actual Serial Execution”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_serial).
### stream process
-A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See Chapter 11.
+
+A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See [Chapter 12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#ch_stream).
### synchronous
-The opposite of asynchronous.
+
+The opposite of *asynchronous*.
### system of record
-A system that holds the primary, authori‐ tative version of some data, also known as the source of truth. Changes are first writ‐ ten here, and other datasets may be derived from the system of record. See the introduction to Part III.
+
+A system that holds the primary, authoritative version of some data, also known as the *source of truth*. Changes are first written here, and other datasets may be derived from the system of record. See [“Systems of Record and Derived Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_derived).
### timeout
-One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See “Timeouts and Unbounded Delays” on page 281.
+
+One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See [“Timeouts and Unbounded Delays”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_queueing).
### total order
-A way of comparing things (e.g., time‐ stamps) that allows you to always say which one of two things is greater and which one is lesser. An ordering in which some things are incomparable (you can‐ not say which is greater or smaller) is called a partial order. See “The causal order is not a total order” on page 341.
+
+A way of comparing things (e.g., timestamps) that allows you to always say which one of two things is greater and which one is lesser. An ordering in which some things are incomparable (you cannot say which is greater or smaller) is called a *partial order*.
### transaction
-Grouping together several reads and writes into a logical unit, in order to sim‐ plify error handling and concurrency issues. See Chapter 7.
+
+Grouping together several reads and writes into a logical unit, in order to simplify error handling and concurrency issues. See [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions).
### two-phase commit (2PC)
-An algorithm to ensure that several data‐ base nodes either all commit or all abort a transaction. See “Atomic Commit and Two-Phase Commit (2PC)” on page 354.
+
+An algorithm to ensure that several database nodes either all *atomically* commit or all abort a transaction. See [“Two-Phase Commit (2PC)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pc).
### two-phase locking (2PL)
-An algorithm for achieving serializable isolation that works by a transaction acquiring a lock on all data it reads or writes, and holding the lock until the end of the transaction. See “Two-Phase Lock‐ ing (2PL)” on page 257.
+
+An algorithm for achieving *serializable isolation* that works by a transaction acquiring a lock on all data it reads or writes, and holding the lock until the end of the transaction. See [“Two-Phase Locking (2PL)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pl).
### unbounded
-Not having any known upper limit or size. The opposite of bounded.
+
+Not having any known upper limit or size. The opposite of *bounded*.
-
-
-
-
-……
\ No newline at end of file
diff --git a/content/en/indexes.md b/content/en/indexes.md
new file mode 100644
index 0000000..55e81ac
--- /dev/null
+++ b/content/en/indexes.md
@@ -0,0 +1,3542 @@
+---
+title: Indexes
+weight: 550
+breadcrumbs: false
+---
+
+### Symbols
+
+- 3FS (distributed filesystem, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+
+### A
+
+- aborts (transactions), [Transactions](/en/ch8#ch_transactions), [Atomicity](/en/ch8#sec_transactions_acid_atomicity)
+ - cascading, [No dirty reads](/en/ch8#no-dirty-reads)
+ - in two-phase commit, [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
+ - performance of optimistic concurrency control, [Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
+ - retrying aborted transactions, [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
+- abstraction, [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Simplicity: Managing Complexity](/en/ch2#id38), [Data Models and Query Languages](/en/ch3#ch_datamodels), [Transactions](/en/ch8#ch_transactions), [Summary](/en/ch8#summary)
+- accidental complexity, [Simplicity: Managing Complexity](/en/ch2#id38)
+- accountability, [Responsibility and Accountability](/en/ch14#id371)
+- accounting (financial data), [Summary](/en/ch3#summary), [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
+- Accumulo (database)
+ - wide-column data model, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality), [Column Compression](/en/ch4#sec_storage_column_compression)
+- ACID properties (transactions), [The Meaning of ACID](/en/ch8#sec_transactions_acid)
+ - atomicity, [Atomicity](/en/ch8#sec_transactions_acid_atomicity), [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
+ - consistency, [Consistency](/en/ch8#sec_transactions_acid_consistency), [Maintaining integrity in the face of software bugs](/en/ch13#id455)
+ - durability, [Making B-trees reliable](/en/ch4#sec_storage_btree_wal), [Durability](/en/ch8#durability)
+ - isolation, [Isolation](/en/ch8#sec_transactions_acid_isolation), [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
+- acknowledgements (messaging), [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
+- active/active replication (see multi-leader replication)
+- active/passive replication (see leader-based replication)
+- ActiveMQ (messaging), [Message brokers](/en/ch5#message-brokers), [Message brokers compared to databases](/en/ch12#id297)
+ - distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
+- ActiveRecord (object-relational mapper), [Object-relational mapping (ORM)](/en/ch3#object-relational-mapping-orm), [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
+- activity (workflows) (see workflow engines)
+- actor model, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
+ - (see also event-driven architecture)
+ - comparison to stream processing, [Event-Driven Architectures and RPC](/en/ch12#sec_stream_actors_drpc)
+- adaptive capacity, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
+- Advanced Message Queuing Protocol (see AMQP)
+- aerospace systems, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
+- Aerospike (database)
+ - strong consistency mode, [Single-object writes](/en/ch8#sec_transactions_single_object)
+- AGE (graph database), [The Cypher Query Language](/en/ch3#id57)
+- aggregation
+ - data cubes and materialized views, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
+ - in batch processes, [Sorting Versus In-memory Aggregation](/en/ch11#id275)
+ - in stream processes, [Stream analytics](/en/ch12#id318)
+- aggregation pipeline (MongoDB), [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization), [Query languages for documents](/en/ch3#query-languages-for-documents)
+- Agile, [Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability)
+ - minimizing irreversibility, [Batch Processing](/en/ch11#ch_batch), [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
+ - moving faster with confidence, [The end-to-end argument again](/en/ch13#id456)
+- agreement, [Single-value consensus](/en/ch10#single-value-consensus), [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
+ - (see also consensus)
+- AI (artificial intelligence) (see machine learning)
+- AI Act (European Union), [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
+- AirByte, [Data Warehousing](/en/ch1#sec_introduction_dwh)
+- Airflow (workflow scheduler), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows), [Batch Processing](/en/ch11#ch_batch), [Scheduling Workflows](/en/ch11#sec_batch_workflows)
+ - cloud data warehouse integration, [Query languages](/en/ch11#sec_batch_query_lanauges)
+ - use for ETL, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
+- Akamai
+ - response time study, [Average, Median, and Percentiles](/en/ch2#id24)
+- algorithms
+ - algorithm correctness, [Defining the correctness of an algorithm](/en/ch9#defining-the-correctness-of-an-algorithm)
+ - B-trees, [B-Trees](/en/ch4#sec_storage_b_trees)-[B-tree variants](/en/ch4#b-tree-variants)
+ - for distributed systems, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+ - mergesort, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Shuffling Data](/en/ch11#sec_shuffle)
+ - scheduling, [Resource Allocation](/en/ch11#id279)
+ - SSTables and LSM-trees, [The SSTable file format](/en/ch4#the-sstable-file-format)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+- all-to-all replication topologies, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)
+- AllegroGraph (database), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+ - SPARQL query language, [The SPARQL query language](/en/ch3#the-sparql-query-language)
+- ALTER TABLE statement (SQL), [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility), [Encoding and Evolution](/en/ch5#ch_encoding)
+- Amazon
+ - Dynamo (see Dynamo (database))
+ - response time study, [Average, Median, and Percentiles](/en/ch2#id24)
+- Amazon Web Services (AWS)
+ - Aurora (see Aurora (cloud database))
+ - ClockBound (see ClockBound (time sync))
+ - correctness testing, [Formal Methods and Randomized Testing](/en/ch9#sec_distributed_formal)
+ - DynamoDB (see DynamoDB (database))
+ - EBS (see EBS (virtual block device))
+ - Kinesis (see Kinesis (messaging))
+ - Neptune (see Neptune (graph database))
+ - network reliability, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
+ - S3 (see S3 (object storage))
+- amplification
+ - of bias, [Bias and Discrimination](/en/ch14#id370)
+ - of failures, [Maintaining derived state](/en/ch13#id446)
+ - of tail latency, [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla), [Local Secondary Indexes](/en/ch7#id166)
+ - write amplification, [Write amplification](/en/ch4#write-amplification)
+- AMQP (Advanced Message Queuing Protocol), [Message brokers compared to databases](/en/ch12#id297)
+ - (see also messaging systems)
+ - comparison to log-based messaging, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Replaying old messages](/en/ch12#sec_stream_replay)
+ - message ordering, [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
+- analytical systems, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
+ - as derived data systems, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
+ - ETL from operational systems, [Data Warehousing](/en/ch1#sec_introduction_dwh)
+ - governance, [Beyond the data lake](/en/ch1#beyond-the-data-lake)
+- analytics, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)-[Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
+ - comparison to transaction processing, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
+ - data normalization, [Trade-offs of normalization](/en/ch3#trade-offs-of-normalization)
+ - data warehousing (see data warehousing)
+ - predictive (see predictive analytics)
+ - relation to batch processing, [Analytics](/en/ch11#sec_batch_olap)-[Analytics](/en/ch11#sec_batch_olap)
+ - schemas for, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)-[Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
+ - snapshot isolation for queries, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
+ - stream analytics, [Stream analytics](/en/ch12#id318)
+- analytics engineering, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
+- anti-entropy, [Catching up on missed writes](/en/ch6#sec_replication_read_repair)
+- Antithesis (deterministic simulation testing), [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+- Apache Accumulo (see Accumulo)
+- Apache ActiveMQ (see ActiveMQ)
+- Apache AGE (see AGE)
+- Apache Arrow (see Arrow (data format))
+- Apache Avro (see Avro)
+- Apache Beam (see Beam)
+- Apache BookKeeper (see BookKeeper)
+- Apache Cassandra (see Cassandra)
+- Apache Curator (see Curator)
+- Apache DataFusion (see DataFusion (query engine))
+- Apache Druid (see Druid (database))
+- Apache Flink (see Flink (processing framework))
+- Apache HBase (see HBase)
+- Apache Iceberg (see Iceberg (table format))
+- Apache Jena (see Jena)
+- Apache Kafka (see Kafka)
+- Apache Lucene (see Lucene)
+- Apache Oozie (see Oozie (workflow scheduler))
+- Apache ORC (see ORC (data format))
+- Apache Parquet (see Parquet (data format))
+- Apache Pig (query language), [Query languages](/en/ch11#sec_batch_query_lanauges)
+- Apache Pinot (see Pinot (database))
+- Apache Pulsar (see Pulsar)
+- Apache Qpid (see Qpid)
+- Apache Samza (see Samza)
+- Apache Solr (see Solr)
+- Apache Spark (see Spark) (see Spark (processing framework))
+- Apache Storm (see Storm)
+- Apache Superset (see Superset (data visualization software))
+- Apache Thrift (see Thrift)
+- Apache ZooKeeper (see ZooKeeper)
+- Apama (stream analytics), [Complex event processing](/en/ch12#id317)
+- append-only files (see logs)
+- Application Programming Interfaces (APIs), [Data Models and Query Languages](/en/ch3#ch_datamodels)
+ - for change streams, [API support for change streams](/en/ch12#sec_stream_change_api)
+ - for distributed transactions, [XA transactions](/en/ch8#xa-transactions)
+ - for services, [Dataflow Through Services: REST and RPC](/en/ch5#sec_encoding_dataflow_rpc)-[Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+ - (see also services)
+ - evolvability, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+ - RESTful, [Web services](/en/ch5#sec_web_services)
+- application state (see state)
+- approximate search (see similarity search)
+- archival storage, data from databases, [Archival storage](/en/ch5#archival-storage)
+- arcs (see edges)
+- ArcticDB (database), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- arithmetic mean, [Average, Median, and Percentiles](/en/ch2#id24)
+- arrays
+ - array databases, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+ - multidimensional, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- Arrow (data format), [Column-Oriented Storage](/en/ch4#sec_storage_column), [DataFrames](/en/ch11#id287)
+- artificial intelligence (see machine learning)
+- ASCII text, [Protocol Buffers](/en/ch5#sec_encoding_protobuf)
+- ASN.1 (schema language), [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
+- associative table, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many), [Property Graphs](/en/ch3#id56)
+- asynchronous networks, [Unreliable Networks](/en/ch9#sec_distributed_networks), [Glossary](/en/glossary)
+ - comparison to synchronous networks, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
+ - system model, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+- asynchronous replication, [Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async), [Glossary](/en/glossary)
+ - data loss on failover, [Leader failure: Failover](/en/ch6#leader-failure-failover)
+ - reads from asynchronous follower, [Problems with Replication Lag](/en/ch6#sec_replication_lag)
+ - with multiple leaders, [Multi-Leader Replication](/en/ch6#sec_replication_multi_leader)
+- Asynchronous Transfer Mode (ATM), [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
+- atomic broadcast, [Shared logs as consensus](/en/ch10#sec_consistency_shared_logs)
+- atomic clocks, [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+ - (see also clocks)
+- atomicity (concurrency), [Glossary](/en/glossary)
+ - atomic increment, [Single-object writes](/en/ch8#sec_transactions_single_object)
+ - compare-and-set (CAS), [Conditional writes (compare-and-set)](/en/ch8#sec_transactions_compare_and_set), [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+ - (see also compare-and-set (CAS))
+ - denormalized data, [Trade-offs of normalization](/en/ch3#trade-offs-of-normalization)
+ - fetch-and-add/increment, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical), [Consensus](/en/ch10#sec_consistency_consensus), [Fetch-and-add as consensus](/en/ch10#fetch-and-add-as-consensus)
+ - write operations, [Atomic write operations](/en/ch8#atomic-write-operations)
+- atomicity (transactions), [Atomicity](/en/ch8#sec_transactions_acid_atomicity), [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object), [Glossary](/en/glossary)
+ - atomic commit
+ - avoiding, [Multi-shard request processing](/en/ch13#id360), [Coordination-avoiding data systems](/en/ch13#id454)
+ - blocking and nonblocking, [Three-phase commit](/en/ch8#three-phase-commit)
+ - in stream processing, [Exactly-once message processing](/en/ch8#sec_transactions_exactly_once), [Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited), [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
+ - maintaining derived data, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
+ - distributed transactions, [Distributed Transactions](/en/ch8#sec_transactions_distributed)-[Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
+ - for multi-object transactions, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
+ - for single-object writes, [Single-object writes](/en/ch8#sec_transactions_single_object)
+ - relation to consensus, [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
+- auditability, [Trust, but Verify](/en/ch13#sec_future_verification)-[Tools for auditable data systems](/en/ch13#id366)
+ - designing for, [Designing for auditability](/en/ch13#id365)
+ - self-auditing systems, [Don't just blindly trust what they promise](/en/ch13#id364)
+ - through immutability, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
+ - tools for auditable data systems, [Tools for auditable data systems](/en/ch13#id366)
+- Aurora (cloud database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
+- Aurora DSQL (database)
+ - snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
+- auto-scaling, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
+- Automerge (CRDT library), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- availability, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)
+ - (see also fault tolerance)
+ - in CAP theorem, [The CAP theorem](/en/ch10#the-cap-theorem)
+ - in leader election, [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
+ - in service level agreements (SLAs), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
+- availability zones, [Tolerating hardware faults through redundancy](/en/ch2#tolerating-hardware-faults-through-redundancy), [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
+- Avro (data format), [Avro](/en/ch5#sec_encoding_avro)-[Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
+ - dynamically generated schemas, [Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
+ - object container files, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema), [Archival storage](/en/ch5#archival-storage)
+ - reader determining writer's schema, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
+ - schema evolution, [The writer's schema and the reader's schema](/en/ch5#the-writers-schema-and-the-readers-schema)
+ - use in batch processing, [MapReduce](/en/ch11#sec_batch_mapreduce)
+- awk (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Distributed Job Orchestration](/en/ch11#id278)
+- Axon Framework, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+- Azkaban (workflow scheduler), [Batch Processing](/en/ch11#ch_batch)
+- Azure Blob Storage (object storage), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - conditional headers, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
+- Azure managed disks, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
+- Azure SQL DB (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
+- Azure Storage, [Object Stores](/en/ch11#id277)
+- Azure Synapse Analytics (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
+- Azure Virtual Machines
+ - spot virtual machines, [Handling Faults](/en/ch11#id281)
+
+### B
+
+- B-trees (indexes), [B-Trees](/en/ch4#sec_storage_b_trees)-[B-tree variants](/en/ch4#b-tree-variants)
+ - B+ trees, [B-tree variants](/en/ch4#b-tree-variants)
+ - branching factor, [B-Trees](/en/ch4#sec_storage_b_trees)
+ - comparison to LSM-trees, [Comparing B-Trees and LSM-Trees](/en/ch4#sec_storage_btree_lsm_comparison)-[Disk space usage](/en/ch4#disk-space-usage)
+ - crash recovery, [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
+ - growing by splitting a page, [B-Trees](/en/ch4#sec_storage_b_trees)
+ - immutable variants, [B-tree variants](/en/ch4#b-tree-variants), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
+ - similarity to shard splitting, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)
+ - variants, [B-tree variants](/en/ch4#b-tree-variants)
+- B2 (object storage), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- Backblaze B2 (see B2 (object storage))
+- backend, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
+- backoff, exponential, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
+- backpressure, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Read performance](/en/ch4#read-performance), [Messaging Systems](/en/ch12#sec_stream_messaging), [Glossary](/en/glossary)
+ - in batch processing, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
+ - in TCP, [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
+- backups
+ - database snapshot for replication, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - in multitenant systems, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+ - integrity of, [Don't just blindly trust what they promise](/en/ch13#id364)
+ - snapshot isolation for, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
+ - using object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - versus replication, [Replication](/en/ch6#ch_replication)
+- backward compatibility, [Encoding and Evolution](/en/ch5#ch_encoding)
+- BadgerDB (database)
+ - serializable transactions, [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)
+- BASE, contrast to ACID, [The Meaning of ACID](/en/ch8#sec_transactions_acid)
+- bash shell (Unix), [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp)
+- batch processing, [Batch Processing](/en/ch11#ch_batch)-[Summary](/en/ch11#id292), [Glossary](/en/glossary)
+ - and functional programming, [MapReduce](/en/ch11#sec_batch_mapreduce)
+ - benefits of, [Batch Processing](/en/ch11#ch_batch)
+ - combining with stream processing, [Unifying batch and stream processing](/en/ch13#id338)
+ - comparison to stream processing, [Processing Streams](/en/ch12#sec_stream_processing)
+ - dataflow engines, [Dataflow Engines](/en/ch11#sec_batch_dataflow)-[Dataflow Engines](/en/ch11#sec_batch_dataflow)
+ - fault tolerance, [Handling Faults](/en/ch11#id281), [Messaging Systems](/en/ch12#sec_stream_messaging)
+ - for data integration, [Batch and Stream Processing](/en/ch13#sec_future_batch_streaming)-[Unifying batch and stream processing](/en/ch13#id338)
+ - graphs and iterative processing, [Machine Learning](/en/ch11#id290)
+ - high-level APIs and languages, [Query languages](/en/ch11#sec_batch_query_lanauges)-[Query languages](/en/ch11#sec_batch_query_lanauges)
+ - in cloud data warehouses, [Query languages](/en/ch11#sec_batch_query_lanauges)
+ - in distributed systems, [Batch Processing in Distributed Systems](/en/ch11#sec_batch_distributed)
+ - join and group by, [JOIN and GROUP BY](/en/ch11#sec_batch_join)-[JOIN and GROUP BY](/en/ch11#sec_batch_join)
+ - limitations, [Batch Processing](/en/ch11#ch_batch)
+ - log-based messaging and, [Replaying old messages](/en/ch12#sec_stream_replay)
+ - maintaining derived state, [Maintaining derived state](/en/ch13#id446)
+ - measuring performance, [Batch Processing](/en/ch11#ch_batch)
+ - models of, [Batch Processing Models](/en/ch11#id431)
+ - resource allocation, [Resource Allocation](/en/ch11#id279)-[Resource Allocation](/en/ch11#id279)
+ - resource managers, [Distributed Job Orchestration](/en/ch11#id278)
+ - schedulers, [Distributed Job Orchestration](/en/ch11#id278)
+ - serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)-[Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+ - shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)-[Shuffling Data](/en/ch11#sec_shuffle)
+ - task execution, [Distributed Job Orchestration](/en/ch11#id278)
+ - use cases, [Batch Use Cases](/en/ch11#sec_batch_output)-[Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+ - using Unix tools (example), [Batch Processing with Unix Tools](/en/ch11#sec_batch_unix)-[Sorting Versus In-memory Aggregation](/en/ch11#id275)
+- batch processing frameworks
+ - comparison to operating systems, [Batch Processing in Distributed Systems](/en/ch11#sec_batch_distributed)
+- Beam (dataflow library), [Unifying batch and stream processing](/en/ch13#id338)
+- BERT (language model), [Vector Embeddings](/en/ch4#id92)
+- bias, [Bias and Discrimination](/en/ch14#id370)
+- bidirectional replication (see multi-leader replication)
+- big ball of mud, [Simplicity: Managing Complexity](/en/ch2#id38)
+- big data
+ - versus data minimization, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
+- BigQuery (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Batch Processing](/en/ch11#ch_batch)
+ - DataFrames, [Query languages](/en/ch11#sec_batch_query_lanauges)
+ - sharding and clustering, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+ - shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
+ - snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
+- Bigtable (database)
+ - sharding scheme, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+ - storage layout, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+ - tablets (sharding), [Sharding](/en/ch7#ch_sharding)
+ - wide-column data model, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality), [Column Compression](/en/ch4#sec_storage_column_compression)
+- binary data encodings, [Binary encoding](/en/ch5#binary-encoding)-[The Merits of Schemas](/en/ch5#sec_encoding_schemas)
+ - Avro, [Avro](/en/ch5#sec_encoding_avro)-[Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
+ - MessagePack, [Binary encoding](/en/ch5#binary-encoding)-[Binary encoding](/en/ch5#binary-encoding)
+ - Protocol Buffers, [Protocol Buffers](/en/ch5#sec_encoding_protobuf)-[Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
+- binary encoding
+ - based on schemas, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
+ - by network drivers, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
+- binary strings, lack of support in JSON and XML, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
+- Bitcoin (cryptocurrency), [Tools for auditable data systems](/en/ch13#id366)
+ - Byzantine fault tolerance, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
+ - concurrency bugs in exchanges, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
+- bitmap indexes, [Column Compression](/en/ch4#sec_storage_column_compression)
+- BitTorrent uTP protocol, [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
+- Bkd-trees (indexes), [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
+- blameless postmortems, [Humans and Reliability](/en/ch2#id31)
+- Blazegraph (database), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+ - SPARQL query language, [The SPARQL query language](/en/ch3#the-sparql-query-language)
+- blob storage (see object storage)
+- block (file system), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- block device (disk), [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
+- blockchains, [Summary](/en/ch3#summary)
+ - Byzantine fault tolerance, [Byzantine Faults](/en/ch9#sec_distributed_byzantine), [Consensus](/en/ch10#sec_consistency_consensus), [Tools for auditable data systems](/en/ch13#id366)
+- blocking atomic commit, [Three-phase commit](/en/ch8#three-phase-commit)
+- Bloom filter (algorithm), [Bloom filters](/en/ch4#bloom-filters), [Read performance](/en/ch4#read-performance), [Stream analytics](/en/ch12#id318)
+- BookKeeper (replicated log), [Allocating work to nodes](/en/ch10#allocating-work-to-nodes)
+- bounded datasets, [Stream Processing](/en/ch12#ch_stream), [Glossary](/en/glossary)
+ - (see also batch processing)
+- bounded delays, [Glossary](/en/glossary)
+ - in networks, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
+ - process pauses, [Response time guarantees](/en/ch9#sec_distributed_clocks_realtime)
+- broadcast
+ - total order broadcast (see shared logs)
+- brokerless messaging, [Direct messaging from producers to consumers](/en/ch12#id296)
+- Brubeck (metrics aggregator), [Direct messaging from producers to consumers](/en/ch12#id296)
+- BTM (transaction coordinator), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
+- Buf
+ - Bufstream (messaging), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+- Bufstream (messaging), [Disk space usage](/en/ch12#sec_stream_disk_usage)
+- build or buy, [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)
+- bursty network traffic patterns, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
+- business analyst, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
+- business data processing, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
+- business intelligence, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)-[Data Warehousing](/en/ch1#sec_introduction_dwh)
+- Business Process Execution Language (BPEL), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
+- Business Process Model and Notation (BPMN), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
+ - example, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
+- byte sequence, encoding data in, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
+- Byzantine faults, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)-[Weak forms of lying](/en/ch9#weak-forms-of-lying), [System Model and Reality](/en/ch9#sec_distributed_system_model), [Glossary](/en/glossary)
+ - Byzantine fault-tolerant systems, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
+ - Byzantine Generals Problem, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
+ - consensus algorithms and, [Consensus](/en/ch10#sec_consistency_consensus), [Tools for auditable data systems](/en/ch13#id366)
+
+### C
+
+- caches, [Keeping everything in memory](/en/ch4#sec_storage_inmemory), [Glossary](/en/glossary)
+ - and materialized views, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
+ - as derived data, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
+ - in CPUs, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized), [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
+ - invalidation and maintenance, [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+ - linearizability, [Linearizability](/en/ch10#sec_consistency_linearizability)
+ - local disks in the cloud, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
+- calendar sync, [Sync Engines and Local-First Software](/en/ch6#sec_replication_offline_clients), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- California Consumer Privacy Act (CCPA), [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
+- Camunda (workflow engine), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
+- canonical version (of data), [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
+- CAP theorem, [The CAP theorem](/en/ch10#the-cap-theorem)-[The CAP theorem](/en/ch10#the-cap-theorem), [Glossary](/en/glossary)
+- capacity planning, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+- Cap'n Proto (data format), [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
+- carbon emissions, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
+- cascading aborts, [No dirty reads](/en/ch8#no-dirty-reads)
+- cascading failures, [Software faults](/en/ch2#software-faults), [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations), [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing)
+- Cassandra (database)
+ - change data capture, [Implementing change data capture](/en/ch12#id307), [API support for change streams](/en/ch12#sec_stream_change_api)
+ - compaction strategy, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - consistency level ANY, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+ - hash-range sharding, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash), [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+ - last-write-wins conflict resolution, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)
+ - leaderless replication, [Leaderless Replication](/en/ch6#sec_replication_leaderless)
+ - lightweight transactions, [Single-object writes](/en/ch8#sec_transactions_single_object)
+ - linearizability, lack of, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
+ - log-structured storage, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+ - multi-region support, [Multi-region operation](/en/ch6#multi-region-operation)
+ - secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
+ - use of clocks, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations), [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
+ - vnodes (sharding), [Sharding](/en/ch7#ch_sharding)
+- cat (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis)
+- catalog, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+- causal context, [Version vectors](/en/ch6#version-vectors)
+ - (see also causal dependencies)
+- causal dependencies, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)-[Version vectors](/en/ch6#version-vectors)
+ - capturing, [Version vectors](/en/ch6#version-vectors), [Ordering events to capture causality](/en/ch13#sec_future_capture_causality), [Reads are events too](/en/ch13#sec_future_read_events)
+ - by total ordering, [The limits of total ordering](/en/ch13#id335)
+ - in transactions, [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)
+ - sending message to friends (example), [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
+- causality, [Glossary](/en/glossary)
+ - causal ordering
+ - total order consistent with, [Logical Clocks](/en/ch10#sec_consistency_timestamps)
+ - consistency with, [Logical Clocks](/en/ch10#sec_consistency_timestamps)-[Enforcing constraints using logical clocks](/en/ch10#enforcing-constraints-using-logical-clocks)
+ - happens-before relation, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
+ - in serializable transactions, [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)-[Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
+ - mismatch with clocks, [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
+ - ordering events to capture, [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
+ - violations of, [Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix), [Problems with different topologies](/en/ch6#problems-with-different-topologies), [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
+ - with synchronized clocks, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+- cell-based architecture, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+- CEP (see complex event processing)
+- CephFS (distributed filesystem), [Batch Processing](/en/ch11#ch_batch), [Object Stores](/en/ch11#id277)
+- certificate transparency, [Tools for auditable data systems](/en/ch13#id366)
+- cgroups, [Distributed Job Orchestration](/en/ch11#id278)
+- change data capture, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication), [Change Data Capture](/en/ch12#sec_stream_cdc)
+ - API support for change streams, [API support for change streams](/en/ch12#sec_stream_change_api)
+ - comparison to event sourcing, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
+ - implementing, [Implementing change data capture](/en/ch12#id307)
+ - initial snapshot, [Initial snapshot](/en/ch12#sec_stream_cdc_snapshot)
+ - log compaction, [Log compaction](/en/ch12#sec_stream_log_compaction)
+- changelogs, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)
+ - change data capture, [Change Data Capture](/en/ch12#sec_stream_cdc)
+ - for operator state, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+ - in stream joins, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
+ - log compaction, [Log compaction](/en/ch12#sec_stream_log_compaction)
+ - maintaining derived state, [Databases and Streams](/en/ch12#sec_stream_databases)
+- chaos engineering, [Fault Tolerance](/en/ch2#id27), [Fault injection](/en/ch9#sec_fault_injection)
+- checkpointing
+ - in high-performance computing, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
+ - in stream processors, [Microbatching and checkpointing](/en/ch12#id329)
+- circuit breaker (limiting retries), [Describing Performance](/en/ch2#sec_introduction_percentiles)
+- circuit-switched networks, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
+- circular buffers, [Disk space usage](/en/ch12#sec_stream_disk_usage)
+- circular replication topologies, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)
+- Citus (database)
+ - hash sharding, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
+- ClickHouse (database), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
+ - incremental view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+- clickstream data, analysis of, [JOIN and GROUP BY](/en/ch11#sec_batch_join)
+- clients
+ - calling services, [Dataflow Through Services: REST and RPC](/en/ch5#sec_encoding_dataflow_rpc)
+ - offline-capable, [Sync Engines and Local-First Software](/en/ch6#sec_replication_offline_clients), [Stateful, offline-capable clients](/en/ch13#id347)
+ - pushing state changes to, [Pushing state changes to clients](/en/ch13#id348)
+ - request routing, [Request Routing](/en/ch7#sec_sharding_routing)
+- ClockBound (time sync), [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval)
+ - use in YugabyteDB, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+- clocks, [Unreliable Clocks](/en/ch9#sec_distributed_clocks)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
+ - atomic clocks, [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+ - confidence interval, [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval)-[Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+ - for global snapshots, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+ - hybrid logical clocks, [Hybrid logical clocks](/en/ch10#hybrid-logical-clocks)
+ - logical (see logical clocks)
+ - skew, [Last write wins (discarding concurrent writes)](/en/ch6#sec_replication_lww), [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations), [Relying on Synchronized Clocks](/en/ch9#sec_distributed_clocks_relying)-[Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
+ - slewing, [Monotonic clocks](/en/ch9#monotonic-clocks)
+ - synchronization and accuracy, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)-[Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
+ - synchronization using GPS, [Unreliable Clocks](/en/ch9#sec_distributed_clocks), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy), [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+ - time-of-day versus monotonic clocks, [Monotonic Versus Time-of-Day Clocks](/en/ch9#sec_distributed_monotonic_timeofday)
+ - timestamping events, [Whose clock are you using, anyway?](/en/ch12#id438)
+- cloud services, [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)-[Cloud Computing Versus Supercomputing](/en/ch1#id17)
+ - availability zones, [Tolerating hardware faults through redundancy](/en/ch2#tolerating-hardware-faults-through-redundancy), [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
+ - data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - need for service discovery, [Service discovery](/en/ch10#service-discovery)
+ - network glitches, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
+ - pros and cons, [Pros and Cons of Cloud Services](/en/ch1#sec_introduction_cloud_tradeoffs)-[Pros and Cons of Cloud Services](/en/ch1#sec_introduction_cloud_tradeoffs)
+ - quotas, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+ - regions (see regions (geographic distribution))
+ - serverless, [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
+ - shared resources, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - versus supercomputing, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
+- cloud-native, [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)-[Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+- Cloudflare
+ - R2 (see R2 (object storage))
+- clustered indexes, [Storing values within the index](/en/ch4#sec_storage_index_heap)
+- clustering (record ordering), [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+- CockroachDB (database)
+ - consensus-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+ - consistency model, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+ - key-range sharding, [Sharding](/en/ch7#ch_sharding), [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+ - serializable transactions, [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)
+ - sharded secondary indexes, [Global Secondary Indexes](/en/ch7#id167)
+ - transactions, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
+ - use of model-checking, [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
+- code generation
+ - for query execution, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+ - with Protocol Buffers, [Protocol Buffers](/en/ch5#sec_encoding_protobuf)
+- collaborative editing, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
+- column families (Bigtable), [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality), [Column Compression](/en/ch4#sec_storage_column_compression)
+- column-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)-[Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+ - column compression, [Column Compression](/en/ch4#sec_storage_column_compression)
+ - Parquet, [Column-Oriented Storage](/en/ch4#sec_storage_column), [Archival storage](/en/ch5#archival-storage)
+ - sort order in, [Sort Order in Column Storage](/en/ch4#sort-order-in-column-storage)-[Sort Order in Column Storage](/en/ch4#sort-order-in-column-storage)
+ - vectorized processing, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+ - versus wide-column model, [Column Compression](/en/ch4#sec_storage_column_compression)
+ - writing to, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
+- comma-separated values (see CSV)
+- command query responsibility segregation (CQRS), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)-[Event Sourcing and CQRS](/en/ch3#sec_datamodels_events), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
+- commands (event sourcing), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+- commits (transactions), [Transactions](/en/ch8#ch_transactions)
+ - atomic commit, [Distributed Transactions](/en/ch8#sec_transactions_distributed)-[Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
+ - (see also atomicity; transactions)
+ - read committed isolation, [Read Committed](/en/ch8#sec_transactions_read_committed)
+ - three-phase commit (3PC), [Three-phase commit](/en/ch8#three-phase-commit)
+ - two-phase commit (2PC), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)-[Coordinator failure](/en/ch8#coordinator-failure)
+- commutative operations, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+- compaction
+ - of changelogs, [Log compaction](/en/ch12#sec_stream_log_compaction)
+ - (see also log compaction)
+ - for stream operator state, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+ - of log-structured storage, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+ - issues with, [Read performance](/en/ch4#read-performance)
+ - size-tiered and leveled approaches, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction), [Disk space usage](/en/ch4#disk-space-usage)
+- compare-and-set (CAS), [Conditional writes (compare-and-set)](/en/ch8#sec_transactions_compare_and_set), [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+ - implementing locks, [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - implementing uniqueness constraints, [Constraints and uniqueness guarantees](/en/ch10#sec_consistency_uniqueness)
+ - on object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - relation to consensus, [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable), [Consensus](/en/ch10#sec_consistency_consensus), [Compare-and-set as consensus](/en/ch10#compare-and-set-as-consensus)
+ - relation to fencing tokens, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
+ - relation to transactions, [Single-object writes](/en/ch8#sec_transactions_single_object)
+- compatibility, [Encoding and Evolution](/en/ch5#ch_encoding), [Modes of Dataflow](/en/ch5#sec_encoding_dataflow)
+ - calling services, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+ - properties of encoding formats, [Summary](/en/ch5#summary)
+ - using databases, [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)-[Archival storage](/en/ch5#archival-storage)
+- compensating transactions, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros), [Loosely interpreted constraints](/en/ch13#id362)
+- compilation, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+- complex event processing (CEP), [Complex event processing](/en/ch12#id317)
+- complexity
+ - distilling in theoretical models, [Mapping system models to the real world](/en/ch9#mapping-system-models-to-the-real-world)
+ - essential and accidental, [Simplicity: Managing Complexity](/en/ch2#id38)
+ - hiding using abstraction, [Data Models and Query Languages](/en/ch3#ch_datamodels)
+ - managing, [Simplicity: Managing Complexity](/en/ch2#id38)
+- composing data systems (see unbundling databases)
+- compression
+ - in SSTables, [The SSTable file format](/en/ch4#the-sstable-file-format)
+- compute-intensive applications, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
+- computer games, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- concatenated indexes, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
+ - in hash-sharded systems, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+- concurrency
+ - actor programming model, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks), [Event-Driven Architectures and RPC](/en/ch12#sec_stream_actors_drpc)
+ - (see also event-driven architecture)
+ - bugs from weak transaction isolation, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
+ - conflict resolution, [Dealing with Conflicting Writes](/en/ch6#sec_replication_write_conflicts)-[Types of conflict](/en/ch6#sec_replication_write_conflicts)
+ - definition, [Dealing with Conflicting Writes](/en/ch6#sec_replication_write_conflicts)
+ - detecting concurrent writes, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)-[Version vectors](/en/ch6#version-vectors)
+ - dual writes, problems with, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
+ - happens-before relation, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
+ - in replicated systems, [Problems with Replication Lag](/en/ch6#sec_replication_lag)-[Version vectors](/en/ch6#version-vectors), [Linearizability](/en/ch10#sec_consistency_linearizability)-[Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
+ - lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)
+ - multi-version concurrency control (MVCC), [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+ - optimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
+ - ordering of operations, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+ - reducing, through event logs, [Concurrency control](/en/ch12#sec_stream_concurrency), [Dataflow: Interplay between state changes and application code](/en/ch13#id450)
+ - time and relativity, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
+ - transaction isolation, [Isolation](/en/ch8#sec_transactions_acid_isolation)
+ - write skew (transaction isolation), [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts)
+- conditional write, [Conditional writes (compare-and-set)](/en/ch8#sec_transactions_compare_and_set)
+ - in transactions, [Single-object writes](/en/ch8#sec_transactions_single_object)
+ - on object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+- conference management system (example), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+- conflict-free replicated datatypes (CRDTs), [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
+ - for leaderless replication, [Capturing the happens-before relationship](/en/ch6#capturing-the-happens-before-relationship)
+ - preventing lost updates, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+- conflicts
+ - avoidance, [Conflict avoidance](/en/ch6#conflict-avoidance)
+ - causal dependencies, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
+ - conflict detection
+ - in distributed transactions, [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
+ - in log-based systems, [Uniqueness constraints require consensus](/en/ch13#id452)
+ - in serializable snapshot isolation (SSI), [Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
+ - in two-phase commit, [A system of promises](/en/ch8#a-system-of-promises)
+ - conflict resolution
+ - by aborting transactions, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
+ - by apologizing, [Loosely interpreted constraints](/en/ch13#id362)
+ - last write wins (LWW), [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
+ - using atomic operations, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+ - determining what is a conflict, [Types of conflict](/en/ch6#sec_replication_write_conflicts), [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
+ - in leaderless replication, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)
+ - lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)-[Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+ - materializing, [Materializing conflicts](/en/ch8#materializing-conflicts)
+ - resolution, [Dealing with Conflicting Writes](/en/ch6#sec_replication_write_conflicts)-[Types of conflict](/en/ch6#sec_replication_write_conflicts)
+ - automatic, [Automatic conflict resolution](/en/ch6#automatic-conflict-resolution)
+ - in leaderless systems, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)
+ - last write wins (LWW), [Last write wins (discarding concurrent writes)](/en/ch6#sec_replication_lww)
+ - using custom logic, [Manual conflict resolution](/en/ch6#manual-conflict-resolution), [Capturing the happens-before relationship](/en/ch6#capturing-the-happens-before-relationship)
+ - siblings, [Manual conflict resolution](/en/ch6#manual-conflict-resolution), [Capturing the happens-before relationship](/en/ch6#capturing-the-happens-before-relationship)
+ - merging, [Capturing the happens-before relationship](/en/ch6#capturing-the-happens-before-relationship)
+ - write skew (transaction isolation), [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts)
+- Confluent
+ - Freight (messaging), [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Disk space usage](/en/ch12#sec_stream_disk_usage)
+ - schema registry, [JSON Schema](/en/ch5#json-schema), [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
+- congestion (networks)
+ - avoidance, [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
+ - limiting accuracy of clocks, [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval)
+ - queueing delays, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+- consensus, [Consensus](/en/ch10#sec_consistency_consensus)-[Summary](/en/ch10#summary), [Glossary](/en/glossary)
+ - algorithms, [Consensus](/en/ch10#sec_consistency_consensus), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
+ - consensus numbers, [Fetch-and-add as consensus](/en/ch10#fetch-and-add-as-consensus)
+ - coordination services, [Coordination Services](/en/ch10#sec_consistency_coordination)-[Service discovery](/en/ch10#service-discovery)
+ - cost of, [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
+ - impossibility of, [Consensus](/en/ch10#sec_consistency_consensus)
+ - preventing split brain, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
+ - reconfiguration, [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
+ - relation to atomic commitment, [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
+ - relation to compare-and-set (CAS), [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable), [Compare-and-set as consensus](/en/ch10#compare-and-set-as-consensus)
+ - relation to fetch-and-add, [Fetch-and-add as consensus](/en/ch10#fetch-and-add-as-consensus)
+ - relation to replication, [Using shared logs](/en/ch10#sec_consistency_smr)
+ - relation to shared logs, [Shared logs as consensus](/en/ch10#sec_consistency_shared_logs)
+ - relation to uniqueness constraints, [Uniqueness constraints require consensus](/en/ch13#id452)
+ - safety and liveness properties, [Single-value consensus](/en/ch10#single-value-consensus)
+ - single-value consensus, [Single-value consensus](/en/ch10#single-value-consensus)
+- consent (GDPR), [Consent and Freedom of Choice](/en/ch14#id375)
+- consistency, [Consistency](/en/ch8#sec_transactions_acid_consistency), [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+ - across different databases, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views), [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions)
+ - causal, [Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix), [Problems with different topologies](/en/ch6#problems-with-different-topologies), [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
+ - consistent prefix reads, [Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix)-[Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix)
+ - consistent snapshots, [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)-[Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner), [Initial snapshot](/en/ch12#sec_stream_cdc_snapshot), [Creating an index](/en/ch13#id340)
+ - (see also snapshots)
+ - crash recovery, [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
+ - enforcing constraints (see constraints)
+ - eventual, [Problems with Replication Lag](/en/ch6#sec_replication_lag)
+ - (see also eventual consistency)
+ - in ACID transactions, [Consistency](/en/ch8#sec_transactions_acid_consistency), [Maintaining integrity in the face of software bugs](/en/ch13#id455)
+ - in CAP theorem, [The CAP theorem](/en/ch10#the-cap-theorem)
+ - in leader election, [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
+ - in microservices, [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems)
+ - linearizability, [Solutions for Replication Lag](/en/ch6#id131), [Linearizability](/en/ch10#sec_consistency_linearizability)-[Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
+ - meanings of, [Consistency](/en/ch8#sec_transactions_acid_consistency)
+ - monotonic reads, [Monotonic Reads](/en/ch6#sec_replication_monotonic_reads)-[Monotonic Reads](/en/ch6#sec_replication_monotonic_reads)
+ - of secondary indexes, [The need for multi-object transactions](/en/ch8#sec_transactions_need), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation), [Reasoning about dataflows](/en/ch13#id443), [Creating an index](/en/ch13#id340)
+ - read-after-write, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)-[Reading Your Own Writes](/en/ch6#sec_replication_ryw)
+ - in derived data systems, [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions)
+ - strong (see linearizability)
+ - timeliness and integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+ - using quorums, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations), [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
+- consistent hashing, [Consistent hashing](/en/ch7#sec_sharding_consistent_hashing)
+- consistent prefix reads, [Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix)
+- constraints (databases), [Consistency](/en/ch8#sec_transactions_acid_consistency), [Characterizing write skew](/en/ch8#characterizing-write-skew)
+ - asynchronously checked, [Loosely interpreted constraints](/en/ch13#id362)
+ - coordination avoidance, [Coordination-avoiding data systems](/en/ch13#id454)
+ - ensuring idempotence, [Uniquely identifying requests](/en/ch13#id355)
+ - in log-based systems, [Enforcing Constraints](/en/ch13#sec_future_constraints)-[Multi-shard request processing](/en/ch13#id360)
+ - across multiple shards, [Multi-shard request processing](/en/ch13#id360)
+ - in two-phase commit, [Distributed Transactions](/en/ch8#sec_transactions_distributed), [A system of promises](/en/ch8#a-system-of-promises)
+ - relation to consensus, [Uniqueness constraints require consensus](/en/ch13#id452)
+ - requiring linearizability, [Constraints and uniqueness guarantees](/en/ch10#sec_consistency_uniqueness)
+- Consul (coordination service), [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - use for service discovery, [Service discovery](/en/ch10#service-discovery)
+- consumers (message streams), [Message brokers](/en/ch5#message-brokers), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
+ - backpressure, [Messaging Systems](/en/ch12#sec_stream_messaging)
+ - consumer groups, [Multiple consumers](/en/ch12#id298)
+ - consumer offsets in logs, [Consumer offsets](/en/ch12#sec_stream_log_offsets)
+ - failures, [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering), [Consumer offsets](/en/ch12#sec_stream_log_offsets)
+ - fan-out, [Materializing and Updating Timelines](/en/ch2#sec_introduction_materializing), [Multiple consumers](/en/ch12#id298), [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging)
+ - load balancing, [Multiple consumers](/en/ch12#id298), [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging)
+ - not keeping up with producers, [Messaging Systems](/en/ch12#sec_stream_messaging), [Disk space usage](/en/ch12#sec_stream_disk_usage), [Making unbundling work](/en/ch13#sec_future_unbundling_favor)
+- content models (JSON Schema), [JSON Schema](/en/ch5#json-schema)
+- contention
+ - between transactions, [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
+ - blocking threads, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+ - performance of optimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
+ - under two-phase locking, [Performance of two-phase locking](/en/ch8#performance-of-two-phase-locking)
+- context switches, [Latency and Response Time](/en/ch2#id23), [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+- convergence (conflict resolution), [Automatic conflict resolution](/en/ch6#automatic-conflict-resolution)-[CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
+- coordination
+ - avoidance, [Coordination-avoiding data systems](/en/ch13#id454)
+ - cross-datacenter, [The limits of total ordering](/en/ch13#id335)
+ - cross-region, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+ - cross-shard ordering, [Sharding](/en/ch8#sharding), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner), [Using shared logs](/en/ch10#sec_consistency_smr), [Multi-shard request processing](/en/ch13#id360)
+ - routing requests to shards, [Request Routing](/en/ch7#sec_sharding_routing)
+ - services, [Locking and leader election](/en/ch10#locking-and-leader-election), [Coordination Services](/en/ch10#sec_consistency_coordination)-[Service discovery](/en/ch10#service-discovery)
+- coordinator (in 2PC), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
+ - failure, [Coordinator failure](/en/ch8#coordinator-failure)
+ - in XA transactions, [XA transactions](/en/ch8#xa-transactions)-[Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
+ - recovery, [Recovering from coordinator failure](/en/ch8#recovering-from-coordinator-failure)
+- copy-on-write (B-trees), [B-tree variants](/en/ch4#b-tree-variants), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
+- CORBA (Common Object Request Broker Architecture), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
+- coronal mass ejection (see solar storm)
+- correctness
+ - auditability, [Trust, but Verify](/en/ch13#sec_future_verification)-[Tools for auditable data systems](/en/ch13#id366)
+ - Byzantine fault tolerance, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
+ - dealing with partial failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
+ - in log-based systems, [Enforcing Constraints](/en/ch13#sec_future_constraints)-[Multi-shard request processing](/en/ch13#id360)
+ - of algorithm within system model, [Defining the correctness of an algorithm](/en/ch9#defining-the-correctness-of-an-algorithm)
+ - of derived data, [Designing for auditability](/en/ch13#id365)
+ - of immutable data, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
+ - of personal data, [Responsibility and Accountability](/en/ch14#id371), [Privacy and Use of Data](/en/ch14#id457)
+ - of time, [Problems with different topologies](/en/ch6#problems-with-different-topologies), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)-[Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+ - of transactions, [Consistency](/en/ch8#sec_transactions_acid_consistency), [Aiming for Correctness](/en/ch13#sec_future_correctness), [Maintaining integrity in the face of software bugs](/en/ch13#id455)
+ - timeliness and integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)-[Coordination-avoiding data systems](/en/ch13#id454)
+- corruption of data
+ - detecting, [The end-to-end argument](/en/ch13#sec_future_e2e_argument), [Don't just blindly trust what they promise](/en/ch13#id364)-[Tools for auditable data systems](/en/ch13#id366)
+ - due to pathological memory access, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
+ - due to radiation, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
+ - due to split brain, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing)
+ - due to weak transaction isolation, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
+ - integrity as absence of, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+ - network packets, [Weak forms of lying](/en/ch9#weak-forms-of-lying)
+ - on disks, [Durability](/en/ch8#durability)
+ - preventing using write-ahead logs, [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
+ - recovering from, [Batch Processing](/en/ch11#ch_batch), [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
+- cosine similarity (semantic search), [Vector Embeddings](/en/ch4#id92)
+- Couchbase (database)
+ - document data model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
+ - durability, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+ - hash sharding, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
+ - join support, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+ - rebalancing, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
+ - vBuckets (sharding), [Sharding](/en/ch7#ch_sharding)
+- CouchDB (database)
+ - as sync engine, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+ - B-tree storage, [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
+ - conflict resolution, [Manual conflict resolution](/en/ch6#manual-conflict-resolution)
+- coupling (loose and tight), [Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability)
+- covering indexes, [Storing values within the index](/en/ch4#sec_storage_index_heap)
+- CozoDB (database), [Datalog: Recursive Relational Queries](/en/ch3#id62)
+- CPUs
+ - cache coherence and memory barriers, [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
+ - caching and pipelining, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+ - computing the wrong result, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
+ - SIMD instructions, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+- crash-stop and crash-recovery faults, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+- CRDTs (see conflict-free replicated datatypes)
+- CREATE INDEX statement (SQL), [Multi-Column and Secondary Indexes](/en/ch4#sec_storage_index_multicolumn), [Creating an index](/en/ch13#id340)
+- credit rating agencies, [Responsibility and Accountability](/en/ch14#id371)
+- crypto-shredding, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events), [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+- cryptocurrencies, [Summary](/en/ch3#summary)
+- cryptography
+ - defense against attackers, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
+ - end-to-end encryption and authentication, [The end-to-end argument](/en/ch13#sec_future_e2e_argument)
+- CSV (comma-separated values), [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp), [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
+- Curator (ZooKeeper recipes), [Locking and leader election](/en/ch10#locking-and-leader-election), [Allocating work to nodes](/en/ch10#allocating-work-to-nodes)
+- Cypher (query language), [The Cypher Query Language](/en/ch3#id57)
+ - comparison to SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)
+
+### D
+
+- Daft (processing framework)
+ - DataFrames, [DataFrames](/en/ch11#id287)
+ - shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
+- Dagster (workflow scheduler), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows), [Batch Processing](/en/ch11#ch_batch), [Scheduling Workflows](/en/ch11#sec_batch_workflows)
+ - cloud data warehouse integration, [Query languages](/en/ch11#sec_batch_query_lanauges)
+- dashboard (business intelligence), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
+- Dask (processing framework), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- data catalog, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+- data connectors, [Data Warehousing](/en/ch1#sec_introduction_dwh)
+- data contracts, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
+ - change data capture, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
+- data corruption (see corruption of data)
+- data cubes, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
+- data engineering, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
+- data fabric, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
+- data formats (see encoding)
+- data infrastructure, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
+- data integration, [Data Integration](/en/ch13#sec_future_integration)-[Unifying batch and stream processing](/en/ch13#id338), [Summary](/en/ch13#id367)
+ - batch and stream processing, [Batch and Stream Processing](/en/ch13#sec_future_batch_streaming)-[Unifying batch and stream processing](/en/ch13#id338)
+ - maintaining derived state, [Maintaining derived state](/en/ch13#id446)
+ - reprocessing data, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
+ - unifying, [Unifying batch and stream processing](/en/ch13#id338)
+ - by unbundling databases, [Unbundling Databases](/en/ch13#sec_future_unbundling)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
+ - comparison to federated databases, [The meta-database of everything](/en/ch13#id341)
+ - combining tools by deriving data, [Combining Specialized Tools by Deriving Data](/en/ch13#id442)-[Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
+ - derived data versus distributed transactions, [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions)
+ - limits of total ordering, [The limits of total ordering](/en/ch13#id335)
+ - ordering events to capture causality, [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
+ - reasoning about dataflows, [Reasoning about dataflows](/en/ch13#id443)
+ - need for, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
+ - using batch processing, [Batch Processing](/en/ch11#ch_batch), [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
+- data lake, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
+ - data lakehouse, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Analytics](/en/ch11#sec_batch_olap)
+- data locality (see locality)
+- data mesh, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
+- data minimization, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
+- data models, [Data Models and Query Languages](/en/ch3#ch_datamodels)-[Summary](/en/ch3#summary)
+ - DataFrames and arrays, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+ - graph-like models, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)-[GraphQL](/en/ch3#id63)
+ - Datalog language, [Datalog: Recursive Relational Queries](/en/ch3#id62)-[Datalog: Recursive Relational Queries](/en/ch3#id62)
+ - property graphs, [Property Graphs](/en/ch3#id56)
+ - RDF and triple-stores, [Triple-Stores and SPARQL](/en/ch3#id59)-[The SPARQL query language](/en/ch3#the-sparql-query-language)
+ - relational model versus document model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)-[Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+ - supporting multiple, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+- data pipelines, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
+- data products, [Beyond the data lake](/en/ch1#beyond-the-data-lake)
+- data protection regulations (see GDPR)
+- data residence laws, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+- data science, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
+- data silo, [Data Warehousing](/en/ch1#sec_introduction_dwh)
+- data systems
+ - correctness, constraints, and integrity, [Aiming for Correctness](/en/ch13#sec_future_correctness)-[Tools for auditable data systems](/en/ch13#id366)
+ - data integration, [Data Integration](/en/ch13#sec_future_integration)-[Unifying batch and stream processing](/en/ch13#id338)
+ - goals for using, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
+ - heterogeneous, keeping in sync, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
+ - maintainability, [Maintainability](/en/ch2#sec_introduction_maintainability)-[Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability)
+ - possible faults in, [Transactions](/en/ch8#ch_transactions)
+ - reliability, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)-[Humans and Reliability](/en/ch2#id31)
+ - hardware faults, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
+ - human errors, [Humans and Reliability](/en/ch2#id31)
+ - importance of, [Humans and Reliability](/en/ch2#id31)
+ - software faults, [Software faults](/en/ch2#software-faults)
+ - scalability, [Scalability](/en/ch2#sec_introduction_scalability)-[Principles for Scalability](/en/ch2#id35)
+ - unbundling databases, [Unbundling Databases](/en/ch13#sec_future_unbundling)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
+ - unreliable clocks, [Unreliable Clocks](/en/ch9#sec_distributed_clocks)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
+- data warehousing, [Data Warehousing](/en/ch1#sec_introduction_dwh), [Glossary](/en/glossary)
+ - cloud-based solutions, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - ETL (extract-transform-load), [Data Warehousing](/en/ch1#sec_introduction_dwh), [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
+ - for batch processing, [Batch Processing](/en/ch11#ch_batch)
+ - keeping data systems in sync, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
+ - schema design, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
+ - sharding and clustering, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+ - slowly changing dimension (SCD), [Time-dependence of joins](/en/ch12#sec_stream_join_time)
+- data-intensive applications, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
+- database administrator, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+- database-internal distributed transactions, [Distributed Transactions Across Different Systems](/en/ch8#sec_transactions_xa), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal), [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
+- databases
+ - archival storage, [Archival storage](/en/ch5#archival-storage)
+ - comparison of message brokers to, [Message brokers compared to databases](/en/ch12#id297)
+ - dataflow through, [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)
+ - end-to-end argument for, [The end-to-end argument](/en/ch13#sec_future_e2e_argument)-[Applying end-to-end thinking in data systems](/en/ch13#id357)
+ - checking integrity, [The end-to-end argument again](/en/ch13#id456)
+ - relation to event streams, [Databases and Streams](/en/ch12#sec_stream_databases)-[Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+ - (see also changelogs)
+ - API support for change streams, [API support for change streams](/en/ch12#sec_stream_change_api), [Separation of application code and state](/en/ch13#id344)
+ - change data capture, [Change Data Capture](/en/ch12#sec_stream_cdc)-[API support for change streams](/en/ch12#sec_stream_change_api)
+ - event sourcing, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
+ - keeping systems in sync, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)-[Keeping Systems in Sync](/en/ch12#sec_stream_sync)
+ - philosophy of immutable events, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)-[Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+ - unbundling, [Unbundling Databases](/en/ch13#sec_future_unbundling)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
+ - composing data storage technologies, [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
+ - designing applications around dataflow, [Designing Applications Around Dataflow](/en/ch13#sec_future_dataflow)-[Stream processors and services](/en/ch13#id345)
+ - observing derived state, [Observing Derived State](/en/ch13#sec_future_observing)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
+- datacenters
+ - failures of, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
+ - geographically distributed (see regions (geographic distribution))
+ - multitenancy and shared resources, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - network architecture, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
+ - network faults, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
+- dataflow, [Modes of Dataflow](/en/ch5#sec_encoding_dataflow)-[Distributed actor frameworks](/en/ch5#distributed-actor-frameworks), [Designing Applications Around Dataflow](/en/ch13#sec_future_dataflow)-[Stream processors and services](/en/ch13#id345)
+ - correctness of dataflow systems, [Correctness of dataflow systems](/en/ch13#id453)
+ - dataflow engines, [Dataflow Engines](/en/ch11#sec_batch_dataflow)
+ - comparison to stream processing, [Processing Streams](/en/ch12#sec_stream_processing)
+ - DataFrames, [DataFrames](/en/ch11#id287)
+ - support in batch processing frameworks, [Batch Processing](/en/ch11#ch_batch)
+ - event-driven, [Event-Driven Architectures](/en/ch5#sec_encoding_dataflow_msg)-[Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
+ - reasoning about, [Reasoning about dataflows](/en/ch13#id443)
+ - through databases, [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)
+ - through services, [Dataflow Through Services: REST and RPC](/en/ch5#sec_encoding_dataflow_rpc)-[Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+ - workflow engines (see workflow engines)
+- DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+ - implementation, [DataFrames](/en/ch11#id287)
+ - in batch processing, [DataFrames](/en/ch11#id287)
+ - in notebooks, [Machine Learning](/en/ch11#id290)
+ - support in batch processing frameworks, [Batch Processing](/en/ch11#ch_batch)
+- DataFusion (query engine), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+- Datalog (query language), [Datalog: Recursive Relational Queries](/en/ch3#id62)-[Datalog: Recursive Relational Queries](/en/ch3#id62)
+- Datastream (change data capture), [API support for change streams](/en/ch12#sec_stream_change_api)
+- datatypes
+ - binary strings in XML and JSON, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
+ - conflict-free, [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
+ - in Avro encodings, [Avro](/en/ch5#sec_encoding_avro)
+ - in Protocol Buffers, [Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
+ - numbers in XML and JSON, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
+- Datensparsamkeit, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
+- Datomic (database)
+ - B-tree storage, [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
+ - data model, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph), [Triple-Stores and SPARQL](/en/ch3#id59)
+ - Datalog query language, [Datalog: Recursive Relational Queries](/en/ch3#id62)
+ - excision (deleting data), [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+ - languages for transactions, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
+ - serial execution of transactions, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
+- Daylight Saving Time (DST), [Time-of-day clocks](/en/ch9#time-of-day-clocks)
+- Db2 (database)
+ - change data capture, [Implementing change data capture](/en/ch12#id307)
+- DBA (database administrator), [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+- deadlocks, [Explicit locking](/en/ch8#explicit-locking)
+ - detection, in distributed transaction, [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
+ - in two-phase locking (2PL), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+- Debezium (change data capture), [Implementing change data capture](/en/ch12#id307)
+ - Cassandra, [API support for change streams](/en/ch12#sec_stream_change_api)
+ - for data integration, [Unbundled versus integrated systems](/en/ch13#id448)
+- declarative languages, [Data Models and Query Languages](/en/ch3#ch_datamodels), [Glossary](/en/glossary)
+ - and sync engines, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+ - Datalog, [Datalog: Recursive Relational Queries](/en/ch3#id62)
+ - in document databases, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+ - recursive SQL queries, [Graph Queries in SQL](/en/ch3#id58)
+ - SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)
+- DeepSeek
+ - 3FS (see 3FS)
+- delays
+ - bounded network delays, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
+ - bounded process pauses, [Response time guarantees](/en/ch9#sec_distributed_clocks_realtime)
+ - unbounded network delays, [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing)
+ - unbounded process pauses, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+- deleting data, [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+ - in LSM storage, [Disk space usage](/en/ch4#disk-space-usage)
+ - legal basis, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
+- Delta Lake (table format), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - sharding and clustering, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+- demilitarized zone (networking), [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+- denormalization (data representation), [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)-[Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many), [Glossary](/en/glossary)
+ - in derived data systems, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
+ - in event sourcing/CQRS, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - in social network case study, [Denormalization in the social networking case study](/en/ch3#denormalization-in-the-social-networking-case-study)
+ - materialized views, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
+ - updating derived data, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object), [The need for multi-object transactions](/en/ch8#sec_transactions_need), [Combining Specialized Tools by Deriving Data](/en/ch13#id442)
+ - versus normalization, [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
+- derived data, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Stream Processing](/en/ch12#ch_stream), [Glossary](/en/glossary)
+ - batch processing, [Batch Processing](/en/ch11#ch_batch)
+ - event sourcing and CQRS, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - from change data capture, [Implementing change data capture](/en/ch12#id307)
+ - maintaining derived state through logs, [Databases and Streams](/en/ch12#sec_stream_databases)-[API support for change streams](/en/ch12#sec_stream_change_api), [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)-[Concurrency control](/en/ch12#sec_stream_concurrency)
+ - observing, by subscribing to streams, [End-to-end event streams](/en/ch13#id349)
+ - outputs of batch and stream processing, [Batch and Stream Processing](/en/ch13#sec_future_batch_streaming)
+ - through application code, [Application code as a derivation function](/en/ch13#sec_future_dataflow_derivation)
+ - versus distributed transactions, [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions)
+- design patterns, [Simplicity: Managing Complexity](/en/ch2#id38)
+- deterministic operations, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs), [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure), [Glossary](/en/glossary)
+ - and idempotence, [Idempotence](/en/ch12#sec_stream_idempotence), [Reasoning about dataflows](/en/ch13#id443)
+ - computing derived data, [Maintaining derived state](/en/ch13#id446), [Correctness of dataflow systems](/en/ch13#id453), [Designing for auditability](/en/ch13#id365)
+ - in event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - in state machine replication, [Using shared logs](/en/ch10#sec_consistency_smr), [Databases and Streams](/en/ch12#sec_stream_databases)
+ - in statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
+ - in testing, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+ - joins, [Time-dependence of joins](/en/ch12#sec_stream_join_time)
+ - making code deterministic, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+ - overview, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+- deterministic simulation testing (DST), [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+- DevOps, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+- dimension tables, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
+- dimensional modeling (see star schemas)
+- directed acyclic graphs (DAG)
+ - workflows, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
+ - (see also workflow engines)
+- dirty reads (transaction isolation), [No dirty reads](/en/ch8#no-dirty-reads)
+- dirty writes (transaction isolation), [No dirty writes](/en/ch8#sec_transactions_dirty_write)
+- disaggregation
+ - of storage and compute, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
+- Discord (group chat)
+ - GraphQL example, [GraphQL](/en/ch3#id63)
+- discrimination, [Bias and Discrimination](/en/ch14#id370)
+- disks (see hard disks)
+- distributed actor frameworks, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
+- distributed filesystems, [Distributed Filesystems](/en/ch11#sec_batch_dfs)-[Distributed Filesystems](/en/ch11#sec_batch_dfs)
+ - comparison to object storage, [Object Stores](/en/ch11#id277)
+ - use by Flink, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+- distributed ledgers, [Summary](/en/ch3#summary)
+- distributed systems, [The Trouble with Distributed Systems](/en/ch9#ch_distributed)-[Summary](/en/ch9#summary), [Glossary](/en/glossary)
+ - Byzantine faults, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)-[Weak forms of lying](/en/ch9#weak-forms-of-lying)
+ - detecting network faults, [Detecting Faults](/en/ch9#id307)
+ - faults and partial failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
+ - formalization of consensus, [Single-value consensus](/en/ch10#single-value-consensus)
+ - impossibility results, [The CAP theorem](/en/ch10#the-cap-theorem), [Consensus](/en/ch10#sec_consistency_consensus)
+ - issues with failover, [Leader failure: Failover](/en/ch6#leader-failure-failover)
+ - multi-region (see regions (geographic distribution))
+ - network problems, [Unreliable Networks](/en/ch9#sec_distributed_networks)-[Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
+ - problems with, [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems)
+ - quorums, relying on, [The Majority Rules](/en/ch9#sec_distributed_majority)
+ - reasons for using, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Replication](/en/ch6#ch_replication)
+ - synchronized clocks, relying on, [Relying on Synchronized Clocks](/en/ch9#sec_distributed_clocks_relying)-[Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+ - system models, [System Model and Reality](/en/ch9#sec_distributed_system_model)-[Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+ - use of clocks and time, [Unreliable Clocks](/en/ch9#sec_distributed_clocks)
+- distributed transactions (see transactions)
+- Django (web framework), [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
+- DMZ (demilitarized zone), [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+- DNS (Domain Name System), [Request Routing](/en/ch7#sec_sharding_routing), [Service discovery](/en/ch10#service-discovery)
+ - for load balancing, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
+- Docker (container manager), [Separation of application code and state](/en/ch13#id344)
+- document data model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)-[Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+ - comparison to relational model, [When to Use Which Model](/en/ch3#sec_datamodels_document_summary)-[Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+ - multi-object transactions, need for, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
+ - sharded secondary indexes, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)
+ - versus relational model
+ - convergence of models, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+ - data locality, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
+- document-partitioned indexes (see local secondary indexes)
+- domain-driven design (DDD), [Simplicity: Managing Complexity](/en/ch2#id38), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+- dotted version vectors, [Version vectors](/en/ch6#version-vectors)
+- double-entry bookkeeping, [Summary](/en/ch3#summary)
+- DRBD (Distributed Replicated Block Device), [Single-Leader Replication](/en/ch6#sec_replication_leader)
+- drift (clocks), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
+- Druid (database), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Column-Oriented Storage](/en/ch4#sec_storage_column), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
+ - handling writes, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
+ - pre-aggregation, [Analytics](/en/ch11#sec_batch_olap)
+ - serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+- Dryad (dataflow engine), [Dataflow Engines](/en/ch11#sec_batch_dataflow)
+- dual writes, problems with, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
+- DuckDB (database), [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems), [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - column-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)
+ - use for ETL, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
+- duplicates, suppression of, [Duplicate suppression](/en/ch13#id354)
+ - (see also idempotence)
+ - using a unique ID, [Uniquely identifying requests](/en/ch13#id355), [Multi-shard request processing](/en/ch13#id360)
+- durability (transactions), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal), [Durability](/en/ch8#durability), [Glossary](/en/glossary)
+- durable execution, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
+ - reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+ - Restate (see Restate (workflow engine))
+ - Temporal (see Temporal (workflow engine))
+- durable functions (see workflow engines)
+- duration (time), [Unreliable Clocks](/en/ch9#sec_distributed_clocks)
+ - measurement with monotonic clocks, [Monotonic clocks](/en/ch9#monotonic-clocks)
+- dynamically typed languages
+ - analogy to schema-on-read, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
+- Dynamo (database), [Leaderless Replication](/en/ch6#sec_replication_leaderless)
+- Dynamo-style databases (see leaderless replication)
+- DynamoDB (database)
+ - auto-scaling, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
+ - hash-range sharding, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+ - leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+ - sharded secondary indexes, [Global Secondary Indexes](/en/ch7#id167)
+
+### E
+
+- EBS (virtual block device), [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
+ - compared to object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+- ECC (see error-correcting codes)
+- EDB Postgres Distributed (database), [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+- edges (in graphs), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+ - property graph model, [Property Graphs](/en/ch3#id56)
+- edit distance (full-text search), [Full-Text Search](/en/ch4#sec_storage_full_text)
+- effectively-once semantics, [Fault Tolerance](/en/ch12#sec_stream_fault_tolerance), [Exactly-once execution of an operation](/en/ch13#id353)
+ - (see also exactly-once semantics)
+ - preservation of integrity, [Correctness of dataflow systems](/en/ch13#id453)
+- Elastic Compute Cloud (EC2)
+ - spot instances, [Handling Faults](/en/ch11#id281)
+- elasticity, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
+ - cloud data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Query languages](/en/ch11#sec_batch_query_lanauges)
+- Elasticsearch (search server)
+ - local secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
+ - percolator (stream search), [Search on streams](/en/ch12#id320)
+ - serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+ - shard rebalancing, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
+ - use of Lucene, [Full-Text Search](/en/ch4#sec_storage_full_text)
+- Elm (programming language), [End-to-end event streams](/en/ch13#id349)
+- ELT (extract-load-transform), [Data Warehousing](/en/ch1#sec_introduction_dwh)
+ - relation to batch processing, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
+- embarassingly parallel (algorithms)
+ - ETL (see ETL (extract-transform-load))
+ - MapReduce, [MapReduce](/en/ch11#sec_batch_mapreduce)
+ - (see also MapReduce)
+- embedded storage engines, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+- embedding (vector), [Vector Embeddings](/en/ch4#id92)
+- encodings (data formats), [Encoding and Evolution](/en/ch5#ch_encoding)-[The Merits of Schemas](/en/ch5#sec_encoding_schemas)
+ - Avro, [Avro](/en/ch5#sec_encoding_avro)-[Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
+ - binary variants of JSON and XML, [Binary encoding](/en/ch5#binary-encoding)
+ - compatibility, [Encoding and Evolution](/en/ch5#ch_encoding)
+ - calling services, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+ - using databases, [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)-[Archival storage](/en/ch5#archival-storage)
+ - defined, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
+ - JSON, XML, and CSV, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
+ - language-specific formats, [Language-Specific Formats](/en/ch5#id96)
+ - merits of schemas, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
+ - Protocol Buffers, [Protocol Buffers](/en/ch5#sec_encoding_protobuf)-[Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
+ - representations of data, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
+- end-to-end argument, [The end-to-end argument](/en/ch13#sec_future_e2e_argument)-[Applying end-to-end thinking in data systems](/en/ch13#id357)
+ - checking integrity, [The end-to-end argument again](/en/ch13#id456)
+ - publish/subscribe streams, [End-to-end event streams](/en/ch13#id349)
+- enrichment (stream), [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
+- Enterprise JavaBeans (EJB), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
+- enterprise software, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
+- entities (see vertices)
+- ephemeral storage, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
+- epoch (consensus algorithms), [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
+- epoch (Unix timestamps), [Time-of-day clocks](/en/ch9#time-of-day-clocks)
+- erasure coding (error correction), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- error handling
+ - for network faults, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
+ - in transactions, [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
+- error-correcting codes, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- Esper (CEP engine), [Complex event processing](/en/ch12#id317)
+- essential complexity, [Simplicity: Managing Complexity](/en/ch2#id38)
+- etcd (coordination service), [Coordination Services](/en/ch10#sec_consistency_coordination)-[Service discovery](/en/ch10#service-discovery)
+ - generating fencing tokens, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens), [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - linearizable operations, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable), [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
+ - locks and leader election, [Locking and leader election](/en/ch10#locking-and-leader-election)
+ - use for service discovery, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery), [Service discovery](/en/ch10#service-discovery)
+ - use for shard assignment, [Request Routing](/en/ch7#sec_sharding_routing)
+ - use of Raft algorithm, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+- Ethereum (blockchain), [Tools for auditable data systems](/en/ch13#id366)
+- Ethernet (networks), [Cloud Computing Versus Supercomputing](/en/ch1#id17), [Unreliable Networks](/en/ch9#sec_distributed_networks), [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
+ - packet checksums, [Weak forms of lying](/en/ch9#weak-forms-of-lying), [The end-to-end argument](/en/ch13#sec_future_e2e_argument)
+- ethics, [Doing the Right Thing](/en/ch14)-[Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
+ - code of ethics and professional practice, [Doing the Right Thing](/en/ch14)
+ - legislation and self-regulation, [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
+ - predictive analytics, [Predictive Analytics](/en/ch14#id369)-[Feedback Loops](/en/ch14#id372)
+ - amplifying bias, [Bias and Discrimination](/en/ch14#id370)
+ - feedback loops, [Feedback Loops](/en/ch14#id372)
+ - privacy and tracking, [Privacy and Tracking](/en/ch14#id373)-[Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
+ - consent and freedom of choice, [Consent and Freedom of Choice](/en/ch14#id375)
+ - data as assets and power, [Data as Assets and Power](/en/ch14#id376)
+ - meaning of privacy, [Privacy and Use of Data](/en/ch14#id457)
+ - surveillance, [Surveillance](/en/ch14#id374)
+ - respect, dignity, and agency, [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
+ - unintended consequences, [Doing the Right Thing](/en/ch14), [Feedback Loops](/en/ch14#id372)
+- ETL (extract-transform-load), [Data Warehousing](/en/ch1#sec_introduction_dwh), [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Glossary](/en/glossary)
+ - relation to batch processing, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)-[Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
+ - using batch processing, [Batch Processing](/en/ch11#ch_batch)
+- Euclidean distance (semantic search), [Vector Embeddings](/en/ch4#id92)
+- European Union
+ - AI Act (see AI Act)
+ - GDPR (see GDPR)
+- event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)-[Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - and change data capture, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
+ - comparison to change data capture, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
+ - immutability and auditability, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability), [Designing for auditability](/en/ch13#id365)
+ - large, reliable data systems, [Uniquely identifying requests](/en/ch13#id355), [Correctness of dataflow systems](/en/ch13#id453)
+ - reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+- event streams (see streams)
+- event-driven architecture, [Event-Driven Architectures](/en/ch5#sec_encoding_dataflow_msg)-[Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
+ - distributed actor frameworks, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
+- events, [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
+ - deciding on total order of, [The limits of total ordering](/en/ch13#id335)
+ - deriving views from event log, [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
+ - event time versus processing time, [Event time versus processing time](/en/ch12#id322), [Microbatching and checkpointing](/en/ch12#id329), [Unifying batch and stream processing](/en/ch13#id338)
+ - immutable, advantages of, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros), [Designing for auditability](/en/ch13#id365)
+ - ordering to capture causality, [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
+ - reads as, [Reads are events too](/en/ch13#sec_future_read_events)
+ - stragglers, [Handling straggler events](/en/ch12#id323)
+ - timestamp of, in stream processing, [Whose clock are you using, anyway?](/en/ch12#id438)
+- EventSource (browser API), [Pushing state changes to clients](/en/ch13#id348)
+- EventStoreDB (database), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+- eventual consistency, [Replication](/en/ch6#ch_replication), [Problems with Replication Lag](/en/ch6#sec_replication_lag), [Safety and liveness](/en/ch9#sec_distributed_safety_liveness)
+ - (see also conflicts)
+ - and perpetual inconsistency, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+ - strong eventual consistency, [Automatic conflict resolution](/en/ch6#automatic-conflict-resolution)
+- evidence
+ - data used as, [Humans and Reliability](/en/ch2#id31)
+- evolvability, [Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability), [Encoding and Evolution](/en/ch5#ch_encoding)
+ - calling services, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+ - event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - graph-structured data, [Property Graphs](/en/ch3#id56)
+ - of databases, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility), [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)-[Archival storage](/en/ch5#archival-storage), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views), [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
+ - reprocessing data, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing), [Unifying batch and stream processing](/en/ch13#id338)
+ - schema evolution in Avro, [The writer's schema and the reader's schema](/en/ch5#the-writers-schema-and-the-readers-schema)
+ - schema evolution in Protocol Buffers, [Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
+ - schema-on-read, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility), [Encoding and Evolution](/en/ch5#ch_encoding), [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
+- exactly-once semantics, [Exactly-once message processing](/en/ch8#sec_transactions_exactly_once), [Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited), [Fault Tolerance](/en/ch12#sec_stream_fault_tolerance), [Exactly-once execution of an operation](/en/ch13#id353)
+ - parity with batch processors, [Unifying batch and stream processing](/en/ch13#id338)
+ - preservation of integrity, [Correctness of dataflow systems](/en/ch13#id453)
+ - using durable execution, [Durable execution](/en/ch5#durable-execution)
+- exclusive mode (locks), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+- exponential backoff, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
+- ext4 (file system), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- eXtended Architecture transactions (see XA transactions)
+- extract-transform-load (see ETL)
+
+### F
+
+- Facebook
+ - Faiss (vector index), [Vector Embeddings](/en/ch4#id92)
+ - React (user interface library), [End-to-end event streams](/en/ch13#id349)
+ - social graphs, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+- facts
+ - fact table (star schema), [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
+ - in Datalog, [Datalog: Recursive Relational Queries](/en/ch3#id62)
+ - in event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+- fail-slow faults, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+- fail-stop model, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+- failover, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Glossary](/en/glossary)
+ - (see also leader-based replication)
+ - in leaderless replication, absence of, [Writing to the Database When a Node Is Down](/en/ch6#id287)
+ - leader election, [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing), [Consensus](/en/ch10#sec_consistency_consensus), [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
+ - potential problems, [Leader failure: Failover](/en/ch6#leader-failure-failover)
+- failures
+ - amplification by distributed transactions, [Maintaining derived state](/en/ch13#id446)
+ - failure detection, [Detecting Faults](/en/ch9#id307)
+ - automatic rebalancing causing cascading failures, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
+ - timeouts and unbounded delays, [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing), [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - using a coordination service, [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - faults versus, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)
+ - partial failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure), [Summary](/en/ch9#summary)
+- Faiss (vector index), [Vector Embeddings](/en/ch4#id92)
+- false positive (Bloom filters), [Bloom filters](/en/ch4#bloom-filters)
+- fan-out (messaging systems), [Materializing and Updating Timelines](/en/ch2#sec_introduction_materializing), [Multiple consumers](/en/ch12#id298)
+- fault injection, [Fault Tolerance](/en/ch2#id27), [Network Faults in Practice](/en/ch9#sec_distributed_network_faults), [Fault injection](/en/ch9#sec_fault_injection)
+- fault isolation, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+- fault tolerance, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)-[Humans and Reliability](/en/ch2#id31), [Glossary](/en/glossary)
+ - formalization in consensus, [Single-value consensus](/en/ch10#single-value-consensus)
+ - human fault tolerance, [Batch Processing](/en/ch11#ch_batch)
+ - in batch processing, [Handling Faults](/en/ch11#id281)
+ - in log-based systems, [Applying end-to-end thinking in data systems](/en/ch13#id357), [Timeliness and Integrity](/en/ch13#sec_future_integrity)-[Correctness of dataflow systems](/en/ch13#id453)
+ - in stream processing, [Fault Tolerance](/en/ch12#sec_stream_fault_tolerance)-[Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+ - atomic commit, [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
+ - idempotence, [Idempotence](/en/ch12#sec_stream_idempotence)
+ - maintaining derived state, [Maintaining derived state](/en/ch13#id446)
+ - microbatching and checkpointing, [Microbatching and checkpointing](/en/ch12#id329)
+ - rebuilding state after a failure, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+ - of distributed transactions, [XA transactions](/en/ch8#xa-transactions)-[Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
+ - of leader-based and leaderless replication, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+ - transaction atomicity, [Atomicity](/en/ch8#sec_transactions_acid_atomicity), [Distributed Transactions](/en/ch8#sec_transactions_distributed)-[Exactly-once message processing](/en/ch8#sec_transactions_exactly_once)
+- faults
+ - Byzantine faults, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)-[Weak forms of lying](/en/ch9#weak-forms-of-lying)
+ - failures versus, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)
+ - handled by transactions, [Transactions](/en/ch8#ch_transactions)
+ - handling in supercomputers and cloud computing, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
+ - hardware, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
+ - in distributed systems, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
+ - introducing deliberately (see fault injection)
+ - network faults, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)-[Detecting Faults](/en/ch9#id307)
+ - asymmetric faults, [The Majority Rules](/en/ch9#sec_distributed_majority)
+ - detecting, [Detecting Faults](/en/ch9#id307)
+ - tolerance of, in multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+ - software faults, [Software faults](/en/ch2#software-faults)
+ - tolerating (see fault tolerance)
+- feature engineering (machine learning), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
+- federated databases, [The meta-database of everything](/en/ch13#id341)
+- Feldera (database)
+ - incremental view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+- fence (CPU instruction), [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
+- fencing (preventing split brain), [Leader failure: Failover](/en/ch6#leader-failure-failover), [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)-[Fencing with multiple replicas](/en/ch9#fencing-with-multiple-replicas)
+ - generating fencing tokens, [Using shared logs](/en/ch10#sec_consistency_smr), [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - properties of fencing tokens, [Defining the correctness of an algorithm](/en/ch9#defining-the-correctness-of-an-algorithm)
+ - stream processors writing to databases, [Idempotence](/en/ch12#sec_stream_idempotence), [Exactly-once execution of an operation](/en/ch13#id353)
+- fetch-and-add
+ - relation to consensus, [Fetch-and-add as consensus](/en/ch10#fetch-and-add-as-consensus)
+- Fibre Channel (networks), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- field tags (Protocol Buffers), [Protocol Buffers](/en/ch5#sec_encoding_protobuf)-[Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
+- Figma (graphics software), [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
+- filesystem in userspace (FUSE), [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+ - on object storage, [Object Stores](/en/ch11#id277)
+- financial data
+ - accounting ledgers, [Summary](/en/ch3#summary)
+ - immutability, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
+ - time series data, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- Fivetran, [Data Warehousing](/en/ch1#sec_introduction_dwh)
+- FizzBee (specification language), [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
+- flat index (vector index), [Vector Embeddings](/en/ch4#id92)
+- FlatBuffers (data format), [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
+- Flink (processing framework), [Batch Processing](/en/ch11#ch_batch), [Dataflow Engines](/en/ch11#sec_batch_dataflow)
+ - cost efficiency, [Query languages](/en/ch11#sec_batch_query_lanauges)
+ - DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes), [DataFrames](/en/ch11#id287)
+ - fault tolerance, [Handling Faults](/en/ch11#id281), [Microbatching and checkpointing](/en/ch12#id329), [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+ - FlinkML, [Machine Learning](/en/ch11#id290)
+ - for data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - high availability using ZooKeeper, [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - integration of batch and stream processing, [Unifying batch and stream processing](/en/ch13#id338)
+ - query optimizer, [Query languages](/en/ch11#sec_batch_query_lanauges)
+ - shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
+ - stream processing, [Stream analytics](/en/ch12#id318)
+ - streaming SQL support, [Complex event processing](/en/ch12#id317)
+- flow control, [The Limitations of TCP](/en/ch9#sec_distributed_tcp), [Messaging Systems](/en/ch12#sec_stream_messaging), [Glossary](/en/glossary)
+- FLP result (on consensus), [Consensus](/en/ch10#sec_consistency_consensus)
+- Flyte (workflow scheduler), [Machine Learning](/en/ch11#id290)
+- followers, [Single-Leader Replication](/en/ch6#sec_replication_leader), [Glossary](/en/glossary)
+ - (see also leader-based replication)
+- formal methods, [Formal Methods and Randomized Testing](/en/ch9#sec_distributed_formal)-[Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+- forward compatibility, [Encoding and Evolution](/en/ch5#ch_encoding)
+- forward decay (algorithm), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
+- Fossil (version control system), [Concurrency control](/en/ch12#sec_stream_concurrency)
+ - shunning (deleting data), [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+- FoundationDB (database)
+ - consistency model, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+ - deterministic simulation testing, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+ - key-range sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+ - process-per-core model, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
+ - serializable transactions, [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi), [Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
+ - transactions, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
+- fractional indexing, [When to Use Which Model](/en/ch3#sec_datamodels_document_summary)
+- fragmentation (of B-trees), [Disk space usage](/en/ch4#disk-space-usage)
+- frame (computer graphics), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- frontend (web development), [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
+- FrostDB (database)
+ - deterministic simulation testing (DST), [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+- fsync (system call), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal), [Durability](/en/ch8#durability)
+- full-text search, [Full-Text Search](/en/ch4#sec_storage_full_text), [Glossary](/en/glossary)
+ - and fuzzy indexes, [Full-Text Search](/en/ch4#sec_storage_full_text)
+ - Lucene storage engine, [Full-Text Search](/en/ch4#sec_storage_full_text)
+ - sharded indexes, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)
+- Function as a Service (FaaS), [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
+- functional programming
+ - inspiration for MapReduce, [MapReduce](/en/ch11#sec_batch_mapreduce)
+- functional requirements, [Defining Nonfunctional Requirements](/en/ch2#ch_nonfunctional)
+- FUSE (see filesystem in userspace (FUSE))
+- fuzzing, [Formal Methods and Randomized Testing](/en/ch9#sec_distributed_formal)
+- fuzzy search (see similarity search)
+
+### G
+
+- Gallina (specification language), [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
+- game development, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- garbage collection
+ - immutability and, [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+ - process pauses for, [Latency and Response Time](/en/ch2#id23), [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact), [The Majority Rules](/en/ch9#sec_distributed_majority)
+ - (see also process pauses)
+- gas stations algorithmic pricing, [Feedback Loops](/en/ch14#id372)
+- GDPR (regulation), [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+ - consent, [Consent and Freedom of Choice](/en/ch14#id375)
+ - data minimization, [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
+ - legitimate interest, [Consent and Freedom of Choice](/en/ch14#id375)
+ - right of access, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+ - right to erasure, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Disk space usage](/en/ch4#disk-space-usage), [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+- GenBank (genome database), [Summary](/en/ch3#summary)
+- General Data Protection Regulation (see GDPR (regulation))
+- genome analysis, [Summary](/en/ch3#summary)
+- geographic distribution (see regions (geographic distribution))
+- geospatial indexes, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
+- Git (version control system), [Concurrency control](/en/ch12#sec_stream_concurrency)
+ - local-first software, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
+ - merge conflicts, [Manual conflict resolution](/en/ch6#manual-conflict-resolution)
+- GitHub, postmortems, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Leader failure: Failover](/en/ch6#leader-failure-failover), [Mapping system models to the real world](/en/ch9#mapping-system-models-to-the-real-world)
+- global secondary indexes, [Global Secondary Indexes](/en/ch7#id167), [Summary](/en/ch7#summary)
+- globally unique identifiers (see UUIDs)
+- GlusterFS (distributed filesystem), [Batch Processing](/en/ch11#ch_batch), [Distributed Filesystems](/en/ch11#sec_batch_dfs), [Object Stores](/en/ch11#id277)
+- GNU Coreutils (Linux), [Sorting Versus In-memory Aggregation](/en/ch11#id275)
+- Go (programming language)
+ - garbage collection, [Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
+- GoldenGate (change data capture), [Implementing change data capture](/en/ch12#id307)
+ - (see also Oracle)
+- Google
+ - BigQuery (see BigQuery (database))
+ - Bigtable (see Bigtable (database))
+ - Chubby (lock service), [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - Cloud Storage (object storage), [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Object Stores](/en/ch11#id277)
+ - request preconditions, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
+ - Compute Engine
+ - preemptible instances, [Handling Faults](/en/ch11#id281)
+ - Dataflow (stream processing)
+ - data warehouse integration, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
+ - Dataflow (stream processor), [Stream analytics](/en/ch12#id318), [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit), [Unifying batch and stream processing](/en/ch13#id338)
+ - (see also Beam)
+ - Datastream (change data capture), [API support for change streams](/en/ch12#sec_stream_change_api)
+ - Docs (collaborative editor), [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps), [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
+ - operational transformation, [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
+ - Dremel (query engine), [Column-Oriented Storage](/en/ch4#sec_storage_column)
+ - Firestore (database), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+ - MapReduce (batch processing), [Batch Processing](/en/ch11#ch_batch)
+ - (see also MapReduce)
+ - Percolator (transaction system), [Implementing a linearizable ID generator](/en/ch10#implementing-a-linearizable-id-generator)
+ - persistent disks (cloud service), [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
+ - Pub/Sub (messaging), [Message brokers](/en/ch5#message-brokers), [Message brokers compared to databases](/en/ch12#id297), [Using logs for message storage](/en/ch12#id300)
+ - response time study, [Average, Median, and Percentiles](/en/ch2#id24)
+ - Sheets (collaborative spreadsheet), [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps), [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
+ - Spanner (see Spanner (database))
+ - TrueTime (clock API), [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval)
+- gossip protocol, [Request Routing](/en/ch7#sec_sharding_routing)
+- governance, [Beyond the data lake](/en/ch1#beyond-the-data-lake)
+- government use of data, [Data as Assets and Power](/en/ch14#id376)
+- GPS (Global Positioning System)
+ - use for clock synchronization, [Unreliable Clocks](/en/ch9#sec_distributed_clocks), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy), [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+- GPT (language model), [Vector Embeddings](/en/ch4#id92)
+- GPU (graphics processing unit), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
+- gradual rollout (see rolling upgrades)
+- GraphQL (query language), [GraphQL](/en/ch3#id63)
+ - validation, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
+- graphs, [Glossary](/en/glossary)
+ - as data models, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)-[GraphQL](/en/ch3#id63)
+ - property graphs, [Property Graphs](/en/ch3#id56)
+ - RDF and triple-stores, [Triple-Stores and SPARQL](/en/ch3#id59)-[The SPARQL query language](/en/ch3#the-sparql-query-language)
+ - DAGs (see directed acyclic graphs)
+ - processing and analysis, [Machine Learning](/en/ch11#id290)
+ - query languages
+ - Cypher, [The Cypher Query Language](/en/ch3#id57)
+ - Datalog, [Datalog: Recursive Relational Queries](/en/ch3#id62)-[Datalog: Recursive Relational Queries](/en/ch3#id62)
+ - GraphQL, [GraphQL](/en/ch3#id63)
+ - Gremlin, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+ - recursive SQL queries, [Graph Queries in SQL](/en/ch3#id58)
+ - SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)-[The SPARQL query language](/en/ch3#the-sparql-query-language)
+ - traversal, [Property Graphs](/en/ch3#id56)
+- gray failures, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+ - in leaderless replication, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+- Gremlin (graph query language), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+- grep (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis)
+- gRPC (service calls), [Microservices and Serverless](/en/ch1#sec_introduction_microservices), [Web services](/en/ch5#sec_web_services)
+ - forward and backward compatibility, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+- GUIDs (see UUIDs)
+
+### H
+
+- Hadoop (data infrastructure)
+ - comparison to distributed databases, [Batch Processing](/en/ch11#ch_batch)
+ - MapReduce (see MapReduce)
+ - NodeManager, [Distributed Job Orchestration](/en/ch11#id278)
+ - YARN (see YARN (job scheduler))
+- HANA (see SAP HANA (database))
+- happens-before relation, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
+- hard disks
+ - access patterns, [Sequential versus random writes](/en/ch4#sidebar_sequential)
+ - detecting corruption, [The end-to-end argument](/en/ch13#sec_future_e2e_argument), [Don't just blindly trust what they promise](/en/ch13#id364)
+ - faults in, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults), [Durability](/en/ch8#durability)
+ - sequential vs. random writes, [Sequential versus random writes](/en/ch4#sidebar_sequential)
+ - sequential write throughput, [Disk space usage](/en/ch12#sec_stream_disk_usage)
+- hardware faults, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
+- hash function
+ - in Bloom filters, [Bloom filters](/en/ch4#bloom-filters)
+- hash join
+ - in stream processing, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
+- hash sharding, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash)-[Consistent hashing](/en/ch7#sec_sharding_consistent_hashing), [Summary](/en/ch7#summary)
+ - consistent hashing, [Consistent hashing](/en/ch7#sec_sharding_consistent_hashing)
+ - problems with hash mod N, [Hash modulo number of nodes](/en/ch7#hash-modulo-number-of-nodes)
+ - range queries, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+ - suitable hash functions, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash)
+ - with fixed number of shards, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
+- hash tables, [Log-Structured Storage](/en/ch4#sec_storage_log_structured)
+- Hazelcast (in-memory data grid)
+ - FencedLock, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
+ - Flake ID Generator, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
+- HBase (database)
+ - bug due to lack of fencing, [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing)
+ - key-range sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+ - log-structured storage, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+ - regions (sharding), [Sharding](/en/ch7#ch_sharding)
+ - request routing, [Request Routing](/en/ch7#sec_sharding_routing)
+ - size-tiered compaction, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - wide-column data model, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality), [Column Compression](/en/ch4#sec_storage_column_compression)
+- HDFS (Hadoop Distributed File System), [Batch Processing](/en/ch11#ch_batch), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+ - (see also distributed filesystems)
+ - checking data integrity, [Don't just blindly trust what they promise](/en/ch13#id364)
+ - DataNode, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+ - NameNode, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+ - use in MapReduce, [MapReduce](/en/ch11#sec_batch_mapreduce)
+ - workflow example, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
+- HdrHistogram (numerical library), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
+- head (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Distributed Job Orchestration](/en/ch11#id278)
+- head vertex (property graphs), [Property Graphs](/en/ch3#id56)
+- head-of-line blocking, [Latency and Response Time](/en/ch2#id23)
+- heap files (databases), [Storing values within the index](/en/ch4#sec_storage_index_heap)
+ - in multiversion concurrency control, [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl)
+- heat management, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
+- hedged requests, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+- heterogeneous distributed transactions, [Distributed Transactions Across Different Systems](/en/ch8#sec_transactions_xa), [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
+- heuristic decisions (in 2PC), [Recovering from coordinator failure](/en/ch8#recovering-from-coordinator-failure)
+- Hex (notebook), [Machine Learning](/en/ch11#id290)
+- hexagons
+ - for geospatial indexing, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
+- Hibernate (object-relational mapper), [Object-relational mapping (ORM)](/en/ch3#object-relational-mapping-orm)
+- hierarchical model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
+- hierarchical navigable small world (vector index), [Vector Embeddings](/en/ch4#id92)
+- hierarchical queries (see recursive common table expressions)
+- high availability (see fault tolerance)
+- high-frequency trading, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
+- high-performance computing (HPC), [Cloud Computing Versus Supercomputing](/en/ch1#id17)
+- hinted handoff (leaderless replication), [Catching up on missed writes](/en/ch6#sec_replication_read_repair)
+- histograms, [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
+- Hive (data warehouse), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - query optimizer, [Query languages](/en/ch11#sec_batch_query_lanauges)
+- HNSW (vector index), [Vector Embeddings](/en/ch4#id92)
+- hopping windows (stream processing), [Types of windows](/en/ch12#id324)
+ - (see also windows)
+- Hoptimator (query engine), [The meta-database of everything](/en/ch13#id341)
+- Horizon scandal, [Humans and Reliability](/en/ch2#id31)
+ - lack of transactions, [Transactions](/en/ch8#ch_transactions)
+- horizontal scaling (see scaling out)
+ - by sharding, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
+- HornetQ (messaging), [Message brokers](/en/ch5#message-brokers), [Message brokers compared to databases](/en/ch12#id297)
+ - distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
+- hot keys, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
+- hot spots, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
+ - due to celebrities, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
+ - for time-series data, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+ - relieving, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
+- hot standbys (see leader-based replication)
+- HTAP (see hybrid transactional/analytic processing)
+- HTTP, use in APIs (see services)
+- human errors, [Humans and Reliability](/en/ch2#id31), [Network Faults in Practice](/en/ch9#sec_distributed_network_faults), [Batch Processing](/en/ch11#ch_batch)
+- hybrid logical clocks, [Hybrid logical clocks](/en/ch10#hybrid-logical-clocks)
+- hybrid transactional/analytic processing, [Data Warehousing](/en/ch1#sec_introduction_dwh), [Data Storage for Analytics](/en/ch4#sec_storage_analytics)
+- hydrating IDs (join), [Denormalization in the social networking case study](/en/ch3#denormalization-in-the-social-networking-case-study)
+- hypergraph, [Property Graphs](/en/ch3#id56)
+- HyperLogLog (algorithm), [Stream analytics](/en/ch12#id318)
+
+### I
+
+- I/O operations, waiting for, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+- IaaS (see infrastructure as a service (IaaS))
+- IBM
+ - Db2 (database)
+ - distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
+ - serializable isolation, [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+ - MQ (messaging), [Message brokers compared to databases](/en/ch12#id297)
+ - distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
+ - System R (database), [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview)
+ - WebSphere (messaging), [Message brokers](/en/ch5#message-brokers)
+- Iceberg (table format), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - databases on object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - log-based message broker storage, [Disk space usage](/en/ch12#sec_stream_disk_usage)
+- idempotence, [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc), [Idempotence](/en/ch12#sec_stream_idempotence), [Glossary](/en/glossary)
+ - by giving operations unique IDs, [Multi-shard request processing](/en/ch13#id360)
+ - by giving requests unique IDs, [Uniquely identifying requests](/en/ch13#id355)
+ - for exactly-once semantics, [Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
+ - idempotent operations, [Exactly-once execution of an operation](/en/ch13#id353)
+ - in workflow engines, [Durable execution](/en/ch5#durable-execution)
+- immutability
+ - advantages of, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros), [Designing for auditability](/en/ch13#id365)
+ - and right to erasure, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Disk space usage](/en/ch4#disk-space-usage)
+ - crypto-shredding for deletion, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events), [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+ - deriving state from event log, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)-[Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+ - for crash recovery, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+ - in B-trees, [B-tree variants](/en/ch4#b-tree-variants), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
+ - in event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events), [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
+ - limitations of, [Concurrency control](/en/ch12#sec_stream_concurrency)
+- impedance mismatch, [The Object-Relational Mismatch](/en/ch3#sec_datamodels_document)
+- in doubt (transaction status), [Coordinator failure](/en/ch8#coordinator-failure)
+ - holding locks, [Holding locks while in doubt](/en/ch8#holding-locks-while-in-doubt)
+ - orphaned transactions, [Recovering from coordinator failure](/en/ch8#recovering-from-coordinator-failure)
+- in-memory databases, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+ - durability, [Durability](/en/ch8#durability)
+ - serial transaction execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
+- incidents
+ - accounting software bugs leading to wrongful convictions, [Humans and Reliability](/en/ch2#id31)
+ - blameless postmortems, [Humans and Reliability](/en/ch2#id31)
+ - crashes due to leap seconds, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
+ - data corruption and financial losses due to concurrency bugs, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
+ - data corruption on hard disks, [Durability](/en/ch8#durability)
+ - data loss due to last-write-wins, [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
+ - data on disks unreadable, [Mapping system models to the real world](/en/ch9#mapping-system-models-to-the-real-world)
+ - disclosure of sensitive data due to primary key reuse, [Leader failure: Failover](/en/ch6#leader-failure-failover)
+ - errors in transaction serializability, [Maintaining integrity in the face of software bugs](/en/ch13#id455)
+ - gigabit network interface with 1 Kb/s throughput, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+ - leap second crash, [Software faults](/en/ch2#software-faults)
+ - network faults, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
+ - network interface dropping only inbound packets, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
+ - network partitions and whole-datacenter failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
+ - poor handling of network faults, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
+ - sending message to ex-partner, [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
+ - sharks biting undersea cables, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
+ - split brain due to 1-minute packet delay, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
+ - SSD failure after 32,768 hours, [Software faults](/en/ch2#software-faults)
+ - thread contention bringing down a service, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+ - vibrations in server rack, [Latency and Response Time](/en/ch2#id23)
+ - violation of uniqueness constraint, [Maintaining integrity in the face of software bugs](/en/ch13#id455)
+- incremental view maintenance (IVM), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+ - for data integration, [Unbundled versus integrated systems](/en/ch13#id448)
+- indexes, [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp), [Glossary](/en/glossary)
+ - and snapshot isolation, [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
+ - as derived data, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
+ - B-trees, [B-Trees](/en/ch4#sec_storage_b_trees)-[B-tree variants](/en/ch4#b-tree-variants)
+ - clustered, [Storing values within the index](/en/ch4#sec_storage_index_heap)
+ - comparison of B-trees and LSM-trees, [Comparing B-Trees and LSM-Trees](/en/ch4#sec_storage_btree_lsm_comparison)-[Disk space usage](/en/ch4#disk-space-usage)
+ - covering (with included columns), [Storing values within the index](/en/ch4#sec_storage_index_heap)
+ - creating, [Creating an index](/en/ch13#id340)
+ - full-text search, [Full-Text Search](/en/ch4#sec_storage_full_text)
+ - geospatial, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
+ - index-range locking, [Index-range locks](/en/ch8#sec_transactions_2pl_range)
+ - multi-column (concatenated), [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
+ - secondary, [Multi-Column and Secondary Indexes](/en/ch4#sec_storage_index_multicolumn)
+ - (see also secondary indexes)
+ - problems with dual writes, [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Reasoning about dataflows](/en/ch13#id443)
+ - sharding and secondary indexes, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)-[Global Secondary Indexes](/en/ch7#id167), [Summary](/en/ch7#summary)
+ - sparse, [The SSTable file format](/en/ch4#the-sstable-file-format)
+ - SSTables and LSM-trees, [The SSTable file format](/en/ch4#the-sstable-file-format)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - updating when data changes, [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+- Industrial Revolution, [Remembering the Industrial Revolution](/en/ch14#id377)
+- InfiniBand (networks), [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
+- InfluxDB IOx (storage engine), [Column-Oriented Storage](/en/ch4#sec_storage_column)
+- information retrieval (see full-text search)
+- infrastructure as a service (IaaS), [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud), [Layering of cloud services](/en/ch1#layering-of-cloud-services)
+- InnoDB (storage engine)
+ - clustered index on primary key, [Storing values within the index](/en/ch4#sec_storage_index_heap)
+ - not preventing lost updates, [Automatically detecting lost updates](/en/ch8#automatically-detecting-lost-updates)
+ - preventing write skew, [Characterizing write skew](/en/ch8#characterizing-write-skew), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+ - serializable isolation, [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+ - snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
+- instance (cloud computing), [Layering of cloud services](/en/ch1#layering-of-cloud-services)
+- integrating different data systems (see data integration)
+- integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+ - coordination-avoiding data systems, [Coordination-avoiding data systems](/en/ch13#id454)
+ - correctness of dataflow systems, [Correctness of dataflow systems](/en/ch13#id453)
+ - in consensus formalization, [Single-value consensus](/en/ch10#single-value-consensus), [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
+ - integrity checks, [Don't just blindly trust what they promise](/en/ch13#id364)
+ - (see also auditing)
+ - end-to-end, [The end-to-end argument](/en/ch13#sec_future_e2e_argument), [The end-to-end argument again](/en/ch13#id456)
+ - use of snapshot isolation, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
+ - maintaining despite software bugs, [Maintaining integrity in the face of software bugs](/en/ch13#id455)
+- Interface Definition Language (IDL), [Protocol Buffers](/en/ch5#sec_encoding_protobuf), [Avro](/en/ch5#sec_encoding_avro), [Web services](/en/ch5#sec_web_services)
+- invariants, [Consistency](/en/ch8#sec_transactions_acid_consistency)
+ - (see also constraints)
+- inverted file index (vector index), [Vector Embeddings](/en/ch4#id92)
+- inverted index, [Full-Text Search](/en/ch4#sec_storage_full_text)
+- irreversibility, minimizing, [Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events), [Batch Processing](/en/ch11#ch_batch)
+- ISDN (Integrated Services Digital Network), [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
+- isolation (in operating systems)
+ - cgroups (see cgroups)
+- isolation (in transactions), [Isolation](/en/ch8#sec_transactions_acid_isolation), [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object), [Glossary](/en/glossary)
+ - correctness and, [Aiming for Correctness](/en/ch13#sec_future_correctness)
+ - for single-object writes, [Single-object writes](/en/ch8#sec_transactions_single_object)
+ - serializability, [Serializability](/en/ch8#sec_transactions_serializability)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
+ - actual serial execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)-[Summary of serial execution](/en/ch8#summary-of-serial-execution)
+ - serializable snapshot isolation (SSI), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
+ - two-phase locking (2PL), [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)-[Index-range locks](/en/ch8#sec_transactions_2pl_range)
+ - violating, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
+ - weak isolation levels, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)-[Materializing conflicts](/en/ch8#materializing-conflicts)
+ - preventing lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)-[Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+ - read committed, [Read Committed](/en/ch8#sec_transactions_read_committed)-[Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
+ - snapshot isolation, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)-[Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
+- IVF (vector index), [Vector Embeddings](/en/ch4#id92)
+
+### J
+
+- Java Database Connectivity (JDBC)
+ - distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
+ - network drivers, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
+- Java Enterprise Edition (EE), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc), [XA transactions](/en/ch8#xa-transactions)
+- Java Message Service (JMS), [Message brokers compared to databases](/en/ch12#id297)
+ - (see also messaging systems)
+ - comparison to log-based messaging, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Replaying old messages](/en/ch12#sec_stream_replay)
+ - distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
+ - message ordering, [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
+- Java Transaction API (JTA), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc), [XA transactions](/en/ch8#xa-transactions)
+- Java Virtual Machine (JVM)
+ - garbage collection, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses), [Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
+ - JIT compilation, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+ - process reuse in batch processors, [Dataflow Engines](/en/ch11#sec_batch_dataflow)
+- Jena (RDF framework), [The RDF data model](/en/ch3#the-rdf-data-model)
+ - SPARQL query language, [The SPARQL query language](/en/ch3#the-sparql-query-language)
+- Jepsen (fault tolerance testing), [Fault injection](/en/ch9#sec_fault_injection), [Aiming for Correctness](/en/ch13#sec_future_correctness)
+- jitter (network delay), [Average, Median, and Percentiles](/en/ch2#id24), [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+- JMESPath (query language), [Query languages](/en/ch11#sec_batch_query_lanauges)
+- join table, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many), [Property Graphs](/en/ch3#id56)
+- joins, [Glossary](/en/glossary)
+ - expressing as relational operators, [Query languages](/en/ch11#sec_batch_query_lanauges)
+ - handling GraphQL query, [GraphQL](/en/ch3#id63)
+ - in application code, [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization), [Denormalization in the social networking case study](/en/ch3#denormalization-in-the-social-networking-case-study)
+ - in DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+ - in relational and document databases, [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)
+ - secondary indexes and, [Multi-Column and Secondary Indexes](/en/ch4#sec_storage_index_multicolumn)
+ - sort-merge joins, [JOIN and GROUP BY](/en/ch11#sec_batch_join)
+ - stream joins, [Stream Joins](/en/ch12#sec_stream_joins)-[Time-dependence of joins](/en/ch12#sec_stream_join_time)
+ - stream-stream join, [Stream-stream join (window join)](/en/ch12#id440)
+ - stream-table join, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
+ - table-table join, [Table-table join (materialized view maintenance)](/en/ch12#id326)
+ - time-dependence of, [Time-dependence of joins](/en/ch12#sec_stream_join_time)
+ - support in document databases, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+- JOTM (transaction coordinator), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
+- journaling (filesystems), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
+- JSON
+ - aggregation pipeline (query language), [Query languages for documents](/en/ch3#query-languages-for-documents)
+ - Avro schema representation, [Avro](/en/ch5#sec_encoding_avro)
+ - binary variants, [Binary encoding](/en/ch5#binary-encoding)
+ - data locality, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
+ - document data model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
+ - for application data, issues with, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
+ - GraphQL response, [GraphQL](/en/ch3#id63)
+ - in relational databases, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
+ - representing a résumé (example), [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
+ - Schema, [JSON Schema](/en/ch5#json-schema)
+- JSON-LD, [Triple-Stores and SPARQL](/en/ch3#id59)
+- JsonPath (query language), [Query languages](/en/ch11#sec_batch_query_lanauges)
+- JuiceFS (distributed filesystem), [Distributed Filesystems](/en/ch11#sec_batch_dfs), [Object Stores](/en/ch11#id277)
+- Jupyter (notebook), [Machine Learning](/en/ch11#id290)
+- just-in-time (JIT) compilation, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+
+### K
+
+- Kafka (messaging), [Message brokers](/en/ch5#message-brokers), [Using logs for message storage](/en/ch12#id300)
+ - consumer groups, [Multiple consumers](/en/ch12#id298)
+ - for data integration, [Unbundled versus integrated systems](/en/ch13#id448)
+ - for event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - Kafka Connect (database integration), [Implementing change data capture](/en/ch12#id307), [API support for change streams](/en/ch12#sec_stream_change_api), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
+ - Kafka Streams (stream processor), [Stream analytics](/en/ch12#id318), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+ - exactly-once semantics, [Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
+ - fault tolerance, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+ - ksqlDB (stream database), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+ - leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+ - log compaction, [Log compaction](/en/ch12#sec_stream_log_compaction), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+ - message offsets, [Using logs for message storage](/en/ch12#id300), [Idempotence](/en/ch12#sec_stream_idempotence)
+ - partitions (sharding), [Sharding](/en/ch7#ch_sharding)
+ - request routing, [Request Routing](/en/ch7#sec_sharding_routing)
+ - schema registry, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
+ - serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+ - tiered storage, [Disk space usage](/en/ch12#sec_stream_disk_usage)
+ - transactions, [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal), [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
+ - unclean leader election, [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
+ - use of model-checking, [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
+- kappa architecture, [Unifying batch and stream processing](/en/ch13#id338)
+- key-value stores, [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp)
+ - comparison to object stores, [Object Stores](/en/ch11#id277)
+ - in-memory, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+ - LSM storage, [Log-Structured Storage](/en/ch4#sec_storage_log_structured)-[Disk space usage](/en/ch4#disk-space-usage)
+ - sharding, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)-[Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
+ - by hash of key, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash), [Summary](/en/ch7#summary)
+ - by key range, [Sharding by Key Range](/en/ch7#sec_sharding_key_range), [Summary](/en/ch7#summary)
+ - skew and hot spots, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
+- Kinesis (messaging), [Message brokers](/en/ch5#message-brokers), [Using logs for message storage](/en/ch12#id300)
+ - data warehouse integration, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+- Kryo (Java), [Language-Specific Formats](/en/ch5#id96)
+- ksqlDB (stream database), [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+- Kubernetes (cluster manager), [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud), [Microservices and Serverless](/en/ch1#sec_introduction_microservices), [Distributed Job Orchestration](/en/ch11#id278), [Separation of application code and state](/en/ch13#id344)
+ - Kubeflow, [Machine Learning](/en/ch11#id290)
+ - kubelet, [Distributed Job Orchestration](/en/ch11#id278)
+ - operators, [Distributed Job Orchestration](/en/ch11#id278)
+ - use of etcd, [Request Routing](/en/ch7#sec_sharding_routing), [Coordination Services](/en/ch10#sec_consistency_coordination)
+- KùzuDB (database), [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+ - as embedded storage engine, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - Cypher query language, [The Cypher Query Language](/en/ch3#id57)
+
+### L
+
+- labeled property graphs (see property graphs)
+- lambda architecture, [Unifying batch and stream processing](/en/ch13#id338)
+- Lamport timestamps, [Lamport timestamps](/en/ch10#lamport-timestamps)
+- Lance (data format), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Column-Oriented Storage](/en/ch4#sec_storage_column)
+ - (see also column-oriented storage)
+- large language models (LLMs)
+ - pre-processing training data, [Machine Learning](/en/ch11#id290)
+- last write wins (LWW), [Last write wins (discarding concurrent writes)](/en/ch6#sec_replication_lww), [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent), [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
+ - problems with, [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
+ - prone to lost updates, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+- latency, [Latency and Response Time](/en/ch2#id23)
+ - (see also response time)
+ - across regions, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
+ - instability under two-phase locking, [Performance of two-phase locking](/en/ch8#performance-of-two-phase-locking)
+ - network latency and resource utilization, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
+ - reducing by request hedging, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+ - response time versus, [Latency and Response Time](/en/ch2#id23)
+ - tail latency, [Average, Median, and Percentiles](/en/ch2#id24), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla), [Local Secondary Indexes](/en/ch7#id166)
+- law (see legal matters)
+- layering (of cloud services), [Layering of cloud services](/en/ch1#layering-of-cloud-services)
+- leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)-[Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
+ - (see also replication)
+ - failover, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing)
+ - handling node outages, [Handling Node Outages](/en/ch6#sec_replication_failover)
+ - implementation of replication logs
+ - change data capture, [Change Data Capture](/en/ch12#sec_stream_cdc)-[API support for change streams](/en/ch12#sec_stream_change_api)
+ - (see also changelogs)
+ - statement-based, [Statement-based replication](/en/ch6#statement-based-replication)
+ - write-ahead log (WAL) shipping, [Write-ahead log (WAL) shipping](/en/ch6#write-ahead-log-wal-shipping)
+ - linearizability of operations, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
+ - locking and leader election, [Locking and leader election](/en/ch10#locking-and-leader-election)
+ - log sequence number, [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Consumer offsets](/en/ch12#sec_stream_log_offsets)
+ - read-scaling architecture, [Problems with Replication Lag](/en/ch6#sec_replication_lag), [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+ - relation to consensus, [Consensus](/en/ch10#sec_consistency_consensus), [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus), [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
+ - setting up new followers, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - synchronous versus asynchronous, [Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async)-[Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async)
+- leaderless replication, [Leaderless Replication](/en/ch6#sec_replication_leaderless)-[Version vectors](/en/ch6#version-vectors)
+ - (see also replication)
+ - catching up on missed writes, [Catching up on missed writes](/en/ch6#sec_replication_read_repair)
+ - detecting concurrent writes, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)-[Version vectors](/en/ch6#version-vectors)
+ - version vectors, [Version vectors](/en/ch6#version-vectors)
+ - multi-region, [Multi-region operation](/en/ch6#multi-region-operation)
+ - quorums, [Quorums for reading and writing](/en/ch6#sec_replication_quorum_condition)-[Multi-region operation](/en/ch6#multi-region-operation)
+ - consistency limitations, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations)-[Monitoring staleness](/en/ch6#monitoring-staleness), [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
+- leap seconds, [Software faults](/en/ch2#software-faults), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
+ - in time-of-day clocks, [Time-of-day clocks](/en/ch9#time-of-day-clocks)
+- leases, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+ - implementation with coordination service, [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - need for fencing, [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing)
+ - relation to consensus, [Single-value consensus](/en/ch10#single-value-consensus)
+- ledgers (accounting), [Summary](/en/ch3#summary)
+ - immutability, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
+- legacy systems, maintenance of, [Maintainability](/en/ch2#sec_introduction_maintainability)
+- legal matters, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)-[Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
+ - data deletion, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Disk space usage](/en/ch4#disk-space-usage)
+ - data residence, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+ - privacy regulation, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
+- legitimate interest (GDPR), [Consent and Freedom of Choice](/en/ch14#id375)
+- leveled compaction, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction), [Disk space usage](/en/ch4#disk-space-usage)
+- Levenshtein automata, [Full-Text Search](/en/ch4#sec_storage_full_text)
+- limping (partial failure), [System Model and Reality](/en/ch9#sec_distributed_system_model)
+- Linear (project management software), [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
+- linear algebra, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- linear scalability, [Describing Load](/en/ch2#id33)
+- linearizability, [Solutions for Replication Lag](/en/ch6#id131), [Linearizability](/en/ch10#sec_consistency_linearizability)-[Linearizability and network delays](/en/ch10#linearizability-and-network-delays), [Glossary](/en/glossary)
+ - and consensus, [Consensus](/en/ch10#sec_consistency_consensus)
+ - cost of, [The Cost of Linearizability](/en/ch10#sec_linearizability_cost)-[Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
+ - CAP theorem, [The CAP theorem](/en/ch10#the-cap-theorem)
+ - memory on multi-core CPUs, [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
+ - definition, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)-[What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+ - ID generation, [Linearizable ID Generators](/en/ch10#sec_consistency_linearizable_id)
+ - in coordination services, [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - of derived data systems
+ - avoiding coordination, [Coordination-avoiding data systems](/en/ch13#id454)
+ - of different replication methods, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)-[Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
+ - using quorums, [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
+ - reads in consensus systems, [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
+ - relying on, [Relying on Linearizability](/en/ch10#sec_consistency_linearizability_usage)-[Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
+ - constraints and uniqueness, [Constraints and uniqueness guarantees](/en/ch10#sec_consistency_uniqueness)
+ - cross-channel timing dependencies, [Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
+ - locking and leader election, [Locking and leader election](/en/ch10#locking-and-leader-election)
+ - versus serializability, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+- linked data, [Triple-Stores and SPARQL](/en/ch3#id59)
+- LinkedIn
+ - Espresso (database), [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
+ - LIquid (database), [Datalog: Recursive Relational Queries](/en/ch3#id62)
+ - profile (example), [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
+- Linux, leap second bug, [Software faults](/en/ch2#software-faults), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
+- Litestream (backup tool), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+- liveness properties, [Safety and liveness](/en/ch9#sec_distributed_safety_liveness)
+- LLVM (compiler), [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+- LMDB (storage engine), [Compaction strategies](/en/ch4#sec_storage_lsm_compaction), [B-tree variants](/en/ch4#b-tree-variants), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
+- load
+ - coping with, [Principles for Scalability](/en/ch2#id35)
+ - describing, [Describing Load](/en/ch2#id33)
+- load balancing, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
+ - in hardware, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
+ - in software, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
+ - using message brokers, [Multiple consumers](/en/ch12#id298)
+- load shedding, [Describing Performance](/en/ch2#sec_introduction_percentiles)
+- local secondary indexes, [Local Secondary Indexes](/en/ch7#id166), [Summary](/en/ch7#summary)
+- local-first software, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
+- locality (data access), [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships), [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality), [Glossary](/en/glossary)
+ - in batch processing, [Dataflow Engines](/en/ch11#sec_batch_dataflow)
+ - in stateful clients, [Sync Engines and Local-First Software](/en/ch6#sec_replication_offline_clients), [Stateful, offline-capable clients](/en/ch13#id347)
+ - in stream processing, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins), [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance), [Stream processors and services](/en/ch13#id345), [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
+- location transparency, [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
+ - in the actor model, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks)
+- lock-in, [Pros and Cons of Cloud Services](/en/ch1#sec_introduction_cloud_tradeoffs)
+- locks, [Glossary](/en/glossary)
+ - deadlock, [Explicit locking](/en/ch8#explicit-locking), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+ - distributed locking, [Distributed Locks and Leases](/en/ch9#sec_distributed_lock_fencing)-[Fencing with multiple replicas](/en/ch9#fencing-with-multiple-replicas), [Locking and leader election](/en/ch10#locking-and-leader-election)
+ - fencing tokens, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
+ - implementation with coordination service, [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - relation to consensus, [Single-value consensus](/en/ch10#single-value-consensus)
+ - for transaction isolation
+ - in snapshot isolation, [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl)
+ - in two-phase locking (2PL), [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)-[Index-range locks](/en/ch8#sec_transactions_2pl_range)
+ - making operations atomic, [Atomic write operations](/en/ch8#atomic-write-operations)
+ - performance, [Performance of two-phase locking](/en/ch8#performance-of-two-phase-locking)
+ - preventing dirty writes, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
+ - preventing phantoms with index-range locks, [Index-range locks](/en/ch8#sec_transactions_2pl_range), [Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
+ - read locks (shared mode), [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+ - shared mode and exclusive mode, [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+ - in distributed transactions
+ - deadlock detection, [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
+ - in-doubt transactions holding locks, [Holding locks while in doubt](/en/ch8#holding-locks-while-in-doubt)
+ - materializing conflicts with, [Materializing conflicts](/en/ch8#materializing-conflicts)
+ - preventing lost updates by explicit locking, [Explicit locking](/en/ch8#explicit-locking)
+- log sequence number, [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Consumer offsets](/en/ch12#sec_stream_log_offsets)
+- logical clocks, [Timestamps for ordering events](/en/ch9#sec_distributed_lww), [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)-[Enforcing constraints using logical clocks](/en/ch10#enforcing-constraints-using-logical-clocks), [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
+ - for last-write-wins, [Last write wins (discarding concurrent writes)](/en/ch6#sec_replication_lww)
+ - for read-after-write consistency, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
+ - hybrid logical clocks, [Hybrid logical clocks](/en/ch10#hybrid-logical-clocks)
+ - insufficiency for enforcing constraints, [Enforcing constraints using logical clocks](/en/ch10#enforcing-constraints-using-logical-clocks)
+ - Lamport timestamps, [Lamport timestamps](/en/ch10#lamport-timestamps)
+- logical replication, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
+ - for change data capture, [Implementing change data capture](/en/ch12#id307)
+- LogicBlox (database), [Datalog: Recursive Relational Queries](/en/ch3#id62)
+- logs (data structure), [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp), [Shared logs as consensus](/en/ch10#sec_consistency_shared_logs), [Glossary](/en/glossary)
+ - (see also shared logs)
+ - advantages of immutability, [Advantages of immutable events](/en/ch12#sec_stream_immutability_pros)
+ - and right to erasure, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Disk space usage](/en/ch4#disk-space-usage)
+ - compaction, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Compaction strategies](/en/ch4#sec_storage_lsm_compaction), [Log compaction](/en/ch12#sec_stream_log_compaction), [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)
+ - for stream operator state, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+ - implementing uniqueness constraints, [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
+ - log-based messaging, [Log-based Message Brokers](/en/ch12#sec_stream_log)-[Replaying old messages](/en/ch12#sec_stream_replay)
+ - comparison to traditional messaging, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Replaying old messages](/en/ch12#sec_stream_replay)
+ - consumer offsets, [Consumer offsets](/en/ch12#sec_stream_log_offsets)
+ - disk space usage, [Disk space usage](/en/ch12#sec_stream_disk_usage)
+ - replaying old messages, [Replaying old messages](/en/ch12#sec_stream_replay), [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing), [Unifying batch and stream processing](/en/ch13#id338)
+ - slow consumers, [When consumers cannot keep up with producers](/en/ch12#id459)
+ - using logs for message storage, [Using logs for message storage](/en/ch12#id300)
+ - log-structured storage, [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - log-structured merge tree (see LSM-trees)
+ - relation to consensus, [Shared logs as consensus](/en/ch10#sec_consistency_shared_logs)
+ - replication, [Single-Leader Replication](/en/ch6#sec_replication_leader), [Implementation of Replication Logs](/en/ch6#sec_replication_implementation)-[Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
+ - change data capture, [Change Data Capture](/en/ch12#sec_stream_cdc)-[API support for change streams](/en/ch12#sec_stream_change_api)
+ - (see also changelogs)
+ - coordination with snapshot, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - logical (row-based) replication, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
+ - statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
+ - write-ahead log (WAL) shipping, [Write-ahead log (WAL) shipping](/en/ch6#write-ahead-log-wal-shipping)
+ - scalability limits, [The limits of total ordering](/en/ch13#id335)
+- Looker (business intelligence software), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Analytics](/en/ch11#sec_batch_olap)
+- loose coupling, [Making unbundling work](/en/ch13#sec_future_unbundling_favor)
+- lost updates (see updates)
+- Lotus Notes (sync engine), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- LSM-trees (indexes), [The SSTable file format](/en/ch4#the-sstable-file-format)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - comparison to B-trees, [Comparing B-Trees and LSM-Trees](/en/ch4#sec_storage_btree_lsm_comparison)-[Disk space usage](/en/ch4#disk-space-usage)
+- Lucene (storage engine), [Full-Text Search](/en/ch4#sec_storage_full_text)
+ - similarity search, [Full-Text Search](/en/ch4#sec_storage_full_text)
+- LWW (see last write wins)
+
+### M
+
+- machine learning
+ - batch inference, [Machine Learning](/en/ch11#id290)
+ - data preparation with DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+ - deleting training data, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
+ - deploying data products, [Beyond the data lake](/en/ch1#beyond-the-data-lake)
+ - ethical considerations, [Predictive Analytics](/en/ch14#id369)
+ - (see also ethics)
+ - feature engineering, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [Machine Learning](/en/ch11#id290)
+ - in analytics systems, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
+ - iterative processing, [Machine Learning](/en/ch11#id290)
+ - LLMs (see large language models (LLMs))
+ - models derived from training data, [Application code as a derivation function](/en/ch13#sec_future_dataflow_derivation)
+ - relation to batch processing, [Machine Learning](/en/ch11#id290)-[Machine Learning](/en/ch11#id290)
+ - using a data lake, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
+ - using GPUs, [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
+ - using matrices, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- madsim (deterministic simulation testing), [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+- magic scaling sauce, [Principles for Scalability](/en/ch2#id35)
+- maintainability, [Maintainability](/en/ch2#sec_introduction_maintainability)-[Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability), [A Philosophy of Streaming Systems](/en/ch13#ch_philosophy)
+ - evolvability (see evolvability)
+ - operability, [Operability: Making Life Easy for Operations](/en/ch2#id37)
+ - simplicity and managing complexity, [Simplicity: Managing Complexity](/en/ch2#id38)
+- many-to-many relationships, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many)
+ - modeling as graphs, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+- many-to-one relationships, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many)
+ - in star schema, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
+- MapReduce (batch processing), [Batch Processing](/en/ch11#ch_batch), [MapReduce](/en/ch11#sec_batch_mapreduce)-[MapReduce](/en/ch11#sec_batch_mapreduce)
+ - analysis of user activity events (example), [JOIN and GROUP BY](/en/ch11#sec_batch_join)
+ - comparison to stream processing, [Processing Streams](/en/ch12#sec_stream_processing)
+ - disadvantages and limitations of, [MapReduce](/en/ch11#sec_batch_mapreduce)
+ - fault tolerance, [Handling Faults](/en/ch11#id281)
+ - higher-level tools, [Query languages](/en/ch11#sec_batch_query_lanauges)
+ - mapper and reducer functions, [MapReduce](/en/ch11#sec_batch_mapreduce)
+ - shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
+ - sort-merge joins, [JOIN and GROUP BY](/en/ch11#sec_batch_join)
+ - workflows, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
+ - (see also workflow engines)
+- marshalling (see encoding)
+- MartenDB (database), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+- master-slave replication (obsolete term), [Single-Leader Replication](/en/ch6#sec_replication_leader)
+- materialization, [Glossary](/en/glossary)
+ - aggregate values, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
+ - conflicts, [Materializing conflicts](/en/ch8#materializing-conflicts)
+ - materialized views, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
+ - as derived data, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
+ - in event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - incremental view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+ - (see also incremental view maintenance (IVM))
+ - maintaining, using stream processing, [Maintaining materialized views](/en/ch12#sec_stream_mat_view), [Table-table join (materialized view maintenance)](/en/ch12#id326)
+ - social network timeline example, [Materializing and Updating Timelines](/en/ch2#sec_introduction_materializing)
+- Materialize (database), [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
+ - incremental view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+- matrices, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+ - sparse, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- Maxwell (change data capture), [Implementing change data capture](/en/ch12#id307)
+- mean, [Average, Median, and Percentiles](/en/ch2#id24)
+- media monitoring, [Search on streams](/en/ch12#id320)
+- median, [Average, Median, and Percentiles](/en/ch2#id24)
+- meeting room booking (example), [More examples of write skew](/en/ch8#more-examples-of-write-skew), [Predicate locks](/en/ch8#predicate-locks), [Enforcing Constraints](/en/ch13#sec_future_constraints)
+- Memcached (caching server), [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+- Memgraph (database), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+ - Cypher query language, [The Cypher Query Language](/en/ch3#id57)
+- memory
+ - barrier (CPU instruction), [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
+ - corruption, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
+ - in-memory databases, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+ - durability, [Durability](/en/ch8#durability)
+ - serial transaction execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
+ - in-memory representation of data, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
+ - memtable (in LSM-trees), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+ - random bit-flips in, [Trust, but Verify](/en/ch13#sec_future_verification)
+ - use by indexes, [Log-Structured Storage](/en/ch4#sec_storage_log_structured)
+- memtable (in LSM-trees), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+- Mercurial (version control system), [Concurrency control](/en/ch12#sec_stream_concurrency)
+- merge (DataFrame operator), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- merging sorted files, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Shuffling Data](/en/ch11#sec_shuffle)
+- Merkle trees, [Tools for auditable data systems](/en/ch13#id366)
+- Mesos (cluster manager), [Separation of application code and state](/en/ch13#id344)
+- message brokers (see messaging systems)
+- message-passing (see event-driven architecture)
+- MessagePack (encoding format), [Binary encoding](/en/ch5#binary-encoding)
+- messaging systems, [Stream Processing](/en/ch12#ch_stream)-[Replaying old messages](/en/ch12#sec_stream_replay)
+ - (see also streams)
+ - backpressure, buffering, or dropping messages, [Messaging Systems](/en/ch12#sec_stream_messaging)
+ - brokerless messaging, [Direct messaging from producers to consumers](/en/ch12#id296)
+ - event logs, [Log-based Message Brokers](/en/ch12#sec_stream_log)-[Replaying old messages](/en/ch12#sec_stream_replay)
+ - as data model, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - comparison to traditional messaging, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Replaying old messages](/en/ch12#sec_stream_replay)
+ - consumer offsets, [Consumer offsets](/en/ch12#sec_stream_log_offsets)
+ - replaying old messages, [Replaying old messages](/en/ch12#sec_stream_replay), [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing), [Unifying batch and stream processing](/en/ch13#id338)
+ - slow consumers, [When consumers cannot keep up with producers](/en/ch12#id459)
+ - exactly-once semantics, [Exactly-once message processing](/en/ch8#sec_transactions_exactly_once), [Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited), [Fault Tolerance](/en/ch12#sec_stream_fault_tolerance)
+ - message brokers, [Message brokers](/en/ch12#id433)-[Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
+ - acknowledgements and redelivery, [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
+ - comparison to event logs, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Replaying old messages](/en/ch12#sec_stream_replay)
+ - multiple consumers of same topic, [Multiple consumers](/en/ch12#id298)
+ - versus RPC, [Event-Driven Architectures](/en/ch5#sec_encoding_dataflow_msg)
+ - message loss, [Messaging Systems](/en/ch12#sec_stream_messaging)
+ - reliability, [Messaging Systems](/en/ch12#sec_stream_messaging)
+ - uniqueness in log-based messaging, [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
+- metastable failure, [Describing Performance](/en/ch2#sec_introduction_percentiles)
+- metered billing
+ - serverless, [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
+ - storage, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+- microbatching, [Microbatching and checkpointing](/en/ch12#id329)
+- microservices, [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
+ - (see also services)
+ - causal dependencies across services, [The limits of total ordering](/en/ch13#id335)
+ - loose coupling, [Making unbundling work](/en/ch13#sec_future_unbundling_favor)
+ - relation to batch/stream processors, [Batch Processing](/en/ch11#ch_batch), [Stream processors and services](/en/ch13#id345)
+- Microsoft
+ - Azure Blob Storage (see Azure Blob Storage)
+ - Azure managed disks, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
+ - Azure Service Bus (messaging), [Message brokers](/en/ch5#message-brokers), [Message brokers compared to databases](/en/ch12#id297)
+ - Azure SQL DB (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
+ - Azure Storage, [Object Stores](/en/ch11#id277)
+ - Azure Stream Analytics, [Stream analytics](/en/ch12#id318)
+ - Azure Synapse Analytics (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
+ - DCOM (Distributed Component Object Model), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
+ - MSDTC (transaction coordinator), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
+ - SQL Server (see SQL Server)
+- Microsoft Power BI (see Power BI (business intelligence software))
+- migrating (rewriting) data, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility), [Different values written at different times](/en/ch5#different-values-written-at-different-times), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views), [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
+- MinIO (object storage), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- mobile apps, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
+ - embedded databases, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+- model checking, [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
+- modulus operator (%), [Hash modulo number of nodes](/en/ch7#hash-modulo-number-of-nodes)
+- Mojo (programming language)
+ - memory management, [Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
+- MongoDB (database)
+ - aggregation pipeline, [Query languages for documents](/en/ch3#query-languages-for-documents)
+ - atomic operations, [Atomic write operations](/en/ch8#atomic-write-operations)
+ - BSON, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
+ - document data model, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
+ - hash-range sharding, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash), [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+ - in the cloud, [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
+ - join support, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+ - joins (\$lookup operator), [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)
+ - JSON Schema validation, [JSON Schema](/en/ch5#json-schema)
+ - leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+ - ObjectIds, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
+ - range-based sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+ - request routing, [Request Routing](/en/ch7#sec_sharding_routing)
+ - secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
+ - shard splitting, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)
+ - stored procedures, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
+- monitoring, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations), [Humans and Reliability](/en/ch2#id31), [Operability: Making Life Easy for Operations](/en/ch2#id37)
+- monotonic clocks, [Monotonic clocks](/en/ch9#monotonic-clocks)
+- monotonic reads, [Monotonic Reads](/en/ch6#sec_replication_monotonic_reads)
+- Morel (query language), [Query languages](/en/ch11#sec_batch_query_lanauges)
+- MSMQ (messaging), [XA transactions](/en/ch8#xa-transactions)
+- multi-column indexes, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
+- multi-leader replication, [Multi-Leader Replication](/en/ch6#sec_replication_multi_leader)-[Types of conflict](/en/ch6#sec_replication_write_conflicts)
+ - (see also replication)
+ - collaborative editing, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
+ - conflict detection, [Types of conflict](/en/ch6#sec_replication_write_conflicts)
+ - conflict resolution, [Dealing with Conflicting Writes](/en/ch6#sec_replication_write_conflicts)
+ - for multi-region replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc), [The Cost of Linearizability](/en/ch10#sec_linearizability_cost)
+ - linearizability, lack of, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
+ - offline-capable clients, [Sync Engines and Local-First Software](/en/ch6#sec_replication_offline_clients)
+ - replication topologies, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)-[Problems with different topologies](/en/ch6#problems-with-different-topologies)
+- multi-object transactions, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
+ - need for, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
+- Multi-Paxos (consensus algorithm), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
+- multi-reader single-writer lock, [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+- multi-table index cluster tables (Oracle), [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
+- multi-version concurrency control (MVCC), [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl), [Summary](/en/ch8#summary)
+ - detecting stale MVCC reads, [Detecting stale MVCC reads](/en/ch8#detecting-stale-mvcc-reads)
+ - indexes and snapshot isolation, [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
+ - using synchronized clocks, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+- multidimensional arrays, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- multitenancy, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute), [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - by sharding, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+ - using embedded databases, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - versus Byzantine fault tolerance, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
+- mutual exclusion, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
+ - (see also locks)
+- MySQL (database)
+ - archiving WAL to object stores, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - binlog coordinates, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - change data capture, [Implementing change data capture](/en/ch12#id307), [API support for change streams](/en/ch12#sec_stream_change_api)
+ - circular replication topology, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)
+ - consistent snapshots, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
+ - global transaction identifiers (GTIDs), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - in the cloud, [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
+ - InnoDB storage engine (see InnoDB)
+ - leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+ - multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+ - row-based replication, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
+ - sharding (see Vitess (database))
+ - snapshot isolation support, [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
+ - (see also InnoDB)
+ - statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
+
+### N
+
+- N+1 query problem, [Object-relational mapping (ORM)](/en/ch3#object-relational-mapping-orm)
+- nanomsg (messaging library), [Direct messaging from producers to consumers](/en/ch12#id296)
+- Narayana (transaction coordinator), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
+- NATS (messaging), [Message brokers](/en/ch5#message-brokers)
+- natural language processing, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
+- Neo4j (database)
+ - Cypher query language, [The Cypher Query Language](/en/ch3#id57)
+ - graph data model, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+- Neon (database), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+- Nephele (dataflow engine), [Dataflow Engines](/en/ch11#sec_batch_dataflow)
+- Neptune (graph database), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+ - Cypher query language, [The Cypher Query Language](/en/ch3#id57)
+ - SPARQL query language, [The SPARQL query language](/en/ch3#the-sparql-query-language)
+- netcode (game development), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- Network Attached Storage (NAS), [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- network model (data representation), [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
+- Network Time Protocol (see NTP)
+- networks
+ - congestion and queueing, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - datacenter network topologies, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
+ - faults (see faults)
+ - linearizability and network delays, [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
+ - network partitions, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
+ - in CAP theorem, [The Cost of Linearizability](/en/ch10#sec_linearizability_cost)
+ - timeouts and unbounded delays, [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing)
+- NewSQL, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history), [Solutions for Replication Lag](/en/ch6#id131)
+ - transactions and, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
+- next-key locking, [Index-range locks](/en/ch8#sec_transactions_2pl_range)
+- NFS (network file system), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+ - on object storage, [Object Stores](/en/ch11#id277)
+- Nimble (data format), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Column-Oriented Storage](/en/ch4#sec_storage_column)
+ - (see also column-oriented storage)
+- node (in graphs) (see vertices)
+- nodes (processes), [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Glossary](/en/glossary)
+ - handling outages in leader-based replication, [Handling Node Outages](/en/ch6#sec_replication_failover)
+ - system models for failure, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+- noisy neighbors, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+- nonblocking atomic commit, [Three-phase commit](/en/ch8#three-phase-commit)
+- nondeterministic operations, [Statement-based replication](/en/ch6#statement-based-replication)
+ - (see also deterministic operations)
+ - in distributed systems, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+ - in workflow engines, [Durable execution](/en/ch5#durable-execution)
+ - partial failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
+ - sources of nondeterminism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+- nonfunctional requirements, [Defining Nonfunctional Requirements](/en/ch2#ch_nonfunctional), [Summary](/en/ch2#summary)
+- nonrepeatable reads, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
+ - (see also read skew)
+- normalization (data representation), [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)-[Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many), [Glossary](/en/glossary)
+ - foreign key references, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
+ - in social network case study, [Denormalization in the social networking case study](/en/ch3#denormalization-in-the-social-networking-case-study)
+ - in systems of record, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
+ - versus denormalization, [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
+- NoSQL, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history), [Solutions for Replication Lag](/en/ch6#id131), [Unbundling Databases](/en/ch13#sec_future_unbundling)
+ - transactions and, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview)
+- Notation3 (N3), [Triple-Stores and SPARQL](/en/ch3#id59)
+- NTP (Network Time Protocol), [Unreliable Clocks](/en/ch9#sec_distributed_clocks)
+ - accuracy, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy), [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
+ - adjustments to monotonic clocks, [Monotonic clocks](/en/ch9#monotonic-clocks)
+ - multiple server addresses, [Weak forms of lying](/en/ch9#weak-forms-of-lying)
+- numbers, in XML and JSON encodings, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
+- NumPy (Python library), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes), [Column-Oriented Storage](/en/ch4#sec_storage_column)
+- NVMe (Non-Volatile Memory Express) (see solid state drives (SSDs))
+
+### O
+
+- object databases, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
+- object storage, [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Object Stores](/en/ch11#id277)-[Object Stores](/en/ch11#id277)
+ - Azure Blob Storage (see Azure Blob Storage)
+ - comparison to distributed filesystems, [Object Stores](/en/ch11#id277)
+ - comparison to key-value stores, [Object Stores](/en/ch11#id277)
+ - databases backed by, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - for backups, [Replication](/en/ch6#ch_replication)
+ - for cloud data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
+ - for database replication, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - Google Cloud Storage (see Google Cloud Storage)
+ - object size, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
+ - S3 (see S3 (object storage))
+ - storing LSM segment files, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+ - support for fencing, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
+ - use in data lakes, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
+- object-relational mapping (ORM) frameworks, [Object-relational mapping (ORM)](/en/ch3#object-relational-mapping-orm)
+ - error handling and aborted transactions, [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
+ - unsafe read-modify-write cycle code, [Atomic write operations](/en/ch8#atomic-write-operations)
+- object-relational mismatch, [The Object-Relational Mismatch](/en/ch3#sec_datamodels_document)
+- observability, [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems), [Humans and Reliability](/en/ch2#id31), [Operability: Making Life Easy for Operations](/en/ch2#id37)
+- observer pattern, [Separation of application code and state](/en/ch13#id344)
+- OBT (one big table), [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics), [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
+- offline systems, [Batch Processing](/en/ch11#ch_batch)
+ - (see also batch processing)
+- offline-first applications, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps), [Stateful, offline-capable clients](/en/ch13#id347)
+- offsets
+ - consumer offsets in sharded logs, [Consumer offsets](/en/ch12#sec_stream_log_offsets)
+ - messages in sharded logs, [Using logs for message storage](/en/ch12#id300)
+- OLAP (online analytic processing), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Glossary](/en/glossary)
+ - data cubes, [Materialized Views and Data Cubes](/en/ch4#sec_storage_materialized_views)
+- OLTP (online transaction processing), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Glossary](/en/glossary)
+ - analytics queries versus, [Analytics](/en/ch11#sec_batch_olap)
+ - data normalization, [Trade-offs of normalization](/en/ch3#trade-offs-of-normalization)
+ - workload characteristics, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
+- on-premises deployment, [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)
+ - data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+- one big table (data warehouse schema), [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics), [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
+- one-hot encoding, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- one-to-few relationships, [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
+- one-to-many relationships, [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
+ - JSON representation, [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
+- online systems, [Batch Processing](/en/ch11#ch_batch)
+ - (see also services)
+ - versus scientific computing, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
+- ontologies, [Triple-Stores and SPARQL](/en/ch3#id59)
+- Oozie (workflow scheduler), [Batch Processing](/en/ch11#ch_batch)
+- OpenAPI (service definition format), [Microservices and Serverless](/en/ch1#sec_introduction_microservices), [Web services](/en/ch5#sec_web_services), [Web services](/en/ch5#sec_web_services)
+ - use of JSON Schema, [JSON Schema](/en/ch5#json-schema)
+- openCypher (see Cypher (query language))
+- OpenLink Virtuoso (see Virtuoso (database))
+- OpenStack
+ - Swift (object storage), [Object Stores](/en/ch11#id277)
+- operability, [Operability: Making Life Easy for Operations](/en/ch2#id37)
+- operating systems versus databases, [Unbundling Databases](/en/ch13#sec_future_unbundling)
+- operational systems, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
+ - (see also OLTP)
+ - as systems of record, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
+ - ETL into analytical systems, [Data Warehousing](/en/ch1#sec_introduction_dwh)
+- operational transformation, [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
+- operations teams, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+- operators (query execution), [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+ - in stream processing, [Processing Streams](/en/ch12#sec_stream_processing)
+- optimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
+- optimistic locking, [Conditional writes (compare-and-set)](/en/ch8#sec_transactions_compare_and_set)
+- Oracle (database)
+ - distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
+ - GoldenGate (change data capture), [Implementing change data capture](/en/ch12#id307)
+ - hierarchical queries, [Graph Queries in SQL](/en/ch3#id58), [Graph Queries in SQL](/en/ch3#id58)
+ - lack of serializability, [Isolation](/en/ch8#sec_transactions_acid_isolation)
+ - leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+ - multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+ - multi-table index cluster tables, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
+ - not preventing write skew, [Characterizing write skew](/en/ch8#characterizing-write-skew)
+ - PL/SQL language, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
+ - preventing lost updates, [Automatically detecting lost updates](/en/ch8#automatically-detecting-lost-updates)
+ - read committed isolation, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
+ - Real Application Clusters (RAC), [Locking and leader election](/en/ch10#locking-and-leader-election)
+ - snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation), [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
+ - TimesTen (in-memory database), [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+ - WAL-based replication, [Write-ahead log (WAL) shipping](/en/ch6#write-ahead-log-wal-shipping)
+- ORC (data format), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Column-Oriented Storage](/en/ch4#sec_storage_column)
+ - (see also column-oriented storage)
+- orchestration (service deployment), [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud), [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
+ - batch job execution, [Distributed Job Orchestration](/en/ch11#id278)-[Distributed Job Orchestration](/en/ch11#id278)
+ - workflow engines, [Batch Processing](/en/ch11#ch_batch)
+- ordering
+ - event logs, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - limits of total ordering, [The limits of total ordering](/en/ch13#id335)
+ - logical timestamps, [Logical Clocks](/en/ch10#sec_consistency_timestamps)
+ - of auto-incrementing IDs, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
+ - shared logs, [Consensus in Practice](/en/ch10#sec_consistency_total_order)-[Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
+- Orkes (workflow engine), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
+- orphan pages (B-trees), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
+- outbox pattern, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
+- outliers (response time), [Average, Median, and Percentiles](/en/ch2#id24)
+- outsourcing, [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)
+- overload, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
+
+### P
+
+- PACELC principle, [The CAP theorem](/en/ch10#the-cap-theorem)
+- package managers, [Separation of application code and state](/en/ch13#id344)
+- packet switching, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
+- packets
+ - corruption of, [Weak forms of lying](/en/ch9#weak-forms-of-lying)
+ - sending via UDP, [Direct messaging from producers to consumers](/en/ch12#id296)
+- PageRank (algorithm), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph), [Query languages](/en/ch11#sec_batch_query_lanauges), [Machine Learning](/en/ch11#id290)
+- paging (see virtual memory)
+- pandas (Python library), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes), [Column-Oriented Storage](/en/ch4#sec_storage_column), [DataFrames](/en/ch11#id287)
+- Parquet (data format), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Column-Oriented Storage](/en/ch4#sec_storage_column), [Archival storage](/en/ch5#archival-storage), [Query languages](/en/ch11#sec_batch_query_lanauges)
+ - (see also column-oriented storage)
+ - databases on object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - document data model, [Column-Oriented Storage](/en/ch4#sec_storage_column)
+ - use in batch processing, [MapReduce](/en/ch11#sec_batch_mapreduce)
+- partial failures, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure), [Summary](/en/ch9#summary)
+ - limping, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+- partial synchrony (system model), [System Model and Reality](/en/ch9#sec_distributed_system_model)
+- partition key, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons), [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
+- partitioning (see sharding)
+- Paxos (consensus algorithm), [Consensus](/en/ch10#sec_consistency_consensus), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
+ - ballot number, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
+ - Multi-Paxos, [Consensus in Practice](/en/ch10#sec_consistency_total_order)
+- payment card industry (PCI), [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
+- PCI (payment card industry) compliance, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
+- percentiles, [Average, Median, and Percentiles](/en/ch2#id24), [Glossary](/en/glossary)
+ - calculating efficiently, [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
+ - importance of high percentiles, [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
+ - use in service level agreements (SLAs), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
+- Percolator (Google), [Implementing a linearizable ID generator](/en/ch10#implementing-a-linearizable-id-generator)
+- Percona XtraBackup (MySQL tool), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+- performance
+ - degradation as fault, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+ - describing, [Describing Performance](/en/ch2#sec_introduction_percentiles)
+ - of distributed transactions, [Distributed Transactions Across Different Systems](/en/ch8#sec_transactions_xa)
+ - of in-memory databases, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+ - of linearizability, [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
+ - of multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+- permission isolation, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+- perpetual inconsistency, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+- pessimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
+- pglogical (PostgreSQL extension), [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+- pgvector (vector index), [Vector Embeddings](/en/ch4#id92)
+- phantoms (transaction isolation), [Phantoms causing write skew](/en/ch8#sec_transactions_phantom)
+ - materializing conflicts, [Materializing conflicts](/en/ch8#materializing-conflicts)
+ - preventing, in serializability, [Predicate locks](/en/ch8#predicate-locks)
+- physical clocks (see clocks)
+- pickle (Python), [Language-Specific Formats](/en/ch5#id96)
+- Pinot (database), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Column-Oriented Storage](/en/ch4#sec_storage_column)
+ - handling writes, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
+ - pre-aggregation, [Analytics](/en/ch11#sec_batch_olap)
+ - serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived), [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+- pipelined execution
+ - in data warehouse queries, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+- pivot table, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- point in time, [Unreliable Clocks](/en/ch9#sec_distributed_clocks)
+- point query, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
+- Polaris (data catalog), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+- polling, [Representing Users, Posts, and Follows](/en/ch2#id20)
+- polystores, [The meta-database of everything](/en/ch13#id341)
+- POSIX (portable operating system interface)
+ - compliant filesystems, [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Distributed Filesystems](/en/ch11#sec_batch_dfs), [Object Stores](/en/ch11#id277)
+- Post Office Horizon scandal, [Humans and Reliability](/en/ch2#id31)
+ - lack of transactions, [Transactions](/en/ch8#ch_transactions)
+- PostgreSQL (database)
+ - archiving WAL to object stores, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - change data capture, [Implementing change data capture](/en/ch12#id307), [API support for change streams](/en/ch12#sec_stream_change_api)
+ - distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
+ - foreign data wrappers, [The meta-database of everything](/en/ch13#id341)
+ - full text search support, [Combining Specialized Tools by Deriving Data](/en/ch13#id442)
+ - in the cloud, [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
+ - JSON Schema validation, [JSON Schema](/en/ch5#json-schema)
+ - leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+ - log sequence number, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - logical decoding, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
+ - materialized view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+ - multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+ - MVCC implementation, [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl), [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
+ - partitioning vs. sharding, [Sharding](/en/ch7#ch_sharding)
+ - pgvector (vector index), [Vector Embeddings](/en/ch4#id92)
+ - PL/pgSQL language, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
+ - PostGIS geospatial indexes, [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
+ - preventing lost updates, [Automatically detecting lost updates](/en/ch8#automatically-detecting-lost-updates)
+ - preventing write skew, [Characterizing write skew](/en/ch8#characterizing-write-skew), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)
+ - read committed isolation, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
+ - representing graphs, [Property Graphs](/en/ch3#id56)
+ - serializable snapshot isolation (SSI), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)
+ - sharding (see Citus (database))
+ - snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation), [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
+ - WAL-based replication, [Write-ahead log (WAL) shipping](/en/ch6#write-ahead-log-wal-shipping)
+- postings list, [Full-Text Search](/en/ch4#sec_storage_full_text)
+ - in sharded indexes, [Local Secondary Indexes](/en/ch7#id166)
+- postmortems, blameless, [Humans and Reliability](/en/ch2#id31)
+- PouchDB (database), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- Power BI (business intelligence software), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Analytics](/en/ch11#sec_batch_olap)
+- pre-aggregation, [Analytics](/en/ch11#sec_batch_olap)
+ - serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+- pre-splitting, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)
+- Precision Time Protocol (PTP), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
+- predicate locks, [Predicate locks](/en/ch8#predicate-locks)
+- predictive analytics, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics), [Predictive Analytics](/en/ch14#id369)-[Feedback Loops](/en/ch14#id372)
+ - amplifying bias, [Bias and Discrimination](/en/ch14#id370)
+ - ethics of (see ethics)
+ - feedback loops, [Feedback Loops](/en/ch14#id372)
+- preemption, [Resource Allocation](/en/ch11#id279)
+ - in distributed schedulers, [Handling Faults](/en/ch11#id281)
+ - of threads, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+- Prefect (workflow scheduler), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows), [Batch Processing](/en/ch11#ch_batch), [Scheduling Workflows](/en/ch11#sec_batch_workflows)
+ - cloud data warehouse integration, [Query languages](/en/ch11#sec_batch_query_lanauges)
+- Presto (query engine), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+- primary keys, [Multi-Column and Secondary Indexes](/en/ch4#sec_storage_index_multicolumn), [Glossary](/en/glossary)
+ - auto-incrementing, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
+ - versus partition key, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+- primary-backup replication (see leader-based replication)
+- privacy, [Privacy and Tracking](/en/ch14#id373)-[Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
+ - consent and freedom of choice, [Consent and Freedom of Choice](/en/ch14#id375)
+ - data as assets and power, [Data as Assets and Power](/en/ch14#id376)
+ - deleting data, [Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+ - ethical considerations (see ethics)
+ - legislation and self-regulation, [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
+ - meaning of, [Privacy and Use of Data](/en/ch14#id457)
+ - regulation, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
+ - surveillance, [Surveillance](/en/ch14#id374)
+ - tracking behavioral data, [Privacy and Tracking](/en/ch14#id373)
+- probabilistic algorithms, [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla), [Stream analytics](/en/ch12#id318)
+- process pauses, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
+- processing time (of events), [Reasoning About Time](/en/ch12#sec_stream_time)
+- producers (message streams), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
+- product analytics, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
+ - column-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)
+- programming languages
+ - for stored procedures, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
+- projections (event sourcing), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+- Prolog (language), [Datalog: Recursive Relational Queries](/en/ch3#id62)
+ - (see also Datalog)
+- property graphs, [Property Graphs](/en/ch3#id56)
+ - Cypher query language, [The Cypher Query Language](/en/ch3#id57)
+ - Property Graph Query Language (PGQL), [Graph Queries in SQL](/en/ch3#id58)
+- property-based testing, [Humans and Reliability](/en/ch2#id31), [Formal Methods and Randomized Testing](/en/ch9#sec_distributed_formal)
+- Protocol Buffers (data format), [Protocol Buffers](/en/ch5#sec_encoding_protobuf)-[Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution), [Protocol Buffers](/en/ch5#sec_encoding_protobuf)
+ - field tags and schema evolution, [Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
+- provenance of data, [Designing for auditability](/en/ch13#id365)
+- publish/subscribe model, [Messaging Systems](/en/ch12#sec_stream_messaging)
+- publishers (message streams), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
+- Pulsar (streaming platform), [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
+- PyTorch (machine learning library), [Machine Learning](/en/ch11#id290)
+
+### Q
+
+- Qpid (messaging), [Message brokers compared to databases](/en/ch12#id297)
+- quality of service (QoS), [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
+- Quantcast File System (distributed filesystem), [Object Stores](/en/ch11#id277)
+- query engines
+ - compilation and vectorization, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+ - in cloud data warehouse, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - operators, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+ - optimizing declarative queries, [Data Models and Query Languages](/en/ch3#ch_datamodels)
+- query languages
+ - Cypher, [The Cypher Query Language](/en/ch3#id57)
+ - Datalog, [Datalog: Recursive Relational Queries](/en/ch3#id62)
+ - GraphQL, [GraphQL](/en/ch3#id63)
+ - MongoDB aggregation pipeline, [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization), [Query languages for documents](/en/ch3#query-languages-for-documents)
+ - recursive SQL queries, [Graph Queries in SQL](/en/ch3#id58)
+ - SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)
+ - SQL, [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)
+- query optimizers, [Query languages](/en/ch11#sec_batch_query_lanauges)
+- query plans, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+- queueing delays, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - head-of-line blocking, [Latency and Response Time](/en/ch2#id23)
+ - latency and response time, [Latency and Response Time](/en/ch2#id23)
+- queues (messaging), [Message brokers](/en/ch5#message-brokers)
+- QUIC (protocol), [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
+- quorums, [Quorums for reading and writing](/en/ch6#sec_replication_quorum_condition)-[Multi-region operation](/en/ch6#multi-region-operation), [Glossary](/en/glossary)
+ - for leaderless replication, [Quorums for reading and writing](/en/ch6#sec_replication_quorum_condition)
+ - in consensus algorithms, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
+ - limitations of consistency, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations)-[Monitoring staleness](/en/ch6#monitoring-staleness), [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
+ - making decisions in distributed systems, [The Majority Rules](/en/ch9#sec_distributed_majority)
+ - monitoring staleness, [Monitoring staleness](/en/ch6#monitoring-staleness)
+ - multi-region replication, [Multi-region operation](/en/ch6#multi-region-operation)
+ - relying on durability, [Mapping system models to the real world](/en/ch9#mapping-system-models-to-the-real-world)
+- quotas, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+
+### R
+
+- R (language), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes), [DataFrames](/en/ch11#id287)
+- R-trees (indexes), [Multidimensional and Full-Text Indexes](/en/ch4#sec_storage_multidimensional)
+- R2 (object storage), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- RabbitMQ (messaging), [Message brokers](/en/ch5#message-brokers), [Message brokers compared to databases](/en/ch12#id297)
+ - quorum queues (replication), [Single-Leader Replication](/en/ch6#sec_replication_leader)
+- race conditions, [Isolation](/en/ch8#sec_transactions_acid_isolation)
+ - (see also concurrency)
+ - avoiding with linearizability, [Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
+ - caused by dual writes, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
+ - causing loss of money, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
+ - dirty writes, [No dirty writes](/en/ch8#sec_transactions_dirty_write)
+ - in counter increments, [No dirty writes](/en/ch8#sec_transactions_dirty_write)
+ - lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)-[Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+ - preventing with event logs, [Concurrency control](/en/ch12#sec_stream_concurrency), [Dataflow: Interplay between state changes and application code](/en/ch13#id450)
+ - preventing with serializable isolation, [Serializability](/en/ch8#sec_transactions_serializability)
+ - weak transaction isolation, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)
+ - write skew, [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts)
+- Raft (consensus algorithm), [Consensus](/en/ch10#sec_consistency_consensus), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
+ - leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+ - sensitivity to network problems, [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
+ - term number, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
+ - use in etcd, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
+- RAID (Redundant Array of Independent Disks), [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute), [Tolerating hardware faults through redundancy](/en/ch2#tolerating-hardware-faults-through-redundancy), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- railways, schema migration on, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
+- RAM (see memory)
+- RAMCloud (in-memory storage), [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+- random writes (access pattern), [Sequential versus random writes](/en/ch4#sidebar_sequential)
+- range queries
+ - in B-trees, [B-Trees](/en/ch4#sec_storage_b_trees), [Read performance](/en/ch4#read-performance)
+ - in LSM-trees, [Read performance](/en/ch4#read-performance)
+ - not efficient in hash maps, [Log-Structured Storage](/en/ch4#sec_storage_log_structured)
+ - with hash sharding, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+- ranking algorithms, [Machine Learning](/en/ch11#id290)
+- Ray (workflow scheduler), [Machine Learning](/en/ch11#id290)
+- RDF (Resource Description Framework), [The RDF data model](/en/ch3#the-rdf-data-model)
+ - querying with SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)
+- RDMA (Remote Direct Memory Access), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Cloud Computing Versus Supercomputing](/en/ch1#id17)
+- React (user interface library), [End-to-end event streams](/en/ch13#id349)
+- reactive programming, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- read committed isolation level, [Read Committed](/en/ch8#sec_transactions_read_committed)-[Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
+ - implementing, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
+ - multi-version concurrency control (MVCC), [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl)
+ - no dirty reads, [No dirty reads](/en/ch8#no-dirty-reads)
+ - no dirty writes, [No dirty writes](/en/ch8#sec_transactions_dirty_write)
+- read models (event sourcing), [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+- read path (derived data), [Observing Derived State](/en/ch13#sec_future_observing)
+- read repair (leaderless replication), [Catching up on missed writes](/en/ch6#sec_replication_read_repair)
+ - for linearizability, [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
+- read replicas (see leader-based replication)
+- read skew (transaction isolation), [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation), [Summary](/en/ch8#summary)
+- read uncommitted isolation level, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
+- read-after-write consistency, [Reading Your Own Writes](/en/ch6#sec_replication_ryw), [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+ - cross-device, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
+ - in derived data systems, [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions)
+- read-modify-write cycle, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)
+- read-scaling architecture, [Problems with Replication Lag](/en/ch6#sec_replication_lag), [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+ - versus sharding, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
+- reads as events, [Reads are events too](/en/ch13#sec_future_read_events)
+- real-time
+ - analytics (see product analytics)
+ - collaborative editing, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
+ - publish/subscribe dataflow, [End-to-end event streams](/en/ch13#id349)
+ - response time guarantees, [Response time guarantees](/en/ch9#sec_distributed_clocks_realtime)
+ - time-of-day clocks, [Time-of-day clocks](/en/ch9#time-of-day-clocks)
+- Realm (database), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- rebalancing shards, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)-[Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations), [Glossary](/en/glossary)
+ - (see also sharding)
+ - automatic or manual rebalancing, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
+ - fixed number of shards, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
+ - fixed number of shards per node, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+ - problems with hash mod N, [Hash modulo number of nodes](/en/ch7#hash-modulo-number-of-nodes)
+- recency guarantee, [Linearizability](/en/ch10#sec_consistency_linearizability)
+- recommendation engines, [Operational Versus Analytical Systems](/en/ch1#sec_introduction_analytics)
+ - building using DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+ - iterative processing, [Machine Learning](/en/ch11#id290)
+- reconfiguration (consensus), [Subtleties of consensus](/en/ch10#subtleties-of-consensus)
+- records, [MapReduce](/en/ch11#sec_batch_mapreduce)
+ - events in stream processing, [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
+- recursive queries
+ - in Cypher, [The Cypher Query Language](/en/ch3#id57)
+ - in Datalog, [Datalog: Recursive Relational Queries](/en/ch3#id62)
+ - in SPARQL, [The SPARQL query language](/en/ch3#the-sparql-query-language)
+ - lack of, in GraphQL, [GraphQL](/en/ch3#id63)
+ - SQL common table expressions, [Graph Queries in SQL](/en/ch3#id58)
+- Red Hat
+ - Apicurio Registry, [JSON Schema](/en/ch5#json-schema)
+- red-black tree, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+- redelivery (messaging), [Acknowledgments and redelivery](/en/ch12#sec_stream_reordering)
+- Redis (database)
+ - atomic operations, [Atomic write operations](/en/ch8#atomic-write-operations)
+ - CRDT support, [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts)
+ - durability, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+ - Lua scripting, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
+ - multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+ - process-per-core model, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
+ - single-threaded execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
+- redo log (see write-ahead log)
+- Redpanda (messaging), [Message brokers](/en/ch5#message-brokers), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - tiered storage, [Disk space usage](/en/ch12#sec_stream_disk_usage)
+- Redshift (database), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+- redundancy
+ - hardware components, [Tolerating hardware faults through redundancy](/en/ch2#tolerating-hardware-faults-through-redundancy)
+ - of derived data, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived)
+ - (see also derived data)
+- Reed--Solomon codes (error correction), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- refactoring, [Evolvability: Making Change Easy](/en/ch2#sec_introduction_evolvability)
+ - (see also evolvability)
+- regions (geographic distribution), [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
+ - (see also datacenters)
+ - consensus across, [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
+ - definition, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
+ - latency, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
+ - linearizable ID generation, [Implementing a linearizable ID generator](/en/ch10#implementing-a-linearizable-id-generator)
+ - replication across, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)-[Problems with different topologies](/en/ch6#problems-with-different-topologies), [The Cost of Linearizability](/en/ch10#sec_linearizability_cost), [The limits of total ordering](/en/ch13#id335)
+ - leaderless, [Multi-region operation](/en/ch6#multi-region-operation)
+ - multi-leader, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+- regions (sharding), [Sharding](/en/ch7#ch_sharding)
+- register (data structure), [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+- regulation (see legal matters)
+- relational data model, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)-[Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+ - comparison to document model, [When to Use Which Model](/en/ch3#sec_datamodels_document_summary)-[Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+ - graph queries in SQL, [Graph Queries in SQL](/en/ch3#id58)
+ - in-memory databases with, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+ - many-to-one and many-to-many relationships, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many)
+ - multi-object transactions, need for, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
+ - object-relational mismatch, [The Object-Relational Mismatch](/en/ch3#sec_datamodels_document)
+ - representing a reorderable list, [When to Use Which Model](/en/ch3#sec_datamodels_document_summary)
+ - versus document model
+ - convergence of models, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+ - data locality, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
+- relational databases
+ - eventual consistency, [Problems with Replication Lag](/en/ch6#sec_replication_lag)
+ - history, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history)
+ - leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+ - logical logs, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
+ - philosophy compared to Unix, [Unbundling Databases](/en/ch13#sec_future_unbundling), [The meta-database of everything](/en/ch13#id341)
+ - schema changes, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility), [Encoding and Evolution](/en/ch5#ch_encoding), [Different values written at different times](/en/ch5#different-values-written-at-different-times)
+ - sharded secondary indexes, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)
+ - statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
+ - use of B-tree indexes, [B-Trees](/en/ch4#sec_storage_b_trees)
+- relationships (see edges)
+- reliability, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)-[Humans and Reliability](/en/ch2#id31), [A Philosophy of Streaming Systems](/en/ch13#ch_philosophy)
+ - building a reliable system from unreliable components, [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
+ - hardware faults, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
+ - human errors, [Humans and Reliability](/en/ch2#id31)
+ - importance of, [Humans and Reliability](/en/ch2#id31)
+ - of messaging systems, [Messaging Systems](/en/ch12#sec_stream_messaging)
+ - software faults, [Software faults](/en/ch2#software-faults)
+- Remote Method Invocation (Java RMI), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
+- remote procedure calls (RPCs), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)-[Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+ - (see also services)
+ - data encoding and evolution, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+ - issues with, [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
+ - using Avro, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
+ - versus message brokers, [Event-Driven Architectures](/en/ch5#sec_encoding_dataflow_msg)
+- renewable energy, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
+- repeatable reads (transaction isolation), [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
+- replicas, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+- replication, [Replication](/en/ch6#ch_replication)-[Summary](/en/ch6#summary), [Glossary](/en/glossary)
+ - and durability, [Durability](/en/ch8#durability)
+ - conflict resolution and, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+ - consistency properties, [Problems with Replication Lag](/en/ch6#sec_replication_lag)-[Solutions for Replication Lag](/en/ch6#id131)
+ - consistent prefix reads, [Consistent Prefix Reads](/en/ch6#sec_replication_consistent_prefix)
+ - monotonic reads, [Monotonic Reads](/en/ch6#sec_replication_monotonic_reads)
+ - reading your own writes, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
+ - in distributed filesystems, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+ - leaderless, [Leaderless Replication](/en/ch6#sec_replication_leaderless)-[Version vectors](/en/ch6#version-vectors)
+ - detecting concurrent writes, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)-[Version vectors](/en/ch6#version-vectors)
+ - limitations of quorum consistency, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations)-[Monitoring staleness](/en/ch6#monitoring-staleness), [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
+ - monitoring staleness, [Monitoring staleness](/en/ch6#monitoring-staleness)
+ - multi-leader, [Multi-Leader Replication](/en/ch6#sec_replication_multi_leader)-[Types of conflict](/en/ch6#sec_replication_write_conflicts)
+ - across multiple regions, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc), [The Cost of Linearizability](/en/ch10#sec_linearizability_cost)
+ - conflict resolution, [Dealing with Conflicting Writes](/en/ch6#sec_replication_write_conflicts)-[Types of conflict](/en/ch6#sec_replication_write_conflicts)
+ - replication topologies, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)-[Problems with different topologies](/en/ch6#problems-with-different-topologies)
+ - reasons for using, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Replication](/en/ch6#ch_replication)
+ - sharding and, [Sharding](/en/ch7#ch_sharding)
+ - single-leader, [Single-Leader Replication](/en/ch6#sec_replication_leader)-[Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
+ - failover, [Leader failure: Failover](/en/ch6#leader-failure-failover)
+ - implementation of replication logs, [Implementation of Replication Logs](/en/ch6#sec_replication_implementation)-[Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
+ - relation to consensus, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus), [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
+ - setting up new followers, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - synchronous versus asynchronous, [Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async)-[Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async)
+ - state machine replication, [Statement-based replication](/en/ch6#statement-based-replication), [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs), [Using shared logs](/en/ch10#sec_consistency_smr), [Databases and Streams](/en/ch12#sec_stream_databases)
+ - event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+ - using consensus, [Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus)
+ - using erasure coding, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+ - using object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - versus backups, [Replication](/en/ch6#ch_replication)
+ - with heterogeneous data systems, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)
+- replication logs (see logs)
+- representations of data (see data models)
+- reprocessing data, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing), [Unifying batch and stream processing](/en/ch13#id338)
+ - (see also evolvability)
+ - from log-based messaging, [Replaying old messages](/en/ch12#sec_stream_replay)
+- request hedging, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+- request identifiers, [Uniquely identifying requests](/en/ch13#id355), [Multi-shard request processing](/en/ch13#id360)
+- request routing, [Request Routing](/en/ch7#sec_sharding_routing)-[Request Routing](/en/ch7#sec_sharding_routing)
+ - approaches to, [Request Routing](/en/ch7#sec_sharding_routing)
+- residence laws for data, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed), [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+- resilient systems, [Reliability and Fault Tolerance](/en/ch2#sec_introduction_reliability)
+ - (see also fault tolerance)
+- resource isolation, [Cloud Computing Versus Supercomputing](/en/ch1#id17), [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+- resource limits, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+- response time
+ - as performance metric, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Batch Processing](/en/ch11#ch_batch)
+ - guarantees on, [Response time guarantees](/en/ch9#sec_distributed_clocks_realtime)
+ - impact on users, [Average, Median, and Percentiles](/en/ch2#id24)
+ - in replicated systems, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+ - latency versus, [Latency and Response Time](/en/ch2#id23)
+ - mean and percentiles, [Average, Median, and Percentiles](/en/ch2#id24)
+ - user experience, [Average, Median, and Percentiles](/en/ch2#id24)
+- responsibility and accountability, [Responsibility and Accountability](/en/ch14#id371)
+- REST (Representational State Transfer), [Web services](/en/ch5#sec_web_services)
+ - (see also services)
+- Restate (workflow engine), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
+- RethinkDB (database)
+ - join support, [Convergence of document and relational databases](/en/ch3#convergence-of-document-and-relational-databases)
+ - key-range sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+- retry storm, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Software faults](/en/ch2#software-faults)
+- reverse ETL, [Beyond the data lake](/en/ch1#beyond-the-data-lake)
+- Riak (database)
+ - CRDT support, [CRDTs and Operational Transformation](/en/ch6#sec_replication_crdts), [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)
+ - dotted version vectors, [Version vectors](/en/ch6#version-vectors)
+ - gossip protocol, [Request Routing](/en/ch7#sec_sharding_routing)
+ - hash sharding, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
+ - leaderless replication, [Leaderless Replication](/en/ch6#sec_replication_leaderless)
+ - linearizability, lack of, [Linearizability and quorums](/en/ch10#sec_consistency_quorum_linearizable)
+ - multi-region support, [Multi-region operation](/en/ch6#multi-region-operation)
+ - rebalancing, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
+ - secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
+ - sloppy quorums, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+ - vnodes (sharding), [Sharding](/en/ch7#ch_sharding)
+- ring buffers, [Disk space usage](/en/ch12#sec_stream_disk_usage)
+- RisingWave (database)
+ - incremental view maintenance, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+- rockets, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
+- RocksDB (storage engine), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+ - as embedded storage engine, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - leveled compaction, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+- rollbacks (transactions), [Transactions](/en/ch8#ch_transactions)
+- rolling upgrades, [Tolerating hardware faults through redundancy](/en/ch2#tolerating-hardware-faults-through-redundancy), [Encoding and Evolution](/en/ch5#ch_encoding), [Faults and Partial Failures](/en/ch9#sec_distributed_partial_failure)
+ - in a multitenant system, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+- routing (see request routing)
+- row-based replication, [Logical (row-based) log replication](/en/ch6#logical-row-based-log-replication)
+- row-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)
+- rowhammer (memory corruption), [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
+- RPCs (see remote procedure calls)
+- rules (Datalog), [Datalog: Recursive Relational Queries](/en/ch3#id62)
+- Rust (programming language)
+ - memory management, [Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
+
+### S
+
+- S3 (object storage), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Batch Processing](/en/ch11#ch_batch), [Distributed Filesystems](/en/ch11#sec_batch_dfs), [Object Stores](/en/ch11#id277)
+ - checking data integrity, [Don't just blindly trust what they promise](/en/ch13#id364)
+ - conditional writes, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
+ - object size, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
+ - S3 Express One Zone, [Object Stores](/en/ch11#id277), [Object Stores](/en/ch11#id277)
+ - use in MapReduce, [MapReduce](/en/ch11#sec_batch_mapreduce)
+ - workflow example, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
+- SaaS (see software as a service (SaaS))
+- safety and liveness properties, [Safety and liveness](/en/ch9#sec_distributed_safety_liveness)
+ - in consensus algorithms, [Single-value consensus](/en/ch10#single-value-consensus)
+ - in transactions, [Transactions](/en/ch8#ch_transactions)
+- sagas (see compensating transactions)
+- Samza (stream processor), [Stream analytics](/en/ch12#id318)
+- SAP HANA (database), [Data Storage for Analytics](/en/ch4#sec_storage_analytics)
+- scalability, [Scalability](/en/ch2#sec_introduction_scalability)-[Principles for Scalability](/en/ch2#id35), [A Philosophy of Streaming Systems](/en/ch13#ch_philosophy)
+ - auto-scaling, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
+ - by sharding, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
+ - describing load, [Describing Load](/en/ch2#id33)
+ - describing performance, [Describing Performance](/en/ch2#sec_introduction_percentiles)
+ - linear, [Describing Load](/en/ch2#id33)
+ - principles for, [Principles for Scalability](/en/ch2#id35)
+ - replication and, [Problems with Replication Lag](/en/ch6#sec_replication_lag)
+ - scaling up versus scaling out, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing)
+- scaling out, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing)
+ - (see also shared-nothing architecture)
+ - by sharding, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
+- scaling up, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing)
+- SCD (slowly changing dimension), [Time-dependence of joins](/en/ch12#sec_stream_join_time)
+- scheduling
+ - algorithms, [Resource Allocation](/en/ch11#id279)
+ - batch jobs, [Distributed Job Orchestration](/en/ch11#id278)-[Scheduling Workflows](/en/ch11#sec_batch_workflows)
+ - gang scheduling, [Resource Allocation](/en/ch11#id279)
+- schema-on-read, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
+ - comparison to evolvable schema, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
+- schema-on-write, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
+- schemaless databases (see schema-on-read)
+- schemas, [Glossary](/en/glossary)
+ - Avro, [Avro](/en/ch5#sec_encoding_avro)-[Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
+ - reader determining writer's schema, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
+ - schema evolution, [The writer's schema and the reader's schema](/en/ch5#the-writers-schema-and-the-readers-schema)
+ - dynamically generated, [Dynamically generated schemas](/en/ch5#dynamically-generated-schemas)
+ - evolution of, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
+ - affecting application code, [Encoding and Evolution](/en/ch5#ch_encoding)
+ - compatibility checking, [But what is the writer's schema?](/en/ch5#but-what-is-the-writers-schema)
+ - in databases, [Dataflow Through Databases](/en/ch5#sec_encoding_dataflow_db)-[Archival storage](/en/ch5#archival-storage)
+ - in service calls, [Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+ - flexibility in document model, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
+ - for analytics, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)-[Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
+ - for JSON and XML, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json), [JSON Schema](/en/ch5#json-schema)
+ - generation and migration using ORMs, [Object-relational mapping (ORM)](/en/ch3#object-relational-mapping-orm)
+ - merits of, [The Merits of Schemas](/en/ch5#sec_encoding_schemas)
+ - migration, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
+ - Protocol Buffers, [Protocol Buffers](/en/ch5#sec_encoding_protobuf)-[Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
+ - schema evolution, [Field tags and schema evolution](/en/ch5#field-tags-and-schema-evolution)
+ - schema migration on railways, [Reprocessing data for application evolution](/en/ch13#sec_future_reprocessing)
+ - traditional approach to design, fallacy in, [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views)
+- scientific computing, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
+- scikit-learn (Python library), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
+- ScyllaDB (database)
+ - cluster metadata, [Request Routing](/en/ch7#sec_sharding_routing)
+ - consistency level ANY, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+ - hash-range sharding, [Sharding by Hash of Key](/en/ch7#sec_sharding_hash), [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+ - last-write-wins conflict resolution, [Detecting Concurrent Writes](/en/ch6#sec_replication_concurrent)
+ - leaderless replication, [Leaderless Replication](/en/ch6#sec_replication_leaderless)
+ - lightweight transactions, [Single-object writes](/en/ch8#sec_transactions_single_object)
+ - linearizability, lack of, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
+ - log-structured storage, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+ - multi-region support, [Multi-region operation](/en/ch6#multi-region-operation)
+ - use of clocks, [Limitations of Quorum Consistency](/en/ch6#sec_replication_quorum_limitations), [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
+ - vnodes (sharding), [Sharding](/en/ch7#ch_sharding)
+- search engines (see full-text search)
+- searching on streams, [Search on streams](/en/ch12#id320)
+- secondaries (see leader-based replication)
+- secondary indexes, [Multi-Column and Secondary Indexes](/en/ch4#sec_storage_index_multicolumn), [Glossary](/en/glossary)
+ - for many-to-many relationships, [Many-to-One and Many-to-Many Relationships](/en/ch3#sec_datamodels_many_to_many)
+ - problems with dual writes, [Keeping Systems in Sync](/en/ch12#sec_stream_sync), [Reasoning about dataflows](/en/ch13#id443)
+ - sharding, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)-[Global Secondary Indexes](/en/ch7#id167), [Summary](/en/ch7#summary)
+ - global, [Global Secondary Indexes](/en/ch7#id167)
+ - index maintenance, [Maintaining derived state](/en/ch13#id446)
+ - local, [Local Secondary Indexes](/en/ch7#id166)
+ - updating, transaction isolation and, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
+- secondary sort (MapReduce), [JOIN and GROUP BY](/en/ch11#sec_batch_join)
+- sed (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis)
+- self-hosting, [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)
+ - data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+- self-joins, [Summary](/en/ch12#id332)
+- self-validating systems, [Don't just blindly trust what they promise](/en/ch13#id364)
+- semantic search, [Vector Embeddings](/en/ch4#id92)
+- semantic similarity, [Vector Embeddings](/en/ch4#id92)
+- semantic web, [Triple-Stores and SPARQL](/en/ch3#id59)
+- semi-synchronous replication, [Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async)
+- sequential writes (access pattern), [Sequential versus random writes](/en/ch4#sidebar_sequential)
+- serializability, [Isolation](/en/ch8#sec_transactions_acid_isolation), [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels), [Serializability](/en/ch8#sec_transactions_serializability)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation), [Glossary](/en/glossary)
+ - linearizability versus, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+ - pessimistic versus optimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
+ - serial execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)-[Summary of serial execution](/en/ch8#summary-of-serial-execution)
+ - sharding, [Sharding](/en/ch8#sharding)
+ - using stored procedures, [Encapsulating transactions in stored procedures](/en/ch8#encapsulating-transactions-in-stored-procedures), [Using shared logs](/en/ch10#sec_consistency_smr)
+ - serializable snapshot isolation (SSI), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
+ - detecting stale MVCC reads, [Detecting stale MVCC reads](/en/ch8#detecting-stale-mvcc-reads)
+ - detecting writes that affect prior reads, [Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
+ - distributed execution, [Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
+ - performance of SSI, [Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
+ - preventing write skew, [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)-[Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
+ - strict serializability, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+ - timeliness vs. integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+ - two-phase locking (2PL), [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)-[Index-range locks](/en/ch8#sec_transactions_2pl_range)
+ - index-range locks, [Index-range locks](/en/ch8#sec_transactions_2pl_range)
+ - performance, [Performance of two-phase locking](/en/ch8#performance-of-two-phase-locking)
+- Serializable (Java), [Language-Specific Formats](/en/ch5#id96)
+- serialization, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
+ - (see also encoding)
+- serverless, [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
+- service discovery, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery), [Request Routing](/en/ch7#sec_sharding_routing), [Service discovery](/en/ch10#service-discovery)
+ - registration, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
+ - using DNS, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery), [Request Routing](/en/ch7#sec_sharding_routing), [Service discovery](/en/ch10#service-discovery)
+- service level agreements (SLAs), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla), [Describing Load](/en/ch2#id33)
+- service mesh, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery)
+- Service Organization Control (SOC), [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
+- service time, [Latency and Response Time](/en/ch2#id23)
+- service-oriented architecture (SOA), [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
+ - (see also services)
+- services, [Dataflow Through Services: REST and RPC](/en/ch5#sec_encoding_dataflow_rpc)-[Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+ - microservices, [Microservices and Serverless](/en/ch1#sec_introduction_microservices)
+ - causal dependencies across services, [The limits of total ordering](/en/ch13#id335)
+ - loose coupling, [Making unbundling work](/en/ch13#sec_future_unbundling_favor)
+ - relation to batch/stream processors, [Batch Processing](/en/ch11#ch_batch), [Stream processors and services](/en/ch13#id345)
+ - remote procedure calls (RPCs), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)-[Data encoding and evolution for RPC](/en/ch5#data-encoding-and-evolution-for-rpc)
+ - issues with, [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
+ - similarity to databases, [Dataflow Through Services: REST and RPC](/en/ch5#sec_encoding_dataflow_rpc)
+ - web services, [Web services](/en/ch5#sec_web_services)
+- session windows (stream processing), [Types of windows](/en/ch12#id324)
+ - (see also windows)
+- sharding, [Sharding](/en/ch7#ch_sharding)-[Summary](/en/ch7#summary), [Glossary](/en/glossary)
+ - and consensus, [Using shared logs](/en/ch10#sec_consistency_smr)
+ - and replication, [Sharding](/en/ch7#ch_sharding)
+ - distributed transactions across shards, [Distributed Transactions](/en/ch8#sec_transactions_distributed)
+ - hot shards, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
+ - in batch processing, [Batch Processing](/en/ch11#ch_batch)
+ - key-range splitting, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)
+ - multi-shard operations, [Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
+ - enforcing constraints, [Multi-shard request processing](/en/ch13#id360)
+ - secondary index maintenance, [Maintaining derived state](/en/ch13#id446)
+ - of key-value data, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)-[Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
+ - by key range, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+ - skew and hot spots, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
+ - origin of the term, [Sharding](/en/ch7#ch_sharding)
+ - partition key, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons), [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
+ - rebalancing
+ - of key-range sharded data, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)
+ - rebalancing shards, [Rebalancing key-range sharded data](/en/ch7#rebalancing-key-range-sharded-data)-[Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
+ - automatic or manual rebalancing, [Operations: Automatic or Manual Rebalancing](/en/ch7#sec_sharding_operations)
+ - problems with hash mod N, [Hash modulo number of nodes](/en/ch7#hash-modulo-number-of-nodes)
+ - using fixed number of shards, [Fixed number of shards](/en/ch7#fixed-number-of-shards)
+ - using N shards per node, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+ - request routing, [Request Routing](/en/ch7#sec_sharding_routing)-[Request Routing](/en/ch7#sec_sharding_routing)
+ - secondary indexes, [Sharding and Secondary Indexes](/en/ch7#sec_sharding_secondary_indexes)-[Global Secondary Indexes](/en/ch7#id167)
+ - global, [Global Secondary Indexes](/en/ch7#id167)
+ - local, [Local Secondary Indexes](/en/ch7#id166)
+ - serial execution of transactions and, [Sharding](/en/ch8#sharding)
+ - sorting sharded data, [Shuffling Data](/en/ch11#sec_shuffle)
+- shared logs, [Consensus in Practice](/en/ch10#sec_consistency_total_order)-[Pros and cons of consensus](/en/ch10#pros-and-cons-of-consensus), [The limits of total ordering](/en/ch13#id335), [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
+ - algorithms, [Consensus in Practice](/en/ch10#sec_consistency_total_order)
+ - for event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - for messaging, [Log-based Message Brokers](/en/ch12#sec_stream_log)-[Replaying old messages](/en/ch12#sec_stream_replay)
+ - relation to consensus, [Shared logs as consensus](/en/ch10#sec_consistency_shared_logs)
+ - using, [Using shared logs](/en/ch10#sec_consistency_smr)
+- shared mode (locks), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+- shared-disk architecture, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- shared-memory architecture, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing)
+- shared-nothing architecture, [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing), [Glossary](/en/glossary)
+ - distributed filesystems, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+ - (see also distributed filesystems)
+ - use of network, [Unreliable Networks](/en/ch9#sec_distributed_networks)
+- sharks
+ - biting undersea cables, [Network Faults in Practice](/en/ch9#sec_distributed_network_faults)
+ - counting (example), [Query languages for documents](/en/ch3#query-languages-for-documents)
+- shredding (deletion) (see crypto-shredding)
+- shredding (in columnar encoding), [Column-Oriented Storage](/en/ch4#sec_storage_column)
+- shredding (in relational model), [When to Use Which Model](/en/ch3#sec_datamodels_document_summary)
+- shuffle (batch processing), [Shuffling Data](/en/ch11#sec_shuffle)-[Shuffling Data](/en/ch11#sec_shuffle)
+- siblings (concurrent values), [Manual conflict resolution](/en/ch6#manual-conflict-resolution), [Capturing the happens-before relationship](/en/ch6#capturing-the-happens-before-relationship), [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+ - (see also conflicts)
+- silo, [Data Warehousing](/en/ch1#sec_introduction_dwh)
+- similarity search
+ - edit distance, [Full-Text Search](/en/ch4#sec_storage_full_text)
+ - genome data, [Summary](/en/ch3#summary)
+- simplicity, [Simplicity: Managing Complexity](/en/ch2#id38)
+- Singer, [Data Warehousing](/en/ch1#sec_introduction_dwh)
+- single-instruction-multi-data (SIMD) instructions, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+- single-leader replication (see leader-based replication)
+- single-threaded execution, [Atomic write operations](/en/ch8#atomic-write-operations), [Actual Serial Execution](/en/ch8#sec_transactions_serial)
+ - in stream processing, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Concurrency control](/en/ch12#sec_stream_concurrency), [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
+- SingleStore (database)
+ - in-memory storage, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+- site reliability engineer, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+- size-tiered compaction, [Compaction strategies](/en/ch4#sec_storage_lsm_compaction), [Disk space usage](/en/ch4#disk-space-usage)
+- skew, [Glossary](/en/glossary)
+ - clock skew, [Relying on Synchronized Clocks](/en/ch9#sec_distributed_clocks_relying)-[Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval), [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
+ - in transaction isolation
+ - read skew, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation), [Summary](/en/ch8#summary)
+ - write skew, [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts), [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)-[Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
+ - (see also write skew)
+ - meanings of, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
+ - unbalanced workload, [Sharding of Key-Value Data](/en/ch7#sec_sharding_key_value)
+ - compensating for, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
+ - due to celebrities, [Skewed Workloads and Relieving Hot Spots](/en/ch7#sec_sharding_skew)
+ - for time-series data, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+- skip list, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+- SLA (see service level agreements)
+- Slack (group chat)
+ - GraphQL example, [GraphQL](/en/ch3#id63)
+- SlateDB (database), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+- sliding windows (stream processing), [Types of windows](/en/ch12#id324)
+ - (see also windows)
+- sloppy quorums, [Single-Leader Versus Leaderless Replication Performance](/en/ch6#sec_replication_leaderless_perf)
+- slowly changing dimension (data warehouses), [Time-dependence of joins](/en/ch12#sec_stream_join_time)
+- smearing (leap seconds adjustments), [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
+- snapshots (databases)
+ - as backups, [Replication](/en/ch6#ch_replication)
+ - computing derived data, [Creating an index](/en/ch13#id340)
+ - in change data capture, [Initial snapshot](/en/ch12#sec_stream_cdc_snapshot)
+ - serializable snapshot isolation (SSI), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
+ - setting up a new replica, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - snapshot isolation and repeatable read, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)-[Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
+ - implementing with MVCC, [Multi-version concurrency control (MVCC)](/en/ch8#sec_transactions_snapshot_impl)
+ - indexes and MVCC, [Indexes and snapshot isolation](/en/ch8#indexes-and-snapshot-isolation)
+ - visibility rules, [Visibility rules for observing a consistent snapshot](/en/ch8#sec_transactions_mvcc_visibility)
+ - synchronized clocks for global snapshots, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+- Snowflake (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native), [Layering of cloud services](/en/ch1#layering-of-cloud-services), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Batch Processing](/en/ch11#ch_batch)
+ - column-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)
+ - handling writes, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
+ - sharding and clustering, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+ - Snowpark, [Query languages](/en/ch11#sec_batch_query_lanauges)
+- Snowflake (ID generator), [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
+- snowflake schemas, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
+- SOAP (web services), [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
+- SOC2 (see Service Organization Control (SOC))
+- social graph, [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+- society
+ - responsibility towards, [Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance), [Legislation and Self-Regulation](/en/ch14#sec_future_legislation)
+- sociotechnical systems, [Humans and Reliability](/en/ch2#id31)
+- software as a service (SaaS), [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs), [Cloud Versus Self-Hosting](/en/ch1#sec_introduction_cloud)
+ - ETL from, [Data Warehousing](/en/ch1#sec_introduction_dwh)
+ - multitenancy, [Sharding for Multitenancy](/en/ch7#sec_sharding_multitenancy)
+- software bugs, [Software faults](/en/ch2#software-faults)
+ - maintaining integrity, [Maintaining integrity in the face of software bugs](/en/ch13#id455)
+- solar storm, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
+- solid state drives (SSDs)
+ - access patterns, [Sequential versus random writes](/en/ch4#sidebar_sequential)
+ - compared to object storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - detecting corruption, [The end-to-end argument](/en/ch13#sec_future_e2e_argument), [Don't just blindly trust what they promise](/en/ch13#id364)
+ - failure rate, [Hardware and Software Faults](/en/ch2#sec_introduction_hardware_faults)
+ - faults in, [Durability](/en/ch8#durability)
+ - firmware bugs, [Software faults](/en/ch2#software-faults)
+ - read throughput, [Read performance](/en/ch4#read-performance)
+ - sequential vs. random writes, [Sequential versus random writes](/en/ch4#sidebar_sequential)
+- Solr (search server)
+ - local secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
+ - request routing, [Request Routing](/en/ch7#sec_sharding_routing)
+ - use of Lucene, [Full-Text Search](/en/ch4#sec_storage_full_text)
+- sort (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Sorting Versus In-memory Aggregation](/en/ch11#id275), [Distributed Job Orchestration](/en/ch11#id278)
+- sort-merge joins (MapReduce), [JOIN and GROUP BY](/en/ch11#sec_batch_join)
+- Sorted String Tables (see SSTables)
+- sorting
+ - sort order in column storage, [Sort Order in Column Storage](/en/ch4#sort-order-in-column-storage)
+- source of truth (see systems of record)
+- Spanner (database)
+ - consistency model, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+ - data locality, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
+ - in the cloud, [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native)
+ - snapshot isolation using clocks, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+ - transactions, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
+ - TrueTime API, [Clock readings with a confidence interval](/en/ch9#clock-readings-with-a-confidence-interval)
+- Spark (processing framework), [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native), [Batch Processing](/en/ch11#ch_batch), [Dataflow Engines](/en/ch11#sec_batch_dataflow)
+ - cost efficiency, [Query languages](/en/ch11#sec_batch_query_lanauges)
+ - DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes), [DataFrames](/en/ch11#id287)
+ - fault tolerance, [Handling Faults](/en/ch11#id281)
+ - for data warehouses, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - high availability using ZooKeeper, [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - MLlib, [Machine Learning](/en/ch11#id290)
+ - query optimizer, [Query languages](/en/ch11#sec_batch_query_lanauges)
+ - shuffling data, [Shuffling Data](/en/ch11#sec_shuffle)
+ - Spark Streaming, [Stream analytics](/en/ch12#id318)
+ - microbatching, [Microbatching and checkpointing](/en/ch12#id329)
+ - streaming SQL support, [Complex event processing](/en/ch12#id317)
+ - use for ETL, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
+- SPARQL (query language), [The SPARQL query language](/en/ch3#the-sparql-query-language)
+- sparse index, [The SSTable file format](/en/ch4#the-sstable-file-format)
+- sparse matrices, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- split brain, [Leader failure: Failover](/en/ch6#leader-failure-failover), [Request Routing](/en/ch7#sec_sharding_routing), [Glossary](/en/glossary)
+ - enforcing constraints, [Uniqueness constraints require consensus](/en/ch13#id452)
+ - in consensus algorithms, [Consensus](/en/ch10#sec_consistency_consensus), [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
+ - preventing, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
+ - using fencing tokens to avoid, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)-[Fencing with multiple replicas](/en/ch9#fencing-with-multiple-replicas)
+- spot instances, [Handling Faults](/en/ch11#id281)
+- spreadsheets, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+ - dataflow programming, [Designing Applications Around Dataflow](/en/ch13#sec_future_dataflow)
+ - pivot table, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- SQL (Structured Query Language), [Simplicity: Managing Complexity](/en/ch2#id38), [Relational Model versus Document Model](/en/ch3#sec_datamodels_history), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - for analytics, [Data Warehousing](/en/ch1#sec_introduction_dwh), [Column-Oriented Storage](/en/ch4#sec_storage_column)
+ - graph queries in, [Graph Queries in SQL](/en/ch3#id58)
+ - isolation levels standard, issues with, [Snapshot isolation, repeatable read, and naming confusion](/en/ch8#snapshot-isolation-repeatable-read-and-naming-confusion)
+ - joins, [Normalization, Denormalization, and Joins](/en/ch3#sec_datamodels_normalization)
+ - résumé (example), [The document data model for one-to-many relationships](/en/ch3#the-document-data-model-for-one-to-many-relationships)
+ - social network home timelines (example), [Representing Users, Posts, and Follows](/en/ch2#id20)
+ - SQL injection vulnerability, [Byzantine Faults](/en/ch9#sec_distributed_byzantine)
+ - statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
+ - stored procedures, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
+ - support in batch processing frameworks, [Batch Processing](/en/ch11#ch_batch)
+ - views, [Datalog: Recursive Relational Queries](/en/ch3#id62)
+- SQL Server (database)
+ - archiving WAL to object stores, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+ - change data capture, [Implementing change data capture](/en/ch12#id307)
+ - data warehousing support, [Data Storage for Analytics](/en/ch4#sec_storage_analytics)
+ - distributed transaction support, [XA transactions](/en/ch8#xa-transactions)
+ - leader-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+ - multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+ - preventing lost updates, [Automatically detecting lost updates](/en/ch8#automatically-detecting-lost-updates)
+ - preventing write skew, [Characterizing write skew](/en/ch8#characterizing-write-skew), [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+ - read committed isolation, [Implementing read committed](/en/ch8#sec_transactions_read_committed_impl)
+ - serializable isolation, [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+ - snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
+ - T-SQL language, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
+- SQLite (database), [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems), [Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - archiving WAL to object stores, [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+- SRE (site reliability engineer), [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+- SSDs (see solid state drives)
+- SSTables (storage format), [The SSTable file format](/en/ch4#the-sstable-file-format)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+ - constructing and maintaining, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+ - making LSM-Tree from, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+- staged rollout (see rolling upgrades)
+- staleness (old data), [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
+ - cross-channel timing dependencies, [Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
+ - in leaderless databases, [Writing to the Database When a Node Is Down](/en/ch6#id287)
+ - in multi-version concurrency control, [Detecting stale MVCC reads](/en/ch8#detecting-stale-mvcc-reads)
+ - monitoring for, [Monitoring staleness](/en/ch6#monitoring-staleness)
+ - of client state, [Pushing state changes to clients](/en/ch13#id348)
+ - versus linearizability, [Linearizability](/en/ch10#sec_consistency_linearizability)
+ - versus timeliness, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+- standbys (see leader-based replication)
+- star replication topologies, [Multi-leader replication topologies](/en/ch6#sec_replication_topologies)
+- star schemas, [Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)-[Stars and Snowflakes: Schemas for Analytics](/en/ch3#sec_datamodels_analytics)
+- Star Wars analogy (event time versus processing time), [Event time versus processing time](/en/ch12#id322)
+- starvation (scheduling), [Resource Allocation](/en/ch11#id279)
+- state
+ - derived from log of immutable events, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)
+ - interplay between state changes and application code, [Dataflow: Interplay between state changes and application code](/en/ch13#id450)
+ - maintaining derived state, [Maintaining derived state](/en/ch13#id446)
+ - maintenance by stream processor in stream-stream joins, [Stream-stream join (window join)](/en/ch12#id440)
+ - observing derived state, [Observing Derived State](/en/ch13#sec_future_observing)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
+ - rebuilding after stream processor failure, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+ - separation of application code and, [Separation of application code and state](/en/ch13#id344)
+- state machine replication, [Statement-based replication](/en/ch6#statement-based-replication), [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs), [Using shared logs](/en/ch10#sec_consistency_smr), [Databases and Streams](/en/ch12#sec_stream_databases)
+ - event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+- stateless systems, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)
+- statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication)
+ - reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+- statically typed languages
+ - analogy to schema-on-write, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
+- statistical and numerical algorithms, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- StatsD (metrics aggregator), [Direct messaging from producers to consumers](/en/ch12#id296)
+- stock market feeds, [Direct messaging from producers to consumers](/en/ch12#id296)
+- STONITH (Shoot The Other Node In The Head), [Leader failure: Failover](/en/ch6#leader-failure-failover)
+ - problems with, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
+- stop-the-world (see garbage collection)
+- storage
+ - composing data storage technologies, [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
+- Storage Area Network (SAN), [Shared-Memory, Shared-Disk, and Shared-Nothing Architecture](/en/ch2#sec_introduction_shared_nothing), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- storage engines, [Storage and Retrieval](/en/ch4#ch_storage)-[Summary](/en/ch4#summary)
+ - column-oriented, [Column-Oriented Storage](/en/ch4#sec_storage_column)-[Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+ - column compression, [Column Compression](/en/ch4#sec_storage_column_compression)-[Column Compression](/en/ch4#sec_storage_column_compression)
+ - defined, [Column-Oriented Storage](/en/ch4#sec_storage_column)
+ - Parquet, [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses), [Column-Oriented Storage](/en/ch4#sec_storage_column), [Archival storage](/en/ch5#archival-storage)
+ - sort order in, [Sort Order in Column Storage](/en/ch4#sort-order-in-column-storage)-[Sort Order in Column Storage](/en/ch4#sort-order-in-column-storage)
+ - versus wide-column model, [Column Compression](/en/ch4#sec_storage_column_compression)
+ - writing to, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
+ - in-memory storage, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+ - durability, [Durability](/en/ch8#durability)
+ - row-oriented, [Storage and Indexing for OLTP](/en/ch4#sec_storage_oltp)-[Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+ - B-trees, [B-Trees](/en/ch4#sec_storage_b_trees)-[B-tree variants](/en/ch4#b-tree-variants)
+ - comparing B-trees and LSM-trees, [Comparing B-Trees and LSM-Trees](/en/ch4#sec_storage_btree_lsm_comparison)-[Disk space usage](/en/ch4#disk-space-usage)
+ - defined, [Column-Oriented Storage](/en/ch4#sec_storage_column)
+ - log-structured, [Log-Structured Storage](/en/ch4#sec_storage_log_structured)-[Compaction strategies](/en/ch4#sec_storage_lsm_compaction)
+- stored procedures, [Encapsulating transactions in stored procedures](/en/ch8#encapsulating-transactions-in-stored-procedures)-[Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs), [Glossary](/en/glossary)
+ - and shared logs, [Using shared logs](/en/ch10#sec_consistency_smr)
+ - pros and cons of, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
+ - similarity to stream processors, [Application code as a derivation function](/en/ch13#sec_future_dataflow_derivation)
+- Storm (stream processor), [Stream analytics](/en/ch12#id318)
+ - distributed RPC, [Event-Driven Architectures and RPC](/en/ch12#sec_stream_actors_drpc), [Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
+ - Trident state handling, [Idempotence](/en/ch12#sec_stream_idempotence)
+- straggler events, [Handling straggler events](/en/ch12#id323)
+- Stream Control Transmission Protocol (SCTP), [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
+- stream processing, [Processing Streams](/en/ch12#sec_stream_processing)-[Summary](/en/ch12#id332), [Glossary](/en/glossary)
+ - accessing external services within job, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins), [Microbatching and checkpointing](/en/ch12#id329), [Idempotence](/en/ch12#sec_stream_idempotence), [Exactly-once execution of an operation](/en/ch13#id353)
+ - combining with batch processing, [Unifying batch and stream processing](/en/ch13#id338)
+ - comparison to batch processing, [Processing Streams](/en/ch12#sec_stream_processing)
+ - complex event processing (CEP), [Complex event processing](/en/ch12#id317)
+ - fault tolerance, [Fault Tolerance](/en/ch12#sec_stream_fault_tolerance)-[Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+ - atomic commit, [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
+ - idempotence, [Idempotence](/en/ch12#sec_stream_idempotence)
+ - microbatching and checkpointing, [Microbatching and checkpointing](/en/ch12#id329)
+ - rebuilding state after a failure, [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+ - for data integration, [Batch and Stream Processing](/en/ch13#sec_future_batch_streaming)-[Unifying batch and stream processing](/en/ch13#id338)
+ - for event sourcing, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - maintaining derived state, [Maintaining derived state](/en/ch13#id446)
+ - maintenance of materialized views, [Maintaining materialized views](/en/ch12#sec_stream_mat_view)
+ - messaging systems (see messaging systems)
+ - reasoning about time, [Reasoning About Time](/en/ch12#sec_stream_time)-[Types of windows](/en/ch12#id324)
+ - event time versus processing time, [Event time versus processing time](/en/ch12#id322), [Microbatching and checkpointing](/en/ch12#id329), [Unifying batch and stream processing](/en/ch13#id338)
+ - knowing when window is ready, [Handling straggler events](/en/ch12#id323)
+ - types of windows, [Types of windows](/en/ch12#id324)
+ - relation to databases (see streams)
+ - relation to services, [Stream processors and services](/en/ch13#id345)
+ - relationship to batch processing, [Batch Processing](/en/ch11#ch_batch)
+ - search on streams, [Search on streams](/en/ch12#id320)
+ - single-threaded execution, [Logs compared to traditional messaging](/en/ch12#sec_stream_logs_vs_messaging), [Concurrency control](/en/ch12#sec_stream_concurrency)
+ - stream analytics, [Stream analytics](/en/ch12#id318)
+ - stream joins, [Stream Joins](/en/ch12#sec_stream_joins)-[Time-dependence of joins](/en/ch12#sec_stream_join_time)
+ - stream-stream join, [Stream-stream join (window join)](/en/ch12#id440)
+ - stream-table join, [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
+ - table-table join, [Table-table join (materialized view maintenance)](/en/ch12#id326)
+ - time-dependence of, [Time-dependence of joins](/en/ch12#sec_stream_join_time)
+- streams, [Stream Processing](/en/ch12#ch_stream)-[Replaying old messages](/en/ch12#sec_stream_replay)
+ - end-to-end, pushing events to clients, [End-to-end event streams](/en/ch13#id349)
+ - messaging systems (see messaging systems)
+ - processing (see stream processing)
+ - relation to databases, [Databases and Streams](/en/ch12#sec_stream_databases)-[Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+ - (see also changelogs)
+ - API support for change streams, [API support for change streams](/en/ch12#sec_stream_change_api)
+ - change data capture, [Change Data Capture](/en/ch12#sec_stream_cdc)-[API support for change streams](/en/ch12#sec_stream_change_api)
+ - derivative of state by time, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)
+ - event sourcing, [Change data capture versus event sourcing](/en/ch12#sec_stream_event_sourcing)
+ - keeping systems in sync, [Keeping Systems in Sync](/en/ch12#sec_stream_sync)-[Keeping Systems in Sync](/en/ch12#sec_stream_sync)
+ - philosophy of immutable events, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)-[Limitations of immutability](/en/ch12#sec_stream_immutability_limitations)
+ - topics, [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
+- strict serializability, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+ - timeliness vs. integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+- striping (in columnar encoding), [Column-Oriented Storage](/en/ch4#sec_storage_column)
+- strong consistency (see linearizability)
+- strong eventual consistency, [Automatic conflict resolution](/en/ch6#automatic-conflict-resolution)
+- strong one-copy serializability, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+- subjects, predicates, and objects (in triple-stores), [Triple-Stores and SPARQL](/en/ch3#id59)
+- subscribers (message streams), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
+ - (see also consumers)
+- supercomputers, [Cloud Computing Versus Supercomputing](/en/ch1#id17)
+- Superset (data visualization software), [Analytics](/en/ch11#sec_batch_olap)
+- surveillance, [Surveillance](/en/ch14#id374)
+ - (see also privacy)
+- sushi principle, [From data warehouse to data lake](/en/ch1#from-data-warehouse-to-data-lake)
+- sustainability, [Distributed Versus Single-Node Systems](/en/ch1#sec_introduction_distributed)
+- Swagger (service definition format), [Web services](/en/ch5#sec_web_services)
+- swapping to disk (see virtual memory)
+- Swift (programming language)
+ - memory management, [Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
+- sync engines, [Sync Engines and Local-First Software](/en/ch6#sec_replication_offline_clients)-[Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+ - examples of, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+ - for local-first software, [Real-time collaboration, offline-first, and local-first apps](/en/ch6#real-time-collaboration-offline-first-and-local-first-apps)
+- synchronous networks, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks), [Glossary](/en/glossary)
+ - comparison to asynchronous networks, [Synchronous Versus Asynchronous Networks](/en/ch9#sec_distributed_sync_networks)
+ - system model, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+- synchronous replication, [Synchronous Versus Asynchronous Replication](/en/ch6#sec_replication_sync_async), [Glossary](/en/glossary)
+ - with multiple leaders, [Multi-Leader Replication](/en/ch6#sec_replication_multi_leader)
+- system administrator, [Operations in the Cloud Era](/en/ch1#sec_introduction_operations)
+- system models, [Knowledge, Truth, and Lies](/en/ch9#sec_distributed_truth), [System Model and Reality](/en/ch9#sec_distributed_system_model)-[Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+ - assumptions in, [Trust, but Verify](/en/ch13#sec_future_verification)
+ - correctness of algorithms, [Defining the correctness of an algorithm](/en/ch9#defining-the-correctness-of-an-algorithm)
+ - mapping to the real world, [Mapping system models to the real world](/en/ch9#mapping-system-models-to-the-real-world)
+ - safety and liveness, [Safety and liveness](/en/ch9#sec_distributed_safety_liveness)
+- systems of record, [Systems of Record and Derived Data](/en/ch1#sec_introduction_derived), [Glossary](/en/glossary)
+ - change data capture, [Implementing change data capture](/en/ch12#id307), [Reasoning about dataflows](/en/ch13#id443)
+ - event logs, [Event Sourcing and CQRS](/en/ch3#sec_datamodels_events)
+ - treating event log as, [State, Streams, and Immutability](/en/ch12#sec_stream_immutability)
+- systems thinking, [Feedback Loops](/en/ch14#id372)
+
+### T
+
+- t-digest (algorithm), [Use of Response Time Metrics](/en/ch2#sec_introduction_slo_sla)
+- table-table joins, [Table-table join (materialized view maintenance)](/en/ch12#id326)
+- Tableau (data visualization software), [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp), [Analytics](/en/ch11#sec_batch_olap)
+- tail (Unix tool), [Using logs for message storage](/en/ch12#id300)
+- tail latency (see latency)
+- tail vertex (property graphs), [Property Graphs](/en/ch3#id56)
+- task (workflows) (see workflow engines)
+- TCP (Transmission Control Protocol), [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
+ - comparison to circuit switching, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
+ - comparison to UDP, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - connection failures, [Detecting Faults](/en/ch9#id307)
+ - flow control, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing), [Messaging Systems](/en/ch12#sec_stream_messaging)
+ - packet checksums, [Weak forms of lying](/en/ch9#weak-forms-of-lying), [The end-to-end argument](/en/ch13#sec_future_e2e_argument), [Trust, but Verify](/en/ch13#sec_future_verification)
+ - reliability and duplicate suppression, [Duplicate suppression](/en/ch13#id354)
+ - retransmission timeouts, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - use for transaction sessions, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)
+- Temporal (workflow engine), [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
+- Tensorflow (machine learning library), [Machine Learning](/en/ch11#id290)
+- Teradata (database), [Cloud-Native System Architecture](/en/ch1#sec_introduction_cloud_native), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+- term-partitioned indexes (see global secondary indexes)
+- termination (consensus), [Single-value consensus](/en/ch10#single-value-consensus), [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
+- testing, [Humans and Reliability](/en/ch2#id31)
+- thrashing (out of memory), [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+- threads (concurrency)
+ - actor model, [Distributed actor frameworks](/en/ch5#distributed-actor-frameworks), [Event-Driven Architectures and RPC](/en/ch12#sec_stream_actors_drpc)
+ - (see also event-driven architecture)
+ - atomic operations, [Atomicity](/en/ch8#sec_transactions_acid_atomicity)
+ - background threads, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables)
+ - execution pauses, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable), [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)-[Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+ - memory barriers, [Linearizability and network delays](/en/ch10#linearizability-and-network-delays)
+ - preemption, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+ - single (see single-threaded execution)
+- three-phase commit, [Three-phase commit](/en/ch8#three-phase-commit)
+- three-way relationships, [Property Graphs](/en/ch3#id56)
+- Thrift (data format), [Protocol Buffers](/en/ch5#sec_encoding_protobuf)
+- throughput, [Describing Performance](/en/ch2#sec_introduction_percentiles), [Describing Load](/en/ch2#id33), [Batch Processing](/en/ch11#ch_batch)
+- TIBCO, [Message brokers](/en/ch5#message-brokers)
+ - Enterprise Message Service, [Message brokers compared to databases](/en/ch12#id297)
+ - StreamBase (stream analytics), [Complex event processing](/en/ch12#id317)
+- TiDB (database)
+ - consensus-based replication, [Single-Leader Replication](/en/ch6#sec_replication_leader)
+ - regions (sharding), [Sharding](/en/ch7#ch_sharding)
+ - request routing, [Request Routing](/en/ch7#sec_sharding_routing)
+ - serving derived data, [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+ - sharded secondary indexes, [Global Secondary Indexes](/en/ch7#id167)
+ - snapshot isolation support, [Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
+ - timestamp oracle, [Implementing a linearizable ID generator](/en/ch10#implementing-a-linearizable-id-generator)
+ - transactions, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
+ - use of model-checking, [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
+- tiered storage, [Setting Up New Followers](/en/ch6#sec_replication_new_replica), [Disk space usage](/en/ch12#sec_stream_disk_usage)
+- TigerBeetle (database), [Summary](/en/ch3#summary)
+ - deterministic simulation testing, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+- TigerGraph (database)
+ - GSQL language, [Graph Queries in SQL](/en/ch3#id58)
+- Tigris (object storage), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- TileDB (database), [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+- time
+ - concurrency and, [The "happens-before" relation and concurrency](/en/ch6#sec_replication_happens_before)
+ - cross-channel timing dependencies, [Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
+ - in distributed systems, [Unreliable Clocks](/en/ch9#sec_distributed_clocks)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
+ - (see also clocks)
+ - clock synchronization and accuracy, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
+ - relying on synchronized clocks, [Relying on Synchronized Clocks](/en/ch9#sec_distributed_clocks_relying)-[Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+ - process pauses, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)-[Limiting the impact of garbage collection](/en/ch9#sec_distributed_gc_impact)
+ - reasoning about, in stream processors, [Reasoning About Time](/en/ch12#sec_stream_time)-[Types of windows](/en/ch12#id324)
+ - event time versus processing time, [Event time versus processing time](/en/ch12#id322), [Microbatching and checkpointing](/en/ch12#id329), [Unifying batch and stream processing](/en/ch13#id338)
+ - knowing when window is ready, [Handling straggler events](/en/ch12#id323)
+ - timestamp of events, [Whose clock are you using, anyway?](/en/ch12#id438)
+ - types of windows, [Types of windows](/en/ch12#id324)
+ - system models for distributed systems, [System Model and Reality](/en/ch9#sec_distributed_system_model)
+ - time-dependence in stream joins, [Time-dependence of joins](/en/ch12#sec_stream_join_time)
+- time series data
+ - as DataFrames, [DataFrames, Matrices, and Arrays](/en/ch3#sec_datamodels_dataframes)
+ - column-oriented storage, [Column-Oriented Storage](/en/ch4#sec_storage_column)
+- time-of-day clocks, [Time-of-day clocks](/en/ch9#time-of-day-clocks)
+ - hybrid logical clocks, [Hybrid logical clocks](/en/ch10#hybrid-logical-clocks)
+- timeliness, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+ - coordination-avoiding data systems, [Coordination-avoiding data systems](/en/ch13#id454)
+ - correctness of dataflow systems, [Correctness of dataflow systems](/en/ch13#id453)
+- timeouts, [Unreliable Networks](/en/ch9#sec_distributed_networks), [Glossary](/en/glossary)
+ - dynamic configuration of, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - for failover, [Leader failure: Failover](/en/ch6#leader-failure-failover)
+ - length of, [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing)
+- TimescaleDB (database), [Column-Oriented Storage](/en/ch4#sec_storage_column)
+- timestamps, [Logical Clocks](/en/ch10#sec_consistency_timestamps)
+ - assigning to events in stream processing, [Whose clock are you using, anyway?](/en/ch12#id438)
+ - for read-after-write consistency, [Reading Your Own Writes](/en/ch6#sec_replication_ryw)
+ - for transaction ordering, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+ - insufficiency for enforcing constraints, [Enforcing constraints using logical clocks](/en/ch10#enforcing-constraints-using-logical-clocks)
+ - key range sharding by, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+ - Lamport, [Lamport timestamps](/en/ch10#lamport-timestamps)
+ - logical, [Ordering events to capture causality](/en/ch13#sec_future_capture_causality)
+ - ordering events, [Timestamps for ordering events](/en/ch9#sec_distributed_lww)
+ - timestamp oracle, [Implementing a linearizable ID generator](/en/ch10#implementing-a-linearizable-id-generator)
+- TLA+ (specification language), [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
+- token bucket (limiting retries), [Describing Performance](/en/ch2#sec_introduction_percentiles)
+- tombstones, [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Disk space usage](/en/ch4#disk-space-usage), [Log compaction](/en/ch12#sec_stream_log_compaction)
+- topics (messaging), [Message brokers](/en/ch5#message-brokers), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
+- torn pages (B-trees), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
+- total order, [Glossary](/en/glossary)
+ - broadcast (see shared logs)
+ - limits of, [The limits of total ordering](/en/ch13#id335)
+ - on logical timestamps, [Logical Clocks](/en/ch10#sec_consistency_timestamps)
+- tracing, [Problems with Distributed Systems](/en/ch1#sec_introduction_dist_sys_problems)
+- tracking behavioral data, [Privacy and Tracking](/en/ch14#id373)
+ - (see also privacy)
+- trade-offs, [Trade-offs in Data Systems Architecture](/en/ch1#ch_tradeoffs)-[Data Systems, Law, and Society](/en/ch1#sec_introduction_compliance)
+- transaction coordinator (see coordinator)
+- transaction manager (see coordinator)
+- transaction processing, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)-[Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
+ - comparison to analytics, [Characterizing Transaction Processing and Analytics](/en/ch1#sec_introduction_oltp)
+ - comparison to data warehousing, [Data Storage for Analytics](/en/ch4#sec_storage_analytics)
+- transactions, [Transactions](/en/ch8#ch_transactions)-[Summary](/en/ch8#summary), [Glossary](/en/glossary)
+ - ACID properties of, [The Meaning of ACID](/en/ch8#sec_transactions_acid)
+ - atomicity, [Atomicity](/en/ch8#sec_transactions_acid_atomicity)
+ - consistency, [Consistency](/en/ch8#sec_transactions_acid_consistency)
+ - durability, [Making B-trees reliable](/en/ch4#sec_storage_btree_wal), [Durability](/en/ch8#durability)
+ - isolation, [Isolation](/en/ch8#sec_transactions_acid_isolation)
+ - and derived data integrity, [Timeliness and Integrity](/en/ch13#sec_future_integrity)
+ - and replication, [Solutions for Replication Lag](/en/ch6#id131)
+ - compensating (see compensating transactions)
+ - concept of, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview)
+ - distributed transactions, [Distributed Transactions](/en/ch8#sec_transactions_distributed)-[Exactly-once message processing revisited](/en/ch8#exactly-once-message-processing-revisited)
+ - avoiding, [Derived data versus distributed transactions](/en/ch13#sec_future_derived_vs_transactions), [Making unbundling work](/en/ch13#sec_future_unbundling_favor), [Enforcing Constraints](/en/ch13#sec_future_constraints)-[Coordination-avoiding data systems](/en/ch13#id454)
+ - failure amplification, [Maintaining derived state](/en/ch13#id446)
+ - for sharded systems, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
+ - in doubt/uncertain status, [Coordinator failure](/en/ch8#coordinator-failure), [Holding locks while in doubt](/en/ch8#holding-locks-while-in-doubt)
+ - two-phase commit, [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)-[Three-phase commit](/en/ch8#three-phase-commit)
+ - use of, [Distributed Transactions Across Different Systems](/en/ch8#sec_transactions_xa)-[Exactly-once message processing](/en/ch8#sec_transactions_exactly_once)
+ - XA transactions, [XA transactions](/en/ch8#xa-transactions)-[Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
+ - OLTP versus analytics queries, [Analytics](/en/ch11#sec_batch_olap)
+ - purpose of, [Transactions](/en/ch8#ch_transactions)
+ - serializability, [Serializability](/en/ch8#sec_transactions_serializability)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
+ - actual serial execution, [Actual Serial Execution](/en/ch8#sec_transactions_serial)-[Summary of serial execution](/en/ch8#summary-of-serial-execution)
+ - pessimistic versus optimistic concurrency control, [Pessimistic versus optimistic concurrency control](/en/ch8#pessimistic-versus-optimistic-concurrency-control)
+ - serializable snapshot isolation (SSI), [Serializable Snapshot Isolation (SSI)](/en/ch8#sec_transactions_ssi)-[Performance of serializable snapshot isolation](/en/ch8#performance-of-serializable-snapshot-isolation)
+ - two-phase locking (2PL), [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)-[Index-range locks](/en/ch8#sec_transactions_2pl_range)
+ - single-object and multi-object, [Single-Object and Multi-Object Operations](/en/ch8#sec_transactions_multi_object)-[Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
+ - handling errors and aborts, [Handling errors and aborts](/en/ch8#handling-errors-and-aborts)
+ - need for multi-object transactions, [The need for multi-object transactions](/en/ch8#sec_transactions_need)
+ - single-object writes, [Single-object writes](/en/ch8#sec_transactions_single_object)
+ - snapshot isolation (see snapshots)
+ - strict serializability, [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition)
+ - weak isolation levels, [Weak Isolation Levels](/en/ch8#sec_transactions_isolation_levels)-[Materializing conflicts](/en/ch8#materializing-conflicts)
+ - preventing lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)-[Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+ - read committed, [Read Committed](/en/ch8#sec_transactions_read_committed)-[Snapshot Isolation and Repeatable Read](/en/ch8#sec_transactions_snapshot_isolation)
+- traversal (graphs), [Property Graphs](/en/ch3#id56)
+- trie (data structure), [Constructing and merging SSTables](/en/ch4#constructing-and-merging-sstables), [Full-Text Search](/en/ch4#sec_storage_full_text)
+ - as SSTable index, [The SSTable file format](/en/ch4#the-sstable-file-format)
+- triggers (databases), [Transmitting Event Streams](/en/ch12#sec_stream_transmit)
+- Trino (data warehouse), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - federated databases, [The meta-database of everything](/en/ch13#id341)
+ - query optimizer, [Query languages](/en/ch11#sec_batch_query_lanauges)
+ - use for ETL, [Extract--Transform--Load (ETL)](/en/ch11#sec_batch_etl_usage)
+ - workflow example, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
+- triple-stores, [Triple-Stores and SPARQL](/en/ch3#id59)-[The SPARQL query language](/en/ch3#the-sparql-query-language)
+ - SPARQL query language, [The SPARQL query language](/en/ch3#the-sparql-query-language)
+- tumbling windows (stream processing), [Types of windows](/en/ch12#id324)
+ - (see also windows)
+ - in microbatching, [Microbatching and checkpointing](/en/ch12#id329)
+- Turbopuffer (vector search), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+- Turtle (RDF data format), [Triple-Stores and SPARQL](/en/ch3#id59)
+- Twitter (see X (social network))
+- two-phase commit (2PC), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)-[Coordinator failure](/en/ch8#coordinator-failure), [Glossary](/en/glossary)
+ - confusion with two-phase locking, [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)
+ - coordinator failure, [Coordinator failure](/en/ch8#coordinator-failure)
+ - coordinator recovery, [Recovering from coordinator failure](/en/ch8#recovering-from-coordinator-failure)
+ - how it works, [A system of promises](/en/ch8#a-system-of-promises)
+ - performance cost, [Distributed Transactions Across Different Systems](/en/ch8#sec_transactions_xa)
+ - problems with XA transactions, [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
+ - transactions holding locks, [Holding locks while in doubt](/en/ch8#holding-locks-while-in-doubt)
+- two-phase locking (2PL), [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)-[Index-range locks](/en/ch8#sec_transactions_2pl_range), [What Makes a System Linearizable?](/en/ch10#sec_consistency_lin_definition), [Glossary](/en/glossary)
+ - confusion with two-phase commit, [Two-Phase Locking (2PL)](/en/ch8#sec_transactions_2pl)
+ - growing and shrinking phases, [Implementation of two-phase locking](/en/ch8#implementation-of-two-phase-locking)
+ - index-range locks, [Index-range locks](/en/ch8#sec_transactions_2pl_range)
+ - performance of, [Performance of two-phase locking](/en/ch8#performance-of-two-phase-locking)
+- type checking, dynamic versus static, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
+
+### U
+
+- UDP (User Datagram Protocol)
+ - comparison to TCP, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - multicast, [Direct messaging from producers to consumers](/en/ch12#id296)
+- Ultima Online (game), [Sharding](/en/ch7#ch_sharding)
+- unbounded datasets, [Stream Processing](/en/ch12#ch_stream), [Glossary](/en/glossary)
+ - (see also streams)
+- unbounded delays, [Glossary](/en/glossary)
+ - in networks, [Timeouts and Unbounded Delays](/en/ch9#sec_distributed_queueing)
+ - process pauses, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+- unbundling databases, [Unbundling Databases](/en/ch13#sec_future_unbundling)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
+ - composing data storage technologies, [Composing Data Storage Technologies](/en/ch13#id447)-[Unbundled versus integrated systems](/en/ch13#id448)
+ - federation versus unbundling, [The meta-database of everything](/en/ch13#id341)
+ - designing applications around dataflow, [Designing Applications Around Dataflow](/en/ch13#sec_future_dataflow)-[Stream processors and services](/en/ch13#id345)
+ - observing derived state, [Observing Derived State](/en/ch13#sec_future_observing)-[Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
+ - materialized views and caching, [Materialized views and caching](/en/ch13#id451)
+ - multi-shard data processing, [Multi-shard data processing](/en/ch13#sec_future_unbundled_multi_shard)
+ - pushing state changes to clients, [Pushing state changes to clients](/en/ch13#id348)
+- uncertain (transaction status) (see in doubt)
+- union type (in Avro), [Schema evolution rules](/en/ch5#schema-evolution-rules)
+- uniq (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis), [Distributed Job Orchestration](/en/ch11#id278)
+- uniqueness constraints
+ - asynchronously checked, [Loosely interpreted constraints](/en/ch13#id362)
+ - requiring consensus, [Uniqueness constraints require consensus](/en/ch13#id452)
+ - requiring linearizability, [Constraints and uniqueness guarantees](/en/ch10#sec_consistency_uniqueness)
+ - uniqueness in log-based messaging, [Uniqueness in log-based messaging](/en/ch13#sec_future_uniqueness_log)
+- Unity (data catalog), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+- universally unique identifiers (see UUIDs)
+- Unix philosophy
+ - comparison to relational databases, [Unbundling Databases](/en/ch13#sec_future_unbundling), [The meta-database of everything](/en/ch13#id341)
+ - comparison to stream processing, [Processing Streams](/en/ch12#sec_stream_processing)
+- Unix pipes, [Simple Log Analysis](/en/ch11#sec_batch_log_analysis)
+ - compared to distributed batch processing, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
+- UPDATE statement (SQL), [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
+- updates
+ - preventing lost updates, [Preventing Lost Updates](/en/ch8#sec_transactions_lost_update)-[Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+ - atomic write operations, [Atomic write operations](/en/ch8#atomic-write-operations)
+ - automatically detecting lost updates, [Automatically detecting lost updates](/en/ch8#automatically-detecting-lost-updates)
+ - compare-and-set (CAS), [Conditional writes (compare-and-set)](/en/ch8#sec_transactions_compare_and_set)
+ - conflict resolution and replication, [Conflict resolution and replication](/en/ch8#conflict-resolution-and-replication)
+ - using explicit locking, [Explicit locking](/en/ch8#explicit-locking)
+ - preventing write skew, [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts)
+- utilization
+ - batch process scheduling, [Resource Allocation](/en/ch11#id279)
+ - increasing through preemption, [Handling Faults](/en/ch11#id281)
+ - trade-off with latency, [Can we not simply make network delays predictable?](/en/ch9#can-we-not-simply-make-network-delays-predictable)
+- uTP protocol (BitTorrent), [The Limitations of TCP](/en/ch9#sec_distributed_tcp)
+- UUIDs, [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
+
+### V
+
+- validity (consensus), [Single-value consensus](/en/ch10#single-value-consensus), [Atomic commitment as consensus](/en/ch10#atomic-commitment-as-consensus)
+- vBuckets (sharding), [Sharding](/en/ch7#ch_sharding)
+- vector clocks, [Version vectors](/en/ch6#version-vectors)
+ - (see also version vectors)
+ - and Lamport/hybrid logical clocks, [Lamport/hybrid logical clocks versus vector clocks](/en/ch10#lamporthybrid-logical-clocks-vs-vector-clocks)
+ - and version vectors, [Version vectors](/en/ch6#version-vectors)
+- vector embedding, [Vector Embeddings](/en/ch4#id92)
+- vectorized processing, [Query Execution: Compilation and Vectorization](/en/ch4#sec_storage_vectorized)
+- vendor lock-in, [Pros and Cons of Cloud Services](/en/ch1#sec_introduction_cloud_tradeoffs)
+- Venice (database), [Serving Derived Data](/en/ch11#sec_batch_serving_derived)
+- verification, [Trust, but Verify](/en/ch13#sec_future_verification)-[Tools for auditable data systems](/en/ch13#id366)
+ - avoiding blind trust, [Don't just blindly trust what they promise](/en/ch13#id364)
+ - designing for auditability, [Designing for auditability](/en/ch13#id365)
+ - end-to-end integrity checks, [The end-to-end argument again](/en/ch13#id456)
+ - tools for auditable data systems, [Tools for auditable data systems](/en/ch13#id366)
+- version control systems
+ - merge conflicts, [Manual conflict resolution](/en/ch6#manual-conflict-resolution)
+ - reliance on immutable data, [Concurrency control](/en/ch12#sec_stream_concurrency)
+- version vectors, [Problems with different topologies](/en/ch6#problems-with-different-topologies), [Version vectors](/en/ch6#version-vectors)
+ - dotted, [Version vectors](/en/ch6#version-vectors)
+ - versus vector clocks, [Version vectors](/en/ch6#version-vectors)
+- Vertica (database), [Cloud Data Warehouses](/en/ch4#sec_cloud_data_warehouses)
+ - handling writes, [Writing to Column-Oriented Storage](/en/ch4#writing-to-column-oriented-storage)
+- vertical scaling (see scaling up)
+- vertices (in graphs), [Graph-Like Data Models](/en/ch3#sec_datamodels_graph)
+ - property graph model, [Property Graphs](/en/ch3#id56)
+- video games, [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- video transcoding (example), [Cross-channel timing dependencies](/en/ch10#cross-channel-timing-dependencies)
+- views (SQL queries), [Datalog: Recursive Relational Queries](/en/ch3#id62)
+ - materialized views (see materialization)
+- Viewstamped Replication (consensus algorithm), [Consensus](/en/ch10#sec_consistency_consensus), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
+ - use of model-checking, [Model checking and specification languages](/en/ch9#model-checking-and-specification-languages)
+ - view number, [From single-leader replication to consensus](/en/ch10#from-single-leader-replication-to-consensus)
+- virtual block device, [Separation of storage and compute](/en/ch1#sec_introduction_storage_compute)
+- virtual file system, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+ - comparison to distributed filesystems, [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- virtual machines, [Layering of cloud services](/en/ch1#layering-of-cloud-services)
+ - context switches, [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+ - network performance, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - noisy neighbors, [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+ - virtualized clocks in, [Clock Synchronization and Accuracy](/en/ch9#sec_distributed_clock_accuracy)
+- virtual memory
+ - process pauses due to page faults, [Latency and Response Time](/en/ch2#id23), [Process Pauses](/en/ch9#sec_distributed_clocks_pauses)
+- Virtuoso (database), [The SPARQL query language](/en/ch3#the-sparql-query-language)
+- VisiCalc (spreadsheets), [Designing Applications Around Dataflow](/en/ch13#sec_future_dataflow)
+- Vitess (database)
+ - key-range sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+- vnodes (sharding), [Sharding](/en/ch7#ch_sharding)
+- vocabularies, [Triple-Stores and SPARQL](/en/ch3#id59)
+- Voice over IP (VoIP), [Network congestion and queueing](/en/ch9#network-congestion-and-queueing)
+- VoltDB (database)
+ - cross-shard serializability, [Sharding](/en/ch8#sharding)
+ - deterministic stored procedures, [Pros and cons of stored procedures](/en/ch8#sec_transactions_stored_proc_tradeoffs)
+ - in-memory storage, [Keeping everything in memory](/en/ch4#sec_storage_inmemory)
+ - process-per-core model, [Pros and Cons of Sharding](/en/ch7#sec_sharding_reasons)
+ - secondary indexes, [Local Secondary Indexes](/en/ch7#id166)
+ - serial execution of transactions, [Actual Serial Execution](/en/ch8#sec_transactions_serial)
+ - statement-based replication, [Statement-based replication](/en/ch6#statement-based-replication), [Rebuilding state after a failure](/en/ch12#sec_stream_state_fault_tolerance)
+ - transactions in stream processing, [Atomic commit revisited](/en/ch12#sec_stream_atomic_commit)
+
+### W
+
+- WAL (write-ahead log), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal)
+- WAL-G (backup tool), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+- WarpStream (messaging), [Disk space usage](/en/ch12#sec_stream_disk_usage)
+- web services (see services)
+- webhooks, [Direct messaging from producers to consumers](/en/ch12#id296)
+- webMethods (messaging), [Message brokers](/en/ch5#message-brokers)
+- WebSocket (protocol), [Pushing state changes to clients](/en/ch13#id348)
+- wide-column data model, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
+ - versus column-oriented storage, [Column Compression](/en/ch4#sec_storage_column_compression)
+- windows (stream processing), [Stream analytics](/en/ch12#id318), [Reasoning About Time](/en/ch12#sec_stream_time)-[Types of windows](/en/ch12#id324)
+ - infinite windows for changelogs, [Maintaining materialized views](/en/ch12#sec_stream_mat_view), [Stream-table join (stream enrichment)](/en/ch12#sec_stream_table_joins)
+ - knowing when all events have arrived, [Handling straggler events](/en/ch12#id323)
+ - stream joins within a window, [Stream-stream join (window join)](/en/ch12#id440)
+ - types of windows, [Types of windows](/en/ch12#id324)
+- WITH RECURSIVE syntax (SQL), [Graph Queries in SQL](/en/ch3#id58)
+- Word2Vec (language model), [Vector Embeddings](/en/ch4#id92)
+- workflow engines, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
+ - Airflow (see Airflow (workflow scheduler))
+ - batch processing, [Scheduling Workflows](/en/ch11#sec_batch_workflows)
+ - Camunda (see Camunda (workflow engine))
+ - Dagster (see Dagster (workflow scheduler))
+ - durable execution, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
+ - ETL (see ETL (extract-transform-load))
+ - executor, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows)
+ - orchestrators, [Durable Execution and Workflows](/en/ch5#sec_encoding_dataflow_workflows), [Batch Processing](/en/ch11#ch_batch)
+ - Orkes (see Orkes (workflow engine))
+ - Prefect (see Prefect (workflow scheduler))
+ - reliance on determinism, [Deterministic simulation testing](/en/ch9#deterministic-simulation-testing)
+ - Restate (see Restate (workflow engine))
+ - Temporal (see Temporal (workflow engine))
+- working set, [Sorting Versus In-memory Aggregation](/en/ch11#id275)
+- write amplification, [Write amplification](/en/ch4#write-amplification)
+- write path (derived data), [Observing Derived State](/en/ch13#sec_future_observing)
+- write skew (transaction isolation), [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Materializing conflicts](/en/ch8#materializing-conflicts)
+ - characterizing, [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew)-[Phantoms causing write skew](/en/ch8#sec_transactions_phantom), [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)
+ - examples of, [Write Skew and Phantoms](/en/ch8#sec_transactions_write_skew), [More examples of write skew](/en/ch8#more-examples-of-write-skew)
+ - materializing conflicts, [Materializing conflicts](/en/ch8#materializing-conflicts)
+ - occurrence in practice, [Maintaining integrity in the face of software bugs](/en/ch13#id455)
+ - phantoms, [Phantoms causing write skew](/en/ch8#sec_transactions_phantom)
+ - preventing
+ - in snapshot isolation, [Decisions based on an outdated premise](/en/ch8#decisions-based-on-an-outdated-premise)-[Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
+ - in two-phase locking, [Predicate locks](/en/ch8#predicate-locks)-[Index-range locks](/en/ch8#sec_transactions_2pl_range)
+ - options for, [Characterizing write skew](/en/ch8#characterizing-write-skew)
+- write-ahead log (WAL), [Making B-trees reliable](/en/ch4#sec_storage_btree_wal), [Write-ahead log (WAL) shipping](/en/ch6#write-ahead-log-wal-shipping)
+ - in durable execution, [Durable execution](/en/ch5#durable-execution)
+- writes (database)
+ - atomic write operations, [Atomic write operations](/en/ch8#atomic-write-operations)
+ - detecting writes affecting prior reads, [Detecting writes that affect prior reads](/en/ch8#sec_detecting_writes_affect_reads)
+ - preventing dirty writes with read committed, [No dirty writes](/en/ch8#sec_transactions_dirty_write)
+- WS-\* framework, [The problems with remote procedure calls (RPCs)](/en/ch5#sec_problems_with_rpc)
+- WS-AtomicTransaction (2PC), [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc)
+
+### X
+
+- X (social network)
+ - constructing home timelines (example), [Case Study: Social Network Home Timelines](/en/ch2#sec_introduction_twitter), [Deriving several views from the same event log](/en/ch12#sec_stream_deriving_views), [Table-table join (materialized view maintenance)](/en/ch12#id326), [Materialized views and caching](/en/ch13#id451)
+ - cost of joins, [Denormalization in the social networking case study](/en/ch3#denormalization-in-the-social-networking-case-study)
+ - describing load, [Describing Load](/en/ch2#id33)
+ - fault tolerance, [Fault Tolerance](/en/ch2#id27)
+ - performance metrics, [Describing Performance](/en/ch2#sec_introduction_percentiles)
+ - DistributedLog (event log), [Using logs for message storage](/en/ch12#id300)
+ - Snowflake (ID generator), [ID Generators and Logical Clocks](/en/ch10#sec_consistency_logical)
+- XA transactions, [Two-Phase Commit (2PC)](/en/ch8#sec_transactions_2pc), [XA transactions](/en/ch8#xa-transactions)-[Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
+ - heuristic decisions, [Recovering from coordinator failure](/en/ch8#recovering-from-coordinator-failure)
+ - problems with, [Problems with XA transactions](/en/ch8#problems-with-xa-transactions)
+- xargs (Unix tool), [Simple Log Analysis](/en/ch11#sec_batch_log_analysis)
+- XFS (file system), [Distributed Filesystems](/en/ch11#sec_batch_dfs)
+- XGBoost (machine learning library), [Machine Learning](/en/ch11#id290)
+- XML
+ - binary variants, [Binary encoding](/en/ch5#binary-encoding)
+ - data locality, [Data locality for reads and writes](/en/ch3#sec_datamodels_document_locality)
+ - encoding RDF data, [The RDF data model](/en/ch3#the-rdf-data-model)
+ - for application data, issues with, [JSON, XML, and Binary Variants](/en/ch5#sec_encoding_json)
+ - in relational databases, [Schema flexibility in the document model](/en/ch3#sec_datamodels_schema_flexibility)
+ - XML databases, [Relational Model versus Document Model](/en/ch3#sec_datamodels_history), [Query languages for documents](/en/ch3#query-languages-for-documents)
+- Xorq (query engine), [The meta-database of everything](/en/ch13#id341)
+- XPath, [Query languages for documents](/en/ch3#query-languages-for-documents)
+- XQuery, [Query languages for documents](/en/ch3#query-languages-for-documents)
+
+### Y
+
+- Yahoo
+ - response time study, [Average, Median, and Percentiles](/en/ch2#id24)
+- YARN (job scheduler), [Distributed Job Orchestration](/en/ch11#id278), [Separation of application code and state](/en/ch13#id344)
+ - ApplicationMaster, [Distributed Job Orchestration](/en/ch11#id278)
+- Yjs (CRDT library), [Pros and cons of sync engines](/en/ch6#pros-and-cons-of-sync-engines)
+- YugabyteDB (database)
+ - hash-range sharding, [Sharding by hash range](/en/ch7#sharding-by-hash-range)
+ - key-range sharding, [Sharding by Key Range](/en/ch7#sec_sharding_key_range)
+ - multi-leader replication, [Geographically Distributed Operation](/en/ch6#sec_replication_multi_dc)
+ - request routing, [Request Routing](/en/ch7#sec_sharding_routing)
+ - sharded secondary indexes, [Global Secondary Indexes](/en/ch7#id167)
+ - tablets (sharding), [Sharding](/en/ch7#ch_sharding)
+ - transactions, [What Exactly Is a Transaction?](/en/ch8#sec_transactions_overview), [Database-internal Distributed Transactions](/en/ch8#sec_transactions_internal)
+ - use of clock synchronization, [Synchronized clocks for global snapshots](/en/ch9#sec_distributed_spanner)
+
+### Z
+
+- Zab (consensus algorithm), [Consensus](/en/ch10#sec_consistency_consensus), [Consensus in Practice](/en/ch10#sec_consistency_total_order)
+ - use in ZooKeeper, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
+- zero-copy, [Formats for Encoding Data](/en/ch5#sec_encoding_formats)
+- zero-disk architecture (ZDA), [Setting Up New Followers](/en/ch6#sec_replication_new_replica)
+- ZeroMQ (messaging library), [Direct messaging from producers to consumers](/en/ch12#id296)
+- zombies (split brain), [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens)
+- zones (cloud computing) (see availability zones)
+- ZooKeeper (coordination service), [Coordination Services](/en/ch10#sec_consistency_coordination)-[Service discovery](/en/ch10#service-discovery)
+ - generating fencing tokens, [Fencing off zombies and delayed requests](/en/ch9#sec_distributed_fencing_tokens), [Using shared logs](/en/ch10#sec_consistency_smr), [Coordination Services](/en/ch10#sec_consistency_coordination)
+ - linearizable operations, [Implementing Linearizable Systems](/en/ch10#sec_consistency_implementing_linearizable)
+ - locks and leader election, [Locking and leader election](/en/ch10#locking-and-leader-election)
+ - observers, [Service discovery](/en/ch10#service-discovery)
+ - use for service discovery, [Load balancers, service discovery, and service meshes](/en/ch5#sec_encoding_service_discovery), [Service discovery](/en/ch10#service-discovery)
+ - use for shard assignment, [Request Routing](/en/ch7#sec_sharding_routing)
+ - use of Zab algorithm, [Consensus](/en/ch10#sec_consistency_consensus)
diff --git a/content/en/part-iii.md b/content/en/part-iii.md
index a1b48ce..0ec1200 100644
--- a/content/en/part-iii.md
+++ b/content/en/part-iii.md
@@ -61,12 +61,13 @@ This point will be a running theme throughout this part of the book.
We will start in [Chapter 11](/en/ch11) by examining batch-oriented dataflow systems such as MapReduce, and see how they give us good tools and principles for building large- scale data systems.
In [Chapter 12](/en/ch12) we will take those ideas and apply them to data streams, which allow us to do the same kinds of things with lower delays.
-[Chapter 13](/en/ch13) concludes the book by exploring ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future.
+In [Chapter 13](/en/ch13) we explore ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future.
+[Chapter 14](/en/ch14) concludes the book with ethics, privacy, and the social impact of data systems.
## Index
- [11. Batch Processing](/en/ch11) (WIP)
- [12. Stream Processing](/en/ch12) (WIP)
-- [13. Doing the Right Thing](/en/ch13) (WIP)
-
+- [13. A Philosophy of Streaming Systems](/en/ch13) (WIP)
+- [14. Doing the Right Thing](/en/ch14) (WIP)
diff --git a/content/en/toc.md b/content/en/toc.md
index de5760b..0f474fa 100644
--- a/content/en/toc.md
+++ b/content/en/toc.md
@@ -368,22 +368,26 @@ breadcrumbs: false
## [11. Batch Processing](/en/ch11)
- [……](/en/ch11#)
-- [Summary](/en/ch11#summary)
+- [Summary](/en/ch11#id292)
- [References](/en/ch11#references)
## [12. Stream Processing](/en/ch12)
- [……](/en/ch12#)
-- [Summary](/en/ch12#summary)
+- [Summary](/en/ch12#id332)
- [References](/en/ch12#references)
-## [13. Do the Right Thing](/en/ch13)
+## [13. A Philosophy of Streaming Systems](/en/ch13)
- [……](/en/ch13#)
-- [Summary](/en/ch13#summary)
+- [Summary](/en/ch13#id367)
- [References](/en/ch13#references)
+## [14. Doing the Right Thing](/en/ch14)
+- [……](/en/ch14#)
+- [Summary](/en/ch14#id594)
+ - [References](/en/ch14#references)
+
## [Glossary](/en/glossary)
## [Colophon](/en/colophon)
- [About the Author](/en/colophon#about-the-author)
- [Colophon](/en/colophon#colophon)
-
diff --git a/hugo.yaml b/hugo.yaml
index 7f61ea2..ae3b435 100644
--- a/hugo.yaml
+++ b/hugo.yaml
@@ -127,22 +127,26 @@ menu:
name: "PostgreSQL 14 内参 ↗"
url: "https://postgres-internals.cn/"
weight: 9
- - identifier: pigsty
- name: "Pigsty Free PG RDS ↗"
- url: "https://pgsty.com/"
+ - identifier: pigsty-cc
+ name: "Pigsty:开源 PG RDS ↗"
+ url: "https://pigsty.cc/"
weight: 10
+ - identifier: pigsty-io
+ name: "Pigsty: Free PG RDS ↗"
+ url: "https://pigsty.io/"
+ weight: 11
- identifier: pgext
name: "PG 扩展目录 ↗"
url: "https://ext.pgsty.com/zh"
- weight: 11
+ weight: 12
- identifier: ddia1
name: "DDIA O'reilly ↗"
url: "https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/"
- weight: 12
+ weight: 13
- identifier: ddia2
name: "DDIA 2nd O'reilly ↗"
url: "https://www.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/"
- weight: 13
+ weight: 14
params:
diff --git a/static/fig/ddia_1101.png b/static/fig/ddia_1101.png
new file mode 100644
index 0000000..01b5dc0
Binary files /dev/null and b/static/fig/ddia_1101.png differ
diff --git a/static/fig/ddia_1102.png b/static/fig/ddia_1102.png
new file mode 100644
index 0000000..2edbf54
Binary files /dev/null and b/static/fig/ddia_1102.png differ
diff --git a/static/fig/ddia_1103.png b/static/fig/ddia_1103.png
new file mode 100644
index 0000000..d8e3138
Binary files /dev/null and b/static/fig/ddia_1103.png differ
diff --git a/static/fig/ddia_1201.png b/static/fig/ddia_1201.png
new file mode 100644
index 0000000..1b38152
Binary files /dev/null and b/static/fig/ddia_1201.png differ
diff --git a/static/fig/ddia_1202.png b/static/fig/ddia_1202.png
new file mode 100644
index 0000000..90f2883
Binary files /dev/null and b/static/fig/ddia_1202.png differ
diff --git a/static/fig/ddia_1203.png b/static/fig/ddia_1203.png
new file mode 100644
index 0000000..9dab53c
Binary files /dev/null and b/static/fig/ddia_1203.png differ
diff --git a/static/fig/ddia_1204.png b/static/fig/ddia_1204.png
new file mode 100644
index 0000000..840b5b5
Binary files /dev/null and b/static/fig/ddia_1204.png differ
diff --git a/static/fig/ddia_1205.png b/static/fig/ddia_1205.png
new file mode 100644
index 0000000..3e62e83
Binary files /dev/null and b/static/fig/ddia_1205.png differ
diff --git a/static/fig/ddia_1206.png b/static/fig/ddia_1206.png
new file mode 100644
index 0000000..e37801c
Binary files /dev/null and b/static/fig/ddia_1206.png differ
diff --git a/static/fig/ddia_1207.png b/static/fig/ddia_1207.png
new file mode 100644
index 0000000..c7b1dc4
Binary files /dev/null and b/static/fig/ddia_1207.png differ
diff --git a/static/fig/ddia_1208.png b/static/fig/ddia_1208.png
new file mode 100644
index 0000000..09d8af6
Binary files /dev/null and b/static/fig/ddia_1208.png differ
diff --git a/static/fig/ddia_1301.png b/static/fig/ddia_1301.png
new file mode 100644
index 0000000..34f132a
Binary files /dev/null and b/static/fig/ddia_1301.png differ
diff --git a/static/fig/ddia_1302.png b/static/fig/ddia_1302.png
new file mode 100644
index 0000000..7b15d68
Binary files /dev/null and b/static/fig/ddia_1302.png differ