2
0
Fork 0
mirror of https://github.com/Vonng/ddia.git synced 2026-06-21 00:47:05 +08:00

add ddia 11-14

This commit is contained in:
Feng Ruohang 2026-02-15 10:03:16 +08:00
parent 181eb7970d
commit a320e1b551
36 changed files with 9599 additions and 946 deletions

View file

@ -31,7 +31,7 @@ jobs:
build:
runs-on: ubuntu-latest
env:
HUGO_VERSION: 0.147.7
HUGO_VERSION: 0.155.3
steps:
- name: Checkout
uses: actions/checkout@v4
@ -41,7 +41,7 @@ jobs:
- name: Setup Go
uses: actions/setup-go@v5
with:
go-version: '1.24'
go-version: '1.26'
- name: Setup Pages
id: pages
uses: actions/configure-pages@v4

View file

@ -67,9 +67,10 @@
- [9. 分布式系统的麻烦](https://ddia.vonng.com/ch9)
- [10.一致性与共识](https://ddia.vonng.com/ch10)
* [第三部分:派生数据](https://ddia.vonng.com/part-iii)
- [11. 批处理](https://ddia.vonng.com/ch11) (尚未发布)
- [12. 流处理](https://ddia.vonng.com/ch12) (尚未发布)
- [13. 做正确的事](https://ddia.vonng.com/ch13)(尚未发布)
- [11. 批处理](https://ddia.vonng.com/ch11)
- [12. 流处理](https://ddia.vonng.com/ch12)
- [13. 流处理系统哲学](https://ddia.vonng.com/ch13)
- [14. 做正确的事](https://ddia.vonng.com/ch14)
* [术语表](https://ddia.vonng.com/glossary)
* [后记](https://ddia.vonng.com/colophon)

View file

@ -49,9 +49,10 @@ breadcrumbs: false
- [10. Consistency and Consensus](/en/ch10)
### [Part III: Derived Data](/en/part-iii)
- [11. Batch Processing](/en/ch11) (WIP)
- [12. Stream Processing](/en/ch12) (WIP)
- [13. Doing the Right Thing](/en/ch13) (WIP)
- [11. Batch Processing](/en/ch11)
- [12. Stream Processing](/en/ch12)
- [13. A Philosophy of Streaming Systems](/en/ch13)
- [14. Doing the Right Thing](/en/ch14)
### [Glossary](/en/glossary)

View file

@ -4,6 +4,8 @@ weight: 101
breadcrumbs: false
---
<a id="ch_tradeoffs"></a>
> *There are no solutions, there are only trade-offs. […] But you try to get the best
> trade-off you can get, and thats all you can hope for.*
>
@ -156,7 +158,7 @@ the term *transaction* nevertheless stuck, referring to a group of reads and wri
logical unit.
> [!NOTE]
> [Chapter 8](/en/ch8#ch_transactions) explores in detail what we mean with a transaction. This chapter uses the term
> [Chapter 8](/en/ch8#ch_transactions) explores in detail what we mean with a transaction. This chapter uses the term
> loosely to refer to low-latency reads and writes.
Even though databases started being used for many different kinds of data—posts on social media,
@ -179,7 +181,7 @@ answer analytic queries such as:
The reports that result from these types of queries are important for business intelligence, helping
the management decide what to do next. In order to differentiate this pattern of using databases
from transaction processing, it has been called *online analytic processing* (OLAP) [^5].
The difference between OLTP and analytics is not always clear-cut, but some typical characteristics are listed in [Table 1-1](/en/ch1#tab_oltp_vs_olap).
The difference between OLTP and analytics is not always clear-cut, but some typical characteristics are listed in [Table 1-1](/en/ch1#tab_oltp_vs_olap).
{{< figure id="tab_oltp_vs_olap" title="Table 1-1. Comparing characteristics of operational and analytic systems" class="w-full my-4" >}}
@ -241,14 +243,14 @@ systems, for several reasons:
A *data warehouse*, by contrast, is a separate database that analysts can query to their hearts
content, without affecting OLTP operations [^7].
As we shall see in [Chapter 4](/en/ch4#ch_storage), data warehouses often store data in a way that is very different
As we shall see in [Chapter 4](/en/ch4#ch_storage), data warehouses often store data in a way that is very different
from OLTP databases, in order to optimize for the types of queries that are common in analytics.
The data warehouse contains a read-only copy of the data in all the various OLTP systems in the
company. Data is extracted from OLTP databases (using either a periodic data dump or a continuous
stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into
the data warehouse. This process of getting data into the data warehouse is known as
*ExtractTransformLoad* (ETL) and is illustrated in [Figure 1-1](/en/ch1#fig_dwh_etl). Sometimes the order of the
*ExtractTransformLoad* (ETL) and is illustrated in [Figure 1-1](/en/ch1#fig_dwh_etl). Sometimes the order of the
*transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse,
after loading), resulting in *ELT*.
@ -287,7 +289,7 @@ scale, the more specialized systems tend to become [^11].
#### From data warehouse to data lake {#from-data-warehouse-to-data-lake}
A data warehouse often uses a *relational* data model that is queried through SQL (see
[Chapter 3](/en/ch3#ch_datamodels)), perhaps using specialized business intelligence software. This model works well
[Chapter 3](/en/ch3#ch_datamodels)), perhaps using specialized business intelligence software. This model works well
for the types of queries that business analysts need to make, but it is less well suited to the
needs of data scientists, who might need to perform tasks such as:
@ -313,7 +315,7 @@ data scientists. The answer is a *data lake*: a centralized data repository that
data that might be useful for analysis, obtained from operational systems via ETL processes. The
difference from a data warehouse is that a data lake simply contains files, without imposing any
particular file format or data model. Files in a data lake might be collections of database records,
encoded using a file format such as Avro or Parquet (see [Chapter 5](/en/ch5#ch_encoding)), but they can equally well
encoded using a file format such as Avro or Parquet (see [Chapter 5](/en/ch5#ch_encoding)), but they can equally well
contain text, images, videos, sensor readings, sparse matrices, feature vectors, genome sequences,
or any other kind of data [^15].
Besides being more flexible, this is also often cheaper than relational data storage, since the data
@ -340,10 +342,10 @@ As analytics practices have matured, organizations have been increasingly paying
management and operations of analytics systems and data pipelines, as captured for example in the
DataOps manifesto [^18].
Part of this are issues of governance, privacy, and compliance with regulation such as GDPR and
CCPA, which we discuss in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance) and [Link to Come].
CCPA, which we discuss in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance) and [“Legislation and Self-Regulation”](/en/ch14#sec_future_legislation).
Moreover, analytical data is increasingly made available not only as files and relational tables,
but also as streams of events (see [Link to Come]). With file-based data analysis you can re-run the
but also as streams of events (see [Chapter 12](/en/ch12#ch_stream)). With file-based data analysis you can re-run the
analysis periodically (e.g., daily) in order to respond to changes in the data, but stream processing
allows analytics systems to respond to events much faster, on the order of seconds. Depending on the
application and how time-sensitive it is, a stream processing approach can be valuable, for example
@ -398,7 +400,7 @@ When the data in one system is derived from the data in another, you need a proc
derived data when the original in the system of record changes. Unfortunately, many databases are
designed based on the assumption that your application only ever needs to use that one database, and
they do not make it easy to integrate multiple systems in order to propagate such updates. In
[Link to Come] we will discuss approaches to *data integration*, which allow us to compose multiple
[“Data Integration”](/en/ch13#sec_future_integration) we will discuss approaches to *data integration*, which allow us to compose multiple
data systems to achieve things that one system alone cannot do.
That brings us to the end of our comparison of analytics and transaction processing. In the next
@ -420,7 +422,7 @@ energy company, and leaving aside emergency backup power), since it is cheaper t
With software, two important decisions to be made are who builds the software and who deploys it.
There is a spectrum of possibilities that outsource each decision to various degrees, as illustrated
in [Figure 1-2](/en/ch1#fig_cloud_spectrum). At one extreme is bespoke software that you write and run in-house; at
in [Figure 1-2](/en/ch1#fig_cloud_spectrum). At one extreme is bespoke software that you write and run in-house; at
the other extreme are widely-used cloud services or Software as a Service (SaaS) products that are
implemented and operated by an external vendor, and which you only access through a web interface or API.
@ -519,9 +521,9 @@ and indeed such managed services are now available for many popular data systems
that have been designed from the ground up to be cloud-native have been shown to have several
advantages: better performance on the same hardware, faster recovery from failures, being able to
quickly scale computing resources to match the load, and supporting larger datasets [^25] [^26] [^27].
[Table 1-2](/en/ch1#tab_cloud_native_dbs) lists some examples of both types of systems.
[Table 1-2](/en/ch1#tab_cloud_native_dbs) lists some examples of both types of systems.
{{< figure id="#tab_cloud_native_dbs" title="Table 1-2. Examples of self-hosted and cloud-native database systems" class="w-full my-4" >}}
{{< figure id="tab_cloud_native_dbs" title="Table 1-2. Examples of self-hosted and cloud-native database systems" class="w-full my-4" >}}
| Category | Self-hosted systems | Cloud-native systems |
|------------------|-----------------------------|-----------------------------------------------------------------------|
@ -580,7 +582,7 @@ As an alternative to local disks, cloud services also offer virtual disk storage
detached from one instance and attached to a different one (Amazon EBS, Azure managed disks, and
persistent disks in Google Cloud). Such a virtual disk is not actually a physical disk, but rather a
cloud service provided by a separate set of machines, which emulates the behavior of a disk (a
*block device*, where each block is typically 4 KiB in size). This technology makes it
*block device*, where each block is typically 4 KiB in size). This technology makes it
possible to run traditional disk-based software in the cloud, but the block device emulation
introduces overheads that can be avoided in systems that are designed from the ground up for the cloud [^25]. It also makes the application
very sensitive to network glitches, since every I/O on the virtual block device is actually a network call [^28].
@ -591,7 +593,7 @@ services such as S3 are designed for long-term storage of fairly large files, ra
of kilobytes to several gigabytes in size. The individual rows or values stored in a database are
typically much smaller than this; cloud databases therefore typically manage smaller values in a
separate service, and store larger data blocks (containing many individual values) in an object
store [^26] [^29]. We will see ways of doing this in [Chapter 4](/en/ch4#ch_storage).
store [^26] [^29]. We will see ways of doing this in [Chapter 4](/en/ch4#ch_storage).
In a traditional systems architecture, the same computer is responsible for both storage (disk) and
computation (CPU and RAM), but in cloud-native systems, these two responsibilities have become
@ -691,7 +693,7 @@ Fault tolerance/high availability
: If your application needs to continue working even if one machine (or several machines, or
the network, or an entire datacenter) goes down, you can use multiple machines to give you
redundancy. When one fails, another one can take over. See [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability) and
[Chapter 6](/en/ch6#ch_replication) on replication.
[Chapter 6](/en/ch6#ch_replication) on replication.
Scalability
: If your data volume or computing requirements grow bigger than a single machine can handle,
@ -739,7 +741,7 @@ Distributed systems also have downsides. Every request and API call that goes vi
to deal with the possibility of failure: the network may be interrupted, or the service may be
overloaded or crashed, and therefore any request may time out without receiving a response. In this
case, we dont know whether the service received the request, and simply retrying it might not be
safe. We will discuss these problems in detail in [Chapter 9](/en/ch9#ch_distributed).
safe. We will discuss these problems in detail in [Chapter 9](/en/ch9#ch_distributed).
Although datacenter networks are fast, making a call to another service is still vastly slower than
calling a function in the same process [^44].
@ -760,9 +762,9 @@ as OpenTelemetry, Zipkin, and Jaeger allow you to track which client called whic
operation, and how long each call took [^49].
Databases provide various mechanisms for ensuring data consistency, as we shall see in
[Chapter 6](/en/ch6#ch_replication) and [Chapter 8](/en/ch8#ch_transactions). However, when each service has its own database,
[Chapter 6](/en/ch6#ch_replication) and [Chapter 8](/en/ch8#ch_transactions). However, when each service has its own database,
maintaining consistency of data across those different services becomes the applications problem.
Distributed transactions, which we explore in [Chapter 8](/en/ch8#ch_transactions), are a possible technique for
Distributed transactions, which we explore in [Chapter 8](/en/ch8#ch_transactions), are a possible technique for
ensuring consistency, but they are rarely used in a microservices context because they run counter
to the goal of making services independent from each other, and many databases dont support them [^50].
@ -770,7 +772,7 @@ For all these reasons, if you can do something on a single machine, this is ofte
cheaper compared to setting up a distributed system [^23] [^46] [^51].
CPUs, memory, and disks have grown larger, faster, and more reliable. When combined with single-node
databases such as DuckDB, SQLite, and KùzuDB, many workloads can now run on a single node. We will
explore more on this topic in [Chapter 4](/en/ch4#ch_storage).
explore more on this topic in [Chapter 4](/en/ch4#ch_storage).
### Microservices and Serverless {#sec_introduction_microservices}
@ -807,7 +809,7 @@ certain fields. Developers might wish to add or remove fields to an API as busin
but doing so can cause clients to fail. Worse still, such failures are often not discovered until
late in the development cycle when the updated service API is deployed to a staging or production
environment. API description standards such as OpenAPI and gRPC help manage the relationship between
client and server APIs; we discuss these further in [Chapter 5](/en/ch5#ch_encoding).
client and server APIs; we discuss these further in [Chapter 5](/en/ch5#ch_encoding).
Microservices are primarily a technical solution to a people problem: allowing different teams to
make progress independently without having to coordinate with each other. This is valuable in a large
@ -937,7 +939,7 @@ Service Organization Control (SOC) Type 2 standards. As with PCI compliance, ven
party audits to verify adherence.
Generally, it is important to balance the needs of your business against the needs of the people
whose data you are collecting and processing. There is much more to this topic; in [Link to Come] we
whose data you are collecting and processing. There is much more to this topic; in [Chapter 14](/en/ch14#ch_right_thing) we
will go deeper into the topics of ethics and legal compliance, including the problems of bias and
discrimination.
@ -952,7 +954,7 @@ We started by making a distinction between operational (transaction-processing,
(OLAP) systems, and saw their different characteristics: not only managing different types of data
with different access patterns, but also serving different audiences. We encountered the concept of
a data warehouse and data lake, which receive data feeds from operational systems via ETL. In
[Chapter 4](/en/ch4#ch_storage) we will see that operational and analytical systems often use very different internal
[Chapter 4](/en/ch4#ch_storage) we will see that operational and analytical systems often use very different internal
data layouts because of the different types of queries they need to serve.
We then compared cloud services, a comparatively recent development, to the traditional paradigm of
@ -964,7 +966,7 @@ example in the way they separate storage and compute.
Cloud systems are intrinsically distributed, and we briefly examined some of the trade-offs of
distributed systems compared to using a single machine. There are situations in which you cant
avoid going distributed, but its advisable not to rush into making a system distributed if its
possible to keep it on a single machine. In [Chapter 9](/en/ch9#ch_distributed) we will cover the challenges with
possible to keep it on a single machine. In [Chapter 9](/en/ch9#ch_distributed) we will cover the challenges with
distributed systems in more detail.
Finally, we saw that data systems architecture is determined not only by the needs of the business
@ -1038,4 +1040,3 @@ this question in mind as we move through the rest of this book.
[^61]: Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. [Understanding and Benchmarking the Impact of GDPR on Database Systems](https://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 7, pages 10641077, March 2020. [doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354)
[^62]: Martin Fowler. [Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html). *martinfowler.com*, December 2013. Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6)
[^63]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016.

View file

@ -4,18 +4,20 @@ weight: 210
breadcrumbs: false
---
<a id="ch_consistency"></a>
![](/map/ch09.png)
> *An ancient adage warns, “Never go to sea with two chronometers; take one or three.”*
>
> Frederick P. Brooks Jr., *The Mythical Man-Month: Essays on Software Engineering* (1995)
Lots of things can go wrong in distributed systems, as discussed in [Chapter 9](/en/ch9#ch_distributed). If we want a
Lots of things can go wrong in distributed systems, as discussed in [Chapter 9](/en/ch9#ch_distributed). If we want a
service to continue working correctly despite those things going wrong, we need to find ways of
tolerating faults.
One of the best tools we have for fault tolerance is *replication*. However, as we saw in
[Chapter 6](/en/ch6#ch_replication), having multiple copies of the data on multiple replicas opens up the risk of
[Chapter 6](/en/ch6#ch_replication), having multiple copies of the data on multiple replicas opens up the risk of
inconsistencies. Reads might be handled by a replica that is not up-to-date, yielding stale results.
If multiple replicas can accept writes, we have to deal with conflicts between values that were
concurrently written on different replicas. At a high level, there are two competing philosophies
@ -87,7 +89,7 @@ guarantee*. To clarify this idea, lets look at an example of a system that is
{{< figure src="/fig/ddia_1001.png" id="fig_consistency_linearizability_0" caption="Figure 10-1. If this database were linearizable, then either Alice's read would return 1 instead of 0, or Bob's read would return 0 instead of 1." class="w-full my-4" >}}
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a
game their favorite team is playing. Just after the final score is announced, Aaliyah refreshes the
page, sees the winner announced, and excitedly tells Bryce about it. Bryce incredulously hits
@ -104,7 +106,7 @@ violation of linearizability.
### What Makes a System Linearizable? {#sec_consistency_lin_definition}
In order to understand linearizability better, lets look at some more examples.
[Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows three clients concurrently reading and writing the same
[Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows three clients concurrently reading and writing the same
object *x* in a linearizable database. In distributed systems theory, *x* is called a *register*—in
practice, it could be one key in a key-value store, one row in a relational database, or one
document in a document database, for example.
@ -112,7 +114,7 @@ document in a document database, for example.
{{< figure src="/fig/ddia_1002.png" id="fig_consistency_linearizability_1" caption="Figure 10-2. Alice observes that x = 0 and y = 1, while Bob observes that x = 1 and y = 0. It's as if Alice's and Bob's computers disagree on the order in which the writes happened." class="w-full my-4" >}}
For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients
For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients
point of view, not the internals of the database. Each bar is a request made by a client, where the
start of a bar is the time when the request was sent, and the end of a bar is when the response was
received by the client. Due to variable network delays, a client doesnt know exactly when the
@ -121,12 +123,12 @@ client sending the request and receiving the response.
In this example, the register has two types of operations:
* *read*(*x*) ⇒ *v* means the client requested to read the value of register
* *read*(*x*)*v* means the client requested to read the value of register
*x*, and the database returned the value *v*.
* *write*(*x*, *v*) ⇒ *r* means the client requested to set the
* *write*(*x*, *v*) ⇒ *r* means the client requested to set the
register *x* to value *v*, and the database returned response *r* (which could be *ok* or *error*).
In [Figure 10-2](/en/ch10#fig_consistency_linearizability_1), the value of *x* is initially 0, and client C performs a
In [Figure 10-2](/en/ch10#fig_consistency_linearizability_1), the value of *x* is initially 0, and client C performs a
write request to set it to 1. While this is happening, clients A and B are repeatedly polling the
database to read the latest value. What are the possible responses that A and B might get for their
read requests?
@ -146,7 +148,7 @@ and forth between the old and the new value several times while a write is going
what we expect of a system that emulates a “single copy of the data.”
To make the system linearizable, we need to add another constraint, illustrated in
[Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
[Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
{{< figure src="/fig/ddia_1003.png" id="fig_consistency_linearizability_2" caption="Figure 10-3. If Alice and Bob had perfect clocks, linearizability would require that x = 1 is returned, since the read of x begins after the write x = 1 completes." class="w-full my-4" >}}
@ -156,25 +158,25 @@ of the write operation) at which the value of *x* atomically flips from 0 to 1.
clients read returns the new value 1, all subsequent reads must also return the new value, even if
the write operation has not yet completed.
This timing dependency is illustrated with an arrow in [Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
This timing dependency is illustrated with an arrow in [Figure 10-3](/en/ch10#fig_consistency_linearizability_2).
Client A is the first to read the new value, 1. Just after As read returns, B begins a new read.
Since Bs read occurs strictly after As read, it must also return 1, even though the write by C is
still ongoing. (Its the same situation as with Aaliyah and Bryce in
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0): after Aaliyah has read the new value, Bryce also expects to
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0): after Aaliyah has read the new value, Bryce also expects to
read the new value.)
We can further refine this timing diagram to visualize each operation taking effect atomically at
some point in time [^5],
like in the more complex example shown in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3). In this example we
like in the more complex example shown in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3). In this example we
add a third type of operation besides *read* and *write*:
* *cas*(*x*, *v*old, *v*new) ⇒ *r* means the client
* *cas*(*x*, *v*old, *v*new) ⇒ *r* means the client
requested an atomic *compare-and-set* operation (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)). If the
current value of the register *x* equals *v*old, it should be atomically set to *v*new. If
the value of *x* is different from *v*old, then the operation should leave the register
unchanged and return an error. *r* is the databases response (*ok* or *error*).
Each operation in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3) is marked with a vertical line (inside the
Each operation in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3) is marked with a vertical line (inside the
bar for each operation) at the time when we think the operation was executed. Those markers are
joined up in a sequential order, and the result must be a valid sequence of reads and writes for a
register (every read must return the value set by the most recent write).
@ -187,7 +189,7 @@ that was written, until it is overwritten again.
{{< figure src="/fig/ddia_1004.png" id="fig_consistency_linearizability_3" caption="Figure 10-4. The read of x is concurrent with the write x = 1. Since we don't know the exact timing of the operations, the read is allowed to return either 0 or 1." class="w-full my-4" >}}
There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3):
There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3):
* First client B sent a request to read *x*, then client D sent a request to set *x* to 0, and then
client A sent a request to set *x* to 1. Nevertheless, the value returned to Bs read is 1 (the
@ -207,7 +209,7 @@ There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_
Cs *cas* write, which updates *x* from 2 to 4. In the absence of other requests, it would be okay for
Bs read to return 2. However, client A has already read the new value 4 before Bs read started,
so B is not allowed to read an older value than A. Again, its the same situation as with Aaliyah
and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0).
and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0).
That is the intuition behind linearizability; the formal definition [^1] describes it more precisely. It is
possible (though computationally expensive) to test whether a systems behavior is linearizable by
@ -225,6 +227,8 @@ which is the strongest consistency model in common use.
--------
<a id="sidebar_consistency_serializability"></a>
> [!TIP] LINEARIZABILITY VERSUS SERIALIZABILITY
Linearizability is easily confused with serializability (see [“Serializability”](/en/ch8#sec_transactions_serializability)),
@ -325,7 +329,7 @@ nodes agree on.
In real applications, it is sometimes acceptable to treat such constraints loosely (for example, if
a flight is overbooked, you can move customers to a different flight and offer them compensation for
the inconvenience). In such cases, linearizability may not be needed, and we will discuss such
loosely interpreted constraints in [Link to Come].
loosely interpreted constraints in [“Timeliness and Integrity”](/en/ch13#sec_future_integrity).
However, a hard uniqueness constraint, such as the one you typically find in relational databases,
requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints,
@ -333,7 +337,7 @@ can be implemented without linearizability [^20].
#### Cross-channel timing dependencies {#cross-channel-timing-dependencies}
Notice a detail in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0): if Aaliyah hadnt exclaimed the score,
Notice a detail in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0): if Aaliyah hadnt exclaimed the score,
Bryce wouldnt have known that the result of his query was stale. He would have just refreshed the
page again a few seconds later, and eventually seen the final score. The linearizability violation
was only noticed because there was an additional communication channel in the system (Aaliyahs
@ -342,10 +346,10 @@ voice to Bryces ears).
Similar situations can arise in computer systems. For example, say you have a website where users
can upload a video, and a background process transcodes the video to a lower quality that can be
streamed on slow internet connections. The architecture and dataflow of this system is illustrated
in [Figure 10-5](/en/ch10#fig_consistency_transcoder).
in [Figure 10-5](/en/ch10#fig_consistency_transcoder).
The video transcoder needs to be explicitly instructed to perform a transcoding job, and this
instruction is sent from the web server to the transcoder via a message queue (see [Link to Come]).
instruction is sent from the web server to the transcoder via a message queue (see [“Messaging Systems”](/en/ch12#sec_stream_messaging)).
The web server doesnt place the entire video on the queue, since most message brokers are designed
for small messages, and a video may be many megabytes in size. Instead, the video is first written
to a file storage service, and once the write is complete, the instruction to the transcoder is
@ -356,7 +360,7 @@ placed on the queue.
If the file storage service is linearizable, then this system should work fine. If it is not
linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in
[Figure 10-5](/en/ch10#fig_consistency_transcoder)) might be faster than the internal replication inside the storage
[Figure 10-5](/en/ch10#fig_consistency_transcoder)) might be faster than the internal replication inside the storage
service. In this case, when the transcoder fetches the original video (step 5), it might see an old
version of the file, or nothing at all. If it processes an old version of the video, the original
and transcoded videos in the file storage become permanently inconsistent with each other.
@ -364,7 +368,7 @@ and transcoded videos in the file storage become permanently inconsistent with e
This problem arises because there are two different communication channels between the web server
and the transcoder: the file storage and the message queue. Without the recency guarantee of
linearizability, race conditions between these two channels are possible. This situation is
analogous to [Figure 10-1](/en/ch10#fig_consistency_linearizability_0), where there was also a race condition between
analogous to [Figure 10-1](/en/ch10#fig_consistency_linearizability_0), where there was also a race condition between
two communication channels: the database replication and the real-life audio channel between
Aaliyahs mouth and Bryces ears.
@ -389,7 +393,7 @@ and all operations on it are atomic,” the simplest answer would be to really o
of the data. However, that approach would not be able to tolerate faults: if the node holding that
one copy failed, the data would be lost, or at least inaccessible until the node was brought up again.
Lets revisit the replication methods from [Chapter 6](/en/ch6#ch_replication), and compare whether they can be made linearizable:
Lets revisit the replication methods from [Chapter 6](/en/ch6#ch_replication), and compare whether they can be made linearizable:
Single-leader replication (potentially linearizable)
: In a system with single-leader replication, the leader has the primary copy of the data that is
@ -423,7 +427,7 @@ Multi-leader replication (not linearizable)
Leaderless replication (probably not linearizable)
: For systems with leaderless replication (Dynamo-style; see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)), people
sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes
(*w* + *r* > *n*). Depending on the exact algorithm, and depending on how you define
(*w* + *r* > *n*). Depending on the exact algorithm, and depending on how you define
strong consistency, this is not quite true.
“Last write wins” conflict resolution methods based on time-of-day clocks (e.g., in Cassandra and
@ -435,21 +439,21 @@ Leaderless replication (probably not linearizable)
Intuitively, it seems as though quorum reads and writes should be linearizable in a
Dynamo-style model. However, when we have variable network delays, it is possible to have race
conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless).
conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless).
{{< figure src="/fig/ddia_1006.png" id="fig_consistency_leaderless" caption="Figure 10-6. Quorums are not sufficient to ensure linearizability if network delays are variable." class="w-full my-4" >}}
In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating
*x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3).
Concurrently, client A reads from a quorum of two nodes (*r* = 2) and sees the new value 1
In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating
*x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3).
Concurrently, client A reads from a quorum of two nodes (*r* = 2) and sees the new value 1
on one of the nodes. Also concurrently with the write, client B reads from a different quorum of two
nodes, and gets back the old value 0 from both.
The quorum condition is met (*w* + *r* > *n*), but this execution is nevertheless not
The quorum condition is met (*w* + *r* > *n*), but this execution is nevertheless not
linearizable: Bs request begins after As request completes, but B returns the old value while A
returns the new value. (Its once again the Aaliyah and Bryce situation from
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0).)
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0).)
It is possible to make Dynamo-style quorums linearizable at the cost of reduced
performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously,
@ -471,10 +475,10 @@ provide linearizability, even with quorum reads and writes.
As some replication methods can provide linearizability and others cannot, it is interesting to
explore the pros and cons of linearizability in more depth.
We already discussed some use cases for different replication methods in [Chapter 6](/en/ch6#ch_replication); for
We already discussed some use cases for different replication methods in [Chapter 6](/en/ch6#ch_replication); for
example, we saw that multi-leader replication is often a good choice for multi-region
replication (see [“Geographically Distributed Operation”](/en/ch6#sec_replication_multi_dc)). An example of such a deployment is illustrated in
[Figure 10-7](/en/ch10#fig_consistency_cap_availability).
[Figure 10-7](/en/ch10#fig_consistency_cap_availability).
{{< figure src="/fig/ddia_1007.png" id="fig_consistency_cap_availability" caption="Figure 10-7. If clients cannot contact enough replicas due to a network partition, they cannot process writes." class="w-full my-4" >}}
@ -600,7 +604,7 @@ proportional to the uncertainty of delays in the network. In a network with high
like most computer networks (see [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing)), the response time of linearizable
reads and writes is inevitably going to be high. A faster algorithm for linearizability does not
exist, but weaker consistency models can be much faster, so this trade-off is important for
latency-sensitive systems. In [Link to Come] we will discuss some approaches for avoiding
latency-sensitive systems. In [“Timeliness and Integrity”](/en/ch13#sec_future_integrity) we will discuss some approaches for avoiding
linearizability without sacrificing correctness.
@ -613,7 +617,7 @@ stored in only 64 bits (or even 32 bits if you are sure that you will never have
records, but that is risky).
Another advantage of such auto-incrementing IDs is that the order of the IDs tells you the order in
which the records were created. For example, [Figure 10-8](/en/ch10#fig_consistency_id_generator) shows a chat
which the records were created. For example, [Figure 10-8](/en/ch10#fig_consistency_id_generator) shows a chat
application that assigns auto-incrementing IDs to chat messages as they are posted. You can then
display the messages in order of increasing ID, and the resulting chat threads will make sense:
Aaliyah posts a question that is assigned ID 1, and Bryces answer to the question is assigned a
@ -626,7 +630,7 @@ This single-node ID generator is another example of a linearizable system. Each
ID is an operation that atomically increments a counter and returns the old counter value (a
*fetch-and-add* operation); linearizability ensures that if the posting of Aaliyahs message
completes before Bryces posting begins, then Bryces ID must be greater than Aaliyahs. The
messages by Aaliyah and Caleb in [Figure 10-8](/en/ch10#fig_consistency_id_generator) are concurrent, so linearizability
messages by Aaliyah and Caleb in [Figure 10-8](/en/ch10#fig_consistency_id_generator) are concurrent, so linearizability
doesnt specify how their IDs must be ordered, as long as they are unique.
An in-memory single-node ID generator is easy to implement: you can use the atomic increment
@ -720,9 +724,9 @@ causality, and which you can use as a distributed ID generator. It is called a *
proposed in 1978 by Leslie Lamport [^54],
in what is now one of the most-cited papers in the field of distributed systems.
[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) shows how a Lamport clock would work in the chat example of
[Figure 10-8](/en/ch10#fig_consistency_id_generator). Each node has a unique identifier, which in
[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) is the name “Aaliyah”, “Bryce”, or “Caleb”, but which in practice
[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) shows how a Lamport clock would work in the chat example of
[Figure 10-8](/en/ch10#fig_consistency_id_generator). Each node has a unique identifier, which in
[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) is the name “Aaliyah”, “Bryce”, or “Caleb”, but which in practice
could be a random UUID or something similar. Moreover, each node keeps a counter of the number of
operations it has processed. A Lamport timestamp is then simply a pair of (*counter*, *node ID*).
Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp,
@ -735,7 +739,7 @@ Every time a node generates a timestamp, it increments its counter value and use
Moreover, every time a node sees a timestamp from another node, if the counter value in that
timestamp is greater than its local counter value, it increases its local counter to match the value in the timestamp.
In [Figure 10-9](/en/ch10#fig_consistency_lamport_ts), Aaliyah had not yet seen Calebs message when posting her own,
In [Figure 10-9](/en/ch10#fig_consistency_lamport_ts), Aaliyah had not yet seen Calebs message when posting her own,
and vice versa. Assuming both users start with an initial counter value of 0, both therefore
increment their local counter and attach the new counter value of 1 to their message. When Bryce
receives those messages, he increases his local counter value to 1. Finally, Bryce sends a reply to
@ -743,10 +747,10 @@ Aaliyahs message, for which he increments his local counter and attaches the
message.
To compare two Lamport timestamps, we first compare their counter value: for example,
(2, “Bryce”) is greater than (1, “Aaliyah”) and also greater than (1, “Caleb”). If
(2, “Bryce”) is greater than (1, “Aaliyah”) and also greater than (1, “Caleb”). If
two timestamps have the same counter, we compare their node IDs instead, using the usual
lexicographic string comparison. Thus, the timestamp order in this example is
(1, “Aaliyah”) < (1, “Caleb”) < (2, “Bryce”).
(1, “Aaliyah”) < (1, “Caleb”) < (2, “Bryce”).
#### Hybrid logical clocks {#hybrid-logical-clocks}
@ -789,7 +793,7 @@ IDs, because they ensure that the snapshot is consistent with causality [^56].
When multiple timestamps are generated concurrently, these algorithms order them arbitrarily. This
means that when you look at two timestamps, you generally cant tell whether they were generated
concurrently or whether one happened before the other. (In the example of
[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) you actually can tell that Aaliyah and Calebs messages must have
[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) you actually can tell that Aaliyah and Calebs messages must have
been concurrent, because they have the same counter value, but when the counter values are different
you cant tell whether they were concurrent.)
@ -807,7 +811,7 @@ the higher ID, even if A and B never communicated with each other. On the other
can only ensure that a node generates timestamps that are greater than any other timestamp that node
has seen, but it cant say anything about timestamps that it hasnt seen.
[Figure 10-10](/en/ch10#fig_consistency_permissions) shows how a non-linearizable ID generator could cause problems.
[Figure 10-10](/en/ch10#fig_consistency_permissions) shows how a non-linearizable ID generator could cause problems.
Imagine a social media website where user A wants to share an embarrassing photo privately with
their friends. As account is initially public, but using their laptop, A first changes their
account settings to private. Then A uses their phone to upload the photo. Since A performed these
@ -917,7 +921,7 @@ It turns out that all of these are instances of the same fundamental distributed
*consensus*. Consensus is one of the most important and fundamental problems in distributed
computing; it is also infamously difficult to get right [^58] [^59],
and many systems have got it wrong in the past. Now that we have discussed replication
([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
linearizability (this chapter), we are finally ready to tackle the consensus problem.
The best-known consensus algorithms are Viewstamped Replication [^60] [^61], Paxos [^58] [^62] [^63] [^64],
@ -1243,7 +1247,7 @@ A shared log is a good fit for database replication: if every log entry represen
database, and every replica processes the same writes in the same order using deterministic logic,
then the replicas will all end up in a consistent state. This idea is known as *state machine replication* [^80],
and it is the principle behind event sourcing, which we saw in [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events). Shared
logs are also useful for stream processing, as we shall see in [Link to Come].
logs are also useful for stream processing, as we shall see in [Chapter 12](/en/ch12#ch_stream).
Similarly, a shared log can be used to implement serializable transactions: as discussed in
[“Actual Serial Execution”](/en/ch8#sec_transactions_serial), if every log entry represents a deterministic transaction to be
@ -1355,7 +1359,7 @@ fails.
If you drop the requirement for the new leader to be up-to-date, you may improve performance and
availability, but you are on thin ice, since the theory of consensus no longer applies. While things
will work fine as long as there are no faults, the problems discussed in [Chapter 9](/en/ch9#ch_distributed) can
will work fine as long as there are no faults, the problems discussed in [Chapter 9](/en/ch9#ch_distributed) can
easily cause a lot of data loss or corruption.
--------
@ -1381,7 +1385,7 @@ one location to another (by first adding the new nodes, and then removing the ol
Although they are complex and subtle, consensus algorithms are a huge breakthrough for distributed
systems. Consensus is essentially “single-leader replication done right”, with automatic failover on
leader failure, ensuring that no committed data is lost and no split-brain is possible, even in the
face of all the problems we discussed in [Chapter 9](/en/ch9#ch_distributed).
face of all the problems we discussed in [Chapter 9](/en/ch9#ch_distributed).
Since single-leader replication with automatic failover is essentially one of the definitions of
consensus, any system that provides automatic failover but does not use a proven consensus algorithm
@ -1413,7 +1417,7 @@ research problem.
For systems that want to be highly available, but dont want to accept the cost of consensus, the
only real alternative is to use a weaker consistency model instead, such as those offered by
leaderless or multi-leader replication as discussed in [Chapter 6](/en/ch6#ch_replication). These approaches
leaderless or multi-leader replication as discussed in [Chapter 6](/en/ch6#ch_replication). These approaches
generally dont offer linearizability, but for applications that dont need it that is fine.
@ -1617,14 +1621,14 @@ a coordination service. It wont guarantee that you will get it right, but it
Consensus algorithms are complicated and subtle, but they are supported by a rich body of theory
that has been developed since the 1980s. This theory makes it possible to build systems that can
tolerate all the faults that we discussed in [Chapter 9](/en/ch9#ch_distributed), and still ensure that your data is
tolerate all the faults that we discussed in [Chapter 9](/en/ch9#ch_distributed), and still ensure that your data is
not corrupted. This is an amazing achievement, and the references at the end of this chapter feature
some of the highlights of this work.
Nevertheless, consensus is not always the right tool: in some systems, the strong consistency
properties it provides are not needed, and it is better to have weaker consistency with higher
availability and better performance. In these cases, it is common to use leaderless or multi-leader
replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we
replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we
discussed in this chapter are helpful in that context.
### References

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

625
content/en/ch14.md Normal file
View file

@ -0,0 +1,625 @@
---
title: "14. Doing the Right Thing"
weight: 314
breadcrumbs: false
---
<a id="ch_right_thing"></a>
![](/map/ch13.png)
> *Feeding AI systems on the world's beauty, ugliness, and cruelty, but expecting it to reflect only
> the beauty is a fantasy.*
>
> Vinay Uday Prabhu and Abeba Birhane, *Large Datasets: A Pyrrhic Win for Computer Vision?* (2020)
> [!TIP] A NOTE FOR EARLY RELEASE READERS
> With Early Release ebooks, you get books in their earliest form---the author's raw and unedited
> content as they write---so you can take advantage of these technologies long before the official
> release of these titles.
>
> This will be the 14th chapter of the final book. The GitHub repo for this book is
> *[*https://github.com/ept/ddia2-feedback*](https://github.com/ept/ddia2-feedback)*.
>
> If you'd like to be actively involved in reviewing and commenting on this draft, please reach out on GitHub.
In the final chapter of this book, let's take a step back. Throughout this book we have examined a
wide range of different architectures for data systems, evaluated their pros and cons, and explored
techniques for building reliable, scalable, and maintainable applications. However, we have left out
an important and fundamental part of the discussion, which we should now fill in.
Every system is built for a purpose; every action we take has both intended and unintended
consequences. The purpose may be as simple as making money, but the consequences for the world may
reach far beyond that original purpose. We, the engineers building these systems, have a
responsibility to carefully consider those consequences and to consciously decide what kind of world
we want to live in.
We talk about data as an abstract thing, but remember that many datasets are about people: their
behavior, their interests, their identity. We must treat such data with humanity and respect. Users
are humans too, and human dignity is paramount [^1].
Software development increasingly involves making important ethical choices. There are guidelines to
help software engineers navigate these issues, such as the ACM Code of Ethics and Professional
Conduct [^2], but they are rarely discussed, applied, and enforced in practice. As a
result, engineers and product managers sometimes take a very cavalier attitude to privacy and
potential negative consequences of their products [^3], [^4].
A technology is not good or bad in itself---what matters is how it is used and how it affects
people. This is true for a software system like a search engine in much the same way as it is for a
weapon like a gun. Is not sufficient for software engineers to focus exclusively on the technology
and ignore its consequences: the ethical responsibility is ours to bear also. Reasoning about ethics
is difficult, but it is too important to ignore.
However, what makes something "good" or "bad" is not well-defined, and most people in computing
don't even discuss that question [^5]. In contrast to much of computing, the concepts at
the heart of ethics are not fixed or determinate in their precise meaning, and they require
interpretation, which may be subjective [^6]. Ethics is not going through some checklist
to confirm you comply; it's a participatory and iterative process of reflection, in dialog with the
people involved, with accountability for the results [^7].
## Predictive Analytics {#id369}
For example, predictive analytics is a major part of why people are excited about big data and AI.
Using data analysis to predict the weather, or the spread of diseases, is one thing [^8];
it is another matter to predict whether a convict is likely to reoffend, whether an applicant for a
loan is likely to default, or whether an insurance customer is likely to make expensive claims
[^9]. The latter have a direct effect on individual people's lives.
Naturally, payment networks want to prevent fraudulent transactions, banks want to avoid bad loans,
airlines want to avoid hijackings, and companies want to avoid hiring ineffective or untrustworthy
people. From their point of view, the cost of a missed business opportunity is low, but the cost of
a bad loan or a problematic employee is much higher, so it is natural for organizations to want to
be cautious. If in doubt, they are better off saying no.
However, as algorithmic decision-making becomes more widespread, someone who has (accurately or
falsely) been labeled as risky by some algorithm may suffer a large number of those "no" decisions.
Systematically being excluded from jobs, air travel, insurance coverage, property rental, financial
services, and other key aspects of society is such a large constraint of the individual's freedom
that it has been called "algorithmic prison" [^10]. In countries that respect human
rights, the criminal justice system presumes innocence until proven guilty; on the other hand,
automated systems can systematically and arbitrarily exclude a person from participating in society
without any proof of guilt, and with little chance of appeal.
### Bias and Discrimination {#id370}
Decisions made by an algorithm are not necessarily any better or any worse than those made by a
human. Every person is likely to have biases, even if they actively try to counteract them, and
discriminatory practices can become culturally institutionalized. There is hope that basing
decisions on data, rather than subjective and instinctive assessments by people, could be more fair
and give a better chance to people who are often overlooked in the traditional system
[^11].
When we develop predictive analytics and AI systems, we are not merely automating a human's decision
by using software to specify the rules for when to say yes or no; we are even leaving the rules
themselves to be inferred from data. However, the patterns learned by these systems are opaque: even
if there is some correlation in the data, we may not know why. If there is a systematic bias in the
input to an algorithm, the system will most likely learn and amplify that bias in its output
[^12].
In many countries, anti-discrimination laws prohibit treating people differently depending on
protected traits such as ethnicity, age, gender, sexuality, disability, or beliefs. Other features
of a person's data may be analyzed, but what happens if they are correlated with protected traits?
For example, in racially segregated neighborhoods, a person's postal code or even their IP address
is a strong predictor of race. Put like this, it seems ridiculous to believe that an algorithm could
somehow take biased data as input and produce fair and impartial output from it [^13],
[^14]. Yet this belief often seems to be implied by proponents of data-driven decision
making, an attitude that has been satirized as "machine learning is like money laundering for bias"
[^15].
Predictive analytics systems merely extrapolate from the past; if the past is discriminatory, they
codify and amplify that discrimination [^16]. If we want the future to be better than the
past, moral imagination is required, and that's something only humans can provide [^17].
Data and models should be our tools, not our masters.
### Responsibility and Accountability {#id371}
Automated decision making opens the question of responsibility and accountability [^17].
If a human makes a mistake, they can be held accountable, and the person affected by the decision
can appeal. Algorithms make mistakes too, but who is accountable if they go wrong [^18]?
When a self-driving car causes an accident, who is responsible? If an automated credit scoring
algorithm systematically discriminates against people of a particular race or religion, is there any
recourse? If a decision by your machine learning system comes under judicial review, can you explain
to the judge how the algorithm made its decision? People should not be able to evade their
responsibility by blaming an algorithm.
Credit rating agencies are an old example of collecting data to make decisions about people. A bad
credit score makes life difficult, but at least a credit score is normally based on relevant facts
about a person's actual borrowing history, and any errors in the record can be corrected (although
the agencies normally do not make this easy). However, scoring algorithms based on machine learning
typically use a much wider range of inputs and are much more opaque, making it harder to understand
how a particular decision has come about and whether someone is being treated in an unfair or
discriminatory way [^19].
A credit score summarizes "How did you behave in the past?" whereas predictive analytics usually
work on the basis of "Who is similar to you, and how did people like you behave in the past?"
Drawing parallels to others' behavior implies stereotyping people, for example based on where they
live (a close proxy for race and socioeconomic class). What about people who get put in the wrong
bucket? Furthermore, if a decision is incorrect due to erroneous data, recourse is almost impossible
[^17].
Much data is statistical in nature, which means that even if the probability distribution on the
whole is correct, individual cases may well be wrong. For example, if the average life expectancy in
your country is 80 years, that doesn't mean you're expected to drop dead on your 80th birthday. From
the average and the probability distribution, you can't say much about the age to which one
particular person will live. Similarly, the output of a prediction system is probabilistic and may
well be wrong in individual cases.
A blind belief in the supremacy of data for making decisions is not only delusional, it is
positively dangerous. As data-driven decision making becomes more widespread, we will need to figure
out how to make algorithms accountable and transparent, how to avoid reinforcing existing biases,
and how to fix them when they inevitably make mistakes.
We will also need to figure out how to prevent data being used to harm people, and realize its
positive potential instead. For example, analytics can reveal financial and social characteristics
of people's lives. On the one hand, this power could be used to focus aid and support to help those
people who most need it. On the other hand, it is sometimes used by predatory business seeking to
identify vulnerable people and sell them risky products such as high-cost loans and worthless
college degrees [^17], [^20].
### Feedback Loops {#id372}
Even with predictive applications that have less immediately far-reaching effects on people, such as
recommendation systems, there are difficult issues that we must confront. When services become good
at predicting what content users want to see, they may end up showing people only opinions they
already agree with, leading to echo chambers in which stereotypes, misinformation, and polarization
can breed. We are already seeing the impact of social media echo chambers on election campaigns.
When predictive analytics affect people's lives, particularly pernicious problems arise due to
self-reinforcing feedback loops. For example, consider the case of employers using credit scores to
evaluate potential hires. You may be a good worker with a good credit score, but suddenly find
yourself in financial difficulties due to a misfortune outside of your control. As you miss payments
on your bills, your credit score suffers, and you will be less likely to find work. Joblessness
pushes you toward poverty, which further worsens your scores, making it even harder to find
employment [^17]. It's a downward spiral due to poisonous assumptions, hidden behind a
camouflage of mathematical rigor and data.
As another example of a feedback loop, economists found that when gas stations in Germany introduced
algorithmic prices, competition was reduced and prices for consumers went up because the algorithms
learned to collude [^21].
We can't always predict when such feedback loops happen. However, many consequences can be predicted
by thinking about the entire system (not just the computerized parts, but also the people
interacting with it)---an approach known as *systems thinking* [^22]. We can try to
understand how a data analysis system responds to different behaviors, structures, or
characteristics. Does the system reinforce and amplify existing differences between people (e.g.,
making the rich richer or the poor poorer), or does it try to combat injustice? And even with the
best intentions, we must beware of unintended consequences.
## Privacy and Tracking {#id373}
Besides the problems of predictive analytics---i.e., using data to make automated decisions about
people---there are ethical problems with data collection itself. What is the relationship between
the organizations collecting data and the people whose data is being collected?
When a system only stores data that a user has explicitly entered, because they want the system to
store and process it in a certain way, the system is performing a service for the user: the user is
the customer. But when a user's activity is tracked and logged as a side effect of other things they
are doing, the relationship is less clear. The service no longer just does what the user tells it to
do, but it takes on interests of its own, which may conflict with the user's interests.
Tracking behavioral data has become increasingly important for user-facing features of many online
services: tracking which search results are clicked helps improve the ranking of search results;
recommending "people who liked X also liked Y" helps users discover interesting and useful things;
A/B tests and user flow analysis can help indicate how a user interface might be improved. Those
features require some amount of tracking of user behavior, and users benefit from them.
However, depending on a company's business model, tracking often doesn't stop there. If the service
is funded through advertising, the advertisers are the actual customers, and the users' interests
take second place. Tracking data becomes more detailed, analyses become further-reaching, and data
is retained for a long time in order to build up detailed profiles of each person for marketing
purposes.
Now the relationship between the company and the user whose data is being collected starts looking
quite different. The user is given a free service and is coaxed into engaging with it as much as
possible. The tracking of the user serves not primarily that individual, but rather the needs of the
advertisers who are funding the service. This relationship can be appropriately described with a
word that has more sinister connotations: *surveillance*.
### Surveillance {#id374}
As a thought experiment, try replacing the word *data* with *surveillance*, and observe if common
phrases still sound so good [^23]. How about this: "In our surveillance-driven
organization we collect real-time surveillance streams and store them in our surveillance warehouse.
Our surveillance scientists use advanced analytics and surveillance processing in order to derive
new insights."
This thought experiment is unusually polemic for this book, *Designing Surveillance-Intensive
Applications*, but strong words are needed to emphasize this point. In our attempts to make software
"eat the world" [^24], we have built the greatest mass surveillance infrastructure the
world has ever seen. We are rapidly approaching a world in which every inhabited space contains at
least one internet-connected microphone, in the form of smartphones, smart TVs, voice-controlled
assistant devices, baby monitors, and even children's toys that use cloud-based speech recognition.
Many of these devices have a terrible security record [^25].
What is new compared to the past is that digitization has made it easy to collect large amounts of
data about people. Surveillance of our location and movements, our social relationships and
communications, our purchases and payments, and data about our health have become almost
unavoidable. A surveillance organisation may end up knowing more about a person than that person
knows about themselves---for example, identifying illnesses or economic problems before the person
themselves is aware of them.
Even the most totalitarian and repressive regimes of the past could only dream of putting a
microphone in every room and forcing every person to constantly carry a device capable of tracking
their location and movements. Yet the benefits that we get from digital technology are so great that
we now voluntarily accept this world of total surveillance. The difference is just that the data is
being collected by corporations to provide us with services, rather than government agencies seeking
control [^26].
Not all data collection necessarily qualifies as surveillance, but examining it as such can help us
understand our relationship with the data collector. Why are we seemingly happy to accept
surveillance by corporations? Perhaps you feel you have nothing to hide---in other words, you are
totally in line with existing power structures, you are not a marginalized minority, and you needn't
fear persecution [^27]. Not everyone is so fortunate. Or perhaps it's because the purpose
seems benign---it's not overt coercion and conformance, but merely better recommendations and more
personalized marketing. However, combined with the discussion of predictive analytics from the last
section, that distinction seems less clear.
We are already seeing behavioral data on car driving, tracked by cars without drivers' consent,
affecting their insurance premiums [^28], and health insurance coverage that depends on
people wearing a fitness tracking device. When surveillance is used to determine things that hold
sway over important aspects of life, such as insurance coverage or employment, it starts to appear
less benign. Moreover, data analysis can reveal surprisingly intrusive things: for example, the
movement sensor in a smartwatch or fitness tracker can be used to work out what you are typing (for
example, passwords) with fairly good accuracy [^29]. Sensor accuracy and algorithms for
analysis are only going to get better.
### Consent and Freedom of Choice {#id375}
We might assert that users voluntarily choose to use a service that tracks their activity, and they
have agreed to the terms of service and privacy policy, so they consent to data collection. We might
even claim that users are receiving a valuable service in return for the data they provide, and that
the tracking is necessary in order to provide the service. Undoubtedly, social networks, search
engines, and various other free online services are valuable to users---but there are problems with
this argument.
First, we should ask in what way the tracking is necessary. Some forms of tracking directly feed
into improving features for users: for example, tracking the click-through rate on search results
can help improve a search engine's result ranking and relevance, and tracking which products
customers tend to buy together can help an online shop suggest related products. However, when
tracking user interaction for content recommendations, or to build user profiles for advertising
purposes, it is less clear whether this is genuinely in the user's interest---or is it only
necessary because the ads pay for the service?
Second, users have little knowledge of what data they are feeding into our databases, or how it is
retained and processed---and most privacy policies do more to obscure than to illuminate. Without
understanding what happens to their data, users cannot give any meaningful consent. Often, data from
one user also says things about other people who are not users of the service and who have not
agreed to any terms. The derived datasets that we discussed in this part of the book---in which data
from the entire user base may have been combined with behavioral tracking and external data
sources---are precisely the kinds of data of which users cannot have any meaningful understanding.
Moreover, data is extracted from users through a one-way process, not a relationship with true
reciprocity, and not a fair value exchange. There is no dialog, no option for users to negotiate how
much data they provide and what service they receive in return: the relationship between the service
and the user is very asymmetric and one-sided. The terms are set by the service, not by the user
[^30], [^31].
In the European Union, the *General Data Protection Regulation* (GDPR) requires that consent must be
"freely given, specific, informed, and unambiguous", and that the user must be able to "refuse or
withdraw consent without detriment"---otherwise it is not considered "freely given". Any request for
consent must be written "in an intelligible and easily accessible form, using clear and plain
language". Moreover, "silence, pre-ticked boxes or inactivity \[do not\] constitute consent"
[^32]. There are other bases for lawful processing of personal data besides consent, such
as *legitimate interest*, which permits certain uses of data such as fraud prevention
[^33].
You might argue that a user who does not consent to surveillance can simply choose not to use a
service. But this choice is not free either: if a service is so popular that it is "regarded by most
people as essential for basic social participation" [^30], then it is not reasonable to
expect people to opt out of this service---using it is *de facto* mandatory. For example, in most
Western social communities, it has become the norm to carry a smartphone, to use social networks for
socializing, and to use Google for finding information. Especially when a service has network
effects, there is a social cost to people choosing *not* to use it.
Declining to use a service due to its user tracking policies is easier said than done. These
platforms are designed specifically to engage users. Many use game mechanics and tactics common in
gambling to keep users coming back [^34]. Even if a user gets past this, declining to
engage is only an option for the small number of people who are privileged enough to have the time
and knowledge to understand its privacy policy, and who can afford to potentially miss out on social
participation or professional opportunities that may have arisen if they had participated in the
service. For people in a less privileged position, there is no meaningful freedom of choice:
surveillance becomes inescapable.
### Privacy and Use of Data {#id457}
Sometimes people claim that "privacy is dead" on the grounds that some users are willing to post all
sorts of things about their lives to social media, sometimes mundane and sometimes deeply personal.
However, this claim is false and rests on a misunderstanding of the word *privacy*.
Having privacy does not mean keeping everything secret; it means having the freedom to choose which
things to reveal to whom, what to make public, and what to keep secret. The right to privacy is a
decision right: it enables each person to decide where they want to be on the spectrum between
secrecy and transparency in each situation [^30]. It is an important aspect of a person's
freedom and autonomy.
For example, someone who suffers from a rare medical condition might be very happy to provide their
private medical data to researchers if there is a chance that it might help the development of
treatments for their condition. However, the important thing is that this person has a choice over
who may access this data, and for what purpose. If there was a risk that information about their
medical condition would harm their access to medical insurance or employment or other important
things, this person would probably be much more cautious about sharing their data.
When data is extracted from people through surveillance infrastructure, privacy rights are not
necessarily eroded, but rather transferred to the data collector. Companies that acquire data
essentially say "trust us to do the right thing with your data," which means that the right to
decide what to reveal and what to keep secret is transferred from the individual to the company.
The companies in turn choose to keep much of the outcome of this surveillance secret, because to
reveal it would be perceived as creepy, and would harm their business model (which relies on knowing
more about people than other companies do). Intimate information about users is only revealed
indirectly, for example in the form of tools for targeting advertisements to specific groups of
people (such as those suffering from a particular illness).
Even if particular users cannot be personally reidentified from the bucket of people targeted by a
particular ad, they have lost their agency about the disclosure of some intimate information. It is
not the user who decides what is revealed to whom on the basis of their personal preferences---it is
the company that exercises the privacy right with the goal of maximizing its profit.
Many companies have a goal of not being *perceived* as creepy---avoiding the question of how
intrusive their data collection actually is, and instead focusing on managing user perceptions. And
even these perceptions are often managed poorly: for example, something may be factually correct,
but if it triggers painful memories, the user may not want to be reminded about it [^35].
With any kind of data we should expect the possibility that it is wrong, undesirable, or
inappropriate in some way, and we need to build mechanisms for handling those failures. Whether
something is "undesirable" or "inappropriate" is of course down to human judgment; algorithms are
oblivious to such notions unless we explicitly program them to respect human needs. As engineers of
these systems we must be humble, accepting and planning for such failings.
Privacy settings that allow a user of an online service to control which aspects of their data other
users can see are a starting point for handing back some control to users. However, regardless of
the setting, the service itself still has unfettered access to the data, and is free to use it in
any way permitted by the privacy policy. Even if the service promises not to sell the data to third
parties, it usually grants itself unrestricted rights to process and analyze the data internally,
often going much further than what is overtly visible to users.
This kind of large-scale transfer of privacy rights from individuals to corporations is historically
unprecedented [^30]. Surveillance has always existed, but it used to be expensive and
manual, not scalable and automated. Trust relationships have always existed, for example between a
patient and their doctor, or between a defendant and their attorney---but in these cases the use of
data has been strictly governed by ethical, legal, and regulatory constraints. Internet services
have made it much easier to amass huge amounts of sensitive information without meaningful consent,
and to use it at massive scale without users understanding what is happening to their private data.
### Data as Assets and Power {#id376}
Since behavioral data is a byproduct of users interacting with a service, it is sometimes called
"data exhaust"---suggesting that the data is worthless waste material. Viewed this way, behavioral
and predictive analytics can be seen as a form of recycling that extracts value from data that would
have otherwise been thrown away.
More correct would be to view it the other way round: from an economic point of view, if targeted
advertising is what pays for a service, then the user activity that generates behavioral data could
be regarded as a form of labor [^36]. One could go even further and argue that the
application with which the user interacts is merely a means to lure users into feeding more and more
personal information into the surveillance infrastructure [^30]. The delightful human
creativity and social relationships that often find expression in online services are cynically
exploited by the data extraction machine.
Personal data is a valuable asset, as evidenced by the existence of data brokers, a shady industry
operating in secrecy, purchasing, aggregating, analyzing, inferring, and reselling intrusive
personal data about people, mostly for marketing purposes [^20]. Startups are valued by
their user numbers, by "eyeballs"---i.e., by their surveillance capabilities.
Because the data is valuable, many people want it. Of course companies want it---that's why they
collect it in the first place. But governments want to obtain it too: by means of secret deals,
coercion, legal compulsion, or simply stealing it [^37]. When a company goes bankrupt, the
personal data it has collected is one of the assets that gets sold. Moreover, the data is difficult
to secure, so breaches happen disconcertingly often.
These observations have led critics to saying that data is not just an asset, but a "toxic asset"
[^37], or at least "hazardous material" [^38]. Maybe data is not the new gold,
nor the new oil, but rather the new uranium [^39]. Even if we think that we are capable of
preventing abuse of data, whenever we collect data, we need to balance the benefits with the risk of
it falling into the wrong hands: computer systems may be compromised by criminals or hostile foreign
intelligence services, data may be leaked by insiders, the company may fall into the hands of
unscrupulous management that does not share our values, or the country may be taken over by a regime
that has no qualms about compelling us to hand over the data.
When collecting data, we need to consider not just today's political environment, but all possible
future governments. There is no guarantee that every government elected in future will respect human
rights and civil liberties, so "it is poor civic hygiene to install technologies that could someday
facilitate a police state" [^40].
"Knowledge is power," as the old adage goes. And furthermore, "to scrutinize others while avoiding
scrutiny oneself is one of the most important forms of power" [^41]. This is why
totalitarian governments want surveillance: it gives them the power to control the population.
Although today's technology companies are not overtly seeking political power, the data and
knowledge they have accumulated nevertheless gives them a lot of power over our lives, much of which
is surreptitious, outside of public oversight [^42].
### Remembering the Industrial Revolution {#id377}
Data is the defining feature of the information age. The internet, data storage, processing, and
software-driven automation are having a major impact on the global economy and human society. As our
daily lives and social organization have been changed by information technology, and will probably
continue to radically change in the coming decades, comparisons to the Industrial Revolution come to
mind [^17], [^26].
The Industrial Revolution came about through major technological and agricultural advances, and it
brought sustained economic growth and significantly improved living standards in the long run. Yet
it also came with major problems: pollution of the air (due to smoke and chemical processes) and the
water (from industrial and human waste) was dreadful. Factory owners lived in splendor, while urban
workers often lived in very poor housing and worked long hours in harsh conditions. Child labor was
common, including dangerous and poorly paid work in mines.
It took a long time before safeguards were established, such as environmental protection
regulations, safety protocols for workplaces, outlawing child labor, and health inspections for
food. Undoubtedly the cost of doing business increased when factories were no longer allowed to dump
their waste into rivers, sell tainted foods, or exploit workers. But society as a whole benefited
hugely from these regulations, and few of us would want to return to a time before [^17].
Just as the Industrial Revolution had a dark side that needed to be managed, our transition to the
information age has major problems that we need to confront and solve [^43], [^44].
The collection and use of data is one of those problems. In the words of Bruce Schneier
[^26]:
> Data is the pollution problem of the information age, and protecting privacy is the environmental
> challenge. Almost all computers produce information. It stays around, festering. How we deal with
> it---how we contain it and how we dispose of it---is central to the health of our information
> economy. Just as we look back today at the early decades of the industrial age and wonder how our
> ancestors could have ignored pollution in their rush to build an industrial world, our
> grandchildren will look back at us during these early decades of the information age and judge us
> on how we addressed the challenge of data collection and misuse.
>
> We should try to make them proud.
### Legislation and Self-Regulation {#sec_future_legislation}
Data protection laws might be able to help preserve individuals' rights. For example, the European
GDPR states that personal data must be "collected for specified, explicit and legitimate purposes
and not further processed in a manner that is incompatible with those purposes", and furthermore
that data must be "adequate, relevant and limited to what is necessary in relation to the purposes
for which they are processed" [^32].
However, this principle of *data minimization* runs directly counter to the philosophy of Big Data,
which is to maximize data collection, to combine it with other datasets, to experiment and to
explore in order to generate new insights. Exploration means using data for unforeseen purposes,
which is the opposite of the "specified and explicit" purposes for which the data must have been
collected. While the GDPR has had some effect on the online advertising industry [^45],
the regulation has been weakly enforced [^46], and it does not seem to have led to much of
a change in culture and practices across the wider tech industry.
Companies that collect lots of data about people oppose regulation as being a burden and a hindrance
to innovation. To some extent that opposition is justified. For example, when sharing medical data,
there are clear risks to privacy, but there are also potential opportunities: how many deaths could
be prevented if data analysis was able to help us achieve better diagnostics or find better
treatments [^47]? Over-regulation may prevent such breakthroughs. It is difficult to
balance such potential opportunities with the risks [^41].
Fundamentally, we need a culture shift in the tech industry with regard to personal data. We should
stop regarding users as metrics to be optimized, and remember that they are humans who deserve
respect, dignity, and agency. We should self-regulate our data collection and processing practices
in order to establish and maintain the trust of the people who depend on our software
[^48]. And we should take it upon ourselves to educate end users about how their data is
used, rather than keeping them in the dark.
We should allow each individual to maintain their privacy---i.e., their control over own data---and
not steal that control from them through surveillance. Our individual right to control our data is
like the natural environment of a national park: if we don't explicitly protect and care for it, it
will be destroyed. It will be the tragedy of the commons, and we will all be worse off for it.
Ubiquitous surveillance is not inevitable---we are still able to stop it.
As a first step, we should not retain data forever, but purge it as soon as it is no longer needed,
and minimize what we collect in the first place [^48], [^49]. Data you don't have is
data that can't be leaked, stolen, or compelled by governments to be handed over. Overall, culture
and attitude changes will be necessary. As people working in technology, if we don't consider the
societal impact of our work, we're not doing our job [^50].
## Summary {#id594}
This brings us to the end of the book. We have covered a lot of ground:
- In [Chapter 1](/en/ch1#ch_tradeoffs) we contrasted analytical and operational systems, compared
the cloud to self-hosting, weighed up distributed and single-node systems, and discussed balancing
the needs of your business with the needs of your users.
- In [Chapter 2](/en/ch2#ch_nonfunctional) we saw how to define several nonfunctional requirements
such as performance, reliability, scalability, and maintainability.
- In [Chapter 3](/en/ch3#ch_datamodels) we explored a spectrum of data models, including the
relational, document, and graph models, event sourcing, and DataFrames. We also looked at examples
of various query languages, including SQL, Cypher, SPARQL, Datalog, and GraphQL.
- In [Chapter 4](/en/ch4#ch_storage) we discussed storage engines for OLTP (LSM-trees and B-trees),
for analytics (column-oriented storage), and indexes for information retrieval (full-text and
vector search).
- In [Chapter 5](/en/ch5#ch_encoding) we examined different ways of encoding data objects as bytes,
and how to support evolution as requirements change. We also compared several ways how data flows
between processes: via databases, service calls, workflow engines, or event-driven architectures.
- In [Chapter 6](/en/ch6#ch_replication) we studied the trade-offs between single-leader,
multi-leader, and leaderless replication. We also looked at consistency models such as
read-after-write consistency, and sync engines that allow clients to work offline.
- In [Chapter 7](/en/ch7#ch_sharding) we went into sharding, including strategies for rebalancing,
request routing, and secondary indexing.
- In [Chapter 8](/en/ch8#ch_transactions) we covered transactions: durability, how various isolation
levels (read committed, snapshot isolation, and serializable) can be achieved, and how atomicity
can be ensured in distributed transactions.
- In [Chapter 9](/en/ch9#ch_distributed) we surveyed fundamental problems that occur in distributed
systems (network faults and delays, clock errors, process pauses, crashes), and saw how they make
it difficult to correctly implement even something seemingly simple like a lock.
- In [Chapter 10](/en/ch10#ch_consistency) we went on a deep-dive into various forms of consensus
and the consistency model (linearizability) it enables.
- In [Chapter 11](/en/ch11#ch_batch) we dug into batch processing, building up from simple chains of
Unix tools to large-scale distributed batch processors using distributed filesystems or object
stores.
- In [Chapter 12](/en/ch12#ch_stream) we generalized batch processing to stream processing,
discussed the underlying message brokers, change data capture, fault tolerance, and processing
patterns such as streaming joins.
- In [Chapter 13](/en/ch13#ch_philosophy) we explored a philosophy of streaming systems that allows
disparate data systems to be integrated, systems to be evolved, and applications to be scaled more
easily.
Finally, in this last chapter, we took a step back and examined some ethical aspects of building
data-intensive applications. We saw that although data can be used to do good, it can also do
significant harm: making decisions that seriously affect people's lives and are difficult to appeal
against, leading to discrimination and exploitation, normalizing surveillance, and exposing intimate
information. We also run the risk of data breaches, and we may find that a well-intentioned use of
data has unintended consequences.
As software and data are having such a large impact on the world, we as engineers must remember that
we carry a responsibility to work toward the kind of world that we want to live in: a world that
treats people with humanity and respect. Let's work together towards that goal.
##### Footnotes
### References {#references}
[^1]: David Schmudde. [What If Data Is a Bad Idea?](https://schmud.de/posts/2024-08-18-data-is-a-bad-idea.html). *schmud.de*, August 2024. Archived at [perma.cc/ZXU5-XMCT](https://perma.cc/ZXU5-XMCT)
[^2]: [ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics). Association for Computing Machinery, *acm.org*, 2018. Archived at [perma.cc/SEA8-CMB8](https://perma.cc/SEA8-CMB8)
[^3]: Igor Perisic. [Making Hard Choices: The Quest for Ethics in Machine Learning](https://www.linkedin.com/blog/engineering/archive/making-hard-choices-the-quest-for-ethics-in-machine-learning). *linkedin.com*, November 2016. Archived at [perma.cc/DGF8-KNT7](https://perma.cc/DGF8-KNT7)
[^4]: John Naughton. [Algorithm Writers Need a Code of Conduct](https://www.theguardian.com/commentisfree/2015/dec/06/algorithm-writers-should-have-code-of-conduct). *theguardian.com*, December 2015. Archived at [perma.cc/TBG2-3NG6](https://perma.cc/TBG2-3NG6)
[^5]: Ben Green. ["Good" isn't good enough](https://www.benzevgreen.com/wp-content/uploads/2019/11/19-ai4sg.pdf). At *NeurIPS Joint Workshop on AI for Social Good*, December 2019. Archived at [perma.cc/H4LN-7VY3](https://perma.cc/H4LN-7VY3)
[^6]: Deborah G. Johnson and Mario Verdicchio. [Ethical AI is Not about AI](https://cacm.acm.org/opinion/ethical-ai-is-not-about-ai/). *Communications of the ACM*, volume 66, issue 2, pages 32--34, January 2023. [doi:10.1145/3576932](https://doi.org/10.1145/3576932)
[^7]: Marc Steen. [Ethics as a Participatory and Iterative Process](https://cacm.acm.org/opinion/ethics-as-a-participatory-and-iterative-process/). *Communications of the ACM*, volume 66, issue 5, pages 27--29, April 2023. [doi:10.1145/3550069](https://doi.org/10.1145/3550069)
[^8]: Logan Kugler. [What Happens When Big Data Blunders?](https://cacm.acm.org/news/what-happens-when-big-data-blunders/) *Communications of the ACM*, volume 59, issue 6, pages 15--16, June 2016. [doi:10.1145/2911975](https://doi.org/10.1145/2911975)
[^9]: Miri Zilka. [Algorithms and the criminal justice system: promises and challenges in deployment and research](https://www.cl.cam.ac.uk/research/security/seminars/archive/video/2023-03-07-t196231.html). At *University of Cambridge Security Seminar Series*, March 2023.
[^10]: Bill Davidow. [Welcome to Algorithmic Prison](https://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/). *theatlantic.com*, February 2014. Archived at [archive.org](https://web.archive.org/web/20171019201812/https://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/)
[^11]: Don Peck. [They're Watching You at Work](https://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/). *theatlantic.com*, December 2013. Archived at [perma.cc/YR9T-6M38](https://perma.cc/YR9T-6M38)
[^12]: Leigh Alexander. [Is an Algorithm Any Less Racist Than a Human?](https://www.theguardian.com/technology/2016/aug/03/algorithm-racist-human-employers-work) *theguardian.com*, August 2016. Archived at [perma.cc/XP93-DSVX](https://perma.cc/XP93-DSVX)
[^13]: Jesse Emspak. [How a Machine Learns Prejudice](https://www.scientificamerican.com/article/how-a-machine-learns-prejudice/). *scientificamerican.com*, December 2016. [perma.cc/R3L5-55E6](https://perma.cc/R3L5-55E6)
[^14]: Rohit Chopra, Kristen Clarke, Charlotte A. Burrows, and Lina M. Khan. [Joint Statement on Enforcement Efforts Against Discrimination and Bias in Automated Systems](https://www.ftc.gov/system/files/ftc_gov/pdf/EEOC-CRT-FTC-CFPB-AI-Joint-Statement%28final%29.pdf). *ftc.gov*, April 2023. Archived at [perma.cc/YY4Y-RCCA](https://perma.cc/YY4Y-RCCA)
[^15]: Maciej Cegłowski. [The Moral Economy of Tech](https://idlewords.com/talks/sase_panel.htm). *idlewords.com*, June 2016. Archived at [perma.cc/L8XV-BKTD](https://perma.cc/L8XV-BKTD)
[^16]: Greg Nichols. [Artificial Intelligence in healthcare is racist](https://www.zdnet.com/article/artificial-intelligence-in-healthcare-is-racist/). *zdnet.com*, November 2020. Archived at [perma.cc/3MKW-YKRS](https://perma.cc/3MKW-YKRS)
[^17]: Cathy O'Neil. *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. ISBN: 978-0-553-41881-1
[^18]: Julia Angwin. [Make Algorithms Accountable](https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html). *nytimes.com*, August 2016. Archived at [archive.org](https://web.archive.org/web/20230819055242/https://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html)
[^19]: Bryce Goodman and Seth Flaxman. [European Union Regulations on Algorithmic Decision-Making and a 'Right to Explanation'](https://arxiv.org/abs/1606.08813). At *ICML Workshop on Human Interpretability in Machine Learning*, June 2016. Archived at [arxiv.org/abs/1606.08813](https://arxiv.org/abs/1606.08813)
[^20]: [A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes](https://www.commerce.senate.gov/services/files/0d2b3642-6221-4888-a631-08f2f255b577). Staff Report, *United States Senate Committee on Commerce, Science, and Transportation*, *commerce.senate.gov*, December 2013. Archived at [perma.cc/32NV-YWLQ](https://perma.cc/32NV-YWLQ)
[^21]: Stephanie Assad, Robert Clark, Daniel Ershov, and Lei Xu. [Algorithmic Pricing and Competition: Empirical Evidence from the German Retail Gasoline Market](https://economics.yale.edu/sites/default/files/clark_acex_jan_2021.pdf). *Journal of Political Economy*, volume 132, issue 3, pages 723-771, March 2024. [doi:10.1086/726906](https://doi.org/10.1086/726906)
[^22]: Donella H. Meadows and Diana Wright. *Thinking in Systems: A Primer*. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7
[^23]: Daniel J. Bernstein. [Listening to a "big data"/"data science" talk. Mentally translating "data" to "surveillance": "\...everything starts with surveillance\..."](https://x.com/hashbreaker/status/598076230437568512) *x.com*, May 2015. Archived at [perma.cc/EY3D-WBBJ](https://perma.cc/EY3D-WBBJ)
[^24]: Marc Andreessen. [Why Software Is Eating the World](https://a16z.com/why-software-is-eating-the-world/). *a16z.com*, August 2011. Archived at [perma.cc/3DCC-W3G6](https://perma.cc/3DCC-W3G6)
[^25]: J. M. Porup. ['Internet of Things' Security Is Hilariously Broken and Getting Worse](https://arstechnica.com/information-technology/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/). *arstechnica.com*, January 2016. Archived at [archive.org](https://web.archive.org/web/20250823001716/https://arstechnica.com/information-technology/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/)
[^26]: Bruce Schneier. [*Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World*](https://www.schneier.com/books/data_and_goliath/). W. W. Norton, 2015. ISBN: 978-0-393-35217-7
[^27]: The Grugq. [Nothing to Hide](https://grugq.tumblr.com/post/142799983558/nothing-to-hide). *grugq.tumblr.com*, April 2016. Archived at [perma.cc/BL95-8W5M](https://perma.cc/BL95-8W5M)
[^28]: Federal Trade Commission. [FTC Takes Action Against General Motors for Sharing Drivers' Precise Location and Driving Behavior Data Without Consent](https://www.ftc.gov/news-events/news/press-releases/2025/01/ftc-takes-action-against-general-motors-sharing-drivers-precise-location-driving-behavior-data). *ftc.gov*, January 2025. Archived at [perma.cc/3XGV-3HRD](https://perma.cc/3XGV-3HRD)
[^29]: Tony Beltramelli. [Deep-Spying: Spying Using Smartwatch and Deep Learning](https://arxiv.org/abs/1512.05616). Masters Thesis, IT University of Copenhagen, December 2015. Archived at *arxiv.org/abs/1512.05616*
[^30]: Shoshana Zuboff. [Big Other: Surveillance Capitalism and the Prospects of an Information Civilization](https://papers.ssrn.com/sol3/papers.cfm?abstract_id=2594754). *Journal of Information Technology*, volume 30, issue 1, pages 75--89, April 2015. [doi:10.1057/jit.2015.5](https://doi.org/10.1057/jit.2015.5)
[^31]: Michiel Rhoen. [Beyond Consent: Improving Data Protection Through Consumer Protection Law](https://policyreview.info/articles/analysis/beyond-consent-improving-data-protection-through-consumer-protection-law). *Internet Policy Review*, volume 5, issue 1, March 2016. [doi:10.14763/2016.1.404](https://doi.org/10.14763/2016.1.404)
[^32]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016](https://eur-lex.europa.eu/eli/reg/2016/679/oj/eng). *Official Journal of the European Union*, L 119/1, May 2016.
[^33]: UK Information Commissioner's Office. [What is the 'legitimate interests' basis?](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/lawful-basis/legitimate-interests/what-is-the-legitimate-interests-basis/) *ico.org.uk*. Archived at [perma.cc/W8XR-F7ML](https://perma.cc/W8XR-F7ML)
[^34]: Tristan Harris. [How a handful of tech companies control billions of minds every day](https://www.ted.com/talks/tristan_harris_how_a_handful_of_tech_companies_control_billions_of_minds_every_day). At *TED2017*, April 2017.
[^35]: Carina C. Zona. [Consequences of an Insightful Algorithm](https://www.youtube.com/watch?v=YRI40A4tyWU). At *GOTO Berlin*, November 2016.
[^36]: Imanol Arrieta Ibarra, Leonard Goff, Diego Jiménez Hernández, Jaron Lanier, and E. Glen Weyl. [Should We Treat Data as Labor? Moving Beyond 'Free'](https://www.aeaweb.org/conference/2018/preliminary/paper/2Y7N88na). *American Economic Association Papers Proceedings*, volume 1, issue 1, December 2017.
[^37]: Bruce Schneier. [Data Is a Toxic Asset, So Why Not Throw It Out?](https://www.schneier.com/essays/archives/2016/03/data_is_a_toxic_asse.html) *schneier.com*, March 2016. Archived at [perma.cc/4GZH-WR3D](https://perma.cc/4GZH-WR3D)
[^38]: Cory Scott. [Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want](https://x.com/cory_scott/status/706586399483437056). *x.com*, March 2016. Archived at [perma.cc/CLV7-JF2E](https://perma.cc/CLV7-JF2E)
[^39]: Mark Pesce. [Data is the new uranium -- incredibly powerful and amazingly dangerous](https://www.theregister.com/2024/11/20/data_is_the_new_uranium/). *theregister.com*, November 2024. Archived at [perma.cc/NV8B-GYGV](https://perma.cc/NV8B-GYGV)
[^40]: Bruce Schneier. [Mission Creep: When Everything Is Terrorism](https://www.schneier.com/essays/archives/2013/07/mission_creep_when_e.html). *schneier.com*, July 2013. Archived at [perma.cc/QB2C-5RCE](https://perma.cc/QB2C-5RCE)
[^41]: Lena Ulbricht and Maximilian von Grafenstein. [Big Data: Big Power Shifts?](https://policyreview.info/articles/analysis/big-data-big-power-shifts) *Internet Policy Review*, volume 5, issue 1, March 2016. [doi:10.14763/2016.1.406](https://doi.org/10.14763/2016.1.406)
[^42]: Ellen P. Goodman and Julia Powles. [Facebook and Google: Most Powerful and Secretive Empires We've Ever Known](https://www.theguardian.com/technology/2016/sep/28/google-facebook-powerful-secretive-empire-transparency). *theguardian.com*, September 2016. Archived at [perma.cc/8UJA-43G6](https://perma.cc/8UJA-43G6)
[^43]: Judy Estrin and Sam Gill. [The World Is Choking on Digital Pollution](https://washingtonmonthly.com/2019/01/13/the-world-is-choking-on-digital-pollution/). *washingtonmonthly.com*, January 2019. Archived at [perma.cc/3VHF-C6UC](https://perma.cc/3VHF-C6UC)
[^44]: A. Michael Froomkin. [Regulating Mass Surveillance as Privacy Pollution: Learning from Environmental Impact Statements](https://repository.law.miami.edu/cgi/viewcontent.cgi?article=1062&context=fac_articles). *University of Illinois Law Review*, volume 2015, issue 5, August 2015. Archived at [perma.cc/24ZL-VK2T](https://perma.cc/24ZL-VK2T)
[^45]: Pengyuan Wang, Li Jiang, and Jian Yang. [The Early Impact of GDPR Compliance on Display Advertising: The Case of an Ad Publisher](https://openreview.net/pdf?id=TUnLHNo19S). *Journal of Marketing Research*, volume 61, issue 1, April 2023. [doi:10.1177/00222437231171848](https://doi.org/10.1177/00222437231171848)
[^46]: Johnny Ryan. [Don't be fooled by Meta's fine for data breaches](https://www.economist.com/by-invitation/2023/05/24/dont-be-fooled-by-metas-fine-for-data-breaches-says-johnny-ryan). *The Economist*, May 2023. Archived at [perma.cc/VCR6-55HR](https://perma.cc/VCR6-55HR)
[^47]: Jessica Leber. [Your Data Footprint Is Affecting Your Life in Ways You Can't Even Imagine](https://www.fastcompany.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine). *fastcompany.com*, March 2016. Archived at [archive.org](https://web.archive.org/web/20161128133016/https://www.fastcoexist.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine)
[^48]: Maciej Cegłowski. [Haunted by Data](https://idlewords.com/talks/haunted_by_data.htm). *idlewords.com*, October 2015. Archived at [archive.org](https://web.archive.org/web/20161130143932/https://idlewords.com/talks/haunted_by_data.htm)
[^49]: Sam Thielman. [You Are Not What You Read: Librarians Purge User Data to Protect Privacy](https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy). *theguardian.com*, January 2016. Archived at [archive.org](https://web.archive.org/web/20250828224851/https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy)
[^50]: Jez Humble. [It's a cliché that people get into tech to "change the world". So then, you have to actually consider what the impact of your work is on the world. The idea that you can or should exclude societal and political discussions in tech is idiotic. It means you're not doing your job](https://x.com/jezhumble/status/1386758340894597122). *x.com*, April 2021. Archived at [perma.cc/3NYS-MHLC](https://perma.cc/3NYS-MHLC)

View file

@ -4,6 +4,8 @@ weight: 102
breadcrumbs: false
---
<a id="ch_nonfunctional"></a>
![](/map/ch01.png)
> *The Internet was done so well that most people think of it as a natural resource like the Pacific
@ -55,7 +57,7 @@ Barack Obama have over 100 million followers).
### Representing Users, Posts, and Follows {#id20}
Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We
Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We
have one table for users, one table for posts, and one table for follow relationships.
{{< figure src="/fig/ddia_0201.png" id="fig_twitter_relational" caption="Figure 2-1. Simple relational schema for a social network in which users can follow each other." class="w-full my-4" >}}
@ -107,7 +109,7 @@ needs to subscribe to the stream of posts being added to their home timeline.
The downside of this approach is that we now need to do more work every time a user makes a post,
because the home timelines are derived data that needs to be updated. The process is illustrated in
[Figure 2-2](/en/ch2#fig_twitter_timelines). When one initial request results in several downstream requests being
[Figure 2-2](/en/ch2#fig_twitter_timelines). When one initial request results in several downstream requests being
carried out, we use the term *fan-out* to describe the factor by which the number of requests
increases.
@ -126,7 +128,7 @@ load, since we simply serve them from a cache.
This process of precomputing and updating the results of a query is called *materialization*, and
the timeline cache is an example of a *materialized view* (a concept we will discuss further in
[Link to Come]). The materialized view speeds up reads, but in return we have to do more work on
[“Maintaining materialized views”](/en/ch12#sec_stream_mat_view)). The materialized view speeds up reads, but in return we have to do more work on
write. The cost of writes for most users is modest, but a social network also has to consider some
extreme cases:
@ -163,7 +165,7 @@ metrics, whereas the “time it takes to load the home timeline” or the “tim
delivered to followers” are response time metrics.
There is often a connection between throughput and response time; an example of such a relationship
for an online service is sketched in [Figure 2-3](/en/ch2#fig_throughput). The service has a low response time when
for an online service is sketched in [Figure 2-3](/en/ch2#fig_throughput). The service has a low response time when
request throughput is low, but response time increases as load increases. This is because of
*queueing*: when a request arrives on a highly loaded system, its likely that the CPU is already in
the process of handling an earlier request, and therefore the incoming request needs to wait until
@ -175,6 +177,8 @@ handle, queueing delays increase sharply.
--------
<a id="sidebar_metastable"></a>
> [!TIP] WHEN AN OVERLOADED SYSTEM WON'T RECOVER
If a system is close to overload, with throughput pushed close to the limit, it can sometimes enter a
@ -206,7 +210,7 @@ scalability in [“Scalability”](/en/ch2#sec_introduction_scalability).
### Latency and Response Time {#id23}
“Latency” and “response time” are sometimes used interchangeably, but in this book we will use the
terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)):
terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)):
* The *response time* is what the client sees; it includes all delays incurred anywhere in the
system.
@ -221,7 +225,7 @@ terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)
{{< figure src="/fig/ddia_0204.png" id="fig_response_time" caption="Figure 2-4. Response time, service time, network latency, and queueing delay." class="w-full my-4" >}}
In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a
In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a
horizontal line, and a request or response message is shown as a thick diagonal arrow from one node
to another. You will encounter this style of diagram frequently over the course of this book.
@ -242,7 +246,7 @@ it is important to measure response times on the client side.
### Average, Median, and Percentiles {#id24}
Because the response time varies from one request to the next, we need to think of it not as a
single number, but as a *distribution* of values that you can measure. In [Figure 2-5](/en/ch2#fig_lognormal), each
single number, but as a *distribution* of values that you can measure. In [Figure 2-5](/en/ch2#fig_lognormal), each
gray bar represents a request to a service, and its height shows how long that request took. Most
requests are reasonably fast, but there are occasional *outliers* that take much longer.
Variation in network delay is also known as *jitter*.
@ -257,7 +261,7 @@ because it doesnt tell you how many users actually experienced that delay.
Usually it is better to use *percentiles*. If you take your list of response times and sort it from
fastest to slowest, then the *median* is the halfway point: for example, if your median response
time is 200 ms, that means half your requests return in less than 200 ms, and half your
time is 200 ms, that means half your requests return in less than 200 ms, and half your
requests take longer than that. This makes the median a good metric if you want to know how long
users typically have to wait. The median is also known as the *50th percentile*, and sometimes
abbreviated as *p50*.
@ -267,7 +271,7 @@ In order to figure out how bad your outliers are, you can look at higher percent
response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular
threshold. For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of
100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more. This is
illustrated in [Figure 2-5](/en/ch2#fig_lognormal).
illustrated in [Figure 2-5](/en/ch2#fig_lognormal).
High percentiles of response times, also known as *tail latencies*, are important because they
directly affect users experience of the service. For example, Amazon describes response time
@ -291,14 +295,14 @@ However, it is surprisingly difficult to get hold of reliable data to quantify t
latency has on user behavior.
Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search
results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue [^21].
However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue [^21].
However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
only 0.6% fewer searches per day [^22],
and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3% [^23].
Newer data from these companies appears not to be publicly available.
A more recent Akamai study [^24]
claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times
are also correlated with lower conversion rates! This seemingly paradoxical result is explained by
the fact that the pages that load fastest are often those that have no useful content (e.g., 404
@ -316,7 +320,7 @@ fast and slow responses is 1.25 seconds or more.
High percentiles are especially important in backend services that are called multiple times as
part of serving a single end-user request. Even if you make the calls in parallel, the end-user
request still needs to wait for the slowest of the parallel calls to complete. It takes just one
slow call to make the entire end-user request slow, as illustrated in [Figure 2-6](/en/ch2#fig_tail_amplification).
slow call to make the entire end-user request slow, as illustrated in [Figure 2-6](/en/ch2#fig_tail_amplification).
Even if only a small percentage of backend calls are slow, the chance of getting a slow call
increases if an end-user request requires multiple backend calls, and so a higher proportion of
end-user requests end up being slow (an effect known as *tail latency amplification* [^26]).
@ -326,13 +330,15 @@ end-user requests end up being slow (an effect known as *tail latency amplificat
Percentiles are often used in *service level objectives* (SLOs) and *service level agreements*
(SLAs) as ways of defining the expected performance and availability of a service [^27].
For example, an SLO may set a target for a service to have a median response time of less than
200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not
met (for example, customers may be entitled to a refund). That is the basic idea, at least; in
practice, defining good availability metrics for SLOs and SLAs is not straightforward [^28] [^29].
--------
<a id="sidebar_percentiles"></a>
> [!TIP] COMPUTING PERCENTILES
If you want to add response time percentiles to the monitoring dashboards for your services, you
@ -395,7 +401,7 @@ For example, in the social network case study, a fault that might happen is that
process, a machine involved in updating the materialized timelines crashes or become unavailable.
To make this process fault-tolerant, we would need to ensure that another machine can take over this
task without missing any posts that should have been delivered, and without duplicating any posts.
(This idea is known as *exactly-once semantics*, and we will examine it in detail in [Link to Come].)
(This idea is known as *exactly-once semantics*, and we will examine it in detail in [“The End-to-End Argument for Databases”](/en/ch13#sec_future_end_to_end).)
Fault tolerance is always limited to a certain number of certain types of faults. For example, a
system might be able to tolerate a maximum of two hard drives failing at the same time, or a maximum
@ -473,14 +479,14 @@ resources.
The fault-tolerance techniques we discuss in this book are designed to tolerate the loss of entire
machines, racks, or availability zones. They generally work by allowing a machine in one datacenter
to take over when a machine in another datacenter fails or becomes unreachable. We will discuss such
techniques for fault tolerance in [Chapter 6](/en/ch6#ch_replication), [Chapter 10](/en/ch10#ch_consistency), and at various other
techniques for fault tolerance in [Chapter 6](/en/ch6#ch_replication), [Chapter 10](/en/ch10#ch_consistency), and at various other
points in this book.
Systems that can tolerate the loss of entire machines also have operational advantages: a
single-server system requires planned downtime if you need to reboot the machine (to apply operating
system security patches, for example), whereas a multi-node fault-tolerant system can be patched by
restarting one node at a time, without affecting the service for users. This is called a *rolling
upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding).
upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding).
#### Software faults {#software-faults}
@ -559,6 +565,8 @@ work with it every day, and take steps to improve it based on this feedback [^71
--------
<a id="sidebar_reliability_importance"></a>
> [!TIP] HOW IMPORTANT IS RELIABILITY?
Reliability is not just for nuclear power stations and air traffic control—more mundane applications
@ -691,8 +699,8 @@ The advantages of shared-nothing are that it has the potential to scale linearly
whatever hardware offers the best price/performance ratio (especially in the cloud), it can more
easily adjust its hardware resources as load increases or decreases, and it can achieve greater
fault tolerance by distributing the system across multiple data centers and regions. The downsides
are that it requires explicit sharding (see [Chapter 7](/en/ch7#ch_sharding)), and it incurs all the complexity of
distributed systems ([Chapter 9](/en/ch9#ch_distributed)).
are that it requires explicit sharding (see [Chapter 7](/en/ch7#ch_sharding)), and it incurs all the complexity of
distributed systems ([Chapter 9](/en/ch9#ch_distributed)).
Some cloud-native database systems use separate services for storage and transaction execution (see
[“Separation of storage and compute”](/en/ch1#sec_introduction_storage_compute)), with multiple compute nodes sharing access to the same
@ -706,9 +714,9 @@ the database [^83].
The architecture of systems that operate at large scale is usually highly specific to the
application—there is no such thing as a generic, one-size-fits-all scalable architecture
(informally known as *magic scaling sauce*). For example, a system that is designed to handle
100,000 requests per second, each 1 kB in size, looks very different from a system that is
designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same
data throughput (100 MB/sec).
100,000 requests per second, each 1 kB in size, looks very different from a system that is
designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same
data throughput (100 MB/sec).
Moreover, an architecture that is appropriate for one level of load is unlikely to cope with 10
times that load. If you are working on a fast-growing service, it is therefore likely that you will
@ -718,11 +726,11 @@ one order of magnitude in advance.
A good general principle for scalability is to break a system down into smaller components that can
operate largely independently from each other. This is the underlying principle behind microservices
(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
([Link to Come]), and shared-nothing architectures. However, the challenge is in knowing where to
(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
([Chapter 12](/en/ch12#ch_stream)), and shared-nothing architectures. However, the challenge is in knowing where to
draw the line between things that should be together, and things that should be apart. Design
guidelines for microservices can be found in other books [^84],
and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).
and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).
Another good principle is not to make things more complicated than necessary. If a single-machine
database will do the job, its probably preferable to a complicated distributed setup. Auto-scaling
@ -997,4 +1005,3 @@ this book will cover a selection of building blocks that have proved to be valua
[^96]: Eric Evans. [*Domain-Driven Design: Tackling Complexity in the Heart of Software*](https://learning.oreilly.com/library/view/domain-driven-design-tackling/0321125215/). Addison-Wesley Professional, August 2003. ISBN: 9780321125217
[^97]: Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson. [Analyzing Software Evolvability](https://www.es.mdh.se/pdf_publications/1251.pdf). at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. [doi:10.1109/COMPSAC.2008.50](https://doi.org/10.1109/COMPSAC.2008.50)
[^98]: Enrico Zaninotto. [From X programming to the X organisation](https://martinfowler.com/articles/zaninotto.pdf). At *XP Conference*, May 2002. Archived at [perma.cc/R9AR-QCKZ](https://perma.cc/R9AR-QCKZ)

View file

@ -4,6 +4,8 @@ weight: 103
breadcrumbs: false
---
<a id="ch_datamodels"></a>
![](/map/ch02.png)
> *The limits of my language mean the limits of my world.*
@ -27,7 +29,7 @@ question is: how is it *represented* in terms of the next-lower layer? For examp
3. The engineers who built your database software decided on a way of representing that
document/relational/graph data in terms of bytes in memory, on disk, or on a network. The
representation may allow the data to be queried, searched, manipulated, and processed in various
ways. We will discuss these storage engine designs in [Chapter 4](/en/ch4#ch_storage).
ways. We will discuss these storage engine designs in [Chapter 4](/en/ch4#ch_storage).
4. On yet lower levels, hardware engineers have figured out how to represent bytes in terms of
electrical currents, pulses of light, magnetic fields, and more.
@ -156,7 +158,7 @@ Nevertheless, ORMs also have advantages:
#### The document data model for one-to-many relationships {#the-document-data-model-for-one-to-many-relationships}
Not all data lends itself well to a relational representation; lets look at an example to explore a
limitation of the relational model. [Figure 3-1](/en/ch3#fig_obama_relational) illustrates how a résumé (a LinkedIn
limitation of the relational model. [Figure 3-1](/en/ch3#fig_obama_relational) illustrates how a résumé (a LinkedIn
profile) could be expressed in a relational schema. The profile as a whole can be identified by a
unique identifier, `user_id`. Fields like `first_name` and `last_name` appear exactly once per user,
so they can be modeled as columns on the `users` table.
@ -165,13 +167,13 @@ Most people have had more than one job in their career (positions), and people m
numbers of periods of education and any number of pieces of contact information. One way of
representing such *one-to-many relationships* is to put positions, education, and contact
information in separate tables, with a foreign key reference to the `users` table, as in
[Figure 3-1](/en/ch3#fig_obama_relational).
[Figure 3-1](/en/ch3#fig_obama_relational).
{{< figure src="/fig/ddia_0301.png" id="fig_obama_relational" caption="Figure 3-1. Representing a LinkedIn profile using a relational schema." class="w-full my-4" >}}
Another way of representing the same information, which is perhaps more natural and maps more
closely to an object structure in application code, is as a JSON document as shown in
[Example 3-1](/en/ch3#fig_obama_json).
[Example 3-1](/en/ch3#fig_obama_json).
{{< figure id="fig_obama_json" title="Example 3-1. Representing a LinkedIn profile as a JSON document" class="w-full my-4" >}}
@ -199,12 +201,12 @@ closely to an object structure in application code, is as a JSON document as sho
```
Some developers feel that the JSON model reduces the impedance mismatch between the application code
and the storage layer. However, as we shall see in [Chapter 5](/en/ch5#ch_encoding), there are also problems with
and the storage layer. However, as we shall see in [Chapter 5](/en/ch5#ch_encoding), there are also problems with
JSON as a data encoding format. The lack of a schema is often cited as an advantage; we will discuss
this in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility).
The JSON representation has better *locality* than the multi-table schema in
[Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile
[Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile
in the relational example, you need to either perform multiple queries (query each table by
`user_id`) or perform a messy multi-way join between the `users` table and its subordinate tables [^8].
In the JSON representation, all the relevant information is in one place, making the query both
@ -212,7 +214,7 @@ faster and simpler.
The one-to-many relationships from the user profile to the users positions, educational history, and
contact information imply a tree structure in the data, and the JSON representation makes this tree
structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
{{< figure src="/fig/ddia_0302.png" id="fig_json_tree" caption="Figure 3-2. One-to-many relationships forming a tree structure." class="w-full my-4" >}}
@ -222,13 +224,13 @@ structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
> This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé typically has a small number of positions [^9] [^10].
> In situations where there may be a genuinely large number of related items—say, comments on a
> celebritys social media post, of which there could be many thousands—embedding them all in the same
> document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable.
> document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable.
--------
### Normalization, Denormalization, and Joins {#sec_datamodels_normalization}
In [Example 3-1](/en/ch3#fig_obama_json) in the preceding section, `region_id` is given as an ID, not as the plain-text
In [Example 3-1](/en/ch3#fig_obama_json) in the preceding section, `region_id` is given as an ID, not as the plain-text
string `"Washington, DC, United States"`. Why?
If the user interface has a free-text field for entering the region, it makes sense to store it as a
@ -321,7 +323,7 @@ Besides the cost of performing all these updates, you also need to consider the
database if a process crashes halfway through making its updates. Databases that offer atomic
transactions (see [“Atomicity”](/en/ch8#sec_transactions_acid_atomicity)) make it easier to remain consistent, but not
all databases offer atomicity across multiple documents. It is also possible to ensure consistency
through stream processing, which we discuss in [Link to Come].
through stream processing, which we discuss in [“Keeping Systems in Sync”](/en/ch12#sec_stream_sync).
Normalization tends to be better for OLTP systems, where both reads and updates need to be fast;
analytics systems often fare better with denormalized data, since they perform updates in bulk, and
@ -332,7 +334,7 @@ acceptable. However, in very large-scale systems, the cost of joins can become p
#### Denormalization in the social networking case study {#denormalization-in-the-social-networking-case-study}
In [“Case Study: Social Network Home Timelines”](/en/ch2#sec_introduction_twitter) we compared a normalized representation ([Figure 2-1](/en/ch2#fig_twitter_relational))
In [“Case Study: Social Network Home Timelines”](/en/ch2#sec_introduction_twitter) we compared a normalized representation ([Figure 2-1](/en/ch2#fig_twitter_relational))
and a denormalized one (precomputed, materialized timelines): here, the join between `posts` and
`follows` was too expensive, and the materialized timeline is a cache of the result of that join.
The fan-out process that inserts a new post into followers timelines was our way of keeping the
@ -380,7 +382,7 @@ of performance of reads and writes, as well as the amount of effort to implement
### Many-to-One and Many-to-Many Relationships {#sec_datamodels_many_to_many}
While `positions` and `education` in [Figure 3-1](/en/ch3#fig_obama_relational) are examples of one-to-many or
While `positions` and `education` in [Figure 3-1](/en/ch3#fig_obama_relational) are examples of one-to-many or
one-to-few relationships (one résumé has several positions, but each position belongs only to one
résumé), the `region_id` field is an example of a *many-to-one* relationship (many people live in
the same region, but we assume that each person lives in only one region at any one time).
@ -389,14 +391,14 @@ If we introduce entities for organizations and schools, and reference them by ID
then we also have *many-to-many* relationships (one person has worked for several organizations, and
an organization has several past or present employees). In a relational model, such a relationship
is usually represented as an *associative table* or *join table*, as shown in
[Figure 3-3](/en/ch3#fig_datamodels_m2m_rel): each position associates one user ID with one organization ID.
[Figure 3-3](/en/ch3#fig_datamodels_m2m_rel): each position associates one user ID with one organization ID.
{{< figure src="/fig/ddia_0303.png" id="fig_datamodels_m2m_rel" caption="Figure 3-3. Many-to-many relationships in the relational model." class="w-full my-4" >}}
Many-to-one and many-to-many relationships do not easily fit within one self-contained JSON
document; they lend themselves more to a normalized representation. In a document model, one
possible representation is given in [Example 3-2](/en/ch3#fig_datamodels_m2m_json) and illustrated in
[Figure 3-4](/en/ch3#fig_datamodels_many_to_many): the data within each dotted rectangle can be grouped into one
possible representation is given in [Example 3-2](/en/ch3#fig_datamodels_m2m_json) and illustrated in
[Figure 3-4](/en/ch3#fig_datamodels_many_to_many): the data within each dotted rectangle can be grouped into one
document, but the links to organizations and schools are best represented as references to other
documents.
@ -426,11 +428,11 @@ representation is denormalized, since the relationship is stored in two places,
inconsistent with each other.
A normalized representation stores the relationship in only one place, and relies on *secondary
indexes* (which we discuss in [Chapter 4](/en/ch4#ch_storage)) to allow the relationship to be efficiently queried in
both directions. In the relational schema of [Figure 3-3](/en/ch3#fig_datamodels_m2m_rel), we would tell the database
indexes* (which we discuss in [Chapter 4](/en/ch4#ch_storage)) to allow the relationship to be efficiently queried in
both directions. In the relational schema of [Figure 3-3](/en/ch3#fig_datamodels_m2m_rel), we would tell the database
to create indexes on both the `user_id` and the `org_id` columns of the `positions` table.
In the document model of [Example 3-2](/en/ch3#fig_datamodels_m2m_json), the database needs to index the `org_id` field
In the document model of [Example 3-2](/en/ch3#fig_datamodels_m2m_json), the database needs to index the `org_id` field
of objects inside the `positions` array. Many document databases and relational databases with JSON
support are able to create such indexes on values inside a document.
@ -442,7 +444,7 @@ widely-used conventions for the structure of tables in a data warehouse: a *star
and *one big table* (OBT). These structures are optimized for the needs of business analysts. ETL
processes translate data from operational systems into this schema.
[Figure 3-5](/en/ch3#fig_dwh_schema) shows an example of a star schema that might be found in the data warehouse of a grocery
[Figure 3-5](/en/ch3#fig_dwh_schema) shows an example of a star schema that might be found in the data warehouse of a grocery
retailer. At the center of the schema is a so-called *fact table* (in this example, it is called
`fact_sales`). Each row of the fact table represents an event that occurred at a particular time
(here, each row represents a customers purchase of a product). If we were analyzing website traffic
@ -460,7 +462,7 @@ Other columns in the fact table are foreign key references to other tables, call
tables*. As each row in the fact table represents an event, the dimensions represent the *who*,
*what*, *where*, *when*, *how*, and *why* of the event.
For example, in [Figure 3-5](/en/ch3#fig_dwh_schema), one of the dimensions is the product that was sold. Each row in
For example, in [Figure 3-5](/en/ch3#fig_dwh_schema), one of the dimensions is the product that was sold. Each row in
the `dim_product` table represents one type of product that is for sale, including its stock-keeping
unit (SKU), description, brand name, category, fat content, package size, etc. Each row in the
`fact_sales` table uses a foreign key to indicate which product was sold in that particular
@ -470,7 +472,7 @@ Even date and time are often represented using dimension tables, because this al
information about dates (such as public holidays) to be encoded, allowing queries to differentiate
between sales on holidays and non-holidays.
[Figure 3-5](/en/ch3#fig_dwh_schema) is an example of a star schema. The name comes from the fact that when the table
[Figure 3-5](/en/ch3#fig_dwh_schema) is an example of a star schema. The name comes from the fact that when the table
relationships are visualized, the fact table is in the middle, surrounded by its dimension tables;
the connections to these tables are like the rays of a star.
@ -516,7 +518,7 @@ many-to-many relationships. Lets examine these arguments in more detail.
If the data in your application has a document-like structure (i.e., a tree of one-to-many
relationships, where typically the entire tree is loaded at once), then its probably a good idea to
use a document model. The relational technique of *shredding*—splitting a document-like structure
into multiple tables (like `positions`, `education`, and `contact_info` in [Figure 3-1](/en/ch3#fig_obama_relational))
into multiple tables (like `positions`, `education`, and `contact_info` in [Figure 3-1](/en/ch3#fig_obama_relational))
— can lead to cumbersome schemas and unnecessarily complicated application code.
The document model has limitations: for example, you cannot refer directly to a nested item within a
@ -595,14 +597,14 @@ structure for some reason (i.e., the data is heterogeneous)—for example, becau
In situations like these, a schema may hurt more than it helps, and schemaless documents can be a
much more natural data model. But in cases where all records are expected to have the same
structure, schemas are a useful mechanism for documenting and enforcing that structure. We will
discuss schemas and schema evolution in more detail in [Chapter 5](/en/ch5#ch_encoding).
discuss schemas and schema evolution in more detail in [Chapter 5](/en/ch5#ch_encoding).
#### Data locality for reads and writes {#sec_datamodels_document_locality}
A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant
thereof (such as MongoDBs BSON). If your application often needs to access the entire document
(for example, to render it on a web page), there is a performance advantage to this *storage
locality*. If data is split across multiple tables, like in [Figure 3-1](/en/ch3#fig_obama_relational), multiple
locality*. If data is split across multiple tables, like in [Figure 3-1](/en/ch3#fig_obama_relational), multiple
index lookups are required to retrieve it all, which may require more disk seeks and take more time.
The locality advantage only applies if you need large parts of the document at the same time. The
@ -755,7 +757,7 @@ as SQL support for querying graphs. Other graph query languages exist, such as G
but these will give us a representative overview.
To illustrate these different languages and models, this section uses the graph shown in
[Figure 3-6](/en/ch3#fig_datamodels_graph) as running example. It could be taken from a social network or a
[Figure 3-6](/en/ch3#fig_datamodels_graph) as running example. It could be taken from a social network or a
genealogical database: it shows two people, Lucy from Idaho and Alain from Saint-Lô, France. They
are married and living in London. Each person and each location is represented as a vertex, and the
relationships between them as edges. This example will help demonstrate some queries that are easy
@ -782,7 +784,7 @@ Each edge consists of:
* A collection of properties (key-value pairs)
You can think of a graph store as consisting of two relational tables, one for vertices and one for
edges, as shown in [Example 3-3](/en/ch3#fig_graph_sql_schema) (this schema uses the PostgreSQL `jsonb` datatype to
edges, as shown in [Example 3-3](/en/ch3#fig_graph_sql_schema) (this schema uses the PostgreSQL `jsonb` datatype to
store the properties of each vertex or edge). The head and tail vertex are stored for each edge; if
you want the set of incoming or outgoing edges for a vertex, you can query the `edges` table by
`head_vertex` or `tail_vertex`, respectively.
@ -814,7 +816,7 @@ Some important aspects of this model are:
restricts which kinds of things can or cannot be associated.
2. Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus
*traverse* the graph—i.e., follow a path through a chain of vertices—both forward and backward.
(Thats why [Example 3-3](/en/ch3#fig_graph_sql_schema) has indexes on both the `tail_vertex` and `head_vertex`
(Thats why [Example 3-3](/en/ch3#fig_graph_sql_schema) has indexes on both the `tail_vertex` and `head_vertex`
columns.)
3. By using different labels for different kinds of vertices and relationships, you can store
several different kinds of information in a single graph, while still maintaining a clean data
@ -837,7 +839,7 @@ vertices or edges with certain properties to be found efficiently.
--------
Those features give graphs a great deal of flexibility for data modeling, as illustrated in
[Figure 3-6](/en/ch3#fig_datamodels_graph). The figure shows a few things that would be difficult to express in a
[Figure 3-6](/en/ch3#fig_datamodels_graph). The figure shows a few things that would be difficult to express in a
traditional relational schema, such as different kinds of regional structures in different countries
(France has *départements* and *régions*, whereas the US has *counties* and *states*), quirks of
history such as a country within a country (ignoring for now the intricacies of sovereign states and
@ -859,8 +861,8 @@ and later developed into an open standard as *openCypher* [^38]. Besides Neo4j,
Amazon Neptune, Apache AGE (with storage in PostgreSQL), and others. It is named after a character
in the movie *The Matrix* and is not related to ciphers in cryptography [^39].
[Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of
[Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each
[Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of
[Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each
vertex is given a symbolic name like `usa` or `idaho`. That name is not stored in the database, but
only used internally within the query to create edges between the vertices, using an arrow notation:
`(idaho) -[:WITHIN]-> (usa)` creates an edge labeled `WITHIN`, with `idaho` as the tail node and
@ -878,13 +880,13 @@ CREATE
(lucy) -[:BORN_IN]-> (idaho)
```
When all the vertices and edges of [Figure 3-6](/en/ch3#fig_datamodels_graph) are added to the database, we can start
When all the vertices and edges of [Figure 3-6](/en/ch3#fig_datamodels_graph) are added to the database, we can start
asking interesting questions: for example, *find the names of all the people who emigrated from the
United States to Europe*. That is, find all the vertices that have a `BORN_IN` edge to a location
within the US, and also a `LIVING_IN` edge to a location within Europe, and return the `name`
property of each of those vertices.
[Example 3-5](/en/ch3#fig_cypher_query) shows how to express that query in Cypher. The same arrow notation is used in a
[Example 3-5](/en/ch3#fig_cypher_query) shows how to express that query in Cypher. The same arrow notation is used in a
`MATCH` clause to find patterns in the graph: `(person) -[:BORN_IN]-> ()` matches any two vertices
that are related by an edge labeled `BORN_IN`. The tail vertex of that edge is bound to the
variable `person`, and the head vertex is left unnamed.
@ -923,7 +925,7 @@ can be found through an incoming `BORN_IN` or `LIVES_IN` edge at one of the loca
### Graph Queries in SQL {#id58}
[Example 3-3](/en/ch3#fig_graph_sql_schema) suggested that graph data can be represented in a relational database. But
[Example 3-3](/en/ch3#fig_graph_sql_schema) suggested that graph data can be represented in a relational database. But
if we put graph data in a relational structure, can we also query it using SQL?
The answer is yes, but with some difficulty. Every edge that you traverse in a graph query is
@ -943,7 +945,7 @@ or more times.” It is like the `*` operator in a regular expression.
Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using
something called *recursive common table expressions* (the `WITH RECURSIVE` syntax).
[Example 3-6](/en/ch3#fig_graph_sql_query) shows the same query—finding the names of people who emigrated from the US
[Example 3-6](/en/ch3#fig_graph_sql_query) shows the same query—finding the names of people who emigrated from the US
to Europe—expressed in SQL using this technique. However, the syntax is very clumsy in comparison to
Cypher.
@ -1035,7 +1037,7 @@ The subject of a triple is equivalent to a vertex in a graph. The object is one
1. A value of a primitive datatype, such as a string or a number. In that case, the predicate and
object of the triple are equivalent to the key and value of a property on the subject vertex.
Using the example from [Figure 3-6](/en/ch3#fig_datamodels_graph), (*lucy*, *birthYear*, *1989*) is like a vertex
Using the example from [Figure 3-6](/en/ch3#fig_datamodels_graph), (*lucy*, *birthYear*, *1989*) is like a vertex
`lucy` with properties `{"birthYear": 1989}`.
2. Another vertex in the graph. In that case, the predicate is an edge in the
graph, the subject is the tail vertex, and the object is the head vertex. For example, in
@ -1051,7 +1053,7 @@ The subject of a triple is equivalent to a vertex in a graph. The object is one
> Since these databases retain the basic *subject-predicate-object* structure explained above, this
> book nevertheless calls them triple-stores.
[Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as
[Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as
triples in a format called *Turtle*, a subset of *Notation3* (*N3*) [^48].
{{< figure id="fig_graph_n3_triples" title="Example 3-7. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as Turtle triples" class="w-full my-4" >}}
@ -1081,7 +1083,7 @@ _:usa`. When the predicate is a property, the object is a string literal, as in
Its quite repetitive to repeat the same subject over and over again, but fortunately you can use
semicolons to say multiple things about the same subject. This makes the Turtle format quite
readable: see [Example 3-8](/en/ch3#fig_graph_n3_shorthand).
readable: see [Example 3-8](/en/ch3#fig_graph_n3_shorthand).
{{< figure id="fig_graph_n3_shorthand" title="Example 3-8. A more concise way of writing the data in [Example 3-7](/en/ch3#fig_graph_n3_triples)" class="w-full my-4" >}}
@ -1112,10 +1114,10 @@ case: even if you have no interest in the Semantic Web, triples can be a good in
#### The RDF data model {#the-rdf-data-model}
The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the
The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the
*Resource Description Framework* (RDF) [^55],
a data model that was designed for the Semantic Web. RDF data can also be encoded in other ways, for
example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can
example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can
automatically convert between different RDF encodings.
{{< figure id="fig_graph_rdf_xml" title="Example 3-9. The data of [Example 3-8](/en/ch3#fig_graph_n3_shorthand), expressed using RDF/XML syntax" class="w-full my-4" >}}
@ -1169,7 +1171,7 @@ It predates Cypher, and since Cyphers pattern matching is borrowed from SPARQ
similar.
The same query as before—finding people who have moved from the US to Europe—is similarly concise in
SPARQL as it is in Cypher (see [Example 3-10](/en/ch3#fig_sparql_query)).
SPARQL as it is in Cypher (see [Example 3-10](/en/ch3#fig_sparql_query)).
{{< figure id="fig_sparql_query" title="Example 3-10. The same query as [Example 3-5](/en/ch3#fig_cypher_query), expressed in SPARQL" class="w-full my-4" >}}
@ -1224,8 +1226,8 @@ columns: *ID*, *name*, and *type*. The fact that the US is a country could then
`table(val1, val2, …​)` means that `table` contains a row where the first column contains `val1`,
the second column contains `val2`, and so on.
[Example 3-11](/en/ch3#fig_datalog_triples) shows how to write the data from the left-hand side of
[Figure 3-6](/en/ch3#fig_datamodels_graph) in Datalog. The edges of the graph (`within`, `born_in`, and `lives_in`)
[Example 3-11](/en/ch3#fig_datalog_triples) shows how to write the data from the left-hand side of
[Figure 3-6](/en/ch3#fig_datamodels_graph) in Datalog. The edges of the graph (`within`, `born_in`, and `lives_in`)
are represented as two-column join tables. For example, Lucy has the ID 100 and Idaho has the ID 3,
so the relationship “Lucy was born in Idaho” is represented as `born_in(100, 3)`.
@ -1244,7 +1246,7 @@ born_in(100, 3). /* Lucy was born in Idaho */
```
Now that we have defined the data, we can write the same query as before, as shown in
[Example 3-12](/en/ch3#fig_datalog_query). It looks a bit different from the equivalent in Cypher or SPARQL, but dont
[Example 3-12](/en/ch3#fig_datalog_query). It looks a bit different from the equivalent in Cypher or SPARQL, but dont
let that put you off. Datalog is a subset of Prolog, a programming language that you might have seen
before if youve studied computer science.
@ -1271,7 +1273,7 @@ define *rules* that derive new virtual tables from the underlying facts. These d
like (virtual) SQL views: they are not stored in the database, but you can query them in the same
way as a table containing stored facts.
In [Example 3-12](/en/ch3#fig_datalog_query) we define three derived tables: `within_recursive`, `migrated`, and
In [Example 3-12](/en/ch3#fig_datalog_query) we define three derived tables: `within_recursive`, `migrated`, and
`us_to_europe`. The name and columns of the virtual tables are defined by what appears before the
`:-` symbol of each rule. For example, `migrated(PName, BornIn, LivingIn)` is a virtual table with
three columns: the name of a person, the name of the place where they were born, and the name of the
@ -1284,7 +1286,7 @@ variable `PName` bound to the value `"Lucy"`. A rule applies if the system can f
*all* patterns on the righthand side of the `:-` operator. When the rule applies, its as though the
lefthand side of the `:-` was added to the database (with variables replaced by the values they matched).
One possible way of applying the rules is thus (and as illustrated in [Figure 3-7](/en/ch3#fig_datalog_naive)):
One possible way of applying the rules is thus (and as illustrated in [Figure 3-7](/en/ch3#fig_datalog_naive)):
1. `location(1, "North America", "continent")` exists in the database, so rule 1 applies. It generates `within_recursive(1, "North America")`.
2. `within(2, 1)` exists in the database and the previous step generated `within_recursive(1, "North America")`, so rule 2 applies. It generates `within_recursive(2, "North America")`.
@ -1295,7 +1297,7 @@ locations in North America (or any other location) contained in our database.
{{< figure link="#fig_datalog_query" src="/fig/ddia_0307.png" id="fig_datalog_naive" title="Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from Example 3-12." class="w-full my-4" >}}
> Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query).
> Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query).
Now rule 3 can find people who were born in some location `BornIn` and live in some location
`LivingIn`. Rule 4 invokes rule 3 with `BornIn = 'United States'` and
@ -1307,7 +1309,7 @@ The Datalog approach requires a different kind of thinking compared to the other
discussed in this chapter. It allows complex queries to be built up rule by rule, with one rule
referring to other rules, similarly to the way that you break down code into functions that call
each other. Just like functions can be recursive, Datalog rules can also invoke themselves, like
rule 2 in [Example 3-12](/en/ch3#fig_datalog_query), which enables graph traversals in Datalog queries.
rule 2 in [Example 3-12](/en/ch3#fig_datalog_query), which enables graph traversals in Datalog queries.
### GraphQL {#id63}
@ -1319,7 +1321,7 @@ interfaces allow developers to rapidly change queries in client code without cha
GraphQLs flexibility comes at a cost. Organizations that adopt GraphQL often need tooling to
convert GraphQL queries into requests to internal services, which often use REST or gRPC (see
[Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns [^61].
[Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns [^61].
GraphQLs query language is also limited since GraphQL come from an untrusted source. The language
does not allow anything that could be expensive to execute, since otherwise users could perform
denial-of-service attacks on a server by running lots of expensive queries. In particular, GraphQL
@ -1327,7 +1329,7 @@ does not allow recursive queries (unlike Cypher, SPARQL, SQL, or Datalog), and i
arbitrary search conditions such as “find people who were born in the US and are now living in
Europe” (unless the service owners specifically choose to offer such search functionality).
Nevertheless, GraphQL is useful. [Example 3-13](/en/ch3#fig_graphql_query) shows how you might implement a group chat
Nevertheless, GraphQL is useful. [Example 3-13](/en/ch3#fig_graphql_query) shows how you might implement a group chat
application such as Discord or Slack using GraphQL. The query requests all the channels that the
user has access to, including the channel name and the 50 most recent messages in each channel. For
each message it requests the timestamp, the message content, and the name and profile picture URL
@ -1359,7 +1361,7 @@ query ChatApp {
}
```
[Example 3-14](/en/ch3#fig_graphql_response) shows what a response to the query in [Example 3-13](/en/ch3#fig_graphql_query) might look
[Example 3-14](/en/ch3#fig_graphql_response) shows what a response to the query in [Example 3-13](/en/ch3#fig_graphql_query) might look
like. The response is a JSON document that mirrors the structure of the query: it contains exactly
those attributes that were requested, no more and no less. This approach has the advantage that the
server does not need to know which attributes the client requires in order to render the user
@ -1395,13 +1397,13 @@ were changed to add that profile picture, it would be easy for the client to add
...
```
In [Example 3-14](/en/ch3#fig_graphql_response) the name and image URL of a message sender is embedded directly in the
In [Example 3-14](/en/ch3#fig_graphql_response) the name and image URL of a message sender is embedded directly in the
message object. If the same user sends multiple messages, this information is repeated on each
message. In principle, it would be possible to reduce this duplication, but GraphQL makes the design
choice to accept a larger response size in order to make it simpler to render the user interface
based on the data.
The `replyTo` field is similar: in [Example 3-14](/en/ch3#fig_graphql_response), the second message is a reply to the
The `replyTo` field is similar: in [Example 3-14](/en/ch3#fig_graphql_response), the second message is a reply to the
first, and the content (“Hey!…”) and sender Aaliyah are duplicated under `replyTo`. It would be
possible to instead return the ID of the message being replied to, but then the client would have to
make an additional request to the server if that ID is not among the 50 most recent messages
@ -1439,7 +1441,7 @@ timestamp, and then append it to a sequence of events. Events in this log are *i
change or delete them, you only ever append more events to the log (which may supersede earlier
events). An event can contain arbitrary properties.
[Figure 3-8](/en/ch3#fig_event_sourcing) shows an example that could be taken from a conference management system. A
[Figure 3-8](/en/ch3#fig_event_sourcing) shows an example that could be taken from a conference management system. A
conference can be a complex business domain: not only can individual attendees register and pay by
card, but companies can also order seats in bulk, pay by invoice, and then later assign the seats to
individual people. Some number of seats may be reserved for speakers, sponsors, volunteer helpers,
@ -1449,7 +1451,7 @@ calculating the number of available seats becomes a challenging query.
{{< figure src="/fig/ddia_0308.png" id="fig_event_sourcing" title="Figure 3-8. Using a log of immutable events as source of truth, and deriving materialized views from it." class="w-full my-4" >}}
In [Figure 3-8](/en/ch3#fig_event_sourcing), every change to the state of the conference (such as the organizer
In [Figure 3-8](/en/ch3#fig_event_sourcing), every change to the state of the conference (such as the organizer
opening registrations, or attendees making and cancelling registrations) is first stored as an
event. Whenever an event is appended to the log, several *materialized views* (also known as
*projections* or *read models*) are also updated to reflect the effect of that event. In the
@ -1540,11 +1542,11 @@ You can implement event sourcing on top of any database, but there are also some
specifically designed to support this pattern, such as EventStoreDB, MartenDB (based on PostgreSQL),
and Axon Framework. You can also use message brokers such as Apache Kafka to store the event log,
and stream processors can keep the materialized views up-to-date; we will return to these topics in
[Link to Come].
[“Change data capture versus event sourcing”](/en/ch12#sec_stream_event_sourcing).
The only important requirement is that the event storage system must guarantee that all materialized
views process the events in exactly the same order as they appear in the log; as we shall see in
[Chapter 10](/en/ch10#ch_consistency), this is not always easy to achieve in a distributed system.
[Chapter 10](/en/ch10#ch_consistency), this is not always easy to achieve in a distributed system.
## Dataframes, Matrices, and Arrays {#sec_datamodels_dataframes}
@ -1579,7 +1581,7 @@ For example, a common use of dataframes is to transform data from a relational-l
into a matrix or multidimensional array representation, which is the form that many machine learning
algorithms expect of their input.
A simple example of such a transformation is shown in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix). On the left we
A simple example of such a transformation is shown in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix). On the left we
have a relational table of how different users have rated various movies (on a scale of 1 to 5), and
on the right the data has been transformed into a matrix where each column is a movie and each row
is a user (similarly to a *pivot table* in a spreadsheet). The matrix is *sparse*, which means there
@ -1592,7 +1594,7 @@ that offer sparse arrays (such as NumPy for Python) can handle such data easily.
A matrix can only contain numbers, and various techniques are used to transform non-numerical data
into numbers in the matrix. For example:
* Dates (which are omitted from the example matrix in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix)) could be scaled
* Dates (which are omitted from the example matrix in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix)) could be scaled
to be floating-point numbers within some suitable range.
* For columns that can only take one of a small, fixed set of values (for example, the genre of a
movie in a database of movies), a *one-hot encoding* is often used: we create a column for each
@ -1603,7 +1605,7 @@ into numbers in the matrix. For example:
Once the data is in the form of a matrix of numbers, it is amenable to linear algebra operations,
which form the basis of many machine learning algorithms. For example, the data in
[Figure 3-9](/en/ch3#fig_dataframe_to_matrix) could be a part of a system for recommending movies that the user may
[Figure 3-9](/en/ch3#fig_dataframe_to_matrix) could be a part of a system for recommending movies that the user may
like. Dataframes are flexible enough to allow data to be gradually evolved from a relational form
into a matrix representation, while giving the data scientist control over the representation that
is most suitable for achieving the goals of the data analysis or model training process.
@ -1648,7 +1650,7 @@ gradually improving.
Another model we discussed is *event sourcing*, which represents data as an append-only log of
immutable events, and which can be advantageous for modeling activities in complex business domains.
An append-only log is good for writing data (as we shall see in [Chapter 4](/en/ch4#ch_storage)); in order to support
An append-only log is good for writing data (as we shall see in [Chapter 4](/en/ch4#ch_storage)); in order to support
efficient queries, the event log is translated into read-optimized materialized views through CQRS.
One thing that non-relational data models have in common is that they typically dont enforce a

View file

@ -4,6 +4,8 @@ weight: 104
breadcrumbs: false
---
<a id="ch_storage"></a>
![](/map/ch03.png)
> *One of the miseries of life is that everybody names things a little bit wrong. And so it makes
@ -17,7 +19,7 @@ breadcrumbs: false
On the most fundamental level, a database needs to do two things: when you give it some data, it
should store the data, and when you ask it again later, it should give the data back to you.
In [Chapter 3](/en/ch3#ch_datamodels) we discussed data models and query languages—i.e., the format in which you give
In [Chapter 3](/en/ch3#ch_datamodels) we discussed data models and query languages—i.e., the format in which you give
the database your data, and the interface through which you can ask for it again later. In this
chapter we discuss the same from the databases point of view: how the database can store the data
that you give it, and how it can find the data again when you ask for it.
@ -140,7 +142,7 @@ your application the greatest benefit, without introducing more overhead on writ
To start, lets assume that you want to continue storing data in the append-only file written by
`db_set`, and you just want to speed up reads. One way you could do this is by keeping a hash map in
memory, in which every key is mapped to the byte offset in the file at which the most recent value
for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
{{< figure src="/fig/ddia_0401.png" id="fig_storage_csv_hash_index" caption="Figure 4-1. Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map." class="w-full my-4" >}}
@ -167,7 +169,7 @@ This approach is much faster, but it still suffers from several problems:
In practice, hash tables are not used very often for database indexes, and instead it is much more
common to keep data in a structure that is *sorted by key* [^3].
One example of such a structure is a *Sorted String Table*, or *SSTable* for short, as shown in
[Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that
[Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that
they are sorted by key, and each key only appears once in the file.
{{< figure src="/fig/ddia_0402.png" id="fig_storage_sstable_index" caption="Figure 4-2. An SSTable with a sparse index, allowing queries to jump to the right block." class="w-full my-4" >}}
@ -178,7 +180,7 @@ This kind of index, which stores only some of the keys, is called *sparse*. This
a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data
structure that allows queries to quickly look up a particular key [^4].
For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
first key of the next block is `handsome`. Now say youre looking for the key `handiwork`, which
doesnt appear in the sparse index. Because of the sorting you know that `handiwork` must appear
between `handbag` and `handsome`. This means you can seek to the offset for `handbag` and scan the
@ -186,7 +188,7 @@ file from there until you find `handiwork` (or not, if the key is not present in
of a few kilobytes can be scanned very quickly.
Moreover, each block of records can be compressed (indicated by the shaded area in
[Figure 4-2](/en/ch4#fig_storage_sstable_index)). Besides saving disk space, compression also reduces the I/O
[Figure 4-2](/en/ch4#fig_storage_sstable_index)). Besides saving disk space, compression also reduces the I/O
bandwidth use, at the cost of using a bit more CPU time.
#### Constructing and merging SSTables {#constructing-and-merging-sstables}
@ -217,7 +219,7 @@ log and a sorted file:
and to discard overwritten or deleted values.
Merging segments works similarly to the *mergesort* algorithm [^5]. The process is illustrated in
[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If
the same key appears in more than one input file, keep only the more recent value. This produces a
new merged segment file, also sorted by key, with one value per key, and it uses minimal memory
@ -258,7 +260,9 @@ the memtable or while merging segments, the database can just delete the unfinis
start afresh. The log that persists writes to the memtable could contain incomplete records if there
was a crash halfway through writing a record, or if the disk was full; these are typically detected
by including checksums in the log, and discarding corrupted or incomplete log entries. We will talk
more about durability and crash recovery in [Chapter 8](/en/ch8#ch_transactions).
more about durability and crash recovery in [Chapter 8](/en/ch8#ch_transactions).
<a id="sec_storage_bloom_filter"></a>
#### Bloom filters {#bloom-filters}
@ -268,7 +272,7 @@ reads, LSM storage engines often include a *Bloom filter* [^13]
in each segment, which provides a fast but approximate way of checking whether a particular key
appears in a particular SSTable.
[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash
function, producing a set of numbers that are then interpreted as indexes into the array of bits [^14].
We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key
@ -279,7 +283,7 @@ extra space, but the Bloom filter is generally small compared to the rest of the
{{< figure src="/fig/ddia_0404.png" id="fig_storage_bloom" caption="Figure 4-4. A Bloom filter provides a fast, probabilistic check whether a particular key exists in a particular SSTable." class="w-full my-4" >}}
When we want to know whether a key appears in the SSTable, we compute the same hash of that key as
before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), were querying
before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), were querying
the key `handheld`, which hashes to (6, 11, 2). One of those bits is 1 (namely, bit number 2),
while the other two are 0. These checks can be made extremely fast using the bitwise operations that
all CPUs support.
@ -333,6 +337,8 @@ characteristics in more detail in [“Comparing B-Trees and LSM-Trees”](/en/ch
--------
<a id="sidebar_embedded"></a>
> [!TIP] EMBEDDED STORAGE ENGINES
Many databases run as a service that accepts queries over a network, but there are also *embedded*
@ -349,7 +355,7 @@ queries that combine data from multiple tenants), you can potentially use a sepa
database instance per tenant [^20].
The storage and retrieval methods we discuss in this chapter are used in both embedded and in
client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
for scaling a database across multiple machines.
--------
@ -370,14 +376,14 @@ philosophy.
The log-structured indexes we saw earlier break the database down into variable-size *segments*,
typically several megabytes or more in size, that are written once and are then immutable. By
contrast, B-trees break the database down into fixed-size *blocks* or *pages*, and may overwrite a
page in-place. A page is traditionally 4 KiB in size, but PostgreSQL now uses 8 KiB and
MySQL uses 16 KiB by default.
page in-place. A page is traditionally 4 KiB in size, but PostgreSQL now uses 8 KiB and
MySQL uses 16 KiB by default.
Each page can be identified using a page number, which allows one page to refer to another—similar
to a pointer, but on disk instead of in memory. If all the pages are stored in the same file,
multiplying the page number by the page size gives us the byte offset in the file where the page is
located. We can use these page references to construct a tree of pages, as illustrated in
[Figure 4-5](/en/ch4#fig_storage_b_tree).
[Figure 4-5](/en/ch4#fig_storage_b_tree).
{{< figure src="/fig/ddia_0405.png" id="fig_storage_b_tree" caption="Figure 4-5. Looking up the key 251 using a B-tree index. From the root page we first follow the reference to the page for keys 200300, then the page for keys 250270." class="w-full my-4" >}}
@ -388,14 +394,14 @@ where the boundaries between those ranges lie.
(This structure is sometimes called a B+ tree, but we dont need to distinguish it
from other B-tree variants.)
In the example in [Figure 4-5](/en/ch4#fig_storage_b_tree), we are looking for the key 251, so we know that we need to
In the example in [Figure 4-5](/en/ch4#fig_storage_b_tree), we are looking for the key 251, so we know that we need to
follow the page reference between the boundaries 200 and 300. That takes us to a similar-looking
page that further breaks down the 200300 range into subranges. Eventually we get down to a
page containing individual keys (a *leaf page*), which either contains the value for each key
inline or contains references to the pages where the values can be found.
The number of references to child pages in one page of the B-tree is called the *branching factor*.
For example, in [Figure 4-5](/en/ch4#fig_storage_b_tree) the branching factor is six. In practice, the branching
For example, in [Figure 4-5](/en/ch4#fig_storage_b_tree) the branching factor is six. In practice, the branching
factor depends on the amount of space required to store the page references and the range
boundaries, but typically it is several hundred.
@ -408,7 +414,7 @@ of key ranges.
{{< figure src="/fig/ddia_0406.png" id="fig_storage_b_tree_split" caption="Figure 4-6. Growing a B-tree by splitting a page on the boundary key 337. The parent page is updated to reference both children." class="w-full my-4" >}}
In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the
In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the
range 333345 is already full. We therefore split it into a page for the range 333337 (including
the new key), and a page for 337344. We also have to update the parent page to have references to
both children, with a boundary value of 337 between them. If the parent page doesnt have enough
@ -417,9 +423,9 @@ to the root of the tree. When the root is split, we make a new root above it. De
may require nodes to be merged) is more complex [^5].
This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth
of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
you dont need to follow many page references to find the page you are looking for. (A four-level
tree of 4 KiB pages with a branching factor of 500 can store up to 250 TB.)
tree of 4 KiB pages with a branching factor of 500 can store up to 250 TB.)
#### Making B-trees reliable {#sec_storage_btree_wal}
@ -530,14 +536,14 @@ flash memory attached to the PCI Express bus) have now overtaken HDDs for many u
are not subject to such mechanical limitations.
Nevertheless, SSDs also have higher throughput for sequential writes than for than random writes.
The reason is that flash memory can be read or written one page (typically 4 KiB) at a time,
but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
The reason is that flash memory can be read or written one page (typically 4 KiB) at a time,
but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
may contain valid data, whereas others may contain data that is no longer needed. Before erasing a
block, the controller must first move pages containing valid data into other blocks; this process is
called *garbage collection* (GC) [^33].
A sequential write workload writes larger chunks of data at a time, so it is likely that a whole
512 KiB block belongs to a single file; when that file is later deleted again, the whole block
512 KiB block belongs to a single file; when that file is later deleted again, the whole block
can be erased without having to perform any GC. On the other hand, with a random write workload, it
is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
to perform more work before a block can be erased [^34] [^35] [^36].
@ -624,7 +630,7 @@ to that row/document/vertex by its primary key (or ID), and the index is used to
It is also very common to have *secondary indexes*. In relational databases, you can create several
secondary indexes on the same table using the `CREATE INDEX` command, allowing you to search by
columns other than the primary key. For example, in [Figure 3-1](/en/ch3#fig_obama_relational) in [Chapter 3](/en/ch3#ch_datamodels)
columns other than the primary key. For example, in [Figure 3-1](/en/ch3#fig_obama_relational) in [Chapter 3](/en/ch3#ch_datamodels)
you would most likely have a secondary index on the `user_id` columns so that you can find all the
rows belonging to the same user in each of the tables.
@ -791,7 +797,7 @@ rows), so in this section we will focus on storage of facts.
Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4
or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics) [^52]. Take the query in
[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of
the `fact_sales` table: `date_key`, `product_sk`,
and `quantity`. The query ignores all other columns.
@ -816,9 +822,9 @@ How can we execute this query efficiently?
In most OLTP databases, storage is laid out in a *row-oriented* fashion: all the values from one row
of a table are stored next to each other. Document databases are similar: an entire document is
typically stored as one contiguous sequence of bytes. You can see this in the CSV example of [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
typically stored as one contiguous sequence of bytes. You can see this in the CSV example of [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).
In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on
In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on
`fact_sales.date_key` and/or `fact_sales.product_sk` that tell the storage engine where to find
all the sales for a particular date or for a particular product. But then, a row-oriented storage
engine still needs to load all of those rows (each consisting of over 100 attributes) from disk into
@ -828,8 +834,8 @@ long time.
The idea behind *column-oriented* (or *columnar*) storage is simple: dont store all the values from
one row together, but store all the values from each *column* together instead [^56].
If each column is stored separately, a query only needs to read and parse those columns that are
used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).
used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).
--------
@ -864,10 +870,10 @@ Besides only loading those columns from disk that are required for a query, we c
the demands on disk throughput and network bandwidth by compressing data. Fortunately,
column-oriented storage often lends itself very well to compression.
Take a look at the sequences of values for each column in [Figure 4-7](/en/ch4#fig_column_store): they often look quite
Take a look at the sequences of values for each column in [Figure 4-7](/en/ch4#fig_column_store): they often look quite
repetitive, which is a good sign for compression. Depending on the data in the column, different
compression techniques can be used. One technique that is particularly effective in data warehouses
is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index).
is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index).
{{< figure src="/fig/ddia_0408.png" id="fig_bitmap_index" caption="Figure 4-8. Compressed, bitmap-indexed storage of a single column." class="w-full my-4" >}}
@ -880,7 +886,7 @@ not.
One option is to store those bitmaps using one bit per row. However, these bitmaps typically contain
a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can additionally be
run-length encoded: counting the number of consecutive zeros or ones and storing that number, as
shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
two bitmap representations, using whichever is the most compact [^73].
This can make the encoding of a column remarkably efficient.
@ -928,7 +934,7 @@ last month, it might make sense to make `date_key` the first sort key. Then the
scan only the rows from the last month, which will be much faster than scanning all rows.
A second column can determine the sort order of any rows that have the same value in the first
column. For example, if `date_key` is the first sort key in [Figure 4-7](/en/ch4#fig_column_store), it might make
column. For example, if `date_key` is the first sort key in [Figure 4-7](/en/ch4#fig_column_store), it might make
sense for `product_sk` to be the second sort key so that all sales for the same product on the same
day are grouped together in storage. That will help queries that need to group or filter sales by
product within a certain date range.
@ -936,7 +942,7 @@ product within a certain date range.
Another advantage of sorted order is that it can help with compression of columns. If the primary
sort column does not have many distinct values, then after sorting, it will have long sequences
where the same value is repeated many times in a row. A simple run-length encoding, like we used for
the bitmaps in [Figure 4-8](/en/ch4#fig_bitmap_index), could compress that column down to a few kilobytes—even if
the bitmaps in [Figure 4-8](/en/ch4#fig_bitmap_index), could compress that column down to a few kilobytes—even if
the table has billions of rows.
That compression effect is strongest on the first sort key. The second and third sort keys will be
@ -1004,7 +1010,7 @@ Vectorized processing
and get back a bitmap (one bit per value in the input column, which is 1 if its a banana); we could
then pass the `store_sk` column and the ID of the store of interest to the same equality operator,
and get back another bitmap; and then we could pass the two bitmaps to a “bitwise AND” operator, as
shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
a particular store.
{{< figure src="/fig/ddia_0409.png" id="fig_bitmap_and" caption="Figure 4-9. A bitwise AND between two bitmaps lends itself to vectorization." class="w-full my-4" >}}
@ -1039,18 +1045,18 @@ discussed earlier, data warehouse queries often involve an aggregate function, s
`AVG`, `MIN`, or `MAX` in SQL. If the same aggregates are used by many different queries, it can be
wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that
queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates grouped by different dimensions [^82].
[Figure 4-10](/en/ch4#fig_data_cube) shows an example.
[Figure 4-10](/en/ch4#fig_data_cube) shows an example.
{{< figure src="/fig/ddia_0410.png" id="fig_data_cube" caption="Figure 4-10. Two dimensions of a data cube, aggregating data by summing." class="w-full my-4" >}}
Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube),
Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube),
these are `date_key` and `product_sk`. You can now draw a two-dimensional table, with
dates along one axis and products along the other. Each cell contains the aggregate (e.g., `SUM`) of
an attribute (e.g., `net_price`) of all facts with that date-product combination. Then you can apply
the same aggregate along each row or column and get a summary that has been reduced by one
dimension (the sales by product regardless of date, or the sales by date regardless of product).
In general, facts often have more than two dimensions. In [Figure 3-5](/en/ch3#fig_dwh_schema) there are five
In general, facts often have more than two dimensions. In [Figure 3-5](/en/ch3#fig_dwh_schema) there are five
dimensions: date, product, store, promotion, and customer. Its a lot harder to imagine what a
five-dimensional hypercube would look like, but the principle remains the same: each cell contains
the sales for a particular date-product-store-promotion-customer combination. These values can then
@ -1132,11 +1138,11 @@ value of 0. Searching for documents mentioning “red apples” means a query th
The data structure that many search engines use to answer such queries is called an *inverted
index*. This is a key-value structure where the key is a term, and the value is the list of IDs of
all the documents that contain the term (the *postings list*). If the document IDs are sequential
numbers, the postings list can also be represented as a sparse bitmap, like in [Figure 4-8](/en/ch4#fig_bitmap_index):
numbers, the postings list can also be represented as a sparse bitmap, like in [Figure 4-8](/en/ch4#fig_bitmap_index):
the *n*th bit in the bitmap for term *x* is a 1 if the document with ID *n* contains the term *x* [^89].
Finding all the documents that contain both terms *x* and *y* is now similar to a vectorized data
warehouse query that searches for rows matching two conditions ([Figure 4-9](/en/ch4#fig_bitmap_and)): load the two
warehouse query that searches for rows matching two conditions ([Figure 4-9](/en/ch4#fig_bitmap_and)): load the two
bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps are run-length
encoded, this can be done very efficiently.
@ -1147,7 +1153,7 @@ PostgreSQLs GIN index type also uses postings lists to support full-text sear
JSON documents [^92] [^93].
Instead of breaking text into words, an alternative is to find all the substrings of length *n*,
which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
`"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can
search the documents for arbitrary substrings that are at least three characters long. Trigram
indexes even allows regular expressions in search queries; the downside is that they are quite large [^94].
@ -1226,7 +1232,7 @@ Inverted file (IVF) indexes
more vectors must be compared.
Hierarchical Navigable Small World (HNSW)
: HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw).
: HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw).
Each layer is represented as a graph, where nodes represent vectors, and edges represent proximity
to nearby vectors. A query starts by locating the nearest vector in the topmost layer, which has a
small number of nodes. The query then moves to the same node in the layer below and follows the
@ -1395,4 +1401,4 @@ documentation for the database of your choice.
[^101]: Matthijs Douze, Maria Lomeli, and Lucas Hosseini. [Faiss indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). *github.com*, August 2024. Archived at [perma.cc/2EWG-FPBS](https://perma.cc/2EWG-FPBS)
[^102]: Varik Matevosyan. [Understanding pgvectors HNSW Index Storage in Postgres](https://lantern.dev/blog/pgvector-storage). *lantern.dev*, August 2024. Archived at [perma.cc/B2YB-JB59](https://perma.cc/B2YB-JB59)
[^103]: Dmitry Baranchuk, Artem Babenko, and Yury Malkov. [Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors](https://arxiv.org/pdf/1802.02422). At *European Conference on Computer Vision* (ECCV), pages 202216, September 2018. [doi:10.1007/978-3-030-01258-8\_13](https://doi.org/10.1007/978-3-030-01258-8_13)
[^104]: Yury A. Malkov and Dmitry A. Yashunin. [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320). *IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824836, April 2020. [doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473)
[^104]: Yury A. Malkov and Dmitry A. Yashunin. [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320). *IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824836, April 2020. [doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473)

View file

@ -4,6 +4,8 @@ weight: 105
breadcrumbs: false
---
<a id="ch_encoding"></a>
![](/map/ch04.png)
> *Everything changes and nothing stands still.*
@ -12,14 +14,14 @@ breadcrumbs: false
Applications inevitably change over time. Features are added or modified as new products are
launched, user requirements become better understood, or business circumstances change. In
[Chapter 2](/en/ch2#ch_nonfunctional) we introduced the idea of *evolvability*: we should aim to build systems that
[Chapter 2](/en/ch2#ch_nonfunctional) we introduced the idea of *evolvability*: we should aim to build systems that
make it easy to adapt to change (see [“Evolvability: Making Change Easy”](/en/ch2#sec_introduction_evolvability)).
In most cases, a change to an applications features also requires a change to data that it stores:
perhaps a new field or record type needs to be captured, or perhaps existing data needs to be
presented in a new way.
The data models we discussed in [Chapter 3](/en/ch3#ch_datamodels) have different ways of coping with such change.
The data models we discussed in [Chapter 3](/en/ch3#ch_datamodels) have different ways of coping with such change.
Relational databases generally assume that all data in the database conforms to one schema: although
that schema can be changed (through schema migrations; i.e., `ALTER` statements), there is exactly
one schema in force at any one point in time. By contrast, schema-on-read (“schemaless”) databases
@ -52,13 +54,13 @@ format of data written by older code, and so you can explicitly handle it (if ne
keeping the old code to read the old data). Forward compatibility can be trickier, because it
requires older code to ignore additions made by a newer version of the code.
Another challenge with forward compatibility is illustrated in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
Another challenge with forward compatibility is illustrated in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
Say you add a field to a record schema, and the newer code creates a record containing that new
field and stores it in a database. Subsequently, an older version of the code (which doesnt yet
know about the new field) reads the record, updates it, and writes it back. In this situation, the
desirable behavior is usually for the old code to keep the new field intact, even though it couldnt
be interpreted. But if the record is decoded into a model object that does not explicitly
preserve unknown fields, data can be lost, like in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
preserve unknown fields, data can be lost, like in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
{{< figure src="/fig/ddia_0501.png" id="fig_encoding_preserve_field" caption="When an older version of the application updates data previously written by a newer version of the application, data may be lost if youre not careful." class="w-full my-4" >}}
@ -90,7 +92,7 @@ in-memory representation to a byte sequence is called *encoding* (also known as
> [!TIP] TERMINOLOGY CLASH
*Serialization* is unfortunately also used in the context of transactions (see [Chapter 8](/en/ch8#ch_transactions)),
*Serialization* is unfortunately also used in the context of transactions (see [Chapter 8](/en/ch8#ch_transactions)),
with a completely different meaning. To avoid overloading the word well stick with *encoding* in
this book, even though *serialization* is perhaps a more common term.
@ -202,7 +204,7 @@ Open content models are powerful, but can be complex. For example, say you want
integers (such as IDs) to strings. JSON does not have a map or dictionary type, only an “object”
type that can contain string keys, and values of any type. You can then constrain this type with
JSON Schema so that keys may only contain digits, and values can only be strings, using
`patternProperties` and `additionalProperties` as shown in [Example 5-1](/en/ch5#fig_encoding_json_schema).
`patternProperties` and `additionalProperties` as shown in [Example 5-1](/en/ch5#fig_encoding_json_schema).
{{< figure id="fig_encoding_json_schema" title="Example 5-1. Example JSON Schema with integer keys and string values. Integer keys are represented as strings containing only integers since JSON Schema requires all keys to be strings." class="w-full my-4" >}}
@ -237,7 +239,7 @@ sometimes faster to parse, but none of them are as widely adopted as the textual
Some of these formats extend the set of datatypes (e.g., distinguishing integers and floating-point numbers,
or adding support for binary strings), but otherwise they keep the JSON/XML data model unchanged. In
particular, since they dont prescribe a schema, they need to include all the object field names within
the encoded data. That is, in a binary encoding of the JSON document in [Example 5-2](/en/ch5#fig_encoding_json), they
the encoded data. That is, in a binary encoding of the JSON document in [Example 5-2](/en/ch5#fig_encoding_json), they
will need to include the strings `userName`, `favoriteNumber`, and `interests` somewhere.
{{< figure id="fig_encoding_json" title="Example 5-2. Example record which we will encode in several binary formats in this chapter" class="w-full my-4" >}}
@ -250,8 +252,8 @@ will need to include the strings `userName`, `favoriteNumber`, and `interests` s
}
```
Lets look at an example of MessagePack, a binary encoding for JSON. [Figure 5-2](/en/ch5#fig_encoding_messagepack)
shows the byte sequence that you get if you encode the JSON document in [Example 5-2](/en/ch5#fig_encoding_json) with
Lets look at an example of MessagePack, a binary encoding for JSON. [Figure 5-2](/en/ch5#fig_encoding_messagepack)
shows the byte sequence that you get if you encode the JSON document in [Example 5-2](/en/ch5#fig_encoding_json) with
MessagePack. The first few bytes are as follows:
1. The first byte, `0x83`, indicates that what follows is an object (top four bits = `0x80`) with three
@ -281,7 +283,7 @@ It is similar to Apache Thrift, which was originally developed by Facebook [^13]
most of what this section says about Protocol Buffers applies also to Thrift.
Protocol Buffers requires a schema for any data that is encoded. To encode the data
in [Example 5-2](/en/ch5#fig_encoding_json) in Protocol Buffers, you would describe the schema in the Protocol Buffers
in [Example 5-2](/en/ch5#fig_encoding_json) in Protocol Buffers, you would describe the schema in the Protocol Buffers
interface definition language (IDL) like this:
```protobuf
@ -300,17 +302,17 @@ application code can call this generated code to encode or decode records of the
language is very simple compared to JSON Schema: it only defines the fields of records and their
types, but it does not support other restrictions on the possible values of fields.
Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in [Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14].
Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in [Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14].
{{< figure src="/fig/ddia_0503.png" id="fig_encoding_protobuf" caption="Figure 5-3. Example record encoded using Protocol Buffers." class="w-full my-4" >}}
Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it
Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it
is a string, integer, etc.) and, where required, a length indication (such as the length of a
string). The strings that appear in the data (“Martin”, “daydreaming”, “hacking”) are also encoded
as ASCII (to be precise, UTF-8), similar to before.
The big difference compared to [Figure 5-2](/en/ch5#fig_encoding_messagepack) is that there are no field names
The big difference compared to [Figure 5-2](/en/ch5#fig_encoding_messagepack) is that there are no field names
(`userName`, `favoriteNumber`, `interests`). Instead, the encoded data contains *field tags*, which
are numbers (`1`, `2`, and `3`). Those are the numbers that appear in the schema definition. Field tags
are like aliases for fields—they are a compact way of saying what field were talking about,
@ -344,7 +346,7 @@ You can add new fields to the schema, provided that you give each field a new ta
code (which doesnt know about the new tag numbers you added) tries to read data written by new
code, including a new field with a tag number it doesnt recognize, it can simply ignore that field.
The datatype annotation allows the parser to determine how many bytes it needs to skip, and preserve
the unknown fields to avoid the problem in [Figure 5-1](/en/ch5#fig_encoding_preserve_field). This maintains forward
the unknown fields to avoid the problem in [Figure 5-1](/en/ch5#fig_encoding_preserve_field). This maintains forward
compatibility: old code can read records that were written by new code.
What about backward compatibility? As long as each field has a unique tag number, new code can
@ -400,9 +402,9 @@ The equivalent JSON representation of that schema is as follows:
```
First of all, notice that there are no tag numbers in the schema. If we encode our example record
([Example 5-2](/en/ch5#fig_encoding_json)) using this schema, the Avro binary encoding is just 32 bytes long—the
([Example 5-2](/en/ch5#fig_encoding_json)) using this schema, the Avro binary encoding is just 32 bytes long—the
most compact of all the encodings we have seen. The breakdown of the encoded byte sequence is shown
in [Figure 5-4](/en/ch5#fig_encoding_avro).
in [Figure 5-4](/en/ch5#fig_encoding_avro).
If you examine the byte sequence, you can see that there is nothing to identify fields or their
datatypes. The encoding simply consists of values concatenated together. A string is just a length
@ -430,7 +432,7 @@ example, that schema may be compiled into the application. This is known as the
When an application wants to decode some data (read it from a file or database, receive it from the
network, etc.), it uses two schemas: the writers schema that is identical to the one used for
encoding, and the *readers schema*, which may be different. This is illustrated in
[Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The readers schema defines the fields of each record that the
[Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The readers schema defines the fields of each record that the
application code is expecting, and their types.
{{< figure src="/fig/ddia_0505.png" id="fig_encoding_avro_schemas" caption="Figure 5-5. In Protocol Buffers, encoding and decoding can use different versions of a schema. In Avro, decoding uses two schemas: the writer's schema must be identical to the one used for encoding, but the reader's schema can be an older or newer version." class="w-full my-4" >}}
@ -438,7 +440,7 @@ application code is expecting, and their types.
If the readers and writers schema are the same, decoding is easy. If they are different, Avro
resolves the differences by looking at the writers schema and the readers schema side by side and
translating the data from the writers schema into the readers schema. The Avro specification [^16] [^17]
defines exactly how this resolution works, and it is illustrated in [Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
defines exactly how this resolution works, and it is illustrated in [Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
For example, its no problem if the writers schema and the readers schema have their fields in a
different order, because the schema resolution matches up the fields by field name. If the code
@ -490,7 +492,7 @@ The answer depends on the context in which Avro is being used. To give a few exa
Large file with lots of records
: A common use for Avro is for storing a large file containing millions of records, all encoded with
the same schema. (We will discuss this kind of situation in [Link to Come].) In this case, the
the same schema. (We will discuss this kind of situation in [Chapter 11](/en/ch11#ch_batch).) In this case, the
writer of that file can just include the writers schema once at the beginning of the file. Avro
specifies a file format (object container files) to do this.
@ -661,7 +663,7 @@ As the data dump is written in one go and is thereafter immutable, formats like
container files are a good fit. This is also a good opportunity to encode the data in an
analytics-friendly column-oriented format such as Parquet (see [“Column Compression”](/en/ch4#sec_storage_column_compression)).
In [Link to Come] we will talk more about using data in archival storage.
In [Chapter 11](/en/ch11#ch_batch) we will talk more about using data in archival storage.
### Dataflow Through Services: REST and RPC {#sec_encoding_dataflow_rpc}
@ -686,7 +688,7 @@ application-specific, and the client and server need to agree on the details of
In some ways, services are similar to databases: they typically allow clients to submit and query
data. However, while databases allow arbitrary queries using the query languages we discussed in
[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
that are predetermined by the business logic (application code) of the service [^29]. This restriction provides a degree of encapsulation: services can impose
fine-grained restrictions on what clients can and cannot do.
@ -728,7 +730,7 @@ service. The two most popular service IDLs are OpenAPI (also known as Swagger [^
and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send
and receive Protocol Buffers.
Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def).
Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def).
The service definition allows developers to define service endpoints, documentation, versions, data
models, and much more. gRPC definitions look similar, but are defined using Protocol Buffers service definitions.
@ -762,8 +764,8 @@ Even if a design philosophy and IDL are adopted, developers must still write the
implements their services API calls. A service framework is often adopted to simplify this
effort. Service frameworks such as Spring Boot, FastAPI, and gRPC allow developers to write the
business logic for each API endpoint while the framework code handles routing, metrics, caching,
authentication, and so on. [Example 5-4](/en/ch5#fig_fastapi_def) shows an example Python implementation of the service
defined in [Example 5-3](/en/ch5#fig_open_api_def).
authentication, and so on. [Example 5-4](/en/ch5#fig_fastapi_def) shows an example Python implementation of the service
defined in [Example 5-3](/en/ch5#fig_open_api_def).
{{< figure id="fig_fastapi_def" title="Example 5-4. Example FastAPI service implementing the definition from [Example 5-3](/en/ch5#fig_open_api_def)" class="w-full my-4" >}}
@ -815,11 +817,11 @@ A network request is very different from a local function call:
it goes into an infinite loop or the process crashes). A network request has another possible
outcome: it may return without a result, due to a *timeout*. In that case, you simply dont know
what happened: if you dont get a response from the remote service, you have no way of knowing
whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).)
whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).)
* If you retry a failed network request, it could happen that the previous request actually got
through, and only the response was lost. In that case, retrying will cause the action to
be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into the protocol [^40].
Local function calls dont have this problem. (We discuss idempotence in more detail in [Link to Come].)
Local function calls dont have this problem. (We discuss idempotence in more detail in [“Idempotence”](/en/ch12#sec_stream_idempotence).)
* Every time you call a local function, it normally takes about the same time to execute. A network
request is much slower than a function call, and its latency is also wildly variable: at good
times it may complete in less than a millisecond, but when the network is congested or the remote
@ -870,7 +872,7 @@ There are many load balancing and service discovery solutions available:
* *Service discovery systems* use a centralized registry rather than DNS to track which service
endpoints are available. When a new service instance starts up, it registers itself with the
service discovery system by declaring the host and port its listening on, along with relevant
metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location,
metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location,
and more. The service then periodically sends a heartbeat signal to the discovery system to signal
that the service is still available.
@ -936,7 +938,7 @@ services responsible for fraud detection, credit card integration, bank integrat
Processing a single payment in our example requires many service calls. A payment processor service
might invoke the fraud detection service to check for fraud, call the credit card service to debit
the credit card, and call the banking service to deposit debited funds, as shown in
[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a
general-purpose programming language, a domain specific language (DSL), or a markup language such as
Business Process Execution Language (BPEL) [^44].
@ -967,7 +969,7 @@ tasks.
There are many kinds of workflow engines that address a diverse set of use cases. Some, such as
Airflow, Dagster, and Prefect, integrate with data systems and orchestrate ETL tasks. Others, such
as Camunda and Orkes, provide a graphical notation for workflows (such as BPMN, used in
[Figure 5-7](/en/ch5#fig_encoding_workflow)) so that non-engineers can more easily define and execute workflows. Still
[Figure 5-7](/en/ch5#fig_encoding_workflow)) so that non-engineers can more easily define and execute workflows. Still
others, such as Temporal and Restate provide *durable execution*.
#### Durable execution {#durable-execution}
@ -984,7 +986,7 @@ task fails, the framework will re-execute the task, but will skip any RPC calls
that the task made successfully before failing. Instead, the framework will pretend to make the
call, but will instead return the results from the previous call. This is possible because durable
execution frameworks log all RPCs and state changes to durable storage like a write-ahead log [^45] [^46].
[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
using Temporal.
{{< figure id="fig_temporal_workflow" title="Example 5-5. A Temporal workflow definition fragment for the payment workflow in [Figure 5-7](/en/ch5#fig_encoding_workflow)." class="w-full my-4" >}}
@ -1060,7 +1062,7 @@ In the past, the landscape of message brokers was dominated by commercial enterp
companies such as TIBCO, IBM WebSphere, and webMethods, before open source implementations such as
RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka become popular. More recently, cloud services
such as Amazon Kinesis, Azure Service Bus, and Google Cloud Pub/Sub have gained adoption. We will
compare them in more detail in [Link to Come].
compare them in more detail in [“Messaging Systems”](/en/ch12#sec_stream_messaging).
The detailed delivery semantics vary by implementation and configuration, but in general, two
message distribution patterns are most often used:
@ -1084,7 +1086,7 @@ to use event sourcing (see [“Event Sourcing and CQRS”](/en/ch3#sec_datamodel
If a consumer republishes messages to another topic, you may need to be careful to preserve unknown
fields, to prevent the issue described previously in the context of databases
([Figure 5-1](/en/ch5#fig_encoding_preserve_field)).
([Figure 5-1](/en/ch5#fig_encoding_preserve_field)).
#### Distributed actor frameworks {#distributed-actor-frameworks}
@ -1213,4 +1215,4 @@ quite achievable. May your applications evolution be rapid and your deploymen
[^48]: [What is a Temporal Workflow?](https://docs.temporal.io/workflows) *docs.temporal.io*, 2024. Archived at [perma.cc/B5C5-Y396](https://perma.cc/B5C5-Y396)
[^49]: Jack Kleeman. [Solving durable executions immutability problem](https://restate.dev/blog/solving-durable-executions-immutability-problem/). *restate.dev*, February 2024. Archived at [perma.cc/G55L-EYH5](https://perma.cc/G55L-EYH5)
[^50]: Srinath Perera. [Exploring Event-Driven Architecture: A Beginners Guide for Cloud Native Developers](https://wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/). *wso2.com*, August 2023. Archived at [archive.org](https://web.archive.org/web/20240716204613/https%3A//wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/)
[^51]: Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. [Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical Report MSR-TR-2014-41, March 2014. Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF)
[^51]: Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. [Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical Report MSR-TR-2014-41, March 2014. Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF)

View file

@ -4,6 +4,8 @@ weight: 206
breadcrumbs: false
---
<a id="ch_replication"></a>
![](/map/ch05.png)
> *The major difference between a thing that might go wrong and a thing that cannot possibly go wrong
@ -21,7 +23,7 @@ why you might want to replicate data:
* To scale out the number of machines that can serve read queries (and thus increase read throughput)
In this chapter we will assume that your dataset is small enough that each machine can hold a copy of
the entire dataset. In [Chapter 7](/en/ch7#ch_sharding) we will relax that assumption and discuss *sharding*
the entire dataset. In [Chapter 7](/en/ch7#ch_sharding) we will relax that assumption and discuss *sharding*
(*partitioning*) of datasets that are too big for a single machine. In later chapters we will discuss
various kinds of faults that can occur in a replicated data system, and how to deal with them.
@ -72,7 +74,7 @@ question inevitably arises: how do we ensure that all the data ends up on all th
Every write to the database needs to be processed by every replica; otherwise, the replicas would no
longer contain the same data. The most common solution is called *leader-based replication*,
*primary-backup*, or *active/passive*. It works as follows (see [Figure 6-1](/en/ch6#fig_replication_leader_follower)):
*primary-backup*, or *active/passive*. It works as follows (see [Figure 6-1](/en/ch6#fig_replication_leader_follower)):
1. One of the replicas is designated the *leader* (also known as *primary* or *source* [^2]).
When clients want to write to the database, they must send their requests to the leader, which
@ -88,7 +90,7 @@ longer contain the same data. The most common solution is called *leader-based r
{{< figure src="/fig/ddia_0601.png" id="fig_replication_leader_follower" caption="Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas." class="w-full my-4" >}}
If the database is sharded (see [Chapter 7](/en/ch7#ch_sharding)), each shard has one leader. Different shards may
If the database is sharded (see [Chapter 7](/en/ch7#ch_sharding)), each shard has one leader. Different shards may
have their leaders on different nodes, but each shard must nevertheless have one leader node. In
[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader) we will discuss an alternative model in which a system may have
multiple leaders for the same shard at the same time.
@ -99,7 +101,7 @@ It is also used in some document databases such as MongoDB and DynamoDB [^5],
message brokers such as Kafka, replicated block devices such as DRBD, and some network filesystems.
Many consensus algorithms such as Raft, which is used for replication in CockroachDB [^6], TiDB [^7],
etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and automatically
elect a new leader if the old one fails (we will discuss consensus in more detail in [Chapter 10](/en/ch10#ch_consistency)).
elect a new leader if the old one fails (we will discuss consensus in more detail in [Chapter 10](/en/ch10#ch_consistency)).
--------
@ -115,15 +117,15 @@ An important detail of a replicated system is whether the replication happens *s
*asynchronously*. (In relational databases, this is often a configurable option; other systems are
often hardcoded to be either one or the other.)
Think about what happens in [Figure 6-1](/en/ch6#fig_replication_leader_follower), where the user of a website updates
Think about what happens in [Figure 6-1](/en/ch6#fig_replication_leader_follower), where the user of a website updates
their profile image. At some point in time, the client sends the update request to the leader;
shortly afterward, it is received by the leader. At some point, the leader forwards the data change
to the followers. Eventually, the leader notifies the client that the update was successful.
[Figure 6-2](/en/ch6#fig_replication_sync_replication) shows one possible way how the timings could work out.
[Figure 6-2](/en/ch6#fig_replication_sync_replication) shows one possible way how the timings could work out.
{{< figure src="/fig/ddia_0602.png" id="fig_replication_sync_replication" caption="Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower." class="w-full my-4" >}}
In the example of [Figure 6-2](/en/ch6#fig_replication_sync_replication), the replication to follower 1 is
In the example of [Figure 6-2](/en/ch6#fig_replication_sync_replication), the replication to follower 1 is
*synchronous*: the leader waits until follower 1 has confirmed that it received the write before
reporting success to the user, and before making the write visible to other clients. The replication
to follower 2 is *asynchronous*: the leader sends the message, but doesnt wait for a response from
@ -155,7 +157,7 @@ In some systems, a *majority* (e.g., 3 out of 5 replicas, including the leader)
updated synchronously, and the remaining minority is asynchronous. This is an example of a *quorum*,
which we will discuss further in [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition). Majority quorums are often
used in systems that use a consensus protocol for automatic leader election, which we will return to
in [Chapter 10](/en/ch10#ch_consistency).
in [Chapter 10](/en/ch10#ch_consistency).
Sometimes, leader-based replication is configured to be completely asynchronous. In this case, if the
leader fails and is not recoverable, any writes that have not yet been replicated to followers are
@ -206,6 +208,8 @@ Litestream does the equivalent for SQLite.
--------
<a id="sec_replication_object_storage"></a>
> [!TIP] DATABASES BACKED BY OBJECT STORAGE
Object storage can be used for more than archiving data. Many databases are beginning to use object
@ -303,7 +307,7 @@ consists of the following steps:
established *controller node* [^13].
The best candidate for leadership is usually the replica with the most up-to-date data changes
from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader
is a consensus problem, discussed in detail in [Chapter 10](/en/ch10#ch_consistency).
is a consensus problem, discussed in detail in [Chapter 10](/en/ch10#ch_consistency).
3. *Reconfiguring the system to use the new leader.* Clients now need to send
their write requests to the new leader (we discuss this
in [“Request Routing”](/en/ch7#sec_sharding_routing)). If the old leader comes back, it might still believe that it is
@ -326,7 +330,7 @@ Failover is fraught with things that can go wrong:
primary keys that were previously assigned by the old leader. These primary keys were also used in
a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis,
which caused some private data to be disclosed to the wrong users.
* In certain fault scenarios (see [Chapter 9](/en/ch9#ch_distributed)), it could happen that two nodes both believe
* In certain fault scenarios (see [Chapter 9](/en/ch9#ch_distributed)), it could happen that two nodes both believe
that they are the leader. This situation is called *split brain*, and it is dangerous: if both
leaders accept writes, and there is no process for resolving conflicts (see
[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
@ -362,7 +366,7 @@ behind by several days could be catastrophic.
These issues—node failures; unreliable networks; and trade-offs around replica consistency,
durability, availability, and latency—are in fact fundamental problems in distributed systems.
In [Chapter 9](/en/ch9#ch_distributed) and [Chapter 10](/en/ch10#ch_consistency) we will discuss them in greater depth.
In [Chapter 9](/en/ch9#ch_distributed) and [Chapter 10](/en/ch10#ch_consistency) we will discuss them in greater depth.
### Implementation of Replication Logs {#sec_replication_implementation}
@ -405,7 +409,7 @@ in practice, so many databases prefer other replication methods.
#### Write-ahead log (WAL) shipping {#write-ahead-log-wal-shipping}
In [Chapter 4](/en/ch4#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
In [Chapter 4](/en/ch4#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
every modification is first written to the WAL so that the tree can be restored to a consistent
state after a crash. Since the WAL contains all the information necessary to restore the indexes and
heap into a consistent state, we can use the exact same log to build a replica on another node:
@ -426,6 +430,8 @@ performing a failover to make one of the upgraded nodes the new leader. If the r
does not allow this version mismatch, as is often the case with WAL shipping, such upgrades require
downtime.
<a id="sec_replication_logical"></a>
#### Logical (row-based) log replication {#logical-row-based-log-replication}
An alternative is to use different log formats for replication and for the storage engine, which
@ -456,7 +462,7 @@ software. This in turn enables upgrading to a new version with minimal downtime
A logical log format is also easier for external applications to parse. This aspect is useful if you want
to send the contents of a database to an external system, such as a data warehouse for offline
analysis, or for building custom indexes and caches [^21].
This technique is called *change data capture*, and we will return to it in [Link to Come].
This technique is called *change data capture*, and we will return to it in [“Change Data Capture”](/en/ch12#sec_stream_cdc).
## Problems with Replication Lag {#sec_replication_lag}
@ -513,7 +519,7 @@ be read from a follower. This is especially appropriate if data is frequently vi
occasionally written.
With asynchronous replication, there is a problem, illustrated in
[Figure 6-3](/en/ch6#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
[Figure 6-3](/en/ch6#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
new data may not yet have reached the replica. To the user, it looks as though the data they
submitted was lost, so they will be understandably unhappy.
@ -597,7 +603,7 @@ Our second example of an anomaly that can occur when reading from asynchronous f
possible for a user to see things *moving backward in time*.
This can happen if a user makes several reads from different replicas. For example,
[Figure 6-4](/en/ch6#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
[Figure 6-4](/en/ch6#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
with little lag, then to a follower with greater lag. (This scenario is quite likely if the user
refreshes a web page, and each request is routed to a random server.) The first query returns a
comment that was recently added by user 1234, but the second query doesnt return anything because
@ -636,7 +642,7 @@ answered it.
Now, imagine a third person is listening to this conversation through followers. The things said by
Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer
replication lag (see [Figure 6-5](/en/ch6#fig_replication_consistent_prefix)). This observer would hear the following:
replication lag (see [Figure 6-5](/en/ch6#fig_replication_consistent_prefix)). This observer would hear the following:
Mrs. Cake
: About ten seconds usually, Mr. Poons.
@ -654,7 +660,7 @@ This guarantee says that if a sequence of writes happens in a certain order,
then anyone reading those writes will see them appear in the same order.
This is a particular problem in sharded (partitioned) databases, which we will discuss in
[Chapter 7](/en/ch7#ch_sharding). If the database always applies writes in the same order, reads always see a
[Chapter 7](/en/ch7#ch_sharding). If the database always applies writes in the same order, reads always see a
consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different
shards operate independently, so there is no global ordering of writes: when a user reads from the
database, they may see some parts of the database in an older state and some in a newer state.
@ -678,8 +684,8 @@ synchronously updated follower. However, dealing with these issues in applicatio
and easy to get wrong.
The simplest programming model for application developers is to choose a database that provides a
strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/en/ch10#ch_consistency)), and ACID
transactions (see [Chapter 8](/en/ch8#ch_transactions)). This allows you to mostly ignore the challenges that arise
strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/en/ch10#ch_consistency)), and ACID
transactions (see [Chapter 8](/en/ch8#ch_transactions)). This allows you to mostly ignore the challenges that arise
from replication, and treat the database as if it had just a single node. In the early 2010s the
*NoSQL* movement promoted the view that these features limited scalability, and that large-scale
systems would have to embrace eventual consistency.
@ -738,7 +744,7 @@ single-leader replication, the leader has to be in *one* of the regions, and all
through that region.
In a multi-leader configuration, you can have a leader in *each* region.
[Figure 6-6](/en/ch6#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
[Figure 6-6](/en/ch6#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
regular leaderfollower replication is used (with followers maybe in a different availability zone
from the leader); between regions, each regions leader replicates its changes to the leaders in
other regions.
@ -774,7 +780,7 @@ Tolerance of network problems
Consistency
: A single-leader system can provide strong consistency guarantees, such as serializable
transactions, which we will discuss in [Chapter 8](/en/ch8#ch_transactions). The biggest downside of multi-leader
transactions, which we will discuss in [Chapter 8](/en/ch8#ch_transactions). The biggest downside of multi-leader
systems is that the consistency they can achieve is much weaker. For example, you cant guarantee
that a bank account wont go negative or that a username is unique: its always possible for
different leaders to process writes that are individually fine (paying out some of the money in an
@ -798,14 +804,14 @@ multi-leader replication is often considered dangerous territory that should be
#### Multi-leader replication topologies {#sec_replication_topologies}
A *replication topology* describes the communication paths along which writes are propagated from
one node to another. If you have two leaders, like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), there is
one node to another. If you have two leaders, like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), there is
only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With
more than two leaders, various different topologies are possible. Some examples are illustrated in
[Figure 6-7](/en/ch6#fig_replication_topologies).
[Figure 6-7](/en/ch6#fig_replication_topologies).
{{< figure src="/fig/ddia_0607.png" id="fig_replication_topologies" caption="Figure 6-7. Three example topologies in which multi-leader replication can be set up." class="w-full my-4" >}}
The most general topology is *all-to-all*, shown in [Figure 6-7](/en/ch6#fig_replication_topologies)(c),
The most general topology is *all-to-all*, shown in [Figure 6-7](/en/ch6#fig_replication_topologies)(c),
in which every leader sends its writes to every other leader. However, more restricted topologies
are also used: for example a *circular topology* in which each node receives writes from one node
and forwards those writes (plus any writes of its own) to one other node. Another popular topology
@ -839,11 +845,11 @@ along different paths, avoiding a single point of failure.
On the other hand, all-to-all topologies can have issues too. In particular, some network links may
be faster than others (e.g., due to network congestion), with the result that some replication
messages may “overtake” others, as illustrated in [Figure 6-8](/en/ch6#fig_replication_causality).
messages may “overtake” others, as illustrated in [Figure 6-8](/en/ch6#fig_replication_causality).
{{< figure src="/fig/ddia_0608.png" id="fig_replication_causality" caption="Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas." class="w-full my-4" >}}
In [Figure 6-8](/en/ch6#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
In [Figure 6-8](/en/ch6#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may
first receive the update (which, from its point of view, is an update to a row that does not exist
in the database) and only later receive the corresponding insert (which should have preceded the
@ -853,12 +859,12 @@ This is a problem of causality, similar to the one we saw in [“Consistent Pref
the update depends on the prior insert, so we need to make sure that all nodes process the insert
first, and then the update. Simply attaching a timestamp to every write is not sufficient, because
clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see
[Chapter 9](/en/ch9#ch_distributed)).
[Chapter 9](/en/ch9#ch_distributed)).
To order these events correctly, a technique called *version vectors* can be used, which we will
discuss later in this chapter (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). However, many multi-leader
replication systems dont use good techniques for ordering updates, leaving them vulnerable to
issues like the one in [Figure 6-8](/en/ch6#fig_replication_causality). If you are using multi-leader replication, it
issues like the one in [Figure 6-8](/en/ch6#fig_replication_causality). If you are using multi-leader replication, it
is worth being aware of these issues, carefully reading the documentation, and thoroughly testing
your database to ensure that it really does provide the guarantees you believe it to have.
@ -926,8 +932,8 @@ approach has a number of advantages:
* Having the data locally means the user interface can be much faster to respond than if it had to
wait for a service call to fetch some data. Some apps aim to respond to user input in the *next
frame* of the graphics system, which means rendering within 16 ms on a display with a
60 Hz refresh rate.
frame* of the graphics system, which means rendering within 16 ms on a display with a
60 Hz refresh rate.
* Allowing users to continue working while offline is valuable, especially on mobile devices with
intermittent connectivity. With a sync engine, an app doesnt need a separate offline mode: being
offline is the same as having very large network delay.
@ -967,7 +973,7 @@ a local-first sync engine on end user devices—is that concurrent writes on dif
lead to conflicts that need to be resolved.
For example, consider a wiki page that is simultaneously being edited by two users, as shown in
[Figure 6-9](/en/ch6#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
[Figure 6-9](/en/ch6#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
independently changes the title from A to C. Each users change is successfully applied to their
local leader. However, when the changes are asynchronously replicated, a conflict is detected.
This problem does not occur in a single-leader database.
@ -975,7 +981,7 @@ This problem does not occur in a single-leader database.
{{< figure src="/fig/ddia_0609.png" id="fig_replication_write_conflict" caption="Figure 6-9. A write conflict caused by two leaders concurrently updating the same record." class="w-full my-4" >}}
> [!NOTE]
> We say that the two writes in [Figure 6-9](/en/ch6#fig_replication_write_conflict) are *concurrent* because neither
> We say that the two writes in [Figure 6-9](/en/ch6#fig_replication_write_conflict) are *concurrent* because neither
> was “aware” of the other at the time the write was originally made. It doesnt matter whether the
> writes literally happened at the same time; indeed, if the writes were made while offline, they
> might have actually happened some time apart. What matters is whether one write occurred in a state
@ -1017,7 +1023,7 @@ We will discuss other ID assignment schemes in [“ID Generators and Logical Clo
If conflicts cant be avoided, the simplest way of resolving them is to attach a timestamp to each
write, and to always use the value with the greatest timestamp. For example, in
[Figure 6-9](/en/ch6#fig_replication_write_conflict), lets say that the timestamp of user 1s write is greater than
[Figure 6-9](/en/ch6#fig_replication_write_conflict), lets say that the timestamp of user 1s write is greater than
the timestamp of user 2s write. In that case, both leaders will determine that the new title of the
page should be B, and they discard the write that sets it to C. If the writes coincidentally have
the same timestamp, the winner can be chosen by comparing the values (e.g., in the case of strings,
@ -1025,7 +1031,7 @@ taking the one thats earlier in the alphabet).
This approach is called *last write wins* (LWW) because the write with the greatest timestamp can be
considered the “last” one. The term is misleading though, because when two writes are concurrent
like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), which one is older and which is later is undefined, and
like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), which one is older and which is later is undefined, and
so the timestamp order of concurrent writes is essentially random.
Therefore the real meaning of LWW is: when the same record is concurrently written on different
@ -1055,7 +1061,7 @@ merge is complete.
In a database, it would be impractical for a conflict to stop the entire replication process until a
human has resolved it. Instead, databases typically store all the concurrently written values for a
given record—for example, both B and C in [Figure 6-9](/en/ch6#fig_replication_write_conflict). These values are
given record—for example, both B and C in [Figure 6-9](/en/ch6#fig_replication_write_conflict). These values are
sometimes called *siblings*. The next time you query that record, the database returns *all* those
values, rather than just the latest one. You can then resolve those values in whatever way you want,
either automatically in application code (for example, you could concatenate B and C into “B/C”), or
@ -1077,7 +1083,7 @@ suffers from a number of problems:
keeping all the shopping cart items that appeared in any of the siblings (i.e., taking the set
union of the carts). This meant that if the customer had removed an item from their cart in one
sibling, but another sibling still contained that old item, the removed item would unexpectedly
reappear in the customers cart [^45]. [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
reappear in the customers cart [^45]. [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear.
* If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution
process can itself introduce a new conflict. Those resolutions could even be inconsistent: for
@ -1088,6 +1094,8 @@ suffers from a number of problems:
{{< figure src="/fig/ddia_0610.png" id="fig_replication_amazon_anomaly" caption="Figure 6-10. Example of Amazon's shopping cart anomaly: if conflicts on a shopping cart are merged by taking the union, deleted items may reappear." class="w-full my-4" >}}
<a id="sec_replication_automatic_resolution"></a>
#### Automatic conflict resolution {#automatic-conflict-resolution}
For many applications, the best way of handling conflicts is to use an algorithm that automatically
@ -1105,8 +1113,8 @@ updates as much as possible, and hence avoiding data loss:
same position, it can be ordered deterministically so that all nodes get the same merged outcome.
* If the data is a collection of items (ordered like a to-do list, or unordered like a shopping
cart), we can merge it similarly to text by tracking insertions and deletions. To avoid the
shopping cart issue in [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly), the algorithms track the fact that Book
and DVD were deleted, so the merged result is Cart = {Soap}.
shopping cart issue in [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly), the algorithms track the fact that Book
and DVD were deleted, so the merged result is Cart = {Soap}.
* If the data is an integer representing a counter that can be incremented or decremented (e.g., the
number of likes on a social media post), the merge algorithm can tell how many increments and
decrements happened on each sibling, and add them together correctly so that the result does not
@ -1129,7 +1137,7 @@ Two families of algorithms are commonly used to implement automatic conflict res
They have different design philosophies and performance characteristics, but both are able to
perform automatic merges for all the aforementioned types of data.
[Figure 6-11](/en/ch6#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
[Figure 6-11](/en/ch6#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the
letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make “ice!”.
@ -1147,7 +1155,7 @@ OT
CRDT
: Most CRDTs give each character a unique, immutable ID and use those to determine the positions of
insertions/deletions, instead of indexes. For example, in [Figure 6-11](/en/ch6#fig_replication_ot_crdt) we assign
insertions/deletions, instead of indexes. For example, in [Figure 6-11](/en/ch6#fig_replication_ot_crdt) we assign
the ID 1A to “i”, the ID 2A to “c”, etc. When inserting the exclamation mark, we generate an
operation containing the ID of the new character (4B) and the ID of the existing character after
which we want to insert (3A). To insert at the beginning of the string we give “nil” as the
@ -1165,7 +1173,7 @@ Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge o
#### What is a conflict? {#what-is-a-conflict}
Some kinds of conflict are obvious. In the example in [Figure 6-9](/en/ch6#fig_replication_write_conflict), two writes
Some kinds of conflict are obvious. In the example in [Figure 6-9](/en/ch6#fig_replication_write_conflict), two writes
concurrently modified the same field in the same record, setting it to two different values. There
is little doubt that this is a conflict.
@ -1179,7 +1187,7 @@ are made on two different leaders.
There isnt a quick ready-made answer, but in the following chapters we will trace a path toward a
good understanding of this problem. We will see some more examples of conflicts in
[Chapter 8](/en/ch8#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and
[Chapter 8](/en/ch8#ch_transactions), and in [“Ordering events to capture causality”](/en/ch13#sec_future_capture_causality) we will discuss scalable approaches for detecting and
resolving conflicts in a replicated system.
@ -1220,7 +1228,7 @@ configuration, if you want to continue processing writes, you may need to perfor
[“Handling Node Outages”](/en/ch6#sec_replication_failover)).
On the other hand, in a leaderless configuration, failover does not exist.
[Figure 6-12](/en/ch6#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
[Figure 6-12](/en/ch6#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
all three replicas in parallel, and the two available replicas accept the write but the unavailable
replica misses it. Lets say that its sufficient for two out of three replicas to
acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be
@ -1252,7 +1260,7 @@ mechanisms are used in Dynamo-style datastores:
Read repair
: When a client makes a read from several nodes in parallel, it can detect any stale responses.
For example, in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
For example, in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale
value and writes the newer value back to that replica. This approach works well for values that are
frequently read.
@ -1272,7 +1280,7 @@ Anti-entropy
#### Quorums for reading and writing {#sec_replication_quorum_condition}
In the example of [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), we considered the write to be successful
In the example of [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), we considered the write to be successful
even though it was only processed on two out of three replicas. What if only one out of three
replicas accepted the write? How far can we push this?
@ -1283,14 +1291,14 @@ respond, reads can nevertheless continue returning an up-to-date value.
More generally, if there are *n* replicas, every write must be confirmed by *w* nodes to be
considered successful, and we must query at least *r* nodes for each read. (In our example,
*n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* > *n*,
*n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* > *n*,
we expect to get an up-to-date value when reading, because at least one of the *r* nodes were
reading from must be up to date. Reads and writes that obey these *r* and *w* values are called *quorum* reads and writes [^50].
You can think of *r* and *w* as the minimum number of votes required for the read or write to be valid.
In Dynamo-style databases, the parameters *n*, *w*, and *r* are typically configurable. A common
choice is to make *n* an odd number (typically 3 or 5) and to set *w* = *r* =
(*n* + 1) / 2 (rounded up). However, you can vary the numbers as you see fit.
(*n* + 1) / 2 (rounded up). However, you can vary the numbers as you see fit.
For example, a workload with few writes and many reads may benefit from setting *w* = *n* and
*r* = 1. This makes reads faster, but has the disadvantage that just one failed node causes all
database writes to fail.
@ -1300,19 +1308,19 @@ database writes to fail.
> [!NOTE]
> There may be more than *n* nodes in the cluster, but any given value is stored only on *n*
> nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit
> on one node. We will return to sharding in [Chapter 7](/en/ch7#ch_sharding).
> on one node. We will return to sharding in [Chapter 7](/en/ch7#ch_sharding).
--------
The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
as follows:
* If *w* < *n*, we can still process writes if a node is unavailable.
* If *r* < *n*, we can still process reads if a node is unavailable.
* With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable
node, like in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage).
* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
This case is illustrated in [Figure 6-13](/en/ch6#fig_replication_quorum_overlap).
* If *w* < *n*, we can still process writes if a node is unavailable.
* If *r* < *n*, we can still process reads if a node is unavailable.
* With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable
node, like in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage).
* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
This case is illustrated in [Figure 6-13](/en/ch6#fig_replication_quorum_overlap).
Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and *r*
determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
@ -1329,19 +1337,19 @@ returned a successful response and dont need to distinguish between different
### Limitations of Quorum Consistency {#sec_replication_quorum_limitations}
If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*, you can
If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*, you can
generally expect every read to return the most recent value written for a key. This is the case because the
set of nodes to which youve written and the set of nodes from which youve read must overlap. That
is, among the nodes you read there must be at least one node with the latest value (illustrated in
[Figure 6-13](/en/ch6#fig_replication_quorum_overlap)).
[Figure 6-13](/en/ch6#fig_replication_quorum_overlap)).
Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, because that ensures
*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
not necessarily majorities—it only matters that the sets of nodes used by the read and write
operations overlap in at least one node. Other quorum assignments are possible, which allows some
flexibility in the design of distributed algorithms [^51].
You may also set *w* and *r* to smaller numbers, so that *w* + *r**n* (i.e.,
You may also set *w* and *r* to smaller numbers, so that *w* + *r**n* (i.e.,
the quorum condition is not satisfied). In this case, reads and writes will still be sent to *n*
nodes, but a smaller number of successful responses is required for the operation to succeed.
@ -1352,14 +1360,14 @@ unreachable, theres a higher chance that you can continue processing reads an
the number of reachable replicas falls below *w* or *r* does the database become unavailable for
writing or reading, respectively.
However, even with *w* + *r* > *n*, there are edge cases in which the consistency
However, even with *w* + *r* > *n*, there are edge cases in which the consistency
properties can be confusing. Some scenarios include:
* If a node carrying a new value fails, and its data is restored from a replica carrying an old
value, the number of replicas storing the new value may fall below *w*, breaking the quorum
condition.
* While a rebalancing is in progress, where some data is moved from one node to another (see
[Chapter 7](/en/ch7#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
[Chapter 7](/en/ch7#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
replicas for a particular value. This can result in the read and write quorums no longer
overlapping.
* If a read is concurrent with a write operation, the read may or may not see the concurrently
@ -1489,7 +1497,7 @@ resulting in conflicts that need to be resolved. Such conflicts may occur as the
not always: they could also be detected later during read repair, hinted handoff, or anti-entropy.
The problem is that events may arrive in a different order at different nodes, due to variable
network delays and partial failures. For example, [Figure 6-14](/en/ch6#fig_replication_concurrency) shows two clients,
network delays and partial failures. For example, [Figure 6-14](/en/ch6#fig_replication_concurrency) shows two clients,
A and B, simultaneously writing to a key *X* in a three-node datastore:
* Node 1 receives the write from A, but never receives the write from B due to a transient outage.
@ -1501,7 +1509,7 @@ A and B, simultaneously writing to a key *X* in a three-node datastore:
If each node simply overwrote the value for a key whenever it received a write request from a
client, the nodes would become permanently inconsistent, as shown by the final *get* request in
[Figure 6-14](/en/ch6#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
[Figure 6-14](/en/ch6#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
nodes think that the value is A.
In order to become eventually consistent, the replicas should converge toward the same value. For
@ -1520,11 +1528,11 @@ take more care to detect concurrent writes.
How do we decide whether two operations are concurrent or not? To develop an intuition, lets look
at some examples:
* In [Figure 6-8](/en/ch6#fig_replication_causality), the two writes are not concurrent: As insert *happens before*
* In [Figure 6-8](/en/ch6#fig_replication_causality), the two writes are not concurrent: As insert *happens before*
Bs increment, because the value incremented by B is the value inserted by A. In other words, Bs
operation builds upon As operation, so Bs operation must have happened later.
We also say that B is *causally dependent* on A.
* On the other hand, the two writes in [Figure 6-14](/en/ch6#fig_replication_concurrency) are concurrent: when each
* On the other hand, the two writes in [Figure 6-14](/en/ch6#fig_replication_concurrency) are concurrent: when each
client starts the operation, it does not know that another client is also performing an operation
on the same key. Thus, there is no causal dependency between the operations.
@ -1546,7 +1554,7 @@ conflict that needs to be resolved.
It may seem that two operations should be called concurrent if they occur “at the same time”—but
in fact, it is not important whether they literally overlap in time. Because of problems with clocks
in distributed systems, it is actually quite difficult to tell whether two things happened
at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/en/ch9#ch_distributed).
at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/en/ch9#ch_distributed).
For defining concurrency, exact time doesnt matter: we simply call two operations concurrent if
they are both unaware of each other, regardless of the physical time at which they occurred. People
@ -1570,7 +1578,7 @@ happened before another. To keep things simple, lets start with a database th
replica. Once we have worked out how to do this on a single replica, we can generalize the approach
to a leaderless database with multiple replicas.
[Figure 6-15](/en/ch6#fig_replication_causality_single) shows two clients concurrently adding items to the same
[Figure 6-15](/en/ch6#fig_replication_causality_single) shows two clients concurrently adding items to the same
shopping cart. (If that example strikes you as too inane, imagine instead two air traffic
controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is
empty. Between them, the clients make five writes to the database:
@ -1604,8 +1612,8 @@ empty. Between them, the clients make five writes to the database:
{{< figure src="/fig/ddia_0615.png" id="fig_replication_causality_single" caption="Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart." class="w-full my-4" >}}
The dataflow between the operations in [Figure 6-15](/en/ch6#fig_replication_causality_single) is illustrated
graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The arrows indicate which operation
The dataflow between the operations in [Figure 6-15](/en/ch6#fig_replication_causality_single) is illustrated
graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The arrows indicate which operation
*happened before* which other operation, in the sense that the later operation *knew about* or
*depended on* the earlier one. In this example, the clients are never fully up to date with the data
on the server, since there is always another operation going on concurrently. But old versions of
@ -1638,10 +1646,10 @@ on subsequent reads.
#### Version vectors {#version-vectors}
The example in [Figure 6-15](/en/ch6#fig_replication_causality_single) used only a single replica. How does the
The example in [Figure 6-15](/en/ch6#fig_replication_causality_single) used only a single replica. How does the
algorithm change when there are multiple replicas, but no leader?
[Figure 6-15](/en/ch6#fig_replication_causality_single) uses a single version number to capture dependencies between
[Figure 6-15](/en/ch6#fig_replication_causality_single) uses a single version number to capture dependencies between
operations, but that is not sufficient when there are multiple replicas accepting writes
concurrently. Instead, we need to use a version number *per replica* as well as per key. Each
replica increments its own version number when processing a write, and also keeps track of the
@ -1653,7 +1661,7 @@ A few variants of this idea are in use, but the most interesting is probably the
which is used in Riak 2.0 [^61] [^62].
We wont go into the details, but the way it works is quite similar to what we saw in our cart example.
Like the version numbers in [Figure 6-15](/en/ch6#fig_replication_causality_single), version vectors are sent from the
Like the version numbers in [Figure 6-15](/en/ch6#fig_replication_causality_single), version vectors are sent from the
database replicas to clients when values are read, and need to be sent back to the database when a
value is subsequently written. (Riak encodes the version vector as a string that it calls *causal
context*.) The version vector allows the database to distinguish between overwrites and concurrent
@ -1818,4 +1826,4 @@ machine to store only a subset of the data.
[^61]: Sean Cribbs. [A Brief History of Time in Riak](https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak). At *RICON*, October 2014. Archived at [perma.cc/7U9P-6JFX](https://perma.cc/7U9P-6JFX)
[^62]: Russell Brown. [Vector Clocks Revisited Part 2: Dotted Version Vectors](https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/). *riak.com*, November 2015. Archived at [perma.cc/96QP-W98R](https://perma.cc/96QP-W98R)
[^63]: Carlos Baquero. [Version Vectors Are Not Vector Clocks](https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/). *haslab.wordpress.com*, July 2011. Archived at [perma.cc/7PNU-4AMG](https://perma.cc/7PNU-4AMG)
[^64]: Reinhard Schwarz and Friedemann Mattern. [Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed Computing*, volume 7, issue 3, pages 149174, March 1994. [doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859)
[^64]: Reinhard Schwarz and Friedemann Mattern. [Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed Computing*, volume 7, issue 3, pages 149174, March 1994. [doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859)

View file

@ -4,6 +4,8 @@ weight: 207
breadcrumbs: false
---
<a id="ch_sharding"></a>
![](/map/ch06.png)
> *Clearly, we must break away from the sequential and not limit the computers. We must state
@ -14,7 +16,7 @@ breadcrumbs: false
A distributed database typically distributes data across nodes in two ways:
1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in [Chapter 6](/en/ch6#ch_replication).
1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in [Chapter 6](/en/ch6#ch_replication).
2. If we dont want every node to store all the data, we can split up a large amount of data into
smaller *shards* or *partitions*, and store different shards on different nodes. Well discuss
sharding in this chapter.
@ -29,13 +31,13 @@ nodes. This means that, even though each record belongs to exactly one shard, it
on several different nodes for fault tolerance.
A node may store more than one shard. If a single-leader replication model is used, the combination
of sharding and replication can look like [Figure 7-1](/en/ch7#fig_sharding_replicas), for example. Each shards
of sharding and replication can look like [Figure 7-1](/en/ch7#fig_sharding_replicas), for example. Each shards
leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the
leader for some shards and a follower for other shards, but each shard still only has one leader.
{{< figure src="/fig/ddia_0701.png" id="fig_sharding_replicas" caption="Figure 7-1. Combining replication and sharding: each node acts as leader for some shards and follower for other shards." class="w-full my-4" >}}
Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replication of databases applies equally to
Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replication of databases applies equally to
replication of shards. Since the choice of sharding scheme is mostly independent of the choice of
replication scheme, we will ignore replication in this chapter for the sake of simplicity.
@ -62,7 +64,7 @@ to databases. Another theory is that *shard* was originally an acronym of *Syste
Available Replicated Data*—reportedly a 1980s database, details of which are lost to history.
By the way, partitioning has nothing to do with *network partitions* (netsplits), a type of fault in
the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#ch_distributed).
the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#ch_distributed).
--------
@ -71,7 +73,7 @@ the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#c
The primary reason for sharding a database is *scalability*: its a solution if the volume of data
or the write throughput has become too great for a single node to handle, as it allows you to spread
that data and those writes across multiple nodes. (If read throughput is the problem, you dont
necessarily need sharding—you can use *read scaling* as discussed in [Chapter 6](/en/ch6#ch_replication).)
necessarily need sharding—you can use *read scaling* as discussed in [Chapter 6](/en/ch6#ch_replication).)
In fact, sharding is one of the main tools we have for achieving *horizontal scaling* (a *scale-out*
architecture), as discussed in [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](/en/ch2#sec_introduction_shared_nothing): that is, allowing a system to
@ -98,9 +100,9 @@ may be distributed across different shards. We will discuss this further in
[“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes).
Another problem with sharding is that a write may need to update related records in several
different shards. While transactions on a single node are quite common (see [Chapter 8](/en/ch8#ch_transactions)),
different shards. While transactions on a single node are quite common (see [Chapter 8](/en/ch8#ch_transactions)),
ensuring consistency across multiple shards requires a *distributed transaction*. As we shall see in
[Chapter 8](/en/ch8#ch_transactions), distributed transactions are available in some databases, but they are usually
[Chapter 8](/en/ch8#ch_transactions), distributed transactions are available in some databases, but they are usually
much slower than single-node transactions, may become a bottleneck for the system as a whole, and
some systems dont support them at all.
@ -201,7 +203,7 @@ hot spots.
One way of sharding is to assign a contiguous range of partition keys (from some minimum to some
maximum) to each shard, like the volumes of a paper encyclopedia, as illustrated in
[Figure 7-2](/en/ch7#fig_sharding_encyclopedia). In this example, an entrys partition key is its title. If you want
[Figure 7-2](/en/ch7#fig_sharding_encyclopedia). In this example, an entrys partition key is its title. If you want
to look up the entry for a particular title, you can easily determine which shard contains that
entry by finding the volume whose key range contains the title youre looking for, and thus pick the
correct book off the shelf.
@ -209,7 +211,7 @@ correct book off the shelf.
{{< figure src="/fig/ddia_0702.png" id="fig_sharding_encyclopedia" caption="Figure 7-2. A print encyclopedia is sharded by key range." class="w-full my-4" >}}
The ranges of keys are not necessarily evenly spaced, because your data may not be evenly
distributed. For example, in [Figure 7-2](/en/ch7#fig_sharding_encyclopedia), volume 1 contains words starting with A
distributed. For example, in [Figure 7-2](/en/ch7#fig_sharding_encyclopedia), volume 1 contains words starting with A
and B, but volume 12 contains words starting with T, U, V, W, X, Y, and Z. Simply having one volume
per two letters of the alphabet would lead to some volumes being much bigger than others. In order
to distribute the data evenly, the shard boundaries need to adapt to the data.
@ -221,7 +223,7 @@ range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB
tablet splitting.
Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in
[Chapter 4](/en/ch4#ch_storage)). This has the advantage that range scans are easy, and you can treat the key as a
[Chapter 4](/en/ch4#ch_storage)). This has the advantage that range scans are easy, and you can treat the key as a
concatenated index in order to fetch several related records in one query (see
[“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional)). For example, consider an application that stores data from a
network of sensors, where the key is the timestamp of the measurement. Range scans are very useful
@ -256,7 +258,7 @@ This process is similar to what happens at the top level of a B-tree (see [“B-
With databases that manage shard boundaries automatically, a shard split is typically triggered by:
* the shard reaching a configured size (for example, on HBase, the default is 10 GB), or
* the shard reaching a configured size (for example, on HBase, the default is 10 GB), or
* in some systems, the write throughput being persistently above some threshold. Thus, a hot shard
may be split even if it is not storing a lot of data, so that its write load can be distributed more uniformly.
@ -278,7 +280,7 @@ application), a common approach is to first hash the partition key before mappin
A good hash function takes skewed data and makes it uniformly distributed. Say you have a 32-bit
hash function that takes a string. Whenever you give it a new string, it returns a seemingly random
number between 0 and 232  1. Even if the input strings are very similar, their hashes are evenly
number between 0 and 232 1. Even if the input strings are very similar, their hashes are evenly
distributed across that range of numbers (but the same input always produces the same output).
For sharding purposes, the hash function need not be cryptographically strong: for example, MongoDB
@ -291,12 +293,12 @@ different hash value in different processes, making them unsuitable for sharding
Once you have hashed the key, how do you choose which shard to store it in? Maybe your first thought
is to take the hash value *modulo* the number of nodes in the system (using the `%` operator in many
programming languages). For example, *hash*(*key*) % 10 would return a number between
0 and 9 (if we write the hash as a decimal number, the hash % 10 would be the last digit).
programming languages). For example, *hash*(*key*) % 10 would return a number between
0 and 9 (if we write the hash as a decimal number, the hash % 10 would be the last digit).
If we have 10 nodes, numbered 0 to 9, that seems like an easy way of assigning each key to a node.
The problem with the *mod N* approach is that if the number of nodes *N* changes, most of the keys
have to be moved from one node to another. [Figure 7-3](/en/ch7#fig_sharding_hash_mod_n) shows what happens when you
have to be moved from one node to another. [Figure 7-3](/en/ch7#fig_sharding_hash_mod_n) shows what happens when you
have three nodes and add a fourth. Before the rebalancing, node 0 stored the keys whose hashes are
0, 3, 6, 9, and so on. After adding the fourth node, the key with hash 3 has moved to node 3, the
key with hash 6 has moved to node 2, the key with hash 9 has moved to node 1, and so on.
@ -312,12 +314,12 @@ doesnt move data around more than necessary.
One simple but widely-used solution is to create many more shards than there are nodes, and to
assign several shards to each node. For example, a database running on a cluster of 10 nodes may be
split into 1,000 shards from the outset so that 100 shards are assigned to each node. A key is then
stored in shard number *hash*(*key*) % 1,000, and the system separately keeps track of
stored in shard number *hash*(*key*) % 1,000, and the system separately keeps track of
which shard is stored on which node.
Now, if a node is added to the cluster, the system can reassign some of the shards from existing
nodes to the new node until they are fairly distributed once again. This process is illustrated in
[Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in reverse.
[Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in reverse.
{{< figure src="/fig/ddia_0704.png" id="fig_sharding_rebalance_fixed" caption="Figure 7-4. Adding a new node to a database cluster with multiple shards per node." class="w-full my-4" >}}
@ -360,8 +362,8 @@ has this property, but it has a risk of hot spots when there are a lot of writes
solution is to combine key-range sharding with a hash function so that each shard contains a range
of *hash values* rather than a range of *keys*.
[Figure 7-5](/en/ch7#fig_sharding_hash_range) shows an example using a 16-bit hash function that returns a number
between 0 and 65,535 = 216  1 (in reality, the hash is usually 32 bits or more).
[Figure 7-5](/en/ch7#fig_sharding_hash_range) shows an example using a 16-bit hash function that returns a number
between 0 and 65,535 = 216 1 (in reality, the hash is usually 32 bits or more).
Even if the input keys are very similar (e.g., consecutive timestamps), their hashes are uniformly
distributed across that range. We can then assign a range of hash values to each shard: for example,
values between 0 and 16,383 to shard 0, values between 16,384 and 32,767 to shard 1, and so on.
@ -394,8 +396,8 @@ improve compression and filtering performance as well.
Hash-range sharding is used in YugabyteDB and DynamoDB [^17], and is an option in MongoDB.
Cassandra and ScyllaDB use a variant of this approach that is illustrated in
[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
those imbalances tend to even out [^15] [^18].
@ -404,7 +406,7 @@ those imbalances tend to even out [^15] [^18].
When nodes are added or removed, range boundaries are added and removed, and shards are split or
merged accordingly [^19].
In the example of [Figure 7-6](/en/ch7#fig_sharding_cassandra), when node 3 is added, node 1
In the example of [Figure 7-6](/en/ch7#fig_sharding_cassandra), when node 3 is added, node 1
transfers parts of two of its ranges to node 3, and node 2 transfers part of one of its ranges to
node 3. This has the effect of giving the new node an approximately fair share of the dataset,
without transferring more data than necessary from one node to another.
@ -417,8 +419,8 @@ in a way that satisfies two properties:
1. the number of keys mapped to each shard is roughly equal, and
2. when the number of shards changes, as few keys as possible are moved from one shard to another.
Note that *consistent* here has nothing to do with replica consistency (see [Chapter 6](/en/ch6#ch_replication)) or
ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describes the tendency of a key to stay in
Note that *consistent* here has nothing to do with replica consistency (see [Chapter 6](/en/ch6#ch_replication)) or
ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describes the tendency of a key to stay in
the same shard as much as possible.
The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of consistent hashing [^20],
@ -516,7 +518,7 @@ only be handled by a node that is a replica for the shard containing that key.
This means that request routing has to be aware of the assignment from keys to shards, and from
shards to nodes. On a high level, there are a few different approaches to this problem
(illustrated in [Figure 7-7](/en/ch7#fig_sharding_routing)):
(illustrated in [Figure 7-7](/en/ch7#fig_sharding_routing)):
1. Allow clients to contact any node (e.g., via a round-robin load balancer). If that node
coincidentally owns the shard to which the request applies, it can handle the request directly;
@ -544,8 +546,8 @@ In all cases, there are some key problems:
those?
Many distributed data systems rely on a separate coordination service such as ZooKeeper or etcd to
keep track of shard assignments, as illustrated in [Figure 7-8](/en/ch7#fig_sharding_zookeeper). They use consensus
algorithms (see [Chapter 10](/en/ch10#ch_consistency)) to provide fault tolerance and protection against split-brain.
keep track of shard assignments, as illustrated in [Figure 7-8](/en/ch7#fig_sharding_zookeeper). They use consensus
algorithms (see [Chapter 10](/en/ch10#ch_consistency)) to provide fault tolerance and protection against split-brain.
Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of shards
to nodes. Other actors, such as the routing tier or the sharding-aware client, can subscribe to this
information in ZooKeeper. Whenever a shard changes ownership, or a node is added or removed,
@ -573,7 +575,7 @@ This discussion of request routing has focused on finding the shard for an indiv
most relevant for sharded OLTP databases. Analytic databases often use sharding as well, but they
typically have a very different kind of query execution: rather than executing in a single shard, a
query typically needs to aggregate and join data from many different shards in parallel. We will
discuss techniques for such parallel query execution in [Link to Come].
discuss techniques for such parallel query execution in [“JOIN and GROUP BY”](/en/ch11#sec_batch_join).
## Sharding and Secondary Indexes {#sec_sharding_secondary_indexes}
@ -597,7 +599,7 @@ local and global indexes.
### Local Secondary Indexes {#id166}
For example, imagine you are operating a website for selling used cars (illustrated in
[Figure 7-9](/en/ch7#fig_sharding_local_secondary)). Each listing has a unique ID, and you use that ID as partition
[Figure 7-9](/en/ch7#fig_sharding_local_secondary)). Each listing has a unique ID, and you use that ID as partition
key for sharding (for example, IDs 0 to 499 in shard 0, IDs 500 to 999 in shard 1, etc.).
If you want to let users search for cars, allowing them to filter by color and by make, you need a
@ -605,7 +607,7 @@ secondary index on `color` and `make` (in a document database these would be fie
database they would be columns). If you have declared the index, the database can perform the
indexing automatically. For example, whenever a red car is added to the database, the database shard
automatically adds its ID to the list of IDs for the index entry `color:red`. As discussed in
[Chapter 4](/en/ch4#ch_storage), that list of IDs is also called a *postings list*.
[Chapter 4](/en/ch4#ch_storage), that list of IDs is also called a *postings list*.
{{< figure src="/fig/ddia_0709.png" id="fig_sharding_local_secondary" caption="Figure 7-9. Local secondary indexes: each shard indexes only the records within its own shard." class="w-full my-4" >}}
@ -632,7 +634,7 @@ want *some* results, and you dont need all, you can send the request to any s
However, if you want all the results and dont know their partition key in advance, you need to send
the query to all shards, and combine the results you get back, because the matching records might be
scattered across all the shards. In [Figure 7-9](/en/ch7#fig_sharding_local_secondary), red cars appear in both shard
scattered across all the shards. In [Figure 7-9](/en/ch7#fig_sharding_local_secondary), red cars appear in both shard
0 and shard 1.
This approach to querying a sharded database can make read queries on secondary indexes quite
@ -651,7 +653,7 @@ covers data in all shards. However, we cant just store that index on one node
likely become a bottleneck and defeat the purpose of sharding. A global index must also be sharded,
but it can be sharded differently from the primary key index.
[Figure 7-10](/en/ch7#fig_sharding_global_secondary) illustrates what this could look like: the IDs of red cars from
[Figure 7-10](/en/ch7#fig_sharding_global_secondary) illustrates what this could look like: the IDs of red cars from
all shards appear under `color:red` in the index, but the index is sharded so that colors starting
with the letters *a* to *r* appear in shard 0 and colors starting with *s* to *z* appear in shard 1.
The index on the make of car is partitioned similarly (with the shard boundary being between *f* and *h*).
@ -664,7 +666,7 @@ you can search for. Here we generalise it to mean any value that you can search
The global index uses the term as partition key, so that when youre looking for a particular term
or value, you can figure out which shard you need to query. As before, a shard can contain a
contiguous range of terms (as in [Figure 7-10](/en/ch7#fig_sharding_global_secondary)), or you can assign terms to
contiguous range of terms (as in [Figure 7-10](/en/ch7#fig_sharding_global_secondary)), or you can assign terms to
shards based on a hash of the term.
Global indexes have the advantage that a query with a single condition (such as *color = red*) only
@ -682,7 +684,7 @@ Another challenge with global secondary indexes is that writes are more complica
indexes, because writing a single record might affect multiple shards of the index (every term in
the document might be on a different shard). This makes it harder to keep the secondary index in
sync with the underlying data. One option is to use a distributed transaction to atomically update
the shards storing the primary record and its secondary indexes (see [Chapter 8](/en/ch8#ch_transactions)).
the shards storing the primary record and its secondary indexes (see [Chapter 8](/en/ch8#ch_transactions)).
Global secondary indexes are used by CockroachDB, TiDB, and YugabyteDB; DynamoDB supports both local
and global secondary indexes. In the case of DynamoDB, writes are asynchronously reflected in global
@ -781,4 +783,4 @@ that question in the following chapters.
[^31]: Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. [Earlybird: Real-Time Search at Twitter](https://cs.uwaterloo.ca/~jimmylin/publications/Busch_etal_ICDE2012.pdf). At *28th IEEE International Conference on Data Engineering* (ICDE), April 2012. [doi:10.1109/ICDE.2012.149](https://doi.org/10.1109/ICDE.2012.149)
[^32]: Nadav HarEl. [Indexing in Cassandra 3](https://github.com/scylladb/scylladb/wiki/Indexing-in-Cassandra-3). *github.com*, April 2017. Archived at [perma.cc/3ENV-8T9P](https://perma.cc/3ENV-8T9P)
[^33]: Zachary Tong. [Customizing Your Document Routing](https://www.elastic.co/blog/customizing-your-document-routing/). *elastic.co*, June 2013. Archived at [perma.cc/97VM-MREN](https://perma.cc/97VM-MREN)
[^34]: Andrew Pavlo. [H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). *hstore.cs.brown.edu*, October 2013. Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z)
[^34]: Andrew Pavlo. [H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). *hstore.cs.brown.edu*, October 2013. Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z)

View file

@ -4,6 +4,8 @@ weight: 208
breadcrumbs: false
---
<a id="ch_transactions"></a>
![](/map/ch07.png)
> *Some authors have claimed that general two-phase commit is too expensive to support, because of the
@ -75,8 +77,8 @@ similar to that of System R.
In the late 2000s, nonrelational (NoSQL) databases started gaining popularity. They aimed to
improve upon the relational status quo by offering a choice of new data models (see
[Chapter 3](/en/ch3#ch_datamodels)), and by including replication ([Chapter 6](/en/ch6#ch_replication)) and sharding
([Chapter 7](/en/ch7#ch_sharding)) by default. Transactions were the main casualty of this movement: many of this
[Chapter 3](/en/ch3#ch_datamodels)), and by including replication ([Chapter 6](/en/ch6#ch_replication)) and sharding
([Chapter 7](/en/ch7#ch_sharding)) by default. Transactions were the main casualty of this movement: many of this
generation of databases abandoned transactions entirely, or redefined the word to describe a
much weaker set of guarantees than had previously been understood.
@ -85,7 +87,7 @@ fundamentally unscalable, and that any large-scale system would have to abandon
order to maintain good performance and high availability. More recently, that belief has turned out
to be wrong. So-called “NewSQL” databases such as CockroachDB [^5], TiDB [^6], Spanner [^7], FoundationDB [^8],
and Yugabyte have shown that transactional systems can scale to large data volumes and high
throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide
throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide
strong ACID guarantees at scale.
However, that doesnt mean that every system must be transactional either: like every other
@ -146,7 +148,7 @@ the defining feature of ACID atomicity. Perhaps *abortability* would have been a
The word *consistency* is terribly overloaded:
* In [Chapter 6](/en/ch6#ch_replication) we discussed *replica consistency* and the issue of *eventual consistency*
* In [Chapter 6](/en/ch6#ch_replication) we discussed *replica consistency* and the issue of *eventual consistency*
that arises in asynchronously replicated systems (see [“Problems with Replication Lag”](/en/ch6#sec_replication_lag)).
* A *consistent snapshot* of a database, e.g. for a backup, is a snapshot of the entire database as
it existed at one moment in time. More precisely, it is consistent with the happens-before
@ -155,7 +157,7 @@ The word *consistency* is terribly overloaded:
value was written.
* *Consistent hashing* is an approach to sharding that some systems use for rebalancing (see
[“Consistent hashing”](/en/ch7#sec_sharding_consistent_hashing)).
* In the CAP theorem (see [Chapter 10](/en/ch10#ch_consistency)), the word *consistency* is used to mean
* In the CAP theorem (see [Chapter 10](/en/ch10#ch_consistency)), the word *consistency* is used to mean
*linearizability* (see [“Linearizability”](/en/ch10#sec_consistency_linearizability)).
* In the context of ACID, *consistency* refers to an application-specific notion of the database
being in a “good state.”
@ -188,10 +190,10 @@ Most databases are accessed by several clients at the same time. That is no prob
reading and writing different parts of the database, but if they are accessing the same database
records, you can run into concurrency problems (race conditions).
[Figure 8-1](/en/ch8#fig_transactions_increment) is a simple example of this kind of problem. Say you have two clients
[Figure 8-1](/en/ch8#fig_transactions_increment) is a simple example of this kind of problem. Say you have two clients
simultaneously incrementing a counter that is stored in a database. Each client needs to read the
current value, add 1, and write the new value back (assuming there is no increment operation built
into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to
into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to
44, because two increments happened, but it actually only went to 43 because of the race condition.
{{< figure src="/fig/ddia_0801.png" id="fig_transactions_increment" caption="Figure 8-1. A race condition between two clients concurrently incrementing a counter." class="w-full my-4" >}}
@ -234,6 +236,8 @@ database can do to save you.
--------
<a id="sidebar_transactions_durability"></a>
> [!TIP] REPLICATION AND DURABILITY
Historically, durability meant writing to an archive tape. Then it was understood as writing to a disk
@ -291,7 +295,7 @@ Isolation
These definitions assume that you want to modify several objects (rows, documents, records) at once.
Such *multi-object transactions* are often needed if several pieces of data need to be kept in sync.
[Figure 8-2](/en/ch8#fig_transactions_read_uncommitted) shows an example from an email application. To display the
[Figure 8-2](/en/ch8#fig_transactions_read_uncommitted) shows an example from an email application. To display the
number of unread messages for a user, you could query something like:
```
@ -307,14 +311,14 @@ number of unread messages in a separate field (a kind of denormalization, which
unread counter as well, and whenever a message is marked as read, you also have to decrement the
unread counter.
In [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), user 2 experiences an anomaly: the mailbox listing shows
In [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), user 2 experiences an anomaly: the mailbox listing shows
an unread message, but the counter shows zero unread messages because the counter increment has not
yet happened. (If an incorrect counter in an email application seems too insignificant, think of a
customer account balance instead of an unread counter, and a payment transaction instead of an
email.) Isolation would have prevented this issue by ensuring that user 2 sees either both the
inserted email and the updated counter, or neither, but not an inconsistent halfway point.
[Figure 8-3](/en/ch8#fig_transactions_atomicity) illustrates the need for atomicity: if an error occurs somewhere
[Figure 8-3](/en/ch8#fig_transactions_atomicity) illustrates the need for atomicity: if an error occurs somewhere
over the course of the transaction, the contents of the mailbox and the unread counter might become out
of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted
and the inserted email is rolled back.
@ -337,10 +341,10 @@ database in a partially updated state.
#### Single-object writes {#sec_transactions_single_object}
Atomicity and isolation also apply when a single object is being changed. For example, imagine you
are writing a 20 KB JSON document to a database:
are writing a 20 KB JSON document to a database:
* If the network connection is interrupted after the first 10 KB have been sent, does the
database store that unparseable 10 KB fragment of JSON?
* If the network connection is interrupted after the first 10 KB have been sent, does the
database store that unparseable 10 KB fragment of JSON?
* If the power fails while the database is in the middle of overwriting the previous value on disk,
do you end up with the old and new values spliced together?
* If another client reads that document while the write is in progress, will it see a partially
@ -353,7 +357,7 @@ isolation can be implemented using a lock on each object (allowing only one thre
object at any one time).
Some databases also provide more complex atomic operations, such as an increment operation, which
removes the need for a read-modify-write cycle like that in [Figure 8-1](/en/ch8#fig_transactions_increment).
removes the need for a read-modify-write cycle like that in [Figure 8-1](/en/ch8#fig_transactions_increment).
Similarly popular is a *conditional write* operation, which allows a write to happen only if the value
has not been concurrently changed by someone else (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)),
similarly to a compare-and-set or compare-and-swap (CAS) operation in shared-memory concurrency.
@ -391,7 +395,7 @@ However, in many other cases writes to several different objects need to be coor
document, which is treated as a single object—no multi-object transactions are needed when
updating a single document. However, document databases lacking join functionality also encourage
denormalization (see [“When to Use Which Model”](/en/ch3#sec_datamodels_document_summary)). When denormalized information needs to
be updated, like in the example of [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), you need to update
be updated, like in the example of [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), you need to update
several documents in one go. Transactions are very useful in this situation to prevent
denormalized data from going out of sync.
* In databases with secondary indexes (almost everything except pure key-value stores), the indexes
@ -403,7 +407,7 @@ However, in many other cases writes to several different objects need to be coor
Such applications can still be implemented without transactions. However, error handling becomes
much more complicated without atomicity, and the lack of isolation can cause concurrency problems.
We will discuss those in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels), and explore alternative approaches
in [Link to Come].
in [“Derived data versus distributed transactions”](/en/ch13#sec_future_derived_vs_transactions).
#### Handling errors and aborts {#handling-errors-and-aborts}
@ -521,7 +525,7 @@ Can another transaction see that uncommitted data? If yes, that is called a
Transactions running at the read committed isolation level must prevent dirty reads. This means that
any writes by a transaction only become visible to others when that transaction commits (and then
all of its writes become visible at once). This is illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2s *get x* still
all of its writes become visible at once). This is illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2s *get x* still
returns the old value, 2, while user 1 has not yet committed.
{{< figure src="/fig/ddia_0804.png" id="fig_transactions_read_committed" caption="Figure 8-4. No dirty reads: user 2 sees the new value for x only after user 1's transaction has committed." class="w-full my-4" >}}
@ -529,12 +533,12 @@ returns the old value, 2, while user 1 has not yet committed.
There are a few reasons why its useful to prevent dirty reads:
* If a transaction needs to update several rows, a dirty read means that another transaction may
see some of the updates but not others. For example, in [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), the
see some of the updates but not others. For example, in [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), the
user sees the new unread email but not the updated counter. This is a dirty read of the email.
Seeing the database in a partially updated state is confusing to users and may cause other
transactions to take incorrect decisions.
* If a transaction aborts, any writes it has made need to be rolled back (like in
[Figure 8-3](/en/ch8#fig_transactions_atomicity)). If the database allows dirty reads, that means a transaction may
[Figure 8-3](/en/ch8#fig_transactions_atomicity)). If the database allows dirty reads, that means a transaction may
see data that is later rolled back—i.e., which is never actually committed to the database. Any
transaction that read uncommitted data would also need to be aborted, leading to a problem called
*cascading aborts*.
@ -553,15 +557,15 @@ first writes transaction has committed or aborted.
By preventing dirty writes, this isolation level avoids some kinds of concurrency problems:
* If transactions update multiple rows, dirty writes can lead to a bad outcome. For example,
consider [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), which illustrates a used car sales website on which
consider [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), which illustrates a used car sales website on which
two people, Aaliyah and Bryce, are simultaneously trying to buy the same car. Buying a car requires
two database writes: the listing on the website needs to be updated to reflect the buyer, and the
sales invoice needs to be sent to the buyer. In the case of [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), the
sales invoice needs to be sent to the buyer. In the case of [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), the
sale is awarded to Bryce (because he performs the winning update to the `listings` table), but the
invoice is sent to Aaliyah (because she performs the winning update to the `invoices` table). Read
committed prevents such mishaps.
* However, read committed does *not* prevent the race condition between two counter increments in
[Figure 8-1](/en/ch8#fig_transactions_increment). In this case, the second write happens after the first transaction
[Figure 8-1](/en/ch8#fig_transactions_increment). In this case, the second write happens after the first transaction
has committed, so its not a dirty write. Its still incorrect, but for a different reason—in
[“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update) we will discuss how to make such counter increments safe.
@ -597,7 +601,7 @@ different part of the application, due to waiting for locks.
Nevertheless, locks are used to prevent dirty reads in some databases, such as IBM
Db2 and Microsoft SQL Server in the `read_committed_snapshot=off` setting [^29].
A more commonly used approach to preventing dirty reads is the one illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
A more commonly used approach to preventing dirty reads is the one illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
row that is written, the database remembers both the old committed value and the new value
set by the transaction that currently holds the write lock. While the transaction is ongoing, any
other transactions that read the row are simply given the old value. Only when the new value is
@ -613,7 +617,7 @@ getting intermingled. Indeed, those are useful features, and much stronger guara
get from a system that has no transactions.
However, there are still plenty of ways in which you can have concurrency bugs when using this
isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that
isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that
can occur with read committed.
{{< figure src="/fig/ddia_0806.png" id="fig_transactions_item_many_preceders" caption="Figure 8-6. Read skew: Aaliyah observes the database in an inconsistent state." class="w-full my-4" >}}
@ -685,14 +689,14 @@ database to handle long-running read queries on a consistent snapshot at the sam
writes normally, without any lock contention between the two.
To implement snapshot isolation, databases use a generalization of the mechanism we saw for
preventing dirty reads in [Figure 8-4](/en/ch8#fig_transactions_read_committed). Instead of two versions of each row
preventing dirty reads in [Figure 8-4](/en/ch8#fig_transactions_read_committed). Instead of two versions of each row
(the committed version and the overwritten-but-not-yet-committed version), the database must
potentially keep several different committed versions of a row, because various in-progress
transactions may need to see the state of the database at different points in time. Because it
maintains several versions of a row side by side, this technique is known as *multi-version
concurrency control* (MVCC).
[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
[^40] [^42] [^43] (other implementations are similar).
When a transaction is started, it is given a unique, always-increasing transaction ID (`txid`).
Whenever a transaction writes anything to the database, the data it writes is tagged with the
@ -712,7 +716,7 @@ garbage collection process in the database removes any rows marked for deletion
space.
An update is internally translated into a delete and a insert [^44].
For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the
For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the
balance from $500 to $400. The `accounts` table now actually contains two rows for account 2: a row
with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of
$400 which was inserted by transaction 13.
@ -741,7 +745,7 @@ consistent snapshot of the database to the application. This works roughly as fo
process can remove them later.
4. All other writes are visible to the applications queries.
These rules apply to both insertion and deletion of rows. In [Figure 8-7](/en/ch8#fig_transactions_mvcc), when
These rules apply to both insertion and deletion of rows. In [Figure 8-7](/en/ch8#fig_transactions_mvcc), when
transaction 12 reads from account 2, it sees a balance of $500 because the deletion of the $500
balance was made by transaction 13 (according to rule 2, transaction 12 cannot see a deletion made
by transaction 13), and the insertion of the $400 balance is not yet visible (by the same rule).
@ -758,6 +762,8 @@ that (from other transactions point of view) have long been overwritten or de
updating values in place but instead inserting a new version every time a value is changed, the
database can provide a consistent snapshot while incurring only a small overhead.
<a id="sec_transactions_snapshot_indexes"></a>
#### Indexes and snapshot isolation {#indexes-and-snapshot-isolation}
How do indexes work in a multi-version database? The most common approach is that each index entry
@ -819,7 +825,7 @@ the issue of two transactions writing concurrently—we have only discussed dirt
There are several other interesting kinds of conflicts that can occur between concurrently writing
transactions. The best known of these is the *lost update* problem, illustrated in
[Figure 8-1](/en/ch8#fig_transactions_increment) with the example of two concurrent counter increments.
[Figure 8-1](/en/ch8#fig_transactions_increment) with the example of two concurrent counter increments.
The lost update problem can occur if an application reads some value from the database, modifies it,
and writes back the modified value (a *read-modify-write cycle*). If two transactions do this
@ -875,7 +881,7 @@ For example, consider a multiplayer game in which several players can move the s
concurrently. In this case, an atomic operation may not be sufficient, because the application also
needs to ensure that a players move abides by the rules of the game, which involves some logic that
you cannot sensibly implement as a database query. Instead, you may use a lock to prevent two
players from concurrently moving the same piece, as illustrated in [Example 8-1](/en/ch8#fig_transactions_select_for_update).
players from concurrently moving the same piece, as illustrated in [Example 8-1](/en/ch8#fig_transactions_select_for_update).
{{< figure id="fig_transactions_select_for_update" title="Example 8-1. Explicitly locking rows to prevent lost updates" class="w-full my-4" >}}
@ -956,7 +962,7 @@ written by other transactions are visible to the evaluation of the `WHERE` claus
#### Conflict resolution and replication {#conflict-resolution-and-replication}
In replicated databases (see [Chapter 6](/en/ch6#ch_replication)), preventing lost updates takes on another
In replicated databases (see [Chapter 6](/en/ch6#ch_replication)), preventing lost updates takes on another
dimension: since they have copies of the data on multiple nodes, and the data can potentially be
modified concurrently on different nodes, some additional steps need to be taken to prevent lost
updates.
@ -1000,7 +1006,7 @@ they are sick themselves), provided that at least one colleague remains on call
Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are
feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button
to go off call at approximately the same time. What happens next is illustrated in
[Figure 8-8](/en/ch8#fig_transactions_write_skew).
[Figure 8-8](/en/ch8#fig_transactions_write_skew).
{{< figure src="/fig/ddia_0808.png" id="fig_transactions_write_skew" caption="Figure 8-8. Example of write skew causing an application bug." class="w-full my-4" >}}
@ -1070,7 +1076,7 @@ Meeting room booking system
: Say you want to enforce that there cannot be two bookings for the same meeting room at the same time [^55].
When someone wants to make a booking, you first check for any conflicting bookings (i.e.,
bookings for the same room with an overlapping time range), and if none are found, you create the
meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).
meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)).
{{< figure id="fig_transactions_meeting_rooms" title="Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation)" class="w-full my-4" >}}
@ -1094,7 +1100,7 @@ Meeting room booking system
isolation.
Multiplayer game
: In [Example 8-1](/en/ch8#fig_transactions_select_for_update), we used a lock to prevent lost updates (that is, making
: In [Example 8-1](/en/ch8#fig_transactions_select_for_update), we used a lock to prevent lost updates (that is, making
sure that two players cant move the same figure at the same time). However, the lock doesnt
prevent players from moving two different figures to the same position on the board or potentially
making some other move that violates the rules of the game. Depending on the kind of rule you are
@ -1278,7 +1284,7 @@ containing a single statement, or submit the entire transaction code to the data
as a *stored procedure* [^61].
The differences between interactive transactions and stored procedures is illustrated in
[Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the
[Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the
stored procedure can execute very quickly, without waiting for any network or disk I/O.
{{< figure src="/fig/ddia_0809.png" id="fig_transactions_stored_proc" caption="Figure 8-9. The difference between an interactive transaction and a stored procedure (using the example transaction of [Figure 8-8](/en/ch8#fig_transactions_write_skew))." class="w-full my-4" >}}
@ -1322,7 +1328,7 @@ requires that stored procedures are *deterministic* (when run on different nodes
the same result). If a transaction needs to use the current date and time, for example, it must do
so through special deterministic APIs (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows) for more details on
deterministic operations). This approach is called *state machine replication*, and we will return
to it in [Chapter 10](/en/ch10#ch_consistency).
to it in [Chapter 10](/en/ch10#ch_consistency).
#### Sharding {#sharding}
@ -1332,7 +1338,7 @@ Read-only transactions may execute elsewhere, using snapshot isolation, but for
high write throughput, the single-threaded transaction processor can become a serious bottleneck.
In order to scale to multiple CPU cores, and multiple nodes, you can shard your data
(see [Chapter 7](/en/ch7#ch_sharding)), which is supported in VoltDB. If you can find a way of sharding your dataset
(see [Chapter 7](/en/ch7#ch_sharding)), which is supported in VoltDB. If you can find a way of sharding your dataset
so that each transaction only needs to read and write data within a single shard, then each shard
can have its own transaction processing thread running independently from the others. In this case,
you can give each CPU core its own shard, which allows your transaction throughput to scale linearly
@ -1398,7 +1404,7 @@ anyone wants to write (modify or delete) an object, exclusive access is required
unexpectedly behind As back.)
* If transaction A has written an object and transaction B wants to read that object, B must wait
until A commits or aborts before it can continue. (Reading an old version of the object, like in
[Figure 8-4](/en/ch8#fig_transactions_read_committed), is not acceptable under 2PL.)
[Figure 8-4](/en/ch8#fig_transactions_read_committed), is not acceptable under 2PL.)
In 2PL, writers dont just block other writers; they also block readers and vice
versa. Snapshot isolation has the mantra *readers never block writers, and writers never block
@ -1470,7 +1476,7 @@ changing the results of another transactions search query. A database with se
must prevent phantoms.
In the meeting room booking example this means that if one transaction has searched for existing
bookings for a room within a certain time window (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)), another
bookings for a room within a certain time window (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)), another
transaction is not allowed to concurrently insert or update another booking for the same room and
time range. (Its okay to concurrently insert bookings for other rooms, or for the same room at a
different time that doesnt affect the proposed booking.)
@ -1623,7 +1629,7 @@ see [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_sn
MVCC database, it ignores writes that were made by any other transactions that hadnt yet committed
at the time when the snapshot was taken.
In [Figure 8-10](/en/ch8#fig_transactions_detect_mvcc), transaction 43 sees
In [Figure 8-10](/en/ch8#fig_transactions_detect_mvcc), transaction 43 sees
Aaliyah as having `on_call = true`, because transaction 42 (which modified Aaliyahs on-call status) is
uncommitted. However, by the time transaction 43 wants to commit, transaction 42 has already
committed. This means that the write that was ignored when reading from the consistent snapshot has
@ -1650,7 +1656,7 @@ isolations support for long-running reads from a consistent snapshot.
#### Detecting writes that affect prior reads {#sec_detecting_writes_affect_reads}
The second case to consider is when another transaction modifies data after it has been read. This
case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range).
case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range).
{{< figure src="/fig/ddia_0811.png" id="fig_transactions_detect_index_range" caption="Figure 8-11. In serializable snapshot isolation, detecting when one transaction modifies another transaction's reads." class="w-full my-4" >}}
@ -1660,7 +1666,7 @@ In the context of two-phase locking we discussed index-range locks (see
search query, such as `WHERE shift_id = 1234`. We can use a similar technique here, except that SSI
locks dont block other transactions.
In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transactions 42 and 43 both search for on-call doctors
In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transactions 42 and 43 both search for on-call doctors
during shift `1234`. If there is an index on `shift_id`, the database can use the index entry 1234 to
record the fact that transactions 42 and 43 read this data. (If there is no index, this information
can be tracked at the table level.) This information only needs to be kept for a while: after a
@ -1672,7 +1678,7 @@ that have recently read the affected data. This process is similar to acquiring
key range, but rather than blocking until the readers have committed, the lock acts as a tripwire:
it simply notifies the transactions that the data they read may no longer be up to date.
In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transaction 43 notifies transaction 42 that its prior
In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transaction 43 notifies transaction 42 that its prior
read is outdated, and vice versa. Transaction 42 is first to commit, and it is successful: although
transaction 43s write affected 42, 43 hasnt yet committed, so the write has not yet taken effect.
However, when transaction 43 wants to commit, the conflicting write from 42 has already been
@ -1750,7 +1756,7 @@ distributed transactions, but various distributed relational databases do.
In these cases, it is not sufficient to simply send a commit request to all of the nodes and
independently commit the transaction on each one. It could easily happen that the commit succeeds on
some nodes and fails on other nodes, as shown in [Figure 8-12](/en/ch8#fig_transactions_non_atomic):
some nodes and fails on other nodes, as shown in [Figure 8-12](/en/ch8#fig_transactions_non_atomic):
* Some nodes may detect a constraint violation or conflict, making an abort necessary, while other
nodes are successfully able to commit.
@ -1766,7 +1772,7 @@ If some nodes commit the transaction but others abort it, the nodes become incon
other. And once a transaction has been committed on one node, it cannot be retracted again if it
later turns out that it was aborted on another node. This is because once data has been committed,
it becomes visible to other transactions under *read committed* or stronger isolation. For example,
in [Figure 8-12](/en/ch8#fig_transactions_non_atomic), by the time user 1 notices that its commit failed on database 1,
in [Figure 8-12](/en/ch8#fig_transactions_non_atomic), by the time user 1 notices that its commit failed on database 1,
user 2 has already read the data from the same transaction on database 2. If user 1s transaction
was later aborted, user 2s transaction would have to be reverted as well, since it was based on
data that was retroactively declared not to have existed.
@ -1782,7 +1788,7 @@ internally in some databases and also made available to applications in the form
(which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP
web services [^74] [^75].
The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two
phases (hence the name).
@ -1877,7 +1883,7 @@ was committed or aborted. If the coordinator crashes or the network fails at thi
participant can do nothing but wait. A participants transaction in this state is called *in doubt*
or *uncertain*.
The situation is illustrated in [Figure 8-14](/en/ch8#fig_transactions_2pc_crash). In this particular example, the
The situation is illustrated in [Figure 8-14](/en/ch8#fig_transactions_2pc_crash). In this particular example, the
coordinator actually decided to commit, and database 2 received the commit request. However, the
coordinator crashed before it could send the commit request to database 1, and so database 1 does
not know whether to commit or abort. Even a timeout does not help here: if database 1 unilaterally
@ -1907,11 +1913,11 @@ is not so straightforward.
As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed [^13] [^77].
However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most
practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
cannot guarantee atomicity.
A better solution in practice is to replace the single-node coordinator with a fault-tolerant
consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_consistency).
consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_consistency).
### Distributed Transactions Across Different Systems {#sec_transactions_xa}
@ -2018,7 +2024,7 @@ writes. In addition, if you want serializable isolation, a database using two-ph
also have to take a shared lock on any rows *read* by the transaction.
The database cannot release those locks until the transaction commits or aborts (illustrated as a
shaded area in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit)). Therefore, when using two-phase commit, a
shaded area in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit)). Therefore, when using two-phase commit, a
transaction must hold onto the locks throughout the time it is in doubt. If the coordinator has
crashed and takes 20 minutes to start up again, those locks will be held for 20 minutes. If the
coordinators log is entirely lost for some reason, those locks will be held forever—or at least
@ -2086,7 +2092,7 @@ different systems.
These problems are somewhat inherent in performing transactions across heterogeneous technologies.
However, keeping several heterogeneous data systems consistent with each other is still a real and
important problem, so we need to find a different solution to it. This can be done, as we will see
in the next section and in [Link to Come].
in the next section and in [“Derived data versus distributed transactions”](/en/ch13#sec_future_derived_vs_transactions).
### Database-internal Distributed Transactions {#sec_transactions_internal}
@ -2111,7 +2117,7 @@ The biggest problems with XA can be fixed by:
* Coupling the atomic commitment protocol with a distributed concurrency control protocol that supports deadlock detection and consistent reads across shards.
Consensus algorithms are commonly used to replicate the coordinator and the database shards. We will
see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented
see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented
using a consensus algorithm. These algorithms tolerate faults by automatically failing over from one
node to another without any human intervention, and while continuing to guarantee strong consistency
properties.
@ -2159,7 +2165,7 @@ Thus, achieving exactly-once processing only requires transactions within the da
across database and message broker is not necessary for this use case. Recording the message ID in
the database makes the message processing *idempotent*, so that message processing can be safely
retried without duplicating its side-effects. A similar approach is used in stream processing
frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in [Link to Come].
frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in [“Fault Tolerance”](/en/ch12#sec_stream_fault_tolerance).
However, internal distributed transactions within the database are still useful for the scalability
of patterns such as these: for example, they would allow the message IDs to be stored on one shard
@ -2189,7 +2195,7 @@ can have on the database.
In this chapter, we went particularly deep into the topic of concurrency control. We discussed
several widely used isolation levels, in particular *read committed*, *snapshot isolation*
(sometimes called *repeatable read*), and *serializable*. We characterized those isolation levels by
discussing various examples of race conditions, summarized in [Table 8-1](/en/ch8#ch_transactions_isolation_levels):
discussing various examples of race conditions, summarized in [Table 8-1](/en/ch8#ch_transactions_isolation_levels):
{{< figure id="ch_transactions_isolation_levels" title="Table 8-1. Summary of anomalies that can occur at various isolation levels" class="w-full my-4" >}}

View file

@ -4,6 +4,8 @@ weight: 209
breadcrumbs: false
---
<a id="ch_distributed"></a>
![](/map/ch08.png)
> *Theyre funny things, Accidents. You never have them till youre having them.*
@ -33,7 +35,7 @@ explore the things that may go wrong in a distributed system. We will look into
networks ([“Unreliable Networks”](/en/ch9#sec_distributed_networks)) as well as clocks and timing issues
([“Unreliable Clocks”](/en/ch9#sec_distributed_clocks)). The consequences of all these issues are disorienting, so well
explore how to think about the state of a distributed system and how to reason about things that
have happened ([“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth)). Later, in [Chapter 10](/en/ch10#ch_consistency), we will look at some
have happened ([“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth)). Later, in [Chapter 10](/en/ch10#ch_consistency), we will look at some
examples of how we can achieve fault tolerance in the face of those faults.
## Faults and Partial Failures {#sec_distributed_partial_failure}
@ -104,7 +106,7 @@ The internet and most internal networks in datacenters (often Ethernet) are *asy
networks*. In this kind of network, one node can send a message (a packet) to another node, but the
network gives no guarantees as to when it will arrive, or whether it will arrive at all. If you send
a request and expect a response, many things could go wrong (some of which are illustrated in
[Figure 9-1](/en/ch9#fig_distributed_network)):
[Figure 9-1](/en/ch9#fig_distributed_network)):
1. Your request may have been lost (perhaps someone unplugged a network cable).
2. Your request may be waiting in a queue and will be delivered later (perhaps the network or the
@ -219,7 +221,7 @@ even in controlled environments like a datacenter operated by one company [^8]:
When one part of the network is cut off from the rest due to a network fault, that is sometimes
called a *network partition* or *netsplit*, but it is not fundamentally different from other kinds
of network interruption. Network partitions are not related to sharding of a storage system, which
is sometimes also called *partitioning* (see [Chapter 7](/en/ch7#ch_sharding)).
is sometimes also called *partitioning* (see [Chapter 7](/en/ch7#ch_sharding)).
--------
@ -286,7 +288,7 @@ to a load spike on the node or the network).
Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of
performing some action (for example, sending an email), and another node takes over, the action may
end up being performed twice. We will discuss this issue in more detail in
[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), and in Chapters [^10] and [Link to Come].
[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), [Chapter 10](/en/ch10#ch_consistency), and [“The End-to-End Argument for Databases”](/en/ch13#sec_future_end_to_end).
When a node is declared dead, its responsibilities need to be transferred to other nodes, which
places additional load on other nodes and the network. If the system is already struggling with high
@ -299,9 +301,9 @@ Imagine a fictitious system with a network that guaranteed a maximum delay for p
is either delivered within some time *d*, or it is lost, but delivery never takes longer than *d*.
Furthermore, assume that you can guarantee that a non-failed node always handles a request within
some time *r*. In this case, you could guarantee that every successful request receives a response
within time 2*d* + *r*—and if you dont receive a response within that time, you know
within time 2*d* + *r*—and if you dont receive a response within that time, you know
that either the network or the remote node is not working. If this was true,
2*d* + *r* would be a reasonable timeout to use.
2*d* + *r* would be a reasonable timeout to use.
Unfortunately, most systems we work with have neither of those guarantees: asynchronous networks
have *unbounded delays* (that is, they try to deliver packets as quickly as possible, but there is
@ -311,6 +313,8 @@ cannot guarantee that they can handle requests within some maximum time (see
be fast most of the time: if your timeout is low, it only takes a transient spike in round-trip
times to throw the system off-balance.
<a id="sec_distributed_congestion"></a>
#### Network congestion and queueing {#network-congestion-and-queueing}
When driving a car, travel times on road networks often vary most due to traffic congestion.
@ -318,7 +322,7 @@ Similarly, the variability of packet delays on computer networks is most often d
* If several different nodes simultaneously try to send packets to the same destination, the network
switch must queue them up and feed them into the destination network link one by one (as illustrated
in [Figure 9-2](/en/ch9#fig_distributed_switch_queueing)). On a busy network link, a packet may have to wait a while
in [Figure 9-2](/en/ch9#fig_distributed_switch_queueing)). On a busy network link, a packet may have to wait a while
until it can get a slot (this is called *network congestion*). If there is so much incoming data
that the switch queue fills up, the packet is dropped, so it needs to be resent—even though
the network is functioning fine.
@ -340,6 +344,8 @@ expire, and then waiting for the retransmitted packet to be acknowledged).
--------
<a id="sidebar_distributed_tcp_udp"></a>
> [!TIP] TCP VERSUS UDP
Some latency-sensitive applications, such as videoconferencing and Voice over IP (VoIP), use UDP
@ -445,6 +451,8 @@ applications to reprioritize packets for QoS purposes.
--------
<a id="sidebar_distributed_latency_utilization"></a>
> [!TIP] LATENCY AND RESOURCE UTILIZATION
More generally, you can think of variable delays as a consequence of dynamic resource partitioning.
@ -548,7 +556,7 @@ unsuitable for measuring elapsed time [^40].
Time-of-day clocks can experience jumps due to the start and end of Daylight Saving Time (DST);
these can be avoided by always using UTC as time zone, which does not have DST.
Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward
in steps of 10 ms on older Windows systems [^41].
in steps of 10 ms on older Windows systems [^41].
On recent systems, this is less of a problem.
#### Monotonic clocks {#monotonic-clocks}
@ -591,8 +599,8 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples
* The quartz clock in a computer is not very accurate: it *drifts* (runs faster or slower than it
should). Clock drift varies depending on the temperature of the machine. Google assumes a clock
drift of up to 200 ppm (parts per million) for its servers [^45],
which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30
drift of up to 200 ppm (parts per million) for its servers [^45],
which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30
seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best
possible accuracy you can achieve, even if everything is working correctly.
* If a computers clock differs too much from an NTP server, it may refuse to synchronize, or the
@ -602,7 +610,7 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples
different nodes clocks. Anecdotal evidence suggests that this does happen in practice.
* NTP synchronization can only be as good as the network delay, so there is a limit to its
accuracy when youre on a congested network with variable packet delays. One experiment showed
that a minimum error of 35 ms is achievable when synchronizing over the internet [^46],
that a minimum error of 35 ms is achievable when synchronizing over the internet [^46],
though occasional spikes in network delay lead to errors of around a second. Depending on the
configuration, large network delays can cause the NTP client to give up entirely.
* Some NTP servers are wrong or misconfigured, reporting time that is off by hours [^47] [^48].
@ -673,29 +681,29 @@ ordering of events across multiple nodes [^64].
For example, if two clients write to a distributed database, who got there first? Which write is the
more recent one?
[Figure 9-3](/en/ch9#fig_distributed_timestamps) illustrates a dangerous use of time-of-day clocks in a database with
multi-leader replication (the example is similar to [Figure 6-8](/en/ch6#fig_replication_causality)). Client A writes
*x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node
3 (we now have *x* = 2); and finally, both writes are replicated to node 2.
[Figure 9-3](/en/ch9#fig_distributed_timestamps) illustrates a dangerous use of time-of-day clocks in a database with
multi-leader replication (the example is similar to [Figure 6-8](/en/ch6#fig_replication_causality)). Client A writes
*x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node
3 (we now have *x* = 2); and finally, both writes are replicated to node 2.
{{< figure src="/fig/ddia_0903.png" id="fig_distributed_timestamps" caption="Figure 9-3. The write by client B is causally later than the write by client A, but B's write has an earlier timestamp." class="w-full my-4" >}}
In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a
In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a
timestamp according to the time-of-day clock on the node where the write originated. The clock
synchronization is very good in this example: the skew between node 1 and node 3 is less than
3 ms, which is probably better than you can expect in practice.
3 ms, which is probably better than you can expect in practice.
Since the increment builds upon the earlier write of *x* = 1, we might expect that the
write of *x* = 2 should have the greater timestamp of the two. Unfortunately, that is
not what happens in [Figure 9-3](/en/ch9#fig_distributed_timestamps): the write *x* = 1 has a timestamp of
42.004 seconds, but the write *x* = 2 has a timestamp of 42.003 seconds.
Since the increment builds upon the earlier write of *x* = 1, we might expect that the
write of *x* = 2 should have the greater timestamp of the two. Unfortunately, that is
not what happens in [Figure 9-3](/en/ch9#fig_distributed_timestamps): the write *x* = 1 has a timestamp of
42.004 seconds, but the write *x* = 2 has a timestamp of 42.003 seconds.
As discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww), one way of resolving conflicts between concurrently written
values on different nodes is *last write wins* (LWW), which means keeping the write with the
greatest timestamp for a given key and discarding all writes with older timestamps. In the example
of [Figure 9-3](/en/ch9#fig_distributed_timestamps), when node 2 receives these two events, it will incorrectly
conclude that *x* = 1 is the more recent value and drop the write *x* = 2,
of [Figure 9-3](/en/ch9#fig_distributed_timestamps), when node 2 receives these two events, it will incorrectly
conclude that *x* = 1 is the more recent value and drop the write *x* = 2,
so the increment is lost.
This problem can be prevented by ensuring that when a value is overwritten, the new value always has
@ -710,7 +718,7 @@ policy [^62]. This approach has some serious problems:
This scenario can cause arbitrary amounts of data to be silently dropped without any error being
reported to the application.
* LWW cannot distinguish between writes that occurred sequentially in quick succession (in
[Figure 9-3](/en/ch9#fig_distributed_timestamps), client Bs increment definitely occurs *after* client As write)
[Figure 9-3](/en/ch9#fig_distributed_timestamps), client Bs increment definitely occurs *after* client As write)
and writes that were truly concurrent (neither writer was aware of the other). Additional
causality tracking mechanisms, such as version vectors, are needed in order to prevent violations
of causality (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)).
@ -722,8 +730,8 @@ policy [^62]. This approach has some serious problems:
Thus, even though it is tempting to resolve conflicts by keeping the most “recent” value and
discarding others, its important to be aware that the definition of “recent” depends on a local
time-of-day clock, which may well be incorrect. Even with tightly NTP-synchronized clocks, you could
send a packet at timestamp 100 ms (according to the senders clock) and have it arrive at
timestamp 99 ms (according to the recipients clock)—so it appears as though the packet
send a packet at timestamp 100 ms (according to the senders clock) and have it arrive at
timestamp 99 ms (according to the recipients clock)—so it appears as though the packet
arrived before it was sent, which is impossible.
Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur?
@ -746,12 +754,12 @@ actually accurate to such precision. In fact, it most likely is not—as mention
drift in an imprecise quartz clock can easily be several milliseconds, even if you synchronize with
an NTP server on the local network every minute. With an NTP server on the public internet, the best
possible accuracy is probably to the tens of milliseconds, and the error may easily spike to over
100 ms when there is network congestion.
100 ms when there is network congestion.
Thus, it doesnt make sense to think of a clock reading as a point in time—it is more like a
range of times, within a confidence interval: for example, a system may be 95% confident that the
time now is between 10.3 and 10.5 seconds past the minute, but it doesnt know any more precisely than that [^67].
If we only know the time +/ 100 ms, the microsecond digits in the timestamp are essentially meaningless.
If we only know the time +/ 100 ms, the microsecond digits in the timestamp are essentially meaningless.
The uncertainty bound can be calculated based on your time source. If you have a GPS receiver or
atomic clock directly attached to your computer, the expected error range is determined by
@ -808,7 +816,7 @@ length of the confidence interval before committing a read-write transaction. By
ensures that any transaction that may read the data is at a sufficiently later time, so their
confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner
needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS
receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [^45].
receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about 7 ms [^45].
The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to
have a confidence interval, and the accurate clock sources only help keep that interval small. Other
@ -943,7 +951,7 @@ failure of the entire system. These are so-called *hard real-time* systems.
> In embedded systems, *real-time* means that a system is carefully designed and tested to meet
> specified timing guarantees in all circumstances. This meaning is in contrast to the more vague use of the
> term *real-time* on the web, where it describes servers pushing data to clients and stream
> processing without hard response time constraints (see [Link to Come]).
> processing without hard response time constraints (see [Chapter 12](/en/ch12#ch_stream)).
--------
@ -997,7 +1005,7 @@ A variant of this idea is to use the garbage collector only for short-lived obje
to collect) and to restart processes periodically, before they accumulate enough long-lived objects
to require a full GC of long-lived objects [^79] [^82].
One node can be restarted at a time, and traffic can be shifted away from the node before the
planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).
planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).
These measures cannot fully prevent garbage collection pauses, but they can usefully reduce their
impact on the application.
@ -1031,7 +1039,7 @@ even if the underlying system model provides very few guarantees.
However, although it is possible to make software well behaved in an unreliable system model, it
is not straightforward to do so. In the rest of this chapter we will further explore the notions of
knowledge and truth in distributed systems, which will help us think about the kinds of assumptions
we can make and the guarantees we may want to provide. In [Chapter 10](/en/ch10#ch_consistency) we will proceed to
we can make and the guarantees we may want to provide. In [Chapter 10](/en/ch10#ch_consistency) we will proceed to
look at some examples of distributed algorithms that provide particular guarantees under particular
assumptions.
@ -1075,7 +1083,7 @@ of quorums are possible). A majority quorum allows the system to continue workin
are faulty (with three nodes, one faulty node can be tolerated; with five nodes, two faulty nodes can be
tolerated). However, it is still safe, because there can only be only one majority in the
system—there cannot be two majorities with conflicting decisions at the same time. We will discuss
the use of quorums in more detail when we get to *consensus algorithms* in [Chapter 10](/en/ch10#ch_consistency).
the use of quorums in more detail when we get to *consensus algorithms* in [Chapter 10](/en/ch10#ch_consistency).
### Distributed Locks and Leases {#sec_distributed_lock_fencing}
@ -1099,13 +1107,13 @@ hold the lease, perhaps due to a process pause. In the third example, the conseq
wasted computational resources, which is not a big deal. But in the first two cases, the consequence
could be lost or corrupted data, which is much more serious.
For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
implementation of locking. (The bug is not theoretical: HBase used to have this problem [^85] [^86].)
Say you want to ensure that a file in a storage service can only be
accessed by one client at a time, because if multiple clients tried to write to it, the file would
become corrupted. You try to implement this by requiring a client to obtain a lease from a lock
service before accessing the file. Such a lock service is often implemented using a consensus
algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency).
algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency).
{{< figure src="/fig/ddia_0904.png" id="fig_distributed_lease_pause" caption="Figure 9-4. Incorrect implementation of a distributed lock: client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage." class="w-full my-4" >}}
@ -1116,13 +1124,13 @@ the same file, and start writing to the file. When the paused client comes back,
(incorrectly) that it still has a valid lease and proceeds to also write to the file. We now have a
split brain situation: the clients writes clash and corrupt the file.
[Figure 9-5](/en/ch9#fig_distributed_lease_delay) shows a different problem that has similar consequences. In this
[Figure 9-5](/en/ch9#fig_distributed_lease_delay) shows a different problem that has similar consequences. In this
example there is no process pause, only a crash by client 1. Just before client 1 crashes it sends a
write request to the storage service, but this request is delayed for a long time in the network.
(Remember from [“Network Faults in Practice”](/en/ch9#sec_distributed_network_faults) that packets can sometimes be delayed by a minute
or more.) By the time the write request arrives at the storage service, the lease has already timed
out, allowing client 2 to acquire it and issue a write of its own. The result is corruption similar
to [Figure 9-4](/en/ch9#fig_distributed_lease_pause).
to [Figure 9-4](/en/ch9#fig_distributed_lease_pause).
{{< figure src="/fig/ddia_0905.png" id="fig_distributed_lease_delay" caption="Figure 9-5. A message from a former leaseholder might be delayed for a long time, and arrive after another node has taken over the lease." class="w-full my-4" >}}
@ -1139,11 +1147,11 @@ from the network [^9], shutting down the VM via
the cloud providers management interface, or even physically powering down the machine [^87].
This approach is known as *Shoot The Other Node In The Head* or STONITH. Unfortunately, it suffers
from some problems: it does not protect against large network delays like in
[Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down [^19]; and by the time the zombie has been
[Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down [^19]; and by the time the zombie has been
detected and shut down, it may already be too late and data may already have been corrupted.
A more robust fencing solution, which protects against both zombies and delayed requests, is
illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing).
illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing).
{{< figure src="/fig/ddia_0906.png" id="fig_distributed_fencing" caption="Figure 9-6. Making access to storage safe by allowing writes only in the order of increasing fencing tokens." class="w-full my-4" >}}
@ -1158,12 +1166,12 @@ it must include its current fencing token.
> [!NOTE]
> There are several alternative names for fencing tokens. In Chubby, Googles lock service, they are
> called *sequencers* [^88], and in Kafka they are called *epoch numbers*.
> In consensus algorithms, which we will discuss in [Chapter 10](/en/ch10#ch_consistency), the *ballot number* (Paxos) or
> In consensus algorithms, which we will discuss in [Chapter 10](/en/ch10#ch_consistency), the *ballot number* (Paxos) or
> *term number* (Raft) serves a similar purpose.
--------
In [Figure 9-6](/en/ch9#fig_distributed_fencing), client 1 acquires the lease with a token of 33, but then
In [Figure 9-6](/en/ch9#fig_distributed_fencing), client 1 acquires the lease with a token of 33, but then
it goes into a long pause and the lease expires. Client 2 acquires the lease with a token of 34 (the
number always increases) and then sends its write request to the storage service, including the
token of 34. Later, client 1 comes back to life and sends its write to the storage service,
@ -1196,7 +1204,7 @@ last-write-wins conflict resolution (see [“Leaderless Replication”](/en/ch6#
client sends writes directly to each replica, and each replica independently decides whether to
accept a write based on a timestamp assigned by the client.
As illustrated in [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), you can put the writers fencing token in
As illustrated in [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), you can put the writers fencing token in
the most significant bits or digits of the timestamp. You can then be sure that any timestamp
generated by the new leaseholder will be greater than any timestamp from the old leaseholder, even
if the old leaseholders writes happened later.
@ -1204,7 +1212,7 @@ if the old leaseholders writes happened later.
{{< figure src="/fig/ddia_0907.png" id="fig_distributed_fencing_leaderless" caption="Figure 9-7. Using fencing tokens to protect writes to a leaderless replicated database." class="w-full my-4" >}}
In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its
In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its
timestamps starting with 34… are greater than any timestamps starting with 33… that are
generated by Client 1. Client 2 writes to a quorum of replicas but it cant reach Replica 3. This
means that when the zombie Client 1 later tries to write, its write may succeed at Replica 3 even
@ -1239,7 +1247,7 @@ The Byzantine Generals Problem is a generalization of the so-called *Two General
which imagines a situation in which two army generals need to agree on a battle plan. As they
have set up camp on two different sites, they can only communicate by messenger, and the messengers
sometimes get delayed or lost (like packets in a network). We will discuss this problem of
*consensus* in [Chapter 10](/en/ch10#ch_consistency).
*consensus* in [Chapter 10](/en/ch10#ch_consistency).
In the Byzantine version of the problem, there are *n* generals who need to agree, and their
endeavor is hampered by the fact that there are some traitors in their midst. Most of the generals
@ -1301,6 +1309,8 @@ an attacker can compromise one node, they can probably compromise all of them, b
probably running the same software. Thus, traditional mechanisms (authentication, access control,
encryption, firewalls, and so on) continue to be the main protection against attackers.
<a id="sec_distributed_weak_lying"></a>
#### Weak forms of lying {#weak-forms-of-lying}
Although we assume that nodes are generally honest, it can be worth adding mechanisms to software
@ -1327,7 +1337,7 @@ pragmatic steps toward better reliability. For example:
### System Model and Reality {#sec_distributed_system_model}
Many algorithms have been designed to solve distributed systems problems—for example, we will
examine solutions for the consensus problem in [Chapter 10](/en/ch10#ch_consistency). In order to be useful, these
examine solutions for the consensus problem in [Chapter 10](/en/ch10#ch_consistency). In order to be useful, these
algorithms need to tolerate the various faults of distributed systems that we discussed in this
chapter.
@ -1409,7 +1419,7 @@ Uniqueness
Monotonic sequence
: If request *x* returned token *t**x*, and request *y* returned token *t**y*, and
*x* completed before *y* began, then *t**x* < *t**y*.
*x* completed before *y* began, then *t**x* < *t**y*.
Availability
: A node that requests a fencing token and does not crash eventually receives a response.
@ -1615,7 +1625,7 @@ TigerBeetles time abstraction allows simulations to simulate network latency
actually taking the full length of time to trigger the timeout. Such techniques allow the simulator
to explore more code paths faster.
# The Power of Determinism
#### The Power of Determinism {#sidebar_distributed_determinism}
Nondeterminism is at the core of all of the distributed systems challenges we discussed in this
chapter: concurrency, network delay, process pauses, clock jumps, and crashes all happen in
@ -1839,4 +1849,4 @@ problems in distributed systems.
[^131]: Rupak Majumdar and Filip Niksic. [Why is random testing effective for partition tolerance bugs?](https://dl.acm.org/doi/pdf/10.1145/3158134) *Proceedings of the ACM on Programming Languages* (PACMPL), volume 2, issue POPL, article no. 46, December 2017. [doi:10.1145/3158134](https://doi.org/10.1145/3158134)
[^132]: FoundationDB project authors. [Simulation and Testing](https://apple.github.io/foundationdb/testing.html). *apple.github.io*. Archived at [perma.cc/NQ3L-PM4C](https://perma.cc/NQ3L-PM4C)
[^133]: Alex Kladov. [Simulation Testing For Liveness](https://tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness/). *tigerbeetle.com*, July 2023. Archived at [perma.cc/RKD4-HGCR](https://perma.cc/RKD4-HGCR)
[^134]: Alfonso Subiotto Marqués. [(Mostly) Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024. Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4)
[^134]: Alfonso Subiotto Marqués. [(Mostly) Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024. Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4)

View file

@ -4,23 +4,20 @@ weight: 600
breadcrumbs: false
---
{{< callout type="warning" >}}
This page is from the 1st edition 2nd edition is not available yet.
{{< /callout >}}
## About the Author
**Martin Kleppmann** is a researcher in distributed systems at the University of Cambridge, UK.
Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure.
In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes.
Martin is a regular conference speaker, blogger, and open source contributor. He believes that profound technical ideas should be accessible to everyone, and that deeper understanding will help us develop better software.
**Martin Kleppmann** is an Associate Professor at the University of Cambridge, UK, where he teaches on distributed systems and cryptographic protocols.
The first edition of *Designing Data-Intensive Applications* in 2017 established him as an authority on data systems,
and through his research on distributed systems he helped start the local-first software movement.
Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive,
where he worked on large-scale data infrastructure.
![](http://martin.kleppmann.com/2017/03/ddia-poster.jpg)
**Chris Riccomini** is a software engineer, startup investor, and author with 15+ years of experience at PayPal, LinkedIn, and WePay.
He runs Materialized View Capital, where he invests in infrastructure startups. He is also the cocreator of Apache Samza and SlateDB,
and coauthor of The Missing README: A Guide for the New Software Engineer.
**Chris Riccomini** is a software engineer, startup investor, and author with 15+ years of experience at PayPal,
LinkedIn, and WePay. He runs Materialized View Capital, where he invests in infrastructure startups.
He is also the co-creator of Apache Samza and SlateDB,
and co-author of The Missing README: A Guide for the New Software Engineer.
## Colophon

View file

@ -4,38 +4,33 @@ weight: 500
breadcrumbs: false
---
{{< callout type="warning" >}}
This page is from the 1st edition 2nd edition is not available yet.
{{< /callout >}}
> Please note that the definitions in this glossary are short and simple, intended to convey the core idea but not the full subtleties of a term. For more detail, please follow the references into the main text.
### asynchronous
Not waiting for something to complete (e.g., sending data over the network to another node), and not making any assumptions about how long it is going to take. See “Synchronous Versus Asynchro nous Replication” on page 153, “Synchro nous Versus Asynchronous Networks” on page 284, and “System Model and Reality” on page 306.
Not waiting for something to complete (e.g., sending data over the network to another node), and not making any assumptions about how long it is going to take. See [“Synchronous Versus Asynchronous Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_sync_async), [“Synchronous Versus Asynchronous Networks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_sync_networks), and [“System Model and Reality”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_system_model).
### atomic
1. In the context of concurrent operations: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half- finished” state. See also *isolation*.
2. In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur. See “Atomicity” on page 223 and “Atomic Commit and Two-Phase Commit (2PC)” on page 354.
1. In the context of concurrency: describing an operation that appears to take effect at a single point in time, so another concurrent process can never encounter the operation in a “half-finished” state. See also *isolation*.
2. In the context of transactions: grouping together a set of writes that must either all be committed or all be rolled back, even if faults occur. See [“Atomicity”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_atomicity) and [“Two-Phase Commit (2PC)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pc).
### backpressure
Forcing the sender of some data to slow down because the recipient cannot keep
up with it. Also known as *flow control*. See “Messaging Systems” on page 441.
Forcing the sender of some data to slow down when the recipient cannot keep up with it. Also known as *flow control*. See [“When an Overloaded System Wont Recover”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sidebar_metastable).
### batch process
A computation that takes some fixed (and usually large) set of data as input and pro duces some other data as output, without modifying the input. See Chapter 10.
A computation that takes some fixed (and usually large) set of data as input and produces some other data as output, without modifying the input. See [Chapter 11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch11.html#ch_batch).
### bounded
Having some known upper limit or size. Used for example in the context of net work delay (see “Timeouts and Unboun ded Delays” on page 281) and datasets (see the introduction to Chapter 11).
Having some known upper limit or size. Used for example in the context of network delay (see [“Timeouts and Unbounded Delays”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_queueing)) and datasets (see the introduction to [Chapter 12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#ch_stream)).
### Byzantine fault
A node that behaves incorrectly in some arbitrary way, for example by sending contradictory or malicious messages to other nodes. See “Byzantine Faults” on page 304.
A node that behaves incorrectly in some arbitrary way, for example by sending contradictory or malicious messages to other nodes. See [“Byzantine Faults”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_byzantine).
### cache
@ -43,55 +38,55 @@ A component that remembers recently used data in order to speed up future reads
### CAP theorem
A widely misunderstood theoretical result that is not useful in practice. See “The CAP theorem” on page 336.
A widely misunderstood theoretical result that is not useful in practice. See [“The CAP theorem”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_cap).
### causality
The dependency between events that ari ses when one thing “happens before” another thing in a system. For example, a later event that is in response to an earlier event, or builds upon an earlier event, or should be understood in the light of an earlier event. See “The “happens-before” relationship and concurrency” on page 186 and “Ordering and Causality” on page 339.
The dependency between events that arises when one thing “happens before” another thing in a system. For example, a later event that is in response to an earlier event, or builds upon an earlier event, or should be understood in the light of an earlier event. See [“The “happens-before” relation and concurrency”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_happens_before).
### consensus
A fundamental problem in distributed computing, concerning getting several nodes to agree on something (for exam ple, which node should be the leader for a database cluster). The problem is much harder than it seems at first glance. See “Fault-Tolerant Consensus” on page 364.
A fundamental problem in distributed computing, concerning getting several nodes to agree on something (for example, which node should be the leader for a database cluster). The problem is much harder than it seems at first glance. See [“Consensus”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_consensus).
### data warehouse
A database in which data from several dif ferent OLTP systems has been combined and prepared to be used for analytics pur poses. See “Data Warehousing” on page 91.
A database in which data from several different OLTP systems has been combined and prepared to be used for analytics purposes. See [“Data Warehousing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_dwh).
### declarative
Describing the properties that something should have, but not the exact steps for how to achieve it. In the context of quer ies, a query optimizer takes a declarative query and decides how it should best be executed. See “Query Languages for Data” on page 42.
Describing the properties that something should have, but not the exact steps for how to achieve it. In the context of database queries, a query optimizer takes a declarative query and decides how it should best be executed. See [“Terminology: Declarative Query Languages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sidebar_declarative).
### denormalize
To introduce some amount of redun dancy or duplication in a *normalized* dataset, typically in the form of a *cache* or *index*, in order to speed up reads. A denormalized value is a kind of precom puted query result, similar to a materialized view. See “Single-Object and Multi- Object Operations” on page 228 and “Deriving several views from the same event log” on page 461.
To introduce some amount of redundancy or duplication in a *normalized* dataset, typically in the form of a *cache* or *index*, in order to speed up reads. A denormalized value is a kind of precomputed query result, similar to a materialized view. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization).
### derived data
A dataset that is created from some other data through a repeatable process, which you could run again if necessary. Usually, derived data is needed to speed up a par ticular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data. See the introduction to Part III.
A dataset that is created from some other data through a repeatable process, which you could run again if necessary. Usually, derived data is needed to speed up a particular kind of read access to the data. Indexes, caches, and materialized views are examples of derived data. See [“Systems of Record and Derived Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_derived).
### deterministic
Describing a function that always pro duces the same output if you give it the same input. This means it cannot depend on random numbers, the time of day, net work communication, or other unpredict able things.
Describing a function that always produces the same output if you give it the same input. This means it cannot depend on random numbers, the time of day, network communication, or other unpredictable things. See [“The Power of Determinism”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sidebar_distributed_determinism).
### distributed
Running on several nodes connected by a network. Characterized by *partial failures*: some part of the system may be broken while other parts are still working, and it is often impossible for the software to know what exactly is broken. See “Faults and Partial Failures” on page 274.
Running on several nodes connected by a network. Characterized by *partial failures*: some part of the system may be broken while other parts are still working, and it is often impossible for the software to know what exactly is broken. See [“Faults and Partial Failures”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_partial_failure).
### durable
Storing data in a way such that you believe it will not be lost, even if various faults occur. See “Durability” on page 226.
Storing data in a way such that you believe it will not be lost, even if various faults occur. See [“Durability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_durability).
### ETL
ExtractTransformLoad. The process of extracting data from a source database, transforming it into a form that is more suitable for analytic queries, and loading it into a data warehouse or batch processing system. See “Data Warehousing” on page 91.
ExtractTransformLoad. The process of extracting data from a source database, transforming it into a form that is more suitable for analytic queries, and loading it into a data warehouse or batch processing system. See [“Data Warehousing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_dwh).
### failover
In systems that have a single leader, fail over is the process of moving the leader ship role from one node to another. See “Handling Node Outages” on page 156.
In systems that have a single leader, failover is the process of moving the leadership role from one node to another. See [“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover).
### fault-tolerant
Able to recover automatically if some thing goes wrong (e.g., if a machine crashes or a network link fails). See “Reli ability” on page 6.
Able to recover automatically if something goes wrong (e.g., if a machine crashes or a network link fails). See [“Reliability and Fault Tolerance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_reliability).
### flow control
@ -99,150 +94,164 @@ See *backpressure*.
### follower
A replica that does not directly accept any writes from clients, but only processes data changes that it receives from a leader. Also known as a *secondary*, *slave*, *read replica*, or *hot standby*. See “Leaders and Followers” on page 152.
A replica that does not directly accept any writes from clients, but only processes data changes that it receives from a leader. Also known as a *secondary*, *read replica*, or *hot standby*. See [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader).
### full-text search
Searching text by arbitrary keywords, often with additional features such as matching similarly spelled words or syno nyms. A full-text index is a kind of *secon dary index* that supports such queries. See “Full-text search and fuzzy indexes” on page 88.
Searching text by arbitrary keywords, often with additional features such as matching similarly spelled words or synonyms. A full-text index is a kind of *secondary index* that supports such queries. See [“Full-Text Search”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_full_text).
### graph
A data structure consisting of *vertices* (things that you can refer to, also known as *nodes* or *entities*) and *edges* (connec tions from one vertex to another, also known as *relationships* or *arcs*). See “Graph-Like Data Models” on page 49.
A data structure consisting of *vertices* (things that you can refer to, also known as *nodes* or *entities*) and *edges* (connections from one vertex to another, also known as *relationships* or *arcs*). See [“Graph-Like Data Models”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_graph).
### hash
A function that turns an input into a random-looking number. The same input always returns the same number as out put. Two different inputs are very likely to have two different numbers as output, although it is possible that two different inputs produce the same output (this is called a *collision*). See “Partitioning by Hash of Key” on page 203.
A function that turns an input into a random-looking number. The same input always returns the same number as output. Two different inputs are very likely to have two different numbers as output, although it is possible that two different inputs produce the same output (this is called a *collision*). See [“Sharding by Hash of Key”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_hash).
### idempotent
Describing an operation that can be safely retried; if it is executed more than once, it has the same effect as if it was only exe cuted once. See “Idempotence” on page 478.
Describing an operation that can be safely retried; if it is executed more than once, it has the same effect as if it was only executed once. See [“Idempotence”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#sec_stream_idempotence).
### index
A data structure that lets you efficiently search for all records that have a particular value in a particular field. See “Data Structures That Power Your Database” on page 70.
A data structure that lets you efficiently search for all records that have a particular value in a particular field. See [“Storage and Indexing for OLTP”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_oltp).
### isolation
In the context of transactions, describing the degree to which concurrently execut ing transactions can interfere with each other. *Serializable* isolation provides the strongest guarantees, but weaker isolation levels are also used. See “Isolation” on page 225.
In the context of transactions, describing the degree to which concurrently executing transactions can interfere with each other. *Serializable* isolation provides the strongest guarantees, but weaker isolation levels are also used. See [“Isolation”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_acid_isolation).
### join
To bring together records that have some thing in common. Most commonly used in the case where one record has a refer ence to another (a foreign key, a docu ment reference, an edge in a graph) and a query needs to get the record that the ref erence points to. See “Many-to-One and Many-to-Many Relationships” on page 33 and “Reduce-Side Joins and Grouping” on page 403.
To bring together records that have something in common. Most commonly used in the case where one record has a reference to another (a foreign key, a document reference, an edge in a graph) and a query needs to get the record that the reference points to. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization) and [“JOIN and GROUP BY”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch11.html#sec_batch_join).
### leader
When data or a service is replicated across several nodes, the leader is the designated replica that is allowed to make changes. A leader may be elected through some pro tocol, or manually chosen by an adminis trator. Also known as the *primary* or *master*. See “Leaders and Followers” on page 152.
When data or a service is replicated across several nodes, the leader is the designated replica that is allowed to make changes. A leader may be elected through some protocol, or manually chosen by an administrator. Also known as the *primary* or *source*. See [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader).
### linearizable
Behaving as if there was only a single copy of data in the system, which is updated by atomic operations. See “Linearizability” on page 324.
Behaving as if there was only a single copy of data in the system, which is updated by atomic operations. See [“Linearizability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_linearizability).
### locality
A performance optimization: putting sev eral pieces of data in the same place if they are frequently needed at the same time. See “Data locality for queries” on page 41.
A performance optimization: putting several pieces of data in the same place if they are frequently needed at the same time. See [“Data locality for reads and writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_document_locality).
### lock
A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See “Two-Phase Locking (2PL)” on page 257 and “The leader and the lock” on page 301.
A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See [“Two-Phase Locking (2PL)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pl) and [“Distributed Locks and Leases”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lock_fencing).
### log
A mechanism to ensure that only one thread, node, or transaction can access something, and anyone else who wants to access the same thing must wait until the lock is released. See “Two-Phase Locking (2PL)” on page 257 and “The leader and the lock” on page 301.
An append-only file for storing data. A *write-ahead log* is used to make a storage engine resilient against crashes (see [“Making B-trees reliable”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_btree_wal)), a *log-structured* storage engine uses logs as its primary storage format (see [“Log-Structured Storage”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_log_structured)), a *replication log* is used to copy writes from a leader to followers (see [“Single-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_leader)), and an *event log* can represent a data stream (see [“Log-based Message Brokers”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#sec_stream_log)).
### materialize
To perform a computation eagerly and write out its result, as opposed to calculat ing it on demand when requested. See “Aggregation: Data Cubes and Material ized Views” on page 101 and “Materialization of Intermediate State” on page 419.
To perform a computation eagerly and write out its result, as opposed to calculating it on demand when requested. See [“Event Sourcing and CQRS”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_events).
### node
An instance of some software running on a computer, which communicates with other nodes via a network in order to accomplish some task.
### normalized
An instance of some software running on a computer, which communicates with other nodes via a network in order to accomplish some task.
Structured in such a way that there is no redundancy or duplication. In a normal ized database, when some piece of data changes, you only need to change it in one place, not many copies in many different places. See “Many-to-One and Many-to- Many Relationships” on page 33.
Structured in such a way that there is no redundancy or duplication. In a normalized database, when some piece of data changes, you only need to change it in one place, not many copies in many different places. See [“Normalization, Denormalization, and Joins”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_normalization).
### OLAP
Online analytic processing. Access pattern characterized by aggregating (e.g., count, sum, average) over a large number of records. See “Transaction Processing or Analytics?” on page 90.
Online analytic processing. Access pattern characterized by aggregating (e.g., count, sum, average) over a large number of records. See [“Operational Versus Analytical Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_analytics).
### OLTP
Online transaction processing. Access pattern characterized by fast queries that read or write a small number of records, usually indexed by key. See “Transaction Processing or Analytics?” on page 90.
Online transaction processing. Access pattern characterized by fast queries that read or write a small number of records, usually indexed by key. See [“Operational Versus Analytical Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_analytics).
### partitioning
### sharding
Splitting up a large dataset or computa tion that is too big for a single machine into smaller parts and spreading them across several machines. Also known as sharding. See Chapter 6.
Splitting up a large dataset or computation that is too big for a single machine into smaller parts and spreading them across several machines. Also known as *partitioning*. See [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding).
### percentile
A way of measuring the distribution of values by counting how many values are above or below some threshold. For example, the 95th percentile response time during some period is the time t such that 95% of requests in that period com plete in less than t, and 5% take longer than t. See “Describing Performance” on page 13.
A way of measuring the distribution of values by counting how many values are above or below some threshold. For example, the 95th percentile response time during some period is the time *t* such that 95% of requests in that period complete in less than *t*, and 5% take longer than *t*. See [“Describing Performance”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_percentiles).
### primary key
A value (typically a number or a string) that uniquely identifies a record. In many applications, primary keys are generated by the system when a record is created (e.g., sequentially or randomly); they are not usually set by users. See also secondary index.
A value (typically a number or a string) that uniquely identifies a record. In many applications, primary keys are generated by the system when a record is created (e.g., sequentially or randomly); they are not usually set by users. See also *secondary index*.
### quorum
The minimum number of nodes that need to vote on an operation before it can be considered successful. See “Quorums for reading and writing” on page 179.
The minimum number of nodes that need to vote on an operation before it can be considered successful. See [“Quorums for reading and writing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_quorum_condition).
### rebalance
To move data or services from one node to another in order to spread the load fairly. See “Rebalancing Partitions” on page 209.
To move data or services from one node to another in order to spread the load fairly. See [“Sharding of Key-Value Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_key_value).
### replication
Keeping a copy of the same data on sev eral nodes (replicas) so that it remains accessible if a node becomes unreachable. See Chapter 5.
Keeping a copy of the same data on several nodes (*replicas*) so that it remains accessible if a node becomes unreachable. See [Chapter 6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#ch_replication).
### schema
A description of the structure of some data, including its fields and datatypes. Whether some data conforms to a schema can be checked at various points in the datas lifetime (see “Schema flexibility in the document model” on page 39), and a schema can change over time (see Chap ter 4).
A description of the structure of some data, including its fields and datatypes. Whether some data conforms to a schema can be checked at various points in the datas lifetime (see [“Schema flexibility in the document model”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_schema_flexibility)), and a schema can change over time (see [Chapter 5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch05.html#ch_encoding)).
### secondary index
An additional data structure that is main tained alongside the primary data storage and which allows you to efficiently search for records that match a certain kind of condition. See “Other Indexing Struc tures” on page 85 and “Partitioning and Secondary Indexes” on page 206.
An additional data structure that is maintained alongside the primary data storage and which allows you to efficiently search for records that match a certain kind of condition. See [“Multi-Column and Secondary Indexes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#sec_storage_index_multicolumn) and [“Sharding and Secondary Indexes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_secondary_indexes).
### serializable
A guarantee that if several transactions execute concurrently, they behave the same as if they had executed one at a time, in some serial order. See “Serializability” on page 251.
An *isolation* guarantee that if several transactions execute concurrently, they behave the same as if they had executed one at a time, in some serial order. See [“Serializability”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_serializability).
### shared-nothing
An architecture in which independent nodes—each with their own CPUs, mem ory, and disks—are connected via a con ventional network, in contrast to shared- memory or shared-disk architectures. See the introduction to Part II.
An architecture in which independent nodes—each with their own CPUs, memory, and disks—are connected via a conventional network, in contrast to shared-memory or shared-disk architectures. See [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_shared_nothing).
### skew
1. Imbalanced load across partitions, such that some partitions have lots of requests or data, and others have much less. Also known as hot spots. See “Skewed Work loads and Relieving Hot Spots” on page 205 and “Handling skew” on page 407.
2. A timing anomaly that causes events to appear in an unexpected, nonsequential order. See the discussions of read skew in “Snapshot Isolation and Repeatable Read” on page 237, write skew in “Write Skew and Phantoms” on page 246, and clock skew in “Timestamps for ordering events” on page 291.
1. Imbalanced load across shards, such that some shards have lots of requests or data, and others have much less. Also known as *hot spots*. See [“Skewed Workloads and Relieving Hot Spots”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_skew).
2. A timing anomaly that causes events to appear in an unexpected, nonsequential order. See the discussions of *read skew* in [“Snapshot Isolation and Repeatable Read”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_snapshot_isolation), *write skew* in [“Write Skew and Phantoms”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_write_skew), and *clock skew* in [“Timestamps for ordering events”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lww).
### split brain
A scenario in which two nodes simultane ously believe themselves to be the leader, and which may cause system guarantees to be violated. See “Handling Node Out ages” on page 156 and “The Truth Is Defined by the Majority” on page 300.
A scenario in which two nodes simultaneously believe themselves to be the leader, and which may cause system guarantees to be violated. See [“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover) and [“The Majority Rules”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_majority).
### stored procedure
A way of encoding the logic of a transac tion such that it can be entirely executed on a database server, without communi cating back and forth with a client during the transaction. See “Actual Serial Execu tion” on page 252.
A way of encoding the logic of a transaction such that it can be entirely executed on a database server, without communicating back and forth with a client during the transaction. See [“Actual Serial Execution”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_serial).
### stream process
A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See Chapter 11.
A continually running computation that consumes a never-ending stream of events as input, and derives some output from it. See [Chapter 12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch12.html#ch_stream).
### synchronous
The opposite of asynchronous.
The opposite of *asynchronous*.
### system of record
A system that holds the primary, authori tative version of some data, also known as the source of truth. Changes are first writ ten here, and other datasets may be derived from the system of record. See the introduction to Part III.
A system that holds the primary, authoritative version of some data, also known as the *source of truth*. Changes are first written here, and other datasets may be derived from the system of record. See [“Systems of Record and Derived Data”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_derived).
### timeout
One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See “Timeouts and Unbounded Delays” on page 281.
One of the simplest ways of detecting a fault, namely by observing the lack of a response within some amount of time. However, it is impossible to know whether a timeout is due to a problem with the remote node, or an issue in the network. See [“Timeouts and Unbounded Delays”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_queueing).
### total order
A way of comparing things (e.g., time stamps) that allows you to always say which one of two things is greater and which one is lesser. An ordering in which some things are incomparable (you can not say which is greater or smaller) is called a partial order. See “The causal order is not a total order” on page 341.
A way of comparing things (e.g., timestamps) that allows you to always say which one of two things is greater and which one is lesser. An ordering in which some things are incomparable (you cannot say which is greater or smaller) is called a *partial order*.
### transaction
Grouping together several reads and writes into a logical unit, in order to sim plify error handling and concurrency issues. See Chapter 7.
Grouping together several reads and writes into a logical unit, in order to simplify error handling and concurrency issues. See [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions).
### two-phase commit (2PC)
An algorithm to ensure that several data base nodes either all commit or all abort a transaction. See “Atomic Commit and Two-Phase Commit (2PC)” on page 354.
An algorithm to ensure that several database nodes either all *atomically* commit or all abort a transaction. See [“Two-Phase Commit (2PC)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pc).
### two-phase locking (2PL)
An algorithm for achieving serializable isolation that works by a transaction acquiring a lock on all data it reads or writes, and holding the lock until the end of the transaction. See “Two-Phase Lock ing (2PL)” on page 257.
An algorithm for achieving *serializable isolation* that works by a transaction acquiring a lock on all data it reads or writes, and holding the lock until the end of the transaction. See [“Two-Phase Locking (2PL)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#sec_transactions_2pl).
### unbounded
Not having any known upper limit or size. The opposite of bounded.
Not having any known upper limit or size. The opposite of *bounded*.
……

3542
content/en/indexes.md Normal file

File diff suppressed because it is too large Load diff

View file

@ -61,12 +61,13 @@ This point will be a running theme throughout this part of the book.
We will start in [Chapter 11](/en/ch11) by examining batch-oriented dataflow systems such as MapReduce, and see how they give us good tools and principles for building large- scale data systems.
In [Chapter 12](/en/ch12) we will take those ideas and apply them to data streams, which allow us to do the same kinds of things with lower delays.
[Chapter 13](/en/ch13) concludes the book by exploring ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future.
In [Chapter 13](/en/ch13) we explore ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future.
[Chapter 14](/en/ch14) concludes the book with ethics, privacy, and the social impact of data systems.
## Index
- [11. Batch Processing](/en/ch11) (WIP)
- [12. Stream Processing](/en/ch12) (WIP)
- [13. Doing the Right Thing](/en/ch13) (WIP)
- [13. A Philosophy of Streaming Systems](/en/ch13) (WIP)
- [14. Doing the Right Thing](/en/ch14) (WIP)

View file

@ -368,22 +368,26 @@ breadcrumbs: false
## [11. Batch Processing](/en/ch11)
- [……](/en/ch11#)
- [Summary](/en/ch11#summary)
- [Summary](/en/ch11#id292)
- [References](/en/ch11#references)
## [12. Stream Processing](/en/ch12)
- [……](/en/ch12#)
- [Summary](/en/ch12#summary)
- [Summary](/en/ch12#id332)
- [References](/en/ch12#references)
## [13. Do the Right Thing](/en/ch13)
## [13. A Philosophy of Streaming Systems](/en/ch13)
- [……](/en/ch13#)
- [Summary](/en/ch13#summary)
- [Summary](/en/ch13#id367)
- [References](/en/ch13#references)
## [14. Doing the Right Thing](/en/ch14)
- [……](/en/ch14#)
- [Summary](/en/ch14#id594)
- [References](/en/ch14#references)
## [Glossary](/en/glossary)
## [Colophon](/en/colophon)
- [About the Author](/en/colophon#about-the-author)
- [Colophon](/en/colophon#colophon)

View file

@ -127,22 +127,26 @@ menu:
name: "PostgreSQL 14 内参 ↗"
url: "https://postgres-internals.cn/"
weight: 9
- identifier: pigsty
name: "Pigsty Free PG RDS ↗"
url: "https://pgsty.com/"
- identifier: pigsty-cc
name: "Pigsty:开源 PG RDS ↗"
url: "https://pigsty.cc/"
weight: 10
- identifier: pigsty-io
name: "Pigsty: Free PG RDS ↗"
url: "https://pigsty.io/"
weight: 11
- identifier: pgext
name: "PG 扩展目录 ↗"
url: "https://ext.pgsty.com/zh"
weight: 11
weight: 12
- identifier: ddia1
name: "DDIA O'reilly ↗"
url: "https://www.oreilly.com/library/view/designing-data-intensive-applications/9781491903063/"
weight: 12
weight: 13
- identifier: ddia2
name: "DDIA 2nd O'reilly ↗"
url: "https://www.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/"
weight: 13
weight: 14
params:

BIN
static/fig/ddia_1101.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 50 KiB

BIN
static/fig/ddia_1102.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 30 KiB

BIN
static/fig/ddia_1103.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 33 KiB

BIN
static/fig/ddia_1201.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 22 KiB

BIN
static/fig/ddia_1202.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 25 KiB

BIN
static/fig/ddia_1203.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

BIN
static/fig/ddia_1204.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 KiB

BIN
static/fig/ddia_1205.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

BIN
static/fig/ddia_1206.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 31 KiB

BIN
static/fig/ddia_1207.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 7.9 KiB

BIN
static/fig/ddia_1208.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 21 KiB

BIN
static/fig/ddia_1301.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 84 KiB

BIN
static/fig/ddia_1302.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 102 KiB