2
0
Fork 0
mirror of https://github.com/Vonng/ddia.git synced 2026-06-25 10:56:50 +08:00

fix reference summary

This commit is contained in:
Feng Ruohang 2025-08-09 16:09:53 +08:00
parent 752c2f58c7
commit 4ec385f161
14 changed files with 2811 additions and 3255 deletions

View file

@ -252,9 +252,7 @@ the data warehouse. This process of getting data into the data warehouse is know
*transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse, *transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse,
after loading), resulting in *ELT*. after loading), resulting in *ELT*.
![ddia 0101](/fig/ddia_0101.png) {{< figure src="/fig/ddia_0101.png" id="fig_dwh_etl" title="Figure 1-1. Simplified outline of ETL into a data warehouse." class="w-full my-4" >}}
###### Figure 1-1. Simplified outline of ETL into a data warehouse.
In some cases the data sources of the ETL processes are external SaaS products such as customer In some cases the data sources of the ETL processes are external SaaS products such as customer
relationship management (CRM), email marketing, or credit card processing systems. In those cases, relationship management (CRM), email marketing, or credit card processing systems. In those cases,
@ -428,9 +426,10 @@ the other extreme are widely-used cloud services or Software as a Service (SaaS)
implemented and operated by an external vendor, and which you only access through a web interface or implemented and operated by an external vendor, and which you only access through a web interface or
API. API.
![ddia 0102](/fig/ddia_0102.png)
###### Figure 1-2. A spectrum of types of software and its operations. {{< figure src="/fig/ddia_0102.png" id="fig_cloud_spectrum" title="Figure 1-2. A spectrum of types of software and its operations." class="w-full my-4" >}}
The middle ground is off-the-shelf software (open source or commercial) that you *self-host*, i.e., The middle ground is off-the-shelf software (open source or commercial) that you *self-host*, i.e.,
deploy yourself—for example, if you download MySQL and install it on a server you control. This deploy yourself—for example, if you download MySQL and install it on a server you control. This
@ -672,7 +671,7 @@ processes you can run concurrently), which you need to know about and plan for b
Adopting a cloud service can be easier and quicker than running your own infrastructure, although Adopting a cloud service can be easier and quicker than running your own infrastructure, although
even here there is a cost in learning how to use it, and perhaps working around its limitations. even here there is a cost in learning how to use it, and perhaps working around its limitations.
Integration between different services becomes a particular challenge as a growing number of vendors Integration between different services becomes a particular challenge as a growing number of vendors
offers an ever broader range of cloud services targeting different use cases [^39][^40]. offers an ever broader range of cloud services targeting different use cases [^39] [^40].
ETL (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) is only part of the story; operational cloud services also need ETL (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) is only part of the story; operational cloud services also need
to be integrated with each other. At present, there is a lack of standards that would facilitate to be integrated with each other. At present, there is a lack of standards that would facilitate
@ -740,7 +739,7 @@ Sustainability
: If you have flexibility on where and when to run your jobs, you might be able to run them in a : If you have flexibility on where and when to run your jobs, you might be able to run them in a
time and place where plenty of renewable electricity is available, and avoid running them when the time and place where plenty of renewable electricity is available, and avoid running them when the
power grid is under strain. This can reduce your carbon emissions and allow you to take advantage power grid is under strain. This can reduce your carbon emissions and allow you to take advantage
of cheap power when it is available [^42][^43]. of cheap power when it is available [^42] [^43].
These reasons apply both to services that you write yourself (application code) and services These reasons apply both to services that you write yourself (application code) and services
consisting of off-the-shelf software (such as databases). consisting of off-the-shelf software (such as databases).
@ -962,7 +961,7 @@ whose data you are collecting and processing. There is much more to this topic;
will go deeper into the topics of ethics and legal compliance, including the problems of bias and will go deeper into the topics of ethics and legal compliance, including the problems of bias and
discrimination. discrimination.
# Summary ## Summary
The theme of this chapter has been to understand trade-offs: that is, to recognize that for many The theme of this chapter has been to understand trade-offs: that is, to recognize that for many
questions there is not one right answer, but several different approaches that each have various questions there is not one right answer, but several different approaches that each have various
@ -994,9 +993,7 @@ data is being processed—an aspect that many engineers are prone to ignoring. H
requirements into technical implementations is not yet well understood, but its important to keep requirements into technical implementations is not yet well understood, but its important to keep
this question in mind as we move through the rest of this book. this question in mind as we move through the rest of this book.
## Footnotes ### References
## References
[^1]: Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. [The Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26) [^1]: Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. [The Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26)
[^2]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737) [^2]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)

File diff suppressed because it is too large Load diff

View file

@ -35,7 +35,7 @@ Stream processing is somewhere between online and offline/batch processing (so i
As we shall see in this chapter, batch processing is an important building block in our quest to build reliable, scalable, and maintainable applications. For example, Map Reduce, a batch processing algorithm published in 2004 [1], was (perhaps over- enthusiastically) called “the algorithm that makes Google so massively scalable” [2]. It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB. As we shall see in this chapter, batch processing is an important building block in our quest to build reliable, scalable, and maintainable applications. For example, Map Reduce, a batch processing algorithm published in 2004 [1], was (perhaps over- enthusiastically) called “the algorithm that makes Google so massively scalable” [2]. It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB.
MapReduce is a fairly low-level programming model compared to the parallel pro cessing systems that were developed for data warehouses many years previously [3, 4], but it was a major step forward in terms of the scale of processing that could be achieved on commodity hardware. Although the importance of MapReduce is now declining [5], it is still worth understanding, because it provides a clear picture of why and how batch processing is useful. MapReduce is a fairly low-level programming model compared to the parallel pro cessing systems that were developed for data warehouses many years previously [^3] [^4], but it was a major step forward in terms of the scale of processing that could be achieved on commodity hardware. Although the importance of MapReduce is now declining [5], it is still worth understanding, because it provides a clear picture of why and how batch processing is useful.
In fact, batch processing is a very old form of computing. Long before programmable digital computers were invented, punch card tabulating machines—such as the Hol lerith machines used in the 1890 US Census [6]—implemented a semi-mechanized form of batch processing to compute aggregate statistics from large inputs. And Map Reduce bears an uncanny resemblance to the electromechanical IBM card-sorting machines that were widely used for business data processing in the 1940s and 1950s [7]. As usual, history has a tendency of repeating itself. In fact, batch processing is a very old form of computing. Long before programmable digital computers were invented, punch card tabulating machines—such as the Hol lerith machines used in the 1890 US Census [6]—implemented a semi-mechanized form of batch processing to compute aggregate statistics from large inputs. And Map Reduce bears an uncanny resemblance to the electromechanical IBM card-sorting machines that were widely used for business data processing in the 1940s and 1950s [7]. As usual, history has a tendency of repeating itself.
@ -94,7 +94,7 @@ In the next chapter, we will turn to stream processing, in which the input is *u
## References ### References
1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/pub62/),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004. 1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/pub62/),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004.
1. Joel Spolsky: “[The Perils of JavaSchools](https://www.joelonsoftware.com/2005/12/29/the-perils-of-javaschools-2/),” *joelonsoftware.com*, December 29, 2005. 1. Joel Spolsky: “[The Perils of JavaSchools](https://www.joelonsoftware.com/2005/12/29/the-perils-of-javaschools-2/),” *joelonsoftware.com*, December 29, 2005.

View file

@ -75,7 +75,7 @@ Finally, we discussed techniques for achieving fault tolerance and exactly-once
## References ### References
1. Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “[The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 12, pages 17921803, August 2015. [doi:10.14778/2824032.2824076](http://dx.doi.org/10.14778/2824032.2824076) 1. Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “[The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 12, pages 17921803, August 2015. [doi:10.14778/2824032.2824076](http://dx.doi.org/10.14778/2824032.2824076)
1. Harold Abelson, Gerald Jay Sussman, and Julie Sussman: [*Structure and Interpretation of Computer Programs*](https://web.archive.org/web/20220807043536/https://mitpress.mit.edu/sites/default/files/sicp/index.html), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, available online at *mitpress.mit.edu* 1. Harold Abelson, Gerald Jay Sussman, and Julie Sussman: [*Structure and Interpretation of Computer Programs*](https://web.archive.org/web/20220807043536/https://mitpress.mit.edu/sites/default/files/sicp/index.html), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, available online at *mitpress.mit.edu*

View file

@ -48,7 +48,7 @@ Finally, we took a step back and examined some ethical aspects of building data-
As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal. As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal.
## References ### References
1. Rachid Belaid: “[Postgres Full-Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/),” *rachbelaid.com*, July 13, 2015. 1. Rachid Belaid: “[Postgres Full-Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/),” *rachbelaid.com*, July 13, 2015.
1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. 1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.

View file

@ -30,9 +30,9 @@ articulate them for your own systems:
* How to define and measure the *performance* of a system (see [“Describing Performance”](/en/ch2#sec_introduction_percentiles)); * How to define and measure the *performance* of a system (see [“Describing Performance”](/en/ch2#sec_introduction_percentiles));
* What it means for a service to be *reliable*—namely, continuing to work correctly, even when * What it means for a service to be *reliable*—namely, continuing to work correctly, even when
things go wrong (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability)); things go wrong (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability));
* Allowing a system to be *scalable* by having efficient ways of adding computing * Allowing a system to be *scalable* by having efficient ways of adding computing
capacity as the load on the system grows (see [“Scalability”](/en/ch2#sec_introduction_scalability)); and capacity as the load on the system grows (see [“Scalability”](/en/ch2#sec_introduction_scalability)); and
* Making it easier to maintain a system in the long term (see [“Maintainability”](/en/ch2#sec_introduction_maintainability)). * Making it easier to maintain a system in the long term (see [“Maintainability”](/en/ch2#sec_introduction_maintainability)).
The terminology introduced in this chapter will also be useful in the following chapters, when we go The terminology introduced in this chapter will also be useful in the following chapters, when we go
@ -70,11 +70,11 @@ query to get the home timeline for a particular user:
``` ```
SELECT posts.*, users.* FROM posts SELECT posts.*, users.* FROM posts
JOIN follows ON posts.sender_id = follows.followee_id JOIN follows ON posts.sender_id = follows.followee_id
JOIN users ON posts.sender_id = users.id JOIN users ON posts.sender_id = users.id
WHERE follows.follower_id = current_user WHERE follows.follower_id = current_user
ORDER BY posts.timestamp DESC ORDER BY posts.timestamp DESC
LIMIT 1000 LIMIT 1000
``` ```
To execute this query, the database will use the `follows` table to find everybody who To execute this query, the database will use the `follows` table to find everybody who
@ -135,32 +135,32 @@ write. The cost of writes for most users is modest, but a social network also ha
extreme cases: extreme cases:
* If a user is following a very large number of accounts, and those accounts post a lot, that user * If a user is following a very large number of accounts, and those accounts post a lot, that user
will have a high rate of writes to their materialized timeline. However, in this case its will have a high rate of writes to their materialized timeline. However, in this case its
unlikely that the user is actually reading all of the posts in their timeline, and therefore its unlikely that the user is actually reading all of the posts in their timeline, and therefore its
okay to simply drop some of their timeline writes and show the user only a sample of the posts okay to simply drop some of their timeline writes and show the user only a sample of the posts
from the accounts theyre following from the accounts theyre following
[^5]. [^5].
* When a celebrity account with a very large number of followers makes a post, we have to do a large * When a celebrity account with a very large number of followers makes a post, we have to do a large
amount of work to insert that post into the home timelines of each of their millions of followers. amount of work to insert that post into the home timelines of each of their millions of followers.
In this case its not okay to drop some of those writes. One way of solving this problem is to In this case its not okay to drop some of those writes. One way of solving this problem is to
handle celebrity posts separately from everyone elses posts: we can save ourselves the effort of handle celebrity posts separately from everyone elses posts: we can save ourselves the effort of
adding them to millions of timelines by storing the celebrity posts separately and merging them adding them to millions of timelines by storing the celebrity posts separately and merging them
with the materialized timeline when it is read. Despite such optimizations, handling celebrities with the materialized timeline when it is read. Despite such optimizations, handling celebrities
on a social network can require a lot of infrastructure on a social network can require a lot of infrastructure
[^6]. [^6].
# Describing Performance # Describing Performance
Most discussions of software performance consider two main types of metric: Most discussions of software performance consider two main types of metric:
Response time Response time
: The elapsed time from the moment when a user makes a request until they receive the requested : The elapsed time from the moment when a user makes a request until they receive the requested
answer. The unit of measurement is seconds (or milliseconds, or microseconds). answer. The unit of measurement is seconds (or milliseconds, or microseconds).
Throughput Throughput
: The number of requests per second, or the data volume per second, that the system is processing. : The number of requests per second, or the data volume per second, that the system is processing.
For a given allocation of hardware resources, there is a *maximum throughput* that can be handled. For a given allocation of hardware resources, there is a *maximum throughput* that can be handled.
The unit of measurement is “somethings per second”. The unit of measurement is “somethings per second”.
In the social network case study, “posts per second” and “timeline writes per second” are throughput In the social network case study, “posts per second” and “timeline writes per second” are throughput
metrics, whereas the “time it takes to load the home timeline” or the “time until a post is metrics, whereas the “time it takes to load the home timeline” or the “time until a post is
@ -187,24 +187,19 @@ time out and resend their request. This causes the rate of requests to increase
the problem worse—a *retry storm*. Even when the load is reduced again, such a system may remain in the problem worse—a *retry storm*. Even when the load is reduced again, such a system may remain in
an overloaded state until it is rebooted or otherwise reset. This phenomenon is called a *metastable an overloaded state until it is rebooted or otherwise reset. This phenomenon is called a *metastable
failure*, and it can cause serious outages in production systems failure*, and it can cause serious outages in production systems
[[7](/en/ch2#Bronson2021), [[^7], [^8]].
[8](/en/ch2#Brooker2021)].
To avoid retries overloading a service, you can increase and randomize the time between successive To avoid retries overloading a service, you can increase and randomize the time between successive
retries on the client side (*exponential backoff* retries on the client side (*exponential backoff*
[[9](/en/ch2#Brooker2015), [[^9], [^10]]),
[10](/en/ch2#Brooker2022backoff)]),
and temporarily stop sending requests to a service that has returned errors or timed out recently and temporarily stop sending requests to a service that has returned errors or timed out recently
(using a *circuit breaker* [[11](/en/ch2#Nygard2018), (using a *circuit breaker* [[^11], [^12]]
[12](/en/ch2#Chen2022)]
or *token bucket* algorithm [^13]). or *token bucket* algorithm [^13]).
The server can also detect when it is approaching overload and start proactively rejecting requests The server can also detect when it is approaching overload and start proactively rejecting requests
(*load shedding* [^14]), and send back (*load shedding* [^14]), and send back
responses asking clients to slow down (*backpressure* responses asking clients to slow down (*backpressure*
[[1](/en/ch2#Cvet2016), [[^1], [^15]]).
[15](/en/ch2#Sackman2016_ch2)]). The choice of queueing and load-balancing algorithms can also make a difference [^16].
The choice of queueing and load-balancing algorithms can also make a difference
[^16].
In terms of performance metrics, the response time is usually what users care about the most, In terms of performance metrics, the response time is usually what users care about the most,
whereas the throughput determines the required computing resources (e.g., how many servers you need), whereas the throughput determines the required computing resources (e.g., how many servers you need),
@ -221,15 +216,15 @@ scalability in [“Scalability”](/en/ch2#sec_introduction_scalability).
terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)): terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)):
* The *response time* is what the client sees; it includes all delays incurred anywhere in the * The *response time* is what the client sees; it includes all delays incurred anywhere in the
system. system.
* The *service time* is the duration for which the service is actively processing the user request. * The *service time* is the duration for which the service is actively processing the user request.
* *Queueing delays* can occur at several points in the flow: for example, after a request is * *Queueing delays* can occur at several points in the flow: for example, after a request is
received, it might need to wait until a CPU is available before it can be processed; a response received, it might need to wait until a CPU is available before it can be processed; a response
packet might need to be buffered before it is sent over the network if other tasks on the same packet might need to be buffered before it is sent over the network if other tasks on the same
machine are sending a lot of data via the outbound network interface. machine are sending a lot of data via the outbound network interface.
* *Latency* is a catch-all term for time during which a request is not being actively processed, * *Latency* is a catch-all term for time during which a request is not being actively processed,
i.e., during which it is *latent*. In particular, *network latency* or *network delay* refers to i.e., during which it is *latent*. In particular, *network latency* or *network delay* refers to
the time that request and response spend traveling through the network. the time that request and response spend traveling through the network.
![ddia 0204](/fig/ddia_0204.png) ![ddia 0204](/fig/ddia_0204.png)
@ -242,8 +237,7 @@ to another. You will encounter this style of diagram frequently over the course
The response time can vary significantly from one request to the next, even if you keep making the The response time can vary significantly from one request to the next, even if you keep making the
same request over and over again. Many factors can add random delays: for example, a context switch same request over and over again. Many factors can add random delays: for example, a context switch
to a background process, the loss of a network packet and TCP retransmission, a garbage collection to a background process, the loss of a network packet and TCP retransmission, a garbage collection
pause, a page fault forcing a read from disk, mechanical vibrations in the server rack pause, a page fault forcing a read from disk, mechanical vibrations in the server rack [^17],
[^17],
or many other causes. We will discuss this topic in more detail in [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing). or many other causes. We will discuss this topic in more detail in [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing).
Queueing delays often account for a large part of the variability in response times. As a server Queueing delays often account for a large part of the variability in response times. As a server
@ -291,8 +285,7 @@ directly affect users experience of the service. For example, Amazon describe
requirements for internal services in terms of the 99.9th percentile, even though it only affects 1 requirements for internal services in terms of the 99.9th percentile, even though it only affects 1
in 1,000 requests. This is because the customers with the slowest requests are often those who have in 1,000 requests. This is because the customers with the slowest requests are often those who have
the most data on their accounts because they have made many purchases—that is, theyre the most the most data on their accounts because they have made many purchases—that is, theyre the most
valuable customers valuable customers [^19].
[^19].
Its important to keep those customers happy by ensuring the website is fast for them. Its important to keep those customers happy by ensuring the website is fast for them.
On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed
@ -302,23 +295,19 @@ control, and the benefits are diminishing.
# The user impact of response times # The user impact of response times
It seems intuitively obvious that a fast service is better for users than a slow service It seems intuitively obvious that a fast service is better for users than a slow service [^20].
[^20].
However, it is surprisingly difficult to get hold of reliable data to quantify the effect that However, it is surprisingly difficult to get hold of reliable data to quantify the effect that
latency has on user behavior. latency has on user behavior.
Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search
results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue [^21].
[^21].
However, another Google study from 2009 reported that a 400 ms increase in latency resulted in However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
only 0.6% fewer searches per day only 0.6% fewer searches per day [^22],
[^22],
and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3% and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3%
[^23]. [^23].
Newer data from these companies appears not to be publicly available. Newer data from these companies appears not to be publicly available.
A more recent Akamai study A more recent Akamai study [^24]
[^24]
claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times
are also correlated with lower conversion rates! This seemingly paradoxical result is explained by are also correlated with lower conversion rates! This seemingly paradoxical result is explained by
@ -326,8 +315,7 @@ the fact that the pages that load fastest are often those that have no useful co
error pages). However, since the study makes no effort to separate the effects of page content from error pages). However, since the study makes no effort to separate the effects of page content from
the effects of load time, its results are probably not meaningful. the effects of load time, its results are probably not meaningful.
A study by Yahoo A study by Yahoo [^25]
[^25]
compares click-through rates on fast-loading versus slow-loading search results, controlling for compares click-through rates on fast-loading versus slow-loading search results, controlling for
quality of search results. It finds 2030% more clicks on fast searches when the difference between quality of search results. It finds 2030% more clicks on fast searches when the difference between
fast and slow responses is 1.25 seconds or more. fast and slow responses is 1.25 seconds or more.
@ -348,15 +336,13 @@ end-user requests end up being slow (an effect known as *tail latency amplificat
###### Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request. ###### Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request.
Percentiles are often used in *service level objectives* (SLOs) and *service level agreements* Percentiles are often used in *service level objectives* (SLOs) and *service level agreements*
(SLAs) as ways of defining the expected performance and availability of a service (SLAs) as ways of defining the expected performance and availability of a service [^27].
[^27].
For example, an SLO may set a target for a service to have a median response time of less than For example, an SLO may set a target for a service to have a median response time of less than
200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests 200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not
met (for example, customers may be entitled to a refund). That is the basic idea, at least; in met (for example, customers may be entitled to a refund). That is the basic idea, at least; in
practice, defining good availability metrics for SLOs and SLAs is not straightforward practice, defining good availability metrics for SLOs and SLAs is not straightforward
[[28](/en/ch2#Mogul2019), [[^28], [^29]].
[29](/en/ch2#Hauer2020)].
# Computing percentiles # Computing percentiles
@ -369,10 +355,8 @@ The simplest implementation is to keep a list of response times for all requests
window and to sort that list every minute. If that is too inefficient for you, there are algorithms window and to sort that list every minute. If that is too inefficient for you, there are algorithms
that can calculate a good approximation of percentiles at minimal CPU and memory cost. that can calculate a good approximation of percentiles at minimal CPU and memory cost.
Open source percentile estimation libraries include HdrHistogram, Open source percentile estimation libraries include HdrHistogram,
t-digest [[30](/en/ch2#Dunning2021), t-digest [[^30], [^31]],
[31](/en/ch2#Kohn2021)], OpenHistogram [^32], and DDSketch [^33].
OpenHistogram [^32], and DDSketch
[^33].
Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from
several machines, is mathematically meaningless—the right way of aggregating response time data several machines, is mathematically meaningless—the right way of aggregating response time data
@ -391,18 +375,16 @@ software, typical expectations include:
If all those things together mean “working correctly,” then we can understand *reliability* as If all those things together mean “working correctly,” then we can understand *reliability* as
meaning, roughly, “continuing to work correctly, even when things go wrong.” To be more precise meaning, roughly, “continuing to work correctly, even when things go wrong.” To be more precise
about things going wrong, we will distinguish between *faults* and *failures* about things going wrong, we will distinguish between *faults* and *failures*
[[35](/en/ch2#Heimerdinger1992), [[^35], [^36], [^37]]:
[36](/en/ch2#Gaertner1999),
[37](/en/ch2#Avizienis2004)]:
Fault Fault
: A fault is when a particular *part* of a system stops working correctly: for example, if a : A fault is when a particular *part* of a system stops working correctly: for example, if a
single hard drive malfunctions, or a single machine crashes, or an external service (that the single hard drive malfunctions, or a single machine crashes, or an external service (that the
system depends on) has an outage. system depends on) has an outage.
Failure Failure
: A failure is when the system *as a whole* stops providing the required service to the user; in : A failure is when the system *as a whole* stops providing the required service to the user; in
other words, when it does not meet the service level objective (SLO). other words, when it does not meet the service level objective (SLO).
The distinction between fault and failure can be confusing because they are the same thing, just at The distinction between fault and failure can be confusing because they are the same thing, just at
different levels. For example, if a hard drive stops working, we say that the hard drive has failed: different levels. For example, if a hard drive stops working, we say that the hard drive has failed:
@ -438,8 +420,7 @@ handling [^38]; by deliberately inducing faults, you ensure
that the fault-tolerance machinery is continually exercised and tested, which can increase your that the fault-tolerance machinery is continually exercised and tested, which can increase your
confidence that faults will be handled correctly when they occur naturally. *Chaos engineering* is confidence that faults will be handled correctly when they occur naturally. *Chaos engineering* is
a discipline that aims to improve confidence in fault-tolerance mechanisms through experiments such a discipline that aims to improve confidence in fault-tolerance mechanisms through experiments such
as deliberately injecting faults as deliberately injecting faults [^39].
[^39].
Although we generally prefer tolerating faults over preventing faults, there are cases where Although we generally prefer tolerating faults over preventing faults, there are cases where
prevention is better than cure (e.g., because no cure exists). This is the case with security prevention is better than cure (e.g., because no cure exists). This is the case with security
@ -452,48 +433,34 @@ cured, as described in the following sections.
When we think of causes of system failure, hardware faults quickly come to mind: When we think of causes of system failure, hardware faults quickly come to mind:
* Approximately 25% of magnetic hard drives fail per year * Approximately 25% of magnetic hard drives fail per year
[[40](/en/ch2#Pinheiro2007), [[^40],
[41](/en/ch2#Schroeder2007)]; [^41]];
in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day. in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
Recent data suggests that disks are getting more reliable, but failure rates remain significant Recent data suggests that disks are getting more reliable, but failure rates remain significant
[^42]. [^42].
* Approximately 0.51% of solid state drives (SSDs) fail per year * Approximately 0.51% of solid state drives (SSDs) fail per year
[^43]. [^43].
Small numbers of bit errors are corrected automatically Small numbers of bit errors are corrected automatically
[^44], [^44],
but uncorrectable errors occur approximately once per year per drive, even in drives that are but uncorrectable errors occur approximately once per year per drive, even in drives that are
fairly new (i.e., that have experienced little wear); this error rate is higher than that of fairly new (i.e., that have experienced little wear); this error rate is higher than that of
magnetic hard drives magnetic hard drives
[[45](/en/ch2#Schroeder2016_ch2), [[^45],
[46](/en/ch2#Alter2019)]. [^46]].
* Other hardware components such as power supplies, RAID controllers, and memory modules also fail, * Other hardware components such as power supplies, RAID controllers, and memory modules also fail,
although less frequently than hard drives although less frequently than hard drives [^47] [^48].
[[47](/en/ch2#Ford2010),
[48](/en/ch2#Vishwanath2010)].
* Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result, * Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result,
likely due to manufacturing defects likely due to manufacturing defects [^49] [^50] [^51]. In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program simply returning the wrong result.
[[49](/en/ch2#Hochschild2021),
[50](/en/ch2#Dixit2021),
[51](/en/ch2#Behrens2015)].
In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program
simply returning the wrong result.
* Data in RAM can also be corrupted, either due to random events such as cosmic rays, or due to * Data in RAM can also be corrupted, either due to random events such as cosmic rays, or due to
permanent physical defects. Even when memory with error-correcting codes (ECC) is used, more than permanent physical defects. Even when memory with error-correcting codes (ECC) is used, more than
1% of machines encounter an uncorrectable error in a given year, which typically leads to a crash 1% of machines encounter an uncorrectable error in a given year, which typically leads to a crash
of the machine and the affected memory module needing to be replaced of the machine and the affected memory module needing to be replaced [^52].
[^52]. Moreover, certain pathological memory access patterns can flip bits with high probability [^53].
Moreover, certain pathological memory access patterns can flip bits with high probability
[^53].
* An entire datacenter might become unavailable (for example, due to power outage or network * An entire datacenter might become unavailable (for example, due to power outage or network
misconfiguration) or even be permanently destroyed (for example by fire, flood, or earthquake misconfiguration) or even be permanently destroyed (for example by fire, flood, or earthquake [^54]).
[^54]). A solar storm, which induces large electrical currents in long-distance wires when the sun ejects
A solar storm, which induces large electrical currents in long-distance wires when the sun ejects a large mass of charged particles, could damage power grids and undersea network cables [^55].
a large mass of charged particles, could damage power grids and undersea network cables Although such large-scale failures are rare, their impact can be catastrophic if a service cannot tolerate the loss of a datacenter [^56].
[^55].
Although such large-scale failures are rare, their impact can be catastrophic if a service cannot
tolerate the loss of a datacenter
[^56].
These events are rare enough that you often dont need to worry about them when working on a small These events are rare enough that you often dont need to worry about them when working on a small
system, as long as you can easily replace hardware that becomes faulty. However, in a large-scale system, as long as you can easily replace hardware that becomes faulty. However, in a large-scale
@ -510,10 +477,7 @@ running uninterrupted for years.
Redundancy is most effective when component faults are independent, that is, the occurrence of one Redundancy is most effective when component faults are independent, that is, the occurrence of one
fault does not change how likely it is that another fault will occur. However, experience has shown fault does not change how likely it is that another fault will occur. However, experience has shown
that there are often significant correlations between component failures that there are often significant correlations between component failures [^41] [^57] [^58];
[[41](/en/ch2#Schroeder2007),
[57](/en/ch2#Han2021),
[58](/en/ch2#Nightingale2011)];
unavailability of an entire server rack or an entire datacenter still happens more often than we unavailability of an entire server rack or an entire datacenter still happens more often than we
would like. would like.
@ -543,40 +507,30 @@ upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding).
Although hardware failures can be weakly correlated, they are still mostly independent: for Although hardware failures can be weakly correlated, they are still mostly independent: for
example, if one disk fails, its likely that other disks in the same machine will be fine for example, if one disk fails, its likely that other disks in the same machine will be fine for
another while. On the other hand, software faults are often very highly correlated, because it is another while. On the other hand, software faults are often very highly correlated, because it is
common for many nodes to run the same software and thus have the same bugs common for many nodes to run the same software and thus have the same bugs [^59] [^60].
[[59](/en/ch2#Gunawi2014),
[60](/en/ch2#Kreps2012_ch1)].
Such faults are harder to anticipate, and they tend to cause many more system failures than Such faults are harder to anticipate, and they tend to cause many more system failures than
uncorrelated hardware faults [^47]. For example: uncorrelated hardware faults [^47]. For example:
* A software bug that causes every node to fail at the same time in particular circumstances. For * A software bug that causes every node to fail at the same time in particular circumstances. For
example, on June 30, 2012, a leap second caused many Java applications to hang simultaneously due example, on June 30, 2012, a leap second caused many Java applications to hang simultaneously due
to a bug in the Linux kernel, bringing down many Internet services to a bug in the Linux kernel, bringing down many Internet services [^61].
[^61]. Due to a firmware bug, all SSDs of certain models suddenly fail after precisely 32,768 hours of
Due to a firmware bug, all SSDs of certain models suddenly fail after precisely 32,768 hours of operation (less than 4 years), rendering the data on them unrecoverable [^62].
operation (less than 4 years), rendering the data on them unrecoverable
[^62].
* A runaway process that uses up some shared, limited resource, such as CPU time, memory, disk * A runaway process that uses up some shared, limited resource, such as CPU time, memory, disk
space, network bandwidth, or threads space, network bandwidth, or threads [^63]. For example, a process that consumes too much memory while processing a large request may be
[^63]. killed by the operating system. A bug in a client library could cause a much higher request
For example, a process that consumes too much memory while processing a large request may be volume than anticipated [^64].
killed by the operating system. A bug in a client library could cause a much higher request
volume than anticipated [^64].
* A service that the system depends on slows down, becomes unresponsive, or starts returning * A service that the system depends on slows down, becomes unresponsive, or starts returning
corrupted responses. corrupted responses.
* An interaction between different systems results in emergent behavior that does not occur when * An interaction between different systems results in emergent behavior that does not occur when
each system was tested in isolation [^65]. each system was tested in isolation [^65].
* Cascading failures, where a problem in one component causes another component to become overloaded * Cascading failures, where a problem in one component causes another component to become overloaded
and slow down, which in turn brings down another component and slow down, which in turn brings down another component [^66] [^67]].
[[66](/en/ch2#Ulrich2016),
[67](/en/ch2#Fassbender2022)].
The bugs that cause these kinds of software faults often lie dormant for a long time until they are The bugs that cause these kinds of software faults often lie dormant for a long time until they are
triggered by an unusual set of circumstances. In those circumstances, it is revealed that the triggered by an unusual set of circumstances. In those circumstances, it is revealed that the
software is making some kind of assumption about its environment—and while that assumption is software is making some kind of assumption about its environment—and while that assumption is
usually true, it eventually stops being true for some reason usually true, it eventually stops being true for some reason [^68] [^69].
[[68](/en/ch2#Cook2000),
[69](/en/ch2#Woods2017)].
There is no quick solution to the problem of systematic faults in software. Lots of small things can There is no quick solution to the problem of systematic faults in software. Lots of small things can
help: carefully thinking about assumptions and interactions in the system; thorough testing; process help: carefully thinking about assumptions and interactions in the system; thorough testing; process
@ -590,8 +544,7 @@ human. Unlike machines, humans dont just follow rules; their strength is bein
adaptive in getting their job done. However, this characteristic also leads to unpredictability, and adaptive in getting their job done. However, this characteristic also leads to unpredictability, and
sometimes mistakes that can lead to failures, despite best intentions. For example, one study of sometimes mistakes that can lead to failures, despite best intentions. For example, one study of
large internet services found that configuration changes by operators were the leading cause of large internet services found that configuration changes by operators were the leading cause of
outages, whereas hardware faults (servers or network) played a role in only 1025% of outages outages, whereas hardware faults (servers or network) played a role in only 1025% of outages [^70].
[^70].
It is tempting to label such problems as “human error” and to wish that they could be solved by It is tempting to label such problems as “human error” and to wish that they could be solved by
better controlling human behavior through tighter procedures and compliance with rules. However, better controlling human behavior through tighter procedures and compliance with rules. However,
@ -602,8 +555,7 @@ Often complex systems have emergent behavior, in which unexpected interactions b
may also lead to failures [^72]. may also lead to failures [^72].
Various technical measures can help minimize the impact of human mistakes, including thorough Various technical measures can help minimize the impact of human mistakes, including thorough
testing (both hand-written tests and *property testing* on lots of random inputs) testing (both hand-written tests and *property testing* on lots of random inputs) [^38], rollback mechanisms for quickly
[^38], rollback mechanisms for quickly
reverting configuration changes, gradual roll-outs of new code, detailed and clear monitoring, reverting configuration changes, gradual roll-outs of new code, detailed and clear monitoring,
observability tools for diagnosing production issues (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)), observability tools for diagnosing production issues (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)),
and well-designed interfaces that encourage “the right thing” and discourage “the wrong thing”. and well-designed interfaces that encourage “the right thing” and discourage “the wrong thing”.
@ -627,8 +579,7 @@ As a general principle, when investigating an incident, you should be suspicious
answers. “Bob should have been more careful when deploying that change” is not productive, but answers. “Bob should have been more careful when deploying that change” is not productive, but
neither is “We must rewrite the backend in Haskell.” Instead, management should take the opportunity neither is “We must rewrite the backend in Haskell.” Instead, management should take the opportunity
to learn the details of how the sociotechnical system works from the point of view of the people who to learn the details of how the sociotechnical system works from the point of view of the people who
work with it every day, and take steps to improve it based on this feedback work with it every day, and take steps to improve it based on this feedback [^71].
[^71].
# How Important Is Reliability? # How Important Is Reliability?
@ -637,11 +588,9 @@ are also expected to work reliably. Bugs in business applications cause lost pro
risks if figures are reported incorrectly), and outages of e-commerce sites can have huge costs in risks if figures are reported incorrectly), and outages of e-commerce sites can have huge costs in
terms of lost revenue and damage to reputation. terms of lost revenue and damage to reputation.
In many applications, a temporary outage of a few minutes or even a few hours is tolerable In many applications, a temporary outage of a few minutes or even a few hours is tolerable [^74],
[^74],
but permanent data loss or corruption would be catastrophic. Consider a parent who stores all their but permanent data loss or corruption would be catastrophic. Consider a parent who stores all their
pictures and videos of their children in your photo application pictures and videos of their children in your photo application [^75]. How would they
[^75]. How would they
feel if that database was suddenly corrupted? Would they know how to restore it from a backup? feel if that database was suddenly corrupted? Would they know how to restore it from a backup?
As another example of how unreliable software can harm people, consider the Post Office Horizon As another example of how unreliable software can harm people, consider the Post Office Horizon
@ -651,8 +600,7 @@ Eventually it became clear that many of these shortfalls were due to bugs in the
convictions have since been overturned [^76]. convictions have since been overturned [^76].
What led to this, probably the largest miscarriage of justice in British history, is the fact that What led to this, probably the largest miscarriage of justice in British history, is the fact that
English law assumes that computers operate correctly (and hence, evidence produced by computers is English law assumes that computers operate correctly (and hence, evidence produced by computers is
reliable) unless there is evidence to the contrary reliable) unless there is evidence to the contrary [^77].
[^77].
Software engineers may laugh at the idea that software could ever be bug-free, but this is little Software engineers may laugh at the idea that software could ever be bug-free, but this is little
solace to the people who were wrongfully imprisoned, declared bankrupt, or even committed suicide as solace to the people who were wrongfully imprisoned, declared bankrupt, or even committed suicide as
a result of a wrongful conviction due to an unreliable computer system. a result of a wrongful conviction due to an unreliable computer system.
@ -714,9 +662,9 @@ Once you have described the load on your system, you can investigate what happen
increases. You can look at it in two ways: increases. You can look at it in two ways:
* When you increase the load in a certain way and keep the system resources (CPUs, memory, network * When you increase the load in a certain way and keep the system resources (CPUs, memory, network
bandwidth, etc.) unchanged, how is the performance of your system affected? bandwidth, etc.) unchanged, how is the performance of your system affected?
* When you increase the load in a certain way, how much do you need to increase the resources if you * When you increase the load in a certain way, how much do you need to increase the resources if you
want to keep performance unchanged? want to keep performance unchanged?
Usually our goal is to keep the performance of the system within the requirements of the SLA Usually our goal is to keep the performance of the system within the requirements of the SLA
(see [“Use of Response Time Metrics”](/en/ch2#sec_introduction_slo_sla)) while also minimizing the cost of running the system. The greater (see [“Use of Response Time Metrics”](/en/ch2#sec_introduction_slo_sla)) while also minimizing the cost of running the system. The greater
@ -728,8 +676,7 @@ If you can double the resources in order to handle twice the load, while keeping
same, we say that you have *linear scalability*, and this is considered a good thing. Occasionally same, we say that you have *linear scalability*, and this is considered a good thing. Occasionally
it is possible to handle twice the load with less than double the resources, due to economies of it is possible to handle twice the load with less than double the resources, due to economies of
scale or a better distribution of peak load scale or a better distribution of peak load
[[79](/en/ch2#Warfield2023_ch2), [[^79], [^80]].
[80](/en/ch2#Brooker2023multitenancy)].
Much more likely is that the cost grows faster than linearly, and there may be many reasons for the Much more likely is that the cost grows faster than linearly, and there may be many reasons for the
inefficiency. For example, if you have a lot of data, then processing a single write request may inefficiency. For example, if you have a lot of data, then processing a single write request may
involve more work than if you have a small amount of data, even if the size of the request is the involve more work than if you have a small amount of data, even if the size of the request is the
@ -753,8 +700,7 @@ Another approach is the *shared-disk architecture*, which uses several machines
CPUs and RAM, but which stores data on an array of disks that is shared between the machines, which CPUs and RAM, but which stores data on an array of disks that is shared between the machines, which
are connected via a fast network: *Network-Attached Storage* (NAS) or *Storage Area Network* (SAN). are connected via a fast network: *Network-Attached Storage* (NAS) or *Storage Area Network* (SAN).
This architecture has traditionally been used for on-premises data warehousing workloads, but This architecture has traditionally been used for on-premises data warehousing workloads, but
contention and the overhead of locking limit the scalability of the shared-disk approach contention and the overhead of locking limit the scalability of the shared-disk approach [^81].
[^81].
By contrast, the *shared-nothing architecture* By contrast, the *shared-nothing architecture*
[^82] [^82]
@ -796,8 +742,7 @@ operate largely independently from each other. This is the underlying principle
(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing (see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
([Link to Come]), and shared-nothing architectures. However, the challenge is in knowing where to ([Link to Come]), and shared-nothing architectures. However, the challenge is in knowing where to
draw the line between things that should be together, and things that should be apart. Design draw the line between things that should be together, and things that should be apart. Design
guidelines for microservices can be found in other books guidelines for microservices can be found in other books [^84],
[^84],
and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding). and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).
Another good principle is not to make things more complicated than necessary. If a single-machine Another good principle is not to make things more complicated than necessary. If a single-machine
@ -817,8 +762,7 @@ bugs that need fixing.
It is widely recognized that the majority of the cost of software is not in its initial development, It is widely recognized that the majority of the cost of software is not in its initial development,
but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures, but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures,
adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding
new features [[85](/en/ch2#Ensmenger2016), new features [[^85], [^86]].
[86](/en/ch2#Glass2002)].
However, maintenance is also difficult. If a system has been successfully running for a long time, However, maintenance is also difficult. If a system has been successfully running for a long time,
it may well use outdated technologies that not many engineers understand today (such as mainframes it may well use outdated technologies that not many engineers understand today (such as mainframes
@ -835,15 +779,15 @@ which decisions might create maintenance headaches in the future, in this book w
to several principles that are widely applicable: to several principles that are widely applicable:
Operability Operability
: Make it easy for the organization to keep the system running smoothly. : Make it easy for the organization to keep the system running smoothly.
Simplicity Simplicity
: Make it easy for new engineers to understand the system, by implementing it using well-understood, : Make it easy for new engineers to understand the system, by implementing it using well-understood,
consistent patterns and structures, and avoiding unnecessary complexity. consistent patterns and structures, and avoiding unnecessary complexity.
Evolvability Evolvability
: Make it easy for engineers to make changes to the system in the future, adapting it and extending : Make it easy for engineers to make changes to the system in the future, adapting it and extending
it for unanticipated use cases as requirements change. it for unanticipated use cases as requirements change.
## Operability: Making Life Easy for Operations ## Operability: Making Life Easy for Operations
@ -857,8 +801,7 @@ In large-scale systems consisting of many thousands of machines, manual maintena
unreasonably expensive, and automation is essential. However, automation can be a two-edged sword: unreasonably expensive, and automation is essential. However, automation can be a two-edged sword:
there will always be edge cases (such as rare failure scenarios) that require manual intervention there will always be edge cases (such as rare failure scenarios) that require manual intervention
from the operations team. Since the cases that cannot be handled automatically are the most complex from the operations team. Since the cases that cannot be handled automatically are the most complex
issues, greater automation requires a *more* skilled operations team that can resolve those issues issues, greater automation requires a *more* skilled operations team that can resolve those issues [^88].
[^88].
Moreover, if an automated system goes wrong, it is often harder to troubleshoot than a system that Moreover, if an automated system goes wrong, it is often harder to troubleshoot than a system that
relies on an operator to perform some actions manually. For that reason, it is not the case that relies on an operator to perform some actions manually. For that reason, it is not the case that
@ -866,15 +809,14 @@ more automation is always better for operability. However, some amount of automa
and the sweet spot will depend on the specifics of your particular application and organization. and the sweet spot will depend on the specifics of your particular application and organization.
Good operability means making routine tasks easy, allowing the operations team to focus their efforts Good operability means making routine tasks easy, allowing the operations team to focus their efforts
on high-value activities. Data systems can do various things to make routine tasks easy, including on high-value activities. Data systems can do various things to make routine tasks easy, including [^89]:
[^89]:
* Allowing monitoring tools to check the systems key metrics, and supporting observability tools * Allowing monitoring tools to check the systems key metrics, and supporting observability tools
(see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)) to give insights into the systems runtime behavior. (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)) to give insights into the systems runtime behavior.
A variety of commercial and open source tools can help here A variety of commercial and open source tools can help here
[^90]. [^90].
* Avoiding dependency on individual machines (allowing machines to be taken down for maintenance * Avoiding dependency on individual machines (allowing machines to be taken down for maintenance
while the system as a whole continues running uninterrupted) while the system as a whole continues running uninterrupted)
* Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”) * Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)
* Providing good default behavior, but also giving administrators the freedom to override defaults when needed * Providing good default behavior, but also giving administrators the freedom to override defaults when needed
* Self-healing where appropriate, but also giving administrators manual control over the system state when needed * Self-healing where appropriate, but also giving administrators manual control over the system state when needed
@ -891,15 +833,13 @@ project mired in complexity is sometimes described as a *big ball of mud*
When complexity makes maintenance hard, budgets and schedules are often overrun. In complex When complexity makes maintenance hard, budgets and schedules are often overrun. In complex
software, there is also a greater risk of introducing bugs when making a change: when the system is software, there is also a greater risk of introducing bugs when making a change: when the system is
harder for developers to understand and reason about, hidden assumptions, unintended consequences, harder for developers to understand and reason about, hidden assumptions, unintended consequences,
and unexpected interactions are more easily overlooked and unexpected interactions are more easily overlooked [^69].
[^69].
Conversely, reducing complexity greatly improves the maintainability of software, and thus Conversely, reducing complexity greatly improves the maintainability of software, and thus
simplicity should be a key goal for the systems we build. simplicity should be a key goal for the systems we build.
Simple systems are easier to understand, and therefore we should try to solve a given problem in the Simple systems are easier to understand, and therefore we should try to solve a given problem in the
simplest way possible. Unfortunately, this is easier said than done. Whether something is simple or simplest way possible. Unfortunately, this is easier said than done. Whether something is simple or
not is often a subjective matter of taste, as there is no objective standard of simplicity not is often a subjective matter of taste, as there is no objective standard of simplicity [^92].
[^92].
For example, one system may hide a complex implementation behind a simple interface, whereas another For example, one system may hide a complex implementation behind a simple interface, whereas another
may have a simple implementation that exposes more internal detail to its users—which one is may have a simple implementation that exposes more internal detail to its users—which one is
simpler? simpler?
@ -952,13 +892,12 @@ different word to refer to agility on a data system level: *evolvability*
[^97]. [^97].
One major factor that makes change difficult in large systems is when some action is irreversible, One major factor that makes change difficult in large systems is when some action is irreversible,
and therefore that action needs to be taken very carefully and therefore that action needs to be taken very carefully [^98].
[^98].
For example, say you are migrating from one database to another: if you cannot switch back to the For example, say you are migrating from one database to another: if you cannot switch back to the
old system in case of problems with the new one, the stakes are much higher than if you can easily go old system in case of problems with the new one, the stakes are much higher than if you can easily go
back. Minimizing irreversibility improves flexibility. back. Minimizing irreversibility improves flexibility.
# Summary ## Summary
In this chapter we examined several examples of nonfunctional requirements: performance, In this chapter we examined several examples of nonfunctional requirements: performance,
reliability, scalability, and maintainability. Through these topics we have also encountered reliability, scalability, and maintainability. Through these topics we have also encountered
@ -986,8 +925,7 @@ There are no easy answers on how to achieve these things, but one thing that can
applications using well-understood building blocks that provide useful abstractions. The rest of applications using well-understood building blocks that provide useful abstractions. The rest of
this book will cover a selection of building blocks that have proved to be valuable in practice. this book will cover a selection of building blocks that have proved to be valuable in practice.
##### References ### Summary
[^1]: Mike Cvet. [How We Learned to Stop Worrying and Love Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016. [^1]: Mike Cvet. [How We Learned to Stop Worrying and Love Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016.
[^2]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK) [^2]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)

File diff suppressed because it is too large Load diff

View file

@ -45,11 +45,11 @@ Consider the worlds simplest database, implemented as two Bash functions:
#!/bin/bash #!/bin/bash
db_set () { db_set () {
echo "$1,$2" >> database echo "$1,$2" >> database
} }
db_get () { db_get () {
grep "^$1," database | sed -e "s/^$1,//" | tail -n 1 grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
} }
``` ```
@ -123,8 +123,7 @@ possible write operation. Any kind of index usually slows down writes, because t
to be updated every time data is written. to be updated every time data is written.
This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but
every index consumes additional disk space and slows down writes, sometimes substantially every index consumes additional disk space and slows down writes, sometimes substantially [^1].
[^1].
For this reason, databases dont usually index everything by default, but require you—the person For this reason, databases dont usually index everything by default, but require you—the person
writing the application or administering the database—to choose indexes manually, using your writing the application or administering the database—to choose indexes manually, using your
knowledge of the applications typical query patterns. You can then choose the indexes that give knowledge of the applications typical query patterns. You can then choose the indexes that give
@ -149,16 +148,16 @@ is already in the filesystem cache, a read doesnt require any disk I/O at all
This approach is much faster, but it still suffers from several problems: This approach is much faster, but it still suffers from several problems:
* You never free up disk space occupied by old log entries that have been overwritten; if you keep * You never free up disk space occupied by old log entries that have been overwritten; if you keep
writing to the database you might run out of disk space. writing to the database you might run out of disk space.
* The hash map is not persisted, so you have to rebuild it when you restart the database—for * The hash map is not persisted, so you have to rebuild it when you restart the database—for
example, by scanning the whole log file to find the latest byte offset for each key. This makes example, by scanning the whole log file to find the latest byte offset for each key. This makes
restarts slow if you have a lot of data. restarts slow if you have a lot of data.
* The hash table must fit in memory. In principle, you could maintain a hash table on disk, but * The hash table must fit in memory. In principle, you could maintain a hash table on disk, but
unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of
random access I/O, it is expensive to grow when it becomes full, and hash collisions require random access I/O, it is expensive to grow when it becomes full, and hash collisions require
fiddly logic [^2]. fiddly logic [^2].
* Range queries are not efficient. For example, you cannot easily scan over all keys between `10000` * Range queries are not efficient. For example, you cannot easily scan over all keys between `10000`
and `19999`—youd have to look up each key individually in the hash map. and `19999`—youd have to look up each key individually in the hash map.
### The SSTable file format ### The SSTable file format
@ -177,8 +176,7 @@ Now you do not need to keep all the keys in memory: you can group the key-value
SSTable into *blocks* of a few kilobytes, and then store the first key of each block in the index. SSTable into *blocks* of a few kilobytes, and then store the first key of each block in the index.
This kind of index, which stores only some of the keys, is called *sparse*. This index is stored in This kind of index, which stores only some of the keys, is called *sparse*. This index is stored in
a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data
structure that allows queries to quickly look up a particular key structure that allows queries to quickly look up a particular key [^4].
[^4].
For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
first key of the next block is `handsome`. Now say youre looking for the key `handiwork`, which first key of the next block is `handsome`. Now say youre looking for the key `handiwork`, which
@ -202,25 +200,24 @@ We can solve this problem with a *log-structured* approach, which is a hybrid be
log and a sorted file: log and a sorted file:
1. When a write comes in, add it to an in-memory ordered map data structure, such as a red-black 1. When a write comes in, add it to an in-memory ordered map data structure, such as a red-black
tree, skip list [^5], or trie tree, skip list [^5], or trie
[^6]. [^6].
With these data structures, you can insert keys in any order, look them up efficiently, and read With these data structures, you can insert keys in any order, look them up efficiently, and read
them back in sorted order. This in-memory data structure is called the *memtable*. them back in sorted order. This in-memory data structure is called the *memtable*.
2. When the memtable gets bigger than some threshold—typically a few megabytes—write it out to 2. When the memtable gets bigger than some threshold—typically a few megabytes—write it out to
disk in sorted order as an SSTable file. We call this new SSTable file the most recent *segment* disk in sorted order as an SSTable file. We call this new SSTable file the most recent *segment*
of the database, and it is stored as a separate file alongside the older segments. Each segment of the database, and it is stored as a separate file alongside the older segments. Each segment
has a separate index of its contents. While the new segment is being written out to disk, the has a separate index of its contents. While the new segment is being written out to disk, the
database can continue writing to a new memtable instance, and the old memtables memory is freed database can continue writing to a new memtable instance, and the old memtables memory is freed
when the writing of the SSTable is complete. when the writing of the SSTable is complete.
3. In order to read the value for some key, first try to find the key in the memtable and the most 3. In order to read the value for some key, first try to find the key in the memtable and the most
recent on-disk segment. If its not there, look in the next-older segment, etc. until you either recent on-disk segment. If its not there, look in the next-older segment, etc. until you either
find the key or reach the oldest segment. If the key does not appear in any of the segments, it find the key or reach the oldest segment. If the key does not appear in any of the segments, it
does not exist in the database. does not exist in the database.
4. From time to time, run a merging and compaction process in the background to combine segment files 4. From time to time, run a merging and compaction process in the background to combine segment files
and to discard overwritten or deleted values. and to discard overwritten or deleted values.
Merging segments works similarly to the *mergesort* algorithm Merging segments works similarly to the *mergesort* algorithm [^5]. The process is illustrated in
[^5]. The process is illustrated in
[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key [Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If
the same key appears in more than one input file, keep only the more recent value. This produces a the same key appears in more than one input file, keep only the more recent value. This produces a
@ -242,18 +239,14 @@ called a *tombstone* to the data file. When log segments are merged, the tombsto
process to discard any previous values for the deleted key. Once the tombstone is merged into the process to discard any previous values for the deleted key. Once the tombstone is merged into the
oldest segment, it can be dropped. oldest segment, it can be dropped.
The algorithm described here is essentially what is used in RocksDB The algorithm described here is essentially what is used in RocksDB [^7],
[^7], Cassandra, Scylla, and HBase [^8],
Cassandra, Scylla, and HBase all of which were inspired by Googles Bigtable paper [^9]
[^8],
all of which were inspired by Googles Bigtable paper
[^9]
(which introduced the terms *SSTable* and *memtable*). (which introduced the terms *SSTable* and *memtable*).
The algorithm was originally published in 1996 under the name *Log-Structured Merge-Tree* or *LSM-Tree* The algorithm was originally published in 1996 under the name *Log-Structured Merge-Tree* or *LSM-Tree*
[^10], [^10],
building on earlier work on log-structured filesystems building on earlier work on log-structured filesystems [^11].
[^11].
For this reason, storage engines that are based on the principle of merging and compacting sorted For this reason, storage engines that are based on the principle of merging and compacting sorted
files are often called *LSM storage engines*. files are often called *LSM storage engines*.
@ -265,8 +258,7 @@ requests to using the new merged segment instead of the old segments, and then t
can be deleted. can be deleted.
The segment files dont necessarily have to be stored on local disk: they are also well suited for The segment files dont necessarily have to be stored on local disk: they are also well suited for
writing to object storage. SlateDB and Delta Lake writing to object storage. SlateDB and Delta Lake [^12].
[^12].
take this approach, for example. take this approach, for example.
Having immutable segment files also simplifies crash recovery: if a crash happens while writing out Having immutable segment files also simplifies crash recovery: if a crash happens while writing out
@ -287,8 +279,7 @@ appears in a particular SSTable.
[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in [Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash
function, producing a set of numbers that are then interpreted as indexes into the array of bits function, producing a set of numbers that are then interpreted as indexes into the array of bits [^14].
[^14].
We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key
`handbag` hashes to the numbers (2, 9, 4), so we set the 2nd, 9th, and 4th bits to 1. The bitmap `handbag` hashes to the numbers (2, 9, 4), so we set the 2nd, 9th, and 4th bits to 1. The bitmap
is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of
@ -311,8 +302,7 @@ as if a key is present, even though it isnt, is called a *false positive*.
The probability of false positives depends on the number of keys, the number of bits set per key, The probability of false positives depends on the number of keys, the number of bits set per key,
and the total number of bits in the Bloom filter. You can use an online calculator tool to work out and the total number of bits in the Bloom filter. You can use an online calculator tool to work out
the right parameters for your application the right parameters for your application [^15].
[^15].
As a rule of thumb, you need to allocate 10 bits of Bloom filter space for every key in the SSTable As a rule of thumb, you need to allocate 10 bits of Bloom filter space for every key in the SSTable
to get a false positive probability of 1%, and the probability is reduced tenfold for every 5 to get a false positive probability of 1%, and the probability is reduced tenfold for every 5
additional bits you allocate per key. additional bits you allocate per key.
@ -320,30 +310,29 @@ additional bits you allocate per key.
In the context of an LSM storage engines, false positives are no problem: In the context of an LSM storage engines, false positives are no problem:
* If the Bloom filter says that a key *is not* present, we can safely skip that SSTable, since we * If the Bloom filter says that a key *is not* present, we can safely skip that SSTable, since we
can be sure that it doesnt contain the key. can be sure that it doesnt contain the key.
* If the Bloom filter says the key *is* present, we have to consult the sparse index and decode the * If the Bloom filter says the key *is* present, we have to consult the sparse index and decode the
block of key-value pairs to check whether the key really is there. If it was a false positive, we block of key-value pairs to check whether the key really is there. If it was a false positive, we
have done a bit of unnecessary work, but otherwise no harm is done—we just continue the search have done a bit of unnecessary work, but otherwise no harm is done—we just continue the search
with the next-oldest segment. with the next-oldest segment.
### Compaction strategies ### Compaction strategies
An important detail is how the LSM storage chooses when to perform compaction, and which SSTables to An important detail is how the LSM storage chooses when to perform compaction, and which SSTables to
include in a compaction. Many LSM-based storage systems allow you to configure which compaction include in a compaction. Many LSM-based storage systems allow you to configure which compaction
strategy to use, and some of the common choices are strategy to use, and some of the common choices are
[[16](/en/ch4#Luo2019), [[^16], [^17]]:
[17](/en/ch4#Sarkar2022)]:
Size-tiered compaction Size-tiered compaction
: Newer and smaller SSTables are successively merged into older and larger SSTables. The SSTables : Newer and smaller SSTables are successively merged into older and larger SSTables. The SSTables
containing older data can get very large, and merging them requires a lot of temporary disk space. containing older data can get very large, and merging them requires a lot of temporary disk space.
The advantage of this strategy is that it can handle very high write throughput. The advantage of this strategy is that it can handle very high write throughput.
Leveled compaction Leveled compaction
: The key range is split up into smaller SSTables and older data is moved into separate “levels,” : The key range is split up into smaller SSTables and older data is moved into separate “levels,”
which allows the compaction to proceed more incrementally and use less disk space than the which allows the compaction to proceed more incrementally and use less disk space than the
size-tiered strategy. This strategy is more efficient for reads than size-tiered compaction size-tiered strategy. This strategy is more efficient for reads than size-tiered compaction
because the storage engine needs to read fewer SSTables to check whether they contain the key. because the storage engine needs to read fewer SSTables to check whether they contain the key.
As a rule of thumb, size-tiered compaction performs better if you have mostly writes and few reads, As a rule of thumb, size-tiered compaction performs better if you have mostly writes and few reads,
whereas leveled compaction performs better if your workload is dominated by reads. If you write a whereas leveled compaction performs better if your workload is dominated by reads. If you write a
@ -360,16 +349,14 @@ Many databases run as a service that accepts queries over a network, but there a
databases that dont expose a network API. Instead, they are libraries that run in the same process databases that dont expose a network API. Instead, they are libraries that run in the same process
as your application code, typically reading and writing files on the local disk, and you interact as your application code, typically reading and writing files on the local disk, and you interact
with them through normal function calls. Examples of embedded storage engines include RocksDB, with them through normal function calls. Examples of embedded storage engines include RocksDB,
SQLite, LMDB, DuckDB, and KùzuDB SQLite, LMDB, DuckDB, and KùzuDB [^19].
[^19].
Embedded databases are very commonly used in mobile apps to store the local users data. On the Embedded databases are very commonly used in mobile apps to store the local users data. On the
backend, they can be an appropriate choice if the data is small enough to fit on a single machine, backend, they can be an appropriate choice if the data is small enough to fit on a single machine,
and if there are not many concurrent transactions. For example, in a multitenant system in which and if there are not many concurrent transactions. For example, in a multitenant system in which
each tenant is small enough and completely separate from others (i.e., you do not need to run each tenant is small enough and completely separate from others (i.e., you do not need to run
queries that combine data from multiple tenants), you can potentially use a separate embedded queries that combine data from multiple tenants), you can potentially use a separate embedded
database instance per tenant database instance per tenant [^20].
[^20].
The storage and retrieval methods we discuss in this chapter are used in both embedded and in The storage and retrieval methods we discuss in this chapter are used in both embedded and in
client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
@ -381,8 +368,7 @@ The log-structured approach is popular, but it is not the only form of key-value
widely used structure for reading and writing database records by key is the *B-tree*. widely used structure for reading and writing database records by key is the *B-tree*.
Introduced in 1970 [^21] Introduced in 1970 [^21]
and called “ubiquitous” less than 10 years later and called “ubiquitous” less than 10 years later [^22],
[^22],
B-trees have stood the test of time very well. They remain the standard index implementation in B-trees have stood the test of time very well. They remain the standard index implementation in
almost all relational databases, and many nonrelational databases use them too. almost all relational databases, and many nonrelational databases use them too.
@ -441,8 +427,7 @@ the new key), and a page for 337344. We also have to update the parent page t
both children, with a boundary value of 337 between them. If the parent page doesnt have enough both children, with a boundary value of 337 between them. If the parent page doesnt have enough
space for the new reference, it may also need to be split, and the splits can continue all the way space for the new reference, it may also need to be split, and the splits can continue all the way
to the root of the tree. When the root is split, we make a new root above it. Deleting keys (which to the root of the tree. When the root is split, we make a new root above it. Deleting keys (which
may require nodes to be merged) is more complex may require nodes to be merged) is more complex [^5].
[^5].
This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth
of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
@ -467,8 +452,7 @@ In order to make the database resilient to crashes, it is common for B-tree impl
include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file
to which every B-tree modification must be written before it can be applied to the pages of the tree to which every B-tree modification must be written before it can be applied to the pages of the tree
itself. When the database comes back up after a crash, this log is used to restore the B-tree back itself. When the database comes back up after a crash, this log is used to restore the B-tree back
to a consistent state [[2](/en/ch4#Graefe2011), to a consistent state [[^2], [^24]].
[24](/en/ch4#Mohan1992)].
In filesystems, the equivalent mechanism is known as *journaling*. In filesystems, the equivalent mechanism is known as *journaling*.
To improve performance, B-tree implementations typically dont immediately write every modified page To improve performance, B-tree implementations typically dont immediately write every modified page
@ -483,26 +467,25 @@ As B-trees have been around for so long, many variants have been developed over
mention just a few: mention just a few:
* Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB) * Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB)
use a copy-on-write scheme [^26]. use a copy-on-write scheme [^26].
A modified page is written to a different location, and a new version of the parent pages in the tree A modified page is written to a different location, and a new version of the parent pages in the tree
is created, pointing at the new location. This approach is also useful for concurrency control, as we shall is created, pointing at the new location. This approach is also useful for concurrency control, as we shall
see in [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation). see in [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation).
* We can save space in pages by not storing the entire key, but abbreviating it. Especially in pages * We can save space in pages by not storing the entire key, but abbreviating it. Especially in pages
on the interior of the tree, keys only need to provide enough information to act as boundaries on the interior of the tree, keys only need to provide enough information to act as boundaries
between key ranges. Packing more keys into a page allows the tree to have a higher branching between key ranges. Packing more keys into a page allows the tree to have a higher branching
factor, and thus fewer levels. factor, and thus fewer levels.
* To speed up scans over the key range in sorted order, some B-tree implementations try to lay out * To speed up scans over the key range in sorted order, some B-tree implementations try to lay out
the tree so that leaf pages appear in sequential order on disk, reducing the number of disk seeks. the tree so that leaf pages appear in sequential order on disk, reducing the number of disk seeks.
However, its difficult to maintain that order as the tree grows. However, its difficult to maintain that order as the tree grows.
* Additional pointers have been added to the tree. For example, each leaf page may have references to * Additional pointers have been added to the tree. For example, each leaf page may have references to
its sibling pages to the left and right, which allows scanning keys in order without jumping back its sibling pages to the left and right, which allows scanning keys in order without jumping back
to parent pages. to parent pages.
## Comparing B-Trees and LSM-Trees ## Comparing B-Trees and LSM-Trees
As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads
[[27](/en/ch4#Athanassoulis2016), [[^27], [^28]].
[28](/en/ch4#Stopford2015)].
However, benchmarks are often sensitive to details of the workload. You need to test systems with However, benchmarks are often sensitive to details of the workload. You need to test systems with
your particular workload in order to make a valid comparison. Moreover, its not a strict either/or your particular workload in order to make a valid comparison. Moreover, its not a strict either/or
choice between LSM and B-trees: storage engines sometimes blend characteristics of both approaches, choice between LSM and B-trees: storage engines sometimes blend characteristics of both approaches,
@ -522,21 +505,18 @@ Range queries are simple and fast on B-trees, as they can use the sorted structu
LSM storage, range queries can also take advantage of the SSTable sorting, but they need to scan all LSM storage, range queries can also take advantage of the SSTable sorting, but they need to scan all
the segments in parallel and combine the results. Bloom filters dont help for range queries (since the segments in parallel and combine the results. Bloom filters dont help for range queries (since
you would need to compute the hash of every possible key within the range, which is impractical), you would need to compute the hash of every possible key within the range, which is impractical),
making range queries more expensive than point queries in the LSM approach making range queries more expensive than point queries in the LSM approach [^29].
[^29].
High write throughput can cause latency spikes in a log-structured storage engine if the High write throughput can cause latency spikes in a log-structured storage engine if the
memtable fills up. This happens if data cant be written out to disk fast enough, perhaps because memtable fills up. This happens if data cant be written out to disk fast enough, perhaps because
the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB, the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB,
perform *backpressure* in this situation: they suspend all reads and writes until the memtable has perform *backpressure* in this situation: they suspend all reads and writes until the memtable has
been written out to disk been written out to disk
[[30](/en/ch4#Balmau2019), [[^30], [^31]].
[31](/en/ch4#RocksDBTuning)].
Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read
requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but
storage engines need to be carefully designed to take advantage of this parallelism storage engines need to be carefully designed to take advantage of this parallelism [^32].
[^32].
### Sequential vs. random writes ### Sequential vs. random writes
@ -568,17 +548,14 @@ The reason is that flash memory can be read or written one page (typically 4 Ki
but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
may contain valid data, whereas others may contain data that is no longer needed. Before erasing a may contain valid data, whereas others may contain data that is no longer needed. Before erasing a
block, the controller must first move pages containing valid data into other blocks; this process is block, the controller must first move pages containing valid data into other blocks; this process is
called *garbage collection* (GC) called *garbage collection* (GC) [^33].
[^33].
A sequential write workload writes larger chunks of data at a time, so it is likely that a whole A sequential write workload writes larger chunks of data at a time, so it is likely that a whole
512 KiB block belongs to a single file; when that file is later deleted again, the whole block 512 KiB block belongs to a single file; when that file is later deleted again, the whole block
can be erased without having to perform any GC. On the other hand, with a random write workload, it can be erased without having to perform any GC. On the other hand, with a random write workload, it
is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
to perform more work before a block can be erased to perform more work before a block can be erased
[[34](/en/ch4#Vanlightly2023nvme), [[^34], [^35], [^36]].
[35](/en/ch4#Alibaba2019_ch4),
[36](/en/ch4#Hu2010)].
The write bandwidth consumed by GC is then not available for the application. Moreover, the The write bandwidth consumed by GC is then not available for the application. Moreover, the
additional writes performed by GC contribute to wear on the flash memory; therefore, random writes additional writes performed by GC contribute to wear on the flash memory; therefore, random writes
@ -591,14 +568,12 @@ operations on the underlying disk. With LSM-trees, a value is first written to t
durability, then again when the memtable is written to disk, and again every time the key-value pair durability, then again when the memtable is written to disk, and again every time the key-value pair
is part of a compaction. (If the values are significantly larger than the keys, this overhead can be is part of a compaction. (If the values are significantly larger than the keys, this overhead can be
reduced by storing values separately from keys, and performing compaction only on SSTables reduced by storing values separately from keys, and performing compaction only on SSTables
containing keys and references to values containing keys and references to values [^37].)
[^37].)
A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once
to the tree page itself. In addition, they sometimes need to write out an entire page, even if only to the tree page itself. In addition, they sometimes need to write out an entire page, even if only
a few bytes in that page changed, to ensure the B-tree can be correctly recovered after a crash or a few bytes in that page changed, to ensure the B-tree can be correctly recovered after a crash or
power failure [[38](/en/ch4#Zaitsev2006), power failure [[^38], [^39]].
[39](/en/ch4#Vondra2016)].
If you take the total number of bytes written to disk in some workload, and divide by the number of If you take the total number of bytes written to disk in some workload, and divide by the number of
bytes you would have to write if you simply wrote an append-only log with no index, you get the bytes you would have to write if you simply wrote an append-only log with no index, you get the
@ -610,8 +585,7 @@ handle within the available disk bandwidth.
Write amplification is a problem in both LSM-trees and B-trees. Which one is better depends on Write amplification is a problem in both LSM-trees and B-trees. Which one is better depends on
various factors, such as the length of your keys and values, and how often you overwrite existing various factors, such as the length of your keys and values, and how often you overwrite existing
keys versus insert new ones. For typical workloads, LSM-trees tend to have lower write amplification keys versus insert new ones. For typical workloads, LSM-trees tend to have lower write amplification
because they dont have to write entire pages and they can compress chunks of the SSTable because they dont have to write entire pages and they can compress chunks of the SSTable [^40].
[^40].
This is another factor that makes LSM storage engines well suited for write-heavy workloads. This is another factor that makes LSM storage engines well suited for write-heavy workloads.
Besides affecting throughput, write amplification is also relevant for the wear on SSDs: a storage Besides affecting throughput, write amplification is also relevant for the wear on SSDs: a storage
@ -636,8 +610,7 @@ the data files anyway, and SSTables dont have pages with unused space. Moreov
key-value pairs can better be compressed in SSTables, and thus often produce smaller files on disk key-value pairs can better be compressed in SSTables, and thus often produce smaller files on disk
than B-trees. Keys and values that have been overwritten continue to consume space until they are than B-trees. Keys and values that have been overwritten continue to consume space until they are
removed by a compaction, but this overhead is quite low when using leveled compaction removed by a compaction, but this overhead is quite low when using leveled compaction
[[40](/en/ch4#Callaghan2015), [[^40], [^41]].
[41](/en/ch4#Callaghan2016rocksdb)].
Size-tiered compaction (see [“Compaction strategies”](/en/ch4#sec_storage_lsm_compaction)) uses more disk space, especially Size-tiered compaction (see [“Compaction strategies”](/en/ch4#sec_storage_lsm_compaction)) uses more disk space, especially
temporarily during compaction. temporarily during compaction.
@ -682,22 +655,22 @@ to implement an index.
The key in an index is the thing that queries search by, but the value can be one of several things: The key in an index is the thing that queries search by, but the value can be one of several things:
* If the actual data (row, document, vertex) is stored directly within the index structure, it is * If the actual data (row, document, vertex) is stored directly within the index structure, it is
called a *clustered index*. For example, in MySQLs InnoDB storage engine, the primary key of a called a *clustered index*. For example, in MySQLs InnoDB storage engine, the primary key of a
table is always a clustered index, and in SQL Server, you can specify one clustered index per table is always a clustered index, and in SQL Server, you can specify one clustered index per
table [^43]. table [^43].
* Alternatively, the value can be a reference to the actual data: either the primary key of the row * Alternatively, the value can be a reference to the actual data: either the primary key of the row
in question (InnoDB does this for secondary indexes), or a direct reference to a location on disk. in question (InnoDB does this for secondary indexes), or a direct reference to a location on disk.
In the latter case, the place where rows are stored is known as a *heap file*, and it stores data In the latter case, the place where rows are stored is known as a *heap file*, and it stores data
in no particular order (it may be append-only, or it may keep track of deleted rows in order to in no particular order (it may be append-only, or it may keep track of deleted rows in order to
overwrite them with new data later). For example, Postgres uses the heap file approach overwrite them with new data later). For example, Postgres uses the heap file approach
[^44]. [^44].
* A middle ground between the two is a *covering index* or *index with included columns*, which * A middle ground between the two is a *covering index* or *index with included columns*, which
stores *some* of a tables columns within the index, in addition to storing the full row on the stores *some* of a tables columns within the index, in addition to storing the full row on the
heap or in the primary key clustered index [^45]. heap or in the primary key clustered index [^45].
This allows some queries to be answered by using the index alone, without having to resolve the This allows some queries to be answered by using the index alone, without having to resolve the
primary key or look in the heap file (in which case, the index is said to *cover* the query). primary key or look in the heap file (in which case, the index is said to *cover* the query).
This can make some queries faster, but the duplication of data means the index uses more disk space and slows down This can make some queries faster, but the duplication of data means the index uses more disk space and slows down
writes. writes.
The indexes discussed so far only map a single key to a value. If you need to query multiple columns The indexes discussed so far only map a single key to a value. If you need to query multiple columns
of a table (or multiple fields in a document) simultaneously, see [“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional). of a table (or multiple fields in a document) simultaneously, see [“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional).
@ -737,11 +710,9 @@ easily be backed up, inspected, and analyzed by external utilities.
Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model, Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model,
and the vendors claim that they can offer big performance improvements by removing all the overheads and the vendors claim that they can offer big performance improvements by removing all the overheads
associated with managing on-disk data structures associated with managing on-disk data structures
[[46](/en/ch4#Stonebraker2007), [[^46], [^47]].
[47](/en/ch4#VoltDB2014uj)].
RAMCloud is an open source, in-memory key-value store with durability (using a log-structured RAMCloud is an open source, in-memory key-value store with durability (using a log-structured
approach for the data in memory as well as the data on disk) approach for the data in memory as well as the data on disk) [^48].
[^48].
Redis and Couchbase provide weak durability by writing to disk asynchronously. Redis and Couchbase provide weak durability by writing to disk asynchronously.
@ -749,8 +720,7 @@ Counterintuitively, the performance advantage of in-memory databases is not due
they dont need to read from disk. Even a disk-based storage engine may never need to read from disk they dont need to read from disk. Even a disk-based storage engine may never need to read from disk
if you have enough memory, because the operating system caches recently used disk blocks in memory if you have enough memory, because the operating system caches recently used disk blocks in memory
anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data
structures in a form that can be written to disk structures in a form that can be written to disk [^49].
[^49].
Besides performance, another interesting area for in-memory databases is providing data models that Besides performance, another interesting area for in-memory databases is providing data models that
are difficult to implement with disk-based indexes. For example, Redis offers a database-like are difficult to implement with disk-based indexes. For example, Redis offers a database-like
@ -774,10 +744,7 @@ transaction processing and data warehousing in the same product. However, these
and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly
becoming two separate storage and query engines, which happen to be accessible through a common SQL becoming two separate storage and query engines, which happen to be accessible through a common SQL
interface interface
[[50](/en/ch4#Larson2013), [[^50], [^51], [^52], [^53]].
[51](/en/ch4#Farber2012),
[52](/en/ch4#Stonebraker2013),
[53](/en/ch4#Prout2022_ch4)].
## Cloud Data Warehouses ## Cloud Data Warehouses
@ -790,50 +757,48 @@ of scalable cloud infrastructure like object storage and serverless computation
Cloud data warehouses tend to integrate better with other cloud services and to be more elastic. Cloud data warehouses tend to integrate better with other cloud services and to be more elastic.
For example, many cloud warehouses support automatic log ingestion, and offer easy integration with For example, many cloud warehouses support automatic log ingestion, and offer easy integration with
data processing frameworks such as Google Clouds Dataflow or Amazon Web Services Kinesis. These data processing frameworks such as Google Clouds Dataflow or Amazon Web Services Kinesis. These
warehouses are also more elastic because they decouple query computation from the storage layer warehouses are also more elastic because they decouple query computation from the storage layer [^54].
[^54].
Data is persisted on object storage rather than local disks, which makes it easy to adjust storage Data is persisted on object storage rather than local disks, which makes it easy to adjust storage
capacity and compute resources for queries independently, as we previously saw in capacity and compute resources for queries independently, as we previously saw in
[“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native). [“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native).
Open source data warehouses such as Apache Hive, Trino, and Apache Spark have also evolved with the Open source data warehouses such as Apache Hive, Trino, and Apache Spark have also evolved with the
cloud. As data storage for analytics has moved to data lakes on object storage, open source warehouses cloud. As data storage for analytics has moved to data lakes on object storage, open source warehouses
have begun to break apart have begun to break apart [^55]. The following
[^55]. The following
components, which were previously integrated in a single system such as Apache Hive, are now often components, which were previously integrated in a single system such as Apache Hive, are now often
implemented as separate components: implemented as separate components:
Query engine Query engine
: Query engines such as Trino, Apache DataFusion, and Presto parse SQL queries, optimize them into : Query engines such as Trino, Apache DataFusion, and Presto parse SQL queries, optimize them into
execution plans, and execute them against the data. Execution usually requires parallel, execution plans, and execute them against the data. Execution usually requires parallel,
distributed data processing tasks. Some query engines provide built-in task execution, while distributed data processing tasks. Some query engines provide built-in task execution, while
others choose to use third party execution frameworks such as Apache Spark or Apache Flink. others choose to use third party execution frameworks such as Apache Spark or Apache Flink.
Storage format Storage format
: The storage format determines how the rows of a table are encoded as bytes in a file, which is : The storage format determines how the rows of a table are encoded as bytes in a file, which is
then typically stored in object storage or a distributed filesystem then typically stored in object storage or a distributed filesystem
[^12]. [^12].
This data can then be accessed by the query engine, but also by other applications using the data This data can then be accessed by the query engine, but also by other applications using the data
lake. Examples of such storage formats are Parquet, ORC, Lance, or Nimble, and we will see more lake. Examples of such storage formats are Parquet, ORC, Lance, or Nimble, and we will see more
about them in the next section. about them in the next section.
Table format Table format
: Files written in Apache Parquet and similar storage formats are typically immutable once written. : Files written in Apache Parquet and similar storage formats are typically immutable once written.
To support row inserts and deletions, a table format such as Apache Iceberg or Databrickss Delta To support row inserts and deletions, a table format such as Apache Iceberg or Databrickss Delta
format are used. Table formats specify a file format that defines which files constitute a table format are used. Table formats specify a file format that defines which files constitute a table
along with the tables schema. Such formats also offer advanced features such as time travel (the along with the tables schema. Such formats also offer advanced features such as time travel (the
ability to query a table as it was at a previous point in time), garbage collection, and even ability to query a table as it was at a previous point in time), garbage collection, and even
transactions. transactions.
Data catalog Data catalog
: Much like a table format defines which files make up a table, a data catalog defines which tables : Much like a table format defines which files make up a table, a data catalog defines which tables
comprise a database. Catalogs are used to create, rename, and drop tables. Unlike storage and table comprise a database. Catalogs are used to create, rename, and drop tables. Unlike storage and table
formats, data catalogs such as Snowflakes Polaris and Databrickss Unity Catalog usually run as a formats, data catalogs such as Snowflakes Polaris and Databrickss Unity Catalog usually run as a
standalone service that can be queried using a REST interface. Apache Iceberg also offers a standalone service that can be queried using a REST interface. Apache Iceberg also offers a
catalog, which can be run inside a client or as a separate process. Query engines use catalog catalog, which can be run inside a client or as a separate process. Query engines use catalog
information when reading and writing tables. Traditionally, catalogs and query engines have been information when reading and writing tables. Traditionally, catalogs and query engines have been
integrated, but decoupling them has enabled data discovery and data governance systems integrated, but decoupling them has enabled data discovery and data governance systems
(discussed in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance)) to access a catalogs metadata as well. (discussed in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance)) to access a catalogs metadata as well.
## Column-Oriented Storage ## Column-Oriented Storage
@ -844,8 +809,7 @@ efficiently becomes a challenging problem. Dimension tables are usually much sma
rows), so in this section we will focus on storage of facts. rows), so in this section we will focus on storage of facts.
Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4 Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4
or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics) or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics) [^52]. Take the query in
[^52]. Take the query in
[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone [Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of
the `fact_sales` table: `date_key`, `product_sk`, the `fact_sales` table: `date_key`, `product_sk`,
@ -855,16 +819,16 @@ and `quantity`. The query ignores all other columns.
``` ```
SELECT SELECT
dim_date.weekday, dim_product.category, dim_date.weekday, dim_product.category,
SUM(fact_sales.quantity) AS quantity_sold SUM(fact_sales.quantity) AS quantity_sold
FROM fact_sales FROM fact_sales
JOIN dim_date ON fact_sales.date_key = dim_date.date_key JOIN dim_date ON fact_sales.date_key = dim_date.date_key
JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
WHERE WHERE
dim_date.year = 2024 AND dim_date.year = 2024 AND
dim_product.category IN ('Fresh fruit', 'Candy') dim_product.category IN ('Fresh fruit', 'Candy')
GROUP BY GROUP BY
dim_date.weekday, dim_product.category; dim_date.weekday, dim_product.category;
``` ```
How can we execute this query efficiently? How can we execute this query efficiently?
@ -882,8 +846,7 @@ memory, parse them, and filter out those that dont meet the required conditio
long time. long time.
The idea behind *column-oriented* (or *columnar*) storage is simple: dont store all the values from The idea behind *column-oriented* (or *columnar*) storage is simple: dont store all the values from
one row together, but store all the values from each *column* together instead one row together, but store all the values from each *column* together instead [^56].
[^56].
If each column is stored separately, a query only needs to read and parse those columns that are If each column is stored separately, a query only needs to read and parse those columns that are
used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema). an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).
@ -907,33 +870,24 @@ individual columns and put them together to form the 23rd row of the table.
In fact, columnar storage engines dont actually store an entire column (containing perhaps In fact, columnar storage engines dont actually store an entire column (containing perhaps
trillions of rows) in one go. Instead, they break the table into blocks of thousands or millions of trillions of rows) in one go. Instead, they break the table into blocks of thousands or millions of
rows, and within each block they store the values from each column separately rows, and within each block they store the values from each column separately [^60].
[^60].
Since many queries are restricted to a particular date range, it is common to make each block Since many queries are restricted to a particular date range, it is common to make each block
contain the rows for a particular timestamp range. A query then only needs to load the columns it contain the rows for a particular timestamp range. A query then only needs to load the columns it
needs in those blocks that overlap with the required date range. needs in those blocks that overlap with the required date range.
Columnar storage is used in almost all analytic databases nowadays Columnar storage is used in almost all analytic databases nowadays [^60],
[^60], ranging from large-scale cloud data warehouses such as Snowflake [^61]
ranging from large-scale cloud data warehouses such as Snowflake to single-node embedded databases such as DuckDB [^62],
[^61] and product analytics systems such as Pinot [^63]
to single-node embedded databases such as DuckDB
[^62],
and product analytics systems such as Pinot
[^63]
and Druid [^64]. and Druid [^64].
It is used in storage formats such as Parquet, ORC It is used in storage formats such as Parquet, ORC
[[65](/en/ch4#Liu2023), [[^65], [^66]],
[66](/en/ch4#Zeng2023)],
Lance [^67], Lance [^67],
and Nimble [^68], and Nimble [^68],
and in-memory analytics formats like Apache Arrow and in-memory analytics formats like Apache Arrow
[[65](/en/ch4#Liu2023), [[^65], [^69]]
[69](/en/ch4#McKinney2021)]
and Pandas/NumPy [^70]. and Pandas/NumPy [^70].
Some time-series databases, such as InfluxDB IOx Some time-series databases, such as InfluxDB IOx [^71] and TimescaleDB [^72],
[^71] and TimescaleDB
[^72],
are also based on column-oriented storage. are also based on column-oriented storage.
### Column Compression ### Column Compression
@ -961,21 +915,20 @@ One option is to store those bitmaps using one bit per row. However, these bitma
a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can additionally be a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can additionally be
run-length encoded: counting the number of consecutive zeros or ones and storing that number, as run-length encoded: counting the number of consecutive zeros or ones and storing that number, as
shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
two bitmap representations, using whichever is the most compact two bitmap representations, using whichever is the most compact [^73].
[^73].
This can make the encoding of a column remarkably efficient. This can make the encoding of a column remarkably efficient.
Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data
warehouse. For example: warehouse. For example:
`WHERE product_sk IN (31, 68, 69):` `WHERE product_sk IN (31, 68, 69):`
: Load the three bitmaps for `product_sk = 31`, `product_sk = 68`, and `product_sk = 69`, and : Load the three bitmaps for `product_sk = 31`, `product_sk = 68`, and `product_sk = 69`, and
calculate the bitwise *OR* of the three bitmaps, which can be done very efficiently. calculate the bitwise *OR* of the three bitmaps, which can be done very efficiently.
`WHERE product_sk = 30 AND store_sk = 3:` `WHERE product_sk = 30 AND store_sk = 3:`
: Load the bitmaps for `product_sk = 30` and `store_sk = 3`, and calculate the bitwise *AND*. This : Load the bitmaps for `product_sk = 30` and `store_sk = 3`, and calculate the bitwise *AND*. This
works because the columns contain the rows in the same order, so the *k*th bit in one columns works because the columns contain the rows in the same order, so the *k*th bit in one columns
bitmap corresponds to the same row as the *k*th bit in another columns bitmap. bitmap corresponds to the same row as the *k*th bit in another columns bitmap.
Bitmaps can also be used to answer graph queries, such as finding all users of a social network who Bitmaps can also be used to answer graph queries, such as finding all users of a social network who
are followed by user *X* and who also follow user *Y* are followed by user *X* and who also follow user *Y*
@ -1046,9 +999,7 @@ Queries need to examine both the column data on disk and the recent writes in me
the two. The query execution engine hides this distinction from the user. From an analysts point the two. The query execution engine hides this distinction from the user. From an analysts point
of view, data that has been modified with inserts, updates, or deletes is immediately reflected in of view, data that has been modified with inserts, updates, or deletes is immediately reflected in
subsequent queries. Snowflake, Vertica, Apache Pinot, Apache Druid, and many others do this subsequent queries. Snowflake, Vertica, Apache Pinot, Apache Druid, and many others do this
[[61](/en/ch4#Dageville2016), [63](/en/ch4#Im2018), [[^61], [^63], [^64], [^76]].
[64](/en/ch4#Yang2014),
[76](/en/ch4#Lamb2012)].
## Query Execution: Compilation and Vectorization ## Query Execution: Compilation and Vectorization
@ -1068,30 +1019,29 @@ the amount of data they need to read off disk, but also the CPU time required to
operators. The simplest kind of operator is like an interpreter for a programming language: while operators. The simplest kind of operator is like an interpreter for a programming language: while
iterating over each row, it checks a data structure representing the query to find out which iterating over each row, it checks a data structure representing the query to find out which
comparisons or calculations it needs to perform on which columns. Unfortunately, this is too slow comparisons or calculations it needs to perform on which columns. Unfortunately, this is too slow
for many analytics purposes. Two alternative approaches for efficient query execution have emerged for many analytics purposes. Two alternative approaches for efficient query execution have emerged [^77]:
[^77]:
Query compilation Query compilation
: The query engine takes the SQL query and generates code for executing it. The code iterates over : The query engine takes the SQL query and generates code for executing it. The code iterates over
the rows one by one, looks at the values in the columns of interest, performs whatever comparisons the rows one by one, looks at the values in the columns of interest, performs whatever comparisons
or calculations are needed, and copies the necessary values to an output buffer if the required or calculations are needed, and copies the necessary values to an output buffer if the required
conditions are satisfied. The query engine compiles the generated code to machine code (often conditions are satisfied. The query engine compiles the generated code to machine code (often
using an existing compiler such as LLVM), and then runs it on the column-encoded data that has using an existing compiler such as LLVM), and then runs it on the column-encoded data that has
been loaded into memory. This approach to code generation is similar to the just-in-time (JIT) been loaded into memory. This approach to code generation is similar to the just-in-time (JIT)
compilation approach that is used in the Java Virtual Machine (JVM) and similar runtimes. compilation approach that is used in the Java Virtual Machine (JVM) and similar runtimes.
Vectorized processing Vectorized processing
: The query is interpreted, not compiled, but it is made fast by processing many values from a : The query is interpreted, not compiled, but it is made fast by processing many values from a
column in a batch, instead of iterating over rows one by one. A fixed set of predefined operators column in a batch, instead of iterating over rows one by one. A fixed set of predefined operators
are built into the database; we can pass arguments to them and get back a batch of results are built into the database; we can pass arguments to them and get back a batch of results
[[50](/en/ch4#Larson2013), [75](/en/ch4#Abadi2013)]. [[^50], [^75]].
For example, we could pass the `product_sk` column and the ID of “bananas” to an equality operator, For example, we could pass the `product_sk` column and the ID of “bananas” to an equality operator,
and get back a bitmap (one bit per value in the input column, which is 1 if its a banana); we could and get back a bitmap (one bit per value in the input column, which is 1 if its a banana); we could
then pass the `store_sk` column and the ID of the store of interest to the same equality operator, then pass the `store_sk` column and the ID of the store of interest to the same equality operator,
and get back another bitmap; and then we could pass the two bitmaps to a “bitwise AND” operator, as and get back another bitmap; and then we could pass the two bitmaps to a “bitwise AND” operator, as
shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
a particular store. a particular store.
![ddia 0409](/fig/ddia_0409.png) ![ddia 0409](/fig/ddia_0409.png)
@ -1102,15 +1052,15 @@ practice [^77]. Both can achieve very good
performance by taking advantages of the characteristics of modern CPUs: performance by taking advantages of the characteristics of modern CPUs:
* preferring sequential memory access over random access to reduce cache misses * preferring sequential memory access over random access to reduce cache misses
[^78], [^78],
* doing most of the work in tight inner loops (that is, with a small number of instructions and no * doing most of the work in tight inner loops (that is, with a small number of instructions and no
function calls) to keep the CPU instruction processing pipeline busy and avoid branch function calls) to keep the CPU instruction processing pipeline busy and avoid branch
mispredictions, mispredictions,
* making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD) * making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD)
instructions [[79](/en/ch4#Boncz2005), instructions [[^79],
[80](/en/ch4#Zhou2002)], and [^80]], and
* operating directly on compressed data without decoding it into a separate in-memory * operating directly on compressed data without decoding it into a separate in-memory
representation, which saves memory allocation and copying costs. representation, which saves memory allocation and copying costs.
## Materialized Views and Data Cubes ## Materialized Views and Data Cubes
@ -1123,8 +1073,7 @@ expanded query.
When the underlying data changes, a materialized view needs to be updated accordingly. Some When the underlying data changes, a materialized view needs to be updated accordingly. Some
databases can do that automatically, and there are also systems such as Materialize that specialize databases can do that automatically, and there are also systems such as Materialize that specialize
in materialized view maintenance in materialized view maintenance [^81].
[^81].
Performing such updates means more work on writes, but materialized views can improve read Performing such updates means more work on writes, but materialized views can improve read
performance in workloads that repeatedly need to perform the same queries. performance in workloads that repeatedly need to perform the same queries.
@ -1133,8 +1082,7 @@ discussed earlier, data warehouse queries often involve an aggregate function, s
`AVG`, `MIN`, or `MAX` in SQL. If the same aggregates are used by many different queries, it can be `AVG`, `MIN`, or `MAX` in SQL. If the same aggregates are used by many different queries, it can be
wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that
queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates
grouped by different dimensions grouped by different dimensions [^82].
[^82].
[Figure 4-10](/en/ch4#fig_data_cube) shows an example. [Figure 4-10](/en/ch4#fig_data_cube) shows an example.
![ddia 0410](/fig/ddia_0410.png) ![ddia 0410](/fig/ddia_0410.png)
@ -1187,8 +1135,8 @@ rectangular map area that the user is currently viewing. This requires a two-dim
like the following: like the following:
``` ```
SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079 SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079
AND longitude > -0.1162 AND longitude < -0.1004; AND longitude > -0.1162 AND longitude < -0.1004;
``` ```
A concatenated index over the latitude and longitude columns is not able to answer that kind of A concatenated index over the latitude and longitude columns is not able to answer that kind of
@ -1197,16 +1145,12 @@ longitude), or all the restaurants in a range of longitudes (but anywhere betwee
South poles), but not both simultaneously. South poles), but not both simultaneously.
One option is to translate a two-dimensional location into a single number using a space-filling One option is to translate a two-dimensional location into a single number using a space-filling
curve, and then to use a regular B-tree index curve, and then to use a regular B-tree index [^83].
[^83]. More commonly, specialized spatial indexes such as R-trees or Bkd-trees [^84]
More commonly, specialized spatial indexes such as R-trees or Bkd-trees
[^84]
are used; they divide up the space so that nearby data points tend to be grouped in the same are used; they divide up the space so that nearby data points tend to be grouped in the same
subtree. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQLs subtree. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQLs
Generalized Search Tree indexing facility Generalized Search Tree indexing facility [^85].
[^85]. It is also possible to use regularly spaced grids of triangles, squares, or hexagons [^86].
It is also possible to use regularly spaced grids of triangles, squares, or hexagons
[^86].
Multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce Multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce
website you could use a three-dimensional index on the dimensions (*red*, *green*, *blue*) to search website you could use a three-dimensional index on the dimensions (*red*, *green*, *blue*) to search
@ -1215,14 +1159,12 @@ two-dimensional index on (*date*, *temperature*) in order to efficiently search
observations during the year 2013 where the temperature was between 25 and 30℃. With a observations during the year 2013 where the temperature was between 25 and 30℃. With a
one-dimensional index, you would have to either scan over all the records from 2013 (regardless of one-dimensional index, you would have to either scan over all the records from 2013 (regardless of
temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by
timestamp and temperature simultaneously timestamp and temperature simultaneously [^87].
[^87].
## Full-Text Search ## Full-Text Search
Full-text search allows you to search a collection of text documents (web pages, product Full-text search allows you to search a collection of text documents (web pages, product
descriptions, etc.) by keywords that might appear anywhere in the text descriptions, etc.) by keywords that might appear anywhere in the text [^88].
[^88].
Information retrieval is a big, specialist topic that often involves language-specific processing: Information retrieval is a big, specialist topic that often involves language-specific processing:
for example, several Asian languages are written without spaces or punctuation between words, and for example, several Asian languages are written without spaces or punctuation between words, and
therefore splitting text into words requires a model that indicates which character sequences therefore splitting text into words requires a model that indicates which character sequences
@ -1249,26 +1191,21 @@ warehouse query that searches for rows matching two conditions ([Figure 4-9](/e
bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps are run-length bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps are run-length
encoded, this can be done very efficiently. encoded, this can be done very efficiently.
For example, Lucene, the full-text indexing engine used by Elasticsearch and Solr, works like this For example, Lucene, the full-text indexing engine used by Elasticsearch and Solr, works like this [^90].
[^90].
It stores the mapping from term to postings list in SSTable-like sorted files, which are merged in It stores the mapping from term to postings list in SSTable-like sorted files, which are merged in
the background using the same log-structured approach we saw earlier in this chapter the background using the same log-structured approach we saw earlier in this chapter [^91].
[^91].
PostgreSQLs GIN index type also uses postings lists to support full-text search and indexing inside PostgreSQLs GIN index type also uses postings lists to support full-text search and indexing inside
JSON documents JSON documents
[[92](/en/ch4#Fittl2021), [[^92], [^93]].
[93](/en/ch4#Angelakos2020)].
Instead of breaking text into words, an alternative is to find all the substrings of length *n*, Instead of breaking text into words, an alternative is to find all the substrings of length *n*,
which are called *n*-grams. For example, the trigrams (*n* = 3) of the string which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
`"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can `"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can
search the documents for arbitrary substrings that are at least three characters long. Trigram search the documents for arbitrary substrings that are at least three characters long. Trigram
indexes even allows regular expressions in search queries; the downside is that they are quite large indexes even allows regular expressions in search queries; the downside is that they are quite large [^94].
[^94].
To cope with typos in documents or queries, Lucene is able to search text for words within a certain To cope with typos in documents or queries, Lucene is able to search text for words within a certain
edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced) edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced) [^95].
[^95].
It does this by storing the set of terms as a finite state automaton over the characters in the It does this by storing the set of terms as a finite state automaton over the characters in the
keys, similar to a *trie* keys, similar to a *trie*
[^96], [^96],
@ -1309,12 +1246,9 @@ measure the distance between vectors. Cosine similarity measures the cosine of t
vectors to determine how close they are, while Euclidean distance measures the straight-line vectors to determine how close they are, while Euclidean distance measures the straight-line
distance between two points in space. distance between two points in space.
Many early embedding models such as Word2Vec Many early embedding models such as Word2Vec [^98],
[^98], BERT [^99],
BERT and GPT [^100]
[^99],
and GPT
[^100]
worked with text data. Such models are usually implemented as neural networks. Researchers went on to worked with text data. Such models are usually implemented as neural networks. Researchers went on to
create embedding models for video, audio, and images as well. More recently, model create embedding models for video, audio, and images as well. More recently, model
architecture has become *multimodal*: a single model can generate vector embeddings for multiple architecture has become *multimodal*: a single model can generate vector embeddings for multiple
@ -1331,42 +1265,39 @@ closest to the query vector. Since the R-trees we saw previously dont work we
many dimensions, specialized vector indexes are used, such as: many dimensions, specialized vector indexes are used, such as:
Flat indexes Flat indexes
: Vectors are stored in the index as they are. A query must read every vector and measure its : Vectors are stored in the index as they are. A query must read every vector and measure its
distance to the query vector. Flat indexes are accurate, but measuring the distance between the distance to the query vector. Flat indexes are accurate, but measuring the distance between the
query and each vector is slow. query and each vector is slow.
Inverted file (IVF) indexes Inverted file (IVF) indexes
: The vector space is clustered into partitions (called *centroids*) of vectors to reduce the number : The vector space is clustered into partitions (called *centroids*) of vectors to reduce the number
of vectors that must be compared. IVF indexes are faster than flat indexes, but can give only of vectors that must be compared. IVF indexes are faster than flat indexes, but can give only
approximate results: the query and a document may fall into different partitions, even though they approximate results: the query and a document may fall into different partitions, even though they
are close to each other. A query on an IVF index first defines *probes*, which are simply the number are close to each other. A query on an IVF index first defines *probes*, which are simply the number
of partitions to check. Queries that use more probes will be more accurate, but will be slower, as of partitions to check. Queries that use more probes will be more accurate, but will be slower, as
more vectors must be compared. more vectors must be compared.
Hierarchical Navigable Small World (HNSW) Hierarchical Navigable Small World (HNSW)
: HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw). : HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw).
Each layer is represented as a graph, where nodes represent vectors, and edges represent proximity Each layer is represented as a graph, where nodes represent vectors, and edges represent proximity
to nearby vectors. A query starts by locating the nearest vector in the topmost layer, which has a to nearby vectors. A query starts by locating the nearest vector in the topmost layer, which has a
small number of nodes. The query then moves to the same node in the layer below and follows the small number of nodes. The query then moves to the same node in the layer below and follows the
edges in that layer, which is more densely connected, looking for a vector that is closer to the edges in that layer, which is more densely connected, looking for a vector that is closer to the
query vector. The process continues until the last layer is reached. As with IVF indexes, HNSW query vector. The process continues until the last layer is reached. As with IVF indexes, HNSW
indexes are approximate. indexes are approximate.
![ddia 0411](/fig/ddia_0411.png) ![ddia 0411](/fig/ddia_0411.png)
###### Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index. ###### Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index.
Many popular vector databases implement IVF and HNSW indexes. Facebooks Faiss library has many Many popular vector databases implement IVF and HNSW indexes. Facebooks Faiss library has many
variations of each variations of each [^101],
[^101], and PostgreSQLs pgvector supports both as well [^102].
and PostgreSQLs pgvector supports both as well
[^102].
The full details of the IVF and HNSW algorithms are beyond the scope of this book, but their papers The full details of the IVF and HNSW algorithms are beyond the scope of this book, but their papers
are an excellent resource are an excellent resource
[[103](/en/ch4#Baranchuk2018), [[^103], [^104]].
[104](/en/ch4#Malkov2020)].
# Summary ## Summary
In this chapter we tried to get to the bottom of how databases perform storage and retrieval. What In this chapter we tried to get to the bottom of how databases perform storage and retrieval. What
happens when you store data in a database, and what does the database do when you query for the happens when you store data in a database, and what does the database do when you query for the
@ -1377,25 +1308,25 @@ analytics (OLAP). In this chapter we saw that storage engines optimized for OLTP
from those optimized for analytics: from those optimized for analytics:
* OLTP systems are optimized for a high volume of requests, each of which reads and writes a small * OLTP systems are optimized for a high volume of requests, each of which reads and writes a small
number of records, and which need fast responses. The records are typically accessed via a primary number of records, and which need fast responses. The records are typically accessed via a primary
key or a secondary index, and these indexes are typically ordered mappings from key to record, key or a secondary index, and these indexes are typically ordered mappings from key to record,
which also support range queries. which also support range queries.
* Data warehouses and similar analytic systems are optimized for complex read queries that scan over * Data warehouses and similar analytic systems are optimized for complex read queries that scan over
a large number of records. They generally use a column-oriented storage layout with compression a large number of records. They generally use a column-oriented storage layout with compression
that minimizes the amount of data that such a query needs to read off disk, and just-in-time that minimizes the amount of data that such a query needs to read off disk, and just-in-time
compilation of queries or vectorization to minimize the amount of CPU time spent processing the compilation of queries or vectorization to minimize the amount of CPU time spent processing the
data. data.
On the OLTP side, we saw storage engines from two main schools of thought: On the OLTP side, we saw storage engines from two main schools of thought:
* The log-structured approach, which only permits appending to files and deleting obsolete files, * The log-structured approach, which only permits appending to files and deleting obsolete files,
but never updates a file that has been written. SSTables, LSM-trees, RocksDB, Cassandra, HBase, but never updates a file that has been written. SSTables, LSM-trees, RocksDB, Cassandra, HBase,
Scylla, Lucene, and others belong to this group. In general, log-structured storage engines tend Scylla, Lucene, and others belong to this group. In general, log-structured storage engines tend
to provide high write throughput. to provide high write throughput.
* The update-in-place approach, which treats the disk as a set of fixed-size pages that can be * The update-in-place approach, which treats the disk as a set of fixed-size pages that can be
overwritten. B-trees, the biggest example of this philosophy, are used in all major relational overwritten. B-trees, the biggest example of this philosophy, are used in all major relational
OLTP databases and also many nonrelational ones. As a rule of thumb, B-trees tend to be better for OLTP databases and also many nonrelational ones. As a rule of thumb, B-trees tend to be better for
reads, providing higher read throughput and lower response times than log-structured storage. reads, providing higher read throughput and lower response times than log-structured storage.
We then looked at indexes that can search for multiple conditions at the same time: multidimensional We then looked at indexes that can search for multiple conditions at the same time: multidimensional
indexes such as R-trees that can search for points on a map by latitude and longitude at the same indexes such as R-trees that can search for points on a map by latitude and longitude at the same
@ -1413,10 +1344,11 @@ Although this chapter couldnt make you an expert in tuning any one particular
has hopefully equipped you with enough vocabulary and ideas that you can make sense of the has hopefully equipped you with enough vocabulary and ideas that you can make sense of the
documentation for the database of your choice. documentation for the database of your choice.
##### Footnotes
##### References
### Summary

View file

@ -31,22 +31,22 @@ and writing that field). However, in a large application, code changes often can
instantaneously: instantaneously:
* With server-side applications you may want to perform a *rolling upgrade* * With server-side applications you may want to perform a *rolling upgrade*
(also known as a *staged rollout*), deploying the new version to a few nodes at a time, checking (also known as a *staged rollout*), deploying the new version to a few nodes at a time, checking
whether the new version is running smoothly, and gradually working your way through all the nodes. whether the new version is running smoothly, and gradually working your way through all the nodes.
This allows new versions to be deployed without service downtime, and thus encourages more This allows new versions to be deployed without service downtime, and thus encourages more
frequent releases and better evolvability. frequent releases and better evolvability.
* With client-side applications youre at the mercy of the user, who may not install the update for * With client-side applications youre at the mercy of the user, who may not install the update for
some time. some time.
This means that old and new versions of the code, and old and new data formats, may potentially all This means that old and new versions of the code, and old and new data formats, may potentially all
coexist in the system at the same time. In order for the system to continue running smoothly, we coexist in the system at the same time. In order for the system to continue running smoothly, we
need to maintain compatibility in both directions: need to maintain compatibility in both directions:
Backward compatibility Backward compatibility
: Newer code can read data that was written by older code. : Newer code can read data that was written by older code.
Forward compatibility Forward compatibility
: Older code can read data that was written by newer code. : Older code can read data that was written by newer code.
Backward compatibility is normally not hard to achieve: as author of the newer code, you know the Backward compatibility is normally not hard to achieve: as author of the newer code, you know the
format of data written by older code, and so you can explicitly handle it (if necessary by simply format of data written by older code, and so you can explicitly handle it (if necessary by simply
@ -77,12 +77,12 @@ message queues.
Programs usually work with data in (at least) two different representations: Programs usually work with data in (at least) two different representations:
1. In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These 1. In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These
data structures are optimized for efficient access and manipulation by the CPU (typically using data structures are optimized for efficient access and manipulation by the CPU (typically using
pointers). pointers).
2. When you want to write data to a file or send it over the network, you have to encode it as some 2. When you want to write data to a file or send it over the network, you have to encode it as some
kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldnt kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldnt
make sense to any other process, this sequence-of-bytes representation often looks quite make sense to any other process, this sequence-of-bytes representation often looks quite
different from the data structures that are normally used in memory. different from the data structures that are normally used in memory.
Thus, we need some kind of translation between the two representations. The translation from the Thus, we need some kind of translation between the two representations. The translation from the
in-memory representation to a byte sequence is called *encoding* (also known as *serialization* or in-memory representation to a byte sequence is called *encoding* (also known as *serialization* or
@ -114,22 +114,20 @@ These encoding libraries are very convenient, because they allow in-memory objec
restored with minimal additional code. However, they also have a number of deep problems: restored with minimal additional code. However, they also have a number of deep problems:
* The encoding is often tied to a particular programming language, and reading the data in another * The encoding is often tied to a particular programming language, and reading the data in another
language is very difficult. If you store or transmit data in such an encoding, you are committing language is very difficult. If you store or transmit data in such an encoding, you are committing
yourself to your current programming language for potentially a very long time, and precluding yourself to your current programming language for potentially a very long time, and precluding
integrating your systems with those of other organizations (which may use different languages). integrating your systems with those of other organizations (which may use different languages).
* In order to restore data in the same object types, the decoding process needs to be able to * In order to restore data in the same object types, the decoding process needs to be able to
instantiate arbitrary classes. This is frequently a source of security problems instantiate arbitrary classes. This is frequently a source of security problems [^1]:
[^1]: if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate
if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate arbitrary classes, which in turn often allows them to do terrible things such as remotely
arbitrary classes, which in turn often allows them to do terrible things such as remotely executing arbitrary code [^2] [^3].
executing arbitrary code [[2](/en/ch5#Breen2015),
[3](/en/ch5#McKenzie2013)].
* Versioning data is often an afterthought in these libraries: as they are intended for quick and * Versioning data is often an afterthought in these libraries: as they are intended for quick and
easy encoding of data, they often neglect the inconvenient problems of forward and backward easy encoding of data, they often neglect the inconvenient problems of forward and backward
compatibility [^4]. compatibility [^4].
* Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also * Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also
often an afterthought. For example, Javas built-in serialization is notorious for its bad often an afterthought. For example, Javas built-in serialization is notorious for its bad
performance and bloated encoding [^5]. performance and bloated encoding [^5].
For these reasons its generally a bad idea to use your languages built-in encoding for anything For these reasons its generally a bad idea to use your languages built-in encoding for anything
other than very transient purposes. other than very transient purposes.
@ -138,8 +136,7 @@ other than very transient purposes.
When moving to standardized encodings that can be written and read by many programming languages, JSON When moving to standardized encodings that can be written and read by many programming languages, JSON
and XML are the obvious contenders. They are widely known, widely supported, and almost as widely and XML are the obvious contenders. They are widely known, widely supported, and almost as widely
disliked. XML is often criticized for being too verbose and unnecessarily complicated disliked. XML is often criticized for being too verbose and unnecessarily complicated [^6].
[^6].
JSONs popularity is mainly due to its built-in support in web browsers and simplicity relative to JSONs popularity is mainly due to its built-in support in web browsers and simplicity relative to
XML. CSV is another popular language-independent format, but it only supports tabular data without XML. CSV is another popular language-independent format, but it only supports tabular data without
nesting. nesting.
@ -149,33 +146,31 @@ popular topic of debate). Besides the superficial syntactic issues, they also ha
problems: problems:
* There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish * There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish
between a number and a string that happens to consist of digits (except by referring to an external between a number and a string that happens to consist of digits (except by referring to an external
schema). JSON distinguishes strings and numbers, but it doesnt distinguish integers and schema). JSON distinguishes strings and numbers, but it doesnt distinguish integers and
floating-point numbers, and it doesnt specify a precision. floating-point numbers, and it doesnt specify a precision.
This is a problem when dealing with large numbers; for example, integers greater than 253 cannot This is a problem when dealing with large numbers; for example, integers greater than 253 cannot
be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become
inaccurate when parsed in a language that uses floating-point numbers, such as JavaScript inaccurate when parsed in a language that uses floating-point numbers, such as JavaScript [^7].
[^7]. An example of numbers larger than 253 occurs on X (formerly Twitter), which uses a 64-bit number to
An example of numbers larger than 253 occurs on X (formerly Twitter), which uses a 64-bit number to identify each post. The JSON returned by the API includes post IDs twice, once as a JSON number and
identify each post. The JSON returned by the API includes post IDs twice, once as a JSON number and once as a decimal string, to work around the fact that the numbers are not correctly parsed by
once as a decimal string, to work around the fact that the numbers are not correctly parsed by JavaScript applications [^8].
JavaScript applications [^8].
* JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they * JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they
dont support binary strings (sequences of bytes without a character encoding). Binary strings are a dont support binary strings (sequences of bytes without a character encoding). Binary strings are a
useful feature, so people get around this limitation by encoding the binary data as text using useful feature, so people get around this limitation by encoding the binary data as text using
Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded. Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded.
This works, but its somewhat hacky and increases the data size by 33%. This works, but its somewhat hacky and increases the data size by 33%.
* XML Schema and JSON Schema are powerful, and thus quite * XML Schema and JSON Schema are powerful, and thus quite
complicated to learn and implement. Since the correct interpretation of data (such as numbers and complicated to learn and implement. Since the correct interpretation of data (such as numbers and
binary strings) depends on information in the schema, applications that dont use XML/JSON schemas binary strings) depends on information in the schema, applications that dont use XML/JSON schemas
need to potentially hard-code the appropriate encoding/decoding logic instead. need to potentially hard-code the appropriate encoding/decoding logic instead.
* CSV does not have any schema, so it is up to the application to define the meaning of each row and * CSV does not have any schema, so it is up to the application to define the meaning of each row and
column. If an application change adds a new row or column, you have to handle that change manually. column. If an application change adds a new row or column, you have to handle that change manually.
CSV is also a quite vague format (what happens if a value contains a comma or a newline character?). CSV is also a quite vague format (what happens if a value contains a comma or a newline character?).
Although its escaping rules have been formally specified Although its escaping rules have been formally specified [^9],
[^9], not all parsers implement them correctly.
not all parsers implement them correctly.
Despite these flaws, JSON, XML, and CSV are good enough for many purposes. Its likely that they will Despite these flaws, JSON, XML, and CSV are good enough for many purposes. Its likely that they will
remain popular, especially as data interchange formats (i.e., for sending data from one organization to remain popular, especially as data interchange formats (i.e., for sending data from one organization to
@ -211,16 +206,16 @@ JSON Schema so that keys may only contain digits, and values can only be strings
##### Example 5-1. Example JSON Schema with integer keys and string values. Integer keys are represented as strings containing only integers since JSON Schema requires all keys to be strings. ##### Example 5-1. Example JSON Schema with integer keys and string values. Integer keys are represented as strings containing only integers since JSON Schema requires all keys to be strings.
``` ```json
{ {
"$schema": "http://json-schema.org/draft-07/schema#", "$schema": "http://json-schema.org/draft-07/schema#",
"type": "object", "type": "object",
"patternProperties": { "patternProperties": {
"^[0-9]+$": { "^[0-9]+$": {
"type": "string" "type": "string"
} }
}, },
"additionalProperties": false "additionalProperties": false
} }
``` ```
@ -229,8 +224,7 @@ if/else schema logic, named types, references to remote schemas, and much more.
for a very powerful schema language. Such features also make for unwieldy definitions. It can be for a very powerful schema language. Such features also make for unwieldy definitions. It can be
challenging to resolve remote schemas, reason about conditional rules, or evolve schemas in a challenging to resolve remote schemas, reason about conditional rules, or evolve schemas in a
forwards or backwards compatible way [^10]. forwards or backwards compatible way [^10].
Similar concerns apply to XML Schema Similar concerns apply to XML Schema [^11].
[^11].
### Binary encoding ### Binary encoding
@ -251,9 +245,9 @@ will need to include the strings `userName`, `favoriteNumber`, and `interests` s
``` ```
{ {
"userName": "Martin", "userName": "Martin",
"favoriteNumber": 1337, "favoriteNumber": 1337,
"interests": ["daydreaming", "hacking"] "interests": ["daydreaming", "hacking"]
} }
``` ```
@ -262,13 +256,13 @@ shows the byte sequence that you get if you encode the JSON document in [Example
MessagePack. The first few bytes are as follows: MessagePack. The first few bytes are as follows:
1. The first byte, `0x83`, indicates that what follows is an object (top four bits = `0x80`) with three 1. The first byte, `0x83`, indicates that what follows is an object (top four bits = `0x80`) with three
fields (bottom four bits = `0x03`). (In case youre wondering what happens if an object has more fields (bottom four bits = `0x03`). (In case youre wondering what happens if an object has more
than 15 fields, so that the number of fields doesnt fit in four bits, it then gets a different type than 15 fields, so that the number of fields doesnt fit in four bits, it then gets a different type
indicator, and the number of fields is encoded in two or four bytes.) indicator, and the number of fields is encoded in two or four bytes.)
2. The second byte, `0xa8`, indicates that what follows is a string (top four bits = `0xa0`) that is eight 2. The second byte, `0xa8`, indicates that what follows is a string (top four bits = `0xa0`) that is eight
bytes long (bottom four bits = `0x08`). bytes long (bottom four bits = `0x08`).
3. The next eight bytes are the field name `userName` in ASCII. Since the length was indicated 3. The next eight bytes are the field name `userName` in ASCII. Since the length was indicated
previously, theres no need for any marker to tell us where the string ends (or any escaping). previously, theres no need for any marker to tell us where the string ends (or any escaping).
4. The next seven bytes encode the six-letter string value `Martin` with a prefix `0xa6`, and so on. 4. The next seven bytes encode the six-letter string value `Martin` with a prefix `0xa6`, and so on.
The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the
@ -286,8 +280,7 @@ In the following sections we will see how we can do much better, and encode the
## Protocol Buffers ## Protocol Buffers
Protocol Buffers (protobuf) is a binary encoding library developed at Google. Protocol Buffers (protobuf) is a binary encoding library developed at Google.
It is similar to Apache Thrift, which was originally developed by Facebook It is similar to Apache Thrift, which was originally developed by Facebook [^13];
[^13];
most of what this section says about Protocol Buffers applies also to Thrift. most of what this section says about Protocol Buffers applies also to Thrift.
Protocol Buffers requires a schema for any data that is encoded. To encode the data Protocol Buffers requires a schema for any data that is encoded. To encode the data
@ -298,9 +291,9 @@ interface definition language (IDL) like this:
syntax = "proto3"; syntax = "proto3";
message Person { message Person {
string user_name = 1; string user_name = 1;
int64 favorite_number = 2; int64 favorite_number = 2;
repeated string interests = 3; repeated string interests = 3;
} }
``` ```
@ -381,8 +374,7 @@ value wont fit in 32 bits, it will be truncated.
Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers. Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers.
It was started in 2009 as a subproject of Hadoop, as a result of Protocol Buffers not being a good It was started in 2009 as a subproject of Hadoop, as a result of Protocol Buffers not being a good
fit for Hadoops use cases fit for Hadoops use cases [^15].
[^15].
Avro also uses a schema to specify the structure of the data being encoded. It has two schema Avro also uses a schema to specify the structure of the data being encoded. It has two schema
languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily
@ -393,9 +385,9 @@ Our example schema, written in Avro IDL, might look like this:
``` ```
record Person { record Person {
string userName; string userName;
union { null, long } favoriteNumber = null; union { null, long } favoriteNumber = null;
array<string> interests; array<string> interests;
} }
``` ```
@ -403,13 +395,13 @@ The equivalent JSON representation of that schema is as follows:
``` ```
{ {
"type": "record", "type": "record",
"name": "Person", "name": "Person",
"fields": [ "fields": [
{"name": "userName", "type": "string"}, {"name": "userName", "type": "string"},
{"name": "favoriteNumber", "type": ["null", "long"], "default": null}, {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
{"name": "interests", "type": {"type": "array", "items": "string"}} {"name": "interests", "type": {"type": "array", "items": "string"}}
] ]
} }
``` ```
@ -455,8 +447,7 @@ application code is expecting, and their types.
If the readers and writers schema are the same, decoding is easy. If they are different, Avro If the readers and writers schema are the same, decoding is easy. If they are different, Avro
resolves the differences by looking at the writers schema and the readers schema side by side and resolves the differences by looking at the writers schema and the readers schema side by side and
translating the data from the writers schema into the readers schema. The Avro specification translating the data from the writers schema into the readers schema. The Avro specification
[[16](/en/ch5#AvroSpec), [[^16], [^17]]
[17](/en/ch5#AvroParsing)]
defines exactly how this resolution works, and it is illustrated in defines exactly how this resolution works, and it is illustrated in
[Figure 5-6](/en/ch5#fig_encoding_avro_resolution). [Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
@ -511,33 +502,32 @@ the space savings from the binary encoding futile.
The answer depends on the context in which Avro is being used. To give a few examples: The answer depends on the context in which Avro is being used. To give a few examples:
Large file with lots of records Large file with lots of records
: A common use for Avro is for storing a large file containing millions of records, all encoded with : A common use for Avro is for storing a large file containing millions of records, all encoded with
the same schema. (We will discuss this kind of situation in [Link to Come].) In this case, the the same schema. (We will discuss this kind of situation in [Link to Come].) In this case, the
writer of that file can just include the writers schema once at the beginning of the file. Avro writer of that file can just include the writers schema once at the beginning of the file. Avro
specifies a file format (object container files) to do this. specifies a file format (object container files) to do this.
Database with individually written records Database with individually written records
: In a database, different records may be written at different points in time using different : In a database, different records may be written at different points in time using different
writers schemas—you cannot assume that all the records will have the same schema. The simplest writers schemas—you cannot assume that all the records will have the same schema. The simplest
solution is to include a version number at the beginning of every encoded record, and to keep a solution is to include a version number at the beginning of every encoded record, and to keep a
list of schema versions in your database. A reader can fetch a record, extract the version number, list of schema versions in your database. A reader can fetch a record, extract the version number,
and then fetch the writers schema for that version number from the database. Using that writers and then fetch the writers schema for that version number from the database. Using that writers
schema, it can decode the rest of the record. schema, it can decode the rest of the record.
Confluents schema registry for Apache Kafka Confluents schema registry for Apache Kafka
[^19] [^19]
and LinkedIns Espresso and LinkedIns Espresso
[^20] [^20]
work this way, for example. work this way, for example.
Sending records over a network connection Sending records over a network connection
: When two processes are communicating over a bidirectional network connection, they can negotiate : When two processes are communicating over a bidirectional network connection, they can negotiate
the schema version on connection setup and then use that schema for the lifetime of the the schema version on connection setup and then use that schema for the lifetime of the
connection. The Avro RPC protocol (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) works like this. connection. The Avro RPC protocol (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) works like this.
A database of schema versions is a useful thing to have in any case, since it acts as documentation A database of schema versions is a useful thing to have in any case, since it acts as documentation
and gives you a chance to check schema compatibility and gives you a chance to check schema compatibility [^21].
[^21].
As the version number, you could use a simple incrementing integer, or you could use a hash of the As the version number, you could use a simple incrementing integer, or you could use a hash of the
schema. schema.
@ -581,13 +571,10 @@ languages.
The ideas on which these encodings are based are by no means new. For example, they have a lot in The ideas on which these encodings are based are by no means new. For example, they have a lot in
common with ASN.1, a schema definition language that was first standardized in 1984 common with ASN.1, a schema definition language that was first standardized in 1984
[[23](/en/ch5#Larmouth1999), [[^23], [^24]].
[24](/en/ch5#Kaliski1993)].
It was used to define various network protocols, and its binary encoding (DER) is still used to encode It was used to define various network protocols, and its binary encoding (DER) is still used to encode
SSL certificates (X.509), for example SSL certificates (X.509), for example [^25].
[^25]. ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers [^26].
ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers
[^26].
However, its also very complex and badly documented, so ASN.1 However, its also very complex and badly documented, so ASN.1
is probably not a good choice for new applications. is probably not a good choice for new applications.
@ -601,14 +588,14 @@ So, we can see that although textual data formats such as JSON, XML, and CSV are
encodings based on schemas are also a viable option. They have a number of nice properties: encodings based on schemas are also a viable option. They have a number of nice properties:
* They can be much more compact than the various “binary JSON” variants, since they can omit field * They can be much more compact than the various “binary JSON” variants, since they can omit field
names from the encoded data. names from the encoded data.
* The schema is a valuable form of documentation, and because the schema is required for decoding, * The schema is a valuable form of documentation, and because the schema is required for decoding,
you can be sure that it is up to date (whereas manually maintained documentation may easily you can be sure that it is up to date (whereas manually maintained documentation may easily
diverge from reality). diverge from reality).
* Keeping a database of schemas allows you to check forward and backward compatibility of schema * Keeping a database of schemas allows you to check forward and backward compatibility of schema
changes, before anything is deployed. changes, before anything is deployed.
* For users of statically typed programming languages, the ability to generate code from the schema * For users of statically typed programming languages, the ability to generate code from the schema
is useful, since it enables type-checking at compile time. is useful, since it enables type-checking at compile time.
In summary, schema evolution allows the same kind of flexibility as schemaless/schema-on-read JSON In summary, schema evolution allows the same kind of flexibility as schemaless/schema-on-read JSON
databases provide (see [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)), while also providing better databases provide (see [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)), while also providing better
@ -681,8 +668,7 @@ versions of the schema.
More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or
moving some data into a separate table—still require data to be rewritten, often at the application moving some data into a separate table—still require data to be rewritten, often at the application
level [^27]. level [^27].
Maintaining forward and backward compatibility across such migrations is still a research problem Maintaining forward and backward compatibility across such migrations is still a research problem [^28].
[^28].
### Archival storage ### Archival storage
@ -722,8 +708,7 @@ application-specific, and the client and server need to agree on the details of
In some ways, services are similar to databases: they typically allow clients to submit and query In some ways, services are similar to databases: they typically allow clients to submit and query
data. However, while databases allow arbitrary queries using the query languages we discussed in data. However, while databases allow arbitrary queries using the query languages we discussed in
[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs [Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
that are predetermined by the business logic (application code) of the service that are predetermined by the business logic (application code) of the service [^29]. This restriction provides a degree of encapsulation: services can impose
[^29]. This restriction provides a degree of encapsulation: services can impose
fine-grained restrictions on what clients can and cannot do. fine-grained restrictions on what clients can and cannot do.
A key design goal of a service-oriented/microservices architecture is to make the application easier A key design goal of a service-oriented/microservices architecture is to make the application easier
@ -742,18 +727,17 @@ perhaps a slight misnomer, because web services are not only used on the web, bu
different contexts. For example: different contexts. For example:
1. A client application running on a users device (e.g., a native app on a mobile device, or a 1. A client application running on a users device (e.g., a native app on a mobile device, or a
JavaScript web app in a browser) making requests to a service over HTTP. These requests typically JavaScript web app in a browser) making requests to a service over HTTP. These requests typically
go over the public internet. go over the public internet.
2. One service making requests to another service owned by the same organization, often located 2. One service making requests to another service owned by the same organization, often located
within the same datacenter, as part of a service-oriented/microservices architecture. within the same datacenter, as part of a service-oriented/microservices architecture.
3. One service making requests to a service owned by a different organization, usually via the 3. One service making requests to a service owned by a different organization, usually via the
internet. This is used for data exchange between different organizations backend systems. This internet. This is used for data exchange between different organizations backend systems. This
category includes public APIs provided by online services, such as credit card processing category includes public APIs provided by online services, such as credit card processing
systems, or OAuth for shared access to user data. systems, or OAuth for shared access to user data.
The most popular service design philosophy is REST, which builds upon the principles of HTTP The most popular service design philosophy is REST, which builds upon the principles of HTTP
[[30](/en/ch5#Fielding2000), [[^30], [^31]].
[31](/en/ch5#Fielding2008)].
It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for
cache control, authentication, and content type negotiation. An API designed according to the cache control, authentication, and content type negotiation. An API designed according to the
principles of REST is called *RESTful*. principles of REST is called *RESTful*.
@ -763,8 +747,7 @@ format to send and expect in response. Even if a service adopts RESTful design p
need to somehow find out these details. Service developers often use an interface definition need to somehow find out these details. Service developers often use an interface definition
language (IDL) to define and document their services API endpoints and data models, and to evolve language (IDL) to define and document their services API endpoints and data models, and to evolve
them over time. Other developers can then use the service definition to determine how to query the them over time. Other developers can then use the service definition to determine how to query the
service. The two most popular service IDLs are OpenAPI (also known as Swagger service. The two most popular service IDLs are OpenAPI (also known as Swagger [^32])
[^32])
and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send
and receive Protocol Buffers. and receive Protocol Buffers.
@ -778,25 +761,25 @@ definitions.
``` ```
openapi: 3.0.0 openapi: 3.0.0
info: info:
title: Ping, Pong title: Ping, Pong
version: 1.0.0 version: 1.0.0
servers: servers:
- url: http://localhost:8080 - url: http://localhost:8080
paths: paths:
/ping: /ping:
get: get:
summary: Given a ping, returns a pong message summary: Given a ping, returns a pong message
responses: responses:
'200': '200':
description: A pong description: A pong
content: content:
application/json: application/json:
schema: schema:
type: object type: object
properties: properties:
message: message:
type: string type: string
example: Pong! example: Pong!
``` ```
Even if a design philosophy and IDL are adopted, developers must still write the code that Even if a design philosophy and IDL are adopted, developers must still write the code that
@ -815,12 +798,12 @@ from pydantic import BaseModel
app = FastAPI(title="Ping, Pong", version="1.0.0") app = FastAPI(title="Ping, Pong", version="1.0.0")
class PongResponse(BaseModel): class PongResponse(BaseModel):
message: str = "Pong!" message: str = "Pong!"
@app.get("/ping", response_model=PongResponse, @app.get("/ping", response_model=PongResponse,
summary="Given a ping, returns a pong message") summary="Given a ping, returns a pong message")
async def ping(): async def ping():
return PongResponse() return PongResponse()
``` ```
Many frameworks couple service definitions and server code together. In some cases, such as with the Many frameworks couple service definitions and server code together. In some cases, such as with the
@ -841,50 +824,47 @@ Architecture (CORBA) is excessively complex, and does not provide backward or fo
compatibility [^33]. compatibility [^33].
SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are
also plagued by complexity and compatibility problems also plagued by complexity and compatibility problems
[[34](/en/ch5#Lacey2006), [[^34], [^35], [^36]].
[35](/en/ch5#Tilkov2006),
[36](/en/ch5#Bray2004)].
All of these are based on the idea of a *remote procedure call* (RPC), which has been around since All of these are based on the idea of a *remote procedure call* (RPC), which has been around since
the 1970s [^37]. the 1970s [^37].
The RPC model tries to make a request to a remote network service look the same as calling a function or The RPC model tries to make a request to a remote network service look the same as calling a function or
method in your programming language, within the same process (this abstraction is called *location method in your programming language, within the same process (this abstraction is called *location
transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed
[[38](/en/ch5#Waldo1994), [[^38], [^39]].
[39](/en/ch5#Vinoski2008)].
A network request is very different from a local function call: A network request is very different from a local function call:
* A local function call is predictable and either succeeds or fails, depending only on parameters * A local function call is predictable and either succeeds or fails, depending only on parameters
that are under your control. A network request is unpredictable: the request or response may be that are under your control. A network request is unpredictable: the request or response may be
lost due to a network problem, or the remote machine may be slow or unavailable, and such problems lost due to a network problem, or the remote machine may be slow or unavailable, and such problems
are entirely outside of your control. Network problems are common, so you have to anticipate them, are entirely outside of your control. Network problems are common, so you have to anticipate them,
for example by retrying a failed request. for example by retrying a failed request.
* A local function call either returns a result, or throws an exception, or never returns (because * A local function call either returns a result, or throws an exception, or never returns (because
it goes into an infinite loop or the process crashes). A network request has another possible it goes into an infinite loop or the process crashes). A network request has another possible
outcome: it may return without a result, due to a *timeout*. In that case, you simply dont know outcome: it may return without a result, due to a *timeout*. In that case, you simply dont know
what happened: if you dont get a response from the remote service, you have no way of knowing what happened: if you dont get a response from the remote service, you have no way of knowing
whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).) whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).)
* If you retry a failed network request, it could happen that the previous request actually got * If you retry a failed network request, it could happen that the previous request actually got
through, and only the response was lost. through, and only the response was lost.
In that case, retrying will cause the action to In that case, retrying will cause the action to
be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into
the protocol [^40]. the protocol [^40].
Local function calls dont have this problem. (We discuss idempotence in more detail Local function calls dont have this problem. (We discuss idempotence in more detail
in [Link to Come].) in [Link to Come].)
* Every time you call a local function, it normally takes about the same time to execute. A network * Every time you call a local function, it normally takes about the same time to execute. A network
request is much slower than a function call, and its latency is also wildly variable: at good request is much slower than a function call, and its latency is also wildly variable: at good
times it may complete in less than a millisecond, but when the network is congested or the remote times it may complete in less than a millisecond, but when the network is congested or the remote
service is overloaded it may take many seconds to do exactly the same thing. service is overloaded it may take many seconds to do exactly the same thing.
* When you call a local function, you can efficiently pass it references (pointers) to objects in * When you call a local function, you can efficiently pass it references (pointers) to objects in
local memory. When you make a network request, all those parameters need to be encoded into a local memory. When you make a network request, all those parameters need to be encoded into a
sequence of bytes that can be sent over the network. Thats okay if the parameters are immutable sequence of bytes that can be sent over the network. Thats okay if the parameters are immutable
primitives like numbers or short strings, but it quickly becomes problematic with larger amounts primitives like numbers or short strings, but it quickly becomes problematic with larger amounts
of data and mutable objects. of data and mutable objects.
* The client and the service may be implemented in different programming languages, so the RPC * The client and the service may be implemented in different programming languages, so the RPC
framework must translate datatypes from one language into another. This can end up ugly, since not framework must translate datatypes from one language into another. This can end up ugly, since not
all languages have the same types—recall JavaScripts problems with numbers greater than 253, all languages have the same types—recall JavaScripts problems with numbers greater than 253,
for example (see [“JSON, XML, and Binary Variants”](/en/ch5#sec_encoding_json)). This problem doesnt exist in a single process written in for example (see [“JSON, XML, and Binary Variants”](/en/ch5#sec_encoding_json)). This problem doesnt exist in a single process written in
a single language. a single language.
All of these factors mean that theres no point trying to make a remote service look too much like a All of these factors mean that theres no point trying to make a remote service look too much like a
local object in your programming language, because its a fundamentally different thing. Part of the local object in your programming language, because its a fundamentally different thing. Part of the
@ -906,43 +886,43 @@ across these instances is called *load balancing*
There are many load balancing and service discovery solutions available: There are many load balancing and service discovery solutions available:
* *Hardware load balancers* are specialized pieces of equipment that are installed in data centers. * *Hardware load balancers* are specialized pieces of equipment that are installed in data centers.
They allow clients to connect to a single host and port, and incoming connections are routed to They allow clients to connect to a single host and port, and incoming connections are routed to
one of the servers running the service. Such load balancers detect network failures when one of the servers running the service. Such load balancers detect network failures when
connecting to a downstream server and shift the traffic to other servers. connecting to a downstream server and shift the traffic to other servers.
* *Software load balancers* behave in much the same way as hardware load balancers. But rather than * *Software load balancers* behave in much the same way as hardware load balancers. But rather than
requiring a special appliance, software load balancers such as Nginx and HAProxy are applications requiring a special appliance, software load balancers such as Nginx and HAProxy are applications
that can be installed on a standard machine. that can be installed on a standard machine.
* The *domain name service (DNS)* is how domain names are resolved on the Internet when you open a * The *domain name service (DNS)* is how domain names are resolved on the Internet when you open a
webpage. It supports load balancing by allowing multiple IP addresses to be associated with a webpage. It supports load balancing by allowing multiple IP addresses to be associated with a
single domain name. Clients can then be configured to connect to a service using a domain name single domain name. Clients can then be configured to connect to a service using a domain name
rather than IP address, and the clients network layer picks which IP address to use when making a rather than IP address, and the clients network layer picks which IP address to use when making a
connection. One drawback of this approach is that DNS is designed to propagate changes over longer connection. One drawback of this approach is that DNS is designed to propagate changes over longer
periods of time, and to cache DNS entries. If servers are started, stopped, or moved frequently, periods of time, and to cache DNS entries. If servers are started, stopped, or moved frequently,
clients might see stale IP addresses that no longer have a server running on them. clients might see stale IP addresses that no longer have a server running on them.
* *Service discovery systems* use a centralized registry rather than DNS to track which service * *Service discovery systems* use a centralized registry rather than DNS to track which service
endpoints are available. When a new service instance starts up, it registers itself with the endpoints are available. When a new service instance starts up, it registers itself with the
service discovery system by declaring the host and port its listening on, along with relevant service discovery system by declaring the host and port its listening on, along with relevant
metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location, metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location,
and more. The service then periodically sends a heartbeat signal to the discovery system to signal and more. The service then periodically sends a heartbeat signal to the discovery system to signal
that the service is still available. that the service is still available.
When a client wishes to connect to a service, it first queries the discovery system to get a list of When a client wishes to connect to a service, it first queries the discovery system to get a list of
available endpoints, and then connects directly to the endpoint. Compared to DNS, service discovery available endpoints, and then connects directly to the endpoint. Compared to DNS, service discovery
supports a much more dynamic environment where service instances change frequently. Discovery supports a much more dynamic environment where service instances change frequently. Discovery
systems also give clients more metadata about the service theyre connecting to, which enables systems also give clients more metadata about the service theyre connecting to, which enables
clients to make smarter load balancing decisions. clients to make smarter load balancing decisions.
* *Service meshes* are a sophisticated form of load balancing that combine software load balancers * *Service meshes* are a sophisticated form of load balancing that combine software load balancers
and service discovery. Unlike traditional software load balancers, which run on a separate and service discovery. Unlike traditional software load balancers, which run on a separate
machine, service mesh load balancers are typically deployed as an in-process client library or as machine, service mesh load balancers are typically deployed as an in-process client library or as
a process or “sidecar” container on both the client and server. Client applications connect a process or “sidecar” container on both the client and server. Client applications connect
to their own local service load balancer, which connects to the servers load balancer. From to their own local service load balancer, which connects to the servers load balancer. From
there, the connection is routed to the local server process. there, the connection is routed to the local server process.
Though complicated, this topology offers a number of advantages. Because the clients and servers are Though complicated, this topology offers a number of advantages. Because the clients and servers are
routed entirely through local connections, connection encryption can be handled entirely at the load routed entirely through local connections, connection encryption can be handled entirely at the load
balancer level. This shields clients and servers from having to deal with the complexities of SSL balancer level. This shields clients and servers from having to deal with the complexities of SSL
certificates and TLS. Mesh systems also provide sophisticated observability. They can track which certificates and TLS. Mesh systems also provide sophisticated observability. They can track which
services are calling each other in realtime, detect failures, track traffic load, and more. services are calling each other in realtime, detect failures, track traffic load, and more.
Which solution is appropriate depends on an organizations needs. Those running in a very dynamic Which solution is appropriate depends on an organizations needs. Those running in a very dynamic
service environment with an orchestrator such as Kubernetes often choose to run a service mesh such service environment with an orchestrator such as Kubernetes often choose to run a service mesh such
@ -962,10 +942,10 @@ The backward and forward compatibility properties of an RPC scheme are inherited
encoding it uses: encoding it uses:
* gRPC (Protocol Buffers) and Avro RPC can be evolved according to the compatibility rules of the * gRPC (Protocol Buffers) and Avro RPC can be evolved according to the compatibility rules of the
respective encoding format. respective encoding format.
* RESTful APIs most commonly use JSON for responses, and JSON or URI-encoded/form-encoded request * RESTful APIs most commonly use JSON for responses, and JSON or URI-encoded/form-encoded request
parameters for requests. Adding optional request parameters and adding new fields to response parameters for requests. Adding optional request parameters and adding new fields to response
objects are usually considered changes that maintain compatibility. objects are usually considered changes that maintain compatibility.
Service compatibility is made harder by the fact that RPC is often used for communication across Service compatibility is made harder by the fact that RPC is often used for communication across
organizational boundaries, so the provider of a service often has no control over its clients and organizational boundaries, so the provider of a service often has no control over its clients and
@ -978,8 +958,7 @@ version of the API it wants to use [^42]).
For RESTful APIs, common approaches are to use a version For RESTful APIs, common approaches are to use a version
number in the URL or in the HTTP `Accept` header. For services that use API keys to identify a number in the URL or in the HTTP `Accept` header. For services that use API keys to identify a
particular client, another option is to store a clients requested API version on the server and to particular client, another option is to store a clients requested API version on the server and to
allow this version selection to be updated through a separate administrative interface allow this version selection to be updated through a separate administrative interface [^43].
[^43].
## Durable Execution and Workflows ## Durable Execution and Workflows
@ -994,8 +973,7 @@ the credit card, and call the banking service to deposit debited funds, as shown
[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*. [Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a
general-purpose programming language, a domain specific language (DSL), or a markup language such as general-purpose programming language, a domain specific language (DSL), or a markup language such as
Business Process Execution Language (BPEL) Business Process Execution Language (BPEL) [^44].
[^44].
# Tasks, Activities, and Functions # Tasks, Activities, and Functions
@ -1038,8 +1016,7 @@ task fails, the framework will re-execute the task, but will skip any RPC calls
that the task made successfully before failing. Instead, the framework will pretend to make the that the task made successfully before failing. Instead, the framework will pretend to make the
call, but will instead return the results from the previous call. This is possible because durable call, but will instead return the results from the previous call. This is possible because durable
execution frameworks log all RPCs and state changes to durable storage like a write-ahead log execution frameworks log all RPCs and state changes to durable storage like a write-ahead log
[[45](/en/ch5#TemporalService), [[^45], [^46]].
[46](/en/ch5#Ewen2023)].
[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution [Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
using Temporal. using Temporal.
@ -1048,35 +1025,32 @@ using Temporal.
``` ```
@workflow.defn @workflow.defn
class PaymentWorkflow: class PaymentWorkflow:
@workflow.run @workflow.run
async def run(self, payment: PaymentRequest) -> PaymentResult: async def run(self, payment: PaymentRequest) -> PaymentResult:
is_fraud = await workflow.execute_activity( is_fraud = await workflow.execute_activity(
check_fraud, check_fraud,
payment, payment,
start_to_close_timeout=timedelta(seconds=15), start_to_close_timeout=timedelta(seconds=15),
) )
if is_fraud: if is_fraud:
return PaymentResultFraudulent return PaymentResultFraudulent
credit_card_response = await workflow.execute_activity( credit_card_response = await workflow.execute_activity(
debit_credit_card, debit_credit_card,
payment, payment,
start_to_close_timeout=timedelta(seconds=15), start_to_close_timeout=timedelta(seconds=15),
) )
# ... # ...
``` ```
Frameworks like Temporal are not without their challenges. External services, such as the Frameworks like Temporal are not without their challenges. External services, such as the
third-party payment gateway in our example, must still provide an idempotent API. Developers must third-party payment gateway in our example, must still provide an idempotent API. Developers must
remember to use unique IDs for these APIs to prevent duplicate execution remember to use unique IDs for these APIs to prevent duplicate execution [^47].
[^47].
And because durable execution frameworks log each RPC call in order, it expects a subsequent And because durable execution frameworks log each RPC call in order, it expects a subsequent
execution to make the same RPC calls in the same order. This makes code changes brittle: you execution to make the same RPC calls in the same order. This makes code changes brittle: you
might introduce undefined behavior simply by re-ordering function calls might introduce undefined behavior simply by re-ordering function calls [^48].
[^48].
Instead of modifying the code of an existing workflow, it is safer to deploy a new version of the Instead of modifying the code of an existing workflow, it is safer to deploy a new version of the
code separately, so that re-executions of existing workflow invocations continue to use the old code separately, so that re-executions of existing workflow invocations continue to use the old
version, and only new invocations use the new code version, and only new invocations use the new code [^49].
[^49].
Similarly, because durable execution frameworks expect to replay all code deterministically (the Similarly, because durable execution frameworks expect to replay all code deterministically (the
same inputs produce the same outputs), nondeterministic code such as random number generators or same inputs produce the same outputs), nondeterministic code such as random number generators or
@ -1097,20 +1071,19 @@ how encoded data can flow from one process to another. A request is called an *e
unlike RPC, the sender usually does not wait for the recipient to process the event. Moreover, unlike RPC, the sender usually does not wait for the recipient to process the event. Moreover,
events are typically not sent to the recipient via a direct network connection, but go via an events are typically not sent to the recipient via a direct network connection, but go via an
intermediary called a *message broker* (also called an *event broker*, *message queue*, or intermediary called a *message broker* (also called an *event broker*, *message queue*, or
*message-oriented middleware*), which stores the message temporarily. *message-oriented middleware*), which stores the message temporarily. [^50].
[^50].
Using a message broker has several advantages compared to direct RPC: Using a message broker has several advantages compared to direct RPC:
* It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system * It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system
reliability. reliability.
* It can automatically redeliver messages to a process that has crashed, and thus prevent messages from * It can automatically redeliver messages to a process that has crashed, and thus prevent messages from
being lost. being lost.
* It avoids the need for service discovery, since senders do not need to directly connect to the IP * It avoids the need for service discovery, since senders do not need to directly connect to the IP
address of the recipient. address of the recipient.
* It allows the same message to be sent to several recipients. * It allows the same message to be sent to several recipients.
* It logically decouples the sender from the recipient (the sender just publishes messages and * It logically decouples the sender from the recipient (the sender just publishes messages and
doesnt care who consumes them). doesnt care who consumes them).
The communication via a message broker is *asynchronous*: the sender doesnt wait for the message to The communication via a message broker is *asynchronous*: the sender doesnt wait for the message to
be delivered, but simply sends it and then forgets about it. Its possible to implement a be delivered, but simply sends it and then forgets about it. Its possible to implement a
@ -1128,15 +1101,15 @@ The detailed delivery semantics vary by implementation and configuration, but in
message distribution patterns are most often used: message distribution patterns are most often used:
* One process adds a message to a named *queue*, and the broker delivers that message to a * One process adds a message to a named *queue*, and the broker delivers that message to a
*consumer* of that queue. If there are multiple consumers, one of them receives the message. *consumer* of that queue. If there are multiple consumers, one of them receives the message.
* One process publishes a message to a named *topic*, and the broker delivers that message to all * One process publishes a message to a named *topic*, and the broker delivers that message to all
*subscribers* of that topic. If there are multiple subscribers, they all receive the message. *subscribers* of that topic. If there are multiple subscribers, they all receive the message.
Message brokers typically dont enforce any particular data model—a message is just a sequence of Message brokers typically dont enforce any particular data model—a message is just a sequence of
bytes with some metadata, so you can use any encoding format. A common approach is to use Protocol bytes with some metadata, so you can use any encoding format. A common approach is to use Protocol
Buffers, Avro, or JSON, and to deploy a schema registry alongside the message broker to store all Buffers, Avro, or JSON, and to deploy a schema registry alongside the message broker to store all
the valid schema versions and check their compatibility the valid schema versions and check their compatibility
[[19](/en/ch5#ConfluentSchemaReg), [21](/en/ch5#Kreps2015)]. [[^19], [^21]].
AsyncAPI, a messaging-based equivalent of OpenAPI, can also be used to specify the schema of AsyncAPI, a messaging-based equivalent of OpenAPI, can also be used to specify the schema of
messages. messages.
@ -1160,8 +1133,7 @@ sending and receiving asynchronous messages. Message delivery is not guaranteed:
scenarios, messages will be lost. Since each actor processes only one message at a time, it doesnt scenarios, messages will be lost. Since each actor processes only one message at a time, it doesnt
need to worry about threads, and each actor can be scheduled independently by the framework. need to worry about threads, and each actor can be scheduled independently by the framework.
In *distributed actor frameworks* such as Akka, Orleans In *distributed actor frameworks* such as Akka, Orleans [^51],
[^51],
and Erlang/OTP, this programming model is used to scale an application across and Erlang/OTP, this programming model is used to scale an application across
multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient
are on the same node or different nodes. If they are on different nodes, the message is are on the same node or different nodes. If they are on different nodes, the message is
@ -1178,7 +1150,7 @@ application, you still have to worry about forward and backward compatibility, a
sent from a node running the new version to a node running the old version, and vice versa. This can sent from a node running the new version to a node running the old version, and vice versa. This can
be achieved by using one of the encodings discussed in this chapter. be achieved by using one of the encodings discussed in this chapter.
# Summary ## Summary
In this chapter we looked at several ways of turning data structures into bytes on the network or In this chapter we looked at several ways of turning data structures into bytes on the network or
bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more
@ -1199,33 +1171,34 @@ read old data) and forward compatibility (old code can read new data).
We discussed several data encoding formats and their compatibility properties: We discussed several data encoding formats and their compatibility properties:
* Programming languagespecific encodings are restricted to a single programming language and often * Programming languagespecific encodings are restricted to a single programming language and often
fail to provide forward and backward compatibility. fail to provide forward and backward compatibility.
* Textual formats like JSON, XML, and CSV are widespread, and their compatibility depends on how you * Textual formats like JSON, XML, and CSV are widespread, and their compatibility depends on how you
use them. They have optional schema languages, which are sometimes helpful and sometimes a use them. They have optional schema languages, which are sometimes helpful and sometimes a
hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things
like numbers and binary strings. like numbers and binary strings.
* Binary schemadriven formats like Protocol Buffers and Avro allow compact, efficient encoding with * Binary schemadriven formats like Protocol Buffers and Avro allow compact, efficient encoding with
clearly defined forward and backward compatibility semantics. The schemas can be useful for clearly defined forward and backward compatibility semantics. The schemas can be useful for
documentation and code generation in statically typed languages. However, these formats have the documentation and code generation in statically typed languages. However, these formats have the
downside that data needs to be decoded before it is human-readable. downside that data needs to be decoded before it is human-readable.
We also discussed several modes of dataflow, illustrating different scenarios in which data We also discussed several modes of dataflow, illustrating different scenarios in which data
encodings are important: encodings are important:
* Databases, where the process writing to the database encodes the data and the process reading * Databases, where the process writing to the database encodes the data and the process reading
from the database decodes it from the database decodes it
* RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes * RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes
a response, and the client finally decodes the response a response, and the client finally decodes the response
* Event-driven architectures (using message brokers or actors), where nodes communicate by sending * Event-driven architectures (using message brokers or actors), where nodes communicate by sending
each other messages that are encoded by the sender and decoded by the recipient each other messages that are encoded by the sender and decoded by the recipient
We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are
quite achievable. May your applications evolution be rapid and your deployments be frequent. quite achievable. May your applications evolution be rapid and your deployments be frequent.
##### Footnotes
##### References
### Summary
[^1]: [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y) [^1]: [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y)

View file

@ -11,7 +11,7 @@ breadcrumbs: false
> Douglas Adams, *Mostly Harmless* (1992) > Douglas Adams, *Mostly Harmless* (1992)
*Replication* means keeping a copy of the same data on multiple machines that are connected via a *Replication* means keeping a copy of the same data on multiple machines that are connected via a
network. As discussed in [“Distributed versus Single-Node Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_distributed), there are several reasons network. As discussed in [“Distributed versus Single-Node Systems”](/ch01.html#sec_introduction_distributed), there are several reasons
why you might want to replicate data: why you might want to replicate data:
* To keep data geographically close to your users (and thus reduce access latency) * To keep data geographically close to your users (and thus reduce access latency)
@ -19,7 +19,7 @@ why you might want to replicate data:
* To scale out the number of machines that can serve read queries (and thus increase read throughput) * To scale out the number of machines that can serve read queries (and thus increase read throughput)
In this chapter we will assume that your dataset is small enough that each machine can hold a copy of In this chapter we will assume that your dataset is small enough that each machine can hold a copy of
the entire dataset. In [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding) we will relax that assumption and discuss *sharding* the entire dataset. In [Chapter 7](/ch07.html#ch_sharding) we will relax that assumption and discuss *sharding*
(*partitioning*) of datasets that are too big for a single machine. In later chapters we will discuss (*partitioning*) of datasets that are too big for a single machine. In later chapters we will discuss
various kinds of faults that can occur in a replicated data system, and how to deal with them. various kinds of faults that can occur in a replicated data system, and how to deal with them.
@ -36,10 +36,8 @@ in databases, and although the details vary by database, the general principles
many different implementations. We will discuss the consequences of such choices in this chapter. many different implementations. We will discuss the consequences of such choices in this chapter.
Replication of databases is an old topic—the principles havent changed much since they were Replication of databases is an old topic—the principles havent changed much since they were
studied in the 1970s studied in the 1970s [^1], because the fundamental constraints of networks have remained the same. Despite being so old,
[^1], concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](/ch06.html#sec_replication_lag) we will
because the fundamental constraints of networks have remained the same. Despite being so old,
concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag) we will
get more precise about eventual consistency and discuss things like the *read-your-writes* and get more precise about eventual consistency and discuss things like the *read-your-writes* and
*monotonic reads* guarantees. *monotonic reads* guarantees.
@ -52,7 +50,7 @@ delete some data, replication doesnt help since the deletion will have also b
replicas, so you need a backup if you want to restore the deleted data. replicas, so you need a backup if you want to restore the deleted data.
In fact, replication and backups are often complementary to each other. Backups are sometimes part In fact, replication and backups are often complementary to each other. Backups are sometimes part
of the process of setting up replication, as we shall see in [“Setting Up New Followers”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_new_replica). of the process of setting up replication, as we shall see in [“Setting Up New Followers”](/ch06.html#sec_replication_new_replica).
Conversely, archiving replication logs can be part of a backup process. Conversely, archiving replication logs can be part of a backup process.
Some databases internally maintain immutable snapshots of past states, which serve as a kind of Some databases internally maintain immutable snapshots of past states, which serve as a kind of
@ -69,7 +67,7 @@ question inevitably arises: how do we ensure that all the data ends up on all th
Every write to the database needs to be processed by every replica; otherwise, the replicas would no Every write to the database needs to be processed by every replica; otherwise, the replicas would no
longer contain the same data. The most common solution is called *leader-based replication*, longer contain the same data. The most common solution is called *leader-based replication*,
*primary-backup*, or *active/passive*. It works as follows (see *primary-backup*, or *active/passive*. It works as follows (see
[Figure 6-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_leader_follower)): [Figure 6-1](/ch06.html#fig_replication_leader_follower)):
1. One of the replicas is designated the *leader* (also known as *primary* or *source* 1. One of the replicas is designated the *leader* (also known as *primary* or *source*
[^2]). [^2]).
@ -88,9 +86,9 @@ longer contain the same data. The most common solution is called *leader-based r
###### Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas. ###### Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas.
If the database is sharded (see [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding)), each shard has one leader. Different shards may If the database is sharded (see [Chapter 7](/ch07.html#ch_sharding)), each shard has one leader. Different shards may
have their leaders on different nodes, but each shard must nevertheless have one leader node. In have their leaders on different nodes, but each shard must nevertheless have one leader node. In
[“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader) we will discuss an alternative model in which a system may have [“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader) we will discuss an alternative model in which a system may have
multiple leaders for the same shard at the same time. multiple leaders for the same shard at the same time.
Single-leader replication is very widely used. Its a built-in feature of many relational databases, Single-leader replication is very widely used. Its a built-in feature of many relational databases,
@ -106,7 +104,7 @@ Many consensus algorithms such as Raft, which is used for replication in Cockroa
TiDB [^7], TiDB [^7],
etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and
automatically elect a new leader if the old one fails (we will discuss consensus in more detail in automatically elect a new leader if the old one fails (we will discuss consensus in more detail in
[Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency)). [Chapter 10](/ch10.html#ch_consistency)).
> [!NOTE] > [!NOTE]
> In older documents you may see the term *masterslave replication*. It means the same as > In older documents you may see the term *masterslave replication*. It means the same as
@ -119,17 +117,17 @@ An important detail of a replicated system is whether the replication happens *s
*asynchronously*. (In relational databases, this is often a configurable option; other systems are *asynchronously*. (In relational databases, this is often a configurable option; other systems are
often hardcoded to be either one or the other.) often hardcoded to be either one or the other.)
Think about what happens in [Figure 6-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_leader_follower), where the user of a website updates Think about what happens in [Figure 6-1](/ch06.html#fig_replication_leader_follower), where the user of a website updates
their profile image. At some point in time, the client sends the update request to the leader; their profile image. At some point in time, the client sends the update request to the leader;
shortly afterward, it is received by the leader. At some point, the leader forwards the data change shortly afterward, it is received by the leader. At some point, the leader forwards the data change
to the followers. Eventually, the leader notifies the client that the update was successful. to the followers. Eventually, the leader notifies the client that the update was successful.
[Figure 6-2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_sync_replication) shows one possible way how the timings could work out. [Figure 6-2](/ch06.html#fig_replication_sync_replication) shows one possible way how the timings could work out.
![ddia 0602](/fig/ddia_0602.png) ![ddia 0602](/fig/ddia_0602.png)
###### Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower. ###### Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower.
In the example of [Figure 6-2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_sync_replication), the replication to follower 1 is In the example of [Figure 6-2](/ch06.html#fig_replication_sync_replication), the replication to follower 1 is
*synchronous*: the leader waits until follower 1 has confirmed that it received the write before *synchronous*: the leader waits until follower 1 has confirmed that it received the write before
reporting success to the user, and before making the write visible to other clients. The replication reporting success to the user, and before making the write visible to other clients. The replication
to follower 2 is *asynchronous*: the leader sends the message, but doesnt wait for a response from to follower 2 is *asynchronous*: the leader sends the message, but doesnt wait for a response from
@ -159,9 +157,9 @@ called *semi-synchronous*.
In some systems, a *majority* (e.g., 3 out of 5 replicas, including the leader) of replicas is In some systems, a *majority* (e.g., 3 out of 5 replicas, including the leader) of replicas is
updated synchronously, and the remaining minority is asynchronous. This is an example of a *quorum*, updated synchronously, and the remaining minority is asynchronous. This is an example of a *quorum*,
which we will discuss further in [“Quorums for reading and writing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_quorum_condition). Majority quorums are often which we will discuss further in [“Quorums for reading and writing”](/ch06.html#sec_replication_quorum_condition). Majority quorums are often
used in systems that use a consensus protocol for automatic leader election, which we will return to used in systems that use a consensus protocol for automatic leader election, which we will return to
in [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency). in [Chapter 10](/ch10.html#ch_consistency).
Sometimes, leader-based replication is configured to be completely asynchronous. In this case, if the Sometimes, leader-based replication is configured to be completely asynchronous. In this case, if the
leader fails and is not recoverable, any writes that have not yet been replicated to followers are leader fails and is not recoverable, any writes that have not yet been replicated to followers are
@ -172,7 +170,7 @@ processing writes, even if all of its followers have fallen behind.
Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless
widely used, especially if there are many followers or if they are geographically distributed widely used, especially if there are many followers or if they are geographically distributed
[^9]. [^9].
We will return to this issue in [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag). We will return to this issue in [“Problems with Replication Lag”](/ch06.html#sec_replication_lag).
## Setting Up New Followers ## Setting Up New Followers
@ -224,8 +222,8 @@ for live queries. Storing database data in object storage has many benefits:
durability guarantees. This also allows databases to bypass inter-zone network fees. durability guarantees. This also allows databases to bypass inter-zone network fees.
* Databases can use an object stores *conditional write* feature—essentially, a *compare-and-set* * Databases can use an object stores *conditional write* feature—essentially, a *compare-and-set*
(CAS) operation—to implement transactions and leadership election (CAS) operation—to implement transactions and leadership election
[[10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Morling2024_ch6), [[10](/ch06.html#Morling2024_ch6),
[11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Chandramohan2024)]). [11](/ch06.html#Chandramohan2024)]).
* Storing data from multiple databases in the same object store can simplify data integration, * Storing data from multiple databases in the same object store can simplify data integration,
particularly when open formats such as Apache Parquet and Apache Iceberg are used. particularly when open formats such as Apache Parquet and Apache Iceberg are used.
@ -312,10 +310,10 @@ consists of the following steps:
[^13]. [^13].
The best candidate for leadership is usually the replica with the most up-to-date data changes The best candidate for leadership is usually the replica with the most up-to-date data changes
from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader
is a consensus problem, discussed in detail in [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency). is a consensus problem, discussed in detail in [Chapter 10](/ch10.html#ch_consistency).
3. *Reconfiguring the system to use the new leader.* Clients now need to send 3. *Reconfiguring the system to use the new leader.* Clients now need to send
their write requests to the new leader (we discuss this their write requests to the new leader (we discuss this
in [“Request Routing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_routing)). If the old leader comes back, it might still believe that it is in [“Request Routing”](/ch07.html#sec_sharding_routing)). If the old leader comes back, it might still believe that it is
the leader, not realizing that the other replicas have the leader, not realizing that the other replicas have
forced it to step down. The system needs to ensure that the old leader becomes a follower and forced it to step down. The system needs to ensure that the old leader becomes a follower and
recognizes the new leader. recognizes the new leader.
@ -337,10 +335,10 @@ Failover is fraught with things that can go wrong:
primary keys that were previously assigned by the old leader. These primary keys were also used in primary keys that were previously assigned by the old leader. These primary keys were also used in
a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis, a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis,
which caused some private data to be disclosed to the wrong users. which caused some private data to be disclosed to the wrong users.
* In certain fault scenarios (see [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed)), it could happen that two nodes both believe * In certain fault scenarios (see [Chapter 9](/ch09.html#ch_distributed)), it could happen that two nodes both believe
that they are the leader. This situation is called *split brain*, and it is dangerous: if both that they are the leader. This situation is called *split brain*, and it is dangerous: if both
leaders accept writes, and there is no process for resolving conflicts (see leaders accept writes, and there is no process for resolving conflicts (see
[“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some [“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
systems have a mechanism to shut down one node if two leaders are detected. However, if this systems have a mechanism to shut down one node if two leaders are detected. However, if this
mechanism is not carefully designed, you can end up with both nodes being shut down mechanism is not carefully designed, you can end up with both nodes being shut down
[^15]. [^15].
@ -356,7 +354,7 @@ Failover is fraught with things that can go wrong:
> [!NOTE] > [!NOTE]
> Guarding against split brain by limiting or shutting down old leaders is known as *fencing* or, more > Guarding against split brain by limiting or shutting down old leaders is known as *fencing* or, more
> emphatically, *Shoot The Other Node In The Head* (STONITH). We will discuss fencing in more detail > emphatically, *Shoot The Other Node In The Head* (STONITH). We will discuss fencing in more detail
> in [“Distributed Locks and Leases”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lock_fencing). > in [“Distributed Locks and Leases”](/ch09.html#sec_distributed_lock_fencing).
There are no easy solutions to these problems. For this reason, some operations teams prefer to There are no easy solutions to these problems. For this reason, some operations teams prefer to
perform failovers manually, even if the software supports automatic failover. perform failovers manually, even if the software supports automatic failover.
@ -370,7 +368,7 @@ behind by several days could be catastrophic.
These issues—node failures; unreliable networks; and trade-offs around replica consistency, These issues—node failures; unreliable networks; and trade-offs around replica consistency,
durability, availability, and latency—are in fact fundamental problems in distributed systems. durability, availability, and latency—are in fact fundamental problems in distributed systems.
In [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed) and [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency) we will discuss them in greater depth. In [Chapter 9](/ch09.html#ch_distributed) and [Chapter 10](/ch10.html#ch_consistency) we will discuss them in greater depth.
## Implementation of Replication Logs ## Implementation of Replication Logs
@ -401,9 +399,9 @@ break down:
It is possible to work around those issues—for example, the leader can replace any nondeterministic It is possible to work around those issues—for example, the leader can replace any nondeterministic
function calls with a fixed return value when the statement is logged so that the followers all get function calls with a fixed return value when the statement is logged so that the followers all get
the same value. The idea of executing deterministic statements in a fixed order is similar to the the same value. The idea of executing deterministic statements in a fixed order is similar to the
event sourcing model that we previously discussed in [“Event Sourcing and CQRS”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_events). This approach is event sourcing model that we previously discussed in [“Event Sourcing and CQRS”](/ch03.html#sec_datamodels_events). This approach is
also known as *state machine replication*, and we will discuss the theory behind it in also known as *state machine replication*, and we will discuss the theory behind it in
[“Using shared logs”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_smr). [“Using shared logs”](/ch10.html#sec_consistency_smr).
Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today, Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today,
as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if
@ -415,7 +413,7 @@ replication methods.
### Write-ahead log (WAL) shipping ### Write-ahead log (WAL) shipping
In [Chapter 4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust: In [Chapter 4](/ch04.html#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
every modification is first written to the WAL so that the tree can be restored to a consistent every modification is first written to the WAL so that the tree can be restored to a consistent
state after a crash. Since the WAL contains all the information necessary to restore the indexes and state after a crash. Since the WAL contains all the information necessary to restore the indexes and
heap into a consistent state, we can use the exact same log to build a replica on another node: heap into a consistent state, we can use the exact same log to build a replica on another node:
@ -423,8 +421,8 @@ besides writing the log to disk, the leader also sends it across the network to
the follower processes this log, it builds a copy of the exact same files as found on the leader. the follower processes this log, it builds a copy of the exact same files as found on the leader.
This method of replication is used in PostgreSQL and Oracle, among others This method of replication is used in PostgreSQL and Oracle, among others
[[17](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Suzuki2017_ch6), [[17](/ch06.html#Suzuki2017_ch6),
[18](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2012)]. [18](/ch06.html#Kapila2012)].
The main disadvantage is that the log describes the data on a very low level: a WAL contains details The main disadvantage is that the log describes the data on a very low level: a WAL contains details
of which bytes were changed in which disk blocks. This makes replication tightly coupled to the of which bytes were changed in which disk blocks. This makes replication tightly coupled to the
storage engine. If the database changes its storage format from one version to another, it is storage engine. If the database changes its storage format from one version to another, it is
@ -476,7 +474,7 @@ This technique is called *change data capture*, and we will return to it in [Lin
# Problems with Replication Lag # Problems with Replication Lag
Being able to tolerate node failures is just one reason for wanting replication. As mentioned Being able to tolerate node failures is just one reason for wanting replication. As mentioned
in [“Distributed versus Single-Node Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_distributed), other reasons are scalability (processing more in [“Distributed versus Single-Node Systems”](/ch01.html#sec_introduction_distributed), other reasons are scalability (processing more
requests than a single machine can handle) and latency (placing replicas geographically closer to requests than a single machine can handle) and latency (placing replicas geographically closer to
users). users).
@ -528,7 +526,7 @@ be read from a follower. This is especially appropriate if data is frequently vi
occasionally written. occasionally written.
With asynchronous replication, there is a problem, illustrated in With asynchronous replication, there is a problem, illustrated in
[Figure 6-3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_read_your_writes): if the user views the data shortly after making a write, the [Figure 6-3](/ch06.html#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
new data may not yet have reached the replica. To the user, it looks as though the data they new data may not yet have reached the replica. To the user, it looks as though the data they
submitted was lost, so they will be understandably unhappy. submitted was lost, so they will be understandably unhappy.
@ -568,7 +566,7 @@ are various possible techniques. To mention a few:
[^26]. [^26].
The timestamp could be a *logical timestamp* (something that indicates ordering of writes, such as The timestamp could be a *logical timestamp* (something that indicates ordering of writes, such as
the log sequence number) or the actual system clock (in which case clock synchronization becomes the log sequence number) or the actual system clock (in which case clock synchronization becomes
critical; see [“Unreliable Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_clocks)). critical; see [“Unreliable Clocks”](/ch09.html#sec_distributed_clocks)).
* If your replicas are distributed across regions (for geographical proximity to users or for * If your replicas are distributed across regions (for geographical proximity to users or for
availability), there is additional complexity. Any request that needs to be served by the leader availability), there is additional complexity. Any request that needs to be served by the leader
must be routed to the region that contains the leader. must be routed to the region that contains the leader.
@ -604,7 +602,7 @@ zonal outages where one zone goes offline, but they do not protect against regio
all zones in a region are unavailable. To survive a regional outage, a distributed system must be all zones in a region are unavailable. To survive a regional outage, a distributed system must be
deployed across multiple regions, which can result in higher latencies, lower throughput, and deployed across multiple regions, which can result in higher latencies, lower throughput, and
increased cloud networking bills. We will discuss these tradeoffs more in increased cloud networking bills. We will discuss these tradeoffs more in
[“Multi-leader replication topologies”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_topologies). For now, just know that when we say region, we mean a collection of [“Multi-leader replication topologies”](/ch06.html#sec_replication_topologies). For now, just know that when we say region, we mean a collection of
zones/datacenters in a single geographic location. zones/datacenters in a single geographic location.
## Monotonic Reads ## Monotonic Reads
@ -613,7 +611,7 @@ Our second example of an anomaly that can occur when reading from asynchronous f
possible for a user to see things *moving backward in time*. possible for a user to see things *moving backward in time*.
This can happen if a user makes several reads from different replicas. For example, This can happen if a user makes several reads from different replicas. For example,
[Figure 6-4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower [Figure 6-4](/ch06.html#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
with little lag, then to a follower with greater lag. (This scenario is quite likely if the user with little lag, then to a follower with greater lag. (This scenario is quite likely if the user
refreshes a web page, and each request is routed to a random server.) The first query returns a refreshes a web page, and each request is routed to a random server.) The first query returns a
comment that was recently added by user 1234, but the second query doesnt return anything because comment that was recently added by user 1234, but the second query doesnt return anything because
@ -654,7 +652,7 @@ answered it.
Now, imagine a third person is listening to this conversation through followers. The things said by Now, imagine a third person is listening to this conversation through followers. The things said by
Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer
replication lag (see [Figure 6-5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_consistent_prefix)). This observer would hear the following: replication lag (see [Figure 6-5](/ch06.html#fig_replication_consistent_prefix)). This observer would hear the following:
Mrs. Cake Mrs. Cake
: About ten seconds usually, Mr. Poons. : About ten seconds usually, Mr. Poons.
@ -676,7 +674,7 @@ writes happens in a certain order, then anyone reading those writes will see the
order. order.
This is a particular problem in sharded (partitioned) databases, which we will discuss in This is a particular problem in sharded (partitioned) databases, which we will discuss in
[Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding). If the database always applies writes in the same order, reads always see a [Chapter 7](/ch07.html#ch_sharding). If the database always applies writes in the same order, reads always see a
consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different
shards operate independently, so there is no global ordering of writes: when a user reads from the shards operate independently, so there is no global ordering of writes: when a user reads from the
database, they may see some parts of the database in an older state and some in a newer state. database, they may see some parts of the database in an older state and some in a newer state.
@ -684,7 +682,7 @@ database, they may see some parts of the database in an older state and some in
One solution is to make sure that any writes that are causally related to each other are written to One solution is to make sure that any writes that are causally related to each other are written to
the same shard—but in some applications that cannot be done efficiently. There are also algorithms the same shard—but in some applications that cannot be done efficiently. There are also algorithms
that explicitly keep track of causal dependencies, a topic that we will return to in that explicitly keep track of causal dependencies, a topic that we will return to in
[“The “happens-before” relation and concurrency”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_happens_before). [“The “happens-before” relation and concurrency”](/ch06.html#sec_replication_happens_before).
## Solutions for Replication Lag ## Solutions for Replication Lag
@ -700,15 +698,15 @@ synchronously updated follower. However, dealing with these issues in applicatio
and easy to get wrong. and easy to get wrong.
The simplest programming model for application developers is to choose a database that provides a The simplest programming model for application developers is to choose a database that provides a
strong consistency guarantee for replicas such as linearizability (see [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency)), and ACID strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/ch10.html#ch_consistency)), and ACID
transactions (see [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions)). This allows you to mostly ignore the challenges that arise transactions (see [Chapter 8](/ch08.html#ch_transactions)). This allows you to mostly ignore the challenges that arise
from replication, and treat the database as if it had just a single node. In the early 2010s the from replication, and treat the database as if it had just a single node. In the early 2010s the
*NoSQL* movement promoted the view that these features limited scalability, and that large-scale *NoSQL* movement promoted the view that these features limited scalability, and that large-scale
systems would have to embrace eventual consistency. systems would have to embrace eventual consistency.
However, since then, a number of databases started providing strong consistency and transactions However, since then, a number of databases started providing strong consistency and transactions
while also offering the fault tolerance, high availability, and scalability advantages of a while also offering the fault tolerance, high availability, and scalability advantages of a
distributed database. As mentioned in [“Relational Model versus Document Model”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_history), this trend is known as *NewSQL* to distributed database. As mentioned in [“Relational Model versus Document Model”](/ch03.html#sec_datamodels_history), this trend is known as *NewSQL* to
contrast with NoSQL (although its less about SQL specifically, and more about new approaches to contrast with NoSQL (although its less about SQL specifically, and more about new approaches to
scalable transaction management). scalable transaction management).
@ -758,7 +756,7 @@ single-leader replication, the leader has to be in *one* of the regions, and all
through that region. through that region.
In a multi-leader configuration, you can have a leader in *each* region. In a multi-leader configuration, you can have a leader in *each* region.
[Figure 6-6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_multi_dc) shows what this architecture might look like. Within each region, [Figure 6-6](/ch06.html#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
regular leaderfollower replication is used (with followers maybe in a different availability zone regular leaderfollower replication is used (with followers maybe in a different availability zone
from the leader); between regions, each regions leader replicates its changes to the leaders in from the leader); between regions, each regions leader replicates its changes to the leaders in
other regions. other regions.
@ -798,7 +796,7 @@ Tolerance of network problems
Consistency Consistency
: A single-leader system can provide strong consistency guarantees, such as serializable : A single-leader system can provide strong consistency guarantees, such as serializable
transactions, which we will discuss in [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions). The biggest downside of multi-leader transactions, which we will discuss in [Chapter 8](/ch08.html#ch_transactions). The biggest downside of multi-leader
systems is that the consistency they can achieve is much weaker. For example, you cant guarantee systems is that the consistency they can achieve is much weaker. For example, you cant guarantee
that a bank account wont go negative or that a username is unique: its always possible for that a bank account wont go negative or that a username is unique: its always possible for
different leaders to process writes that are individually fine (paying out some of the money in an different leaders to process writes that are individually fine (paying out some of the money in an
@ -808,7 +806,7 @@ Consistency
This is simply a fundamental limitation of distributed systems This is simply a fundamental limitation of distributed systems
[^28]. [^28].
If you need to enforce such constraints, youre therefore better off with a single-leader system. If you need to enforce such constraints, youre therefore better off with a single-leader system.
However, as we will see in [“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts), multi-leader systems can still However, as we will see in [“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts), multi-leader systems can still
achieve consistency properties that are useful in a wide range of apps that dont need such achieve consistency properties that are useful in a wide range of apps that dont need such
constraints. constraints.
@ -826,17 +824,17 @@ multi-leader replication is often considered dangerous territory that should be
### Multi-leader replication topologies ### Multi-leader replication topologies
A *replication topology* describes the communication paths along which writes are propagated from A *replication topology* describes the communication paths along which writes are propagated from
one node to another. If you have two leaders, like in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), there is one node to another. If you have two leaders, like in [Figure 6-9](/ch06.html#fig_replication_write_conflict), there is
only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With
more than two leaders, various different topologies are possible. Some examples are illustrated in more than two leaders, various different topologies are possible. Some examples are illustrated in
[Figure 6-7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_topologies). [Figure 6-7](/ch06.html#fig_replication_topologies).
![ddia 0607](/fig/ddia_0607.png) ![ddia 0607](/fig/ddia_0607.png)
###### Figure 6-7. Three example topologies in which multi-leader replication can be set up. ###### Figure 6-7. Three example topologies in which multi-leader replication can be set up.
The most general topology is *all-to-all*, shown in The most general topology is *all-to-all*, shown in
[Figure 6-7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_topologies)(c), [Figure 6-7](/ch06.html#fig_replication_topologies)(c),
in which every leader sends its writes to every other leader. However, more restricted topologies in which every leader sends its writes to every other leader. However, more restricted topologies
are also used: for example a *circular topology* in which each node receives writes from one node are also used: for example a *circular topology* in which each node receives writes from one node
and forwards those writes (plus any writes of its own) to one other node. Another popular topology and forwards those writes (plus any writes of its own) to one other node. Another popular topology
@ -845,7 +843,7 @@ star topology can be generalized to a tree.
> [!NOTE] > [!NOTE]
> Dont confuse a star-shaped network topology with a *star schema* (see > Dont confuse a star-shaped network topology with a *star schema* (see
> [“Stars and Snowflakes: Schemas for Analytics”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_analytics)), which describes the structure of a data model. > [“Stars and Snowflakes: Schemas for Analytics”](/ch03.html#sec_datamodels_analytics)), which describes the structure of a data model.
In circular and star topologies, a write may need to pass through several nodes before it reaches In circular and star topologies, a write may need to pass through several nodes before it reaches
all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To
@ -866,28 +864,28 @@ along different paths, avoiding a single point of failure.
On the other hand, all-to-all topologies can have issues too. In particular, some network links may On the other hand, all-to-all topologies can have issues too. In particular, some network links may
be faster than others (e.g., due to network congestion), with the result that some replication be faster than others (e.g., due to network congestion), with the result that some replication
messages may “overtake” others, as illustrated in [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality). messages may “overtake” others, as illustrated in [Figure 6-8](/ch06.html#fig_replication_causality).
![ddia 0608](/fig/ddia_0608.png) ![ddia 0608](/fig/ddia_0608.png)
###### Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas. ###### Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas.
In [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality), client A inserts a row into a table on leader 1, and client B In [Figure 6-8](/ch06.html#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may
first receive the update (which, from its point of view, is an update to a row that does not exist first receive the update (which, from its point of view, is an update to a row that does not exist
in the database) and only later receive the corresponding insert (which should have preceded the in the database) and only later receive the corresponding insert (which should have preceded the
update). update).
This is a problem of causality, similar to the one we saw in [“Consistent Prefix Reads”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_consistent_prefix): This is a problem of causality, similar to the one we saw in [“Consistent Prefix Reads”](/ch06.html#sec_replication_consistent_prefix):
the update depends on the prior insert, so we need to make sure that all nodes process the insert the update depends on the prior insert, so we need to make sure that all nodes process the insert
first, and then the update. Simply attaching a timestamp to every write is not sufficient, because first, and then the update. Simply attaching a timestamp to every write is not sufficient, because
clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see
[Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed)). [Chapter 9](/ch09.html#ch_distributed)).
To order these events correctly, a technique called *version vectors* can be used, which we will To order these events correctly, a technique called *version vectors* can be used, which we will
discuss later in this chapter (see [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent)). However, many multi-leader discuss later in this chapter (see [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent)). However, many multi-leader
replication systems dont use good techniques for ordering updates, leaving them vulnerable to replication systems dont use good techniques for ordering updates, leaving them vulnerable to
issues like the one in [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality). If you are using multi-leader replication, it issues like the one in [Figure 6-8](/ch06.html#fig_replication_causality). If you are using multi-leader replication, it
is worth being aware of these issues, carefully reading the documentation, and thoroughly testing is worth being aware of these issues, carefully reading the documentation, and thoroughly testing
your database to ensure that it really does provide the guarantees you believe it to have. your database to ensure that it really does provide the guarantees you believe it to have.
@ -918,9 +916,9 @@ Sheets for text documents and spreadsheets, Figma for graphics, and Linear for p
What makes these apps so responsive is that user input is immediately reflected in the user What makes these apps so responsive is that user input is immediately reflected in the user
interface, without waiting for a network round-trip to the server, and edits by one user are shown interface, without waiting for a network round-trip to the server, and edits by one user are shown
to their collaborators with low latency to their collaborators with low latency
[[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DayRichter2010), [[32](/ch06.html#DayRichter2010),
[33](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Wallace2019), [33](/ch06.html#Wallace2019),
[34](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Artman2023)]. [34](/ch06.html#Artman2023)].
This again results in a multi-leader architecture: each web browser tab that has opened the shared This again results in a multi-leader architecture: each web browser tab that has opened the shared
file is a replica, and any updates that you make to the file are asynchronously replicated to the file is a replica, and any updates that you make to the file are asynchronously replicated to the
@ -938,9 +936,9 @@ those changes.
A software library that supports this process is called a *sync engine*. Although the idea has A software library that supports this process is called a *sync engine*. Although the idea has
existed for a long time, the term has recently gained attention existed for a long time, the term has recently gained attention
[[35](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Saafan2024), [[35](/ch06.html#Saafan2024),
[36](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hagoel2024), [36](/ch06.html#Hagoel2024),
[37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Jayakar2024)]. [37](/ch06.html#Jayakar2024)].
An application that allows a user to continue editing a file while offline (which may be implemented An application that allows a user to continue editing a file while offline (which may be implemented
using a sync engine) is called *offline-first* using a sync engine) is called *offline-first*
[^38]. [^38].
@ -970,7 +968,7 @@ approach has a number of advantages:
offline is the same as having very large network delay. offline is the same as having very large network delay.
* A sync engine simplifies the programming model for frontend apps, compared to performing explicit * A sync engine simplifies the programming model for frontend apps, compared to performing explicit
service calls in application code. Every service call requires error handling, as discussed in service calls in application code. Every service call requires error handling, as discussed in
[“The problems with remote procedure calls (RPCs)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch05.html#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user [“The problems with remote procedure calls (RPCs)”](/ch05.html#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user
interface needs to somehow reflect that error. A sync engine allows the app to perform reads and interface needs to somehow reflect that error. A sync engine allows the app to perform reads and
writes on local data, which almost never fails, leading to a more declarative programming style writes on local data, which almost never fails, leading to a more declarative programming style
[^41]. [^41].
@ -1007,7 +1005,7 @@ a local-first sync engine on end user devices—is that concurrent writes on dif
lead to conflicts that need to be resolved. lead to conflicts that need to be resolved.
For example, consider a wiki page that is simultaneously being edited by two users, as shown in For example, consider a wiki page that is simultaneously being edited by two users, as shown in
[Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2 [Figure 6-9](/ch06.html#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
independently changes the title from A to C. Each users change is successfully applied to their independently changes the title from A to C. Each users change is successfully applied to their
local leader. However, when the changes are asynchronously replicated, a conflict is detected. local leader. However, when the changes are asynchronously replicated, a conflict is detected.
This problem does not occur in a single-leader database. This problem does not occur in a single-leader database.
@ -1017,13 +1015,13 @@ This problem does not occur in a single-leader database.
###### Figure 6-9. A write conflict caused by two leaders concurrently updating the same record. ###### Figure 6-9. A write conflict caused by two leaders concurrently updating the same record.
> [!NOTE] > [!NOTE]
> We say that the two writes in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict) are *concurrent* because neither > We say that the two writes in [Figure 6-9](/ch06.html#fig_replication_write_conflict) are *concurrent* because neither
> was “aware” of the other at the time the write was originally made. It doesnt matter whether the > was “aware” of the other at the time the write was originally made. It doesnt matter whether the
> writes literally happened at the same time; indeed, if the writes were made while offline, they > writes literally happened at the same time; indeed, if the writes were made while offline, they
> might have actually happened some time apart. What matters is whether one write occurred in a state > might have actually happened some time apart. What matters is whether one write occurred in a state
> where the other write has already taken effect. > where the other write has already taken effect.
In [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent) we will tackle the question of how a database can determine In [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent) we will tackle the question of how a database can determine
whether two writes are concurrent. For now we will assume that we can detect conflicts, and we want whether two writes are concurrent. For now we will assume that we can detect conflicts, and we want
to figure out the best way of resolving them. to figure out the best way of resolving them.
@ -1052,13 +1050,13 @@ Another example of conflict avoidance: imagine you want to insert new records an
IDs for them based on an auto-incrementing counter. If you have two leaders, you could set them up IDs for them based on an auto-incrementing counter. If you have two leaders, you could set them up
so that one leader only generates odd numbers and the other only generates even numbers. That way so that one leader only generates odd numbers and the other only generates even numbers. That way
you can be sure that the two leaders wont concurrently assign the same ID to different records. you can be sure that the two leaders wont concurrently assign the same ID to different records.
We will discuss other ID assignment schemes in [“ID Generators and Logical Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_logical). We will discuss other ID assignment schemes in [“ID Generators and Logical Clocks”](/ch10.html#sec_consistency_logical).
### Last write wins (discarding concurrent writes) ### Last write wins (discarding concurrent writes)
If conflicts cant be avoided, the simplest way of resolving them is to attach a timestamp to each If conflicts cant be avoided, the simplest way of resolving them is to attach a timestamp to each
write, and to always use the value with the greatest timestamp. For example, in write, and to always use the value with the greatest timestamp. For example, in
[Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), lets say that the timestamp of user 1s write is greater than [Figure 6-9](/ch06.html#fig_replication_write_conflict), lets say that the timestamp of user 1s write is greater than
the timestamp of user 2s write. In that case, both leaders will determine that the new title of the the timestamp of user 2s write. In that case, both leaders will determine that the new title of the
page should be B, and they discard the write that sets it to C. If the writes coincidentally have page should be B, and they discard the write that sets it to C. If the writes coincidentally have
the same timestamp, the winner can be chosen by comparing the values (e.g., in the case of strings, the same timestamp, the winner can be chosen by comparing the values (e.g., in the case of strings,
@ -1066,7 +1064,7 @@ taking the one thats earlier in the alphabet).
This approach is called *last write wins* (LWW) because the write with the greatest timestamp can be This approach is called *last write wins* (LWW) because the write with the greatest timestamp can be
considered the “last” one. The term is misleading though, because when two writes are concurrent considered the “last” one. The term is misleading though, because when two writes are concurrent
like in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), which one is older and which is later is undefined, and like in [Figure 6-9](/ch06.html#fig_replication_write_conflict), which one is older and which is later is undefined, and
so the timestamp order of concurrent writes is essentially random. so the timestamp order of concurrent writes is essentially random.
Therefore the real meaning of LWW is: when the same record is concurrently written on different Therefore the real meaning of LWW is: when the same record is concurrently written on different
@ -1084,7 +1082,7 @@ Another problem with LWW is that if a real-time clock (e.g. a Unix timestamp) is
for the writes, the system becomes very sensitive to clock synchronization. If one node has a clock for the writes, the system becomes very sensitive to clock synchronization. If one node has a clock
that is ahead of the others, and you try to overwrite a value written by that node, your write may that is ahead of the others, and you try to overwrite a value written by that node, your write may
be ignored as it may have a lower timestamp, even though it clearly occurred later. This problem can be ignored as it may have a lower timestamp, even though it clearly occurred later. This problem can
be solved by using a *logical clock*, which we will discuss in [“ID Generators and Logical Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_logical). be solved by using a *logical clock*, which we will discuss in [“ID Generators and Logical Clocks”](/ch10.html#sec_consistency_logical).
### Manual conflict resolution ### Manual conflict resolution
@ -1096,7 +1094,7 @@ merge is complete.
In a database, it would be impractical for a conflict to stop the entire replication process until a In a database, it would be impractical for a conflict to stop the entire replication process until a
human has resolved it. Instead, databases typically store all the concurrently written values for a human has resolved it. Instead, databases typically store all the concurrently written values for a
given record—for example, both B and C in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict). These values are given record—for example, both B and C in [Figure 6-9](/ch06.html#fig_replication_write_conflict). These values are
sometimes called *siblings*. The next time you query that record, the database returns *all* those sometimes called *siblings*. The next time you query that record, the database returns *all* those
values, rather than just the latest one. You can then resolve those values in whatever way you want, values, rather than just the latest one. You can then resolve those values in whatever way you want,
either automatically in application code (for example, you could concatenate B and C into “B/C”), or either automatically in application code (for example, you could concatenate B and C into “B/C”), or
@ -1120,7 +1118,7 @@ suffers from a number of problems:
sibling, but another sibling still contained that old item, the removed item would unexpectedly sibling, but another sibling still contained that old item, the removed item would unexpectedly
reappear in the customers cart reappear in the customers cart
[^45]. [^45].
[Figure 6-10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping [Figure 6-10](/ch06.html#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear. cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear.
* If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution * If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution
process can itself introduce a new conflict. Those resolutions could even be inconsistent: for process can itself introduce a new conflict. Those resolutions could even be inconsistent: for
@ -1149,7 +1147,7 @@ updates as much as possible, and hence avoiding data loss:
same position, it can be ordered deterministically so that all nodes get the same merged outcome. same position, it can be ordered deterministically so that all nodes get the same merged outcome.
* If the data is a collection of items (ordered like a to-do list, or unordered like a shopping * If the data is a collection of items (ordered like a to-do list, or unordered like a shopping
cart), we can merge it similarly to text by tracking insertions and deletions. To avoid the cart), we can merge it similarly to text by tracking insertions and deletions. To avoid the
shopping cart issue in [Figure 6-10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_amazon_anomaly), the algorithms track the fact that Book shopping cart issue in [Figure 6-10](/ch06.html#fig_replication_amazon_anomaly), the algorithms track the fact that Book
and DVD were deleted, so the merged result is Cart = {Soap}. and DVD were deleted, so the merged result is Cart = {Soap}.
* If the data is an integer representing a counter that can be incremented or decremented (e.g., the * If the data is an integer representing a counter that can be incremented or decremented (e.g., the
number of likes on a social media post), the merge algorithm can tell how many increments and number of likes on a social media post), the merge algorithm can tell how many increments and
@ -1175,7 +1173,7 @@ Two families of algorithms are commonly used to implement automatic conflict res
They have different design philosophies and performance characteristics, but both are able to They have different design philosophies and performance characteristics, but both are able to
perform automatic merges for all the aforementioned types of data. perform automatic merges for all the aforementioned types of data.
[Figure 6-11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a [Figure 6-11](/ch06.html#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the
letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make
“ice!”. “ice!”.
@ -1196,7 +1194,7 @@ OT
CRDT CRDT
: Most CRDTs give each character a unique, immutable ID and use those to determine the positions of : Most CRDTs give each character a unique, immutable ID and use those to determine the positions of
insertions/deletions, instead of indexes. For example, in [Figure 6-11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_ot_crdt) we assign insertions/deletions, instead of indexes. For example, in [Figure 6-11](/ch06.html#fig_replication_ot_crdt) we assign
the ID 1A to “i”, the ID 2A to “c”, etc. When inserting the exclamation mark, we generate an the ID 1A to “i”, the ID 2A to “c”, etc. When inserting the exclamation mark, we generate an
operation containing the ID of the new character (4B) and the ID of the existing character after operation containing the ID of the new character (4B) and the ID of the existing character after
which we want to insert (3A). To insert at the beginning of the string we give “nil” as the which we want to insert (3A). To insert at the beginning of the string we give “nil” as the
@ -1218,7 +1216,7 @@ Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge o
### What is a conflict? ### What is a conflict?
Some kinds of conflict are obvious. In the example in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), two writes Some kinds of conflict are obvious. In the example in [Figure 6-9](/ch06.html#fig_replication_write_conflict), two writes
concurrently modified the same field in the same record, setting it to two different values. There concurrently modified the same field in the same record, setting it to two different values. There
is little doubt that this is a conflict. is little doubt that this is a conflict.
@ -1232,7 +1230,7 @@ are made on two different leaders.
There isnt a quick ready-made answer, but in the following chapters we will trace a path toward a There isnt a quick ready-made answer, but in the following chapters we will trace a path toward a
good understanding of this problem. We will see some more examples of conflicts in good understanding of this problem. We will see some more examples of conflicts in
[Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and [Chapter 8](/ch08.html#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and
resolving conflicts in a replicated system. resolving conflicts in a replicated system.
# Leaderless Replication # Leaderless Replication
@ -1245,8 +1243,8 @@ writes in the same order.
Some data storage systems take a different approach, abandoning the concept of a leader and Some data storage systems take a different approach, abandoning the concept of a leader and
allowing any replica to directly accept writes from clients. Some of the earliest replicated data allowing any replica to directly accept writes from clients. Some of the earliest replicated data
systems were leaderless [[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lindsay1979_ch6), systems were leaderless [[1](/ch06.html#Lindsay1979_ch6),
[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979)], but the [50](/ch06.html#Gifford1979)], but the
idea was mostly forgotten during the era of dominance of relational databases. It once again became idea was mostly forgotten during the era of dominance of relational databases. It once again became
a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in
2007 [^45]. 2007 [^45].
@ -1270,10 +1268,10 @@ profound consequences for the way the database is used.
Imagine you have a database with three replicas, and one of the replicas is currently Imagine you have a database with three replicas, and one of the replicas is currently
unavailable—perhaps it is being rebooted to install a system update. In a single-leader unavailable—perhaps it is being rebooted to install a system update. In a single-leader
configuration, if you want to continue processing writes, you may need to perform a failover (see configuration, if you want to continue processing writes, you may need to perform a failover (see
[“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover)). [“Handling Node Outages”](/ch06.html#sec_replication_failover)).
On the other hand, in a leaderless configuration, failover does not exist. On the other hand, in a leaderless configuration, failover does not exist.
[Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
all three replicas in parallel, and the two available replicas accept the write but the unavailable all three replicas in parallel, and the two available replicas accept the write but the unavailable
replica misses it. Lets say that its sufficient for two out of three replicas to replica misses it. Lets say that its sufficient for two out of three replicas to
acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be
@ -1294,9 +1292,9 @@ stale value from another.
In order to tell which responses are up-to-date and which are outdated, every value that is written In order to tell which responses are up-to-date and which are outdated, every value that is written
needs to be tagged with a version number or timestamp, similarly to what we saw in needs to be tagged with a version number or timestamp, similarly to what we saw in
[“Last write wins (discarding concurrent writes)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lww). When a client receives multiple values in response to a read, it uses the [“Last write wins (discarding concurrent writes)”](/ch06.html#sec_replication_lww). When a client receives multiple values in response to a read, it uses the
one with the greatest timestamp (even if that value was only returned by one replica, and several one with the greatest timestamp (even if that value was only returned by one replica, and several
other replicas returned older values). See [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent) for more details. other replicas returned older values). See [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent) for more details.
### Catching up on missed writes ### Catching up on missed writes
@ -1306,7 +1304,7 @@ mechanisms are used in Dynamo-style datastores:
Read repair Read repair
: When a client makes a read from several nodes in parallel, it can detect any stale responses. : When a client makes a read from several nodes in parallel, it can detect any stale responses.
For example, in [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from For example, in [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale
value and writes the newer value back to that replica. This approach works well for values that are value and writes the newer value back to that replica. This approach works well for values that are
frequently read. frequently read.
@ -1326,7 +1324,7 @@ Anti-entropy
### Quorums for reading and writing ### Quorums for reading and writing
In the example of [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage), we considered the write to be successful In the example of [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage), we considered the write to be successful
even though it was only processed on two out of three replicas. What if only one out of three even though it was only processed on two out of three replicas. What if only one out of three
replicas accepted the write? How far can we push this? replicas accepted the write? How far can we push this?
@ -1354,7 +1352,7 @@ database writes to fail.
> [!NOTE] > [!NOTE]
> There may be more than *n* nodes in the cluster, but any given value is stored only on *n* > There may be more than *n* nodes in the cluster, but any given value is stored only on *n*
> nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit > nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit
> on one node. We will return to sharding in [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding). > on one node. We will return to sharding in [Chapter 7](/ch07.html#ch_sharding).
The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
as follows: as follows:
@ -1362,9 +1360,9 @@ as follows:
* If *w* < *n*, we can still process writes if a node is unavailable. * If *w* < *n*, we can still process writes if a node is unavailable.
* If *r* < *n*, we can still process reads if a node is unavailable. * If *r* < *n*, we can still process reads if a node is unavailable.
* With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable * With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable
node, like in [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage). node, like in [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage).
* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes. * With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
This case is illustrated in [Figure 6-13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_overlap). This case is illustrated in [Figure 6-13](/ch06.html#fig_replication_quorum_overlap).
Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and
*r* determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success *r* determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
@ -1386,7 +1384,7 @@ If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*
generally expect every read to return the most recent value written for a key. This is the case because the generally expect every read to return the most recent value written for a key. This is the case because the
set of nodes to which youve written and the set of nodes from which youve read must overlap. That set of nodes to which youve written and the set of nodes from which youve read must overlap. That
is, among the nodes you read there must be at least one node with the latest value (illustrated in is, among the nodes you read there must be at least one node with the latest value (illustrated in
[Figure 6-13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_overlap)). [Figure 6-13](/ch06.html#fig_replication_quorum_overlap)).
Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, because that ensures Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, because that ensures
*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are *w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
@ -1413,12 +1411,12 @@ properties can be confusing. Some scenarios include:
value, the number of replicas storing the new value may fall below *w*, breaking the quorum value, the number of replicas storing the new value may fall below *w*, breaking the quorum
condition. condition.
* While a rebalancing is in progress, where some data is moved from one node to another (see * While a rebalancing is in progress, where some data is moved from one node to another (see
[Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n* [Chapter 7](/ch07.html#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
replicas for a particular value. This can result in the read and write quorums no longer replicas for a particular value. This can result in the read and write quorums no longer
overlapping. overlapping.
* If a read is concurrent with a write operation, the read may or may not see the concurrently * If a read is concurrent with a write operation, the read may or may not see the concurrently
written value. In particular, its possible for one read to see the new value, and a subsequent written value. In particular, its possible for one read to see the new value, and a subsequent
read to see the old value, as we shall see in [“Linearizability and quorums”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_quorum_linearizable). read to see the old value, as we shall see in [“Linearizability and quorums”](/ch10.html#sec_consistency_quorum_linearizable).
* If a write succeeded on some replicas but failed on others (for example because the disks on some * If a write succeeded on some replicas but failed on others (for example because the disks on some
nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the
replicas where it succeeded. This means that if a write was reported as failed, subsequent reads replicas where it succeeded. This means that if a write was reported as failed, subsequent reads
@ -1426,12 +1424,12 @@ properties can be confusing. Some scenarios include:
[^52]. [^52].
* If the database uses timestamps from a real-time clock to determine which write is newer (as * If the database uses timestamps from a real-time clock to determine which write is newer (as
Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a
faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lww). faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](/ch06.html#sec_replication_lww).
We will discuss this in more detail in [“Relying on Synchronized Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_clocks_relying). We will discuss this in more detail in [“Relying on Synchronized Clocks”](/ch09.html#sec_distributed_clocks_relying).
* If two writes occur concurrently, one of them might be processed first on one replica, and the * If two writes occur concurrently, one of them might be processed first on one replica, and the
other might be processed first on another replica. This leads to a conflict, similarly to what we other might be processed first on another replica. This leads to a conflict, similarly to what we
saw for multi-leader replication (see [“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts)). We will return to this saw for multi-leader replication (see [“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts)). We will return to this
topic in [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent). topic in [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent).
Thus, although quorums appear to guarantee that a read returns the latest written value, in practice Thus, although quorums appear to guarantee that a read returns the latest written value, in practice
it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate
@ -1463,7 +1461,7 @@ able to quantify “eventual.”
A replication system based on a single leader can provide strong consistency guarantees that are A replication system based on a single leader can provide strong consistency guarantees that are
difficult or impossible to achieve in a leaderless system. However, as we have seen in difficult or impossible to achieve in a leaderless system. However, as we have seen in
[“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag), reads in a leader-based replicated system can also return stale values if [“Problems with Replication Lag”](/ch06.html#sec_replication_lag), reads in a leader-based replicated system can also return stale values if
you make them on an asynchronously updated follower. you make them on an asynchronously updated follower.
Reading from the leader ensures up-to-date responses, but it suffers from performance problems: Reading from the leader ensures up-to-date responses, but it suffers from performance problems:
@ -1507,7 +1505,7 @@ That said, leaderless systems can have performance problems as well:
to wait for before a request can complete. Even if you wait only for the fastest *r* or *w* to wait for before a request can complete. Even if you wait only for the fastest *r* or *w*
replicas to respond, and even if you make the requests in parallel, a bigger *r* or *w* increases replicas to respond, and even if you make the requests in parallel, a bigger *r* or *w* increases
the chance that you hit a slow replica, increasing the overall response time (see the chance that you hit a slow replica, increasing the overall response time (see
[“Use of Response Time Metrics”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_slo_sla)). [“Use of Response Time Metrics”](/ch02.html#sec_introduction_slo_sla)).
* A large-scale network interruption that disconnects a client from a large number of replicas can * A large-scale network interruption that disconnects a client from a large number of replicas can
make it impossible to form a quorum. Some leaderless databases offer a configuration option that make it impossible to form a quorum. Some leaderless databases offer a configuration option that
allows any reachable replica to accept writes, even if its not one of the usual replicas for that allows any reachable replica to accept writes, even if its not one of the usual replicas for that
@ -1526,7 +1524,7 @@ fault tolerance while also having a high likelihood of reading up-to-date data.
### Multi-region operation ### Multi-region operation
We previously discussed cross-region replication as a use case for multi-leader replication (see We previously discussed cross-region replication as a use case for multi-leader replication (see
[“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader)). Leaderless replication is also suitable for [“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader)). Leaderless replication is also suitable for
multi-region operation, since it is designed to tolerate conflicting concurrent writes, network multi-region operation, since it is designed to tolerate conflicting concurrent writes, network
interruptions, and latency spikes. interruptions, and latency spikes.
@ -1549,7 +1547,7 @@ resulting in conflicts that need to be resolved. Such conflicts may occur as the
not always: they could also be detected later during read repair, hinted handoff, or anti-entropy. not always: they could also be detected later during read repair, hinted handoff, or anti-entropy.
The problem is that events may arrive in a different order at different nodes, due to variable The problem is that events may arrive in a different order at different nodes, due to variable
network delays and partial failures. For example, [Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency) shows two clients, network delays and partial failures. For example, [Figure 6-14](/ch06.html#fig_replication_concurrency) shows two clients,
A and B, simultaneously writing to a key *X* in a three-node datastore: A and B, simultaneously writing to a key *X* in a three-node datastore:
* Node 1 receives the write from A, but never receives the write from B due to a transient * Node 1 receives the write from A, but never receives the write from B due to a transient
@ -1563,13 +1561,13 @@ A and B, simultaneously writing to a key *X* in a three-node datastore:
If each node simply overwrote the value for a key whenever it received a write request from a If each node simply overwrote the value for a key whenever it received a write request from a
client, the nodes would become permanently inconsistent, as shown by the final *get* request in client, the nodes would become permanently inconsistent, as shown by the final *get* request in
[Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other [Figure 6-14](/ch06.html#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
nodes think that the value is A. nodes think that the value is A.
In order to become eventually consistent, the replicas should converge toward the same value. For In order to become eventually consistent, the replicas should converge toward the same value. For
this, we can use any of the conflict resolution mechanisms we previously discussed in this, we can use any of the conflict resolution mechanisms we previously discussed in
[“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts), such as last-write-wins (used by Cassandra and ScyllaDB), [“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts), such as last-write-wins (used by Cassandra and ScyllaDB),
manual resolution, or CRDTs (described in [“CRDTs and Operational Transformation”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_crdts), and used by Riak). manual resolution, or CRDTs (described in [“CRDTs and Operational Transformation”](/ch06.html#sec_replication_crdts), and used by Riak).
Last-write-wins is easy to implement: each write is tagged with a timestamp, and a value with a Last-write-wins is easy to implement: each write is tagged with a timestamp, and a value with a
higher timestamp always overwrites a value with a lower timestamp. However, a timestamp doesnt tell higher timestamp always overwrites a value with a lower timestamp. However, a timestamp doesnt tell
@ -1582,11 +1580,11 @@ take more care to detect concurrent writes.
How do we decide whether two operations are concurrent or not? To develop an intuition, lets look How do we decide whether two operations are concurrent or not? To develop an intuition, lets look
at some examples: at some examples:
* In [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality), the two writes are not concurrent: As insert *happens before* * In [Figure 6-8](/ch06.html#fig_replication_causality), the two writes are not concurrent: As insert *happens before*
Bs increment, because the value incremented by B is the value inserted by A. In other words, Bs Bs increment, because the value incremented by B is the value inserted by A. In other words, Bs
operation builds upon As operation, so Bs operation must have happened later. operation builds upon As operation, so Bs operation must have happened later.
We also say that B is *causally dependent* on A. We also say that B is *causally dependent* on A.
* On the other hand, the two writes in [Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency) are concurrent: when each * On the other hand, the two writes in [Figure 6-14](/ch06.html#fig_replication_concurrency) are concurrent: when each
client starts the operation, it does not know that another client is also performing an operation client starts the operation, it does not know that another client is also performing an operation
on the same key. Thus, there is no causal dependency between the operations. on the same key. Thus, there is no causal dependency between the operations.
@ -1607,7 +1605,7 @@ conflict that needs to be resolved.
It may seem that two operations should be called concurrent if they occur “at the same time”—but It may seem that two operations should be called concurrent if they occur “at the same time”—but
in fact, it is not important whether they literally overlap in time. Because of problems with clocks in fact, it is not important whether they literally overlap in time. Because of problems with clocks
in distributed systems, it is actually quite difficult to tell whether two things happened in distributed systems, it is actually quite difficult to tell whether two things happened
at exactly the same time—an issue we will discuss in more detail in [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed). at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/ch09.html#ch_distributed).
For defining concurrency, exact time doesnt matter: we simply call two operations concurrent if For defining concurrency, exact time doesnt matter: we simply call two operations concurrent if
they are both unaware of each other, regardless of the physical time at which they occurred. People they are both unaware of each other, regardless of the physical time at which they occurred. People
@ -1629,7 +1627,7 @@ happened before another. To keep things simple, lets start with a database th
replica. Once we have worked out how to do this on a single replica, we can generalize the approach replica. Once we have worked out how to do this on a single replica, we can generalize the approach
to a leaderless database with multiple replicas. to a leaderless database with multiple replicas.
[Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) shows two clients concurrently adding items to the same [Figure 6-15](/ch06.html#fig_replication_causality_single) shows two clients concurrently adding items to the same
shopping cart. (If that example strikes you as too inane, imagine instead two air traffic shopping cart. (If that example strikes you as too inane, imagine instead two air traffic
controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is
empty. Between them, the clients make five writes to the database: empty. Between them, the clients make five writes to the database:
@ -1664,8 +1662,8 @@ empty. Between them, the clients make five writes to the database:
###### Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart. ###### Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart.
The dataflow between the operations in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) is illustrated The dataflow between the operations in [Figure 6-15](/ch06.html#fig_replication_causality_single) is illustrated
graphically in [Figure 6-16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causal_dependencies). The arrows indicate which operation graphically in [Figure 6-16](/ch06.html#fig_replication_causal_dependencies). The arrows indicate which operation
*happened before* which other operation, in the sense that the later operation *knew about* or *happened before* which other operation, in the sense that the later operation *knew about* or
*depended on* the earlier one. In this example, the clients are never fully up to date with the data *depended on* the earlier one. In this example, the clients are never fully up to date with the data
on the server, since there is always another operation going on concurrently. But old versions of on the server, since there is always another operation going on concurrently. But old versions of
@ -1673,7 +1671,7 @@ the value do get overwritten eventually, and no writes are lost.
![ddia 0616](/fig/ddia_0616.png) ![ddia 0616](/fig/ddia_0616.png)
###### Figure 6-16. Graph of causal dependencies in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single). ###### Figure 6-16. Graph of causal dependencies in [Figure 6-15](/ch06.html#fig_replication_causality_single).
Note that the server can determine whether two operations are concurrent by looking at the version Note that the server can determine whether two operations are concurrent by looking at the version
numbers—it does not need to interpret the value itself (so the value could be any data numbers—it does not need to interpret the value itself (so the value could be any data
@ -1699,10 +1697,10 @@ on subsequent reads.
### Version vectors ### Version vectors
The example in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) used only a single replica. How does the The example in [Figure 6-15](/ch06.html#fig_replication_causality_single) used only a single replica. How does the
algorithm change when there are multiple replicas, but no leader? algorithm change when there are multiple replicas, but no leader?
[Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) uses a single version number to capture dependencies between [Figure 6-15](/ch06.html#fig_replication_causality_single) uses a single version number to capture dependencies between
operations, but that is not sufficient when there are multiple replicas accepting writes operations, but that is not sufficient when there are multiple replicas accepting writes
concurrently. Instead, we need to use a version number *per replica* as well as per key. Each concurrently. Instead, we need to use a version number *per replica* as well as per key. Each
replica increments its own version number when processing a write, and also keeps track of the replica increments its own version number when processing a write, and also keeps track of the
@ -1713,14 +1711,14 @@ The collection of version numbers from all the replicas is called a *version vec
[^58]. [^58].
A few variants of this idea are in use, but the most interesting is probably the *dotted version A few variants of this idea are in use, but the most interesting is probably the *dotted version
vector* vector*
[[59](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Preguica2010), [[59](/ch06.html#Preguica2010),
[60](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Manepalli2022)], [60](/ch06.html#Manepalli2022)],
which is used in Riak 2.0 which is used in Riak 2.0
[[61](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Cribbs2014), [[61](/ch06.html#Cribbs2014),
[62](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Brown2015)]. [62](/ch06.html#Brown2015)].
We wont go into the details, but the way it works is quite similar to what we saw in our cart example. We wont go into the details, but the way it works is quite similar to what we saw in our cart example.
Like the version numbers in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single), version vectors are sent from the Like the version numbers in [Figure 6-15](/ch06.html#fig_replication_causality_single), version vectors are sent from the
database replicas to clients when values are read, and need to be sent back to the database when a database replicas to clients when values are read, and need to be sent back to the database when a
value is subsequently written. (Riak encodes the version vector as a string that it calls *causal value is subsequently written. (Riak encodes the version vector as a string that it calls *causal
context*.) The version vector allows the database to distinguish between overwrites and concurrent context*.) The version vector allows the database to distinguish between overwrites and concurrent
@ -1734,12 +1732,12 @@ siblings are merged correctly.
A *version vector* is sometimes also called a *vector clock*, even though they are not quite the A *version vector* is sometimes also called a *vector clock*, even though they are not quite the
same. The difference is subtle—please see the references for details same. The difference is subtle—please see the references for details
[[60](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Manepalli2022), [[60](/ch06.html#Manepalli2022),
[63](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Baquero2011), [63](/ch06.html#Baquero2011),
[64](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Schwarz1994)]. In brief, when [64](/ch06.html#Schwarz1994)]. In brief, when
comparing the state of replicas, version vectors are the right data structure to use. comparing the state of replicas, version vectors are the right data structure to use.
# Summary ## Summary
In this chapter we looked at the issue of replication. Replication can serve several purposes: In this chapter we looked at the issue of replication. Replication can serve several purposes:
@ -1816,10 +1814,10 @@ This chapter has assumed that every replica stores a full copy of the whole data
unrealistic for large datasets. In the next chapter we will look at *sharding*, which allows each unrealistic for large datasets. In the next chapter we will look at *sharding*, which allows each
machine to store only a subset of the data. machine to store only a subset of the data.
##### Footnotes
##### References
### Summary
[^1]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD) [^1]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)

View file

@ -13,10 +13,10 @@ breadcrumbs: false
A distributed database typically distributes data across nodes in two ways: A distributed database typically distributes data across nodes in two ways:
1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in 1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in
[Chapter 6](/en/ch6#ch_replication). [Chapter 6](/en/ch6#ch_replication).
2. If we dont want every node to store all the data, we can split up a large amount of data into 2. If we dont want every node to store all the data, we can split up a large amount of data into
smaller *shards* or *partitions*, and store different shards on different nodes. Well discuss smaller *shards* or *partitions*, and store different shards on different nodes. Well discuss
sharding in this chapter. sharding in this chapter.
Normally, shards are defined in such a way that each piece of data (each record, row, or document) Normally, shards are defined in such a way that each piece of data (each record, row, or document)
belongs to exactly one shard. There are various ways of achieving this, which we discuss in depth in belongs to exactly one shard. There are various ways of achieving this, which we discuss in depth in
@ -51,14 +51,12 @@ Some databases treat partitions and shards as two distinct concepts. For example
partitioning is a way of splitting a large table into several files that are stored on the same partitioning is a way of splitting a large table into several files that are stored on the same
machine (which has several advantages, such as making it very fast to delete an entire partition), machine (which has several advantages, such as making it very fast to delete an entire partition),
whereas sharding splits a dataset across multiple machines whereas sharding splits a dataset across multiple machines
[[1](/en/ch7#Giordano2023), [[^1], [^2]].
[2](/en/ch7#Leach2022)].
In many other systems, partitioning is just another word for sharding. In many other systems, partitioning is just another word for sharding.
While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to
one theory, the term arose from the online role-play game *Ultima Online*, in which a magic crystal one theory, the term arose from the online role-play game *Ultima Online*, in which a magic crystal
was shattered into pieces, and each of those shards refracted a copy of the game world was shattered into pieces, and each of those shards refracted a copy of the game world [^3].
[^3].
The term *shard* thus came to mean one of a set of parallel game servers, and later was carried over The term *shard* thus came to mean one of a set of parallel game servers, and later was carried over
to databases. Another theory is that *shard* was originally an acronym of *System for Highly to databases. Another theory is that *shard* was originally an acronym of *System for Highly
Available Replicated Data*—reportedly a 1980s database, details of which are lost to history. Available Replicated Data*—reportedly a 1980s database, details of which are lost to history.
@ -87,8 +85,7 @@ single-shard database.
The reason for this recommendation is that sharding often adds complexity: you typically have to The reason for this recommendation is that sharding often adds complexity: you typically have to
decide which records to put in which shard by choosing a *partition key*; all records with the decide which records to put in which shard by choosing a *partition key*; all records with the
same partition key are placed in the same shard same partition key are placed in the same shard [^4].
[^4].
This choice matters because accessing a record is fast if you know which shard its in, but if you This choice matters because accessing a record is fast if you know which shard its in, but if you
dont know the shard you have to do an inefficient search across all shards, and the sharding scheme dont know the shard you have to do an inefficient search across all shards, and the sharding scheme
is difficult to change. is difficult to change.
@ -107,11 +104,9 @@ some systems dont support them at all.
Some systems use sharding even on a single machine, typically running one single-threaded process Some systems use sharding even on a single machine, typically running one single-threaded process
per CPU core to make use of the parallelism in the CPU, or to take advantage of a *nonuniform memory per CPU core to make use of the parallelism in the CPU, or to take advantage of a *nonuniform memory
access* (NUMA) architecture in which some banks of memory are closer to one CPU than to others access* (NUMA) architecture in which some banks of memory are closer to one CPU than to others [^5].
[^5].
For example, Redis, VoltDB, and FoundationDB use one process per core, and rely on sharding to For example, Redis, VoltDB, and FoundationDB use one process per core, and rely on sharding to
spread load across CPU cores in the same machine spread load across CPU cores in the same machine [^6].
[^6].
## Sharding for Multitenancy ## Sharding for Multitenancy
@ -124,61 +119,60 @@ signups, delivery data etc. are separate from those of other businesses.
Sometimes sharding is used to implement multitenant systems: either each tenant is given a separate Sometimes sharding is used to implement multitenant systems: either each tenant is given a separate
shard, or multiple small tenants may be grouped together into a larger shard. These shards might be shard, or multiple small tenants may be grouped together into a larger shard. These shards might be
physically separate databases (which we previously touched on in [“Embedded storage engines”](/en/ch4#sidebar_embedded)), or physically separate databases (which we previously touched on in [“Embedded storage engines”](/en/ch4#sidebar_embedded)), or
separately manageable portions of a larger logical database separately manageable portions of a larger logical database [^7].
[^7].
Using sharding for multitenancy has several advantages: Using sharding for multitenancy has several advantages:
Resource isolation Resource isolation
: If one tenant performs a computationally expensive operation, it is less likely that other : If one tenant performs a computationally expensive operation, it is less likely that other
tenants performance will be affected if they are running on different shards. tenants performance will be affected if they are running on different shards.
Permission isolation Permission isolation
: If there is a bug in your access control logic, its less likely that you will accidentally give : If there is a bug in your access control logic, its less likely that you will accidentally give
one tenant access to another tenants data if those tenants datasets are stored physically one tenant access to another tenants data if those tenants datasets are stored physically
separately from each other. separately from each other.
Cell-based architecture Cell-based architecture
: You can apply sharding not only at the data storage level, but also for the services running your : You can apply sharding not only at the data storage level, but also for the services running your
application code. In a *cell-based architecture*, the services and storage for a particular set of application code. In a *cell-based architecture*, the services and storage for a particular set of
tenants are grouped into a self-contained *cell*, and different cells are set up such that they tenants are grouped into a self-contained *cell*, and different cells are set up such that they
can run largely independently from each other. This approach provides *fault isolation*: that is, can run largely independently from each other. This approach provides *fault isolation*: that is,
a fault in one cell remains limited to that cell, and tenants in other cells are not affected a fault in one cell remains limited to that cell, and tenants in other cells are not affected
[^8]. [^8].
Per-tenant backup and restore Per-tenant backup and restore
: Backing up each tenants shard separately makes it possible to restore a tenants state from a : Backing up each tenants shard separately makes it possible to restore a tenants state from a
backup without affecting other tenants, which can be useful in case the tenant accidentally backup without affecting other tenants, which can be useful in case the tenant accidentally
deletes or overwrites important data deletes or overwrites important data
[^9]. [^9].
Regulatory compliance Regulatory compliance
: Data privacy regulation such as the GDPR gives individuals the right to access and delete all data : Data privacy regulation such as the GDPR gives individuals the right to access and delete all data
stored about them. If each persons data is stored in a separate shard, this translates into stored about them. If each persons data is stored in a separate shard, this translates into
simple data export and deletion operations on their shard simple data export and deletion operations on their shard
[^10]. [^10].
Data residence Data residence
: If a particular tenants data needs to be stored in a particular jurisdiction in order to comply : If a particular tenants data needs to be stored in a particular jurisdiction in order to comply
with data residency laws, a region-aware database can allow you to assign that tenants shard to a with data residency laws, a region-aware database can allow you to assign that tenants shard to a
particular region. particular region.
Gradual schema rollout Gradual schema rollout
: Schema migrations (previously discussed in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)) can be rolled : Schema migrations (previously discussed in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)) can be rolled
out gradually, one tenant at a time. This reduces risk, as you can detect problems before they out gradually, one tenant at a time. This reduces risk, as you can detect problems before they
affect all tenants, but it can be difficult to do transactionally affect all tenants, but it can be difficult to do transactionally
[^11]. [^11].
The main challenges around using sharding for multitenancy are: The main challenges around using sharding for multitenancy are:
* It assumes that each individual tenant is small enough to fit on a single node. If that is not the * It assumes that each individual tenant is small enough to fit on a single node. If that is not the
case, and you have a single tenant thats too big for one machine, you would need to additionally case, and you have a single tenant thats too big for one machine, you would need to additionally
perform sharding within a single tenant, which brings us back to the topic of sharding for perform sharding within a single tenant, which brings us back to the topic of sharding for
scalability [^12]. scalability [^12].
* If you have many small tenants, then creating a separate shard for each one may incur too much * If you have many small tenants, then creating a separate shard for each one may incur too much
overhead. You could group several small tenants together into a bigger shard, but then you have overhead. You could group several small tenants together into a bigger shard, but then you have
the problem of how you move tenants from one shard to another as they grow. the problem of how you move tenants from one shard to another as they grow.
* If you ever need to support features that connect data across multiple tenants, these become * If you ever need to support features that connect data across multiple tenants, these become
harder to implement if you need to join data across multiple shards. harder to implement if you need to join data across multiple shards.
# Sharding of Key-Value Data # Sharding of Key-Value Data
@ -226,8 +220,7 @@ to distribute the data evenly, the shard boundaries need to adapt to the data.
The shard boundaries might be chosen manually by an administrator, or the database can choose them The shard boundaries might be chosen manually by an administrator, or the database can choose them
automatically. Manual key-range sharding is used by Vitess (a sharding layer for MySQL), for automatically. Manual key-range sharding is used by Vitess (a sharding layer for MySQL), for
example; the automatic variant is used by Bigtable, its open source equivalent HBase, the example; the automatic variant is used by Bigtable, its open source equivalent HBase, the
range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB [^6]. YugabyteDB offers both manual and automatic
[^6]. YugabyteDB offers both manual and automatic
tablet splitting. tablet splitting.
Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in
@ -241,8 +234,7 @@ A downside of key range sharding is that you can easily get a hot shard if there
lot of writes to nearby keys. For example, if the key is a timestamp, then the shards correspond to lot of writes to nearby keys. For example, if the key is a timestamp, then the shards correspond to
ranges of time—e.g., one shard per month. Unfortunately, if you write data from the sensors to the ranges of time—e.g., one shard per month. Unfortunately, if you write data from the sensors to the
database as the measurements happen, all the writes end up going to the same shard (the one for database as the measurements happen, all the writes end up going to the same shard (the one for
this month), so that shard can be overloaded with writes while others sit idle this month), so that shard can be overloaded with writes while others sit idle [^13].
[^13].
To avoid this problem in the sensor database, you need to use something other than the timestamp as To avoid this problem in the sensor database, you need to use something other than the timestamp as
the first element of the key. For example, you could prefix each timestamp with the sensor ID so the first element of the key. For example, you could prefix each timestamp with the sensor ID so
@ -256,8 +248,7 @@ need to perform a separate range query for each sensor.
When you first set up your database, there are no key ranges to split into shards. Some databases, When you first set up your database, there are no key ranges to split into shards. Some databases,
such as HBase and MongoDB, allow you to configure an initial set of shards on an empty database, such as HBase and MongoDB, allow you to configure an initial set of shards on an empty database,
which is called *pre-splitting*. This requires that you already have some idea of what the key which is called *pre-splitting*. This requires that you already have some idea of what the key
distribution is going to look like, so that you can choose appropriate key range boundaries distribution is going to look like, so that you can choose appropriate key range boundaries [^14].
[^14].
Later on, as your data volume and write throughput grow, a system with key-range sharding grows by Later on, as your data volume and write throughput grow, a system with key-range sharding grows by
splitting an existing shard into two or more smaller shards, each of which holds a contiguous splitting an existing shard into two or more smaller shards, each of which holds a contiguous
@ -270,8 +261,8 @@ With databases that manage shard boundaries automatically, a shard split is typi
* the shard reaching a configured size (for example, on HBase, the default is 10 GB), or * the shard reaching a configured size (for example, on HBase, the default is 10 GB), or
* in some systems, the write throughput being persistently above some threshold. Thus, a hot shard * in some systems, the write throughput being persistently above some threshold. Thus, a hot shard
may be split even if it is not storing a lot of data, so that its write load can be distributed may be split even if it is not storing a lot of data, so that its write load can be distributed
more uniformly. more uniformly.
An advantage of key-range sharding is that the number of shards adapts to the data volume. If there An advantage of key-range sharding is that the number of shards adapts to the data volume. If there
is only a small amount of data, a small number of shards is sufficient, so overheads are small; if is only a small amount of data, a small number of shards is sufficient, so overheads are small; if
@ -300,8 +291,7 @@ For sharding purposes, the hash function need not be cryptographically strong: f
uses MD5, whereas Cassandra and ScyllaDB use Murmur3. Many programming languages have simple hash uses MD5, whereas Cassandra and ScyllaDB use Murmur3. Many programming languages have simple hash
functions built in (as they are used for hash tables), but they may not be suitable for sharding: functions built in (as they are used for hash tables), but they may not be suitable for sharding:
for example, in Javas `Object.hashCode()` and Rubys `Object#hash`, the same key may have a for example, in Javas `Object.hashCode()` and Rubys `Object#hash`, the same key may have a
different hash value in different processes, making them unsuitable for sharding different hash value in different processes, making them unsuitable for sharding [^16].
[^16].
### Hash modulo number of nodes ### Hash modulo number of nodes
@ -411,16 +401,14 @@ cluster keys for a table. Delta Lake supports both manual and automatic partitio
supports cluster keys. Clustering data not only improves range scan performance, but can supports cluster keys. Clustering data not only improves range scan performance, but can
improve compression and filtering performance as well. improve compression and filtering performance as well.
Hash-range sharding is used in YugabyteDB and DynamoDB Hash-range sharding is used in YugabyteDB and DynamoDB [^17], and is an option in MongoDB.
[^17], and is an option in MongoDB.
Cassandra and ScyllaDB use a variant of this approach that is illustrated in Cassandra and ScyllaDB use a variant of this approach that is illustrated in
[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional [Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8 to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
those ranges. This means some ranges are bigger than others, but by having multiple ranges per node those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
those imbalances tend to even out those imbalances tend to even out
[[15](/en/ch7#Evans2013), [[^15], [^18]].
[18](/en/ch7#Williams2012)].
![ddia 0706](/fig/ddia_0706.png) ![ddia 0706](/fig/ddia_0706.png)
@ -446,10 +434,8 @@ ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describ
the same shard as much as possible. the same shard as much as possible.
The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of
consistent hashing consistent hashing [^20],
[^20], but several other consistent hashing algorithms have also been proposed [^21],
but several other consistent hashing algorithms have also been proposed
[^21],
such as *highest random weight*, also known as *rendezvous hashing* such as *highest random weight*, also known as *rendezvous hashing*
[^22], [^22],
and *jump consistent hash* and *jump consistent hash*
@ -473,11 +459,9 @@ This event can result in a large volume of reads and writes to the same key (whe
is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on). is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on).
In such situations, a more flexible sharding policy is required In such situations, a more flexible sharding policy is required
[[25](/en/ch7#Guo2020), [[^25], [^26]].
[26](/en/ch7#Lee2021)].
A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put
an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine [^27].
[^27].
Its also possible to compensate for skew at the application level. For example, if one key is known Its also possible to compensate for skew at the application level. For example, if one key is known
to be very hot, a simple technique is to add a random number to the beginning or end of the key. to be very hot, a simple technique is to add a random number to the beginning or end of the key.
@ -518,16 +502,14 @@ Fully automated rebalancing can be convenient, because there is less operational
normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud
databases such as DynamoDB are promoted as being able to automatically add and remove shards to databases such as DynamoDB are promoted as being able to automatically add and remove shards to
adapt to big increases or decreases of load within a matter of minutes adapt to big increases or decreases of load within a matter of minutes
[[17](/en/ch7#Elhemali2022_ch7), [[^17], [^29]].
[29](/en/ch7#Houlihan2017)].
However, automatic shard management can also be unpredictable. Rebalancing is an expensive However, automatic shard management can also be unpredictable. Rebalancing is an expensive
operation, because it requires rerouting requests and moving a large amount of data from one node to operation, because it requires rerouting requests and moving a large amount of data from one node to
another. If it is not done carefully, this process can overload the network or the nodes, and it another. If it is not done carefully, this process can overload the network or the nodes, and it
might harm the performance of other requests. The system must continue processing writes while the might harm the performance of other requests. The system must continue processing writes while the
rebalancing is in progress; if a system is near its maximum write throughput, the shard-splitting rebalancing is in progress; if a system is near its maximum write throughput, the shard-splitting
process might not even be able to keep up with the rate of incoming writes process might not even be able to keep up with the rate of incoming writes [^29].
[^29].
Such automation can be dangerous in combination with automatic failure detection. For example, say Such automation can be dangerous in combination with automatic failure detection. For example, say
one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that
@ -557,14 +539,14 @@ shards to nodes. On a high level, there are a few different approaches to this p
in [Figure 7-7](/en/ch7#fig_sharding_routing)): in [Figure 7-7](/en/ch7#fig_sharding_routing)):
1. Allow clients to contact any node (e.g., via a round-robin load balancer). If that node 1. Allow clients to contact any node (e.g., via a round-robin load balancer). If that node
coincidentally owns the shard to which the request applies, it can handle the request directly; coincidentally owns the shard to which the request applies, it can handle the request directly;
otherwise, it forwards the request to the appropriate node, receives the reply, and passes the otherwise, it forwards the request to the appropriate node, receives the reply, and passes the
reply along to the client. reply along to the client.
2. Send all requests from clients to a routing tier first, which determines the node that should 2. Send all requests from clients to a routing tier first, which determines the node that should
handle each request and forwards it accordingly. This routing tier does not itself handle any handle each request and forwards it accordingly. This routing tier does not itself handle any
requests; it only acts as a shard-aware load balancer. requests; it only acts as a shard-aware load balancer.
3. Require that clients be aware of the sharding and the assignment of shards to nodes. In this 3. Require that clients be aware of the sharding and the assignment of shards to nodes. In this
case, a client can connect directly to the appropriate node, without any intermediary. case, a client can connect directly to the appropriate node, without any intermediary.
![ddia 0707](/fig/ddia_0707.png) ![ddia 0707](/fig/ddia_0707.png)
@ -573,15 +555,15 @@ in [Figure 7-7](/en/ch7#fig_sharding_routing)):
In all cases, there are some key problems: In all cases, there are some key problems:
* Who decides which shard should live on which node? Its simplest to have a single coordinator * Who decides which shard should live on which node? Its simplest to have a single coordinator
making that decision, but in that case how do you make it fault-tolerant in case the node running making that decision, but in that case how do you make it fault-tolerant in case the node running
the coordinator goes down? And if the coordinator role can failover to another node, how do you the coordinator goes down? And if the coordinator role can failover to another node, how do you
prevent a split-brain situation (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)) where two different prevent a split-brain situation (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)) where two different
coordinators make contradictory shard assignments? coordinators make contradictory shard assignments?
* How does the component performing the routing (which may be one of the nodes, or the routing tier, * How does the component performing the routing (which may be one of the nodes, or the routing tier,
or the client) learn about changes in the assignment of shards to nodes? or the client) learn about changes in the assignment of shards to nodes?
* While a shard is being moved from one node to another, there is a cutover period during which the * While a shard is being moved from one node to another, there is a cutover period during which the
new node has taken over, but requests to the old node may still be in flight. How do you handle new node has taken over, but requests to the old node may still be in flight. How do you handle
those? those?
Many distributed data systems rely on a separate coordination service such as ZooKeeper or etcd to Many distributed data systems rely on a separate coordination service such as ZooKeeper or etcd to
keep track of shard assignments, as illustrated in [Figure 7-8](/en/ch7#fig_sharding_zookeeper). They use consensus keep track of shard assignments, as illustrated in [Figure 7-8](/en/ch7#fig_sharding_zookeeper). They use consensus
@ -684,8 +666,7 @@ expensive. Even if you query the shards in parallel, it is prone to tail latency
shards lets you store more data, but it doesnt increase your query throughput if every shard has to shards lets you store more data, but it doesnt increase your query throughput if every shard has to
process every query anyway. process every query anyway.
Nevertheless, local secondary indexes are widely used Nevertheless, local secondary indexes are widely used [^31]:
[^31]:
for example, MongoDB, Riak, Cassandra [^32], for example, MongoDB, Riak, Cassandra [^32],
Elasticsearch [^33], SolrCloud, Elasticsearch [^33], SolrCloud,
and VoltDB [^34] and VoltDB [^34]
@ -742,7 +723,7 @@ indexes, so reads from a global index may be stale (similarly to replication lag
Nevertheless, global indexes are useful if read throughput is higher than write throughput, and if Nevertheless, global indexes are useful if read throughput is higher than write throughput, and if
the postings lists are not too long. the postings lists are not too long.
# Summary ## Summary
In this chapter we explored different ways of sharding a large dataset into smaller subsets. In this chapter we explored different ways of sharding a large dataset into smaller subsets.
Sharding is necessary when you have so much data that storing and processing it on a single machine Sharding is necessary when you have so much data that storing and processing it on a single machine
@ -756,20 +737,20 @@ cluster.
We discussed two main approaches to sharding: We discussed two main approaches to sharding:
* *Key range sharding*, where keys are sorted, and a shard owns all the keys from some minimum up to * *Key range sharding*, where keys are sorted, and a shard owns all the keys from some minimum up to
some maximum. Sorting has the advantage that efficient range queries are possible, but there is a some maximum. Sorting has the advantage that efficient range queries are possible, but there is a
risk of hot spots if the application often accesses keys that are close together in the sorted risk of hot spots if the application often accesses keys that are close together in the sorted
order. order.
In this approach, shards are typically rebalanced by splitting the range into two subranges when a In this approach, shards are typically rebalanced by splitting the range into two subranges when a
shard gets too big. shard gets too big.
* *Hash sharding*, where a hash function is applied to each key, and a shard owns a range of hash * *Hash sharding*, where a hash function is applied to each key, and a shard owns a range of hash
values (or another consistent hashing algorithm may be used to map hashes to shards). This method values (or another consistent hashing algorithm may be used to map hashes to shards). This method
destroys the ordering of keys, making range queries inefficient, but it may distribute load more destroys the ordering of keys, making range queries inefficient, but it may distribute load more
evenly. evenly.
When sharding by hash, it is common to create a fixed number of shards in advance, to assign several When sharding by hash, it is common to create a fixed number of shards in advance, to assign several
shards to each node, and to move entire shards from one node to another when nodes are added or shards to each node, and to move entire shards from one node to another when nodes are added or
removed. Splitting shards, like with key ranges, is also possible. removed. Splitting shards, like with key ranges, is also possible.
It is common to use the first part of the key as the partition key (i.e., to identify the shard), It is common to use the first part of the key as the partition key (i.e., to identify the shard),
and to sort records within that shard by the rest of the key. That way you can still have efficient and to sort records within that shard by the rest of the key. That way you can still have efficient
@ -779,13 +760,13 @@ We also discussed the interaction between sharding and secondary indexes. A seco
needs to be sharded, and there are two methods: needs to be sharded, and there are two methods:
* *Local secondary indexes*, where the secondary indexes are stored * *Local secondary indexes*, where the secondary indexes are stored
in the same shard as the primary key and value. This means that only a single shard needs to be in the same shard as the primary key and value. This means that only a single shard needs to be
updated on write, but a lookup of the secondary index requires reading from all shards. updated on write, but a lookup of the secondary index requires reading from all shards.
* *Global secondary indexes*, which are sharded separately based on * *Global secondary indexes*, which are sharded separately based on
the indexed values. An entry in the secondary index may refer to records from all shards of the the indexed values. An entry in the secondary index may refer to records from all shards of the
primary key. When a record is written, several secondary index shards may need to be updated; primary key. When a record is written, several secondary index shards may need to be updated;
however, a read of the postings list can be served from a single shard (fetching the actual however, a read of the postings list can be served from a single shard (fetching the actual
records still requires reading from multiple shards). records still requires reading from multiple shards).
Finally, we discussed techniques for routing queries to the appropriate shard, and how a Finally, we discussed techniques for routing queries to the appropriate shard, and how a
coordination service is often used to keep track of the assigment of shards to nodes. coordination service is often used to keep track of the assigment of shards to nodes.
@ -795,10 +776,10 @@ to multiple machines. However, operations that need to write to several shards c
for example, what happens if the write to one shard succeeds, but another fails? We will address for example, what happens if the write to one shard succeeds, but another fails? We will address
that question in the following chapters. that question in the following chapters.
##### Footnotes
##### References
### Summary
[^1]: Claire Giordano. [Understanding partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959) [^1]: Claire Giordano. [Understanding partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959)

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -105,7 +105,7 @@ Later, in Part III of this book, we will discuss how you can take several (poten
- [9. The Trouble with Distributed Systems](/en/ch9) - [9. The Trouble with Distributed Systems](/en/ch9)
- [10. Consistency and Consensus](/en/ch10) - [10. Consistency and Consensus](/en/ch10)
## References ### References
1. Ulrich Drepper: “[What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf),” akkadia.org, November 21, 2007. 1. Ulrich Drepper: “[What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf),” akkadia.org, November 21, 2007.
1. Ben Stopford: “[Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/),” benstopford.com, November 24, 2009. 1. Ben Stopford: “[Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/),” benstopford.com, November 24, 2009.