mirror of
https://github.com/Vonng/ddia.git
synced 2026-06-25 02:46:51 +08:00
fix reference summary
This commit is contained in:
parent
752c2f58c7
commit
4ec385f161
14 changed files with 2811 additions and 3255 deletions
|
|
@ -252,9 +252,7 @@ the data warehouse. This process of getting data into the data warehouse is know
|
||||||
*transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse,
|
*transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse,
|
||||||
after loading), resulting in *ELT*.
|
after loading), resulting in *ELT*.
|
||||||
|
|
||||||

|
{{< figure src="/fig/ddia_0101.png" id="fig_dwh_etl" title="Figure 1-1. Simplified outline of ETL into a data warehouse." class="w-full my-4" >}}
|
||||||
|
|
||||||
###### Figure 1-1. Simplified outline of ETL into a data warehouse.
|
|
||||||
|
|
||||||
In some cases the data sources of the ETL processes are external SaaS products such as customer
|
In some cases the data sources of the ETL processes are external SaaS products such as customer
|
||||||
relationship management (CRM), email marketing, or credit card processing systems. In those cases,
|
relationship management (CRM), email marketing, or credit card processing systems. In those cases,
|
||||||
|
|
@ -428,9 +426,10 @@ the other extreme are widely-used cloud services or Software as a Service (SaaS)
|
||||||
implemented and operated by an external vendor, and which you only access through a web interface or
|
implemented and operated by an external vendor, and which you only access through a web interface or
|
||||||
API.
|
API.
|
||||||
|
|
||||||

|
|
||||||
|
|
||||||
###### Figure 1-2. A spectrum of types of software and its operations.
|
{{< figure src="/fig/ddia_0102.png" id="fig_cloud_spectrum" title="Figure 1-2. A spectrum of types of software and its operations." class="w-full my-4" >}}
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
The middle ground is off-the-shelf software (open source or commercial) that you *self-host*, i.e.,
|
The middle ground is off-the-shelf software (open source or commercial) that you *self-host*, i.e.,
|
||||||
deploy yourself—for example, if you download MySQL and install it on a server you control. This
|
deploy yourself—for example, if you download MySQL and install it on a server you control. This
|
||||||
|
|
@ -962,7 +961,7 @@ whose data you are collecting and processing. There is much more to this topic;
|
||||||
will go deeper into the topics of ethics and legal compliance, including the problems of bias and
|
will go deeper into the topics of ethics and legal compliance, including the problems of bias and
|
||||||
discrimination.
|
discrimination.
|
||||||
|
|
||||||
# Summary
|
## Summary
|
||||||
|
|
||||||
The theme of this chapter has been to understand trade-offs: that is, to recognize that for many
|
The theme of this chapter has been to understand trade-offs: that is, to recognize that for many
|
||||||
questions there is not one right answer, but several different approaches that each have various
|
questions there is not one right answer, but several different approaches that each have various
|
||||||
|
|
@ -994,9 +993,7 @@ data is being processed—an aspect that many engineers are prone to ignoring. H
|
||||||
requirements into technical implementations is not yet well understood, but it’s important to keep
|
requirements into technical implementations is not yet well understood, but it’s important to keep
|
||||||
this question in mind as we move through the rest of this book.
|
this question in mind as we move through the rest of this book.
|
||||||
|
|
||||||
## Footnotes
|
### References
|
||||||
|
|
||||||
## References
|
|
||||||
|
|
||||||
[^1]: Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. [The Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26)
|
[^1]: Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. [The Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26)
|
||||||
[^2]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
|
[^2]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
|
||||||
|
|
|
||||||
|
|
@ -90,8 +90,7 @@ guarantee*. To clarify this idea, let’s look at an example of a system that is
|
||||||
|
|
||||||
###### Figure 10-1. This system is not linearizable, causing sports fans to be confused.
|
###### Figure 10-1. This system is not linearizable, causing sports fans to be confused.
|
||||||
|
|
||||||
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website
|
[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website [^4].
|
||||||
[^4].
|
|
||||||
Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a
|
Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a
|
||||||
game their favorite team is playing. Just after the final score is announced, Aaliyah refreshes the
|
game their favorite team is playing. Just after the final score is announced, Aaliyah refreshes the
|
||||||
page, sees the winner announced, and excitedly tells Bryce about it. Bryce incredulously hits
|
page, sees the winner announced, and excitedly tells Bryce about it. Bryce incredulously hits
|
||||||
|
|
@ -216,17 +215,14 @@ There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_
|
||||||
so B is not allowed to read an older value than A. Again, it’s the same situation as with Aaliyah
|
so B is not allowed to read an older value than A. Again, it’s the same situation as with Aaliyah
|
||||||
and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0).
|
and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0).
|
||||||
|
|
||||||
That is the intuition behind linearizability; the formal definition
|
That is the intuition behind linearizability; the formal definition [^1] describes it more precisely. It is
|
||||||
[^1] describes it more precisely. It is
|
|
||||||
possible (though computationally expensive) to test whether a system’s behavior is linearizable by
|
possible (though computationally expensive) to test whether a system’s behavior is linearizable by
|
||||||
recording the timings of all requests and responses, and checking whether they can be arranged into
|
recording the timings of all requests and responses, and checking whether they can be arranged into
|
||||||
a valid sequential order [[6](/en/ch10#Kingsbury2014knossos),
|
a valid sequential order [[^6], [^7]].
|
||||||
[7](/en/ch10#Kingsbury2020elle)].
|
|
||||||
|
|
||||||
Just as there are various weak isolation levels for transactions besides serializability (see
|
Just as there are various weak isolation levels for transactions besides serializability (see
|
||||||
[“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels)), there are also various weaker consistency models for
|
[“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels)), there are also various weaker consistency models for
|
||||||
replicated systems besides linearizability
|
replicated systems besides linearizability [^8].
|
||||||
[^8].
|
|
||||||
In fact, the *read-after-write*, *monotonic reads*, and *consistent prefix reads* properties we saw
|
In fact, the *read-after-write*, *monotonic reads*, and *consistent prefix reads* properties we saw
|
||||||
in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag) are examples of such weaker consistency models. Linearizability
|
in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag) are examples of such weaker consistency models. Linearizability
|
||||||
guarantees all these weaker properties, and more. In this chapter we will focus on linearizability,
|
guarantees all these weaker properties, and more. In this chapter we will focus on linearizability,
|
||||||
|
|
@ -255,24 +251,20 @@ Linearizability
|
||||||
Serializability does not have that requirement: for example, stale reads are allowed by
|
Serializability does not have that requirement: for example, stale reads are allowed by
|
||||||
serializability [^10].
|
serializability [^10].
|
||||||
|
|
||||||
(*Sequential consistency* is something else again
|
(*Sequential consistency* is something else again [^8], but we won’t discuss it here.)
|
||||||
[^8], but we won’t discuss it here.)
|
|
||||||
|
|
||||||
A database may provide both serializability and linearizability, and this combination is known as
|
A database may provide both serializability and linearizability, and this combination is known as
|
||||||
*strict serializability* or *strong one-copy serializability* (*strong-1SR*)
|
*strict serializability* or *strong one-copy serializability* (*strong-1SR*)
|
||||||
[[11](/en/ch10#Bailis2014virtues_ch10),
|
[[^11], [^12]].
|
||||||
[12](/en/ch10#Bernstein1987_ch10)].
|
|
||||||
Single-node databases are typically linearizable. With distributed databases using optimistic
|
Single-node databases are typically linearizable. With distributed databases using optimistic
|
||||||
methods like serializable snapshot isolation (see [“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)) the situation is more
|
methods like serializable snapshot isolation (see [“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)) the situation is more
|
||||||
complicated: for example, CockroachDB provides serializability, and some recency guarantees on
|
complicated: for example, CockroachDB provides serializability, and some recency guarantees on
|
||||||
reads, but not strict serializability [^13]
|
reads, but not strict serializability [^13]
|
||||||
because this would require expensive coordination between transactions
|
because this would require expensive coordination between transactions [^14].
|
||||||
[^14].
|
|
||||||
|
|
||||||
It is also possible to combine a weaker isolation level with linearizability, or a weaker
|
It is also possible to combine a weaker isolation level with linearizability, or a weaker
|
||||||
consistency model with serializability; in fact, consistency model and isolation level can be chosen
|
consistency model with serializability; in fact, consistency model and isolation level can be chosen
|
||||||
largely independently from each other [[15](/en/ch10#Darnell2022),
|
largely independently from each other [[^15], [^16]].
|
||||||
[16](/en/ch10#Abadi2019consistency)].
|
|
||||||
|
|
||||||
## Relying on Linearizability
|
## Relying on Linearizability
|
||||||
|
|
||||||
|
|
@ -285,13 +277,11 @@ requirement for making a system work correctly.
|
||||||
|
|
||||||
A system that uses single-leader replication needs to ensure that there is indeed only one leader,
|
A system that uses single-leader replication needs to ensure that there is indeed only one leader,
|
||||||
not several (split brain). One way of electing a leader is to use a lease: every node that starts up
|
not several (split brain). One way of electing a leader is to use a lease: every node that starts up
|
||||||
tries to acquire the lease, and the one that succeeds becomes the leader
|
tries to acquire the lease, and the one that succeeds becomes the leader [^17].
|
||||||
[^17].
|
|
||||||
No matter how this mechanism is implemented, it must be linearizable: it should not be possible for
|
No matter how this mechanism is implemented, it must be linearizable: it should not be possible for
|
||||||
two different nodes to acquire the lease at the same time.
|
two different nodes to acquire the lease at the same time.
|
||||||
|
|
||||||
Coordination services like Apache ZooKeeper
|
Coordination services like Apache ZooKeeper [^18]
|
||||||
[^18]
|
|
||||||
and etcd are often used to implement distributed leases and leader election. They use consensus
|
and etcd are often used to implement distributed leases and leader election. They use consensus
|
||||||
algorithms to implement linearizable operations in a fault-tolerant way (we discuss such algorithms
|
algorithms to implement linearizable operations in a fault-tolerant way (we discuss such algorithms
|
||||||
later in this chapter). There are still many subtle details to implementing leases and leader
|
later in this chapter). There are still many subtle details to implementing leases and leader
|
||||||
|
|
@ -305,8 +295,7 @@ linearizable storage service is the basic foundation for these coordination task
|
||||||
> etcd since version 3 provides linearizable reads by default.
|
> etcd since version 3 provides linearizable reads by default.
|
||||||
|
|
||||||
Distributed locking is also used at a much more granular level in some distributed databases, such as
|
Distributed locking is also used at a much more granular level in some distributed databases, such as
|
||||||
Oracle Real Application Clusters (RAC)
|
Oracle Real Application Clusters (RAC) [^19].
|
||||||
[^19].
|
|
||||||
RAC uses a lock per disk page, with multiple nodes sharing access
|
RAC uses a lock per disk page, with multiple nodes sharing access
|
||||||
to the same disk storage system. Since these linearizable locks are on the critical path of
|
to the same disk storage system. Since these linearizable locks are on the critical path of
|
||||||
transaction execution, RAC deployments usually have a dedicated cluster interconnect network for
|
transaction execution, RAC deployments usually have a dedicated cluster interconnect network for
|
||||||
|
|
@ -338,8 +327,7 @@ loosely interpreted constraints in [Link to Come].
|
||||||
|
|
||||||
However, a hard uniqueness constraint, such as the one you typically find in relational databases,
|
However, a hard uniqueness constraint, such as the one you typically find in relational databases,
|
||||||
requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints,
|
requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints,
|
||||||
can be implemented without linearizability
|
can be implemented without linearizability [^20].
|
||||||
[^20].
|
|
||||||
|
|
||||||
### Cross-channel timing dependencies
|
### Cross-channel timing dependencies
|
||||||
|
|
||||||
|
|
@ -469,20 +457,16 @@ returns the new value. (It’s once again the Aaliyah and Bryce situation from
|
||||||
|
|
||||||
It is possible to make Dynamo-style quorums linearizable at the cost of reduced
|
It is possible to make Dynamo-style quorums linearizable at the cost of reduced
|
||||||
performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously,
|
performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously,
|
||||||
before returning results to the application
|
before returning results to the application [^24].
|
||||||
[^24].
|
|
||||||
Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the
|
Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the
|
||||||
latest timestamp of any prior write, and ensure that the new write has a greater timestamp
|
latest timestamp of any prior write, and ensure that the new write has a greater timestamp
|
||||||
[[25](/en/ch10#Lynch1997),
|
[[^25], [^26]].
|
||||||
[26](/en/ch10#Cachin2011)].
|
|
||||||
However, Riak does not perform synchronous read repair due to the performance penalty.
|
However, Riak does not perform synchronous read repair due to the performance penalty.
|
||||||
Cassandra does wait for read repair to complete on quorum reads
|
Cassandra does wait for read repair to complete on quorum reads [^27],
|
||||||
[^27],
|
|
||||||
but it loses linearizability due to its use of time-of-day clocks for timestamps.
|
but it loses linearizability due to its use of time-of-day clocks for timestamps.
|
||||||
|
|
||||||
Moreover, only linearizable read and write operations can be implemented in this way; a
|
Moreover, only linearizable read and write operations can be implemented in this way; a
|
||||||
linearizable compare-and-set operation cannot, because it requires a consensus algorithm
|
linearizable compare-and-set operation cannot, because it requires a consensus algorithm [^28].
|
||||||
[^28].
|
|
||||||
|
|
||||||
In summary, it is safest to assume that a leaderless system with Dynamo-style replication does not
|
In summary, it is safest to assume that a leaderless system with Dynamo-style replication does not
|
||||||
provide linearizability, even with quorum reads and writes.
|
provide linearizability, even with quorum reads and writes.
|
||||||
|
|
@ -545,31 +529,23 @@ The trade-off is as follows:
|
||||||
|
|
||||||
Thus, applications that don’t require linearizability can be more tolerant of network problems. This
|
Thus, applications that don’t require linearizability can be more tolerant of network problems. This
|
||||||
insight is popularly known as the *CAP theorem*
|
insight is popularly known as the *CAP theorem*
|
||||||
[[29](/en/ch10#Fox1999),
|
[[^29], [^30], [^31], [^32]],
|
||||||
[30](/en/ch10#Gilbert2002),
|
|
||||||
[31](/en/ch10#Gilbert2012),
|
|
||||||
[32](/en/ch10#Brewer2012rules)],
|
|
||||||
named by Eric Brewer in 2000, although the trade-off had been known to designers of
|
named by Eric Brewer in 2000, although the trade-off had been known to designers of
|
||||||
distributed databases since the 1970s
|
distributed databases since the 1970s
|
||||||
[[33](/en/ch10#Davidson1985),
|
[[^33], [^34], [^35]].
|
||||||
[34](/en/ch10#Johnson1975),
|
|
||||||
[35](/en/ch10#Fischer1982)].
|
|
||||||
|
|
||||||
CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of
|
CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of
|
||||||
starting a discussion about trade-offs in databases. At the time, many distributed databases
|
starting a discussion about trade-offs in databases. At the time, many distributed databases
|
||||||
focused on providing linearizable semantics on a cluster of machines with shared storage
|
focused on providing linearizable semantics on a cluster of machines with shared storage [^19], and CAP encouraged database engineers
|
||||||
[^19], and CAP encouraged database engineers
|
|
||||||
to explore a wider design space of distributed shared-nothing systems, which were more suitable for
|
to explore a wider design space of distributed shared-nothing systems, which were more suitable for
|
||||||
implementing large-scale web services
|
implementing large-scale web services [^36].
|
||||||
[^36].
|
|
||||||
CAP deserves credit for this culture shift—it helped trigger the NoSQL movement, a burst of new
|
CAP deserves credit for this culture shift—it helped trigger the NoSQL movement, a burst of new
|
||||||
database technologies around the mid-2000s.
|
database technologies around the mid-2000s.
|
||||||
|
|
||||||
# The Unhelpful CAP Theorem
|
# The Unhelpful CAP Theorem
|
||||||
|
|
||||||
CAP is sometimes presented as *Consistency, Availability, Partition tolerance: pick 2 out of 3*.
|
CAP is sometimes presented as *Consistency, Availability, Partition tolerance: pick 2 out of 3*.
|
||||||
Unfortunately, putting it this way is misleading
|
Unfortunately, putting it this way is misleading [^32] because network partitions are a kind of
|
||||||
[^32] because network partitions are a kind of
|
|
||||||
fault, so they aren’t something about which you have a choice: they will happen whether you like it
|
fault, so they aren’t something about which you have a choice: they will happen whether you like it
|
||||||
or not.
|
or not.
|
||||||
|
|
||||||
|
|
@ -581,16 +557,13 @@ either linearizability or total availability. Thus, a better way of phrasing CAP
|
||||||
A more reliable network needs to make this choice less often, but at some point the choice is
|
A more reliable network needs to make this choice less often, but at some point the choice is
|
||||||
inevitable.
|
inevitable.
|
||||||
|
|
||||||
The CP/AP classification scheme has several further flaws
|
The CP/AP classification scheme has several further flaws [^4]. *Consistency* is formalized as
|
||||||
[^4]. *Consistency* is formalized as
|
|
||||||
linearizability (the theorem doesn’t say anything about weaker consistency models), and the
|
linearizability (the theorem doesn’t say anything about weaker consistency models), and the
|
||||||
formalization of *availability* [^30] does not
|
formalization of *availability* [^30] does not
|
||||||
match the usual meaning of the term
|
match the usual meaning of the term [^38]. Many highly available (fault-tolerant) systems actually do not meet CAP’s
|
||||||
[^38]. Many highly available (fault-tolerant) systems actually do not meet CAP’s
|
|
||||||
idiosyncratic definition of availability. Moreover, some system designers choose (with good reason)
|
idiosyncratic definition of availability. Moreover, some system designers choose (with good reason)
|
||||||
to provide neither linearizability nor the form of availability that the CAP theorem assumes, so
|
to provide neither linearizability nor the form of availability that the CAP theorem assumes, so
|
||||||
those systems are neither CP nor AP [[39](/en/ch10#Abadi2010),
|
those systems are neither CP nor AP [[^39], [^40]].
|
||||||
[40](/en/ch10#Abadi2017)].
|
|
||||||
|
|
||||||
All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us
|
All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us
|
||||||
understand systems better, so CAP is best avoided.
|
understand systems better, so CAP is best avoided.
|
||||||
|
|
@ -601,31 +574,25 @@ fault (network partitions, which according to data from Google are the cause of
|
||||||
incidents [^41]).
|
incidents [^41]).
|
||||||
It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP
|
It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP
|
||||||
has been historically influential, it has little practical value for designing systems
|
has been historically influential, it has little practical value for designing systems
|
||||||
[[4](/en/ch10#Kleppmann2015stop),
|
[[^4], [^38]].
|
||||||
[38](/en/ch10#Kleppmann2015critique)].
|
|
||||||
|
|
||||||
There have been efforts to generalize CAP. For example, the *PACELC principle* observes that system
|
There have been efforts to generalize CAP. For example, the *PACELC principle* observes that system
|
||||||
designers might also choose to weaken consistency at times when the network is working fine in order
|
designers might also choose to weaken consistency at times when the network is working fine in order
|
||||||
to reduce latency [[39](/en/ch10#Abadi2010),
|
to reduce latency [[^39], [^40], [^42]].
|
||||||
[40](/en/ch10#Abadi2017),
|
|
||||||
[42](/en/ch10#Abadi2012)].
|
|
||||||
Thus, during a network partition (P), we need to choose between availability (A) and consistency
|
Thus, during a network partition (P), we need to choose between availability (A) and consistency
|
||||||
(C); else (E), when there is no partition, we may choose between low latency (L) and
|
(C); else (E), when there is no partition, we may choose between low latency (L) and
|
||||||
consistency (C). However, this definition inherits several problems with CAP, such as the
|
consistency (C). However, this definition inherits several problems with CAP, such as the
|
||||||
counterintuitive definitions of consistency and availability.
|
counterintuitive definitions of consistency and availability.
|
||||||
|
|
||||||
There are many more interesting impossibility results in distributed systems
|
There are many more interesting impossibility results in distributed systems [^43],
|
||||||
[^43],
|
|
||||||
and CAP has now been superseded by more precise results
|
and CAP has now been superseded by more precise results
|
||||||
[[44](/en/ch10#Mahajan2011),
|
[[^44], [^45]],
|
||||||
[45](/en/ch10#Attiya2015)],
|
|
||||||
so it is of mostly historical interest today.
|
so it is of mostly historical interest today.
|
||||||
|
|
||||||
### Linearizability and network delays
|
### Linearizability and network delays
|
||||||
|
|
||||||
Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable
|
Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable
|
||||||
in practice. For example, even RAM on a modern multi-core CPU is not linearizable
|
in practice. For example, even RAM on a modern multi-core CPU is not linearizable [^46]:
|
||||||
[^46]:
|
|
||||||
if a thread running on one CPU core writes to a memory address, and a thread on another CPU core
|
if a thread running on one CPU core writes to a memory address, and a thread on another CPU core
|
||||||
reads the same address shortly afterward, it is not guaranteed to read the value written by the
|
reads the same address shortly afterward, it is not guaranteed to read the value written by the
|
||||||
first thread (unless a *memory barrier* or *fence*
|
first thread (unless a *memory barrier* or *fence*
|
||||||
|
|
@ -633,8 +600,7 @@ first thread (unless a *memory barrier* or *fence*
|
||||||
|
|
||||||
The reason for this behavior is that every CPU core has its own memory cache and store buffer.
|
The reason for this behavior is that every CPU core has its own memory cache and store buffer.
|
||||||
Memory access first goes to the cache by default, and any changes are asynchronously written out to
|
Memory access first goes to the cache by default, and any changes are asynchronously written out to
|
||||||
main memory. Since accessing data in the cache is much faster than going to main memory
|
main memory. Since accessing data in the cache is much faster than going to main memory [^48], this feature is essential for
|
||||||
[^48], this feature is essential for
|
|
||||||
good performance on modern CPUs. However, there are now several copies of the data (one in main
|
good performance on modern CPUs. However, there are now several copies of the data (one in main
|
||||||
memory, and perhaps several more in various caches), and these copies are asynchronously updated, so
|
memory, and perhaps several more in various caches), and these copies are asynchronously updated, so
|
||||||
linearizability is lost.
|
linearizability is lost.
|
||||||
|
|
@ -642,12 +608,10 @@ linearizability is lost.
|
||||||
Why make this trade-off? It makes no sense to use the CAP theorem to justify the multi-core memory
|
Why make this trade-off? It makes no sense to use the CAP theorem to justify the multi-core memory
|
||||||
consistency model: within one computer we usually assume reliable communication, and we don’t expect
|
consistency model: within one computer we usually assume reliable communication, and we don’t expect
|
||||||
one CPU core to be able to continue operating normally if it is disconnected from the rest of the
|
one CPU core to be able to continue operating normally if it is disconnected from the rest of the
|
||||||
computer. The reason for dropping linearizability is *performance*, not fault tolerance
|
computer. The reason for dropping linearizability is *performance*, not fault tolerance [^39].
|
||||||
[^39].
|
|
||||||
|
|
||||||
The same is true of many distributed databases that choose not to provide linearizable guarantees:
|
The same is true of many distributed databases that choose not to provide linearizable guarantees:
|
||||||
they do so primarily to increase performance, not so much for fault tolerance
|
they do so primarily to increase performance, not so much for fault tolerance [^42].
|
||||||
[^42].
|
|
||||||
Linearizability is slow—and this is true all the time, not only during a network fault.
|
Linearizability is slow—and this is true all the time, not only during a network fault.
|
||||||
|
|
||||||
Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is
|
Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is
|
||||||
|
|
@ -826,8 +790,7 @@ limitations:
|
||||||
different nodes have wildly different counter values.
|
different nodes have wildly different counter values.
|
||||||
|
|
||||||
A *hybrid logical clock* combines the advantages of physical time-of-day clocks with the ordering
|
A *hybrid logical clock* combines the advantages of physical time-of-day clocks with the ordering
|
||||||
guarantees of Lamport clocks
|
guarantees of Lamport clocks [^55].
|
||||||
[^55].
|
|
||||||
Like a physical clock, it counts seconds or microseconds. Like a Lamport clock, when one node sees a
|
Like a physical clock, it counts seconds or microseconds. Like a Lamport clock, when one node sees a
|
||||||
timestamp from another node that is greater than its local clock value, it moves its own local value
|
timestamp from another node that is greater than its local clock value, it moves its own local value
|
||||||
forward to match the other node’s timestamp. As a result, if one node’s clock is running fast, the
|
forward to match the other node’s timestamp. As a result, if one node’s clock is running fast, the
|
||||||
|
|
@ -850,8 +813,7 @@ In [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_sna
|
||||||
essentially, by giving each transaction a transaction ID, and allowing each transaction to see
|
essentially, by giving each transaction a transaction ID, and allowing each transaction to see
|
||||||
writes made by transactions with a lower ID, but to make writes by transactions with higher IDs
|
writes made by transactions with a lower ID, but to make writes by transactions with higher IDs
|
||||||
invisible. Lamport clocks and hybrid logical clocks are a good way of generating these transaction
|
invisible. Lamport clocks and hybrid logical clocks are a good way of generating these transaction
|
||||||
IDs, because they ensure that the snapshot is consistent with causality
|
IDs, because they ensure that the snapshot is consistent with causality [^56].
|
||||||
[^56].
|
|
||||||
|
|
||||||
When multiple timestamps are generated concurrently, these algorithms order them arbitrarily. This
|
When multiple timestamps are generated concurrently, these algorithms order them arbitrarily. This
|
||||||
means that when you look at two timestamps, you generally can’t tell whether they were generated
|
means that when you look at two timestamps, you generally can’t tell whether they were generated
|
||||||
|
|
@ -983,28 +945,18 @@ node, but which get a lot harder if you want fault tolerance:
|
||||||
It turns out that all of these are instances of the same fundamental distributed systems problem:
|
It turns out that all of these are instances of the same fundamental distributed systems problem:
|
||||||
*consensus*. Consensus is one of the most important and fundamental problems in distributed
|
*consensus*. Consensus is one of the most important and fundamental problems in distributed
|
||||||
computing; it is also infamously difficult to get right
|
computing; it is also infamously difficult to get right
|
||||||
[[58](/en/ch10#Chandra2007),
|
[[^58], [^59]],
|
||||||
[59](/en/ch10#Portnoy2012)],
|
|
||||||
and many systems have got it wrong in the past. Now that we have discussed replication
|
and many systems have got it wrong in the past. Now that we have discussed replication
|
||||||
([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
|
([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
|
||||||
linearizability (this chapter), we are finally ready to tackle the consensus problem.
|
linearizability (this chapter), we are finally ready to tackle the consensus problem.
|
||||||
|
|
||||||
The best-known consensus algorithms are Viewstamped Replication
|
The best-known consensus algorithms are Viewstamped Replication
|
||||||
[[60](/en/ch10#Oki1988),
|
[[^60], [^61]],
|
||||||
[61](/en/ch10#Liskov2012)],
|
Paxos [[^58], [^62], [^63], [^64]],
|
||||||
Paxos [[58](/en/ch10#Chandra2007),
|
Raft [[^23], [^65], [^66]],
|
||||||
[62](/en/ch10#Lamport1998),
|
and Zab [[^18], [^22], [^67]].
|
||||||
[63](/en/ch10#Lamport2001),
|
|
||||||
[64](/en/ch10#vanRenesse2011)],
|
|
||||||
Raft [[23](/en/ch10#Ongaro2014atc),
|
|
||||||
[65](/en/ch10#Ongaro2014thesis),
|
|
||||||
[66](/en/ch10#Howard2015refloated)],
|
|
||||||
and Zab [[18](/en/ch10#Junqueira2013_ch10),
|
|
||||||
[22](/en/ch10#Junqueira2011),
|
|
||||||
[67](/en/ch10#Medeiros2012)].
|
|
||||||
There are quite a few similarities between these algorithms, but they are not the same
|
There are quite a few similarities between these algorithms, but they are not the same
|
||||||
[[68](/en/ch10#vanRenesse2014),
|
[[^68], [^69]].
|
||||||
[69](/en/ch10#Howard2020)].
|
|
||||||
These algorithms work in a non-Byzantine system model: that is, network communication may be
|
These algorithms work in a non-Byzantine system model: that is, network communication may be
|
||||||
arbitrarily delayed or dropped, and nodes may crash, restart, and become disconnected, but the
|
arbitrarily delayed or dropped, and nodes may crash, restart, and become disconnected, but the
|
||||||
algorithms assume that nodes otherwise follow the protocol correctly and do not behave maliciously.
|
algorithms assume that nodes otherwise follow the protocol correctly and do not behave maliciously.
|
||||||
|
|
@ -1012,17 +964,14 @@ algorithms assume that nodes otherwise follow the protocol correctly and do not
|
||||||
There are also consensus algorithms that can tolerate some Byzantine nodes, i.e., nodes that don’t
|
There are also consensus algorithms that can tolerate some Byzantine nodes, i.e., nodes that don’t
|
||||||
correctly follow the protocol (for example, by sending contradictory messages to other nodes). A
|
correctly follow the protocol (for example, by sending contradictory messages to other nodes). A
|
||||||
common assumption is that fewer than one-third of the nodes are Byzantine-faulty
|
common assumption is that fewer than one-third of the nodes are Byzantine-faulty
|
||||||
[[26](/en/ch10#Cachin2011),
|
[[^26], [^70]].
|
||||||
[70](/en/ch10#Castro2002)].
|
Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchains [^71].
|
||||||
Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchains
|
|
||||||
[^71].
|
|
||||||
However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this
|
However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this
|
||||||
book.
|
book.
|
||||||
|
|
||||||
# The Impossibility of Consensus
|
# The Impossibility of Consensus
|
||||||
|
|
||||||
You may have heard about the FLP result
|
You may have heard about the FLP result [^72]—named after the
|
||||||
[^72]—named after the
|
|
||||||
authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to
|
authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to
|
||||||
reach consensus if there is a risk that a node may crash. In a distributed system, we must assume
|
reach consensus if there is a risk that a node may crash. In a distributed system, we must assume
|
||||||
that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms
|
that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms
|
||||||
|
|
@ -1118,15 +1067,13 @@ and is never going to come back online.)
|
||||||
Of course, if *all* nodes crash and none of them are running, then it is not possible for any
|
Of course, if *all* nodes crash and none of them are running, then it is not possible for any
|
||||||
algorithm to decide anything. There is a limit to the number of failures that an algorithm can
|
algorithm to decide anything. There is a limit to the number of failures that an algorithm can
|
||||||
tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of
|
tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of
|
||||||
nodes to be functioning correctly in order to assure termination
|
nodes to be functioning correctly in order to assure termination [^73]. That majority can safely form a quorum
|
||||||
[^73]. That majority can safely form a quorum
|
|
||||||
(see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)).
|
(see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)).
|
||||||
|
|
||||||
Thus, the termination property is subject to the assumption that fewer than half of the nodes are
|
Thus, the termination property is subject to the assumption that fewer than half of the nodes are
|
||||||
crashed or unreachable. However, most consensus algorithms ensure that the safety
|
crashed or unreachable. However, most consensus algorithms ensure that the safety
|
||||||
properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or
|
properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or
|
||||||
there is a severe network problem
|
there is a severe network problem [^75].
|
||||||
[^75].
|
|
||||||
Thus, a large-scale outage can stop the system from being able to process requests, but it cannot
|
Thus, a large-scale outage can stop the system from being able to process requests, but it cannot
|
||||||
corrupt the consensus system by causing it to make inconsistent decisions.
|
corrupt the consensus system by causing it to make inconsistent decisions.
|
||||||
|
|
||||||
|
|
@ -1148,8 +1095,7 @@ consensus. Any CAS invocations whose new value was not decided return an error.
|
||||||
different expected values use separate runs of the consensus protocol.
|
different expected values use separate runs of the consensus protocol.
|
||||||
|
|
||||||
This shows that CAS and consensus are equivalent to each other
|
This shows that CAS and consensus are equivalent to each other
|
||||||
[[28](/en/ch10#Herlihy1991),
|
[[^28], [^73]].
|
||||||
[73](/en/ch10#Chandra1996)].
|
|
||||||
Again, both are straightforward on a single node, but challenging to make fault-tolerant. As an
|
Again, both are straightforward on a single node, but challenging to make fault-tolerant. As an
|
||||||
example of CAS in a distributed setting, we saw conditional write operations for object stores in
|
example of CAS in a distributed setting, we saw conditional write operations for object stores in
|
||||||
[“Databases backed by object storage”](/en/ch6#sec_replication_object_storage), which allow a write to happen only if an object with the same
|
[“Databases backed by object storage”](/en/ch6#sec_replication_object_storage), which allow a write to happen only if an object with the same
|
||||||
|
|
@ -1159,8 +1105,7 @@ However, a linearizable read-write register is not sufficient to solve consensus
|
||||||
tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop
|
tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop
|
||||||
model [^72], but we saw in
|
model [^72], but we saw in
|
||||||
[“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum
|
[“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum
|
||||||
reads/writes in this model [[24](/en/ch10#Attiya1995),
|
reads/writes in this model [[^24], [^25], [^26]].
|
||||||
[25](/en/ch10#Lynch1997), [26](/en/ch10#Cachin2011)].
|
|
||||||
From this it follows that a linearizable register cannot solve consensus.
|
From this it follows that a linearizable register cannot solve consensus.
|
||||||
|
|
||||||
### Shared logs as consensus
|
### Shared logs as consensus
|
||||||
|
|
@ -1198,21 +1143,19 @@ Validity
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> A shared log is formally known as a *total order broadcast*, *atomic broadcast*, or *total order
|
> A shared log is formally known as a *total order broadcast*, *atomic broadcast*, or *total order
|
||||||
> multicast* protocol [[26](/en/ch10#Cachin2011),
|
> multicast* protocol [[^26],
|
||||||
> [76](/en/ch10#Defago2004),
|
> [^76],
|
||||||
> [77](/en/ch10#Attiya2004)].
|
> [^77]].
|
||||||
> It’s the same thing described in different words: requesting a value to be added to the log is then
|
> It’s the same thing described in different words: requesting a value to be added to the log is then
|
||||||
> called “broadcasting” it, and reading a log entry is called “delivering” it.
|
> called “broadcasting” it, and reading a log entry is called “delivering” it.
|
||||||
|
|
||||||
If you have an implementation of a shared log, it is easy to solve the consensus problem: every node
|
If you have an implementation of a shared log, it is easy to solve the consensus problem: every node
|
||||||
that wants to propose a value requests for it to be added to the log, and whichever value is read
|
that wants to propose a value requests for it to be added to the log, and whichever value is read
|
||||||
back in the first log entry is the value that is decided. Since all nodes read log entries in the
|
back in the first log entry is the value that is decided. Since all nodes read log entries in the
|
||||||
same order, they are guaranteed to agree on which value is delivered first
|
same order, they are guaranteed to agree on which value is delivered first [^28].
|
||||||
[^28].
|
|
||||||
|
|
||||||
Conversely, if you have a solution for consensus, you can implement a shared log. The details are a
|
Conversely, if you have a solution for consensus, you can implement a shared log. The details are a
|
||||||
bit more complicated, but the basic idea is this
|
bit more complicated, but the basic idea is this [^73]:
|
||||||
[^73]:
|
|
||||||
|
|
||||||
1. You have a slot in the log for every future log entry, and you run a separate instance of the
|
1. You have a slot in the log for every future log entry, and you run a separate instance of the
|
||||||
consensus algorithm for every such slot to decide what value should go in that entry.
|
consensus algorithm for every such slot to decide what value should go in that entry.
|
||||||
|
|
@ -1260,8 +1203,7 @@ An exception is if we know for sure that no more than two nodes will propose a v
|
||||||
the nodes can send each other the values they want to propose, and then each perform the
|
the nodes can send each other the values they want to propose, and then each perform the
|
||||||
fetch-and-add operation. The node that reads zero decides its own value, and the node that reads one
|
fetch-and-add operation. The node that reads zero decides its own value, and the node that reads one
|
||||||
decides the other node’s value. This solves the consensus problem among two nodes, which is why we
|
decides the other node’s value. This solves the consensus problem among two nodes, which is why we
|
||||||
can say that fetch-and-add has a *consensus number* of two
|
can say that fetch-and-add has a *consensus number* of two [^28].
|
||||||
[^28].
|
|
||||||
In contrast, CAS and shared logs solve consensus for any number of nodes that may propose values, so
|
In contrast, CAS and shared logs solve consensus for any number of nodes that may propose values, so
|
||||||
they have a consensus number of ∞ (infinity).
|
they have a consensus number of ∞ (infinity).
|
||||||
|
|
||||||
|
|
@ -1276,8 +1218,7 @@ What is the relationship between consensus and atomic commitment? At first glanc
|
||||||
similar—both require nodes to come to some form of agreement. However, there is one important
|
similar—both require nodes to come to some form of agreement. However, there is one important
|
||||||
difference: with consensus it’s okay to decide any value that proposed, whereas with atomic
|
difference: with consensus it’s okay to decide any value that proposed, whereas with atomic
|
||||||
commitment the algorithm *must* abort if *any* of the participants voted to abort. More precisely,
|
commitment the algorithm *must* abort if *any* of the participants voted to abort. More precisely,
|
||||||
atomic commitment requires the following properties
|
atomic commitment requires the following properties [^78]:
|
||||||
[^78]:
|
|
||||||
|
|
||||||
Uniform agreement
|
Uniform agreement
|
||||||
: No two nodes decide on different outcomes.
|
: No two nodes decide on different outcomes.
|
||||||
|
|
@ -1302,8 +1243,7 @@ any of the communication among the nodes times out). The other three properties
|
||||||
same as for consensus.
|
same as for consensus.
|
||||||
|
|
||||||
If you have a solution for consensus, there are multiple ways you could solve atomic commitment
|
If you have a solution for consensus, there are multiple ways you could solve atomic commitment
|
||||||
[[78](/en/ch10#Guerraoui1995),
|
[[^78], [^79]].
|
||||||
[79](/en/ch10#Gray2006)].
|
|
||||||
One works like this: when you want to commit the transaction, every node sends its vote to commit or
|
One works like this: when you want to commit the transaction, every node sends its vote to commit or
|
||||||
abort to every other node. Nodes that receive a vote to commit from itself and every other node
|
abort to every other node. Nodes that receive a vote to commit from itself and every other node
|
||||||
propose “commit” using the consensus algorithm; nodes that receive a vote to abort, or which
|
propose “commit” using the consensus algorithm; nodes that receive a vote to abort, or which
|
||||||
|
|
@ -1350,8 +1290,7 @@ Similarly, a shared log can be used to implement serializable transactions: as d
|
||||||
[“Actual Serial Execution”](/en/ch8#sec_transactions_serial), if every log entry represents a deterministic transaction to be
|
[“Actual Serial Execution”](/en/ch8#sec_transactions_serial), if every log entry represents a deterministic transaction to be
|
||||||
executed as a stored procedure, and if every node executes those transactions in the same order,
|
executed as a stored procedure, and if every node executes those transactions in the same order,
|
||||||
then the transactions will be serializable
|
then the transactions will be serializable
|
||||||
[[81](/en/ch10#Thomson2012),
|
[[^81], [^82]].
|
||||||
[82](/en/ch10#Balakrishnan2013)].
|
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> Sharded databases with a strong consistency model often maintain a separate log per shard, which
|
> Sharded databases with a strong consistency model often maintain a separate log per shard, which
|
||||||
|
|
@ -1411,12 +1350,10 @@ A node votes yes only if it is not aware of any other leader with a higher epoch
|
||||||
Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s
|
Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s
|
||||||
proposal for the next entry to append to the log. The quorums for those two votes must overlap: if
|
proposal for the next entry to append to the log. The quorums for those two votes must overlap: if
|
||||||
a vote on a proposal succeeds, at least one of the nodes that voted for it must have also
|
a vote on a proposal succeeds, at least one of the nodes that voted for it must have also
|
||||||
participated in the most recent successful leader election
|
participated in the most recent successful leader election [^85]. Thus, if the vote on a proposal
|
||||||
[^85]. Thus, if the vote on a proposal
|
|
||||||
passes without revealing any higher-numbered epoch, the current leader can conclude that no leader
|
passes without revealing any higher-numbered epoch, the current leader can conclude that no leader
|
||||||
with a higher epoch number has been elected, and therefore it can safely append the proposed entry
|
with a higher epoch number has been elected, and therefore it can safely append the proposed entry
|
||||||
to the log [[26](/en/ch10#Cachin2011),
|
to the log [[^26], [^86]].
|
||||||
[86](/en/ch10#Kleppmann2024distsys)].
|
|
||||||
|
|
||||||
These two rounds of voting look superficially similar to two-phase commit, but they are very
|
These two rounds of voting look superficially similar to two-phase commit, but they are very
|
||||||
different protocols. In consensus algorithms, any node can start an election and it requires only a
|
different protocols. In consensus algorithms, any node can start an election and it requires only a
|
||||||
|
|
@ -1427,8 +1364,7 @@ vote from *every* participant before it can commit.
|
||||||
|
|
||||||
This basic structure is common to all of Raft, Multi-Paxos, Zab, and Viewstamped Replication: a vote
|
This basic structure is common to all of Raft, Multi-Paxos, Zab, and Viewstamped Replication: a vote
|
||||||
by a quorum of nodes elects a leader, and then another quorum vote is required for every entry that
|
by a quorum of nodes elects a leader, and then another quorum vote is required for every entry that
|
||||||
the leader wants to append to the log [[68](/en/ch10#vanRenesse2014),
|
the leader wants to append to the log [[^68], [^69]]. Every new log entry is synchronously replicated
|
||||||
[69](/en/ch10#Howard2020)]. Every new log entry is synchronously replicated
|
|
||||||
to a quorum of nodes before it is confirmed to the client that requested the write. This ensures
|
to a quorum of nodes before it is confirmed to the client that requested the write. This ensures
|
||||||
that the log entry won’t be lost if the current leader fails.
|
that the log entry won’t be lost if the current leader fails.
|
||||||
|
|
||||||
|
|
@ -1436,8 +1372,7 @@ However, the devil is in the details, and that’s also where these algorithms t
|
||||||
approaches. For example, when the old leader fails and a new one is elected, the algorithm needs to
|
approaches. For example, when the old leader fails and a new one is elected, the algorithm needs to
|
||||||
ensure that the new leader honors any log entries that had already been appended by the old leader
|
ensure that the new leader honors any log entries that had already been appended by the old leader
|
||||||
before it failed. Raft does this by only allowing a node to become the new leader if its log is at
|
before it failed. Raft does this by only allowing a node to become the new leader if its log is at
|
||||||
least as up-to-date as a majority of its followers
|
least as up-to-date as a majority of its followers [^69].
|
||||||
[^69].
|
|
||||||
In contrast, Paxos allows any node to become the new leader, but requires it to bring its log
|
In contrast, Paxos allows any node to become the new leader, but requires it to bring its log
|
||||||
up-to-date with other nodes before it can start appending new entries of its own.
|
up-to-date with other nodes before it can start appending new entries of its own.
|
||||||
|
|
||||||
|
|
@ -1463,9 +1398,7 @@ easily cause a lot of data loss or corruption.
|
||||||
Another subtlety is in how the algorithms deal with log entries that had been proposed by the old
|
Another subtlety is in how the algorithms deal with log entries that had been proposed by the old
|
||||||
leader before it failed, but for which the vote on appending to the log had not yet completed. You
|
leader before it failed, but for which the vote on appending to the log had not yet completed. You
|
||||||
can find discussions of these details in the references for this chapter
|
can find discussions of these details in the references for this chapter
|
||||||
[[23](/en/ch10#Ongaro2014atc),
|
[[^23], [^69], [^86]].
|
||||||
[69](/en/ch10#Howard2020),
|
|
||||||
[86](/en/ch10#Kleppmann2024distsys)].
|
|
||||||
|
|
||||||
For databases that use a consensus algorithm for replication, not only do writes need to be turned
|
For databases that use a consensus algorithm for replication, not only do writes need to be turned
|
||||||
into log entries and replicated to a quorum. If you want to guarantee linearizable reads, they also
|
into log entries and replicated to a quorum. If you want to guarantee linearizable reads, they also
|
||||||
|
|
@ -1508,8 +1441,7 @@ work.
|
||||||
|
|
||||||
Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft
|
Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft
|
||||||
has been shown to have unpleasant edge cases
|
has been shown to have unpleasant edge cases
|
||||||
[[88](/en/ch10#Howard2015coracle),
|
[[^88], [^89]]:
|
||||||
[89](/en/ch10#Lianza2020_ch10)]:
|
|
||||||
if the entire network is working correctly except for one particular network link that is
|
if the entire network is working correctly except for one particular network link that is
|
||||||
consistently unreliable, Raft can get into situations where leadership continually bounces between
|
consistently unreliable, Raft can get into situations where leadership continually bounces between
|
||||||
two nodes, or the current leader is continually forced to resign, so the system effectively never
|
two nodes, or the current leader is continually forced to resign, so the system effectively never
|
||||||
|
|
@ -1536,8 +1468,7 @@ entirely in memory (although they still write to disk for durability), which is
|
||||||
multiple nodes using a fault-tolerant consensus algorithm.
|
multiple nodes using a fault-tolerant consensus algorithm.
|
||||||
|
|
||||||
Coordination services are modeled after Google’s Chubby lock service
|
Coordination services are modeled after Google’s Chubby lock service
|
||||||
[[17](/en/ch10#Burrows2006_ch10),
|
[[^17], [^58]].
|
||||||
[58](/en/ch10#Chandra2007)].
|
|
||||||
They combine a consensus algorithm with several other features that turn out to be particularly
|
They combine a consensus algorithm with several other features that turn out to be particularly
|
||||||
useful when building distributed systems:
|
useful when building distributed systems:
|
||||||
|
|
||||||
|
|
@ -1614,8 +1545,7 @@ information like “the node running on IP address 10.1.1.23 is the leader for s
|
||||||
assignments usually change on a timescale of minutes or hours. Coordination services are not
|
assignments usually change on a timescale of minutes or hours. Coordination services are not
|
||||||
intended for storing data that may change thousands of times per second. For that, it is better to
|
intended for storing data that may change thousands of times per second. For that, it is better to
|
||||||
use a conventional database; alternatively, tools like Apache BookKeeper
|
use a conventional database; alternatively, tools like Apache BookKeeper
|
||||||
[[90](/en/ch10#Kelly2014),
|
[[^90], [^91]]
|
||||||
[91](/en/ch10#Vanlightly2021)]
|
|
||||||
can be used to replicate fast-changing internal state of a service.
|
can be used to replicate fast-changing internal state of a service.
|
||||||
|
|
||||||
### Service discovery
|
### Service discovery
|
||||||
|
|
@ -1645,7 +1575,7 @@ algorithm’s voting process. Reads from an observer are not linearizable as the
|
||||||
they remain available even if the network is interrupted, and they increase the read throughput that
|
they remain available even if the network is interrupted, and they increase the read throughput that
|
||||||
the system can support by caching.
|
the system can support by caching.
|
||||||
|
|
||||||
# Summary
|
## Summary
|
||||||
|
|
||||||
In this chapter we examined the topic of strong consistency in fault-tolerant systems: what it is,
|
In this chapter we examined the topic of strong consistency in fault-tolerant systems: what it is,
|
||||||
and how to achieve it. We looked in depth at linearizability, a popular formalization of strong
|
and how to achieve it. We looked in depth at linearizability, a popular formalization of strong
|
||||||
|
|
@ -1731,8 +1661,6 @@ availability and better performance. In these cases, it is common to use leaderl
|
||||||
replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we
|
replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we
|
||||||
discussed in this chapter are helpful in that context.
|
discussed in this chapter are helpful in that context.
|
||||||
|
|
||||||
### Footnotes
|
|
||||||
|
|
||||||
### References
|
### References
|
||||||
|
|
||||||
[^1]: Maurice P. Herlihy and Jeannette M. Wing. [Linearizability: A Correctness Condition for Concurrent Objects](https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 12, issue 3, pages 463–492, July 1990. [doi:10.1145/78969.78972](https://doi.org/10.1145/78969.78972)
|
[^1]: Maurice P. Herlihy and Jeannette M. Wing. [Linearizability: A Correctness Condition for Concurrent Objects](https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 12, issue 3, pages 463–492, July 1990. [doi:10.1145/78969.78972](https://doi.org/10.1145/78969.78972)
|
||||||
|
|
|
||||||
|
|
@ -35,7 +35,7 @@ Stream processing is somewhere between online and offline/batch processing (so i
|
||||||
|
|
||||||
As we shall see in this chapter, batch processing is an important building block in our quest to build reliable, scalable, and maintainable applications. For example, Map‐ Reduce, a batch processing algorithm published in 2004 [1], was (perhaps over- enthusiastically) called “the algorithm that makes Google so massively scalable” [2]. It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB.
|
As we shall see in this chapter, batch processing is an important building block in our quest to build reliable, scalable, and maintainable applications. For example, Map‐ Reduce, a batch processing algorithm published in 2004 [1], was (perhaps over- enthusiastically) called “the algorithm that makes Google so massively scalable” [2]. It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB.
|
||||||
|
|
||||||
MapReduce is a fairly low-level programming model compared to the parallel pro‐ cessing systems that were developed for data warehouses many years previously [3, 4], but it was a major step forward in terms of the scale of processing that could be achieved on commodity hardware. Although the importance of MapReduce is now declining [5], it is still worth understanding, because it provides a clear picture of why and how batch processing is useful.
|
MapReduce is a fairly low-level programming model compared to the parallel pro‐ cessing systems that were developed for data warehouses many years previously [^3] [^4], but it was a major step forward in terms of the scale of processing that could be achieved on commodity hardware. Although the importance of MapReduce is now declining [5], it is still worth understanding, because it provides a clear picture of why and how batch processing is useful.
|
||||||
|
|
||||||
In fact, batch processing is a very old form of computing. Long before programmable digital computers were invented, punch card tabulating machines—such as the Hol‐ lerith machines used in the 1890 US Census [6]—implemented a semi-mechanized form of batch processing to compute aggregate statistics from large inputs. And Map‐ Reduce bears an uncanny resemblance to the electromechanical IBM card-sorting machines that were widely used for business data processing in the 1940s and 1950s [7]. As usual, history has a tendency of repeating itself.
|
In fact, batch processing is a very old form of computing. Long before programmable digital computers were invented, punch card tabulating machines—such as the Hol‐ lerith machines used in the 1890 US Census [6]—implemented a semi-mechanized form of batch processing to compute aggregate statistics from large inputs. And Map‐ Reduce bears an uncanny resemblance to the electromechanical IBM card-sorting machines that were widely used for business data processing in the 1940s and 1950s [7]. As usual, history has a tendency of repeating itself.
|
||||||
|
|
||||||
|
|
@ -94,7 +94,7 @@ In the next chapter, we will turn to stream processing, in which the input is *u
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## References
|
### References
|
||||||
|
|
||||||
1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/pub62/),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004.
|
1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/pub62/),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004.
|
||||||
1. Joel Spolsky: “[The Perils of JavaSchools](https://www.joelonsoftware.com/2005/12/29/the-perils-of-javaschools-2/),” *joelonsoftware.com*, December 29, 2005.
|
1. Joel Spolsky: “[The Perils of JavaSchools](https://www.joelonsoftware.com/2005/12/29/the-perils-of-javaschools-2/),” *joelonsoftware.com*, December 29, 2005.
|
||||||
|
|
|
||||||
|
|
@ -75,7 +75,7 @@ Finally, we discussed techniques for achieving fault tolerance and exactly-once
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
## References
|
### References
|
||||||
|
|
||||||
1. Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “[The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 12, pages 1792–1803, August 2015. [doi:10.14778/2824032.2824076](http://dx.doi.org/10.14778/2824032.2824076)
|
1. Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “[The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 12, pages 1792–1803, August 2015. [doi:10.14778/2824032.2824076](http://dx.doi.org/10.14778/2824032.2824076)
|
||||||
1. Harold Abelson, Gerald Jay Sussman, and Julie Sussman: [*Structure and Interpretation of Computer Programs*](https://web.archive.org/web/20220807043536/https://mitpress.mit.edu/sites/default/files/sicp/index.html), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, available online at *mitpress.mit.edu*
|
1. Harold Abelson, Gerald Jay Sussman, and Julie Sussman: [*Structure and Interpretation of Computer Programs*](https://web.archive.org/web/20220807043536/https://mitpress.mit.edu/sites/default/files/sicp/index.html), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, available online at *mitpress.mit.edu*
|
||||||
|
|
|
||||||
|
|
@ -48,7 +48,7 @@ Finally, we took a step back and examined some ethical aspects of building data-
|
||||||
|
|
||||||
As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal.
|
As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal.
|
||||||
|
|
||||||
## References
|
### References
|
||||||
|
|
||||||
1. Rachid Belaid: “[Postgres Full-Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/),” *rachbelaid.com*, July 13, 2015.
|
1. Rachid Belaid: “[Postgres Full-Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/),” *rachbelaid.com*, July 13, 2015.
|
||||||
1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
|
1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
|
||||||
|
|
|
||||||
|
|
@ -187,24 +187,19 @@ time out and resend their request. This causes the rate of requests to increase
|
||||||
the problem worse—a *retry storm*. Even when the load is reduced again, such a system may remain in
|
the problem worse—a *retry storm*. Even when the load is reduced again, such a system may remain in
|
||||||
an overloaded state until it is rebooted or otherwise reset. This phenomenon is called a *metastable
|
an overloaded state until it is rebooted or otherwise reset. This phenomenon is called a *metastable
|
||||||
failure*, and it can cause serious outages in production systems
|
failure*, and it can cause serious outages in production systems
|
||||||
[[7](/en/ch2#Bronson2021),
|
[[^7], [^8]].
|
||||||
[8](/en/ch2#Brooker2021)].
|
|
||||||
|
|
||||||
To avoid retries overloading a service, you can increase and randomize the time between successive
|
To avoid retries overloading a service, you can increase and randomize the time between successive
|
||||||
retries on the client side (*exponential backoff*
|
retries on the client side (*exponential backoff*
|
||||||
[[9](/en/ch2#Brooker2015),
|
[[^9], [^10]]),
|
||||||
[10](/en/ch2#Brooker2022backoff)]),
|
|
||||||
and temporarily stop sending requests to a service that has returned errors or timed out recently
|
and temporarily stop sending requests to a service that has returned errors or timed out recently
|
||||||
(using a *circuit breaker* [[11](/en/ch2#Nygard2018),
|
(using a *circuit breaker* [[^11], [^12]]
|
||||||
[12](/en/ch2#Chen2022)]
|
|
||||||
or *token bucket* algorithm [^13]).
|
or *token bucket* algorithm [^13]).
|
||||||
The server can also detect when it is approaching overload and start proactively rejecting requests
|
The server can also detect when it is approaching overload and start proactively rejecting requests
|
||||||
(*load shedding* [^14]), and send back
|
(*load shedding* [^14]), and send back
|
||||||
responses asking clients to slow down (*backpressure*
|
responses asking clients to slow down (*backpressure*
|
||||||
[[1](/en/ch2#Cvet2016),
|
[[^1], [^15]]).
|
||||||
[15](/en/ch2#Sackman2016_ch2)]).
|
The choice of queueing and load-balancing algorithms can also make a difference [^16].
|
||||||
The choice of queueing and load-balancing algorithms can also make a difference
|
|
||||||
[^16].
|
|
||||||
|
|
||||||
In terms of performance metrics, the response time is usually what users care about the most,
|
In terms of performance metrics, the response time is usually what users care about the most,
|
||||||
whereas the throughput determines the required computing resources (e.g., how many servers you need),
|
whereas the throughput determines the required computing resources (e.g., how many servers you need),
|
||||||
|
|
@ -242,8 +237,7 @@ to another. You will encounter this style of diagram frequently over the course
|
||||||
The response time can vary significantly from one request to the next, even if you keep making the
|
The response time can vary significantly from one request to the next, even if you keep making the
|
||||||
same request over and over again. Many factors can add random delays: for example, a context switch
|
same request over and over again. Many factors can add random delays: for example, a context switch
|
||||||
to a background process, the loss of a network packet and TCP retransmission, a garbage collection
|
to a background process, the loss of a network packet and TCP retransmission, a garbage collection
|
||||||
pause, a page fault forcing a read from disk, mechanical vibrations in the server rack
|
pause, a page fault forcing a read from disk, mechanical vibrations in the server rack [^17],
|
||||||
[^17],
|
|
||||||
or many other causes. We will discuss this topic in more detail in [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing).
|
or many other causes. We will discuss this topic in more detail in [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing).
|
||||||
|
|
||||||
Queueing delays often account for a large part of the variability in response times. As a server
|
Queueing delays often account for a large part of the variability in response times. As a server
|
||||||
|
|
@ -291,8 +285,7 @@ directly affect users’ experience of the service. For example, Amazon describe
|
||||||
requirements for internal services in terms of the 99.9th percentile, even though it only affects 1
|
requirements for internal services in terms of the 99.9th percentile, even though it only affects 1
|
||||||
in 1,000 requests. This is because the customers with the slowest requests are often those who have
|
in 1,000 requests. This is because the customers with the slowest requests are often those who have
|
||||||
the most data on their accounts because they have made many purchases—that is, they’re the most
|
the most data on their accounts because they have made many purchases—that is, they’re the most
|
||||||
valuable customers
|
valuable customers [^19].
|
||||||
[^19].
|
|
||||||
It’s important to keep those customers happy by ensuring the website is fast for them.
|
It’s important to keep those customers happy by ensuring the website is fast for them.
|
||||||
|
|
||||||
On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed
|
On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed
|
||||||
|
|
@ -302,23 +295,19 @@ control, and the benefits are diminishing.
|
||||||
|
|
||||||
# The user impact of response times
|
# The user impact of response times
|
||||||
|
|
||||||
It seems intuitively obvious that a fast service is better for users than a slow service
|
It seems intuitively obvious that a fast service is better for users than a slow service [^20].
|
||||||
[^20].
|
|
||||||
However, it is surprisingly difficult to get hold of reliable data to quantify the effect that
|
However, it is surprisingly difficult to get hold of reliable data to quantify the effect that
|
||||||
latency has on user behavior.
|
latency has on user behavior.
|
||||||
|
|
||||||
Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search
|
Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search
|
||||||
results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue
|
results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue [^21].
|
||||||
[^21].
|
|
||||||
However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
|
However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
|
||||||
only 0.6% fewer searches per day
|
only 0.6% fewer searches per day [^22],
|
||||||
[^22],
|
|
||||||
and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3%
|
and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3%
|
||||||
[^23].
|
[^23].
|
||||||
Newer data from these companies appears not to be publicly available.
|
Newer data from these companies appears not to be publicly available.
|
||||||
|
|
||||||
A more recent Akamai study
|
A more recent Akamai study [^24]
|
||||||
[^24]
|
|
||||||
claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
|
claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
|
||||||
by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times
|
by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times
|
||||||
are also correlated with lower conversion rates! This seemingly paradoxical result is explained by
|
are also correlated with lower conversion rates! This seemingly paradoxical result is explained by
|
||||||
|
|
@ -326,8 +315,7 @@ the fact that the pages that load fastest are often those that have no useful co
|
||||||
error pages). However, since the study makes no effort to separate the effects of page content from
|
error pages). However, since the study makes no effort to separate the effects of page content from
|
||||||
the effects of load time, its results are probably not meaningful.
|
the effects of load time, its results are probably not meaningful.
|
||||||
|
|
||||||
A study by Yahoo
|
A study by Yahoo [^25]
|
||||||
[^25]
|
|
||||||
compares click-through rates on fast-loading versus slow-loading search results, controlling for
|
compares click-through rates on fast-loading versus slow-loading search results, controlling for
|
||||||
quality of search results. It finds 20–30% more clicks on fast searches when the difference between
|
quality of search results. It finds 20–30% more clicks on fast searches when the difference between
|
||||||
fast and slow responses is 1.25 seconds or more.
|
fast and slow responses is 1.25 seconds or more.
|
||||||
|
|
@ -348,15 +336,13 @@ end-user requests end up being slow (an effect known as *tail latency amplificat
|
||||||
###### Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request.
|
###### Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request.
|
||||||
|
|
||||||
Percentiles are often used in *service level objectives* (SLOs) and *service level agreements*
|
Percentiles are often used in *service level objectives* (SLOs) and *service level agreements*
|
||||||
(SLAs) as ways of defining the expected performance and availability of a service
|
(SLAs) as ways of defining the expected performance and availability of a service [^27].
|
||||||
[^27].
|
|
||||||
For example, an SLO may set a target for a service to have a median response time of less than
|
For example, an SLO may set a target for a service to have a median response time of less than
|
||||||
200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
|
200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
|
||||||
result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not
|
result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not
|
||||||
met (for example, customers may be entitled to a refund). That is the basic idea, at least; in
|
met (for example, customers may be entitled to a refund). That is the basic idea, at least; in
|
||||||
practice, defining good availability metrics for SLOs and SLAs is not straightforward
|
practice, defining good availability metrics for SLOs and SLAs is not straightforward
|
||||||
[[28](/en/ch2#Mogul2019),
|
[[^28], [^29]].
|
||||||
[29](/en/ch2#Hauer2020)].
|
|
||||||
|
|
||||||
# Computing percentiles
|
# Computing percentiles
|
||||||
|
|
||||||
|
|
@ -369,10 +355,8 @@ The simplest implementation is to keep a list of response times for all requests
|
||||||
window and to sort that list every minute. If that is too inefficient for you, there are algorithms
|
window and to sort that list every minute. If that is too inefficient for you, there are algorithms
|
||||||
that can calculate a good approximation of percentiles at minimal CPU and memory cost.
|
that can calculate a good approximation of percentiles at minimal CPU and memory cost.
|
||||||
Open source percentile estimation libraries include HdrHistogram,
|
Open source percentile estimation libraries include HdrHistogram,
|
||||||
t-digest [[30](/en/ch2#Dunning2021),
|
t-digest [[^30], [^31]],
|
||||||
[31](/en/ch2#Kohn2021)],
|
OpenHistogram [^32], and DDSketch [^33].
|
||||||
OpenHistogram [^32], and DDSketch
|
|
||||||
[^33].
|
|
||||||
|
|
||||||
Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from
|
Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from
|
||||||
several machines, is mathematically meaningless—the right way of aggregating response time data
|
several machines, is mathematically meaningless—the right way of aggregating response time data
|
||||||
|
|
@ -391,9 +375,7 @@ software, typical expectations include:
|
||||||
If all those things together mean “working correctly,” then we can understand *reliability* as
|
If all those things together mean “working correctly,” then we can understand *reliability* as
|
||||||
meaning, roughly, “continuing to work correctly, even when things go wrong.” To be more precise
|
meaning, roughly, “continuing to work correctly, even when things go wrong.” To be more precise
|
||||||
about things going wrong, we will distinguish between *faults* and *failures*
|
about things going wrong, we will distinguish between *faults* and *failures*
|
||||||
[[35](/en/ch2#Heimerdinger1992),
|
[[^35], [^36], [^37]]:
|
||||||
[36](/en/ch2#Gaertner1999),
|
|
||||||
[37](/en/ch2#Avizienis2004)]:
|
|
||||||
|
|
||||||
Fault
|
Fault
|
||||||
: A fault is when a particular *part* of a system stops working correctly: for example, if a
|
: A fault is when a particular *part* of a system stops working correctly: for example, if a
|
||||||
|
|
@ -438,8 +420,7 @@ handling [^38]; by deliberately inducing faults, you ensure
|
||||||
that the fault-tolerance machinery is continually exercised and tested, which can increase your
|
that the fault-tolerance machinery is continually exercised and tested, which can increase your
|
||||||
confidence that faults will be handled correctly when they occur naturally. *Chaos engineering* is
|
confidence that faults will be handled correctly when they occur naturally. *Chaos engineering* is
|
||||||
a discipline that aims to improve confidence in fault-tolerance mechanisms through experiments such
|
a discipline that aims to improve confidence in fault-tolerance mechanisms through experiments such
|
||||||
as deliberately injecting faults
|
as deliberately injecting faults [^39].
|
||||||
[^39].
|
|
||||||
|
|
||||||
Although we generally prefer tolerating faults over preventing faults, there are cases where
|
Although we generally prefer tolerating faults over preventing faults, there are cases where
|
||||||
prevention is better than cure (e.g., because no cure exists). This is the case with security
|
prevention is better than cure (e.g., because no cure exists). This is the case with security
|
||||||
|
|
@ -452,8 +433,8 @@ cured, as described in the following sections.
|
||||||
When we think of causes of system failure, hardware faults quickly come to mind:
|
When we think of causes of system failure, hardware faults quickly come to mind:
|
||||||
|
|
||||||
* Approximately 2–5% of magnetic hard drives fail per year
|
* Approximately 2–5% of magnetic hard drives fail per year
|
||||||
[[40](/en/ch2#Pinheiro2007),
|
[[^40],
|
||||||
[41](/en/ch2#Schroeder2007)];
|
[^41]];
|
||||||
in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
|
in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
|
||||||
Recent data suggests that disks are getting more reliable, but failure rates remain significant
|
Recent data suggests that disks are getting more reliable, but failure rates remain significant
|
||||||
[^42].
|
[^42].
|
||||||
|
|
@ -464,36 +445,22 @@ When we think of causes of system failure, hardware faults quickly come to mind:
|
||||||
but uncorrectable errors occur approximately once per year per drive, even in drives that are
|
but uncorrectable errors occur approximately once per year per drive, even in drives that are
|
||||||
fairly new (i.e., that have experienced little wear); this error rate is higher than that of
|
fairly new (i.e., that have experienced little wear); this error rate is higher than that of
|
||||||
magnetic hard drives
|
magnetic hard drives
|
||||||
[[45](/en/ch2#Schroeder2016_ch2),
|
[[^45],
|
||||||
[46](/en/ch2#Alter2019)].
|
[^46]].
|
||||||
* Other hardware components such as power supplies, RAID controllers, and memory modules also fail,
|
* Other hardware components such as power supplies, RAID controllers, and memory modules also fail,
|
||||||
although less frequently than hard drives
|
although less frequently than hard drives [^47] [^48].
|
||||||
[[47](/en/ch2#Ford2010),
|
|
||||||
[48](/en/ch2#Vishwanath2010)].
|
|
||||||
* Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result,
|
* Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result,
|
||||||
likely due to manufacturing defects
|
likely due to manufacturing defects [^49] [^50] [^51]. In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program simply returning the wrong result.
|
||||||
[[49](/en/ch2#Hochschild2021),
|
|
||||||
[50](/en/ch2#Dixit2021),
|
|
||||||
[51](/en/ch2#Behrens2015)].
|
|
||||||
In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program
|
|
||||||
simply returning the wrong result.
|
|
||||||
* Data in RAM can also be corrupted, either due to random events such as cosmic rays, or due to
|
* Data in RAM can also be corrupted, either due to random events such as cosmic rays, or due to
|
||||||
permanent physical defects. Even when memory with error-correcting codes (ECC) is used, more than
|
permanent physical defects. Even when memory with error-correcting codes (ECC) is used, more than
|
||||||
1% of machines encounter an uncorrectable error in a given year, which typically leads to a crash
|
1% of machines encounter an uncorrectable error in a given year, which typically leads to a crash
|
||||||
of the machine and the affected memory module needing to be replaced
|
of the machine and the affected memory module needing to be replaced [^52].
|
||||||
[^52].
|
Moreover, certain pathological memory access patterns can flip bits with high probability [^53].
|
||||||
|
|
||||||
Moreover, certain pathological memory access patterns can flip bits with high probability
|
|
||||||
[^53].
|
|
||||||
* An entire datacenter might become unavailable (for example, due to power outage or network
|
* An entire datacenter might become unavailable (for example, due to power outage or network
|
||||||
misconfiguration) or even be permanently destroyed (for example by fire, flood, or earthquake
|
misconfiguration) or even be permanently destroyed (for example by fire, flood, or earthquake [^54]).
|
||||||
[^54]).
|
|
||||||
A solar storm, which induces large electrical currents in long-distance wires when the sun ejects
|
A solar storm, which induces large electrical currents in long-distance wires when the sun ejects
|
||||||
a large mass of charged particles, could damage power grids and undersea network cables
|
a large mass of charged particles, could damage power grids and undersea network cables [^55].
|
||||||
[^55].
|
Although such large-scale failures are rare, their impact can be catastrophic if a service cannot tolerate the loss of a datacenter [^56].
|
||||||
Although such large-scale failures are rare, their impact can be catastrophic if a service cannot
|
|
||||||
tolerate the loss of a datacenter
|
|
||||||
[^56].
|
|
||||||
|
|
||||||
These events are rare enough that you often don’t need to worry about them when working on a small
|
These events are rare enough that you often don’t need to worry about them when working on a small
|
||||||
system, as long as you can easily replace hardware that becomes faulty. However, in a large-scale
|
system, as long as you can easily replace hardware that becomes faulty. However, in a large-scale
|
||||||
|
|
@ -510,10 +477,7 @@ running uninterrupted for years.
|
||||||
|
|
||||||
Redundancy is most effective when component faults are independent, that is, the occurrence of one
|
Redundancy is most effective when component faults are independent, that is, the occurrence of one
|
||||||
fault does not change how likely it is that another fault will occur. However, experience has shown
|
fault does not change how likely it is that another fault will occur. However, experience has shown
|
||||||
that there are often significant correlations between component failures
|
that there are often significant correlations between component failures [^41] [^57] [^58];
|
||||||
[[41](/en/ch2#Schroeder2007),
|
|
||||||
[57](/en/ch2#Han2021),
|
|
||||||
[58](/en/ch2#Nightingale2011)];
|
|
||||||
unavailability of an entire server rack or an entire datacenter still happens more often than we
|
unavailability of an entire server rack or an entire datacenter still happens more often than we
|
||||||
would like.
|
would like.
|
||||||
|
|
||||||
|
|
@ -543,23 +507,17 @@ upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding).
|
||||||
Although hardware failures can be weakly correlated, they are still mostly independent: for
|
Although hardware failures can be weakly correlated, they are still mostly independent: for
|
||||||
example, if one disk fails, it’s likely that other disks in the same machine will be fine for
|
example, if one disk fails, it’s likely that other disks in the same machine will be fine for
|
||||||
another while. On the other hand, software faults are often very highly correlated, because it is
|
another while. On the other hand, software faults are often very highly correlated, because it is
|
||||||
common for many nodes to run the same software and thus have the same bugs
|
common for many nodes to run the same software and thus have the same bugs [^59] [^60].
|
||||||
[[59](/en/ch2#Gunawi2014),
|
|
||||||
[60](/en/ch2#Kreps2012_ch1)].
|
|
||||||
Such faults are harder to anticipate, and they tend to cause many more system failures than
|
Such faults are harder to anticipate, and they tend to cause many more system failures than
|
||||||
uncorrelated hardware faults [^47]. For example:
|
uncorrelated hardware faults [^47]. For example:
|
||||||
|
|
||||||
* A software bug that causes every node to fail at the same time in particular circumstances. For
|
* A software bug that causes every node to fail at the same time in particular circumstances. For
|
||||||
example, on June 30, 2012, a leap second caused many Java applications to hang simultaneously due
|
example, on June 30, 2012, a leap second caused many Java applications to hang simultaneously due
|
||||||
to a bug in the Linux kernel, bringing down many Internet services
|
to a bug in the Linux kernel, bringing down many Internet services [^61].
|
||||||
[^61].
|
|
||||||
Due to a firmware bug, all SSDs of certain models suddenly fail after precisely 32,768 hours of
|
Due to a firmware bug, all SSDs of certain models suddenly fail after precisely 32,768 hours of
|
||||||
operation (less than 4 years), rendering the data on them unrecoverable
|
operation (less than 4 years), rendering the data on them unrecoverable [^62].
|
||||||
[^62].
|
|
||||||
* A runaway process that uses up some shared, limited resource, such as CPU time, memory, disk
|
* A runaway process that uses up some shared, limited resource, such as CPU time, memory, disk
|
||||||
space, network bandwidth, or threads
|
space, network bandwidth, or threads [^63]. For example, a process that consumes too much memory while processing a large request may be
|
||||||
[^63].
|
|
||||||
For example, a process that consumes too much memory while processing a large request may be
|
|
||||||
killed by the operating system. A bug in a client library could cause a much higher request
|
killed by the operating system. A bug in a client library could cause a much higher request
|
||||||
volume than anticipated [^64].
|
volume than anticipated [^64].
|
||||||
* A service that the system depends on slows down, becomes unresponsive, or starts returning
|
* A service that the system depends on slows down, becomes unresponsive, or starts returning
|
||||||
|
|
@ -567,16 +525,12 @@ uncorrelated hardware faults [^47]. For example:
|
||||||
* An interaction between different systems results in emergent behavior that does not occur when
|
* An interaction between different systems results in emergent behavior that does not occur when
|
||||||
each system was tested in isolation [^65].
|
each system was tested in isolation [^65].
|
||||||
* Cascading failures, where a problem in one component causes another component to become overloaded
|
* Cascading failures, where a problem in one component causes another component to become overloaded
|
||||||
and slow down, which in turn brings down another component
|
and slow down, which in turn brings down another component [^66] [^67]].
|
||||||
[[66](/en/ch2#Ulrich2016),
|
|
||||||
[67](/en/ch2#Fassbender2022)].
|
|
||||||
|
|
||||||
The bugs that cause these kinds of software faults often lie dormant for a long time until they are
|
The bugs that cause these kinds of software faults often lie dormant for a long time until they are
|
||||||
triggered by an unusual set of circumstances. In those circumstances, it is revealed that the
|
triggered by an unusual set of circumstances. In those circumstances, it is revealed that the
|
||||||
software is making some kind of assumption about its environment—and while that assumption is
|
software is making some kind of assumption about its environment—and while that assumption is
|
||||||
usually true, it eventually stops being true for some reason
|
usually true, it eventually stops being true for some reason [^68] [^69].
|
||||||
[[68](/en/ch2#Cook2000),
|
|
||||||
[69](/en/ch2#Woods2017)].
|
|
||||||
|
|
||||||
There is no quick solution to the problem of systematic faults in software. Lots of small things can
|
There is no quick solution to the problem of systematic faults in software. Lots of small things can
|
||||||
help: carefully thinking about assumptions and interactions in the system; thorough testing; process
|
help: carefully thinking about assumptions and interactions in the system; thorough testing; process
|
||||||
|
|
@ -590,8 +544,7 @@ human. Unlike machines, humans don’t just follow rules; their strength is bein
|
||||||
adaptive in getting their job done. However, this characteristic also leads to unpredictability, and
|
adaptive in getting their job done. However, this characteristic also leads to unpredictability, and
|
||||||
sometimes mistakes that can lead to failures, despite best intentions. For example, one study of
|
sometimes mistakes that can lead to failures, despite best intentions. For example, one study of
|
||||||
large internet services found that configuration changes by operators were the leading cause of
|
large internet services found that configuration changes by operators were the leading cause of
|
||||||
outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages
|
outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages [^70].
|
||||||
[^70].
|
|
||||||
|
|
||||||
It is tempting to label such problems as “human error” and to wish that they could be solved by
|
It is tempting to label such problems as “human error” and to wish that they could be solved by
|
||||||
better controlling human behavior through tighter procedures and compliance with rules. However,
|
better controlling human behavior through tighter procedures and compliance with rules. However,
|
||||||
|
|
@ -602,8 +555,7 @@ Often complex systems have emergent behavior, in which unexpected interactions b
|
||||||
may also lead to failures [^72].
|
may also lead to failures [^72].
|
||||||
|
|
||||||
Various technical measures can help minimize the impact of human mistakes, including thorough
|
Various technical measures can help minimize the impact of human mistakes, including thorough
|
||||||
testing (both hand-written tests and *property testing* on lots of random inputs)
|
testing (both hand-written tests and *property testing* on lots of random inputs) [^38], rollback mechanisms for quickly
|
||||||
[^38], rollback mechanisms for quickly
|
|
||||||
reverting configuration changes, gradual roll-outs of new code, detailed and clear monitoring,
|
reverting configuration changes, gradual roll-outs of new code, detailed and clear monitoring,
|
||||||
observability tools for diagnosing production issues (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)),
|
observability tools for diagnosing production issues (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)),
|
||||||
and well-designed interfaces that encourage “the right thing” and discourage “the wrong thing”.
|
and well-designed interfaces that encourage “the right thing” and discourage “the wrong thing”.
|
||||||
|
|
@ -627,8 +579,7 @@ As a general principle, when investigating an incident, you should be suspicious
|
||||||
answers. “Bob should have been more careful when deploying that change” is not productive, but
|
answers. “Bob should have been more careful when deploying that change” is not productive, but
|
||||||
neither is “We must rewrite the backend in Haskell.” Instead, management should take the opportunity
|
neither is “We must rewrite the backend in Haskell.” Instead, management should take the opportunity
|
||||||
to learn the details of how the sociotechnical system works from the point of view of the people who
|
to learn the details of how the sociotechnical system works from the point of view of the people who
|
||||||
work with it every day, and take steps to improve it based on this feedback
|
work with it every day, and take steps to improve it based on this feedback [^71].
|
||||||
[^71].
|
|
||||||
|
|
||||||
# How Important Is Reliability?
|
# How Important Is Reliability?
|
||||||
|
|
||||||
|
|
@ -637,11 +588,9 @@ are also expected to work reliably. Bugs in business applications cause lost pro
|
||||||
risks if figures are reported incorrectly), and outages of e-commerce sites can have huge costs in
|
risks if figures are reported incorrectly), and outages of e-commerce sites can have huge costs in
|
||||||
terms of lost revenue and damage to reputation.
|
terms of lost revenue and damage to reputation.
|
||||||
|
|
||||||
In many applications, a temporary outage of a few minutes or even a few hours is tolerable
|
In many applications, a temporary outage of a few minutes or even a few hours is tolerable [^74],
|
||||||
[^74],
|
|
||||||
but permanent data loss or corruption would be catastrophic. Consider a parent who stores all their
|
but permanent data loss or corruption would be catastrophic. Consider a parent who stores all their
|
||||||
pictures and videos of their children in your photo application
|
pictures and videos of their children in your photo application [^75]. How would they
|
||||||
[^75]. How would they
|
|
||||||
feel if that database was suddenly corrupted? Would they know how to restore it from a backup?
|
feel if that database was suddenly corrupted? Would they know how to restore it from a backup?
|
||||||
|
|
||||||
As another example of how unreliable software can harm people, consider the Post Office Horizon
|
As another example of how unreliable software can harm people, consider the Post Office Horizon
|
||||||
|
|
@ -651,8 +600,7 @@ Eventually it became clear that many of these shortfalls were due to bugs in the
|
||||||
convictions have since been overturned [^76].
|
convictions have since been overturned [^76].
|
||||||
What led to this, probably the largest miscarriage of justice in British history, is the fact that
|
What led to this, probably the largest miscarriage of justice in British history, is the fact that
|
||||||
English law assumes that computers operate correctly (and hence, evidence produced by computers is
|
English law assumes that computers operate correctly (and hence, evidence produced by computers is
|
||||||
reliable) unless there is evidence to the contrary
|
reliable) unless there is evidence to the contrary [^77].
|
||||||
[^77].
|
|
||||||
Software engineers may laugh at the idea that software could ever be bug-free, but this is little
|
Software engineers may laugh at the idea that software could ever be bug-free, but this is little
|
||||||
solace to the people who were wrongfully imprisoned, declared bankrupt, or even committed suicide as
|
solace to the people who were wrongfully imprisoned, declared bankrupt, or even committed suicide as
|
||||||
a result of a wrongful conviction due to an unreliable computer system.
|
a result of a wrongful conviction due to an unreliable computer system.
|
||||||
|
|
@ -728,8 +676,7 @@ If you can double the resources in order to handle twice the load, while keeping
|
||||||
same, we say that you have *linear scalability*, and this is considered a good thing. Occasionally
|
same, we say that you have *linear scalability*, and this is considered a good thing. Occasionally
|
||||||
it is possible to handle twice the load with less than double the resources, due to economies of
|
it is possible to handle twice the load with less than double the resources, due to economies of
|
||||||
scale or a better distribution of peak load
|
scale or a better distribution of peak load
|
||||||
[[79](/en/ch2#Warfield2023_ch2),
|
[[^79], [^80]].
|
||||||
[80](/en/ch2#Brooker2023multitenancy)].
|
|
||||||
Much more likely is that the cost grows faster than linearly, and there may be many reasons for the
|
Much more likely is that the cost grows faster than linearly, and there may be many reasons for the
|
||||||
inefficiency. For example, if you have a lot of data, then processing a single write request may
|
inefficiency. For example, if you have a lot of data, then processing a single write request may
|
||||||
involve more work than if you have a small amount of data, even if the size of the request is the
|
involve more work than if you have a small amount of data, even if the size of the request is the
|
||||||
|
|
@ -753,8 +700,7 @@ Another approach is the *shared-disk architecture*, which uses several machines
|
||||||
CPUs and RAM, but which stores data on an array of disks that is shared between the machines, which
|
CPUs and RAM, but which stores data on an array of disks that is shared between the machines, which
|
||||||
are connected via a fast network: *Network-Attached Storage* (NAS) or *Storage Area Network* (SAN).
|
are connected via a fast network: *Network-Attached Storage* (NAS) or *Storage Area Network* (SAN).
|
||||||
This architecture has traditionally been used for on-premises data warehousing workloads, but
|
This architecture has traditionally been used for on-premises data warehousing workloads, but
|
||||||
contention and the overhead of locking limit the scalability of the shared-disk approach
|
contention and the overhead of locking limit the scalability of the shared-disk approach [^81].
|
||||||
[^81].
|
|
||||||
|
|
||||||
By contrast, the *shared-nothing architecture*
|
By contrast, the *shared-nothing architecture*
|
||||||
[^82]
|
[^82]
|
||||||
|
|
@ -796,8 +742,7 @@ operate largely independently from each other. This is the underlying principle
|
||||||
(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
|
(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
|
||||||
([Link to Come]), and shared-nothing architectures. However, the challenge is in knowing where to
|
([Link to Come]), and shared-nothing architectures. However, the challenge is in knowing where to
|
||||||
draw the line between things that should be together, and things that should be apart. Design
|
draw the line between things that should be together, and things that should be apart. Design
|
||||||
guidelines for microservices can be found in other books
|
guidelines for microservices can be found in other books [^84],
|
||||||
[^84],
|
|
||||||
and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).
|
and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).
|
||||||
|
|
||||||
Another good principle is not to make things more complicated than necessary. If a single-machine
|
Another good principle is not to make things more complicated than necessary. If a single-machine
|
||||||
|
|
@ -817,8 +762,7 @@ bugs that need fixing.
|
||||||
It is widely recognized that the majority of the cost of software is not in its initial development,
|
It is widely recognized that the majority of the cost of software is not in its initial development,
|
||||||
but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures,
|
but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures,
|
||||||
adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding
|
adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding
|
||||||
new features [[85](/en/ch2#Ensmenger2016),
|
new features [[^85], [^86]].
|
||||||
[86](/en/ch2#Glass2002)].
|
|
||||||
|
|
||||||
However, maintenance is also difficult. If a system has been successfully running for a long time,
|
However, maintenance is also difficult. If a system has been successfully running for a long time,
|
||||||
it may well use outdated technologies that not many engineers understand today (such as mainframes
|
it may well use outdated technologies that not many engineers understand today (such as mainframes
|
||||||
|
|
@ -857,8 +801,7 @@ In large-scale systems consisting of many thousands of machines, manual maintena
|
||||||
unreasonably expensive, and automation is essential. However, automation can be a two-edged sword:
|
unreasonably expensive, and automation is essential. However, automation can be a two-edged sword:
|
||||||
there will always be edge cases (such as rare failure scenarios) that require manual intervention
|
there will always be edge cases (such as rare failure scenarios) that require manual intervention
|
||||||
from the operations team. Since the cases that cannot be handled automatically are the most complex
|
from the operations team. Since the cases that cannot be handled automatically are the most complex
|
||||||
issues, greater automation requires a *more* skilled operations team that can resolve those issues
|
issues, greater automation requires a *more* skilled operations team that can resolve those issues [^88].
|
||||||
[^88].
|
|
||||||
|
|
||||||
Moreover, if an automated system goes wrong, it is often harder to troubleshoot than a system that
|
Moreover, if an automated system goes wrong, it is often harder to troubleshoot than a system that
|
||||||
relies on an operator to perform some actions manually. For that reason, it is not the case that
|
relies on an operator to perform some actions manually. For that reason, it is not the case that
|
||||||
|
|
@ -866,8 +809,7 @@ more automation is always better for operability. However, some amount of automa
|
||||||
and the sweet spot will depend on the specifics of your particular application and organization.
|
and the sweet spot will depend on the specifics of your particular application and organization.
|
||||||
|
|
||||||
Good operability means making routine tasks easy, allowing the operations team to focus their efforts
|
Good operability means making routine tasks easy, allowing the operations team to focus their efforts
|
||||||
on high-value activities. Data systems can do various things to make routine tasks easy, including
|
on high-value activities. Data systems can do various things to make routine tasks easy, including [^89]:
|
||||||
[^89]:
|
|
||||||
|
|
||||||
* Allowing monitoring tools to check the system’s key metrics, and supporting observability tools
|
* Allowing monitoring tools to check the system’s key metrics, and supporting observability tools
|
||||||
(see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)) to give insights into the system’s runtime behavior.
|
(see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)) to give insights into the system’s runtime behavior.
|
||||||
|
|
@ -891,15 +833,13 @@ project mired in complexity is sometimes described as a *big ball of mud*
|
||||||
When complexity makes maintenance hard, budgets and schedules are often overrun. In complex
|
When complexity makes maintenance hard, budgets and schedules are often overrun. In complex
|
||||||
software, there is also a greater risk of introducing bugs when making a change: when the system is
|
software, there is also a greater risk of introducing bugs when making a change: when the system is
|
||||||
harder for developers to understand and reason about, hidden assumptions, unintended consequences,
|
harder for developers to understand and reason about, hidden assumptions, unintended consequences,
|
||||||
and unexpected interactions are more easily overlooked
|
and unexpected interactions are more easily overlooked [^69].
|
||||||
[^69].
|
|
||||||
Conversely, reducing complexity greatly improves the maintainability of software, and thus
|
Conversely, reducing complexity greatly improves the maintainability of software, and thus
|
||||||
simplicity should be a key goal for the systems we build.
|
simplicity should be a key goal for the systems we build.
|
||||||
|
|
||||||
Simple systems are easier to understand, and therefore we should try to solve a given problem in the
|
Simple systems are easier to understand, and therefore we should try to solve a given problem in the
|
||||||
simplest way possible. Unfortunately, this is easier said than done. Whether something is simple or
|
simplest way possible. Unfortunately, this is easier said than done. Whether something is simple or
|
||||||
not is often a subjective matter of taste, as there is no objective standard of simplicity
|
not is often a subjective matter of taste, as there is no objective standard of simplicity [^92].
|
||||||
[^92].
|
|
||||||
For example, one system may hide a complex implementation behind a simple interface, whereas another
|
For example, one system may hide a complex implementation behind a simple interface, whereas another
|
||||||
may have a simple implementation that exposes more internal detail to its users—which one is
|
may have a simple implementation that exposes more internal detail to its users—which one is
|
||||||
simpler?
|
simpler?
|
||||||
|
|
@ -952,13 +892,12 @@ different word to refer to agility on a data system level: *evolvability*
|
||||||
[^97].
|
[^97].
|
||||||
|
|
||||||
One major factor that makes change difficult in large systems is when some action is irreversible,
|
One major factor that makes change difficult in large systems is when some action is irreversible,
|
||||||
and therefore that action needs to be taken very carefully
|
and therefore that action needs to be taken very carefully [^98].
|
||||||
[^98].
|
|
||||||
For example, say you are migrating from one database to another: if you cannot switch back to the
|
For example, say you are migrating from one database to another: if you cannot switch back to the
|
||||||
old system in case of problems with the new one, the stakes are much higher than if you can easily go
|
old system in case of problems with the new one, the stakes are much higher than if you can easily go
|
||||||
back. Minimizing irreversibility improves flexibility.
|
back. Minimizing irreversibility improves flexibility.
|
||||||
|
|
||||||
# Summary
|
## Summary
|
||||||
|
|
||||||
In this chapter we examined several examples of nonfunctional requirements: performance,
|
In this chapter we examined several examples of nonfunctional requirements: performance,
|
||||||
reliability, scalability, and maintainability. Through these topics we have also encountered
|
reliability, scalability, and maintainability. Through these topics we have also encountered
|
||||||
|
|
@ -986,8 +925,7 @@ There are no easy answers on how to achieve these things, but one thing that can
|
||||||
applications using well-understood building blocks that provide useful abstractions. The rest of
|
applications using well-understood building blocks that provide useful abstractions. The rest of
|
||||||
this book will cover a selection of building blocks that have proved to be valuable in practice.
|
this book will cover a selection of building blocks that have proved to be valuable in practice.
|
||||||
|
|
||||||
##### References
|
### Summary
|
||||||
|
|
||||||
|
|
||||||
[^1]: Mike Cvet. [How We Learned to Stop Worrying and Love Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016.
|
[^1]: Mike Cvet. [How We Learned to Stop Worrying and Love Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016.
|
||||||
[^2]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)
|
[^2]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)
|
||||||
|
|
|
||||||
|
|
@ -54,12 +54,10 @@ In contrast, with most programming languages you would have to write an *algorit
|
||||||
the computer which operations to perform in which order. A declarative query language is attractive
|
the computer which operations to perform in which order. A declarative query language is attractive
|
||||||
because it is typically more concise and easier to write than an explicit algorithm. But more
|
because it is typically more concise and easier to write than an explicit algorithm. But more
|
||||||
importantly, it also hides implementation details of the query engine, which makes it possible for
|
importantly, it also hides implementation details of the query engine, which makes it possible for
|
||||||
the database system to introduce performance improvements without requiring any changes to queries.
|
the database system to introduce performance improvements without requiring any changes to queries. [^1].
|
||||||
[^1].
|
|
||||||
|
|
||||||
For example, a database might be able to execute a declarative query in parallel across multiple CPU
|
For example, a database might be able to execute a declarative query in parallel across multiple CPU
|
||||||
cores and machines, without you having to worry about how to implement that parallelism
|
cores and machines, without you having to worry about how to implement that parallelism [^2].
|
||||||
[^2].
|
|
||||||
In a hand-coded algorithm it would be a lot of work to implement such parallel execution yourself.
|
In a hand-coded algorithm it would be a lot of work to implement such parallel execution yourself.
|
||||||
|
|
||||||
# Relational Model versus Document Model
|
# Relational Model versus Document Model
|
||||||
|
|
@ -79,11 +77,9 @@ Over the years, there have been many competing approaches to data storage and qu
|
||||||
and early 1980s, the *network model* and the *hierarchical model* were the main alternatives, but
|
and early 1980s, the *network model* and the *hierarchical model* were the main alternatives, but
|
||||||
the relational model came to dominate them. Object databases came and went again in the late 1980s
|
the relational model came to dominate them. Object databases came and went again in the late 1980s
|
||||||
and early 1990s. XML databases appeared in the early 2000s, but have only seen niche adoption. Each
|
and early 1990s. XML databases appeared in the early 2000s, but have only seen niche adoption. Each
|
||||||
competitor to the relational model generated a lot of hype in its time, but it never lasted
|
competitor to the relational model generated a lot of hype in its time, but it never lasted [^4].
|
||||||
[^4].
|
|
||||||
Instead, SQL has grown to incorporate other data types besides its relational core—for example,
|
Instead, SQL has grown to incorporate other data types besides its relational core—for example,
|
||||||
adding support for XML, JSON, and graph data
|
adding support for XML, JSON, and graph data [^5].
|
||||||
[^5].
|
|
||||||
|
|
||||||
In the 2010s, *NoSQL* was the latest buzzword that tried to overthrow the dominance of relational
|
In the 2010s, *NoSQL* was the latest buzzword that tried to overthrow the dominance of relational
|
||||||
databases. NoSQL refers not to a single technology, but a loose set of ideas around new data models,
|
databases. NoSQL refers not to a single technology, but a loose set of ideas around new data models,
|
||||||
|
|
@ -120,8 +116,7 @@ mismatch*.
|
||||||
### Object-relational mapping (ORM)
|
### Object-relational mapping (ORM)
|
||||||
|
|
||||||
Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of
|
Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of
|
||||||
boilerplate code required for this translation layer, but they are often criticized
|
boilerplate code required for this translation layer, but they are often criticized [^6].
|
||||||
[^6].
|
|
||||||
Some commonly cited problems are:
|
Some commonly cited problems are:
|
||||||
|
|
||||||
* ORMs are complex and can’t completely hide the differences between the two models, so developers
|
* ORMs are complex and can’t completely hide the differences between the two models, so developers
|
||||||
|
|
@ -211,8 +206,7 @@ this in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_
|
||||||
The JSON representation has better *locality* than the multi-table schema in
|
The JSON representation has better *locality* than the multi-table schema in
|
||||||
[Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile
|
[Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile
|
||||||
in the relational example, you need to either perform multiple queries (query each table by
|
in the relational example, you need to either perform multiple queries (query each table by
|
||||||
`user_id`) or perform a messy multi-way join between the `users` table and its subordinate tables
|
`user_id`) or perform a messy multi-way join between the `users` table and its subordinate tables [^8].
|
||||||
[^8].
|
|
||||||
In the JSON representation, all the relevant information is in one place, making the query both
|
In the JSON representation, all the relevant information is in one place, making the query both
|
||||||
faster and simpler.
|
faster and simpler.
|
||||||
|
|
||||||
|
|
@ -227,8 +221,8 @@ structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé
|
> This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé
|
||||||
> typically has a small number of positions
|
> typically has a small number of positions
|
||||||
> [[9](/en/ch3#Zola2014),
|
> [[^9],
|
||||||
> [10](/en/ch3#Andrews2023)].
|
> [^10]].
|
||||||
> In situations where there may be a genuinely large number of related items—say, comments on a
|
> In situations where there may be a genuinely large number of related items—say, comments on a
|
||||||
> celebrity’s social media post, of which there could be many thousands—embedding them all in the same
|
> celebrity’s social media post, of which there could be many thousands—embedding them all in the same
|
||||||
> document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable.
|
> document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable.
|
||||||
|
|
@ -347,8 +341,7 @@ denormalized representation consistent.
|
||||||
|
|
||||||
However, the implementation of materialized timelines at X (formerly Twitter) does not store the
|
However, the implementation of materialized timelines at X (formerly Twitter) does not store the
|
||||||
actual text of each post: each entry actually only stores the post ID, the ID of the user who posted
|
actual text of each post: each entry actually only stores the post ID, the ID of the user who posted
|
||||||
it, and a little bit of extra information to identify reposts and replies
|
it, and a little bit of extra information to identify reposts and replies [^11].
|
||||||
[^11].
|
|
||||||
In other words, it is a precomputed result of (approximately) the following query:
|
In other words, it is a precomputed result of (approximately) the following query:
|
||||||
|
|
||||||
```
|
```
|
||||||
|
|
@ -363,8 +356,7 @@ This means that whenever the timeline is read, the service still needs to perfor
|
||||||
the post ID to fetch the actual post content (as well as statistics such as the number of likes
|
the post ID to fetch the actual post content (as well as statistics such as the number of likes
|
||||||
and replies), and look up the sender’s profile by ID (to get their username, profile picture, and
|
and replies), and look up the sender’s profile by ID (to get their username, profile picture, and
|
||||||
other details). This process of looking up the human-readable information by ID is called
|
other details). This process of looking up the human-readable information by ID is called
|
||||||
*hydrating* the IDs, and it is essentially a join performed in application code
|
*hydrating* the IDs, and it is essentially a join performed in application code [^11].
|
||||||
[^11].
|
|
||||||
|
|
||||||
The reason for storing only IDs in the precomputed timeline is that the data they refer to is
|
The reason for storing only IDs in the precomputed timeline is that the data they refer to is
|
||||||
fast-changing: the number of likes and replies may change multiple times per second on a popular
|
fast-changing: the number of likes and replies may change multiple times per second on a popular
|
||||||
|
|
@ -495,8 +487,7 @@ down into subdimensions. For example, there could be separate tables for brands
|
||||||
product categories, and each row in the `dim_product` table could reference the brand and category
|
product categories, and each row in the `dim_product` table could reference the brand and category
|
||||||
as foreign keys, rather than storing them as strings in the `dim_product` table. Snowflake schemas
|
as foreign keys, rather than storing them as strings in the `dim_product` table. Snowflake schemas
|
||||||
are more normalized than star schemas, but star schemas are often preferred because
|
are more normalized than star schemas, but star schemas are often preferred because
|
||||||
they are simpler for analysts to work with
|
they are simpler for analysts to work with [^12].
|
||||||
[^12].
|
|
||||||
|
|
||||||
In a typical data warehouse, tables are often quite wide: fact tables often have over 100 columns,
|
In a typical data warehouse, tables are often quite wide: fact tables often have over 100 columns,
|
||||||
sometimes several hundred. Dimension tables can also be wide, as they include all the metadata that
|
sometimes several hundred. Dimension tables can also be wide, as they include all the metadata that
|
||||||
|
|
@ -549,9 +540,7 @@ such applications well, because the items (or their IDs) can simply be stored in
|
||||||
determine their order. In relational databases there isn’t a standard way of representing such
|
determine their order. In relational databases there isn’t a standard way of representing such
|
||||||
reorderable lists, and various tricks are used: sorting by an integer column (requiring renumbering
|
reorderable lists, and various tricks are used: sorting by an integer column (requiring renumbering
|
||||||
when you insert into the middle), a linked list of IDs, or fractional indexing
|
when you insert into the middle), a linked list of IDs, or fractional indexing
|
||||||
[[14](/en/ch3#Nelson2018),
|
[[^14], [^15], [^16]].
|
||||||
[15](/en/ch3#Wallace2017),
|
|
||||||
[16](/en/ch3#Greenspan2020)].
|
|
||||||
|
|
||||||
### Schema flexibility in the document model
|
### Schema flexibility in the document model
|
||||||
|
|
||||||
|
|
@ -570,15 +559,13 @@ when the data is written) [^18].
|
||||||
|
|
||||||
Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas
|
Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas
|
||||||
schema-on-write is similar to static (compile-time) type checking. Just as the advocates of static
|
schema-on-write is similar to static (compile-time) type checking. Just as the advocates of static
|
||||||
and dynamic type checking have big debates about their relative merits
|
and dynamic type checking have big debates about their relative merits [^19],
|
||||||
[^19],
|
|
||||||
enforcement of schemas in database is a contentious topic, and in general there’s no right or wrong
|
enforcement of schemas in database is a contentious topic, and in general there’s no right or wrong
|
||||||
answer.
|
answer.
|
||||||
|
|
||||||
The difference between the approaches is particularly noticeable in situations where an application
|
The difference between the approaches is particularly noticeable in situations where an application
|
||||||
wants to change the format of its data. For example, say you are currently storing each user’s full
|
wants to change the format of its data. For example, say you are currently storing each user’s full
|
||||||
name in one field, and you instead want to store the first name and last name separately
|
name in one field, and you instead want to store the first name and last name separately [^20].
|
||||||
[^20].
|
|
||||||
In a document database, you would just start writing new documents with the new fields and have
|
In a document database, you would just start writing new documents with the new fields and have
|
||||||
code in the application that handles the case when old documents are read. For example:
|
code in the application that handles the case when old documents are read. For example:
|
||||||
|
|
||||||
|
|
@ -606,10 +593,7 @@ since every row needs to be rewritten, and other schema operations (such as chan
|
||||||
of a column) also typically require the entire table to be copied.
|
of a column) also typically require the entire table to be copied.
|
||||||
|
|
||||||
Various tools exist to allow this type of schema changes to be performed in the background without downtime
|
Various tools exist to allow this type of schema changes to be performed in the background without downtime
|
||||||
[[21](/en/ch3#Percona2023),
|
[[^21], [^22], [^23], [^24]],
|
||||||
[22](/en/ch3#Noach2016),
|
|
||||||
[23](/en/ch3#Mukherjee2022),
|
|
||||||
[24](/en/ch3#PerezAradros2023)],
|
|
||||||
but performing such migrations on large databases remains operationally challenging. Complicated
|
but performing such migrations on large databases remains operationally challenging. Complicated
|
||||||
migrations can be avoided by only adding the `first_name` column with a default value of `NULL`
|
migrations can be avoided by only adding the `first_name` column with a default value of `NULL`
|
||||||
(which is fast), and filling it in at read time, like you would with a document database.
|
(which is fast), and filling it in at read time, like you would with a document database.
|
||||||
|
|
@ -644,13 +628,11 @@ and avoid frequent small updates to a document.
|
||||||
However, the idea of storing related data together for locality is not limited to the document
|
However, the idea of storing related data together for locality is not limited to the document
|
||||||
model. For example, Google’s Spanner database offers the same locality properties in a relational
|
model. For example, Google’s Spanner database offers the same locality properties in a relational
|
||||||
data model, by allowing the schema to declare that a table’s rows should be interleaved (nested)
|
data model, by allowing the schema to declare that a table’s rows should be interleaved (nested)
|
||||||
within a parent table
|
within a parent table [^25].
|
||||||
[^25].
|
|
||||||
Oracle allows the same, using a feature called *multi-table index cluster tables*
|
Oracle allows the same, using a feature called *multi-table index cluster tables*
|
||||||
[^26].
|
[^26].
|
||||||
The *wide-column* data model popularized by Google’s Bigtable, and used e.g. in HBase and Accumulo,
|
The *wide-column* data model popularized by Google’s Bigtable, and used e.g. in HBase and Accumulo,
|
||||||
has a concept of *column families*, which have a similar purpose of managing locality
|
has a concept of *column families*, which have a similar purpose of managing locality [^27].
|
||||||
[^27].
|
|
||||||
|
|
||||||
### Query languages for documents
|
### Query languages for documents
|
||||||
|
|
||||||
|
|
@ -660,10 +642,7 @@ varied. Some allow only key-value access by primary key, while others also offer
|
||||||
to query for values inside documents, and some provide rich query languages.
|
to query for values inside documents, and some provide rich query languages.
|
||||||
|
|
||||||
XML databases are often queried using XQuery and XPath, which are designed to allow complex queries,
|
XML databases are often queried using XQuery and XPath, which are designed to allow complex queries,
|
||||||
including joins across multiple documents, and also format their results as XML
|
including joins across multiple documents, and also format their results as XML [^28]. JSON Pointer [^29] and JSONPath [^30] provide an equivalent to XPath for JSON.
|
||||||
[^28]. JSON Pointer
|
|
||||||
[^29] and JSONPath
|
|
||||||
[^30] provide an equivalent to XPath for JSON.
|
|
||||||
|
|
||||||
MongoDB’s aggregation pipeline, whose `$lookup` operator for joins we saw in
|
MongoDB’s aggregation pipeline, whose `$lookup` operator for joins we saw in
|
||||||
[“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization), is an example of a query language for collections of JSON
|
[“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization), is an example of a query language for collections of JSON
|
||||||
|
|
@ -713,8 +692,7 @@ matter of taste.
|
||||||
### Convergence of document and relational databases
|
### Convergence of document and relational databases
|
||||||
|
|
||||||
Document databases and relational databases started out as very different approaches to data
|
Document databases and relational databases started out as very different approaches to data
|
||||||
management, but they have grown more similar over time
|
management, but they have grown more similar over time [^31].
|
||||||
[^31].
|
|
||||||
Relational databases added support for JSON types and query operators, and the ability to index
|
Relational databases added support for JSON types and query operators, and the ability to index
|
||||||
properties inside documents. Some document databases (such as MongoDB, Couchbase, and RethinkDB)
|
properties inside documents. Some document databases (such as MongoDB, Couchbase, and RethinkDB)
|
||||||
added support for joins, secondary indexes, and declarative query languages.
|
added support for joins, secondary indexes, and declarative query languages.
|
||||||
|
|
@ -759,8 +737,7 @@ Road or rail networks
|
||||||
Well-known algorithms can operate on these graphs: for example, map navigation apps search for
|
Well-known algorithms can operate on these graphs: for example, map navigation apps search for
|
||||||
the shortest path between two points in a road network, and
|
the shortest path between two points in a road network, and
|
||||||
PageRank can be used on the web graph to determine the
|
PageRank can be used on the web graph to determine the
|
||||||
popularity of a web page and thus its ranking in search results
|
popularity of a web page and thus its ranking in search results [^32].
|
||||||
[^32].
|
|
||||||
|
|
||||||
Graphs can be represented in several different ways. In the *adjacency list* model, each vertex
|
Graphs can be represented in several different ways. In the *adjacency list* model, each vertex
|
||||||
stores the IDs of its neighbor vertices that are one edge away. Alternatively, you can use an
|
stores the IDs of its neighbor vertices that are one edge away. Alternatively, you can use an
|
||||||
|
|
@ -786,16 +763,14 @@ types of objects in a single database. For example:
|
||||||
as Wikidata, also publish graph data in a structured form.
|
as Wikidata, also publish graph data in a structured form.
|
||||||
|
|
||||||
There are several different, but related, ways of structuring and querying data in graphs. In this
|
There are several different, but related, ways of structuring and querying data in graphs. In this
|
||||||
section we will discuss the *property graph* model (implemented by Neo4j, Memgraph, KùzuDB
|
section we will discuss the *property graph* model (implemented by Neo4j, Memgraph, KùzuDB [^35],
|
||||||
[^35],
|
|
||||||
and others [^36])
|
and others [^36])
|
||||||
and the *triple-store* model (implemented by Datomic, AllegroGraph, Blazegraph, and others). These
|
and the *triple-store* model (implemented by Datomic, AllegroGraph, Blazegraph, and others). These
|
||||||
models are fairly similar in what they can express, and some graph databases (such as Amazon
|
models are fairly similar in what they can express, and some graph databases (such as Amazon
|
||||||
Neptune) support both models.
|
Neptune) support both models.
|
||||||
|
|
||||||
We will also look at four query languages for graphs (Cypher, SPARQL, Datalog, and GraphQL), as well
|
We will also look at four query languages for graphs (Cypher, SPARQL, Datalog, and GraphQL), as well
|
||||||
as SQL support for querying graphs. Other graph query languages exist, such as Gremlin
|
as SQL support for querying graphs. Other graph query languages exist, such as Gremlin [^37],
|
||||||
[^37],
|
|
||||||
but these will give us a representative overview.
|
but these will give us a representative overview.
|
||||||
|
|
||||||
To illustrate these different languages and models, this section uses the graph shown in
|
To illustrate these different languages and models, this section uses the graph shown in
|
||||||
|
|
@ -899,11 +874,9 @@ extended to accommodate changes in your application’s data structures.
|
||||||
*Cypher* is a query language for property graphs, originally created for the Neo4j graph database,
|
*Cypher* is a query language for property graphs, originally created for the Neo4j graph database,
|
||||||
and later developed into an open standard as *openCypher*
|
and later developed into an open standard as *openCypher*
|
||||||
[^38].
|
[^38].
|
||||||
Besides Neo4j, Cypher is supported by Memgraph, KùzuDB
|
Besides Neo4j, Cypher is supported by Memgraph, KùzuDB [^35],
|
||||||
[^35],
|
|
||||||
Amazon Neptune, Apache AGE (with storage in PostgreSQL), and others. It is named after a character
|
Amazon Neptune, Apache AGE (with storage in PostgreSQL), and others. It is named after a character
|
||||||
in the movie *The Matrix* and is not related to ciphers in cryptography
|
in the movie *The Matrix* and is not related to ciphers in cryptography [^39].
|
||||||
[^39].
|
|
||||||
|
|
||||||
[Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of
|
[Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of
|
||||||
[Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each
|
[Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each
|
||||||
|
|
@ -1071,11 +1044,8 @@ Oracle has a different SQL extension for recursive queries, which it calls *hier
|
||||||
[^41].
|
[^41].
|
||||||
|
|
||||||
However, the situation may be improving: at the time of writing, there are plans to add a graph
|
However, the situation may be improving: at the time of writing, there are plans to add a graph
|
||||||
query language called GQL to the SQL standard [[42](/en/ch3#Deutsch2022),
|
query language called GQL to the SQL standard [[^42], [^43]],
|
||||||
[43](/en/ch3#Green2019)],
|
which will provide a syntax inspired by Cypher, GSQL [^44], and PGQL [^45].
|
||||||
which will provide a syntax inspired by Cypher, GSQL
|
|
||||||
[^44], and PGQL
|
|
||||||
[^45].
|
|
||||||
|
|
||||||
## Triple-Stores and SPARQL
|
## Triple-Stores and SPARQL
|
||||||
|
|
||||||
|
|
@ -1109,8 +1079,7 @@ The subject of a triple is equivalent to a vertex in a graph. The object is one
|
||||||
> book nevertheless calls them triple-stores.
|
> book nevertheless calls them triple-stores.
|
||||||
|
|
||||||
[Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as
|
[Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as
|
||||||
triples in a format called *Turtle*, a subset of *Notation3* (*N3*)
|
triples in a format called *Turtle*, a subset of *Notation3* (*N3*) [^48].
|
||||||
[^48].
|
|
||||||
|
|
||||||
##### Example 3-7. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as Turtle triples
|
##### Example 3-7. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as Turtle triples
|
||||||
|
|
||||||
|
|
@ -1158,16 +1127,12 @@ Some of the research and development effort on triple stores was motivated by th
|
||||||
early-2000s effort to facilitate internet-wide data exchange by publishing data not only as
|
early-2000s effort to facilitate internet-wide data exchange by publishing data not only as
|
||||||
human-readable web pages, but also in a standardized, machine-readable format. Although the Semantic
|
human-readable web pages, but also in a standardized, machine-readable format. Although the Semantic
|
||||||
Web as originally envisioned did not succeed
|
Web as originally envisioned did not succeed
|
||||||
[[49](/en/ch3#Target2018),
|
[[^49], [^50]],
|
||||||
[50](/en/ch3#MendelGleason2022)],
|
|
||||||
the legacy of the Semantic Web project lives on in a couple of specific technologies: *linked data*
|
the legacy of the Semantic Web project lives on in a couple of specific technologies: *linked data*
|
||||||
standards such as JSON-LD [^51],
|
standards such as JSON-LD [^51],
|
||||||
*ontologies* used in biomedical science
|
*ontologies* used in biomedical science [^52],
|
||||||
[^52],
|
Facebook’s Open Graph protocol [^53]
|
||||||
Facebook’s Open Graph protocol
|
(which is used for link unfurling [^54]),
|
||||||
[^53]
|
|
||||||
(which is used for link unfurling
|
|
||||||
[^54]),
|
|
||||||
knowledge graphs such as Wikidata, and standardized vocabularies for structured data maintained by
|
knowledge graphs such as Wikidata, and standardized vocabularies for structured data maintained by
|
||||||
[`schema.org`](https://schema.org/).
|
[`schema.org`](https://schema.org/).
|
||||||
|
|
||||||
|
|
@ -1178,8 +1143,7 @@ for applications.
|
||||||
### The RDF data model
|
### The RDF data model
|
||||||
|
|
||||||
The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the
|
The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the
|
||||||
*Resource Description Framework* (RDF)
|
*Resource Description Framework* (RDF) [^55],
|
||||||
[^55],
|
|
||||||
a data model that was designed for the Semantic Web. RDF data can also be encoded in other ways, for
|
a data model that was designed for the Semantic Web. RDF data can also be encoded in other ways, for
|
||||||
example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can
|
example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can
|
||||||
automatically convert between different RDF encodings.
|
automatically convert between different RDF encodings.
|
||||||
|
|
@ -1229,8 +1193,7 @@ just specify this prefix once at the top of the file, and then forget about it.
|
||||||
|
|
||||||
### The SPARQL query language
|
### The SPARQL query language
|
||||||
|
|
||||||
*SPARQL* is a query language for triple-stores using the RDF data model
|
*SPARQL* is a query language for triple-stores using the RDF data model [^56].
|
||||||
[^56].
|
|
||||||
(It is an acronym for *SPARQL Protocol and RDF Query Language*, pronounced “sparkle.”)
|
(It is an acronym for *SPARQL Protocol and RDF Query Language*, pronounced “sparkle.”)
|
||||||
It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite
|
It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite
|
||||||
similar.
|
similar.
|
||||||
|
|
@ -1275,9 +1238,7 @@ various other triple stores [^36].
|
||||||
## Datalog: Recursive Relational Queries
|
## Datalog: Recursive Relational Queries
|
||||||
|
|
||||||
Datalog is a much older language than SPARQL or Cypher: it arose from academic research in the 1980s
|
Datalog is a much older language than SPARQL or Cypher: it arose from academic research in the 1980s
|
||||||
[[57](/en/ch3#Green2013),
|
[[^57], [^58], [^59]].
|
||||||
[58](/en/ch3#Ceri1989),
|
|
||||||
[59](/en/ch3#Abiteboul1995)].
|
|
||||||
It is less well known among software engineers and not widely supported in mainstream databases, but
|
It is less well known among software engineers and not widely supported in mainstream databases, but
|
||||||
it ought to be better-known since it is a very expressive language that is particularly powerful for
|
it ought to be better-known since it is a very expressive language that is particularly powerful for
|
||||||
complex queries. Several niche databases, including Datomic, LogicBlox, CozoDB, and LinkedIn’s
|
complex queries. Several niche databases, including Datomic, LogicBlox, CozoDB, and LinkedIn’s
|
||||||
|
|
@ -1397,8 +1358,7 @@ APIs.
|
||||||
|
|
||||||
GraphQL’s flexibility comes at a cost. Organizations that adopt GraphQL often need tooling to
|
GraphQL’s flexibility comes at a cost. Organizations that adopt GraphQL often need tooling to
|
||||||
convert GraphQL queries into requests to internal services, which often use REST or gRPC (see
|
convert GraphQL queries into requests to internal services, which often use REST or gRPC (see
|
||||||
[Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns
|
[Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns [^61].
|
||||||
[^61].
|
|
||||||
GraphQL’s query language is also limited since GraphQL come from an untrusted source. The language
|
GraphQL’s query language is also limited since GraphQL come from an untrusted source. The language
|
||||||
does not allow anything that could be expensive to execute, since otherwise users could perform
|
does not allow anything that could be expensive to execute, since otherwise users could perform
|
||||||
denial-of-service attacks on a server by running lots of expensive queries. In particular, GraphQL
|
denial-of-service attacks on a server by running lots of expensive queries. In particular, GraphQL
|
||||||
|
|
@ -1538,8 +1498,7 @@ the status of each booking, another that computes charts for the conference orga
|
||||||
and a third that generates files for the printer that produces the attendees’ badges.
|
and a third that generates files for the printer that produces the attendees’ badges.
|
||||||
|
|
||||||
The idea of using events as the source of truth, and expressing every state change as an event, is
|
The idea of using events as the source of truth, and expressing every state change as an event, is
|
||||||
known as *event sourcing* [[62](/en/ch3#Betts2012),
|
known as *event sourcing* [[^62], [^63]].
|
||||||
[63](/en/ch3#Young2014)].
|
|
||||||
The principle of maintaining separate read-optimized representations and deriving them from the
|
The principle of maintaining separate read-optimized representations and deriving them from the
|
||||||
write-optimized representation is called *command query responsibility segregation (CQRS)*
|
write-optimized representation is called *command query responsibility segregation (CQRS)*
|
||||||
[^64].
|
[^64].
|
||||||
|
|
@ -1692,17 +1651,15 @@ like. Dataframes are flexible enough to allow data to be gradually evolved from
|
||||||
into a matrix representation, while giving the data scientist control over the representation that
|
into a matrix representation, while giving the data scientist control over the representation that
|
||||||
is most suitable for achieving the goals of the data analysis or model training process.
|
is most suitable for achieving the goals of the data analysis or model training process.
|
||||||
|
|
||||||
There are also databases such as TileDB
|
There are also databases such as TileDB [^66]
|
||||||
[^66]
|
|
||||||
that specialize in storing large multidimensional arrays of numbers; they are called *array
|
that specialize in storing large multidimensional arrays of numbers; they are called *array
|
||||||
databases* and are most commonly used for scientific datasets such as geospatial measurements
|
databases* and are most commonly used for scientific datasets such as geospatial measurements
|
||||||
(raster data on a regularly spaced grid), medical imaging, or observations from astronomical
|
(raster data on a regularly spaced grid), medical imaging, or observations from astronomical
|
||||||
telescopes [^67].
|
telescopes [^67].
|
||||||
Dataframes are also used in the financial industry for representing *time series data*, such as the
|
Dataframes are also used in the financial industry for representing *time series data*, such as the
|
||||||
prices of assets and trades over time
|
prices of assets and trades over time [^68].
|
||||||
[^68].
|
|
||||||
|
|
||||||
# Summary
|
## Summary
|
||||||
|
|
||||||
Data models are a huge subject, and in this chapter we have taken a quick look at a broad variety of
|
Data models are a huge subject, and in this chapter we have taken a quick look at a broad variety of
|
||||||
different models. We didn’t have space to go into all the details of each model, but hopefully the
|
different models. We didn’t have space to go into all the details of each model, but hopefully the
|
||||||
|
|
@ -1764,10 +1721,11 @@ a few brief examples:
|
||||||
We have to leave it there for now. In the next chapter we will discuss some of the trade-offs that
|
We have to leave it there for now. In the next chapter we will discuss some of the trade-offs that
|
||||||
come into play when *implementing* the data models described in this chapter.
|
come into play when *implementing* the data models described in this chapter.
|
||||||
|
|
||||||
##### Footnotes
|
|
||||||
|
|
||||||
|
|
||||||
##### References
|
|
||||||
|
### Summary
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -123,8 +123,7 @@ possible write operation. Any kind of index usually slows down writes, because t
|
||||||
to be updated every time data is written.
|
to be updated every time data is written.
|
||||||
|
|
||||||
This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but
|
This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but
|
||||||
every index consumes additional disk space and slows down writes, sometimes substantially
|
every index consumes additional disk space and slows down writes, sometimes substantially [^1].
|
||||||
[^1].
|
|
||||||
For this reason, databases don’t usually index everything by default, but require you—the person
|
For this reason, databases don’t usually index everything by default, but require you—the person
|
||||||
writing the application or administering the database—to choose indexes manually, using your
|
writing the application or administering the database—to choose indexes manually, using your
|
||||||
knowledge of the application’s typical query patterns. You can then choose the indexes that give
|
knowledge of the application’s typical query patterns. You can then choose the indexes that give
|
||||||
|
|
@ -177,8 +176,7 @@ Now you do not need to keep all the keys in memory: you can group the key-value
|
||||||
SSTable into *blocks* of a few kilobytes, and then store the first key of each block in the index.
|
SSTable into *blocks* of a few kilobytes, and then store the first key of each block in the index.
|
||||||
This kind of index, which stores only some of the keys, is called *sparse*. This index is stored in
|
This kind of index, which stores only some of the keys, is called *sparse*. This index is stored in
|
||||||
a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data
|
a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data
|
||||||
structure that allows queries to quickly look up a particular key
|
structure that allows queries to quickly look up a particular key [^4].
|
||||||
[^4].
|
|
||||||
|
|
||||||
For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
|
For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
|
||||||
first key of the next block is `handsome`. Now say you’re looking for the key `handiwork`, which
|
first key of the next block is `handsome`. Now say you’re looking for the key `handiwork`, which
|
||||||
|
|
@ -219,8 +217,7 @@ log and a sorted file:
|
||||||
4. From time to time, run a merging and compaction process in the background to combine segment files
|
4. From time to time, run a merging and compaction process in the background to combine segment files
|
||||||
and to discard overwritten or deleted values.
|
and to discard overwritten or deleted values.
|
||||||
|
|
||||||
Merging segments works similarly to the *mergesort* algorithm
|
Merging segments works similarly to the *mergesort* algorithm [^5]. The process is illustrated in
|
||||||
[^5]. The process is illustrated in
|
|
||||||
[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
|
[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
|
||||||
in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If
|
in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If
|
||||||
the same key appears in more than one input file, keep only the more recent value. This produces a
|
the same key appears in more than one input file, keep only the more recent value. This produces a
|
||||||
|
|
@ -242,18 +239,14 @@ called a *tombstone* to the data file. When log segments are merged, the tombsto
|
||||||
process to discard any previous values for the deleted key. Once the tombstone is merged into the
|
process to discard any previous values for the deleted key. Once the tombstone is merged into the
|
||||||
oldest segment, it can be dropped.
|
oldest segment, it can be dropped.
|
||||||
|
|
||||||
The algorithm described here is essentially what is used in RocksDB
|
The algorithm described here is essentially what is used in RocksDB [^7],
|
||||||
[^7],
|
Cassandra, Scylla, and HBase [^8],
|
||||||
Cassandra, Scylla, and HBase
|
all of which were inspired by Google’s Bigtable paper [^9]
|
||||||
[^8],
|
|
||||||
all of which were inspired by Google’s Bigtable paper
|
|
||||||
[^9]
|
|
||||||
(which introduced the terms *SSTable* and *memtable*).
|
(which introduced the terms *SSTable* and *memtable*).
|
||||||
|
|
||||||
The algorithm was originally published in 1996 under the name *Log-Structured Merge-Tree* or *LSM-Tree*
|
The algorithm was originally published in 1996 under the name *Log-Structured Merge-Tree* or *LSM-Tree*
|
||||||
[^10],
|
[^10],
|
||||||
building on earlier work on log-structured filesystems
|
building on earlier work on log-structured filesystems [^11].
|
||||||
[^11].
|
|
||||||
For this reason, storage engines that are based on the principle of merging and compacting sorted
|
For this reason, storage engines that are based on the principle of merging and compacting sorted
|
||||||
files are often called *LSM storage engines*.
|
files are often called *LSM storage engines*.
|
||||||
|
|
||||||
|
|
@ -265,8 +258,7 @@ requests to using the new merged segment instead of the old segments, and then t
|
||||||
can be deleted.
|
can be deleted.
|
||||||
|
|
||||||
The segment files don’t necessarily have to be stored on local disk: they are also well suited for
|
The segment files don’t necessarily have to be stored on local disk: they are also well suited for
|
||||||
writing to object storage. SlateDB and Delta Lake
|
writing to object storage. SlateDB and Delta Lake [^12].
|
||||||
[^12].
|
|
||||||
take this approach, for example.
|
take this approach, for example.
|
||||||
|
|
||||||
Having immutable segment files also simplifies crash recovery: if a crash happens while writing out
|
Having immutable segment files also simplifies crash recovery: if a crash happens while writing out
|
||||||
|
|
@ -287,8 +279,7 @@ appears in a particular SSTable.
|
||||||
|
|
||||||
[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
|
[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
|
||||||
reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash
|
reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash
|
||||||
function, producing a set of numbers that are then interpreted as indexes into the array of bits
|
function, producing a set of numbers that are then interpreted as indexes into the array of bits [^14].
|
||||||
[^14].
|
|
||||||
We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key
|
We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key
|
||||||
`handbag` hashes to the numbers (2, 9, 4), so we set the 2nd, 9th, and 4th bits to 1. The bitmap
|
`handbag` hashes to the numbers (2, 9, 4), so we set the 2nd, 9th, and 4th bits to 1. The bitmap
|
||||||
is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of
|
is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of
|
||||||
|
|
@ -311,8 +302,7 @@ as if a key is present, even though it isn’t, is called a *false positive*.
|
||||||
|
|
||||||
The probability of false positives depends on the number of keys, the number of bits set per key,
|
The probability of false positives depends on the number of keys, the number of bits set per key,
|
||||||
and the total number of bits in the Bloom filter. You can use an online calculator tool to work out
|
and the total number of bits in the Bloom filter. You can use an online calculator tool to work out
|
||||||
the right parameters for your application
|
the right parameters for your application [^15].
|
||||||
[^15].
|
|
||||||
As a rule of thumb, you need to allocate 10 bits of Bloom filter space for every key in the SSTable
|
As a rule of thumb, you need to allocate 10 bits of Bloom filter space for every key in the SSTable
|
||||||
to get a false positive probability of 1%, and the probability is reduced tenfold for every 5
|
to get a false positive probability of 1%, and the probability is reduced tenfold for every 5
|
||||||
additional bits you allocate per key.
|
additional bits you allocate per key.
|
||||||
|
|
@ -331,8 +321,7 @@ In the context of an LSM storage engines, false positives are no problem:
|
||||||
An important detail is how the LSM storage chooses when to perform compaction, and which SSTables to
|
An important detail is how the LSM storage chooses when to perform compaction, and which SSTables to
|
||||||
include in a compaction. Many LSM-based storage systems allow you to configure which compaction
|
include in a compaction. Many LSM-based storage systems allow you to configure which compaction
|
||||||
strategy to use, and some of the common choices are
|
strategy to use, and some of the common choices are
|
||||||
[[16](/en/ch4#Luo2019),
|
[[^16], [^17]]:
|
||||||
[17](/en/ch4#Sarkar2022)]:
|
|
||||||
|
|
||||||
Size-tiered compaction
|
Size-tiered compaction
|
||||||
: Newer and smaller SSTables are successively merged into older and larger SSTables. The SSTables
|
: Newer and smaller SSTables are successively merged into older and larger SSTables. The SSTables
|
||||||
|
|
@ -360,16 +349,14 @@ Many databases run as a service that accepts queries over a network, but there a
|
||||||
databases that don’t expose a network API. Instead, they are libraries that run in the same process
|
databases that don’t expose a network API. Instead, they are libraries that run in the same process
|
||||||
as your application code, typically reading and writing files on the local disk, and you interact
|
as your application code, typically reading and writing files on the local disk, and you interact
|
||||||
with them through normal function calls. Examples of embedded storage engines include RocksDB,
|
with them through normal function calls. Examples of embedded storage engines include RocksDB,
|
||||||
SQLite, LMDB, DuckDB, and KùzuDB
|
SQLite, LMDB, DuckDB, and KùzuDB [^19].
|
||||||
[^19].
|
|
||||||
|
|
||||||
Embedded databases are very commonly used in mobile apps to store the local user’s data. On the
|
Embedded databases are very commonly used in mobile apps to store the local user’s data. On the
|
||||||
backend, they can be an appropriate choice if the data is small enough to fit on a single machine,
|
backend, they can be an appropriate choice if the data is small enough to fit on a single machine,
|
||||||
and if there are not many concurrent transactions. For example, in a multitenant system in which
|
and if there are not many concurrent transactions. For example, in a multitenant system in which
|
||||||
each tenant is small enough and completely separate from others (i.e., you do not need to run
|
each tenant is small enough and completely separate from others (i.e., you do not need to run
|
||||||
queries that combine data from multiple tenants), you can potentially use a separate embedded
|
queries that combine data from multiple tenants), you can potentially use a separate embedded
|
||||||
database instance per tenant
|
database instance per tenant [^20].
|
||||||
[^20].
|
|
||||||
|
|
||||||
The storage and retrieval methods we discuss in this chapter are used in both embedded and in
|
The storage and retrieval methods we discuss in this chapter are used in both embedded and in
|
||||||
client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
|
client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
|
||||||
|
|
@ -381,8 +368,7 @@ The log-structured approach is popular, but it is not the only form of key-value
|
||||||
widely used structure for reading and writing database records by key is the *B-tree*.
|
widely used structure for reading and writing database records by key is the *B-tree*.
|
||||||
|
|
||||||
Introduced in 1970 [^21]
|
Introduced in 1970 [^21]
|
||||||
and called “ubiquitous” less than 10 years later
|
and called “ubiquitous” less than 10 years later [^22],
|
||||||
[^22],
|
|
||||||
B-trees have stood the test of time very well. They remain the standard index implementation in
|
B-trees have stood the test of time very well. They remain the standard index implementation in
|
||||||
almost all relational databases, and many nonrelational databases use them too.
|
almost all relational databases, and many nonrelational databases use them too.
|
||||||
|
|
||||||
|
|
@ -441,8 +427,7 @@ the new key), and a page for 337–344. We also have to update the parent page t
|
||||||
both children, with a boundary value of 337 between them. If the parent page doesn’t have enough
|
both children, with a boundary value of 337 between them. If the parent page doesn’t have enough
|
||||||
space for the new reference, it may also need to be split, and the splits can continue all the way
|
space for the new reference, it may also need to be split, and the splits can continue all the way
|
||||||
to the root of the tree. When the root is split, we make a new root above it. Deleting keys (which
|
to the root of the tree. When the root is split, we make a new root above it. Deleting keys (which
|
||||||
may require nodes to be merged) is more complex
|
may require nodes to be merged) is more complex [^5].
|
||||||
[^5].
|
|
||||||
|
|
||||||
This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth
|
This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth
|
||||||
of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
|
of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
|
||||||
|
|
@ -467,8 +452,7 @@ In order to make the database resilient to crashes, it is common for B-tree impl
|
||||||
include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file
|
include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file
|
||||||
to which every B-tree modification must be written before it can be applied to the pages of the tree
|
to which every B-tree modification must be written before it can be applied to the pages of the tree
|
||||||
itself. When the database comes back up after a crash, this log is used to restore the B-tree back
|
itself. When the database comes back up after a crash, this log is used to restore the B-tree back
|
||||||
to a consistent state [[2](/en/ch4#Graefe2011),
|
to a consistent state [[^2], [^24]].
|
||||||
[24](/en/ch4#Mohan1992)].
|
|
||||||
In filesystems, the equivalent mechanism is known as *journaling*.
|
In filesystems, the equivalent mechanism is known as *journaling*.
|
||||||
|
|
||||||
To improve performance, B-tree implementations typically don’t immediately write every modified page
|
To improve performance, B-tree implementations typically don’t immediately write every modified page
|
||||||
|
|
@ -501,8 +485,7 @@ mention just a few:
|
||||||
## Comparing B-Trees and LSM-Trees
|
## Comparing B-Trees and LSM-Trees
|
||||||
|
|
||||||
As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads
|
As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads
|
||||||
[[27](/en/ch4#Athanassoulis2016),
|
[[^27], [^28]].
|
||||||
[28](/en/ch4#Stopford2015)].
|
|
||||||
However, benchmarks are often sensitive to details of the workload. You need to test systems with
|
However, benchmarks are often sensitive to details of the workload. You need to test systems with
|
||||||
your particular workload in order to make a valid comparison. Moreover, it’s not a strict either/or
|
your particular workload in order to make a valid comparison. Moreover, it’s not a strict either/or
|
||||||
choice between LSM and B-trees: storage engines sometimes blend characteristics of both approaches,
|
choice between LSM and B-trees: storage engines sometimes blend characteristics of both approaches,
|
||||||
|
|
@ -522,21 +505,18 @@ Range queries are simple and fast on B-trees, as they can use the sorted structu
|
||||||
LSM storage, range queries can also take advantage of the SSTable sorting, but they need to scan all
|
LSM storage, range queries can also take advantage of the SSTable sorting, but they need to scan all
|
||||||
the segments in parallel and combine the results. Bloom filters don’t help for range queries (since
|
the segments in parallel and combine the results. Bloom filters don’t help for range queries (since
|
||||||
you would need to compute the hash of every possible key within the range, which is impractical),
|
you would need to compute the hash of every possible key within the range, which is impractical),
|
||||||
making range queries more expensive than point queries in the LSM approach
|
making range queries more expensive than point queries in the LSM approach [^29].
|
||||||
[^29].
|
|
||||||
|
|
||||||
High write throughput can cause latency spikes in a log-structured storage engine if the
|
High write throughput can cause latency spikes in a log-structured storage engine if the
|
||||||
memtable fills up. This happens if data can’t be written out to disk fast enough, perhaps because
|
memtable fills up. This happens if data can’t be written out to disk fast enough, perhaps because
|
||||||
the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB,
|
the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB,
|
||||||
perform *backpressure* in this situation: they suspend all reads and writes until the memtable has
|
perform *backpressure* in this situation: they suspend all reads and writes until the memtable has
|
||||||
been written out to disk
|
been written out to disk
|
||||||
[[30](/en/ch4#Balmau2019),
|
[[^30], [^31]].
|
||||||
[31](/en/ch4#RocksDBTuning)].
|
|
||||||
|
|
||||||
Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read
|
Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read
|
||||||
requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but
|
requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but
|
||||||
storage engines need to be carefully designed to take advantage of this parallelism
|
storage engines need to be carefully designed to take advantage of this parallelism [^32].
|
||||||
[^32].
|
|
||||||
|
|
||||||
### Sequential vs. random writes
|
### Sequential vs. random writes
|
||||||
|
|
||||||
|
|
@ -568,17 +548,14 @@ The reason is that flash memory can be read or written one page (typically 4 Ki
|
||||||
but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
|
but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
|
||||||
may contain valid data, whereas others may contain data that is no longer needed. Before erasing a
|
may contain valid data, whereas others may contain data that is no longer needed. Before erasing a
|
||||||
block, the controller must first move pages containing valid data into other blocks; this process is
|
block, the controller must first move pages containing valid data into other blocks; this process is
|
||||||
called *garbage collection* (GC)
|
called *garbage collection* (GC) [^33].
|
||||||
[^33].
|
|
||||||
|
|
||||||
A sequential write workload writes larger chunks of data at a time, so it is likely that a whole
|
A sequential write workload writes larger chunks of data at a time, so it is likely that a whole
|
||||||
512 KiB block belongs to a single file; when that file is later deleted again, the whole block
|
512 KiB block belongs to a single file; when that file is later deleted again, the whole block
|
||||||
can be erased without having to perform any GC. On the other hand, with a random write workload, it
|
can be erased without having to perform any GC. On the other hand, with a random write workload, it
|
||||||
is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
|
is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
|
||||||
to perform more work before a block can be erased
|
to perform more work before a block can be erased
|
||||||
[[34](/en/ch4#Vanlightly2023nvme),
|
[[^34], [^35], [^36]].
|
||||||
[35](/en/ch4#Alibaba2019_ch4),
|
|
||||||
[36](/en/ch4#Hu2010)].
|
|
||||||
|
|
||||||
The write bandwidth consumed by GC is then not available for the application. Moreover, the
|
The write bandwidth consumed by GC is then not available for the application. Moreover, the
|
||||||
additional writes performed by GC contribute to wear on the flash memory; therefore, random writes
|
additional writes performed by GC contribute to wear on the flash memory; therefore, random writes
|
||||||
|
|
@ -591,14 +568,12 @@ operations on the underlying disk. With LSM-trees, a value is first written to t
|
||||||
durability, then again when the memtable is written to disk, and again every time the key-value pair
|
durability, then again when the memtable is written to disk, and again every time the key-value pair
|
||||||
is part of a compaction. (If the values are significantly larger than the keys, this overhead can be
|
is part of a compaction. (If the values are significantly larger than the keys, this overhead can be
|
||||||
reduced by storing values separately from keys, and performing compaction only on SSTables
|
reduced by storing values separately from keys, and performing compaction only on SSTables
|
||||||
containing keys and references to values
|
containing keys and references to values [^37].)
|
||||||
[^37].)
|
|
||||||
|
|
||||||
A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once
|
A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once
|
||||||
to the tree page itself. In addition, they sometimes need to write out an entire page, even if only
|
to the tree page itself. In addition, they sometimes need to write out an entire page, even if only
|
||||||
a few bytes in that page changed, to ensure the B-tree can be correctly recovered after a crash or
|
a few bytes in that page changed, to ensure the B-tree can be correctly recovered after a crash or
|
||||||
power failure [[38](/en/ch4#Zaitsev2006),
|
power failure [[^38], [^39]].
|
||||||
[39](/en/ch4#Vondra2016)].
|
|
||||||
|
|
||||||
If you take the total number of bytes written to disk in some workload, and divide by the number of
|
If you take the total number of bytes written to disk in some workload, and divide by the number of
|
||||||
bytes you would have to write if you simply wrote an append-only log with no index, you get the
|
bytes you would have to write if you simply wrote an append-only log with no index, you get the
|
||||||
|
|
@ -610,8 +585,7 @@ handle within the available disk bandwidth.
|
||||||
Write amplification is a problem in both LSM-trees and B-trees. Which one is better depends on
|
Write amplification is a problem in both LSM-trees and B-trees. Which one is better depends on
|
||||||
various factors, such as the length of your keys and values, and how often you overwrite existing
|
various factors, such as the length of your keys and values, and how often you overwrite existing
|
||||||
keys versus insert new ones. For typical workloads, LSM-trees tend to have lower write amplification
|
keys versus insert new ones. For typical workloads, LSM-trees tend to have lower write amplification
|
||||||
because they don’t have to write entire pages and they can compress chunks of the SSTable
|
because they don’t have to write entire pages and they can compress chunks of the SSTable [^40].
|
||||||
[^40].
|
|
||||||
This is another factor that makes LSM storage engines well suited for write-heavy workloads.
|
This is another factor that makes LSM storage engines well suited for write-heavy workloads.
|
||||||
|
|
||||||
Besides affecting throughput, write amplification is also relevant for the wear on SSDs: a storage
|
Besides affecting throughput, write amplification is also relevant for the wear on SSDs: a storage
|
||||||
|
|
@ -636,8 +610,7 @@ the data files anyway, and SSTables don’t have pages with unused space. Moreov
|
||||||
key-value pairs can better be compressed in SSTables, and thus often produce smaller files on disk
|
key-value pairs can better be compressed in SSTables, and thus often produce smaller files on disk
|
||||||
than B-trees. Keys and values that have been overwritten continue to consume space until they are
|
than B-trees. Keys and values that have been overwritten continue to consume space until they are
|
||||||
removed by a compaction, but this overhead is quite low when using leveled compaction
|
removed by a compaction, but this overhead is quite low when using leveled compaction
|
||||||
[[40](/en/ch4#Callaghan2015),
|
[[^40], [^41]].
|
||||||
[41](/en/ch4#Callaghan2016rocksdb)].
|
|
||||||
Size-tiered compaction (see [“Compaction strategies”](/en/ch4#sec_storage_lsm_compaction)) uses more disk space, especially
|
Size-tiered compaction (see [“Compaction strategies”](/en/ch4#sec_storage_lsm_compaction)) uses more disk space, especially
|
||||||
temporarily during compaction.
|
temporarily during compaction.
|
||||||
|
|
||||||
|
|
@ -737,11 +710,9 @@ easily be backed up, inspected, and analyzed by external utilities.
|
||||||
Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model,
|
Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model,
|
||||||
and the vendors claim that they can offer big performance improvements by removing all the overheads
|
and the vendors claim that they can offer big performance improvements by removing all the overheads
|
||||||
associated with managing on-disk data structures
|
associated with managing on-disk data structures
|
||||||
[[46](/en/ch4#Stonebraker2007),
|
[[^46], [^47]].
|
||||||
[47](/en/ch4#VoltDB2014uj)].
|
|
||||||
RAMCloud is an open source, in-memory key-value store with durability (using a log-structured
|
RAMCloud is an open source, in-memory key-value store with durability (using a log-structured
|
||||||
approach for the data in memory as well as the data on disk)
|
approach for the data in memory as well as the data on disk) [^48].
|
||||||
[^48].
|
|
||||||
|
|
||||||
Redis and Couchbase provide weak durability by writing to disk asynchronously.
|
Redis and Couchbase provide weak durability by writing to disk asynchronously.
|
||||||
|
|
||||||
|
|
@ -749,8 +720,7 @@ Counterintuitively, the performance advantage of in-memory databases is not due
|
||||||
they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk
|
they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk
|
||||||
if you have enough memory, because the operating system caches recently used disk blocks in memory
|
if you have enough memory, because the operating system caches recently used disk blocks in memory
|
||||||
anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data
|
anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data
|
||||||
structures in a form that can be written to disk
|
structures in a form that can be written to disk [^49].
|
||||||
[^49].
|
|
||||||
|
|
||||||
Besides performance, another interesting area for in-memory databases is providing data models that
|
Besides performance, another interesting area for in-memory databases is providing data models that
|
||||||
are difficult to implement with disk-based indexes. For example, Redis offers a database-like
|
are difficult to implement with disk-based indexes. For example, Redis offers a database-like
|
||||||
|
|
@ -774,10 +744,7 @@ transaction processing and data warehousing in the same product. However, these
|
||||||
and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly
|
and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly
|
||||||
becoming two separate storage and query engines, which happen to be accessible through a common SQL
|
becoming two separate storage and query engines, which happen to be accessible through a common SQL
|
||||||
interface
|
interface
|
||||||
[[50](/en/ch4#Larson2013),
|
[[^50], [^51], [^52], [^53]].
|
||||||
[51](/en/ch4#Farber2012),
|
|
||||||
[52](/en/ch4#Stonebraker2013),
|
|
||||||
[53](/en/ch4#Prout2022_ch4)].
|
|
||||||
|
|
||||||
## Cloud Data Warehouses
|
## Cloud Data Warehouses
|
||||||
|
|
||||||
|
|
@ -790,16 +757,14 @@ of scalable cloud infrastructure like object storage and serverless computation
|
||||||
Cloud data warehouses tend to integrate better with other cloud services and to be more elastic.
|
Cloud data warehouses tend to integrate better with other cloud services and to be more elastic.
|
||||||
For example, many cloud warehouses support automatic log ingestion, and offer easy integration with
|
For example, many cloud warehouses support automatic log ingestion, and offer easy integration with
|
||||||
data processing frameworks such as Google Cloud’s Dataflow or Amazon Web Services’ Kinesis. These
|
data processing frameworks such as Google Cloud’s Dataflow or Amazon Web Services’ Kinesis. These
|
||||||
warehouses are also more elastic because they decouple query computation from the storage layer
|
warehouses are also more elastic because they decouple query computation from the storage layer [^54].
|
||||||
[^54].
|
|
||||||
Data is persisted on object storage rather than local disks, which makes it easy to adjust storage
|
Data is persisted on object storage rather than local disks, which makes it easy to adjust storage
|
||||||
capacity and compute resources for queries independently, as we previously saw in
|
capacity and compute resources for queries independently, as we previously saw in
|
||||||
[“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native).
|
[“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native).
|
||||||
|
|
||||||
Open source data warehouses such as Apache Hive, Trino, and Apache Spark have also evolved with the
|
Open source data warehouses such as Apache Hive, Trino, and Apache Spark have also evolved with the
|
||||||
cloud. As data storage for analytics has moved to data lakes on object storage, open source warehouses
|
cloud. As data storage for analytics has moved to data lakes on object storage, open source warehouses
|
||||||
have begun to break apart
|
have begun to break apart [^55]. The following
|
||||||
[^55]. The following
|
|
||||||
components, which were previously integrated in a single system such as Apache Hive, are now often
|
components, which were previously integrated in a single system such as Apache Hive, are now often
|
||||||
implemented as separate components:
|
implemented as separate components:
|
||||||
|
|
||||||
|
|
@ -844,8 +809,7 @@ efficiently becomes a challenging problem. Dimension tables are usually much sma
|
||||||
rows), so in this section we will focus on storage of facts.
|
rows), so in this section we will focus on storage of facts.
|
||||||
|
|
||||||
Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4
|
Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4
|
||||||
or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics)
|
or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics) [^52]. Take the query in
|
||||||
[^52]. Take the query in
|
|
||||||
[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
|
[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
|
||||||
buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of
|
buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of
|
||||||
the `fact_sales` table: `date_key`, `product_sk`,
|
the `fact_sales` table: `date_key`, `product_sk`,
|
||||||
|
|
@ -882,8 +846,7 @@ memory, parse them, and filter out those that don’t meet the required conditio
|
||||||
long time.
|
long time.
|
||||||
|
|
||||||
The idea behind *column-oriented* (or *columnar*) storage is simple: don’t store all the values from
|
The idea behind *column-oriented* (or *columnar*) storage is simple: don’t store all the values from
|
||||||
one row together, but store all the values from each *column* together instead
|
one row together, but store all the values from each *column* together instead [^56].
|
||||||
[^56].
|
|
||||||
If each column is stored separately, a query only needs to read and parse those columns that are
|
If each column is stored separately, a query only needs to read and parse those columns that are
|
||||||
used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
|
used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
|
||||||
an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).
|
an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).
|
||||||
|
|
@ -907,33 +870,24 @@ individual columns and put them together to form the 23rd row of the table.
|
||||||
|
|
||||||
In fact, columnar storage engines don’t actually store an entire column (containing perhaps
|
In fact, columnar storage engines don’t actually store an entire column (containing perhaps
|
||||||
trillions of rows) in one go. Instead, they break the table into blocks of thousands or millions of
|
trillions of rows) in one go. Instead, they break the table into blocks of thousands or millions of
|
||||||
rows, and within each block they store the values from each column separately
|
rows, and within each block they store the values from each column separately [^60].
|
||||||
[^60].
|
|
||||||
Since many queries are restricted to a particular date range, it is common to make each block
|
Since many queries are restricted to a particular date range, it is common to make each block
|
||||||
contain the rows for a particular timestamp range. A query then only needs to load the columns it
|
contain the rows for a particular timestamp range. A query then only needs to load the columns it
|
||||||
needs in those blocks that overlap with the required date range.
|
needs in those blocks that overlap with the required date range.
|
||||||
|
|
||||||
Columnar storage is used in almost all analytic databases nowadays
|
Columnar storage is used in almost all analytic databases nowadays [^60],
|
||||||
[^60],
|
ranging from large-scale cloud data warehouses such as Snowflake [^61]
|
||||||
ranging from large-scale cloud data warehouses such as Snowflake
|
to single-node embedded databases such as DuckDB [^62],
|
||||||
[^61]
|
and product analytics systems such as Pinot [^63]
|
||||||
to single-node embedded databases such as DuckDB
|
|
||||||
[^62],
|
|
||||||
and product analytics systems such as Pinot
|
|
||||||
[^63]
|
|
||||||
and Druid [^64].
|
and Druid [^64].
|
||||||
It is used in storage formats such as Parquet, ORC
|
It is used in storage formats such as Parquet, ORC
|
||||||
[[65](/en/ch4#Liu2023),
|
[[^65], [^66]],
|
||||||
[66](/en/ch4#Zeng2023)],
|
|
||||||
Lance [^67],
|
Lance [^67],
|
||||||
and Nimble [^68],
|
and Nimble [^68],
|
||||||
and in-memory analytics formats like Apache Arrow
|
and in-memory analytics formats like Apache Arrow
|
||||||
[[65](/en/ch4#Liu2023),
|
[[^65], [^69]]
|
||||||
[69](/en/ch4#McKinney2021)]
|
|
||||||
and Pandas/NumPy [^70].
|
and Pandas/NumPy [^70].
|
||||||
Some time-series databases, such as InfluxDB IOx
|
Some time-series databases, such as InfluxDB IOx [^71] and TimescaleDB [^72],
|
||||||
[^71] and TimescaleDB
|
|
||||||
[^72],
|
|
||||||
are also based on column-oriented storage.
|
are also based on column-oriented storage.
|
||||||
|
|
||||||
### Column Compression
|
### Column Compression
|
||||||
|
|
@ -961,8 +915,7 @@ One option is to store those bitmaps using one bit per row. However, these bitma
|
||||||
a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can additionally be
|
a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can additionally be
|
||||||
run-length encoded: counting the number of consecutive zeros or ones and storing that number, as
|
run-length encoded: counting the number of consecutive zeros or ones and storing that number, as
|
||||||
shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
|
shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
|
||||||
two bitmap representations, using whichever is the most compact
|
two bitmap representations, using whichever is the most compact [^73].
|
||||||
[^73].
|
|
||||||
This can make the encoding of a column remarkably efficient.
|
This can make the encoding of a column remarkably efficient.
|
||||||
|
|
||||||
Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data
|
Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data
|
||||||
|
|
@ -1046,9 +999,7 @@ Queries need to examine both the column data on disk and the recent writes in me
|
||||||
the two. The query execution engine hides this distinction from the user. From an analyst’s point
|
the two. The query execution engine hides this distinction from the user. From an analyst’s point
|
||||||
of view, data that has been modified with inserts, updates, or deletes is immediately reflected in
|
of view, data that has been modified with inserts, updates, or deletes is immediately reflected in
|
||||||
subsequent queries. Snowflake, Vertica, Apache Pinot, Apache Druid, and many others do this
|
subsequent queries. Snowflake, Vertica, Apache Pinot, Apache Druid, and many others do this
|
||||||
[[61](/en/ch4#Dageville2016), [63](/en/ch4#Im2018),
|
[[^61], [^63], [^64], [^76]].
|
||||||
[64](/en/ch4#Yang2014),
|
|
||||||
[76](/en/ch4#Lamb2012)].
|
|
||||||
|
|
||||||
## Query Execution: Compilation and Vectorization
|
## Query Execution: Compilation and Vectorization
|
||||||
|
|
||||||
|
|
@ -1068,8 +1019,7 @@ the amount of data they need to read off disk, but also the CPU time required to
|
||||||
operators. The simplest kind of operator is like an interpreter for a programming language: while
|
operators. The simplest kind of operator is like an interpreter for a programming language: while
|
||||||
iterating over each row, it checks a data structure representing the query to find out which
|
iterating over each row, it checks a data structure representing the query to find out which
|
||||||
comparisons or calculations it needs to perform on which columns. Unfortunately, this is too slow
|
comparisons or calculations it needs to perform on which columns. Unfortunately, this is too slow
|
||||||
for many analytics purposes. Two alternative approaches for efficient query execution have emerged
|
for many analytics purposes. Two alternative approaches for efficient query execution have emerged [^77]:
|
||||||
[^77]:
|
|
||||||
|
|
||||||
Query compilation
|
Query compilation
|
||||||
: The query engine takes the SQL query and generates code for executing it. The code iterates over
|
: The query engine takes the SQL query and generates code for executing it. The code iterates over
|
||||||
|
|
@ -1084,7 +1034,7 @@ Vectorized processing
|
||||||
: The query is interpreted, not compiled, but it is made fast by processing many values from a
|
: The query is interpreted, not compiled, but it is made fast by processing many values from a
|
||||||
column in a batch, instead of iterating over rows one by one. A fixed set of predefined operators
|
column in a batch, instead of iterating over rows one by one. A fixed set of predefined operators
|
||||||
are built into the database; we can pass arguments to them and get back a batch of results
|
are built into the database; we can pass arguments to them and get back a batch of results
|
||||||
[[50](/en/ch4#Larson2013), [75](/en/ch4#Abadi2013)].
|
[[^50], [^75]].
|
||||||
|
|
||||||
For example, we could pass the `product_sk` column and the ID of “bananas” to an equality operator,
|
For example, we could pass the `product_sk` column and the ID of “bananas” to an equality operator,
|
||||||
and get back a bitmap (one bit per value in the input column, which is 1 if it’s a banana); we could
|
and get back a bitmap (one bit per value in the input column, which is 1 if it’s a banana); we could
|
||||||
|
|
@ -1107,8 +1057,8 @@ performance by taking advantages of the characteristics of modern CPUs:
|
||||||
function calls) to keep the CPU instruction processing pipeline busy and avoid branch
|
function calls) to keep the CPU instruction processing pipeline busy and avoid branch
|
||||||
mispredictions,
|
mispredictions,
|
||||||
* making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD)
|
* making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD)
|
||||||
instructions [[79](/en/ch4#Boncz2005),
|
instructions [[^79],
|
||||||
[80](/en/ch4#Zhou2002)], and
|
[^80]], and
|
||||||
* operating directly on compressed data without decoding it into a separate in-memory
|
* operating directly on compressed data without decoding it into a separate in-memory
|
||||||
representation, which saves memory allocation and copying costs.
|
representation, which saves memory allocation and copying costs.
|
||||||
|
|
||||||
|
|
@ -1123,8 +1073,7 @@ expanded query.
|
||||||
|
|
||||||
When the underlying data changes, a materialized view needs to be updated accordingly. Some
|
When the underlying data changes, a materialized view needs to be updated accordingly. Some
|
||||||
databases can do that automatically, and there are also systems such as Materialize that specialize
|
databases can do that automatically, and there are also systems such as Materialize that specialize
|
||||||
in materialized view maintenance
|
in materialized view maintenance [^81].
|
||||||
[^81].
|
|
||||||
Performing such updates means more work on writes, but materialized views can improve read
|
Performing such updates means more work on writes, but materialized views can improve read
|
||||||
performance in workloads that repeatedly need to perform the same queries.
|
performance in workloads that repeatedly need to perform the same queries.
|
||||||
|
|
||||||
|
|
@ -1133,8 +1082,7 @@ discussed earlier, data warehouse queries often involve an aggregate function, s
|
||||||
`AVG`, `MIN`, or `MAX` in SQL. If the same aggregates are used by many different queries, it can be
|
`AVG`, `MIN`, or `MAX` in SQL. If the same aggregates are used by many different queries, it can be
|
||||||
wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that
|
wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that
|
||||||
queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates
|
queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates
|
||||||
grouped by different dimensions
|
grouped by different dimensions [^82].
|
||||||
[^82].
|
|
||||||
[Figure 4-10](/en/ch4#fig_data_cube) shows an example.
|
[Figure 4-10](/en/ch4#fig_data_cube) shows an example.
|
||||||
|
|
||||||

|

|
||||||
|
|
@ -1197,16 +1145,12 @@ longitude), or all the restaurants in a range of longitudes (but anywhere betwee
|
||||||
South poles), but not both simultaneously.
|
South poles), but not both simultaneously.
|
||||||
|
|
||||||
One option is to translate a two-dimensional location into a single number using a space-filling
|
One option is to translate a two-dimensional location into a single number using a space-filling
|
||||||
curve, and then to use a regular B-tree index
|
curve, and then to use a regular B-tree index [^83].
|
||||||
[^83].
|
More commonly, specialized spatial indexes such as R-trees or Bkd-trees [^84]
|
||||||
More commonly, specialized spatial indexes such as R-trees or Bkd-trees
|
|
||||||
[^84]
|
|
||||||
are used; they divide up the space so that nearby data points tend to be grouped in the same
|
are used; they divide up the space so that nearby data points tend to be grouped in the same
|
||||||
subtree. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQL’s
|
subtree. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQL’s
|
||||||
Generalized Search Tree indexing facility
|
Generalized Search Tree indexing facility [^85].
|
||||||
[^85].
|
It is also possible to use regularly spaced grids of triangles, squares, or hexagons [^86].
|
||||||
It is also possible to use regularly spaced grids of triangles, squares, or hexagons
|
|
||||||
[^86].
|
|
||||||
|
|
||||||
Multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce
|
Multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce
|
||||||
website you could use a three-dimensional index on the dimensions (*red*, *green*, *blue*) to search
|
website you could use a three-dimensional index on the dimensions (*red*, *green*, *blue*) to search
|
||||||
|
|
@ -1215,14 +1159,12 @@ two-dimensional index on (*date*, *temperature*) in order to efficiently search
|
||||||
observations during the year 2013 where the temperature was between 25 and 30℃. With a
|
observations during the year 2013 where the temperature was between 25 and 30℃. With a
|
||||||
one-dimensional index, you would have to either scan over all the records from 2013 (regardless of
|
one-dimensional index, you would have to either scan over all the records from 2013 (regardless of
|
||||||
temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by
|
temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by
|
||||||
timestamp and temperature simultaneously
|
timestamp and temperature simultaneously [^87].
|
||||||
[^87].
|
|
||||||
|
|
||||||
## Full-Text Search
|
## Full-Text Search
|
||||||
|
|
||||||
Full-text search allows you to search a collection of text documents (web pages, product
|
Full-text search allows you to search a collection of text documents (web pages, product
|
||||||
descriptions, etc.) by keywords that might appear anywhere in the text
|
descriptions, etc.) by keywords that might appear anywhere in the text [^88].
|
||||||
[^88].
|
|
||||||
Information retrieval is a big, specialist topic that often involves language-specific processing:
|
Information retrieval is a big, specialist topic that often involves language-specific processing:
|
||||||
for example, several Asian languages are written without spaces or punctuation between words, and
|
for example, several Asian languages are written without spaces or punctuation between words, and
|
||||||
therefore splitting text into words requires a model that indicates which character sequences
|
therefore splitting text into words requires a model that indicates which character sequences
|
||||||
|
|
@ -1249,26 +1191,21 @@ warehouse query that searches for rows matching two conditions ([Figure 4-9](/e
|
||||||
bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps are run-length
|
bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps are run-length
|
||||||
encoded, this can be done very efficiently.
|
encoded, this can be done very efficiently.
|
||||||
|
|
||||||
For example, Lucene, the full-text indexing engine used by Elasticsearch and Solr, works like this
|
For example, Lucene, the full-text indexing engine used by Elasticsearch and Solr, works like this [^90].
|
||||||
[^90].
|
|
||||||
It stores the mapping from term to postings list in SSTable-like sorted files, which are merged in
|
It stores the mapping from term to postings list in SSTable-like sorted files, which are merged in
|
||||||
the background using the same log-structured approach we saw earlier in this chapter
|
the background using the same log-structured approach we saw earlier in this chapter [^91].
|
||||||
[^91].
|
|
||||||
PostgreSQL’s GIN index type also uses postings lists to support full-text search and indexing inside
|
PostgreSQL’s GIN index type also uses postings lists to support full-text search and indexing inside
|
||||||
JSON documents
|
JSON documents
|
||||||
[[92](/en/ch4#Fittl2021),
|
[[^92], [^93]].
|
||||||
[93](/en/ch4#Angelakos2020)].
|
|
||||||
|
|
||||||
Instead of breaking text into words, an alternative is to find all the substrings of length *n*,
|
Instead of breaking text into words, an alternative is to find all the substrings of length *n*,
|
||||||
which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
|
which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
|
||||||
`"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can
|
`"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can
|
||||||
search the documents for arbitrary substrings that are at least three characters long. Trigram
|
search the documents for arbitrary substrings that are at least three characters long. Trigram
|
||||||
indexes even allows regular expressions in search queries; the downside is that they are quite large
|
indexes even allows regular expressions in search queries; the downside is that they are quite large [^94].
|
||||||
[^94].
|
|
||||||
|
|
||||||
To cope with typos in documents or queries, Lucene is able to search text for words within a certain
|
To cope with typos in documents or queries, Lucene is able to search text for words within a certain
|
||||||
edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced)
|
edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced) [^95].
|
||||||
[^95].
|
|
||||||
It does this by storing the set of terms as a finite state automaton over the characters in the
|
It does this by storing the set of terms as a finite state automaton over the characters in the
|
||||||
keys, similar to a *trie*
|
keys, similar to a *trie*
|
||||||
[^96],
|
[^96],
|
||||||
|
|
@ -1309,12 +1246,9 @@ measure the distance between vectors. Cosine similarity measures the cosine of t
|
||||||
vectors to determine how close they are, while Euclidean distance measures the straight-line
|
vectors to determine how close they are, while Euclidean distance measures the straight-line
|
||||||
distance between two points in space.
|
distance between two points in space.
|
||||||
|
|
||||||
Many early embedding models such as Word2Vec
|
Many early embedding models such as Word2Vec [^98],
|
||||||
[^98],
|
BERT [^99],
|
||||||
BERT
|
and GPT [^100]
|
||||||
[^99],
|
|
||||||
and GPT
|
|
||||||
[^100]
|
|
||||||
worked with text data. Such models are usually implemented as neural networks. Researchers went on to
|
worked with text data. Such models are usually implemented as neural networks. Researchers went on to
|
||||||
create embedding models for video, audio, and images as well. More recently, model
|
create embedding models for video, audio, and images as well. More recently, model
|
||||||
architecture has become *multimodal*: a single model can generate vector embeddings for multiple
|
architecture has become *multimodal*: a single model can generate vector embeddings for multiple
|
||||||
|
|
@ -1357,16 +1291,13 @@ Hierarchical Navigable Small World (HNSW)
|
||||||
###### Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index.
|
###### Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index.
|
||||||
|
|
||||||
Many popular vector databases implement IVF and HNSW indexes. Facebook’s Faiss library has many
|
Many popular vector databases implement IVF and HNSW indexes. Facebook’s Faiss library has many
|
||||||
variations of each
|
variations of each [^101],
|
||||||
[^101],
|
and PostgreSQL’s pgvector supports both as well [^102].
|
||||||
and PostgreSQL’s pgvector supports both as well
|
|
||||||
[^102].
|
|
||||||
The full details of the IVF and HNSW algorithms are beyond the scope of this book, but their papers
|
The full details of the IVF and HNSW algorithms are beyond the scope of this book, but their papers
|
||||||
are an excellent resource
|
are an excellent resource
|
||||||
[[103](/en/ch4#Baranchuk2018),
|
[[^103], [^104]].
|
||||||
[104](/en/ch4#Malkov2020)].
|
|
||||||
|
|
||||||
# Summary
|
## Summary
|
||||||
|
|
||||||
In this chapter we tried to get to the bottom of how databases perform storage and retrieval. What
|
In this chapter we tried to get to the bottom of how databases perform storage and retrieval. What
|
||||||
happens when you store data in a database, and what does the database do when you query for the
|
happens when you store data in a database, and what does the database do when you query for the
|
||||||
|
|
@ -1413,10 +1344,11 @@ Although this chapter couldn’t make you an expert in tuning any one particular
|
||||||
has hopefully equipped you with enough vocabulary and ideas that you can make sense of the
|
has hopefully equipped you with enough vocabulary and ideas that you can make sense of the
|
||||||
documentation for the database of your choice.
|
documentation for the database of your choice.
|
||||||
|
|
||||||
##### Footnotes
|
|
||||||
|
|
||||||
|
|
||||||
##### References
|
|
||||||
|
### Summary
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -118,12 +118,10 @@ restored with minimal additional code. However, they also have a number of deep
|
||||||
yourself to your current programming language for potentially a very long time, and precluding
|
yourself to your current programming language for potentially a very long time, and precluding
|
||||||
integrating your systems with those of other organizations (which may use different languages).
|
integrating your systems with those of other organizations (which may use different languages).
|
||||||
* In order to restore data in the same object types, the decoding process needs to be able to
|
* In order to restore data in the same object types, the decoding process needs to be able to
|
||||||
instantiate arbitrary classes. This is frequently a source of security problems
|
instantiate arbitrary classes. This is frequently a source of security problems [^1]:
|
||||||
[^1]:
|
|
||||||
if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate
|
if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate
|
||||||
arbitrary classes, which in turn often allows them to do terrible things such as remotely
|
arbitrary classes, which in turn often allows them to do terrible things such as remotely
|
||||||
executing arbitrary code [[2](/en/ch5#Breen2015),
|
executing arbitrary code [^2] [^3].
|
||||||
[3](/en/ch5#McKenzie2013)].
|
|
||||||
* Versioning data is often an afterthought in these libraries: as they are intended for quick and
|
* Versioning data is often an afterthought in these libraries: as they are intended for quick and
|
||||||
easy encoding of data, they often neglect the inconvenient problems of forward and backward
|
easy encoding of data, they often neglect the inconvenient problems of forward and backward
|
||||||
compatibility [^4].
|
compatibility [^4].
|
||||||
|
|
@ -138,8 +136,7 @@ other than very transient purposes.
|
||||||
|
|
||||||
When moving to standardized encodings that can be written and read by many programming languages, JSON
|
When moving to standardized encodings that can be written and read by many programming languages, JSON
|
||||||
and XML are the obvious contenders. They are widely known, widely supported, and almost as widely
|
and XML are the obvious contenders. They are widely known, widely supported, and almost as widely
|
||||||
disliked. XML is often criticized for being too verbose and unnecessarily complicated
|
disliked. XML is often criticized for being too verbose and unnecessarily complicated [^6].
|
||||||
[^6].
|
|
||||||
JSON’s popularity is mainly due to its built-in support in web browsers and simplicity relative to
|
JSON’s popularity is mainly due to its built-in support in web browsers and simplicity relative to
|
||||||
XML. CSV is another popular language-independent format, but it only supports tabular data without
|
XML. CSV is another popular language-independent format, but it only supports tabular data without
|
||||||
nesting.
|
nesting.
|
||||||
|
|
@ -155,8 +152,7 @@ problems:
|
||||||
|
|
||||||
This is a problem when dealing with large numbers; for example, integers greater than 253 cannot
|
This is a problem when dealing with large numbers; for example, integers greater than 253 cannot
|
||||||
be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become
|
be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become
|
||||||
inaccurate when parsed in a language that uses floating-point numbers, such as JavaScript
|
inaccurate when parsed in a language that uses floating-point numbers, such as JavaScript [^7].
|
||||||
[^7].
|
|
||||||
An example of numbers larger than 253 occurs on X (formerly Twitter), which uses a 64-bit number to
|
An example of numbers larger than 253 occurs on X (formerly Twitter), which uses a 64-bit number to
|
||||||
identify each post. The JSON returned by the API includes post IDs twice, once as a JSON number and
|
identify each post. The JSON returned by the API includes post IDs twice, once as a JSON number and
|
||||||
once as a decimal string, to work around the fact that the numbers are not correctly parsed by
|
once as a decimal string, to work around the fact that the numbers are not correctly parsed by
|
||||||
|
|
@ -173,8 +169,7 @@ problems:
|
||||||
* CSV does not have any schema, so it is up to the application to define the meaning of each row and
|
* CSV does not have any schema, so it is up to the application to define the meaning of each row and
|
||||||
column. If an application change adds a new row or column, you have to handle that change manually.
|
column. If an application change adds a new row or column, you have to handle that change manually.
|
||||||
CSV is also a quite vague format (what happens if a value contains a comma or a newline character?).
|
CSV is also a quite vague format (what happens if a value contains a comma or a newline character?).
|
||||||
Although its escaping rules have been formally specified
|
Although its escaping rules have been formally specified [^9],
|
||||||
[^9],
|
|
||||||
not all parsers implement them correctly.
|
not all parsers implement them correctly.
|
||||||
|
|
||||||
Despite these flaws, JSON, XML, and CSV are good enough for many purposes. It’s likely that they will
|
Despite these flaws, JSON, XML, and CSV are good enough for many purposes. It’s likely that they will
|
||||||
|
|
@ -211,7 +206,7 @@ JSON Schema so that keys may only contain digits, and values can only be strings
|
||||||
|
|
||||||
##### Example 5-1. Example JSON Schema with integer keys and string values. Integer keys are represented as strings containing only integers since JSON Schema requires all keys to be strings.
|
##### Example 5-1. Example JSON Schema with integer keys and string values. Integer keys are represented as strings containing only integers since JSON Schema requires all keys to be strings.
|
||||||
|
|
||||||
```
|
```json
|
||||||
{
|
{
|
||||||
"$schema": "http://json-schema.org/draft-07/schema#",
|
"$schema": "http://json-schema.org/draft-07/schema#",
|
||||||
"type": "object",
|
"type": "object",
|
||||||
|
|
@ -229,8 +224,7 @@ if/else schema logic, named types, references to remote schemas, and much more.
|
||||||
for a very powerful schema language. Such features also make for unwieldy definitions. It can be
|
for a very powerful schema language. Such features also make for unwieldy definitions. It can be
|
||||||
challenging to resolve remote schemas, reason about conditional rules, or evolve schemas in a
|
challenging to resolve remote schemas, reason about conditional rules, or evolve schemas in a
|
||||||
forwards or backwards compatible way [^10].
|
forwards or backwards compatible way [^10].
|
||||||
Similar concerns apply to XML Schema
|
Similar concerns apply to XML Schema [^11].
|
||||||
[^11].
|
|
||||||
|
|
||||||
### Binary encoding
|
### Binary encoding
|
||||||
|
|
||||||
|
|
@ -286,8 +280,7 @@ In the following sections we will see how we can do much better, and encode the
|
||||||
## Protocol Buffers
|
## Protocol Buffers
|
||||||
|
|
||||||
Protocol Buffers (protobuf) is a binary encoding library developed at Google.
|
Protocol Buffers (protobuf) is a binary encoding library developed at Google.
|
||||||
It is similar to Apache Thrift, which was originally developed by Facebook
|
It is similar to Apache Thrift, which was originally developed by Facebook [^13];
|
||||||
[^13];
|
|
||||||
most of what this section says about Protocol Buffers applies also to Thrift.
|
most of what this section says about Protocol Buffers applies also to Thrift.
|
||||||
|
|
||||||
Protocol Buffers requires a schema for any data that is encoded. To encode the data
|
Protocol Buffers requires a schema for any data that is encoded. To encode the data
|
||||||
|
|
@ -381,8 +374,7 @@ value won’t fit in 32 bits, it will be truncated.
|
||||||
|
|
||||||
Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers.
|
Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers.
|
||||||
It was started in 2009 as a subproject of Hadoop, as a result of Protocol Buffers not being a good
|
It was started in 2009 as a subproject of Hadoop, as a result of Protocol Buffers not being a good
|
||||||
fit for Hadoop’s use cases
|
fit for Hadoop’s use cases [^15].
|
||||||
[^15].
|
|
||||||
|
|
||||||
Avro also uses a schema to specify the structure of the data being encoded. It has two schema
|
Avro also uses a schema to specify the structure of the data being encoded. It has two schema
|
||||||
languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily
|
languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily
|
||||||
|
|
@ -455,8 +447,7 @@ application code is expecting, and their types.
|
||||||
If the reader’s and writer’s schema are the same, decoding is easy. If they are different, Avro
|
If the reader’s and writer’s schema are the same, decoding is easy. If they are different, Avro
|
||||||
resolves the differences by looking at the writer’s schema and the reader’s schema side by side and
|
resolves the differences by looking at the writer’s schema and the reader’s schema side by side and
|
||||||
translating the data from the writer’s schema into the reader’s schema. The Avro specification
|
translating the data from the writer’s schema into the reader’s schema. The Avro specification
|
||||||
[[16](/en/ch5#AvroSpec),
|
[[^16], [^17]]
|
||||||
[17](/en/ch5#AvroParsing)]
|
|
||||||
defines exactly how this resolution works, and it is illustrated in
|
defines exactly how this resolution works, and it is illustrated in
|
||||||
[Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
|
[Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
|
||||||
|
|
||||||
|
|
@ -536,8 +527,7 @@ Sending records over a network connection
|
||||||
connection. The Avro RPC protocol (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) works like this.
|
connection. The Avro RPC protocol (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) works like this.
|
||||||
|
|
||||||
A database of schema versions is a useful thing to have in any case, since it acts as documentation
|
A database of schema versions is a useful thing to have in any case, since it acts as documentation
|
||||||
and gives you a chance to check schema compatibility
|
and gives you a chance to check schema compatibility [^21].
|
||||||
[^21].
|
|
||||||
As the version number, you could use a simple incrementing integer, or you could use a hash of the
|
As the version number, you could use a simple incrementing integer, or you could use a hash of the
|
||||||
schema.
|
schema.
|
||||||
|
|
||||||
|
|
@ -581,13 +571,10 @@ languages.
|
||||||
|
|
||||||
The ideas on which these encodings are based are by no means new. For example, they have a lot in
|
The ideas on which these encodings are based are by no means new. For example, they have a lot in
|
||||||
common with ASN.1, a schema definition language that was first standardized in 1984
|
common with ASN.1, a schema definition language that was first standardized in 1984
|
||||||
[[23](/en/ch5#Larmouth1999),
|
[[^23], [^24]].
|
||||||
[24](/en/ch5#Kaliski1993)].
|
|
||||||
It was used to define various network protocols, and its binary encoding (DER) is still used to encode
|
It was used to define various network protocols, and its binary encoding (DER) is still used to encode
|
||||||
SSL certificates (X.509), for example
|
SSL certificates (X.509), for example [^25].
|
||||||
[^25].
|
ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers [^26].
|
||||||
ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers
|
|
||||||
[^26].
|
|
||||||
However, it’s also very complex and badly documented, so ASN.1
|
However, it’s also very complex and badly documented, so ASN.1
|
||||||
is probably not a good choice for new applications.
|
is probably not a good choice for new applications.
|
||||||
|
|
||||||
|
|
@ -681,8 +668,7 @@ versions of the schema.
|
||||||
More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or
|
More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or
|
||||||
moving some data into a separate table—still require data to be rewritten, often at the application
|
moving some data into a separate table—still require data to be rewritten, often at the application
|
||||||
level [^27].
|
level [^27].
|
||||||
Maintaining forward and backward compatibility across such migrations is still a research problem
|
Maintaining forward and backward compatibility across such migrations is still a research problem [^28].
|
||||||
[^28].
|
|
||||||
|
|
||||||
### Archival storage
|
### Archival storage
|
||||||
|
|
||||||
|
|
@ -722,8 +708,7 @@ application-specific, and the client and server need to agree on the details of
|
||||||
In some ways, services are similar to databases: they typically allow clients to submit and query
|
In some ways, services are similar to databases: they typically allow clients to submit and query
|
||||||
data. However, while databases allow arbitrary queries using the query languages we discussed in
|
data. However, while databases allow arbitrary queries using the query languages we discussed in
|
||||||
[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
|
[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
|
||||||
that are predetermined by the business logic (application code) of the service
|
that are predetermined by the business logic (application code) of the service [^29]. This restriction provides a degree of encapsulation: services can impose
|
||||||
[^29]. This restriction provides a degree of encapsulation: services can impose
|
|
||||||
fine-grained restrictions on what clients can and cannot do.
|
fine-grained restrictions on what clients can and cannot do.
|
||||||
|
|
||||||
A key design goal of a service-oriented/microservices architecture is to make the application easier
|
A key design goal of a service-oriented/microservices architecture is to make the application easier
|
||||||
|
|
@ -752,8 +737,7 @@ different contexts. For example:
|
||||||
systems, or OAuth for shared access to user data.
|
systems, or OAuth for shared access to user data.
|
||||||
|
|
||||||
The most popular service design philosophy is REST, which builds upon the principles of HTTP
|
The most popular service design philosophy is REST, which builds upon the principles of HTTP
|
||||||
[[30](/en/ch5#Fielding2000),
|
[[^30], [^31]].
|
||||||
[31](/en/ch5#Fielding2008)].
|
|
||||||
It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for
|
It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for
|
||||||
cache control, authentication, and content type negotiation. An API designed according to the
|
cache control, authentication, and content type negotiation. An API designed according to the
|
||||||
principles of REST is called *RESTful*.
|
principles of REST is called *RESTful*.
|
||||||
|
|
@ -763,8 +747,7 @@ format to send and expect in response. Even if a service adopts RESTful design p
|
||||||
need to somehow find out these details. Service developers often use an interface definition
|
need to somehow find out these details. Service developers often use an interface definition
|
||||||
language (IDL) to define and document their service’s API endpoints and data models, and to evolve
|
language (IDL) to define and document their service’s API endpoints and data models, and to evolve
|
||||||
them over time. Other developers can then use the service definition to determine how to query the
|
them over time. Other developers can then use the service definition to determine how to query the
|
||||||
service. The two most popular service IDLs are OpenAPI (also known as Swagger
|
service. The two most popular service IDLs are OpenAPI (also known as Swagger [^32])
|
||||||
[^32])
|
|
||||||
and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send
|
and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send
|
||||||
and receive Protocol Buffers.
|
and receive Protocol Buffers.
|
||||||
|
|
||||||
|
|
@ -841,17 +824,14 @@ Architecture (CORBA) is excessively complex, and does not provide backward or fo
|
||||||
compatibility [^33].
|
compatibility [^33].
|
||||||
SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are
|
SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are
|
||||||
also plagued by complexity and compatibility problems
|
also plagued by complexity and compatibility problems
|
||||||
[[34](/en/ch5#Lacey2006),
|
[[^34], [^35], [^36]].
|
||||||
[35](/en/ch5#Tilkov2006),
|
|
||||||
[36](/en/ch5#Bray2004)].
|
|
||||||
|
|
||||||
All of these are based on the idea of a *remote procedure call* (RPC), which has been around since
|
All of these are based on the idea of a *remote procedure call* (RPC), which has been around since
|
||||||
the 1970s [^37].
|
the 1970s [^37].
|
||||||
The RPC model tries to make a request to a remote network service look the same as calling a function or
|
The RPC model tries to make a request to a remote network service look the same as calling a function or
|
||||||
method in your programming language, within the same process (this abstraction is called *location
|
method in your programming language, within the same process (this abstraction is called *location
|
||||||
transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed
|
transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed
|
||||||
[[38](/en/ch5#Waldo1994),
|
[[^38], [^39]].
|
||||||
[39](/en/ch5#Vinoski2008)].
|
|
||||||
A network request is very different from a local function call:
|
A network request is very different from a local function call:
|
||||||
|
|
||||||
* A local function call is predictable and either succeeds or fails, depending only on parameters
|
* A local function call is predictable and either succeeds or fails, depending only on parameters
|
||||||
|
|
@ -978,8 +958,7 @@ version of the API it wants to use [^42]).
|
||||||
For RESTful APIs, common approaches are to use a version
|
For RESTful APIs, common approaches are to use a version
|
||||||
number in the URL or in the HTTP `Accept` header. For services that use API keys to identify a
|
number in the URL or in the HTTP `Accept` header. For services that use API keys to identify a
|
||||||
particular client, another option is to store a client’s requested API version on the server and to
|
particular client, another option is to store a client’s requested API version on the server and to
|
||||||
allow this version selection to be updated through a separate administrative interface
|
allow this version selection to be updated through a separate administrative interface [^43].
|
||||||
[^43].
|
|
||||||
|
|
||||||
## Durable Execution and Workflows
|
## Durable Execution and Workflows
|
||||||
|
|
||||||
|
|
@ -994,8 +973,7 @@ the credit card, and call the banking service to deposit debited funds, as shown
|
||||||
[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
|
[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
|
||||||
Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a
|
Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a
|
||||||
general-purpose programming language, a domain specific language (DSL), or a markup language such as
|
general-purpose programming language, a domain specific language (DSL), or a markup language such as
|
||||||
Business Process Execution Language (BPEL)
|
Business Process Execution Language (BPEL) [^44].
|
||||||
[^44].
|
|
||||||
|
|
||||||
# Tasks, Activities, and Functions
|
# Tasks, Activities, and Functions
|
||||||
|
|
||||||
|
|
@ -1038,8 +1016,7 @@ task fails, the framework will re-execute the task, but will skip any RPC calls
|
||||||
that the task made successfully before failing. Instead, the framework will pretend to make the
|
that the task made successfully before failing. Instead, the framework will pretend to make the
|
||||||
call, but will instead return the results from the previous call. This is possible because durable
|
call, but will instead return the results from the previous call. This is possible because durable
|
||||||
execution frameworks log all RPCs and state changes to durable storage like a write-ahead log
|
execution frameworks log all RPCs and state changes to durable storage like a write-ahead log
|
||||||
[[45](/en/ch5#TemporalService),
|
[[^45], [^46]].
|
||||||
[46](/en/ch5#Ewen2023)].
|
|
||||||
[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
|
[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
|
||||||
using Temporal.
|
using Temporal.
|
||||||
|
|
||||||
|
|
@ -1067,16 +1044,13 @@ class PaymentWorkflow:
|
||||||
|
|
||||||
Frameworks like Temporal are not without their challenges. External services, such as the
|
Frameworks like Temporal are not without their challenges. External services, such as the
|
||||||
third-party payment gateway in our example, must still provide an idempotent API. Developers must
|
third-party payment gateway in our example, must still provide an idempotent API. Developers must
|
||||||
remember to use unique IDs for these APIs to prevent duplicate execution
|
remember to use unique IDs for these APIs to prevent duplicate execution [^47].
|
||||||
[^47].
|
|
||||||
And because durable execution frameworks log each RPC call in order, it expects a subsequent
|
And because durable execution frameworks log each RPC call in order, it expects a subsequent
|
||||||
execution to make the same RPC calls in the same order. This makes code changes brittle: you
|
execution to make the same RPC calls in the same order. This makes code changes brittle: you
|
||||||
might introduce undefined behavior simply by re-ordering function calls
|
might introduce undefined behavior simply by re-ordering function calls [^48].
|
||||||
[^48].
|
|
||||||
Instead of modifying the code of an existing workflow, it is safer to deploy a new version of the
|
Instead of modifying the code of an existing workflow, it is safer to deploy a new version of the
|
||||||
code separately, so that re-executions of existing workflow invocations continue to use the old
|
code separately, so that re-executions of existing workflow invocations continue to use the old
|
||||||
version, and only new invocations use the new code
|
version, and only new invocations use the new code [^49].
|
||||||
[^49].
|
|
||||||
|
|
||||||
Similarly, because durable execution frameworks expect to replay all code deterministically (the
|
Similarly, because durable execution frameworks expect to replay all code deterministically (the
|
||||||
same inputs produce the same outputs), nondeterministic code such as random number generators or
|
same inputs produce the same outputs), nondeterministic code such as random number generators or
|
||||||
|
|
@ -1097,8 +1071,7 @@ how encoded data can flow from one process to another. A request is called an *e
|
||||||
unlike RPC, the sender usually does not wait for the recipient to process the event. Moreover,
|
unlike RPC, the sender usually does not wait for the recipient to process the event. Moreover,
|
||||||
events are typically not sent to the recipient via a direct network connection, but go via an
|
events are typically not sent to the recipient via a direct network connection, but go via an
|
||||||
intermediary called a *message broker* (also called an *event broker*, *message queue*, or
|
intermediary called a *message broker* (also called an *event broker*, *message queue*, or
|
||||||
*message-oriented middleware*), which stores the message temporarily.
|
*message-oriented middleware*), which stores the message temporarily. [^50].
|
||||||
[^50].
|
|
||||||
|
|
||||||
Using a message broker has several advantages compared to direct RPC:
|
Using a message broker has several advantages compared to direct RPC:
|
||||||
|
|
||||||
|
|
@ -1136,7 +1109,7 @@ Message brokers typically don’t enforce any particular data model—a message
|
||||||
bytes with some metadata, so you can use any encoding format. A common approach is to use Protocol
|
bytes with some metadata, so you can use any encoding format. A common approach is to use Protocol
|
||||||
Buffers, Avro, or JSON, and to deploy a schema registry alongside the message broker to store all
|
Buffers, Avro, or JSON, and to deploy a schema registry alongside the message broker to store all
|
||||||
the valid schema versions and check their compatibility
|
the valid schema versions and check their compatibility
|
||||||
[[19](/en/ch5#ConfluentSchemaReg), [21](/en/ch5#Kreps2015)].
|
[[^19], [^21]].
|
||||||
AsyncAPI, a messaging-based equivalent of OpenAPI, can also be used to specify the schema of
|
AsyncAPI, a messaging-based equivalent of OpenAPI, can also be used to specify the schema of
|
||||||
messages.
|
messages.
|
||||||
|
|
||||||
|
|
@ -1160,8 +1133,7 @@ sending and receiving asynchronous messages. Message delivery is not guaranteed:
|
||||||
scenarios, messages will be lost. Since each actor processes only one message at a time, it doesn’t
|
scenarios, messages will be lost. Since each actor processes only one message at a time, it doesn’t
|
||||||
need to worry about threads, and each actor can be scheduled independently by the framework.
|
need to worry about threads, and each actor can be scheduled independently by the framework.
|
||||||
|
|
||||||
In *distributed actor frameworks* such as Akka, Orleans
|
In *distributed actor frameworks* such as Akka, Orleans [^51],
|
||||||
[^51],
|
|
||||||
and Erlang/OTP, this programming model is used to scale an application across
|
and Erlang/OTP, this programming model is used to scale an application across
|
||||||
multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient
|
multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient
|
||||||
are on the same node or different nodes. If they are on different nodes, the message is
|
are on the same node or different nodes. If they are on different nodes, the message is
|
||||||
|
|
@ -1178,7 +1150,7 @@ application, you still have to worry about forward and backward compatibility, a
|
||||||
sent from a node running the new version to a node running the old version, and vice versa. This can
|
sent from a node running the new version to a node running the old version, and vice versa. This can
|
||||||
be achieved by using one of the encodings discussed in this chapter.
|
be achieved by using one of the encodings discussed in this chapter.
|
||||||
|
|
||||||
# Summary
|
## Summary
|
||||||
|
|
||||||
In this chapter we looked at several ways of turning data structures into bytes on the network or
|
In this chapter we looked at several ways of turning data structures into bytes on the network or
|
||||||
bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more
|
bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more
|
||||||
|
|
@ -1222,10 +1194,11 @@ encodings are important:
|
||||||
We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are
|
We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are
|
||||||
quite achievable. May your application’s evolution be rapid and your deployments be frequent.
|
quite achievable. May your application’s evolution be rapid and your deployments be frequent.
|
||||||
|
|
||||||
##### Footnotes
|
|
||||||
|
|
||||||
|
|
||||||
##### References
|
|
||||||
|
### Summary
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
[^1]: [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y)
|
[^1]: [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y)
|
||||||
|
|
|
||||||
|
|
@ -11,7 +11,7 @@ breadcrumbs: false
|
||||||
> Douglas Adams, *Mostly Harmless* (1992)
|
> Douglas Adams, *Mostly Harmless* (1992)
|
||||||
|
|
||||||
*Replication* means keeping a copy of the same data on multiple machines that are connected via a
|
*Replication* means keeping a copy of the same data on multiple machines that are connected via a
|
||||||
network. As discussed in [“Distributed versus Single-Node Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_distributed), there are several reasons
|
network. As discussed in [“Distributed versus Single-Node Systems”](/ch01.html#sec_introduction_distributed), there are several reasons
|
||||||
why you might want to replicate data:
|
why you might want to replicate data:
|
||||||
|
|
||||||
* To keep data geographically close to your users (and thus reduce access latency)
|
* To keep data geographically close to your users (and thus reduce access latency)
|
||||||
|
|
@ -19,7 +19,7 @@ why you might want to replicate data:
|
||||||
* To scale out the number of machines that can serve read queries (and thus increase read throughput)
|
* To scale out the number of machines that can serve read queries (and thus increase read throughput)
|
||||||
|
|
||||||
In this chapter we will assume that your dataset is small enough that each machine can hold a copy of
|
In this chapter we will assume that your dataset is small enough that each machine can hold a copy of
|
||||||
the entire dataset. In [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding) we will relax that assumption and discuss *sharding*
|
the entire dataset. In [Chapter 7](/ch07.html#ch_sharding) we will relax that assumption and discuss *sharding*
|
||||||
(*partitioning*) of datasets that are too big for a single machine. In later chapters we will discuss
|
(*partitioning*) of datasets that are too big for a single machine. In later chapters we will discuss
|
||||||
various kinds of faults that can occur in a replicated data system, and how to deal with them.
|
various kinds of faults that can occur in a replicated data system, and how to deal with them.
|
||||||
|
|
||||||
|
|
@ -36,10 +36,8 @@ in databases, and although the details vary by database, the general principles
|
||||||
many different implementations. We will discuss the consequences of such choices in this chapter.
|
many different implementations. We will discuss the consequences of such choices in this chapter.
|
||||||
|
|
||||||
Replication of databases is an old topic—the principles haven’t changed much since they were
|
Replication of databases is an old topic—the principles haven’t changed much since they were
|
||||||
studied in the 1970s
|
studied in the 1970s [^1], because the fundamental constraints of networks have remained the same. Despite being so old,
|
||||||
[^1],
|
concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](/ch06.html#sec_replication_lag) we will
|
||||||
because the fundamental constraints of networks have remained the same. Despite being so old,
|
|
||||||
concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag) we will
|
|
||||||
get more precise about eventual consistency and discuss things like the *read-your-writes* and
|
get more precise about eventual consistency and discuss things like the *read-your-writes* and
|
||||||
*monotonic reads* guarantees.
|
*monotonic reads* guarantees.
|
||||||
|
|
||||||
|
|
@ -52,7 +50,7 @@ delete some data, replication doesn’t help since the deletion will have also b
|
||||||
replicas, so you need a backup if you want to restore the deleted data.
|
replicas, so you need a backup if you want to restore the deleted data.
|
||||||
|
|
||||||
In fact, replication and backups are often complementary to each other. Backups are sometimes part
|
In fact, replication and backups are often complementary to each other. Backups are sometimes part
|
||||||
of the process of setting up replication, as we shall see in [“Setting Up New Followers”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_new_replica).
|
of the process of setting up replication, as we shall see in [“Setting Up New Followers”](/ch06.html#sec_replication_new_replica).
|
||||||
Conversely, archiving replication logs can be part of a backup process.
|
Conversely, archiving replication logs can be part of a backup process.
|
||||||
|
|
||||||
Some databases internally maintain immutable snapshots of past states, which serve as a kind of
|
Some databases internally maintain immutable snapshots of past states, which serve as a kind of
|
||||||
|
|
@ -69,7 +67,7 @@ question inevitably arises: how do we ensure that all the data ends up on all th
|
||||||
Every write to the database needs to be processed by every replica; otherwise, the replicas would no
|
Every write to the database needs to be processed by every replica; otherwise, the replicas would no
|
||||||
longer contain the same data. The most common solution is called *leader-based replication*,
|
longer contain the same data. The most common solution is called *leader-based replication*,
|
||||||
*primary-backup*, or *active/passive*. It works as follows (see
|
*primary-backup*, or *active/passive*. It works as follows (see
|
||||||
[Figure 6-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_leader_follower)):
|
[Figure 6-1](/ch06.html#fig_replication_leader_follower)):
|
||||||
|
|
||||||
1. One of the replicas is designated the *leader* (also known as *primary* or *source*
|
1. One of the replicas is designated the *leader* (also known as *primary* or *source*
|
||||||
[^2]).
|
[^2]).
|
||||||
|
|
@ -88,9 +86,9 @@ longer contain the same data. The most common solution is called *leader-based r
|
||||||
|
|
||||||
###### Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas.
|
###### Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas.
|
||||||
|
|
||||||
If the database is sharded (see [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding)), each shard has one leader. Different shards may
|
If the database is sharded (see [Chapter 7](/ch07.html#ch_sharding)), each shard has one leader. Different shards may
|
||||||
have their leaders on different nodes, but each shard must nevertheless have one leader node. In
|
have their leaders on different nodes, but each shard must nevertheless have one leader node. In
|
||||||
[“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader) we will discuss an alternative model in which a system may have
|
[“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader) we will discuss an alternative model in which a system may have
|
||||||
multiple leaders for the same shard at the same time.
|
multiple leaders for the same shard at the same time.
|
||||||
|
|
||||||
Single-leader replication is very widely used. It’s a built-in feature of many relational databases,
|
Single-leader replication is very widely used. It’s a built-in feature of many relational databases,
|
||||||
|
|
@ -106,7 +104,7 @@ Many consensus algorithms such as Raft, which is used for replication in Cockroa
|
||||||
TiDB [^7],
|
TiDB [^7],
|
||||||
etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and
|
etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and
|
||||||
automatically elect a new leader if the old one fails (we will discuss consensus in more detail in
|
automatically elect a new leader if the old one fails (we will discuss consensus in more detail in
|
||||||
[Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency)).
|
[Chapter 10](/ch10.html#ch_consistency)).
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> In older documents you may see the term *master–slave replication*. It means the same as
|
> In older documents you may see the term *master–slave replication*. It means the same as
|
||||||
|
|
@ -119,17 +117,17 @@ An important detail of a replicated system is whether the replication happens *s
|
||||||
*asynchronously*. (In relational databases, this is often a configurable option; other systems are
|
*asynchronously*. (In relational databases, this is often a configurable option; other systems are
|
||||||
often hardcoded to be either one or the other.)
|
often hardcoded to be either one or the other.)
|
||||||
|
|
||||||
Think about what happens in [Figure 6-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_leader_follower), where the user of a website updates
|
Think about what happens in [Figure 6-1](/ch06.html#fig_replication_leader_follower), where the user of a website updates
|
||||||
their profile image. At some point in time, the client sends the update request to the leader;
|
their profile image. At some point in time, the client sends the update request to the leader;
|
||||||
shortly afterward, it is received by the leader. At some point, the leader forwards the data change
|
shortly afterward, it is received by the leader. At some point, the leader forwards the data change
|
||||||
to the followers. Eventually, the leader notifies the client that the update was successful.
|
to the followers. Eventually, the leader notifies the client that the update was successful.
|
||||||
[Figure 6-2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_sync_replication) shows one possible way how the timings could work out.
|
[Figure 6-2](/ch06.html#fig_replication_sync_replication) shows one possible way how the timings could work out.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
###### Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower.
|
###### Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower.
|
||||||
|
|
||||||
In the example of [Figure 6-2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_sync_replication), the replication to follower 1 is
|
In the example of [Figure 6-2](/ch06.html#fig_replication_sync_replication), the replication to follower 1 is
|
||||||
*synchronous*: the leader waits until follower 1 has confirmed that it received the write before
|
*synchronous*: the leader waits until follower 1 has confirmed that it received the write before
|
||||||
reporting success to the user, and before making the write visible to other clients. The replication
|
reporting success to the user, and before making the write visible to other clients. The replication
|
||||||
to follower 2 is *asynchronous*: the leader sends the message, but doesn’t wait for a response from
|
to follower 2 is *asynchronous*: the leader sends the message, but doesn’t wait for a response from
|
||||||
|
|
@ -159,9 +157,9 @@ called *semi-synchronous*.
|
||||||
|
|
||||||
In some systems, a *majority* (e.g., 3 out of 5 replicas, including the leader) of replicas is
|
In some systems, a *majority* (e.g., 3 out of 5 replicas, including the leader) of replicas is
|
||||||
updated synchronously, and the remaining minority is asynchronous. This is an example of a *quorum*,
|
updated synchronously, and the remaining minority is asynchronous. This is an example of a *quorum*,
|
||||||
which we will discuss further in [“Quorums for reading and writing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_quorum_condition). Majority quorums are often
|
which we will discuss further in [“Quorums for reading and writing”](/ch06.html#sec_replication_quorum_condition). Majority quorums are often
|
||||||
used in systems that use a consensus protocol for automatic leader election, which we will return to
|
used in systems that use a consensus protocol for automatic leader election, which we will return to
|
||||||
in [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency).
|
in [Chapter 10](/ch10.html#ch_consistency).
|
||||||
|
|
||||||
Sometimes, leader-based replication is configured to be completely asynchronous. In this case, if the
|
Sometimes, leader-based replication is configured to be completely asynchronous. In this case, if the
|
||||||
leader fails and is not recoverable, any writes that have not yet been replicated to followers are
|
leader fails and is not recoverable, any writes that have not yet been replicated to followers are
|
||||||
|
|
@ -172,7 +170,7 @@ processing writes, even if all of its followers have fallen behind.
|
||||||
Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless
|
Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless
|
||||||
widely used, especially if there are many followers or if they are geographically distributed
|
widely used, especially if there are many followers or if they are geographically distributed
|
||||||
[^9].
|
[^9].
|
||||||
We will return to this issue in [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag).
|
We will return to this issue in [“Problems with Replication Lag”](/ch06.html#sec_replication_lag).
|
||||||
|
|
||||||
## Setting Up New Followers
|
## Setting Up New Followers
|
||||||
|
|
||||||
|
|
@ -224,8 +222,8 @@ for live queries. Storing database data in object storage has many benefits:
|
||||||
durability guarantees. This also allows databases to bypass inter-zone network fees.
|
durability guarantees. This also allows databases to bypass inter-zone network fees.
|
||||||
* Databases can use an object store’s *conditional write* feature—essentially, a *compare-and-set*
|
* Databases can use an object store’s *conditional write* feature—essentially, a *compare-and-set*
|
||||||
(CAS) operation—to implement transactions and leadership election
|
(CAS) operation—to implement transactions and leadership election
|
||||||
[[10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Morling2024_ch6),
|
[[10](/ch06.html#Morling2024_ch6),
|
||||||
[11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Chandramohan2024)]).
|
[11](/ch06.html#Chandramohan2024)]).
|
||||||
* Storing data from multiple databases in the same object store can simplify data integration,
|
* Storing data from multiple databases in the same object store can simplify data integration,
|
||||||
particularly when open formats such as Apache Parquet and Apache Iceberg are used.
|
particularly when open formats such as Apache Parquet and Apache Iceberg are used.
|
||||||
|
|
||||||
|
|
@ -312,10 +310,10 @@ consists of the following steps:
|
||||||
[^13].
|
[^13].
|
||||||
The best candidate for leadership is usually the replica with the most up-to-date data changes
|
The best candidate for leadership is usually the replica with the most up-to-date data changes
|
||||||
from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader
|
from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader
|
||||||
is a consensus problem, discussed in detail in [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency).
|
is a consensus problem, discussed in detail in [Chapter 10](/ch10.html#ch_consistency).
|
||||||
3. *Reconfiguring the system to use the new leader.* Clients now need to send
|
3. *Reconfiguring the system to use the new leader.* Clients now need to send
|
||||||
their write requests to the new leader (we discuss this
|
their write requests to the new leader (we discuss this
|
||||||
in [“Request Routing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_routing)). If the old leader comes back, it might still believe that it is
|
in [“Request Routing”](/ch07.html#sec_sharding_routing)). If the old leader comes back, it might still believe that it is
|
||||||
the leader, not realizing that the other replicas have
|
the leader, not realizing that the other replicas have
|
||||||
forced it to step down. The system needs to ensure that the old leader becomes a follower and
|
forced it to step down. The system needs to ensure that the old leader becomes a follower and
|
||||||
recognizes the new leader.
|
recognizes the new leader.
|
||||||
|
|
@ -337,10 +335,10 @@ Failover is fraught with things that can go wrong:
|
||||||
primary keys that were previously assigned by the old leader. These primary keys were also used in
|
primary keys that were previously assigned by the old leader. These primary keys were also used in
|
||||||
a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis,
|
a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis,
|
||||||
which caused some private data to be disclosed to the wrong users.
|
which caused some private data to be disclosed to the wrong users.
|
||||||
* In certain fault scenarios (see [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed)), it could happen that two nodes both believe
|
* In certain fault scenarios (see [Chapter 9](/ch09.html#ch_distributed)), it could happen that two nodes both believe
|
||||||
that they are the leader. This situation is called *split brain*, and it is dangerous: if both
|
that they are the leader. This situation is called *split brain*, and it is dangerous: if both
|
||||||
leaders accept writes, and there is no process for resolving conflicts (see
|
leaders accept writes, and there is no process for resolving conflicts (see
|
||||||
[“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
|
[“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
|
||||||
systems have a mechanism to shut down one node if two leaders are detected. However, if this
|
systems have a mechanism to shut down one node if two leaders are detected. However, if this
|
||||||
mechanism is not carefully designed, you can end up with both nodes being shut down
|
mechanism is not carefully designed, you can end up with both nodes being shut down
|
||||||
[^15].
|
[^15].
|
||||||
|
|
@ -356,7 +354,7 @@ Failover is fraught with things that can go wrong:
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> Guarding against split brain by limiting or shutting down old leaders is known as *fencing* or, more
|
> Guarding against split brain by limiting or shutting down old leaders is known as *fencing* or, more
|
||||||
> emphatically, *Shoot The Other Node In The Head* (STONITH). We will discuss fencing in more detail
|
> emphatically, *Shoot The Other Node In The Head* (STONITH). We will discuss fencing in more detail
|
||||||
> in [“Distributed Locks and Leases”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lock_fencing).
|
> in [“Distributed Locks and Leases”](/ch09.html#sec_distributed_lock_fencing).
|
||||||
|
|
||||||
There are no easy solutions to these problems. For this reason, some operations teams prefer to
|
There are no easy solutions to these problems. For this reason, some operations teams prefer to
|
||||||
perform failovers manually, even if the software supports automatic failover.
|
perform failovers manually, even if the software supports automatic failover.
|
||||||
|
|
@ -370,7 +368,7 @@ behind by several days could be catastrophic.
|
||||||
|
|
||||||
These issues—node failures; unreliable networks; and trade-offs around replica consistency,
|
These issues—node failures; unreliable networks; and trade-offs around replica consistency,
|
||||||
durability, availability, and latency—are in fact fundamental problems in distributed systems.
|
durability, availability, and latency—are in fact fundamental problems in distributed systems.
|
||||||
In [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed) and [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency) we will discuss them in greater depth.
|
In [Chapter 9](/ch09.html#ch_distributed) and [Chapter 10](/ch10.html#ch_consistency) we will discuss them in greater depth.
|
||||||
|
|
||||||
## Implementation of Replication Logs
|
## Implementation of Replication Logs
|
||||||
|
|
||||||
|
|
@ -401,9 +399,9 @@ break down:
|
||||||
It is possible to work around those issues—for example, the leader can replace any nondeterministic
|
It is possible to work around those issues—for example, the leader can replace any nondeterministic
|
||||||
function calls with a fixed return value when the statement is logged so that the followers all get
|
function calls with a fixed return value when the statement is logged so that the followers all get
|
||||||
the same value. The idea of executing deterministic statements in a fixed order is similar to the
|
the same value. The idea of executing deterministic statements in a fixed order is similar to the
|
||||||
event sourcing model that we previously discussed in [“Event Sourcing and CQRS”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_events). This approach is
|
event sourcing model that we previously discussed in [“Event Sourcing and CQRS”](/ch03.html#sec_datamodels_events). This approach is
|
||||||
also known as *state machine replication*, and we will discuss the theory behind it in
|
also known as *state machine replication*, and we will discuss the theory behind it in
|
||||||
[“Using shared logs”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_smr).
|
[“Using shared logs”](/ch10.html#sec_consistency_smr).
|
||||||
|
|
||||||
Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today,
|
Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today,
|
||||||
as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if
|
as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if
|
||||||
|
|
@ -415,7 +413,7 @@ replication methods.
|
||||||
|
|
||||||
### Write-ahead log (WAL) shipping
|
### Write-ahead log (WAL) shipping
|
||||||
|
|
||||||
In [Chapter 4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
|
In [Chapter 4](/ch04.html#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
|
||||||
every modification is first written to the WAL so that the tree can be restored to a consistent
|
every modification is first written to the WAL so that the tree can be restored to a consistent
|
||||||
state after a crash. Since the WAL contains all the information necessary to restore the indexes and
|
state after a crash. Since the WAL contains all the information necessary to restore the indexes and
|
||||||
heap into a consistent state, we can use the exact same log to build a replica on another node:
|
heap into a consistent state, we can use the exact same log to build a replica on another node:
|
||||||
|
|
@ -423,8 +421,8 @@ besides writing the log to disk, the leader also sends it across the network to
|
||||||
the follower processes this log, it builds a copy of the exact same files as found on the leader.
|
the follower processes this log, it builds a copy of the exact same files as found on the leader.
|
||||||
|
|
||||||
This method of replication is used in PostgreSQL and Oracle, among others
|
This method of replication is used in PostgreSQL and Oracle, among others
|
||||||
[[17](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Suzuki2017_ch6),
|
[[17](/ch06.html#Suzuki2017_ch6),
|
||||||
[18](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2012)].
|
[18](/ch06.html#Kapila2012)].
|
||||||
The main disadvantage is that the log describes the data on a very low level: a WAL contains details
|
The main disadvantage is that the log describes the data on a very low level: a WAL contains details
|
||||||
of which bytes were changed in which disk blocks. This makes replication tightly coupled to the
|
of which bytes were changed in which disk blocks. This makes replication tightly coupled to the
|
||||||
storage engine. If the database changes its storage format from one version to another, it is
|
storage engine. If the database changes its storage format from one version to another, it is
|
||||||
|
|
@ -476,7 +474,7 @@ This technique is called *change data capture*, and we will return to it in [Lin
|
||||||
# Problems with Replication Lag
|
# Problems with Replication Lag
|
||||||
|
|
||||||
Being able to tolerate node failures is just one reason for wanting replication. As mentioned
|
Being able to tolerate node failures is just one reason for wanting replication. As mentioned
|
||||||
in [“Distributed versus Single-Node Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_distributed), other reasons are scalability (processing more
|
in [“Distributed versus Single-Node Systems”](/ch01.html#sec_introduction_distributed), other reasons are scalability (processing more
|
||||||
requests than a single machine can handle) and latency (placing replicas geographically closer to
|
requests than a single machine can handle) and latency (placing replicas geographically closer to
|
||||||
users).
|
users).
|
||||||
|
|
||||||
|
|
@ -528,7 +526,7 @@ be read from a follower. This is especially appropriate if data is frequently vi
|
||||||
occasionally written.
|
occasionally written.
|
||||||
|
|
||||||
With asynchronous replication, there is a problem, illustrated in
|
With asynchronous replication, there is a problem, illustrated in
|
||||||
[Figure 6-3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
|
[Figure 6-3](/ch06.html#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
|
||||||
new data may not yet have reached the replica. To the user, it looks as though the data they
|
new data may not yet have reached the replica. To the user, it looks as though the data they
|
||||||
submitted was lost, so they will be understandably unhappy.
|
submitted was lost, so they will be understandably unhappy.
|
||||||
|
|
||||||
|
|
@ -568,7 +566,7 @@ are various possible techniques. To mention a few:
|
||||||
[^26].
|
[^26].
|
||||||
The timestamp could be a *logical timestamp* (something that indicates ordering of writes, such as
|
The timestamp could be a *logical timestamp* (something that indicates ordering of writes, such as
|
||||||
the log sequence number) or the actual system clock (in which case clock synchronization becomes
|
the log sequence number) or the actual system clock (in which case clock synchronization becomes
|
||||||
critical; see [“Unreliable Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_clocks)).
|
critical; see [“Unreliable Clocks”](/ch09.html#sec_distributed_clocks)).
|
||||||
* If your replicas are distributed across regions (for geographical proximity to users or for
|
* If your replicas are distributed across regions (for geographical proximity to users or for
|
||||||
availability), there is additional complexity. Any request that needs to be served by the leader
|
availability), there is additional complexity. Any request that needs to be served by the leader
|
||||||
must be routed to the region that contains the leader.
|
must be routed to the region that contains the leader.
|
||||||
|
|
@ -604,7 +602,7 @@ zonal outages where one zone goes offline, but they do not protect against regio
|
||||||
all zones in a region are unavailable. To survive a regional outage, a distributed system must be
|
all zones in a region are unavailable. To survive a regional outage, a distributed system must be
|
||||||
deployed across multiple regions, which can result in higher latencies, lower throughput, and
|
deployed across multiple regions, which can result in higher latencies, lower throughput, and
|
||||||
increased cloud networking bills. We will discuss these tradeoffs more in
|
increased cloud networking bills. We will discuss these tradeoffs more in
|
||||||
[“Multi-leader replication topologies”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_topologies). For now, just know that when we say region, we mean a collection of
|
[“Multi-leader replication topologies”](/ch06.html#sec_replication_topologies). For now, just know that when we say region, we mean a collection of
|
||||||
zones/datacenters in a single geographic location.
|
zones/datacenters in a single geographic location.
|
||||||
|
|
||||||
## Monotonic Reads
|
## Monotonic Reads
|
||||||
|
|
@ -613,7 +611,7 @@ Our second example of an anomaly that can occur when reading from asynchronous f
|
||||||
possible for a user to see things *moving backward in time*.
|
possible for a user to see things *moving backward in time*.
|
||||||
|
|
||||||
This can happen if a user makes several reads from different replicas. For example,
|
This can happen if a user makes several reads from different replicas. For example,
|
||||||
[Figure 6-4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
|
[Figure 6-4](/ch06.html#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
|
||||||
with little lag, then to a follower with greater lag. (This scenario is quite likely if the user
|
with little lag, then to a follower with greater lag. (This scenario is quite likely if the user
|
||||||
refreshes a web page, and each request is routed to a random server.) The first query returns a
|
refreshes a web page, and each request is routed to a random server.) The first query returns a
|
||||||
comment that was recently added by user 1234, but the second query doesn’t return anything because
|
comment that was recently added by user 1234, but the second query doesn’t return anything because
|
||||||
|
|
@ -654,7 +652,7 @@ answered it.
|
||||||
|
|
||||||
Now, imagine a third person is listening to this conversation through followers. The things said by
|
Now, imagine a third person is listening to this conversation through followers. The things said by
|
||||||
Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer
|
Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer
|
||||||
replication lag (see [Figure 6-5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_consistent_prefix)). This observer would hear the following:
|
replication lag (see [Figure 6-5](/ch06.html#fig_replication_consistent_prefix)). This observer would hear the following:
|
||||||
|
|
||||||
Mrs. Cake
|
Mrs. Cake
|
||||||
: About ten seconds usually, Mr. Poons.
|
: About ten seconds usually, Mr. Poons.
|
||||||
|
|
@ -676,7 +674,7 @@ writes happens in a certain order, then anyone reading those writes will see the
|
||||||
order.
|
order.
|
||||||
|
|
||||||
This is a particular problem in sharded (partitioned) databases, which we will discuss in
|
This is a particular problem in sharded (partitioned) databases, which we will discuss in
|
||||||
[Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding). If the database always applies writes in the same order, reads always see a
|
[Chapter 7](/ch07.html#ch_sharding). If the database always applies writes in the same order, reads always see a
|
||||||
consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different
|
consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different
|
||||||
shards operate independently, so there is no global ordering of writes: when a user reads from the
|
shards operate independently, so there is no global ordering of writes: when a user reads from the
|
||||||
database, they may see some parts of the database in an older state and some in a newer state.
|
database, they may see some parts of the database in an older state and some in a newer state.
|
||||||
|
|
@ -684,7 +682,7 @@ database, they may see some parts of the database in an older state and some in
|
||||||
One solution is to make sure that any writes that are causally related to each other are written to
|
One solution is to make sure that any writes that are causally related to each other are written to
|
||||||
the same shard—but in some applications that cannot be done efficiently. There are also algorithms
|
the same shard—but in some applications that cannot be done efficiently. There are also algorithms
|
||||||
that explicitly keep track of causal dependencies, a topic that we will return to in
|
that explicitly keep track of causal dependencies, a topic that we will return to in
|
||||||
[“The “happens-before” relation and concurrency”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_happens_before).
|
[“The “happens-before” relation and concurrency”](/ch06.html#sec_replication_happens_before).
|
||||||
|
|
||||||
## Solutions for Replication Lag
|
## Solutions for Replication Lag
|
||||||
|
|
||||||
|
|
@ -700,15 +698,15 @@ synchronously updated follower. However, dealing with these issues in applicatio
|
||||||
and easy to get wrong.
|
and easy to get wrong.
|
||||||
|
|
||||||
The simplest programming model for application developers is to choose a database that provides a
|
The simplest programming model for application developers is to choose a database that provides a
|
||||||
strong consistency guarantee for replicas such as linearizability (see [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency)), and ACID
|
strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/ch10.html#ch_consistency)), and ACID
|
||||||
transactions (see [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions)). This allows you to mostly ignore the challenges that arise
|
transactions (see [Chapter 8](/ch08.html#ch_transactions)). This allows you to mostly ignore the challenges that arise
|
||||||
from replication, and treat the database as if it had just a single node. In the early 2010s the
|
from replication, and treat the database as if it had just a single node. In the early 2010s the
|
||||||
*NoSQL* movement promoted the view that these features limited scalability, and that large-scale
|
*NoSQL* movement promoted the view that these features limited scalability, and that large-scale
|
||||||
systems would have to embrace eventual consistency.
|
systems would have to embrace eventual consistency.
|
||||||
|
|
||||||
However, since then, a number of databases started providing strong consistency and transactions
|
However, since then, a number of databases started providing strong consistency and transactions
|
||||||
while also offering the fault tolerance, high availability, and scalability advantages of a
|
while also offering the fault tolerance, high availability, and scalability advantages of a
|
||||||
distributed database. As mentioned in [“Relational Model versus Document Model”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_history), this trend is known as *NewSQL* to
|
distributed database. As mentioned in [“Relational Model versus Document Model”](/ch03.html#sec_datamodels_history), this trend is known as *NewSQL* to
|
||||||
contrast with NoSQL (although it’s less about SQL specifically, and more about new approaches to
|
contrast with NoSQL (although it’s less about SQL specifically, and more about new approaches to
|
||||||
scalable transaction management).
|
scalable transaction management).
|
||||||
|
|
||||||
|
|
@ -758,7 +756,7 @@ single-leader replication, the leader has to be in *one* of the regions, and all
|
||||||
through that region.
|
through that region.
|
||||||
|
|
||||||
In a multi-leader configuration, you can have a leader in *each* region.
|
In a multi-leader configuration, you can have a leader in *each* region.
|
||||||
[Figure 6-6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
|
[Figure 6-6](/ch06.html#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
|
||||||
regular leader–follower replication is used (with followers maybe in a different availability zone
|
regular leader–follower replication is used (with followers maybe in a different availability zone
|
||||||
from the leader); between regions, each region’s leader replicates its changes to the leaders in
|
from the leader); between regions, each region’s leader replicates its changes to the leaders in
|
||||||
other regions.
|
other regions.
|
||||||
|
|
@ -798,7 +796,7 @@ Tolerance of network problems
|
||||||
|
|
||||||
Consistency
|
Consistency
|
||||||
: A single-leader system can provide strong consistency guarantees, such as serializable
|
: A single-leader system can provide strong consistency guarantees, such as serializable
|
||||||
transactions, which we will discuss in [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions). The biggest downside of multi-leader
|
transactions, which we will discuss in [Chapter 8](/ch08.html#ch_transactions). The biggest downside of multi-leader
|
||||||
systems is that the consistency they can achieve is much weaker. For example, you can’t guarantee
|
systems is that the consistency they can achieve is much weaker. For example, you can’t guarantee
|
||||||
that a bank account won’t go negative or that a username is unique: it’s always possible for
|
that a bank account won’t go negative or that a username is unique: it’s always possible for
|
||||||
different leaders to process writes that are individually fine (paying out some of the money in an
|
different leaders to process writes that are individually fine (paying out some of the money in an
|
||||||
|
|
@ -808,7 +806,7 @@ Consistency
|
||||||
This is simply a fundamental limitation of distributed systems
|
This is simply a fundamental limitation of distributed systems
|
||||||
[^28].
|
[^28].
|
||||||
If you need to enforce such constraints, you’re therefore better off with a single-leader system.
|
If you need to enforce such constraints, you’re therefore better off with a single-leader system.
|
||||||
However, as we will see in [“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts), multi-leader systems can still
|
However, as we will see in [“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts), multi-leader systems can still
|
||||||
achieve consistency properties that are useful in a wide range of apps that don’t need such
|
achieve consistency properties that are useful in a wide range of apps that don’t need such
|
||||||
constraints.
|
constraints.
|
||||||
|
|
||||||
|
|
@ -826,17 +824,17 @@ multi-leader replication is often considered dangerous territory that should be
|
||||||
### Multi-leader replication topologies
|
### Multi-leader replication topologies
|
||||||
|
|
||||||
A *replication topology* describes the communication paths along which writes are propagated from
|
A *replication topology* describes the communication paths along which writes are propagated from
|
||||||
one node to another. If you have two leaders, like in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), there is
|
one node to another. If you have two leaders, like in [Figure 6-9](/ch06.html#fig_replication_write_conflict), there is
|
||||||
only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With
|
only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With
|
||||||
more than two leaders, various different topologies are possible. Some examples are illustrated in
|
more than two leaders, various different topologies are possible. Some examples are illustrated in
|
||||||
[Figure 6-7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_topologies).
|
[Figure 6-7](/ch06.html#fig_replication_topologies).
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
###### Figure 6-7. Three example topologies in which multi-leader replication can be set up.
|
###### Figure 6-7. Three example topologies in which multi-leader replication can be set up.
|
||||||
|
|
||||||
The most general topology is *all-to-all*, shown in
|
The most general topology is *all-to-all*, shown in
|
||||||
[Figure 6-7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_topologies)(c),
|
[Figure 6-7](/ch06.html#fig_replication_topologies)(c),
|
||||||
in which every leader sends its writes to every other leader. However, more restricted topologies
|
in which every leader sends its writes to every other leader. However, more restricted topologies
|
||||||
are also used: for example a *circular topology* in which each node receives writes from one node
|
are also used: for example a *circular topology* in which each node receives writes from one node
|
||||||
and forwards those writes (plus any writes of its own) to one other node. Another popular topology
|
and forwards those writes (plus any writes of its own) to one other node. Another popular topology
|
||||||
|
|
@ -845,7 +843,7 @@ star topology can be generalized to a tree.
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> Don’t confuse a star-shaped network topology with a *star schema* (see
|
> Don’t confuse a star-shaped network topology with a *star schema* (see
|
||||||
> [“Stars and Snowflakes: Schemas for Analytics”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_analytics)), which describes the structure of a data model.
|
> [“Stars and Snowflakes: Schemas for Analytics”](/ch03.html#sec_datamodels_analytics)), which describes the structure of a data model.
|
||||||
|
|
||||||
In circular and star topologies, a write may need to pass through several nodes before it reaches
|
In circular and star topologies, a write may need to pass through several nodes before it reaches
|
||||||
all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To
|
all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To
|
||||||
|
|
@ -866,28 +864,28 @@ along different paths, avoiding a single point of failure.
|
||||||
|
|
||||||
On the other hand, all-to-all topologies can have issues too. In particular, some network links may
|
On the other hand, all-to-all topologies can have issues too. In particular, some network links may
|
||||||
be faster than others (e.g., due to network congestion), with the result that some replication
|
be faster than others (e.g., due to network congestion), with the result that some replication
|
||||||
messages may “overtake” others, as illustrated in [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality).
|
messages may “overtake” others, as illustrated in [Figure 6-8](/ch06.html#fig_replication_causality).
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
###### Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas.
|
###### Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas.
|
||||||
|
|
||||||
In [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
|
In [Figure 6-8](/ch06.html#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
|
||||||
updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may
|
updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may
|
||||||
first receive the update (which, from its point of view, is an update to a row that does not exist
|
first receive the update (which, from its point of view, is an update to a row that does not exist
|
||||||
in the database) and only later receive the corresponding insert (which should have preceded the
|
in the database) and only later receive the corresponding insert (which should have preceded the
|
||||||
update).
|
update).
|
||||||
|
|
||||||
This is a problem of causality, similar to the one we saw in [“Consistent Prefix Reads”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_consistent_prefix):
|
This is a problem of causality, similar to the one we saw in [“Consistent Prefix Reads”](/ch06.html#sec_replication_consistent_prefix):
|
||||||
the update depends on the prior insert, so we need to make sure that all nodes process the insert
|
the update depends on the prior insert, so we need to make sure that all nodes process the insert
|
||||||
first, and then the update. Simply attaching a timestamp to every write is not sufficient, because
|
first, and then the update. Simply attaching a timestamp to every write is not sufficient, because
|
||||||
clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see
|
clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see
|
||||||
[Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed)).
|
[Chapter 9](/ch09.html#ch_distributed)).
|
||||||
|
|
||||||
To order these events correctly, a technique called *version vectors* can be used, which we will
|
To order these events correctly, a technique called *version vectors* can be used, which we will
|
||||||
discuss later in this chapter (see [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent)). However, many multi-leader
|
discuss later in this chapter (see [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent)). However, many multi-leader
|
||||||
replication systems don’t use good techniques for ordering updates, leaving them vulnerable to
|
replication systems don’t use good techniques for ordering updates, leaving them vulnerable to
|
||||||
issues like the one in [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality). If you are using multi-leader replication, it
|
issues like the one in [Figure 6-8](/ch06.html#fig_replication_causality). If you are using multi-leader replication, it
|
||||||
is worth being aware of these issues, carefully reading the documentation, and thoroughly testing
|
is worth being aware of these issues, carefully reading the documentation, and thoroughly testing
|
||||||
your database to ensure that it really does provide the guarantees you believe it to have.
|
your database to ensure that it really does provide the guarantees you believe it to have.
|
||||||
|
|
||||||
|
|
@ -918,9 +916,9 @@ Sheets for text documents and spreadsheets, Figma for graphics, and Linear for p
|
||||||
What makes these apps so responsive is that user input is immediately reflected in the user
|
What makes these apps so responsive is that user input is immediately reflected in the user
|
||||||
interface, without waiting for a network round-trip to the server, and edits by one user are shown
|
interface, without waiting for a network round-trip to the server, and edits by one user are shown
|
||||||
to their collaborators with low latency
|
to their collaborators with low latency
|
||||||
[[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DayRichter2010),
|
[[32](/ch06.html#DayRichter2010),
|
||||||
[33](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Wallace2019),
|
[33](/ch06.html#Wallace2019),
|
||||||
[34](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Artman2023)].
|
[34](/ch06.html#Artman2023)].
|
||||||
|
|
||||||
This again results in a multi-leader architecture: each web browser tab that has opened the shared
|
This again results in a multi-leader architecture: each web browser tab that has opened the shared
|
||||||
file is a replica, and any updates that you make to the file are asynchronously replicated to the
|
file is a replica, and any updates that you make to the file are asynchronously replicated to the
|
||||||
|
|
@ -938,9 +936,9 @@ those changes.
|
||||||
|
|
||||||
A software library that supports this process is called a *sync engine*. Although the idea has
|
A software library that supports this process is called a *sync engine*. Although the idea has
|
||||||
existed for a long time, the term has recently gained attention
|
existed for a long time, the term has recently gained attention
|
||||||
[[35](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Saafan2024),
|
[[35](/ch06.html#Saafan2024),
|
||||||
[36](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hagoel2024),
|
[36](/ch06.html#Hagoel2024),
|
||||||
[37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Jayakar2024)].
|
[37](/ch06.html#Jayakar2024)].
|
||||||
An application that allows a user to continue editing a file while offline (which may be implemented
|
An application that allows a user to continue editing a file while offline (which may be implemented
|
||||||
using a sync engine) is called *offline-first*
|
using a sync engine) is called *offline-first*
|
||||||
[^38].
|
[^38].
|
||||||
|
|
@ -970,7 +968,7 @@ approach has a number of advantages:
|
||||||
offline is the same as having very large network delay.
|
offline is the same as having very large network delay.
|
||||||
* A sync engine simplifies the programming model for frontend apps, compared to performing explicit
|
* A sync engine simplifies the programming model for frontend apps, compared to performing explicit
|
||||||
service calls in application code. Every service call requires error handling, as discussed in
|
service calls in application code. Every service call requires error handling, as discussed in
|
||||||
[“The problems with remote procedure calls (RPCs)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch05.html#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user
|
[“The problems with remote procedure calls (RPCs)”](/ch05.html#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user
|
||||||
interface needs to somehow reflect that error. A sync engine allows the app to perform reads and
|
interface needs to somehow reflect that error. A sync engine allows the app to perform reads and
|
||||||
writes on local data, which almost never fails, leading to a more declarative programming style
|
writes on local data, which almost never fails, leading to a more declarative programming style
|
||||||
[^41].
|
[^41].
|
||||||
|
|
@ -1007,7 +1005,7 @@ a local-first sync engine on end user devices—is that concurrent writes on dif
|
||||||
lead to conflicts that need to be resolved.
|
lead to conflicts that need to be resolved.
|
||||||
|
|
||||||
For example, consider a wiki page that is simultaneously being edited by two users, as shown in
|
For example, consider a wiki page that is simultaneously being edited by two users, as shown in
|
||||||
[Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
|
[Figure 6-9](/ch06.html#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
|
||||||
independently changes the title from A to C. Each user’s change is successfully applied to their
|
independently changes the title from A to C. Each user’s change is successfully applied to their
|
||||||
local leader. However, when the changes are asynchronously replicated, a conflict is detected.
|
local leader. However, when the changes are asynchronously replicated, a conflict is detected.
|
||||||
This problem does not occur in a single-leader database.
|
This problem does not occur in a single-leader database.
|
||||||
|
|
@ -1017,13 +1015,13 @@ This problem does not occur in a single-leader database.
|
||||||
###### Figure 6-9. A write conflict caused by two leaders concurrently updating the same record.
|
###### Figure 6-9. A write conflict caused by two leaders concurrently updating the same record.
|
||||||
|
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> We say that the two writes in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict) are *concurrent* because neither
|
> We say that the two writes in [Figure 6-9](/ch06.html#fig_replication_write_conflict) are *concurrent* because neither
|
||||||
> was “aware” of the other at the time the write was originally made. It doesn’t matter whether the
|
> was “aware” of the other at the time the write was originally made. It doesn’t matter whether the
|
||||||
> writes literally happened at the same time; indeed, if the writes were made while offline, they
|
> writes literally happened at the same time; indeed, if the writes were made while offline, they
|
||||||
> might have actually happened some time apart. What matters is whether one write occurred in a state
|
> might have actually happened some time apart. What matters is whether one write occurred in a state
|
||||||
> where the other write has already taken effect.
|
> where the other write has already taken effect.
|
||||||
|
|
||||||
In [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent) we will tackle the question of how a database can determine
|
In [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent) we will tackle the question of how a database can determine
|
||||||
whether two writes are concurrent. For now we will assume that we can detect conflicts, and we want
|
whether two writes are concurrent. For now we will assume that we can detect conflicts, and we want
|
||||||
to figure out the best way of resolving them.
|
to figure out the best way of resolving them.
|
||||||
|
|
||||||
|
|
@ -1052,13 +1050,13 @@ Another example of conflict avoidance: imagine you want to insert new records an
|
||||||
IDs for them based on an auto-incrementing counter. If you have two leaders, you could set them up
|
IDs for them based on an auto-incrementing counter. If you have two leaders, you could set them up
|
||||||
so that one leader only generates odd numbers and the other only generates even numbers. That way
|
so that one leader only generates odd numbers and the other only generates even numbers. That way
|
||||||
you can be sure that the two leaders won’t concurrently assign the same ID to different records.
|
you can be sure that the two leaders won’t concurrently assign the same ID to different records.
|
||||||
We will discuss other ID assignment schemes in [“ID Generators and Logical Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_logical).
|
We will discuss other ID assignment schemes in [“ID Generators and Logical Clocks”](/ch10.html#sec_consistency_logical).
|
||||||
|
|
||||||
### Last write wins (discarding concurrent writes)
|
### Last write wins (discarding concurrent writes)
|
||||||
|
|
||||||
If conflicts can’t be avoided, the simplest way of resolving them is to attach a timestamp to each
|
If conflicts can’t be avoided, the simplest way of resolving them is to attach a timestamp to each
|
||||||
write, and to always use the value with the greatest timestamp. For example, in
|
write, and to always use the value with the greatest timestamp. For example, in
|
||||||
[Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than
|
[Figure 6-9](/ch06.html#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than
|
||||||
the timestamp of user 2’s write. In that case, both leaders will determine that the new title of the
|
the timestamp of user 2’s write. In that case, both leaders will determine that the new title of the
|
||||||
page should be B, and they discard the write that sets it to C. If the writes coincidentally have
|
page should be B, and they discard the write that sets it to C. If the writes coincidentally have
|
||||||
the same timestamp, the winner can be chosen by comparing the values (e.g., in the case of strings,
|
the same timestamp, the winner can be chosen by comparing the values (e.g., in the case of strings,
|
||||||
|
|
@ -1066,7 +1064,7 @@ taking the one that’s earlier in the alphabet).
|
||||||
|
|
||||||
This approach is called *last write wins* (LWW) because the write with the greatest timestamp can be
|
This approach is called *last write wins* (LWW) because the write with the greatest timestamp can be
|
||||||
considered the “last” one. The term is misleading though, because when two writes are concurrent
|
considered the “last” one. The term is misleading though, because when two writes are concurrent
|
||||||
like in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), which one is older and which is later is undefined, and
|
like in [Figure 6-9](/ch06.html#fig_replication_write_conflict), which one is older and which is later is undefined, and
|
||||||
so the timestamp order of concurrent writes is essentially random.
|
so the timestamp order of concurrent writes is essentially random.
|
||||||
|
|
||||||
Therefore the real meaning of LWW is: when the same record is concurrently written on different
|
Therefore the real meaning of LWW is: when the same record is concurrently written on different
|
||||||
|
|
@ -1084,7 +1082,7 @@ Another problem with LWW is that if a real-time clock (e.g. a Unix timestamp) is
|
||||||
for the writes, the system becomes very sensitive to clock synchronization. If one node has a clock
|
for the writes, the system becomes very sensitive to clock synchronization. If one node has a clock
|
||||||
that is ahead of the others, and you try to overwrite a value written by that node, your write may
|
that is ahead of the others, and you try to overwrite a value written by that node, your write may
|
||||||
be ignored as it may have a lower timestamp, even though it clearly occurred later. This problem can
|
be ignored as it may have a lower timestamp, even though it clearly occurred later. This problem can
|
||||||
be solved by using a *logical clock*, which we will discuss in [“ID Generators and Logical Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_logical).
|
be solved by using a *logical clock*, which we will discuss in [“ID Generators and Logical Clocks”](/ch10.html#sec_consistency_logical).
|
||||||
|
|
||||||
### Manual conflict resolution
|
### Manual conflict resolution
|
||||||
|
|
||||||
|
|
@ -1096,7 +1094,7 @@ merge is complete.
|
||||||
|
|
||||||
In a database, it would be impractical for a conflict to stop the entire replication process until a
|
In a database, it would be impractical for a conflict to stop the entire replication process until a
|
||||||
human has resolved it. Instead, databases typically store all the concurrently written values for a
|
human has resolved it. Instead, databases typically store all the concurrently written values for a
|
||||||
given record—for example, both B and C in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict). These values are
|
given record—for example, both B and C in [Figure 6-9](/ch06.html#fig_replication_write_conflict). These values are
|
||||||
sometimes called *siblings*. The next time you query that record, the database returns *all* those
|
sometimes called *siblings*. The next time you query that record, the database returns *all* those
|
||||||
values, rather than just the latest one. You can then resolve those values in whatever way you want,
|
values, rather than just the latest one. You can then resolve those values in whatever way you want,
|
||||||
either automatically in application code (for example, you could concatenate B and C into “B/C”), or
|
either automatically in application code (for example, you could concatenate B and C into “B/C”), or
|
||||||
|
|
@ -1120,7 +1118,7 @@ suffers from a number of problems:
|
||||||
sibling, but another sibling still contained that old item, the removed item would unexpectedly
|
sibling, but another sibling still contained that old item, the removed item would unexpectedly
|
||||||
reappear in the customer’s cart
|
reappear in the customer’s cart
|
||||||
[^45].
|
[^45].
|
||||||
[Figure 6-10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
|
[Figure 6-10](/ch06.html#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
|
||||||
cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear.
|
cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear.
|
||||||
* If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution
|
* If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution
|
||||||
process can itself introduce a new conflict. Those resolutions could even be inconsistent: for
|
process can itself introduce a new conflict. Those resolutions could even be inconsistent: for
|
||||||
|
|
@ -1149,7 +1147,7 @@ updates as much as possible, and hence avoiding data loss:
|
||||||
same position, it can be ordered deterministically so that all nodes get the same merged outcome.
|
same position, it can be ordered deterministically so that all nodes get the same merged outcome.
|
||||||
* If the data is a collection of items (ordered like a to-do list, or unordered like a shopping
|
* If the data is a collection of items (ordered like a to-do list, or unordered like a shopping
|
||||||
cart), we can merge it similarly to text by tracking insertions and deletions. To avoid the
|
cart), we can merge it similarly to text by tracking insertions and deletions. To avoid the
|
||||||
shopping cart issue in [Figure 6-10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_amazon_anomaly), the algorithms track the fact that Book
|
shopping cart issue in [Figure 6-10](/ch06.html#fig_replication_amazon_anomaly), the algorithms track the fact that Book
|
||||||
and DVD were deleted, so the merged result is Cart = {Soap}.
|
and DVD were deleted, so the merged result is Cart = {Soap}.
|
||||||
* If the data is an integer representing a counter that can be incremented or decremented (e.g., the
|
* If the data is an integer representing a counter that can be incremented or decremented (e.g., the
|
||||||
number of likes on a social media post), the merge algorithm can tell how many increments and
|
number of likes on a social media post), the merge algorithm can tell how many increments and
|
||||||
|
|
@ -1175,7 +1173,7 @@ Two families of algorithms are commonly used to implement automatic conflict res
|
||||||
They have different design philosophies and performance characteristics, but both are able to
|
They have different design philosophies and performance characteristics, but both are able to
|
||||||
perform automatic merges for all the aforementioned types of data.
|
perform automatic merges for all the aforementioned types of data.
|
||||||
|
|
||||||
[Figure 6-11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
|
[Figure 6-11](/ch06.html#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
|
||||||
text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the
|
text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the
|
||||||
letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make
|
letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make
|
||||||
“ice!”.
|
“ice!”.
|
||||||
|
|
@ -1196,7 +1194,7 @@ OT
|
||||||
|
|
||||||
CRDT
|
CRDT
|
||||||
: Most CRDTs give each character a unique, immutable ID and use those to determine the positions of
|
: Most CRDTs give each character a unique, immutable ID and use those to determine the positions of
|
||||||
insertions/deletions, instead of indexes. For example, in [Figure 6-11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_ot_crdt) we assign
|
insertions/deletions, instead of indexes. For example, in [Figure 6-11](/ch06.html#fig_replication_ot_crdt) we assign
|
||||||
the ID 1A to “i”, the ID 2A to “c”, etc. When inserting the exclamation mark, we generate an
|
the ID 1A to “i”, the ID 2A to “c”, etc. When inserting the exclamation mark, we generate an
|
||||||
operation containing the ID of the new character (4B) and the ID of the existing character after
|
operation containing the ID of the new character (4B) and the ID of the existing character after
|
||||||
which we want to insert (3A). To insert at the beginning of the string we give “nil” as the
|
which we want to insert (3A). To insert at the beginning of the string we give “nil” as the
|
||||||
|
|
@ -1218,7 +1216,7 @@ Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge o
|
||||||
|
|
||||||
### What is a conflict?
|
### What is a conflict?
|
||||||
|
|
||||||
Some kinds of conflict are obvious. In the example in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), two writes
|
Some kinds of conflict are obvious. In the example in [Figure 6-9](/ch06.html#fig_replication_write_conflict), two writes
|
||||||
concurrently modified the same field in the same record, setting it to two different values. There
|
concurrently modified the same field in the same record, setting it to two different values. There
|
||||||
is little doubt that this is a conflict.
|
is little doubt that this is a conflict.
|
||||||
|
|
||||||
|
|
@ -1232,7 +1230,7 @@ are made on two different leaders.
|
||||||
|
|
||||||
There isn’t a quick ready-made answer, but in the following chapters we will trace a path toward a
|
There isn’t a quick ready-made answer, but in the following chapters we will trace a path toward a
|
||||||
good understanding of this problem. We will see some more examples of conflicts in
|
good understanding of this problem. We will see some more examples of conflicts in
|
||||||
[Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and
|
[Chapter 8](/ch08.html#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and
|
||||||
resolving conflicts in a replicated system.
|
resolving conflicts in a replicated system.
|
||||||
|
|
||||||
# Leaderless Replication
|
# Leaderless Replication
|
||||||
|
|
@ -1245,8 +1243,8 @@ writes in the same order.
|
||||||
|
|
||||||
Some data storage systems take a different approach, abandoning the concept of a leader and
|
Some data storage systems take a different approach, abandoning the concept of a leader and
|
||||||
allowing any replica to directly accept writes from clients. Some of the earliest replicated data
|
allowing any replica to directly accept writes from clients. Some of the earliest replicated data
|
||||||
systems were leaderless [[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lindsay1979_ch6),
|
systems were leaderless [[1](/ch06.html#Lindsay1979_ch6),
|
||||||
[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979)], but the
|
[50](/ch06.html#Gifford1979)], but the
|
||||||
idea was mostly forgotten during the era of dominance of relational databases. It once again became
|
idea was mostly forgotten during the era of dominance of relational databases. It once again became
|
||||||
a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in
|
a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in
|
||||||
2007 [^45].
|
2007 [^45].
|
||||||
|
|
@ -1270,10 +1268,10 @@ profound consequences for the way the database is used.
|
||||||
Imagine you have a database with three replicas, and one of the replicas is currently
|
Imagine you have a database with three replicas, and one of the replicas is currently
|
||||||
unavailable—perhaps it is being rebooted to install a system update. In a single-leader
|
unavailable—perhaps it is being rebooted to install a system update. In a single-leader
|
||||||
configuration, if you want to continue processing writes, you may need to perform a failover (see
|
configuration, if you want to continue processing writes, you may need to perform a failover (see
|
||||||
[“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover)).
|
[“Handling Node Outages”](/ch06.html#sec_replication_failover)).
|
||||||
|
|
||||||
On the other hand, in a leaderless configuration, failover does not exist.
|
On the other hand, in a leaderless configuration, failover does not exist.
|
||||||
[Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
|
[Figure 6-12](/ch06.html#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
|
||||||
all three replicas in parallel, and the two available replicas accept the write but the unavailable
|
all three replicas in parallel, and the two available replicas accept the write but the unavailable
|
||||||
replica misses it. Let’s say that it’s sufficient for two out of three replicas to
|
replica misses it. Let’s say that it’s sufficient for two out of three replicas to
|
||||||
acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be
|
acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be
|
||||||
|
|
@ -1294,9 +1292,9 @@ stale value from another.
|
||||||
|
|
||||||
In order to tell which responses are up-to-date and which are outdated, every value that is written
|
In order to tell which responses are up-to-date and which are outdated, every value that is written
|
||||||
needs to be tagged with a version number or timestamp, similarly to what we saw in
|
needs to be tagged with a version number or timestamp, similarly to what we saw in
|
||||||
[“Last write wins (discarding concurrent writes)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lww). When a client receives multiple values in response to a read, it uses the
|
[“Last write wins (discarding concurrent writes)”](/ch06.html#sec_replication_lww). When a client receives multiple values in response to a read, it uses the
|
||||||
one with the greatest timestamp (even if that value was only returned by one replica, and several
|
one with the greatest timestamp (even if that value was only returned by one replica, and several
|
||||||
other replicas returned older values). See [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent) for more details.
|
other replicas returned older values). See [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent) for more details.
|
||||||
|
|
||||||
### Catching up on missed writes
|
### Catching up on missed writes
|
||||||
|
|
||||||
|
|
@ -1306,7 +1304,7 @@ mechanisms are used in Dynamo-style datastores:
|
||||||
|
|
||||||
Read repair
|
Read repair
|
||||||
: When a client makes a read from several nodes in parallel, it can detect any stale responses.
|
: When a client makes a read from several nodes in parallel, it can detect any stale responses.
|
||||||
For example, in [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
|
For example, in [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
|
||||||
replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale
|
replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale
|
||||||
value and writes the newer value back to that replica. This approach works well for values that are
|
value and writes the newer value back to that replica. This approach works well for values that are
|
||||||
frequently read.
|
frequently read.
|
||||||
|
|
@ -1326,7 +1324,7 @@ Anti-entropy
|
||||||
|
|
||||||
### Quorums for reading and writing
|
### Quorums for reading and writing
|
||||||
|
|
||||||
In the example of [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage), we considered the write to be successful
|
In the example of [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage), we considered the write to be successful
|
||||||
even though it was only processed on two out of three replicas. What if only one out of three
|
even though it was only processed on two out of three replicas. What if only one out of three
|
||||||
replicas accepted the write? How far can we push this?
|
replicas accepted the write? How far can we push this?
|
||||||
|
|
||||||
|
|
@ -1354,7 +1352,7 @@ database writes to fail.
|
||||||
> [!NOTE]
|
> [!NOTE]
|
||||||
> There may be more than *n* nodes in the cluster, but any given value is stored only on *n*
|
> There may be more than *n* nodes in the cluster, but any given value is stored only on *n*
|
||||||
> nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit
|
> nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit
|
||||||
> on one node. We will return to sharding in [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding).
|
> on one node. We will return to sharding in [Chapter 7](/ch07.html#ch_sharding).
|
||||||
|
|
||||||
The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
|
The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
|
||||||
as follows:
|
as follows:
|
||||||
|
|
@ -1362,9 +1360,9 @@ as follows:
|
||||||
* If *w* < *n*, we can still process writes if a node is unavailable.
|
* If *w* < *n*, we can still process writes if a node is unavailable.
|
||||||
* If *r* < *n*, we can still process reads if a node is unavailable.
|
* If *r* < *n*, we can still process reads if a node is unavailable.
|
||||||
* With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable
|
* With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable
|
||||||
node, like in [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage).
|
node, like in [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage).
|
||||||
* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
|
* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
|
||||||
This case is illustrated in [Figure 6-13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_overlap).
|
This case is illustrated in [Figure 6-13](/ch06.html#fig_replication_quorum_overlap).
|
||||||
|
|
||||||
Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and
|
Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and
|
||||||
*r* determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
|
*r* determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
|
||||||
|
|
@ -1386,7 +1384,7 @@ If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*
|
||||||
generally expect every read to return the most recent value written for a key. This is the case because the
|
generally expect every read to return the most recent value written for a key. This is the case because the
|
||||||
set of nodes to which you’ve written and the set of nodes from which you’ve read must overlap. That
|
set of nodes to which you’ve written and the set of nodes from which you’ve read must overlap. That
|
||||||
is, among the nodes you read there must be at least one node with the latest value (illustrated in
|
is, among the nodes you read there must be at least one node with the latest value (illustrated in
|
||||||
[Figure 6-13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_overlap)).
|
[Figure 6-13](/ch06.html#fig_replication_quorum_overlap)).
|
||||||
|
|
||||||
Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, because that ensures
|
Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, because that ensures
|
||||||
*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
|
*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
|
||||||
|
|
@ -1413,12 +1411,12 @@ properties can be confusing. Some scenarios include:
|
||||||
value, the number of replicas storing the new value may fall below *w*, breaking the quorum
|
value, the number of replicas storing the new value may fall below *w*, breaking the quorum
|
||||||
condition.
|
condition.
|
||||||
* While a rebalancing is in progress, where some data is moved from one node to another (see
|
* While a rebalancing is in progress, where some data is moved from one node to another (see
|
||||||
[Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
|
[Chapter 7](/ch07.html#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
|
||||||
replicas for a particular value. This can result in the read and write quorums no longer
|
replicas for a particular value. This can result in the read and write quorums no longer
|
||||||
overlapping.
|
overlapping.
|
||||||
* If a read is concurrent with a write operation, the read may or may not see the concurrently
|
* If a read is concurrent with a write operation, the read may or may not see the concurrently
|
||||||
written value. In particular, it’s possible for one read to see the new value, and a subsequent
|
written value. In particular, it’s possible for one read to see the new value, and a subsequent
|
||||||
read to see the old value, as we shall see in [“Linearizability and quorums”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_quorum_linearizable).
|
read to see the old value, as we shall see in [“Linearizability and quorums”](/ch10.html#sec_consistency_quorum_linearizable).
|
||||||
* If a write succeeded on some replicas but failed on others (for example because the disks on some
|
* If a write succeeded on some replicas but failed on others (for example because the disks on some
|
||||||
nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the
|
nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the
|
||||||
replicas where it succeeded. This means that if a write was reported as failed, subsequent reads
|
replicas where it succeeded. This means that if a write was reported as failed, subsequent reads
|
||||||
|
|
@ -1426,12 +1424,12 @@ properties can be confusing. Some scenarios include:
|
||||||
[^52].
|
[^52].
|
||||||
* If the database uses timestamps from a real-time clock to determine which write is newer (as
|
* If the database uses timestamps from a real-time clock to determine which write is newer (as
|
||||||
Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a
|
Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a
|
||||||
faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lww).
|
faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](/ch06.html#sec_replication_lww).
|
||||||
We will discuss this in more detail in [“Relying on Synchronized Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_clocks_relying).
|
We will discuss this in more detail in [“Relying on Synchronized Clocks”](/ch09.html#sec_distributed_clocks_relying).
|
||||||
* If two writes occur concurrently, one of them might be processed first on one replica, and the
|
* If two writes occur concurrently, one of them might be processed first on one replica, and the
|
||||||
other might be processed first on another replica. This leads to a conflict, similarly to what we
|
other might be processed first on another replica. This leads to a conflict, similarly to what we
|
||||||
saw for multi-leader replication (see [“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts)). We will return to this
|
saw for multi-leader replication (see [“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts)). We will return to this
|
||||||
topic in [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent).
|
topic in [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent).
|
||||||
|
|
||||||
Thus, although quorums appear to guarantee that a read returns the latest written value, in practice
|
Thus, although quorums appear to guarantee that a read returns the latest written value, in practice
|
||||||
it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate
|
it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate
|
||||||
|
|
@ -1463,7 +1461,7 @@ able to quantify “eventual.”
|
||||||
|
|
||||||
A replication system based on a single leader can provide strong consistency guarantees that are
|
A replication system based on a single leader can provide strong consistency guarantees that are
|
||||||
difficult or impossible to achieve in a leaderless system. However, as we have seen in
|
difficult or impossible to achieve in a leaderless system. However, as we have seen in
|
||||||
[“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag), reads in a leader-based replicated system can also return stale values if
|
[“Problems with Replication Lag”](/ch06.html#sec_replication_lag), reads in a leader-based replicated system can also return stale values if
|
||||||
you make them on an asynchronously updated follower.
|
you make them on an asynchronously updated follower.
|
||||||
|
|
||||||
Reading from the leader ensures up-to-date responses, but it suffers from performance problems:
|
Reading from the leader ensures up-to-date responses, but it suffers from performance problems:
|
||||||
|
|
@ -1507,7 +1505,7 @@ That said, leaderless systems can have performance problems as well:
|
||||||
to wait for before a request can complete. Even if you wait only for the fastest *r* or *w*
|
to wait for before a request can complete. Even if you wait only for the fastest *r* or *w*
|
||||||
replicas to respond, and even if you make the requests in parallel, a bigger *r* or *w* increases
|
replicas to respond, and even if you make the requests in parallel, a bigger *r* or *w* increases
|
||||||
the chance that you hit a slow replica, increasing the overall response time (see
|
the chance that you hit a slow replica, increasing the overall response time (see
|
||||||
[“Use of Response Time Metrics”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_slo_sla)).
|
[“Use of Response Time Metrics”](/ch02.html#sec_introduction_slo_sla)).
|
||||||
* A large-scale network interruption that disconnects a client from a large number of replicas can
|
* A large-scale network interruption that disconnects a client from a large number of replicas can
|
||||||
make it impossible to form a quorum. Some leaderless databases offer a configuration option that
|
make it impossible to form a quorum. Some leaderless databases offer a configuration option that
|
||||||
allows any reachable replica to accept writes, even if it’s not one of the usual replicas for that
|
allows any reachable replica to accept writes, even if it’s not one of the usual replicas for that
|
||||||
|
|
@ -1526,7 +1524,7 @@ fault tolerance while also having a high likelihood of reading up-to-date data.
|
||||||
### Multi-region operation
|
### Multi-region operation
|
||||||
|
|
||||||
We previously discussed cross-region replication as a use case for multi-leader replication (see
|
We previously discussed cross-region replication as a use case for multi-leader replication (see
|
||||||
[“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader)). Leaderless replication is also suitable for
|
[“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader)). Leaderless replication is also suitable for
|
||||||
multi-region operation, since it is designed to tolerate conflicting concurrent writes, network
|
multi-region operation, since it is designed to tolerate conflicting concurrent writes, network
|
||||||
interruptions, and latency spikes.
|
interruptions, and latency spikes.
|
||||||
|
|
||||||
|
|
@ -1549,7 +1547,7 @@ resulting in conflicts that need to be resolved. Such conflicts may occur as the
|
||||||
not always: they could also be detected later during read repair, hinted handoff, or anti-entropy.
|
not always: they could also be detected later during read repair, hinted handoff, or anti-entropy.
|
||||||
|
|
||||||
The problem is that events may arrive in a different order at different nodes, due to variable
|
The problem is that events may arrive in a different order at different nodes, due to variable
|
||||||
network delays and partial failures. For example, [Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency) shows two clients,
|
network delays and partial failures. For example, [Figure 6-14](/ch06.html#fig_replication_concurrency) shows two clients,
|
||||||
A and B, simultaneously writing to a key *X* in a three-node datastore:
|
A and B, simultaneously writing to a key *X* in a three-node datastore:
|
||||||
|
|
||||||
* Node 1 receives the write from A, but never receives the write from B due to a transient
|
* Node 1 receives the write from A, but never receives the write from B due to a transient
|
||||||
|
|
@ -1563,13 +1561,13 @@ A and B, simultaneously writing to a key *X* in a three-node datastore:
|
||||||
|
|
||||||
If each node simply overwrote the value for a key whenever it received a write request from a
|
If each node simply overwrote the value for a key whenever it received a write request from a
|
||||||
client, the nodes would become permanently inconsistent, as shown by the final *get* request in
|
client, the nodes would become permanently inconsistent, as shown by the final *get* request in
|
||||||
[Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
|
[Figure 6-14](/ch06.html#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
|
||||||
nodes think that the value is A.
|
nodes think that the value is A.
|
||||||
|
|
||||||
In order to become eventually consistent, the replicas should converge toward the same value. For
|
In order to become eventually consistent, the replicas should converge toward the same value. For
|
||||||
this, we can use any of the conflict resolution mechanisms we previously discussed in
|
this, we can use any of the conflict resolution mechanisms we previously discussed in
|
||||||
[“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts), such as last-write-wins (used by Cassandra and ScyllaDB),
|
[“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts), such as last-write-wins (used by Cassandra and ScyllaDB),
|
||||||
manual resolution, or CRDTs (described in [“CRDTs and Operational Transformation”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_crdts), and used by Riak).
|
manual resolution, or CRDTs (described in [“CRDTs and Operational Transformation”](/ch06.html#sec_replication_crdts), and used by Riak).
|
||||||
|
|
||||||
Last-write-wins is easy to implement: each write is tagged with a timestamp, and a value with a
|
Last-write-wins is easy to implement: each write is tagged with a timestamp, and a value with a
|
||||||
higher timestamp always overwrites a value with a lower timestamp. However, a timestamp doesn’t tell
|
higher timestamp always overwrites a value with a lower timestamp. However, a timestamp doesn’t tell
|
||||||
|
|
@ -1582,11 +1580,11 @@ take more care to detect concurrent writes.
|
||||||
How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look
|
How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look
|
||||||
at some examples:
|
at some examples:
|
||||||
|
|
||||||
* In [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before*
|
* In [Figure 6-8](/ch06.html#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before*
|
||||||
B’s increment, because the value incremented by B is the value inserted by A. In other words, B’s
|
B’s increment, because the value incremented by B is the value inserted by A. In other words, B’s
|
||||||
operation builds upon A’s operation, so B’s operation must have happened later.
|
operation builds upon A’s operation, so B’s operation must have happened later.
|
||||||
We also say that B is *causally dependent* on A.
|
We also say that B is *causally dependent* on A.
|
||||||
* On the other hand, the two writes in [Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency) are concurrent: when each
|
* On the other hand, the two writes in [Figure 6-14](/ch06.html#fig_replication_concurrency) are concurrent: when each
|
||||||
client starts the operation, it does not know that another client is also performing an operation
|
client starts the operation, it does not know that another client is also performing an operation
|
||||||
on the same key. Thus, there is no causal dependency between the operations.
|
on the same key. Thus, there is no causal dependency between the operations.
|
||||||
|
|
||||||
|
|
@ -1607,7 +1605,7 @@ conflict that needs to be resolved.
|
||||||
It may seem that two operations should be called concurrent if they occur “at the same time”—but
|
It may seem that two operations should be called concurrent if they occur “at the same time”—but
|
||||||
in fact, it is not important whether they literally overlap in time. Because of problems with clocks
|
in fact, it is not important whether they literally overlap in time. Because of problems with clocks
|
||||||
in distributed systems, it is actually quite difficult to tell whether two things happened
|
in distributed systems, it is actually quite difficult to tell whether two things happened
|
||||||
at exactly the same time—an issue we will discuss in more detail in [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed).
|
at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/ch09.html#ch_distributed).
|
||||||
|
|
||||||
For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if
|
For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if
|
||||||
they are both unaware of each other, regardless of the physical time at which they occurred. People
|
they are both unaware of each other, regardless of the physical time at which they occurred. People
|
||||||
|
|
@ -1629,7 +1627,7 @@ happened before another. To keep things simple, let’s start with a database th
|
||||||
replica. Once we have worked out how to do this on a single replica, we can generalize the approach
|
replica. Once we have worked out how to do this on a single replica, we can generalize the approach
|
||||||
to a leaderless database with multiple replicas.
|
to a leaderless database with multiple replicas.
|
||||||
|
|
||||||
[Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) shows two clients concurrently adding items to the same
|
[Figure 6-15](/ch06.html#fig_replication_causality_single) shows two clients concurrently adding items to the same
|
||||||
shopping cart. (If that example strikes you as too inane, imagine instead two air traffic
|
shopping cart. (If that example strikes you as too inane, imagine instead two air traffic
|
||||||
controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is
|
controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is
|
||||||
empty. Between them, the clients make five writes to the database:
|
empty. Between them, the clients make five writes to the database:
|
||||||
|
|
@ -1664,8 +1662,8 @@ empty. Between them, the clients make five writes to the database:
|
||||||
|
|
||||||
###### Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart.
|
###### Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart.
|
||||||
|
|
||||||
The dataflow between the operations in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) is illustrated
|
The dataflow between the operations in [Figure 6-15](/ch06.html#fig_replication_causality_single) is illustrated
|
||||||
graphically in [Figure 6-16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causal_dependencies). The arrows indicate which operation
|
graphically in [Figure 6-16](/ch06.html#fig_replication_causal_dependencies). The arrows indicate which operation
|
||||||
*happened before* which other operation, in the sense that the later operation *knew about* or
|
*happened before* which other operation, in the sense that the later operation *knew about* or
|
||||||
*depended on* the earlier one. In this example, the clients are never fully up to date with the data
|
*depended on* the earlier one. In this example, the clients are never fully up to date with the data
|
||||||
on the server, since there is always another operation going on concurrently. But old versions of
|
on the server, since there is always another operation going on concurrently. But old versions of
|
||||||
|
|
@ -1673,7 +1671,7 @@ the value do get overwritten eventually, and no writes are lost.
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
###### Figure 6-16. Graph of causal dependencies in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single).
|
###### Figure 6-16. Graph of causal dependencies in [Figure 6-15](/ch06.html#fig_replication_causality_single).
|
||||||
|
|
||||||
Note that the server can determine whether two operations are concurrent by looking at the version
|
Note that the server can determine whether two operations are concurrent by looking at the version
|
||||||
numbers—it does not need to interpret the value itself (so the value could be any data
|
numbers—it does not need to interpret the value itself (so the value could be any data
|
||||||
|
|
@ -1699,10 +1697,10 @@ on subsequent reads.
|
||||||
|
|
||||||
### Version vectors
|
### Version vectors
|
||||||
|
|
||||||
The example in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) used only a single replica. How does the
|
The example in [Figure 6-15](/ch06.html#fig_replication_causality_single) used only a single replica. How does the
|
||||||
algorithm change when there are multiple replicas, but no leader?
|
algorithm change when there are multiple replicas, but no leader?
|
||||||
|
|
||||||
[Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) uses a single version number to capture dependencies between
|
[Figure 6-15](/ch06.html#fig_replication_causality_single) uses a single version number to capture dependencies between
|
||||||
operations, but that is not sufficient when there are multiple replicas accepting writes
|
operations, but that is not sufficient when there are multiple replicas accepting writes
|
||||||
concurrently. Instead, we need to use a version number *per replica* as well as per key. Each
|
concurrently. Instead, we need to use a version number *per replica* as well as per key. Each
|
||||||
replica increments its own version number when processing a write, and also keeps track of the
|
replica increments its own version number when processing a write, and also keeps track of the
|
||||||
|
|
@ -1713,14 +1711,14 @@ The collection of version numbers from all the replicas is called a *version vec
|
||||||
[^58].
|
[^58].
|
||||||
A few variants of this idea are in use, but the most interesting is probably the *dotted version
|
A few variants of this idea are in use, but the most interesting is probably the *dotted version
|
||||||
vector*
|
vector*
|
||||||
[[59](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Preguica2010),
|
[[59](/ch06.html#Preguica2010),
|
||||||
[60](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Manepalli2022)],
|
[60](/ch06.html#Manepalli2022)],
|
||||||
which is used in Riak 2.0
|
which is used in Riak 2.0
|
||||||
[[61](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Cribbs2014),
|
[[61](/ch06.html#Cribbs2014),
|
||||||
[62](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Brown2015)].
|
[62](/ch06.html#Brown2015)].
|
||||||
We won’t go into the details, but the way it works is quite similar to what we saw in our cart example.
|
We won’t go into the details, but the way it works is quite similar to what we saw in our cart example.
|
||||||
|
|
||||||
Like the version numbers in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single), version vectors are sent from the
|
Like the version numbers in [Figure 6-15](/ch06.html#fig_replication_causality_single), version vectors are sent from the
|
||||||
database replicas to clients when values are read, and need to be sent back to the database when a
|
database replicas to clients when values are read, and need to be sent back to the database when a
|
||||||
value is subsequently written. (Riak encodes the version vector as a string that it calls *causal
|
value is subsequently written. (Riak encodes the version vector as a string that it calls *causal
|
||||||
context*.) The version vector allows the database to distinguish between overwrites and concurrent
|
context*.) The version vector allows the database to distinguish between overwrites and concurrent
|
||||||
|
|
@ -1734,12 +1732,12 @@ siblings are merged correctly.
|
||||||
|
|
||||||
A *version vector* is sometimes also called a *vector clock*, even though they are not quite the
|
A *version vector* is sometimes also called a *vector clock*, even though they are not quite the
|
||||||
same. The difference is subtle—please see the references for details
|
same. The difference is subtle—please see the references for details
|
||||||
[[60](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Manepalli2022),
|
[[60](/ch06.html#Manepalli2022),
|
||||||
[63](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Baquero2011),
|
[63](/ch06.html#Baquero2011),
|
||||||
[64](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Schwarz1994)]. In brief, when
|
[64](/ch06.html#Schwarz1994)]. In brief, when
|
||||||
comparing the state of replicas, version vectors are the right data structure to use.
|
comparing the state of replicas, version vectors are the right data structure to use.
|
||||||
|
|
||||||
# Summary
|
## Summary
|
||||||
|
|
||||||
In this chapter we looked at the issue of replication. Replication can serve several purposes:
|
In this chapter we looked at the issue of replication. Replication can serve several purposes:
|
||||||
|
|
||||||
|
|
@ -1816,10 +1814,10 @@ This chapter has assumed that every replica stores a full copy of the whole data
|
||||||
unrealistic for large datasets. In the next chapter we will look at *sharding*, which allows each
|
unrealistic for large datasets. In the next chapter we will look at *sharding*, which allows each
|
||||||
machine to store only a subset of the data.
|
machine to store only a subset of the data.
|
||||||
|
|
||||||
##### Footnotes
|
|
||||||
|
|
||||||
|
|
||||||
##### References
|
|
||||||
|
### Summary
|
||||||
|
|
||||||
|
|
||||||
[^1]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)
|
[^1]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)
|
||||||
|
|
|
||||||
|
|
@ -51,14 +51,12 @@ Some databases treat partitions and shards as two distinct concepts. For example
|
||||||
partitioning is a way of splitting a large table into several files that are stored on the same
|
partitioning is a way of splitting a large table into several files that are stored on the same
|
||||||
machine (which has several advantages, such as making it very fast to delete an entire partition),
|
machine (which has several advantages, such as making it very fast to delete an entire partition),
|
||||||
whereas sharding splits a dataset across multiple machines
|
whereas sharding splits a dataset across multiple machines
|
||||||
[[1](/en/ch7#Giordano2023),
|
[[^1], [^2]].
|
||||||
[2](/en/ch7#Leach2022)].
|
|
||||||
In many other systems, partitioning is just another word for sharding.
|
In many other systems, partitioning is just another word for sharding.
|
||||||
|
|
||||||
While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to
|
While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to
|
||||||
one theory, the term arose from the online role-play game *Ultima Online*, in which a magic crystal
|
one theory, the term arose from the online role-play game *Ultima Online*, in which a magic crystal
|
||||||
was shattered into pieces, and each of those shards refracted a copy of the game world
|
was shattered into pieces, and each of those shards refracted a copy of the game world [^3].
|
||||||
[^3].
|
|
||||||
The term *shard* thus came to mean one of a set of parallel game servers, and later was carried over
|
The term *shard* thus came to mean one of a set of parallel game servers, and later was carried over
|
||||||
to databases. Another theory is that *shard* was originally an acronym of *System for Highly
|
to databases. Another theory is that *shard* was originally an acronym of *System for Highly
|
||||||
Available Replicated Data*—reportedly a 1980s database, details of which are lost to history.
|
Available Replicated Data*—reportedly a 1980s database, details of which are lost to history.
|
||||||
|
|
@ -87,8 +85,7 @@ single-shard database.
|
||||||
|
|
||||||
The reason for this recommendation is that sharding often adds complexity: you typically have to
|
The reason for this recommendation is that sharding often adds complexity: you typically have to
|
||||||
decide which records to put in which shard by choosing a *partition key*; all records with the
|
decide which records to put in which shard by choosing a *partition key*; all records with the
|
||||||
same partition key are placed in the same shard
|
same partition key are placed in the same shard [^4].
|
||||||
[^4].
|
|
||||||
This choice matters because accessing a record is fast if you know which shard it’s in, but if you
|
This choice matters because accessing a record is fast if you know which shard it’s in, but if you
|
||||||
don’t know the shard you have to do an inefficient search across all shards, and the sharding scheme
|
don’t know the shard you have to do an inefficient search across all shards, and the sharding scheme
|
||||||
is difficult to change.
|
is difficult to change.
|
||||||
|
|
@ -107,11 +104,9 @@ some systems don’t support them at all.
|
||||||
|
|
||||||
Some systems use sharding even on a single machine, typically running one single-threaded process
|
Some systems use sharding even on a single machine, typically running one single-threaded process
|
||||||
per CPU core to make use of the parallelism in the CPU, or to take advantage of a *nonuniform memory
|
per CPU core to make use of the parallelism in the CPU, or to take advantage of a *nonuniform memory
|
||||||
access* (NUMA) architecture in which some banks of memory are closer to one CPU than to others
|
access* (NUMA) architecture in which some banks of memory are closer to one CPU than to others [^5].
|
||||||
[^5].
|
|
||||||
For example, Redis, VoltDB, and FoundationDB use one process per core, and rely on sharding to
|
For example, Redis, VoltDB, and FoundationDB use one process per core, and rely on sharding to
|
||||||
spread load across CPU cores in the same machine
|
spread load across CPU cores in the same machine [^6].
|
||||||
[^6].
|
|
||||||
|
|
||||||
## Sharding for Multitenancy
|
## Sharding for Multitenancy
|
||||||
|
|
||||||
|
|
@ -124,8 +119,7 @@ signups, delivery data etc. are separate from those of other businesses.
|
||||||
Sometimes sharding is used to implement multitenant systems: either each tenant is given a separate
|
Sometimes sharding is used to implement multitenant systems: either each tenant is given a separate
|
||||||
shard, or multiple small tenants may be grouped together into a larger shard. These shards might be
|
shard, or multiple small tenants may be grouped together into a larger shard. These shards might be
|
||||||
physically separate databases (which we previously touched on in [“Embedded storage engines”](/en/ch4#sidebar_embedded)), or
|
physically separate databases (which we previously touched on in [“Embedded storage engines”](/en/ch4#sidebar_embedded)), or
|
||||||
separately manageable portions of a larger logical database
|
separately manageable portions of a larger logical database [^7].
|
||||||
[^7].
|
|
||||||
Using sharding for multitenancy has several advantages:
|
Using sharding for multitenancy has several advantages:
|
||||||
|
|
||||||
Resource isolation
|
Resource isolation
|
||||||
|
|
@ -226,8 +220,7 @@ to distribute the data evenly, the shard boundaries need to adapt to the data.
|
||||||
The shard boundaries might be chosen manually by an administrator, or the database can choose them
|
The shard boundaries might be chosen manually by an administrator, or the database can choose them
|
||||||
automatically. Manual key-range sharding is used by Vitess (a sharding layer for MySQL), for
|
automatically. Manual key-range sharding is used by Vitess (a sharding layer for MySQL), for
|
||||||
example; the automatic variant is used by Bigtable, its open source equivalent HBase, the
|
example; the automatic variant is used by Bigtable, its open source equivalent HBase, the
|
||||||
range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB
|
range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB [^6]. YugabyteDB offers both manual and automatic
|
||||||
[^6]. YugabyteDB offers both manual and automatic
|
|
||||||
tablet splitting.
|
tablet splitting.
|
||||||
|
|
||||||
Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in
|
Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in
|
||||||
|
|
@ -241,8 +234,7 @@ A downside of key range sharding is that you can easily get a hot shard if there
|
||||||
lot of writes to nearby keys. For example, if the key is a timestamp, then the shards correspond to
|
lot of writes to nearby keys. For example, if the key is a timestamp, then the shards correspond to
|
||||||
ranges of time—e.g., one shard per month. Unfortunately, if you write data from the sensors to the
|
ranges of time—e.g., one shard per month. Unfortunately, if you write data from the sensors to the
|
||||||
database as the measurements happen, all the writes end up going to the same shard (the one for
|
database as the measurements happen, all the writes end up going to the same shard (the one for
|
||||||
this month), so that shard can be overloaded with writes while others sit idle
|
this month), so that shard can be overloaded with writes while others sit idle [^13].
|
||||||
[^13].
|
|
||||||
|
|
||||||
To avoid this problem in the sensor database, you need to use something other than the timestamp as
|
To avoid this problem in the sensor database, you need to use something other than the timestamp as
|
||||||
the first element of the key. For example, you could prefix each timestamp with the sensor ID so
|
the first element of the key. For example, you could prefix each timestamp with the sensor ID so
|
||||||
|
|
@ -256,8 +248,7 @@ need to perform a separate range query for each sensor.
|
||||||
When you first set up your database, there are no key ranges to split into shards. Some databases,
|
When you first set up your database, there are no key ranges to split into shards. Some databases,
|
||||||
such as HBase and MongoDB, allow you to configure an initial set of shards on an empty database,
|
such as HBase and MongoDB, allow you to configure an initial set of shards on an empty database,
|
||||||
which is called *pre-splitting*. This requires that you already have some idea of what the key
|
which is called *pre-splitting*. This requires that you already have some idea of what the key
|
||||||
distribution is going to look like, so that you can choose appropriate key range boundaries
|
distribution is going to look like, so that you can choose appropriate key range boundaries [^14].
|
||||||
[^14].
|
|
||||||
|
|
||||||
Later on, as your data volume and write throughput grow, a system with key-range sharding grows by
|
Later on, as your data volume and write throughput grow, a system with key-range sharding grows by
|
||||||
splitting an existing shard into two or more smaller shards, each of which holds a contiguous
|
splitting an existing shard into two or more smaller shards, each of which holds a contiguous
|
||||||
|
|
@ -300,8 +291,7 @@ For sharding purposes, the hash function need not be cryptographically strong: f
|
||||||
uses MD5, whereas Cassandra and ScyllaDB use Murmur3. Many programming languages have simple hash
|
uses MD5, whereas Cassandra and ScyllaDB use Murmur3. Many programming languages have simple hash
|
||||||
functions built in (as they are used for hash tables), but they may not be suitable for sharding:
|
functions built in (as they are used for hash tables), but they may not be suitable for sharding:
|
||||||
for example, in Java’s `Object.hashCode()` and Ruby’s `Object#hash`, the same key may have a
|
for example, in Java’s `Object.hashCode()` and Ruby’s `Object#hash`, the same key may have a
|
||||||
different hash value in different processes, making them unsuitable for sharding
|
different hash value in different processes, making them unsuitable for sharding [^16].
|
||||||
[^16].
|
|
||||||
|
|
||||||
### Hash modulo number of nodes
|
### Hash modulo number of nodes
|
||||||
|
|
||||||
|
|
@ -411,16 +401,14 @@ cluster keys for a table. Delta Lake supports both manual and automatic partitio
|
||||||
supports cluster keys. Clustering data not only improves range scan performance, but can
|
supports cluster keys. Clustering data not only improves range scan performance, but can
|
||||||
improve compression and filtering performance as well.
|
improve compression and filtering performance as well.
|
||||||
|
|
||||||
Hash-range sharding is used in YugabyteDB and DynamoDB
|
Hash-range sharding is used in YugabyteDB and DynamoDB [^17], and is an option in MongoDB.
|
||||||
[^17], and is an option in MongoDB.
|
|
||||||
Cassandra and ScyllaDB use a variant of this approach that is illustrated in
|
Cassandra and ScyllaDB use a variant of this approach that is illustrated in
|
||||||
[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
|
[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
|
||||||
to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
|
to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
|
||||||
per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
|
per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
|
||||||
those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
|
those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
|
||||||
those imbalances tend to even out
|
those imbalances tend to even out
|
||||||
[[15](/en/ch7#Evans2013),
|
[[^15], [^18]].
|
||||||
[18](/en/ch7#Williams2012)].
|
|
||||||
|
|
||||||

|

|
||||||
|
|
||||||
|
|
@ -446,10 +434,8 @@ ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describ
|
||||||
the same shard as much as possible.
|
the same shard as much as possible.
|
||||||
|
|
||||||
The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of
|
The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of
|
||||||
consistent hashing
|
consistent hashing [^20],
|
||||||
[^20],
|
but several other consistent hashing algorithms have also been proposed [^21],
|
||||||
but several other consistent hashing algorithms have also been proposed
|
|
||||||
[^21],
|
|
||||||
such as *highest random weight*, also known as *rendezvous hashing*
|
such as *highest random weight*, also known as *rendezvous hashing*
|
||||||
[^22],
|
[^22],
|
||||||
and *jump consistent hash*
|
and *jump consistent hash*
|
||||||
|
|
@ -473,11 +459,9 @@ This event can result in a large volume of reads and writes to the same key (whe
|
||||||
is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on).
|
is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on).
|
||||||
|
|
||||||
In such situations, a more flexible sharding policy is required
|
In such situations, a more flexible sharding policy is required
|
||||||
[[25](/en/ch7#Guo2020),
|
[[^25], [^26]].
|
||||||
[26](/en/ch7#Lee2021)].
|
|
||||||
A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put
|
A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put
|
||||||
an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine
|
an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine [^27].
|
||||||
[^27].
|
|
||||||
|
|
||||||
It’s also possible to compensate for skew at the application level. For example, if one key is known
|
It’s also possible to compensate for skew at the application level. For example, if one key is known
|
||||||
to be very hot, a simple technique is to add a random number to the beginning or end of the key.
|
to be very hot, a simple technique is to add a random number to the beginning or end of the key.
|
||||||
|
|
@ -518,16 +502,14 @@ Fully automated rebalancing can be convenient, because there is less operational
|
||||||
normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud
|
normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud
|
||||||
databases such as DynamoDB are promoted as being able to automatically add and remove shards to
|
databases such as DynamoDB are promoted as being able to automatically add and remove shards to
|
||||||
adapt to big increases or decreases of load within a matter of minutes
|
adapt to big increases or decreases of load within a matter of minutes
|
||||||
[[17](/en/ch7#Elhemali2022_ch7),
|
[[^17], [^29]].
|
||||||
[29](/en/ch7#Houlihan2017)].
|
|
||||||
|
|
||||||
However, automatic shard management can also be unpredictable. Rebalancing is an expensive
|
However, automatic shard management can also be unpredictable. Rebalancing is an expensive
|
||||||
operation, because it requires rerouting requests and moving a large amount of data from one node to
|
operation, because it requires rerouting requests and moving a large amount of data from one node to
|
||||||
another. If it is not done carefully, this process can overload the network or the nodes, and it
|
another. If it is not done carefully, this process can overload the network or the nodes, and it
|
||||||
might harm the performance of other requests. The system must continue processing writes while the
|
might harm the performance of other requests. The system must continue processing writes while the
|
||||||
rebalancing is in progress; if a system is near its maximum write throughput, the shard-splitting
|
rebalancing is in progress; if a system is near its maximum write throughput, the shard-splitting
|
||||||
process might not even be able to keep up with the rate of incoming writes
|
process might not even be able to keep up with the rate of incoming writes [^29].
|
||||||
[^29].
|
|
||||||
|
|
||||||
Such automation can be dangerous in combination with automatic failure detection. For example, say
|
Such automation can be dangerous in combination with automatic failure detection. For example, say
|
||||||
one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that
|
one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that
|
||||||
|
|
@ -684,8 +666,7 @@ expensive. Even if you query the shards in parallel, it is prone to tail latency
|
||||||
shards lets you store more data, but it doesn’t increase your query throughput if every shard has to
|
shards lets you store more data, but it doesn’t increase your query throughput if every shard has to
|
||||||
process every query anyway.
|
process every query anyway.
|
||||||
|
|
||||||
Nevertheless, local secondary indexes are widely used
|
Nevertheless, local secondary indexes are widely used [^31]:
|
||||||
[^31]:
|
|
||||||
for example, MongoDB, Riak, Cassandra [^32],
|
for example, MongoDB, Riak, Cassandra [^32],
|
||||||
Elasticsearch [^33], SolrCloud,
|
Elasticsearch [^33], SolrCloud,
|
||||||
and VoltDB [^34]
|
and VoltDB [^34]
|
||||||
|
|
@ -742,7 +723,7 @@ indexes, so reads from a global index may be stale (similarly to replication lag
|
||||||
Nevertheless, global indexes are useful if read throughput is higher than write throughput, and if
|
Nevertheless, global indexes are useful if read throughput is higher than write throughput, and if
|
||||||
the postings lists are not too long.
|
the postings lists are not too long.
|
||||||
|
|
||||||
# Summary
|
## Summary
|
||||||
|
|
||||||
In this chapter we explored different ways of sharding a large dataset into smaller subsets.
|
In this chapter we explored different ways of sharding a large dataset into smaller subsets.
|
||||||
Sharding is necessary when you have so much data that storing and processing it on a single machine
|
Sharding is necessary when you have so much data that storing and processing it on a single machine
|
||||||
|
|
@ -795,10 +776,10 @@ to multiple machines. However, operations that need to write to several shards c
|
||||||
for example, what happens if the write to one shard succeeds, but another fails? We will address
|
for example, what happens if the write to one shard succeeds, but another fails? We will address
|
||||||
that question in the following chapters.
|
that question in the following chapters.
|
||||||
|
|
||||||
##### Footnotes
|
|
||||||
|
|
||||||
|
|
||||||
##### References
|
|
||||||
|
### Summary
|
||||||
|
|
||||||
|
|
||||||
[^1]: Claire Giordano. [Understanding partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959)
|
[^1]: Claire Giordano. [Understanding partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959)
|
||||||
|
|
|
||||||
|
|
@ -46,8 +46,7 @@ transactional guarantees or abandoning them entirely (for example, to achieve hi
|
||||||
higher availability). Some safety properties can be achieved without transactions. On the other
|
higher availability). Some safety properties can be achieved without transactions. On the other
|
||||||
hand, transactions can prevent a lot of grief: for example, the technical cause behind the Post
|
hand, transactions can prevent a lot of grief: for example, the technical cause behind the Post
|
||||||
Office Horizon scandal (see [“How Important Is Reliability?”](/en/ch2#sidebar_reliability_importance)) was probably a lack of ACID
|
Office Horizon scandal (see [“How Important Is Reliability?”](/en/ch2#sidebar_reliability_importance)) was probably a lack of ACID
|
||||||
transactions in the underlying accounting system
|
transactions in the underlying accounting system [^1].
|
||||||
[^1].
|
|
||||||
|
|
||||||
How do you figure out whether you need transactions? In order to answer that question, we first need
|
How do you figure out whether you need transactions? In order to answer that question, we first need
|
||||||
to understand exactly what safety guarantees transactions can provide, and what costs are associated
|
to understand exactly what safety guarantees transactions can provide, and what costs are associated
|
||||||
|
|
@ -68,9 +67,7 @@ the challenge of achieving atomicity in a distributed transaction.
|
||||||
|
|
||||||
Almost all relational databases today, and some nonrelational databases, support transactions. Most
|
Almost all relational databases today, and some nonrelational databases, support transactions. Most
|
||||||
of them follow the style that was introduced in 1975 by IBM System R, the first SQL database
|
of them follow the style that was introduced in 1975 by IBM System R, the first SQL database
|
||||||
[[2](/en/ch8#Chamberlin1981),
|
[[^2], [^3], [^4]].
|
||||||
[3](/en/ch8#Gray1976),
|
|
||||||
[4](/en/ch8#Eswaran1976)].
|
|
||||||
Although some implementation details have changed, the general idea has remained virtually the same
|
Although some implementation details have changed, the general idea has remained virtually the same
|
||||||
for 50 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily
|
for 50 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily
|
||||||
similar to that of System R.
|
similar to that of System R.
|
||||||
|
|
@ -85,8 +82,7 @@ much weaker set of guarantees than had previously been understood.
|
||||||
The hype around NoSQL distributed databases led to a popular belief that transactions were
|
The hype around NoSQL distributed databases led to a popular belief that transactions were
|
||||||
fundamentally unscalable, and that any large-scale system would have to abandon transactions in
|
fundamentally unscalable, and that any large-scale system would have to abandon transactions in
|
||||||
order to maintain good performance and high availability. More recently, that belief has turned out
|
order to maintain good performance and high availability. More recently, that belief has turned out
|
||||||
to be wrong. So-called “NewSQL” databases such as CockroachDB
|
to be wrong. So-called “NewSQL” databases such as CockroachDB [^5],
|
||||||
[^5],
|
|
||||||
TiDB [^6],
|
TiDB [^6],
|
||||||
Spanner [^7],
|
Spanner [^7],
|
||||||
FoundationDB [^8],
|
FoundationDB [^8],
|
||||||
|
|
@ -103,8 +99,7 @@ operation and in various extreme (but realistic) circumstances.
|
||||||
|
|
||||||
The safety guarantees provided by transactions are often described by the well-known acronym *ACID*,
|
The safety guarantees provided by transactions are often described by the well-known acronym *ACID*,
|
||||||
which stands for *Atomicity*, *Consistency*, *Isolation*, and *Durability*. It was coined in 1983 by
|
which stands for *Atomicity*, *Consistency*, *Isolation*, and *Durability*. It was coined in 1983 by
|
||||||
Theo Härder and Andreas Reuter
|
Theo Härder and Andreas Reuter [^9]
|
||||||
[^9]
|
|
||||||
in an effort to establish precise terminology for fault-tolerance mechanisms in databases.
|
in an effort to establish precise terminology for fault-tolerance mechanisms in databases.
|
||||||
|
|
||||||
However, in practice, one database’s implementation of ACID does not equal another’s implementation.
|
However, in practice, one database’s implementation of ACID does not equal another’s implementation.
|
||||||
|
|
@ -213,15 +208,13 @@ each other: they cannot step on each other’s toes. The classic database textbo
|
||||||
isolation as *serializability*, which means that each transaction can pretend that it is the only
|
isolation as *serializability*, which means that each transaction can pretend that it is the only
|
||||||
transaction running on the entire database. The database ensures that when the transactions have
|
transaction running on the entire database. The database ensures that when the transactions have
|
||||||
committed, the result is the same as if they had run *serially* (one after another), even though in
|
committed, the result is the same as if they had run *serially* (one after another), even though in
|
||||||
reality they may have run concurrently
|
reality they may have run concurrently [^13].
|
||||||
[^13].
|
|
||||||
|
|
||||||
However, serializability has a performance cost. In practice, many databases use forms of isolation
|
However, serializability has a performance cost. In practice, many databases use forms of isolation
|
||||||
that are weaker than serializability: that is, they allow concurrent transactions to interfere with
|
that are weaker than serializability: that is, they allow concurrent transactions to interfere with
|
||||||
each other in limited ways. Some popular databases, such as Oracle, don’t even implement it (Oracle
|
each other in limited ways. Some popular databases, such as Oracle, don’t even implement it (Oracle
|
||||||
has an isolation level called “serializable,” but it actually implements *snapshot isolation*, which
|
has an isolation level called “serializable,” but it actually implements *snapshot isolation*, which
|
||||||
is a weaker guarantee than serializability [[10](/en/ch8#Bailis2013HAT),
|
is a weaker guarantee than serializability [[^10], [^14]]).
|
||||||
[14](/en/ch8#Fekete2005)]).
|
|
||||||
This means that some kinds of race conditions can still occur. We will explore snapshot isolation
|
This means that some kinds of race conditions can still occur. We will explore snapshot isolation
|
||||||
and other forms of isolation in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels).
|
and other forms of isolation in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels).
|
||||||
|
|
||||||
|
|
@ -264,18 +257,18 @@ The truth is, nothing is perfect:
|
||||||
guarantees they are supposed to provide: even `fsync` isn’t guaranteed to work correctly
|
guarantees they are supposed to provide: even `fsync` isn’t guaranteed to work correctly
|
||||||
[^15].
|
[^15].
|
||||||
Disk firmware can have bugs, just like any other kind of software
|
Disk firmware can have bugs, just like any other kind of software
|
||||||
[[16](/en/ch8#Denness2015),
|
[[^16],
|
||||||
[17](/en/ch8#Surak2015)],
|
[^17]],
|
||||||
e.g. causing drives to fail after exactly 32,768 hours of operation
|
e.g. causing drives to fail after exactly 32,768 hours of operation
|
||||||
[^18].
|
[^18].
|
||||||
And `fsync` is hard to use; even PostgreSQL used it incorrectly for over 20 years
|
And `fsync` is hard to use; even PostgreSQL used it incorrectly for over 20 years
|
||||||
[[19](/en/ch8#Ringer2018),
|
[[^19],
|
||||||
[20](/en/ch8#Rebello2020),
|
[^20],
|
||||||
[21](/en/ch8#Pillai2015)].
|
[^21]].
|
||||||
* Subtle interactions between the storage engine and the filesystem implementation can lead to bugs
|
* Subtle interactions between the storage engine and the filesystem implementation can lead to bugs
|
||||||
that are hard to track down, and may cause files on disk to be corrupted after a crash
|
that are hard to track down, and may cause files on disk to be corrupted after a crash
|
||||||
[[22](/en/ch8#Pillai2014),
|
[[^22],
|
||||||
[23](/en/ch8#Siebenmann2016)].
|
[^23]].
|
||||||
Filesystem errors on one replica can sometimes spread to other replicas as well
|
Filesystem errors on one replica can sometimes spread to other replicas as well
|
||||||
[^24].
|
[^24].
|
||||||
* Data on disk can gradually become corrupted without this being detected
|
* Data on disk can gradually become corrupted without this being detected
|
||||||
|
|
@ -489,20 +482,15 @@ guarantees that transactions have the same effect as if they ran *serially* (i.e
|
||||||
without any concurrency).
|
without any concurrency).
|
||||||
|
|
||||||
In practice, isolation is unfortunately not that simple. Serializable isolation has a performance
|
In practice, isolation is unfortunately not that simple. Serializable isolation has a performance
|
||||||
cost, and many databases don’t want to pay that price
|
cost, and many databases don’t want to pay that price [^10]. It’s therefore common for systems to use
|
||||||
[^10]. It’s therefore common for systems to use
|
|
||||||
weaker levels of isolation, which protect against *some* concurrency issues, but not all. Those
|
weaker levels of isolation, which protect against *some* concurrency issues, but not all. Those
|
||||||
levels of isolation are much harder to understand, and they can lead to subtle bugs, but they are
|
levels of isolation are much harder to understand, and they can lead to subtle bugs, but they are
|
||||||
nevertheless used in practice
|
nevertheless used in practice [^29].
|
||||||
[^29].
|
|
||||||
|
|
||||||
Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have
|
Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have
|
||||||
caused substantial loss of money
|
caused substantial loss of money
|
||||||
[[30](/en/ch8#Warszawski2017),
|
[[^30], [^31], [^32]],
|
||||||
[31](/en/ch8#DAgosta2014),
|
led to investigation by financial auditors [^33],
|
||||||
[32](/en/ch8#bitcointhief2014)],
|
|
||||||
led to investigation by financial auditors
|
|
||||||
[^33],
|
|
||||||
and caused customer data to be corrupted [^34].
|
and caused customer data to be corrupted [^34].
|
||||||
A popular comment on revelations of such problems is “Use an ACID database if you’re handling
|
A popular comment on revelations of such problems is “Use an ACID database if you’re handling
|
||||||
financial data!”—but that misses the point. Even many popular relational database systems (which
|
financial data!”—but that misses the point. Even many popular relational database systems (which
|
||||||
|
|
@ -517,8 +505,7 @@ bugs from occurring.
|
||||||
|
|
||||||
Those examples also highlight an important point: even if concurrency issues are rare in normal
|
Those examples also highlight an important point: even if concurrency issues are rare in normal
|
||||||
operation, you have to consider the possibility that an attacker deliberately sends a burst of
|
operation, you have to consider the possibility that an attacker deliberately sends a burst of
|
||||||
highly concurrent requests to your API in an attempt to deliberately exploit concurrency bugs
|
highly concurrent requests to your API in an attempt to deliberately exploit concurrency bugs [^30]. Therefore, in order to build
|
||||||
[^30]. Therefore, in order to build
|
|
||||||
applications that are reliable and secure, you have to ensure that such bugs are systematically
|
applications that are reliable and secure, you have to ensure that such bugs are systematically
|
||||||
prevented.
|
prevented.
|
||||||
|
|
||||||
|
|
@ -528,10 +515,7 @@ decide what level is appropriate to your application. Once we’ve done that, we
|
||||||
serializability in detail (see [“Serializability”](/en/ch8#sec_transactions_serializability)). Our discussion of isolation
|
serializability in detail (see [“Serializability”](/en/ch8#sec_transactions_serializability)). Our discussion of isolation
|
||||||
levels will be informal, using examples. If you want rigorous definitions and analyses of their
|
levels will be informal, using examples. If you want rigorous definitions and analyses of their
|
||||||
properties, you can find them in the academic literature
|
properties, you can find them in the academic literature
|
||||||
[[36](/en/ch8#Berenson1995),
|
[[^36], [^37], [^38], [^39]].
|
||||||
[37](/en/ch8#Adya1999),
|
|
||||||
[38](/en/ch8#Bailis2014virtues_ch8),
|
|
||||||
[39](/en/ch8#Crooks2017)].
|
|
||||||
|
|
||||||
## Read Committed
|
## Read Committed
|
||||||
|
|
||||||
|
|
@ -608,8 +592,7 @@ By preventing dirty writes, this isolation level avoids some kinds of concurrenc
|
||||||
### Implementing read committed
|
### Implementing read committed
|
||||||
|
|
||||||
Read committed is a very popular isolation level. It is the default setting in Oracle Database,
|
Read committed is a very popular isolation level. It is the default setting in Oracle Database,
|
||||||
PostgreSQL, SQL Server, and many other databases
|
PostgreSQL, SQL Server, and many other databases [^10].
|
||||||
[^10].
|
|
||||||
|
|
||||||
Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to
|
Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to
|
||||||
modify a particular row (or document or some other object), it must first acquire a lock on that
|
modify a particular row (or document or some other object), it must first acquire a lock on that
|
||||||
|
|
@ -633,8 +616,7 @@ operability: a slowdown in one part of an application can have a knock-on effect
|
||||||
different part of the application, due to waiting for locks.
|
different part of the application, due to waiting for locks.
|
||||||
|
|
||||||
Nevertheless, locks are used to prevent dirty reads in some databases, such as IBM
|
Nevertheless, locks are used to prevent dirty reads in some databases, such as IBM
|
||||||
Db2 and Microsoft SQL Server in the `read_committed_snapshot=off` setting
|
Db2 and Microsoft SQL Server in the `read_committed_snapshot=off` setting [^29].
|
||||||
[^29].
|
|
||||||
|
|
||||||
A more commonly used approach to preventing dirty reads is the one illustrated in
|
A more commonly used approach to preventing dirty reads is the one illustrated in
|
||||||
[Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
|
[Figure 8-4](/en/ch8#fig_transactions_read_committed): for every
|
||||||
|
|
@ -708,9 +690,7 @@ database, frozen at a particular point in time, it is much easier to understand.
|
||||||
|
|
||||||
Snapshot isolation is a popular feature: variants of it are supported by PostgreSQL, MySQL with the
|
Snapshot isolation is a popular feature: variants of it are supported by PostgreSQL, MySQL with the
|
||||||
InnoDB storage engine, Oracle, SQL Server, and others, although the detailed behavior varies from
|
InnoDB storage engine, Oracle, SQL Server, and others, although the detailed behavior varies from
|
||||||
one system to the next [[29](/en/ch8#Kleppmann2014),
|
one system to the next [[^29], [^40], [^41]].
|
||||||
[40](/en/ch8#Momjian2014),
|
|
||||||
[41](/en/ch8#Alvaro2023)].
|
|
||||||
Some databases, such as Oracle, TiDB, and Aurora DSQL, even choose snapshot isolation as their
|
Some databases, such as Oracle, TiDB, and Aurora DSQL, even choose snapshot isolation as their
|
||||||
highest isolation level.
|
highest isolation level.
|
||||||
|
|
||||||
|
|
@ -733,9 +713,7 @@ maintains several versions of a row side by side, this technique is known as *mu
|
||||||
concurrency control* (MVCC).
|
concurrency control* (MVCC).
|
||||||
|
|
||||||
[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
|
[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
|
||||||
[[40](/en/ch8#Momjian2014),
|
[[^40], [^42], [^43]] (other implementations are similar).
|
||||||
[42](/en/ch8#Rogov2023),
|
|
||||||
[43](/en/ch8#Suzuki2017_ch8)] (other implementations are similar).
|
|
||||||
When a transaction is started, it is given a unique, always-increasing transaction ID (`txid`).
|
When a transaction is started, it is given a unique, always-increasing transaction ID (`txid`).
|
||||||
Whenever a transaction writes anything to the database, the data it writes is tagged with the
|
Whenever a transaction writes anything to the database, the data it writes is tagged with the
|
||||||
transaction ID of the writer. (To be precise, transaction IDs in PostgreSQL are 32-bit integers, so
|
transaction ID of the writer. (To be precise, transaction IDs in PostgreSQL are 32-bit integers, so
|
||||||
|
|
@ -754,8 +732,7 @@ At some later time, when it is certain that no transaction can any longer access
|
||||||
garbage collection process in the database removes any rows marked for deletion and frees their
|
garbage collection process in the database removes any rows marked for deletion and frees their
|
||||||
space.
|
space.
|
||||||
|
|
||||||
An update is internally translated into a delete and a insert
|
An update is internally translated into a delete and a insert [^44].
|
||||||
[^44].
|
|
||||||
For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the
|
For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the
|
||||||
balance from $500 to $400. The `accounts` table now actually contains two rows for account 2: a row
|
balance from $500 to $400. The `accounts` table now actually contains two rows for account 2: a row
|
||||||
with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of
|
with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of
|
||||||
|
|
@ -765,15 +742,13 @@ All of the versions of a row are stored within the same database heap (see
|
||||||
[“Storing values within the index”](/en/ch4#sec_storage_index_heap)), regardless of whether the transactions that wrote them have committed
|
[“Storing values within the index”](/en/ch4#sec_storage_index_heap)), regardless of whether the transactions that wrote them have committed
|
||||||
or not. The versions of the same row form a linked list, going either from newest version to oldest
|
or not. The versions of the same row form a linked list, going either from newest version to oldest
|
||||||
version or the other way round, so that queries can internally iterate over all versions of a row
|
version or the other way round, so that queries can internally iterate over all versions of a row
|
||||||
[[45](/en/ch8#Pavlo2023),
|
[[^45], [^46]].
|
||||||
[46](/en/ch8#Wu2017)].
|
|
||||||
|
|
||||||
### Visibility rules for observing a consistent snapshot
|
### Visibility rules for observing a consistent snapshot
|
||||||
|
|
||||||
When a transaction reads from the database, transaction IDs are used to decide which row versions it
|
When a transaction reads from the database, transaction IDs are used to decide which row versions it
|
||||||
can see and which are invisible. By carefully defining visibility rules, the database can present a
|
can see and which are invisible. By carefully defining visibility rules, the database can present a
|
||||||
consistent snapshot of the database to the application. This works roughly as follows
|
consistent snapshot of the database to the application. This works roughly as follows [^43]:
|
||||||
[^43]:
|
|
||||||
|
|
||||||
1. At the start of each transaction, the database makes a list of all the other transactions that
|
1. At the start of each transaction, the database makes a list of all the other transactions that
|
||||||
are in progress (not yet committed or aborted) at that time. Any writes that those
|
are in progress (not yet committed or aborted) at that time. Any writes that those
|
||||||
|
|
@ -815,7 +790,7 @@ value matches what the query is looking for. When garbage collection removes old
|
||||||
are no longer visible to any transaction, the corresponding index entries can also be removed.
|
are no longer visible to any transaction, the corresponding index entries can also be removed.
|
||||||
|
|
||||||
Many implementation details affect the performance of multi-version concurrency control
|
Many implementation details affect the performance of multi-version concurrency control
|
||||||
[[45](/en/ch8#Pavlo2023), [46](/en/ch8#Wu2017)].
|
[[^45], [^46]].
|
||||||
For example, PostgreSQL has optimizations for avoiding index updates if different versions of the
|
For example, PostgreSQL has optimizations for avoiding index updates if different versions of the
|
||||||
same row can fit on the same page [^40].
|
same row can fit on the same page [^40].
|
||||||
Some other databases avoid storing full copies of modified rows, and only store differences between
|
Some other databases avoid storing full copies of modified rows, and only store differences between
|
||||||
|
|
@ -845,22 +820,17 @@ snapshot isolation, in MySQL it means an implementation of MVCC with weaker cons
|
||||||
snapshot isolation [^41].
|
snapshot isolation [^41].
|
||||||
|
|
||||||
The reason for this naming confusion is that the SQL standard doesn’t have the concept of snapshot
|
The reason for this naming confusion is that the SQL standard doesn’t have the concept of snapshot
|
||||||
isolation, because the standard is based on System R’s 1975 definition of isolation levels
|
isolation, because the standard is based on System R’s 1975 definition of isolation levels [^3] and snapshot isolation hadn’t yet been
|
||||||
[^3] and snapshot isolation hadn’t yet been
|
|
||||||
invented then. Instead, it defines repeatable read, which looks superficially similar to snapshot
|
invented then. Instead, it defines repeatable read, which looks superficially similar to snapshot
|
||||||
isolation. PostgreSQL calls its snapshot isolation level “repeatable read” because it meets the
|
isolation. PostgreSQL calls its snapshot isolation level “repeatable read” because it meets the
|
||||||
requirements of the standard, and so they can claim standards compliance.
|
requirements of the standard, and so they can claim standards compliance.
|
||||||
|
|
||||||
Unfortunately, the SQL standard’s definition of isolation levels is flawed—it is ambiguous,
|
Unfortunately, the SQL standard’s definition of isolation levels is flawed—it is ambiguous,
|
||||||
imprecise, and not as implementation-independent as a standard should be
|
imprecise, and not as implementation-independent as a standard should be [^36]. Even though several databases
|
||||||
[^36]. Even though several databases
|
|
||||||
implement repeatable read, there are big differences in the guarantees they actually provide,
|
implement repeatable read, there are big differences in the guarantees they actually provide,
|
||||||
despite being ostensibly standardized
|
despite being ostensibly standardized [^29]. There has been a formal definition of
|
||||||
[^29]. There has been a formal definition of
|
repeatable read in the research literature [[^37], [^38]], but most implementations don’t satisfy that
|
||||||
repeatable read in the research literature [[37](/en/ch8#Adya1999),
|
formal definition. And to top it off, IBM Db2 uses “repeatable read” to refer to serializability [^10].
|
||||||
[38](/en/ch8#Bailis2014virtues_ch8)], but most implementations don’t satisfy that
|
|
||||||
formal definition. And to top it off, IBM Db2 uses “repeatable read” to refer to serializability
|
|
||||||
[^10].
|
|
||||||
|
|
||||||
As a result, nobody really knows what repeatable read means.
|
As a result, nobody really knows what repeatable read means.
|
||||||
|
|
||||||
|
|
@ -888,8 +858,7 @@ pattern occurs in various different scenarios:
|
||||||
* Two users editing a wiki page at the same time, where each user saves their changes by sending the
|
* Two users editing a wiki page at the same time, where each user saves their changes by sending the
|
||||||
entire page contents to the server, overwriting whatever is currently in the database
|
entire page contents to the server, overwriting whatever is currently in the database
|
||||||
|
|
||||||
Because this is such a common problem, a variety of solutions have been developed
|
Because this is such a common problem, a variety of solutions have been developed [^48].
|
||||||
[^48].
|
|
||||||
|
|
||||||
### Atomic write operations
|
### Atomic write operations
|
||||||
|
|
||||||
|
|
@ -915,9 +884,7 @@ Another option is to simply force all atomic operations to be executed on a sing
|
||||||
|
|
||||||
Unfortunately, object-relational mapping (ORM) frameworks make it easy to accidentally write code
|
Unfortunately, object-relational mapping (ORM) frameworks make it easy to accidentally write code
|
||||||
that performs unsafe read-modify-write cycles instead of using atomic operations provided by the
|
that performs unsafe read-modify-write cycles instead of using atomic operations provided by the
|
||||||
database [[49](/en/ch8#Wiger2010),
|
database [[^49], [^50], [^51]].
|
||||||
[50](/en/ch8#Coglan2020),
|
|
||||||
[51](/en/ch8#Bailis2015_ch8)].
|
|
||||||
This can be a source of subtle bugs that are difficult to find by testing.
|
This can be a source of subtle bugs that are difficult to find by testing.
|
||||||
|
|
||||||
### Explicit locking
|
### Explicit locking
|
||||||
|
|
@ -973,10 +940,8 @@ An advantage of this approach is that databases can perform this check efficient
|
||||||
with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL
|
with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL
|
||||||
Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort
|
Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort
|
||||||
the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates
|
the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates
|
||||||
[[29](/en/ch8#Kleppmann2014),
|
[[^29], [^41]].
|
||||||
[41](/en/ch8#Alvaro2023)].
|
Some authors [[^36], [^38]] argue that a database must prevent lost
|
||||||
Some authors [[36](/en/ch8#Berenson1995),
|
|
||||||
[38](/en/ch8#Bailis2014virtues_ch8)] argue that a database must prevent lost
|
|
||||||
updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot
|
updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot
|
||||||
isolation under this definition.
|
isolation under this definition.
|
||||||
|
|
||||||
|
|
@ -1058,8 +1023,7 @@ To begin, imagine this example: you are writing an application for doctors to ma
|
||||||
shifts at a hospital. The hospital usually tries to have several doctors on call at any one time,
|
shifts at a hospital. The hospital usually tries to have several doctors on call at any one time,
|
||||||
but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if
|
but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if
|
||||||
they are sick themselves), provided that at least one colleague remains on call in that shift
|
they are sick themselves), provided that at least one colleague remains on call in that shift
|
||||||
[[53](/en/ch8#Cahill2008),
|
[[^53], [^54]].
|
||||||
[54](/en/ch8#Ports2012)].
|
|
||||||
|
|
||||||
Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are
|
Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are
|
||||||
feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button
|
feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button
|
||||||
|
|
@ -1220,8 +1184,7 @@ transaction, is called a *phantom* [^4].
|
||||||
Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the
|
Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the
|
||||||
examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL
|
examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL
|
||||||
generated by ORMs is also prone to write skew
|
generated by ORMs is also prone to write skew
|
||||||
[[50](/en/ch8#Coglan2020),
|
[[^50], [^51]].
|
||||||
[51](/en/ch8#Bailis2015_ch8)].
|
|
||||||
|
|
||||||
### Materializing conflicts
|
### Materializing conflicts
|
||||||
|
|
||||||
|
|
@ -1240,8 +1203,7 @@ isn’t used to store information about the booking—it’s purely a collection
|
||||||
to prevent bookings on the same room and time range from being modified concurrently.
|
to prevent bookings on the same room and time range from being modified concurrently.
|
||||||
|
|
||||||
This approach is called *materializing conflicts*, because it takes a phantom and turns it into a
|
This approach is called *materializing conflicts*, because it takes a phantom and turns it into a
|
||||||
lock conflict on a concrete set of rows that exist in the database
|
lock conflict on a concrete set of rows that exist in the database [^14]. Unfortunately, it can be hard and
|
||||||
[^14]. Unfortunately, it can be hard and
|
|
||||||
error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control
|
error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control
|
||||||
mechanism leak into the application data model. For those reasons, materializing conflicts should be
|
mechanism leak into the application data model. For those reasons, materializing conflicts should be
|
||||||
considered a last resort if no alternative is possible. A serializable isolation level is much
|
considered a last resort if no alternative is possible. A serializable isolation level is much
|
||||||
|
|
@ -1293,8 +1255,7 @@ sidestep the problem of detecting and preventing conflicts between transactions:
|
||||||
isolation is by definition serializable.
|
isolation is by definition serializable.
|
||||||
|
|
||||||
Even though this seems like an obvious idea, it was only in the 2000s that database designers
|
Even though this seems like an obvious idea, it was only in the 2000s that database designers
|
||||||
decided that a single-threaded loop for executing transactions was feasible
|
decided that a single-threaded loop for executing transactions was feasible [^57].
|
||||||
[^57].
|
|
||||||
If multi-threaded concurrency was considered essential for getting good performance during the
|
If multi-threaded concurrency was considered essential for getting good performance during the
|
||||||
previous 30 years, what changed to make single-threaded execution possible?
|
previous 30 years, what changed to make single-threaded execution possible?
|
||||||
|
|
||||||
|
|
@ -1310,9 +1271,7 @@ Two developments caused this rethink:
|
||||||
outside of the serial execution loop.
|
outside of the serial execution loop.
|
||||||
|
|
||||||
The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic,
|
The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic,
|
||||||
for example [[58](/en/ch8#Hugg2014streaming),
|
for example [[^58], [^59], [^60]].
|
||||||
[59](/en/ch8#Kallman2008),
|
|
||||||
[60](/en/ch8#Hickey2012)].
|
|
||||||
A system designed for single-threaded execution can sometimes perform better than a system that
|
A system designed for single-threaded execution can sometimes perform better than a system that
|
||||||
supports concurrency, because it can avoid the coordination overhead of locking. However, its
|
supports concurrency, because it can avoid the coordination overhead of locking. However, its
|
||||||
throughput is limited to that of a single CPU core. In order to make the most of that single thread,
|
throughput is limited to that of a single CPU core. In order to make the most of that single thread,
|
||||||
|
|
@ -1425,8 +1384,7 @@ Since cross-shard transactions have additional coordination overhead, they are v
|
||||||
single-shard transactions. VoltDB reports a throughput of about 1,000 cross-shard writes per second,
|
single-shard transactions. VoltDB reports a throughput of about 1,000 cross-shard writes per second,
|
||||||
which is orders of magnitude below its single-shard throughput and cannot be increased by adding
|
which is orders of magnitude below its single-shard throughput and cannot be increased by adding
|
||||||
more machines [^61]. More recent research
|
more machines [^61]. More recent research
|
||||||
has explored ways of making multi-shard transactions more scalable
|
has explored ways of making multi-shard transactions more scalable [^63].
|
||||||
[^63].
|
|
||||||
|
|
||||||
Whether transactions can be single-shard depends very much on the structure of the data used by the
|
Whether transactions can be single-shard depends very much on the structure of the data used by the
|
||||||
application. Simple key-value data can often be sharded very easily, but data with multiple
|
application. Simple key-value data can often be sharded very easily, but data with multiple
|
||||||
|
|
@ -1485,8 +1443,7 @@ it protects against all the race conditions discussed earlier, including lost up
|
||||||
### Implementation of two-phase locking
|
### Implementation of two-phase locking
|
||||||
|
|
||||||
2PL is used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the
|
2PL is used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the
|
||||||
repeatable read isolation level in Db2
|
repeatable read isolation level in Db2 [^29].
|
||||||
[^29].
|
|
||||||
|
|
||||||
The blocking of readers and writers is implemented by having a lock on each object in the
|
The blocking of readers and writers is implemented by having a lock on each object in the
|
||||||
database. The lock can either be in *shared mode* or in *exclusive mode* (also known as a
|
database. The lock can either be in *shared mode* or in *exclusive mode* (also known as a
|
||||||
|
|
@ -1584,8 +1541,7 @@ becomes serializable.
|
||||||
Unfortunately, predicate locks do not perform well: if there are many locks by active transactions,
|
Unfortunately, predicate locks do not perform well: if there are many locks by active transactions,
|
||||||
checking for matching locks becomes time-consuming. For that reason, most databases with 2PL
|
checking for matching locks becomes time-consuming. For that reason, most databases with 2PL
|
||||||
actually implement *index-range locking* (also known as *next-key locking*), which is a simplified
|
actually implement *index-range locking* (also known as *next-key locking*), which is a simplified
|
||||||
approximation of predicate locking [[54](/en/ch8#Ports2012),
|
approximation of predicate locking [[^54], [^64]].
|
||||||
[64](/en/ch8#Hellerstein2007_ch8)].
|
|
||||||
|
|
||||||
It’s safe to simplify a predicate by making it match a greater set of objects. For example, if you
|
It’s safe to simplify a predicate by making it match a greater set of objects. For example, if you
|
||||||
have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by
|
have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by
|
||||||
|
|
@ -1629,13 +1585,11 @@ serializable isolation and good performance fundamentally at odds with each othe
|
||||||
It seems not: an algorithm called *serializable snapshot isolation* (SSI) provides full
|
It seems not: an algorithm called *serializable snapshot isolation* (SSI) provides full
|
||||||
serializability with only a small performance penalty compared to snapshot isolation. SSI is
|
serializability with only a small performance penalty compared to snapshot isolation. SSI is
|
||||||
comparatively new: it was first described in 2008
|
comparatively new: it was first described in 2008
|
||||||
[[53](/en/ch8#Cahill2008),
|
[[^53], [^65]].
|
||||||
[65](/en/ch8#Cahill2009)].
|
|
||||||
|
|
||||||
Today SSI and similar algorithms are used in single-node databases (the serializable isolation level
|
Today SSI and similar algorithms are used in single-node databases (the serializable isolation level
|
||||||
in PostgreSQL [^54], SQL Server’s In-Memory
|
in PostgreSQL [^54], SQL Server’s In-Memory
|
||||||
OLTP/Hekaton [^66], and HyPer
|
OLTP/Hekaton [^66], and HyPer [^67]),
|
||||||
[^67]),
|
|
||||||
distributed databases (CockroachDB [^5] and
|
distributed databases (CockroachDB [^5] and
|
||||||
FoundationDB [^8]), and embedded storage
|
FoundationDB [^8]), and embedded storage
|
||||||
engines such as BadgerDB.
|
engines such as BadgerDB.
|
||||||
|
|
@ -1659,10 +1613,8 @@ transaction wants to commit, the database checks whether anything bad happened (
|
||||||
isolation was violated); if so, the transaction is aborted and has to be retried. Only transactions
|
isolation was violated); if so, the transaction is aborted and has to be retried. Only transactions
|
||||||
that executed serializably are allowed to commit.
|
that executed serializably are allowed to commit.
|
||||||
|
|
||||||
Optimistic concurrency control is an old idea
|
Optimistic concurrency control is an old idea [^68],
|
||||||
[^68],
|
and its advantages and disadvantages have been debated for a long time [^69].
|
||||||
and its advantages and disadvantages have been debated for a long time
|
|
||||||
[^69].
|
|
||||||
It performs badly if there is high contention (many transactions trying to access the same objects),
|
It performs badly if there is high contention (many transactions trying to access the same objects),
|
||||||
as this leads to a high proportion of transactions needing to abort. If the system is already close
|
as this leads to a high proportion of transactions needing to abort. If the system is already close
|
||||||
to its maximum throughput, the additional transaction load from retried transactions can make
|
to its maximum throughput, the additional transaction load from retried transactions can make
|
||||||
|
|
@ -1781,8 +1733,7 @@ tracking is faster, but may lead to more transactions being aborted than strictl
|
||||||
In some cases, it’s okay for a transaction to read information that was overwritten by another
|
In some cases, it’s okay for a transaction to read information that was overwritten by another
|
||||||
transaction: depending on what else happened, it’s sometimes possible to prove that the result of
|
transaction: depending on what else happened, it’s sometimes possible to prove that the result of
|
||||||
the execution is nevertheless serializable. PostgreSQL uses this theory to reduce the number of
|
the execution is nevertheless serializable. PostgreSQL uses this theory to reduce the number of
|
||||||
unnecessary aborts [[14](/en/ch8#Fekete2005),
|
unnecessary aborts [[^14], [^54]].
|
||||||
[54](/en/ch8#Ports2012)].
|
|
||||||
|
|
||||||
Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one
|
Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one
|
||||||
transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot
|
transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot
|
||||||
|
|
@ -1798,8 +1749,7 @@ serializable isolation.
|
||||||
|
|
||||||
Compared to non-serializable snapshot isolation, the need to check for serializability violations
|
Compared to non-serializable snapshot isolation, the need to check for serializability violations
|
||||||
introduces some performance overheads. How significant these overheads are is a matter of debate:
|
introduces some performance overheads. How significant these overheads are is a matter of debate:
|
||||||
some believe that serializability checking is not worth it
|
some believe that serializability checking is not worth it [^70],
|
||||||
[^70],
|
|
||||||
while others believe that the performance of serializability is now so good that there is no need to
|
while others believe that the performance of serializability is now so good that there is no need to
|
||||||
use the weaker snapshot isolation any more [^67].
|
use the weaker snapshot isolation any more [^67].
|
||||||
|
|
||||||
|
|
@ -1815,8 +1765,7 @@ The last few sections have focused on concurrency control for isolation, the I i
|
||||||
algorithms we have seen apply to both single-node and distributed databases: although there are
|
algorithms we have seen apply to both single-node and distributed databases: although there are
|
||||||
challenges in making concurrency control algorithms scalable (for example, performing distributed
|
challenges in making concurrency control algorithms scalable (for example, performing distributed
|
||||||
serializability checking for SSI), the high-level ideas for distributed concurrency control are
|
serializability checking for SSI), the high-level ideas for distributed concurrency control are
|
||||||
similar to single-node concurrency control
|
similar to single-node concurrency control [^8].
|
||||||
[^8].
|
|
||||||
|
|
||||||
Consistency and durability also don’t change much when we move to distributed transactions. However,
|
Consistency and durability also don’t change much when we move to distributed transactions. However,
|
||||||
atomicity requires more care.
|
atomicity requires more care.
|
||||||
|
|
@ -1830,8 +1779,7 @@ successfully written to disk before the crash, the transaction is considered com
|
||||||
writes from that transaction are rolled back.
|
writes from that transaction are rolled back.
|
||||||
|
|
||||||
Thus, on a single node, transaction commitment crucially depends on the *order* in which data is
|
Thus, on a single node, transaction commitment crucially depends on the *order* in which data is
|
||||||
durably written to disk: first the data, then the commit record
|
durably written to disk: first the data, then the commit record [^22].
|
||||||
[^22].
|
|
||||||
The key deciding moment for whether the transaction commits or aborts is the moment at which the
|
The key deciding moment for whether the transaction commits or aborts is the moment at which the
|
||||||
disk finishes writing the commit record: before that moment, it is still possible to abort (due to a
|
disk finishes writing the commit record: before that moment, it is still possible to abort (due to a
|
||||||
crash), but after that moment, the transaction is committed (even if the database crashes). Thus, it
|
crash), but after that moment, the transaction is committed (even if the database crashes). Thus, it
|
||||||
|
|
@ -1876,15 +1824,12 @@ problem.
|
||||||
|
|
||||||
Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It
|
Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It
|
||||||
is a classic algorithm in distributed databases
|
is a classic algorithm in distributed databases
|
||||||
[[13](/en/ch8#Bernstein1987_ch8),
|
[[^13], [^71], [^72]]. 2PC is used
|
||||||
[71](/en/ch8#Lindsay1979_ch8),
|
|
||||||
[72](/en/ch8#Mohan1986)]. 2PC is used
|
|
||||||
internally in some databases and also made available to applications in the form of *XA transactions*
|
internally in some databases and also made available to applications in the form of *XA transactions*
|
||||||
[^73]
|
[^73]
|
||||||
(which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP
|
(which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP
|
||||||
web services
|
web services
|
||||||
[[74](/en/ch8#Neto2008),
|
[[^74], [^75]].
|
||||||
[75](/en/ch8#Johnson2004)].
|
|
||||||
|
|
||||||
The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
|
The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
|
||||||
commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two
|
commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two
|
||||||
|
|
@ -1916,8 +1861,7 @@ This process is somewhat like the traditional marriage ceremony in Western cultu
|
||||||
asks the bride and groom individually whether each wants to marry the other, and typically receives
|
asks the bride and groom individually whether each wants to marry the other, and typically receives
|
||||||
the answer “I do” from both. After receiving both acknowledgments, the minister pronounces the
|
the answer “I do” from both. After receiving both acknowledgments, the minister pronounces the
|
||||||
couple husband and wife: the transaction is committed, and the happy fact is broadcast to all
|
couple husband and wife: the transaction is committed, and the happy fact is broadcast to all
|
||||||
attendees. If either bride or groom does not say “yes,” the ceremony is aborted
|
attendees. If either bride or groom does not say “yes,” the ceremony is aborted [^76].
|
||||||
[^76].
|
|
||||||
|
|
||||||
### A system of promises
|
### A system of promises
|
||||||
|
|
||||||
|
|
@ -2014,8 +1958,7 @@ stuck waiting for the coordinator to recover. It is possible to make an atomic c
|
||||||
is not so straightforward.
|
is not so straightforward.
|
||||||
|
|
||||||
As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed
|
As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed
|
||||||
[[13](/en/ch8#Bernstein1987_ch8),
|
[[^13], [^77]].
|
||||||
[77](/en/ch8#Skeen1981)].
|
|
||||||
However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most
|
However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most
|
||||||
practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
|
practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
|
||||||
cannot guarantee atomicity.
|
cannot guarantee atomicity.
|
||||||
|
|
@ -2028,10 +1971,7 @@ consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_cons
|
||||||
Distributed transactions and two-phase commit have a mixed reputation. On the one hand, they are
|
Distributed transactions and two-phase commit have a mixed reputation. On the one hand, they are
|
||||||
seen as providing an important safety guarantee that would be hard to achieve otherwise; on the
|
seen as providing an important safety guarantee that would be hard to achieve otherwise; on the
|
||||||
other hand, they are criticized for causing operational problems, killing performance, and promising
|
other hand, they are criticized for causing operational problems, killing performance, and promising
|
||||||
more than they can deliver [[78](/en/ch8#Hohpe2005),
|
more than they can deliver [[^78], [^79], [^80], [^81]].
|
||||||
[79](/en/ch8#Helland2007_ch8),
|
|
||||||
[80](/en/ch8#Oliver2011),
|
|
||||||
[81](/en/ch8#Rahien2014)].
|
|
||||||
Many cloud services choose not to implement distributed transactions due to the operational
|
Many cloud services choose not to implement distributed transactions due to the operational
|
||||||
problems they engender [^82].
|
problems they engender [^82].
|
||||||
|
|
||||||
|
|
@ -2149,8 +2089,7 @@ transaction is resolved.
|
||||||
|
|
||||||
In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the
|
In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the
|
||||||
log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do
|
log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do
|
||||||
occur [[83](/en/ch8#Dhariwal2008),
|
occur [[^83], [^84]]—that is,
|
||||||
[84](/en/ch8#Randal2013)]—that is,
|
|
||||||
transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because
|
transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because
|
||||||
the transaction log has been lost or corrupted due to a software bug). These transactions cannot be
|
the transaction log has been lost or corrupted due to a software bug). These transactions cannot be
|
||||||
resolved automatically, so they sit forever in the database, holding locks and blocking other
|
resolved automatically, so they sit forever in the database, holding locks and blocking other
|
||||||
|
|
@ -2215,8 +2154,7 @@ CockroachDB [^5],
|
||||||
TiDB [^6],
|
TiDB [^6],
|
||||||
Spanner [^7],
|
Spanner [^7],
|
||||||
FoundationDB [^8], and YugabyteDB, for
|
FoundationDB [^8], and YugabyteDB, for
|
||||||
example. Some message brokers such as Kafka also support internal distributed transactions
|
example. Some message brokers such as Kafka also support internal distributed transactions [^85].
|
||||||
[^85].
|
|
||||||
|
|
||||||
Many of these systems use 2-phase commit to ensure atomicity of transactions that write to multiple
|
Many of these systems use 2-phase commit to ensure atomicity of transactions that write to multiple
|
||||||
shards, and yet they don’t suffer the same problems as XA transactions. The reason is that because
|
shards, and yet they don’t suffer the same problems as XA transactions. The reason is that because
|
||||||
|
|
@ -2292,7 +2230,7 @@ of patterns such as these: for example, they would allow the message IDs to be s
|
||||||
and the main data updated by the message processing to be stored on other shards, and to ensure
|
and the main data updated by the message processing to be stored on other shards, and to ensure
|
||||||
atomicity of the transaction commit across those shards.
|
atomicity of the transaction commit across those shards.
|
||||||
|
|
||||||
# Summary
|
## Summary
|
||||||
|
|
||||||
Transactions are an abstraction layer that allows an application to pretend that certain concurrency
|
Transactions are an abstraction layer that allows an application to pretend that certain concurrency
|
||||||
problems and certain kinds of hardware and software faults don’t exist. A large class of errors is
|
problems and certain kinds of hardware and software faults don’t exist. A large class of errors is
|
||||||
|
|
@ -2385,10 +2323,11 @@ The examples in this chapter used a relational data model. However, as discussed
|
||||||
[“The need for multi-object transactions”](/en/ch8#sec_transactions_need), transactions are a valuable database feature, no matter which data model
|
[“The need for multi-object transactions”](/en/ch8#sec_transactions_need), transactions are a valuable database feature, no matter which data model
|
||||||
is used.
|
is used.
|
||||||
|
|
||||||
##### Footnotes
|
|
||||||
|
|
||||||
|
|
||||||
##### References
|
|
||||||
|
### Summary
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
|
|
|
||||||
|
|
@ -22,8 +22,7 @@ anything that *can* go wrong *will* go wrong.
|
||||||
|
|
||||||
Moreover, working with distributed systems is fundamentally different from writing software on a
|
Moreover, working with distributed systems is fundamentally different from writing software on a
|
||||||
single computer—and the main difference is that there are lots of new and exciting ways for things
|
single computer—and the main difference is that there are lots of new and exciting ways for things
|
||||||
to go wrong [[1](/en/ch9#Cavage2013),
|
to go wrong [[^1], [^2]].
|
||||||
[2](/en/ch9#Kreps2012_ch9)].
|
|
||||||
In this chapter, you will get a taste of the problems that arise in practice, and an understanding
|
In this chapter, you will get a taste of the problems that arise in practice, and an understanding
|
||||||
of the things you can and cannot rely on.
|
of the things you can and cannot rely on.
|
||||||
|
|
||||||
|
|
@ -157,8 +156,7 @@ algorithm decides that it has capacity to send a packet, it takes the next packe
|
||||||
that buffer and passes it to the network interface. The packet passes through several switches and
|
that buffer and passes it to the network interface. The packet passes through several switches and
|
||||||
routers, and eventually the receiving node’s operating system places the packet’s data in a receive
|
routers, and eventually the receiving node’s operating system places the packet’s data in a receive
|
||||||
buffer and sends an acknowledgment packet back to the sender. Only then does the receiving operating
|
buffer and sends an acknowledgment packet back to the sender. Only then does the receiving operating
|
||||||
system notify the application that some more data has arrived
|
system notify the application that some more data has arrived [^6].
|
||||||
[^6].
|
|
||||||
|
|
||||||
So, if TCP provides “reliability”, does that mean we no longer need to worry about networks being
|
So, if TCP provides “reliability”, does that mean we no longer need to worry about networks being
|
||||||
unreliable? Unfortunately not. It decides that a packet must have been lost if no acknowledgment
|
unreliable? Unfortunately not. It decides that a packet must have been lost if no acknowledgment
|
||||||
|
|
@ -173,8 +171,7 @@ actually processed by the remote node [^6].
|
||||||
Even if TCP acknowledged that a packet was delivered, this only means that the operating system
|
Even if TCP acknowledged that a packet was delivered, this only means that the operating system
|
||||||
kernel on the remote node received it, but the application may have crashed before it handled that
|
kernel on the remote node received it, but the application may have crashed before it handled that
|
||||||
data. If you want to be sure that a request was successful, you need a positive response from the
|
data. If you want to be sure that a request was successful, you need a positive response from the
|
||||||
application itself
|
application itself [^7].
|
||||||
[^7].
|
|
||||||
|
|
||||||
Nevertheless, TCP is very useful, because it provides a convenient way of sending and receiving
|
Nevertheless, TCP is very useful, because it provides a convenient way of sending and receiving
|
||||||
messages that are too big to fit in one packet. Once a TCP connection is established, you can also
|
messages that are too big to fit in one packet. Once a TCP connection is established, you can also
|
||||||
|
|
@ -187,47 +184,32 @@ many RPC protocols (see [“Dataflow Through Services: REST and RPC”](/en/ch5#
|
||||||
We have been building computer networks for decades—one might hope that by now we would have figured
|
We have been building computer networks for decades—one might hope that by now we would have figured
|
||||||
out how to make them reliable. Unfortunately, we have not yet succeeded. There are some systematic
|
out how to make them reliable. Unfortunately, we have not yet succeeded. There are some systematic
|
||||||
studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common,
|
studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common,
|
||||||
even in controlled environments like a datacenter operated by one company
|
even in controlled environments like a datacenter operated by one company [^8]:
|
||||||
[^8]:
|
|
||||||
|
|
||||||
* One study in a medium-sized datacenter found about 12 network faults per month, of which half
|
* One study in a medium-sized datacenter found about 12 network faults per month, of which half
|
||||||
disconnected a single machine, and half disconnected an entire rack
|
disconnected a single machine, and half disconnected an entire rack [^9].
|
||||||
[^9].
|
|
||||||
* Another study measured the failure rates of components like top-of-rack switches, aggregation
|
* Another study measured the failure rates of components like top-of-rack switches, aggregation
|
||||||
switches, and load balancers
|
switches, and load balancers [^10].
|
||||||
[^10].
|
|
||||||
It found that adding redundant networking gear doesn’t reduce faults as much as you might hope,
|
It found that adding redundant networking gear doesn’t reduce faults as much as you might hope,
|
||||||
since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause
|
since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause
|
||||||
of outages.
|
of outages.
|
||||||
* Interruptions of wide-area fiber links have been blamed on cows
|
* Interruptions of wide-area fiber links have been blamed on cows [^11], beavers [^12], and sharks [^13]
|
||||||
[^11],
|
(though shark bites have become rarer due to better shielding of submarine cables [^14]).
|
||||||
beavers [^12],
|
Humans are also at fault, be it due to accidental misconfiguration [^15], scavenging [^16], or sabotage [^17].
|
||||||
and sharks [^13]
|
|
||||||
(though shark bites have become rarer due to better shielding of submarine cables
|
|
||||||
[^14]).
|
|
||||||
Humans are also at fault, be it due to accidental misconfiguration
|
|
||||||
[^15],
|
|
||||||
scavenging [^16],
|
|
||||||
or sabotage
|
|
||||||
[^17].
|
|
||||||
* Across different cloud regions, round-trip times of up to several *minutes* have been observed at
|
* Across different cloud regions, round-trip times of up to several *minutes* have been observed at
|
||||||
high percentiles [[18](/en/ch9#Liu2016), Table 3].
|
high percentiles [[^18], Table 3].
|
||||||
Even within a single datacenter, packet delay of more than a minute can occur during a network
|
Even within a single datacenter, packet delay of more than a minute can occur during a network
|
||||||
topology reconfiguration, triggered by a problem during a software upgrade for a switch
|
topology reconfiguration, triggered by a problem during a software upgrade for a switch
|
||||||
[^19].
|
[^19].
|
||||||
Thus, we have to assume that messages might be delayed arbitrarily.
|
Thus, we have to assume that messages might be delayed arbitrarily.
|
||||||
* Sometimes communications are partially interrupted, depending on who you’re talking to: for
|
* Sometimes communications are partially interrupted, depending on who you’re talking to: for
|
||||||
example, A and B can communicate, B and C can communicate, but A and C cannot
|
example, A and B can communicate, B and C can communicate, but A and C cannot [^20] [^21].
|
||||||
[[20](/en/ch9#Lianza2020_ch9),
|
|
||||||
[21](/en/ch9#Alfatafta2020)].
|
|
||||||
Other surprising faults include a network interface that sometimes drops all inbound packets but
|
Other surprising faults include a network interface that sometimes drops all inbound packets but
|
||||||
sends outbound packets successfully [^22]:
|
sends outbound packets successfully [^22]:
|
||||||
just because a network link works in one direction doesn’t guarantee it’s also working in the
|
just because a network link works in one direction doesn’t guarantee it’s also working in the
|
||||||
opposite direction.
|
opposite direction.
|
||||||
* Even a brief network interruption can have repercussions that last for much longer than the
|
* Even a brief network interruption can have repercussions that last for much longer than the
|
||||||
original issue [[8](/en/ch9#Bailis2014reliable),
|
original issue [^8] [^20] [^23].
|
||||||
[20](/en/ch9#Lianza2020_ch9),
|
|
||||||
[23](/en/ch9#Toman2020)].
|
|
||||||
|
|
||||||
# Network partitions
|
# Network partitions
|
||||||
|
|
||||||
|
|
@ -243,8 +225,7 @@ may fail—there is no way around it.
|
||||||
If the error handling of network faults is not defined and tested, arbitrarily bad things could
|
If the error handling of network faults is not defined and tested, arbitrarily bad things could
|
||||||
happen: for example, the cluster could become deadlocked and permanently unable to serve requests,
|
happen: for example, the cluster could become deadlocked and permanently unable to serve requests,
|
||||||
even when the network recovers [^24],
|
even when the network recovers [^24],
|
||||||
or it could even delete all of your data
|
or it could even delete all of your data [^25].
|
||||||
[^25].
|
|
||||||
If software is put in an unanticipated situation, it may do arbitrary unexpected things.
|
If software is put in an unanticipated situation, it may do arbitrary unexpected things.
|
||||||
|
|
||||||
Handling network faults doesn’t necessarily mean *tolerating* them: if your network is normally
|
Handling network faults doesn’t necessarily mean *tolerating* them: if your network is normally
|
||||||
|
|
@ -302,7 +283,7 @@ Prematurely declaring a node dead is problematic: if the node is actually alive
|
||||||
performing some action (for example, sending an email), and another node takes over, the action may
|
performing some action (for example, sending an email), and another node takes over, the action may
|
||||||
end up being performed twice. We will discuss this issue in more detail in
|
end up being performed twice. We will discuss this issue in more detail in
|
||||||
[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), and in
|
[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), and in
|
||||||
Chapters [10](/en/ch10#ch_consistency)
|
Chapters [^10]
|
||||||
and [Link to Come].
|
and [Link to Come].
|
||||||
|
|
||||||
When a node is declared dead, its responsibilities need to be transferred to other nodes, which
|
When a node is declared dead, its responsibilities need to be transferred to other nodes, which
|
||||||
|
|
@ -331,8 +312,7 @@ times to throw the system off-balance.
|
||||||
### Network congestion and queueing
|
### Network congestion and queueing
|
||||||
|
|
||||||
When driving a car, travel times on road networks often vary most due to traffic congestion.
|
When driving a car, travel times on road networks often vary most due to traffic congestion.
|
||||||
Similarly, the variability of packet delays on computer networks is most often due to queueing
|
Similarly, the variability of packet delays on computer networks is most often due to queueing [^27]:
|
||||||
[^27]:
|
|
||||||
|
|
||||||
* If several different nodes simultaneously try to send packets to the same destination, the network
|
* If several different nodes simultaneously try to send packets to the same destination, the network
|
||||||
switch must queue them up and feed them into the destination network link one by one (as illustrated
|
switch must queue them up and feed them into the destination network link one by one (as illustrated
|
||||||
|
|
@ -384,8 +364,7 @@ network links and switches, and even each machine’s network interface and CPUs
|
||||||
virtual machines), are shared. Processing large amounts of data can use the entire capacity of
|
virtual machines), are shared. Processing large amounts of data can use the entire capacity of
|
||||||
network links (*saturate* them). As you have no control over or insight into other customers’ usage of the shared
|
network links (*saturate* them). As you have no control over or insight into other customers’ usage of the shared
|
||||||
resources, network delays can be highly variable if someone near you (a *noisy neighbor*) is
|
resources, network delays can be highly variable if someone near you (a *noisy neighbor*) is
|
||||||
using a lot of resources [[30](/en/ch9#Philips2014),
|
using a lot of resources [[^30], [^31]].
|
||||||
[31](/en/ch9#Newman2012)].
|
|
||||||
|
|
||||||
In such environments, you can only choose timeouts experimentally: measure the distribution of
|
In such environments, you can only choose timeouts experimentally: measure the distribution of
|
||||||
network round-trip times over an extended period, and over many machines, to determine the expected
|
network round-trip times over an extended period, and over many machines, to determine the expected
|
||||||
|
|
@ -394,12 +373,9 @@ determine an appropriate trade-off between failure detection delay and risk of p
|
||||||
|
|
||||||
Even better, rather than using configured constant timeouts, systems can continually measure
|
Even better, rather than using configured constant timeouts, systems can continually measure
|
||||||
response times and their variability (*jitter*), and automatically adjust timeouts according to the
|
response times and their variability (*jitter*), and automatically adjust timeouts according to the
|
||||||
observed response time distribution. The Phi Accrual failure detector
|
observed response time distribution. The Phi Accrual failure detector [^32],
|
||||||
[^32],
|
which is used for example in Akka and Cassandra [^33]
|
||||||
which is used for example in Akka and Cassandra
|
is one way of doing this. TCP retransmission timeouts also work similarly [^5].
|
||||||
[^33]
|
|
||||||
is one way of doing this. TCP retransmission timeouts also work similarly
|
|
||||||
[^5].
|
|
||||||
|
|
||||||
## Synchronous Versus Asynchronous Networks
|
## Synchronous Versus Asynchronous Networks
|
||||||
|
|
||||||
|
|
@ -415,13 +391,11 @@ similar reliability and predictability in computer networks?
|
||||||
|
|
||||||
When you make a call over the telephone network, it establishes a *circuit*: a fixed, guaranteed
|
When you make a call over the telephone network, it establishes a *circuit*: a fixed, guaranteed
|
||||||
amount of bandwidth is allocated for the call, along the entire route between the two callers. This
|
amount of bandwidth is allocated for the call, along the entire route between the two callers. This
|
||||||
circuit remains in place until the call ends
|
circuit remains in place until the call ends [^34].
|
||||||
[^34].
|
|
||||||
For example, an ISDN network runs at a fixed rate of 4,000 frames per second. When a call is
|
For example, an ISDN network runs at a fixed rate of 4,000 frames per second. When a call is
|
||||||
established, it is allocated 16 bits of space within each frame (in each direction). Thus, for the
|
established, it is allocated 16 bits of space within each frame (in each direction). Thus, for the
|
||||||
duration of the call, each side is guaranteed to be able to send exactly 16 bits of audio data every
|
duration of the call, each side is guaranteed to be able to send exactly 16 bits of audio data every
|
||||||
250 microseconds
|
250 microseconds [^35].
|
||||||
[^35].
|
|
||||||
|
|
||||||
This kind of network is *synchronous*: even as data passes through several routers, it does not
|
This kind of network is *synchronous*: even as data passes through several routers, it does not
|
||||||
suffer from queueing, because the 16 bits of space for the call have already been reserved in the
|
suffer from queueing, because the 16 bits of space for the call have already been reserved in the
|
||||||
|
|
@ -457,15 +431,12 @@ the rate of data transfer to the available network capacity.
|
||||||
|
|
||||||
There have been some attempts to build hybrid networks that support both circuit switching and
|
There have been some attempts to build hybrid networks that support both circuit switching and
|
||||||
packet switching. *Asynchronous Transfer Mode* (ATM) was a competitor to Ethernet in the 1980s, but
|
packet switching. *Asynchronous Transfer Mode* (ATM) was a competitor to Ethernet in the 1980s, but
|
||||||
it didn’t gain much adoption outside of telephone network core switches. InfiniBand has some similarities
|
it didn’t gain much adoption outside of telephone network core switches. InfiniBand has some similarities [^36]:
|
||||||
[^36]:
|
|
||||||
it implements end-to-end flow control at the link layer, which reduces the need for queueing in the
|
it implements end-to-end flow control at the link layer, which reduces the need for queueing in the
|
||||||
network, although it can still suffer from delays due to link congestion
|
network, although it can still suffer from delays due to link congestion [^37].
|
||||||
[^37].
|
|
||||||
With careful use of *quality of service* (QoS, prioritization and scheduling of packets) and *admission
|
With careful use of *quality of service* (QoS, prioritization and scheduling of packets) and *admission
|
||||||
control* (rate-limiting senders), it is possible to emulate circuit switching on packet networks, or
|
control* (rate-limiting senders), it is possible to emulate circuit switching on packet networks, or
|
||||||
provide statistically bounded delay [[27](/en/ch9#Grosvenor2015),
|
provide statistically bounded delay [^27] [^34]. New network algorithms like Low Latency, Low
|
||||||
[34](/en/ch9#Keshav1997)]. New network algorithms like Low Latency, Low
|
|
||||||
Loss, and Scalable Throughput (L4S) attempt to mitigate some of the queuing and congestion control
|
Loss, and Scalable Throughput (L4S) attempt to mitigate some of the queuing and congestion control
|
||||||
problems both at the client and router level. Linux’s traffic controller (TC) also allows
|
problems both at the client and router level. Linux’s traffic controller (TC) also allows
|
||||||
applications to reprioritize packets for QoS purposes.
|
applications to reprioritize packets for QoS purposes.
|
||||||
|
|
@ -489,8 +460,7 @@ fixed cost, so if you utilize it better, each byte you send over the wire is che
|
||||||
|
|
||||||
A similar situation arises with CPUs: if you share each CPU core dynamically between several
|
A similar situation arises with CPUs: if you share each CPU core dynamically between several
|
||||||
threads, one thread sometimes has to wait in the operating system’s run queue while another thread
|
threads, one thread sometimes has to wait in the operating system’s run queue while another thread
|
||||||
is running, so a thread can be paused for varying lengths of time
|
is running, so a thread can be paused for varying lengths of time [^38].
|
||||||
[^38].
|
|
||||||
However, this utilizes the hardware better than if you allocated a static number of CPU cycles to
|
However, this utilizes the hardware better than if you allocated a static number of CPU cycles to
|
||||||
each thread (see [“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime)). Better hardware utilization is also why cloud
|
each thread (see [“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime)). Better hardware utilization is also why cloud
|
||||||
platforms run several virtual machines from different customers on the same physical machine.
|
platforms run several virtual machines from different customers on the same physical machine.
|
||||||
|
|
@ -544,8 +514,7 @@ Moreover, each machine on the network has its own clock, which is an actual hard
|
||||||
a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own
|
a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own
|
||||||
notion of time, which may be slightly faster or slower than on other machines. It is possible to
|
notion of time, which may be slightly faster or slower than on other machines. It is possible to
|
||||||
synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol (NTP), which
|
synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol (NTP), which
|
||||||
allows the computer clock to be adjusted according to the time reported by a group of servers
|
allows the computer clock to be adjusted according to the time reported by a group of servers [^39].
|
||||||
[^39].
|
|
||||||
The servers in turn get their time from a more accurate time source, such as a GPS receiver.
|
The servers in turn get their time from a more accurate time source, such as a GPS receiver.
|
||||||
|
|
||||||
## Monotonic Versus Time-of-Day Clocks
|
## Monotonic Versus Time-of-Day Clocks
|
||||||
|
|
@ -570,14 +539,12 @@ Time-of-day clocks are usually synchronized with NTP, which means that a timesta
|
||||||
various oddities, as described in the next section. In particular, if the local clock is too far
|
various oddities, as described in the next section. In particular, if the local clock is too far
|
||||||
ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in
|
ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in
|
||||||
time. These jumps, as well as similar jumps caused by leap seconds, make time-of-day clocks
|
time. These jumps, as well as similar jumps caused by leap seconds, make time-of-day clocks
|
||||||
unsuitable for measuring elapsed time
|
unsuitable for measuring elapsed time [^40].
|
||||||
[^40].
|
|
||||||
|
|
||||||
Time-of-day clocks can experience jumps due to the start and end of Daylight Saving Time (DST);
|
Time-of-day clocks can experience jumps due to the start and end of Daylight Saving Time (DST);
|
||||||
these can be avoided by always using UTC as time zone, which does not have DST.
|
these can be avoided by always using UTC as time zone, which does not have DST.
|
||||||
Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward
|
Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward
|
||||||
in steps of 10 ms on older Windows systems
|
in steps of 10 ms on older Windows systems [^41].
|
||||||
[^41].
|
|
||||||
On recent systems, this is less of a problem.
|
On recent systems, this is less of a problem.
|
||||||
|
|
||||||
### Monotonic clocks
|
### Monotonic clocks
|
||||||
|
|
@ -596,12 +563,10 @@ booted up, or something similarly arbitrary. In particular, it makes no sense to
|
||||||
clock values from two different computers, because they don’t mean the same thing.
|
clock values from two different computers, because they don’t mean the same thing.
|
||||||
|
|
||||||
On a server with multiple CPU sockets, there may be a separate timer per CPU, which is not
|
On a server with multiple CPU sockets, there may be a separate timer per CPU, which is not
|
||||||
necessarily synchronized with other CPUs
|
necessarily synchronized with other CPUs [^43].
|
||||||
[^43].
|
|
||||||
Operating systems compensate for any discrepancy and try
|
Operating systems compensate for any discrepancy and try
|
||||||
to present a monotonic view of the clock to application threads, even as they are scheduled across
|
to present a monotonic view of the clock to application threads, even as they are scheduled across
|
||||||
different CPUs. However, it is wise to take this guarantee of monotonicity with a pinch of salt
|
different CPUs. However, it is wise to take this guarantee of monotonicity with a pinch of salt [^44].
|
||||||
[^44].
|
|
||||||
|
|
||||||
NTP may adjust the frequency at which the monotonic clock moves forward (this is known as *slewing*
|
NTP may adjust the frequency at which the monotonic clock moves forward (this is known as *slewing*
|
||||||
the clock) if it detects that the computer’s local quartz is moving faster or slower than the NTP
|
the clock) if it detects that the computer’s local quartz is moving faster or slower than the NTP
|
||||||
|
|
@ -642,24 +607,17 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples
|
||||||
though occasional spikes in network delay lead to errors of around a second. Depending on the
|
though occasional spikes in network delay lead to errors of around a second. Depending on the
|
||||||
configuration, large network delays can cause the NTP client to give up entirely.
|
configuration, large network delays can cause the NTP client to give up entirely.
|
||||||
* Some NTP servers are wrong or misconfigured, reporting time that is off by hours
|
* Some NTP servers are wrong or misconfigured, reporting time that is off by hours
|
||||||
[[47](/en/ch9#Minar1999),
|
[^47] [^48].
|
||||||
[48](/en/ch9#Holub2014)].
|
|
||||||
NTP clients mitigate such errors by querying several servers and ignoring outliers.
|
NTP clients mitigate such errors by querying several servers and ignoring outliers.
|
||||||
Nevertheless, it’s somewhat worrying to bet the correctness of your systems on the time that you
|
Nevertheless, it’s somewhat worrying to bet the correctness of your systems on the time that you
|
||||||
were told by a stranger on the internet.
|
were told by a stranger on the internet.
|
||||||
* Leap seconds result in a minute that is 59 seconds or 61 seconds long, which messes up timing
|
* Leap seconds result in a minute that is 59 seconds or 61 seconds long, which messes up timing
|
||||||
assumptions in systems that are not designed with leap seconds in mind
|
assumptions in systems that are not designed with leap seconds in mind [^49].
|
||||||
[^49].
|
The fact that leap seconds have crashed many large systems [^40] [^50]
|
||||||
The fact that leap seconds have crashed many large systems
|
|
||||||
[[40](/en/ch9#GrahamCumming2017),
|
|
||||||
[50](/en/ch9#Minar2012_ch9)]
|
|
||||||
shows how easy it is for incorrect assumptions about clocks to sneak into a system. The best
|
shows how easy it is for incorrect assumptions about clocks to sneak into a system. The best
|
||||||
way of handling leap seconds may be to make NTP servers “lie,” by performing the leap second
|
way of handling leap seconds may be to make NTP servers “lie,” by performing the leap second
|
||||||
adjustment gradually over the course of a day (this is known as *smearing*)
|
adjustment gradually over the course of a day (this is known as *smearing*) [^51] [^52],
|
||||||
[[51](/en/ch9#Pascoe2011),
|
although actual NTP server behavior varies in practice [^53].
|
||||||
[52](/en/ch9#Zhao2015)],
|
|
||||||
although actual NTP server behavior varies in practice
|
|
||||||
[^53].
|
|
||||||
Leap seconds will no longer be used from 2035 onwards, so this problem will fortunately go away.
|
Leap seconds will no longer be used from 2035 onwards, so this problem will fortunately go away.
|
||||||
* In virtual machines, the hardware clock is virtualized, which raises additional challenges for
|
* In virtual machines, the hardware clock is virtualized, which raises additional challenges for
|
||||||
applications that need accurate timekeeping
|
applications that need accurate timekeeping
|
||||||
|
|
@ -668,31 +626,24 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples
|
||||||
while another VM is running. From an application’s point of view, this pause manifests itself as
|
while another VM is running. From an application’s point of view, this pause manifests itself as
|
||||||
the clock suddenly jumping forward [^29].
|
the clock suddenly jumping forward [^29].
|
||||||
If a VM pauses for several seconds, the clock may then be several seconds behind the actual time,
|
If a VM pauses for several seconds, the clock may then be several seconds behind the actual time,
|
||||||
but NTP may continue to report that the clock is almost perfectly in sync
|
but NTP may continue to report that the clock is almost perfectly in sync [^55].
|
||||||
[^55].
|
|
||||||
* If you run software on devices that you don’t fully control (e.g., mobile or embedded devices), you
|
* If you run software on devices that you don’t fully control (e.g., mobile or embedded devices), you
|
||||||
probably cannot trust the device’s hardware clock at all. Some users deliberately set their
|
probably cannot trust the device’s hardware clock at all. Some users deliberately set their
|
||||||
hardware clock to an incorrect date and time, for example to cheat in games
|
hardware clock to an incorrect date and time, for example to cheat in games [^56].
|
||||||
[^56].
|
|
||||||
As a result, the clock might be set to a time wildly in the past or the future.
|
As a result, the clock might be set to a time wildly in the past or the future.
|
||||||
|
|
||||||
It is possible to achieve very good clock accuracy if you care about it sufficiently to invest
|
It is possible to achieve very good clock accuracy if you care about it sufficiently to invest
|
||||||
significant resources. For example, the MiFID II European regulation for financial
|
significant resources. For example, the MiFID II European regulation for financial
|
||||||
institutions requires all high-frequency trading funds to synchronize their clocks to within 100
|
institutions requires all high-frequency trading funds to synchronize their clocks to within 100
|
||||||
microseconds of UTC, in order to help debug market anomalies such as “flash crashes” and to help
|
microseconds of UTC, in order to help debug market anomalies such as “flash crashes” and to help
|
||||||
detect market manipulation
|
detect market manipulation [^57].
|
||||||
[^57].
|
|
||||||
|
|
||||||
Such accuracy can be achieved with some special hardware (GPS receivers and/or atomic clocks), the
|
Such accuracy can be achieved with some special hardware (GPS receivers and/or atomic clocks), the
|
||||||
Precision Time Protocol (PTP) and careful deployment and monitoring
|
Precision Time Protocol (PTP) and careful deployment and monitoring [^58] [^59].
|
||||||
[[58](/en/ch9#Bigum2015),
|
|
||||||
[59](/en/ch9#Obleukhov2022)].
|
|
||||||
Relying on GPS alone can be risky because GPS signals can easily be jammed. In some locations this
|
Relying on GPS alone can be risky because GPS signals can easily be jammed. In some locations this
|
||||||
happens frequently, e.g. close to military facilities
|
happens frequently, e.g. close to military facilities [^60].
|
||||||
[^60].
|
|
||||||
Some cloud providers have begun offering high-accuracy clock synchronization for their virtual
|
Some cloud providers have begun offering high-accuracy clock synchronization for their virtual
|
||||||
machines
|
machines [^61].
|
||||||
[^61].
|
|
||||||
However, clock synchronization still requires a lot of care. If your NTP daemon is misconfigured, or
|
However, clock synchronization still requires a lot of care. If your NTP daemon is misconfigured, or
|
||||||
a firewall is blocking NTP traffic, the clock error due to drift can quickly become large.
|
a firewall is blocking NTP traffic, the clock error due to drift can quickly become large.
|
||||||
|
|
||||||
|
|
@ -714,8 +665,7 @@ fixed. On the other hand, if its quartz clock is defective or its NTP client is
|
||||||
things will seem to work fine, even though its clock gradually drifts further and further away from
|
things will seem to work fine, even though its clock gradually drifts further and further away from
|
||||||
reality. If some piece of software is relying on an accurately synchronized clock, the result is
|
reality. If some piece of software is relying on an accurately synchronized clock, the result is
|
||||||
more likely to be silent and subtle data loss than a dramatic crash
|
more likely to be silent and subtle data loss than a dramatic crash
|
||||||
[[62](/en/ch9#Kingsbury2013cassandra),
|
[[^62], [^63]].
|
||||||
[63](/en/ch9#Daily2013_ch9)].
|
|
||||||
|
|
||||||
Thus, if you use software that requires synchronized clocks, it is essential that you also carefully
|
Thus, if you use software that requires synchronized clocks, it is essential that you also carefully
|
||||||
monitor the clock offsets between all the machines. Any node whose clock drifts too far from the
|
monitor the clock offsets between all the machines. Any node whose clock drifts too far from the
|
||||||
|
|
@ -725,8 +675,7 @@ the broken clocks before they can cause too much damage.
|
||||||
### Timestamps for ordering events
|
### Timestamps for ordering events
|
||||||
|
|
||||||
Let’s consider one particular situation in which it is tempting, but dangerous, to rely on clocks:
|
Let’s consider one particular situation in which it is tempting, but dangerous, to rely on clocks:
|
||||||
ordering of events across multiple nodes
|
ordering of events across multiple nodes [^64].
|
||||||
[^64].
|
|
||||||
For example, if two clients write to a distributed database, who got there first? Which write is the
|
For example, if two clients write to a distributed database, who got there first? Which write is the
|
||||||
more recent one?
|
more recent one?
|
||||||
|
|
||||||
|
|
@ -766,8 +715,8 @@ serious problems:
|
||||||
|
|
||||||
* Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite
|
* Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite
|
||||||
values previously written by a node with a fast clock until the clock skew between the nodes has
|
values previously written by a node with a fast clock until the clock skew between the nodes has
|
||||||
elapsed [[63](/en/ch9#Daily2013_ch9),
|
elapsed [[^63],
|
||||||
[65](/en/ch9#Kingsbury2013timestamps)].
|
[^65]].
|
||||||
This scenario can cause arbitrary amounts of data to be silently dropped without any error being
|
This scenario can cause arbitrary amounts of data to be silently dropped without any error being
|
||||||
reported to the application.
|
reported to the application.
|
||||||
* LWW cannot distinguish between writes that occurred sequentially in quick succession (in
|
* LWW cannot distinguish between writes that occurred sequentially in quick succession (in
|
||||||
|
|
@ -830,8 +779,7 @@ Unfortunately, most systems don’t expose this uncertainty: for example, when y
|
||||||
`clock_gettime()`, the return value doesn’t tell you the expected error of the timestamp, so you
|
`clock_gettime()`, the return value doesn’t tell you the expected error of the timestamp, so you
|
||||||
don’t know if its confidence interval is five milliseconds or five years.
|
don’t know if its confidence interval is five milliseconds or five years.
|
||||||
|
|
||||||
There are exceptions: the *TrueTime* API in Google’s Spanner
|
There are exceptions: the *TrueTime* API in Google’s Spanner [^45] and Amazon’s ClockBound explicitly report the
|
||||||
[^45] and Amazon’s ClockBound explicitly report the
|
|
||||||
confidence interval on the local clock. When you ask it for the current time, you get back two
|
confidence interval on the local clock. When you ask it for the current time, you get back two
|
||||||
values: `[earliest, latest]`, which are the *earliest possible* and the *latest possible*
|
values: `[earliest, latest]`, which are the *earliest possible* and the *latest possible*
|
||||||
timestamp. Based on its uncertainty calculations, the clock knows that the actual current time is
|
timestamp. Based on its uncertainty calculations, the clock knows that the actual current time is
|
||||||
|
|
@ -864,8 +812,7 @@ the synchronization good enough, they would have the right properties: later tra
|
||||||
higher timestamp. The problem, of course, is the uncertainty about clock accuracy.
|
higher timestamp. The problem, of course, is the uncertainty about clock accuracy.
|
||||||
|
|
||||||
Spanner implements snapshot isolation across datacenters in this way
|
Spanner implements snapshot isolation across datacenters in this way
|
||||||
[[68](/en/ch9#Demirbas2013),
|
[[^68], [^69]].
|
||||||
[69](/en/ch9#Malkhi2013)].
|
|
||||||
It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the
|
It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the
|
||||||
following observation: if you have two confidence intervals, each consisting of an earliest and
|
following observation: if you have two confidence intervals, each consisting of an earliest and
|
||||||
latest possible timestamp (*A* = [*Aearliest*, *Alatest*] and
|
latest possible timestamp (*A* = [*Aearliest*, *Alatest*] and
|
||||||
|
|
@ -884,10 +831,7 @@ receiver or atomic clock in each datacenter, allowing clocks to be synchronized
|
||||||
The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to
|
The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to
|
||||||
have a confidence interval, and the accurate clock sources only help keep that interval small. Other
|
have a confidence interval, and the accurate clock sources only help keep that interval small. Other
|
||||||
systems are beginning to adopt similar approaches: for example, YugabyteDB can leverage ClockBound
|
systems are beginning to adopt similar approaches: for example, YugabyteDB can leverage ClockBound
|
||||||
when running on AWS [^70],
|
when running on AWS [^70], and several other systems now also rely on clock synchronization to various degrees [^71] [^72].
|
||||||
and several other systems now also rely on clock synchronization to various degrees
|
|
||||||
[[71](/en/ch9#Kimball2022),
|
|
||||||
[72](/en/ch9#Demirbas2025)].
|
|
||||||
|
|
||||||
## Process Pauses
|
## Process Pauses
|
||||||
|
|
||||||
|
|
@ -905,7 +849,7 @@ lease, so another node can take over when it expires.
|
||||||
|
|
||||||
You can imagine the request-handling loop looking something like this:
|
You can imagine the request-handling loop looking something like this:
|
||||||
|
|
||||||
```
|
```js
|
||||||
while (true) {
|
while (true) {
|
||||||
request = getIncomingRequest();
|
request = getIncomingRequest();
|
||||||
|
|
||||||
|
|
@ -1048,8 +992,7 @@ operating in a non-real-time environment.
|
||||||
|
|
||||||
### Limiting the impact of garbage collection
|
### Limiting the impact of garbage collection
|
||||||
|
|
||||||
Garbage collection used to be one of the biggest reasons for process pauses
|
Garbage collection used to be one of the biggest reasons for process pauses [^79],
|
||||||
[^79],
|
|
||||||
but fortunately GC algorithms have improved a lot: a properly tuned collector will now usually pause
|
but fortunately GC algorithms have improved a lot: a properly tuned collector will now usually pause
|
||||||
for no more than a few milliseconds. The Java runtime offers collectors such as concurrent mark
|
for no more than a few milliseconds. The Java runtime offers collectors such as concurrent mark
|
||||||
sweep (CMS), garbage-first (G1), the Z garbage collector (ZGC), Epsilon, and Shenandoah. Each of
|
sweep (CMS), garbage-first (G1), the Z garbage collector (ZGC), Epsilon, and Shenandoah. Each of
|
||||||
|
|
@ -1068,13 +1011,11 @@ handle requests from clients while one node is collecting its garbage. If the ru
|
||||||
application that a node soon requires a GC pause, the application can stop sending new requests to
|
application that a node soon requires a GC pause, the application can stop sending new requests to
|
||||||
that node, wait for it to finish processing outstanding requests, and then perform the GC while no
|
that node, wait for it to finish processing outstanding requests, and then perform the GC while no
|
||||||
requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles
|
requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles
|
||||||
of the response time [[80](/en/ch9#Terei2015),
|
of the response time [[^80], [^81]].
|
||||||
[81](/en/ch9#Maas2015)].
|
|
||||||
|
|
||||||
A variant of this idea is to use the garbage collector only for short-lived objects (which are fast
|
A variant of this idea is to use the garbage collector only for short-lived objects (which are fast
|
||||||
to collect) and to restart processes periodically, before they accumulate enough long-lived objects
|
to collect) and to restart processes periodically, before they accumulate enough long-lived objects
|
||||||
to require a full GC of long-lived objects [[79](/en/ch9#Thompson2013),
|
to require a full GC of long-lived objects [[^79], [^82]].
|
||||||
[82](/en/ch9#Fowler2011_ch9)].
|
|
||||||
One node can be restarted at a time, and traffic can be shifted away from the node before the
|
One node can be restarted at a time, and traffic can be shifted away from the node before the
|
||||||
planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).
|
planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).
|
||||||
|
|
||||||
|
|
@ -1116,8 +1057,7 @@ assumptions.
|
||||||
## The Majority Rules
|
## The Majority Rules
|
||||||
|
|
||||||
Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but
|
Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but
|
||||||
any outgoing messages from that node are dropped or delayed
|
any outgoing messages from that node are dropped or delayed [^22]. Even though that node is working
|
||||||
[^22]. Even though that node is working
|
|
||||||
perfectly well, and is receiving requests from other nodes, the other nodes cannot hear its
|
perfectly well, and is receiving requests from other nodes, the other nodes cannot hear its
|
||||||
responses. After some timeout, the other nodes declare it dead, because they haven’t heard from the
|
responses. After some timeout, the other nodes declare it dead, because they haven’t heard from the
|
||||||
node. The situation unfolds like a nightmare: the semi-disconnected node is dragged to the
|
node. The situation unfolds like a nightmare: the semi-disconnected node is dragged to the
|
||||||
|
|
@ -1158,8 +1098,7 @@ the use of quorums in more detail when we get to *consensus algorithms* in [Chap
|
||||||
|
|
||||||
## Distributed Locks and Leases
|
## Distributed Locks and Leases
|
||||||
|
|
||||||
Locks and leases in distributed application are prone to be misused, and a common source of bugs
|
Locks and leases in distributed application are prone to be misused, and a common source of bugs [^84].
|
||||||
[^84].
|
|
||||||
Let’s look at one particular case of how they can go wrong.
|
Let’s look at one particular case of how they can go wrong.
|
||||||
|
|
||||||
In [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses) we saw that a lease is a kind of lock that times out and can be
|
In [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses) we saw that a lease is a kind of lock that times out and can be
|
||||||
|
|
@ -1181,8 +1120,7 @@ could be lost or corrupted data, which is much more serious.
|
||||||
|
|
||||||
For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
|
For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
|
||||||
implementation of locking. (The bug is not theoretical: HBase used to have this problem
|
implementation of locking. (The bug is not theoretical: HBase used to have this problem
|
||||||
[[85](/en/ch9#Junqueira2013_ch9),
|
[[^85], [^86]].)
|
||||||
[86](/en/ch9#Soztutar2013hdfs)].)
|
|
||||||
Say you want to ensure that a file in a storage service can only be
|
Say you want to ensure that a file in a storage service can only be
|
||||||
accessed by one client at a time, because if multiple clients tried to write to it, the file would
|
accessed by one client at a time, because if multiple clients tried to write to it, the file would
|
||||||
become corrupted. You try to implement this by requiring a client to obtain a lease from a lock
|
become corrupted. You try to implement this by requiring a client to obtain a lease from a lock
|
||||||
|
|
@ -1220,12 +1158,10 @@ split brain. This is called *fencing off* the zombie.
|
||||||
|
|
||||||
Some systems attempt to fence off zombies by shutting them down, for example by disconnecting them
|
Some systems attempt to fence off zombies by shutting them down, for example by disconnecting them
|
||||||
from the network [^9], shutting down the VM via
|
from the network [^9], shutting down the VM via
|
||||||
the cloud provider’s management interface, or even physically powering down the machine
|
the cloud provider’s management interface, or even physically powering down the machine [^87].
|
||||||
[^87].
|
|
||||||
This approach is known as *Shoot The Other Node In The Head* or STONITH. Unfortunately, it suffers
|
This approach is known as *Shoot The Other Node In The Head* or STONITH. Unfortunately, it suffers
|
||||||
from some problems: it does not protect against large network delays like in
|
from some problems: it does not protect against large network delays like in
|
||||||
[Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down
|
[Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down [^19]; and by the time the zombie has been
|
||||||
[^19]; and by the time the zombie has been
|
|
||||||
detected and shut down, it may already be too late and data may already have been corrupted.
|
detected and shut down, it may already be too late and data may already have been corrupted.
|
||||||
|
|
||||||
A more robust fencing solution, which protects against both zombies and delayed requests, is
|
A more robust fencing solution, which protects against both zombies and delayed requests, is
|
||||||
|
|
@ -1257,10 +1193,8 @@ write has completed, any zombies are fenced off.
|
||||||
|
|
||||||
If ZooKeeper is your lock service, you can use the transaction ID `zxid` or the node version
|
If ZooKeeper is your lock service, you can use the transaction ID `zxid` or the node version
|
||||||
`cversion` as fencing token [^85].
|
`cversion` as fencing token [^85].
|
||||||
With etcd, the revision number along with the lease ID serves a similar purpose
|
With etcd, the revision number along with the lease ID serves a similar purpose [^89].
|
||||||
[^89].
|
The FencedLock API in Hazelcast explicitly generates a fencing token [^90].
|
||||||
The FencedLock API in Hazelcast explicitly generates a fencing token
|
|
||||||
[^90].
|
|
||||||
|
|
||||||
This mechanism requires that the storage service has some way of checking whether a write is based
|
This mechanism requires that the storage service has some way of checking whether a write is based
|
||||||
on an outdated token. Alternatively, it’s sufficient for the service to support a write that
|
on an outdated token. Alternatively, it’s sufficient for the service to support a write that
|
||||||
|
|
@ -1273,10 +1207,8 @@ services support such a check: Amazon S3 calls it *conditional writes*, Azure Bl
|
||||||
|
|
||||||
If your clients need to write only to one storage service that supports such conditional writes, the
|
If your clients need to write only to one storage service that supports such conditional writes, the
|
||||||
lock service is somewhat redundant
|
lock service is somewhat redundant
|
||||||
[[91](/en/ch9#Kleppmann2016),
|
[[^91], [^92]],
|
||||||
[92](/en/ch9#Sanfilippo2016)],
|
since the lease assignment could have been implemented directly based on that storage service [^93].
|
||||||
since the lease assignment could have been implemented directly based on that storage service
|
|
||||||
[^93].
|
|
||||||
However, once you have a fencing token you can also use it with multiple services or replicas, and
|
However, once you have a fencing token you can also use it with multiple services or replicas, and
|
||||||
ensure that the old leaseholder is fenced off on all of those services.
|
ensure that the old leaseholder is fenced off on all of those services.
|
||||||
|
|
||||||
|
|
@ -1344,8 +1276,7 @@ prone to intrigue and conspiracy than those elsewhere. Rather, the name is deriv
|
||||||
in the sense of *excessively complicated, bureaucratic, devious*, which was used in politics long
|
in the sense of *excessively complicated, bureaucratic, devious*, which was used in politics long
|
||||||
before computers [^96].
|
before computers [^96].
|
||||||
Lamport wanted to choose a nationality that would not offend any readers, and he was advised that
|
Lamport wanted to choose a nationality that would not offend any readers, and he was advised that
|
||||||
calling it *The Albanian Generals Problem* was not such a good idea
|
calling it *The Albanian Generals Problem* was not such a good idea [^97].
|
||||||
[^97].
|
|
||||||
|
|
||||||
A system is *Byzantine fault-tolerant* if it continues to operate correctly even if some of the
|
A system is *Byzantine fault-tolerant* if it continues to operate correctly even if some of the
|
||||||
nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering
|
nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering
|
||||||
|
|
@ -1355,8 +1286,8 @@ with the network. This concern is relevant in certain specific circumstances. Fo
|
||||||
by radiation, leading it to respond to other nodes in arbitrarily unpredictable ways. Since a
|
by radiation, leading it to respond to other nodes in arbitrarily unpredictable ways. Since a
|
||||||
system failure would be very expensive (e.g., an aircraft crashing and killing everyone on board,
|
system failure would be very expensive (e.g., an aircraft crashing and killing everyone on board,
|
||||||
or a rocket colliding with the International Space Station), flight control systems must tolerate
|
or a rocket colliding with the International Space Station), flight control systems must tolerate
|
||||||
Byzantine faults [[98](/en/ch9#Rushby2001),
|
Byzantine faults [[^98],
|
||||||
[99](/en/ch9#Edge2013)].
|
[^99]].
|
||||||
* In a system with multiple participating parties, some participants may attempt to cheat or
|
* In a system with multiple participating parties, some participants may attempt to cheat or
|
||||||
defraud others. In such circumstances, it is not safe for a node to simply trust another node’s
|
defraud others. In such circumstances, it is not safe for a node to simply trust another node’s
|
||||||
messages, since they may be sent with malicious intent. For example, cryptocurrencies like
|
messages, since they may be sent with malicious intent. For example, cryptocurrencies like
|
||||||
|
|
@ -1367,14 +1298,11 @@ with the network. This concern is relevant in certain specific circumstances. Fo
|
||||||
However, in the kinds of systems we discuss in this book, we can usually safely assume that there
|
However, in the kinds of systems we discuss in this book, we can usually safely assume that there
|
||||||
are no Byzantine faults. In a datacenter, all the nodes are controlled by your organization (so
|
are no Byzantine faults. In a datacenter, all the nodes are controlled by your organization (so
|
||||||
they can hopefully be trusted) and radiation levels are low enough that memory corruption is not a
|
they can hopefully be trusted) and radiation levels are low enough that memory corruption is not a
|
||||||
major problem (although datacenters in orbit are being considered
|
major problem (although datacenters in orbit are being considered [^101]).
|
||||||
[^101]).
|
|
||||||
Multitenant systems have mutually untrusting tenants, but they are isolated from each
|
Multitenant systems have mutually untrusting tenants, but they are isolated from each
|
||||||
other using firewalls, virtualization, and access control policies, not using Byzantine fault
|
other using firewalls, virtualization, and access control policies, not using Byzantine fault
|
||||||
tolerance. Protocols for making systems Byzantine fault-tolerant are quite expensive
|
tolerance. Protocols for making systems Byzantine fault-tolerant are quite expensive [^102],
|
||||||
[^102],
|
and fault-tolerant embedded systems rely on support from the hardware level [^98]. In most server-side data systems, the
|
||||||
and fault-tolerant embedded systems rely on support from the hardware level
|
|
||||||
[^98]. In most server-side data systems, the
|
|
||||||
cost of deploying Byzantine fault-tolerant solutions makes them impracticable.
|
cost of deploying Byzantine fault-tolerant solutions makes them impracticable.
|
||||||
|
|
||||||
Web applications do need to expect arbitrary and malicious behavior of clients that are under
|
Web applications do need to expect arbitrary and malicious behavior of clients that are under
|
||||||
|
|
@ -1383,8 +1311,7 @@ escaping are so important: to prevent SQL injection and cross-site scripting, fo
|
||||||
we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the
|
we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the
|
||||||
authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where
|
authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where
|
||||||
there is no such central authority, Byzantine fault tolerance is more relevant
|
there is no such central authority, Byzantine fault tolerance is more relevant
|
||||||
[[103](/en/ch9#Kleppmann2020),
|
[[^103], [^104]].
|
||||||
[104](/en/ch9#Kleppmann2022)].
|
|
||||||
|
|
||||||
A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to
|
A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to
|
||||||
all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant
|
all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant
|
||||||
|
|
@ -1409,9 +1336,9 @@ pragmatic steps toward better reliability. For example:
|
||||||
|
|
||||||
* Network packets do sometimes get corrupted due to hardware issues or bugs in operating systems,
|
* Network packets do sometimes get corrupted due to hardware issues or bugs in operating systems,
|
||||||
drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and
|
drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and
|
||||||
UDP, but sometimes they evade detection [[105](/en/ch9#Gilman2015),
|
UDP, but sometimes they evade detection [[^105],
|
||||||
[106](/en/ch9#Stone2000),
|
[^106],
|
||||||
[107](/en/ch9#Jones2015)].
|
[^107]].
|
||||||
Simple measures are usually sufficient protection against such corruption, such as checksums in
|
Simple measures are usually sufficient protection against such corruption, such as checksums in
|
||||||
the application-level protocol. TLS-encrypted connections also offer protection against
|
the application-level protocol. TLS-encrypted connections also offer protection against
|
||||||
corruption.
|
corruption.
|
||||||
|
|
@ -1543,8 +1470,7 @@ liveness property [^115].)
|
||||||
Safety is often informally defined as *nothing bad happens*, and liveness as *something good
|
Safety is often informally defined as *nothing bad happens*, and liveness as *something good
|
||||||
eventually happens*. However, it’s best to not read too much into those informal definitions,
|
eventually happens*. However, it’s best to not read too much into those informal definitions,
|
||||||
because “good” and “bad” are value judgements that don’t apply well to algorithms. The actual
|
because “good” and “bad” are value judgements that don’t apply well to algorithms. The actual
|
||||||
definitions of safety and liveness are more precise
|
definitions of safety and liveness are more precise [^116]:
|
||||||
[^116]:
|
|
||||||
|
|
||||||
* If a safety property is violated, we can point at a particular point in time at which it was
|
* If a safety property is violated, we can point at a particular point in time at which it was
|
||||||
broken (for example, if the uniqueness property was violated, we can identify the particular
|
broken (for example, if the uniqueness property was violated, we can identify the particular
|
||||||
|
|
@ -1556,8 +1482,7 @@ definitions of safety and liveness are more precise
|
||||||
|
|
||||||
An advantage of distinguishing between safety and liveness properties is that it helps us deal with
|
An advantage of distinguishing between safety and liveness properties is that it helps us deal with
|
||||||
difficult system models. For distributed algorithms, it is common to require that safety properties
|
difficult system models. For distributed algorithms, it is common to require that safety properties
|
||||||
*always* hold, in all possible situations of a system model
|
*always* hold, in all possible situations of a system model [^108]. That is, even if all nodes crash, or
|
||||||
[^108]. That is, even if all nodes crash, or
|
|
||||||
the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong
|
the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong
|
||||||
result (i.e., that the safety properties remain satisfied).
|
result (i.e., that the safety properties remain satisfied).
|
||||||
|
|
||||||
|
|
@ -1576,11 +1501,9 @@ abstraction of reality.
|
||||||
|
|
||||||
For example, algorithms in the crash-recovery model generally assume that data in stable storage
|
For example, algorithms in the crash-recovery model generally assume that data in stable storage
|
||||||
survives crashes. However, what happens if the data on disk is corrupted, or the data is wiped out
|
survives crashes. However, what happens if the data on disk is corrupted, or the data is wiped out
|
||||||
due to hardware error or misconfiguration
|
due to hardware error or misconfiguration [^117]?
|
||||||
[^117]?
|
|
||||||
What happens if a server has a firmware bug and fails to recognize
|
What happens if a server has a firmware bug and fails to recognize
|
||||||
its hard drives on reboot, even though the drives are correctly attached to the server
|
its hard drives on reboot, even though the drives are correctly attached to the server [^118]?
|
||||||
[^118]?
|
|
||||||
|
|
||||||
Quorum algorithms (see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)) rely on a node remembering the data
|
Quorum algorithms (see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)) rely on a node remembering the data
|
||||||
that it claims to have stored. If a node may suffer from amnesia and forget previously stored data,
|
that it claims to have stored. If a node may suffer from amnesia and forget previously stored data,
|
||||||
|
|
@ -1592,8 +1515,7 @@ The theoretical description of an algorithm can declare that certain things are
|
||||||
to happen—and in non-Byzantine systems, we do have to make some assumptions about faults that can
|
to happen—and in non-Byzantine systems, we do have to make some assumptions about faults that can
|
||||||
and cannot happen. However, a real implementation may still have to include code to handle the
|
and cannot happen. However, a real implementation may still have to include code to handle the
|
||||||
case where something happens that was assumed to be impossible, even if that handling boils down to
|
case where something happens that was assumed to be impossible, even if that handling boils down to
|
||||||
`printf("Sucks to be you")` and `exit(666)`—i.e., letting a human operator clean up the mess
|
`printf("Sucks to be you")` and `exit(666)`—i.e., letting a human operator clean up the mess [^119].
|
||||||
[^119].
|
|
||||||
(This is one difference between computer science and software engineering.)
|
(This is one difference between computer science and software engineering.)
|
||||||
|
|
||||||
That is not to say that theoretical, abstract system models are worthless—quite the opposite.
|
That is not to say that theoretical, abstract system models are worthless—quite the opposite.
|
||||||
|
|
@ -1620,8 +1542,7 @@ It is prudent to combine theoretical analysis with empirical testing to verify t
|
||||||
behave as expected. Techniques such as property-based testing, fuzzing, and deterministic simulation
|
behave as expected. Techniques such as property-based testing, fuzzing, and deterministic simulation
|
||||||
testing (DST) use randomization to test a system in a wide range of situations. Companies such as
|
testing (DST) use randomization to test a system in a wide range of situations. Companies such as
|
||||||
Amazon Web Services have successfully used a combination of these techniques on many of their
|
Amazon Web Services have successfully used a combination of these techniques on many of their
|
||||||
products [[120](/en/ch9#Brooker2024correctness),
|
products [[^120], [^121]].
|
||||||
[121](/en/ch9#SatarinTesting)].
|
|
||||||
|
|
||||||
### Model checking and specification languages
|
### Model checking and specification languages
|
||||||
|
|
||||||
|
|
@ -1642,20 +1563,16 @@ longer executions would then not be found.
|
||||||
Still, model checkers strike a nice balance between ease of use and the ability to find non-obvious
|
Still, model checkers strike a nice balance between ease of use and the ability to find non-obvious
|
||||||
bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find
|
bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find
|
||||||
and fix bugs
|
and fix bugs
|
||||||
[[122](/en/ch9#Vanlightly2024),
|
[[^122], [^123], [^124]]. For example,
|
||||||
[123](/en/ch9#Tang2018),
|
|
||||||
[124](/en/ch9#VanBenschoten2019)]. For example,
|
|
||||||
using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped
|
using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped
|
||||||
replication (VR) caused by ambiguity in the prose description of the algorithm
|
replication (VR) caused by ambiguity in the prose description of the algorithm [^125].
|
||||||
[^125].
|
|
||||||
|
|
||||||
By design, model checkers don’t run your actual code, but rather a simplified model that specifies
|
By design, model checkers don’t run your actual code, but rather a simplified model that specifies
|
||||||
only the core ideas of your protocol. This makes it more tractable to systematically explore the
|
only the core ideas of your protocol. This makes it more tractable to systematically explore the
|
||||||
state space, but it risks that your specification and your implementation go out of sync with each
|
state space, but it risks that your specification and your implementation go out of sync with each
|
||||||
other [^126].
|
other [^126].
|
||||||
It is possible to check whether the model and the real implementation have equivalent behavior, but
|
It is possible to check whether the model and the real implementation have equivalent behavior, but
|
||||||
this requires instrumentation in the real implementation
|
this requires instrumentation in the real implementation [^127].
|
||||||
[^127].
|
|
||||||
|
|
||||||
### Fault injection
|
### Fault injection
|
||||||
|
|
||||||
|
|
@ -1667,8 +1584,7 @@ processes—anything you can imagine going wrong with a computer.
|
||||||
|
|
||||||
Fault injection tests are typically run in an environment that closely resembles the production
|
Fault injection tests are typically run in an environment that closely resembles the production
|
||||||
environment where the system will run. Some even inject faults directly into their production
|
environment where the system will run. Some even inject faults directly into their production
|
||||||
environment. Netflix popularized this approach with their Chaos Monkey tool
|
environment. Netflix popularized this approach with their Chaos Monkey tool [^128]. Production fault
|
||||||
[^128]. Production fault
|
|
||||||
injection is often referred to as *chaos engineering*, which we discussed in
|
injection is often referred to as *chaos engineering*, which we discussed in
|
||||||
[“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability).
|
[“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability).
|
||||||
|
|
||||||
|
|
@ -1683,11 +1599,9 @@ during and after faults are injected to make sure things work as expected.
|
||||||
The myriad of tools required to trigger failures make fault injection tests cumbersome to write.
|
The myriad of tools required to trigger failures make fault injection tests cumbersome to write.
|
||||||
It’s common to adopt a fault injection framework like Jepsen to run fault injection tests to
|
It’s common to adopt a fault injection framework like Jepsen to run fault injection tests to
|
||||||
simplify the process. Such frameworks come with integrations for various operating systems and many
|
simplify the process. Such frameworks come with integrations for various operating systems and many
|
||||||
pre-built fault injectors
|
pre-built fault injectors [^129].
|
||||||
[^129].
|
|
||||||
Jepsen has been remarkably effective at finding critical bugs in many widely-used systems
|
Jepsen has been remarkably effective at finding critical bugs in many widely-used systems
|
||||||
[[130](/en/ch9#Kingsbury2024),
|
[[^130], [^131]].
|
||||||
[131](/en/ch9#Majumdar2017)].
|
|
||||||
|
|
||||||
### Deterministic simulation testing
|
### Deterministic simulation testing
|
||||||
|
|
||||||
|
|
@ -1772,7 +1686,7 @@ simulations, elements of nondeterminism may remain. For example, in some program
|
||||||
order in which you iterate over the elements of a hash table may be nondeterministic. Whether you
|
order in which you iterate over the elements of a hash table may be nondeterministic. Whether you
|
||||||
run into a resource limit (memory allocation failure, stack overflow) is also nondeterministic.
|
run into a resource limit (memory allocation failure, stack overflow) is also nondeterministic.
|
||||||
|
|
||||||
# Summary
|
## Summary
|
||||||
|
|
||||||
In this chapter we have discussed a wide range of problems that can occur in distributed systems,
|
In this chapter we have discussed a wide range of problems that can occur in distributed systems,
|
||||||
including:
|
including:
|
||||||
|
|
@ -1810,8 +1724,7 @@ other nodes and try to get a quorum to agree.
|
||||||
If you’re used to writing software in the idealized mathematical perfection of a single computer,
|
If you’re used to writing software in the idealized mathematical perfection of a single computer,
|
||||||
where the same operation always deterministically returns the same result, then moving to the messy
|
where the same operation always deterministically returns the same result, then moving to the messy
|
||||||
physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems
|
physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems
|
||||||
engineers will often regard a problem as trivial if it can be solved on a single computer
|
engineers will often regard a problem as trivial if it can be solved on a single computer [^4],
|
||||||
[^4],
|
|
||||||
and indeed a single computer can do a lot nowadays. If you can avoid opening Pandora’s box and
|
and indeed a single computer can do a lot nowadays. If you can avoid opening Pandora’s box and
|
||||||
simply keep things on a single machine, for example by using an embedded storage engine (see
|
simply keep things on a single machine, for example by using an embedded storage engine (see
|
||||||
[“Embedded storage engines”](/en/ch4#sidebar_embedded)), it is generally worth doing so.
|
[“Embedded storage engines”](/en/ch4#sidebar_embedded)), it is generally worth doing so.
|
||||||
|
|
@ -1834,11 +1747,10 @@ This chapter has been all about problems, and has given us a bleak outlook. In t
|
||||||
will move on to solutions, and discuss some algorithms that have been designed to cope with the
|
will move on to solutions, and discuss some algorithms that have been designed to cope with the
|
||||||
problems in distributed systems.
|
problems in distributed systems.
|
||||||
|
|
||||||
##### Footnotes
|
|
||||||
|
|
||||||
|
|
||||||
##### References
|
|
||||||
|
|
||||||
|
### Summary
|
||||||
|
|
||||||
[^1]: Mark Cavage. [There’s Just No Getting Around It: You’re Building a Distributed System](https://queue.acm.org/detail.cfm?id=2482856). *ACM Queue*, volume 11, issue 4, pages 80-89, April 2013. [doi:10.1145/2466486.2482856](https://doi.org/10.1145/2466486.2482856)
|
[^1]: Mark Cavage. [There’s Just No Getting Around It: You’re Building a Distributed System](https://queue.acm.org/detail.cfm?id=2482856). *ACM Queue*, volume 11, issue 4, pages 80-89, April 2013. [doi:10.1145/2466486.2482856](https://doi.org/10.1145/2466486.2482856)
|
||||||
[^2]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)
|
[^2]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)
|
||||||
|
|
|
||||||
|
|
@ -105,7 +105,7 @@ Later, in Part III of this book, we will discuss how you can take several (poten
|
||||||
- [9. The Trouble with Distributed Systems](/en/ch9)
|
- [9. The Trouble with Distributed Systems](/en/ch9)
|
||||||
- [10. Consistency and Consensus](/en/ch10)
|
- [10. Consistency and Consensus](/en/ch10)
|
||||||
|
|
||||||
## References
|
### References
|
||||||
|
|
||||||
1. Ulrich Drepper: “[What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf),” akka‐dia.org, November 21, 2007.
|
1. Ulrich Drepper: “[What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf),” akka‐dia.org, November 21, 2007.
|
||||||
1. Ben Stopford: “[Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/),” benstopford.com, November 24, 2009.
|
1. Ben Stopford: “[Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/),” benstopford.com, November 24, 2009.
|
||||||
|
|
|
||||||
Loading…
Reference in a new issue