2
0
Fork 0
mirror of https://github.com/Vonng/ddia.git synced 2026-06-21 00:47:05 +08:00

fix ref links

This commit is contained in:
Feng Ruohang 2025-08-09 16:22:19 +08:00
parent 4ec385f161
commit 515bb3a093
9 changed files with 150 additions and 223 deletions

View file

@ -218,7 +218,7 @@ There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_
That is the intuition behind linearizability; the formal definition [^1] describes it more precisely. It is
possible (though computationally expensive) to test whether a systems behavior is linearizable by
recording the timings of all requests and responses, and checking whether they can be arranged into
a valid sequential order [[^6], [^7]].
a valid sequential order [^6] [^7].
Just as there are various weak isolation levels for transactions besides serializability (see
[“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels)), there are also various weaker consistency models for
@ -255,7 +255,7 @@ Linearizability
A database may provide both serializability and linearizability, and this combination is known as
*strict serializability* or *strong one-copy serializability* (*strong-1SR*)
[[^11], [^12]].
[^11] [^12].
Single-node databases are typically linearizable. With distributed databases using optimistic
methods like serializable snapshot isolation (see [“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)) the situation is more
complicated: for example, CockroachDB provides serializability, and some recency guarantees on
@ -264,7 +264,7 @@ because this would require expensive coordination between transactions [^14].
It is also possible to combine a weaker isolation level with linearizability, or a weaker
consistency model with serializability; in fact, consistency model and isolation level can be chosen
largely independently from each other [[^15], [^16]].
largely independently from each other [^15] [^16].
## Relying on Linearizability
@ -460,7 +460,7 @@ performance: a reader must perform read repair (see [“Catching up on missed wr
before returning results to the application [^24].
Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the
latest timestamp of any prior write, and ensure that the new write has a greater timestamp
[[^25], [^26]].
[^25] [^26].
However, Riak does not perform synchronous read repair due to the performance penalty.
Cassandra does wait for read repair to complete on quorum reads [^27],
but it loses linearizability due to its use of time-of-day clocks for timestamps.
@ -529,10 +529,10 @@ The trade-off is as follows:
Thus, applications that dont require linearizability can be more tolerant of network problems. This
insight is popularly known as the *CAP theorem*
[[^29], [^30], [^31], [^32]],
[^29] [^30] [^31] [^32],
named by Eric Brewer in 2000, although the trade-off had been known to designers of
distributed databases since the 1970s
[[^33], [^34], [^35]].
[^33] [^34] [^35].
CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of
starting a discussion about trade-offs in databases. At the time, many distributed databases
@ -563,7 +563,7 @@ formalization of *availability* [^30] does not
match the usual meaning of the term [^38]. Many highly available (fault-tolerant) systems actually do not meet CAPs
idiosyncratic definition of availability. Moreover, some system designers choose (with good reason)
to provide neither linearizability nor the form of availability that the CAP theorem assumes, so
those systems are neither CP nor AP [[^39], [^40]].
those systems are neither CP nor AP [^39] [^40].
All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us
understand systems better, so CAP is best avoided.
@ -574,11 +574,11 @@ fault (network partitions, which according to data from Google are the cause of
incidents [^41]).
It doesnt say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP
has been historically influential, it has little practical value for designing systems
[[^4], [^38]].
[^4] [^38].
There have been efforts to generalize CAP. For example, the *PACELC principle* observes that system
designers might also choose to weaken consistency at times when the network is working fine in order
to reduce latency [[^39], [^40], [^42]].
to reduce latency [^39] [^40] [^42].
Thus, during a network partition (P), we need to choose between availability (A) and consistency
(C); else (E), when there is no partition, we may choose between low latency (L) and
consistency (C). However, this definition inherits several problems with CAP, such as the
@ -586,7 +586,7 @@ counterintuitive definitions of consistency and availability.
There are many more interesting impossibility results in distributed systems [^43],
and CAP has now been superseded by more precise results
[[^44], [^45]],
[^44] [^45],
so it is of mostly historical interest today.
### Linearizability and network delays
@ -945,18 +945,18 @@ node, but which get a lot harder if you want fault tolerance:
It turns out that all of these are instances of the same fundamental distributed systems problem:
*consensus*. Consensus is one of the most important and fundamental problems in distributed
computing; it is also infamously difficult to get right
[[^58], [^59]],
[^58] [^59],
and many systems have got it wrong in the past. Now that we have discussed replication
([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
linearizability (this chapter), we are finally ready to tackle the consensus problem.
The best-known consensus algorithms are Viewstamped Replication
[[^60], [^61]],
Paxos [[^58], [^62], [^63], [^64]],
Raft [[^23], [^65], [^66]],
and Zab [[^18], [^22], [^67]].
[^60] [^61],
Paxos [^58] [^62] [^63] [^64],
Raft [^23] [^65] [^66],
and Zab [^18] [^22] [^67].
There are quite a few similarities between these algorithms, but they are not the same
[[^68], [^69]].
[^68] [^69].
These algorithms work in a non-Byzantine system model: that is, network communication may be
arbitrarily delayed or dropped, and nodes may crash, restart, and become disconnected, but the
algorithms assume that nodes otherwise follow the protocol correctly and do not behave maliciously.
@ -964,7 +964,7 @@ algorithms assume that nodes otherwise follow the protocol correctly and do not
There are also consensus algorithms that can tolerate some Byzantine nodes, i.e., nodes that dont
correctly follow the protocol (for example, by sending contradictory messages to other nodes). A
common assumption is that fewer than one-third of the nodes are Byzantine-faulty
[[^26], [^70]].
[^26] [^70].
Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchains [^71].
However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this
book.
@ -1095,7 +1095,7 @@ consensus. Any CAS invocations whose new value was not decided return an error.
different expected values use separate runs of the consensus protocol.
This shows that CAS and consensus are equivalent to each other
[[^28], [^73]].
[^28] [^73].
Again, both are straightforward on a single node, but challenging to make fault-tolerant. As an
example of CAS in a distributed setting, we saw conditional write operations for object stores in
[“Databases backed by object storage”](/en/ch6#sec_replication_object_storage), which allow a write to happen only if an object with the same
@ -1105,7 +1105,7 @@ However, a linearizable read-write register is not sufficient to solve consensus
tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop
model [^72], but we saw in
[“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum
reads/writes in this model [[^24], [^25], [^26]].
reads/writes in this model [^24] [^25] [^26].
From this it follows that a linearizable register cannot solve consensus.
### Shared logs as consensus
@ -1142,12 +1142,8 @@ Validity
value to be added to the log.
> [!NOTE]
> A shared log is formally known as a *total order broadcast*, *atomic broadcast*, or *total order
> multicast* protocol [[^26],
> [^76],
> [^77]].
> Its the same thing described in different words: requesting a value to be added to the log is then
> called “broadcasting” it, and reading a log entry is called “delivering” it.
> A shared log is formally known as a *total order broadcast*, *atomic broadcast*, or *total order multicast* protocol [^26] [^76] [^77]
> Its the same thing described in different words: requesting a value to be added to the log is then called “broadcasting” it, and reading a log entry is called “delivering” it.
If you have an implementation of a shared log, it is easy to solve the consensus problem: every node
that wants to propose a value requests for it to be added to the log, and whichever value is read
@ -1243,7 +1239,7 @@ any of the communication among the nodes times out). The other three properties
same as for consensus.
If you have a solution for consensus, there are multiple ways you could solve atomic commitment
[[^78], [^79]].
[^78] [^79].
One works like this: when you want to commit the transaction, every node sends its vote to commit or
abort to every other node. Nodes that receive a vote to commit from itself and every other node
propose “commit” using the consensus algorithm; nodes that receive a vote to abort, or which
@ -1290,7 +1286,7 @@ Similarly, a shared log can be used to implement serializable transactions: as d
[“Actual Serial Execution”](/en/ch8#sec_transactions_serial), if every log entry represents a deterministic transaction to be
executed as a stored procedure, and if every node executes those transactions in the same order,
then the transactions will be serializable
[[^81], [^82]].
[^81] [^82].
> [!NOTE]
> Sharded databases with a strong consistency model often maintain a separate log per shard, which
@ -1353,7 +1349,7 @@ a vote on a proposal succeeds, at least one of the nodes that voted for it must
participated in the most recent successful leader election [^85]. Thus, if the vote on a proposal
passes without revealing any higher-numbered epoch, the current leader can conclude that no leader
with a higher epoch number has been elected, and therefore it can safely append the proposed entry
to the log [[^26], [^86]].
to the log [^26] [^86].
These two rounds of voting look superficially similar to two-phase commit, but they are very
different protocols. In consensus algorithms, any node can start an election and it requires only a
@ -1364,7 +1360,7 @@ vote from *every* participant before it can commit.
This basic structure is common to all of Raft, Multi-Paxos, Zab, and Viewstamped Replication: a vote
by a quorum of nodes elects a leader, and then another quorum vote is required for every entry that
the leader wants to append to the log [[^68], [^69]]. Every new log entry is synchronously replicated
the leader wants to append to the log [^68] [^69]. Every new log entry is synchronously replicated
to a quorum of nodes before it is confirmed to the client that requested the write. This ensures
that the log entry wont be lost if the current leader fails.
@ -1398,7 +1394,7 @@ easily cause a lot of data loss or corruption.
Another subtlety is in how the algorithms deal with log entries that had been proposed by the old
leader before it failed, but for which the vote on appending to the log had not yet completed. You
can find discussions of these details in the references for this chapter
[[^23], [^69], [^86]].
[^23] [^69] [^86].
For databases that use a consensus algorithm for replication, not only do writes need to be turned
into log entries and replicated to a quorum. If you want to guarantee linearizable reads, they also
@ -1441,7 +1437,7 @@ work.
Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft
has been shown to have unpleasant edge cases
[[^88], [^89]]:
[^88] [^89]:
if the entire network is working correctly except for one particular network link that is
consistently unreliable, Raft can get into situations where leadership continually bounces between
two nodes, or the current leader is continually forced to resign, so the system effectively never
@ -1468,7 +1464,7 @@ entirely in memory (although they still write to disk for durability), which is
multiple nodes using a fault-tolerant consensus algorithm.
Coordination services are modeled after Googles Chubby lock service
[[^17], [^58]].
[^17] [^58].
They combine a consensus algorithm with several other features that turn out to be particularly
useful when building distributed systems:
@ -1545,7 +1541,7 @@ information like “the node running on IP address 10.1.1.23 is the leader for s
assignments usually change on a timescale of minutes or hours. Coordination services are not
intended for storing data that may change thousands of times per second. For that, it is better to
use a conventional database; alternatively, tools like Apache BookKeeper
[[^90], [^91]]
[^90] [^91]
can be used to replicate fast-changing internal state of a service.
### Service discovery

View file

@ -186,19 +186,13 @@ is a long queue of requests waiting to be handled, response times may increase s
time out and resend their request. This causes the rate of requests to increase even further, making
the problem worse—a *retry storm*. Even when the load is reduced again, such a system may remain in
an overloaded state until it is rebooted or otherwise reset. This phenomenon is called a *metastable
failure*, and it can cause serious outages in production systems
[[^7], [^8]].
failure*, and it can cause serious outages in production systems [^7] [^8].
To avoid retries overloading a service, you can increase and randomize the time between successive
retries on the client side (*exponential backoff*
[[^9], [^10]]),
and temporarily stop sending requests to a service that has returned errors or timed out recently
(using a *circuit breaker* [[^11], [^12]]
or *token bucket* algorithm [^13]).
retries on the client side (*exponential backoff* [^9] [^10]), and temporarily stop sending requests to a service that has returned errors or timed out recently
(using a *circuit breaker* [^11] [^12] or *token bucket* algorithm [^13]).
The server can also detect when it is approaching overload and start proactively rejecting requests
(*load shedding* [^14]), and send back
responses asking clients to slow down (*backpressure*
[[^1], [^15]]).
(*load shedding* [^14]), and send back responses asking clients to slow down (*backpressure* [^1] [^15]).
The choice of queueing and load-balancing algorithms can also make a difference [^16].
In terms of performance metrics, the response time is usually what users care about the most,
@ -342,7 +336,7 @@ For example, an SLO may set a target for a service to have a median response tim
result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not
met (for example, customers may be entitled to a refund). That is the basic idea, at least; in
practice, defining good availability metrics for SLOs and SLAs is not straightforward
[[^28], [^29]].
[^28] [^29].
# Computing percentiles
@ -355,7 +349,7 @@ The simplest implementation is to keep a list of response times for all requests
window and to sort that list every minute. If that is too inefficient for you, there are algorithms
that can calculate a good approximation of percentiles at minimal CPU and memory cost.
Open source percentile estimation libraries include HdrHistogram,
t-digest [[^30], [^31]],
t-digest [^30] [^31],
OpenHistogram [^32], and DDSketch [^33].
Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from
@ -375,7 +369,7 @@ software, typical expectations include:
If all those things together mean “working correctly,” then we can understand *reliability* as
meaning, roughly, “continuing to work correctly, even when things go wrong.” To be more precise
about things going wrong, we will distinguish between *faults* and *failures*
[[^35], [^36], [^37]]:
[^35] [^36] [^37]:
Fault
: A fault is when a particular *part* of a system stops working correctly: for example, if a
@ -432,21 +426,14 @@ cured, as described in the following sections.
When we think of causes of system failure, hardware faults quickly come to mind:
* Approximately 25% of magnetic hard drives fail per year
[[^40],
[^41]];
* Approximately 25% of magnetic hard drives fail per year [^40] [^41];
in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
Recent data suggests that disks are getting more reliable, but failure rates remain significant
[^42].
* Approximately 0.51% of solid state drives (SSDs) fail per year
[^43].
Small numbers of bit errors are corrected automatically
[^44],
Recent data suggests that disks are getting more reliable, but failure rates remain significant [^42].
* Approximately 0.51% of solid state drives (SSDs) fail per year [^43].
Small numbers of bit errors are corrected automatically [^44],
but uncorrectable errors occur approximately once per year per drive, even in drives that are
fairly new (i.e., that have experienced little wear); this error rate is higher than that of
magnetic hard drives
[[^45],
[^46]].
magnetic hard drives [^45], [^46].
* Other hardware components such as power supplies, RAID controllers, and memory modules also fail,
although less frequently than hard drives [^47] [^48].
* Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result,
@ -676,7 +663,7 @@ If you can double the resources in order to handle twice the load, while keeping
same, we say that you have *linear scalability*, and this is considered a good thing. Occasionally
it is possible to handle twice the load with less than double the resources, due to economies of
scale or a better distribution of peak load
[[^79], [^80]].
[^79] [^80].
Much more likely is that the cost grows faster than linearly, and there may be many reasons for the
inefficiency. For example, if you have a lot of data, then processing a single write request may
involve more work than if you have a small amount of data, even if the size of the request is the
@ -762,7 +749,7 @@ bugs that need fixing.
It is widely recognized that the majority of the cost of software is not in its initial development,
but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures,
adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding
new features [[^85], [^86]].
new features [^85] [^86].
However, maintenance is also difficult. If a system has been successfully running for a long time,
it may well use outdated technologies that not many engineers understand today (such as mainframes
@ -925,7 +912,7 @@ There are no easy answers on how to achieve these things, but one thing that can
applications using well-understood building blocks that provide useful abstractions. The rest of
this book will cover a selection of building blocks that have proved to be valuable in practice.
### Summary
### References
[^1]: Mike Cvet. [How We Learned to Stop Worrying and Love Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016.
[^2]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)

View file

@ -219,10 +219,7 @@ structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
###### Figure 3-2. One-to-many relationships forming a tree structure.
> [!NOTE]
> This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé
> typically has a small number of positions
> [[^9],
> [^10]].
> This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé typically has a small number of positions [^9] [^10].
> In situations where there may be a genuinely large number of related items—say, comments on a
> celebritys social media post, of which there could be many thousands—embedding them all in the same
> document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable.
@ -540,7 +537,7 @@ such applications well, because the items (or their IDs) can simply be stored in
determine their order. In relational databases there isnt a standard way of representing such
reorderable lists, and various tricks are used: sorting by an integer column (requiring renumbering
when you insert into the middle), a linked list of IDs, or fractional indexing
[[^14], [^15], [^16]].
[^14] [^15] [^16].
### Schema flexibility in the document model
@ -593,7 +590,7 @@ since every row needs to be rewritten, and other schema operations (such as chan
of a column) also typically require the entire table to be copied.
Various tools exist to allow this type of schema changes to be performed in the background without downtime
[[^21], [^22], [^23], [^24]],
[^21] [^22] [^23] [^24],
but performing such migrations on large databases remains operationally challenging. Complicated
migrations can be avoided by only adding the `first_name` column with a default value of `NULL`
(which is fast), and filling it in at read time, like you would with a document database.
@ -1044,7 +1041,7 @@ Oracle has a different SQL extension for recursive queries, which it calls *hier
[^41].
However, the situation may be improving: at the time of writing, there are plans to add a graph
query language called GQL to the SQL standard [[^42], [^43]],
query language called GQL to the SQL standard [^42] [^43],
which will provide a syntax inspired by Cypher, GSQL [^44], and PGQL [^45].
## Triple-Stores and SPARQL
@ -1127,7 +1124,7 @@ Some of the research and development effort on triple stores was motivated by th
early-2000s effort to facilitate internet-wide data exchange by publishing data not only as
human-readable web pages, but also in a standardized, machine-readable format. Although the Semantic
Web as originally envisioned did not succeed
[[^49], [^50]],
[^49] [^50],
the legacy of the Semantic Web project lives on in a couple of specific technologies: *linked data*
standards such as JSON-LD [^51],
*ontologies* used in biomedical science [^52],
@ -1238,7 +1235,7 @@ various other triple stores [^36].
## Datalog: Recursive Relational Queries
Datalog is a much older language than SPARQL or Cypher: it arose from academic research in the 1980s
[[^57], [^58], [^59]].
[^57] [^58] [^59].
It is less well known among software engineers and not widely supported in mainstream databases, but
it ought to be better-known since it is a very expressive language that is particularly powerful for
complex queries. Several niche databases, including Datomic, LogicBlox, CozoDB, and LinkedIns
@ -1498,7 +1495,7 @@ the status of each booking, another that computes charts for the conference orga
and a third that generates files for the printer that produces the attendees badges.
The idea of using events as the source of truth, and expressing every state change as an event, is
known as *event sourcing* [[^62], [^63]].
known as *event sourcing* [^62] [^63].
The principle of maintaining separate read-optimized representations and deriving them from the
write-optimized representation is called *command query responsibility segregation (CQRS)*
[^64].
@ -1724,11 +1721,7 @@ come into play when *implementing* the data models described in this chapter.
### Summary
### References
[^1]: Jamie Brandon. [Unexplanations: query optimization works because sql is declarative](https://www.scattered-thoughts.net/writing/unexplanations-sql-declarative/). *scattered-thoughts.net*, February 2024. Archived at [perma.cc/P6W2-WMFZ](https://perma.cc/P6W2-WMFZ)
[^2]: Joseph M. Hellerstein. [The Declarative Imperative: Experiences and Conjectures in Distributed Logic](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf). Tech report UCB/EECS-2010-90, Electrical Engineering and Computer Sciences, University of California at Berkeley, June 2010. Archived at [perma.cc/K56R-VVQM](https://perma.cc/K56R-VVQM)

View file

@ -320,8 +320,7 @@ In the context of an LSM storage engines, false positives are no problem:
An important detail is how the LSM storage chooses when to perform compaction, and which SSTables to
include in a compaction. Many LSM-based storage systems allow you to configure which compaction
strategy to use, and some of the common choices are
[[^16], [^17]]:
strategy to use, and some of the common choices are [^16] [^17]:
Size-tiered compaction
: Newer and smaller SSTables are successively merged into older and larger SSTables. The SSTables
@ -452,7 +451,7 @@ In order to make the database resilient to crashes, it is common for B-tree impl
include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file
to which every B-tree modification must be written before it can be applied to the pages of the tree
itself. When the database comes back up after a crash, this log is used to restore the B-tree back
to a consistent state [[^2], [^24]].
to a consistent state [^2] [^24].
In filesystems, the equivalent mechanism is known as *journaling*.
To improve performance, B-tree implementations typically dont immediately write every modified page
@ -484,8 +483,7 @@ mention just a few:
## Comparing B-Trees and LSM-Trees
As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads
[[^27], [^28]].
As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads [^27] [^28].
However, benchmarks are often sensitive to details of the workload. You need to test systems with
your particular workload in order to make a valid comparison. Moreover, its not a strict either/or
choice between LSM and B-trees: storage engines sometimes blend characteristics of both approaches,
@ -512,7 +510,7 @@ memtable fills up. This happens if data cant be written out to disk fast enou
the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB,
perform *backpressure* in this situation: they suspend all reads and writes until the memtable has
been written out to disk
[[^30], [^31]].
[^30] [^31].
Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read
requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but
@ -555,7 +553,7 @@ A sequential write workload writes larger chunks of data at a time, so it is lik
can be erased without having to perform any GC. On the other hand, with a random write workload, it
is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
to perform more work before a block can be erased
[[^34], [^35], [^36]].
[^34] [^35] [^36].
The write bandwidth consumed by GC is then not available for the application. Moreover, the
additional writes performed by GC contribute to wear on the flash memory; therefore, random writes
@ -573,7 +571,7 @@ containing keys and references to values [^37].)
A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once
to the tree page itself. In addition, they sometimes need to write out an entire page, even if only
a few bytes in that page changed, to ensure the B-tree can be correctly recovered after a crash or
power failure [[^38], [^39]].
power failure [^38] [^39].
If you take the total number of bytes written to disk in some workload, and divide by the number of
bytes you would have to write if you simply wrote an append-only log with no index, you get the
@ -610,7 +608,7 @@ the data files anyway, and SSTables dont have pages with unused space. Moreov
key-value pairs can better be compressed in SSTables, and thus often produce smaller files on disk
than B-trees. Keys and values that have been overwritten continue to consume space until they are
removed by a compaction, but this overhead is quite low when using leveled compaction
[[^40], [^41]].
[^40] [^41].
Size-tiered compaction (see [“Compaction strategies”](/en/ch4#sec_storage_lsm_compaction)) uses more disk space, especially
temporarily during compaction.
@ -710,7 +708,7 @@ easily be backed up, inspected, and analyzed by external utilities.
Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model,
and the vendors claim that they can offer big performance improvements by removing all the overheads
associated with managing on-disk data structures
[[^46], [^47]].
[^46] [^47].
RAMCloud is an open source, in-memory key-value store with durability (using a log-structured
approach for the data in memory as well as the data on disk) [^48].
@ -744,7 +742,7 @@ transaction processing and data warehousing in the same product. However, these
and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly
becoming two separate storage and query engines, which happen to be accessible through a common SQL
interface
[[^50], [^51], [^52], [^53]].
[^50] [^51] [^52] [^53].
## Cloud Data Warehouses
@ -881,11 +879,11 @@ to single-node embedded databases such as DuckDB [^62],
and product analytics systems such as Pinot [^63]
and Druid [^64].
It is used in storage formats such as Parquet, ORC
[[^65], [^66]],
[^65] [^66],
Lance [^67],
and Nimble [^68],
and in-memory analytics formats like Apache Arrow
[[^65], [^69]]
[^65] [^69]
and Pandas/NumPy [^70].
Some time-series databases, such as InfluxDB IOx [^71] and TimescaleDB [^72],
are also based on column-oriented storage.
@ -999,7 +997,7 @@ Queries need to examine both the column data on disk and the recent writes in me
the two. The query execution engine hides this distinction from the user. From an analysts point
of view, data that has been modified with inserts, updates, or deletes is immediately reflected in
subsequent queries. Snowflake, Vertica, Apache Pinot, Apache Druid, and many others do this
[[^61], [^63], [^64], [^76]].
[^61] [^63] [^64] [^76].
## Query Execution: Compilation and Vectorization
@ -1034,7 +1032,7 @@ Vectorized processing
: The query is interpreted, not compiled, but it is made fast by processing many values from a
column in a batch, instead of iterating over rows one by one. A fixed set of predefined operators
are built into the database; we can pass arguments to them and get back a batch of results
[[^50], [^75]].
[^50] [^75].
For example, we could pass the `product_sk` column and the ID of “bananas” to an equality operator,
and get back a bitmap (one bit per value in the input column, which is 1 if its a banana); we could
@ -1056,9 +1054,7 @@ performance by taking advantages of the characteristics of modern CPUs:
* doing most of the work in tight inner loops (that is, with a small number of instructions and no
function calls) to keep the CPU instruction processing pipeline busy and avoid branch
mispredictions,
* making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD)
instructions [[^79],
[^80]], and
* making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD) instructions [^79] [^80], and
* operating directly on compressed data without decoding it into a separate in-memory
representation, which saves memory allocation and copying costs.
@ -1196,7 +1192,7 @@ It stores the mapping from term to postings list in SSTable-like sorted files, w
the background using the same log-structured approach we saw earlier in this chapter [^91].
PostgreSQLs GIN index type also uses postings lists to support full-text search and indexing inside
JSON documents
[[^92], [^93]].
[^92] [^93].
Instead of breaking text into words, an alternative is to find all the substrings of length *n*,
which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
@ -1295,7 +1291,7 @@ variations of each [^101],
and PostgreSQLs pgvector supports both as well [^102].
The full details of the IVF and HNSW algorithms are beyond the scope of this book, but their papers
are an excellent resource
[[^103], [^104]].
[^103] [^104].
## Summary
@ -1347,7 +1343,7 @@ documentation for the database of your choice.
### Summary
### References

View file

@ -447,7 +447,7 @@ application code is expecting, and their types.
If the readers and writers schema are the same, decoding is easy. If they are different, Avro
resolves the differences by looking at the writers schema and the readers schema side by side and
translating the data from the writers schema into the readers schema. The Avro specification
[[^16], [^17]]
[^16] [^17]
defines exactly how this resolution works, and it is illustrated in
[Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
@ -571,7 +571,7 @@ languages.
The ideas on which these encodings are based are by no means new. For example, they have a lot in
common with ASN.1, a schema definition language that was first standardized in 1984
[[^23], [^24]].
[^23] [^24].
It was used to define various network protocols, and its binary encoding (DER) is still used to encode
SSL certificates (X.509), for example [^25].
ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers [^26].
@ -737,7 +737,7 @@ different contexts. For example:
systems, or OAuth for shared access to user data.
The most popular service design philosophy is REST, which builds upon the principles of HTTP
[[^30], [^31]].
[^30] [^31].
It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for
cache control, authentication, and content type negotiation. An API designed according to the
principles of REST is called *RESTful*.
@ -824,14 +824,14 @@ Architecture (CORBA) is excessively complex, and does not provide backward or fo
compatibility [^33].
SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are
also plagued by complexity and compatibility problems
[[^34], [^35], [^36]].
[^34] [^35] [^36].
All of these are based on the idea of a *remote procedure call* (RPC), which has been around since
the 1970s [^37].
The RPC model tries to make a request to a remote network service look the same as calling a function or
method in your programming language, within the same process (this abstraction is called *location
transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed
[[^38], [^39]].
[^38] [^39].
A network request is very different from a local function call:
* A local function call is predictable and either succeeds or fails, depending only on parameters
@ -1016,7 +1016,7 @@ task fails, the framework will re-execute the task, but will skip any RPC calls
that the task made successfully before failing. Instead, the framework will pretend to make the
call, but will instead return the results from the previous call. This is possible because durable
execution frameworks log all RPCs and state changes to durable storage like a write-ahead log
[[^45], [^46]].
[^45] [^46].
[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
using Temporal.
@ -1109,7 +1109,7 @@ Message brokers typically dont enforce any particular data model—a message
bytes with some metadata, so you can use any encoding format. A common approach is to use Protocol
Buffers, Avro, or JSON, and to deploy a schema registry alongside the message broker to store all
the valid schema versions and check their compatibility
[[^19], [^21]].
[^19] [^21].
AsyncAPI, a messaging-based equivalent of OpenAPI, can also be used to specify the schema of
messages.
@ -1197,7 +1197,7 @@ quite achievable. May your applications evolution be rapid and your deploymen
### Summary
### References

View file

@ -221,9 +221,7 @@ for live queries. Storing database data in object storage has many benefits:
* Object stores also provide multi-zone, dual-region, or multi-region replication with very high
durability guarantees. This also allows databases to bypass inter-zone network fees.
* Databases can use an object stores *conditional write* feature—essentially, a *compare-and-set*
(CAS) operation—to implement transactions and leadership election
[[10](/ch06.html#Morling2024_ch6),
[11](/ch06.html#Chandramohan2024)]).
(CAS) operation—to implement transactions and leadership election [^10] [^11]
* Storing data from multiple databases in the same object store can simplify data integration,
particularly when open formats such as Apache Parquet and Apache Iceberg are used.
@ -420,9 +418,7 @@ heap into a consistent state, we can use the exact same log to build a replica o
besides writing the log to disk, the leader also sends it across the network to its followers. When
the follower processes this log, it builds a copy of the exact same files as found on the leader.
This method of replication is used in PostgreSQL and Oracle, among others
[[17](/ch06.html#Suzuki2017_ch6),
[18](/ch06.html#Kapila2012)].
This method of replication is used in PostgreSQL and Oracle, among others [^17] [^18]
The main disadvantage is that the log describes the data on a very low level: a WAL contains details
of which bytes were changed in which disk blocks. This makes replication tightly coupled to the
storage engine. If the database changes its storage format from one version to another, it is
@ -915,10 +911,7 @@ Moreover, many modern web apps offer *real-time collaboration* features, such as
Sheets for text documents and spreadsheets, Figma for graphics, and Linear for project management.
What makes these apps so responsive is that user input is immediately reflected in the user
interface, without waiting for a network round-trip to the server, and edits by one user are shown
to their collaborators with low latency
[[32](/ch06.html#DayRichter2010),
[33](/ch06.html#Wallace2019),
[34](/ch06.html#Artman2023)].
to their collaborators with low latency [^32] [^33] [^34]
This again results in a multi-leader architecture: each web browser tab that has opened the shared
file is a replica, and any updates that you make to the file are asynchronously replicated to the
@ -935,19 +928,14 @@ multiple users have changed the file concurrently, conflict resolution logic may
those changes.
A software library that supports this process is called a *sync engine*. Although the idea has
existed for a long time, the term has recently gained attention
[[35](/ch06.html#Saafan2024),
[36](/ch06.html#Hagoel2024),
[37](/ch06.html#Jayakar2024)].
existed for a long time, the term has recently gained attention [^35] [^36] [^37].
An application that allows a user to continue editing a file while offline (which may be implemented
using a sync engine) is called *offline-first*
[^38].
using a sync engine) is called *offline-first* [^38].
The term *local-first software* refers to collaborative apps that are not only offline-first, but
are also designed to continue working even if the developer who made the software shuts down all of
their online services [^39].
This can be achieved by using a sync engine with an open standard sync protocol for which multiple
service providers are available
[^40].
service providers are available [^40].
For example, Git is a local-first collaboration system (albeit one that doesnt support real-time
collaboration) since you can sync via GitHub, GitLab, or any other repository hosting service.
@ -1243,20 +1231,16 @@ writes in the same order.
Some data storage systems take a different approach, abandoning the concept of a leader and
allowing any replica to directly accept writes from clients. Some of the earliest replicated data
systems were leaderless [[1](/ch06.html#Lindsay1979_ch6),
[50](/ch06.html#Gifford1979)], but the
idea was mostly forgotten during the era of dominance of relational databases. It once again became
systems were leaderless [^1] [^50], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became
a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in
2007 [^45].
Riak, Cassandra, and ScyllaDB are open source datastores with leaderless replication models inspired
by Dynamo, so this kind of database is also known as *Dynamo-style*.
> [!NOTE]
> The original *Dynamo* system was only described in a paper
> [^45], but never released outside of
> Amazon. The similarly-named *DynamoDB* is a more recent cloud database from AWS, but it has a
> completely different architecture: it uses single-leader replication based on the Multi-Paxos
> consensus algorithm [^5].
> The original *Dynamo* system was only described in a paper [^45], but never released outside of Amazon.
> The similarly-named *DynamoDB* is a more recent cloud database from AWS, but it has a completely different architecture:
> it uses single-leader replication based on the Multi-Paxos consensus algorithm [^5].
In some leaderless implementations, the client directly sends its writes to several replicas, while
in others, a coordinator node does this on behalf of the client. However, unlike a leader database,
@ -1707,15 +1691,9 @@ replica increments its own version number when processing a write, and also keep
version numbers it has seen from each of the other replicas. This information indicates which values
to overwrite and which values to keep as siblings.
The collection of version numbers from all the replicas is called a *version vector*
[^58].
A few variants of this idea are in use, but the most interesting is probably the *dotted version
vector*
[[59](/ch06.html#Preguica2010),
[60](/ch06.html#Manepalli2022)],
which is used in Riak 2.0
[[61](/ch06.html#Cribbs2014),
[62](/ch06.html#Brown2015)].
The collection of version numbers from all the replicas is called a *version vector* [^58].
A few variants of this idea are in use, but the most interesting is probably the *dotted version vector* [^59] [^60],
which is used in Riak 2.0 [^61] [^62].
We wont go into the details, but the way it works is quite similar to what we saw in our cart example.
Like the version numbers in [Figure 6-15](/ch06.html#fig_replication_causality_single), version vectors are sent from the
@ -1731,10 +1709,7 @@ siblings are merged correctly.
# Version vectors and vector clocks
A *version vector* is sometimes also called a *vector clock*, even though they are not quite the
same. The difference is subtle—please see the references for details
[[60](/ch06.html#Manepalli2022),
[63](/ch06.html#Baquero2011),
[64](/ch06.html#Schwarz1994)]. In brief, when
same. The difference is subtle—please see the references for details [^60] [^63] [^64]. In brief, when
comparing the state of replicas, version vectors are the right data structure to use.
## Summary
@ -1760,8 +1735,7 @@ Despite being a simple goal—keeping a copy of the same data on several machine
to be a remarkably tricky problem. It requires carefully thinking about concurrency and about all
the things that can go wrong, and dealing with the consequences of those faults. At a minimum, we
need to deal with unavailable nodes and network interruptions (and thats not even considering the
more insidious kinds of fault, such as silent data corruption due to software bugs or hardware
errors).
more insidious kinds of fault, such as silent data corruption due to software bugs or hardware errors).
We discussed three main approaches to replication:
@ -1817,7 +1791,7 @@ machine to store only a subset of the data.
### Summary
### References
[^1]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)

View file

@ -51,7 +51,7 @@ Some databases treat partitions and shards as two distinct concepts. For example
partitioning is a way of splitting a large table into several files that are stored on the same
machine (which has several advantages, such as making it very fast to delete an entire partition),
whereas sharding splits a dataset across multiple machines
[[^1], [^2]].
[^1] [^2].
In many other systems, partitioning is just another word for sharding.
While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to
@ -408,7 +408,7 @@ to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_
per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
those imbalances tend to even out
[[^15], [^18]].
[^15] [^18].
![ddia 0706](/fig/ddia_0706.png)
@ -459,7 +459,7 @@ This event can result in a large volume of reads and writes to the same key (whe
is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on).
In such situations, a more flexible sharding policy is required
[[^25], [^26]].
[^25] [^26].
A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put
an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine [^27].
@ -502,7 +502,7 @@ Fully automated rebalancing can be convenient, because there is less operational
normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud
databases such as DynamoDB are promoted as being able to automatically add and remove shards to
adapt to big increases or decreases of load within a matter of minutes
[[^17], [^29]].
[^17] [^29].
However, automatic shard management can also be unpredictable. Rebalancing is an expensive
operation, because it requires rerouting requests and moving a large amount of data from one node to
@ -779,7 +779,7 @@ that question in the following chapters.
### Summary
### References
[^1]: Claire Giordano. [Understanding partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959)

View file

@ -67,7 +67,7 @@ the challenge of achieving atomicity in a distributed transaction.
Almost all relational databases today, and some nonrelational databases, support transactions. Most
of them follow the style that was introduced in 1975 by IBM System R, the first SQL database
[[^2], [^3], [^4]].
[^2] [^3] [^4].
Although some implementation details have changed, the general idea has remained virtually the same
for 50 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily
similar to that of System R.
@ -214,7 +214,7 @@ However, serializability has a performance cost. In practice, many databases use
that are weaker than serializability: that is, they allow concurrent transactions to interfere with
each other in limited ways. Some popular databases, such as Oracle, dont even implement it (Oracle
has an isolation level called “serializable,” but it actually implements *snapshot isolation*, which
is a weaker guarantee than serializability [[^10], [^14]]).
is a weaker guarantee than serializability [^10] [^14]).
This means that some kinds of race conditions can still occur. We will explore snapshot isolation
and other forms of isolation in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels).
@ -254,37 +254,22 @@ The truth is, nothing is perfect:
* In an asynchronously replicated system, recent writes may be lost when the leader becomes
unavailable (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)).
* When the power is suddenly cut, SSDs in particular have been shown to sometimes violate the
guarantees they are supposed to provide: even `fsync` isnt guaranteed to work correctly
[^15].
Disk firmware can have bugs, just like any other kind of software
[[^16],
[^17]],
e.g. causing drives to fail after exactly 32,768 hours of operation
[^18].
And `fsync` is hard to use; even PostgreSQL used it incorrectly for over 20 years
[[^19],
[^20],
[^21]].
guarantees they are supposed to provide: even `fsync` isnt guaranteed to work correctly [^15].
Disk firmware can have bugs, just like any other kind of software [^16] [^17],
e.g. causing drives to fail after exactly 32,768 hours of operation [^18].
And `fsync` is hard to use; even PostgreSQL used it incorrectly for over 20 years [^19] [^20] [^21].
* Subtle interactions between the storage engine and the filesystem implementation can lead to bugs
that are hard to track down, and may cause files on disk to be corrupted after a crash
[[^22],
[^23]].
Filesystem errors on one replica can sometimes spread to other replicas as well
[^24].
* Data on disk can gradually become corrupted without this being detected
[^25].
that are hard to track down, and may cause files on disk to be corrupted after a crash [^22] [^23].
Filesystem errors on one replica can sometimes spread to other replicas as well [^24].
* Data on disk can gradually become corrupted without this being detected [^25].
If data has been corrupted for some time, replicas and recent backups may also be corrupted. In
this case, you will need to try to restore the data from a historical backup.
* One study of SSDs found that between 30% and 80% of drives develop at least one bad block during
the first four years of operation, and only some of these can be corrected by the firmware
[^26].
Magnetic hard drives have a lower rate of bad sectors, but a higher rate of complete failure than
SSDs.
the first four years of operation, and only some of these can be corrected by the firmware [^26].
Magnetic hard drives have a lower rate of bad sectors, but a higher rate of complete failure than SSDs.
* When a worn-out SSD (that has gone through many write/erase cycles) is disconnected from power,
it can start losing data within a timescale of weeks to months, depending on the temperature
[^27].
This is less of a problem for drives with lower wear levels
[^28].
it can start losing data within a timescale of weeks to months, depending on the temperature [^27].
This is less of a problem for drives with lower wear levels [^28].
In practice, there is no one technique that can provide absolute guarantees. There are only various
risk-reduction techniques, including writing to disk, replicating to remote machines, and
@ -489,7 +474,7 @@ nevertheless used in practice [^29].
Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have
caused substantial loss of money
[[^30], [^31], [^32]],
[^30] [^31] [^32],
led to investigation by financial auditors [^33],
and caused customer data to be corrupted [^34].
A popular comment on revelations of such problems is “Use an ACID database if youre handling
@ -515,7 +500,7 @@ decide what level is appropriate to your application. Once weve done that, we
serializability in detail (see [“Serializability”](/en/ch8#sec_transactions_serializability)). Our discussion of isolation
levels will be informal, using examples. If you want rigorous definitions and analyses of their
properties, you can find them in the academic literature
[[^36], [^37], [^38], [^39]].
[^36] [^37] [^38] [^39].
## Read Committed
@ -690,7 +675,7 @@ database, frozen at a particular point in time, it is much easier to understand.
Snapshot isolation is a popular feature: variants of it are supported by PostgreSQL, MySQL with the
InnoDB storage engine, Oracle, SQL Server, and others, although the detailed behavior varies from
one system to the next [[^29], [^40], [^41]].
one system to the next [^29] [^40] [^41].
Some databases, such as Oracle, TiDB, and Aurora DSQL, even choose snapshot isolation as their
highest isolation level.
@ -713,7 +698,7 @@ maintains several versions of a row side by side, this technique is known as *mu
concurrency control* (MVCC).
[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
[[^40], [^42], [^43]] (other implementations are similar).
[^40] [^42] [^43] (other implementations are similar).
When a transaction is started, it is given a unique, always-increasing transaction ID (`txid`).
Whenever a transaction writes anything to the database, the data it writes is tagged with the
transaction ID of the writer. (To be precise, transaction IDs in PostgreSQL are 32-bit integers, so
@ -742,7 +727,7 @@ All of the versions of a row are stored within the same database heap (see
[“Storing values within the index”](/en/ch4#sec_storage_index_heap)), regardless of whether the transactions that wrote them have committed
or not. The versions of the same row form a linked list, going either from newest version to oldest
version or the other way round, so that queries can internally iterate over all versions of a row
[[^45], [^46]].
[^45] [^46].
### Visibility rules for observing a consistent snapshot
@ -790,7 +775,7 @@ value matches what the query is looking for. When garbage collection removes old
are no longer visible to any transaction, the corresponding index entries can also be removed.
Many implementation details affect the performance of multi-version concurrency control
[[^45], [^46]].
[^45] [^46].
For example, PostgreSQL has optimizations for avoiding index updates if different versions of the
same row can fit on the same page [^40].
Some other databases avoid storing full copies of modified rows, and only store differences between
@ -829,7 +814,7 @@ Unfortunately, the SQL standards definition of isolation levels is flawed—i
imprecise, and not as implementation-independent as a standard should be [^36]. Even though several databases
implement repeatable read, there are big differences in the guarantees they actually provide,
despite being ostensibly standardized [^29]. There has been a formal definition of
repeatable read in the research literature [[^37], [^38]], but most implementations dont satisfy that
repeatable read in the research literature [^37] [^38], but most implementations dont satisfy that
formal definition. And to top it off, IBM Db2 uses “repeatable read” to refer to serializability [^10].
As a result, nobody really knows what repeatable read means.
@ -884,7 +869,7 @@ Another option is to simply force all atomic operations to be executed on a sing
Unfortunately, object-relational mapping (ORM) frameworks make it easy to accidentally write code
that performs unsafe read-modify-write cycles instead of using atomic operations provided by the
database [[^49], [^50], [^51]].
database [^49] [^50] [^51].
This can be a source of subtle bugs that are difficult to find by testing.
### Explicit locking
@ -940,8 +925,8 @@ An advantage of this approach is that databases can perform this check efficient
with snapshot isolation. Indeed, PostgreSQLs repeatable read, Oracles serializable, and SQL
Servers snapshot isolation levels automatically detect when a lost update has occurred and abort
the offending transaction. However, MySQL/InnoDBs repeatable read does not detect lost updates
[[^29], [^41]].
Some authors [[^36], [^38]] argue that a database must prevent lost
[^29] [^41].
Some authors [^36] [^38] argue that a database must prevent lost
updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot
isolation under this definition.
@ -1023,7 +1008,7 @@ To begin, imagine this example: you are writing an application for doctors to ma
shifts at a hospital. The hospital usually tries to have several doctors on call at any one time,
but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if
they are sick themselves), provided that at least one colleague remains on call in that shift
[[^53], [^54]].
[^53] [^54].
Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are
feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button
@ -1184,7 +1169,7 @@ transaction, is called a *phantom* [^4].
Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the
examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL
generated by ORMs is also prone to write skew
[[^50], [^51]].
[^50] [^51].
### Materializing conflicts
@ -1271,7 +1256,7 @@ Two developments caused this rethink:
outside of the serial execution loop.
The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic,
for example [[^58], [^59], [^60]].
for example [^58] [^59] [^60].
A system designed for single-threaded execution can sometimes perform better than a system that
supports concurrency, because it can avoid the coordination overhead of locking. However, its
throughput is limited to that of a single CPU core. In order to make the most of that single thread,
@ -1541,7 +1526,7 @@ becomes serializable.
Unfortunately, predicate locks do not perform well: if there are many locks by active transactions,
checking for matching locks becomes time-consuming. For that reason, most databases with 2PL
actually implement *index-range locking* (also known as *next-key locking*), which is a simplified
approximation of predicate locking [[^54], [^64]].
approximation of predicate locking [^54] [^64].
Its safe to simplify a predicate by making it match a greater set of objects. For example, if you
have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by
@ -1585,7 +1570,7 @@ serializable isolation and good performance fundamentally at odds with each othe
It seems not: an algorithm called *serializable snapshot isolation* (SSI) provides full
serializability with only a small performance penalty compared to snapshot isolation. SSI is
comparatively new: it was first described in 2008
[[^53], [^65]].
[^53] [^65].
Today SSI and similar algorithms are used in single-node databases (the serializable isolation level
in PostgreSQL [^54], SQL Servers In-Memory
@ -1733,7 +1718,7 @@ tracking is faster, but may lead to more transactions being aborted than strictl
In some cases, its okay for a transaction to read information that was overwritten by another
transaction: depending on what else happened, its sometimes possible to prove that the result of
the execution is nevertheless serializable. PostgreSQL uses this theory to reduce the number of
unnecessary aborts [[^14], [^54]].
unnecessary aborts [^14] [^54].
Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one
transaction doesnt need to block waiting for locks held by another transaction. Like under snapshot
@ -1824,12 +1809,12 @@ problem.
Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It
is a classic algorithm in distributed databases
[[^13], [^71], [^72]]. 2PC is used
[^13] [^71] [^72]. 2PC is used
internally in some databases and also made available to applications in the form of *XA transactions*
[^73]
(which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP
web services
[[^74], [^75]].
[^74] [^75].
The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two
@ -1958,7 +1943,7 @@ stuck waiting for the coordinator to recover. It is possible to make an atomic c
is not so straightforward.
As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed
[[^13], [^77]].
[^13] [^77].
However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most
practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
cannot guarantee atomicity.
@ -1971,7 +1956,7 @@ consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_cons
Distributed transactions and two-phase commit have a mixed reputation. On the one hand, they are
seen as providing an important safety guarantee that would be hard to achieve otherwise; on the
other hand, they are criticized for causing operational problems, killing performance, and promising
more than they can deliver [[^78], [^79], [^80], [^81]].
more than they can deliver [^78] [^79] [^80] [^81].
Many cloud services choose not to implement distributed transactions due to the operational
problems they engender [^82].
@ -2089,7 +2074,7 @@ transaction is resolved.
In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the
log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do
occur [[^83], [^84]]—that is,
occur [^83] [^84]—that is,
transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because
the transaction log has been lost or corrupted due to a software bug). These transactions cannot be
resolved automatically, so they sit forever in the database, holding locks and blocking other
@ -2326,7 +2311,7 @@ is used.
### Summary
### References

View file

@ -22,7 +22,7 @@ anything that *can* go wrong *will* go wrong.
Moreover, working with distributed systems is fundamentally different from writing software on a
single computer—and the main difference is that there are lots of new and exciting ways for things
to go wrong [[^1], [^2]].
to go wrong [^1] [^2].
In this chapter, you will get a taste of the problems that arise in practice, and an understanding
of the things you can and cannot rely on.
@ -197,7 +197,7 @@ even in controlled environments like a datacenter operated by one company [^8]:
(though shark bites have become rarer due to better shielding of submarine cables [^14]).
Humans are also at fault, be it due to accidental misconfiguration [^15], scavenging [^16], or sabotage [^17].
* Across different cloud regions, round-trip times of up to several *minutes* have been observed at
high percentiles [[^18], Table 3].
high percentiles [^18].
Even within a single datacenter, packet delay of more than a minute can occur during a network
topology reconfiguration, triggered by a problem during a software upgrade for a switch
[^19].
@ -364,7 +364,7 @@ network links and switches, and even each machines network interface and CPUs
virtual machines), are shared. Processing large amounts of data can use the entire capacity of
network links (*saturate* them). As you have no control over or insight into other customers usage of the shared
resources, network delays can be highly variable if someone near you (a *noisy neighbor*) is
using a lot of resources [[^30], [^31]].
using a lot of resources [^30] [^31].
In such environments, you can only choose timeouts experimentally: measure the distribution of
network round-trip times over an extended period, and over many machines, to determine the expected
@ -665,7 +665,7 @@ fixed. On the other hand, if its quartz clock is defective or its NTP client is
things will seem to work fine, even though its clock gradually drifts further and further away from
reality. If some piece of software is relying on an accurately synchronized clock, the result is
more likely to be silent and subtle data loss than a dramatic crash
[[^62], [^63]].
[^62] [^63].
Thus, if you use software that requires synchronized clocks, it is essential that you also carefully
monitor the clock offsets between all the machines. Any node whose clock drifts too far from the
@ -715,8 +715,7 @@ serious problems:
* Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite
values previously written by a node with a fast clock until the clock skew between the nodes has
elapsed [[^63],
[^65]].
elapsed [^63] [^65].
This scenario can cause arbitrary amounts of data to be silently dropped without any error being
reported to the application.
* LWW cannot distinguish between writes that occurred sequentially in quick succession (in
@ -812,7 +811,7 @@ the synchronization good enough, they would have the right properties: later tra
higher timestamp. The problem, of course, is the uncertainty about clock accuracy.
Spanner implements snapshot isolation across datacenters in this way
[[^68], [^69]].
[^68] [^69].
It uses the clocks confidence interval as reported by the TrueTime API, and is based on the
following observation: if you have two confidence intervals, each consisting of an earliest and
latest possible timestamp (*A* = [*Aearliest*, *Alatest*] and
@ -1011,11 +1010,11 @@ handle requests from clients while one node is collecting its garbage. If the ru
application that a node soon requires a GC pause, the application can stop sending new requests to
that node, wait for it to finish processing outstanding requests, and then perform the GC while no
requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles
of the response time [[^80], [^81]].
of the response time [^80] [^81].
A variant of this idea is to use the garbage collector only for short-lived objects (which are fast
to collect) and to restart processes periodically, before they accumulate enough long-lived objects
to require a full GC of long-lived objects [[^79], [^82]].
to require a full GC of long-lived objects [^79] [^82].
One node can be restarted at a time, and traffic can be shifted away from the node before the
planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).
@ -1120,7 +1119,7 @@ could be lost or corrupted data, which is much more serious.
For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
implementation of locking. (The bug is not theoretical: HBase used to have this problem
[[^85], [^86]].)
[^85] [^86].)
Say you want to ensure that a file in a storage service can only be
accessed by one client at a time, because if multiple clients tried to write to it, the file would
become corrupted. You try to implement this by requiring a client to obtain a lease from a lock
@ -1207,7 +1206,7 @@ services support such a check: Amazon S3 calls it *conditional writes*, Azure Bl
If your clients need to write only to one storage service that supports such conditional writes, the
lock service is somewhat redundant
[[^91], [^92]],
[^91] [^92],
since the lease assignment could have been implemented directly based on that storage service [^93].
However, once you have a fencing token you can also use it with multiple services or replicas, and
ensure that the old leaseholder is fenced off on all of those services.
@ -1286,8 +1285,7 @@ with the network. This concern is relevant in certain specific circumstances. Fo
by radiation, leading it to respond to other nodes in arbitrarily unpredictable ways. Since a
system failure would be very expensive (e.g., an aircraft crashing and killing everyone on board,
or a rocket colliding with the International Space Station), flight control systems must tolerate
Byzantine faults [[^98],
[^99]].
Byzantine faults [^98] [^99].
* In a system with multiple participating parties, some participants may attempt to cheat or
defraud others. In such circumstances, it is not safe for a node to simply trust another nodes
messages, since they may be sent with malicious intent. For example, cryptocurrencies like
@ -1311,7 +1309,7 @@ escaping are so important: to prevent SQL injection and cross-site scripting, fo
we typically dont use Byzantine fault-tolerant protocols here, but simply make the server the
authority on deciding what client behavior is and isnt allowed. In peer-to-peer networks, where
there is no such central authority, Byzantine fault tolerance is more relevant
[[^103], [^104]].
[^103] [^104].
A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to
all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant
@ -1336,9 +1334,7 @@ pragmatic steps toward better reliability. For example:
* Network packets do sometimes get corrupted due to hardware issues or bugs in operating systems,
drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and
UDP, but sometimes they evade detection [[^105],
[^106],
[^107]].
UDP, but sometimes they evade detection [^105] [^106] [^107].
Simple measures are usually sufficient protection against such corruption, such as checksums in
the application-level protocol. TLS-encrypted connections also offer protection against
corruption.
@ -1542,7 +1538,7 @@ It is prudent to combine theoretical analysis with empirical testing to verify t
behave as expected. Techniques such as property-based testing, fuzzing, and deterministic simulation
testing (DST) use randomization to test a system in a wide range of situations. Companies such as
Amazon Web Services have successfully used a combination of these techniques on many of their
products [[^120], [^121]].
products [^120] [^121].
### Model checking and specification languages
@ -1563,7 +1559,7 @@ longer executions would then not be found.
Still, model checkers strike a nice balance between ease of use and the ability to find non-obvious
bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find
and fix bugs
[[^122], [^123], [^124]]. For example,
[^122] [^123] [^124]. For example,
using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped
replication (VR) caused by ambiguity in the prose description of the algorithm [^125].
@ -1601,7 +1597,7 @@ Its common to adopt a fault injection framework like Jepsen to run fault inje
simplify the process. Such frameworks come with integrations for various operating systems and many
pre-built fault injectors [^129].
Jepsen has been remarkably effective at finding critical bugs in many widely-used systems
[[^130], [^131]].
[^130] [^131].
### Deterministic simulation testing
@ -1750,7 +1746,7 @@ problems in distributed systems.
### Summary
### References
[^1]: Mark Cavage. [Theres Just No Getting Around It: Youre Building a Distributed System](https://queue.acm.org/detail.cfm?id=2482856). *ACM Queue*, volume 11, issue 4, pages 80-89, April 2013. [doi:10.1145/2466486.2482856](https://doi.org/10.1145/2466486.2482856)
[^2]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)