mirror of
https://github.com/Vonng/ddia.git
synced 2026-06-21 00:47:05 +08:00
fix ref links
This commit is contained in:
parent
4ec385f161
commit
515bb3a093
9 changed files with 150 additions and 223 deletions
|
|
@ -218,7 +218,7 @@ There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_
|
|||
That is the intuition behind linearizability; the formal definition [^1] describes it more precisely. It is
|
||||
possible (though computationally expensive) to test whether a system’s behavior is linearizable by
|
||||
recording the timings of all requests and responses, and checking whether they can be arranged into
|
||||
a valid sequential order [[^6], [^7]].
|
||||
a valid sequential order [^6] [^7].
|
||||
|
||||
Just as there are various weak isolation levels for transactions besides serializability (see
|
||||
[“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels)), there are also various weaker consistency models for
|
||||
|
|
@ -255,7 +255,7 @@ Linearizability
|
|||
|
||||
A database may provide both serializability and linearizability, and this combination is known as
|
||||
*strict serializability* or *strong one-copy serializability* (*strong-1SR*)
|
||||
[[^11], [^12]].
|
||||
[^11] [^12].
|
||||
Single-node databases are typically linearizable. With distributed databases using optimistic
|
||||
methods like serializable snapshot isolation (see [“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)) the situation is more
|
||||
complicated: for example, CockroachDB provides serializability, and some recency guarantees on
|
||||
|
|
@ -264,7 +264,7 @@ because this would require expensive coordination between transactions [^14].
|
|||
|
||||
It is also possible to combine a weaker isolation level with linearizability, or a weaker
|
||||
consistency model with serializability; in fact, consistency model and isolation level can be chosen
|
||||
largely independently from each other [[^15], [^16]].
|
||||
largely independently from each other [^15] [^16].
|
||||
|
||||
## Relying on Linearizability
|
||||
|
||||
|
|
@ -460,7 +460,7 @@ performance: a reader must perform read repair (see [“Catching up on missed wr
|
|||
before returning results to the application [^24].
|
||||
Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the
|
||||
latest timestamp of any prior write, and ensure that the new write has a greater timestamp
|
||||
[[^25], [^26]].
|
||||
[^25] [^26].
|
||||
However, Riak does not perform synchronous read repair due to the performance penalty.
|
||||
Cassandra does wait for read repair to complete on quorum reads [^27],
|
||||
but it loses linearizability due to its use of time-of-day clocks for timestamps.
|
||||
|
|
@ -529,10 +529,10 @@ The trade-off is as follows:
|
|||
|
||||
Thus, applications that don’t require linearizability can be more tolerant of network problems. This
|
||||
insight is popularly known as the *CAP theorem*
|
||||
[[^29], [^30], [^31], [^32]],
|
||||
[^29] [^30] [^31] [^32],
|
||||
named by Eric Brewer in 2000, although the trade-off had been known to designers of
|
||||
distributed databases since the 1970s
|
||||
[[^33], [^34], [^35]].
|
||||
[^33] [^34] [^35].
|
||||
|
||||
CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of
|
||||
starting a discussion about trade-offs in databases. At the time, many distributed databases
|
||||
|
|
@ -563,7 +563,7 @@ formalization of *availability* [^30] does not
|
|||
match the usual meaning of the term [^38]. Many highly available (fault-tolerant) systems actually do not meet CAP’s
|
||||
idiosyncratic definition of availability. Moreover, some system designers choose (with good reason)
|
||||
to provide neither linearizability nor the form of availability that the CAP theorem assumes, so
|
||||
those systems are neither CP nor AP [[^39], [^40]].
|
||||
those systems are neither CP nor AP [^39] [^40].
|
||||
|
||||
All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us
|
||||
understand systems better, so CAP is best avoided.
|
||||
|
|
@ -574,11 +574,11 @@ fault (network partitions, which according to data from Google are the cause of
|
|||
incidents [^41]).
|
||||
It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP
|
||||
has been historically influential, it has little practical value for designing systems
|
||||
[[^4], [^38]].
|
||||
[^4] [^38].
|
||||
|
||||
There have been efforts to generalize CAP. For example, the *PACELC principle* observes that system
|
||||
designers might also choose to weaken consistency at times when the network is working fine in order
|
||||
to reduce latency [[^39], [^40], [^42]].
|
||||
to reduce latency [^39] [^40] [^42].
|
||||
Thus, during a network partition (P), we need to choose between availability (A) and consistency
|
||||
(C); else (E), when there is no partition, we may choose between low latency (L) and
|
||||
consistency (C). However, this definition inherits several problems with CAP, such as the
|
||||
|
|
@ -586,7 +586,7 @@ counterintuitive definitions of consistency and availability.
|
|||
|
||||
There are many more interesting impossibility results in distributed systems [^43],
|
||||
and CAP has now been superseded by more precise results
|
||||
[[^44], [^45]],
|
||||
[^44] [^45],
|
||||
so it is of mostly historical interest today.
|
||||
|
||||
### Linearizability and network delays
|
||||
|
|
@ -945,18 +945,18 @@ node, but which get a lot harder if you want fault tolerance:
|
|||
It turns out that all of these are instances of the same fundamental distributed systems problem:
|
||||
*consensus*. Consensus is one of the most important and fundamental problems in distributed
|
||||
computing; it is also infamously difficult to get right
|
||||
[[^58], [^59]],
|
||||
[^58] [^59],
|
||||
and many systems have got it wrong in the past. Now that we have discussed replication
|
||||
([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and
|
||||
linearizability (this chapter), we are finally ready to tackle the consensus problem.
|
||||
|
||||
The best-known consensus algorithms are Viewstamped Replication
|
||||
[[^60], [^61]],
|
||||
Paxos [[^58], [^62], [^63], [^64]],
|
||||
Raft [[^23], [^65], [^66]],
|
||||
and Zab [[^18], [^22], [^67]].
|
||||
[^60] [^61],
|
||||
Paxos [^58] [^62] [^63] [^64],
|
||||
Raft [^23] [^65] [^66],
|
||||
and Zab [^18] [^22] [^67].
|
||||
There are quite a few similarities between these algorithms, but they are not the same
|
||||
[[^68], [^69]].
|
||||
[^68] [^69].
|
||||
These algorithms work in a non-Byzantine system model: that is, network communication may be
|
||||
arbitrarily delayed or dropped, and nodes may crash, restart, and become disconnected, but the
|
||||
algorithms assume that nodes otherwise follow the protocol correctly and do not behave maliciously.
|
||||
|
|
@ -964,7 +964,7 @@ algorithms assume that nodes otherwise follow the protocol correctly and do not
|
|||
There are also consensus algorithms that can tolerate some Byzantine nodes, i.e., nodes that don’t
|
||||
correctly follow the protocol (for example, by sending contradictory messages to other nodes). A
|
||||
common assumption is that fewer than one-third of the nodes are Byzantine-faulty
|
||||
[[^26], [^70]].
|
||||
[^26] [^70].
|
||||
Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchains [^71].
|
||||
However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this
|
||||
book.
|
||||
|
|
@ -1095,7 +1095,7 @@ consensus. Any CAS invocations whose new value was not decided return an error.
|
|||
different expected values use separate runs of the consensus protocol.
|
||||
|
||||
This shows that CAS and consensus are equivalent to each other
|
||||
[[^28], [^73]].
|
||||
[^28] [^73].
|
||||
Again, both are straightforward on a single node, but challenging to make fault-tolerant. As an
|
||||
example of CAS in a distributed setting, we saw conditional write operations for object stores in
|
||||
[“Databases backed by object storage”](/en/ch6#sec_replication_object_storage), which allow a write to happen only if an object with the same
|
||||
|
|
@ -1105,7 +1105,7 @@ However, a linearizable read-write register is not sufficient to solve consensus
|
|||
tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop
|
||||
model [^72], but we saw in
|
||||
[“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum
|
||||
reads/writes in this model [[^24], [^25], [^26]].
|
||||
reads/writes in this model [^24] [^25] [^26].
|
||||
From this it follows that a linearizable register cannot solve consensus.
|
||||
|
||||
### Shared logs as consensus
|
||||
|
|
@ -1142,12 +1142,8 @@ Validity
|
|||
value to be added to the log.
|
||||
|
||||
> [!NOTE]
|
||||
> A shared log is formally known as a *total order broadcast*, *atomic broadcast*, or *total order
|
||||
> multicast* protocol [[^26],
|
||||
> [^76],
|
||||
> [^77]].
|
||||
> It’s the same thing described in different words: requesting a value to be added to the log is then
|
||||
> called “broadcasting” it, and reading a log entry is called “delivering” it.
|
||||
> A shared log is formally known as a *total order broadcast*, *atomic broadcast*, or *total order multicast* protocol [^26] [^76] [^77]
|
||||
> It’s the same thing described in different words: requesting a value to be added to the log is then called “broadcasting” it, and reading a log entry is called “delivering” it.
|
||||
|
||||
If you have an implementation of a shared log, it is easy to solve the consensus problem: every node
|
||||
that wants to propose a value requests for it to be added to the log, and whichever value is read
|
||||
|
|
@ -1243,7 +1239,7 @@ any of the communication among the nodes times out). The other three properties
|
|||
same as for consensus.
|
||||
|
||||
If you have a solution for consensus, there are multiple ways you could solve atomic commitment
|
||||
[[^78], [^79]].
|
||||
[^78] [^79].
|
||||
One works like this: when you want to commit the transaction, every node sends its vote to commit or
|
||||
abort to every other node. Nodes that receive a vote to commit from itself and every other node
|
||||
propose “commit” using the consensus algorithm; nodes that receive a vote to abort, or which
|
||||
|
|
@ -1290,7 +1286,7 @@ Similarly, a shared log can be used to implement serializable transactions: as d
|
|||
[“Actual Serial Execution”](/en/ch8#sec_transactions_serial), if every log entry represents a deterministic transaction to be
|
||||
executed as a stored procedure, and if every node executes those transactions in the same order,
|
||||
then the transactions will be serializable
|
||||
[[^81], [^82]].
|
||||
[^81] [^82].
|
||||
|
||||
> [!NOTE]
|
||||
> Sharded databases with a strong consistency model often maintain a separate log per shard, which
|
||||
|
|
@ -1353,7 +1349,7 @@ a vote on a proposal succeeds, at least one of the nodes that voted for it must
|
|||
participated in the most recent successful leader election [^85]. Thus, if the vote on a proposal
|
||||
passes without revealing any higher-numbered epoch, the current leader can conclude that no leader
|
||||
with a higher epoch number has been elected, and therefore it can safely append the proposed entry
|
||||
to the log [[^26], [^86]].
|
||||
to the log [^26] [^86].
|
||||
|
||||
These two rounds of voting look superficially similar to two-phase commit, but they are very
|
||||
different protocols. In consensus algorithms, any node can start an election and it requires only a
|
||||
|
|
@ -1364,7 +1360,7 @@ vote from *every* participant before it can commit.
|
|||
|
||||
This basic structure is common to all of Raft, Multi-Paxos, Zab, and Viewstamped Replication: a vote
|
||||
by a quorum of nodes elects a leader, and then another quorum vote is required for every entry that
|
||||
the leader wants to append to the log [[^68], [^69]]. Every new log entry is synchronously replicated
|
||||
the leader wants to append to the log [^68] [^69]. Every new log entry is synchronously replicated
|
||||
to a quorum of nodes before it is confirmed to the client that requested the write. This ensures
|
||||
that the log entry won’t be lost if the current leader fails.
|
||||
|
||||
|
|
@ -1398,7 +1394,7 @@ easily cause a lot of data loss or corruption.
|
|||
Another subtlety is in how the algorithms deal with log entries that had been proposed by the old
|
||||
leader before it failed, but for which the vote on appending to the log had not yet completed. You
|
||||
can find discussions of these details in the references for this chapter
|
||||
[[^23], [^69], [^86]].
|
||||
[^23] [^69] [^86].
|
||||
|
||||
For databases that use a consensus algorithm for replication, not only do writes need to be turned
|
||||
into log entries and replicated to a quorum. If you want to guarantee linearizable reads, they also
|
||||
|
|
@ -1441,7 +1437,7 @@ work.
|
|||
|
||||
Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft
|
||||
has been shown to have unpleasant edge cases
|
||||
[[^88], [^89]]:
|
||||
[^88] [^89]:
|
||||
if the entire network is working correctly except for one particular network link that is
|
||||
consistently unreliable, Raft can get into situations where leadership continually bounces between
|
||||
two nodes, or the current leader is continually forced to resign, so the system effectively never
|
||||
|
|
@ -1468,7 +1464,7 @@ entirely in memory (although they still write to disk for durability), which is
|
|||
multiple nodes using a fault-tolerant consensus algorithm.
|
||||
|
||||
Coordination services are modeled after Google’s Chubby lock service
|
||||
[[^17], [^58]].
|
||||
[^17] [^58].
|
||||
They combine a consensus algorithm with several other features that turn out to be particularly
|
||||
useful when building distributed systems:
|
||||
|
||||
|
|
@ -1545,7 +1541,7 @@ information like “the node running on IP address 10.1.1.23 is the leader for s
|
|||
assignments usually change on a timescale of minutes or hours. Coordination services are not
|
||||
intended for storing data that may change thousands of times per second. For that, it is better to
|
||||
use a conventional database; alternatively, tools like Apache BookKeeper
|
||||
[[^90], [^91]]
|
||||
[^90] [^91]
|
||||
can be used to replicate fast-changing internal state of a service.
|
||||
|
||||
### Service discovery
|
||||
|
|
|
|||
|
|
@ -186,19 +186,13 @@ is a long queue of requests waiting to be handled, response times may increase s
|
|||
time out and resend their request. This causes the rate of requests to increase even further, making
|
||||
the problem worse—a *retry storm*. Even when the load is reduced again, such a system may remain in
|
||||
an overloaded state until it is rebooted or otherwise reset. This phenomenon is called a *metastable
|
||||
failure*, and it can cause serious outages in production systems
|
||||
[[^7], [^8]].
|
||||
failure*, and it can cause serious outages in production systems [^7] [^8].
|
||||
|
||||
To avoid retries overloading a service, you can increase and randomize the time between successive
|
||||
retries on the client side (*exponential backoff*
|
||||
[[^9], [^10]]),
|
||||
and temporarily stop sending requests to a service that has returned errors or timed out recently
|
||||
(using a *circuit breaker* [[^11], [^12]]
|
||||
or *token bucket* algorithm [^13]).
|
||||
retries on the client side (*exponential backoff* [^9] [^10]), and temporarily stop sending requests to a service that has returned errors or timed out recently
|
||||
(using a *circuit breaker* [^11] [^12] or *token bucket* algorithm [^13]).
|
||||
The server can also detect when it is approaching overload and start proactively rejecting requests
|
||||
(*load shedding* [^14]), and send back
|
||||
responses asking clients to slow down (*backpressure*
|
||||
[[^1], [^15]]).
|
||||
(*load shedding* [^14]), and send back responses asking clients to slow down (*backpressure* [^1] [^15]).
|
||||
The choice of queueing and load-balancing algorithms can also make a difference [^16].
|
||||
|
||||
In terms of performance metrics, the response time is usually what users care about the most,
|
||||
|
|
@ -342,7 +336,7 @@ For example, an SLO may set a target for a service to have a median response tim
|
|||
result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not
|
||||
met (for example, customers may be entitled to a refund). That is the basic idea, at least; in
|
||||
practice, defining good availability metrics for SLOs and SLAs is not straightforward
|
||||
[[^28], [^29]].
|
||||
[^28] [^29].
|
||||
|
||||
# Computing percentiles
|
||||
|
||||
|
|
@ -355,7 +349,7 @@ The simplest implementation is to keep a list of response times for all requests
|
|||
window and to sort that list every minute. If that is too inefficient for you, there are algorithms
|
||||
that can calculate a good approximation of percentiles at minimal CPU and memory cost.
|
||||
Open source percentile estimation libraries include HdrHistogram,
|
||||
t-digest [[^30], [^31]],
|
||||
t-digest [^30] [^31],
|
||||
OpenHistogram [^32], and DDSketch [^33].
|
||||
|
||||
Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from
|
||||
|
|
@ -375,7 +369,7 @@ software, typical expectations include:
|
|||
If all those things together mean “working correctly,” then we can understand *reliability* as
|
||||
meaning, roughly, “continuing to work correctly, even when things go wrong.” To be more precise
|
||||
about things going wrong, we will distinguish between *faults* and *failures*
|
||||
[[^35], [^36], [^37]]:
|
||||
[^35] [^36] [^37]:
|
||||
|
||||
Fault
|
||||
: A fault is when a particular *part* of a system stops working correctly: for example, if a
|
||||
|
|
@ -432,21 +426,14 @@ cured, as described in the following sections.
|
|||
|
||||
When we think of causes of system failure, hardware faults quickly come to mind:
|
||||
|
||||
* Approximately 2–5% of magnetic hard drives fail per year
|
||||
[[^40],
|
||||
[^41]];
|
||||
* Approximately 2–5% of magnetic hard drives fail per year [^40] [^41];
|
||||
in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
|
||||
Recent data suggests that disks are getting more reliable, but failure rates remain significant
|
||||
[^42].
|
||||
* Approximately 0.5–1% of solid state drives (SSDs) fail per year
|
||||
[^43].
|
||||
Small numbers of bit errors are corrected automatically
|
||||
[^44],
|
||||
Recent data suggests that disks are getting more reliable, but failure rates remain significant [^42].
|
||||
* Approximately 0.5–1% of solid state drives (SSDs) fail per year [^43].
|
||||
Small numbers of bit errors are corrected automatically [^44],
|
||||
but uncorrectable errors occur approximately once per year per drive, even in drives that are
|
||||
fairly new (i.e., that have experienced little wear); this error rate is higher than that of
|
||||
magnetic hard drives
|
||||
[[^45],
|
||||
[^46]].
|
||||
magnetic hard drives [^45], [^46].
|
||||
* Other hardware components such as power supplies, RAID controllers, and memory modules also fail,
|
||||
although less frequently than hard drives [^47] [^48].
|
||||
* Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result,
|
||||
|
|
@ -676,7 +663,7 @@ If you can double the resources in order to handle twice the load, while keeping
|
|||
same, we say that you have *linear scalability*, and this is considered a good thing. Occasionally
|
||||
it is possible to handle twice the load with less than double the resources, due to economies of
|
||||
scale or a better distribution of peak load
|
||||
[[^79], [^80]].
|
||||
[^79] [^80].
|
||||
Much more likely is that the cost grows faster than linearly, and there may be many reasons for the
|
||||
inefficiency. For example, if you have a lot of data, then processing a single write request may
|
||||
involve more work than if you have a small amount of data, even if the size of the request is the
|
||||
|
|
@ -762,7 +749,7 @@ bugs that need fixing.
|
|||
It is widely recognized that the majority of the cost of software is not in its initial development,
|
||||
but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures,
|
||||
adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding
|
||||
new features [[^85], [^86]].
|
||||
new features [^85] [^86].
|
||||
|
||||
However, maintenance is also difficult. If a system has been successfully running for a long time,
|
||||
it may well use outdated technologies that not many engineers understand today (such as mainframes
|
||||
|
|
@ -925,7 +912,7 @@ There are no easy answers on how to achieve these things, but one thing that can
|
|||
applications using well-understood building blocks that provide useful abstractions. The rest of
|
||||
this book will cover a selection of building blocks that have proved to be valuable in practice.
|
||||
|
||||
### Summary
|
||||
### References
|
||||
|
||||
[^1]: Mike Cvet. [How We Learned to Stop Worrying and Love Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016.
|
||||
[^2]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)
|
||||
|
|
|
|||
|
|
@ -219,10 +219,7 @@ structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)).
|
|||
###### Figure 3-2. One-to-many relationships forming a tree structure.
|
||||
|
||||
> [!NOTE]
|
||||
> This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé
|
||||
> typically has a small number of positions
|
||||
> [[^9],
|
||||
> [^10]].
|
||||
> This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé typically has a small number of positions [^9] [^10].
|
||||
> In situations where there may be a genuinely large number of related items—say, comments on a
|
||||
> celebrity’s social media post, of which there could be many thousands—embedding them all in the same
|
||||
> document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable.
|
||||
|
|
@ -540,7 +537,7 @@ such applications well, because the items (or their IDs) can simply be stored in
|
|||
determine their order. In relational databases there isn’t a standard way of representing such
|
||||
reorderable lists, and various tricks are used: sorting by an integer column (requiring renumbering
|
||||
when you insert into the middle), a linked list of IDs, or fractional indexing
|
||||
[[^14], [^15], [^16]].
|
||||
[^14] [^15] [^16].
|
||||
|
||||
### Schema flexibility in the document model
|
||||
|
||||
|
|
@ -593,7 +590,7 @@ since every row needs to be rewritten, and other schema operations (such as chan
|
|||
of a column) also typically require the entire table to be copied.
|
||||
|
||||
Various tools exist to allow this type of schema changes to be performed in the background without downtime
|
||||
[[^21], [^22], [^23], [^24]],
|
||||
[^21] [^22] [^23] [^24],
|
||||
but performing such migrations on large databases remains operationally challenging. Complicated
|
||||
migrations can be avoided by only adding the `first_name` column with a default value of `NULL`
|
||||
(which is fast), and filling it in at read time, like you would with a document database.
|
||||
|
|
@ -1044,7 +1041,7 @@ Oracle has a different SQL extension for recursive queries, which it calls *hier
|
|||
[^41].
|
||||
|
||||
However, the situation may be improving: at the time of writing, there are plans to add a graph
|
||||
query language called GQL to the SQL standard [[^42], [^43]],
|
||||
query language called GQL to the SQL standard [^42] [^43],
|
||||
which will provide a syntax inspired by Cypher, GSQL [^44], and PGQL [^45].
|
||||
|
||||
## Triple-Stores and SPARQL
|
||||
|
|
@ -1127,7 +1124,7 @@ Some of the research and development effort on triple stores was motivated by th
|
|||
early-2000s effort to facilitate internet-wide data exchange by publishing data not only as
|
||||
human-readable web pages, but also in a standardized, machine-readable format. Although the Semantic
|
||||
Web as originally envisioned did not succeed
|
||||
[[^49], [^50]],
|
||||
[^49] [^50],
|
||||
the legacy of the Semantic Web project lives on in a couple of specific technologies: *linked data*
|
||||
standards such as JSON-LD [^51],
|
||||
*ontologies* used in biomedical science [^52],
|
||||
|
|
@ -1238,7 +1235,7 @@ various other triple stores [^36].
|
|||
## Datalog: Recursive Relational Queries
|
||||
|
||||
Datalog is a much older language than SPARQL or Cypher: it arose from academic research in the 1980s
|
||||
[[^57], [^58], [^59]].
|
||||
[^57] [^58] [^59].
|
||||
It is less well known among software engineers and not widely supported in mainstream databases, but
|
||||
it ought to be better-known since it is a very expressive language that is particularly powerful for
|
||||
complex queries. Several niche databases, including Datomic, LogicBlox, CozoDB, and LinkedIn’s
|
||||
|
|
@ -1498,7 +1495,7 @@ the status of each booking, another that computes charts for the conference orga
|
|||
and a third that generates files for the printer that produces the attendees’ badges.
|
||||
|
||||
The idea of using events as the source of truth, and expressing every state change as an event, is
|
||||
known as *event sourcing* [[^62], [^63]].
|
||||
known as *event sourcing* [^62] [^63].
|
||||
The principle of maintaining separate read-optimized representations and deriving them from the
|
||||
write-optimized representation is called *command query responsibility segregation (CQRS)*
|
||||
[^64].
|
||||
|
|
@ -1724,11 +1721,7 @@ come into play when *implementing* the data models described in this chapter.
|
|||
|
||||
|
||||
|
||||
### Summary
|
||||
|
||||
|
||||
|
||||
|
||||
### References
|
||||
|
||||
[^1]: Jamie Brandon. [Unexplanations: query optimization works because sql is declarative](https://www.scattered-thoughts.net/writing/unexplanations-sql-declarative/). *scattered-thoughts.net*, February 2024. Archived at [perma.cc/P6W2-WMFZ](https://perma.cc/P6W2-WMFZ)
|
||||
[^2]: Joseph M. Hellerstein. [The Declarative Imperative: Experiences and Conjectures in Distributed Logic](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf). Tech report UCB/EECS-2010-90, Electrical Engineering and Computer Sciences, University of California at Berkeley, June 2010. Archived at [perma.cc/K56R-VVQM](https://perma.cc/K56R-VVQM)
|
||||
|
|
|
|||
|
|
@ -320,8 +320,7 @@ In the context of an LSM storage engines, false positives are no problem:
|
|||
|
||||
An important detail is how the LSM storage chooses when to perform compaction, and which SSTables to
|
||||
include in a compaction. Many LSM-based storage systems allow you to configure which compaction
|
||||
strategy to use, and some of the common choices are
|
||||
[[^16], [^17]]:
|
||||
strategy to use, and some of the common choices are [^16] [^17]:
|
||||
|
||||
Size-tiered compaction
|
||||
: Newer and smaller SSTables are successively merged into older and larger SSTables. The SSTables
|
||||
|
|
@ -452,7 +451,7 @@ In order to make the database resilient to crashes, it is common for B-tree impl
|
|||
include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file
|
||||
to which every B-tree modification must be written before it can be applied to the pages of the tree
|
||||
itself. When the database comes back up after a crash, this log is used to restore the B-tree back
|
||||
to a consistent state [[^2], [^24]].
|
||||
to a consistent state [^2] [^24].
|
||||
In filesystems, the equivalent mechanism is known as *journaling*.
|
||||
|
||||
To improve performance, B-tree implementations typically don’t immediately write every modified page
|
||||
|
|
@ -484,8 +483,7 @@ mention just a few:
|
|||
|
||||
## Comparing B-Trees and LSM-Trees
|
||||
|
||||
As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads
|
||||
[[^27], [^28]].
|
||||
As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads [^27] [^28].
|
||||
However, benchmarks are often sensitive to details of the workload. You need to test systems with
|
||||
your particular workload in order to make a valid comparison. Moreover, it’s not a strict either/or
|
||||
choice between LSM and B-trees: storage engines sometimes blend characteristics of both approaches,
|
||||
|
|
@ -512,7 +510,7 @@ memtable fills up. This happens if data can’t be written out to disk fast enou
|
|||
the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB,
|
||||
perform *backpressure* in this situation: they suspend all reads and writes until the memtable has
|
||||
been written out to disk
|
||||
[[^30], [^31]].
|
||||
[^30] [^31].
|
||||
|
||||
Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read
|
||||
requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but
|
||||
|
|
@ -555,7 +553,7 @@ A sequential write workload writes larger chunks of data at a time, so it is lik
|
|||
can be erased without having to perform any GC. On the other hand, with a random write workload, it
|
||||
is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
|
||||
to perform more work before a block can be erased
|
||||
[[^34], [^35], [^36]].
|
||||
[^34] [^35] [^36].
|
||||
|
||||
The write bandwidth consumed by GC is then not available for the application. Moreover, the
|
||||
additional writes performed by GC contribute to wear on the flash memory; therefore, random writes
|
||||
|
|
@ -573,7 +571,7 @@ containing keys and references to values [^37].)
|
|||
A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once
|
||||
to the tree page itself. In addition, they sometimes need to write out an entire page, even if only
|
||||
a few bytes in that page changed, to ensure the B-tree can be correctly recovered after a crash or
|
||||
power failure [[^38], [^39]].
|
||||
power failure [^38] [^39].
|
||||
|
||||
If you take the total number of bytes written to disk in some workload, and divide by the number of
|
||||
bytes you would have to write if you simply wrote an append-only log with no index, you get the
|
||||
|
|
@ -610,7 +608,7 @@ the data files anyway, and SSTables don’t have pages with unused space. Moreov
|
|||
key-value pairs can better be compressed in SSTables, and thus often produce smaller files on disk
|
||||
than B-trees. Keys and values that have been overwritten continue to consume space until they are
|
||||
removed by a compaction, but this overhead is quite low when using leveled compaction
|
||||
[[^40], [^41]].
|
||||
[^40] [^41].
|
||||
Size-tiered compaction (see [“Compaction strategies”](/en/ch4#sec_storage_lsm_compaction)) uses more disk space, especially
|
||||
temporarily during compaction.
|
||||
|
||||
|
|
@ -710,7 +708,7 @@ easily be backed up, inspected, and analyzed by external utilities.
|
|||
Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model,
|
||||
and the vendors claim that they can offer big performance improvements by removing all the overheads
|
||||
associated with managing on-disk data structures
|
||||
[[^46], [^47]].
|
||||
[^46] [^47].
|
||||
RAMCloud is an open source, in-memory key-value store with durability (using a log-structured
|
||||
approach for the data in memory as well as the data on disk) [^48].
|
||||
|
||||
|
|
@ -744,7 +742,7 @@ transaction processing and data warehousing in the same product. However, these
|
|||
and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly
|
||||
becoming two separate storage and query engines, which happen to be accessible through a common SQL
|
||||
interface
|
||||
[[^50], [^51], [^52], [^53]].
|
||||
[^50] [^51] [^52] [^53].
|
||||
|
||||
## Cloud Data Warehouses
|
||||
|
||||
|
|
@ -881,11 +879,11 @@ to single-node embedded databases such as DuckDB [^62],
|
|||
and product analytics systems such as Pinot [^63]
|
||||
and Druid [^64].
|
||||
It is used in storage formats such as Parquet, ORC
|
||||
[[^65], [^66]],
|
||||
[^65] [^66],
|
||||
Lance [^67],
|
||||
and Nimble [^68],
|
||||
and in-memory analytics formats like Apache Arrow
|
||||
[[^65], [^69]]
|
||||
[^65] [^69]
|
||||
and Pandas/NumPy [^70].
|
||||
Some time-series databases, such as InfluxDB IOx [^71] and TimescaleDB [^72],
|
||||
are also based on column-oriented storage.
|
||||
|
|
@ -999,7 +997,7 @@ Queries need to examine both the column data on disk and the recent writes in me
|
|||
the two. The query execution engine hides this distinction from the user. From an analyst’s point
|
||||
of view, data that has been modified with inserts, updates, or deletes is immediately reflected in
|
||||
subsequent queries. Snowflake, Vertica, Apache Pinot, Apache Druid, and many others do this
|
||||
[[^61], [^63], [^64], [^76]].
|
||||
[^61] [^63] [^64] [^76].
|
||||
|
||||
## Query Execution: Compilation and Vectorization
|
||||
|
||||
|
|
@ -1034,7 +1032,7 @@ Vectorized processing
|
|||
: The query is interpreted, not compiled, but it is made fast by processing many values from a
|
||||
column in a batch, instead of iterating over rows one by one. A fixed set of predefined operators
|
||||
are built into the database; we can pass arguments to them and get back a batch of results
|
||||
[[^50], [^75]].
|
||||
[^50] [^75].
|
||||
|
||||
For example, we could pass the `product_sk` column and the ID of “bananas” to an equality operator,
|
||||
and get back a bitmap (one bit per value in the input column, which is 1 if it’s a banana); we could
|
||||
|
|
@ -1056,9 +1054,7 @@ performance by taking advantages of the characteristics of modern CPUs:
|
|||
* doing most of the work in tight inner loops (that is, with a small number of instructions and no
|
||||
function calls) to keep the CPU instruction processing pipeline busy and avoid branch
|
||||
mispredictions,
|
||||
* making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD)
|
||||
instructions [[^79],
|
||||
[^80]], and
|
||||
* making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD) instructions [^79] [^80], and
|
||||
* operating directly on compressed data without decoding it into a separate in-memory
|
||||
representation, which saves memory allocation and copying costs.
|
||||
|
||||
|
|
@ -1196,7 +1192,7 @@ It stores the mapping from term to postings list in SSTable-like sorted files, w
|
|||
the background using the same log-structured approach we saw earlier in this chapter [^91].
|
||||
PostgreSQL’s GIN index type also uses postings lists to support full-text search and indexing inside
|
||||
JSON documents
|
||||
[[^92], [^93]].
|
||||
[^92] [^93].
|
||||
|
||||
Instead of breaking text into words, an alternative is to find all the substrings of length *n*,
|
||||
which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
|
||||
|
|
@ -1295,7 +1291,7 @@ variations of each [^101],
|
|||
and PostgreSQL’s pgvector supports both as well [^102].
|
||||
The full details of the IVF and HNSW algorithms are beyond the scope of this book, but their papers
|
||||
are an excellent resource
|
||||
[[^103], [^104]].
|
||||
[^103] [^104].
|
||||
|
||||
## Summary
|
||||
|
||||
|
|
@ -1347,7 +1343,7 @@ documentation for the database of your choice.
|
|||
|
||||
|
||||
|
||||
### Summary
|
||||
### References
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -447,7 +447,7 @@ application code is expecting, and their types.
|
|||
If the reader’s and writer’s schema are the same, decoding is easy. If they are different, Avro
|
||||
resolves the differences by looking at the writer’s schema and the reader’s schema side by side and
|
||||
translating the data from the writer’s schema into the reader’s schema. The Avro specification
|
||||
[[^16], [^17]]
|
||||
[^16] [^17]
|
||||
defines exactly how this resolution works, and it is illustrated in
|
||||
[Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
|
||||
|
||||
|
|
@ -571,7 +571,7 @@ languages.
|
|||
|
||||
The ideas on which these encodings are based are by no means new. For example, they have a lot in
|
||||
common with ASN.1, a schema definition language that was first standardized in 1984
|
||||
[[^23], [^24]].
|
||||
[^23] [^24].
|
||||
It was used to define various network protocols, and its binary encoding (DER) is still used to encode
|
||||
SSL certificates (X.509), for example [^25].
|
||||
ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers [^26].
|
||||
|
|
@ -737,7 +737,7 @@ different contexts. For example:
|
|||
systems, or OAuth for shared access to user data.
|
||||
|
||||
The most popular service design philosophy is REST, which builds upon the principles of HTTP
|
||||
[[^30], [^31]].
|
||||
[^30] [^31].
|
||||
It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for
|
||||
cache control, authentication, and content type negotiation. An API designed according to the
|
||||
principles of REST is called *RESTful*.
|
||||
|
|
@ -824,14 +824,14 @@ Architecture (CORBA) is excessively complex, and does not provide backward or fo
|
|||
compatibility [^33].
|
||||
SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are
|
||||
also plagued by complexity and compatibility problems
|
||||
[[^34], [^35], [^36]].
|
||||
[^34] [^35] [^36].
|
||||
|
||||
All of these are based on the idea of a *remote procedure call* (RPC), which has been around since
|
||||
the 1970s [^37].
|
||||
The RPC model tries to make a request to a remote network service look the same as calling a function or
|
||||
method in your programming language, within the same process (this abstraction is called *location
|
||||
transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed
|
||||
[[^38], [^39]].
|
||||
[^38] [^39].
|
||||
A network request is very different from a local function call:
|
||||
|
||||
* A local function call is predictable and either succeeds or fails, depending only on parameters
|
||||
|
|
@ -1016,7 +1016,7 @@ task fails, the framework will re-execute the task, but will skip any RPC calls
|
|||
that the task made successfully before failing. Instead, the framework will pretend to make the
|
||||
call, but will instead return the results from the previous call. This is possible because durable
|
||||
execution frameworks log all RPCs and state changes to durable storage like a write-ahead log
|
||||
[[^45], [^46]].
|
||||
[^45] [^46].
|
||||
[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
|
||||
using Temporal.
|
||||
|
||||
|
|
@ -1109,7 +1109,7 @@ Message brokers typically don’t enforce any particular data model—a message
|
|||
bytes with some metadata, so you can use any encoding format. A common approach is to use Protocol
|
||||
Buffers, Avro, or JSON, and to deploy a schema registry alongside the message broker to store all
|
||||
the valid schema versions and check their compatibility
|
||||
[[^19], [^21]].
|
||||
[^19] [^21].
|
||||
AsyncAPI, a messaging-based equivalent of OpenAPI, can also be used to specify the schema of
|
||||
messages.
|
||||
|
||||
|
|
@ -1197,7 +1197,7 @@ quite achievable. May your application’s evolution be rapid and your deploymen
|
|||
|
||||
|
||||
|
||||
### Summary
|
||||
### References
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -221,9 +221,7 @@ for live queries. Storing database data in object storage has many benefits:
|
|||
* Object stores also provide multi-zone, dual-region, or multi-region replication with very high
|
||||
durability guarantees. This also allows databases to bypass inter-zone network fees.
|
||||
* Databases can use an object store’s *conditional write* feature—essentially, a *compare-and-set*
|
||||
(CAS) operation—to implement transactions and leadership election
|
||||
[[10](/ch06.html#Morling2024_ch6),
|
||||
[11](/ch06.html#Chandramohan2024)]).
|
||||
(CAS) operation—to implement transactions and leadership election [^10] [^11]
|
||||
* Storing data from multiple databases in the same object store can simplify data integration,
|
||||
particularly when open formats such as Apache Parquet and Apache Iceberg are used.
|
||||
|
||||
|
|
@ -420,9 +418,7 @@ heap into a consistent state, we can use the exact same log to build a replica o
|
|||
besides writing the log to disk, the leader also sends it across the network to its followers. When
|
||||
the follower processes this log, it builds a copy of the exact same files as found on the leader.
|
||||
|
||||
This method of replication is used in PostgreSQL and Oracle, among others
|
||||
[[17](/ch06.html#Suzuki2017_ch6),
|
||||
[18](/ch06.html#Kapila2012)].
|
||||
This method of replication is used in PostgreSQL and Oracle, among others [^17] [^18]
|
||||
The main disadvantage is that the log describes the data on a very low level: a WAL contains details
|
||||
of which bytes were changed in which disk blocks. This makes replication tightly coupled to the
|
||||
storage engine. If the database changes its storage format from one version to another, it is
|
||||
|
|
@ -915,10 +911,7 @@ Moreover, many modern web apps offer *real-time collaboration* features, such as
|
|||
Sheets for text documents and spreadsheets, Figma for graphics, and Linear for project management.
|
||||
What makes these apps so responsive is that user input is immediately reflected in the user
|
||||
interface, without waiting for a network round-trip to the server, and edits by one user are shown
|
||||
to their collaborators with low latency
|
||||
[[32](/ch06.html#DayRichter2010),
|
||||
[33](/ch06.html#Wallace2019),
|
||||
[34](/ch06.html#Artman2023)].
|
||||
to their collaborators with low latency [^32] [^33] [^34]
|
||||
|
||||
This again results in a multi-leader architecture: each web browser tab that has opened the shared
|
||||
file is a replica, and any updates that you make to the file are asynchronously replicated to the
|
||||
|
|
@ -935,19 +928,14 @@ multiple users have changed the file concurrently, conflict resolution logic may
|
|||
those changes.
|
||||
|
||||
A software library that supports this process is called a *sync engine*. Although the idea has
|
||||
existed for a long time, the term has recently gained attention
|
||||
[[35](/ch06.html#Saafan2024),
|
||||
[36](/ch06.html#Hagoel2024),
|
||||
[37](/ch06.html#Jayakar2024)].
|
||||
existed for a long time, the term has recently gained attention [^35] [^36] [^37].
|
||||
An application that allows a user to continue editing a file while offline (which may be implemented
|
||||
using a sync engine) is called *offline-first*
|
||||
[^38].
|
||||
using a sync engine) is called *offline-first* [^38].
|
||||
The term *local-first software* refers to collaborative apps that are not only offline-first, but
|
||||
are also designed to continue working even if the developer who made the software shuts down all of
|
||||
their online services [^39].
|
||||
This can be achieved by using a sync engine with an open standard sync protocol for which multiple
|
||||
service providers are available
|
||||
[^40].
|
||||
service providers are available [^40].
|
||||
For example, Git is a local-first collaboration system (albeit one that doesn’t support real-time
|
||||
collaboration) since you can sync via GitHub, GitLab, or any other repository hosting service.
|
||||
|
||||
|
|
@ -1243,20 +1231,16 @@ writes in the same order.
|
|||
|
||||
Some data storage systems take a different approach, abandoning the concept of a leader and
|
||||
allowing any replica to directly accept writes from clients. Some of the earliest replicated data
|
||||
systems were leaderless [[1](/ch06.html#Lindsay1979_ch6),
|
||||
[50](/ch06.html#Gifford1979)], but the
|
||||
idea was mostly forgotten during the era of dominance of relational databases. It once again became
|
||||
systems were leaderless [^1] [^50], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became
|
||||
a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in
|
||||
2007 [^45].
|
||||
Riak, Cassandra, and ScyllaDB are open source datastores with leaderless replication models inspired
|
||||
by Dynamo, so this kind of database is also known as *Dynamo-style*.
|
||||
|
||||
> [!NOTE]
|
||||
> The original *Dynamo* system was only described in a paper
|
||||
> [^45], but never released outside of
|
||||
> Amazon. The similarly-named *DynamoDB* is a more recent cloud database from AWS, but it has a
|
||||
> completely different architecture: it uses single-leader replication based on the Multi-Paxos
|
||||
> consensus algorithm [^5].
|
||||
> The original *Dynamo* system was only described in a paper [^45], but never released outside of Amazon.
|
||||
> The similarly-named *DynamoDB* is a more recent cloud database from AWS, but it has a completely different architecture:
|
||||
> it uses single-leader replication based on the Multi-Paxos consensus algorithm [^5].
|
||||
|
||||
In some leaderless implementations, the client directly sends its writes to several replicas, while
|
||||
in others, a coordinator node does this on behalf of the client. However, unlike a leader database,
|
||||
|
|
@ -1707,15 +1691,9 @@ replica increments its own version number when processing a write, and also keep
|
|||
version numbers it has seen from each of the other replicas. This information indicates which values
|
||||
to overwrite and which values to keep as siblings.
|
||||
|
||||
The collection of version numbers from all the replicas is called a *version vector*
|
||||
[^58].
|
||||
A few variants of this idea are in use, but the most interesting is probably the *dotted version
|
||||
vector*
|
||||
[[59](/ch06.html#Preguica2010),
|
||||
[60](/ch06.html#Manepalli2022)],
|
||||
which is used in Riak 2.0
|
||||
[[61](/ch06.html#Cribbs2014),
|
||||
[62](/ch06.html#Brown2015)].
|
||||
The collection of version numbers from all the replicas is called a *version vector* [^58].
|
||||
A few variants of this idea are in use, but the most interesting is probably the *dotted version vector* [^59] [^60],
|
||||
which is used in Riak 2.0 [^61] [^62].
|
||||
We won’t go into the details, but the way it works is quite similar to what we saw in our cart example.
|
||||
|
||||
Like the version numbers in [Figure 6-15](/ch06.html#fig_replication_causality_single), version vectors are sent from the
|
||||
|
|
@ -1731,10 +1709,7 @@ siblings are merged correctly.
|
|||
# Version vectors and vector clocks
|
||||
|
||||
A *version vector* is sometimes also called a *vector clock*, even though they are not quite the
|
||||
same. The difference is subtle—please see the references for details
|
||||
[[60](/ch06.html#Manepalli2022),
|
||||
[63](/ch06.html#Baquero2011),
|
||||
[64](/ch06.html#Schwarz1994)]. In brief, when
|
||||
same. The difference is subtle—please see the references for details [^60] [^63] [^64]. In brief, when
|
||||
comparing the state of replicas, version vectors are the right data structure to use.
|
||||
|
||||
## Summary
|
||||
|
|
@ -1760,8 +1735,7 @@ Despite being a simple goal—keeping a copy of the same data on several machine
|
|||
to be a remarkably tricky problem. It requires carefully thinking about concurrency and about all
|
||||
the things that can go wrong, and dealing with the consequences of those faults. At a minimum, we
|
||||
need to deal with unavailable nodes and network interruptions (and that’s not even considering the
|
||||
more insidious kinds of fault, such as silent data corruption due to software bugs or hardware
|
||||
errors).
|
||||
more insidious kinds of fault, such as silent data corruption due to software bugs or hardware errors).
|
||||
|
||||
We discussed three main approaches to replication:
|
||||
|
||||
|
|
@ -1817,7 +1791,7 @@ machine to store only a subset of the data.
|
|||
|
||||
|
||||
|
||||
### Summary
|
||||
### References
|
||||
|
||||
|
||||
[^1]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)
|
||||
|
|
|
|||
|
|
@ -51,7 +51,7 @@ Some databases treat partitions and shards as two distinct concepts. For example
|
|||
partitioning is a way of splitting a large table into several files that are stored on the same
|
||||
machine (which has several advantages, such as making it very fast to delete an entire partition),
|
||||
whereas sharding splits a dataset across multiple machines
|
||||
[[^1], [^2]].
|
||||
[^1] [^2].
|
||||
In many other systems, partitioning is just another word for sharding.
|
||||
|
||||
While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to
|
||||
|
|
@ -408,7 +408,7 @@ to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_
|
|||
per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
|
||||
those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
|
||||
those imbalances tend to even out
|
||||
[[^15], [^18]].
|
||||
[^15] [^18].
|
||||
|
||||

|
||||
|
||||
|
|
@ -459,7 +459,7 @@ This event can result in a large volume of reads and writes to the same key (whe
|
|||
is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on).
|
||||
|
||||
In such situations, a more flexible sharding policy is required
|
||||
[[^25], [^26]].
|
||||
[^25] [^26].
|
||||
A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put
|
||||
an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine [^27].
|
||||
|
||||
|
|
@ -502,7 +502,7 @@ Fully automated rebalancing can be convenient, because there is less operational
|
|||
normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud
|
||||
databases such as DynamoDB are promoted as being able to automatically add and remove shards to
|
||||
adapt to big increases or decreases of load within a matter of minutes
|
||||
[[^17], [^29]].
|
||||
[^17] [^29].
|
||||
|
||||
However, automatic shard management can also be unpredictable. Rebalancing is an expensive
|
||||
operation, because it requires rerouting requests and moving a large amount of data from one node to
|
||||
|
|
@ -779,7 +779,7 @@ that question in the following chapters.
|
|||
|
||||
|
||||
|
||||
### Summary
|
||||
### References
|
||||
|
||||
|
||||
[^1]: Claire Giordano. [Understanding partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959)
|
||||
|
|
|
|||
|
|
@ -67,7 +67,7 @@ the challenge of achieving atomicity in a distributed transaction.
|
|||
|
||||
Almost all relational databases today, and some nonrelational databases, support transactions. Most
|
||||
of them follow the style that was introduced in 1975 by IBM System R, the first SQL database
|
||||
[[^2], [^3], [^4]].
|
||||
[^2] [^3] [^4].
|
||||
Although some implementation details have changed, the general idea has remained virtually the same
|
||||
for 50 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily
|
||||
similar to that of System R.
|
||||
|
|
@ -214,7 +214,7 @@ However, serializability has a performance cost. In practice, many databases use
|
|||
that are weaker than serializability: that is, they allow concurrent transactions to interfere with
|
||||
each other in limited ways. Some popular databases, such as Oracle, don’t even implement it (Oracle
|
||||
has an isolation level called “serializable,” but it actually implements *snapshot isolation*, which
|
||||
is a weaker guarantee than serializability [[^10], [^14]]).
|
||||
is a weaker guarantee than serializability [^10] [^14]).
|
||||
This means that some kinds of race conditions can still occur. We will explore snapshot isolation
|
||||
and other forms of isolation in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels).
|
||||
|
||||
|
|
@ -254,37 +254,22 @@ The truth is, nothing is perfect:
|
|||
* In an asynchronously replicated system, recent writes may be lost when the leader becomes
|
||||
unavailable (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)).
|
||||
* When the power is suddenly cut, SSDs in particular have been shown to sometimes violate the
|
||||
guarantees they are supposed to provide: even `fsync` isn’t guaranteed to work correctly
|
||||
[^15].
|
||||
Disk firmware can have bugs, just like any other kind of software
|
||||
[[^16],
|
||||
[^17]],
|
||||
e.g. causing drives to fail after exactly 32,768 hours of operation
|
||||
[^18].
|
||||
And `fsync` is hard to use; even PostgreSQL used it incorrectly for over 20 years
|
||||
[[^19],
|
||||
[^20],
|
||||
[^21]].
|
||||
guarantees they are supposed to provide: even `fsync` isn’t guaranteed to work correctly [^15].
|
||||
Disk firmware can have bugs, just like any other kind of software [^16] [^17],
|
||||
e.g. causing drives to fail after exactly 32,768 hours of operation [^18].
|
||||
And `fsync` is hard to use; even PostgreSQL used it incorrectly for over 20 years [^19] [^20] [^21].
|
||||
* Subtle interactions between the storage engine and the filesystem implementation can lead to bugs
|
||||
that are hard to track down, and may cause files on disk to be corrupted after a crash
|
||||
[[^22],
|
||||
[^23]].
|
||||
Filesystem errors on one replica can sometimes spread to other replicas as well
|
||||
[^24].
|
||||
* Data on disk can gradually become corrupted without this being detected
|
||||
[^25].
|
||||
that are hard to track down, and may cause files on disk to be corrupted after a crash [^22] [^23].
|
||||
Filesystem errors on one replica can sometimes spread to other replicas as well [^24].
|
||||
* Data on disk can gradually become corrupted without this being detected [^25].
|
||||
If data has been corrupted for some time, replicas and recent backups may also be corrupted. In
|
||||
this case, you will need to try to restore the data from a historical backup.
|
||||
* One study of SSDs found that between 30% and 80% of drives develop at least one bad block during
|
||||
the first four years of operation, and only some of these can be corrected by the firmware
|
||||
[^26].
|
||||
Magnetic hard drives have a lower rate of bad sectors, but a higher rate of complete failure than
|
||||
SSDs.
|
||||
the first four years of operation, and only some of these can be corrected by the firmware [^26].
|
||||
Magnetic hard drives have a lower rate of bad sectors, but a higher rate of complete failure than SSDs.
|
||||
* When a worn-out SSD (that has gone through many write/erase cycles) is disconnected from power,
|
||||
it can start losing data within a timescale of weeks to months, depending on the temperature
|
||||
[^27].
|
||||
This is less of a problem for drives with lower wear levels
|
||||
[^28].
|
||||
it can start losing data within a timescale of weeks to months, depending on the temperature [^27].
|
||||
This is less of a problem for drives with lower wear levels [^28].
|
||||
|
||||
In practice, there is no one technique that can provide absolute guarantees. There are only various
|
||||
risk-reduction techniques, including writing to disk, replicating to remote machines, and
|
||||
|
|
@ -489,7 +474,7 @@ nevertheless used in practice [^29].
|
|||
|
||||
Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have
|
||||
caused substantial loss of money
|
||||
[[^30], [^31], [^32]],
|
||||
[^30] [^31] [^32],
|
||||
led to investigation by financial auditors [^33],
|
||||
and caused customer data to be corrupted [^34].
|
||||
A popular comment on revelations of such problems is “Use an ACID database if you’re handling
|
||||
|
|
@ -515,7 +500,7 @@ decide what level is appropriate to your application. Once we’ve done that, we
|
|||
serializability in detail (see [“Serializability”](/en/ch8#sec_transactions_serializability)). Our discussion of isolation
|
||||
levels will be informal, using examples. If you want rigorous definitions and analyses of their
|
||||
properties, you can find them in the academic literature
|
||||
[[^36], [^37], [^38], [^39]].
|
||||
[^36] [^37] [^38] [^39].
|
||||
|
||||
## Read Committed
|
||||
|
||||
|
|
@ -690,7 +675,7 @@ database, frozen at a particular point in time, it is much easier to understand.
|
|||
|
||||
Snapshot isolation is a popular feature: variants of it are supported by PostgreSQL, MySQL with the
|
||||
InnoDB storage engine, Oracle, SQL Server, and others, although the detailed behavior varies from
|
||||
one system to the next [[^29], [^40], [^41]].
|
||||
one system to the next [^29] [^40] [^41].
|
||||
Some databases, such as Oracle, TiDB, and Aurora DSQL, even choose snapshot isolation as their
|
||||
highest isolation level.
|
||||
|
||||
|
|
@ -713,7 +698,7 @@ maintains several versions of a row side by side, this technique is known as *mu
|
|||
concurrency control* (MVCC).
|
||||
|
||||
[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL
|
||||
[[^40], [^42], [^43]] (other implementations are similar).
|
||||
[^40] [^42] [^43] (other implementations are similar).
|
||||
When a transaction is started, it is given a unique, always-increasing transaction ID (`txid`).
|
||||
Whenever a transaction writes anything to the database, the data it writes is tagged with the
|
||||
transaction ID of the writer. (To be precise, transaction IDs in PostgreSQL are 32-bit integers, so
|
||||
|
|
@ -742,7 +727,7 @@ All of the versions of a row are stored within the same database heap (see
|
|||
[“Storing values within the index”](/en/ch4#sec_storage_index_heap)), regardless of whether the transactions that wrote them have committed
|
||||
or not. The versions of the same row form a linked list, going either from newest version to oldest
|
||||
version or the other way round, so that queries can internally iterate over all versions of a row
|
||||
[[^45], [^46]].
|
||||
[^45] [^46].
|
||||
|
||||
### Visibility rules for observing a consistent snapshot
|
||||
|
||||
|
|
@ -790,7 +775,7 @@ value matches what the query is looking for. When garbage collection removes old
|
|||
are no longer visible to any transaction, the corresponding index entries can also be removed.
|
||||
|
||||
Many implementation details affect the performance of multi-version concurrency control
|
||||
[[^45], [^46]].
|
||||
[^45] [^46].
|
||||
For example, PostgreSQL has optimizations for avoiding index updates if different versions of the
|
||||
same row can fit on the same page [^40].
|
||||
Some other databases avoid storing full copies of modified rows, and only store differences between
|
||||
|
|
@ -829,7 +814,7 @@ Unfortunately, the SQL standard’s definition of isolation levels is flawed—i
|
|||
imprecise, and not as implementation-independent as a standard should be [^36]. Even though several databases
|
||||
implement repeatable read, there are big differences in the guarantees they actually provide,
|
||||
despite being ostensibly standardized [^29]. There has been a formal definition of
|
||||
repeatable read in the research literature [[^37], [^38]], but most implementations don’t satisfy that
|
||||
repeatable read in the research literature [^37] [^38], but most implementations don’t satisfy that
|
||||
formal definition. And to top it off, IBM Db2 uses “repeatable read” to refer to serializability [^10].
|
||||
|
||||
As a result, nobody really knows what repeatable read means.
|
||||
|
|
@ -884,7 +869,7 @@ Another option is to simply force all atomic operations to be executed on a sing
|
|||
|
||||
Unfortunately, object-relational mapping (ORM) frameworks make it easy to accidentally write code
|
||||
that performs unsafe read-modify-write cycles instead of using atomic operations provided by the
|
||||
database [[^49], [^50], [^51]].
|
||||
database [^49] [^50] [^51].
|
||||
This can be a source of subtle bugs that are difficult to find by testing.
|
||||
|
||||
### Explicit locking
|
||||
|
|
@ -940,8 +925,8 @@ An advantage of this approach is that databases can perform this check efficient
|
|||
with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL
|
||||
Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort
|
||||
the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates
|
||||
[[^29], [^41]].
|
||||
Some authors [[^36], [^38]] argue that a database must prevent lost
|
||||
[^29] [^41].
|
||||
Some authors [^36] [^38] argue that a database must prevent lost
|
||||
updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot
|
||||
isolation under this definition.
|
||||
|
||||
|
|
@ -1023,7 +1008,7 @@ To begin, imagine this example: you are writing an application for doctors to ma
|
|||
shifts at a hospital. The hospital usually tries to have several doctors on call at any one time,
|
||||
but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if
|
||||
they are sick themselves), provided that at least one colleague remains on call in that shift
|
||||
[[^53], [^54]].
|
||||
[^53] [^54].
|
||||
|
||||
Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are
|
||||
feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button
|
||||
|
|
@ -1184,7 +1169,7 @@ transaction, is called a *phantom* [^4].
|
|||
Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the
|
||||
examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL
|
||||
generated by ORMs is also prone to write skew
|
||||
[[^50], [^51]].
|
||||
[^50] [^51].
|
||||
|
||||
### Materializing conflicts
|
||||
|
||||
|
|
@ -1271,7 +1256,7 @@ Two developments caused this rethink:
|
|||
outside of the serial execution loop.
|
||||
|
||||
The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic,
|
||||
for example [[^58], [^59], [^60]].
|
||||
for example [^58] [^59] [^60].
|
||||
A system designed for single-threaded execution can sometimes perform better than a system that
|
||||
supports concurrency, because it can avoid the coordination overhead of locking. However, its
|
||||
throughput is limited to that of a single CPU core. In order to make the most of that single thread,
|
||||
|
|
@ -1541,7 +1526,7 @@ becomes serializable.
|
|||
Unfortunately, predicate locks do not perform well: if there are many locks by active transactions,
|
||||
checking for matching locks becomes time-consuming. For that reason, most databases with 2PL
|
||||
actually implement *index-range locking* (also known as *next-key locking*), which is a simplified
|
||||
approximation of predicate locking [[^54], [^64]].
|
||||
approximation of predicate locking [^54] [^64].
|
||||
|
||||
It’s safe to simplify a predicate by making it match a greater set of objects. For example, if you
|
||||
have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by
|
||||
|
|
@ -1585,7 +1570,7 @@ serializable isolation and good performance fundamentally at odds with each othe
|
|||
It seems not: an algorithm called *serializable snapshot isolation* (SSI) provides full
|
||||
serializability with only a small performance penalty compared to snapshot isolation. SSI is
|
||||
comparatively new: it was first described in 2008
|
||||
[[^53], [^65]].
|
||||
[^53] [^65].
|
||||
|
||||
Today SSI and similar algorithms are used in single-node databases (the serializable isolation level
|
||||
in PostgreSQL [^54], SQL Server’s In-Memory
|
||||
|
|
@ -1733,7 +1718,7 @@ tracking is faster, but may lead to more transactions being aborted than strictl
|
|||
In some cases, it’s okay for a transaction to read information that was overwritten by another
|
||||
transaction: depending on what else happened, it’s sometimes possible to prove that the result of
|
||||
the execution is nevertheless serializable. PostgreSQL uses this theory to reduce the number of
|
||||
unnecessary aborts [[^14], [^54]].
|
||||
unnecessary aborts [^14] [^54].
|
||||
|
||||
Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one
|
||||
transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot
|
||||
|
|
@ -1824,12 +1809,12 @@ problem.
|
|||
|
||||
Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It
|
||||
is a classic algorithm in distributed databases
|
||||
[[^13], [^71], [^72]]. 2PC is used
|
||||
[^13] [^71] [^72]. 2PC is used
|
||||
internally in some databases and also made available to applications in the form of *XA transactions*
|
||||
[^73]
|
||||
(which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP
|
||||
web services
|
||||
[[^74], [^75]].
|
||||
[^74] [^75].
|
||||
|
||||
The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single
|
||||
commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two
|
||||
|
|
@ -1958,7 +1943,7 @@ stuck waiting for the coordinator to recover. It is possible to make an atomic c
|
|||
is not so straightforward.
|
||||
|
||||
As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed
|
||||
[[^13], [^77]].
|
||||
[^13] [^77].
|
||||
However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most
|
||||
practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it
|
||||
cannot guarantee atomicity.
|
||||
|
|
@ -1971,7 +1956,7 @@ consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_cons
|
|||
Distributed transactions and two-phase commit have a mixed reputation. On the one hand, they are
|
||||
seen as providing an important safety guarantee that would be hard to achieve otherwise; on the
|
||||
other hand, they are criticized for causing operational problems, killing performance, and promising
|
||||
more than they can deliver [[^78], [^79], [^80], [^81]].
|
||||
more than they can deliver [^78] [^79] [^80] [^81].
|
||||
Many cloud services choose not to implement distributed transactions due to the operational
|
||||
problems they engender [^82].
|
||||
|
||||
|
|
@ -2089,7 +2074,7 @@ transaction is resolved.
|
|||
|
||||
In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the
|
||||
log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do
|
||||
occur [[^83], [^84]]—that is,
|
||||
occur [^83] [^84]—that is,
|
||||
transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because
|
||||
the transaction log has been lost or corrupted due to a software bug). These transactions cannot be
|
||||
resolved automatically, so they sit forever in the database, holding locks and blocking other
|
||||
|
|
@ -2326,7 +2311,7 @@ is used.
|
|||
|
||||
|
||||
|
||||
### Summary
|
||||
### References
|
||||
|
||||
|
||||
|
||||
|
|
|
|||
|
|
@ -22,7 +22,7 @@ anything that *can* go wrong *will* go wrong.
|
|||
|
||||
Moreover, working with distributed systems is fundamentally different from writing software on a
|
||||
single computer—and the main difference is that there are lots of new and exciting ways for things
|
||||
to go wrong [[^1], [^2]].
|
||||
to go wrong [^1] [^2].
|
||||
In this chapter, you will get a taste of the problems that arise in practice, and an understanding
|
||||
of the things you can and cannot rely on.
|
||||
|
||||
|
|
@ -197,7 +197,7 @@ even in controlled environments like a datacenter operated by one company [^8]:
|
|||
(though shark bites have become rarer due to better shielding of submarine cables [^14]).
|
||||
Humans are also at fault, be it due to accidental misconfiguration [^15], scavenging [^16], or sabotage [^17].
|
||||
* Across different cloud regions, round-trip times of up to several *minutes* have been observed at
|
||||
high percentiles [[^18], Table 3].
|
||||
high percentiles [^18].
|
||||
Even within a single datacenter, packet delay of more than a minute can occur during a network
|
||||
topology reconfiguration, triggered by a problem during a software upgrade for a switch
|
||||
[^19].
|
||||
|
|
@ -364,7 +364,7 @@ network links and switches, and even each machine’s network interface and CPUs
|
|||
virtual machines), are shared. Processing large amounts of data can use the entire capacity of
|
||||
network links (*saturate* them). As you have no control over or insight into other customers’ usage of the shared
|
||||
resources, network delays can be highly variable if someone near you (a *noisy neighbor*) is
|
||||
using a lot of resources [[^30], [^31]].
|
||||
using a lot of resources [^30] [^31].
|
||||
|
||||
In such environments, you can only choose timeouts experimentally: measure the distribution of
|
||||
network round-trip times over an extended period, and over many machines, to determine the expected
|
||||
|
|
@ -665,7 +665,7 @@ fixed. On the other hand, if its quartz clock is defective or its NTP client is
|
|||
things will seem to work fine, even though its clock gradually drifts further and further away from
|
||||
reality. If some piece of software is relying on an accurately synchronized clock, the result is
|
||||
more likely to be silent and subtle data loss than a dramatic crash
|
||||
[[^62], [^63]].
|
||||
[^62] [^63].
|
||||
|
||||
Thus, if you use software that requires synchronized clocks, it is essential that you also carefully
|
||||
monitor the clock offsets between all the machines. Any node whose clock drifts too far from the
|
||||
|
|
@ -715,8 +715,7 @@ serious problems:
|
|||
|
||||
* Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite
|
||||
values previously written by a node with a fast clock until the clock skew between the nodes has
|
||||
elapsed [[^63],
|
||||
[^65]].
|
||||
elapsed [^63] [^65].
|
||||
This scenario can cause arbitrary amounts of data to be silently dropped without any error being
|
||||
reported to the application.
|
||||
* LWW cannot distinguish between writes that occurred sequentially in quick succession (in
|
||||
|
|
@ -812,7 +811,7 @@ the synchronization good enough, they would have the right properties: later tra
|
|||
higher timestamp. The problem, of course, is the uncertainty about clock accuracy.
|
||||
|
||||
Spanner implements snapshot isolation across datacenters in this way
|
||||
[[^68], [^69]].
|
||||
[^68] [^69].
|
||||
It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the
|
||||
following observation: if you have two confidence intervals, each consisting of an earliest and
|
||||
latest possible timestamp (*A* = [*Aearliest*, *Alatest*] and
|
||||
|
|
@ -1011,11 +1010,11 @@ handle requests from clients while one node is collecting its garbage. If the ru
|
|||
application that a node soon requires a GC pause, the application can stop sending new requests to
|
||||
that node, wait for it to finish processing outstanding requests, and then perform the GC while no
|
||||
requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles
|
||||
of the response time [[^80], [^81]].
|
||||
of the response time [^80] [^81].
|
||||
|
||||
A variant of this idea is to use the garbage collector only for short-lived objects (which are fast
|
||||
to collect) and to restart processes periodically, before they accumulate enough long-lived objects
|
||||
to require a full GC of long-lived objects [[^79], [^82]].
|
||||
to require a full GC of long-lived objects [^79] [^82].
|
||||
One node can be restarted at a time, and traffic can be shifted away from the node before the
|
||||
planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).
|
||||
|
||||
|
|
@ -1120,7 +1119,7 @@ could be lost or corrupted data, which is much more serious.
|
|||
|
||||
For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
|
||||
implementation of locking. (The bug is not theoretical: HBase used to have this problem
|
||||
[[^85], [^86]].)
|
||||
[^85] [^86].)
|
||||
Say you want to ensure that a file in a storage service can only be
|
||||
accessed by one client at a time, because if multiple clients tried to write to it, the file would
|
||||
become corrupted. You try to implement this by requiring a client to obtain a lease from a lock
|
||||
|
|
@ -1207,7 +1206,7 @@ services support such a check: Amazon S3 calls it *conditional writes*, Azure Bl
|
|||
|
||||
If your clients need to write only to one storage service that supports such conditional writes, the
|
||||
lock service is somewhat redundant
|
||||
[[^91], [^92]],
|
||||
[^91] [^92],
|
||||
since the lease assignment could have been implemented directly based on that storage service [^93].
|
||||
However, once you have a fencing token you can also use it with multiple services or replicas, and
|
||||
ensure that the old leaseholder is fenced off on all of those services.
|
||||
|
|
@ -1286,8 +1285,7 @@ with the network. This concern is relevant in certain specific circumstances. Fo
|
|||
by radiation, leading it to respond to other nodes in arbitrarily unpredictable ways. Since a
|
||||
system failure would be very expensive (e.g., an aircraft crashing and killing everyone on board,
|
||||
or a rocket colliding with the International Space Station), flight control systems must tolerate
|
||||
Byzantine faults [[^98],
|
||||
[^99]].
|
||||
Byzantine faults [^98] [^99].
|
||||
* In a system with multiple participating parties, some participants may attempt to cheat or
|
||||
defraud others. In such circumstances, it is not safe for a node to simply trust another node’s
|
||||
messages, since they may be sent with malicious intent. For example, cryptocurrencies like
|
||||
|
|
@ -1311,7 +1309,7 @@ escaping are so important: to prevent SQL injection and cross-site scripting, fo
|
|||
we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the
|
||||
authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where
|
||||
there is no such central authority, Byzantine fault tolerance is more relevant
|
||||
[[^103], [^104]].
|
||||
[^103] [^104].
|
||||
|
||||
A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to
|
||||
all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant
|
||||
|
|
@ -1336,9 +1334,7 @@ pragmatic steps toward better reliability. For example:
|
|||
|
||||
* Network packets do sometimes get corrupted due to hardware issues or bugs in operating systems,
|
||||
drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and
|
||||
UDP, but sometimes they evade detection [[^105],
|
||||
[^106],
|
||||
[^107]].
|
||||
UDP, but sometimes they evade detection [^105] [^106] [^107].
|
||||
Simple measures are usually sufficient protection against such corruption, such as checksums in
|
||||
the application-level protocol. TLS-encrypted connections also offer protection against
|
||||
corruption.
|
||||
|
|
@ -1542,7 +1538,7 @@ It is prudent to combine theoretical analysis with empirical testing to verify t
|
|||
behave as expected. Techniques such as property-based testing, fuzzing, and deterministic simulation
|
||||
testing (DST) use randomization to test a system in a wide range of situations. Companies such as
|
||||
Amazon Web Services have successfully used a combination of these techniques on many of their
|
||||
products [[^120], [^121]].
|
||||
products [^120] [^121].
|
||||
|
||||
### Model checking and specification languages
|
||||
|
||||
|
|
@ -1563,7 +1559,7 @@ longer executions would then not be found.
|
|||
Still, model checkers strike a nice balance between ease of use and the ability to find non-obvious
|
||||
bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find
|
||||
and fix bugs
|
||||
[[^122], [^123], [^124]]. For example,
|
||||
[^122] [^123] [^124]. For example,
|
||||
using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped
|
||||
replication (VR) caused by ambiguity in the prose description of the algorithm [^125].
|
||||
|
||||
|
|
@ -1601,7 +1597,7 @@ It’s common to adopt a fault injection framework like Jepsen to run fault inje
|
|||
simplify the process. Such frameworks come with integrations for various operating systems and many
|
||||
pre-built fault injectors [^129].
|
||||
Jepsen has been remarkably effective at finding critical bugs in many widely-used systems
|
||||
[[^130], [^131]].
|
||||
[^130] [^131].
|
||||
|
||||
### Deterministic simulation testing
|
||||
|
||||
|
|
@ -1750,7 +1746,7 @@ problems in distributed systems.
|
|||
|
||||
|
||||
|
||||
### Summary
|
||||
### References
|
||||
|
||||
[^1]: Mark Cavage. [There’s Just No Getting Around It: You’re Building a Distributed System](https://queue.acm.org/detail.cfm?id=2482856). *ACM Queue*, volume 11, issue 4, pages 80-89, April 2013. [doi:10.1145/2466486.2482856](https://doi.org/10.1145/2466486.2482856)
|
||||
[^2]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)
|
||||
|
|
|
|||
Loading…
Reference in a new issue