From 5c1bef6008d22a8845521e627d5c963c5951277f Mon Sep 17 00:00:00 2001 From: Feng Ruohang Date: Sat, 9 Aug 2025 16:38:18 +0800 Subject: [PATCH] update v2 missing warning --- content/en/author.md | 16 ---- content/en/ch11.md | 5 +- content/en/ch12.md | 5 +- content/en/ch13.md | 5 +- content/en/ch6.md | 190 ++++++++++++++++++++--------------------- content/en/colophon.md | 11 ++- content/en/glossary.md | 4 + content/en/part-i.md | 5 +- content/en/part-ii.md | 31 ++++--- content/en/part-iii.md | 5 +- content/en/preface.md | 5 +- content/en/toc.md | 4 - 12 files changed, 144 insertions(+), 142 deletions(-) delete mode 100644 content/en/author.md diff --git a/content/en/author.md b/content/en/author.md deleted file mode 100644 index baa73b2..0000000 --- a/content/en/author.md +++ /dev/null @@ -1,16 +0,0 @@ ---- -title: "About the Authors" -linkTitle: "About the Authors" -weight: 10 -breadcrumbs: false ---- - -**Martin Kleppmann** is a researcher in distributed systems at the University of Cambridge, UK. -Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. -In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes. - -Martin is a regular conference speaker, blogger, and open source contributor. He believes that profound technical ideas should be accessible to everyone, and that deeper understanding will help us develop better software. - -**Chris Riccomini** is a software engineer, startup investor, and author with 15+ years of experience at PayPal, LinkedIn, and WePay. -He runs Materialized View Capital, where he invests in infrastructure startups. He is also the cocreator of Apache Samza and SlateDB, -and coauthor of The Missing README: A Guide for the New Software Engineer. \ No newline at end of file diff --git a/content/en/ch11.md b/content/en/ch11.md index 8cd6bd0..b7dc78f 100644 --- a/content/en/ch11.md +++ b/content/en/ch11.md @@ -4,8 +4,9 @@ weight: 311 breadcrumbs: false --- -> [!IMPORTANT] -> This chapter is from the 1st edition, the 2nd edition is not available yet +{{< callout type="warning" >}} +This page is from the 1st edition, 2nd edition is not available yet. +{{< /callout >}} ![](/map/ch10.png) diff --git a/content/en/ch12.md b/content/en/ch12.md index 98b62b4..81a9447 100644 --- a/content/en/ch12.md +++ b/content/en/ch12.md @@ -4,8 +4,9 @@ weight: 312 breadcrumbs: false --- -> [!IMPORTANT] -> This chapter is from the 1st edition, the 2nd edition is not available yet +{{< callout type="warning" >}} +This page is from the 1st edition, 2nd edition is not available yet. +{{< /callout >}} ![](/map/ch11.png) diff --git a/content/en/ch13.md b/content/en/ch13.md index cfa1005..05f4a4d 100644 --- a/content/en/ch13.md +++ b/content/en/ch13.md @@ -4,8 +4,9 @@ weight: 313 breadcrumbs: false --- -> [!IMPORTANT] -> This chapter is from the 1st edition, the 2nd edition is not available yet +{{< callout type="warning" >}} +This page is from the 1st edition, 2nd edition is not available yet. +{{< /callout >}} ![](/map/ch12.png) diff --git a/content/en/ch6.md b/content/en/ch6.md index 3c33c90..045e792 100644 --- a/content/en/ch6.md +++ b/content/en/ch6.md @@ -11,7 +11,7 @@ breadcrumbs: false > Douglas Adams, *Mostly Harmless* (1992) *Replication* means keeping a copy of the same data on multiple machines that are connected via a -network. As discussed in [“Distributed versus Single-Node Systems”](/ch01.html#sec_introduction_distributed), there are several reasons +network. As discussed in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), there are several reasons why you might want to replicate data: * To keep data geographically close to your users (and thus reduce access latency) @@ -19,7 +19,7 @@ why you might want to replicate data: * To scale out the number of machines that can serve read queries (and thus increase read throughput) In this chapter we will assume that your dataset is small enough that each machine can hold a copy of -the entire dataset. In [Chapter 7](/ch07.html#ch_sharding) we will relax that assumption and discuss *sharding* +the entire dataset. In [Chapter 7](/en/ch7#ch_sharding) we will relax that assumption and discuss *sharding* (*partitioning*) of datasets that are too big for a single machine. In later chapters we will discuss various kinds of faults that can occur in a replicated data system, and how to deal with them. @@ -37,7 +37,7 @@ many different implementations. We will discuss the consequences of such choices Replication of databases is an old topic—the principles haven’t changed much since they were studied in the 1970s [^1], because the fundamental constraints of networks have remained the same. Despite being so old, -concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](/ch06.html#sec_replication_lag) we will +concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](/en/ch6#sec_replication_lag) we will get more precise about eventual consistency and discuss things like the *read-your-writes* and *monotonic reads* guarantees. @@ -50,7 +50,7 @@ delete some data, replication doesn’t help since the deletion will have also b replicas, so you need a backup if you want to restore the deleted data. In fact, replication and backups are often complementary to each other. Backups are sometimes part -of the process of setting up replication, as we shall see in [“Setting Up New Followers”](/ch06.html#sec_replication_new_replica). +of the process of setting up replication, as we shall see in [“Setting Up New Followers”](/en/ch6#sec_replication_new_replica). Conversely, archiving replication logs can be part of a backup process. Some databases internally maintain immutable snapshots of past states, which serve as a kind of @@ -67,7 +67,7 @@ question inevitably arises: how do we ensure that all the data ends up on all th Every write to the database needs to be processed by every replica; otherwise, the replicas would no longer contain the same data. The most common solution is called *leader-based replication*, *primary-backup*, or *active/passive*. It works as follows (see -[Figure 6-1](/ch06.html#fig_replication_leader_follower)): +[Figure 6-1](/en/ch6#fig_replication_leader_follower)): 1. One of the replicas is designated the *leader* (also known as *primary* or *source* [^2]). @@ -86,9 +86,9 @@ longer contain the same data. The most common solution is called *leader-based r ###### Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas. -If the database is sharded (see [Chapter 7](/ch07.html#ch_sharding)), each shard has one leader. Different shards may +If the database is sharded (see [Chapter 7](/en/ch7#ch_sharding)), each shard has one leader. Different shards may have their leaders on different nodes, but each shard must nevertheless have one leader node. In -[“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader) we will discuss an alternative model in which a system may have +[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader) we will discuss an alternative model in which a system may have multiple leaders for the same shard at the same time. Single-leader replication is very widely used. It’s a built-in feature of many relational databases, @@ -104,7 +104,7 @@ Many consensus algorithms such as Raft, which is used for replication in Cockroa TiDB [^7], etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and automatically elect a new leader if the old one fails (we will discuss consensus in more detail in -[Chapter 10](/ch10.html#ch_consistency)). +[Chapter 10](/en/ch10#ch_consistency)). > [!NOTE] > In older documents you may see the term *master–slave replication*. It means the same as @@ -117,17 +117,17 @@ An important detail of a replicated system is whether the replication happens *s *asynchronously*. (In relational databases, this is often a configurable option; other systems are often hardcoded to be either one or the other.) -Think about what happens in [Figure 6-1](/ch06.html#fig_replication_leader_follower), where the user of a website updates +Think about what happens in [Figure 6-1](/en/ch6#fig_replication_leader_follower), where the user of a website updates their profile image. At some point in time, the client sends the update request to the leader; shortly afterward, it is received by the leader. At some point, the leader forwards the data change to the followers. Eventually, the leader notifies the client that the update was successful. -[Figure 6-2](/ch06.html#fig_replication_sync_replication) shows one possible way how the timings could work out. +[Figure 6-2](/en/ch6#fig_replication_sync_replication) shows one possible way how the timings could work out. ![ddia 0602](/fig/ddia_0602.png) ###### Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower. -In the example of [Figure 6-2](/ch06.html#fig_replication_sync_replication), the replication to follower 1 is +In the example of [Figure 6-2](/en/ch6#fig_replication_sync_replication), the replication to follower 1 is *synchronous*: the leader waits until follower 1 has confirmed that it received the write before reporting success to the user, and before making the write visible to other clients. The replication to follower 2 is *asynchronous*: the leader sends the message, but doesn’t wait for a response from @@ -157,9 +157,9 @@ called *semi-synchronous*. In some systems, a *majority* (e.g., 3 out of 5 replicas, including the leader) of replicas is updated synchronously, and the remaining minority is asynchronous. This is an example of a *quorum*, -which we will discuss further in [“Quorums for reading and writing”](/ch06.html#sec_replication_quorum_condition). Majority quorums are often +which we will discuss further in [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition). Majority quorums are often used in systems that use a consensus protocol for automatic leader election, which we will return to -in [Chapter 10](/ch10.html#ch_consistency). +in [Chapter 10](/en/ch10#ch_consistency). Sometimes, leader-based replication is configured to be completely asynchronous. In this case, if the leader fails and is not recoverable, any writes that have not yet been replicated to followers are @@ -170,7 +170,7 @@ processing writes, even if all of its followers have fallen behind. Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless widely used, especially if there are many followers or if they are geographically distributed [^9]. -We will return to this issue in [“Problems with Replication Lag”](/ch06.html#sec_replication_lag). +We will return to this issue in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag). ## Setting Up New Followers @@ -308,10 +308,10 @@ consists of the following steps: [^13]. The best candidate for leadership is usually the replica with the most up-to-date data changes from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader - is a consensus problem, discussed in detail in [Chapter 10](/ch10.html#ch_consistency). + is a consensus problem, discussed in detail in [Chapter 10](/en/ch10#ch_consistency). 3. *Reconfiguring the system to use the new leader.* Clients now need to send their write requests to the new leader (we discuss this - in [“Request Routing”](/ch07.html#sec_sharding_routing)). If the old leader comes back, it might still believe that it is + in [“Request Routing”](/en/ch7#sec_sharding_routing)). If the old leader comes back, it might still believe that it is the leader, not realizing that the other replicas have forced it to step down. The system needs to ensure that the old leader becomes a follower and recognizes the new leader. @@ -333,10 +333,10 @@ Failover is fraught with things that can go wrong: primary keys that were previously assigned by the old leader. These primary keys were also used in a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis, which caused some private data to be disclosed to the wrong users. -* In certain fault scenarios (see [Chapter 9](/ch09.html#ch_distributed)), it could happen that two nodes both believe +* In certain fault scenarios (see [Chapter 9](/en/ch9#ch_distributed)), it could happen that two nodes both believe that they are the leader. This situation is called *split brain*, and it is dangerous: if both leaders accept writes, and there is no process for resolving conflicts (see - [“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some + [“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some systems have a mechanism to shut down one node if two leaders are detected. However, if this mechanism is not carefully designed, you can end up with both nodes being shut down [^15]. @@ -352,7 +352,7 @@ Failover is fraught with things that can go wrong: > [!NOTE] > Guarding against split brain by limiting or shutting down old leaders is known as *fencing* or, more > emphatically, *Shoot The Other Node In The Head* (STONITH). We will discuss fencing in more detail -> in [“Distributed Locks and Leases”](/ch09.html#sec_distributed_lock_fencing). +> in [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing). There are no easy solutions to these problems. For this reason, some operations teams prefer to perform failovers manually, even if the software supports automatic failover. @@ -366,7 +366,7 @@ behind by several days could be catastrophic. These issues—node failures; unreliable networks; and trade-offs around replica consistency, durability, availability, and latency—are in fact fundamental problems in distributed systems. -In [Chapter 9](/ch09.html#ch_distributed) and [Chapter 10](/ch10.html#ch_consistency) we will discuss them in greater depth. +In [Chapter 9](/en/ch9#ch_distributed) and [Chapter 10](/en/ch10#ch_consistency) we will discuss them in greater depth. ## Implementation of Replication Logs @@ -397,9 +397,9 @@ break down: It is possible to work around those issues—for example, the leader can replace any nondeterministic function calls with a fixed return value when the statement is logged so that the followers all get the same value. The idea of executing deterministic statements in a fixed order is similar to the -event sourcing model that we previously discussed in [“Event Sourcing and CQRS”](/ch03.html#sec_datamodels_events). This approach is +event sourcing model that we previously discussed in [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events). This approach is also known as *state machine replication*, and we will discuss the theory behind it in -[“Using shared logs”](/ch10.html#sec_consistency_smr). +[“Using shared logs”](/en/ch10#sec_consistency_smr). Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today, as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if @@ -411,7 +411,7 @@ replication methods. ### Write-ahead log (WAL) shipping -In [Chapter 4](/ch04.html#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust: +In [Chapter 4](/en/ch4#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust: every modification is first written to the WAL so that the tree can be restored to a consistent state after a crash. Since the WAL contains all the information necessary to restore the indexes and heap into a consistent state, we can use the exact same log to build a replica on another node: @@ -470,7 +470,7 @@ This technique is called *change data capture*, and we will return to it in [Lin # Problems with Replication Lag Being able to tolerate node failures is just one reason for wanting replication. As mentioned -in [“Distributed versus Single-Node Systems”](/ch01.html#sec_introduction_distributed), other reasons are scalability (processing more +in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), other reasons are scalability (processing more requests than a single machine can handle) and latency (placing replicas geographically closer to users). @@ -522,7 +522,7 @@ be read from a follower. This is especially appropriate if data is frequently vi occasionally written. With asynchronous replication, there is a problem, illustrated in -[Figure 6-3](/ch06.html#fig_replication_read_your_writes): if the user views the data shortly after making a write, the +[Figure 6-3](/en/ch6#fig_replication_read_your_writes): if the user views the data shortly after making a write, the new data may not yet have reached the replica. To the user, it looks as though the data they submitted was lost, so they will be understandably unhappy. @@ -562,7 +562,7 @@ are various possible techniques. To mention a few: [^26]. The timestamp could be a *logical timestamp* (something that indicates ordering of writes, such as the log sequence number) or the actual system clock (in which case clock synchronization becomes - critical; see [“Unreliable Clocks”](/ch09.html#sec_distributed_clocks)). + critical; see [“Unreliable Clocks”](/en/ch9#sec_distributed_clocks)). * If your replicas are distributed across regions (for geographical proximity to users or for availability), there is additional complexity. Any request that needs to be served by the leader must be routed to the region that contains the leader. @@ -598,7 +598,7 @@ zonal outages where one zone goes offline, but they do not protect against regio all zones in a region are unavailable. To survive a regional outage, a distributed system must be deployed across multiple regions, which can result in higher latencies, lower throughput, and increased cloud networking bills. We will discuss these tradeoffs more in -[“Multi-leader replication topologies”](/ch06.html#sec_replication_topologies). For now, just know that when we say region, we mean a collection of +[“Multi-leader replication topologies”](/en/ch6#sec_replication_topologies). For now, just know that when we say region, we mean a collection of zones/datacenters in a single geographic location. ## Monotonic Reads @@ -607,7 +607,7 @@ Our second example of an anomaly that can occur when reading from asynchronous f possible for a user to see things *moving backward in time*. This can happen if a user makes several reads from different replicas. For example, -[Figure 6-4](/ch06.html#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower +[Figure 6-4](/en/ch6#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower with little lag, then to a follower with greater lag. (This scenario is quite likely if the user refreshes a web page, and each request is routed to a random server.) The first query returns a comment that was recently added by user 1234, but the second query doesn’t return anything because @@ -648,7 +648,7 @@ answered it. Now, imagine a third person is listening to this conversation through followers. The things said by Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer -replication lag (see [Figure 6-5](/ch06.html#fig_replication_consistent_prefix)). This observer would hear the following: +replication lag (see [Figure 6-5](/en/ch6#fig_replication_consistent_prefix)). This observer would hear the following: Mrs. Cake : About ten seconds usually, Mr. Poons. @@ -670,7 +670,7 @@ writes happens in a certain order, then anyone reading those writes will see the order. This is a particular problem in sharded (partitioned) databases, which we will discuss in -[Chapter 7](/ch07.html#ch_sharding). If the database always applies writes in the same order, reads always see a +[Chapter 7](/en/ch7#ch_sharding). If the database always applies writes in the same order, reads always see a consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different shards operate independently, so there is no global ordering of writes: when a user reads from the database, they may see some parts of the database in an older state and some in a newer state. @@ -678,7 +678,7 @@ database, they may see some parts of the database in an older state and some in One solution is to make sure that any writes that are causally related to each other are written to the same shard—but in some applications that cannot be done efficiently. There are also algorithms that explicitly keep track of causal dependencies, a topic that we will return to in -[“The “happens-before” relation and concurrency”](/ch06.html#sec_replication_happens_before). +[“The “happens-before” relation and concurrency”](/en/ch6#sec_replication_happens_before). ## Solutions for Replication Lag @@ -694,15 +694,15 @@ synchronously updated follower. However, dealing with these issues in applicatio and easy to get wrong. The simplest programming model for application developers is to choose a database that provides a -strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/ch10.html#ch_consistency)), and ACID -transactions (see [Chapter 8](/ch08.html#ch_transactions)). This allows you to mostly ignore the challenges that arise +strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/en/ch10#ch_consistency)), and ACID +transactions (see [Chapter 8](/en/ch8#ch_transactions)). This allows you to mostly ignore the challenges that arise from replication, and treat the database as if it had just a single node. In the early 2010s the *NoSQL* movement promoted the view that these features limited scalability, and that large-scale systems would have to embrace eventual consistency. However, since then, a number of databases started providing strong consistency and transactions while also offering the fault tolerance, high availability, and scalability advantages of a -distributed database. As mentioned in [“Relational Model versus Document Model”](/ch03.html#sec_datamodels_history), this trend is known as *NewSQL* to +distributed database. As mentioned in [“Relational Model versus Document Model”](/en/ch3#sec_datamodels_history), this trend is known as *NewSQL* to contrast with NoSQL (although it’s less about SQL specifically, and more about new approaches to scalable transaction management). @@ -752,7 +752,7 @@ single-leader replication, the leader has to be in *one* of the regions, and all through that region. In a multi-leader configuration, you can have a leader in *each* region. -[Figure 6-6](/ch06.html#fig_replication_multi_dc) shows what this architecture might look like. Within each region, +[Figure 6-6](/en/ch6#fig_replication_multi_dc) shows what this architecture might look like. Within each region, regular leader–follower replication is used (with followers maybe in a different availability zone from the leader); between regions, each region’s leader replicates its changes to the leaders in other regions. @@ -792,7 +792,7 @@ Tolerance of network problems Consistency : A single-leader system can provide strong consistency guarantees, such as serializable - transactions, which we will discuss in [Chapter 8](/ch08.html#ch_transactions). The biggest downside of multi-leader + transactions, which we will discuss in [Chapter 8](/en/ch8#ch_transactions). The biggest downside of multi-leader systems is that the consistency they can achieve is much weaker. For example, you can’t guarantee that a bank account won’t go negative or that a username is unique: it’s always possible for different leaders to process writes that are individually fine (paying out some of the money in an @@ -802,7 +802,7 @@ Consistency This is simply a fundamental limitation of distributed systems [^28]. If you need to enforce such constraints, you’re therefore better off with a single-leader system. - However, as we will see in [“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts), multi-leader systems can still + However, as we will see in [“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts), multi-leader systems can still achieve consistency properties that are useful in a wide range of apps that don’t need such constraints. @@ -820,17 +820,17 @@ multi-leader replication is often considered dangerous territory that should be ### Multi-leader replication topologies A *replication topology* describes the communication paths along which writes are propagated from -one node to another. If you have two leaders, like in [Figure 6-9](/ch06.html#fig_replication_write_conflict), there is +one node to another. If you have two leaders, like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), there is only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With more than two leaders, various different topologies are possible. Some examples are illustrated in -[Figure 6-7](/ch06.html#fig_replication_topologies). +[Figure 6-7](/en/ch6#fig_replication_topologies). ![ddia 0607](/fig/ddia_0607.png) ###### Figure 6-7. Three example topologies in which multi-leader replication can be set up. The most general topology is *all-to-all*, shown in -[Figure 6-7](/ch06.html#fig_replication_topologies)(c), +[Figure 6-7](/en/ch6#fig_replication_topologies)(c), in which every leader sends its writes to every other leader. However, more restricted topologies are also used: for example a *circular topology* in which each node receives writes from one node and forwards those writes (plus any writes of its own) to one other node. Another popular topology @@ -839,7 +839,7 @@ star topology can be generalized to a tree. > [!NOTE] > Don’t confuse a star-shaped network topology with a *star schema* (see -> [“Stars and Snowflakes: Schemas for Analytics”](/ch03.html#sec_datamodels_analytics)), which describes the structure of a data model. +> [“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics)), which describes the structure of a data model. In circular and star topologies, a write may need to pass through several nodes before it reaches all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To @@ -860,28 +860,28 @@ along different paths, avoiding a single point of failure. On the other hand, all-to-all topologies can have issues too. In particular, some network links may be faster than others (e.g., due to network congestion), with the result that some replication -messages may “overtake” others, as illustrated in [Figure 6-8](/ch06.html#fig_replication_causality). +messages may “overtake” others, as illustrated in [Figure 6-8](/en/ch6#fig_replication_causality). ![ddia 0608](/fig/ddia_0608.png) ###### Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas. -In [Figure 6-8](/ch06.html#fig_replication_causality), client A inserts a row into a table on leader 1, and client B +In [Figure 6-8](/en/ch6#fig_replication_causality), client A inserts a row into a table on leader 1, and client B updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may first receive the update (which, from its point of view, is an update to a row that does not exist in the database) and only later receive the corresponding insert (which should have preceded the update). -This is a problem of causality, similar to the one we saw in [“Consistent Prefix Reads”](/ch06.html#sec_replication_consistent_prefix): +This is a problem of causality, similar to the one we saw in [“Consistent Prefix Reads”](/en/ch6#sec_replication_consistent_prefix): the update depends on the prior insert, so we need to make sure that all nodes process the insert first, and then the update. Simply attaching a timestamp to every write is not sufficient, because clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see -[Chapter 9](/ch09.html#ch_distributed)). +[Chapter 9](/en/ch9#ch_distributed)). To order these events correctly, a technique called *version vectors* can be used, which we will -discuss later in this chapter (see [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent)). However, many multi-leader +discuss later in this chapter (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). However, many multi-leader replication systems don’t use good techniques for ordering updates, leaving them vulnerable to -issues like the one in [Figure 6-8](/ch06.html#fig_replication_causality). If you are using multi-leader replication, it +issues like the one in [Figure 6-8](/en/ch6#fig_replication_causality). If you are using multi-leader replication, it is worth being aware of these issues, carefully reading the documentation, and thoroughly testing your database to ensure that it really does provide the guarantees you believe it to have. @@ -956,7 +956,7 @@ approach has a number of advantages: offline is the same as having very large network delay. * A sync engine simplifies the programming model for frontend apps, compared to performing explicit service calls in application code. Every service call requires error handling, as discussed in - [“The problems with remote procedure calls (RPCs)”](/ch05.html#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user + [“The problems with remote procedure calls (RPCs)”](/en/ch5#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user interface needs to somehow reflect that error. A sync engine allows the app to perform reads and writes on local data, which almost never fails, leading to a more declarative programming style [^41]. @@ -993,7 +993,7 @@ a local-first sync engine on end user devices—is that concurrent writes on dif lead to conflicts that need to be resolved. For example, consider a wiki page that is simultaneously being edited by two users, as shown in -[Figure 6-9](/ch06.html#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2 +[Figure 6-9](/en/ch6#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2 independently changes the title from A to C. Each user’s change is successfully applied to their local leader. However, when the changes are asynchronously replicated, a conflict is detected. This problem does not occur in a single-leader database. @@ -1003,13 +1003,13 @@ This problem does not occur in a single-leader database. ###### Figure 6-9. A write conflict caused by two leaders concurrently updating the same record. > [!NOTE] -> We say that the two writes in [Figure 6-9](/ch06.html#fig_replication_write_conflict) are *concurrent* because neither +> We say that the two writes in [Figure 6-9](/en/ch6#fig_replication_write_conflict) are *concurrent* because neither > was “aware” of the other at the time the write was originally made. It doesn’t matter whether the > writes literally happened at the same time; indeed, if the writes were made while offline, they > might have actually happened some time apart. What matters is whether one write occurred in a state > where the other write has already taken effect. -In [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent) we will tackle the question of how a database can determine +In [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent) we will tackle the question of how a database can determine whether two writes are concurrent. For now we will assume that we can detect conflicts, and we want to figure out the best way of resolving them. @@ -1038,13 +1038,13 @@ Another example of conflict avoidance: imagine you want to insert new records an IDs for them based on an auto-incrementing counter. If you have two leaders, you could set them up so that one leader only generates odd numbers and the other only generates even numbers. That way you can be sure that the two leaders won’t concurrently assign the same ID to different records. -We will discuss other ID assignment schemes in [“ID Generators and Logical Clocks”](/ch10.html#sec_consistency_logical). +We will discuss other ID assignment schemes in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical). ### Last write wins (discarding concurrent writes) If conflicts can’t be avoided, the simplest way of resolving them is to attach a timestamp to each write, and to always use the value with the greatest timestamp. For example, in -[Figure 6-9](/ch06.html#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than +[Figure 6-9](/en/ch6#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than the timestamp of user 2’s write. In that case, both leaders will determine that the new title of the page should be B, and they discard the write that sets it to C. If the writes coincidentally have the same timestamp, the winner can be chosen by comparing the values (e.g., in the case of strings, @@ -1052,7 +1052,7 @@ taking the one that’s earlier in the alphabet). This approach is called *last write wins* (LWW) because the write with the greatest timestamp can be considered the “last” one. The term is misleading though, because when two writes are concurrent -like in [Figure 6-9](/ch06.html#fig_replication_write_conflict), which one is older and which is later is undefined, and +like in [Figure 6-9](/en/ch6#fig_replication_write_conflict), which one is older and which is later is undefined, and so the timestamp order of concurrent writes is essentially random. Therefore the real meaning of LWW is: when the same record is concurrently written on different @@ -1070,7 +1070,7 @@ Another problem with LWW is that if a real-time clock (e.g. a Unix timestamp) is for the writes, the system becomes very sensitive to clock synchronization. If one node has a clock that is ahead of the others, and you try to overwrite a value written by that node, your write may be ignored as it may have a lower timestamp, even though it clearly occurred later. This problem can -be solved by using a *logical clock*, which we will discuss in [“ID Generators and Logical Clocks”](/ch10.html#sec_consistency_logical). +be solved by using a *logical clock*, which we will discuss in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical). ### Manual conflict resolution @@ -1082,7 +1082,7 @@ merge is complete. In a database, it would be impractical for a conflict to stop the entire replication process until a human has resolved it. Instead, databases typically store all the concurrently written values for a -given record—for example, both B and C in [Figure 6-9](/ch06.html#fig_replication_write_conflict). These values are +given record—for example, both B and C in [Figure 6-9](/en/ch6#fig_replication_write_conflict). These values are sometimes called *siblings*. The next time you query that record, the database returns *all* those values, rather than just the latest one. You can then resolve those values in whatever way you want, either automatically in application code (for example, you could concatenate B and C into “B/C”), or @@ -1106,7 +1106,7 @@ suffers from a number of problems: sibling, but another sibling still contained that old item, the removed item would unexpectedly reappear in the customer’s cart [^45]. - [Figure 6-10](/ch06.html#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping + [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear. * If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution process can itself introduce a new conflict. Those resolutions could even be inconsistent: for @@ -1135,7 +1135,7 @@ updates as much as possible, and hence avoiding data loss: same position, it can be ordered deterministically so that all nodes get the same merged outcome. * If the data is a collection of items (ordered like a to-do list, or unordered like a shopping cart), we can merge it similarly to text by tracking insertions and deletions. To avoid the - shopping cart issue in [Figure 6-10](/ch06.html#fig_replication_amazon_anomaly), the algorithms track the fact that Book + shopping cart issue in [Figure 6-10](/en/ch6#fig_replication_amazon_anomaly), the algorithms track the fact that Book and DVD were deleted, so the merged result is Cart = {Soap}. * If the data is an integer representing a counter that can be incremented or decremented (e.g., the number of likes on a social media post), the merge algorithm can tell how many increments and @@ -1161,7 +1161,7 @@ Two families of algorithms are commonly used to implement automatic conflict res They have different design philosophies and performance characteristics, but both are able to perform automatic merges for all the aforementioned types of data. -[Figure 6-11](/ch06.html#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a +[Figure 6-11](/en/ch6#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make “ice!”. @@ -1182,7 +1182,7 @@ OT CRDT : Most CRDTs give each character a unique, immutable ID and use those to determine the positions of - insertions/deletions, instead of indexes. For example, in [Figure 6-11](/ch06.html#fig_replication_ot_crdt) we assign + insertions/deletions, instead of indexes. For example, in [Figure 6-11](/en/ch6#fig_replication_ot_crdt) we assign the ID 1A to “i”, the ID 2A to “c”, etc. When inserting the exclamation mark, we generate an operation containing the ID of the new character (4B) and the ID of the existing character after which we want to insert (3A). To insert at the beginning of the string we give “nil” as the @@ -1204,7 +1204,7 @@ Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge o ### What is a conflict? -Some kinds of conflict are obvious. In the example in [Figure 6-9](/ch06.html#fig_replication_write_conflict), two writes +Some kinds of conflict are obvious. In the example in [Figure 6-9](/en/ch6#fig_replication_write_conflict), two writes concurrently modified the same field in the same record, setting it to two different values. There is little doubt that this is a conflict. @@ -1218,7 +1218,7 @@ are made on two different leaders. There isn’t a quick ready-made answer, but in the following chapters we will trace a path toward a good understanding of this problem. We will see some more examples of conflicts in -[Chapter 8](/ch08.html#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and +[Chapter 8](/en/ch8#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and resolving conflicts in a replicated system. # Leaderless Replication @@ -1252,10 +1252,10 @@ profound consequences for the way the database is used. Imagine you have a database with three replicas, and one of the replicas is currently unavailable—​perhaps it is being rebooted to install a system update. In a single-leader configuration, if you want to continue processing writes, you may need to perform a failover (see -[“Handling Node Outages”](/ch06.html#sec_replication_failover)). +[“Handling Node Outages”](/en/ch6#sec_replication_failover)). On the other hand, in a leaderless configuration, failover does not exist. -[Figure 6-12](/ch06.html#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to +[Figure 6-12](/en/ch6#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to all three replicas in parallel, and the two available replicas accept the write but the unavailable replica misses it. Let’s say that it’s sufficient for two out of three replicas to acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be @@ -1276,9 +1276,9 @@ stale value from another. In order to tell which responses are up-to-date and which are outdated, every value that is written needs to be tagged with a version number or timestamp, similarly to what we saw in -[“Last write wins (discarding concurrent writes)”](/ch06.html#sec_replication_lww). When a client receives multiple values in response to a read, it uses the +[“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww). When a client receives multiple values in response to a read, it uses the one with the greatest timestamp (even if that value was only returned by one replica, and several -other replicas returned older values). See [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent) for more details. +other replicas returned older values). See [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent) for more details. ### Catching up on missed writes @@ -1288,7 +1288,7 @@ mechanisms are used in Dynamo-style datastores: Read repair : When a client makes a read from several nodes in parallel, it can detect any stale responses. - For example, in [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from + For example, in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale value and writes the newer value back to that replica. This approach works well for values that are frequently read. @@ -1308,7 +1308,7 @@ Anti-entropy ### Quorums for reading and writing -In the example of [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage), we considered the write to be successful +In the example of [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage), we considered the write to be successful even though it was only processed on two out of three replicas. What if only one out of three replicas accepted the write? How far can we push this? @@ -1336,7 +1336,7 @@ database writes to fail. > [!NOTE] > There may be more than *n* nodes in the cluster, but any given value is stored only on *n* > nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit -> on one node. We will return to sharding in [Chapter 7](/ch07.html#ch_sharding). +> on one node. We will return to sharding in [Chapter 7](/en/ch7#ch_sharding). The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes as follows: @@ -1344,9 +1344,9 @@ as follows: * If *w* < *n*, we can still process writes if a node is unavailable. * If *r* < *n*, we can still process reads if a node is unavailable. * With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable - node, like in [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage). + node, like in [Figure 6-12](/en/ch6#fig_replication_quorum_node_outage). * With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes. - This case is illustrated in [Figure 6-13](/ch06.html#fig_replication_quorum_overlap). + This case is illustrated in [Figure 6-13](/en/ch6#fig_replication_quorum_overlap). Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and *r* determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success @@ -1368,7 +1368,7 @@ If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n* generally expect every read to return the most recent value written for a key. This is the case because the set of nodes to which you’ve written and the set of nodes from which you’ve read must overlap. That is, among the nodes you read there must be at least one node with the latest value (illustrated in -[Figure 6-13](/ch06.html#fig_replication_quorum_overlap)). +[Figure 6-13](/en/ch6#fig_replication_quorum_overlap)). Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, because that ensures *w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are @@ -1395,12 +1395,12 @@ properties can be confusing. Some scenarios include: value, the number of replicas storing the new value may fall below *w*, breaking the quorum condition. * While a rebalancing is in progress, where some data is moved from one node to another (see - [Chapter 7](/ch07.html#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n* + [Chapter 7](/en/ch7#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n* replicas for a particular value. This can result in the read and write quorums no longer overlapping. * If a read is concurrent with a write operation, the read may or may not see the concurrently written value. In particular, it’s possible for one read to see the new value, and a subsequent - read to see the old value, as we shall see in [“Linearizability and quorums”](/ch10.html#sec_consistency_quorum_linearizable). + read to see the old value, as we shall see in [“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable). * If a write succeeded on some replicas but failed on others (for example because the disks on some nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the replicas where it succeeded. This means that if a write was reported as failed, subsequent reads @@ -1408,12 +1408,12 @@ properties can be confusing. Some scenarios include: [^52]. * If the database uses timestamps from a real-time clock to determine which write is newer (as Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a - faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](/ch06.html#sec_replication_lww). - We will discuss this in more detail in [“Relying on Synchronized Clocks”](/ch09.html#sec_distributed_clocks_relying). + faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww). + We will discuss this in more detail in [“Relying on Synchronized Clocks”](/en/ch9#sec_distributed_clocks_relying). * If two writes occur concurrently, one of them might be processed first on one replica, and the other might be processed first on another replica. This leads to a conflict, similarly to what we - saw for multi-leader replication (see [“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts)). We will return to this - topic in [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent). + saw for multi-leader replication (see [“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts)). We will return to this + topic in [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent). Thus, although quorums appear to guarantee that a read returns the latest written value, in practice it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate @@ -1445,7 +1445,7 @@ able to quantify “eventual.” A replication system based on a single leader can provide strong consistency guarantees that are difficult or impossible to achieve in a leaderless system. However, as we have seen in -[“Problems with Replication Lag”](/ch06.html#sec_replication_lag), reads in a leader-based replicated system can also return stale values if +[“Problems with Replication Lag”](/en/ch6#sec_replication_lag), reads in a leader-based replicated system can also return stale values if you make them on an asynchronously updated follower. Reading from the leader ensures up-to-date responses, but it suffers from performance problems: @@ -1489,7 +1489,7 @@ That said, leaderless systems can have performance problems as well: to wait for before a request can complete. Even if you wait only for the fastest *r* or *w* replicas to respond, and even if you make the requests in parallel, a bigger *r* or *w* increases the chance that you hit a slow replica, increasing the overall response time (see - [“Use of Response Time Metrics”](/ch02.html#sec_introduction_slo_sla)). + [“Use of Response Time Metrics”](/en/ch2#sec_introduction_slo_sla)). * A large-scale network interruption that disconnects a client from a large number of replicas can make it impossible to form a quorum. Some leaderless databases offer a configuration option that allows any reachable replica to accept writes, even if it’s not one of the usual replicas for that @@ -1508,7 +1508,7 @@ fault tolerance while also having a high likelihood of reading up-to-date data. ### Multi-region operation We previously discussed cross-region replication as a use case for multi-leader replication (see -[“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader)). Leaderless replication is also suitable for +[“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)). Leaderless replication is also suitable for multi-region operation, since it is designed to tolerate conflicting concurrent writes, network interruptions, and latency spikes. @@ -1531,7 +1531,7 @@ resulting in conflicts that need to be resolved. Such conflicts may occur as the not always: they could also be detected later during read repair, hinted handoff, or anti-entropy. The problem is that events may arrive in a different order at different nodes, due to variable -network delays and partial failures. For example, [Figure 6-14](/ch06.html#fig_replication_concurrency) shows two clients, +network delays and partial failures. For example, [Figure 6-14](/en/ch6#fig_replication_concurrency) shows two clients, A and B, simultaneously writing to a key *X* in a three-node datastore: * Node 1 receives the write from A, but never receives the write from B due to a transient @@ -1545,13 +1545,13 @@ A and B, simultaneously writing to a key *X* in a three-node datastore: If each node simply overwrote the value for a key whenever it received a write request from a client, the nodes would become permanently inconsistent, as shown by the final *get* request in -[Figure 6-14](/ch06.html#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other +[Figure 6-14](/en/ch6#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other nodes think that the value is A. In order to become eventually consistent, the replicas should converge toward the same value. For this, we can use any of the conflict resolution mechanisms we previously discussed in -[“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts), such as last-write-wins (used by Cassandra and ScyllaDB), -manual resolution, or CRDTs (described in [“CRDTs and Operational Transformation”](/ch06.html#sec_replication_crdts), and used by Riak). +[“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts), such as last-write-wins (used by Cassandra and ScyllaDB), +manual resolution, or CRDTs (described in [“CRDTs and Operational Transformation”](/en/ch6#sec_replication_crdts), and used by Riak). Last-write-wins is easy to implement: each write is tagged with a timestamp, and a value with a higher timestamp always overwrites a value with a lower timestamp. However, a timestamp doesn’t tell @@ -1564,11 +1564,11 @@ take more care to detect concurrent writes. How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look at some examples: -* In [Figure 6-8](/ch06.html#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before* +* In [Figure 6-8](/en/ch6#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before* B’s increment, because the value incremented by B is the value inserted by A. In other words, B’s operation builds upon A’s operation, so B’s operation must have happened later. We also say that B is *causally dependent* on A. -* On the other hand, the two writes in [Figure 6-14](/ch06.html#fig_replication_concurrency) are concurrent: when each +* On the other hand, the two writes in [Figure 6-14](/en/ch6#fig_replication_concurrency) are concurrent: when each client starts the operation, it does not know that another client is also performing an operation on the same key. Thus, there is no causal dependency between the operations. @@ -1589,7 +1589,7 @@ conflict that needs to be resolved. It may seem that two operations should be called concurrent if they occur “at the same time”—but in fact, it is not important whether they literally overlap in time. Because of problems with clocks in distributed systems, it is actually quite difficult to tell whether two things happened -at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/ch09.html#ch_distributed). +at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/en/ch9#ch_distributed). For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred. People @@ -1611,7 +1611,7 @@ happened before another. To keep things simple, let’s start with a database th replica. Once we have worked out how to do this on a single replica, we can generalize the approach to a leaderless database with multiple replicas. -[Figure 6-15](/ch06.html#fig_replication_causality_single) shows two clients concurrently adding items to the same +[Figure 6-15](/en/ch6#fig_replication_causality_single) shows two clients concurrently adding items to the same shopping cart. (If that example strikes you as too inane, imagine instead two air traffic controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is empty. Between them, the clients make five writes to the database: @@ -1646,8 +1646,8 @@ empty. Between them, the clients make five writes to the database: ###### Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart. -The dataflow between the operations in [Figure 6-15](/ch06.html#fig_replication_causality_single) is illustrated -graphically in [Figure 6-16](/ch06.html#fig_replication_causal_dependencies). The arrows indicate which operation +The dataflow between the operations in [Figure 6-15](/en/ch6#fig_replication_causality_single) is illustrated +graphically in [Figure 6-16](/en/ch6#fig_replication_causal_dependencies). The arrows indicate which operation *happened before* which other operation, in the sense that the later operation *knew about* or *depended on* the earlier one. In this example, the clients are never fully up to date with the data on the server, since there is always another operation going on concurrently. But old versions of @@ -1655,7 +1655,7 @@ the value do get overwritten eventually, and no writes are lost. ![ddia 0616](/fig/ddia_0616.png) -###### Figure 6-16. Graph of causal dependencies in [Figure 6-15](/ch06.html#fig_replication_causality_single). +###### Figure 6-16. Graph of causal dependencies in [Figure 6-15](/en/ch6#fig_replication_causality_single). Note that the server can determine whether two operations are concurrent by looking at the version numbers—it does not need to interpret the value itself (so the value could be any data @@ -1681,10 +1681,10 @@ on subsequent reads. ### Version vectors -The example in [Figure 6-15](/ch06.html#fig_replication_causality_single) used only a single replica. How does the +The example in [Figure 6-15](/en/ch6#fig_replication_causality_single) used only a single replica. How does the algorithm change when there are multiple replicas, but no leader? -[Figure 6-15](/ch06.html#fig_replication_causality_single) uses a single version number to capture dependencies between +[Figure 6-15](/en/ch6#fig_replication_causality_single) uses a single version number to capture dependencies between operations, but that is not sufficient when there are multiple replicas accepting writes concurrently. Instead, we need to use a version number *per replica* as well as per key. Each replica increments its own version number when processing a write, and also keeps track of the @@ -1696,7 +1696,7 @@ A few variants of this idea are in use, but the most interesting is probably the which is used in Riak 2.0 [^61] [^62]. We won’t go into the details, but the way it works is quite similar to what we saw in our cart example. -Like the version numbers in [Figure 6-15](/ch06.html#fig_replication_causality_single), version vectors are sent from the +Like the version numbers in [Figure 6-15](/en/ch6#fig_replication_causality_single), version vectors are sent from the database replicas to clients when values are read, and need to be sent back to the database when a value is subsequently written. (Riak encodes the version vector as a string that it calls *causal context*.) The version vector allows the database to distinguish between overwrites and concurrent diff --git a/content/en/colophon.md b/content/en/colophon.md index 5ca9612..cac70d9 100644 --- a/content/en/colophon.md +++ b/content/en/colophon.md @@ -4,14 +4,23 @@ weight: 600 breadcrumbs: false --- +{{< callout type="warning" >}} +This page is from the 1st edition, 2nd edition is not available yet. +{{< /callout >}} + ## About the Author -**Martin Kleppmann** is a researcher in distributed systems at the University of Cam‐ bridge, UK. Previously he was a software engineer and entrepreneur at internet com‐ panies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes. +**Martin Kleppmann** is a researcher in distributed systems at the University of Cambridge, UK. +Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. +In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes. Martin is a regular conference speaker, blogger, and open source contributor. He believes that profound technical ideas should be accessible to everyone, and that deeper understanding will help us develop better software. ![](http://martin.kleppmann.com/2017/03/ddia-poster.jpg) +**Chris Riccomini** is a software engineer, startup investor, and author with 15+ years of experience at PayPal, LinkedIn, and WePay. +He runs Materialized View Capital, where he invests in infrastructure startups. He is also the cocreator of Apache Samza and SlateDB, +and coauthor of The Missing README: A Guide for the New Software Engineer. ## Colophon diff --git a/content/en/glossary.md b/content/en/glossary.md index b023327..33476b1 100644 --- a/content/en/glossary.md +++ b/content/en/glossary.md @@ -4,6 +4,10 @@ weight: 500 breadcrumbs: false --- +{{< callout type="warning" >}} +This page is from the 1st edition, 2nd edition is not available yet. +{{< /callout >}} + > Please note that the definitions in this glossary are short and simple, intended to convey the core idea but not the full subtleties of a term. For more detail, please follow the references into the main text. ### asynchronous diff --git a/content/en/part-i.md b/content/en/part-i.md index 2a50b5a..bd17fb6 100644 --- a/content/en/part-i.md +++ b/content/en/part-i.md @@ -4,8 +4,9 @@ weight: 100 breadcrumbs: false --- -> [!IMPORTANT] -> This page is from the 1st edition +{{< callout type="warning" >}} +This page is from the 1st edition, 2nd edition is not available yet. +{{< /callout >}} The first four chapters go through the fundamental ideas that apply to all data sys‐ tems, whether running on a single machine or distributed across a cluster of machines: diff --git a/content/en/part-ii.md b/content/en/part-ii.md index 7e8fe2a..ec0f160 100644 --- a/content/en/part-ii.md +++ b/content/en/part-ii.md @@ -4,8 +4,9 @@ weight: 200 breadcrumbs: false --- -> [!IMPORTANT] -> This page is from the 1st edition +{{< callout type="warning" >}} +This page is from the 1st edition, 2nd edition is not available yet. +{{< /callout >}} > *For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.* > @@ -37,26 +38,28 @@ That avoids the users having to wait for network packets to travel halfway aroun ## Scaling to Higher Load If all you need is to scale to higher load, the simplest approach is to buy a more powerful machine (sometimes called *vertical scaling* or *scaling up*). Many CPUs, many RAM chips, and many disks can be joined together under one operating system, -and a fast interconnect allows any CPU to access any part of the memory or disk. In this kind of *shared-memory architecture*, all the components can be treated as a single machine [1].[^ii] +and a fast interconnect allows any CPU to access any part of the memory or disk. In this kind of *shared-memory architecture*, all the components can be treated as a single machine [^1]. -[^i]: In a large machine, although any CPU can access any part of memory, some banks of memory are closer to one CPU than to others (this is called nonuniform memory access, or NUMA [1]). -To make efficient use of this architecture, the processing needs to be broken down so that each CPU mostly accesses memory that is nearby—which means that partitioning is still required, even when ostensibly running on one machine. +> [!NOTE] +> In a large machine, although any CPU can access any part of memory, some banks of memory are closer to one CPU than to others (this is called nonuniform memory access, or NUMA [^1]). +> To make efficient use of this architecture, the processing needs to be broken down so that each CPU mostly accesses memory that is nearby—which means that partitioning is still required, even when ostensibly running on one machine. The problem with a shared-memory approach is that the cost grows faster than linearly: a machine with twice as many CPUs, twice as much RAM, and twice as much disk capacity as another typically costs significantly more than twice as much. And due to bottlenecks, a machine twice the size cannot necessarily handle twice the load. A shared-memory architecture may offer limited fault tolerance—high-end machines have hot-swappable components (you can replace disks, memory modules, and even CPUs without shutting down the machines) — but it is definitely limited to a single geographic location. -Another approach is the *shared-disk architecture*, which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network.[^ii] -This architecture is used for some data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach [2]. +Another approach is the *shared-disk architecture*, which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network. +This architecture is used for some data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach [^2]. -[^ii]: Network Attached Storage (NAS) or Storage Area Network (SAN). +> [!NOTE] +> Network Attached Storage (NAS) or Storage Area Network (SAN). ### Shared-Nothing Architectures -By contrast, *shared-nothing architectures* [3] (sometimes called *horizontal scaling* or *scaling out*) have gained a lot of popularity. +By contrast, *shared-nothing architectures* [^3] (sometimes called *horizontal scaling* or *scaling out*) have gained a lot of popularity. In this approach, each machine or virtual machine running the database software is called a *node*. Each node uses its CPUs, RAM, and disks independently. Any coordination between nodes is done at the software level, using a conventional network. @@ -68,7 +71,7 @@ In this part of the book, we focus on shared-nothing architectures—not because If your data is distributed across multiple nodes, you need to be aware of the constraints and trade-offs that occur in such a distributed system—the database cannot magically hide these from you. While a distributed shared-nothing architecture has many advantages, it usually also incurs additional complexity for applications and sometimes limits the expressiveness of the data models you can use. -In some cases, a simple single-threaded program can perform significantly better than a cluster with over 100 CPU cores [4]. On the other hand, shared-nothing systems can be very powerful. +In some cases, a simple single-threaded program can perform significantly better than a cluster with over 100 CPU cores [^4]. On the other hand, shared-nothing systems can be very powerful. The next few chapters go into details on the issues that arise when data is distributed. ### Replication Versus Partitioning @@ -107,7 +110,7 @@ Later, in Part III of this book, we will discuss how you can take several (poten ### References -1. Ulrich Drepper: “[What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf),” akka‐dia.org, November 21, 2007. -1. Ben Stopford: “[Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/),” benstopford.com, November 24, 2009. -1. Michael Stonebraker: “[The Case for Shared Nothing](http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf),” IEEE Database EngineeringBulletin, volume 9, number 1, pages 4–9, March 1986. -1. Frank McSherry, Michael Isard, and Derek G. Murray: “[Scalability! But at What COST?](http://www.frankmcsherry.org/assets/COST.pdf),” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS),May 2015. +[^1]: Ulrich Drepper: “[What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf),” akka‐dia.org, November 21, 2007. +[^2]: Ben Stopford: “[Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/),” benstopford.com, November 24, 2009. +[^3]: Michael Stonebraker: “[The Case for Shared Nothing](http://db.cs.berkeley.edu/papers/hpts85-nothing.pdf),” IEEE Database EngineeringBulletin, volume 9, number 1, pages 4–9, March 1986. +[^4]: Frank McSherry, Michael Isard, and Derek G. Murray: “[Scalability! But at What COST?](http://www.frankmcsherry.org/assets/COST.pdf),” at 15th USENIX Workshop on Hot Topics in Operating Systems (HotOS),May 2015. diff --git a/content/en/part-iii.md b/content/en/part-iii.md index 79220d4..73b0034 100644 --- a/content/en/part-iii.md +++ b/content/en/part-iii.md @@ -4,8 +4,9 @@ weight: 300 breadcrumbs: false --- -> [!IMPORTANT] -> This page is from the 1st edition +{{< callout type="warning" >}} +This page is from the 1st edition, 2nd edition is not available yet. +{{< /callout >}} In Parts [I](/en/part-i) and [II](/en/part-ii) of this book, we assembled from the ground up all the major considerations that go into a distributed database, from the layout of data on disk all the way to the limits of distributed consistency in the presence of faults. However, this discussion assumed that there was only one database in the application. diff --git a/content/en/preface.md b/content/en/preface.md index 8b9d90e..38c6517 100644 --- a/content/en/preface.md +++ b/content/en/preface.md @@ -4,8 +4,9 @@ weight: 50 breadcrumbs: false --- -> [!IMPORTANT] -> This page is from the 1st edition +{{< callout type="warning" >}} +This page is from the 1st edition, 2nd edition is not available yet. +{{< /callout >}} If you have worked in software engineering in recent years, especially in server-side and backend systems, you have probably been bombarded with a plethora of buzz‐ words relating to storage and processing of data. NoSQL! Big Data! Web-scale! Sharding! Eventual consistency! ACID! CAP theorem! Cloud services! MapReduce! Real-time! diff --git a/content/en/toc.md b/content/en/toc.md index c81f983..28f9906 100644 --- a/content/en/toc.md +++ b/content/en/toc.md @@ -6,10 +6,6 @@ breadcrumbs: false --- - - -## Table of Contents - ### [Preface](/en/preface) ### [Part I: Foundations of Data Systems](/en/part-i)