ddia/content/en/ch9.md

---
title: "9. The Trouble with Distributed Systems"
weight: 209
breadcrumbs: false
---

> *They’re funny things, Accidents. You never have them till you’re having them.*
>
> A.A. Milne, *The House at Pooh Corner* (1928)

As discussed in [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability), making a system reliable means ensuring that the
system as a whole continues working, even when things go wrong (i.e., when there is a fault).
However, anticipating all the possible faults and handling them is not that easy. As a developer, it
is very tempting to focus mostly on the happy path (after all, most of the time things work fine!)
and to neglect faults, since they introduce a lot of edge cases.

If you want your system to be reliable in the presence of faults you have to radically change your
mindset, and focus on the things that could go wrong, even though they may be unlikely. It doesn’t
matter whether there is only a one-in-a-million chance of a thing going wrong: in a large enough
system, one-in-a-million events happen every day. Experienced systems operators will tell you that
anything that *can* go wrong *will* go wrong.

Moreover, working with distributed systems is fundamentally different from writing software on a
single computer—and the main difference is that there are lots of new and exciting ways for things
to go wrong [[1](/en/ch9#Cavage2013),
[2](/en/ch9#Kreps2012_ch9)].
In this chapter, you will get a taste of the problems that arise in practice, and an understanding
of the things you can and cannot rely on.

To understand what challenges we are up against, we will now turn our pessimism to the maximum and
explore the things that may go wrong in a distributed system. We will look into problems with
networks ([“Unreliable Networks”](/en/ch9#sec_distributed_networks)) as well as clocks and timing issues
([“Unreliable Clocks”](/en/ch9#sec_distributed_clocks)). The consequences of all these issues are disorienting, so we’ll
explore how to think about the state of a distributed system and how to reason about things that
have happened ([“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth)). Later, in [Chapter 10](/en/ch10#ch_consistency), we will look at some
examples of how we can achieve fault tolerance in the face of those faults.

# Faults and Partial Failures

When you are writing a program on a single computer, it normally behaves in a fairly predictable
way: either it works or it doesn’t. Buggy software may give the appearance that the computer is
sometimes “having a bad day” (a problem that is often fixed by a reboot), but that is mostly just
a consequence of badly written software.

There is no fundamental reason why software on a single computer should be flaky: when the hardware
is working correctly, the same operation always produces the same result (it is *deterministic*). If
there is a hardware problem (e.g., memory corruption or a loose connector), the consequence is usually a
total system failure (e.g., kernel panic, “blue screen of death,” failure to start up). An individual
computer with good software is usually either fully functional or entirely broken, but not something
in between.

This is a deliberate choice in the design of computers: if an internal fault occurs, we prefer a
computer to crash completely rather than returning a wrong result, because wrong results are difficult
and confusing to deal with. Thus, computers hide the fuzzy physical reality on which they are
implemented and present an idealized system model that operates with mathematical perfection. A CPU
instruction always does the same thing; if you write some data to memory or disk, that data remains
intact and doesn’t get randomly corrupted. As discussed in [“Hardware and Software Faults”](/en/ch2#sec_introduction_hardware_faults),
this is not actually true—in reality, data does get silently corrupted and CPUs do sometimes
silently return the wrong result—but it happens rarely enough that we can get away with ignoring it.

When you are writing software that runs on several computers, connected by a network, the situation
is fundamentally different. In distributed systems, faults occur much more frequently, and so we can
no longer ignore them—we have no choice but to confront the messy reality of the physical world. And
in the physical world, a remarkably wide range of things can go wrong, as illustrated by this
anecdote [[3](/en/ch9#Hale2010)]:

> In my limited experience I’ve dealt with long-lived network partitions in a single data center (DC),
> PDU [power distribution unit] failures, switch failures, accidental power cycles of whole racks,
> whole-DC backbone failures, whole-DC power failures, and a hypoglycemic driver smashing his Ford
> pickup truck into a DC’s HVAC [heating, ventilation, and air conditioning] system. And I’m not even
> an ops guy.
>
> Coda Hale

In a distributed system, there may well be some parts of the system that are broken in some
unpredictable way, even though other parts of the system are working fine. This is known as a
*partial failure*. The difficulty is that partial failures are *nondeterministic*: if you try to do
anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably
fail. As we shall see, you may not even *know* whether something succeeded or not!

This nondeterminism and possibility of partial failures is what makes distributed systems hard to
work with [[4](/en/ch9#Hodges2013)].
On the other hand, if a distributed system can tolerate partial failures, that opens up powerful
possibilities: for example, it allows you to perform a rolling upgrade, rebooting one node at a time
to install software updates while the system as a whole continues working uninterrupted all the
time. Fault tolerance therefore allows us to make distributed systems more reliable than single-node
systems: we can build a reliable system from unreliable components.

But before we can implement fault tolerance, we need to know more about the faults that we’re
supposed to tolerate. It is important to consider a wide range of possible faults—even fairly
unlikely ones—and to artificially create such situations in your testing environment to see what
happens. In distributed systems, suspicion, pessimism, and paranoia pay off.

# Unreliable Networks

As discussed in [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](/en/ch2#sec_introduction_shared_nothing), the distributed systems we focus on
in this book are mostly *shared-nothing systems*: i.e., a bunch of machines connected by a network.
The network is the only way those machines can communicate—we assume that each machine has its
own memory and disk, and one machine cannot access another machine’s memory or disk (except by
making requests to a service over the network). Even when storage is shared, such as with Amazon’s
S3, machines communicate with shared storage services over the network.

The internet and most internal networks in datacenters (often Ethernet) are *asynchronous packet
networks*. In this kind of network, one node can send a message (a packet) to another node, but the
network gives no guarantees as to when it will arrive, or whether it will arrive at all. If you send
a request and expect a response, many things could go wrong (some of which are illustrated in
[Figure 9-1](/en/ch9#fig_distributed_network)):

1. Your request may have been lost (perhaps someone unplugged a network cable).
2. Your request may be waiting in a queue and will be delivered later (perhaps the network or the
   recipient is overloaded).
3. The remote node may have failed (perhaps it crashed or it was powered down).
4. The remote node may have temporarily stopped responding (perhaps it is experiencing a long
   garbage collection pause; see [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses)), but it will start responding
   again later.
5. The remote node may have processed your request, but the response has been lost on the network
   (perhaps a network switch has been misconfigured).
6. The remote node may have processed your request, but the response has been delayed and will be
   delivered later (perhaps the network or your own machine is overloaded).

![ddia 0901](/fig/ddia_0901.png)

###### Figure 9-1. If you send a request and don’t get a response, it’s not possible to distinguish whether (a) the request was lost, (b) the remote node is down, or (c) the response was lost.

The sender can’t even tell whether the packet was delivered: the only option is for the recipient to
send a response message, which may in turn be lost or delayed. These issues are indistinguishable in
an asynchronous network: the only information you have is that you haven’t received a response yet.
If you send a request to another node and don’t receive a response, it is *impossible* to tell why.

The usual way of handling this issue is a *timeout*: after some time you give up waiting and assume that
the response is not going to arrive. However, when a timeout occurs, you still don’t know whether
the remote node got your request or not (and if the request is still queued somewhere, it may still
be delivered to the recipient, even if the sender has given up on it).

## The Limitations of TCP

Network packets have a maximum size (generally a few kilobytes), but many applications need to send
messages (requests, responses) that are too big to fit in one packet. These applications most often
use TCP, the Transmission Control Protocol, to establish a *connection* that breaks up large data
streams into individual packets, and puts them back together again on the receiving side.

###### Note

Most of what we say about TCP applies also to its more recent alternative QUIC, as well as the
Stream Control Transmission Protocol (SCTP) used in WebRTC, the BitTorrent uTP protocol, and
other transport protocols. For a comparison to UDP, see [“TCP Versus UDP”](/en/ch9#sidebar_distributed_tcp_udp).

TCP is often described as providing “reliable” delivery, in the sense that it detects and
retransmits dropped packets, it detects reordered packets and puts them back in the correct order,
and it detects packet corruption using a simple checksum. It also figures out how fast it can send
data so that it is transferred as quickly as possible, but without overloading the network or the
receiving node; this is known as *congestion control*, *flow control*, or *backpressure*
[[5](/en/ch9#Jacobson1988)].

When you “send” some data by writing it to a socket, it actually doesn’t get sent immediately,
but it’s only placed in a buffer managed by your operating system. When the congestion control
algorithm decides that it has capacity to send a packet, it takes the next packet-worth of data from
that buffer and passes it to the network interface. The packet passes through several switches and
routers, and eventually the receiving node’s operating system places the packet’s data in a receive
buffer and sends an acknowledgment packet back to the sender. Only then does the receiving operating
system notify the application that some more data has arrived
[[6](/en/ch9#Hubert2009)].

So, if TCP provides “reliability”, does that mean we no longer need to worry about networks being
unreliable? Unfortunately not. It decides that a packet must have been lost if no acknowledgment
arrives within some timeout, but TCP can’t tell either whether it was the outbound packet or the
acknowledgment that was lost. Although TCP can resend the packet, it can’t guarantee that the new
packet will get through either. If the network cable is unplugged, TCP can’t plug it back in for
you. Eventually, after a configurable timeout, TCP gives up and signals an error to the application.

If a TCP connection is closed with an error—perhaps because the remote node crashed, or perhaps
because the network was interrupted—you unfortunately have no way of knowing how much data was
actually processed by the remote node [[6](/en/ch9#Hubert2009)].
Even if TCP acknowledged that a packet was delivered, this only means that the operating system
kernel on the remote node received it, but the application may have crashed before it handled that
data. If you want to be sure that a request was successful, you need a positive response from the
application itself
[[7](/en/ch9#Saltzer1984_ch9)].

Nevertheless, TCP is very useful, because it provides a convenient way of sending and receiving
messages that are too big to fit in one packet. Once a TCP connection is established, you can also
use it to send multiple requests and responses. This is usually done by first sending a header that
indicates the length of the following message in bytes, followed by the actual message. HTTP and
many RPC protocols (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) work like this.

## Network Faults in Practice

We have been building computer networks for decades—one might hope that by now we would have figured
out how to make them reliable. Unfortunately, we have not yet succeeded. There are some systematic
studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common,
even in controlled environments like a datacenter operated by one company
[[8](/en/ch9#Bailis2014reliable)]:

* One study in a medium-sized datacenter found about 12 network faults per month, of which half
  disconnected a single machine, and half disconnected an entire rack
  [[9](/en/ch9#Leners2015)].
* Another study measured the failure rates of components like top-of-rack switches, aggregation
  switches, and load balancers
  [[10](/en/ch9#Gill2011)].
  It found that adding redundant networking gear doesn’t reduce faults as much as you might hope,
  since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause
  of outages.
* Interruptions of wide-area fiber links have been blamed on cows
  [[11](/en/ch9#Hoelzle2020)],
  beavers [[12](/en/ch9#CBCNews2021)],
  and sharks [[13](/en/ch9#Oremus2014)]
  (though shark bites have become rarer due to better shielding of submarine cables
  [[14](/en/ch9#AuerbachJahajeeah2023)]).
  Humans are also at fault, be it due to accidental misconfiguration
  [[15](/en/ch9#Janardhan2021)],
  scavenging [[16](/en/ch9#Parfitt2011)],
  or sabotage
  [[17](/en/ch9#Voce2025)].
* Across different cloud regions, round-trip times of up to several *minutes* have been observed at
  high percentiles [[18](/en/ch9#Liu2016), Table 3].
  Even within a single datacenter, packet delay of more than a minute can occur during a network
  topology reconfiguration, triggered by a problem during a software upgrade for a switch
  [[19](/en/ch9#Imbriaco2012_ch9)].
  Thus, we have to assume that messages might be delayed arbitrarily.
* Sometimes communications are partially interrupted, depending on who you’re talking to: for
  example, A and B can communicate, B and C can communicate, but A and C cannot
  [[20](/en/ch9#Lianza2020_ch9),
  [21](/en/ch9#Alfatafta2020)].
  Other surprising faults include a network interface that sometimes drops all inbound packets but
  sends outbound packets successfully [[22](/en/ch9#Donges2012)]:
  just because a network link works in one direction doesn’t guarantee it’s also working in the
  opposite direction.
* Even a brief network interruption can have repercussions that last for much longer than the
  original issue [[8](/en/ch9#Bailis2014reliable),
  [20](/en/ch9#Lianza2020_ch9),
  [23](/en/ch9#Toman2020)].

# Network partitions

When one part of the network is cut off from the rest due to a network fault, that is sometimes
called a *network partition* or *netsplit*, but it is not fundamentally different from other kinds
of network interruption. Network partitions are not related to sharding of a storage system, which
is sometimes also called *partitioning* (see [Chapter 7](/en/ch7#ch_sharding)).

Even if network faults are rare in your environment, the fact that faults *can* occur means that
your software needs to be able to handle them. Whenever any communication happens over a network, it
may fail—there is no way around it.

If the error handling of network faults is not defined and tested, arbitrarily bad things could
happen: for example, the cluster could become deadlocked and permanently unable to serve requests,
even when the network recovers [[24](/en/ch9#Kingsbury2014elastic)],
or it could even delete all of your data
[[25](/en/ch9#Sanfilippo2014)].
If software is put in an unanticipated situation, it may do arbitrary unexpected things.

Handling network faults doesn’t necessarily mean *tolerating* them: if your network is normally
fairly reliable, a valid approach may be to simply show an error message to users while your network
is experiencing problems. However, you do need to know how your software reacts to network problems
and ensure that the system can recover from them.
It may make sense to deliberately trigger network problems and test the system’s response (this is
known as *fault injection*; see [“Fault injection”](/en/ch9#sec_fault_injection)).

## Detecting Faults

Many systems need to automatically detect faulty nodes. For example:

* A load balancer needs to stop sending requests to a node that is dead (i.e., take it *out of rotation*).
* In a distributed database with single-leader replication, if the leader fails, one of the
  followers needs to be promoted to be the new leader (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)).

Unfortunately, the uncertainty about the network makes it difficult to tell whether a node is
working or not. In some specific circumstances you might get some feedback to explicitly tell you
that something is not working:

* If you can reach the machine on which the node should be running, but no process is listening on
  the destination port (e.g., because the process crashed), the operating system will helpfully close
  or refuse TCP connections by sending a `RST` or `FIN` packet in reply.
* If a node process crashed (or was killed by an administrator) but the node’s operating system is
  still running, a script can notify other nodes about the crash so that another node can take over
  quickly without having to wait for a timeout to expire. For example, HBase does this
  [[26](/en/ch9#Liochon2015)].
* If you have access to the management interface of the network switches in your datacenter, you can
  query them to detect link failures at a hardware level (e.g., if the remote machine is powered
  down). This option is ruled out if you’re connecting via the internet, or if you’re in a shared
  datacenter with no access to the switches themselves, or if you can’t reach the management
  interface due to a network problem.
* If a router is sure that the IP address you’re trying to connect to is unreachable, it may reply
  to you with an ICMP Destination Unreachable packet. However, the router doesn’t have a magic
  failure detection capability either—it is subject to the same limitations as other participants
  of the network.

Rapid feedback about a remote node being down is useful, but you can’t count on it. If something has
gone wrong, you may get an error response at some level of the stack, but in general you have to
assume that you will get no response at all. You can retry a few times, wait for a timeout to
elapse, and eventually declare the node dead if you don’t hear back within the timeout.

## Timeouts and Unbounded Delays

If a timeout is the only sure way of detecting a fault, then how long should the timeout be? There
is unfortunately no simple answer.

A long timeout means a long wait until a node is declared dead (and during this time, users may have
to wait or see error messages). A short timeout detects faults faster, but carries a higher risk of
incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown (e.g., due
to a load spike on the node or the network).

Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of
performing some action (for example, sending an email), and another node takes over, the action may
end up being performed twice. We will discuss this issue in more detail in
[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), and in
Chapters [10](/en/ch10#ch_consistency)
and [Link to Come].

When a node is declared dead, its responsibilities need to be transferred to other nodes, which
places additional load on other nodes and the network. If the system is already struggling with high
load, declaring nodes dead prematurely can make the problem worse. In particular, it could happen
that the node actually wasn’t dead but only slow to respond due to overload; transferring its load
to other nodes can cause a cascading failure (in the extreme case, all nodes declare each other
dead, and everything stops working—see [“When an overloaded system won’t recover”](/en/ch2#sidebar_metastable)).

Imagine a fictitious system with a network that guaranteed a maximum delay for packets—every packet
is either delivered within some time *d*, or it is lost, but delivery never takes longer than *d*.
Furthermore, assume that you can guarantee that a non-failed node always handles a request within
some time *r*. In this case, you could guarantee that every successful request receives a response
within time 2*d* + *r*—and if you don’t receive a response within that time, you know
that either the network or the remote node is not working. If this was true,
2*d* + *r* would be a reasonable timeout to use.

Unfortunately, most systems we work with have neither of those guarantees: asynchronous networks
have *unbounded delays* (that is, they try to deliver packets as quickly as possible, but there is
no upper limit on the time it may take for a packet to arrive), and most server implementations
cannot guarantee that they can handle requests within some maximum time (see
[“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime)). For failure detection, it’s not sufficient for the system to
be fast most of the time: if your timeout is low, it only takes a transient spike in round-trip
times to throw the system off-balance.

### Network congestion and queueing

When driving a car, travel times on road networks often vary most due to traffic congestion.
Similarly, the variability of packet delays on computer networks is most often due to queueing
[[27](/en/ch9#Grosvenor2015)]:

* If several different nodes simultaneously try to send packets to the same destination, the network
  switch must queue them up and feed them into the destination network link one by one (as illustrated
  in [Figure 9-2](/en/ch9#fig_distributed_switch_queueing)). On a busy network link, a packet may have to wait a while
  until it can get a slot (this is called *network congestion*). If there is so much incoming data
  that the switch queue fills up, the packet is dropped, so it needs to be resent—even though
  the network is functioning fine.
* When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming
  request from the network is queued by the operating system until the application is ready to
  handle it. Depending on the load on the machine, this may take an arbitrary length of time
  [[28](/en/ch9#Julienne2019)].
* In virtualized environments, a running operating system is often paused for tens of milliseconds
  while another virtual machine uses a CPU core. During this time, the VM cannot consume any data
  from the network, so the incoming data is queued (buffered) by the virtual machine monitor
  [[29](/en/ch9#Wang2010)],
  further increasing the variability of network delays.
* As mentioned earlier, in order to avoid overloading the network, TCP limits the rate at which it
  sends data. This means additional queueing at the sender before the data even enters the network.

![ddia 0902](/fig/ddia_0902.png)

###### Figure 9-2. If several machines send network traffic to the same destination, its switch queue can fill up. Here, ports 1, 2, and 4 are all trying to send packets to port 3.

Moreover, when TCP detects and automatically retransmits a lost packet, although the application
does not see the packet loss directly, it does see the resulting delay (waiting for the timeout to
expire, and then waiting for the retransmitted packet to be acknowledged).

# TCP Versus UDP

Some latency-sensitive applications, such as videoconferencing and Voice over IP (VoIP), use UDP
rather than TCP. It’s a trade-off between reliability and variability of delays: as UDP does not
perform flow control and does not retransmit lost packets, it avoids some of the reasons for
variable network delays (although it is still susceptible to switch queues and scheduling delays).

UDP is a good choice in situations where delayed data is worthless. For example, in a VoIP phone
call, there probably isn’t enough time to retransmit a lost packet before its data is due to be
played over the loudspeakers. In this case, there’s no point in retransmitting the packet—the
application must instead fill the missing packet’s time slot with silence (causing a brief
interruption in the sound) and move on in the stream. The retry happens at the human layer instead.
(“Could you repeat that please? The sound just cut out for a moment.”)

All of these factors contribute to the variability of network delays. Queueing delays have an
especially wide range when a system is close to its maximum capacity: a system with plenty of spare
capacity can easily drain queues, whereas in a highly utilized system, long queues can build up very
quickly.

In public clouds and multitenant datacenters, resources are shared among many customers: the
network links and switches, and even each machine’s network interface and CPUs (when running on
virtual machines), are shared. Processing large amounts of data can use the entire capacity of
network links (*saturate* them). As you have no control over or insight into other customers’ usage of the shared
resources, network delays can be highly variable if someone near you (a *noisy neighbor*) is
using a lot of resources [[30](/en/ch9#Philips2014),
[31](/en/ch9#Newman2012)].

In such environments, you can only choose timeouts experimentally: measure the distribution of
network round-trip times over an extended period, and over many machines, to determine the expected
variability of delays. Then, taking into account your application’s characteristics, you can
determine an appropriate trade-off between failure detection delay and risk of premature timeouts.

Even better, rather than using configured constant timeouts, systems can continually measure
response times and their variability (*jitter*), and automatically adjust timeouts according to the
observed response time distribution. The Phi Accrual failure detector
[[32](/en/ch9#Hayashibara2004)],
which is used for example in Akka and Cassandra
[[33](/en/ch9#Wang2013)]
is one way of doing this. TCP retransmission timeouts also work similarly
[[5](/en/ch9#Jacobson1988)].

## Synchronous Versus Asynchronous Networks

Distributed systems would be a lot simpler if we could rely on the network to deliver packets with
some fixed maximum delay, and not to drop packets. Why can’t we solve this at the hardware level
and make the network reliable so that the software doesn’t need to worry about it?

To answer this question, it’s interesting to compare datacenter networks to the traditional fixed-line
telephone network (non-cellular, non-VoIP), which is extremely reliable: delayed audio
frames and dropped calls are very rare. A phone call requires a constantly low end-to-end latency
and enough bandwidth to transfer the audio samples of your voice. Wouldn’t it be nice to have
similar reliability and predictability in computer networks?

When you make a call over the telephone network, it establishes a *circuit*: a fixed, guaranteed
amount of bandwidth is allocated for the call, along the entire route between the two callers. This
circuit remains in place until the call ends
[[34](/en/ch9#Keshav1997)].
For example, an ISDN network runs at a fixed rate of 4,000 frames per second. When a call is
established, it is allocated 16 bits of space within each frame (in each direction). Thus, for the
duration of the call, each side is guaranteed to be able to send exactly 16 bits of audio data every
250 microseconds
[[35](/en/ch9#Kyas1995)].

This kind of network is *synchronous*: even as data passes through several routers, it does not
suffer from queueing, because the 16 bits of space for the call have already been reserved in the
next hop of the network. And because there is no queueing, the maximum end-to-end latency of the
network is fixed. We call this a *bounded delay*.

### Can we not simply make network delays predictable?

Note that a circuit in a telephone network is very different from a TCP connection: a circuit is a
fixed amount of reserved bandwidth which nobody else can use while the circuit is established,
whereas the packets of a TCP connection opportunistically use whatever network bandwidth is
available. You can give TCP a variable-sized block of data (e.g., an email or a web page), and it
will try to transfer it in the shortest time possible. While a TCP connection is idle, it doesn’t
use any bandwidth (except perhaps for an occasional keepalive packet).

If datacenter networks and the internet were circuit-switched networks, it would be possible to
establish a guaranteed maximum round-trip time when a circuit was set up. However, they are not:
Ethernet and IP are packet-switched protocols, which suffer from queueing and thus unbounded delays
in the network. These protocols do not have the concept of a circuit.

Why do datacenter networks and the internet use packet switching? The answer is that they are
optimized for *bursty traffic*. A circuit is good for an audio or video call, which needs to
transfer a fairly constant number of bits per second for the duration of the call. On the other
hand, requesting a web page, sending an email, or transferring a file doesn’t have any particular
bandwidth requirement—we just want it to complete as quickly as possible.

If you wanted to transfer a file over a circuit, you would have to guess a bandwidth allocation. If
you guess too low, the transfer is unnecessarily slow, leaving network capacity unused. If you guess
too high, the circuit cannot be set up (because the network cannot allow a circuit to be created if
its bandwidth allocation cannot be guaranteed). Thus, using circuits for bursty data transfers
wastes network capacity and makes transfers unnecessarily slow. By contrast, TCP dynamically adapts
the rate of data transfer to the available network capacity.

There have been some attempts to build hybrid networks that support both circuit switching and
packet switching. *Asynchronous Transfer Mode* (ATM) was a competitor to Ethernet in the 1980s, but
it didn’t gain much adoption outside of telephone network core switches. InfiniBand has some similarities
[[36](/en/ch9#Mellanox2014)]:
it implements end-to-end flow control at the link layer, which reduces the need for queueing in the
network, although it can still suffer from delays due to link congestion
[[37](/en/ch9#Santos2003)].
With careful use of *quality of service* (QoS, prioritization and scheduling of packets) and *admission
control* (rate-limiting senders), it is possible to emulate circuit switching on packet networks, or
provide statistically bounded delay [[27](/en/ch9#Grosvenor2015),
[34](/en/ch9#Keshav1997)]. New network algorithms like Low Latency, Low
Loss, and Scalable Throughput (L4S) attempt to mitigate some of the queuing and congestion control
problems both at the client and router level. Linux’s traffic controller (TC) also allows
applications to reprioritize packets for QoS purposes.

# Latency and Resource Utilization

More generally, you can think of variable delays as a consequence of dynamic resource partitioning.

Say you have a wire between two telephone switches that can carry up to 10,000 simultaneous calls.
Each circuit that is switched over this wire occupies one of those call slots. Thus, you can think of
the wire as a resource that can be shared by up to 10,000 simultaneous users. The resource is
divided up in a *static* way: even if you’re the only call on the wire right now, and all other
9,999 slots are unused, your circuit is still allocated the same fixed amount of bandwidth as when
the wire is fully utilized.

By contrast, the internet shares network bandwidth *dynamically*. Senders push and jostle with each
other to get their packets over the wire as quickly as possible, and the network switches decide
which packet to send (i.e., the bandwidth allocation) from one moment to the next. This approach has the
downside of queueing, but the advantage is that it maximizes utilization of the wire. The wire has a
fixed cost, so if you utilize it better, each byte you send over the wire is cheaper.

A similar situation arises with CPUs: if you share each CPU core dynamically between several
threads, one thread sometimes has to wait in the operating system’s run queue while another thread
is running, so a thread can be paused for varying lengths of time
[[38](/en/ch9#Li2014)].
However, this utilizes the hardware better than if you allocated a static number of CPU cycles to
each thread (see [“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime)). Better hardware utilization is also why cloud
platforms run several virtual machines from different customers on the same physical machine.

Latency guarantees are achievable in certain environments, if resources are statically partitioned
(e.g., dedicated hardware and exclusive bandwidth allocations). However, it comes at the cost of
reduced utilization—in other words, it is more expensive. On the other hand, multitenancy with
dynamic resource partitioning provides better utilization, so it is cheaper, but it has the downside
of variable delays.

Variable delays in networks are not a law of nature, but simply the result of a cost/benefit
trade-off.

However, such quality of service is currently not enabled in multitenant datacenters and public
clouds, or when communicating via the internet.
Currently deployed technology does not allow us to make any guarantees about delays or reliability
of the network: we have to assume that network congestion, queueing, and unbounded delays will
happen. Consequently, there’s no “correct” value for timeouts—they need to be determined
experimentally.

Peering agreements between internet service providers and the establishment of routes through the
Border Gateway Protocol (BGP), bear closer resemblance to circuit switching than IP itself. At this
level, it is possible to buy dedicated bandwidth. However, internet routing operates at the level of
networks, not individual connections between hosts, and at a much longer timescale.

# Unreliable Clocks

Clocks and time are important. Applications depend on clocks in various ways to answer questions
like the following:

1. Has this request timed out yet?
2. What’s the 99th percentile response time of this service?
3. How many queries per second did this service handle on average in the last five minutes?
4. How long did the user spend on our site?
5. When was this article published?
6. At what date and time should the reminder email be sent?
7. When does this cache entry expire?
8. What is the timestamp on this error message in the log file?

Examples 1–4 measure *durations* (e.g., the time interval between a request being sent and a
response being received), whereas examples 5–8 describe *points in time* (events that occur on a
particular date, at a particular time).

In a distributed system, time is a tricky business, because communication is not instantaneous: it
takes time for a message to travel across the network from one machine to another. The time when a
message is received is always later than the time when it is sent, but due to variable delays in the
network, we don’t know how much later. This fact sometimes makes it difficult to determine the order
in which things happened when multiple machines are involved.

Moreover, each machine on the network has its own clock, which is an actual hardware device: usually
a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own
notion of time, which may be slightly faster or slower than on other machines. It is possible to
synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol (NTP), which
allows the computer clock to be adjusted according to the time reported by a group of servers
[[39](/en/ch9#Windl2006)].
The servers in turn get their time from a more accurate time source, such as a GPS receiver.

## Monotonic Versus Time-of-Day Clocks

Modern computers have at least two different kinds of clocks: a *time-of-day clock* and a *monotonic
clock*. Although they both measure time, it is important to distinguish the two, since they serve
different purposes.

### Time-of-day clocks

A time-of-day clock does what you intuitively expect of a clock: it returns the current date and
time according to some calendar (also known as *wall-clock time*). For example,
`clock_gettime(CLOCK_REALTIME)` on Linux and
`System.currentTimeMillis()` in Java return the number of seconds (or milliseconds) since the
*epoch*: midnight UTC on January 1, 1970, according to the Gregorian calendar, not counting leap
seconds. Some systems use other dates as their reference point.
(Although the Linux clock is called *real-time*, it has nothing to do with real-time operating
systems, as discussed in [“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime).)

Time-of-day clocks are usually synchronized with NTP, which means that a timestamp from one machine
(ideally) means the same as a timestamp on another machine. However, time-of-day clocks also have
various oddities, as described in the next section. In particular, if the local clock is too far
ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in
time. These jumps, as well as similar jumps caused by leap seconds, make time-of-day clocks
unsuitable for measuring elapsed time
[[40](/en/ch9#GrahamCumming2017)].

Time-of-day clocks can experience jumps due to the start and end of Daylight Saving Time (DST);
these can be avoided by always using UTC as time zone, which does not have DST.
Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward
in steps of 10 ms on older Windows systems
[[41](/en/ch9#Holmes2006)].
On recent systems, this is less of a problem.

### Monotonic clocks

A monotonic clock is suitable for measuring a duration (time interval), such as a timeout or a
service’s response time: `clock_gettime(CLOCK_MONOTONIC)` or `clock_gettime(CLOCK_BOOTTIME)` on
Linux [[42](/en/ch9#Greef2021)]
and `System.nanoTime()` in Java are monotonic clocks, for example. The name comes from the fact that
they are guaranteed to always move forward (whereas a time-of-day clock may jump back in time).

You can check the value of the monotonic clock at one point in time, do something, and then check
the clock again at a later time. The *difference* between the two values tells you how much time
elapsed between the two checks — more like a stopwatch than a wall clock. However, the *absolute*
value of the clock is meaningless: it might be the number of nanoseconds since the computer was
booted up, or something similarly arbitrary. In particular, it makes no sense to compare monotonic
clock values from two different computers, because they don’t mean the same thing.

On a server with multiple CPU sockets, there may be a separate timer per CPU, which is not
necessarily synchronized with other CPUs
[[43](/en/ch9#Yang2015)].
Operating systems compensate for any discrepancy and try
to present a monotonic view of the clock to application threads, even as they are scheduled across
different CPUs. However, it is wise to take this guarantee of monotonicity with a pinch of salt
[[44](/en/ch9#Loughran2015)].

NTP may adjust the frequency at which the monotonic clock moves forward (this is known as *slewing*
the clock) if it detects that the computer’s local quartz is moving faster or slower than the NTP
server. By default, NTP allows the clock rate to be speeded up or slowed down by up to 0.05%, but
NTP cannot cause the monotonic clock to jump forward or backward. The resolution of monotonic
clocks is usually quite good: on most systems they can measure time intervals in microseconds or
less.

In a distributed system, using a monotonic clock for measuring elapsed time (e.g., timeouts) is
usually fine, because it doesn’t assume any synchronization between different nodes’ clocks and is
not sensitive to slight inaccuracies of measurement.

## Clock Synchronization and Accuracy

Monotonic clocks don’t need synchronization, but time-of-day clocks need to be set according to an
NTP server or other external time source in order to be useful. Unfortunately, our methods for
getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might
hope—hardware clocks and NTP can be fickle beasts. To give just a few examples:

* The quartz clock in a computer is not very accurate: it *drifts* (runs faster or slower than it
  should). Clock drift varies depending on the temperature of the machine. Google assumes a clock
  drift of up to 200 ppm (parts per million) for its servers
  [[45](/en/ch9#Corbett2012_ch9)],
  which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30
  seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best
  possible accuracy you can achieve, even if everything is working correctly.
* If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the
  local clock will be forcibly reset [[39](/en/ch9#Windl2006)]. Any
  applications observing the time before and after this reset may see time go backward or suddenly
  jump forward.
* If a node is accidentally firewalled off from NTP servers, the misconfiguration may go
  unnoticed for some time, during which the drift may add up to large discrepancies between
  different nodes’ clocks. Anecdotal evidence suggests that this does happen in practice.
* NTP synchronization can only be as good as the network delay, so there is a limit to its
  accuracy when you’re on a congested network with variable packet delays. One experiment showed
  that a minimum error of 35 ms is achievable when synchronizing over the internet
  [[46](/en/ch9#Caporaloni2012)],
  though occasional spikes in network delay lead to errors of around a second. Depending on the
  configuration, large network delays can cause the NTP client to give up entirely.
* Some NTP servers are wrong or misconfigured, reporting time that is off by hours
  [[47](/en/ch9#Minar1999),
  [48](/en/ch9#Holub2014)].
  NTP clients mitigate such errors by querying several servers and ignoring outliers.
  Nevertheless, it’s somewhat worrying to bet the correctness of your systems on the time that you
  were told by a stranger on the internet.
* Leap seconds result in a minute that is 59 seconds or 61 seconds long, which messes up timing
  assumptions in systems that are not designed with leap seconds in mind
  [[49](/en/ch9#Kamp2011)].
  The fact that leap seconds have crashed many large systems
  [[40](/en/ch9#GrahamCumming2017),
  [50](/en/ch9#Minar2012_ch9)]
  shows how easy it is for incorrect assumptions about clocks to sneak into a system. The best
  way of handling leap seconds may be to make NTP servers “lie,” by performing the leap second
  adjustment gradually over the course of a day (this is known as *smearing*)
  [[51](/en/ch9#Pascoe2011),
  [52](/en/ch9#Zhao2015)],
  although actual NTP server behavior varies in practice
  [[53](/en/ch9#Veitch2016)].
  Leap seconds will no longer be used from 2035 onwards, so this problem will fortunately go away.
* In virtual machines, the hardware clock is virtualized, which raises additional challenges for
  applications that need accurate timekeeping
  [[54](/en/ch9#VMware2011)].
  When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds
  while another VM is running. From an application’s point of view, this pause manifests itself as
  the clock suddenly jumping forward [[29](/en/ch9#Wang2010)].
  If a VM pauses for several seconds, the clock may then be several seconds behind the actual time,
  but NTP may continue to report that the clock is almost perfectly in sync
  [[55](/en/ch9#Yodaiken2017)].
* If you run software on devices that you don’t fully control (e.g., mobile or embedded devices), you
  probably cannot trust the device’s hardware clock at all. Some users deliberately set their
  hardware clock to an incorrect date and time, for example to cheat in games
  [[56](/en/ch9#EmreAcer2017)].
  As a result, the clock might be set to a time wildly in the past or the future.

It is possible to achieve very good clock accuracy if you care about it sufficiently to invest
significant resources. For example, the MiFID II European regulation for financial
institutions requires all high-frequency trading funds to synchronize their clocks to within 100
microseconds of UTC, in order to help debug market anomalies such as “flash crashes” and to help
detect market manipulation
[[57](/en/ch9#MiFID2015)].

Such accuracy can be achieved with some special hardware (GPS receivers and/or atomic clocks), the
Precision Time Protocol (PTP) and careful deployment and monitoring
[[58](/en/ch9#Bigum2015),
[59](/en/ch9#Obleukhov2022)].
Relying on GPS alone can be risky because GPS signals can easily be jammed. In some locations this
happens frequently, e.g. close to military facilities
[[60](/en/ch9#Wiseman2022)].
Some cloud providers have begun offering high-accuracy clock synchronization for their virtual
machines
[[61](/en/ch9#Levinson2023)].
However, clock synchronization still requires a lot of care. If your NTP daemon is misconfigured, or
a firewall is blocking NTP traffic, the clock error due to drift can quickly become large.

## Relying on Synchronized Clocks

The problem with clocks is that while they seem simple and easy to use, they have a surprising
number of pitfalls: a day may not have exactly 86,400 seconds, time-of-day clocks may move backward
in time, and the time according to one node’s clock may be quite different from another node’s clock.

Earlier in this chapter we discussed networks dropping and arbitrarily delaying packets. Even though
networks are well behaved most of the time, software must be designed on the assumption that the
network will occasionally be faulty, and the software must handle such faults gracefully. The same
is true with clocks: although they work quite well most of the time, robust software needs to be
prepared to deal with incorrect clocks.

Part of the problem is that incorrect clocks easily go unnoticed. If a machine’s CPU is defective or
its network is misconfigured, it most likely won’t work at all, so it will quickly be noticed and
fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most
things will seem to work fine, even though its clock gradually drifts further and further away from
reality. If some piece of software is relying on an accurately synchronized clock, the result is
more likely to be silent and subtle data loss than a dramatic crash
[[62](/en/ch9#Kingsbury2013cassandra),
[63](/en/ch9#Daily2013_ch9)].

Thus, if you use software that requires synchronized clocks, it is essential that you also carefully
monitor the clock offsets between all the machines. Any node whose clock drifts too far from the
others should be declared dead and removed from the cluster. Such monitoring ensures that you notice
the broken clocks before they can cause too much damage.

### Timestamps for ordering events

Let’s consider one particular situation in which it is tempting, but dangerous, to rely on clocks:
ordering of events across multiple nodes
[[64](/en/ch9#Brooker2023time)].
For example, if two clients write to a distributed database, who got there first? Which write is the
more recent one?

[Figure 9-3](/en/ch9#fig_distributed_timestamps) illustrates a dangerous use of time-of-day clocks in a database with
multi-leader replication (the example is similar to [Figure 6-8](/en/ch6#fig_replication_causality)). Client A writes
*x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node
3 (we now have *x* = 2); and finally, both writes are replicated to node 2.

![ddia 0903](/fig/ddia_0903.png)

###### Figure 9-3. The write by client B is causally later than the write by client A, but B’s write has an earlier timestamp.

In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a
timestamp according to the time-of-day clock on the node where the write originated. The clock
synchronization is very good in this example: the skew between node 1 and node 3 is less than
3 ms, which is probably better than you can expect in practice.

Since the increment builds upon the earlier write of *x* = 1, we might expect that the
write of *x* = 2 should have the greater timestamp of the two. Unfortunately, that is
not what happens in [Figure 9-3](/en/ch9#fig_distributed_timestamps): the write *x* = 1 has a timestamp of
42.004 seconds, but the write *x* = 2 has a timestamp of 42.003 seconds.

As discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww), one way of resolving conflicts between concurrently written
values on different nodes is *last write wins* (LWW), which means keeping the write with the
greatest timestamp for a given key and discarding all writes with older timestamps. In the example
of [Figure 9-3](/en/ch9#fig_distributed_timestamps), when node 2 receives these two events, it will incorrectly
conclude that *x* = 1 is the more recent value and drop the write *x* = 2,
so the increment is lost.

This problem can be prevented by ensuring that when a value is overwritten, the new value always has
a higher timestamp than the overwritten value, even if that timestamp is ahead of the writer’s local
clock. However, that incurs the cost of an additional read to find the greatest existing timestamp.
Some systems, including Cassandra and ScyllaDB, want to write to all replicas in a single round
trip, and therefore they simply use the client clock’s timestamp along with a last write wins
policy [[62](/en/ch9#Kingsbury2013cassandra)]. This approach has some
serious problems:

* Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite
  values previously written by a node with a fast clock until the clock skew between the nodes has
  elapsed [[63](/en/ch9#Daily2013_ch9),
  [65](/en/ch9#Kingsbury2013timestamps)].
  This scenario can cause arbitrary amounts of data to be silently dropped without any error being
  reported to the application.
* LWW cannot distinguish between writes that occurred sequentially in quick succession (in
  [Figure 9-3](/en/ch9#fig_distributed_timestamps), client B’s increment definitely occurs *after* client A’s write)
  and writes that were truly concurrent (neither writer was aware of the other). Additional
  causality tracking mechanisms, such as version vectors, are needed in order to prevent violations
  of causality (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)).
* It is possible for two nodes to independently generate writes with the same timestamp, especially
  when the clock only has millisecond resolution. An additional tiebreaker value (which can simply
  be a large random number) is required to resolve such conflicts, but this approach can also lead to
  violations of causality [[62](/en/ch9#Kingsbury2013cassandra)].

Thus, even though it is tempting to resolve conflicts by keeping the most “recent” value and
discarding others, it’s important to be aware that the definition of “recent” depends on a local
time-of-day clock, which may well be incorrect. Even with tightly NTP-synchronized clocks, you could
send a packet at timestamp 100 ms (according to the sender’s clock) and have it arrive at
timestamp 99 ms (according to the recipient’s clock)—so it appears as though the packet
arrived before it was sent, which is impossible.

Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur?
Probably not, because NTP’s synchronization accuracy is itself limited by the network round-trip
time, in addition to other sources of error such as quartz drift. To guarantee a correct ordering,
you would need the clock error to be significantly lower than the network delay, which is not
possible.

So-called *logical clocks*
[[66](/en/ch9#Lamport1978_ch9)],
which are based on incrementing counters rather than an oscillating quartz crystal, are a safer
alternative for ordering events (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). Logical clocks do not measure
the time of day or the number of seconds elapsed, only the relative ordering of events (whether one
event happened before or after another). In contrast, time-of-day and monotonic clocks, which
measure actual elapsed time, are also known as *physical clocks*. We’ll look at logical clocks in
more detail in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical).

### Clock readings with a confidence interval

You may be able to read a machine’s time-of-day clock with microsecond or even nanosecond
resolution. But even if you can get such a fine-grained measurement, that doesn’t mean the value is
actually accurate to such precision. In fact, it most likely is not—as mentioned previously, the
drift in an imprecise quartz clock can easily be several milliseconds, even if you synchronize with
an NTP server on the local network every minute. With an NTP server on the public internet, the best
possible accuracy is probably to the tens of milliseconds, and the error may easily spike to over
100 ms when there is network congestion.

Thus, it doesn’t make sense to think of a clock reading as a point in time—it is more like a
range of times, within a confidence interval: for example, a system may be 95% confident that the
time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely
than that [[67](/en/ch9#Sheehy2015)].
If we only know the time +/– 100 ms, the microsecond digits in the timestamp are
essentially meaningless.

The uncertainty bound can be calculated based on your time source. If you have a GPS receiver or
atomic clock directly attached to your computer, the expected error range is determined by
the device and, in the case of GPS, by the quality of the signal from the satellites. If you’re
getting the time from a server, the uncertainty is based on the expected quartz drift since your
last sync with the server, plus the NTP server’s uncertainty, plus the network round-trip time to
the server (to a first approximation, and assuming you trust the server).

Unfortunately, most systems don’t expose this uncertainty: for example, when you call
`clock_gettime()`, the return value doesn’t tell you the expected error of the timestamp, so you
don’t know if its confidence interval is five milliseconds or five years.

There are exceptions: the *TrueTime* API in Google’s Spanner
[[45](/en/ch9#Corbett2012_ch9)] and Amazon’s ClockBound explicitly report the
confidence interval on the local clock. When you ask it for the current time, you get back two
values: `[earliest, latest]`, which are the *earliest possible* and the *latest possible*
timestamp. Based on its uncertainty calculations, the clock knows that the actual current time is
somewhere within that interval. The width of the interval depends, among other things, on how long
it has been since the local quartz clock was last synchronized with a more accurate clock source.

### Synchronized clocks for global snapshots

In [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation) we discussed *multi-version concurrency control* (MVCC),
which is a very useful feature in databases that need to support both small, fast read-write
transactions and large, long-running read-only transactions (e.g., for backups or analytics). It
allows read-only transactions to see a *snapshot* of the database, a consistent state at a
particular point in time, without locking and interfering with read-write transactions.

Generally, MVCC requires a monotonically increasing transaction ID. If a write happened later than
the snapshot (i.e., the write has a greater transaction ID than the snapshot), that write is
invisible to the snapshot transaction. On a single-node database, a simple counter is sufficient for
generating transaction IDs.

However, when a database is distributed across many machines, potentially in multiple datacenters, a
global, monotonically increasing transaction ID (across all shards) is difficult to generate,
because it requires coordination. The transaction ID must reflect causality: if transaction B reads
or overwrites a value that was previously written by transaction A, then B must have a higher
transaction ID than A—otherwise, the snapshot would not be consistent. With lots of small, rapid
transactions, creating transaction IDs in a distributed system becomes an untenable
bottleneck. (We will discuss such ID generators in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical).)

Can we use the timestamps from synchronized time-of-day clocks as transaction IDs? If we could get
the synchronization good enough, they would have the right properties: later transactions have a
higher timestamp. The problem, of course, is the uncertainty about clock accuracy.

Spanner implements snapshot isolation across datacenters in this way
[[68](/en/ch9#Demirbas2013),
[69](/en/ch9#Malkhi2013)].
It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the
following observation: if you have two confidence intervals, each consisting of an earliest and
latest possible timestamp (*A* = [*Aearliest*, *Alatest*] and
*B* = [*Bearliest*, *Blatest*]), and those two intervals do not overlap (i.e.,
*Aearliest* < *Alatest* < *Bearliest* < *Blatest*), then B definitely happened after A—there
can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened.

In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the
length of the confidence interval before committing a read-write transaction. By doing so, it
ensures that any transaction that may read the data is at a sufficiently later time, so their
confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner
needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS
receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about
7 ms [[45](/en/ch9#Corbett2012_ch9)].

The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to
have a confidence interval, and the accurate clock sources only help keep that interval small. Other
systems are beginning to adopt similar approaches: for example, YugabyteDB can leverage ClockBound
when running on AWS [[70](/en/ch9#Pachot2024)],
and several other systems now also rely on clock synchronization to various degrees
[[71](/en/ch9#Kimball2022),
[72](/en/ch9#Demirbas2025)].

## Process Pauses

Let’s consider another example of dangerous clock use in a distributed system. Say you have a
database with a single leader per shard. Only the leader is allowed to accept writes. How does a
node know that it is still leader (that it hasn’t been declared dead by the others), and that it may
safely accept writes?

One option is for the leader to obtain a *lease* from the other nodes, which is similar to a lock
with a timeout [[73](/en/ch9#Gray1989)].
Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that
it is the leader for some amount of time, until the lease expires. In order to remain leader, the
node must periodically renew the lease before it expires. If the node fails, it stops renewing the
lease, so another node can take over when it expires.

You can imagine the request-handling loop looking something like this:

```
while (true) {
    request = getIncomingRequest();

    // Ensure that the lease always has at least 10 seconds remaining
    if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) {
        lease = lease.renew();
    }

    if (lease.isValid()) {
        process(request);
    }
}
```

What’s wrong with this code? Firstly, it’s relying on synchronized clocks: the expiry time on the
lease is set by a different machine (where the expiry may be calculated as the current time plus 30
seconds, for example), and it’s being compared to the local system clock. If the clocks are out of
sync by more than a few seconds, this code will start doing strange things.

Secondly, even if we change the protocol to only use the local monotonic clock, there is another
problem: the code assumes that very little time passes between the point that it checks the time
(`System.currentTimeMillis()`) and the time when the request is processed (`process(request)`).
Normally this code runs very quickly, so the 10 second buffer is more than enough to ensure that the
lease doesn’t expire in the middle of processing a request.

However, what if there is an unexpected pause in the execution of the program? For example, imagine
the thread stops for 15 seconds around the line `lease.isValid()` before finally continuing. In
that case, it’s likely that the lease will have expired by the time the request is processed, and
another node has already taken over as leader. However, there is nothing to tell this thread that it
was paused for so long, so this code won’t notice that the lease has expired until the next
iteration of the loop—by which time it may have already done something unsafe by processing the
request.

Is it reasonable to assume that a thread might be paused for so long? Unfortunately yes. There are
various reasons why this could happen:

* Contention among threads accessing a shared resource, such as a lock or queue, can cause threads
  to spend a lot of their time waiting. Moving to a machine with more CPU cores can make such
  problems worse, and contention problems can be difficult to diagnose
  [[74](/en/ch9#Sturman2022)].
* Many programming language runtimes (such as the Java Virtual Machine) have a *garbage collector*
  (GC) that occasionally needs to stop all running threads. In the past, such *“stop-the-world” GC
  pauses* would sometimes last for several minutes
  [[75](/en/ch9#Lipcon2011)]!
  With modern GC algorithms this is less of a problem, but GC pauses can still be noticable (see
  [“Limiting the impact of garbage collection”](/en/ch9#sec_distributed_gc_impact)).
* In virtualized environments, a virtual machine can be *suspended* (pausing the execution of all
  processes and saving the contents of memory to disk) and *resumed* (restoring the contents of
  memory and continuing execution). This pause can occur at any time in a process’s execution and can
  last for an arbitrary length of time. This feature is sometimes used for *live migration* of
  virtual machines from one host to another without a reboot, in which case the length of the pause
  depends on the rate at which processes are writing to memory
  [[76](/en/ch9#Clark2005)].
* On end-user devices such as laptops and phones, execution may also be suspended and resumed
  arbitrarily, e.g., when the user closes the lid of their laptop.
* When the operating system context-switches to another thread, or when the hypervisor switches to a
  different virtual machine (when running in a virtual machine), the currently running thread can be
  paused at any arbitrary point in the code. In the case of a virtual machine, the CPU time spent in
  other virtual machines is known as *steal time*. If the machine is under heavy load—i.e., if
  there is a long queue of threads waiting to run—it may take some time before the paused thread
  gets to run again.
* If the application performs synchronous disk access, a thread may be paused waiting for a slow
  disk I/O operation to complete [[77](/en/ch9#Shaver2008)]. In many languages, disk access can happen
  surprisingly, even if the code doesn’t explicitly mention file access—for example, the Java
  classloader lazily loads class files when they are first used, which could happen at any time in
  the program execution. I/O pauses and GC pauses may even conspire to combine their delays
  [[78](/en/ch9#Zhuang2016)].
  If the disk is actually a network filesystem or network block device (such as Amazon’s EBS), the
  I/O latency is further subject to the variability of network delays
  [[31](/en/ch9#Newman2012)].
* If the operating system is configured to allow *swapping to disk* (*paging*), a simple memory
  access may result in a page fault that requires a page from disk to be loaded into memory. The
  thread is paused while this slow I/O operation takes place. If memory pressure is high, this may
  in turn require a different page to be swapped out to disk. In extreme circumstances, the
  operating system may spend most of its time swapping pages in and out of memory and getting little
  actual work done (this is known as *thrashing*). To avoid this problem, paging is often disabled
  on server machines (if you would rather kill a process to free up memory than risk thrashing).
* A Unix process can be paused by sending it the `SIGSTOP` signal, for example by pressing Ctrl-Z in
  a shell. This signal immediately stops the process from getting any more CPU cycles until it is
  resumed with `SIGCONT`, at which point it continues running where it left off. Even if your
  environment does not normally use `SIGSTOP`, it might be sent accidentally by an operations
  engineer.

All of these occurrences can *preempt* the running thread at any point and resume it at some later time,
without the thread even noticing. The problem is similar to making multi-threaded code on a single
machine thread-safe: you can’t assume anything about timing, because arbitrary context switches and
parallelism may occur.

When writing multi-threaded code on a single machine, we have fairly good tools for making it
thread-safe: mutexes, semaphores, atomic counters, lock-free data structures, blocking queues, and
so on. Unfortunately, these tools don’t directly translate to distributed systems, because a
distributed system has no shared memory—only messages sent over an unreliable network.

A node in a distributed system must assume that its execution can be paused for a significant length
of time at any point, even in the middle of a function. During the pause, the rest of the world
keeps moving and may even declare the paused node dead because it’s not responding. Eventually,
the paused node may continue running, without even noticing that it was asleep until it checks its
clock sometime later.

### Response time guarantees

In many programming languages and operating systems, threads and processes may pause for an
unbounded amount of time, as discussed. Those reasons for pausing *can* be eliminated if you try
hard enough.

Some software runs in environments where a failure to respond within a specified time can cause
serious damage: computers that control aircraft, rockets, robots, cars, and other physical objects
must respond quickly and predictably to their sensor inputs. In these systems, there is a specified
*deadline* by which the software must respond; if it doesn’t meet the deadline, that may cause a
failure of the entire system. These are so-called *hard real-time* systems.

###### Note

In embedded systems, *real-time* means that a system is carefully designed and tested to meet
specified timing guarantees in all circumstances. This meaning is in contrast to the more vague use of the
term *real-time* on the web, where it describes servers pushing data to clients and stream
processing without hard response time constraints (see [Link to Come]).

For example, if your car’s onboard sensors detect that you are currently experiencing a crash, you
wouldn’t want the release of the airbag to be delayed due to an inopportune GC pause in the airbag
release system.

Providing real-time guarantees in a system requires support from all levels of the software stack: a
*real-time operating system* (RTOS) that allows processes to be scheduled with a guaranteed
allocation of CPU time in specified intervals is needed; library functions must document their
worst-case execution times; dynamic memory allocation may be restricted or disallowed entirely
(real-time garbage collectors exist, but the application must still ensure that it doesn’t give the
GC too much work to do); and an enormous amount of testing and measurement must be done to ensure
that guarantees are being met.

All of this requires a large amount of additional work and severely restricts the range of
programming languages, libraries, and tools that can be used (since most languages and tools do not
provide real-time guarantees). For these reasons, developing real-time systems is very expensive,
and they are most commonly used in safety-critical embedded devices. Moreover, “real-time” is not the
same as “high-performance”—in fact, real-time systems may have lower throughput, since they have to
prioritize timely responses above all else (see also [“Latency and Resource Utilization”](/en/ch9#sidebar_distributed_latency_utilization)).

For most server-side data processing systems, real-time guarantees are simply not economical or
appropriate. Consequently, these systems must suffer the pauses and clock instability that come from
operating in a non-real-time environment.

### Limiting the impact of garbage collection

Garbage collection used to be one of the biggest reasons for process pauses
[[79](/en/ch9#Thompson2013)],
but fortunately GC algorithms have improved a lot: a properly tuned collector will now usually pause
for no more than a few milliseconds. The Java runtime offers collectors such as concurrent mark
sweep (CMS), garbage-first (G1), the Z garbage collector (ZGC), Epsilon, and Shenandoah. Each of
these is optimized for different memory profiles such as high-frequency object creation, large
heaps, and so on. By contrast, Go offers a simpler concurrent mark sweep garbage collector that
attempts to optimize itself.

If you need to avoid GC pauses entirely, one option is to use a language that doesn’t have a garbage
collector at all. For example, Swift uses automatic reference counting to determine when memory can
be freed; Rust and Mojo track lifetimes of objects using the type system so the compiler can
determine how long memory must be allocated for.

It’s also possible to use a garbage-collected language while mitigating the impact of pauses.
One approach is to treat GC pauses like brief planned outages of a node, and to let other nodes
handle requests from clients while one node is collecting its garbage. If the runtime can warn the
application that a node soon requires a GC pause, the application can stop sending new requests to
that node, wait for it to finish processing outstanding requests, and then perform the GC while no
requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles
of the response time [[80](/en/ch9#Terei2015),
[81](/en/ch9#Maas2015)].

A variant of this idea is to use the garbage collector only for short-lived objects (which are fast
to collect) and to restart processes periodically, before they accumulate enough long-lived objects
to require a full GC of long-lived objects [[79](/en/ch9#Thompson2013),
[82](/en/ch9#Fowler2011_ch9)].
One node can be restarted at a time, and traffic can be shifted away from the node before the
planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)).

These measures cannot fully prevent garbage collection pauses, but they can usefully reduce their
impact on the application.

# Knowledge, Truth, and Lies

So far in this chapter we have explored the ways in which distributed systems are different from
programs running on a single computer: there is no shared memory, only message passing via an
unreliable network with variable delays, and the systems may suffer from partial failures, unreliable clocks,
and processing pauses.

The consequences of these issues are profoundly disorienting if you’re not used to distributed
systems. A node in the network cannot *know* anything for sure about other nodes—it can only make
guesses based on the messages it receives (or doesn’t receive). A node can only find out what state
another node is in (what data it has stored, whether it is correctly functioning, etc.) by
exchanging messages with it. If a remote node doesn’t respond, there is no way of knowing what state
it is in, because problems in the network cannot reliably be distinguished from problems at a node.

Discussions of these systems border on the philosophical: What do we know to be true or false in our
system? How sure can we be of that knowledge, if the mechanisms for perception and measurement are
unreliable [[83](/en/ch9#Halpern1990)]?
Should software systems obey the laws that we expect of the physical world, such as cause and effect?

Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed
system, we can state the assumptions we are making about the behavior (the *system model*) and
design the actual system in such a way that it meets those assumptions. Algorithms can be proved to
function correctly within a certain system model. This means that reliable behavior is achievable,
even if the underlying system model provides very few guarantees.

However, although it is possible to make software well behaved in an unreliable system model, it
is not straightforward to do so. In the rest of this chapter we will further explore the notions of
knowledge and truth in distributed systems, which will help us think about the kinds of assumptions
we can make and the guarantees we may want to provide. In [Chapter 10](/en/ch10#ch_consistency) we will proceed to
look at some examples of distributed algorithms that provide particular guarantees under particular
assumptions.

## The Majority Rules

Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but
any outgoing messages from that node are dropped or delayed
[[22](/en/ch9#Donges2012)]. Even though that node is working
perfectly well, and is receiving requests from other nodes, the other nodes cannot hear its
responses. After some timeout, the other nodes declare it dead, because they haven’t heard from the
node. The situation unfolds like a nightmare: the semi-disconnected node is dragged to the
graveyard, kicking and screaming “I’m not dead!”—but since nobody can hear its screaming, the
funeral procession continues with stoic determination.

In a slightly less nightmarish scenario, the semi-disconnected node may notice that the messages it
is sending are not being acknowledged by other nodes, and so realize that there must be a fault
in the network. Nevertheless, the node is wrongly declared dead by the other nodes, and the
semi-disconnected node cannot do anything about it.

As a third scenario, imagine a node that pauses execution for one minute. During that time, no
requests are processed and no responses are sent. The other nodes wait, retry, grow impatient, and
eventually declare the node dead and load it onto the hearse. Finally, the pause finishes and the
node’s threads continue as if nothing had happened. The other nodes are surprised as the supposedly
dead node suddenly raises its head out of the coffin, in full health, and starts cheerfully chatting
with bystanders. At first, the paused node doesn’t even realize that an entire minute has passed and
that it was declared dead—from its perspective, hardly any time has passed since it was last talking
to the other nodes.

The moral of these stories is that a node cannot necessarily trust its own judgment of a situation.
A distributed system cannot exclusively rely on a single node, because a node may fail at any time,
potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms
rely on a *quorum*, that is, voting among the nodes (see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)):
decisions require some minimum number of votes from several nodes in order to reduce the dependence
on any one particular node.

That includes decisions about declaring nodes dead. If a quorum of nodes declares another node
dead, then it must be considered dead, even if that node still very much feels alive. The individual
node must abide by the quorum decision and step down.

Most commonly, the quorum is an absolute majority of more than half the nodes (although other kinds
of quorums are possible). A majority quorum allows the system to continue working if a minority of nodes
are faulty (with three nodes, one faulty node can be tolerated; with five nodes, two faulty nodes can be
tolerated). However, it is still safe, because there can only be only one majority in the
system—there cannot be two majorities with conflicting decisions at the same time. We will discuss
the use of quorums in more detail when we get to *consensus algorithms* in [Chapter 10](/en/ch10#ch_consistency).

## Distributed Locks and Leases

Locks and leases in distributed application are prone to be misused, and a common source of bugs
[[84](/en/ch9#Tang2022)].
Let’s look at one particular case of how they can go wrong.

In [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses) we saw that a lease is a kind of lock that times out and can be
assigned to a new owner if the old owner stops responding (perhaps because it crashed, it paused for
too long, or it was disconnected from the network). You can use leases in situations where a system
requires there to be only one of some thing. For example:

* Only one node is allowed to be the leader for a database shard, to avoid split brain (see
  [“Handling Node Outages”](/en/ch6#sec_replication_failover)).
* Only one transaction or client is allowed to update a particular resource or object, to prevent
  it being corrupted by concurrent writes.
* Only one node should process a given input file to a big processing job, to avoid wasted effort
  due to multiple nodes redundantly doing the same work.

It is worth thinking carefully about what happens if several nodes simultaneously believe that they
hold the lease, perhaps due to a process pause. In the third example, the consequence is only some
wasted computational resources, which is not a big deal. But in the first two cases, the consequence
could be lost or corrupted data, which is much more serious.

For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect
implementation of locking. (The bug is not theoretical: HBase used to have this problem
[[85](/en/ch9#Junqueira2013_ch9),
[86](/en/ch9#Soztutar2013hdfs)].)
Say you want to ensure that a file in a storage service can only be
accessed by one client at a time, because if multiple clients tried to write to it, the file would
become corrupted. You try to implement this by requiring a client to obtain a lease from a lock
service before accessing the file. Such a lock service is often implemented using a consensus
algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency).

![ddia 0904](/fig/ddia_0904.png)

###### Figure 9-4. Incorrect implementation of a distributed lock: client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage.

The problem is an example of what we discussed in [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses): if the client
holding the lease is paused for too long, its lease expires. Another client can obtain a lease for
the same file, and start writing to the file. When the paused client comes back, it believes
(incorrectly) that it still has a valid lease and proceeds to also write to the file. We now have a
split brain situation: the clients’ writes clash and corrupt the file.

[Figure 9-5](/en/ch9#fig_distributed_lease_delay) shows a different problem that has similar consequences. In this
example there is no process pause, only a crash by client 1. Just before client 1 crashes it sends a
write request to the storage service, but this request is delayed for a long time in the network.
(Remember from [“Network Faults in Practice”](/en/ch9#sec_distributed_network_faults) that packets can sometimes be delayed by a minute
or more.) By the time the write request arrives at the storage service, the lease has already timed
out, allowing client 2 to acquire it and issue a write of its own. The result is corruption similar
to [Figure 9-4](/en/ch9#fig_distributed_lease_pause).

![ddia 0905](/fig/ddia_0905.png)

###### Figure 9-5. A message from a former leaseholder might be delayed for a long time, and arrive after another node has taken over the lease.

### Fencing off zombies and delayed requests

The term *zombie* is sometimes used to describe a former leaseholder who has not yet found out that
it lost the lease, and who is still acting as if it was the current leaseholder. Since we cannot
rule out zombies entirely, we have to instead ensure that they can’t do any damage in the form of
split brain. This is called *fencing off* the zombie.

Some systems attempt to fence off zombies by shutting them down, for example by disconnecting them
from the network [[9](/en/ch9#Leners2015)], shutting down the VM via
the cloud provider’s management interface, or even physically powering down the machine
[[87](/en/ch9#SUSE2025)].
This approach is known as *Shoot The Other Node In The Head* or STONITH. Unfortunately, it suffers
from some problems: it does not protect against large network delays like in
[Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down
[[19](/en/ch9#Imbriaco2012_ch9)]; and by the time the zombie has been
detected and shut down, it may already be too late and data may already have been corrupted.

A more robust fencing solution, which protects against both zombies and delayed requests, is
illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing).

![ddia 0906](/fig/ddia_0906.png)

###### Figure 9-6. Making access to storage safe by allowing writes only in the order of increasing fencing tokens.

Let’s assume that every time the lock service grants a lock or lease, it also returns a *fencing
token*, which is a number that increases every time a lock is granted (e.g., incremented by the lock
service). We can then require that every time a client sends a write request to the storage service,
it must include its current fencing token.

###### Note

There are several alternative names for fencing tokens. In Chubby, Google’s lock service, they are
called *sequencers* [[88](/en/ch9#Burrows2006_ch9)], and in Kafka they are called *epoch numbers*.
In consensus algorithms, which we will discuss in [Chapter 10](/en/ch10#ch_consistency), the *ballot number* (Paxos) or
*term number* (Raft) serves a similar purpose.

In [Figure 9-6](/en/ch9#fig_distributed_fencing), client 1 acquires the lease with a token of 33, but then
it goes into a long pause and the lease expires. Client 2 acquires the lease with a token of 34 (the
number always increases) and then sends its write request to the storage service, including the
token of 34. Later, client 1 comes back to life and sends its write to the storage service,
including its token value 33. However, the storage service remembers that it has already processed a
write with a higher token number (34), and so it rejects the request with token 33. A client that
has just acquired the lease must immediately make a write to the storage service, and once that
write has completed, any zombies are fenced off.

If ZooKeeper is your lock service, you can use the transaction ID `zxid` or the node version
`cversion` as fencing token [[85](/en/ch9#Junqueira2013_ch9)].
With etcd, the revision number along with the lease ID serves a similar purpose
[[89](/en/ch9#Kingsbury2020etcd)].
The FencedLock API in Hazelcast explicitly generates a fencing token
[[90](/en/ch9#BasriKahveci2019)].

This mechanism requires that the storage service has some way of checking whether a write is based
on an outdated token. Alternatively, it’s sufficient for the service to support a write that
succeeds only if the object has not been written by another client since the current client last
read it, similarly to an atomic compare-and-set (CAS) operation. For example, object storage
services support such a check: Amazon S3 calls it *conditional writes*, Azure Blob Storage calls it
*conditional headers*, and Google Cloud Storage calls it *request preconditions*.

### Fencing with multiple replicas

If your clients need to write only to one storage service that supports such conditional writes, the
lock service is somewhat redundant
[[91](/en/ch9#Kleppmann2016),
[92](/en/ch9#Sanfilippo2016)],
since the lease assignment could have been implemented directly based on that storage service
[[93](/en/ch9#Morling2024_ch9)].
However, once you have a fencing token you can also use it with multiple services or replicas, and
ensure that the old leaseholder is fenced off on all of those services.

For example, imagine the storage service is a leaderless replicated key-value store with
last-write-wins conflict resolution (see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)). In such a system, the
client sends writes directly to each replica, and each replica independently decides whether to
accept a write based on a timestamp assigned by the client.

As illustrated in [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), you can put the writer’s fencing token in
the most significant bits or digits of the timestamp. You can then be sure that any timestamp
generated by the new leaseholder will be greater than any timestamp from the old leaseholder, even
if the old leaseholder’s writes happened later.

![ddia 0907](/fig/ddia_0907.png)

###### Figure 9-7. Using fencing tokens to protect writes to a leaderless replicated database.

In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its
timestamps starting with 34… are greater than any timestamps starting with 33… that are
generated by Client 1. Client 2 writes to a quorum of replicas but it can’t reach Replica 3. This
means that when the zombie Client 1 later tries to write, its write may succeed at Replica 3 even
though it is ignored by replicas 1 and 2. This is not a problem, since a subsequent quorum read will
prefer the write from Client 2 with the greater timestamp, and read repair or anti-entropy will
eventually overwrite the value written by Client 1.

As you can see from these examples, it is not safe to assume that there is only one node holding a
lease at any one time. Fortunately, with a bit of care you can use fencing tokens to prevent zombies
and delayed requests from doing any damage.

## Byzantine Faults

Fencing tokens can detect and block a node that is *inadvertently* acting in error (e.g., because it
hasn’t yet found out that its lease has expired). However, if the node deliberately wanted to
subvert the system’s guarantees, it could easily do so by sending messages with a fake fencing
token.

In this book we assume that nodes are unreliable but honest: they may be slow or never respond (due
to a fault), and their state may be outdated (due to a GC pause or network delays), but we assume
that if a node *does* respond, it is telling the “truth”: to the best of its knowledge, it is
playing by the rules of the protocol.

Distributed systems problems become much harder if there is a risk that nodes may “lie” (send
arbitrary faulty or corrupted responses)—for example, it might cast multiple contradictory votes in
the same election. Such behavior is known as a *Byzantine fault*, and the problem of reaching
consensus in this untrusting environment is known as the *Byzantine Generals Problem*
[[94](/en/ch9#Lamport1982)].

# The Byzantine Generals Problem

The Byzantine Generals Problem is a generalization of the so-called *Two Generals Problem*
[[95](/en/ch9#Gray1978)],
which imagines a situation in which two army generals need to agree on a battle plan. As they
have set up camp on two different sites, they can only communicate by messenger, and the messengers
sometimes get delayed or lost (like packets in a network). We will discuss this problem of
*consensus* in [Chapter 10](/en/ch10#ch_consistency).

In the Byzantine version of the problem, there are *n* generals who need to agree, and their
endeavor is hampered by the fact that there are some traitors in their midst. Most of the generals
are loyal, and thus send truthful messages, but the traitors may try to deceive and confuse the
others by sending fake or untrue messages. It is not known in advance who the traitors are.

Byzantium was an ancient Greek city that later became Constantinople, in the place which is now
Istanbul in Turkey. There isn’t any historic evidence that the generals of Byzantium were any more
prone to intrigue and conspiracy than those elsewhere. Rather, the name is derived from *Byzantine*
in the sense of *excessively complicated, bureaucratic, devious*, which was used in politics long
before computers [[96](/en/ch9#Palmer2011)].
Lamport wanted to choose a nationality that would not offend any readers, and he was advised that
calling it *The Albanian Generals Problem* was not such a good idea
[[97](/en/ch9#LamportPubs)].

A system is *Byzantine fault-tolerant* if it continues to operate correctly even if some of the
nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering
with the network. This concern is relevant in certain specific circumstances. For example:

* In aerospace environments, the data in a computer’s memory or CPU register could become corrupted
  by radiation, leading it to respond to other nodes in arbitrarily unpredictable ways. Since a
  system failure would be very expensive (e.g., an aircraft crashing and killing everyone on board,
  or a rocket colliding with the International Space Station), flight control systems must tolerate
  Byzantine faults [[98](/en/ch9#Rushby2001),
  [99](/en/ch9#Edge2013)].
* In a system with multiple participating parties, some participants may attempt to cheat or
  defraud others. In such circumstances, it is not safe for a node to simply trust another node’s
  messages, since they may be sent with malicious intent. For example, cryptocurrencies like
  Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties
  to agree whether a transaction happened or not, without relying on a central authority
  [[100](/en/ch9#Bano2019_ch9)].

However, in the kinds of systems we discuss in this book, we can usually safely assume that there
are no Byzantine faults. In a datacenter, all the nodes are controlled by your organization (so
they can hopefully be trusted) and radiation levels are low enough that memory corruption is not a
major problem (although datacenters in orbit are being considered
[[101](/en/ch9#Feilden2024)]).
Multitenant systems have mutually untrusting tenants, but they are isolated from each
other using firewalls, virtualization, and access control policies, not using Byzantine fault
tolerance. Protocols for making systems Byzantine fault-tolerant are quite expensive
[[102](/en/ch9#Mickens2013)],
and fault-tolerant embedded systems rely on support from the hardware level
[[98](/en/ch9#Rushby2001)]. In most server-side data systems, the
cost of deploying Byzantine fault-tolerant solutions makes them impracticable.

Web applications do need to expect arbitrary and malicious behavior of clients that are under
end-user control, such as web browsers. This is why input validation, sanitization, and output
escaping are so important: to prevent SQL injection and cross-site scripting, for example. However,
we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the
authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where
there is no such central authority, Byzantine fault tolerance is more relevant
[[103](/en/ch9#Kleppmann2020),
[104](/en/ch9#Kleppmann2022)].

A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to
all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant
algorithms require a supermajority of more than two-thirds of the nodes to be functioning correctly
(for example, if you have four nodes, at most one may malfunction). To use this approach against bugs, you
would have to have four independent implementations of the same software and hope that a bug only
appears in one of the four implementations.

Similarly, it would be appealing if a protocol could protect us from vulnerabilities, security
compromises, and malicious attacks. Unfortunately, this is not realistic either: in most systems, if
an attacker can compromise one node, they can probably compromise all of them, because they are
probably running the same software. Thus, traditional mechanisms (authentication, access control,
encryption, firewalls, and so on) continue to be the main protection against attackers.

### Weak forms of lying

Although we assume that nodes are generally honest, it can be worth adding mechanisms to software
that guard against weak forms of “lying”—for example, invalid messages due to hardware issues,
software bugs, and misconfiguration. Such protection mechanisms are not full-blown Byzantine fault
tolerance, as they would not withstand a determined adversary, but they are nevertheless simple and
pragmatic steps toward better reliability. For example:

* Network packets do sometimes get corrupted due to hardware issues or bugs in operating systems,
  drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and
  UDP, but sometimes they evade detection [[105](/en/ch9#Gilman2015),
  [106](/en/ch9#Stone2000),
  [107](/en/ch9#Jones2015)].
  Simple measures are usually sufficient protection against such corruption, such as checksums in
  the application-level protocol. TLS-encrypted connections also offer protection against
  corruption.
* A publicly accessible application must carefully sanitize any inputs from users, for example
  checking that a value is within a reasonable range and limiting the size of strings to prevent
  denial of service through large memory allocations. An internal service behind a firewall may be
  able to get away with less strict checks on inputs, but basic checks in protocol parsers are still
  a good idea [[105](/en/ch9#Gilman2015)].
* NTP clients can be configured with multiple server addresses. When synchronizing, the client
  contacts all of them, estimates their errors, and checks that a majority of servers agree on some
  time range. As long as most of the servers are okay, a misconfigured NTP server that is reporting an
  incorrect time is detected as an outlier and is excluded from synchronization
  [[39](/en/ch9#Windl2006)]. The use of multiple servers makes NTP
  more robust than if it only uses a single server.

## System Model and Reality

Many algorithms have been designed to solve distributed systems problems—for example, we will
examine solutions for the consensus problem in [Chapter 10](/en/ch10#ch_consistency). In order to be useful, these
algorithms need to tolerate the various faults of distributed systems that we discussed in this
chapter.

Algorithms need to be written in a way that does not depend too heavily on the details of the
hardware and software configuration on which they are run. This in turn requires that we somehow
formalize the kinds of faults that we expect to happen in a system. We do this by defining a *system
model*, which is an abstraction that describes what things an algorithm may assume.

With regard to timing assumptions, three system models are in common use:

Synchronous model
:   The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock
    error. This does not imply exactly synchronized clocks or zero network delay; it just means you
    know that network delay, pauses, and clock drift will never exceed some fixed upper bound
    [[108](/en/ch9#Dwork1988_ch9)].
    The synchronous model is not a realistic model of most practical
    systems, because (as discussed in this chapter) unbounded delays and pauses do occur.

Partially synchronous model
:   Partial synchrony means that a system behaves like a synchronous system *most of the time*, but it
    sometimes exceeds the bounds for network delay, process pauses, and clock drift
    [[108](/en/ch9#Dwork1988_ch9)]. This is a realistic model of many
    systems: most of the time, networks and processes are quite well behaved—otherwise we would never
    be able to get anything done—but we have to reckon with the fact that any timing assumptions
    may be shattered occasionally. When this happens, network delay, pauses, and clock error may become
    arbitrarily large.

Asynchronous model
:   In this model, an algorithm is not allowed to make any timing assumptions—in fact, it does not
    even have a clock (so it cannot use timeouts). Some algorithms can be designed for the
    asynchronous model, but it is very restrictive.

Moreover, besides timing issues, we have to consider node failures. Some common system models for
nodes are:

Crash-stop faults
:   In the *crash-stop* (or *fail-stop*) model, an algorithm may assume that a node can fail in only
    one way, namely by crashing
    [[109](/en/ch9#Schlichting1983)].
    This means that the node may suddenly stop responding at any moment, and thereafter that node is
    gone forever—it never comes back.

Crash-recovery faults
:   We assume that nodes may crash at any moment, and perhaps start responding again after some
    unknown time. In the crash-recovery model, nodes are assumed to have stable storage (i.e.,
    nonvolatile disk storage) that is preserved across crashes, while the in-memory state is assumed
    to be lost.

Degraded performance and partial functionality
:   In addition to crashing and restarting, nodes may go slow: they may still be able to respond to
    health check requests, while being too slow to get any real work done. For example, a Gigabit
    network interface could suddenly drop to 1 Kb/s throughput due to a driver bug
    [[110](/en/ch9#Do2013)];
    a process that is under memory pressure may spend most of its time performing garbage collection
    [[111](/en/ch9#Snyder2019)];
    worn-out SSDs can have erratic performance; and hardware can be affected by high temperature,
    loose connectors, mechanical vibration, power supply problems, firmware bugs, and more
    [[112](/en/ch9#Gunawi2018_ch9)].
    Such a situation is called a *limping node*, *gray failure*, or *fail-slow*
    [[113](/en/ch9#Huang2017_ch9)],
    and it can be even more difficult to deal with than a cleanly failed node. A related problem is
    when a process stops doing some of the things it is supposed to do while other aspects continue
    working, for example because a background thread is crashed or deadlocked
    [[114](/en/ch9#Lou2020)].

Byzantine (arbitrary) faults
:   Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described
    in the last section.

For modeling real systems, the partially synchronous model with crash-recovery faults is generally
the most useful model. It allows for unbounded network delay, process pauses, and slow nodes. But
how do distributed algorithms cope with that model?

### Defining the correctness of an algorithm

To define what it means for an algorithm to be *correct*, we can describe its *properties*. For
example, the output of a sorting algorithm has the property that for any two distinct elements of
the output list, the element further to the left is smaller than the element further to the right.
That is simply a formal way of defining what it means for a list to be sorted.

Similarly, we can write down the properties we want of a distributed algorithm to define what it
means to be correct. For example, if we are generating fencing tokens for a lock (see
[“Fencing off zombies and delayed requests”](/en/ch9#sec_distributed_fencing_tokens)), we may require the algorithm to have the following properties:

Uniqueness
:   No two requests for a fencing token return the same value.

Monotonic sequence
:   If request *x* returned token *t**x*, and request *y* returned token *t**y*, and
    *x* completed before *y* began, then *t**x* < *t**y*.

Availability
:   A node that requests a fencing token and does not crash eventually receives a response.

An algorithm is correct in some system model if it always satisfies its properties in all situations
that we assume may occur in that system model. However, if all nodes crash, or all network delays
suddenly become infinitely long, then no algorithm will be able to get anything done. How can we
still make useful guarantees even in a system model that allows complete failures?

### Safety and liveness

To clarify the situation, it is worth distinguishing between two different kinds of properties:
*safety* and *liveness* properties. In the example just given, *uniqueness* and *monotonic sequence* are
safety properties, but *availability* is a liveness property.

What distinguishes the two kinds of properties? A giveaway is that liveness properties often include
the word “eventually” in their definition. (And yes, you guessed it—*eventual consistency* is a
liveness property [[115](/en/ch9#Bailis2013_ch9)].)

Safety is often informally defined as *nothing bad happens*, and liveness as *something good
eventually happens*. However, it’s best to not read too much into those informal definitions,
because “good” and “bad” are value judgements that don’t apply well to algorithms. The actual
definitions of safety and liveness are more precise
[[116](/en/ch9#Alpern1985)]:

* If a safety property is violated, we can point at a particular point in time at which it was
  broken (for example, if the uniqueness property was violated, we can identify the particular
  operation in which a duplicate fencing token was returned). After a safety property has been
  violated, the violation cannot be undone—the damage is already done.
* A liveness property works the other way round: it may not hold at some point in time (for example,
  a node may have sent a request but not yet received a response), but there is always hope that it
  may be satisfied in the future (namely by receiving a response).

An advantage of distinguishing between safety and liveness properties is that it helps us deal with
difficult system models. For distributed algorithms, it is common to require that safety properties
*always* hold, in all possible situations of a system model
[[108](/en/ch9#Dwork1988_ch9)]. That is, even if all nodes crash, or
the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong
result (i.e., that the safety properties remain satisfied).

However, with liveness properties we are allowed to make caveats: for example, we could say that a
request needs to receive a response only if a majority of nodes have not crashed, and only if the
network eventually recovers from an outage. The definition of the partially synchronous model
requires that eventually the system returns to a synchronous state—that is, any period of network
interruption lasts only for a finite duration and is then repaired.

### Mapping system models to the real world

Safety and liveness properties and system models are very useful for reasoning about the correctness
of a distributed algorithm. However, when implementing an algorithm in practice, the messy facts of
reality come back to bite you again, and it becomes clear that the system model is a simplified
abstraction of reality.

For example, algorithms in the crash-recovery model generally assume that data in stable storage
survives crashes. However, what happens if the data on disk is corrupted, or the data is wiped out
due to hardware error or misconfiguration
[[117](/en/ch9#Junqueira2015)]?
What happens if a server has a firmware bug and fails to recognize
its hard drives on reboot, even though the drives are correctly attached to the server
[[118](/en/ch9#Sanders2016)]?

Quorum algorithms (see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)) rely on a node remembering the data
that it claims to have stored. If a node may suffer from amnesia and forget previously stored data,
that breaks the quorum condition, and thus breaks the correctness of the algorithm. Perhaps a new
system model is needed, in which we assume that stable storage mostly survives crashes, but may
sometimes be lost. But that model then becomes harder to reason about.

The theoretical description of an algorithm can declare that certain things are simply assumed not
to happen—and in non-Byzantine systems, we do have to make some assumptions about faults that can
and cannot happen. However, a real implementation may still have to include code to handle the
case where something happens that was assumed to be impossible, even if that handling boils down to
`printf("Sucks to be you")` and `exit(666)`—i.e., letting a human operator clean up the mess
[[119](/en/ch9#Kreps2013)].
(This is one difference between computer science and software engineering.)

That is not to say that theoretical, abstract system models are worthless—quite the opposite.
They are incredibly helpful for distilling down the complexity of real systems to a manageable set
of faults that we can reason about, so that we can understand the problem and try to solve it
systematically.

## Formal Methods and Randomized Testing

How do we know that an algorithm satisfies the required properties? Due to concurrency, partial
failures, and network delays there are a huge number of potential states. We need to guarantee
that the properties hold in every possible state, and ensure that we haven’t forgotten about any
edge cases.

One approach is to formally verify an algorithm by describing it mathematically, and using proof
techniques to show that it satisfies the required properties in all situations that the system model
allows. Proving an algorithm correct does not mean its *implementation* on a real system will
necessarily always behave correctly. But it’s a very good first step, because the theoretical
analysis can uncover problems in an algorithm that might remain hidden for a long time in a real
system, and that only come to bite you when your assumptions (e.g., about timing) are defeated due
to unusual circumstances.

It is prudent to combine theoretical analysis with empirical testing to verify that implementations
behave as expected. Techniques such as property-based testing, fuzzing, and deterministic simulation
testing (DST) use randomization to test a system in a wide range of situations. Companies such as
Amazon Web Services have successfully used a combination of these techniques on many of their
products [[120](/en/ch9#Brooker2024correctness),
[121](/en/ch9#SatarinTesting)].

### Model checking and specification languages

*Model checkers* are tools that help verify that an algorithm or system behaves as expected. An algorithm
specification is written in a purpose-built language such as TLA+, Gallina, or FizzBee. These
languages make it easier to focus on an algorithm’s behavior without worrying about code
implementation details. Model checkers then use these models to verify that invariants hold across
all of an algorithm’s states by systematically trying all the things that could happen.

Model checking can’t actually prove that an algorithm’s invariants hold for every possible state
since most real-world algorithms have an infinite state space. A true verification of all states
would require a formal proof, which can be done, but which is typically more difficult than running
a model checker. Instead, model checkers encourage you to reduce the algorithm’s model to an
approximation that can be fully verified, or to limit the execution to some upper bound (for
example, by setting a maximum number of messages that can be sent). Any bugs that only occur with
longer executions would then not be found.

Still, model checkers strike a nice balance between ease of use and the ability to find non-obvious
bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find
and fix bugs
[[122](/en/ch9#Vanlightly2024),
[123](/en/ch9#Tang2018),
[124](/en/ch9#VanBenschoten2019)]. For example,
using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped
replication (VR) caused by ambiguity in the prose description of the algorithm
[[125](/en/ch9#Vanlightly2022)].

By design, model checkers don’t run your actual code, but rather a simplified model that specifies
only the core ideas of your protocol. This makes it more tractable to systematically explore the
state space, but it risks that your specification and your implementation go out of sync with each
other [[126](/en/ch9#Wayne2024)].
It is possible to check whether the model and the real implementation have equivalent behavior, but
this requires instrumentation in the real implementation
[[127](/en/ch9#Ouyang2025)].

### Fault injection

Many bugs are triggered when machine and network failures occur. Fault injection is an effective
(and sometimes scary) technique that verifies whether a system’s implementation works as expected things
go wrong. The idea is simple: inject faults into a running system’s environment and see how it
behaves. Faults can be network failures, machine crashes, disk corruption, paused
processes—anything you can imagine going wrong with a computer.

Fault injection tests are typically run in an environment that closely resembles the production
environment where the system will run. Some even inject faults directly into their production
environment. Netflix popularized this approach with their Chaos Monkey tool
[[128](/en/ch9#Izrailevsky2011)]. Production fault
injection is often referred to as *chaos engineering*, which we discussed in
[“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability).

To run fault injection tests, the system under test is first deployed along with fault injection
coordinators and scripts. Coordinators are responsible for deciding what faults to execute and when
to execute them. Local or remote scripts are responsible for injecting failures into individual
nodes or processes. Injection scripts use many different tools to trigger faults. A Linux process
can be paused or killed using Linux’s `kill` command, a disk can be unmounted with `umount`, and
network connections can be disrupted through firewall settings. You can inspect system behavior
during and after faults are injected to make sure things work as expected.

The myriad of tools required to trigger failures make fault injection tests cumbersome to write.
It’s common to adopt a fault injection framework like Jepsen to run fault injection tests to
simplify the process. Such frameworks come with integrations for various operating systems and many
pre-built fault injectors
[[129](/en/ch9#Kingsbury2013jepsen)].
Jepsen has been remarkably effective at finding critical bugs in many widely-used systems
[[130](/en/ch9#Kingsbury2024),
[131](/en/ch9#Majumdar2017)].

### Deterministic simulation testing

Deterministic simulation testing (DST) has also become a popular complement to model-checking and
fault injection. It uses a similar state space exploration process as a model checker, but it tests
your actual code, not a model.

In DST, a simulation automatically runs through a large number of randomised executions of the
system. Network communication, I/O, and clock timing during the simulation are all replaced with
mocks that allow the simulator to control the exact order in which things happen, including various
timings and failure scenarios. This allows the simulator to explore many more situations than
hand-written tests or fault injection could. If a test fails, it can be re-run since the simulator
knows the exact order of operations that triggered the failure—in contrast to fault injection, which
does not have such fine-grained control over the system.

DST requires the simulator to be able to control all sources of nondeterminism, such as network
delays. One of three strategies is generally adopted to make code deterministic:

Application-level
:   Some systems are built from the ground-up to make it easy to execute code deterministically. For
    example, FoundationDB, one of the pioneers in the DST space, is built using an asynchronous
    communication library called Flow. Flow provides a point for developers to inject a deterministic
    network simulation into the system
    [[132](/en/ch9#FoundationDB_ch9)].
    Similarly, TigerBeetle is an online transaction processing (OLTP) database with first-class DST
    support. The system’s state is modeled as a state machine, with all mutations occuring within a
    single event loop. When combined with mock deterministic primitives such as clocks, such an
    architecture is able to run deterministically
    [[133](/en/ch9#Kladov2023)].

Runtime-level
:   Languages with asynchronous runtimes and commonly used libraries provide an insertion point
    to introduce determinism. A single-threaded runtime is used to force all asynchronous code to run
    sequentially. FrostDB, for example, patches Go’s runtime to execute goroutines sequentially
    [[134](/en/ch9#Marques2024)].
    Rust’s madsim library works in a similar manner. Madsim provides deterministic implementations of
    Tokio’s asynchronous runtime API, AWS’s S3 library, Kafka’s Rust library, and many others.
    Applications can swap in deterministic libraries and runtimes to get deterministic test executions
    without changing their code.

Machine-level
:   Rather than patching code at runtime, an entire machine can be made deterministic. This is a
    delicate process that requires a machine to respond to all normally nondeterministic calls with
    deterministic responses. Tools such as Antithesis do this by building a custom hypervisor that
    replaces normally nondeterministic operations with deterministic ones. Everything from clocks
    to network and storage needs to be accounted for. Once done, though, developers can run their
    entire distributed system in a collection of containers within the hypervisor and get a completely
    deterministic distributed system.

DST provides several advantages beyond replayability. Tools such as Antithesis attempt to explore
many different code paths in application code by branching a test execution into multiple
sub-executions when it discovers less common behavior. And because deterministic tests often use
mocked clocks and network calls, such tests can run faster than wall-clock time. For example,
TigerBeetle’s time abstraction allows simulations to simulate network latency and timeouts without
actually taking the full length of time to trigger the timeout. Such techniques allow the simulator
to explore more code paths faster.

# The Power of Determinism

Nondeterminism is at the core of all of the distributed systems challenges we discussed in this
chapter: concurrency, network delay, process pauses, clock jumps, and crashes all happen in
unpredictable ways that vary from one run of a system to the next. Conversely, if you can make a
system deterministic, that can hugely simplify things.

In fact, making things deterministic is a simple but powerful idea that arises again and again in
distributed system design. Besides deterministic simulation testing, we have seen several ways of
using determinism over the past chapters:

* A key advantage of event sourcing (see [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events)) is that you can
  deterministically replay a log of events to reconstruct derived materialized views.
* Workflow engines (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows)) rely on workflow definitions being
  deterministic to provide durable execution semantics.
* *State machine replication*, which we will discuss in [“Using shared logs”](/en/ch10#sec_consistency_smr), replicates data by
  independently executing the same sequence of deterministic transactions on each replica. We have
  already seen two variants of that idea: statement-based replication (see
  [“Implementation of Replication Logs”](/en/ch6#sec_replication_implementation)) and serial transaction execution using stored procedures
  (see [“Pros and cons of stored procedures”](/en/ch8#sec_transactions_stored_proc_tradeoffs)).

However, making code fully deterministic requires care. Even once you have removed all concurrency
and replaced I/O, network communication, clocks, and random number generators with deterministic
simulations, elements of nondeterminism may remain. For example, in some programming languages, the
order in which you iterate over the elements of a hash table may be nondeterministic. Whether you
run into a resource limit (memory allocation failure, stack overflow) is also nondeterministic.

# Summary

In this chapter we have discussed a wide range of problems that can occur in distributed systems,
including:

* Whenever you try to send a packet over the network, it may be lost or arbitrarily delayed.
  Likewise, the reply may be lost or delayed, so if you don’t get a reply, you have no idea whether
  the message got through.
* A node’s clock may be significantly out of sync with other nodes (despite your best efforts to set
  up NTP), it may suddenly jump forward or back in time, and relying on it is dangerous because you
  most likely don’t have a good measure of your clock’s confidence interval.
* A process may pause for a substantial amount of time at any point in its execution, be declared
  dead by other nodes, and then come back to life again without realizing that it was paused.

The fact that such *partial failures* can occur is the defining characteristic of distributed
systems. Whenever software tries to do anything involving other nodes, there is the possibility that
it may occasionally fail, or randomly go slow, or not respond at all (and eventually time out). In
distributed systems, we try to build tolerance of partial failures into software, so that the system
as a whole may continue functioning even when some of its constituent parts are broken.

To tolerate faults, the first step is to *detect* them, but even that is hard. Most systems
don’t have an accurate mechanism of detecting whether a node has failed, so most distributed
algorithms rely on timeouts to determine whether a remote node is still available. However, timeouts
can’t distinguish between network and node failures, and variable network delay sometimes causes a
node to be falsely suspected of crashing. Handling limping nodes, which are responding but are too
slow to do anything useful, is even harder.

Once a fault is detected, making a system tolerate it is not easy either: there is no global
variable, no shared memory, no common knowledge or any other kind of shared state between the
machines [[83](/en/ch9#Halpern1990)].
Nodes can’t even agree on what time it is, let alone on anything more profound. The only way
information can flow from one node to another is by sending it over the unreliable network. Major
decisions cannot be safely made by a single node, so we require protocols that enlist help from
other nodes and try to get a quorum to agree.

If you’re used to writing software in the idealized mathematical perfection of a single computer,
where the same operation always deterministically returns the same result, then moving to the messy
physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems
engineers will often regard a problem as trivial if it can be solved on a single computer
[[4](/en/ch9#Hodges2013)],
and indeed a single computer can do a lot nowadays. If you can avoid opening Pandora’s box and
simply keep things on a single machine, for example by using an embedded storage engine (see
[“Embedded storage engines”](/en/ch4#sidebar_embedded)), it is generally worth doing so.

However, as discussed in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), scalability is not the only reason for
wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically
close to users) are equally important goals, and those things cannot be achieved with a single node.
The power of distributed systems is that in principle, they can run forever without being
interrupted at the service level, because all faults and maintenance can be handled at the node
level. (In practice, if a bad configuration change is rolled out to all nodes, that will still bring
a distributed system to its knees.)

In this chapter we also went on some tangents to explore whether the unreliability of networks,
clocks, and processes is an inevitable law of nature. We saw that it isn’t: it is possible to give
hard real-time response guarantees and bounded delays in networks, but doing so is very expensive and
results in lower utilization of hardware resources. Most non-safety-critical systems choose cheap
and unreliable over expensive and reliable.

This chapter has been all about problems, and has given us a bleak outlook. In the next chapter we
will move on to solutions, and discuss some algorithms that have been designed to cope with the
problems in distributed systems.

##### Footnotes

##### References

[[1](/en/ch9#Cavage2013-marker)] Mark Cavage.
[There’s Just No Getting Around It: You’re
Building a Distributed System](https://queue.acm.org/detail.cfm?id=2482856). *ACM Queue*, volume 11, issue 4, pages 80-89, April 2013.
[doi:10.1145/2466486.2482856](https://doi.org/10.1145/2466486.2482856)

[[2](/en/ch9#Kreps2012_ch9-marker)] Jay Kreps.
[Getting
Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012.
Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)

[[3](/en/ch9#Hale2010-marker)] Coda Hale.
[You Can’t Sacrifice
Partition Tolerance](https://codahale.com/you-cant-sacrifice-partition-tolerance/). *codahale.com*, October 2010.
<https://perma.cc/6GJU-X4G5>

[[4](/en/ch9#Hodges2013-marker)] Jeff Hodges.
[Notes
on Distributed Systems for Young Bloods](https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/). *somethingsimilar.com*, January 2013.
Archived at [perma.cc/B636-62CE](https://perma.cc/B636-62CE)

[[5](/en/ch9#Jacobson1988-marker)] Van Jacobson.
[Congestion
Avoidance and Control](https://www.cs.usask.ca/ftp/pub/discus/seminars2002-2003/p314-jacobson.pdf). At *ACM Symposium on Communications Architectures and
Protocols* (SIGCOMM), August 1988.
[doi:10.1145/52324.52356](https://doi.org/10.1145/52324.52356)

[[6](/en/ch9#Hubert2009-marker)] Bert Hubert.
[The
Ultimate SO\_LINGER Page, or: Why Is My TCP Not Reliable](https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable). *blog.netherlabs.nl*, January 2009.
Archived at [perma.cc/6HDX-L2RR](https://perma.cc/6HDX-L2RR)

[[7](/en/ch9#Saltzer1984_ch9-marker)] Jerome H. Saltzer, David P. Reed, and David D. Clark.
[End-To-End
Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf). *ACM Transactions on Computer Systems*, volume 2, issue 4,
pages 277–288, November 1984.
[doi:10.1145/357401.357402](https://doi.org/10.1145/357401.357402)

[[8](/en/ch9#Bailis2014reliable-marker)] Peter Bailis and Kyle Kingsbury.
[The Network Is Reliable](https://queue.acm.org/detail.cfm?id=2655736).
*ACM Queue*, volume 12, issue 7, pages 48-55, July 2014.
[doi:10.1145/2639988.2639988](https://doi.org/10.1145/2639988.2639988)

[[9](/en/ch9#Leners2015-marker)] Joshua B. Leners, Trinabh Gupta, Marcos K.
Aguilera, and Michael Walfish.
[Taming Uncertainty in
Distributed Systems with Help from the Network](https://cs.nyu.edu/~mwalfish/papers/albatross-eurosys15.pdf). At *10th European Conference on Computer
Systems* (EuroSys), April 2015.
[doi:10.1145/2741948.2741976](https://doi.org/10.1145/2741948.2741976)

[[10](/en/ch9#Gill2011-marker)] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan.
[Understanding
Network Failures in Data Centers: Measurement, Analysis, and Implications](https://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdf). At
*ACM SIGCOMM Conference*, August 2011.
[doi:10.1145/2018436.2018477](https://doi.org/10.1145/2018436.2018477)

[[11](/en/ch9#Hoelzle2020-marker)] Urs Hölzle.
[But recently a farmer had started
grazing a herd of cows nearby. And whenever they stepped on the fiber link, they bent it enough
to cause a blip](https://x.com/uhoelzle/status/1263333283107991558). *x.com*, May 2020.
Archived at [perma.cc/WX8X-ZZA5](https://perma.cc/WX8X-ZZA5)

[[12](/en/ch9#CBCNews2021-marker)] CBC News.
[Hundreds
lose internet service in northern B.C. after beaver chews through cable](https://www.cbc.ca/news/canada/british-columbia/beaver-internet-down-tumbler-ridge-1.6001594). *cbc.ca*,
April 2021. Archived at [perma.cc/UW8C-H2MY](https://perma.cc/UW8C-H2MY)

[[13](/en/ch9#Oremus2014-marker)] Will Oremus.
[The
Global Internet Is Being Attacked by Sharks, Google Confirms](https://slate.com/technology/2014/08/shark-attacks-threaten-google-s-undersea-internet-cables-video.html). *slate.com*, August 2014.
Archived at [perma.cc/P6F3-C6YG](https://perma.cc/P6F3-C6YG)

[[14](/en/ch9#AuerbachJahajeeah2023-marker)] Jess Auerbach Jahajeeah.
[Down to the wire: The
ship fixing our internet](https://continent.substack.com/p/down-to-the-wire-the-ship-fixing). *continent.substack.com*, November 2023.
Archived at [perma.cc/DP7B-EQ7S](https://perma.cc/DP7B-EQ7S)

[[15](/en/ch9#Janardhan2021-marker)] Santosh Janardhan.
[More details
about the October 4 outage](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/). *engineering.fb.com*, October 2021.
Archived at [perma.cc/WW89-VSXH](https://perma.cc/WW89-VSXH)

[[16](/en/ch9#Parfitt2011-marker)] Tom Parfitt.
[Georgian
woman cuts off web access to whole of Armenia](https://www.theguardian.com/world/2011/apr/06/georgian-woman-cuts-web-access). *theguardian.com*, April 2011.
Archived at [perma.cc/KMC3-N3NZ](https://perma.cc/KMC3-N3NZ)

[[17](/en/ch9#Voce2025-marker)] Antonio Voce, Tural Ahmedzade and Ashley Kirk.
[‘Shadow
fleets’ and subaquatic sabotage: are Europe’s undersea internet cables under attack?](https://www.theguardian.com/world/ng-interactive/2025/mar/05/shadow-fleets-subaquatic-sabotage-europe-undersea-internet-cables-under-attack)
*theguardian.com*, March 2025.
Archived at [perma.cc/HA7S-ZDBV](https://perma.cc/HA7S-ZDBV)

[[18](/en/ch9#Liu2016-marker)] Shengyun Liu, Paolo Viotti,
Christian Cachin, Vivien Quéma, and Marko Vukolić.
[XFT: Practical
Fault Tolerance beyond Crashes](https://www.usenix.org/system/files/conference/osdi16/osdi16-liu.pdf). At *12th USENIX Symposium on Operating Systems Design and
Implementation* (OSDI), November 2016.

[[19](/en/ch9#Imbriaco2012_ch9-marker)] Mark Imbriaco.
[Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/).
*github.blog*, December 2012.
Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ)

[[20](/en/ch9#Lianza2020_ch9-marker)] Tom Lianza and Chris Snook.
[A Byzantine failure
in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020.
Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY)

[[21](/en/ch9#Alfatafta2020-marker)] Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan,
and Samer Al-Kiswany.
[Toward a Generic Fault
Tolerance Technique for Partial Network Partitioning](https://www.usenix.org/conference/osdi20/presentation/alfatafta). At *14th USENIX Symposium on
Operating Systems Design and Implementation* (OSDI), November 2020.

[[22](/en/ch9#Donges2012-marker)] Marc A. Donges.
[Re: bnx2 cards Intermittantly Going
Offline](https://www.spinics.net/lists/netdev/msg210485.html). Message to Linux *netdev* mailing list, *spinics.net*, September 2012.
Archived at [perma.cc/TXP6-H8R3](https://perma.cc/TXP6-H8R3)

[[23](/en/ch9#Toman2020-marker)] Troy Toman.
[Inside a CODE RED:
Network Edition](https://signalvnoise.com/svn3/inside-a-code-red-network-edition/). *signalvnoise.com*, September 2020.
Archived at [perma.cc/BET6-FY25](https://perma.cc/BET6-FY25)

[[24](/en/ch9#Kingsbury2014elastic-marker)] Kyle Kingsbury.
[Call Me Maybe:
Elasticsearch](https://aphyr.com/posts/317-call-me-maybe-elasticsearch). *aphyr.com*, June 2014.
[perma.cc/JK47-S89J](https://perma.cc/JK47-S89J)

[[25](/en/ch9#Sanfilippo2014-marker)] Salvatore Sanfilippo.
[A Few Arguments About Redis Sentinel Properties and Fail
Scenarios](https://antirez.com/news/80). *antirez.com*, October 2014.
[perma.cc/8XEU-CLM8](https://perma.cc/8XEU-CLM8)

[[26](/en/ch9#Liochon2015-marker)] Nicolas Liochon.
[CAP:
If All You Have Is a Timeout, Everything Looks Like a Partition](http://blog.thislongrun.com/2015/05/CAP-theorem-partition-timeout-zookeeper.html). *blog.thislongrun.com*,
May 2015. Archived at [perma.cc/FS57-V2PZ](https://perma.cc/FS57-V2PZ)

[[27](/en/ch9#Grosvenor2015-marker)] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel
Gog, Robert N. M. Watson, Andrew W. Moore, Steven Hand, and Jon Crowcroft.
[Queues
Don’t Matter When You Can JUMP Them!](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-grosvenor_update.pdf) At *12th USENIX Symposium on Networked
Systems Design and Implementation* (NSDI), May 2015.

[[28](/en/ch9#Julienne2019-marker)] Theo Julienne.
[Debugging
network stalls on Kubernetes](https://github.blog/engineering/debugging-network-stalls-on-kubernetes/). *github.blog*, November 2019.
Archived at [perma.cc/K9M8-XVGL](https://perma.cc/K9M8-XVGL)

[[29](/en/ch9#Wang2010-marker)] Guohui Wang and T. S. Eugene Ng.
[The Impact of
Virtualization on Network Performance of Amazon EC2 Data Center](https://www.cs.rice.edu/~eugeneng/papers/INFOCOM10-ec2.pdf). At *29th IEEE
International Conference on Computer Communications* (INFOCOM), March 2010.
[doi:10.1109/INFCOM.2010.5461931](https://doi.org/10.1109/INFCOM.2010.5461931)

[[30](/en/ch9#Philips2014-marker)] Brandon Philips.
[etcd: Distributed Locking and Service
Discovery](https://www.youtube.com/watch?v=HJIjTTHWYnE). At *Strange Loop*, September 2014.

[[31](/en/ch9#Newman2012-marker)] Steve Newman.
[A Systematic Look at EC2 I/O](https://www.sentinelone.com/blog/a-systematic-look-at-ec2-i-o/).
*blog.scalyr.com*, October 2012.
Archived at [perma.cc/FL4R-H2VE](https://perma.cc/FL4R-H2VE)

[[32](/en/ch9#Hayashibara2004-marker)] Naohiro Hayashibara, Xavier Défago, Rami Yared, and
Takuya Katayama. [The ϕ Accrual Failure
Detector](https://hdl.handle.net/10119/4784). Japan Advanced Institute of Science and Technology, School of Information
Science, Technical Report IS-RR-2004-010, May 2004.
Archived at [perma.cc/NSM2-TRYA](https://perma.cc/NSM2-TRYA)

[[33](/en/ch9#Wang2013-marker)] Jeffrey Wang.
[Phi
Accrual Failure Detector](https://ternarysearch.blogspot.com/2013/08/phi-accrual-failure-detector.html). *ternarysearch.blogspot.co.uk*, August 2013.
[perma.cc/L452-AMLV](https://perma.cc/L452-AMLV)

[[34](/en/ch9#Keshav1997-marker)] Srinivasan Keshav. *An Engineering Approach
to Computer Networking: ATM Networks, the Internet, and the Telephone Network*.
Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6

[[35](/en/ch9#Kyas1995-marker)] Othmar Kyas. *ATM Networks*.
International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6

[[36](/en/ch9#Mellanox2014-marker)] Mellanox Technologies.
[InfiniBand
FAQ, Rev 1.3](https://network.nvidia.com/related-docs/whitepapers/InfiniBandFAQ_FQ_100.pdf). *network.nvidia.com*, December 2014.
Archived at [perma.cc/LQJ4-QZVK](https://perma.cc/LQJ4-QZVK)

[[37](/en/ch9#Santos2003-marker)] Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman.
[End-to-End Congestion Control
for InfiniBand](https://infocom2003.ieee-infocom.org/papers/28_01.PDF). At *22nd Annual Joint Conference of the IEEE Computer and
Communications Societies* (INFOCOM), April 2003. Also published by HP Laboratories Palo
Alto, Tech Report HPL-2002-359.
[doi:10.1109/INFCOM.2003.1208949](https://doi.org/10.1109/INFCOM.2003.1208949)

[[38](/en/ch9#Li2014-marker)] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and
Steven D. Gribble.
[Tales of the Tail: Hardware,
OS, and Application-level Sources of Tail Latency](https://syslab.cs.washington.edu/papers/latency-socc14.pdf). At *ACM Symposium on Cloud Computing*
(SOCC), November 2014.
[doi:10.1145/2670979.2670988](https://doi.org/10.1145/2670979.2670988)

[[39](/en/ch9#Windl2006-marker)] Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley.
[The NTP FAQ and HOWTO](https://www.ntp.org/ntpfaq/). *ntp.org*, November 2006.

[[40](/en/ch9#GrahamCumming2017-marker)] John Graham-Cumming.
[How and
why the leap second affected Cloudflare DNS](https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/). *blog.cloudflare.com*, January 2017.
Archived at [archive.org](https://web.archive.org/web/20250202041444/https%3A//blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/)

[[41](/en/ch9#Holmes2006-marker)] David Holmes.
[Inside
the Hotspot VM: Clocks, Timers and Scheduling Events – Part I – Windows](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks). *blogs.oracle.com*,
October 2006. Archived at [archive.org](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks)

[[42](/en/ch9#Greef2021-marker)] Joran Dirk Greef.
[Three Clocks are
Better than One](https://tigerbeetle.com/blog/2021-08-30-three-clocks-are-better-than-one/). *tigerbeetle.com*, August 2021.
Archived at [perma.cc/5RXG-EU6B](https://perma.cc/5RXG-EU6B)

[[43](/en/ch9#Yang2015-marker)] Oliver Yang.
[Pitfalls of TSC usage](https://oliveryang.net/2015/09/pitfalls-of-TSC-usage/).
*oliveryang.net*, September 2015.
Archived at [perma.cc/Z2QY-5FRA](https://perma.cc/Z2QY-5FRA)

[[44](/en/ch9#Loughran2015-marker)] Steve Loughran.
[Time
on Multi-Core, Multi-Socket Servers](https://steveloughran.blogspot.com/2015/09/time-on-multi-core-multi-socket-servers.html). *steveloughran.blogspot.co.uk*, September 2015.
Archived at [perma.cc/7M4S-D4U6](https://perma.cc/7M4S-D4U6)

[[45](/en/ch9#Corbett2012_ch9-marker)] James C. Corbett, Jeffrey Dean, Michael
Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher
Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd,
Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford,
Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang.
[Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/).
At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI),
October 2012.

[[46](/en/ch9#Caporaloni2012-marker)] M. Caporaloni and R. Ambrosini.
[How Closely Can a Personal Computer
Clock Track the UTC Timescale Via the Internet?](https://iopscience.iop.org/0143-0807/23/4/103/) *European Journal of Physics*,
volume 23, issue 4, pages L17–L21, June 2012.
[doi:10.1088/0143-0807/23/4/103](https://doi.org/10.1088/0143-0807/23/4/103)

[[47](/en/ch9#Minar1999-marker)] Nelson Minar.
[A Survey of the NTP Network](https://alumni.media.mit.edu/~nelson/research/ntp-survey99/).
*alumni.media.mit.edu*, December 1999.
Archived at [perma.cc/EV76-7ZV3](https://perma.cc/EV76-7ZV3)

[[48](/en/ch9#Holub2014-marker)] Viliam Holub.
[Synchronizing
Clocks in a Cassandra Cluster Pt. 1 – The Problem](https://blog.rapid7.com/2014/03/14/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/). *blog.rapid7.com*, March 2014.
Archived at [perma.cc/N3RV-5LNL](https://perma.cc/N3RV-5LNL)

[[49](/en/ch9#Kamp2011-marker)] Poul-Henning Kamp.
[The One-Second War (What Time Will You Die?)](https://queue.acm.org/detail.cfm?id=1967009)
*ACM Queue*, volume 9, issue 4, pages 44–48, April 2011.
[doi:10.1145/1966989.1967009](https://doi.org/10.1145/1966989.1967009)

[[50](/en/ch9#Minar2012_ch9-marker)] Nelson Minar.
[Leap Second Crashes Half
the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012.
Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU)

[[51](/en/ch9#Pascoe2011-marker)] Christopher Pascoe.
[Time,
Technology and Leaping Seconds](https://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html). *googleblog.blogspot.co.uk*, September 2011.
Archived at [perma.cc/U2JL-7E74](https://perma.cc/U2JL-7E74)

[[52](/en/ch9#Zhao2015-marker)] Mingxue Zhao and Jeff Barr.
[Look
Before You Leap – The Coming Leap Second and AWS](https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/). *aws.amazon.com*, May 2015.
Archived at [perma.cc/KPE9-XMFM](https://perma.cc/KPE9-XMFM)

[[53](/en/ch9#Veitch2016-marker)] Darryl Veitch and Kanthaiah Vijayalayan.
[Network Timing
and the 2015 Leap Second](https://opus.lib.uts.edu.au/bitstream/10453/43923/1/LeapSecond_camera.pdf). At *17th International Conference on Passive and Active
Measurement* (PAM), April 2016.
[doi:10.1007/978-3-319-30505-9\_29](https://doi.org/10.1007/978-3-319-30505-9_29)

[[54](/en/ch9#VMware2011-marker)] VMware, Inc.
[Timekeeping in VMware Virtual
Machines](https://www.vmware.com/docs/vmware_timekeeping). *vmware.com*, October 2008.
Archived at [perma.cc/HM5R-T5NF](https://perma.cc/HM5R-T5NF)

[[55](/en/ch9#Yodaiken2017-marker)] Victor Yodaiken.
[Clock
Synchronization in Finance and Beyond](https://www.yodaiken.com/wp-content/uploads/2018/05/financeandbeyond.pdf). *yodaiken.com*, November 2017.
Archived at [perma.cc/9XZD-8ZZN](https://perma.cc/9XZD-8ZZN)

[[56](/en/ch9#EmreAcer2017-marker)] Mustafa Emre Acer, Emily Stark, Adrienne Porter
Felt, Sascha Fahl, Radhika Bhargava, Bhanu Dev, Matt Braithwaite, Ryan Sleevi, and Parisa Tabriz.
[Where the Wild Warnings Are: Root Causes
of Chrome HTTPS Certificate Errors](https://acmccs.github.io/papers/p1407-acerA.pdf). At *ACM SIGSAC Conference on Computer and
Communications Security* (CCS), pages 1407–1420, October 2017.
[doi:10.1145/3133956.3134007](https://doi.org/10.1145/3133956.3134007)

[[57](/en/ch9#MiFID2015-marker)] European Securities and Markets Authority.
[MiFID
II / MiFIR: Regulatory Technical and Implementing Standards – Annex I](https://www.esma.europa.eu/sites/default/files/library/2015/11/2015-esma-1464_annex_i_-_draft_rts_and_its_on_mifid_ii_and_mifir.pdf).
*esma.europa.eu*, Report ESMA/2015/1464, September 2015.
Archived at [perma.cc/ZLX9-FGQ3](https://perma.cc/ZLX9-FGQ3)

[[58](/en/ch9#Bigum2015-marker)] Luke Bigum.
[Solving
MiFID II Clock Synchronisation With Minimum Spend (Part 1)](https://catach.blogspot.com/2015/11/solving-mifid-ii-clock-synchronisation.html). *catach.blogspot.com*,
November 2015. Archived at [perma.cc/4J5W-FNM4](https://perma.cc/4J5W-FNM4)

[[59](/en/ch9#Obleukhov2022-marker)] Oleg Obleukhov and Ahmad Byagowi.
[How
Precision Time Protocol is being deployed at Meta](https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/). *engineering.fb.com*, November 2022.
Archived at [perma.cc/29G6-UJNW](https://perma.cc/29G6-UJNW)

[[60](/en/ch9#Wiseman2022-marker)] John Wiseman.
[gpsjam.org](https://gpsjam.org/), July 2022.

[[61](/en/ch9#Levinson2023-marker)] Josh Levinson, Julien Ridoux, and Chris Munns.
[It’s
About Time: Microsecond-Accurate Clocks on Amazon EC2 Instances](https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/). *aws.amazon.com*, November 2023.
Archived at [perma.cc/56M6-5VMZ](https://perma.cc/56M6-5VMZ)

[[62](/en/ch9#Kingsbury2013cassandra-marker)] Kyle Kingsbury.
[Call Me Maybe: Cassandra](https://aphyr.com/posts/294-call-me-maybe-cassandra/).
*aphyr.com*, September 2013.
Archived at [perma.cc/4MBR-J96V](https://perma.cc/4MBR-J96V)

[[63](/en/ch9#Daily2013_ch9-marker)] John Daily.
[Clocks Are Bad, or,
Welcome to the Wonderful World of Distributed Systems](https://riak.com/clocks-are-bad-or-welcome-to-distributed-systems/). *riak.com*, November 2013.
Archived at [perma.cc/4XB5-UCXY](https://perma.cc/4XB5-UCXY)

[[64](/en/ch9#Brooker2023time-marker)] Marc Brooker.
[It’s About Time!](https://brooker.co.za/blog/2023/11/27/about-time.html)
*brooker.co.za*, November 2023.
Archived at [perma.cc/N6YK-DRPA](https://perma.cc/N6YK-DRPA)

[[65](/en/ch9#Kingsbury2013timestamps-marker)] Kyle Kingsbury.
[The Trouble with Timestamps](https://aphyr.com/posts/299-the-trouble-with-timestamps).
*aphyr.com*, October 2013.
Archived at [perma.cc/W3AM-5VAV](https://perma.cc/W3AM-5VAV)

[[66](/en/ch9#Lamport1978_ch9-marker)] Leslie Lamport.
[Time,
Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*,
volume 21, issue 7, pages 558–565, July 1978.
[doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563)

[[67](/en/ch9#Sheehy2015-marker)] Justin Sheehy.
[There Is No Now: Problems With Simultaneity
in Distributed Systems](https://queue.acm.org/detail.cfm?id=2745385). *ACM Queue*, volume 13, issue 3, pages 36–41, March 2015.
[doi:10.1145/2733108](https://doi.org/10.1145/2733108)

[[68](/en/ch9#Demirbas2013-marker)] Murat Demirbas.
[Spanner:
Google’s Globally-Distributed Database](https://muratbuffalo.blogspot.com/2013/07/spanner-googles-globally-distributed_4.html). *muratbuffalo.blogspot.co.uk*, July 2013.
Archived at [perma.cc/6VWR-C9WB](https://perma.cc/6VWR-C9WB)

[[69](/en/ch9#Malkhi2013-marker)] Dahlia Malkhi and Jean-Philippe Martin.
[Spanner’s Concurrency
Control](https://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf). *ACM SIGACT News*, volume 44, issue 3, pages 73–77, September 2013.
[doi:10.1145/2527748.2527767](https://doi.org/10.1145/2527748.2527767)

[[70](/en/ch9#Pachot2024-marker)] Franck Pachot.
[Achieving Precise Clock
Synchronization on AWS](https://www.yugabyte.com/blog/aws-clock-synchronization/). *yugabyte.com*, December 2024.
Archived at [perma.cc/UYM6-RNBS](https://perma.cc/UYM6-RNBS)

[[71](/en/ch9#Kimball2022-marker)] Spencer Kimball.
[Living Without Atomic
Clocks: Where CockroachDB and Spanner diverge](https://www.cockroachlabs.com/blog/living-without-atomic-clocks/). *cockroachlabs.com*, January 2022.
Archived at [perma.cc/AWZ7-RXFT](https://perma.cc/AWZ7-RXFT)

[[72](/en/ch9#Demirbas2025-marker)] Murat Demirbas.
[Use of
Time in Distributed Databases (part 4): Synchronized clocks in production databases](https://muratbuffalo.blogspot.com/2025/01/use-of-time-in-distributed-databases.html).
*muratbuffalo.blogspot.com*, January 2025.
Archived at [perma.cc/9WNX-Q9U3](https://perma.cc/9WNX-Q9U3)

[[73](/en/ch9#Gray1989-marker)] Cary G. Gray and David R. Cheriton.
[Leases: An Efficient
Fault-Tolerant Mechanism for Distributed File Cache Consistency](https://courses.cs.duke.edu/spring11/cps210/papers/p202-gray.pdf). At
*12th ACM Symposium on Operating Systems Principles* (SOSP), December 1989.
[doi:10.1145/74850.74870](https://doi.org/10.1145/74850.74870)

[[74](/en/ch9#Sturman2022-marker)] Daniel Sturman, Scott Delap, Max Ross, et al.
[Roblox
Return to Service](https://corp.roblox.com/newsroom/2022/01/roblox-return-to-service-10-28-10-31-2021). *corp.roblox.com*, January 2022.
Archived at [perma.cc/8ALT-WAS4](https://perma.cc/8ALT-WAS4)

[[75](/en/ch9#Lipcon2011-marker)] Todd Lipcon.
[Avoiding Full GCs
with MemStore-Local Allocation Buffers](https://www.slideshare.net/slideshow/hbase-hug-presentation/7038178). *slideshare.net*, February 2011.
Archived at <https://perma.cc/CH62-2EWJ>

[[76](/en/ch9#Clark2005-marker)] Christopher Clark, Keir Fraser, Steven Hand,
Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield.
[Live
Migration of Virtual Machines](https://www.usenix.org/legacy/publications/library/proceedings/nsdi05/tech/full_papers/clark/clark.pdf). At *2nd USENIX Symposium on Symposium on
Networked Systems Design & Implementation* (NSDI), May 2005.

[[77](/en/ch9#Shaver2008-marker)] Mike Shaver.
[fsyncers and
Curveballs](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/). *shaver.off.net*, May 2008. Archived at
[archive.org](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/)

[[78](/en/ch9#Zhuang2016-marker)] Zhenyun Zhuang and Cuong Tran.
[Eliminating
Large JVM GC Pauses Caused by Background IO Traffic](https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic). *engineering.linkedin.com*, February 2016.
Archived at [perma.cc/ML2M-X9XT](https://perma.cc/ML2M-X9XT)

[[79](/en/ch9#Thompson2013-marker)] Martin Thompson.
[Java
Garbage Collection Distilled](https://mechanical-sympathy.blogspot.com/2013/07/java-garbage-collection-distilled.html). *mechanical-sympathy.blogspot.co.uk*, July 2013.
Archived at [perma.cc/DJT3-NQLQ](https://perma.cc/DJT3-NQLQ)

[[80](/en/ch9#Terei2015-marker)] David Terei and Amit Levy.
[Blade: A Data Center Garbage Collector](https://arxiv.org/pdf/1504.02578).
arXiv:1504.02578, April 2015.

[[81](/en/ch9#Maas2015-marker)] Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz.
[Trash Day: Coordinating Garbage Collection in
Distributed Systems](https://timharris.uk/papers/2015-hotos.pdf). At *15th USENIX Workshop on Hot Topics in Operating Systems*
(HotOS), May 2015.

[[82](/en/ch9#Fowler2011_ch9-marker)] Martin Fowler.
[The LMAX Architecture](https://martinfowler.com/articles/lmax.html).
*martinfowler.com*, July 2011.
Archived at [perma.cc/5AV4-N6RJ](https://perma.cc/5AV4-N6RJ)

[[83](/en/ch9#Halpern1990-marker)] Joseph Y. Halpern and Yoram Moses.
[Knowledge and common knowledge
in a distributed environment](https://groups.csail.mit.edu/tds/papers/Halpern/JACM90.pdf). *Journal of the ACM* (JACM), volume 37, issue 3, pages
549–587, July 1990.
[doi:10.1145/79147.79161](https://doi.org/10.1145/79147.79161)

[[84](/en/ch9#Tang2022-marker)] Chuzhe Tang, Zhaoguo Wang, Xiaodong Zhang, Qianmian
Yu, Binyu Zang, Haibing Guan, and Haibo Chen.
[Ad Hoc Transactions
in Web Applications: The Good, the Bad, and the Ugly](https://ipads.se.sjtu.edu.cn/_media/publications/concerto-sigmod22.pdf). At *ACM International Conference on
Management of Data* (SIGMOD), June 2022.
[doi:10.1145/3514221.3526120](https://doi.org/10.1145/3514221.3526120)

[[85](/en/ch9#Junqueira2013_ch9-marker)] Flavio P. Junqueira and Benjamin Reed.
[*ZooKeeper: Distributed
Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3

[[86](/en/ch9#Soztutar2013hdfs-marker)] Enis Söztutar.
[HBase
and HDFS: Understanding Filesystem Usage in HBase](https://www.slideshare.net/slideshow/hbase-and-hdfs-understanding-filesystem-usage/22990858). At *HBaseCon*, June 2013.
Archived at [perma.cc/4DXR-9P88](https://perma.cc/4DXR-9P88)

[[87](/en/ch9#SUSE2025-marker)] SUSE LLC.
[SUSE
Linux Enterprise High Availability 15 SP6 Administration Guide, Section 12: Fencing and STONITH](https://documentation.suse.com/sle-ha/15-SP6/html/SLE-HA-all/cha-ha-fencing.html).
*documentation.suse.com*, March 2025.
Archived at [perma.cc/8LAR-EL9D](https://perma.cc/8LAR-EL9D)

[[88](/en/ch9#Burrows2006_ch9-marker)] Mike Burrows.
[The Chubby Lock Service for Loosely-Coupled
Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and
Implementation* (OSDI), November 2006.

[[89](/en/ch9#Kingsbury2020etcd-marker)] Kyle Kingsbury.
[etcd 3.4.3](https://jepsen.io/analyses/etcd-3.4.3). *jepsen.io*, January 2020.
Archived at [perma.cc/2P3Y-MPWU](https://perma.cc/2P3Y-MPWU)

[[90](/en/ch9#BasriKahveci2019-marker)] Ensar Basri Kahveci.
[Distributed Locks are Dead; Long
Live Distributed Locks!](https://hazelcast.com/blog/long-live-distributed-locks/) *hazelcast.com*, April 2019.
Archived at [perma.cc/7FS5-LDXE](https://perma.cc/7FS5-LDXE)

[[91](/en/ch9#Kleppmann2016-marker)] Martin Kleppmann.
[How to do
distributed locking](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html). *martin.kleppmann.com*, February 2016.
Archived at [perma.cc/Y24W-YQ5L](https://perma.cc/Y24W-YQ5L)

[[92](/en/ch9#Sanfilippo2016-marker)] Salvatore Sanfilippo.
[Is Redlock safe?](https://antirez.com/news/101) *antirez.com*, February 2016.
Archived at [perma.cc/B6GA-9Q6A](https://perma.cc/B6GA-9Q6A)

[[93](/en/ch9#Morling2024_ch9-marker)] Gunnar Morling.
[Leader
Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024.
Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y)

[[94](/en/ch9#Lamport1982-marker)] Leslie Lamport, Robert Shostak, and Marshall Pease.
[The
Byzantine Generals Problem](https://www.microsoft.com/en-us/research/publication/byzantine-generals-problem/). *ACM Transactions on Programming Languages and Systems*
(TOPLAS), volume 4, issue 3, pages 382–401, July 1982.
[doi:10.1145/357172.357176](https://doi.org/10.1145/357172.357176)

[[95](/en/ch9#Gray1978-marker)] Jim N. Gray.
[Notes on Data Base
Operating Systems](https://jimgray.azurewebsites.net/papers/dbos.pdf). in *Operating Systems: An Advanced Course*, Lecture
Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller,
pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7.
Archived at [perma.cc/7S9M-2LZU](https://perma.cc/7S9M-2LZU)

[[96](/en/ch9#Palmer2011-marker)] Brian Palmer.
[How
Complicated Was the Byzantine Empire?](https://slate.com/news-and-politics/2011/10/the-byzantine-tax-code-how-complicated-was-byzantium-anyway.html) *slate.com*, October 2011.
Archived at [perma.cc/AN7X-FL3N](https://perma.cc/AN7X-FL3N)

[[97](/en/ch9#LamportPubs-marker)] Leslie Lamport.
[My Writings](https://lamport.azurewebsites.net/pubs/pubs.html).
*lamport.azurewebsites.net*, December 2014.
Archived at [perma.cc/5NNM-SQGR](https://perma.cc/5NNM-SQGR)

[[98](/en/ch9#Rushby2001-marker)] John Rushby.
[Bus Architectures for
Safety-Critical Embedded Systems](https://www.csl.sri.com/papers/emsoft01/emsoft01.pdf). At *1st International Workshop on Embedded Software*
(EMSOFT), October 2001.
[doi:10.1007/3-540-45449-7\_22](https://doi.org/10.1007/3-540-45449-7_22)

[[99](/en/ch9#Edge2013-marker)] Jake Edge.
[ELC: SpaceX Lessons Learned](https://lwn.net/Articles/540368/). *lwn.net*,
March 2013. Archived at [perma.cc/AYX8-QP5X](https://perma.cc/AYX8-QP5X)

[[100](/en/ch9#Bano2019_ch9-marker)] Shehar Bano, Alberto Sonnino, Mustafa
Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis.
[SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At
*1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019.
[doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458)

[[101](/en/ch9#Feilden2024-marker)] Ezra Feilden, Adi Oltean, and Philip Johnston.
[Why we should train AI in space](https://www.starcloud.com/wp).
White Paper, *starcloud.com*, September 2024.
Archived at [perma.cc/7Y3S-8UB6](https://perma.cc/7Y3S-8UB6)

[[102](/en/ch9#Mickens2013-marker)] James Mickens.
[The Saddest
Moment](https://www.usenix.org/system/files/login-logout_1305_mickens.pdf). *USENIX ;login*, May 2013.
Archived at [perma.cc/T7BZ-XCFR](https://perma.cc/T7BZ-XCFR)

[[103](/en/ch9#Kleppmann2020-marker)] Martin Kleppmann and Heidi Howard.
[Byzantine Eventual Consistency and the Fundamental Limits
of Peer-to-Peer Databases](https://arxiv.org/abs/2012.00472). *arxiv.org*, December 2020.
[doi:10.48550/arXiv.2012.00472](https://doi.org/10.48550/arXiv.2012.00472)

[[104](/en/ch9#Kleppmann2022-marker)] Martin Kleppmann.
[Making CRDTs Byzantine Fault
Tolerant](https://martin.kleppmann.com/papers/bft-crdt-papoc22.pdf). At *9th Workshop on Principles and Practice of Consistency for Distributed
Data* (PaPoC), April 2022.
[doi:10.1145/3517209.3524042](https://doi.org/10.1145/3517209.3524042)

[[105](/en/ch9#Gilman2015-marker)] Evan Gilman.
[The
Discovery of Apache ZooKeeper’s Poison Packet](https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/). *pagerduty.com*, May 2015.
Archived at [perma.cc/RV6L-Y5CQ](https://perma.cc/RV6L-Y5CQ)

[[106](/en/ch9#Stone2000-marker)] Jonathan Stone and Craig Partridge.
[When
the CRC and TCP Checksum Disagree](https://conferences2.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf). At *ACM Conference on Applications,
Technologies, Architectures, and Protocols for Computer Communication* (SIGCOMM), August 2000.
[doi:10.1145/347059.347561](https://doi.org/10.1145/347059.347561)

[[107](/en/ch9#Jones2015-marker)] Evan Jones.
[How Both TCP and Ethernet
Checksums Fail](https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html). *evanjones.ca*, October 2015.
Archived at [perma.cc/9T5V-B8X5](https://perma.cc/9T5V-B8X5)

[[108](/en/ch9#Dwork1988_ch9-marker)] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer.
[Consensus in the
Presence of Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323,
April 1988. [doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283)

[[109](/en/ch9#Schlichting1983-marker)] Richard D. Schlichting and Fred B. Schneider.
[Fail-stop processors: an
approach to designing fault-tolerant computing systems](https://www.cs.cornell.edu/fbs/publications/Fail_Stop.pdf). *ACM Transactions on Computer
Systems* (TOCS), volume 1, issue 3, pages 222–238, August 1983.
[doi:10.1145/357369.357371](https://doi.org/10.1145/357369.357371)

[[110](/en/ch9#Do2013-marker)] Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa,
Tiratat Patana-anake, and Haryadi S. Gunawi.
[Limplock: Understanding the Impact
of Limpware on Scale-out Cloud Systems](https://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf). At *4th ACM Symposium on Cloud Computing*
(SoCC), October 2013.
[doi:10.1145/2523616.2523627](https://doi.org/10.1145/2523616.2523627)

[[111](/en/ch9#Snyder2019-marker)] Josh Snyder and Joseph Lynch.
[Garbage collecting
unhealthy JVMs, a proactive approach](https://netflixtechblog.medium.com/introducing-jvmquake-ec944c60ba70). Netflix Technology Blog,
*netflixtechblog.medium.com*, November 2019.
Archived at [perma.cc/8BTA-N3YB](https://perma.cc/8BTA-N3YB)

[[112](/en/ch9#Gunawi2018_ch9-marker)] Haryadi S. Gunawi, Riza O. Suminto, Russell
Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah
Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree
Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li.
[Fail-Slow at
Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf).
At *16th USENIX Conference on File and Storage Technologies*, February 2018.

[[113](/en/ch9#Huang2017_ch9-marker)] Peng Huang, Chuanxiong Guo, Lidong Zhou,
Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao.
[Gray
Failure: The Achilles’ Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in
Operating Systems* (HotOS), May 2017.
[doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005)

[[114](/en/ch9#Lou2020-marker)] Chang Lou, Peng Huang, and Scott Smith.
[Understanding, Detecting and
Localizing Partial Failures in Large System Software](https://www.usenix.org/conference/nsdi20/presentation/lou). At *17th USENIX Symposium on
Networked Systems Design and Implementation* (NSDI), February 2020.

[[115](/en/ch9#Bailis2013_ch9-marker)] Peter Bailis and Ali Ghodsi.
[Eventual Consistency Today: Limitations,
Extensions, and Beyond](https://queue.acm.org/detail.cfm?id=2462076). *ACM Queue*, volume 11, issue 3, pages 55-63, March 2013.
[doi:10.1145/2460276.2462076](https://doi.org/10.1145/2460276.2462076)

[[116](/en/ch9#Alpern1985-marker)] Bowen Alpern and Fred B. Schneider.
[Defining Liveness](https://www.cs.cornell.edu/fbs/publications/DefLiveness.pdf).
*Information Processing Letters*, volume 21, issue 4, pages 181–185, October 1985.
[doi:10.1016/0020-0190(85)90056-0](https://doi.org/10.1016/0020-0190%2885%2990056-0)

[[117](/en/ch9#Junqueira2015-marker)] Flavio P. Junqueira.
[Dude, Where’s My Metadata?](https://fpj.me/2015/05/28/dude-wheres-my-metadata/)
*fpj.me*, May 2015.
Archived at [perma.cc/D2EU-Y9S5](https://perma.cc/D2EU-Y9S5)

[[118](/en/ch9#Sanders2016-marker)] Scott Sanders.
[January 28th Incident
Report](https://github.com/blog/2106-january-28th-incident-report). *github.com*, February 2016.
Archived at [perma.cc/5GZR-88TV](https://perma.cc/5GZR-88TV)

[[119](/en/ch9#Kreps2013-marker)] Jay Kreps.
[A Few Notes
on Kafka and Jepsen](https://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen). *blog.empathybox.com*, September 2013.
[perma.cc/XJ5C-F583](https://perma.cc/XJ5C-F583)

[[120](/en/ch9#Brooker2024correctness-marker)] Marc Brooker and Ankush Desai.
[Systems Correctness Practices at AWS](https://dl.acm.org/doi/pdf/10.1145/3712057).
*Queue, Volume 22, Issue 6*, November/December 2024.
[doi:10.1145/3712057](https://doi.org/10.1145/3712057)

[[121](/en/ch9#SatarinTesting-marker)] Andrey Satarin.
[Testing Distributed Systems:
Curated list of resources on testing distributed systems](https://asatarin.github.io/testing-distributed-systems/). *asatarin.github.io*.
Archived at [perma.cc/U5V8-XP24](https://perma.cc/U5V8-XP24)

[[122](/en/ch9#Vanlightly2024-marker)] Jack Vanlightly.
[Verifying Kafka transactions - Diary entry 2 - Writing an initial TLA+ spec](https://jack-vanlightly.com/analyses/2024/12/3/verifying-kafka-transactions-diary-entry-2-writing-an-initial-tla-spec).
*jack-vanlightly.com*, December 2024.
Archived at [perma.cc/NSQ8-MQ5N](https://perma.cc/NSQ8-MQ5N)

[[123](/en/ch9#Tang2018-marker)] Siddon Tang.
[From Chaos to Order — Tools and
Techniques for Testing TiDB, A Distributed NewSQL Database](https://www.pingcap.com/blog/chaos-practice-in-tidb/). *pingcap.com*, April 2018.
Archived at [perma.cc/5EJB-R29F](https://perma.cc/5EJB-R29F)

[[124](/en/ch9#VanBenschoten2019-marker)] Nathan VanBenschoten.
[Parallel Commits: An atomic commit
protocol for globally distributed transactions](https://www.cockroachlabs.com/blog/parallel-commits/). *cockroachlabs.com*, November 2019.
Archived at [perma.cc/5FZ7-QK6J](https://perma.cc/5FZ7-QK6J%20)

[[125](/en/ch9#Vanlightly2022-marker)] Jack Vanlightly.
[Paper: VR Revisited - State Transfer (part 3)](https://jack-vanlightly.com/analyses/2022/12/28/paper-vr-revisited-state-transfer-part-3).
*jack-vanlightly.com*, December 2022.
Archived at [perma.cc/KNK3-K6WS](https://perma.cc/KNK3-K6WS)

[[126](/en/ch9#Wayne2024-marker)] Hillel Wayne.
[What if
the spec doesn’t match the code?](https://buttondown.com/hillelwayne/archive/what-if-the-spec-doesnt-match-the-code/) *buttondown.com*, March 2024.
Archived at [perma.cc/8HEZ-KHER](https://perma.cc/8HEZ-KHER)

[[127](/en/ch9#Ouyang2025-marker)] Lingzhi Ouyang, Xudong Sun, Ruize Tang, Yu Huang,
Madhav Jivrajani, Xiaoxing Ma, Tianyin Xu.
[Multi-Grained Specifications for Distributed System Model
Checking and Verification](https://arxiv.org/abs/2409.14301). At *20th European Conference on Computer Systems* (EuroSys),
March 2025. [doi:10.1145/3689031.3696069](https://doi.org/10.1145/3689031.3696069)

[[128](/en/ch9#Izrailevsky2011-marker)] Yury Izrailevsky and Ariel Tseitlin.
[The Netflix Simian Army](https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116).
*netflixtechblog.com*, July, 2011.
Archived at [perma.cc/M3NY-FJW6](https://perma.cc/M3NY-FJW6)

[[129](/en/ch9#Kingsbury2013jepsen-marker)] Kyle Kingsbury.
[Jepsen: On the perils of network partitions](https://aphyr.com/posts/281-jepsen-on-the-perils-of-network-partitions).
*aphyr.com*, May, 2013.
Archived at [perma.cc/W98G-6HQP](https://perma.cc/W98G-6HQP)

[[130](/en/ch9#Kingsbury2024-marker)] Kyle Kingsbury.
[Jepsen Analyses](https://jepsen.io/analyses). *jepsen.io*, 2024.
Archived at [perma.cc/8LDN-D2T8](https://perma.cc/8LDN-D2T8)

[[131](/en/ch9#Majumdar2017-marker)] Rupak Majumdar and Filip Niksic.
[Why is random testing effective for partition
tolerance bugs?](https://dl.acm.org/doi/pdf/10.1145/3158134) *Proceedings of the ACM on Programming Languages* (PACMPL), volume 2,
issue POPL, article no. 46, December 2017.
[doi:10.1145/3158134](https://doi.org/10.1145/3158134)

[[132](/en/ch9#FoundationDB_ch9-marker)] FoundationDB project authors.
[Simulation and Testing](https://apple.github.io/foundationdb/testing.html).
*apple.github.io*.
Archived at [perma.cc/NQ3L-PM4C](https://perma.cc/NQ3L-PM4C)

[[133](/en/ch9#Kladov2023-marker)] Alex Kladov.
[Simulation
Testing For Liveness](https://tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness/). *tigerbeetle.com*, July 2023.
Archived at [perma.cc/RKD4-HGCR](https://perma.cc/RKD4-HGCR)

[[134](/en/ch9#Marques2024-marker)] Alfonso Subiotto Marqués.
[(Mostly)
Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024.
Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4)