mirror of
https://github.com/Vonng/ddia.git
synced 2026-06-22 17:37:04 +08:00
1518 lines
92 KiB
Markdown
1518 lines
92 KiB
Markdown
---
|
||
title: "2. Defining Nonfunctional Requirements"
|
||
weight: 102
|
||
breadcrumbs: false
|
||
---
|
||
|
||
|
||
# Chapter 2. Defining Nonfunctional Requirements
|
||
|
||
> *The Internet was done so well that most people think of it as a natural resource like the Pacific
|
||
> Ocean, rather than something that was man-made. When was the last time a technology with a scale
|
||
> like that was so error-free?*
|
||
>
|
||
> [Alan Kay](https://www.drdobbs.com/architecture-and-design/interview-with-alan-kay/240003442),
|
||
> in interview with *Dr Dobb’s Journal* (2012)
|
||
|
||
If you are building an application, you will be driven by a list of requirements. At the top of your
|
||
list is most likely the functionality that the application must offer: what screens and what buttons
|
||
you need, and what each operation is supposed to do in order to fulfill the purpose of your
|
||
software. These are your *functional requirements*.
|
||
|
||
In addition, you probably also have some *nonfunctional requirements*: for example, the app should
|
||
be fast, reliable, secure, legally compliant, and easy to maintain. These requirements might not be
|
||
explicitly written down, because they may seem somewhat obvious, but they are just as important as
|
||
the app’s functionality: an app that is unbearably slow or unreliable might as well not exist.
|
||
|
||
Many nonfunctional requirements, such as security, fall outside the scope of this book. But there
|
||
are a few nonfunctional requirements that we will consider, and this chapter will help you
|
||
articulate them for your own systems:
|
||
|
||
* How to define and measure the *performance* of a system (see [“Describing Performance”](/en/ch2#sec_introduction_percentiles));
|
||
* What it means for a service to be *reliable*—namely, continuing to work correctly, even when
|
||
things go wrong (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability));
|
||
* Allowing a system to be *scalable* by having efficient ways of adding computing
|
||
capacity as the load on the system grows (see [“Scalability”](/en/ch2#sec_introduction_scalability)); and
|
||
* Making it easier to maintain a system in the long term (see [“Maintainability”](/en/ch2#sec_introduction_maintainability)).
|
||
|
||
The terminology introduced in this chapter will also be useful in the following chapters, when we go
|
||
into the details of how data-intensive systems are implemented. However, abstract definitions can be
|
||
quite dry; to make the ideas more concrete, we will start this chapter with a case study of how a
|
||
social networking service might work, which will provide practical examples of performance and
|
||
scalability.
|
||
|
||
# Case Study: Social Network Home Timelines
|
||
|
||
Imagine you are given the task of implementing a social network in the style of X (formerly
|
||
Twitter), in which users can post messages and follow other users. This will be a huge
|
||
simplification of how such a service actually works
|
||
[[1](/en/ch2#Cvet2016),
|
||
[2](/en/ch2#Krikorian2012_ch2),
|
||
[3](/en/ch2#Twitter2023)],
|
||
but it will help illustrate some of the issues that arise in large-scale systems.
|
||
|
||
Let’s assume that users make 500 million posts per day, or 5,700 posts per second on average.
|
||
Occasionally, the rate can spike as high as 150,000 posts/second
|
||
[[4](/en/ch2#Krikorian2013)].
|
||
Let’s also assume that the average user follows 200 people and has 200 followers (although there is
|
||
a very wide range: most people have only a handful of followers, and a few celebrities such as
|
||
Barack Obama have over 100 million followers).
|
||
|
||
## Representing Users, Posts, and Follows
|
||
|
||
Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We
|
||
have one table for users, one table for posts, and one table for follow relationships.
|
||
|
||

|
||
|
||
###### Figure 2-1. Simple relational schema for a social network in which users can follow each other.
|
||
|
||
Let’s say the main read operation that our social network must support is the *home timeline*, which
|
||
displays recent posts by people you are following (for simplicity we will ignore ads, suggested
|
||
posts from people you are not following, and other extensions). We could write the following SQL
|
||
query to get the home timeline for a particular user:
|
||
|
||
```
|
||
SELECT posts.*, users.* FROM posts
|
||
JOIN follows ON posts.sender_id = follows.followee_id
|
||
JOIN users ON posts.sender_id = users.id
|
||
WHERE follows.follower_id = current_user
|
||
ORDER BY posts.timestamp DESC
|
||
LIMIT 1000
|
||
```
|
||
|
||
To execute this query, the database will use the `follows` table to find everybody who
|
||
`current_user` is following, look up recent posts by those users, and sort them by timestamp to get
|
||
the most recent 1,000 posts by any of the followed users.
|
||
|
||
Posts are supposed to be timely, so let’s assume that after somebody makes a post, we want their
|
||
followers to be able to see it within 5 seconds. One way of doing that would be for the user’s
|
||
client to repeat the query above every 5 seconds while the user is online (this is known as
|
||
*polling*). If we assume that 10 million users are online and logged in at the same time, that would
|
||
mean running the query 2 million times per second. Even if you increase the polling interval, this
|
||
is a lot.
|
||
|
||
Moreover, the query above is quite expensive: if you are following 200 people, it needs to fetch a
|
||
list of recent posts by each of those 200 people, and merge those lists. 2 million timeline queries
|
||
per second then means that the database needs to look up the recent posts from some sender 400
|
||
million times per second—a huge number. And that is the average case. Some users follow tens of
|
||
thousands of accounts; for them, this query is very expensive to execute, and difficult to make
|
||
fast.
|
||
|
||
## Materializing and Updating Timelines
|
||
|
||
How can we do better? Firstly, instead of polling, it would be better if the server actively pushed
|
||
new posts to any followers who are currently online. Secondly, we should precompute the results of
|
||
the query above so that a user’s request for their home timeline can be served from a cache.
|
||
|
||
Imagine that for each user we store a data structure containing their home timeline, i.e., the
|
||
recent posts by people they are following. Every time a user makes a post, we look up all of their
|
||
followers, and insert that post into the home timeline of each follower—like delivering a message to
|
||
a mailbox. Now when a user logs in, we can simply give them this home timeline that we precomputed.
|
||
Moreover, to receive a notification about any new posts on their timeline, the user’s client simply
|
||
needs to subscribe to the stream of posts being added to their home timeline.
|
||
|
||
The downside of this approach is that we now need to do more work every time a user makes a post,
|
||
because the home timelines are derived data that needs to be updated. The process is illustrated in
|
||
[Figure 2-2](/en/ch2#fig_twitter_timelines). When one initial request results in several downstream requests being
|
||
carried out, we use the term *fan-out* to describe the factor by which the number of requests
|
||
increases.
|
||
|
||

|
||
|
||
###### Figure 2-2. Fan-out: delivering new posts to every follower of the user who made the post.
|
||
|
||
At a rate of 5,700 posts posted per second, if the average post reaches 200 followers (i.e., a
|
||
fan-out factor of 200), we will need to do just over 1 million home timeline writes per second. This
|
||
is a lot, but it’s still a significant saving compared to the 400 million per-sender post lookups
|
||
per second that we would otherwise have to do.
|
||
|
||
If the rate of posts spikes due to some special event, we don’t have to do the timeline
|
||
deliveries immediately—we can enqueue them and accept that it will temporarily take a bit longer for
|
||
posts to show up in followers’ timelines. Even during such load spikes, timelines remain fast to
|
||
load, since we simply serve them from a cache.
|
||
|
||
This process of precomputing and updating the results of a query is called *materialization*, and
|
||
the timeline cache is an example of a *materialized view* (a concept we will discuss further in
|
||
[Link to Come]). The materialized view speeds up reads, but in return we have to do more work on
|
||
write. The cost of writes for most users is modest, but a social network also has to consider some
|
||
extreme cases:
|
||
|
||
* If a user is following a very large number of accounts, and those accounts post a lot, that user
|
||
will have a high rate of writes to their materialized timeline. However, in this case it’s
|
||
unlikely that the user is actually reading all of the posts in their timeline, and therefore it’s
|
||
okay to simply drop some of their timeline writes and show the user only a sample of the posts
|
||
from the accounts they’re following
|
||
[[5](/en/ch2#Volpert2025)].
|
||
* When a celebrity account with a very large number of followers makes a post, we have to do a large
|
||
amount of work to insert that post into the home timelines of each of their millions of followers.
|
||
In this case it’s not okay to drop some of those writes. One way of solving this problem is to
|
||
handle celebrity posts separately from everyone else’s posts: we can save ourselves the effort of
|
||
adding them to millions of timelines by storing the celebrity posts separately and merging them
|
||
with the materialized timeline when it is read. Despite such optimizations, handling celebrities
|
||
on a social network can require a lot of infrastructure
|
||
[[6](/en/ch2#Axon2010_ch2)].
|
||
|
||
# Describing Performance
|
||
|
||
Most discussions of software performance consider two main types of metric:
|
||
|
||
Response time
|
||
: The elapsed time from the moment when a user makes a request until they receive the requested
|
||
answer. The unit of measurement is seconds (or milliseconds, or microseconds).
|
||
|
||
Throughput
|
||
: The number of requests per second, or the data volume per second, that the system is processing.
|
||
For a given allocation of hardware resources, there is a *maximum throughput* that can be handled.
|
||
The unit of measurement is “somethings per second”.
|
||
|
||
In the social network case study, “posts per second” and “timeline writes per second” are throughput
|
||
metrics, whereas the “time it takes to load the home timeline” or the “time until a post is
|
||
delivered to followers” are response time metrics.
|
||
|
||
There is often a connection between throughput and response time; an example of such a relationship
|
||
for an online service is sketched in [Figure 2-3](/en/ch2#fig_throughput). The service has a low response time when
|
||
request throughput is low, but response time increases as load increases. This is because of
|
||
*queueing*: when a request arrives on a highly loaded system, it’s likely that the CPU is already in
|
||
the process of handling an earlier request, and therefore the incoming request needs to wait until
|
||
the earlier request has been completed. As throughput approaches the maximum that the hardware can
|
||
handle, queueing delays increase sharply.
|
||
|
||

|
||
|
||
###### Figure 2-3. As the throughput of a service approaches its capacity, the response time increases dramatically due to queueing.
|
||
|
||
# When an overloaded system won’t recover
|
||
|
||
If a system is close to overload, with throughput pushed close to the limit, it can sometimes enter a
|
||
vicious cycle where it becomes less efficient and hence even more overloaded. For example, if there
|
||
is a long queue of requests waiting to be handled, response times may increase so much that clients
|
||
time out and resend their request. This causes the rate of requests to increase even further, making
|
||
the problem worse—a *retry storm*. Even when the load is reduced again, such a system may remain in
|
||
an overloaded state until it is rebooted or otherwise reset. This phenomenon is called a *metastable
|
||
failure*, and it can cause serious outages in production systems
|
||
[[7](/en/ch2#Bronson2021),
|
||
[8](/en/ch2#Brooker2021)].
|
||
|
||
To avoid retries overloading a service, you can increase and randomize the time between successive
|
||
retries on the client side (*exponential backoff*
|
||
[[9](/en/ch2#Brooker2015),
|
||
[10](/en/ch2#Brooker2022backoff)]),
|
||
and temporarily stop sending requests to a service that has returned errors or timed out recently
|
||
(using a *circuit breaker* [[11](/en/ch2#Nygard2018),
|
||
[12](/en/ch2#Chen2022)]
|
||
or *token bucket* algorithm [[13](/en/ch2#Brooker2022retries)]).
|
||
The server can also detect when it is approaching overload and start proactively rejecting requests
|
||
(*load shedding* [[14](/en/ch2#YanacekLoadShedding)]), and send back
|
||
responses asking clients to slow down (*backpressure*
|
||
[[1](/en/ch2#Cvet2016),
|
||
[15](/en/ch2#Sackman2016_ch2)]).
|
||
The choice of queueing and load-balancing algorithms can also make a difference
|
||
[[16](/en/ch2#Kopytkov2018)].
|
||
|
||
In terms of performance metrics, the response time is usually what users care about the most,
|
||
whereas the throughput determines the required computing resources (e.g., how many servers you need),
|
||
and hence the cost of serving a particular workload. If throughput is likely to increase beyond what
|
||
the current hardware can handle, the capacity needs to be expanded; a system is said to be
|
||
*scalable* if its maximum throughput can be significantly increased by adding computing resources.
|
||
|
||
In this section we will focus primarily on response times, and we will return to throughput and
|
||
scalability in [“Scalability”](/en/ch2#sec_introduction_scalability).
|
||
|
||
## Latency and Response Time
|
||
|
||
“Latency” and “response time” are sometimes used interchangeably, but in this book we will use the
|
||
terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)):
|
||
|
||
* The *response time* is what the client sees; it includes all delays incurred anywhere in the
|
||
system.
|
||
* The *service time* is the duration for which the service is actively processing the user request.
|
||
* *Queueing delays* can occur at several points in the flow: for example, after a request is
|
||
received, it might need to wait until a CPU is available before it can be processed; a response
|
||
packet might need to be buffered before it is sent over the network if other tasks on the same
|
||
machine are sending a lot of data via the outbound network interface.
|
||
* *Latency* is a catch-all term for time during which a request is not being actively processed,
|
||
i.e., during which it is *latent*. In particular, *network latency* or *network delay* refers to
|
||
the time that request and response spend traveling through the network.
|
||
|
||

|
||
|
||
###### Figure 2-4. Response time, service time, network latency, and queueing delay.
|
||
|
||
In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a
|
||
horizontal line, and a request or response message is shown as a thick diagonal arrow from one node
|
||
to another. You will encounter this style of diagram frequently over the course of this book.
|
||
|
||
The response time can vary significantly from one request to the next, even if you keep making the
|
||
same request over and over again. Many factors can add random delays: for example, a context switch
|
||
to a background process, the loss of a network packet and TCP retransmission, a garbage collection
|
||
pause, a page fault forcing a read from disk, mechanical vibrations in the server rack
|
||
[[17](/en/ch2#Gunawi2018_ch2)],
|
||
or many other causes. We will discuss this topic in more detail in [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing).
|
||
|
||
Queueing delays often account for a large part of the variability in response times. As a server
|
||
can only process a small number of things in parallel (limited, for example, by its number of CPU
|
||
cores), it only takes a small number of slow requests to hold up the processing of subsequent
|
||
requests—an effect known as *head-of-line blocking*. Even if those subsequent requests have fast
|
||
service times, the client will see a slow overall response time due to the time waiting for the
|
||
prior request to complete. The queueing delay is not part of the service time, and for this reason
|
||
it is important to measure response times on the client side.
|
||
|
||
## Average, Median, and Percentiles
|
||
|
||
Because the response time varies from one request to the next, we need to think of it not as a
|
||
single number, but as a *distribution* of values that you can measure. In [Figure 2-5](/en/ch2#fig_lognormal), each
|
||
gray bar represents a request to a service, and its height shows how long that request took. Most
|
||
requests are reasonably fast, but there are occasional *outliers* that take much longer.
|
||
Variation in network delay is also known as *jitter*.
|
||
|
||

|
||
|
||
###### Figure 2-5. Illustrating mean and percentiles: response times for a sample of 100 requests to a service.
|
||
|
||
It’s common to report the *average* response time of a service (technically, the *arithmetic mean*:
|
||
that is, sum all the response times, and divide by the number of requests). The mean response time
|
||
is useful for estimating throughput limits [[18](/en/ch2#Brooker2017)].
|
||
However, the mean is not a very good metric if you want to know your “typical” response time,
|
||
because it doesn’t tell you how many users actually experienced that delay.
|
||
|
||
Usually it is better to use *percentiles*. If you take your list of response times and sort it from
|
||
fastest to slowest, then the *median* is the halfway point: for example, if your median response
|
||
time is 200 ms, that means half your requests return in less than 200 ms, and half your
|
||
requests take longer than that. This makes the median a good metric if you want to know how long
|
||
users typically have to wait. The median is also known as the *50th percentile*, and sometimes
|
||
abbreviated as *p50*.
|
||
|
||
In order to figure out how bad your outliers are, you can look at higher percentiles: the *95th*,
|
||
*99th*, and *99.9th* percentiles are common (abbreviated *p95*, *p99*, and *p999*). They are the
|
||
response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular
|
||
threshold. For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of
|
||
100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more. This is
|
||
illustrated in [Figure 2-5](/en/ch2#fig_lognormal).
|
||
|
||
High percentiles of response times, also known as *tail latencies*, are important because they
|
||
directly affect users’ experience of the service. For example, Amazon describes response time
|
||
requirements for internal services in terms of the 99.9th percentile, even though it only affects 1
|
||
in 1,000 requests. This is because the customers with the slowest requests are often those who have
|
||
the most data on their accounts because they have made many purchases—that is, they’re the most
|
||
valuable customers
|
||
[[19](/en/ch2#DeCandia2007_ch1)].
|
||
It’s important to keep those customers happy by ensuring the website is fast for them.
|
||
|
||
On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed
|
||
too expensive and to not yield enough benefit for Amazon’s purposes. Reducing response times at very
|
||
high percentiles is difficult because they are easily affected by random events outside of your
|
||
control, and the benefits are diminishing.
|
||
|
||
# The user impact of response times
|
||
|
||
It seems intuitively obvious that a fast service is better for users than a slow service
|
||
[[20](/en/ch2#Whitenton2020)].
|
||
However, it is surprisingly difficult to get hold of reliable data to quantify the effect that
|
||
latency has on user behavior.
|
||
|
||
Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search
|
||
results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue
|
||
[[21](/en/ch2#Linden2006)].
|
||
However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
|
||
only 0.6% fewer searches per day
|
||
[[22](/en/ch2#Brutlag2009)],
|
||
and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3%
|
||
[[23](/en/ch2#Schurman2009)].
|
||
Newer data from these companies appears not to be publicly available.
|
||
|
||
A more recent Akamai study
|
||
[[24](/en/ch2#Akamai2017)]
|
||
claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
|
||
by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times
|
||
are also correlated with lower conversion rates! This seemingly paradoxical result is explained by
|
||
the fact that the pages that load fastest are often those that have no useful content (e.g., 404
|
||
error pages). However, since the study makes no effort to separate the effects of page content from
|
||
the effects of load time, its results are probably not meaningful.
|
||
|
||
A study by Yahoo
|
||
[[25](/en/ch2#Bai2017)]
|
||
compares click-through rates on fast-loading versus slow-loading search results, controlling for
|
||
quality of search results. It finds 20–30% more clicks on fast searches when the difference between
|
||
fast and slow responses is 1.25 seconds or more.
|
||
|
||
## Use of Response Time Metrics
|
||
|
||
High percentiles are especially important in backend services that are called multiple times as
|
||
part of serving a single end-user request. Even if you make the calls in parallel, the end-user
|
||
request still needs to wait for the slowest of the parallel calls to complete. It takes just one
|
||
slow call to make the entire end-user request slow, as illustrated in [Figure 2-6](/en/ch2#fig_tail_amplification).
|
||
Even if only a small percentage of backend calls are slow, the chance of getting a slow call
|
||
increases if an end-user request requires multiple backend calls, and so a higher proportion of
|
||
end-user requests end up being slow (an effect known as *tail latency amplification*
|
||
[[26](/en/ch2#Dean2013_ch2)]).
|
||
|
||

|
||
|
||
###### Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request.
|
||
|
||
Percentiles are often used in *service level objectives* (SLOs) and *service level agreements*
|
||
(SLAs) as ways of defining the expected performance and availability of a service
|
||
[[27](/en/ch2#Hidalgo2020)].
|
||
For example, an SLO may set a target for a service to have a median response time of less than
|
||
200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
|
||
result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not
|
||
met (for example, customers may be entitled to a refund). That is the basic idea, at least; in
|
||
practice, defining good availability metrics for SLOs and SLAs is not straightforward
|
||
[[28](/en/ch2#Mogul2019),
|
||
[29](/en/ch2#Hauer2020)].
|
||
|
||
# Computing percentiles
|
||
|
||
If you want to add response time percentiles to the monitoring dashboards for your services, you
|
||
need to efficiently calculate them on an ongoing basis. For example, you may want to keep a rolling
|
||
window of response times of requests in the last 10 minutes. Every minute, you calculate the median
|
||
and various percentiles over the values in that window and plot those metrics on a graph.
|
||
|
||
The simplest implementation is to keep a list of response times for all requests within the time
|
||
window and to sort that list every minute. If that is too inefficient for you, there are algorithms
|
||
that can calculate a good approximation of percentiles at minimal CPU and memory cost.
|
||
Open source percentile estimation libraries include HdrHistogram,
|
||
t-digest [[30](/en/ch2#Dunning2021),
|
||
[31](/en/ch2#Kohn2021)],
|
||
OpenHistogram [[32](/en/ch2#Hartmann2020)], and DDSketch
|
||
[[33](/en/ch2#Masson2019)].
|
||
|
||
Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from
|
||
several machines, is mathematically meaningless—the right way of aggregating response time data
|
||
is to add the histograms [[34](/en/ch2#Schwartz2015)].
|
||
|
||
# Reliability and Fault Tolerance
|
||
|
||
Everybody has an intuitive idea of what it means for something to be reliable or unreliable. For
|
||
software, typical expectations include:
|
||
|
||
* The application performs the function that the user expected.
|
||
* It can tolerate the user making mistakes or using the software in unexpected ways.
|
||
* Its performance is good enough for the required use case, under the expected load and data volume.
|
||
* The system prevents any unauthorized access and abuse.
|
||
|
||
If all those things together mean “working correctly,” then we can understand *reliability* as
|
||
meaning, roughly, “continuing to work correctly, even when things go wrong.” To be more precise
|
||
about things going wrong, we will distinguish between *faults* and *failures*
|
||
[[35](/en/ch2#Heimerdinger1992),
|
||
[36](/en/ch2#Gaertner1999),
|
||
[37](/en/ch2#Avizienis2004)]:
|
||
|
||
Fault
|
||
: A fault is when a particular *part* of a system stops working correctly: for example, if a
|
||
single hard drive malfunctions, or a single machine crashes, or an external service (that the
|
||
system depends on) has an outage.
|
||
|
||
Failure
|
||
: A failure is when the system *as a whole* stops providing the required service to the user; in
|
||
other words, when it does not meet the service level objective (SLO).
|
||
|
||
The distinction between fault and failure can be confusing because they are the same thing, just at
|
||
different levels. For example, if a hard drive stops working, we say that the hard drive has failed:
|
||
if the system consists only of that one hard drive, it has stopped providing the required service.
|
||
However, if the system you’re talking about contains many hard drives, then the failure of a single
|
||
hard drive is only a fault from the point of view of the bigger system, and the bigger system might
|
||
be able to tolerate that fault by having a copy of the data on another hard drive.
|
||
|
||
## Fault Tolerance
|
||
|
||
We call a system *fault-tolerant* if it continues providing the required service to the user in
|
||
spite of certain faults occurring. If a system cannot tolerate a certain part becoming faulty, we
|
||
call that part a *single point of failure* (SPOF), because a fault in that part escalates to cause
|
||
the failure of the whole system.
|
||
|
||
For example, in the social network case study, a fault that might happen is that during the fan-out
|
||
process, a machine involved in updating the materialized timelines crashes or become unavailable.
|
||
To make this process fault-tolerant, we would need to ensure that another machine can take over this
|
||
task without missing any posts that should have been delivered, and without duplicating any posts.
|
||
(This idea is known as *exactly-once semantics*, and we will examine it in detail in [Link to Come].)
|
||
|
||
Fault tolerance is always limited to a certain number of certain types of faults. For example, a
|
||
system might be able to tolerate a maximum of two hard drives failing at the same time, or a maximum
|
||
of one out of three nodes crashing. It would not make sense to tolerate any number of faults: if all
|
||
nodes crash, there is nothing that can be done. If the entire planet Earth (and all servers on it)
|
||
were swallowed by a black hole, tolerance of that fault would require web hosting in space—good luck
|
||
getting that budget item approved.
|
||
|
||
Counter-intuitively, in such fault-tolerant systems, it can make sense to *increase* the rate of
|
||
faults by triggering them deliberately—for example, by randomly killing individual processes
|
||
without warning. This is called *fault injection*. Many critical bugs are actually due to poor error
|
||
handling [[38](/en/ch2#Yuan2014)]; by deliberately inducing faults, you ensure
|
||
that the fault-tolerance machinery is continually exercised and tested, which can increase your
|
||
confidence that faults will be handled correctly when they occur naturally. *Chaos engineering* is
|
||
a discipline that aims to improve confidence in fault-tolerance mechanisms through experiments such
|
||
as deliberately injecting faults
|
||
[[39](/en/ch2#Rosenthal2020)].
|
||
|
||
Although we generally prefer tolerating faults over preventing faults, there are cases where
|
||
prevention is better than cure (e.g., because no cure exists). This is the case with security
|
||
matters, for example: if an attacker has compromised a system and gained access to sensitive data,
|
||
that event cannot be undone. However, this book mostly deals with the kinds of faults that can be
|
||
cured, as described in the following sections.
|
||
|
||
## Hardware and Software Faults
|
||
|
||
When we think of causes of system failure, hardware faults quickly come to mind:
|
||
|
||
* Approximately 2–5% of magnetic hard drives fail per year
|
||
[[40](/en/ch2#Pinheiro2007),
|
||
[41](/en/ch2#Schroeder2007)];
|
||
in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
|
||
Recent data suggests that disks are getting more reliable, but failure rates remain significant
|
||
[[42](/en/ch2#Klein2021)].
|
||
* Approximately 0.5–1% of solid state drives (SSDs) fail per year
|
||
[[43](/en/ch2#Narayanan2016)].
|
||
Small numbers of bit errors are corrected automatically
|
||
[[44](/en/ch2#Alibaba2019_ch2)],
|
||
but uncorrectable errors occur approximately once per year per drive, even in drives that are
|
||
fairly new (i.e., that have experienced little wear); this error rate is higher than that of
|
||
magnetic hard drives
|
||
[[45](/en/ch2#Schroeder2016_ch2),
|
||
[46](/en/ch2#Alter2019)].
|
||
* Other hardware components such as power supplies, RAID controllers, and memory modules also fail,
|
||
although less frequently than hard drives
|
||
[[47](/en/ch2#Ford2010),
|
||
[48](/en/ch2#Vishwanath2010)].
|
||
* Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result,
|
||
likely due to manufacturing defects
|
||
[[49](/en/ch2#Hochschild2021),
|
||
[50](/en/ch2#Dixit2021),
|
||
[51](/en/ch2#Behrens2015)].
|
||
In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program
|
||
simply returning the wrong result.
|
||
* Data in RAM can also be corrupted, either due to random events such as cosmic rays, or due to
|
||
permanent physical defects. Even when memory with error-correcting codes (ECC) is used, more than
|
||
1% of machines encounter an uncorrectable error in a given year, which typically leads to a crash
|
||
of the machine and the affected memory module needing to be replaced
|
||
[[52](/en/ch2#Schroeder2009)].
|
||
|
||
Moreover, certain pathological memory access patterns can flip bits with high probability
|
||
[[53](/en/ch2#Kim2014)].
|
||
* An entire datacenter might become unavailable (for example, due to power outage or network
|
||
misconfiguration) or even be permanently destroyed (for example by fire, flood, or earthquake
|
||
[[54](/en/ch2#Bray2021)]).
|
||
A solar storm, which induces large electrical currents in long-distance wires when the sun ejects
|
||
a large mass of charged particles, could damage power grids and undersea network cables
|
||
[[55](/en/ch2#AbduJyothi2021)].
|
||
Although such large-scale failures are rare, their impact can be catastrophic if a service cannot
|
||
tolerate the loss of a datacenter
|
||
[[56](/en/ch2#Cockcroft2019)].
|
||
|
||
These events are rare enough that you often don’t need to worry about them when working on a small
|
||
system, as long as you can easily replace hardware that becomes faulty. However, in a large-scale
|
||
system, hardware faults happen often enough that they become part of the normal system operation.
|
||
|
||
### Tolerating hardware faults through redundancy
|
||
|
||
Our first response to unreliable hardware is usually to add redundancy to the individual hardware
|
||
components in order to reduce the failure rate of the system. Disks may be set up in a RAID
|
||
configuration (spreading data across multiple disks in the same machine so that a failed disk does
|
||
not cause data loss), servers may have dual power supplies and hot-swappable CPUs, and datacenters
|
||
may have batteries and diesel generators for backup power. Such redundancy can often keep a machine
|
||
running uninterrupted for years.
|
||
|
||
Redundancy is most effective when component faults are independent, that is, the occurrence of one
|
||
fault does not change how likely it is that another fault will occur. However, experience has shown
|
||
that there are often significant correlations between component failures
|
||
[[41](/en/ch2#Schroeder2007),
|
||
[57](/en/ch2#Han2021),
|
||
[58](/en/ch2#Nightingale2011)];
|
||
unavailability of an entire server rack or an entire datacenter still happens more often than we
|
||
would like.
|
||
|
||
Hardware redundancy increases the uptime of a single machine; however, as discussed in
|
||
[“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), there are advantages to using a distributed system, such as being
|
||
able to tolerate a complete outage of one datacenter.
|
||
For this reason, cloud systems tend to focus less on the reliability of individual machines, and
|
||
instead aim to make services highly available by tolerating faulty nodes at the software level.
|
||
Cloud providers use *availability zones* to identify which resources are physically co-located;
|
||
resources in the same place are more likely to fail at the same time than geographically separated
|
||
resources.
|
||
|
||
The fault-tolerance techniques we discuss in this book are designed to tolerate the loss of entire
|
||
machines, racks, or availability zones. They generally work by allowing a machine in one datacenter
|
||
to take over when a machine in another datacenter fails or becomes unreachable. We will discuss such
|
||
techniques for fault tolerance in [Chapter 6](/en/ch6#ch_replication), [Chapter 10](/en/ch10#ch_consistency), and at various other
|
||
points in this book.
|
||
|
||
Systems that can tolerate the loss of entire machines also have operational advantages: a
|
||
single-server system requires planned downtime if you need to reboot the machine (to apply operating
|
||
system security patches, for example), whereas a multi-node fault-tolerant system can be patched by
|
||
restarting one node at a time, without affecting the service for users. This is called a *rolling
|
||
upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding).
|
||
|
||
### Software faults
|
||
|
||
Although hardware failures can be weakly correlated, they are still mostly independent: for
|
||
example, if one disk fails, it’s likely that other disks in the same machine will be fine for
|
||
another while. On the other hand, software faults are often very highly correlated, because it is
|
||
common for many nodes to run the same software and thus have the same bugs
|
||
[[59](/en/ch2#Gunawi2014),
|
||
[60](/en/ch2#Kreps2012_ch1)].
|
||
Such faults are harder to anticipate, and they tend to cause many more system failures than
|
||
uncorrelated hardware faults [[47](/en/ch2#Ford2010)]. For example:
|
||
|
||
* A software bug that causes every node to fail at the same time in particular circumstances. For
|
||
example, on June 30, 2012, a leap second caused many Java applications to hang simultaneously due
|
||
to a bug in the Linux kernel, bringing down many Internet services
|
||
[[61](/en/ch2#Minar2012_ch1)].
|
||
Due to a firmware bug, all SSDs of certain models suddenly fail after precisely 32,768 hours of
|
||
operation (less than 4 years), rendering the data on them unrecoverable
|
||
[[62](/en/ch2#HPE2019_ch2)].
|
||
* A runaway process that uses up some shared, limited resource, such as CPU time, memory, disk
|
||
space, network bandwidth, or threads
|
||
[[63](/en/ch2#Hochstein2020)].
|
||
For example, a process that consumes too much memory while processing a large request may be
|
||
killed by the operating system. A bug in a client library could cause a much higher request
|
||
volume than anticipated [[64](/en/ch2#McCaffrey2015)].
|
||
* A service that the system depends on slows down, becomes unresponsive, or starts returning
|
||
corrupted responses.
|
||
* An interaction between different systems results in emergent behavior that does not occur when
|
||
each system was tested in isolation [[65](/en/ch2#Tang2023)].
|
||
* Cascading failures, where a problem in one component causes another component to become overloaded
|
||
and slow down, which in turn brings down another component
|
||
[[66](/en/ch2#Ulrich2016),
|
||
[67](/en/ch2#Fassbender2022)].
|
||
|
||
The bugs that cause these kinds of software faults often lie dormant for a long time until they are
|
||
triggered by an unusual set of circumstances. In those circumstances, it is revealed that the
|
||
software is making some kind of assumption about its environment—and while that assumption is
|
||
usually true, it eventually stops being true for some reason
|
||
[[68](/en/ch2#Cook2000),
|
||
[69](/en/ch2#Woods2017)].
|
||
|
||
There is no quick solution to the problem of systematic faults in software. Lots of small things can
|
||
help: carefully thinking about assumptions and interactions in the system; thorough testing; process
|
||
isolation; allowing processes to crash and restart; avoiding feedback loops such as retry storms
|
||
(see [“When an overloaded system won’t recover”](/en/ch2#sidebar_metastable)); measuring, monitoring, and analyzing system behavior in production.
|
||
|
||
## Humans and Reliability
|
||
|
||
Humans design and build software systems, and the operators who keep the systems running are also
|
||
human. Unlike machines, humans don’t just follow rules; their strength is being creative and
|
||
adaptive in getting their job done. However, this characteristic also leads to unpredictability, and
|
||
sometimes mistakes that can lead to failures, despite best intentions. For example, one study of
|
||
large internet services found that configuration changes by operators were the leading cause of
|
||
outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages
|
||
[[70](/en/ch2#Oppenheimer2003)].
|
||
|
||
It is tempting to label such problems as “human error” and to wish that they could be solved by
|
||
better controlling human behavior through tighter procedures and compliance with rules. However,
|
||
blaming people for mistakes is counterproductive. What we call “human error” is not really the cause
|
||
of an incident, but rather a symptom of a problem with the sociotechnical system in which people are
|
||
trying their best to do their jobs [[71](/en/ch2#Dekker2017)].
|
||
Often complex systems have emergent behavior, in which unexpected interactions between components
|
||
may also lead to failures [[72](/en/ch2#Dekker2011)].
|
||
|
||
Various technical measures can help minimize the impact of human mistakes, including thorough
|
||
testing (both hand-written tests and *property testing* on lots of random inputs)
|
||
[[38](/en/ch2#Yuan2014)], rollback mechanisms for quickly
|
||
reverting configuration changes, gradual roll-outs of new code, detailed and clear monitoring,
|
||
observability tools for diagnosing production issues (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)),
|
||
and well-designed interfaces that encourage “the right thing” and discourage “the wrong thing”.
|
||
|
||
However, these things require an investment of time and money, and in the pragmatic reality of
|
||
everyday business, organizations often prioritize revenue-generating activities over measures that
|
||
increase their resilience against mistakes. If there is a choice between more features and more
|
||
testing, many organizations understandably choose features. Given this choice, when a preventable
|
||
mistake inevitably occurs, it does not make sense to blame the person who made the mistake—the
|
||
problem is the organization’s priorities.
|
||
|
||
Increasingly, organizations are adopting a culture of *blameless postmortems*: after an incident,
|
||
the people involved are encouraged to share full details about what happened, without fear of
|
||
punishment, since this allows others in the organization to learn how to prevent similar problems in
|
||
the future [[73](/en/ch2#Allspaw2012)].
|
||
This process may uncover a need to change business priorities, a need to invest in areas that have
|
||
been neglected, a need to change the incentives for the people involved, or some other systemic
|
||
issue that needs to be brought to the management’s attention.
|
||
|
||
As a general principle, when investigating an incident, you should be suspicious of simplistic
|
||
answers. “Bob should have been more careful when deploying that change” is not productive, but
|
||
neither is “We must rewrite the backend in Haskell.” Instead, management should take the opportunity
|
||
to learn the details of how the sociotechnical system works from the point of view of the people who
|
||
work with it every day, and take steps to improve it based on this feedback
|
||
[[71](/en/ch2#Dekker2017)].
|
||
|
||
# How Important Is Reliability?
|
||
|
||
Reliability is not just for nuclear power stations and air traffic control—more mundane applications
|
||
are also expected to work reliably. Bugs in business applications cause lost productivity (and legal
|
||
risks if figures are reported incorrectly), and outages of e-commerce sites can have huge costs in
|
||
terms of lost revenue and damage to reputation.
|
||
|
||
In many applications, a temporary outage of a few minutes or even a few hours is tolerable
|
||
[[74](/en/ch2#Sabo2023)],
|
||
but permanent data loss or corruption would be catastrophic. Consider a parent who stores all their
|
||
pictures and videos of their children in your photo application
|
||
[[75](/en/ch2#Jurewitz2013)]. How would they
|
||
feel if that database was suddenly corrupted? Would they know how to restore it from a backup?
|
||
|
||
As another example of how unreliable software can harm people, consider the Post Office Horizon
|
||
scandal. Between 1999 and 2019, hundreds of people managing Post Office branches in Britain were
|
||
convicted of theft or fraud because the accounting software showed a shortfall in their accounts.
|
||
Eventually it became clear that many of these shortfalls were due to bugs in the software, and many
|
||
convictions have since been overturned [[76](/en/ch2#Halper2025)].
|
||
What led to this, probably the largest miscarriage of justice in British history, is the fact that
|
||
English law assumes that computers operate correctly (and hence, evidence produced by computers is
|
||
reliable) unless there is evidence to the contrary
|
||
[[77](/en/ch2#Bohm2022)].
|
||
Software engineers may laugh at the idea that software could ever be bug-free, but this is little
|
||
solace to the people who were wrongfully imprisoned, declared bankrupt, or even committed suicide as
|
||
a result of a wrongful conviction due to an unreliable computer system.
|
||
|
||
There are situations in which we may choose to sacrifice reliability in order to reduce development
|
||
cost (e.g., when developing a prototype product for an unproven market)—but we should be very
|
||
conscious of when we are cutting corners and keep in mind the potential consequences.
|
||
|
||
# Scalability
|
||
|
||
Even if a system is working reliably today, that doesn’t mean it will necessarily work reliably in
|
||
the future. One common reason for degradation is increased load: perhaps the system has grown from
|
||
10,000 concurrent users to 100,000 concurrent users, or from 1 million to 10 million. Perhaps it is
|
||
processing much larger volumes of data than it did before.
|
||
|
||
*Scalability* is the term we use to describe a system’s ability to cope with increased load.
|
||
Sometimes, when discussing scalability, people make comments along the lines of, “You’re not Google
|
||
or Amazon. Stop worrying about scale and just use a relational database.” Whether this maxim applies
|
||
to you depends on the type of application you are building.
|
||
|
||
If you are building a new product that currently only has a small number of users, perhaps at a
|
||
startup, the overriding engineering goal is usually to keep the system as simple and flexible as
|
||
possible, so that you can easily modify and adapt the features of your product as you learn more
|
||
about customers’ needs [[78](/en/ch2#McKinley2015)].
|
||
In such an environment, it is counterproductive to worry about hypothetical scale that might be
|
||
needed in the future: in the best case, investments in scalability are wasted effort and premature
|
||
optimization; in the worst case, they lock you into an inflexible design and make it harder to
|
||
evolve your application.
|
||
|
||
The reason is that scalability is not a one-dimensional label: it is meaningless to say “X is
|
||
scalable” or “Y doesn’t scale.” Rather, discussing scalability means considering questions like:
|
||
|
||
* “If the system grows in a particular way, what are our options for coping with the growth?”
|
||
* “How can we add computing resources to handle the additional load?”
|
||
* “Based on current growth projections, when will we hit the limits of our current architecture?”
|
||
|
||
If you succeed in making your application popular, and therefore handling a growing amount of load,
|
||
you will learn where your performance bottlenecks lie, and therefore you will know along which
|
||
dimensions you need to scale. At that point it’s time to start worrying about techniques for
|
||
scalability.
|
||
|
||
## Describing Load
|
||
|
||
First, we need to succinctly describe the current load on the system; only then can we discuss
|
||
growth questions (what happens if our load doubles?). Often this will be a measure of throughput:
|
||
for example, the number of requests per second to a service, how many gigabytes of new data arrive
|
||
per day, or the number of shopping cart checkouts per hour. Sometimes you care about the peak of
|
||
some variable quantity, such as the number of simultaneously online users in
|
||
[“Case Study: Social Network Home Timelines”](/en/ch2#sec_introduction_twitter).
|
||
|
||
Often there are other statistical characteristics of the load that also affect the access patterns
|
||
and hence the scalability requirements. For example, you may need to know the ratio of reads to
|
||
writes in a database, the hit rate on a cache, or the number of data items per user (for example,
|
||
the number of followers in the social network case study). Perhaps the average case is what matters
|
||
for you, or perhaps your bottleneck is dominated by a small number of extreme cases. It all depends
|
||
on the details of your particular application.
|
||
|
||
Once you have described the load on your system, you can investigate what happens when the load
|
||
increases. You can look at it in two ways:
|
||
|
||
* When you increase the load in a certain way and keep the system resources (CPUs, memory, network
|
||
bandwidth, etc.) unchanged, how is the performance of your system affected?
|
||
* When you increase the load in a certain way, how much do you need to increase the resources if you
|
||
want to keep performance unchanged?
|
||
|
||
Usually our goal is to keep the performance of the system within the requirements of the SLA
|
||
(see [“Use of Response Time Metrics”](/en/ch2#sec_introduction_slo_sla)) while also minimizing the cost of running the system. The greater
|
||
the required computing resources, the higher the cost. It might be that some types of hardware are
|
||
more cost-effective than others, and these factors may change over time as new types of hardware
|
||
become available.
|
||
|
||
If you can double the resources in order to handle twice the load, while keeping performance the
|
||
same, we say that you have *linear scalability*, and this is considered a good thing. Occasionally
|
||
it is possible to handle twice the load with less than double the resources, due to economies of
|
||
scale or a better distribution of peak load
|
||
[[79](/en/ch2#Warfield2023_ch2),
|
||
[80](/en/ch2#Brooker2023multitenancy)].
|
||
Much more likely is that the cost grows faster than linearly, and there may be many reasons for the
|
||
inefficiency. For example, if you have a lot of data, then processing a single write request may
|
||
involve more work than if you have a small amount of data, even if the size of the request is the
|
||
same.
|
||
|
||
## Shared-Memory, Shared-Disk, and Shared-Nothing Architecture
|
||
|
||
The simplest way of increasing the hardware resources of a service is to move it to a more powerful
|
||
machine. Individual CPU cores are no longer getting significantly faster, but you can buy a machine
|
||
(or rent a cloud instance) with more CPU cores, more RAM, and more disk space. This approach is
|
||
called *vertical scaling* or *scaling up*.
|
||
|
||
You can get parallelism on a single machine by using multiple processes or threads. All the threads
|
||
belonging to the same process can access the same RAM, and hence this approach is also called a
|
||
*shared-memory architecture*. The problem with a shared-memory approach is that the cost grows
|
||
faster than linearly: a high-end machine with twice the hardware resources typically costs
|
||
significantly more than twice as much. And due to bottlenecks, a machine twice the size can often
|
||
handle less than twice the load.
|
||
|
||
Another approach is the *shared-disk architecture*, which uses several machines with independent
|
||
CPUs and RAM, but which stores data on an array of disks that is shared between the machines, which
|
||
are connected via a fast network: *Network-Attached Storage* (NAS) or *Storage Area Network* (SAN).
|
||
This architecture has traditionally been used for on-premises data warehousing workloads, but
|
||
contention and the overhead of locking limit the scalability of the shared-disk approach
|
||
[[81](/en/ch2#Stopford2009)].
|
||
|
||
By contrast, the *shared-nothing architecture*
|
||
[[82](/en/ch2#Stonebraker1986)]
|
||
(also called *horizontal scaling* or *scaling out*) has gained a lot of popularity. In this
|
||
approach, we use a distributed system with multiple nodes, each of which has its own CPUs, RAM, and
|
||
disks. Any coordination between nodes is done at the software level, via a conventional network.
|
||
|
||
The advantages of shared-nothing are that it has the potential to scale linearly, it can use
|
||
whatever hardware offers the best price/performance ratio (especially in the cloud), it can more
|
||
easily adjust its hardware resources as load increases or decreases, and it can achieve greater
|
||
fault tolerance by distributing the system across multiple data centers and regions. The downsides
|
||
are that it requires explicit sharding (see [Chapter 7](/en/ch7#ch_sharding)), and it incurs all the complexity of
|
||
distributed systems ([Chapter 9](/en/ch9#ch_distributed)).
|
||
|
||
Some cloud-native database systems use separate services for storage and transaction execution (see
|
||
[“Separation of storage and compute”](/en/ch1#sec_introduction_storage_compute)), with multiple compute nodes sharing access to the same
|
||
storage service. This model has some similarity to a shared-disk architecture, but it avoids the
|
||
scalability problems of older systems: instead of providing a filesystem (NAS) or block device (SAN)
|
||
abstraction, the storage service offers a specialized API that is designed for the specific needs of
|
||
the database [[83](/en/ch2#Antonopoulos2019_ch2)].
|
||
|
||
## Principles for Scalability
|
||
|
||
The architecture of systems that operate at large scale is usually highly specific to the
|
||
application—there is no such thing as a generic, one-size-fits-all scalable architecture
|
||
(informally known as *magic scaling sauce*). For example, a system that is designed to handle
|
||
100,000 requests per second, each 1 kB in size, looks very different from a system that is
|
||
designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same
|
||
data throughput (100 MB/sec).
|
||
|
||
Moreover, an architecture that is appropriate for one level of load is unlikely to cope with 10
|
||
times that load. If you are working on a fast-growing service, it is therefore likely that you will
|
||
need to rethink your architecture on every order of magnitude load increase. As the needs of the
|
||
application are likely to evolve, it is usually not worth planning future scaling needs more than
|
||
one order of magnitude in advance.
|
||
|
||
A good general principle for scalability is to break a system down into smaller components that can
|
||
operate largely independently from each other. This is the underlying principle behind microservices
|
||
(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
|
||
([Link to Come]), and shared-nothing architectures. However, the challenge is in knowing where to
|
||
draw the line between things that should be together, and things that should be apart. Design
|
||
guidelines for microservices can be found in other books
|
||
[[84](/en/ch2#Newman2021_ch2)],
|
||
and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).
|
||
|
||
Another good principle is not to make things more complicated than necessary. If a single-machine
|
||
database will do the job, it’s probably preferable to a complicated distributed setup. Auto-scaling
|
||
systems (which automatically add or remove resources in response to demand) are cool, but if your
|
||
load is fairly predictable, a manually scaled system may have fewer operational surprises (see
|
||
[“Operations: Automatic or Manual Rebalancing”](/en/ch7#sec_sharding_operations)). A system with five services is simpler than one with fifty. Good
|
||
architectures usually involve a pragmatic mixture of approaches.
|
||
|
||
# Maintainability
|
||
|
||
Software does not wear out or suffer material fatigue, so it does not break in the same ways as
|
||
mechanical objects do. But the requirements for an application frequently change, the environment
|
||
that the software runs in changes (such as its dependencies and the underlying platform), and it has
|
||
bugs that need fixing.
|
||
|
||
It is widely recognized that the majority of the cost of software is not in its initial development,
|
||
but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures,
|
||
adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding
|
||
new features [[85](/en/ch2#Ensmenger2016),
|
||
[86](/en/ch2#Glass2002)].
|
||
|
||
However, maintenance is also difficult. If a system has been successfully running for a long time,
|
||
it may well use outdated technologies that not many engineers understand today (such as mainframes
|
||
and COBOL code); institutional knowledge of how and why a system was designed in a certain way may
|
||
have been lost as people have left the organization; it might be necessary to fix other people’s
|
||
mistakes. Moreover, the computer system is often intertwined with the human organization that it
|
||
supports, which means that maintenance of such *legacy* systems is as much a people problem as a
|
||
technical one [[87](/en/ch2#Bellotti2021)].
|
||
|
||
Every system we create today will one day become a legacy system if it is valuable enough to survive
|
||
for a long time. In order to minimize the pain for future generations who need to maintain our
|
||
software, we should design it with maintenance concerns in mind. Although we cannot always predict
|
||
which decisions might create maintenance headaches in the future, in this book we will pay attention
|
||
to several principles that are widely applicable:
|
||
|
||
Operability
|
||
: Make it easy for the organization to keep the system running smoothly.
|
||
|
||
Simplicity
|
||
: Make it easy for new engineers to understand the system, by implementing it using well-understood,
|
||
consistent patterns and structures, and avoiding unnecessary complexity.
|
||
|
||
Evolvability
|
||
: Make it easy for engineers to make changes to the system in the future, adapting it and extending
|
||
it for unanticipated use cases as requirements change.
|
||
|
||
## Operability: Making Life Easy for Operations
|
||
|
||
We previously discussed the role of operations in [“Operations in the Cloud Era”](/en/ch1#sec_introduction_operations), and we saw that
|
||
human processes are at least as important for reliable operations as software tools. In fact, it has
|
||
been suggested that “good operations can often work around the limitations of bad (or incomplete)
|
||
software, but good software cannot run reliably with bad operations”
|
||
[[60](/en/ch2#Kreps2012_ch1)].
|
||
|
||
In large-scale systems consisting of many thousands of machines, manual maintenance would be
|
||
unreasonably expensive, and automation is essential. However, automation can be a two-edged sword:
|
||
there will always be edge cases (such as rare failure scenarios) that require manual intervention
|
||
from the operations team. Since the cases that cannot be handled automatically are the most complex
|
||
issues, greater automation requires a *more* skilled operations team that can resolve those issues
|
||
[[88](/en/ch2#Bainbridge1983)].
|
||
|
||
Moreover, if an automated system goes wrong, it is often harder to troubleshoot than a system that
|
||
relies on an operator to perform some actions manually. For that reason, it is not the case that
|
||
more automation is always better for operability. However, some amount of automation is important,
|
||
and the sweet spot will depend on the specifics of your particular application and organization.
|
||
|
||
Good operability means making routine tasks easy, allowing the operations team to focus their efforts
|
||
on high-value activities. Data systems can do various things to make routine tasks easy, including
|
||
[[89](/en/ch2#Hamilton2007)]:
|
||
|
||
* Allowing monitoring tools to check the system’s key metrics, and supporting observability tools
|
||
(see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)) to give insights into the system’s runtime behavior.
|
||
A variety of commercial and open source tools can help here
|
||
[[90](/en/ch2#Horovits2021)].
|
||
* Avoiding dependency on individual machines (allowing machines to be taken down for maintenance
|
||
while the system as a whole continues running uninterrupted)
|
||
* Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)
|
||
* Providing good default behavior, but also giving administrators the freedom to override defaults when needed
|
||
* Self-healing where appropriate, but also giving administrators manual control over the system state when needed
|
||
* Exhibiting predictable behavior, minimizing surprises
|
||
|
||
## Simplicity: Managing Complexity
|
||
|
||
Small software projects can have delightfully simple and expressive code, but as projects get
|
||
larger, they often become very complex and difficult to understand. This complexity slows down
|
||
everyone who needs to work on the system, further increasing the cost of maintenance. A software
|
||
project mired in complexity is sometimes described as a *big ball of mud*
|
||
[[91](/en/ch2#Foote1997)].
|
||
|
||
When complexity makes maintenance hard, budgets and schedules are often overrun. In complex
|
||
software, there is also a greater risk of introducing bugs when making a change: when the system is
|
||
harder for developers to understand and reason about, hidden assumptions, unintended consequences,
|
||
and unexpected interactions are more easily overlooked
|
||
[[69](/en/ch2#Woods2017)].
|
||
Conversely, reducing complexity greatly improves the maintainability of software, and thus
|
||
simplicity should be a key goal for the systems we build.
|
||
|
||
Simple systems are easier to understand, and therefore we should try to solve a given problem in the
|
||
simplest way possible. Unfortunately, this is easier said than done. Whether something is simple or
|
||
not is often a subjective matter of taste, as there is no objective standard of simplicity
|
||
[[92](/en/ch2#Brooker2022)].
|
||
For example, one system may hide a complex implementation behind a simple interface, whereas another
|
||
may have a simple implementation that exposes more internal detail to its users—which one is
|
||
simpler?
|
||
|
||
One attempt at reasoning about complexity has been to break it down into two categories, *essential*
|
||
and *accidental* complexity [[93](/en/ch2#Brooks1995)].
|
||
The idea is that essential complexity is inherent in the problem domain of the application, while
|
||
accidental complexity arises only because of limitations of our tooling. Unfortunately, this
|
||
distinction is also flawed, because boundaries between the essential and the accidental shift as our
|
||
tooling evolves [[94](/en/ch2#Luu2020)].
|
||
|
||
One of the best tools we have for managing complexity is *abstraction*. A good abstraction can hide
|
||
a great deal of implementation detail behind a clean, simple-to-understand façade. A good
|
||
abstraction can also be used for a wide range of different applications. Not only is this reuse more
|
||
efficient than reimplementing a similar thing multiple times, but it also leads to higher-quality
|
||
software, as quality improvements in the abstracted component benefit all applications that use it.
|
||
|
||
For example, high-level programming languages are abstractions that hide machine code, CPU registers,
|
||
and syscalls. SQL is an abstraction that hides complex on-disk and in-memory data structures,
|
||
concurrent requests from other clients, and inconsistencies after crashes. Of course, when
|
||
programming in a high-level language, we are still using machine code; we are just not using it
|
||
*directly*, because the programming language abstraction saves us from having to think about it.
|
||
|
||
Abstractions for application code, which aim to reduce its complexity, can be created using
|
||
methodologies such as *design patterns*
|
||
[[95](/en/ch2#Gamma1994)]
|
||
and *domain-driven design* (DDD) [[96](/en/ch2#Evans2003)].
|
||
This book is not about such application-specific abstractions, but rather about general-purpose
|
||
abstractions on top of which you can build your applications, such as database transactions,
|
||
indexes, and event logs. If you want to use techniques such as DDD, you can implement them on top of
|
||
the foundations described in this book.
|
||
|
||
## Evolvability: Making Change Easy
|
||
|
||
It’s extremely unlikely that your system’s requirements will remain unchanged forever. They are much more
|
||
likely to be in constant flux: you learn new facts, previously unanticipated use cases emerge,
|
||
business priorities change, users request new features, new platforms replace old platforms, legal
|
||
or regulatory requirements change, growth of the system forces architectural changes, etc.
|
||
|
||
In terms of organizational processes, *Agile* working patterns provide a framework for adapting to
|
||
change. The Agile community has also developed technical tools and processes that are helpful when
|
||
developing software in a frequently changing environment, such as test-driven development (TDD) and
|
||
refactoring. In this book, we search for ways of increasing agility at the level of a system
|
||
consisting of several different applications or services with different characteristics.
|
||
|
||
The ease with which you can modify a data system, and adapt it to changing requirements, is closely
|
||
linked to its simplicity and its abstractions: loosely-coupled, simple systems are usually easier to
|
||
modify than tightly-coupled, complex ones. Since this is such an important idea, we will use a
|
||
different word to refer to agility on a data system level: *evolvability*
|
||
[[97](/en/ch2#Breivold2008)].
|
||
|
||
One major factor that makes change difficult in large systems is when some action is irreversible,
|
||
and therefore that action needs to be taken very carefully
|
||
[[98](/en/ch2#Zaninotto2002)].
|
||
For example, say you are migrating from one database to another: if you cannot switch back to the
|
||
old system in case of problems with the new one, the stakes are much higher than if you can easily go
|
||
back. Minimizing irreversibility improves flexibility.
|
||
|
||
# Summary
|
||
|
||
In this chapter we examined several examples of nonfunctional requirements: performance,
|
||
reliability, scalability, and maintainability. Through these topics we have also encountered
|
||
principles and terminology that we will need throughout the rest of the book. We started with a case
|
||
study of how one might implement home timelines in a social network, which illustrated some of the
|
||
challenges that arise at scale.
|
||
|
||
We discussed how to measure performance (e.g., using response time percentiles), the load on a
|
||
system (e.g., using throughput metrics), and how they are used in SLAs. Scalability is a closely
|
||
related concept: that is, ensuring performance stays the same when the load grows. We saw some
|
||
general principles for scalability, such as breaking a task down into smaller parts that can operate
|
||
independently, and we will dive into deep technical detail on scalability techniques in the
|
||
following chapters.
|
||
|
||
To achieve reliability, you can use fault tolerance techniques, which allow a system to continue
|
||
providing its service even if some component (e.g., a disk, a machine, or another service) is
|
||
faulty. We saw examples of hardware faults that can occur, and distinguished them from software
|
||
faults, which can be harder to deal with because they are often strongly correlated. Another aspect
|
||
of achieving reliability is to build resilience against humans making mistakes, and we saw blameless
|
||
postmortems as a technique for learning from incidents.
|
||
|
||
Finally, we examined several facets of maintainability, including supporting the work of operations
|
||
teams, managing complexity, and making it easy to evolve an application’s functionality over time.
|
||
There are no easy answers on how to achieve these things, but one thing that can help is to build
|
||
applications using well-understood building blocks that provide useful abstractions. The rest of
|
||
this book will cover a selection of building blocks that have proved to be valuable in practice.
|
||
|
||
##### Footnotes
|
||
|
||
##### References
|
||
|
||
[[1](/en/ch2#Cvet2016-marker)] Mike Cvet.
|
||
[How We Learned to Stop Worrying and Love
|
||
Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016.
|
||
|
||
[[2](/en/ch2#Krikorian2012_ch2-marker)] Raffi Krikorian.
|
||
[Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/).
|
||
At *QCon San Francisco*, November 2012.
|
||
Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)
|
||
|
||
[[3](/en/ch2#Twitter2023-marker)] Twitter.
|
||
[Twitter’s
|
||
Recommendation Algorithm](https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm). *blog.twitter.com*, March 2023.
|
||
Archived at [perma.cc/L5GT-229T](https://perma.cc/L5GT-229T)
|
||
|
||
[[4](/en/ch2#Krikorian2013-marker)] Raffi Krikorian.
|
||
[New
|
||
Tweets per second record, and how!](https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how) *blog.twitter.com*, August 2013.
|
||
Archived at [perma.cc/6JZN-XJYN](https://perma.cc/6JZN-XJYN)
|
||
|
||
[[5](/en/ch2#Volpert2025-marker)] Jaz Volpert.
|
||
[When Imperfect Systems are Good, Actually:
|
||
Bluesky’s Lossy Timelines](https://jazco.dev/2025/02/19/imperfection/). *jazco.dev*, February 2025.
|
||
Archived at [perma.cc/2PVE-L2MX](https://perma.cc/2PVE-L2MX)
|
||
|
||
[[6](/en/ch2#Axon2010_ch2-marker)] Samuel Axon.
|
||
[3% of Twitter’s Servers
|
||
Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010.
|
||
Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX)
|
||
|
||
[[7](/en/ch2#Bronson2021-marker)] Nathan Bronson, Abutalib Aghayev, Aleksey
|
||
Charapko, and Timothy Zhu.
|
||
[Metastable
|
||
Failures in Distributed Systems](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf).
|
||
At *Workshop on Hot Topics in Operating Systems* (HotOS), May 2021.
|
||
[doi:10.1145/3458336.3465286](https://doi.org/10.1145/3458336.3465286)
|
||
|
||
[[8](/en/ch2#Brooker2021-marker)] Marc Brooker.
|
||
[Metastability and Distributed
|
||
Systems](https://brooker.co.za/blog/2021/05/24/metastable.html). *brooker.co.za*, May 2021.
|
||
Archived at [perma.cc/7FGJ-7XRK](https://perma.cc/7FGJ-7XRK)
|
||
|
||
[[9](/en/ch2#Brooker2015-marker)] Marc Brooker.
|
||
[Exponential
|
||
Backoff And Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/). *aws.amazon.com*, March 2015.
|
||
Archived at [perma.cc/R6MS-AZKH](https://perma.cc/R6MS-AZKH)
|
||
|
||
[[10](/en/ch2#Brooker2022backoff-marker)] Marc Brooker.
|
||
[What is Backoff For?](https://brooker.co.za/blog/2022/08/11/backoff.html)
|
||
*brooker.co.za*, August 2022.
|
||
Archived at [perma.cc/PW9N-55Q5](https://perma.cc/PW9N-55Q5)
|
||
|
||
[[11](/en/ch2#Nygard2018-marker)] Michael T. Nygard.
|
||
[*Release It!*](https://learning.oreilly.com/library/view/release-it-2nd/9781680504552/),
|
||
2nd Edition. Pragmatic Bookshelf, January 2018. ISBN: 9781680502398
|
||
|
||
[[12](/en/ch2#Chen2022-marker)] Frank Chen.
|
||
[Slowing Down to Speed Up – Circuit Breakers
|
||
for Slack’s CI/CD](https://slack.engineering/circuit-breakers/). *slack.engineering*, August 2022.
|
||
Archived at [perma.cc/5FGS-ZPH3](https://perma.cc/5FGS-ZPH3)
|
||
|
||
[[13](/en/ch2#Brooker2022retries-marker)] Marc Brooker.
|
||
[Fixing retries with token buckets and
|
||
circuit breakers](https://brooker.co.za/blog/2022/02/28/retries.html). *brooker.co.za*, February 2022.
|
||
Archived at [perma.cc/MD6N-GW26](https://perma.cc/MD6N-GW26)
|
||
|
||
[[14](/en/ch2#YanacekLoadShedding-marker)] David Yanacek.
|
||
[Using load
|
||
shedding to avoid overload](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/). Amazon Builders’ Library, *aws.amazon.com*.
|
||
Archived at [perma.cc/9SAW-68MP](https://perma.cc/9SAW-68MP)
|
||
|
||
[[15](/en/ch2#Sackman2016_ch2-marker)] Matthew Sackman.
|
||
[Pushing Back](https://wellquite.org/posts/lshift/pushing_back/).
|
||
*wellquite.org*, May 2016.
|
||
Archived at [perma.cc/3KCZ-RUFY](https://perma.cc/3KCZ-RUFY)
|
||
|
||
[[16](/en/ch2#Kopytkov2018-marker)] Dmitry Kopytkov and Patrick Lee.
|
||
[Meet Bandaid,
|
||
the Dropbox service proxy](https://dropbox.tech/infrastructure/meet-bandaid-the-dropbox-service-proxy). *dropbox.tech*, March 2018.
|
||
Archived at [perma.cc/KUU6-YG4S](https://perma.cc/KUU6-YG4S)
|
||
|
||
[[17](/en/ch2#Gunawi2018_ch2-marker)] Haryadi S. Gunawi, Riza O. Suminto, Russell Sears,
|
||
Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti,
|
||
Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert
|
||
Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li.
|
||
[Fail-Slow at
|
||
Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf).
|
||
At *16th USENIX Conference on File and Storage Technologies*, February 2018.
|
||
|
||
[[18](/en/ch2#Brooker2017-marker)] Marc Brooker.
|
||
[Is the Mean Really Useless?](https://brooker.co.za/blog/2017/12/28/mean.html)
|
||
*brooker.co.za*, December 2017.
|
||
Archived at [perma.cc/U5AE-CVEM](https://perma.cc/U5AE-CVEM)
|
||
|
||
[[19](/en/ch2#DeCandia2007_ch1-marker)] Giuseppe DeCandia, Deniz Hastorun, Madan
|
||
Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter
|
||
Vosshall, and Werner Vogels.
|
||
[Dynamo:
|
||
Amazon’s Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating
|
||
Systems Principles* (SOSP), October 2007.
|
||
[doi:10.1145/1294261.1294281](https://doi.org/10.1145/1294261.1294281)
|
||
|
||
[[20](/en/ch2#Whitenton2020-marker)] Kathryn Whitenton.
|
||
[The Need for Speed, 23 Years Later](https://www.nngroup.com/articles/the-need-for-speed/).
|
||
*nngroup.com*, May 2020.
|
||
Archived at [perma.cc/C4ER-LZYA](https://perma.cc/C4ER-LZYA)
|
||
|
||
[[21](/en/ch2#Linden2006-marker)] Greg Linden.
|
||
[Marissa Mayer at Web 2.0](https://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html).
|
||
*glinden.blogspot.com*, November 2005.
|
||
Archived at [perma.cc/V7EA-3VXB](https://perma.cc/V7EA-3VXB)
|
||
|
||
[[22](/en/ch2#Brutlag2009-marker)] Jake Brutlag.
|
||
[Speed Matters for Google
|
||
Web Search](https://services.google.com/fh/files/blogs/google_delayexp.pdf). *services.google.com*, June 2009.
|
||
Archived at [perma.cc/BK7R-X7M2](https://perma.cc/BK7R-X7M2)
|
||
|
||
[[23](/en/ch2#Schurman2009-marker)] Eric Schurman and Jake Brutlag.
|
||
[Performance Related Changes and their User Impact](https://www.youtube.com/watch?v=bQSE51-gr2s).
|
||
Talk at *Velocity 2009*.
|
||
|
||
[[24](/en/ch2#Akamai2017-marker)] Akamai Technologies, Inc.
|
||
[The
|
||
State of Online Retail Performance](https://web.archive.org/web/20210729180749/https%3A//www.akamai.com/us/en/multimedia/documents/report/akamai-state-of-online-retail-performance-spring-2017.pdf). *akamai.com*, April 2017.
|
||
Archived at [perma.cc/UEK2-HYCS](https://perma.cc/UEK2-HYCS)
|
||
|
||
[[25](/en/ch2#Bai2017-marker)] Xiao Bai, Ioannis Arapakis, B. Barla Cambazoglu, and Ana Freire.
|
||
[Understanding and Leveraging the Impact of
|
||
Response Latency on User Behaviour in Web Search](https://iarapakis.github.io/papers/TOIS17.pdf). *ACM Transactions on Information Systems*,
|
||
volume 36, issue 2, article 21, April 2018.
|
||
[doi:10.1145/3106372](https://doi.org/10.1145/3106372)
|
||
|
||
[[26](/en/ch2#Dean2013_ch2-marker)] Jeffrey Dean and Luiz André Barroso.
|
||
[The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/).
|
||
*Communications of the ACM*, volume 56, issue 2, pages 74–80, February 2013.
|
||
[doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794)
|
||
|
||
[[27](/en/ch2#Hidalgo2020-marker)] Alex Hidalgo.
|
||
[*Implementing
|
||
Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets*](https://www.oreilly.com/library/view/implementing-service-level/9781492076803/). O’Reilly
|
||
Media, September 2020. ISBN: 1492076813
|
||
|
||
[[28](/en/ch2#Mogul2019-marker)] Jeffrey C. Mogul and John Wilkes.
|
||
[Nines are Not Enough: Meaningful Metrics for
|
||
Clouds](https://research.google/pubs/pub48033/). At *17th Workshop on Hot Topics in Operating Systems* (HotOS), May 2019.
|
||
[doi:10.1145/3317550.3321432](https://doi.org/10.1145/3317550.3321432)
|
||
|
||
[[29](/en/ch2#Hauer2020-marker)] Tamás Hauer, Philipp Hoffmann, John Lunney, Dan Ardelean, and Amer Diwan.
|
||
[Meaningful Availability](https://www.usenix.org/conference/nsdi20/presentation/hauer).
|
||
At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020.
|
||
|
||
[[30](/en/ch2#Dunning2021-marker)] Ted Dunning.
|
||
[The t-digest:
|
||
Efficient estimates of distributions](https://www.sciencedirect.com/science/article/pii/S2665963820300403). *Software Impacts*, volume 7, article 100049, February 2021.
|
||
[doi:10.1016/j.simpa.2020.100049](https://doi.org/10.1016/j.simpa.2020.100049)
|
||
|
||
[[31](/en/ch2#Kohn2021-marker)] David Kohn.
|
||
[How
|
||
percentile approximation works (and why it’s more useful than averages)](https://www.timescale.com/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/). *timescale.com*,
|
||
September 2021. Archived at [perma.cc/3PDP-NR8B](https://perma.cc/3PDP-NR8B)
|
||
|
||
[[32](/en/ch2#Hartmann2020-marker)] Heinrich Hartmann and Theo Schlossnagle.
|
||
[Circllhist — A Log-Linear Histogram Data Structure
|
||
for IT Infrastructure Monitoring](https://arxiv.org/pdf/2001.06561.pdf). *arxiv.org*, January 2020.
|
||
|
||
[[33](/en/ch2#Masson2019-marker)] Charles Masson, Jee E. Rim, and Homin K. Lee.
|
||
[DDSketch: A Fast and Fully-Mergeable
|
||
Quantile Sketch with Relative-Error Guarantees](https://www.vldb.org/pvldb/vol12/p2195-masson.pdf). *Proceedings of the VLDB Endowment*,
|
||
volume 12, issue 12, pages 2195–2205, August 2019.
|
||
[doi:10.14778/3352063.3352135](https://doi.org/10.14778/3352063.3352135)
|
||
|
||
[[34](/en/ch2#Schwartz2015-marker)] Baron Schwartz.
|
||
[Why
|
||
Percentiles Don’t Work the Way You Think](https://orangematter.solarwinds.com/2016/11/18/why-percentiles-dont-work-the-way-you-think/). *solarwinds.com*, November 2016.
|
||
Archived at [perma.cc/469T-6UGB](https://perma.cc/469T-6UGB)
|
||
|
||
[[35](/en/ch2#Heimerdinger1992-marker)] Walter L. Heimerdinger and Charles B. Weinstock.
|
||
[A Conceptual
|
||
Framework for System Fault Tolerance](https://resources.sei.cmu.edu/asset_files/TechnicalReport/1992_005_001_16112.pdf). Technical Report CMU/SEI-92-TR-033, Software Engineering
|
||
Institute, Carnegie Mellon University, October 1992.
|
||
Archived at [perma.cc/GD2V-DMJW](https://perma.cc/GD2V-DMJW)
|
||
|
||
[[36](/en/ch2#Gaertner1999-marker)] Felix C. Gärtner.
|
||
[Fundamentals of fault-tolerant
|
||
distributed computing in asynchronous environments](https://dl.acm.org/doi/pdf/10.1145/311531.311532). *ACM Computing Surveys*, volume 31,
|
||
issue 1, pages 1–26, March 1999.
|
||
[doi:10.1145/311531.311532](https://doi.org/10.1145/311531.311532)
|
||
|
||
[[37](/en/ch2#Avizienis2004-marker)] Algirdas Avižienis, Jean-Claude Laprie, Brian Randell,
|
||
and Carl Landwehr.
|
||
[Basic Concepts and Taxonomy of Dependable and Secure
|
||
Computing](https://hdl.handle.net/1903/6459). *IEEE Transactions on Dependable and Secure Computing*, volume 1, issue 1,
|
||
January 2004. [doi:10.1109/TDSC.2004.2](https://doi.org/10.1109/TDSC.2004.2)
|
||
|
||
[[38](/en/ch2#Yuan2014-marker)] Ding Yuan, Yu Luo, Xin Zhuang, Guilherme
|
||
Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm.
|
||
[Simple
|
||
Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed
|
||
Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf). At *11th USENIX Symposium on Operating Systems Design
|
||
and Implementation* (OSDI), October 2014.
|
||
|
||
[[39](/en/ch2#Rosenthal2020-marker)] Casey Rosenthal and Nora Jones.
|
||
[*Chaos
|
||
Engineering*](https://learning.oreilly.com/library/view/chaos-engineering/9781492043850/). O’Reilly Media, April 2020. ISBN: 9781492043867
|
||
|
||
[[40](/en/ch2#Pinheiro2007-marker)] Eduardo Pinheiro, Wolf-Dietrich Weber, and
|
||
Luiz Andre Barroso.
|
||
[Failure
|
||
Trends in a Large Disk Drive Population](https://www.usenix.org/legacy/events/fast07/tech/full_papers/pinheiro/pinheiro_old.pdf). At *5th USENIX Conference on File and Storage
|
||
Technologies* (FAST), February 2007.
|
||
|
||
[[41](/en/ch2#Schroeder2007-marker)] Bianca Schroeder and Garth A. Gibson.
|
||
[Disk failures
|
||
in the real world: What does an MTTF of 1,000,000 hours mean to you?](https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder.pdf) At *5th USENIX
|
||
Conference on File and Storage Technologies* (FAST), February 2007.
|
||
|
||
[[42](/en/ch2#Klein2021-marker)] Andy Klein.
|
||
[Backblaze Drive Stats
|
||
for Q2 2021](https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2021/). *backblaze.com*, August 2021.
|
||
Archived at [perma.cc/2943-UD5E](https://perma.cc/2943-UD5E)
|
||
|
||
[[43](/en/ch2#Narayanan2016-marker)] Iyswarya Narayanan, Di Wang, Myeongjae Jeon,
|
||
Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and
|
||
Kushagra Vaid.
|
||
[SSD
|
||
Failures in Datacenters: What? When? and Why?](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/a7-narayanan.pdf) At *9th ACM International on Systems and
|
||
Storage Conference* (SYSTOR), June 2016.
|
||
[doi:10.1145/2928275.2928278](https://doi.org/10.1145/2928275.2928278)
|
||
|
||
[[44](/en/ch2#Alibaba2019_ch2-marker)] Alibaba Cloud Storage Team.
|
||
[Storage System Design Analysis: Factors
|
||
Affecting NVMe SSD Performance (1)](https://www.alibabacloud.com/blog/594375). *alibabacloud.com*, January 2019. Archived at
|
||
[archive.org](https://web.archive.org/web/20230522005034/https%3A//www.alibabacloud.com/blog/594375)
|
||
|
||
[[45](/en/ch2#Schroeder2016_ch2-marker)] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant.
|
||
[Flash
|
||
Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf). At *14th USENIX Conference on
|
||
File and Storage Technologies* (FAST), February 2016.
|
||
|
||
[[46](/en/ch2#Alter2019-marker)] Jacob Alter, Ji Xue, Alma Dimnaku, and Evgenia Smirni.
|
||
[SSD failures in the field: symptoms,
|
||
causes, and prediction models](https://dl.acm.org/doi/pdf/10.1145/3295500.3356172). At *International Conference for High Performance Computing,
|
||
Networking, Storage and Analysis* (SC), November 2019.
|
||
[doi:10.1145/3295500.3356172](https://doi.org/10.1145/3295500.3356172)
|
||
|
||
[[47](/en/ch2#Ford2010-marker)] Daniel Ford, François Labelle, Florentina I.
|
||
Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan.
|
||
[Availability in
|
||
Globally Distributed Storage Systems](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Ford.pdf). At *9th USENIX Symposium on Operating Systems Design
|
||
and Implementation* (OSDI), October 2010.
|
||
|
||
[[48](/en/ch2#Vishwanath2010-marker)] Kashi Venkatesh Vishwanath and Nachiappan Nagappan.
|
||
[Characterizing
|
||
Cloud Computing Hardware Reliability](https://www.microsoft.com/en-us/research/wp-content/uploads/2010/06/socc088-vishwanath.pdf). At *1st ACM Symposium on Cloud Computing* (SoCC),
|
||
June 2010. [doi:10.1145/1807128.1807161](https://doi.org/10.1145/1807128.1807161)
|
||
|
||
[[49](/en/ch2#Hochschild2021-marker)] Peter H. Hochschild, Paul Turner, Jeffrey C.
|
||
Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat.
|
||
[Cores that
|
||
don’t count](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf). At *Workshop on Hot Topics in Operating Systems* (HotOS), June 2021.
|
||
[doi:10.1145/3458336.3465297](https://doi.org/10.1145/3458336.3465297)
|
||
|
||
[[50](/en/ch2#Dixit2021-marker)] Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon,
|
||
Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar.
|
||
[Silent Data Corruptions at Scale](https://arxiv.org/abs/2102.11245).
|
||
*arXiv:2102.11245*, February 2021.
|
||
|
||
[[51](/en/ch2#Behrens2015-marker)] Diogo Behrens, Marco Serafini, Sergei Arnautov, Flavio P.
|
||
Junqueira, and Christof Fetzer.
|
||
[Scalable
|
||
Error Isolation for Distributed Systems](https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/behrens). At *12th USENIX Symposium on Networked Systems
|
||
Design and Implementation* (NSDI), May 2015.
|
||
|
||
[[52](/en/ch2#Schroeder2009-marker)] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber.
|
||
[DRAM
|
||
Errors in the Wild: A Large-Scale Field Study](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf). At *11th International Joint Conference on
|
||
Measurement and Modeling of Computer Systems* (SIGMETRICS), June 2009.
|
||
[doi:10.1145/1555349.1555372](https://doi.org/10.1145/1555349.1555372)
|
||
|
||
[[53](/en/ch2#Kim2014-marker)] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin,
|
||
Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu.
|
||
[Flipping Bits in Memory Without
|
||
Accessing Them: An Experimental Study of DRAM Disturbance Errors](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf). At *41st Annual
|
||
International Symposium on Computer Architecture* (ISCA), June 2014.
|
||
[doi:10.5555/2665671.2665726](https://doi.org/10.5555/2665671.2665726)
|
||
|
||
[[54](/en/ch2#Bray2021-marker)] Tim Bray.
|
||
[Worst Case](https://www.tbray.org/ongoing/When/202x/2021/10/08/The-WOrst-Case).
|
||
*tbray.org*, October 2021.
|
||
Archived at [perma.cc/4QQM-RTHN](https://perma.cc/4QQM-RTHN)
|
||
|
||
[[55](/en/ch2#AbduJyothi2021-marker)] Sangeetha Abdu Jyothi.
|
||
[Solar Superstorms: Planning for
|
||
an Internet Apocalypse](https://ics.uci.edu/~sabdujyo/papers/sigcomm21-cme.pdf). At *ACM SIGCOMM Conferene*, August 2021.
|
||
[doi:10.1145/3452296.3472916](https://doi.org/10.1145/3452296.3472916)
|
||
|
||
[[56](/en/ch2#Cockcroft2019-marker)] Adrian Cockcroft.
|
||
[Failure
|
||
Modes and Continuous Resilience](https://adrianco.medium.com/failure-modes-and-continuous-resilience-6553078caad5). *adrianco.medium.com*, November 2019.
|
||
Archived at [perma.cc/7SYS-BVJP](https://perma.cc/7SYS-BVJP)
|
||
|
||
[[57](/en/ch2#Han2021-marker)] Shujie Han, Patrick P. C. Lee, Fan Xu, Yi Liu, Cheng He, and Jiongzhou Liu.
|
||
[An In-Depth Study of Correlated
|
||
Failures in Production SSD-Based Data Centers](https://www.usenix.org/conference/fast21/presentation/han). At *19th USENIX Conference on File and Storage
|
||
Technologies* (FAST), February 2021.
|
||
|
||
[[58](/en/ch2#Nightingale2011-marker)] Edmund B. Nightingale, John R. Douceur, and Vince Orgovan.
|
||
[Cycles, Cells and
|
||
Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs](https://eurosys2011.cs.uni-salzburg.at/pdf/eurosys2011-nightingale.pdf).
|
||
At *6th European Conference on Computer Systems* (EuroSys), April 2011.
|
||
[doi:10.1145/1966445.1966477](https://doi.org/10.1145/1966445.1966477)
|
||
|
||
[[59](/en/ch2#Gunawi2014-marker)] Haryadi S. Gunawi, Mingzhe Hao, Tanakorn
|
||
Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar,
|
||
Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria.
|
||
[What Bugs Live in the Cloud?](https://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf)
|
||
At *5th ACM Symposium on Cloud Computing* (SoCC), November 2014.
|
||
[doi:10.1145/2670979.2670986](https://doi.org/10.1145/2670979.2670986)
|
||
|
||
[[60](/en/ch2#Kreps2012_ch1-marker)] Jay Kreps.
|
||
[Getting
|
||
Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012.
|
||
Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW)
|
||
|
||
[[61](/en/ch2#Minar2012_ch1-marker)] Nelson Minar.
|
||
[Leap Second Crashes Half
|
||
the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012.
|
||
Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU)
|
||
|
||
[[62](/en/ch2#HPE2019_ch2-marker)] Hewlett Packard Enterprise.
|
||
[Support
|
||
Alerts – Customer Bulletin a00092491en\_us](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us). *support.hpe.com*, November 2019.
|
||
Archived at [perma.cc/S5F6-7ZAC](https://perma.cc/S5F6-7ZAC)
|
||
|
||
[[63](/en/ch2#Hochstein2020-marker)] Lorin Hochstein.
|
||
[awesome limits](https://github.com/lorin/awesome-limits). *github.com*,
|
||
November 2020. Archived at [perma.cc/3R5M-E5Q4](https://perma.cc/3R5M-E5Q4)
|
||
|
||
[[64](/en/ch2#McCaffrey2015-marker)] Caitie McCaffrey.
|
||
[Clients
|
||
Are Jerks: AKA How Halo 4 DoSed the Services at Launch & How We Survived](https://www.caitiem.com/2015/06/23/clients-are-jerks-aka-how-halo-4-dosed-the-services-at-launch-how-we-survived/). *caitiem.com*,
|
||
June 2015. Archived at [perma.cc/MXX4-W373](https://perma.cc/MXX4-W373)
|
||
|
||
[[65](/en/ch2#Tang2023-marker)] Lilia Tang,
|
||
Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu.
|
||
[Fail through the Cracks: Cross-System
|
||
Interaction Failures in Modern Cloud Systems](https://tianyin.github.io/pub/csi-failures.pdf). At *18th European Conference on Computer
|
||
Systems* (EuroSys), May 2023.
|
||
[doi:10.1145/3552326.3587448](https://doi.org/10.1145/3552326.3587448)
|
||
|
||
[[66](/en/ch2#Ulrich2016-marker)] Mike Ulrich.
|
||
[Addressing Cascading Failures](https://sre.google/sre-book/addressing-cascading-failures/).
|
||
In Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy (ed).
|
||
[*Site
|
||
Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/).
|
||
O’Reilly Media, 2016. ISBN: 9781491929124
|
||
|
||
[[67](/en/ch2#Fassbender2022-marker)] Harri Faßbender.
|
||
[Cascading
|
||
failures in large-scale distributed systems](https://blog.mi.hdm-stuttgart.de/index.php/2022/03/03/cascading-failures-in-large-scale-distributed-systems/). *blog.mi.hdm-stuttgart.de*, March 2022.
|
||
Archived at [perma.cc/K7VY-YJRX](https://perma.cc/K7VY-YJRX)
|
||
|
||
[[68](/en/ch2#Cook2000-marker)] Richard I. Cook.
|
||
[How Complex
|
||
Systems Fail](https://www.adaptivecapacitylabs.com/HowComplexSystemsFail.pdf). Cognitive Technologies Laboratory, April 2000.
|
||
Archived at [perma.cc/RDS6-2YVA](https://perma.cc/RDS6-2YVA)
|
||
|
||
[[69](/en/ch2#Woods2017-marker)] David D. Woods.
|
||
[STELLA: Report from the SNAFUcatchers Workshop on Coping
|
||
With Complexity](https://snafucatchers.github.io/). *snafucatchers.github.io*, March 2017. Archived at
|
||
[archive.org](https://web.archive.org/web/20230306130131/https%3A//snafucatchers.github.io/)
|
||
|
||
[[70](/en/ch2#Oppenheimer2003-marker)] David Oppenheimer, Archana Ganapathi, and David A. Patterson.
|
||
[Why
|
||
Do Internet Services Fail, and What Can Be Done About It?](https://static.usenix.org/events/usits03/tech/full_papers/oppenheimer/oppenheimer.pdf) At *4th USENIX Symposium on
|
||
Internet Technologies and Systems* (USITS), March 2003.
|
||
|
||
[[71](/en/ch2#Dekker2017-marker)] Sidney Dekker.
|
||
[*The Field
|
||
Guide to Understanding ‘Human Error’, 3rd Edition*](https://learning.oreilly.com/library/view/the-field-guide/9781317031833/). CRC Press, November 2017.
|
||
ISBN: 9781472439055
|
||
|
||
[[72](/en/ch2#Dekker2011-marker)] Sidney Dekker.
|
||
[*Drift
|
||
into Failure: From Hunting Broken Components to Understanding Complex Systems*](https://www.taylorfrancis.com/books/mono/10.1201/9781315257396/drift-failure-sidney-dekker).
|
||
CRC Press, 2011. ISBN: 9781315257396
|
||
|
||
[[73](/en/ch2#Allspaw2012-marker)] John Allspaw.
|
||
[Blameless PostMortems and a Just
|
||
Culture](https://www.etsy.com/codeascraft/blameless-postmortems/). *etsy.com*, May 2012.
|
||
Archived at [perma.cc/YMJ7-NTAP](https://perma.cc/YMJ7-NTAP)
|
||
|
||
[[74](/en/ch2#Sabo2023-marker)] Itzy Sabo.
|
||
[Uptime
|
||
Guarantees — A Pragmatic Perspective](https://world.hey.com/itzy/uptime-guarantees-a-pragmatic-perspective-736d7ea4). *world.hey.com*, March 2023.
|
||
Archived at [perma.cc/F7TU-78JB](https://perma.cc/F7TU-78JB)
|
||
|
||
[[75](/en/ch2#Jurewitz2013-marker)] Michael Jurewitz.
|
||
[The Human Impact of Bugs](http://jury.me/blog/2013/3/14/the-human-impact-of-bugs).
|
||
*jury.me*, March 2013.
|
||
Archived at [perma.cc/5KQ4-VDYL](https://perma.cc/5KQ4-VDYL)
|
||
|
||
[[76](/en/ch2#Halper2025-marker)] Mark Halper.
|
||
[How
|
||
Software Bugs led to ‘One of the Greatest Miscarriages of Justice’ in British History](https://cacm.acm.org/news/how-software-bugs-led-to-one-of-the-greatest-miscarriages-of-justice-in-british-history/).
|
||
*Communications of the ACM*, January 2025.
|
||
[doi:10.1145/3703779](https://doi.org/10.1145/3703779)
|
||
|
||
[[77](/en/ch2#Bohm2022-marker)] Nicholas Bohm, James Christie, Peter Bernard Ladkin,
|
||
Bev Littlewood, Paul Marshall, Stephen Mason, Martin Newby, Steven J. Murdoch, Harold Thimbleby, and Martyn Thomas.
|
||
[The
|
||
legal rule that computers are presumed to be operating correctly – unforeseen and unjust
|
||
consequences](https://www.benthamsgaze.org/wp-content/uploads/2022/06/briefing-presumption-that-computers-are-reliable.pdf). Briefing note, *benthamsgaze.org*, June 2022.
|
||
Archived at [perma.cc/WQ6X-TMW4](https://perma.cc/WQ6X-TMW4)
|
||
|
||
[[78](/en/ch2#McKinley2015-marker)] Dan McKinley.
|
||
[Choose Boring Technology](https://mcfunley.com/choose-boring-technology).
|
||
*mcfunley.com*, March 2015.
|
||
Archived at [perma.cc/7QW7-J4YP](https://perma.cc/7QW7-J4YP)
|
||
|
||
[[79](/en/ch2#Warfield2023_ch2-marker)] Andy Warfield.
|
||
[Building
|
||
and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023.
|
||
Archived at [perma.cc/7LPK-TP7V](https://perma.cc/7LPK-TP7V)
|
||
|
||
[[80](/en/ch2#Brooker2023multitenancy-marker)] Marc Brooker.
|
||
[Surprising Scalability of
|
||
Multitenancy](https://brooker.co.za/blog/2023/03/23/economics.html). *brooker.co.za*, March 2023.
|
||
Archived at [perma.cc/ZZD9-VV8T](https://perma.cc/ZZD9-VV8T)
|
||
|
||
[[81](/en/ch2#Stopford2009-marker)] Ben Stopford.
|
||
[Shared
|
||
Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/). *benstopford.com*,
|
||
November 2009. Archived at [perma.cc/7BXH-EDUR](https://perma.cc/7BXH-EDUR)
|
||
|
||
[[82](/en/ch2#Stonebraker1986-marker)] Michael Stonebraker.
|
||
[The Case for Shared Nothing](https://dsf.berkeley.edu/papers/hpts85-nothing.pdf).
|
||
*IEEE Database Engineering Bulletin*, volume 9, issue 1, pages 4–9, March 1986.
|
||
|
||
[[83](/en/ch2#Antonopoulos2019_ch2-marker)] Panagiotis Antonopoulos,
|
||
Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald
|
||
Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya
|
||
Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade.
|
||
[Socrates: The
|
||
New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data*
|
||
(SIGMOD), pages 1743–1756, June 2019.
|
||
[doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047)
|
||
|
||
[[84](/en/ch2#Newman2021_ch2-marker)] Sam Newman.
|
||
[*Building
|
||
Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O’Reilly Media, 2021. ISBN: 9781492034025
|
||
|
||
[[85](/en/ch2#Ensmenger2016-marker)] Nathan Ensmenger.
|
||
[When
|
||
Good Software Goes Bad: The Surprising Durability of an Ephemeral Technology](https://themaintainers.wpengine.com/wp-content/uploads/2021/04/ensmenger-maintainers-v2.pdf).
|
||
At *The Maintainers Conference*, April 2016.
|
||
Archived at [perma.cc/ZXT4-HGZB](https://perma.cc/ZXT4-HGZB)
|
||
|
||
[[86](/en/ch2#Glass2002-marker)] Robert L. Glass.
|
||
[*Facts and
|
||
Fallacies of Software Engineering*](https://learning.oreilly.com/library/view/facts-and-fallacies/0321117425/).
|
||
Addison-Wesley Professional, October 2002. ISBN: 9780321117427
|
||
|
||
[[87](/en/ch2#Bellotti2021-marker)] Marianne Bellotti.
|
||
[*Kill It with
|
||
Fire*](https://learning.oreilly.com/library/view/kill-it-with/9781098128883/). No Starch Press, April 2021. ISBN: 9781718501188
|
||
|
||
[[88](/en/ch2#Bainbridge1983-marker)] Lisanne Bainbridge.
|
||
[Ironies of
|
||
automation](https://www.adaptivecapacitylabs.com/IroniesOfAutomation-Bainbridge83.pdf). *Automatica*, volume 19, issue 6, pages 775–779, November 1983.
|
||
[doi:10.1016/0005-1098(83)90046-8](https://doi.org/10.1016/0005-1098%2883%2990046-8)
|
||
|
||
[[89](/en/ch2#Hamilton2007-marker)] James Hamilton.
|
||
[On
|
||
Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/events/lisa07/tech/full_papers/hamilton/hamilton.pdf). At *21st Large Installation
|
||
System Administration Conference* (LISA), November 2007.
|
||
|
||
[[90](/en/ch2#Horovits2021-marker)] Dotan Horovits.
|
||
[Open Source
|
||
for Better Observability](https://horovits.medium.com/open-source-for-better-observability-8c65b5630561). *horovits.medium.com*, October 2021.
|
||
Archived at [perma.cc/R2HD-U2ZT](https://perma.cc/R2HD-U2ZT)
|
||
|
||
[[91](/en/ch2#Foote1997-marker)] Brian Foote and Joseph Yoder.
|
||
[Big Ball of Mud](http://www.laputan.org/pub/foote/mud.pdf). At
|
||
*4th Conference on Pattern Languages of Programs* (PLoP), September 1997.
|
||
Archived at [perma.cc/4GUP-2PBV](https://perma.cc/4GUP-2PBV)
|
||
|
||
[[92](/en/ch2#Brooker2022-marker)] Marc Brooker.
|
||
[What is a simple system?](https://brooker.co.za/blog/2022/05/03/simplicity.html)
|
||
*brooker.co.za*, May 2022.
|
||
Archived at [perma.cc/U72T-BFVE](https://perma.cc/U72T-BFVE)
|
||
|
||
[[93](/en/ch2#Brooks1995-marker)] Frederick P. Brooks.
|
||
[No Silver Bullet – Essence and
|
||
Accident in Software Engineering](https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf). In
|
||
[*The Mythical
|
||
Man-Month*](https://www.oreilly.com/library/view/mythical-man-month-the/0201835959/), Anniversary edition, Addison-Wesley, 1995. ISBN: 9780201835953
|
||
|
||
[[94](/en/ch2#Luu2020-marker)] Dan Luu.
|
||
[Against essential and accidental complexity](https://danluu.com/essential-complexity/).
|
||
*danluu.com*, December 2020.
|
||
Archived at [perma.cc/H5ES-69KC](https://perma.cc/H5ES-69KC)
|
||
|
||
[[95](/en/ch2#Gamma1994-marker)] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides.
|
||
[*Design Patterns:
|
||
Elements of Reusable Object-Oriented Software*](https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/). Addison-Wesley Professional, October 1994.
|
||
ISBN: 9780201633610
|
||
|
||
[[96](/en/ch2#Evans2003-marker)] Eric Evans.
|
||
[*Domain-Driven
|
||
Design: Tackling Complexity in the Heart of Software*](https://learning.oreilly.com/library/view/domain-driven-design-tackling/0321125215/). Addison-Wesley Professional, August 2003.
|
||
ISBN: 9780321125217
|
||
|
||
[[97](/en/ch2#Breivold2008-marker)] Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson.
|
||
[Analyzing Software Evolvability](https://www.es.mdh.se/pdf_publications/1251.pdf).
|
||
at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008.
|
||
[doi:10.1109/COMPSAC.2008.50](https://doi.org/10.1109/COMPSAC.2008.50)
|
||
|
||
[[98](/en/ch2#Zaninotto2002-marker)] Enrico Zaninotto.
|
||
[From X programming to the X organisation](https://martinfowler.com/articles/zaninotto.pdf).
|
||
At *XP Conference*, May 2002.
|
||
Archived at [perma.cc/R9AR-QCKZ](https://perma.cc/R9AR-QCKZ)
|
||
|