diff --git a/content/en/_index.md b/content/en/_index.md index 93706ba..c4aced7 100644 --- a/content/en/_index.md +++ b/content/en/_index.md @@ -1,5 +1,5 @@ --- -title: "Designing Data-Intensive Applications" +title: "Designing Data-Intensive Applications 2nd Edition" linkTitle: DDIA cascade: type: docs @@ -13,6 +13,7 @@ breadcrumbs: false > The en-us version only includes **intro**, **summary**, **references** of all chapters to protect the intellectual property of author and publisher. +![](/title.jpg) -------- @@ -34,24 +35,24 @@ breadcrumbs: false ### [Preface](/en/preface) ### [Part I: Foundations of Data Systems](/en/part-i) - - [1. Reliable, Scalable, and Maintainable Applications](/en/ch1) - - [2. Data Models and Query Languages](/en/ch2) - - [3. Storage and Retrieval](/en/ch3) - - [4. Encoding and Evolution](/en/ch4) + - [1. Tradeoffs in Data Systems Architecture](/en/ch1) + - [2. Defining NonFunctional Requirements](/en/ch2) + - [3. Data Models and Query Languages](/en/ch3) + - [4. Storage and Retrieval](/en/ch4) + - [5. Encoding and Evolution](/en/ch5) ### [Part II: Distributed Data](/en/part-ii) - - [5. Replication](/en/ch5) - - [6. Partitioning](/en/ch6) - - [7. Transactions](/en/ch7) - - [8. The Trouble with Distributed Systems](/en/ch8) - - [9. Consistency and Consensus](/en/ch9) + - [6. Replication](/en/ch6) + - [7. Partitioning](/en/ch7) + - [8. Transactions](/en/ch8) + - [9. The Trouble with Distributed Systems](/en/ch9) + - [10. Consistency and Consensus](/en/ch10) ### [Part III: Derived Data](/en/part-iii) - - [10. Batch Processing](/en/ch10) - - [11. Stream Processing](/en/ch11) - - [12. The Future of Data Systems](/en/ch12) + - [11. Batch Processing](/en/ch11) (WIP) + - [12. Stream Processing](/en/ch12) (WIP) + - [13. Doing the Right Thing](/en/ch13) (WIP) ### [Glossary](/en/glossary) ### [Colophon](/en/colophon) - diff --git a/content/en/author.md b/content/en/author.md new file mode 100644 index 0000000..baa73b2 --- /dev/null +++ b/content/en/author.md @@ -0,0 +1,16 @@ +--- +title: "About the Authors" +linkTitle: "About the Authors" +weight: 10 +breadcrumbs: false +--- + +**Martin Kleppmann** is a researcher in distributed systems at the University of Cambridge, UK. +Previously he was a software engineer and entrepreneur at internet companies including LinkedIn and Rapportive, where he worked on large-scale data infrastructure. +In the process he learned a few things the hard way, and he hopes this book will save you from repeating the same mistakes. + +Martin is a regular conference speaker, blogger, and open source contributor. He believes that profound technical ideas should be accessible to everyone, and that deeper understanding will help us develop better software. + +**Chris Riccomini** is a software engineer, startup investor, and author with 15+ years of experience at PayPal, LinkedIn, and WePay. +He runs Materialized View Capital, where he invests in infrastructure startups. He is also the cocreator of Apache Samza and SlateDB, +and coauthor of The Missing README: A Guide for the New Software Engineer. \ No newline at end of file diff --git a/content/en/ch1.md b/content/en/ch1.md index b188ad9..a8c0910 100644 --- a/content/en/ch1.md +++ b/content/en/ch1.md @@ -1,93 +1,1378 @@ --- -title: "1. Reliable, Scalable, and Maintainable Applications" -linkTitle: "1. Reliable, Scalable, and Maintainable Applications" +title: "1. Trade-offs in Data Systems Architecture" weight: 101 breadcrumbs: false --- - -![](/img/ch1.png) - -> *The Internet was done so well that most people think of it as a natural resource like the Pacific Ocean, rather than something that was man-made. When was the last time a tech‐ nology with a scale like that was so error-free?* +> *There are no solutions, there are only trade-offs. […] But you try to get the best +> trade-off you can get, and that’s all you can hope for.* > -> — [Alan Kay](http://www.drdobbs.com/architecture-and-design/interview-with-alan-kay/240003442), in interview with *Dr Dobb’s Journal* (2012) +> [Thomas Sowell](https://www.youtube.com/watch?v=2YUtKr8-_Fg), +> Interview with Fred Barnes (2005) +Data is central to much application development today. With web and mobile apps, software as a +service (SaaS), and cloud services, it has become normal to store data from many different users in +a shared server-based data infrastructure. Data from user activity, business transactions, devices +and sensors needs to be stored and made available for analysis. As users interact with an +application, they both read the data that is stored, and also generate more data. -Many applications today are *data-intensive*, as opposed to *compute-intensive*. Raw CPU power is rarely a limiting factor for these applications—bigger problems are usually the amount of data, the complexity of data, and the speed at which it is changing. +Small amounts of data, which can be stored and processed on a single machine, are often fairly easy +to deal with. However, as the data volume or the rate of queries grows, it needs to be distributed +across multiple machines, which introduces many challenges. As the needs of the application become +more complex, it is no longer sufficient to store everything in one system, but it might be +necessary to combine multiple storage or processing systems that provide different capabilities. -A data-intensive application is typically built from standard building blocks that pro‐ vide commonly needed functionality. For example, many applications need to: +We call an application *data-intensive* if data management is one of the primary challenges in +developing the application [[1](/en/ch1#Kouzes2009)]. +While in *compute-intensive* systems the challenge is parallelizing some very large computation, in +data-intensive applications we usually worry more about things like storing and processing large +data volumes, managing changes to data, ensuring consistency in the face of failures and +concurrency, and making sure services are highly available. -- Store data so that they, or another application, can find it again later (*databases*) -- Remember the result of an expensive operation, to speed up reads (*caches*) -- Allow users to search data by keyword or filter it in various ways (*search indexes*) -- Send a message to another process, to be handled asynchronously (*stream pro‐ cessing*) -- Periodically crunch a large amount of accumulated data (*batch processing*) +Such applications are typically built from standard building blocks that provide commonly needed +functionality. For example, many applications need to: -If that sounds painfully obvious, that’s just because these *data systems* are such a suc‐ cessful abstraction: we use them all the time without thinking too much. When build‐ ing an application, most engineers wouldn’t dream of writing a new data storage engine from scratch, because databases are a perfectly good tool for the job. +* Store data so that they, or another application, can find it again later (*databases*) +* Remember the result of an expensive operation, to speed up reads (*caches*) +* Allow users to search data by keyword or filter it in various ways (*search indexes*) +* Handle events and data changes as soon as they occur (*stream processing*) +* Periodically crunch a large amount of accumulated data (*batch processing*) -But reality is not that simple. There are many database systems with different charac‐ teristics, because different applications have different requirements. There are vari‐ ous approaches to caching, several ways of building search indexes, and so on. When building an application, we still need to figure out which tools and which approaches are the most appropriate for the task at hand. And it can be hard to combine tools when you need to do something that a single tool cannot do alone. +In building an application we typically take several software systems or services, such as databases +or APIs, and glue them together with some application code. If you are doing exactly what the data +systems were designed for, then this process can be quite easy. -This book is a journey through both the principles and the practicalities of data sys‐ tems, and how you can use them to build data-intensive applications. We will explore what different tools have in common, what distinguishes them, and how they achieve their characteristics. +However, as your application becomes more ambitious, challenges arise. There are many database +systems with different characteristics, suitable for different purposes—how do you choose which one +to use? There are various approaches to caching, several ways of building search indexes, and so +on—how do you reason about their trade-offs? You need to figure out which tools and which approaches +are the most appropriate for the task at hand, and it can be difficult to combine tools when you +need to do something that a single tool cannot do alone. -In this chapter, we will start by exploring the fundamentals of what we are trying to achieve: reliable, scalable, and maintainable data systems. We’ll clarify what those things mean, outline some ways of thinking about them, and go over the basics that we will need for later chapters. In the following chapters we will continue layer by layer, looking at different design decisions that need to be considered when working on a data-intensive application. +This book is a guide to help you make decisions about which technologies to use and how to combine +them. As you will see, there is no one approach that is fundamentally better than others; everything +has pros and cons. With this book, you will learn to ask the right questions to evaluate and compare +data systems, so that you can figure out which approach will best serve the needs of your particular +application. +We will start our journey by looking at some of the ways that data is typically used in +organizations today. Many of the ideas here have their origin in *enterprise software* (i.e., the +software needs and engineering practices of large organizations, such as big corporations and +governments), since historically, only large organizations had the large data volumes that required +sophisticated technical solutions. If your data volume is small enough, you can simply keep it in a +spreadsheet! However, more recently it has also become common for smaller companies and startups to +manage large data volumes and build data-intensive systems. +One of the key challenges with data systems is that different people need to do very different +things with data. If you are working at a company, you and your team will have one set of +priorities, while another team may have entirely different goals, even though you might be working +with the same dataset! Moreover, those goals might not be explicitly articulated, which can lead to +misunderstandings and disagreement about the right approach. -## …… +To help you understand what choices you can make, this chapter compares several contrasting +concepts, and explores their trade-offs: +* the difference between operational and analytical systems ([“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)); +* pros and cons of cloud services and self-hosted systems ([“Cloud versus Self-Hosting”](/en/ch1#sec_introduction_cloud)); +* when to move from single-node systems to distributed systems ([“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed)); and +* balancing the needs of the business and the rights of the user ([“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance)). +Moreover, this chapter will provide you with terminology that we will need for the rest of the book. -## Summary +# Terminology: Frontends and Backends -In this chapter, we have explored some fundamental ways of thinking about data-intensive applications. These principles will guide us through the rest of the book, where we dive into deep technical detail. +Much of what we will discuss in this book relates to *backend development*. To explain that term: +for web applications, the client-side code (which runs in a web browser) is called the *frontend*, +and the server-side code that handles user requests is known as the *backend*. Mobile apps are +similar to frontends in that they provide user interfaces, which often communicate over the Internet +with a server-side backend. Frontends sometimes manage data locally on the user’s device +[[2](/en/ch1#Kleppmann2019_ch1)], +but the greatest data infrastructure challenges often lie in the backend: a frontend only needs to +handle one user’s data, whereas the backend manages data on behalf of *all* of the users. -An application has to meet various requirements in order to be useful. There are *functional requirements* (what it should do, such as allowing data to be stored, retrieved, searched, and processed in various ways), and some *nonfunctional require‐ ments* (general properties like security, reliability, compliance, scalability, compatibil‐ ity, and maintainability). In this chapter we discussed reliability, scalability, and maintainability in detail. +A backend service is often reachable via HTTP (sometimes WebSocket); it usually consists of some +application code that reads and writes data in one or more databases, and sometimes interfaces with +additional data systems such as caches or message queues (which we might collectively call *data +infrastructure*). The application code is often *stateless* (i.e., when it finishes handling one +HTTP request, it forgets everything about that request), and any information that needs to persist +from one request to another needs to be stored either on the client, or in the server-side data +infrastructure. -*Reliability* means making systems work correctly, even when faults occur. Faults can be in hardware (typically random and uncorrelated), software (bugs are typically sys‐ tematic and hard to deal with), and humans (who inevitably make mistakes from time to time). Fault-tolerance techniques can hide certain types of faults from the end user. +# Analytical versus Operational Systems -*Scalability* means having strategies for keeping performance good, even when load increases. In order to discuss scalability, we first need ways of describing load and performance quantitatively. We briefly looked at Twitter’s home timelines as an example of describing load, and response time percentiles as a way of measuring performance. In a scalable system, you can add processing capacity in order to remain reliable under high load. +If you are working on data systems in an enterprise, you are likely to encounter several different +types of people who work with data. The first type are *backend engineers* who build services that +handle requests for reading and updating data; these services often serve external users, either +directly or indirectly via other services (see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)). Sometimes +services are for internal use by other parts of the organization. -*Maintainability* has many facets, but in essence it’s about making life better for the engineering and operations teams who need to work with the system. Good abstrac‐ tions can help reduce complexity and make the system easier to modify and adapt for new use cases. Good operability means having good visibility into the system’s health, and having effective ways of managing it. +In addition to the teams managing backend services, two other groups of people typically require +access to an organization’s data: *business analysts*, who generate reports about the activities of +the organization in order to help the management make better decisions (*business intelligence* or +*BI*), and *data scientists*, who look for novel insights in data or who create user-facing product +features that are enabled by data analysis and machine learning/AI (for example, “people who bought +X also bought Y” recommendations on an e-commerce website, predictive analytics such as risk scoring +or spam filtering, and ranking of search results). -There is unfortunately no easy fix for making applications reliable, scalable, or main‐ tainable. However, there are certain patterns and techniques that keep reappearing in different kinds of applications. In the next few chapters we will take a look at some examples of data systems and analyze how they work toward those goals. +Although business analysts and data scientists tend to use different tools and operate in different +ways, they have some things in common: both perform *analytics*, which means they look at the data +that the users and backend services have generated, but they generally do not modify this data +(except perhaps for fixing mistakes). They might create derived datasets in which the original data +has been processed in some way. This has led to a split between two types of systems—a distinction +that we will use throughout this book: -Later in the book, in [Part III](/en/part-iii), we will look at patterns for systems that consist of sev‐ eral components working together, such as the one in [Figure 1-1](/img/fig1-1.png). +* *Operational systems* consist of the backend services and data infrastructure where data is + created, for example by serving external users. Here, the application code both reads and modifies + the data in its databases, based on the actions performed by the users. +* *Analytical systems* serve the needs of business analysts and data scientists. They contain a + read-only copy of the data from the operational systems, and they are optimized for the types of + data processing that are needed for analytics. +As we shall see in the next section, operational and analytical systems are often kept separate, for +good reasons. As these systems have matured, two new specialized roles have emerged: *data +engineers* and *analytics engineers*. Data engineers are the people who know how to integrate the +operational and the analytical systems, and who take responsibility for the organization’s data +infrastructure more widely [[3](/en/ch1#Reis2022)]. +Analytics engineers model and transform data to make it more useful for the business analysts and +data scientists in an organization +[[4](/en/ch1#Machado2023)]. +Many engineers specialize on either the operational or the analytical side. However, this book +covers both operational and analytical data systems, since both play an important role in the +lifecycle of data within an organization. We will explore in-depth the data infrastructure that is +used to deliver services both to internal and external users, so that you can work better with your +colleagues on the other side of this divide. -## References +## Characterizing Transaction Processing and Analytics + +In the early days of business data processing, a write to the database typically corresponded to a +*commercial transaction* taking place: making a sale, placing an order with a supplier, paying an +employee’s salary, etc. As databases expanded into areas that didn’t involve money changing hands, +the term *transaction* nevertheless stuck, referring to a group of reads and writes that form a +logical unit. + +###### Note + +[Chapter 8](/en/ch8#ch_transactions) explores in detail what we mean with a transaction. This chapter uses the term +loosely to refer to low-latency reads and writes. + +Even though databases started being used for many different kinds of data—posts on social media, +moves in a game, contacts in an address book, and many others—the basic access pattern +remained similar to processing business transactions. An operational system typically looks up a +small number of records by some key (this is called a *point query*). Records are inserted, updated, +or deleted based on the user’s input. Because these applications are interactive, this access +pattern became known as *online transaction processing* (OLTP). + +However, databases also started being increasingly used for analytics, which has very different +access patterns compared to OLTP. Usually an analytic query scans over a huge number of records, and +calculates aggregate statistics (such as count, sum, or average) rather than returning the +individual records to the user. For example, a business analyst at a supermarket chain may want to +answer analytic queries such as: + +* What was the total revenue of each of our stores in January? +* How many more bananas than usual did we sell during our latest promotion? +* Which brand of baby food is most often purchased together with brand X diapers? + +The reports that result from these types of queries are important for business intelligence, helping +the management decide what to do next. In order to differentiate this pattern of using databases +from transaction processing, it has been called *online analytic processing* (OLAP) +[[5](/en/ch1#Codd1993)]. +The difference between OLTP and analytics is not always clear-cut, but some typical characteristics +are listed in [Table 1-1](/en/ch1#tab_oltp_vs_olap). + +Table 1-1. Comparing characteristics of operational and analytic systems + +| Property | Operational systems (OLTP) | Analytical systems (OLAP) | +| --- | --- | --- | +| Main read pattern | Point queries (fetch individual records by key) | Aggregate over large number of records | +| Main write pattern | Create, update, and delete individual records | Bulk import (ETL) or event stream | +| Human user example | End user of web/mobile application | Internal analyst, for decision support | +| Machine use example | Checking if an action is authorized | Detecting fraud/abuse patterns | +| Type of queries | Fixed set of queries, predefined by application | Analyst can make arbitrary queries | +| Data represents | Latest state of data (current point in time) | History of events that happened over time | +| Dataset size | Gigabytes to terabytes | Terabytes to petabytes | + +###### Note + +The meaning of *online* in *OLAP* is unclear; it probably refers to the fact that queries are not +just for predefined reports, but that analysts use the OLAP system interactively for explorative +queries. + +With operational systems, users are generally not allowed to construct custom SQL queries and run +them on the database, since that would potentially allow them to read or modify data that they do +not have permission to access. Moreover, they might write queries that are expensive to execute, and +hence affect the database performance for other users. For these reasons, OLTP systems mostly run a +fixed set of queries that are baked into the application code, and use one-off custom queries only +occasionally for maintenance or troubleshooting. On the other hand, analytic databases usually give +their users the freedom to write arbitrary SQL queries by hand, or to generate queries automatically +using a data visualization or dashboard tool such as Tableau, Looker, or Microsoft Power BI. + +There is also a type of systems that is designed for analytical workloads (queries that aggregate +over many records) but that are embedded into user-facing products. This category is known as +*product analytics* or *real-time analytics*, and systems designed for this type of use include +Pinot, Druid, and ClickHouse +[[6](/en/ch1#Soman2023)]. + +## Data Warehousing + +At first, the same databases were used for both transaction processing and analytic queries. SQL +turned out to be quite flexible in this regard: it works well for both types of queries. +Nevertheless, in the late 1980s and early 1990s, there was a trend for companies to stop using their +OLTP systems for analytics purposes, and to run the analytics on a separate database system instead. +This separate database was called a *data warehouse*. + +A large enterprise may have dozens, even hundreds, of online transaction processing systems: +systems powering the customer-facing website, controlling point of sale (checkout) systems in +physical stores, tracking inventory in warehouses, planning routes for vehicles, managing suppliers, +administering employees, and performing many other tasks. Each of these systems is complex and needs +a team of people to maintain it, so these systems end up operating mostly independently from each +other. + +It is usually undesirable for business analysts and data scientists to directly query these OLTP +systems, for several reasons: + +* the data of interest may be spread across multiple operational systems, making it difficult to + combine those datasets in a single query (a problem known as *data silos*); +* the kinds of schemas and data layouts that are good for OLTP are less well suited for analytics + (see [“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics)); +* analytic queries can be quite expensive, and running them on an OLTP database would impact the + performance for other users; and +* the OLTP systems might reside in a separate network that users are not allowed direct access to + for security or compliance reasons. + +A *data warehouse*, by contrast, is a separate database that analysts can query to their hearts’ +content, without affecting OLTP operations +[[7](/en/ch1#Chaudhuri1997)]. +As we shall see in [Chapter 4](/en/ch4#ch_storage), data warehouses often store data in a way that is very different +from OLTP databases, in order to optimize for the types of queries that are common in analytics. + +The data warehouse contains a read-only copy of the data in all the various OLTP systems in the +company. Data is extracted from OLTP databases (using either a periodic data dump or a continuous +stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into +the data warehouse. This process of getting data into the data warehouse is known as +*Extract–Transform–Load* (ETL) and is illustrated in [Figure 1-1](/en/ch1#fig_dwh_etl). Sometimes the order of the +*transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse, +after loading), resulting in *ELT*. + +![ddia 0101](/fig/ddia_0101.png) + +###### Figure 1-1. Simplified outline of ETL into a data warehouse. + +In some cases the data sources of the ETL processes are external SaaS products such as customer +relationship management (CRM), email marketing, or credit card processing systems. In those cases, +you do not have direct access to the original database, since it is accessible only via the software +vendor’s API. Bringing the data from these external systems into your own data warehouse can enable +analyses that are not possible via the SaaS API. ETL for SaaS APIs is often implemented by +specialist data connector services such as Fivetran, Singer, or AirByte. + +Some database systems offer *hybrid transactional/analytic processing* (HTAP), which aims to enable +OLTP and analytics in a single system without requiring ETL from one system into another +[[8](/en/ch1#Ozcan2017), +[9](/en/ch1#Prout2022_ch1)]. +However, many HTAP systems internally consist of an OLTP system coupled with a separate analytical +system, hidden behind a common interface—so the distinction between the two remains important for +understanding how these systems work. + +Moreover, even though HTAP exists, it is common to have a separation between transactional and +analytic systems due to their different goals and requirements. In particular, it is considered good +practice for each operational system to have its own database (see +[“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), leading to hundreds of separate operational databases; on the +other hand, an enterprise usually has a single data warehouse, so that business analysts can combine +data from several operational systems in a single query. + +HTAP therefore does not replace data warehouses. Rather, it is useful in scenarios where the same +application needs to both perform analytics queries that scan a large number of rows, and also +read and update individual records with low latency. Fraud detection can involve such workloads, for +example [[10](/en/ch1#Zhang2024)]. + +The separation between operational and analytical systems is part of a wider trend: as workloads +have become more demanding, systems have become more specialized and optimized for particular +workloads. General-purpose systems can handle small data volumes comfortably, but the greater the +scale, the more specialized systems tend to become +[[11](/en/ch1#Stonebraker2005fitsall)]. + +### From data warehouse to data lake + +A data warehouse often uses a *relational* data model that is queried through SQL (see +[Chapter 3](/en/ch3#ch_datamodels)), perhaps using specialized business intelligence software. This model works well +for the types of queries that business analysts need to make, but it is less well suited to the +needs of data scientists, who might need to perform tasks such as: + +* Transform data into a form that is suitable for training a machine learning model; often this + requires turning the rows and columns of a database table into a vector or matrix of numerical + values called *features*. The process of performing this transformation in a way that maximizes + the performance of the trained model is called *feature engineering*, and it often requires custom + code that is difficult to express using SQL. +* Take textual data (e.g., reviews of a product) and use natural language processing techniques to + try to extract structured information from it (e.g., the sentiment of the author, or which topics + they mention). Similarly, they might need to extract structured information from photos using + computer vision techniques. + +Although there have been efforts to add machine learning operators to a SQL data model +[[12](/en/ch1#Cohen2009)] +and to build efficient machine learning systems on top of a relational foundation +[[13](/en/ch1#Olteanu2020)], +many data scientists prefer not to work in a relational database such as a data warehouse. Instead, +many prefer to use Python data analysis libraries such as pandas and scikit-learn, statistical +analysis languages such as R, and distributed analytics frameworks such as Spark +[[14](/en/ch1#Bornstein2020)]. +We discuss these further in [“Dataframes, Matrices, and Arrays”](/en/ch3#sec_datamodels_dataframes). + +Consequently, organizations face a need to make data available in a form that is suitable for use by +data scientists. The answer is a *data lake*: a centralized data repository that holds a copy of any +data that might be useful for analysis, obtained from operational systems via ETL processes. The +difference from a data warehouse is that a data lake simply contains files, without imposing any +particular file format or data model. Files in a data lake might be collections of database records, +encoded using a file format such as Avro or Parquet (see [Chapter 5](/en/ch5#ch_encoding)), but they can equally well +contain text, images, videos, sensor readings, sparse matrices, feature vectors, genome sequences, +or any other kind of data [[15](/en/ch1#Fowler2015)]. +Besides being more flexible, this is also often cheaper than relational data storage, since the data +lake can use commoditized file storage such as object stores (see [“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native)). + +ETL processes have been generalized to *data pipelines*, and in some cases the data lake has become +an intermediate stop on the path from the operational systems to the data warehouse. The data lake +contains data in a “raw” form produced by the operational systems, without the transformation into a +relational data warehouse schema. This approach has the advantage that each consumer of the data can +transform the raw data into a form that best suits their needs. It has been dubbed the *sushi +principle*: “raw data is better” [[16](/en/ch1#Johnson2015)]. + +Besides loading data from a data lake into a separate data warehouse, it is also possible to run +typical data warehousing workloads (SQL queries and business analytics) directly on the files in the +data lake, alongside data science/machine learning workloads. This architecture is known as a *data +lakehouse*, and it requires a query execution engine and a metadata (e.g., schema management) layer +that extend the data lake’s file storage +[[17](/en/ch1#Armbrust2021)]. + +Apache Hive, Spark SQL, Presto, and Trino are examples of this approach. + +### Beyond the data lake + +As analytics practices have matured, organizations have been increasingly paying attention to the +management and operations of analytics systems and data pipelines, as captured for example in the +DataOps manifesto [[18](/en/ch1#DataOps)]. +Part of this are issues of governance, privacy, and compliance with regulation such as GDPR and +CCPA, which we discuss in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance) and [Link to Come]. + +Moreover, analytical data is increasingly made available not only as files and relational tables, +but also as streams of events (see [Link to Come]). With file-based data analysis you can re-run the +analysis periodically (e.g., daily) in order to respond to changes in the data, but stream processing +allows analytics systems to respond to events much faster, on the order of seconds. Depending on the +application and how time-sensitive it is, a stream processing approach can be valuable, for example +to identify and block potentially fraudulent or abusive activity. + +In some cases the outputs of analytics systems are made available to operational systems (a process +sometimes known as *reverse ETL* [[19](/en/ch1#Manohar2021)]). For example, a +machine-learning model that was trained on data in an analytics system may be deployed to +production, so that it can generate recommendations for end-users, such as “people who bought X also +bought Y”. Such deployed outputs of analytics systems are also known as *data products* +[[20](/en/ch1#ORegan2018)]. +Machine learning models can be deployed to operational systems using specialized tools such as +TFX, Kubeflow, or MLflow. + +## Systems of Record and Derived Data + +Related to the distinction between operational and analytical systems, this book also distinguishes +between *systems of record* and *derived data systems*. These terms are useful because they can help +you clarify the flow of data through a system: + +Systems of record +: A system of record, also known as *source of truth*, holds the authoritative or *canonical* + version of some data. When new data comes in, e.g., as user input, it is first written here. Each + fact is represented exactly once (the representation is typically *normalized*; see + [“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization)). If there is any discrepancy between another system and the + system of record, then the value in the system of record is (by definition) the correct one. + +Derived data systems +: Data in a derived system is the result of taking some existing data from another system and + transforming or processing it in some way. If you lose derived data, you can recreate it from the + original source. A classic example is a cache: data can be served from the cache if present, but + if the cache doesn’t contain what you need, you can fall back to the underlying database. + Denormalized values, indexes, materialized views, transformed data representations, and models + trained on a dataset also fall into this category. + +Technically speaking, derived data is *redundant*, in the sense that it duplicates existing +information. However, it is often essential for getting good performance on read queries. You can +derive several different datasets from a single source, enabling you to look at the data from +different “points of view.” + +Analytical systems are usually derived data systems, because they are consumers of data created +elsewhere. Operational services may contain a mixture of systems of record and derived data systems. +The systems of record are the primary databases to which data is first written, whereas the derived +data systems are the indexes and caches that speed up common read operations, especially for queries +that the system of record cannot answer efficiently. + +Most databases, storage engines, and query languages are not inherently a system of record or a +derived system. A database is just a tool: how you use it is up to you. The distinction between +system of record and derived data system depends not on the tool, but on how you use it in your +application. By being clear about which data is derived from which other data, you can bring clarity +to an otherwise confusing system architecture. + +When the data in one system is derived from the data in another, you need a process for updating the +derived data when the original in the system of record changes. Unfortunately, many databases are +designed based on the assumption that your application only ever needs to use that one database, and +they do not make it easy to integrate multiple systems in order to propagate such updates. In +[Link to Come] we will discuss approaches to *data integration*, which allow us to compose multiple +data systems to achieve things that one system alone cannot do. + +That brings us to the end of our comparison of analytics and transaction processing. In the next +section, we will examine another trade-off that you might have already seen debated multiple times. + +# Cloud versus Self-Hosting + +With anything that an organization needs to do, one of the first questions is: should it be done +in-house, or should it be outsourced? Should you build or should you buy? + +Ultimately, this is a question about business priorities. The received management wisdom is that +things that are a core competency or a competitive advantage of your organization should be done +in-house, whereas things that are non-core, routine, or commonplace should be left to a vendor +[[21](/en/ch1#Fournier2021)]. +To give an extreme example, most companies do not generate their own electricity (unless they are an +energy company, and leaving aside emergency backup power), since it is cheaper to buy electricity +from the grid. + +With software, two important decisions to be made are who builds the software and who deploys it. +There is a spectrum of possibilities that outsource each decision to various degrees, as illustrated +in [Figure 1-2](/en/ch1#fig_cloud_spectrum). At one extreme is bespoke software that you write and run in-house; at +the other extreme are widely-used cloud services or Software as a Service (SaaS) products that are +implemented and operated by an external vendor, and which you only access through a web interface or +API. + +![ddia 0102](/fig/ddia_0102.png) + +###### Figure 1-2. A spectrum of types of software and its operations. + +The middle ground is off-the-shelf software (open source or commercial) that you *self-host*, i.e., +deploy yourself—for example, if you download MySQL and install it on a server you control. This +could be on your own hardware (often called *on-premises*, even if the server is actually in a +rented datacenter rack and not literally on your own premises), or on a virtual machine in the cloud +(*Infrastructure as a Service* or IaaS). There are still more points along this spectrum, e.g., +taking open source software and running a modified version of it. + +Separately from this spectrum there is also the question of *how* you deploy services, either in the +cloud or on-premises—for example, whether you use an orchestration framework such as Kubernetes. +However, choice of deployment tooling is out of scope of this book, since other factors have a +greater influence on the architecture of data systems. + +## Pros and Cons of Cloud Services + +Using a cloud service, rather than running comparable software yourself, essentially outsources the +operation of that software to the cloud provider. There are good arguments for and against cloud +services. Cloud providers claim that using their services saves you time and money, and allows you +to move faster compared to setting up your own infrastructure. + +Whether a cloud service is actually cheaper and easier than self-hosting depends very much on your +skills and the workload on your systems. If you already have experience setting up and operating the +systems you need, and if your load is quite predictable (i.e., the number of machines you need does +not fluctuate wildly), then it’s often cheaper to buy your own machines and run the software on them +yourself [[22](/en/ch1#HeinemeierHansson2022), +[23](/en/ch1#Badizadegan2022)]. + +On the other hand, if you need a system that you don’t already know how to deploy and operate, then +adopting a cloud service is often easier and quicker than learning to manage the system yourself. If +you have to hire and train staff specifically to maintain and operate the system, that can get very +expensive. You still need an operations team when you’re using the cloud (see +[“Operations in the Cloud Era”](/en/ch1#sec_introduction_operations)), but outsourcing the basic system administration can free up your +team to focus on higher-level concerns. + +When you outsource the operation of a system to a company that specializes in running that service, +that can potentially result in a better service, since the provider gains operational expertise from +providing the service to many customers. On the other hand, if you run the service yourself, you can +configure and tune it to perform well on your particular workload; it is unlikely that a cloud +service would be willing to make such customizations on your behalf. + +Cloud services are particularly valuable if the load on your systems varies a lot over time. If you +provision your machines to be able to handle peak load, but those computing resources are idle most +of the time, the system becomes less cost-effective. In this situation, cloud services have the +advantage that they can make it easier to scale your computing resources up or down in response to +changes in demand. + +For example, analytics systems often have extremely variable load: running a large analytical query +quickly requires a lot of computing resources in parallel, but once the query completes, those +resources sit idle until the user makes the next query. Predefined queries (e.g., for daily reports) +can be enqueued and scheduled to smooth out the load, but for interactive queries, the faster you +want them to complete, the more variable the workload becomes. If your dataset is so large that +querying it quickly requires significant computing resources, using the cloud can save money, since +you can return unused resources to the provider rather than leaving them idle. For smaller datasets, +this difference is less significant. + +The biggest downside of a cloud service is that you have no control over it: + +* If it is lacking a feature you need, all you can do is to politely ask the vendor whether they + will add it; you generally cannot implement it yourself. +* If the service goes down, all you can do is to wait for it to recover. +* If you are using the service in a way that triggers a bug or causes performance problems, it will + be difficult for you to diagnose the issue. With software that you run yourself, you can get + performance metrics and debugging information from the operating system to help you understand its + behavior, and you can look at the server logs, but with a service hosted by a vendor you usually + do not have access to these internals. +* Moreover, if the service shuts down or becomes unacceptably expensive, or if the vendor decides to + change their product in a way you don’t like, you are at their mercy—continuing to run an old + version of the software is usually not an option, so you will be forced to migrate to an + alternative service [[24](/en/ch1#Yegge2020)]. + This risk is mitigated if there are alternative services that expose a compatible API, but for + many cloud services there are no standard APIs, which raises the cost of switching, making vendor + lock-in a problem. +* The cloud provider needs to be trusted to keep the data secure, which can complicate the process + of complying with privacy and security regulations. + +Despite all these risks, it has become more and more popular for organizations to build new +applications on top of cloud services, or adopting a hybrid approach in which cloud services are +used for some aspects of a system. However, cloud services will not subsume all in-house data +systems: many older systems predate the cloud, and for any services that have specialist +requirements that existing cloud services cannot meet, in-house systems remain necessary. For +example, very latency-sensitive applications such as high-frequency trading require full control of +the hardware. + +## Cloud-Native System Architecture + +Besides having a different economic model (subscribing to a service instead of buying hardware and +licensing software to run on it), the rise of the cloud has also had a profound effect on how data +systems are implemented on a technical level. The term *cloud-native* is used to describe an +architecture that is designed to take advantage of cloud services. + +In principle, almost any software that you can self-host could also be provided as a cloud service, +and indeed such managed services are now available for many popular data systems. However, systems +that have been designed from the ground up to be cloud-native have been shown to have several +advantages: better performance on the same hardware, faster recovery from failures, being able to +quickly scale computing resources to match the load, and supporting larger datasets +[[25](/en/ch1#Verbitski2017), +[26](/en/ch1#Antonopoulos2019_ch1), +[27](/en/ch1#Vuppalapati2020)]. +[Table 1-2](/en/ch1#tab_cloud_native_dbs) lists some examples of both types of systems. + +Table 1-2. Examples of self-hosted and cloud-native database systems + +| Category | Self-hosted systems | Cloud-native systems | +| --- | --- | --- | +| Operational/OLTP | MySQL, PostgreSQL, MongoDB | AWS Aurora [[25](/en/ch1#Verbitski2017)], Azure SQL DB Hyperscale [[26](/en/ch1#Antonopoulos2019_ch1)], Google Cloud Spanner | +| Analytical/OLAP | Teradata, ClickHouse, Spark | Snowflake [[27](/en/ch1#Vuppalapati2020)], Google BigQuery, Azure Synapse Analytics | + +### Layering of cloud services + +Many self-hosted data systems have very simple system requirements: they run on a conventional +operating system such as Linux or Windows, they store their data as files on the filesystem, and +they communicate via standard network protocols such as TCP/IP. A few systems depend on special +hardware such as GPUs (for machine learning) or RDMA network interfaces, but on the whole, +self-hosted software tends to use very generic computing resources: CPU, RAM, a filesystem, and an +IP network. + +In a cloud, this type of software can be run on an Infrastructure-as-a-Service environment, using +one or more virtual machines (or *instances*) with a certain allocation of CPUs, memory, disk, and +network bandwidth. Compared to physical machines, cloud instances can be provisioned faster and they +come in a greater variety of sizes, but otherwise they are similar to a traditional computer: you +can run any software you like on it, but you are responsible for administering it yourself. + +In contrast, the key idea of cloud-native services is to use not only the computing resources +managed by your operating system, but also to build upon lower-level cloud services to create +higher-level services. For example: + +* *Object storage* services such as Amazon S3, Azure Blob Storage, and Cloudflare R2 store large + files. They provide more limited APIs than a typical filesystem (basic file reads and writes), but + they have the advantage that they hide the underlying physical machines: the service automatically + distributes the data across many machines, so that you don’t have to worry about running out of + disk space on any one machine. Even if some machines or their disks fail entirely, no data is + lost. +* Many other services are in turn built upon object storage and other cloud services: for example, + Snowflake is a cloud-based analytic database (data warehouse) that relies on S3 for data storage + [[27](/en/ch1#Vuppalapati2020)], and some other services in turn + build upon Snowflake. + +As always with abstractions in computing, there is no one right answer to what you should use. As a +general rule, higher-level abstractions tend to be more oriented towards particular use cases. If +your needs match the situations for which a higher-level system is designed, using the existing +higher-level system will probably provide what you need with much less hassle than building it +yourself from lower-level systems. On the other hand, if there is no high-level system that meets +your needs, then building it yourself from lower-level components is the only option. + +### Separation of storage and compute + +In traditional computing, disk storage is regarded as durable (we assume that once something is +written to disk, it will not be lost). To tolerate the failure of an individual hard disk, RAID +(Redundant Array of Independent Disks) is often used to maintain copies of the data on several +disks attached to the same machine. RAID can be performed either in hardware or in software by the +operating system, and it is transparent to the applications accessing the filesystem. + +In the cloud, compute instances (virtual machines) may also have local disks attached, but +cloud-native systems typically treat these disks more like an ephemeral cache, and less like +long-term storage. This is because the local disk becomes inaccessible if the associated instance +fails, or if the instance is replaced with a bigger or a smaller one (on a different physical +machine) in order to adapt to changes in load. + +As an alternative to local disks, cloud services also offer virtual disk storage that can be +detached from one instance and attached to a different one (Amazon EBS, Azure managed disks, and +persistent disks in Google Cloud). Such a virtual disk is not actually a physical disk, but rather a +cloud service provided by a separate set of machines, which emulates the behavior of a disk (a +*block device*, where each block is typically 4 KiB in size). This technology makes it +possible to run traditional disk-based software in the cloud, but the block device emulation +introduces overheads that can be avoided in systems that are designed from the ground up for the +cloud [[25](/en/ch1#Verbitski2017)]. It also makes the application +very sensitive to network glitches, since every I/O on the virtual block device is actually a +network call [[28](/en/ch1#NickVanWiggeren2025)]. + +To address this problem, cloud-native services generally avoid using virtual disks, and instead +build on dedicated storage services that are optimized for particular workloads. Object storage +services such as S3 are designed for long-term storage of fairly large files, ranging from hundreds +of kilobytes to several gigabytes in size. The individual rows or values stored in a database are +typically much smaller than this; cloud databases therefore typically manage smaller values in a +separate service, and store larger data blocks (containing many individual values) in an object +store [[26](/en/ch1#Antonopoulos2019_ch1), +[29](/en/ch1#Breck2024)]. +We will see ways of doing this in [Chapter 4](/en/ch4#ch_storage). + +In a traditional systems architecture, the same computer is responsible for both storage (disk) and +computation (CPU and RAM), but in cloud-native systems, these two responsibilities have become +somewhat separated or *disaggregated* [[9](/en/ch1#Prout2022_ch1), +[27](/en/ch1#Vuppalapati2020), +[30](/en/ch1#Shapira2023separation), +[31](/en/ch1#Murthy2022)]: +for example, S3 only stores files, and if you want to analyze that data, you will have to run the +analysis code somewhere outside of S3. This implies transferring the data over the network, which we +will discuss further in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed). + +Moreover, cloud-native systems are often *multitenant*, which means that rather than having a +separate machine for each customer, data and computation from several different customers are +handled on the same shared hardware by the same service +[[32](/en/ch1#Vanlightly2023serverless)]. +Multitenancy can enable better hardware utilization, easier scalability, and easier management by +the cloud provider, but it also requires careful engineering to ensure that one customer’s activity +does not affect the performance or security of the system for other customers +[[33](/en/ch1#Jonas2019)]. + +## Operations in the Cloud Era + +Traditionally, the people managing an organization’s server-side data infrastructure were known as +*database administrators* (DBAs) or *system administrators* (sysadmins). More recently, many +organizations have tried to integrate the roles of software development and operations into teams +with a shared responsibility for both backend services and data infrastructure; the *DevOps* +philosophy has guided this trend. *Site Reliability Engineers* (SREs) are Google’s implementation of +this idea [[34](/en/ch1#Beyer2016)]. + +The role of operations is to ensure services are reliably delivered to users (including configuring +infrastructure and deploying applications), and to ensure a stable production environment (including +monitoring and diagnosing any problems that may affect reliability). For self-hosted systems, +operations traditionally involves a significant amount of work at the level of individual machines, +such as capacity planning (e.g., monitoring available disk space and adding more disks before you +run out of space), provisioning new machines, moving services from one machine to another, and +installing operating system patches. + +Many cloud services present an API that hides the individual machines that actually implement the +service. For example, cloud storage replaces fixed-size disks with *metered billing*, where you can +store data without planning your capacity needs in advance, and you are then charged based on the +space actually used. Moreover, many cloud services remain highly available, even when individual +machines have failed (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability)). + +This shift in emphasis from individual machines to services has been accompanied by a change in the +role of operations. The high-level goal of providing a reliable service remains the same, but the +processes and tools have evolved. The DevOps/SRE philosophy places greater emphasis on: + +* automation—preferring repeatable processes over manual one-off jobs, +* preferring ephemeral virtual machines and services over long running servers, +* enabling frequent application updates, +* learning from incidents, and +* preserving the organization’s knowledge about the system, even as individual people come and go + [[35](/en/ch1#Limoncelli2020)]. + +With the rise of cloud services, there has been a bifurcation of roles: operations teams at +infrastructure companies specialize in the details of providing a reliable service to a large number +of customers, while the customers of the service spend as little time and effort as possible on +infrastructure [[36](/en/ch1#Majors2020)]. + +Customers of cloud services still require operations, but they focus on different aspects, such as +choosing the most appropriate service for a given task, integrating different services with each +other, and migrating from one service to another. Even though metered billing removes the need for +capacity planning in the traditional sense, it’s still important to know what resources you are +using for which purpose, so that you don’t waste money on cloud resources that are not needed: +capacity planning becomes financial planning, and performance optimization becomes cost optimization +[[37](/en/ch1#Cherkasky2021)]. +Moreover, cloud services do have resource limits or *quotas* (such as the maximum number of +processes you can run concurrently), which you need to know about and plan for before you run into +them [[38](/en/ch1#Kushchi2023)]. + +Adopting a cloud service can be easier and quicker than running your own infrastructure, although +even here there is a cost in learning how to use it, and perhaps working around its limitations. +Integration between different services becomes a particular challenge as a growing number of vendors +offers an ever broader range of cloud services targeting different use cases +[[39](/en/ch1#Bernhardsson2021), +[40](/en/ch1#Stancil2021)]. +ETL (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) is only part of the story; operational cloud services also need +to be integrated with each other. At present, there is a lack of standards that would facilitate +this sort of integration, so it often involves significant manual effort. + +Other operational aspects that cannot fully be outsourced to cloud services include maintaining the +security of an application and the libraries it uses, managing the interactions between your own +services, monitoring the load on your services, and tracking down the cause of problems such as +performance degradations or outages. While the cloud is changing the role of operations, the need +for operations is as great as ever. + +# Distributed versus Single-Node Systems + +A system that involves several machines communicating via a network is called a *distributed +system*. Each of the processes participating in a distributed system is called a *node*. There are +various reasons why you might want a system to be distributed: + +Inherently distributed systems +: If an application involves two or more interacting users, each using their own device, then the + system is unavoidably distributed: the communication between the devices will have to go via a + network. + +Requests between cloud services +: If data is stored in one service but processed in another, it must be transferred over the network + from one service to the other. + +Fault tolerance/high availability +: If your application needs to continue working even if one machine (or several machines, or + the network, or an entire datacenter) goes down, you can use multiple machines to give you + redundancy. When one fails, another one can take over. See [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability) and + [Chapter 6](/en/ch6#ch_replication) on replication. + +Scalability +: If your data volume or computing requirements grow bigger than a single machine can handle, + you can potentially spread the load across multiple machines. See + [“Scalability”](/en/ch2#sec_introduction_scalability). + +Latency +: If you have users around the world, you might want to have servers in various regions + worldwide so that each user can be served from a server that is geographically close to + them. That avoids the users having to wait for network packets to travel halfway around the + world to answer their requests. See [“Describing Performance”](/en/ch2#sec_introduction_percentiles). + +Elasticity +: If your application is busy at some times and idle at other times, a cloud deployment can scale up + or down to meet the demand, so that you pay only for resources you are actively using. This is more + difficult on a single machine, which needs to be provisioned to handle the maximum load, even at + times when it is barely used. + +Using specialized hardware +: Different parts of the system can take advantage of different types of hardware to match their + workload. For example, an object store may use machines with many disks but few CPUs, whereas a + data analysis system may use machines with lots of CPU and memory but no disks, and a machine + learning system may use machines with GPUs (which are much more efficient than CPUs for training + deep neural networks and other machine learning tasks). + +Legal compliance +: Some countries have data residency laws that require data about people in their jurisdiction to be + stored and processed geographically within that country + [[41](/en/ch1#Korolov2022)]. + The scope of these rules varies—for example, in some cases it applies only to medical or financial + data, while other cases are broader. A service with users in several such jurisdictions will + therefore have to distribute their data across servers in several locations. + +Sustainability +: If you have flexibility on where and when to run your jobs, you might be able to run them in a + time and place where plenty of renewable electricity is available, and avoid running them when the + power grid is under strain. This can reduce your carbon emissions and allow you to take advantage + of cheap power when it is available + [[42](/en/ch1#Borenstein2025), + [43](/en/ch1#Acun2023)]. + +These reasons apply both to services that you write yourself (application code) and services +consisting of off-the-shelf software (such as databases). + +## Problems with Distributed Systems + +Distributed systems also have downsides. Every request and API call that goes via the network needs +to deal with the possibility of failure: the network may be interrupted, or the service may be +overloaded or crashed, and therefore any request may time out without receiving a response. In this +case, we don’t know whether the service received the request, and simply retrying it might not be +safe. We will discuss these problems in detail in [Chapter 9](/en/ch9#ch_distributed). + +Although datacenter networks are fast, making a call to another service is still vastly slower than +calling a function in the same process +[[44](/en/ch1#Nath2019)]. +When operating on large volumes of data, rather than transferring the data from storage to a +separate machine that processes it, it can be faster to bring the computation to the machine that +already has the data +[[45](/en/ch1#Hellerstein2019)]. +More nodes are not always faster: in some cases, a simple single-threaded program on one computer +can perform significantly better than a cluster with over 100 CPU cores +[[46](/en/ch1#McSherry2015_ch1)]. + +Troubleshooting a distributed system is often difficult: if the system is slow to respond, how do +you figure out where the problem lies? Techniques for diagnosing problems in distributed systems are +developed under the heading of *observability* [[47](/en/ch1#Sridharan2018), +[48](/en/ch1#Majors2019)], +which involves collecting data about the execution of a system, and allowing it to be queried in +ways that allows both high-level metrics and individual events to be analyzed. *Tracing* tools such +as OpenTelemetry, Zipkin, and Jaeger allow you to track which client called which server for which +operation, and how long each call took +[[49](/en/ch1#Sigelman2010)]. + +Databases provide various mechanisms for ensuring data consistency, as we shall see in +[Chapter 6](/en/ch6#ch_replication) and [Chapter 8](/en/ch8#ch_transactions). However, when each service has its own database, +maintaining consistency of data across those different services becomes the application’s problem. +Distributed transactions, which we explore in [Chapter 8](/en/ch8#ch_transactions), are a possible technique for +ensuring consistency, but they are rarely used in a microservices context because they run counter +to the goal of making services independent from each other, and many databases don’t support them +[[50](/en/ch1#Laigner2021)]. + +For all these reasons, if you can do something on a single machine, this is often much simpler and +cheaper compared to setting up a distributed system +[[23](/en/ch1#Badizadegan2022), +[46](/en/ch1#McSherry2015_ch1), +[51](/en/ch1#Tigani2023)]. +CPUs, memory, and disks have grown larger, faster, and more reliable. When combined with single-node +databases such as DuckDB, SQLite, and KùzuDB, many workloads can now run on a single node. We will +explore more on this topic in [Chapter 4](/en/ch4#ch_storage). + +## Microservices and Serverless + +The most common way of distributing a system across multiple machines is to divide them into clients +and servers, and let the clients make requests to the servers. Most commonly HTTP is used for this +communication, as we will discuss in [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc). The same process may be both a +server (handling incoming requests) and a client (making outbound requests to other services). + +This way of building applications has traditionally been called a *service-oriented architecture* +(SOA); more recently the idea has been refined into a *microservices* architecture +[[52](/en/ch1#Newman2021_ch1), +[53](/en/ch1#Richardson2014)]. +In this architecture, a service has one well-defined purpose (for example, in the case of S3, this +would be file storage); each service exposes an API that can be called by clients via the network, +and each service has one team that is responsible for its maintenance. A complex application can +thus be decomposed into multiple interacting services, each managed by a separate team. + +There are several advantages to breaking down a complex piece of software into multiple services: +each service can be updated independently, reducing coordination effort among teams; each service +can be assigned the hardware resources it needs; and by hiding the implementation details behind an +API, the service owners are free to change the implementation without affecting clients. In terms of +data storage, it is common for each service to have its own databases, and not to share databases +between services: sharing a database would effectively make the entire database structure a part of +the service’s API, and then that structure would be difficult to change. Shared databases could also +cause one service’s queries to negatively impact the performance of other services. + +On the other hand, having many services can itself breed complexity: each service requires +infrastructure for deploying new releases, adjusting the allocated hardware resources to match the +load, collecting logs, monitoring service health, and alerting an on-call engineer in the case of a +problem. *Orchestration* frameworks such as Kubernetes have become a popular way of deploying +services, since they provide a foundation for this infrastructure. Testing a service during +development can be complicated, since you also need to run all the other services that it depends +on. + +Microservice APIs can be challenging to evolve. Clients that call an API expect the API to have +certain fields. Developers might wish to add or remove fields to an API as business needs change, +but doing so can cause clients to fail. Worse still, such failures are often not discovered until +late in the development cycle when the updated service API is deployed to a staging or production +environment. API description standards such as OpenAPI and gRPC help manage the relationship between +client and server APIs; we discuss these further in [Chapter 5](/en/ch5#ch_encoding). + +Microservices are primarily a technical solution to a people problem: allowing different teams to +make progress independently without having to coordinate with each other. This is valuable in a large +company, but in a small company where there are not many teams, using microservices is likely to be +unnecessary overhead, and it is preferable to implement the application in the simplest way possible +[[52](/en/ch1#Newman2021_ch1)]. + +*Serverless*, or *function-as-a-service* (FaaS), is another approach to deploying services, in which +the management of the infrastructure is outsourced to a cloud vendor +[[33](/en/ch1#Jonas2019)]. +When using virtual machines, you have to explicitly choose when to start up or shut down an +instance; in contrast, with the serverless model, the cloud provider automatically allocates and +frees hardware resources as needed, based on the incoming requests to your service +[[54](/en/ch1#Shahrad2020)]. Serverless deployment +shifts more of the operational burden to cloud providers and enables flexible billing by usage +rather than machine instances. To offer such benefits, many serverless infrastructure providers +impose a time limit on function execution, limit runtime environments, and might suffer from slow +start times when a function is first invoked. The term “serverless” can also be misleading: each +serverless function execution still runs on a server, but subsequent executions might run on a +different one. Moreover, infrastructure such as BigQuery and various Kafka offerings have adopted +“serverless” terminology to signal that their services auto-scale and that they bill by usage rather +than machine instances. + +Just like cloud storage replaced capacity planning (deciding in advance how many disks to buy) with +a metered billing model, the serverless approach is bringing metered billing to code execution: you +only pay for the time that your application code is actually running, rather than having to +provision resources in advance. + +## Cloud Computing versus Supercomputing + +Cloud computing is not the only way of building large-scale computing systems; an alternative is +*high-performance computing* (HPC), also known as *supercomputing*. Although there are overlaps, HPC +often has different priorities and uses different techniques compared to cloud computing and +enterprise datacenter systems. Some of those differences are: + +* Supercomputers are typically used for computationally intensive scientific computing tasks, such + as weather forecasting, climate modeling, molecular dynamics (simulating the movement of atoms and + molecules), complex optimization problems, and solving partial differential equations. On the + other hand, cloud computing tends to be used for online services, business data systems, and + similar systems that need to serve user requests with high availability. +* A supercomputer typically runs large batch jobs that checkpoint the state of their computation to + disk from time to time. If a node fails, a common solution is to simply stop the entire cluster + workload, repair the faulty node, and then restart the computation from the last checkpoint + [[55](/en/ch1#Barroso2018), + [56](/en/ch1#Fiala2012)]. + With cloud services, it is usually not desirable to stop the entire cluster, since the services + need to continually serve users with minimal interruptions. +* Supercomputer nodes typically communicate through shared memory and remote direct memory access + (RDMA), which support high bandwidth and low latency, but assume a high level of trust among the + users of the system [[57](/en/ch1#KornfeldSimpson2020)]. + In cloud computing, the network and the machines are often shared by mutually untrusting + organizations, requiring stronger security mechanisms such as resource isolation (e.g., virtual + machines), encryption and authentication. +* Cloud datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to + provide high bisection bandwidth—a commonly used measure of a network’s overall performance + [[55](/en/ch1#Barroso2018), + [58](/en/ch1#Singh2015)]. + Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses + [[59](/en/ch1#Lockwood2014)], + which yield better performance for HPC workloads with known communication patterns. +* Cloud computing allows nodes to be distributed across multiple geographic regions, whereas + supercomputers generally assume that all of their nodes are close together. + +Large-scale analytics systems sometimes share some characteristics with supercomputing, which is why +it can be worth knowing about these techniques if you are working in this area. However, this book +is mostly concerned with services that need to be continually available, as discussed in +[“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability). + +# Data Systems, Law, and Society + +So far you’ve seen in this chapter that the architecture of data systems is influenced not only by +technical goals and requirements, but also by the human needs of the organizations that they +support. Increasingly, data systems engineers are realizing that serving the needs of their own +business is not enough: we also have a responsibility towards society at large. + +One particular concern are systems that store data about people and their behavior. Since 2018 the +*General Data Protection Regulation* (GDPR) has given residents of many European countries greater +control and legal rights over their personal data, and similar privacy regulation has been adopted +in various other countries and states around the world, including for example the California +Consumer Privacy Act (CCPA). Regulations around AI, such as the *EU AI Act*, place further +restrictions on how personal data can be used. + +Moreover, even in areas that are not directly subject to regulation, there is increasing recognition +of the effects that computer systems have on people and society. Social media has changed how +individuals consume news, which influences their political opinions and hence may affect the outcome +of elections. Automated systems increasingly make decisions that have profound consequences for +individuals, such as deciding who should be given a loan or insurance coverage, who should be +invited to a job interview, or who should be suspected of a crime +[[60](/en/ch1#ONeil2016_ch1)]. + +Everyone who works on such systems shares a responsibility for considering the ethical impact and +ensuring that they comply with relevant law. It is not necessary for everybody to become an expert +in law and ethics, but a basic awareness of legal and ethical principles is just as important as, +say, some foundational knowledge in distributed systems. + +Legal considerations are influencing the very foundations of how data systems are being designed +[[61](/en/ch1#Shastri2020)]. +For example, the GDPR grants individuals the right to have their data erased on request (sometimes +known as the *right to be forgotten*). However, as we shall see in this book, many data systems rely +on immutable constructs such as append-only logs as part of their design; how can we ensure deletion +of some data in the middle of a file that is supposed to be immutable? How do we handle deletion of +data that has been incorporated into derived datasets (see [“Systems of Record and Derived Data”](/en/ch1#sec_introduction_derived)), such as +training data for machine learning models? Answering these questions creates new engineering +challenges. + +At present we don’t have clear guidelines on which particular technologies or system architectures +should be considered “GDPR-compliant” or not. The regulation deliberately does not mandate +particular technologies, because these may quickly change as technology progresses. Instead, the +legal texts set out high-level principles that are subject to interpretation. This means that there +are no simple answers to the question of how to comply with privacy regulation, but we will look at +some of the technologies in this book through this lens. + +In general, we store data because we think that its value is greater than the costs of storing it. +However, it is worth remembering that the costs of storage are not just the bill you pay for Amazon +S3 or another service: the cost-benefit calculation should also take into account the risks of +liability and reputational damage if the data were to be leaked or compromised by adversaries, and +the risk of legal costs and fines if the storage and processing of the data is found not to be +compliant with the law [[51](/en/ch1#Tigani2023)]. + +Governments or police forces might also compel companies to hand over data. When there is a risk +that the data may reveal criminalized behaviors (for example, homosexuality in several Middle +Eastern and African countries, or seeking an abortion in several US states), storing that data +creates real safety risks for users. Travel to an abortion clinic, for example, could easily be +revealed by location data, perhaps even by a log of the user’s IP addresses over time (which +indicate approximate location). + +Once all the risks are taken into account, it might be reasonable to decide that some data is simply +not worth storing, and that it should therefore be deleted. This principle of *data minimization* +(sometimes known by the German term *Datensparsamkeit*) runs counter to the “big data” philosophy of +storing lots of data speculatively in case it turns out to be useful in the future +[[62](/en/ch1#Datensparsamkeit)]. +But it fits with the GDPR, which mandates that personal data may only be collected for a specified, +explicit purpose, that this data may not later be used for any other purpose, and that the data must +not be kept for longer than necessary for the purposes for which it was collected +[[63](/en/ch1#GDPR)]. + +Businesses have also taken notice of privacy and safety concerns. Credit card companies require +payment processing businesses to adhere to strict payment card industry (PCI) standards. Processors +undergo frequent evaluations from independent auditors to verify continued compliance. Software +vendors have also seen increased scrutiny. Many buyers now require their vendors to comply with +Service Organization Control (SOC) Type 2 standards. As with PCI compliance, vendors undergo third +party audits to verify adherence. + +Generally, it is important to balance the needs of your business against the needs of the people +whose data you are collecting and processing. There is much more to this topic; in [Link to Come] we +will go deeper into the topics of ethics and legal compliance, including the problems of bias and +discrimination. + +# Summary + +The theme of this chapter has been to understand trade-offs: that is, to recognize that for many +questions there is not one right answer, but several different approaches that each have various +pros and cons. We explored some of the most important choices that affect the architecture of data +systems, and introduced terminology that will be needed throughout the rest of this book. + +We started by making a distinction between operational (transaction-processing, OLTP) and analytical +(OLAP) systems, and saw their different characteristics: not only managing different types of data +with different access patterns, but also serving different audiences. We encountered the concept of +a data warehouse and data lake, which receive data feeds from operational systems via ETL. In +[Chapter 4](/en/ch4#ch_storage) we will see that operational and analytical systems often use very different internal +data layouts because of the different types of queries they need to serve. + +We then compared cloud services, a comparatively recent development, to the traditional paradigm of +self-hosted software that has previously dominated data systems architecture. Which of these +approaches is more cost-effective depends a lot on your particular situation, but it’s undeniable +that cloud-native approaches are bringing big changes to the way data systems are architected, for +example in the way they separate storage and compute. + +Cloud systems are intrinsically distributed, and we briefly examined some of the trade-offs of +distributed systems compared to using a single machine. There are situations in which you can’t +avoid going distributed, but it’s advisable not to rush into making a system distributed if it’s +possible to keep it on a single machine. In [Chapter 9](/en/ch9#ch_distributed) we will cover the challenges with +distributed systems in more detail. + +Finally, we saw that data systems architecture is determined not only by the needs of the business +deploying the system, but also by privacy regulation that protects the rights of the people whose +data is being processed—an aspect that many engineers are prone to ignoring. How we translate legal +requirements into technical implementations is not yet well understood, but it’s important to keep +this question in mind as we move through the rest of this book. + +##### Footnotes + +##### References + +[[1](/en/ch1#Kouzes2009-marker)] Richard T. Kouzes, +Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. +[The +Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, +January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26) + +[[2](/en/ch1#Kleppmann2019_ch1-marker)] Martin Kleppmann, Adam Wiggins, Peter van +Hardenberg, and Mark McGranaghan. [Local-first +software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International +Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), +October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737) + +[[3](/en/ch1#Reis2022-marker)] Joe Reis and Matt Housley. +[*Fundamentals +of Data Engineering*](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/). O’Reilly Media, 2022. ISBN: 9781098108304 + +[[4](/en/ch1#Machado2023-marker)] Rui Pedro Machado and Helder Russa. +[*Analytics +Engineering with SQL and dbt*](https://www.oreilly.com/library/view/analytics-engineering-with/9781098142377/). O’Reilly Media, 2023. ISBN: 9781098142384 + +[[5](/en/ch1#Codd1993-marker)] Edgar F. Codd, S. B. Codd, and C. T. Salley. +[Providing +OLAP to User-Analysts: An IT Mandate](https://www.estgv.ipv.pt/PaginasPessoais/jloureiro/ESI_AID2007_2008/fichas/codd.pdf). E. F. Codd Associates, 1993. +Archived at [perma.cc/RKX8-2GEE](https://perma.cc/RKX8-2GEE) + +[[6](/en/ch1#Soman2023-marker)] Chinmay Soman and Neha Pawar. +[Comparing Three +Real-Time OLAP Databases: Apache Pinot, Apache Druid, and ClickHouse](https://startree.ai/blog/a-tale-of-three-real-time-olap-databases). *startree.ai*, +April 2023. Archived at [perma.cc/8BZP-VWPA](https://perma.cc/8BZP-VWPA) + +[[7](/en/ch1#Chaudhuri1997-marker)] Surajit Chaudhuri and Umeshwar Dayal. +[An Overview of Data +Warehousing and OLAP Technology](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigrecord.pdf). *ACM SIGMOD Record*, volume 26, issue 1, pages 65–74, +March 1997. [doi:10.1145/248603.248616](https://doi.org/10.1145/248603.248616) + +[[8](/en/ch1#Ozcan2017-marker)] Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. +[Hybrid Transactional/Analytical +Processing: A Survey](https://humming80.github.io/papers/sigmod-htaptut.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017. +[doi:10.1145/3035918.3054784](https://doi.org/10.1145/3035918.3054784) + +[[9](/en/ch1#Prout2022_ch1-marker)] Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu +Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. +[Cloud-Native Transactions and Analytics +in SingleStore](https://dl.acm.org/doi/abs/10.1145/3514221.3526055). At *International Conference on Management of Data* (SIGMOD), June 2022. +[doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055) + +[[10](/en/ch1#Zhang2024-marker)] Chao Zhang, Guoliang Li, Jintao Zhang, +Xinning Zhang, and Jianhua Feng. +[HTAP Databases: A Survey](https://arxiv.org/pdf/2404.15670). +*IEEE Transactions on Knowledge and Data Engineering*, April 2024. +[doi:10.1109/TKDE.2024.3389693](https://doi.org/10.1109/TKDE.2024.3389693) + +[[11](/en/ch1#Stonebraker2005fitsall-marker)] Michael Stonebraker and Uğur Çetintemel. +[‘One Size Fits All’: An +Idea Whose Time Has Come and Gone](https://pages.cs.wisc.edu/~shivaram/cs744-readings/fits_all.pdf). At *21st International Conference on Data Engineering* +(ICDE), April 2005. [doi:10.1109/ICDE.2005.1](https://doi.org/10.1109/ICDE.2005.1) + +[[12](/en/ch1#Cohen2009-marker)] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. +Hellerstein, and Caleb Welton. [MAD Skills: +New Analysis Practices for Big Data](https://www.vldb.org/pvldb/vol2/vldb09-219.pdf). *Proceedings of the VLDB Endowment*, volume 2, +issue 2, pages 1481–1492, August 2009. +[doi:10.14778/1687553.1687576](https://doi.org/10.14778/1687553.1687576) + +[[13](/en/ch1#Olteanu2020-marker)] Dan Olteanu. +[The Relational Data Borg is Learning](https://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf). +*Proceedings of the VLDB Endowment*, volume 13, issue 12, August 2020. +[doi:10.14778/3415478.3415572](https://doi.org/10.14778/3415478.3415572) + +[[14](/en/ch1#Bornstein2020-marker)] Matt Bornstein, Martin Casado, and Jennifer Li. +[Emerging +Architectures for Modern Data Infrastructure: 2020](https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/). *future.a16z.com*, October 2020. +Archived at [perma.cc/LF8W-KDCC](https://perma.cc/LF8W-KDCC) + +[[15](/en/ch1#Fowler2015-marker)] Martin Fowler. +[DataLake](https://www.martinfowler.com/bliki/DataLake.html). +*martinfowler.com*, February 2015. +Archived at [perma.cc/4WKN-CZUK](https://perma.cc/4WKN-CZUK) + +[[16](/en/ch1#Johnson2015-marker)] Bobby Johnson and Joseph Adler. +[The +Sushi Principle: Raw Data Is Better](https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210840/). At *Strata+Hadoop World*, February 2015. + +[[17](/en/ch1#Armbrust2021-marker)] Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. +[Lakehouse: A New Generation of +Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference +on Innovative Data Systems Research* (CIDR), January 2021. + +[[18](/en/ch1#DataOps-marker)] DataKitchen, Inc. +[The DataOps Manifesto](https://dataopsmanifesto.org/en/). *dataopsmanifesto.org*, 2017. +Archived at [perma.cc/3F5N-FUQ4](https://perma.cc/3F5N-FUQ4) + +[[19](/en/ch1#Manohar2021-marker)] Tejas Manohar. +[What is Reverse ETL: A Definition & Why It’s +Taking Off](https://hightouch.io/blog/reverse-etl/). *hightouch.io*, November 2021. +Archived at [perma.cc/A7TN-GLYJ](https://perma.cc/A7TN-GLYJ) + +[[20](/en/ch1#ORegan2018-marker)] Simon O’Regan. +[Designing Data +Products](https://towardsdatascience.com/designing-data-products-b6b93edf3d23). *towardsdatascience.com*, August 2018. +Archived at [perma.cc/HU67-3RV8](https://perma.cc/HU67-3RV8) + +[[21](/en/ch1#Fournier2021-marker)] Camille Fournier. +[Why is it so +hard to decide to buy?](https://skamille.medium.com/why-is-it-so-hard-to-decide-to-buy-d86fee98e88e) *skamille.medium.com*, July 2021. +Archived at [perma.cc/6VSG-HQ5X](https://perma.cc/6VSG-HQ5X) + +[[22](/en/ch1#HeinemeierHansson2022-marker)] David Heinemeier Hansson. +[Why we’re leaving the cloud](https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0). +*world.hey.com*, October 2022. +Archived at [perma.cc/82E6-UJ65](https://perma.cc/82E6-UJ65) + +[[23](/en/ch1#Badizadegan2022-marker)] Nima Badizadegan. +[Use One Big Server](https://specbranch.com/posts/one-big-server/). +*specbranch.com*, August 2022. +Archived at [perma.cc/M8NB-95UK](https://perma.cc/M8NB-95UK) + +[[24](/en/ch1#Yegge2020-marker)] Steve Yegge. +[Dear +Google Cloud: Your Deprecation Policy is Killing You](https://steve-yegge.medium.com/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc). *steve-yegge.medium.com*, August 2020. +Archived at [perma.cc/KQP9-SPGU](https://perma.cc/KQP9-SPGU) + +[[25](/en/ch1#Verbitski2017-marker)] Alexandre Verbitski, Anurag Gupta, Debanjan +Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz +Kharatishvili, and Xiaofeng Bao. +[Amazon +Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases](https://media.amazonwebservices.com/blog/2017/aurora-design-considerations-paper.pdf). +At *ACM International Conference on Management of Data* (SIGMOD), pages 1041–1052, May 2017. +[doi:10.1145/3035918.3056101](https://doi.org/10.1145/3035918.3056101) + +[[26](/en/ch1#Antonopoulos2019_ch1-marker)] Panagiotis Antonopoulos, Alex Budovski, Cristian +Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar +Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna +Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. +[Socrates: The +New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* +(SIGMOD), pages 1743–1756, June 2019. +[doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047) + +[[27](/en/ch1#Vuppalapati2020-marker)] Midhul Vuppalapati, Justin Miron, Rachit Agarwal, +Dan Truong, Ashish Motivala, and Thierry Cruanes. +[Building An Elastic Query +Engine on Disaggregated Storage](https://www.usenix.org/system/files/nsdi20-paper-vuppalapati.pdf). At *17th USENIX Symposium on Networked Systems Design and +Implementation* (NSDI), February 2020. + +[[28](/en/ch1#NickVanWiggeren2025-marker)] Nick Van Wiggeren. +[The Real Failure Rate of EBS](https://planetscale.com/blog/the-real-fail-rate-of-ebs). +*planetscale.com*, March 2025. +Archived at [perma.cc/43CR-SAH5](https://perma.cc/43CR-SAH5) + +[[29](/en/ch1#Breck2024-marker)] Colin Breck. +[Predicting the +Future of Distributed Systems](https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/). *blog.colinbreck.com*, August 2024. +Archived at [perma.cc/K5FC-4XX2](https://perma.cc/K5FC-4XX2) + +[[30](/en/ch1#Shapira2023separation-marker)] Gwen Shapira. +[Compute-Storage Separation Explained](https://www.thenile.dev/blog/storage-compute). +*thenile.dev*, January 2023. Archived at +[perma.cc/QCV3-XJNZ](https://perma.cc/QCV3-XJNZ) + +[[31](/en/ch1#Murthy2022-marker)] Ravi Murthy and Gurmeet Goindi. +[AlloyDB +for PostgreSQL under the hood: Intelligent, database-aware storage](https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage). *cloud.google.com*, +May 2022. Archived at +[archive.org](https://web.archive.org/web/20220514021120/https%3A//cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage) + +[[32](/en/ch1#Vanlightly2023serverless-marker)] Jack Vanlightly. +[The +Architecture of Serverless Data Systems](https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems). *jack-vanlightly.com*, November 2023. +Archived at [perma.cc/UDV4-TNJ5](https://perma.cc/UDV4-TNJ5) + +[[33](/en/ch1#Jonas2019-marker)] Eric Jonas, Johann Schleier-Smith, Vikram +Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, +Neeraja Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, David A. Patterson. +[Cloud Programming Simplified: A Berkeley View on +Serverless Computing](https://arxiv.org/abs/1902.03383). *arxiv.org*, February 2019. + +[[34](/en/ch1#Beyer2016-marker)] Betsy Beyer, Jennifer Petoff, Chris +Jones, and Niall Richard Murphy. +[*Site +Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). +O’Reilly Media, 2016. ISBN: 9781491929124 + +[[35](/en/ch1#Limoncelli2020-marker)] Thomas Limoncelli. +[The Time I Stole $10,000 from Bell Labs](https://queue.acm.org/detail.cfm?id=3434773). +*ACM Queue*, volume 18, issue 5, November 2020. +[doi:10.1145/3434571.3434773](https://doi.org/10.1145/3434571.3434773) + +[[36](/en/ch1#Majors2020-marker)] Charity Majors. +[The Future of Ops Jobs](https://acloudguru.com/blog/engineering/the-future-of-ops-jobs). +*acloudguru.com*, August 2020. +Archived at [perma.cc/GRU2-CZG3](https://perma.cc/GRU2-CZG3) + +[[37](/en/ch1#Cherkasky2021-marker)] Boris Cherkasky. +[(Over)Pay +As You Go for Your Datastore](https://medium.com/riskified-technology/over-pay-as-you-go-for-your-datastore-11a29ae49a8b). *medium.com*, September 2021. +Archived at [perma.cc/Q8TV-2AM2](https://perma.cc/Q8TV-2AM2) + +[[38](/en/ch1#Kushchi2023-marker)] Shlomi Kushchi. +[Serverless Doesn’t Mean +DevOpsLess or NoOps](https://thenewstack.io/serverless-doesnt-mean-devopsless-or-noops/). *thenewstack.io*, February 2023. +Archived at [perma.cc/3NJR-AYYU](https://perma.cc/3NJR-AYYU) + +[[39](/en/ch1#Bernhardsson2021-marker)] Erik Bernhardsson. +[Storm +in the stratosphere: how the cloud will be reshuffled](https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html). *erikbern.com*, November 2021. +Archived at [perma.cc/SYB2-99P3](https://perma.cc/SYB2-99P3) + +[[40](/en/ch1#Stancil2021-marker)] Benn Stancil. +[The data OS](https://benn.substack.com/p/the-data-os). *benn.substack.com*, +September 2021. Archived at [perma.cc/WQ43-FHS6](https://perma.cc/WQ43-FHS6) + +[[41](/en/ch1#Korolov2022-marker)] Maria Korolov. +[Data +residency laws pushing companies toward residency as a service](https://www.csoonline.com/article/3647761/data-residency-laws-pushing-companies-toward-residency-as-a-service.html). *csoonline.com*, +January 2022. Archived at [perma.cc/CHE4-XZZ2](https://perma.cc/CHE4-XZZ2) + +[[42](/en/ch1#Borenstein2025-marker)] Severin Borenstein. +[Can +Data Centers Flex Their Power Demand?](https://energyathaas.wordpress.com/2025/04/14/can-data-centers-flex-their-power-demand/) *energyathaas.wordpress.com*, April 2025. +Archived at + +[[43](/en/ch1#Acun2023-marker)] Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Aditya +Sundarrajan, Kiwan Maeng, Manoj Chakkaravarthy, David Brooks, and Carole-Jean Wu. +[Carbon Dependencies in +Datacenter Design and Management](https://hotcarbon.org/assets/2022/pdf/hotcarbon22-acun.pdf). +*ACM SIGENERGY Energy Informatics Review*, volume 3, issue 3, pages 21–26. +[doi:10.1145/3630614.3630619](https://doi.org/10.1145/3630614.3630619) + +[[44](/en/ch1#Nath2019-marker)] Kousik Nath. +[These are +the numbers every computer engineer should know](https://www.freecodecamp.org/news/must-know-numbers-for-every-computer-engineer/). *freecodecamp.org*, September 2019. +Archived at [perma.cc/RW73-36RL](https://perma.cc/RW73-36RL) + +[[45](/en/ch1#Hellerstein2019-marker)] Joseph M. Hellerstein, Jose Faleiro, Joseph E. +Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. +[Serverless Computing: One Step Forward, Two Steps Back](https://arxiv.org/abs/1812.03651). +At *Conference on Innovative Data Systems Research* (CIDR), January 2019. + +[[46](/en/ch1#McSherry2015_ch1-marker)] Frank McSherry, Michael Isard, and Derek G. Murray. +[Scalability! +But at What COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), +May 2015. + +[[47](/en/ch1#Sridharan2018-marker)] Cindy Sridharan. +*[Distributed +Systems Observability: A Guide to Building Robust Systems](https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-Systems-Observability-eBook.pdf)*. Report, O’Reilly Media, May 2018. +Archived at [perma.cc/M6JL-XKCM](https://perma.cc/M6JL-XKCM) + +[[48](/en/ch1#Majors2019-marker)] Charity Majors. +[Observability — A 3-Year +Retrospective](https://thenewstack.io/observability-a-3-year-retrospective/). *thenewstack.io*, August 2019. +Archived at [perma.cc/CG62-TJWL](https://perma.cc/CG62-TJWL) + +[[49](/en/ch1#Sigelman2010-marker)] Benjamin H. Sigelman, Luiz André Barroso, Mike +Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. +[Dapper, a Large-Scale Distributed Systems Tracing +Infrastructure](https://research.google/pubs/pub36356/). Google Technical Report dapper-2010-1, April 2010. +Archived at [perma.cc/K7KU-2TMH](https://perma.cc/K7KU-2TMH) + +[[50](/en/ch1#Laigner2021-marker)] Rodrigo Laigner, Yongluan Zhou, Marcos Antonio +Vaz Salles, Yijian Liu, and Marcos Kalinowski. +[Data management in microservices: State +of the practice, challenges, and research directions](https://www.vldb.org/pvldb/vol14/p3348-laigner.pdf). *Proceedings of the VLDB Endowment*, +volume 14, issue 13, pages 3348–3361, September 2021. +[doi:10.14778/3484224.3484232](https://doi.org/10.14778/3484224.3484232) + +[[51](/en/ch1#Tigani2023-marker)] Jordan Tigani. +[Big Data is Dead](https://motherduck.com/blog/big-data-is-dead/). +*motherduck.com*, February 2023. +Archived at [perma.cc/HT4Q-K77U](https://perma.cc/HT4Q-K77U) + +[[52](/en/ch1#Newman2021_ch1-marker)] Sam Newman. +[*Building +Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O’Reilly Media, 2021. ISBN: 9781492034025 + +[[53](/en/ch1#Richardson2014-marker)] Chris Richardson. +[Microservices: Decomposing +Applications for Deployability and Scalability](https://www.infoq.com/articles/microservices-intro/). *infoq.com*, May 2014. +Archived at [perma.cc/CKN4-YEQ2](https://perma.cc/CKN4-YEQ2) + +[[54](/en/ch1#Shahrad2020-marker)] Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, +Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, Ricardo Bianchini. +[Serverless in the Wild: +Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider](https://www.usenix.org/system/files/atc20-shahrad.pdf). +At *USENIX Annual Technical Conference* (ATC), July 2020. + +[[55](/en/ch1#Barroso2018-marker)] Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. +[The Datacenter as a +Computer: Designing Warehouse-Scale Machines](https://www.morganclaypool.com/doi/10.2200/S00874ED3V01Y201809CAC046), third edition. +Morgan & Claypool Synthesis Lectures on Computer Architecture, October 2018. +[doi:10.2200/S00874ED3V01Y201809CAC046](https://doi.org/10.2200/S00874ED3V01Y201809CAC046) + +[[56](/en/ch1#Fiala2012-marker)] David Fiala, Frank Mueller, Christian Engelmann, Rolf +Riesen, Kurt Ferreira, and Ron Brightwell. +[Detection and +Correction of Silent Data Corruption for Large-Scale High-Performance Computing](https://arcb.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc12.pdf),” at +*International Conference for High Performance Computing, Networking, Storage and +Analysis* (SC), November 2012. +[doi:10.1109/SC.2012.49](https://doi.org/10.1109/SC.2012.49) + +[[57](/en/ch1#KornfeldSimpson2020-marker)] Anna Kornfeld +Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang. +[Securing RDMA +for High-Performance Datacenter Storage Systems](https://www.usenix.org/conference/hotcloud20/presentation/kornfeld-simpson). At *12th USENIX Workshop on Hot Topics in +Cloud Computing* (HotCloud), July 2020. + +[[58](/en/ch1#Singh2015-marker)] Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, +Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, +Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. +[Jupiter Rising: A +Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf). At +*Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. +[doi:10.1145/2785956.2787508](https://doi.org/10.1145/2785956.2787508) + +[[59](/en/ch1#Lockwood2014-marker)] Glenn K. Lockwood. +[Hadoop’s +Uncomfortable Fit in HPC](https://blog.glennklockwood.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html). *glennklockwood.blogspot.co.uk*, May 2014. +Archived at [perma.cc/S8XX-Y67B](https://perma.cc/S8XX-Y67B) + +[[60](/en/ch1#ONeil2016_ch1-marker)] Cathy O’Neil: *Weapons of Math Destruction: +How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. +ISBN: 9780553418811 + +[[61](/en/ch1#Shastri2020-marker)] Supreeth Shastri, Vinay Banakar, Melissa +Wasserman, Arun Kumar, and Vijay Chidambaram. +[Understanding and Benchmarking the +Impact of GDPR on Database Systems](https://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue +7, pages 1064–1077, March 2020. +[doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354) + +[[62](/en/ch1#Datensparsamkeit-marker)] Martin Fowler. +[Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html). +*martinfowler.com*, December 2013. +Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6) + +[[63](/en/ch1#GDPR-marker)] [Regulation +(EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data +Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016. -1. Michael Stonebraker and Uğur Çetintemel: “['One Size Fits All': An Idea Whose Time Has Come and Gone](https://cs.brown.edu/~ugur/fits_all.pdf),” at *21st International Conference on Data Engineering* (ICDE), April 2005. -1. Walter L. Heimerdinger and Charles B. Weinstock: “[A Conceptual Framework for System Fault Tolerance](https://resources.sei.cmu.edu/asset_files/TechnicalReport/1992_005_001_16112.pdf),” Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992. -1. Ding Yuan, Yu Luo, Xin Zhuang, et al.: “[Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf),” at *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014. -1. Yury Izrailevsky and Ariel Tseitlin: “[The Netflix Simian Army](https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116),” *netflixtechblog.com*, July 19, 2011. -1. Daniel Ford, François Labelle, Florentina I. Popovici, et al.: “[Availability in Globally Distributed Storage Systems](http://research.google.com/pubs/archive/36737.pdf),” at *9th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2010. -1. Brian Beach: “[Hard Drive Reliability Update – Sep 2014](https://www.backblaze.com/blog/hard-drive-reliability-update-september-2014/),” *backblaze.com*, September 23, 2014. -1. Laurie Voss: “[AWS: The Good, the Bad and the Ugly](https://web.archive.org/web/20160429075023/http://blog.awe.sm/2012/12/18/aws-the-good-the-bad-and-the-ugly/),” *blog.awe.sm*, December 18, 2012. -1. Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “[What Bugs Live in the Cloud?](http://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf),” at *5th ACM Symposium on Cloud Computing* (SoCC), November 2014. [doi:10.1145/2670979.2670986](http://dx.doi.org/10.1145/2670979.2670986) -1. Nelson Minar: “[Leap Second Crashes Half the Internet](http://www.somebits.com/weblog/tech/bad/leap-second-2012.html),” *somebits.com*, July 3, 2012. -1. Amazon Web Services: “[Summary of the Amazon EC2 and Amazon RDS Service Disruption in the US East Region](http://aws.amazon.com/message/65648/),” *aws.amazon.com*, April 29, 2011. -1. Richard I. Cook: “[How Complex Systems Fail](https://www.adaptivecapacitylabs.com/HowComplexSystemsFail.pdf),” Cognitive Technologies Laboratory, April 2000. -1. Jay Kreps: “[Getting Real About Distributed System Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability),” *blog.empathybox.com*, March 19, 2012. -1. David Oppenheimer, Archana Ganapathi, and David A. Patterson: “[Why Do Internet Services Fail, and What Can Be Done About It?](http://static.usenix.org/legacy/events/usits03/tech/full_papers/oppenheimer/oppenheimer.pdf),” at *4th USENIX Symposium on Internet Technologies and Systems* (USITS), March 2003. -1. Nathan Marz: “[Principles of Software Engineering, Part 1](http://nathanmarz.com/blog/principles-of-software-engineering-part-1.html),” *nathanmarz.com*, April 2, 2013. -1. Michael Jurewitz: “[The Human Impact of Bugs](http://jury.me/blog/2013/3/14/the-human-impact-of-bugs),” *jury.me*, March 15, 2013. -1. Raffi Krikorian: “[Timelines at Scale](http://www.infoq.com/presentations/Twitter-Timeline-Scalability),” at *QCon San Francisco*, November 2012. -1. Martin Fowler: *Patterns of Enterprise Application Architecture*. Addison Wesley, 2002. ISBN: 978-0-321-12742-6 -1. Kelly Sommers: “[After all that run around, what caused 500ms disk latency even when we replaced physical server?](https://twitter.com/kellabyte/status/532930540777635840)” *twitter.com*, November 13, 2014. -1. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “[Dynamo: Amazon's Highly Available Key-Value Store](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf),” at *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007. -1. Greg Linden: “[Make Data Useful](http://glinden.blogspot.co.uk/2006/12/slides-from-my-talk-at-stanford.html),” slides from presentation at Stanford University Data Mining class (CS345), December 2006. -1. Tammy Everts: “[The Real Cost of Slow Time vs Downtime](https://www.slideshare.net/Radware/radware-cmg2014-tammyevertsslowtimevsdowntime),” *slideshare.net*, November 5, 2014. -1. Jake Brutlag: “[Speed Matters](https://ai.googleblog.com/2009/06/speed-matters.html),” *ai.googleblog.com*, June 23, 2009. -1. Tyler Treat: “[Everything You Know About Latency Is Wrong](http://bravenewgeek.com/everything-you-know-about-latency-is-wrong/),” *bravenewgeek.com*, December 12, 2015. -1. Jeffrey Dean and Luiz André Barroso: “[The Tail at Scale](http://cacm.acm.org/magazines/2013/2/160173-the-tail-at-scale/fulltext),” *Communications of the ACM*, volume 56, number 2, pages 74–80, February 2013. [doi:10.1145/2408776.2408794](http://dx.doi.org/10.1145/2408776.2408794) -1. Graham Cormode, Vladislav Shkapenyuk, Divesh Srivastava, and Bojian Xu: “[Forward Decay: A Practical Time Decay Model for Streaming Systems](http://dimacs.rutgers.edu/~graham/pubs/papers/fwddecay.pdf),” at *25th IEEE International Conference on Data Engineering* (ICDE), March 2009. -1. Ted Dunning and Otmar Ertl: “[Computing Extremely Accurate Quantiles Using t-Digests](https://github.com/tdunning/t-digest),” *github.com*, March 2014. -1. Gil Tene: “[HdrHistogram](http://www.hdrhistogram.org/),” *hdrhistogram.org*. -1. Baron Schwartz: “[Why Percentiles Don’t Work the Way You Think](https://orangematter.solarwinds.com/2016/11/18/why-percentiles-dont-work-the-way-you-think/),” *solarwinds.com*, November 18, 2016. -1. James Hamilton: “[On Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/events/lisa07/tech/full_papers/hamilton/hamilton.pdf),” at *21st Large Installation System Administration Conference* (LISA), November 2007. -1. Brian Foote and Joseph Yoder: “[Big Ball of Mud](http://www.laputan.org/pub/foote/mud.pdf),” at *4th Conference on Pattern Languages of Programs* (PLoP), September 1997. -1. Frederick P Brooks: “No Silver Bullet – Essence and Accident in Software Engineering,” in *The Mythical Man-Month*, Anniversary edition, Addison-Wesley, 1995. ISBN: 978-0-201-83595-3 -1. Ben Moseley and Peter Marks: “[Out of the Tar Pit](https://curtclifton.net/papers/MoseleyMarks06a.pdf),” at *BCS Software Practice Advancement* (SPA), 2006. -1. Rich Hickey: “[Simple Made Easy](http://www.infoq.com/presentations/Simple-Made-Easy),” at *Strange Loop*, September 2011. -1. Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson: “[Analyzing Software Evolvability](http://www.es.mdh.se/pdf_publications/1251.pdf),” at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. [doi:10.1109/COMPSAC.2008.50](http://dx.doi.org/10.1109/COMPSAC.2008.50) \ No newline at end of file diff --git a/content/en/ch10.md b/content/en/ch10.md index edda26c..64d21a4 100644 --- a/content/en/ch10.md +++ b/content/en/ch10.md @@ -1,178 +1,2232 @@ --- -title: "10. Batch Processing" -linkTitle: "10. Batch Processing" -weight: 310 +title: "10. Consistency and Consensus" +weight: 210 breadcrumbs: false ---- +--- -![](/img/ch10.png) - -> *A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments.* +> *An ancient adage warns, “Never go to sea with two chronometers; take one or three.”* > -> ​ — Donald Knuth +> Frederick P. Brooks Jr., *The Mythical Man-Month: Essays on Software Engineering* (1995) ---------------- +Lots of things can go wrong in distributed systems, as discussed in [Chapter 9](/en/ch9#ch_distributed). If we want a +service to continue working correctly despite those things going wrong, we need to find ways of +tolerating faults. -In the first two parts of this book we talked a lot about *requests* and *queries*, and the corresponding *responses* or *results*. This style of data processing is assumed in many modern data systems: you ask for something, or you send an instruction, and some time later the system (hopefully) gives you an answer. Databases, caches, search indexes, web servers, and many other systems work this way. +One of the best tools we have for fault tolerance is *replication*. However, as we saw in +[Chapter 6](/en/ch6#ch_replication), having multiple copies of the data on multiple replicas opens up the risk of +inconsistencies. Reads might be handled by a replica that is not up-to-date, yielding stale results. +If multiple replicas can accept writes, we have to deal with conflicts between values that were +concurrently written on different replicas. At a high level, there are two competing philosophies +for dealing with such issues: -In such *online* systems, whether it’s a web browser requesting a page or a service call‐ ing a remote API, we generally assume that the request is triggered by a human user, and that the user is waiting for the response. They shouldn’t have to wait too long, so we pay a lot of attention to the *response time* of these systems (see “[Describing Performance](/en/ch1#describing-performance)”). +Eventual consistency +: In this philosophy, the fact that a system is replicated is made visible to the application, and + you as application developer are expected to deal with the inconsistencies and conflicts that may + arise. This approach is often used in systems with multi-leader (see + [“Multi-Leader Replication”](/en/ch6#sec_replication_multi_leader)) and leaderless replication (see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)). -The web, and increasing numbers of HTTP/REST-based APIs, has made the request/ response style of interaction so common that it’s easy to take it for granted. But we should remember that it’s not the only way of building systems, and that other approaches have their merits too. Let’s distinguish three different types of systems: +Strong consistency +: This philosophy says that applications should not have to worry about internal details of + replication, and that the system should behave as if it was single-node. The advantage of this + approach is that it’s simpler for you, the application developer. The disadvantage is that + stronger consistency has a performance cost, and some kinds of fault that an eventually consistent + system can tolerate cause outages in strongly consistent systems. -***Services (online systems)*** +As always, which approach is better depends on your application. If you have an app where users can +make changes to data while offline, then eventual consistency is inevitable, as discussed in +[“Sync Engines and Local-First Software”](/en/ch6#sec_replication_offline_clients). However, eventual consistency can also be difficult for +applications to deal with. If your replicas are located in datacenters with fast, reliable +communication, then strong consistency is often appropriate because its cost is acceptable. -A service waits for a request or instruction from a client to arrive. When one is received, the service tries to handle it as quickly as possible and sends a response back. Response time is usually the primary measure of performance of a service, and availability is often very important (if the client can’t reach the service, the user will probably get an error message). +In this chapter we will dive deeper into the strongly consistent approach, looking at three areas: -***Batch processing systems (offline systems)*** +1. One challenge is that “strong consistency” is quite vague, so we will develop a more precise + definition of what we want to achieve: *linearizability*. +2. We will look at the problem of generating IDs and timestamps. This may sound unrelated to + consistency but is actually closely connected. +3. We will explore how distributed systems can achieve linearizability while still remaining + fault-tolerant; the answer is *consensus* algorithms. -A batch processing system takes a large amount of input data, runs a *job* to pro‐ cess it, and produces some output data. Jobs often take a while (from a few minutes to several days), so there normally isn’t a user waiting for the job to fin‐ ish. Instead, batch jobs are often scheduled to run periodically (for example, once a day). The primary performance measure of a batch job is usually *throughput* (the time it takes to crunch through an input dataset of a certain size). We dis‐ cuss batch processing in this chapter. +Along the way, we will see that there are some fundamental limits on what is possible and what is +not in a distributed system. -***Stream processing systems (near-real-time systems)*** +The topics of this chapter are notorious for being hard to implement correctly; it’s very easy to +build systems that behave fine when there are no faults, but which completely fall apart when faced +with an unlucky combination of faults that the designer of the system hadn’t considered. A lot of +theory has been developed to help us think through those edge cases, which enables us to build +systems that can robustly tolerate faults. -Stream processing is somewhere between online and offline/batch processing (so it is sometimes called *near-real-time* or *nearline* processing). Like a batch pro‐ cessing system, a stream processor consumes inputs and produces outputs (rather than responding to requests). However, a stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data. This difference allows stream processing systems to have lower latency than the equivalent batch systems. As stream processing builds upon batch process‐ ing, we discuss it in [Chapter 11](/en/ch11). +This chapter will only scratch the surface: we will stick with informal intuitions, and avoid the +algorithmic nitty-gritty, formal models, and proofs. If you want to do serious work on consensus +systems and similar infrastructure, you will need to go much deeper into the theory if you want any +chance of your systems being robust. As usual, the literature references in this chapter provide +some initial pointers. -As we shall see in this chapter, batch processing is an important building block in our quest to build reliable, scalable, and maintainable applications. For example, Map‐ Reduce, a batch processing algorithm published in 2004 [1], was (perhaps over- enthusiastically) called “the algorithm that makes Google so massively scalable” [2]. It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB. +# Linearizability -MapReduce is a fairly low-level programming model compared to the parallel pro‐ cessing systems that were developed for data warehouses many years previously [3, 4], but it was a major step forward in terms of the scale of processing that could be achieved on commodity hardware. Although the importance of MapReduce is now declining [5], it is still worth understanding, because it provides a clear picture of why and how batch processing is useful. +If you want a replicated database to be as simple as possible to use, you should make it behave as +if it wasn’t replicated at all. Then users don’t have to worry about replication lag, conflicts, and +other inconsistencies. That would give us the advantage of fault tolerance, but without the +complexity arising from having to think about multiple replicas. -In fact, batch processing is a very old form of computing. Long before programmable digital computers were invented, punch card tabulating machines—such as the Hol‐ lerith machines used in the 1890 US Census [6]—implemented a semi-mechanized form of batch processing to compute aggregate statistics from large inputs. And Map‐ Reduce bears an uncanny resemblance to the electromechanical IBM card-sorting machines that were widely used for business data processing in the 1940s and 1950s [7]. As usual, history has a tendency of repeating itself. +This is the idea behind *linearizability* +[[1](/en/ch10#Herlihy1990)] +(also known as *atomic consistency* +[[2](/en/ch10#Lamport1986)], +*strong consistency*, *immediate consistency*, or *external consistency* +[[3](/en/ch10#Gifford1981)]). +The exact definition of linearizability is quite subtle, and we will explore it in the rest of this +section. But the basic idea is to make a system appear as if there were only one copy of the data, +and all operations on it are atomic. With this guarantee, even though there may be multiple replicas +in reality, the application does not need to worry about them. -In this chapter, we will look at MapReduce and several other batch processing algo‐ rithms and frameworks, and explore how they are used in modern data systems. But first, to get started, we will look at data processing using standard Unix tools. Even if you are already familiar with them, a reminder about the Unix philosophy is worthwhile because the ideas and lessons from Unix carry over to large-scale, heterogene‐ ous distributed data systems. +In a linearizable system, as soon as one client successfully completes a write, all clients reading +from the database must be able to see the value just written. Maintaining the illusion of a single +copy of the data means guaranteeing that the value read is the most recent, up-to-date value, and +doesn’t come from a stale cache or replica. In other words, linearizability is a *recency +guarantee*. To clarify this idea, let’s look at an example of a system that is not linearizable. +![ddia 1001](/fig/ddia_1001.png) +###### Figure 10-1. This system is not linearizable, causing sports fans to be confused. -## …… +[Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website +[[4](/en/ch10#Kleppmann2015stop)]. +Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a +game their favorite team is playing. Just after the final score is announced, Aaliyah refreshes the +page, sees the winner announced, and excitedly tells Bryce about it. Bryce incredulously hits +*reload* on his own phone, but his request goes to a database replica that is lagging, and so his +phone shows that the game is still ongoing. +If Aaliyah and Bryce had hit reload at the same time, it would have been less surprising if they had +gotten two different query results, because they wouldn’t know at exactly what time their respective +requests were processed by the server. However, Bryce knows that he hit the reload button (initiated +his query) *after* he heard Aaliyah exclaim the final score, and therefore he expects his query +result to be at least as recent as Aaliyah’s. The fact that his query returned a stale result is a +violation of linearizability. +## What Makes a System Linearizable? -## Summary +In order to understand linearizability better, let’s look at some more examples. +[Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows three clients concurrently reading and writing the same +object *x* in a linearizable database. In distributed systems theory, *x* is called a *register*—in +practice, it could be one key in a key-value store, one row in a relational database, or one +document in a document database, for example. +![ddia 1002](/fig/ddia_1002.png) -In this chapter we explored the topic of batch processing. We started by looking at Unix tools such as awk, grep, and sort, and we saw how the design philosophy of those tools is carried forward into MapReduce and more recent dataflow engines. Some of those design principles are that inputs are immutable, outputs are intended to become the input to another (as yet unknown) program, and complex problems are solved by composing small tools that “do one thing well.” +###### Figure 10-2. If a read request is concurrent with a write request, it may return either the old or the new value. -In the Unix world, the uniform interface that allows one program to be composed with another is files and pipes; in MapReduce, that interface is a distributed filesys‐ tem. We saw that dataflow engines add their own pipe-like data transport mecha‐ nisms to avoid materializing intermediate state to the distributed filesystem, but the initial input and final output of a job is still usually HDFS. +For simplicity, [Figure 10-2](/en/ch10#fig_consistency_linearizability_1) shows only the requests from the clients’ +point of view, not the internals of the database. Each bar is a request made by a client, where the +start of a bar is the time when the request was sent, and the end of a bar is when the response was +received by the client. Due to variable network delays, a client doesn’t know exactly when the +database processed its request—it only knows that it must have happened sometime between the +client sending the request and receiving the response. -The two main problems that distributed batch processing frameworks need to solve are: +In this example, the register has two types of operations: -***Partitioning*** +* *read*(*x*) ⇒ *v* means the client requested to read the value of register + *x*, and the database returned the value *v*. +* *write*(*x*, *v*) ⇒ *r* means the client requested to set the + register *x* to value *v*, and the database returned response *r* (which could be *ok* or *error*). -In MapReduce, mappers are partitioned according to input file blocks. The out‐ put of mappers is repartitioned, sorted, and merged into a configurable number of reducer partitions. The purpose of this process is to bring all the related data— e.g., all the records with the same key—together in the same place. +In [Figure 10-2](/en/ch10#fig_consistency_linearizability_1), the value of *x* is initially 0, and client C performs a +write request to set it to 1. While this is happening, clients A and B are repeatedly polling the +database to read the latest value. What are the possible responses that A and B might get for their +read requests? -Post-MapReduce dataflow engines try to avoid sorting unless it is required, but they otherwise take a broadly similar approach to partitioning. +* The first read operation by client A completes before the write begins, so it must definitely + return the old value 0. +* The last read by client A begins after the write has completed, so it must definitely return the + new value 1 if the database is linearizable, because the read must have been processed after the + write. +* Any read operations that overlap in time with the write operation might return either 0 or 1, + because we don’t know whether or not the write has taken effect at the time when the read + operation is processed. These operations are *concurrent* with the write. -***Fault tolerance*** +However, that is not yet sufficient to fully describe linearizability: if reads that are concurrent +with a write can return either the old or the new value, then readers could see a value flip back +and forth between the old and the new value several times while a write is going on. That is not +what we expect of a system that emulates a “single copy of the data.” -MapReduce frequently writes to disk, which makes it easy to recover from an individual failed task without restarting the entire job but slows down execution in the failure-free case. Dataflow engines perform less materialization of inter‐ mediate state and keep more in memory, which means that they need to recom‐ pute more data if a node fails. Deterministic operators reduce the amount of data that needs to be recomputed. +To make the system linearizable, we need to add another constraint, illustrated in +[Figure 10-3](/en/ch10#fig_consistency_linearizability_2). +![ddia 1003](/fig/ddia_1003.png) +###### Figure 10-3. After any one read has returned the new value, all following reads (on the same or other clients) must also return the new value. -We discussed several join algorithms for MapReduce, most of which are also inter‐ nally used in MPP databases and dataflow engines. They also provide a good illustra‐ tion of how partitioned algorithms work: +In a linearizable system we imagine that there must be some point in time (between the start and end +of the write operation) at which the value of *x* atomically flips from 0 to 1. Thus, if one +client’s read returns the new value 1, all subsequent reads must also return the new value, even if +the write operation has not yet completed. -***Sort-merge joins*** +This timing dependency is illustrated with an arrow in [Figure 10-3](/en/ch10#fig_consistency_linearizability_2). +Client A is the first to read the new value, 1. Just after A’s read returns, B begins a new read. +Since B’s read occurs strictly after A’s read, it must also return 1, even though the write by C is +still ongoing. (It’s the same situation as with Aaliyah and Bryce in +[Figure 10-1](/en/ch10#fig_consistency_linearizability_0): after Aaliyah has read the new value, Bryce also expects to +read the new value.) -Each of the inputs being joined goes through a mapper that extracts the join key. By partitioning, sorting, and merging, all the records with the same key end up going to the same call of the reducer. This function can then output the joined records. +We can further refine this timing diagram to visualize each operation taking effect atomically at +some point in time [[5](/en/ch10#Kingsbury2015mongodb)], +like in the more complex example shown in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3). In this example we +add a third type of operation besides *read* and *write*: -***Broadcast hash joins*** +* *cas*(*x*, *v*old, *v*new) ⇒ *r* means the client + requested an atomic *compare-and-set* operation (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)). If the + current value of the register *x* equals *v*old, it should be atomically set to *v*new. If + the value of *x* is different from *v*old, then the operation should leave the register + unchanged and return an error. *r* is the database’s response (*ok* or *error*). -One of the two join inputs is small, so it is not partitioned and it can be entirely loaded into a hash table. Thus, you can start a mapper for each partition of the large join input, load the hash table for the small input into each mapper, and then scan over the large input one record at a time, querying the hash table for each record. +Each operation in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3) is marked with a vertical line (inside the +bar for each operation) at the time when we think the operation was executed. Those markers are +joined up in a sequential order, and the result must be a valid sequence of reads and writes for a +register (every read must return the value set by the most recent write). -***Partitioned hash joins*** +The requirement of linearizability is that the lines joining up the operation markers always move +forward in time (from left to right), never backward. This requirement ensures the recency guarantee we +discussed earlier: once a new value has been written or read, all subsequent reads see the value +that was written, until it is overwritten again. -If the two join inputs are partitioned in the same way (using the same key, same hash function, and same number of partitions), then the hash table approach can be used independently for each partition. +![ddia 1004](/fig/ddia_1004.png) +###### Figure 10-4. Visualizing the points in time at which the reads and writes appear to have taken effect. The final read by B is not linearizable. +There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3): -Distributed batch processing engines have a deliberately restricted programming model: callback functions (such as mappers and reducers) are assumed to be stateless and to have no externally visible side effects besides their designated output. This restriction allows the framework to hide some of the hard distributed systems prob‐ lems behind its abstraction: in the face of crashes and network issues, tasks can be retried safely, and the output from any failed tasks is discarded. If several tasks for a partition succeed, only one of them actually makes its output visible. +* First client B sent a request to read *x*, then client D sent a request to set *x* to 0, and then + client A sent a request to set *x* to 1. Nevertheless, the value returned to B’s read is 1 (the + value written by A). This is okay: it means that the database first processed D’s write, then A’s + write, and finally B’s read. Although this is not the order in which the requests were sent, it’s + an acceptable order, because the three requests are concurrent. Perhaps B’s read request was + slightly delayed in the network, so it only reached the database after the two writes. +* Client B’s read returned 1 before client A received its response from the database, saying that + the write of the value 1 was successful. This is also okay: it just means the *ok* response from + the database to client A was slightly delayed in the network. +* This model doesn’t assume any transaction isolation: another client may change a value at any + time. For example, C first reads 1 and then reads 2, because the value was changed by B between + the two reads. An atomic compare-and-set (*cas*) operation can be used to check the value hasn’t + been concurrently changed by another client: B and C’s *cas* requests succeed, but D’s *cas* + request fails (by the time the database processes it, the value of *x* is no longer 0). +* The final read by client B (in a shaded bar) is not linearizable. The operation is concurrent with + C’s *cas* write, which updates *x* from 2 to 4. In the absence of other requests, it would be okay for + B’s read to return 2. However, client A has already read the new value 4 before B’s read started, + so B is not allowed to read an older value than A. Again, it’s the same situation as with Aaliyah + and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0). -Thanks to the framework, your code in a batch processing job does not need to worry about implementing fault-tolerance mechanisms: the framework can guarantee that the final output of a job is the same as if no faults had occurred, even though in real‐ ity various tasks perhaps had to be retried. These reliable semantics are much stron‐ ger than what you usually have in online services that handle user requests and that write to databases as a side effect of processing a request. +That is the intuition behind linearizability; the formal definition +[[1](/en/ch10#Herlihy1990)] describes it more precisely. It is +possible (though computationally expensive) to test whether a system’s behavior is linearizable by +recording the timings of all requests and responses, and checking whether they can be arranged into +a valid sequential order [[6](/en/ch10#Kingsbury2014knossos), +[7](/en/ch10#Kingsbury2020elle)]. -The distinguishing feature of a batch processing job is that it reads some input data and produces some output data, without modifying the input—in other words, the output is derived from the input. Crucially, the input data is *bounded*: it has a known, fixed size (for example, it consists of a set of log files at some point in time, or a snap‐ shot of a database’s contents). Because it is bounded, a job knows when it has finished reading the entire input, and so a job eventually completes when it is done. +Just as there are various weak isolation levels for transactions besides serializability (see +[“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels)), there are also various weaker consistency models for +replicated systems besides linearizability +[[8](/en/ch10#Viotti2016)]. +In fact, the *read-after-write*, *monotonic reads*, and *consistent prefix reads* properties we saw +in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag) are examples of such weaker consistency models. Linearizability +guarantees all these weaker properties, and more. In this chapter we will focus on linearizability, +which is the strongest consistency model in common use. -In the next chapter, we will turn to stream processing, in which the input is *unboun‐ ded*—that is, you still have a job, but its inputs are never-ending streams of data. In this case, a job is never complete, because at any time there may still be more work coming in. We shall see that stream and batch processing are similar in some respects, but the assumption of unbounded streams also changes a lot about how we build systems. +# Linearizability Versus Serializability +Linearizability is easily confused with serializability (see [“Serializability”](/en/ch8#sec_transactions_serializability)), +as both words seem to mean something like “can be arranged in a sequential order.” However, they are +quite different guarantees, and it is important to distinguish between them: +Serializability +: Serializability is an isolation property of transactions, where every transaction may read and + write *multiple objects* (rows, documents, records). It guarantees that transactions behave the + same as if they had executed in *some* serial order: that is, as if you first performed all of one + transaction’s operations, then all of another transaction’s operations, and so on, without + interleaving them. It is okay for that serial order to be different from the order in which the + transactions were actually run [[9](/en/ch10#Bailis2014linear)]. -## References +Linearizability +: Linearizability is a guarantee on reads and writes of a register (an *individual object*). It + doesn’t group operations together into transactions, so it does not prevent problems such as write + skew that involve multiple objects (see [“Write Skew and Phantoms”](/en/ch8#sec_transactions_write_skew)). However, linearizability + is a *recency* guarantee: it requires that if one operation finishes before another one starts, + then the later operation must observe a state that is at least as new as the earlier operation. + Serializability does not have that requirement: for example, stale reads are allowed by + serializability [[10](/en/ch10#Abadi2019serializable)]. -1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/pub62/),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004. -1. Joel Spolsky: “[The Perils of JavaSchools](https://www.joelonsoftware.com/2005/12/29/the-perils-of-javaschools-2/),” *joelonsoftware.com*, December 29, 2005. -1. Shivnath Babu and Herodotos Herodotou: “[Massively Parallel Databases and MapReduce Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/db-mr-survey-final.pdf),” *Foundations and Trends in Databases*, volume 5, number 1, pages 1–104, November 2013. [doi:10.1561/1900000036](http://dx.doi.org/10.1561/1900000036) -1. David J. DeWitt and Michael Stonebraker: “[MapReduce: A Major Step Backwards](https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html),” originally published at *databasecolumn.vertica.com*, January 17, 2008. -1. Henry Robinson: “[The Elephant Was a Trojan Horse: On the Death of Map-Reduce at Google](https://www.the-paper-trail.org/post/2014-06-25-the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/),” *the-paper-trail.org*, June 25, 2014. -1. “[The Hollerith Machine](https://www.census.gov/history/www/innovations/technology/the_hollerith_tabulator.html),” United States Census Bureau, *census.gov*. -1. “[IBM 82, 83, and 84 Sorters Reference Manual](https://bitsavers.org/pdf/ibm/punchedCard/Sorter/A24-1034-1_82-83-84_sorters.pdf),” Edition A24-1034-1, International Business Machines Corporation, July 1962. -1. Adam Drake: “[Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster](https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html),” *aadrake.com*, January 25, 2014. -1. “[GNU Coreutils 8.23 Documentation](http://www.gnu.org/software/coreutils/manual/html_node/index.html),” Free Software Foundation, Inc., 2014. -1. Martin Kleppmann: “[Kafka, Samza, and the Unix Philosophy of Distributed Data](http://martin.kleppmann.com/2015/08/05/kafka-samza-unix-philosophy-distributed-data.html),” *martin.kleppmann.com*, August 5, 2015. -1. Doug McIlroy: [Internal Bell Labs memo](https://swtch.com/~rsc/thread/mdmpipe.pdf), October 1964. Cited in: Dennis M. Richie: “[Advice from Doug McIlroy](https://www.bell-labs.com/usr/dmr/www/mdmpipe.html),” *bell-labs.com*. -1. M. D. McIlroy, E. N. Pinson, and B. A. Tague: “[UNIX Time-Sharing System: Foreword](https://archive.org/details/bstj57-6-1899),” *The Bell System Technical Journal*, volume 57, number 6, pages 1899–1904, July 1978. -1. Eric S. Raymond: [*The Art of UNIX Programming*](http://www.catb.org/~esr/writings/taoup/html/). Addison-Wesley, 2003. ISBN: 978-0-13-142901-7 -1. Ronald Duncan: “[Text File Formats – ASCII Delimited Text – Not CSV or TAB Delimited Text](https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/),” *ronaldduncan.wordpress.com*, October 31, 2009. -1. Alan Kay: “[Is 'Software Engineering' an Oxymoron?](http://tinlizzie.org/~takashi/IsSoftwareEngineeringAnOxymoron.pdf),” *tinlizzie.org*. -1. Martin Fowler: “[InversionOfControl](http://martinfowler.com/bliki/InversionOfControl.html),” *martinfowler.com*, June 26, 2005. -1. Daniel J. Bernstein: “[Two File Descriptors for Sockets](http://cr.yp.to/tcpip/twofd.html),” *cr.yp.to*. -1. Rob Pike and Dennis M. Ritchie: “[The Styx Architecture for Distributed Systems](http://doc.cat-v.org/inferno/4th_edition/styx),” *Bell Labs Technical Journal*, volume 4, number 2, pages 146–152, April 1999. -1. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: “[The Google File System](http://research.google.com/archive/gfs-sosp2003.pdf),” at *19th ACM Symposium on Operating Systems Principles* (SOSP), October 2003. [doi:10.1145/945445.945450](http://dx.doi.org/10.1145/945445.945450) -1. Michael Ovsiannikov, Silvius Rus, Damian Reeves, et al.: “[The Quantcast File System](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p808-ovsiannikov.pdf),” *Proceedings of the VLDB Endowment*, volume 6, number 11, pages 1092–1101, August 2013. [doi:10.14778/2536222.2536234](http://dx.doi.org/10.14778/2536222.2536234) -1. “[OpenStack Swift 2.6.1 Developer Documentation](http://docs.openstack.org/developer/swift/),” OpenStack Foundation, *docs.openstack.org*, March 2016. -1. Zhe Zhang, Andrew Wang, Kai Zheng, et al.: “[Introduction to HDFS Erasure Coding in Apache Hadoop](https://blog.cloudera.com/introduction-to-hdfs-erasure-coding-in-apache-hadoop/),” *blog.cloudera.com*, September 23, 2015. -1. Peter Cnudde: “[Hadoop Turns 10](https://web.archive.org/web/20190119112713/https://yahoohadoop.tumblr.com/post/138739227316/hadoop-turns-10),” *yahoohadoop.tumblr.com*, February 5, 2016. -1. Eric Baldeschwieler: “[Thinking About the HDFS vs. Other Storage Technologies](https://web.archive.org/web/20190529215115/http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/),” *hortonworks.com*, July 25, 2012. -1. Brendan Gregg: “[Manta: Unix Meets Map Reduce](https://web.archive.org/web/20220125052545/http://dtrace.org/blogs/brendan/2013/06/25/manta-unix-meets-map-reduce/),” *dtrace.org*, June 25, 2013. -1. Tom White: *Hadoop: The Definitive Guide*, 4th edition. O'Reilly Media, 2015. ISBN: 978-1-491-90163-2 -1. Jim N. Gray: “[Distributed Computing Economics](http://arxiv.org/pdf/cs/0403019.pdf),” Microsoft Research Tech Report MSR-TR-2003-24, March 2003. -1. Márton Trencséni: “[Luigi vs Airflow vs Pinball](http://bytepawn.com/luigi-airflow-pinball.html),” *bytepawn.com*, February 6, 2016. -1. Roshan Sumbaly, Jay Kreps, and Sam Shah: “[The 'Big Data' Ecosystem at LinkedIn](http://www.slideshare.net/s_shah/the-big-data-ecosystem-at-linkedin-23512853),” at *ACM International Conference on Management of Data* (SIGMOD), July 2013. [doi:10.1145/2463676.2463707](http://dx.doi.org/10.1145/2463676.2463707) -1. Alan F. Gates, Olga Natkovich, Shubham Chopra, et al.: “[Building a High-Level Dataflow System on Top of Map-Reduce: The Pig Experience](http://www.vldb.org/pvldb/vol2/vldb09-1074.pdf),” at *35th International Conference on Very Large Data Bases* (VLDB), August 2009. -1. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, et al.: “[Hive – A Petabyte Scale Data Warehouse Using Hadoop](http://i.stanford.edu/~ragho/hive-icde2010.pdf),” at *26th IEEE International Conference on Data Engineering* (ICDE), March 2010. [doi:10.1109/ICDE.2010.5447738](http://dx.doi.org/10.1109/ICDE.2010.5447738) -1. “[Cascading 3.0 User Guide](https://web.archive.org/web/20231206195311/http://docs.cascading.org/cascading/3.0/userguide/),” Concurrent, Inc., *docs.cascading.org*, January 2016. -1. “[Apache Crunch User Guide](https://crunch.apache.org/user-guide.html),” Apache Software Foundation, *crunch.apache.org*. -1. Craig Chambers, Ashish Raniwala, Frances Perry, et al.: “[FlumeJava: Easy, Efficient Data-Parallel Pipelines](https://research.google.com/pubs/archive/35650.pdf),” at *31st ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2010. [doi:10.1145/1806596.1806638](http://dx.doi.org/10.1145/1806596.1806638) -1. Jay Kreps: “[Why Local State is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing),” *oreilly.com*, July 31, 2014. -1. Martin Kleppmann: “[Rethinking Caching in Web Apps](http://martin.kleppmann.com/2012/10/01/rethinking-caching-in-web-apps.html),” *martin.kleppmann.com*, October 1, 2012. -1. Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira: *[Hadoop Application Architectures](http://shop.oreilly.com/product/0636920033196.do)*. O'Reilly Media, 2015. ISBN: 978-1-491-90004-8 -1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. -1. Sriranjan Manjunath: “[Skewed Join](https://web.archive.org/web/20151228114742/https://wiki.apache.org/pig/PigSkewedJoinSpec),” *wiki.apache.org*, 2009. -1. David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri: “[Practical Skew Handling in Parallel Joins](http://www.vldb.org/conf/1992/P027.PDF),” at *18th International Conference on Very Large Data Bases* (VLDB), August 1992. -1. Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “[Impala: A Modern, Open-Source SQL Engine for Hadoop](http://pandis.net/resources/cidr15impala.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015. -1. Matthieu Monsch: “[Open-Sourcing PalDB, a Lightweight Companion for Storing Side Data](https://engineering.linkedin.com/blog/2015/10/open-sourcing-paldb--a-lightweight-companion-for-storing-side-da),” *engineering.linkedin.com*, October 26, 2015. -1. Daniel Peng and Frank Dabek: “[Large-Scale Incremental Processing Using Distributed Transactions and Notifications](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf),” at *9th USENIX conference on Operating Systems Design and Implementation* (OSDI), October 2010. -1. “["Cloudera Search User Guide,"](http://www.cloudera.com/documentation/cdh/5-1-x/Search/Cloudera-Search-User-Guide/Cloudera-Search-User-Guide.html) Cloudera, Inc., September 2015. -1. Lili Wu, Sam Shah, Sean Choi, et al.: “[The Browsemaps: Collaborative Filtering at LinkedIn](http://ceur-ws.org/Vol-1271/Paper3.pdf),” at *6th Workshop on Recommender Systems and the Social Web* (RSWeb), October 2014. -1. Roshan Sumbaly, Jay Kreps, Lei Gao, et al.: “[Serving Large-Scale Batch Computed Data with Project Voldemort](http://static.usenix.org/events/fast12/tech/full_papers/Sumbaly.pdf),” at *10th USENIX Conference on File and Storage Technologies* (FAST), February 2012. -1. Varun Sharma: “[Open-Sourcing Terrapin: A Serving System for Batch Generated Data](https://web.archive.org/web/20170215032514/https://engineering.pinterest.com/blog/open-sourcing-terrapin-serving-system-batch-generated-data-0),” *engineering.pinterest.com*, September 14, 2015. -1. Nathan Marz: “[ElephantDB](http://www.slideshare.net/nathanmarz/elephantdb),” *slideshare.net*, May 30, 2011. -1. Jean-Daniel (JD) Cryans: “[How-to: Use HBase Bulk Loading, and Why](https://blog.cloudera.com/how-to-use-hbase-bulk-loading-and-why/),” *blog.cloudera.com*, September 27, 2013. -1. Nathan Marz: “[How to Beat the CAP Theorem](http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html),” *nathanmarz.com*, October 13, 2011. -1. Molly Bartlett Dishman and Martin Fowler: “[Agile Architecture](https://web.archive.org/web/20161130034721/http://conferences.oreilly.com/software-architecture/sa2015/public/schedule/detail/40388),” at *O'Reilly Software Architecture Conference*, March 2015. -1. David J. DeWitt and Jim N. Gray: “[Parallel Database Systems: The Future of High Performance Database Systems](http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/dewittgray92.pdf),” *Communications of the ACM*, volume 35, number 6, pages 85–98, June 1992. [doi:10.1145/129888.129894](http://dx.doi.org/10.1145/129888.129894) -1. Jay Kreps: “[But the multi-tenancy thing is actually really really hard](https://twitter.com/jaykreps/status/528235702480142336),” tweetstorm, *twitter.com*, October 31, 2014. -1. Jeffrey Cohen, Brian Dolan, Mark Dunlap, et al.: “[MAD Skills: New Analysis Practices for Big Data](http://www.vldb.org/pvldb/vol2/vldb09-219.pdf),” *Proceedings of the VLDB Endowment*, volume 2, number 2, pages 1481–1492, August 2009. [doi:10.14778/1687553.1687576](http://dx.doi.org/10.14778/1687553.1687576) -1. Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino: “[Data Wrangling: The Challenging Journey from the Wild to the Lake](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015. -1. Paige Roberts: “[To Schema on Read or to Schema on Write, That Is the Hadoop Data Lake Question](https://web.archive.org/web/20171105001306/http://adaptivesystemsinc.com/blog/to-schema-on-read-or-to-schema-on-write-that-is-the-hadoop-data-lake-question/),” *adaptivesystemsinc.com*, July 2, 2015. -1. Bobby Johnson and Joseph Adler: “[The Sushi Principle: Raw Data Is Better](https://web.archive.org/web/20161126104941/https://conferences.oreilly.com/strata/big-data-conference-ca-2015/public/schedule/detail/38737),” at *Strata+Hadoop World*, February 2015. -1. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, et al.: “[Apache Hadoop YARN: Yet Another Resource Negotiator](https://www.cs.cmu.edu/~garth/15719/papers/yarn.pdf),” at *4th ACM Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523633](http://dx.doi.org/10.1145/2523616.2523633) -1. Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, et al.: “[Large-Scale Cluster Management at Google with Borg](http://research.google.com/pubs/pub43438.html),” at *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741964](http://dx.doi.org/10.1145/2741948.2741964) -1. Malte Schwarzkopf: “[The Evolution of Cluster Scheduler Architectures](https://web.archive.org/web/20201109052657/http://www.firmament.io/blog/scheduler-architectures.html),” *firmament.io*, March 9, 2016. -1. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al.: “[Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf),” at *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012. -1. Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia: *Learning Spark*. O'Reilly Media, 2015. ISBN: 978-1-449-35904-1 -1. Bikas Saha and Hitesh Shah: “[Apache Tez: Accelerating Hadoop Query Processing](http://www.slideshare.net/Hadoop_Summit/w-1205phall1saha),” at *Hadoop Summit*, June 2014. -1. Bikas Saha, Hitesh Shah, Siddharth Seth, et al.: “[Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications](http://home.cse.ust.hk/~weiwa/teaching/Fall15-COMP6611B/reading_list/Tez.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742790](http://dx.doi.org/10.1145/2723372.2742790) -1. Kostas Tzoumas: “[Apache Flink: API, Runtime, and Project Roadmap](http://www.slideshare.net/KostasTzoumas/apache-flink-api-runtime-and-project-roadmap),” *slideshare.net*, January 14, 2015. -1. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al.: “[The Stratosphere Platform for Big Data Analytics](https://ssc.io/pdf/2014-VLDBJ_Stratosphere_Overview.pdf),” *The VLDB Journal*, volume 23, number 6, pages 939–964, May 2014. [doi:10.1007/s00778-014-0357-y](http://dx.doi.org/10.1007/s00778-014-0357-y) -1. Michael Isard, Mihai Budiu, Yuan Yu, et al.: “[Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks](https://www.microsoft.com/en-us/research/publication/dryad-distributed-data-parallel-programs-from-sequential-building-blocks/),” at *European Conference on Computer Systems* (EuroSys), March 2007. [doi:10.1145/1272996.1273005](http://dx.doi.org/10.1145/1272996.1273005) -1. Daniel Warneke and Odej Kao: “[Nephele: Efficient Parallel Data Processing in the Cloud](https://stratosphere2.dima.tu-berlin.de/assets/papers/Nephele_09.pdf),” at *2nd Workshop on Many-Task Computing on Grids and Supercomputers* (MTAGS), November 2009. [doi:10.1145/1646468.1646476](http://dx.doi.org/10.1145/1646468.1646476) -1. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd: “[The PageRank Citation Ranking: Bringing Order to the Web](https://web.archive.org/web/20230219170930/http://ilpubs.stanford.edu:8090/422/),” Stanford InfoLab Technical Report 422, 1999. -1. Leslie G. Valiant: “[A Bridging Model for Parallel Computation](http://dl.acm.org/citation.cfm?id=79181),” *Communications of the ACM*, volume 33, number 8, pages 103–111, August 1990. [doi:10.1145/79173.79181](http://dx.doi.org/10.1145/79173.79181) -1. Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl: “[Spinning Fast Iterative Data Flows](http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf),” *Proceedings of the VLDB Endowment*, volume 5, number 11, pages 1268-1279, July 2012. [doi:10.14778/2350229.2350245](http://dx.doi.org/10.14778/2350229.2350245) -1. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, et al.: “[Pregel: A System for Large-Scale Graph Processing](https://kowshik.github.io/JPregel/pregel_paper.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2010. [doi:10.1145/1807167.1807184](http://dx.doi.org/10.1145/1807167.1807184) -1. Frank McSherry, Michael Isard, and Derek G. Murray: “[Scalability! But at What COST?](http://www.frankmcsherry.org/assets/COST.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. -1. Ionel Gog, Malte Schwarzkopf, Natacha Crooks, et al.: “[Musketeer: All for One, One for All in Data Processing Systems](http://www.cl.cam.ac.uk/research/srg/netos/camsas/pubs/eurosys15-musketeer.pdf),” at *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741968](http://dx.doi.org/10.1145/2741948.2741968) -1. Aapo Kyrola, Guy Blelloch, and Carlos Guestrin: “[GraphChi: Large-Scale Graph Computation on Just a PC](https://www.usenix.org/system/files/conference/osdi12/osdi12-final-126.pdf),” at *10th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2012. -1. Andrew Lenharth, Donald Nguyen, and Keshav Pingali: “[Parallel Graph Analytics](http://cacm.acm.org/magazines/2016/5/201591-parallel-graph-analytics/fulltext),” *Communications of the ACM*, volume 59, number 5, pages 78–87, May 2016. [doi:10.1145/2901919](http://dx.doi.org/10.1145/2901919) -1. Fabian Hüske: “[Peeking into Apache Flink's Engine Room](http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html),” *flink.apache.org*, March 13, 2015. -1. Mostafa Mokhtar: “[Hive 0.14 Cost Based Optimizer (CBO) Technical Overview](https://web.archive.org/web/20170607112708/http://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/),” *hortonworks.com*, March 2, 2015. -1. Michael Armbrust, Reynold S Xin, Cheng Lian, et al.: “[Spark SQL: Relational Data Processing in Spark](http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742797](http://dx.doi.org/10.1145/2723372.2742797) -1. Daniel Blazevski: “[Planting Quadtrees for Apache Flink](https://blog.insightdatascience.com/planting-quadtrees-for-apache-flink-b396ebc80d35),” *insightdataengineering.com*, March 25, 2016. -1. Tom White: “[Genome Analysis Toolkit: Now Using Apache Spark for Data Processing](https://web.archive.org/web/20190215132904/http://blog.cloudera.com/blog/2016/04/genome-analysis-toolkit-now-using-apache-spark-for-data-processing/),” *blog.cloudera.com*, April 6, 2016. +(*Sequential consistency* is something else again +[[8](/en/ch10#Viotti2016)], but we won’t discuss it here.) +A database may provide both serializability and linearizability, and this combination is known as +*strict serializability* or *strong one-copy serializability* (*strong-1SR*) +[[11](/en/ch10#Bailis2014virtues_ch10), +[12](/en/ch10#Bernstein1987_ch10)]. +Single-node databases are typically linearizable. With distributed databases using optimistic +methods like serializable snapshot isolation (see [“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)) the situation is more +complicated: for example, CockroachDB provides serializability, and some recency guarantees on +reads, but not strict serializability [[13](/en/ch10#Matei2021)] +because this would require expensive coordination between transactions +[[14](/en/ch10#Demirbas2022)]. + +It is also possible to combine a weaker isolation level with linearizability, or a weaker +consistency model with serializability; in fact, consistency model and isolation level can be chosen +largely independently from each other [[15](/en/ch10#Darnell2022), +[16](/en/ch10#Abadi2019consistency)]. + +## Relying on Linearizability + +In what circumstances is linearizability useful? Viewing the final score of a sporting match is +perhaps a frivolous example: a result that is outdated by a few seconds is unlikely to cause any +real harm in this situation. However, there a few areas in which linearizability is an important +requirement for making a system work correctly. + +### Locking and leader election + +A system that uses single-leader replication needs to ensure that there is indeed only one leader, +not several (split brain). One way of electing a leader is to use a lease: every node that starts up +tries to acquire the lease, and the one that succeeds becomes the leader +[[17](/en/ch10#Burrows2006_ch10)]. +No matter how this mechanism is implemented, it must be linearizable: it should not be possible for +two different nodes to acquire the lease at the same time. + +Coordination services like Apache ZooKeeper +[[18](/en/ch10#Junqueira2013_ch10)] +and etcd are often used to implement distributed leases and leader election. They use consensus +algorithms to implement linearizable operations in a fault-tolerant way (we discuss such algorithms +later in this chapter). There are still many subtle details to implementing leases and leader +election correctly (see for example the fencing issue in [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing)), and +libraries like Apache Curator help by providing higher-level recipes on top of ZooKeeper. However, a +linearizable storage service is the basic foundation for these coordination tasks. + +###### Note + +Strictly speaking, ZooKeeper provides linearizable writes, but reads may be stale, since there is no +guarantee that they are served from the current leader +[[18](/en/ch10#Junqueira2013_ch10)]. +etcd since version 3 provides linearizable reads by default. + +Distributed locking is also used at a much more granular level in some distributed databases, such as +Oracle Real Application Clusters (RAC) +[[19](/en/ch10#Vallath2006)]. +RAC uses a lock per disk page, with multiple nodes sharing access +to the same disk storage system. Since these linearizable locks are on the critical path of +transaction execution, RAC deployments usually have a dedicated cluster interconnect network for +communication between database nodes. + +### Constraints and uniqueness guarantees + +Uniqueness constraints are common in databases: for example, a username or email address must +uniquely identify one user, and in a file storage service there cannot be two files with the same +path and filename. If you want to enforce this constraint as the data is written (such that if two people +try to concurrently create a user or a file with the same name, one of them will be returned an +error), you need linearizability. + +This situation is actually similar to a lock: when a user registers for your service, you can think +of them acquiring a “lock” on their chosen username. The operation is also very similar to an atomic +compare-and-set, setting the username to the ID of the user who claimed it, provided that the +username is not already taken. + +Similar issues arise if you want to ensure that a bank account balance never goes negative, or that +you don’t sell more items than you have in stock in the warehouse, or that two people don’t +concurrently book the same seat on a flight or in a theater. These constraints all require there to +be a single up-to-date value (the account balance, the stock level, the seat occupancy) that all +nodes agree on. + +In real applications, it is sometimes acceptable to treat such constraints loosely (for example, if +a flight is overbooked, you can move customers to a different flight and offer them compensation for +the inconvenience). In such cases, linearizability may not be needed, and we will discuss such +loosely interpreted constraints in [Link to Come]. + +However, a hard uniqueness constraint, such as the one you typically find in relational databases, +requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints, +can be implemented without linearizability +[[20](/en/ch10#Bailis2014coord_ch10)]. + +### Cross-channel timing dependencies + +Notice a detail in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0): if Aaliyah hadn’t exclaimed the score, +Bryce wouldn’t have known that the result of his query was stale. He would have just refreshed the +page again a few seconds later, and eventually seen the final score. The linearizability violation +was only noticed because there was an additional communication channel in the system (Aaliyah’s +voice to Bryce’s ears). + +Similar situations can arise in computer systems. For example, say you have a website where users +can upload a video, and a background process transcodes the video to a lower quality that can be +streamed on slow internet connections. The architecture and dataflow of this system is illustrated +in [Figure 10-5](/en/ch10#fig_consistency_transcoder). + +The video transcoder needs to be explicitly instructed to perform a transcoding job, and this +instruction is sent from the web server to the transcoder via a message queue (see [Link to Come]). +The web server doesn’t place the entire video on the queue, since most message brokers are designed +for small messages, and a video may be many megabytes in size. Instead, the video is first written +to a file storage service, and once the write is complete, the instruction to the transcoder is +placed on the queue. + +![ddia 1005](/fig/ddia_1005.png) + +###### Figure 10-5. The web server and video transcoder communicate both through file storage and a message queue, opening the potential for race conditions. + +If the file storage service is linearizable, then this system should work fine. If it is not +linearizable, there is the risk of a race condition: the message queue (steps 3 and 4 in +[Figure 10-5](/en/ch10#fig_consistency_transcoder)) might be faster than the internal replication inside the storage +service. In this case, when the transcoder fetches the original video (step 5), it might see an old +version of the file, or nothing at all. If it processes an old version of the video, the original +and transcoded videos in the file storage become permanently inconsistent with each other. + +This problem arises because there are two different communication channels between the web server +and the transcoder: the file storage and the message queue. Without the recency guarantee of +linearizability, race conditions between these two channels are possible. This situation is +analogous to [Figure 10-1](/en/ch10#fig_consistency_linearizability_0), where there was also a race condition between +two communication channels: the database replication and the real-life audio channel between +Aaliyah’s mouth and Bryce’s ears. + +A similar race condition occurs if you have a mobile app that can receive push notifications, and +the app fetches some data from a server when it receives a push notification. If the data fetch +might go to a lagging replica, it could happen that the push notification goes through quickly, but +the subsequent fetch doesn’t see the data that the push notification was about. + +Linearizability is not the only way of avoiding this race condition, but it’s the simplest to +understand. If you control the additional communication channel (like in the case of the message +queue, but not in the case of Aaliyah and Bryce), you can use alternative approaches similar to what +we discussed in [“Reading Your Own Writes”](/en/ch6#sec_replication_ryw), at the cost of additional complexity. + +## Implementing Linearizable Systems + +Now that we’ve looked at a few examples in which linearizability is useful, let’s think about how we +might implement a system that offers linearizable semantics. + +Since linearizability essentially means “behave as though there is only a single copy of the data, +and all operations on it are atomic,” the simplest answer would be to really only use a single copy +of the data. However, that approach would not be able to tolerate faults: if the node holding that +one copy failed, the data would be lost, or at least inaccessible until the node was brought up +again. + +Let’s revisit the replication methods from [Chapter 6](/en/ch6#ch_replication), and compare whether they can be made +linearizable: + +Single-leader replication (potentially linearizable) +: In a system with single-leader replication, the leader has the primary copy of the data that is + used for writes, and the followers maintain backup copies of the data on other nodes. As long as + you perform all reads and writes on the leader, they are likely to be linearizable. However, this + assumes that you know for sure who the leader is. As discussed in + [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing), it is quite possible for a node to think that it is the leader, + when in fact it is not—and if the delusional leader continues to serve requests, it is likely to + violate linearizability [[21](/en/ch10#Kingsbury2014etcd)]. + With asynchronous replication, failover may even lose committed writes, which violates both + durability and linearizability. + + Sharding a single-leader database, with a separate leader per shard, does not affect + linearizability, since it is only a single-object guarantee. Cross-shard transactions are a + different matter (see [“Distributed Transactions”](/en/ch8#sec_transactions_distributed)). + +Consensus algorithms (likely linearizable) +: Some consensus algorithms are essentially single-leader replication with automatic leader election + and failover. They are carefully designed to prevent split brain, allowing them to implement + linearizable storage safely. ZooKeeper uses the Zab consensus algorithm + [[22](/en/ch10#Junqueira2011)] + and etcd uses Raft + [[23](/en/ch10#Ongaro2014atc)], for example. + However, just because a system uses consensus does not guarantee that all operations on it are + linearizable: if it allows reads on a node without checking that it is still the leader, the + results of the read may be stale if a new leader has just been elected. + +Multi-leader replication (not linearizable) +: Systems with multi-leader replication are generally not linearizable, because they concurrently + process writes on multiple nodes and asynchronously replicate them to other nodes. For this + reason, they can produce conflicting writes that require resolution (see + [“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts)). + +Leaderless replication (probably not linearizable) +: For systems with leaderless replication (Dynamo-style; see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)), people + sometimes claim that you can obtain “strong consistency” by requiring quorum reads and writes + (*w* + *r* > *n*). Depending on the exact algorithm, and depending on how you define + strong consistency, this is not quite true. + + “Last write wins” conflict resolution methods based on time-of-day clocks (e.g., in Cassandra and + ScyllaDB) are almost certainly nonlinearizable, because clock timestamps cannot be guaranteed to be + consistent with actual event ordering due to clock skew (see [“Relying on Synchronized Clocks”](/en/ch9#sec_distributed_clocks_relying)). + Even with quorums, nonlinearizable behavior is possible, as demonstrated in the next section. + +### Linearizability and quorums + +Intuitively, it seems as though quorum reads and writes should be linearizable in a +Dynamo-style model. However, when we have variable network delays, it is possible to have race +conditions, as demonstrated in [Figure 10-6](/en/ch10#fig_consistency_leaderless). + +![ddia 1006](/fig/ddia_1006.png) + +###### Figure 10-6. A nonlinearizable execution, despite using a quorum. + +In [Figure 10-6](/en/ch10#fig_consistency_leaderless), the initial value of *x* is 0, and a writer client is updating +*x* to 1 by sending the write to all three replicas (*n* = 3, *w* = 3). +Concurrently, client A reads from a quorum of two nodes (*r* = 2) and sees the new value 1 +on one of the nodes. Also concurrently with the write, client B reads from a different quorum of two +nodes, and gets back the old value 0 from both. + +The quorum condition is met (*w* + *r* > *n*), but this execution is nevertheless not +linearizable: B’s request begins after A’s request completes, but B returns the old value while A +returns the new value. (It’s once again the Aaliyah and Bryce situation from +[Figure 10-1](/en/ch10#fig_consistency_linearizability_0).) + +It is possible to make Dynamo-style quorums linearizable at the cost of reduced +performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously, +before returning results to the application +[[24](/en/ch10#Attiya1995)]. +Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the +latest timestamp of any prior write, and ensure that the new write has a greater timestamp +[[25](/en/ch10#Lynch1997), +[26](/en/ch10#Cachin2011)]. +However, Riak does not perform synchronous read repair due to the performance penalty. +Cassandra does wait for read repair to complete on quorum reads +[[27](/en/ch10#Ekstrom2012)], +but it loses linearizability due to its use of time-of-day clocks for timestamps. + +Moreover, only linearizable read and write operations can be implemented in this way; a +linearizable compare-and-set operation cannot, because it requires a consensus algorithm +[[28](/en/ch10#Herlihy1991)]. + +In summary, it is safest to assume that a leaderless system with Dynamo-style replication does not +provide linearizability, even with quorum reads and writes. + +## The Cost of Linearizability + +As some replication methods can provide linearizability and others cannot, it is interesting to +explore the pros and cons of linearizability in more depth. + +We already discussed some use cases for different replication methods in [Chapter 6](/en/ch6#ch_replication); for +example, we saw that multi-leader replication is often a good choice for multi-region +replication (see [“Geographically Distributed Operation”](/en/ch6#sec_replication_multi_dc)). An example of such a deployment is illustrated in +[Figure 10-7](/en/ch10#fig_consistency_cap_availability). + +![ddia 1007](/fig/ddia_1007.png) + +###### Figure 10-7. A network interruption forcing a choice between linearizability and availability. + +Consider what happens if there is a network interruption between the two regions. Let’s assume +that the network within each region is working, and clients can reach their local region, but the +regions cannot connect to each other. This is known as a *network partition*. + +With a multi-leader database, each region can continue operating normally: since writes from one +region are asynchronously replicated to the other, the writes are simply queued up and exchanged +when network connectivity is restored. + +On the other hand, if single-leader replication is used, then the leader must be in one of the +regions. Any writes and any linearizable reads must be sent to the leader—thus, for any +clients connected to a follower region, those read and write requests must be sent synchronously +over the network to the leader region. + +If the network between regions is interrupted in a single-leader setup, clients connected to +follower regions cannot contact the leader, so they cannot make any writes to the database, nor +any linearizable reads. They can still make reads from the follower, but they might be stale +(nonlinearizable). If the application requires linearizable reads and writes, the network +interruption causes the application to become unavailable in the regions that cannot contact the +leader. + +If clients can connect directly to the leader region, this is not a problem, since the +application continues to work normally there. But clients that can only reach a follower region +will experience an outage until the network link is repaired. + +### The CAP theorem + +This issue is not just a consequence of single-leader and multi-leader replication: any linearizable +database has this problem, no matter how it is implemented. The issue also isn’t specific to +multi-region deployments, but can occur on any unreliable network, even within one region. +The trade-off is as follows: + +* If your application *requires* linearizability, and some replicas are disconnected from the other + replicas due to a network problem, then some replicas cannot process requests while they are + disconnected: they must either wait until the network problem is fixed, or return an error (either + way, they become *unavailable*). This choice is sometimes known as *CP* (consistent under network + partitions). +* If your application *does not require* linearizability, then it can be written in a way that each + replica can process requests independently, even if it is disconnected from other replicas (e.g., + multi-leader). In this case, the application can remain *available* in the face of a network + problem, but its behavior is not linearizable. This choice is known as *AP* (available under + network partitions). + +Thus, applications that don’t require linearizability can be more tolerant of network problems. This +insight is popularly known as the *CAP theorem* +[[29](/en/ch10#Fox1999), +[30](/en/ch10#Gilbert2002), +[31](/en/ch10#Gilbert2012), +[32](/en/ch10#Brewer2012rules)], +named by Eric Brewer in 2000, although the trade-off had been known to designers of +distributed databases since the 1970s +[[33](/en/ch10#Davidson1985), +[34](/en/ch10#Johnson1975), +[35](/en/ch10#Fischer1982)]. + +CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of +starting a discussion about trade-offs in databases. At the time, many distributed databases +focused on providing linearizable semantics on a cluster of machines with shared storage +[[19](/en/ch10#Vallath2006)], and CAP encouraged database engineers +to explore a wider design space of distributed shared-nothing systems, which were more suitable for +implementing large-scale web services +[[36](/en/ch10#Brewer2012nosql)]. +CAP deserves credit for this culture shift—it helped trigger the NoSQL movement, a burst of new +database technologies around the mid-2000s. + +# The Unhelpful CAP Theorem + +CAP is sometimes presented as *Consistency, Availability, Partition tolerance: pick 2 out of 3*. +Unfortunately, putting it this way is misleading +[[32](/en/ch10#Brewer2012rules)] because network partitions are a kind of +fault, so they aren’t something about which you have a choice: they will happen whether you like it +or not. + +At times when the network is working correctly, a system can provide both consistency +(linearizability) and total availability. When a network fault occurs, you have to choose between +either linearizability or total availability. Thus, a better way of phrasing CAP would be +*either Consistent or Available when Partitioned* +[[37](/en/ch10#Cockcroft2014)]. +A more reliable network needs to make this choice less often, but at some point the choice is +inevitable. + +The CP/AP classification scheme has several further flaws +[[4](/en/ch10#Kleppmann2015stop)]. *Consistency* is formalized as +linearizability (the theorem doesn’t say anything about weaker consistency models), and the +formalization of *availability* [[30](/en/ch10#Gilbert2002)] does not +match the usual meaning of the term +[[38](/en/ch10#Kleppmann2015critique)]. Many highly available (fault-tolerant) systems actually do not meet CAP’s +idiosyncratic definition of availability. Moreover, some system designers choose (with good reason) +to provide neither linearizability nor the form of availability that the CAP theorem assumes, so +those systems are neither CP nor AP [[39](/en/ch10#Abadi2010), +[40](/en/ch10#Abadi2017)]. + +All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us +understand systems better, so CAP is best avoided. + +The CAP theorem as formally defined [[30](/en/ch10#Gilbert2002)] is of +very narrow scope: it only considers one consistency model (namely linearizability) and one kind of +fault (network partitions, which according to data from Google are the cause of less than 8% of +incidents [[41](/en/ch10#Brewer2017)]). +It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP +has been historically influential, it has little practical value for designing systems +[[4](/en/ch10#Kleppmann2015stop), +[38](/en/ch10#Kleppmann2015critique)]. + +There have been efforts to generalize CAP. For example, the *PACELC principle* observes that system +designers might also choose to weaken consistency at times when the network is working fine in order +to reduce latency [[39](/en/ch10#Abadi2010), +[40](/en/ch10#Abadi2017), +[42](/en/ch10#Abadi2012)]. +Thus, during a network partition (P), we need to choose between availability (A) and consistency +(C); else (E), when there is no partition, we may choose between low latency (L) and +consistency (C). However, this definition inherits several problems with CAP, such as the +counterintuitive definitions of consistency and availability. + +There are many more interesting impossibility results in distributed systems +[[43](/en/ch10#Lynch1989)], +and CAP has now been superseded by more precise results +[[44](/en/ch10#Mahajan2011), +[45](/en/ch10#Attiya2015)], +so it is of mostly historical interest today. + +### Linearizability and network delays + +Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable +in practice. For example, even RAM on a modern multi-core CPU is not linearizable +[[46](/en/ch10#Sewell2010)]: +if a thread running on one CPU core writes to a memory address, and a thread on another CPU core +reads the same address shortly afterward, it is not guaranteed to read the value written by the +first thread (unless a *memory barrier* or *fence* +[[47](/en/ch10#Thompson2011)] is used). + +The reason for this behavior is that every CPU core has its own memory cache and store buffer. +Memory access first goes to the cache by default, and any changes are asynchronously written out to +main memory. Since accessing data in the cache is much faster than going to main memory +[[48](/en/ch10#Drepper2007_ch10)], this feature is essential for +good performance on modern CPUs. However, there are now several copies of the data (one in main +memory, and perhaps several more in various caches), and these copies are asynchronously updated, so +linearizability is lost. + +Why make this trade-off? It makes no sense to use the CAP theorem to justify the multi-core memory +consistency model: within one computer we usually assume reliable communication, and we don’t expect +one CPU core to be able to continue operating normally if it is disconnected from the rest of the +computer. The reason for dropping linearizability is *performance*, not fault tolerance +[[39](/en/ch10#Abadi2010)]. + +The same is true of many distributed databases that choose not to provide linearizable guarantees: +they do so primarily to increase performance, not so much for fault tolerance +[[42](/en/ch10#Abadi2012)]. +Linearizability is slow—and this is true all the time, not only during a network fault. + +Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is +no: Attiya and Welch [[49](/en/ch10#Attiya1994)] +prove that if you want linearizability, the response time of read and write requests is at least +proportional to the uncertainty of delays in the network. In a network with highly variable delays, +like most computer networks (see [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing)), the response time of linearizable +reads and writes is inevitably going to be high. A faster algorithm for linearizability does not +exist, but weaker consistency models can be much faster, so this trade-off is important for +latency-sensitive systems. In [Link to Come] we will discuss some approaches for avoiding +linearizability without sacrificing correctness. + +# ID Generators and Logical Clocks + +In many applications you need to assign some sort of unique ID to database records when they are +created, which gives you a primary key by which you can refer to those records. In single-node +databases it is common to use an auto-incrementing integer, which has the advantage that it can be +stored in only 64 bits (or even 32 bits if you are sure that you will never have more than 4 billion +records, but that is risky). + +Another advantage of such auto-incrementing IDs is that the order of the IDs tells you the order in +which the records were created. For example, [Figure 10-8](/en/ch10#fig_consistency_id_generator) shows a chat +application that assigns auto-incrementing IDs to chat messages as they are posted. You can then +display the messages in order of increasing ID, and the resulting chat threads will make sense: +Aaliyah posts a question that is assigned ID 1, and Bryce’s answer to the question is assigned a +greater ID, namely 3. + +![ddia 1008](/fig/ddia_1008.png) + +###### Figure 10-8. An ID generator that assigns auto-incrementing integer IDs to messages in a chat application. + +This single-node ID generator is another example of a linearizable system. Each request to fetch the +ID is an operation that atomically increments a counter and returns the old counter value (a +*fetch-and-add* operation); linearizability ensures that if the posting of Aaliyah’s message +completes before Bryce’s posting begins, then Bryce’s ID must be greater than Aaliyah’s. The +messages by Aaliyah and Caleb in [Figure 10-8](/en/ch10#fig_consistency_id_generator) are concurrent, so linearizability +doesn’t specify how their IDs must be ordered, as long as they are unique. + +An in-memory single-node ID generator is easy to implement: you can use the atomic increment +instruction provided by your CPU, which allows multiple threads to safely increment the same +counter. It’s a bit more effort to make the counter persistent, so that the node can crash and +restart without resetting the counter value, which would result in duplicate IDs. But the real +problems are: + +* A single-node ID generator is not fault-tolerant because that node is a single point of failure. +* It’s slow if you want to create a record in another region, as you potentially have to make a + round-trip to the other side of the planet just to get an ID. +* That single node could become a bottleneck if you have high write throughput. + +There are various alternative options for ID generators that you can consider: + +Sharded ID assignment +: You could have multiple nodes that assign IDs—for example, one that generates only even numbers, + and one that generates only odd numbers. In general, you can reserve some bits in the ID to + contain a shard number. Those IDs are still compact, but you lose the ordering property: for + example, if you have chat messages with IDs 16 and 17, you don’t know whether message 16 was + actually sent first, because the IDs were assigned by different nodes, and one node might have + been ahead of the other. + +Preallocated blocks of IDs +: Instead of requesting individual IDs from the single-node ID generator, it could hand out blocks + of IDs. For example, node A might claim the block of IDs from 1 to 1,000, and node B might claim + the block from 1,001 to 2,000. Then each node can independently hand out IDs from its block, and + request a new block from the single-node ID generator when its supply of sequence numbers begins + to run low. However, this scheme doesn’t ensure correct ordering either: it could happen that one + message is given an ID in the range from 1,001 to 2,000, and a later message is given an ID in the + range from 1 to 1,000 if the ID was assigned by a different node. + +Random UUIDs +: You can use *universally unique identifiers* (UUIDs), also known as *globally unique identifiers* + (GUIDs). These have the big advantage that they can be generated locally on any node without + requiring communication, but they require more space (128 bits). There are several different + versions of UUIDs; the simplest is version 4, which is essentially a random number that is so long + that is very unlikely that two nodes would ever pick the same one. Unfortunately, the order of + such IDs is also random, so comparing two IDs tells you nothing about which one is newer. + +Wall-clock timestamp made unique +: If your nodes’ time-of-day clock is kept approximately correct using NTP, you can generate IDs by + putting a timestamp from that clock in the most significant bits, and filling the remaining bits + with extra information that ensures the ID is unique even if the timestamp is not—for example, a + shard number and a per-shard incrementing sequence number, or a long random value. This approach + is used in Version 7 UUIDs + [[50](/en/ch10#Davis2024)], + Twitter’s Snowflake [[51](/en/ch10#King2010)], + ULIDs [[52](/en/ch10#Feerasta2016)], + Hazelcast’s Flake ID generator, MongoDB ObjectIDs, and many similar schemes + [[50](/en/ch10#Davis2024)]. + You can implement these ID generators in application code or within a database + [[53](/en/ch10#Conery2014)]. + +All these schemes generate IDs that are unique (at least with high enough probability that +collisions are vanishingly rare), but they have much weaker ordering guarantees for IDs than the +single-node auto-incrementing scheme. + +As discussed in [“Timestamps for ordering events”](/en/ch9#sec_distributed_lww), wall-clock timestamps can provide at best an approximate +ordering: if an earlier write gets a timestamp from a slightly fast clock, and a later write’s +timestamp is from a slightly slow clock, the timestamp order may be inconsistent with the order in +which the events actually happened. With clock jumps due to using a non-monotonic clock, even the +timestamps generated by a single node might be ordered incorrectly. ID generators based on +wall-clock time are therefore unlikely to be linearizable. + +You can reduce such ordering inconsistencies by relying on high-precision clock synchronization, +using atomic clocks or GPS receivers. But it would also be nice to be able to generate IDs that are +unique and correctly ordered without relying on special hardware. That’s what *logical clocks* are +about. + +## Logical Clocks + +In [“Unreliable Clocks”](/en/ch9#sec_distributed_clocks) we discussed time-of-day clocks and monotonic clocks. Both of these +are *physical clocks*: they measure the passing of seconds (or milliseconds, microseconds, etc.). + +In distributed systems it is common to also use another kind of clock, called a *logical clock*. +While a physical clock is a hardware device that counts the seconds that have elapsed, a logical +clock is an algorithm that counts the events that have occurred. A timestamp from a logical clock +therefore doesn’t tell you what time it is, but you *can* compare two timestamps from a logical +clock to tell which one is earlier and which one is later. + +The requirements for a logical clock are typically: + +* that its timestamps are compact (a few bytes in size) and unique; +* that you can compare any two timestamps (i.e. they are *totally ordered*); and +* that the order of timestamps is *consistent with causality*: if operation A happened before B, + then A’s timestamp is less than B’s timestamp. (We discussed causality previously in + [“The “happens-before” relation and concurrency”](/en/ch6#sec_replication_happens_before).) + +A single-node ID generator meets these requirements, but the distributed ID generators we just +discussed do not meet the causal ordering requirement. + +### Lamport timestamps + +Fortunately, there is a simple method for generating logical timestamps that *is* consistent with +causality, and which you can use as a distributed ID generator. It is called a *Lamport clock*, +proposed in 1978 by Leslie Lamport [[54](/en/ch10#Lamport1978_ch10)], +in what is now one of the most-cited papers in the field of distributed systems. + +[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) shows how a Lamport clock would work in the chat example of +[Figure 10-8](/en/ch10#fig_consistency_id_generator). Each node has a unique identifier, which in +[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) is the name “Aaliyah”, “Bryce”, or “Caleb”, but which in practice +could be a random UUID or something similar. Moreover, each node keeps a counter of the number of +operations it has processed. A Lamport timestamp is then simply a pair of (*counter*, *node ID*). +Two nodes may sometimes have the same counter value, but by including the node ID in the timestamp, +each timestamp is made unique. + +![ddia 1009](/fig/ddia_1009.png) + +###### Figure 10-9. Lamport timestamps provide a total ordering consistent with causality. + +Every time a node generates a timestamp, it increments its counter value and uses the new value. +Moreover, every time a node sees a timestamp from another node, if the counter value in that +timestamp is greater than its local counter value, it increases its local counter to match the value +in the timestamp. + +In [Figure 10-9](/en/ch10#fig_consistency_lamport_ts), Aaliyah had not yet seen Caleb’s message when posting her own, +and vice versa. Assuming both users start with an initial counter value of 0, both therefore +increment their local counter and attach the new counter value of 1 to their message. When Bryce +receives those messages, he increases his local counter value to 1. Finally, Bryce sends a reply to +Aaliyah’s message, for which he increments his local counter and attaches the new value of 2 to the +message. + +To compare two Lamport timestamps, we first compare their counter value: for example, +(2, “Bryce”) is greater than (1, “Aaliyah”) and also greater than (1, “Caleb”). If +two timestamps have the same counter, we compare their node IDs instead, using the usual +lexicographic string comparison. Thus, the timestamp order in this example is +(1, “Aaliyah”) < (1, “Caleb”) < (2, “Bryce”). + +### Hybrid logical clocks + +Lamport timestamps are good at capturing the order in which things happened, but they have some +limitations: + +* Since they have no direct relation to physical time, you can’t use them to find, say, all the + messages that were posted on a particular date—you would need to store the physical time + separately. +* If two nodes never communicate, one node’s counter increments will never be reflected in the other + one’s counter. As a result, it could happen that events generated around the same time on + different nodes have wildly different counter values. + +A *hybrid logical clock* combines the advantages of physical time-of-day clocks with the ordering +guarantees of Lamport clocks +[[55](/en/ch10#Kulkarni2014)]. +Like a physical clock, it counts seconds or microseconds. Like a Lamport clock, when one node sees a +timestamp from another node that is greater than its local clock value, it moves its own local value +forward to match the other node’s timestamp. As a result, if one node’s clock is running fast, the +other nodes will similarly move their clocks forward when they communicate. + +Every time a timestamp from a hybrid logical clock is generated, it is also incremented, which +ensures that the clock monotonically moves forward, even if the underlying physical clock jumps +backwards, for example due to NTP adjustments. Thus, the hybrid logical clock might be slightly +ahead of the underlying physical clock. Details of the algorithm ensure that this discrepancy +remains as small as possible. + +As a result, you can treat a timestamp from a hybrid logical clock almost like a timestamp from a +conventional time-of-day clock, with the added property that its ordering is consistent with the +happens-before relation. It doesn’t depend on any special hardware, and requires only roughly +synchronized clocks. Hybrid logical clocks are used by CockroachDB, for example. + +### Lamport/hybrid logical clocks vs. vector clocks + +In [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl) we discussed how snapshot isolation is often implemented: +essentially, by giving each transaction a transaction ID, and allowing each transaction to see +writes made by transactions with a lower ID, but to make writes by transactions with higher IDs +invisible. Lamport clocks and hybrid logical clocks are a good way of generating these transaction +IDs, because they ensure that the snapshot is consistent with causality +[[56](/en/ch10#Bravo2015)]. + +When multiple timestamps are generated concurrently, these algorithms order them arbitrarily. This +means that when you look at two timestamps, you generally can’t tell whether they were generated +concurrently or whether one happened before the other. (In the example of +[Figure 10-9](/en/ch10#fig_consistency_lamport_ts) you actually can tell that Aaliyah and Caleb’s messages must have +been concurrent, because they have the same counter value, but when the counter values are different +you can’t tell whether they were concurrent.) + +If you want to be able to determine when records were created concurrently, you need a different +algorithm, such as a *vector clock*. The downside is that the timestamps from a vector clock are +much bigger—potentially one integer for every node in the system. See [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent) +for more details on detecting concurrency. + +## Linearizable ID Generators + +Although Lamport clocks and hybrid logical clocks provide useful ordering guarantees, that ordering +is still weaker than the linearizable single-node ID generator we talked about previously. Recall +that linearizability requires that if request A completed before request B began, then B must have +the higher ID, even if A and B never communicated with each other. On the other hand, Lamport clocks +can only ensure that a node generates timestamps that are greater than any other timestamp that node +has seen, but it can’t say anything about timestamps that it hasn’t seen. + +[Figure 10-10](/en/ch10#fig_consistency_permissions) shows how a non-linearizable ID generator could cause problems. +Imagine a social media website where user A wants to share an embarrassing photo privately with +their friends. A’s account is initially public, but using their laptop, A first changes their +account settings to private. Then A uses their phone to upload the photo. Since A performed these +updates in sequence, they might reasonably expect the photo upload to be subject to the new, +restricted account permissions. + +![ddia 1010](/fig/ddia_1010.png) + +###### Figure 10-10. User A first sets their account to private, then shares a photo. With a non-linearizable ID generator, an unauthorized viewer may see the photo. + +The account permission and the photo are stored in two separate databases (or separate shards of the +same database), and let’s assume they use a Lamport clock or hybrid logical clock to assign a +timestamp to every write. Since the photos database didn’t read from the accounts database, it’s +possible that the local counter in the photos database is slightly behind, and therefore the photo +upload is assigned a lower timestamp than the update of the account settings. + +Next, let’s say that a viewer (who is not friends with A) is looking at A’s profile, and their read +uses an MVCC implementation of snapshot isolation. It could happen that the viewer’s read has a +timestamp that is greater than that of the photo upload, but less than that of the account settings +update. As a result, the system will determine that the account is still public at the time of the +read, and therefore show the viewer the embarrassing photo that they were not supposed to see. + +You can imagine several possible ways of fixing this problem. Maybe the photos database should have +read the user’s account status before performing the write, but it’s easy to forget such a check. +If A’s actions had been performed on the same device, maybe the app on their device could have +tracked the latest timestamp of that user’s writes—but if the user uses a laptop and a phone, as in +the example, that’s not so easy. + +The simplest solution in this case would be to use a linearizable ID generator, which would ensure +that the photo upload is assigned a greater ID than the account permissions change. + +### Implementing a linearizable ID generator + +The simplest way of ensuring that ID assignment is linearizable is by actually using a single node +for this purpose. That node only needs to atomically increment a counter and return its value when +requested, persist the counter value (so that it doesn’t generate duplicate IDs if the node crashes +and restarts), and replicate it for fault tolerance (using single-leader replication). This approach +is used in practice: for example, TiDB/TiKV calls it a *timestamp oracle*, inspired by Google’s +Percolator [[57](/en/ch10#Peng2010_ch10)]. + +As an optimization, you can avoid performing a disk write and replication on every single request. +Instead, the ID generator can write a record describing a batch of IDs; once that record is +persisted and replicated, the node can start handing out those IDs to clients in sequence. Before it +runs out of IDs in that batch, it can persist and replicate the record for the next batch. That way, +some IDs will be skipped if the node crashes and restarts or if you fail over to a follower, but you +won’t issue any duplicate or out-of-order IDs. + +You can’t easily shard the ID generator, since if you have multiple shards independently handing out +IDs, you can no longer guarantee that their order is linearizable. You also can’t easily distribute +the ID generator across multiple regions; thus, in a geographically distributed database, all +requests for IDs will have to go to a node in a single region. On the upside, the ID generator’s job +is very simple, so a single node can handle a large request throughput. + +If you don’t want to use a single-node ID generator, an alternative is possible: you can do what +Google’s Spanner does, as discussed in [“Synchronized clocks for global snapshots”](/en/ch9#sec_distributed_spanner). It relies on a physical clock +that returns not just a single timestamp, but a range of timestamps indicating the uncertainty in +the clock reading. It then waits for the duration of that uncertainty interval to elapse before +returning. + +Assuming that the uncertainty interval is correct (i.e., that the true current physical time always +lies within that interval), this process also ensures that if one request completes before another +begins, the later request will have a greater timestamp. This approach ensures this linearizable ID +assignment without any communication: even requests in different regions will be ordered correctly, +without waiting for cross-region requests. The downside is that you need hardware and software +support for clocks to be tightly synchronized and compute the necessary uncertainty interval. + +### Enforcing constraints using logical clocks + +In [“Constraints and uniqueness guarantees”](/en/ch10#sec_consistency_uniqueness) we saw that a linearizable compare-and-set operation can be used +to implement locks, uniqueness constraints, and similar constructs in a distributed system. This +raises the question: is a logical clock or a linearizable ID generator also sufficient to implement +these things? + +The answer is: not quite. When you have several nodes that are all trying to acquire the +same lock or register the same username, you could use a logical clock to assign timestamps to those +requests, and pick the one with the lowest timestamp as the winner. If the clock is linearizable, +you know that any future requests will always generate greater timestamps, and therefore you can be +sure that no future request will receive an even lower timestamp than the winner. + +Unfortunately, part of the problem is still unsolved: how does a node know whether its own timestamp +is the lowest? To be sure, it needs to hear from *every* other node that might have generated a +timestamp [[54](/en/ch10#Lamport1978_ch10)]. If one of the other nodes +has failed in the meantime, or cannot be reached due to a network problem, this system would grind +to a halt, because we can’t be sure whether that node might have the lowest timestamp. This is not +the kind of fault-tolerant system that we need. + +To implement locks, leases, and similar constructs in a fault-tolerant way, we need something +stronger than logical clocks or ID generators: we need consensus. + +# Consensus + +In this chapter we have seen several examples of things that are easy when you have only a single +node, but which get a lot harder if you want fault tolerance: + +* A database can be linearizable if you have only a single leader, and you make all reads and writes + on that leader. But how do you fail over if that leader fails, while avoiding split brain? How do + you ensure that a node that believes itself to be the leader hasn’t actually been voted out in the + meantime? +* A linearizable ID generator on a single node is just a counter with an atomic fetch-and-add + instruction, but what if it crashes? +* An atomic compare-and-set (CAS) operation is useful for many things, such as deciding who gets a + lock or lease when several processes are racing to acquire it, or ensuring the uniqueness of a + file or user with a given name. On a single node, CAS may be as simple as one CPU instruction, but + how do you make it fault-tolerant? + +It turns out that all of these are instances of the same fundamental distributed systems problem: +*consensus*. Consensus is one of the most important and fundamental problems in distributed +computing; it is also infamously difficult to get right +[[58](/en/ch10#Chandra2007), +[59](/en/ch10#Portnoy2012)], +and many systems have got it wrong in the past. Now that we have discussed replication +([Chapter 6](/en/ch6#ch_replication)), transactions ([Chapter 8](/en/ch8#ch_transactions)), system models ([Chapter 9](/en/ch9#ch_distributed)), and +linearizability (this chapter), we are finally ready to tackle the consensus problem. + +The best-known consensus algorithms are Viewstamped Replication +[[60](/en/ch10#Oki1988), +[61](/en/ch10#Liskov2012)], +Paxos [[58](/en/ch10#Chandra2007), +[62](/en/ch10#Lamport1998), +[63](/en/ch10#Lamport2001), +[64](/en/ch10#vanRenesse2011)], +Raft [[23](/en/ch10#Ongaro2014atc), +[65](/en/ch10#Ongaro2014thesis), +[66](/en/ch10#Howard2015refloated)], +and Zab [[18](/en/ch10#Junqueira2013_ch10), +[22](/en/ch10#Junqueira2011), +[67](/en/ch10#Medeiros2012)]. +There are quite a few similarities between these algorithms, but they are not the same +[[68](/en/ch10#vanRenesse2014), +[69](/en/ch10#Howard2020)]. +These algorithms work in a non-Byzantine system model: that is, network communication may be +arbitrarily delayed or dropped, and nodes may crash, restart, and become disconnected, but the +algorithms assume that nodes otherwise follow the protocol correctly and do not behave maliciously. + +There are also consensus algorithms that can tolerate some Byzantine nodes, i.e., nodes that don’t +correctly follow the protocol (for example, by sending contradictory messages to other nodes). A +common assumption is that fewer than one-third of the nodes are Byzantine-faulty +[[26](/en/ch10#Cachin2011), +[70](/en/ch10#Castro2002)]. +Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchains +[[71](/en/ch10#Bano2019_ch10)]. +However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this +book. + +# The Impossibility of Consensus + +You may have heard about the FLP result +[[72](/en/ch10#Fischer1985)]—named after the +authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to +reach consensus if there is a risk that a node may crash. In a distributed system, we must assume +that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms +for achieving consensus. What is going on here? + +Firstly, FLP doesn’t say that we can never reach consensus—it only says that we can’t guarantee that +a consensus algorithm will *always* terminate. Moreover, the FLP result is proved assuming a +deterministic algorithm in the asynchronous system model (see [“System Model and Reality”](/en/ch9#sec_distributed_system_model)), +which means the algorithm cannot use any clocks or timeouts. If it can use timeouts to suspect that +another node may have crashed (even if the suspicion is sometimes wrong), then consensus becomes +solvable [[73](/en/ch10#Chandra1996)]. +Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility +result [[74](/en/ch10#BenOr1983)]. + +Thus, although the FLP result about the impossibility of consensus is of great theoretical +importance, distributed systems can usually achieve consensus in practice. + +## The Many Faces of Consensus + +Consensus can be expressed in several different ways: + +* *Single-value consensus* is very similar to an atomic *compare-and-set* operation, and it can be + used to implement locks, leases, and uniqueness constraints. +* Constructing an *append-only log* also requires consensus; it is usually formalized as *total + order broadcast*. With a log you can build *state machine replication*, leader-based replication, + event sourcing, and other useful things. +* *Atomic commitment* of a multi-database or multi-shard transaction requires that all participants + agree on whether to commit or abort the transaction. + +We will explore all of these shortly. In fact, these problems are all equivalent to each other: if +you have an algorithm that solves one of these problems, you can convert it into a solution for any +of the others. This is quite a profound and perhaps surprising insight! And that’s why we can lump +all of these things together under “consensus”, even though they look quite different on the +surface. + +### Single-value consensus + +The standard formulation of consensus involves getting multiple nodes to agree on a single value. +For example: + +* When a database with single-leader replication first starts up, or when the existing leader fails, + several nodes may concurrently try to become the leader. Similarly, multiple nodes may race to + acquire a lock or lease. Consensus allows them to decide which one wins. +* If several people concurrently try to book the last seat on an airplane, or the same seat in a + theater, or try to register an account with the same username, then a consensus algorithm could + determine which one should succeed. + +More generally, one or more nodes may *propose* values, and the consensus algorithm *decides* on one +of those values. In the examples above, each node could propose its own ID, and the algorithm +decides which node ID should become the new leader, the holder of the lease, or the buyer of the +airplane/theater seat. In this formalism, a consensus algorithm must satisfy the following +properties [[26](/en/ch10#Cachin2011)]: + +Uniform agreement +: No two nodes decide differently. + +Integrity +: Once a node has decided one value, it cannot change its mind by deciding another value. + +Validity +: If a node decides value *v*, then *v* was proposed by some node. + +Termination +: Every node that does not crash eventually decides some value. + +If you want to decide multiple values, you can run a separate instance of the consensus algorithm +for each. For example, you could have a separate consensus run for each bookable seat in the +theater, so that you get one decision (one buyer) for each seat. + +The uniform agreement and integrity properties define the core idea of consensus: everyone decides +on the same outcome, and once you have decided, you cannot change your mind. The validity property +rules out trivial solutions: for example, you could have an algorithm that always decides `null`, no +matter what was proposed; this algorithm would satisfy the agreement and integrity properties, but +not the validity property. + +If you don’t care about fault tolerance, then satisfying the first three properties is easy: you can +just hardcode one node to be the “dictator,” and let that node make all of the decisions. However, +if that one node fails, then the system can no longer make any decisions—just like single-leader +replication without failover. All the difficulty arises from the need for fault tolerance. + +The termination property formalizes the idea of fault tolerance. It essentially says that a +consensus algorithm cannot simply sit around and do nothing forever—in other words, it must make +progress. Even if some nodes fail, the other nodes must still reach a decision. (Termination is a +liveness property, whereas the other three are safety properties—see +[“Safety and liveness”](/en/ch9#sec_distributed_safety_liveness).) + +If a crashed node may recover, you could just wait for it to come back. However, consensus must +ensure that it makes a decision even if a crashed node suddenly disappears and never comes back. +(Instead of a software crash, imagine that there is an earthquake, and the datacenter containing +your node is destroyed by a landslide. You must assume that your node is buried under 30 feet of mud +and is never going to come back online.) + +Of course, if *all* nodes crash and none of them are running, then it is not possible for any +algorithm to decide anything. There is a limit to the number of failures that an algorithm can +tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of +nodes to be functioning correctly in order to assure termination +[[73](/en/ch10#Chandra1996)]. That majority can safely form a quorum +(see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)). + +Thus, the termination property is subject to the assumption that fewer than half of the nodes are +crashed or unreachable. However, most consensus algorithms ensure that the safety +properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or +there is a severe network problem +[[75](/en/ch10#Dwork1988_ch10)]. +Thus, a large-scale outage can stop the system from being able to process requests, but it cannot +corrupt the consensus system by causing it to make inconsistent decisions. + +### Compare-and-set as consensus + +A compare-and-set (CAS) operation checks whether the current value of some object equals some +expected value; if yes, it atomically updates the object to some new value; if no, it leaves the +object unchanged and returns an error. + +If you have a fault-tolerant, linearizable CAS operation, it is easy to solve the consensus problem: +initially set the object to a null value; each node that wants to propose a value invokes CAS with +the expected value being null, and the new value being the value it wants to propose (assuming it is +non-null). The decided value is then whatever value the object is set to. + +Likewise, if you have a solution for consensus, you can implement CAS: whenever one or more nodes +want to perform CAS with the same expected value, you use the consensus protocol to propose the new +values in the CAS invocation, and then set the object to whatever value was decided by the +consensus. Any CAS invocations whose new value was not decided return an error. CAS invocations with +different expected values use separate runs of the consensus protocol. + +This shows that CAS and consensus are equivalent to each other +[[28](/en/ch10#Herlihy1991), +[73](/en/ch10#Chandra1996)]. +Again, both are straightforward on a single node, but challenging to make fault-tolerant. As an +example of CAS in a distributed setting, we saw conditional write operations for object stores in +[“Databases backed by object storage”](/en/ch6#sec_replication_object_storage), which allow a write to happen only if an object with the same +name has not been created or modified by another client since the current client last read it. + +However, a linearizable read-write register is not sufficient to solve consensus. The FLP result +tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop +model [[72](/en/ch10#Fischer1985)], but we saw in +[“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum +reads/writes in this model [[24](/en/ch10#Attiya1995), +[25](/en/ch10#Lynch1997), [26](/en/ch10#Cachin2011)]. +From this it follows that a linearizable register cannot solve consensus. + +### Shared logs as consensus + +We have seen several examples of logs, such as replication logs, transaction logs, and write-ahead +logs. A log stores a sequence of *log entries*, and anyone who reads it sees the same entries in the +same order. Sometimes a log has a single writer that is allowed to append new entries, but a *shared +log* is one where multiple nodes can request entries to be appended. An example is single-leader +replication: any client can ask the leader to make a write, which the leader appends to the +replication log, and then all followers apply the writes in the same order as the leader. + +More formally, a shared log supports two operations: you can request for a value to be added to the +log, and you can read the entries in the log. It must satisfy the following properties: + +Eventual append +: If a node requests for some value to be added the log, and the node does not crash, then that node + must eventually read that value in a log entry. + +Reliable delivery +: No log entries are lost: if one node reads some log entry, then eventually every node that does + not crash must also read that log entry. + +Append-only +: Once a node has read some log entry, it is immutable, and new log entries can only be added after + it, but not before. A node may re-read the log, in which case it sees the same log entries in the + same order as it read them initially (even if the node crashes and restarts). + +Agreement +: If two nodes both read some log entry *e*, then prior to *e* they must have read exactly the same + sequence of log entries in the same order. + +Validity +: If a node reads a log entry containing some value, then some node previously requested for that + value to be added to the log. + +###### Note + +A shared log is formally known as a *total order broadcast*, *atomic broadcast*, or *total order +multicast* protocol [[26](/en/ch10#Cachin2011), +[76](/en/ch10#Defago2004), +[77](/en/ch10#Attiya2004)]. +It’s the same thing described in different words: requesting a value to be added to the log is then +called “broadcasting” it, and reading a log entry is called “delivering” it. + +If you have an implementation of a shared log, it is easy to solve the consensus problem: every node +that wants to propose a value requests for it to be added to the log, and whichever value is read +back in the first log entry is the value that is decided. Since all nodes read log entries in the +same order, they are guaranteed to agree on which value is delivered first +[[28](/en/ch10#Herlihy1991)]. + +Conversely, if you have a solution for consensus, you can implement a shared log. The details are a +bit more complicated, but the basic idea is this +[[73](/en/ch10#Chandra1996)]: + +1. You have a slot in the log for every future log entry, and you run a separate instance of the + consensus algorithm for every such slot to decide what value should go in that entry. +2. When a node wants to add a value to the log, it proposes that value for one of the slots that has + not yet been decided. +3. When the consensus algorithm decides for one of the slots, and all the previous slots have + already been decided, then the decided value is appended as a new log entry, and any consecutive + slots that have been decided also have their decided value appended to the log. +4. If a proposed value was not chosen for some slot, the node that wanted to add it retries by + proposing it for a later slot. + +This shows that consensus is equivalent to total order broadcast and shared logs. Single-leader +replication without failover does not meet the liveness requirements, since it stops delivering +messages if the leader crashes. As usual, the challenge is in performing failover safely and +automatically. + +### Fetch-and-add as consensus + +The linearizable ID generator we saw in [“Linearizable ID Generators”](/en/ch10#sec_consistency_linearizable_id) comes close to solving +consensus, but it falls slightly short. We can implement such an ID generator using a fetch-and-add +operation, which atomically increments a counter and returns the old counter value. + +If you have a CAS operation, it’s easy to implement fetch-and-add: first read the counter value, +then perform a CAS where the expected value is the value you read, and the new value is that value +plus one. If the CAS fails, you retry the whole process until the CAS succeeds. This is less +efficient than a native fetch-and-add operation when there is contention, but it is functionally +equivalent. Since you can implement CAS using consensus, you can also implement fetch-and-add using +consensus. + +Conversely, if you have a fault-tolerant fetch-and-add operation, can you solve the consensus +problem? Let’s say you initialize the counter to zero, and every node that wants to propose a value +invokes the fetch-and-add operation to increment the counter. Since the fetch-and-add operation is +atomic, one of the nodes will read the initial value of zero, and the others will all read a value +that has been incremented at least once. + +Now let’s say that the node that reads zero is the winner, and its value is decided. That works for +the node that read zero, but the other nodes have a problem: they know that they are not the winner, +but they don’t know which of the other nodes has won. The winner could send a message to the other +nodes to let them know it has won, but what if the winner crashes before it has a chance to send +this message? In that case the other nodes are left hanging, unable to decide any value, and thus +the consensus does not terminate. And the other nodes can’t fall back to another node, because the +node that read zero may yet come back and rightly decide the value it proposed. + +An exception is if we know for sure that no more than two nodes will propose a value. In that case, +the nodes can send each other the values they want to propose, and then each perform the +fetch-and-add operation. The node that reads zero decides its own value, and the node that reads one +decides the other node’s value. This solves the consensus problem among two nodes, which is why we +can say that fetch-and-add has a *consensus number* of two +[[28](/en/ch10#Herlihy1991)]. +In contrast, CAS and shared logs solve consensus for any number of nodes that may propose values, so +they have a consensus number of ∞ (infinity). + +### Atomic commitment as consensus + +In [“Distributed Transactions”](/en/ch8#sec_transactions_distributed) we saw the *atomic commitment* problem, which is to ensure that +the databases or shards involved in a distributed transaction all either commit or abort a +transaction. We also saw the *two-phase commit* algorithm, which relies on a coordinator that is a +single point of failure. + +What is the relationship between consensus and atomic commitment? At first glance, they seem very +similar—both require nodes to come to some form of agreement. However, there is one important +difference: with consensus it’s okay to decide any value that proposed, whereas with atomic +commitment the algorithm *must* abort if *any* of the participants voted to abort. More precisely, +atomic commitment requires the following properties +[[78](/en/ch10#Guerraoui1995)]: + +Uniform agreement +: No two nodes decide on different outcomes. + +Integrity +: Once a node has decided one outcome, it cannot change its mind by deciding another outcome. + +Validity +: If a node decides to commit, then all nodes must have previously voted to commit. If any node + voted to abort, the nodes must abort. + +Non-triviality +: If all nodes vote to commit, and no communication timeouts occur, then all nodes must decide to + commit. + +Termination +: Every node that does not crash eventually decides to either commit or abort. + +The validity property ensures that a transaction can only commit if all nodes agree; and the +non-triviality property ensures the algorithm can’t simply always abort (but it allows an abort if +any of the communication among the nodes times out). The other three properties are basically the +same as for consensus. + +If you have a solution for consensus, there are multiple ways you could solve atomic commitment +[[78](/en/ch10#Guerraoui1995), +[79](/en/ch10#Gray2006)]. +One works like this: when you want to commit the transaction, every node sends its vote to commit or +abort to every other node. Nodes that receive a vote to commit from itself and every other node +propose “commit” using the consensus algorithm; nodes that receive a vote to abort, or which +experience a timeout, propose “abort” using the consensus algorithm. When a node finds out what the +consensus algorithm decided, it commits or aborts accordingly. + +In this algorithm, “commit” will only be proposed if all nodes voted to commit. If any node voted to +abort, all proposals in the consensus algorithm will be “abort”. It could happen that some nodes +propose “abort” while others propose “commit” if all nodes voted to commit but some communication +timed out; in this case it doesn’t matter whether the nodes commit or abort, as long as they all do +the same. + +If you have a fault-tolerant atomic commitment protocol, you can also solve consensus. Every node +that wants to propose a value starts a transaction on a quorum of nodes, and at each node it +performs a single-node CAS to set a register to the proposed value if its value has not already been +set by another transaction. If the CAS succeeds, the node votes to commit, otherwise it votes to +abort. If the atomic commit protocol decides to commit a transaction, its value is decided for +consensus; if atomic commit aborts, the proposing node retries with a new transaction. + +This shows that atomic commit and consensus are also equivalent to each other. + +## Consensus in Practice + +We have seen that single-value consensus, CAS, shared logs, and atomic commitment are all equivalent +to each other: you can convert a solution to one of them into a solution to any of the others. That +is a valuable theoretical insight, but it doesn’t answer the question: which of these many +formulations of consensus is the most useful in practice? + +The answer is that most consensus systems provide shared logs, also known as total order broadcast. +Raft, Viewstamped Replication, and Zab provide shared logs right out of the box. Paxos provides +single-value consensus, but in practice most systems using Paxos actually use the extension called +Multi-Paxos, which also provides a shared log. + +### Using shared logs + +A shared log is a good fit for database replication: if every log entry represents a write to the +database, and every replica processes the same writes in the same order using deterministic logic, +then the replicas will all end up in a consistent state. This idea is known as *state machine +replication* [[80](/en/ch10#Schneider1990)], +and it is the principle behind event sourcing, which we saw in [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events). Shared +logs are also useful for stream processing, as we shall see in [Link to Come]. + +Similarly, a shared log can be used to implement serializable transactions: as discussed in +[“Actual Serial Execution”](/en/ch8#sec_transactions_serial), if every log entry represents a deterministic transaction to be +executed as a stored procedure, and if every node executes those transactions in the same order, +then the transactions will be serializable +[[81](/en/ch10#Thomson2012), +[82](/en/ch10#Balakrishnan2013)]. + +###### Note + +Sharded databases with a strong consistency model often maintain a separate log per shard, which +improves scalability, but limits the consistency guarantees (e.g., consistent snapshots, foreign key +references) they can offer across shards. Serializable transactions across shards are possible, but +require additional coordination [[83](/en/ch10#Balakrishnan2012)]. + +A shared log is also powerful because it can easily be adapted to other forms of consensus: + +* We saw previously how to use it to implement single-value consensus and CAS: simply decide the + value that appears first in the log. +* If you want many instances of single-value consensus (e.g. one per seat in a theater that several + people are trying to book), include the seat number in the log entries, and decide the first log + entry that contains a given seat number. +* If you want an atomic fetch-and-add, put the number to add to the counter in a log entry, and the + current counter value is the sum of all of the log entries so far. A simple counter on log entries + can be used to generate fencing tokens (see [“Fencing off zombies and delayed requests”](/en/ch9#sec_distributed_fencing_tokens)); for example, in + ZooKeeper, this sequence number is called `zxid` + [[18](/en/ch10#Junqueira2013_ch10)]. + +### From single-leader replication to consensus + +We saw previously that single-value consensus is easy if you have a single “dictator” node that +makes the decision, and likewise a shared log is easy if a single leader is the only node that is +allowed to append entries to it. The question is how to provide fault tolerance if that node fails. + +Traditionally, databases with single-leader replication didn’t solve this problem: they left leader +failover as an action that a human administrator had to perform manually. Unfortunately, this means +a significant amount of downtime, since there is a limit to how fast humans can react, and it +doesn’t satisfy the termination property of consensus. For consensus, we require that the algorithm +can automatically choose a new leader. (Not all consensus algorithms have a leader, but the commonly +used algorithms do [[84](/en/ch10#Gavrielatos2021)].) + +However, there is a problem. We previously discussed the problem of split brain, and said that all +nodes need to agree who the leader is—otherwise two different nodes could each believe themselves to +be the leader, and consequently make inconsistent decisions. Thus, it seems like we need consensus +in order to elect a leader, and we need a leader in order to solve consensus. How do we break out of +this conundrum? + +In fact, consensus algorithms don’t require that there is only one leader at any one time. Instead, +they make a weaker guarantee: they define an *epoch number* (called the *ballot number* in Paxos, +*view number* in Viewstamped Replication, and *term number* in Raft) and guarantee that within each +epoch, the leader is unique. + +When a node believes that the current leader is dead because it hasn’t heard from the leader for +some timeout, it may start a vote to elect a new leader. This election is given a new epoch number +that is greater than any previous epoch. If there is a conflict between two different leaders in two +different epochs (perhaps because the previous leader actually wasn’t dead after all), then the +leader with the higher epoch number prevails. + +Before a leader is allowed to append the next entry to the shared log, it must first check that +there isn’t some other leader with a higher epoch number which might append a different entry. It +can do this by collecting votes from a quorum of nodes—typically, but not always, a majority of +nodes [[85](/en/ch10#Howard2016_ch10)]. +A node votes yes only if it is not aware of any other leader with a higher epoch. + +Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s +proposal for the next entry to append to the log. The quorums for those two votes must overlap: if +a vote on a proposal succeeds, at least one of the nodes that voted for it must have also +participated in the most recent successful leader election +[[85](/en/ch10#Howard2016_ch10)]. Thus, if the vote on a proposal +passes without revealing any higher-numbered epoch, the current leader can conclude that no leader +with a higher epoch number has been elected, and therefore it can safely append the proposed entry +to the log [[26](/en/ch10#Cachin2011), +[86](/en/ch10#Kleppmann2024distsys)]. + +These two rounds of voting look superficially similar to two-phase commit, but they are very +different protocols. In consensus algorithms, any node can start an election and it requires only a +quorum of nodes to respond; in 2PC, only the coordinator can request votes, and it requires a “yes” +vote from *every* participant before it can commit. + +### Subtleties of consensus + +This basic structure is common to all of Raft, Multi-Paxos, Zab, and Viewstamped Replication: a vote +by a quorum of nodes elects a leader, and then another quorum vote is required for every entry that +the leader wants to append to the log [[68](/en/ch10#vanRenesse2014), +[69](/en/ch10#Howard2020)]. Every new log entry is synchronously replicated +to a quorum of nodes before it is confirmed to the client that requested the write. This ensures +that the log entry won’t be lost if the current leader fails. + +However, the devil is in the details, and that’s also where these algorithms take different +approaches. For example, when the old leader fails and a new one is elected, the algorithm needs to +ensure that the new leader honors any log entries that had already been appended by the old leader +before it failed. Raft does this by only allowing a node to become the new leader if its log is at +least as up-to-date as a majority of its followers +[[69](/en/ch10#Howard2020)]. +In contrast, Paxos allows any node to become the new leader, but requires it to bring its log +up-to-date with other nodes before it can start appending new entries of its own. + +# Consistency vs. Availability in Leader Election + +If you want the consensus algorithm to strictly guarantee the properties laid out in +[“Shared logs as consensus”](/en/ch10#sec_consistency_shared_logs), it’s essential that the new leader is up-to-date with any confirmed +log entries before it can process any writes or linearizable reads. If a node with stale data were +to become the new leader, it may write a new value to log entries that were already written by the +old leader, violating the shared log’s append-only property. + +In some cases, you might choose to weaken the consensus properties in order to recover more quickly +from a leader failure. For example, Kafka offers the option of enabling *unclean leader election*, +which allows any replica to become leader, even if it is not up-to-date. Also, in databases with +asynchronous replication, you cannot guarantee that any follower is up-to-date when the leader +fails. + +If you drop the requirement for the new leader to be up-to-date, you may improve performance and +availability, but you are on thin ice, since the theory of consensus no longer applies. While things +will work fine as long as there are no faults, the problems discussed in [Chapter 9](/en/ch9#ch_distributed) can +easily cause a lot of data loss or corruption. + +Another subtlety is in how the algorithms deal with log entries that had been proposed by the old +leader before it failed, but for which the vote on appending to the log had not yet completed. You +can find discussions of these details in the references for this chapter +[[23](/en/ch10#Ongaro2014atc), +[69](/en/ch10#Howard2020), +[86](/en/ch10#Kleppmann2024distsys)]. + +For databases that use a consensus algorithm for replication, not only do writes need to be turned +into log entries and replicated to a quorum. If you want to guarantee linearizable reads, they also +have to go through a quorum vote similarly to a write, to confirm that the node that believes to be +the leader really still is up-to-date. Linearizable reads in etcd work like this, for example. + +In their standard form, most consensus algorithms assume a fixed set of nodes—that is, nodes may go +down and come back up again, but the set of nodes that is allowed to vote is fixed when the cluster +is created. In practice, it’s often necessary to add new nodes or remove old nodes in a system +configuration. Consensus algorithms have been extended with *reconfiguration* features that make +this possible. This is especially useful when adding new regions to a system, or when migrating from +one location to another (by first adding the new nodes, and then removing the old nodes). + +### Pros and cons of consensus + +Although they are complex and subtle, consensus algorithms are a huge breakthrough for distributed +systems. Consensus is essentially “single-leader replication done right”, with automatic failover on +leader failure, ensuring that no committed data is lost and no split-brain is possible, even in the +face of all the problems we discussed in [Chapter 9](/en/ch9#ch_distributed). + +Since single-leader replication with automatic failover is essentially one of the definitions of +consensus, any system that provides automatic failover but does not use a proven consensus algorithm +is likely to be unsafe [[87](/en/ch10#Kingsbury2015elastic)]. +Using a proven consensus algorithm is not a guarantee of correctness of the whole system—there are +still plenty of other places where bugs can lurk—but it’s a good start. + +Nevertheless, consensus is not used everywhere, because the benefits come at a cost. Consensus +systems always require a strict majority to operate—three nodes to tolerate one failure, or five +nodes to tolerate two failures. Every operation needs to communicate with a quorum, so you can’t +increase throughput by adding more nodes (in fact, every node you add makes the algorithm slower). +If a network partition cuts off some nodes from the rest, only the majority portion of the network +can make progress, and the rest are blocked. + +Consensus systems generally rely on timeouts to detect failed nodes. In environments with highly +variable network delays, especially systems distributed across multiple geographic regions, it can +be difficult to tune these timeouts: if they are too large it takes a long time to recover from a +failure; if they are too small there can be lots of unnecessary leader elections, resulting in +terrible performance as the system can end up spending more time choosing leaders than doing useful +work. + +Sometimes, consensus algorithms are particularly sensitive to network problems. For example, Raft +has been shown to have unpleasant edge cases +[[88](/en/ch10#Howard2015coracle), +[89](/en/ch10#Lianza2020_ch10)]: +if the entire network is working correctly except for one particular network link that is +consistently unreliable, Raft can get into situations where leadership continually bounces between +two nodes, or the current leader is continually forced to resign, so the system effectively never +makes progress. Designing algorithms that are more robust to unreliable networks is still an open +research problem. + +For systems that want to be highly available, but don’t want to accept the cost of consensus, the +only real alternative is to use a weaker consistency model instead, such as those offered by +leaderless or multi-leader replication as discussed in [Chapter 6](/en/ch6#ch_replication). These approaches +generally don’t offer linearizability, but for applications that don’t need it that is fine. + +## Coordination Services + +Consensus algorithms are useful in any distributed database that wants to offer linearizable +operations, and many modern distributed databases use consensus algorithms for replication. But one +family of systems is a particularly prominent user of consensus: *coordination services* such as +ZooKeeper, etcd, or Consul. Although these systems look superficially like any other key-value +store, they are not designed for general-purpose data storage like most databases. + +Instead, they are designed to coordinate between nodes of another distributed system. For example, +Kubernetes relies on etcd, while Spark and Flink in high availability mode rely on ZooKeeper running +in the background. Coordination services are designed to hold small amounts of data that can fit +entirely in memory (although they still write to disk for durability), which is replicated across +multiple nodes using a fault-tolerant consensus algorithm. + +Coordination services are modeled after Google’s Chubby lock service +[[17](/en/ch10#Burrows2006_ch10), +[58](/en/ch10#Chandra2007)]. +They combine a consensus algorithm with several other features that turn out to be particularly +useful when building distributed systems: + +Locks and leases +: We saw previously how consensus systems can implement an atomic, fault-tolerant compare-and-set + (CAS) operation. Coordination services rely on this approach to implement locks and leases: if + several nodes concurrently try to acquire the same lease, only one of them will succeed. + +Support for fencing +: As discussed in [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing), when a resource is protected by a lease, you + need *fencing* to prevent clients from interfering with each other in the case of a process pause + or large network delay. Consensus systems can generate fencing tokens by giving each log entry a + monotonically increasing ID (`zxid` and `cversion` in ZooKeeper, revision number in etcd). + +Failure detection +: Clients maintain a long-lived session on the coordination service, and periodically exchange + heartbeats to check if the other node is still alive. Even if the connection is temporarily + interrupted, or a server fails, any leases held by the client remain active. However, if there is + no heartbeat for longer than the timeout of the lease, the coordination service assumes the client + is dead and releases the lease (ZooKeeper calls these *ephemeral nodes*). + +Change notifications +: A client can request that the coordination service sends it a notification whenever certain keys + change. This allows a client to find out when another client joins the cluster (based on the value + it writes to the coordination service), or if another client fails (because its session times out + and its ephemeral nodes disappear), for example. These notifications save the client from having + to frequently poll the service to find out about changes. + +Failure detection and change notifications do not require consensus, but they are useful for +distributed coordination alongside the atomic operations and fencing support that do require +consensus. + +# Managing configuration with coordination services + +Applications and infrastructure often have configuration parameters such as timeouts, thread pool +sizes, and so on. Coordination services are sometimes used to store such configuration data, +represented as key-value pairs. Processes load the latest settings upon startup, and subscribe to +receive notifications of any changes. When a configuration changes, the process can begin using the +new setting immediately or restart itself to load the latest changes. + +Configuration management doesn’t need the consensus aspect of a coordination service, but it’s +convenient to use a coordination service and rely on its notification feature if you are already +running the coordination service anyway. Alternatively, a process could periodically poll for +configuration updates from a file or URL, which avoids the need for a specialized service. + +### Allocating work to nodes + +A coordination service is useful if you have several instances of a process or service, and one +of them needs to be chosen as leader or primary. If the leader fails, one of the other nodes should +take over. This is necessary for single-leader databases, but it’s also appropriate for job +schedulers and similar stateful systems. + +Another use case is when you have some sharded resource (database, message streams, file storage, +distributed actor system, etc.) and need to decide which shard to assign to which node. As new nodes +join the cluster, some of the shards need to be moved from existing nodes to the new nodes in order +to rebalance the load. As nodes are removed or fail, other nodes need to take over the failed nodes’ +work. + +These kinds of tasks can be achieved by judicious use of atomic operations, ephemeral nodes, and +notifications in a coordination service. If done correctly, this approach allows the application to +automatically recover from faults without human intervention. It’s not easy, despite the appearance +of libraries such as Apache Curator that have sprung up to provide higher-level tools on top of the +ZooKeeper client API—but it is still much better than attempting to implement the necessary +consensus algorithms from scratch, which would be very prone to bugs. + +A dedicated coordination service also has the advantage that it can run on a fixed set of nodes +(usually three or five), regardless of how many nodes there are in the distributed system that +relies on it for coordination. For example, in a storage system with thousands of shards, it would +be terribly inefficient to run a consensus algorithm over thousands of nodes; it’s much better to +“outsource” the consensus to a small number of nodes running a coordination service. + +Normally, the kind of data managed by a coordination service is quite slow-changing: it represents +information like “the node running on IP address 10.1.1.23 is the leader for shard 7,” and such +assignments usually change on a timescale of minutes or hours. Coordination services are not +intended for storing data that may change thousands of times per second. For that, it is better to +use a conventional database; alternatively, tools like Apache BookKeeper +[[90](/en/ch10#Kelly2014), +[91](/en/ch10#Vanlightly2021)] +can be used to replicate fast-changing internal state of a service. + +### Service discovery + +ZooKeeper, etcd, and Consul are also often used for *service discovery*—that is, to find out which +IP address you need to connect to in order to reach a particular service (see +[“Load balancers, service discovery, and service meshes”](/en/ch5#sec_encoding_service_discovery)). In cloud environments, where it is common for +virtual machines to continually come and go, you often don’t know the IP addresses of your services +ahead of time. Instead, you can configure your services such that when they start up they register +their network endpoints in a service registry, where they can then be found by other services. + +Using a coordination service for service discovery can be convenient, as its failure detection and +change notification features make it easy for clients to keep track of service instances as they +come and go. And if you are already using a coordination service for leases, locking, or leader +election, it makes sense to also use it for service discovery, since it already knows which node +should receive requests for your service. + +However, using consensus for service discovery is often overkill: this use case often doesn’t +require linearizability, and it’s more important that service discovery is highly available and +fast, since without it everything would grind to a halt. It’s therefore often preferable to cache +service discovery information and accept that it might be slightly stale. For example, DNS-based +service discovery uses multiple layers of caching to achieve good performance and availability. + +To support this use case, ZooKeeper supports *observers*, which are replicas that receive the log +and maintain a copy of the data stored in ZooKeeper, but which do not participate in the consensus +algorithm’s voting process. Reads from an observer are not linearizable as they might be stale, but +they remain available even if the network is interrupted, and they increase the read throughput that +the system can support by caching. + +# Summary + +In this chapter we examined the topic of strong consistency in fault-tolerant systems: what it is, +and how to achieve it. We looked in depth at linearizability, a popular formalization of strong +consistency: it means that replicated data appears as though there were only a single copy, and all +operations act on it atomically. We saw that linearizability is useful when you need some data to be +up-to-date when you read it, or if you need to resolve a race condition (e.g. if multiple nodes are +concurrently trying to do the same thing, such as creating files with the same name). + +Although linearizability is appealing because it is easy to understand—it makes a database behave +like a variable in a single-threaded program—it has the downside of being slow, especially in +environments with large network delays. Many replication algorithms don’t guarantee linearizability, +even though it superficially might seem like they might provide strong consistency. + +Next, we applied the concept of linearizability in the context of ID generators. A single-node +auto-incrementing counter is linearizable, but not fault-tolerant. Many distributed ID generation +schemes don’t guarantee that the IDs are ordered consistently with the order in which the events +actually happened. Logical clocks such as Lamport clocks and hybrid logical clocks provide ordering +that is consistent with causality, but no linearizability. + +This led us to the concept of consensus. We saw that achieving consensus means deciding something in +such a way that all nodes agree on what was decided, and such that they can’t change their mind. A +wide range of problems are actually reducible to consensus and are equivalent to each other (i.e., +if you have a solution for one of them, you can transform it into a solution for all of the others). +Such equivalent problems include: + +Linearizable compare-and-set operation +: The register needs to atomically *decide* whether to set its value, based on whether its current + value equals the parameter given in the operation. + +Locks and leases +: When several clients are concurrently trying to grab a lock or lease, the lock *decides* which one + successfully acquired it. + +Uniqueness constraints +: When several transactions concurrently try to create conflicting records with the same key, the + constraint must *decide* which one to allow and which should fail with a constraint violation. + +Shared logs +: When several nodes concurrently want to append entries to a log, the log *decides* in which order + they are appended. Total order broadcast is also equivalent. + +Atomic transaction commit +: The database nodes involved in a distributed transaction must all *decide* the same way whether to + commit or abort the transaction. + +Linearizable fetch-and-add operation +: This operation can be used to implement an ID generator. Several nodes can concurrently invoke the + operation, and it *decides* the order in which they increment the counter. This case actually + solves consensus only between two nodes, while the others work for any number of nodes. + +All of these are straightforward if you only have a single node, or if you are willing to assign the +decision-making capability to a single node. This is what happens in a single-leader database: all +the power to make decisions is vested in the leader, which is why such databases are able to provide +linearizable operations, uniqueness constraints, a replication log, and more. + +However, if that single leader fails, or if a network interruption makes the leader unreachable, +such a system becomes unable to make any progress until a human performs a manual failover. +Widely-used consensus algorithms like Raft and Paxos are essentially single-leader replication with +built-in automatic leader election and failover if the current leader fails. + +Consensus algorithms are carefully designed to ensure that no committed writes are lost during a +failover, and that the system cannot get into a split brain state in which multiple nodes are +accepting writes. This requires that every write, and every linearizable read, is confirmed by a +quorum (typically a majority) of nodes. This can be expensive, especially across geographic regions, +but it is unavoidable if you want the strong consistency and fault tolerance that consensus +provides. + +Coordination services like ZooKeeper and etcd are also built on top of consensus algorithms. They +provide locks, leases, failure detection, and change notification features that are useful for +managing the state of distributed applications. If you find yourself wanting to do one of those +things that is reducible to consensus, and you want it to be fault-tolerant, it is advisable to use +a coordination service. It won’t guarantee that you will get it right, but it will probably help. + +Consensus algorithms are complicated and subtle, but they are supported by a rich body of theory +that has been developed since the 1980s. This theory makes it possible to build systems that can +tolerate all the faults that we discussed in [Chapter 9](/en/ch9#ch_distributed), and still ensure that your data is +not corrupted. This is an amazing achievement, and the references at the end of this chapter feature +some of the highlights of this work. + +Nevertheless, consensus is not always the right tool: in some systems, the strong consistency +properties it provides are not needed, and it is better to have weaker consistency with higher +availability and better performance. In these cases, it is common to use leaderless or multi-leader +replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we +discussed in this chapter are helpful in that context. + +##### Footnotes + +##### References + +[[1](/en/ch10#Herlihy1990-marker)] Maurice P. Herlihy and Jeannette M. Wing. +[Linearizability: A Correctness +Condition for Concurrent Objects](https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* +(TOPLAS), volume 12, issue 3, pages 463–492, July 1990. +[doi:10.1145/78969.78972](https://doi.org/10.1145/78969.78972) + +[[2](/en/ch10#Lamport1986-marker)] Leslie Lamport. +[On +interprocess communication](https://www.microsoft.com/en-us/research/publication/interprocess-communication-part-basic-formalism-part-ii-algorithms/). *Distributed Computing*, volume 1, issue 2, pages 77–101, +June 1986. [doi:10.1007/BF01786228](https://doi.org/10.1007/BF01786228) + +[[3](/en/ch10#Gifford1981-marker)] David K. Gifford. +[Information +Storage in a Decentralized Computer System](https://bitsavers.org/pdf/xerox/parc/techReports/CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System.pdf). Xerox Palo Alto Research Centers, CSL-81-8, June 1981. +Archived at [perma.cc/2XXP-3JPB](https://perma.cc/2XXP-3JPB) + +[[4](/en/ch10#Kleppmann2015stop-marker)] Martin Kleppmann. +[Please +Stop Calling Databases CP or AP](https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html). *martin.kleppmann.com*, May 2015. +Archived at [perma.cc/MJ5G-75GL](https://perma.cc/MJ5G-75GL) + +[[5](/en/ch10#Kingsbury2015mongodb-marker)] Kyle Kingsbury. +[Call Me Maybe: MongoDB +Stale Reads](https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads). *aphyr.com*, April 2015. +Archived at [perma.cc/DXB4-J4JC](https://perma.cc/DXB4-J4JC) + +[[6](/en/ch10#Kingsbury2014knossos-marker)] Kyle Kingsbury. +[Computational Techniques +in Knossos](https://aphyr.com/posts/314-computational-techniques-in-knossos). *aphyr.com*, May 2014. +Archived at [perma.cc/2X5M-EHTU](https://perma.cc/2X5M-EHTU) + +[[7](/en/ch10#Kingsbury2020elle-marker)] Kyle Kingsbury and Peter Alvaro. +[Elle: Inferring Isolation Anomalies from +Experimental Observations](https://www.vldb.org/pvldb/vol14/p268-alvaro.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 3, pages +268–280, November 2020. +[doi:10.14778/3430915.3430918](https://doi.org/10.14778/3430915.3430918) + +[[8](/en/ch10#Viotti2016-marker)] Paolo Viotti and Marko Vukolić. +[Consistency in Non-Transactional Distributed Storage +Systems](https://arxiv.org/abs/1512.00168). *ACM Computing Surveys* (CSUR), volume 49, issue 1, article no. 19, June 2016. +[doi:10.1145/2926965](https://doi.org/10.1145/2926965) + +[[9](/en/ch10#Bailis2014linear-marker)] Peter Bailis. +[Linearizability +Versus Serializability](http://www.bailis.org/blog/linearizability-versus-serializability/). *bailis.org*, September 2014. +Archived at [perma.cc/386B-KAC3](https://perma.cc/386B-KAC3) + +[[10](/en/ch10#Abadi2019serializable-marker)] Daniel Abadi. +[Correctness +Anomalies Under Serializable Isolation](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html). *dbmsmusings.blogspot.com*, June 2019. +Archived at [perma.cc/JGS7-BZFY](https://perma.cc/JGS7-BZFY) + +[[11](/en/ch10#Bailis2014virtues_ch10-marker)] Peter Bailis, Aaron Davidson, Alan +Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. +[Highly Available Transactions: Virtues and +Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). *Proceedings of the VLDB Endowment*, volume 7, issue 3, pages 181–192, +November 2013. [doi:10.14778/2732232.2732237](https://doi.org/10.14778/2732232.2732237), +extended version published as [arXiv:1302.0309](https://arxiv.org/abs/1302.0309) + +[[12](/en/ch10#Bernstein1987_ch10-marker)] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. +[*Concurrency Control and +Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at +[*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/). + +[[13](/en/ch10#Matei2021-marker)] Andrei Matei. +[CockroachDB’s consistency model](https://www.cockroachlabs.com/blog/consistency-model/). +*cockroachlabs.com*, February 2021. +Archived at [perma.cc/MR38-883B](https://perma.cc/MR38-883B) + +[[14](/en/ch10#Demirbas2022-marker)] Murat Demirbas. +[Strict-serializability, +but at what cost, for what purpose?](https://muratbuffalo.blogspot.com/2022/08/strict-serializability-but-at-what-cost.html) *muratbuffalo.blogspot.com*, August 2022. +Archived at [perma.cc/T8AY-N3U9](https://perma.cc/T8AY-N3U9) + +[[15](/en/ch10#Darnell2022-marker)] Ben Darnell. +[How to talk about +consistency and isolation in distributed DBs](https://www.cockroachlabs.com/blog/db-consistency-isolation-terminology/). *cockroachlabs.com*, February 2022. +Archived at [perma.cc/53SV-JBGK](https://perma.cc/53SV-JBGK) + +[[16](/en/ch10#Abadi2019consistency-marker)] Daniel Abadi. +[An +explanation of the difference between Isolation levels vs. Consistency levels](https://dbmsmusings.blogspot.com/2019/08/an-explanation-of-difference-between.html). +*dbmsmusings.blogspot.com*, August 2019. +Archived at [perma.cc/QSF2-CD4P](https://perma.cc/QSF2-CD4P) + +[[17](/en/ch10#Burrows2006_ch10-marker)] Mike Burrows. +[The Chubby Lock Service for Loosely-Coupled +Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and +Implementation* (OSDI), November 2006. + +[[18](/en/ch10#Junqueira2013_ch10-marker)] Flavio P. Junqueira and Benjamin Reed. +[*ZooKeeper: Distributed +Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3 + +[[19](/en/ch10#Vallath2006-marker)] Murali Vallath. +[*Oracle 10g RAC +Grid, Services & Clustering*](https://www.oreilly.com/library/view/oracle-10g-rac/9781555583217/). Elsevier Digital Press, 2006. ISBN: 978-1-555-58321-7 + +[[20](/en/ch10#Bailis2014coord_ch10-marker)] Peter Bailis, Alan Fekete, Michael J. +Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. +[Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). +*Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185–196, November 2014. +[doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509) + +[[21](/en/ch10#Kingsbury2014etcd-marker)] Kyle Kingsbury. +[Call Me Maybe: etcd and +Consul](https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul). *aphyr.com*, June 2014. +Archived at [perma.cc/XL7U-378K](https://perma.cc/XL7U-378K) + +[[22](/en/ch10#Junqueira2011-marker)] Flavio P. Junqueira, Benjamin C. Reed, and +Marco Serafini. [Zab: High-Performance +Broadcast for Primary-Backup Systems](https://marcoserafini.github.io/assets/pdf/zab.pdf). At *41st IEEE International Conference on Dependable +Systems and Networks* (DSN), June 2011. +[doi:10.1109/DSN.2011.5958223](https://doi.org/10.1109/DSN.2011.5958223) + +[[23](/en/ch10#Ongaro2014atc-marker)] Diego Ongaro and John K. Ousterhout. +[In Search +of an Understandable Consensus Algorithm](https://www.usenix.org/system/files/conference/atc14/atc14-paper-ongaro.pdf). At *USENIX Annual Technical Conference* +(ATC), June 2014. + +[[24](/en/ch10#Attiya1995-marker)] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. +[Sharing Memory Robustly in +Message-Passing Systems](https://www.cs.huji.ac.il/course/2004/dist/p124-attiya.pdf). *Journal of the ACM*, volume 42, issue 1, pages 124–142, January 1995. +[doi:10.1145/200836.200869](https://doi.org/10.1145/200836.200869) + +[[25](/en/ch10#Lynch1997-marker)] Nancy Lynch and Alex Shvartsman. +[Robust Emulation of Shared Memory +Using Dynamic Quorum-Acknowledged Broadcasts](https://groups.csail.mit.edu/tds/papers/Lynch/FTCS97.pdf). At *27th Annual International Symposium on +Fault-Tolerant Computing* (FTCS), June 1997. +[doi:10.1109/FTCS.1997.614100](https://doi.org/10.1109/FTCS.1997.614100) + +[[26](/en/ch10#Cachin2011-marker)] Christian Cachin, Rachid Guerraoui, and Luís Rodrigues. +[*Introduction to Reliable and Secure Distributed +Programming*](https://www.distributedprogramming.net/), 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7, +[doi:10.1007/978-3-642-15260-3](https://doi.org/10.1007/978-3-642-15260-3) + +[[27](/en/ch10#Ekstrom2012-marker)] Niklas Ekström, Mikhail Panchenko, and Jonathan Ellis. +[Possible +Issue with Read Repair?](https://lists.apache.org/thread/wwsjnnc93mdlpw8nb0d5gn4q1bmpzbon) Email thread on *cassandra-dev* mailing list, October 2012. + +[[28](/en/ch10#Herlihy1991-marker)] Maurice P. Herlihy. +[Wait-Free Synchronization](https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf). +*ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 13, issue 1, +pages 124–149, January 1991. +[doi:10.1145/114005.102808](https://doi.org/10.1145/114005.102808) + +[[29](/en/ch10#Fox1999-marker)] Armando Fox and Eric A. Brewer. +[Harvest, Yield, and +Scalable Tolerant Systems](https://radlab.cs.berkeley.edu/people/fox/static/pubs/pdf/c18.pdf). At *7th Workshop on Hot Topics in Operating Systems* (HotOS), +March 1999. [doi:10.1109/HOTOS.1999.798396](https://doi.org/10.1109/HOTOS.1999.798396) + +[[30](/en/ch10#Gilbert2002-marker)] Seth Gilbert and Nancy Lynch. +[Brewer’s Conjecture +and the Feasibility of Consistent, Available, Partition-Tolerant Web Services](https://www.comp.nus.edu.sg/~gilbert/pubs/BrewersConjecture-SigAct.pdf). +*ACM SIGACT News*, volume 33, issue 2, pages 51–59, June 2002. +[doi:10.1145/564585.564601](https://doi.org/10.1145/564585.564601) + +[[31](/en/ch10#Gilbert2012-marker)] Seth Gilbert and Nancy Lynch. +[Perspectives on the CAP +Theorem](https://groups.csail.mit.edu/tds/papers/Gilbert/Brewer2.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 30–36, February 2012. +[doi:10.1109/MC.2011.389](https://doi.org/10.1109/MC.2011.389) + +[[32](/en/ch10#Brewer2012rules-marker)] Eric A. Brewer. +[CAP Twelve Years +Later: How the ‘Rules’ Have Changed](https://sites.cs.ucsb.edu/~rich/class/cs293-cloud/papers/brewer-cap.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages +23–29, February 2012. [doi:10.1109/MC.2012.37](https://doi.org/10.1109/MC.2012.37) + +[[33](/en/ch10#Davidson1985-marker)] Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen. +[Consistency in Partitioned +Networks](https://www.cs.rice.edu/~alc/old/comp520/papers/DGS85.pdf). *ACM Computing Surveys*, volume 17, issue 3, pages 341–370, September 1985. +[doi:10.1145/5505.5508](https://doi.org/10.1145/5505.5508) + +[[34](/en/ch10#Johnson1975-marker)] Paul R. Johnson and Robert H. Thomas. +[RFC 677: The Maintenance of Duplicate +Databases](https://tools.ietf.org/html/rfc677). Network Working Group, January 1975. + +[[35](/en/ch10#Fischer1982-marker)] Michael J. Fischer and Alan Michael. +[Sacrificing +Serializability to Attain High Availability of Data in an Unreliable Network](https://sites.cs.ucsb.edu/~agrawal/spring2011/ugrad/p70-fischer.pdf). At +*1st ACM Symposium on Principles of Database Systems* (PODS), March 1982. +[doi:10.1145/588111.588124](https://doi.org/10.1145/588111.588124) + +[[36](/en/ch10#Brewer2012nosql-marker)] Eric A. Brewer. +[NoSQL: Past, Present, Future](https://www.infoq.com/presentations/NoSQL-History/). +At *QCon San Francisco*, November 2012. + +[[37](/en/ch10#Cockcroft2014-marker)] Adrian Cockcroft. +[Migrating to Microservices](https://www.infoq.com/presentations/migration-cloud-native/). +At *QCon London*, March 2014. + +[[38](/en/ch10#Kleppmann2015critique-marker)] Martin Kleppmann. +[A Critique of the CAP Theorem](https://arxiv.org/abs/1509.05393). arXiv:1509.05393, +September 2015. + +[[39](/en/ch10#Abadi2010-marker)] Daniel Abadi. +[Problems +with CAP, and Yahoo’s little known NoSQL system](https://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html). *dbmsmusings.blogspot.com*, April 2010. +Archived at [perma.cc/4NTZ-CLM9](https://perma.cc/4NTZ-CLM9) + +[[40](/en/ch10#Abadi2017-marker)] Daniel Abadi. +[Hazelcast +and the Mythical PA/EC System](https://dbmsmusings.blogspot.com/2017/10/hazelcast-and-mythical-paec-system.html). *dbmsmusings.blogspot.com*, October 2017. +Archived at [perma.cc/J5XM-U5C2](https://perma.cc/J5XM-U5C2) + +[[41](/en/ch10#Brewer2017-marker)] Eric Brewer. +[Spanner, TrueTime & The CAP +Theorem](https://research.google.com/pubs/archive/45855.pdf). *research.google.com*, February 2017. +Archived at [perma.cc/59UW-RH7N](https://perma.cc/59UW-RH7N) + +[[42](/en/ch10#Abadi2012-marker)] Daniel J. Abadi. +[Consistency Tradeoffs in +Modern Distributed Database System Design](https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf). *IEEE Computer Magazine*, +volume 45, issue 2, pages 37–42, February 2012. +[doi:10.1109/MC.2012.33](https://doi.org/10.1109/MC.2012.33) + +[[43](/en/ch10#Lynch1989-marker)] Nancy A. Lynch. +[A Hundred Impossibility Proofs +for Distributed Computing](https://groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf). At *8th ACM Symposium on Principles of Distributed +Computing* (PODC), August 1989. +[doi:10.1145/72981.72982](https://doi.org/10.1145/72981.72982) + +[[44](/en/ch10#Mahajan2011-marker)] Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin. +[Consistency, Availability, +and Convergence](https://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2036.pdf). University of Texas at Austin, Department of Computer Science, Tech Report UTCS +TR-11-22, May 2011. Archived at [perma.cc/SAV8-9JAJ](https://perma.cc/SAV8-9JAJ) + +[[45](/en/ch10#Attiya2015-marker)] Hagit Attiya, Faith Ellen, and Adam Morrison. +[Limitations +of Highly-Available Eventually-Consistent Data Stores](https://www.cs.tau.ac.il/~mad/publications/podc2015-replds.pdf). At *ACM Symposium on Principles of +Distributed Computing* (PODC), July 2015. +[doi:10.1145/2767386.2767419](https://doi.org/10.1145/2767386.2767419) + +[[46](/en/ch10#Sewell2010-marker)] Peter Sewell, Susmit Sarkar, Scott Owens, +Francesco Zappa Nardelli, and Magnus O. Myreen. +[x86-TSO: A Rigorous and Usable +Programmer’s Model for x86 Multiprocessors](https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf). *Communications of the ACM*, +volume 53, issue 7, pages 89–97, July 2010. +[doi:10.1145/1785414.1785443](https://doi.org/10.1145/1785414.1785443) + +[[47](/en/ch10#Thompson2011-marker)] Martin Thompson. +[Memory +Barriers/Fences](https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html). *mechanical-sympathy.blogspot.co.uk*, July 2011. +Archived at [perma.cc/7NXM-GC5U](https://perma.cc/7NXM-GC5U) + +[[48](/en/ch10#Drepper2007_ch10-marker)] Ulrich Drepper. +[What Every Programmer Should Know About +Memory](https://www.akkadia.org/drepper/cpumemory.pdf). *akkadia.org*, November 2007. Archived at +[perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ) + +[[49](/en/ch10#Attiya1994-marker)] Hagit Attiya and Jennifer L. Welch. +[Sequential Consistency +Versus Linearizability](https://courses.csail.mit.edu/6.852/01/papers/p91-attiya.pdf). *ACM Transactions on Computer Systems* (TOCS), +volume 12, issue 2, pages 91–122, May 1994. +[doi:10.1145/176575.176576](https://doi.org/10.1145/176575.176576) + +[[50](/en/ch10#Davis2024-marker)] Kyzer R. Davis, Brad G. Peabody, and Paul J. Leach. +[Universally Unique IDentifiers (UUIDs)](https://www.rfc-editor.org/rfc/rfc9562). +RFC 9562, IETF, May 2024. + +[[51](/en/ch10#King2010-marker)] Ryan King. +[Announcing Snowflake](https://blog.x.com/engineering/en_us/a/2010/announcing-snowflake). +*blog.x.com*, June 2010. Archived at +[archive.org](https://web.archive.org/web/20241128214604/https%3A//blog.x.com/engineering/en_us/a/2010/announcing-snowflake) + +[[52](/en/ch10#Feerasta2016-marker)] Alizain Feerasta. +[Universally Unique Lexicographically Sortable Identifier](https://github.com/ulid/spec). +*github.com*, 2016. +Archived at [perma.cc/NV2Y-ZP8U](https://perma.cc/NV2Y-ZP8U) + +[[53](/en/ch10#Conery2014-marker)] Rob Conery. +[A Better ID +Generator for PostgreSQL](https://bigmachine.io/2014/05/29/a-better-id-generator-for-postgresql/). *bigmachine.io*, May 2014. +Archived at [perma.cc/K7QV-3KFC](https://perma.cc/K7QV-3KFC) + +[[54](/en/ch10#Lamport1978_ch10-marker)] Leslie Lamport. +[Time, +Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, +volume 21, issue 7, pages 558–565, July 1978. +[doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563) + +[[55](/en/ch10#Kulkarni2014-marker)] Sandeep S. Kulkarni, Murat Demirbas, Deepak +Madeppa, Bharadwaj Avva, and Marcelo Leone. +[Logical Physical Clocks](https://cse.buffalo.edu/~demirbas/publications/hlc.pdf). +*18th International Conference on Principles of Distributed Systems* (OPODIS), December 2014. +[doi:10.1007/978-3-319-14472-6\_2](https://doi.org/10.1007/978-3-319-14472-6_2) + +[[56](/en/ch10#Bravo2015-marker)] Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo +Romano, and Luís Rodrigues. +[On the use of Clocks to Enforce +Consistency in the Cloud](http://sites.computer.org/debull/A15mar/p18.pdf). *IEEE Data Engineering Bulletin*, volume 38, issue 1, +pages 18–31, March 2015. +Archived at [perma.cc/68ZU-45SH](https://perma.cc/68ZU-45SH) + +[[57](/en/ch10#Peng2010_ch10-marker)] Daniel Peng and Frank Dabek. +[Large-Scale +Incremental Processing Using Distributed Transactions and Notifications](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf). At *9th USENIX +Conference on Operating Systems Design and Implementation* (OSDI), October 2010. + +[[58](/en/ch10#Chandra2007-marker)] Tushar Deepak Chandra, Robert Griesemer, and Joshua +Redstone. [Paxos +Made Live – An Engineering Perspective](https://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf). At *26th ACM Symposium on Principles of Distributed +Computing* (PODC), June 2007. +[doi:10.1145/1281100.1281103](https://doi.org/10.1145/1281100.1281103) + +[[59](/en/ch10#Portnoy2012-marker)] Will Portnoy. +[Lessons Learned from +Implementing Paxos](https://blog.willportnoy.com/2012/06/lessons-learned-from-paxos.html). *blog.willportnoy.com*, June 2012. +Archived at [perma.cc/QHD9-FDD2](https://perma.cc/QHD9-FDD2) + +[[60](/en/ch10#Oki1988-marker)] Brian M. Oki and Barbara H. Liskov. +[Viewstamped Replication: A New Primary Copy Method +to Support Highly-Available Distributed Systems](https://pmg.csail.mit.edu/papers/vr.pdf). At *7th ACM Symposium on Principles of +Distributed Computing* (PODC), August 1988. +[doi:10.1145/62546.62549](https://doi.org/10.1145/62546.62549) + +[[61](/en/ch10#Liskov2012-marker)] Barbara H. Liskov and James Cowling. +[Viewstamped Replication Revisited](https://pmg.csail.mit.edu/papers/vr-revisited.pdf). +Massachusetts Institute of Technology, Tech Report MIT-CSAIL-TR-2012-021, July 2012. +Archived at [perma.cc/56SJ-WENQ](https://perma.cc/56SJ-WENQ) + +[[62](/en/ch10#Lamport1998-marker)] Leslie Lamport. +[The +Part-Time Parliament](https://www.microsoft.com/en-us/research/publication/part-time-parliament/). *ACM Transactions on Computer Systems*, volume 16, issue 2, +pages 133–169, May 1998. +[doi:10.1145/279227.279229](https://doi.org/10.1145/279227.279229) + +[[63](/en/ch10#Lamport2001-marker)] Leslie Lamport. +[Paxos Made +Simple](https://www.microsoft.com/en-us/research/publication/paxos-made-simple/). *ACM SIGACT News*, volume 32, issue 4, pages 51–58, December 2001. +Archived at [perma.cc/82HP-MNKE](https://perma.cc/82HP-MNKE) + +[[64](/en/ch10#vanRenesse2011-marker)] Robbert van Renesse and Deniz Altinbuken. +[Paxos Made +Moderately Complex](https://people.cs.umass.edu/~arun/590CC/papers/paxos-moderately-complex.pdf). *ACM Computing Surveys* (CSUR), volume 47, issue 3, article no. 42, +February 2015. [doi:10.1145/2673577](https://doi.org/10.1145/2673577) + +[[65](/en/ch10#Ongaro2014thesis-marker)] Diego Ongaro. +[Consensus: Bridging Theory and Practice](https://github.com/ongardie/dissertation). +PhD Thesis, Stanford University, August 2014. +Archived at [perma.cc/5VTZ-2ADH](https://perma.cc/5VTZ-2ADH) + +[[66](/en/ch10#Howard2015refloated-marker)] Heidi Howard, Malte Schwarzkopf, Anil +Madhavapeddy, and Jon Crowcroft. +[Raft +Refloated: Do We Have Consensus?](https://www.cl.cam.ac.uk/research/srg/netos/papers/2015-raftrefloated-osr.pdf) *ACM SIGOPS Operating Systems Review*, volume 49, issue +1, pages 12–21, January 2015. +[doi:10.1145/2723872.2723876](https://doi.org/10.1145/2723872.2723876) + +[[67](/en/ch10#Medeiros2012-marker)] André Medeiros. +[ZooKeeper’s Atomic +Broadcast Protocol: Theory and Practice](http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf). Aalto University School of Science, March 2012. +Archived at [perma.cc/FVL4-JMVA](https://perma.cc/FVL4-JMVA) + +[[68](/en/ch10#vanRenesse2014-marker)] Robbert van Renesse, Nicolas Schiper, and +Fred B. Schneider. [Vive La Différence: Paxos vs. +Viewstamped Replication vs. Zab](https://arxiv.org/abs/1309.5671). *IEEE Transactions on Dependable and Secure Computing*, +volume 12, issue 4, pages 472–484, September 2014. +[doi:10.1109/TDSC.2014.2355848](https://doi.org/10.1109/TDSC.2014.2355848) + +[[69](/en/ch10#Howard2020-marker)] Heidi Howard and Richard Mortier. +[Paxos vs Raft: Have we reached consensus on distributed +consensus?](https://arxiv.org/abs/2004.05074). At *7th Workshop on Principles and Practice of Consistency for Distributed +Data* (PaPoC), April 2020. +[doi:10.1145/3380787.3393681](https://doi.org/10.1145/3380787.3393681) + +[[70](/en/ch10#Castro2002-marker)] Miguel Castro and Barbara H. Liskov. +[Practical +Byzantine Fault Tolerance and Proactive Recovery](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/p398-castro-bft-tocs.pdf). *ACM Transactions on Computer Systems*, +volume 20, issue 4, pages 396–461, November 2002. +[doi:10.1145/571637.571640](https://doi.org/10.1145/571637.571640) + +[[71](/en/ch10#Bano2019_ch10-marker)] Shehar Bano, Alberto Sonnino, Mustafa +Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis. +[SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At +*1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019. +[doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458) + +[[72](/en/ch10#Fischer1985-marker)] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. +[Impossibility of Distributed Consensus with +One Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf). *Journal of the ACM*, volume 32, issue 2, pages 374–382, April 1985. +[doi:10.1145/3149.214121](https://doi.org/10.1145/3149.214121) + +[[73](/en/ch10#Chandra1996-marker)] Tushar Deepak Chandra and Sam Toueg. +[Unreliable Failure Detectors +for Reliable Distributed Systems](https://courses.csail.mit.edu/6.852/08/papers/CT96-JACM.pdf). *Journal of the ACM*, volume 43, issue 2, pages +225–267, March 1996. +[doi:10.1145/226643.226647](https://doi.org/10.1145/226643.226647) + +[[74](/en/ch10#BenOr1983-marker)] Michael Ben-Or. +[Another Advantage of Free Choice: +Completely Asynchronous Agreement Protocols](https://homepage.cs.uiowa.edu/~ghosh/BenOr.pdf). At *2nd ACM Symposium on Principles of +Distributed Computing* (PODC), August 1983. +[doi:10.1145/800221.806707](https://doi.org/10.1145/800221.806707) + +[[75](/en/ch10#Dwork1988_ch10-marker)] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. +[Consensus in the Presence of +Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323, April 1988. +[doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283) + +[[76](/en/ch10#Defago2004-marker)] Xavier Défago, André Schiper, and Péter Urbán. +[Total Order +Broadcast and Multicast Algorithms: Taxonomy and Survey](https://dspace.jaist.ac.jp/dspace/bitstream/10119/4883/1/defago_et_al.pdf). *ACM Computing Surveys*, volume +36, issue 4, pages 372–421, December 2004. +[doi:10.1145/1041680.1041682](https://doi.org/10.1145/1041680.1041682) + +[[77](/en/ch10#Attiya2004-marker)] Hagit Attiya and Jennifer Welch. *Distributed +Computing: Fundamentals, Simulations and Advanced Topics*, 2nd edition. +John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6, +[doi:10.1002/0471478210](https://doi.org/10.1002/0471478210) + +[[78](/en/ch10#Guerraoui1995-marker)] Rachid Guerraoui. +[Revisiting +the Relationship Between Non-Blocking Atomic Commitment and Consensus](https://citeseerx.ist.psu.edu/pdf/5d06489503b6f791aa56d2d7942359c2592e44b0). At *9th International +Workshop on Distributed Algorithms* (WDAG), September 1995. +[doi:10.1007/BFb0022140](https://doi.org/10.1007/BFb0022140) + +[[79](/en/ch10#Gray2006-marker)] Jim N. Gray and Leslie Lamport. +[Consensus on Transaction +Commit](https://dsf.berkeley.edu/cs286/papers/paxoscommit-tods2006.pdf). *ACM Transactions on Database Systems* (TODS), volume 31, issue 1, pages 133–160, +March 2006. [doi:10.1145/1132863.1132867](https://doi.org/10.1145/1132863.1132867) + +[[80](/en/ch10#Schneider1990-marker)] Fred B. Schneider. +[Implementing Fault-Tolerant +Services Using the State Machine Approach: A Tutorial](https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf). *ACM Computing Surveys*, volume +22, issue 4, pages 299–319, December 1990. +[doi:10.1145/98163.98167](https://doi.org/10.1145/98163.98167) + +[[81](/en/ch10#Thomson2012-marker)] Alexander Thomson, Thaddeus Diamond, Shu-Chun +Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. +[Calvin: Fast +Distributed Transactions for Partitioned Database Systems](https://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf). At *ACM International Conference +on Management of Data* (SIGMOD), May 2012. +[doi:10.1145/2213836.2213838](https://doi.org/10.1145/2213836.2213838) + +[[82](/en/ch10#Balakrishnan2013-marker)] Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, +Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck. +[Tango: +Distributed Data Structures over a Shared Log](https://www.microsoft.com/en-us/research/publication/tango-distributed-data-structures-over-a-shared-log/). At *24th ACM Symposium on Operating Systems +Principles* (SOSP), November 2013. +[doi:10.1145/2517349.2522732](https://doi.org/10.1145/2517349.2522732) + +[[83](/en/ch10#Balakrishnan2012-marker)] Mahesh +Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. +[CORFU: A Shared +Log Design for Flash Clusters](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final30.pdf). At *9th USENIX Symposium on Networked Systems Design and +Implementation* (NSDI), April 2012. + +[[84](/en/ch10#Gavrielatos2021-marker)] Vasilis Gavrielatos, +Antonios Katsarakis, and Vijay Nagarajan. +[Odyssey: the impact of modern +hardware on strongly-consistent replication protocols](https://vasigavr1.github.io/files/Odyssey_Eurosys_2021.pdf). At *16th European Conference on +Computer Systems* (EuroSys), April 2021. +[doi:10.1145/3447786.3456240](https://doi.org/10.1145/3447786.3456240) + +[[85](/en/ch10#Howard2016_ch10-marker)] Heidi Howard, Dahlia Malkhi, and +Alexander Spiegelman. +[Flexible +Paxos: Quorum Intersection Revisited](https://drops.dagstuhl.de/opus/volltexte/2017/7094/pdf/LIPIcs-OPODIS-2016-25.pdf). At *20th International Conference on Principles of +Distributed Systems* (OPODIS), December 2016. +[doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25) + +[[86](/en/ch10#Kleppmann2024distsys-marker)] Martin Kleppmann. +[Distributed Systems +lecture notes](https://www.cl.cam.ac.uk/teaching/2425/ConcDisSys/dist-sys-notes.pdf). *University of Cambridge*, October 2024. +Archived at [perma.cc/SS3Q-FNS5](https://perma.cc/SS3Q-FNS5) + +[[87](/en/ch10#Kingsbury2015elastic-marker)] Kyle Kingsbury. +[Call Me Maybe: +Elasticsearch 1.5.0](https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0). *aphyr.com*, April 2015. +Archived at [perma.cc/37MZ-JT7H](https://perma.cc/37MZ-JT7H) + +[[88](/en/ch10#Howard2015coracle-marker)] Heidi Howard and Jon Crowcroft. +[Coracle: Evaluating +Consensus at the Internet Edge](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p85.pdf). At *Annual Conference of the ACM Special Interest Group on +Data Communication* (SIGCOMM), August 2015. +[doi:10.1145/2829988.2790010](https://doi.org/10.1145/2829988.2790010) + +[[89](/en/ch10#Lianza2020_ch10-marker)] Tom Lianza and Chris Snook. +[A Byzantine failure +in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020. +Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY) + +[[90](/en/ch10#Kelly2014-marker)] Ivan Kelly. +[BookKeeper Tutorial](https://github.com/ivankelly/bookkeeper-tutorial). +*github.com*, October 2014. +Archived at [perma.cc/37Y6-VZWU](https://perma.cc/37Y6-VZWU) + +[[91](/en/ch10#Vanlightly2021-marker)] Jack Vanlightly. +[Apache +BookKeeper Insights Part 1 — External Consensus and Dynamic Membership](https://medium.com/splunk-maas/apache-bookkeeper-insights-part-1-external-consensus-and-dynamic-membership-c259f388da21). *medium.com*, November 2021. +Archived at [perma.cc/3MDB-8GFB](https://perma.cc/3MDB-8GFB) \ No newline at end of file diff --git a/content/en/ch11.md b/content/en/ch11.md index 2396b69..b406c70 100644 --- a/content/en/ch11.md +++ b/content/en/ch11.md @@ -1,30 +1,46 @@ --- -title: "Stream Processing" -linkTitle: "11. Stream Processing" +title: "11. Batch Processing" weight: 311 breadcrumbs: false --- +> [!IMPORTANT] +> This chapter is from the 1st edition, the 2nd edition is not available yet -![](/img/ch11.png) +![](/map/ch10.png) -> *A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work.* +> *A system cannot be successful if it is too strongly influenced by a single person. Once the initial design is complete and fairly robust, the real test begins as people with many different viewpoints undertake their own experiments.* > -> ​ — John Gall, *Systemantics* (1975) +> ​ — Donald Knuth --------------- -In [Chapter 10](/en/ch10) we discussed batch processing—techniques that read a set of files as input and produce a new set of output files. The output is a form of *derived data*; that is, a dataset that can be recreated by running the batch process again if necessary. We saw how this simple but powerful idea can be used to create search indexes, recom‐ mendation systems, analytics, and more. +In the first two parts of this book we talked a lot about *requests* and *queries*, and the corresponding *responses* or *results*. This style of data processing is assumed in many modern data systems: you ask for something, or you send an instruction, and some time later the system (hopefully) gives you an answer. Databases, caches, search indexes, web servers, and many other systems work this way. -However, one big assumption remained throughout [Chapter 10](/en/ch10): namely, that the input is bounded—i.e., of a known and finite size—so the batch process knows when it has finished reading its input. For example, the sorting operation that is central to MapReduce must read its entire input before it can start producing output: it could happen that the very last input record is the one with the lowest key, and thus needs to be the very first output record, so starting the output early is not an option. +In such *online* systems, whether it’s a web browser requesting a page or a service call‐ ing a remote API, we generally assume that the request is triggered by a human user, and that the user is waiting for the response. They shouldn’t have to wait too long, so we pay a lot of attention to the *response time* of these systems (see “[Describing Performance](/en/ch1#describing-performance)”). -In reality, a lot of data is unbounded because it arrives gradually over time: your users produced data yesterday and today, and they will continue to produce more data tomorrow. Unless you go out of business, this process never ends, and so the dataset is never “complete” in any meaningful way [1]. Thus, batch processors must artifi‐ cially divide the data into chunks of fixed duration: for example, processing a day’s worth of data at the end of every day, or processing an hour’s worth of data at the end of every hour. +The web, and increasing numbers of HTTP/REST-based APIs, has made the request/ response style of interaction so common that it’s easy to take it for granted. But we should remember that it’s not the only way of building systems, and that other approaches have their merits too. Let’s distinguish three different types of systems: -The problem with daily batch processes is that changes in the input are only reflected in the output a day later, which is too slow for many impatient users. To reduce the delay, we can run the processing more frequently—say, processing a second’s worth of data at the end of every second—or even continuously, abandoning the fixed time slices entirely and simply processing every event as it happens. That is the idea behind *stream processing*. +***Services (online systems)*** -In general, a “stream” refers to data that is incrementally made available over time. The concept appears in many places: in the stdin and stdout of Unix, programming languages (lazy lists) [2], filesystem APIs (such as Java’s `FileInputStream`), TCP con‐ nections, delivering audio and video over the internet, and so on. +A service waits for a request or instruction from a client to arrive. When one is received, the service tries to handle it as quickly as possible and sends a response back. Response time is usually the primary measure of performance of a service, and availability is often very important (if the client can’t reach the service, the user will probably get an error message). + +***Batch processing systems (offline systems)*** + +A batch processing system takes a large amount of input data, runs a *job* to pro‐ cess it, and produces some output data. Jobs often take a while (from a few minutes to several days), so there normally isn’t a user waiting for the job to fin‐ ish. Instead, batch jobs are often scheduled to run periodically (for example, once a day). The primary performance measure of a batch job is usually *throughput* (the time it takes to crunch through an input dataset of a certain size). We dis‐ cuss batch processing in this chapter. + +***Stream processing systems (near-real-time systems)*** + +Stream processing is somewhere between online and offline/batch processing (so it is sometimes called *near-real-time* or *nearline* processing). Like a batch pro‐ cessing system, a stream processor consumes inputs and produces outputs (rather than responding to requests). However, a stream job operates on events shortly after they happen, whereas a batch job operates on a fixed set of input data. This difference allows stream processing systems to have lower latency than the equivalent batch systems. As stream processing builds upon batch process‐ ing, we discuss it in [Chapter 11](/en/ch11). + +As we shall see in this chapter, batch processing is an important building block in our quest to build reliable, scalable, and maintainable applications. For example, Map‐ Reduce, a batch processing algorithm published in 2004 [1], was (perhaps over- enthusiastically) called “the algorithm that makes Google so massively scalable” [2]. It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB. + +MapReduce is a fairly low-level programming model compared to the parallel pro‐ cessing systems that were developed for data warehouses many years previously [3, 4], but it was a major step forward in terms of the scale of processing that could be achieved on commodity hardware. Although the importance of MapReduce is now declining [5], it is still worth understanding, because it provides a clear picture of why and how batch processing is useful. + +In fact, batch processing is a very old form of computing. Long before programmable digital computers were invented, punch card tabulating machines—such as the Hol‐ lerith machines used in the 1890 US Census [6]—implemented a semi-mechanized form of batch processing to compute aggregate statistics from large inputs. And Map‐ Reduce bears an uncanny resemblance to the electromechanical IBM card-sorting machines that were widely used for business data processing in the 1940s and 1950s [7]. As usual, history has a tendency of repeating itself. + +In this chapter, we will look at MapReduce and several other batch processing algo‐ rithms and frameworks, and explore how they are used in modern data systems. But first, to get started, we will look at data processing using standard Unix tools. Even if you are already familiar with them, a reminder about the Unix philosophy is worthwhile because the ideas and lessons from Unix carry over to large-scale, heterogene‐ ous distributed data systems. -In this chapter we will look at *event streams* as a data management mechanism: the unbounded, incrementally processed counterpart to the batch data we saw in the last chapter. We will first discuss how streams are represented, stored, and transmit‐ ted over a network. In “[Databases and Streams](#databases-and-streams)” we will investigate the relationship between streams and databases. And finally, in “[Processing Streams](#processing-streams)” we will explore approaches and tools for processing those streams continually, and ways that they can be used to build applications. ## …… @@ -33,146 +49,131 @@ In this chapter we will look at *event streams* as a data management mechanism: ## Summary -In this chapter we have discussed event streams, what purposes they serve, and how to process them. In some ways, stream processing is very much like the batch pro‐ cessing we discussed in [Chapter 10](/en/ch10), but done continuously on unbounded (neverending) streams rather than on a fixed-size input. From this perspective, message brokers and event logs serve as the streaming equivalent of a filesystem. -We spent some time comparing two types of message brokers: +In this chapter we explored the topic of batch processing. We started by looking at Unix tools such as awk, grep, and sort, and we saw how the design philosophy of those tools is carried forward into MapReduce and more recent dataflow engines. Some of those design principles are that inputs are immutable, outputs are intended to become the input to another (as yet unknown) program, and complex problems are solved by composing small tools that “do one thing well.” -***AMQP/JMS-style message broker*** +In the Unix world, the uniform interface that allows one program to be composed with another is files and pipes; in MapReduce, that interface is a distributed filesys‐ tem. We saw that dataflow engines add their own pipe-like data transport mecha‐ nisms to avoid materializing intermediate state to the distributed filesystem, but the initial input and final output of a job is still usually HDFS. -The broker assigns individual messages to consumers, and consumers acknowl‐ edge individual messages when they have been successfully processed. Messages are deleted from the broker once they have been acknowledged. This approach is appropriate as an asynchronous form of RPC (see also “[Message-Passing Data‐ flow]()”), for example in a task queue, where the exact order of mes‐ sage processing is not important and where there is no need to go back and read old messages again after they have been processed. +The two main problems that distributed batch processing frameworks need to solve are: -***Log-based message broker*** +***Partitioning*** -The broker assigns all messages in a partition to the same consumer node, and always delivers messages in the same order. Parallelism is achieved through par‐ titioning, and consumers track their progress by checkpointing the offset of the last message they have processed. The broker retains messages on disk, so it is possible to jump back and reread old messages if necessary. +In MapReduce, mappers are partitioned according to input file blocks. The out‐ put of mappers is repartitioned, sorted, and merged into a configurable number of reducer partitions. The purpose of this process is to bring all the related data— e.g., all the records with the same key—together in the same place. -The log-based approach has similarities to the replication logs found in databases (see [Chapter 5](/en/ch5)) and log-structured storage engines (see [Chapter 3](/en/ch3)). We saw that this approach is especially appropriate for stream processing systems that consume input streams and generate derived state or derived output streams. +Post-MapReduce dataflow engines try to avoid sorting unless it is required, but they otherwise take a broadly similar approach to partitioning. -In terms of where streams come from, we discussed several possibilities: user activity events, sensors providing periodic readings, and data feeds (e.g., market data in finance) are naturally represented as streams. We saw that it can also be useful to think of the writes to a database as a stream: we can capture the changelog—i.e., the history of all changes made to a database—either implicitly through change data cap‐ ture or explicitly through event sourcing. Log compaction allows the stream to retain a full copy of the contents of a database. +***Fault tolerance*** -Representing databases as streams opens up powerful opportunities for integrating systems. You can keep derived data systems such as search indexes, caches, and analytics systems continually up to date by consuming the log of changes and applying them to the derived system. You can even build fresh views onto existing data by starting from scratch and consuming the log of changes from the beginning all the way to the present. +MapReduce frequently writes to disk, which makes it easy to recover from an individual failed task without restarting the entire job but slows down execution in the failure-free case. Dataflow engines perform less materialization of inter‐ mediate state and keep more in memory, which means that they need to recom‐ pute more data if a node fails. Deterministic operators reduce the amount of data that needs to be recomputed. -The facilities for maintaining state as streams and replaying messages are also the basis for the techniques that enable stream joins and fault tolerance in various stream processing frameworks. We discussed several purposes of stream processing, including searching for event patterns (complex event processing), computing windowed aggregations (stream analytics), and keeping derived data systems up to date (materialized views). -We then discussed the difficulties of reasoning about time in a stream processor, including the distinction between processing time and event timestamps, and the problem of dealing with straggler events that arrive after you thought your window was complete. -We distinguished three types of joins that may appear in stream processes: +We discussed several join algorithms for MapReduce, most of which are also inter‐ nally used in MPP databases and dataflow engines. They also provide a good illustra‐ tion of how partitioned algorithms work: -***Stream-stream joins*** +***Sort-merge joins*** -Both input streams consist of activity events, and the join operator searches for related events that occur within some window of time. For example, it may match two actions taken by the same user within 30 minutes of each other. The two join inputs may in fact be the same stream (a *self-join*) if you want to find related events within that one stream. +Each of the inputs being joined goes through a mapper that extracts the join key. By partitioning, sorting, and merging, all the records with the same key end up going to the same call of the reducer. This function can then output the joined records. -***Stream-table joins*** +***Broadcast hash joins*** -One input stream consists of activity events, while the other is a database change‐ log. The changelog keeps a local copy of the database up to date. For each activity event, the join operator queries the database and outputs an enriched activity event. +One of the two join inputs is small, so it is not partitioned and it can be entirely loaded into a hash table. Thus, you can start a mapper for each partition of the large join input, load the hash table for the small input into each mapper, and then scan over the large input one record at a time, querying the hash table for each record. -***Table-table joins*** +***Partitioned hash joins*** -Both input streams are database changelogs. In this case, every change on one side is joined with the latest state of the other side. The result is a stream of changes to the materialized view of the join between the two tables. +If the two join inputs are partitioned in the same way (using the same key, same hash function, and same number of partitions), then the hash table approach can be used independently for each partition. -Finally, we discussed techniques for achieving fault tolerance and exactly-once semantics in a stream processor. As with batch processing, we need to discard the partial output of any failed tasks. However, since a stream process is long-running and produces output continuously, we can’t simply discard all output. Instead, a finer-grained recovery mechanism can be used, based on microbatching, checkpoint‐ ing, transactions, or idempotent writes. +Distributed batch processing engines have a deliberately restricted programming model: callback functions (such as mappers and reducers) are assumed to be stateless and to have no externally visible side effects besides their designated output. This restriction allows the framework to hide some of the hard distributed systems prob‐ lems behind its abstraction: in the face of crashes and network issues, tasks can be retried safely, and the output from any failed tasks is discarded. If several tasks for a partition succeed, only one of them actually makes its output visible. + +Thanks to the framework, your code in a batch processing job does not need to worry about implementing fault-tolerance mechanisms: the framework can guarantee that the final output of a job is the same as if no faults had occurred, even though in real‐ ity various tasks perhaps had to be retried. These reliable semantics are much stron‐ ger than what you usually have in online services that handle user requests and that write to databases as a side effect of processing a request. + +The distinguishing feature of a batch processing job is that it reads some input data and produces some output data, without modifying the input—in other words, the output is derived from the input. Crucially, the input data is *bounded*: it has a known, fixed size (for example, it consists of a set of log files at some point in time, or a snap‐ shot of a database’s contents). Because it is bounded, a job knows when it has finished reading the entire input, and so a job eventually completes when it is done. + +In the next chapter, we will turn to stream processing, in which the input is *unboun‐ ded*—that is, you still have a job, but its inputs are never-ending streams of data. In this case, a job is never complete, because at any time there may still be more work coming in. We shall see that stream and batch processing are similar in some respects, but the assumption of unbounded streams also changes a lot about how we build systems. + ## References -1. Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “[The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 12, pages 1792–1803, August 2015. [doi:10.14778/2824032.2824076](http://dx.doi.org/10.14778/2824032.2824076) -1. Harold Abelson, Gerald Jay Sussman, and Julie Sussman: [*Structure and Interpretation of Computer Programs*](https://web.archive.org/web/20220807043536/https://mitpress.mit.edu/sites/default/files/sicp/index.html), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, available online at *mitpress.mit.edu* -1. Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “[The Many Faces of Publish/Subscribe](http://www.cs.ru.nl/~pieter/oss/manyfaces.pdf),” *ACM Computing Surveys*, volume 35, number 2, pages 114–131, June 2003. [doi:10.1145/857076.857078](http://dx.doi.org/10.1145/857076.857078) -1. Joseph M. Hellerstein and Michael Stonebraker: [*Readings in Database Systems*](http://redbook.cs.berkeley.edu/), 4th edition. MIT Press, 2005. ISBN: 978-0-262-69314-1, available online at *redbook.cs.berkeley.edu* -1. Don Carney, Uğur Çetintemel, Mitch Cherniack, et al.: “[Monitoring Streams – A New Class of Data Management Applications](http://www.vldb.org/conf/2002/S07P02.pdf),” at *28th International Conference on Very Large Data Bases* (VLDB), August 2002. -1. Matthew Sackman: “[Pushing Back](https://wellquite.org/posts/lshift/pushing_back/),” *lshift.net*, May 5, 2016. -1. Vicent Martí: “[Brubeck, a statsd-Compatible Metrics Aggregator](http://githubengineering.com/brubeck/),” *githubengineering.com*, June 15, 2015. -1. Seth Lowenberger: “[MoldUDP64 Protocol Specification V 1.00](http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf),” *nasdaqtrader.com*, July 2009. -1. Pieter Hintjens: [*ZeroMQ – The Guide*](http://zguide.zeromq.org/page:all). O'Reilly Media, 2013. ISBN: 978-1-449-33404-8 -1. Ian Malpass: “[Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/),” *codeascraft.com*, February 15, 2011. -1. Dieter Plaetinck: “[25 Graphite, Grafana and statsd Gotchas](https://grafana.com/blog/2016/03/03/25-graphite-grafana-and-statsd-gotchas/),” *grafana.com*, March 3, 2016. -1. Jeff Lindsay: “[Web Hooks to Revolutionize the Web](https://web.archive.org/web/20180928201955/http://progrium.com/blog/2007/05/03/web-hooks-to-revolutionize-the-web/),” *progrium.com*, May 3, 2007. -1. Jim N. Gray: “[Queues Are Databases](https://arxiv.org/pdf/cs/0701158.pdf),” Microsoft Research Technical Report MSR-TR-95-56, December 1995. -1. Mark Hapner, Rich Burridge, Rahul Sharma, et al.: “[JSR-343 Java Message Service (JMS) 2.0 Specification](https://jcp.org/en/jsr/detail?id=343),” *jms-spec.java.net*, March 2013. -1. Sanjay Aiyagari, Matthew Arrott, Mark Atwell, et al.: “[AMQP: Advanced Message Queuing Protocol Specification](http://www.rabbitmq.com/resources/specs/amqp0-9-1.pdf),” Version 0-9-1, November 2008. -1. “[Google Cloud Pub/Sub: A Google-Scale Messaging Service](https://cloud.google.com/pubsub/architecture),” *cloud.google.com*, 2016. -1. “[Apache Kafka 0.9 Documentation](http://kafka.apache.org/documentation.html),” *kafka.apache.org*, November 2015. -1. Jay Kreps, Neha Narkhede, and Jun Rao: “[Kafka: A Distributed Messaging System for Log Processing](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf),” at *6th International Workshop on Networking Meets Databases* (NetDB), June 2011. -1. “[Amazon Kinesis Streams Developer Guide](http://docs.aws.amazon.com/streams/latest/dev/introduction.html),” *docs.aws.amazon.com*, April 2016. -1. Leigh Stewart and Sijie Guo: “[Building DistributedLog: Twitter’s High-Performance Replicated Log Service](https://blog.twitter.com/2015/building-distributedlog-twitter-s-high-performance-replicated-log-service),” *blog.twitter.com*, September 16, 2015. -1. “[DistributedLog Documentation](https://web.archive.org/web/20210517201308/https://bookkeeper.apache.org/distributedlog/docs/latest/),” Apache Software Foundation, *distributedlog.io*. -1. Jay Kreps: “[Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)](https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines),” *engineering.linkedin.com*, April 27, 2014. -1. Kartik Paramasivam: “[How We’re Improving and Advancing Kafka at LinkedIn](https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedin),” *engineering.linkedin.com*, September 2, 2015. -1. Jay Kreps: “[The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying),” *engineering.linkedin.com*, December 16, 2013. -1. Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “[All Aboard the Databus!](http://www.socc2012.org/s18-das.pdf),” at *3rd ACM Symposium on Cloud Computing* (SoCC), October 2012. -1. Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “[Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf),” at *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015. -1. P. P. S. Narayan: “[Sherpa Update](http://web.archive.org/web/20160801221400/https://developer.yahoo.com/blogs/ydn/sherpa-7992.html),” *developer.yahoo.com*, June 8, . -1. Martin Kleppmann: “[Bottled Water: Real-Time Integration of PostgreSQL and Kafka](http://martin.kleppmann.com/2015/04/23/bottled-water-real-time-postgresql-kafka.html),” *martin.kleppmann.com*, April 23, 2015. -1. Ben Osheroff: “[Introducing Maxwell, a mysql-to-kafka Binlog Processor](https://web.archive.org/web/20170208100334/https://developer.zendesk.com/blog/introducing-maxwell-a-mysql-to-kafka-binlog-processor),” *developer.zendesk.com*, August 20, 2015. -1. Randall Hauch: “[Debezium 0.2.1 Released](https://debezium.io/blog/2016/06/10/Debezium-0.2.1-Released/),” *debezium.io*, June 10, 2016. -1. Prem Santosh Udaya Shankar: “[Streaming MySQL Tables in Real-Time to Kafka](https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html),” *engineeringblog.yelp.com*, August 1, 2016. -1. “[Mongoriver](https://github.com/stripe/mongoriver),” Stripe, Inc., *github.com*, September 2014. -1. Dan Harvey: “[Change Data Capture with Mongo + Kafka](http://www.slideshare.net/danharvey/change-data-capture-with-mongodb-and-kafka),” at *Hadoop Users Group UK*, August 2015. -1. “[Oracle GoldenGate 12c: Real-Time Access to Real-Time Information](https://web.archive.org/web/20160923105841/http://www.oracle.com/us/products/middleware/data-integration/oracle-goldengate-realtime-access-2031152.pdf),” Oracle White Paper, March 2015. -1. “[Oracle GoldenGate Fundamentals: How Oracle GoldenGate Works](https://www.youtube.com/watch?v=6H9NibIiPQE),” Oracle Corporation, *youtube.com*, November 2012. -1. Slava Akhmechet: “[Advancing the Realtime Web](http://rethinkdb.com/blog/realtime-web/),” *rethinkdb.com*, January 27, 2015. -1. “[Firebase Realtime Database Documentation](https://firebase.google.com/docs/database/),” Google, Inc., *firebase.google.com*, May 2016. -1. “[Apache CouchDB 1.6 Documentation](http://docs.couchdb.org/en/latest/),” *docs.couchdb.org*, 2014. -1. Matt DeBergalis: “[Meteor 0.7.0: Scalable Database Queries Using MongoDB Oplog Instead of Poll-and-Diff](https://web.archive.org/web/20160324055429/http://info.meteor.com/blog/meteor-070-scalable-database-queries-using-mongodb-oplog-instead-of-poll-and-diff),” *info.meteor.com*, December 17, 2013. -1. “[Chapter 15. Importing and Exporting Live Data](https://docs.voltdb.com/UsingVoltDB/ChapExport.php),” VoltDB 6.4 User Manual, *docs.voltdb.com*, June 2016. -1. Neha Narkhede: “[Announcing Kafka Connect: Building Large-Scale Low-Latency Data Pipelines](http://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines),” *confluent.io*, February 18, 2016. -1. Greg Young: “[CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs),” at *Code on the Beach*, August 2014. -1. Martin Fowler: “[Event Sourcing](http://martinfowler.com/eaaDev/EventSourcing.html),” *martinfowler.com*, December 12, 2005. -1. Vaughn Vernon: [*Implementing Domain-Driven Design*](https://www.informit.com/store/implementing-domain-driven-design-9780321834577). Addison-Wesley Professional, 2013. ISBN: 978-0-321-83457-7 -1. H. V. Jagadish, Inderpal Singh Mumick, and Abraham Silberschatz: “[View Maintenance Issues for the Chronicle Data Model](https://dl.acm.org/doi/10.1145/212433.220201),” at *14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems* (PODS), May 1995. [doi:10.1145/212433.220201](http://dx.doi.org/10.1145/212433.220201) -1. “[Event Store 3.5.0 Documentation](http://docs.geteventstore.com/),” Event Store LLP, *docs.geteventstore.com*, February 2016. -1. Martin Kleppmann: [*Making Sense of Stream Processing*](http://www.oreilly.com/data/free/stream-processing.csp). Report, O'Reilly Media, May 2016. -1. Sander Mak: “[Event-Sourced Architectures with Akka](http://www.slideshare.net/SanderMak/eventsourced-architectures-with-akka),” at *JavaOne*, September 2014. -1. Julian Hyde: [personal communication](https://twitter.com/julianhyde/status/743374145006641153), June 2016. -1. Ashish Gupta and Inderpal Singh Mumick: *Materialized Views: Techniques, Implementations, and Applications*. MIT Press, 1999. ISBN: 978-0-262-57122-7 -1. Timothy Griffin and Leonid Libkin: “[Incremental Maintenance of Views with Duplicates](http://homepages.inf.ed.ac.uk/libkin/papers/sigmod95.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 1995. [doi:10.1145/223784.223849](http://dx.doi.org/10.1145/223784.223849) -1. Pat Helland: “[Immutability Changes Everything](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015. -1. Martin Kleppmann: “[Accounting for Computer Scientists](http://martin.kleppmann.com/2011/03/07/accounting-for-computer-scientists.html),” *martin.kleppmann.com*, March 7, 2011. -1. Pat Helland: “[Accountants Don't Use Erasers](https://web.archive.org/web/20200220161036/https://blogs.msdn.microsoft.com/pathelland/2007/06/14/accountants-dont-use-erasers/),” *blogs.msdn.com*, June 14, 2007. -1. Fangjin Yang: “[Dogfooding with Druid, Samza, and Kafka: Metametrics at Metamarkets](https://metamarkets.com/2015/dogfooding-with-druid-samza-and-kafka-metametrics-at-metamarkets/),” *metamarkets.com*, June 3, 2015. -1. Gavin Li, Jianqiu Lv, and Hang Qi: “[Pistachio: Co-Locate the Data and Compute for Fastest Cloud Compute](https://web.archive.org/web/20181214032620/https://yahoohadoop.tumblr.com/post/116365275781/pistachio-co-locate-the-data-and-compute-for),” *yahoohadoop.tumblr.com*, April 13, 2015. -1. Kartik Paramasivam: “[Stream Processing Hard Problems – Part 1: Killing Lambda](https://engineering.linkedin.com/blog/2016/06/stream-processing-hard-problems-part-1-killing-lambda),” *engineering.linkedin.com*, June 27, 2016. -1. Martin Fowler: “[CQRS](http://martinfowler.com/bliki/CQRS.html),” *martinfowler.com*, July 14, 2011. -1. Greg Young: “[CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf),” *cqrs.files.wordpress.com*, November 2010. -1. Baron Schwartz: “[Immutability, MVCC, and Garbage Collection](https://web.archive.org/web/20161110094746/http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/),” *xaprb.com*, December 28, 2013. -1. Daniel Eloff, Slava Akhmechet, Jay Kreps, et al.: ["Re: Turning the Database Inside-out with Apache Samza](https://news.ycombinator.com/item?id=9145197)," Hacker News discussion, *news.ycombinator.com*, March 4, 2015. -1. “[Datomic Development Resources: Excision](http://docs.datomic.com/excision.html),” Cognitect, Inc., *docs.datomic.com*. -1. “[Fossil Documentation: Deleting Content from Fossil](http://fossil-scm.org/index.html/doc/trunk/www/shunning.wiki),” *fossil-scm.org*, 2016. -1. Jay Kreps: “[The irony of distributed systems is that data loss is really easy but deleting data is surprisingly hard,](https://twitter.com/jaykreps/status/582580836425330688)” *twitter.com*, March 30, 2015. -1. David C. Luckham: “[What’s the Difference Between ESP and CEP?](http://www.complexevents.com/2006/08/01/what%E2%80%99s-the-difference-between-esp-and-cep/),” *complexevents.com*, August 1, 2006. -1. Srinath Perera: “[How Is Stream Processing and Complex Event Processing (CEP) Different?](https://www.quora.com/How-is-stream-processing-and-complex-event-processing-CEP-different),” *quora.com*, December 3, 2015. -1. Arvind Arasu, Shivnath Babu, and Jennifer Widom: “[The CQL Continuous Query Language: Semantic Foundations and Query Execution](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cql.pdf),” *The VLDB Journal*, volume 15, number 2, pages 121–142, June 2006. [doi:10.1007/s00778-004-0147-z](http://dx.doi.org/10.1007/s00778-004-0147-z) -1. Julian Hyde: “[Data in Flight: How Streaming SQL Technology Can Help Solve the Web 2.0 Data Crunch](http://queue.acm.org/detail.cfm?id=1667562),” *ACM Queue*, volume 7, number 11, December 2009. [doi:10.1145/1661785.1667562](http://dx.doi.org/10.1145/1661785.1667562) -1. “[Esper Reference, Version 5.4.0](http://esper.espertech.com/release-5.4.0/esper-reference/html_single/index.html),” EsperTech, Inc., *espertech.com*, April 2016. -1. Zubair Nabi, Eric Bouillet, Andrew Bainbridge, and Chris Thomas: “[Of Streams and Storms](https://web.archive.org/web/20170711081434/https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2014/04/Streams-and-Storm-April-2014-Final.pdf),” IBM technical report, *developer.ibm.com*, April 2014. -1. Milinda Pathirage, Julian Hyde, Yi Pan, and Beth Plale: “[SamzaSQL: Scalable Fast Data Management with Streaming SQL](https://github.com/milinda/samzasql-hpbdc2016/blob/master/samzasql-hpbdc2016.pdf),” at *IEEE International Workshop on High-Performance Big Data Computing* (HPBDC), May 2016. [doi:10.1109/IPDPSW.2016.141](http://dx.doi.org/10.1109/IPDPSW.2016.141) -1. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier: “[HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf),” at *Conference on Analysis of Algorithms* (AofA), June 2007. -1. Jay Kreps: “[Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture),” *oreilly.com*, July 2, 2014. -1. Ian Hellström: “[An Overview of Apache Streaming Technologies](https://databaseline.bitbucket.io/an-overview-of-apache-streaming-technologies/),” *databaseline.bitbucket.io*, March 12, 2016. -1. Jay Kreps: “[Why Local State Is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing),” *oreilly.com*, July 31, 2014. -1. Shay Banon: “[Percolator](https://www.elastic.co/blog/percolator),” *elastic.co*, February 8, 2011. -1. Alan Woodward and Martin Kleppmann: “[Real-Time Full-Text Search with Luwak and Samza](http://martin.kleppmann.com/2015/04/13/real-time-full-text-search-luwak-samza.html),” *martin.kleppmann.com*, April 13, 2015. -1. “[Apache Storm 2.1.0 Documentation](https://storm.apache.org/releases/2.1.0/index.html),” *storm.apache.org*, October 2019. -1. Tyler Akidau: “[The World Beyond Batch: Streaming 102](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102),” *oreilly.com*, January 20, 2016. -1. Stephan Ewen: “[Streaming Analytics with Apache Flink](https://www.confluent.io/resources/kafka-summit-2016/advanced-streaming-analytics-apache-flink-apache-kafka/),” at *Kafka Summit*, April 2016. -1. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, et al.: “[MillWheel: Fault-Tolerant Stream Processing at Internet Scale](http://research.google.com/pubs/pub41378.html),” at *39th International Conference on Very Large Data Bases* (VLDB), August 2013. -1. Alex Dean: “[Improving Snowplow's Understanding of Time](https://snowplow.io/blog/improving-snowplows-understanding-of-time/),” *snowplowanalytics.com*, September 15, 2015. -1. “[Windowing (Azure Stream Analytics)](https://msdn.microsoft.com/en-us/library/azure/dn835019.aspx),” Microsoft Azure Reference, *msdn.microsoft.com*, April 2016. -1. “[State Management](http://samza.apache.org/learn/documentation/0.10/container/state-management.html),” Apache Samza 0.10 Documentation, *samza.apache.org*, December 2015. -1. Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, et al.: “[Photon: Fault-Tolerant and Scalable Joining of Continuous Data Streams](http://research.google.com/pubs/pub41318.html),” at *ACM International Conference on Management of Data* (SIGMOD), June 2013. [doi:10.1145/2463676.2465272](http://dx.doi.org/10.1145/2463676.2465272) -1. Martin Kleppmann: “[Samza Newsfeed Demo](https://github.com/ept/newsfeed),” *github.com*, September 2014. -1. Ben Kirwin: “[Doing the Impossible: Exactly-Once Messaging Patterns in Kafka](http://ben.kirw.in/2014/11/28/kafka-patterns/),” *ben.kirw.in*, November 28, 2014. -1. Pat Helland: “[Data on the Outside Versus Data on the Inside](http://cidrdb.org/cidr2005/papers/P12.pdf),” at *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005. -1. Ralph Kimball and Margy Ross: *The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*, 3rd edition. John Wiley & Sons, 2013. ISBN: 978-1-118-53080-1 -1. Viktor Klang: “[I'm coining the phrase 'effectively-once' for message processing with at-least-once + idempotent operations](https://twitter.com/viktorklang/status/789036133434978304),” *twitter.com*, October 20, 2016. -1. Matei Zaharia, Tathagata Das, Haoyuan Li, et al.: “[Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters](https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf),” at *4th USENIX Conference in Hot Topics in Cloud Computing* (HotCloud), June 2012. -1. Kostas Tzoumas, Stephan Ewen, and Robert Metzger: “[High-Throughput, Low-Latency, and Exactly-Once Stream Processing with Apache Flink](https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink),” *ververica.com*, August 5, 2015. -1. Paris Carbone, Gyula Fóra, Stephan Ewen, et al.: “[Lightweight Asynchronous Snapshots for Distributed Dataflows](http://arxiv.org/abs/1506.08603),” arXiv:1506.08603 [cs.DC], June 29, 2015. -1. Ryan Betts and John Hugg: [*Fast Data: Smart and at Scale*](http://www.oreilly.com/data/free/fast-data-smart-and-at-scale.csp). Report, O'Reilly Media, October 2015. -1. Flavio Junqueira: “[Making Sense of Exactly-Once Semantics](https://web.archive.org/web/20160812172900/http://conferences.oreilly.com/strata/hadoop-big-data-eu/public/schedule/detail/49690),” at *Strata+Hadoop World London*, June 2016. -1. Jason Gustafson, Flavio Junqueira, Apurva Mehta, Sriram Subramanian, and Guozhang Wang: “[KIP-98 – Exactly Once Delivery and Transactional Messaging](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging),” *cwiki.apache.org*, November 2016. -1. Pat Helland: “[Idempotence Is Not a Medical Condition](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=4b6dda7fe75b51e1c543a87ca7b3b322fbf55614),” *Communications of the ACM*, volume 55, number 5, page 56, May 2012. [doi:10.1145/2160718.2160734](http://dx.doi.org/10.1145/2160718.2160734) -1. Jay Kreps: “[Re: Trying to Achieve Deterministic Behavior on Recovery/Rewind](http://mail-archives.apache.org/mod_mbox/samza-dev/201409.mbox/%3CCAOeJiJg%2Bc7Ei%3DgzCuOz30DD3G5Hm9yFY%3DUJ6SafdNUFbvRgorg%40mail.gmail.com%3E),” email to *samza-dev* mailing list, September 9, 2014. -1. E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson: “[A Survey of Rollback-Recovery Protocols in Message-Passing Systems](http://www.cs.utexas.edu/~lorenzo/papers/SurveyFinal.pdf),” *ACM Computing Surveys*, volume 34, number 3, pages 375–408, September 2002. [doi:10.1145/568522.568525](http://dx.doi.org/10.1145/568522.568525) -1. Adam Warski: “[Kafka Streams – How Does It Fit the Stream Processing Landscape?](https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/),” *softwaremill.com*, June 1, 2016. +1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/pub62/),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004. +1. Joel Spolsky: “[The Perils of JavaSchools](https://www.joelonsoftware.com/2005/12/29/the-perils-of-javaschools-2/),” *joelonsoftware.com*, December 29, 2005. +1. Shivnath Babu and Herodotos Herodotou: “[Massively Parallel Databases and MapReduce Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/db-mr-survey-final.pdf),” *Foundations and Trends in Databases*, volume 5, number 1, pages 1–104, November 2013. [doi:10.1561/1900000036](http://dx.doi.org/10.1561/1900000036) +1. David J. DeWitt and Michael Stonebraker: “[MapReduce: A Major Step Backwards](https://homes.cs.washington.edu/~billhowe/mapreduce_a_major_step_backwards.html),” originally published at *databasecolumn.vertica.com*, January 17, 2008. +1. Henry Robinson: “[The Elephant Was a Trojan Horse: On the Death of Map-Reduce at Google](https://www.the-paper-trail.org/post/2014-06-25-the-elephant-was-a-trojan-horse-on-the-death-of-map-reduce-at-google/),” *the-paper-trail.org*, June 25, 2014. +1. “[The Hollerith Machine](https://www.census.gov/history/www/innovations/technology/the_hollerith_tabulator.html),” United States Census Bureau, *census.gov*. +1. “[IBM 82, 83, and 84 Sorters Reference Manual](https://bitsavers.org/pdf/ibm/punchedCard/Sorter/A24-1034-1_82-83-84_sorters.pdf),” Edition A24-1034-1, International Business Machines Corporation, July 1962. +1. Adam Drake: “[Command-Line Tools Can Be 235x Faster than Your Hadoop Cluster](https://adamdrake.com/command-line-tools-can-be-235x-faster-than-your-hadoop-cluster.html),” *aadrake.com*, January 25, 2014. +1. “[GNU Coreutils 8.23 Documentation](http://www.gnu.org/software/coreutils/manual/html_node/index.html),” Free Software Foundation, Inc., 2014. +1. Martin Kleppmann: “[Kafka, Samza, and the Unix Philosophy of Distributed Data](http://martin.kleppmann.com/2015/08/05/kafka-samza-unix-philosophy-distributed-data.html),” *martin.kleppmann.com*, August 5, 2015. +1. Doug McIlroy: [Internal Bell Labs memo](https://swtch.com/~rsc/thread/mdmpipe.pdf), October 1964. Cited in: Dennis M. Richie: “[Advice from Doug McIlroy](https://www.bell-labs.com/usr/dmr/www/mdmpipe.html),” *bell-labs.com*. +1. M. D. McIlroy, E. N. Pinson, and B. A. Tague: “[UNIX Time-Sharing System: Foreword](https://archive.org/details/bstj57-6-1899),” *The Bell System Technical Journal*, volume 57, number 6, pages 1899–1904, July 1978. +1. Eric S. Raymond: [*The Art of UNIX Programming*](http://www.catb.org/~esr/writings/taoup/html/). Addison-Wesley, 2003. ISBN: 978-0-13-142901-7 +1. Ronald Duncan: “[Text File Formats – ASCII Delimited Text – Not CSV or TAB Delimited Text](https://ronaldduncan.wordpress.com/2009/10/31/text-file-formats-ascii-delimited-text-not-csv-or-tab-delimited-text/),” *ronaldduncan.wordpress.com*, October 31, 2009. +1. Alan Kay: “[Is 'Software Engineering' an Oxymoron?](http://tinlizzie.org/~takashi/IsSoftwareEngineeringAnOxymoron.pdf),” *tinlizzie.org*. +1. Martin Fowler: “[InversionOfControl](http://martinfowler.com/bliki/InversionOfControl.html),” *martinfowler.com*, June 26, 2005. +1. Daniel J. Bernstein: “[Two File Descriptors for Sockets](http://cr.yp.to/tcpip/twofd.html),” *cr.yp.to*. +1. Rob Pike and Dennis M. Ritchie: “[The Styx Architecture for Distributed Systems](http://doc.cat-v.org/inferno/4th_edition/styx),” *Bell Labs Technical Journal*, volume 4, number 2, pages 146–152, April 1999. +1. Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung: “[The Google File System](http://research.google.com/archive/gfs-sosp2003.pdf),” at *19th ACM Symposium on Operating Systems Principles* (SOSP), October 2003. [doi:10.1145/945445.945450](http://dx.doi.org/10.1145/945445.945450) +1. Michael Ovsiannikov, Silvius Rus, Damian Reeves, et al.: “[The Quantcast File System](http://db.disi.unitn.eu/pages/VLDBProgram/pdf/industry/p808-ovsiannikov.pdf),” *Proceedings of the VLDB Endowment*, volume 6, number 11, pages 1092–1101, August 2013. [doi:10.14778/2536222.2536234](http://dx.doi.org/10.14778/2536222.2536234) +1. “[OpenStack Swift 2.6.1 Developer Documentation](http://docs.openstack.org/developer/swift/),” OpenStack Foundation, *docs.openstack.org*, March 2016. +1. Zhe Zhang, Andrew Wang, Kai Zheng, et al.: “[Introduction to HDFS Erasure Coding in Apache Hadoop](https://blog.cloudera.com/introduction-to-hdfs-erasure-coding-in-apache-hadoop/),” *blog.cloudera.com*, September 23, 2015. +1. Peter Cnudde: “[Hadoop Turns 10](https://web.archive.org/web/20190119112713/https://yahoohadoop.tumblr.com/post/138739227316/hadoop-turns-10),” *yahoohadoop.tumblr.com*, February 5, 2016. +1. Eric Baldeschwieler: “[Thinking About the HDFS vs. Other Storage Technologies](https://web.archive.org/web/20190529215115/http://hortonworks.com/blog/thinking-about-the-hdfs-vs-other-storage-technologies/),” *hortonworks.com*, July 25, 2012. +1. Brendan Gregg: “[Manta: Unix Meets Map Reduce](https://web.archive.org/web/20220125052545/http://dtrace.org/blogs/brendan/2013/06/25/manta-unix-meets-map-reduce/),” *dtrace.org*, June 25, 2013. +1. Tom White: *Hadoop: The Definitive Guide*, 4th edition. O'Reilly Media, 2015. ISBN: 978-1-491-90163-2 +1. Jim N. Gray: “[Distributed Computing Economics](http://arxiv.org/pdf/cs/0403019.pdf),” Microsoft Research Tech Report MSR-TR-2003-24, March 2003. +1. Márton Trencséni: “[Luigi vs Airflow vs Pinball](http://bytepawn.com/luigi-airflow-pinball.html),” *bytepawn.com*, February 6, 2016. +1. Roshan Sumbaly, Jay Kreps, and Sam Shah: “[The 'Big Data' Ecosystem at LinkedIn](http://www.slideshare.net/s_shah/the-big-data-ecosystem-at-linkedin-23512853),” at *ACM International Conference on Management of Data* (SIGMOD), July 2013. [doi:10.1145/2463676.2463707](http://dx.doi.org/10.1145/2463676.2463707) +1. Alan F. Gates, Olga Natkovich, Shubham Chopra, et al.: “[Building a High-Level Dataflow System on Top of Map-Reduce: The Pig Experience](http://www.vldb.org/pvldb/vol2/vldb09-1074.pdf),” at *35th International Conference on Very Large Data Bases* (VLDB), August 2009. +1. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, et al.: “[Hive – A Petabyte Scale Data Warehouse Using Hadoop](http://i.stanford.edu/~ragho/hive-icde2010.pdf),” at *26th IEEE International Conference on Data Engineering* (ICDE), March 2010. [doi:10.1109/ICDE.2010.5447738](http://dx.doi.org/10.1109/ICDE.2010.5447738) +1. “[Cascading 3.0 User Guide](https://web.archive.org/web/20231206195311/http://docs.cascading.org/cascading/3.0/userguide/),” Concurrent, Inc., *docs.cascading.org*, January 2016. +1. “[Apache Crunch User Guide](https://crunch.apache.org/user-guide.html),” Apache Software Foundation, *crunch.apache.org*. +1. Craig Chambers, Ashish Raniwala, Frances Perry, et al.: “[FlumeJava: Easy, Efficient Data-Parallel Pipelines](https://research.google.com/pubs/archive/35650.pdf),” at *31st ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2010. [doi:10.1145/1806596.1806638](http://dx.doi.org/10.1145/1806596.1806638) +1. Jay Kreps: “[Why Local State is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing),” *oreilly.com*, July 31, 2014. +1. Martin Kleppmann: “[Rethinking Caching in Web Apps](http://martin.kleppmann.com/2012/10/01/rethinking-caching-in-web-apps.html),” *martin.kleppmann.com*, October 1, 2012. +1. Mark Grover, Ted Malaska, Jonathan Seidman, and Gwen Shapira: *[Hadoop Application Architectures](http://shop.oreilly.com/product/0636920033196.do)*. O'Reilly Media, 2015. ISBN: 978-1-491-90004-8 +1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. +1. Sriranjan Manjunath: “[Skewed Join](https://web.archive.org/web/20151228114742/https://wiki.apache.org/pig/PigSkewedJoinSpec),” *wiki.apache.org*, 2009. +1. David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri: “[Practical Skew Handling in Parallel Joins](http://www.vldb.org/conf/1992/P027.PDF),” at *18th International Conference on Very Large Data Bases* (VLDB), August 1992. +1. Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “[Impala: A Modern, Open-Source SQL Engine for Hadoop](http://pandis.net/resources/cidr15impala.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015. +1. Matthieu Monsch: “[Open-Sourcing PalDB, a Lightweight Companion for Storing Side Data](https://engineering.linkedin.com/blog/2015/10/open-sourcing-paldb--a-lightweight-companion-for-storing-side-da),” *engineering.linkedin.com*, October 26, 2015. +1. Daniel Peng and Frank Dabek: “[Large-Scale Incremental Processing Using Distributed Transactions and Notifications](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf),” at *9th USENIX conference on Operating Systems Design and Implementation* (OSDI), October 2010. +1. “["Cloudera Search User Guide,"](http://www.cloudera.com/documentation/cdh/5-1-x/Search/Cloudera-Search-User-Guide/Cloudera-Search-User-Guide.html) Cloudera, Inc., September 2015. +1. Lili Wu, Sam Shah, Sean Choi, et al.: “[The Browsemaps: Collaborative Filtering at LinkedIn](http://ceur-ws.org/Vol-1271/Paper3.pdf),” at *6th Workshop on Recommender Systems and the Social Web* (RSWeb), October 2014. +1. Roshan Sumbaly, Jay Kreps, Lei Gao, et al.: “[Serving Large-Scale Batch Computed Data with Project Voldemort](http://static.usenix.org/events/fast12/tech/full_papers/Sumbaly.pdf),” at *10th USENIX Conference on File and Storage Technologies* (FAST), February 2012. +1. Varun Sharma: “[Open-Sourcing Terrapin: A Serving System for Batch Generated Data](https://web.archive.org/web/20170215032514/https://engineering.pinterest.com/blog/open-sourcing-terrapin-serving-system-batch-generated-data-0),” *engineering.pinterest.com*, September 14, 2015. +1. Nathan Marz: “[ElephantDB](http://www.slideshare.net/nathanmarz/elephantdb),” *slideshare.net*, May 30, 2011. +1. Jean-Daniel (JD) Cryans: “[How-to: Use HBase Bulk Loading, and Why](https://blog.cloudera.com/how-to-use-hbase-bulk-loading-and-why/),” *blog.cloudera.com*, September 27, 2013. +1. Nathan Marz: “[How to Beat the CAP Theorem](http://nathanmarz.com/blog/how-to-beat-the-cap-theorem.html),” *nathanmarz.com*, October 13, 2011. +1. Molly Bartlett Dishman and Martin Fowler: “[Agile Architecture](https://web.archive.org/web/20161130034721/http://conferences.oreilly.com/software-architecture/sa2015/public/schedule/detail/40388),” at *O'Reilly Software Architecture Conference*, March 2015. +1. David J. DeWitt and Jim N. Gray: “[Parallel Database Systems: The Future of High Performance Database Systems](http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/dewittgray92.pdf),” *Communications of the ACM*, volume 35, number 6, pages 85–98, June 1992. [doi:10.1145/129888.129894](http://dx.doi.org/10.1145/129888.129894) +1. Jay Kreps: “[But the multi-tenancy thing is actually really really hard](https://twitter.com/jaykreps/status/528235702480142336),” tweetstorm, *twitter.com*, October 31, 2014. +1. Jeffrey Cohen, Brian Dolan, Mark Dunlap, et al.: “[MAD Skills: New Analysis Practices for Big Data](http://www.vldb.org/pvldb/vol2/vldb09-219.pdf),” *Proceedings of the VLDB Endowment*, volume 2, number 2, pages 1481–1492, August 2009. [doi:10.14778/1687553.1687576](http://dx.doi.org/10.14778/1687553.1687576) +1. Ignacio Terrizzano, Peter Schwarz, Mary Roth, and John E. Colino: “[Data Wrangling: The Challenging Journey from the Wild to the Lake](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper2.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015. +1. Paige Roberts: “[To Schema on Read or to Schema on Write, That Is the Hadoop Data Lake Question](https://web.archive.org/web/20171105001306/http://adaptivesystemsinc.com/blog/to-schema-on-read-or-to-schema-on-write-that-is-the-hadoop-data-lake-question/),” *adaptivesystemsinc.com*, July 2, 2015. +1. Bobby Johnson and Joseph Adler: “[The Sushi Principle: Raw Data Is Better](https://web.archive.org/web/20161126104941/https://conferences.oreilly.com/strata/big-data-conference-ca-2015/public/schedule/detail/38737),” at *Strata+Hadoop World*, February 2015. +1. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, et al.: “[Apache Hadoop YARN: Yet Another Resource Negotiator](https://www.cs.cmu.edu/~garth/15719/papers/yarn.pdf),” at *4th ACM Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523633](http://dx.doi.org/10.1145/2523616.2523633) +1. Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, et al.: “[Large-Scale Cluster Management at Google with Borg](http://research.google.com/pubs/pub43438.html),” at *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741964](http://dx.doi.org/10.1145/2741948.2741964) +1. Malte Schwarzkopf: “[The Evolution of Cluster Scheduler Architectures](https://web.archive.org/web/20201109052657/http://www.firmament.io/blog/scheduler-architectures.html),” *firmament.io*, March 9, 2016. +1. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, et al.: “[Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final138.pdf),” at *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012. +1. Holden Karau, Andy Konwinski, Patrick Wendell, and Matei Zaharia: *Learning Spark*. O'Reilly Media, 2015. ISBN: 978-1-449-35904-1 +1. Bikas Saha and Hitesh Shah: “[Apache Tez: Accelerating Hadoop Query Processing](http://www.slideshare.net/Hadoop_Summit/w-1205phall1saha),” at *Hadoop Summit*, June 2014. +1. Bikas Saha, Hitesh Shah, Siddharth Seth, et al.: “[Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications](http://home.cse.ust.hk/~weiwa/teaching/Fall15-COMP6611B/reading_list/Tez.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742790](http://dx.doi.org/10.1145/2723372.2742790) +1. Kostas Tzoumas: “[Apache Flink: API, Runtime, and Project Roadmap](http://www.slideshare.net/KostasTzoumas/apache-flink-api-runtime-and-project-roadmap),” *slideshare.net*, January 14, 2015. +1. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, et al.: “[The Stratosphere Platform for Big Data Analytics](https://ssc.io/pdf/2014-VLDBJ_Stratosphere_Overview.pdf),” *The VLDB Journal*, volume 23, number 6, pages 939–964, May 2014. [doi:10.1007/s00778-014-0357-y](http://dx.doi.org/10.1007/s00778-014-0357-y) +1. Michael Isard, Mihai Budiu, Yuan Yu, et al.: “[Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks](https://www.microsoft.com/en-us/research/publication/dryad-distributed-data-parallel-programs-from-sequential-building-blocks/),” at *European Conference on Computer Systems* (EuroSys), March 2007. [doi:10.1145/1272996.1273005](http://dx.doi.org/10.1145/1272996.1273005) +1. Daniel Warneke and Odej Kao: “[Nephele: Efficient Parallel Data Processing in the Cloud](https://stratosphere2.dima.tu-berlin.de/assets/papers/Nephele_09.pdf),” at *2nd Workshop on Many-Task Computing on Grids and Supercomputers* (MTAGS), November 2009. [doi:10.1145/1646468.1646476](http://dx.doi.org/10.1145/1646468.1646476) +1. Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd: “[The PageRank Citation Ranking: Bringing Order to the Web](https://web.archive.org/web/20230219170930/http://ilpubs.stanford.edu:8090/422/),” Stanford InfoLab Technical Report 422, 1999. +1. Leslie G. Valiant: “[A Bridging Model for Parallel Computation](http://dl.acm.org/citation.cfm?id=79181),” *Communications of the ACM*, volume 33, number 8, pages 103–111, August 1990. [doi:10.1145/79173.79181](http://dx.doi.org/10.1145/79173.79181) +1. Stephan Ewen, Kostas Tzoumas, Moritz Kaufmann, and Volker Markl: “[Spinning Fast Iterative Data Flows](http://vldb.org/pvldb/vol5/p1268_stephanewen_vldb2012.pdf),” *Proceedings of the VLDB Endowment*, volume 5, number 11, pages 1268-1279, July 2012. [doi:10.14778/2350229.2350245](http://dx.doi.org/10.14778/2350229.2350245) +1. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, et al.: “[Pregel: A System for Large-Scale Graph Processing](https://kowshik.github.io/JPregel/pregel_paper.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2010. [doi:10.1145/1807167.1807184](http://dx.doi.org/10.1145/1807167.1807184) +1. Frank McSherry, Michael Isard, and Derek G. Murray: “[Scalability! But at What COST?](http://www.frankmcsherry.org/assets/COST.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. +1. Ionel Gog, Malte Schwarzkopf, Natacha Crooks, et al.: “[Musketeer: All for One, One for All in Data Processing Systems](http://www.cl.cam.ac.uk/research/srg/netos/camsas/pubs/eurosys15-musketeer.pdf),” at *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741968](http://dx.doi.org/10.1145/2741948.2741968) +1. Aapo Kyrola, Guy Blelloch, and Carlos Guestrin: “[GraphChi: Large-Scale Graph Computation on Just a PC](https://www.usenix.org/system/files/conference/osdi12/osdi12-final-126.pdf),” at *10th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2012. +1. Andrew Lenharth, Donald Nguyen, and Keshav Pingali: “[Parallel Graph Analytics](http://cacm.acm.org/magazines/2016/5/201591-parallel-graph-analytics/fulltext),” *Communications of the ACM*, volume 59, number 5, pages 78–87, May 2016. [doi:10.1145/2901919](http://dx.doi.org/10.1145/2901919) +1. Fabian Hüske: “[Peeking into Apache Flink's Engine Room](http://flink.apache.org/news/2015/03/13/peeking-into-Apache-Flinks-Engine-Room.html),” *flink.apache.org*, March 13, 2015. +1. Mostafa Mokhtar: “[Hive 0.14 Cost Based Optimizer (CBO) Technical Overview](https://web.archive.org/web/20170607112708/http://hortonworks.com/blog/hive-0-14-cost-based-optimizer-cbo-technical-overview/),” *hortonworks.com*, March 2, 2015. +1. Michael Armbrust, Reynold S Xin, Cheng Lian, et al.: “[Spark SQL: Relational Data Processing in Spark](http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2742797](http://dx.doi.org/10.1145/2723372.2742797) +1. Daniel Blazevski: “[Planting Quadtrees for Apache Flink](https://blog.insightdatascience.com/planting-quadtrees-for-apache-flink-b396ebc80d35),” *insightdataengineering.com*, March 25, 2016. +1. Tom White: “[Genome Analysis Toolkit: Now Using Apache Spark for Data Processing](https://web.archive.org/web/20190215132904/http://blog.cloudera.com/blog/2016/04/genome-analysis-toolkit-now-using-apache-spark-for-data-processing/),” *blog.cloudera.com*, April 6, 2016. diff --git a/content/en/ch12.md b/content/en/ch12.md index fb4a084..f147525 100644 --- a/content/en/ch12.md +++ b/content/en/ch12.md @@ -1,26 +1,31 @@ --- -title: "12. The Future of Data Systems" -linkTitle: "12. The Future of Data Systems" +title: "12. Stream Processing" weight: 312 breadcrumbs: false --- +> [!IMPORTANT] +> This chapter is from the 1st edition, the 2nd edition is not available yet -![](/img/ch12.png) +![](/map/ch11.png) -> *If a thing be ordained to another as to its end, its last end cannot consist in the preservation of its being. Hence a captain does not intend as a last end, the preservation of the ship entrusted to him, since a ship is ordained to something else as its end, viz. to navigation.* +> *A complex system that works is invariably found to have evolved from a simple system that works. The inverse proposition also appears to be true: A complex system designed from scratch never works and cannot be made to work.* > -> *(Often quoted as: If the highest aim of a captain was the preserve his ship, he would keep it in port forever.)* -> -> ​ — St. Thomas Aquinas, *Summa Theologica* (1265–1274) +> ​ — John Gall, *Systemantics* (1975) --------------- -So far, this book has been mostly about describing things as they *are* at present. In this final chapter, we will shift our perspective toward the future and discuss how things *should be*: I will propose some ideas and approaches that, I believe, may funda‐ mentally improve the ways we design and build applications. +In [Chapter 10](/en/ch10) we discussed batch processing—techniques that read a set of files as input and produce a new set of output files. The output is a form of *derived data*; that is, a dataset that can be recreated by running the batch process again if necessary. We saw how this simple but powerful idea can be used to create search indexes, recom‐ mendation systems, analytics, and more. -Opinions and speculation about the future are of course subjective, and so I will use the first person in this chapter when writing about my personal opinions. You are welcome to disagree with them and form your own opinions, but I hope that the ideas in this chapter will at least be a starting point for a productive discussion and bring some clarity to concepts that are often confused. +However, one big assumption remained throughout [Chapter 10](/en/ch10): namely, that the input is bounded—i.e., of a known and finite size—so the batch process knows when it has finished reading its input. For example, the sorting operation that is central to MapReduce must read its entire input before it can start producing output: it could happen that the very last input record is the one with the lowest key, and thus needs to be the very first output record, so starting the output early is not an option. -The goal of this book was outlined in [Chapter 1](/en/ch1): to explore how to create applications and systems that are *reliable*, *scalable*, and *maintainable*. These themes have run through all of the chapters: for example, we discussed many fault-tolerance algo‐ rithms that help improve reliability, partitioning to improve scalability, and mecha‐ nisms for evolution and abstraction that improve maintainability. In this chapter we will bring all of these ideas together, and build on them to envisage the future. Our goal is to discover how to design applications that are better than the ones of today— robust, correct, evolvable, and ultimately beneficial to humanity. +In reality, a lot of data is unbounded because it arrives gradually over time: your users produced data yesterday and today, and they will continue to produce more data tomorrow. Unless you go out of business, this process never ends, and so the dataset is never “complete” in any meaningful way [1]. Thus, batch processors must artifi‐ cially divide the data into chunks of fixed duration: for example, processing a day’s worth of data at the end of every day, or processing an hour’s worth of data at the end of every hour. + +The problem with daily batch processes is that changes in the input are only reflected in the output a day later, which is too slow for many impatient users. To reduce the delay, we can run the processing more frequently—say, processing a second’s worth of data at the end of every second—or even continuously, abandoning the fixed time slices entirely and simply processing every event as it happens. That is the idea behind *stream processing*. + +In general, a “stream” refers to data that is incrementally made available over time. The concept appears in many places: in the stdin and stdout of Unix, programming languages (lazy lists) [2], filesystem APIs (such as Java’s `FileInputStream`), TCP con‐ nections, delivering audio and video over the internet, and so on. + +In this chapter we will look at *event streams* as a data management mechanism: the unbounded, incrementally processed counterpart to the batch data we saw in the last chapter. We will first discuss how streams are represented, stored, and transmit‐ ted over a network. In “[Databases and Streams](#databases-and-streams)” we will investigate the relationship between streams and databases. And finally, in “[Processing Streams](#processing-streams)” we will explore approaches and tools for processing those streams continually, and ways that they can be used to build applications. ## …… @@ -29,137 +34,146 @@ The goal of this book was outlined in [Chapter 1](/en/ch1): to explore how to cr ## Summary -In this chapter we discussed new approaches to designing data systems, and I included my personal opinions and speculations about the future. We started with the observation that there is no one single tool that can efficiently serve all possible use cases, and so applications necessarily need to compose several different pieces of software to accomplish their goals. We discussed how to solve this *data integration* problem by using batch processing and event streams to let data changes flow between different systems. +In this chapter we have discussed event streams, what purposes they serve, and how to process them. In some ways, stream processing is very much like the batch pro‐ cessing we discussed in [Chapter 10](/en/ch10), but done continuously on unbounded (neverending) streams rather than on a fixed-size input. From this perspective, message brokers and event logs serve as the streaming equivalent of a filesystem. -In this approach, certain systems are designated as systems of record, and other data is derived from them through transformations. In this way we can maintain indexes, materialized views, machine learning models, statistical summaries, and more. By making these derivations and transformations asynchronous and loosely coupled, a problem in one area is prevented from spreading to unrelated parts of the system, increasing the robustness and fault-tolerance of the system as a whole. +We spent some time comparing two types of message brokers: -Expressing dataflows as transformations from one dataset to another also helps evolve applications: if you want to change one of the processing steps, for example to change the structure of an index or cache, you can just rerun the new transformation code on the whole input dataset in order to rederive the output. Similarly, if some‐ thing goes wrong, you can fix the code and reprocess the data in order to recover. +***AMQP/JMS-style message broker*** -These processes are quite similar to what databases already do internally, so we recast the idea of dataflow applications as *unbundling* the components of a database, and building an application by composing these loosely coupled components. +The broker assigns individual messages to consumers, and consumers acknowl‐ edge individual messages when they have been successfully processed. Messages are deleted from the broker once they have been acknowledged. This approach is appropriate as an asynchronous form of RPC (see also “[Message-Passing Data‐ flow]()”), for example in a task queue, where the exact order of mes‐ sage processing is not important and where there is no need to go back and read old messages again after they have been processed. -Derived state can be updated by observing changes in the underlying data. Moreover, the derived state itself can further be observed by downstream consumers. We can even take this dataflow all the way through to the end-user device that is displaying the data, and thus build user interfaces that dynamically update to reflect data changes and continue to work offline. +***Log-based message broker*** -Next, we discussed how to ensure that all of this processing remains correct in the presence of faults. We saw that strong integrity guarantees can be implemented scala‐ bly with asynchronous event processing, by using end-to-end operation identifiers to make operations idempotent and by checking constraints asynchronously. Clients can either wait until the check has passed, or go ahead without waiting but risk hav‐ ing to apologize about a constraint violation. This approach is much more scalable and robust than the traditional approach of using distributed transactions, and fits with how many business processes work in practice. +The broker assigns all messages in a partition to the same consumer node, and always delivers messages in the same order. Parallelism is achieved through par‐ titioning, and consumers track their progress by checkpointing the offset of the last message they have processed. The broker retains messages on disk, so it is possible to jump back and reread old messages if necessary. + +The log-based approach has similarities to the replication logs found in databases (see [Chapter 5](/en/ch5)) and log-structured storage engines (see [Chapter 3](/en/ch3)). We saw that this approach is especially appropriate for stream processing systems that consume input streams and generate derived state or derived output streams. + +In terms of where streams come from, we discussed several possibilities: user activity events, sensors providing periodic readings, and data feeds (e.g., market data in finance) are naturally represented as streams. We saw that it can also be useful to think of the writes to a database as a stream: we can capture the changelog—i.e., the history of all changes made to a database—either implicitly through change data cap‐ ture or explicitly through event sourcing. Log compaction allows the stream to retain a full copy of the contents of a database. + +Representing databases as streams opens up powerful opportunities for integrating systems. You can keep derived data systems such as search indexes, caches, and analytics systems continually up to date by consuming the log of changes and applying them to the derived system. You can even build fresh views onto existing data by starting from scratch and consuming the log of changes from the beginning all the way to the present. + +The facilities for maintaining state as streams and replaying messages are also the basis for the techniques that enable stream joins and fault tolerance in various stream processing frameworks. We discussed several purposes of stream processing, including searching for event patterns (complex event processing), computing windowed aggregations (stream analytics), and keeping derived data systems up to date (materialized views). + +We then discussed the difficulties of reasoning about time in a stream processor, including the distinction between processing time and event timestamps, and the problem of dealing with straggler events that arrive after you thought your window was complete. + +We distinguished three types of joins that may appear in stream processes: + +***Stream-stream joins*** + +Both input streams consist of activity events, and the join operator searches for related events that occur within some window of time. For example, it may match two actions taken by the same user within 30 minutes of each other. The two join inputs may in fact be the same stream (a *self-join*) if you want to find related events within that one stream. + +***Stream-table joins*** + +One input stream consists of activity events, while the other is a database change‐ log. The changelog keeps a local copy of the database up to date. For each activity event, the join operator queries the database and outputs an enriched activity event. + +***Table-table joins*** + +Both input streams are database changelogs. In this case, every change on one side is joined with the latest state of the other side. The result is a stream of changes to the materialized view of the join between the two tables. + +Finally, we discussed techniques for achieving fault tolerance and exactly-once semantics in a stream processor. As with batch processing, we need to discard the partial output of any failed tasks. However, since a stream process is long-running and produces output continuously, we can’t simply discard all output. Instead, a finer-grained recovery mechanism can be used, based on microbatching, checkpoint‐ ing, transactions, or idempotent writes. -By structuring applications around dataflow and checking constraints asynchro‐ nously, we can avoid most coordination and create systems that maintain integrity but still perform well, even in geographically distributed scenarios and in the pres‐ ence of faults. We then talked a little about using audits to verify the integrity of data and detect corruption. -Finally, we took a step back and examined some ethical aspects of building data- intensive applications. We saw that although data can be used to do good, it can also do significant harm: making justifying decisions that seriously affect people’s lives and are difficult to appeal against, leading to discrimination and exploitation, nor‐ malizing surveillance, and exposing intimate information. We also run the risk of data breaches, and we may find that a well-intentioned use of data has unintended consequences. -As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal. ## References -1. Rachid Belaid: “[Postgres Full-Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/),” *rachbelaid.com*, July 13, 2015. -1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. -1. Pat Helland and Dave Campbell: “[Building on Quicksand](https://web.archive.org/web/20220606172817/https://database.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf),” at *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009. -1. Jessica Kerr: “[Provenance and Causality in Distributed Systems](https://web.archive.org/web/20190425150540/http://blog.jessitron.com/2016/09/provenance-and-causality-in-distributed.html),” *blog.jessitron.com*, September 25, 2016. -1. Kostas Tzoumas: “[Batch Is a Special Case of Streaming](http://data-artisans.com/blog/batch-is-a-special-case-of-streaming/),” *data-artisans.com*, September 15, 2015. -1. Shinji Kim and Robert Blafford: “[Stream Windowing Performance Analysis: Concord and Spark Streaming](https://web.archive.org/web/20180125074821/http://concord.io/posts/windowing_performance_analysis_w_spark_streaming),” *concord.io*, July 6, 2016. -1. Jay Kreps: “[The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying),” *engineering.linkedin.com*, December 16, 2013. -1. Pat Helland: “[Life Beyond Distributed Transactions: An Apostate’s Opinion](https://web.archive.org/web/20200730171311/http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf),” at *3rd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2007. -1. “[Great Western Railway (1835–1948)](https://web.archive.org/web/20160122155425/https://www.networkrail.co.uk/VirtualArchive/great-western/),” Network Rail Virtual Archive, *networkrail.co.uk*. -1. Jacqueline Xu: “[Online Migrations at Scale](https://stripe.com/blog/online-migrations),” *stripe.com*, February 2, 2017. -1. Molly Bartlett Dishman and Martin Fowler: “[Agile Architecture](https://web.archive.org/web/20161130034721/http://conferences.oreilly.com/software-architecture/sa2015/public/schedule/detail/40388),” at *O'Reilly Software Architecture Conference*, March 2015. -1. Nathan Marz and James Warren: [*Big Data: Principles and Best Practices of Scalable Real-Time Data Systems*](https://www.manning.com/books/big-data). Manning, 2015. ISBN: 978-1-617-29034-3 -1. Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin: “[Summingbird: A Framework for Integrating Batch and Online MapReduce Computations](http://www.vldb.org/pvldb/vol7/p1441-boykin.pdf),” at *40th International Conference on Very Large Data Bases* (VLDB), September 2014. -1. Jay Kreps: “[Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture),” *oreilly.com*, July 2, 2014. -1. Raul Castro Fernandez, Peter Pietzuch, Jay Kreps, et al.: “[Liquid: Unifying Nearline and Offline Big Data Integration](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper25u.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015. -1. Dennis M. Ritchie and Ken Thompson: “[The UNIX Time-Sharing System](http://web.eecs.utk.edu/~qcao1/cs560/papers/paper-unix.pdf),” *Communications of the ACM*, volume 17, number 7, pages 365–375, July 1974. [doi:10.1145/361011.361061](http://dx.doi.org/10.1145/361011.361061) -1. Eric A. Brewer and Joseph M. Hellerstein: “[CS262a: Advanced Topics in Computer Systems](http://people.eecs.berkeley.edu/~brewer/cs262/systemr.html),” lecture notes, University of California, Berkeley, *cs.berkeley.edu*, August 2011. -1. Michael Stonebraker: “[The Case for Polystores](http://wp.sigmod.org/?p=1629),” *wp.sigmod.org*, July 13, 2015. -1. Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, et al.: “[The BigDAWG Polystore System](https://dspace.mit.edu/handle/1721.1/100936),” *ACM SIGMOD Record*, volume 44, number 2, pages 11–16, June 2015. [doi:10.1145/2814710.2814713](http://dx.doi.org/10.1145/2814710.2814713) -1. Patrycja Dybka: “[Foreign Data Wrappers for PostgreSQL](https://web.archive.org/web/20221003115732/https://www.vertabelo.com/blog/foreign-data-wrappers-for-postgresql/),” *vertabelo.com*, March 24, 2015. -1. David B. Lomet, Alan Fekete, Gerhard Weikum, and Mike Zwilling: “[Unbundling Transaction Services in the Cloud](https://www.microsoft.com/en-us/research/publication/unbundling-transaction-services-in-the-cloud/),” at *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009. -1. Martin Kleppmann and Jay Kreps: “[Kafka, Samza and the Unix Philosophy of Distributed Data](http://martin.kleppmann.com/papers/kafka-debull15.pdf),” *IEEE Data Engineering Bulletin*, volume 38, number 4, pages 4–14, December 2015. -1. John Hugg: “[Winning Now and in the Future: Where VoltDB Shines](https://voltdb.com/blog/winning-now-and-future-where-voltdb-shines),” *voltdb.com*, March 23, 2016. -1. Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard: “[Differential Dataflow](http://cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf),” at *6th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2013. -1. Derek G Murray, Frank McSherry, Rebecca Isaacs, et al.: “[Naiad: A Timely Dataflow System](http://sigops.org/s/conferences/sosp/2013/papers/p439-murray.pdf),” at *24th ACM Symposium on Operating Systems Principles* (SOSP), pages 439–455, November 2013. [doi:10.1145/2517349.2522738](http://dx.doi.org/10.1145/2517349.2522738) -1. Gwen Shapira: “[We have a bunch of customers who are implementing ‘database inside-out’ concept and they all ask ‘is anyone else doing it? are we crazy?’](https://twitter.com/gwenshap/status/758800071110430720)” *twitter.com*, July 28, 2016. -1. Martin Kleppmann: “[Turning the Database Inside-out with Apache Samza,](http://martin.kleppmann.com/2015/03/04/turning-the-database-inside-out.html)” at *Strange Loop*, September 2014. -1. Peter Van Roy and Seif Haridi: [*Concepts, Techniques, and Models of Computer Programming*](https://www.info.ucl.ac.be/~pvr/book.html). MIT Press, 2004. ISBN: 978-0-262-22069-9 -1. “[Juttle Documentation](http://juttle.github.io/juttle/),” *juttle.github.io*, 2016. -1. Evan Czaplicki and Stephen Chong: “[Asynchronous Functional Reactive Programming for GUIs](http://people.seas.harvard.edu/~chong/pubs/pldi13-elm.pdf),” at *34th ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2013. [doi:10.1145/2491956.2462161](http://dx.doi.org/10.1145/2491956.2462161) -1. Engineer Bainomugisha, Andoni Lombide Carreton, Tom van Cutsem, Stijn Mostinckx, and Wolfgang de Meuter: “[A Survey on Reactive Programming](http://soft.vub.ac.be/Publications/2012/vub-soft-tr-12-13.pdf),” *ACM Computing Surveys*, volume 45, number 4, pages 1–34, August 2013. [doi:10.1145/2501654.2501666](http://dx.doi.org/10.1145/2501654.2501666) -1. Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak: “[Consistency Analysis in Bloom: A CALM and Collected Approach](https://dsf.berkeley.edu/cs286/papers/calm-cidr2011.pdf),” at *5th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2011. -1. Felienne Hermans: “[Spreadsheets Are Code](https://vimeo.com/145492419),” at *Code Mesh*, November 2015. -1. Dan Bricklin and Bob Frankston: “[VisiCalc: Information from Its Creators](http://danbricklin.com/visicalc.htm),” *danbricklin.com*. -1. D. Sculley, Gary Holt, Daniel Golovin, et al.: “[Machine Learning: The High-Interest Credit Card of Technical Debt](http://research.google.com/pubs/pub43146.html),” at *NIPS Workshop on Software Engineering for Machine Learning* (SE4ML), December 2014. -1. Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “[Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2737784](http://dx.doi.org/10.1145/2723372.2737784) -1. Guy Steele: “[Re: Need for Macros (Was Re: Icon)](https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01134.html),” email to *ll1-discuss* mailing list, *people.csail.mit.edu*, December 24, 2001. -1. David Gelernter: “[Generative Communication in Linda](http://cseweb.ucsd.edu/groups/csag/html/teaching/cse291s03/Readings/p80-gelernter.pdf),” *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 7, number 1, pages 80–112, January 1985. [doi:10.1145/2363.2433](http://dx.doi.org/10.1145/2363.2433) +1. Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “[The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 12, pages 1792–1803, August 2015. [doi:10.14778/2824032.2824076](http://dx.doi.org/10.14778/2824032.2824076) +1. Harold Abelson, Gerald Jay Sussman, and Julie Sussman: [*Structure and Interpretation of Computer Programs*](https://web.archive.org/web/20220807043536/https://mitpress.mit.edu/sites/default/files/sicp/index.html), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, available online at *mitpress.mit.edu* 1. Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “[The Many Faces of Publish/Subscribe](http://www.cs.ru.nl/~pieter/oss/manyfaces.pdf),” *ACM Computing Surveys*, volume 35, number 2, pages 114–131, June 2003. [doi:10.1145/857076.857078](http://dx.doi.org/10.1145/857076.857078) -1. Ben Stopford: “[Microservices in a Streaming World](https://www.infoq.com/presentations/microservices-streaming),” at *QCon London*, March 2016. -1. Christian Posta: “[Why Microservices Should Be Event Driven: Autonomy vs Authority](http://blog.christianposta.com/microservices/why-microservices-should-be-event-driven-autonomy-vs-authority/),” *blog.christianposta.com*, May 27, 2016. -1. Alex Feyerke: “[Say Hello to Offline First](https://web.archive.org/web/20210420014747/http://hood.ie/blog/say-hello-to-offline-first.html),” *hood.ie*, November 5, 2013. -1. Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich: “[Global Sequence Protocol: A Robust Abstraction for Replicated Shared State](http://drops.dagstuhl.de/opus/volltexte/2015/5238/),” at *29th European Conference on Object-Oriented Programming* (ECOOP), July 2015. [doi:10.4230/LIPIcs.ECOOP.2015.568](http://dx.doi.org/10.4230/LIPIcs.ECOOP.2015.568) -1. Mark Soper: “[Clearing Up React Data Management Confusion with Flux, Redux, and Relay](https://medium.com/@marksoper/clearing-up-react-data-management-confusion-with-flux-redux-and-relay-aad504e63cae),” *medium.com*, December 3, 2015. -1. Eno Thereska, Damian Guy, Michael Noll, and Neha Narkhede: “[Unifying Stream Processing and Interactive Queries in Apache Kafka](http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/),” *confluent.io*, October 26, 2016. -1. Frank McSherry: “[Dataflow as Database](https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-17.md),” *github.com*, July 17, 2016. -1. Peter Alvaro: “[I See What You Mean](https://www.youtube.com/watch?v=R2Aa4PivG0g),” at *Strange Loop*, September 2015. -1. Nathan Marz: “[Trident: A High-Level Abstraction for Realtime Computation](https://blog.twitter.com/2012/trident-a-high-level-abstraction-for-realtime-computation),” *blog.twitter.com*, August 2, 2012. -1. Edi Bice: “[Low Latency Web Scale Fraud Prevention with Apache Samza, Kafka and Friends](http://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends),” at *Merchant Risk Council MRC Vegas Conference*, March 2016. -1. Charity Majors: “[The Accidental DBA](https://charity.wtf/2016/10/02/the-accidental-dba/),” *charity.wtf*, October 2, 2016. -1. Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu: “[Semantic Conditions for Correctness at Different Isolation Levels](http://db.cs.berkeley.edu/cs286/papers/isolation-icde2000.pdf),” at *16th International Conference on Data Engineering* (ICDE), February 2000. [doi:10.1109/ICDE.2000.839387](http://dx.doi.org/10.1109/ICDE.2000.839387) -1. Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “[Automating the Detection of Snapshot Isolation Anomalies](http://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf),” at *33rd International Conference on Very Large Data Bases* (VLDB), September 2007. -1. Kyle Kingsbury: [Jepsen blog post series](https://aphyr.com/tags/jepsen), *aphyr.com*, 2013–2016. -1. Michael Jouravlev: “[Redirect After Post](http://www.theserverside.com/news/1365146/Redirect-After-Post),” *theserverside.com*, August 1, 2004. -1. Jerome H. Saltzer, David P. Reed, and David D. Clark: “[End-to-End Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf),” *ACM Transactions on Computer Systems*, volume 2, number 4, pages 277–288, November 1984. [doi:10.1145/357401.357402](http://dx.doi.org/10.1145/357401.357402) -1. Peter Bailis, Alan Fekete, Michael J. Franklin, et al.: “[Coordination-Avoiding Database Systems](http://arxiv.org/pdf/1402.2237.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 3, pages 185–196, November 2014. -1. Alex Yarmula: “[Strong Consistency in Manhattan](https://blog.twitter.com/2016/strong-consistency-in-manhattan),” *blog.twitter.com*, March 17, 2016. -1. Douglas B Terry, Marvin M Theimer, Karin Petersen, et al.: “[Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](http://css.csail.mit.edu/6.824/2014/papers/bayou-conflicts.pdf),” at *15th ACM Symposium on Operating Systems Principles* (SOSP), pages 172–182, December 1995. [doi:10.1145/224056.224070](http://dx.doi.org/10.1145/224056.224070) -1. Jim Gray: “[The Transaction Concept: Virtues and Limitations](http://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf),” at *7th International Conference on Very Large Data Bases* (VLDB), September 1981. -1. Hector Garcia-Molina and Kenneth Salem: “[Sagas](http://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 1987. [doi:10.1145/38713.38742](http://dx.doi.org/10.1145/38713.38742) -1. Pat Helland: “[Memories, Guesses, and Apologies](https://web.archive.org/web/20160304020907/http://blogs.msdn.com/b/pathelland/archive/2007/05/15/memories-guesses-and-apologies.aspx),” *blogs.msdn.com*, May 15, 2007. -1. Yoongu Kim, Ross Daly, Jeremie Kim, et al.: “[Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf),” at *41st Annual International Symposium on Computer Architecture* (ISCA), June 2014. [doi:10.1145/2678373.2665726](http://dx.doi.org/10.1145/2678373.2665726) -1. Mark Seaborn and Thomas Dullien: “[Exploiting the DRAM Rowhammer Bug to Gain Kernel Privileges](https://googleprojectzero.blogspot.co.uk/2015/03/exploiting-dram-rowhammer-bug-to-gain.html),” *googleprojectzero.blogspot.co.uk*, March 9, 2015. -1. Jim N. Gray and Catharine van Ingen: “[Empirical Measurements of Disk Failure Rates and Error Rates](https://www.microsoft.com/en-us/research/publication/empirical-measurements-of-disk-failure-rates-and-error-rates/),” Microsoft Research, MSR-TR-2005-166, December 2005. -1. Annamalai Gurusami and Daniel Price: “[Bug #73170: Duplicates in Unique Secondary Index Because of Fix of Bug#68021](http://bugs.mysql.com/bug.php?id=73170),” *bugs.mysql.com*, July 2014. -1. Gary Fredericks: “[Postgres Serializability Bug](https://github.com/gfredericks/pg-serializability-bug),” *github.com*, September 2015. -1. Xiao Chen: “[HDFS DataNode Scanners and Disk Checker Explained](http://blog.cloudera.com/blog/2016/12/hdfs-datanode-scanners-and-disk-checker-explained/),” *blog.cloudera.com*, December 20, 2016. -1. Jay Kreps: “[Getting Real About Distributed System Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability),” *blog.empathybox.com*, March 19, 2012. -1. Martin Fowler: “[The LMAX Architecture](http://martinfowler.com/articles/lmax.html),” *martinfowler.com*, July 12, 2011. -1. Sam Stokes: “[Move Fast with Confidence](http://blog.samstokes.co.uk/blog/2016/07/11/move-fast-with-confidence/),” *blog.samstokes.co.uk*, July 11, 2016. -1. “[Hyperledger Sawtooth documentation](https://web.archive.org/web/20220120211548/https://sawtooth.hyperledger.org/docs/core/releases/latest/introduction.html),” Intel Corporation, *sawtooth.hyperledger.org*, 2017. -1. Richard Gendal Brown: “[Introducing R3 Corda™: A Distributed Ledger Designed for Financial Services](https://gendal.me/2016/04/05/introducing-r3-corda-a-distributed-ledger-designed-for-financial-services/),” *gendal.me*, April 5, 2016. -1. Trent McConaghy, Rodolphe Marques, Andreas Müller, et al.: “[BigchainDB: A Scalable Blockchain Database](https://www.bigchaindb.com/whitepaper/bigchaindb-whitepaper.pdf),” *bigchaindb.com*, June 8, 2016. -1. Ralph C. Merkle: “[A Digital Signature Based on a Conventional Encryption Function](https://people.eecs.berkeley.edu/~raluca/cs261-f15/readings/merkle.pdf),” at *CRYPTO '87*, August 1987. [doi:10.1007/3-540-48184-2_32](http://dx.doi.org/10.1007/3-540-48184-2_32) -1. Ben Laurie: “[Certificate Transparency](http://queue.acm.org/detail.cfm?id=2668154),” *ACM Queue*, volume 12, number 8, pages 10-19, August 2014. [doi:10.1145/2668152.2668154](http://dx.doi.org/10.1145/2668152.2668154) -1. Mark D. Ryan: “[Enhanced Certificate Transparency and End-to-End Encrypted Mail](https://www.ndss-symposium.org/wp-content/uploads/2017/09/12_2_1.pdf),” at *Network and Distributed System Security Symposium* (NDSS), February 2014. [doi:10.14722/ndss.2014.23379](http://dx.doi.org/10.14722/ndss.2014.23379) -1. “[ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics),” Association for Computing Machinery, *acm.org*, 2018. -1. François Chollet: “[Software development is starting to involve important ethical choices](https://twitter.com/fchollet/status/792958695722201088),” *twitter.com*, October 30, 2016. -1. Igor Perisic: “[Making Hard Choices: The Quest for Ethics in Machine Learning](https://engineering.linkedin.com/blog/2016/11/making-hard-choices--the-quest-for-ethics-in-machine-learning),” *engineering.linkedin.com*, November 2016. -1. John Naughton: “[Algorithm Writers Need a Code of Conduct](https://www.theguardian.com/commentisfree/2015/dec/06/algorithm-writers-should-have-code-of-conduct),” *theguardian.com*, December 6, 2015. -1. Logan Kugler: “[What Happens When Big Data Blunders?](http://cacm.acm.org/magazines/2016/6/202655-what-happens-when-big-data-blunders/fulltext),” *Communications of the ACM*, volume 59, number 6, pages 15–16, June 2016. [doi:10.1145/2911975](http://dx.doi.org/10.1145/2911975) -1. Bill Davidow: “[Welcome to Algorithmic Prison](http://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/),” *theatlantic.com*, February 20, 2014. -1. Don Peck: “[They're Watching You at Work](http://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/),” *theatlantic.com*, December 2013. -1. Leigh Alexander: “[Is an Algorithm Any Less Racist Than a Human?](https://www.theguardian.com/technology/2016/aug/03/algorithm-racist-human-employers-work)” *theguardian.com*, August 3, 2016. -1. Jesse Emspak: “[How a Machine Learns Prejudice](https://www.scientificamerican.com/article/how-a-machine-learns-prejudice/),” *scientificamerican.com*, December 29, 2016. -1. Maciej Cegłowski: “[The Moral Economy of Tech](http://idlewords.com/talks/sase_panel.htm),” *idlewords.com*, June 2016. -1. Cathy O'Neil: [*Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*](https://web.archive.org/web/20210621234447/https://weaponsofmathdestructionbook.com/). Crown Publishing, 2016. ISBN: 978-0-553-41881-1 -1. Julia Angwin: “[Make Algorithms Accountable](http://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html),” *nytimes.com*, August 1, 2016. -1. Bryce Goodman and Seth Flaxman: “[European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation’](https://arxiv.org/abs/1606.08813),” *arXiv:1606.08813*, August 31, 2016. -1. “[A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes](https://web.archive.org/web/20240619042302/http://educationnewyork.com/files/rockefeller_databroker.pdf),” Staff Report, *United States Senate Committee on Commerce, Science, and Transportation*, *commerce.senate.gov*, December 2013. -1. Olivia Solon: “[Facebook’s Failure: Did Fake News and Polarized Politics Get Trump Elected?](https://www.theguardian.com/technology/2016/nov/10/facebook-fake-news-election-conspiracy-theories)” *theguardian.com*, November 10, 2016. -1. Donella H. Meadows and Diana Wright: *Thinking in Systems: A Primer*. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7 -1. Daniel J. Bernstein: “[Listening to a ‘big data’/‘data science’ talk](https://twitter.com/hashbreaker/status/598076230437568512),” *twitter.com*, May 12, 2015. -1. Marc Andreessen: “[Why Software Is Eating the World](http://genius.com/Marc-andreessen-why-software-is-eating-the-world-annotated),” *The Wall Street Journal*, 20 August 2011. -1. J. M. Porup: “[‘Internet of Things’ Security Is Hilariously Broken and Getting Worse](http://arstechnica.com/security/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/),” *arstechnica.com*, January 23, 2016. -1. Bruce Schneier: [*Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World*](https://www.schneier.com/books/data_and_goliath/). W. W. Norton, 2015. ISBN: 978-0-393-35217-7 -1. The Grugq: “[Nothing to Hide](https://grugq.tumblr.com/post/142799983558/nothing-to-hide),” *grugq.tumblr.com*, April 15, 2016. -1. Tony Beltramelli: “[Deep-Spying: Spying Using Smartwatch and Deep Learning](https://arxiv.org/abs/1512.05616),” Masters Thesis, IT University of Copenhagen, December 2015. Available at *arxiv.org/abs/1512.05616* -1. Shoshana Zuboff: “[Big Other: Surveillance Capitalism and the Prospects of an Information Civilization](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2594754),” *Journal of Information Technology*, volume 30, number 1, pages 75–89, April 2015. [doi:10.1057/jit.2015.5](http://dx.doi.org/10.1057/jit.2015.5) -1. Carina C. Zona: “[Consequences of an Insightful Algorithm](https://www.youtube.com/watch?v=YRI40A4tyWU),” at *GOTO Berlin*, November 2016. -1. Bruce Schneier: “[Data Is a Toxic Asset, So Why Not Throw It Out?](https://www.schneier.com/essays/archives/2016/03/data_is_a_toxic_asse.html),” *schneier.com*, March 1, 2016. -1. John E. Dunn: “[The UK’s 15 Most Infamous Data Breaches](https://web.archive.org/web/20161120070058/http://www.techworld.com/security/uks-most-infamous-data-breaches-2016-3604586/),” *techworld.com*, November 18, 2016. -1. Cory Scott: “[Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want](https://twitter.com/cory_scott/status/706586399483437056),” *twitter.com*, March 6, 2016. -1. Bruce Schneier: “[Mission Creep: When Everything Is Terrorism](https://www.schneier.com/essays/archives/2013/07/mission_creep_when_e.html),” *schneier.com*, July 16, 2013. -1. Lena Ulbricht and Maximilian von Grafenstein: “[Big Data: Big Power Shifts?](http://policyreview.info/articles/analysis/big-data-big-power-shifts),” *Internet Policy Review*, volume 5, number 1, March 2016. [doi:10.14763/2016.1.406](http://dx.doi.org/10.14763/2016.1.406) -1. Ellen P. Goodman and Julia Powles: “[Facebook and Google: Most Powerful and Secretive Empires We've Ever Known](https://www.theguardian.com/technology/2016/sep/28/google-facebook-powerful-secretive-empire-transparency),” *theguardian.com*, September 28, 2016. -1. [Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data](http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:31995L0046), Official Journal of the European Communities No. L 281/31, *eur-lex.europa.eu*, November 1995. -1. Brendan Van Alsenoy: “[Regulating Data Protection: The Allocation of Responsibility and Risk Among Actors Involved in Personal Data Processing](https://lirias.kuleuven.be/handle/123456789/545027),” Thesis, KU Leuven Centre for IT and IP Law, August 2016. -1. Michiel Rhoen: “[Beyond Consent: Improving Data Protection Through Consumer Protection Law](http://policyreview.info/articles/analysis/beyond-consent-improving-data-protection-through-consumer-protection-law),” *Internet Policy Review*, volume 5, number 1, March 2016. [doi:10.14763/2016.1.404](http://dx.doi.org/10.14763/2016.1.404) -1. Jessica Leber: “[Your Data Footprint Is Affecting Your Life in Ways You Can’t Even Imagine](https://www.fastcoexist.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine),” *fastcoexist.com*, March 15, 2016. -1. Maciej Cegłowski: “[Haunted by Data](http://idlewords.com/talks/haunted_by_data.htm),” *idlewords.com*, October 2015. -1. Sam Thielman: “[You Are Not What You Read: Librarians Purge User Data to Protect Privacy](https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy),” *theguardian.com*, January 13, 2016. -1. Conor Friedersdorf: “[Edward Snowden’s Other Motive for Leaking](http://www.theatlantic.com/politics/archive/2014/05/edward-snowdens-other-motive-for-leaking/370068/),” *theatlantic.com*, May 13, 2014. -1. Phillip Rogaway: “[The Moral Character of Cryptographic Work](http://web.cs.ucdavis.edu/~rogaway/papers/moral-fn.pdf),” Cryptology ePrint 2015/1162, December 2015. +1. Joseph M. Hellerstein and Michael Stonebraker: [*Readings in Database Systems*](http://redbook.cs.berkeley.edu/), 4th edition. MIT Press, 2005. ISBN: 978-0-262-69314-1, available online at *redbook.cs.berkeley.edu* +1. Don Carney, Uğur Çetintemel, Mitch Cherniack, et al.: “[Monitoring Streams – A New Class of Data Management Applications](http://www.vldb.org/conf/2002/S07P02.pdf),” at *28th International Conference on Very Large Data Bases* (VLDB), August 2002. +1. Matthew Sackman: “[Pushing Back](https://wellquite.org/posts/lshift/pushing_back/),” *lshift.net*, May 5, 2016. +1. Vicent Martí: “[Brubeck, a statsd-Compatible Metrics Aggregator](http://githubengineering.com/brubeck/),” *githubengineering.com*, June 15, 2015. +1. Seth Lowenberger: “[MoldUDP64 Protocol Specification V 1.00](http://www.nasdaqtrader.com/content/technicalsupport/specifications/dataproducts/moldudp64.pdf),” *nasdaqtrader.com*, July 2009. +1. Pieter Hintjens: [*ZeroMQ – The Guide*](http://zguide.zeromq.org/page:all). O'Reilly Media, 2013. ISBN: 978-1-449-33404-8 +1. Ian Malpass: “[Measure Anything, Measure Everything](https://codeascraft.com/2011/02/15/measure-anything-measure-everything/),” *codeascraft.com*, February 15, 2011. +1. Dieter Plaetinck: “[25 Graphite, Grafana and statsd Gotchas](https://grafana.com/blog/2016/03/03/25-graphite-grafana-and-statsd-gotchas/),” *grafana.com*, March 3, 2016. +1. Jeff Lindsay: “[Web Hooks to Revolutionize the Web](https://web.archive.org/web/20180928201955/http://progrium.com/blog/2007/05/03/web-hooks-to-revolutionize-the-web/),” *progrium.com*, May 3, 2007. +1. Jim N. Gray: “[Queues Are Databases](https://arxiv.org/pdf/cs/0701158.pdf),” Microsoft Research Technical Report MSR-TR-95-56, December 1995. +1. Mark Hapner, Rich Burridge, Rahul Sharma, et al.: “[JSR-343 Java Message Service (JMS) 2.0 Specification](https://jcp.org/en/jsr/detail?id=343),” *jms-spec.java.net*, March 2013. +1. Sanjay Aiyagari, Matthew Arrott, Mark Atwell, et al.: “[AMQP: Advanced Message Queuing Protocol Specification](http://www.rabbitmq.com/resources/specs/amqp0-9-1.pdf),” Version 0-9-1, November 2008. +1. “[Google Cloud Pub/Sub: A Google-Scale Messaging Service](https://cloud.google.com/pubsub/architecture),” *cloud.google.com*, 2016. +1. “[Apache Kafka 0.9 Documentation](http://kafka.apache.org/documentation.html),” *kafka.apache.org*, November 2015. +1. Jay Kreps, Neha Narkhede, and Jun Rao: “[Kafka: A Distributed Messaging System for Log Processing](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/09/Kafka.pdf),” at *6th International Workshop on Networking Meets Databases* (NetDB), June 2011. +1. “[Amazon Kinesis Streams Developer Guide](http://docs.aws.amazon.com/streams/latest/dev/introduction.html),” *docs.aws.amazon.com*, April 2016. +1. Leigh Stewart and Sijie Guo: “[Building DistributedLog: Twitter’s High-Performance Replicated Log Service](https://blog.twitter.com/2015/building-distributedlog-twitter-s-high-performance-replicated-log-service),” *blog.twitter.com*, September 16, 2015. +1. “[DistributedLog Documentation](https://web.archive.org/web/20210517201308/https://bookkeeper.apache.org/distributedlog/docs/latest/),” Apache Software Foundation, *distributedlog.io*. +1. Jay Kreps: “[Benchmarking Apache Kafka: 2 Million Writes Per Second (On Three Cheap Machines)](https://engineering.linkedin.com/kafka/benchmarking-apache-kafka-2-million-writes-second-three-cheap-machines),” *engineering.linkedin.com*, April 27, 2014. +1. Kartik Paramasivam: “[How We’re Improving and Advancing Kafka at LinkedIn](https://engineering.linkedin.com/apache-kafka/how-we_re-improving-and-advancing-kafka-linkedin),” *engineering.linkedin.com*, September 2, 2015. +1. Jay Kreps: “[The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying),” *engineering.linkedin.com*, December 16, 2013. +1. Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “[All Aboard the Databus!](http://www.socc2012.org/s18-das.pdf),” at *3rd ACM Symposium on Cloud Computing* (SoCC), October 2012. +1. Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “[Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf),” at *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015. +1. P. P. S. Narayan: “[Sherpa Update](http://web.archive.org/web/20160801221400/https://developer.yahoo.com/blogs/ydn/sherpa-7992.html),” *developer.yahoo.com*, June 8, . +1. Martin Kleppmann: “[Bottled Water: Real-Time Integration of PostgreSQL and Kafka](http://martin.kleppmann.com/2015/04/23/bottled-water-real-time-postgresql-kafka.html),” *martin.kleppmann.com*, April 23, 2015. +1. Ben Osheroff: “[Introducing Maxwell, a mysql-to-kafka Binlog Processor](https://web.archive.org/web/20170208100334/https://developer.zendesk.com/blog/introducing-maxwell-a-mysql-to-kafka-binlog-processor),” *developer.zendesk.com*, August 20, 2015. +1. Randall Hauch: “[Debezium 0.2.1 Released](https://debezium.io/blog/2016/06/10/Debezium-0.2.1-Released/),” *debezium.io*, June 10, 2016. +1. Prem Santosh Udaya Shankar: “[Streaming MySQL Tables in Real-Time to Kafka](https://engineeringblog.yelp.com/2016/08/streaming-mysql-tables-in-real-time-to-kafka.html),” *engineeringblog.yelp.com*, August 1, 2016. +1. “[Mongoriver](https://github.com/stripe/mongoriver),” Stripe, Inc., *github.com*, September 2014. +1. Dan Harvey: “[Change Data Capture with Mongo + Kafka](http://www.slideshare.net/danharvey/change-data-capture-with-mongodb-and-kafka),” at *Hadoop Users Group UK*, August 2015. +1. “[Oracle GoldenGate 12c: Real-Time Access to Real-Time Information](https://web.archive.org/web/20160923105841/http://www.oracle.com/us/products/middleware/data-integration/oracle-goldengate-realtime-access-2031152.pdf),” Oracle White Paper, March 2015. +1. “[Oracle GoldenGate Fundamentals: How Oracle GoldenGate Works](https://www.youtube.com/watch?v=6H9NibIiPQE),” Oracle Corporation, *youtube.com*, November 2012. +1. Slava Akhmechet: “[Advancing the Realtime Web](http://rethinkdb.com/blog/realtime-web/),” *rethinkdb.com*, January 27, 2015. +1. “[Firebase Realtime Database Documentation](https://firebase.google.com/docs/database/),” Google, Inc., *firebase.google.com*, May 2016. +1. “[Apache CouchDB 1.6 Documentation](http://docs.couchdb.org/en/latest/),” *docs.couchdb.org*, 2014. +1. Matt DeBergalis: “[Meteor 0.7.0: Scalable Database Queries Using MongoDB Oplog Instead of Poll-and-Diff](https://web.archive.org/web/20160324055429/http://info.meteor.com/blog/meteor-070-scalable-database-queries-using-mongodb-oplog-instead-of-poll-and-diff),” *info.meteor.com*, December 17, 2013. +1. “[Chapter 15. Importing and Exporting Live Data](https://docs.voltdb.com/UsingVoltDB/ChapExport.php),” VoltDB 6.4 User Manual, *docs.voltdb.com*, June 2016. +1. Neha Narkhede: “[Announcing Kafka Connect: Building Large-Scale Low-Latency Data Pipelines](http://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines),” *confluent.io*, February 18, 2016. +1. Greg Young: “[CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs),” at *Code on the Beach*, August 2014. +1. Martin Fowler: “[Event Sourcing](http://martinfowler.com/eaaDev/EventSourcing.html),” *martinfowler.com*, December 12, 2005. +1. Vaughn Vernon: [*Implementing Domain-Driven Design*](https://www.informit.com/store/implementing-domain-driven-design-9780321834577). Addison-Wesley Professional, 2013. ISBN: 978-0-321-83457-7 +1. H. V. Jagadish, Inderpal Singh Mumick, and Abraham Silberschatz: “[View Maintenance Issues for the Chronicle Data Model](https://dl.acm.org/doi/10.1145/212433.220201),” at *14th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems* (PODS), May 1995. [doi:10.1145/212433.220201](http://dx.doi.org/10.1145/212433.220201) +1. “[Event Store 3.5.0 Documentation](http://docs.geteventstore.com/),” Event Store LLP, *docs.geteventstore.com*, February 2016. +1. Martin Kleppmann: [*Making Sense of Stream Processing*](http://www.oreilly.com/data/free/stream-processing.csp). Report, O'Reilly Media, May 2016. +1. Sander Mak: “[Event-Sourced Architectures with Akka](http://www.slideshare.net/SanderMak/eventsourced-architectures-with-akka),” at *JavaOne*, September 2014. +1. Julian Hyde: [personal communication](https://twitter.com/julianhyde/status/743374145006641153), June 2016. +1. Ashish Gupta and Inderpal Singh Mumick: *Materialized Views: Techniques, Implementations, and Applications*. MIT Press, 1999. ISBN: 978-0-262-57122-7 +1. Timothy Griffin and Leonid Libkin: “[Incremental Maintenance of Views with Duplicates](http://homepages.inf.ed.ac.uk/libkin/papers/sigmod95.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 1995. [doi:10.1145/223784.223849](http://dx.doi.org/10.1145/223784.223849) +1. Pat Helland: “[Immutability Changes Everything](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper16.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015. +1. Martin Kleppmann: “[Accounting for Computer Scientists](http://martin.kleppmann.com/2011/03/07/accounting-for-computer-scientists.html),” *martin.kleppmann.com*, March 7, 2011. +1. Pat Helland: “[Accountants Don't Use Erasers](https://web.archive.org/web/20200220161036/https://blogs.msdn.microsoft.com/pathelland/2007/06/14/accountants-dont-use-erasers/),” *blogs.msdn.com*, June 14, 2007. +1. Fangjin Yang: “[Dogfooding with Druid, Samza, and Kafka: Metametrics at Metamarkets](https://metamarkets.com/2015/dogfooding-with-druid-samza-and-kafka-metametrics-at-metamarkets/),” *metamarkets.com*, June 3, 2015. +1. Gavin Li, Jianqiu Lv, and Hang Qi: “[Pistachio: Co-Locate the Data and Compute for Fastest Cloud Compute](https://web.archive.org/web/20181214032620/https://yahoohadoop.tumblr.com/post/116365275781/pistachio-co-locate-the-data-and-compute-for),” *yahoohadoop.tumblr.com*, April 13, 2015. +1. Kartik Paramasivam: “[Stream Processing Hard Problems – Part 1: Killing Lambda](https://engineering.linkedin.com/blog/2016/06/stream-processing-hard-problems-part-1-killing-lambda),” *engineering.linkedin.com*, June 27, 2016. +1. Martin Fowler: “[CQRS](http://martinfowler.com/bliki/CQRS.html),” *martinfowler.com*, July 14, 2011. +1. Greg Young: “[CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf),” *cqrs.files.wordpress.com*, November 2010. +1. Baron Schwartz: “[Immutability, MVCC, and Garbage Collection](https://web.archive.org/web/20161110094746/http://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/),” *xaprb.com*, December 28, 2013. +1. Daniel Eloff, Slava Akhmechet, Jay Kreps, et al.: ["Re: Turning the Database Inside-out with Apache Samza](https://news.ycombinator.com/item?id=9145197)," Hacker News discussion, *news.ycombinator.com*, March 4, 2015. +1. “[Datomic Development Resources: Excision](http://docs.datomic.com/excision.html),” Cognitect, Inc., *docs.datomic.com*. +1. “[Fossil Documentation: Deleting Content from Fossil](http://fossil-scm.org/index.html/doc/trunk/www/shunning.wiki),” *fossil-scm.org*, 2016. +1. Jay Kreps: “[The irony of distributed systems is that data loss is really easy but deleting data is surprisingly hard,](https://twitter.com/jaykreps/status/582580836425330688)” *twitter.com*, March 30, 2015. +1. David C. Luckham: “[What’s the Difference Between ESP and CEP?](http://www.complexevents.com/2006/08/01/what%E2%80%99s-the-difference-between-esp-and-cep/),” *complexevents.com*, August 1, 2006. +1. Srinath Perera: “[How Is Stream Processing and Complex Event Processing (CEP) Different?](https://www.quora.com/How-is-stream-processing-and-complex-event-processing-CEP-different),” *quora.com*, December 3, 2015. +1. Arvind Arasu, Shivnath Babu, and Jennifer Widom: “[The CQL Continuous Query Language: Semantic Foundations and Query Execution](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/cql.pdf),” *The VLDB Journal*, volume 15, number 2, pages 121–142, June 2006. [doi:10.1007/s00778-004-0147-z](http://dx.doi.org/10.1007/s00778-004-0147-z) +1. Julian Hyde: “[Data in Flight: How Streaming SQL Technology Can Help Solve the Web 2.0 Data Crunch](http://queue.acm.org/detail.cfm?id=1667562),” *ACM Queue*, volume 7, number 11, December 2009. [doi:10.1145/1661785.1667562](http://dx.doi.org/10.1145/1661785.1667562) +1. “[Esper Reference, Version 5.4.0](http://esper.espertech.com/release-5.4.0/esper-reference/html_single/index.html),” EsperTech, Inc., *espertech.com*, April 2016. +1. Zubair Nabi, Eric Bouillet, Andrew Bainbridge, and Chris Thomas: “[Of Streams and Storms](https://web.archive.org/web/20170711081434/https://developer.ibm.com/streamsdev/wp-content/uploads/sites/15/2014/04/Streams-and-Storm-April-2014-Final.pdf),” IBM technical report, *developer.ibm.com*, April 2014. +1. Milinda Pathirage, Julian Hyde, Yi Pan, and Beth Plale: “[SamzaSQL: Scalable Fast Data Management with Streaming SQL](https://github.com/milinda/samzasql-hpbdc2016/blob/master/samzasql-hpbdc2016.pdf),” at *IEEE International Workshop on High-Performance Big Data Computing* (HPBDC), May 2016. [doi:10.1109/IPDPSW.2016.141](http://dx.doi.org/10.1109/IPDPSW.2016.141) +1. Philippe Flajolet, Éric Fusy, Olivier Gandouet, and Frédéric Meunier: “[HyperLogLog: The Analysis of a Near-Optimal Cardinality Estimation Algorithm](http://algo.inria.fr/flajolet/Publications/FlFuGaMe07.pdf),” at *Conference on Analysis of Algorithms* (AofA), June 2007. +1. Jay Kreps: “[Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture),” *oreilly.com*, July 2, 2014. +1. Ian Hellström: “[An Overview of Apache Streaming Technologies](https://databaseline.bitbucket.io/an-overview-of-apache-streaming-technologies/),” *databaseline.bitbucket.io*, March 12, 2016. +1. Jay Kreps: “[Why Local State Is a Fundamental Primitive in Stream Processing](https://www.oreilly.com/ideas/why-local-state-is-a-fundamental-primitive-in-stream-processing),” *oreilly.com*, July 31, 2014. +1. Shay Banon: “[Percolator](https://www.elastic.co/blog/percolator),” *elastic.co*, February 8, 2011. +1. Alan Woodward and Martin Kleppmann: “[Real-Time Full-Text Search with Luwak and Samza](http://martin.kleppmann.com/2015/04/13/real-time-full-text-search-luwak-samza.html),” *martin.kleppmann.com*, April 13, 2015. +1. “[Apache Storm 2.1.0 Documentation](https://storm.apache.org/releases/2.1.0/index.html),” *storm.apache.org*, October 2019. +1. Tyler Akidau: “[The World Beyond Batch: Streaming 102](https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102),” *oreilly.com*, January 20, 2016. +1. Stephan Ewen: “[Streaming Analytics with Apache Flink](https://www.confluent.io/resources/kafka-summit-2016/advanced-streaming-analytics-apache-flink-apache-kafka/),” at *Kafka Summit*, April 2016. +1. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, et al.: “[MillWheel: Fault-Tolerant Stream Processing at Internet Scale](http://research.google.com/pubs/pub41378.html),” at *39th International Conference on Very Large Data Bases* (VLDB), August 2013. +1. Alex Dean: “[Improving Snowplow's Understanding of Time](https://snowplow.io/blog/improving-snowplows-understanding-of-time/),” *snowplowanalytics.com*, September 15, 2015. +1. “[Windowing (Azure Stream Analytics)](https://msdn.microsoft.com/en-us/library/azure/dn835019.aspx),” Microsoft Azure Reference, *msdn.microsoft.com*, April 2016. +1. “[State Management](http://samza.apache.org/learn/documentation/0.10/container/state-management.html),” Apache Samza 0.10 Documentation, *samza.apache.org*, December 2015. +1. Rajagopal Ananthanarayanan, Venkatesh Basker, Sumit Das, et al.: “[Photon: Fault-Tolerant and Scalable Joining of Continuous Data Streams](http://research.google.com/pubs/pub41318.html),” at *ACM International Conference on Management of Data* (SIGMOD), June 2013. [doi:10.1145/2463676.2465272](http://dx.doi.org/10.1145/2463676.2465272) +1. Martin Kleppmann: “[Samza Newsfeed Demo](https://github.com/ept/newsfeed),” *github.com*, September 2014. +1. Ben Kirwin: “[Doing the Impossible: Exactly-Once Messaging Patterns in Kafka](http://ben.kirw.in/2014/11/28/kafka-patterns/),” *ben.kirw.in*, November 28, 2014. +1. Pat Helland: “[Data on the Outside Versus Data on the Inside](http://cidrdb.org/cidr2005/papers/P12.pdf),” at *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005. +1. Ralph Kimball and Margy Ross: *The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*, 3rd edition. John Wiley & Sons, 2013. ISBN: 978-1-118-53080-1 +1. Viktor Klang: “[I'm coining the phrase 'effectively-once' for message processing with at-least-once + idempotent operations](https://twitter.com/viktorklang/status/789036133434978304),” *twitter.com*, October 20, 2016. +1. Matei Zaharia, Tathagata Das, Haoyuan Li, et al.: “[Discretized Streams: An Efficient and Fault-Tolerant Model for Stream Processing on Large Clusters](https://www.usenix.org/system/files/conference/hotcloud12/hotcloud12-final28.pdf),” at *4th USENIX Conference in Hot Topics in Cloud Computing* (HotCloud), June 2012. +1. Kostas Tzoumas, Stephan Ewen, and Robert Metzger: “[High-Throughput, Low-Latency, and Exactly-Once Stream Processing with Apache Flink](https://www.ververica.com/blog/high-throughput-low-latency-and-exactly-once-stream-processing-with-apache-flink),” *ververica.com*, August 5, 2015. +1. Paris Carbone, Gyula Fóra, Stephan Ewen, et al.: “[Lightweight Asynchronous Snapshots for Distributed Dataflows](http://arxiv.org/abs/1506.08603),” arXiv:1506.08603 [cs.DC], June 29, 2015. +1. Ryan Betts and John Hugg: [*Fast Data: Smart and at Scale*](http://www.oreilly.com/data/free/fast-data-smart-and-at-scale.csp). Report, O'Reilly Media, October 2015. +1. Flavio Junqueira: “[Making Sense of Exactly-Once Semantics](https://web.archive.org/web/20160812172900/http://conferences.oreilly.com/strata/hadoop-big-data-eu/public/schedule/detail/49690),” at *Strata+Hadoop World London*, June 2016. +1. Jason Gustafson, Flavio Junqueira, Apurva Mehta, Sriram Subramanian, and Guozhang Wang: “[KIP-98 – Exactly Once Delivery and Transactional Messaging](https://cwiki.apache.org/confluence/display/KAFKA/KIP-98+-+Exactly+Once+Delivery+and+Transactional+Messaging),” *cwiki.apache.org*, November 2016. +1. Pat Helland: “[Idempotence Is Not a Medical Condition](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=4b6dda7fe75b51e1c543a87ca7b3b322fbf55614),” *Communications of the ACM*, volume 55, number 5, page 56, May 2012. [doi:10.1145/2160718.2160734](http://dx.doi.org/10.1145/2160718.2160734) +1. Jay Kreps: “[Re: Trying to Achieve Deterministic Behavior on Recovery/Rewind](http://mail-archives.apache.org/mod_mbox/samza-dev/201409.mbox/%3CCAOeJiJg%2Bc7Ei%3DgzCuOz30DD3G5Hm9yFY%3DUJ6SafdNUFbvRgorg%40mail.gmail.com%3E),” email to *samza-dev* mailing list, September 9, 2014. +1. E. N. (Mootaz) Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson: “[A Survey of Rollback-Recovery Protocols in Message-Passing Systems](http://www.cs.utexas.edu/~lorenzo/papers/SurveyFinal.pdf),” *ACM Computing Surveys*, volume 34, number 3, pages 375–408, September 2002. [doi:10.1145/568522.568525](http://dx.doi.org/10.1145/568522.568525) +1. Adam Warski: “[Kafka Streams – How Does It Fit the Stream Processing Landscape?](https://softwaremill.com/kafka-streams-how-does-it-fit-stream-landscape/),” *softwaremill.com*, June 1, 2016. diff --git a/content/en/ch13.md b/content/en/ch13.md new file mode 100644 index 0000000..60c1cf8 --- /dev/null +++ b/content/en/ch13.md @@ -0,0 +1,166 @@ +--- +title: "13. Do the Right Thing" +weight: 313 +breadcrumbs: false +--- + +> [!IMPORTANT] +> This chapter is from the 1st edition, the 2nd edition is not available yet + +![](/map/ch12.png) + +> *If a thing be ordained to another as to its end, its last end cannot consist in the preservation of its being. Hence a captain does not intend as a last end, the preservation of the ship entrusted to him, since a ship is ordained to something else as its end, viz. to navigation.* +> +> *(Often quoted as: If the highest aim of a captain was the preserve his ship, he would keep it in port forever.)* +> +> ​ — St. Thomas Aquinas, *Summa Theologica* (1265–1274) + +--------------- + +So far, this book has been mostly about describing things as they *are* at present. In this final chapter, we will shift our perspective toward the future and discuss how things *should be*: I will propose some ideas and approaches that, I believe, may funda‐ mentally improve the ways we design and build applications. + +Opinions and speculation about the future are of course subjective, and so I will use the first person in this chapter when writing about my personal opinions. You are welcome to disagree with them and form your own opinions, but I hope that the ideas in this chapter will at least be a starting point for a productive discussion and bring some clarity to concepts that are often confused. + +The goal of this book was outlined in [Chapter 1](/en/ch1): to explore how to create applications and systems that are *reliable*, *scalable*, and *maintainable*. These themes have run through all of the chapters: for example, we discussed many fault-tolerance algo‐ rithms that help improve reliability, partitioning to improve scalability, and mecha‐ nisms for evolution and abstraction that improve maintainability. In this chapter we will bring all of these ideas together, and build on them to envisage the future. Our goal is to discover how to design applications that are better than the ones of today— robust, correct, evolvable, and ultimately beneficial to humanity. + + +## …… + + + +## Summary + +In this chapter we discussed new approaches to designing data systems, and I included my personal opinions and speculations about the future. We started with the observation that there is no one single tool that can efficiently serve all possible use cases, and so applications necessarily need to compose several different pieces of software to accomplish their goals. We discussed how to solve this *data integration* problem by using batch processing and event streams to let data changes flow between different systems. + +In this approach, certain systems are designated as systems of record, and other data is derived from them through transformations. In this way we can maintain indexes, materialized views, machine learning models, statistical summaries, and more. By making these derivations and transformations asynchronous and loosely coupled, a problem in one area is prevented from spreading to unrelated parts of the system, increasing the robustness and fault-tolerance of the system as a whole. + +Expressing dataflows as transformations from one dataset to another also helps evolve applications: if you want to change one of the processing steps, for example to change the structure of an index or cache, you can just rerun the new transformation code on the whole input dataset in order to rederive the output. Similarly, if some‐ thing goes wrong, you can fix the code and reprocess the data in order to recover. + +These processes are quite similar to what databases already do internally, so we recast the idea of dataflow applications as *unbundling* the components of a database, and building an application by composing these loosely coupled components. + +Derived state can be updated by observing changes in the underlying data. Moreover, the derived state itself can further be observed by downstream consumers. We can even take this dataflow all the way through to the end-user device that is displaying the data, and thus build user interfaces that dynamically update to reflect data changes and continue to work offline. + +Next, we discussed how to ensure that all of this processing remains correct in the presence of faults. We saw that strong integrity guarantees can be implemented scala‐ bly with asynchronous event processing, by using end-to-end operation identifiers to make operations idempotent and by checking constraints asynchronously. Clients can either wait until the check has passed, or go ahead without waiting but risk hav‐ ing to apologize about a constraint violation. This approach is much more scalable and robust than the traditional approach of using distributed transactions, and fits with how many business processes work in practice. + +By structuring applications around dataflow and checking constraints asynchro‐ nously, we can avoid most coordination and create systems that maintain integrity but still perform well, even in geographically distributed scenarios and in the pres‐ ence of faults. We then talked a little about using audits to verify the integrity of data and detect corruption. + +Finally, we took a step back and examined some ethical aspects of building data- intensive applications. We saw that although data can be used to do good, it can also do significant harm: making justifying decisions that seriously affect people’s lives and are difficult to appeal against, leading to discrimination and exploitation, nor‐ malizing surveillance, and exposing intimate information. We also run the risk of data breaches, and we may find that a well-intentioned use of data has unintended consequences. + +As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal. + +## References + +1. Rachid Belaid: “[Postgres Full-Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/),” *rachbelaid.com*, July 13, 2015. +1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. +1. Pat Helland and Dave Campbell: “[Building on Quicksand](https://web.archive.org/web/20220606172817/https://database.cs.wisc.edu/cidr/cidr2009/Paper_133.pdf),” at *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009. +1. Jessica Kerr: “[Provenance and Causality in Distributed Systems](https://web.archive.org/web/20190425150540/http://blog.jessitron.com/2016/09/provenance-and-causality-in-distributed.html),” *blog.jessitron.com*, September 25, 2016. +1. Kostas Tzoumas: “[Batch Is a Special Case of Streaming](http://data-artisans.com/blog/batch-is-a-special-case-of-streaming/),” *data-artisans.com*, September 15, 2015. +1. Shinji Kim and Robert Blafford: “[Stream Windowing Performance Analysis: Concord and Spark Streaming](https://web.archive.org/web/20180125074821/http://concord.io/posts/windowing_performance_analysis_w_spark_streaming),” *concord.io*, July 6, 2016. +1. Jay Kreps: “[The Log: What Every Software Engineer Should Know About Real-Time Data's Unifying Abstraction](http://engineering.linkedin.com/distributed-systems/log-what-every-software-engineer-should-know-about-real-time-datas-unifying),” *engineering.linkedin.com*, December 16, 2013. +1. Pat Helland: “[Life Beyond Distributed Transactions: An Apostate’s Opinion](https://web.archive.org/web/20200730171311/http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf),” at *3rd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2007. +1. “[Great Western Railway (1835–1948)](https://web.archive.org/web/20160122155425/https://www.networkrail.co.uk/VirtualArchive/great-western/),” Network Rail Virtual Archive, *networkrail.co.uk*. +1. Jacqueline Xu: “[Online Migrations at Scale](https://stripe.com/blog/online-migrations),” *stripe.com*, February 2, 2017. +1. Molly Bartlett Dishman and Martin Fowler: “[Agile Architecture](https://web.archive.org/web/20161130034721/http://conferences.oreilly.com/software-architecture/sa2015/public/schedule/detail/40388),” at *O'Reilly Software Architecture Conference*, March 2015. +1. Nathan Marz and James Warren: [*Big Data: Principles and Best Practices of Scalable Real-Time Data Systems*](https://www.manning.com/books/big-data). Manning, 2015. ISBN: 978-1-617-29034-3 +1. Oscar Boykin, Sam Ritchie, Ian O'Connell, and Jimmy Lin: “[Summingbird: A Framework for Integrating Batch and Online MapReduce Computations](http://www.vldb.org/pvldb/vol7/p1441-boykin.pdf),” at *40th International Conference on Very Large Data Bases* (VLDB), September 2014. +1. Jay Kreps: “[Questioning the Lambda Architecture](https://www.oreilly.com/ideas/questioning-the-lambda-architecture),” *oreilly.com*, July 2, 2014. +1. Raul Castro Fernandez, Peter Pietzuch, Jay Kreps, et al.: “[Liquid: Unifying Nearline and Offline Big Data Integration](http://cidrdb.org/cidr2015/Papers/CIDR15_Paper25u.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015. +1. Dennis M. Ritchie and Ken Thompson: “[The UNIX Time-Sharing System](http://web.eecs.utk.edu/~qcao1/cs560/papers/paper-unix.pdf),” *Communications of the ACM*, volume 17, number 7, pages 365–375, July 1974. [doi:10.1145/361011.361061](http://dx.doi.org/10.1145/361011.361061) +1. Eric A. Brewer and Joseph M. Hellerstein: “[CS262a: Advanced Topics in Computer Systems](http://people.eecs.berkeley.edu/~brewer/cs262/systemr.html),” lecture notes, University of California, Berkeley, *cs.berkeley.edu*, August 2011. +1. Michael Stonebraker: “[The Case for Polystores](http://wp.sigmod.org/?p=1629),” *wp.sigmod.org*, July 13, 2015. +1. Jennie Duggan, Aaron J. Elmore, Michael Stonebraker, et al.: “[The BigDAWG Polystore System](https://dspace.mit.edu/handle/1721.1/100936),” *ACM SIGMOD Record*, volume 44, number 2, pages 11–16, June 2015. [doi:10.1145/2814710.2814713](http://dx.doi.org/10.1145/2814710.2814713) +1. Patrycja Dybka: “[Foreign Data Wrappers for PostgreSQL](https://web.archive.org/web/20221003115732/https://www.vertabelo.com/blog/foreign-data-wrappers-for-postgresql/),” *vertabelo.com*, March 24, 2015. +1. David B. Lomet, Alan Fekete, Gerhard Weikum, and Mike Zwilling: “[Unbundling Transaction Services in the Cloud](https://www.microsoft.com/en-us/research/publication/unbundling-transaction-services-in-the-cloud/),” at *4th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2009. +1. Martin Kleppmann and Jay Kreps: “[Kafka, Samza and the Unix Philosophy of Distributed Data](http://martin.kleppmann.com/papers/kafka-debull15.pdf),” *IEEE Data Engineering Bulletin*, volume 38, number 4, pages 4–14, December 2015. +1. John Hugg: “[Winning Now and in the Future: Where VoltDB Shines](https://voltdb.com/blog/winning-now-and-future-where-voltdb-shines),” *voltdb.com*, March 23, 2016. +1. Frank McSherry, Derek G. Murray, Rebecca Isaacs, and Michael Isard: “[Differential Dataflow](http://cidrdb.org/cidr2013/Papers/CIDR13_Paper111.pdf),” at *6th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2013. +1. Derek G Murray, Frank McSherry, Rebecca Isaacs, et al.: “[Naiad: A Timely Dataflow System](http://sigops.org/s/conferences/sosp/2013/papers/p439-murray.pdf),” at *24th ACM Symposium on Operating Systems Principles* (SOSP), pages 439–455, November 2013. [doi:10.1145/2517349.2522738](http://dx.doi.org/10.1145/2517349.2522738) +1. Gwen Shapira: “[We have a bunch of customers who are implementing ‘database inside-out’ concept and they all ask ‘is anyone else doing it? are we crazy?’](https://twitter.com/gwenshap/status/758800071110430720)” *twitter.com*, July 28, 2016. +1. Martin Kleppmann: “[Turning the Database Inside-out with Apache Samza,](http://martin.kleppmann.com/2015/03/04/turning-the-database-inside-out.html)” at *Strange Loop*, September 2014. +1. Peter Van Roy and Seif Haridi: [*Concepts, Techniques, and Models of Computer Programming*](https://www.info.ucl.ac.be/~pvr/book.html). MIT Press, 2004. ISBN: 978-0-262-22069-9 +1. “[Juttle Documentation](http://juttle.github.io/juttle/),” *juttle.github.io*, 2016. +1. Evan Czaplicki and Stephen Chong: “[Asynchronous Functional Reactive Programming for GUIs](http://people.seas.harvard.edu/~chong/pubs/pldi13-elm.pdf),” at *34th ACM SIGPLAN Conference on Programming Language Design and Implementation* (PLDI), June 2013. [doi:10.1145/2491956.2462161](http://dx.doi.org/10.1145/2491956.2462161) +1. Engineer Bainomugisha, Andoni Lombide Carreton, Tom van Cutsem, Stijn Mostinckx, and Wolfgang de Meuter: “[A Survey on Reactive Programming](http://soft.vub.ac.be/Publications/2012/vub-soft-tr-12-13.pdf),” *ACM Computing Surveys*, volume 45, number 4, pages 1–34, August 2013. [doi:10.1145/2501654.2501666](http://dx.doi.org/10.1145/2501654.2501666) +1. Peter Alvaro, Neil Conway, Joseph M. Hellerstein, and William R. Marczak: “[Consistency Analysis in Bloom: A CALM and Collected Approach](https://dsf.berkeley.edu/cs286/papers/calm-cidr2011.pdf),” at *5th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2011. +1. Felienne Hermans: “[Spreadsheets Are Code](https://vimeo.com/145492419),” at *Code Mesh*, November 2015. +1. Dan Bricklin and Bob Frankston: “[VisiCalc: Information from Its Creators](http://danbricklin.com/visicalc.htm),” *danbricklin.com*. +1. D. Sculley, Gary Holt, Daniel Golovin, et al.: “[Machine Learning: The High-Interest Credit Card of Technical Debt](http://research.google.com/pubs/pub43146.html),” at *NIPS Workshop on Software Engineering for Machine Learning* (SE4ML), December 2014. +1. Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “[Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2737784](http://dx.doi.org/10.1145/2723372.2737784) +1. Guy Steele: “[Re: Need for Macros (Was Re: Icon)](https://people.csail.mit.edu/gregs/ll1-discuss-archive-html/msg01134.html),” email to *ll1-discuss* mailing list, *people.csail.mit.edu*, December 24, 2001. +1. David Gelernter: “[Generative Communication in Linda](http://cseweb.ucsd.edu/groups/csag/html/teaching/cse291s03/Readings/p80-gelernter.pdf),” *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 7, number 1, pages 80–112, January 1985. [doi:10.1145/2363.2433](http://dx.doi.org/10.1145/2363.2433) +1. Patrick Th. Eugster, Pascal A. Felber, Rachid Guerraoui, and Anne-Marie Kermarrec: “[The Many Faces of Publish/Subscribe](http://www.cs.ru.nl/~pieter/oss/manyfaces.pdf),” *ACM Computing Surveys*, volume 35, number 2, pages 114–131, June 2003. [doi:10.1145/857076.857078](http://dx.doi.org/10.1145/857076.857078) +1. Ben Stopford: “[Microservices in a Streaming World](https://www.infoq.com/presentations/microservices-streaming),” at *QCon London*, March 2016. +1. Christian Posta: “[Why Microservices Should Be Event Driven: Autonomy vs Authority](http://blog.christianposta.com/microservices/why-microservices-should-be-event-driven-autonomy-vs-authority/),” *blog.christianposta.com*, May 27, 2016. +1. Alex Feyerke: “[Say Hello to Offline First](https://web.archive.org/web/20210420014747/http://hood.ie/blog/say-hello-to-offline-first.html),” *hood.ie*, November 5, 2013. +1. Sebastian Burckhardt, Daan Leijen, Jonathan Protzenko, and Manuel Fähndrich: “[Global Sequence Protocol: A Robust Abstraction for Replicated Shared State](http://drops.dagstuhl.de/opus/volltexte/2015/5238/),” at *29th European Conference on Object-Oriented Programming* (ECOOP), July 2015. [doi:10.4230/LIPIcs.ECOOP.2015.568](http://dx.doi.org/10.4230/LIPIcs.ECOOP.2015.568) +1. Mark Soper: “[Clearing Up React Data Management Confusion with Flux, Redux, and Relay](https://medium.com/@marksoper/clearing-up-react-data-management-confusion-with-flux-redux-and-relay-aad504e63cae),” *medium.com*, December 3, 2015. +1. Eno Thereska, Damian Guy, Michael Noll, and Neha Narkhede: “[Unifying Stream Processing and Interactive Queries in Apache Kafka](http://www.confluent.io/blog/unifying-stream-processing-and-interactive-queries-in-apache-kafka/),” *confluent.io*, October 26, 2016. +1. Frank McSherry: “[Dataflow as Database](https://github.com/frankmcsherry/blog/blob/master/posts/2016-07-17.md),” *github.com*, July 17, 2016. +1. Peter Alvaro: “[I See What You Mean](https://www.youtube.com/watch?v=R2Aa4PivG0g),” at *Strange Loop*, September 2015. +1. Nathan Marz: “[Trident: A High-Level Abstraction for Realtime Computation](https://blog.twitter.com/2012/trident-a-high-level-abstraction-for-realtime-computation),” *blog.twitter.com*, August 2, 2012. +1. Edi Bice: “[Low Latency Web Scale Fraud Prevention with Apache Samza, Kafka and Friends](http://www.slideshare.net/edibice/extremely-low-latency-web-scale-fraud-prevention-with-apache-samza-kafka-and-friends),” at *Merchant Risk Council MRC Vegas Conference*, March 2016. +1. Charity Majors: “[The Accidental DBA](https://charity.wtf/2016/10/02/the-accidental-dba/),” *charity.wtf*, October 2, 2016. +1. Arthur J. Bernstein, Philip M. Lewis, and Shiyong Lu: “[Semantic Conditions for Correctness at Different Isolation Levels](http://db.cs.berkeley.edu/cs286/papers/isolation-icde2000.pdf),” at *16th International Conference on Data Engineering* (ICDE), February 2000. [doi:10.1109/ICDE.2000.839387](http://dx.doi.org/10.1109/ICDE.2000.839387) +1. Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “[Automating the Detection of Snapshot Isolation Anomalies](http://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf),” at *33rd International Conference on Very Large Data Bases* (VLDB), September 2007. +1. Kyle Kingsbury: [Jepsen blog post series](https://aphyr.com/tags/jepsen), *aphyr.com*, 2013–2016. +1. Michael Jouravlev: “[Redirect After Post](http://www.theserverside.com/news/1365146/Redirect-After-Post),” *theserverside.com*, August 1, 2004. +1. Jerome H. Saltzer, David P. Reed, and David D. Clark: “[End-to-End Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf),” *ACM Transactions on Computer Systems*, volume 2, number 4, pages 277–288, November 1984. [doi:10.1145/357401.357402](http://dx.doi.org/10.1145/357401.357402) +1. Peter Bailis, Alan Fekete, Michael J. Franklin, et al.: “[Coordination-Avoiding Database Systems](http://arxiv.org/pdf/1402.2237.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 3, pages 185–196, November 2014. +1. Alex Yarmula: “[Strong Consistency in Manhattan](https://blog.twitter.com/2016/strong-consistency-in-manhattan),” *blog.twitter.com*, March 17, 2016. +1. Douglas B Terry, Marvin M Theimer, Karin Petersen, et al.: “[Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](http://css.csail.mit.edu/6.824/2014/papers/bayou-conflicts.pdf),” at *15th ACM Symposium on Operating Systems Principles* (SOSP), pages 172–182, December 1995. [doi:10.1145/224056.224070](http://dx.doi.org/10.1145/224056.224070) +1. Jim Gray: “[The Transaction Concept: Virtues and Limitations](http://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf),” at *7th International Conference on Very Large Data Bases* (VLDB), September 1981. +1. Hector Garcia-Molina and Kenneth Salem: “[Sagas](http://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 1987. [doi:10.1145/38713.38742](http://dx.doi.org/10.1145/38713.38742) +1. Pat Helland: “[Memories, Guesses, and Apologies](https://web.archive.org/web/20160304020907/http://blogs.msdn.com/b/pathelland/archive/2007/05/15/memories-guesses-and-apologies.aspx),” *blogs.msdn.com*, May 15, 2007. +1. Yoongu Kim, Ross Daly, Jeremie Kim, et al.: “[Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf),” at *41st Annual International Symposium on Computer Architecture* (ISCA), June 2014. [doi:10.1145/2678373.2665726](http://dx.doi.org/10.1145/2678373.2665726) +1. Mark Seaborn and Thomas Dullien: “[Exploiting the DRAM Rowhammer Bug to Gain Kernel Privileges](https://googleprojectzero.blogspot.co.uk/2015/03/exploiting-dram-rowhammer-bug-to-gain.html),” *googleprojectzero.blogspot.co.uk*, March 9, 2015. +1. Jim N. Gray and Catharine van Ingen: “[Empirical Measurements of Disk Failure Rates and Error Rates](https://www.microsoft.com/en-us/research/publication/empirical-measurements-of-disk-failure-rates-and-error-rates/),” Microsoft Research, MSR-TR-2005-166, December 2005. +1. Annamalai Gurusami and Daniel Price: “[Bug #73170: Duplicates in Unique Secondary Index Because of Fix of Bug#68021](http://bugs.mysql.com/bug.php?id=73170),” *bugs.mysql.com*, July 2014. +1. Gary Fredericks: “[Postgres Serializability Bug](https://github.com/gfredericks/pg-serializability-bug),” *github.com*, September 2015. +1. Xiao Chen: “[HDFS DataNode Scanners and Disk Checker Explained](http://blog.cloudera.com/blog/2016/12/hdfs-datanode-scanners-and-disk-checker-explained/),” *blog.cloudera.com*, December 20, 2016. +1. Jay Kreps: “[Getting Real About Distributed System Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability),” *blog.empathybox.com*, March 19, 2012. +1. Martin Fowler: “[The LMAX Architecture](http://martinfowler.com/articles/lmax.html),” *martinfowler.com*, July 12, 2011. +1. Sam Stokes: “[Move Fast with Confidence](http://blog.samstokes.co.uk/blog/2016/07/11/move-fast-with-confidence/),” *blog.samstokes.co.uk*, July 11, 2016. +1. “[Hyperledger Sawtooth documentation](https://web.archive.org/web/20220120211548/https://sawtooth.hyperledger.org/docs/core/releases/latest/introduction.html),” Intel Corporation, *sawtooth.hyperledger.org*, 2017. +1. Richard Gendal Brown: “[Introducing R3 Corda™: A Distributed Ledger Designed for Financial Services](https://gendal.me/2016/04/05/introducing-r3-corda-a-distributed-ledger-designed-for-financial-services/),” *gendal.me*, April 5, 2016. +1. Trent McConaghy, Rodolphe Marques, Andreas Müller, et al.: “[BigchainDB: A Scalable Blockchain Database](https://www.bigchaindb.com/whitepaper/bigchaindb-whitepaper.pdf),” *bigchaindb.com*, June 8, 2016. +1. Ralph C. Merkle: “[A Digital Signature Based on a Conventional Encryption Function](https://people.eecs.berkeley.edu/~raluca/cs261-f15/readings/merkle.pdf),” at *CRYPTO '87*, August 1987. [doi:10.1007/3-540-48184-2_32](http://dx.doi.org/10.1007/3-540-48184-2_32) +1. Ben Laurie: “[Certificate Transparency](http://queue.acm.org/detail.cfm?id=2668154),” *ACM Queue*, volume 12, number 8, pages 10-19, August 2014. [doi:10.1145/2668152.2668154](http://dx.doi.org/10.1145/2668152.2668154) +1. Mark D. Ryan: “[Enhanced Certificate Transparency and End-to-End Encrypted Mail](https://www.ndss-symposium.org/wp-content/uploads/2017/09/12_2_1.pdf),” at *Network and Distributed System Security Symposium* (NDSS), February 2014. [doi:10.14722/ndss.2014.23379](http://dx.doi.org/10.14722/ndss.2014.23379) +1. “[ACM Code of Ethics and Professional Conduct](https://www.acm.org/code-of-ethics),” Association for Computing Machinery, *acm.org*, 2018. +1. François Chollet: “[Software development is starting to involve important ethical choices](https://twitter.com/fchollet/status/792958695722201088),” *twitter.com*, October 30, 2016. +1. Igor Perisic: “[Making Hard Choices: The Quest for Ethics in Machine Learning](https://engineering.linkedin.com/blog/2016/11/making-hard-choices--the-quest-for-ethics-in-machine-learning),” *engineering.linkedin.com*, November 2016. +1. John Naughton: “[Algorithm Writers Need a Code of Conduct](https://www.theguardian.com/commentisfree/2015/dec/06/algorithm-writers-should-have-code-of-conduct),” *theguardian.com*, December 6, 2015. +1. Logan Kugler: “[What Happens When Big Data Blunders?](http://cacm.acm.org/magazines/2016/6/202655-what-happens-when-big-data-blunders/fulltext),” *Communications of the ACM*, volume 59, number 6, pages 15–16, June 2016. [doi:10.1145/2911975](http://dx.doi.org/10.1145/2911975) +1. Bill Davidow: “[Welcome to Algorithmic Prison](http://www.theatlantic.com/technology/archive/2014/02/welcome-to-algorithmic-prison/283985/),” *theatlantic.com*, February 20, 2014. +1. Don Peck: “[They're Watching You at Work](http://www.theatlantic.com/magazine/archive/2013/12/theyre-watching-you-at-work/354681/),” *theatlantic.com*, December 2013. +1. Leigh Alexander: “[Is an Algorithm Any Less Racist Than a Human?](https://www.theguardian.com/technology/2016/aug/03/algorithm-racist-human-employers-work)” *theguardian.com*, August 3, 2016. +1. Jesse Emspak: “[How a Machine Learns Prejudice](https://www.scientificamerican.com/article/how-a-machine-learns-prejudice/),” *scientificamerican.com*, December 29, 2016. +1. Maciej Cegłowski: “[The Moral Economy of Tech](http://idlewords.com/talks/sase_panel.htm),” *idlewords.com*, June 2016. +1. Cathy O'Neil: [*Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*](https://web.archive.org/web/20210621234447/https://weaponsofmathdestructionbook.com/). Crown Publishing, 2016. ISBN: 978-0-553-41881-1 +1. Julia Angwin: “[Make Algorithms Accountable](http://www.nytimes.com/2016/08/01/opinion/make-algorithms-accountable.html),” *nytimes.com*, August 1, 2016. +1. Bryce Goodman and Seth Flaxman: “[European Union Regulations on Algorithmic Decision-Making and a ‘Right to Explanation’](https://arxiv.org/abs/1606.08813),” *arXiv:1606.08813*, August 31, 2016. +1. “[A Review of the Data Broker Industry: Collection, Use, and Sale of Consumer Data for Marketing Purposes](https://web.archive.org/web/20240619042302/http://educationnewyork.com/files/rockefeller_databroker.pdf),” Staff Report, *United States Senate Committee on Commerce, Science, and Transportation*, *commerce.senate.gov*, December 2013. +1. Olivia Solon: “[Facebook’s Failure: Did Fake News and Polarized Politics Get Trump Elected?](https://www.theguardian.com/technology/2016/nov/10/facebook-fake-news-election-conspiracy-theories)” *theguardian.com*, November 10, 2016. +1. Donella H. Meadows and Diana Wright: *Thinking in Systems: A Primer*. Chelsea Green Publishing, 2008. ISBN: 978-1-603-58055-7 +1. Daniel J. Bernstein: “[Listening to a ‘big data’/‘data science’ talk](https://twitter.com/hashbreaker/status/598076230437568512),” *twitter.com*, May 12, 2015. +1. Marc Andreessen: “[Why Software Is Eating the World](http://genius.com/Marc-andreessen-why-software-is-eating-the-world-annotated),” *The Wall Street Journal*, 20 August 2011. +1. J. M. Porup: “[‘Internet of Things’ Security Is Hilariously Broken and Getting Worse](http://arstechnica.com/security/2016/01/how-to-search-the-internet-of-things-for-photos-of-sleeping-babies/),” *arstechnica.com*, January 23, 2016. +1. Bruce Schneier: [*Data and Goliath: The Hidden Battles to Collect Your Data and Control Your World*](https://www.schneier.com/books/data_and_goliath/). W. W. Norton, 2015. ISBN: 978-0-393-35217-7 +1. The Grugq: “[Nothing to Hide](https://grugq.tumblr.com/post/142799983558/nothing-to-hide),” *grugq.tumblr.com*, April 15, 2016. +1. Tony Beltramelli: “[Deep-Spying: Spying Using Smartwatch and Deep Learning](https://arxiv.org/abs/1512.05616),” Masters Thesis, IT University of Copenhagen, December 2015. Available at *arxiv.org/abs/1512.05616* +1. Shoshana Zuboff: “[Big Other: Surveillance Capitalism and the Prospects of an Information Civilization](http://papers.ssrn.com/sol3/papers.cfm?abstract_id=2594754),” *Journal of Information Technology*, volume 30, number 1, pages 75–89, April 2015. [doi:10.1057/jit.2015.5](http://dx.doi.org/10.1057/jit.2015.5) +1. Carina C. Zona: “[Consequences of an Insightful Algorithm](https://www.youtube.com/watch?v=YRI40A4tyWU),” at *GOTO Berlin*, November 2016. +1. Bruce Schneier: “[Data Is a Toxic Asset, So Why Not Throw It Out?](https://www.schneier.com/essays/archives/2016/03/data_is_a_toxic_asse.html),” *schneier.com*, March 1, 2016. +1. John E. Dunn: “[The UK’s 15 Most Infamous Data Breaches](https://web.archive.org/web/20161120070058/http://www.techworld.com/security/uks-most-infamous-data-breaches-2016-3604586/),” *techworld.com*, November 18, 2016. +1. Cory Scott: “[Data is not toxic - which implies no benefit - but rather hazardous material, where we must balance need vs. want](https://twitter.com/cory_scott/status/706586399483437056),” *twitter.com*, March 6, 2016. +1. Bruce Schneier: “[Mission Creep: When Everything Is Terrorism](https://www.schneier.com/essays/archives/2013/07/mission_creep_when_e.html),” *schneier.com*, July 16, 2013. +1. Lena Ulbricht and Maximilian von Grafenstein: “[Big Data: Big Power Shifts?](http://policyreview.info/articles/analysis/big-data-big-power-shifts),” *Internet Policy Review*, volume 5, number 1, March 2016. [doi:10.14763/2016.1.406](http://dx.doi.org/10.14763/2016.1.406) +1. Ellen P. Goodman and Julia Powles: “[Facebook and Google: Most Powerful and Secretive Empires We've Ever Known](https://www.theguardian.com/technology/2016/sep/28/google-facebook-powerful-secretive-empire-transparency),” *theguardian.com*, September 28, 2016. +1. [Directive 95/46/EC on the protection of individuals with regard to the processing of personal data and on the free movement of such data](http://eur-lex.europa.eu/legal-content/EN/TXT/?uri=CELEX:31995L0046), Official Journal of the European Communities No. L 281/31, *eur-lex.europa.eu*, November 1995. +1. Brendan Van Alsenoy: “[Regulating Data Protection: The Allocation of Responsibility and Risk Among Actors Involved in Personal Data Processing](https://lirias.kuleuven.be/handle/123456789/545027),” Thesis, KU Leuven Centre for IT and IP Law, August 2016. +1. Michiel Rhoen: “[Beyond Consent: Improving Data Protection Through Consumer Protection Law](http://policyreview.info/articles/analysis/beyond-consent-improving-data-protection-through-consumer-protection-law),” *Internet Policy Review*, volume 5, number 1, March 2016. [doi:10.14763/2016.1.404](http://dx.doi.org/10.14763/2016.1.404) +1. Jessica Leber: “[Your Data Footprint Is Affecting Your Life in Ways You Can’t Even Imagine](https://www.fastcoexist.com/3057514/your-data-footprint-is-affecting-your-life-in-ways-you-cant-even-imagine),” *fastcoexist.com*, March 15, 2016. +1. Maciej Cegłowski: “[Haunted by Data](http://idlewords.com/talks/haunted_by_data.htm),” *idlewords.com*, October 2015. +1. Sam Thielman: “[You Are Not What You Read: Librarians Purge User Data to Protect Privacy](https://www.theguardian.com/us-news/2016/jan/13/us-library-records-purged-data-privacy),” *theguardian.com*, January 13, 2016. +1. Conor Friedersdorf: “[Edward Snowden’s Other Motive for Leaking](http://www.theatlantic.com/politics/archive/2014/05/edward-snowdens-other-motive-for-leaking/370068/),” *theatlantic.com*, May 13, 2014. +1. Phillip Rogaway: “[The Moral Character of Cryptographic Work](http://web.cs.ucdavis.edu/~rogaway/papers/moral-fn.pdf),” Cryptology ePrint 2015/1162, December 2015. diff --git a/content/en/ch2.md b/content/en/ch2.md index 72c61a3..584a31c 100644 --- a/content/en/ch2.md +++ b/content/en/ch2.md @@ -1,118 +1,1518 @@ --- -title: "2. Data Models and Query Languages" -linkTitle: "2. Data Models and Query Languages" +title: "2. Defining Nonfunctional Requirements" weight: 102 breadcrumbs: false --- -![](/img/ch2.png) +# Chapter 2. Defining Nonfunctional Requirements -> *The limits of my language mean the limits of my world.* +> *The Internet was done so well that most people think of it as a natural resource like the Pacific +> Ocean, rather than something that was man-made. When was the last time a technology with a scale +> like that was so error-free?* > -> ​ — Ludwig Wittgenstein, *Tractatus Logico-Philosophicus* (1922) +> [Alan Kay](https://www.drdobbs.com/architecture-and-design/interview-with-alan-kay/240003442), +> in interview with *Dr Dobb’s Journal* (2012) -------------------- +If you are building an application, you will be driven by a list of requirements. At the top of your +list is most likely the functionality that the application must offer: what screens and what buttons +you need, and what each operation is supposed to do in order to fulfill the purpose of your +software. These are your *functional requirements*. -Data models are perhaps the most important part of developing software, because they have such a profound effect: not only on how the software is written, but also on how we *think about the problem* that we are solving. +In addition, you probably also have some *nonfunctional requirements*: for example, the app should +be fast, reliable, secure, legally compliant, and easy to maintain. These requirements might not be +explicitly written down, because they may seem somewhat obvious, but they are just as important as +the app’s functionality: an app that is unbearably slow or unreliable might as well not exist. -Most applications are built by layering one data model on top of another. For each layer, the key question is: how is it *represented* in terms of the next-lower layer? For example: +Many nonfunctional requirements, such as security, fall outside the scope of this book. But there +are a few nonfunctional requirements that we will consider, and this chapter will help you +articulate them for your own systems: -1. As an application developer, you look at the real world (in which there are peo‐ ple, organizations, goods, actions, money flows, sensors, etc.) and model it in terms of objects or data structures, and APIs that manipulate those data struc‐ tures. Those structures are often specific to your application. -2. When you want to store those data structures, you express them in terms of a general-purpose data model, such as JSON or XML documents, tables in a rela‐ tional database, or a graph model. -3. The engineers who built your database software decided on a way of representing that JSON/XML/relational/graph data in terms of bytes in memory, on disk, or on a network. The representation may allow the data to be queried, searched, manipulated, and processed in various ways. -4. On yet lower levels, hardware engineers have figured out how to represent bytes in terms of electrical currents, pulses of light, magnetic fields, and more. +* How to define and measure the *performance* of a system (see [“Describing Performance”](/en/ch2#sec_introduction_percentiles)); +* What it means for a service to be *reliable*—namely, continuing to work correctly, even when + things go wrong (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability)); +* Allowing a system to be *scalable* by having efficient ways of adding computing + capacity as the load on the system grows (see [“Scalability”](/en/ch2#sec_introduction_scalability)); and +* Making it easier to maintain a system in the long term (see [“Maintainability”](/en/ch2#sec_introduction_maintainability)). -In a complex application there may be more intermediary levels, such as APIs built upon APIs, but the basic idea is still the same: each layer hides the complexity of the layers below it by providing a clean data model. These abstractions allow different groups of people—for example, the engineers at the database vendor and the applica‐ tion developers using their database—to work together effectively. +The terminology introduced in this chapter will also be useful in the following chapters, when we go +into the details of how data-intensive systems are implemented. However, abstract definitions can be +quite dry; to make the ideas more concrete, we will start this chapter with a case study of how a +social networking service might work, which will provide practical examples of performance and +scalability. -There are many different kinds of data models, and every data model embodies assumptions about how it is going to be used. Some kinds of usage are easy and some are not supported; some operations are fast and some perform badly; some data transformations feel natural and some are awkward. +# Case Study: Social Network Home Timelines -It can take a lot of effort to master just one data model (think how many books there are on relational data modeling). Building software is hard enough, even when work‐ ing with just one data model and without worrying about its inner workings. But since the data model has such a profound effect on what the software above it can and can’t do, it’s important to choose one that is appropriate to the application. +Imagine you are given the task of implementing a social network in the style of X (formerly +Twitter), in which users can post messages and follow other users. This will be a huge +simplification of how such a service actually works +[[1](/en/ch2#Cvet2016), +[2](/en/ch2#Krikorian2012_ch2), +[3](/en/ch2#Twitter2023)], +but it will help illustrate some of the issues that arise in large-scale systems. -In this chapter we will look at a range of general-purpose data models for data stor‐ age and querying (point 2 in the preceding list). In particular, we will compare the relational model, the document model, and a few graph-based data models. We will also look at various query languages and compare their use cases. In [Chapter 3](/en/ch3) we will discuss how storage engines work; that is, how these data models are actually implemented (point 3 in the list). +Let’s assume that users make 500 million posts per day, or 5,700 posts per second on average. +Occasionally, the rate can spike as high as 150,000 posts/second +[[4](/en/ch2#Krikorian2013)]. +Let’s also assume that the average user follows 200 people and has 200 followers (although there is +a very wide range: most people have only a handful of followers, and a few celebrities such as +Barack Obama have over 100 million followers). +## Representing Users, Posts, and Follows +Imagine we keep all of the data in a relational database as shown in [Figure 2-1](/en/ch2#fig_twitter_relational). We +have one table for users, one table for posts, and one table for follow relationships. -## …… +![ddia 0201](/fig/ddia_0201.png) +###### Figure 2-1. Simple relational schema for a social network in which users can follow each other. +Let’s say the main read operation that our social network must support is the *home timeline*, which +displays recent posts by people you are following (for simplicity we will ignore ads, suggested +posts from people you are not following, and other extensions). We could write the following SQL +query to get the home timeline for a particular user: -## Summary +``` +SELECT posts.*, users.* FROM posts + JOIN follows ON posts.sender_id = follows.followee_id + JOIN users ON posts.sender_id = users.id + WHERE follows.follower_id = current_user + ORDER BY posts.timestamp DESC + LIMIT 1000 +``` +To execute this query, the database will use the `follows` table to find everybody who +`current_user` is following, look up recent posts by those users, and sort them by timestamp to get +the most recent 1,000 posts by any of the followed users. -Data models are a huge subject, and in this chapter we have taken a quick look at a broad variety of different models. We didn’t have space to go into all the details of each model, but hopefully the overview has been enough to whet your appetite to find out more about the model that best fits your application’s requirements. +Posts are supposed to be timely, so let’s assume that after somebody makes a post, we want their +followers to be able to see it within 5 seconds. One way of doing that would be for the user’s +client to repeat the query above every 5 seconds while the user is online (this is known as +*polling*). If we assume that 10 million users are online and logged in at the same time, that would +mean running the query 2 million times per second. Even if you increase the polling interval, this +is a lot. -Historically, data started out being represented as one big tree (the hierarchical model), but that wasn’t good for representing many-to-many relationships, so the relational model was invented to solve that problem. More recently, developers found that some applications don’t fit well in the relational model either. New nonrelational “NoSQL” datastores have diverged in two main directions: +Moreover, the query above is quite expensive: if you are following 200 people, it needs to fetch a +list of recent posts by each of those 200 people, and merge those lists. 2 million timeline queries +per second then means that the database needs to look up the recent posts from some sender 400 +million times per second—a huge number. And that is the average case. Some users follow tens of +thousands of accounts; for them, this query is very expensive to execute, and difficult to make +fast. -1. *Document databases* target use cases where data comes in self-contained docu‐ ments and relationships between one document and another are rare. +## Materializing and Updating Timelines -2. *Graph databases* go in the opposite direction, targeting use cases where anything is potentially related to everything. +How can we do better? Firstly, instead of polling, it would be better if the server actively pushed +new posts to any followers who are currently online. Secondly, we should precompute the results of +the query above so that a user’s request for their home timeline can be served from a cache. -All three models (document, relational, and graph) are widely used today, and each is good in its respective domain. One model can be emulated in terms of another model —for example, graph data can be represented in a relational database—but the result is often awkward. That’s why we have different systems for different purposes, not a single one-size-fits-all solution. +Imagine that for each user we store a data structure containing their home timeline, i.e., the +recent posts by people they are following. Every time a user makes a post, we look up all of their +followers, and insert that post into the home timeline of each follower—like delivering a message to +a mailbox. Now when a user logs in, we can simply give them this home timeline that we precomputed. +Moreover, to receive a notification about any new posts on their timeline, the user’s client simply +needs to subscribe to the stream of posts being added to their home timeline. -One thing that document and graph databases have in common is that they typically don’t enforce a schema for the data they store, which can make it easier to adapt applications to changing requirements. However, your application most likely still assumes that data has a certain structure; it’s just a question of whether the schema is explicit (enforced on write) or implicit (handled on read). +The downside of this approach is that we now need to do more work every time a user makes a post, +because the home timelines are derived data that needs to be updated. The process is illustrated in +[Figure 2-2](/en/ch2#fig_twitter_timelines). When one initial request results in several downstream requests being +carried out, we use the term *fan-out* to describe the factor by which the number of requests +increases. -Each data model comes with its own query language or framework, and we discussed several examples: SQL, MapReduce, MongoDB’s aggregation pipeline, Cypher, SPARQL, and Datalog. We also touched on CSS and XSL/XPath, which aren’t data‐ base query languages but have interesting parallels. +![ddia 0202](/fig/ddia_0202.png) -Although we have covered a lot of ground, there are still many data models left unmentioned. To give just a few brief examples: +###### Figure 2-2. Fan-out: delivering new posts to every follower of the user who made the post. -* Researchers working with genome data often need to perform *sequence- similarity searches*, which means taking one very long string (representing a DNA molecule) and matching it against a large database of strings that are simi‐ lar, but not identical. None of the databases described here can handle this kind of usage, which is why researchers have written specialized genome database software like GenBank [48]. +At a rate of 5,700 posts posted per second, if the average post reaches 200 followers (i.e., a +fan-out factor of 200), we will need to do just over 1 million home timeline writes per second. This +is a lot, but it’s still a significant saving compared to the 400 million per-sender post lookups +per second that we would otherwise have to do. -- Particle physicists have been doing Big Data–style large-scale data analysis for decades, and projects like the Large Hadron Collider (LHC) now work with hun‐ dreds of petabytes! At such a scale custom solutions are required to stop the hardware cost from spiraling out of control [49]. -- *Full-text search* is arguably a kind of data model that is frequently used alongside databases. Information retrieval is a large specialist subject that we won’t cover in great detail in this book, but we’ll touch on search indexes in [Chapter 3](/en/ch3) and [Part III](/en/part-iii). +If the rate of posts spikes due to some special event, we don’t have to do the timeline +deliveries immediately—we can enqueue them and accept that it will temporarily take a bit longer for +posts to show up in followers’ timelines. Even during such load spikes, timelines remain fast to +load, since we simply serve them from a cache. -We have to leave it there for now. In the next chapter we will discuss some of the trade-offs that come into play when *implementing* the data models described in this chapter. +This process of precomputing and updating the results of a query is called *materialization*, and +the timeline cache is an example of a *materialized view* (a concept we will discuss further in +[Link to Come]). The materialized view speeds up reads, but in return we have to do more work on +write. The cost of writes for most users is modest, but a social network also has to consider some +extreme cases: +* If a user is following a very large number of accounts, and those accounts post a lot, that user + will have a high rate of writes to their materialized timeline. However, in this case it’s + unlikely that the user is actually reading all of the posts in their timeline, and therefore it’s + okay to simply drop some of their timeline writes and show the user only a sample of the posts + from the accounts they’re following + [[5](/en/ch2#Volpert2025)]. +* When a celebrity account with a very large number of followers makes a post, we have to do a large + amount of work to insert that post into the home timelines of each of their millions of followers. + In this case it’s not okay to drop some of those writes. One way of solving this problem is to + handle celebrity posts separately from everyone else’s posts: we can save ourselves the effort of + adding them to millions of timelines by storing the celebrity posts separately and merging them + with the materialized timeline when it is read. Despite such optimizations, handling celebrities + on a social network can require a lot of infrastructure + [[6](/en/ch2#Axon2010_ch2)]. +# Describing Performance -## References +Most discussions of software performance consider two main types of metric: + +Response time +: The elapsed time from the moment when a user makes a request until they receive the requested + answer. The unit of measurement is seconds (or milliseconds, or microseconds). + +Throughput +: The number of requests per second, or the data volume per second, that the system is processing. + For a given allocation of hardware resources, there is a *maximum throughput* that can be handled. + The unit of measurement is “somethings per second”. + +In the social network case study, “posts per second” and “timeline writes per second” are throughput +metrics, whereas the “time it takes to load the home timeline” or the “time until a post is +delivered to followers” are response time metrics. + +There is often a connection between throughput and response time; an example of such a relationship +for an online service is sketched in [Figure 2-3](/en/ch2#fig_throughput). The service has a low response time when +request throughput is low, but response time increases as load increases. This is because of +*queueing*: when a request arrives on a highly loaded system, it’s likely that the CPU is already in +the process of handling an earlier request, and therefore the incoming request needs to wait until +the earlier request has been completed. As throughput approaches the maximum that the hardware can +handle, queueing delays increase sharply. + +![ddia 0203](/fig/ddia_0203.png) + +###### Figure 2-3. As the throughput of a service approaches its capacity, the response time increases dramatically due to queueing. + +# When an overloaded system won’t recover + +If a system is close to overload, with throughput pushed close to the limit, it can sometimes enter a +vicious cycle where it becomes less efficient and hence even more overloaded. For example, if there +is a long queue of requests waiting to be handled, response times may increase so much that clients +time out and resend their request. This causes the rate of requests to increase even further, making +the problem worse—a *retry storm*. Even when the load is reduced again, such a system may remain in +an overloaded state until it is rebooted or otherwise reset. This phenomenon is called a *metastable +failure*, and it can cause serious outages in production systems +[[7](/en/ch2#Bronson2021), +[8](/en/ch2#Brooker2021)]. + +To avoid retries overloading a service, you can increase and randomize the time between successive +retries on the client side (*exponential backoff* +[[9](/en/ch2#Brooker2015), +[10](/en/ch2#Brooker2022backoff)]), +and temporarily stop sending requests to a service that has returned errors or timed out recently +(using a *circuit breaker* [[11](/en/ch2#Nygard2018), +[12](/en/ch2#Chen2022)] +or *token bucket* algorithm [[13](/en/ch2#Brooker2022retries)]). +The server can also detect when it is approaching overload and start proactively rejecting requests +(*load shedding* [[14](/en/ch2#YanacekLoadShedding)]), and send back +responses asking clients to slow down (*backpressure* +[[1](/en/ch2#Cvet2016), +[15](/en/ch2#Sackman2016_ch2)]). +The choice of queueing and load-balancing algorithms can also make a difference +[[16](/en/ch2#Kopytkov2018)]. + +In terms of performance metrics, the response time is usually what users care about the most, +whereas the throughput determines the required computing resources (e.g., how many servers you need), +and hence the cost of serving a particular workload. If throughput is likely to increase beyond what +the current hardware can handle, the capacity needs to be expanded; a system is said to be +*scalable* if its maximum throughput can be significantly increased by adding computing resources. + +In this section we will focus primarily on response times, and we will return to throughput and +scalability in [“Scalability”](/en/ch2#sec_introduction_scalability). + +## Latency and Response Time + +“Latency” and “response time” are sometimes used interchangeably, but in this book we will use the +terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)): + +* The *response time* is what the client sees; it includes all delays incurred anywhere in the + system. +* The *service time* is the duration for which the service is actively processing the user request. +* *Queueing delays* can occur at several points in the flow: for example, after a request is + received, it might need to wait until a CPU is available before it can be processed; a response + packet might need to be buffered before it is sent over the network if other tasks on the same + machine are sending a lot of data via the outbound network interface. +* *Latency* is a catch-all term for time during which a request is not being actively processed, + i.e., during which it is *latent*. In particular, *network latency* or *network delay* refers to + the time that request and response spend traveling through the network. + +![ddia 0204](/fig/ddia_0204.png) + +###### Figure 2-4. Response time, service time, network latency, and queueing delay. + +In [Figure 2-4](/en/ch2#fig_response_time), time flows from left to right, each communicating node is shown as a +horizontal line, and a request or response message is shown as a thick diagonal arrow from one node +to another. You will encounter this style of diagram frequently over the course of this book. + +The response time can vary significantly from one request to the next, even if you keep making the +same request over and over again. Many factors can add random delays: for example, a context switch +to a background process, the loss of a network packet and TCP retransmission, a garbage collection +pause, a page fault forcing a read from disk, mechanical vibrations in the server rack +[[17](/en/ch2#Gunawi2018_ch2)], +or many other causes. We will discuss this topic in more detail in [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing). + +Queueing delays often account for a large part of the variability in response times. As a server +can only process a small number of things in parallel (limited, for example, by its number of CPU +cores), it only takes a small number of slow requests to hold up the processing of subsequent +requests—an effect known as *head-of-line blocking*. Even if those subsequent requests have fast +service times, the client will see a slow overall response time due to the time waiting for the +prior request to complete. The queueing delay is not part of the service time, and for this reason +it is important to measure response times on the client side. + +## Average, Median, and Percentiles + +Because the response time varies from one request to the next, we need to think of it not as a +single number, but as a *distribution* of values that you can measure. In [Figure 2-5](/en/ch2#fig_lognormal), each +gray bar represents a request to a service, and its height shows how long that request took. Most +requests are reasonably fast, but there are occasional *outliers* that take much longer. +Variation in network delay is also known as *jitter*. + +![ddia 0205](/fig/ddia_0205.png) + +###### Figure 2-5. Illustrating mean and percentiles: response times for a sample of 100 requests to a service. + +It’s common to report the *average* response time of a service (technically, the *arithmetic mean*: +that is, sum all the response times, and divide by the number of requests). The mean response time +is useful for estimating throughput limits [[18](/en/ch2#Brooker2017)]. +However, the mean is not a very good metric if you want to know your “typical” response time, +because it doesn’t tell you how many users actually experienced that delay. + +Usually it is better to use *percentiles*. If you take your list of response times and sort it from +fastest to slowest, then the *median* is the halfway point: for example, if your median response +time is 200 ms, that means half your requests return in less than 200 ms, and half your +requests take longer than that. This makes the median a good metric if you want to know how long +users typically have to wait. The median is also known as the *50th percentile*, and sometimes +abbreviated as *p50*. + +In order to figure out how bad your outliers are, you can look at higher percentiles: the *95th*, +*99th*, and *99.9th* percentiles are common (abbreviated *p95*, *p99*, and *p999*). They are the +response time thresholds at which 95%, 99%, or 99.9% of requests are faster than that particular +threshold. For example, if the 95th percentile response time is 1.5 seconds, that means 95 out of +100 requests take less than 1.5 seconds, and 5 out of 100 requests take 1.5 seconds or more. This is +illustrated in [Figure 2-5](/en/ch2#fig_lognormal). + +High percentiles of response times, also known as *tail latencies*, are important because they +directly affect users’ experience of the service. For example, Amazon describes response time +requirements for internal services in terms of the 99.9th percentile, even though it only affects 1 +in 1,000 requests. This is because the customers with the slowest requests are often those who have +the most data on their accounts because they have made many purchases—that is, they’re the most +valuable customers +[[19](/en/ch2#DeCandia2007_ch1)]. +It’s important to keep those customers happy by ensuring the website is fast for them. + +On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed +too expensive and to not yield enough benefit for Amazon’s purposes. Reducing response times at very +high percentiles is difficult because they are easily affected by random events outside of your +control, and the benefits are diminishing. + +# The user impact of response times + +It seems intuitively obvious that a fast service is better for users than a slow service +[[20](/en/ch2#Whitenton2020)]. +However, it is surprisingly difficult to get hold of reliable data to quantify the effect that +latency has on user behavior. + +Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search +results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue +[[21](/en/ch2#Linden2006)]. +However, another Google study from 2009 reported that a 400 ms increase in latency resulted in +only 0.6% fewer searches per day +[[22](/en/ch2#Brutlag2009)], +and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3% +[[23](/en/ch2#Schurman2009)]. +Newer data from these companies appears not to be publicly available. + +A more recent Akamai study +[[24](/en/ch2#Akamai2017)] +claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites +by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times +are also correlated with lower conversion rates! This seemingly paradoxical result is explained by +the fact that the pages that load fastest are often those that have no useful content (e.g., 404 +error pages). However, since the study makes no effort to separate the effects of page content from +the effects of load time, its results are probably not meaningful. + +A study by Yahoo +[[25](/en/ch2#Bai2017)] +compares click-through rates on fast-loading versus slow-loading search results, controlling for +quality of search results. It finds 20–30% more clicks on fast searches when the difference between +fast and slow responses is 1.25 seconds or more. + +## Use of Response Time Metrics + +High percentiles are especially important in backend services that are called multiple times as +part of serving a single end-user request. Even if you make the calls in parallel, the end-user +request still needs to wait for the slowest of the parallel calls to complete. It takes just one +slow call to make the entire end-user request slow, as illustrated in [Figure 2-6](/en/ch2#fig_tail_amplification). +Even if only a small percentage of backend calls are slow, the chance of getting a slow call +increases if an end-user request requires multiple backend calls, and so a higher proportion of +end-user requests end up being slow (an effect known as *tail latency amplification* +[[26](/en/ch2#Dean2013_ch2)]). + +![ddia 0206](/fig/ddia_0206.png) + +###### Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request. + +Percentiles are often used in *service level objectives* (SLOs) and *service level agreements* +(SLAs) as ways of defining the expected performance and availability of a service +[[27](/en/ch2#Hidalgo2020)]. +For example, an SLO may set a target for a service to have a median response time of less than +200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests +result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not +met (for example, customers may be entitled to a refund). That is the basic idea, at least; in +practice, defining good availability metrics for SLOs and SLAs is not straightforward +[[28](/en/ch2#Mogul2019), +[29](/en/ch2#Hauer2020)]. + +# Computing percentiles + +If you want to add response time percentiles to the monitoring dashboards for your services, you +need to efficiently calculate them on an ongoing basis. For example, you may want to keep a rolling +window of response times of requests in the last 10 minutes. Every minute, you calculate the median +and various percentiles over the values in that window and plot those metrics on a graph. + +The simplest implementation is to keep a list of response times for all requests within the time +window and to sort that list every minute. If that is too inefficient for you, there are algorithms +that can calculate a good approximation of percentiles at minimal CPU and memory cost. +Open source percentile estimation libraries include HdrHistogram, +t-digest [[30](/en/ch2#Dunning2021), +[31](/en/ch2#Kohn2021)], +OpenHistogram [[32](/en/ch2#Hartmann2020)], and DDSketch +[[33](/en/ch2#Masson2019)]. + +Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from +several machines, is mathematically meaningless—the right way of aggregating response time data +is to add the histograms [[34](/en/ch2#Schwartz2015)]. + +# Reliability and Fault Tolerance + +Everybody has an intuitive idea of what it means for something to be reliable or unreliable. For +software, typical expectations include: + +* The application performs the function that the user expected. +* It can tolerate the user making mistakes or using the software in unexpected ways. +* Its performance is good enough for the required use case, under the expected load and data volume. +* The system prevents any unauthorized access and abuse. + +If all those things together mean “working correctly,” then we can understand *reliability* as +meaning, roughly, “continuing to work correctly, even when things go wrong.” To be more precise +about things going wrong, we will distinguish between *faults* and *failures* +[[35](/en/ch2#Heimerdinger1992), +[36](/en/ch2#Gaertner1999), +[37](/en/ch2#Avizienis2004)]: + +Fault +: A fault is when a particular *part* of a system stops working correctly: for example, if a + single hard drive malfunctions, or a single machine crashes, or an external service (that the + system depends on) has an outage. + +Failure +: A failure is when the system *as a whole* stops providing the required service to the user; in + other words, when it does not meet the service level objective (SLO). + +The distinction between fault and failure can be confusing because they are the same thing, just at +different levels. For example, if a hard drive stops working, we say that the hard drive has failed: +if the system consists only of that one hard drive, it has stopped providing the required service. +However, if the system you’re talking about contains many hard drives, then the failure of a single +hard drive is only a fault from the point of view of the bigger system, and the bigger system might +be able to tolerate that fault by having a copy of the data on another hard drive. + +## Fault Tolerance + +We call a system *fault-tolerant* if it continues providing the required service to the user in +spite of certain faults occurring. If a system cannot tolerate a certain part becoming faulty, we +call that part a *single point of failure* (SPOF), because a fault in that part escalates to cause +the failure of the whole system. + +For example, in the social network case study, a fault that might happen is that during the fan-out +process, a machine involved in updating the materialized timelines crashes or become unavailable. +To make this process fault-tolerant, we would need to ensure that another machine can take over this +task without missing any posts that should have been delivered, and without duplicating any posts. +(This idea is known as *exactly-once semantics*, and we will examine it in detail in [Link to Come].) + +Fault tolerance is always limited to a certain number of certain types of faults. For example, a +system might be able to tolerate a maximum of two hard drives failing at the same time, or a maximum +of one out of three nodes crashing. It would not make sense to tolerate any number of faults: if all +nodes crash, there is nothing that can be done. If the entire planet Earth (and all servers on it) +were swallowed by a black hole, tolerance of that fault would require web hosting in space—good luck +getting that budget item approved. + +Counter-intuitively, in such fault-tolerant systems, it can make sense to *increase* the rate of +faults by triggering them deliberately—for example, by randomly killing individual processes +without warning. This is called *fault injection*. Many critical bugs are actually due to poor error +handling [[38](/en/ch2#Yuan2014)]; by deliberately inducing faults, you ensure +that the fault-tolerance machinery is continually exercised and tested, which can increase your +confidence that faults will be handled correctly when they occur naturally. *Chaos engineering* is +a discipline that aims to improve confidence in fault-tolerance mechanisms through experiments such +as deliberately injecting faults +[[39](/en/ch2#Rosenthal2020)]. + +Although we generally prefer tolerating faults over preventing faults, there are cases where +prevention is better than cure (e.g., because no cure exists). This is the case with security +matters, for example: if an attacker has compromised a system and gained access to sensitive data, +that event cannot be undone. However, this book mostly deals with the kinds of faults that can be +cured, as described in the following sections. + +## Hardware and Software Faults + +When we think of causes of system failure, hardware faults quickly come to mind: + +* Approximately 2–5% of magnetic hard drives fail per year + [[40](/en/ch2#Pinheiro2007), + [41](/en/ch2#Schroeder2007)]; + in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day. + Recent data suggests that disks are getting more reliable, but failure rates remain significant + [[42](/en/ch2#Klein2021)]. +* Approximately 0.5–1% of solid state drives (SSDs) fail per year + [[43](/en/ch2#Narayanan2016)]. + Small numbers of bit errors are corrected automatically + [[44](/en/ch2#Alibaba2019_ch2)], + but uncorrectable errors occur approximately once per year per drive, even in drives that are + fairly new (i.e., that have experienced little wear); this error rate is higher than that of + magnetic hard drives + [[45](/en/ch2#Schroeder2016_ch2), + [46](/en/ch2#Alter2019)]. +* Other hardware components such as power supplies, RAID controllers, and memory modules also fail, + although less frequently than hard drives + [[47](/en/ch2#Ford2010), + [48](/en/ch2#Vishwanath2010)]. +* Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result, + likely due to manufacturing defects + [[49](/en/ch2#Hochschild2021), + [50](/en/ch2#Dixit2021), + [51](/en/ch2#Behrens2015)]. + In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program + simply returning the wrong result. +* Data in RAM can also be corrupted, either due to random events such as cosmic rays, or due to + permanent physical defects. Even when memory with error-correcting codes (ECC) is used, more than + 1% of machines encounter an uncorrectable error in a given year, which typically leads to a crash + of the machine and the affected memory module needing to be replaced + [[52](/en/ch2#Schroeder2009)]. + + Moreover, certain pathological memory access patterns can flip bits with high probability + [[53](/en/ch2#Kim2014)]. +* An entire datacenter might become unavailable (for example, due to power outage or network + misconfiguration) or even be permanently destroyed (for example by fire, flood, or earthquake + [[54](/en/ch2#Bray2021)]). + A solar storm, which induces large electrical currents in long-distance wires when the sun ejects + a large mass of charged particles, could damage power grids and undersea network cables + [[55](/en/ch2#AbduJyothi2021)]. + Although such large-scale failures are rare, their impact can be catastrophic if a service cannot + tolerate the loss of a datacenter + [[56](/en/ch2#Cockcroft2019)]. + +These events are rare enough that you often don’t need to worry about them when working on a small +system, as long as you can easily replace hardware that becomes faulty. However, in a large-scale +system, hardware faults happen often enough that they become part of the normal system operation. + +### Tolerating hardware faults through redundancy + +Our first response to unreliable hardware is usually to add redundancy to the individual hardware +components in order to reduce the failure rate of the system. Disks may be set up in a RAID +configuration (spreading data across multiple disks in the same machine so that a failed disk does +not cause data loss), servers may have dual power supplies and hot-swappable CPUs, and datacenters +may have batteries and diesel generators for backup power. Such redundancy can often keep a machine +running uninterrupted for years. + +Redundancy is most effective when component faults are independent, that is, the occurrence of one +fault does not change how likely it is that another fault will occur. However, experience has shown +that there are often significant correlations between component failures +[[41](/en/ch2#Schroeder2007), +[57](/en/ch2#Han2021), +[58](/en/ch2#Nightingale2011)]; +unavailability of an entire server rack or an entire datacenter still happens more often than we +would like. + +Hardware redundancy increases the uptime of a single machine; however, as discussed in +[“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), there are advantages to using a distributed system, such as being +able to tolerate a complete outage of one datacenter. +For this reason, cloud systems tend to focus less on the reliability of individual machines, and +instead aim to make services highly available by tolerating faulty nodes at the software level. +Cloud providers use *availability zones* to identify which resources are physically co-located; +resources in the same place are more likely to fail at the same time than geographically separated +resources. + +The fault-tolerance techniques we discuss in this book are designed to tolerate the loss of entire +machines, racks, or availability zones. They generally work by allowing a machine in one datacenter +to take over when a machine in another datacenter fails or becomes unreachable. We will discuss such +techniques for fault tolerance in [Chapter 6](/en/ch6#ch_replication), [Chapter 10](/en/ch10#ch_consistency), and at various other +points in this book. + +Systems that can tolerate the loss of entire machines also have operational advantages: a +single-server system requires planned downtime if you need to reboot the machine (to apply operating +system security patches, for example), whereas a multi-node fault-tolerant system can be patched by +restarting one node at a time, without affecting the service for users. This is called a *rolling +upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding). + +### Software faults + +Although hardware failures can be weakly correlated, they are still mostly independent: for +example, if one disk fails, it’s likely that other disks in the same machine will be fine for +another while. On the other hand, software faults are often very highly correlated, because it is +common for many nodes to run the same software and thus have the same bugs +[[59](/en/ch2#Gunawi2014), +[60](/en/ch2#Kreps2012_ch1)]. +Such faults are harder to anticipate, and they tend to cause many more system failures than +uncorrelated hardware faults [[47](/en/ch2#Ford2010)]. For example: + +* A software bug that causes every node to fail at the same time in particular circumstances. For + example, on June 30, 2012, a leap second caused many Java applications to hang simultaneously due + to a bug in the Linux kernel, bringing down many Internet services + [[61](/en/ch2#Minar2012_ch1)]. + Due to a firmware bug, all SSDs of certain models suddenly fail after precisely 32,768 hours of + operation (less than 4 years), rendering the data on them unrecoverable + [[62](/en/ch2#HPE2019_ch2)]. +* A runaway process that uses up some shared, limited resource, such as CPU time, memory, disk + space, network bandwidth, or threads + [[63](/en/ch2#Hochstein2020)]. + For example, a process that consumes too much memory while processing a large request may be + killed by the operating system. A bug in a client library could cause a much higher request + volume than anticipated [[64](/en/ch2#McCaffrey2015)]. +* A service that the system depends on slows down, becomes unresponsive, or starts returning + corrupted responses. +* An interaction between different systems results in emergent behavior that does not occur when + each system was tested in isolation [[65](/en/ch2#Tang2023)]. +* Cascading failures, where a problem in one component causes another component to become overloaded + and slow down, which in turn brings down another component + [[66](/en/ch2#Ulrich2016), + [67](/en/ch2#Fassbender2022)]. + +The bugs that cause these kinds of software faults often lie dormant for a long time until they are +triggered by an unusual set of circumstances. In those circumstances, it is revealed that the +software is making some kind of assumption about its environment—and while that assumption is +usually true, it eventually stops being true for some reason +[[68](/en/ch2#Cook2000), +[69](/en/ch2#Woods2017)]. + +There is no quick solution to the problem of systematic faults in software. Lots of small things can +help: carefully thinking about assumptions and interactions in the system; thorough testing; process +isolation; allowing processes to crash and restart; avoiding feedback loops such as retry storms +(see [“When an overloaded system won’t recover”](/en/ch2#sidebar_metastable)); measuring, monitoring, and analyzing system behavior in production. + +## Humans and Reliability + +Humans design and build software systems, and the operators who keep the systems running are also +human. Unlike machines, humans don’t just follow rules; their strength is being creative and +adaptive in getting their job done. However, this characteristic also leads to unpredictability, and +sometimes mistakes that can lead to failures, despite best intentions. For example, one study of +large internet services found that configuration changes by operators were the leading cause of +outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages +[[70](/en/ch2#Oppenheimer2003)]. + +It is tempting to label such problems as “human error” and to wish that they could be solved by +better controlling human behavior through tighter procedures and compliance with rules. However, +blaming people for mistakes is counterproductive. What we call “human error” is not really the cause +of an incident, but rather a symptom of a problem with the sociotechnical system in which people are +trying their best to do their jobs [[71](/en/ch2#Dekker2017)]. +Often complex systems have emergent behavior, in which unexpected interactions between components +may also lead to failures [[72](/en/ch2#Dekker2011)]. + +Various technical measures can help minimize the impact of human mistakes, including thorough +testing (both hand-written tests and *property testing* on lots of random inputs) +[[38](/en/ch2#Yuan2014)], rollback mechanisms for quickly +reverting configuration changes, gradual roll-outs of new code, detailed and clear monitoring, +observability tools for diagnosing production issues (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)), +and well-designed interfaces that encourage “the right thing” and discourage “the wrong thing”. + +However, these things require an investment of time and money, and in the pragmatic reality of +everyday business, organizations often prioritize revenue-generating activities over measures that +increase their resilience against mistakes. If there is a choice between more features and more +testing, many organizations understandably choose features. Given this choice, when a preventable +mistake inevitably occurs, it does not make sense to blame the person who made the mistake—the +problem is the organization’s priorities. + +Increasingly, organizations are adopting a culture of *blameless postmortems*: after an incident, +the people involved are encouraged to share full details about what happened, without fear of +punishment, since this allows others in the organization to learn how to prevent similar problems in +the future [[73](/en/ch2#Allspaw2012)]. +This process may uncover a need to change business priorities, a need to invest in areas that have +been neglected, a need to change the incentives for the people involved, or some other systemic +issue that needs to be brought to the management’s attention. + +As a general principle, when investigating an incident, you should be suspicious of simplistic +answers. “Bob should have been more careful when deploying that change” is not productive, but +neither is “We must rewrite the backend in Haskell.” Instead, management should take the opportunity +to learn the details of how the sociotechnical system works from the point of view of the people who +work with it every day, and take steps to improve it based on this feedback +[[71](/en/ch2#Dekker2017)]. + +# How Important Is Reliability? + +Reliability is not just for nuclear power stations and air traffic control—more mundane applications +are also expected to work reliably. Bugs in business applications cause lost productivity (and legal +risks if figures are reported incorrectly), and outages of e-commerce sites can have huge costs in +terms of lost revenue and damage to reputation. + +In many applications, a temporary outage of a few minutes or even a few hours is tolerable +[[74](/en/ch2#Sabo2023)], +but permanent data loss or corruption would be catastrophic. Consider a parent who stores all their +pictures and videos of their children in your photo application +[[75](/en/ch2#Jurewitz2013)]. How would they +feel if that database was suddenly corrupted? Would they know how to restore it from a backup? + +As another example of how unreliable software can harm people, consider the Post Office Horizon +scandal. Between 1999 and 2019, hundreds of people managing Post Office branches in Britain were +convicted of theft or fraud because the accounting software showed a shortfall in their accounts. +Eventually it became clear that many of these shortfalls were due to bugs in the software, and many +convictions have since been overturned [[76](/en/ch2#Halper2025)]. +What led to this, probably the largest miscarriage of justice in British history, is the fact that +English law assumes that computers operate correctly (and hence, evidence produced by computers is +reliable) unless there is evidence to the contrary +[[77](/en/ch2#Bohm2022)]. +Software engineers may laugh at the idea that software could ever be bug-free, but this is little +solace to the people who were wrongfully imprisoned, declared bankrupt, or even committed suicide as +a result of a wrongful conviction due to an unreliable computer system. + +There are situations in which we may choose to sacrifice reliability in order to reduce development +cost (e.g., when developing a prototype product for an unproven market)—but we should be very +conscious of when we are cutting corners and keep in mind the potential consequences. + +# Scalability + +Even if a system is working reliably today, that doesn’t mean it will necessarily work reliably in +the future. One common reason for degradation is increased load: perhaps the system has grown from +10,000 concurrent users to 100,000 concurrent users, or from 1 million to 10 million. Perhaps it is +processing much larger volumes of data than it did before. + +*Scalability* is the term we use to describe a system’s ability to cope with increased load. +Sometimes, when discussing scalability, people make comments along the lines of, “You’re not Google +or Amazon. Stop worrying about scale and just use a relational database.” Whether this maxim applies +to you depends on the type of application you are building. + +If you are building a new product that currently only has a small number of users, perhaps at a +startup, the overriding engineering goal is usually to keep the system as simple and flexible as +possible, so that you can easily modify and adapt the features of your product as you learn more +about customers’ needs [[78](/en/ch2#McKinley2015)]. +In such an environment, it is counterproductive to worry about hypothetical scale that might be +needed in the future: in the best case, investments in scalability are wasted effort and premature +optimization; in the worst case, they lock you into an inflexible design and make it harder to +evolve your application. + +The reason is that scalability is not a one-dimensional label: it is meaningless to say “X is +scalable” or “Y doesn’t scale.” Rather, discussing scalability means considering questions like: + +* “If the system grows in a particular way, what are our options for coping with the growth?” +* “How can we add computing resources to handle the additional load?” +* “Based on current growth projections, when will we hit the limits of our current architecture?” + +If you succeed in making your application popular, and therefore handling a growing amount of load, +you will learn where your performance bottlenecks lie, and therefore you will know along which +dimensions you need to scale. At that point it’s time to start worrying about techniques for +scalability. + +## Describing Load + +First, we need to succinctly describe the current load on the system; only then can we discuss +growth questions (what happens if our load doubles?). Often this will be a measure of throughput: +for example, the number of requests per second to a service, how many gigabytes of new data arrive +per day, or the number of shopping cart checkouts per hour. Sometimes you care about the peak of +some variable quantity, such as the number of simultaneously online users in +[“Case Study: Social Network Home Timelines”](/en/ch2#sec_introduction_twitter). + +Often there are other statistical characteristics of the load that also affect the access patterns +and hence the scalability requirements. For example, you may need to know the ratio of reads to +writes in a database, the hit rate on a cache, or the number of data items per user (for example, +the number of followers in the social network case study). Perhaps the average case is what matters +for you, or perhaps your bottleneck is dominated by a small number of extreme cases. It all depends +on the details of your particular application. + +Once you have described the load on your system, you can investigate what happens when the load +increases. You can look at it in two ways: + +* When you increase the load in a certain way and keep the system resources (CPUs, memory, network + bandwidth, etc.) unchanged, how is the performance of your system affected? +* When you increase the load in a certain way, how much do you need to increase the resources if you + want to keep performance unchanged? + +Usually our goal is to keep the performance of the system within the requirements of the SLA +(see [“Use of Response Time Metrics”](/en/ch2#sec_introduction_slo_sla)) while also minimizing the cost of running the system. The greater +the required computing resources, the higher the cost. It might be that some types of hardware are +more cost-effective than others, and these factors may change over time as new types of hardware +become available. + +If you can double the resources in order to handle twice the load, while keeping performance the +same, we say that you have *linear scalability*, and this is considered a good thing. Occasionally +it is possible to handle twice the load with less than double the resources, due to economies of +scale or a better distribution of peak load +[[79](/en/ch2#Warfield2023_ch2), +[80](/en/ch2#Brooker2023multitenancy)]. +Much more likely is that the cost grows faster than linearly, and there may be many reasons for the +inefficiency. For example, if you have a lot of data, then processing a single write request may +involve more work than if you have a small amount of data, even if the size of the request is the +same. + +## Shared-Memory, Shared-Disk, and Shared-Nothing Architecture + +The simplest way of increasing the hardware resources of a service is to move it to a more powerful +machine. Individual CPU cores are no longer getting significantly faster, but you can buy a machine +(or rent a cloud instance) with more CPU cores, more RAM, and more disk space. This approach is +called *vertical scaling* or *scaling up*. + +You can get parallelism on a single machine by using multiple processes or threads. All the threads +belonging to the same process can access the same RAM, and hence this approach is also called a +*shared-memory architecture*. The problem with a shared-memory approach is that the cost grows +faster than linearly: a high-end machine with twice the hardware resources typically costs +significantly more than twice as much. And due to bottlenecks, a machine twice the size can often +handle less than twice the load. + +Another approach is the *shared-disk architecture*, which uses several machines with independent +CPUs and RAM, but which stores data on an array of disks that is shared between the machines, which +are connected via a fast network: *Network-Attached Storage* (NAS) or *Storage Area Network* (SAN). +This architecture has traditionally been used for on-premises data warehousing workloads, but +contention and the overhead of locking limit the scalability of the shared-disk approach +[[81](/en/ch2#Stopford2009)]. + +By contrast, the *shared-nothing architecture* +[[82](/en/ch2#Stonebraker1986)] +(also called *horizontal scaling* or *scaling out*) has gained a lot of popularity. In this +approach, we use a distributed system with multiple nodes, each of which has its own CPUs, RAM, and +disks. Any coordination between nodes is done at the software level, via a conventional network. + +The advantages of shared-nothing are that it has the potential to scale linearly, it can use +whatever hardware offers the best price/performance ratio (especially in the cloud), it can more +easily adjust its hardware resources as load increases or decreases, and it can achieve greater +fault tolerance by distributing the system across multiple data centers and regions. The downsides +are that it requires explicit sharding (see [Chapter 7](/en/ch7#ch_sharding)), and it incurs all the complexity of +distributed systems ([Chapter 9](/en/ch9#ch_distributed)). + +Some cloud-native database systems use separate services for storage and transaction execution (see +[“Separation of storage and compute”](/en/ch1#sec_introduction_storage_compute)), with multiple compute nodes sharing access to the same +storage service. This model has some similarity to a shared-disk architecture, but it avoids the +scalability problems of older systems: instead of providing a filesystem (NAS) or block device (SAN) +abstraction, the storage service offers a specialized API that is designed for the specific needs of +the database [[83](/en/ch2#Antonopoulos2019_ch2)]. + +## Principles for Scalability + +The architecture of systems that operate at large scale is usually highly specific to the +application—there is no such thing as a generic, one-size-fits-all scalable architecture +(informally known as *magic scaling sauce*). For example, a system that is designed to handle +100,000 requests per second, each 1 kB in size, looks very different from a system that is +designed for 3 requests per minute, each 2 GB in size—even though the two systems have the same +data throughput (100 MB/sec). + +Moreover, an architecture that is appropriate for one level of load is unlikely to cope with 10 +times that load. If you are working on a fast-growing service, it is therefore likely that you will +need to rethink your architecture on every order of magnitude load increase. As the needs of the +application are likely to evolve, it is usually not worth planning future scaling needs more than +one order of magnitude in advance. + +A good general principle for scalability is to break a system down into smaller components that can +operate largely independently from each other. This is the underlying principle behind microservices +(see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing +([Link to Come]), and shared-nothing architectures. However, the challenge is in knowing where to +draw the line between things that should be together, and things that should be apart. Design +guidelines for microservices can be found in other books +[[84](/en/ch2#Newman2021_ch2)], +and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding). + +Another good principle is not to make things more complicated than necessary. If a single-machine +database will do the job, it’s probably preferable to a complicated distributed setup. Auto-scaling +systems (which automatically add or remove resources in response to demand) are cool, but if your +load is fairly predictable, a manually scaled system may have fewer operational surprises (see +[“Operations: Automatic or Manual Rebalancing”](/en/ch7#sec_sharding_operations)). A system with five services is simpler than one with fifty. Good +architectures usually involve a pragmatic mixture of approaches. + +# Maintainability + +Software does not wear out or suffer material fatigue, so it does not break in the same ways as +mechanical objects do. But the requirements for an application frequently change, the environment +that the software runs in changes (such as its dependencies and the underlying platform), and it has +bugs that need fixing. + +It is widely recognized that the majority of the cost of software is not in its initial development, +but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures, +adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding +new features [[85](/en/ch2#Ensmenger2016), +[86](/en/ch2#Glass2002)]. + +However, maintenance is also difficult. If a system has been successfully running for a long time, +it may well use outdated technologies that not many engineers understand today (such as mainframes +and COBOL code); institutional knowledge of how and why a system was designed in a certain way may +have been lost as people have left the organization; it might be necessary to fix other people’s +mistakes. Moreover, the computer system is often intertwined with the human organization that it +supports, which means that maintenance of such *legacy* systems is as much a people problem as a +technical one [[87](/en/ch2#Bellotti2021)]. + +Every system we create today will one day become a legacy system if it is valuable enough to survive +for a long time. In order to minimize the pain for future generations who need to maintain our +software, we should design it with maintenance concerns in mind. Although we cannot always predict +which decisions might create maintenance headaches in the future, in this book we will pay attention +to several principles that are widely applicable: + +Operability +: Make it easy for the organization to keep the system running smoothly. + +Simplicity +: Make it easy for new engineers to understand the system, by implementing it using well-understood, + consistent patterns and structures, and avoiding unnecessary complexity. + +Evolvability +: Make it easy for engineers to make changes to the system in the future, adapting it and extending + it for unanticipated use cases as requirements change. + +## Operability: Making Life Easy for Operations + +We previously discussed the role of operations in [“Operations in the Cloud Era”](/en/ch1#sec_introduction_operations), and we saw that +human processes are at least as important for reliable operations as software tools. In fact, it has +been suggested that “good operations can often work around the limitations of bad (or incomplete) +software, but good software cannot run reliably with bad operations” +[[60](/en/ch2#Kreps2012_ch1)]. + +In large-scale systems consisting of many thousands of machines, manual maintenance would be +unreasonably expensive, and automation is essential. However, automation can be a two-edged sword: +there will always be edge cases (such as rare failure scenarios) that require manual intervention +from the operations team. Since the cases that cannot be handled automatically are the most complex +issues, greater automation requires a *more* skilled operations team that can resolve those issues +[[88](/en/ch2#Bainbridge1983)]. + +Moreover, if an automated system goes wrong, it is often harder to troubleshoot than a system that +relies on an operator to perform some actions manually. For that reason, it is not the case that +more automation is always better for operability. However, some amount of automation is important, +and the sweet spot will depend on the specifics of your particular application and organization. + +Good operability means making routine tasks easy, allowing the operations team to focus their efforts +on high-value activities. Data systems can do various things to make routine tasks easy, including +[[89](/en/ch2#Hamilton2007)]: + +* Allowing monitoring tools to check the system’s key metrics, and supporting observability tools + (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)) to give insights into the system’s runtime behavior. + A variety of commercial and open source tools can help here + [[90](/en/ch2#Horovits2021)]. +* Avoiding dependency on individual machines (allowing machines to be taken down for maintenance + while the system as a whole continues running uninterrupted) +* Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”) +* Providing good default behavior, but also giving administrators the freedom to override defaults when needed +* Self-healing where appropriate, but also giving administrators manual control over the system state when needed +* Exhibiting predictable behavior, minimizing surprises + +## Simplicity: Managing Complexity + +Small software projects can have delightfully simple and expressive code, but as projects get +larger, they often become very complex and difficult to understand. This complexity slows down +everyone who needs to work on the system, further increasing the cost of maintenance. A software +project mired in complexity is sometimes described as a *big ball of mud* +[[91](/en/ch2#Foote1997)]. + +When complexity makes maintenance hard, budgets and schedules are often overrun. In complex +software, there is also a greater risk of introducing bugs when making a change: when the system is +harder for developers to understand and reason about, hidden assumptions, unintended consequences, +and unexpected interactions are more easily overlooked +[[69](/en/ch2#Woods2017)]. +Conversely, reducing complexity greatly improves the maintainability of software, and thus +simplicity should be a key goal for the systems we build. + +Simple systems are easier to understand, and therefore we should try to solve a given problem in the +simplest way possible. Unfortunately, this is easier said than done. Whether something is simple or +not is often a subjective matter of taste, as there is no objective standard of simplicity +[[92](/en/ch2#Brooker2022)]. +For example, one system may hide a complex implementation behind a simple interface, whereas another +may have a simple implementation that exposes more internal detail to its users—which one is +simpler? + +One attempt at reasoning about complexity has been to break it down into two categories, *essential* +and *accidental* complexity [[93](/en/ch2#Brooks1995)]. +The idea is that essential complexity is inherent in the problem domain of the application, while +accidental complexity arises only because of limitations of our tooling. Unfortunately, this +distinction is also flawed, because boundaries between the essential and the accidental shift as our +tooling evolves [[94](/en/ch2#Luu2020)]. + +One of the best tools we have for managing complexity is *abstraction*. A good abstraction can hide +a great deal of implementation detail behind a clean, simple-to-understand façade. A good +abstraction can also be used for a wide range of different applications. Not only is this reuse more +efficient than reimplementing a similar thing multiple times, but it also leads to higher-quality +software, as quality improvements in the abstracted component benefit all applications that use it. + +For example, high-level programming languages are abstractions that hide machine code, CPU registers, +and syscalls. SQL is an abstraction that hides complex on-disk and in-memory data structures, +concurrent requests from other clients, and inconsistencies after crashes. Of course, when +programming in a high-level language, we are still using machine code; we are just not using it +*directly*, because the programming language abstraction saves us from having to think about it. + +Abstractions for application code, which aim to reduce its complexity, can be created using +methodologies such as *design patterns* +[[95](/en/ch2#Gamma1994)] +and *domain-driven design* (DDD) [[96](/en/ch2#Evans2003)]. +This book is not about such application-specific abstractions, but rather about general-purpose +abstractions on top of which you can build your applications, such as database transactions, +indexes, and event logs. If you want to use techniques such as DDD, you can implement them on top of +the foundations described in this book. + +## Evolvability: Making Change Easy + +It’s extremely unlikely that your system’s requirements will remain unchanged forever. They are much more +likely to be in constant flux: you learn new facts, previously unanticipated use cases emerge, +business priorities change, users request new features, new platforms replace old platforms, legal +or regulatory requirements change, growth of the system forces architectural changes, etc. + +In terms of organizational processes, *Agile* working patterns provide a framework for adapting to +change. The Agile community has also developed technical tools and processes that are helpful when +developing software in a frequently changing environment, such as test-driven development (TDD) and +refactoring. In this book, we search for ways of increasing agility at the level of a system +consisting of several different applications or services with different characteristics. + +The ease with which you can modify a data system, and adapt it to changing requirements, is closely +linked to its simplicity and its abstractions: loosely-coupled, simple systems are usually easier to +modify than tightly-coupled, complex ones. Since this is such an important idea, we will use a +different word to refer to agility on a data system level: *evolvability* +[[97](/en/ch2#Breivold2008)]. + +One major factor that makes change difficult in large systems is when some action is irreversible, +and therefore that action needs to be taken very carefully +[[98](/en/ch2#Zaninotto2002)]. +For example, say you are migrating from one database to another: if you cannot switch back to the +old system in case of problems with the new one, the stakes are much higher than if you can easily go +back. Minimizing irreversibility improves flexibility. + +# Summary + +In this chapter we examined several examples of nonfunctional requirements: performance, +reliability, scalability, and maintainability. Through these topics we have also encountered +principles and terminology that we will need throughout the rest of the book. We started with a case +study of how one might implement home timelines in a social network, which illustrated some of the +challenges that arise at scale. + +We discussed how to measure performance (e.g., using response time percentiles), the load on a +system (e.g., using throughput metrics), and how they are used in SLAs. Scalability is a closely +related concept: that is, ensuring performance stays the same when the load grows. We saw some +general principles for scalability, such as breaking a task down into smaller parts that can operate +independently, and we will dive into deep technical detail on scalability techniques in the +following chapters. + +To achieve reliability, you can use fault tolerance techniques, which allow a system to continue +providing its service even if some component (e.g., a disk, a machine, or another service) is +faulty. We saw examples of hardware faults that can occur, and distinguished them from software +faults, which can be harder to deal with because they are often strongly correlated. Another aspect +of achieving reliability is to build resilience against humans making mistakes, and we saw blameless +postmortems as a technique for learning from incidents. + +Finally, we examined several facets of maintainability, including supporting the work of operations +teams, managing complexity, and making it easy to evolve an application’s functionality over time. +There are no easy answers on how to achieve these things, but one thing that can help is to build +applications using well-understood building blocks that provide useful abstractions. The rest of +this book will cover a selection of building blocks that have proved to be valuable in practice. + +##### Footnotes + +##### References + +[[1](/en/ch2#Cvet2016-marker)] Mike Cvet. +[How We Learned to Stop Worrying and Love +Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016. + +[[2](/en/ch2#Krikorian2012_ch2-marker)] Raffi Krikorian. +[Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). +At *QCon San Francisco*, November 2012. +Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK) + +[[3](/en/ch2#Twitter2023-marker)] Twitter. +[Twitter’s +Recommendation Algorithm](https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm). *blog.twitter.com*, March 2023. +Archived at [perma.cc/L5GT-229T](https://perma.cc/L5GT-229T) + +[[4](/en/ch2#Krikorian2013-marker)] Raffi Krikorian. +[New +Tweets per second record, and how!](https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how) *blog.twitter.com*, August 2013. +Archived at [perma.cc/6JZN-XJYN](https://perma.cc/6JZN-XJYN) + +[[5](/en/ch2#Volpert2025-marker)] Jaz Volpert. +[When Imperfect Systems are Good, Actually: +Bluesky’s Lossy Timelines](https://jazco.dev/2025/02/19/imperfection/). *jazco.dev*, February 2025. +Archived at [perma.cc/2PVE-L2MX](https://perma.cc/2PVE-L2MX) + +[[6](/en/ch2#Axon2010_ch2-marker)] Samuel Axon. +[3% of Twitter’s Servers +Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010. +Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX) + +[[7](/en/ch2#Bronson2021-marker)] Nathan Bronson, Abutalib Aghayev, Aleksey +Charapko, and Timothy Zhu. +[Metastable +Failures in Distributed Systems](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf). +At *Workshop on Hot Topics in Operating Systems* (HotOS), May 2021. +[doi:10.1145/3458336.3465286](https://doi.org/10.1145/3458336.3465286) + +[[8](/en/ch2#Brooker2021-marker)] Marc Brooker. +[Metastability and Distributed +Systems](https://brooker.co.za/blog/2021/05/24/metastable.html). *brooker.co.za*, May 2021. +Archived at [perma.cc/7FGJ-7XRK](https://perma.cc/7FGJ-7XRK) + +[[9](/en/ch2#Brooker2015-marker)] Marc Brooker. +[Exponential +Backoff And Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/). *aws.amazon.com*, March 2015. +Archived at [perma.cc/R6MS-AZKH](https://perma.cc/R6MS-AZKH) + +[[10](/en/ch2#Brooker2022backoff-marker)] Marc Brooker. +[What is Backoff For?](https://brooker.co.za/blog/2022/08/11/backoff.html) +*brooker.co.za*, August 2022. +Archived at [perma.cc/PW9N-55Q5](https://perma.cc/PW9N-55Q5) + +[[11](/en/ch2#Nygard2018-marker)] Michael T. Nygard. +[*Release It!*](https://learning.oreilly.com/library/view/release-it-2nd/9781680504552/), +2nd Edition. Pragmatic Bookshelf, January 2018. ISBN: 9781680502398 + +[[12](/en/ch2#Chen2022-marker)] Frank Chen. +[Slowing Down to Speed Up – Circuit Breakers +for Slack’s CI/CD](https://slack.engineering/circuit-breakers/). *slack.engineering*, August 2022. +Archived at [perma.cc/5FGS-ZPH3](https://perma.cc/5FGS-ZPH3) + +[[13](/en/ch2#Brooker2022retries-marker)] Marc Brooker. +[Fixing retries with token buckets and +circuit breakers](https://brooker.co.za/blog/2022/02/28/retries.html). *brooker.co.za*, February 2022. +Archived at [perma.cc/MD6N-GW26](https://perma.cc/MD6N-GW26) + +[[14](/en/ch2#YanacekLoadShedding-marker)] David Yanacek. +[Using load +shedding to avoid overload](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/). Amazon Builders’ Library, *aws.amazon.com*. +Archived at [perma.cc/9SAW-68MP](https://perma.cc/9SAW-68MP) + +[[15](/en/ch2#Sackman2016_ch2-marker)] Matthew Sackman. +[Pushing Back](https://wellquite.org/posts/lshift/pushing_back/). +*wellquite.org*, May 2016. +Archived at [perma.cc/3KCZ-RUFY](https://perma.cc/3KCZ-RUFY) + +[[16](/en/ch2#Kopytkov2018-marker)] Dmitry Kopytkov and Patrick Lee. +[Meet Bandaid, +the Dropbox service proxy](https://dropbox.tech/infrastructure/meet-bandaid-the-dropbox-service-proxy). *dropbox.tech*, March 2018. +Archived at [perma.cc/KUU6-YG4S](https://perma.cc/KUU6-YG4S) + +[[17](/en/ch2#Gunawi2018_ch2-marker)] Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, +Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, +Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert +Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. +[Fail-Slow at +Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf). +At *16th USENIX Conference on File and Storage Technologies*, February 2018. + +[[18](/en/ch2#Brooker2017-marker)] Marc Brooker. +[Is the Mean Really Useless?](https://brooker.co.za/blog/2017/12/28/mean.html) +*brooker.co.za*, December 2017. +Archived at [perma.cc/U5AE-CVEM](https://perma.cc/U5AE-CVEM) + +[[19](/en/ch2#DeCandia2007_ch1-marker)] Giuseppe DeCandia, Deniz Hastorun, Madan +Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter +Vosshall, and Werner Vogels. +[Dynamo: +Amazon’s Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating +Systems Principles* (SOSP), October 2007. +[doi:10.1145/1294261.1294281](https://doi.org/10.1145/1294261.1294281) + +[[20](/en/ch2#Whitenton2020-marker)] Kathryn Whitenton. +[The Need for Speed, 23 Years Later](https://www.nngroup.com/articles/the-need-for-speed/). +*nngroup.com*, May 2020. +Archived at [perma.cc/C4ER-LZYA](https://perma.cc/C4ER-LZYA) + +[[21](/en/ch2#Linden2006-marker)] Greg Linden. +[Marissa Mayer at Web 2.0](https://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html). +*glinden.blogspot.com*, November 2005. +Archived at [perma.cc/V7EA-3VXB](https://perma.cc/V7EA-3VXB) + +[[22](/en/ch2#Brutlag2009-marker)] Jake Brutlag. +[Speed Matters for Google +Web Search](https://services.google.com/fh/files/blogs/google_delayexp.pdf). *services.google.com*, June 2009. +Archived at [perma.cc/BK7R-X7M2](https://perma.cc/BK7R-X7M2) + +[[23](/en/ch2#Schurman2009-marker)] Eric Schurman and Jake Brutlag. +[Performance Related Changes and their User Impact](https://www.youtube.com/watch?v=bQSE51-gr2s). +Talk at *Velocity 2009*. + +[[24](/en/ch2#Akamai2017-marker)] Akamai Technologies, Inc. +[The +State of Online Retail Performance](https://web.archive.org/web/20210729180749/https%3A//www.akamai.com/us/en/multimedia/documents/report/akamai-state-of-online-retail-performance-spring-2017.pdf). *akamai.com*, April 2017. +Archived at [perma.cc/UEK2-HYCS](https://perma.cc/UEK2-HYCS) + +[[25](/en/ch2#Bai2017-marker)] Xiao Bai, Ioannis Arapakis, B. Barla Cambazoglu, and Ana Freire. +[Understanding and Leveraging the Impact of +Response Latency on User Behaviour in Web Search](https://iarapakis.github.io/papers/TOIS17.pdf). *ACM Transactions on Information Systems*, +volume 36, issue 2, article 21, April 2018. +[doi:10.1145/3106372](https://doi.org/10.1145/3106372) + +[[26](/en/ch2#Dean2013_ch2-marker)] Jeffrey Dean and Luiz André Barroso. +[The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/). +*Communications of the ACM*, volume 56, issue 2, pages 74–80, February 2013. +[doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794) + +[[27](/en/ch2#Hidalgo2020-marker)] Alex Hidalgo. +[*Implementing +Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets*](https://www.oreilly.com/library/view/implementing-service-level/9781492076803/). O’Reilly +Media, September 2020. ISBN: 1492076813 + +[[28](/en/ch2#Mogul2019-marker)] Jeffrey C. Mogul and John Wilkes. +[Nines are Not Enough: Meaningful Metrics for +Clouds](https://research.google/pubs/pub48033/). At *17th Workshop on Hot Topics in Operating Systems* (HotOS), May 2019. +[doi:10.1145/3317550.3321432](https://doi.org/10.1145/3317550.3321432) + +[[29](/en/ch2#Hauer2020-marker)] Tamás Hauer, Philipp Hoffmann, John Lunney, Dan Ardelean, and Amer Diwan. +[Meaningful Availability](https://www.usenix.org/conference/nsdi20/presentation/hauer). +At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020. + +[[30](/en/ch2#Dunning2021-marker)] Ted Dunning. +[The t-digest: +Efficient estimates of distributions](https://www.sciencedirect.com/science/article/pii/S2665963820300403). *Software Impacts*, volume 7, article 100049, February 2021. +[doi:10.1016/j.simpa.2020.100049](https://doi.org/10.1016/j.simpa.2020.100049) + +[[31](/en/ch2#Kohn2021-marker)] David Kohn. +[How +percentile approximation works (and why it’s more useful than averages)](https://www.timescale.com/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/). *timescale.com*, +September 2021. Archived at [perma.cc/3PDP-NR8B](https://perma.cc/3PDP-NR8B) + +[[32](/en/ch2#Hartmann2020-marker)] Heinrich Hartmann and Theo Schlossnagle. +[Circllhist — A Log-Linear Histogram Data Structure +for IT Infrastructure Monitoring](https://arxiv.org/pdf/2001.06561.pdf). *arxiv.org*, January 2020. + +[[33](/en/ch2#Masson2019-marker)] Charles Masson, Jee E. Rim, and Homin K. Lee. +[DDSketch: A Fast and Fully-Mergeable +Quantile Sketch with Relative-Error Guarantees](https://www.vldb.org/pvldb/vol12/p2195-masson.pdf). *Proceedings of the VLDB Endowment*, +volume 12, issue 12, pages 2195–2205, August 2019. +[doi:10.14778/3352063.3352135](https://doi.org/10.14778/3352063.3352135) + +[[34](/en/ch2#Schwartz2015-marker)] Baron Schwartz. +[Why +Percentiles Don’t Work the Way You Think](https://orangematter.solarwinds.com/2016/11/18/why-percentiles-dont-work-the-way-you-think/). *solarwinds.com*, November 2016. +Archived at [perma.cc/469T-6UGB](https://perma.cc/469T-6UGB) + +[[35](/en/ch2#Heimerdinger1992-marker)] Walter L. Heimerdinger and Charles B. Weinstock. +[A Conceptual +Framework for System Fault Tolerance](https://resources.sei.cmu.edu/asset_files/TechnicalReport/1992_005_001_16112.pdf). Technical Report CMU/SEI-92-TR-033, Software Engineering +Institute, Carnegie Mellon University, October 1992. +Archived at [perma.cc/GD2V-DMJW](https://perma.cc/GD2V-DMJW) + +[[36](/en/ch2#Gaertner1999-marker)] Felix C. Gärtner. +[Fundamentals of fault-tolerant +distributed computing in asynchronous environments](https://dl.acm.org/doi/pdf/10.1145/311531.311532). *ACM Computing Surveys*, volume 31, +issue 1, pages 1–26, March 1999. +[doi:10.1145/311531.311532](https://doi.org/10.1145/311531.311532) + +[[37](/en/ch2#Avizienis2004-marker)] Algirdas Avižienis, Jean-Claude Laprie, Brian Randell, +and Carl Landwehr. +[Basic Concepts and Taxonomy of Dependable and Secure +Computing](https://hdl.handle.net/1903/6459). *IEEE Transactions on Dependable and Secure Computing*, volume 1, issue 1, +January 2004. [doi:10.1109/TDSC.2004.2](https://doi.org/10.1109/TDSC.2004.2) + +[[38](/en/ch2#Yuan2014-marker)] Ding Yuan, Yu Luo, Xin Zhuang, Guilherme +Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. +[Simple +Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed +Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf). At *11th USENIX Symposium on Operating Systems Design +and Implementation* (OSDI), October 2014. + +[[39](/en/ch2#Rosenthal2020-marker)] Casey Rosenthal and Nora Jones. +[*Chaos +Engineering*](https://learning.oreilly.com/library/view/chaos-engineering/9781492043850/). O’Reilly Media, April 2020. ISBN: 9781492043867 + +[[40](/en/ch2#Pinheiro2007-marker)] Eduardo Pinheiro, Wolf-Dietrich Weber, and +Luiz Andre Barroso. +[Failure +Trends in a Large Disk Drive Population](https://www.usenix.org/legacy/events/fast07/tech/full_papers/pinheiro/pinheiro_old.pdf). At *5th USENIX Conference on File and Storage +Technologies* (FAST), February 2007. + +[[41](/en/ch2#Schroeder2007-marker)] Bianca Schroeder and Garth A. Gibson. +[Disk failures +in the real world: What does an MTTF of 1,000,000 hours mean to you?](https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder.pdf) At *5th USENIX +Conference on File and Storage Technologies* (FAST), February 2007. + +[[42](/en/ch2#Klein2021-marker)] Andy Klein. +[Backblaze Drive Stats +for Q2 2021](https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2021/). *backblaze.com*, August 2021. +Archived at [perma.cc/2943-UD5E](https://perma.cc/2943-UD5E) + +[[43](/en/ch2#Narayanan2016-marker)] Iyswarya Narayanan, Di Wang, Myeongjae Jeon, +Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and +Kushagra Vaid. +[SSD +Failures in Datacenters: What? When? and Why?](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/a7-narayanan.pdf) At *9th ACM International on Systems and +Storage Conference* (SYSTOR), June 2016. +[doi:10.1145/2928275.2928278](https://doi.org/10.1145/2928275.2928278) + +[[44](/en/ch2#Alibaba2019_ch2-marker)] Alibaba Cloud Storage Team. +[Storage System Design Analysis: Factors +Affecting NVMe SSD Performance (1)](https://www.alibabacloud.com/blog/594375). *alibabacloud.com*, January 2019. Archived at +[archive.org](https://web.archive.org/web/20230522005034/https%3A//www.alibabacloud.com/blog/594375) + +[[45](/en/ch2#Schroeder2016_ch2-marker)] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. +[Flash +Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf). At *14th USENIX Conference on +File and Storage Technologies* (FAST), February 2016. + +[[46](/en/ch2#Alter2019-marker)] Jacob Alter, Ji Xue, Alma Dimnaku, and Evgenia Smirni. +[SSD failures in the field: symptoms, +causes, and prediction models](https://dl.acm.org/doi/pdf/10.1145/3295500.3356172). At *International Conference for High Performance Computing, +Networking, Storage and Analysis* (SC), November 2019. +[doi:10.1145/3295500.3356172](https://doi.org/10.1145/3295500.3356172) + +[[47](/en/ch2#Ford2010-marker)] Daniel Ford, François Labelle, Florentina I. +Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. +[Availability in +Globally Distributed Storage Systems](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Ford.pdf). At *9th USENIX Symposium on Operating Systems Design +and Implementation* (OSDI), October 2010. + +[[48](/en/ch2#Vishwanath2010-marker)] Kashi Venkatesh Vishwanath and Nachiappan Nagappan. +[Characterizing +Cloud Computing Hardware Reliability](https://www.microsoft.com/en-us/research/wp-content/uploads/2010/06/socc088-vishwanath.pdf). At *1st ACM Symposium on Cloud Computing* (SoCC), +June 2010. [doi:10.1145/1807128.1807161](https://doi.org/10.1145/1807128.1807161) + +[[49](/en/ch2#Hochschild2021-marker)] Peter H. Hochschild, Paul Turner, Jeffrey C. +Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat. +[Cores that +don’t count](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf). At *Workshop on Hot Topics in Operating Systems* (HotOS), June 2021. +[doi:10.1145/3458336.3465297](https://doi.org/10.1145/3458336.3465297) + +[[50](/en/ch2#Dixit2021-marker)] Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, +Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar. +[Silent Data Corruptions at Scale](https://arxiv.org/abs/2102.11245). +*arXiv:2102.11245*, February 2021. + +[[51](/en/ch2#Behrens2015-marker)] Diogo Behrens, Marco Serafini, Sergei Arnautov, Flavio P. +Junqueira, and Christof Fetzer. +[Scalable +Error Isolation for Distributed Systems](https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/behrens). At *12th USENIX Symposium on Networked Systems +Design and Implementation* (NSDI), May 2015. + +[[52](/en/ch2#Schroeder2009-marker)] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. +[DRAM +Errors in the Wild: A Large-Scale Field Study](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf). At *11th International Joint Conference on +Measurement and Modeling of Computer Systems* (SIGMETRICS), June 2009. +[doi:10.1145/1555349.1555372](https://doi.org/10.1145/1555349.1555372) + +[[53](/en/ch2#Kim2014-marker)] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, +Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. +[Flipping Bits in Memory Without +Accessing Them: An Experimental Study of DRAM Disturbance Errors](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf). At *41st Annual +International Symposium on Computer Architecture* (ISCA), June 2014. +[doi:10.5555/2665671.2665726](https://doi.org/10.5555/2665671.2665726) + +[[54](/en/ch2#Bray2021-marker)] Tim Bray. +[Worst Case](https://www.tbray.org/ongoing/When/202x/2021/10/08/The-WOrst-Case). +*tbray.org*, October 2021. +Archived at [perma.cc/4QQM-RTHN](https://perma.cc/4QQM-RTHN) + +[[55](/en/ch2#AbduJyothi2021-marker)] Sangeetha Abdu Jyothi. +[Solar Superstorms: Planning for +an Internet Apocalypse](https://ics.uci.edu/~sabdujyo/papers/sigcomm21-cme.pdf). At *ACM SIGCOMM Conferene*, August 2021. +[doi:10.1145/3452296.3472916](https://doi.org/10.1145/3452296.3472916) + +[[56](/en/ch2#Cockcroft2019-marker)] Adrian Cockcroft. +[Failure +Modes and Continuous Resilience](https://adrianco.medium.com/failure-modes-and-continuous-resilience-6553078caad5). *adrianco.medium.com*, November 2019. +Archived at [perma.cc/7SYS-BVJP](https://perma.cc/7SYS-BVJP) + +[[57](/en/ch2#Han2021-marker)] Shujie Han, Patrick P. C. Lee, Fan Xu, Yi Liu, Cheng He, and Jiongzhou Liu. +[An In-Depth Study of Correlated +Failures in Production SSD-Based Data Centers](https://www.usenix.org/conference/fast21/presentation/han). At *19th USENIX Conference on File and Storage +Technologies* (FAST), February 2021. + +[[58](/en/ch2#Nightingale2011-marker)] Edmund B. Nightingale, John R. Douceur, and Vince Orgovan. +[Cycles, Cells and +Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs](https://eurosys2011.cs.uni-salzburg.at/pdf/eurosys2011-nightingale.pdf). +At *6th European Conference on Computer Systems* (EuroSys), April 2011. +[doi:10.1145/1966445.1966477](https://doi.org/10.1145/1966445.1966477) + +[[59](/en/ch2#Gunawi2014-marker)] Haryadi S. Gunawi, Mingzhe Hao, Tanakorn +Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, +Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. +[What Bugs Live in the Cloud?](https://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf) +At *5th ACM Symposium on Cloud Computing* (SoCC), November 2014. +[doi:10.1145/2670979.2670986](https://doi.org/10.1145/2670979.2670986) + +[[60](/en/ch2#Kreps2012_ch1-marker)] Jay Kreps. +[Getting +Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. +Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW) + +[[61](/en/ch2#Minar2012_ch1-marker)] Nelson Minar. +[Leap Second Crashes Half +the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012. +Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU) + +[[62](/en/ch2#HPE2019_ch2-marker)] Hewlett Packard Enterprise. +[Support +Alerts – Customer Bulletin a00092491en\_us](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us). *support.hpe.com*, November 2019. +Archived at [perma.cc/S5F6-7ZAC](https://perma.cc/S5F6-7ZAC) + +[[63](/en/ch2#Hochstein2020-marker)] Lorin Hochstein. +[awesome limits](https://github.com/lorin/awesome-limits). *github.com*, +November 2020. Archived at [perma.cc/3R5M-E5Q4](https://perma.cc/3R5M-E5Q4) + +[[64](/en/ch2#McCaffrey2015-marker)] Caitie McCaffrey. +[Clients +Are Jerks: AKA How Halo 4 DoSed the Services at Launch & How We Survived](https://www.caitiem.com/2015/06/23/clients-are-jerks-aka-how-halo-4-dosed-the-services-at-launch-how-we-survived/). *caitiem.com*, +June 2015. Archived at [perma.cc/MXX4-W373](https://perma.cc/MXX4-W373) + +[[65](/en/ch2#Tang2023-marker)] Lilia Tang, +Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu. +[Fail through the Cracks: Cross-System +Interaction Failures in Modern Cloud Systems](https://tianyin.github.io/pub/csi-failures.pdf). At *18th European Conference on Computer +Systems* (EuroSys), May 2023. +[doi:10.1145/3552326.3587448](https://doi.org/10.1145/3552326.3587448) + +[[66](/en/ch2#Ulrich2016-marker)] Mike Ulrich. +[Addressing Cascading Failures](https://sre.google/sre-book/addressing-cascading-failures/). +In Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy (ed). +[*Site +Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). +O’Reilly Media, 2016. ISBN: 9781491929124 + +[[67](/en/ch2#Fassbender2022-marker)] Harri Faßbender. +[Cascading +failures in large-scale distributed systems](https://blog.mi.hdm-stuttgart.de/index.php/2022/03/03/cascading-failures-in-large-scale-distributed-systems/). *blog.mi.hdm-stuttgart.de*, March 2022. +Archived at [perma.cc/K7VY-YJRX](https://perma.cc/K7VY-YJRX) + +[[68](/en/ch2#Cook2000-marker)] Richard I. Cook. +[How Complex +Systems Fail](https://www.adaptivecapacitylabs.com/HowComplexSystemsFail.pdf). Cognitive Technologies Laboratory, April 2000. +Archived at [perma.cc/RDS6-2YVA](https://perma.cc/RDS6-2YVA) + +[[69](/en/ch2#Woods2017-marker)] David D. Woods. +[STELLA: Report from the SNAFUcatchers Workshop on Coping +With Complexity](https://snafucatchers.github.io/). *snafucatchers.github.io*, March 2017. Archived at +[archive.org](https://web.archive.org/web/20230306130131/https%3A//snafucatchers.github.io/) + +[[70](/en/ch2#Oppenheimer2003-marker)] David Oppenheimer, Archana Ganapathi, and David A. Patterson. +[Why +Do Internet Services Fail, and What Can Be Done About It?](https://static.usenix.org/events/usits03/tech/full_papers/oppenheimer/oppenheimer.pdf) At *4th USENIX Symposium on +Internet Technologies and Systems* (USITS), March 2003. + +[[71](/en/ch2#Dekker2017-marker)] Sidney Dekker. +[*The Field +Guide to Understanding ‘Human Error’, 3rd Edition*](https://learning.oreilly.com/library/view/the-field-guide/9781317031833/). CRC Press, November 2017. +ISBN: 9781472439055 + +[[72](/en/ch2#Dekker2011-marker)] Sidney Dekker. +[*Drift +into Failure: From Hunting Broken Components to Understanding Complex Systems*](https://www.taylorfrancis.com/books/mono/10.1201/9781315257396/drift-failure-sidney-dekker). +CRC Press, 2011. ISBN: 9781315257396 + +[[73](/en/ch2#Allspaw2012-marker)] John Allspaw. +[Blameless PostMortems and a Just +Culture](https://www.etsy.com/codeascraft/blameless-postmortems/). *etsy.com*, May 2012. +Archived at [perma.cc/YMJ7-NTAP](https://perma.cc/YMJ7-NTAP) + +[[74](/en/ch2#Sabo2023-marker)] Itzy Sabo. +[Uptime +Guarantees — A Pragmatic Perspective](https://world.hey.com/itzy/uptime-guarantees-a-pragmatic-perspective-736d7ea4). *world.hey.com*, March 2023. +Archived at [perma.cc/F7TU-78JB](https://perma.cc/F7TU-78JB) + +[[75](/en/ch2#Jurewitz2013-marker)] Michael Jurewitz. +[The Human Impact of Bugs](http://jury.me/blog/2013/3/14/the-human-impact-of-bugs). +*jury.me*, March 2013. +Archived at [perma.cc/5KQ4-VDYL](https://perma.cc/5KQ4-VDYL) + +[[76](/en/ch2#Halper2025-marker)] Mark Halper. +[How +Software Bugs led to ‘One of the Greatest Miscarriages of Justice’ in British History](https://cacm.acm.org/news/how-software-bugs-led-to-one-of-the-greatest-miscarriages-of-justice-in-british-history/). +*Communications of the ACM*, January 2025. +[doi:10.1145/3703779](https://doi.org/10.1145/3703779) + +[[77](/en/ch2#Bohm2022-marker)] Nicholas Bohm, James Christie, Peter Bernard Ladkin, +Bev Littlewood, Paul Marshall, Stephen Mason, Martin Newby, Steven J. Murdoch, Harold Thimbleby, and Martyn Thomas. +[The +legal rule that computers are presumed to be operating correctly – unforeseen and unjust +consequences](https://www.benthamsgaze.org/wp-content/uploads/2022/06/briefing-presumption-that-computers-are-reliable.pdf). Briefing note, *benthamsgaze.org*, June 2022. +Archived at [perma.cc/WQ6X-TMW4](https://perma.cc/WQ6X-TMW4) + +[[78](/en/ch2#McKinley2015-marker)] Dan McKinley. +[Choose Boring Technology](https://mcfunley.com/choose-boring-technology). +*mcfunley.com*, March 2015. +Archived at [perma.cc/7QW7-J4YP](https://perma.cc/7QW7-J4YP) + +[[79](/en/ch2#Warfield2023_ch2-marker)] Andy Warfield. +[Building +and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. +Archived at [perma.cc/7LPK-TP7V](https://perma.cc/7LPK-TP7V) + +[[80](/en/ch2#Brooker2023multitenancy-marker)] Marc Brooker. +[Surprising Scalability of +Multitenancy](https://brooker.co.za/blog/2023/03/23/economics.html). *brooker.co.za*, March 2023. +Archived at [perma.cc/ZZD9-VV8T](https://perma.cc/ZZD9-VV8T) + +[[81](/en/ch2#Stopford2009-marker)] Ben Stopford. +[Shared +Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/). *benstopford.com*, +November 2009. Archived at [perma.cc/7BXH-EDUR](https://perma.cc/7BXH-EDUR) + +[[82](/en/ch2#Stonebraker1986-marker)] Michael Stonebraker. +[The Case for Shared Nothing](https://dsf.berkeley.edu/papers/hpts85-nothing.pdf). +*IEEE Database Engineering Bulletin*, volume 9, issue 1, pages 4–9, March 1986. + +[[83](/en/ch2#Antonopoulos2019_ch2-marker)] Panagiotis Antonopoulos, +Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald +Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya +Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. +[Socrates: The +New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* +(SIGMOD), pages 1743–1756, June 2019. +[doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047) + +[[84](/en/ch2#Newman2021_ch2-marker)] Sam Newman. +[*Building +Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O’Reilly Media, 2021. ISBN: 9781492034025 + +[[85](/en/ch2#Ensmenger2016-marker)] Nathan Ensmenger. +[When +Good Software Goes Bad: The Surprising Durability of an Ephemeral Technology](https://themaintainers.wpengine.com/wp-content/uploads/2021/04/ensmenger-maintainers-v2.pdf). +At *The Maintainers Conference*, April 2016. +Archived at [perma.cc/ZXT4-HGZB](https://perma.cc/ZXT4-HGZB) + +[[86](/en/ch2#Glass2002-marker)] Robert L. Glass. +[*Facts and +Fallacies of Software Engineering*](https://learning.oreilly.com/library/view/facts-and-fallacies/0321117425/). +Addison-Wesley Professional, October 2002. ISBN: 9780321117427 + +[[87](/en/ch2#Bellotti2021-marker)] Marianne Bellotti. +[*Kill It with +Fire*](https://learning.oreilly.com/library/view/kill-it-with/9781098128883/). No Starch Press, April 2021. ISBN: 9781718501188 + +[[88](/en/ch2#Bainbridge1983-marker)] Lisanne Bainbridge. +[Ironies of +automation](https://www.adaptivecapacitylabs.com/IroniesOfAutomation-Bainbridge83.pdf). *Automatica*, volume 19, issue 6, pages 775–779, November 1983. +[doi:10.1016/0005-1098(83)90046-8](https://doi.org/10.1016/0005-1098%2883%2990046-8) + +[[89](/en/ch2#Hamilton2007-marker)] James Hamilton. +[On +Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/events/lisa07/tech/full_papers/hamilton/hamilton.pdf). At *21st Large Installation +System Administration Conference* (LISA), November 2007. + +[[90](/en/ch2#Horovits2021-marker)] Dotan Horovits. +[Open Source +for Better Observability](https://horovits.medium.com/open-source-for-better-observability-8c65b5630561). *horovits.medium.com*, October 2021. +Archived at [perma.cc/R2HD-U2ZT](https://perma.cc/R2HD-U2ZT) + +[[91](/en/ch2#Foote1997-marker)] Brian Foote and Joseph Yoder. +[Big Ball of Mud](http://www.laputan.org/pub/foote/mud.pdf). At +*4th Conference on Pattern Languages of Programs* (PLoP), September 1997. +Archived at [perma.cc/4GUP-2PBV](https://perma.cc/4GUP-2PBV) + +[[92](/en/ch2#Brooker2022-marker)] Marc Brooker. +[What is a simple system?](https://brooker.co.za/blog/2022/05/03/simplicity.html) +*brooker.co.za*, May 2022. +Archived at [perma.cc/U72T-BFVE](https://perma.cc/U72T-BFVE) + +[[93](/en/ch2#Brooks1995-marker)] Frederick P. Brooks. +[No Silver Bullet – Essence and +Accident in Software Engineering](https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf). In +[*The Mythical +Man-Month*](https://www.oreilly.com/library/view/mythical-man-month-the/0201835959/), Anniversary edition, Addison-Wesley, 1995. ISBN: 9780201835953 + +[[94](/en/ch2#Luu2020-marker)] Dan Luu. +[Against essential and accidental complexity](https://danluu.com/essential-complexity/). +*danluu.com*, December 2020. +Archived at [perma.cc/H5ES-69KC](https://perma.cc/H5ES-69KC) + +[[95](/en/ch2#Gamma1994-marker)] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. +[*Design Patterns: +Elements of Reusable Object-Oriented Software*](https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/). Addison-Wesley Professional, October 1994. +ISBN: 9780201633610 + +[[96](/en/ch2#Evans2003-marker)] Eric Evans. +[*Domain-Driven +Design: Tackling Complexity in the Heart of Software*](https://learning.oreilly.com/library/view/domain-driven-design-tackling/0321125215/). Addison-Wesley Professional, August 2003. +ISBN: 9780321125217 + +[[97](/en/ch2#Breivold2008-marker)] Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson. +[Analyzing Software Evolvability](https://www.es.mdh.se/pdf_publications/1251.pdf). +at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. +[doi:10.1109/COMPSAC.2008.50](https://doi.org/10.1109/COMPSAC.2008.50) + +[[98](/en/ch2#Zaninotto2002-marker)] Enrico Zaninotto. +[From X programming to the X organisation](https://martinfowler.com/articles/zaninotto.pdf). +At *XP Conference*, May 2002. +Archived at [perma.cc/R9AR-QCKZ](https://perma.cc/R9AR-QCKZ) -1. Edgar F. Codd: “[A Relational Model of Data for Large Shared Data Banks](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf),” *Communications of the ACM*, volume 13, number 6, pages 377–387, June 1970. [doi:10.1145/362384.362685](http://dx.doi.org/10.1145/362384.362685) -1. Michael Stonebraker and Joseph M. Hellerstein: “[What Goes Around Comes Around](http://mitpress2.mit.edu/books/chapters/0262693143chapm1.pdf),” in *Readings in Database Systems*, 4th edition, MIT Press, pages 2–41, 2005. ISBN: 978-0-262-69314-1 -1. Pramod J. Sadalage and Martin Fowler: *NoSQL Distilled*. Addison-Wesley, August 2012. ISBN: 978-0-321-82662-6 -1. Eric Evans: “[NoSQL: What's in a Name?](https://web.archive.org/web/20190623045155/http://blog.sym-link.com/2009/10/30/nosql_whats_in_a_name.html),” *blog.sym-link.com*, October 30, 2009. -1. James Phillips: “[Surprises in Our NoSQL Adoption Survey](http://blog.couchbase.com/nosql-adoption-survey-surprises),” *blog.couchbase.com*, February 8, 2012. -1. Michael Wagner: *SQL/XML:2006 – Evaluierung der Standardkonformität ausgewählter Datenbanksysteme*. Diplomica Verlag, Hamburg, 2010. ISBN: 978-3-836-64609-3 -1. “[XML Data (SQL Server)](https://docs.microsoft.com/en-us/sql/relational-databases/xml/xml-data-sql-server?view=sql-server-ver15),” SQL Server documentation, *docs.microsoft.com*, 2013. -1. “[PostgreSQL 9.3.1 Documentation](http://www.postgresql.org/docs/9.3/static/index.html),” The PostgreSQL Global Development Group, 2013. -1. “[The MongoDB 2.4 Manual](http://docs.mongodb.org/manual/),” MongoDB, Inc., 2013. -1. “[RethinkDB 1.11 Documentation](http://www.rethinkdb.com/docs/),” *rethinkdb.com*, 2013. -1. “[Apache CouchDB 1.6 Documentation](http://docs.couchdb.org/en/latest/),” *docs.couchdb.org*, 2014. -1. Lin Qiao, Kapil Surlaker, Shirshanka Das, et al.: “[On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform](http://www.slideshare.net/amywtang/espresso-20952131),” at *ACM International Conference on Management of Data* (SIGMOD), June 2013. -1. Rick Long, Mark Harrington, Robert Hain, and Geoff Nicholls: [*IMS Primer*](http://www.redbooks.ibm.com/redbooks/pdfs/sg245352.pdf). IBM Redbook SG24-5352-00, IBM International Technical Support Organization, January 2000. -1. Stephen D. Bartlett: “[IBM’s IMS—Myths, Realities, and Opportunities](https://public.dhe.ibm.com/software/data/ims/pdf/TCG2013015LI.pdf),” The Clipper Group Navigator, TCG2013015LI, July 2013. -1. Sarah Mei: “[Why You Should Never Use MongoDB](http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/),” *sarahmei.com*, November 11, 2013. -1. J. S. Knowles and D. M. R. Bell: “The CODASYL Model,” in *Databases—Role and Structure: An Advanced Course*, edited by P. M. Stocker, P. M. D. Gray, and M. P. Atkinson, pages 19–56, Cambridge University Press, 1984. ISBN: 978-0-521-25430-4 -1. Charles W. Bachman: “[The Programmer as Navigator](http://dl.acm.org/citation.cfm?id=362534),” *Communications of the ACM*, volume 16, number 11, pages 653–658, November 1973. [doi:10.1145/355611.362534](http://dx.doi.org/10.1145/355611.362534) -1. Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: “[Architecture of a Database System](http://db.cs.berkeley.edu/papers/fntdb07-architecture.pdf),” *Foundations and Trends in Databases*, volume 1, number 2, pages 141–259, November 2007. [doi:10.1561/1900000002](http://dx.doi.org/10.1561/1900000002) -1. Sandeep Parikh and Kelly Stirman: “[Schema Design for Time Series Data in MongoDB](http://blog.mongodb.org/post/65517193370/schema-design-for-time-series-data-in-mongodb),” *blog.mongodb.org*, October 30, 2013. -1. Martin Fowler: “[Schemaless Data Structures](http://martinfowler.com/articles/schemaless/),” *martinfowler.com*, January 7, 2013. -1. Amr Awadallah: “[Schema-on-Read vs. Schema-on-Write](http://www.slideshare.net/awadallah/schemaonread-vs-schemaonwrite),” at *Berkeley EECS RAD Lab Retreat*, Santa Cruz, CA, May 2009. -1. Martin Odersky: “[The Trouble with Types](http://www.infoq.com/presentations/data-types-issues),” at *Strange Loop*, September 2013. -1. Conrad Irwin: “[MongoDB—Confessions of a PostgreSQL Lover](https://speakerdeck.com/conradirwin/mongodb-confessions-of-a-postgresql-lover),” at *HTML5DevConf*, October 2013. -1. “[Percona Toolkit Documentation: pt-online-schema-change](http://www.percona.com/doc/percona-toolkit/2.2/pt-online-schema-change.html),” Percona Ireland Ltd., 2013. -1. Rany Keddo, Tobias Bielohlawek, and Tobias Schmidt: “[Large Hadron Migrator](https://github.com/soundcloud/lhm),” SoundCloud, 2013. -1. Shlomi Noach: “[gh-ost: GitHub's Online Schema Migration Tool for MySQL](http://githubengineering.com/gh-ost-github-s-online-migration-tool-for-mysql/),” *githubengineering.com*, August 1, 2016. -1. James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: “[Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/),” at *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012. -1. Donald K. Burleson: “[Reduce I/O with Oracle Cluster Tables](https://web.archive.org/web/20231207233228/http://www.dba-oracle.com/oracle_tip_hash_index_cluster_table.htm),” *dba-oracle.com*. -1. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: “[Bigtable: A Distributed Storage System for Structured Data](https://research.google/pubs/pub27898/),” at *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006. -1. Bobbie J. Cochrane and Kathy A. McKnight: “[DB2 JSON Capabilities, Part 1: Introduction to DB2 JSON](https://web.archive.org/web/20180516203043/https://www.ibm.com/developerworks/data/library/techarticle/dm-1306nosqlforjson1/),” IBM developerWorks, June 20, 2013. -1. Herb Sutter: “[The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software](http://www.gotw.ca/publications/concurrency-ddj.htm),” *Dr. Dobb's Journal*, volume 30, number 3, pages 202-210, March 2005. -1. Joseph M. Hellerstein: “[The Declarative Imperative: Experiences and Conjectures in Distributed Logic](http://www.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf),” Electrical Engineering and Computer Sciences, University of California at Berkeley, Tech report UCB/EECS-2010-90, June 2010. -1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/pub62/),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004. -1. Craig Kerstiens: “[JavaScript in Your Postgres](https://blog.heroku.com/javascript_in_your_postgres),” *blog.heroku.com*, June 5, 2013. -1. Nathan Bronson, Zach Amsden, George Cabrera, et al.: “[TAO: Facebook’s Distributed Data Store for the Social Graph](https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson),” at *USENIX Annual Technical Conference* (USENIX ATC), June 2013. -1. “[Apache TinkerPop3.2.3 Documentation](http://tinkerpop.apache.org/docs/3.2.3/reference/),” *tinkerpop.apache.org*, October 2016. -1. “[The Neo4j Manual v2.0.0](http://docs.neo4j.org/chunked/2.0.0/index.html),” Neo Technology, 2013. -1. Emil Eifrem: [Twitter correspondence](https://twitter.com/emileifrem/status/419107961512804352), January 3, 2014. -1. David Beckett and Tim Berners-Lee: “[Turtle – Terse RDF Triple Language](http://www.w3.org/TeamSubmission/turtle/),” W3C Team Submission, March 28, 2011. -1. “[Datomic Development Resources](http://docs.datomic.com/),” Metadata Partners, LLC, 2013. -1. W3C RDF Working Group: “[Resource Description Framework (RDF)](http://www.w3.org/RDF/),” *w3.org*, 10 February 2004. -1. “[Apache Jena](http://jena.apache.org/),” Apache Software Foundation. -1. Steve Harris, Andy Seaborne, and Eric Prud'hommeaux: “[SPARQL 1.1 Query Language](http://www.w3.org/TR/sparql11-query/),” W3C Recommendation, March 2013. -1. Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou: “[Datalog and Recursive Query Processing](http://blogs.evergreen.edu/sosw/files/2014/04/Green-Vol5-DBS-017.pdf),” *Foundations and Trends in Databases*, volume 5, number 2, pages 105–195, November 2013. [doi:10.1561/1900000017](http://dx.doi.org/10.1561/1900000017) -1. Stefano Ceri, Georg Gottlob, and Letizia Tanca: “[What You Always Wanted to Know About Datalog (And Never Dared to Ask)](https://www.researchgate.net/profile/Letizia_Tanca/publication/3296132_What_you_always_wanted_to_know_about_Datalog_and_never_dared_to_ask/links/0fcfd50ca2d20473ca000000.pdf),” *IEEE Transactions on Knowledge and Data Engineering*, volume 1, number 1, pages 146–166, March 1989. [doi:10.1109/69.43410](http://dx.doi.org/10.1109/69.43410) -1. Serge Abiteboul, Richard Hull, and Victor Vianu: [*Foundations of Databases*](http://webdam.inria.fr/Alice/). Addison-Wesley, 1995. ISBN: 978-0-201-53771-0, available online at *webdam.inria.fr/Alice* -1. Nathan Marz: “[Cascalog](https://github.com/nathanmarz/cascalog)," *github.com*. -1. Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, et al.: “[GenBank](https://academic.oup.com/nar/article/36/suppl_1/D25/2507746),” *Nucleic Acids Research*, volume 36, Database issue, pages D25–D30, December 2007. [doi:10.1093/nar/gkm929](http://dx.doi.org/10.1093/nar/gkm929) -1. Fons Rademakers: “[ROOT for Big Data Analysis](https://indico.cern.ch/event/246453/contributions/1566610/attachments/423154/587535/ROOT-BigData-Analysis-London-2013.pdf),” at *Workshop on the Future of Big Data Management*, London, UK, June 2013. diff --git a/content/en/ch3.md b/content/en/ch3.md index 82d0f92..c0cc6d5 100644 --- a/content/en/ch3.md +++ b/content/en/ch3.md @@ -1,125 +1,2131 @@ --- -title: "3. Storage and Retrieval" -linkTitle: "3. Storage and Retrieval" +title: "3. Data Models and Query Languages" weight: 103 breadcrumbs: false --- -![](/img/ch3.png) +> *The limits of my language mean the limits of my world.* +> +> Ludwig Wittgenstein, *Tractatus Logico-Philosophicus* (1922) -> *Wer Ordnung hält, ist nur zu faul zum Suchen. -> (If you keep things tidily ordered, you’re just too lazy to go searching.)* -> ->​ — German proverb +Data models are perhaps the most important part of developing software, because they have such a +profound effect: not only on how the software is written, but also on how we *think about the problem* +that we are solving. -------------------- +Most applications are built by layering one data model on top of another. For each layer, the key +question is: how is it *represented* in terms of the next-lower layer? For example: -On the most fundamental level, a database needs to do two things: when you give it some data, it should store the data, and when you ask it again later, it should give the data back to you. +1. As an application developer, you look at the real world (in which there are people, + organizations, goods, actions, money flows, sensors, etc.) and model it in terms of objects or + data structures, and APIs that manipulate those data structures. Those structures are often + specific to your application. +2. When you want to store those data structures, you express them in terms of a general-purpose + data model, such as JSON or XML documents, tables in a relational database, or vertices and + edges in a graph. Those data models are the topic of this chapter. +3. The engineers who built your database software decided on a way of representing that + document/relational/graph data in terms of bytes in memory, on disk, or on a network. The + representation may allow the data to be queried, searched, manipulated, and processed in various + ways. We will discuss these storage engine designs in [Chapter 4](/en/ch4#ch_storage). +4. On yet lower levels, hardware engineers have figured out how to represent bytes in terms of + electrical currents, pulses of light, magnetic fields, and more. -In [Chapter 2](/en/ch2) we discussed data models and query languages—i.e., the format in which you (the application developer) give the database your data, and the mecha‐ nism by which you can ask for it again later. In this chapter we discuss the same from the database’s point of view: how we can store the data that we’re given, and how we can find it again when we’re asked for it. +In a complex application there may be more intermediary levels, such as APIs built upon APIs, but +the basic idea is still the same: each layer hides the complexity of the layers below it by +providing a clean data model. These abstractions allow different groups of people—for example, +the engineers at the database vendor and the application developers using their database—to work +together effectively. -Why should you, as an application developer, care how the database handles storage and retrieval internally? You’re probably not going to implement your own storage engine from scratch, but you *do* need to select a storage engine that is appropriate for your application, from the many that are available. In order to tune a storage engine to perform well on your kind of workload, you need to have a rough idea of what the storage engine is doing under the hood. +Several different data models are widely used in practice, often for different purposes. Some types +of data and some queries are easy to express in one model, and awkward in another. In this chapter +we will explore those trade-offs by comparing the relational model, the document model, graph-based +data models, event sourcing, and dataframes. We will also briefly look at query languages that allow +you to work with these models. This comparison will help you decide when to use which model. -In particular, there is a big difference between storage engines that are optimized for transactional workloads and those that are optimized for analytics. We will explore that distinction later in “[Transaction Processing or Analytics?](#transaction-processing-or-analytics)”, and in “[Column-Oriented Storage](#column-oriented-storage)” we’ll discuss a family of storage engines that is optimized for analytics. +# Terminology: Declarative Query Languages -However, first we’ll start this chapter by talking about storage engines that are used in the kinds of databases that you’re probably familiar with: traditional relational data‐ bases, and also most so-called NoSQL databases. We will examine two families of storage engines: *log-structured* storage engines, and *page-oriented* storage engines such as B-trees. +Many of the query languages in this chapter (such as SQL, Cypher, SPARQL, or Datalog) are +*declarative*, which means that you specify the pattern of the data you want—what conditions the +results must meet, and how you want the data to be transformed (e.g., sorted, grouped, and +aggregated)—but not *how* to achieve that goal. The database system’s query optimizer can decide +which indexes and which join algorithms to use, and in which order to execute various parts of the +query. + +In contrast, with most programming languages you would have to write an *algorithm*—i.e., telling +the computer which operations to perform in which order. A declarative query language is attractive +because it is typically more concise and easier to write than an explicit algorithm. But more +importantly, it also hides implementation details of the query engine, which makes it possible for +the database system to introduce performance improvements without requiring any changes to queries. +[[1](/en/ch3#Brandon2024)]. + +For example, a database might be able to execute a declarative query in parallel across multiple CPU +cores and machines, without you having to worry about how to implement that parallelism +[[2](/en/ch3#Hellerstein2010)]. +In a hand-coded algorithm it would be a lot of work to implement such parallel execution yourself. + +# Relational Model versus Document Model + +The best-known data model today is probably that of SQL, based on the relational model proposed by +Edgar Codd in 1970 [[3](/en/ch3#Codd1970)]: +data is organized into *relations* (called *tables* in SQL), where each relation is an unordered collection +of *tuples* (*rows* in SQL). + +The relational model was originally a theoretical proposal, and many people at the time doubted whether it +could be implemented efficiently. However, by the mid-1980s, relational database management systems +(RDBMS) and SQL had become the tools of choice for most people who needed to store and query data +with some kind of regular structure. Many data management use cases are still dominated by +relational data decades later—for example, business analytics (see [“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics)). + +Over the years, there have been many competing approaches to data storage and querying. In the 1970s +and early 1980s, the *network model* and the *hierarchical model* were the main alternatives, but +the relational model came to dominate them. Object databases came and went again in the late 1980s +and early 1990s. XML databases appeared in the early 2000s, but have only seen niche adoption. Each +competitor to the relational model generated a lot of hype in its time, but it never lasted +[[4](/en/ch3#Stonebraker2005around)]. +Instead, SQL has grown to incorporate other data types besides its relational core—for example, +adding support for XML, JSON, and graph data +[[5](/en/ch3#Winand2015)]. + +In the 2010s, *NoSQL* was the latest buzzword that tried to overthrow the dominance of relational +databases. NoSQL refers not to a single technology, but a loose set of ideas around new data models, +schema flexibility, scalability, and a move towards open source licensing models. Some databases +branded themselves as *NewSQL*, as they aim to provide the scalability of NoSQL systems along with +the data model and transactional guarantees of traditional relational databases. The NoSQL and +NewSQL ideas have been very influential in the design of data systems, but as the principles have +become widely adopted, use of those terms has faded. + +One lasting effect of the NoSQL movement is the popularity of the *document model*, which usually +represents data as JSON. This model was originally popularized by specialized document databases +such as MongoDB and Couchbase, although most relational databases have now also added JSON support. +Compared to relational tables, which are often seen as having a rigid and inflexible schema, JSON +documents are thought to be more flexible. + +The pros and cons of document and relational data have been debated extensively; let’s examine some +of the key points of that debate. + +## The Object-Relational Mismatch + +Much application development today is done in object-oriented programming languages, which leads to +a common criticism of the SQL data model: if data is stored in relational tables, an awkward +translation layer is required between the objects in the application code and the database model of +tables, rows, and columns. The disconnect between the models is sometimes called an *impedance +mismatch*. + +###### Note + +The term *impedance mismatch* is borrowed from electronics. Every electric circuit has a certain +impedance (resistance to alternating current) on its inputs and outputs. When you connect one +circuit’s output to another one’s input, the power transfer across the connection is maximized if +the output and input impedances of the two circuits match. An impedance mismatch can lead to signal +reflections and other troubles. + +### Object-relational mapping (ORM) + +Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of +boilerplate code required for this translation layer, but they are often criticized +[[6](/en/ch3#Fowler2012)]. +Some commonly cited problems are: + +* ORMs are complex and can’t completely hide the differences between the two models, so developers + still end up having to think about both the relational and the object representations of the data. +* ORMs are generally only used for OLTP app development (see [“Characterizing Transaction Processing and Analytics”](/en/ch1#sec_introduction_oltp)); data + engineers making the data available for analytics purposes still need to work with the underlying + relational representation, so the design of the relational schema still matters when using an ORM. +* Many ORMs work only with relational OLTP databases. Organizations with diverse data systems such + as search engines, graph databases, and NoSQL systems might find ORM support lacking. +* Some ORMs generate relational schemas automatically, but these might be awkward for the users who + are accessing the relational data directly, and they might be inefficient on the underlying + database. Customizing the ORM’s schema and query generation can be complex and negate the benefit + of using the ORM in the first place. +* ORMs make it easy to accidentally write inefficient queries, such as the *N+1 query problem* + [[7](/en/ch3#Mihalcea2023)]. + For example, say you want to display a list of user comments on a page, so you perform one query + that returns *N* comments, each containing the ID of its author. To show the name of the comment + author you need to look up the ID in the users table. In hand-written SQL you would probably + perform this join in the query and return the author name along with each comment, but with an ORM + you might end up making a separate query on the users table for each of the *N* comments to look + up its author, resulting in *N*+1 database queries in total, which is slower than performing the + join in the database. To avoid this problem, you may need to tell the ORM to fetch the author + information at the same time as fetching the comments. + +Nevertheless, ORMs also have advantages: + +* For data that is well suited to a relational model, some kind of translation between the + persistent relational and the in-memory object representation is inevitable, and ORMs reduce the + amount of boilerplate code required for this translation. Complicated queries may still need to be + handled outside of the ORM, but the ORM can help with the simple and repetitive cases. +* Some ORMs help with caching the results of database queries, which can help reduce the load on the + database. +* ORMs can also help with managing schema migrations and other administrative activities. + +### The document data model for one-to-many relationships + +Not all data lends itself well to a relational representation; let’s look at an example to explore a +limitation of the relational model. [Figure 3-1](/en/ch3#fig_obama_relational) illustrates how a résumé (a LinkedIn +profile) could be expressed in a relational schema. The profile as a whole can be identified by a +unique identifier, `user_id`. Fields like `first_name` and `last_name` appear exactly once per user, +so they can be modeled as columns on the `users` table. + +Most people have had more than one job in their career (positions), and people may have varying +numbers of periods of education and any number of pieces of contact information. One way of +representing such *one-to-many relationships* is to put positions, education, and contact +information in separate tables, with a foreign key reference to the `users` table, as in +[Figure 3-1](/en/ch3#fig_obama_relational). + +![ddia 0301](/fig/ddia_0301.png) + +###### Figure 3-1. Representing a LinkedIn profile using a relational schema. + +Another way of representing the same information, which is perhaps more natural and maps more +closely to an object structure in application code, is as a JSON document as shown in +[Example 3-1](/en/ch3#fig_obama_json). + +##### Example 3-1. Representing a LinkedIn profile as a JSON document + +``` +{ + "user_id": 251, + "first_name": "Barack", + "last_name": "Obama", + "headline": "Former President of the United States of America", + "region_id": "us:91", + "photo_url": "/p/7/000/253/05b/308dd6e.jpg", + "positions": [ + {"job_title": "President", "organization": "United States of America"}, + {"job_title": "US Senator (D-IL)", "organization": "United States Senate"} + ], + "education": [ + {"school_name": "Harvard University", "start": 1988, "end": 1991}, + {"school_name": "Columbia University", "start": 1981, "end": 1983} + ], + "contact_info": { + "website": "https://barackobama.com", + "twitter": "https://twitter.com/barackobama" + } +} +``` + +Some developers feel that the JSON model reduces the impedance mismatch between the application code +and the storage layer. However, as we shall see in [Chapter 5](/en/ch5#ch_encoding), there are also problems with +JSON as a data encoding format. The lack of a schema is often cited as an advantage; we will discuss +this in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility). + +The JSON representation has better *locality* than the multi-table schema in +[Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile +in the relational example, you need to either perform multiple queries (query each table by +`user_id`) or perform a messy multi-way join between the `users` table and its subordinate tables +[[8](/en/ch3#Schauder2023)]. +In the JSON representation, all the relevant information is in one place, making the query both +faster and simpler. + +The one-to-many relationships from the user profile to the user’s positions, educational history, and +contact information imply a tree structure in the data, and the JSON representation makes this tree +structure explicit (see [Figure 3-2](/en/ch3#fig_json_tree)). + +![ddia 0302](/fig/ddia_0302.png) + +###### Figure 3-2. One-to-many relationships forming a tree structure. + +###### Note + +This type of relationship is sometimes called *one-to-few* rather than *one-to-many*, since a résumé +typically has a small number of positions +[[9](/en/ch3#Zola2014), +[10](/en/ch3#Andrews2023)]. +In situations where there may be a genuinely large number of related items—say, comments on a +celebrity’s social media post, of which there could be many thousands—embedding them all in the same +document may be too unwieldy, so the relational approach in [Figure 3-1](/en/ch3#fig_obama_relational) is preferable. + +## Normalization, Denormalization, and Joins + +In [Example 3-1](/en/ch3#fig_obama_json) in the preceding section, `region_id` is given as an ID, not as the plain-text +string `"Washington, DC, United States"`. Why? + +If the user interface has a free-text field for entering the region, it makes sense to store it as a +plain-text string. But there are advantages to having standardized lists of geographic regions, and +letting users choose from a drop-down list or autocompleter: + +* Consistent style and spelling across profiles +* Avoiding ambiguity if there are several places with the same name (if the string were just + “Washington”, would it refer to DC or to the state?) +* Ease of updating—the name is stored in only one place, so it is easy to update across the board if + it ever needs to be changed (e.g., change of a city name due to political events) +* Localization support—when the site is translated into other languages, the standardized lists can + be localized, so the region can be displayed in the viewer’s language +* Better search—e.g., a search for people on the US East Coast can match this profile, because the + list of regions can encode the fact that Washington is located on the East Coast (which is not + apparent from the string `"Washington, DC"`) + +Whether you store an ID or a text string is a question of *normalization*. When you use an ID, your +data is more normalized: the information that is meaningful to humans (such as the text *Washington, +DC*) is stored in only one place, and everything that refers to it uses an ID (which only has +meaning within the database). When you store the text directly, you are duplicating the +human-meaningful information in every record that uses it; this representation is *denormalized*. + +The advantage of using an ID is that because it has no meaning to humans, it never needs to change: +the ID can remain the same, even if the information it identifies changes. Anything that is +meaningful to humans may need to change sometime in the future—and if that information is +duplicated, all the redundant copies need to be updated. That requires more code, more write +operations, more disk space, and risks inconsistencies (where some copies of the information are +updated but others aren’t). + +The downside of a normalized representation is that every time you want to display a record +containing an ID, you have to do an additional lookup to resolve the ID into something +human-readable. In a relational data model, this is done using a *join*, for example: + +``` +SELECT users.*, regions.region_name +FROM users +JOIN regions ON users.region_id = regions.id +WHERE users.id = 251; +``` + +Document databases can store both normalized and denormalized data, but they are often associated +with denormalization—partly because the JSON data model makes it easy to store additional, +denormalized fields, and partly because the weak support for joins in many document databases makes +normalization inconvenient. Some document databases don’t support joins at all, so you have to +perform them in application code—that is, you first fetch a document containing an ID, and then +perform a second query to resolve that ID into another document. In MongoDB, it is also possible to +perform a join using the `$lookup` operator in an aggregation pipeline: + +``` +db.users.aggregate([ + { $match: { _id: 251 } }, + { $lookup: { + from: "regions", + localField: "region_id", + foreignField: "_id", + as: "region" + } } +]) +``` + +### Trade-offs of normalization + +In the résumé example, while the `region_id` field is a reference into a standardized set of +regions, the name of the `organization` (the company or government where the person worked) and +`school_name` (where they studied) are just strings. This representation is denormalized: many +people may have worked at the same company, but there is no ID linking them. + +Perhaps the organization and school should be entities instead, and the profile should reference +their IDs instead of their names? The same arguments for referencing the ID of a region also apply +here. For example, say we wanted to include the logo of the school or company in addition to their +name: + +* In a denormalized representation, we would include the image URL of the logo on every individual + person’s profile; this makes the JSON document self-contained, but it creates a headache if we + ever need to change the logo, because we now need to find all of the occurrences of the old URL + and update them [[9](/en/ch3#Zola2014)]. +* In a normalized representation, we would create an entity representing an organization or school, + and store its name, logo URL, and perhaps other attributes (description, news feed, etc.) once on + that entity. Every résumé that mentions the organization would then simply reference its ID, and + updating the logo is easy. + +As a general principle, normalized data is usually faster to write (since there is only one copy), +but slower to query (since it requires joins); denormalized data is usually faster to read (fewer +joins), but more expensive to write (more copies to update, more disk space used). You might find it +helpful to view denormalization as a form of derived data ([“Systems of Record and Derived Data”](/en/ch1#sec_introduction_derived)), since you +need to set up a process for updating the redundant copies of the data. + +Besides the cost of performing all these updates, you also need to consider the consistency of the +database if a process crashes halfway through making its updates. Databases that offer atomic +transactions (see [“Atomicity”](/en/ch8#sec_transactions_acid_atomicity)) make it easier to remain consistent, but not +all databases offer atomicity across multiple documents. It is also possible to ensure consistency +through stream processing, which we discuss in [Link to Come]. + +Normalization tends to be better for OLTP systems, where both reads and updates need to be fast; +analytics systems often fare better with denormalized data, since they perform updates in bulk, and +the performance of read-only queries is the dominant concern. Moreover, in systems of small to +moderate scale, a normalized data model is often best, because you don’t have to worry about keeping +multiple copies of the data consistent with each other, and the cost of performing joins is +acceptable. However, in very large-scale systems, the cost of joins can become problematic. + +### Denormalization in the social networking case study + +In [“Case Study: Social Network Home Timelines”](/en/ch2#sec_introduction_twitter) we compared a normalized representation ([Figure 2-1](/en/ch2#fig_twitter_relational)) +and a denormalized one (precomputed, materialized timelines): here, the join between `posts` and +`follows` was too expensive, and the materialized timeline is a cache of the result of that join. +The fan-out process that inserts a new post into followers’ timelines was our way of keeping the +denormalized representation consistent. + +However, the implementation of materialized timelines at X (formerly Twitter) does not store the +actual text of each post: each entry actually only stores the post ID, the ID of the user who posted +it, and a little bit of extra information to identify reposts and replies +[[11](/en/ch3#Krikorian2012_ch3)]. +In other words, it is a precomputed result of (approximately) the following query: + +``` +SELECT posts.id, posts.sender_id FROM posts + JOIN follows ON posts.sender_id = follows.followee_id + WHERE follows.follower_id = current_user + ORDER BY posts.timestamp DESC + LIMIT 1000 +``` + +This means that whenever the timeline is read, the service still needs to perform two joins: look up +the post ID to fetch the actual post content (as well as statistics such as the number of likes +and replies), and look up the sender’s profile by ID (to get their username, profile picture, and +other details). This process of looking up the human-readable information by ID is called +*hydrating* the IDs, and it is essentially a join performed in application code +[[11](/en/ch3#Krikorian2012_ch3)]. + +The reason for storing only IDs in the precomputed timeline is that the data they refer to is +fast-changing: the number of likes and replies may change multiple times per second on a popular +post, and some users regularly change their username or profile photo. Since the timeline should +show the latest like count and profile picture when it is viewed, it would not make sense to +denormalize this information into the materialized timeline. Moreover, the storage cost would be +increased significantly by such denormalization. + +This example shows that having to perform joins when reading data is not, as sometimes claimed, an +impediment to creating high-performance, scalable services. Hydrating post ID and user ID is +actually a fairly easy operation to scale, since it parallelizes well, and the cost doesn’t depend +on the number of accounts you are following or the number of followers you have. + +If you need to decide whether to denormalize something in your application, the social network case +study shows that the choice is not immediately obvious: the most scalable approach may involve +denormalizing some things and leaving other things normalized. You will have to carefully consider +how often the information changes, and the cost of reads and writes (which might be dominated by +outliers, such as users with many follows/followers in the case of a typical social network). +Normalization and denormalization are not inherently good or bad—they are just a trade-off in terms +of performance of reads and writes, as well as the amount of effort to implement. + +## Many-to-One and Many-to-Many Relationships + +While `positions` and `education` in [Figure 3-1](/en/ch3#fig_obama_relational) are examples of one-to-many or +one-to-few relationships (one résumé has several positions, but each position belongs only to one +résumé), the `region_id` field is an example of a *many-to-one* relationship (many people live in +the same region, but we assume that each person lives in only one region at any one time). + +If we introduce entities for organizations and schools, and reference them by ID from the résumé, +then we also have *many-to-many* relationships (one person has worked for several organizations, and +an organization has several past or present employees). In a relational model, such a relationship +is usually represented as an *associative table* or *join table*, as shown in +[Figure 3-3](/en/ch3#fig_datamodels_m2m_rel): each position associates one user ID with one organization ID. + +![ddia 0303](/fig/ddia_0303.png) + +###### Figure 3-3. Many-to-many relationships in the relational model. + +Many-to-one and many-to-many relationships do not easily fit within one self-contained JSON +document; they lend themselves more to a normalized representation. In a document model, one +possible representation is given in [Example 3-2](/en/ch3#fig_datamodels_m2m_json) and illustrated in +[Figure 3-4](/en/ch3#fig_datamodels_many_to_many): the data within each dotted rectangle can be grouped into one +document, but the links to organizations and schools are best represented as references to other +documents. + +##### Example 3-2. A résumé that references organizations by ID. + +``` +{ + "user_id": 251, + "first_name": "Barack", + "last_name": "Obama", + "positions": [ + {"start": 2009, "end": 2017, "job_title": "President", "org_id": 513}, + {"start": 2005, "end": 2008, "job_title": "US Senator (D-IL)", "org_id": 514} + ], + ... +} +``` + +![ddia 0304](/fig/ddia_0304.png) + +###### Figure 3-4. Many-to-many relationships in the document model: the data within each dotted box can be grouped into one document. + +Many-to-many relationships often need to be queried in “both directions”: for example, finding all +of the organizations that a particular person has worked for, and finding all of the people who have +worked at a particular organization. One way of enabling such queries is to store ID references on +both sides, i.e., a résumé includes the ID of each organization where the person has worked, and the +organization document includes the IDs of the résumés that mention that organization. This +representation is denormalized, since the relationship is stored in two places, which could become +inconsistent with each other. + +A normalized representation stores the relationship in only one place, and relies on *secondary +indexes* (which we discuss in [Chapter 4](/en/ch4#ch_storage)) to allow the relationship to be efficiently queried in +both directions. In the relational schema of [Figure 3-3](/en/ch3#fig_datamodels_m2m_rel), we would tell the database +to create indexes on both the `user_id` and the `org_id` columns of the `positions` table. + +In the document model of [Example 3-2](/en/ch3#fig_datamodels_m2m_json), the database needs to index the `org_id` field +of objects inside the `positions` array. Many document databases and relational databases with JSON +support are able to create such indexes on values inside a document. + +## Stars and Snowflakes: Schemas for Analytics + +Data warehouses (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are usually relational, and there are a few +widely-used conventions for the structure of tables in a data warehouse: a *star schema*, +*snowflake schema*, *dimensional modeling* +[[12](/en/ch3#Kimball2013_ch3)], +and *one big table* (OBT). These structures are optimized for the needs of business analysts. ETL +processes translate data from operational systems into this schema. + +[Figure 3-5](/en/ch3#fig_dwh_schema) shows an example of a star schema that might be found in the data warehouse of a grocery +retailer. At the center of the schema is a so-called *fact table* (in this example, it is called +`fact_sales`). Each row of the fact table represents an event that occurred at a particular time +(here, each row represents a customer’s purchase of a product). If we were analyzing website traffic +rather than retail sales, each row might represent a page view or a click by a user. + +![ddia 0305](/fig/ddia_0305.png) + +###### Figure 3-5. Example of a star schema for use in a data warehouse. + +Usually, facts are captured as individual events, because this allows maximum flexibility of +analysis later. However, this means that the fact table can become extremely large. A big enterprise +may have many petabytes of transaction history in its data warehouse, mostly represented as fact +tables. + +Some of the columns in the fact table are attributes, such as the price at which the product was +sold and the cost of buying it from the supplier (allowing the profit margin to be calculated). +Other columns in the fact table are foreign key references to other tables, called *dimension +tables*. As each row in the fact table represents an event, the dimensions represent the *who*, +*what*, *where*, *when*, *how*, and *why* of the event. + +For example, in [Figure 3-5](/en/ch3#fig_dwh_schema), one of the dimensions is the product that was sold. Each row in +the `dim_product` table represents one type of product that is for sale, including its stock-keeping +unit (SKU), description, brand name, category, fat content, package size, etc. Each row in the +`fact_sales` table uses a foreign key to indicate which product was sold in that particular +transaction. Queries often involve multiple joins to multiple dimension tables. + +Even date and time are often represented using dimension tables, because this allows additional +information about dates (such as public holidays) to be encoded, allowing queries to differentiate +between sales on holidays and non-holidays. + +[Figure 3-5](/en/ch3#fig_dwh_schema) is an example of a star schema. The name comes from the fact that when the table +relationships are visualized, the fact table is in the middle, surrounded by its dimension tables; +the connections to these tables are like the rays of a star. + +A variation of this template is known as the *snowflake schema*, where dimensions are further broken +down into subdimensions. For example, there could be separate tables for brands and +product categories, and each row in the `dim_product` table could reference the brand and category +as foreign keys, rather than storing them as strings in the `dim_product` table. Snowflake schemas +are more normalized than star schemas, but star schemas are often preferred because +they are simpler for analysts to work with +[[12](/en/ch3#Kimball2013_ch3)]. + +In a typical data warehouse, tables are often quite wide: fact tables often have over 100 columns, +sometimes several hundred. Dimension tables can also be wide, as they include all the metadata that +may be relevant for analysis—for example, the `dim_store` table may include details of which +services are offered at each store, whether it has an in-store bakery, the square footage, the date +when the store was first opened, when it was last remodeled, how far it is from the nearest highway, +etc. + +A star or snowflake schema consists mostly of many-to-one relationships (e.g., many sales occur for +one particular product, in one particular store), represented as the fact table having foreign keys +into dimension tables, or dimensions into sub-dimensions. In principle, other types of relationship +could exist, but they are often denormalized in order to simplify queries. For example, if a +customer buys several different products at once, that multi-item transaction is not represented +explicitly; instead, there is a separate row in the fact table for each product purchased, and those +facts all just happen to have the same customer ID, store ID, and timestamp. + +Some data warehouse schemas take denormalization even further and leave out the dimension tables +entirely, folding the information in the dimensions into denormalized columns on the fact table +instead (essentially, precomputing the join between the fact table and the dimension tables). This +approach is known as *one big table* (OBT), and while it requires more storage space, it sometimes +enables faster queries [[13](/en/ch3#Kaminsky2022)]. + +In the context of analytics, such denormalization is unproblematic, since the data typically +represents a log of historical data that is not going to change (except maybe for occasionally +correcting an error). The issues of data consistency and write overheads that occur with +denormalization in OLTP systems are not as pressing in analytics. + +## When to Use Which Model + +The main arguments in favor of the document data model are schema flexibility, better performance +due to locality, and that for some applications it is closer to the object model used by the +application. The relational model counters by providing better support for joins, many-to-one, and +many-to-many relationships. Let’s examine these arguments in more detail. + +If the data in your application has a document-like structure (i.e., a tree of one-to-many +relationships, where typically the entire tree is loaded at once), then it’s probably a good idea to +use a document model. The relational technique of *shredding*—splitting a document-like structure +into multiple tables (like `positions`, `education`, and `contact_info` in +[Figure 3-1](/en/ch3#fig_obama_relational))—can lead to cumbersome schemas and unnecessarily complicated application +code. + +The document model has limitations: for example, you cannot refer directly to a nested item within a +document, but instead you need to say something like “the second item in the list of positions for +user 251”. If you do need to reference nested items, a relational approach works better, since you +can refer to any item directly by its ID. + +Some applications allow the user to choose the order of items: for example, imagine a to-do list or +issue tracker where the user can drag and drop tasks to reorder them. The document model supports +such applications well, because the items (or their IDs) can simply be stored in a JSON array to +determine their order. In relational databases there isn’t a standard way of representing such +reorderable lists, and various tricks are used: sorting by an integer column (requiring renumbering +when you insert into the middle), a linked list of IDs, or fractional indexing +[[14](/en/ch3#Nelson2018), +[15](/en/ch3#Wallace2017), +[16](/en/ch3#Greenspan2020)]. + +### Schema flexibility in the document model + +Most document databases, and the JSON support in relational databases, do not enforce any schema on +the data in documents. XML support in relational databases usually comes with optional schema +validation. No schema means that arbitrary keys and values can be added to a document, and when +reading, clients have no guarantees as to what fields the documents may contain. + +Document databases are sometimes called *schemaless*, but that’s misleading, as the code that reads +the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not +enforced by the database [[17](/en/ch3#Schemaless)]. +A more accurate term is *schema-on-read* (the structure of the data is implicit, and only +interpreted when the data is read), in contrast with *schema-on-write* (the traditional approach of +relational databases, where the schema is explicit and the database ensures all data conforms to it +when the data is written) [[18](/en/ch3#Awadallah2009)]. + +Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas +schema-on-write is similar to static (compile-time) type checking. Just as the advocates of static +and dynamic type checking have big debates about their relative merits +[[19](/en/ch3#Odersky2013)], +enforcement of schemas in database is a contentious topic, and in general there’s no right or wrong +answer. + +The difference between the approaches is particularly noticeable in situations where an application +wants to change the format of its data. For example, say you are currently storing each user’s full +name in one field, and you instead want to store the first name and last name separately +[[20](/en/ch3#Irwin2013)]. +In a document database, you would just start writing new documents with the new fields and have +code in the application that handles the case when old documents are read. For example: + +``` +if (user && user.name && !user.first_name) { + // Documents written before Dec 8, 2023 don't have first_name + user.first_name = user.name.split(" ")[0]; +} +``` + +The downside of this approach is that every part of your application that reads from the database +now needs to deal with documents in old formats that may have been written a long time in the past. +On the other hand, in a schema-on-write database, you would typically perform a *migration* along +the lines of: + +``` +ALTER TABLE users ADD COLUMN first_name text DEFAULT NULL; +UPDATE users SET first_name = split_part(name, ' ', 1); -- PostgreSQL +UPDATE users SET first_name = substring_index(name, ' ', 1); -- MySQL +``` + +In most relational databases, adding a column with a default value is fast and unproblematic, even +on large tables. However, running the `UPDATE` statement is likely to be slow on a large table, +since every row needs to be rewritten, and other schema operations (such as changing the data type +of a column) also typically require the entire table to be copied. + +Various tools exist to allow this type of schema changes to be performed in the background without downtime +[[21](/en/ch3#Percona2023), +[22](/en/ch3#Noach2016), +[23](/en/ch3#Mukherjee2022), +[24](/en/ch3#PerezAradros2023)], +but performing such migrations on large databases remains operationally challenging. Complicated +migrations can be avoided by only adding the `first_name` column with a default value of `NULL` +(which is fast), and filling it in at read time, like you would with a document database. + +The schema-on-read approach is advantageous if the items in the collection don’t all have the same +structure for some reason (i.e., the data is heterogeneous)—for example, because: + +* There are many different types of objects, and it is not practicable to put each type of object in + its own table. +* The structure of the data is determined by external systems over which you have no control and + which may change at any time. + +In situations like these, a schema may hurt more than it helps, and schemaless documents can be a +much more natural data model. But in cases where all records are expected to have the same +structure, schemas are a useful mechanism for documenting and enforcing that structure. We will +discuss schemas and schema evolution in more detail in [Chapter 5](/en/ch5#ch_encoding). + +### Data locality for reads and writes + +A document is usually stored as a single continuous string, encoded as JSON, XML, or a binary variant +thereof (such as MongoDB’s BSON). If your application often needs to access the entire document +(for example, to render it on a web page), there is a performance advantage to this *storage +locality*. If data is split across multiple tables, like in [Figure 3-1](/en/ch3#fig_obama_relational), multiple +index lookups are required to retrieve it all, which may require more disk seeks and take more time. + +The locality advantage only applies if you need large parts of the document at the same time. The +database typically needs to load the entire document, which can be wasteful if you only need to +access a small part of a large document. On updates to a document, the entire document usually needs +to be rewritten. For these reasons, it is generally recommended that you keep documents fairly small +and avoid frequent small updates to a document. + +However, the idea of storing related data together for locality is not limited to the document +model. For example, Google’s Spanner database offers the same locality properties in a relational +data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) +within a parent table +[[25](/en/ch3#Corbett2012_ch2)]. +Oracle allows the same, using a feature called *multi-table index cluster tables* +[[26](/en/ch3#BurlesonCluster)]. +The *wide-column* data model popularized by Google’s Bigtable, and used e.g. in HBase and Accumulo, +has a concept of *column families*, which have a similar purpose of managing locality +[[27](/en/ch3#Chang2006_ch3)]. + +### Query languages for documents + +Another difference between a relational and a document database is the language or API that you use +to query it. Most relational databases are queried using SQL, but document databases are more +varied. Some allow only key-value access by primary key, while others also offer secondary indexes +to query for values inside documents, and some provide rich query languages. + +XML databases are often queried using XQuery and XPath, which are designed to allow complex queries, +including joins across multiple documents, and also format their results as XML +[[28](/en/ch3#Walmsley2015)]. JSON Pointer +[[29](/en/ch3#Bryan2013)] and JSONPath +[[30](/en/ch3#Goessner2024)] provide an equivalent to XPath for JSON. + +MongoDB’s aggregation pipeline, whose `$lookup` operator for joins we saw in +[“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization), is an example of a query language for collections of JSON +documents. + +Let’s look at another example to get a feel for this language—this time an aggregation, which is +especially needed for analytics. Imagine you are a marine biologist, and you add an observation +record to your database every time you see animals in the ocean. Now you want to generate a report +saying how many sharks you have sighted per month. In PostgreSQL you might express that query like +this: + +``` +SELECT date_trunc('month', observation_timestamp) AS observation_month, ![1](/fig/1.png) + sum(num_animals) AS total_animals +FROM observations +WHERE family = 'Sharks' +GROUP BY observation_month; +``` + +[![1](/fig/1.png)](/en/ch3#co_data_models_and_query_languages_CO1-1) +: The `date_trunc('month', timestamp)` function determines the calendar month + containing `timestamp`, and returns another timestamp representing the beginning of that month. In + other words, it rounds a timestamp down to the nearest month. + +This query first filters the observations to only show species in the `Sharks` family, then groups +the observations by the calendar month in which they occurred, and finally adds up the number of +animals seen in all observations in that month. The same query can be expressed using MongoDB’s +aggregation pipeline as follows: + +``` +db.observations.aggregate([ + { $match: { family: "Sharks" } }, + { $group: { + _id: { + year: { $year: "$observationTimestamp" }, + month: { $month: "$observationTimestamp" } + }, + totalAnimals: { $sum: "$numAnimals" } + } } +]); +``` + +The aggregation pipeline language is similar in expressiveness to a subset of SQL, but it uses a +JSON-based syntax rather than SQL’s English-sentence-style syntax; the difference is perhaps a +matter of taste. + +### Convergence of document and relational databases + +Document databases and relational databases started out as very different approaches to data +management, but they have grown more similar over time +[[31](/en/ch3#Stonebraker2024)]. +Relational databases added support for JSON types and query operators, and the ability to index +properties inside documents. Some document databases (such as MongoDB, Couchbase, and RethinkDB) +added support for joins, secondary indexes, and declarative query languages. + +This convergence of the models is good news for application developers, because the relational model +and the document model work best when you can combine both in the same database. Many document +databases need relational-style references to other documents, and many relational databases have +sections where schema flexibility is beneficial. Relational-document hybrids are a powerful +combination. + +###### Note + +Codd’s original description of the relational model +[[3](/en/ch3#Codd1970)] actually allowed something similar to JSON +within a relational schema. He called it *nonsimple domains*. The idea was that a value in a row +doesn’t have to just be a primitive datatype like a number or a string, but it could also be a +nested relation (table)—so you can have an arbitrarily nested tree structure as a value, much like +the JSON or XML support that was added to SQL over 30 years later. + +# Graph-Like Data Models + +We saw earlier that the type of relationships is an important distinguishing feature between +different data models. If your application has mostly one-to-many relationships (tree-structured +data) and few other relationships between records, the document model is appropriate. + +But what if many-to-many relationships are very common in your data? The relational model can handle +simple cases of many-to-many relationships, but as the connections within your data become more +complex, it becomes more natural to start modeling your data as a graph. + +A graph consists of two kinds of objects: *vertices* (also known as *nodes* or *entities*) and +*edges* (also known as *relationships* or *arcs*). Many kinds of data can be modeled as a graph. +Typical examples include: + +Social graphs +: Vertices are people, and edges indicate which people know each other. + +The web graph +: Vertices are web pages, and edges indicate HTML links to other pages. + +Road or rail networks +: Vertices are junctions, and edges represent the roads or railway lines between them. + +Well-known algorithms can operate on these graphs: for example, map navigation apps search for +the shortest path between two points in a road network, and +PageRank can be used on the web graph to determine the +popularity of a web page and thus its ranking in search results +[[32](/en/ch3#Page1999)]. + +Graphs can be represented in several different ways. In the *adjacency list* model, each vertex +stores the IDs of its neighbor vertices that are one edge away. Alternatively, you can use an +*adjacency matrix*, a two-dimensional array where each row and each column corresponds to a vertex, +where the value is zero when there is no edge between the row vertex and the column vertex, and +where the value is one if there is an edge. The adjacency list is good for graph traversals, and the +matrix is good for machine learning (see [“Dataframes, Matrices, and Arrays”](/en/ch3#sec_datamodels_dataframes)). + +In the examples just given, all the vertices in a graph represent the same kind of thing (people, web +pages, or road junctions, respectively). However, graphs are not limited to such *homogeneous* data: +an equally powerful use of graphs is to provide a consistent way of storing completely different +types of objects in a single database. For example: + +* Facebook maintains a single graph with many different types of vertices and edges: vertices + represent people, locations, events, checkins, and comments made by users; edges indicate which + people are friends with each other, which checkin happened in which location, who commented on + which post, who attended which event, and so on + [[33](/en/ch3#Bronson2013)]. +* Knowledge graphs are used by search engines to record facts about entities that often occur in + search queries, such as organizations, people, and places + [[34](/en/ch3#Noy2019)]. + This information is obtained by crawling and analyzing the text on websites; some websites, such + as Wikidata, also publish graph data in a structured form. + +There are several different, but related, ways of structuring and querying data in graphs. In this +section we will discuss the *property graph* model (implemented by Neo4j, Memgraph, KùzuDB +[[35](/en/ch3#Feng2023)], +and others [[36](/en/ch3#Besta2019)]) +and the *triple-store* model (implemented by Datomic, AllegroGraph, Blazegraph, and others). These +models are fairly similar in what they can express, and some graph databases (such as Amazon +Neptune) support both models. + +We will also look at four query languages for graphs (Cypher, SPARQL, Datalog, and GraphQL), as well +as SQL support for querying graphs. Other graph query languages exist, such as Gremlin +[[37](/en/ch3#TinkerPop2023)], +but these will give us a representative overview. + +To illustrate these different languages and models, this section uses the graph shown in +[Figure 3-6](/en/ch3#fig_datamodels_graph) as running example. It could be taken from a social network or a +genealogical database: it shows two people, Lucy from Idaho and Alain from Saint-Lô, France. They +are married and living in London. Each person and each location is represented as a vertex, and the +relationships between them as edges. This example will help demonstrate some queries that are easy +in graph databases, but difficult in other models. + +![ddia 0306](/fig/ddia_0306.png) + +###### Figure 3-6. Example of graph-structured data (boxes represent vertices, arrows represent edges). + +## Property Graphs + +In the *property graph* (also known as *labeled property graph*) model, each vertex consists of: + +* A unique identifier +* A label (string) to describe what type of object this vertex represents +* A set of outgoing edges +* A set of incoming edges +* A collection of properties (key-value pairs) + +Each edge consists of: + +* A unique identifier +* The vertex at which the edge starts (the *tail vertex*) +* The vertex at which the edge ends (the *head vertex*) +* A label to describe the kind of relationship between the two vertices +* A collection of properties (key-value pairs) + +You can think of a graph store as consisting of two relational tables, one for vertices and one for +edges, as shown in [Example 3-3](/en/ch3#fig_graph_sql_schema) (this schema uses the PostgreSQL `jsonb` datatype to +store the properties of each vertex or edge). The head and tail vertex are stored for each edge; if +you want the set of incoming or outgoing edges for a vertex, you can query the `edges` table by +`head_vertex` or `tail_vertex`, respectively. + +##### Example 3-3. Representing a property graph using a relational schema + +``` +CREATE TABLE vertices ( + vertex_id integer PRIMARY KEY, + label text, + properties jsonb +); + +CREATE TABLE edges ( + edge_id integer PRIMARY KEY, + tail_vertex integer REFERENCES vertices (vertex_id), + head_vertex integer REFERENCES vertices (vertex_id), + label text, + properties jsonb +); + +CREATE INDEX edges_tails ON edges (tail_vertex); +CREATE INDEX edges_heads ON edges (head_vertex); +``` + +Some important aspects of this model are: + +1. Any vertex can have an edge connecting it with any other vertex. There is no schema that + restricts which kinds of things can or cannot be associated. +2. Given any vertex, you can efficiently find both its incoming and its outgoing edges, and thus + *traverse* the graph—i.e., follow a path through a chain of vertices—both forward and backward. + (That’s why [Example 3-3](/en/ch3#fig_graph_sql_schema) has indexes on both the `tail_vertex` and `head_vertex` + columns.) +3. By using different labels for different kinds of vertices and relationships, you can store + several different kinds of information in a single graph, while still maintaining a clean data + model. + +The edges table is like the many-to-many associative table/join table we saw in +[“Many-to-One and Many-to-Many Relationships”](/en/ch3#sec_datamodels_many_to_many), generalized to allow many different types of relationship to be +stored in the same table. There may also be indexes on the labels and the properties, allowing +vertices or edges with certain properties to be found efficiently. + +###### Note + +A limitation of graph models is that an edge can only associate two vertices with each other, +whereas a relational join table can represent three-way or even higher-degree relationships by +having multiple foreign key references on a single row. Such relationships can be represented in a +graph by creating an additional vertex corresponding to each row of the join table, and edges +to/from that vertex, or by using a *hypergraph*. + +Those features give graphs a great deal of flexibility for data modeling, as illustrated in +[Figure 3-6](/en/ch3#fig_datamodels_graph). The figure shows a few things that would be difficult to express in a +traditional relational schema, such as different kinds of regional structures in different countries +(France has *départements* and *régions*, whereas the US has *counties* and *states*), quirks of +history such as a country within a country (ignoring for now the intricacies of sovereign states and +nations), and varying granularity of data (Lucy’s current residence is specified as a city, whereas +her place of birth is specified only at the level of a state). + +You could imagine extending the graph to also include many other facts about Lucy and Alain, or +other people. For instance, you could use it to indicate any food allergies they have (by +introducing a vertex for each allergen, and an edge between a person and an allergen to indicate an +allergy), and link the allergens with a set of vertices that show which foods contain which +substances. Then you could write a query to find out what is safe for each person to eat. +Graphs are good for evolvability: as you add features to your application, a graph can easily be +extended to accommodate changes in your application’s data structures. + +## The Cypher Query Language + +*Cypher* is a query language for property graphs, originally created for the Neo4j graph database, +and later developed into an open standard as *openCypher* +[[38](/en/ch3#Francis2018)]. +Besides Neo4j, Cypher is supported by Memgraph, KùzuDB +[[35](/en/ch3#Feng2023)], +Amazon Neptune, Apache AGE (with storage in PostgreSQL), and others. It is named after a character +in the movie *The Matrix* and is not related to ciphers in cryptography +[[39](/en/ch3#EifremTweet)]. + +[Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of +[Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each +vertex is given a symbolic name like `usa` or `idaho`. That name is not stored in the database, but +only used internally within the query to create edges between the vertices, using an arrow notation: +`(idaho) -[:WITHIN]-> (usa)` creates an edge labeled `WITHIN`, with `idaho` as the tail node and +`usa` as the head node. + +##### Example 3-4. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as a Cypher query + +``` +CREATE + (namerica :Location {name:'North America', type:'continent'}), + (usa :Location {name:'United States', type:'country' }), + (idaho :Location {name:'Idaho', type:'state' }), + (lucy :Person {name:'Lucy' }), + (idaho) -[:WITHIN ]-> (usa) -[:WITHIN]-> (namerica), + (lucy) -[:BORN_IN]-> (idaho) +``` + +When all the vertices and edges of [Figure 3-6](/en/ch3#fig_datamodels_graph) are added to the database, we can start +asking interesting questions: for example, *find the names of all the people who emigrated from the +United States to Europe*. That is, find all the vertices that have a `BORN_IN` edge to a location +within the US, and also a `LIVING_IN` edge to a location within Europe, and return the `name` +property of each of those vertices. + +[Example 3-5](/en/ch3#fig_cypher_query) shows how to express that query in Cypher. The same arrow notation is used in a +`MATCH` clause to find patterns in the graph: `(person) -[:BORN_IN]-> ()` matches any two vertices +that are related by an edge labeled `BORN_IN`. The tail vertex of that edge is bound to the +variable `person`, and the head vertex is left unnamed. + +##### Example 3-5. Cypher query to find people who emigrated from the US to Europe + +``` +MATCH + (person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (:Location {name:'United States'}), + (person) -[:LIVES_IN]-> () -[:WITHIN*0..]-> (:Location {name:'Europe'}) +RETURN person.name +``` + +The query can be read as follows: + +> Find any vertex (call it `person`) that meets *both* of the following conditions: +> +> 1. `person` has an outgoing `BORN_IN` edge to some vertex. From that vertex, you can follow a chain +> of outgoing `WITHIN` edges until eventually you reach a vertex of type `Location`, whose `name` +> property is equal to `"United States"`. +> 2. That same `person` vertex also has an outgoing `LIVES_IN` edge. Following that edge, and then a +> chain of outgoing `WITHIN` edges, you eventually reach a vertex of type `Location`, whose `name` +> property is equal to `"Europe"`. +> +> For each such `person` vertex, return the `name` property. + +There are several possible ways of executing the query. The description given here suggests that you +start by scanning all the people in the database, examine each person’s birthplace and residence, +and return only those people who meet the criteria. + +But equivalently, you could start with the two `Location` vertices and work backward. If there is an +index on the `name` property, you can efficiently find the two vertices representing the US and +Europe. Then you can proceed to find all locations (states, regions, cities, etc.) in the US and +Europe respectively by following all incoming `WITHIN` edges. Finally, you can look for people who +can be found through an incoming `BORN_IN` or `LIVES_IN` edge at one of the location vertices. + +## Graph Queries in SQL + +[Example 3-3](/en/ch3#fig_graph_sql_schema) suggested that graph data can be represented in a relational database. But +if we put graph data in a relational structure, can we also query it using SQL? + +The answer is yes, but with some difficulty. Every edge that you traverse in a graph query is +effectively a join with the `edges` table. In a relational database, you usually know in advance +which joins you need in your query. On the other hand, in a graph query, you may need to traverse a +variable number of edges before you find the vertex you’re looking for—that is, the number of joins +is not fixed in advance. + +In our example, that happens in the `() -[:WITHIN*0..]-> ()` pattern in the Cypher query. A person’s +`LIVES_IN` edge may point at any kind of location: a street, a city, a district, a region, a state, +etc. A city may be `WITHIN` a region, a region `WITHIN` a state, a state `WITHIN` a country, etc. +The `LIVES_IN` edge may point directly at the location vertex you’re looking for, or it may be +several levels away in the location hierarchy. + +In Cypher, `:WITHIN*0..` expresses that fact very concisely: it means “follow a `WITHIN` edge, zero +or more times.” It is like the `*` operator in a regular expression. + +Since SQL:1999, this idea of variable-length traversal paths in a query can be expressed using +something called *recursive common table expressions* (the `WITH RECURSIVE` syntax). +[Example 3-6](/en/ch3#fig_graph_sql_query) shows the same query—finding the names of people who emigrated from the US +to Europe—expressed in SQL using this technique. However, the syntax is very clumsy in comparison to +Cypher. + +##### Example 3-6. The same query as [Example 3-5](/en/ch3#fig_cypher_query), written in SQL using recursive common table expressions + +``` +WITH RECURSIVE + + -- in_usa is the set of vertex IDs of all locations within the United States + in_usa(vertex_id) AS ( + SELECT vertex_id FROM vertices + WHERE label = 'Location' AND properties->>'name' = 'United States' ![1](/fig/1.png) + UNION + SELECT edges.tail_vertex FROM edges ![2](/fig/2.png) + JOIN in_usa ON edges.head_vertex = in_usa.vertex_id + WHERE edges.label = 'within' + ), + + -- in_europe is the set of vertex IDs of all locations within Europe + in_europe(vertex_id) AS ( + SELECT vertex_id FROM vertices + WHERE label = 'location' AND properties->>'name' = 'Europe' ![3](/fig/3.png) + UNION + SELECT edges.tail_vertex FROM edges + JOIN in_europe ON edges.head_vertex = in_europe.vertex_id + WHERE edges.label = 'within' + ), + + -- born_in_usa is the set of vertex IDs of all people born in the US + born_in_usa(vertex_id) AS ( ![4](/fig/4.png) + SELECT edges.tail_vertex FROM edges + JOIN in_usa ON edges.head_vertex = in_usa.vertex_id + WHERE edges.label = 'born_in' + ), + + -- lives_in_europe is the set of vertex IDs of all people living in Europe + lives_in_europe(vertex_id) AS ( ![5](/fig/5.png) + SELECT edges.tail_vertex FROM edges + JOIN in_europe ON edges.head_vertex = in_europe.vertex_id + WHERE edges.label = 'lives_in' + ) + +SELECT vertices.properties->>'name' +FROM vertices +-- join to find those people who were both born in the US *and* live in Europe +JOIN born_in_usa ON vertices.vertex_id = born_in_usa.vertex_id ![6](/fig/6.png) +JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id; +``` + +[![1](/fig/1.png)](/en/ch3#co_data_models_and_query_languages_CO2-1) +: First find the vertex whose `name` property has the value `"United States"`, and make it the first element of the set + of vertices `in_usa`. + +[![2](/fig/2.png)](/en/ch3#co_data_models_and_query_languages_CO2-2) +: Follow all incoming `within` edges from vertices in the set `in_usa`, and add them to the same + set, until all incoming `within` edges have been visited. + +[![3](/fig/3.png)](/en/ch3#co_data_models_and_query_languages_CO2-3) +: Do the same starting with the vertex whose `name` property has the value `"Europe"`, and build up + the set of vertices `in_europe`. + +[![4](/fig/4.png)](/en/ch3#co_data_models_and_query_languages_CO2-4) +: For each of the vertices in the set `in_usa`, follow incoming `born_in` edges to find people + who were born in some place within the United States. + +[![5](/fig/5.png)](/en/ch3#co_data_models_and_query_languages_CO2-5) +: Similarly, for each of the vertices in the set `in_europe`, follow incoming `lives_in` edges to find people who live in Europe. + +[![6](/fig/6.png)](/en/ch3#co_data_models_and_query_languages_CO2-6) +: Finally, intersect the set of people born in the USA with the set of people living in Europe, by + joining them. + +The fact that a 4-line Cypher query requires 31 lines in SQL shows how much of a difference the +right choice of data model and query language can make. And this is just the beginning; there are +more details to consider, e.g., around handling cycles, and choosing between breadth-first or +depth-first traversal [[40](/en/ch3#Tisiot2021)]. + +Oracle has a different SQL extension for recursive queries, which it calls *hierarchical* +[[41](/en/ch3#Goel2020)]. + +However, the situation may be improving: at the time of writing, there are plans to add a graph +query language called GQL to the SQL standard [[42](/en/ch3#Deutsch2022), +[43](/en/ch3#Green2019)], +which will provide a syntax inspired by Cypher, GSQL +[[44](/en/ch3#Deutsch2018)], and PGQL +[[45](/en/ch3#vanRest2016)]. + +## Triple-Stores and SPARQL + +The triple-store model is mostly equivalent to the property graph model, using different words to +describe the same ideas. It is nevertheless worth discussing, because there are various tools and +languages for triple-stores that can be valuable additions to your toolbox for building +applications. + +In a triple-store, all information is stored in the form of very simple three-part statements: +(*subject*, *predicate*, *object*). For example, in the triple (*Jim*, *likes*, *bananas*), *Jim* is +the subject, *likes* is the predicate (verb), and *bananas* is the object. + +The subject of a triple is equivalent to a vertex in a graph. The object is one of two things: + +1. A value of a primitive datatype, such as a string or a number. In that case, the predicate and + object of the triple are equivalent to the key and value of a property on the subject vertex. + Using the example from [Figure 3-6](/en/ch3#fig_datamodels_graph), (*lucy*, *birthYear*, *1989*) is like a vertex + `lucy` with properties `{"birthYear": 1989}`. +2. Another vertex in the graph. In that case, the predicate is an edge in the + graph, the subject is the tail vertex, and the object is the head vertex. For example, in + (*lucy*, *marriedTo*, *alain*) the subject and object *lucy* and *alain* are both vertices, and + the predicate *marriedTo* is the label of the edge that connects them. + +###### Note + +To be precise, databases that offer a triple-like data model often need to store some additional +metadata on each tuple. For example, AWS Neptune uses quads (4-tuples) by adding a graph ID to each +triple [[46](/en/ch3#NeptuneDataModel)]; +Datomic uses 5-tuples, extending each triple with a transaction ID and a boolean to indicate +deletion [[47](/en/ch3#DatomicDataModel)]. +Since these databases retain the basic *subject-predicate-object* structure explained above, this +book nevertheless calls them triple-stores. + +[Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as +triples in a format called *Turtle*, a subset of *Notation3* (*N3*) +[[48](/en/ch3#Beckett2011)]. + +##### Example 3-7. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as Turtle triples + +``` +@prefix : . +_:lucy a :Person. +_:lucy :name "Lucy". +_:lucy :bornIn _:idaho. +_:idaho a :Location. +_:idaho :name "Idaho". +_:idaho :type "state". +_:idaho :within _:usa. +_:usa a :Location. +_:usa :name "United States". +_:usa :type "country". +_:usa :within _:namerica. +_:namerica a :Location. +_:namerica :name "North America". +_:namerica :type "continent". +``` + +In this example, vertices of the graph are written as `_:someName`. The name doesn’t mean anything +outside of this file; it exists only because we otherwise wouldn’t know which triples refer to the +same vertex. When the predicate represents an edge, the object is a vertex, as in `_:idaho :within +_:usa`. When the predicate is a property, the object is a string literal, as in `_:usa :name +"United States"`. + +It’s quite repetitive to repeat the same subject over and over again, but fortunately you can use +semicolons to say multiple things about the same subject. This makes the Turtle format quite +readable: see [Example 3-8](/en/ch3#fig_graph_n3_shorthand). + +##### Example 3-8. A more concise way of writing the data in [Example 3-7](/en/ch3#fig_graph_n3_triples) + +``` +@prefix : . +_:lucy a :Person; :name "Lucy"; :bornIn _:idaho. +_:idaho a :Location; :name "Idaho"; :type "state"; :within _:usa. +_:usa a :Location; :name "United States"; :type "country"; :within _:namerica. +_:namerica a :Location; :name "North America"; :type "continent". +``` + +# The Semantic Web + +Some of the research and development effort on triple stores was motivated by the *Semantic Web*, an +early-2000s effort to facilitate internet-wide data exchange by publishing data not only as +human-readable web pages, but also in a standardized, machine-readable format. Although the Semantic +Web as originally envisioned did not succeed +[[49](/en/ch3#Target2018), +[50](/en/ch3#MendelGleason2022)], +the legacy of the Semantic Web project lives on in a couple of specific technologies: *linked data* +standards such as JSON-LD [[51](/en/ch3#Sporny2014)], +*ontologies* used in biomedical science +[[52](/en/ch3#MichiganOntologies)], +Facebook’s Open Graph protocol +[[53](/en/ch3#OpenGraph)] +(which is used for link unfurling +[[54](/en/ch3#Haughey2015)]), +knowledge graphs such as Wikidata, and standardized vocabularies for structured data maintained by +[`schema.org`](https://schema.org/). + +Triple-stores are another Semantic Web technology that has found use outside of its original use +case: even if you have no interest in the Semantic Web, triples can be a good internal data model +for applications. + +### The RDF data model + +The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the +*Resource Description Framework* (RDF) +[[55](/en/ch3#W3CRDF)], +a data model that was designed for the Semantic Web. RDF data can also be encoded in other ways, for +example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can +automatically convert between different RDF encodings. + +##### Example 3-9. The data of [Example 3-8](/en/ch3#fig_graph_n3_shorthand), expressed using RDF/XML syntax + +``` + + + + Idaho + state + + + United States + country + + + North America + continent + + + + + + + + Lucy + + + +``` + +RDF has a few quirks due to the fact that it is designed for internet-wide data exchange. The +subject, predicate, and object of a triple are often URIs. For example, a predicate might be an URI +such as `` or ``, +rather than just `WITHIN` or `LIVES_IN`. The reasoning behind this design is that you should be able +to combine your data with someone else’s data, and if they attach a different meaning to the word +`within` or `lives_in`, you won’t get a conflict because their predicates are actually +`` and ``. + +The URL `` doesn’t necessarily need to resolve to anything—from +RDF’s point of view, it is simply a namespace. To avoid potential confusion with `http://` URLs, the +examples in this section use non-resolvable URIs such as `urn:example:within`. Fortunately, you can +just specify this prefix once at the top of the file, and then forget about it. + +### The SPARQL query language + +*SPARQL* is a query language for triple-stores using the RDF data model +[[56](/en/ch3#Harris2013)]. +(It is an acronym for *SPARQL Protocol and RDF Query Language*, pronounced “sparkle.”) +It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite +similar. + +The same query as before—finding people who have moved from the US to Europe—is similarly concise in +SPARQL as it is in Cypher (see [Example 3-10](/en/ch3#fig_sparql_query)). + +##### Example 3-10. The same query as [Example 3-5](/en/ch3#fig_cypher_query), expressed in SPARQL + +``` +PREFIX : + +SELECT ?personName WHERE { + ?person :name ?personName. + ?person :bornIn / :within* / :name "United States". + ?person :livesIn / :within* / :name "Europe". +} +``` + +The structure is very similar. The following two expressions are equivalent (variables start with a +question mark in SPARQL): + +``` +(person) -[:BORN_IN]-> () -[:WITHIN*0..]-> (location) # Cypher + +?person :bornIn / :within* ?location. # SPARQL +``` + +Because RDF doesn’t distinguish between properties and edges but just uses predicates for both, you +can use the same syntax for matching properties. In the following expression, the variable `usa` is +bound to any vertex that has a `name` property whose value is the string `"United States"`: + +``` +(usa {name:'United States'}) # Cypher + +?usa :name "United States". # SPARQL +``` + +SPARQL is supported by Amazon Neptune, AllegroGraph, Blazegraph, OpenLink Virtuoso, Apache Jena, and +various other triple stores [[36](/en/ch3#Besta2019)]. + +## Datalog: Recursive Relational Queries + +Datalog is a much older language than SPARQL or Cypher: it arose from academic research in the 1980s +[[57](/en/ch3#Green2013), +[58](/en/ch3#Ceri1989), +[59](/en/ch3#Abiteboul1995)]. +It is less well known among software engineers and not widely supported in mainstream databases, but +it ought to be better-known since it is a very expressive language that is particularly powerful for +complex queries. Several niche databases, including Datomic, LogicBlox, CozoDB, and LinkedIn’s +LIquid [[60](/en/ch3#Meyer2020)] use Datalog as +their query language. + +Datalog is actually based on a relational data model, not a graph, but it appears in the graph +databases section of this book because recursive queries on graphs are a particular strength of +Datalog. + +The contents of a Datalog database consists of *facts*, and each fact corresponds to a row in a +relational table. For example, say we have a table *location* containing locations, and it has three +columns: *ID*, *name*, and *type*. The fact that the US is a country could then be written as +`location(2, "United States", "country")`, where `2` is the ID of the US. In general, the statement +`table(val1, val2, …​)` means that `table` contains a row where the first column contains `val1`, +the second column contains `val2`, and so on. + +[Example 3-11](/en/ch3#fig_datalog_triples) shows how to write the data from the left-hand side of +[Figure 3-6](/en/ch3#fig_datamodels_graph) in Datalog. The edges of the graph (`within`, `born_in`, and `lives_in`) +are represented as two-column join tables. For example, Lucy has the ID 100 and Idaho has the ID 3, +so the relationship “Lucy was born in Idaho” is represented as `born_in(100, 3)`. + +##### Example 3-11. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as Datalog facts + +``` +location(1, "North America", "continent"). +location(2, "United States", "country"). +location(3, "Idaho", "state"). + +within(2, 1). /* US is in North America */ +within(3, 2). /* Idaho is in the US */ + +person(100, "Lucy"). +born_in(100, 3). /* Lucy was born in Idaho */ +``` + +Now that we have defined the data, we can write the same query as before, as shown in +[Example 3-12](/en/ch3#fig_datalog_query). It looks a bit different from the equivalent in Cypher or SPARQL, but don’t +let that put you off. Datalog is a subset of Prolog, a programming language that you might have seen +before if you’ve studied computer science. + +##### Example 3-12. The same query as [Example 3-5](/en/ch3#fig_cypher_query), expressed in Datalog + +``` +within_recursive(LocID, PlaceName) :- location(LocID, PlaceName, _). /* Rule 1 */ + +within_recursive(LocID, PlaceName) :- within(LocID, ViaID), /* Rule 2 */ + within_recursive(ViaID, PlaceName). + +migrated(PName, BornIn, LivingIn) :- person(PersonID, PName), /* Rule 3 */ + born_in(PersonID, BornID), + within_recursive(BornID, BornIn), + lives_in(PersonID, LivingID), + within_recursive(LivingID, LivingIn). + +us_to_europe(Person) :- migrated(Person, "United States", "Europe"). /* Rule 4 */ +/* us_to_europe contains the row "Lucy". */ +``` + +Cypher and SPARQL jump in right away with `SELECT`, but Datalog takes a small step at a time. We +define *rules* that derive new virtual tables from the underlying facts. These derived tables are +like (virtual) SQL views: they are not stored in the database, but you can query them in the same +way as a table containing stored facts. + +In [Example 3-12](/en/ch3#fig_datalog_query) we define three derived tables: `within_recursive`, `migrated`, and +`us_to_europe`. The name and columns of the virtual tables are defined by what appears before the +`:-` symbol of each rule. For example, `migrated(PName, BornIn, LivingIn)` is a virtual table with +three columns: the name of a person, the name of the place where they were born, and the name of the +place where they are living. + +The content of a virtual table is defined by the part of the rule after the `:-` symbol, where we +try to find rows that match a certain pattern in the tables. For example, `person(PersonID, PName)` +matches the row `person(100, "Lucy")`, with the variable `PersonID` bound to the value `100` and the +variable `PName` bound to the value `"Lucy"`. A rule applies if the system can find a match for +*all* patterns on the righthand side of the `:-` operator. When the rule applies, it’s as though the +lefthand side of the `:-` was added to the database (with variables replaced by the values they +matched). + +One possible way of applying the rules is thus (and as illustrated in [Figure 3-7](/en/ch3#fig_datalog_naive)): + +1. `location(1, "North America", "continent")` exists in the database, so rule 1 + applies. It generates `within_recursive(1, "North America")`. +2. `within(2, 1)` exists in the database and the previous step generated + `within_recursive(1, "North America")`, so rule 2 applies. It generates + `within_recursive(2, "North America")`. +3. `within(3, 2)` exists in the database and the previous step generated + `within_recursive(2, "North America")`, so rule 2 applies. It generates + `within_recursive(3, "North America")`. + +By repeated application of rules 1 and 2, the `within_recursive` virtual table can tell us all the +locations in North America (or any other location) contained in our database. + +![ddia 0307](/fig/ddia_0307.png) + +###### Figure 3-7. Determining that Idaho is in North America, using the Datalog rules from [Example 3-12](/en/ch3#fig_datalog_query). + +Now rule 3 can find people who were born in some location `BornIn` and live in some location +`LivingIn`. Rule 4 invokes rule 3 with `BornIn = 'United States'` and +`LivingIn = 'Europe'`, and returns only the names of the people who match the +search. By querying the contents of the virtual `us_to_europe` table, the Datalog system finally +gets the same answer as in the earlier Cypher and SPARQL queries. + +The Datalog approach requires a different kind of thinking compared to the other query languages +discussed in this chapter. It allows complex queries to be built up rule by rule, with one rule +referring to other rules, similarly to the way that you break down code into functions that call +each other. Just like functions can be recursive, Datalog rules can also invoke themselves, like +rule 2 in [Example 3-12](/en/ch3#fig_datalog_query), which enables graph traversals in Datalog queries. + +## GraphQL + +GraphQL is a query language that, by design, is much more restrictive than the other query languages +we have seen in this chapter. The purpose of GraphQL is to allow client software running on a user’s +device (such as a mobile app or a JavaScript web app frontend) to request a JSON document with a +particular structure, containing the fields necessary for rendering its user interface. GraphQL +interfaces allow developers to rapidly change queries in client code without changing server-side +APIs. + +GraphQL’s flexibility comes at a cost. Organizations that adopt GraphQL often need tooling to +convert GraphQL queries into requests to internal services, which often use REST or gRPC (see +[Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns +[[61](/en/ch3#Bessey2024)]. +GraphQL’s query language is also limited since GraphQL come from an untrusted source. The language +does not allow anything that could be expensive to execute, since otherwise users could perform +denial-of-service attacks on a server by running lots of expensive queries. In particular, GraphQL +does not allow recursive queries (unlike Cypher, SPARQL, SQL, or Datalog), and it does not allow +arbitrary search conditions such as “find people who were born in the US and are now living in +Europe” (unless the service owners specifically choose to offer such search functionality). + +Nevertheless, GraphQL is useful. [Example 3-13](/en/ch3#fig_graphql_query) shows how you might implement a group chat +application such as Discord or Slack using GraphQL. The query requests all the channels that the +user has access to, including the channel name and the 50 most recent messages in each channel. For +each message it requests the timestamp, the message content, and the name and profile picture URL +for the sender of the message. Moreover, if a message is a reply to another message, the query also +requests the sender name and the content of the message it is replying to (which might be rendered +in a smaller font above the reply, in order to provide some context). + +##### Example 3-13. Example GraphQL query for a group chat application + +``` +query ChatApp { + channels { + name + recentMessages(latest: 50) { + timestamp + content + sender { + fullName + imageUrl + } + replyTo { + content + sender { + fullName + } + } + } + } +} +``` + +[Example 3-14](/en/ch3#fig_graphql_response) shows what a response to the query in [Example 3-13](/en/ch3#fig_graphql_query) might look +like. The response is a JSON document that mirrors the structure of the query: it contains exactly +those attributes that were requested, no more and no less. This approach has the advantage that the +server does not need to know which attributes the client requires in order to render the user +interface; instead, the client can simply request what it needs. For example, this query does not +request a profile picture URL for the sender of the `replyTo` message, but if the user interface +were changed to add that profile picture, it would be easy for the client to add the required +`imageUrl` attribute to the query without changing the server. + +##### Example 3-14. A possible response to the query in [Example 3-13](/en/ch3#fig_graphql_query) + +``` +{ + "data": { + "channels": [ + { + "name": "#general", + "recentMessages": [ + { + "timestamp": 1693143014, + "content": "Hey! How are y'all doing?", + "sender": {"fullName": "Aaliyah", "imageUrl": "https://..."}, + "replyTo": null + }, + { + "timestamp": 1693143024, + "content": "Great! And you?", + "sender": {"fullName": "Caleb", "imageUrl": "https://..."}, + "replyTo": { + "content": "Hey! How are y'all doing?", + "sender": {"fullName": "Aaliyah"} + } + }, + ... +``` + +In [Example 3-14](/en/ch3#fig_graphql_response) the name and image URL of a message sender is embedded directly in the +message object. If the same user sends multiple messages, this information is repeated on each +message. In principle, it would be possible to reduce this duplication, but GraphQL makes the design +choice to accept a larger response size in order to make it simpler to render the user interface +based on the data. + +The `replyTo` field is similar: in [Example 3-14](/en/ch3#fig_graphql_response), the second message is a reply to the +first, and the content (“Hey!…”) and sender Aaliyah are duplicated under `replyTo`. It would be +possible to instead return the ID of the message being replied to, but then the client would have to +make an additional request to the server if that ID is not among the 50 most recent messages +returned. Duplicating the content makes it much simpler to work with the data. + +The server’s database can store the data in a more normalized form, and perform the necessary joins +to process a query. For example, the server might store a message along with the user ID of the +sender and the ID of the message it is replying to; when it receives a query like the one above, the +server would then resolve those IDs to find the records they refer to. However, the client can only +ask the server to perform joins that are explicitly offered in the GraphQL schema. + +Even though the response to a GraphQL query looks similar to a response from a document database, +and even though it has “graph” in the name, GraphQL can be implemented on top of any type of +database—relational, document, or graph. + +# Event Sourcing and CQRS + +In all the data models we have discussed so far, the data is queried in the same form as it is +written—be it JSON documents, rows in tables, or vertices and edges in a graph. However, in complex +applications it can sometimes be difficult to find a single data representation that is able to +satisfy all the different ways that the data needs to be queried and presented. In such situations, +it can be beneficial to write data in one form, and then to derive from it several representations +that are optimized for different types of reads. + +We previously saw this idea in [“Systems of Record and Derived Data”](/en/ch1#sec_introduction_derived), and ETL (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) +is one example of such a derivation process. Now we will take the idea further. If we are going to +derive one data representation from another anyway, we can choose different representations that are +optimized for writing and for reading, respectively. How would you model your data if you only +wanted to optimize it for writing, and if efficient queries were of no concern? + +Perhaps the simplest, fastest, and most expressive way of writing data is an *event log*: every time +you want to write some data, you encode it as a self-contained string (perhaps as JSON), including a +timestamp, and then append it to a sequence of events. Events in this log are *immutable*: you never +change or delete them, you only ever append more events to the log (which may supersede earlier +events). An event can contain arbitrary properties. + +[Figure 3-8](/en/ch3#fig_event_sourcing) shows an example that could be taken from a conference management system. A +conference can be a complex business domain: not only can individual attendees register and pay by +card, but companies can also order seats in bulk, pay by invoice, and then later assign the seats to +individual people. Some number of seats may be reserved for speakers, sponsors, volunteer helpers, +and so on. Reservations may also be cancelled, and meanwhile, the conference organizer might change +the capacity of the event by moving it to a different room. With all of this going on, simply +calculating the number of available seats becomes a challenging query. + +![ddia 0308](/fig/ddia_0308.png) + +###### Figure 3-8. Using a log of immutable events as source of truth, and deriving materialized views from it. + +In [Figure 3-8](/en/ch3#fig_event_sourcing), every change to the state of the conference (such as the organizer +opening registrations, or attendees making and cancelling registrations) is first stored as an +event. Whenever an event is appended to the log, several *materialized views* (also known as +*projections* or *read models*) are also updated to reflect the effect of that event. In the +conference example, there might be one materialized view that collects all information related to +the status of each booking, another that computes charts for the conference organizer’s dashboard, +and a third that generates files for the printer that produces the attendees’ badges. + +The idea of using events as the source of truth, and expressing every state change as an event, is +known as *event sourcing* [[62](/en/ch3#Betts2012), +[63](/en/ch3#Young2014)]. +The principle of maintaining separate read-optimized representations and deriving them from the +write-optimized representation is called *command query responsibility segregation (CQRS)* +[[64](/en/ch3#Young2010)]. +These terms originated in the domain-driven design (DDD) community, although similar ideas have been +around for a long time, for example in *state machine replication* (see [“Using shared logs”](/en/ch10#sec_consistency_smr)). + +When a request from a user comes in, it is called a *command*, and it first needs to be validated. +Only once the command has been executed and it has been determined to be valid (e.g., there were +enough available seats for a requested reservation), it becomes a fact, and the corresponding event +is added to the log. Consequently, the event log should contain only valid events, and a consumer +of the event log that builds a materialized view is not allowed to reject an event. + +When modelling your data in an event sourcing style, it is recommended that you name your events in +the past tense (e.g., “the seats were booked”), because an event is a record of the fact that +something has happened in the past. Even if the user later decides to change or cancel, the fact +remains true that they formerly held a booking, and the change or cancellation is a separate event +that is added later. + +A similarity between event sourcing and a star schema fact table, as discussed in +[“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics), is that both are collections of events that happened in the past. +However, rows in a fact table all have the same set of columns, wheras in event sourcing there may +be many different event types, each with different properties. Moreover, a fact table is an +unordered collection, while in event sourcing the order of events is important: if a booking is +first made and then cancelled, processing those events in the wrong order would not make sense. + +Event sourcing and CQRS have several advantages: + +* For the people developing the system, events better communicate the intent of *why* something + happened. For example, it’s easier to understand the event “the booking was cancelled” than “the + `active` column on row 4001 of the `bookings` table was set to `false`, three rows associated with + that booking were deleted from the `seat_assignments` table, and a row representing the refund was + inserted into the `payments` table”. Those row modifications may still happen when a materialized + view processes the cancellation event, but when they are driven by an event, the reason for the + updates becomes much clearer. +* A key principle of event sourcing is that the materialized views are derived from the event log in + a reproducible way: you should always be able to delete the materialized views and recompute them + by processing the same events in the same order, using the same code. If there was a bug in the + view maintenance code, you can just delete the view and recompute it with the new code. It’s also + easier to find the bug because you can re-run the view maintenance code as often as you like and + inspect its behavior. +* You can have multiple materialized views that are optimized for the particular queries that your + application requires. They can be stored either in the same database as the events or a different + one, depending on your needs. They can use any data model, and they can be denormalized for fast + reads. You can even keep a view only in memory and avoid persisting it, as long as it’s okay to + recompute the view from the event log whenever the service restarts. +* If you decide you want to present the existing information in a new way, it is easy to build a new + materialized view from the existing event log. You can also evolve the system to support new + features by adding new types of events, or new properties to existing event types (any older + events remain unmodified). You can also chain new behaviors off existing events (for example, when + a conference attendee cancels, their seat could be offered to the next person on the waiting + list). +* If an event was written in error you can delete it again, and then you can rebuild the views + without the deleted event. On the other hand, in a database where you update and delete data + directly, a committed transaction is often difficult to reverse. Event sourcing can therefore + reduce the number of irreversible actions in the system, making it easier to change + (see [“Evolvability: Making Change Easy”](/en/ch2#sec_introduction_evolvability)). +* The event log can also serve as an audit log of everything that happened in the system, which is + valuable in regulated industries that require such auditability. + +However, event sourcing and CQRS also have downsides: + +* You need to be careful if external information is involved. For example, say an event contains a + price given in one currency, and for one of the views it needs to be converted into another + currency. Since the exchange rate may fluctuate, it would be problematic to fetch the exchange + rate from an external source when processing the event, since you would get a different result if + you recompute the materialized view on another date. To make the event processing logic + deterministic, you either need to include the exchange rate in the event itself, or have a way of + querying the historical exchange rate at the timestamp indicated in the event, ensuring that this + query always returns the same result for the same timestamp. +* The requirement that events are immutable creates problems if events contain personal data from + users, since users may exercise their right (e.g., under the GDPR) to request deletion of their + data. If the event log is on a per-user basis, you can just delete the whole log for that user, + but that doesn’t work if your event log contains events relating to multiple users. You can try + storing the personal data outside of the actual event, or encrypting it with a key that you can + later choose to delete, but that also makes it harder to recompute derived state when needed. +* Reprocessing events requires care if there are externally visible side-effects—for example, you + probably don’t want to resend confirmation emails every time you rebuild a materialized view. + +You can implement event sourcing on top of any database, but there are also some systems that are +specifically designed to support this pattern, such as EventStoreDB, MartenDB (based on PostgreSQL), +and Axon Framework. You can also use message brokers such as Apache Kafka to store the event log, +and stream processors can keep the materialized views up-to-date; we will return to these topics in +[Link to Come]. + +The only important requirement is that the event storage system must guarantee that all materialized +views process the events in exactly the same order as they appear in the log; as we shall see in +[Chapter 10](/en/ch10#ch_consistency), this is not always easy to achieve in a distributed system. + +# Dataframes, Matrices, and Arrays + +The data models we have seen so far in this chapter are generally used for both transaction +processing and analytics purposes (see [“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)). There are also some data +models that you are likely to encounter in an analytical or scientific context, but that rarely +feature in OLTP systems: dataframes and multidimensional arrays of numbers such as matrices. + +Dataframes are a data model supported by the R language, the Pandas library for Python, Apache +Spark, ArcticDB, Dask, and other systems. They are a popular tool for data scientists preparing data +for training machine learning models, but they are also widely used for data exploration, +statistical data analysis, data visualization, and similar purposes. + +At first glance, a dataframe is similar to a table in a relational database or a spreadsheet. It +supports relational-like operators that perform bulk operations on the contents of the dataframe: +for example, applying a function to all of the rows, filtering the rows based on some condition, +grouping rows by some columns and aggregating other columns, and joining the rows in one dataframe +with another dataframe based on some key (what a relational database calls *join* is typically +called *merge* on dataframes). + +Instead of a declarative query such as SQL, a dataframe is typically manipulated through a series of +commands that modify its structure and content. This matches the typical workflow of data +scientists, who incrementally “wrangle” the data into a form that allows them to find answers to the +questions they are asking. These manipulations usually take place on the data scientist’s private +copy of the dataset, often on their local machine, although the end result may be shared with other +users. + +Dataframe APIs also offer a wide variety of operations that go far beyond what relational databases +offer, and the data model is often used in ways that are very different from typical relational data +modelling [[65](/en/ch3#Petersohn2020)]. +For example, a common use of dataframes is to transform data from a relational-like representation +into a matrix or multidimensional array representation, which is the form that many machine learning +algorithms expect of their input. + +A simple example of such a transformation is shown in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix). On the left we +have a relational table of how different users have rated various movies (on a scale of 1 to 5), and +on the right the data has been transformed into a matrix where each column is a movie and each row +is a user (similarly to a *pivot table* in a spreadsheet). The matrix is *sparse*, which means there +is no data for many user-movie combinations, but this is fine. This matrix may have many thousands +of columns and would therefore not fit well in a relational database, but dataframes and libraries +that offer sparse arrays (such as NumPy for Python) can handle such data easily. + +![ddia 0309](/fig/ddia_0309.png) + +###### Figure 3-9. Transforming a relational database of movie ratings into a matrix representation. + +A matrix can only contain numbers, and various techniques are used to transform non-numerical data +into numbers in the matrix. For example: + +* Dates (which are omitted from the example matrix in [Figure 3-9](/en/ch3#fig_dataframe_to_matrix)) could be scaled + to be floating-point numbers within some suitable range. +* For columns that can only take one of a small, fixed set of values (for example, the genre of a + movie in a database of movies), a *one-hot encoding* is often used: we create a column for each + possible value (one for “comedy”, one for “drama”, one for “horror”, etc.), and for each row + representing a movie, we put a 1 in the column corresponding to the genre of that movie, and a 0 + in all the other columns. This representation also easily generalizes to movies that fit within + several genres. + +Once the data is in the form of a matrix of numbers, it is amenable to linear algebra operations, +which form the basis of many machine learning algorithms. For example, the data in +[Figure 3-9](/en/ch3#fig_dataframe_to_matrix) could be a part of a system for recommending movies that the user may +like. Dataframes are flexible enough to allow data to be gradually evolved from a relational form +into a matrix representation, while giving the data scientist control over the representation that +is most suitable for achieving the goals of the data analysis or model training process. + +There are also databases such as TileDB +[[66](/en/ch3#Papadopoulos2016)] +that specialize in storing large multidimensional arrays of numbers; they are called *array +databases* and are most commonly used for scientific datasets such as geospatial measurements +(raster data on a regularly spaced grid), medical imaging, or observations from astronomical +telescopes [[67](/en/ch3#Rusu2022)]. +Dataframes are also used in the financial industry for representing *time series data*, such as the +prices of assets and trades over time +[[68](/en/ch3#Targett2023)]. + +# Summary + +Data models are a huge subject, and in this chapter we have taken a quick look at a broad variety of +different models. We didn’t have space to go into all the details of each model, but hopefully the +overview has been enough to whet your appetite to find out more about the model that best fits your +application’s requirements. + +The *relational model*, despite being more than half a century old, remains an important data model +for many applications—especially in data warehousing and business analytics, where relational star +or snowflake schemas and SQL queries are ubiquitous. However, several alternatives to relational +data have also become popular in other domains: + +* The *document model* targets use cases where data comes in self-contained JSON documents, and + where relationships between one document and another are rare. +* *Graph data models* go in the opposite direction, targeting use cases where anything is potentially + related to everything, and where queries potentially need to traverse multiple hops to find the + data of interest (which can be expressed using recursive queries in Cypher, SPARQL, or Datalog). +* *Dataframes* generalize relational data to large numbers of columns, and thereby provide a bridge + between databases and the multidimensional arrays that form the basis of much machine learning, + statistical data analysis, and scientific computing. + +To some degree, one model can be emulated in terms of another model—for example, graph data can be +represented in a relational database—but the result can be awkward, as we saw with the support for +recursive queries in SQL. + +Various specialist databases have therefore been developed for each data model, providing query +languages and storage engines that are optimized for a particular model. However, there is also a +trend for databases to expand into neighboring niches by adding support for other data models: for +example, relational databases have added support for document data in the form of JSON columns, +document databases have added relational-like joins, and support for graph data within SQL is +gradually improving. + +Another model we discussed is *event sourcing*, which represents data as an append-only log of +immutable events, and which can be advantageous for modeling activities in complex business domains. +An append-only log is good for writing data (as we shall see in [Chapter 4](/en/ch4#ch_storage)); in order to support +efficient queries, the event log is translated into read-optimized materialized views through CQRS. + +One thing that non-relational data models have in common is that they typically don’t enforce a +schema for the data they store, which can make it easier to adapt applications to changing +requirements. However, your application most likely still assumes that data has a certain structure; +it’s just a question of whether the schema is explicit (enforced on write) or implicit (assumed on +read). + +Although we have covered a lot of ground, there are still data models left unmentioned. To give just +a few brief examples: + +* Researchers working with genome data often need to perform *sequence-similarity searches*, which + means taking one very long string (representing a DNA molecule) and matching it against a large + database of strings that are similar, but not identical. None of the databases described here can + handle this kind of usage, which is why researchers have written specialized genome database + software like GenBank [[69](/en/ch3#Benson2007)]. +* Many financial systems use *ledgers* with double-entry accounting as their data model. This type + of data can be represented in relational databases, but there are also databases such as + TigerBeetle that specialize in this data model. Cryptocurrencies and blockchains are typically + based on distributed ledgers, which also have value transfer built into their data model. +* *Full-text search* is arguably a kind of data model that is frequently used alongside databases. + Information retrieval is a large specialist subject that we won’t cover in great detail in this + book, but we’ll touch on search indexes and vector search in [“Full-Text Search”](/en/ch4#sec_storage_full_text). + +We have to leave it there for now. In the next chapter we will discuss some of the trade-offs that +come into play when *implementing* the data models described in this chapter. + +##### Footnotes + +##### References + +[[1](/en/ch3#Brandon2024-marker)] Jamie Brandon. +[Unexplanations: +query optimization works because sql is declarative](https://www.scattered-thoughts.net/writing/unexplanations-sql-declarative/). *scattered-thoughts.net*, February 2024. +Archived at [perma.cc/P6W2-WMFZ](https://perma.cc/P6W2-WMFZ) + +[[2](/en/ch3#Hellerstein2010-marker)] Joseph M. Hellerstein. +[The Declarative +Imperative: Experiences and Conjectures in Distributed Logic](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf). Tech report UCB/EECS-2010-90, +Electrical Engineering and Computer Sciences, University of California at Berkeley, June 2010. +Archived at [perma.cc/K56R-VVQM](https://perma.cc/K56R-VVQM) + +[[3](/en/ch3#Codd1970-marker)] Edgar F. Codd. +[A Relational Model of Data for Large +Shared Data Banks](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf). *Communications of the ACM*, volume 13, issue 6, pages 377–387, June 1970. +[doi:10.1145/362384.362685](https://doi.org/10.1145/362384.362685) + +[[4](/en/ch3#Stonebraker2005around-marker)] Michael Stonebraker and Joseph M. Hellerstein. +[What Goes Around Comes Around](http://mitpress2.mit.edu/books/chapters/0262693143chapm1.pdf). +In *Readings in Database Systems*, 4th edition, MIT Press, pages 2–41, 2005. +ISBN: 9780262693141 + +[[5](/en/ch3#Winand2015-marker)] Markus Winand. +[Modern SQL: Beyond Relational](https://modern-sql.com/). *modern-sql.com*, 2015. +Archived at [perma.cc/D63V-WAPN](https://perma.cc/D63V-WAPN) + +[[6](/en/ch3#Fowler2012-marker)] Martin Fowler. +[OrmHate](https://martinfowler.com/bliki/OrmHate.html). *martinfowler.com*, May +2012. Archived at [perma.cc/VCM8-PKNG](https://perma.cc/VCM8-PKNG) + +[[7](/en/ch3#Mihalcea2023-marker)] Vlad Mihalcea. +[N+1 query problem with JPA and Hibernate](https://vladmihalcea.com/n-plus-1-query-problem/). +*vladmihalcea.com*, January 2023. +Archived at [perma.cc/79EV-TZKB](https://perma.cc/79EV-TZKB) + +[[8](/en/ch3#Schauder2023-marker)] Jens Schauder. +[This +is the Beginning of the End of the N+1 Problem: Introducing Single Query Loading](https://spring.io/blog/2023/08/31/this-is-the-beginning-of-the-end-of-the-n-1-problem-introducing-single-query). *spring.io*, August 2023. +Archived at [perma.cc/6V96-R333](https://perma.cc/6V96-R333) + +[[9](/en/ch3#Zola2014-marker)] William Zola. +[6 Rules of +Thumb for MongoDB Schema Design](https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design). *mongodb.com*, June 2014. +Archived at [perma.cc/T2BZ-PPJB](https://perma.cc/T2BZ-PPJB) + +[[10](/en/ch3#Andrews2023-marker)] Sidney Andrews and Christopher McClister. +[Data modeling in +Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data). *learn.microsoft.com*, February 2023. Archived at +[archive.org](https://web.archive.org/web/20230207193233/https%3A//learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data) + +[[11](/en/ch3#Krikorian2012_ch3-marker)] Raffi Krikorian. +[Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). +At *QCon San Francisco*, November 2012. +Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK) + +[[12](/en/ch3#Kimball2013_ch3-marker)] Ralph Kimball and Margy Ross. +[*The Data +Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*](https://learning.oreilly.com/library/view/the-data-warehouse/9781118530801/), +3rd edition. John Wiley & Sons, July 2013. ISBN: 9781118530801 + +[[13](/en/ch3#Kaminsky2022-marker)] Michael Kaminsky. +[Data warehouse modeling: Star schema vs. +OBT](https://www.fivetran.com/blog/star-schema-vs-obt). *fivetran.com*, August 2022. +Archived at [perma.cc/2PZK-BFFP](https://perma.cc/2PZK-BFFP) + +[[14](/en/ch3#Nelson2018-marker)] Joe Nelson. +[User-defined Order in +SQL](https://begriffs.com/posts/2018-03-20-user-defined-order.html). *begriffs.com*, March 2018. +Archived at [perma.cc/GS3W-F7AD](https://perma.cc/GS3W-F7AD) + +[[15](/en/ch3#Wallace2017-marker)] Evan Wallace. +[Realtime Editing of +Ordered Sequences](https://www.figma.com/blog/realtime-editing-of-ordered-sequences/). *figma.com*, March 2017. +Archived at [perma.cc/K6ER-CQZW](https://perma.cc/K6ER-CQZW) + +[[16](/en/ch3#Greenspan2020-marker)] David Greenspan. +[Implementing +Fractional Indexing](https://observablehq.com/%40dgreensp/implementing-fractional-indexing). *observablehq.com*, October 2020. +Archived at [perma.cc/5N4R-MREN](https://perma.cc/5N4R-MREN) + +[[17](/en/ch3#Schemaless-marker)] Martin Fowler. +[Schemaless Data Structures](https://martinfowler.com/articles/schemaless/). +*martinfowler.com*, January 2013. + +[[18](/en/ch3#Awadallah2009-marker)] Amr Awadallah. +[Schema-on-Read vs. +Schema-on-Write](https://www.slideshare.net/awadallah/schemaonread-vs-schemaonwrite). At *Berkeley EECS RAD Lab Retreat*, Santa Cruz, CA, May 2009. +Archived at [perma.cc/DTB2-JCFR](https://perma.cc/DTB2-JCFR) + +[[19](/en/ch3#Odersky2013-marker)] Martin Odersky. +[The Trouble with Types](https://www.infoq.com/presentations/data-types-issues/). +At *Strange Loop*, September 2013. +Archived at [perma.cc/85QE-PVEP](https://perma.cc/85QE-PVEP) + +[[20](/en/ch3#Irwin2013-marker)] Conrad Irwin. +[MongoDB—Confessions +of a PostgreSQL Lover](https://speakerdeck.com/conradirwin/mongodb-confessions-of-a-postgresql-lover). At *HTML5DevConf*, October 2013. +Archived at [perma.cc/C2J6-3AL5](https://perma.cc/C2J6-3AL5) + +[[21](/en/ch3#Percona2023-marker)] [Percona +Toolkit Documentation: pt-online-schema-change](https://docs.percona.com/percona-toolkit/pt-online-schema-change.html). *docs.percona.com*, 2023. +Archived at [perma.cc/9K8R-E5UH](https://perma.cc/9K8R-E5UH) + +[[22](/en/ch3#Noach2016-marker)] Shlomi Noach. +[gh-ost: +GitHub’s Online Schema Migration Tool for MySQL](https://github.blog/2016-08-01-gh-ost-github-s-online-migration-tool-for-mysql/). *github.blog*, August 2016. +Archived at [perma.cc/7XAG-XB72](https://perma.cc/7XAG-XB72) + +[[23](/en/ch3#Mukherjee2022-marker)] Shayon Mukherjee. +[pg-osc: +Zero downtime schema changes in PostgreSQL](https://www.shayon.dev/post/2022/47/pg-osc-zero-downtime-schema-changes-in-postgresql/). *shayon.dev*, February 2022. +Archived at [perma.cc/35WN-7WMY](https://perma.cc/35WN-7WMY) + +[[24](/en/ch3#PerezAradros2023-marker)] Carlos Pérez-Aradros Herce. +[Introducing pgroll: zero-downtime, +reversible, schema migrations for Postgres](https://xata.io/blog/pgroll-schema-migrations-postgres). *xata.io*, October 2023. Archived at +[archive.org](https://web.archive.org/web/20231008161750/https%3A//xata.io/blog/pgroll-schema-migrations-postgres) + +[[25](/en/ch3#Corbett2012_ch2-marker)] James C. Corbett, Jeffrey Dean, Michael +Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher +Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, +Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, +Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. +[Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). +At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), +October 2012. + +[[26](/en/ch3#BurlesonCluster-marker)] Donald K. Burleson. +[Reduce I/O with Oracle +Cluster Tables](http://www.dba-oracle.com/oracle_tip_hash_index_cluster_table.htm). *dba-oracle.com*. +Archived at [perma.cc/7LBJ-9X2C](https://perma.cc/7LBJ-9X2C) + +[[27](/en/ch3#Chang2006_ch3-marker)] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, +Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. +[Bigtable: A Distributed Storage System for +Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and Implementation* +(OSDI), November 2006. + +[[28](/en/ch3#Walmsley2015-marker)] Priscilla Walmsley. +[*XQuery, +2nd Edition*](https://learning.oreilly.com/library/view/xquery-2nd-edition/9781491915080/). O’Reilly Media, December 2015. ISBN: 9781491915080 + +[[29](/en/ch3#Bryan2013-marker)] Paul C. Bryan, Kris Zyp, and Mark Nottingham. +[JavaScript Object Notation (JSON) Pointer](https://www.rfc-editor.org/rfc/rfc6901). +RFC 6901, IETF, April 2013. + +[[30](/en/ch3#Goessner2024-marker)] Stefan Gössner, Glyn Normington, and Carsten Bormann. +[JSONPath: Query Expressions for JSON](https://www.rfc-editor.org/rfc/rfc9535.html). +RFC 9535, IETF, February 2024. + +[[31](/en/ch3#Stonebraker2024-marker)] Michael Stonebraker and Andrew Pavlo. +[What Goes Around Comes +Around… And Around…](https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf). *ACM SIGMOD Record*, volume 53, issue 2, pages 21–37. +[doi:10.1145/3685980.3685984](https://doi.org/10.1145/3685980.3685984) + +[[32](/en/ch3#Page1999-marker)] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. +[The PageRank Citation Ranking: Bringing Order to the Web](http://ilpubs.stanford.edu:8090/422/). +Technical Report 1999-66, Stanford University InfoLab, November 1999. +Archived at [perma.cc/UML9-UZHW](https://perma.cc/UML9-UZHW) + +[[33](/en/ch3#Bronson2013-marker)] Nathan Bronson, Zach Amsden, George Cabrera, +Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, +Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. +[TAO: +Facebook’s Distributed Data Store for the Social Graph](https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson). At *USENIX Annual Technical +Conference* (ATC), June 2013. + +[[34](/en/ch3#Noy2019-marker)] Natasha Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, +Alan Patterson, and Jamie Taylor. +[Industry-Scale +Knowledge Graphs: Lessons and Challenges](https://cacm.acm.org/magazines/2019/8/238342-industry-scale-knowledge-graphs/fulltext). *Communications of the ACM*, volume 62, issue +8, pages 36–43, August 2019. +[doi:10.1145/3331166](https://doi.org/10.1145/3331166) + +[[35](/en/ch3#Feng2023-marker)] Xiyang Feng, Guodong Jin, Ziyi Chen, Chang Liu, and Semih Salihoğlu. +[KÙZU Graph Database Management System](https://www.cidrdb.org/cidr2023/papers/p48-jin.pdf). +At *3th Annual Conference on Innovative Data Systems Research* (CIDR 2023), January 2023. + +[[36](/en/ch3#Besta2019-marker)] Maciej Besta, Emanuel Peter, Robert +Gerstenberger, Marc Fischer, Michał Podstawski, Claude Barthels, Gustavo Alonso, Torsten Hoefler. +[Demystifying Graph Databases: Analysis and Taxonomy +of Data Organization, System Designs, and Graph Queries](https://arxiv.org/pdf/1910.09017.pdf). *arxiv.org*, October 2019. + +[[37](/en/ch3#TinkerPop2023-marker)] [Apache +TinkerPop 3.6.3 Documentation](https://tinkerpop.apache.org/docs/3.6.3/reference/). *tinkerpop.apache.org*, May 2023. +Archived at [perma.cc/KM7W-7PAT](https://perma.cc/KM7W-7PAT) + +[[38](/en/ch3#Francis2018-marker)] Nadime Francis, Alastair Green, Paolo Guagliardo, +Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and +Andrés Taylor. [Cypher: An Evolving Query +Language for Property Graphs](https://core.ac.uk/download/pdf/158372754.pdf). At *International Conference on Management of Data* +(SIGMOD), pages 1433–1445, May 2018. +[doi:10.1145/3183713.3190657](https://doi.org/10.1145/3183713.3190657) + +[[39](/en/ch3#EifremTweet-marker)] Emil Eifrem. +[Twitter correspondence](https://twitter.com/emileifrem/status/419107961512804352), +January 2014. Archived at [perma.cc/WM4S-BW64](https://perma.cc/WM4S-BW64) + +[[40](/en/ch3#Tisiot2021-marker)] Francesco Tisiot. +[Explore +the new SEARCH and CYCLE features in PostgreSQL® 14](https://aiven.io/blog/explore-the-new-search-and-cycle-features-in-postgresql-14). *aiven.io*, December 2021. +Archived at [perma.cc/J6BT-83UZ](https://perma.cc/J6BT-83UZ) + +[[41](/en/ch3#Goel2020-marker)] Gaurav Goel. +[Understanding +Hierarchies in Oracle](https://towardsdatascience.com/understanding-hierarchies-in-oracle-43f85561f3d9). *towardsdatascience.com*, May 2020. +Archived at [perma.cc/5ZLR-Q7EW](https://perma.cc/5ZLR-Q7EW) + +[[42](/en/ch3#Deutsch2022-marker)] Alin +Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor +Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Oskar van Rest, +Hannes Voigt, Domagoj Vrgoč, Mingxi Wu, and Fred Zemke. +[Graph Pattern Matching in GQL and SQL/PGQ](https://arxiv.org/abs/2112.06217). +At *International Conference on Management of Data* (SIGMOD), pages 2246–2258, June 2022. +[doi:10.1145/3514221.3526057](https://doi.org/10.1145/3514221.3526057) + +[[43](/en/ch3#Green2019-marker)] Alastair Green. +[SQL... and now GQL](https://opencypher.org/articles/2019/09/12/SQL-and-now-GQL/). +*opencypher.org*, September 2019. +Archived at [perma.cc/AFB2-3SY7](https://perma.cc/AFB2-3SY7) + +[[44](/en/ch3#Deutsch2018-marker)] Alin Deutsch, Yu Xu, and Mingxi Wu. +[Seamless +Syntactic and Semantic Integration of Query Primitives over Relational and Graph Data in GSQL](https://cdn2.hubspot.net/hubfs/4114546/IntegrationQuery%20PrimitivesGSQL.pdf). +*tigergraph.com*, November 2018. +Archived at [perma.cc/JG7J-Y35X](https://perma.cc/JG7J-Y35X) + +[[45](/en/ch3#vanRest2016-marker)] Oskar van Rest, Sungpack Hong, Jinha Kim, Xuming +Meng, and Hassan Chafi. [PGQL: a property +graph query language](https://event.cwi.nl/grades/2016/07-VanRest.pdf). At *4th International Workshop on Graph Data Management Experiences and +Systems* (GRADES), June 2016. +[doi:10.1145/2960414.2960421](https://doi.org/10.1145/2960414.2960421) + +[[46](/en/ch3#NeptuneDataModel-marker)] Amazon Web Services. +[Neptune +Graph Data Model](https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html). Amazon Neptune User Guide, *docs.aws.amazon.com*. +Archived at [perma.cc/CX3T-EZU9](https://perma.cc/CX3T-EZU9) + +[[47](/en/ch3#DatomicDataModel-marker)] Cognitect. +[Datomic Data Model](https://docs.datomic.com/cloud/whatis/data-model.html). +Datomic Cloud Documentation, *docs.datomic.com*. +Archived at [perma.cc/LGM9-LEUT](https://perma.cc/LGM9-LEUT) + +[[48](/en/ch3#Beckett2011-marker)] David Beckett and Tim Berners-Lee. +[Turtle – Terse RDF Triple Language](https://www.w3.org/TeamSubmission/turtle/). +W3C Team Submission, March 2011. + +[[49](/en/ch3#Target2018-marker)] Sinclair Target. +[Whatever Happened to the Semantic +Web?](https://twobithistory.org/2018/05/27/semantic-web.html) *twobithistory.org*, May 2018. +Archived at [perma.cc/M8GL-9KHS](https://perma.cc/M8GL-9KHS) + +[[50](/en/ch3#MendelGleason2022-marker)] Gavin Mendel-Gleason. +[The Semantic Web is Dead – Long Live +the Semantic Web!](https://terminusdb.com/blog/the-semantic-web-is-dead/) *terminusdb.com*, August 2022. +Archived at [perma.cc/G2MZ-DSS3](https://perma.cc/G2MZ-DSS3) + +[[51](/en/ch3#Sporny2014-marker)] Manu Sporny. +[JSON-LD and Why I Hate the Semantic Web](http://manu.sporny.org/2014/json-ld-origins-2/). +*manu.sporny.org*, January 2014. +Archived at [perma.cc/7PT4-PJKF](https://perma.cc/7PT4-PJKF) + +[[52](/en/ch3#MichiganOntologies-marker)] University of Michigan Library. +[Biomedical Ontologies and Controlled Vocabularies](https://guides.lib.umich.edu/ontology), +*guides.lib.umich.edu/ontology*. +Archived at [perma.cc/Q5GA-F2N8](https://perma.cc/Q5GA-F2N8) + +[[53](/en/ch3#OpenGraph-marker)] Facebook. +[The Open Graph protocol](https://ogp.me/), *ogp.me*. +Archived at [perma.cc/C49A-GUSY](https://perma.cc/C49A-GUSY) + +[[54](/en/ch3#Haughey2015-marker)] Matt Haughey. +[Everything +you ever wanted to know about unfurling but were afraid to ask /or/ How to make your site previews +look amazing in Slack](https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254). *medium.com*, November 2015. +Archived at [perma.cc/C7S8-4PZN](https://perma.cc/C7S8-4PZN) + +[[55](/en/ch3#W3CRDF-marker)] W3C RDF Working Group. +[Resource Description Framework (RDF)](https://www.w3.org/RDF/). +*w3.org*, February 2004. + +[[56](/en/ch3#Harris2013-marker)] Steve Harris, Andy Seaborne, and Eric +Prud’hommeaux. [SPARQL 1.1 Query Language](https://www.w3.org/TR/sparql11-query/). +W3C Recommendation, March 2013. + +[[57](/en/ch3#Green2013-marker)] Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou. +[Datalog and Recursive +Query Processing](http://blogs.evergreen.edu/sosw/files/2014/04/Green-Vol5-DBS-017.pdf). *Foundations and Trends in Databases*, volume 5, issue 2, pages 105–195, +November 2013. [doi:10.1561/1900000017](https://doi.org/10.1561/1900000017) + +[[58](/en/ch3#Ceri1989-marker)] Stefano Ceri, Georg Gottlob, and Letizia Tanca. +[What +You Always Wanted to Know About Datalog (And Never Dared to Ask)](https://www.researchgate.net/profile/Letizia_Tanca/publication/3296132_What_you_always_wanted_to_know_about_Datalog_and_never_dared_to_ask/links/0fcfd50ca2d20473ca000000.pdf). *IEEE Transactions on +Knowledge and Data Engineering*, volume 1, issue 1, pages 146–166, March 1989. +[doi:10.1109/69.43410](https://doi.org/10.1109/69.43410) + +[[59](/en/ch3#Abiteboul1995-marker)] Serge Abiteboul, Richard Hull, and Victor Vianu. +[*Foundations of Databases*](http://webdam.inria.fr/Alice/). Addison-Wesley, 1995. +ISBN: 9780201537710, available online at +[*webdam.inria.fr/Alice*](http://webdam.inria.fr/Alice/) + +[[60](/en/ch3#Meyer2020-marker)] Scott Meyer, Andrew Carter, and Andrew Rodriguez. +[LIquid: +The soul of a new graph database, Part 2](https://engineering.linkedin.com/blog/2020/liquid--the-soul-of-a-new-graph-database--part-2). *engineering.linkedin.com*, September 2020. +Archived at [perma.cc/K9M4-PD6Q](https://perma.cc/K9M4-PD6Q) + +[[61](/en/ch3#Bessey2024-marker)] Matt Bessey. +[Why, after 6 years, I’m over +GraphQL](https://bessey.dev/blog/2024/05/24/why-im-over-graphql/). *bessey.dev*, May 2024. Archived at +[perma.cc/2PAU-JYRA](https://perma.cc/2PAU-JYRA) + +[[62](/en/ch3#Betts2012-marker)] Dominic Betts, Julián +Domínguez, Grigori Melnik, Fernando Simonazzi, and Mani Subramanian. +[*Exploring +CQRS and Event Sourcing*](https://learn.microsoft.com/en-us/previous-versions/msp-n-p/jj554200%28v%3Dpandp.10%29). Microsoft Patterns & Practices, July 2012. +ISBN: 1621140164, archived at [perma.cc/7A39-3NM8](https://perma.cc/7A39-3NM8) + +[[63](/en/ch3#Young2014-marker)] Greg Young. +[CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs). At *Code on +the Beach*, August 2014. + +[[64](/en/ch3#Young2010-marker)] Greg Young. +[CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf). +*cqrs.wordpress.com*, November 2010. +Archived at [perma.cc/X5R6-R47F](https://perma.cc/X5R6-R47F) + +[[65](/en/ch3#Petersohn2020-marker)] Devin Petersohn, Stephen Macke, Doris +Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. +Joseph, and Aditya Parameswaran. +[Towards Scalable Dataframe Systems](https://www.vldb.org/pvldb/vol13/p2033-petersohn.pdf). +*Proceedings of the VLDB Endowment*, volume 13, issue 11, pages 2033–2046. +[doi:10.14778/3407790.3407807](https://doi.org/10.14778/3407790.3407807) + +[[66](/en/ch3#Papadopoulos2016-marker)] Stavros Papadopoulos, Kushal Datta, Samuel +Madden, and Timothy Mattson. +[The TileDB Array Data Storage Manager](https://www.vldb.org/pvldb/vol10/p349-papadopoulos.pdf). +*Proceedings of the VLDB Endowment*, volume 10, issue 4, pages 349–360, November 2016. +[doi:10.14778/3025111.3025117](https://doi.org/10.14778/3025111.3025117) + +[[67](/en/ch3#Rusu2022-marker)] Florin Rusu. +[Multidimensional +Array Data Management](https://faculty.ucmerced.edu/frusu/Papers/Report/2022-09-fntdb-arrays.pdf). *Foundations and Trends in Databases*, volume 12, numbers 2–3, +pages 69–220, February 2023. +[doi:10.1561/1900000069](https://doi.org/10.1561/1900000069) + +[[68](/en/ch3#Targett2023-marker)] Ed Targett. +[Bloomberg, +Man Group team up to develop open source “ArcticDB” database](https://www.thestack.technology/bloomberg-man-group-arcticdb-database-dataframe/). *thestack.technology*, +March 2023. Archived at [perma.cc/M5YD-QQYV](https://perma.cc/M5YD-QQYV) + +[[69](/en/ch3#Benson2007-marker)] Dennis A. Benson, Ilene +Karsch-Mizrachi, David J. Lipman, James Ostell, and David L. Wheeler. +[GenBank](https://academic.oup.com/nar/article/36/suppl_1/D25/2507746). +*Nucleic Acids Research*, volume 36, database issue, pages D25–D30, December 2007. +[doi:10.1093/nar/gkm929](https://doi.org/10.1093/nar/gkm929) - -## …… - - - -## Summary - - -In this chapter we tried to get to the bottom of how databases handle storage and retrieval. What happens when you store data in a database, and what does the data‐ base do when you query for the data again later? - -On a high level, we saw that storage engines fall into two broad categories: those opti‐ mized for transaction processing (OLTP), and those optimized for analytics (OLAP). There are big differences between the access patterns in those use cases: - -- OLTP systems are typically user-facing, which means that they may see a huge volume of requests. In order to handle the load, applications usually only touch a small number of records in each query. The application requests records using some kind of key, and the storage engine uses an index to find the data for the requested key. Disk seek time is often the bottleneck here. - -- Data warehouses and similar analytic systems are less well known, because they are primarily used by business analysts, not by end users. They handle a much lower volume of queries than OLTP systems, but each query is typically very demanding, requiring many millions of records to be scanned in a short time. Disk bandwidth (not seek time) is often the bottleneck here, and column- oriented storage is an increasingly popular solution for this kind of workload. - - On the OLTP side, we saw storage engines from two main schools of thought: - -- The log-structured school, which only permits appending to files and deleting obsolete files, but never updates a file that has been written. Bitcask, SSTables, LSM-trees, LevelDB, Cassandra, HBase, Lucene, and others belong to this group. - -- The update-in-place school, which treats the disk as a set of fixed-size pages that can be overwritten. B-trees are the biggest example of this philosophy, being used in all major relational databases and also many nonrelational ones. - - Log-structured storage engines are a comparatively recent development. Their key idea is that they systematically turn random-access writes into sequential writes on disk, which enables higher write throughput due to the performance characteristics of hard drives and SSDs. - -Finishing off the OLTP side, we did a brief tour through some more complicated indexing structures, and databases that are optimized for keeping all data in memory. - -We then took a detour from the internals of storage engines to look at the high-level architecture of a typical data warehouse. This background illustrated why analytic workloads are so different from OLTP: when your queries require sequentially scan‐ ning across a large number of rows, indexes are much less relevant. Instead it becomes important to encode data very compactly, to minimize the amount of data that the query needs to read from disk. We discussed how column-oriented storage helps achieve this goal. - -As an application developer, if you’re armed with this knowledge about the internals of storage engines, you are in a much better position to know which tool is best suited for your particular application. If you need to adjust a database’s tuning parameters, this understanding allows you to imagine what effect a higher or a lower value may have. - -Although this chapter couldn’t make you an expert in tuning any one particular stor‐ age engine, it has hopefully equipped you with enough vocabulary and ideas that you can make sense of the documentation for the database of your choice. - -## References - -1. Alfred V. Aho, John E. Hopcroft, and Jeffrey D. Ullman: *Data Structures and Algorithms*. Addison-Wesley, 1983. ISBN: 978-0-201-00023-8 -1. Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein: *Introduction to Algorithms*, 3rd edition. MIT Press, 2009. ISBN: 978-0-262-53305-8 -1. Justin Sheehy and David Smith: “[Bitcask: A Log-Structured Hash Table for Fast Key/Value Data](https://riak.com/assets/bitcask-intro.pdf),” Basho Technologies, April 2010. -1. Yinan Li, Bingsheng He, Robin Jun Yang, et al.: “[Tree Indexing on Solid State Drives](http://pages.cs.wisc.edu/~yinan/paper/fdtree_pvldb.pdf),” *Proceedings of the VLDB Endowment*, volume 3, number 1, pages 1195–1206, September 2010. -1. Goetz Graefe: “[Modern B-Tree Techniques](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=0b19f413ffb5bc68b43f3bd05a97c282a7c6d6ab),” *Foundations and Trends in Databases*, volume 3, number 4, pages 203–402, August 2011. [doi:10.1561/1900000028](http://dx.doi.org/10.1561/1900000028) -1. Jeffrey Dean and Sanjay Ghemawat: “[LevelDB Implementation Notes](https://github.com/google/leveldb/blob/master/doc/impl.md),” *github.com*. -1. Dhruba Borthakur: “[The History of RocksDB](https://rocksdb.blogspot.com/2013/11/the-history-of-rocksdb.html),” *rocksdb.blogspot.com*, November 24, 2013. -1. Matteo Bertozzi: “[Apache HBase I/O – HFile](https://blog.cloudera.com/apache-hbase-i-o-hfile/),” *blog.cloudera.com*, June 29, 2012. -1. Fay Chang, Jeffrey Dean, Sanjay Ghemawat, et al.: “[Bigtable: A Distributed Storage System for Structured Data](https://research.google/pubs/pub27898/),” at *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006. -1. Patrick O'Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O'Neil: “[The Log-Structured Merge-Tree (LSM-Tree)](http://www.cs.umb.edu/~poneil/lsmtree.pdf),” *Acta Informatica*, volume 33, number 4, pages 351–385, June 1996. [doi:10.1007/s002360050048](http://dx.doi.org/10.1007/s002360050048) -1. Mendel Rosenblum and John K. Ousterhout: “[The Design and Implementation of a Log-Structured File System](http://research.cs.wisc.edu/areas/os/Qual/papers/lfs.pdf),” *ACM Transactions on Computer Systems*, volume 10, number 1, pages 26–52, February 1992. [doi:10.1145/146941.146943](http://dx.doi.org/10.1145/146941.146943) -1. Adrien Grand: “[What Is in a Lucene Index?](http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal),” at *Lucene/Solr Revolution*, November 14, 2013. -1. Deepak Kandepet: “[Hacking Lucene—The Index Format](https://web.archive.org/web/20160316190830/http://hackerlabs.github.io/blog/2011/10/01/hacking-lucene-the-index-format/index.html),” *hackerlabs.github.io*, October 1, 2011. -1. Michael McCandless: “[Visualizing Lucene's Segment Merges](http://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html),” *blog.mikemccandless.com*, February 11, 2011. -1. Burton H. Bloom: “[Space/Time Trade-offs in Hash Coding with Allowable Errors](https://people.cs.umass.edu/~emery/classes/cmpsci691st/readings/Misc/p422-bloom.pdf),” *Communications of the ACM*, volume 13, number 7, pages 422–426, July 1970. [doi:10.1145/362686.362692](http://dx.doi.org/10.1145/362686.362692) -1. “[Operating Cassandra: Compaction](https://cassandra.apache.org/doc/latest/operating/compaction/index.html),” Apache Cassandra Documentation v4.0, 2016. -1. Rudolf Bayer and Edward M. McCreight: “[Organization and Maintenance of Large Ordered Indices](https://apps.dtic.mil/sti/citations/AD0712079),” Boeing Scientific Research Laboratories, Mathematical and Information Sciences Laboratory, report no. 20, July 1970. -1. Douglas Comer: “[The Ubiquitous B-Tree](https://carlosproal.com/ir/papers/p121-comer.pdf),” *ACM Computing Surveys*, volume 11, number 2, pages 121–137, June 1979. [doi:10.1145/356770.356776](http://dx.doi.org/10.1145/356770.356776) -1. Emmanuel Goossaert: “[Coding for SSDs](http://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/),” *codecapsule.com*, February 12, 2014. -1. C. Mohan and Frank Levine: “[ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging](http://www.ics.uci.edu/~cs223/papers/p371-mohan.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 1992. [doi:10.1145/130283.130338](http://dx.doi.org/10.1145/130283.130338) -1. Howard Chu: “[LDAP at Lightning Speed](https://buildstuff14.sched.com/event/08a1a368e272eb599a52e08b4c3c779d),” at *Build Stuff '14*, November 2014. -1. Bradley C. Kuszmaul: “[A Comparison of Fractal Trees to Log-Structured Merge (LSM) Trees](http://www.pandademo.com/wp-content/uploads/2017/12/A-Comparison-of-Fractal-Trees-to-Log-Structured-Merge-LSM-Trees.pdf),” *tokutek.com*, April 22, 2014. -1. Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, et al.: “[Designing Access Methods: The RUM Conjecture](http://openproceedings.org/2016/conf/edbt/paper-12.pdf),” at *19th International Conference on Extending Database Technology* (EDBT), March 2016. [doi:10.5441/002/edbt.2016.42](http://dx.doi.org/10.5441/002/edbt.2016.42) -1. Peter Zaitsev: “[Innodb Double Write](https://www.percona.com/blog/2006/08/04/innodb-double-write/),” *percona.com*, August 4, 2006. -1. Tomas Vondra: “[On the Impact of Full-Page Writes](https://www.enterprisedb.com/blog/impact-full-page-writes),” *blog.2ndquadrant.com*, November 23, 2016. -1. Mark Callaghan: “[The Advantages of an LSM vs a B-Tree](http://smalldatum.blogspot.co.uk/2016/01/summary-of-advantages-of-lsm-vs-b-tree.html),” *smalldatum.blogspot.co.uk*, January 19, 2016. -1. Mark Callaghan: “[Choosing Between Efficiency and Performance with RocksDB](https://codemesh.io/codemesh2016/mark-callaghan),” at *Code Mesh*, November 4, 2016. -1. Michi Mutsuzaki: “[MySQL vs. LevelDB](https://github.com/m1ch1/mapkeeper/wiki/MySQL-vs.-LevelDB),” *github.com*, August 2011. -1. Benjamin Coverston, Jonathan Ellis, et al.: “[CASSANDRA-1608: Redesigned Compaction](https://issues.apache.org/jira/browse/CASSANDRA-1608), *issues.apache.org*, July 2011. -1. Igor Canadi, Siying Dong, and Mark Callaghan: “[RocksDB Tuning Guide](https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide),” *github.com*, 2016. -1. [*MySQL 5.7 Reference Manual*](http://dev.mysql.com/doc/refman/5.7/en/index.html). Oracle, 2014. -1. [*Books Online for SQL Server 2012*](https://learn.microsoft.com/en-us/previous-versions/sql/sql-server-2012/ms130214(v=sql.110)). Microsoft, 2012. -1. Joe Webb: “[Using Covering Indexes to Improve Query Performance](https://www.simple-talk.com/sql/learn-sql-server/using-covering-indexes-to-improve-query-performance/),” *simple-talk.com*, 29 September 2008. -1. Frank Ramsak, Volker Markl, Robert Fenk, et al.: “[Integrating the UB-Tree into a Database System Kernel](http://www.vldb.org/conf/2000/P263.pdf),” at *26th International Conference on Very Large Data Bases* (VLDB), September 2000. -1. The PostGIS Development Group: “[PostGIS 2.1.2dev Manual](http://postgis.net/docs/manual-2.1/),” *postgis.net*, 2014. -1. Robert Escriva, Bernard Wong, and Emin Gün Sirer: “[HyperDex: A Distributed, Searchable Key-Value Store](http://www.cs.princeton.edu/courses/archive/fall13/cos518/papers/hyperdex.pdf),” at *ACM SIGCOMM Conference*, August 2012. [doi:10.1145/2377677.2377681](http://dx.doi.org/10.1145/2377677.2377681) -1. Michael McCandless: “[Lucene's FuzzyQuery Is 100 Times Faster in 4.0](http://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html),” *blog.mikemccandless.com*, March 24, 2011. -1. Steffen Heinz, Justin Zobel, and Hugh E. Williams: “[Burst Tries: A Fast, Efficient Data Structure for String Keys](http://citeseer.ist.psu.edu/viewdoc/summary?doi=10.1.1.18.3499),” *ACM Transactions on Information Systems*, volume 20, number 2, pages 192–223, April 2002. [doi:10.1145/506309.506312](http://dx.doi.org/10.1145/506309.506312) -1. Klaus U. Schulz and Stoyan Mihov: “[Fast String Correction with Levenshtein Automata](http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.652),” *International Journal on Document Analysis and Recognition*, volume 5, number 1, pages 67–85, November 2002. [doi:10.1007/s10032-002-0082-8](http://dx.doi.org/10.1007/s10032-002-0082-8) -1. Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze: [*Introduction to Information Retrieval*](http://nlp.stanford.edu/IR-book/). Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at *nlp.stanford.edu/IR-book* -1. Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: “[The End of an Architectural Era (It’s Time for a Complete Rewrite)](http://nms.csail.mit.edu/~stavros/pubs/hstore.pdf),” at *33rd International Conference on Very Large Data Bases* (VLDB), September 2007. -1. “[VoltDB Technical Overview White Paper](https://www.voltdb.com/files/voltdb-technical-overview/),” VoltDB, 2014. -1. Stephen M. Rumble, Ankita Kejriwal, and John K. Ousterhout: “[Log-Structured Memory for DRAM-Based Storage](https://www.usenix.org/system/files/conference/fast14/fast14-paper_rumble.pdf),” at *12th USENIX Conference on File and Storage Technologies* (FAST), February 2014. -1. Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker: “[OLTP Through the Looking Glass, and What We Found There](http://hstore.cs.brown.edu/papers/hstore-lookingglass.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2008. [doi:10.1145/1376616.1376713](http://dx.doi.org/10.1145/1376616.1376713) -1. Justin DeBrabant, Andrew Pavlo, Stephen Tu, et al.: “[Anti-Caching: A New Approach to Database Management System Architecture](http://www.vldb.org/pvldb/vol6/p1942-debrabant.pdf),” *Proceedings of the VLDB Endowment*, volume 6, number 14, pages 1942–1953, September 2013. -1. Joy Arulraj, Andrew Pavlo, and Subramanya R. Dulloor: “[Let's Talk About Storage & Recovery Methods for Non-Volatile Memory Database Systems](http://www.pdl.cmu.edu/PDL-FTP/NVM/storage.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2749441](http://dx.doi.org/10.1145/2723372.2749441) -1. Edgar F. Codd, S. B. Codd, and C. T. Salley: “[Providing OLAP to User-Analysts: An IT Mandate](https://pdfs.semanticscholar.org/a0bd/1491a54a4de428c5eef9b836ef6ee2915fe7.pdf),” E. F. Codd Associates, 1993. -1. Surajit Chaudhuri and Umeshwar Dayal: “[An Overview of Data Warehousing and OLAP Technology](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigrecord.pdf),” *ACM SIGMOD Record*, volume 26, number 1, pages 65–74, March 1997. [doi:10.1145/248603.248616](http://dx.doi.org/10.1145/248603.248616) -1. Per-Åke Larson, Cipri Clinciu, Campbell Fraser, et al.: “[Enhancements to SQL Server Column Stores](http://research.microsoft.com/pubs/193599/Apollo3%20-%20Sigmod%202013%20-%20final.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2013. -1. Franz Färber, Norman May, Wolfgang Lehner, et al.: “[The SAP HANA Database – An Architecture Overview](http://sites.computer.org/debull/A12mar/hana.pdf),” *IEEE Data Engineering Bulletin*, volume 35, number 1, pages 28–33, March 2012. -1. Michael Stonebraker: “[The Traditional RDBMS Wisdom Is (Almost Certainly) All Wrong](http://slideshot.epfl.ch/talks/166),” presentation at *EPFL*, May 2013. -1. Daniel J. Abadi: “[Classifying the SQL-on-Hadoop Solutions](https://web.archive.org/web/20150622074951/http://hadapt.com/blog/2013/10/02/classifying-the-sql-on-hadoop-solutions/),” *hadapt.com*, October 2, 2013. -1. Marcel Kornacker, Alexander Behm, Victor Bittorf, et al.: “[Impala: A Modern, Open-Source SQL Engine for Hadoop](http://pandis.net/resources/cidr15impala.pdf),” at *7th Biennial Conference on Innovative Data Systems Research* (CIDR), January 2015. -1. Sergey Melnik, Andrey Gubarev, Jing Jing Long, et al.: “[Dremel: Interactive Analysis of Web-Scale Datasets](https://research.google/pubs/pub36632/),” at *36th International Conference on Very Large Data Bases* (VLDB), pages 330–339, September 2010. -1. Ralph Kimball and Margy Ross: *The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*, 3rd edition. John Wiley & Sons, July 2013. ISBN: 978-1-118-53080-1 -1. Derrick Harris: “[Why Apple, eBay, and Walmart Have Some of the Biggest Data Warehouses You’ve Ever Seen](https://web.archive.org/web/20221129085658/https://old.gigaom.com/2013/03/27/why-apple-ebay-and-walmart-have-some-of-the-biggest-data-warehouses-youve-ever-seen/),” *gigaom.com*, March 27, 2013. -1. Julien Le Dem: “[Dremel Made Simple with Parquet](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html),” *blog.twitter.com*, September 11, 2013. -1. Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, et al.: “[The Design and Implementation of Modern Column-Oriented Database Systems](http://cs-www.cs.yale.edu/homes/dna/papers/abadi-column-stores.pdf),” *Foundations and Trends in Databases*, volume 5, number 3, pages 197–280, December 2013. [doi:10.1561/1900000024](http://dx.doi.org/10.1561/1900000024) -1. Peter Boncz, Marcin Zukowski, and Niels Nes: “[MonetDB/X100: Hyper-Pipelining Query Execution](http://cidrdb.org/cidr2005/papers/P19.pdf),” at *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005. -1. Jingren Zhou and Kenneth A. Ross: “[Implementing Database Operations Using SIMD Instructions](http://www1.cs.columbia.edu/~kar/pubsk/simd.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), pages 145–156, June 2002. [doi:10.1145/564691.564709](http://dx.doi.org/10.1145/564691.564709) -1. Michael Stonebraker, Daniel J. Abadi, Adam Batkin, et al.: “[C-Store: A Column-oriented DBMS](http://www.cs.umd.edu/~abadi/vldb.pdf),” at *31st International Conference on Very Large Data Bases* (VLDB), pages 553–564, September 2005. -1. Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, et al.: “[The Vertica Analytic Database: C-Store 7 Years Later](http://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf),” *Proceedings of the VLDB Endowment*, volume 5, number 12, pages 1790–1801, August 2012. -1. Julien Le Dem and Nong Li: “[Efficient Data Storage for Analytics with Apache Parquet 2.0](http://www.slideshare.net/julienledem/th-210pledem),” at *Hadoop Summit*, San Jose, June 2014. -1. Jim Gray, Surajit Chaudhuri, Adam Bosworth, et al.: “[Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals](http://arxiv.org/pdf/cs/0701155.pdf),” *Data Mining and Knowledge Discovery*, volume 1, number 1, pages 29–53, March 2007. [doi:10.1023/A:1009726021843](http://dx.doi.org/10.1023/A:1009726021843) diff --git a/content/en/ch4.md b/content/en/ch4.md index 0958a42..55cca43 100644 --- a/content/en/ch4.md +++ b/content/en/ch4.md @@ -1,124 +1,1985 @@ --- -title: "4. Encoding and Evolution" -linkTitle: "4. Encoding and Evolution" +title: "4. Storage and Retrieval" weight: 104 breadcrumbs: false --- -![](/img/ch4.png) - -> *Everything changes and nothing stands still.* +> *One of the miseries of life is that everybody names things a little bit wrong. And so it makes +> everything a little harder to understand in the world than it would be if it were named +> differently. A computer does not primarily compute in the sense of doing arithmetic. […] They +> primarily are filing systems.* > -> ​ — Heraclitus of Ephesus, as quoted by Plato in *Cratylus* (360 BCE) +> [Richard Feynman](https://www.youtube.com/watch?v=EKWGGDXe5MA&t=296s), +> *Idiosyncratic Thinking* seminar (1985) -------------------- +On the most fundamental level, a database needs to do two things: when you give it some data, it +should store the data, and when you ask it again later, it should give the data back to you. -Applications inevitably change over time. Features are added or modified as new products are launched, user requirements become better understood, or business cir‐ cumstances change. In [Chapter 1](/en/ch1j) we introduced the idea of *evolvability*: we should aim to build systems that make it easy to adapt to change (see “[Evolvability: Making Change Easy](/en/ch1#evolvability-making-change-easy)”). +In [Chapter 3](/en/ch3#ch_datamodels) we discussed data models and query languages—i.e., the format in which you give +the database your data, and the interface through which you can ask for it again later. In this +chapter we discuss the same from the database’s point of view: how the database can store the data +that you give it, and how it can find the data again when you ask for it. -In most cases, a change to an application’s features also requires a change to data that it stores: perhaps a new field or record type needs to be captured, or perhaps existing data needs to be presented in a new way. +Why should you, as an application developer, care how the database handles storage and retrieval +internally? You’re probably not going to implement your own storage engine from scratch, but you +*do* need to select a storage engine that is appropriate for your application, from the many that +are available. In order to configure a storage engine to perform well on your kind of workload, you +need to have a rough idea of what the storage engine is doing under the hood. -The data models we discussed in [Chapter 2](/en/ch2) have different ways of coping with such change. Relational databases generally assume that all data in the database conforms to one schema: although that schema can be changed (through schema migrations; i.e., ALTER statements), there is exactly one schema in force at any one point in time. By contrast, schema-on-read (“schemaless”) databases don’t enforce a schema, so the database can contain a mixture of older and newer data formats written at different times (see “[Schema flexibility in the document model](/en/ch3#schema-flexibility-in-the-document-model)”). +In particular, there is a big difference between storage engines that are optimized for +transactional workloads (OLTP) and those that are optimized for analytics (we introduced this +distinction in [“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)). This chapter starts by examining two families of +storage engines for OLTP: *log-structured* storage engines that write out immutable data files, and +storage engines such as *B-trees* that update data in-place. These structures are used for both +key-value storage as well as secondary indexes. -When a data format or schema changes, a corresponding change to application code often needs to happen (for example, you add a new field to a record, and the applica‐ tion code starts reading and writing that field). However, in a large application, code changes often cannot happen instantaneously: +Later in [“Data Storage for Analytics”](/en/ch4#sec_storage_analytics) we’ll discuss a family of storage engines that is optimized for +analytics, and in [“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional) we’ll briefly look at indexes for more advanced +queries, such as text retrieval. -- With server-side applications you may want to perform a *rolling upgrade* (also known as a *staged rollout*), deploying the new version to a few nodes at a time, checking whether the new version is running smoothly, and gradually working your way through all the nodes. This allows new versions to be deployed without service downtime, and thus encourages more frequent releases and better evolva‐ bility. -- With client-side applications you’re at the mercy of the user, who may not install the update for some time. +# Storage and Indexing for OLTP -This means that old and new versions of the code, and old and new data formats, may potentially all coexist in the system at the same time. In order for the system to continue running smoothly, we need to maintain compatibility in both directions: +Consider the world’s simplest database, implemented as two Bash functions: -***Backward compatibility*** +``` +#!/bin/bash -Newer code can read data that was written by older code. +db_set () { + echo "$1,$2" >> database +} -***Forward compatibility*** +db_get () { + grep "^$1," database | sed -e "s/^$1,//" | tail -n 1 +} +``` -Older code can read data that was written by newer code. +These two functions implement a key-value store. You can call `db_set key value`, which will store +`key` and `value` in the database. The key and value can be (almost) anything you like—for +example, the value could be a JSON document. You can then call `db_get key`, which looks up the most +recent value associated with that particular key and returns it. -Backward compatibility is normally not hard to achieve: as author of the newer code, you know the format of data written by older code, and so you can explicitly handle it (if necessary by simply keeping the old code to read the old data). Forward compati‐ bility can be trickier, because it requires older code to ignore additions made by a newer version of the code. +And it works: -In this chapter we will look at several formats for encoding data, including JSON, XML, Protocol Buffers, Thrift, and Avro. In particular, we will look at how they han‐ dle schema changes and how they support systems where old and new data and code need to coexist. We will then discuss how those formats are used for data storage and for communication: in web services, Representational State Transfer (REST), and remote procedure calls (RPC), as well as message-passing systems such as actors and message queues. +``` +$ db_set 12 '{"name":"London","attractions":["Big Ben","London Eye"]}' +$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}' +$ db_get 42 +{"name":"San Francisco","attractions":["Golden Gate Bridge"]} -## …… +``` +The storage format is very simple: a text file where each line contains a key-value pair, separated +by a comma (roughly like a CSV file, ignoring escaping issues). Every call to `db_set` appends to +the end of the file. If you update a key several times, old versions of the value are not +overwritten—you need to look at the last occurrence of a key in a file to find the latest value +(hence the `tail -n 1` in `db_get`): +``` +$ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}' -## Summary +$ db_get 42 +{"name":"San Francisco","attractions":["Exploratorium"]} -In this chapter we looked at several ways of turning data structures into bytes on the network or bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more importantly also the architecture of applications and your options for deploying them. +$ cat database +12,{"name":"London","attractions":["Big Ben","London Eye"]} +42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]} +42,{"name":"San Francisco","attractions":["Exploratorium"]} -In particular, many services need to support rolling upgrades, where a new version of a service is gradually deployed to a few nodes at a time, rather than deploying to all nodes simultaneously. Rolling upgrades allow new versions of a service to be released without downtime (thus encouraging frequent small releases over rare big releases) and make deployments less risky (allowing faulty releases to be detected and rolled back before they affect a large number of users). These properties are hugely benefi‐ cial for *evolvability*, the ease of making changes to an application. +``` -During rolling upgrades, or for various other reasons, we must assume that different nodes are running the different versions of our application’s code. Thus, it is impor‐ tant that all data flowing around the system is encoded in a way that provides back‐ ward compatibility (new code can read old data) and forward compatibility (old code can read new data). +The `db_set` function actually has pretty good performance for something that is so simple, because +appending to a file is generally very efficient. Similarly to what `db_set` does, many databases +internally use a *log*, which is an append-only data file. Real databases have more issues to deal +with (such as handling concurrent writes, reclaiming disk space so that the log doesn’t grow +forever, and handling partially written records when recovering from a crash), but the basic +principle is the same. Logs are incredibly useful, and we will encounter them several times in this +book. -We discussed several data encoding formats and their compatibility properties: +###### Note -- Programming language–specific encodings are restricted to a single program‐ ming language and often fail to provide forward and backward compatibility. -- Textual formats like JSON, XML, and CSV are widespread, and their compatibil‐ ity depends on how you use them. They have optional schema languages, which are sometimes helpful and sometimes a hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things like numbers and binary strings. -- Binary schema–driven formats like Thrift, Protocol Buffers, and Avro allow compact, efficient encoding with clearly defined forward and backward compati‐ bility semantics. The schemas can be useful for documentation and code genera‐ tion in statically typed languages. However, they have the downside that data needs to be decoded before it is human-readable. +The word *log* is often used to refer to application logs, where an application outputs text that +describes what’s happening. In this book, *log* is used in the more general sense: an append-only +sequence of records on disk. It doesn’t have to be human-readable; it might be binary and intended +only for internal use by the database system. -We also discussed several modes of dataflow, illustrating different scenarios in which data encodings are important: +On the other hand, the `db_get` function has terrible performance if you have a large number of +records in your database. Every time you want to look up a key, `db_get` has to scan the entire +database file from beginning to end, looking for occurrences of the key. In algorithmic terms, the +cost of a lookup is *O*(*n*): if you double the number of records *n* in your database, a lookup +takes twice as long. That’s not good. -- Databases, where the process writing to the database encodes the data and the process reading from the database decodes it -- RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes a response, and the client finally decodes the response -- Asynchronous message passing (using message brokers or actors), where nodes communicate by sending each other messages that are encoded by the sender and decoded by the recipient +In order to efficiently find the value for a particular key in the database, we need a different +data structure: an *index*. In this chapter we will look at a range of indexing structures and see +how they compare; the general idea is to structure the data in a particular way (e.g., sorted by +some key) that makes it faster to locate the data you want. If you want to search the same data in +several different ways, you may need several different indexes on different parts of the data. -We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are quite achievable. May your application’s evolution be rapid and your deployments be frequent. +An index is an *additional* structure that is derived from the primary data. Many databases allow +you to add and remove indexes, and this doesn’t affect the contents of the database; it only affects +the performance of queries. Maintaining additional structures incurs overhead, especially on writes. For +writes, it’s hard to beat the performance of simply appending to a file, because that’s the simplest +possible write operation. Any kind of index usually slows down writes, because the index also needs +to be updated every time data is written. -## References +This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but +every index consumes additional disk space and slows down writes, sometimes substantially +[[1](/en/ch4#Samokhvalov2021)]. +For this reason, databases don’t usually index everything by default, but require you—the person +writing the application or administering the database—to choose indexes manually, using your +knowledge of the application’s typical query patterns. You can then choose the indexes that give +your application the greatest benefit, without introducing more overhead on writes than necessary. -1. “[Java Object Serialization Specification](http://docs.oracle.com/javase/7/docs/platform/serialization/spec/serialTOC.html),” *docs.oracle.com*, 2010. -1. “[Ruby 2.2.0 API Documentation](http://ruby-doc.org/core-2.2.0/),” *ruby-doc.org*, Dec 2014. -1. “[The Python 3.4.3 Standard Library Reference Manual](https://docs.python.org/3/library/pickle.html),” *docs.python.org*, February 2015. -1. “[EsotericSoftware/kryo](https://github.com/EsotericSoftware/kryo),” *github.com*, October 2014. -1. “[CWE-502: Deserialization of Untrusted Data](http://cwe.mitre.org/data/definitions/502.html),” Common Weakness Enumeration, *cwe.mitre.org*, July 30, 2014. -1. Steve Breen: “[What Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This Vulnerability](http://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-and-your-application-have-in-common-this-vulnerability/),” *foxglovesecurity.com*, November 6, 2015. -1. Patrick McKenzie: “[What the Rails Security Issue Means for Your Startup](http://www.kalzumeus.com/2013/01/31/what-the-rails-security-issue-means-for-your-startup/),” *kalzumeus.com*, January 31, 2013. -1. Eishay Smith: “[jvm-serializers wiki](https://github.com/eishay/jvm-serializers/wiki),” *github.com*, November 2014. -1. “[XML Is a Poor Copy of S-Expressions](http://c2.com/cgi/wiki?XmlIsaPoorCopyOfEssExpressions),” *c2.com* wiki. -1. Matt Harris: “[Snowflake: An Update and Some Very Important Information](https://groups.google.com/forum/#!topic/twitter-development-talk/ahbvo3VTIYI),” email to *Twitter Development Talk* mailing list, October 19, 2010. -1. Shudi (Sandy) Gao, C. M. Sperberg-McQueen, and Henry S. Thompson: “[XML Schema 1.1](http://www.w3.org/XML/Schema),” W3C Recommendation, May 2001. -1. Francis Galiegue, Kris Zyp, and Gary Court: “[JSON Schema](http://json-schema.org/),” IETF Internet-Draft, February 2013. -1. Yakov Shafranovich: “[RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180),” October 2005. -1. “[MessagePack Specification](http://msgpack.org/),” *msgpack.org*. -1. Mark Slee, Aditya Agarwal, and Marc Kwiatkowski: “[Thrift: Scalable Cross-Language Services Implementation](http://thrift.apache.org/static/files/thrift-20070401.pdf),” Facebook technical report, April 2007. -1. “[Protocol Buffers Developer Guide](https://developers.google.com/protocol-buffers/docs/overview),” Google, Inc., *developers.google.com*. -1. Igor Anishchenko: “[Thrift vs Protocol Buffers vs Avro - Biased Comparison](http://www.slideshare.net/IgorAnishchenko/pb-vs-thrift-vs-avro),” *slideshare.net*, September 17, 2012. -1. “[A Matrix of the Features Each Individual Language Library Supports](http://wiki.apache.org/thrift/LibraryFeatures),” *wiki.apache.org*. -1. Martin Kleppmann: “[Schema Evolution in Avro, Protocol Buffers and Thrift](http://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html),” *martin.kleppmann.com*, December 5, 2012. -1. “[Apache Avro 1.7.7 Documentation](http://avro.apache.org/docs/1.7.7/),” *avro.apache.org*, July 2014. -1. Doug Cutting, Chad Walters, Jim Kellerman, et al.: “[[PROPOSAL] New Subproject: Avro](http://mail-archives.apache.org/mod_mbox/hadoop-general/200904.mbox/%3C49D53694.1050906@apache.org%3E),” email thread on *hadoop-general* mailing list, *mail-archives.apache.org*, April 2009. -1. Tony Hoare: “[Null References: The Billion Dollar Mistake](http://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare),” at *QCon London*, March 2009. -1. Aditya Auradkar and Tom Quiggle: “[Introducing Espresso—LinkedIn's Hot New Distributed Document Store](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store),” *engineering.linkedin.com*, January 21, 2015. -1. Jay Kreps: “[Putting Apache Kafka to Use: A Practical Guide to Building a Stream Data Platform (Part 2)](http://blog.confluent.io/2015/02/25/stream-data-platform-2/),” *blog.confluent.io*, February 25, 2015. -1. Gwen Shapira: “[The Problem of Managing Schemas](http://radar.oreilly.com/2014/11/the-problem-of-managing-schemas.html),” *radar.oreilly.com*, November 4, 2014. -1. “[Apache Pig 0.14.0 Documentation](http://pig.apache.org/docs/r0.14.0/),” *pig.apache.org*, November 2014. -1. John Larmouth: [*ASN.1 Complete*](http://www.oss.com/asn1/resources/books-whitepapers-pubs/larmouth-asn1-book.pdf). Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1 -1. Russell Housley, Warwick Ford, Tim Polk, and David Solo: “[RFC 2459: Internet X.509 Public Key Infrastructure: Certificate and CRL Profile](https://www.ietf.org/rfc/rfc2459.txt),” IETF Network Working Group, Standards Track, January 1999. -1. Lev Walkin: “[Question: Extensibility and Dropping Fields](http://lionet.info/asn1c/blog/2010/09/21/question-extensibility-removing-fields/),” *lionet.info*, September 21, 2010. -1. Jesse James Garrett: “[Ajax: A New Approach to Web Applications](https://web.archive.org/web/20181231094556/https://www.adaptivepath.com/ideas/ajax-new-approach-web-applications/),” *adaptivepath.com*, February 18, 2005. -1. Sam Newman: *Building Microservices*. O'Reilly Media, 2015. ISBN: 978-1-491-95035-7 -1. Chris Richardson: “[Microservices: Decomposing Applications for Deployability and Scalability](http://www.infoq.com/articles/microservices-intro),” *infoq.com*, May 25, 2014. -1. Pat Helland: “[Data on the Outside Versus Data on the Inside](http://cidrdb.org/cidr2005/papers/P12.pdf),” at *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005. -1. Roy Thomas Fielding: “[Architectural Styles and the Design of Network-Based Software Architectures](https://www.ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf),” PhD Thesis, University of California, Irvine, 2000. -1. Roy Thomas Fielding: “[REST APIs Must Be Hypertext-Driven](http://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven),” *roy.gbiv.com*, October 20 2008. -1. “[REST in Peace, SOAP](https://royal.pingdom.com/rest-in-peace-soap/),” *royal.pingdom.com*, October 15, 2010. -1. “[Web Services Standards as of Q1 2007](https://www.innoq.com/resources/ws-standards-poster/),” *innoq.com*, February 2007. -1. Pete Lacey: “[The S Stands for Simple](http://harmful.cat-v.org/software/xml/soap/simple),” *harmful.cat-v.org*, November 15, 2006. -1. Stefan Tilkov: “[Interview: Pete Lacey Criticizes Web Services](http://www.infoq.com/articles/pete-lacey-ws-criticism),” *infoq.com*, December 12, 2006. -1. “[OpenAPI Specification (fka Swagger RESTful API Documentation Specification) Version 2.0](http://swagger.io/specification/),” *swagger.io*, September 8, 2014. -1. Michi Henning: “[The Rise and Fall of CORBA](https://cacm.acm.org/magazines/2008/8/5336-the-rise-and-fall-of-corba/fulltext),” *Communications of the ACM*, volume 51, number 8, pages 52–57, August 2008. [doi:10.1145/1378704.1378718](http://dx.doi.org/10.1145/1378704.1378718) -1. Andrew D. Birrell and Bruce Jay Nelson: “[Implementing Remote Procedure Calls](http://www.cs.princeton.edu/courses/archive/fall03/cs518/papers/rpc.pdf),” *ACM Transactions on Computer Systems* (TOCS), volume 2, number 1, pages 39–59, February 1984. [doi:10.1145/2080.357392](http://dx.doi.org/10.1145/2080.357392) -1. Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall: “[A Note on Distributed Computing](http://m.mirror.facebook.net/kde/devel/smli_tr-94-29.pdf),” Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994. -1. Steve Vinoski: “[Convenience over Correctness](http://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf),” *IEEE Internet Computing*, volume 12, number 4, pages 89–92, July 2008. [doi:10.1109/MIC.2008.75](http://dx.doi.org/10.1109/MIC.2008.75) -1. Marius Eriksen: “[Your Server as a Function](http://monkey.org/~marius/funsrv.pdf),” at *7th Workshop on Programming Languages and Operating Systems* (PLOS), November 2013. [doi:10.1145/2525528.2525538](http://dx.doi.org/10.1145/2525528.2525538) -1. “[gRPC concepts](https://grpc.io/docs/guides/concepts/),” The Linux Foundation, *grpc.io*. -1. Aditya Narayan and Irina Singh: “[Designing and Versioning Compatible Web Services](https://web.archive.org/web/20141016000136/http://www.ibm.com/developerworks/websphere/library/techarticles/0705_narayan/0705_narayan.html),” *ibm.com*, March 28, 2007. -1. Troy Hunt: “[Your API Versioning Is Wrong, Which Is Why I Decided to Do It 3 Different Wrong Ways](http://www.troyhunt.com/2014/02/your-api-versioning-is-wrong-which-is.html),” *troyhunt.com*, February 10, 2014. -1. “[API Upgrades](https://stripe.com/docs/upgrades),” Stripe, Inc., April 2015. -1. Jonas Bonér: “[Upgrade in an Akka Cluster](http://grokbase.com/t/gg/akka-user/138wd8j9e3/upgrade-in-an-akka-cluster),” email to *akka-user* mailing list, *grokbase.com*, August 28, 2013. -1. Philip A. Bernstein, Sergey Bykov, Alan Geller, et al.: “[Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/),” Microsoft Research Technical Report MSR-TR-2014-41, March 2014. -1. “[Microsoft Project Orleans Documentation](http://dotnet.github.io/orleans/),” Microsoft Research, *dotnet.github.io*, 2015. -1. David Mercer, Sean Hinde, Yinso Chen, and Richard A O'Keefe: “[beginner: Updating Data Structures](http://erlang.org/pipermail/erlang-questions/2007-October/030318.html),” email thread on *erlang-questions* mailing list, *erlang.com*, October 29, 2007. -1. Fred Hebert: “[Postscript: Maps](http://learnyousomeerlang.com/maps),” *learnyousomeerlang.com*, April 9, 2014. +## Log-Structured Storage + +To start, let’s assume that you want to continue storing data in the append-only file written by +`db_set`, and you just want to speed up reads. One way you could do this is by keeping a hash map in +memory, in which every key is mapped to the byte offset in the file at which the most recent value +for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index). + +![ddia 0401](/fig/ddia_0401.png) + +###### Figure 4-1. Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map. + +Whenever you append a new key-value pair to the file, you also update the hash map to reflect the +offset of the data you just wrote. When you want to look up a value, you use the hash map to find +the offset in the log file, seek to that location, and read the value. If that part of the data file +is already in the filesystem cache, a read doesn’t require any disk I/O at all. + +This approach is much faster, but it still suffers from several problems: + +* You never free up disk space occupied by old log entries that have been overwritten; if you keep + writing to the database you might run out of disk space. +* The hash map is not persisted, so you have to rebuild it when you restart the database—for + example, by scanning the whole log file to find the latest byte offset for each key. This makes + restarts slow if you have a lot of data. +* The hash table must fit in memory. In principle, you could maintain a hash table on disk, but + unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of + random access I/O, it is expensive to grow when it becomes full, and hash collisions require + fiddly logic [[2](/en/ch4#Graefe2011)]. +* Range queries are not efficient. For example, you cannot easily scan over all keys between `10000` + and `19999`—you’d have to look up each key individually in the hash map. + +### The SSTable file format + +In practice, hash tables are not used very often for database indexes, and instead it is much more +common to keep data in a structure that is *sorted by key* +[[3](/en/ch4#Jones2019)]. +One example of such a structure is a *Sorted String Table*, or *SSTable* for short, as shown in +[Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that +they are sorted by key, and each key only appears once in the file. + +![ddia 0402](/fig/ddia_0402.png) + +###### Figure 4-2. An SSTable with a sparse index, allowing queries to jump to the right block. + +Now you do not need to keep all the keys in memory: you can group the key-value pairs within an +SSTable into *blocks* of a few kilobytes, and then store the first key of each block in the index. +This kind of index, which stores only some of the keys, is called *sparse*. This index is stored in +a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data +structure that allows queries to quickly look up a particular key +[[4](/en/ch4#Lambov2022a)]. + +For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the +first key of the next block is `handsome`. Now say you’re looking for the key `handiwork`, which +doesn’t appear in the sparse index. Because of the sorting you know that `handiwork` must appear +between `handbag` and `handsome`. This means you can seek to the offset for `handbag` and scan the +file from there until you find `handiwork` (or not, if the key is not present in the file). A block +of a few kilobytes can be scanned very quickly. + +Moreover, each block of records can be compressed (indicated by the shaded area in +[Figure 4-2](/en/ch4#fig_storage_sstable_index)). Besides saving disk space, compression also reduces the I/O +bandwidth use, at the cost of using a bit more CPU time. + +### Constructing and merging SSTables + +The SSTable file format is better for reading than an append-only log, but it makes writes more +difficult. We can’t simply append at the end, because then the file would no longer be sorted +(unless the keys happen to be written in ascending order). If we had to rewrite the whole SSTable +every time a key is inserted somewhere in the middle, writes would become far too expensive. + +We can solve this problem with a *log-structured* approach, which is a hybrid between an append-only +log and a sorted file: + +1. When a write comes in, add it to an in-memory ordered map data structure, such as a red-black + tree, skip list [[5](/en/ch4#Cormen2009)], or trie + [[6](/en/ch4#Lambov2022b)]. + With these data structures, you can insert keys in any order, look them up efficiently, and read + them back in sorted order. This in-memory data structure is called the *memtable*. +2. When the memtable gets bigger than some threshold—typically a few megabytes—write it out to + disk in sorted order as an SSTable file. We call this new SSTable file the most recent *segment* + of the database, and it is stored as a separate file alongside the older segments. Each segment + has a separate index of its contents. While the new segment is being written out to disk, the + database can continue writing to a new memtable instance, and the old memtable’s memory is freed + when the writing of the SSTable is complete. +3. In order to read the value for some key, first try to find the key in the memtable and the most + recent on-disk segment. If it’s not there, look in the next-older segment, etc. until you either + find the key or reach the oldest segment. If the key does not appear in any of the segments, it + does not exist in the database. +4. From time to time, run a merging and compaction process in the background to combine segment files + and to discard overwritten or deleted values. + +Merging segments works similarly to the *mergesort* algorithm +[[5](/en/ch4#Cormen2009)]. The process is illustrated in +[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key +in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If +the same key appears in more than one input file, keep only the more recent value. This produces a +new merged segment file, also sorted by key, with one value per key, and it uses minimal memory +because we can iterate over the SSTables one key at a time. + +![ddia 0403](/fig/ddia_0403.png) + +###### Figure 4-3. Merging several SSTable segments, retaining only the most recent value for each key. + +To ensure that the data in the memtable is not lost if the database crashes, the storage engine +keeps a separate log on disk to which every write is immediately appended. This log is not sorted by +key, but that doesn’t matter, because its only purpose is to restore the memtable after a crash. +Every time the memtable has been written out to an SSTable, the corresponding part of the log can be +discarded. + +If you want to delete a key and its associated value, you have to append a special deletion record +called a *tombstone* to the data file. When log segments are merged, the tombstone tells the merging +process to discard any previous values for the deleted key. Once the tombstone is merged into the +oldest segment, it can be dropped. + +The algorithm described here is essentially what is used in RocksDB +[[7](/en/ch4#Borthakur2013)], +Cassandra, Scylla, and HBase +[[8](/en/ch4#Bertozzi2012)], +all of which were inspired by Google’s Bigtable paper +[[9](/en/ch4#Chang2006_ch4)] +(which introduced the terms *SSTable* and *memtable*). + +The algorithm was originally published in 1996 under the name *Log-Structured Merge-Tree* or *LSM-Tree* +[[10](/en/ch4#ONeil1996)], +building on earlier work on log-structured filesystems +[[11](/en/ch4#Rosenblum1992)]. +For this reason, storage engines that are based on the principle of merging and compacting sorted +files are often called *LSM storage engines*. + +In LSM storage engines, a segment file is written in one pass (either by writing out the memtable or +by merging some existing segments), and thereafter it is immutable. The merging and compaction of +segments can be done in a background thread, and while it is going on, we can still continue to +serve reads using the old segment files. When the merging process is complete, we switch read +requests to using the new merged segment instead of the old segments, and then the old segment files +can be deleted. + +The segment files don’t necessarily have to be stored on local disk: they are also well suited for +writing to object storage. SlateDB and Delta Lake +[[12](/en/ch4#Armbrust2020)]. +take this approach, for example. + +Having immutable segment files also simplifies crash recovery: if a crash happens while writing out +the memtable or while merging segments, the database can just delete the unfinished SSTable and +start afresh. The log that persists writes to the memtable could contain incomplete records if there +was a crash halfway through writing a record, or if the disk was full; these are typically detected +by including checksums in the log, and discarding corrupted or incomplete log entries. We will talk +more about durability and crash recovery in [Chapter 8](/en/ch8#ch_transactions). + +### Bloom filters + +With LSM storage it can be slow to read a key that was last updated a long time ago, or that does +not exist, since the storage engine needs to check several segment files. In order to speed up such +reads, LSM storage engines often include a *Bloom filter* +[[13](/en/ch4#Bloom1970)] +in each segment, which provides a fast but approximate way of checking whether a particular key +appears in a particular SSTable. + +[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in +reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash +function, producing a set of numbers that are then interpreted as indexes into the array of bits +[[14](/en/ch4#Kirsch2008)]. +We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key +`handbag` hashes to the numbers (2, 9, 4), so we set the 2nd, 9th, and 4th bits to 1. The bitmap +is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of +extra space, but the Bloom filter is generally small compared to the rest of the SSTable. + +![ddia 0404](/fig/ddia_0404.png) + +###### Figure 4-4. A Bloom filter provides a fast, probabilistic check whether a particular key exists in a particular SSTable. + +When we want to know whether a key appears in the SSTable, we compute the same hash of that key as +before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), we’re querying +the key `handheld`, which hashes to (6, 11, 2). One of those bits is 1 (namely, bit number 2), +while the other two are 0. These checks can be made extremely fast using the bitwise operations that +all CPUs support. + +If at least one of the bits is 0, we know that the key definitely does not appear in the SSTable. +If the bits in the query are all 1, it’s likely that the key is in the SSTable, but it’s also +possible that by coincidence all of those bits were set to 1 by other keys. This case when it looks +as if a key is present, even though it isn’t, is called a *false positive*. + +The probability of false positives depends on the number of keys, the number of bits set per key, +and the total number of bits in the Bloom filter. You can use an online calculator tool to work out +the right parameters for your application +[[15](/en/ch4#Hurst2023)]. +As a rule of thumb, you need to allocate 10 bits of Bloom filter space for every key in the SSTable +to get a false positive probability of 1%, and the probability is reduced tenfold for every 5 +additional bits you allocate per key. + +In the context of an LSM storage engines, false positives are no problem: + +* If the Bloom filter says that a key *is not* present, we can safely skip that SSTable, since we + can be sure that it doesn’t contain the key. +* If the Bloom filter says the key *is* present, we have to consult the sparse index and decode the + block of key-value pairs to check whether the key really is there. If it was a false positive, we + have done a bit of unnecessary work, but otherwise no harm is done—we just continue the search + with the next-oldest segment. + +### Compaction strategies + +An important detail is how the LSM storage chooses when to perform compaction, and which SSTables to +include in a compaction. Many LSM-based storage systems allow you to configure which compaction +strategy to use, and some of the common choices are +[[16](/en/ch4#Luo2019), +[17](/en/ch4#Sarkar2022)]: + +Size-tiered compaction +: Newer and smaller SSTables are successively merged into older and larger SSTables. The SSTables + containing older data can get very large, and merging them requires a lot of temporary disk space. + The advantage of this strategy is that it can handle very high write throughput. + +Leveled compaction +: The key range is split up into smaller SSTables and older data is moved into separate “levels,” + which allows the compaction to proceed more incrementally and use less disk space than the + size-tiered strategy. This strategy is more efficient for reads than size-tiered compaction + because the storage engine needs to read fewer SSTables to check whether they contain the key. + +As a rule of thumb, size-tiered compaction performs better if you have mostly writes and few reads, +whereas leveled compaction performs better if your workload is dominated by reads. If you write a +small number of keys frequently and a large number of keys rarely, then leveled compaction can also +be advantageous [[18](/en/ch4#Callaghan2018)]. + +Even though there are many subtleties, the basic idea of LSM-trees—keeping a cascade of SSTables +that are merged in the background—is simple and effective. We discuss their performance +characteristics in more detail in [“Comparing B-Trees and LSM-Trees”](/en/ch4#sec_storage_btree_lsm_comparison). + +# Embedded storage engines + +Many databases run as a service that accepts queries over a network, but there are also *embedded* +databases that don’t expose a network API. Instead, they are libraries that run in the same process +as your application code, typically reading and writing files on the local disk, and you interact +with them through normal function calls. Examples of embedded storage engines include RocksDB, +SQLite, LMDB, DuckDB, and KùzuDB +[[19](/en/ch4#Rao2023)]. + +Embedded databases are very commonly used in mobile apps to store the local user’s data. On the +backend, they can be an appropriate choice if the data is small enough to fit on a single machine, +and if there are not many concurrent transactions. For example, in a multitenant system in which +each tenant is small enough and completely separate from others (i.e., you do not need to run +queries that combine data from multiple tenants), you can potentially use a separate embedded +database instance per tenant +[[20](/en/ch4#BlueskySQLite)]. + +The storage and retrieval methods we discuss in this chapter are used in both embedded and in +client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques +for scaling a database across multiple machines. + +## B-Trees + +The log-structured approach is popular, but it is not the only form of key-value storage. The most +widely used structure for reading and writing database records by key is the *B-tree*. + +Introduced in 1970 [[21](/en/ch4#Bayer1970)] +and called “ubiquitous” less than 10 years later +[[22](/en/ch4#Comer1979)], +B-trees have stood the test of time very well. They remain the standard index implementation in +almost all relational databases, and many nonrelational databases use them too. + +Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups +and range queries. But that’s where the similarity ends: B-trees have a very different design +philosophy. + +The log-structured indexes we saw earlier break the database down into variable-size *segments*, +typically several megabytes or more in size, that are written once and are then immutable. By +contrast, B-trees break the database down into fixed-size *blocks* or *pages*, and may overwrite a +page in-place. A page is traditionally 4 KiB in size, but PostgreSQL now uses 8 KiB and +MySQL uses 16 KiB by default. + +Each page can be identified using a page number, which allows one page to refer to another—​similar +to a pointer, but on disk instead of in memory. If all the pages are stored in the same file, +multiplying the page number by the page size gives us the byte offset in the file where the page is +located. We can use these page references to construct a tree of pages, as illustrated in +[Figure 4-5](/en/ch4#fig_storage_b_tree). + +![ddia 0405](/fig/ddia_0405.png) + +###### Figure 4-5. Looking up the key 251 using a B-tree index. From the root page we first follow the reference to the page for keys 200–300, then the page for keys 250–270. + +One page is designated as the *root* of the B-tree; whenever you want to look up a key in the index, +you start here. The page contains several keys and references to child pages. +Each child is responsible for a continuous range of keys, and the keys between the references indicate +where the boundaries between those ranges lie. +(This structure is sometimes called a B+ tree, but we don’t need to distinguish it +from other B-tree variants.) + +In the example in [Figure 4-5](/en/ch4#fig_storage_b_tree), we are looking for the key 251, so we know that we need to +follow the page reference between the boundaries 200 and 300. That takes us to a similar-looking +page that further breaks down the 200–300 range into subranges. Eventually we get down to a +page containing individual keys (a *leaf page*), which either contains the value for each key +inline or contains references to the pages where the values can be found. + +The number of references to child pages in one page of the B-tree is called the *branching factor*. +For example, in [Figure 4-5](/en/ch4#fig_storage_b_tree) the branching factor is six. In practice, the branching +factor depends on the amount of space required to store the page references and the range +boundaries, but typically it is several hundred. + +If you want to update the value for an existing key in a B-tree, you search for the leaf page +containing that key, and overwrite that page on disk with a version that contains the new value. +If you want to add a new key, you need to find the page whose range encompasses the new key and add +it to that page. If there isn’t enough free space in the page to accommodate the new key, the page +is split into two half-full pages, and the parent page is updated to account for the new subdivision +of key ranges. + +![ddia 0406](/fig/ddia_0406.png) + +###### Figure 4-6. Growing a B-tree by splitting a page on the boundary key 337. The parent page is updated to reference both children. + +In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the +range 333–345 is already full. We therefore split it into a page for the range 333–337 (including +the new key), and a page for 337–344. We also have to update the parent page to have references to +both children, with a boundary value of 337 between them. If the parent page doesn’t have enough +space for the new reference, it may also need to be split, and the splits can continue all the way +to the root of the tree. When the root is split, we make a new root above it. Deleting keys (which +may require nodes to be merged) is more complex +[[5](/en/ch4#Cormen2009)]. + +This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth +of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so +you don’t need to follow many page references to find the page you are looking for. (A four-level +tree of 4 KiB pages with a branching factor of 500 can store up to 250 TB.) + +### Making B-trees reliable + +The basic underlying write operation of a B-tree is to overwrite a page on disk with new data. It is +assumed that the overwrite does not change the location of the page; i.e., all references to that +page remain intact when the page is overwritten. This is in stark contrast to log-structured indexes +such as LSM-trees, which only append to files (and eventually delete obsolete files) but never +modify files in place. + +Overwriting several pages at once, like in a page split, is a dangerous operation: if the database +crashes after only some of the pages have been written, you end up with a corrupted tree (e.g., +there may be an *orphan* page that is not a child of any parent). If the hardware can’t atomically +write an entire page, you can also end up with a partially written page (this is known as a *torn +page* [[23](/en/ch4#Miller2025)]). + +In order to make the database resilient to crashes, it is common for B-tree implementations to +include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file +to which every B-tree modification must be written before it can be applied to the pages of the tree +itself. When the database comes back up after a crash, this log is used to restore the B-tree back +to a consistent state [[2](/en/ch4#Graefe2011), +[24](/en/ch4#Mohan1992)]. +In filesystems, the equivalent mechanism is known as *journaling*. + +To improve performance, B-tree implementations typically don’t immediately write every modified page +to disk, but buffer the B-tree pages in memory for a while first. The write-ahead log then also +ensures that data is not lost in the case of a crash: as long as data has been written to the WAL, +and flushed to disk using the `fsync()` system call, the data will be durable as the database will +be able to recover it after a crash [[25](/en/ch4#Suzuki2017_ch4)]. + +### B-tree variants + +As B-trees have been around for so long, many variants have been developed over the years. To +mention just a few: + +* Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB) + use a copy-on-write scheme [[26](/en/ch4#Chu2014)]. + A modified page is written to a different location, and a new version of the parent pages in the tree + is created, pointing at the new location. This approach is also useful for concurrency control, as we shall + see in [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation). +* We can save space in pages by not storing the entire key, but abbreviating it. Especially in pages + on the interior of the tree, keys only need to provide enough information to act as boundaries + between key ranges. Packing more keys into a page allows the tree to have a higher branching + factor, and thus fewer levels. +* To speed up scans over the key range in sorted order, some B-tree implementations try to lay out + the tree so that leaf pages appear in sequential order on disk, reducing the number of disk seeks. + However, it’s difficult to maintain that order as the tree grows. +* Additional pointers have been added to the tree. For example, each leaf page may have references to + its sibling pages to the left and right, which allows scanning keys in order without jumping back + to parent pages. + +## Comparing B-Trees and LSM-Trees + +As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads +[[27](/en/ch4#Athanassoulis2016), +[28](/en/ch4#Stopford2015)]. +However, benchmarks are often sensitive to details of the workload. You need to test systems with +your particular workload in order to make a valid comparison. Moreover, it’s not a strict either/or +choice between LSM and B-trees: storage engines sometimes blend characteristics of both approaches, +for example by having multiple B-trees and merging them LSM-style. In this section we will briefly +discuss a few things that are worth considering when measuring the performance of a storage engine. + +### Read performance + +In a B-tree, looking up a key involves reading one page at each level of the B-tree. Since the +number of levels is usually quite small, this means that reads from a B-tree are generally fast and +have predictable performance. In an LSM storage engine, reads often have to check several different +SSTables at different stages of compaction, but Bloom filters help reduce the number of actual disk +I/O operations required. Both approaches can perform well, and which is faster depends on the +details of the storage engine and the workload. + +Range queries are simple and fast on B-trees, as they can use the sorted structure of the tree. On +LSM storage, range queries can also take advantage of the SSTable sorting, but they need to scan all +the segments in parallel and combine the results. Bloom filters don’t help for range queries (since +you would need to compute the hash of every possible key within the range, which is impractical), +making range queries more expensive than point queries in the LSM approach +[[29](/en/ch4#Callaghan2016lsm)]. + +High write throughput can cause latency spikes in a log-structured storage engine if the +memtable fills up. This happens if data can’t be written out to disk fast enough, perhaps because +the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB, +perform *backpressure* in this situation: they suspend all reads and writes until the memtable has +been written out to disk +[[30](/en/ch4#Balmau2019), +[31](/en/ch4#RocksDBTuning)]. + +Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read +requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but +storage engines need to be carefully designed to take advantage of this parallelism +[[32](/en/ch4#Haas2023)]. + +### Sequential vs. random writes + +With a B-tree, if the application writes keys that are scattered all over the key space, the +resulting disk operations are also scattered randomly, since the pages that the storage engine needs +to overwrite could be located anywhere on disk. On the other hand, a log-structured storage engine +writes entire segment files at a time (either writing out the memtable or while compacting existing +segments), which are much bigger than a page in a B-tree. + +The pattern of many small, scattered writes (as found in B-trees) is called *random writes*, while +the pattern of fewer large writes (as found in LSM-trees) is called *sequential writes*. Disks +generally have higher sequential write throughput than random write throughput, which means that a +log-structured storage engine can generally handle higher write throughput on the same hardware than +a B-tree. This difference is particularly big on spinning-disk hard drives (HDDs); on the solid +state drives (SSDs) that most databases use today, the difference is smaller, but still noticeable +(see [“Sequential vs. Random Writes on SSDs”](/en/ch4#sidebar_sequential)). + +# Sequential vs. Random Writes on SSDs + +On spinning-disk hard drives (HDDs), sequential writes are much faster than random writes: a random +write has to mechanically move the disk head to a new position and wait for the right part of the +platter to pass underneath the disk head, which takes several milliseconds—an eternity in computing +timescales. However, SSDs (solid-state drives) including NVMe (Non-Volatile Memory Express, i.e. +flash memory attached to the PCI Express bus) have now overtaken HDDs for many use cases, and they +are not subject to such mechanical limitations. + +Nevertheless, SSDs also have higher throughput for sequential writes than for than random writes. +The reason is that flash memory can be read or written one page (typically 4 KiB) at a time, +but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block +may contain valid data, whereas others may contain data that is no longer needed. Before erasing a +block, the controller must first move pages containing valid data into other blocks; this process is +called *garbage collection* (GC) +[[33](/en/ch4#Goossaert2014)]. + +A sequential write workload writes larger chunks of data at a time, so it is likely that a whole +512 KiB block belongs to a single file; when that file is later deleted again, the whole block +can be erased without having to perform any GC. On the other hand, with a random write workload, it +is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has +to perform more work before a block can be erased +[[34](/en/ch4#Vanlightly2023nvme), +[35](/en/ch4#Alibaba2019_ch4), +[36](/en/ch4#Hu2010)]. + +The write bandwidth consumed by GC is then not available for the application. Moreover, the +additional writes performed by GC contribute to wear on the flash memory; therefore, random writes +wear out the drive faster than sequential writes. + +### Write amplification + +With any type of storage engine, one write request from the application turns into multiple I/O +operations on the underlying disk. With LSM-trees, a value is first written to the log for +durability, then again when the memtable is written to disk, and again every time the key-value pair +is part of a compaction. (If the values are significantly larger than the keys, this overhead can be +reduced by storing values separately from keys, and performing compaction only on SSTables +containing keys and references to values +[[37](/en/ch4#Lu2016)].) + +A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once +to the tree page itself. In addition, they sometimes need to write out an entire page, even if only +a few bytes in that page changed, to ensure the B-tree can be correctly recovered after a crash or +power failure [[38](/en/ch4#Zaitsev2006), +[39](/en/ch4#Vondra2016)]. + +If you take the total number of bytes written to disk in some workload, and divide by the number of +bytes you would have to write if you simply wrote an append-only log with no index, you get the +*write amplification*. (Sometimes write amplification is defined in terms of I/O operations rather +than bytes.) In write-heavy applications, the bottleneck might be the rate at which the database can +write to disk. In this case, the higher the write amplification, the fewer writes per second it can +handle within the available disk bandwidth. + +Write amplification is a problem in both LSM-trees and B-trees. Which one is better depends on +various factors, such as the length of your keys and values, and how often you overwrite existing +keys versus insert new ones. For typical workloads, LSM-trees tend to have lower write amplification +because they don’t have to write entire pages and they can compress chunks of the SSTable +[[40](/en/ch4#Callaghan2015)]. +This is another factor that makes LSM storage engines well suited for write-heavy workloads. + +Besides affecting throughput, write amplification is also relevant for the wear on SSDs: a storage +engine with lower write amplification will wear out the SSD less quickly. + +When measuring the write throughput of a storage engine, it is important to run the experiment for +long enough that the effects of write amplification become clear. When writing to an empty LSM-tree, +there are no compactions going on yet, so all of the disk bandwidth is available for new writes. As +the database grows, new writes need to share the disk bandwidth with compaction. + +### Disk space usage + +B-trees can become *fragmented* over time: for example, if a large number of keys are deleted, the +database file may contain a lot of pages that are no longer used by the B-tree. Subsequent additions +to the B-tree can use those free pages, but they can’t easily be returned to the operating system +because they are in the middle of the file, so they still take up space on the filesystem. Databases +therefore need a background process that moves pages around to place them better, such as the vacuum +process in PostgreSQL [[25](/en/ch4#Suzuki2017_ch4)]. + +Fragmentation is less of a problem in LSM-trees, since the compaction process periodically rewrites +the data files anyway, and SSTables don’t have pages with unused space. Moreover, blocks of +key-value pairs can better be compressed in SSTables, and thus often produce smaller files on disk +than B-trees. Keys and values that have been overwritten continue to consume space until they are +removed by a compaction, but this overhead is quite low when using leveled compaction +[[40](/en/ch4#Callaghan2015), +[41](/en/ch4#Callaghan2016rocksdb)]. +Size-tiered compaction (see [“Compaction strategies”](/en/ch4#sec_storage_lsm_compaction)) uses more disk space, especially +temporarily during compaction. + +Having multiple copies of some data on disk can also be a problem when you need to delete some data, +and be confident that it really has been deleted (perhaps to comply with data protection +regulations). For example, in most LSM storage engines a deleted record may still exist in the higher +levels until the tombstone representing the deletion has been propagated through all of the +compaction levels, which may take a long time. Specialist storage engine designs can propagate +deletions faster [[42](/en/ch4#Sarkar2023)]. + +On the other hand, the immutable nature of SSTable segment files is useful if you want to take a +snapshot of a database at some point in time (e.g. for a backup or to create a copy of the database +for testing): you can write out the memtable and record which segment files existed at that point in +time. As long as you don’t delete the files that are part of the snapshot, you don’t need to +actually copy them. In a B-tree whose pages are overwritten, taking such a snapshot efficiently is +more difficult. + +## Multi-Column and Secondary Indexes + +So far we have only discussed key-value indexes, which are like a *primary key* index in the +relational model. A primary key uniquely identifies one row in a relational table, or one document +in a document database, or one vertex in a graph database. Other records in the database can refer +to that row/document/vertex by its primary key (or ID), and the index is used to resolve such +references. + +It is also very common to have *secondary indexes*. In relational databases, you can create several +secondary indexes on the same table using the `CREATE INDEX` command, allowing you to search by +columns other than the primary key. For example, in [Figure 3-1](/en/ch3#fig_obama_relational) in [Chapter 3](/en/ch3#ch_datamodels) +you would most likely have a secondary index on the `user_id` columns so that you can find all the +rows belonging to the same user in each of the tables. + +A secondary index can easily be constructed from a key-value index. The main difference is that +in a secondary index, the indexed values are not necessarily unique; that is, +there might be many rows (documents, vertices) under the same index entry. This can be +solved in two ways: either by making each value in the index a list of matching row identifiers (like a +postings list in a full-text index) or by making each entry unique by appending a row identifier to +it. Storage engines with in-place updates, like B-trees, and log-structured storage can both be used +to implement an index. + +### Storing values within the index + +The key in an index is the thing that queries search by, but the value can be one of several things: + +* If the actual data (row, document, vertex) is stored directly within the index structure, it is + called a *clustered index*. For example, in MySQL’s InnoDB storage engine, the primary key of a + table is always a clustered index, and in SQL Server, you can specify one clustered index per + table [[43](/en/ch4#Fittl2025)]. +* Alternatively, the value can be a reference to the actual data: either the primary key of the row + in question (InnoDB does this for secondary indexes), or a direct reference to a location on disk. + In the latter case, the place where rows are stored is known as a *heap file*, and it stores data + in no particular order (it may be append-only, or it may keep track of deleted rows in order to + overwrite them with new data later). For example, Postgres uses the heap file approach + [[44](/en/ch4#Silcock2024)]. +* A middle ground between the two is a *covering index* or *index with included columns*, which + stores *some* of a table’s columns within the index, in addition to storing the full row on the + heap or in the primary key clustered index [[45](/en/ch4#Webb2008)]. + This allows some queries to be answered by using the index alone, without having to resolve the + primary key or look in the heap file (in which case, the index is said to *cover* the query). + This can make some queries faster, but the duplication of data means the index uses more disk space and slows down + writes. + +The indexes discussed so far only map a single key to a value. If you need to query multiple columns +of a table (or multiple fields in a document) simultaneously, see [“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional). + +When updating a value without changing the key, the heap file approach can allow the record to be +overwritten in place, provided that the new value is not larger than the old value. The situation is +more complicated if the new value is larger, as it probably needs to be moved to a new location in +the heap where there is enough space. In that case, either all indexes need to be updated to point +at the new heap location of the record, or a forwarding pointer is left behind in the old heap +location [[2](/en/ch4#Graefe2011)]. + +## Keeping everything in memory + +The data structures discussed so far in this chapter have all been answers to the limitations of +disks. Compared to main memory, disks are awkward to deal with. With both magnetic disks and SSDs, +data on disk needs to be laid out carefully if you want good performance on reads and writes. +However, we tolerate this awkwardness because disks have two significant advantages: they are +durable (their contents are not lost if the power is turned off), and they have a lower cost per +gigabyte than RAM. + +As RAM becomes cheaper, the cost-per-gigabyte argument is eroded. Many datasets are simply not that +big, so it’s quite feasible to keep them entirely in memory, potentially distributed across several +machines. This has led to the development of *in-memory databases*. + +Some in-memory key-value stores, such as Memcached, are intended for caching use only, where it’s +acceptable for data to be lost if a machine is restarted. But other in-memory databases aim for +durability, which can be achieved with special hardware (such as battery-powered RAM), by writing a +log of changes to disk, by writing periodic snapshots to disk, or by replicating the in-memory state +to other machines. + +When an in-memory database is restarted, it needs to reload its state, either from disk or over the +network from a replica (unless special hardware is used). Despite writing to disk, it’s still an +in-memory database, because the disk is merely used as an append-only log for durability, and reads +are served entirely from memory. Writing to disk also has operational advantages: files on disk can +easily be backed up, inspected, and analyzed by external utilities. + +Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model, +and the vendors claim that they can offer big performance improvements by removing all the overheads +associated with managing on-disk data structures +[[46](/en/ch4#Stonebraker2007), +[47](/en/ch4#VoltDB2014uj)]. +RAMCloud is an open source, in-memory key-value store with durability (using a log-structured +approach for the data in memory as well as the data on disk) +[[48](/en/ch4#Rumble2014)]. + +Redis and Couchbase provide weak durability by writing to disk asynchronously. + +Counterintuitively, the performance advantage of in-memory databases is not due to the fact that +they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk +if you have enough memory, because the operating system caches recently used disk blocks in memory +anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data +structures in a form that can be written to disk +[[49](/en/ch4#Harizopoulos2008)]. + +Besides performance, another interesting area for in-memory databases is providing data models that +are difficult to implement with disk-based indexes. For example, Redis offers a database-like +interface to various data structures such as priority queues and sets. Because it keeps all data in +memory, its implementation is comparatively simple. + +# Data Storage for Analytics + +The data model of a data warehouse is most commonly relational, because SQL is generally a good fit +for analytic queries. There are many graphical data analysis tools that generate SQL queries, +visualize the results, and allow analysts to explore the data (through operations such as +*drill-down* and *slicing and dicing*). + +On the surface, a data warehouse and a relational OLTP database look similar, because they both have +a SQL query interface. However, the internals of the systems can look quite different, because they +are optimized for very different query patterns. Many database vendors now focus on supporting +either transaction processing or analytics workloads, but not both. + +Some databases, such as Microsoft SQL Server, SAP HANA, and SingleStore, have support for +transaction processing and data warehousing in the same product. However, these hybrid transactional +and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly +becoming two separate storage and query engines, which happen to be accessible through a common SQL +interface +[[50](/en/ch4#Larson2013), +[51](/en/ch4#Farber2012), +[52](/en/ch4#Stonebraker2013), +[53](/en/ch4#Prout2022_ch4)]. + +## Cloud Data Warehouses + +Data warehouse vendors such as Teradata, Vertica, and SAP HANA sell both on-premises warehouses +under commercial licenses and cloud-based solutions. But as many of their customers move to the +cloud, new cloud data warehouses such as Google Cloud BigQuery, Amazon Redshift, and Snowflake have +also become widely adopted. Unlike traditional data warehouses, cloud data warehouses take advantage +of scalable cloud infrastructure like object storage and serverless computation platforms. + +Cloud data warehouses tend to integrate better with other cloud services and to be more elastic. +For example, many cloud warehouses support automatic log ingestion, and offer easy integration with +data processing frameworks such as Google Cloud’s Dataflow or Amazon Web Services’ Kinesis. These +warehouses are also more elastic because they decouple query computation from the storage layer +[[54](/en/ch4#Tereshko2016)]. +Data is persisted on object storage rather than local disks, which makes it easy to adjust storage +capacity and compute resources for queries independently, as we previously saw in +[“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native). + +Open source data warehouses such as Apache Hive, Trino, and Apache Spark have also evolved with the +cloud. As data storage for analytics has moved to data lakes on object storage, open source warehouses +have begun to break apart +[[55](/en/ch4#McKinney2023)]. The following +components, which were previously integrated in a single system such as Apache Hive, are now often +implemented as separate components: + +Query engine +: Query engines such as Trino, Apache DataFusion, and Presto parse SQL queries, optimize them into + execution plans, and execute them against the data. Execution usually requires parallel, + distributed data processing tasks. Some query engines provide built-in task execution, while + others choose to use third party execution frameworks such as Apache Spark or Apache Flink. + +Storage format +: The storage format determines how the rows of a table are encoded as bytes in a file, which is + then typically stored in object storage or a distributed filesystem + [[12](/en/ch4#Armbrust2020)]. + This data can then be accessed by the query engine, but also by other applications using the data + lake. Examples of such storage formats are Parquet, ORC, Lance, or Nimble, and we will see more + about them in the next section. + +Table format +: Files written in Apache Parquet and similar storage formats are typically immutable once written. + To support row inserts and deletions, a table format such as Apache Iceberg or Databricks’s Delta + format are used. Table formats specify a file format that defines which files constitute a table + along with the table’s schema. Such formats also offer advanced features such as time travel (the + ability to query a table as it was at a previous point in time), garbage collection, and even + transactions. + +Data catalog +: Much like a table format defines which files make up a table, a data catalog defines which tables + comprise a database. Catalogs are used to create, rename, and drop tables. Unlike storage and table + formats, data catalogs such as Snowflake’s Polaris and Databricks’s Unity Catalog usually run as a + standalone service that can be queried using a REST interface. Apache Iceberg also offers a + catalog, which can be run inside a client or as a separate process. Query engines use catalog + information when reading and writing tables. Traditionally, catalogs and query engines have been + integrated, but decoupling them has enabled data discovery and data governance systems + (discussed in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance)) to access a catalog’s metadata as well. + +## Column-Oriented Storage + +As discussed in [“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics), data warehouses by convention often use a relational +schema with a big fact table that contains foreign key references into dimension tables. +If you have trillions of rows and petabytes of data in your fact tables, storing and querying them +efficiently becomes a challenging problem. Dimension tables are usually much smaller (millions of +rows), so in this section we will focus on storage of facts. + +Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4 +or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics) +[[52](/en/ch4#Stonebraker2013)]. Take the query in +[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone +buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of +the `fact_sales` table: `date_key`, `product_sk`, +and `quantity`. The query ignores all other columns. + +##### Example 4-1. Analyzing whether people are more inclined to buy fresh fruit or candy, depending on the day of the week + +``` +SELECT + dim_date.weekday, dim_product.category, + SUM(fact_sales.quantity) AS quantity_sold +FROM fact_sales + JOIN dim_date ON fact_sales.date_key = dim_date.date_key + JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk +WHERE + dim_date.year = 2024 AND + dim_product.category IN ('Fresh fruit', 'Candy') +GROUP BY + dim_date.weekday, dim_product.category; +``` + +How can we execute this query efficiently? + +In most OLTP databases, storage is laid out in a *row-oriented* fashion: all the values from one row +of a table are stored next to each other. Document databases are similar: an entire document is +typically stored as one contiguous sequence of bytes. You can see this in the CSV example of +[Figure 4-1](/en/ch4#fig_storage_csv_hash_index). + +In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on +`fact_sales.date_key` and/or `fact_sales.product_sk` that tell the storage engine where to find +all the sales for a particular date or for a particular product. But then, a row-oriented storage +engine still needs to load all of those rows (each consisting of over 100 attributes) from disk into +memory, parse them, and filter out those that don’t meet the required conditions. That can take a +long time. + +The idea behind *column-oriented* (or *columnar*) storage is simple: don’t store all the values from +one row together, but store all the values from each *column* together instead +[[56](/en/ch4#Stonebraker2005)]. +If each column is stored separately, a query only needs to read and parse those columns that are +used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using +an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema). + +###### Note + +Column storage is easiest to understand in a relational data model, but it applies equally to +nonrelational data. For example, Parquet +[[57](/en/ch4#LeDem2013)] +is a columnar storage format that supports a document data model, based on Google’s Dremel +[[58](/en/ch4#Melnik2010)], +using a technique known as *shredding* or *striping* +[[59](/en/ch4#Kearney2016)]. + +![ddia 0407](/fig/ddia_0407.png) + +###### Figure 4-7. Storing relational data by column, rather than by row. + +The column-oriented storage layout relies on each column storing the rows in the same order. +Thus, if you need to reassemble an entire row, you can take the 23rd entry from each of the +individual columns and put them together to form the 23rd row of the table. + +In fact, columnar storage engines don’t actually store an entire column (containing perhaps +trillions of rows) in one go. Instead, they break the table into blocks of thousands or millions of +rows, and within each block they store the values from each column separately +[[60](/en/ch4#Brandon2023)]. +Since many queries are restricted to a particular date range, it is common to make each block +contain the rows for a particular timestamp range. A query then only needs to load the columns it +needs in those blocks that overlap with the required date range. + +Columnar storage is used in almost all analytic databases nowadays +[[60](/en/ch4#Brandon2023)], +ranging from large-scale cloud data warehouses such as Snowflake +[[61](/en/ch4#Dageville2016)] +to single-node embedded databases such as DuckDB +[[62](/en/ch4#Raasveldt2020)], +and product analytics systems such as Pinot +[[63](/en/ch4#Im2018)] +and Druid [[64](/en/ch4#Yang2014)]. +It is used in storage formats such as Parquet, ORC +[[65](/en/ch4#Liu2023), +[66](/en/ch4#Zeng2023)], +Lance [[67](/en/ch4#Pace2024)], +and Nimble [[68](/en/ch4#Helfman2024)], +and in-memory analytics formats like Apache Arrow +[[65](/en/ch4#Liu2023), +[69](/en/ch4#McKinney2021)] +and Pandas/NumPy [[70](/en/ch4#McKinney2022)]. +Some time-series databases, such as InfluxDB IOx +[[71](/en/ch4#Dix2021)] and TimescaleDB +[[72](/en/ch4#Soto2024)], +are also based on column-oriented storage. + +### Column Compression + +Besides only loading those columns from disk that are required for a query, we can further reduce +the demands on disk throughput and network bandwidth by compressing data. Fortunately, +column-oriented storage often lends itself very well to compression. + +Take a look at the sequences of values for each column in [Figure 4-7](/en/ch4#fig_column_store): they often look quite +repetitive, which is a good sign for compression. Depending on the data in the column, different +compression techniques can be used. One technique that is particularly effective in data warehouses +is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index). + +![ddia 0408](/fig/ddia_0408.png) + +###### Figure 4-8. Compressed, bitmap-indexed storage of a single column. + +Often, the number of distinct values in a column is small compared to the number of rows (for +example, a retailer may have billions of sales transactions, but only 100,000 distinct products). +We can now take a column with *n* distinct values and turn it into *n* separate bitmaps: one bitmap +for each distinct value, with one bit for each row. The bit is 1 if the row has that value, and 0 if +not. + +One option is to store those bitmaps using one bit per row. However, these bitmaps typically contain +a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can additionally be +run-length encoded: counting the number of consecutive zeros or ones and storing that number, as +shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the +two bitmap representations, using whichever is the most compact +[[73](/en/ch4#Lemire2016)]. +This can make the encoding of a column remarkably efficient. + +Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data +warehouse. For example: + +`WHERE product_sk IN (31, 68, 69):` +: Load the three bitmaps for `product_sk = 31`, `product_sk = 68`, and `product_sk = 69`, and + calculate the bitwise *OR* of the three bitmaps, which can be done very efficiently. + +`WHERE product_sk = 30 AND store_sk = 3:` +: Load the bitmaps for `product_sk = 30` and `store_sk = 3`, and calculate the bitwise *AND*. This + works because the columns contain the rows in the same order, so the *k*th bit in one column’s + bitmap corresponds to the same row as the *k*th bit in another column’s bitmap. + +Bitmaps can also be used to answer graph queries, such as finding all users of a social network who +are followed by user *X* and who also follow user *Y* +[[74](/en/ch4#Volpert2024)]. +There are also various other compression schemes for columnar databases, which you can find in the +references [[75](/en/ch4#Abadi2013)]. + +###### Note + +Don’t confuse column-oriented databases with the *wide-column* (also known as *column-family*) data +model, in which a row can have thousands of columns, and there is no need for all the rows to have +the same columns [[9](/en/ch4#Chang2006_ch4)]. Despite the similarity +in name, wide-column databases are row-oriented, since they store all values from a row together. +Google’s Bigtable, Apache Accumulo, and HBase are examples of the wide-column model. + +### Sort Order in Column Storage + +In a column store, it doesn’t necessarily matter in which order the rows are stored. It’s easiest to +store them in the order in which they were inserted, since then inserting a new row just means +appending to each of the columns. However, we can choose to impose an order, like we did with +SSTables previously, and use that as an indexing mechanism. + +Note that it wouldn’t make sense to sort each column independently, because then we would no longer +know which items in the columns belong to the same row. We can only reconstruct a row because we +know that the *k*th item in one column belongs to the same row as the *k*th item in another +column. + +Rather, the data needs to be sorted an entire row at a time, even though it is stored by column. +The administrator of the database can choose the columns by which the table should be sorted, using +their knowledge of common queries. For example, if queries often target date ranges, such as the +last month, it might make sense to make `date_key` the first sort key. Then the query can +scan only the rows from the last month, which will be much faster than scanning all rows. + +A second column can determine the sort order of any rows that have the same value in the first +column. For example, if `date_key` is the first sort key in [Figure 4-7](/en/ch4#fig_column_store), it might make +sense for `product_sk` to be the second sort key so that all sales for the same product on the same +day are grouped together in storage. That will help queries that need to group or filter sales by +product within a certain date range. + +Another advantage of sorted order is that it can help with compression of columns. If the primary +sort column does not have many distinct values, then after sorting, it will have long sequences +where the same value is repeated many times in a row. A simple run-length encoding, like we used for +the bitmaps in [Figure 4-8](/en/ch4#fig_bitmap_index), could compress that column down to a few kilobytes—even if +the table has billions of rows. + +That compression effect is strongest on the first sort key. The second and third sort keys will be +more jumbled up, and thus not have such long runs of repeated values. Columns further down the +sorting priority appear in essentially random order, so they probably won’t compress as well. But +having the first few columns sorted is still a win overall. + +### Writing to Column-Oriented Storage + +We saw in [“Characterizing Transaction Processing and Analytics”](/en/ch1#sec_introduction_oltp) that reads in data warehouses tend to consist of aggregations +over a large number of rows; column-oriented storage, compression, and sorting all help to make +those read queries faster. Writes in a data warehouse tend to be a bulk import of data, often via an +ETL process. + +With columnar storage, writing an individual row somewhere in the middle of a sorted table would be +very inefficient, as you would have to rewrite all the compressed columns from the insertion +position onwards. However, a bulk write of many rows at once amortizes the cost of rewriting those +columns, making it efficient. + +A log-structured approach is often used to perform writes in batches. All writes first go to a +row-oriented, sorted, in-memory store. When enough writes have accumulated, they are merged with the +column-encoded files on disk and written to new files in bulk. As old files remain immutable, and +new files are written in one go, object storage is well suited for storing these files. + +Queries need to examine both the column data on disk and the recent writes in memory, and combine +the two. The query execution engine hides this distinction from the user. From an analyst’s point +of view, data that has been modified with inserts, updates, or deletes is immediately reflected in +subsequent queries. Snowflake, Vertica, Apache Pinot, Apache Druid, and many others do this +[[61](/en/ch4#Dageville2016), [63](/en/ch4#Im2018), +[64](/en/ch4#Yang2014), +[76](/en/ch4#Lamb2012)]. + +## Query Execution: Compilation and Vectorization + +A complex SQL query for analytics is broken down into a *query plan* consisting of multiple stages, +called *operators*, which may be distributed across multiple machines for parallel execution. Query +planners can perform a lot of optimizations by choosing which operators to use, in which order to +perform them, and where to run each operator. + +Within each operator, the query engine needs to do various things with the values in a column, such +as finding all the rows where the value is among a particular set of values (perhaps as part of a +join), or checking whether the value is greater than 15. It also needs to look at several columns +for the same row, for example to find all sales transactions where the product is bananas and the +store is a particular store of interest. + +For data warehouse queries that need to scan over millions of rows, we need to worry not only about +the amount of data they need to read off disk, but also the CPU time required to execute complex +operators. The simplest kind of operator is like an interpreter for a programming language: while +iterating over each row, it checks a data structure representing the query to find out which +comparisons or calculations it needs to perform on which columns. Unfortunately, this is too slow +for many analytics purposes. Two alternative approaches for efficient query execution have emerged +[[77](/en/ch4#Kersten2018)]: + +Query compilation +: The query engine takes the SQL query and generates code for executing it. The code iterates over + the rows one by one, looks at the values in the columns of interest, performs whatever comparisons + or calculations are needed, and copies the necessary values to an output buffer if the required + conditions are satisfied. The query engine compiles the generated code to machine code (often + using an existing compiler such as LLVM), and then runs it on the column-encoded data that has + been loaded into memory. This approach to code generation is similar to the just-in-time (JIT) + compilation approach that is used in the Java Virtual Machine (JVM) and similar runtimes. + +Vectorized processing +: The query is interpreted, not compiled, but it is made fast by processing many values from a + column in a batch, instead of iterating over rows one by one. A fixed set of predefined operators + are built into the database; we can pass arguments to them and get back a batch of results + [[50](/en/ch4#Larson2013), [75](/en/ch4#Abadi2013)]. + + For example, we could pass the `product_sk` column and the ID of “bananas” to an equality operator, + and get back a bitmap (one bit per value in the input column, which is 1 if it’s a banana); we could + then pass the `store_sk` column and the ID of the store of interest to the same equality operator, + and get back another bitmap; and then we could pass the two bitmaps to a “bitwise AND” operator, as + shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in + a particular store. + +![ddia 0409](/fig/ddia_0409.png) + +###### Figure 4-9. A bitwise AND between two bitmaps lends itself to vectorization. + +The two approaches are very different in terms of their implementation, but both are used in +practice [[77](/en/ch4#Kersten2018)]. Both can achieve very good +performance by taking advantages of the characteristics of modern CPUs: + +* preferring sequential memory access over random access to reduce cache misses + [[78](/en/ch4#Smith2020)], +* doing most of the work in tight inner loops (that is, with a small number of instructions and no + function calls) to keep the CPU instruction processing pipeline busy and avoid branch + mispredictions, +* making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD) + instructions [[79](/en/ch4#Boncz2005), + [80](/en/ch4#Zhou2002)], and +* operating directly on compressed data without decoding it into a separate in-memory + representation, which saves memory allocation and copying costs. + +## Materialized Views and Data Cubes + +We previously encountered *materialized views* in [“Materializing and Updating Timelines”](/en/ch2#sec_introduction_materializing): +in a relational data model, they are table-like object whose contents are the results of some +query. The difference is that a materialized view is an actual copy of the query results, written to +disk, whereas a virtual view is just a shortcut for writing queries. When you read from a virtual +view, the SQL engine expands it into the view’s underlying query on the fly and then processes the +expanded query. + +When the underlying data changes, a materialized view needs to be updated accordingly. Some +databases can do that automatically, and there are also systems such as Materialize that specialize +in materialized view maintenance +[[81](/en/ch4#Bartley2024)]. +Performing such updates means more work on writes, but materialized views can improve read +performance in workloads that repeatedly need to perform the same queries. + +*Materialized aggregates* are a type of materialized views that can be useful in data warehouses. As +discussed earlier, data warehouse queries often involve an aggregate function, such as `COUNT`, `SUM`, +`AVG`, `MIN`, or `MAX` in SQL. If the same aggregates are used by many different queries, it can be +wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that +queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates +grouped by different dimensions +[[82](/en/ch4#Gray2007)]. +[Figure 4-10](/en/ch4#fig_data_cube) shows an example. + +![ddia 0410](/fig/ddia_0410.png) + +###### Figure 4-10. Two dimensions of a data cube, aggregating data by summing. + +Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube), +these are `date_key` and `product_sk`. You can now draw a two-dimensional table, with +dates along one axis and products along the other. Each cell contains the aggregate (e.g., `SUM`) of +an attribute (e.g., `net_price`) of all facts with that date-product combination. Then you can apply +the same aggregate along each row or column and get a summary that has been reduced by one +dimension (the sales by product regardless of date, or the sales by date regardless of product). + +In general, facts often have more than two dimensions. In [Figure 3-5](/en/ch3#fig_dwh_schema) there are five +dimensions: date, product, store, promotion, and customer. It’s a lot harder to imagine what a +five-dimensional hypercube would look like, but the principle remains the same: each cell contains +the sales for a particular date-product-store-promotion-customer combination. These values can then +repeatedly be summarized along each of the dimensions. + +The advantage of a materialized data cube is that certain queries become very fast because they +have effectively been precomputed. For example, if you want to know the total sales per store +yesterday, you just need to look at the totals along the appropriate dimension—no need to scan +millions of rows. + +The disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data. For example, +there is no way of calculating which proportion of sales comes from items that cost more than $100, +because the price isn’t one of the dimensions. Most data warehouses therefore try to keep as much +raw data as possible, and use aggregates such as data cubes only as a performance boost for certain +queries. + +# Multidimensional and Full-Text Indexes + +The B-trees and LSM-trees we saw in the first half of this chapter allow range queries over a single +attribute: for example, if the key is a username, you can use them as an index to efficiently find +all names starting with an L. But sometimes, searching by a single attribute is not enough. + +The most common type of multi-column index is called a *concatenated index*, which simply combines +several fields into one key by appending one column to another (the index definition specifies in +which order the fields are concatenated). This is like an old-fashioned paper phone book, which +provides an index from (*lastname*, *firstname*) to phone number. Due to the sort order, the index +can be used to find all the people with a particular last name, or all the people with a particular +*lastname-firstname* combination. However, the index is useless if you want to find all the people +with a particular first name. + +On the other hand, *multi-dimensional indexes* allow you to query several columns at once. +One case where this is particularly important is geospatial data. For example, a restaurant-search +website may have a database containing the latitude and longitude of each restaurant. When a user is +looking at the restaurants on a map, the website needs to search for all the restaurants within the +rectangular map area that the user is currently viewing. This requires a two-dimensional range query +like the following: + +``` +SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079 + AND longitude > -0.1162 AND longitude < -0.1004; +``` + +A concatenated index over the latitude and longitude columns is not able to answer that kind of +query efficiently: it can give you either all the restaurants in a range of latitudes (but at any +longitude), or all the restaurants in a range of longitudes (but anywhere between the North and +South poles), but not both simultaneously. + +One option is to translate a two-dimensional location into a single number using a space-filling +curve, and then to use a regular B-tree index +[[83](/en/ch4#Ramsak2000)]. +More commonly, specialized spatial indexes such as R-trees or Bkd-trees +[[84](/en/ch4#Procopiuc2003)] +are used; they divide up the space so that nearby data points tend to be grouped in the same +subtree. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQL’s +Generalized Search Tree indexing facility +[[85](/en/ch4#Hellerstein1995)]. +It is also possible to use regularly spaced grids of triangles, squares, or hexagons +[[86](/en/ch4#Brodsky2018)]. + +Multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce +website you could use a three-dimensional index on the dimensions (*red*, *green*, *blue*) to search +for products in a certain range of colors, or in a database of weather observations you could have a +two-dimensional index on (*date*, *temperature*) in order to efficiently search for all the +observations during the year 2013 where the temperature was between 25 and 30℃. With a +one-dimensional index, you would have to either scan over all the records from 2013 (regardless of +temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by +timestamp and temperature simultaneously +[[87](/en/ch4#Escriva2012)]. + +## Full-Text Search + +Full-text search allows you to search a collection of text documents (web pages, product +descriptions, etc.) by keywords that might appear anywhere in the text +[[88](/en/ch4#Manning2008_ch4)]. +Information retrieval is a big, specialist topic that often involves language-specific processing: +for example, several Asian languages are written without spaces or punctuation between words, and +therefore splitting text into words requires a model that indicates which character sequences +constitute a word. Full-text search also often involves matching words that are similar but not +identical (such as typos or different grammatical forms of words) and synonyms. Those problems go +beyond the scope of this book. + +However, at its core, you can think of full-text search as another kind of multidimensional query: +in this case, each word that might appear in a text (a *term*) is a dimension. A document that +contains term *x* has a value of 1 in dimension *x*, and a document that doesn’t contain *x* has a +value of 0. Searching for documents mentioning “red apples” means a query that looks for a 1 in the +*red* dimension, and simultaneously a 1 in the *apples* dimension. The number of dimensions may thus +be very large. + +The data structure that many search engines use to answer such queries is called an *inverted +index*. This is a key-value structure where the key is a term, and the value is the list of IDs of +all the documents that contain the term (the *postings list*). If the document IDs are sequential +numbers, the postings list can also be represented as a sparse bitmap, like in [Figure 4-8](/en/ch4#fig_bitmap_index): +the *n*th bit in the bitmap for term *x* is a 1 if the document with ID *n* contains the term *x* +[[89](/en/ch4#Wang2017)]. + +Finding all the documents that contain both terms *x* and *y* is now similar to a vectorized data +warehouse query that searches for rows matching two conditions ([Figure 4-9](/en/ch4#fig_bitmap_and)): load the two +bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps are run-length +encoded, this can be done very efficiently. + +For example, Lucene, the full-text indexing engine used by Elasticsearch and Solr, works like this +[[90](/en/ch4#Grand2013)]. +It stores the mapping from term to postings list in SSTable-like sorted files, which are merged in +the background using the same log-structured approach we saw earlier in this chapter +[[91](/en/ch4#McCandless2011merges)]. +PostgreSQL’s GIN index type also uses postings lists to support full-text search and indexing inside +JSON documents +[[92](/en/ch4#Fittl2021), +[93](/en/ch4#Angelakos2020)]. + +Instead of breaking text into words, an alternative is to find all the substrings of length *n*, +which are called *n*-grams. For example, the trigrams (*n* = 3) of the string +`"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can +search the documents for arbitrary substrings that are at least three characters long. Trigram +indexes even allows regular expressions in search queries; the downside is that they are quite large +[[94](/en/ch4#Korotkov2012)]. + +To cope with typos in documents or queries, Lucene is able to search text for words within a certain +edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced) +[[95](/en/ch4#McCandless2011fuzzy)]. +It does this by storing the set of terms as a finite state automaton over the characters in the +keys, similar to a *trie* +[[96](/en/ch4#Heinz2002)], +and transforming it into a *Levenshtein automaton*, which supports efficient search for words within +a given edit distance [[97](/en/ch4#Schulz2002)]. + +## Vector Embeddings + +Semantic search goes beyond synonyms and typos to try and understand document concepts +and user intentions. For example, if your help pages contain a page titled “cancelling your +subscription”, users should still be able to find that page when searching for “how to close my +account” or “terminate contract”, which are close in terms of meaning even though they use +completely different words. + +To understand a document’s semantics—​its meaning—​semantic search indexes use embedding models to +translate a document into a vector of floating-point values, called a *vector embedding*. The vector +represents a point in a multi-dimensional space, and each floating-point value represents the document’s +location along one dimension’s axis. Embedding models generate vector embeddings that are near +each other (in this multi-dimensional space) when the embedding’s input documents are semantically +similar. + +###### Note + +We saw the term *vectorized processing* in [“Query Execution: Compilation and Vectorization”](/en/ch4#sec_storage_vectorized). +Vectors in semantic search have a different meaning. In vectorized processing, the vector refers to +a batch of bits that can be processed with specially optimized code. In embedding models, vectors are a list of +floating point numbers that represent a location in multi-dimensional space. + +For example, a three-dimensional vector embedding for a Wikipedia page about agriculture might be +[0.1, 0.22, 0.11]. A Wikipedia page about vegetables would be quite near, perhaps with an embedding +of [0.13, 0.19, 0.24]. A page about star schemas might have an embedding of [0.82, 0.39, -0.74], +comparatively far away. We can tell by looking that the first two vectors are closer than the third. + +Embedding models use much larger vectors (often over 1,000 numbers), but the principles are the +same. We don’t try to understand what the individual numbers mean; +they’re simply a way for embedding models to point to a location in an abstract multi-dimensional +space. Search engines use distance functions such as cosine similarity or Euclidean distance to +measure the distance between vectors. Cosine similarity measures the cosine of the angle of two +vectors to determine how close they are, while Euclidean distance measures the straight-line +distance between two points in space. + +Many early embedding models such as Word2Vec +[[98](/en/ch4#Mikolov2013)], +BERT +[[99](/en/ch4#Devlin2018)], +and GPT +[[100](/en/ch4#Radford2018)] +worked with text data. Such models are usually implemented as neural networks. Researchers went on to +create embedding models for video, audio, and images as well. More recently, model +architecture has become *multimodal*: a single model can generate vector embeddings for multiple +modalities such as text and images. + +Semantic search engines use an embedding model to generate a vector embedding when a user enters a +query. The user’s query and related context (such as a user’s location) are fed into the embedding +model. After the embedding model generates the query’s vector embedding, the search engine must find +documents with similar vector embeddings using a vector index. + +Vector indexes store the vector embeddings of a collection of documents. To query the index, you +pass in the vector embedding of the query, and the index returns the documents whose vectors are +closest to the query vector. Since the R-trees we saw previously don’t work well for vectors with +many dimensions, specialized vector indexes are used, such as: + +Flat indexes +: Vectors are stored in the index as they are. A query must read every vector and measure its + distance to the query vector. Flat indexes are accurate, but measuring the distance between the + query and each vector is slow. + +Inverted file (IVF) indexes +: The vector space is clustered into partitions (called *centroids*) of vectors to reduce the number + of vectors that must be compared. IVF indexes are faster than flat indexes, but can give only + approximate results: the query and a document may fall into different partitions, even though they + are close to each other. A query on an IVF index first defines *probes*, which are simply the number + of partitions to check. Queries that use more probes will be more accurate, but will be slower, as + more vectors must be compared. + +Hierarchical Navigable Small World (HNSW) +: HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw). + Each layer is represented as a graph, where nodes represent vectors, and edges represent proximity + to nearby vectors. A query starts by locating the nearest vector in the topmost layer, which has a + small number of nodes. The query then moves to the same node in the layer below and follows the + edges in that layer, which is more densely connected, looking for a vector that is closer to the + query vector. The process continues until the last layer is reached. As with IVF indexes, HNSW + indexes are approximate. + +![ddia 0411](/fig/ddia_0411.png) + +###### Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index. + +Many popular vector databases implement IVF and HNSW indexes. Facebook’s Faiss library has many +variations of each +[[101](/en/ch4#Faiis2023)], +and PostgreSQL’s pgvector supports both as well +[[102](/en/ch4#Matevosyan2024)]. +The full details of the IVF and HNSW algorithms are beyond the scope of this book, but their papers +are an excellent resource +[[103](/en/ch4#Baranchuk2018), +[104](/en/ch4#Malkov2020)]. + +# Summary + +In this chapter we tried to get to the bottom of how databases perform storage and retrieval. What +happens when you store data in a database, and what does the database do when you query for the +data again later? + +[“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics) introduced the distinction between transaction processing (OLTP) and +analytics (OLAP). In this chapter we saw that storage engines optimized for OLTP look very different +from those optimized for analytics: + +* OLTP systems are optimized for a high volume of requests, each of which reads and writes a small + number of records, and which need fast responses. The records are typically accessed via a primary + key or a secondary index, and these indexes are typically ordered mappings from key to record, + which also support range queries. +* Data warehouses and similar analytic systems are optimized for complex read queries that scan over + a large number of records. They generally use a column-oriented storage layout with compression + that minimizes the amount of data that such a query needs to read off disk, and just-in-time + compilation of queries or vectorization to minimize the amount of CPU time spent processing the + data. + +On the OLTP side, we saw storage engines from two main schools of thought: + +* The log-structured approach, which only permits appending to files and deleting obsolete files, + but never updates a file that has been written. SSTables, LSM-trees, RocksDB, Cassandra, HBase, + Scylla, Lucene, and others belong to this group. In general, log-structured storage engines tend + to provide high write throughput. +* The update-in-place approach, which treats the disk as a set of fixed-size pages that can be + overwritten. B-trees, the biggest example of this philosophy, are used in all major relational + OLTP databases and also many nonrelational ones. As a rule of thumb, B-trees tend to be better for + reads, providing higher read throughput and lower response times than log-structured storage. + +We then looked at indexes that can search for multiple conditions at the same time: multidimensional +indexes such as R-trees that can search for points on a map by latitude and longitude at the same +time, and full-text search indexes that can search for multiple keywords appearing in the same text. +Finally, vector databases are used for semantic search on text documents and other media; they use +vectors with a larger number of dimensions and find similar documents by comparing vector +similarity. + +As an application developer, if you’re armed with this knowledge about the internals of storage +engines, you are in a much better position to know which tool is best suited for your particular +application. If you need to adjust a database’s tuning parameters, this understanding allows you to +imagine what effect a higher or a lower value may have. + +Although this chapter couldn’t make you an expert in tuning any one particular storage engine, it +has hopefully equipped you with enough vocabulary and ideas that you can make sense of the +documentation for the database of your choice. + +##### Footnotes + +##### References + +[[1](/en/ch4#Samokhvalov2021-marker)] Nikolay Samokhvalov. +[How +partial, covering, and multicolumn indexes may slow down UPDATEs in PostgreSQL](https://postgres.ai/blog/20211029-how-partial-and-covering-indexes-affect-update-performance-in-postgresql). +*postgres.ai*, October 2021. +Archived at [perma.cc/PBK3-F4G9](https://perma.cc/PBK3-F4G9) + +[[2](/en/ch4#Graefe2011-marker)] Goetz Graefe. +[Modern B-Tree Techniques](https://w6113.github.io/files/papers/btreesurvey-graefe.pdf). +*Foundations and Trends in Databases*, volume 3, issue 4, pages 203–402, August 2011. +[doi:10.1561/1900000028](https://doi.org/10.1561/1900000028) + +[[3](/en/ch4#Jones2019-marker)] Evan Jones. +[Why databases use ordered +indexes but programming uses hash tables](https://www.evanjones.ca/ordered-vs-unordered-indexes.html). *evanjones.ca*, December 2019. +Archived at [perma.cc/NJX8-3ZZD](https://perma.cc/NJX8-3ZZD) + +[[4](/en/ch4#Lambov2022a-marker)] Branimir Lambov. +[CEP-25: +Trie-indexed SSTable format](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A%2BTrie-indexed%2BSSTable%2Bformat). *cwiki.apache.org*, November 2022. +Archived at [perma.cc/HD7W-PW8U](https://perma.cc/HD7W-PW8U). +Linked Google Doc archived at [perma.cc/UL6C-AAAE](https://perma.cc/UL6C-AAAE) + +[[5](/en/ch4#Cormen2009-marker)] Thomas H. Cormen, Charles E. +Leiserson, Ronald L. Rivest, and Clifford Stein: *Introduction to Algorithms*, 3rd edition. +MIT Press, 2009. ISBN: 978-0-262-53305-8 + +[[6](/en/ch4#Lambov2022b-marker)] Branimir Lambov. +[Trie Memtables in Cassandra](https://www.vldb.org/pvldb/vol15/p3359-lambov.pdf). +*Proceedings of the VLDB Endowment*, volume 15, issue 12, pages 3359–3371, August 2022. +[doi:10.14778/3554821.3554828](https://doi.org/10.14778/3554821.3554828) + +[[7](/en/ch4#Borthakur2013-marker)] Dhruba Borthakur. +[The History of RocksDB](https://rocksdb.blogspot.com/2013/11/the-history-of-rocksdb.html). +*rocksdb.blogspot.com*, November 2013. +Archived at [perma.cc/Z7C5-JPSP](https://perma.cc/Z7C5-JPSP) + +[[8](/en/ch4#Bertozzi2012-marker)] Matteo Bertozzi. +[Apache HBase I/O – HFile](https://blog.cloudera.com/apache-hbase-i-o-hfile/). +*blog.cloudera.com*, June 2012. +Archived at [perma.cc/U9XH-L2KL](https://perma.cc/U9XH-L2KL) + +[[9](/en/ch4#Chang2006_ch4-marker)] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, +Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. +[Bigtable: A Distributed Storage System +for Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and +Implementation* (OSDI), November 2006. + +[[10](/en/ch4#ONeil1996-marker)] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and +Elizabeth O’Neil. +[The Log-Structured Merge-Tree (LSM-Tree)](https://www.cs.umb.edu/~poneil/lsmtree.pdf). +*Acta Informatica*, volume 33, issue 4, pages 351–385, June 1996. +[doi:10.1007/s002360050048](https://doi.org/10.1007/s002360050048) + +[[11](/en/ch4#Rosenblum1992-marker)] Mendel Rosenblum and John K. Ousterhout. +[The Design and Implementation of +a Log-Structured File System](https://research.cs.wisc.edu/areas/os/Qual/papers/lfs.pdf). +*ACM Transactions on Computer Systems*, volume 10, issue 1, pages 26–52, February 1992. +[doi:10.1145/146941.146943](https://doi.org/10.1145/146941.146943) + +[[12](/en/ch4#Armbrust2020-marker)] Michael Armbrust, Tathagata Das, Liwen Sun, +Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja +Łuszczak, Michał Świtakowski, Michał Szafrański, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter +Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, and Matei Zaharia. +[Delta Lake: High-Performance ACID Table +Storage over Cloud Object Stores](https://vldb.org/pvldb/vol13/p3411-armbrust.pdf). *Proceedings of the VLDB Endowment*, volume 13, +issue 12, pages 3411–3424, August 2020. +[doi:10.14778/3415478.3415560](https://doi.org/10.14778/3415478.3415560) + +[[13](/en/ch4#Bloom1970-marker)] Burton H. Bloom. +[Space/Time +Trade-offs in Hash Coding with Allowable Errors](https://people.cs.umass.edu/~emery/classes/cmpsci691st/readings/Misc/p422-bloom.pdf). *Communications of the ACM*, +volume 13, issue 7, pages 422–426, July 1970. +[doi:10.1145/362686.362692](https://doi.org/10.1145/362686.362692) + +[[14](/en/ch4#Kirsch2008-marker)] Adam Kirsch and Michael Mitzenmacher. +[Less Hashing, Same +Performance: Building a Better Bloom Filter](https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf). *Random Structures & Algorithms*, +volume 33, issue 2, pages 187–218, September 2008. +[doi:10.1002/rsa.20208](https://doi.org/10.1002/rsa.20208) + +[[15](/en/ch4#Hurst2023-marker)] Thomas Hurst. +[Bloom Filter Calculator](https://hur.st/bloomfilter/). *hur.st*, September 2023. +Archived at [perma.cc/L3AV-6VC2](https://perma.cc/L3AV-6VC2) + +[[16](/en/ch4#Luo2019-marker)] Chen Luo and Michael J. Carey. +[LSM-based storage techniques: a survey](https://arxiv.org/abs/1812.07527). +*The VLDB Journal*, volume 29, pages 393–418, July 2019. +[doi:10.1007/s00778-019-00555-y](https://doi.org/10.1007/s00778-019-00555-y) + +[[17](/en/ch4#Sarkar2022-marker)] Subhadeep Sarkar and Manos Athanassoulis. +[Dissecting, Designing, and Optimizing +LSM-based Data Stores](https://www.youtube.com/watch?v=hkMkBZn2mGs). Tutorial at *ACM International Conference on Management of Data* +(SIGMOD), June 2022. Slides archived at +[perma.cc/93B3-E827](https://perma.cc/93B3-E827) + +[[18](/en/ch4#Callaghan2018-marker)] Mark Callaghan. +[Name that +compaction algorithm](https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html). *smalldatum.blogspot.com*, August 2018. +Archived at [perma.cc/CN4M-82DY](https://perma.cc/CN4M-82DY) + +[[19](/en/ch4#Rao2023-marker)] Prashanth Rao. +[Embedded databases (1): The harmony of +DuckDB, KùzuDB and LanceDB](https://thedataquarry.com/posts/embedded-db-1/). *thedataquarry.com*, August 2023. +Archived at [perma.cc/PA28-2R35](https://perma.cc/PA28-2R35) + +[[20](/en/ch4#BlueskySQLite-marker)] Hacker News discussion. +[Bluesky migrates to single-tenant SQLite](https://news.ycombinator.com/item?id=38171322). +*news.ycombinator.com*, October 2023. +Archived at [perma.cc/69LM-5P6X](https://perma.cc/69LM-5P6X) + +[[21](/en/ch4#Bayer1970-marker)] Rudolf Bayer and Edward M. McCreight. +[Organization and Maintenance of Large +Ordered Indices](https://dl.acm.org/doi/pdf/10.1145/1734663.1734671). Boeing Scientific Research Laboratories, Mathematical and Information Sciences +Laboratory, report no. 20, July 1970. +[doi:10.1145/1734663.1734671](https://doi.org/10.1145/1734663.1734671) + +[[22](/en/ch4#Comer1979-marker)] Douglas Comer. +[The +Ubiquitous B-Tree](https://web.archive.org/web/20170809145513id_/http%3A//sites.fas.harvard.edu/~cs165/papers/comer.pdf). *ACM Computing Surveys*, volume 11, issue 2, pages 121–137, June 1979. +[doi:10.1145/356770.356776](https://doi.org/10.1145/356770.356776) + +[[23](/en/ch4#Miller2025-marker)] Alex Miller. +[Torn Write Detection and Protection](https://transactional.blog/blog/2025-torn-writes). +*transactional.blog*, April 2025. +Archived at [perma.cc/G7EB-33EW](https://perma.cc/G7EB-33EW) + +[[24](/en/ch4#Mohan1992-marker)] C. Mohan and Frank Levine. +[ARIES/IM: An Efficient and High +Concurrency Index Management Method Using Write-Ahead Logging](https://ics.uci.edu/~cs223/papers/p371-mohan.pdf). At *ACM +International Conference on Management of Data* (SIGMOD), June 1992. +[doi:10.1145/130283.130338](https://doi.org/10.1145/130283.130338) + +[[25](/en/ch4#Suzuki2017_ch4-marker)] Hironobu Suzuki. +[The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017. + +[[26](/en/ch4#Chu2014-marker)] Howard Chu. +[LDAP at Lightning Speed](https://buildstuff14.sched.com/event/08a1a368e272eb599a52e08b4c3c779d). +At *Build Stuff ’14*, November 2014. +Archived at [perma.cc/GB6Z-P8YH](https://perma.cc/GB6Z-P8YH) + +[[27](/en/ch4#Athanassoulis2016-marker)] Manos Athanassoulis, Michael S. Kester, +Lukas M. Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan. +[Designing Access Methods: The RUM +Conjecture](https://openproceedings.org/2016/conf/edbt/paper-12.pdf). At *19th International Conference on Extending Database Technology* (EDBT), March 2016. +[doi:10.5441/002/edbt.2016.42](https://doi.org/10.5441/002/edbt.2016.42) + +[[28](/en/ch4#Stopford2015-marker)] Ben Stopford. +[Log Structured Merge Trees](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/). +*benstopford.com*, February 2015. +Archived at [perma.cc/E5BV-KUJ6](https://perma.cc/E5BV-KUJ6) + +[[29](/en/ch4#Callaghan2016lsm-marker)] Mark Callaghan. +[The +Advantages of an LSM vs a B-Tree](https://smalldatum.blogspot.com/2016/01/summary-of-advantages-of-lsm-vs-b-tree.html). *smalldatum.blogspot.co.uk*, January 2016. +Archived at [perma.cc/3TYZ-EFUD](https://perma.cc/3TYZ-EFUD) + +[[30](/en/ch4#Balmau2019-marker)] Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan +Gupta, Ravishankar Chandhiramoorthi, and Diego Didona. +[SILK: Preventing Latency +Spikes in Log-Structured Merge Key-Value Stores](https://www.usenix.org/conference/atc19/presentation/balmau). At *USENIX Annual Technical Conference*, +July 2019. + +[[31](/en/ch4#RocksDBTuning-marker)] Igor Canadi, Siying Dong, Mark Callaghan, et al. +[RocksDB Tuning Guide](https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide). +*github.com*, 2023. +Archived at [perma.cc/UNY4-MK6C](https://perma.cc/UNY4-MK6C) + +[[32](/en/ch4#Haas2023-marker)] Gabriel Haas and Viktor Leis. +[What Modern NVMe Storage Can Do, and How +to Exploit it: High-Performance I/O for High-Performance Storage Engines](https://www.vldb.org/pvldb/vol16/p2090-haas.pdf). *Proceedings of the +VLDB Endowment*, volume 16, issue 9, pages 2090-2102. +[doi:10.14778/3598581.3598584](https://doi.org/10.14778/3598581.3598584) + +[[33](/en/ch4#Goossaert2014-marker)] Emmanuel Goossaert. +[Coding +for SSDs](https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/). *codecapsule.com*, February 2014. + +[[34](/en/ch4#Vanlightly2023nvme-marker)] Jack Vanlightly. +[Is +sequential IO dead in the era of the NVMe drive?](https://jack-vanlightly.com/blog/2023/5/9/is-sequential-io-dead-in-the-era-of-the-nvme-drive) *jack-vanlightly.com*, May 2023. +Archived at [perma.cc/7TMZ-TAPU](https://perma.cc/7TMZ-TAPU) + +[[35](/en/ch4#Alibaba2019_ch4-marker)] Alibaba Cloud Storage Team. +[Storage System Design Analysis: Factors Affecting +NVMe SSD Performance (2)](https://www.alibabacloud.com/blog/594376). *alibabacloud.com*, January 2019. Archived at +[archive.org](https://web.archive.org/web/20230510065132/https%3A//www.alibabacloud.com/blog/594376) + +[[36](/en/ch4#Hu2010-marker)] Xiao-Yu Hu and Robert Haas. +[The Fundamental Limit of Flash +Random Write Performance: Understanding, Analysis and Performance Modelling](https://dominoweb.draco.res.ibm.com/reports/rz3771.pdf). +*dominoweb.draco.res.ibm.com*, March 2010. +Archived at [perma.cc/8JUL-4ZDS](https://perma.cc/8JUL-4ZDS) + +[[37](/en/ch4#Lu2016-marker)] Lanyue Lu, Thanumalayan Sankaranarayana Pillai, +Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. +[WiscKey: +Separating Keys from Values in SSD-conscious Storage](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf). At *4th USENIX Conference on File and +Storage Technologies* (FAST), February 2016. + +[[38](/en/ch4#Zaitsev2006-marker)] Peter Zaitsev. +[Innodb Double Write](https://www.percona.com/blog/innodb-double-write/). +*percona.com*, August 2006. +Archived at [perma.cc/NT4S-DK7T](https://perma.cc/NT4S-DK7T) + +[[39](/en/ch4#Vondra2016-marker)] Tomas Vondra. +[On the Impact of +Full-Page Writes](https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-page-writes/). *2ndquadrant.com*, November 2016. +Archived at [perma.cc/7N6B-CVL3](https://perma.cc/7N6B-CVL3) + +[[40](/en/ch4#Callaghan2015-marker)] Mark Callaghan. +[Read, +write & space amplification - B-Tree vs LSM](https://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html). *smalldatum.blogspot.com*, November 2015. +Archived at [perma.cc/S487-WK5P](https://perma.cc/S487-WK5P) + +[[41](/en/ch4#Callaghan2016rocksdb-marker)] Mark Callaghan. +[Choosing Between Efficiency and +Performance with RocksDB](https://codemesh.io/codemesh2016/mark-callaghan). At *Code Mesh*, November 2016. +Video at [youtube.com/watch?v=tgzkgZVXKB4](https://www.youtube.com/watch?v=tgzkgZVXKB4) + +[[42](/en/ch4#Sarkar2023-marker)] Subhadeep Sarkar, Tarikul Islam +Papon, Dimitris Staratzis, Zichen Zhu, and Manos Athanassoulis. +[Enabling +Timely and Persistent Deletion in LSM-Engines](https://subhadeep.net/assets/fulltext/Enabling_Timely_and_Persistent_Deletion_in_LSM-Engines.pdf). *ACM Transactions on Database Systems*, +volume 48, issue 3, article no. 8, August 2023. +[doi:10.1145/3599724](https://doi.org/10.1145/3599724) + +[[43](/en/ch4#Fittl2025-marker)] Lukas Fittl. +[Postgres +vs. SQL Server: B-Tree Index Differences & the Benefit of Deduplication](https://pganalyze.com/blog/postgresql-vs-sql-server-btree-index-deduplication). +*pganalyze.com*, April 2025. +Archived at [perma.cc/XY6T-LTPX](https://perma.cc/XY6T-LTPX) + +[[44](/en/ch4#Silcock2024-marker)] Drew Silcock. +[How Postgres stores data +on disk – this one’s a page turner](https://drew.silcock.dev/blog/how-postgres-stores-data-on-disk/). *drew.silcock.dev*, August 2024. +Archived at [perma.cc/8K7K-7VJ2](https://perma.cc/8K7K-7VJ2) + +[[45](/en/ch4#Webb2008-marker)] Joe Webb. +[Using +Covering Indexes to Improve Query Performance](https://www.red-gate.com/simple-talk/databases/sql-server/learn/using-covering-indexes-to-improve-query-performance/). *simple-talk.com*, September 2008. +Archived at [perma.cc/6MEZ-R5VR](https://perma.cc/6MEZ-R5VR) + +[[46](/en/ch4#Stonebraker2007-marker)] Michael Stonebraker, Samuel Madden, Daniel J. +Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. +[The End of an +Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on +Very Large Data Bases* (VLDB), September 2007. + +[[47](/en/ch4#VoltDB2014uj-marker)] [VoltDB +Technical Overview White Paper](https://www.voltactivedata.com/wp-content/uploads/2017/03/hv-white-paper-voltdb-technical-overview.pdf). VoltDB, 2017. +Archived at [perma.cc/B9SF-SK5G](https://perma.cc/B9SF-SK5G) + +[[48](/en/ch4#Rumble2014-marker)] Stephen M. Rumble, Ankita Kejriwal, and John K. Ousterhout. +[Log-Structured +Memory for DRAM-Based Storage](https://www.usenix.org/system/files/conference/fast14/fast14-paper_rumble.pdf). At *12th USENIX Conference on File and Storage +Technologies* (FAST), February 2014. + +[[49](/en/ch4#Harizopoulos2008-marker)] Stavros Harizopoulos, Daniel J. Abadi, +Samuel Madden, and Michael Stonebraker. +[OLTP Through the Looking Glass, +and What We Found There](https://hstore.cs.brown.edu/papers/hstore-lookingglass.pdf). At *ACM International Conference on Management of Data* +(SIGMOD), June 2008. +[doi:10.1145/1376616.1376713](https://doi.org/10.1145/1376616.1376713) + +[[50](/en/ch4#Larson2013-marker)] Per-Åke Larson, Cipri Clinciu, Campbell Fraser, +Eric N. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan L. Price, Srikumar +Rangarajan, Remus Rusanu, and Mayukh Saubhasik. +[Enhancements +to SQL Server Column Stores](https://web.archive.org/web/20131203001153id_/http%3A//research.microsoft.com/pubs/193599/Apollo3%20-%20Sigmod%202013%20-%20final.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2013. +[doi:10.1145/2463676.2463708](https://doi.org/10.1145/2463676.2463708) + +[[51](/en/ch4#Farber2012-marker)] Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, +Ingo Müller, Hannes Rauhe, and Jonathan Dees. +[The +SAP HANA Database – An Architecture Overview](https://web.archive.org/web/20220208081111id_/http%3A//sites.computer.org/debull/A12mar/hana.pdf). +*IEEE Data Engineering Bulletin*, volume 35, issue 1, pages 28–33, March 2012. + +[[52](/en/ch4#Stonebraker2013-marker)] Michael Stonebraker. +[The Traditional RDBMS Wisdom Is (Almost Certainly) All +Wrong](https://slideshot.epfl.ch/talks/166). Presentation at *EPFL*, May 2013. + +[[53](/en/ch4#Prout2022_ch4-marker)] Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu +Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. +[Cloud-Native Transactions and Analytics +in SingleStore](https://dl.acm.org/doi/pdf/10.1145/3514221.3526055). At *ACM International Conference on Management of Data* (SIGMOD), June 2022. +[doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055) + +[[54](/en/ch4#Tereshko2016-marker)] Tino Tereshko and Jordan Tigani. +[BigQuery under the +hood](https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood). *cloud.google.com*, January 2016. +Archived at [perma.cc/WP2Y-FUCF](https://perma.cc/WP2Y-FUCF) + +[[55](/en/ch4#McKinney2023-marker)] Wes McKinney. +[The Road to Composable Data Systems: +Thoughts on the Last 15 Years and the Future](https://wesmckinney.com/blog/looking-back-15-years/). *wesmckinney.com*, September 2023. +Archived at [perma.cc/6L2M-GTJX](https://perma.cc/6L2M-GTJX) + +[[56](/en/ch4#Stonebraker2005-marker)] Michael Stonebraker, Daniel +J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam +Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. +[C-Store: +A Column-oriented DBMS](https://www.vldb.org/archives/website/2005/program/paper/thu/p553-stonebraker.pdf). At *31st International Conference on Very Large Data Bases* +(VLDB), pages 553–564, September 2005. + +[[57](/en/ch4#LeDem2013-marker)] Julien Le Dem. +[Dremel +Made Simple with Parquet](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html). *blog.twitter.com*, September 2013. + +[[58](/en/ch4#Melnik2010-marker)] Sergey Melnik, Andrey Gubarev, Jing Jing Long, +Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. +[Dremel: Interactive Analysis of Web-Scale +Datasets](https://vldb.org/pvldb/vol3/R29.pdf). At *36th International Conference on Very Large Data Bases* (VLDB), pages +330–339, September 2010. +[doi:10.14778/1920841.1920886](https://doi.org/10.14778/1920841.1920886) + +[[59](/en/ch4#Kearney2016-marker)] Joe Kearney. +[Understanding Record +Shredding: storing nested data in columns](https://www.joekearney.co.uk/posts/understanding-record-shredding). *joekearney.co.uk*, December 2016. +Archived at [perma.cc/ZD5N-AX5D](https://perma.cc/ZD5N-AX5D) + +[[60](/en/ch4#Brandon2023-marker)] Jamie Brandon. +[A +shallow survey of OLAP and HTAP query engines](https://www.scattered-thoughts.net/writing/a-shallow-survey-of-olap-and-htap-query-engines). *scattered-thoughts.net*, September 2023. +Archived at [perma.cc/L3KH-J4JF](https://perma.cc/L3KH-J4JF) + +[[61](/en/ch4#Dageville2016-marker)] Benoit Dageville, Thierry Cruanes, Marcin +Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin +Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter +Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. +[The Snowflake Elastic Data Warehouse](https://dl.acm.org/doi/pdf/10.1145/2882903.2903741). +At *ACM International Conference on Management of Data* (SIGMOD), pages 215–226, June 2016. +[doi:10.1145/2882903.2903741](https://doi.org/10.1145/2882903.2903741) + +[[62](/en/ch4#Raasveldt2020-marker)] Mark Raasveldt and Hannes Mühleisen. +[Data Management for Data +Science Towards Embedded Analytics](https://duckdb.org/pdf/CIDR2020-raasveldt-muehleisen-duckdb.pdf). At *10th Conference on Innovative Data Systems +Research* (CIDR), January 2020. + +[[63](/en/ch4#Im2018-marker)] Jean-François Im, Kishore Gopalakrishna, Subbu +Subramaniam, Mayank Shrivastava, Adwait Tumbde, Xiaotian Jiang, Jennifer Dai, Seunghyun Lee, Neha +Pawar, Jialiang Li, and Ravi Aringunram. +[Pinot: +Realtime OLAP for 530 Million Users](https://cwiki.apache.org/confluence/download/attachments/103092375/Pinot.pdf). At *ACM International Conference on Management of +Data* (SIGMOD), pages 583–594, May 2018. +[doi:10.1145/3183713.3190661](https://doi.org/10.1145/3183713.3190661) + +[[64](/en/ch4#Yang2014-marker)] Fangjin Yang, Eric Tschetter, Xavier +Léauté, Nelson Ray, Gian Merlino, and Deep Ganguli. +[Druid: A Real-time Analytical Data Store](https://static.druid.io/docs/druid.pdf). +At *ACM International Conference on Management of Data* (SIGMOD), June 2014. +[doi:10.1145/2588555.2595631](https://doi.org/10.1145/2588555.2595631) + +[[65](/en/ch4#Liu2023-marker)] Chunwei Liu, Anna Pavlenko, Matteo Interlandi, and Brandon Haynes. +[Deep Dive into Common Open Formats for Analytical DBMSs](https://www.vldb.org/pvldb/vol16/p3044-liu.pdf). +*Proceedings of the VLDB Endowment*, volume 16, issue 11, pages 3044–3056, July 2023. +[doi:10.14778/3611479.3611507](https://doi.org/10.14778/3611479.3611507) + +[[66](/en/ch4#Zeng2023-marker)] Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes +McKinney, and Huanchen Zhang. [An Empirical +Evaluation of Columnar Storage Formats](https://www.vldb.org/pvldb/vol17/p148-zeng.pdf). *Proceedings of the VLDB Endowment*, volume 17, +issue 2, pages 148–161. +[doi:10.14778/3626292.3626298](https://doi.org/10.14778/3626292.3626298) + +[[67](/en/ch4#Pace2024-marker)] Weston Pace. +[Lance v2: A columnar container format for modern data](https://blog.lancedb.com/lance-v2/). +*blog.lancedb.com*, April 2024. +Archived at [perma.cc/ZK3Q-S9VJ](https://perma.cc/ZK3Q-S9VJ) + +[[68](/en/ch4#Helfman2024-marker)] Yoav Helfman. +[Nimble, A New Columnar File Format](https://www.youtube.com/watch?v=bISBNVtXZ6M). +At *VeloxCon*, April 2024. + +[[69](/en/ch4#McKinney2021-marker)] Wes McKinney. +[Apache Arrow: High-Performance Columnar Data +Framework](https://www.youtube.com/watch?v=YhF8YR0OEFk). At *CMU Database Group – Vaccination Database Tech Talks*, December 2021. + +[[70](/en/ch4#McKinney2022-marker)] Wes McKinney. +[Python for Data +Analysis, 3rd Edition](https://learning.oreilly.com/library/view/python-for-data/9781098104023/). O’Reilly Media, August 2022. ISBN: 9781098104023 + +[[71](/en/ch4#Dix2021-marker)] Paul Dix. +[The Design of InfluxDB IOx: An In-Memory +Columnar Database Written in Rust with Apache Arrow](https://www.youtube.com/watch?v=_zbwz-4RDXg). At *CMU Database Group – Vaccination +Database Tech Talks*, May 2021. + +[[72](/en/ch4#Soto2024-marker)] Carlota Soto and Mike Freedman. +[Building +Columnar Compression for Large PostgreSQL Databases](https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database/). *timescale.com*, March 2024. +Archived at [perma.cc/7KTF-V3EH](https://perma.cc/7KTF-V3EH) + +[[73](/en/ch4#Lemire2016-marker)] Daniel Lemire, Gregory Ssi‐Yan‐Kai, and Owen Kaser. +[Consistently faster and smaller compressed bitmaps with Roaring](https://arxiv.org/pdf/1603.06549). +*Software: Practice and Experience*, volume 46, issue 11, pages 1547–1569, November 2016. +[doi:10.1002/spe.2402](https://doi.org/10.1002/spe.2402) + +[[74](/en/ch4#Volpert2024-marker)] Jaz Volpert. +[An entire Social Network in 1.6GB (GraphD +Part 2)](https://jazco.dev/2024/04/20/roaring-bitmaps/). *jazco.dev*, April 2024. +Archived at [perma.cc/L27Z-QVMG](https://perma.cc/L27Z-QVMG) + +[[75](/en/ch4#Abadi2013-marker)] Daniel J. Abadi, Peter Boncz, Stavros +Harizopoulos, Stratos Idreos, and Samuel Madden. +[The Design and +Implementation of Modern Column-Oriented Database Systems](https://www.cs.umd.edu/~abadi/papers/abadi-column-stores.pdf). *Foundations and Trends in +Databases*, volume 5, issue 3, pages 197–280, December 2013. +[doi:10.1561/1900000024](https://doi.org/10.1561/1900000024) + +[[76](/en/ch4#Lamb2012-marker)] Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, +Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. +[The Vertica Analytic Database: C-Store 7 Years Later](https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf). +*Proceedings of the VLDB Endowment*, volume 5, issue 12, pages 1790–1801, August 2012. +[doi:10.14778/2367502.2367518](https://doi.org/10.14778/2367502.2367518) + +[[77](/en/ch4#Kersten2018-marker)] Timo Kersten, Viktor Leis, Alfons Kemper, Thomas +Neumann, Andrew Pavlo, and Peter Boncz. +[Everything You Always Wanted to Know +About Compiled and Vectorized Queries But Were Afraid to Ask](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf). *Proceedings of the VLDB +Endowment*, volume 11, issue 13, pages 2209–2222, September 2018. +[doi:10.14778/3275366.3284966](https://doi.org/10.14778/3275366.3284966) + +[[78](/en/ch4#Smith2020-marker)] Forrest Smith. +[Memory Bandwidth Napkin +Math](https://www.forrestthewoods.com/blog/memory-bandwidth-napkin-math/). *forrestthewoods.com*, February 2020. +Archived at [perma.cc/Y8U4-PS7N](https://perma.cc/Y8U4-PS7N) + +[[79](/en/ch4#Boncz2005-marker)] Peter Boncz, Marcin Zukowski, and Niels Nes. +[MonetDB/X100: Hyper-Pipelining Query Execution](https://www.cidrdb.org/cidr2005/papers/P19.pdf). +At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005. + +[[80](/en/ch4#Zhou2002-marker)] Jingren Zhou and Kenneth A. Ross. +[Implementing Database Operations Using SIMD Instructions](https://www1.cs.columbia.edu/~kar/pubsk/simd.pdf). +At *ACM International Conference on Management of Data* (SIGMOD), pages 145–156, June 2002. +[doi:10.1145/564691.564709](https://doi.org/10.1145/564691.564709) + +[[81](/en/ch4#Bartley2024-marker)] Kevin Bartley. +[OLTP Queries: Transfer Expensive Workloads to +Materialize](https://materialize.com/blog/oltp-queries/). *materialize.com*, August 2024. +Archived at [perma.cc/4TYM-TYD8](https://perma.cc/4TYM-TYD8) + +[[82](/en/ch4#Gray2007-marker)] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew +Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. +[Data Cube: A Relational Aggregation Operator +Generalizing Group-By, Cross-Tab, and Sub-Totals](https://arxiv.org/pdf/cs/0701155). *Data Mining and Knowledge +Discovery*, volume 1, issue 1, pages 29–53, March 2007. +[doi:10.1023/A:1009726021843](https://doi.org/10.1023/A%3A1009726021843) + +[[83](/en/ch4#Ramsak2000-marker)] Frank Ramsak, Volker Markl, Robert Fenk, Martin +Zirkel, Klaus Elhardt, and Rudolf Bayer. +[Integrating the UB-Tree into a Database System Kernel](https://www.vldb.org/conf/2000/P263.pdf). +At *26th International Conference on Very Large Data Bases* (VLDB), September 2000. + +[[84](/en/ch4#Procopiuc2003-marker)] Octavian Procopiuc, Pankaj K. Agarwal, Lars +Arge, and Jeffrey Scott Vitter. +[Bkd-Tree: A Dynamic +Scalable kd-Tree](https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf). At *8th International Symposium on Spatial and Temporal Databases* +(SSTD), pages 46–65, July 2003. +[doi:10.1007/978-3-540-45072-6\_4](https://doi.org/10.1007/978-3-540-45072-6_4) + +[[85](/en/ch4#Hellerstein1995-marker)] Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer. +[Generalized Search Trees for Database Systems](https://dsf.berkeley.edu/papers/vldb95-gist.pdf). +At *21st International Conference on Very Large Data Bases* (VLDB), September 1995. + +[[86](/en/ch4#Brodsky2018-marker)] Isaac Brodsky. +[H3: Uber’s Hexagonal Hierarchical Spatial Index](https://eng.uber.com/h3/). +*eng.uber.com*, June 2018. +Archived at [archive.org](https://web.archive.org/web/20240722003854/https%3A//www.uber.com/blog/h3/) + +[[87](/en/ch4#Escriva2012-marker)] Robert Escriva, Bernard Wong, and Emin Gün Sirer. +[HyperDex: +A Distributed, Searchable Key-Value Store](https://www.cs.princeton.edu/courses/archive/fall13/cos518/papers/hyperdex.pdf). At *ACM SIGCOMM Conference*, August 2012. +[doi:10.1145/2377677.2377681](https://doi.org/10.1145/2377677.2377681) + +[[88](/en/ch4#Manning2008_ch4-marker)] Christopher D. Manning, Prabhakar Raghavan, +and Hinrich Schütze. +[*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/). +Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at +[nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/) + +[[89](/en/ch4#Wang2017-marker)] Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, +and Steven Swanson. +[An Experimental +Study of Bitmap Compression vs. Inverted List Compression](https://cseweb.ucsd.edu/~swanson/papers/SIGMOD2017-ListCompression.pdf). At *ACM International Conference +on Management of Data* (SIGMOD), pages 993–1008, May 2017. +[doi:10.1145/3035918.3064007](https://doi.org/10.1145/3035918.3064007) + +[[90](/en/ch4#Grand2013-marker)] Adrien Grand. +[What is in a Lucene +Index?](https://speakerdeck.com/elasticsearch/what-is-in-a-lucene-index) At *Lucene/Solr Revolution*, November 2013. +Archived at [perma.cc/Z7QN-GBYY](https://perma.cc/Z7QN-GBYY) + +[[91](/en/ch4#McCandless2011merges-marker)] Michael McCandless. +[Visualizing +Lucene’s Segment Merges](https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html). *blog.mikemccandless.com*, February 2011. +Archived at [perma.cc/3ZV8-72W6](https://perma.cc/3ZV8-72W6) + +[[92](/en/ch4#Fittl2021-marker)] Lukas Fittl. +[Understanding Postgres GIN Indexes: The Good and the +Bad](https://pganalyze.com/blog/gin-index). *pganalyze.com*, December 2021. +Archived at [perma.cc/V3MW-26H6](https://perma.cc/V3MW-26H6) + +[[93](/en/ch4#Angelakos2020-marker)] Jimmy Angelakos. +[The State of (Full) Text Search in PostgreSQL +12](https://www.youtube.com/watch?v=c8IrUHV70KQ). At *FOSDEM*, February 2020. +Archived at [perma.cc/J6US-3WZS](https://perma.cc/J6US-3WZS) + +[[94](/en/ch4#Korotkov2012-marker)] Alexander Korotkov. +[Index +support for regular expression search](https://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf). At *PGConf.EU Prague*, October 2012. +Archived at [perma.cc/5RFZ-ZKDQ](https://perma.cc/5RFZ-ZKDQ) + +[[95](/en/ch4#McCandless2011fuzzy-marker)] Michael McCandless. +[Lucene’s +FuzzyQuery Is 100 Times Faster in 4.0](https://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html). *blog.mikemccandless.com*, March 2011. +Archived at [perma.cc/E2WC-GHTW](https://perma.cc/E2WC-GHTW) + +[[96](/en/ch4#Heinz2002-marker)] Steffen Heinz, Justin Zobel, and Hugh E. Williams. +[Burst +Tries: A Fast, Efficient Data Structure for String Keys](https://web.archive.org/web/20130903070248id_/http%3A//ww2.cs.mu.oz.au%3A80/~jz/fulltext/acmtois02.pdf). +*ACM Transactions on Information Systems*, volume 20, issue 2, pages 192–223, April 2002. +[doi:10.1145/506309.506312](https://doi.org/10.1145/506309.506312) + +[[97](/en/ch4#Schulz2002-marker)] Klaus U. Schulz and Stoyan Mihov. +[Fast String +Correction with Levenshtein Automata](https://dmice.ohsu.edu/bedricks/courses/cs655/pdf/readings/2002_Schulz.pdf). *International Journal on Document Analysis and +Recognition*, volume 5, issue 1, pages 67–85, November 2002. +[doi:10.1007/s10032-002-0082-8](https://doi.org/10.1007/s10032-002-0082-8) + +[[98](/en/ch4#Mikolov2013-marker)] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. +[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781). +At *International Conference on Learning Representations* (ICLR), May 2013. +[doi:10.48550/arXiv.1301.3781](https://doi.org/10.48550/arXiv.1301.3781) + +[[99](/en/ch4#Devlin2018-marker)] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. +[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805). +At *Conference of the North American Chapter of the Association for Computational +Linguistics: Human Language Technologies*, volume 1, pages 4171–4186, June 2019. +[doi:10.18653/v1/N19-1423](https://doi.org/10.18653/v1/N19-1423) + +[[100](/en/ch4#Radford2018-marker)] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. +[Improving +Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf). *openai.com*, June 2018. +Archived at [perma.cc/5N3C-DJ4C](https://perma.cc/5N3C-DJ4C) + +[[101](/en/ch4#Faiis2023-marker)] Matthijs Douze, Maria Lomeli, and Lucas Hosseini. +[Faiss indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). +*github.com*, August 2024. +Archived at [perma.cc/2EWG-FPBS](https://perma.cc/2EWG-FPBS) + +[[102](/en/ch4#Matevosyan2024-marker)] Varik Matevosyan. +[Understanding pgvector’s HNSW Index Storage in Postgres](https://lantern.dev/blog/pgvector-storage). +*lantern.dev*, August 2024. +Archived at [perma.cc/B2YB-JB59](https://perma.cc/B2YB-JB59) + +[[103](/en/ch4#Baranchuk2018-marker)] Dmitry Baranchuk, Artem Babenko, and Yury Malkov. +[Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors](https://arxiv.org/pdf/1802.02422). +At *European Conference on Computer Vision* (ECCV), pages 202–216, September 2018. +[doi:10.1007/978-3-030-01258-8\_13](https://doi.org/10.1007/978-3-030-01258-8_13) + +[[104](/en/ch4#Malkov2020-marker)] Yury A. Malkov and Dmitry A. Yashunin. +[Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320). +*IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824–836, April 2020. +[doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473) diff --git a/content/en/ch5.md b/content/en/ch5.md index b423ff1..1c137c5 100644 --- a/content/en/ch5.md +++ b/content/en/ch5.md @@ -1,162 +1,1481 @@ --- -title: "5. Replication" -linkTitle: "5. Replication" -weight: 205 +title: "5. Encoding and Evolution" +weight: 105 breadcrumbs: false --- -![](/img/ch5.png) -> *The major difference between a thing that might go wrong and a thing that cannot possibly go wrong is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible to get at or repair.* +> *Everything changes and nothing stands still.* > -> ​ — Douglas Adams, *Mostly Harmless* (1992) +> Heraclitus of Ephesus, as quoted by Plato in *Cratylus* (360 BCE) ------- +Applications inevitably change over time. Features are added or modified as new products are +launched, user requirements become better understood, or business circumstances change. In +[Chapter 2](/en/ch2#ch_nonfunctional) we introduced the idea of *evolvability*: we should aim to build systems that +make it easy to adapt to change (see [“Evolvability: Making Change Easy”](/en/ch2#sec_introduction_evolvability)). -In [Part I](/en/part-i) of this book, we discussed aspects of data systems that apply when data is stored on a single machine. Now, in [Part II](/en/part-ii), we move up a level and ask: what happens if multiple machines are involved in storage and retrieval of data? +In most cases, a change to an application’s features also requires a change to data that it stores: +perhaps a new field or record type needs to be captured, or perhaps existing data needs to be +presented in a new way. -There are various reasons why you might want to distribute a database across multi‐ ple machines: +The data models we discussed in [Chapter 3](/en/ch3#ch_datamodels) have different ways of coping with such change. +Relational databases generally assume that all data in the database conforms to one schema: although +that schema can be changed (through schema migrations; i.e., `ALTER` statements), there is exactly +one schema in force at any one point in time. By contrast, schema-on-read (“schemaless”) databases +don’t enforce a schema, so the database can contain a mixture of older and newer data formats +written at different times (see [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)). -***Scalability*** +When a data format or schema changes, a corresponding change to application code often needs to +happen (for example, you add a new field to a record, and the application code starts reading +and writing that field). However, in a large application, code changes often cannot happen +instantaneously: -If your data volume, read load, or write load grows bigger than a single machine can handle, you can potentially spread the load across multiple machines. +* With server-side applications you may want to perform a *rolling upgrade* + (also known as a *staged rollout*), deploying the new version to a few nodes at a time, checking + whether the new version is running smoothly, and gradually working your way through all the nodes. + This allows new versions to be deployed without service downtime, and thus encourages more + frequent releases and better evolvability. +* With client-side applications you’re at the mercy of the user, who may not install the update for + some time. -***Fault tolerance/high availability*** +This means that old and new versions of the code, and old and new data formats, may potentially all +coexist in the system at the same time. In order for the system to continue running smoothly, we +need to maintain compatibility in both directions: -If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multi‐ ple machines to give you redundancy. When one fails, another one can take over. +Backward compatibility +: Newer code can read data that was written by older code. -***Latency*** +Forward compatibility +: Older code can read data that was written by newer code. -If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geo‐ graphically close to them. That avoids the users having to wait for network pack‐ ets to travel halfway around the world. +Backward compatibility is normally not hard to achieve: as author of the newer code, you know the +format of data written by older code, and so you can explicitly handle it (if necessary by simply +keeping the old code to read the old data). Forward compatibility can be trickier, because it +requires older code to ignore additions made by a newer version of the code. +Another challenge with forward compatibility is illustrated in [Figure 5-1](/en/ch5#fig_encoding_preserve_field). +Say you add a field to a record schema, and the newer code creates a record containing that new +field and stores it in a database. Subsequently, an older version of the code (which doesn’t yet +know about the new field) reads the record, updates it, and writes it back. In this situation, the +desirable behavior is usually for the old code to keep the new field intact, even though it couldn’t +be interpreted. But if the record is decoded into a model object that does not explicitly +preserve unknown fields, data can be lost, like in [Figure 5-1](/en/ch5#fig_encoding_preserve_field). +![ddia 0501](/fig/ddia_0501.png) -## …… +###### Figure 5-1. When an older version of the application updates data previously written by a newer version of the application, data may be lost if you’re not careful. +In this chapter we will look at several formats for encoding data, including JSON, XML, Protocol +Buffers, and Avro. In particular, we will look at how they handle schema changes and how they +support systems where old and new data and code need to coexist. We will then discuss how those +formats are used for data storage and for communication: in databases, web services, REST APIs, +remote procedure calls (RPC), workflow engines, and event-driven systems such as actors and +message queues. +# Formats for Encoding Data -## Summary +Programs usually work with data in (at least) two different representations: -In this chapter we looked at the issue of replication. Replication can serve several purposes: +1. In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These + data structures are optimized for efficient access and manipulation by the CPU (typically using + pointers). +2. When you want to write data to a file or send it over the network, you have to encode it as some + kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldn’t + make sense to any other process, this sequence-of-bytes representation often looks quite + different from the data structures that are normally used in memory. -***High availability*** +Thus, we need some kind of translation between the two representations. The translation from the +in-memory representation to a byte sequence is called *encoding* (also known as *serialization* or +*marshalling*), and the reverse is called *decoding* (*parsing*, *deserialization*, +*unmarshalling*). -Keeping the system running, even when one machine (or several machines, or an entire datacenter) goes down +# Terminology clash -***Disconnected operation*** +*Serialization* is unfortunately also used in the context of transactions (see [Chapter 8](/en/ch8#ch_transactions)), +with a completely different meaning. To avoid overloading the word we’ll stick with *encoding* in +this book, even though *serialization* is perhaps a more common term. -Allowing an application to continue working when there is a network interrup‐ tion +There are exceptions in which encoding/decoding is not needed—for example, when a database operates +directly on compressed data loaded from disk, as discussed in [“Query Execution: Compilation and Vectorization”](/en/ch4#sec_storage_vectorized). There are +also *zero-copy* data formats that are designed to be used both at runtime and on disk/on the +network, without an explicit conversion step, such as Cap’n Proto and FlatBuffers. -***Latency*** +However, most systems need to convert between in-memory objects and flat byte sequences. As this is +such a common problem, there are a myriad different libraries and encoding formats to choose from. +Let’s do a brief overview. -Placing data geographically close to users, so that users can interact with it faster +## Language-Specific Formats -***Scalability*** +Many programming languages come with built-in support for encoding in-memory objects into byte +sequences. For example, Java has `java.io.Serializable`, Python has `pickle`, Ruby has `Marshal`, +and so on. Many third-party libraries also exist, such as Kryo for Java. -Being able to handle a higher volume of reads than a single machine could han‐ dle, by performing reads on replicas +These encoding libraries are very convenient, because they allow in-memory objects to be saved and +restored with minimal additional code. However, they also have a number of deep problems: +* The encoding is often tied to a particular programming language, and reading the data in another + language is very difficult. If you store or transmit data in such an encoding, you are committing + yourself to your current programming language for potentially a very long time, and precluding + integrating your systems with those of other organizations (which may use different languages). +* In order to restore data in the same object types, the decoding process needs to be able to + instantiate arbitrary classes. This is frequently a source of security problems + [[1](/en/ch5#CWE502)]: + if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate + arbitrary classes, which in turn often allows them to do terrible things such as remotely + executing arbitrary code [[2](/en/ch5#Breen2015), + [3](/en/ch5#McKenzie2013)]. +* Versioning data is often an afterthought in these libraries: as they are intended for quick and + easy encoding of data, they often neglect the inconvenient problems of forward and backward + compatibility [[4](/en/ch5#Goetz2019)]. +* Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also + often an afterthought. For example, Java’s built-in serialization is notorious for its bad + performance and bloated encoding [[5](/en/ch5#JvmSerializers)]. +For these reasons it’s generally a bad idea to use your language’s built-in encoding for anything +other than very transient purposes. -Despite being a simple goal—keeping a copy of the same data on several machines— replication turns out to be a remarkably tricky problem. It requires carefully thinking about concurrency and about all the things that can go wrong, and dealing with the consequences of those faults. At a minimum, we need to deal with unavailable nodes and network interruptions (and that’s not even considering the more insidious kinds of fault, such as silent data corruption due to software bugs). +## JSON, XML, and Binary Variants -We discussed three main approaches to replication: +When moving to standardized encodings that can be written and read by many programming languages, JSON +and XML are the obvious contenders. They are widely known, widely supported, and almost as widely +disliked. XML is often criticized for being too verbose and unnecessarily complicated +[[6](/en/ch5#XMLSExp)]. +JSON’s popularity is mainly due to its built-in support in web browsers and simplicity relative to +XML. CSV is another popular language-independent format, but it only supports tabular data without +nesting. -***Single-leader replication*** +JSON, XML, and CSV are textual formats, and thus somewhat human-readable (although the syntax is a +popular topic of debate). Besides the superficial syntactic issues, they also have some subtle +problems: -Clients send all writes to a single node (the leader), which sends a stream of data change events to the other replicas (followers). Reads can be performed on any replica, but reads from followers might be stale. +* There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish + between a number and a string that happens to consist of digits (except by referring to an external + schema). JSON distinguishes strings and numbers, but it doesn’t distinguish integers and + floating-point numbers, and it doesn’t specify a precision. -***Multi-leader replication*** + This is a problem when dealing with large numbers; for example, integers greater than 253 cannot + be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become + inaccurate when parsed in a language that uses floating-point numbers, such as JavaScript + [[7](/en/ch5#Evans2023)]. + An example of numbers larger than 253 occurs on X (formerly Twitter), which uses a 64-bit number to + identify each post. The JSON returned by the API includes post IDs twice, once as a JSON number and + once as a decimal string, to work around the fact that the numbers are not correctly parsed by + JavaScript applications [[8](/en/ch5#Harris2010)]. +* JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they + don’t support binary strings (sequences of bytes without a character encoding). Binary strings are a + useful feature, so people get around this limitation by encoding the binary data as text using + Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded. + This works, but it’s somewhat hacky and increases the data size by 33%. +* XML Schema and JSON Schema are powerful, and thus quite + complicated to learn and implement. Since the correct interpretation of data (such as numbers and + binary strings) depends on information in the schema, applications that don’t use XML/JSON schemas + need to potentially hard-code the appropriate encoding/decoding logic instead. +* CSV does not have any schema, so it is up to the application to define the meaning of each row and + column. If an application change adds a new row or column, you have to handle that change manually. + CSV is also a quite vague format (what happens if a value contains a comma or a newline character?). + Although its escaping rules have been formally specified + [[9](/en/ch5#Shafranovich2005)], + not all parsers implement them correctly. -Clients send each write to one of several leader nodes, any of which can accept writes. The leaders send streams of data change events to each other and to any follower nodes. +Despite these flaws, JSON, XML, and CSV are good enough for many purposes. It’s likely that they will +remain popular, especially as data interchange formats (i.e., for sending data from one organization to +another). In these situations, as long as people agree on what the format is, it often doesn’t +matter how pretty or efficient the format is. The difficulty of getting different organizations to +agree on *anything* outweighs most other concerns. -***Leaderless replication*** +### JSON Schema -Clients send each write to several nodes, and read from several nodes in parallel in order to detect and correct nodes with stale data. +JSON Schema has become widely adopted as a way to model data whenever it’s exchanged between systems +or written to storage. You’ll find JSON schemas in web services (see [“Web services”](/en/ch5#sec_web_services)) as part +of the OpenAPI web service specification, schema registries such as Confluent’s Schema Registry and +Red Hat’s Apicurio Registry, and in databases such as PostgreSQL’s pg\_jsonschema validator extension +and MongoDB’s `$jsonSchema` validator syntax. -Each approach has advantages and disadvantages. Single-leader replication is popular because it is fairly easy to understand and there is no conflict resolution to worry about. Multi-leader and leaderless replication can be more robust in the presence of faulty nodes, network interruptions, and latency spikes—at the cost of being harder to reason about and providing only very weak consistency guarantees. +The JSON Schema specification offers a number of features. Schemas include standard primitive types +including strings, numbers, integers, objects, arrays, booleans, or nulls. But JSON Schema also +offers a separate validation specification that allows developers to overlay constraints on fields. +For example, a `port` field might have a minimum of 1 and a maximum of 65535. -Replication can be synchronous or asynchronous, which has a profound effect on the system behavior when there is a fault. Although asynchronous replication can be fast when the system is running smoothly, it’s important to figure out what happens when replication lag increases and servers fail. If a leader fails and you promote an asynchronously updated follower to be the new leader, recently committed data may be lost. +JSON Schemas can have either open or closed content models. An open content model permits any field +not defined in the schema to exist with any data type, whereas a closed content model only allows +fields that are explicitly defined. The open content model in JSON Schema is enabled when +`additionalProperties` is set to `true`, which is the default. Thus, JSON Schemas are usually a +definition of what *isn’t* permitted (namely, invalid values on any of the defined fields), rather +than what *is* permitted in a schema. -We looked at some strange effects that can be caused by replication lag, and we dis‐ cussed a few consistency models which are helpful for deciding how an application should behave under replication lag: +Open content models are powerful, but can be complex. For example, say you want to define a map from +integers (such as IDs) to strings. JSON does not have a map or dictionary type, only an “object” +type that can contain string keys, and values of any type. You can then constrain this type with +JSON Schema so that keys may only contain digits, and values can only be strings, using +`patternProperties` and `additionalProperties` as shown in [Example 5-1](/en/ch5#fig_encoding_json_schema). -***Read-after-write consistency*** +##### Example 5-1. Example JSON Schema with integer keys and string values. Integer keys are represented as strings containing only integers since JSON Schema requires all keys to be strings. -Users should always see data that they submitted themselves. +``` +{ + "$schema": "http://json-schema.org/draft-07/schema#", + "type": "object", + "patternProperties": { + "^[0-9]+$": { + "type": "string" + } + }, + "additionalProperties": false +} +``` -***Monotonic reads*** +In addition to open and closed content models and validators, JSON Schema supports conditional +if/else schema logic, named types, references to remote schemas, and much more. All of this makes +for a very powerful schema language. Such features also make for unwieldy definitions. It can be +challenging to resolve remote schemas, reason about conditional rules, or evolve schemas in a +forwards or backwards compatible way [[10](/en/ch5#Coates2024)]. +Similar concerns apply to XML Schema +[[11](/en/ch5#Geneves2008)]. -After users have seen the data at one point in time, they shouldn’t later see the data from some earlier point in time. +### Binary encoding -***Consistent prefix reads*** +JSON is less verbose than XML, but both still use a lot of space compared to binary formats. This +observation led to the development of a profusion of binary encodings for JSON (MessagePack, CBOR, +BSON, BJSON, UBJSON, BISON, Hessian, and Smile, to name a few) and for XML (WBXML and Fast Infoset, +for example). These formats have been adopted in various niches, as they are more compact and +sometimes faster to parse, but none of them are as widely adopted as the textual versions of JSON +and XML [[12](/en/ch5#Bray2019)]. -Users should see the data in a state that makes causal sense: for example, seeing a question and its reply in the correct order. +Some of these formats extend the set of datatypes (e.g., distinguishing integers and floating-point numbers, +or adding support for binary strings), but otherwise they keep the JSON/XML data model unchanged. In +particular, since they don’t prescribe a schema, they need to include all the object field names within +the encoded data. That is, in a binary encoding of the JSON document in [Example 5-2](/en/ch5#fig_encoding_json), they +will need to include the strings `userName`, `favoriteNumber`, and `interests` somewhere. -Finally, we discussed the concurrency issues that are inherent in multi-leader and leaderless replication approaches: because they allow multiple writes to happen con‐ currently, conflicts may occur. We examined an algorithm that a database might use to determine whether one operation happened before another, or whether they hap‐ pened concurrently. We also touched on methods for resolving conflicts by merging together concurrent updates. +##### Example 5-2. Example record which we will encode in several binary formats in this chapter -In the next chapter we will continue looking at data that is distributed across multiple machines, through the counterpart of replication: splitting a large dataset into *partitions*. +``` +{ + "userName": "Martin", + "favoriteNumber": 1337, + "interests": ["daydreaming", "hacking"] +} +``` +Let’s look at an example of MessagePack, a binary encoding for JSON. [Figure 5-2](/en/ch5#fig_encoding_messagepack) +shows the byte sequence that you get if you encode the JSON document in [Example 5-2](/en/ch5#fig_encoding_json) with +MessagePack. The first few bytes are as follows: +1. The first byte, `0x83`, indicates that what follows is an object (top four bits = `0x80`) with three + fields (bottom four bits = `0x03`). (In case you’re wondering what happens if an object has more + than 15 fields, so that the number of fields doesn’t fit in four bits, it then gets a different type + indicator, and the number of fields is encoded in two or four bytes.) +2. The second byte, `0xa8`, indicates that what follows is a string (top four bits = `0xa0`) that is eight + bytes long (bottom four bits = `0x08`). +3. The next eight bytes are the field name `userName` in ASCII. Since the length was indicated + previously, there’s no need for any marker to tell us where the string ends (or any escaping). +4. The next seven bytes encode the six-letter string value `Martin` with a prefix `0xa6`, and so on. -## References +The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the +textual JSON encoding (with whitespace removed). All the binary encodings of JSON are similar in +this regard. It’s not clear whether such a small space reduction (and perhaps a speedup in parsing) +is worth the loss of human-readability. -1. Bruce G. Lindsay, Patricia Griffiths Selinger, C. Galtieri, et al.: “[Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf),” IBM Research, Research Report RJ2571(33471), July 1979. -1. “[Oracle Active Data Guard Real-Time Data Protection and Availability](http://www.oracle.com/technetwork/database/availability/active-data-guard-wp-12c-1896127.pdf),” Oracle White Paper, June 2013. -1. “[AlwaysOn Availability Groups](http://msdn.microsoft.com/en-us/library/hh510230.aspx),” in *SQL Server Books Online*, Microsoft, 2012. -1. Lin Qiao, Kapil Surlaker, Shirshanka Das, et al.: “[On Brewing Fresh Espresso: LinkedIn’s Distributed Data Serving Platform](http://www.slideshare.net/amywtang/espresso-20952131),” at *ACM International Conference on Management of Data* (SIGMOD), June 2013. -1. Jun Rao: “[Intra-Cluster Replication for Apache Kafka](http://www.slideshare.net/junrao/kafka-replication-apachecon2013),” at *ApacheCon North America*, February 2013. -1. “[Highly Available Queues](https://www.rabbitmq.com/ha.html),” in *RabbitMQ Server Documentation*, Pivotal Software, Inc., 2014. -1. Yoshinori Matsunobu: “[Semi-Synchronous Replication at Facebook](http://yoshinorimatsunobu.blogspot.co.uk/2014/04/semi-synchronous-replication-at-facebook.html),” *yoshinorimatsunobu.blogspot.co.uk*, April 1, 2014. -1. Robbert van Renesse and Fred B. Schneider: “[Chain Replication for Supporting High Throughput and Availability](http://static.usenix.org/legacy/events/osdi04/tech/full_papers/renesse/renesse.pdf),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004. -1. Jeff Terrace and Michael J. Freedman: “[Object Storage on CRAQ: High-Throughput Chain Replication for Read-Mostly Workloads](https://www.usenix.org/legacy/event/usenix09/tech/full_papers/terrace/terrace.pdf),” at *USENIX Annual Technical Conference* (ATC), June 2009. -1. Brad Calder, Ju Wang, Aaron Ogus, et al.: “[Windows Azure Storage: A Highly Available Cloud Storage Service with Strong Consistency](http://sigops.org/sosp/sosp11/current/2011-Cascais/printable/11-calder.pdf),” at *23rd ACM Symposium on Operating Systems Principles* (SOSP), October 2011. -1. Andrew Wang: “[Windows Azure Storage](https://www.umbrant.com/2016/02/04/windows-azure-storage/),” *umbrant.com*, February 4, 2016. -1. “[Percona Xtrabackup - Documentation](https://www.percona.com/doc/percona-xtrabackup/2.1/index.html),” Percona LLC, 2014. -1. Jesse Newland: “[GitHub Availability This Week](https://github.com/blog/1261-github-availability-this-week),” *github.com*, September 14, 2012. -1. Mark Imbriaco: “[Downtime Last Saturday](https://github.com/blog/1364-downtime-last-saturday),” *github.com*, December 26, 2012. -1. John Hugg: “[‘All in’ with Determinism for Performance and Testing in Distributed Systems](https://www.youtube.com/watch?v=gJRj3vJL4wE),” at *Strange Loop*, September 2015. -1. Amit Kapila: “[WAL Internals of PostgreSQL](http://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf),” at *PostgreSQL Conference* (PGCon), May 2012. -1. [*MySQL Documentation*](https://dev.mysql.com/doc/refman/en/binary-log.html). Oracle, 2025. -1. Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, et al.: “[Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf),” at *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015. -1. “[Oracle GoldenGate 12c: Real-Time Access to Real-Time Information](https://web.archive.org/web/20200110231516/http://www.oracle.com/us/products/middleware/data-integration/oracle-goldengate-realtime-access-2031152.pdf),” Oracle White Paper, October 2013. -1. Shirshanka Das, Chavdar Botev, Kapil Surlaker, et al.: “[All Aboard the Databus!](http://www.socc2012.org/s18-das.pdf),” at *ACM Symposium on Cloud Computing* (SoCC), October 2012. -1. Greg Sabino Mullane: “[Version 5 of Bucardo Database Replication System](https://www.endpointdev.com/blog/2014/06/bucardo-5-multimaster-postgres-released/),” *blog.endpoint.com*, June 23, 2014. -1. Werner Vogels: “[Eventually Consistent](http://queue.acm.org/detail.cfm?id=1466448),” *ACM Queue*, volume 6, number 6, pages 14–19, October 2008. [doi:10.1145/1466443.1466448](http://dx.doi.org/10.1145/1466443.1466448) -1. Douglas B. Terry: “[Replicated Data Consistency Explained Through Baseball](https://www.microsoft.com/en-us/research/publication/replicated-data-consistency-explained-through-baseball/),” Microsoft Research, Technical Report MSR-TR-2011-137, October 2011. -1. Douglas B. Terry, Alan J. Demers, Karin Petersen, et al.: “[Session Guarantees for Weakly Consistent Replicated Data](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.71.2269&rep=rep1&type=pdf),” at *3rd International Conference on Parallel and Distributed Information Systems* (PDIS), September 1994. [doi:10.1109/PDIS.1994.331722](http://dx.doi.org/10.1109/PDIS.1994.331722) -1. Terry Pratchett: *Reaper Man: A Discworld Novel*. Victor Gollancz, 1991. ISBN: 978-0-575-04979-6 -1. “[Tungsten Replicator](https://github.com/holys/tungsten-replicator),” *github.com*. -1. “[BDR 0.10.0 Documentation](https://web.archive.org/web/20160728020040/http://bdr-project.org/docs/next/index.html),” The PostgreSQL Global Development Group, *bdr-project.org*, 2015. -1. Robert Hodges: “[If You *Must* Deploy Multi-Master Replication, Read This First](http://scale-out-blog.blogspot.co.uk/2012/04/if-you-must-deploy-multi-master.html),” *scale-out-blog.blogspot.co.uk*, March 30, 2012. -1. J. Chris Anderson, Jan Lehnardt, and Noah Slater: *CouchDB: The Definitive Guide*. O'Reilly Media, 2010. ISBN: 978-0-596-15589-6 -1. AppJet, Inc.: “[Etherpad and EasySync Technical Manual](https://github.com/ether/etherpad-lite/blob/e2ce9dc/doc/easysync/easysync-full-description.pdf),” *github.com*, March 26, 2011. -1. John Day-Richter: “[What’s Different About the New Google Docs: Making Collaboration Fast](https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs.html),” *drive.googleblog.com*, September 23, 2010. -1. Martin Kleppmann and Alastair R. Beresford: “[A Conflict-Free Replicated JSON Datatype](http://arxiv.org/abs/1608.03960),” arXiv:1608.03960, August 13, 2016. -1. Frazer Clement: “[Eventual Consistency – Detecting Conflicts](http://messagepassing.blogspot.co.uk/2011/10/eventual-consistency-detecting.html),” *messagepassing.blogspot.co.uk*, October 20, 2011. -1. Robert Hodges: “[State of the Art for MySQL Multi-Master Replication](https://web.archive.org/web/20161010052017/https://www.percona.com/live/mysql-conference-2013/sites/default/files/slides/mysql-multi-master-state-of-art-2013-04-24_0.pdf),” at *Percona Live: MySQL Conference & Expo*, April 2013. -1. John Daily: “[Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems](https://riak.com/clocks-are-bad-or-welcome-to-distributed-systems/),” *riak.com*, November 12, 2013. -1. Riley Berton: “[Is Bi-Directional Replication (BDR) in Postgres Transactional?](https://web.archive.org/web/20211204170610/http://sdf.org/~riley/blog/2016/01/04/is-bi-directional-replication-bdr-in-postgres-transactional/),” *sdf.org*, January 4, 2016. -1. Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, et al.: “[Dynamo: Amazon's Highly Available Key-Value Store](http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf),” at *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007. -1. Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski: “[A Comprehensive Study of Convergent and Commutative Replicated Data Types](http://hal.inria.fr/inria-00555588/),” INRIA Research Report no. 7506, January 2011. -1. Sam Elliott: “[CRDTs: An UPDATE (or Maybe Just a PUT)](https://speakerdeck.com/lenary/crdts-an-update-or-just-a-put),” at *RICON West*, October 2013. -1. Russell Brown: “[A Bluffers Guide to CRDTs in Riak](https://gist.github.com/russelldb/f92f44bdfb619e089a4d),” *gist.github.com*, October 28, 2013. -1. Benjamin Farinier, Thomas Gazagnaire, and Anil Madhavapeddy: “[Mergeable Persistent Data Structures](http://gazagnaire.org/pub/FGM15.pdf),” at *26es Journées Francophones des Langages Applicatifs* (JFLA), January 2015. -1. Chengzheng Sun and Clarence Ellis: “[Operational Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements](http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.53.933&rep=rep1&type=pdf),” at *ACM Conference on Computer Supported Cooperative Work* (CSCW), November 1998. -1. Lars Hofhansl: “[HBASE-7709: Infinite Loop Possible in Master/Master Replication](https://issues.apache.org/jira/browse/HBASE-7709),” *issues.apache.org*, January 29, 2013. -1. David K. Gifford: “[Weighted Voting for Replicated Data](https://www.cs.cmu.edu/~15-749/READINGS/required/availability/gifford79.pdf),” at *7th ACM Symposium on Operating Systems Principles* (SOSP), December 1979. [doi:10.1145/800215.806583](http://dx.doi.org/10.1145/800215.806583) -1. Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman: “[Flexible Paxos: Quorum Intersection Revisited](https://arxiv.org/abs/1608.06696),” *arXiv:1608.06696*, August 24, 2016. -1. Joseph Blomstedt: “[Re: Absolute Consistency](https://web.archive.org/web/20190919171316/http://lists.basho.com:80/pipermail/riak-users_lists.basho.com/2012-January/007157.html),” email to *riak-users* mailing list, *lists.basho.com*, January 11, 2012. -1. Joseph Blomstedt: “[Bringing Consistency to Riak](https://vimeo.com/51973001),” at *RICON West*, October 2012. -1. Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, et al.: “[Quantifying Eventual Consistency with PBS](http://www.bailis.org/papers/pbs-cacm2014.pdf),” *Communications of the ACM*, volume 57, number 8, pages 93–102, August 2014. [doi:10.1145/2632792](http://dx.doi.org/10.1145/2632792) -1. Jonathan Ellis: “[Modern Hinted Handoff](http://www.datastax.com/dev/blog/modern-hinted-handoff),” *datastax.com*, December 11, 2012. -1. “[Project Voldemort Wiki](https://github.com/voldemort/voldemort/wiki),” *github.com*, 2013. -1. “[Apache Cassandra Documentation](https://cassandra.apache.org/doc/latest/),” Apache Software Foundation, *cassandra.apache.org*. -1. “[Riak Enterprise: Multi-Datacenter Replication](https://web.archive.org/web/20150513041837/http://basho.com/assets/MultiDatacenter_Replication.pdf).” Technical whitepaper, Basho Technologies, Inc., September 2014. -1. Jonathan Ellis: “[Why Cassandra Doesn't Need Vector Clocks](http://www.datastax.com/dev/blog/why-cassandra-doesnt-need-vector-clocks),” *datastax.com*, September 2, 2013. -1. Leslie Lamport: “[Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/),” *Communications of the ACM*, volume 21, number 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](http://dx.doi.org/10.1145/359545.359563) -1. Joel Jacobson: “[Riak 2.0: Data Types](https://web.archive.org/web/20160327135816/http://blog.joeljacobson.com/riak-2-0-data-types/),” *blog.joeljacobson.com*, March 23, 2014. -1. D. Stott Parker Jr., Gerald J. Popek, Gerard Rudisin, et al.: “[Detection of Mutual Inconsistency in Distributed Systems](https://web.archive.org/web/20170808212704/https://zoo.cs.yale.edu/classes/cs426/2013/bib/parker83detection.pdf),” *IEEE Transactions on Software Engineering*, volume 9, number 3, pages 240–247, May 1983. [doi:10.1109/TSE.1983.236733](http://dx.doi.org/10.1109/TSE.1983.236733) -1. Nuno Preguiça, Carlos Baquero, Paulo Sérgio Almeida, et al.: “[Dotted Version Vectors: Logical Clocks for Optimistic Replication](http://arxiv.org/pdf/1011.5808v1.pdf),” arXiv:1011.5808, November 26, 2010. -1. Sean Cribbs: “[A Brief History of Time in Riak](https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak),” at *RICON*, October 2014. -1. Russell Brown: “[Vector Clocks Revisited Part 2: Dotted Version Vectors](https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/),” *basho.com*, November 10, 2015. -1. Carlos Baquero: “[Version Vectors Are Not Vector Clocks](https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/),” *haslab.wordpress.com*, July 8, 2011. -1. Reinhard Schwarz and Friedemann Mattern: “[Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail](http://dcg.ethz.ch/lectures/hs08/seminar/papers/mattern4.pdf),” *Distributed Computing*, volume 7, number 3, pages 149–174, March 1994. [doi:10.1007/BF02277859](http://dx.doi.org/10.1007/BF02277859) +In the following sections we will see how we can do much better, and encode the same record in just +32 bytes. + +![ddia 0502](/fig/ddia_0502.png) + +###### Figure 5-2. Example record ([Example 5-2](/en/ch5#fig_encoding_json)) encoded using MessagePack. + +## Protocol Buffers + +Protocol Buffers (protobuf) is a binary encoding library developed at Google. +It is similar to Apache Thrift, which was originally developed by Facebook +[[13](/en/ch5#Slee2007)]; +most of what this section says about Protocol Buffers applies also to Thrift. + +Protocol Buffers requires a schema for any data that is encoded. To encode the data +in [Example 5-2](/en/ch5#fig_encoding_json) in Protocol Buffers, you would describe the schema in the Protocol Buffers +interface definition language (IDL) like this: + +``` +syntax = "proto3"; + +message Person { + string user_name = 1; + int64 favorite_number = 2; + repeated string interests = 3; +} +``` + +Protocol Buffers comes with a code generation tool that takes a schema definition like the one +shown here, and produces classes that implement the schema in various programming languages. Your +application code can call this generated code to encode or decode records of the schema. The schema +language is very simple compared to JSON Schema: it only defines the fields of records and their +types, but it does not support other restrictions on the possible values of fields. + +Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in +[Figure 5-3](/en/ch5#fig_encoding_protobuf) [[14](/en/ch5#Kleppmann2012evolution)]. + +![ddia 0503](/fig/ddia_0503.png) + +###### Figure 5-3. Example record encoded using Protocol Buffers. + +Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it +is a string, integer, etc.) and, where required, a length indication (such as the length of a +string). The strings that appear in the data (“Martin”, “daydreaming”, “hacking”) are also encoded +as ASCII (to be precise, UTF-8), similar to before. + +The big difference compared to [Figure 5-2](/en/ch5#fig_encoding_messagepack) is that there are no field names +(`userName`, `favoriteNumber`, `interests`). Instead, the encoded data contains *field tags*, which +are numbers (`1`, `2`, and `3`). Those are the numbers that appear in the schema definition. Field tags +are like aliases for fields—they are a compact way of saying what field we’re talking about, +without having to spell out the field name. + +As you can see, Protocol Buffers saves even more space by packing the field type and tag number into +a single byte. It uses variable-length integers: the number 1337 is encoded in two bytes, with the +top bit of each byte used to indicate whether there are still more bytes to come. This means numbers +between –64 and 63 are encoded in one byte, numbers between –8192 and 8191 are encoded in two bytes, +etc. Bigger numbers use more bytes. + +Protocol Buffers doesn’t have an explicit list or array datatype. Instead, the `repeated` modifier +on the `interests` field indicates that the field contains a list of values, rather than a single +value. In the binary encoding, the list elements are represented simply as repeated occurrences of +the same field tag within the same record. + +### Field tags and schema evolution + +We said previously that schemas inevitably need to change over time. We call this *schema +evolution*. How does Protocol Buffers handle schema changes while keeping backward and forward +compatibility? + +As you can see from the examples, an encoded record is just the concatenation of its encoded fields. +Each field is identified by its tag number (the numbers `1`, `2`, `3` in the sample schema) and +annotated with a datatype (e.g., string or integer). If a field value is not set, it is simply +omitted from the encoded record. From this you can see that field tags are critical to the meaning +of the encoded data. You can change the name of a field in the schema, since the encoded data never +refers to field names, but you cannot change a field’s tag, since that would make all existing +encoded data invalid. + +You can add new fields to the schema, provided that you give each field a new tag number. If old +code (which doesn’t know about the new tag numbers you added) tries to read data written by new +code, including a new field with a tag number it doesn’t recognize, it can simply ignore that field. +The datatype annotation allows the parser to determine how many bytes it needs to skip, and preserve +the unknown fields to avoid the problem in [Figure 5-1](/en/ch5#fig_encoding_preserve_field). This maintains forward +compatibility: old code can read records that were written by new code. + +What about backward compatibility? As long as each field has a unique tag number, new code can +always read old data, because the tag numbers still have the same meaning. If a field was added in +the new schema, and you read old data that does not yet contain that field, it is filled in with a +default value (for example, the empty string if the field type is string, or zero if it’s a number). + +Removing a field is just like adding a field, with backward and forward compatibility concerns +reversed. You can never use the same tag number again, because you may still have data written +somewhere that includes the old tag number, and that field must be ignored by new code. Tag numbers +used in the past can be reserved in the schema definition to ensure they are not forgotten. + +What about changing the datatype of a field? That is possible with some types—check the +documentation for details—but there is a risk that values will get truncated. For example, say you +change a 32-bit integer into a 64-bit integer. New code can easily read data written by old code, +because the parser can fill in any missing bits with zeros. However, if old code reads data written +by new code, the old code is still using a 32-bit variable to hold the value. If the decoded 64-bit +value won’t fit in 32 bits, it will be truncated. + +## Avro + +Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers. +It was started in 2009 as a subproject of Hadoop, as a result of Protocol Buffers not being a good +fit for Hadoop’s use cases +[[15](/en/ch5#Cutting2009)]. + +Avro also uses a schema to specify the structure of the data being encoded. It has two schema +languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily +machine-readable. Like Protocol Buffers, this schema language specifies only fields and their types, +and not complex validation rules like in JSON Schema. + +Our example schema, written in Avro IDL, might look like this: + +``` +record Person { + string userName; + union { null, long } favoriteNumber = null; + array interests; +} +``` + +The equivalent JSON representation of that schema is as follows: + +``` +{ + "type": "record", + "name": "Person", + "fields": [ + {"name": "userName", "type": "string"}, + {"name": "favoriteNumber", "type": ["null", "long"], "default": null}, + {"name": "interests", "type": {"type": "array", "items": "string"}} + ] +} +``` + +First of all, notice that there are no tag numbers in the schema. If we encode our example record +([Example 5-2](/en/ch5#fig_encoding_json)) using this schema, the Avro binary encoding is just 32 bytes long—the +most compact of all the encodings we have seen. The breakdown of the encoded byte sequence is shown +in [Figure 5-4](/en/ch5#fig_encoding_avro). + +If you examine the byte sequence, you can see that there is nothing to identify fields or their +datatypes. The encoding simply consists of values concatenated together. A string is just a length +prefix followed by UTF-8 bytes, but there’s nothing in the encoded data that tells you that it is a +string. It could just as well be an integer, or something else entirely. An integer is encoded using +a variable-length encoding. + +![ddia 0504](/fig/ddia_0504.png) + +###### Figure 5-4. Example record encoded using Avro. + +To parse the binary data, you go through the fields in the order that they appear in the schema and +use the schema to tell you the datatype of each field. This means that the binary data can only be +decoded correctly if the code reading the data is using the *exact same schema* as the code that +wrote the data. Any mismatch in the schema between the reader and the writer would mean incorrectly +decoded data. + +So, how does Avro support schema evolution? + +### The writer’s schema and the reader’s schema + +When an application wants to encode some data (to write it to a file or database, to send it over +the network, etc.), it encodes the data using whatever version of the schema it knows about—for +example, that schema may be compiled into the application. This is known as the *writer’s schema*. + +When an application wants to decode some data (read it from a file or database, receive it from the +network, etc.), it uses two schemas: the writer’s schema that is identical to the one used for +encoding, and the *reader’s schema*, which may be different. This is illustrated in +[Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The reader’s schema defines the fields of each record that the +application code is expecting, and their types. + +![ddia 0505](/fig/ddia_0505.png) + +###### Figure 5-5. In Protocol Buffers, encoding and decoding can use different versions of a schema. In Avro, decoding uses two schemas: the writer’s schema must be identical to the one used for encoding, but the reader’s schema can be an older or newer version. + +If the reader’s and writer’s schema are the same, decoding is easy. If they are different, Avro +resolves the differences by looking at the writer’s schema and the reader’s schema side by side and +translating the data from the writer’s schema into the reader’s schema. The Avro specification +[[16](/en/ch5#AvroSpec), +[17](/en/ch5#AvroParsing)] +defines exactly how this resolution works, and it is illustrated in +[Figure 5-6](/en/ch5#fig_encoding_avro_resolution). + +For example, it’s no problem if the writer’s schema and the reader’s schema have their fields in a +different order, because the schema resolution matches up the fields by field name. If the code +reading the data encounters a field that appears in the writer’s schema but not in the reader’s +schema, it is ignored. If the code reading the data expects some field, but the writer’s schema does +not contain a field of that name, it is filled in with a default value declared in the reader’s +schema. + +![ddia 0506](/fig/ddia_0506.png) + +###### Figure 5-6. An Avro reader resolves differences between the writer’s schema and the reader’s schema. + +### Schema evolution rules + +With Avro, forward compatibility means that you can have a new version of the schema as writer and +an old version of the schema as reader. Conversely, backward compatibility means that you can have a +new version of the schema as reader and an old version as writer. + +To maintain compatibility, you may only add or remove a field that has a default value. (The field +`favoriteNumber` in our Avro schema has a default value of `null`.) For example, say you add a +field with a default value, so this new field exists in the new schema but not the old one. When a +reader using the new schema reads a record written with the old schema, the default value is filled +in for the missing field. + +If you were to add a field that has no default value, new readers wouldn’t be able to read data +written by old writers, so you would break backward compatibility. If you were to remove a field +that has no default value, old readers wouldn’t be able to read data written by new writers, so you +would break forward compatibility. + +In some programming languages, `null` is an acceptable default for any variable, but this is not the +case in Avro: if you want to allow a field to be null, you have to use a *union type*. For example, +`union { null, long, string } field;` indicates that `field` can be a number, or a string, or null. +You can only use `null` as a default value if it is the first branch of the union. This is a little +more verbose than having everything nullable by default, but it helps prevent bugs by being explicit +about what can and cannot be null [[18](/en/ch5#Hoare2009)]. + +Changing the datatype of a field is possible, provided that Avro can convert the type. Changing the +name of a field is possible but a little tricky: the reader’s schema can contain aliases for field +names, so it can match an old writer’s schema field names against the aliases. This means that +changing a field name is backward compatible but not forward compatible. Similarly, adding a branch +to a union type is backward compatible but not forward compatible. + +### But what is the writer’s schema? + +There is an important question that we’ve glossed over so far: how does the reader know the writer’s +schema with which a particular piece of data was encoded? We can’t just include the entire schema +with every record, because the schema would likely be much bigger than the encoded data, making all +the space savings from the binary encoding futile. + +The answer depends on the context in which Avro is being used. To give a few examples: + +Large file with lots of records +: A common use for Avro is for storing a large file containing millions of records, all encoded with + the same schema. (We will discuss this kind of situation in [Link to Come].) In this case, the + writer of that file can just include the writer’s schema once at the beginning of the file. Avro + specifies a file format (object container files) to do this. + +Database with individually written records +: In a database, different records may be written at different points in time using different + writer’s schemas—you cannot assume that all the records will have the same schema. The simplest + solution is to include a version number at the beginning of every encoded record, and to keep a + list of schema versions in your database. A reader can fetch a record, extract the version number, + and then fetch the writer’s schema for that version number from the database. Using that writer’s + schema, it can decode the rest of the record. + + Confluent’s schema registry for Apache Kafka + [[19](/en/ch5#ConfluentSchemaReg)] + and LinkedIn’s Espresso + [[20](/en/ch5#Auradkar2015)] + work this way, for example. + +Sending records over a network connection +: When two processes are communicating over a bidirectional network connection, they can negotiate + the schema version on connection setup and then use that schema for the lifetime of the + connection. The Avro RPC protocol (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) works like this. + +A database of schema versions is a useful thing to have in any case, since it acts as documentation +and gives you a chance to check schema compatibility +[[21](/en/ch5#Kreps2015)]. +As the version number, you could use a simple incrementing integer, or you could use a hash of the +schema. + +### Dynamically generated schemas + +One advantage of Avro’s approach, compared to Protocol Buffers, is that the schema doesn’t contain +any tag numbers. But why is this important? What’s the problem with keeping a couple of numbers in +the schema? + +The difference is that Avro is friendlier to *dynamically generated* schemas. For example, say +you have a relational database whose contents you want to dump to a file, and you want to use a +binary format to avoid the aforementioned problems with textual formats (JSON, CSV, XML). If you use +Avro, you can fairly easily generate an Avro schema (in the JSON representation we saw earlier) from the +relational schema and encode the database contents using that schema, dumping it all to an Avro +object container file [[22](/en/ch5#Shapira2014)]. +You can generate a record schema for each database table, and each column becomes a field in that +record. The column name in the database maps to the field name in Avro. + +Now, if the database schema changes (for example, a table has one column added and one column +removed), you can just generate a new Avro schema from the updated database schema and export data in +the new Avro schema. The data export process does not need to pay any attention to the schema +change—it can simply do the schema conversion every time it runs. Anyone who reads the new data +files will see that the fields of the record have changed, but since the fields are identified by +name, the updated writer’s schema can still be matched up with the old reader’s schema. + +By contrast, if you were using Protocol Buffers for this purpose, the field tags would likely have +to be assigned by hand: every time the database schema changes, an administrator would have to +manually update the mapping from database column names to field tags. (It might be possible to +automate this, but the schema generator would have to be very careful to not assign previously used +field tags.) This kind of dynamically generated schema simply wasn’t a design goal of Protocol +Buffers, whereas it was for Avro. + +## The Merits of Schemas + +As we saw, Protocol Buffers and Avro both use a schema to describe a binary encoding format. Their +schema languages are much simpler than XML Schema or JSON Schema, which support much more detailed +validation rules (e.g., “the string value of this field must match this regular expression” or “the +integer value of this field must be between 0 and 100”). As Protocol Buffers and Avro are simpler to +implement and simpler to use, they have grown to support a fairly wide range of programming +languages. + +The ideas on which these encodings are based are by no means new. For example, they have a lot in +common with ASN.1, a schema definition language that was first standardized in 1984 +[[23](/en/ch5#Larmouth1999), +[24](/en/ch5#Kaliski1993)]. +It was used to define various network protocols, and its binary encoding (DER) is still used to encode +SSL certificates (X.509), for example +[[25](/en/ch5#HoffmanAndrews2020)]. +ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers +[[26](/en/ch5#Walkin2010)]. +However, it’s also very complex and badly documented, so ASN.1 +is probably not a good choice for new applications. + +Many data systems also implement some kind of proprietary binary encoding for their data. For +example, most relational databases have a network protocol over which you can send queries to the +database and get back responses. Those protocols are generally specific to a particular database, +and the database vendor provides a driver (e.g., using the ODBC or JDBC APIs) that decodes responses +from the database’s network protocol into in-memory data structures. + +So, we can see that although textual data formats such as JSON, XML, and CSV are widespread, binary +encodings based on schemas are also a viable option. They have a number of nice properties: + +* They can be much more compact than the various “binary JSON” variants, since they can omit field + names from the encoded data. +* The schema is a valuable form of documentation, and because the schema is required for decoding, + you can be sure that it is up to date (whereas manually maintained documentation may easily + diverge from reality). +* Keeping a database of schemas allows you to check forward and backward compatibility of schema + changes, before anything is deployed. +* For users of statically typed programming languages, the ability to generate code from the schema + is useful, since it enables type-checking at compile time. + +In summary, schema evolution allows the same kind of flexibility as schemaless/schema-on-read JSON +databases provide (see [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)), while also providing better +guarantees about your data and better tooling. + +# Modes of Dataflow + +At the beginning of this chapter we said that whenever you want to send some data to another process +with which you don’t share memory—for example, whenever you want to send data over the network or +write it to a file—you need to encode it as a sequence of bytes. We then discussed a variety of +different encodings for doing this. + +We talked about forward and backward compatibility, which are important for evolvability (making +change easy by allowing you to upgrade different parts of your system independently, and not having +to change everything at once). Compatibility is a relationship between one process that encodes the +data, and another process that decodes it. + +That’s a fairly abstract idea—there are many ways data can flow from one process to another. +Who encodes the data, and who decodes it? In the rest of this chapter we will explore some of the +most common ways how data flows between processes: + +* Via databases (see [“Dataflow Through Databases”](/en/ch5#sec_encoding_dataflow_db)) +* Via service calls (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) +* Via workflow engines (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows)) +* Via asynchronous messages (see [“Event-Driven Architectures”](/en/ch5#sec_encoding_dataflow_msg)) + +## Dataflow Through Databases + +In a database, the process that writes to the database encodes the data, and the process that reads +from the database decodes it. There may just be a single process accessing the database, in which +case the reader is simply a later version of the same process—in that case you can think of +storing something in the database as *sending a message to your future self*. + +Backward compatibility is clearly necessary here; otherwise your future self won’t be able to decode +what you previously wrote. + +In general, it’s common for several different processes to be accessing a database at the same time. +Those processes might be several different applications or services, or they may simply be several +instances of the same service (running in parallel for scalability or fault tolerance). Either way, +in an environment where the application is changing, it is likely that some processes accessing the +database will be running newer code and some will be running older code—for example because a new +version is currently being deployed in a rolling upgrade, so some instances have been updated while +others haven’t yet. + +This means that a value in the database may be written by a *newer* version of the code, and +subsequently read by an *older* version of the code that is still running. Thus, forward +compatibility is also often required for databases. + +### Different values written at different times + +A database generally allows any value to be updated at any time. This means that within a single +database you may have some values that were written five milliseconds ago, and some values that were +written five years ago. + +When you deploy a new version of your application (of a server-side application, at least), you may +entirely replace the old version with the new version within a few minutes. The same is not true of +database contents: the five-year-old data will still be there, in the original encoding, unless you +have explicitly rewritten it since then. This observation is sometimes summed up as *data outlives +code*. + +Rewriting (*migrating*) data into a new schema is certainly possible, but it’s an expensive thing to +do on a large dataset, so most databases avoid it if possible. Most relational databases allow +simple schema changes, such as adding a new column with a `null` default value, without rewriting +existing data. When an old row is read, the database fills in `null`s for any columns that are +missing from the encoded data on disk. +Schema evolution thus allows the entire database to appear as if it was encoded with a single +schema, even though the underlying storage may contain records encoded with various historical +versions of the schema. + +More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or +moving some data into a separate table—still require data to be rewritten, often at the application +level [[27](/en/ch5#Xu2017)]. +Maintaining forward and backward compatibility across such migrations is still a research problem +[[28](/en/ch5#Litt2020)]. + +### Archival storage + +Perhaps you take a snapshot of your database from time to time, say for backup purposes or for +loading into a data warehouse (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)). In this case, the data dump will typically +be encoded using the latest schema, even if the original encoding in the source database contained a +mixture of schema versions from different eras. Since you’re copying the data anyway, you might as +well encode the copy of the data consistently. + +As the data dump is written in one go and is thereafter immutable, formats like Avro object +container files are a good fit. This is also a good opportunity to encode the data in an +analytics-friendly column-oriented format such as Parquet (see [“Column Compression”](/en/ch4#sec_storage_column_compression)). + +In [Link to Come] we will talk more about using data in archival storage. + +## Dataflow Through Services: REST and RPC + +When you have processes that need to communicate over a network, there are a few different ways of +arranging that communication. The most common arrangement is to have two roles: *clients* and +*servers*. The servers expose an API over the network, and the clients can connect to the servers +to make requests to that API. The API exposed by the server is known as a *service*. + +The web works this way: clients (web browsers) make requests to web servers, making `GET` requests +to download HTML, CSS, JavaScript, images, etc., and making `POST` requests to submit data to the +server. The API consists of a standardized set of protocols and data formats (HTTP, URLs, SSL/TLS, +HTML, etc.). Because web browsers, web servers, and website authors mostly agree on these standards, +you can use any web browser to access any website (at least in theory!). + +Web browsers are not the only type of client. For example, native apps running on mobile devices and +desktop computers often talk to servers, and client-side JavaScript applications running inside web +browsers can also make HTTP requests. +In this case, the server’s response is typically not HTML for displaying to a human, but rather data +in an encoding that is convenient for further processing by the client-side application code (most +often JSON). Although HTTP may be used as the transport protocol, the API implemented on top is +application-specific, and the client and server need to agree on the details of that API. + +In some ways, services are similar to databases: they typically allow clients to submit and query +data. However, while databases allow arbitrary queries using the query languages we discussed in +[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs +that are predetermined by the business logic (application code) of the service +[[29](/en/ch5#Helland2005_ch5)]. This restriction provides a degree of encapsulation: services can impose +fine-grained restrictions on what clients can and cannot do. + +A key design goal of a service-oriented/microservices architecture is to make the application easier +to change and maintain by making services independently deployable and evolvable. A common principle +is that each service should be owned by one team, and that team should be able to release new +versions of the service frequently, without having to coordinate with other teams. We should +therefore expect old and new versions of servers and clients to be running at the same time, and so +the data encoding used by servers and clients must be compatible across versions of the service API. + +### Web services + +When HTTP is used as the underlying protocol for talking to the service, it is called a *web +service*. Web services are commonly used when building a service oriented or microservices +architecture (discussed earlier in [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)). The term “web service” is +perhaps a slight misnomer, because web services are not only used on the web, but in several +different contexts. For example: + +1. A client application running on a user’s device (e.g., a native app on a mobile device, or a + JavaScript web app in a browser) making requests to a service over HTTP. These requests typically + go over the public internet. +2. One service making requests to another service owned by the same organization, often located + within the same datacenter, as part of a service-oriented/microservices architecture. +3. One service making requests to a service owned by a different organization, usually via the + internet. This is used for data exchange between different organizations’ backend systems. This + category includes public APIs provided by online services, such as credit card processing + systems, or OAuth for shared access to user data. + +The most popular service design philosophy is REST, which builds upon the principles of HTTP +[[30](/en/ch5#Fielding2000), +[31](/en/ch5#Fielding2008)]. +It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for +cache control, authentication, and content type negotiation. An API designed according to the +principles of REST is called *RESTful*. + +Code that needs to invoke a web service API must know which HTTP endpoint to query, and what data +format to send and expect in response. Even if a service adopts RESTful design principles, clients +need to somehow find out these details. Service developers often use an interface definition +language (IDL) to define and document their service’s API endpoints and data models, and to evolve +them over time. Other developers can then use the service definition to determine how to query the +service. The two most popular service IDLs are OpenAPI (also known as Swagger +[[32](/en/ch5#Swagger2014)]) +and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send +and receive Protocol Buffers. + +Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def). +The service definition allows developers to define service endpoints, documentation, versions, data +models, and much more. gRPC definitions look similar, but are defined using Protocol Buffers service +definitions. + +##### Example 5-3. Example OpenAPI service definition in YAML + +``` +openapi: 3.0.0 +info: + title: Ping, Pong + version: 1.0.0 +servers: + - url: http://localhost:8080 +paths: + /ping: + get: + summary: Given a ping, returns a pong message + responses: + '200': + description: A pong + content: + application/json: + schema: + type: object + properties: + message: + type: string + example: Pong! +``` + +Even if a design philosophy and IDL are adopted, developers must still write the code that +implements their service’s API calls. A service framework is often adopted to simplify this +effort. Service frameworks such as Spring Boot, FastAPI, and gRPC allow developers to write the +business logic for each API endpoint while the framework code handles routing, metrics, caching, +authentication, and so on. [Example 5-4](/en/ch5#fig_fastapi_def) shows an example Python implementation of the service +defined in [Example 5-3](/en/ch5#fig_open_api_def). + +##### Example 5-4. Example FastAPI service implementing the definition from [Example 5-3](/en/ch5#fig_open_api_def) + +``` +from fastapi import FastAPI +from pydantic import BaseModel + +app = FastAPI(title="Ping, Pong", version="1.0.0") + +class PongResponse(BaseModel): + message: str = "Pong!" + +@app.get("/ping", response_model=PongResponse, + summary="Given a ping, returns a pong message") +async def ping(): + return PongResponse() +``` + +Many frameworks couple service definitions and server code together. In some cases, such as with the +popular Python FastAPI framework, servers are written in code and an IDL is generated automatically. +In other cases, such as with gRPC, the service definition is written first, and server code +scaffolding is generated. Both approaches allow developers to generate client libraries and SDKs +in a variety of languages from the service definition. In addition to code generation, IDL tools +such as Swagger’s can generate documentation, verify schema change compatibility, and provide a +graphical user interfaces for developers to query and test services. + +### The problems with remote procedure calls (RPCs) + +Web services are merely the latest incarnation of a long line of technologies for making API +requests over a network, many of which received a lot of hype but have serious problems. Enterprise +JavaBeans (EJB) and Java’s Remote Method Invocation (RMI) are limited to Java. The Distributed +Component Object Model (DCOM) is limited to Microsoft platforms. The Common Object Request Broker +Architecture (CORBA) is excessively complex, and does not provide backward or forward +compatibility [[33](/en/ch5#Henning2006)]. +SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are +also plagued by complexity and compatibility problems +[[34](/en/ch5#Lacey2006), +[35](/en/ch5#Tilkov2006), +[36](/en/ch5#Bray2004)]. + +All of these are based on the idea of a *remote procedure call* (RPC), which has been around since +the 1970s [[37](/en/ch5#Birrell1984)]. +The RPC model tries to make a request to a remote network service look the same as calling a function or +method in your programming language, within the same process (this abstraction is called *location +transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed +[[38](/en/ch5#Waldo1994), +[39](/en/ch5#Vinoski2008)]. +A network request is very different from a local function call: + +* A local function call is predictable and either succeeds or fails, depending only on parameters + that are under your control. A network request is unpredictable: the request or response may be + lost due to a network problem, or the remote machine may be slow or unavailable, and such problems + are entirely outside of your control. Network problems are common, so you have to anticipate them, + for example by retrying a failed request. +* A local function call either returns a result, or throws an exception, or never returns (because + it goes into an infinite loop or the process crashes). A network request has another possible + outcome: it may return without a result, due to a *timeout*. In that case, you simply don’t know + what happened: if you don’t get a response from the remote service, you have no way of knowing + whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).) +* If you retry a failed network request, it could happen that the previous request actually got + through, and only the response was lost. + In that case, retrying will cause the action to + be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into + the protocol [[40](/en/ch5#Leach2017idemptence)]. + Local function calls don’t have this problem. (We discuss idempotence in more detail + in [Link to Come].) +* Every time you call a local function, it normally takes about the same time to execute. A network + request is much slower than a function call, and its latency is also wildly variable: at good + times it may complete in less than a millisecond, but when the network is congested or the remote + service is overloaded it may take many seconds to do exactly the same thing. +* When you call a local function, you can efficiently pass it references (pointers) to objects in + local memory. When you make a network request, all those parameters need to be encoded into a + sequence of bytes that can be sent over the network. That’s okay if the parameters are immutable + primitives like numbers or short strings, but it quickly becomes problematic with larger amounts + of data and mutable objects. +* The client and the service may be implemented in different programming languages, so the RPC + framework must translate datatypes from one language into another. This can end up ugly, since not + all languages have the same types—recall JavaScript’s problems with numbers greater than 253, + for example (see [“JSON, XML, and Binary Variants”](/en/ch5#sec_encoding_json)). This problem doesn’t exist in a single process written in + a single language. + +All of these factors mean that there’s no point trying to make a remote service look too much like a +local object in your programming language, because it’s a fundamentally different thing. Part of the +appeal of REST is that it treats state transfer over a network as a process that is distinct from a +function call. + +### Load balancers, service discovery, and service meshes + +All services communicate over the network. For this reason, a client must know the address of the +service it’s connecting to—a problem known as *service discovery*. The simplest approach is to +configure a client to connect to the IP address and port where the service is running. This +configuration will work, but if the server goes offline, is transferred to a new machine, or becomes +overloaded, the client has to be manually reconfigured. + +To provide higher availability and scalability, there are usually multiple instances of a service +running on different machines, any of which can handle an incoming request. Spreading requests +across these instances is called *load balancing* +[[41](/en/ch5#Rose2023)]. +There are many load balancing and service discovery solutions available: + +* *Hardware load balancers* are specialized pieces of equipment that are installed in data centers. + They allow clients to connect to a single host and port, and incoming connections are routed to + one of the servers running the service. Such load balancers detect network failures when + connecting to a downstream server and shift the traffic to other servers. +* *Software load balancers* behave in much the same way as hardware load balancers. But rather than + requiring a special appliance, software load balancers such as Nginx and HAProxy are applications + that can be installed on a standard machine. +* The *domain name service (DNS)* is how domain names are resolved on the Internet when you open a + webpage. It supports load balancing by allowing multiple IP addresses to be associated with a + single domain name. Clients can then be configured to connect to a service using a domain name + rather than IP address, and the client’s network layer picks which IP address to use when making a + connection. One drawback of this approach is that DNS is designed to propagate changes over longer + periods of time, and to cache DNS entries. If servers are started, stopped, or moved frequently, + clients might see stale IP addresses that no longer have a server running on them. +* *Service discovery systems* use a centralized registry rather than DNS to track which service + endpoints are available. When a new service instance starts up, it registers itself with the + service discovery system by declaring the host and port it’s listening on, along with relevant + metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location, + and more. The service then periodically sends a heartbeat signal to the discovery system to signal + that the service is still available. + + When a client wishes to connect to a service, it first queries the discovery system to get a list of + available endpoints, and then connects directly to the endpoint. Compared to DNS, service discovery + supports a much more dynamic environment where service instances change frequently. Discovery + systems also give clients more metadata about the service they’re connecting to, which enables + clients to make smarter load balancing decisions. +* *Service meshes* are a sophisticated form of load balancing that combine software load balancers + and service discovery. Unlike traditional software load balancers, which run on a separate + machine, service mesh load balancers are typically deployed as an in-process client library or as + a process or “sidecar” container on both the client and server. Client applications connect + to their own local service load balancer, which connects to the server’s load balancer. From + there, the connection is routed to the local server process. + + Though complicated, this topology offers a number of advantages. Because the clients and servers are + routed entirely through local connections, connection encryption can be handled entirely at the load + balancer level. This shields clients and servers from having to deal with the complexities of SSL + certificates and TLS. Mesh systems also provide sophisticated observability. They can track which + services are calling each other in realtime, detect failures, track traffic load, and more. + +Which solution is appropriate depends on an organization’s needs. Those running in a very dynamic +service environment with an orchestrator such as Kubernetes often choose to run a service mesh such +as Istio or Linkerd. Specialized infrastructure such as databases or messaging systems might require +their own purpose-built load balancer. Simpler deployments are best served with software load +balancers. + +### Data encoding and evolution for RPC + +For evolvability, it is important that RPC clients and servers can be changed and deployed +independently. Compared to data flowing through databases (as described in the last section), we can make a +simplifying assumption in the case of dataflow through services: it is reasonable to assume that +all the servers will be updated first, and all the clients second. Thus, you only need backward +compatibility on requests, and forward compatibility on responses. + +The backward and forward compatibility properties of an RPC scheme are inherited from whatever +encoding it uses: + +* gRPC (Protocol Buffers) and Avro RPC can be evolved according to the compatibility rules of the + respective encoding format. +* RESTful APIs most commonly use JSON for responses, and JSON or URI-encoded/form-encoded request + parameters for requests. Adding optional request parameters and adding new fields to response + objects are usually considered changes that maintain compatibility. + +Service compatibility is made harder by the fact that RPC is often used for communication across +organizational boundaries, so the provider of a service often has no control over its clients and +cannot force them to upgrade. Thus, compatibility needs to be maintained for a long time, perhaps +indefinitely. If a compatibility-breaking change is required, the service provider often ends up +maintaining multiple versions of the service API side by side. + +There is no agreement on how API versioning should work (i.e., how a client can indicate which +version of the API it wants to use [[42](/en/ch5#Hunt2014wn)]). +For RESTful APIs, common approaches are to use a version +number in the URL or in the HTTP `Accept` header. For services that use API keys to identify a +particular client, another option is to store a client’s requested API version on the server and to +allow this version selection to be updated through a separate administrative interface +[[43](/en/ch5#Leach2017versioning)]. + +## Durable Execution and Workflows + +By definition, service-based architectures have multiple services that are all responsible for +different portions of an application. Consider a payment processing application that charges a +credit card and deposits the funds into a bank account. This system would likely have different +services responsible for fraud detection, credit card integration, bank integration, and so on. + +Processing a single payment in our example requires many service calls. A payment processor service +might invoke the fraud detection service to check for fraud, call the credit card service to debit +the credit card, and call the banking service to deposit debited funds, as shown in +[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*. +Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a +general-purpose programming language, a domain specific language (DSL), or a markup language such as +Business Process Execution Language (BPEL) +[[44](/en/ch5#BPEL2007)]. + +# Tasks, Activities, and Functions + +Different workflow engines use different names for tasks. Temporal, for example, uses the term +*activity*. Others refer to tasks as *durable functions*. Though the names differ, the concepts are +the same. + +![ddia 0507](/fig/ddia_0507.png) + +###### Figure 5-7. Example of a workflow expressed using Business Process Model and Notation (BPMN), a graphical notation. + +Workflows are run, or executed, by a *workflow engine*. Workflow engines determine when to run each +task, on which machine a task must be run, what to do if a task fails (e.g., if the machine crashes +while the task is running), how many tasks are allowed to execute in parallel, and more. + +Workflow engines are typically composed of an orchestrator and an executor. The orchestrator is +responsible for scheduling tasks to be executed and the executor is responsible for executing tasks. +Execution begins when a workflow is triggered. The orchestrator triggers the workflow itself if +users define a time-based schedule, such as hourly execution. External sources such as a web service +or even a human can also trigger workflow executions. Once triggered, executors are invoked to run +tasks. + +There are many kinds of workflow engines that address a diverse set of use cases. Some, such as +Airflow, Dagster, and Prefect, integrate with data systems and orchestrate ETL tasks. Others, such +as Camunda and Orkes, provide a graphical notation for workflows (such as BPMN, used in +[Figure 5-7](/en/ch5#fig_encoding_workflow)) so that non-engineers can more easily define and execute workflows. Still +others, such as Temporal and Restate provide *durable execution*. + +### Durable execution + +Durable execution frameworks have become a popular way to build service-based architectures that +require transactionality. In our payment example, we would like to process each payment exactly +once. A failure while the workflow is executing could result in a credit card charge, but no +corresponding bank account deposit. In a service-based architecture, we can’t simply wrap the two +tasks in a database transaction. Moreover, we might be interacting with third-party payment gateways +that we have limited control over. + +Durable execution frameworks are a way to provide *exactly-once semantics* for workflows. If a +task fails, the framework will re-execute the task, but will skip any RPC calls or state changes +that the task made successfully before failing. Instead, the framework will pretend to make the +call, but will instead return the results from the previous call. This is possible because durable +execution frameworks log all RPCs and state changes to durable storage like a write-ahead log +[[45](/en/ch5#TemporalService), +[46](/en/ch5#Ewen2023)]. +[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution +using Temporal. + +##### Example 5-5. A Temporal workflow definition fragment for the payment workflow in [Figure 5-7](/en/ch5#fig_encoding_workflow). + +``` +@workflow.defn +class PaymentWorkflow: + @workflow.run + async def run(self, payment: PaymentRequest) -> PaymentResult: + is_fraud = await workflow.execute_activity( + check_fraud, + payment, + start_to_close_timeout=timedelta(seconds=15), + ) + if is_fraud: + return PaymentResultFraudulent + credit_card_response = await workflow.execute_activity( + debit_credit_card, + payment, + start_to_close_timeout=timedelta(seconds=15), + ) + # ... +``` + +Frameworks like Temporal are not without their challenges. External services, such as the +third-party payment gateway in our example, must still provide an idempotent API. Developers must +remember to use unique IDs for these APIs to prevent duplicate execution +[[47](/en/ch5#Tenzer2024)]. +And because durable execution frameworks log each RPC call in order, it expects a subsequent +execution to make the same RPC calls in the same order. This makes code changes brittle: you +might introduce undefined behavior simply by re-ordering function calls +[[48](/en/ch5#TemporalWorkflow)]. +Instead of modifying the code of an existing workflow, it is safer to deploy a new version of the +code separately, so that re-executions of existing workflow invocations continue to use the old +version, and only new invocations use the new code +[[49](/en/ch5#Kleeman2024)]. + +Similarly, because durable execution frameworks expect to replay all code deterministically (the +same inputs produce the same outputs), nondeterministic code such as random number generators or +system clocks are problematic [[48](/en/ch5#TemporalWorkflow)]. +Frameworks often provide their own, deterministic implementations of such library functions, but +you have to remember to use them. In some cases, such as with Temporal’s workflowcheck tool, +frameworks provide static analysis tools to determine if nondeterministic behavior has been +introduced. + +###### Note + +Making code deterministic is a powerful idea, but tricky to do robustly. In +[“The Power of Determinism”](/en/ch9#sidebar_distributed_determinism) we will return to this topic. + +## Event-Driven Architectures + +In this final section, we will briefly look at *event-driven architectures*, which are another way +how encoded data can flow from one process to another. A request is called an *event* or *message*; +unlike RPC, the sender usually does not wait for the recipient to process the event. Moreover, +events are typically not sent to the recipient via a direct network connection, but go via an +intermediary called a *message broker* (also called an *event broker*, *message queue*, or +*message-oriented middleware*), which stores the message temporarily. +[[50](/en/ch5#Perera2023)]. + +Using a message broker has several advantages compared to direct RPC: + +* It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system + reliability. +* It can automatically redeliver messages to a process that has crashed, and thus prevent messages from + being lost. +* It avoids the need for service discovery, since senders do not need to directly connect to the IP + address of the recipient. +* It allows the same message to be sent to several recipients. +* It logically decouples the sender from the recipient (the sender just publishes messages and + doesn’t care who consumes them). + +The communication via a message broker is *asynchronous*: the sender doesn’t wait for the message to +be delivered, but simply sends it and then forgets about it. It’s possible to implement a +synchronous RPC-like model by having the sender wait for a response on a separate channel. + +### Message brokers + +In the past, the landscape of message brokers was dominated by commercial enterprise software from +companies such as TIBCO, IBM WebSphere, and webMethods, before open source implementations such as +RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka become popular. More recently, cloud services +such as Amazon Kinesis, Azure Service Bus, and Google Cloud Pub/Sub have gained adoption. We will +compare them in more detail in [Link to Come]. + +The detailed delivery semantics vary by implementation and configuration, but in general, two +message distribution patterns are most often used: + +* One process adds a message to a named *queue*, and the broker delivers that message to a + *consumer* of that queue. If there are multiple consumers, one of them receives the message. +* One process publishes a message to a named *topic*, and the broker delivers that message to all + *subscribers* of that topic. If there are multiple subscribers, they all receive the message. + +Message brokers typically don’t enforce any particular data model—a message is just a sequence of +bytes with some metadata, so you can use any encoding format. A common approach is to use Protocol +Buffers, Avro, or JSON, and to deploy a schema registry alongside the message broker to store all +the valid schema versions and check their compatibility +[[19](/en/ch5#ConfluentSchemaReg), [21](/en/ch5#Kreps2015)]. +AsyncAPI, a messaging-based equivalent of OpenAPI, can also be used to specify the schema of +messages. + +Message brokers differ in terms of how durable their messages are. Many write messages to disk, so +that they are not lost in case the message broker crashes or needs to be restarted. Unlike +databases, many message brokers automatically delete messages again after they have been consumed. +Some brokers can be configured to store messages indefinitely, which you would require if you want +to use event sourcing (see [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events)). + +If a consumer republishes messages to another topic, you may need to be careful to preserve unknown +fields, to prevent the issue described previously in the context of databases +([Figure 5-1](/en/ch5#fig_encoding_preserve_field)). + +### Distributed actor frameworks + +The *actor model* is a programming model for concurrency in a single process. Rather than dealing +directly with threads (and the associated problems of race conditions, locking, and deadlock), logic +is encapsulated in *actors*. Each actor typically represents one client or entity, it may have some +local state (which is not shared with any other actor), and it communicates with other actors by +sending and receiving asynchronous messages. Message delivery is not guaranteed: in certain error +scenarios, messages will be lost. Since each actor processes only one message at a time, it doesn’t +need to worry about threads, and each actor can be scheduled independently by the framework. + +In *distributed actor frameworks* such as Akka, Orleans +[[51](/en/ch5#Bernstein2014)], +and Erlang/OTP, this programming model is used to scale an application across +multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient +are on the same node or different nodes. If they are on different nodes, the message is +transparently encoded into a byte sequence, sent over the network, and decoded on the other side. + +Location transparency works better in the actor model than in RPC, because the actor model already +assumes that messages may be lost, even within a single process. Although latency over the network +is likely higher than within the same process, there is less of a fundamental mismatch between local +and remote communication when using the actor model. + +A distributed actor framework essentially integrates a message broker and the actor programming +model into a single framework. However, if you want to perform rolling upgrades of your actor-based +application, you still have to worry about forward and backward compatibility, as messages may be +sent from a node running the new version to a node running the old version, and vice versa. This can +be achieved by using one of the encodings discussed in this chapter. + +# Summary + +In this chapter we looked at several ways of turning data structures into bytes on the network or +bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more +importantly also the architecture of applications and your options for evolving them. + +In particular, many services need to support rolling upgrades, where a new version of a service is +gradually deployed to a few nodes at a time, rather than deploying to all nodes simultaneously. +Rolling upgrades allow new versions of a service to be released without downtime (thus encouraging +frequent small releases over rare big releases) and make deployments less risky (allowing faulty +releases to be detected and rolled back before they affect a large number of users). These +properties are hugely beneficial for *evolvability*, the ease of making changes to an application. + +During rolling upgrades, or for various other reasons, we must assume that different nodes are +running the different versions of our application’s code. Thus, it is important that all data +flowing around the system is encoded in a way that provides backward compatibility (new code can +read old data) and forward compatibility (old code can read new data). + +We discussed several data encoding formats and their compatibility properties: + +* Programming language–specific encodings are restricted to a single programming language and often + fail to provide forward and backward compatibility. +* Textual formats like JSON, XML, and CSV are widespread, and their compatibility depends on how you + use them. They have optional schema languages, which are sometimes helpful and sometimes a + hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things + like numbers and binary strings. +* Binary schema–driven formats like Protocol Buffers and Avro allow compact, efficient encoding with + clearly defined forward and backward compatibility semantics. The schemas can be useful for + documentation and code generation in statically typed languages. However, these formats have the + downside that data needs to be decoded before it is human-readable. + +We also discussed several modes of dataflow, illustrating different scenarios in which data +encodings are important: + +* Databases, where the process writing to the database encodes the data and the process reading + from the database decodes it +* RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes + a response, and the client finally decodes the response +* Event-driven architectures (using message brokers or actors), where nodes communicate by sending + each other messages that are encoded by the sender and decoded by the recipient + +We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are +quite achievable. May your application’s evolution be rapid and your deployments be frequent. + +##### Footnotes + +##### References + +[[1](/en/ch5#CWE502-marker)] [CWE-502: +Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, +July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y) + +[[2](/en/ch5#Breen2015-marker)] Steve Breen. +[What +Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This +Vulnerability](https://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-and-your-application-have-in-common-this-vulnerability/). *foxglovesecurity.com*, November 2015. +Archived at [perma.cc/9U97-UVVD](https://perma.cc/9U97-UVVD) + +[[3](/en/ch5#McKenzie2013-marker)] Patrick McKenzie. +[What +the Rails Security Issue Means for Your Startup](https://www.kalzumeus.com/2013/01/31/what-the-rails-security-issue-means-for-your-startup/). *kalzumeus.com*, January 2013. +Archived at [perma.cc/2MBJ-7PZ6](https://perma.cc/2MBJ-7PZ6) + +[[4](/en/ch5#Goetz2019-marker)] Brian Goetz. +[Towards +Better Serialization](https://openjdk.org/projects/amber/design-notes/towards-better-serialization). *openjdk.org*, June 2019. +Archived at [perma.cc/UK6U-GQDE](https://perma.cc/UK6U-GQDE) + +[[5](/en/ch5#JvmSerializers-marker)] Eishay Smith. +[jvm-serializers wiki](https://github.com/eishay/jvm-serializers/wiki). +*github.com*, October 2023. +Archived at [perma.cc/PJP7-WCNG](https://perma.cc/PJP7-WCNG) + +[[6](/en/ch5#XMLSExp-marker)] [XML +Is a Poor Copy of S-Expressions](https://wiki.c2.com/?XmlIsaPoorCopyOfEssExpressions). *wiki.c2.com*, May 2013. +Archived at [perma.cc/7FAN-YBKL](https://perma.cc/7FAN-YBKL) + +[[7](/en/ch5#Evans2023-marker)] Julia Evans. +[Examples of floating +point problems](https://jvns.ca/blog/2023/01/13/examples-of-floating-point-problems/). *jvns.ca*, January 2023. +Archived at [perma.cc/M57L-QKKW](https://perma.cc/M57L-QKKW) + +[[8](/en/ch5#Harris2010-marker)] Matt Harris. +[Snowflake: +An Update and Some Very Important Information](https://groups.google.com/g/twitter-development-talk/c/ahbvo3VTIYI). Email to *Twitter Development +Talk* mailing list, October 2010. +Archived at [perma.cc/8UBV-MZ3D](https://perma.cc/8UBV-MZ3D) + +[[9](/en/ch5#Shafranovich2005-marker)] Yakov Shafranovich. +[RFC 4180: Common Format and MIME Type for +Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180). IETF, October 2005. + +[[10](/en/ch5#Coates2024-marker)] Andy Coates. +[Evolving JSON Schemas - Part I](https://www.creekservice.org/articles/2024/01/08/json-schema-evolution-part-1.html) and +[Part II](https://www.creekservice.org/articles/2024/01/09/json-schema-evolution-part-2.html). +*creekservice.org*, January 2024. Archived at +[perma.cc/MZW3-UA54](https://perma.cc/MZW3-UA54) and +[perma.cc/GT5H-WKZ5](https://perma.cc/GT5H-WKZ5) + +[[11](/en/ch5#Geneves2008-marker)] Pierre Genevès, Nabil Layaïda, and Vincent Quint. +[Ensuring Query Compatibility with Evolving XML Schemas](https://arxiv.org/abs/0811.4324). +INRIA Technical Report 6711, November 2008. + +[[12](/en/ch5#Bray2019-marker)] Tim Bray. +[Bits On the Wire](https://www.tbray.org/ongoing/When/201x/2019/11/17/Bits-On-the-Wire). +*tbray.org*, November 2019. +Archived at [perma.cc/3BT3-BQU3](https://perma.cc/3BT3-BQU3) + +[[13](/en/ch5#Slee2007-marker)] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. +[Thrift: Scalable +Cross-Language Services Implementation](https://thrift.apache.org/static/files/thrift-20070401.pdf). Facebook technical report, April 2007. +Archived at [perma.cc/22BS-TUFB](https://perma.cc/22BS-TUFB) + +[[14](/en/ch5#Kleppmann2012evolution-marker)] Martin Kleppmann. +[Schema +Evolution in Avro, Protocol Buffers and Thrift](https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html). *martin.kleppmann.com*, December 2012. +Archived at [perma.cc/E4R2-9RJT](https://perma.cc/E4R2-9RJT) + +[[15](/en/ch5#Cutting2009-marker)] Doug Cutting, Chad Walters, Jim Kellerman, et al. +[[PROPOSAL] +New Subproject: Avro](https://lists.apache.org/thread/z571w0r5jmfsjvnl0fq4fgg0vh28d3bk). Email thread on *hadoop-general* mailing list, +*lists.apache.org*, April 2009. +Archived at [perma.cc/4A79-BMEB](https://perma.cc/4A79-BMEB) + +[[16](/en/ch5#AvroSpec-marker)] Apache Software Foundation. +[Apache Avro 1.12.0 Specification](https://avro.apache.org/docs/1.12.0/specification/). +*avro.apache.org*, August 2024. +Archived at [perma.cc/C36P-5EBQ](https://perma.cc/C36P-5EBQ) + +[[17](/en/ch5#AvroParsing-marker)] Apache Software Foundation. +[Avro +schemas as LL(1) CFG definitions](https://avro.apache.org/docs/1.12.0/api/java/org/apache/avro/io/parsing/doc-files/parsing.html). *avro.apache.org*, August 2024. +Archived at [perma.cc/JB44-EM9Q](https://perma.cc/JB44-EM9Q) + +[[18](/en/ch5#Hoare2009-marker)] Tony Hoare. +[Null +References: The Billion Dollar Mistake](https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/). Talk at *QCon London*, March 2009. + +[[19](/en/ch5#ConfluentSchemaReg-marker)] Confluent, Inc. +[Schema Registry +Overview](https://docs.confluent.io/platform/current/schema-registry/index.html). *docs.confluent.io*, 2024. +Archived at [perma.cc/92C3-A9JA](https://perma.cc/92C3-A9JA) + +[[20](/en/ch5#Auradkar2015-marker)] Aditya Auradkar and Tom Quiggle. +[Introducing +Espresso—LinkedIn’s Hot New Distributed Document Store](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store). *engineering.linkedin.com*, January 2015. +Archived at [perma.cc/FX4P-VW9T](https://perma.cc/FX4P-VW9T) + +[[21](/en/ch5#Kreps2015-marker)] Jay Kreps. +[Putting Apache Kafka to +Use: A Practical Guide to Building a Stream Data Platform (Part 2)](https://www.confluent.io/blog/event-streaming-platform-2/). *confluent.io*, +February 2015. Archived at [perma.cc/8UA4-ZS5S](https://perma.cc/8UA4-ZS5S) + +[[22](/en/ch5#Shapira2014-marker)] Gwen Shapira. +[The Problem of Managing +Schemas](https://www.oreilly.com/content/the-problem-of-managing-schemas/). *oreilly.com*, November 2014. +Archived at [perma.cc/BY8Q-RYV3](https://perma.cc/BY8Q-RYV3) + +[[23](/en/ch5#Larmouth1999-marker)] John Larmouth. +[*ASN.1 +Complete*](https://www.oss.com/asn1/resources/books-whitepapers-pubs/larmouth-asn1-book.pdf). Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1. +Archived at [perma.cc/GB7Y-XSXQ](https://perma.cc/GB7Y-XSXQ) + +[[24](/en/ch5#Kaliski1993-marker)] Burton S. Kaliski Jr. +[A Layman’s Guide to a Subset of ASN.1, +BER, and DER](https://luca.ntop.org/Teaching/Appunti/asn1.html). Technical Note, RSA Data Security, Inc., November 1993. +Archived at [perma.cc/2LMN-W9U8](https://perma.cc/2LMN-W9U8) + +[[25](/en/ch5#HoffmanAndrews2020-marker)] Jacob Hoffman-Andrews. +[A Warm Welcome to ASN.1 and DER](https://letsencrypt.org/docs/a-warm-welcome-to-asn1-and-der/). +*letsencrypt.org*, April 2020. +Archived at [perma.cc/CYT2-GPQ8](https://perma.cc/CYT2-GPQ8) + +[[26](/en/ch5#Walkin2010-marker)] Lev Walkin. +[Question: +Extensibility and Dropping Fields](https://lionet.info/asn1c/blog/2010/09/21/question-extensibility-removing-fields/). *lionet.info*, September 2010. +Archived at [perma.cc/VX8E-NLH3](https://perma.cc/VX8E-NLH3) + +[[27](/en/ch5#Xu2017-marker)] Jacqueline Xu. +[Online migrations at scale](https://stripe.com/blog/online-migrations). +*stripe.com*, February 2017. +Archived at [perma.cc/X59W-DK7Y](https://perma.cc/X59W-DK7Y) + +[[28](/en/ch5#Litt2020-marker)] Geoffrey Litt, Peter van Hardenberg, and Orion Henry. +[Project Cambria: Translate your data with lenses](https://www.inkandswitch.com/cambria/). +Technical Report, *Ink & Switch*, October 2020. +Archived at [perma.cc/WA4V-VKDB](https://perma.cc/WA4V-VKDB) + +[[29](/en/ch5#Helland2005_ch5-marker)] Pat Helland. +[Data on the Outside Versus Data on the +Inside](https://www.cidrdb.org/cidr2005/papers/P12.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), +January 2005. + +[[30](/en/ch5#Fielding2000-marker)] Roy Thomas Fielding. +[Architectural +Styles and the Design of Network-Based Software Architectures](https://ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf). PhD Thesis, University of +California, Irvine, 2000. Archived at [perma.cc/LWY9-7BPE](https://perma.cc/LWY9-7BPE) + +[[31](/en/ch5#Fielding2008-marker)] Roy Thomas Fielding. +[REST APIs must +be hypertext-driven](https://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven).” *roy.gbiv.com*, October 2008. +Archived at [perma.cc/M2ZW-8ATG](https://perma.cc/M2ZW-8ATG) + +[[32](/en/ch5#Swagger2014-marker)] [OpenAPI +Specification Version 3.1.0](https://swagger.io/specification/). *swagger.io*, February 2021. +Archived at [perma.cc/3S6S-K5M4](https://perma.cc/3S6S-K5M4) + +[[33](/en/ch5#Henning2006-marker)] Michi Henning. +[The Rise and Fall of CORBA](https://cacm.acm.org/practice/the-rise-and-fall-of-corba/). +*Communications of the ACM*, volume 51, issue 8, pages 52–57, August 2008. +[doi:10.1145/1378704.1378718](https://doi.org/10.1145/1378704.1378718) + +[[34](/en/ch5#Lacey2006-marker)] Pete Lacey. +[The S Stands for Simple](https://harmful.cat-v.org/software/xml/soap/simple). +*harmful.cat-v.org*, November 2006. +Archived at [perma.cc/4PMK-Z9X7](https://perma.cc/4PMK-Z9X7) + +[[35](/en/ch5#Tilkov2006-marker)] Stefan Tilkov. +[Interview: Pete Lacey Criticizes +Web Services](https://www.infoq.com/articles/pete-lacey-ws-criticism/). *infoq.com*, December 2006. +Archived at [perma.cc/JWF4-XY3P](https://perma.cc/JWF4-XY3P) + +[[36](/en/ch5#Bray2004-marker)] Tim Bray. +[The Loyal WS-Opposition](https://www.tbray.org/ongoing/When/200x/2004/09/18/WS-Oppo). +*tbray.org*, September 2004. +Archived at [perma.cc/J5Q8-69Q2](https://perma.cc/J5Q8-69Q2) + +[[37](/en/ch5#Birrell1984-marker)] Andrew D. Birrell and Bruce Jay Nelson. +[Implementing +Remote Procedure Calls](https://www.cs.princeton.edu/courses/archive/fall03/cs518/papers/rpc.pdf). *ACM Transactions on Computer Systems* (TOCS), +volume 2, issue 1, pages 39–59, February 1984. +[doi:10.1145/2080.357392](https://doi.org/10.1145/2080.357392) + +[[38](/en/ch5#Waldo1994-marker)] Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. +[A Note on Distributed Computing](https://m.mirror.facebook.net/kde/devel/smli_tr-94-29.pdf). +Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994. +Archived at [perma.cc/8LRZ-BSZR](https://perma.cc/8LRZ-BSZR) + +[[39](/en/ch5#Vinoski2008-marker)] Steve Vinoski. +[Convenience over +Correctness](https://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf). *IEEE Internet Computing*, volume 12, issue 4, pages 89–92, July 2008. +[doi:10.1109/MIC.2008.75](https://doi.org/10.1109/MIC.2008.75) + +[[40](/en/ch5#Leach2017idemptence-marker)] Brandur Leach. +[Designing robust and predictable APIs with +idempotency](https://stripe.com/blog/idempotency). *stripe.com*, February 2017. +Archived at [perma.cc/JD22-XZQT](https://perma.cc/JD22-XZQT) + +[[41](/en/ch5#Rose2023-marker)] Sam Rose. +[Load Balancing](https://samwho.dev/load-balancing/). *samwho.dev*, April 2023. +Archived at [perma.cc/Q7BA-9AE2](https://perma.cc/Q7BA-9AE2) + +[[42](/en/ch5#Hunt2014wn-marker)] Troy Hunt. +[Your API versioning is +wrong, which is why I decided to do it 3 different wrong ways](https://www.troyhunt.com/your-api-versioning-is-wrong-which-is/). *troyhunt.com*, +February 2014. Archived at [perma.cc/9DSW-DGR5](https://perma.cc/9DSW-DGR5) + +[[43](/en/ch5#Leach2017versioning-marker)] Brandur Leach. +[APIs as infrastructure: future-proofing Stripe with +versioning](https://stripe.com/blog/api-versioning). *stripe.com*, August 2017. +Archived at [perma.cc/L63K-USFW](https://perma.cc/L63K-USFW) + +[[44](/en/ch5#BPEL2007-marker)] Alexandre Alves, Assaf Arkin, Sid Askary, et al. +[Web Services Business Process +Execution Language Version 2.0](https://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html). *docs.oasis-open.org*, April 2007. + +[[45](/en/ch5#TemporalService-marker)] [What +is a Temporal Service?](https://docs.temporal.io/clusters) *docs.temporal.io*, 2024. +Archived at [perma.cc/32P3-CJ9V](https://perma.cc/32P3-CJ9V) + +[[46](/en/ch5#Ewen2023-marker)] Stephan Ewen. +[Why we built Restate](https://restate.dev/blog/why-we-built-restate/). *restate.dev*, +August 2023. Archived at [perma.cc/BJJ2-X75K](https://perma.cc/BJJ2-X75K) + +[[47](/en/ch5#Tenzer2024-marker)] Keith Tenzer and Joshua Smith. +[Idempotency and Durable +Execution](https://temporal.io/blog/idempotency-and-durable-execution). *temporal.io*, February 2024. +Archived at [perma.cc/9LGW-PCLU](https://perma.cc/9LGW-PCLU) + +[[48](/en/ch5#TemporalWorkflow-marker)] [What +is a Temporal Workflow?](https://docs.temporal.io/workflows) *docs.temporal.io*, 2024. +Archived at [perma.cc/B5C5-Y396](https://perma.cc/B5C5-Y396) + +[[49](/en/ch5#Kleeman2024-marker)] Jack Kleeman. +[Solving durable +execution’s immutability problem](https://restate.dev/blog/solving-durable-executions-immutability-problem/). *restate.dev*, February 2024. +Archived at [perma.cc/G55L-EYH5](https://perma.cc/G55L-EYH5) + +[[50](/en/ch5#Perera2023-marker)] Srinath Perera. +[Exploring +Event-Driven Architecture: A Beginner’s Guide for Cloud Native Developers](https://wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/). *wso2.com*, +August 2023. Archived at +[archive.org](https://web.archive.org/web/20240716204613/https%3A//wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/) + +[[51](/en/ch5#Bernstein2014-marker)] Philip A. Bernstein, Sergey Bykov, Alan +Geller, Gabriel Kliot, and Jorgen Thelin. +[Orleans: +Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical +Report MSR-TR-2014-41, March 2014. +Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF) diff --git a/content/en/ch6.md b/content/en/ch6.md index 687f6dc..bf1a2f0 100644 --- a/content/en/ch6.md +++ b/content/en/ch6.md @@ -1,107 +1,2176 @@ --- -linktitle: "6. Partitioning" -linkTitle: "6. Partitioning" +title: "6. Replication" weight: 206 breadcrumbs: false --- - -![](/img/ch6.png) - -> *Clearly, we must break away from the sequential and not limit the computers. We must state definitions and provide for priorities and descriptions of data. We must state relation‐ ships, not procedures.* +> *The major difference between a thing that might go wrong and a thing that cannot possibly go wrong +> is that when a thing that cannot possibly go wrong goes wrong it usually turns out to be impossible +> to get at or repair.* > -> ​ — Grace Murray Hopper, *Management and the Computer of the Future* (1962) +> Douglas Adams, *Mostly Harmless* (1992) -------------- +*Replication* means keeping a copy of the same data on multiple machines that are connected via a +network. As discussed in [“Distributed versus Single-Node Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_distributed), there are several reasons +why you might want to replicate data: +* To keep data geographically close to your users (and thus reduce access latency) +* To allow the system to continue working even if some of its parts have failed (and thus + increase availability) +* To scale out the number of machines that can serve read queries (and thus increase read + throughput) +In this chapter we will assume that your dataset is small enough that each machine can hold a copy of +the entire dataset. In [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding) we will relax that assumption and discuss *sharding* +(*partitioning*) of datasets that are too big for a single machine. In later chapters we will discuss +various kinds of faults that can occur in a replicated data system, and how to deal with them. -In [Chapter 5](/en/ch5) we discussed replication—that is, having multiple copies of the same data on different nodes. For very large datasets, or very high query throughput, that is not sufficient: we need to break the data up into *partitions*, also known as *sharding*.[^i] +If the data that you’re replicating does not change over time, then replication is easy: you just +need to copy the data to every node once, and you’re done. All of the difficulty in replication lies +in handling *changes* to replicated data, and that’s what this chapter is about. We will discuss +three families of algorithms for replicating changes between nodes: *single-leader*, *multi-leader*, +and *leaderless* replication. Almost all distributed databases use one of these three approaches. +They all have various pros and cons, which we will examine in detail. -[^i]: Partitioning, as discussed in this chapter, is a way of intentionally breaking a large database down into smaller ones. It has nothing to do with *network partitions* (netsplits), a type of fault in the network between nodes. We will discuss such faults in [Chapter 8](/en/ch8). +There are many trade-offs to consider with replication: for example, whether to use synchronous or +asynchronous replication, and how to handle failed replicas. Those are often configuration options +in databases, and although the details vary by database, the general principles are similar across +many different implementations. We will discuss the consequences of such choices in this chapter. -> #### Terminological confusion -> -> What we call a ***partition*** here is called a ***shard*** in MongoDB, Elasticsearch, and SolrCloud; it’s known as a ***region*** in HBase, a ***tablet*** in Bigtable, a ***vnode*** in Cassandra and Riak, and a ***vBucket*** in Couchbase. However, ***partitioning*** is the most established term, so we’ll stick with that. -> +Replication of databases is an old topic—the principles haven’t changed much since they were +studied in the 1970s +[[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lindsay1979_ch6)], +because the fundamental constraints of networks have remained the same. Despite being so old, +concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag) we will +get more precise about eventual consistency and discuss things like the *read-your-writes* and +*monotonic reads* guarantees. -Normally, partitions are defined in such a way that each piece of data (each record, row, or document) belongs to exactly one partition. There are various ways of achiev‐ ing this, which we discuss in depth in this chapter. In effect, each partition is a small database of its own, although the database may support operations that touch multi‐ ple partitions at the same time. +# Backups and replication -The main reason for wanting to partition data is *scalability*. Different partitions can be placed on different nodes in a shared-nothing cluster (see the introduction to [Part II](/en/part-ii) for a definition of *shared nothing*). Thus, a large dataset can be distributed across many disks, and the query load can be distributed across many processors. +You might be wondering whether you still need backups if you have replication. The answer is yes, +because they have different purposes: replicas quickly reflect writes from one node on other nodes, +but backups store old snapshots of the data so that you can go back in time. If you accidentally +delete some data, replication doesn’t help since the deletion will have also been propagated to the +replicas, so you need a backup if you want to restore the deleted data. -For queries that operate on a single partition, each node can independently execute the queries for its own partition, so query throughput can be scaled by adding more nodes. Large, complex queries can potentially be parallelized across many nodes, although this gets significantly harder. +In fact, replication and backups are often complementary to each other. Backups are sometimes part +of the process of setting up replication, as we shall see in [“Setting Up New Followers”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_new_replica). +Conversely, archiving replication logs can be part of a backup process. -Partitioned databases were pioneered in the 1980s by products such as Teradata and Tandem NonStop SQL [1], and more recently rediscovered by NoSQL databases and Hadoop-based data warehouses. Some systems are designed for transactional work‐ loads, and others for analytics (see “[Transaction Processing or Analytics?](/en/ch3#transaction-processing-or-analytics?)”): this difference affects how the system is tuned, but the fundamentals of partitioning apply to both kinds of workloads. +Some databases internally maintain immutable snapshots of past states, which serve as a kind of +internal backup. However, this means keeping old versions of the data on the same storage media as +the current state. If you have a large amount of data, it can be cheaper to keep the backups of old +data in an object store that is optimized for infrequently-accessed data, and to store only the +current state of the database in primary storage. -In this chapter we will first look at different approaches for partitioning large datasets and observe how the indexing of data interacts with partitioning. We’ll then talk about rebalancing, which is necessary if you want to add or remove nodes in your cluster. Finally, we’ll get an overview of how databases route requests to the right partitions and execute queries. +# Single-Leader Replication +Each node that stores a copy of the database is called a *replica*. With multiple replicas, a +question inevitably arises: how do we ensure that all the data ends up on all the replicas? -## …… +Every write to the database needs to be processed by every replica; otherwise, the replicas would no +longer contain the same data. The most common solution is called *leader-based replication*, +*primary-backup*, or *active/passive*. It works as follows (see +[Figure 6-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_leader_follower)): +1. One of the replicas is designated the *leader* (also known as *primary* or *source* + [[2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gryp2020)]). + When clients want to write to the database, they must send their requests to the leader, which + first writes the new data to its local storage. +2. The other replicas are known as *followers* (*read replicas*, *secondaries*, or *hot standbys*). + Whenever the leader writes new data to its local storage, it also sends the data change to all of + its followers as part of a *replication log* or *change stream*. Each follower takes the log + from the leader and updates its local copy of the database accordingly, by applying all writes in + the same order as they were processed on the leader. +3. When a client wants to read from the database, it can query either the leader or any of the + followers. However, writes are only accepted on the leader (the followers are read-only from the + client’s point of view). +![ddia 0601](/fig/ddia_0601.png) -## Summary +###### Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas. -In this chapter we explored different ways of partitioning a large dataset into smaller subsets. Partitioning is necessary when you have so much data that storing and pro‐ cessing it on a single machine is no longer feasible. +If the database is sharded (see [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding)), each shard has one leader. Different shards may +have their leaders on different nodes, but each shard must nevertheless have one leader node. In +[“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader) we will discuss an alternative model in which a system may have +multiple leaders for the same shard at the same time. -The goal of partitioning is to spread the data and query load evenly across multiple machines, avoiding hot spots (nodes with disproportionately high load). This requires choosing a partitioning scheme that is appropriate to your data, and reba‐ lancing the partitions when nodes are added to or removed from the cluster. +Single-leader replication is very widely used. It’s a built-in feature of many relational databases, +such as PostgreSQL, MySQL, Oracle Data Guard +[[3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Oracle2019)], +and SQL Server’s Always On Availability Groups +[[4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#AlwaysOn2012)]. +It is also used in some document databases such as MongoDB and DynamoDB +[[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Elhemali2022_ch6)], +message brokers such as Kafka, replicated block devices such as DRBD, and some network filesystems. +Many consensus algorithms such as Raft, which is used for replication in CockroachDB +[[6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Taft2020_ch6)], +TiDB [[7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2020_ch6)], +etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and +automatically elect a new leader if the old one fails (we will discuss consensus in more detail in +[Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency)). -We discussed two main approaches to partitioning: +###### Note -* ***Key range partitioning***, where keys are sorted, and a partition owns all the keys from some minimum up to some maximum. Sorting has the advantage that effi‐ cient range queries are possible, but there is a risk of hot spots if the application often accesses keys that are close together in the sorted order. +In older documents you may see the term *master–slave replication*. It means the same as +leader-based replication, but the term should be avoided as it is widely considered offensive +[[8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Knodel2023)]. - In this approach, partitions are typically rebalanced dynamically by splitting the range into two subranges when a partition gets too big. +## Synchronous Versus Asynchronous Replication -* ***Hash partitioning***, where a hash function is applied to each key, and a partition owns a range of hashes. This method destroys the ordering of keys, making range queries inefficient, but may distribute load more evenly. +An important detail of a replicated system is whether the replication happens *synchronously* or +*asynchronously*. (In relational databases, this is often a configurable option; other systems are +often hardcoded to be either one or the other.) - When partitioning by hash, it is common to create a fixed number of partitions in advance, to assign several partitions to each node, and to move entire parti‐ tions from one node to another when nodes are added or removed. Dynamic partitioning can also be used. +Think about what happens in [Figure 6-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_leader_follower), where the user of a website updates +their profile image. At some point in time, the client sends the update request to the leader; +shortly afterward, it is received by the leader. At some point, the leader forwards the data change +to the followers. Eventually, the leader notifies the client that the update was successful. +[Figure 6-2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_sync_replication) shows one possible way how the timings could work out. -Hybrid approaches are also possible, for example with a compound key: using one part of the key to identify the partition and another part for the sort order. +![ddia 0602](/fig/ddia_0602.png) -We also discussed the interaction between partitioning and secondary indexes. A sec‐ ondary index also needs to be partitioned, and there are two methods: +###### Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower. -* ***Document-partitioned indexes*** (local indexes), where the secondary indexes are stored in the same partition as the primary key and value. This means that only a single partition needs to be updated on write, but a read of the secondary index requires a scatter/gather across all partitions. +In the example of [Figure 6-2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_sync_replication), the replication to follower 1 is +*synchronous*: the leader waits until follower 1 has confirmed that it received the write before +reporting success to the user, and before making the write visible to other clients. The replication +to follower 2 is *asynchronous*: the leader sends the message, but doesn’t wait for a response from +the follower. -* ***Term-partitioned indexes*** (global indexes), where the secondary indexes are partitioned separately, using the indexed values. An entry in the secondary index may include records from all partitions of the primary key. When a document is writ‐ ten, several partitions of the secondary index need to be updated; however, a read can be served from a single partition. +The diagram shows that there is a substantial delay before follower 2 processes the message. +Normally, replication is quite fast: most database systems apply changes to followers in less than a +second. However, there is no guarantee of how long it might take. There are circumstances when +followers might fall behind the leader by several minutes or more; for example, if a follower is +recovering from a failure, if the system is operating near maximum capacity, or if there are network +problems between the nodes. -Finally, we discussed techniques for routing queries to the appropriate partition, which range from simple partition-aware load balancing to sophisticated parallel query execution engines. +The advantage of synchronous replication is that the follower is guaranteed to have an up-to-date +copy of the data that is consistent with the leader. If the leader suddenly fails, we can be sure +that the data is still available on the follower. The disadvantage is that if the synchronous +follower doesn’t respond (because it has crashed, or there is a network fault, or for any other +reason), the write cannot be processed. The leader must block all writes and wait until the +synchronous replica is available again. -By design, every partition operates mostly independently—that’s what allows a parti‐ tioned database to scale to multiple machines. However, operations that need to write to several partitions can be difficult to reason about: for example, what happens if the write to one partition succeeds, but another fails? We will address that question in the following chapters. +For that reason, it is impracticable for all followers to be synchronous: any one node outage would +cause the whole system to grind to a halt. In practice, if a database offers synchronous +replication, it often means that *one* of the followers is synchronous, and the others are +asynchronous. If the synchronous follower becomes unavailable or slow, one of the asynchronous +followers is made synchronous. This guarantees that you have an up-to-date copy of the data on at +least two nodes: the leader and one synchronous follower. This configuration is sometimes also +called *semi-synchronous*. +In some systems, a *majority* (e.g., 3 out of 5 replicas, including the leader) of replicas is +updated synchronously, and the remaining minority is asynchronous. This is an example of a *quorum*, +which we will discuss further in [“Quorums for reading and writing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_quorum_condition). Majority quorums are often +used in systems that use a consensus protocol for automatic leader election, which we will return to +in [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency). +Sometimes, leader-based replication is configured to be completely asynchronous. In this case, if the +leader fails and is not recoverable, any writes that have not yet been replicated to followers are +lost. This means that a write is not guaranteed to be durable, even if it has been confirmed to the +client. However, a fully asynchronous configuration has the advantage that the leader can continue +processing writes, even if all of its followers have fallen behind. -## References +Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless +widely used, especially if there are many followers or if they are geographically distributed +[[9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2018)]. +We will return to this issue in [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag). -1. David J. DeWitt and Jim N. Gray: “[Parallel Database Systems: The Future of High Performance Database Systems](http://www.cs.cmu.edu/~pavlo/courses/fall2013/static/papers/dewittgray92.pdf),” *Communications of the ACM*, volume 35, number 6, pages 85–98, June 1992. [doi:10.1145/129888.129894](http://dx.doi.org/10.1145/129888.129894) -1. Lars George: “[HBase vs. BigTable Comparison](http://www.larsgeorge.com/2009/11/hbase-vs-bigtable-comparison.html),” *larsgeorge.com*, November 2009. -1. “[The Apache HBase Reference Guide](https://hbase.apache.org/book/book.html),” Apache Software Foundation, *hbase.apache.org*, 2014. -1. MongoDB, Inc.: “[New Hash-Based Sharding Feature in MongoDB 2.4](https://web.archive.org/web/20230610080235/https://www.mongodb.com/blog/post/new-hash-based-sharding-feature-in-mongodb-24),” *blog.mongodb.org*, April 10, 2013. -1. Ikai Lan: “[App Engine Datastore Tip: Monotonically Increasing Values Are Bad](http://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/),” *ikaisays.com*, January 25, 2011. -1. Martin Kleppmann: “[Java's hashCode Is Not Safe for Distributed Systems](http://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html),” *martin.kleppmann.com*, June 18, 2012. -1. David Karger, Eric Lehman, Tom Leighton, et al.: “[Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web](https://www.akamai.com/site/en/documents/research-paper/consistent-hashing-and-random-trees-distributed-caching-protocols-for-relieving-hot-spots-on-the-world-wide-web-technical-publication.pdf),” at *29th Annual ACM Symposium on Theory of Computing* (STOC), pages 654–663, 1997. [doi:10.1145/258533.258660](http://dx.doi.org/10.1145/258533.258660) -1. John Lamping and Eric Veach: “[A Fast, Minimal Memory, Consistent Hash Algorithm](http://arxiv.org/pdf/1406.2294.pdf),” *arxiv.org*, June 2014. -1. Eric Redmond: “[A Little Riak Book](https://web.archive.org/web/20160807123307/http://www.littleriakbook.com/),” Version 1.4.0, Basho Technologies, September 2013. -1. “[Couchbase 2.5 Administrator Guide](http://docs.couchbase.com/couchbase-manual-2.5/cb-admin/),” Couchbase, Inc., 2014. -1. Avinash Lakshman and Prashant Malik: “[Cassandra – A Decentralized Structured Storage System](http://www.cs.cornell.edu/Projects/ladis2009/papers/Lakshman-ladis2009.PDF),” at *3rd ACM SIGOPS International Workshop on Large Scale Distributed Systems and Middleware* (LADIS), October 2009. -1. Jonathan Ellis: “[Facebook’s Cassandra Paper, Annotated and Compared to Apache Cassandra 2.0](https://docs.datastax.com/en/articles/cassandra/cassandrathenandnow.html),” *docs.datastax.com*, September 12, 2013. -1. “[Introduction to Cassandra Query Language](https://docs.datastax.com/en/cql-oss/3.1/cql/cql_intro_c.html),” DataStax, Inc., 2014. -1. Samuel Axon: “[3% of Twitter's Servers Dedicated to Justin Bieber](https://web.archive.org/web/20201109041636/https://mashable.com/2010/09/07/justin-bieber-twitter/?europe=true),” *mashable.com*, September 7, 2010. -1. “[Riak KV Docs](https://docs.riak.com/riak/kv/latest/index.html),” *docs.riak.com*. -1. Richard Low: “[The Sweet Spot for Cassandra Secondary Indexing](https://web.archive.org/web/20190831132955/http://www.wentnet.com/blog/?p=77),” *wentnet.com*, October 21, 2013. -1. Zachary Tong: “[Customizing Your Document Routing](https://www.elastic.co/blog/customizing-your-document-routing/),” *elastic.co*, June 3, 2013. -1. “[Apache Solr Reference Guide](https://cwiki.apache.org/confluence/display/solr/Apache+Solr+Reference+Guide),” Apache Software Foundation, 2014. -1. Andrew Pavlo: “[H-Store Frequently Asked Questions](http://hstore.cs.brown.edu/documentation/faq/),” *hstore.cs.brown.edu*, October 2013. -1. “[Amazon DynamoDB Developer Guide](http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/),” Amazon Web Services, Inc., 2014. -1. Rusty Klophaus: “[Difference Between 2I and Search](https://web.archive.org/web/20150926053350/http://lists.basho.com/pipermail/riak-users_lists.basho.com/2011-October/006220.html),” email to *riak-users* mailing list, *lists.basho.com*, October 25, 2011. -1. Donald K. Burleson: “[Object Partitioning in Oracle](http://www.dba-oracle.com/art_partit.htm),”*dba-oracle.com*, November 8, 2000. -1. Eric Evans: “[Rethinking Topology in Cassandra](http://www.slideshare.net/jericevans/virtual-nodes-rethinking-topology-in-cassandra),” at *ApacheCon Europe*, November 2012. -1. Rafał Kuć: “[Reroute API Explained](https://web.archive.org/web/20190706215750/http://elasticsearchserverbook.com/reroute-api-explained/),” *elasticsearchserverbook.com*, September 30, 2013. -1. “[Project Voldemort Documentation](https://web.archive.org/web/20250107145644/http://www.project-voldemort.com/voldemort/),” *project-voldemort.com*. -1. Enis Soztutar: “[Apache HBase Region Splitting and Merging](http://hortonworks.com/blog/apache-hbase-region-splitting-and-merging/),” *hortonworks.com*, February 1, 2013. -1. Brandon Williams: “[Virtual Nodes in Cassandra 1.2](http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2),” *datastax.com*, December 4, 2012. -1. Richard Jones: “[libketama: Consistent Hashing Library for Memcached Clients](https://www.metabrew.com/article/libketama-consistent-hashing-algo-memcached-clients),” *metabrew.com*, April 10, 2007. -1. Branimir Lambov: “[New Token Allocation Algorithm in Cassandra 3.0](http://www.datastax.com/dev/blog/token-allocation-algorithm),” *datastax.com*, January 28, 2016. -1. Jason Wilder: “[Open-Source Service Discovery](http://jasonwilder.com/blog/2014/02/04/service-discovery-in-the-cloud/),” *jasonwilder.com*, February 2014. -1. Kishore Gopalakrishna, Shi Lu, Zhen Zhang, et al.: “[Untangling Cluster Management with Helix](http://www.socc2012.org/helix_onecol.pdf?attredirects=0),” at *ACM Symposium on Cloud Computing* (SoCC), October 2012. [doi:10.1145/2391229.2391248](http://dx.doi.org/10.1145/2391229.2391248) -1. “[Moxi 1.8 Manual](http://docs.couchbase.com/moxi-manual-1.8/),” Couchbase, Inc., 2014. -1. Shivnath Babu and Herodotos Herodotou: “[Massively Parallel Databases and MapReduce Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/11/db-mr-survey-final.pdf),” *Foundations and Trends in Databases*, volume 5, number 1, pages 1–104, November 2013. [doi:10.1561/1900000036](http://dx.doi.org/10.1561/1900000036) +## Setting Up New Followers + +From time to time, you need to set up new followers—perhaps to increase the number of replicas, +or to replace failed nodes. How do you ensure that the new follower has an accurate copy of the +leader’s data? + +Simply copying data files from one node to another is typically not sufficient: clients are +constantly writing to the database, and the data is always in flux, so a standard file copy would +see different parts of the database at different points in time. The result might not make any +sense. + +You could make the files on disk consistent by locking the database (making it unavailable for +writes), but that would go against our goal of high availability. Fortunately, setting up a +follower can usually be done without downtime. Conceptually, the process looks like this: + +1. Take a consistent snapshot of the leader’s database at some point in time—if possible, without + taking a lock on the entire database. Most databases have this feature, as it is also required + for backups. In some cases, third-party tools are needed, such as Percona XtraBackup for MySQL. +2. Copy the snapshot to the new follower node. +3. The follower connects to the leader and requests all the data changes that have happened since + the snapshot was taken. This requires that the snapshot is associated with an exact position in + the leader’s replication log. That position has various names: for example, PostgreSQL calls it + the *log sequence number*; MySQL has two mechanisms, *binlog coordinates* and *global transaction + identifiers* (GTIDs). +4. When the follower has processed the backlog of data changes since the snapshot, we say it has + *caught up*. It can now continue to process data changes from the leader as they happen. + +The practical steps of setting up a follower vary significantly by database. In some systems the +process is fully automated, whereas in others it can be a somewhat arcane multi-step workflow that +needs to be manually performed by an administrator. + +You can also archive the replication log to an object store; along with periodic snapshots of the +whole database in the object store this is a good way of implementing database backups and disaster +recovery. You can also perform steps 1 and 2 of setting up a new follower by downloading those files +from the object store. For example, WAL-G does this for PostgreSQL, MySQL, and SQL Server, and +Litestream does the equivalent for SQLite. + +# Databases backed by object storage + +Object storage can be used for more than archiving data. Many databases are beginning to use object +stores such as Amazon Web Services S3, Google Cloud Storage, and Azure Blob Storage to serve data +for live queries. Storing database data in object storage has many benefits: + +* Object storage is inexpensive compared to other cloud storage options, which allow cloud databases + to store less-often queried data on cheaper, higher-latency storage while serving the working set + from memory, SSDs, and NVMe. +* Object stores also provide multi-zone, dual-region, or multi-region replication with very high + durability guarantees. This also allows databases to bypass inter-zone network fees. +* Databases can use an object store’s *conditional write* feature—essentially, a *compare-and-set* + (CAS) operation—to implement transactions and leadership election + [[10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Morling2024_ch6), + [11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Chandramohan2024)]). +* Storing data from multiple databases in the same object store can simplify data integration, + particularly when open formats such as Apache Parquet and Apache Iceberg are used. + +These benefits dramatically simplify the database architecture by shifting the responsibility of +transactions, leadership election, and replication to object storage. + +Systems that adopt object storage for replication must grapple with some tradeoffs. Notably, object +stores have much higher read and write latencies than local disks or virtual block devices such as +EBS. Many cloud providers also charge a per-API call fee, which forces systems to batch reads and +writes to reduce cost. Such batching further increases latency. Moreover, many object stores do not +offer standard filesystem interfaces. This prevents systems that lack object storage integration +from leveraging object storage. Interfaces such as *filesystem in userspace* (FUSE) allow operators +to mount object store buckets as filesystems that applications can use without knowing their data is +stored on object storage. Still, many FUSE interfaces to object stores lack POSIX features such as +non-sequential writes or symlinks, which systems might depend on. + +Different systems deal with these trade-offs in various ways. Some introduce a *tiered storage* +architecture that places less frequently accessed data on object storage while new or frequently +accessed data is kept on faster storage devices such as SSDs, NVMe, or even in memory. Other systems +use object storage as their primary storage tier, but use a separate low-latency storage system such +as Amazon’s EBS or Neon’s Safekeepers +[[12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kelvich2022)]) +to store their WAL. Recently, some systems have gone even farther by adopting a +*zero-disk architecture* (ZDA). ZDA-based systems persist all data to object storage and use disks +and memory strictly for caching. This allows nodes to have no persistent state, which dramatically +simplifies operations. WarpStream, Confluent Freight, Buf’s Bufstream, and Redpanda Serverless are +all Kafka-compatible systems built using a zero-disk architecture. Nearly every modern cloud data +warehouse also adopts such an architecture, as does Turbopuffer (a vector search engine), and +SlateDB (a cloud-native LSM storage engine). + +## Handling Node Outages + +Any node in the system can go down, perhaps unexpectedly due to a fault, but just as likely due to +planned maintenance (for example, rebooting a machine to install a kernel security patch). Being +able to reboot individual nodes without downtime is a big advantage for operations and maintenance. +Thus, our goal is to keep the system as a whole running despite individual node failures, and to keep +the impact of a node outage as small as possible. + +How do you achieve high availability with leader-based replication? + +### Follower failure: Catch-up recovery + +On its local disk, each follower keeps a log of the data changes it has received from the leader. If +a follower crashes and is restarted, or if the network between the leader and the follower is +temporarily interrupted, the follower can recover quite easily: from its log, it knows the last +transaction that was processed before the fault occurred. Thus, the follower can connect to the +leader and request all the data changes that occurred during the time when the follower was +disconnected. When it has applied these changes, it has caught up to the leader and can continue +receiving a stream of data changes as before. + +Although follower recovery is conceptually simple, it can be challenging in terms of performance: if +the database has a high write throughput or if the follower has been offline for a long time, there +might be a lot of writes to catch up on. There will be high load on both the recovering follower and +the leader (which needs to send the backlog of writes to the follower) while this catch-up is +ongoing. + +The leader can delete its log of writes once all followers have confirmed that they have processed +it, but if a follower is unavailable for a long time, the leader faces a choice: either it retains +the log until the follower recovers and catches up (at the risk of running out of disk space on the +leader), or it deletes the log that the unavailable follower has not yet acknowledged (in which case +the follower won’t be able to recover from the log, and will have to be restored from a backup when +it comes back). + +### Leader failure: Failover + +Handling a failure of the leader is trickier: one of the followers needs to be promoted to be the +new leader, clients need to be reconfigured to send their writes to the new leader, and the other +followers need to start consuming data changes from the new leader. This process is called +*failover*. + +Failover can happen manually (an administrator is notified that the leader has failed and takes the +necessary steps to make a new leader) or automatically. An automatic failover process usually +consists of the following steps: + +1. *Determining that the leader has failed.* There are many things that could potentially go wrong: + crashes, power outages, network issues, and more. There is no foolproof way of detecting what + has gone wrong, so most systems simply use a timeout: nodes frequently bounce messages back and + forth between each other, and if a node doesn’t respond for some period of time—say, 30 + seconds—it is assumed to be dead. (If the leader is deliberately taken down for planned + maintenance, this doesn’t apply.) +2. *Choosing a new leader.* This could be done through an election process (where the leader is chosen by + a majority of the remaining replicas), or a new leader could be appointed by a previously + established *controller node* + [[13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Fontaine2021)]. + The best candidate for leadership is usually the replica with the most up-to-date data changes + from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader + is a consensus problem, discussed in detail in [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency). +3. *Reconfiguring the system to use the new leader.* Clients now need to send + their write requests to the new leader (we discuss this + in [“Request Routing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_routing)). If the old leader comes back, it might still believe that it is + the leader, not realizing that the other replicas have + forced it to step down. The system needs to ensure that the old leader becomes a follower and + recognizes the new leader. + +Failover is fraught with things that can go wrong: + +* If asynchronous replication is used, the new leader may not have received all the writes from the old + leader before it failed. If the former leader rejoins the cluster after a new leader has been + chosen, what should happen to those writes? The new leader may have received conflicting writes + in the meantime. The most common solution is for the old leader’s unreplicated writes to simply be + discarded, which means that writes you believed to be committed actually weren’t durable after all. +* Discarding writes is especially dangerous if other storage systems outside of the database need to + be coordinated with the database contents. + For example, in one incident at GitHub + [[14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Newland2012)], + an out-of-date MySQL follower + was promoted to leader. The database used an autoincrementing counter to assign primary keys to + new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some + primary keys that were previously assigned by the old leader. These primary keys were also used in + a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis, + which caused some private data to be disclosed to the wrong users. +* In certain fault scenarios (see [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed)), it could happen that two nodes both believe + that they are the leader. This situation is called *split brain*, and it is dangerous: if both + leaders accept writes, and there is no process for resolving conflicts (see + [“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some + systems have a mechanism to shut down one node if two leaders are detected. However, if this + mechanism is not carefully designed, you can end up with both nodes being shut down + [[15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Imbriaco2012_ch6)]. + Moreover, there is a risk that by the time the split brain is detected and the old node is shut + down, it is already too late and data has already been corrupted. +* What is the right timeout before the leader is declared dead? A longer timeout means a longer + time to recovery in the case where the leader fails. However, if the timeout is too short, there + could be unnecessary failovers. For example, a temporary load spike could cause a node’s response + time to increase above the timeout, or a network glitch could cause delayed packets. If the system + is already struggling with high load or network problems, an unnecessary failover is likely to + make the situation worse, not better. + +###### Note + +Guarding against split brain by limiting or shutting down old leaders is known as *fencing* or, more +emphatically, *Shoot The Other Node In The Head* (STONITH). We will discuss fencing in more detail +in [“Distributed Locks and Leases”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lock_fencing). + +There are no easy solutions to these problems. For this reason, some operations teams prefer to +perform failovers manually, even if the software supports automatic failover. + +The most important thing with failover is to pick an up-to-date follower as the new leader—if +synchronous or semi-synchronous replication is used, this would be the follower that the old leader +waited for before acknowledging writes. With asynchronous replication, you can pick the follower +with the greatest log sequence number. This minimizes the amount of data that is lost during +failover: losing a fraction of a second of writes may be tolerable, but picking a follower that is +behind by several days could be catastrophic. + +These issues—node failures; unreliable networks; and trade-offs around replica consistency, +durability, availability, and latency—are in fact fundamental problems in distributed systems. +In [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed) and [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency) we will discuss them in greater depth. + +## Implementation of Replication Logs + +How does leader-based replication work under the hood? Several different replication methods are +used in practice, so let’s look at each one briefly. + +### Statement-based replication + +In the simplest case, the leader logs every write request (*statement*) that it executes and sends +that statement log to its followers. For a relational database, this means that every `INSERT`, +`UPDATE`, or `DELETE` statement is forwarded to followers, and each follower parses and executes +that SQL statement as if it had been received from a client. + +Although this may sound reasonable, there are various ways in which this approach to replication can +break down: + +* Any statement that calls a nondeterministic function, such as `NOW()` to get the current date + and time or `RAND()` to get a random number, is likely to generate a different value on each + replica. +* If statements use an autoincrementing column, or if they depend on the existing data in the + database (e.g., `UPDATE …​ WHERE `), they must be executed in exactly the same + order on each replica, or else they may have a different effect. This can be limiting when there + are multiple concurrently executing transactions. +* Statements that have side effects (e.g., triggers, stored procedures, user-defined functions) may + result in different side effects occurring on each replica, unless the side effects are absolutely + deterministic. + +It is possible to work around those issues—for example, the leader can replace any nondeterministic +function calls with a fixed return value when the statement is logged so that the followers all get +the same value. The idea of executing deterministic statements in a fixed order is similar to the +event sourcing model that we previously discussed in [“Event Sourcing and CQRS”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_events). This approach is +also known as *state machine replication*, and we will discuss the theory behind it in +[“Using shared logs”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_smr). + +Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today, +as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if +there is any nondeterminism in a statement. VoltDB uses statement-based replication, and makes it +safe by requiring transactions to be deterministic +[[16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hugg2015)]. +However, determinism can be hard to guarantee in practice, so many databases prefer other +replication methods. + +### Write-ahead log (WAL) shipping + +In [Chapter 4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust: +every modification is first written to the WAL so that the tree can be restored to a consistent +state after a crash. Since the WAL contains all the information necessary to restore the indexes and +heap into a consistent state, we can use the exact same log to build a replica on another node: +besides writing the log to disk, the leader also sends it across the network to its followers. When +the follower processes this log, it builds a copy of the exact same files as found on the leader. + +This method of replication is used in PostgreSQL and Oracle, among others +[[17](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Suzuki2017_ch6), +[18](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2012)]. +The main disadvantage is that the log describes the data on a very low level: a WAL contains details +of which bytes were changed in which disk blocks. This makes replication tightly coupled to the +storage engine. If the database changes its storage format from one version to another, it is +typically not possible to run different versions of the database software on the leader and the +followers. + +That may seem like a minor implementation detail, but it can have a big operational impact. If the +replication protocol allows the follower to use a newer software version than the leader, you can +perform a zero-downtime upgrade of the database software by first upgrading the followers and then +performing a failover to make one of the upgraded nodes the new leader. If the replication protocol +does not allow this version mismatch, as is often the case with WAL shipping, such upgrades require +downtime. + +### Logical (row-based) log replication + +An alternative is to use different log formats for replication and for the storage engine, which +allows the replication log to be decoupled from the storage engine internals. This kind of +replication log is called a *logical log*, to distinguish it from the storage engine’s (*physical*) +data representation. + +A logical log for a relational database is usually a sequence of records describing writes to +database tables at the granularity of a row: + +* For an inserted row, the log contains the new values of all columns. +* For a deleted row, the log contains enough information to uniquely identify the row that was + deleted. Typically this would be the primary key, but if there is no primary key on the table, the + old values of all columns need to be logged. +* For an updated row, the log contains enough information to uniquely identify the updated row, and + the new values of all columns (or at least the new values of all columns that changed). + +A transaction that modifies several rows generates several such log records, followed by a record +indicating that the transaction was committed. MySQL keeps a separate logical replication log, +called the *binlog*, in addition to the WAL (when configured to use row-based replication). +PostgreSQL implements logical replication by decoding the physical WAL into row +insertion/update/delete events +[[19](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2023)]. + +Since a logical log is decoupled from the storage engine internals, it can more easily be kept +backward compatible, allowing the leader and the follower to run different versions of the database +software. This in turn enables upgrading to a new version with minimal downtime +[[20](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Petchimuthu2021)]. + +A logical log format is also easier for external applications to parse. This aspect is useful if you want +to send the contents of a database to an external system, such as a data warehouse for offline +analysis, or for building custom indexes and caches +[[21](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sharma2015te_ch6)]. +This technique is called *change data capture*, and we will return to it in [Link to Come]. + +# Problems with Replication Lag + +Being able to tolerate node failures is just one reason for wanting replication. As mentioned +in [“Distributed versus Single-Node Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_distributed), other reasons are scalability (processing more +requests than a single machine can handle) and latency (placing replicas geographically closer to +users). + +Leader-based replication requires all writes to go through a single node, but read-only queries can +go to any replica. For workloads that consist of mostly reads and only a small percentage of writes +(which is often the case with online services), there is an attractive option: create many followers, and distribute +the read requests across those followers. This removes load from the leader and allows read requests to be +served by nearby replicas. + +In this *read-scaling* architecture, you can increase the capacity for serving read-only requests +simply by adding more followers. However, this approach only realistically works with asynchronous +replication—if you tried to synchronously replicate to all followers, a single node failure or +network outage would make the entire system unavailable for writing. And the more nodes you have, +the likelier it is that one will be down, so a fully synchronous configuration would be very unreliable. + +Unfortunately, if an application reads from an *asynchronous* follower, it may see outdated +information if the follower has fallen behind. This leads to apparent inconsistencies in the +database: if you run the same query on the leader and a follower at the same time, you may get +different results, because not all writes have been reflected in the follower. This inconsistency is +just a temporary state—if you stop writing to the database and wait a while, the followers will +eventually catch up and become consistent with the leader. For that reason, this effect is known +as *eventual consistency* [[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011)]. + +###### Note + +The term *eventual consistency* was coined by Douglas Terry et al. +[[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry1994)], +popularized by Werner Vogels +[[24](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Vogels2008)], +and became the battle cry of many NoSQL projects. However, not only NoSQL databases are eventually +consistent: followers in an asynchronously replicated relational database have the same +characteristics. + +The term “eventually” is deliberately vague: in general, there is no limit to how far a replica can +fall behind. In normal operation, the delay between a write happening on the leader and being +reflected on a follower—the *replication lag*—may be only a fraction of a second, and not +noticeable in practice. However, if the system is operating near capacity or if there is a problem +in the network, the lag can easily increase to several seconds or even minutes. + +When the lag is so large, the inconsistencies it introduces are not just a theoretical issue but a +real problem for applications. In this section we will highlight three examples of problems that are +likely to occur when there is replication lag. We’ll also outline some approaches to solving them. + +## Reading Your Own Writes + +Many applications let the user submit some data and then view what they have submitted. This might +be a record in a customer database, or a comment on a discussion thread, or something else of that sort. +When new data is submitted, it must be sent to the leader, but when the user views the data, it can +be read from a follower. This is especially appropriate if data is frequently viewed but only +occasionally written. + +With asynchronous replication, there is a problem, illustrated in +[Figure 6-3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_read_your_writes): if the user views the data shortly after making a write, the +new data may not yet have reached the replica. To the user, it looks as though the data they +submitted was lost, so they will be understandably unhappy. + +![ddia 0603](/fig/ddia_0603.png) + +###### Figure 6-3. A user makes a write, followed by a read from a stale replica. To prevent this anomaly, we need read-after-write consistency. + +In this situation, we need *read-after-write consistency*, also known as *read-your-writes consistency* +[[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry1994)]. +This is a guarantee that if the user reloads the page, they will always see any updates they +submitted themselves. It makes no promises about other users: other users’ updates may not be +visible until some later time. However, it reassures the user that their own input has been saved +correctly. + +How can we implement read-after-write consistency in a system with leader-based replication? There +are various possible techniques. To mention a few: + +* When reading something that the user may have modified, read it from the leader or a synchronously + updated follower; otherwise, read it from an asynchronously updated follower. + This requires that you have some way of knowing whether something might have been + modified, without actually querying it. For example, user profile information on a social network + is normally only editable by the owner of the profile, not by anybody else. Thus, a simple + rule is: always read the user’s own profile from the leader, and any other users’ profiles from a + follower. +* If most things in the application are potentially editable by the user, that approach won’t be + effective, as most things would have to be read from the leader (negating the benefit of read + scaling). In that case, other criteria may be used to decide whether to read from the leader. For + example, you could track the time of the last update and, for one minute after the last update, make all + reads from the leader + [[25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Willison2022)]. + You could also monitor the replication lag on followers and prevent queries on any follower that + is more than one minute behind the leader. +* The client can remember the timestamp of its most recent write—then the system can ensure that the + replica serving any reads for that user reflects updates at least until that timestamp. If a + replica is not sufficiently up to date, either the read can be handled by another replica or the + query can wait until the replica has caught up + [[26](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Tharakan2020)]. + The timestamp could be a *logical timestamp* (something that indicates ordering of writes, such as + the log sequence number) or the actual system clock (in which case clock synchronization becomes + critical; see [“Unreliable Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_clocks)). +* If your replicas are distributed across regions (for geographical proximity to users or for + availability), there is additional complexity. Any request that needs to be served by the leader + must be routed to the region that contains the leader. + +Another complication arises when the same user is accessing your service from multiple devices, for +example a desktop web browser and a mobile app. In this case you may want to provide *cross-device* +read-after-write consistency: if the user enters some information on one device and then views it +on another device, they should see the information they just entered. + +In this case, there are some additional issues to consider: + +* Approaches that require remembering the timestamp of the user’s last update become more difficult, + because the code running on one device doesn’t know what updates have happened on the other + device. This metadata will need to be centralized. +* If your replicas are distributed across different regions, there is no guarantee that connections + from different devices will be routed to the same region. (For example, if the user’s desktop + computer uses the home broadband connection and their mobile device uses the cellular data network, + the devices’ network routes may be completely different.) If your approach requires reading from the + leader, you may first need to route requests from all of a user’s devices to the same region. + +# Regions and Availability Zones + +We use the term *region* to refer to one or more datacenters in a single geographic location. Cloud +providers locate multiple datacenters in the same geographic region. Each datacenter is referred to +as an *availability zone* or simply *zone*. Thus, a single cloud region is made up of multiple +zones. Each zone is a separate datacenter located in separate physical facility with its own +power, cooling, and so on. + +Zones in the same region are connected by very high speed network connections. Latency is low enough +that most distributed systems can run with nodes spread across multiple zones in the same region as +though they were in a single zone. Multi-zone configurations allow distributed systems to survive +zonal outages where one zone goes offline, but they do not protect against regional outages where +all zones in a region are unavailable. To survive a regional outage, a distributed system must be +deployed across multiple regions, which can result in higher latencies, lower throughput, and +increased cloud networking bills. We will discuss these tradeoffs more in +[“Multi-leader replication topologies”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_topologies). For now, just know that when we say region, we mean a collection of +zones/datacenters in a single geographic location. + +## Monotonic Reads + +Our second example of an anomaly that can occur when reading from asynchronous followers is that it’s +possible for a user to see things *moving backward in time*. + +This can happen if a user makes several reads from different replicas. For example, +[Figure 6-4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower +with little lag, then to a follower with greater lag. (This scenario is quite likely if the user +refreshes a web page, and each request is routed to a random server.) The first query returns a +comment that was recently added by user 1234, but the second query doesn’t return anything because +the lagging follower has not yet picked up that write. In effect, the second query observes the +system state at an earlier point in time than the first query. This wouldn’t be so bad if the first query +hadn’t returned anything, because user 2345 probably wouldn’t know that user 1234 had recently added +a comment. However, it’s very confusing for user 2345 if they first see user 1234’s comment appear, +and then see it disappear again. + +![ddia 0604](/fig/ddia_0604.png) + +###### Figure 6-4. A user first reads from a fresh replica, then from a stale replica. Time appears to go backward. To prevent this anomaly, we need monotonic reads. + +*Monotonic reads* [[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011)] is a guarantee that this +kind of anomaly does not happen. It’s a lesser guarantee than strong consistency, but a stronger +guarantee than eventual consistency. When you read data, you may see an old value; monotonic reads +only means that if one user makes several reads in sequence, they will not see time go +backward—i.e., they will not read older data after having previously read newer data. + +One way of achieving monotonic reads is to make sure that each user always makes their reads from +the same replica (different users can read from different replicas). For example, the replica can be +chosen based on a hash of the user ID, rather than randomly. However, if that replica fails, the +user’s queries will need to be rerouted to another replica. + +## Consistent Prefix Reads + +Our third example of replication lag anomalies concerns violation of causality. Imagine the +following short dialog between Mr. Poons and Mrs. Cake: + +Mr. Poons +: How far into the future can you see, Mrs. Cake? + +Mrs. Cake +: About ten seconds usually, Mr. Poons. + +There is a causal dependency between those two sentences: Mrs. Cake heard Mr. Poons’s question and +answered it. + +Now, imagine a third person is listening to this conversation through followers. The things said by +Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer +replication lag (see [Figure 6-5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_consistent_prefix)). This observer would hear the following: + +Mrs. Cake +: About ten seconds usually, Mr. Poons. + +Mr. Poons +: How far into the future can you see, Mrs. Cake? + +To the observer it looks as though Mrs. Cake is answering the question before Mr. Poons has even asked +it. Such psychic powers are impressive, but very confusing +[[27](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pratchett1991)]. + +![ddia 0605](/fig/ddia_0605.png) + +###### Figure 6-5. If some shards are replicated slower than others, an observer may see the answer before they see the question. + +Preventing this kind of anomaly requires another type of guarantee: *consistent prefix reads* +[[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011)]. This guarantee says that if a sequence of +writes happens in a certain order, then anyone reading those writes will see them appear in the same +order. + +This is a particular problem in sharded (partitioned) databases, which we will discuss in +[Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding). If the database always applies writes in the same order, reads always see a +consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different +shards operate independently, so there is no global ordering of writes: when a user reads from the +database, they may see some parts of the database in an older state and some in a newer state. + +One solution is to make sure that any writes that are causally related to each other are written to +the same shard—but in some applications that cannot be done efficiently. There are also algorithms +that explicitly keep track of causal dependencies, a topic that we will return to in +[“The “happens-before” relation and concurrency”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_happens_before). + +## Solutions for Replication Lag + +When working with an eventually consistent system, it is worth thinking about how the application +behaves if the replication lag increases to several minutes or even hours. If the answer is “no +problem,” that’s great. However, if the result is a bad experience for users, it’s important to +design the system to provide a stronger guarantee, such as read-after-write. Pretending that +replication is synchronous when in fact it is asynchronous is a recipe for problems down the line. + +As discussed earlier, there are ways in which an application can provide a stronger guarantee than +the underlying database—for example, by performing certain kinds of reads on the leader or a +synchronously updated follower. However, dealing with these issues in application code is complex +and easy to get wrong. + +The simplest programming model for application developers is to choose a database that provides a +strong consistency guarantee for replicas such as linearizability (see [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency)), and ACID +transactions (see [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions)). This allows you to mostly ignore the challenges that arise +from replication, and treat the database as if it had just a single node. In the early 2010s the +*NoSQL* movement promoted the view that these features limited scalability, and that large-scale +systems would have to embrace eventual consistency. + +However, since then, a number of databases started providing strong consistency and transactions +while also offering the fault tolerance, high availability, and scalability advantages of a +distributed database. As mentioned in [“Relational Model versus Document Model”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_history), this trend is known as *NewSQL* to +contrast with NoSQL (although it’s less about SQL specifically, and more about new approaches to +scalable transaction management). + +Even though scalable, strongly consistent distributed databases are now available, there are still +good reasons why some applications choose to use different forms of replication that offer weaker +consistency guarantees: they can offer stronger resilience in the face of network interruptions, and +have lower overheads compared to transactional systems. We will explore such approaches in the rest +of this chapter. + +# Multi-Leader Replication + +So far in this chapter we have only considered replication architectures using a single leader. +Although that is a common approach, there are interesting alternatives. + +Single-leader replication has one major downside: all writes must go through the one leader. If you +can’t connect to the leader for any reason, for example due to a network interruption between you +and the leader, you can’t write to the database. + +A natural extension of the single-leader replication model is to allow more than one node to accept +writes. Replication still happens in the same way: each node that processes a write must forward +that data change to all the other nodes. We call this a *multi-leader* configuration (also known as +*active/active* or *bidirectional* replication). In this setup, each leader simultaneously acts as a +follower to the other leaders. + +As with single-leader replication, there is a choice between making it synchronous or asynchronous. +Let’s say you have two leaders, *A* and *B*, and you’re trying to write to *A*. If writes are +synchronously replicated from *A* to *B*, and the network between the two nodes is interrupted, you +can’t write to *A* until the network comes back. Synchronous multi-leader replication thus gives you +a model that is very similar to single-leader replication, i.e. if you had made *B* the leader and +*A* simply forwards any write requests to *B* to be executed. + +For that reason, we won’t go further into synchronous multi-leader replication, and simply treat it +as equivalent to single-leader replication. The rest of this section focusses on asynchronous +multi-leader replication, in which any leader can process writes even when its connection to the +other leaders is interrupted. + +## Geographically Distributed Operation + +It rarely makes sense to use a multi-leader setup within a single region, because the benefits +rarely outweigh the added complexity. However, there are some situations in which this configuration +is reasonable. + +Imagine you have a database with replicas in several different regions (perhaps so that you can +tolerate the failure of an entire region, or perhaps in order to be closer to your users). This is +known as a *geographically distributed*, *geo-distributed* or *geo-replicated* setup. With +single-leader replication, the leader has to be in *one* of the regions, and all writes must go +through that region. + +In a multi-leader configuration, you can have a leader in *each* region. +[Figure 6-6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_multi_dc) shows what this architecture might look like. Within each region, +regular leader–follower replication is used (with followers maybe in a different availability zone +from the leader); between regions, each region’s leader replicates its changes to the leaders in +other regions. + +![ddia 0606](/fig/ddia_0606.png) + +###### Figure 6-6. Multi-leader replication across multiple regions. + +Let’s compare how the single-leader and multi-leader configurations fare in a multi-region +deployment: + +Performance +: In a single-leader configuration, every write must go over the internet to the region with the + leader. This can add significant latency to + writes and might contravene the purpose of having multiple regions in the first place. In a + multi-leader configuration, every write can be processed in the local region and is replicated + asynchronously to the other regions. Thus, the inter-region network delay is hidden from + users, which means the perceived performance may be better. + +Tolerance of regional outages +: In a single-leader configuration, if the region with the leader becomes unavailable, failover can + promote a follower in another region to be leader. In a multi-leader configuration, each region + can continue operating independently of the others, and replication catches up when the offline + region comes back online. + +Tolerance of network problems +: Even with dedicated connections, traffic between regions + + can be less reliable than traffic between zones in the same region or within a single zone. A + single-leader configuration is very sensitive to problems in this inter-region link, because when + a client in one region wants to write to a leader in another region, it has to send its request + over that link and wait for the response before it can complete. + + A multi-leader configuration with asynchronous replication can tolerate network problems better: + during a temporary network interruption, each region’s leader can continue independently processing + writes. + +Consistency +: A single-leader system can provide strong consistency guarantees, such as serializable + transactions, which we will discuss in [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions). The biggest downside of multi-leader + systems is that the consistency they can achieve is much weaker. For example, you can’t guarantee + that a bank account won’t go negative or that a username is unique: it’s always possible for + different leaders to process writes that are individually fine (paying out some of the money in an + account, registering a particular username), but which violate the constraint when taken together + with another write on another leader. + + This is simply a fundamental limitation of distributed systems + [[28](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014coord_ch6)]. + If you need to enforce such constraints, you’re therefore better off with a single-leader system. + However, as we will see in [“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts), multi-leader systems can still + achieve consistency properties that are useful in a wide range of apps that don’t need such + constraints. + +Multi-leader replication is less common than single-leader replication, but it is still supported by +many databases, including MySQL, Oracle, SQL Server, and YugabyteDB. In some cases it is an external +add-on feature, for example in Redis Enterprise, EDB Postgres Distributed, and pglogical +[[29](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Raja2022)]. + +As multi-leader replication is a somewhat retrofitted feature in many databases, there are often +subtle configuration pitfalls and surprising interactions with other database features. For example, +autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason, +multi-leader replication is often considered dangerous territory that should be avoided if possible +[[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2012)]. + +### Multi-leader replication topologies + +A *replication topology* describes the communication paths along which writes are propagated from +one node to another. If you have two leaders, like in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), there is +only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With +more than two leaders, various different topologies are possible. Some examples are illustrated in +[Figure 6-7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_topologies). + +![ddia 0607](/fig/ddia_0607.png) + +###### Figure 6-7. Three example topologies in which multi-leader replication can be set up. + +The most general topology is *all-to-all*, shown in +[Figure 6-7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_topologies)(c), +in which every leader sends its writes to every other leader. However, more restricted topologies +are also used: for example a *circular topology* in which each node receives writes from one node +and forwards those writes (plus any writes of its own) to one other node. Another popular topology +has the shape of a *star*: one designated root node forwards writes to all of the other nodes. The +star topology can be generalized to a tree. + +###### Note + +Don’t confuse a star-shaped network topology with a *star schema* (see +[“Stars and Snowflakes: Schemas for Analytics”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_analytics)), which describes the structure of a data model. + +In circular and star topologies, a write may need to pass through several nodes before it reaches +all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To +prevent infinite replication loops, each node is given a unique identifier, and in the replication +log, each write is tagged with the identifiers of all the nodes it has passed through +[[31](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#HBase7709)]. +When a node receives a data change that is tagged with its own identifier, that data change is +ignored, because the node knows that it has already been processed. + +### Problems with different topologies + +A problem with circular and star topologies is that if just one node fails, it can interrupt the +flow of replication messages between other nodes, leaving them unable to communicate until the +node is fixed. The topology could be reconfigured to work around the failed node, but in most +deployments such reconfiguration would have to be done manually. The fault tolerance of a more +densely connected topology (such as all-to-all) is better because it allows messages to travel +along different paths, avoiding a single point of failure. + +On the other hand, all-to-all topologies can have issues too. In particular, some network links may +be faster than others (e.g., due to network congestion), with the result that some replication +messages may “overtake” others, as illustrated in [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality). + +![ddia 0608](/fig/ddia_0608.png) + +###### Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas. + +In [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality), client A inserts a row into a table on leader 1, and client B +updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may +first receive the update (which, from its point of view, is an update to a row that does not exist +in the database) and only later receive the corresponding insert (which should have preceded the +update). + +This is a problem of causality, similar to the one we saw in [“Consistent Prefix Reads”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_consistent_prefix): +the update depends on the prior insert, so we need to make sure that all nodes process the insert +first, and then the update. Simply attaching a timestamp to every write is not sufficient, because +clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see +[Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed)). + +To order these events correctly, a technique called *version vectors* can be used, which we will +discuss later in this chapter (see [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent)). However, many multi-leader +replication systems don’t use good techniques for ordering updates, leaving them vulnerable to +issues like the one in [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality). If you are using multi-leader replication, it +is worth being aware of these issues, carefully reading the documentation, and thoroughly testing +your database to ensure that it really does provide the guarantees you believe it to have. + +## Sync Engines and Local-First Software + +Another situation in which multi-leader replication is appropriate is if you have an application +that needs to continue to work while it is disconnected from the internet. + +For example, consider the calendar apps on your mobile phone, your laptop, and other devices. You +need to be able to see your meetings (make read requests) and enter new meetings (make write +requests) at any time, regardless of whether your device currently has an internet connection. If +you make any changes while you are offline, they need to be synced with a server and your other +devices when the device is next online. + +In this case, every device has a local database replica that acts as a leader (it accepts write +requests), and there is an asynchronous multi-leader replication process (sync) between the replicas +of your calendar on all of your devices. The replication lag may be hours or even days, depending on +when you have internet access available. + +From an architectural point of view, this setup is very similar to multi-leader replication between +regions, taken to the extreme: each device is a “region,” and the network connection between them is +extremely unreliable. + +### Real-time collaboration, offline-first, and local-first apps + +Moreover, many modern web apps offer *real-time collaboration* features, such as Google Docs and +Sheets for text documents and spreadsheets, Figma for graphics, and Linear for project management. +What makes these apps so responsive is that user input is immediately reflected in the user +interface, without waiting for a network round-trip to the server, and edits by one user are shown +to their collaborators with low latency +[[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DayRichter2010), +[33](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Wallace2019), +[34](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Artman2023)]. + +This again results in a multi-leader architecture: each web browser tab that has opened the shared +file is a replica, and any updates that you make to the file are asynchronously replicated to the +devices of the other users who have opened the same file. Even if the app does not allow you to +continue editing a file while offline, the fact that multiple users can make edits without waiting +for a response from the server already makes it multi-leader. + +Both offline editing and real-time collaboration require a similar replication infrastructure: the +application needs to capture any changes that the user makes to a file, and either send them to +collaborators immediately (if online), or store them locally for sending later (if offline). +Additionally, the application needs to receive changes from collaborators, merge them into the +user’s local copy of the file, and update the user interface to reflect the latest version. If +multiple users have changed the file concurrently, conflict resolution logic may be needed to merge +those changes. + +A software library that supports this process is called a *sync engine*. Although the idea has +existed for a long time, the term has recently gained attention +[[35](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Saafan2024), +[36](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hagoel2024), +[37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Jayakar2024)]. +An application that allows a user to continue editing a file while offline (which may be implemented +using a sync engine) is called *offline-first* +[[38](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Feyerke2013)]. +The term *local-first software* refers to collaborative apps that are not only offline-first, but +are also designed to continue working even if the developer who made the software shuts down all of +their online services [[39](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2019_ch6)]. +This can be achieved by using a sync engine with an open standard sync protocol for which multiple +service providers are available +[[40](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2024lofi)]. +For example, Git is a local-first collaboration system (albeit one that doesn’t support real-time +collaboration) since you can sync via GitHub, GitLab, or any other repository hosting service. + +### Pros and cons of sync engines + +The dominant way of building web apps today is to keep very little persistent state on the client, +and to rely on making requests to a server whenever a new piece of data needs to be displayed or +some data needs to be updated. In contrast, when using a sync engine, you have persistent state on +the client, and communication with the server is moved into a background process. The sync engine +approach has a number of advantages: + +* Having the data locally means the user interface can be much faster to respond than if it had to + wait for a service call to fetch some data. Some apps aim to respond to user input in the *next + frame* of the graphics system, which means rendering within 16 ms on a display with a + 60 Hz refresh rate. +* Allowing users to continue working while offline is valuable, especially on mobile devices with + intermittent connectivity. With a sync engine, an app doesn’t need a separate offline mode: being + offline is the same as having very large network delay. +* A sync engine simplifies the programming model for frontend apps, compared to performing explicit + service calls in application code. Every service call requires error handling, as discussed in + [“The problems with remote procedure calls (RPCs)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch05.html#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user + interface needs to somehow reflect that error. A sync engine allows the app to perform reads and + writes on local data, which almost never fails, leading to a more declarative programming style + [[41](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hofmeyr2024)]. +* In order to display edits from other users in real-time, you need to receive notifications of + those edits and efficiently update the user interface accordingly. A sync engine combined with a + *reactive programming* model is a good way of implementing this + [[42](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#vanHardenberg2020)]. + +Sync engines work best when all the data that the user may need is downloaded in advance and stored +persistently on the client. This means that the data is available for offline access when needed, +but it also means that sync engines are not suitable if the user has access to a very large amount +of data. For example, downloading all the files that the user themselves created is probably fine +(one user generally doesn’t generate that much data), but downloading the entire catalog of an +e-commerce website probably doesn’t make sense. + +The sync engine was pioneered by Lotus Notes in the 1980s +[[43](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kawell1988)] +(without using that term), and sync for specific apps such as calendars has also existed for a long +time. Today there are a number of general-purpose sync engines, some of which use a proprietary +backend service (e.g., Google Firestore, Realm, or Ditto), and some have an open source backend, +making them suitable for creating local-first software (e.g., PouchDB/CouchDB, Automerge, or Yjs). + +Multiplayer video games have a similar need to respond immediately to the user’s local actions, and +reconcile them with other players’ actions received asynchronously over the network. In game +development jargon the equivalent of a sync engine is called *netcode*. The techniques used in +netcode are quite specific to the requirements of games +[[44](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pusch2019)], and don’t directly +carry over to other types of software, so we won’t consider them further in this book. + +## Dealing with Conflicting Writes + +The biggest problem with multi-leader replication—both in a geo-distributed server-side database and +a local-first sync engine on end user devices—is that concurrent writes on different leaders can +lead to conflicts that need to be resolved. + +For example, consider a wiki page that is simultaneously being edited by two users, as shown in +[Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2 +independently changes the title from A to C. Each user’s change is successfully applied to their +local leader. However, when the changes are asynchronously replicated, a conflict is detected. +This problem does not occur in a single-leader database. + +![ddia 0609](/fig/ddia_0609.png) + +###### Figure 6-9. A write conflict caused by two leaders concurrently updating the same record. + +###### Note + +We say that the two writes in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict) are *concurrent* because neither +was “aware” of the other at the time the write was originally made. It doesn’t matter whether the +writes literally happened at the same time; indeed, if the writes were made while offline, they +might have actually happened some time apart. What matters is whether one write occurred in a state +where the other write has already taken effect. + +In [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent) we will tackle the question of how a database can determine +whether two writes are concurrent. For now we will assume that we can detect conflicts, and we want +to figure out the best way of resolving them. + +### Conflict avoidance + +One strategy for conflicts is to avoid them occurring in the first place. For example, if the +application can ensure that all writes for a particular record go through the same leader, then +conflicts cannot occur, even if the database as a whole is multi-leader. This approach is not +possible in the case of a sync engine client being updated offline, but it is sometimes possible in +geo-replicated server systems [[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2012)]. + +For example, in an application where a user can only edit their own data, you can ensure that +requests from a particular user are always routed to the same region and use the leader in that +region for reading and writing. Different users may have different “home” regions (perhaps picked +based on geographic proximity to the user), but from any one user’s point of view the configuration +is essentially single-leader. + +However, sometimes you might want to change the designated leader for a record—perhaps because +one region is unavailable and you need to reroute traffic to another region, or perhaps because +a user has moved to a different location and is now closer to a different region. There is now a +risk that the user performs a write while the change of designated leader is in progress, leading to +a conflict that would have to be resolved using one of the methods below. Thus, conflict avoidance +breaks down if you allow the leader to be changed. + +Another example of conflict avoidance: imagine you want to insert new records and generate unique +IDs for them based on an auto-incrementing counter. If you have two leaders, you could set them up +so that one leader only generates odd numbers and the other only generates even numbers. That way +you can be sure that the two leaders won’t concurrently assign the same ID to different records. +We will discuss other ID assignment schemes in [“ID Generators and Logical Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_logical). + +### Last write wins (discarding concurrent writes) + +If conflicts can’t be avoided, the simplest way of resolving them is to attach a timestamp to each +write, and to always use the value with the greatest timestamp. For example, in +[Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than +the timestamp of user 2’s write. In that case, both leaders will determine that the new title of the +page should be B, and they discard the write that sets it to C. If the writes coincidentally have +the same timestamp, the winner can be chosen by comparing the values (e.g., in the case of strings, +taking the one that’s earlier in the alphabet). + +This approach is called *last write wins* (LWW) because the write with the greatest timestamp can be +considered the “last” one. The term is misleading though, because when two writes are concurrent +like in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), which one is older and which is later is undefined, and +so the timestamp order of concurrent writes is essentially random. + +Therefore the real meaning of LWW is: when the same record is concurrently written on different +leaders, one of those writes is randomly chosen to be the winner, and the other writes are silently +discarded, even though they were successfully processed at their respective leaders. This achieves +the goal that eventually all replicas end up in a consistent state, but at the cost of data loss. + +If you can avoid conflicts—for example, by only inserting records with a unique key such as a UUID, +and never updating them—then LWW is no problem. But if you update existing +records, or if different leaders may insert records with the same key, then you have to decide +whether lost updates are a problem for your application. If lost updates are not acceptable, you +need to use one of the conflict resolution approaches described below. + +Another problem with LWW is that if a real-time clock (e.g. a Unix timestamp) is used as timestamp +for the writes, the system becomes very sensitive to clock synchronization. If one node has a clock +that is ahead of the others, and you try to overwrite a value written by that node, your write may +be ignored as it may have a lower timestamp, even though it clearly occurred later. This problem can +be solved by using a *logical clock*, which we will discuss in [“ID Generators and Logical Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_logical). + +### Manual conflict resolution + +If randomly discarding some of your writes is not desirable, the next option is to resolve the +conflict manually. You may be familiar with manual conflict resolution from Git and other version +control systems: if commits on two different branches edit the same lines of the same file, and you +try to merge those branches, you will get a merge conflict that needs to be resolved before the +merge is complete. + +In a database, it would be impractical for a conflict to stop the entire replication process until a +human has resolved it. Instead, databases typically store all the concurrently written values for a +given record—for example, both B and C in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict). These values are +sometimes called *siblings*. The next time you query that record, the database returns *all* those +values, rather than just the latest one. You can then resolve those values in whatever way you want, +either automatically in application code (for example, you could concatenate B and C into “B/C”), or +by asking the user. You then write back a new value to the database to resolve the conflict. + +This approach to conflict resolution is used in some systems, such as CouchDB. However, it also +suffers from a number of problems: + +* The API of the database changes: for example, where previously the title of the wiki page was just + a string, it now becomes a set of strings that usually contains one element, but may sometimes + contain multiple elements if there is a conflict. This can make the data awkward to work with in + application code. +* Asking the user to manually merge the siblings is a lot of work, both for the app developer (who + needs to build the user interface for conflict resolution) and for the user (who may be confused + about what they are being asked to do, and why). In many cases, it’s better to merge automatically + than to bother the user. +* Merging siblings automatically can lead to surprising behavior if it is not done carefully. For + example, the shopping cart on Amazon used to allow concurrent updates, which were then merged by + keeping all the shopping cart items that appeared in any of the siblings (i.e., taking the set + union of the carts). This meant that if the customer had removed an item from their cart in one + sibling, but another sibling still contained that old item, the removed item would unexpectedly + reappear in the customer’s cart + [[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)]. + [Figure 6-10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping + cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear. +* If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution + process can itself introduce a new conflict. Those resolutions could even be inconsistent: for + example, one node may merge B and C into “B/C” and another may merge them into “C/B” if you are + not careful to order them consistently. When the conflict between “B/C” and “C/B” is merged, it + may result in “B/C/C/B” or something similarly surprising. + +![ddia 0610](/fig/ddia_0610.png) + +###### Figure 6-10. Example of Amazon’s shopping cart anomaly: if conflicts on a shopping cart are merged by taking the union, deleted items may reappear. + +### Automatic conflict resolution + +For many applications, the best way of handling conflicts is to use an algorithm that automatically +merges concurrent writes into a consistent state. Automatic conflict resolution ensures that all +replicas *converge* to the same state—i.e., all replicas that have processed the same set of writes +have the same state, regardless of the order in which the writes arrived. + +LWW is a simple example of a conflict resolution algorithm. More sophisticated merge algorithms have +been developed for different types of data, with the goal of preserving the intended effect of all +updates as much as possible, and hence avoiding data loss: + +* If the data is text (e.g., the title or body of a wiki page), we can detect which characters have + been inserted or deleted from one version to the next. The merged result then preserves all the + insertions and deletions made in any of the siblings. If users concurrently insert text at the + same position, it can be ordered deterministically so that all nodes get the same merged outcome. +* If the data is a collection of items (ordered like a to-do list, or unordered like a shopping + cart), we can merge it similarly to text by tracking insertions and deletions. To avoid the + shopping cart issue in [Figure 6-10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_amazon_anomaly), the algorithms track the fact that Book + and DVD were deleted, so the merged result is Cart = {Soap}. +* If the data is an integer representing a counter that can be incremented or decremented (e.g., the + number of likes on a social media post), the merge algorithm can tell how many increments and + decrements happened on each sibling, and add them together correctly so that the result does not + double-count and does not drop updates. +* If the data is a key-value mapping, we can merge updates to the same key by applying one of the + other conflict resolution algorithms to the values under that key. Updates to different keys can + be handled independently from each other. + +There are limits to what is possible with conflict resolution. For example, if you want to enforce +that a list contains no more than five items, and multiple users concurrently add items to the list +so that there are more than five in total, your only option is to drop some of the items. +Nevertheless, automatic conflict resolution is sufficient to build many useful apps. And if you +start from the requirement of wanting to build a collaborative offline-first or local-first app, +then conflict resolution is inevitable, and automating it is often the best approach. + +## CRDTs and Operational Transformation + +Two families of algorithms are commonly used to implement automatic conflict resolution: +*Conflict-free replicated datatypes* (CRDTs) +[[46](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shapiro2011)] and *Operational Transformation* (OT) +[[47](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sun1998)]. +They have different design philosophies and performance characteristics, but both are able to +perform automatic merges for all the aforementioned types of data. + +[Figure 6-11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a +text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the +letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make +“ice!”. + +![ddia 0611](/fig/ddia_0611.png) + +###### Figure 6-11. How two concurrent insertions into a string are merged by OT and a CRDT respectively. + +The merged result “nice!” is achieved differently by both types of algorithms: + +OT +: We record the index at which characters are inserted or deleted: “n” is inserted at index 0, and + “!” at index 3. Next, the replicas exchange their operations. The insertion of “n” at 0 can be + applied as-is, but if the insertion of “!” at 3 were applied to the state “nice” we would get + “nic!e”, which is incorrect. We therefore need to transform the index of each operation to account + for concurrent operations that have already been applied; in this case, the insertion of “!” is + transformed to index 4 to account for the insertion of “n” at an earlier index. + +CRDT +: Most CRDTs give each character a unique, immutable ID and use those to determine the positions of + insertions/deletions, instead of indexes. For example, in [Figure 6-11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_ot_crdt) we assign + the ID 1A to “i”, the ID 2A to “c”, etc. When inserting the exclamation mark, we generate an + operation containing the ID of the new character (4B) and the ID of the existing character after + which we want to insert (3A). To insert at the beginning of the string we give “nil” as the + preceding character ID. Concurrent insertions at the same position are ordered by the IDs of the + characters. This ensures that replicas converge without performing any transformation. + +There are many algorithms based on variations of these ideas. Lists/arrays can be supported +similarly, using list elements instead of characters, and other datatypes such as key-value maps can +be added quite easily. There are some performance and functionality trade-offs between OT and CRDTs, +but it’s possible to combine the advantages of CRDTs and OT in one algorithm +[[48](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gentle2025)]. + +OT is most often used for real-time collaborative editing of text, e.g. in Google Docs +[[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DayRichter2010)], whereas CRDTs can be found in +distributed databases such as Redis Enterprise, Riak, and Azure Cosmos DB +[[49](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shukla2018)]. +Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge or Yjs) and with OT +(e.g., ShareDB). + +### What is a conflict? + +Some kinds of conflict are obvious. In the example in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), two writes +concurrently modified the same field in the same record, setting it to two different values. There +is little doubt that this is a conflict. + +Other kinds of conflict can be more subtle to detect. For example, consider a meeting room booking +system: it tracks which room is booked by which group of people at which time. This application +needs to ensure that each room is only booked by one group of people at any one time (i.e., there +must not be any overlapping bookings for the same room). In this case, a conflict may arise if two +different bookings are created for the same room at the same time. Even if the application checks +availability before allowing a user to make a booking, there can be a conflict if the two bookings +are made on two different leaders. + +There isn’t a quick ready-made answer, but in the following chapters we will trace a path toward a +good understanding of this problem. We will see some more examples of conflicts in +[Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and +resolving conflicts in a replicated system. + +# Leaderless Replication + +The replication approaches we have discussed so far in this chapter—single-leader and +multi-leader replication—are based on the idea that a client sends a write request to one node +(the leader), and the database system takes care of copying that write to the other replicas. A +leader determines the order in which writes should be processed, and followers apply the leader’s +writes in the same order. + +Some data storage systems take a different approach, abandoning the concept of a leader and +allowing any replica to directly accept writes from clients. Some of the earliest replicated data +systems were leaderless [[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lindsay1979_ch6), +[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979)], but the +idea was mostly forgotten during the era of dominance of relational databases. It once again became +a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in +2007 [[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)]. +Riak, Cassandra, and ScyllaDB are open source datastores with leaderless replication models inspired +by Dynamo, so this kind of database is also known as *Dynamo-style*. + +###### Note + +The original *Dynamo* system was only described in a paper +[[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)], but never released outside of +Amazon. The similarly-named *DynamoDB* is a more recent cloud database from AWS, but it has a +completely different architecture: it uses single-leader replication based on the Multi-Paxos +consensus algorithm [[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Elhemali2022_ch6)]. + +In some leaderless implementations, the client directly sends its writes to several replicas, while +in others, a coordinator node does this on behalf of the client. However, unlike a leader database, +that coordinator does not enforce a particular ordering of writes. As we shall see, this difference in design has +profound consequences for the way the database is used. + +## Writing to the Database When a Node Is Down + +Imagine you have a database with three replicas, and one of the replicas is currently +unavailable—​perhaps it is being rebooted to install a system update. In a single-leader +configuration, if you want to continue processing writes, you may need to perform a failover (see +[“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover)). + +On the other hand, in a leaderless configuration, failover does not exist. +[Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to +all three replicas in parallel, and the two available replicas accept the write but the unavailable +replica misses it. Let’s say that it’s sufficient for two out of three replicas to +acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be +successful. The client simply ignores the fact that one of the replicas missed the write. + +![ddia 0612](/fig/ddia_0612.png) + +###### Figure 6-12. A quorum write, quorum read, and read repair after a node outage. + +Now imagine that the unavailable node comes back online, and clients start reading from it. Any +writes that happened while the node was down are missing from that node. Thus, if you read from that +node, you may get *stale* (outdated) values as responses. + +To solve that problem, when a client reads from the database, it doesn’t just send its request to +one replica: *read requests are also sent to several nodes in parallel*. The client may get +different responses from different nodes; for example, the up-to-date value from one node and a +stale value from another. + +In order to tell which responses are up-to-date and which are outdated, every value that is written +needs to be tagged with a version number or timestamp, similarly to what we saw in +[“Last write wins (discarding concurrent writes)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lww). When a client receives multiple values in response to a read, it uses the +one with the greatest timestamp (even if that value was only returned by one replica, and several +other replicas returned older values). See [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent) for more details. + +### Catching up on missed writes + +The replication system should ensure that eventually all the data is copied to every replica. After +an unavailable node comes back online, how does it catch up on the writes that it missed? Several +mechanisms are used in Dynamo-style datastores: + +Read repair +: When a client makes a read from several nodes in parallel, it can detect any stale responses. + For example, in [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from + replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale + value and writes the newer value back to that replica. This approach works well for values that are + frequently read. + +Hinted handoff +: If one replica is unavailable, another replica may store writes on its behalf in the form of + *hints*. When the replica that was supposed to receive those writes comes back, the replica + storing the hints sends them to the recovered replica, and then deletes the hints. This *handoff* + process helps bring replicas up-to-date even for values that are never read, and therefore not + handled by read repair. + +Anti-entropy +: In addition, there is a background process that periodically looks for differences in + the data between replicas and copies any missing data from one replica to another. Unlike the + replication log in leader-based replication, this *anti-entropy process* does not copy writes in + any particular order, and there may be a significant delay before data is copied. + +### Quorums for reading and writing + +In the example of [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage), we considered the write to be successful +even though it was only processed on two out of three replicas. What if only one out of three +replicas accepted the write? How far can we push this? + +If we know that every successful write is guaranteed to be present on at least two out of three +replicas, that means at most one replica can be stale. Thus, if we read from at least two replicas, +we can be sure that at least one of the two is up to date. If the third replica is down or slow to +respond, reads can nevertheless continue returning an up-to-date value. + +More generally, if there are *n* replicas, every write must be confirmed by *w* nodes to be +considered successful, and we must query at least *r* nodes for each read. (In our example, +*n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* > +*n*, we expect to get an up-to-date value when reading, because at least one of the *r* nodes we’re +reading from must be up to date. Reads and writes that obey these *r* and *w* values are called +*quorum* reads and writes [[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979)]. +You can think of *r* and *w* as the minimum number of votes required for the read or write to be +valid. + +In Dynamo-style databases, the parameters *n*, *w*, and *r* are typically configurable. A common +choice is to make *n* an odd number (typically 3 or 5) and to set *w* = *r* = +(*n* + 1) / 2 (rounded up). However, you can vary the numbers as you see fit. +For example, a workload with few writes and many reads may benefit from setting *w* = *n* and +*r* = 1. This makes reads faster, but has the disadvantage that just one failed node causes all +database writes to fail. + +###### Note + +There may be more than *n* nodes in the cluster, but any given value is stored only on *n* +nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit +on one node. We will return to sharding in [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding). + +The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes +as follows: + +* If *w* < *n*, we can still process writes if a node is unavailable. +* If *r* < *n*, we can still process reads if a node is unavailable. +* With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable + node, like in [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage). +* With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes. + This case is illustrated in [Figure 6-13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_overlap). + +Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and +*r* determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success +before we consider the read or write to be successful. + +![ddia 0613](/fig/ddia_0613.png) + +###### Figure 6-13. If *w* + *r* > *n*, at least one of the *r* replicas you read from must have seen the most recent successful write. + +If fewer than the required *w* or *r* nodes are available, writes or reads return an error. A node +could be unavailable for many reasons: because the node is down (crashed, powered down), due to an +error executing the operation (can’t write because the disk is full), due to a network interruption +between the client and the node, or for any number of other reasons. We only care whether the node +returned a successful response and don’t need to distinguish between different kinds of fault. + +## Limitations of Quorum Consistency + +If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*, you can +generally expect every read to return the most recent value written for a key. This is the case because the +set of nodes to which you’ve written and the set of nodes from which you’ve read must overlap. That +is, among the nodes you read there must be at least one node with the latest value (illustrated in +[Figure 6-13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_overlap)). + +Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, because that ensures +*w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are +not necessarily majorities—it only matters that the sets of nodes used by the read and write +operations overlap in at least one node. Other quorum assignments are possible, which allows some +flexibility in the design of distributed algorithms +[[51](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Howard2016_ch6)]. + +You may also set *w* and *r* to smaller numbers, so that *w* + *r* ≤ *n* (i.e., +the quorum condition is not satisfied). In this case, reads and writes will still be sent to *n* +nodes, but a smaller number of successful responses is required for the operation to succeed. + +With a smaller *w* and *r* you are more likely to read stale values, because it’s more likely that +your read didn’t include the node with the latest value. On the upside, this configuration allows +lower latency and higher availability: if there is a network interruption and many replicas become +unreachable, there’s a higher chance that you can continue processing reads and writes. Only after +the number of reachable replicas falls below *w* or *r* does the database become unavailable for +writing or reading, respectively. + +However, even with *w* + *r* > *n*, there are edge cases in which the consistency +properties can be confusing. Some scenarios include: + +* If a node carrying a new value fails, and its data is restored from a replica carrying an old + value, the number of replicas storing the new value may fall below *w*, breaking the quorum + condition. +* While a rebalancing is in progress, where some data is moved from one node to another (see + [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n* + replicas for a particular value. This can result in the read and write quorums no longer + overlapping. +* If a read is concurrent with a write operation, the read may or may not see the concurrently + written value. In particular, it’s possible for one read to see the new value, and a subsequent + read to see the old value, as we shall see in [“Linearizability and quorums”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_quorum_linearizable). +* If a write succeeded on some replicas but failed on others (for example because the disks on some + nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the + replicas where it succeeded. This means that if a write was reported as failed, subsequent reads + may or may not return the value from that write + [[52](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Blomstedt2012ricon)]. +* If the database uses timestamps from a real-time clock to determine which write is newer (as + Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a + faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lww). + We will discuss this in more detail in [“Relying on Synchronized Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_clocks_relying). +* If two writes occur concurrently, one of them might be processed first on one replica, and the + other might be processed first on another replica. This leads to a conflict, similarly to what we + saw for multi-leader replication (see [“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts)). We will return to this + topic in [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent). + +Thus, although quorums appear to guarantee that a read returns the latest written value, in practice +it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate +eventual consistency. The parameters *w* and *r* allow you to adjust the probability of stale values +being read [[53](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014pbs)], +but it’s wise to not take them as absolute guarantees. + +### Monitoring staleness + +From an operational perspective, it’s important to monitor whether your databases are +returning up-to-date results. Even if your application can tolerate stale reads, you need to be +aware of the health of your replication. If it falls behind significantly, it should alert you so +that you can investigate the cause (for example, a problem in the network or an overloaded node). + +For leader-based replication, the database typically exposes metrics for the replication lag, which +you can feed into a monitoring system. This is possible because writes are applied to the leader and +to followers in the same order, and each node has a position in the replication log (the number of +writes it has applied locally). By subtracting a follower’s current position from the leader’s +current position, you can measure the amount of replication lag. + +However, in systems with leaderless replication, there is no fixed order in which writes are +applied, which makes monitoring more difficult. The number of hints that a replica stores for +handoff can be one measure of system health, but it’s difficult to interpret usefully +[[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Breck2019)]. +Eventual consistency is a deliberately vague guarantee, but for operability it’s important to be +able to quantify “eventual.” + +## Single-Leader vs. Leaderless Replication Performance + +A replication system based on a single leader can provide strong consistency guarantees that are +difficult or impossible to achieve in a leaderless system. However, as we have seen in +[“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag), reads in a leader-based replicated system can also return stale values if +you make them on an asynchronously updated follower. + +Reading from the leader ensures up-to-date responses, but it suffers from performance problems: + +* Read throughput is limited by the leader’s capacity to handle requests (in contrast with read + scaling, which distributes reads across asynchronously updated replicas that may return stale + values). +* If the leader fails, you have to wait for the fault to be detected, and for the failover to + complete before you can continue handling requests. Even if the failover process is very quick, + users will notice it because of the temporarily increased response times; if failover takes a long + time, the system is unavailable for its duration. +* The system is very sensitive to performance problems on the leader: if the leader is slow to + respond, e.g. due to overload or some resource contention, the increased response times + immediately affect users as well. + +A big advantage of a leaderless architecture is that it is more resilient against such issues. +Because there is no failover, and requests go to multiple replicas in parallel anyway, one replica +becoming slow or unavailable has very little impact on response times: the client simply uses the +responses from the other replicas that are faster to respond. Using the fastest responses is called +*request hedging*, and it can significantly reduce tail latency +[[55](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Dean2013_ch6)]). + +At its core, the resilience of a leaderless system comes from the fact that it doesn’t distinguish +between the normal case and the failure case. This is especially helpful when handling so-called +*gray failures*, in which a node isn’t completely down, but running in a degraded state where it is +unusually slow to handle requests +[[56](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2017_ch6)], +or when a node is simply overloaded (for example, if a node has been offline for a while, recovery +via hinted handoff can cause a lot of additional load). A leader-based system has to decide whether +the situation is bad enough to warrant a failover (which can itself cause further disruption), +whereas in a leaderless system that question doesn’t even arise. + +That said, leaderless systems can have performance problems as well: + +* Even though the system doesn’t need to perform failover, one replica does need to detect when + another replica is unavailable so that it can store hints about writes that the unavailable + replica missed. When the unavailable replica comes back, the handoff process needs to send it + those hints. This puts additional load on the replicas at a time when the system is already under + strain [[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Breck2019)]. +* The more replicas you have, the bigger the size of your quorums, and the more responses you have + to wait for before a request can complete. Even if you wait only for the fastest *r* or *w* + replicas to respond, and even if you make the requests in parallel, a bigger *r* or *w* increases + the chance that you hit a slow replica, increasing the overall response time (see + [“Use of Response Time Metrics”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_slo_sla)). +* A large-scale network interruption that disconnects a client from a large number of replicas can + make it impossible to form a quorum. Some leaderless databases offer a configuration option that + allows any reachable replica to accept writes, even if it’s not one of the usual replicas for that + key (Riak and Dynamo call this a *sloppy quorum* + [[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)]; + Cassandra and ScyllaDB call it *consistency level ANY*). There is no guarantee that subsequent + reads will see the written value, but depending on the application it may still be better than + having the write fail. + +Multi-leader replication can offer even greater resilience against network interruptions than +leaderless replication, since reads and writes only require communication with one leader, which can +be co-located with the client. However, since a write on one leader is propagated asynchronously to +the others, reads can be arbitrarily out-of-date. Quorum reads and writes provide a compromise: good +fault tolerance while also having a high likelihood of reading up-to-date data. + +### Multi-region operation + +We previously discussed cross-region replication as a use case for multi-leader replication (see +[“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader)). Leaderless replication is also suitable for +multi-region operation, since it is designed to tolerate conflicting concurrent writes, network +interruptions, and latency spikes. + +Cassandra and ScyllaDB implement their multi-region support within the normal leaderless model: the +client sends its writes directly to the replicas in all regions, and you can choose from a variety +of consistency levels that determine how many responses are required for a request to be successful. +For example, you can request a quorum across the replicas in all the regions, a separate quorum in +each of the regions, or a quorum only in the client’s local region. A local quorum avoids having to +wait for slow requests to other regions, but it is also more likely to return stale results. + +Riak keeps all communication between clients and database nodes local to one region, so *n* +describes the number of replicas within one region. Cross-region replication between +database clusters happens asynchronously in the background, in a style that is similar to +multi-leader replication. + +## Detecting Concurrent Writes + +Like with multi-leader replication, leaderless databases allow concurrent writes to the same key, +resulting in conflicts that need to be resolved. Such conflicts may occur as the writes happen, but +not always: they could also be detected later during read repair, hinted handoff, or anti-entropy. + +The problem is that events may arrive in a different order at different nodes, due to variable +network delays and partial failures. For example, [Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency) shows two clients, +A and B, simultaneously writing to a key *X* in a three-node datastore: + +* Node 1 receives the write from A, but never receives the write from B due to a transient + outage. +* Node 2 first receives the write from A, then the write from B. +* Node 3 first receives the write from B, then the write from A. + +![ddia 0614](/fig/ddia_0614.png) + +###### Figure 6-14. Concurrent writes in a Dynamo-style datastore: there is no well-defined ordering. + +If each node simply overwrote the value for a key whenever it received a write request from a +client, the nodes would become permanently inconsistent, as shown by the final *get* request in +[Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other +nodes think that the value is A. + +In order to become eventually consistent, the replicas should converge toward the same value. For +this, we can use any of the conflict resolution mechanisms we previously discussed in +[“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts), such as last-write-wins (used by Cassandra and ScyllaDB), +manual resolution, or CRDTs (described in [“CRDTs and Operational Transformation”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_crdts), and used by Riak). + +Last-write-wins is easy to implement: each write is tagged with a timestamp, and a value with a +higher timestamp always overwrites a value with a lower timestamp. However, a timestamp doesn’t tell +you whether two values are actually conflicting (i.e., they were written concurrently) or not (they +were written one after another). If you want to resolve conflicts explicitly, the system needs to +take more care to detect concurrent writes. + +### The “happens-before” relation and concurrency + +How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look +at some examples: + +* In [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before* + B’s increment, because the value incremented by B is the value inserted by A. In other words, B’s + operation builds upon A’s operation, so B’s operation must have happened later. + We also say that B is *causally dependent* on A. +* On the other hand, the two writes in [Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency) are concurrent: when each + client starts the operation, it does not know that another client is also performing an operation + on the same key. Thus, there is no causal dependency between the operations. + +An operation A *happens before* another operation B if B knows about A, or depends on A, or builds +upon A in some way. Whether one operation happens before another operation is the key to defining +what concurrency means. In fact, we can simply say that two operations are *concurrent* if neither +happens before the other (i.e., neither knows about the other) +[[57](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lamport1978_ch6)]. + +Thus, whenever you have two operations A and B, there are three possibilities: either A happened +before B, or B happened before A, or A and B are concurrent. What we need is an algorithm to tell us +whether two operations are concurrent or not. If one operation happened before another, the later +operation should overwrite the earlier operation, but if the operations are concurrent, we have a +conflict that needs to be resolved. + +# Concurrency, Time, and Relativity + +It may seem that two operations should be called concurrent if they occur “at the same time”—but +in fact, it is not important whether they literally overlap in time. Because of problems with clocks +in distributed systems, it is actually quite difficult to tell whether two things happened +at exactly the same time—an issue we will discuss in more detail in [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed). + +For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if +they are both unaware of each other, regardless of the physical time at which they occurred. People +sometimes make a connection between this principle and the special theory of relativity in physics +[[57](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lamport1978_ch6)], which introduced the idea that +information cannot travel faster than the speed of light. Consequently, two events that occur some +distance apart cannot possibly affect each other if the time between the events is shorter than the +time it takes light to travel the distance between them. + +In computer systems, two operations might be concurrent even though the speed of light would in +principle have allowed one operation to affect the other. For example, if the network was slow or +interrupted at the time, two operations can occur some time apart and still be concurrent, because +the network problems prevented one operation from being able to know about the other. + +### Capturing the happens-before relationship + +Let’s look at an algorithm that determines whether two operations are concurrent, or whether one +happened before another. To keep things simple, let’s start with a database that has only one +replica. Once we have worked out how to do this on a single replica, we can generalize the approach +to a leaderless database with multiple replicas. + +[Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) shows two clients concurrently adding items to the same +shopping cart. (If that example strikes you as too inane, imagine instead two air traffic +controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is +empty. Between them, the clients make five writes to the database: + +1. Client 1 adds `milk` to the cart. This is the first write to that key, so the server successfully + stores it and assigns it version 1. The server also echoes the value back to the client, along + with the version number. +2. Client 2 adds `eggs` to the cart, not knowing that client 1 concurrently added `milk` (client 2 + thought that its `eggs` were the only item in the cart). The server assigns version 2 to this + write, and stores `eggs` and `milk` as two separate values (siblings). It then returns *both* + values to the client, along with the version number of 2. +3. Client 1, oblivious to client 2’s write, wants to add `flour` to the cart, so it thinks the + current cart contents should be `[milk, flour]`. It sends this value to the server, along with + the version number 1 that the server gave client 1 previously. The server can tell from the + version number that the write of `[milk, flour]` supersedes the prior value of `[milk]` but that + it is concurrent with `[eggs]`. Thus, the server assigns version 3 to `[milk, flour]`, overwrites + the version 1 value `[milk]`, but keeps the version 2 value `[eggs]` and returns both remaining + values to the client. +4. Meanwhile, client 2 wants to add `ham` to the cart, unaware that client 1 just added `flour`. + Client 2 received the two values `[milk]` and `[eggs]` from the server in the last response, so + the client now merges those values and adds `ham` to form a new value, `[eggs, milk, ham]`. It + sends that value to the server, along with the previous version number 2. The server detects that + version 2 overwrites `[eggs]` but is concurrent with `[milk, flour]`, so the two remaining + values are `[milk, flour]` with version 3, and `[eggs, milk, ham]` with version 4. +5. Finally, client 1 wants to add `bacon`. It previously received `[milk, flour]` and `[eggs]` from + the server at version 3, so it merges those, adds `bacon`, and sends the final value + `[milk, flour, eggs, bacon]` to the server, along with the version number 3. This overwrites + `[milk, flour]` (note that `[eggs]` was already overwritten in the last step) but is concurrent + with `[eggs, milk, ham]`, so the server keeps those two concurrent values. + +![ddia 0615](/fig/ddia_0615.png) + +###### Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart. + +The dataflow between the operations in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) is illustrated +graphically in [Figure 6-16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causal_dependencies). The arrows indicate which operation +*happened before* which other operation, in the sense that the later operation *knew about* or +*depended on* the earlier one. In this example, the clients are never fully up to date with the data +on the server, since there is always another operation going on concurrently. But old versions of +the value do get overwritten eventually, and no writes are lost. + +![ddia 0616](/fig/ddia_0616.png) + +###### Figure 6-16. Graph of causal dependencies in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single). + +Note that the server can determine whether two operations are concurrent by looking at the version +numbers—it does not need to interpret the value itself (so the value could be any data +structure). The algorithm works as follows: + +* The server maintains a version number for every key, increments the version number every time that + key is written, and stores the new version number along with the value written. +* When a client reads a key, the server returns all siblings, i.e., all values that have not been + overwritten, as well as the latest version number. A client must read a key before writing. +* When a client writes a key, it must include the version number from the prior read, and it must + merge together all values that it received in the prior read, e.g. using a CRDT or by asking the + user. The response from a write request is like a read, returning all siblings, which allows us to + chain several writes like in the shopping cart example. +* When the server receives a write with a particular version number, it can overwrite all values + with that version number or below (since it knows that they have been merged into the new value), + but it must keep all values with a higher version number (because those values are concurrent with + the incoming write). + +When a write includes the version number from a prior read, that tells us which previous state the +write is based on. If you make a write without including a version number, it is concurrent with all +other writes, so it will not overwrite anything—it will just be returned as one of the values +on subsequent reads. + +### Version vectors + +The example in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) used only a single replica. How does the +algorithm change when there are multiple replicas, but no leader? + +[Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) uses a single version number to capture dependencies between +operations, but that is not sufficient when there are multiple replicas accepting writes +concurrently. Instead, we need to use a version number *per replica* as well as per key. Each +replica increments its own version number when processing a write, and also keeps track of the +version numbers it has seen from each of the other replicas. This information indicates which values +to overwrite and which values to keep as siblings. + +The collection of version numbers from all the replicas is called a *version vector* +[[58](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#ParkerJr1983)]. +A few variants of this idea are in use, but the most interesting is probably the *dotted version +vector* +[[59](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Preguica2010), +[60](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Manepalli2022)], +which is used in Riak 2.0 +[[61](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Cribbs2014), +[62](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Brown2015)]. +We won’t go into the details, but the way it works is quite similar to what we saw in our cart example. + +Like the version numbers in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single), version vectors are sent from the +database replicas to clients when values are read, and need to be sent back to the database when a +value is subsequently written. (Riak encodes the version vector as a string that it calls *causal +context*.) The version vector allows the database to distinguish between overwrites and concurrent +writes. + +The version vector also ensures that it is safe to read from one replica and subsequently write back +to another replica. Doing so may result in siblings being created, but no data is lost as long as +siblings are merged correctly. + +# Version vectors and vector clocks + +A *version vector* is sometimes also called a *vector clock*, even though they are not quite the +same. The difference is subtle—please see the references for details +[[60](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Manepalli2022), +[63](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Baquero2011), +[64](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Schwarz1994)]. In brief, when +comparing the state of replicas, version vectors are the right data structure to use. + +# Summary + +In this chapter we looked at the issue of replication. Replication can serve several purposes: + +*High availability* +: Keeping the system running, even when one machine (or several machines, a + zone, or even an entire region) goes down + +*Disconnected operation* +: Allowing an application to continue working when there is a network + interruption + +*Latency* +: Placing data geographically close to users, so that users can interact with it faster + +*Scalability* +: Being able to handle a higher volume of reads than a single machine could handle, + by performing reads on replicas + +Despite being a simple goal—keeping a copy of the same data on several machines—replication turns out +to be a remarkably tricky problem. It requires carefully thinking about concurrency and about all +the things that can go wrong, and dealing with the consequences of those faults. At a minimum, we +need to deal with unavailable nodes and network interruptions (and that’s not even considering the +more insidious kinds of fault, such as silent data corruption due to software bugs or hardware +errors). + +We discussed three main approaches to replication: + +*Single-leader replication* +: Clients send all writes to a single node (the leader), which sends a + stream of data change events to the other replicas (followers). Reads can be performed on any + replica, but reads from followers might be stale. + +*Multi-leader replication* +: Clients send each write to one of several leader nodes, any of which + can accept writes. The leaders send streams of data change events to each other and to any + follower nodes. + +*Leaderless replication* +: Clients send each write to several nodes, and read from several nodes + in parallel in order to detect and correct nodes with stale data. + +Each approach has advantages and disadvantages. Single-leader replication is popular because it is +fairly easy to understand and it offers strong consistency. Multi-leader and leaderless replication +can be more robust in the presence of faulty nodes, network interruptions, and latency spikes—at the +cost of requiring conflict resolution and providing weaker consistency guarantees. + +Replication can be synchronous or asynchronous, which has a profound effect on the system behavior +when there is a fault. Although asynchronous replication can be fast when the system is running +smoothly, it’s important to figure out what happens when replication lag increases and servers fail. +If a leader fails and you promote an asynchronously updated follower to be the new leader, recently +committed data may be lost. + +We looked at some strange effects that can be caused by replication lag, and we discussed a few +consistency models which are helpful for deciding how an application should behave under replication +lag: + +*Read-after-write consistency* +: Users should always see data that they submitted themselves. + +*Monotonic reads* +: After users have seen the data at one point in time, they shouldn’t later see + the data from some earlier point in time. + +*Consistent prefix reads* +: Users should see the data in a state that makes causal sense: + for example, seeing a question and its reply in the correct order. + +Finally, we discussed how multi-leader and leaderless replication ensure that all replicas +eventually converge to a consistent state: by using a version vector or similar algorithm to detect +which writes are concurrent, and by using a conflict resolution algorithm such as a CRDT to merge +the concurrently written values. Last-write-wins and manual conflict resolution are also possible. + +This chapter has assumed that every replica stores a full copy of the whole database, which is +unrealistic for large datasets. In the next chapter we will look at *sharding*, which allows each +machine to store only a subset of the data. + +##### Footnotes + +##### References + +[[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lindsay1979_ch6-marker)] B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. +Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. +[Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). +IBM Research, Research Report RJ2571(33471), July 1979. +Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD) + +[[2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gryp2020-marker)] Kenny Gryp. +[MySQL Terminology +Updates](https://dev.mysql.com/blog-archive/mysql-terminology-updates/). *dev.mysql.com*, July 2020. +Archived at [perma.cc/S62G-6RJ2](https://perma.cc/S62G-6RJ2) + +[[3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Oracle2019-marker)] Oracle Corporation. +[Oracle +(Active) Data Guard 19c: Real-Time Data Protection and Availability](https://www.oracle.com/technetwork/database/availability/dg-adg-technical-overview-wp-5347548.pdf). White Paper, *oracle.com*, March 2019. +Archived at [perma.cc/P5ST-RPKE](https://perma.cc/P5ST-RPKE) + +[[4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#AlwaysOn2012-marker)] Microsoft. +[What +is an Always On availability group?](https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server) *learn.microsoft.com*, September 2024. +Archived at [perma.cc/ABH6-3MXF](https://perma.cc/ABH6-3MXF) + +[[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Elhemali2022_ch6-marker)] Mostafa Elhemali, Niall Gallagher, Nicholas +Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu +Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, +Doug Terry, and Akshat Vig. +[Amazon DynamoDB: A Scalable, +Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical +Conference* (ATC), July 2022. + +[[6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Taft2020_ch6-marker)] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan +VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul +Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. +[CockroachDB: The Resilient +Geo-Distributed SQL Database](https://dl.acm.org/doi/abs/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of +Data* (SIGMOD), pages 1493–1509, June 2020. +[doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134) + +[[7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2020_ch6-marker)] Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, +Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, +Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan +Pei, and Xin Tang. +[TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf). +*Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3072–3084. +[doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535) + +[[8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Knodel2023-marker)] Mallory Knodel and Niels ten Oever. +[Terminology, Power, and +Inclusive Language in Internet-Drafts and RFCs](https://www.ietf.org/archive/id/draft-knodel-terminology-14.html). *IETF Internet-Draft*, August 2023. +Archived at [perma.cc/5ZY9-725E](https://perma.cc/5ZY9-725E) + +[[9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2018-marker)] Buck Hodges. +[Postmortem: VSTS 4 September 2018](https://devblogs.microsoft.com/devopsservice/?p=17485). +*devblogs.microsoft.com*, September 2018. +Archived at [perma.cc/ZF5R-DYZS](https://perma.cc/ZF5R-DYZS) + +[[10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Morling2024_ch6-marker)] Gunnar Morling. +[Leader +Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024. +Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y) + +[[11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Chandramohan2024-marker)] Vignesh Chandramohan, Rohan Desai, and Chris Riccomini. +[SlateDB Manifest +Design](https://github.com/slatedb/slatedb/blob/main/rfcs/0001-manifest.md). *github.com*, May 2024. +Archived at [perma.cc/8EUY-P32Z](https://perma.cc/8EUY-P32Z) + +[[12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kelvich2022-marker)] Stas Kelvich. +[Why does Neon use Paxos instead of Raft, and what’s the +difference?](https://neon.tech/blog/paxos) *neon.tech*, August 2022. +Archived at [perma.cc/SEZ4-2GXU](https://perma.cc/SEZ4-2GXU) + +[[13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Fontaine2021-marker)] Dimitri Fontaine. +[An +introduction to the pg\_auto\_failover project](https://tapoueh.org/blog/2021/11/an-introduction-to-the-pg_auto_failover-project/). *tapoueh.org*, November 2021. +Archived at [perma.cc/3WH5-6BAF](https://perma.cc/3WH5-6BAF) + +[[14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Newland2012-marker)] Jesse Newland. +[GitHub +availability this week](https://github.blog/news-insights/the-library/github-availability-this-week/). *github.blog*, September 2012. +Archived at [perma.cc/3YRF-FTFJ](https://perma.cc/3YRF-FTFJ) + +[[15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Imbriaco2012_ch6-marker)] Mark Imbriaco. +[Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/). +*github.blog*, December 2012. +Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ) + +[[16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hugg2015-marker)] John Hugg. +[‘All In’ with Determinism for Performance and +Testing in Distributed Systems](https://www.youtube.com/watch?v=gJRj3vJL4wE). At *Strange Loop*, September 2015. + +[[17](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Suzuki2017_ch6-marker)] Hironobu Suzuki. +[The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017. + +[[18](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2012-marker)] Amit Kapila. +[WAL +Internals of PostgreSQL](https://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf). At *PostgreSQL Conference* (PGCon), May 2012. +Archived at [perma.cc/6225-3SUX](https://perma.cc/6225-3SUX) + +[[19](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2023-marker)] Amit Kapila. +[Evolution +of Logical Replication](https://amitkapila16.blogspot.com/2023/09/evolution-of-logical-replication.html). *amitkapila16.blogspot.com*, September 2023. +Archived at [perma.cc/F9VX-JLER](https://perma.cc/F9VX-JLER) + +[[20](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Petchimuthu2021-marker)] Aru Petchimuthu. +[Upgrade +your Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL database, Part 2: Using the pglogical +extension](https://aws.amazon.com/blogs/database/part-2-upgrade-your-amazon-rds-for-postgresql-database-using-the-pglogical-extension/). *aws.amazon.com*, August 2021. +Archived at [perma.cc/RXT8-FS2T](https://perma.cc/RXT8-FS2T) + +[[21](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sharma2015te_ch6-marker)] Yogeshwer Sharma, Philippe Ajoux, Petchean +Ang, David Callies, Abhishek Choudhary, Laurent Demailly, Thomas Fersch, Liat Atsmon Guz, Andrzej +Kotulski, Sachin Kulkarni, Sanjeev Kumar, Harry Li, Jun Li, Evgeniy Makeev, Kowshik Prakasam, +Robbert van Renesse, Sabyasachi Roy, Pratyush Seth, Yee Jiun Song, Benjamin Wester, Kaushik +Veeraraghavan, and Peter Xie. +[Wormhole: +Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf). At *12th USENIX +Symposium on Networked Systems Design and Implementation* (NSDI), May 2015. + +[[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011-marker)] Douglas B. Terry. +[Replicated +Data Consistency Explained Through Baseball](https://www.microsoft.com/en-us/research/publication/replicated-data-consistency-explained-through-baseball/). Microsoft Research, Technical Report +MSR-TR-2011-137, October 2011. +Archived at [perma.cc/F4KZ-AR38](https://perma.cc/F4KZ-AR38) + +[[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry1994-marker)] Douglas B. Terry, Alan J. Demers, Karin Petersen, +Mike J. Spreitzer, Marvin M. Theher, and Brent B. Welch. +[Session Guarantees +for Weakly Consistent Replicated Data](https://csis.pace.edu/~marchese/CS865/Papers/SessionGuaranteesPDIS.pdf). At *3rd International Conference on Parallel and +Distributed Information Systems* (PDIS), September 1994. +[doi:10.1109/PDIS.1994.331722](https://doi.org/10.1109/PDIS.1994.331722) + +[[24](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Vogels2008-marker)] Werner Vogels. +[Eventually Consistent](https://queue.acm.org/detail.cfm?id=1466448). +*ACM Queue*, volume 6, issue 6, pages 14–19, October 2008. +[doi:10.1145/1466443.1466448](https://doi.org/10.1145/1466443.1466448) + +[[25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Willison2022-marker)] Simon Willison. +[Reply to: “My thoughts about Fly.io (so +far) and other newish technology I’m getting into”](https://news.ycombinator.com/item?id=31434055). *news.ycombinator.com*, May 2022. +Archived at [perma.cc/ZRV4-WWV8](https://perma.cc/ZRV4-WWV8) + +[[26](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Tharakan2020-marker)] Nithin Tharakan. +[Scaling Bitbucket’s +Database](https://www.atlassian.com/blog/bitbucket/scaling-bitbuckets-database). *atlassian.com*, October 2020. +Archived at [perma.cc/JAB7-9FGX](https://perma.cc/JAB7-9FGX) + +[[27](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pratchett1991-marker)] Terry Pratchett. *Reaper Man: A Discworld +Novel*. Victor Gollancz, 1991. ISBN: 978-0-575-04979-6 + +[[28](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014coord_ch6-marker)] Peter Bailis, Alan Fekete, Michael J. +Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. +[Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). +*Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185–196, November 2014. +[doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509) + +[[29](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Raja2022-marker)] Yaser Raja and Peter Celentano. +[PostgreSQL +bi-directional replication using pglogical](https://aws.amazon.com/blogs/database/postgresql-bi-directional-replication-using-pglogical/). *aws.amazon.com*, January 2022. +Archived at + +[[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2012-marker)] Robert Hodges. +[If +You \*Must\* Deploy Multi-Master Replication, Read This First](https://scale-out-blog.blogspot.com/2012/04/if-you-must-deploy-multi-master.html). *scale-out-blog.blogspot.com*, +April 2012. Archived at [perma.cc/C2JN-F6Y8](https://perma.cc/C2JN-F6Y8) + +[[31](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#HBase7709-marker)] Lars Hofhansl. +[HBASE-7709: Infinite Loop Possible in +Master/Master Replication](https://issues.apache.org/jira/browse/HBASE-7709). *issues.apache.org*, January 2013. +Archived at [perma.cc/24G2-8NLC](https://perma.cc/24G2-8NLC) + +[[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DayRichter2010-marker)] John Day-Richter. +[What’s +Different About the New Google Docs: Making Collaboration Fast](https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs.html). *drive.googleblog.com*, +September 2010. Archived at [perma.cc/5TL8-TSJ2](https://perma.cc/5TL8-TSJ2) + +[[33](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Wallace2019-marker)] Evan Wallace. +[How Figma’s +multiplayer technology works](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/). *figma.com*, October 2019. +Archived at [perma.cc/L49H-LY4D](https://perma.cc/L49H-LY4D) + +[[34](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Artman2023-marker)] Tuomas Artman. +[Scaling the Linear Sync Engine](https://linear.app/blog/scaling-the-linear-sync-engine). +*linear.app*, June 2023. + +[[35](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Saafan2024-marker)] Amr Saafan. +[Why Sync +Engines Might Be the Future of Web Applications](https://www.nilebits.com/blog/2024/09/sync-engines-future-web-applications/). *nilebits.com*, September 2024. +Archived at [perma.cc/5N73-5M3V](https://perma.cc/5N73-5M3V) + +[[36](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hagoel2024-marker)] Isaac Hagoel. +[Are Sync +Engines The Future of Web Applications?](https://dev.to/isaachagoel/are-sync-engines-the-future-of-web-applications-1bbi) *dev.to*, July 2024. +Archived at [perma.cc/R9HF-BKKL](https://perma.cc/R9HF-BKKL) + +[[37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Jayakar2024-marker)] Sujay Jayakar. +[A Map of Sync](https://stack.convex.dev/a-map-of-sync). *stack.convex.dev*, +October 2024. Archived at [perma.cc/82R3-H42A](https://perma.cc/82R3-H42A) + +[[38](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Feyerke2013-marker)] Alex Feyerke. +[Designing Offline-First Web Apps](https://alistapart.com/article/offline-first/). +*alistapart.com*, December 2013. +Archived at [perma.cc/WH7R-S2DS](https://perma.cc/WH7R-S2DS) + +[[39](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2019_ch6-marker)] Martin Kleppmann, +Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. +[Local-first software: You own your data, in +spite of the cloud](https://www.inkandswitch.com/local-first/). At *ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and +Reflections on Programming and Software* (Onward!), October 2019, pages 154–178. +[doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737) + +[[40](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2024lofi-marker)] Martin Kleppmann. +[The past, present, and +future of local-first](https://martin.kleppmann.com/2024/05/30/local-first-conference.html). At *Local-First Conference*, May 2024. + +[[41](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hofmeyr2024-marker)] Conrad Hofmeyr. +[API +Calling is to Sync Engines as jQuery is to React](https://www.powersync.com/blog/api-calling-is-to-sync-engines-as-jquery-is-to-react). *powersync.com*, November 2024. +Archived at [perma.cc/2FP9-7WJJ](https://perma.cc/2FP9-7WJJ) + +[[42](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#vanHardenberg2020-marker)] Peter van Hardenberg and Martin Kleppmann. +[PushPin: Towards +Production-Quality Peer-to-Peer Collaboration](https://martin.kleppmann.com/papers/pushpin-papoc20.pdf). At *7th Workshop on Principles and Practice +of Consistency for Distributed Data* (PaPoC), April 2020. +[doi:10.1145/3380787.3393683](https://doi.org/10.1145/3380787.3393683) + +[[43](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kawell1988-marker)] Leonard Kawell, Jr., Steven Beckhardt, Timothy +Halvorsen, Raymond Ozzie, and Irene Greif. +[Replicated document management in a group +communication system](https://dl.acm.org/doi/pdf/10.1145/62266.1024798). At *ACM Conference on Computer-Supported Cooperative Work* (CSCW), +September 1988. +[doi:10.1145/62266.1024798](https://doi.org/10.1145/62266.1024798) + +[[44](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pusch2019-marker)] Ricky Pusch. +[Explaining how fighting games use delay-based and +rollback netcode](https://words.infil.net/w02-netcode.html). *words.infil.net* and *arstechnica.com*, October 2019. +Archived at [perma.cc/DE7W-RDJ8](https://perma.cc/DE7W-RDJ8) + +[[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6-marker)] Giuseppe DeCandia, Deniz Hastorun, Madan +Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, +Peter Vosshall, and Werner Vogels. +[Dynamo: Amazon’s +Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating Systems Principles* +(SOSP), October 2007. +[doi:10.1145/1323293.1294281](https://doi.org/10.1145/1323293.1294281) + +[[46](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shapiro2011-marker)] Marc Shapiro, Nuno Preguiça, Carlos Baquero, and +Marek Zawirski. [A Comprehensive Study +of Convergent and Commutative Replicated Data Types](https://inria.hal.science/inria-00555588v1/document). INRIA Research Report no. 7506, January +2011. + +[[47](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sun1998-marker)] Chengzheng Sun and Clarence Ellis. +[Operational +Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=aef660812c5a9c4d3f06775f9455eeb090a4ff0f). At +*ACM Conference on Computer Supported Cooperative Work* (CSCW), November 1998. +[doi:10.1145/289444.289469](https://doi.org/10.1145/289444.289469) + +[[48](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gentle2025-marker)] Joseph Gentle and Martin Kleppmann. +[Collaborative Text Editing with Eg-walker: Better, +Faster, Smaller](https://arxiv.org/abs/2409.14252). At *20th European Conference on Computer Systems* (EuroSys), March 2025. +[doi:10.1145/3689031.3696076](https://doi.org/10.1145/3689031.3696076) + +[[49](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shukla2018-marker)] Dharma Shukla. +[Azure +Cosmos DB: Pushing the frontier of globally distributed databases](https://azure.microsoft.com/en-us/blog/azure-cosmos-db-pushing-the-frontier-of-globally-distributed-databases/). *azure.microsoft.com*, September 2018. +Archived at [perma.cc/UT3B-HH6R](https://perma.cc/UT3B-HH6R) + +[[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979-marker)] David K. Gifford. +[Weighted Voting for +Replicated Data](https://www.cs.cmu.edu/~15-749/READINGS/required/availability/gifford79.pdf). At *7th ACM Symposium on Operating Systems Principles* (SOSP), December 1979. +[doi:10.1145/800215.806583](https://doi.org/10.1145/800215.806583) + +[[51](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Howard2016_ch6-marker)] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. +[Flexible Paxos: +Quorum Intersection Revisited](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.OPODIS.2016.25). At *20th International Conference on Principles of Distributed +Systems* (OPODIS), December 2016. +[doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25) + +[[52](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Blomstedt2012ricon-marker)] Joseph Blomstedt. +[Bringing Consistency to Riak](https://vimeo.com/51973001). At *RICON West*, +October 2012. + +[[53](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014pbs-marker)] Peter Bailis, Shivaram Venkataraman, +Michael J. Franklin, Joseph M. Hellerstein, and Ion Stoica. +[Quantifying eventual consistency with +PBS](http://www.bailis.org/papers/pbs-vldbj2014.pdf). *The VLDB Journal*, volume 23, pages 279–302, April 2014. +[doi:10.1007/s00778-013-0330-1](https://doi.org/10.1007/s00778-013-0330-1) + +[[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Breck2019-marker)] Colin Breck. +[Shared-Nothing +Architectures for Server Replication and Synchronization](https://blog.colinbreck.com/shared-nothing-architectures-for-server-replication-and-synchronization/). *blog.colinbreck.com*, December 2019. +Archived at [perma.cc/48P3-J6CJ](https://perma.cc/48P3-J6CJ) + +[[55](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Dean2013_ch6-marker)] Jeffrey Dean and Luiz André Barroso. +[The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/). +*Communications of the ACM*, volume 56, issue 2, pages 74–80, February 2013. +[doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794) + +[[56](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2017_ch6-marker)] Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. +Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. +[Gray +Failure: The Achilles’ Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in +Operating Systems* (HotOS), May 2017. +[doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005) + +[[57](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lamport1978_ch6-marker)] Leslie Lamport. +[Time, +Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, +volume 21, issue 7, pages 558–565, July 1978. +[doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563) + +[[58](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#ParkerJr1983-marker)] D. Stott Parker Jr., Gerald J. Popek, Gerard +Rudisin, Allen Stoughton, Bruce J. Walker, Evelyn Walton, Johanna M. Chow, David Edwards, Stephen +Kiser, and Charles Kline. +[Detection of +Mutual Inconsistency in Distributed Systems](https://pages.cs.wisc.edu/~remzi/Classes/739/Papers/parker83detection.pdf). *IEEE Transactions on Software Engineering*, +volume SE-9, issue 3, pages 240–247, May 1983. +[doi:10.1109/TSE.1983.236733](https://doi.org/10.1109/TSE.1983.236733) + +[[59](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Preguica2010-marker)] Nuno Preguiça, Carlos Baquero, Paulo Sérgio +Almeida, Victor Fonte, and Ricardo Gonçalves. [Dotted +Version Vectors: Logical Clocks for Optimistic Replication](https://arxiv.org/abs/1011.5808). arXiv:1011.5808, November 2010. + +[[60](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Manepalli2022-marker)] Giridhar Manepalli. +[Clocks and Causality - Ordering Events +in Distributed Systems](https://www.exhypothesi.com/clocks-and-causality/). *exhypothesi.com*, November 2022. +Archived at [perma.cc/8REU-KVLQ](https://perma.cc/8REU-KVLQ) + +[[61](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Cribbs2014-marker)] Sean Cribbs. +[A Brief History of Time in Riak](https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak). +At *RICON*, October 2014. Archived at [perma.cc/7U9P-6JFX](https://perma.cc/7U9P-6JFX) + +[[62](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Brown2015-marker)] Russell Brown. +[Vector +Clocks Revisited Part 2: Dotted Version Vectors](https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/). *riak.com*, November 2015. +Archived at [perma.cc/96QP-W98R](https://perma.cc/96QP-W98R) + +[[63](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Baquero2011-marker)] Carlos Baquero. +[Version +Vectors Are Not Vector Clocks](https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/). *haslab.wordpress.com*, July 2011. +Archived at [perma.cc/7PNU-4AMG](https://perma.cc/7PNU-4AMG) + +[[64](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Schwarz1994-marker)] Reinhard Schwarz and Friedemann Mattern. +[Detecting Causal +Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed +Computing*, volume 7, issue 3, pages 149–174, March 1994. +[doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859) diff --git a/content/en/ch7.md b/content/en/ch7.md index f017401..8b0030f 100644 --- a/content/en/ch7.md +++ b/content/en/ch7.md @@ -1,160 +1,987 @@ --- -title: "7. Transactions" -linkTitle: "7. Transactions" +title: "7. Sharding" weight: 207 breadcrumbs: false --- -![](/img/ch7.png) - -> *Some authors have claimed that general two-phase commit is too expensive to support, because of the performance or availability problems that it brings. We believe it is better to have application programmers deal with performance problems due to overuse of transac‐ tions as bottlenecks arise, rather than always coding around the lack of transactions.* +> *Clearly, we must break away from the sequential and not limit the computers. We must state +> definitions and provide for priorities and descriptions of data. We must state relationships, not +> procedures.* > -> ​ — James Corbett et al., *Spanner: Google’s Globally-Distributed Database* (2012) +> Grace Murray Hopper, *Management and the Computer of the Future* (1962) ------- +A distributed database typically distributes data across nodes in two ways: -In the harsh reality of data systems, many things can go wrong: +1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in + [Chapter 6](/en/ch6#ch_replication). +2. If we don’t want every node to store all the data, we can split up a large amount of data into + smaller *shards* or *partitions*, and store different shards on different nodes. We’ll discuss + sharding in this chapter. -- The database software or hardware may fail at any time (including in the middle of a write operation). +Normally, shards are defined in such a way that each piece of data (each record, row, or document) +belongs to exactly one shard. There are various ways of achieving this, which we discuss in depth in +this chapter. In effect, each shard is a small database of its own, although some database systems +support operations that touch multiple shards at the same time. -- The application may crash at any time (including halfway through a series of operations). +Sharding is usually combined with replication so that copies of each shard are stored on multiple +nodes. This means that, even though each record belongs to exactly one shard, it may still be stored +on several different nodes for fault tolerance. -- Interruptions in the network can unexpectedly cut off the application from the database, or one database node from another. +A node may store more than one shard. If a single-leader replication model is used, the combination +of sharding and replication can look like [Figure 7-1](/en/ch7#fig_sharding_replicas), for example. Each shard’s +leader is assigned to one node, and its followers are assigned to other nodes. Each node may be the +leader for some shards and a follower for other shards, but each shard still only has one leader. -- Several clients may write to the database at the same time, overwriting each other’s changes. +![ddia 0701](/fig/ddia_0701.png) -- A client may read data that doesn’t make sense because it has only partially been updated. +###### Figure 7-1. Combining replication and sharding: each node acts as leader for some shards and follower for other shards. -- Race conditions between clients can cause surprising bugs. +Everything we discussed in [Chapter 6](/en/ch6#ch_replication) about replication of databases applies equally to +replication of shards. Since the choice of sharding scheme is mostly independent of the choice of +replication scheme, we will ignore replication in this chapter for the sake of simplicity. - In order to be reliable, a system has to deal with these faults and ensure that they don’t cause catastrophic failure of the entire system. However, implementing fault- tolerance mechanisms is a lot of work. It requires a lot of careful thinking about all the things that can go wrong, and a lot of testing to ensure that the solution actually works. +# Sharding and Partitioning -For decades, ***transactions*** have been the mechanism of choice for simplifying these issues. A transaction is a way for an application to group several reads and writes together into a logical unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either the entire transaction succeeds (*commit*) or it fails (*abort*, *rollback*). If it fails, the application can safely retry. With transactions, error handling becomes much simpler for an application, because it doesn’t need to worry about partial failure—i.e., the case where some operations succeed and some fail (for whatever reason). +What we call a *shard* in this chapter has many different names depending on which software you’re +using: it’s called a *partition* in Kafka, a *range* in CockroachDB, a *region* in HBase and TiDB, a +*tablet* in Bigtable and YugabyteDB, a *vnode* in Cassandra, ScyllaDB, and Riak, and a *vBucket* in +Couchbase, to name just a few. -If you have spent years working with transactions, they may seem obvious, but we shouldn’t take them for granted. Transactions are not a law of nature; they were cre‐ ated with a purpose, namely to *simplify the programming model* for applications accessing a database. By using transactions, the application is free to ignore certain potential error scenarios and concurrency issues, because the database takes care of them instead (we call these *safety guarantees*). +Some databases treat partitions and shards as two distinct concepts. For example, in PostgreSQL, +partitioning is a way of splitting a large table into several files that are stored on the same +machine (which has several advantages, such as making it very fast to delete an entire partition), +whereas sharding splits a dataset across multiple machines +[[1](/en/ch7#Giordano2023), +[2](/en/ch7#Leach2022)]. +In many other systems, partitioning is just another word for sharding. -Not every application needs transactions, and sometimes there are advantages to weakening transactional guarantees or abandoning them entirely (for example, to achieve higher performance or higher availability). Some safety properties can be achieved without transactions. +While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to +one theory, the term arose from the online role-play game *Ultima Online*, in which a magic crystal +was shattered into pieces, and each of those shards refracted a copy of the game world +[[3](/en/ch7#Koster2009)]. +The term *shard* thus came to mean one of a set of parallel game servers, and later was carried over +to databases. Another theory is that *shard* was originally an acronym of *System for Highly +Available Replicated Data*—reportedly a 1980s database, details of which are lost to history. -How do you figure out whether you need transactions? In order to answer that ques‐ tion, we first need to understand exactly what safety guarantees transactions can pro‐ vide, and what costs are associated with them. Although transactions seem straightforward at first glance, there are actually many subtle but important details that come into play. +By the way, partitioning has nothing to do with *network partitions* (netsplits), a type of fault in +the network between nodes. We will discuss such faults in [Chapter 9](/en/ch9#ch_distributed). -In this chapter, we will examine many examples of things that can go wrong, and explore the algorithms that databases use to guard against those issues. We will go especially deep in the area of concurrency control, discussing various kinds of race conditions that can occur and how databases implement isolation levels such as *read committed*, *snapshot isolation*, and *serializability*. +# Pros and Cons of Sharding -This chapter applies to both single-node and distributed databases; in Chapter 8 we will focus the discussion on the particular challenges that arise only in distributed systems. +The primary reason for sharding a database is *scalability*: it’s a solution if the volume of data +or the write throughput has become too great for a single node to handle, as it allows you to spread +that data and those writes across multiple nodes. (If read throughput is the problem, you don’t +necessarily need sharding—you can use *read scaling* as discussed in [Chapter 6](/en/ch6#ch_replication).) +In fact, sharding is one of the main tools we have for achieving *horizontal scaling* (a *scale-out* +architecture), as discussed in [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](/en/ch2#sec_introduction_shared_nothing): that is, allowing a system to +grow its capacity not by moving to a bigger machine, but by adding more (smaller) machines. If you +can divide the workload such that each shard handles a roughly equal share, you can then assign +those shards to different machines in order to process their data and queries in parallel. -## …… +While replication is useful at both small and large scale, because it enables fault tolerance and +offline operation, sharding is a heavyweight solution that is mostly relevant at large scale. If +your data volume and write throughput are such that you can process them on a single machine (and a +single machine can do a lot nowadays!), it’s often better to avoid sharding and stick with a +single-shard database. +The reason for this recommendation is that sharding often adds complexity: you typically have to +decide which records to put in which shard by choosing a *partition key*; all records with the +same partition key are placed in the same shard +[[4](/en/ch7#Fidalgo2021)]. +This choice matters because accessing a record is fast if you know which shard it’s in, but if you +don’t know the shard you have to do an inefficient search across all shards, and the sharding scheme +is difficult to change. +Thus, sharding often works well for key-value data, where you can easily shard by key, but it’s +harder with relational data where you may want to search by a secondary index, or join records that +may be distributed across different shards. We will discuss this further in +[“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes). -## Summary +Another problem with sharding is that a write may need to update related records in several +different shards. While transactions on a single node are quite common (see [Chapter 8](/en/ch8#ch_transactions)), +ensuring consistency across multiple shards requires a *distributed transaction*. As we shall see in +[Chapter 8](/en/ch8#ch_transactions), distributed transactions are available in some databases, but they are usually +much slower than single-node transactions, may become a bottleneck for the system as a whole, and +some systems don’t support them at all. -Transactions are an abstraction layer that allows an application to pretend that cer‐ tain concurrency problems and certain kinds of hardware and software faults don’t exist. A large class of errors is reduced down to a simple *transaction abort*, and the application just needs to try again. +Some systems use sharding even on a single machine, typically running one single-threaded process +per CPU core to make use of the parallelism in the CPU, or to take advantage of a *nonuniform memory +access* (NUMA) architecture in which some banks of memory are closer to one CPU than to others +[[5](/en/ch7#Drepper2007)]. +For example, Redis, VoltDB, and FoundationDB use one process per core, and rely on sharding to +spread load across CPU cores in the same machine +[[6](/en/ch7#Zhou2021_ch7)]. -In this chapter we saw many examples of problems that transactions help prevent. Not all applications are susceptible to all those problems: an application with very simple access patterns, such as reading and writing only a single record, can probably manage without transactions. However, for more complex access patterns, transac‐ tions can hugely reduce the number of potential error cases you need to think about. +## Sharding for Multitenancy -Without transactions, various error scenarios (processes crashing, network interrup‐ tions, power outages, disk full, unexpected concurrency, etc.) mean that data can become inconsistent in various ways. For example, denormalized data can easily go out of sync with the source data. Without transactions, it becomes very difficult to reason about the effects that complex interacting accesses can have on the database. +Software as a Service (SaaS) products and cloud services are often *multitenant*, where each tenant +is a customer. Multiple users may have logins on the same tenant, but each tenant has a +self-contained dataset that is separate from other tenants. For example, in an email marketing +service, each business that signs up is typically a separate tenant, since one business’s newsletter +signups, delivery data etc. are separate from those of other businesses. -In this chapter, we went particularly deep into the topic of concurrency control. We discussed several widely used isolation levels, in particular *read committed*, *snapshot isolation* (sometimes called *repeatable read*), and *serializable*. We characterized those isolation levels by discussing various examples of race conditions: +Sometimes sharding is used to implement multitenant systems: either each tenant is given a separate +shard, or multiple small tenants may be grouped together into a larger shard. These shards might be +physically separate databases (which we previously touched on in [“Embedded storage engines”](/en/ch4#sidebar_embedded)), or +separately manageable portions of a larger logical database +[[7](/en/ch7#Slot2023)]. +Using sharding for multitenancy has several advantages: -***Dirty reads*** +Resource isolation +: If one tenant performs a computationally expensive operation, it is less likely that other + tenants’ performance will be affected if they are running on different shards. -One client reads another client’s writes before they have been committed. The read committed isolation level and stronger levels prevent dirty reads. +Permission isolation +: If there is a bug in your access control logic, it’s less likely that you will accidentally give + one tenant access to another tenant’s data if those tenants’ datasets are stored physically + separately from each other. -***Dirty writes*** +Cell-based architecture +: You can apply sharding not only at the data storage level, but also for the services running your + application code. In a *cell-based architecture*, the services and storage for a particular set of + tenants are grouped into a self-contained *cell*, and different cells are set up such that they + can run largely independently from each other. This approach provides *fault isolation*: that is, + a fault in one cell remains limited to that cell, and tenants in other cells are not affected + [[8](/en/ch7#Oliveira2023)]. -One client overwrites data that another client has written, but not yet committed. Almost all transaction implementations prevent dirty writes. +Per-tenant backup and restore +: Backing up each tenant’s shard separately makes it possible to restore a tenant’s state from a + backup without affecting other tenants, which can be useful in case the tenant accidentally + deletes or overwrites important data + [[9](/en/ch7#Shapira2023dont)]. -***Read skew (nonrepeatable reads)*** +Regulatory compliance +: Data privacy regulation such as the GDPR gives individuals the right to access and delete all data + stored about them. If each person’s data is stored in a separate shard, this translates into + simple data export and deletion operations on their shard + [[10](/en/ch7#Schwarzkopf2019)]. -A client sees different parts of the database at different points in time. This issue is most commonly prevented with snapshot isolation, which allows a transaction to read from a consistent snapshot at one point in time. It is usually implemented with *multi-version concurrency control* (MVCC). +Data residence +: If a particular tenant’s data needs to be stored in a particular jurisdiction in order to comply + with data residency laws, a region-aware database can allow you to assign that tenant’s shard to a + particular region. -***Lost updates*** +Gradual schema rollout +: Schema migrations (previously discussed in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)) can be rolled + out gradually, one tenant at a time. This reduces risk, as you can detect problems before they + affect all tenants, but it can be difficult to do transactionally + [[11](/en/ch7#Shapira2024)]. -Two clients concurrently perform a read-modify-write cycle. One overwrites the other’s write without incorporating its changes, so data is lost. Some implemen‐ tations of snapshot isolation prevent this anomaly automatically, while others require a manual lock (SELECT FOR UPDATE). +The main challenges around using sharding for multitenancy are: -***Write skew*** +* It assumes that each individual tenant is small enough to fit on a single node. If that is not the + case, and you have a single tenant that’s too big for one machine, you would need to additionally + perform sharding within a single tenant, which brings us back to the topic of sharding for + scalability [[12](/en/ch7#Ganguli2020)]. +* If you have many small tenants, then creating a separate shard for each one may incur too much + overhead. You could group several small tenants together into a bigger shard, but then you have + the problem of how you move tenants from one shard to another as they grow. +* If you ever need to support features that connect data across multiple tenants, these become + harder to implement if you need to join data across multiple shards. -A transaction reads something, makes a decision based on the value it saw, and writes the decision to the database. However, by the time the write is made, the premise of the decision is no longer true. Only serializable isolation prevents this anomaly. +# Sharding of Key-Value Data -***Phantom reads*** +Say you have a large amount of data, and you want to shard it. How do you decide which records to +store on which nodes? -A transaction reads objects that match some search condition. Another client makes a write that affects the results of that search. Snapshot isolation prevents straightforward phantom reads, but phantoms in the context of write skew require special treatment, such as index-range locks. +Our goal with sharding is to spread the data and the query load evenly across nodes. If every node +takes a fair share, then—in theory—10 nodes should be able to handle 10 times as much data and 10 +times the read and write throughput of a single node (ignoring replication). Moreover, if we add or +remove a node, we want to be able to *rebalance* the load so that it is evenly distributed across +the 11 (when adding) or the remaining 9 (when removing) nodes. +If the sharding is unfair, so that some shards have more data or queries than others, we call it +*skewed*. The presence of skew makes sharding much less effective. In an extreme case, all the load +could end up on one shard, so 9 out of 10 nodes are idle and your bottleneck is the single busy +node. A shard with disproportionately high load is called a *hot shard* or *hot spot*. If there’s +one key with a particularly high load (e.g., a celebrity in a social network), we call it a *hot +key*. +Therefore we need an algorithm that takes as input the partition key of a record, and tells us which +shard that record is in. In a key-value store the partition key is usually the key, or the first +part of the key. In a relational model the partition key might be some column of a table (not +necessarily its primary key). That algorithm needs to be amenable to rebalancing in order to relieve +hot spots. -Weak isolation levels protect against some of those anomalies but leave you, the application developer, to handle others manually (e.g., using explicit locking). Only serializable isolation protects against all of these issues. We discussed three different approaches to implementing serializable transactions: +## Sharding by Key Range -***Literally executing transactions in a serial order*** +One way of sharding is to assign a contiguous range of partition keys (from some minimum to some +maximum) to each shard, like the volumes of a paper encyclopedia, as illustrated in +[Figure 7-2](/en/ch7#fig_sharding_encyclopedia). In this example, an entry’s partition key is its title. If you want +to look up the entry for a particular title, you can easily determine which shard contains that +entry by finding the volume whose key range contains the title you’re looking for, and thus pick the +correct book off the shelf. -If you can make each transaction very fast to execute, and the transaction throughput is low enough to process on a single CPU core, this is a simple and effective option. +![ddia 0702](/fig/ddia_0702.png) -***Two-phase locking*** +###### Figure 7-2. A print encyclopedia is sharded by key range. -For decades this has been the standard way of implementing serializability, but many applications avoid using it because of its performance characteristics. +The ranges of keys are not necessarily evenly spaced, because your data may not be evenly +distributed. For example, in [Figure 7-2](/en/ch7#fig_sharding_encyclopedia), volume 1 contains words starting with A +and B, but volume 12 contains words starting with T, U, V, W, X, Y, and Z. Simply having one volume +per two letters of the alphabet would lead to some volumes being much bigger than others. In order +to distribute the data evenly, the shard boundaries need to adapt to the data. -***Serializable snapshot isolation (SSI)*** +The shard boundaries might be chosen manually by an administrator, or the database can choose them +automatically. Manual key-range sharding is used by Vitess (a sharding layer for MySQL), for +example; the automatic variant is used by Bigtable, its open source equivalent HBase, the +range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB +[[6](/en/ch7#Zhou2021_ch7)]. YugabyteDB offers both manual and automatic +tablet splitting. -A fairly new algorithm that avoids most of the downsides of the previous approaches. It uses an optimistic approach, allowing transactions to proceed without blocking. When a transaction wants to commit, it is checked, and it is aborted if the execution was not serializable. +Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in +[Chapter 4](/en/ch4#ch_storage)). This has the advantage that range scans are easy, and you can treat the key as a +concatenated index in order to fetch several related records in one query (see +[“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional)). For example, consider an application that stores data from a +network of sensors, where the key is the timestamp of the measurement. Range scans are very useful +in this case, because they let you easily fetch, say, all the readings from a particular month. -The examples in this chapter used a relational data model. However, as discussed in “[The need for multi-object transactions](/en/ch7#the-need-for-multi-object-transactions)”, transactions are a valuable database feature, no matter which data model is used. +A downside of key range sharding is that you can easily get a hot shard if there are a +lot of writes to nearby keys. For example, if the key is a timestamp, then the shards correspond to +ranges of time—e.g., one shard per month. Unfortunately, if you write data from the sensors to the +database as the measurements happen, all the writes end up going to the same shard (the one for +this month), so that shard can be overloaded with writes while others sit idle +[[13](/en/ch7#Lan2011)]. -In this chapter, we explored ideas and algorithms mostly in the context of a database running on a single machine. Transactions in distributed databases open a new set of difficult challenges, which we’ll discuss in the next two chapters. +To avoid this problem in the sensor database, you need to use something other than the timestamp as +the first element of the key. For example, you could prefix each timestamp with the sensor ID so +that the key ordering is first by sensor ID and then by timestamp. Assuming you have many sensors +active at the same time, the write load will end up more evenly spread across the shards. The +downside is that when you want to fetch the values of multiple sensors within a time range, you now +need to perform a separate range query for each sensor. +### Rebalancing key-range sharded data +When you first set up your database, there are no key ranges to split into shards. Some databases, +such as HBase and MongoDB, allow you to configure an initial set of shards on an empty database, +which is called *pre-splitting*. This requires that you already have some idea of what the key +distribution is going to look like, so that you can choose appropriate key range boundaries +[[14](/en/ch7#Soztutar2013split)]. -## References +Later on, as your data volume and write throughput grow, a system with key-range sharding grows by +splitting an existing shard into two or more smaller shards, each of which holds a contiguous +sub-range of the original shard’s key range. The resulting smaller shards can then be distributed +across multiple nodes. If large amounts of data are deleted, you may also need to merge several +adjacent shards that have become small into one bigger one. +This process is similar to what happens at the top level of a B-tree (see [“B-Trees”](/en/ch4#sec_storage_b_trees)). -1. Donald D. Chamberlin, Morton M. Astrahan, Michael W. Blasgen, et al.: “[A History and Evaluation of System R](https://citeseerx.ist.psu.edu/pdf/ebb29a0ca16e04e7eeb6b606b22a9eadb3a9d531),” *Communications of the ACM*, volume 24, number 10, pages 632–646, October 1981. [doi:10.1145/358769.358784](http://dx.doi.org/10.1145/358769.358784) -1. Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger: “[Granularity of Locks and Degrees of Consistency in a Shared Data Base](https://citeseerx.ist.psu.edu/pdf/e127f0a6a912bb9150ecfe03c0ebf7fbc289a023),” in *Modelling in Data Base Management Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems*, edited by G. M. Nijssen, pages 364–394, Elsevier/North Holland Publishing, 1976. Also in *Readings in Database Systems*, 4th edition, edited by Joseph M. Hellerstein and Michael Stonebraker, MIT Press, 2005. ISBN: 978-0-262-69314-1 -1. Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger: “[The Notions of Consistency and Predicate Locks in a Database System](http://research.microsoft.com/en-us/um/people/gray/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf),” *Communications of the ACM*, volume 19, number 11, pages 624–633, November 1976. -1. “[ACID Transactions Are Incredibly Helpful](http://web.archive.org/web/20150320053809/https://foundationdb.com/acid-claims),” FoundationDB, LLC, 2013. -1. John D. Cook: “[ACID Versus BASE for Database Transactions](http://www.johndcook.com/blog/2009/07/06/brewer-cap-theorem-base/),” *johndcook.com*, July 6, 2009. -1. Gavin Clarke: “[NoSQL's CAP Theorem Busters: We Don't Drop ACID](http://www.theregister.co.uk/2012/11/22/foundationdb_fear_of_cap_theorem/),” *theregister.co.uk*, November 22, 2012. -1. Theo Härder and Andreas Reuter: “[Principles of Transaction-Oriented Database Recovery](https://citeseerx.ist.psu.edu/pdf/11ef7c142295aeb1a28a0e714c91fc8d610c3047),” *ACM Computing Surveys*, volume 15, number 4, pages 287–317, December 1983. [doi:10.1145/289.291](http://dx.doi.org/10.1145/289.291) -1. Peter Bailis, Alan Fekete, Ali Ghodsi, et al.: “[HAT, not CAP: Towards Highly Available Transactions](http://www.bailis.org/papers/hat-hotos2013.pdf),” at *14th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2013. -1. Armando Fox, Steven D. Gribble, Yatin Chawathe, et al.: “[Cluster-Based Scalable Network Services](https://people.eecs.berkeley.edu/~brewer/cs262b/TACC.pdf),” at *16th ACM Symposium on Operating Systems Principles* (SOSP), October 1997. -1. Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: [*Concurrency Control and Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at *research.microsoft.com*. -1. Alan Fekete, Dimitrios Liarokapis, Elizabeth O'Neil, et al.: “[Making Snapshot Isolation Serializable](https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2009/Papers/p492-fekete.pdf),” *ACM Transactions on Database Systems*, volume 30, number 2, pages 492–528, June 2005. [doi:10.1145/1071610.1071615](http://dx.doi.org/10.1145/1071610.1071615) -1. Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge: “[Understanding the Robustness of SSDs Under Power Fault](https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf),” at *11th USENIX Conference on File and Storage Technologies* (FAST), February 2013. -1. Laurie Denness: “[SSDs: A Gift and a Curse](https://laur.ie/blog/2015/06/ssds-a-gift-and-a-curse/),” *laur.ie*, June 2, 2015. -1. Adam Surak: “[When Solid State Drives Are Not That Solid](https://blog.algolia.com/when-solid-state-drives-are-not-that-solid/),” *blog.algolia.com*, June 15, 2015. -1. Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, et al.: “[All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications](http://research.cs.wisc.edu/wind/Publications/alice-osdi14.pdf),” at *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014. -1. Chris Siebenmann: “[Unix's File Durability Problem](https://utcc.utoronto.ca/~cks/space/blog/unix/FileSyncProblem),” *utcc.utoronto.ca*, April 14, 2016. -1. Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, et al.: “[An Analysis of Data Corruption in the Storage Stack](http://research.cs.wisc.edu/adsl/Publications/corruption-fast08.pdf),” at *6th USENIX Conference on File and Storage Technologies* (FAST), February 2008. -1. Bianca Schroeder, Raghav Lagisetty, and Arif Merchant: “[Flash Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder),” at *14th USENIX Conference on File and Storage Technologies* (FAST), February 2016. -1. Don Allison: “[SSD Storage – Ignorance of Technology Is No Excuse](https://blog.korelogic.com/blog/2015/03/24),” *blog.korelogic.com*, March 24, 2015. -1. Dave Scherer: “[Those Are Not Transactions (Cassandra 2.0)](http://web.archive.org/web/20150526065247/http://blog.foundationdb.com/those-are-not-transactions-cassandra-2-0),” *blog.foundationdb.com*, September 6, 2013. -1. Kyle Kingsbury: “[Call Me Maybe: Cassandra](http://aphyr.com/posts/294-call-me-maybe-cassandra/),” *aphyr.com*, September 24, 2013. -1. “[ACID Support in Aerospike](https://web.archive.org/web/20170305002118/https://www.aerospike.com/docs/architecture/assets/AerospikeACIDSupport.pdf),” Aerospike, Inc., June 2014. -1. Martin Kleppmann: “[Hermitage: Testing the 'I' in ACID](http://martin.kleppmann.com/2014/11/25/hermitage-testing-the-i-in-acid.html),” *martin.kleppmann.com*, November 25, 2014. -1. Tristan D'Agosta: “[BTC Stolen from Poloniex](https://bitcointalk.org/index.php?topic=499580),” *bitcointalk.org*, March 4, 2014. -1. bitcointhief2: “[How I Stole Roughly 100 BTC from an Exchange and How I Could Have Stolen More!](http://www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/),” *reddit.com*, February 2, 2014. -1. Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan: “[Automating the Detection of Snapshot Isolation Anomalies](http://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf),” at *33rd International Conference on Very Large Data Bases* (VLDB), September 2007. -1. Michael Melanson: “[Transactions: The Limits of Isolation](https://www.michaelmelanson.net/posts/transactions-the-limits-of-isolation/),” *michaelmelanson.net*, November 30, 2014. -1. Hal Berenson, Philip A. Bernstein, Jim N. Gray, et al.: “[A Critique of ANSI SQL Isolation Levels](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 1995. -1. Atul Adya: “[Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions](http://pmg.csail.mit.edu/papers/adya-phd.pdf),” PhD Thesis, Massachusetts Institute of Technology, March 1999. -1. Peter Bailis, Aaron Davidson, Alan Fekete, et al.: “[Highly Available Transactions: Virtues and Limitations (Extended Version)](http://arxiv.org/pdf/1302.0309.pdf),” at *40th International Conference on Very Large Data Bases* (VLDB), September 2014. -1. Bruce Momjian: “[MVCC Unmasked](http://momjian.us/main/presentations/internals.html#mvcc),” *momjian.us*, July 2014. -1. Annamalai Gurusami: “[Repeatable Read Isolation Level in InnoDB – How Consistent Read View Works](https://web.archive.org/web/20161225080947/https://blogs.oracle.com/mysqlinnodb/entry/repeatable_read_isolation_level_in),” *blogs.oracle.com*, January 15, 2013. -1. Nikita Prokopov: “[Unofficial Guide to Datomic Internals](http://tonsky.me/blog/unofficial-guide-to-datomic-internals/),” *tonsky.me*, May 6, 2014. -1. Baron Schwartz: “[Immutability, MVCC, and Garbage Collection](https://web.archive.org/web/20220122020806/https://www.xaprb.com/blog/2013/12/28/immutability-mvcc-and-garbage-collection/),” *xaprb.com*, December 28, 2013. -1. J. Chris Anderson, Jan Lehnardt, and Noah Slater: *CouchDB: The Definitive Guide*. O'Reilly Media, 2010. ISBN: 978-0-596-15589-6 -1. Rikdeb Mukherjee: “[Isolation in DB2 (Repeatable Read, Read Stability, Cursor Stability, Uncommitted Read) with Examples](http://mframes.blogspot.co.uk/2013/07/isolation-in-cursor.html),” *mframes.blogspot.co.uk*, July 4, 2013. -1. Steve Hilker: “[Cursor Stability (CS) – IBM DB2 Community](https://web.archive.org/web/20150420001721/http://www.toadworld.com/platforms/ibmdb2/w/wiki/6661.cursor-stability-cs.aspx),” *toadworld.com*, March 14, 2013. -1. Nate Wiger: “[An Atomic Rant](https://nateware.com/2010/02/18/an-atomic-rant/),” *nateware.com*, February 18, 2010. -1. Joel Jacobson: “[Riak 2.0: Data Types](https://web.archive.org/web/20160327135816/http://blog.joeljacobson.com/riak-2-0-data-types/),” *blog.joeljacobson.com*, March 23, 2014. -1. Michael J. Cahill, Uwe Röhm, and Alan Fekete: “[Serializable Isolation for Snapshot Databases](https://web.archive.org/web/20200709144151/https://cs.nyu.edu/courses/Fall12/CSCI-GA.2434-001/p729-cahill.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2008. [doi:10.1145/1376616.1376690](http://dx.doi.org/10.1145/1376616.1376690) -1. Dan R. K. Ports and Kevin Grittner: “[Serializable Snapshot Isolation in PostgreSQL](http://drkp.net/papers/ssi-vldb12.pdf),” at *38th International Conference on Very Large Databases* (VLDB), August 2012. -1. Tony Andrews: “[Enforcing Complex Constraints in Oracle](http://tonyandrews.blogspot.co.uk/2004/10/enforcing-complex-constraints-in.html),” *tonyandrews.blogspot.co.uk*, October 15, 2004. -1. Douglas B. Terry, Marvin M. Theimer, Karin Petersen, et al.: “[Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](https://citeseerx.ist.psu.edu/pdf/20c450f099b661c5a2dff3f348773a0d1af1b09b),” at *15th ACM Symposium on Operating Systems Principles* (SOSP), December 1995. [doi:10.1145/224056.224070](http://dx.doi.org/10.1145/224056.224070) -1. Gary Fredericks: “[Postgres Serializability Bug](https://github.com/gfredericks/pg-serializability-bug),” *github.com*, September 2015. -1. Michael Stonebraker, Samuel Madden, Daniel J. Abadi, et al.: “[The End of an Architectural Era (It’s Time for a Complete Rewrite)](https://citeseerx.ist.psu.edu/pdf/775d54c66d271028a7d4dadf07cce6f918584cd3),” at *33rd International Conference on Very Large Data Bases* (VLDB), September 2007. -1. John Hugg: “[H-Store/VoltDB Architecture vs. CEP Systems and Newer Streaming Architectures](https://www.youtube.com/watch?v=hD5M4a1UVz8),” at *Data @Scale Boston*, November 2014. -1. Robert Kallman, Hideaki Kimura, Jonathan Natkins, et al.: “[H-Store: A High-Performance, Distributed Main Memory Transaction Processing System](http://www.vldb.org/pvldb/vol1/1454211.pdf),” *Proceedings of the VLDB Endowment*, volume 1, number 2, pages 1496–1499, August 2008. -1. Rich Hickey: “[The Architecture of Datomic](http://www.infoq.com/articles/Architecture-Datomic),” *infoq.com*, November 2, 2012. -1. John Hugg: “[Debunking Myths About the VoltDB In-Memory Database](https://dzone.com/articles/debunking-myths-about-voltdb),” *dzone.com*, May 28, 2014. -1. Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton: “[Architecture of a Database System](https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf),” *Foundations and Trends in Databases*, volume 1, number 2, pages 141–259, November 2007. [doi:10.1561/1900000002](http://dx.doi.org/10.1561/1900000002) -1. Michael J. Cahill: “[Serializable Isolation for Snapshot Databases](https://ses.library.usyd.edu.au/bitstream/handle/2123/5353/michael-cahill-2009-thesis.pdf),” PhD Thesis, University of Sydney, July 2009. -1. D. Z. Badal: “[Correctness of Concurrency Control and Implications in Distributed Databases](http://ieeexplore.ieee.org/abstract/document/762563/),” at *3rd International IEEE Computer Software and Applications Conference* (COMPSAC), November 1979. -1. Rakesh Agrawal, Michael J. Carey, and Miron Livny: “[Concurrency Control Performance Modeling: Alternatives and Implications](http://www.eecs.berkeley.edu/~brewer/cs262/ConcControl.pdf),” *ACM Transactions on Database Systems* (TODS), volume 12, number 4, pages 609–654, December 1987. [doi:10.1145/32204.32220](http://dx.doi.org/10.1145/32204.32220) -1. Dave Rosenthal: “[Databases at 14.4MHz](http://web.archive.org/web/20150427041746/http://blog.foundationdb.com/databases-at-14.4mhz),” *blog.foundationdb.com*, December 10, 2014. +With databases that manage shard boundaries automatically, a shard split is typically triggered by: + +* the shard reaching a configured size (for example, on HBase, the default is 10 GB), or +* in some systems, the write throughput being persistently above some threshold. Thus, a hot shard + may be split even if it is not storing a lot of data, so that its write load can be distributed + more uniformly. + +An advantage of key-range sharding is that the number of shards adapts to the data volume. If there +is only a small amount of data, a small number of shards is sufficient, so overheads are small; if +there is a huge amount of data, the size of each individual shard is limited to a configurable +maximum [[15](/en/ch7#Evans2013)]. + +A downside of this approach is that splitting a shard is an expensive operation, since it requires +all of its data to be rewritten into new files, similarly to a compaction in a log-structured +storage engine. A shard that needs splitting is often also one that is under high load, and the cost +of splitting can exacerbate that load, risking it becoming overloaded. + +## Sharding by Hash of Key + +Key-range sharding is useful if you want records with nearby (but different) partition keys to be +grouped into the same shard; for example, this might be the case with timestamps. If you don’t care +whether partition keys are near each other (e.g., if they are tenant IDs in a multitenant +application), a common approach is to first hash the partition key before mapping it to a shard. + +A good hash function takes skewed data and makes it uniformly distributed. Say you have a 32-bit +hash function that takes a string. Whenever you give it a new string, it returns a seemingly random +number between 0 and 232 − 1. Even if the input strings are very similar, their +hashes are evenly distributed across that range of numbers (but the same input always produces the +same output). + +For sharding purposes, the hash function need not be cryptographically strong: for example, MongoDB +uses MD5, whereas Cassandra and ScyllaDB use Murmur3. Many programming languages have simple hash +functions built in (as they are used for hash tables), but they may not be suitable for sharding: +for example, in Java’s `Object.hashCode()` and Ruby’s `Object#hash`, the same key may have a +different hash value in different processes, making them unsuitable for sharding +[[16](/en/ch7#Kleppmann2012hash)]. + +### Hash modulo number of nodes + +Once you have hashed the key, how do you choose which shard to store it in? Maybe your first thought +is to take the hash value *modulo* the number of nodes in the system (using the `%` operator in many +programming languages). For example, *hash*(*key*) % 10 would return a number between +0 and 9 (if we write the hash as a decimal number, the hash % 10 would be the last digit). +If we have 10 nodes, numbered 0 to 9, that seems like an easy way of assigning each key to a node. + +The problem with the *mod N* approach is that if the number of nodes *N* changes, most of the keys +have to be moved from one node to another. [Figure 7-3](/en/ch7#fig_sharding_hash_mod_n) shows what happens when you +have three nodes and add a fourth. Before the rebalancing, node 0 stored the keys whose hashes are +0, 3, 6, 9, and so on. After adding the fourth node, the key with hash 3 has moved to node 3, the +key with hash 6 has moved to node 2, the key with hash 9 has moved to node 1, and so on. + +![ddia 0703](/fig/ddia_0703.png) + +###### Figure 7-3. Assigning keys to nodes by hashing the key and taking it modulo the number of nodes. Changing the number of nodes results in many keys moving from one node to another. + +The *mod N* function is easy to compute, but it leads to very inefficient rebalancing because there +is a lot of unnecessary movement of records from one node to another. We need an approach that +doesn’t move data around more than necessary. + +### Fixed number of shards + +One simple but widely-used solution is to create many more shards than there are nodes, and to +assign several shards to each node. For example, a database running on a cluster of 10 nodes may be +split into 1,000 shards from the outset so that 100 shards are assigned to each node. A key is then +stored in shard number *hash*(*key*) % 1,000, and the system separately keeps track of +which shard is stored on which node. + +Now, if a node is added to the cluster, the system can reassign some of the shards from existing +nodes to the new node until they are fairly distributed once again. This process is illustrated in +[Figure 7-4](/en/ch7#fig_sharding_rebalance_fixed). If a node is removed from the cluster, the same happens in +reverse. + +![ddia 0704](/fig/ddia_0704.png) + +###### Figure 7-4. Adding a new node to a database cluster with multiple shards per node. + +In this model, only entire shards are moved between nodes, which is cheaper than splitting shards. +The number of shards does not change, nor does the assignment of keys to shards. The only thing that +changes is the assignment of shards to nodes. This change of assignment is not immediate—it takes +some time to transfer a large amount of data over the network—so the old assignment of shards is +used for any reads and writes that happen while the transfer is in progress. + +It’s common to choose the number of shards to be a number that is divisible by many factors, so that +the dataset can be evenly split across various different numbers of nodes—not requiring the number +of nodes to be a power of 2, for example [[4](/en/ch7#Fidalgo2021)]. +You can even account for mismatched hardware in your cluster: by assigning more shards to nodes that +are more powerful, you can make those nodes take a greater share of the load. + +This approach to sharding is used in Citus (a sharding layer for PostgreSQL), Riak, Elasticsearch, +and Couchbase, among others. It works well as long as you have a good estimate of how many shards +you will need when you first create the database. You can then add or remove nodes easily, subject +to the limitation that you can’t have more nodes than you have shards. + +If you find the originally configured number of shards to be wrong—for example, if you have reached +a scale where you need more nodes than you have shards—then an expensive resharding operation is +required. It needs to split each shard and write it out to new files, using a lot of additional disk +space in the process. Some systems don’t allow resharding while concurrently writing to the +database, which makes it difficult to change the number of shards without downtime. + +Choosing the right number of shards is difficult if the total size of the dataset is highly variable +(for example, if it starts small but may grow much larger over time). Since each shard contains a +fixed fraction of the total data, the size of each shard grows proportionally to the total amount of +data in the cluster. If shards are very large, rebalancing and recovery from node failures become +expensive. But if shards are too small, they incur too much overhead. The best performance is +achieved when the size of shards is “just right,” neither too big nor too small, which can be hard +to achieve if the number of shards is fixed but the dataset size varies. + +### Sharding by hash range + +If the required number of shards can’t be predicted in advance, it’s better to use a scheme in which +the number of shards can adapt easily to the workload. The aforementioned key-range sharding scheme +has this property, but it has a risk of hot spots when there are a lot of writes to nearby keys. One +solution is to combine key-range sharding with a hash function so that each shard contains a range +of *hash values* rather than a range of *keys*. + +[Figure 7-5](/en/ch7#fig_sharding_hash_range) shows an example using a 16-bit hash function that returns a number +between 0 and 65,535 = 216 − 1 (in reality, the hash is usually 32 bits or more). +Even if the input keys are very similar (e.g., consecutive timestamps), their hashes are uniformly +distributed across that range. We can then assign a range of hash values to each shard: for example, +values between 0 and 16,383 to shard 0, values between 16,384 and 32,767 to shard 1, and so on. + +![ddia 0705](/fig/ddia_0705.png) + +###### Figure 7-5. Assigning a contiguous range of hash values to each shard. + +Like with key-range sharding, a shard in hash-range sharding can be split when it becomes too big or +too heavily loaded. This is still an expensive operation, but it can happen as needed, so the number +of shards adapts to the volume of data rather than being fixed in advance. + +The downside compared to key-range sharding is that range queries over the partition key are not +efficient, as keys in the range are now scattered across all the shards. However, if keys consist of +two or more columns, and the partition key is only the first of these columns, you can still perform +efficient range queries over the second and later columns: as long as all records in the range query +have the same partition key, they will be in the same shard. + +# Partitioning and Range Queries in Data Warehouses + +Data warehouses such as BigQuery, Snowflake, and Delta Lake support a similar indexing approach, +though the terminology differs. In BigQuery, for example, the partition key determines which +partition a record resides in while “cluster columns” determine how records are sorted within the +partition. Snowflake assigns records to “micro-partitions” automatically, but allows users to define +cluster keys for a table. Delta Lake supports both manual and automatic partition assignment, and +supports cluster keys. Clustering data not only improves range scan performance, but can +improve compression and filtering performance as well. + +Hash-range sharding is used in YugabyteDB and DynamoDB +[[17](/en/ch7#Elhemali2022_ch7)], and is an option in MongoDB. +Cassandra and ScyllaDB use a variant of this approach that is illustrated in +[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional +to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8 +per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between +those ranges. This means some ranges are bigger than others, but by having multiple ranges per node +those imbalances tend to even out +[[15](/en/ch7#Evans2013), +[18](/en/ch7#Williams2012)]. + +![ddia 0706](/fig/ddia_0706.png) + +###### Figure 7-6. Cassandra and ScyllaDB split the range of possible hash values (here 0–1023) into contiguous ranges with random boundaries, and assign several ranges to each node. + +When nodes are added or removed, range boundaries are added and removed, and shards are split or +merged accordingly [[19](/en/ch7#Lambov2016)]. +In the example of [Figure 7-6](/en/ch7#fig_sharding_cassandra), when node 3 is added, node 1 +transfers parts of two of its ranges to node 3, and node 2 transfers part of one of its ranges to +node 3. This has the effect of giving the new node an approximately fair share of the dataset, +without transferring more data than necessary from one node to another. + +### Consistent hashing + +A *consistent hashing* algorithm is a hash function that maps keys to a specified number of shards +in a way that satisfies two properties: + +1. the number of keys mapped to each shard is roughly equal, and +2. when the number of shards changes, as few keys as possible are moved from one shard to another. + +Note that *consistent* here has nothing to do with replica consistency (see [Chapter 6](/en/ch6#ch_replication)) or +ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describes the tendency of a key to stay in +the same shard as much as possible. + +The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of +consistent hashing +[[20](/en/ch7#Karger1997)], +but several other consistent hashing algorithms have also been proposed +[[21](/en/ch7#Gryski2018)], +such as *highest random weight*, also known as *rendezvous hashing* +[[22](/en/ch7#Thaler1998)], +and *jump consistent hash* +[[23](/en/ch7#Lamping2014)]. +With Cassandra’s algorithm, if one node is added, a small number of existing shards are split into +sub-ranges; on the other hand, with rendezvous and jump consistent hashes, the new node is assigned +individual keys that were previously scattered across all of the other nodes. Which one is +preferable depends on the application. + +## Skewed Workloads and Relieving Hot Spots + +Consistent hashing ensures that keys are uniformly distributed across nodes, but that doesn’t mean +that the actual load is uniformly distributed. If the workload is highly skewed—that is, the amount +of data under some partition keys is much greater than other keys, or if the rate of requests to +some keys is much higher than to others—you can still end up with some servers being overloaded +while others sit almost idle. + +For example, on a social media site, a celebrity user with millions of followers may cause a storm +of activity when they do something [[24](/en/ch7#Axon2010_ch7)]. +This event can result in a large volume of reads and writes to the same key (where the partition key +is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on). + +In such situations, a more flexible sharding policy is required +[[25](/en/ch7#Guo2020), +[26](/en/ch7#Lee2021)]. +A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put +an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine +[[27](/en/ch7#Fritchie2018)]. + +It’s also possible to compensate for skew at the application level. For example, if one key is known +to be very hot, a simple technique is to add a random number to the beginning or end of the key. +Just a two-digit decimal random number would split the writes to the key evenly across 100 different +keys, allowing those keys to be distributed to different shards. + +However, having split the writes across different keys, any reads now have to do additional work, as +they have to read the data from all 100 keys and combine it. The volume of reads to each shard of +the hot key is not reduced; only the write load is split. This technique also requires additional +bookkeeping: it only makes sense to append the random number for the small number of hot keys; for +the vast majority of keys with low write throughput this would be unnecessary overhead. Thus, you +also need some way of keeping track of which keys are being split, and a process for converting a +regular key into a specially-managed hot key. + +The problem is further compounded by change of load over time: for example, a particular social +media post that has gone viral may experience high load for a couple of days, but thereafter it’s +likely to calm down again. Moreover, some keys may be hot for writes while others are hot for reads, +necessitating different strategies for handling them. + +Some systems (especially cloud services designed for large scale) have automated approaches for +dealing with hot shards; for example, Amazon calls it *heat management* +[[28](/en/ch7#Warfield2023_ch7)] +or *adaptive capacity* [[17](/en/ch7#Elhemali2022_ch7)]. +The details of how these systems work go beyond the scope of this book. + +## Operations: Automatic or Manual Rebalancing + +There is one important question with regard to rebalancing that we have glossed over: does the +splitting of shards and rebalancing happen automatically or manually? + +Some systems automatically decide when to split shards and when to move them from one node to +another, without any human interaction, while others leave sharding to be explicitly configured by +an administrator. There is also a middle ground: for example, Couchbase and Riak generate a +suggested shard assignment automatically, but require an administrator to commit it before it takes +effect. + +Fully automated rebalancing can be convenient, because there is less operational work to do for +normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud +databases such as DynamoDB are promoted as being able to automatically add and remove shards to +adapt to big increases or decreases of load within a matter of minutes +[[17](/en/ch7#Elhemali2022_ch7), +[29](/en/ch7#Houlihan2017)]. + +However, automatic shard management can also be unpredictable. Rebalancing is an expensive +operation, because it requires rerouting requests and moving a large amount of data from one node to +another. If it is not done carefully, this process can overload the network or the nodes, and it +might harm the performance of other requests. The system must continue processing writes while the +rebalancing is in progress; if a system is near its maximum write throughput, the shard-splitting +process might not even be able to keep up with the rate of incoming writes +[[29](/en/ch7#Houlihan2017)]. + +Such automation can be dangerous in combination with automatic failure detection. For example, say +one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that +the overloaded node is dead, and automatically rebalance the cluster to move load away from it. This +puts additional load on other nodes and the network, making the situation worse. There is a risk of +causing a cascading failure where other nodes become overloaded and are also falsely suspected of +being down. + +For that reason, it can be a good thing to have a human in the loop for rebalancing. It’s slower +than a fully automatic process, but it can help prevent operational surprises. + +# Request Routing + +We have discussed how to shard a dataset across multiple nodes, and how to rebalance those shards as +nodes are added or removed. Now let’s move on to the question: if you want to read or write a +particular key, how do you know which node—i.e., which IP address and port number—you need to +connect to? + +We call this problem *request routing*, and it’s very similar to *service discovery*, which we +previously discussed in [“Load balancers, service discovery, and service meshes”](/en/ch5#sec_encoding_service_discovery). The biggest difference between the two +is that with services running application code, each instance is usually stateless, and a load +balancer can send a request to any of the instances. With sharded databases, a request for a key can +only be handled by a node that is a replica for the shard containing that key. + +This means that request routing has to be aware of the assignment from keys to shards, and from +shards to nodes. On a high level, there are a few different approaches to this problem (illustrated +in [Figure 7-7](/en/ch7#fig_sharding_routing)): + +1. Allow clients to contact any node (e.g., via a round-robin load balancer). If that node + coincidentally owns the shard to which the request applies, it can handle the request directly; + otherwise, it forwards the request to the appropriate node, receives the reply, and passes the + reply along to the client. +2. Send all requests from clients to a routing tier first, which determines the node that should + handle each request and forwards it accordingly. This routing tier does not itself handle any + requests; it only acts as a shard-aware load balancer. +3. Require that clients be aware of the sharding and the assignment of shards to nodes. In this + case, a client can connect directly to the appropriate node, without any intermediary. + +![ddia 0707](/fig/ddia_0707.png) + +###### Figure 7-7. Three different ways of routing a request to the right node. + +In all cases, there are some key problems: + +* Who decides which shard should live on which node? It’s simplest to have a single coordinator + making that decision, but in that case how do you make it fault-tolerant in case the node running + the coordinator goes down? And if the coordinator role can failover to another node, how do you + prevent a split-brain situation (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)) where two different + coordinators make contradictory shard assignments? +* How does the component performing the routing (which may be one of the nodes, or the routing tier, + or the client) learn about changes in the assignment of shards to nodes? +* While a shard is being moved from one node to another, there is a cutover period during which the + new node has taken over, but requests to the old node may still be in flight. How do you handle + those? + +Many distributed data systems rely on a separate coordination service such as ZooKeeper or etcd to +keep track of shard assignments, as illustrated in [Figure 7-8](/en/ch7#fig_sharding_zookeeper). They use consensus +algorithms (see [Chapter 10](/en/ch10#ch_consistency)) to provide fault tolerance and protection against split-brain. +Each node registers itself in ZooKeeper, and ZooKeeper maintains the authoritative mapping of shards +to nodes. Other actors, such as the routing tier or the sharding-aware client, can subscribe to this +information in ZooKeeper. Whenever a shard changes ownership, or a node is added or removed, +ZooKeeper notifies the routing tier so that it can keep its routing information up to date. + +![ddia 0708](/fig/ddia_0708.png) + +###### Figure 7-8. Using ZooKeeper to keep track of assignment of shards to nodes. + +For example, HBase and SolrCloud use ZooKeeper to manage shard assignment, and Kubernetes uses etcd +to keep track of which service instance is running where. MongoDB has a similar architecture, but it +relies on its own *config server* implementation and *mongos* daemons as the routing tier. Kafka, +YugabyteDB, and TiDB use built-in implementations of the Raft consensus protocol to perform this +coordination function. + +Cassandra, ScyllaDB, and Riak take a different approach: they use a *gossip protocol* among the +nodes to disseminate any changes in cluster state. This provides much weaker consistency than a +consensus protocol; it is possible to have split brain, in which different parts of the cluster have +different node assignments for the same shard. Leaderless databases can tolerate this because they +generally make weak consistency guarantees anyway (see [“Limitations of Quorum Consistency”](/en/ch6#sec_replication_quorum_limitations)). + +When using a routing tier or when sending requests to a random node, clients still need to find the +IP addresses to connect to. These are not as fast-changing as the assignment of shards to nodes, +so it is often sufficient to use DNS for this purpose. + +This discussion of request routing has focused on finding the shard for an individual key, which is +most relevant for sharded OLTP databases. Analytic databases often use sharding as well, but they +typically have a very different kind of query execution: rather than executing in a single shard, a +query typically needs to aggregate and join data from many different shards in parallel. We will +discuss techniques for such parallel query execution in [Link to Come]. + +# Sharding and Secondary Indexes + +The sharding schemes we have discussed so far rely on the client knowing the partition key for any +record it wants to access. This is most easily done in a key-value data model, where the partition +key is the first part of the primary key (or the entire primary key), and so we can use the +partition key to determine the shard, and thus route reads and writes to the node that is +responsible for that key. + +The situation becomes more complicated if secondary indexes are involved (see also +[“Multi-Column and Secondary Indexes”](/en/ch4#sec_storage_index_multicolumn)). A secondary index usually doesn’t identify a record uniquely but +rather is a way of searching for occurrences of a particular value: find all actions by user `123`, +find all articles containing the word `hogwash`, find all cars whose color is `red`, and so on. + +Key-value stores often don’t have secondary indexes, but they are the bread and butter of relational +databases, they are common in document databases too, and they are the *raison d’être* of full-text +search engines such as Solr and Elasticsearch. The problem with secondary indexes is that they don’t +map neatly to shards. There are two main approaches to sharding a database with secondary indexes: +local and global indexes. + +## Local Secondary Indexes + +For example, imagine you are operating a website for selling used cars (illustrated in +[Figure 7-9](/en/ch7#fig_sharding_local_secondary)). Each listing has a unique ID, and you use that ID as partition +key for sharding (for example, IDs 0 to 499 in shard 0, IDs 500 to 999 in shard 1, etc.). + +If you want to let users search for cars, allowing them to filter by color and by make, you need a +secondary index on `color` and `make` (in a document database these would be fields; in a relational +database they would be columns). If you have declared the index, the database can perform the +indexing automatically. For example, whenever a red car is added to the database, the database shard +automatically adds its ID to the list of IDs for the index entry `color:red`. As discussed in +[Chapter 4](/en/ch4#ch_storage), that list of IDs is also called a *postings list*. + +![ddia 0709](/fig/ddia_0709.png) + +###### Figure 7-9. Local secondary indexes: each shard indexes only the records within its own shard. + +###### Warning + +If your database only supports a key-value model, you might be tempted to implement a secondary +index yourself by creating a mapping from values to IDs in application code. If you go down this +route, you need to take great care to ensure your indexes remain consistent with the underlying +data. Race conditions and intermittent write failures (where some changes were saved but others +weren’t) can very easily cause the data to go out of sync—see [“The need for multi-object transactions”](/en/ch8#sec_transactions_need). + +In this indexing approach, each shard is completely separate: each shard maintains its own secondary +indexes, covering only the records in that shard. It doesn’t care what data is stored in other +shards. Whenever you write to the database—to add, remove, or update a records—you only need to +deal with the shard that contains the record that you are writing. For that reason, this type of +secondary index is known as a *local index*. In an information retrieval context it is also known as +a *document-partitioned index* +[[30](/en/ch7#Manning2008_ch7)]. + +When reading from a local secondary index, if you already know the partition key of the record +you’re looking for, you can just perform the search on the appropriate shard. Moreover, if you only +want *some* results, and you don’t need all, you can send the request to any shard. + +However, if you want all the results and don’t know their partition key in advance, you need to send +the query to all shards, and combine the results you get back, because the matching records might be +scattered across all the shards. In [Figure 7-9](/en/ch7#fig_sharding_local_secondary), red cars appear in both shard +0 and shard 1. + +This approach to querying a sharded database can make read queries on secondary indexes quite +expensive. Even if you query the shards in parallel, it is prone to tail latency amplification (see +[“Use of Response Time Metrics”](/en/ch2#sec_introduction_slo_sla)). It also limits the scalability of your application: adding more +shards lets you store more data, but it doesn’t increase your query throughput if every shard has to +process every query anyway. + +Nevertheless, local secondary indexes are widely used +[[31](/en/ch7#Busch2012)]: +for example, MongoDB, Riak, Cassandra [[32](/en/ch7#HarEl2017)], +Elasticsearch [[33](/en/ch7#Tong2013)], SolrCloud, +and VoltDB [[34](/en/ch7#Pavlo2013)] +all use local secondary indexes. + +## Global Secondary Indexes + +Rather than each shard having its own, local secondary index, we can construct a *global index* that +covers data in all shards. However, we can’t just store that index on one node, since it would +likely become a bottleneck and defeat the purpose of sharding. A global index must also be sharded, +but it can be sharded differently from the primary key index. + +[Figure 7-10](/en/ch7#fig_sharding_global_secondary) illustrates what this could look like: the IDs of red cars from +all shards appear under `color:red` in the index, but the index is sharded so that colors starting +with the letters *a* to *r* appear in shard 0 and colors starting with *s* to *z* appear in shard 1. +The index on the make of car is partitioned similarly (with the shard boundary being between *f* and +*h*). + +![ddia 0710](/fig/ddia_0710.png) + +###### Figure 7-10. A global secondary index reflects data from all shards, and is itself sharded by the indexed value. + +This kind of index is also called *term-partitioned* +[[30](/en/ch7#Manning2008_ch7)]: +recall from [“Full-Text Search”](/en/ch4#sec_storage_full_text) that in full-text search, a *term* is a keyword in a text that +you can search for. Here we generalise it to mean any value that you can search for in the secondary +index. + +The global index uses the term as partition key, so that when you’re looking for a particular term +or value, you can figure out which shard you need to query. As before, a shard can contain a +contiguous range of terms (as in [Figure 7-10](/en/ch7#fig_sharding_global_secondary)), or you can assign terms to +shards based on a hash of the term. + +Global indexes have the advantage that a query with a single condition (such as *color = red*) only +needs to read from a single shard to fetch the postings list. However, if you want to fetch records +and not just IDs, you still have to read from all the shards that are responsible for those IDs. + +If you have multiple search conditions or terms (e.g., searching for cars of a certain color and a +certain make, or searching for multiple words occurring in the same text), it’s likely that those +terms will be assigned to different shards. To compute the logical AND of the two conditions, the +system needs to find all the IDs that occur in both of the postings lists. That’s no problem if the +postings lists are short, but if they are long, it can be slow to send them over the network to +compute their intersection [[30](/en/ch7#Manning2008_ch7)]. + +Another challenge with global secondary indexes is that writes are more complicated than with local +indexes, because writing a single record might affect multiple shards of the index (every term in +the document might be on a different shard). This makes it harder to keep the secondary index in +sync with the underlying data. One option is to use a distributed transaction to atomically update +the shards storing the primary record and its secondary indexes (see [Chapter 8](/en/ch8#ch_transactions)). + +Global secondary indexes are used by CockroachDB, TiDB, and YugabyteDB; DynamoDB supports both local +and global secondary indexes. In the case of DynamoDB, writes are asynchronously reflected in global +indexes, so reads from a global index may be stale (similarly to replication lag, as in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag)). +Nevertheless, global indexes are useful if read throughput is higher than write throughput, and if +the postings lists are not too long. + +# Summary + +In this chapter we explored different ways of sharding a large dataset into smaller subsets. +Sharding is necessary when you have so much data that storing and processing it on a single machine +is no longer feasible. + +The goal of sharding is to spread the data and query load evenly across multiple machines, avoiding +hot spots (nodes with disproportionately high load). This requires choosing a sharding scheme that +is appropriate to your data, and rebalancing the shards when nodes are added to or removed from the +cluster. + +We discussed two main approaches to sharding: + +* *Key range sharding*, where keys are sorted, and a shard owns all the keys from some minimum up to + some maximum. Sorting has the advantage that efficient range queries are possible, but there is a + risk of hot spots if the application often accesses keys that are close together in the sorted + order. + + In this approach, shards are typically rebalanced by splitting the range into two subranges when a + shard gets too big. +* *Hash sharding*, where a hash function is applied to each key, and a shard owns a range of hash + values (or another consistent hashing algorithm may be used to map hashes to shards). This method + destroys the ordering of keys, making range queries inefficient, but it may distribute load more + evenly. + + When sharding by hash, it is common to create a fixed number of shards in advance, to assign several + shards to each node, and to move entire shards from one node to another when nodes are added or + removed. Splitting shards, like with key ranges, is also possible. + +It is common to use the first part of the key as the partition key (i.e., to identify the shard), +and to sort records within that shard by the rest of the key. That way you can still have efficient +range queries among the records with the same partition key. + +We also discussed the interaction between sharding and secondary indexes. A secondary index also +needs to be sharded, and there are two methods: + +* *Local secondary indexes*, where the secondary indexes are stored + in the same shard as the primary key and value. This means that only a single shard needs to be + updated on write, but a lookup of the secondary index requires reading from all shards. +* *Global secondary indexes*, which are sharded separately based on + the indexed values. An entry in the secondary index may refer to records from all shards of the + primary key. When a record is written, several secondary index shards may need to be updated; + however, a read of the postings list can be served from a single shard (fetching the actual + records still requires reading from multiple shards). + +Finally, we discussed techniques for routing queries to the appropriate shard, and how a +coordination service is often used to keep track of the assigment of shards to nodes. + +By design, every shard operates mostly independently—that’s what allows a sharded database to scale +to multiple machines. However, operations that need to write to several shards can be problematic: +for example, what happens if the write to one shard succeeds, but another fails? We will address +that question in the following chapters. + +##### Footnotes + +##### References + +[[1](/en/ch7#Giordano2023-marker)] Claire Giordano. +[Understanding +partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. +Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959) + +[[2](/en/ch7#Leach2022-marker)] Brandur Leach. +[Partitioning in Postgres, 2022 +edition](https://brandur.org/fragments/postgres-partitioning-2022). *brandur.org*, October 2022. +Archived at [perma.cc/Z5LE-6AKX](https://perma.cc/Z5LE-6AKX) + +[[3](/en/ch7#Koster2009-marker)] Raph Koster. +[Database “sharding” +came from UO?](https://www.raphkoster.com/2009/01/08/database-sharding-came-from-uo/) *raphkoster.com*, January 2009. +Archived at [perma.cc/4N9U-5KYF](https://perma.cc/4N9U-5KYF) + +[[4](/en/ch7#Fidalgo2021-marker)] Garrett Fidalgo. +[Herding elephants: Lessons learned +from sharding Postgres at Notion](https://www.notion.com/blog/sharding-postgres-at-notion). *notion.com*, October 2021. +Archived at [perma.cc/5J5V-W2VX](https://perma.cc/5J5V-W2VX) + +[[5](/en/ch7#Drepper2007-marker)] Ulrich Drepper. +[What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf). +*akkadia.org*, November 2007. Archived at +[perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ) + +[[6](/en/ch7#Zhou2021_ch7-marker)] Jingyu Zhou, Meng Xu, Alexander Shraer, Bala +Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, +Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin +Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. +[FoundationDB: A Distributed Unbundled +Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data* +(SIGMOD), June 2021. +[doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559) + +[[7](/en/ch7#Slot2023-marker)] Marco Slot. +[Citus 12: +Schema-based sharding for PostgreSQL](https://www.citusdata.com/blog/2023/07/18/citus-12-schema-based-sharding-for-postgres/). *citusdata.com*, July 2023. +Archived at [perma.cc/R874-EC9W](https://perma.cc/R874-EC9W) + +[[8](/en/ch7#Oliveira2023-marker)] Robisson Oliveira. +[Reducing +the Scope of Impact with Cell-Based Architecture](https://docs.aws.amazon.com/pdfs/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/reducing-scope-of-impact-with-cell-based-architecture.pdf). AWS Well-Architected white paper, Amazon Web +Services, September 2023. +Archived at [perma.cc/4KWW-47NR](https://perma.cc/4KWW-47NR) + +[[9](/en/ch7#Shapira2023dont-marker)] Gwen Shapira. +[Things DBs Don’t Do - But Should](https://www.thenile.dev/blog/things-dbs-dont-do). +*thenile.dev*, February 2023. +Archived at [perma.cc/C3J4-JSFW](https://perma.cc/C3J4-JSFW) + +[[10](/en/ch7#Schwarzkopf2019-marker)] Malte Schwarzkopf, Eddie Kohler, M. Frans +Kaashoek, and Robert Morris. +[Position: GDPR +Compliance by Construction](https://cs.brown.edu/people/malte/pub/papers/2019-poly-gdpr.pdf). At *Towards Polystores that manage multiple Databases, Privacy, +Security and/or Policy Issues for Heterogenous Data* (Poly), August 2019. +[doi:10.1007/978-3-030-33752-0\_3](https://doi.org/10.1007/978-3-030-33752-0_3) + +[[11](/en/ch7#Shapira2024-marker)] Gwen Shapira. +[Introducing pg\_karnak: Transactional schema +migration across tenant databases](https://www.thenile.dev/blog/distributed-ddl). *thenile.dev*, November 2024. +Archived at [perma.cc/R5RD-8HR9](https://perma.cc/R5RD-8HR9) + +[[12](/en/ch7#Ganguli2020-marker)] Arka Ganguli, Guido Iaquinti, +Maggie Zhou, and Rafael Chacón. +[Scaling Datastores at +Slack with Vitess](https://slack.engineering/scaling-datastores-at-slack-with-vitess/). *slack.engineering*, December 2020. +Archived at [perma.cc/UW8F-ALJK](https://perma.cc/UW8F-ALJK) + +[[13](/en/ch7#Lan2011-marker)] Ikai Lan. +[App +Engine Datastore Tip: Monotonically Increasing Values Are Bad](https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/). *ikaisays.com*, +January 2011. Archived at [perma.cc/BPX8-RPJB](https://perma.cc/BPX8-RPJB) + +[[14](/en/ch7#Soztutar2013split-marker)] Enis Soztutar. +[Apache +HBase Region Splitting and Merging](https://www.cloudera.com/blog/technical/apache-hbase-region-splitting-and-merging.html). *cloudera.com*, February 2013. +Archived at [perma.cc/S9HS-2X2C](https://perma.cc/S9HS-2X2C) + +[[15](/en/ch7#Evans2013-marker)] Eric Evans. +[Rethinking Topology in Cassandra](https://www.youtube.com/watch?v=Qz6ElTdYjjU). At +*Cassandra Summit*, June 2013. +Archived at [perma.cc/2DKM-F438](https://perma.cc/2DKM-F438) + +[[16](/en/ch7#Kleppmann2012hash-marker)] Martin Kleppmann. +[Java’s +hashCode Is Not Safe for Distributed Systems](https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html). *martin.kleppmann.com*, June 2012. +Archived at [perma.cc/LK5U-VZSN](https://perma.cc/LK5U-VZSN) + +[[17](/en/ch7#Elhemali2022_ch7-marker)] Mostafa Elhemali, Niall Gallagher, Nicholas +Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu +Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, +Doug Terry, and Akshat Vig. +[Amazon DynamoDB: A Scalable, +Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical +Conference* (ATC), July 2022. + +[[18](/en/ch7#Williams2012-marker)] Brandon Williams. +[Virtual Nodes in Cassandra +1.2](https://www.datastax.com/blog/virtual-nodes-cassandra-12). *datastax.com*, December 2012. +Archived at [perma.cc/N385-EQXV](https://perma.cc/N385-EQXV) + +[[19](/en/ch7#Lambov2016-marker)] Branimir Lambov. +[New Token +Allocation Algorithm in Cassandra 3.0](https://www.datastax.com/blog/new-token-allocation-algorithm-cassandra-30). *datastax.com*, January 2016. +Archived at [perma.cc/2BG7-LDWY](https://perma.cc/2BG7-LDWY) + +[[20](/en/ch7#Karger1997-marker)] David Karger, Eric Lehman, Tom Leighton, Rina +Panigrahy, Matthew Levine, and Daniel Lewin. +[Consistent Hashing and Random Trees: +Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web](https://people.csail.mit.edu/karger/Papers/web.pdf). +At *29th Annual ACM Symposium on Theory of Computing* (STOC), May 1997. +[doi:10.1145/258533.258660](https://doi.org/10.1145/258533.258660) + +[[21](/en/ch7#Gryski2018-marker)] Damian Gryski. +[Consistent +Hashing: Algorithmic Tradeoffs](https://dgryski.medium.com/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8). *dgryski.medium.com*, April 2018. +Archived at [perma.cc/B2WF-TYQ8](https://perma.cc/B2WF-TYQ8) + +[[22](/en/ch7#Thaler1998-marker)] David G. Thaler and Chinya V. Ravishankar. +[Using name-based mappings to increase +hit rates](https://www.cs.kent.edu/~javed/DL/web/p1-thaler.pdf). *IEEE/ACM Transactions on Networking*, volume 6, issue 1, pages 1–14, February 1998. +[doi:10.1109/90.663936](https://doi.org/10.1109/90.663936) + +[[23](/en/ch7#Lamping2014-marker)] John Lamping and Eric Veach. +[A Fast, Minimal Memory, Consistent Hash +Algorithm](https://arxiv.org/abs/1406.2294). *arxiv.org*, June 2014. + +[[24](/en/ch7#Axon2010_ch7-marker)] Samuel Axon. +[3% of Twitter’s Servers +Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010. +Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX) + +[[25](/en/ch7#Guo2020-marker)] Gerald Guo and Thawan Kooburat. +[Scaling +services with Shard Manager](https://engineering.fb.com/2020/08/24/production-engineering/scaling-services-with-shard-manager/). *engineering.fb.com*, August 2020. +Archived at [perma.cc/EFS3-XQYT](https://perma.cc/EFS3-XQYT) + +[[26](/en/ch7#Lee2021-marker)] Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan +Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan, +Biren Damani, Pol Mauri Ruiz, Vikas Mehta, and Chunqiang Tang. +[Shard Manager: A Generic Shard +Management Framework for Geo-distributed Applications](https://dl.acm.org/doi/pdf/10.1145/3477132.3483546). *28th ACM SIGOPS Symposium on +Operating Systems Principles* (SOSP), pages 553–569, October 2021. +[doi:10.1145/3477132.3483546](https://doi.org/10.1145/3477132.3483546) + +[[27](/en/ch7#Fritchie2018-marker)] Scott Lystig Fritchie. +[A Critique of Resizable Hash +Tables: Riak Core & Random Slicing](https://www.infoq.com/articles/dynamo-riak-random-slicing/). *infoq.com*, August 2018. +Archived at [perma.cc/RPX7-7BLN](https://perma.cc/RPX7-7BLN) + +[[28](/en/ch7#Warfield2023_ch7-marker)] Andy Warfield. +[Building +and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. +Archived at [perma.cc/6S7P-GLM4](https://perma.cc/6S7P-GLM4) + +[[29](/en/ch7#Houlihan2017-marker)] Rich Houlihan. +[DynamoDB adaptive capacity: smooth performance +for chaotic workloads (DAT327)](https://www.youtube.com/watch?v=kMY0_m29YzU). At *AWS re:Invent*, November 2017. + +[[30](/en/ch7#Manning2008_ch7-marker)] Christopher D. Manning, Prabhakar Raghavan, +and Hinrich Schütze. +[*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/). +Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at +[nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/) + +[[31](/en/ch7#Busch2012-marker)] Michael Busch, Krishna Gade, Brian Larson, Patrick +Lok, Samuel Luckenbill, and Jimmy Lin. +[Earlybird: +Real-Time Search at Twitter](https://cs.uwaterloo.ca/~jimmylin/publications/Busch_etal_ICDE2012.pdf). At *28th IEEE International Conference on Data Engineering* +(ICDE), April 2012. +[doi:10.1109/ICDE.2012.149](https://doi.org/10.1109/ICDE.2012.149) + +[[32](/en/ch7#HarEl2017-marker)] Nadav Har’El. +[Indexing in Cassandra 3](https://github.com/scylladb/scylladb/wiki/Indexing-in-Cassandra-3). +*github.com*, April 2017. +Archived at [perma.cc/3ENV-8T9P](https://perma.cc/3ENV-8T9P) + +[[33](/en/ch7#Tong2013-marker)] Zachary Tong. +[Customizing Your +Document Routing](https://www.elastic.co/blog/customizing-your-document-routing/). *elastic.co*, June 2013. +Archived at [perma.cc/97VM-MREN](https://perma.cc/97VM-MREN) + +[[34](/en/ch7#Pavlo2013-marker)] Andrew Pavlo. +[H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). +*hstore.cs.brown.edu*, October 2013. +Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z) diff --git a/content/en/ch8.md b/content/en/ch8.md index 8f983df..5a6edef 100644 --- a/content/en/ch8.md +++ b/content/en/ch8.md @@ -1,158 +1,2871 @@ --- -title: "8. The Trouble with Distributed Systems" -linkTitle: "8. The Trouble with Distributed Systems" +title: "8. Transactions" weight: 208 breadcrumbs: false --- - -![](/img/ch8.png) - -> *Hey I just met you* -> *The network’s laggy* -> *But here’s my data* -> *So store it maybe* +> *Some authors have claimed that general two-phase commit is too expensive to support, because of the +> performance or availability problems that it brings. We believe it is better to have application +> programmers deal with performance problems due to overuse of transactions as bottlenecks arise, +> rather than always coding around the lack of transactions.* > -> ​ — Kyle Kingsbury, *Carly Rae Jepsen and the Perils of Network Partitions* (2013) +> James Corbett et al., *Spanner: Google’s Globally-Distributed Database* (2012) ---------- +In the harsh reality of data systems, many things can go wrong: -A recurring theme in the last few chapters has been how systems handle things going wrong. For example, we discussed replica failover (“[Handling Node Outages](/en/ch5#handing-node-outages)”), replication lag (“[Problems with Replication Lag](/en/ch5#problems-with-replication-lag)”), and con‐ currency control for transactions (“[Weak Isolation Levels](/en/ch7#weak-isolation-levels)”). As we come to understand various edge cases that can occur in real systems, we get better at handling them. +* The database software or hardware may fail at any time (including in the middle of a write + operation). +* The application may crash at any time (including halfway through a series of operations). +* Interruptions in the network can unexpectedly cut off the application from the database, or one + database node from another. +* Several clients may write to the database at the same time, overwriting each other’s changes. +* A client may read data that doesn’t make sense because it has only partially been updated. +* Race conditions between clients can cause surprising bugs. -However, even though we have talked a lot about faults, the last few chapters have still been too optimistic. The reality is even darker. We will now turn our pessimism to the maximum and assume that anything that *can* go wrong *will* go wrong.[^i] (Experienced systems operators will tell you that is a reasonable assumption. If you ask nicely, they might tell you some frightening stories while nursing their scars of past battles.) +In order to be reliable, a system has to deal with these faults and ensure that they don’t cause +catastrophic failure of the entire system. However, implementing fault-tolerance mechanisms is a lot +of work. It requires a lot of careful thinking about all the things that can go wrong, and a lot of +testing to ensure that the solution actually works. -[^i]: With one exception: we will assume that faults are *non-Byzantine* (see “[Byzantine Faults](/en/ch8#byzantine-faults)”). +For decades, *transactions* have been the mechanism of choice for simplifying these issues. A +transaction is a way for an application to group several reads and writes together into a logical +unit. Conceptually, all the reads and writes in a transaction are executed as one operation: either +the entire transaction succeeds (*commit*) or it fails (*abort*, *rollback*). If it fails, the +application can safely retry. With transactions, error handling becomes much simpler for an +application, because it doesn’t need to worry about partial failure—i.e., the case where some +operations succeed and some fail (for whatever reason). -Working with distributed systems is fundamentally different from writing software on a single computer—and the main difference is that there are lots of new and excit‐ ing ways for things to go wrong [1, 2]. In this chapter, we will get a taste of the prob‐ lems that arise in practice, and an understanding of the things we can and cannot rely on. +If you have spent years working with transactions, they may seem obvious, but we shouldn’t take them +for granted. Transactions are not a law of nature; they were created with a purpose, namely to +*simplify the programming model* for applications accessing a database. By using transactions, the +application is free to ignore certain potential error scenarios and concurrency issues, because the +database takes care of them instead (we call these *safety guarantees*). -In the end, our task as engineers is to build systems that do their job (i.e., meet the guarantees that users are expecting), in spite of everything going wrong. In [Chapter 9](/en/ch9), we will look at some examples of algorithms that can provide such guarantees in a distributed system. But first, in this chapter, we must understand what challenges we are up against. +Not every application needs transactions, and sometimes there are advantages to weakening +transactional guarantees or abandoning them entirely (for example, to achieve higher performance or +higher availability). Some safety properties can be achieved without transactions. On the other +hand, transactions can prevent a lot of grief: for example, the technical cause behind the Post +Office Horizon scandal (see [“How Important Is Reliability?”](/en/ch2#sidebar_reliability_importance)) was probably a lack of ACID +transactions in the underlying accounting system +[[1](/en/ch8#Murdoch2021)]. -This chapter is a thoroughly pessimistic and depressing overview of things that may go wrong in a distributed system. We will look into problems with networks (“[Unreliable Networks](#unreliable-networks)”); clocks and timing issues (“[Unreliable Clocks](#unreliable-clocks)”); and we’ll discuss to what degree they are avoidable. The consequences of all these issues are disorienting, so we’ll explore how to think about the state of a dis‐ tributed system and how to reason about things that have happened (“[Knowledge, Truth, and Lies](#knowledge-truth-and-lies)”). +How do you figure out whether you need transactions? In order to answer that question, we first need +to understand exactly what safety guarantees transactions can provide, and what costs are associated +with them. Although transactions seem straightforward at first glance, there are actually many +subtle but important details that come into play. +In this chapter, we will examine many examples of things that can go wrong, and explore the +algorithms that databases use to guard against those issues. We will go especially deep in the area +of concurrency control, discussing various kinds of race conditions that can occur and how +databases implement isolation levels such as *read committed*, *snapshot isolation*, and +*serializability*. +Concurrency control is relevant for both single-node and distributed databases. Later in this +chapter, in [“Distributed Transactions”](/en/ch8#sec_transactions_distributed), we will examine the *two-phase commit* protocol and +the challenge of achieving atomicity in a distributed transaction. -## …… +# What Exactly Is a Transaction? +Almost all relational databases today, and some nonrelational databases, support transactions. Most +of them follow the style that was introduced in 1975 by IBM System R, the first SQL database +[[2](/en/ch8#Chamberlin1981), +[3](/en/ch8#Gray1976), +[4](/en/ch8#Eswaran1976)]. +Although some implementation details have changed, the general idea has remained virtually the same +for 50 years: the transaction support in MySQL, PostgreSQL, Oracle, SQL Server, etc., is uncannily +similar to that of System R. +In the late 2000s, nonrelational (NoSQL) databases started gaining popularity. They aimed to +improve upon the relational status quo by offering a choice of new data models (see +[Chapter 3](/en/ch3#ch_datamodels)), and by including replication ([Chapter 6](/en/ch6#ch_replication)) and sharding +([Chapter 7](/en/ch7#ch_sharding)) by default. Transactions were the main casualty of this movement: many of this +generation of databases abandoned transactions entirely, or redefined the word to describe a +much weaker set of guarantees than had previously been understood. -## Summary +The hype around NoSQL distributed databases led to a popular belief that transactions were +fundamentally unscalable, and that any large-scale system would have to abandon transactions in +order to maintain good performance and high availability. More recently, that belief has turned out +to be wrong. So-called “NewSQL” databases such as CockroachDB +[[5](/en/ch8#Taft2020_ch8)], +TiDB [[6](/en/ch8#Huang2020)], +Spanner [[7](/en/ch8#Corbett2012_ch8)], +FoundationDB [[8](/en/ch8#Zhou2021_ch8)], +and Yugabyte have shown that transactional systems can scale to large data volumes and high +throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide +strong ACID guarantees at scale. -In this chapter we have discussed a wide range of problems that can occur in dis‐ tributed systems, including: +However, that doesn’t mean that every system must be transactional either: like every other +technical design choice, transactions have advantages and limitations. In order to understand those +trade-offs, let’s go into the details of the guarantees that transactions can provide—both in normal +operation and in various extreme (but realistic) circumstances. -- Whenever you try to send a packet over the network, it may be lost or arbitrarily delayed. Likewise, the reply may be lost or delayed, so if you don’t get a reply, you have no idea whether the message got through. -- A node’s clock may be significantly out of sync with other nodes (despite your best efforts to set up NTP), it may suddenly jump forward or back in time, and relying on it is dangerous because you most likely don’t have a good measure of your clock’s error interval. -- A process may pause for a substantial amount of time at any point in its execu‐ tion (perhaps due to a stop-the-world garbage collector), be declared dead by other nodes, and then come back to life again without realizing that it was paused. +## The Meaning of ACID -The fact that such *partial failures* can occur is the defining characteristic of dis‐ tributed systems. Whenever software tries to do anything involving other nodes, there is the possibility that it may occasionally fail, or randomly go slow, or not respond at all (and eventually time out). In distributed systems, we try to build tolerance of partial failures into software, so that the system as a whole may continue functioning even when some of its constituent parts are broken. +The safety guarantees provided by transactions are often described by the well-known acronym *ACID*, +which stands for *Atomicity*, *Consistency*, *Isolation*, and *Durability*. It was coined in 1983 by +Theo Härder and Andreas Reuter +[[9](/en/ch8#Harder1983)] +in an effort to establish precise terminology for fault-tolerance mechanisms in databases. -To tolerate faults, the first step is to *detect* them, but even that is hard. Most systems don’t have an accurate mechanism of detecting whether a node has failed, so most distributed algorithms rely on timeouts to determine whether a remote node is still available. However, timeouts can’t distinguish between network and node failures, and variable network delay sometimes causes a node to be falsely suspected of crash‐ ing. Moreover, sometimes a node can be in a degraded state: for example, a Gigabit network interface could suddenly drop to 1 Kb/s throughput due to a driver bug [94]. Such a node that is “limping” but not dead can be even more difficult to deal with than a cleanly failed node. +However, in practice, one database’s implementation of ACID does not equal another’s implementation. +For example, as we shall see, there is a lot of ambiguity around the meaning of *isolation* +[[10](/en/ch8#Bailis2013HAT)]. +The high-level idea is sound, but the devil is in the details. Today, when a system claims to be +“ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately +become mostly a marketing term. -Once a fault is detected, making a system tolerate it is not easy either: there is no global variable, no shared memory, no common knowledge or any other kind of shared state between the machines. Nodes can’t even agree on what time it is, let alone on anything more profound. The only way information can flow from one node to another is by sending it over the unreliable network. Major decisions cannot be safely made by a single node, so we require protocols that enlist help from other nodes and try to get a quorum to agree. +(Systems that do not meet the ACID criteria are sometimes called *BASE*, which stands for +*Basically Available*, *Soft state*, and *Eventual consistency* +[[11](/en/ch8#Fox1997)]. +This is even more vague than the definition of ACID. It seems that the only sensible definition of +BASE is “not ACID”; i.e., it can mean almost anything you want.) -If you’re used to writing software in the idealized mathematical perfection of a single computer, where the same operation always deterministically returns the same result, then moving to the messy physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems engineers will often regard a problem as triv‐ ial if it can be solved on a single computer [5], and indeed a single computer can do a lot nowadays [95]. If you can avoid opening Pandora’s box and simply keep things on a single machine, it is generally worth doing so. +Let’s dig into the definitions of atomicity, consistency, isolation, and durability, as this will let +us refine our idea of transactions. -However, as discussed in the introduction to [Part II](/en/part-ii), scalability is not the only reason for wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically close to users) are equally important goals, and those things can‐ not be achieved with a single node. +### Atomicity -In this chapter we also went on some tangents to explore whether the unreliability of networks, clocks, and processes is an inevitable law of nature. We saw that it isn’t: it is possible to give hard real-time response guarantees and bounded delays in net‐ works, but doing so is very expensive and results in lower utilization of hardware resources. Most non-safety-critical systems choose cheap and unreliable over expen‐ sive and reliable. +In general, *atomic* refers to something that cannot be broken down into smaller parts. The word +means similar but subtly different things in different branches of computing. For example, in +multi-threaded programming, if one thread executes an atomic operation, that means there is no way +that another thread could see the half-finished result of the operation. The system can only be in +the state it was before the operation or after the operation, not something in between. -We also touched on supercomputers, which assume reliable components and thus have to be stopped and restarted entirely when a component does fail. By contrast, distributed systems can run forever without being interrupted at the service level, because all faults and maintenance can be handled at the node level—at least in theory. (In practice, if a bad configuration change is rolled out to all nodes, that will still bring a distributed system to its knees.) +By contrast, in the context of ACID, atomicity is *not* about concurrency. It does not describe +what happens if several processes try to access the same data at the same time, because that is +covered under the letter *I*, for *isolation* (see [“Isolation”](/en/ch8#sec_transactions_acid_isolation)). -This chapter has been all about problems, and has given us a bleak outlook. In the next chapter we will move on to solutions, and discuss some algorithms that have been designed to cope with all the problems in distributed systems. +Rather, ACID atomicity describes what happens if a client wants to make several writes, but a fault +occurs after some of the writes have been processed—for example, a process crashes, a network +connection is interrupted, a disk becomes full, or some integrity constraint is violated. +If the writes are grouped together into an atomic transaction, and the transaction cannot be +completed (*committed*) due to a fault, then the transaction is *aborted* and the database must +discard or undo any writes it has made so far in that transaction. -## References +Without atomicity, if an error occurs partway through making multiple changes, it’s difficult to +know which changes have taken effect and which haven’t. The application could try again, but that +risks making the same change twice, leading to duplicate or incorrect data. Atomicity simplifies +this problem: if a transaction was aborted, the application can be sure that it didn’t change +anything, so it can safely be retried. -1. Mark Cavage: “[There’s Just No Getting Around It: You’re Building a Distributed System](http://queue.acm.org/detail.cfm?id=2482856),” *ACM Queue*, volume 11, number 4, pages 80-89, April 2013. [doi:10.1145/2466486.2482856](http://dx.doi.org/10.1145/2466486.2482856) -1. Jay Kreps: “[Getting Real About Distributed System Reliability](http://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability),” *blog.empathybox.com*, March 19, 2012. -1. Sydney Padua: *The Thrilling Adventures of Lovelace and Babbage: The (Mostly) True Story of the First Computer*. Particular Books, April 2015. ISBN: 978-0-141-98151-2 -1. Coda Hale: “[You Can’t Sacrifice Partition Tolerance](http://codahale.com/you-cant-sacrifice-partition-tolerance/),” *codahale.com*, October 7, 2010. -1. Jeff Hodges: “[Notes on Distributed Systems for Young Bloods](https://web.archive.org/web/20200218095605/https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/),” *somethingsimilar.com*, January 14, 2013. -1. Antonio Regalado: “[Who Coined 'Cloud Computing'?](https://www.technologyreview.com/2011/10/31/257406/who-coined-cloud-computing/),” *technologyreview.com*, October 31, 2011. -1. Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle: “[The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines, Second Edition](https://web.archive.org/web/20140404113735/http://www.morganclaypool.com/doi/abs/10.2200/S00516ED2V01Y201306CAC024),” *Synthesis Lectures on Computer Architecture*, volume 8, number 3, Morgan & Claypool Publishers, July 2013. [doi:10.2200/S00516ED2V01Y201306CAC024](http://dx.doi.org/10.2200/S00516ED2V01Y201306CAC024), ISBN: 978-1-627-05010-4 -1. David Fiala, Frank Mueller, Christian Engelmann, et al.: “[Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing](http://moss.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc12.pdf),” at *International Conference for High Performance Computing, Networking, Storage and Analysis* (SC12), November 2012. -1. Arjun Singh, Joon Ong, Amit Agarwal, et al.: “[Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network](http://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf),” at *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2785956.2787508](http://dx.doi.org/10.1145/2785956.2787508) -1. Glenn K. Lockwood: “[Hadoop's Uncomfortable Fit in HPC](http://glennklockwood.blogspot.co.uk/2014/05/hadoops-uncomfortable-fit-in-hpc.html),” *glennklockwood.blogspot.co.uk*, May 16, 2014. -1. John von Neumann: “[Probabilistic Logics and the Synthesis of Reliable Organisms from Unreliable Components](https://personalpages.manchester.ac.uk/staff/nikolaos.kyparissas/uploads/VonNeumann1956.pdf),” in *Automata Studies (AM-34)*, edited by Claude E. Shannon and John McCarthy, Princeton University Press, 1956. ISBN: 978-0-691-07916-5 -1. Richard W. Hamming: *The Art of Doing Science and Engineering*. Taylor & Francis, 1997. ISBN: 978-9-056-99500-3 -1. Claude E. Shannon: “[A Mathematical Theory of Communication](http://cs.brynmawr.edu/Courses/cs380/fall2012/shannon1948.pdf),” *The Bell System Technical Journal*, volume 27, number 3, pages 379–423 and 623–656, July 1948. -1. Peter Bailis and Kyle Kingsbury: “[The Network Is Reliable](https://queue.acm.org/detail.cfm?id=2655736),” *ACM Queue*, volume 12, number 7, pages 48-55, July 2014. [doi:10.1145/2639988.2639988](http://dx.doi.org/10.1145/2639988.2639988) -1. Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, and Michael Walfish: “[Taming Uncertainty in Distributed Systems with Help from the Network](http://www.cs.nyu.edu/~mwalfish/papers/albatross-eurosys15.pdf),” at *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741976](http://dx.doi.org/10.1145/2741948.2741976) -1. Phillipa Gill, Navendu Jain, and Nachiappan Nagappan: “[Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications](http://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdf),” at *ACM SIGCOMM Conference*, August 2011. [doi:10.1145/2018436.2018477](http://dx.doi.org/10.1145/2018436.2018477) -1. Mark Imbriaco: “[Downtime Last Saturday](https://github.com/blog/1364-downtime-last-saturday),” *github.com*, December 26, 2012. -1. Will Oremus: “[The Global Internet Is Being Attacked by Sharks, Google Confirms](http://www.slate.com/blogs/future_tense/2014/08/15/shark_attacks_threaten_google_s_undersea_internet_cables_video.html),” *slate.com*, August 15, 2014. -1. Marc A. Donges: “[Re: bnx2 cards Intermittantly Going Offline](http://www.spinics.net/lists/netdev/msg210485.html),” Message to Linux *netdev* mailing list, *spinics.net*, September 13, 2012. -1. Kyle Kingsbury: “[Call Me Maybe: Elasticsearch](https://aphyr.com/posts/317-call-me-maybe-elasticsearch),” *aphyr.com*, June 15, 2014. -1. Salvatore Sanfilippo: “[A Few Arguments About Redis Sentinel Properties and Fail Scenarios](http://antirez.com/news/80),” *antirez.com*, October 21, 2014. -1. Bert Hubert: “[The Ultimate SO_LINGER Page, or: Why Is My TCP Not Reliable](http://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable),” *blog.netherlabs.nl*, January 18, 2009. -1. Nicolas Liochon: “[CAP: If All You Have Is a Timeout, Everything Looks Like a Partition](http://blog.thislongrun.com/2015/05/CAP-theorem-partition-timeout-zookeeper.html),” *blog.thislongrun.com*, May 25, 2015. -1. Jerome H. Saltzer, David P. Reed, and David D. Clark: “[End-To-End Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf),” *ACM Transactions on Computer Systems*, volume 2, number 4, pages 277–288, November 1984. [doi:10.1145/357401.357402](http://dx.doi.org/10.1145/357401.357402) -1. Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, et al.: “[Queues Don’t Matter When You Can JUMP Them!](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-grosvenor_update.pdf),” at *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015. -1. Guohui Wang and T. S. Eugene Ng: “[The Impact of Virtualization on Network Performance of Amazon EC2 Data Center](http://www.cs.rice.edu/~eugeneng/papers/INFOCOM10-ec2.pdf),” at *29th IEEE International Conference on Computer Communications* (INFOCOM), March 2010. [doi:10.1109/INFCOM.2010.5461931](http://dx.doi.org/10.1109/INFCOM.2010.5461931) -1. Van Jacobson: “[Congestion Avoidance and Control](http://www.cs.usask.ca/ftp/pub/discus/seminars2002-2003/p314-jacobson.pdf),” at *ACM Symposium on Communications Architectures and Protocols* (SIGCOMM), August 1988. [doi:10.1145/52324.52356](http://dx.doi.org/10.1145/52324.52356) -1. Brandon Philips: “[etcd: Distributed Locking and Service Discovery](https://www.youtube.com/watch?v=HJIjTTHWYnE),” at *Strange Loop*, September 2014. -1. Steve Newman: “[A Systematic Look at EC2 I/O](https://web.archive.org/web/20141211094156/http://blog.scalyr.com/2012/10/a-systematic-look-at-ec2-io/),” *blog.scalyr.com*, October 16, 2012. -1. Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama: “[The ϕ Accrual Failure Detector](http://hdl.handle.net/10119/4784),” Japan Advanced Institute of Science and Technology, School of Information Science, Technical Report IS-RR-2004-010, May 2004. -1. Jeffrey Wang: “[Phi Accrual Failure Detector](http://ternarysearch.blogspot.co.uk/2013/08/phi-accrual-failure-detector.html),” *ternarysearch.blogspot.co.uk*, August 11, 2013. -1. Srinivasan Keshav: *An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the Telephone Network*. Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6 -1. Cisco, “[Integrated Services Digital Network](https://web.archive.org/web/20181229220921/http://docwiki.cisco.com/wiki/Integrated_Services_Digital_Network),” *docwiki.cisco.com*. -1. Othmar Kyas: *ATM Networks*. International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6 -1. “[InfiniBand FAQ](http://www.mellanox.com/related-docs/whitepapers/InfiniBandFAQ_FQ_100.pdf),” Mellanox Technologies, December 22, 2014. -1. Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman: “[End-to-End Congestion Control for InfiniBand](http://www.hpl.hp.com/techreports/2002/HPL-2002-359.pdf),” at *22nd Annual Joint Conference of the IEEE Computer and Communications Societies* (INFOCOM), April 2003. Also published by HP Laboratories Palo Alto, Tech Report HPL-2002-359. [doi:10.1109/INFCOM.2003.1208949](http://dx.doi.org/10.1109/INFCOM.2003.1208949) -1. Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley: “[The NTP FAQ and HOWTO](http://www.ntp.org/ntpfaq/NTP-a-faq.htm),” *ntp.org*, November 2006. -1. John Graham-Cumming: “[How and why the leap second affected Cloudflare DNS](https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/),” *blog.cloudflare.com*, January 1, 2017. -1. David Holmes: “[Inside the Hotspot VM: Clocks, Timers and Scheduling Events – Part I – Windows](https://web.archive.org/web/20160308031939/https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks),” *blogs.oracle.com*, October 2, 2006. -1. Steve Loughran: “[Time on Multi-Core, Multi-Socket Servers](http://steveloughran.blogspot.co.uk/2015/09/time-on-multi-core-multi-socket-servers.html),” *steveloughran.blogspot.co.uk*, September 17, 2015. -1. James C. Corbett, Jeffrey Dean, Michael Epstein, et al.: “[Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/),” at *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012. -1. M. Caporaloni and R. Ambrosini: “[How Closely Can a Personal Computer Clock Track the UTC Timescale Via the Internet?](https://iopscience.iop.org/0143-0807/23/4/103/),” *European Journal of Physics*, volume 23, number 4, pages L17–L21, June 2012. [doi:10.1088/0143-0807/23/4/103](http://dx.doi.org/10.1088/0143-0807/23/4/103) -1. Nelson Minar: “[A Survey of the NTP Network](http://alumni.media.mit.edu/~nelson/research/ntp-survey99/),” *alumni.media.mit.edu*, December 1999. -1. Viliam Holub: “[Synchronizing Clocks in a Cassandra Cluster Pt. 1 – The Problem](https://blog.rapid7.com/2014/03/14/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/),” *blog.rapid7.com*, March 14, 2014. -1. Poul-Henning Kamp: “[The One-Second War (What Time Will You Die?)](http://queue.acm.org/detail.cfm?id=1967009),” *ACM Queue*, volume 9, number 4, pages 44–48, April 2011. [doi:10.1145/1966989.1967009](http://dx.doi.org/10.1145/1966989.1967009) -1. Nelson Minar: “[Leap Second Crashes Half the Internet](http://www.somebits.com/weblog/tech/bad/leap-second-2012.html),” *somebits.com*, July 3, 2012. -1. Christopher Pascoe: “[Time, Technology and Leaping Seconds](http://googleblog.blogspot.co.uk/2011/09/time-technology-and-leaping-seconds.html),” *googleblog.blogspot.co.uk*, September 15, 2011. -1. Mingxue Zhao and Jeff Barr: “[Look Before You Leap – The Coming Leap Second and AWS](https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/),” *aws.amazon.com*, May 18, 2015. -1. Darryl Veitch and Kanthaiah Vijayalayan: “[Network Timing and the 2015 Leap Second](https://tklab.feit.uts.edu.au/~darryl/Publications/LeapSecond_camera.pdf),” at *17th International Conference on Passive and Active Measurement* (PAM), April 2016. [doi:10.1007/978-3-319-30505-9_29](http://dx.doi.org/10.1007/978-3-319-30505-9_29) -1. “[Timekeeping in VMware Virtual Machines](https://www.vmware.com/content/dam/digitalmarketing/vmware/en/pdf/techpaper/Timekeeping-In-VirtualMachines.pdf),” Information Guide, VMware, Inc., December 2011. -1. “[MiFID II / MiFIR: Regulatory Technical and Implementing Standards – Annex I (Draft)](https://www.esma.europa.eu/sites/default/files/library/2015/11/2015-esma-1464_annex_i_-_draft_rts_and_its_on_mifid_ii_and_mifir.pdf),” European Securities and Markets Authority, Report ESMA/2015/1464, September 2015. -1. Luke Bigum: “[Solving MiFID II Clock Synchronisation With Minimum Spend (Part 1)](https://web.archive.org/web/20170704030310/https://www.lmax.com/blog/staff-blogs/2015/11/27/solving-mifid-ii-clock-synchronisation-minimum-spend-part-1/),” *lmax.com*, November 27, 2015. -1. Kyle Kingsbury: “[Call Me Maybe: Cassandra](https://aphyr.com/posts/294-call-me-maybe-cassandra/),” *aphyr.com*, September 24, 2013. -1. John Daily: “[Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems](https://riak.com/clocks-are-bad-or-welcome-to-distributed-systems/),” *riak.com*, November 12, 2013. -1. Kyle Kingsbury: “[The Trouble with Timestamps](https://aphyr.com/posts/299-the-trouble-with-timestamps),” *aphyr.com*, October 12, 2013. -1. Leslie Lamport: “[Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/),” *Communications of the ACM*, volume 21, number 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](http://dx.doi.org/10.1145/359545.359563) -1. Sandeep Kulkarni, Murat Demirbas, Deepak Madeppa, et al.: “[Logical Physical Clocks and Consistent Snapshots in Globally Distributed Databases](http://www.cse.buffalo.edu/tech-reports/2014-04.pdf),” State University of New York at Buffalo, Computer Science and Engineering Technical Report 2014-04, May 2014. -1. Justin Sheehy: “[There Is No Now: Problems With Simultaneity in Distributed Systems](https://queue.acm.org/detail.cfm?id=2745385),” *ACM Queue*, volume 13, number 3, pages 36–41, March 2015. [doi:10.1145/2733108](http://dx.doi.org/10.1145/2733108) -1. Murat Demirbas: “[Spanner: Google's Globally-Distributed Database](http://muratbuffalo.blogspot.co.uk/2013/07/spanner-googles-globally-distributed_4.html),” *muratbuffalo.blogspot.co.uk*, July 4, 2013. -1. Dahlia Malkhi and Jean-Philippe Martin: “[Spanner's Concurrency Control](http://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf),” *ACM SIGACT News*, volume 44, number 3, pages 73–77, September 2013. [doi:10.1145/2527748.2527767](http://dx.doi.org/10.1145/2527748.2527767) -1. Manuel Bravo, Nuno Diegues, Jingna Zeng, et al.: “[On the Use of Clocks to Enforce Consistency in the Cloud](http://sites.computer.org/debull/A15mar/p18.pdf),” *IEEE Data Engineering Bulletin*, volume 38, number 1, pages 18–31, March 2015. -1. Spencer Kimball: “[Living Without Atomic Clocks](http://www.cockroachlabs.com/blog/living-without-atomic-clocks/),” *cockroachlabs.com*, February 17, 2016. -1. Cary G. Gray and David R. Cheriton: “[Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency](https://web.archive.org/web/20230325205928/http://web.stanford.edu/class/cs240/readings/89-leases.pdf),” at *12th ACM Symposium on Operating Systems Principles* (SOSP), December 1989. [doi:10.1145/74850.74870](http://dx.doi.org/10.1145/74850.74870) -1. Todd Lipcon: “[Avoiding Full GCs in Apache HBase with MemStore-Local Allocation Buffers: Part 1](https://web.archive.org/web/20121101040711/http://blog.cloudera.com/blog/2011/02/avoiding-full-gcs-in-hbase-with-memstore-local-allocation-buffers-part-1/),” *blog.cloudera.com*, February 24, 2011. -1. Martin Thompson: “[Java Garbage Collection Distilled](http://mechanical-sympathy.blogspot.co.uk/2013/07/java-garbage-collection-distilled.html),” *mechanical-sympathy.blogspot.co.uk*, July 16, 2013. -1. Alexey Ragozin: “[How to Tame Java GC Pauses? Surviving 16GiB Heap and Greater](https://dzone.com/articles/how-tame-java-gc-pauses),” *dzone.com*, June 28, 2011. -1. Christopher Clark, Keir Fraser, Steven Hand, et al.: “[Live Migration of Virtual Machines](http://www.cl.cam.ac.uk/research/srg/netos/papers/2005-nsdi-migration.pdf),” at *2nd USENIX Symposium on Symposium on Networked Systems Design & Implementation* (NSDI), May 2005. -1. Mike Shaver: “[fsyncers and Curveballs](https://web.archive.org/web/20220107141023/http://shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/),” *shaver.off.net*, May 25, 2008. -1. Zhenyun Zhuang and Cuong Tran: “[Eliminating Large JVM GC Pauses Caused by Background IO Traffic](https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic),” *engineering.linkedin.com*, February 10, 2016. -1. David Terei and Amit Levy: “[Blade: A Data Center Garbage Collector](http://arxiv.org/pdf/1504.02578.pdf),” arXiv:1504.02578, April 13, 2015. -1. Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz: “[Trash Day: Coordinating Garbage Collection in Distributed Systems](https://timharris.uk/papers/2015-hotos.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. -1. “[Predictable Low Latency](http://cdn2.hubspot.net/hubfs/1624455/Website_2016/content/White%20papers/Cinnober%20on%20GC%20pause%20free%20Java%20applications.pdf),” Cinnober Financial Technology AB, *cinnober.com*, November 24, 2013. -1. Martin Fowler: “[The LMAX Architecture](http://martinfowler.com/articles/lmax.html),” *martinfowler.com*, July 12, 2011. -1. Flavio P. Junqueira and Benjamin Reed: *ZooKeeper: Distributed Process Coordination*. O'Reilly Media, 2013. ISBN: 978-1-449-36130-3 -1. Enis Söztutar: “[HBase and HDFS: Understanding Filesystem Usage in HBase](http://www.slideshare.net/enissoz/hbase-and-hdfs-understanding-filesystem-usage),” at *HBaseCon*, June 2013. -1. Caitie McCaffrey: “[Clients Are Jerks: AKA How Halo 4 DoSed the Services at Launch & How We Survived](https://web.archive.org/web/20230128065851/http://caitiem.com/2015/06/23/clients-are-jerks-aka-how-halo-4-dosed-the-services-at-launch-how-we-survived/),” *caitiem.com*, June 23, 2015. -1. Leslie Lamport, Robert Shostak, and Marshall Pease: “[The Byzantine Generals Problem](https://www.microsoft.com/en-us/research/publication/byzantine-generals-problem/),” *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 4, number 3, pages 382–401, July 1982. [doi:10.1145/357172.357176](http://dx.doi.org/10.1145/357172.357176) -1. Jim N. Gray: “[Notes on Data Base Operating Systems](http://jimgray.azurewebsites.net/papers/dbos.pdf),” in *Operating Systems: An Advanced Course*, Lecture Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller, pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7 -1. Brian Palmer: “[How Complicated Was the Byzantine Empire?](http://www.slate.com/articles/news_and_politics/explainer/2011/10/the_byzantine_tax_code_how_complicated_was_byzantium_anyway_.html),” *slate.com*, October 20, 2011. -1. Leslie Lamport: “[My Writings](http://lamport.azurewebsites.net/pubs/pubs.html),” *lamport.azurewebsites.net*, December 16, 2014. This page can be found by searching the web for the 23-character string obtained by removing the hyphens from the string `allla-mport-spubso-ntheweb`. -1. John Rushby: “[Bus Architectures for Safety-Critical Embedded Systems](http://www.csl.sri.com/papers/emsoft01/emsoft01.pdf),” at *1st International Workshop on Embedded Software* (EMSOFT), October 2001. -1. Jake Edge: “[ELC: SpaceX Lessons Learned](http://lwn.net/Articles/540368/),” *lwn.net*, March 6, 2013. -1. Andrew Miller and Joseph J. LaViola, Jr.: “[Anonymous Byzantine Consensus from Moderately-Hard Puzzles: A Model for Bitcoin](http://nakamotoinstitute.org/static/docs/anonymous-byzantine-consensus.pdf),” University of Central Florida, Technical Report CS-TR-14-01, April 2014. -1. James Mickens: “[The Saddest Moment](https://www.usenix.org/system/files/login-logout_1305_mickens.pdf),” *USENIX ;login: logout*, May 2013. -1. Evan Gilman: “[The Discovery of Apache ZooKeeper’s Poison Packet](http://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/),” *pagerduty.com*, May 7, 2015. -1. Jonathan Stone and Craig Partridge: “[When the CRC and TCP Checksum Disagree](https://web.archive.org/web/20220818235232/https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.27.7611&rep=rep1&type=pdf),” at *ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication* (SIGCOMM), August 2000. [doi:10.1145/347059.347561](http://dx.doi.org/10.1145/347059.347561) -1. Evan Jones: “[How Both TCP and Ethernet Checksums Fail](http://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html),” *evanjones.ca*, October 5, 2015. -1. Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “[Consensus in the Presence of Partial Synchrony](https://dl.acm.org/doi/10.1145/42282.42283),” *Journal of the ACM*, volume 35, number 2, pages 288–323, April 1988. [doi:10.1145/42282.42283](http://dx.doi.org/10.1145/42282.42283) -1. Peter Bailis and Ali Ghodsi: “[Eventual Consistency Today: Limitations, Extensions, and Beyond](http://queue.acm.org/detail.cfm?id=2462076),” *ACM Queue*, volume 11, number 3, pages 55-63, March 2013. [doi:10.1145/2460276.2462076](http://dx.doi.org/10.1145/2460276.2462076) -1. Bowen Alpern and Fred B. Schneider: “[Defining Liveness](https://www.cs.cornell.edu/fbs/publications/DefLiveness.pdf),” *Information Processing Letters*, volume 21, number 4, pages 181–185, October 1985. [doi:10.1016/0020-0190(85)90056-0](http://dx.doi.org/10.1016/0020-0190(85)90056-0) -1. Flavio P. Junqueira: “[Dude, Where’s My Metadata?](https://web.archive.org/web/20230604215314/https://fpj.systems/2015/05/28/dude-wheres-my-metadata/),” *fpj.me*, May 28, 2015. -1. Scott Sanders: “[January 28th Incident Report](https://github.com/blog/2106-january-28th-incident-report),” *github.com*, February 3, 2016. -1. Jay Kreps: “[A Few Notes on Kafka and Jepsen](http://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen),” *blog.empathybox.com*, September 25, 2013. -1. Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, et al.: “[Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems](http://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf),” at *4th ACM Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523627](http://dx.doi.org/10.1145/2523616.2523627) -1. Frank McSherry, Michael Isard, and Derek G. Murray: “[Scalability! But at What COST?](http://www.frankmcsherry.org/assets/COST.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. +The ability to abort a transaction on error and have all writes from that transaction discarded is +the defining feature of ACID atomicity. Perhaps *abortability* would have been a better term than +*atomicity*, but we will stick with *atomicity* since that’s the usual word. + +### Consistency + +The word *consistency* is terribly overloaded: + +* In [Chapter 6](/en/ch6#ch_replication) we discussed *replica consistency* and the issue of *eventual consistency* + that arises in asynchronously replicated systems (see [“Problems with Replication Lag”](/en/ch6#sec_replication_lag)). +* A *consistent snapshot* of a database, e.g. for a backup, is a snapshot of the entire database as + it existed at one moment in time. More precisely, it is consistent with the happens-before + relation (see [“The “happens-before” relation and concurrency”](/en/ch6#sec_replication_happens_before)): that is, if the snapshot contains a value that + was written at a particular time, then it also reflects all the writes that happened before that + value was written. +* *Consistent hashing* is an approach to sharding that some systems use for rebalancing (see + [“Consistent hashing”](/en/ch7#sec_sharding_consistent_hashing)). +* In the CAP theorem (see [Chapter 10](/en/ch10#ch_consistency)), the word *consistency* is used to mean + *linearizability* (see [“Linearizability”](/en/ch10#sec_consistency_linearizability)). +* In the context of ACID, *consistency* refers to an application-specific notion of the database + being in a “good state.” + +It’s unfortunate that the same word is used with at least five different meanings. + +The idea of ACID consistency is that you have certain statements about your data (*invariants*) that +must always be true—for example, in an accounting system, credits and debits across all accounts +must always be balanced. If a transaction starts with a database that is valid according to these +invariants, and any writes during the transaction preserve the validity, then you can be sure that +the invariants are always satisfied. (An invariant may be temporarily violated during transaction +execution, but it should be satisfied again at transaction commit.) + +If you want the database to enforce your invariants, you need to declare them as *constraints* as +part of the schema. For example, foreign key constraints, uniqueness constraints, or check +constraints (which restrict the values that can appear in an individual row) are often used to +model specific types of invariants. More complex consistency requirements can sometimes be modeled +using triggers or materialized views [[12](/en/ch8#Andrews2004)]. + +However, complex invariants can be difficult or impossible to model using the constraints that +databases usually provide. In that case, it’s the application’s responsibility to define its +transactions correctly so that they preserve consistency. If you write bad data that violates your +invariants, but you haven’t declared those invariants, the database can’t stop you. As such, the C +in ACID often depends on how the application uses the database, and it’s not a property of the +database alone. + +### Isolation + +Most databases are accessed by several clients at the same time. That is no problem if they are +reading and writing different parts of the database, but if they are accessing the same database +records, you can run into concurrency problems (race conditions). + +[Figure 8-1](/en/ch8#fig_transactions_increment) is a simple example of this kind of problem. Say you have two clients +simultaneously incrementing a counter that is stored in a database. Each client needs to read the +current value, add 1, and write the new value back (assuming there is no increment operation built +into the database). In [Figure 8-1](/en/ch8#fig_transactions_increment) the counter should have increased from 42 to +44, because two increments happened, but it actually only went to 43 because of the race condition. + +![ddia 0801](/fig/ddia_0801.png) + +###### Figure 8-1. A race condition between two clients concurrently incrementing a counter. + +*Isolation* in the sense of ACID means that concurrently executing transactions are isolated from +each other: they cannot step on each other’s toes. The classic database textbooks formalize +isolation as *serializability*, which means that each transaction can pretend that it is the only +transaction running on the entire database. The database ensures that when the transactions have +committed, the result is the same as if they had run *serially* (one after another), even though in +reality they may have run concurrently +[[13](/en/ch8#Bernstein1987_ch8)]. + +However, serializability has a performance cost. In practice, many databases use forms of isolation +that are weaker than serializability: that is, they allow concurrent transactions to interfere with +each other in limited ways. Some popular databases, such as Oracle, don’t even implement it (Oracle +has an isolation level called “serializable,” but it actually implements *snapshot isolation*, which +is a weaker guarantee than serializability [[10](/en/ch8#Bailis2013HAT), +[14](/en/ch8#Fekete2005)]). +This means that some kinds of race conditions can still occur. We will explore snapshot isolation +and other forms of isolation in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels). + +### Durability + +The purpose of a database system is to provide a safe place where data can be stored without fear of +losing it. *Durability* is the promise that once a transaction has committed successfully, any data it +has written will not be forgotten, even if there is a hardware fault or the database crashes. + +In a single-node database, durability typically means that the data has been written to nonvolatile +storage such as a hard drive or SSD. Regular file writes are usually buffered in memory before being +sent to the disk sometime later, which means they would be lost if there is a sudden power failure; +many databases therefore use the `fsync()` system call to ensure the data really has been written to +disk. Databases usually also have a write-ahead log or similar (see [“Making B-trees reliable”](/en/ch4#sec_storage_btree_wal)), +which allows them to recover in the event that a crash occurs part way through a write. + +In a replicated database, durability may mean that the data has been successfully copied to some +number of nodes. In order to provide a durability guarantee, a database must wait until these writes +or replications are complete before reporting a transaction as successfully committed. However, +as discussed in [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability), perfect durability does not exist: if all your +hard disks and all your backups are destroyed at the same time, there’s obviously nothing your +database can do to save you. + +# Replication and Durability + +Historically, durability meant writing to an archive tape. Then it was understood as writing to a disk +or SSD. More recently, it has been adapted to mean replication. Which implementation is better? + +The truth is, nothing is perfect: + +* If you write to disk and the machine dies, even though your data isn’t lost, it is inaccessible + until you either fix the machine or transfer the disk to another machine. Replicated systems can + remain available. +* A correlated fault—a power outage or a bug that crashes every node on a particular input—​can + knock out all replicas at once (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability)), losing any data that is + only in memory. Writing to disk is therefore still relevant for replicated databases. +* In an asynchronously replicated system, recent writes may be lost when the leader becomes + unavailable (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)). +* When the power is suddenly cut, SSDs in particular have been shown to sometimes violate the + guarantees they are supposed to provide: even `fsync` isn’t guaranteed to work correctly + [[15](/en/ch8#Zheng2013)]. + Disk firmware can have bugs, just like any other kind of software + [[16](/en/ch8#Denness2015), + [17](/en/ch8#Surak2015)], + e.g. causing drives to fail after exactly 32,768 hours of operation + [[18](/en/ch8#HPE2019_ch8)]. + And `fsync` is hard to use; even PostgreSQL used it incorrectly for over 20 years + [[19](/en/ch8#Ringer2018), + [20](/en/ch8#Rebello2020), + [21](/en/ch8#Pillai2015)]. +* Subtle interactions between the storage engine and the filesystem implementation can lead to bugs + that are hard to track down, and may cause files on disk to be corrupted after a crash + [[22](/en/ch8#Pillai2014), + [23](/en/ch8#Siebenmann2016)]. + Filesystem errors on one replica can sometimes spread to other replicas as well + [[24](/en/ch8#Ganesan2017)]. +* Data on disk can gradually become corrupted without this being detected + [[25](/en/ch8#Bairavasundaram2008)]. + If data has been corrupted for some time, replicas and recent backups may also be corrupted. In + this case, you will need to try to restore the data from a historical backup. +* One study of SSDs found that between 30% and 80% of drives develop at least one bad block during + the first four years of operation, and only some of these can be corrected by the firmware + [[26](/en/ch8#Schroeder2016_ch8)]. + Magnetic hard drives have a lower rate of bad sectors, but a higher rate of complete failure than + SSDs. +* When a worn-out SSD (that has gone through many write/erase cycles) is disconnected from power, + it can start losing data within a timescale of weeks to months, depending on the temperature + [[27](/en/ch8#Allison2015)]. + This is less of a problem for drives with lower wear levels + [[28](/en/ch8#MahUng2015)]. + +In practice, there is no one technique that can provide absolute guarantees. There are only various +risk-reduction techniques, including writing to disk, replicating to remote machines, and +backups—​and they can and should be used together. As always, it’s wise to take any theoretical +“guarantees” with a healthy grain of salt. + +## Single-Object and Multi-Object Operations + +To recap, in ACID, atomicity and isolation describe what the database should do if a client makes +several writes within the same transaction: + +Atomicity +: If an error occurs halfway through a sequence of writes, the transaction should be aborted, and + the writes made up to that point should be discarded. In other words, the database saves you from + having to worry about partial failure, by giving an all-or-nothing guarantee. + +Isolation +: Concurrently running transactions shouldn’t interfere with each other. For example, if one + transaction makes several writes, then another transaction should see either all or none of those + writes, but not some subset. + +These definitions assume that you want to modify several objects (rows, documents, records) at once. +Such *multi-object transactions* are often needed if several pieces of data need to be kept in sync. +[Figure 8-2](/en/ch8#fig_transactions_read_uncommitted) shows an example from an email application. To display the +number of unread messages for a user, you could query something like: + +``` +SELECT COUNT(*) FROM emails WHERE recipient_id = 2 AND unread_flag = true +``` + +![ddia 0802](/fig/ddia_0802.png) + +###### Figure 8-2. Violating isolation: one transaction reads another transaction’s uncommitted writes (a “dirty read”). + +However, you might find this query to be too slow if there are many emails, and decide to store the +number of unread messages in a separate field (a kind of denormalization, which we discuss in +[“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization)). Now, whenever a new message comes in, you have to increment the +unread counter as well, and whenever a message is marked as read, you also have to decrement the +unread counter. + +In [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), user 2 experiences an anomaly: the mailbox listing shows +an unread message, but the counter shows zero unread messages because the counter increment has not +yet happened. (If an incorrect counter in an email application seems too insignificant, think of a +customer account balance instead of an unread counter, and a payment transaction instead of an +email.) Isolation would have prevented this issue by ensuring that user 2 sees either both the +inserted email and the updated counter, or neither, but not an inconsistent halfway point. + +[Figure 8-3](/en/ch8#fig_transactions_atomicity) illustrates the need for atomicity: if an error occurs somewhere +over the course of the transaction, the contents of the mailbox and the unread counter might become out +of sync. In an atomic transaction, if the update to the counter fails, the transaction is aborted +and the inserted email is rolled back. + +![ddia 0803](/fig/ddia_0803.png) + +###### Figure 8-3. Atomicity ensures that if an error occurs any prior writes from that transaction are undone, to avoid an inconsistent state. + +Multi-object transactions require some way of determining which read and write operations belong to +the same transaction. In relational databases, that is typically done based on the client’s TCP +connection to the database server: on any particular connection, everything between a `BEGIN +TRANSACTION` and a `COMMIT` statement is considered to be part of the same transaction. If the TCP +connection is interrupted, the transaction must be aborted. + +On the other hand, many nonrelational databases don’t have such a way of grouping operations +together. Even if there is a multi-object API (for example, a key-value store may have a *multi-put* +operation that updates several keys in one operation), that doesn’t necessarily mean it has +transaction semantics: the command may succeed for some keys and fail for others, leaving the +database in a partially updated state. + +### Single-object writes + +Atomicity and isolation also apply when a single object is being changed. For example, imagine you +are writing a 20 KB JSON document to a database: + +* If the network connection is interrupted after the first 10 KB have been sent, does the + database store that unparseable 10 KB fragment of JSON? +* If the power fails while the database is in the middle of overwriting the previous value on disk, + do you end up with the old and new values spliced together? +* If another client reads that document while the write is in progress, will it see a partially + updated value? + +Those issues would be incredibly confusing, so storage engines almost universally aim to provide +atomicity and isolation on the level of a single object (such as a key-value pair) on one node. +Atomicity can be implemented using a log for crash recovery (see [“Making B-trees reliable”](/en/ch4#sec_storage_btree_wal)), and +isolation can be implemented using a lock on each object (allowing only one thread to access an +object at any one time). + +Some databases also provide more complex atomic operations, such as an increment operation, which +removes the need for a read-modify-write cycle like that in [Figure 8-1](/en/ch8#fig_transactions_increment). +Similarly popular is a *conditional write* operation, which allows a write to happen only if the value +has not been concurrently changed by someone else (see [“Conditional writes (compare-and-set)”](/en/ch8#sec_transactions_compare_and_set)), +similarly to a compare-and-set or compare-and-swap (CAS) operation in shared-memory concurrency. + +###### Note + +Strictly speaking, the term *atomic increment* uses the word *atomic* in the sense of multi-threaded +programming. In the context of ACID, it should actually be called an *isolated* or *serializable* +increment, but that’s not the usual term. + +These single-object operations are useful, as they can prevent lost updates when several clients try +to write to the same object concurrently (see [“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update)). However, they are +not transactions in the usual sense of the word. For example, the “lightweight transactions” feature +of Cassandra and ScyllaDB, and Aerospike’s “strong consistency” mode offer linearizable (see +[“Linearizability”](/en/ch10#sec_consistency_linearizability)) reads and conditional writes on a single object, but no +guarantees across multiple objects. + +### The need for multi-object transactions + +Do we need multi-object transactions at all? Would it be possible to implement any application with +only a key-value data model and single-object operations? + +There are some use cases in which single-object inserts, updates, and deletes are sufficient. +However, in many other cases writes to several different objects need to be coordinated: + +* In a relational data model, a row in one table often has a foreign key reference to a row in + another table. Similarly, in a graph-like data model, a vertex has edges to other vertices. + Multi-object transactions allow you to ensure that these references remain valid: when inserting + several records that refer to one another, the foreign keys have to be correct and up to date, + or the data becomes nonsensical. +* In a document data model, the fields that need to be updated together are often within the same + document, which is treated as a single object—no multi-object transactions are needed when + updating a single document. However, document databases lacking join functionality also encourage + denormalization (see [“When to Use Which Model”](/en/ch3#sec_datamodels_document_summary)). When denormalized information needs to + be updated, like in the example of [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), you need to update + several documents in one go. Transactions are very useful in this situation to prevent + denormalized data from going out of sync. +* In databases with secondary indexes (almost everything except pure key-value stores), the indexes + also need to be updated every time you change a value. These indexes are different database + objects from a transaction point of view: for example, without transaction isolation, it’s + possible for a record to appear in one index but not another, because the update to the second + index hasn’t happened yet (see [“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes)). + +Such applications can still be implemented without transactions. However, error handling becomes +much more complicated without atomicity, and the lack of isolation can cause concurrency problems. +We will discuss those in [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels), and explore alternative approaches +in [Link to Come]. + +### Handling errors and aborts + +A key feature of a transaction is that it can be aborted and safely retried if an error occurred. +ACID databases are based on this philosophy: if the database is in danger of violating its guarantee +of atomicity, isolation, or durability, it would rather abandon the transaction entirely than allow +it to remain half-finished. + +Not all systems follow that philosophy, though. In particular, datastores with leaderless +replication (see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)) work much more on a “best effort” basis, which +could be summarized as “the database will do as much as it can, and if it runs into an error, it +won’t undo something it has already done”—so it’s the application’s responsibility to recover from +errors. + +Errors will inevitably happen, but many software developers prefer to think only about the happy +path rather than the intricacies of error handling. For example, popular object-relational mapping +(ORM) frameworks such as Rails’s ActiveRecord and Django don’t retry aborted transactions—the +error usually results in an exception bubbling up the stack, so any user input is thrown away and +the user gets an error message. This is a shame, because the whole point of aborts is to enable safe +retries. + +Although retrying an aborted transaction is a simple and effective error handling mechanism, it +isn’t perfect: + +* If the transaction actually succeeded, but the network was interrupted while the server tried to + acknowledge the successful commit to the client (so it timed out from the client’s point of view), + then retrying the transaction causes it to be performed twice—unless you have an additional + application-level deduplication mechanism in place. +* If the error is due to overload or high contention between concurrent transactions, retrying the + transaction will make the problem worse, not better. To avoid such feedback cycles, you can limit + the number of retries, use exponential backoff, and handle overload-related errors differently + from other errors (see [“When an overloaded system won’t recover”](/en/ch2#sidebar_metastable)). +* It is only worth retrying after transient errors (for example due to deadlock, isolation + violation, temporary network interruptions, and failover); after a permanent error (e.g., + constraint violation) a retry would be pointless. +* If the transaction also has side effects outside of the database, those side effects may happen + even if the transaction is aborted. For example, if you’re sending an email, you wouldn’t want to + send the email again every time you retry the transaction. If you want to make sure that several + different systems either commit or abort together, two-phase commit can help (we will discuss this + in [“Two-Phase Commit (2PC)”](/en/ch8#sec_transactions_2pc)). +* If the client process crashes while retrying, any data it was trying to write to the database is lost. + +# Weak Isolation Levels + +If two transactions don’t access the same data, or if both are read-only, they can safely be run in +parallel, because neither depends on the other. Concurrency issues (race conditions) only come into +play when one transaction reads data that is concurrently modified by another transaction, or when +the two transactions try to modify the same data. + +Concurrency bugs are hard to find by testing, because such bugs are only triggered when you get +unlucky with the timing. Such timing issues might occur very rarely, and are usually difficult to +reproduce. Concurrency is also very difficult to reason about, especially in a large application +where you don’t necessarily know which other pieces of code are accessing the database. Application +development is difficult enough if you just have one user at a time; having many concurrent users +makes it much harder still, because any piece of data could unexpectedly change at any time. + +For that reason, databases have long tried to hide concurrency issues from application developers by +providing *transaction isolation*. In theory, isolation should make your life easier by letting you +pretend that no concurrency is happening: *serializable* isolation means that the database +guarantees that transactions have the same effect as if they ran *serially* (i.e., one at a time, +without any concurrency). + +In practice, isolation is unfortunately not that simple. Serializable isolation has a performance +cost, and many databases don’t want to pay that price +[[10](/en/ch8#Bailis2013HAT)]. It’s therefore common for systems to use +weaker levels of isolation, which protect against *some* concurrency issues, but not all. Those +levels of isolation are much harder to understand, and they can lead to subtle bugs, but they are +nevertheless used in practice +[[29](/en/ch8#Kleppmann2014)]. + +Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have +caused substantial loss of money +[[30](/en/ch8#Warszawski2017), +[31](/en/ch8#DAgosta2014), +[32](/en/ch8#bitcointhief2014)], +led to investigation by financial auditors +[[33](/en/ch8#Jorwekar2007_ch8)], +and caused customer data to be corrupted [[34](/en/ch8#Melanson2014)]. +A popular comment on revelations of such problems is “Use an ACID database if you’re handling +financial data!”—but that misses the point. Even many popular relational database systems (which +are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these +bugs from occurring. + +###### Note + +Incidentally, much of the banking system relies on text files that are exchanged via secure FTP +[[35](/en/ch8#Kim2014ACH)]. +In this context, having an audit trail and some human-level fraud prevention measures is actually +more important than ACID properties. + +Those examples also highlight an important point: even if concurrency issues are rare in normal +operation, you have to consider the possibility that an attacker deliberately sends a burst of +highly concurrent requests to your API in an attempt to deliberately exploit concurrency bugs +[[30](/en/ch8#Warszawski2017)]. Therefore, in order to build +applications that are reliable and secure, you have to ensure that such bugs are systematically +prevented. + +In this section we will look at several weak (nonserializable) isolation levels that are used in +practice, and discuss in detail what kinds of race conditions can and cannot occur, so that you can +decide what level is appropriate to your application. Once we’ve done that, we will discuss +serializability in detail (see [“Serializability”](/en/ch8#sec_transactions_serializability)). Our discussion of isolation +levels will be informal, using examples. If you want rigorous definitions and analyses of their +properties, you can find them in the academic literature +[[36](/en/ch8#Berenson1995), +[37](/en/ch8#Adya1999), +[38](/en/ch8#Bailis2014virtues_ch8), +[39](/en/ch8#Crooks2017)]. + +## Read Committed + +The most basic level of transaction isolation is *read committed*. It makes two guarantees: + +1. When reading from the database, you will only see data that has been committed (no *dirty + reads*). +2. When writing to the database, you will only overwrite data that has been committed (no *dirty + writes*). + +Some databases support an even weaker isolation level called *read uncommitted*. It prevents dirty +writes, but does not prevent dirty reads. Let’s discuss these two guarantees in more detail. + +### No dirty reads + +Imagine a transaction has written some data to the database, but the transaction has not yet committed or aborted. +Can another transaction see that uncommitted data? If yes, that is called a +*dirty read* [[3](/en/ch8#Gray1976)]. + +Transactions running at the read committed isolation level must prevent dirty reads. This means that +any writes by a transaction only become visible to others when that transaction commits (and then +all of its writes become visible at once). This is illustrated in +[Figure 8-4](/en/ch8#fig_transactions_read_committed), where user 1 has set *x* = 3, but user 2’s *get x* still +returns the old value, 2, while user 1 has not yet committed. + +![ddia 0804](/fig/ddia_0804.png) + +###### Figure 8-4. No dirty reads: user 2 sees the new value for *x* only after user 1’s transaction has committed. + +There are a few reasons why it’s useful to prevent dirty reads: + +* If a transaction needs to update several rows, a dirty read means that another transaction may + see some of the updates but not others. For example, in [Figure 8-2](/en/ch8#fig_transactions_read_uncommitted), the + user sees the new unread email but not the updated counter. This is a dirty read of the email. + Seeing the database in a partially updated state is confusing to users and may cause other + transactions to take incorrect decisions. +* If a transaction aborts, any writes it has made need to be rolled back (like in + [Figure 8-3](/en/ch8#fig_transactions_atomicity)). If the database allows dirty reads, that means a transaction may + see data that is later rolled back—i.e., which is never actually committed to the database. Any + transaction that read uncommitted data would also need to be aborted, leading to a problem called + *cascading aborts*. + +### No dirty writes + +What happens if two transactions concurrently try to update the same row in a database? We don’t +know in which order the writes will happen, but we normally assume that the later write overwrites +the earlier write. + +However, what happens if the earlier write is part of a transaction that has not yet committed, so +the later write overwrites an uncommitted value? This is called a *dirty write* +[[36](/en/ch8#Berenson1995)]. Transactions running at the read +committed isolation level must prevent dirty writes, usually by delaying the second write until the +first write’s transaction has committed or aborted. + +By preventing dirty writes, this isolation level avoids some kinds of concurrency problems: + +* If transactions update multiple rows, dirty writes can lead to a bad outcome. For example, + consider [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), which illustrates a used car sales website on which + two people, Aaliyah and Bryce, are simultaneously trying to buy the same car. Buying a car requires + two database writes: the listing on the website needs to be updated to reflect the buyer, and the + sales invoice needs to be sent to the buyer. In the case of [Figure 8-5](/en/ch8#fig_transactions_dirty_writes), the + sale is awarded to Bryce (because he performs the winning update to the `listings` table), but the + invoice is sent to Aaliyah (because she performs the winning update to the `invoices` table). Read + committed prevents such mishaps. +* However, read committed does *not* prevent the race condition between two counter increments in + [Figure 8-1](/en/ch8#fig_transactions_increment). In this case, the second write happens after the first transaction + has committed, so it’s not a dirty write. It’s still incorrect, but for a different reason—in + [“Preventing Lost Updates”](/en/ch8#sec_transactions_lost_update) we will discuss how to make such counter increments safe. + +![ddia 0805](/fig/ddia_0805.png) + +###### Figure 8-5. With dirty writes, conflicting writes from different transactions can be mixed up. + +### Implementing read committed + +Read committed is a very popular isolation level. It is the default setting in Oracle Database, +PostgreSQL, SQL Server, and many other databases +[[10](/en/ch8#Bailis2013HAT)]. + +Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to +modify a particular row (or document or some other object), it must first acquire a lock on that +row. It must then hold that lock until the transaction is committed or aborted. Only one transaction +can hold the lock for any given row; if another transaction wants to write to the same row, it must +wait until the first transaction is committed or aborted before it can acquire the lock and +continue. This locking is done automatically by databases in read committed mode (or stronger +isolation levels). + +How do we prevent dirty reads? One option would be to use the same lock, and to require any +transaction that wants to read a row to briefly acquire the lock and then release it again +immediately after reading. This would ensure that a read couldn’t happen while a row has a +dirty, uncommitted value (because during that time the lock would be held by the transaction that +has made the write). + +However, the approach of requiring read locks does not work well in practice, because one +long-running write transaction can force many other transactions to wait until the long-running +transaction has completed, even if the other transactions only read and do not write anything to the +database. This harms the response time of read-only transactions and is bad for +operability: a slowdown in one part of an application can have a knock-on effect in a completely +different part of the application, due to waiting for locks. + +Nevertheless, locks are used to prevent dirty reads in some databases, such as IBM +Db2 and Microsoft SQL Server in the `read_committed_snapshot=off` setting +[[29](/en/ch8#Kleppmann2014)]. + +A more commonly used approach to preventing dirty reads is the one illustrated in +[Figure 8-4](/en/ch8#fig_transactions_read_committed): for every +row that is written, the database remembers both the old committed value and the new value +set by the transaction that currently holds the write lock. While the transaction is ongoing, any +other transactions that read the row are simply given the old value. Only when the new value is +committed do transactions switch over to reading the new value (see +[“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl) for more detail). + +## Snapshot Isolation and Repeatable Read + +If you look superficially at read committed isolation, you could be forgiven for thinking that it +does everything that a transaction needs to do: it allows aborts (required for atomicity), it +prevents reading the incomplete results of transactions, and it prevents concurrent writes from +getting intermingled. Indeed, those are useful features, and much stronger guarantees than you can +get from a system that has no transactions. + +However, there are still plenty of ways in which you can have concurrency bugs when using this +isolation level. For example, [Figure 8-6](/en/ch8#fig_transactions_item_many_preceders) illustrates a problem that +can occur with read committed. + +![ddia 0806](/fig/ddia_0806.png) + +###### Figure 8-6. Read skew: Aaliyah observes the database in an inconsistent state. + +Say Aaliyah has $1,000 of savings at a bank, split across two accounts with $500 each. Now a +transaction transfers $100 from one of her accounts to the other. If she is unlucky enough to look at her +list of account balances in the same moment as that transaction is being processed, she may see one +account balance at a time before the incoming payment has arrived (with a balance of $500), and the +other account after the outgoing transfer has been made (the new balance being $400). To Aaliyah it +now appears as though she only has a total of $900 in her accounts—it seems that $100 has +vanished into thin air. + +This anomaly is called *read skew*, and it is an example of a *nonrepeatable read*: +if Aaliyah were to read the balance of +account 1 again at the end of the transaction, she would see a different value ($600) than she saw +in her previous query. Read skew is considered acceptable under read committed isolation: the +account balances that Aaliyah saw were indeed committed at the time when she read them. + +###### Note + +The term *skew* is unfortunately overloaded: we previously used it in the sense of an *unbalanced +workload with hot spots* (see [“Skewed Workloads and Relieving Hot Spots”](/en/ch7#sec_sharding_skew)), whereas here it means *timing anomaly*. + +In Aaliyah’s case, this is not a lasting problem, because she will most likely see consistent account +balances if she reloads the online banking website a few seconds later. However, some situations +cannot tolerate such temporary inconsistency: + +Backups +: Taking a backup requires making a copy of the entire database, which may take hours on a large + database. During the time that the backup process is running, writes will continue to be made to + the database. Thus, you could end up with some parts of the backup containing an older version of + the data, and other parts containing a newer version. If you need to restore from such a backup, + the inconsistencies (such as disappearing money) become permanent. + +Analytic queries and integrity checks +: Sometimes, you may want to run a query that scans over large parts of the database. Such queries + are common in analytics (see [“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)), or may be part of a periodic integrity + check that everything is in order (monitoring for data corruption). These queries are likely to + return nonsensical results if they observe parts of the database at different points in time. + +*Snapshot isolation* [[36](/en/ch8#Berenson1995)] is the most common +solution to this problem. The idea is that each transaction reads from a *consistent snapshot* of +the database—that is, the transaction sees all the data that was committed in the database at the +start of the transaction. Even if the data is subsequently changed by another transaction, each +transaction sees only the old data from that particular point in time. + +Snapshot isolation is a boon for long-running, read-only queries such as backups and analytics. It +is very hard to reason about the meaning of a query if the data on which it operates is changing at +the same time as the query is executing. When a transaction can see a consistent snapshot of the +database, frozen at a particular point in time, it is much easier to understand. + +Snapshot isolation is a popular feature: variants of it are supported by PostgreSQL, MySQL with the +InnoDB storage engine, Oracle, SQL Server, and others, although the detailed behavior varies from +one system to the next [[29](/en/ch8#Kleppmann2014), +[40](/en/ch8#Momjian2014), +[41](/en/ch8#Alvaro2023)]. +Some databases, such as Oracle, TiDB, and Aurora DSQL, even choose snapshot isolation as their +highest isolation level. + +### Multi-version concurrency control (MVCC) + +Like read committed isolation, implementations of snapshot isolation typically use write locks to +prevent dirty writes (see [“Implementing read committed”](/en/ch8#sec_transactions_read_committed_impl)), which means that a transaction +that makes a write can block the progress of another transaction that writes to the same row. +However, reads do not require any locks. From a performance point of view, a key principle of +snapshot isolation is *readers never block writers, and writers never block readers*. This allows a +database to handle long-running read queries on a consistent snapshot at the same time as processing +writes normally, without any lock contention between the two. + +To implement snapshot isolation, databases use a generalization of the mechanism we saw for +preventing dirty reads in [Figure 8-4](/en/ch8#fig_transactions_read_committed). Instead of two versions of each row +(the committed version and the overwritten-but-not-yet-committed version), the database must +potentially keep several different committed versions of a row, because various in-progress +transactions may need to see the state of the database at different points in time. Because it +maintains several versions of a row side by side, this technique is known as *multi-version +concurrency control* (MVCC). + +[Figure 8-7](/en/ch8#fig_transactions_mvcc) illustrates how MVCC-based snapshot isolation is implemented in PostgreSQL +[[40](/en/ch8#Momjian2014), +[42](/en/ch8#Rogov2023), +[43](/en/ch8#Suzuki2017_ch8)] (other implementations are similar). +When a transaction is started, it is given a unique, always-increasing transaction ID (`txid`). +Whenever a transaction writes anything to the database, the data it writes is tagged with the +transaction ID of the writer. (To be precise, transaction IDs in PostgreSQL are 32-bit integers, so +they overflow after approximately 4 billion transactions. The vacuum process performs cleanup to +ensure that overflow does not affect the data.) + +![ddia 0807](/fig/ddia_0807.png) + +###### Figure 8-7. Implementing snapshot isolation using multi-version concurrency control. + +Each row in a table has a `inserted_by` field, containing the ID of the transaction that inserted +this row into the table. Moreover, each row has a `deleted_by` field, which is initially empty. If a +transaction deletes a row, the row isn’t actually removed from the database, but it is marked for +deletion by setting the `deleted_by` field to the ID of the transaction that requested the deletion. +At some later time, when it is certain that no transaction can any longer access the deleted data, a +garbage collection process in the database removes any rows marked for deletion and frees their +space. + +An update is internally translated into a delete and a insert +[[44](/en/ch8#Alleti2025)]. +For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the +balance from $500 to $400. The `accounts` table now actually contains two rows for account 2: a row +with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of +$400 which was inserted by transaction 13. + +All of the versions of a row are stored within the same database heap (see +[“Storing values within the index”](/en/ch4#sec_storage_index_heap)), regardless of whether the transactions that wrote them have committed +or not. The versions of the same row form a linked list, going either from newest version to oldest +version or the other way round, so that queries can internally iterate over all versions of a row +[[45](/en/ch8#Pavlo2023), +[46](/en/ch8#Wu2017)]. + +### Visibility rules for observing a consistent snapshot + +When a transaction reads from the database, transaction IDs are used to decide which row versions it +can see and which are invisible. By carefully defining visibility rules, the database can present a +consistent snapshot of the database to the application. This works roughly as follows +[[43](/en/ch8#Suzuki2017_ch8)]: + +1. At the start of each transaction, the database makes a list of all the other transactions that + are in progress (not yet committed or aborted) at that time. Any writes that those + transactions have made are ignored, even if the transactions subsequently commit. This ensures + that we see a consistent snapshot that is not affected by another transaction committing. +2. Any writes made by transactions with a later transaction ID (i.e., which started after the current + transaction started, and which are therefore not included in the list of in-progress + transactions) are ignored, regardless of whether those transactions have committed. +3. Any writes made by aborted transactions are ignored, regardless of when that abort happened. + This has the advantage that when a transaction aborts, we don’t need to immediately remove the + rows it wrote from storage, since the visibility rule filters them out. The garbage collection + process can remove them later. +4. All other writes are visible to the application’s queries. + +These rules apply to both insertion and deletion of rows. In [Figure 8-7](/en/ch8#fig_transactions_mvcc), when +transaction 12 reads from account 2, it sees a balance of $500 because the deletion of the $500 +balance was made by transaction 13 (according to rule 2, transaction 12 cannot see a deletion made +by transaction 13), and the insertion of the $400 balance is not yet visible (by the same rule). + +Put another way, a row is visible if both of the following conditions are true: + +* At the time when the reader’s transaction started, the transaction that inserted the row had + already committed. +* The row is not marked for deletion, or if it is, the transaction that requested deletion had + not yet committed at the time when the reader’s transaction started. + +A long-running transaction may continue using a snapshot for a long time, continuing to read values +that (from other transactions’ point of view) have long been overwritten or deleted. By never +updating values in place but instead inserting a new version every time a value is changed, the +database can provide a consistent snapshot while incurring only a small overhead. + +### Indexes and snapshot isolation + +How do indexes work in a multi-version database? The most common approach is that each index entry +points at one of the versions of a row that matches the entry (either the oldest or the newest +version). Each row version may contain a reference to the next-oldest or next-newest version. A +query that uses the index must then iterate over the rows to find one that is visible, and where the +value matches what the query is looking for. When garbage collection removes old row versions that +are no longer visible to any transaction, the corresponding index entries can also be removed. + +Many implementation details affect the performance of multi-version concurrency control +[[45](/en/ch8#Pavlo2023), [46](/en/ch8#Wu2017)]. +For example, PostgreSQL has optimizations for avoiding index updates if different versions of the +same row can fit on the same page [[40](/en/ch8#Momjian2014)]. +Some other databases avoid storing full copies of modified rows, and only store differences between +versions to save space. + +Another approach is used in CouchDB, Datomic, and LMDB. Although they also use B-trees (see +[“B-Trees”](/en/ch4#sec_storage_b_trees)), they use an *immutable* (copy-on-write) variant that does not overwrite +pages of the tree when they are updated, but instead creates a new copy of each modified page. +Parent pages, up to the root of the tree, are copied and updated to point to the new versions of +their child pages. Any pages that are not affected by a write do not need to be copied, and can be +shared with the new tree [[47](/en/ch8#Prokopov2014)]. + +With immutable B-trees, every write transaction (or batch of transactions) creates a new B-tree +root, and a particular root is a consistent snapshot of the database at the point in time when it +was created. There is no need to filter out rows based on transaction IDs because subsequent +writes cannot modify an existing B-tree; they can only create new tree roots. This approach also +requires a background process for compaction and garbage collection. + +### Snapshot isolation, repeatable read, and naming confusion + +MVCC is a commonly used implementation technique for databases, and often it is used to implement +snapshot isolation. However, different databases sometimes use different terms to refer to the same +thing: for example, snapshot isolation is called “repeatable read” in PostgreSQL, and “serializable” +in Oracle [[29](/en/ch8#Kleppmann2014)]. Sometimes different systems +use the same term to mean different things: for example, while in PostgreSQL “repeatable read” means +snapshot isolation, in MySQL it means an implementation of MVCC with weaker consistency than +snapshot isolation [[41](/en/ch8#Alvaro2023)]. + +The reason for this naming confusion is that the SQL standard doesn’t have the concept of snapshot +isolation, because the standard is based on System R’s 1975 definition of isolation levels +[[3](/en/ch8#Gray1976)] and snapshot isolation hadn’t yet been +invented then. Instead, it defines repeatable read, which looks superficially similar to snapshot +isolation. PostgreSQL calls its snapshot isolation level “repeatable read” because it meets the +requirements of the standard, and so they can claim standards compliance. + +Unfortunately, the SQL standard’s definition of isolation levels is flawed—it is ambiguous, +imprecise, and not as implementation-independent as a standard should be +[[36](/en/ch8#Berenson1995)]. Even though several databases +implement repeatable read, there are big differences in the guarantees they actually provide, +despite being ostensibly standardized +[[29](/en/ch8#Kleppmann2014)]. There has been a formal definition of +repeatable read in the research literature [[37](/en/ch8#Adya1999), +[38](/en/ch8#Bailis2014virtues_ch8)], but most implementations don’t satisfy that +formal definition. And to top it off, IBM Db2 uses “repeatable read” to refer to serializability +[[10](/en/ch8#Bailis2013HAT)]. + +As a result, nobody really knows what repeatable read means. + +## Preventing Lost Updates + +The read committed and snapshot isolation levels we’ve discussed so far have been primarily about the guarantees +of what a read-only transaction can see in the presence of concurrent writes. We have mostly ignored +the issue of two transactions writing concurrently—we have only discussed dirty writes (see +[“No dirty writes”](/en/ch8#sec_transactions_dirty_write)), one particular type of write-write conflict that can occur. + +There are several other interesting kinds of conflicts that can occur between concurrently writing +transactions. The best known of these is the *lost update* problem, illustrated in +[Figure 8-1](/en/ch8#fig_transactions_increment) with the example of two concurrent counter increments. + +The lost update problem can occur if an application reads some value from the database, modifies it, +and writes back the modified value (a *read-modify-write cycle*). If two transactions do this +concurrently, one of the modifications can be lost, because the second write does not include the +first modification. (We sometimes say that the later write *clobbers* the earlier write.) This +pattern occurs in various different scenarios: + +* Incrementing a counter or updating an account balance (requires reading the current value, + calculating the new value, and writing back the updated value) +* Making a local change to a complex value, e.g., adding an element to a list within a JSON document + (requires parsing the document, making the change, and writing back the modified document) +* Two users editing a wiki page at the same time, where each user saves their changes by sending the + entire page contents to the server, overwriting whatever is currently in the database + +Because this is such a common problem, a variety of solutions have been developed +[[48](/en/ch8#Svetlov2025)]. + +### Atomic write operations + +Many databases provide atomic update operations, which remove the need to implement +read-modify-write cycles in application code. They are usually the best solution if your code can be +expressed in terms of those operations. For example, the following instruction is concurrency-safe +in most relational databases: + +``` +UPDATE counters SET value = value + 1 WHERE key = 'foo'; +``` + +Similarly, document databases such as MongoDB provide atomic operations for making local +modifications to a part of a JSON document, and Redis provides atomic operations for modifying data +structures such as priority queues. Not all writes can easily be expressed in terms of atomic +operations—for example, updates to a wiki page involve arbitrary text editing, which can be handled +using algorithms discussed in [“CRDTs and Operational Transformation”](/en/ch6#sec_replication_crdts)—but in situations where atomic operations +can be used, they are usually the best choice. + +Atomic operations are usually implemented by taking an exclusive lock on the object when it is read +so that no other transaction can read it until the update has been applied. +Another option is to simply force all atomic operations to be executed on a single thread. + +Unfortunately, object-relational mapping (ORM) frameworks make it easy to accidentally write code +that performs unsafe read-modify-write cycles instead of using atomic operations provided by the +database [[49](/en/ch8#Wiger2010), +[50](/en/ch8#Coglan2020), +[51](/en/ch8#Bailis2015_ch8)]. +This can be a source of subtle bugs that are difficult to find by testing. + +### Explicit locking + +Another option for preventing lost updates, if the database’s built-in atomic operations don’t +provide the necessary functionality, is for the application to explicitly lock objects that are +going to be updated. Then the application can perform a read-modify-write cycle, and if any other +transaction tries to concurrently update or lock the same object, it is forced to wait until the +first read-modify-write cycle has completed. + +For example, consider a multiplayer game in which several players can move the same figure +concurrently. In this case, an atomic operation may not be sufficient, because the application also +needs to ensure that a player’s move abides by the rules of the game, which involves some logic that +you cannot sensibly implement as a database query. Instead, you may use a lock to prevent two +players from concurrently moving the same piece, as illustrated in [Example 8-1](/en/ch8#fig_transactions_select_for_update). + +##### Example 8-1. Explicitly locking rows to prevent lost updates + +``` +BEGIN TRANSACTION; + +SELECT * FROM figures + WHERE name = 'robot' AND game_id = 222 + FOR UPDATE; ![1](/fig/1.png) + +-- Check whether move is valid, then update the position +-- of the piece that was returned by the previous SELECT. +UPDATE figures SET position = 'c4' WHERE id = 1234; + +COMMIT; +``` + +[![1](/fig/1.png)](/en/ch8#co_transactions_CO1-1) +: The `FOR UPDATE` clause indicates that the database should take a lock on all rows returned by + this query. + +This works, but to get it right, you need to carefully think about your application logic. It’s easy +to forget to add a necessary lock somewhere in the code, and thus introduce a race condition. + +Moreover, if you lock multiple objects there is a risk of deadlock, where two or more transactions +are waiting for each other to release their locks. Many databases automatically detect deadlocks, +and abort one of the involved transactions so that the system can make progress. You can handle this +situation at the application level by retrying the aborted transaction. + +### Automatically detecting lost updates + +Atomic operations and locks are ways of preventing lost updates by forcing the read-modify-write +cycles to happen sequentially. An alternative is to allow them to execute in parallel and, if the +transaction manager detects a lost update, abort the transaction and force it to retry +its read-modify-write cycle. + +An advantage of this approach is that databases can perform this check efficiently in conjunction +with snapshot isolation. Indeed, PostgreSQL’s repeatable read, Oracle’s serializable, and SQL +Server’s snapshot isolation levels automatically detect when a lost update has occurred and abort +the offending transaction. However, MySQL/InnoDB’s repeatable read does not detect lost updates +[[29](/en/ch8#Kleppmann2014), +[41](/en/ch8#Alvaro2023)]. +Some authors [[36](/en/ch8#Berenson1995), +[38](/en/ch8#Bailis2014virtues_ch8)] argue that a database must prevent lost +updates in order to qualify as providing snapshot isolation, so MySQL does not provide snapshot +isolation under this definition. + +Lost update detection is a great feature, because it doesn’t require application code to use any +special database features—you may forget to use a lock or an atomic operation and thus introduce +a bug, but lost update detection happens automatically and is thus less error-prone. However, you +also have to retry aborted transactions at the application level. + +### Conditional writes (compare-and-set) + +In databases that don’t provide transactions, you sometimes find a *conditional write* operation +that can prevent lost updates by allowing an update to happen only if the value has not changed +since you last read it (previously mentioned in [“Single-object writes”](/en/ch8#sec_transactions_single_object)). If the current +value does not match what you previously read, the update has no effect, and the read-modify-write +cycle must be retried. It is the database equivalent of an atomic *compare-and-set* or +*compare-and-swap* (CAS) instruction that is supported by many CPUs. + +For example, to prevent two users concurrently updating the same wiki page, you might try something +like this, expecting the update to occur only if the content of the page hasn’t changed since the +user started editing it: + +``` +-- This may or may not be safe, depending on the database implementation +UPDATE wiki_pages SET content = 'new content' + WHERE id = 1234 AND content = 'old content'; +``` + +If the content has changed and no longer matches `'old content'`, this update will have no effect, +so you need to check whether the update took effect and retry if necessary. Instead of comparing the +full content, you could also use a version number column that you increment on every update, and +apply the update only if the current version number hasn’t changed. This approach is sometimes +called *optimistic locking* [[52](/en/ch8#Dogan2020)]. + +Note that if another transaction has concurrently modified `content`, the new content may not be +visible under the MVCC visibility rules (see [“Visibility rules for observing a consistent snapshot”](/en/ch8#sec_transactions_mvcc_visibility)). Many +implementations of MVCC have an exception to the visibility rules for this scenario, where values +written by other transactions are visible to the evaluation of the `WHERE` clause of `UPDATE` and +`DELETE` queries, even though those writes are not otherwise visible in the snapshot. + +### Conflict resolution and replication + +In replicated databases (see [Chapter 6](/en/ch6#ch_replication)), preventing lost updates takes on another +dimension: since they have copies of the data on multiple nodes, and the data can potentially be +modified concurrently on different nodes, some additional steps need to be taken to prevent lost +updates. + +Locks and conditional write operations assume that there is a single up-to-date copy of the data. +However, databases with multi-leader or leaderless replication usually allow several writes to +happen concurrently and replicate them asynchronously, so they cannot guarantee that there is a +single up-to-date copy of the data. Thus, techniques based on locks or conditional writes do not apply +in this context. (We will revisit this issue in more detail in [“Linearizability”](/en/ch10#sec_consistency_linearizability).) + +Instead, as discussed in [“Dealing with Conflicting Writes”](/en/ch6#sec_replication_write_conflicts), a common approach in such replicated +databases is to allow concurrent writes to create several conflicting versions of a value (also +known as *siblings*), and to use application code or special data structures to resolve and merge +these versions after the fact. + +Merging conflicting values can prevent lost updates if the updates are commutative (i.e., you can +apply them in a different order on different replicas, and still get the same result). For example, +incrementing a counter or adding an element to a set are commutative operations. That is the idea +behind CRDTs, which we encountered in [“CRDTs and Operational Transformation”](/en/ch6#sec_replication_crdts). However, some operations such as +conditional writes cannot be made commutative. + +On the other hand, the *last write wins* (LWW) conflict resolution method is prone to lost updates, +as discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww). Unfortunately, LWW is the default in many replicated +databases. + +## Write Skew and Phantoms + +In the previous sections we saw *dirty writes* and *lost updates*, two kinds of race conditions that +can occur when different transactions concurrently try to write to the same objects. In order to +avoid data corruption, those race conditions need to be prevented—either automatically by the +database, or by manual safeguards such as using locks or atomic write operations. + +However, that is not the end of the list of potential race conditions that can occur between +concurrent writes. In this section we will see some subtler examples of conflicts. + +To begin, imagine this example: you are writing an application for doctors to manage their on-call +shifts at a hospital. The hospital usually tries to have several doctors on call at any one time, +but it absolutely must have at least one doctor on call. Doctors can give up their shifts (e.g., if +they are sick themselves), provided that at least one colleague remains on call in that shift +[[53](/en/ch8#Cahill2008), +[54](/en/ch8#Ports2012)]. + +Now imagine that Aaliyah and Bryce are the two on-call doctors for a particular shift. Both are +feeling unwell, so they both decide to request leave. Unfortunately, they happen to click the button +to go off call at approximately the same time. What happens next is illustrated in +[Figure 8-8](/en/ch8#fig_transactions_write_skew). + +![ddia 0808](/fig/ddia_0808.png) + +###### Figure 8-8. Example of write skew causing an application bug. + +In each transaction, your application first checks that two or more doctors are currently on call; +if yes, it assumes it’s safe for one doctor to go off call. Since the database is using snapshot +isolation, both checks return `2`, so both transactions proceed to the next stage. Aaliyah updates her +own record to take herself off call, and Bryce updates his own record likewise. Both transactions +commit, and now no doctor is on call. Your requirement of having at least one doctor on call has +been violated. + +### Characterizing write skew + +This anomaly is called *write skew* [[36](/en/ch8#Berenson1995)]. It +is neither a dirty write nor a lost update, because the two transactions are updating two different +objects (Aaliyah’s and Bryce’s on-call records, respectively). It is less obvious that a conflict occurred +here, but it’s definitely a race condition: if the two transactions had run one after another, the +second doctor would have been prevented from going off call. The anomalous behavior was only +possible because the transactions ran concurrently. + +You can think of write skew as a generalization of the lost update problem. Write skew can occur if two +transactions read the same objects, and then update some of those objects (different transactions +may update different objects). In the special case where different transactions update the same +object, you get a dirty write or lost update anomaly (depending on the timing). + +We saw that there are various different ways of preventing lost updates. With write skew, our +options are more restricted: + +* Atomic single-object operations don’t help, as multiple objects are involved. +* The automatic detection of lost updates that you find in some implementations of snapshot + isolation unfortunately doesn’t help either: write skew is not automatically detected in + PostgreSQL’s repeatable read, MySQL/InnoDB’s repeatable read, Oracle’s serializable, or SQL + Server’s snapshot isolation level [[29](/en/ch8#Kleppmann2014)]. + Automatically preventing write skew requires true serializable isolation (see + [“Serializability”](/en/ch8#sec_transactions_serializability)). +* Some databases allow you to configure constraints, which are then enforced by the database (e.g., + uniqueness, foreign key constraints, or restrictions on a particular value). However, in order to + specify that at least one doctor must be on call, you would need a constraint that involves + multiple objects. Most databases do not have built-in support for such constraints, but you may be + able to implement them with triggers or materialized views, as discussed in + [“Consistency”](/en/ch8#sec_transactions_acid_consistency) [[12](/en/ch8#Andrews2004)]. +* If you can’t use a serializable isolation level, the second-best option in this case is probably + to explicitly lock the rows that the transaction depends on. In the doctors example, you could + write something like the following: + + ``` + BEGIN TRANSACTION; + + SELECT * FROM doctors + WHERE on_call = true + AND shift_id = 1234 FOR UPDATE; ![1](/fig/1.png) + + UPDATE doctors + SET on_call = false + WHERE name = 'Aaliyah' + AND shift_id = 1234; + + COMMIT; + ``` + + [![1](/fig/1.png)](/en/ch8#co_transactions_CO2-1) + : As before, `FOR UPDATE` tells the database to lock all rows returned by this query. + +### More examples of write skew + +Write skew may seem like an esoteric issue at first, but once you’re aware of it, you may notice +more situations in which it can occur. Here are some more examples: + +Meeting room booking system +: Say you want to enforce that there cannot be two bookings for the same meeting room at the same + time [[55](/en/ch8#Terry1995_ch8)]. + When someone wants to make a booking, you first check for any conflicting bookings (i.e., + bookings for the same room with an overlapping time range), and if none are found, you create the + meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)). + + ##### Example 8-2. A meeting room booking system tries to avoid double-booking (not safe under snapshot isolation) + + ``` + BEGIN TRANSACTION; + + -- Check for any existing bookings that overlap with the period of noon-1pm + SELECT COUNT(*) FROM bookings + WHERE room_id = 123 AND + end_time > '2025-01-01 12:00' AND start_time < '2025-01-01 13:00'; + + -- If the previous query returned zero: + INSERT INTO bookings + (room_id, start_time, end_time, user_id) + VALUES (123, '2025-01-01 12:00', '2025-01-01 13:00', 666); + + COMMIT; + ``` + + Unfortunately, snapshot isolation does not prevent another user from concurrently inserting a conflicting + meeting. In order to guarantee you won’t get scheduling conflicts, you once again need serializable + isolation. + +Multiplayer game +: In [Example 8-1](/en/ch8#fig_transactions_select_for_update), we used a lock to prevent lost updates (that is, making + sure that two players can’t move the same figure at the same time). However, the lock doesn’t + prevent players from moving two different figures to the same position on the board or potentially + making some other move that violates the rules of the game. Depending on the kind of rule you are + enforcing, you might be able to use a unique constraint, but otherwise you’re vulnerable to write + skew. + +Claiming a username +: On a website where each user has a unique username, two users may try to create accounts with the + same username at the same time. You may use a transaction to check whether a name is taken and, if + not, create an account with that name. However, like in the previous examples, that is not safe + under snapshot isolation. Fortunately, a unique constraint is a simple solution here (the second + transaction that tries to register the username will be aborted due to violating the constraint). + +Preventing double-spending +: A service that allows users to spend money or points needs to check that a user doesn’t spend more + than they have. You might implement this by inserting a tentative spending item into a user’s + account, listing all the items in the account, and checking that the sum is positive. + With write skew, it could happen that two spending items are inserted concurrently that together + cause the balance to go negative, but that neither transaction notices the other. + +### Phantoms causing write skew + +All of these examples follow a similar pattern: + +1. A `SELECT` query checks whether some requirement is satisfied by searching for rows that + match some search condition (there are at least two doctors on call, there are no existing + bookings for that room at that time, the position on the board doesn’t already have another + figure on it, the username isn’t already taken, there is still money in the account). +2. Depending on the result of the first query, the application code decides how to continue (perhaps + to go ahead with the operation, or perhaps to report an error to the user and abort). +3. If the application decides to go ahead, it makes a write (`INSERT`, `UPDATE`, or `DELETE`) to the + database and commits the transaction. + + The effect of this write changes the precondition of the decision of step 2. In other words, if you + were to repeat the `SELECT` query from step 1 after committing the write, you would get a different + result, because the write changed the set of rows matching the search condition (there is now one + fewer doctor on call, the meeting room is now booked for that time, the position on the board is now + taken by the figure that was moved, the username is now taken, there is now less money in the + account). + +The steps may occur in a different order. For example, you could first make the write, then the +`SELECT` query, and finally decide whether to abort or commit based on the result of the query. + +In the case of the doctor on call example, the row being modified in step 3 was one of the rows +returned in step 1, so we could make the transaction safe and avoid write skew by locking the rows +in step 1 (`SELECT FOR UPDATE`). However, the other four examples are different: they check for the +*absence* of rows matching some search condition, and the write *adds* a row matching the same +condition. If the query in step 1 doesn’t return any rows, `SELECT FOR UPDATE` can’t attach locks to +anything [[56](/en/ch8#Schoenig2021)]. + +This effect, where a write in one transaction changes the result of a search query in another +transaction, is called a *phantom* [[4](/en/ch8#Eswaran1976)]. +Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the +examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL +generated by ORMs is also prone to write skew +[[50](/en/ch8#Coglan2020), +[51](/en/ch8#Bailis2015_ch8)]. + +### Materializing conflicts + +If the problem of phantoms is that there is no object to which we can attach the locks, perhaps we +can artificially introduce a lock object into the database? + +For example, in the meeting room booking case you could imagine creating a table of time slots and +rooms. Each row in this table corresponds to a particular room for a particular time period (say, 15 +minutes). You create rows for all possible combinations of rooms and time periods ahead of time, +e.g. for the next six months. + +Now a transaction that wants to create a booking can lock (`SELECT FOR UPDATE`) the rows in the +table that correspond to the desired room and time period. After it has acquired the locks, it can +check for overlapping bookings and insert a new booking as before. Note that the additional table +isn’t used to store information about the booking—it’s purely a collection of locks which is used +to prevent bookings on the same room and time range from being modified concurrently. + +This approach is called *materializing conflicts*, because it takes a phantom and turns it into a +lock conflict on a concrete set of rows that exist in the database +[[14](/en/ch8#Fekete2005)]. Unfortunately, it can be hard and +error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control +mechanism leak into the application data model. For those reasons, materializing conflicts should be +considered a last resort if no alternative is possible. A serializable isolation level is much +preferable in most cases. + +# Serializability + +In this chapter we have seen several examples of transactions that are prone to race conditions. +Some race conditions are prevented by the read committed and snapshot isolation levels, but +others are not. We encountered some particularly tricky examples with write skew and phantoms. It’s +a sad situation: + +* Isolation levels are hard to understand, and inconsistently implemented in different databases + (e.g., the meaning of “repeatable read” varies significantly). +* If you look at your application code, it’s difficult to tell whether it is safe to run at a + particular isolation level—especially in a large application, where you might not be aware of + all the things that may be happening concurrently. +* There are no good tools to help us detect race conditions. In principle, static analysis may + help [[33](/en/ch8#Jorwekar2007_ch8)], but research techniques have not + yet found their way into practical use. Testing for concurrency issues is hard, because they are + usually nondeterministic—problems only occur if you get unlucky with the timing. + +This is not a new problem—it has been like this since the 1970s, when weak isolation levels were +first introduced [[3](/en/ch8#Gray1976)]. All along, the answer +from researchers has been simple: use *serializable* isolation! + +Serializable isolation is the strongest isolation level. It guarantees that even +though transactions may execute in parallel, the end result is the same as if they had executed one +at a time, *serially*, without any concurrency. Thus, the database guarantees that if the +transactions behave correctly when run individually, they continue to be correct when run +concurrently—in other words, the database prevents *all* possible race conditions. + +But if serializable isolation is so much better than the mess of weak isolation levels, then why +isn’t everyone using it? To answer this question, we need to look at the options for implementing +serializability, and how they perform. Most databases that provide serializability today use one of +three techniques, which we will explore in the rest of this chapter: + +* Literally executing transactions in a serial order (see [“Actual Serial Execution”](/en/ch8#sec_transactions_serial)) +* Two-phase locking (see [“Two-Phase Locking (2PL)”](/en/ch8#sec_transactions_2pl)), which for several decades was the only viable + option +* Optimistic concurrency control techniques such as serializable snapshot isolation (see + [“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)) + +## Actual Serial Execution + +The simplest way of avoiding concurrency problems is to remove the concurrency entirely: to +execute only one transaction at a time, in serial order, on a single thread. By doing so, we completely +sidestep the problem of detecting and preventing conflicts between transactions: the resulting +isolation is by definition serializable. + +Even though this seems like an obvious idea, it was only in the 2000s that database designers +decided that a single-threaded loop for executing transactions was feasible +[[57](/en/ch8#Stonebraker2007_ch8)]. +If multi-threaded concurrency was considered essential for getting good performance during the +previous 30 years, what changed to make single-threaded execution possible? + +Two developments caused this rethink: + +* RAM became cheap enough that for many use cases it is now feasible to keep the entire + active dataset in memory (see [“Keeping everything in memory”](/en/ch4#sec_storage_inmemory)). When all data that a transaction needs to + access is in memory, transactions can execute much faster than if they have to wait for data to be + loaded from disk. +* Database designers realized that OLTP transactions are usually short and only make a small number + of reads and writes (see [“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)). By contrast, long-running analytic queries + are typically read-only, so they can be run on a consistent snapshot (using snapshot isolation) + outside of the serial execution loop. + +The approach of executing transactions serially is implemented in VoltDB/H-Store, Redis, and Datomic, +for example [[58](/en/ch8#Hugg2014streaming), +[59](/en/ch8#Kallman2008), +[60](/en/ch8#Hickey2012)]. +A system designed for single-threaded execution can sometimes perform better than a system that +supports concurrency, because it can avoid the coordination overhead of locking. However, its +throughput is limited to that of a single CPU core. In order to make the most of that single thread, +transactions need to be structured differently from their traditional form. + +### Encapsulating transactions in stored procedures + +In the early days of databases, the intention was that a database transaction could encompass an +entire flow of user activity. For example, booking an airline ticket is a multi-stage process +(searching for routes, fares, and available seats; deciding on an itinerary; booking seats on +each of the flights of the itinerary; entering passenger details; making payment). Database +designers thought that it would be neat if that entire process was one transaction so that it could +be committed atomically. + +Unfortunately, humans are very slow to make up their minds and respond. If a database transaction +needs to wait for input from a user, the database needs to support a potentially huge number of +concurrent transactions, most of them idle. Most databases cannot do that efficiently, and so almost +all OLTP applications keep transactions short by avoiding interactively waiting for a user within a +transaction. On the web, this means that a transaction is committed within the same HTTP request—​a +transaction does not span multiple requests. A new HTTP request starts a new transaction. + +Even though the human has been taken out of the critical path, transactions have continued to be +executed in an interactive client/server style, one statement at a time. An application makes a +query, reads the result, perhaps makes another query depending on the result of the first query, and +so on. The queries and results are sent back and forth between the application code (running on one +machine) and the database server (on another machine). + +In this interactive style of transaction, a lot of time is spent in network communication between +the application and the database. If you were to disallow concurrency in the database and only +process one transaction at a time, the throughput would be dreadful because the database would +spend most of its time waiting for the application to issue the next query for the current +transaction. In this kind of database, it’s necessary to process multiple transactions concurrently +in order to get reasonable performance. + +For this reason, systems with single-threaded serial transaction processing don’t allow interactive +multi-statement transactions. Instead, the application must either limit itself to transactions +containing a single statement, or submit the entire transaction code to the database ahead of time, +as a *stored procedure* [[61](/en/ch8#Hugg2014debunking)]. + +The differences between interactive transactions and stored procedures is illustrated in +[Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the +stored procedure can execute very quickly, without waiting for any network or disk I/O. + +![ddia 0809](/fig/ddia_0809.png) + +###### Figure 8-9. The difference between an interactive transaction and a stored procedure (using the example transaction of [Figure 8-8](/en/ch8#fig_transactions_write_skew)). + +### Pros and cons of stored procedures + +Stored procedures have existed for some time in relational databases, and they have been part of the +SQL standard (SQL/PSM) since 1999. They have gained a somewhat bad reputation, for various reasons: + +* Traditionally, each database vendor had its own language for stored procedures (Oracle has PL/SQL, SQL Server + has T-SQL, PostgreSQL has PL/pgSQL, etc.). These languages haven’t kept up with developments in + general-purpose programming languages, so they look quite ugly and archaic from today’s point of + view, and they lack the ecosystem of libraries that you find with most programming languages. +* Code running in a database is difficult to manage: compared to an application server, it’s harder + to debug, more awkward to keep in version control and deploy, trickier to test, and difficult to + integrate with a metrics collection system for monitoring. +* A database is often much more performance-sensitive than an application server, because a single + database instance is often shared by many application servers. A badly written stored procedure + (e.g., using a lot of memory or CPU time) in a database can cause much more trouble than equivalent + badly written code in an application server. +* In a multitenant system that allows tenants to write their own stored procedures, it’s a security + risk to execute untrusted code in the same process as the database kernel + [[62](/en/ch8#Zhou2025)]. + +However, those issues can be overcome. Modern implementations of stored procedures have abandoned +PL/SQL and use existing general-purpose programming languages instead: VoltDB uses Java or Groovy, +Datomic uses Java or Clojure, Redis uses Lua, and MongoDB uses Javascript. + +Stored procedures are also useful in cases where application logic can’t easily be embedded +elsewhere. Applications that use GraphQL, for example, might directly expose their database through +a GraphQL proxy. If the proxy doesn’t support complex validation logic, you can embed such logic +directly in the database using a stored procedure. If the database doesn’t support stored +procedures, you would have to deploy a validation service between the proxy and the database to do +validation. + +With stored procedures and in-memory data, executing all transactions on a single thread becomes +feasible. When stored procedures don’t need to wait for I/O and avoid the overhead of other +concurrency control mechanisms, they can achieve quite good throughput on a single thread. + +VoltDB also uses stored procedures for replication: instead of copying a transaction’s writes from +one node to another, it executes the same stored procedure on each replica. VoltDB therefore +requires that stored procedures are *deterministic* (when run on different nodes, they must produce +the same result). If a transaction needs to use the current date and time, for example, it must do +so through special deterministic APIs (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows) for more details on +deterministic operations). This approach is called *state machine replication*, and we will return +to it in [Chapter 10](/en/ch10#ch_consistency). + +### Sharding + +Executing all transactions serially makes concurrency control much simpler, but limits the +transaction throughput of the database to the speed of a single CPU core on a single machine. +Read-only transactions may execute elsewhere, using snapshot isolation, but for applications with +high write throughput, the single-threaded transaction processor can become a serious bottleneck. + +In order to scale to multiple CPU cores, and multiple nodes, you can shard your data +(see [Chapter 7](/en/ch7#ch_sharding)), which is supported in VoltDB. If you can find a way of sharding your dataset +so that each transaction only needs to read and write data within a single shard, then each shard +can have its own transaction processing thread running independently from the others. In this case, +you can give each CPU core its own shard, which allows your transaction throughput to scale linearly +with the number of CPU cores [[59](/en/ch8#Kallman2008)]. + +However, for any transaction that needs to access multiple shards, the database must coordinate the +transaction across all the shards that it touches. The stored procedure needs to be performed in +lock-step across all shards to ensure serializability across the whole system. + +Since cross-shard transactions have additional coordination overhead, they are vastly slower than +single-shard transactions. VoltDB reports a throughput of about 1,000 cross-shard writes per second, +which is orders of magnitude below its single-shard throughput and cannot be increased by adding +more machines [[61](/en/ch8#Hugg2014debunking)]. More recent research +has explored ways of making multi-shard transactions more scalable +[[63](/en/ch8#Zhou2022)]. + +Whether transactions can be single-shard depends very much on the structure of the data used by the +application. Simple key-value data can often be sharded very easily, but data with multiple +secondary indexes is likely to require a lot of cross-shard coordination (see +[“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes)). + +### Summary of serial execution + +Serial execution of transactions has become a viable way of achieving serializable isolation within +certain constraints: + +* Every transaction must be small and fast, because it takes only one slow transaction to stall all + transaction processing. +* It is most appropriate in situations where the active dataset can fit in memory. Rarely accessed + data could potentially be moved to disk, but if it needed to be accessed in a single-threaded + transaction, the system would get very slow. +* Write throughput must be low enough to be handled on a single CPU core, or else transactions need + to be sharded without requiring cross-shard coordination. +* Cross-shard transactions are possible, but their throughput is hard to scale. + +## Two-Phase Locking (2PL) + +For around 30 years, there was only one widely used algorithm for serializability in databases: +*two-phase locking* (2PL), sometimes called *strong strict two-phase locking* (SS2PL) to distinguish +it from other variants of 2PL. + +# 2PL is not 2PC + +Two-phase *locking* (2PL) and two-phase *commit* (2PC) are two very different things. 2PL provides +serializable isolation, whereas 2PC provides atomic commit in a distributed database (see +[“Two-Phase Commit (2PC)”](/en/ch8#sec_transactions_2pc)). To avoid confusion, it’s best to think of them as entirely separate +concepts and to ignore the unfortunate similarity in the names. + +We saw previously that locks are often used to prevent dirty writes (see +[“No dirty writes”](/en/ch8#sec_transactions_dirty_write)): if two transactions concurrently try to write to the same object, +the lock ensures that the second writer must wait until the first one has finished its transaction +(aborted or committed) before it may continue. + +Two-phase locking is similar, but makes the lock requirements much stronger. Several transactions +are allowed to concurrently read the same object as long as nobody is writing to it. But as soon as +anyone wants to write (modify or delete) an object, exclusive access is required: + +* If transaction A has read an object and transaction B wants to write to that object, B must wait + until A commits or aborts before it can continue. (This ensures that B can’t change the object + unexpectedly behind A’s back.) +* If transaction A has written an object and transaction B wants to read that object, B must wait + until A commits or aborts before it can continue. (Reading an old version of the object, like in + [Figure 8-4](/en/ch8#fig_transactions_read_committed), is not acceptable under 2PL.) + +In 2PL, writers don’t just block other writers; they also block readers and vice +versa. Snapshot isolation has the mantra *readers never block writers, and writers never block +readers* (see [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl)), which captures this key difference between +snapshot isolation and two-phase locking. On the other hand, because 2PL provides serializability, +it protects against all the race conditions discussed earlier, including lost updates and write skew. + +### Implementation of two-phase locking + +2PL is used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the +repeatable read isolation level in Db2 +[[29](/en/ch8#Kleppmann2014)]. + +The blocking of readers and writers is implemented by having a lock on each object in the +database. The lock can either be in *shared mode* or in *exclusive mode* (also known as a +*multi-reader single-writer* lock). The lock is used as follows: + +* If a transaction wants to read an object, it must first acquire the lock in shared mode. Several + transactions are allowed to hold the lock in shared mode simultaneously, but if another + transaction already has an exclusive lock on the object, these transactions must wait. +* If a transaction wants to write to an object, it must first acquire the lock in exclusive mode. No + other transaction may hold the lock at the same time (either in shared or in exclusive mode), so + if there is any existing lock on the object, the transaction must wait. +* If a transaction first reads and then writes an object, it may upgrade its shared lock to an + exclusive lock. The upgrade works the same as getting an exclusive lock directly. +* After a transaction has acquired the lock, it must continue to hold the lock until the end of the + transaction (commit or abort). This is where the name “two-phase” comes from: the first phase + (while the transaction is executing) is when the locks are acquired, and the second phase (at the + end of the transaction) is when all the locks are released. + +Since so many locks are in use, it can happen quite easily that transaction A is stuck waiting for +transaction B to release its lock, and vice versa. This situation is called *deadlock*. The database +automatically detects deadlocks between transactions and aborts one of them so that the others can +make progress. The aborted transaction needs to be retried by the application. + +### Performance of two-phase locking + +The big downside of two-phase locking, and the reason why it hasn’t been used by everybody since the +1970s, is performance: transaction throughput and response times of queries are significantly worse +under two-phase locking than under weak isolation. + +This is partly due to the overhead of acquiring and releasing all those locks, but more importantly +due to reduced concurrency. By design, if two concurrent transactions try to do anything that may +in any way result in a race condition, one has to wait for the other to complete. + +For example, if you have a transaction that needs to read an entire table (e.g. a backup, analytics +query, or integrity check, as discussed in [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation)), that +transaction has to take a shared lock on the entire table. Therefore, the reading transaction first +has to wait until all in-progress transactions writing to that table have completed; then, while the +whole table is being read (which may take a long time on a large table), all other transactions that +want to write to that table are blocked until the big read-only transaction commits. In effect, the +database becomes unavailable for writes for an extended time. + +For this reason, databases running 2PL can have quite unstable latencies, and they can be very slow at +high percentiles (see [“Describing Performance”](/en/ch2#sec_introduction_percentiles)) if there is contention in the workload. It +may take just one slow transaction, or one transaction that accesses a lot of data and acquires many +locks, to cause the rest of the system to grind to a halt. + +Although deadlocks can happen with the lock-based read committed isolation level, they occur much +more frequently under 2PL serializable isolation (depending on the access patterns of your +transaction). This can be an additional performance problem: when a transaction is aborted due to +deadlock and is retried, it needs to do its work all over again. If deadlocks are frequent, this can +mean significant wasted effort. + +### Predicate locks + +In the preceding description of locks, we glossed over a subtle but important detail. In +[“Phantoms causing write skew”](/en/ch8#sec_transactions_phantom) we discussed the problem of *phantoms*—that is, one transaction +changing the results of another transaction’s search query. A database with serializable isolation +must prevent phantoms. + +In the meeting room booking example this means that if one transaction has searched for existing +bookings for a room within a certain time window (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)), another +transaction is not allowed to concurrently insert or update another booking for the same room and +time range. (It’s okay to concurrently insert bookings for other rooms, or for the same room at a +different time that doesn’t affect the proposed booking.) + +How do we implement this? Conceptually, we need a *predicate lock* +[[4](/en/ch8#Eswaran1976)]. It works similarly to the +shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one +row in a table), it belongs to all objects that match some search condition, such as: + +``` +SELECT * FROM bookings + WHERE room_id = 123 AND + end_time > '2025-01-01 12:00' AND + start_time < '2025-01-01 13:00'; +``` + +A predicate lock restricts access as follows: + +* If transaction A wants to read objects matching some condition, like in that `SELECT` query, it + must acquire a shared-mode predicate lock on the conditions of the query. If another transaction B + currently has an exclusive lock on any object matching those conditions, A must wait until B + releases its lock before it is allowed to make its query. +* If transaction A wants to insert, update, or delete any object, it must first check whether either the old + or the new value matches any existing predicate lock. If there is a matching predicate lock held by + transaction B, then A must wait until B has committed or aborted before it can continue. + +The key idea here is that a predicate lock applies even to objects that do not yet exist in the +database, but which might be added in the future (phantoms). If two-phase locking includes predicate locks, +the database prevents all forms of write skew and other race conditions, and so its isolation +becomes serializable. + +### Index-range locks + +Unfortunately, predicate locks do not perform well: if there are many locks by active transactions, +checking for matching locks becomes time-consuming. For that reason, most databases with 2PL +actually implement *index-range locking* (also known as *next-key locking*), which is a simplified +approximation of predicate locking [[54](/en/ch8#Ports2012), +[64](/en/ch8#Hellerstein2007_ch8)]. + +It’s safe to simplify a predicate by making it match a greater set of objects. For example, if you +have a predicate lock for bookings of room 123 between noon and 1 p.m., you can approximate it by +locking bookings for room 123 at any time, or you can approximate it by locking all rooms (not just +room 123) between noon and 1 p.m. This is safe because any write that matches the original predicate +will definitely also match the approximations. + +In the room bookings database you would probably have an index on the `room_id` column, and/or +indexes on `start_time` and `end_time` (otherwise the preceding query would be very slow on a large +database): + +* Say your index is on `room_id`, and the database uses this index to find existing bookings for + room 123. Now the database can simply attach a shared lock to this index entry, indicating that a + transaction has searched for bookings of room 123. +* Alternatively, if the database uses a time-based index to find existing bookings, it can attach a + shared lock to a range of values in that index, indicating that a transaction has searched for + bookings that overlap with the time period of noon to 1 p.m. on January 1, 2025. + +Either way, an approximation of the search condition is attached to one of the indexes. Now, if +another transaction wants to insert, update, or delete a booking for the same room and/or an +overlapping time period, it will have to update the same part of the index. In the process of doing +so, it will encounter the shared lock, and it will be forced to wait until the lock is released. + +This provides effective protection against phantoms and write skew. Index-range locks are not as +precise as predicate locks would be (they may lock a bigger range of objects than is strictly +necessary to maintain serializability), but since they have much lower overheads, they are a good +compromise. + +If there is no suitable index where a range lock can be attached, the database can fall back to a +shared lock on the entire table. This will not be good for performance, since it will stop all +other transactions writing to the table, but it’s a safe fallback position. + +## Serializable Snapshot Isolation (SSI) + +This chapter has painted a bleak picture of concurrency control in databases. On the one hand, we +have implementations of serializability that don’t perform well (two-phase locking) or don’t scale +well (serial execution). On the other hand, we have weak isolation levels that have good +performance, but are prone to various race conditions (lost updates, write skew, phantoms, etc.). Are +serializable isolation and good performance fundamentally at odds with each other? + +It seems not: an algorithm called *serializable snapshot isolation* (SSI) provides full +serializability with only a small performance penalty compared to snapshot isolation. SSI is +comparatively new: it was first described in 2008 +[[53](/en/ch8#Cahill2008), +[65](/en/ch8#Cahill2009)]. + +Today SSI and similar algorithms are used in single-node databases (the serializable isolation level +in PostgreSQL [[54](/en/ch8#Ports2012)], SQL Server’s In-Memory +OLTP/Hekaton [[66](/en/ch8#Diaconu2013)], and HyPer +[[67](/en/ch8#Neumann2015)]), +distributed databases (CockroachDB [[5](/en/ch8#Taft2020_ch8)] and +FoundationDB [[8](/en/ch8#Zhou2021_ch8)]), and embedded storage +engines such as BadgerDB. + +### Pessimistic versus optimistic concurrency control + +Two-phase locking is a so-called *pessimistic* concurrency control mechanism: it is based on the +principle that if anything might possibly go wrong (as indicated by a lock held by another +transaction), it’s better to wait until the situation is safe again before doing anything. It is +like *mutual exclusion*, which is used to protect data structures in multi-threaded programming. + +Serial execution is, in a sense, pessimistic to the extreme: it is essentially equivalent to each +transaction having an exclusive lock on the entire database (or one shard of the database) for the +duration of the transaction. We compensate for the pessimism by making each transaction very fast to +execute, so it only needs to hold the “lock” for a short time. + +By contrast, serializable snapshot isolation is an *optimistic* concurrency control technique. +Optimistic in this context means that instead of blocking if something potentially dangerous +happens, transactions continue anyway, in the hope that everything will turn out all right. When a +transaction wants to commit, the database checks whether anything bad happened (i.e., whether +isolation was violated); if so, the transaction is aborted and has to be retried. Only transactions +that executed serializably are allowed to commit. + +Optimistic concurrency control is an old idea +[[68](/en/ch8#Badal1979)], +and its advantages and disadvantages have been debated for a long time +[[69](/en/ch8#Agrawal1987)]. +It performs badly if there is high contention (many transactions trying to access the same objects), +as this leads to a high proportion of transactions needing to abort. If the system is already close +to its maximum throughput, the additional transaction load from retried transactions can make +performance worse. + +However, if there is enough spare capacity, and if contention between transactions is not too high, +optimistic concurrency control techniques tend to perform better than pessimistic ones. Contention +can be reduced with commutative atomic operations: for example, if several transactions concurrently +want to increment a counter, it doesn’t matter in which order the increments are applied (as long as +the counter isn’t read in the same transaction), so the concurrent increments can all be applied +without conflicting. + +As the name suggests, SSI is based on snapshot isolation—that is, all reads within a transaction +are made from a consistent snapshot of the database (see [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation)). +On top of snapshot isolation, SSI adds an algorithm for detecting serialization conflicts among +reads and writes, and determining which transactions to abort. + +### Decisions based on an outdated premise + +When we previously discussed write skew in snapshot isolation (see [“Write Skew and Phantoms”](/en/ch8#sec_transactions_write_skew)), +we observed a recurring pattern: a transaction reads some data from the database, examines the +result of the query, and decides to take some action (write to the database) based on the result +that it saw. However, under snapshot isolation, the result from the original query may no longer be +up-to-date by the time the transaction commits, because the data may have been modified in the +meantime. + +Put another way, the transaction is taking an action based on a *premise* (a fact that was true at +the beginning of the transaction, e.g., “There are currently two doctors on call”). Later, when the +transaction wants to commit, the original data may have changed—the premise may no longer be +true. + +When the application makes a query (e.g., “How many doctors are currently on call?”), the database +doesn’t know how the application logic uses the result of that query. To be safe, the database needs +to assume that any change in the query result (the premise) means that writes in that transaction +may be invalid. In other words, there may be a causal dependency between the queries and the writes +in the transaction. In order to provide serializable isolation, the database must detect situations +in which a transaction may have acted on an outdated premise and abort the transaction in that case. + +How does the database know if a query result might have changed? There are two cases to consider: + +* Detecting reads of a stale MVCC object version (uncommitted write occurred before the read) +* Detecting writes that affect prior reads (the write occurs after the read) + +### Detecting stale MVCC reads + +Recall that snapshot isolation is usually implemented by multi-version concurrency control (MVCC; +see [“Multi-version concurrency control (MVCC)”](/en/ch8#sec_transactions_snapshot_impl)). When a transaction reads from a consistent snapshot in an +MVCC database, it ignores writes that were made by any other transactions that hadn’t yet committed +at the time when the snapshot was taken. + +In [Figure 8-10](/en/ch8#fig_transactions_detect_mvcc), transaction 43 sees +Aaliyah as having `on_call = true`, because transaction 42 (which modified Aaliyah’s on-call status) is +uncommitted. However, by the time transaction 43 wants to commit, transaction 42 has already +committed. This means that the write that was ignored when reading from the consistent snapshot has +now taken effect, and transaction 43’s premise is no longer true. Things get even more complicated +when a writer inserts data that didn’t exist before (see [“Phantoms causing write skew”](/en/ch8#sec_transactions_phantom)). We’ll +discuss detecting phantom writes for SSI in [“Detecting writes that affect prior reads”](/en/ch8#sec_detecting_writes_affect_reads). + +![ddia 0810](/fig/ddia_0810.png) + +###### Figure 8-10. Detecting when a transaction reads outdated values from an MVCC snapshot. + +In order to prevent this anomaly, the database needs to track when a transaction ignores another +transaction’s writes due to MVCC visibility rules. When the transaction wants to commit, the +database checks whether any of the ignored writes have now been committed. If so, the transaction +must be aborted. + +Why wait until committing? Why not abort transaction 43 immediately when the stale read is detected? +Well, if transaction 43 was a read-only transaction, it wouldn’t need to be aborted, because there +is no risk of write skew. At the time when transaction 43 makes its read, the database doesn’t yet +know whether that transaction is going to later perform a write. Moreover, transaction 42 may yet +abort or may still be uncommitted at the time when transaction 43 is committed, and so the read may +turn out not to have been stale after all. By avoiding unnecessary aborts, SSI preserves snapshot +isolation’s support for long-running reads from a consistent snapshot. + +### Detecting writes that affect prior reads + +The second case to consider is when another transaction modifies data after it has been read. This +case is illustrated in [Figure 8-11](/en/ch8#fig_transactions_detect_index_range). + +![ddia 0811](/fig/ddia_0811.png) + +###### Figure 8-11. In serializable snapshot isolation, detecting when one transaction modifies another transaction’s reads. + +In the context of two-phase locking we discussed index-range locks (see +[“Index-range locks”](/en/ch8#sec_transactions_2pl_range)), which allow the database to lock access to all rows matching some +search query, such as `WHERE shift_id = 1234`. We can use a similar technique here, except that SSI +locks don’t block other transactions. + +In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transactions 42 and 43 both search for on-call doctors +during shift `1234`. If there is an index on `shift_id`, the database can use the index entry 1234 to +record the fact that transactions 42 and 43 read this data. (If there is no index, this information +can be tracked at the table level.) This information only needs to be kept for a while: after a +transaction has finished (committed or aborted), and all concurrent transactions have finished, the +database can forget what data it read. + +When a transaction writes to the database, it must look in the indexes for any other transactions +that have recently read the affected data. This process is similar to acquiring a write lock on the affected +key range, but rather than blocking until the readers have committed, the lock acts as a tripwire: +it simply notifies the transactions that the data they read may no longer be up to date. + +In [Figure 8-11](/en/ch8#fig_transactions_detect_index_range), transaction 43 notifies transaction 42 that its prior +read is outdated, and vice versa. Transaction 42 is first to commit, and it is successful: although +transaction 43’s write affected 42, 43 hasn’t yet committed, so the write has not yet taken effect. +However, when transaction 43 wants to commit, the conflicting write from 42 has already been +committed, so 43 must abort. + +### Performance of serializable snapshot isolation + +As always, many engineering details affect how well an algorithm works in practice. For example, one +trade-off is the granularity at which transactions’ reads and writes are tracked. If the database +keeps track of each transaction’s activity in great detail, it can be precise about which +transactions need to abort, but the bookkeeping overhead can become significant. Less detailed +tracking is faster, but may lead to more transactions being aborted than strictly necessary. + +In some cases, it’s okay for a transaction to read information that was overwritten by another +transaction: depending on what else happened, it’s sometimes possible to prove that the result of +the execution is nevertheless serializable. PostgreSQL uses this theory to reduce the number of +unnecessary aborts [[14](/en/ch8#Fekete2005), +[54](/en/ch8#Ports2012)]. + +Compared to two-phase locking, the big advantage of serializable snapshot isolation is that one +transaction doesn’t need to block waiting for locks held by another transaction. Like under snapshot +isolation, writers don’t block readers, and vice versa. This design principle makes query latency +much more predictable and less variable. In particular, read-only queries can run on a consistent +snapshot without requiring any locks, which is very appealing for read-heavy workloads. + +Compared to serial execution, serializable snapshot isolation is not limited to the throughput of a +single CPU core: for example, FoundationDB distributes the detection of serialization conflicts across multiple +machines, allowing it to scale to very high throughput. Even though data may be sharded across +multiple machines, transactions can read and write data in multiple shards while ensuring +serializable isolation. + +Compared to non-serializable snapshot isolation, the need to check for serializability violations +introduces some performance overheads. How significant these overheads are is a matter of debate: +some believe that serializability checking is not worth it +[[70](/en/ch8#Brooker2024snapshot)], +while others believe that the performance of serializability is now so good that there is no need to +use the weaker snapshot isolation any more [[67](/en/ch8#Neumann2015)]. + +The rate of aborts significantly affects the overall performance of SSI. For example, a transaction +that reads and writes data over a long period of time is likely to run into conflicts and abort, so +SSI requires that read-write transactions be fairly short (long-running read-only transactions are +okay). However, SSI is less sensitive to slow transactions than two-phase locking or serial +execution. + +# Distributed Transactions + +The last few sections have focused on concurrency control for isolation, the I in ACID. The +algorithms we have seen apply to both single-node and distributed databases: although there are +challenges in making concurrency control algorithms scalable (for example, performing distributed +serializability checking for SSI), the high-level ideas for distributed concurrency control are +similar to single-node concurrency control +[[8](/en/ch8#Zhou2021_ch8)]. + +Consistency and durability also don’t change much when we move to distributed transactions. However, +atomicity requires more care. + +For transactions that execute at a single database node, atomicity is commonly implemented by the +storage engine. When the client asks the database node to commit the transaction, the database makes +the transaction’s writes durable (typically in a write-ahead log; see [“Making B-trees reliable”](/en/ch4#sec_storage_btree_wal)) and +then appends a commit record to the log on disk. If the database crashes in the middle of this +process, the transaction is recovered from the log when the node restarts: if the commit record was +successfully written to disk before the crash, the transaction is considered committed; if not, any +writes from that transaction are rolled back. + +Thus, on a single node, transaction commitment crucially depends on the *order* in which data is +durably written to disk: first the data, then the commit record +[[22](/en/ch8#Pillai2014)]. +The key deciding moment for whether the transaction commits or aborts is the moment at which the +disk finishes writing the commit record: before that moment, it is still possible to abort (due to a +crash), but after that moment, the transaction is committed (even if the database crashes). Thus, it +is a single device (the controller of one particular disk drive, attached to one particular node) +that makes the commit atomic. + +However, what if multiple nodes are involved in a transaction? For example, perhaps you have a +multi-object transaction in a sharded database, or a global secondary index (in which the +index entry may be on a different node from the primary data; see +[“Sharding and Secondary Indexes”](/en/ch7#sec_sharding_secondary_indexes)). Most “NoSQL” distributed datastores do not support such +distributed transactions, but various distributed relational databases do. + +In these cases, it is not sufficient to simply send a commit request to all of the nodes and +independently commit the transaction on each one. It could easily happen that the commit succeeds on +some nodes and fails on other nodes, as shown in [Figure 8-12](/en/ch8#fig_transactions_non_atomic): + +* Some nodes may detect a constraint violation or conflict, making an abort necessary, while other + nodes are successfully able to commit. +* Some of the commit requests might be lost in the network, eventually aborting due to a timeout, + while other commit requests get through. +* Some nodes may crash before the commit record is fully written and roll back on recovery, while + others successfully commit. + +![ddia 0812](/fig/ddia_0812.png) + +###### Figure 8-12. When a transaction involves multiple database nodes, it may commit on some and fail on others. + +If some nodes commit the transaction but others abort it, the nodes become inconsistent with each +other. And once a transaction has been committed on one node, it cannot be retracted again if it +later turns out that it was aborted on another node. This is because once data has been committed, +it becomes visible to other transactions under *read committed* or stronger isolation. For example, +in [Figure 8-12](/en/ch8#fig_transactions_non_atomic), by the time user 1 notices that its commit failed on database 1, +user 2 has already read the data from the same transaction on database 2. If user 1’s transaction +was later aborted, user 2’s transaction would have to be reverted as well, since it was based on +data that was retroactively declared not to have existed. + +A better approach is to ensure that the nodes involved in a transaction either all commit or all +abort, and to prevent a mixture of the two. Ensuring this is known as the *atomic commitment* +problem. + +## Two-Phase Commit (2PC) + +Two-phase commit is an algorithm for achieving atomic transaction commit across multiple nodes. It +is a classic algorithm in distributed databases +[[13](/en/ch8#Bernstein1987_ch8), +[71](/en/ch8#Lindsay1979_ch8), +[72](/en/ch8#Mohan1986)]. 2PC is used +internally in some databases and also made available to applications in the form of *XA transactions* +[[73](/en/ch8#XASpec1991)] +(which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP +web services +[[74](/en/ch8#Neto2008), +[75](/en/ch8#Johnson2004)]. + +The basic flow of 2PC is illustrated in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit). Instead of a single +commit request, as with a single-node transaction, the commit/abort process in 2PC is split into two +phases (hence the name). + +![ddia 0813](/fig/ddia_0813.png) + +###### Figure 8-13. A successful execution of two-phase commit (2PC). + +2PC uses a new component that does not normally appear in single-node transactions: a +*coordinator* (also known as *transaction manager*). The coordinator is often implemented as a +library within the same application process that is requesting the transaction (e.g., embedded in a +Java EE container), but it can also be a separate process or service. Examples of such coordinators +include Narayana, JOTM, BTM, or MSDTC. + +When 2PC is used, a distributed +transaction begins with the application reading and writing data on multiple database nodes, +as normal. We call these database nodes *participants* in the transaction. When the application is +ready to commit, the coordinator begins phase 1: it sends a *prepare* request to each of the nodes, +asking them whether they are able to commit. The coordinator then tracks the responses from the +participants: + +* If all participants reply “yes,” indicating they are ready to commit, then the coordinator sends + out a *commit* request in phase 2, and the commit actually takes place. +* If any of the participants replies “no,” the coordinator sends an *abort* request to all nodes in + phase 2. + +This process is somewhat like the traditional marriage ceremony in Western cultures: the minister +asks the bride and groom individually whether each wants to marry the other, and typically receives +the answer “I do” from both. After receiving both acknowledgments, the minister pronounces the +couple husband and wife: the transaction is committed, and the happy fact is broadcast to all +attendees. If either bride or groom does not say “yes,” the ceremony is aborted +[[76](/en/ch8#Gray1981_ch8)]. + +### A system of promises + +From this short description it might not be clear why two-phase commit ensures atomicity, while +one-phase commit across several nodes does not. Surely the prepare and commit requests can just +as easily be lost in the two-phase case. What makes 2PC different? + +To understand why it works, we have to break down the process in a bit more detail: + +1. When the application wants to begin a distributed transaction, it requests a transaction ID from + the coordinator. This transaction ID is globally unique. +2. The application begins a single-node transaction on each of the participants, and attaches the + globally unique transaction ID to the single-node transaction. All reads and writes are done in + one of these single-node transactions. If anything goes wrong at this stage (for example, a node + crashes or a request times out), the coordinator or any of the participants can abort. +3. When the application is ready to commit, the coordinator sends a prepare request to all + participants, tagged with the global transaction ID. If any of these requests fails or times out, + the coordinator sends an abort request for that transaction ID to all participants. +4. When a participant receives the prepare request, it makes sure that it can definitely commit + the transaction under all circumstances. + + This includes writing all transaction data to disk (a crash, a power failure, or running out of + disk space is not an acceptable excuse for refusing to commit later), and checking for any + conflicts or constraint violations. By replying “yes” to the coordinator, the node promises to + commit the transaction without error if requested. In other words, the participant surrenders the + right to abort the transaction, but without actually committing it. +5. When the coordinator has received responses to all prepare requests, it makes a definitive + decision on whether to commit or abort the transaction (committing only if all participants voted + “yes”). The coordinator must write that decision to its transaction log on disk so that it knows + which way it decided in case it subsequently crashes. This is called the *commit point*. +6. Once the coordinator’s decision has been written to disk, the commit or abort request is sent + to all participants. If this request fails or times out, the coordinator must retry forever until + it succeeds. There is no more going back: if the decision was to commit, that decision must be + enforced, no matter how many retries it takes. If a participant has crashed in the meantime, the + transaction will be committed when it recovers—since the participant voted “yes,” it cannot + refuse to commit when it recovers. + +Thus, the protocol contains two crucial “points of no return”: when a participant votes “yes,” it +promises that it will definitely be able to commit later (although the coordinator may still choose to +abort); and once the coordinator decides, that decision is irrevocable. Those promises ensure the +atomicity of 2PC. (Single-node atomic commit lumps these two events into one: writing the commit +record to the transaction log.) + +Returning to the marriage analogy, before saying “I do,” you and your bride/groom have the freedom +to abort the transaction by saying “No way!” (or something to that effect). However, after saying “I +do,” you cannot retract that statement. If you faint after saying “I do” and you don’t hear the +minister speak the words “You are now husband and wife,” that doesn’t change the fact that the +transaction was committed. When you recover consciousness later, you can find out whether you are +married or not by querying the minister for the status of your global transaction ID, or you can +wait for the minister’s next retry of the commit request (since the retries will have continued +throughout your period of unconsciousness). + +### Coordinator failure + +We have discussed what happens if one of the participants or the network fails during 2PC: if any of +the prepare requests fails or times out, the coordinator aborts the transaction; if any of the +commit or abort requests fails, the coordinator retries them indefinitely. However, it is less +clear what happens if the coordinator crashes. + +If the coordinator fails before sending the prepare requests, a participant can safely abort the +transaction. But once the participant has received a prepare request and voted “yes,” it can no +longer abort unilaterally—it must wait to hear back from the coordinator whether the transaction +was committed or aborted. If the coordinator crashes or the network fails at this point, the +participant can do nothing but wait. A participant’s transaction in this state is called *in doubt* +or *uncertain*. + +The situation is illustrated in [Figure 8-14](/en/ch8#fig_transactions_2pc_crash). In this particular example, the +coordinator actually decided to commit, and database 2 received the commit request. However, the +coordinator crashed before it could send the commit request to database 1, and so database 1 does +not know whether to commit or abort. Even a timeout does not help here: if database 1 unilaterally +aborts after a timeout, it will end up inconsistent with database 2, which has committed. Similarly, +it is not safe to unilaterally commit, because another participant may have aborted. + +![ddia 0814](/fig/ddia_0814.png) + +###### Figure 8-14. The coordinator crashes after participants vote “yes.” Database 1 does not know whether to commit or abort. + +Without hearing from the coordinator, the participant has no way of knowing whether to commit or +abort. In principle, the participants could communicate among themselves to find out how each +participant voted and come to some agreement, but that is not part of the 2PC protocol. + +The only way 2PC can complete is by waiting for the coordinator to recover. This is why the +coordinator must write its commit or abort decision to a transaction log on disk before sending +commit or abort requests to participants: when the coordinator recovers, it determines the status of +all in-doubt transactions by reading its transaction log. Any transactions that don’t have a commit +record in the coordinator’s log are aborted. Thus, the commit point of 2PC comes down to a regular +single-node atomic commit on the coordinator. + +### Three-phase commit + +Two-phase commit is called a *blocking* atomic commit protocol due to the fact that 2PC can become +stuck waiting for the coordinator to recover. It is possible to make an atomic commit protocol +*nonblocking*, so that it does not get stuck if a node fails. However, making this work in practice +is not so straightforward. + +As an alternative to 2PC, an algorithm called *three-phase commit* (3PC) has been proposed +[[13](/en/ch8#Bernstein1987_ch8), +[77](/en/ch8#Skeen1981)]. +However, 3PC assumes a network with bounded delay and nodes with bounded response times; in most +practical systems with unbounded network delay and process pauses (see [Chapter 9](/en/ch9#ch_distributed)), it +cannot guarantee atomicity. + +A better solution in practice is to replace the single-node coordinator with a fault-tolerant +consensus protocol. We will see how to do this in [Chapter 10](/en/ch10#ch_consistency). + +## Distributed Transactions Across Different Systems + +Distributed transactions and two-phase commit have a mixed reputation. On the one hand, they are +seen as providing an important safety guarantee that would be hard to achieve otherwise; on the +other hand, they are criticized for causing operational problems, killing performance, and promising +more than they can deliver [[78](/en/ch8#Hohpe2005), +[79](/en/ch8#Helland2007_ch8), +[80](/en/ch8#Oliver2011), +[81](/en/ch8#Rahien2014)]. +Many cloud services choose not to implement distributed transactions due to the operational +problems they engender [[82](/en/ch8#Vasters2012)]. + +Some implementations of distributed transactions carry a heavy performance penalty. Much of the +performance cost inherent in two-phase commit is due to the additional disk forcing (`fsync`) that +is required for crash recovery, and the additional network round-trips. + +However, rather than dismissing distributed transactions outright, we should examine them in some +more detail, because there are important lessons to be learned from them. To begin, we should be +precise about what we mean by “distributed transactions.” Two quite different types of distributed +transactions are often conflated: + +Database-internal distributed transactions +: Some distributed databases (i.e., databases that use replication and sharding in their standard + configuration) support internal transactions among the nodes of that database. For example, + YugabyteDB, TiDB, FoundationDB, Spanner, VoltDB, and MySQL Cluster’s NDB storage engine have such + internal transaction support. In this case, all the nodes participating in the transaction are + running the same database software. + +Heterogeneous distributed transactions +: In a *heterogeneous* transaction, the participants are two or more different technologies: for + example, two databases from different vendors, or even non-database systems such as message + brokers. A distributed transaction across these systems must ensure atomic commit, even though + the systems may be entirely different under the hood. + +Database-internal transactions do not have to be compatible with any other system, so they can +use any protocol and apply optimizations specific to that particular technology. For that reason, +database-internal distributed transactions can often work quite well. On the other hand, +transactions spanning heterogeneous technologies are a lot more challenging. + +### Exactly-once message processing + +Heterogeneous distributed transactions allow diverse systems to be integrated in powerful ways. For +example, a message from a message queue can be acknowledged as processed if and only if the database +transaction for processing the message was successfully committed. This is implemented by atomically +committing the message acknowledgment and the database writes in a single transaction. With +distributed transaction support, this is possible, even if the message broker and the database are +two unrelated technologies running on different machines. + +If either the message delivery or the database transaction fails, both are aborted, and so the +message broker may safely redeliver the message later. Thus, by atomically committing the message +and the side effects of its processing, we can ensure that the message is *effectively* processed +exactly once, even if it required a few retries before it succeeded. The abort discards any side +effects of the partially completed transaction. This is known as *exactly-once semantics*. + +Such a distributed transaction is only possible if all systems affected by the transaction are able +to use the same atomic commit protocol, however. For example, say a side effect of processing a +message is to send an email, and the email server does not support two-phase commit: it could happen +that the email is sent two or more times if message processing fails and is retried. But if all side +effects of processing a message are rolled back on transaction abort, then the processing step can +safely be retried as if nothing had happened. + +We will return to the topic of exactly-once semantics later in this chapter. Let’s look first at the +atomic commit protocol that allows such heterogeneous distributed transactions. + +### XA transactions + +*X/Open XA* (short for *eXtended Architecture*) is a standard for implementing two-phase commit +across heterogeneous technologies [[73](/en/ch8#XASpec1991)]. +It was introduced in 1991 and has been widely +implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL, +Db2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ). + +XA is not a network protocol—it is merely a C API for interfacing with a transaction coordinator. +Bindings for this API exist in other languages; for example, in the world of Java EE applications, +XA transactions are implemented using the Java Transaction API (JTA), which in turn is supported by +many drivers for databases using Java Database Connectivity (JDBC) and drivers for message brokers +using the Java Message Service (JMS) APIs. + +XA assumes that your application uses a network driver or client library to communicate with the +participant databases or messaging services. If the driver supports XA, that means it calls the XA +API to find out whether an operation should be part of a distributed transaction—and if so, it +sends the necessary information to the database server. The driver also exposes callbacks through +which the coordinator can ask the participant to prepare, commit, or abort. + +The transaction coordinator implements the XA API. The standard does not specify how it should be +implemented, but in practice the coordinator is often simply a library that is loaded into the same +process as the application issuing the transaction (not a separate service). It keeps track of the +participants in a transaction, collects partipants’ responses after asking them to prepare (via a +callback into the driver), and uses a log on the local disk to keep track of the commit/abort +decision for each transaction. + +If the application process crashes, or the machine on which the application is running dies, the +coordinator goes with it. Any participants with prepared but uncommitted transactions are then stuck +in doubt. Since the coordinator’s log is on the application server’s local disk, that server must be +restarted, and the coordinator library must read the log to recover the commit/abort outcome of each +transaction. Only then can the coordinator use the database driver’s XA callbacks to ask +participants to commit or abort, as appropriate. The database server cannot contact the coordinator +directly, since all communication must go via its client library. + +### Holding locks while in doubt + +Why do we care so much about a transaction being stuck in doubt? Can’t the rest of the system just +get on with its work, and ignore the in-doubt transaction that will be cleaned up eventually? + +The problem is with *locking*. As discussed in [“Read Committed”](/en/ch8#sec_transactions_read_committed), database +transactions usually take a row-level exclusive lock on any rows they modify, to prevent dirty +writes. In addition, if you want serializable isolation, a database using two-phase locking would +also have to take a shared lock on any rows *read* by the transaction. + +The database cannot release those locks until the transaction commits or aborts (illustrated as a +shaded area in [Figure 8-13](/en/ch8#fig_transactions_two_phase_commit)). Therefore, when using two-phase commit, a +transaction must hold onto the locks throughout the time it is in doubt. If the coordinator has +crashed and takes 20 minutes to start up again, those locks will be held for 20 minutes. If the +coordinator’s log is entirely lost for some reason, those locks will be held forever—or at least +until the situation is manually resolved by an administrator. + +While those locks are held, no other transaction can modify those rows. Depending on the isolation +level, other transactions may even be blocked from reading those rows. Thus, other transactions +cannot simply continue with their business—if they want to access that same data, they will be +blocked. This can cause large parts of your application to become unavailable until the in-doubt +transaction is resolved. + +### Recovering from coordinator failure + +In theory, if the coordinator crashes and is restarted, it should cleanly recover its state from the +log and resolve any in-doubt transactions. However, in practice, *orphaned* in-doubt transactions do +occur [[83](/en/ch8#Dhariwal2008), +[84](/en/ch8#Randal2013)]—that is, +transactions for which the coordinator cannot decide the outcome for whatever reason (e.g., because +the transaction log has been lost or corrupted due to a software bug). These transactions cannot be +resolved automatically, so they sit forever in the database, holding locks and blocking other +transactions. + +Even rebooting your database servers will not fix this problem, since a correct implementation of +2PC must preserve the locks of an in-doubt transaction even across restarts (otherwise it would risk +violating the atomicity guarantee). It’s a sticky situation. + +The only way out is for an administrator to manually decide whether to commit or roll back the +transactions. The administrator must examine the participants of each in-doubt transaction, +determine whether any participant has committed or aborted already, and then apply the same outcome +to the other participants. Resolving the problem potentially requires a lot of manual effort, and +most likely needs to be done under high stress and time pressure during a serious production outage +(otherwise, why would the coordinator be in such a bad state?). + +Many XA implementations have an emergency escape hatch called *heuristic decisions*: allowing a +participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive +decision from the coordinator [[73](/en/ch8#XASpec1991)]. To be clear, +*heuristic* here is a euphemism for *probably breaking atomicity*, since the heuristic decision +violates the system of promises in two-phase commit. Thus, heuristic decisions are intended only for +getting out of catastrophic situations, and not for regular use. + +### Problems with XA transactions + +A single-node coordinator is a single point of failure for the entire system, and making it part of +the application server is also problematic because the coordinator’s logs on its local disk become a +crucial part of the durable system state—as important as the databases themselves. + +In principle, the coordinator of an XA transaction could be highly available and replicated, just +like we would expect of any other important database. Unfortunately, this still doesn’t solve a +fundamental problem with XA, which is that it provides no way for the coordinator and the +participants of a transaction to communicate with each other directly. They can only communicate via +the application code that invoked the transaction, and the database drivers through which it calls +the participants. + +Even if the coordinator were replicated, the application code would therefore be a single point of +failure. Solving this problem would require totally redesigning how application code is run to make +it replicated or restartable, which could perhaps look similar to durable execution (see +[“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows)). However, there don’t seem to be any tools that actually take +this approach in practice. + +Another problem is that since XA needs to be compatible with a wide range of data systems, it is +necessarily a lowest common denominator. For example, it cannot detect deadlocks across different +systems (since that would require a standardized protocol for systems to exchange information on the +locks that each transaction is waiting for), and it does not work with SSI (see +[“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)), since that would require a protocol for identifying conflicts across +different systems. + +These problems are somewhat inherent in performing transactions across heterogeneous technologies. +However, keeping several heterogeneous data systems consistent with each other is still a real and +important problem, so we need to find a different solution to it. This can be done, as we will see +in the next section and in [Link to Come]. + +## Database-internal Distributed Transactions + +As explained previously, there is a big difference between distributed transactions that span +multiple heterogeneous storage technologies, and those that are internal to a system—i.e., where all +the participating nodes are shards of the same database running the same software. Such internal +distributed transactions are a defining feature of “NewSQL” databases such as +CockroachDB [[5](/en/ch8#Taft2020_ch8)], +TiDB [[6](/en/ch8#Huang2020)], +Spanner [[7](/en/ch8#Corbett2012_ch8)], +FoundationDB [[8](/en/ch8#Zhou2021_ch8)], and YugabyteDB, for +example. Some message brokers such as Kafka also support internal distributed transactions +[[85](/en/ch8#Wang2021)]. + +Many of these systems use 2-phase commit to ensure atomicity of transactions that write to multiple +shards, and yet they don’t suffer the same problems as XA transactions. The reason is that because +their distributed transactions don’t need to interface with any other technologies, they avoid the +lowest-common-denominator trap—the designers of these systems are free to use better protocols that +are more reliable and faster. + +The biggest problems with XA can be fixed by: + +* Replicating the coordinator, with automatic failover to another coordinator node if the primary + one crashes; +* Allowing the coordinator and data shards to communicate directly without going via application + code; +* Replicating the participating shards, so that the risk of having to abort a transaction because of + a fault in one of the shards is reduced; and +* Coupling the atomic commitment protocol with a distributed concurrency control protocol that + supports deadlock detection and consistent reads across shards. + +Consensus algorithms are commonly used to replicate the coordinator and the database shards. We will +see in [Chapter 10](/en/ch10#ch_consistency) how atomic commitment for distributed transactions can be implemented +using a consensus algorithm. These algorithms tolerate faults by automatically failing over from one +node to another without any human intervention, and while continuing to guarantee strong consistency +properties. + +The isolation levels offered for distributed transactions depend on the system, but snapshot +isolation and serializable snapshot isolation are both possible across shards. The details of how +this works can be found in the papers referenced at the end of this chapter. + +### Exactly-once message processing revisited + +We saw in [“Exactly-once message processing”](/en/ch8#sec_transactions_exactly_once) that an important use case for distributed transactions +is to ensure that some operation takes effect exactly once, even if a crash occurs while it is being +processed and the processing needs to be retried. If you can atomically commit a transaction across +a message broker and a database, you can acknowledge the message to the broker if and only if it was +successfully processed and the database writes resulting from the process were committed. + +However, you don’t actually need such distributed transactions to achieve exactly-once semantics. An +alternative approach is as follows, which only requires transactions within the database: + +1. Assume every message has a unique ID, and in the database you have a table of message IDs that + have been processed. When you start processing a message from the broker, you begin a new + transaction on the database, and check the message ID. If the same message ID is already present + in the database, you know that it has already been processed, so you can acknowledge the message + to the broker and drop it. +2. If the message ID is not already in the database, you add it to the table. You then process the + message, which may result in additional writes to the database within the same transaction. When + you finish processing the message, you commit the transaction on the database. +3. Once the database transaction is successfully committed, you can acknowledge the message to the + broker. +4. Once the message has successfully been acknowledged to the broker, you know that it won’t try + processing the same message again, so you can delete the message ID from the database (in a + separate transaction). + +If the message processor crashes before committing the database transaction, the transaction is +aborted and the message broker will retry processing. If it crashes after committing but before +acknowledging the message to the broker, it will also retry processing, but the retry will see the +message ID in the database and drop it. If it crashes after acknowledging the message but before +deleting the message ID from the database, you will have an old message ID lying around, which +doesn’t do any harm besides taking a little bit of storage space. If a retry happens before the +database transaction is aborted (which could happen if communication between the message processor +and the database is interrupted), a uniqueness constraint on the table of message IDs should prevent +the same message ID from being inserted by two concurrent transactions. + +Thus, achieving exactly-once processing only requires transactions within the database—atomicity +across database and message broker is not necessary for this use case. Recording the message ID in +the database makes the message processing *idempotent*, so that message processing can be safely +retried without duplicating its side-effects. A similar approach is used in stream processing +frameworks such as Kafka Streams to achieve exactly-once semantics, as we shall see in +[Link to Come]. + +However, internal distributed transactions within the database are still useful for the scalability +of patterns such as these: for example, they would allow the message IDs to be stored on one shard +and the main data updated by the message processing to be stored on other shards, and to ensure +atomicity of the transaction commit across those shards. + +# Summary + +Transactions are an abstraction layer that allows an application to pretend that certain concurrency +problems and certain kinds of hardware and software faults don’t exist. A large class of errors is +reduced down to a simple *transaction abort*, and the application just needs to try again. + +In this chapter we saw many examples of problems that transactions help prevent. Not all +applications are susceptible to all those problems: an application with very simple access patterns, +such as reading and writing only a single record, can probably manage without transactions. However, +for more complex access patterns, transactions can hugely reduce the number of potential error cases +you need to think about. + +Without transactions, various error scenarios (processes crashing, network interruptions, power +outages, disk full, unexpected concurrency, etc.) mean that data can become inconsistent in various +ways. For example, denormalized data can easily go out of sync with the source data. Without +transactions, it becomes very difficult to reason about the effects that complex interacting accesses +can have on the database. + +In this chapter, we went particularly deep into the topic of concurrency control. We discussed +several widely used isolation levels, in particular *read committed*, *snapshot isolation* +(sometimes called *repeatable read*), and *serializable*. We characterized those isolation levels by +discussing various examples of race conditions, summarized in [Table 8-1](/en/ch8#ch_transactions_isolation_levels): + +Table 8-1. Summary of anomalies that can occur at various isolation levels + +| Isolation level | Dirty reads | Read skew | Phantom reads | Lost updates | Write skew | +| --- | --- | --- | --- | --- | --- | +| Read uncommitted | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | +| Read committed | ✓ Prevented | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | +| Snapshot isolation | ✓ Prevented | ✓ Prevented | ✓ Prevented | ? Depends | ✗ Possible | +| Serializable | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | + +Dirty reads +: One client reads another client’s writes before they have been committed. The read committed + isolation level and stronger levels prevent dirty reads. + +Dirty writes +: One client overwrites data that another client has written, but not yet committed. Almost all + transaction implementations prevent dirty writes. + +Read skew +: A client sees different parts of the database at different points in time. Some cases of read + skew are also known as *nonrepeatable reads*. This issue is most commonly prevented with snapshot + isolation, which allows a transaction to read from a consistent snapshot corresponding to one + particular point in time. It is usually implemented with *multi-version concurrency control* + (MVCC). + +Lost updates +: Two clients concurrently perform a read-modify-write cycle. One overwrites the other’s write + without incorporating its changes, so data is lost. Some implementations of snapshot isolation + prevent this anomaly automatically, while others require a manual lock (`SELECT FOR UPDATE`). + +Write skew +: A transaction reads something, makes a decision based on the value it saw, and writes the decision + to the database. However, by the time the write is made, the premise of the decision is no longer + true. Only serializable isolation prevents this anomaly. + +Phantom reads +: A transaction reads objects that match some search condition. Another client makes a write that + affects the results of that search. Snapshot isolation prevents straightforward phantom reads, but + phantoms in the context of write skew require special treatment, such as index-range locks. + +Weak isolation levels protect against some of those anomalies but leave you, the application +developer, to handle others manually (e.g., using explicit locking). Only serializable isolation +protects against all of these issues. We discussed three different approaches to implementing +serializable transactions: + +Literally executing transactions in a serial order +: If you can make each transaction very fast to execute (typically by using stored procedures), and + the transaction throughput is low enough to process on a single CPU core or can be sharded, this + is a simple and effective option. + +Two-phase locking +: For decades this has been the standard way of implementing serializability, but many applications + avoid using it because of its poor performance. + +Serializable snapshot isolation (SSI) +: A comparatively new algorithm that avoids most of the downsides of the previous approaches. It + uses an optimistic approach, allowing transactions to proceed without blocking. When a transaction + wants to commit, it is checked, and it is aborted if the execution was not serializable. + +Finally, we examined how to achieve atomicity when a transaction is distributed across multiple +nodes, using two-phase commit. If those nodes are all running the same database software, +distributed transactions can work quite well, but across different storage technologies (using XA +transactions), 2PC is problematic: it is very sensitive to faults in the coordinator and the +application code driving the transaction, and it interacts poorly with concurrency control +mechanisms. Fortunately, idempotence can ensure exactly-once semantics without requiring atomic +commit across different storage technologies, and we will see more on this in later chapters. + +The examples in this chapter used a relational data model. However, as discussed in +[“The need for multi-object transactions”](/en/ch8#sec_transactions_need), transactions are a valuable database feature, no matter which data model +is used. + +##### Footnotes + +##### References + +[[1](/en/ch8#Murdoch2021-marker)] Steven J. Murdoch. +[What +went wrong with Horizon: learning from the Post Office Trial](https://www.benthamsgaze.org/2021/07/15/what-went-wrong-with-horizon-learning-from-the-post-office-trial/). *benthamsgaze.org*, July 2021. +Archived at [perma.cc/CNM4-553F](https://perma.cc/CNM4-553F) + +[[2](/en/ch8#Chamberlin1981-marker)] Donald D. Chamberlin, Morton M. Astrahan, +Michael W. Blasgen, James N. Gray, W. Frank King, Bruce G. Lindsay, Raymond Lorie, James W. Mehl, +Thomas G. Price, Franco Putzolu, Patricia Griffiths Selinger, Mario Schkolnick, Donald R. Slutz, +Irving L. Traiger, Bradford W. Wade, and Robert A. Yost. +[A History and Evaluation of System +R](https://dsf.berkeley.edu/cs262/2005/SystemR.pdf). *Communications of the ACM*, volume 24, issue 10, pages 632–646, October 1981. +[doi:10.1145/358769.358784](https://doi.org/10.1145/358769.358784) + +[[3](/en/ch8#Gray1976-marker)] Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger. +[Granularity of +Locks and Degrees of Consistency in a Shared Data Base](https://citeseerx.ist.psu.edu/pdf/e127f0a6a912bb9150ecfe03c0ebf7fbc289a023). in *Modelling in Data Base Management +Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management +Systems*, edited by G. M. Nijssen, pages 364–394, Elsevier/North Holland Publishing, 1976. Also +in *Readings in Database Systems*, 4th edition, edited by Joseph M. Hellerstein and Michael +Stonebraker, MIT Press, 2005. ISBN: 978-0-262-69314-1 + +[[4](/en/ch8#Eswaran1976-marker)] Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger. +[The +Notions of Consistency and Predicate Locks in a Database System](https://jimgray.azurewebsites.net/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf?from=https://research.microsoft.com/en-us/um/people/gray/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf). *Communications of the +ACM*, volume 19, issue 11, pages 624–633, November 1976. +[doi:10.1145/360363.360369](https://doi.org/10.1145/360363.360369) + +[[5](/en/ch8#Taft2020_ch8-marker)] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan +VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul +Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. +[CockroachDB: The Resilient +Geo-Distributed SQL Database](https://dl.acm.org/doi/pdf/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of +Data* (SIGMOD), pages 1493–1509, June 2020. +[doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134) + +[[6](/en/ch8#Huang2020-marker)] Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, +Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, +Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan +Pei, and Xin Tang. +[TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf). +*Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3072–3084. +[doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535) + +[[7](/en/ch8#Corbett2012_ch8-marker)] James C. Corbett, Jeffrey Dean, +Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, +Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, +Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, +Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. +[Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). +At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), +October 2012. + +[[8](/en/ch8#Zhou2021_ch8-marker)] Jingyu Zhou, Meng Xu, Alexander +Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty +Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, +Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. +[FoundationDB: A Distributed Unbundled +Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data* +(SIGMOD), June 2021. +[doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559) + +[[9](/en/ch8#Harder1983-marker)] Theo Härder and Andreas Reuter. +[Principles of +Transaction-Oriented Database Recovery](https://citeseerx.ist.psu.edu/pdf/11ef7c142295aeb1a28a0e714c91fc8d610c3047). *ACM Computing Surveys*, volume 15, issue 4, +pages 287–317, December 1983. [doi:10.1145/289.291](https://doi.org/10.1145/289.291) + +[[10](/en/ch8#Bailis2013HAT-marker)] Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph +M. Hellerstein, and Ion Stoica. +[HAT, not CAP: +Towards Highly Available Transactions](https://www.usenix.org/system/files/conference/hotos13/hotos13-final80.pdf). At *14th USENIX Workshop on Hot Topics in Operating +Systems* (HotOS), May 2013. + +[[11](/en/ch8#Fox1997-marker)] Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric +A. Brewer, and Paul Gauthier. +[Cluster-Based Scalable Network +Services](https://people.eecs.berkeley.edu/~brewer/cs262b/TACC.pdf). At *16th ACM Symposium on Operating Systems Principles* (SOSP), October 1997. +[doi:10.1145/268998.266662](https://doi.org/10.1145/268998.266662) + +[[12](/en/ch8#Andrews2004-marker)] Tony Andrews. +[Enforcing +Complex Constraints in Oracle](https://tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html). *tonyandrews.blogspot.co.uk*, October 2004. Archived at +[archive.org](https://web.archive.org/web/20220201190625/https%3A//tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html) + +[[13](/en/ch8#Bernstein1987_ch8-marker)] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. +[*Concurrency Control and +Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available +online at [*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/). + +[[14](/en/ch8#Fekete2005-marker)] Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, +Patrick O’Neil, and Dennis Shasha. +[Making +Snapshot Isolation Serializable](https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2009/Papers/p492-fekete.pdf). *ACM Transactions on Database Systems*, +volume 30, issue 2, pages 492–528, June 2005. +[doi:10.1145/1071610.1071615](https://doi.org/10.1145/1071610.1071615) + +[[15](/en/ch8#Zheng2013-marker)] Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge. +[Understanding +the Robustness of SSDs Under Power Fault](https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf). At *11th USENIX Conference on File and Storage +Technologies* (FAST), February 2013. + +[[16](/en/ch8#Denness2015-marker)] Laurie Denness. +[SSDs: A Gift and a Curse](https://laur.ie/blog/2015/06/ssds-a-gift-and-a-curse/). +*laur.ie*, June 2015. Archived at [perma.cc/6GLP-BX3T](https://perma.cc/6GLP-BX3T) + +[[17](/en/ch8#Surak2015-marker)] Adam Surak. +[When +Solid State Drives Are Not That Solid](https://www.algolia.com/blog/engineering/when-solid-state-drives-are-not-that-solid). *blog.algolia.com*, June 2015. +Archived at [perma.cc/CBR9-QZEE](https://perma.cc/CBR9-QZEE) + +[[18](/en/ch8#HPE2019_ch8-marker)] Hewlett Packard Enterprise. +[Bulletin: +(Revision) HPE SAS Solid State Drives - Critical Firmware Upgrade Required for Certain HPE SAS +Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us). +*support.hpe.com*, November 2019. +Archived at [perma.cc/CZR4-AQBS](https://perma.cc/CZR4-AQBS) + +[[19](/en/ch8#Ringer2018-marker)] Craig Ringer et al. +[PostgreSQL’s +handling of fsync() errors is unsafe and risks data loss at least on XFS](https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com). Email thread on +pgsql-hackers mailing list, *postgresql.org*, March 2018. +Archived at [perma.cc/5RKU-57FL](https://perma.cc/5RKU-57FL) + +[[20](/en/ch8#Rebello2020-marker)] Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, +Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. +[Can Applications Recover +from fsync Failures?](https://www.usenix.org/conference/atc20/presentation/rebello) At *USENIX Annual Technical Conference* (ATC), July 2020. + +[[21](/en/ch8#Pillai2015-marker)] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, +Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. +[Crash Consistency: Rethinking the +Fundamental Abstractions of the File System](https://dl.acm.org/doi/pdf/10.1145/2800695.2801719). *ACM Queue*, volume 13, issue 7, pages 20–28, July 2015. +[doi:10.1145/2800695.2801719](https://doi.org/10.1145/2800695.2801719) + +[[22](/en/ch8#Pillai2014-marker)] Thanumalayan Sankaranarayana Pillai, Vijay +Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. +[All File +Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf). +At *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014. + +[[23](/en/ch8#Siebenmann2016-marker)] Chris Siebenmann. +[Unix’s File Durability +Problem](https://utcc.utoronto.ca/~cks/space/blog/unix/FileSyncProblem). *utcc.utoronto.ca*, April 2016. +Archived at [perma.cc/VSS8-5MC4](https://perma.cc/VSS8-5MC4) + +[[24](/en/ch8#Ganesan2017-marker)] Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. +Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. +[Redundancy +Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and +Corruptions](https://www.usenix.org/conference/fast17/technical-sessions/presentation/ganesan). At *15th USENIX Conference on File and Storage Technologies* (FAST), +February 2017. + +[[25](/en/ch8#Bairavasundaram2008-marker)] Lakshmi N. Bairavasundaram, Garth R. +Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. +[An +Analysis of Data Corruption in the Storage Stack](https://www.usenix.org/legacy/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf). At *6th USENIX Conference on File and +Storage Technologies* (FAST), February 2008. + +[[26](/en/ch8#Schroeder2016_ch8-marker)] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. +[Flash +Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder). At *14th USENIX Conference on +File and Storage Technologies* (FAST), February 2016. + +[[27](/en/ch8#Allison2015-marker)] Don Allison. +[SSD Storage – Ignorance of Technology Is No +Excuse](https://blog.korelogic.com/blog/2015/03/24). *blog.korelogic.com*, March 2015. +Archived at [perma.cc/9QN4-9SNJ](https://perma.cc/9QN4-9SNJ) + +[[28](/en/ch8#MahUng2015-marker)] Gordon Mah Ung. +[Debunked: +Your SSD won’t lose data if left unplugged after all](https://www.pcworld.com/article/427602/debunked-your-ssd-wont-lose-data-if-left-unplugged-after-all.html). *pcworld.com*, May 2015. +Archived at [perma.cc/S46H-JUDU](https://perma.cc/S46H-JUDU) + +[[29](/en/ch8#Kleppmann2014-marker)] Martin Kleppmann. +[Hermitage: +Testing the ‘I’ in ACID](https://martin.kleppmann.com/2014/11/25/hermitage-testing-the-i-in-acid.html). *martin.kleppmann.com*, November 2014. +Archived at [perma.cc/KP2Y-AQGK](https://perma.cc/KP2Y-AQGK) + +[[30](/en/ch8#Warszawski2017-marker)] Todd Warszawski and Peter Bailis. +[ACIDRain: Concurrency-Related Attacks +on Database-Backed Web Applications](http://www.bailis.org/papers/acidrain-sigmod2017.pdf). At *ACM International Conference on Management of +Data* (SIGMOD), May 2017. +[doi:10.1145/3035918.3064037](https://doi.org/10.1145/3035918.3064037) + +[[31](/en/ch8#DAgosta2014-marker)] Tristan D’Agosta. +[BTC Stolen from Poloniex](https://bitcointalk.org/index.php?topic=499580). +*bitcointalk.org*, March 2014. +Archived at [perma.cc/YHA6-4C5D](https://perma.cc/YHA6-4C5D) + +[[32](/en/ch8#bitcointhief2014-marker)] bitcointhief2. +[How +I Stole Roughly 100 BTC from an Exchange and How I Could Have Stolen More!](https://www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/) *reddit.com*, +February 2014. Archived at +[archive.org](https://web.archive.org/web/20250118042610/https%3A//www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/) + +[[33](/en/ch8#Jorwekar2007_ch8-marker)] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan. +[Automating the +Detection of Snapshot Isolation Anomalies](https://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf). At *33rd International Conference on Very Large +Data Bases* (VLDB), September 2007. + +[[34](/en/ch8#Melanson2014-marker)] Michael Melanson. +[Transactions: +The Limits of Isolation](https://www.michaelmelanson.net/posts/transactions-the-limits-of-isolation/). *michaelmelanson.net*, November 2014. +Archived at [perma.cc/RG5R-KMYZ](https://perma.cc/RG5R-KMYZ) + +[[35](/en/ch8#Kim2014ACH-marker)] Edward Kim. +[How +ACH works: A developer perspective — Part 1](https://engineering.gusto.com/how-ach-works-a-developer-perspective-part-1-339d3e7bea1). *engineering.gusto.com*, April 2014. +Archived at [perma.cc/7B2H-PU94](https://perma.cc/7B2H-PU94) + +[[36](/en/ch8#Berenson1995-marker)] Hal Berenson, Philip A. Bernstein, Jim N. Gray, +Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. +[A Critique of +ANSI SQL Isolation Levels](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf). At *ACM International Conference on Management of Data* (SIGMOD), +May 1995. [doi:10.1145/568271.223785](https://doi.org/10.1145/568271.223785) + +[[37](/en/ch8#Adya1999-marker)] Atul Adya. [Weak +Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions](https://pmg.csail.mit.edu/papers/adya-phd.pdf). +PhD Thesis, Massachusetts Institute of Technology, March 1999. +Archived at [perma.cc/E97M-HW5Q](https://perma.cc/E97M-HW5Q) + +[[38](/en/ch8#Bailis2014virtues_ch8-marker)] Peter Bailis, Aaron Davidson, Alan Fekete, Ali +Ghodsi, Joseph M. Hellerstein, and Ion Stoica. +[Highly Available Transactions: Virtues and +Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). At *40th International Conference on Very Large Data Bases* (VLDB), +September 2014. + +[[39](/en/ch8#Crooks2017-marker)] Natacha Crooks, Youer Pu, Lorenzo Alvisi, and Allen Clement. +[Seeing is Believing: A +Client-Centric Specification of Database Isolation](https://www.cs.cornell.edu/lorenzo/papers/Crooks17Seeing.pdf). At *ACM Symposium on Principles of +Distributed Computing* (PODC), pages 73–82, July 2017. +[doi:10.1145/3087801.3087802](https://doi.org/10.1145/3087801.3087802) + +[[40](/en/ch8#Momjian2014-marker)] Bruce Momjian. +[MVCC Unmasked](https://momjian.us/main/writings/pgsql/mvcc.pdf). *momjian.us*, +July 2014. Archived at [perma.cc/KQ47-9GYB](https://perma.cc/KQ47-9GYB) + +[[41](/en/ch8#Alvaro2023-marker)] Peter Alvaro and Kyle Kingsbury. +[MySQL 8.0.34](https://jepsen.io/analyses/mysql-8.0.34). *jepsen.io*, December 2023. +Archived at [perma.cc/HGE2-Z878](https://perma.cc/HGE2-Z878) + +[[42](/en/ch8#Rogov2023-marker)] Egor Rogov. +[PostgreSQL 14 Internals](https://postgrespro.com/community/books/internals). +*postgrespro.com*, April 2023. +Archived at [perma.cc/FRK2-D7WB](https://perma.cc/FRK2-D7WB) + +[[43](/en/ch8#Suzuki2017_ch8-marker)] Hironobu Suzuki. +[The Internals of PostgreSQL](https://www.interdb.jp/pg/). +*interdb.jp*, 2017. + +[[44](/en/ch8#Alleti2025-marker)] Rohan Reddy Alleti. +[Internals +of MVCC in Postgres: Hidden costs of Updates vs Inserts](https://medium.com/%40rohanjnr44/internals-of-mvcc-in-postgres-hidden-costs-of-updates-vs-inserts-381eadd35844). *medium.com*, March 2025. +Archived at [perma.cc/3ACX-DFXT](https://perma.cc/3ACX-DFXT) + +[[45](/en/ch8#Pavlo2023-marker)] Andy Pavlo and Bohan Zhang. +[The +Part of PostgreSQL We Hate the Most](https://www.cs.cmu.edu/~pavlo/blog/2023/04/the-part-of-postgresql-we-hate-the-most.html). *cs.cmu.edu*, April 2023. +Archived at [perma.cc/XSP6-3JBN](https://perma.cc/XSP6-3JBN) + +[[46](/en/ch8#Wu2017-marker)] Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. +[An empirical evaluation of in-memory +multi-version concurrency control](https://vldb.org/pvldb/vol10/p781-Wu.pdf). *Proceedings of the VLDB Endowment*, volume 10, issue +7, pages 781–792, March 2017. +[doi:10.14778/3067421.3067427](https://doi.org/10.14778/3067421.3067427) + +[[47](/en/ch8#Prokopov2014-marker)] Nikita Prokopov. +[Unofficial Guide to Datomic +Internals](https://tonsky.me/blog/unofficial-guide-to-datomic-internals/). *tonsky.me*, May 2014. + +[[48](/en/ch8#Svetlov2025-marker)] Daniil Svetlov. +[A Practical Guide to Taming Postgres Isolation +Anomalies](https://dansvetlov.me/postgres-anomalies/). *dansvetlov.me*, March 2025. +Archived at [perma.cc/L7LE-TDLS](https://perma.cc/L7LE-TDLS) + +[[49](/en/ch8#Wiger2010-marker)] Nate Wiger. +[An Atomic Rant](https://nateware.com/2010/02/18/an-atomic-rant/). *nateware.com*, +February 2010. Archived at [perma.cc/5ZYB-PE44](https://perma.cc/5ZYB-PE44) + +[[50](/en/ch8#Coglan2020-marker)] James Coglan. +[Reading and writing, +part 3: web applications](https://blog.jcoglan.com/2020/10/12/reading-and-writing-part-3/). *blog.jcoglan.com*, October 2020. +Archived at [perma.cc/A7EK-PJVS](https://perma.cc/A7EK-PJVS) + +[[51](/en/ch8#Bailis2015_ch8-marker)] Peter Bailis, Alan Fekete, Michael J. Franklin, +Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. +[Feral Concurrency Control: An +Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf). At *ACM International Conference on +Management of Data* (SIGMOD), June 2015. +[doi:10.1145/2723372.2737784](https://doi.org/10.1145/2723372.2737784) + +[[52](/en/ch8#Dogan2020-marker)] Jaana Dogan. +[Things +I Wished More Developers Knew About Databases](https://rakyll.medium.com/things-i-wished-more-developers-knew-about-databases-2d0178464f78). *rakyll.medium.com*, April 2020. +Archived at [perma.cc/6EFK-P2TD](https://perma.cc/6EFK-P2TD) + +[[53](/en/ch8#Cahill2008-marker)] Michael J. Cahill, Uwe Röhm, and Alan Fekete. +[Serializable +Isolation for Snapshot Databases](https://www.cs.cornell.edu/~sowell/dbpapers/serializable_isolation.pdf). At *ACM International Conference on Management of Data* +(SIGMOD), June 2008. +[doi:10.1145/1376616.1376690](https://doi.org/10.1145/1376616.1376690) + +[[54](/en/ch8#Ports2012-marker)] Dan R. K. Ports and Kevin Grittner. +[Serializable Snapshot Isolation in PostgreSQL](https://drkp.net/papers/ssi-vldb12.pdf). +At *38th International Conference on Very Large Databases* (VLDB), August 2012. + +[[55](/en/ch8#Terry1995_ch8-marker)] Douglas B. Terry, Marvin M. Theimer, +Karin Petersen, Alan J. Demers, Mike J. Spreitzer and Carl H. Hauser. +[Managing +Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](https://pdos.csail.mit.edu/6.824/papers/bayou-conflicts.pdf). At +*15th ACM Symposium on Operating Systems Principles* (SOSP), December 1995. +[doi:10.1145/224056.224070](https://doi.org/10.1145/224056.224070) + +[[56](/en/ch8#Schoenig2021-marker)] Hans-Jürgen Schönig. +[Constraints +over multiple rows in PostgreSQL](https://www.cybertec-postgresql.com/en/postgresql-constraints-over-multiple-rows/). *cybertec-postgresql.com*, June 2021. +Archived at [perma.cc/2TGH-XUPZ](https://perma.cc/2TGH-XUPZ) + +[[57](/en/ch8#Stonebraker2007_ch8-marker)] Michael Stonebraker, Samuel Madden, +Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. +[The End of an +Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on +Very Large Data Bases* (VLDB), September 2007. + +[[58](/en/ch8#Hugg2014streaming-marker)] John Hugg. +[H-Store/VoltDB Architecture vs. CEP Systems +and Newer Streaming Architectures](https://www.youtube.com/watch?v=hD5M4a1UVz8). At *Data @Scale Boston*, November 2014. + +[[59](/en/ch8#Kallman2008-marker)] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew +Pavlo, Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang +Zhang, John Hugg, and Daniel J. Abadi. +[H-Store: A High-Performance, Distributed Main +Memory Transaction Processing System](https://www.vldb.org/pvldb/vol1/1454211.pdf). *Proceedings of the VLDB Endowment*, volume 1, +issue 2, pages 1496–1499, August 2008. + +[[60](/en/ch8#Hickey2012-marker)] Rich Hickey. +[The Architecture of Datomic](https://www.infoq.com/articles/Architecture-Datomic/). +*infoq.com*, November 2012. +Archived at [perma.cc/5YWU-8XJK](https://perma.cc/5YWU-8XJK) + +[[61](/en/ch8#Hugg2014debunking-marker)] John Hugg. +[Debunking Myths +About the VoltDB In-Memory Database](https://dzone.com/articles/debunking-myths-about-voltdb). *dzone.com*, May 2014. +Archived at [perma.cc/2Z9N-HPKF](https://perma.cc/2Z9N-HPKF) + +[[62](/en/ch8#Zhou2025-marker)] Xinjing Zhou, Viktor Leis, Xiangyao Yu, and Michael Stonebraker. +[OLTP Through the Looking Glass 16 +Years Later: Communication is the New Bottleneck](https://www.vldb.org/cidrdb/papers/2025/p17-zhou.pdf). At *15th Annual Conference on Innovative +Data Systems Research* (CIDR), January 2025. + +[[63](/en/ch8#Zhou2022-marker)] Xinjing Zhou, Xiangyao Yu, Goetz Graefe, and Michael Stonebraker. +[Lotus: scalable multi-partition +transactions on single-threaded partitioned databases](https://www.vldb.org/pvldb/vol15/p2939-zhou.pdf). *Proceedings of the VLDB +Endowment* (PVLDB), volume 15, issue 11, pages 2939–2952, July 2022. +[doi:10.14778/3551793.3551843](https://doi.org/10.14778/3551793.3551843) + +[[64](/en/ch8#Hellerstein2007_ch8-marker)] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton. +[Architecture of a Database System](https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf). +*Foundations and Trends in Databases*, volume 1, issue 2, pages 141–259, November 2007. +[doi:10.1561/1900000002](https://doi.org/10.1561/1900000002) + +[[65](/en/ch8#Cahill2009-marker)] Michael J. Cahill. +[Serializable +Isolation for Snapshot Databases](https://ses.library.usyd.edu.au/bitstream/handle/2123/5353/michael-cahill-2009-thesis.pdf). PhD Thesis, University of Sydney, July 2009. +Archived at [perma.cc/727J-NTMP](https://perma.cc/727J-NTMP) + +[[66](/en/ch8#Diaconu2013-marker)] Cristian Diaconu, Craig Freedman, +Erik Ismert, Per-Åke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. +[Hekaton: +SQL Server’s Memory-Optimized OLTP Engine](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/06/Hekaton-Sigmod2013-final.pdf). At *ACM SIGMOD International Conference on +Management of Data* (SIGMOD), pages 1243–1254, June 2013. +[doi:10.1145/2463676.2463710](https://doi.org/10.1145/2463676.2463710) + +[[67](/en/ch8#Neumann2015-marker)] Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. +[Fast Serializable Multi-Version Concurrency +Control for Main-Memory Database Systems](https://db.in.tum.de/~muehlbau/papers/mvcc.pdf). At *ACM SIGMOD International Conference on +Management of Data* (SIGMOD), pages 677–689, May 2015. +[doi:10.1145/2723372.2749436](https://doi.org/10.1145/2723372.2749436) + +[[68](/en/ch8#Badal1979-marker)] D. Z. Badal. +[Correctness of Concurrency Control and +Implications in Distributed Databases](https://ieeexplore.ieee.org/abstract/document/762563). At *3rd International IEEE Computer Software and +Applications Conference* (COMPSAC), November 1979. +[doi:10.1109/CMPSAC.1979.762563](https://doi.org/10.1109/CMPSAC.1979.762563) + +[[69](/en/ch8#Agrawal1987-marker)] Rakesh Agrawal, Michael J. Carey, and Miron Livny. +[Concurrency Control +Performance Modeling: Alternatives and Implications](https://people.eecs.berkeley.edu/~brewer/cs262/ConcControl.pdf). *ACM Transactions on Database +Systems* (TODS), volume 12, issue 4, pages 609–654, December 1987. +[doi:10.1145/32204.32220](https://doi.org/10.1145/32204.32220) + +[[70](/en/ch8#Brooker2024snapshot-marker)] Marc Brooker. +[Snapshot Isolation vs +Serializability](https://brooker.co.za/blog/2024/12/17/occ-and-isolation.html). *brooker.co.za*, December 2024. +Archived at [perma.cc/5TRC-CR5G](https://perma.cc/5TRC-CR5G) + +[[71](/en/ch8#Lindsay1979_ch8-marker)] B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. +Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. +[Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). +IBM Research, Research Report RJ2571(33471), July 1979. +Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD) + +[[72](/en/ch8#Mohan1986-marker)] C. Mohan, Bruce G. Lindsay, and Ron Obermarck. +[Transaction +Management in the R\* Distributed Database Management System](https://cs.brown.edu/courses/csci2270/archives/2012/papers/dtxn/p378-mohan.pdf). +*ACM Transactions on Database Systems*, volume 11, issue 4, pages 378–396, December 1986. +[doi:10.1145/7239.7266](https://doi.org/10.1145/7239.7266) + +[[73](/en/ch8#XASpec1991-marker)] X/Open Company Ltd. +[Distributed Transaction Processing: +The XA Specification](https://pubs.opengroup.org/onlinepubs/009680699/toc.pdf). Technical Standard XO/CAE/91/300, December 1991. ISBN: 978-1-872-63024-3, +archived at [perma.cc/Z96H-29JB](https://perma.cc/Z96H-29JB) + +[[74](/en/ch8#Neto2008-marker)] Ivan Silva Neto and Francisco Reverbel. +[Lessons Learned from Implementing +WS-Coordination and WS-AtomicTransaction](https://www.ime.usp.br/~reverbel/papers/icis2008.pdf). At *7th IEEE/ACIS International Conference on +Computer and Information Science* (ICIS), May 2008. +[doi:10.1109/ICIS.2008.75](https://doi.org/10.1109/ICIS.2008.75) + +[[75](/en/ch8#Johnson2004-marker)] James E. Johnson, David E. Langworthy, Leslie Lamport, +and Friedrich H. Vogt. +[Formal +Specification of a Web Services Protocol](https://www.microsoft.com/en-us/research/publication/formal-specification-of-a-web-services-protocol/). At *1st International Workshop on Web Services and +Formal Methods* (WS-FM), February 2004. +[doi:10.1016/j.entcs.2004.02.022](https://doi.org/10.1016/j.entcs.2004.02.022) + +[[76](/en/ch8#Gray1981_ch8-marker)] Jim Gray. +[The Transaction +Concept: Virtues and Limitations](https://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf). At *7th International Conference on Very Large Data +Bases* (VLDB), September 1981. + +[[77](/en/ch8#Skeen1981-marker)] Dale Skeen. +[Nonblocking Commit +Protocols](https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/Ske81.pdf). At *ACM International Conference on Management of Data* (SIGMOD), April 1981. +[doi:10.1145/582318.582339](https://doi.org/10.1145/582318.582339) + +[[78](/en/ch8#Hohpe2005-marker)] Gregor Hohpe. +[Your Coffee Shop Doesn’t Use +Two-Phase Commit](https://www.martinfowler.com/ieeeSoftware/coffeeShop.pdf). *IEEE Software*, volume 22, issue 2, pages 64–66, March 2005. +[doi:10.1109/MS.2005.52](https://doi.org/10.1109/MS.2005.52) + +[[79](/en/ch8#Helland2007_ch8-marker)] Pat Helland. +[Life Beyond Distributed Transactions: +An Apostate’s Opinion](https://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf). At *3rd Biennial Conference on Innovative Data Systems Research* +(CIDR), January 2007. + +[[80](/en/ch8#Oliver2011-marker)] Jonathan Oliver. +[My Beef with +MSDTC and Two-Phase Commits](https://blog.jonathanoliver.com/my-beef-with-msdtc-and-two-phase-commits/). *blog.jonathanoliver.com*, April 2011. +Archived at [perma.cc/K8HF-Z4EN](https://perma.cc/K8HF-Z4EN) + +[[81](/en/ch8#Rahien2014-marker)] Oren Eini (Ahende Rahien). +[The Fallacy of +Distributed Transactions](https://ayende.com/blog/167362/the-fallacy-of-distributed-transactions). *ayende.com*, July 2014. +Archived at [perma.cc/VB87-2JEF](https://perma.cc/VB87-2JEF) + +[[82](/en/ch8#Vasters2012-marker)] Clemens Vasters. +[Transactions +in Windows Azure (with Service Bus) – An Email Discussion](https://learn.microsoft.com/en-gb/archive/blogs/clemensv/transactions-in-windows-azure-with-service-bus-an-email-discussion). *learn.microsoft.com*, July 2012. +Archived at [perma.cc/4EZ9-5SKW](https://perma.cc/4EZ9-5SKW) + +[[83](/en/ch8#Dhariwal2008-marker)] Ajmer Dhariwal. +[Orphaned MSDTC +Transactions (-2 spids)](https://www.eraofdata.com/posts/2008/orphaned-msdtc-transactions-2-spids/). *eraofdata.com*, December 2008. +Archived at [perma.cc/YG6F-U34C](https://perma.cc/YG6F-U34C) + +[[84](/en/ch8#Randal2013-marker)] Paul Randal. +[Real +World Story of DBCC PAGE Saving the Day](https://www.sqlskills.com/blogs/paul/real-world-story-of-dbcc-page-saving-the-day/). *sqlskills.com*, June 2013. +Archived at [perma.cc/2MJN-A5QH](https://perma.cc/2MJN-A5QH) + +[[85](/en/ch8#Wang2021-marker)] Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason +Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva +Mehta, Varun Madan, and Jun Rao. +[Consistency and Completeness: +Rethinking Distributed Stream Processing in Apache Kafka](https://dl.acm.org/doi/pdf/10.1145/3448016.3457556). At *ACM International Conference on +Management of Data* (SIGMOD), June 2021. +[doi:10.1145/3448016.3457556](https://doi.org/10.1145/3448016.3457556) diff --git a/content/en/ch9.md b/content/en/ch9.md index 3b5c35c..b3bc107 100644 --- a/content/en/ch9.md +++ b/content/en/ch9.md @@ -1,205 +1,2549 @@ --- -title: "9. Consistency and Consensus" -linkTitle: "9. Consistency and Consensus" +title: "9. The Trouble with Distributed Systems" weight: 209 breadcrumbs: false --- - -![](/img/ch9.png) - -> *Is it better to be alive and wrong or right and dead?* +> *They’re funny things, Accidents. You never have them till you’re having them.* > -> ​ — Jay Kreps, *A Few Notes on Kafka and Jepsen* (2013) +> A.A. Milne, *The House at Pooh Corner* (1928) ---------------- +As discussed in [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability), making a system reliable means ensuring that the +system as a whole continues working, even when things go wrong (i.e., when there is a fault). +However, anticipating all the possible faults and handling them is not that easy. As a developer, it +is very tempting to focus mostly on the happy path (after all, most of the time things work fine!) +and to neglect faults, since they introduce a lot of edge cases. -Lots of things can go wrong in distributed systems, as discussed in [Chapter 8](/en/ch8). The simplest way of handling such faults is to simply let the entire service fail, and show the user an error message. If that solution is unacceptable, we need to find ways of *tolerating* faults—that is, of keeping the service functioning correctly, even if some internal component is faulty. +If you want your system to be reliable in the presence of faults you have to radically change your +mindset, and focus on the things that could go wrong, even though they may be unlikely. It doesn’t +matter whether there is only a one-in-a-million chance of a thing going wrong: in a large enough +system, one-in-a-million events happen every day. Experienced systems operators will tell you that +anything that *can* go wrong *will* go wrong. -In this chapter, we will talk about some examples of algorithms and protocols for building fault-tolerant distributed systems. We will assume that all the problems from [Chapter 8](/en/ch8) can occur: packets can be lost, reordered, duplicated, or arbitrarily delayed in the network; clocks are approximate at best; and nodes can pause (e.g., due to garbage collection) or crash at any time. +Moreover, working with distributed systems is fundamentally different from writing software on a +single computer—and the main difference is that there are lots of new and exciting ways for things +to go wrong [[1](/en/ch9#Cavage2013), +[2](/en/ch9#Kreps2012_ch9)]. +In this chapter, you will get a taste of the problems that arise in practice, and an understanding +of the things you can and cannot rely on. -The best way of building fault-tolerant systems is to find some general-purpose abstractions with useful guarantees, implement them once, and then let applications rely on those guarantees. This is the same approach as we used with transactions in [Chapter 7](/en/ch7): by using a transaction, the application can pretend that there are no crashes (atomicity), that nobody else is concurrently accessing the database (isola‐ tion), and that storage devices are perfectly reliable (durability). Even though crashes, race conditions, and disk failures do occur, the transaction abstraction hides those problems so that the application doesn’t need to worry about them. +To understand what challenges we are up against, we will now turn our pessimism to the maximum and +explore the things that may go wrong in a distributed system. We will look into problems with +networks ([“Unreliable Networks”](/en/ch9#sec_distributed_networks)) as well as clocks and timing issues +([“Unreliable Clocks”](/en/ch9#sec_distributed_clocks)). The consequences of all these issues are disorienting, so we’ll +explore how to think about the state of a distributed system and how to reason about things that +have happened ([“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth)). Later, in [Chapter 10](/en/ch10#ch_consistency), we will look at some +examples of how we can achieve fault tolerance in the face of those faults. -We will now continue along the same lines, and seek abstractions that can allow an application to ignore some of the problems with distributed systems. For example, one of the most important abstractions for distributed systems is *consensus*: that is, getting all of the nodes to agree on something. As we shall see in this chapter, reliably reaching consensus in spite of network faults and process failures is a surprisingly tricky problem. +# Faults and Partial Failures -Once you have an implementation of consensus, applications can use it for various purposes. For example, say you have a database with single-leader replication. If the leader dies and you need to fail over to another node, the remaining database nodes can use consensus to elect a new leader. As discussed in “[Handling Node Outages](/en/ch5#handling-onde-outages)” on page 156, it’s important that there is only one leader, and that all nodes agree who the leader is. If two nodes both believe that they are the leader, that situation is called *split brain*, and it often leads to data loss. Correct implementations of consensus help avoid such problems. +When you are writing a program on a single computer, it normally behaves in a fairly predictable +way: either it works or it doesn’t. Buggy software may give the appearance that the computer is +sometimes “having a bad day” (a problem that is often fixed by a reboot), but that is mostly just +a consequence of badly written software. -Later in this chapter, in “[Distributed Transactions and Consensus](#distributed-transactions-and-consensus)”, we will look into algorithms to solve consensus and related problems. But first we first need to explore the range of guarantees and abstractions that can be provided in a distributed system. +There is no fundamental reason why software on a single computer should be flaky: when the hardware +is working correctly, the same operation always produces the same result (it is *deterministic*). If +there is a hardware problem (e.g., memory corruption or a loose connector), the consequence is usually a +total system failure (e.g., kernel panic, “blue screen of death,” failure to start up). An individual +computer with good software is usually either fully functional or entirely broken, but not something +in between. -We need to understand the scope of what can and cannot be done: in some situa‐ tions, it’s possible for the system to tolerate faults and continue working; in other sit‐ uations, that is not possible. The limits of what is and isn’t possible have been explored in depth, both in theoretical proofs and in practical implementations. We will get an overview of those fundamental limits in this chapter. +This is a deliberate choice in the design of computers: if an internal fault occurs, we prefer a +computer to crash completely rather than returning a wrong result, because wrong results are difficult +and confusing to deal with. Thus, computers hide the fuzzy physical reality on which they are +implemented and present an idealized system model that operates with mathematical perfection. A CPU +instruction always does the same thing; if you write some data to memory or disk, that data remains +intact and doesn’t get randomly corrupted. As discussed in [“Hardware and Software Faults”](/en/ch2#sec_introduction_hardware_faults), +this is not actually true—in reality, data does get silently corrupted and CPUs do sometimes +silently return the wrong result—but it happens rarely enough that we can get away with ignoring it. -Researchers in the field of distributed systems have been studying these topics for decades, so there is a lot of material—we’ll only be able to scratch the surface. In this book we don’t have space to go into details of the formal models and proofs, so we will stick with informal intuitions. The literature references offer plenty of additional depth if you’re interested. +When you are writing software that runs on several computers, connected by a network, the situation +is fundamentally different. In distributed systems, faults occur much more frequently, and so we can +no longer ignore them—we have no choice but to confront the messy reality of the physical world. And +in the physical world, a remarkably wide range of things can go wrong, as illustrated by this +anecdote [[3](/en/ch9#Hale2010)]: +> In my limited experience I’ve dealt with long-lived network partitions in a single data center (DC), +> PDU [power distribution unit] failures, switch failures, accidental power cycles of whole racks, +> whole-DC backbone failures, whole-DC power failures, and a hypoglycemic driver smashing his Ford +> pickup truck into a DC’s HVAC [heating, ventilation, and air conditioning] system. And I’m not even +> an ops guy. +> +> Coda Hale -## …… +In a distributed system, there may well be some parts of the system that are broken in some +unpredictable way, even though other parts of the system are working fine. This is known as a +*partial failure*. The difficulty is that partial failures are *nondeterministic*: if you try to do +anything involving multiple nodes and the network, it may sometimes work and sometimes unpredictably +fail. As we shall see, you may not even *know* whether something succeeded or not! +This nondeterminism and possibility of partial failures is what makes distributed systems hard to +work with [[4](/en/ch9#Hodges2013)]. +On the other hand, if a distributed system can tolerate partial failures, that opens up powerful +possibilities: for example, it allows you to perform a rolling upgrade, rebooting one node at a time +to install software updates while the system as a whole continues working uninterrupted all the +time. Fault tolerance therefore allows us to make distributed systems more reliable than single-node +systems: we can build a reliable system from unreliable components. +But before we can implement fault tolerance, we need to know more about the faults that we’re +supposed to tolerate. It is important to consider a wide range of possible faults—even fairly +unlikely ones—and to artificially create such situations in your testing environment to see what +happens. In distributed systems, suspicion, pessimism, and paranoia pay off. -## Summary +# Unreliable Networks -In this chapter we examined the topics of consistency and consensus from several different angles. We looked in depth at linearizability, a popular consistency model: its goal is to make replicated data appear as though there were only a single copy, and to make all operations act on it atomically. Although linearizability is appealing because it is easy to understand—it makes a database behave like a variable in a single-threaded program — it has the downside of being slow, especially in environments with large network delays. +As discussed in [“Shared-Memory, Shared-Disk, and Shared-Nothing Architecture”](/en/ch2#sec_introduction_shared_nothing), the distributed systems we focus on +in this book are mostly *shared-nothing systems*: i.e., a bunch of machines connected by a network. +The network is the only way those machines can communicate—we assume that each machine has its +own memory and disk, and one machine cannot access another machine’s memory or disk (except by +making requests to a service over the network). Even when storage is shared, such as with Amazon’s +S3, machines communicate with shared storage services over the network. -We also explored causality, which imposes an ordering on events in a system (what happened before what, based on cause and effect). Unlike linearizability, which puts all operations in a single, totally ordered timeline, causality provides us with a weaker consistency model: some things can be concurrent, so the version history is like a timeline with branching and merging. Causal consistency does not have the coordi‐ nation overhead of linearizability and is much less sensitive to network problems. +The internet and most internal networks in datacenters (often Ethernet) are *asynchronous packet +networks*. In this kind of network, one node can send a message (a packet) to another node, but the +network gives no guarantees as to when it will arrive, or whether it will arrive at all. If you send +a request and expect a response, many things could go wrong (some of which are illustrated in +[Figure 9-1](/en/ch9#fig_distributed_network)): -However, even if we capture the causal ordering (for example using Lamport timestamps), we saw that some things cannot be implemented this way: in “Timestamp ordering is not sufficient” on page 347 we considered the example of ensuring that a username is unique and rejecting concurrent registrations for the same username. If one node is going to accept a registration, it needs to somehow know that another node isn’t concurrently in the process of registering the same name. This problem led us toward *consensus*. +1. Your request may have been lost (perhaps someone unplugged a network cable). +2. Your request may be waiting in a queue and will be delivered later (perhaps the network or the + recipient is overloaded). +3. The remote node may have failed (perhaps it crashed or it was powered down). +4. The remote node may have temporarily stopped responding (perhaps it is experiencing a long + garbage collection pause; see [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses)), but it will start responding + again later. +5. The remote node may have processed your request, but the response has been lost on the network + (perhaps a network switch has been misconfigured). +6. The remote node may have processed your request, but the response has been delayed and will be + delivered later (perhaps the network or your own machine is overloaded). -We saw that achieving consensus means deciding something in such a way that all nodes agree on what was decided, and such that the decision is irrevocable. With some digging, it turns out that a wide range of problems are actually reducible to consensus and are equivalent to each other (in the sense that if you have a solution for one of them, you can easily transform it into a solution for one of the others). Such equivalent problems include: +![ddia 0901](/fig/ddia_0901.png) -***Linearizable compare-and-set registers*** +###### Figure 9-1. If you send a request and don’t get a response, it’s not possible to distinguish whether (a) the request was lost, (b) the remote node is down, or (c) the response was lost. -The register needs to atomically *decide* whether to set its value, based on whether its current value equals the parameter given in the operation. +The sender can’t even tell whether the packet was delivered: the only option is for the recipient to +send a response message, which may in turn be lost or delayed. These issues are indistinguishable in +an asynchronous network: the only information you have is that you haven’t received a response yet. +If you send a request to another node and don’t receive a response, it is *impossible* to tell why. -***Atomic transaction commit*** +The usual way of handling this issue is a *timeout*: after some time you give up waiting and assume that +the response is not going to arrive. However, when a timeout occurs, you still don’t know whether +the remote node got your request or not (and if the request is still queued somewhere, it may still +be delivered to the recipient, even if the sender has given up on it). -A database must *decide* whether to commit or abort a distributed transaction. +## The Limitations of TCP -***Total order broadcast*** +Network packets have a maximum size (generally a few kilobytes), but many applications need to send +messages (requests, responses) that are too big to fit in one packet. These applications most often +use TCP, the Transmission Control Protocol, to establish a *connection* that breaks up large data +streams into individual packets, and puts them back together again on the receiving side. -The messaging system must *decide* on the order in which to deliver messages. +###### Note -***Locks and leases*** +Most of what we say about TCP applies also to its more recent alternative QUIC, as well as the +Stream Control Transmission Protocol (SCTP) used in WebRTC, the BitTorrent uTP protocol, and +other transport protocols. For a comparison to UDP, see [“TCP Versus UDP”](/en/ch9#sidebar_distributed_tcp_udp). -When several clients are racing to grab a lock or lease, the lock *decides* which one successfully acquired it. +TCP is often described as providing “reliable” delivery, in the sense that it detects and +retransmits dropped packets, it detects reordered packets and puts them back in the correct order, +and it detects packet corruption using a simple checksum. It also figures out how fast it can send +data so that it is transferred as quickly as possible, but without overloading the network or the +receiving node; this is known as *congestion control*, *flow control*, or *backpressure* +[[5](/en/ch9#Jacobson1988)]. -***Membership/coordination service*** +When you “send” some data by writing it to a socket, it actually doesn’t get sent immediately, +but it’s only placed in a buffer managed by your operating system. When the congestion control +algorithm decides that it has capacity to send a packet, it takes the next packet-worth of data from +that buffer and passes it to the network interface. The packet passes through several switches and +routers, and eventually the receiving node’s operating system places the packet’s data in a receive +buffer and sends an acknowledgment packet back to the sender. Only then does the receiving operating +system notify the application that some more data has arrived +[[6](/en/ch9#Hubert2009)]. -Given a failure detector (e.g., timeouts), the system must *decide* which nodes are alive, and which should be considered dead because their sessions timed out. +So, if TCP provides “reliability”, does that mean we no longer need to worry about networks being +unreliable? Unfortunately not. It decides that a packet must have been lost if no acknowledgment +arrives within some timeout, but TCP can’t tell either whether it was the outbound packet or the +acknowledgment that was lost. Although TCP can resend the packet, it can’t guarantee that the new +packet will get through either. If the network cable is unplugged, TCP can’t plug it back in for +you. Eventually, after a configurable timeout, TCP gives up and signals an error to the application. -***Uniqueness constraint*** +If a TCP connection is closed with an error—perhaps because the remote node crashed, or perhaps +because the network was interrupted—you unfortunately have no way of knowing how much data was +actually processed by the remote node [[6](/en/ch9#Hubert2009)]. +Even if TCP acknowledged that a packet was delivered, this only means that the operating system +kernel on the remote node received it, but the application may have crashed before it handled that +data. If you want to be sure that a request was successful, you need a positive response from the +application itself +[[7](/en/ch9#Saltzer1984_ch9)]. -When several transactions concurrently try to create conflicting records with the same key, the constraint must *decide* which one to allow and which should fail with a constraint violation. +Nevertheless, TCP is very useful, because it provides a convenient way of sending and receiving +messages that are too big to fit in one packet. Once a TCP connection is established, you can also +use it to send multiple requests and responses. This is usually done by first sending a header that +indicates the length of the following message in bytes, followed by the actual message. HTTP and +many RPC protocols (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) work like this. +## Network Faults in Practice +We have been building computer networks for decades—one might hope that by now we would have figured +out how to make them reliable. Unfortunately, we have not yet succeeded. There are some systematic +studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common, +even in controlled environments like a datacenter operated by one company +[[8](/en/ch9#Bailis2014reliable)]: -All of these are straightforward if you only have a single node, or if you are willing to assign the decision-making capability to a single node. This is what happens in a single-leader database: all the power to make decisions is vested in the leader, which is why such databases are able to provide linearizable operations, uniqueness con‐ straints, a totally ordered replication log, and more. +* One study in a medium-sized datacenter found about 12 network faults per month, of which half + disconnected a single machine, and half disconnected an entire rack + [[9](/en/ch9#Leners2015)]. +* Another study measured the failure rates of components like top-of-rack switches, aggregation + switches, and load balancers + [[10](/en/ch9#Gill2011)]. + It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, + since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause + of outages. +* Interruptions of wide-area fiber links have been blamed on cows + [[11](/en/ch9#Hoelzle2020)], + beavers [[12](/en/ch9#CBCNews2021)], + and sharks [[13](/en/ch9#Oremus2014)] + (though shark bites have become rarer due to better shielding of submarine cables + [[14](/en/ch9#AuerbachJahajeeah2023)]). + Humans are also at fault, be it due to accidental misconfiguration + [[15](/en/ch9#Janardhan2021)], + scavenging [[16](/en/ch9#Parfitt2011)], + or sabotage + [[17](/en/ch9#Voce2025)]. +* Across different cloud regions, round-trip times of up to several *minutes* have been observed at + high percentiles [[18](/en/ch9#Liu2016), Table 3]. + Even within a single datacenter, packet delay of more than a minute can occur during a network + topology reconfiguration, triggered by a problem during a software upgrade for a switch + [[19](/en/ch9#Imbriaco2012_ch9)]. + Thus, we have to assume that messages might be delayed arbitrarily. +* Sometimes communications are partially interrupted, depending on who you’re talking to: for + example, A and B can communicate, B and C can communicate, but A and C cannot + [[20](/en/ch9#Lianza2020_ch9), + [21](/en/ch9#Alfatafta2020)]. + Other surprising faults include a network interface that sometimes drops all inbound packets but + sends outbound packets successfully [[22](/en/ch9#Donges2012)]: + just because a network link works in one direction doesn’t guarantee it’s also working in the + opposite direction. +* Even a brief network interruption can have repercussions that last for much longer than the + original issue [[8](/en/ch9#Bailis2014reliable), + [20](/en/ch9#Lianza2020_ch9), + [23](/en/ch9#Toman2020)]. -However, if that single leader fails, or if a network interruption makes the leader unreachable, such a system becomes unable to make any progress. There are three ways of handling that situation: +# Network partitions -1. Wait for the leader to recover, and accept that the system will be blocked in the meantime. Many XA/JTA transaction coordinators choose this option. This approach does not fully solve consensus because it does not satisfy the termina‐ tion property: if the leader does not recover, the system can be blocked forever. +When one part of the network is cut off from the rest due to a network fault, that is sometimes +called a *network partition* or *netsplit*, but it is not fundamentally different from other kinds +of network interruption. Network partitions are not related to sharding of a storage system, which +is sometimes also called *partitioning* (see [Chapter 7](/en/ch7#ch_sharding)). -2. Manually fail over by getting humans to choose a new leader node and reconfig‐ ure the system to use it. Many relational databases take this approach. It is a kind of consensus by “act of God”—the human operator, outside of the computer sys‐ tem, makes the decision. The speed of failover is limited by the speed at which humans can act, which is generally slower than computers. +Even if network faults are rare in your environment, the fact that faults *can* occur means that +your software needs to be able to handle them. Whenever any communication happens over a network, it +may fail—there is no way around it. -3. Use an algorithm to automatically choose a new leader. This approach requires a consensus algorithm, and it is advisable to use a proven algorithm that correctly handles adverse network conditions [107]. +If the error handling of network faults is not defined and tested, arbitrarily bad things could +happen: for example, the cluster could become deadlocked and permanently unable to serve requests, +even when the network recovers [[24](/en/ch9#Kingsbury2014elastic)], +or it could even delete all of your data +[[25](/en/ch9#Sanfilippo2014)]. +If software is put in an unanticipated situation, it may do arbitrary unexpected things. -Although a single-leader database can provide linearizability without executing a consensus algorithm on every write, it still requires consensus to maintain its leader‐ ship and for leadership changes. Thus, in some sense, having a leader only “kicks the can down the road”: consensus is still required, only in a different place, and less fre‐ quently. The good news is that fault-tolerant algorithms and systems for consensus exist, and we briefly discussed them in this chapter. +Handling network faults doesn’t necessarily mean *tolerating* them: if your network is normally +fairly reliable, a valid approach may be to simply show an error message to users while your network +is experiencing problems. However, you do need to know how your software reacts to network problems +and ensure that the system can recover from them. +It may make sense to deliberately trigger network problems and test the system’s response (this is +known as *fault injection*; see [“Fault injection”](/en/ch9#sec_fault_injection)). -Tools like ZooKeeper play an important role in providing an “outsourced” consen‐ sus, failure detection, and membership service that applications can use. It’s not easy to use, but it is much better than trying to develop your own algorithms that can withstand all the problems discussed in [Chapter 8](/en/ch8). If you find yourself wanting to do one of those things that is reducible to consensus, and you want it to be fault-tolerant, then it is advisable to use something like ZooKeeper. +## Detecting Faults -Nevertheless, not every system necessarily requires consensus: for example, leaderless and multi-leader replication systems typically do not use global consensus. The con‐ flicts that occur in these systems (see “[Handling Write Conflicts](/en/ch5#handling-write-conflicts)”) are a consequence of not having consensus across different leaders, but maybe that’s okay: maybe we simply need to cope without linearizability and learn to work better with data that has branching and merging version histories. +Many systems need to automatically detect faulty nodes. For example: -This chapter referenced a large body of research on the theory of distributed systems. Although the theoretical papers and proofs are not always easy to understand, and sometimes make unrealistic assumptions, they are incredibly valuable for informing practical work in this field: they help us reason about what can and cannot be done, and help us find the counterintuitive ways in which distributed systems are often flawed. If you have the time, the references are well worth exploring. +* A load balancer needs to stop sending requests to a node that is dead (i.e., take it *out of rotation*). +* In a distributed database with single-leader replication, if the leader fails, one of the + followers needs to be promoted to be the new leader (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)). -This brings us to the end of [Part II](/en/part-ii) of this book, in which we covered replication ([Chapter 5](/en/ch5)), partitioning ([Chapter 6](/en/ch6)), transactions ([Chapter 7](/en/ch7)), distributed system failure models ([Chapter 8](/en/ch8)), and finally consistency and consensus ([Chapter 9](/en/ch9)). Now that we have laid a firm foundation of theory, in [Part III](/en/part-iii) we will turn once again to more practical systems, and discuss how to build powerful applications from heterogeneous building blocks. +Unfortunately, the uncertainty about the network makes it difficult to tell whether a node is +working or not. In some specific circumstances you might get some feedback to explicitly tell you +that something is not working: -## References +* If you can reach the machine on which the node should be running, but no process is listening on + the destination port (e.g., because the process crashed), the operating system will helpfully close + or refuse TCP connections by sending a `RST` or `FIN` packet in reply. +* If a node process crashed (or was killed by an administrator) but the node’s operating system is + still running, a script can notify other nodes about the crash so that another node can take over + quickly without having to wait for a timeout to expire. For example, HBase does this + [[26](/en/ch9#Liochon2015)]. +* If you have access to the management interface of the network switches in your datacenter, you can + query them to detect link failures at a hardware level (e.g., if the remote machine is powered + down). This option is ruled out if you’re connecting via the internet, or if you’re in a shared + datacenter with no access to the switches themselves, or if you can’t reach the management + interface due to a network problem. +* If a router is sure that the IP address you’re trying to connect to is unreachable, it may reply + to you with an ICMP Destination Unreachable packet. However, the router doesn’t have a magic + failure detection capability either—it is subject to the same limitations as other participants + of the network. -1. Peter Bailis and Ali Ghodsi: “[Eventual Consistency Today: Limitations, Extensions, and Beyond](http://queue.acm.org/detail.cfm?id=2462076),” *ACM Queue*, volume 11, number 3, pages 55-63, March 2013. [doi:10.1145/2460276.2462076](http://dx.doi.org/10.1145/2460276.2462076) -1. Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin: “[Consistency, Availability, and Convergence](http://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2036.pdf),” University of Texas at Austin, Department of Computer Science, Tech Report UTCS TR-11-22, May 2011. -1. Alex Scotti: “[Adventures in Building Your Own Database](http://www.slideshare.net/AlexScotti1/allyourbase-55212398),” at *All Your Base*, November 2015. -1. Peter Bailis, Aaron Davidson, Alan Fekete, et al.: “[Highly Available Transactions: Virtues and Limitations](http://arxiv.org/pdf/1302.0309.pdf),” at *40th International Conference on Very Large Data Bases* (VLDB), September 2014. Extended version published as pre-print arXiv:1302.0309 [cs.DB]. -1. Paolo Viotti and Marko Vukolić: “[Consistency in Non-Transactional Distributed Storage Systems](http://arxiv.org/abs/1512.00168),” arXiv:1512.00168, 12 April 2016. -1. Maurice P. Herlihy and Jeannette M. Wing: “[Linearizability: A Correctness Condition for Concurrent Objects](http://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf),” *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 12, number 3, pages 463–492, July 1990. [doi:10.1145/78969.78972](http://dx.doi.org/10.1145/78969.78972) -1. Leslie Lamport: “[On interprocess communication](https://www.microsoft.com/en-us/research/publication/interprocess-communication-part-basic-formalism-part-ii-algorithms/),” *Distributed Computing*, volume 1, number 2, pages 77–101, June 1986. [doi:10.1007/BF01786228](http://dx.doi.org/10.1007/BF01786228) -1. David K. Gifford: “[Information Storage in a Decentralized Computer System](http://www.mirrorservice.org/sites/www.bitsavers.org/pdf/xerox/parc/techReports/CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System.pdf),” Xerox Palo Alto Research Centers, CSL-81-8, June 1981. -1. Martin Kleppmann: “[Please Stop Calling Databases CP or AP](http://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html),” *martin.kleppmann.com*, May 11, 2015. -1. Kyle Kingsbury: “[Call Me Maybe: MongoDB Stale Reads](https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads),” *aphyr.com*, April 20, 2015. -1. Kyle Kingsbury: “[Computational Techniques in Knossos](https://aphyr.com/posts/314-computational-techniques-in-knossos),” *aphyr.com*, May 17, 2014. -1. Peter Bailis: “[Linearizability Versus Serializability](http://www.bailis.org/blog/linearizability-versus-serializability/),” *bailis.org*, September 24, 2014. -1. Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman: [*Concurrency Control and Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at *research.microsoft.com*. -1. Mike Burrows: “[The Chubby Lock Service for Loosely-Coupled Distributed Systems](https://research.google/pubs/pub27897/),” at *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006. -1. Flavio P. Junqueira and Benjamin Reed: *ZooKeeper: Distributed Process Coordination*. O'Reilly Media, 2013. ISBN: 978-1-449-36130-3 -1. “[etcd Documentation](https://etcd.io/docs/),” The Linux Foundation, *etcd.io*. -1. “[Apache Curator](http://curator.apache.org/),” Apache Software Foundation, *curator.apache.org*, 2015. -1. Murali Vallath: *Oracle 10g RAC Grid, Services & Clustering*. Elsevier Digital Press, 2006. ISBN: 978-1-555-58321-7 -1. Peter Bailis, Alan Fekete, Michael J Franklin, et al.: “[Coordination-Avoiding Database Systems](http://arxiv.org/pdf/1402.2237.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 3, pages 185–196, November 2014. -1. Kyle Kingsbury: “[Call Me Maybe: etcd and Consul](https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul),” *aphyr.com*, June 9, 2014. -1. Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini: “[Zab: High-Performance Broadcast for Primary-Backup Systems](https://web.archive.org/web/20220419064903/https://marcoserafini.github.io/papers/zab.pdf),” at *41st IEEE International Conference on Dependable Systems and Networks* (DSN), June 2011. [doi:10.1109/DSN.2011.5958223](http://dx.doi.org/10.1109/DSN.2011.5958223) -1. Diego Ongaro and John K. Ousterhout: “[In Search of an Understandable Consensus Algorithm](https://www.usenix.org/system/files/conference/atc14/atc14-paper-ongaro.pdf),” at *USENIX Annual Technical Conference* (ATC), June 2014. -1. Hagit Attiya, Amotz Bar-Noy, and Danny Dolev: “[Sharing Memory Robustly in Message-Passing Systems](http://www.cse.huji.ac.il/course/2004/dist/p124-attiya.pdf),” *Journal of the ACM*, volume 42, number 1, pages 124–142, January 1995. [doi:10.1145/200836.200869](http://dx.doi.org/10.1145/200836.200869) -1. Nancy Lynch and Alex Shvartsman: “[Robust Emulation of Shared Memory Using Dynamic Quorum-Acknowledged Broadcasts](http://groups.csail.mit.edu/tds/papers/Lynch/FTCS97.pdf),” at *27th Annual International Symposium on Fault-Tolerant Computing* (FTCS), June 1997. [doi:10.1109/FTCS.1997.614100](http://dx.doi.org/10.1109/FTCS.1997.614100) -1. Christian Cachin, Rachid Guerraoui, and Luís Rodrigues: [*Introduction to Reliable and Secure Distributed Programming*](http://www.distributedprogramming.net/), 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7, [doi:10.1007/978-3-642-15260-3](http://dx.doi.org/10.1007/978-3-642-15260-3) -1. Sam Elliott, Mark Allen, and Martin Kleppmann: [personal communication](https://web.archive.org/web/20230620021338/https://twitter.com/lenary/status/654761711933648896), thread on *twitter.com*, October 15, 2015. -1. Niklas Ekström, Mikhail Panchenko, and Jonathan Ellis: “[Possible Issue with Read Repair?](http://mail-archives.apache.org/mod_mbox/cassandra-dev/201210.mbox/%3CFA480D1DC3964E2C8B0A14E0880094C9%40Robotech%3E),” email thread on *cassandra-dev* mailing list, October 2012. -1. Maurice P. Herlihy: “[Wait-Free Synchronization](https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf),” *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 13, number 1, pages 124–149, January 1991. [doi:10.1145/114005.102808](http://dx.doi.org/10.1145/114005.102808) -1. Armando Fox and Eric A. Brewer: “[Harvest, Yield, and Scalable Tolerant Systems](http://radlab.cs.berkeley.edu/people/fox/static/pubs/pdf/c18.pdf),” at *7th Workshop on Hot Topics in Operating Systems* (HotOS), March 1999. [doi:10.1109/HOTOS.1999.798396](http://dx.doi.org/10.1109/HOTOS.1999.798396) -1. Seth Gilbert and Nancy Lynch: “[Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services](http://www.comp.nus.edu.sg/~gilbert/pubs/BrewersConjecture-SigAct.pdf),” *ACM SIGACT News*, volume 33, number 2, pages 51–59, June 2002. [doi:10.1145/564585.564601](http://dx.doi.org/10.1145/564585.564601) -1. Seth Gilbert and Nancy Lynch: “[Perspectives on the CAP Theorem](http://groups.csail.mit.edu/tds/papers/Gilbert/Brewer2.pdf),” *IEEE Computer Magazine*, volume 45, number 2, pages 30–36, February 2012. [doi:10.1109/MC.2011.389](http://dx.doi.org/10.1109/MC.2011.389) -1. Eric A. Brewer: “[CAP Twelve Years Later: How the 'Rules' Have Changed](https://web.archive.org/web/20221222092656/http://cs609.cs.ua.edu/CAP12.pdf),” *IEEE Computer Magazine*, volume 45, number 2, pages 23–29, February 2012. [doi:10.1109/MC.2012.37](http://dx.doi.org/10.1109/MC.2012.37) -1. Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen: “[Consistency in Partitioned Networks](http://delab.csd.auth.gr/~dimitris/courses/mpc_fall05/papers/invalidation/acm_csur85_partitioned_network_consistency.pdf),” *ACM Computing Surveys*, volume 17, number 3, pages 341–370, September 1985. [doi:10.1145/5505.5508](http://dx.doi.org/10.1145/5505.5508) -1. Paul R. Johnson and Robert H. Thomas: “[RFC 677: The Maintenance of Duplicate Databases](https://tools.ietf.org/html/rfc677),” Network Working Group, January 27, 1975. -1. Bruce G. Lindsay, Patricia Griffiths Selinger, C. Galtieri, et al.: “[Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf),” IBM Research, Research Report RJ2571(33471), July 1979. -1. Michael J. Fischer and Alan Michael: “[Sacrificing Serializability to Attain High Availability of Data in an Unreliable Network](http://www.cs.ucsb.edu/~agrawal/spring2011/ugrad/p70-fischer.pdf),” at *1st ACM Symposium on Principles of Database Systems* (PODS), March 1982. [doi:10.1145/588111.588124](http://dx.doi.org/10.1145/588111.588124) -1. Eric A. Brewer: “[NoSQL: Past, Present, Future](http://www.infoq.com/presentations/NoSQL-History),” at *QCon San Francisco*, November 2012. -1. Henry Robinson: “[CAP Confusion: Problems with 'Partition Tolerance,'](https://web.archive.org/web/20160304020135/http://blog.cloudera.com/blog/2010/04/cap-confusion-problems-with-partition-tolerance/)” *blog.cloudera.com*, April 26, 2010. -1. Adrian Cockcroft: “[Migrating to Microservices](http://www.infoq.com/presentations/migration-cloud-native),” at *QCon London*, March 2014. -1. Martin Kleppmann: “[A Critique of the CAP Theorem](http://arxiv.org/abs/1509.05393),” arXiv:1509.05393, September 17, 2015. -1. Nancy A. Lynch: “[A Hundred Impossibility Proofs for Distributed Computing](http://groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf),” at *8th ACM Symposium on Principles of Distributed Computing* (PODC), August 1989. [doi:10.1145/72981.72982](http://dx.doi.org/10.1145/72981.72982) -1. Hagit Attiya, Faith Ellen, and Adam Morrison: “[Limitations of Highly-Available Eventually-Consistent Data Stores](https://www.cs.tau.ac.il/~mad/publications/podc2015-replds.pdf),” at *ACM Symposium on Principles of Distributed Computing* (PODC), July 2015. [doi:10.1145/2767386.2767419](http://dx.doi.org/10.1145/2767386.2767419) -1. Peter Sewell, Susmit Sarkar, Scott Owens, et al.: “[x86-TSO: A Rigorous and Usable Programmer's Model for x86 Multiprocessors](http://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf),” *Communications of the ACM*, volume 53, number 7, pages 89–97, July 2010. [doi:10.1145/1785414.1785443](http://dx.doi.org/10.1145/1785414.1785443) -1. Martin Thompson: “[Memory Barriers/Fences](http://mechanical-sympathy.blogspot.co.uk/2011/07/memory-barriersfences.html),” *mechanical-sympathy.blogspot.co.uk*, July 24, 2011. -1. Ulrich Drepper: “[What Every Programmer Should Know About Memory](http://www.akkadia.org/drepper/cpumemory.pdf),” *akkadia.org*, November 21, 2007. -1. Daniel J. Abadi: “[Consistency Tradeoffs in Modern Distributed Database System Design](http://cs-www.cs.yale.edu/homes/dna/papers/abadi-pacelc.pdf),” *IEEE Computer Magazine*, volume 45, number 2, pages 37–42, February 2012. [doi:10.1109/MC.2012.33](http://dx.doi.org/10.1109/MC.2012.33) -1. Hagit Attiya and Jennifer L. Welch: “[Sequential Consistency Versus Linearizability](http://courses.csail.mit.edu/6.852/01/papers/p91-attiya.pdf),” *ACM Transactions on Computer Systems* (TOCS), volume 12, number 2, pages 91–122, May 1994. [doi:10.1145/176575.176576](http://dx.doi.org/10.1145/176575.176576) -1. Mustaque Ahamad, Gil Neiger, James E. Burns, et al.: “[Causal Memory: Definitions, Implementation, and Programming](http://www-i2.informatik.rwth-aachen.de/i2/fileadmin/user_upload/documents/Seminar_MCMM11/Causal_memory_1996.pdf),” *Distributed Computing*, volume 9, number 1, pages 37–49, March 1995. [doi:10.1007/BF01784241](http://dx.doi.org/10.1007/BF01784241) -1. Wyatt Lloyd, Michael J. Freedman, Michael Kaminsky, and David G. Andersen: “[Stronger Semantics for Low-Latency Geo-Replicated Storage](https://www.usenix.org/system/files/conference/nsdi13/nsdi13-final149.pdf),” at *10th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2013. -1. Marek Zawirski, Annette Bieniusa, Valter Balegas, et al.: “[SwiftCloud: Fault-Tolerant Geo-Replication Integrated All the Way to the Client Machine](http://arxiv.org/abs/1310.3107),” INRIA Research Report 8347, August 2013. -1. Peter Bailis, Ali Ghodsi, Joseph M Hellerstein, and Ion Stoica: “[Bolt-on Causal Consistency](http://db.cs.berkeley.edu/papers/sigmod13-bolton.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), June 2013. -1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. -1. Peter Bailis: “[Causality Is Expensive (and What to Do About It)](http://www.bailis.org/blog/causality-is-expensive-and-what-to-do-about-it/),” *bailis.org*, February 5, 2014. -1. Ricardo Gonçalves, Paulo Sérgio Almeida, Carlos Baquero, and Victor Fonte: “[Concise Server-Wide Causality Management for Eventually Consistent Data Stores](https://web.archive.org/web/20220810205439/http://haslab.uminho.pt/tome/files/global_logical_clocks.pdf),” at *15th IFIP International Conference on Distributed Applications and Interoperable Systems* (DAIS), June 2015. [doi:10.1007/978-3-319-19129-4_6](http://dx.doi.org/10.1007/978-3-319-19129-4_6) -1. Rob Conery: “[A Better ID Generator for PostgreSQL](https://web.archive.org/web/20220118044729/http://rob.conery.io/2014/05/29/a-better-id-generator-for-postgresql/),” *rob.conery.io*, May 29, 2014. -1. Leslie Lamport: “[Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/),” *Communications of the ACM*, volume 21, number 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](http://dx.doi.org/10.1145/359545.359563) -1. Xavier Défago, André Schiper, and Péter Urbán: “[Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey](https://dspace.jaist.ac.jp/dspace/bitstream/10119/4883/1/defago_et_al.pdf),” *ACM Computing Surveys*, volume 36, number 4, pages 372–421, December 2004. [doi:10.1145/1041680.1041682](http://dx.doi.org/10.1145/1041680.1041682) -1. Hagit Attiya and Jennifer Welch: *Distributed Computing: Fundamentals, Simulations and Advanced Topics*, 2nd edition. John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6, [doi:10.1002/0471478210](http://dx.doi.org/10.1002/0471478210) -1. Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, et al.: “[CORFU: A Shared Log Design for Flash Clusters](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final30.pdf),” at *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012. -1. Fred B. Schneider: “[Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial](http://www.cs.cornell.edu/fbs/publications/smsurvey.pdf),” *ACM Computing Surveys*, volume 22, number 4, pages 299–319, December 1990. -1. Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, et al.: “[Calvin: Fast Distributed Transactions for Partitioned Database Systems](http://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 2012. -1. Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, et al.: “[Tango: Distributed Data Structures over a Shared Log](https://www.microsoft.com/en-us/research/publication/tango-distributed-data-structures-over-a-shared-log/),” at *24th ACM Symposium on Operating Systems Principles* (SOSP), November 2013. [doi:10.1145/2517349.2522732](http://dx.doi.org/10.1145/2517349.2522732) -1. Robbert van Renesse and Fred B. Schneider: “[Chain Replication for Supporting High Throughput and Availability](http://static.usenix.org/legacy/events/osdi04/tech/full_papers/renesse/renesse.pdf),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004. -1. Leslie Lamport: “[How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs](https://lamport.azurewebsites.net/pubs/multi.pdf),” *IEEE Transactions on Computers*, volume 28, number 9, pages 690–691, September 1979. [doi:10.1109/TC.1979.1675439](http://dx.doi.org/10.1109/TC.1979.1675439) -1. Enis Söztutar, Devaraj Das, and Carter Shanklin: “[Apache HBase High Availability at the Next Level](https://web.archive.org/web/20160405122821/http://hortonworks.com/blog/apache-hbase-high-availability-next-level/),” *hortonworks.com*, January 22, 2015. -1. Brian F Cooper, Raghu Ramakrishnan, Utkarsh Srivastava, et al.: “[PNUTS: Yahoo!’s Hosted Data Serving Platform](http://www.mpi-sws.org/~druschel/courses/ds/papers/cooper-pnuts.pdf),” at *34th International Conference on Very Large Data Bases* (VLDB), August 2008. [doi:10.14778/1454159.1454167](http://dx.doi.org/10.14778/1454159.1454167) -1. Tushar Deepak Chandra and Sam Toueg: “[Unreliable Failure Detectors for Reliable Distributed Systems](http://courses.csail.mit.edu/6.852/08/papers/CT96-JACM.pdf),” *Journal of the ACM*, volume 43, number 2, pages 225–267, March 1996. [doi:10.1145/226643.226647](http://dx.doi.org/10.1145/226643.226647) -1. Michael J. Fischer, Nancy Lynch, and Michael S. Paterson: “[Impossibility of Distributed Consensus with One Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf),” *Journal of the ACM*, volume 32, number 2, pages 374–382, April 1985. [doi:10.1145/3149.214121](http://dx.doi.org/10.1145/3149.214121) -1. Michael Ben-Or: “Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols,” at *2nd ACM Symposium on Principles of Distributed Computing* (PODC), August 1983. [doi:10.1145/800221.806707](http://dl.acm.org/citation.cfm?id=806707) -1. Jim N. Gray and Leslie Lamport: “[Consensus on Transaction Commit](http://db.cs.berkeley.edu/cs286/papers/paxoscommit-tods2006.pdf),” *ACM Transactions on Database Systems* (TODS), volume 31, number 1, pages 133–160, March 2006. [doi:10.1145/1132863.1132867](http://dx.doi.org/10.1145/1132863.1132867) -1. Rachid Guerraoui: “[Revisiting the Relationship Between Non-Blocking Atomic Commitment and Consensus](https://citeseerx.ist.psu.edu/pdf/5d06489503b6f791aa56d2d7942359c2592e44b0),” at *9th International Workshop on Distributed Algorithms* (WDAG), September 1995. [doi:10.1007/BFb0022140](http://dx.doi.org/10.1007/BFb0022140) -1. Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, et al.: “[All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications](http://research.cs.wisc.edu/wind/Publications/alice-osdi14.pdf),” at *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014. -1. Jim Gray: “[The Transaction Concept: Virtues and Limitations](http://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf),” at *7th International Conference on Very Large Data Bases* (VLDB), September 1981. -1. Hector Garcia-Molina and Kenneth Salem: “[Sagas](http://www.cs.cornell.edu/andru/cs711/2002fa/reading/sagas.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), May 1987. [doi:10.1145/38713.38742](http://dx.doi.org/10.1145/38713.38742) -1. C. Mohan, Bruce G. Lindsay, and Ron Obermarck: “[Transaction Management in the R* Distributed Database Management System](https://cs.brown.edu/courses/csci2270/archives/2012/papers/dtxn/p378-mohan.pdf),” *ACM Transactions on Database Systems*, volume 11, number 4, pages 378–396, December 1986. [doi:10.1145/7239.7266](http://dx.doi.org/10.1145/7239.7266) -1. “[Distributed Transaction Processing: The XA Specification](http://pubs.opengroup.org/onlinepubs/009680699/toc.pdf),” X/Open Company Ltd., Technical Standard XO/CAE/91/300, December 1991. ISBN: 978-1-872-63024-3 -1. Mike Spille: “[XA Exposed, Part II](http://www.jroller.com/pyrasun/entry/xa_exposed_part_ii_schwartz),” *jroller.com*, April 3, 2004. -1. Ivan Silva Neto and Francisco Reverbel: “[Lessons Learned from Implementing WS-Coordination and WS-AtomicTransaction](http://www.ime.usp.br/~reverbel/papers/icis2008.pdf),” at *7th IEEE/ACIS International Conference on Computer and Information Science* (ICIS), May 2008. [doi:10.1109/ICIS.2008.75](http://dx.doi.org/10.1109/ICIS.2008.75) -1. James E. Johnson, David E. Langworthy, Leslie Lamport, and Friedrich H. Vogt: “[Formal Specification of a Web Services Protocol](https://www.microsoft.com/en-us/research/publication/formal-specification-of-a-web-services-protocol/),” at *1st International Workshop on Web Services and Formal Methods* (WS-FM), February 2004. [doi:10.1016/j.entcs.2004.02.022](http://dx.doi.org/10.1016/j.entcs.2004.02.022) -1. Dale Skeen: “[Nonblocking Commit Protocols](http://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/Ske81.pdf),” at *ACM International Conference on Management of Data* (SIGMOD), April 1981. [doi:10.1145/582318.582339](http://dx.doi.org/10.1145/582318.582339) -1. Gregor Hohpe: “[Your Coffee Shop Doesn’t Use Two-Phase Commit](http://www.martinfowler.com/ieeeSoftware/coffeeShop.pdf),” *IEEE Software*, volume 22, number 2, pages 64–66, March 2005. [doi:10.1109/MS.2005.52](http://dx.doi.org/10.1109/MS.2005.52) -1. Pat Helland: “[Life Beyond Distributed Transactions: An Apostate’s Opinion](https://web.archive.org/web/20210303104924/http://www-db.cs.wisc.edu/cidr/cidr2007/papers/cidr07p15.pdf),” at *3rd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2007. -1. Jonathan Oliver: “[My Beef with MSDTC and Two-Phase Commits](http://blog.jonathanoliver.com/my-beef-with-msdtc-and-two-phase-commits/),” *blog.jonathanoliver.com*, April 4, 2011. -1. Oren Eini (Ahende Rahien): “[The Fallacy of Distributed Transactions](http://ayende.com/blog/167362/the-fallacy-of-distributed-transactions),” *ayende.com*, July 17, 2014. -1. Clemens Vasters: “[Transactions in Windows Azure (with Service Bus) – An Email Discussion](https://blogs.msdn.microsoft.com/clemensv/2012/07/30/transactions-in-windows-azure-with-service-bus-an-email-discussion/),” *vasters.com*, July 30, 2012. -1. “[Understanding Transactionality in Azure](https://docs.particular.net/nservicebus/azure/understanding-transactionality-in-azure),” NServiceBus Documentation, Particular Software, 2015. -1. Randy Wigginton, Ryan Lowe, Marcos Albe, and Fernando Ipar: “[Distributed Transactions in MySQL](https://web.archive.org/web/20161010054152/https://www.percona.com/live/mysql-conference-2013/sites/default/files/slides/XA_final.pdf),” at *MySQL Conference and Expo*, April 2013. -1. Mike Spille: “[XA Exposed, Part I](https://web.archive.org/web/20130523064202/http://www.jroller.com/pyrasun/entry/xa_exposed),” *jroller.com*, April 3, 2004. -1. Ajmer Dhariwal: “[Orphaned MSDTC Transactions (-2 spids)](https://www.eraofdata.com/posts/2008/orphaned-msdtc-transactions-2-spids/),” *eraofdata.com*, December 12, 2008. -1. Paul Randal: “[Real World Story of DBCC PAGE Saving the Day](http://www.sqlskills.com/blogs/paul/real-world-story-of-dbcc-page-saving-the-day/),” *sqlskills.com*, June 19, 2013. -1. “[in-doubt xact resolution Server Configuration Option](https://msdn.microsoft.com/en-us/library/ms179586.aspx),” SQL Server 2016 documentation, Microsoft, Inc., 2016. -1. Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer: “[Consensus in the Presence of Partial Synchrony](https://web.archive.org/web/20210318133551/https://www.net.t-labs.tu-berlin.de/~petr/ADC-07/papers/DLS88.pdf),” *Journal of the ACM*, volume 35, number 2, pages 288–323, April 1988. [doi:10.1145/42282.42283](http://dx.doi.org/10.1145/42282.42283) -1. Miguel Castro and Barbara H. Liskov: “[Practical Byzantine Fault Tolerance and Proactive Recovery](https://web.archive.org/web/20181123142540/http://zoo.cs.yale.edu/classes/cs426/2012/bib/castro02practical.pdf),” *ACM Transactions on Computer Systems*, volume 20, number 4, pages 396–461, November 2002. [doi:10.1145/571637.571640](http://dx.doi.org/10.1145/571637.571640) -1. Brian M. Oki and Barbara H. Liskov: “[Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems](http://www.cs.princeton.edu/courses/archive/fall11/cos518/papers/viewstamped.pdf),” at *7th ACM Symposium on Principles of Distributed Computing* (PODC), August 1988. [doi:10.1145/62546.62549](http://dx.doi.org/10.1145/62546.62549) -1. Barbara H. Liskov and James Cowling: “[Viewstamped Replication Revisited](http://pmg.csail.mit.edu/papers/vr-revisited.pdf),” Massachusetts Institute of Technology, Tech Report MIT-CSAIL-TR-2012-021, July 2012. -1. Leslie Lamport: “[The Part-Time Parliament](https://www.microsoft.com/en-us/research/publication/part-time-parliament/),” *ACM Transactions on Computer Systems*, volume 16, number 2, pages 133–169, May 1998. [doi:10.1145/279227.279229](http://dx.doi.org/10.1145/279227.279229) -1. Leslie Lamport: “[Paxos Made Simple](https://www.microsoft.com/en-us/research/publication/paxos-made-simple/),” *ACM SIGACT News*, volume 32, number 4, pages 51–58, December 2001. -1. Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone: “[Paxos Made Live – An Engineering Perspective](http://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf),” at *26th ACM Symposium on Principles of Distributed Computing* (PODC), June 2007. -1. Robbert van Renesse: “[Paxos Made Moderately Complex](http://www.cs.cornell.edu/home/rvr/Paxos/paxos.pdf),” *cs.cornell.edu*, March 2011. -1. Diego Ongaro: “[Consensus: Bridging Theory and Practice](https://github.com/ongardie/dissertation),” PhD Thesis, Stanford University, August 2014. -1. Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy, and Jon Crowcroft: “[Raft Refloated: Do We Have Consensus?](https://web.archive.org/web/20230319151303/https://www.cl.cam.ac.uk/~ms705/pub/papers/2015-osr-raft.pdf),” *ACM SIGOPS Operating Systems Review*, volume 49, number 1, pages 12–21, January 2015. [doi:10.1145/2723872.2723876](http://dx.doi.org/10.1145/2723872.2723876) -1. André Medeiros: “[ZooKeeper’s Atomic Broadcast Protocol: Theory and Practice](http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf),” Aalto University School of Science, March 20, 2012. -1. Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider: “[Vive La Différence: Paxos vs. Viewstamped Replication vs. Zab](http://arxiv.org/abs/1309.5671),” *IEEE Transactions on Dependable and Secure Computing*, volume 12, number 4, pages 472–484, September 2014. [doi:10.1109/TDSC.2014.2355848](http://dx.doi.org/10.1109/TDSC.2014.2355848) -1. Will Portnoy: “[Lessons Learned from Implementing Paxos](http://blog.willportnoy.com/2012/06/lessons-learned-from-paxos.html),” *blog.willportnoy.com*, June 14, 2012. -1. Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman: “[Flexible Paxos: Quorum Intersection Revisited](https://drops.dagstuhl.de/opus/volltexte/2017/7094/pdf/LIPIcs-OPODIS-2016-25.pdf),” at *20th International Conference on Principles of Distributed Systems* (OPODIS), December 2016. [doi:10.4230/LIPIcs.OPODIS.2016.25](http://dx.doi.org/10.4230/LIPIcs.OPODIS.2016.25) -1. Heidi Howard and Jon Crowcroft: “[Coracle: Evaluating Consensus at the Internet Edge](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p85.pdf),” at *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2829988.2790010](http://dx.doi.org/10.1145/2829988.2790010) -1. Kyle Kingsbury: “[Call Me Maybe: Elasticsearch 1.5.0](https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0),” *aphyr.com*, April 27, 2015. -1. Ivan Kelly: “[BookKeeper Tutorial](https://github.com/ivankelly/bookkeeper-tutorial),” *github.com*, October 2014. -1. Camille Fournier: “[Consensus Systems for the Skeptical Architect](https://vimeo.com/102667163),” at *Philly ETE*, Philadelphia, PA, USA, April 2014. -1. Kenneth P. Birman: “[A History of the Virtual Synchrony Replication Model](https://ptolemy.berkeley.edu/projects/truststc/pubs/713/History%20of%20the%20Virtual%20Synchrony%20Replication%20Model%202010.pdf),” in *Replication: Theory and Practice*, Springer LNCS volume 5959, chapter 6, pages 91–120, 2010. ISBN: 978-3-642-11293-5, [doi:10.1007/978-3-642-11294-2_6](http://dx.doi.org/10.1007/978-3-642-11294-2_6) +Rapid feedback about a remote node being down is useful, but you can’t count on it. If something has +gone wrong, you may get an error response at some level of the stack, but in general you have to +assume that you will get no response at all. You can retry a few times, wait for a timeout to +elapse, and eventually declare the node dead if you don’t hear back within the timeout. + +## Timeouts and Unbounded Delays + +If a timeout is the only sure way of detecting a fault, then how long should the timeout be? There +is unfortunately no simple answer. + +A long timeout means a long wait until a node is declared dead (and during this time, users may have +to wait or see error messages). A short timeout detects faults faster, but carries a higher risk of +incorrectly declaring a node dead when in fact it has only suffered a temporary slowdown (e.g., due +to a load spike on the node or the network). + +Prematurely declaring a node dead is problematic: if the node is actually alive and in the middle of +performing some action (for example, sending an email), and another node takes over, the action may +end up being performed twice. We will discuss this issue in more detail in +[“Knowledge, Truth, and Lies”](/en/ch9#sec_distributed_truth), and in +Chapters [10](/en/ch10#ch_consistency) +and [Link to Come]. + +When a node is declared dead, its responsibilities need to be transferred to other nodes, which +places additional load on other nodes and the network. If the system is already struggling with high +load, declaring nodes dead prematurely can make the problem worse. In particular, it could happen +that the node actually wasn’t dead but only slow to respond due to overload; transferring its load +to other nodes can cause a cascading failure (in the extreme case, all nodes declare each other +dead, and everything stops working—see [“When an overloaded system won’t recover”](/en/ch2#sidebar_metastable)). + +Imagine a fictitious system with a network that guaranteed a maximum delay for packets—every packet +is either delivered within some time *d*, or it is lost, but delivery never takes longer than *d*. +Furthermore, assume that you can guarantee that a non-failed node always handles a request within +some time *r*. In this case, you could guarantee that every successful request receives a response +within time 2*d* + *r*—and if you don’t receive a response within that time, you know +that either the network or the remote node is not working. If this was true, +2*d* + *r* would be a reasonable timeout to use. + +Unfortunately, most systems we work with have neither of those guarantees: asynchronous networks +have *unbounded delays* (that is, they try to deliver packets as quickly as possible, but there is +no upper limit on the time it may take for a packet to arrive), and most server implementations +cannot guarantee that they can handle requests within some maximum time (see +[“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime)). For failure detection, it’s not sufficient for the system to +be fast most of the time: if your timeout is low, it only takes a transient spike in round-trip +times to throw the system off-balance. + +### Network congestion and queueing + +When driving a car, travel times on road networks often vary most due to traffic congestion. +Similarly, the variability of packet delays on computer networks is most often due to queueing +[[27](/en/ch9#Grosvenor2015)]: + +* If several different nodes simultaneously try to send packets to the same destination, the network + switch must queue them up and feed them into the destination network link one by one (as illustrated + in [Figure 9-2](/en/ch9#fig_distributed_switch_queueing)). On a busy network link, a packet may have to wait a while + until it can get a slot (this is called *network congestion*). If there is so much incoming data + that the switch queue fills up, the packet is dropped, so it needs to be resent—even though + the network is functioning fine. +* When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming + request from the network is queued by the operating system until the application is ready to + handle it. Depending on the load on the machine, this may take an arbitrary length of time + [[28](/en/ch9#Julienne2019)]. +* In virtualized environments, a running operating system is often paused for tens of milliseconds + while another virtual machine uses a CPU core. During this time, the VM cannot consume any data + from the network, so the incoming data is queued (buffered) by the virtual machine monitor + [[29](/en/ch9#Wang2010)], + further increasing the variability of network delays. +* As mentioned earlier, in order to avoid overloading the network, TCP limits the rate at which it + sends data. This means additional queueing at the sender before the data even enters the network. + +![ddia 0902](/fig/ddia_0902.png) + +###### Figure 9-2. If several machines send network traffic to the same destination, its switch queue can fill up. Here, ports 1, 2, and 4 are all trying to send packets to port 3. + +Moreover, when TCP detects and automatically retransmits a lost packet, although the application +does not see the packet loss directly, it does see the resulting delay (waiting for the timeout to +expire, and then waiting for the retransmitted packet to be acknowledged). + +# TCP Versus UDP + +Some latency-sensitive applications, such as videoconferencing and Voice over IP (VoIP), use UDP +rather than TCP. It’s a trade-off between reliability and variability of delays: as UDP does not +perform flow control and does not retransmit lost packets, it avoids some of the reasons for +variable network delays (although it is still susceptible to switch queues and scheduling delays). + +UDP is a good choice in situations where delayed data is worthless. For example, in a VoIP phone +call, there probably isn’t enough time to retransmit a lost packet before its data is due to be +played over the loudspeakers. In this case, there’s no point in retransmitting the packet—the +application must instead fill the missing packet’s time slot with silence (causing a brief +interruption in the sound) and move on in the stream. The retry happens at the human layer instead. +(“Could you repeat that please? The sound just cut out for a moment.”) + +All of these factors contribute to the variability of network delays. Queueing delays have an +especially wide range when a system is close to its maximum capacity: a system with plenty of spare +capacity can easily drain queues, whereas in a highly utilized system, long queues can build up very +quickly. + +In public clouds and multitenant datacenters, resources are shared among many customers: the +network links and switches, and even each machine’s network interface and CPUs (when running on +virtual machines), are shared. Processing large amounts of data can use the entire capacity of +network links (*saturate* them). As you have no control over or insight into other customers’ usage of the shared +resources, network delays can be highly variable if someone near you (a *noisy neighbor*) is +using a lot of resources [[30](/en/ch9#Philips2014), +[31](/en/ch9#Newman2012)]. + +In such environments, you can only choose timeouts experimentally: measure the distribution of +network round-trip times over an extended period, and over many machines, to determine the expected +variability of delays. Then, taking into account your application’s characteristics, you can +determine an appropriate trade-off between failure detection delay and risk of premature timeouts. + +Even better, rather than using configured constant timeouts, systems can continually measure +response times and their variability (*jitter*), and automatically adjust timeouts according to the +observed response time distribution. The Phi Accrual failure detector +[[32](/en/ch9#Hayashibara2004)], +which is used for example in Akka and Cassandra +[[33](/en/ch9#Wang2013)] +is one way of doing this. TCP retransmission timeouts also work similarly +[[5](/en/ch9#Jacobson1988)]. + +## Synchronous Versus Asynchronous Networks + +Distributed systems would be a lot simpler if we could rely on the network to deliver packets with +some fixed maximum delay, and not to drop packets. Why can’t we solve this at the hardware level +and make the network reliable so that the software doesn’t need to worry about it? + +To answer this question, it’s interesting to compare datacenter networks to the traditional fixed-line +telephone network (non-cellular, non-VoIP), which is extremely reliable: delayed audio +frames and dropped calls are very rare. A phone call requires a constantly low end-to-end latency +and enough bandwidth to transfer the audio samples of your voice. Wouldn’t it be nice to have +similar reliability and predictability in computer networks? + +When you make a call over the telephone network, it establishes a *circuit*: a fixed, guaranteed +amount of bandwidth is allocated for the call, along the entire route between the two callers. This +circuit remains in place until the call ends +[[34](/en/ch9#Keshav1997)]. +For example, an ISDN network runs at a fixed rate of 4,000 frames per second. When a call is +established, it is allocated 16 bits of space within each frame (in each direction). Thus, for the +duration of the call, each side is guaranteed to be able to send exactly 16 bits of audio data every +250 microseconds +[[35](/en/ch9#Kyas1995)]. + +This kind of network is *synchronous*: even as data passes through several routers, it does not +suffer from queueing, because the 16 bits of space for the call have already been reserved in the +next hop of the network. And because there is no queueing, the maximum end-to-end latency of the +network is fixed. We call this a *bounded delay*. + +### Can we not simply make network delays predictable? + +Note that a circuit in a telephone network is very different from a TCP connection: a circuit is a +fixed amount of reserved bandwidth which nobody else can use while the circuit is established, +whereas the packets of a TCP connection opportunistically use whatever network bandwidth is +available. You can give TCP a variable-sized block of data (e.g., an email or a web page), and it +will try to transfer it in the shortest time possible. While a TCP connection is idle, it doesn’t +use any bandwidth (except perhaps for an occasional keepalive packet). + +If datacenter networks and the internet were circuit-switched networks, it would be possible to +establish a guaranteed maximum round-trip time when a circuit was set up. However, they are not: +Ethernet and IP are packet-switched protocols, which suffer from queueing and thus unbounded delays +in the network. These protocols do not have the concept of a circuit. + +Why do datacenter networks and the internet use packet switching? The answer is that they are +optimized for *bursty traffic*. A circuit is good for an audio or video call, which needs to +transfer a fairly constant number of bits per second for the duration of the call. On the other +hand, requesting a web page, sending an email, or transferring a file doesn’t have any particular +bandwidth requirement—we just want it to complete as quickly as possible. + +If you wanted to transfer a file over a circuit, you would have to guess a bandwidth allocation. If +you guess too low, the transfer is unnecessarily slow, leaving network capacity unused. If you guess +too high, the circuit cannot be set up (because the network cannot allow a circuit to be created if +its bandwidth allocation cannot be guaranteed). Thus, using circuits for bursty data transfers +wastes network capacity and makes transfers unnecessarily slow. By contrast, TCP dynamically adapts +the rate of data transfer to the available network capacity. + +There have been some attempts to build hybrid networks that support both circuit switching and +packet switching. *Asynchronous Transfer Mode* (ATM) was a competitor to Ethernet in the 1980s, but +it didn’t gain much adoption outside of telephone network core switches. InfiniBand has some similarities +[[36](/en/ch9#Mellanox2014)]: +it implements end-to-end flow control at the link layer, which reduces the need for queueing in the +network, although it can still suffer from delays due to link congestion +[[37](/en/ch9#Santos2003)]. +With careful use of *quality of service* (QoS, prioritization and scheduling of packets) and *admission +control* (rate-limiting senders), it is possible to emulate circuit switching on packet networks, or +provide statistically bounded delay [[27](/en/ch9#Grosvenor2015), +[34](/en/ch9#Keshav1997)]. New network algorithms like Low Latency, Low +Loss, and Scalable Throughput (L4S) attempt to mitigate some of the queuing and congestion control +problems both at the client and router level. Linux’s traffic controller (TC) also allows +applications to reprioritize packets for QoS purposes. + +# Latency and Resource Utilization + +More generally, you can think of variable delays as a consequence of dynamic resource partitioning. + +Say you have a wire between two telephone switches that can carry up to 10,000 simultaneous calls. +Each circuit that is switched over this wire occupies one of those call slots. Thus, you can think of +the wire as a resource that can be shared by up to 10,000 simultaneous users. The resource is +divided up in a *static* way: even if you’re the only call on the wire right now, and all other +9,999 slots are unused, your circuit is still allocated the same fixed amount of bandwidth as when +the wire is fully utilized. + +By contrast, the internet shares network bandwidth *dynamically*. Senders push and jostle with each +other to get their packets over the wire as quickly as possible, and the network switches decide +which packet to send (i.e., the bandwidth allocation) from one moment to the next. This approach has the +downside of queueing, but the advantage is that it maximizes utilization of the wire. The wire has a +fixed cost, so if you utilize it better, each byte you send over the wire is cheaper. + +A similar situation arises with CPUs: if you share each CPU core dynamically between several +threads, one thread sometimes has to wait in the operating system’s run queue while another thread +is running, so a thread can be paused for varying lengths of time +[[38](/en/ch9#Li2014)]. +However, this utilizes the hardware better than if you allocated a static number of CPU cycles to +each thread (see [“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime)). Better hardware utilization is also why cloud +platforms run several virtual machines from different customers on the same physical machine. + +Latency guarantees are achievable in certain environments, if resources are statically partitioned +(e.g., dedicated hardware and exclusive bandwidth allocations). However, it comes at the cost of +reduced utilization—in other words, it is more expensive. On the other hand, multitenancy with +dynamic resource partitioning provides better utilization, so it is cheaper, but it has the downside +of variable delays. + +Variable delays in networks are not a law of nature, but simply the result of a cost/benefit +trade-off. + +However, such quality of service is currently not enabled in multitenant datacenters and public +clouds, or when communicating via the internet. +Currently deployed technology does not allow us to make any guarantees about delays or reliability +of the network: we have to assume that network congestion, queueing, and unbounded delays will +happen. Consequently, there’s no “correct” value for timeouts—they need to be determined +experimentally. + +Peering agreements between internet service providers and the establishment of routes through the +Border Gateway Protocol (BGP), bear closer resemblance to circuit switching than IP itself. At this +level, it is possible to buy dedicated bandwidth. However, internet routing operates at the level of +networks, not individual connections between hosts, and at a much longer timescale. + +# Unreliable Clocks + +Clocks and time are important. Applications depend on clocks in various ways to answer questions +like the following: + +1. Has this request timed out yet? +2. What’s the 99th percentile response time of this service? +3. How many queries per second did this service handle on average in the last five minutes? +4. How long did the user spend on our site? +5. When was this article published? +6. At what date and time should the reminder email be sent? +7. When does this cache entry expire? +8. What is the timestamp on this error message in the log file? + +Examples 1–4 measure *durations* (e.g., the time interval between a request being sent and a +response being received), whereas examples 5–8 describe *points in time* (events that occur on a +particular date, at a particular time). + +In a distributed system, time is a tricky business, because communication is not instantaneous: it +takes time for a message to travel across the network from one machine to another. The time when a +message is received is always later than the time when it is sent, but due to variable delays in the +network, we don’t know how much later. This fact sometimes makes it difficult to determine the order +in which things happened when multiple machines are involved. + +Moreover, each machine on the network has its own clock, which is an actual hardware device: usually +a quartz crystal oscillator. These devices are not perfectly accurate, so each machine has its own +notion of time, which may be slightly faster or slower than on other machines. It is possible to +synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol (NTP), which +allows the computer clock to be adjusted according to the time reported by a group of servers +[[39](/en/ch9#Windl2006)]. +The servers in turn get their time from a more accurate time source, such as a GPS receiver. + +## Monotonic Versus Time-of-Day Clocks + +Modern computers have at least two different kinds of clocks: a *time-of-day clock* and a *monotonic +clock*. Although they both measure time, it is important to distinguish the two, since they serve +different purposes. + +### Time-of-day clocks + +A time-of-day clock does what you intuitively expect of a clock: it returns the current date and +time according to some calendar (also known as *wall-clock time*). For example, +`clock_gettime(CLOCK_REALTIME)` on Linux and +`System.currentTimeMillis()` in Java return the number of seconds (or milliseconds) since the +*epoch*: midnight UTC on January 1, 1970, according to the Gregorian calendar, not counting leap +seconds. Some systems use other dates as their reference point. +(Although the Linux clock is called *real-time*, it has nothing to do with real-time operating +systems, as discussed in [“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime).) + +Time-of-day clocks are usually synchronized with NTP, which means that a timestamp from one machine +(ideally) means the same as a timestamp on another machine. However, time-of-day clocks also have +various oddities, as described in the next section. In particular, if the local clock is too far +ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in +time. These jumps, as well as similar jumps caused by leap seconds, make time-of-day clocks +unsuitable for measuring elapsed time +[[40](/en/ch9#GrahamCumming2017)]. + +Time-of-day clocks can experience jumps due to the start and end of Daylight Saving Time (DST); +these can be avoided by always using UTC as time zone, which does not have DST. +Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward +in steps of 10 ms on older Windows systems +[[41](/en/ch9#Holmes2006)]. +On recent systems, this is less of a problem. + +### Monotonic clocks + +A monotonic clock is suitable for measuring a duration (time interval), such as a timeout or a +service’s response time: `clock_gettime(CLOCK_MONOTONIC)` or `clock_gettime(CLOCK_BOOTTIME)` on +Linux [[42](/en/ch9#Greef2021)] +and `System.nanoTime()` in Java are monotonic clocks, for example. The name comes from the fact that +they are guaranteed to always move forward (whereas a time-of-day clock may jump back in time). + +You can check the value of the monotonic clock at one point in time, do something, and then check +the clock again at a later time. The *difference* between the two values tells you how much time +elapsed between the two checks — more like a stopwatch than a wall clock. However, the *absolute* +value of the clock is meaningless: it might be the number of nanoseconds since the computer was +booted up, or something similarly arbitrary. In particular, it makes no sense to compare monotonic +clock values from two different computers, because they don’t mean the same thing. + +On a server with multiple CPU sockets, there may be a separate timer per CPU, which is not +necessarily synchronized with other CPUs +[[43](/en/ch9#Yang2015)]. +Operating systems compensate for any discrepancy and try +to present a monotonic view of the clock to application threads, even as they are scheduled across +different CPUs. However, it is wise to take this guarantee of monotonicity with a pinch of salt +[[44](/en/ch9#Loughran2015)]. + +NTP may adjust the frequency at which the monotonic clock moves forward (this is known as *slewing* +the clock) if it detects that the computer’s local quartz is moving faster or slower than the NTP +server. By default, NTP allows the clock rate to be speeded up or slowed down by up to 0.05%, but +NTP cannot cause the monotonic clock to jump forward or backward. The resolution of monotonic +clocks is usually quite good: on most systems they can measure time intervals in microseconds or +less. + +In a distributed system, using a monotonic clock for measuring elapsed time (e.g., timeouts) is +usually fine, because it doesn’t assume any synchronization between different nodes’ clocks and is +not sensitive to slight inaccuracies of measurement. + +## Clock Synchronization and Accuracy + +Monotonic clocks don’t need synchronization, but time-of-day clocks need to be set according to an +NTP server or other external time source in order to be useful. Unfortunately, our methods for +getting a clock to tell the correct time aren’t nearly as reliable or accurate as you might +hope—hardware clocks and NTP can be fickle beasts. To give just a few examples: + +* The quartz clock in a computer is not very accurate: it *drifts* (runs faster or slower than it + should). Clock drift varies depending on the temperature of the machine. Google assumes a clock + drift of up to 200 ppm (parts per million) for its servers + [[45](/en/ch9#Corbett2012_ch9)], + which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30 + seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best + possible accuracy you can achieve, even if everything is working correctly. +* If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the + local clock will be forcibly reset [[39](/en/ch9#Windl2006)]. Any + applications observing the time before and after this reset may see time go backward or suddenly + jump forward. +* If a node is accidentally firewalled off from NTP servers, the misconfiguration may go + unnoticed for some time, during which the drift may add up to large discrepancies between + different nodes’ clocks. Anecdotal evidence suggests that this does happen in practice. +* NTP synchronization can only be as good as the network delay, so there is a limit to its + accuracy when you’re on a congested network with variable packet delays. One experiment showed + that a minimum error of 35 ms is achievable when synchronizing over the internet + [[46](/en/ch9#Caporaloni2012)], + though occasional spikes in network delay lead to errors of around a second. Depending on the + configuration, large network delays can cause the NTP client to give up entirely. +* Some NTP servers are wrong or misconfigured, reporting time that is off by hours + [[47](/en/ch9#Minar1999), + [48](/en/ch9#Holub2014)]. + NTP clients mitigate such errors by querying several servers and ignoring outliers. + Nevertheless, it’s somewhat worrying to bet the correctness of your systems on the time that you + were told by a stranger on the internet. +* Leap seconds result in a minute that is 59 seconds or 61 seconds long, which messes up timing + assumptions in systems that are not designed with leap seconds in mind + [[49](/en/ch9#Kamp2011)]. + The fact that leap seconds have crashed many large systems + [[40](/en/ch9#GrahamCumming2017), + [50](/en/ch9#Minar2012_ch9)] + shows how easy it is for incorrect assumptions about clocks to sneak into a system. The best + way of handling leap seconds may be to make NTP servers “lie,” by performing the leap second + adjustment gradually over the course of a day (this is known as *smearing*) + [[51](/en/ch9#Pascoe2011), + [52](/en/ch9#Zhao2015)], + although actual NTP server behavior varies in practice + [[53](/en/ch9#Veitch2016)]. + Leap seconds will no longer be used from 2035 onwards, so this problem will fortunately go away. +* In virtual machines, the hardware clock is virtualized, which raises additional challenges for + applications that need accurate timekeeping + [[54](/en/ch9#VMware2011)]. + When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds + while another VM is running. From an application’s point of view, this pause manifests itself as + the clock suddenly jumping forward [[29](/en/ch9#Wang2010)]. + If a VM pauses for several seconds, the clock may then be several seconds behind the actual time, + but NTP may continue to report that the clock is almost perfectly in sync + [[55](/en/ch9#Yodaiken2017)]. +* If you run software on devices that you don’t fully control (e.g., mobile or embedded devices), you + probably cannot trust the device’s hardware clock at all. Some users deliberately set their + hardware clock to an incorrect date and time, for example to cheat in games + [[56](/en/ch9#EmreAcer2017)]. + As a result, the clock might be set to a time wildly in the past or the future. + +It is possible to achieve very good clock accuracy if you care about it sufficiently to invest +significant resources. For example, the MiFID II European regulation for financial +institutions requires all high-frequency trading funds to synchronize their clocks to within 100 +microseconds of UTC, in order to help debug market anomalies such as “flash crashes” and to help +detect market manipulation +[[57](/en/ch9#MiFID2015)]. + +Such accuracy can be achieved with some special hardware (GPS receivers and/or atomic clocks), the +Precision Time Protocol (PTP) and careful deployment and monitoring +[[58](/en/ch9#Bigum2015), +[59](/en/ch9#Obleukhov2022)]. +Relying on GPS alone can be risky because GPS signals can easily be jammed. In some locations this +happens frequently, e.g. close to military facilities +[[60](/en/ch9#Wiseman2022)]. +Some cloud providers have begun offering high-accuracy clock synchronization for their virtual +machines +[[61](/en/ch9#Levinson2023)]. +However, clock synchronization still requires a lot of care. If your NTP daemon is misconfigured, or +a firewall is blocking NTP traffic, the clock error due to drift can quickly become large. + +## Relying on Synchronized Clocks + +The problem with clocks is that while they seem simple and easy to use, they have a surprising +number of pitfalls: a day may not have exactly 86,400 seconds, time-of-day clocks may move backward +in time, and the time according to one node’s clock may be quite different from another node’s clock. + +Earlier in this chapter we discussed networks dropping and arbitrarily delaying packets. Even though +networks are well behaved most of the time, software must be designed on the assumption that the +network will occasionally be faulty, and the software must handle such faults gracefully. The same +is true with clocks: although they work quite well most of the time, robust software needs to be +prepared to deal with incorrect clocks. + +Part of the problem is that incorrect clocks easily go unnoticed. If a machine’s CPU is defective or +its network is misconfigured, it most likely won’t work at all, so it will quickly be noticed and +fixed. On the other hand, if its quartz clock is defective or its NTP client is misconfigured, most +things will seem to work fine, even though its clock gradually drifts further and further away from +reality. If some piece of software is relying on an accurately synchronized clock, the result is +more likely to be silent and subtle data loss than a dramatic crash +[[62](/en/ch9#Kingsbury2013cassandra), +[63](/en/ch9#Daily2013_ch9)]. + +Thus, if you use software that requires synchronized clocks, it is essential that you also carefully +monitor the clock offsets between all the machines. Any node whose clock drifts too far from the +others should be declared dead and removed from the cluster. Such monitoring ensures that you notice +the broken clocks before they can cause too much damage. + +### Timestamps for ordering events + +Let’s consider one particular situation in which it is tempting, but dangerous, to rely on clocks: +ordering of events across multiple nodes +[[64](/en/ch9#Brooker2023time)]. +For example, if two clients write to a distributed database, who got there first? Which write is the +more recent one? + +[Figure 9-3](/en/ch9#fig_distributed_timestamps) illustrates a dangerous use of time-of-day clocks in a database with +multi-leader replication (the example is similar to [Figure 6-8](/en/ch6#fig_replication_causality)). Client A writes +*x* = 1 on node 1; the write is replicated to node 3; client B increments *x* on node +3 (we now have *x* = 2); and finally, both writes are replicated to node 2. + +![ddia 0903](/fig/ddia_0903.png) + +###### Figure 9-3. The write by client B is causally later than the write by client A, but B’s write has an earlier timestamp. + +In [Figure 9-3](/en/ch9#fig_distributed_timestamps), when a write is replicated to other nodes, it is tagged with a +timestamp according to the time-of-day clock on the node where the write originated. The clock +synchronization is very good in this example: the skew between node 1 and node 3 is less than +3 ms, which is probably better than you can expect in practice. + +Since the increment builds upon the earlier write of *x* = 1, we might expect that the +write of *x* = 2 should have the greater timestamp of the two. Unfortunately, that is +not what happens in [Figure 9-3](/en/ch9#fig_distributed_timestamps): the write *x* = 1 has a timestamp of +42.004 seconds, but the write *x* = 2 has a timestamp of 42.003 seconds. + +As discussed in [“Last write wins (discarding concurrent writes)”](/en/ch6#sec_replication_lww), one way of resolving conflicts between concurrently written +values on different nodes is *last write wins* (LWW), which means keeping the write with the +greatest timestamp for a given key and discarding all writes with older timestamps. In the example +of [Figure 9-3](/en/ch9#fig_distributed_timestamps), when node 2 receives these two events, it will incorrectly +conclude that *x* = 1 is the more recent value and drop the write *x* = 2, +so the increment is lost. + +This problem can be prevented by ensuring that when a value is overwritten, the new value always has +a higher timestamp than the overwritten value, even if that timestamp is ahead of the writer’s local +clock. However, that incurs the cost of an additional read to find the greatest existing timestamp. +Some systems, including Cassandra and ScyllaDB, want to write to all replicas in a single round +trip, and therefore they simply use the client clock’s timestamp along with a last write wins +policy [[62](/en/ch9#Kingsbury2013cassandra)]. This approach has some +serious problems: + +* Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite + values previously written by a node with a fast clock until the clock skew between the nodes has + elapsed [[63](/en/ch9#Daily2013_ch9), + [65](/en/ch9#Kingsbury2013timestamps)]. + This scenario can cause arbitrary amounts of data to be silently dropped without any error being + reported to the application. +* LWW cannot distinguish between writes that occurred sequentially in quick succession (in + [Figure 9-3](/en/ch9#fig_distributed_timestamps), client B’s increment definitely occurs *after* client A’s write) + and writes that were truly concurrent (neither writer was aware of the other). Additional + causality tracking mechanisms, such as version vectors, are needed in order to prevent violations + of causality (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). +* It is possible for two nodes to independently generate writes with the same timestamp, especially + when the clock only has millisecond resolution. An additional tiebreaker value (which can simply + be a large random number) is required to resolve such conflicts, but this approach can also lead to + violations of causality [[62](/en/ch9#Kingsbury2013cassandra)]. + +Thus, even though it is tempting to resolve conflicts by keeping the most “recent” value and +discarding others, it’s important to be aware that the definition of “recent” depends on a local +time-of-day clock, which may well be incorrect. Even with tightly NTP-synchronized clocks, you could +send a packet at timestamp 100 ms (according to the sender’s clock) and have it arrive at +timestamp 99 ms (according to the recipient’s clock)—so it appears as though the packet +arrived before it was sent, which is impossible. + +Could NTP synchronization be made accurate enough that such incorrect orderings cannot occur? +Probably not, because NTP’s synchronization accuracy is itself limited by the network round-trip +time, in addition to other sources of error such as quartz drift. To guarantee a correct ordering, +you would need the clock error to be significantly lower than the network delay, which is not +possible. + +So-called *logical clocks* +[[66](/en/ch9#Lamport1978_ch9)], +which are based on incrementing counters rather than an oscillating quartz crystal, are a safer +alternative for ordering events (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). Logical clocks do not measure +the time of day or the number of seconds elapsed, only the relative ordering of events (whether one +event happened before or after another). In contrast, time-of-day and monotonic clocks, which +measure actual elapsed time, are also known as *physical clocks*. We’ll look at logical clocks in +more detail in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical). + +### Clock readings with a confidence interval + +You may be able to read a machine’s time-of-day clock with microsecond or even nanosecond +resolution. But even if you can get such a fine-grained measurement, that doesn’t mean the value is +actually accurate to such precision. In fact, it most likely is not—as mentioned previously, the +drift in an imprecise quartz clock can easily be several milliseconds, even if you synchronize with +an NTP server on the local network every minute. With an NTP server on the public internet, the best +possible accuracy is probably to the tens of milliseconds, and the error may easily spike to over +100 ms when there is network congestion. + +Thus, it doesn’t make sense to think of a clock reading as a point in time—it is more like a +range of times, within a confidence interval: for example, a system may be 95% confident that the +time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely +than that [[67](/en/ch9#Sheehy2015)]. +If we only know the time +/– 100 ms, the microsecond digits in the timestamp are +essentially meaningless. + +The uncertainty bound can be calculated based on your time source. If you have a GPS receiver or +atomic clock directly attached to your computer, the expected error range is determined by +the device and, in the case of GPS, by the quality of the signal from the satellites. If you’re +getting the time from a server, the uncertainty is based on the expected quartz drift since your +last sync with the server, plus the NTP server’s uncertainty, plus the network round-trip time to +the server (to a first approximation, and assuming you trust the server). + +Unfortunately, most systems don’t expose this uncertainty: for example, when you call +`clock_gettime()`, the return value doesn’t tell you the expected error of the timestamp, so you +don’t know if its confidence interval is five milliseconds or five years. + +There are exceptions: the *TrueTime* API in Google’s Spanner +[[45](/en/ch9#Corbett2012_ch9)] and Amazon’s ClockBound explicitly report the +confidence interval on the local clock. When you ask it for the current time, you get back two +values: `[earliest, latest]`, which are the *earliest possible* and the *latest possible* +timestamp. Based on its uncertainty calculations, the clock knows that the actual current time is +somewhere within that interval. The width of the interval depends, among other things, on how long +it has been since the local quartz clock was last synchronized with a more accurate clock source. + +### Synchronized clocks for global snapshots + +In [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation) we discussed *multi-version concurrency control* (MVCC), +which is a very useful feature in databases that need to support both small, fast read-write +transactions and large, long-running read-only transactions (e.g., for backups or analytics). It +allows read-only transactions to see a *snapshot* of the database, a consistent state at a +particular point in time, without locking and interfering with read-write transactions. + +Generally, MVCC requires a monotonically increasing transaction ID. If a write happened later than +the snapshot (i.e., the write has a greater transaction ID than the snapshot), that write is +invisible to the snapshot transaction. On a single-node database, a simple counter is sufficient for +generating transaction IDs. + +However, when a database is distributed across many machines, potentially in multiple datacenters, a +global, monotonically increasing transaction ID (across all shards) is difficult to generate, +because it requires coordination. The transaction ID must reflect causality: if transaction B reads +or overwrites a value that was previously written by transaction A, then B must have a higher +transaction ID than A—otherwise, the snapshot would not be consistent. With lots of small, rapid +transactions, creating transaction IDs in a distributed system becomes an untenable +bottleneck. (We will discuss such ID generators in [“ID Generators and Logical Clocks”](/en/ch10#sec_consistency_logical).) + +Can we use the timestamps from synchronized time-of-day clocks as transaction IDs? If we could get +the synchronization good enough, they would have the right properties: later transactions have a +higher timestamp. The problem, of course, is the uncertainty about clock accuracy. + +Spanner implements snapshot isolation across datacenters in this way +[[68](/en/ch9#Demirbas2013), +[69](/en/ch9#Malkhi2013)]. +It uses the clock’s confidence interval as reported by the TrueTime API, and is based on the +following observation: if you have two confidence intervals, each consisting of an earliest and +latest possible timestamp (*A* = [*Aearliest*, *Alatest*] and +*B* = [*Bearliest*, *Blatest*]), and those two intervals do not overlap (i.e., +*Aearliest* < *Alatest* < *Bearliest* < *Blatest*), then B definitely happened after A—there +can be no doubt. Only if the intervals overlap are we unsure in which order A and B happened. + +In order to ensure that transaction timestamps reflect causality, Spanner deliberately waits for the +length of the confidence interval before committing a read-write transaction. By doing so, it +ensures that any transaction that may read the data is at a sufficiently later time, so their +confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner +needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS +receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about +7 ms [[45](/en/ch9#Corbett2012_ch9)]. + +The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to +have a confidence interval, and the accurate clock sources only help keep that interval small. Other +systems are beginning to adopt similar approaches: for example, YugabyteDB can leverage ClockBound +when running on AWS [[70](/en/ch9#Pachot2024)], +and several other systems now also rely on clock synchronization to various degrees +[[71](/en/ch9#Kimball2022), +[72](/en/ch9#Demirbas2025)]. + +## Process Pauses + +Let’s consider another example of dangerous clock use in a distributed system. Say you have a +database with a single leader per shard. Only the leader is allowed to accept writes. How does a +node know that it is still leader (that it hasn’t been declared dead by the others), and that it may +safely accept writes? + +One option is for the leader to obtain a *lease* from the other nodes, which is similar to a lock +with a timeout [[73](/en/ch9#Gray1989)]. +Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that +it is the leader for some amount of time, until the lease expires. In order to remain leader, the +node must periodically renew the lease before it expires. If the node fails, it stops renewing the +lease, so another node can take over when it expires. + +You can imagine the request-handling loop looking something like this: + +``` +while (true) { + request = getIncomingRequest(); + + // Ensure that the lease always has at least 10 seconds remaining + if (lease.expiryTimeMillis - System.currentTimeMillis() < 10000) { + lease = lease.renew(); + } + + if (lease.isValid()) { + process(request); + } +} +``` + +What’s wrong with this code? Firstly, it’s relying on synchronized clocks: the expiry time on the +lease is set by a different machine (where the expiry may be calculated as the current time plus 30 +seconds, for example), and it’s being compared to the local system clock. If the clocks are out of +sync by more than a few seconds, this code will start doing strange things. + +Secondly, even if we change the protocol to only use the local monotonic clock, there is another +problem: the code assumes that very little time passes between the point that it checks the time +(`System.currentTimeMillis()`) and the time when the request is processed (`process(request)`). +Normally this code runs very quickly, so the 10 second buffer is more than enough to ensure that the +lease doesn’t expire in the middle of processing a request. + +However, what if there is an unexpected pause in the execution of the program? For example, imagine +the thread stops for 15 seconds around the line `lease.isValid()` before finally continuing. In +that case, it’s likely that the lease will have expired by the time the request is processed, and +another node has already taken over as leader. However, there is nothing to tell this thread that it +was paused for so long, so this code won’t notice that the lease has expired until the next +iteration of the loop—by which time it may have already done something unsafe by processing the +request. + +Is it reasonable to assume that a thread might be paused for so long? Unfortunately yes. There are +various reasons why this could happen: + +* Contention among threads accessing a shared resource, such as a lock or queue, can cause threads + to spend a lot of their time waiting. Moving to a machine with more CPU cores can make such + problems worse, and contention problems can be difficult to diagnose + [[74](/en/ch9#Sturman2022)]. +* Many programming language runtimes (such as the Java Virtual Machine) have a *garbage collector* + (GC) that occasionally needs to stop all running threads. In the past, such *“stop-the-world” GC + pauses* would sometimes last for several minutes + [[75](/en/ch9#Lipcon2011)]! + With modern GC algorithms this is less of a problem, but GC pauses can still be noticable (see + [“Limiting the impact of garbage collection”](/en/ch9#sec_distributed_gc_impact)). +* In virtualized environments, a virtual machine can be *suspended* (pausing the execution of all + processes and saving the contents of memory to disk) and *resumed* (restoring the contents of + memory and continuing execution). This pause can occur at any time in a process’s execution and can + last for an arbitrary length of time. This feature is sometimes used for *live migration* of + virtual machines from one host to another without a reboot, in which case the length of the pause + depends on the rate at which processes are writing to memory + [[76](/en/ch9#Clark2005)]. +* On end-user devices such as laptops and phones, execution may also be suspended and resumed + arbitrarily, e.g., when the user closes the lid of their laptop. +* When the operating system context-switches to another thread, or when the hypervisor switches to a + different virtual machine (when running in a virtual machine), the currently running thread can be + paused at any arbitrary point in the code. In the case of a virtual machine, the CPU time spent in + other virtual machines is known as *steal time*. If the machine is under heavy load—i.e., if + there is a long queue of threads waiting to run—it may take some time before the paused thread + gets to run again. +* If the application performs synchronous disk access, a thread may be paused waiting for a slow + disk I/O operation to complete [[77](/en/ch9#Shaver2008)]. In many languages, disk access can happen + surprisingly, even if the code doesn’t explicitly mention file access—for example, the Java + classloader lazily loads class files when they are first used, which could happen at any time in + the program execution. I/O pauses and GC pauses may even conspire to combine their delays + [[78](/en/ch9#Zhuang2016)]. + If the disk is actually a network filesystem or network block device (such as Amazon’s EBS), the + I/O latency is further subject to the variability of network delays + [[31](/en/ch9#Newman2012)]. +* If the operating system is configured to allow *swapping to disk* (*paging*), a simple memory + access may result in a page fault that requires a page from disk to be loaded into memory. The + thread is paused while this slow I/O operation takes place. If memory pressure is high, this may + in turn require a different page to be swapped out to disk. In extreme circumstances, the + operating system may spend most of its time swapping pages in and out of memory and getting little + actual work done (this is known as *thrashing*). To avoid this problem, paging is often disabled + on server machines (if you would rather kill a process to free up memory than risk thrashing). +* A Unix process can be paused by sending it the `SIGSTOP` signal, for example by pressing Ctrl-Z in + a shell. This signal immediately stops the process from getting any more CPU cycles until it is + resumed with `SIGCONT`, at which point it continues running where it left off. Even if your + environment does not normally use `SIGSTOP`, it might be sent accidentally by an operations + engineer. + +All of these occurrences can *preempt* the running thread at any point and resume it at some later time, +without the thread even noticing. The problem is similar to making multi-threaded code on a single +machine thread-safe: you can’t assume anything about timing, because arbitrary context switches and +parallelism may occur. + +When writing multi-threaded code on a single machine, we have fairly good tools for making it +thread-safe: mutexes, semaphores, atomic counters, lock-free data structures, blocking queues, and +so on. Unfortunately, these tools don’t directly translate to distributed systems, because a +distributed system has no shared memory—only messages sent over an unreliable network. + +A node in a distributed system must assume that its execution can be paused for a significant length +of time at any point, even in the middle of a function. During the pause, the rest of the world +keeps moving and may even declare the paused node dead because it’s not responding. Eventually, +the paused node may continue running, without even noticing that it was asleep until it checks its +clock sometime later. + +### Response time guarantees + +In many programming languages and operating systems, threads and processes may pause for an +unbounded amount of time, as discussed. Those reasons for pausing *can* be eliminated if you try +hard enough. + +Some software runs in environments where a failure to respond within a specified time can cause +serious damage: computers that control aircraft, rockets, robots, cars, and other physical objects +must respond quickly and predictably to their sensor inputs. In these systems, there is a specified +*deadline* by which the software must respond; if it doesn’t meet the deadline, that may cause a +failure of the entire system. These are so-called *hard real-time* systems. + +###### Note + +In embedded systems, *real-time* means that a system is carefully designed and tested to meet +specified timing guarantees in all circumstances. This meaning is in contrast to the more vague use of the +term *real-time* on the web, where it describes servers pushing data to clients and stream +processing without hard response time constraints (see [Link to Come]). + +For example, if your car’s onboard sensors detect that you are currently experiencing a crash, you +wouldn’t want the release of the airbag to be delayed due to an inopportune GC pause in the airbag +release system. + +Providing real-time guarantees in a system requires support from all levels of the software stack: a +*real-time operating system* (RTOS) that allows processes to be scheduled with a guaranteed +allocation of CPU time in specified intervals is needed; library functions must document their +worst-case execution times; dynamic memory allocation may be restricted or disallowed entirely +(real-time garbage collectors exist, but the application must still ensure that it doesn’t give the +GC too much work to do); and an enormous amount of testing and measurement must be done to ensure +that guarantees are being met. + +All of this requires a large amount of additional work and severely restricts the range of +programming languages, libraries, and tools that can be used (since most languages and tools do not +provide real-time guarantees). For these reasons, developing real-time systems is very expensive, +and they are most commonly used in safety-critical embedded devices. Moreover, “real-time” is not the +same as “high-performance”—in fact, real-time systems may have lower throughput, since they have to +prioritize timely responses above all else (see also [“Latency and Resource Utilization”](/en/ch9#sidebar_distributed_latency_utilization)). + +For most server-side data processing systems, real-time guarantees are simply not economical or +appropriate. Consequently, these systems must suffer the pauses and clock instability that come from +operating in a non-real-time environment. + +### Limiting the impact of garbage collection + +Garbage collection used to be one of the biggest reasons for process pauses +[[79](/en/ch9#Thompson2013)], +but fortunately GC algorithms have improved a lot: a properly tuned collector will now usually pause +for no more than a few milliseconds. The Java runtime offers collectors such as concurrent mark +sweep (CMS), garbage-first (G1), the Z garbage collector (ZGC), Epsilon, and Shenandoah. Each of +these is optimized for different memory profiles such as high-frequency object creation, large +heaps, and so on. By contrast, Go offers a simpler concurrent mark sweep garbage collector that +attempts to optimize itself. + +If you need to avoid GC pauses entirely, one option is to use a language that doesn’t have a garbage +collector at all. For example, Swift uses automatic reference counting to determine when memory can +be freed; Rust and Mojo track lifetimes of objects using the type system so the compiler can +determine how long memory must be allocated for. + +It’s also possible to use a garbage-collected language while mitigating the impact of pauses. +One approach is to treat GC pauses like brief planned outages of a node, and to let other nodes +handle requests from clients while one node is collecting its garbage. If the runtime can warn the +application that a node soon requires a GC pause, the application can stop sending new requests to +that node, wait for it to finish processing outstanding requests, and then perform the GC while no +requests are in progress. This trick hides GC pauses from clients and reduces the high percentiles +of the response time [[80](/en/ch9#Terei2015), +[81](/en/ch9#Maas2015)]. + +A variant of this idea is to use the garbage collector only for short-lived objects (which are fast +to collect) and to restart processes periodically, before they accumulate enough long-lived objects +to require a full GC of long-lived objects [[79](/en/ch9#Thompson2013), +[82](/en/ch9#Fowler2011_ch9)]. +One node can be restarted at a time, and traffic can be shifted away from the node before the +planned restart, like in a rolling upgrade (see [Chapter 5](/en/ch5#ch_encoding)). + +These measures cannot fully prevent garbage collection pauses, but they can usefully reduce their +impact on the application. + +# Knowledge, Truth, and Lies + +So far in this chapter we have explored the ways in which distributed systems are different from +programs running on a single computer: there is no shared memory, only message passing via an +unreliable network with variable delays, and the systems may suffer from partial failures, unreliable clocks, +and processing pauses. + +The consequences of these issues are profoundly disorienting if you’re not used to distributed +systems. A node in the network cannot *know* anything for sure about other nodes—it can only make +guesses based on the messages it receives (or doesn’t receive). A node can only find out what state +another node is in (what data it has stored, whether it is correctly functioning, etc.) by +exchanging messages with it. If a remote node doesn’t respond, there is no way of knowing what state +it is in, because problems in the network cannot reliably be distinguished from problems at a node. + +Discussions of these systems border on the philosophical: What do we know to be true or false in our +system? How sure can we be of that knowledge, if the mechanisms for perception and measurement are +unreliable [[83](/en/ch9#Halpern1990)]? +Should software systems obey the laws that we expect of the physical world, such as cause and effect? + +Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed +system, we can state the assumptions we are making about the behavior (the *system model*) and +design the actual system in such a way that it meets those assumptions. Algorithms can be proved to +function correctly within a certain system model. This means that reliable behavior is achievable, +even if the underlying system model provides very few guarantees. + +However, although it is possible to make software well behaved in an unreliable system model, it +is not straightforward to do so. In the rest of this chapter we will further explore the notions of +knowledge and truth in distributed systems, which will help us think about the kinds of assumptions +we can make and the guarantees we may want to provide. In [Chapter 10](/en/ch10#ch_consistency) we will proceed to +look at some examples of distributed algorithms that provide particular guarantees under particular +assumptions. + +## The Majority Rules + +Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but +any outgoing messages from that node are dropped or delayed +[[22](/en/ch9#Donges2012)]. Even though that node is working +perfectly well, and is receiving requests from other nodes, the other nodes cannot hear its +responses. After some timeout, the other nodes declare it dead, because they haven’t heard from the +node. The situation unfolds like a nightmare: the semi-disconnected node is dragged to the +graveyard, kicking and screaming “I’m not dead!”—but since nobody can hear its screaming, the +funeral procession continues with stoic determination. + +In a slightly less nightmarish scenario, the semi-disconnected node may notice that the messages it +is sending are not being acknowledged by other nodes, and so realize that there must be a fault +in the network. Nevertheless, the node is wrongly declared dead by the other nodes, and the +semi-disconnected node cannot do anything about it. + +As a third scenario, imagine a node that pauses execution for one minute. During that time, no +requests are processed and no responses are sent. The other nodes wait, retry, grow impatient, and +eventually declare the node dead and load it onto the hearse. Finally, the pause finishes and the +node’s threads continue as if nothing had happened. The other nodes are surprised as the supposedly +dead node suddenly raises its head out of the coffin, in full health, and starts cheerfully chatting +with bystanders. At first, the paused node doesn’t even realize that an entire minute has passed and +that it was declared dead—from its perspective, hardly any time has passed since it was last talking +to the other nodes. + +The moral of these stories is that a node cannot necessarily trust its own judgment of a situation. +A distributed system cannot exclusively rely on a single node, because a node may fail at any time, +potentially leaving the system stuck and unable to recover. Instead, many distributed algorithms +rely on a *quorum*, that is, voting among the nodes (see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)): +decisions require some minimum number of votes from several nodes in order to reduce the dependence +on any one particular node. + +That includes decisions about declaring nodes dead. If a quorum of nodes declares another node +dead, then it must be considered dead, even if that node still very much feels alive. The individual +node must abide by the quorum decision and step down. + +Most commonly, the quorum is an absolute majority of more than half the nodes (although other kinds +of quorums are possible). A majority quorum allows the system to continue working if a minority of nodes +are faulty (with three nodes, one faulty node can be tolerated; with five nodes, two faulty nodes can be +tolerated). However, it is still safe, because there can only be only one majority in the +system—there cannot be two majorities with conflicting decisions at the same time. We will discuss +the use of quorums in more detail when we get to *consensus algorithms* in [Chapter 10](/en/ch10#ch_consistency). + +## Distributed Locks and Leases + +Locks and leases in distributed application are prone to be misused, and a common source of bugs +[[84](/en/ch9#Tang2022)]. +Let’s look at one particular case of how they can go wrong. + +In [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses) we saw that a lease is a kind of lock that times out and can be +assigned to a new owner if the old owner stops responding (perhaps because it crashed, it paused for +too long, or it was disconnected from the network). You can use leases in situations where a system +requires there to be only one of some thing. For example: + +* Only one node is allowed to be the leader for a database shard, to avoid split brain (see + [“Handling Node Outages”](/en/ch6#sec_replication_failover)). +* Only one transaction or client is allowed to update a particular resource or object, to prevent + it being corrupted by concurrent writes. +* Only one node should process a given input file to a big processing job, to avoid wasted effort + due to multiple nodes redundantly doing the same work. + +It is worth thinking carefully about what happens if several nodes simultaneously believe that they +hold the lease, perhaps due to a process pause. In the third example, the consequence is only some +wasted computational resources, which is not a big deal. But in the first two cases, the consequence +could be lost or corrupted data, which is much more serious. + +For example, [Figure 9-4](/en/ch9#fig_distributed_lease_pause) shows a data corruption bug due to an incorrect +implementation of locking. (The bug is not theoretical: HBase used to have this problem +[[85](/en/ch9#Junqueira2013_ch9), +[86](/en/ch9#Soztutar2013hdfs)].) +Say you want to ensure that a file in a storage service can only be +accessed by one client at a time, because if multiple clients tried to write to it, the file would +become corrupted. You try to implement this by requiring a client to obtain a lease from a lock +service before accessing the file. Such a lock service is often implemented using a consensus +algorithm; we will discuss this further in [Chapter 10](/en/ch10#ch_consistency). + +![ddia 0904](/fig/ddia_0904.png) + +###### Figure 9-4. Incorrect implementation of a distributed lock: client 1 believes that it still has a valid lease, even though it has expired, and thus corrupts a file in storage. + +The problem is an example of what we discussed in [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses): if the client +holding the lease is paused for too long, its lease expires. Another client can obtain a lease for +the same file, and start writing to the file. When the paused client comes back, it believes +(incorrectly) that it still has a valid lease and proceeds to also write to the file. We now have a +split brain situation: the clients’ writes clash and corrupt the file. + +[Figure 9-5](/en/ch9#fig_distributed_lease_delay) shows a different problem that has similar consequences. In this +example there is no process pause, only a crash by client 1. Just before client 1 crashes it sends a +write request to the storage service, but this request is delayed for a long time in the network. +(Remember from [“Network Faults in Practice”](/en/ch9#sec_distributed_network_faults) that packets can sometimes be delayed by a minute +or more.) By the time the write request arrives at the storage service, the lease has already timed +out, allowing client 2 to acquire it and issue a write of its own. The result is corruption similar +to [Figure 9-4](/en/ch9#fig_distributed_lease_pause). + +![ddia 0905](/fig/ddia_0905.png) + +###### Figure 9-5. A message from a former leaseholder might be delayed for a long time, and arrive after another node has taken over the lease. + +### Fencing off zombies and delayed requests + +The term *zombie* is sometimes used to describe a former leaseholder who has not yet found out that +it lost the lease, and who is still acting as if it was the current leaseholder. Since we cannot +rule out zombies entirely, we have to instead ensure that they can’t do any damage in the form of +split brain. This is called *fencing off* the zombie. + +Some systems attempt to fence off zombies by shutting them down, for example by disconnecting them +from the network [[9](/en/ch9#Leners2015)], shutting down the VM via +the cloud provider’s management interface, or even physically powering down the machine +[[87](/en/ch9#SUSE2025)]. +This approach is known as *Shoot The Other Node In The Head* or STONITH. Unfortunately, it suffers +from some problems: it does not protect against large network delays like in +[Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down +[[19](/en/ch9#Imbriaco2012_ch9)]; and by the time the zombie has been +detected and shut down, it may already be too late and data may already have been corrupted. + +A more robust fencing solution, which protects against both zombies and delayed requests, is +illustrated in [Figure 9-6](/en/ch9#fig_distributed_fencing). + +![ddia 0906](/fig/ddia_0906.png) + +###### Figure 9-6. Making access to storage safe by allowing writes only in the order of increasing fencing tokens. + +Let’s assume that every time the lock service grants a lock or lease, it also returns a *fencing +token*, which is a number that increases every time a lock is granted (e.g., incremented by the lock +service). We can then require that every time a client sends a write request to the storage service, +it must include its current fencing token. + +###### Note + +There are several alternative names for fencing tokens. In Chubby, Google’s lock service, they are +called *sequencers* [[88](/en/ch9#Burrows2006_ch9)], and in Kafka they are called *epoch numbers*. +In consensus algorithms, which we will discuss in [Chapter 10](/en/ch10#ch_consistency), the *ballot number* (Paxos) or +*term number* (Raft) serves a similar purpose. + +In [Figure 9-6](/en/ch9#fig_distributed_fencing), client 1 acquires the lease with a token of 33, but then +it goes into a long pause and the lease expires. Client 2 acquires the lease with a token of 34 (the +number always increases) and then sends its write request to the storage service, including the +token of 34. Later, client 1 comes back to life and sends its write to the storage service, +including its token value 33. However, the storage service remembers that it has already processed a +write with a higher token number (34), and so it rejects the request with token 33. A client that +has just acquired the lease must immediately make a write to the storage service, and once that +write has completed, any zombies are fenced off. + +If ZooKeeper is your lock service, you can use the transaction ID `zxid` or the node version +`cversion` as fencing token [[85](/en/ch9#Junqueira2013_ch9)]. +With etcd, the revision number along with the lease ID serves a similar purpose +[[89](/en/ch9#Kingsbury2020etcd)]. +The FencedLock API in Hazelcast explicitly generates a fencing token +[[90](/en/ch9#BasriKahveci2019)]. + +This mechanism requires that the storage service has some way of checking whether a write is based +on an outdated token. Alternatively, it’s sufficient for the service to support a write that +succeeds only if the object has not been written by another client since the current client last +read it, similarly to an atomic compare-and-set (CAS) operation. For example, object storage +services support such a check: Amazon S3 calls it *conditional writes*, Azure Blob Storage calls it +*conditional headers*, and Google Cloud Storage calls it *request preconditions*. + +### Fencing with multiple replicas + +If your clients need to write only to one storage service that supports such conditional writes, the +lock service is somewhat redundant +[[91](/en/ch9#Kleppmann2016), +[92](/en/ch9#Sanfilippo2016)], +since the lease assignment could have been implemented directly based on that storage service +[[93](/en/ch9#Morling2024_ch9)]. +However, once you have a fencing token you can also use it with multiple services or replicas, and +ensure that the old leaseholder is fenced off on all of those services. + +For example, imagine the storage service is a leaderless replicated key-value store with +last-write-wins conflict resolution (see [“Leaderless Replication”](/en/ch6#sec_replication_leaderless)). In such a system, the +client sends writes directly to each replica, and each replica independently decides whether to +accept a write based on a timestamp assigned by the client. + +As illustrated in [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), you can put the writer’s fencing token in +the most significant bits or digits of the timestamp. You can then be sure that any timestamp +generated by the new leaseholder will be greater than any timestamp from the old leaseholder, even +if the old leaseholder’s writes happened later. + +![ddia 0907](/fig/ddia_0907.png) + +###### Figure 9-7. Using fencing tokens to protect writes to a leaderless replicated database. + +In [Figure 9-7](/en/ch9#fig_distributed_fencing_leaderless), Client 2 has a fencing token of 34, so all of its +timestamps starting with 34…​ are greater than any timestamps starting with 33…​ that are +generated by Client 1. Client 2 writes to a quorum of replicas but it can’t reach Replica 3. This +means that when the zombie Client 1 later tries to write, its write may succeed at Replica 3 even +though it is ignored by replicas 1 and 2. This is not a problem, since a subsequent quorum read will +prefer the write from Client 2 with the greater timestamp, and read repair or anti-entropy will +eventually overwrite the value written by Client 1. + +As you can see from these examples, it is not safe to assume that there is only one node holding a +lease at any one time. Fortunately, with a bit of care you can use fencing tokens to prevent zombies +and delayed requests from doing any damage. + +## Byzantine Faults + +Fencing tokens can detect and block a node that is *inadvertently* acting in error (e.g., because it +hasn’t yet found out that its lease has expired). However, if the node deliberately wanted to +subvert the system’s guarantees, it could easily do so by sending messages with a fake fencing +token. + +In this book we assume that nodes are unreliable but honest: they may be slow or never respond (due +to a fault), and their state may be outdated (due to a GC pause or network delays), but we assume +that if a node *does* respond, it is telling the “truth”: to the best of its knowledge, it is +playing by the rules of the protocol. + +Distributed systems problems become much harder if there is a risk that nodes may “lie” (send +arbitrary faulty or corrupted responses)—for example, it might cast multiple contradictory votes in +the same election. Such behavior is known as a *Byzantine fault*, and the problem of reaching +consensus in this untrusting environment is known as the *Byzantine Generals Problem* +[[94](/en/ch9#Lamport1982)]. + +# The Byzantine Generals Problem + +The Byzantine Generals Problem is a generalization of the so-called *Two Generals Problem* +[[95](/en/ch9#Gray1978)], +which imagines a situation in which two army generals need to agree on a battle plan. As they +have set up camp on two different sites, they can only communicate by messenger, and the messengers +sometimes get delayed or lost (like packets in a network). We will discuss this problem of +*consensus* in [Chapter 10](/en/ch10#ch_consistency). + +In the Byzantine version of the problem, there are *n* generals who need to agree, and their +endeavor is hampered by the fact that there are some traitors in their midst. Most of the generals +are loyal, and thus send truthful messages, but the traitors may try to deceive and confuse the +others by sending fake or untrue messages. It is not known in advance who the traitors are. + +Byzantium was an ancient Greek city that later became Constantinople, in the place which is now +Istanbul in Turkey. There isn’t any historic evidence that the generals of Byzantium were any more +prone to intrigue and conspiracy than those elsewhere. Rather, the name is derived from *Byzantine* +in the sense of *excessively complicated, bureaucratic, devious*, which was used in politics long +before computers [[96](/en/ch9#Palmer2011)]. +Lamport wanted to choose a nationality that would not offend any readers, and he was advised that +calling it *The Albanian Generals Problem* was not such a good idea +[[97](/en/ch9#LamportPubs)]. + +A system is *Byzantine fault-tolerant* if it continues to operate correctly even if some of the +nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering +with the network. This concern is relevant in certain specific circumstances. For example: + +* In aerospace environments, the data in a computer’s memory or CPU register could become corrupted + by radiation, leading it to respond to other nodes in arbitrarily unpredictable ways. Since a + system failure would be very expensive (e.g., an aircraft crashing and killing everyone on board, + or a rocket colliding with the International Space Station), flight control systems must tolerate + Byzantine faults [[98](/en/ch9#Rushby2001), + [99](/en/ch9#Edge2013)]. +* In a system with multiple participating parties, some participants may attempt to cheat or + defraud others. In such circumstances, it is not safe for a node to simply trust another node’s + messages, since they may be sent with malicious intent. For example, cryptocurrencies like + Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties + to agree whether a transaction happened or not, without relying on a central authority + [[100](/en/ch9#Bano2019_ch9)]. + +However, in the kinds of systems we discuss in this book, we can usually safely assume that there +are no Byzantine faults. In a datacenter, all the nodes are controlled by your organization (so +they can hopefully be trusted) and radiation levels are low enough that memory corruption is not a +major problem (although datacenters in orbit are being considered +[[101](/en/ch9#Feilden2024)]). +Multitenant systems have mutually untrusting tenants, but they are isolated from each +other using firewalls, virtualization, and access control policies, not using Byzantine fault +tolerance. Protocols for making systems Byzantine fault-tolerant are quite expensive +[[102](/en/ch9#Mickens2013)], +and fault-tolerant embedded systems rely on support from the hardware level +[[98](/en/ch9#Rushby2001)]. In most server-side data systems, the +cost of deploying Byzantine fault-tolerant solutions makes them impracticable. + +Web applications do need to expect arbitrary and malicious behavior of clients that are under +end-user control, such as web browsers. This is why input validation, sanitization, and output +escaping are so important: to prevent SQL injection and cross-site scripting, for example. However, +we typically don’t use Byzantine fault-tolerant protocols here, but simply make the server the +authority on deciding what client behavior is and isn’t allowed. In peer-to-peer networks, where +there is no such central authority, Byzantine fault tolerance is more relevant +[[103](/en/ch9#Kleppmann2020), +[104](/en/ch9#Kleppmann2022)]. + +A bug in the software could be regarded as a Byzantine fault, but if you deploy the same software to +all nodes, then a Byzantine fault-tolerant algorithm cannot save you. Most Byzantine fault-tolerant +algorithms require a supermajority of more than two-thirds of the nodes to be functioning correctly +(for example, if you have four nodes, at most one may malfunction). To use this approach against bugs, you +would have to have four independent implementations of the same software and hope that a bug only +appears in one of the four implementations. + +Similarly, it would be appealing if a protocol could protect us from vulnerabilities, security +compromises, and malicious attacks. Unfortunately, this is not realistic either: in most systems, if +an attacker can compromise one node, they can probably compromise all of them, because they are +probably running the same software. Thus, traditional mechanisms (authentication, access control, +encryption, firewalls, and so on) continue to be the main protection against attackers. + +### Weak forms of lying + +Although we assume that nodes are generally honest, it can be worth adding mechanisms to software +that guard against weak forms of “lying”—for example, invalid messages due to hardware issues, +software bugs, and misconfiguration. Such protection mechanisms are not full-blown Byzantine fault +tolerance, as they would not withstand a determined adversary, but they are nevertheless simple and +pragmatic steps toward better reliability. For example: + +* Network packets do sometimes get corrupted due to hardware issues or bugs in operating systems, + drivers, routers, etc. Usually, corrupted packets are caught by the checksums built into TCP and + UDP, but sometimes they evade detection [[105](/en/ch9#Gilman2015), + [106](/en/ch9#Stone2000), + [107](/en/ch9#Jones2015)]. + Simple measures are usually sufficient protection against such corruption, such as checksums in + the application-level protocol. TLS-encrypted connections also offer protection against + corruption. +* A publicly accessible application must carefully sanitize any inputs from users, for example + checking that a value is within a reasonable range and limiting the size of strings to prevent + denial of service through large memory allocations. An internal service behind a firewall may be + able to get away with less strict checks on inputs, but basic checks in protocol parsers are still + a good idea [[105](/en/ch9#Gilman2015)]. +* NTP clients can be configured with multiple server addresses. When synchronizing, the client + contacts all of them, estimates their errors, and checks that a majority of servers agree on some + time range. As long as most of the servers are okay, a misconfigured NTP server that is reporting an + incorrect time is detected as an outlier and is excluded from synchronization + [[39](/en/ch9#Windl2006)]. The use of multiple servers makes NTP + more robust than if it only uses a single server. + +## System Model and Reality + +Many algorithms have been designed to solve distributed systems problems—for example, we will +examine solutions for the consensus problem in [Chapter 10](/en/ch10#ch_consistency). In order to be useful, these +algorithms need to tolerate the various faults of distributed systems that we discussed in this +chapter. + +Algorithms need to be written in a way that does not depend too heavily on the details of the +hardware and software configuration on which they are run. This in turn requires that we somehow +formalize the kinds of faults that we expect to happen in a system. We do this by defining a *system +model*, which is an abstraction that describes what things an algorithm may assume. + +With regard to timing assumptions, three system models are in common use: + +Synchronous model +: The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock + error. This does not imply exactly synchronized clocks or zero network delay; it just means you + know that network delay, pauses, and clock drift will never exceed some fixed upper bound + [[108](/en/ch9#Dwork1988_ch9)]. + The synchronous model is not a realistic model of most practical + systems, because (as discussed in this chapter) unbounded delays and pauses do occur. + +Partially synchronous model +: Partial synchrony means that a system behaves like a synchronous system *most of the time*, but it + sometimes exceeds the bounds for network delay, process pauses, and clock drift + [[108](/en/ch9#Dwork1988_ch9)]. This is a realistic model of many + systems: most of the time, networks and processes are quite well behaved—otherwise we would never + be able to get anything done—but we have to reckon with the fact that any timing assumptions + may be shattered occasionally. When this happens, network delay, pauses, and clock error may become + arbitrarily large. + +Asynchronous model +: In this model, an algorithm is not allowed to make any timing assumptions—in fact, it does not + even have a clock (so it cannot use timeouts). Some algorithms can be designed for the + asynchronous model, but it is very restrictive. + +Moreover, besides timing issues, we have to consider node failures. Some common system models for +nodes are: + +Crash-stop faults +: In the *crash-stop* (or *fail-stop*) model, an algorithm may assume that a node can fail in only + one way, namely by crashing + [[109](/en/ch9#Schlichting1983)]. + This means that the node may suddenly stop responding at any moment, and thereafter that node is + gone forever—it never comes back. + +Crash-recovery faults +: We assume that nodes may crash at any moment, and perhaps start responding again after some + unknown time. In the crash-recovery model, nodes are assumed to have stable storage (i.e., + nonvolatile disk storage) that is preserved across crashes, while the in-memory state is assumed + to be lost. + +Degraded performance and partial functionality +: In addition to crashing and restarting, nodes may go slow: they may still be able to respond to + health check requests, while being too slow to get any real work done. For example, a Gigabit + network interface could suddenly drop to 1 Kb/s throughput due to a driver bug + [[110](/en/ch9#Do2013)]; + a process that is under memory pressure may spend most of its time performing garbage collection + [[111](/en/ch9#Snyder2019)]; + worn-out SSDs can have erratic performance; and hardware can be affected by high temperature, + loose connectors, mechanical vibration, power supply problems, firmware bugs, and more + [[112](/en/ch9#Gunawi2018_ch9)]. + Such a situation is called a *limping node*, *gray failure*, or *fail-slow* + [[113](/en/ch9#Huang2017_ch9)], + and it can be even more difficult to deal with than a cleanly failed node. A related problem is + when a process stops doing some of the things it is supposed to do while other aspects continue + working, for example because a background thread is crashed or deadlocked + [[114](/en/ch9#Lou2020)]. + +Byzantine (arbitrary) faults +: Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described + in the last section. + +For modeling real systems, the partially synchronous model with crash-recovery faults is generally +the most useful model. It allows for unbounded network delay, process pauses, and slow nodes. But +how do distributed algorithms cope with that model? + +### Defining the correctness of an algorithm + +To define what it means for an algorithm to be *correct*, we can describe its *properties*. For +example, the output of a sorting algorithm has the property that for any two distinct elements of +the output list, the element further to the left is smaller than the element further to the right. +That is simply a formal way of defining what it means for a list to be sorted. + +Similarly, we can write down the properties we want of a distributed algorithm to define what it +means to be correct. For example, if we are generating fencing tokens for a lock (see +[“Fencing off zombies and delayed requests”](/en/ch9#sec_distributed_fencing_tokens)), we may require the algorithm to have the following properties: + +Uniqueness +: No two requests for a fencing token return the same value. + +Monotonic sequence +: If request *x* returned token *t**x*, and request *y* returned token *t**y*, and + *x* completed before *y* began, then *t**x* < *t**y*. + +Availability +: A node that requests a fencing token and does not crash eventually receives a response. + +An algorithm is correct in some system model if it always satisfies its properties in all situations +that we assume may occur in that system model. However, if all nodes crash, or all network delays +suddenly become infinitely long, then no algorithm will be able to get anything done. How can we +still make useful guarantees even in a system model that allows complete failures? + +### Safety and liveness + +To clarify the situation, it is worth distinguishing between two different kinds of properties: +*safety* and *liveness* properties. In the example just given, *uniqueness* and *monotonic sequence* are +safety properties, but *availability* is a liveness property. + +What distinguishes the two kinds of properties? A giveaway is that liveness properties often include +the word “eventually” in their definition. (And yes, you guessed it—*eventual consistency* is a +liveness property [[115](/en/ch9#Bailis2013_ch9)].) + +Safety is often informally defined as *nothing bad happens*, and liveness as *something good +eventually happens*. However, it’s best to not read too much into those informal definitions, +because “good” and “bad” are value judgements that don’t apply well to algorithms. The actual +definitions of safety and liveness are more precise +[[116](/en/ch9#Alpern1985)]: + +* If a safety property is violated, we can point at a particular point in time at which it was + broken (for example, if the uniqueness property was violated, we can identify the particular + operation in which a duplicate fencing token was returned). After a safety property has been + violated, the violation cannot be undone—the damage is already done. +* A liveness property works the other way round: it may not hold at some point in time (for example, + a node may have sent a request but not yet received a response), but there is always hope that it + may be satisfied in the future (namely by receiving a response). + +An advantage of distinguishing between safety and liveness properties is that it helps us deal with +difficult system models. For distributed algorithms, it is common to require that safety properties +*always* hold, in all possible situations of a system model +[[108](/en/ch9#Dwork1988_ch9)]. That is, even if all nodes crash, or +the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong +result (i.e., that the safety properties remain satisfied). + +However, with liveness properties we are allowed to make caveats: for example, we could say that a +request needs to receive a response only if a majority of nodes have not crashed, and only if the +network eventually recovers from an outage. The definition of the partially synchronous model +requires that eventually the system returns to a synchronous state—that is, any period of network +interruption lasts only for a finite duration and is then repaired. + +### Mapping system models to the real world + +Safety and liveness properties and system models are very useful for reasoning about the correctness +of a distributed algorithm. However, when implementing an algorithm in practice, the messy facts of +reality come back to bite you again, and it becomes clear that the system model is a simplified +abstraction of reality. + +For example, algorithms in the crash-recovery model generally assume that data in stable storage +survives crashes. However, what happens if the data on disk is corrupted, or the data is wiped out +due to hardware error or misconfiguration +[[117](/en/ch9#Junqueira2015)]? +What happens if a server has a firmware bug and fails to recognize +its hard drives on reboot, even though the drives are correctly attached to the server +[[118](/en/ch9#Sanders2016)]? + +Quorum algorithms (see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)) rely on a node remembering the data +that it claims to have stored. If a node may suffer from amnesia and forget previously stored data, +that breaks the quorum condition, and thus breaks the correctness of the algorithm. Perhaps a new +system model is needed, in which we assume that stable storage mostly survives crashes, but may +sometimes be lost. But that model then becomes harder to reason about. + +The theoretical description of an algorithm can declare that certain things are simply assumed not +to happen—and in non-Byzantine systems, we do have to make some assumptions about faults that can +and cannot happen. However, a real implementation may still have to include code to handle the +case where something happens that was assumed to be impossible, even if that handling boils down to +`printf("Sucks to be you")` and `exit(666)`—i.e., letting a human operator clean up the mess +[[119](/en/ch9#Kreps2013)]. +(This is one difference between computer science and software engineering.) + +That is not to say that theoretical, abstract system models are worthless—quite the opposite. +They are incredibly helpful for distilling down the complexity of real systems to a manageable set +of faults that we can reason about, so that we can understand the problem and try to solve it +systematically. + +## Formal Methods and Randomized Testing + +How do we know that an algorithm satisfies the required properties? Due to concurrency, partial +failures, and network delays there are a huge number of potential states. We need to guarantee +that the properties hold in every possible state, and ensure that we haven’t forgotten about any +edge cases. + +One approach is to formally verify an algorithm by describing it mathematically, and using proof +techniques to show that it satisfies the required properties in all situations that the system model +allows. Proving an algorithm correct does not mean its *implementation* on a real system will +necessarily always behave correctly. But it’s a very good first step, because the theoretical +analysis can uncover problems in an algorithm that might remain hidden for a long time in a real +system, and that only come to bite you when your assumptions (e.g., about timing) are defeated due +to unusual circumstances. + +It is prudent to combine theoretical analysis with empirical testing to verify that implementations +behave as expected. Techniques such as property-based testing, fuzzing, and deterministic simulation +testing (DST) use randomization to test a system in a wide range of situations. Companies such as +Amazon Web Services have successfully used a combination of these techniques on many of their +products [[120](/en/ch9#Brooker2024correctness), +[121](/en/ch9#SatarinTesting)]. + +### Model checking and specification languages + +*Model checkers* are tools that help verify that an algorithm or system behaves as expected. An algorithm +specification is written in a purpose-built language such as TLA+, Gallina, or FizzBee. These +languages make it easier to focus on an algorithm’s behavior without worrying about code +implementation details. Model checkers then use these models to verify that invariants hold across +all of an algorithm’s states by systematically trying all the things that could happen. + +Model checking can’t actually prove that an algorithm’s invariants hold for every possible state +since most real-world algorithms have an infinite state space. A true verification of all states +would require a formal proof, which can be done, but which is typically more difficult than running +a model checker. Instead, model checkers encourage you to reduce the algorithm’s model to an +approximation that can be fully verified, or to limit the execution to some upper bound (for +example, by setting a maximum number of messages that can be sent). Any bugs that only occur with +longer executions would then not be found. + +Still, model checkers strike a nice balance between ease of use and the ability to find non-obvious +bugs. CockroachDB, TiDB, Kafka, and many other distributed systems use model specifications to find +and fix bugs +[[122](/en/ch9#Vanlightly2024), +[123](/en/ch9#Tang2018), +[124](/en/ch9#VanBenschoten2019)]. For example, +using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped +replication (VR) caused by ambiguity in the prose description of the algorithm +[[125](/en/ch9#Vanlightly2022)]. + +By design, model checkers don’t run your actual code, but rather a simplified model that specifies +only the core ideas of your protocol. This makes it more tractable to systematically explore the +state space, but it risks that your specification and your implementation go out of sync with each +other [[126](/en/ch9#Wayne2024)]. +It is possible to check whether the model and the real implementation have equivalent behavior, but +this requires instrumentation in the real implementation +[[127](/en/ch9#Ouyang2025)]. + +### Fault injection + +Many bugs are triggered when machine and network failures occur. Fault injection is an effective +(and sometimes scary) technique that verifies whether a system’s implementation works as expected things +go wrong. The idea is simple: inject faults into a running system’s environment and see how it +behaves. Faults can be network failures, machine crashes, disk corruption, paused +processes—anything you can imagine going wrong with a computer. + +Fault injection tests are typically run in an environment that closely resembles the production +environment where the system will run. Some even inject faults directly into their production +environment. Netflix popularized this approach with their Chaos Monkey tool +[[128](/en/ch9#Izrailevsky2011)]. Production fault +injection is often referred to as *chaos engineering*, which we discussed in +[“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability). + +To run fault injection tests, the system under test is first deployed along with fault injection +coordinators and scripts. Coordinators are responsible for deciding what faults to execute and when +to execute them. Local or remote scripts are responsible for injecting failures into individual +nodes or processes. Injection scripts use many different tools to trigger faults. A Linux process +can be paused or killed using Linux’s `kill` command, a disk can be unmounted with `umount`, and +network connections can be disrupted through firewall settings. You can inspect system behavior +during and after faults are injected to make sure things work as expected. + +The myriad of tools required to trigger failures make fault injection tests cumbersome to write. +It’s common to adopt a fault injection framework like Jepsen to run fault injection tests to +simplify the process. Such frameworks come with integrations for various operating systems and many +pre-built fault injectors +[[129](/en/ch9#Kingsbury2013jepsen)]. +Jepsen has been remarkably effective at finding critical bugs in many widely-used systems +[[130](/en/ch9#Kingsbury2024), +[131](/en/ch9#Majumdar2017)]. + +### Deterministic simulation testing + +Deterministic simulation testing (DST) has also become a popular complement to model-checking and +fault injection. It uses a similar state space exploration process as a model checker, but it tests +your actual code, not a model. + +In DST, a simulation automatically runs through a large number of randomised executions of the +system. Network communication, I/O, and clock timing during the simulation are all replaced with +mocks that allow the simulator to control the exact order in which things happen, including various +timings and failure scenarios. This allows the simulator to explore many more situations than +hand-written tests or fault injection could. If a test fails, it can be re-run since the simulator +knows the exact order of operations that triggered the failure—in contrast to fault injection, which +does not have such fine-grained control over the system. + +DST requires the simulator to be able to control all sources of nondeterminism, such as network +delays. One of three strategies is generally adopted to make code deterministic: + +Application-level +: Some systems are built from the ground-up to make it easy to execute code deterministically. For + example, FoundationDB, one of the pioneers in the DST space, is built using an asynchronous + communication library called Flow. Flow provides a point for developers to inject a deterministic + network simulation into the system + [[132](/en/ch9#FoundationDB_ch9)]. + Similarly, TigerBeetle is an online transaction processing (OLTP) database with first-class DST + support. The system’s state is modeled as a state machine, with all mutations occuring within a + single event loop. When combined with mock deterministic primitives such as clocks, such an + architecture is able to run deterministically + [[133](/en/ch9#Kladov2023)]. + +Runtime-level +: Languages with asynchronous runtimes and commonly used libraries provide an insertion point + to introduce determinism. A single-threaded runtime is used to force all asynchronous code to run + sequentially. FrostDB, for example, patches Go’s runtime to execute goroutines sequentially + [[134](/en/ch9#Marques2024)]. + Rust’s madsim library works in a similar manner. Madsim provides deterministic implementations of + Tokio’s asynchronous runtime API, AWS’s S3 library, Kafka’s Rust library, and many others. + Applications can swap in deterministic libraries and runtimes to get deterministic test executions + without changing their code. + +Machine-level +: Rather than patching code at runtime, an entire machine can be made deterministic. This is a + delicate process that requires a machine to respond to all normally nondeterministic calls with + deterministic responses. Tools such as Antithesis do this by building a custom hypervisor that + replaces normally nondeterministic operations with deterministic ones. Everything from clocks + to network and storage needs to be accounted for. Once done, though, developers can run their + entire distributed system in a collection of containers within the hypervisor and get a completely + deterministic distributed system. + +DST provides several advantages beyond replayability. Tools such as Antithesis attempt to explore +many different code paths in application code by branching a test execution into multiple +sub-executions when it discovers less common behavior. And because deterministic tests often use +mocked clocks and network calls, such tests can run faster than wall-clock time. For example, +TigerBeetle’s time abstraction allows simulations to simulate network latency and timeouts without +actually taking the full length of time to trigger the timeout. Such techniques allow the simulator +to explore more code paths faster. + +# The Power of Determinism + +Nondeterminism is at the core of all of the distributed systems challenges we discussed in this +chapter: concurrency, network delay, process pauses, clock jumps, and crashes all happen in +unpredictable ways that vary from one run of a system to the next. Conversely, if you can make a +system deterministic, that can hugely simplify things. + +In fact, making things deterministic is a simple but powerful idea that arises again and again in +distributed system design. Besides deterministic simulation testing, we have seen several ways of +using determinism over the past chapters: + +* A key advantage of event sourcing (see [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events)) is that you can + deterministically replay a log of events to reconstruct derived materialized views. +* Workflow engines (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows)) rely on workflow definitions being + deterministic to provide durable execution semantics. +* *State machine replication*, which we will discuss in [“Using shared logs”](/en/ch10#sec_consistency_smr), replicates data by + independently executing the same sequence of deterministic transactions on each replica. We have + already seen two variants of that idea: statement-based replication (see + [“Implementation of Replication Logs”](/en/ch6#sec_replication_implementation)) and serial transaction execution using stored procedures + (see [“Pros and cons of stored procedures”](/en/ch8#sec_transactions_stored_proc_tradeoffs)). + +However, making code fully deterministic requires care. Even once you have removed all concurrency +and replaced I/O, network communication, clocks, and random number generators with deterministic +simulations, elements of nondeterminism may remain. For example, in some programming languages, the +order in which you iterate over the elements of a hash table may be nondeterministic. Whether you +run into a resource limit (memory allocation failure, stack overflow) is also nondeterministic. + +# Summary + +In this chapter we have discussed a wide range of problems that can occur in distributed systems, +including: + +* Whenever you try to send a packet over the network, it may be lost or arbitrarily delayed. + Likewise, the reply may be lost or delayed, so if you don’t get a reply, you have no idea whether + the message got through. +* A node’s clock may be significantly out of sync with other nodes (despite your best efforts to set + up NTP), it may suddenly jump forward or back in time, and relying on it is dangerous because you + most likely don’t have a good measure of your clock’s confidence interval. +* A process may pause for a substantial amount of time at any point in its execution, be declared + dead by other nodes, and then come back to life again without realizing that it was paused. + +The fact that such *partial failures* can occur is the defining characteristic of distributed +systems. Whenever software tries to do anything involving other nodes, there is the possibility that +it may occasionally fail, or randomly go slow, or not respond at all (and eventually time out). In +distributed systems, we try to build tolerance of partial failures into software, so that the system +as a whole may continue functioning even when some of its constituent parts are broken. + +To tolerate faults, the first step is to *detect* them, but even that is hard. Most systems +don’t have an accurate mechanism of detecting whether a node has failed, so most distributed +algorithms rely on timeouts to determine whether a remote node is still available. However, timeouts +can’t distinguish between network and node failures, and variable network delay sometimes causes a +node to be falsely suspected of crashing. Handling limping nodes, which are responding but are too +slow to do anything useful, is even harder. + +Once a fault is detected, making a system tolerate it is not easy either: there is no global +variable, no shared memory, no common knowledge or any other kind of shared state between the +machines [[83](/en/ch9#Halpern1990)]. +Nodes can’t even agree on what time it is, let alone on anything more profound. The only way +information can flow from one node to another is by sending it over the unreliable network. Major +decisions cannot be safely made by a single node, so we require protocols that enlist help from +other nodes and try to get a quorum to agree. + +If you’re used to writing software in the idealized mathematical perfection of a single computer, +where the same operation always deterministically returns the same result, then moving to the messy +physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems +engineers will often regard a problem as trivial if it can be solved on a single computer +[[4](/en/ch9#Hodges2013)], +and indeed a single computer can do a lot nowadays. If you can avoid opening Pandora’s box and +simply keep things on a single machine, for example by using an embedded storage engine (see +[“Embedded storage engines”](/en/ch4#sidebar_embedded)), it is generally worth doing so. + +However, as discussed in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed), scalability is not the only reason for +wanting to use a distributed system. Fault tolerance and low latency (by placing data geographically +close to users) are equally important goals, and those things cannot be achieved with a single node. +The power of distributed systems is that in principle, they can run forever without being +interrupted at the service level, because all faults and maintenance can be handled at the node +level. (In practice, if a bad configuration change is rolled out to all nodes, that will still bring +a distributed system to its knees.) + +In this chapter we also went on some tangents to explore whether the unreliability of networks, +clocks, and processes is an inevitable law of nature. We saw that it isn’t: it is possible to give +hard real-time response guarantees and bounded delays in networks, but doing so is very expensive and +results in lower utilization of hardware resources. Most non-safety-critical systems choose cheap +and unreliable over expensive and reliable. + +This chapter has been all about problems, and has given us a bleak outlook. In the next chapter we +will move on to solutions, and discuss some algorithms that have been designed to cope with the +problems in distributed systems. + +##### Footnotes + +##### References + +[[1](/en/ch9#Cavage2013-marker)] Mark Cavage. +[There’s Just No Getting Around It: You’re +Building a Distributed System](https://queue.acm.org/detail.cfm?id=2482856). *ACM Queue*, volume 11, issue 4, pages 80-89, April 2013. +[doi:10.1145/2466486.2482856](https://doi.org/10.1145/2466486.2482856) + +[[2](/en/ch9#Kreps2012_ch9-marker)] Jay Kreps. +[Getting +Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. +Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW) + +[[3](/en/ch9#Hale2010-marker)] Coda Hale. +[You Can’t Sacrifice +Partition Tolerance](https://codahale.com/you-cant-sacrifice-partition-tolerance/). *codahale.com*, October 2010. + + +[[4](/en/ch9#Hodges2013-marker)] Jeff Hodges. +[Notes +on Distributed Systems for Young Bloods](https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/). *somethingsimilar.com*, January 2013. +Archived at [perma.cc/B636-62CE](https://perma.cc/B636-62CE) + +[[5](/en/ch9#Jacobson1988-marker)] Van Jacobson. +[Congestion +Avoidance and Control](https://www.cs.usask.ca/ftp/pub/discus/seminars2002-2003/p314-jacobson.pdf). At *ACM Symposium on Communications Architectures and +Protocols* (SIGCOMM), August 1988. +[doi:10.1145/52324.52356](https://doi.org/10.1145/52324.52356) + +[[6](/en/ch9#Hubert2009-marker)] Bert Hubert. +[The +Ultimate SO\_LINGER Page, or: Why Is My TCP Not Reliable](https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable). *blog.netherlabs.nl*, January 2009. +Archived at [perma.cc/6HDX-L2RR](https://perma.cc/6HDX-L2RR) + +[[7](/en/ch9#Saltzer1984_ch9-marker)] Jerome H. Saltzer, David P. Reed, and David D. Clark. +[End-To-End +Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf). *ACM Transactions on Computer Systems*, volume 2, issue 4, +pages 277–288, November 1984. +[doi:10.1145/357401.357402](https://doi.org/10.1145/357401.357402) + +[[8](/en/ch9#Bailis2014reliable-marker)] Peter Bailis and Kyle Kingsbury. +[The Network Is Reliable](https://queue.acm.org/detail.cfm?id=2655736). +*ACM Queue*, volume 12, issue 7, pages 48-55, July 2014. +[doi:10.1145/2639988.2639988](https://doi.org/10.1145/2639988.2639988) + +[[9](/en/ch9#Leners2015-marker)] Joshua B. Leners, Trinabh Gupta, Marcos K. +Aguilera, and Michael Walfish. +[Taming Uncertainty in +Distributed Systems with Help from the Network](https://cs.nyu.edu/~mwalfish/papers/albatross-eurosys15.pdf). At *10th European Conference on Computer +Systems* (EuroSys), April 2015. +[doi:10.1145/2741948.2741976](https://doi.org/10.1145/2741948.2741976) + +[[10](/en/ch9#Gill2011-marker)] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. +[Understanding +Network Failures in Data Centers: Measurement, Analysis, and Implications](https://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdf). At +*ACM SIGCOMM Conference*, August 2011. +[doi:10.1145/2018436.2018477](https://doi.org/10.1145/2018436.2018477) + +[[11](/en/ch9#Hoelzle2020-marker)] Urs Hölzle. +[But recently a farmer had started +grazing a herd of cows nearby. And whenever they stepped on the fiber link, they bent it enough +to cause a blip](https://x.com/uhoelzle/status/1263333283107991558). *x.com*, May 2020. +Archived at [perma.cc/WX8X-ZZA5](https://perma.cc/WX8X-ZZA5) + +[[12](/en/ch9#CBCNews2021-marker)] CBC News. +[Hundreds +lose internet service in northern B.C. after beaver chews through cable](https://www.cbc.ca/news/canada/british-columbia/beaver-internet-down-tumbler-ridge-1.6001594). *cbc.ca*, +April 2021. Archived at [perma.cc/UW8C-H2MY](https://perma.cc/UW8C-H2MY) + +[[13](/en/ch9#Oremus2014-marker)] Will Oremus. +[The +Global Internet Is Being Attacked by Sharks, Google Confirms](https://slate.com/technology/2014/08/shark-attacks-threaten-google-s-undersea-internet-cables-video.html). *slate.com*, August 2014. +Archived at [perma.cc/P6F3-C6YG](https://perma.cc/P6F3-C6YG) + +[[14](/en/ch9#AuerbachJahajeeah2023-marker)] Jess Auerbach Jahajeeah. +[Down to the wire: The +ship fixing our internet](https://continent.substack.com/p/down-to-the-wire-the-ship-fixing). *continent.substack.com*, November 2023. +Archived at [perma.cc/DP7B-EQ7S](https://perma.cc/DP7B-EQ7S) + +[[15](/en/ch9#Janardhan2021-marker)] Santosh Janardhan. +[More details +about the October 4 outage](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/). *engineering.fb.com*, October 2021. +Archived at [perma.cc/WW89-VSXH](https://perma.cc/WW89-VSXH) + +[[16](/en/ch9#Parfitt2011-marker)] Tom Parfitt. +[Georgian +woman cuts off web access to whole of Armenia](https://www.theguardian.com/world/2011/apr/06/georgian-woman-cuts-web-access). *theguardian.com*, April 2011. +Archived at [perma.cc/KMC3-N3NZ](https://perma.cc/KMC3-N3NZ) + +[[17](/en/ch9#Voce2025-marker)] Antonio Voce, Tural Ahmedzade and Ashley Kirk. +[‘Shadow +fleets’ and subaquatic sabotage: are Europe’s undersea internet cables under attack?](https://www.theguardian.com/world/ng-interactive/2025/mar/05/shadow-fleets-subaquatic-sabotage-europe-undersea-internet-cables-under-attack) +*theguardian.com*, March 2025. +Archived at [perma.cc/HA7S-ZDBV](https://perma.cc/HA7S-ZDBV) + +[[18](/en/ch9#Liu2016-marker)] Shengyun Liu, Paolo Viotti, +Christian Cachin, Vivien Quéma, and Marko Vukolić. +[XFT: Practical +Fault Tolerance beyond Crashes](https://www.usenix.org/system/files/conference/osdi16/osdi16-liu.pdf). At *12th USENIX Symposium on Operating Systems Design and +Implementation* (OSDI), November 2016. + +[[19](/en/ch9#Imbriaco2012_ch9-marker)] Mark Imbriaco. +[Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/). +*github.blog*, December 2012. +Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ) + +[[20](/en/ch9#Lianza2020_ch9-marker)] Tom Lianza and Chris Snook. +[A Byzantine failure +in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020. +Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY) + +[[21](/en/ch9#Alfatafta2020-marker)] Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, +and Samer Al-Kiswany. +[Toward a Generic Fault +Tolerance Technique for Partial Network Partitioning](https://www.usenix.org/conference/osdi20/presentation/alfatafta). At *14th USENIX Symposium on +Operating Systems Design and Implementation* (OSDI), November 2020. + +[[22](/en/ch9#Donges2012-marker)] Marc A. Donges. +[Re: bnx2 cards Intermittantly Going +Offline](https://www.spinics.net/lists/netdev/msg210485.html). Message to Linux *netdev* mailing list, *spinics.net*, September 2012. +Archived at [perma.cc/TXP6-H8R3](https://perma.cc/TXP6-H8R3) + +[[23](/en/ch9#Toman2020-marker)] Troy Toman. +[Inside a CODE RED: +Network Edition](https://signalvnoise.com/svn3/inside-a-code-red-network-edition/). *signalvnoise.com*, September 2020. +Archived at [perma.cc/BET6-FY25](https://perma.cc/BET6-FY25) + +[[24](/en/ch9#Kingsbury2014elastic-marker)] Kyle Kingsbury. +[Call Me Maybe: +Elasticsearch](https://aphyr.com/posts/317-call-me-maybe-elasticsearch). *aphyr.com*, June 2014. +[perma.cc/JK47-S89J](https://perma.cc/JK47-S89J) + +[[25](/en/ch9#Sanfilippo2014-marker)] Salvatore Sanfilippo. +[A Few Arguments About Redis Sentinel Properties and Fail +Scenarios](https://antirez.com/news/80). *antirez.com*, October 2014. +[perma.cc/8XEU-CLM8](https://perma.cc/8XEU-CLM8) + +[[26](/en/ch9#Liochon2015-marker)] Nicolas Liochon. +[CAP: +If All You Have Is a Timeout, Everything Looks Like a Partition](http://blog.thislongrun.com/2015/05/CAP-theorem-partition-timeout-zookeeper.html). *blog.thislongrun.com*, +May 2015. Archived at [perma.cc/FS57-V2PZ](https://perma.cc/FS57-V2PZ) + +[[27](/en/ch9#Grosvenor2015-marker)] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel +Gog, Robert N. M. Watson, Andrew W. Moore, Steven Hand, and Jon Crowcroft. +[Queues +Don’t Matter When You Can JUMP Them!](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-grosvenor_update.pdf) At *12th USENIX Symposium on Networked +Systems Design and Implementation* (NSDI), May 2015. + +[[28](/en/ch9#Julienne2019-marker)] Theo Julienne. +[Debugging +network stalls on Kubernetes](https://github.blog/engineering/debugging-network-stalls-on-kubernetes/). *github.blog*, November 2019. +Archived at [perma.cc/K9M8-XVGL](https://perma.cc/K9M8-XVGL) + +[[29](/en/ch9#Wang2010-marker)] Guohui Wang and T. S. Eugene Ng. +[The Impact of +Virtualization on Network Performance of Amazon EC2 Data Center](https://www.cs.rice.edu/~eugeneng/papers/INFOCOM10-ec2.pdf). At *29th IEEE +International Conference on Computer Communications* (INFOCOM), March 2010. +[doi:10.1109/INFCOM.2010.5461931](https://doi.org/10.1109/INFCOM.2010.5461931) + +[[30](/en/ch9#Philips2014-marker)] Brandon Philips. +[etcd: Distributed Locking and Service +Discovery](https://www.youtube.com/watch?v=HJIjTTHWYnE). At *Strange Loop*, September 2014. + +[[31](/en/ch9#Newman2012-marker)] Steve Newman. +[A Systematic Look at EC2 I/O](https://www.sentinelone.com/blog/a-systematic-look-at-ec2-i-o/). +*blog.scalyr.com*, October 2012. +Archived at [perma.cc/FL4R-H2VE](https://perma.cc/FL4R-H2VE) + +[[32](/en/ch9#Hayashibara2004-marker)] Naohiro Hayashibara, Xavier Défago, Rami Yared, and +Takuya Katayama. [The ϕ Accrual Failure +Detector](https://hdl.handle.net/10119/4784). Japan Advanced Institute of Science and Technology, School of Information +Science, Technical Report IS-RR-2004-010, May 2004. +Archived at [perma.cc/NSM2-TRYA](https://perma.cc/NSM2-TRYA) + +[[33](/en/ch9#Wang2013-marker)] Jeffrey Wang. +[Phi +Accrual Failure Detector](https://ternarysearch.blogspot.com/2013/08/phi-accrual-failure-detector.html). *ternarysearch.blogspot.co.uk*, August 2013. +[perma.cc/L452-AMLV](https://perma.cc/L452-AMLV) + +[[34](/en/ch9#Keshav1997-marker)] Srinivasan Keshav. *An Engineering Approach +to Computer Networking: ATM Networks, the Internet, and the Telephone Network*. +Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6 + +[[35](/en/ch9#Kyas1995-marker)] Othmar Kyas. *ATM Networks*. +International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6 + +[[36](/en/ch9#Mellanox2014-marker)] Mellanox Technologies. +[InfiniBand +FAQ, Rev 1.3](https://network.nvidia.com/related-docs/whitepapers/InfiniBandFAQ_FQ_100.pdf). *network.nvidia.com*, December 2014. +Archived at [perma.cc/LQJ4-QZVK](https://perma.cc/LQJ4-QZVK) + +[[37](/en/ch9#Santos2003-marker)] Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman. +[End-to-End Congestion Control +for InfiniBand](https://infocom2003.ieee-infocom.org/papers/28_01.PDF). At *22nd Annual Joint Conference of the IEEE Computer and +Communications Societies* (INFOCOM), April 2003. Also published by HP Laboratories Palo +Alto, Tech Report HPL-2002-359. +[doi:10.1109/INFCOM.2003.1208949](https://doi.org/10.1109/INFCOM.2003.1208949) + +[[38](/en/ch9#Li2014-marker)] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and +Steven D. Gribble. +[Tales of the Tail: Hardware, +OS, and Application-level Sources of Tail Latency](https://syslab.cs.washington.edu/papers/latency-socc14.pdf). At *ACM Symposium on Cloud Computing* +(SOCC), November 2014. +[doi:10.1145/2670979.2670988](https://doi.org/10.1145/2670979.2670988) + +[[39](/en/ch9#Windl2006-marker)] Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley. +[The NTP FAQ and HOWTO](https://www.ntp.org/ntpfaq/). *ntp.org*, November 2006. + +[[40](/en/ch9#GrahamCumming2017-marker)] John Graham-Cumming. +[How and +why the leap second affected Cloudflare DNS](https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/). *blog.cloudflare.com*, January 2017. +Archived at [archive.org](https://web.archive.org/web/20250202041444/https%3A//blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/) + +[[41](/en/ch9#Holmes2006-marker)] David Holmes. +[Inside +the Hotspot VM: Clocks, Timers and Scheduling Events – Part I – Windows](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks). *blogs.oracle.com*, +October 2006. Archived at [archive.org](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks) + +[[42](/en/ch9#Greef2021-marker)] Joran Dirk Greef. +[Three Clocks are +Better than One](https://tigerbeetle.com/blog/2021-08-30-three-clocks-are-better-than-one/). *tigerbeetle.com*, August 2021. +Archived at [perma.cc/5RXG-EU6B](https://perma.cc/5RXG-EU6B) + +[[43](/en/ch9#Yang2015-marker)] Oliver Yang. +[Pitfalls of TSC usage](https://oliveryang.net/2015/09/pitfalls-of-TSC-usage/). +*oliveryang.net*, September 2015. +Archived at [perma.cc/Z2QY-5FRA](https://perma.cc/Z2QY-5FRA) + +[[44](/en/ch9#Loughran2015-marker)] Steve Loughran. +[Time +on Multi-Core, Multi-Socket Servers](https://steveloughran.blogspot.com/2015/09/time-on-multi-core-multi-socket-servers.html). *steveloughran.blogspot.co.uk*, September 2015. +Archived at [perma.cc/7M4S-D4U6](https://perma.cc/7M4S-D4U6) + +[[45](/en/ch9#Corbett2012_ch9-marker)] James C. Corbett, Jeffrey Dean, Michael +Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher +Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, +Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, +Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. +[Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). +At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), +October 2012. + +[[46](/en/ch9#Caporaloni2012-marker)] M. Caporaloni and R. Ambrosini. +[How Closely Can a Personal Computer +Clock Track the UTC Timescale Via the Internet?](https://iopscience.iop.org/0143-0807/23/4/103/) *European Journal of Physics*, +volume 23, issue 4, pages L17–L21, June 2012. +[doi:10.1088/0143-0807/23/4/103](https://doi.org/10.1088/0143-0807/23/4/103) + +[[47](/en/ch9#Minar1999-marker)] Nelson Minar. +[A Survey of the NTP Network](https://alumni.media.mit.edu/~nelson/research/ntp-survey99/). +*alumni.media.mit.edu*, December 1999. +Archived at [perma.cc/EV76-7ZV3](https://perma.cc/EV76-7ZV3) + +[[48](/en/ch9#Holub2014-marker)] Viliam Holub. +[Synchronizing +Clocks in a Cassandra Cluster Pt. 1 – The Problem](https://blog.rapid7.com/2014/03/14/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/). *blog.rapid7.com*, March 2014. +Archived at [perma.cc/N3RV-5LNL](https://perma.cc/N3RV-5LNL) + +[[49](/en/ch9#Kamp2011-marker)] Poul-Henning Kamp. +[The One-Second War (What Time Will You Die?)](https://queue.acm.org/detail.cfm?id=1967009) +*ACM Queue*, volume 9, issue 4, pages 44–48, April 2011. +[doi:10.1145/1966989.1967009](https://doi.org/10.1145/1966989.1967009) + +[[50](/en/ch9#Minar2012_ch9-marker)] Nelson Minar. +[Leap Second Crashes Half +the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012. +Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU) + +[[51](/en/ch9#Pascoe2011-marker)] Christopher Pascoe. +[Time, +Technology and Leaping Seconds](https://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html). *googleblog.blogspot.co.uk*, September 2011. +Archived at [perma.cc/U2JL-7E74](https://perma.cc/U2JL-7E74) + +[[52](/en/ch9#Zhao2015-marker)] Mingxue Zhao and Jeff Barr. +[Look +Before You Leap – The Coming Leap Second and AWS](https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/). *aws.amazon.com*, May 2015. +Archived at [perma.cc/KPE9-XMFM](https://perma.cc/KPE9-XMFM) + +[[53](/en/ch9#Veitch2016-marker)] Darryl Veitch and Kanthaiah Vijayalayan. +[Network Timing +and the 2015 Leap Second](https://opus.lib.uts.edu.au/bitstream/10453/43923/1/LeapSecond_camera.pdf). At *17th International Conference on Passive and Active +Measurement* (PAM), April 2016. +[doi:10.1007/978-3-319-30505-9\_29](https://doi.org/10.1007/978-3-319-30505-9_29) + +[[54](/en/ch9#VMware2011-marker)] VMware, Inc. +[Timekeeping in VMware Virtual +Machines](https://www.vmware.com/docs/vmware_timekeeping). *vmware.com*, October 2008. +Archived at [perma.cc/HM5R-T5NF](https://perma.cc/HM5R-T5NF) + +[[55](/en/ch9#Yodaiken2017-marker)] Victor Yodaiken. +[Clock +Synchronization in Finance and Beyond](https://www.yodaiken.com/wp-content/uploads/2018/05/financeandbeyond.pdf). *yodaiken.com*, November 2017. +Archived at [perma.cc/9XZD-8ZZN](https://perma.cc/9XZD-8ZZN) + +[[56](/en/ch9#EmreAcer2017-marker)] Mustafa Emre Acer, Emily Stark, Adrienne Porter +Felt, Sascha Fahl, Radhika Bhargava, Bhanu Dev, Matt Braithwaite, Ryan Sleevi, and Parisa Tabriz. +[Where the Wild Warnings Are: Root Causes +of Chrome HTTPS Certificate Errors](https://acmccs.github.io/papers/p1407-acerA.pdf). At *ACM SIGSAC Conference on Computer and +Communications Security* (CCS), pages 1407–1420, October 2017. +[doi:10.1145/3133956.3134007](https://doi.org/10.1145/3133956.3134007) + +[[57](/en/ch9#MiFID2015-marker)] European Securities and Markets Authority. +[MiFID +II / MiFIR: Regulatory Technical and Implementing Standards – Annex I](https://www.esma.europa.eu/sites/default/files/library/2015/11/2015-esma-1464_annex_i_-_draft_rts_and_its_on_mifid_ii_and_mifir.pdf). +*esma.europa.eu*, Report ESMA/2015/1464, September 2015. +Archived at [perma.cc/ZLX9-FGQ3](https://perma.cc/ZLX9-FGQ3) + +[[58](/en/ch9#Bigum2015-marker)] Luke Bigum. +[Solving +MiFID II Clock Synchronisation With Minimum Spend (Part 1)](https://catach.blogspot.com/2015/11/solving-mifid-ii-clock-synchronisation.html). *catach.blogspot.com*, +November 2015. Archived at [perma.cc/4J5W-FNM4](https://perma.cc/4J5W-FNM4) + +[[59](/en/ch9#Obleukhov2022-marker)] Oleg Obleukhov and Ahmad Byagowi. +[How +Precision Time Protocol is being deployed at Meta](https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/). *engineering.fb.com*, November 2022. +Archived at [perma.cc/29G6-UJNW](https://perma.cc/29G6-UJNW) + +[[60](/en/ch9#Wiseman2022-marker)] John Wiseman. +[gpsjam.org](https://gpsjam.org/), July 2022. + +[[61](/en/ch9#Levinson2023-marker)] Josh Levinson, Julien Ridoux, and Chris Munns. +[It’s +About Time: Microsecond-Accurate Clocks on Amazon EC2 Instances](https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/). *aws.amazon.com*, November 2023. +Archived at [perma.cc/56M6-5VMZ](https://perma.cc/56M6-5VMZ) + +[[62](/en/ch9#Kingsbury2013cassandra-marker)] Kyle Kingsbury. +[Call Me Maybe: Cassandra](https://aphyr.com/posts/294-call-me-maybe-cassandra/). +*aphyr.com*, September 2013. +Archived at [perma.cc/4MBR-J96V](https://perma.cc/4MBR-J96V) + +[[63](/en/ch9#Daily2013_ch9-marker)] John Daily. +[Clocks Are Bad, or, +Welcome to the Wonderful World of Distributed Systems](https://riak.com/clocks-are-bad-or-welcome-to-distributed-systems/). *riak.com*, November 2013. +Archived at [perma.cc/4XB5-UCXY](https://perma.cc/4XB5-UCXY) + +[[64](/en/ch9#Brooker2023time-marker)] Marc Brooker. +[It’s About Time!](https://brooker.co.za/blog/2023/11/27/about-time.html) +*brooker.co.za*, November 2023. +Archived at [perma.cc/N6YK-DRPA](https://perma.cc/N6YK-DRPA) + +[[65](/en/ch9#Kingsbury2013timestamps-marker)] Kyle Kingsbury. +[The Trouble with Timestamps](https://aphyr.com/posts/299-the-trouble-with-timestamps). +*aphyr.com*, October 2013. +Archived at [perma.cc/W3AM-5VAV](https://perma.cc/W3AM-5VAV) + +[[66](/en/ch9#Lamport1978_ch9-marker)] Leslie Lamport. +[Time, +Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, +volume 21, issue 7, pages 558–565, July 1978. +[doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563) + +[[67](/en/ch9#Sheehy2015-marker)] Justin Sheehy. +[There Is No Now: Problems With Simultaneity +in Distributed Systems](https://queue.acm.org/detail.cfm?id=2745385). *ACM Queue*, volume 13, issue 3, pages 36–41, March 2015. +[doi:10.1145/2733108](https://doi.org/10.1145/2733108) + +[[68](/en/ch9#Demirbas2013-marker)] Murat Demirbas. +[Spanner: +Google’s Globally-Distributed Database](https://muratbuffalo.blogspot.com/2013/07/spanner-googles-globally-distributed_4.html). *muratbuffalo.blogspot.co.uk*, July 2013. +Archived at [perma.cc/6VWR-C9WB](https://perma.cc/6VWR-C9WB) + +[[69](/en/ch9#Malkhi2013-marker)] Dahlia Malkhi and Jean-Philippe Martin. +[Spanner’s Concurrency +Control](https://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf). *ACM SIGACT News*, volume 44, issue 3, pages 73–77, September 2013. +[doi:10.1145/2527748.2527767](https://doi.org/10.1145/2527748.2527767) + +[[70](/en/ch9#Pachot2024-marker)] Franck Pachot. +[Achieving Precise Clock +Synchronization on AWS](https://www.yugabyte.com/blog/aws-clock-synchronization/). *yugabyte.com*, December 2024. +Archived at [perma.cc/UYM6-RNBS](https://perma.cc/UYM6-RNBS) + +[[71](/en/ch9#Kimball2022-marker)] Spencer Kimball. +[Living Without Atomic +Clocks: Where CockroachDB and Spanner diverge](https://www.cockroachlabs.com/blog/living-without-atomic-clocks/). *cockroachlabs.com*, January 2022. +Archived at [perma.cc/AWZ7-RXFT](https://perma.cc/AWZ7-RXFT) + +[[72](/en/ch9#Demirbas2025-marker)] Murat Demirbas. +[Use of +Time in Distributed Databases (part 4): Synchronized clocks in production databases](https://muratbuffalo.blogspot.com/2025/01/use-of-time-in-distributed-databases.html). +*muratbuffalo.blogspot.com*, January 2025. +Archived at [perma.cc/9WNX-Q9U3](https://perma.cc/9WNX-Q9U3) + +[[73](/en/ch9#Gray1989-marker)] Cary G. Gray and David R. Cheriton. +[Leases: An Efficient +Fault-Tolerant Mechanism for Distributed File Cache Consistency](https://courses.cs.duke.edu/spring11/cps210/papers/p202-gray.pdf). At +*12th ACM Symposium on Operating Systems Principles* (SOSP), December 1989. +[doi:10.1145/74850.74870](https://doi.org/10.1145/74850.74870) + +[[74](/en/ch9#Sturman2022-marker)] Daniel Sturman, Scott Delap, Max Ross, et al. +[Roblox +Return to Service](https://corp.roblox.com/newsroom/2022/01/roblox-return-to-service-10-28-10-31-2021). *corp.roblox.com*, January 2022. +Archived at [perma.cc/8ALT-WAS4](https://perma.cc/8ALT-WAS4) + +[[75](/en/ch9#Lipcon2011-marker)] Todd Lipcon. +[Avoiding Full GCs +with MemStore-Local Allocation Buffers](https://www.slideshare.net/slideshow/hbase-hug-presentation/7038178). *slideshare.net*, February 2011. +Archived at + +[[76](/en/ch9#Clark2005-marker)] Christopher Clark, Keir Fraser, Steven Hand, +Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. +[Live +Migration of Virtual Machines](https://www.usenix.org/legacy/publications/library/proceedings/nsdi05/tech/full_papers/clark/clark.pdf). At *2nd USENIX Symposium on Symposium on +Networked Systems Design & Implementation* (NSDI), May 2005. + +[[77](/en/ch9#Shaver2008-marker)] Mike Shaver. +[fsyncers and +Curveballs](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/). *shaver.off.net*, May 2008. Archived at +[archive.org](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/) + +[[78](/en/ch9#Zhuang2016-marker)] Zhenyun Zhuang and Cuong Tran. +[Eliminating +Large JVM GC Pauses Caused by Background IO Traffic](https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic). *engineering.linkedin.com*, February 2016. +Archived at [perma.cc/ML2M-X9XT](https://perma.cc/ML2M-X9XT) + +[[79](/en/ch9#Thompson2013-marker)] Martin Thompson. +[Java +Garbage Collection Distilled](https://mechanical-sympathy.blogspot.com/2013/07/java-garbage-collection-distilled.html). *mechanical-sympathy.blogspot.co.uk*, July 2013. +Archived at [perma.cc/DJT3-NQLQ](https://perma.cc/DJT3-NQLQ) + +[[80](/en/ch9#Terei2015-marker)] David Terei and Amit Levy. +[Blade: A Data Center Garbage Collector](https://arxiv.org/pdf/1504.02578). +arXiv:1504.02578, April 2015. + +[[81](/en/ch9#Maas2015-marker)] Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz. +[Trash Day: Coordinating Garbage Collection in +Distributed Systems](https://timharris.uk/papers/2015-hotos.pdf). At *15th USENIX Workshop on Hot Topics in Operating Systems* +(HotOS), May 2015. + +[[82](/en/ch9#Fowler2011_ch9-marker)] Martin Fowler. +[The LMAX Architecture](https://martinfowler.com/articles/lmax.html). +*martinfowler.com*, July 2011. +Archived at [perma.cc/5AV4-N6RJ](https://perma.cc/5AV4-N6RJ) + +[[83](/en/ch9#Halpern1990-marker)] Joseph Y. Halpern and Yoram Moses. +[Knowledge and common knowledge +in a distributed environment](https://groups.csail.mit.edu/tds/papers/Halpern/JACM90.pdf). *Journal of the ACM* (JACM), volume 37, issue 3, pages +549–587, July 1990. +[doi:10.1145/79147.79161](https://doi.org/10.1145/79147.79161) + +[[84](/en/ch9#Tang2022-marker)] Chuzhe Tang, Zhaoguo Wang, Xiaodong Zhang, Qianmian +Yu, Binyu Zang, Haibing Guan, and Haibo Chen. +[Ad Hoc Transactions +in Web Applications: The Good, the Bad, and the Ugly](https://ipads.se.sjtu.edu.cn/_media/publications/concerto-sigmod22.pdf). At *ACM International Conference on +Management of Data* (SIGMOD), June 2022. +[doi:10.1145/3514221.3526120](https://doi.org/10.1145/3514221.3526120) + +[[85](/en/ch9#Junqueira2013_ch9-marker)] Flavio P. Junqueira and Benjamin Reed. +[*ZooKeeper: Distributed +Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3 + +[[86](/en/ch9#Soztutar2013hdfs-marker)] Enis Söztutar. +[HBase +and HDFS: Understanding Filesystem Usage in HBase](https://www.slideshare.net/slideshow/hbase-and-hdfs-understanding-filesystem-usage/22990858). At *HBaseCon*, June 2013. +Archived at [perma.cc/4DXR-9P88](https://perma.cc/4DXR-9P88) + +[[87](/en/ch9#SUSE2025-marker)] SUSE LLC. +[SUSE +Linux Enterprise High Availability 15 SP6 Administration Guide, Section 12: Fencing and STONITH](https://documentation.suse.com/sle-ha/15-SP6/html/SLE-HA-all/cha-ha-fencing.html). +*documentation.suse.com*, March 2025. +Archived at [perma.cc/8LAR-EL9D](https://perma.cc/8LAR-EL9D) + +[[88](/en/ch9#Burrows2006_ch9-marker)] Mike Burrows. +[The Chubby Lock Service for Loosely-Coupled +Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and +Implementation* (OSDI), November 2006. + +[[89](/en/ch9#Kingsbury2020etcd-marker)] Kyle Kingsbury. +[etcd 3.4.3](https://jepsen.io/analyses/etcd-3.4.3). *jepsen.io*, January 2020. +Archived at [perma.cc/2P3Y-MPWU](https://perma.cc/2P3Y-MPWU) + +[[90](/en/ch9#BasriKahveci2019-marker)] Ensar Basri Kahveci. +[Distributed Locks are Dead; Long +Live Distributed Locks!](https://hazelcast.com/blog/long-live-distributed-locks/) *hazelcast.com*, April 2019. +Archived at [perma.cc/7FS5-LDXE](https://perma.cc/7FS5-LDXE) + +[[91](/en/ch9#Kleppmann2016-marker)] Martin Kleppmann. +[How to do +distributed locking](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html). *martin.kleppmann.com*, February 2016. +Archived at [perma.cc/Y24W-YQ5L](https://perma.cc/Y24W-YQ5L) + +[[92](/en/ch9#Sanfilippo2016-marker)] Salvatore Sanfilippo. +[Is Redlock safe?](https://antirez.com/news/101) *antirez.com*, February 2016. +Archived at [perma.cc/B6GA-9Q6A](https://perma.cc/B6GA-9Q6A) + +[[93](/en/ch9#Morling2024_ch9-marker)] Gunnar Morling. +[Leader +Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024. +Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y) + +[[94](/en/ch9#Lamport1982-marker)] Leslie Lamport, Robert Shostak, and Marshall Pease. +[The +Byzantine Generals Problem](https://www.microsoft.com/en-us/research/publication/byzantine-generals-problem/). *ACM Transactions on Programming Languages and Systems* +(TOPLAS), volume 4, issue 3, pages 382–401, July 1982. +[doi:10.1145/357172.357176](https://doi.org/10.1145/357172.357176) + +[[95](/en/ch9#Gray1978-marker)] Jim N. Gray. +[Notes on Data Base +Operating Systems](https://jimgray.azurewebsites.net/papers/dbos.pdf). in *Operating Systems: An Advanced Course*, Lecture +Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller, +pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7. +Archived at [perma.cc/7S9M-2LZU](https://perma.cc/7S9M-2LZU) + +[[96](/en/ch9#Palmer2011-marker)] Brian Palmer. +[How +Complicated Was the Byzantine Empire?](https://slate.com/news-and-politics/2011/10/the-byzantine-tax-code-how-complicated-was-byzantium-anyway.html) *slate.com*, October 2011. +Archived at [perma.cc/AN7X-FL3N](https://perma.cc/AN7X-FL3N) + +[[97](/en/ch9#LamportPubs-marker)] Leslie Lamport. +[My Writings](https://lamport.azurewebsites.net/pubs/pubs.html). +*lamport.azurewebsites.net*, December 2014. +Archived at [perma.cc/5NNM-SQGR](https://perma.cc/5NNM-SQGR) + +[[98](/en/ch9#Rushby2001-marker)] John Rushby. +[Bus Architectures for +Safety-Critical Embedded Systems](https://www.csl.sri.com/papers/emsoft01/emsoft01.pdf). At *1st International Workshop on Embedded Software* +(EMSOFT), October 2001. +[doi:10.1007/3-540-45449-7\_22](https://doi.org/10.1007/3-540-45449-7_22) + +[[99](/en/ch9#Edge2013-marker)] Jake Edge. +[ELC: SpaceX Lessons Learned](https://lwn.net/Articles/540368/). *lwn.net*, +March 2013. Archived at [perma.cc/AYX8-QP5X](https://perma.cc/AYX8-QP5X) + +[[100](/en/ch9#Bano2019_ch9-marker)] Shehar Bano, Alberto Sonnino, Mustafa +Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis. +[SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At +*1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019. +[doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458) + +[[101](/en/ch9#Feilden2024-marker)] Ezra Feilden, Adi Oltean, and Philip Johnston. +[Why we should train AI in space](https://www.starcloud.com/wp). +White Paper, *starcloud.com*, September 2024. +Archived at [perma.cc/7Y3S-8UB6](https://perma.cc/7Y3S-8UB6) + +[[102](/en/ch9#Mickens2013-marker)] James Mickens. +[The Saddest +Moment](https://www.usenix.org/system/files/login-logout_1305_mickens.pdf). *USENIX ;login*, May 2013. +Archived at [perma.cc/T7BZ-XCFR](https://perma.cc/T7BZ-XCFR) + +[[103](/en/ch9#Kleppmann2020-marker)] Martin Kleppmann and Heidi Howard. +[Byzantine Eventual Consistency and the Fundamental Limits +of Peer-to-Peer Databases](https://arxiv.org/abs/2012.00472). *arxiv.org*, December 2020. +[doi:10.48550/arXiv.2012.00472](https://doi.org/10.48550/arXiv.2012.00472) + +[[104](/en/ch9#Kleppmann2022-marker)] Martin Kleppmann. +[Making CRDTs Byzantine Fault +Tolerant](https://martin.kleppmann.com/papers/bft-crdt-papoc22.pdf). At *9th Workshop on Principles and Practice of Consistency for Distributed +Data* (PaPoC), April 2022. +[doi:10.1145/3517209.3524042](https://doi.org/10.1145/3517209.3524042) + +[[105](/en/ch9#Gilman2015-marker)] Evan Gilman. +[The +Discovery of Apache ZooKeeper’s Poison Packet](https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/). *pagerduty.com*, May 2015. +Archived at [perma.cc/RV6L-Y5CQ](https://perma.cc/RV6L-Y5CQ) + +[[106](/en/ch9#Stone2000-marker)] Jonathan Stone and Craig Partridge. +[When +the CRC and TCP Checksum Disagree](https://conferences2.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf). At *ACM Conference on Applications, +Technologies, Architectures, and Protocols for Computer Communication* (SIGCOMM), August 2000. +[doi:10.1145/347059.347561](https://doi.org/10.1145/347059.347561) + +[[107](/en/ch9#Jones2015-marker)] Evan Jones. +[How Both TCP and Ethernet +Checksums Fail](https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html). *evanjones.ca*, October 2015. +Archived at [perma.cc/9T5V-B8X5](https://perma.cc/9T5V-B8X5) + +[[108](/en/ch9#Dwork1988_ch9-marker)] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. +[Consensus in the +Presence of Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323, +April 1988. [doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283) + +[[109](/en/ch9#Schlichting1983-marker)] Richard D. Schlichting and Fred B. Schneider. +[Fail-stop processors: an +approach to designing fault-tolerant computing systems](https://www.cs.cornell.edu/fbs/publications/Fail_Stop.pdf). *ACM Transactions on Computer +Systems* (TOCS), volume 1, issue 3, pages 222–238, August 1983. +[doi:10.1145/357369.357371](https://doi.org/10.1145/357369.357371) + +[[110](/en/ch9#Do2013-marker)] Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, +Tiratat Patana-anake, and Haryadi S. Gunawi. +[Limplock: Understanding the Impact +of Limpware on Scale-out Cloud Systems](https://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf). At *4th ACM Symposium on Cloud Computing* +(SoCC), October 2013. +[doi:10.1145/2523616.2523627](https://doi.org/10.1145/2523616.2523627) + +[[111](/en/ch9#Snyder2019-marker)] Josh Snyder and Joseph Lynch. +[Garbage collecting +unhealthy JVMs, a proactive approach](https://netflixtechblog.medium.com/introducing-jvmquake-ec944c60ba70). Netflix Technology Blog, +*netflixtechblog.medium.com*, November 2019. +Archived at [perma.cc/8BTA-N3YB](https://perma.cc/8BTA-N3YB) + +[[112](/en/ch9#Gunawi2018_ch9-marker)] Haryadi S. Gunawi, Riza O. Suminto, Russell +Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah +Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree +Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. +[Fail-Slow at +Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf). +At *16th USENIX Conference on File and Storage Technologies*, February 2018. + +[[113](/en/ch9#Huang2017_ch9-marker)] Peng Huang, Chuanxiong Guo, Lidong Zhou, +Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. +[Gray +Failure: The Achilles’ Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in +Operating Systems* (HotOS), May 2017. +[doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005) + +[[114](/en/ch9#Lou2020-marker)] Chang Lou, Peng Huang, and Scott Smith. +[Understanding, Detecting and +Localizing Partial Failures in Large System Software](https://www.usenix.org/conference/nsdi20/presentation/lou). At *17th USENIX Symposium on +Networked Systems Design and Implementation* (NSDI), February 2020. + +[[115](/en/ch9#Bailis2013_ch9-marker)] Peter Bailis and Ali Ghodsi. +[Eventual Consistency Today: Limitations, +Extensions, and Beyond](https://queue.acm.org/detail.cfm?id=2462076). *ACM Queue*, volume 11, issue 3, pages 55-63, March 2013. +[doi:10.1145/2460276.2462076](https://doi.org/10.1145/2460276.2462076) + +[[116](/en/ch9#Alpern1985-marker)] Bowen Alpern and Fred B. Schneider. +[Defining Liveness](https://www.cs.cornell.edu/fbs/publications/DefLiveness.pdf). +*Information Processing Letters*, volume 21, issue 4, pages 181–185, October 1985. +[doi:10.1016/0020-0190(85)90056-0](https://doi.org/10.1016/0020-0190%2885%2990056-0) + +[[117](/en/ch9#Junqueira2015-marker)] Flavio P. Junqueira. +[Dude, Where’s My Metadata?](https://fpj.me/2015/05/28/dude-wheres-my-metadata/) +*fpj.me*, May 2015. +Archived at [perma.cc/D2EU-Y9S5](https://perma.cc/D2EU-Y9S5) + +[[118](/en/ch9#Sanders2016-marker)] Scott Sanders. +[January 28th Incident +Report](https://github.com/blog/2106-january-28th-incident-report). *github.com*, February 2016. +Archived at [perma.cc/5GZR-88TV](https://perma.cc/5GZR-88TV) + +[[119](/en/ch9#Kreps2013-marker)] Jay Kreps. +[A Few Notes +on Kafka and Jepsen](https://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen). *blog.empathybox.com*, September 2013. +[perma.cc/XJ5C-F583](https://perma.cc/XJ5C-F583) + +[[120](/en/ch9#Brooker2024correctness-marker)] Marc Brooker and Ankush Desai. +[Systems Correctness Practices at AWS](https://dl.acm.org/doi/pdf/10.1145/3712057). +*Queue, Volume 22, Issue 6*, November/December 2024. +[doi:10.1145/3712057](https://doi.org/10.1145/3712057) + +[[121](/en/ch9#SatarinTesting-marker)] Andrey Satarin. +[Testing Distributed Systems: +Curated list of resources on testing distributed systems](https://asatarin.github.io/testing-distributed-systems/). *asatarin.github.io*. +Archived at [perma.cc/U5V8-XP24](https://perma.cc/U5V8-XP24) + +[[122](/en/ch9#Vanlightly2024-marker)] Jack Vanlightly. +[Verifying Kafka transactions - Diary entry 2 - Writing an initial TLA+ spec](https://jack-vanlightly.com/analyses/2024/12/3/verifying-kafka-transactions-diary-entry-2-writing-an-initial-tla-spec). +*jack-vanlightly.com*, December 2024. +Archived at [perma.cc/NSQ8-MQ5N](https://perma.cc/NSQ8-MQ5N) + +[[123](/en/ch9#Tang2018-marker)] Siddon Tang. +[From Chaos to Order — Tools and +Techniques for Testing TiDB, A Distributed NewSQL Database](https://www.pingcap.com/blog/chaos-practice-in-tidb/). *pingcap.com*, April 2018. +Archived at [perma.cc/5EJB-R29F](https://perma.cc/5EJB-R29F) + +[[124](/en/ch9#VanBenschoten2019-marker)] Nathan VanBenschoten. +[Parallel Commits: An atomic commit +protocol for globally distributed transactions](https://www.cockroachlabs.com/blog/parallel-commits/). *cockroachlabs.com*, November 2019. +Archived at [perma.cc/5FZ7-QK6J](https://perma.cc/5FZ7-QK6J%20) + +[[125](/en/ch9#Vanlightly2022-marker)] Jack Vanlightly. +[Paper: VR Revisited - State Transfer (part 3)](https://jack-vanlightly.com/analyses/2022/12/28/paper-vr-revisited-state-transfer-part-3). +*jack-vanlightly.com*, December 2022. +Archived at [perma.cc/KNK3-K6WS](https://perma.cc/KNK3-K6WS) + +[[126](/en/ch9#Wayne2024-marker)] Hillel Wayne. +[What if +the spec doesn’t match the code?](https://buttondown.com/hillelwayne/archive/what-if-the-spec-doesnt-match-the-code/) *buttondown.com*, March 2024. +Archived at [perma.cc/8HEZ-KHER](https://perma.cc/8HEZ-KHER) + +[[127](/en/ch9#Ouyang2025-marker)] Lingzhi Ouyang, Xudong Sun, Ruize Tang, Yu Huang, +Madhav Jivrajani, Xiaoxing Ma, Tianyin Xu. +[Multi-Grained Specifications for Distributed System Model +Checking and Verification](https://arxiv.org/abs/2409.14301). At *20th European Conference on Computer Systems* (EuroSys), +March 2025. [doi:10.1145/3689031.3696069](https://doi.org/10.1145/3689031.3696069) + +[[128](/en/ch9#Izrailevsky2011-marker)] Yury Izrailevsky and Ariel Tseitlin. +[The Netflix Simian Army](https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116). +*netflixtechblog.com*, July, 2011. +Archived at [perma.cc/M3NY-FJW6](https://perma.cc/M3NY-FJW6) + +[[129](/en/ch9#Kingsbury2013jepsen-marker)] Kyle Kingsbury. +[Jepsen: On the perils of network partitions](https://aphyr.com/posts/281-jepsen-on-the-perils-of-network-partitions). +*aphyr.com*, May, 2013. +Archived at [perma.cc/W98G-6HQP](https://perma.cc/W98G-6HQP) + +[[130](/en/ch9#Kingsbury2024-marker)] Kyle Kingsbury. +[Jepsen Analyses](https://jepsen.io/analyses). *jepsen.io*, 2024. +Archived at [perma.cc/8LDN-D2T8](https://perma.cc/8LDN-D2T8) + +[[131](/en/ch9#Majumdar2017-marker)] Rupak Majumdar and Filip Niksic. +[Why is random testing effective for partition +tolerance bugs?](https://dl.acm.org/doi/pdf/10.1145/3158134) *Proceedings of the ACM on Programming Languages* (PACMPL), volume 2, +issue POPL, article no. 46, December 2017. +[doi:10.1145/3158134](https://doi.org/10.1145/3158134) + +[[132](/en/ch9#FoundationDB_ch9-marker)] FoundationDB project authors. +[Simulation and Testing](https://apple.github.io/foundationdb/testing.html). +*apple.github.io*. +Archived at [perma.cc/NQ3L-PM4C](https://perma.cc/NQ3L-PM4C) + +[[133](/en/ch9#Kladov2023-marker)] Alex Kladov. +[Simulation +Testing For Liveness](https://tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness/). *tigerbeetle.com*, July 2023. +Archived at [perma.cc/RKD4-HGCR](https://perma.cc/RKD4-HGCR) + +[[134](/en/ch9#Marques2024-marker)] Alfonso Subiotto Marqués. +[(Mostly) +Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024. +Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4) diff --git a/content/en/part-i.md b/content/en/part-i.md index 304ff3e..2a50b5a 100644 --- a/content/en/part-i.md +++ b/content/en/part-i.md @@ -4,22 +4,29 @@ weight: 100 breadcrumbs: false --- +> [!IMPORTANT] +> This page is from the 1st edition + The first four chapters go through the fundamental ideas that apply to all data sys‐ tems, whether running on a single machine or distributed across a cluster of machines: -1. [Chapter 1](/en/ch1) introduces the terminology and approach that we’re going to use throughout this book. It examines what we actually mean by words like *reliabil‐ ity*, *scalability*, and *maintainability*, and how we can try to achieve these goals. +1. [Chapter 1](/en/ch1) introduces the tradeoffs that data systems must make, such as the balance between consistency and availability, and how these tradeoffs affect system design. -2. [Chapter 2](/en/ch2) compares several different data models and query languages—the most visible distinguishing factor between databases from a developer’s point of view. We will see how different models are appropriate to different situations. +2. [Chater 2](/en/ch2) discusses the nonfunctional requirements of data systems, such as availability, consistency, and latency. And how we can try to achieve these goals. -3. [Chapter 3](/en/ch4) turns to the internals of storage engines and looks at how databases lay out data on disk. Different storage engines are optimized for different workloads, and choosing the right one can have a huge effect on performance. +3. [Chapter 3](/en/ch3) compares several different data models and query languages—the most visible distinguishing factor between databases from a developer’s point of view. We will see how different models are appropriate to different situations. -4. [Chapter 4](/en/ch4) compares various formats for data encoding (serialization) and espe‐ cially examines how they fare in an environment where application requirements change and schemas need to adapt over time. +4. [Chapter 4](/en/ch4) turns to the internals of storage engines and looks at how databases lay out data on disk. Different storage engines are optimized for different workloads, and choosing the right one can have a huge effect on performance. + +5. [Chapter 5](/en/ch5) compares various formats for data encoding (serialization) and espe‐ cially examines how they fare in an environment where application requirements change and schemas need to adapt over time. Later, [Part II](/en/part-ii) will turn to the particular issues of distributed data systems. ## Index -- [1. Reliable, Scalable, and Maintainable Applications](/en/ch1) -- [2. Data Models and Query Languages](/en/ch2) -- [3. Storage and Retrieval](/en/ch3) -- [4. Encoding and Evolution](/en/ch4) \ No newline at end of file +- [1. Tradeoffs in Data Systems Architecture](/en/ch1) +- [2. Defining NonFunctional Requirements](/en/ch2) +- [3. Data Models and Query Languages](/en/ch3) +- [4. Storage and Retrieval](/en/ch4) +- [5. Encoding and Evolution](/en/ch5) + diff --git a/content/en/part-ii.md b/content/en/part-ii.md index 2697334..81514ea 100644 --- a/content/en/part-ii.md +++ b/content/en/part-ii.md @@ -4,15 +4,19 @@ weight: 200 breadcrumbs: false --- +> [!IMPORTANT] +> This page is from the 1st edition + > *For a successful technology, reality must take precedence over public relations, for nature cannot be fooled.* > > —Richard Feynman, *Rogers Commission Report* (1986) ------- -In [Part I](/en/part-i) of this book, we discussed aspects of data systems that apply when data is stored on a single machine. Now, in [Part II](/en/part-ii), we move up a level and ask: what hap‐ pens if multiple machines are involved in storage and retrieval of data? +In [Part I](/en/part-i) of this book, we discussed aspects of data systems that apply when data is stored on a single machine. Now, in [Part II](/en/part-ii), +we move up a level and ask: what happens if multiple machines are involved in storage and retrieval of data? -There are various reasons why you might want to distribute a database across multi‐ ple machines: +There are various reasons why you might want to distribute a database across multiple machines: ***Scalability*** @@ -20,25 +24,31 @@ If your data volume, read load, or write load grows bigger than a single machine ***Fault tolerance/high availability*** -If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, you can use multi‐ ple machines to give you redundancy. When one fails, another one can take over. +If your application needs to continue working even if one machine (or several machines, or the network, or an entire datacenter) goes down, +you can use multiple machines to give you redundancy. When one fails, another one can take over. ***Latency*** -If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geo‐ graphically close to them. That avoids the users having to wait for network pack‐ ets to travel halfway around the world. +If you have users around the world, you might want to have servers at various locations worldwide so that each user can be served from a datacenter that is geographically close to them. +That avoids the users having to wait for network packets to travel halfway around the world. ## Scaling to Higher Load -If all you need is to scale to higher load, the simplest approach is to buy a more pow‐ erful machine (sometimes called *vertical scaling* or *scaling up*). Many CPUs, many RAM chips, and many disks can be joined together under one operating system, and a fast interconnect allows any CPU to access any part of the memory or disk. In this kind of *shared-memory architecture*, all the components can be treated as a single machine [1].[^ii] +If all you need is to scale to higher load, the simplest approach is to buy a more powerful machine (sometimes called *vertical scaling* or *scaling up*). Many CPUs, many RAM chips, and many disks can be joined together under one operating system, +and a fast interconnect allows any CPU to access any part of the memory or disk. In this kind of *shared-memory architecture*, all the components can be treated as a single machine [1].[^ii] -[^i]: In a large machine, although any CPU can access any part of memory, some banks of memory are closer to one CPU than to others (this is called nonuniform memory access, or NUMA [1]). To make efficient use of this architecture, the processing needs to be broken down so that each CPU mostly accesses memory that is nearby—which means that partitioning is still required, even when ostensibly running on one machine. +[^i]: In a large machine, although any CPU can access any part of memory, some banks of memory are closer to one CPU than to others (this is called nonuniform memory access, or NUMA [1]). +To make efficient use of this architecture, the processing needs to be broken down so that each CPU mostly accesses memory that is nearby—which means that partitioning is still required, even when ostensibly running on one machine. -The problem with a shared-memory approach is that the cost grows faster than line‐ arly: a machine with twice as many CPUs, twice as much RAM, and twice as much disk capacity as another typically costs significantly more than twice as much. And due to bottlenecks, a machine twice the size cannot necessarily handle twice the load. +The problem with a shared-memory approach is that the cost grows faster than linearly: a machine with twice as many CPUs, twice as much RAM, and twice as much disk capacity as another typically costs significantly more than twice as much. +And due to bottlenecks, a machine twice the size cannot necessarily handle twice the load. -A shared-memory architecture may offer limited fault tolerance—high-end machines have hot-swappable components (you can replace disks, memory modules, and even CPUs without shutting down the machines)—but it is definitely limited to a single geographic location. +A shared-memory architecture may offer limited fault tolerance—high-end machines have hot-swappable components (you can replace disks, memory modules, and even CPUs without shutting down the machines) — but it is definitely limited to a single geographic location. -Another approach is the *shared-disk architecture*, which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network.[^ii] This architecture is used for some data warehousing workloads, but contention and the overhead of lock‐ ing limit the scalability of the shared-disk approach [2]. +Another approach is the *shared-disk architecture*, which uses several machines with independent CPUs and RAM, but stores data on an array of disks that is shared between the machines, which are connected via a fast network.[^ii] +This architecture is used for some data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach [2]. [^ii]: Network Attached Storage (NAS) or Storage Area Network (SAN). @@ -46,13 +56,20 @@ Another approach is the *shared-disk architecture*, which uses several machines ### Shared-Nothing Architectures -By contrast, *shared-nothing architectures* [3] (sometimes called *horizontal scaling* or *scaling out*) have gained a lot of popularity. In this approach, each machine or virtual machine running the database software is called a *node*. Each node uses its CPUs, RAM, and disks independently. Any coordination between nodes is done at the soft‐ ware level, using a conventional network. +By contrast, *shared-nothing architectures* [3] (sometimes called *horizontal scaling* or *scaling out*) have gained a lot of popularity. +In this approach, each machine or virtual machine running the database software is called a *node*. +Each node uses its CPUs, RAM, and disks independently. Any coordination between nodes is done at the software level, using a conventional network. -No special hardware is required by a shared-nothing system, so you can use whatever machines have the best price/performance ratio. You can potentially distribute data across multiple geographic regions, and thus reduce latency for users and potentially be able to survive the loss of an entire datacenter. With cloud deployments of virtual machines, you don’t need to be operating at Google scale: even for small companies, a multi-region distributed architecture is now feasible. +No special hardware is required by a shared-nothing system, so you can use whatever machines have the best price/performance ratio. +You can potentially distribute data across multiple geographic regions, and thus reduce latency for users and potentially be able to survive the loss of an entire datacenter. +With cloud deployments of virtual machines, you don’t need to be operating at Google scale: even for small companies, a multi-region distributed architecture is now feasible. -In this part of the book, we focus on shared-nothing architectures—not because they are necessarily the best choice for every use case, but rather because they require the most caution from you, the application developer. If your data is distributed across multiple nodes, you need to be aware of the constraints and trade-offs that occur in such a distributed system—the database cannot magically hide these from you. +In this part of the book, we focus on shared-nothing architectures—not because they are necessarily the best choice for every use case, but rather because they require the most caution from you, the application developer. +If your data is distributed across multiple nodes, you need to be aware of the constraints and trade-offs that occur in such a distributed system—the database cannot magically hide these from you. -While a distributed shared-nothing architecture has many advantages, it usually also incurs additional complexity for applications and sometimes limits the expressive‐ ness of the data models you can use. In some cases, a simple single-threaded program can perform significantly better than a cluster with over 100 CPU cores [4]. On the other hand, shared-nothing systems can be very powerful. The next few chapters go into details on the issues that arise when data is distributed. +While a distributed shared-nothing architecture has many advantages, it usually also incurs additional complexity for applications and sometimes limits the expressiveness of the data models you can use. +In some cases, a simple single-threaded program can perform significantly better than a cluster with over 100 CPU cores [4]. On the other hand, shared-nothing systems can be very powerful. +The next few chapters go into details on the issues that arise when data is distributed. ### Replication Versus Partitioning @@ -60,15 +77,18 @@ There are two common ways data is distributed across multiple nodes: ***Replication*** -Keeping a copy of the same data on several different nodes, potentially in differ‐ ent locations. Replication provides redundancy: if some nodes are unavailable, the data can still be served from the remaining nodes. Replication can also help improve performance. We discuss replication in [Chapter 5](/en/ch5). +Keeping a copy of the same data on several different nodes, potentially in different locations. +Replication provides redundancy: if some nodes are unavailable, the data can still be served from the remaining nodes. +Replication can also help improve performance. We discuss replication in [Chapter 6](/en/ch6). ***Partitioning*** - Splitting a big database into smaller subsets called *partitions* so that different par‐ titions can be assigned to different nodes (also known as *sharding*). We discuss partitioning in [Chapter 6](/en/ch6). + Splitting a big database into smaller subsets called *partitions* so that different partitions can be assigned to different nodes (also known as *sharding*). + We discuss partitioning in [Chapter 7](/en/ch7). These are separate mechanisms, but they often go hand in hand, as illustrated in Figure II-1. -![](/img/figii-1.png) +![](/fig/ddia_08.png) > *Figure II-1. A database split into two partitions, with two replicas per partition.* @@ -79,11 +99,11 @@ Later, in Part III of this book, we will discuss how you can take several (poten ## Index -- [5. Replication](/en/ch5) -- [6. Partitioning](/en/ch6) -- [7. Transactions](/en/ch7) -- [8. The Trouble with Distributed Systems](/en/ch8) -- [9. Consistency and Consensus](/en/ch9) +- [6. Replication](/en/ch6) +- [7. Partitioning](/en/ch7) +- [8. Transactions](/en/ch8) +- [9. The Trouble with Distributed Systems](/en/ch9) +- [10. Consistency and Consensus](/en/ch10) ## References diff --git a/content/en/part-iii.md b/content/en/part-iii.md index f719657..79220d4 100644 --- a/content/en/part-iii.md +++ b/content/en/part-iii.md @@ -4,11 +4,20 @@ weight: 300 breadcrumbs: false --- -In Parts [I](/en/part-i) and [II](/en/part-ii) of this book, we assembled from the ground up all the major consid‐ erations that go into a distributed database, from the layout of data on disk all the way to the limits of distributed consistency in the presence of faults. However, this discussion assumed that there was only one database in the application. +> [!IMPORTANT] +> This page is from the 1st edition -In reality, data systems are often more complex. In a large application you often need to be able to access and process data in many different ways, and there is no one data‐ base that can satisfy all those different needs simultaneously. Applications thus com‐ monly use a combination of several different datastores, indexes, caches, analytics systems, etc. and implement mechanisms for moving data from one store to another. +In Parts [I](/en/part-i) and [II](/en/part-ii) of this book, we assembled from the ground up all the major considerations that go into a distributed database, +from the layout of data on disk all the way to the limits of distributed consistency in the presence of faults. However, this discussion assumed that there was only one database in the application. -In this final part of the book, we will examine the issues around integrating multiple different data systems, potentially with different data models and optimized for dif‐ ferent access patterns, into one coherent application architecture. This aspect of system-building is often overlooked by vendors who claim that their product can sat‐ isfy all your needs. In reality, integrating disparate systems is one of the most impor‐ tant things that needs to be done in a nontrivial application. +In reality, data systems are often more complex. In a large application you often need to be able to access and process data in many different ways, +and there is no one database that can satisfy all those different needs simultaneously. Applications thus commonly use a combination of several different datastores, +indexes, caches, analytics systems, etc. and implement mechanisms for moving data from one store to another. + +In this final part of the book, we will examine the issues around integrating multiple different data systems, +potentially with different data models and optimized for different access patterns, into one coherent application architecture. +This aspect of system-building is often overlooked by vendors who claim that their product can satisfy all your needs. +In reality, integrating disparate systems is one of the most important things that needs to be done in a nontrivial application. ## Systems of Record and Derived Data @@ -18,31 +27,45 @@ On a high level, systems that store and process data can be grouped into two bro ***Systems of record*** -A system of record, also known as *source of truth*, holds the authoritative version of your data. When new data comes in, e.g., as user input, it is first written here. Each fact is represented exactly once (the representation is typically *normalized*). If there is any discrepancy between another system and the system of record, then the value in the system of record is (by definition) the correct one. +A system of record, also known as *source of truth*, holds the authoritative version of your data. +When new data comes in, e.g., as user input, it is first written here. +Each fact is represented exactly once (the representation is typically *normalized*). +If there is any discrepancy between another system and the system of record, +then the value in the system of record is (by definition) the correct one. ***Derived data systems*** -Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way. If you lose derived data, you can recreate it from the original source. A classic example is a cache: data can be served from the cache if present, but if the cache doesn’t contain what you need, you can fall back to the underlying database. Denormalized values, indexes, and materialized views also fall into this category. In recommendation systems, predictive summary data is often derived from usage logs. +Data in a derived system is the result of taking some existing data from another system and transforming or processing it in some way. +If you lose derived data, you can recreate it from the original source. A classic example is a cache: data can be served from the cache if present, +but if the cache doesn’t contain what you need, you can fall back to the underlying database. Denormalized values, indexes, +and materialized views also fall into this category. In recommendation systems, predictive summary data is often derived from usage logs. +Technically speaking, derived data is *redundant*, in the sense that it duplicates existing information. +However, it is often essential for getting good performance on read queries. It is commonly *denormalized*. +You can derive several different datasets from a single source, enabling you to look at the data from different “points of view.” +Not all systems make a clear distinction between systems of record and derived data in their architecture, +but it’s a very helpful distinction to make, because it clarifies the dataflow through your system: +it makes explicit which parts of the system have which inputs and which outputs, and how they depend on each other. -Technically speaking, derived data is *redundant*, in the sense that it duplicates exist‐ ing information. However, it is often essential for getting good performance on read queries. It is commonly *denormalized*. You can derive several different datasets from a single source, enabling you to look at the data from different “points of view.” +Most databases, storage engines, and query languages are not inherently either a system of record or a derived system. +A database is just a tool: how you use it is up to you. +The distinction between system of record and derived data system depends not on the tool, but on how you use it in your application. -Not all systems make a clear distinction between systems of record and derived data in their architecture, but it’s a very helpful distinction to make, because it clarifies the dataflow through your system: it makes explicit which parts of the system have which inputs and which outputs, and how they depend on each other. - -Most databases, storage engines, and query languages are not inherently either a sys‐ tem of record or a derived system. A database is just a tool: how you use it is up to you. The distinction between system of record and derived data system depends not on the tool, but on how you use it in your application. - -By being clear about which data is derived from which other data, you can bring clarity to an otherwise confusing system architecture. This point will be a running theme throughout this part of the book. +By being clear about which data is derived from which other data, you can bring clarity to an otherwise confusing system architecture. +This point will be a running theme throughout this part of the book. ## Overview of Chapters -We will start in [Chapter 10](/en/ch10) by examining batch-oriented dataflow systems such as MapReduce, and see how they give us good tools and principles for building large- scale data systems. In [Chapter 11](/en/ch11) we will take those ideas and apply them to data streams, which allow us to do the same kinds of things with lower delays. [Chapter 12](/en/ch12) concludes the book by exploring ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future. +We will start in [Chapter 11](/en/ch11) by examining batch-oriented dataflow systems such as MapReduce, and see how they give us good tools and principles for building large- scale data systems. +In [Chapter 12](/en/ch12) we will take those ideas and apply them to data streams, which allow us to do the same kinds of things with lower delays. +[Chapter 13](/en/ch13) concludes the book by exploring ideas about how we might use these tools to build reliable, scalable, and maintainable applications in the future. ## Index -- [10. Batch Processing](/en/ch10) -- [11. Stream Processing](/en/ch11) -- [12. The Future of Data Systems](/en/ch12) \ No newline at end of file +- [11. Batch Processing](/en/ch11) (WIP) +- [12. Stream Processing](/en/ch12) (WIP) +- [13. Doing the Right Thing](/en/ch13) (WIP) diff --git a/content/en/preface.md b/content/en/preface.md index e3d7106..8b9d90e 100644 --- a/content/en/preface.md +++ b/content/en/preface.md @@ -4,6 +4,9 @@ weight: 50 breadcrumbs: false --- +> [!IMPORTANT] +> This page is from the 1st edition + If you have worked in software engineering in recent years, especially in server-side and backend systems, you have probably been bombarded with a plethora of buzz‐ words relating to storage and processing of data. NoSQL! Big Data! Web-scale! Sharding! Eventual consistency! ACID! CAP theorem! Cloud services! MapReduce! Real-time! In the last decade we have seen many interesting developments in databases, in dis‐ tributed systems, and in the ways we build applications on top of them. There are various driving forces for these developments: diff --git a/content/en/toc.md b/content/en/toc.md index bbff663..c81f983 100644 --- a/content/en/toc.md +++ b/content/en/toc.md @@ -5,24 +5,35 @@ weight: 10 breadcrumbs: false --- -![](/img/title.png) -* [Preface](/en/preface) -* [Part I: Foundations of Data Systems](/en/part-i) - - [1. Reliable, Scalable, and Maintainable Applications](/en/ch1) - - [2. Data Models and Query Languages](/en/ch2) - - [3. Storage and Retrieval](/en/ch3) - - [4. Encoding and Evolution](/en/ch4) -* [Part II: Distributed Data](/en/part-ii) - - [5. Replication](/en/ch5) - - [6. Partitioning](/en/ch6) - - [7. Transactions](/en/ch7) - - [8. The Trouble with Distributed Systems](/en/ch8) - - [9. Consistency and Consensus](/en/ch9) -* [Part III: Derived Data](/en/part-iii) - - [10. Batch Processing](/en/ch10) - - [11. Stream Processing](/en/ch11) - - [12. The Future of Data Systems](/en/ch12) -* [Glossary](/en/glossary) -* [Colophon](/en/colophon) + +## Table of Contents + +### [Preface](/en/preface) + +### [Part I: Foundations of Data Systems](/en/part-i) +- [1. Tradeoffs in Data Systems Architecture](/en/ch1) +- [2. Defining NonFunctional Requirements](/en/ch2) +- [3. Data Models and Query Languages](/en/ch3) +- [4. Storage and Retrieval](/en/ch4) +- [5. Encoding and Evolution](/en/ch5) + +### [Part II: Distributed Data](/en/part-ii) +- [6. Replication](/en/ch6) +- [7. Partitioning](/en/ch7) +- [8. Transactions](/en/ch8) +- [9. The Trouble with Distributed Systems](/en/ch9) +- [10. Consistency and Consensus](/en/ch10) + +### [Part III: Derived Data](/en/part-iii) +- [11. Batch Processing](/en/ch11) (WIP) +- [12. Stream Processing](/en/ch12) (WIP) +- [13. Doing the Right Thing](/en/ch13) (WIP) + +### [Glossary](/en/glossary) + +### [Colophon](/en/colophon) + + +![](/title.jpg) \ No newline at end of file