diff --git a/content/en/ch1.md b/content/en/ch1.md index a8c0910..18f5148 100644 --- a/content/en/ch1.md +++ b/content/en/ch1.md @@ -23,7 +23,7 @@ more complex, it is no longer sufficient to store everything in one system, but necessary to combine multiple storage or processing systems that provide different capabilities. We call an application *data-intensive* if data management is one of the primary challenges in -developing the application [[1](/en/ch1#Kouzes2009)]. +developing the application [^1]. While in *compute-intensive* systems the challenge is parallelizing some very large computation, in data-intensive applications we usually worry more about things like storing and processing large data volumes, managing changes to data, ensuring consistency in the face of failures and @@ -86,7 +86,7 @@ for web applications, the client-side code (which runs in a web browser) is call and the server-side code that handles user requests is known as the *backend*. Mobile apps are similar to frontends in that they provide user interfaces, which often communicate over the Internet with a server-side backend. Frontends sometimes manage data locally on the user’s device -[[2](/en/ch1#Kleppmann2019_ch1)], +[^2], but the greatest data infrastructure challenges often lie in the backend: a frontend only needs to handle one user’s data, whereas the backend manages data on behalf of *all* of the users. @@ -132,10 +132,10 @@ As we shall see in the next section, operational and analytical systems are ofte good reasons. As these systems have matured, two new specialized roles have emerged: *data engineers* and *analytics engineers*. Data engineers are the people who know how to integrate the operational and the analytical systems, and who take responsibility for the organization’s data -infrastructure more widely [[3](/en/ch1#Reis2022)]. +infrastructure more widely [^3]. Analytics engineers model and transform data to make it more useful for the business analysts and data scientists in an organization -[[4](/en/ch1#Machado2023)]. +[^4]. Many engineers specialize on either the operational or the analytical side. However, this book covers both operational and analytical data systems, since both play an important role in the @@ -176,7 +176,7 @@ answer analytic queries such as: The reports that result from these types of queries are important for business intelligence, helping the management decide what to do next. In order to differentiate this pattern of using databases from transaction processing, it has been called *online analytic processing* (OLAP) -[[5](/en/ch1#Codd1993)]. +[^5]. The difference between OLTP and analytics is not always clear-cut, but some typical characteristics are listed in [Table 1-1](/en/ch1#tab_oltp_vs_olap). @@ -211,7 +211,7 @@ There is also a type of systems that is designed for analytical workloads (queri over many records) but that are embedded into user-facing products. This category is known as *product analytics* or *real-time analytics*, and systems designed for this type of use include Pinot, Druid, and ClickHouse -[[6](/en/ch1#Soman2023)]. +[^6]. ## Data Warehousing @@ -242,7 +242,7 @@ systems, for several reasons: A *data warehouse*, by contrast, is a separate database that analysts can query to their hearts’ content, without affecting OLTP operations -[[7](/en/ch1#Chaudhuri1997)]. +[^7]. As we shall see in [Chapter 4](/en/ch4#ch_storage), data warehouses often store data in a way that is very different from OLTP databases, in order to optimize for the types of queries that are common in analytics. @@ -267,8 +267,7 @@ specialist data connector services such as Fivetran, Singer, or AirByte. Some database systems offer *hybrid transactional/analytic processing* (HTAP), which aims to enable OLTP and analytics in a single system without requiring ETL from one system into another -[[8](/en/ch1#Ozcan2017), -[9](/en/ch1#Prout2022_ch1)]. +[^8] [^9]. However, many HTAP systems internally consist of an OLTP system coupled with a separate analytical system, hidden behind a common interface—so the distinction between the two remains important for understanding how these systems work. @@ -283,13 +282,13 @@ data from several operational systems in a single query. HTAP therefore does not replace data warehouses. Rather, it is useful in scenarios where the same application needs to both perform analytics queries that scan a large number of rows, and also read and update individual records with low latency. Fraud detection can involve such workloads, for -example [[10](/en/ch1#Zhang2024)]. +example [^10]. The separation between operational and analytical systems is part of a wider trend: as workloads have become more demanding, systems have become more specialized and optimized for particular workloads. General-purpose systems can handle small data volumes comfortably, but the greater the scale, the more specialized systems tend to become -[[11](/en/ch1#Stonebraker2005fitsall)]. +[^11]. ### From data warehouse to data lake @@ -308,14 +307,11 @@ needs of data scientists, who might need to perform tasks such as: they mention). Similarly, they might need to extract structured information from photos using computer vision techniques. -Although there have been efforts to add machine learning operators to a SQL data model -[[12](/en/ch1#Cohen2009)] -and to build efficient machine learning systems on top of a relational foundation -[[13](/en/ch1#Olteanu2020)], +Although there have been efforts to add machine learning operators to a SQL data model [^12] +and to build efficient machine learning systems on top of a relational foundation [^13], many data scientists prefer not to work in a relational database such as a data warehouse. Instead, many prefer to use Python data analysis libraries such as pandas and scikit-learn, statistical -analysis languages such as R, and distributed analytics frameworks such as Spark -[[14](/en/ch1#Bornstein2020)]. +analysis languages such as R, and distributed analytics frameworks such as Spark [^14]. We discuss these further in [“Dataframes, Matrices, and Arrays”](/en/ch3#sec_datamodels_dataframes). Consequently, organizations face a need to make data available in a form that is suitable for use by @@ -325,7 +321,7 @@ difference from a data warehouse is that a data lake simply contains files, with particular file format or data model. Files in a data lake might be collections of database records, encoded using a file format such as Avro or Parquet (see [Chapter 5](/en/ch5#ch_encoding)), but they can equally well contain text, images, videos, sensor readings, sparse matrices, feature vectors, genome sequences, -or any other kind of data [[15](/en/ch1#Fowler2015)]. +or any other kind of data [^15]. Besides being more flexible, this is also often cheaper than relational data storage, since the data lake can use commoditized file storage such as object stores (see [“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native)). @@ -334,14 +330,13 @@ an intermediate stop on the path from the operational systems to the data wareho contains data in a “raw” form produced by the operational systems, without the transformation into a relational data warehouse schema. This approach has the advantage that each consumer of the data can transform the raw data into a form that best suits their needs. It has been dubbed the *sushi -principle*: “raw data is better” [[16](/en/ch1#Johnson2015)]. +principle*: “raw data is better” [^16]. Besides loading data from a data lake into a separate data warehouse, it is also possible to run typical data warehousing workloads (SQL queries and business analytics) directly on the files in the data lake, alongside data science/machine learning workloads. This architecture is known as a *data lakehouse*, and it requires a query execution engine and a metadata (e.g., schema management) layer -that extend the data lake’s file storage -[[17](/en/ch1#Armbrust2021)]. +that extend the data lake’s file storage [^17]. Apache Hive, Spark SQL, Presto, and Trino are examples of this approach. @@ -349,7 +344,7 @@ Apache Hive, Spark SQL, Presto, and Trino are examples of this approach. As analytics practices have matured, organizations have been increasingly paying attention to the management and operations of analytics systems and data pipelines, as captured for example in the -DataOps manifesto [[18](/en/ch1#DataOps)]. +DataOps manifesto [^18]. Part of this are issues of governance, privacy, and compliance with regulation such as GDPR and CCPA, which we discuss in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance) and [Link to Come]. @@ -361,11 +356,9 @@ application and how time-sensitive it is, a stream processing approach can be va to identify and block potentially fraudulent or abusive activity. In some cases the outputs of analytics systems are made available to operational systems (a process -sometimes known as *reverse ETL* [[19](/en/ch1#Manohar2021)]). For example, a -machine-learning model that was trained on data in an analytics system may be deployed to +sometimes known as *reverse ETL* [^19]). For example, a machine-learning model that was trained on data in an analytics system may be deployed to production, so that it can generate recommendations for end-users, such as “people who bought X also -bought Y”. Such deployed outputs of analytics systems are also known as *data products* -[[20](/en/ch1#ORegan2018)]. +bought Y”. Such deployed outputs of analytics systems are also known as *data products* [^20]. Machine learning models can be deployed to operational systems using specialized tools such as TFX, Kubeflow, or MLflow. @@ -425,7 +418,7 @@ in-house, or should it be outsourced? Should you build or should you buy? Ultimately, this is a question about business priorities. The received management wisdom is that things that are a core competency or a competitive advantage of your organization should be done in-house, whereas things that are non-core, routine, or commonplace should be left to a vendor -[[21](/en/ch1#Fournier2021)]. +[^21]. To give an extreme example, most companies do not generate their own electricity (unless they are an energy company, and leaving aside emergency backup power), since it is cheaper to buy electricity from the grid. @@ -464,8 +457,7 @@ Whether a cloud service is actually cheaper and easier than self-hosting depends skills and the workload on your systems. If you already have experience setting up and operating the systems you need, and if your load is quite predictable (i.e., the number of machines you need does not fluctuate wildly), then it’s often cheaper to buy your own machines and run the software on them -yourself [[22](/en/ch1#HeinemeierHansson2022), -[23](/en/ch1#Badizadegan2022)]. +yourself [^22] [^23]. On the other hand, if you need a system that you don’t already know how to deploy and operate, then adopting a cloud service is often easier and quicker than learning to manage the system yourself. If @@ -508,7 +500,7 @@ The biggest downside of a cloud service is that you have no control over it: * Moreover, if the service shuts down or becomes unacceptably expensive, or if the vendor decides to change their product in a way you don’t like, you are at their mercy—continuing to run an old version of the software is usually not an option, so you will be forced to migrate to an - alternative service [[24](/en/ch1#Yegge2020)]. + alternative service [^24]. This risk is mitigated if there are alternative services that expose a compatible API, but for many cloud services there are no standard APIs, which raises the cost of switching, making vendor lock-in a problem. @@ -535,17 +527,15 @@ and indeed such managed services are now available for many popular data systems that have been designed from the ground up to be cloud-native have been shown to have several advantages: better performance on the same hardware, faster recovery from failures, being able to quickly scale computing resources to match the load, and supporting larger datasets -[[25](/en/ch1#Verbitski2017), -[26](/en/ch1#Antonopoulos2019_ch1), -[27](/en/ch1#Vuppalapati2020)]. +[^25] [^26] [^27]. [Table 1-2](/en/ch1#tab_cloud_native_dbs) lists some examples of both types of systems. Table 1-2. Examples of self-hosted and cloud-native database systems -| Category | Self-hosted systems | Cloud-native systems | -| --- | --- | --- | -| Operational/OLTP | MySQL, PostgreSQL, MongoDB | AWS Aurora [[25](/en/ch1#Verbitski2017)], Azure SQL DB Hyperscale [[26](/en/ch1#Antonopoulos2019_ch1)], Google Cloud Spanner | -| Analytical/OLAP | Teradata, ClickHouse, Spark | Snowflake [[27](/en/ch1#Vuppalapati2020)], Google BigQuery, Azure Synapse Analytics | +| Category | Self-hosted systems | Cloud-native systems | +|------------------|-----------------------------|-----------------------------------------------------------------------| +| Operational/OLTP | MySQL, PostgreSQL, MongoDB | AWS Aurora [^25], Azure SQL DB Hyperscale [^26], Google Cloud Spanner | +| Analytical/OLAP | Teradata, ClickHouse, Spark | Snowflake [^27], Google BigQuery, Azure Synapse Analytics | ### Layering of cloud services @@ -574,7 +564,7 @@ higher-level services. For example: lost. * Many other services are in turn built upon object storage and other cloud services: for example, Snowflake is a cloud-based analytic database (data warehouse) that relies on S3 for data storage - [[27](/en/ch1#Vuppalapati2020)], and some other services in turn + [^27], and some other services in turn build upon Snowflake. As always with abstractions in computing, there is no one right answer to what you should use. As a @@ -605,9 +595,9 @@ cloud service provided by a separate set of machines, which emulates the behavio *block device*, where each block is typically 4 KiB in size). This technology makes it possible to run traditional disk-based software in the cloud, but the block device emulation introduces overheads that can be avoided in systems that are designed from the ground up for the -cloud [[25](/en/ch1#Verbitski2017)]. It also makes the application +cloud [^25]. It also makes the application very sensitive to network glitches, since every I/O on the virtual block device is actually a -network call [[28](/en/ch1#NickVanWiggeren2025)]. +network call [^28]. To address this problem, cloud-native services generally avoid using virtual disks, and instead build on dedicated storage services that are optimized for particular workloads. Object storage @@ -615,28 +605,23 @@ services such as S3 are designed for long-term storage of fairly large files, ra of kilobytes to several gigabytes in size. The individual rows or values stored in a database are typically much smaller than this; cloud databases therefore typically manage smaller values in a separate service, and store larger data blocks (containing many individual values) in an object -store [[26](/en/ch1#Antonopoulos2019_ch1), -[29](/en/ch1#Breck2024)]. +store [^26] [^29]. We will see ways of doing this in [Chapter 4](/en/ch4#ch_storage). In a traditional systems architecture, the same computer is responsible for both storage (disk) and computation (CPU and RAM), but in cloud-native systems, these two responsibilities have become -somewhat separated or *disaggregated* [[9](/en/ch1#Prout2022_ch1), -[27](/en/ch1#Vuppalapati2020), -[30](/en/ch1#Shapira2023separation), -[31](/en/ch1#Murthy2022)]: +somewhat separated or *disaggregated* [^9] [^27] [^30] [^31]: for example, S3 only stores files, and if you want to analyze that data, you will have to run the analysis code somewhere outside of S3. This implies transferring the data over the network, which we will discuss further in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed). Moreover, cloud-native systems are often *multitenant*, which means that rather than having a separate machine for each customer, data and computation from several different customers are -handled on the same shared hardware by the same service -[[32](/en/ch1#Vanlightly2023serverless)]. +handled on the same shared hardware by the same service [^32]. + Multitenancy can enable better hardware utilization, easier scalability, and easier management by the cloud provider, but it also requires careful engineering to ensure that one customer’s activity -does not affect the performance or security of the system for other customers -[[33](/en/ch1#Jonas2019)]. +does not affect the performance or security of the system for other customers [^33]. ## Operations in the Cloud Era @@ -645,7 +630,7 @@ Traditionally, the people managing an organization’s server-side data infrastr organizations have tried to integrate the roles of software development and operations into teams with a shared responsibility for both backend services and data infrastructure; the *DevOps* philosophy has guided this trend. *Site Reliability Engineers* (SREs) are Google’s implementation of -this idea [[34](/en/ch1#Beyer2016)]. +this idea [^34]. The role of operations is to ensure services are reliably delivered to users (including configuring infrastructure and deploying applications), and to ensure a stable production environment (including @@ -669,31 +654,28 @@ processes and tools have evolved. The DevOps/SRE philosophy places greater empha * preferring ephemeral virtual machines and services over long running servers, * enabling frequent application updates, * learning from incidents, and -* preserving the organization’s knowledge about the system, even as individual people come and go - [[35](/en/ch1#Limoncelli2020)]. +* preserving the organization’s knowledge about the system, even as individual people come and go [^35]. With the rise of cloud services, there has been a bifurcation of roles: operations teams at infrastructure companies specialize in the details of providing a reliable service to a large number of customers, while the customers of the service spend as little time and effort as possible on -infrastructure [[36](/en/ch1#Majors2020)]. +infrastructure [^36]. Customers of cloud services still require operations, but they focus on different aspects, such as choosing the most appropriate service for a given task, integrating different services with each other, and migrating from one service to another. Even though metered billing removes the need for capacity planning in the traditional sense, it’s still important to know what resources you are using for which purpose, so that you don’t waste money on cloud resources that are not needed: -capacity planning becomes financial planning, and performance optimization becomes cost optimization -[[37](/en/ch1#Cherkasky2021)]. +capacity planning becomes financial planning, and performance optimization becomes cost optimization [^37]. + Moreover, cloud services do have resource limits or *quotas* (such as the maximum number of -processes you can run concurrently), which you need to know about and plan for before you run into -them [[38](/en/ch1#Kushchi2023)]. +processes you can run concurrently), which you need to know about and plan for before you run into them [^38]. Adopting a cloud service can be easier and quicker than running your own infrastructure, although even here there is a cost in learning how to use it, and perhaps working around its limitations. Integration between different services becomes a particular challenge as a growing number of vendors -offers an ever broader range of cloud services targeting different use cases -[[39](/en/ch1#Bernhardsson2021), -[40](/en/ch1#Stancil2021)]. +offers an ever broader range of cloud services targeting different use cases [^39][^40]. + ETL (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) is only part of the story; operational cloud services also need to be integrated with each other. At present, there is a lack of standards that would facilitate this sort of integration, so it often involves significant manual effort. @@ -751,8 +733,7 @@ Using specialized hardware Legal compliance : Some countries have data residency laws that require data about people in their jurisdiction to be - stored and processed geographically within that country - [[41](/en/ch1#Korolov2022)]. + stored and processed geographically within that country [^41]. The scope of these rules varies—for example, in some cases it applies only to medical or financial data, while other cases are broader. A service with users in several such jurisdictions will therefore have to distribute their data across servers in several locations. @@ -761,9 +742,7 @@ Sustainability : If you have flexibility on where and when to run your jobs, you might be able to run them in a time and place where plenty of renewable electricity is available, and avoid running them when the power grid is under strain. This can reduce your carbon emissions and allow you to take advantage - of cheap power when it is available - [[42](/en/ch1#Borenstein2025), - [43](/en/ch1#Acun2023)]. + of cheap power when it is available [^42][^43]. These reasons apply both to services that you write yourself (application code) and services consisting of off-the-shelf software (such as databases). @@ -777,39 +756,32 @@ case, we don’t know whether the service received the request, and simply retry safe. We will discuss these problems in detail in [Chapter 9](/en/ch9#ch_distributed). Although datacenter networks are fast, making a call to another service is still vastly slower than -calling a function in the same process -[[44](/en/ch1#Nath2019)]. +calling a function in the same process [^44]. + When operating on large volumes of data, rather than transferring the data from storage to a separate machine that processes it, it can be faster to bring the computation to the machine that -already has the data -[[45](/en/ch1#Hellerstein2019)]. +already has the data [^45]. + More nodes are not always faster: in some cases, a simple single-threaded program on one computer -can perform significantly better than a cluster with over 100 CPU cores -[[46](/en/ch1#McSherry2015_ch1)]. +can perform significantly better than a cluster with over 100 CPU cores [^46]. Troubleshooting a distributed system is often difficult: if the system is slow to respond, how do you figure out where the problem lies? Techniques for diagnosing problems in distributed systems are -developed under the heading of *observability* [[47](/en/ch1#Sridharan2018), -[48](/en/ch1#Majors2019)], +developed under the heading of *observability* [^47] [^48], which involves collecting data about the execution of a system, and allowing it to be queried in ways that allows both high-level metrics and individual events to be analyzed. *Tracing* tools such as OpenTelemetry, Zipkin, and Jaeger allow you to track which client called which server for which -operation, and how long each call took -[[49](/en/ch1#Sigelman2010)]. +operation, and how long each call took [^49]. Databases provide various mechanisms for ensuring data consistency, as we shall see in [Chapter 6](/en/ch6#ch_replication) and [Chapter 8](/en/ch8#ch_transactions). However, when each service has its own database, maintaining consistency of data across those different services becomes the application’s problem. Distributed transactions, which we explore in [Chapter 8](/en/ch8#ch_transactions), are a possible technique for ensuring consistency, but they are rarely used in a microservices context because they run counter -to the goal of making services independent from each other, and many databases don’t support them -[[50](/en/ch1#Laigner2021)]. +to the goal of making services independent from each other, and many databases don’t support them [^50]. For all these reasons, if you can do something on a single machine, this is often much simpler and -cheaper compared to setting up a distributed system -[[23](/en/ch1#Badizadegan2022), -[46](/en/ch1#McSherry2015_ch1), -[51](/en/ch1#Tigani2023)]. +cheaper compared to setting up a distributed system [^23] [^46] [^51]. CPUs, memory, and disks have grown larger, faster, and more reliable. When combined with single-node databases such as DuckDB, SQLite, and KùzuDB, many workloads can now run on a single node. We will explore more on this topic in [Chapter 4](/en/ch4#ch_storage). @@ -823,8 +795,7 @@ server (handling incoming requests) and a client (making outbound requests to ot This way of building applications has traditionally been called a *service-oriented architecture* (SOA); more recently the idea has been refined into a *microservices* architecture -[[52](/en/ch1#Newman2021_ch1), -[53](/en/ch1#Richardson2014)]. +[^52] [^53]. In this architecture, a service has one well-defined purpose (for example, in the case of S3, this would be file storage); each service exposes an API that can be called by clients via the network, and each service has one team that is responsible for its maintenance. A complex application can @@ -857,16 +828,14 @@ client and server APIs; we discuss these further in [Chapter 5](/en/ch5#ch_enco Microservices are primarily a technical solution to a people problem: allowing different teams to make progress independently without having to coordinate with each other. This is valuable in a large company, but in a small company where there are not many teams, using microservices is likely to be -unnecessary overhead, and it is preferable to implement the application in the simplest way possible -[[52](/en/ch1#Newman2021_ch1)]. +unnecessary overhead, and it is preferable to implement the application in the simplest way possible [^52]. *Serverless*, or *function-as-a-service* (FaaS), is another approach to deploying services, in which -the management of the infrastructure is outsourced to a cloud vendor -[[33](/en/ch1#Jonas2019)]. +the management of the infrastructure is outsourced to a cloud vendor [^33]. When using virtual machines, you have to explicitly choose when to start up or shut down an instance; in contrast, with the serverless model, the cloud provider automatically allocates and frees hardware resources as needed, based on the incoming requests to your service -[[54](/en/ch1#Shahrad2020)]. Serverless deployment +[^54]. Serverless deployment shifts more of the operational burden to cloud providers and enables flexible billing by usage rather than machine instances. To offer such benefits, many serverless infrastructure providers impose a time limit on function execution, limit runtime environments, and might suffer from slow @@ -896,22 +865,20 @@ enterprise datacenter systems. Some of those differences are: * A supercomputer typically runs large batch jobs that checkpoint the state of their computation to disk from time to time. If a node fails, a common solution is to simply stop the entire cluster workload, repair the faulty node, and then restart the computation from the last checkpoint - [[55](/en/ch1#Barroso2018), - [56](/en/ch1#Fiala2012)]. + [^55] [^56]. With cloud services, it is usually not desirable to stop the entire cluster, since the services need to continually serve users with minimal interruptions. * Supercomputer nodes typically communicate through shared memory and remote direct memory access (RDMA), which support high bandwidth and low latency, but assume a high level of trust among the - users of the system [[57](/en/ch1#KornfeldSimpson2020)]. + users of the system [^57]. In cloud computing, the network and the machines are often shared by mutually untrusting organizations, requiring stronger security mechanisms such as resource isolation (e.g., virtual machines), encryption and authentication. * Cloud datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to provide high bisection bandwidth—a commonly used measure of a network’s overall performance - [[55](/en/ch1#Barroso2018), - [58](/en/ch1#Singh2015)]. + [^55] [^58]. Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses - [[59](/en/ch1#Lockwood2014)], + [^59], which yield better performance for HPC workloads with known communication patterns. * Cloud computing allows nodes to be distributed across multiple geographic regions, whereas supercomputers generally assume that all of their nodes are close together. @@ -940,16 +907,14 @@ of the effects that computer systems have on people and society. Social media ha individuals consume news, which influences their political opinions and hence may affect the outcome of elections. Automated systems increasingly make decisions that have profound consequences for individuals, such as deciding who should be given a loan or insurance coverage, who should be -invited to a job interview, or who should be suspected of a crime -[[60](/en/ch1#ONeil2016_ch1)]. +invited to a job interview, or who should be suspected of a crime [^60]. Everyone who works on such systems shares a responsibility for considering the ethical impact and ensuring that they comply with relevant law. It is not necessary for everybody to become an expert in law and ethics, but a basic awareness of legal and ethical principles is just as important as, say, some foundational knowledge in distributed systems. -Legal considerations are influencing the very foundations of how data systems are being designed -[[61](/en/ch1#Shastri2020)]. +Legal considerations are influencing the very foundations of how data systems are being designed [^61]. For example, the GDPR grants individuals the right to have their data erased on request (sometimes known as the *right to be forgotten*). However, as we shall see in this book, many data systems rely on immutable constructs such as append-only logs as part of their design; how can we ensure deletion @@ -970,7 +935,7 @@ However, it is worth remembering that the costs of storage are not just the bill S3 or another service: the cost-benefit calculation should also take into account the risks of liability and reputational damage if the data were to be leaked or compromised by adversaries, and the risk of legal costs and fines if the storage and processing of the data is found not to be -compliant with the law [[51](/en/ch1#Tigani2023)]. +compliant with the law [^51]. Governments or police forces might also compel companies to hand over data. When there is a risk that the data may reveal criminalized behaviors (for example, homosexuality in several Middle @@ -982,12 +947,10 @@ indicate approximate location). Once all the risks are taken into account, it might be reasonable to decide that some data is simply not worth storing, and that it should therefore be deleted. This principle of *data minimization* (sometimes known by the German term *Datensparsamkeit*) runs counter to the “big data” philosophy of -storing lots of data speculatively in case it turns out to be useful in the future -[[62](/en/ch1#Datensparsamkeit)]. +storing lots of data speculatively in case it turns out to be useful in the future [^62]. But it fits with the GDPR, which mandates that personal data may only be collected for a specified, explicit purpose, that this data may not later be used for any other purpose, and that the data must -not be kept for longer than necessary for the purposes for which it was collected -[[63](/en/ch1#GDPR)]. +not be kept for longer than necessary for the purposes for which it was collected [^63]. Businesses have also taken notice of privacy and safety concerns. Credit card companies require payment processing businesses to adhere to strict payment card industry (PCI) standards. Processors @@ -1033,346 +996,71 @@ data is being processed—an aspect that many engineers are prone to ignoring. H requirements into technical implementations is not yet well understood, but it’s important to keep this question in mind as we move through the rest of this book. -##### Footnotes - -##### References - -[[1](/en/ch1#Kouzes2009-marker)] Richard T. Kouzes, -Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. -[The -Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, -January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26) - -[[2](/en/ch1#Kleppmann2019_ch1-marker)] Martin Kleppmann, Adam Wiggins, Peter van -Hardenberg, and Mark McGranaghan. [Local-first -software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International -Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), -October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737) - -[[3](/en/ch1#Reis2022-marker)] Joe Reis and Matt Housley. -[*Fundamentals -of Data Engineering*](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/). O’Reilly Media, 2022. ISBN: 9781098108304 - -[[4](/en/ch1#Machado2023-marker)] Rui Pedro Machado and Helder Russa. -[*Analytics -Engineering with SQL and dbt*](https://www.oreilly.com/library/view/analytics-engineering-with/9781098142377/). O’Reilly Media, 2023. ISBN: 9781098142384 - -[[5](/en/ch1#Codd1993-marker)] Edgar F. Codd, S. B. Codd, and C. T. Salley. -[Providing -OLAP to User-Analysts: An IT Mandate](https://www.estgv.ipv.pt/PaginasPessoais/jloureiro/ESI_AID2007_2008/fichas/codd.pdf). E. F. Codd Associates, 1993. -Archived at [perma.cc/RKX8-2GEE](https://perma.cc/RKX8-2GEE) - -[[6](/en/ch1#Soman2023-marker)] Chinmay Soman and Neha Pawar. -[Comparing Three -Real-Time OLAP Databases: Apache Pinot, Apache Druid, and ClickHouse](https://startree.ai/blog/a-tale-of-three-real-time-olap-databases). *startree.ai*, -April 2023. Archived at [perma.cc/8BZP-VWPA](https://perma.cc/8BZP-VWPA) - -[[7](/en/ch1#Chaudhuri1997-marker)] Surajit Chaudhuri and Umeshwar Dayal. -[An Overview of Data -Warehousing and OLAP Technology](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigrecord.pdf). *ACM SIGMOD Record*, volume 26, issue 1, pages 65–74, -March 1997. [doi:10.1145/248603.248616](https://doi.org/10.1145/248603.248616) - -[[8](/en/ch1#Ozcan2017-marker)] Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. -[Hybrid Transactional/Analytical -Processing: A Survey](https://humming80.github.io/papers/sigmod-htaptut.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017. -[doi:10.1145/3035918.3054784](https://doi.org/10.1145/3035918.3054784) - -[[9](/en/ch1#Prout2022_ch1-marker)] Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu -Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. -[Cloud-Native Transactions and Analytics -in SingleStore](https://dl.acm.org/doi/abs/10.1145/3514221.3526055). At *International Conference on Management of Data* (SIGMOD), June 2022. -[doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055) - -[[10](/en/ch1#Zhang2024-marker)] Chao Zhang, Guoliang Li, Jintao Zhang, -Xinning Zhang, and Jianhua Feng. -[HTAP Databases: A Survey](https://arxiv.org/pdf/2404.15670). -*IEEE Transactions on Knowledge and Data Engineering*, April 2024. -[doi:10.1109/TKDE.2024.3389693](https://doi.org/10.1109/TKDE.2024.3389693) - -[[11](/en/ch1#Stonebraker2005fitsall-marker)] Michael Stonebraker and Uğur Çetintemel. -[‘One Size Fits All’: An -Idea Whose Time Has Come and Gone](https://pages.cs.wisc.edu/~shivaram/cs744-readings/fits_all.pdf). At *21st International Conference on Data Engineering* -(ICDE), April 2005. [doi:10.1109/ICDE.2005.1](https://doi.org/10.1109/ICDE.2005.1) - -[[12](/en/ch1#Cohen2009-marker)] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. -Hellerstein, and Caleb Welton. [MAD Skills: -New Analysis Practices for Big Data](https://www.vldb.org/pvldb/vol2/vldb09-219.pdf). *Proceedings of the VLDB Endowment*, volume 2, -issue 2, pages 1481–1492, August 2009. -[doi:10.14778/1687553.1687576](https://doi.org/10.14778/1687553.1687576) - -[[13](/en/ch1#Olteanu2020-marker)] Dan Olteanu. -[The Relational Data Borg is Learning](https://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf). -*Proceedings of the VLDB Endowment*, volume 13, issue 12, August 2020. -[doi:10.14778/3415478.3415572](https://doi.org/10.14778/3415478.3415572) - -[[14](/en/ch1#Bornstein2020-marker)] Matt Bornstein, Martin Casado, and Jennifer Li. -[Emerging -Architectures for Modern Data Infrastructure: 2020](https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/). *future.a16z.com*, October 2020. -Archived at [perma.cc/LF8W-KDCC](https://perma.cc/LF8W-KDCC) - -[[15](/en/ch1#Fowler2015-marker)] Martin Fowler. -[DataLake](https://www.martinfowler.com/bliki/DataLake.html). -*martinfowler.com*, February 2015. -Archived at [perma.cc/4WKN-CZUK](https://perma.cc/4WKN-CZUK) - -[[16](/en/ch1#Johnson2015-marker)] Bobby Johnson and Joseph Adler. -[The -Sushi Principle: Raw Data Is Better](https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210840/). At *Strata+Hadoop World*, February 2015. - -[[17](/en/ch1#Armbrust2021-marker)] Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. -[Lakehouse: A New Generation of -Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference -on Innovative Data Systems Research* (CIDR), January 2021. - -[[18](/en/ch1#DataOps-marker)] DataKitchen, Inc. -[The DataOps Manifesto](https://dataopsmanifesto.org/en/). *dataopsmanifesto.org*, 2017. -Archived at [perma.cc/3F5N-FUQ4](https://perma.cc/3F5N-FUQ4) - -[[19](/en/ch1#Manohar2021-marker)] Tejas Manohar. -[What is Reverse ETL: A Definition & Why It’s -Taking Off](https://hightouch.io/blog/reverse-etl/). *hightouch.io*, November 2021. -Archived at [perma.cc/A7TN-GLYJ](https://perma.cc/A7TN-GLYJ) - -[[20](/en/ch1#ORegan2018-marker)] Simon O’Regan. -[Designing Data -Products](https://towardsdatascience.com/designing-data-products-b6b93edf3d23). *towardsdatascience.com*, August 2018. -Archived at [perma.cc/HU67-3RV8](https://perma.cc/HU67-3RV8) - -[[21](/en/ch1#Fournier2021-marker)] Camille Fournier. -[Why is it so -hard to decide to buy?](https://skamille.medium.com/why-is-it-so-hard-to-decide-to-buy-d86fee98e88e) *skamille.medium.com*, July 2021. -Archived at [perma.cc/6VSG-HQ5X](https://perma.cc/6VSG-HQ5X) - -[[22](/en/ch1#HeinemeierHansson2022-marker)] David Heinemeier Hansson. -[Why we’re leaving the cloud](https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0). -*world.hey.com*, October 2022. -Archived at [perma.cc/82E6-UJ65](https://perma.cc/82E6-UJ65) - -[[23](/en/ch1#Badizadegan2022-marker)] Nima Badizadegan. -[Use One Big Server](https://specbranch.com/posts/one-big-server/). -*specbranch.com*, August 2022. -Archived at [perma.cc/M8NB-95UK](https://perma.cc/M8NB-95UK) - -[[24](/en/ch1#Yegge2020-marker)] Steve Yegge. -[Dear -Google Cloud: Your Deprecation Policy is Killing You](https://steve-yegge.medium.com/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc). *steve-yegge.medium.com*, August 2020. -Archived at [perma.cc/KQP9-SPGU](https://perma.cc/KQP9-SPGU) - -[[25](/en/ch1#Verbitski2017-marker)] Alexandre Verbitski, Anurag Gupta, Debanjan -Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz -Kharatishvili, and Xiaofeng Bao. -[Amazon -Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases](https://media.amazonwebservices.com/blog/2017/aurora-design-considerations-paper.pdf). -At *ACM International Conference on Management of Data* (SIGMOD), pages 1041–1052, May 2017. -[doi:10.1145/3035918.3056101](https://doi.org/10.1145/3035918.3056101) - -[[26](/en/ch1#Antonopoulos2019_ch1-marker)] Panagiotis Antonopoulos, Alex Budovski, Cristian -Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar -Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna -Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. -[Socrates: The -New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* -(SIGMOD), pages 1743–1756, June 2019. -[doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047) - -[[27](/en/ch1#Vuppalapati2020-marker)] Midhul Vuppalapati, Justin Miron, Rachit Agarwal, -Dan Truong, Ashish Motivala, and Thierry Cruanes. -[Building An Elastic Query -Engine on Disaggregated Storage](https://www.usenix.org/system/files/nsdi20-paper-vuppalapati.pdf). At *17th USENIX Symposium on Networked Systems Design and -Implementation* (NSDI), February 2020. - -[[28](/en/ch1#NickVanWiggeren2025-marker)] Nick Van Wiggeren. -[The Real Failure Rate of EBS](https://planetscale.com/blog/the-real-fail-rate-of-ebs). -*planetscale.com*, March 2025. -Archived at [perma.cc/43CR-SAH5](https://perma.cc/43CR-SAH5) - -[[29](/en/ch1#Breck2024-marker)] Colin Breck. -[Predicting the -Future of Distributed Systems](https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/). *blog.colinbreck.com*, August 2024. -Archived at [perma.cc/K5FC-4XX2](https://perma.cc/K5FC-4XX2) - -[[30](/en/ch1#Shapira2023separation-marker)] Gwen Shapira. -[Compute-Storage Separation Explained](https://www.thenile.dev/blog/storage-compute). -*thenile.dev*, January 2023. Archived at -[perma.cc/QCV3-XJNZ](https://perma.cc/QCV3-XJNZ) - -[[31](/en/ch1#Murthy2022-marker)] Ravi Murthy and Gurmeet Goindi. -[AlloyDB -for PostgreSQL under the hood: Intelligent, database-aware storage](https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage). *cloud.google.com*, -May 2022. Archived at -[archive.org](https://web.archive.org/web/20220514021120/https%3A//cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage) - -[[32](/en/ch1#Vanlightly2023serverless-marker)] Jack Vanlightly. -[The -Architecture of Serverless Data Systems](https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems). *jack-vanlightly.com*, November 2023. -Archived at [perma.cc/UDV4-TNJ5](https://perma.cc/UDV4-TNJ5) - -[[33](/en/ch1#Jonas2019-marker)] Eric Jonas, Johann Schleier-Smith, Vikram -Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, -Neeraja Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, David A. Patterson. -[Cloud Programming Simplified: A Berkeley View on -Serverless Computing](https://arxiv.org/abs/1902.03383). *arxiv.org*, February 2019. - -[[34](/en/ch1#Beyer2016-marker)] Betsy Beyer, Jennifer Petoff, Chris -Jones, and Niall Richard Murphy. -[*Site -Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). -O’Reilly Media, 2016. ISBN: 9781491929124 - -[[35](/en/ch1#Limoncelli2020-marker)] Thomas Limoncelli. -[The Time I Stole $10,000 from Bell Labs](https://queue.acm.org/detail.cfm?id=3434773). -*ACM Queue*, volume 18, issue 5, November 2020. -[doi:10.1145/3434571.3434773](https://doi.org/10.1145/3434571.3434773) - -[[36](/en/ch1#Majors2020-marker)] Charity Majors. -[The Future of Ops Jobs](https://acloudguru.com/blog/engineering/the-future-of-ops-jobs). -*acloudguru.com*, August 2020. -Archived at [perma.cc/GRU2-CZG3](https://perma.cc/GRU2-CZG3) - -[[37](/en/ch1#Cherkasky2021-marker)] Boris Cherkasky. -[(Over)Pay -As You Go for Your Datastore](https://medium.com/riskified-technology/over-pay-as-you-go-for-your-datastore-11a29ae49a8b). *medium.com*, September 2021. -Archived at [perma.cc/Q8TV-2AM2](https://perma.cc/Q8TV-2AM2) - -[[38](/en/ch1#Kushchi2023-marker)] Shlomi Kushchi. -[Serverless Doesn’t Mean -DevOpsLess or NoOps](https://thenewstack.io/serverless-doesnt-mean-devopsless-or-noops/). *thenewstack.io*, February 2023. -Archived at [perma.cc/3NJR-AYYU](https://perma.cc/3NJR-AYYU) - -[[39](/en/ch1#Bernhardsson2021-marker)] Erik Bernhardsson. -[Storm -in the stratosphere: how the cloud will be reshuffled](https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html). *erikbern.com*, November 2021. -Archived at [perma.cc/SYB2-99P3](https://perma.cc/SYB2-99P3) - -[[40](/en/ch1#Stancil2021-marker)] Benn Stancil. -[The data OS](https://benn.substack.com/p/the-data-os). *benn.substack.com*, -September 2021. Archived at [perma.cc/WQ43-FHS6](https://perma.cc/WQ43-FHS6) - -[[41](/en/ch1#Korolov2022-marker)] Maria Korolov. -[Data -residency laws pushing companies toward residency as a service](https://www.csoonline.com/article/3647761/data-residency-laws-pushing-companies-toward-residency-as-a-service.html). *csoonline.com*, -January 2022. Archived at [perma.cc/CHE4-XZZ2](https://perma.cc/CHE4-XZZ2) - -[[42](/en/ch1#Borenstein2025-marker)] Severin Borenstein. -[Can -Data Centers Flex Their Power Demand?](https://energyathaas.wordpress.com/2025/04/14/can-data-centers-flex-their-power-demand/) *energyathaas.wordpress.com*, April 2025. -Archived at - -[[43](/en/ch1#Acun2023-marker)] Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Aditya -Sundarrajan, Kiwan Maeng, Manoj Chakkaravarthy, David Brooks, and Carole-Jean Wu. -[Carbon Dependencies in -Datacenter Design and Management](https://hotcarbon.org/assets/2022/pdf/hotcarbon22-acun.pdf). -*ACM SIGENERGY Energy Informatics Review*, volume 3, issue 3, pages 21–26. -[doi:10.1145/3630614.3630619](https://doi.org/10.1145/3630614.3630619) - -[[44](/en/ch1#Nath2019-marker)] Kousik Nath. -[These are -the numbers every computer engineer should know](https://www.freecodecamp.org/news/must-know-numbers-for-every-computer-engineer/). *freecodecamp.org*, September 2019. -Archived at [perma.cc/RW73-36RL](https://perma.cc/RW73-36RL) - -[[45](/en/ch1#Hellerstein2019-marker)] Joseph M. Hellerstein, Jose Faleiro, Joseph E. -Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. -[Serverless Computing: One Step Forward, Two Steps Back](https://arxiv.org/abs/1812.03651). -At *Conference on Innovative Data Systems Research* (CIDR), January 2019. - -[[46](/en/ch1#McSherry2015_ch1-marker)] Frank McSherry, Michael Isard, and Derek G. Murray. -[Scalability! -But at What COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), -May 2015. - -[[47](/en/ch1#Sridharan2018-marker)] Cindy Sridharan. -*[Distributed -Systems Observability: A Guide to Building Robust Systems](https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-Systems-Observability-eBook.pdf)*. Report, O’Reilly Media, May 2018. -Archived at [perma.cc/M6JL-XKCM](https://perma.cc/M6JL-XKCM) - -[[48](/en/ch1#Majors2019-marker)] Charity Majors. -[Observability — A 3-Year -Retrospective](https://thenewstack.io/observability-a-3-year-retrospective/). *thenewstack.io*, August 2019. -Archived at [perma.cc/CG62-TJWL](https://perma.cc/CG62-TJWL) - -[[49](/en/ch1#Sigelman2010-marker)] Benjamin H. Sigelman, Luiz André Barroso, Mike -Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. -[Dapper, a Large-Scale Distributed Systems Tracing -Infrastructure](https://research.google/pubs/pub36356/). Google Technical Report dapper-2010-1, April 2010. -Archived at [perma.cc/K7KU-2TMH](https://perma.cc/K7KU-2TMH) - -[[50](/en/ch1#Laigner2021-marker)] Rodrigo Laigner, Yongluan Zhou, Marcos Antonio -Vaz Salles, Yijian Liu, and Marcos Kalinowski. -[Data management in microservices: State -of the practice, challenges, and research directions](https://www.vldb.org/pvldb/vol14/p3348-laigner.pdf). *Proceedings of the VLDB Endowment*, -volume 14, issue 13, pages 3348–3361, September 2021. -[doi:10.14778/3484224.3484232](https://doi.org/10.14778/3484224.3484232) - -[[51](/en/ch1#Tigani2023-marker)] Jordan Tigani. -[Big Data is Dead](https://motherduck.com/blog/big-data-is-dead/). -*motherduck.com*, February 2023. -Archived at [perma.cc/HT4Q-K77U](https://perma.cc/HT4Q-K77U) - -[[52](/en/ch1#Newman2021_ch1-marker)] Sam Newman. -[*Building -Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O’Reilly Media, 2021. ISBN: 9781492034025 - -[[53](/en/ch1#Richardson2014-marker)] Chris Richardson. -[Microservices: Decomposing -Applications for Deployability and Scalability](https://www.infoq.com/articles/microservices-intro/). *infoq.com*, May 2014. -Archived at [perma.cc/CKN4-YEQ2](https://perma.cc/CKN4-YEQ2) - -[[54](/en/ch1#Shahrad2020-marker)] Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, -Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, Ricardo Bianchini. -[Serverless in the Wild: -Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider](https://www.usenix.org/system/files/atc20-shahrad.pdf). -At *USENIX Annual Technical Conference* (ATC), July 2020. - -[[55](/en/ch1#Barroso2018-marker)] Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. -[The Datacenter as a -Computer: Designing Warehouse-Scale Machines](https://www.morganclaypool.com/doi/10.2200/S00874ED3V01Y201809CAC046), third edition. -Morgan & Claypool Synthesis Lectures on Computer Architecture, October 2018. -[doi:10.2200/S00874ED3V01Y201809CAC046](https://doi.org/10.2200/S00874ED3V01Y201809CAC046) - -[[56](/en/ch1#Fiala2012-marker)] David Fiala, Frank Mueller, Christian Engelmann, Rolf -Riesen, Kurt Ferreira, and Ron Brightwell. -[Detection and -Correction of Silent Data Corruption for Large-Scale High-Performance Computing](https://arcb.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc12.pdf),” at -*International Conference for High Performance Computing, Networking, Storage and -Analysis* (SC), November 2012. -[doi:10.1109/SC.2012.49](https://doi.org/10.1109/SC.2012.49) - -[[57](/en/ch1#KornfeldSimpson2020-marker)] Anna Kornfeld -Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang. -[Securing RDMA -for High-Performance Datacenter Storage Systems](https://www.usenix.org/conference/hotcloud20/presentation/kornfeld-simpson). At *12th USENIX Workshop on Hot Topics in -Cloud Computing* (HotCloud), July 2020. - -[[58](/en/ch1#Singh2015-marker)] Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, -Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, -Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. -[Jupiter Rising: A -Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf). At -*Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. -[doi:10.1145/2785956.2787508](https://doi.org/10.1145/2785956.2787508) - -[[59](/en/ch1#Lockwood2014-marker)] Glenn K. Lockwood. -[Hadoop’s -Uncomfortable Fit in HPC](https://blog.glennklockwood.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html). *glennklockwood.blogspot.co.uk*, May 2014. -Archived at [perma.cc/S8XX-Y67B](https://perma.cc/S8XX-Y67B) - -[[60](/en/ch1#ONeil2016_ch1-marker)] Cathy O’Neil: *Weapons of Math Destruction: -How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. -ISBN: 9780553418811 - -[[61](/en/ch1#Shastri2020-marker)] Supreeth Shastri, Vinay Banakar, Melissa -Wasserman, Arun Kumar, and Vijay Chidambaram. -[Understanding and Benchmarking the -Impact of GDPR on Database Systems](https://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue -7, pages 1064–1077, March 2020. -[doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354) - -[[62](/en/ch1#Datensparsamkeit-marker)] Martin Fowler. -[Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html). -*martinfowler.com*, December 2013. -Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6) - -[[63](/en/ch1#GDPR-marker)] [Regulation -(EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data -Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016. +## Footnotes + +## References + +[^1]: Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. [The Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26) +[^2]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737) +[^3]: Joe Reis and Matt Housley. [*Fundamentals of Data Engineering*](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/). O’Reilly Media, 2022. ISBN: 9781098108304 +[^4]: Rui Pedro Machado and Helder Russa. [*Analytics Engineering with SQL and dbt*](https://www.oreilly.com/library/view/analytics-engineering-with/9781098142377/). O’Reilly Media, 2023. ISBN: 9781098142384 +[^5]: Edgar F. Codd, S. B. Codd, and C. T. Salley. [Providing OLAP to User-Analysts: An IT Mandate](https://www.estgv.ipv.pt/PaginasPessoais/jloureiro/ESI_AID2007_2008/fichas/codd.pdf). E. F. Codd Associates, 1993. Archived at [perma.cc/RKX8-2GEE](https://perma.cc/RKX8-2GEE) +[^6]: Chinmay Soman and Neha Pawar. [Comparing Three Real-Time OLAP Databases: Apache Pinot, Apache Druid, and ClickHouse](https://startree.ai/blog/a-tale-of-three-real-time-olap-databases). *startree.ai*, April 2023. Archived at [perma.cc/8BZP-VWPA](https://perma.cc/8BZP-VWPA) +[^7]: Surajit Chaudhuri and Umeshwar Dayal. [An Overview of Data Warehousing and OLAP Technology](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigrecord.pdf). *ACM SIGMOD Record*, volume 26, issue 1, pages 65–74, March 1997. [doi:10.1145/248603.248616](https://doi.org/10.1145/248603.248616) +[^8]: Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. [Hybrid Transactional/Analytical Processing: A Survey](https://humming80.github.io/papers/sigmod-htaptut.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017. [doi:10.1145/3035918.3054784](https://doi.org/10.1145/3035918.3054784) +[^9]: Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. [Cloud-Native Transactions and Analytics in SingleStore](https://dl.acm.org/doi/abs/10.1145/3514221.3526055). At *International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055) +[^10]: Chao Zhang, Guoliang Li, Jintao Zhang, Xinning Zhang, and Jianhua Feng. [HTAP Databases: A Survey](https://arxiv.org/pdf/2404.15670). *IEEE Transactions on Knowledge and Data Engineering*, April 2024. [doi:10.1109/TKDE.2024.3389693](https://doi.org/10.1109/TKDE.2024.3389693) +[^11]: Michael Stonebraker and Uğur Çetintemel. [‘One Size Fits All’: An Idea Whose Time Has Come and Gone](https://pages.cs.wisc.edu/~shivaram/cs744-readings/fits_all.pdf). At *21st International Conference on Data Engineering* (ICDE), April 2005. [doi:10.1109/ICDE.2005.1](https://doi.org/10.1109/ICDE.2005.1) +[^12]: Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. [MAD Skills: New Analysis Practices for Big Data](https://www.vldb.org/pvldb/vol2/vldb09-219.pdf). *Proceedings of the VLDB Endowment*, volume 2, issue 2, pages 1481–1492, August 2009. [doi:10.14778/1687553.1687576](https://doi.org/10.14778/1687553.1687576) +[^13]: Dan Olteanu. [The Relational Data Borg is Learning](https://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, August 2020. [doi:10.14778/3415478.3415572](https://doi.org/10.14778/3415478.3415572) +[^14]: Matt Bornstein, Martin Casado, and Jennifer Li. [Emerging Architectures for Modern Data Infrastructure: 2020](https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/). *future.a16z.com*, October 2020. Archived at [perma.cc/LF8W-KDCC](https://perma.cc/LF8W-KDCC) +[^15]: Martin Fowler. [DataLake](https://www.martinfowler.com/bliki/DataLake.html). *martinfowler.com*, February 2015. Archived at [perma.cc/4WKN-CZUK](https://perma.cc/4WKN-CZUK) +[^16]: Bobby Johnson and Joseph Adler. [The Sushi Principle: Raw Data Is Better](https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210840/). At *Strata+Hadoop World*, February 2015. +[^17]: Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. [Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference on Innovative Data Systems Research* (CIDR), January 2021. +[^18]: DataKitchen, Inc. [The DataOps Manifesto](https://dataopsmanifesto.org/en/). *dataopsmanifesto.org*, 2017. Archived at [perma.cc/3F5N-FUQ4](https://perma.cc/3F5N-FUQ4) +[^19]: Tejas Manohar. [What is Reverse ETL: A Definition & Why It’s Taking Off](https://hightouch.io/blog/reverse-etl/). *hightouch.io*, November 2021. Archived at [perma.cc/A7TN-GLYJ](https://perma.cc/A7TN-GLYJ) +[^20]: Simon O’Regan. [Designing Data Products](https://towardsdatascience.com/designing-data-products-b6b93edf3d23). *towardsdatascience.com*, August 2018. Archived at [perma.cc/HU67-3RV8](https://perma.cc/HU67-3RV8) +[^21]: Camille Fournier. [Why is it so hard to decide to buy?](https://skamille.medium.com/why-is-it-so-hard-to-decide-to-buy-d86fee98e88e) *skamille.medium.com*, July 2021. Archived at [perma.cc/6VSG-HQ5X](https://perma.cc/6VSG-HQ5X) +[^22]: David Heinemeier Hansson. [Why we’re leaving the cloud](https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0). *world.hey.com*, October 2022. Archived at [perma.cc/82E6-UJ65](https://perma.cc/82E6-UJ65) +[^23]: Nima Badizadegan. [Use One Big Server](https://specbranch.com/posts/one-big-server/). *specbranch.com*, August 2022. Archived at [perma.cc/M8NB-95UK](https://perma.cc/M8NB-95UK) +[^24]: Steve Yegge. [Dear Google Cloud: Your Deprecation Policy is Killing You](https://steve-yegge.medium.com/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc). *steve-yegge.medium.com*, August 2020. Archived at [perma.cc/KQP9-SPGU](https://perma.cc/KQP9-SPGU) +[^25]: Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. [Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases](https://media.amazonwebservices.com/blog/2017/aurora-design-considerations-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 1041–1052, May 2017. [doi:10.1145/3035918.3056101](https://doi.org/10.1145/3035918.3056101) +[^26]: Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. [Socrates: The New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 1743–1756, June 2019. [doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047) +[^27]: Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong, Ashish Motivala, and Thierry Cruanes. [Building An Elastic Query Engine on Disaggregated Storage](https://www.usenix.org/system/files/nsdi20-paper-vuppalapati.pdf). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020. +[^28]: Nick Van Wiggeren. [The Real Failure Rate of EBS](https://planetscale.com/blog/the-real-fail-rate-of-ebs). *planetscale.com*, March 2025. Archived at [perma.cc/43CR-SAH5](https://perma.cc/43CR-SAH5) +[^29]: Colin Breck. [Predicting the Future of Distributed Systems](https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/). *blog.colinbreck.com*, August 2024. Archived at [perma.cc/K5FC-4XX2](https://perma.cc/K5FC-4XX2) +[^30]: Gwen Shapira. [Compute-Storage Separation Explained](https://www.thenile.dev/blog/storage-compute). *thenile.dev*, January 2023. Archived at [perma.cc/QCV3-XJNZ](https://perma.cc/QCV3-XJNZ) +[^31]: Ravi Murthy and Gurmeet Goindi. [AlloyDB for PostgreSQL under the hood: Intelligent, database-aware storage](https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage). *cloud.google.com*, May 2022. Archived at [archive.org](https://web.archive.org/web/20220514021120/https%3A//cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage) +[^32]: Jack Vanlightly. [The Architecture of Serverless Data Systems](https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems). *jack-vanlightly.com*, November 2023. Archived at [perma.cc/UDV4-TNJ5](https://perma.cc/UDV4-TNJ5) +[^33]: Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, David A. Patterson. [Cloud Programming Simplified: A Berkeley View on Serverless Computing](https://arxiv.org/abs/1902.03383). *arxiv.org*, February 2019. +[^34]: Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy. [*Site Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). O’Reilly Media, 2016. ISBN: 9781491929124 +[^35]: Thomas Limoncelli. [The Time I Stole $10,000 from Bell Labs](https://queue.acm.org/detail.cfm?id=3434773). *ACM Queue*, volume 18, issue 5, November 2020. [doi:10.1145/3434571.3434773](https://doi.org/10.1145/3434571.3434773) +[^36]: Charity Majors. [The Future of Ops Jobs](https://acloudguru.com/blog/engineering/the-future-of-ops-jobs). *acloudguru.com*, August 2020. Archived at [perma.cc/GRU2-CZG3](https://perma.cc/GRU2-CZG3) +[^37]: Boris Cherkasky. [(Over)Pay As You Go for Your Datastore](https://medium.com/riskified-technology/over-pay-as-you-go-for-your-datastore-11a29ae49a8b). *medium.com*, September 2021. Archived at [perma.cc/Q8TV-2AM2](https://perma.cc/Q8TV-2AM2) +[^38]: Shlomi Kushchi. [Serverless Doesn’t Mean DevOpsLess or NoOps](https://thenewstack.io/serverless-doesnt-mean-devopsless-or-noops/). *thenewstack.io*, February 2023. Archived at [perma.cc/3NJR-AYYU](https://perma.cc/3NJR-AYYU) +[^39]: Erik Bernhardsson. [Storm in the stratosphere: how the cloud will be reshuffled](https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html). *erikbern.com*, November 2021. Archived at [perma.cc/SYB2-99P3](https://perma.cc/SYB2-99P3) +[^40]: Benn Stancil. [The data OS](https://benn.substack.com/p/the-data-os). *benn.substack.com*, September 2021. Archived at [perma.cc/WQ43-FHS6](https://perma.cc/WQ43-FHS6) +[^41]: Maria Korolov. [Data residency laws pushing companies toward residency as a service](https://www.csoonline.com/article/3647761/data-residency-laws-pushing-companies-toward-residency-as-a-service.html). *csoonline.com*, January 2022. Archived at [perma.cc/CHE4-XZZ2](https://perma.cc/CHE4-XZZ2) +[^42]: Severin Borenstein. [Can Data Centers Flex Their Power Demand?](https://energyathaas.wordpress.com/2025/04/14/can-data-centers-flex-their-power-demand/) *energyathaas.wordpress.com*, April 2025. Archived at +[^43]: Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Aditya Sundarrajan, Kiwan Maeng, Manoj Chakkaravarthy, David Brooks, and Carole-Jean Wu. [Carbon Dependencies in Datacenter Design and Management](https://hotcarbon.org/assets/2022/pdf/hotcarbon22-acun.pdf). *ACM SIGENERGY Energy Informatics Review*, volume 3, issue 3, pages 21–26. [doi:10.1145/3630614.3630619](https://doi.org/10.1145/3630614.3630619) +[^44]: Kousik Nath. [These are the numbers every computer engineer should know](https://www.freecodecamp.org/news/must-know-numbers-for-every-computer-engineer/). *freecodecamp.org*, September 2019. Archived at [perma.cc/RW73-36RL](https://perma.cc/RW73-36RL) +[^45]: Joseph M. Hellerstein, Jose Faleiro, Joseph E. Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. [Serverless Computing: One Step Forward, Two Steps Back](https://arxiv.org/abs/1812.03651). At *Conference on Innovative Data Systems Research* (CIDR), January 2019. +[^46]: Frank McSherry, Michael Isard, and Derek G. Murray. [Scalability! But at What COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. +[^47]: Cindy Sridharan. *[Distributed Systems Observability: A Guide to Building Robust Systems](https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-Systems-Observability-eBook.pdf)*. Report, O’Reilly Media, May 2018. Archived at [perma.cc/M6JL-XKCM](https://perma.cc/M6JL-XKCM) +[^48]: Charity Majors. [Observability — A 3-Year Retrospective](https://thenewstack.io/observability-a-3-year-retrospective/). *thenewstack.io*, August 2019. Archived at [perma.cc/CG62-TJWL](https://perma.cc/CG62-TJWL) +[^49]: Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. [Dapper, a Large-Scale Distributed Systems Tracing Infrastructure](https://research.google/pubs/pub36356/). Google Technical Report dapper-2010-1, April 2010. Archived at [perma.cc/K7KU-2TMH](https://perma.cc/K7KU-2TMH) +[^50]: Rodrigo Laigner, Yongluan Zhou, Marcos Antonio Vaz Salles, Yijian Liu, and Marcos Kalinowski. [Data management in microservices: State of the practice, challenges, and research directions](https://www.vldb.org/pvldb/vol14/p3348-laigner.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 13, pages 3348–3361, September 2021. [doi:10.14778/3484224.3484232](https://doi.org/10.14778/3484224.3484232) +[^51]: Jordan Tigani. [Big Data is Dead](https://motherduck.com/blog/big-data-is-dead/). *motherduck.com*, February 2023. Archived at [perma.cc/HT4Q-K77U](https://perma.cc/HT4Q-K77U) +[^52]: Sam Newman. [*Building Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O’Reilly Media, 2021. ISBN: 9781492034025 +[^53]: Chris Richardson. [Microservices: Decomposing Applications for Deployability and Scalability](https://www.infoq.com/articles/microservices-intro/). *infoq.com*, May 2014. Archived at [perma.cc/CKN4-YEQ2](https://perma.cc/CKN4-YEQ2) +[^54]: Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, Ricardo Bianchini. [Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider](https://www.usenix.org/system/files/atc20-shahrad.pdf). At *USENIX Annual Technical Conference* (ATC), July 2020. +[^55]: Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. [The Datacenter as a Computer: Designing Warehouse-Scale Machines](https://www.morganclaypool.com/doi/10.2200/S00874ED3V01Y201809CAC046), third edition. Morgan & Claypool Synthesis Lectures on Computer Architecture, October 2018. [doi:10.2200/S00874ED3V01Y201809CAC046](https://doi.org/10.2200/S00874ED3V01Y201809CAC046) +[^56]: David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. [Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing](https://arcb.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc12.pdf),” at *International Conference for High Performance Computing, Networking, Storage and Analysis* (SC), November 2012. [doi:10.1109/SC.2012.49](https://doi.org/10.1109/SC.2012.49) +[^57]: Anna Kornfeld Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang. [Securing RDMA for High-Performance Datacenter Storage Systems](https://www.usenix.org/conference/hotcloud20/presentation/kornfeld-simpson). At *12th USENIX Workshop on Hot Topics in Cloud Computing* (HotCloud), July 2020. +[^58]: Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. [Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf). At *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2785956.2787508](https://doi.org/10.1145/2785956.2787508) +[^59]: Glenn K. Lockwood. [Hadoop’s Uncomfortable Fit in HPC](https://blog.glennklockwood.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html). *glennklockwood.blogspot.co.uk*, May 2014. Archived at [perma.cc/S8XX-Y67B](https://perma.cc/S8XX-Y67B) +[^60]: Cathy O’Neil: *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. ISBN: 9780553418811 +[^61]: Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. [Understanding and Benchmarking the Impact of GDPR on Database Systems](https://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 7, pages 1064–1077, March 2020. [doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354) +[^62]: Martin Fowler. [Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html). *martinfowler.com*, December 2013. Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6) +[^63]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016. diff --git a/content/en/ch10.md b/content/en/ch10.md index 64d21a4..8f9dc78 100644 --- a/content/en/ch10.md +++ b/content/en/ch10.md @@ -70,11 +70,11 @@ other inconsistencies. That would give us the advantage of fault tolerance, but complexity arising from having to think about multiple replicas. This is the idea behind *linearizability* -[[1](/en/ch10#Herlihy1990)] +[^1] (also known as *atomic consistency* -[[2](/en/ch10#Lamport1986)], +[^2], *strong consistency*, *immediate consistency*, or *external consistency* -[[3](/en/ch10#Gifford1981)]). +[^3]). The exact definition of linearizability is quite subtle, and we will explore it in the rest of this section. But the basic idea is to make a system appear as if there were only one copy of the data, and all operations on it are atomic. With this guarantee, even though there may be multiple replicas @@ -91,7 +91,7 @@ guarantee*. To clarify this idea, let’s look at an example of a system that is ###### Figure 10-1. This system is not linearizable, causing sports fans to be confused. [Figure 10-1](/en/ch10#fig_consistency_linearizability_0) shows an example of a nonlinearizable sports website -[[4](/en/ch10#Kleppmann2015stop)]. +[^4]. Aaliyah and Bryce are sitting in the same room, both checking their phones to see the outcome of a game their favorite team is playing. Just after the final score is announced, Aaliyah refreshes the page, sees the winner announced, and excitedly tells Bryce about it. Bryce incredulously hits @@ -170,7 +170,7 @@ still ongoing. (It’s the same situation as with Aaliyah and Bryce in read the new value.) We can further refine this timing diagram to visualize each operation taking effect atomically at -some point in time [[5](/en/ch10#Kingsbury2015mongodb)], +some point in time [^5], like in the more complex example shown in [Figure 10-4](/en/ch10#fig_consistency_linearizability_3). In this example we add a third type of operation besides *read* and *write*: @@ -217,7 +217,7 @@ There are a few interesting details to point out in [Figure 10-4](/en/ch10#fig_ and Bryce in [Figure 10-1](/en/ch10#fig_consistency_linearizability_0). That is the intuition behind linearizability; the formal definition -[[1](/en/ch10#Herlihy1990)] describes it more precisely. It is +[^1] describes it more precisely. It is possible (though computationally expensive) to test whether a system’s behavior is linearizable by recording the timings of all requests and responses, and checking whether they can be arranged into a valid sequential order [[6](/en/ch10#Kingsbury2014knossos), @@ -226,7 +226,7 @@ a valid sequential order [[6](/en/ch10#Kingsbury2014knossos), Just as there are various weak isolation levels for transactions besides serializability (see [“Weak Isolation Levels”](/en/ch8#sec_transactions_isolation_levels)), there are also various weaker consistency models for replicated systems besides linearizability -[[8](/en/ch10#Viotti2016)]. +[^8]. In fact, the *read-after-write*, *monotonic reads*, and *consistent prefix reads* properties we saw in [“Problems with Replication Lag”](/en/ch6#sec_replication_lag) are examples of such weaker consistency models. Linearizability guarantees all these weaker properties, and more. In this chapter we will focus on linearizability, @@ -244,7 +244,7 @@ Serializability same as if they had executed in *some* serial order: that is, as if you first performed all of one transaction’s operations, then all of another transaction’s operations, and so on, without interleaving them. It is okay for that serial order to be different from the order in which the - transactions were actually run [[9](/en/ch10#Bailis2014linear)]. + transactions were actually run [^9]. Linearizability : Linearizability is a guarantee on reads and writes of a register (an *individual object*). It @@ -253,10 +253,10 @@ Linearizability is a *recency* guarantee: it requires that if one operation finishes before another one starts, then the later operation must observe a state that is at least as new as the earlier operation. Serializability does not have that requirement: for example, stale reads are allowed by - serializability [[10](/en/ch10#Abadi2019serializable)]. + serializability [^10]. (*Sequential consistency* is something else again -[[8](/en/ch10#Viotti2016)], but we won’t discuss it here.) +[^8], but we won’t discuss it here.) A database may provide both serializability and linearizability, and this combination is known as *strict serializability* or *strong one-copy serializability* (*strong-1SR*) @@ -265,9 +265,9 @@ A database may provide both serializability and linearizability, and this combin Single-node databases are typically linearizable. With distributed databases using optimistic methods like serializable snapshot isolation (see [“Serializable Snapshot Isolation (SSI)”](/en/ch8#sec_transactions_ssi)) the situation is more complicated: for example, CockroachDB provides serializability, and some recency guarantees on -reads, but not strict serializability [[13](/en/ch10#Matei2021)] +reads, but not strict serializability [^13] because this would require expensive coordination between transactions -[[14](/en/ch10#Demirbas2022)]. +[^14]. It is also possible to combine a weaker isolation level with linearizability, or a weaker consistency model with serializability; in fact, consistency model and isolation level can be chosen @@ -286,12 +286,12 @@ requirement for making a system work correctly. A system that uses single-leader replication needs to ensure that there is indeed only one leader, not several (split brain). One way of electing a leader is to use a lease: every node that starts up tries to acquire the lease, and the one that succeeds becomes the leader -[[17](/en/ch10#Burrows2006_ch10)]. +[^17]. No matter how this mechanism is implemented, it must be linearizable: it should not be possible for two different nodes to acquire the lease at the same time. Coordination services like Apache ZooKeeper -[[18](/en/ch10#Junqueira2013_ch10)] +[^18] and etcd are often used to implement distributed leases and leader election. They use consensus algorithms to implement linearizable operations in a fault-tolerant way (we discuss such algorithms later in this chapter). There are still many subtle details to implementing leases and leader @@ -303,12 +303,12 @@ linearizable storage service is the basic foundation for these coordination task Strictly speaking, ZooKeeper provides linearizable writes, but reads may be stale, since there is no guarantee that they are served from the current leader -[[18](/en/ch10#Junqueira2013_ch10)]. +[^18]. etcd since version 3 provides linearizable reads by default. Distributed locking is also used at a much more granular level in some distributed databases, such as Oracle Real Application Clusters (RAC) -[[19](/en/ch10#Vallath2006)]. +[^19]. RAC uses a lock per disk page, with multiple nodes sharing access to the same disk storage system. Since these linearizable locks are on the critical path of transaction execution, RAC deployments usually have a dedicated cluster interconnect network for @@ -341,7 +341,7 @@ loosely interpreted constraints in [Link to Come]. However, a hard uniqueness constraint, such as the one you typically find in relational databases, requires linearizability. Other kinds of constraints, such as foreign key or attribute constraints, can be implemented without linearizability -[[20](/en/ch10#Bailis2014coord_ch10)]. +[^20]. ### Cross-channel timing dependencies @@ -412,7 +412,7 @@ Single-leader replication (potentially linearizable) assumes that you know for sure who the leader is. As discussed in [“Distributed Locks and Leases”](/en/ch9#sec_distributed_lock_fencing), it is quite possible for a node to think that it is the leader, when in fact it is not—and if the delusional leader continues to serve requests, it is likely to - violate linearizability [[21](/en/ch10#Kingsbury2014etcd)]. + violate linearizability [^21]. With asynchronous replication, failover may even lose committed writes, which violates both durability and linearizability. @@ -424,9 +424,9 @@ Consensus algorithms (likely linearizable) : Some consensus algorithms are essentially single-leader replication with automatic leader election and failover. They are carefully designed to prevent split brain, allowing them to implement linearizable storage safely. ZooKeeper uses the Zab consensus algorithm - [[22](/en/ch10#Junqueira2011)] + [^22] and etcd uses Raft - [[23](/en/ch10#Ongaro2014atc)], for example. + [^23], for example. However, just because a system uses consensus does not guarantee that all operations on it are linearizable: if it allows reads on a node without checking that it is still the leader, the results of the read may be stale if a new leader has just been elected. @@ -472,19 +472,19 @@ returns the new value. (It’s once again the Aaliyah and Bryce situation from It is possible to make Dynamo-style quorums linearizable at the cost of reduced performance: a reader must perform read repair (see [“Catching up on missed writes”](/en/ch6#sec_replication_read_repair)) synchronously, before returning results to the application -[[24](/en/ch10#Attiya1995)]. +[^24]. Moreover, before writing, a writer must read the latest state of a quorum of nodes to fetch the latest timestamp of any prior write, and ensure that the new write has a greater timestamp [[25](/en/ch10#Lynch1997), [26](/en/ch10#Cachin2011)]. However, Riak does not perform synchronous read repair due to the performance penalty. Cassandra does wait for read repair to complete on quorum reads -[[27](/en/ch10#Ekstrom2012)], +[^27], but it loses linearizability due to its use of time-of-day clocks for timestamps. Moreover, only linearizable read and write operations can be implemented in this way; a linearizable compare-and-set operation cannot, because it requires a consensus algorithm -[[28](/en/ch10#Herlihy1991)]. +[^28]. In summary, it is safest to assume that a leaderless system with Dynamo-style replication does not provide linearizability, even with quorum reads and writes. @@ -560,10 +560,10 @@ distributed databases since the 1970s CAP was originally proposed as a rule of thumb, without precise definitions, with the goal of starting a discussion about trade-offs in databases. At the time, many distributed databases focused on providing linearizable semantics on a cluster of machines with shared storage -[[19](/en/ch10#Vallath2006)], and CAP encouraged database engineers +[^19], and CAP encouraged database engineers to explore a wider design space of distributed shared-nothing systems, which were more suitable for implementing large-scale web services -[[36](/en/ch10#Brewer2012nosql)]. +[^36]. CAP deserves credit for this culture shift—it helped trigger the NoSQL movement, a burst of new database technologies around the mid-2000s. @@ -571,7 +571,7 @@ database technologies around the mid-2000s. CAP is sometimes presented as *Consistency, Availability, Partition tolerance: pick 2 out of 3*. Unfortunately, putting it this way is misleading -[[32](/en/ch10#Brewer2012rules)] because network partitions are a kind of +[^32] because network partitions are a kind of fault, so they aren’t something about which you have a choice: they will happen whether you like it or not. @@ -579,16 +579,16 @@ At times when the network is working correctly, a system can provide both consis (linearizability) and total availability. When a network fault occurs, you have to choose between either linearizability or total availability. Thus, a better way of phrasing CAP would be *either Consistent or Available when Partitioned* -[[37](/en/ch10#Cockcroft2014)]. +[^37]. A more reliable network needs to make this choice less often, but at some point the choice is inevitable. The CP/AP classification scheme has several further flaws -[[4](/en/ch10#Kleppmann2015stop)]. *Consistency* is formalized as +[^4]. *Consistency* is formalized as linearizability (the theorem doesn’t say anything about weaker consistency models), and the -formalization of *availability* [[30](/en/ch10#Gilbert2002)] does not +formalization of *availability* [^30] does not match the usual meaning of the term -[[38](/en/ch10#Kleppmann2015critique)]. Many highly available (fault-tolerant) systems actually do not meet CAP’s +[^38]. Many highly available (fault-tolerant) systems actually do not meet CAP’s idiosyncratic definition of availability. Moreover, some system designers choose (with good reason) to provide neither linearizability nor the form of availability that the CAP theorem assumes, so those systems are neither CP nor AP [[39](/en/ch10#Abadi2010), @@ -597,10 +597,10 @@ those systems are neither CP nor AP [[39](/en/ch10#Abadi2010), All in all, there is a lot of misunderstanding and confusion around CAP, and it does not help us understand systems better, so CAP is best avoided. -The CAP theorem as formally defined [[30](/en/ch10#Gilbert2002)] is of +The CAP theorem as formally defined [^30] is of very narrow scope: it only considers one consistency model (namely linearizability) and one kind of fault (network partitions, which according to data from Google are the cause of less than 8% of -incidents [[41](/en/ch10#Brewer2017)]). +incidents [^41]). It doesn’t say anything about network delays, dead nodes, or other trade-offs. Thus, although CAP has been historically influential, it has little practical value for designing systems [[4](/en/ch10#Kleppmann2015stop), @@ -617,7 +617,7 @@ consistency (C). However, this definition inherits several problems with CAP, su counterintuitive definitions of consistency and availability. There are many more interesting impossibility results in distributed systems -[[43](/en/ch10#Lynch1989)], +[^43], and CAP has now been superseded by more precise results [[44](/en/ch10#Mahajan2011), [45](/en/ch10#Attiya2015)], @@ -627,16 +627,16 @@ so it is of mostly historical interest today. Although linearizability is a useful guarantee, surprisingly few systems are actually linearizable in practice. For example, even RAM on a modern multi-core CPU is not linearizable -[[46](/en/ch10#Sewell2010)]: +[^46]: if a thread running on one CPU core writes to a memory address, and a thread on another CPU core reads the same address shortly afterward, it is not guaranteed to read the value written by the first thread (unless a *memory barrier* or *fence* -[[47](/en/ch10#Thompson2011)] is used). +[^47] is used). The reason for this behavior is that every CPU core has its own memory cache and store buffer. Memory access first goes to the cache by default, and any changes are asynchronously written out to main memory. Since accessing data in the cache is much faster than going to main memory -[[48](/en/ch10#Drepper2007_ch10)], this feature is essential for +[^48], this feature is essential for good performance on modern CPUs. However, there are now several copies of the data (one in main memory, and perhaps several more in various caches), and these copies are asynchronously updated, so linearizability is lost. @@ -645,15 +645,15 @@ Why make this trade-off? It makes no sense to use the CAP theorem to justify the consistency model: within one computer we usually assume reliable communication, and we don’t expect one CPU core to be able to continue operating normally if it is disconnected from the rest of the computer. The reason for dropping linearizability is *performance*, not fault tolerance -[[39](/en/ch10#Abadi2010)]. +[^39]. The same is true of many distributed databases that choose not to provide linearizable guarantees: they do so primarily to increase performance, not so much for fault tolerance -[[42](/en/ch10#Abadi2012)]. +[^42]. Linearizability is slow—and this is true all the time, not only during a network fault. Can’t we maybe find a more efficient implementation of linearizable storage? It seems the answer is -no: Attiya and Welch [[49](/en/ch10#Attiya1994)] +no: Attiya and Welch [^49] prove that if you want linearizability, the response time of read and write requests is at least proportional to the uncertainty of delays in the network. In a network with highly variable delays, like most computer networks (see [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing)), the response time of linearizable @@ -732,13 +732,13 @@ Wall-clock timestamp made unique with extra information that ensures the ID is unique even if the timestamp is not—for example, a shard number and a per-shard incrementing sequence number, or a long random value. This approach is used in Version 7 UUIDs - [[50](/en/ch10#Davis2024)], - Twitter’s Snowflake [[51](/en/ch10#King2010)], - ULIDs [[52](/en/ch10#Feerasta2016)], + [^50], + Twitter’s Snowflake [^51], + ULIDs [^52], Hazelcast’s Flake ID generator, MongoDB ObjectIDs, and many similar schemes - [[50](/en/ch10#Davis2024)]. + [^50]. You can implement these ID generators in application code or within a database - [[53](/en/ch10#Conery2014)]. + [^53]. All these schemes generate IDs that are unique (at least with high enough probability that collisions are vanishingly rare), but they have much weaker ordering guarantees for IDs than the @@ -782,7 +782,7 @@ discussed do not meet the causal ordering requirement. Fortunately, there is a simple method for generating logical timestamps that *is* consistent with causality, and which you can use as a distributed ID generator. It is called a *Lamport clock*, -proposed in 1978 by Leslie Lamport [[54](/en/ch10#Lamport1978_ch10)], +proposed in 1978 by Leslie Lamport [^54], in what is now one of the most-cited papers in the field of distributed systems. [Figure 10-9](/en/ch10#fig_consistency_lamport_ts) shows how a Lamport clock would work in the chat example of @@ -829,7 +829,7 @@ limitations: A *hybrid logical clock* combines the advantages of physical time-of-day clocks with the ordering guarantees of Lamport clocks -[[55](/en/ch10#Kulkarni2014)]. +[^55]. Like a physical clock, it counts seconds or microseconds. Like a Lamport clock, when one node sees a timestamp from another node that is greater than its local clock value, it moves its own local value forward to match the other node’s timestamp. As a result, if one node’s clock is running fast, the @@ -853,7 +853,7 @@ essentially, by giving each transaction a transaction ID, and allowing each tran writes made by transactions with a lower ID, but to make writes by transactions with higher IDs invisible. Lamport clocks and hybrid logical clocks are a good way of generating these transaction IDs, because they ensure that the snapshot is consistent with causality -[[56](/en/ch10#Bravo2015)]. +[^56]. When multiple timestamps are generated concurrently, these algorithms order them arbitrarily. This means that when you look at two timestamps, you generally can’t tell whether they were generated @@ -915,7 +915,7 @@ for this purpose. That node only needs to atomically increment a counter and ret requested, persist the counter value (so that it doesn’t generate duplicate IDs if the node crashes and restarts), and replicate it for fault tolerance (using single-leader replication). This approach is used in practice: for example, TiDB/TiKV calls it a *timestamp oracle*, inspired by Google’s -Percolator [[57](/en/ch10#Peng2010_ch10)]. +Percolator [^57]. As an optimization, you can avoid performing a disk write and replication on every single request. Instead, the ID generator can write a record describing a batch of IDs; once that record is @@ -958,7 +958,7 @@ sure that no future request will receive an even lower timestamp than the winner Unfortunately, part of the problem is still unsolved: how does a node know whether its own timestamp is the lowest? To be sure, it needs to hear from *every* other node that might have generated a -timestamp [[54](/en/ch10#Lamport1978_ch10)]. If one of the other nodes +timestamp [^54]. If one of the other nodes has failed in the meantime, or cannot be reached due to a network problem, this system would grind to a halt, because we can’t be sure whether that node might have the lowest timestamp. This is not the kind of fault-tolerant system that we need. @@ -1017,14 +1017,14 @@ common assumption is that fewer than one-third of the nodes are Byzantine-faulty [[26](/en/ch10#Cachin2011), [70](/en/ch10#Castro2002)]. Such *Byzantine fault tolerant* (BFT) consensus algorithms are used in blockchains -[[71](/en/ch10#Bano2019_ch10)]. +[^71]. However, as explained in [“Byzantine Faults”](/en/ch9#sec_distributed_byzantine), BFT algorithms are beyond the scope of this book. # The Impossibility of Consensus You may have heard about the FLP result -[[72](/en/ch10#Fischer1985)]—named after the +[^72]—named after the authors Fischer, Lynch, and Paterson—which proves that there is no algorithm that is always able to reach consensus if there is a risk that a node may crash. In a distributed system, we must assume that nodes may crash, so reliable consensus is impossible. Yet, here we are, discussing algorithms @@ -1035,9 +1035,9 @@ a consensus algorithm will *always* terminate. Moreover, the FLP result is prove deterministic algorithm in the asynchronous system model (see [“System Model and Reality”](/en/ch9#sec_distributed_system_model)), which means the algorithm cannot use any clocks or timeouts. If it can use timeouts to suspect that another node may have crashed (even if the suspicion is sometimes wrong), then consensus becomes -solvable [[73](/en/ch10#Chandra1996)]. +solvable [^73]. Even just allowing the algorithm to use random numbers is sufficient to get around the impossibility -result [[74](/en/ch10#BenOr1983)]. +result [^74]. Thus, although the FLP result about the impossibility of consensus is of great theoretical importance, distributed systems can usually achieve consensus in practice. @@ -1076,7 +1076,7 @@ More generally, one or more nodes may *propose* values, and the consensus algori of those values. In the examples above, each node could propose its own ID, and the algorithm decides which node ID should become the new leader, the holder of the lease, or the buyer of the airplane/theater seat. In this formalism, a consensus algorithm must satisfy the following -properties [[26](/en/ch10#Cachin2011)]: +properties [^26]: Uniform agreement : No two nodes decide differently. @@ -1121,14 +1121,14 @@ Of course, if *all* nodes crash and none of them are running, then it is not pos algorithm to decide anything. There is a limit to the number of failures that an algorithm can tolerate: in fact, it can be proved that any consensus algorithm requires at least a majority of nodes to be functioning correctly in order to assure termination -[[73](/en/ch10#Chandra1996)]. That majority can safely form a quorum +[^73]. That majority can safely form a quorum (see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)). Thus, the termination property is subject to the assumption that fewer than half of the nodes are crashed or unreachable. However, most consensus algorithms ensure that the safety properties—agreement, integrity, and validity—are always met, even if a majority of nodes fail or there is a severe network problem -[[75](/en/ch10#Dwork1988_ch10)]. +[^75]. Thus, a large-scale outage can stop the system from being able to process requests, but it cannot corrupt the consensus system by causing it to make inconsistent decisions. @@ -1159,7 +1159,7 @@ name has not been created or modified by another client since the current client However, a linearizable read-write register is not sufficient to solve consensus. The FLP result tells us that consensus cannot be solved by a deterministic algorithm in the asynchronous crash-stop -model [[72](/en/ch10#Fischer1985)], but we saw in +model [^72], but we saw in [“Linearizability and quorums”](/en/ch10#sec_consistency_quorum_linearizable) that a linearizable register can be implemented using quorum reads/writes in this model [[24](/en/ch10#Attiya1995), [25](/en/ch10#Lynch1997), [26](/en/ch10#Cachin2011)]. @@ -1211,11 +1211,11 @@ If you have an implementation of a shared log, it is easy to solve the consensus that wants to propose a value requests for it to be added to the log, and whichever value is read back in the first log entry is the value that is decided. Since all nodes read log entries in the same order, they are guaranteed to agree on which value is delivered first -[[28](/en/ch10#Herlihy1991)]. +[^28]. Conversely, if you have a solution for consensus, you can implement a shared log. The details are a bit more complicated, but the basic idea is this -[[73](/en/ch10#Chandra1996)]: +[^73]: 1. You have a slot in the log for every future log entry, and you run a separate instance of the consensus algorithm for every such slot to decide what value should go in that entry. @@ -1264,7 +1264,7 @@ the nodes can send each other the values they want to propose, and then each per fetch-and-add operation. The node that reads zero decides its own value, and the node that reads one decides the other node’s value. This solves the consensus problem among two nodes, which is why we can say that fetch-and-add has a *consensus number* of two -[[28](/en/ch10#Herlihy1991)]. +[^28]. In contrast, CAS and shared logs solve consensus for any number of nodes that may propose values, so they have a consensus number of ∞ (infinity). @@ -1280,7 +1280,7 @@ similar—both require nodes to come to some form of agreement. However, there i difference: with consensus it’s okay to decide any value that proposed, whereas with atomic commitment the algorithm *must* abort if *any* of the participants voted to abort. More precisely, atomic commitment requires the following properties -[[78](/en/ch10#Guerraoui1995)]: +[^78]: Uniform agreement : No two nodes decide on different outcomes. @@ -1345,7 +1345,7 @@ Multi-Paxos, which also provides a shared log. A shared log is a good fit for database replication: if every log entry represents a write to the database, and every replica processes the same writes in the same order using deterministic logic, then the replicas will all end up in a consistent state. This idea is known as *state machine -replication* [[80](/en/ch10#Schneider1990)], +replication* [^80], and it is the principle behind event sourcing, which we saw in [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events). Shared logs are also useful for stream processing, as we shall see in [Link to Come]. @@ -1361,7 +1361,7 @@ then the transactions will be serializable Sharded databases with a strong consistency model often maintain a separate log per shard, which improves scalability, but limits the consistency guarantees (e.g., consistent snapshots, foreign key references) they can offer across shards. Serializable transactions across shards are possible, but -require additional coordination [[83](/en/ch10#Balakrishnan2012)]. +require additional coordination [^83]. A shared log is also powerful because it can easily be adapted to other forms of consensus: @@ -1374,7 +1374,7 @@ A shared log is also powerful because it can easily be adapted to other forms of current counter value is the sum of all of the log entries so far. A simple counter on log entries can be used to generate fencing tokens (see [“Fencing off zombies and delayed requests”](/en/ch9#sec_distributed_fencing_tokens)); for example, in ZooKeeper, this sequence number is called `zxid` - [[18](/en/ch10#Junqueira2013_ch10)]. + [^18]. ### From single-leader replication to consensus @@ -1387,7 +1387,7 @@ failover as an action that a human administrator had to perform manually. Unfort a significant amount of downtime, since there is a limit to how fast humans can react, and it doesn’t satisfy the termination property of consensus. For consensus, we require that the algorithm can automatically choose a new leader. (Not all consensus algorithms have a leader, but the commonly -used algorithms do [[84](/en/ch10#Gavrielatos2021)].) +used algorithms do [^84].) However, there is a problem. We previously discussed the problem of split brain, and said that all nodes need to agree who the leader is—otherwise two different nodes could each believe themselves to @@ -1409,14 +1409,14 @@ leader with the higher epoch number prevails. Before a leader is allowed to append the next entry to the shared log, it must first check that there isn’t some other leader with a higher epoch number which might append a different entry. It can do this by collecting votes from a quorum of nodes—typically, but not always, a majority of -nodes [[85](/en/ch10#Howard2016_ch10)]. +nodes [^85]. A node votes yes only if it is not aware of any other leader with a higher epoch. Thus, we have two rounds of voting: once to choose a leader, and a second time to vote on a leader’s proposal for the next entry to append to the log. The quorums for those two votes must overlap: if a vote on a proposal succeeds, at least one of the nodes that voted for it must have also participated in the most recent successful leader election -[[85](/en/ch10#Howard2016_ch10)]. Thus, if the vote on a proposal +[^85]. Thus, if the vote on a proposal passes without revealing any higher-numbered epoch, the current leader can conclude that no leader with a higher epoch number has been elected, and therefore it can safely append the proposed entry to the log [[26](/en/ch10#Cachin2011), @@ -1441,7 +1441,7 @@ approaches. For example, when the old leader fails and a new one is elected, the ensure that the new leader honors any log entries that had already been appended by the old leader before it failed. Raft does this by only allowing a node to become the new leader if its log is at least as up-to-date as a majority of its followers -[[69](/en/ch10#Howard2020)]. +[^69]. In contrast, Paxos allows any node to become the new leader, but requires it to bring its log up-to-date with other nodes before it can start appending new entries of its own. @@ -1492,7 +1492,7 @@ face of all the problems we discussed in [Chapter 9](/en/ch9#ch_distributed). Since single-leader replication with automatic failover is essentially one of the definitions of consensus, any system that provides automatic failover but does not use a proven consensus algorithm -is likely to be unsafe [[87](/en/ch10#Kingsbury2015elastic)]. +is likely to be unsafe [^87]. Using a proven consensus algorithm is not a guarantee of correctness of the whole system—there are still plenty of other places where bugs can lurk—but it’s a good start. @@ -1735,498 +1735,98 @@ availability and better performance. In these cases, it is common to use leaderl replication, which we previously discussed in [Chapter 6](/en/ch6#ch_replication). The logical clocks that we discussed in this chapter are helpful in that context. -##### Footnotes - -##### References - -[[1](/en/ch10#Herlihy1990-marker)] Maurice P. Herlihy and Jeannette M. Wing. -[Linearizability: A Correctness -Condition for Concurrent Objects](https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* -(TOPLAS), volume 12, issue 3, pages 463–492, July 1990. -[doi:10.1145/78969.78972](https://doi.org/10.1145/78969.78972) - -[[2](/en/ch10#Lamport1986-marker)] Leslie Lamport. -[On -interprocess communication](https://www.microsoft.com/en-us/research/publication/interprocess-communication-part-basic-formalism-part-ii-algorithms/). *Distributed Computing*, volume 1, issue 2, pages 77–101, -June 1986. [doi:10.1007/BF01786228](https://doi.org/10.1007/BF01786228) - -[[3](/en/ch10#Gifford1981-marker)] David K. Gifford. -[Information -Storage in a Decentralized Computer System](https://bitsavers.org/pdf/xerox/parc/techReports/CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System.pdf). Xerox Palo Alto Research Centers, CSL-81-8, June 1981. -Archived at [perma.cc/2XXP-3JPB](https://perma.cc/2XXP-3JPB) - -[[4](/en/ch10#Kleppmann2015stop-marker)] Martin Kleppmann. -[Please -Stop Calling Databases CP or AP](https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html). *martin.kleppmann.com*, May 2015. -Archived at [perma.cc/MJ5G-75GL](https://perma.cc/MJ5G-75GL) - -[[5](/en/ch10#Kingsbury2015mongodb-marker)] Kyle Kingsbury. -[Call Me Maybe: MongoDB -Stale Reads](https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads). *aphyr.com*, April 2015. -Archived at [perma.cc/DXB4-J4JC](https://perma.cc/DXB4-J4JC) - -[[6](/en/ch10#Kingsbury2014knossos-marker)] Kyle Kingsbury. -[Computational Techniques -in Knossos](https://aphyr.com/posts/314-computational-techniques-in-knossos). *aphyr.com*, May 2014. -Archived at [perma.cc/2X5M-EHTU](https://perma.cc/2X5M-EHTU) - -[[7](/en/ch10#Kingsbury2020elle-marker)] Kyle Kingsbury and Peter Alvaro. -[Elle: Inferring Isolation Anomalies from -Experimental Observations](https://www.vldb.org/pvldb/vol14/p268-alvaro.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 3, pages -268–280, November 2020. -[doi:10.14778/3430915.3430918](https://doi.org/10.14778/3430915.3430918) - -[[8](/en/ch10#Viotti2016-marker)] Paolo Viotti and Marko Vukolić. -[Consistency in Non-Transactional Distributed Storage -Systems](https://arxiv.org/abs/1512.00168). *ACM Computing Surveys* (CSUR), volume 49, issue 1, article no. 19, June 2016. -[doi:10.1145/2926965](https://doi.org/10.1145/2926965) - -[[9](/en/ch10#Bailis2014linear-marker)] Peter Bailis. -[Linearizability -Versus Serializability](http://www.bailis.org/blog/linearizability-versus-serializability/). *bailis.org*, September 2014. -Archived at [perma.cc/386B-KAC3](https://perma.cc/386B-KAC3) - -[[10](/en/ch10#Abadi2019serializable-marker)] Daniel Abadi. -[Correctness -Anomalies Under Serializable Isolation](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html). *dbmsmusings.blogspot.com*, June 2019. -Archived at [perma.cc/JGS7-BZFY](https://perma.cc/JGS7-BZFY) - -[[11](/en/ch10#Bailis2014virtues_ch10-marker)] Peter Bailis, Aaron Davidson, Alan -Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. -[Highly Available Transactions: Virtues and -Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). *Proceedings of the VLDB Endowment*, volume 7, issue 3, pages 181–192, -November 2013. [doi:10.14778/2732232.2732237](https://doi.org/10.14778/2732232.2732237), -extended version published as [arXiv:1302.0309](https://arxiv.org/abs/1302.0309) - -[[12](/en/ch10#Bernstein1987_ch10-marker)] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. -[*Concurrency Control and -Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at -[*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/). - -[[13](/en/ch10#Matei2021-marker)] Andrei Matei. -[CockroachDB’s consistency model](https://www.cockroachlabs.com/blog/consistency-model/). -*cockroachlabs.com*, February 2021. -Archived at [perma.cc/MR38-883B](https://perma.cc/MR38-883B) - -[[14](/en/ch10#Demirbas2022-marker)] Murat Demirbas. -[Strict-serializability, -but at what cost, for what purpose?](https://muratbuffalo.blogspot.com/2022/08/strict-serializability-but-at-what-cost.html) *muratbuffalo.blogspot.com*, August 2022. -Archived at [perma.cc/T8AY-N3U9](https://perma.cc/T8AY-N3U9) - -[[15](/en/ch10#Darnell2022-marker)] Ben Darnell. -[How to talk about -consistency and isolation in distributed DBs](https://www.cockroachlabs.com/blog/db-consistency-isolation-terminology/). *cockroachlabs.com*, February 2022. -Archived at [perma.cc/53SV-JBGK](https://perma.cc/53SV-JBGK) - -[[16](/en/ch10#Abadi2019consistency-marker)] Daniel Abadi. -[An -explanation of the difference between Isolation levels vs. Consistency levels](https://dbmsmusings.blogspot.com/2019/08/an-explanation-of-difference-between.html). -*dbmsmusings.blogspot.com*, August 2019. -Archived at [perma.cc/QSF2-CD4P](https://perma.cc/QSF2-CD4P) - -[[17](/en/ch10#Burrows2006_ch10-marker)] Mike Burrows. -[The Chubby Lock Service for Loosely-Coupled -Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and -Implementation* (OSDI), November 2006. - -[[18](/en/ch10#Junqueira2013_ch10-marker)] Flavio P. Junqueira and Benjamin Reed. -[*ZooKeeper: Distributed -Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3 - -[[19](/en/ch10#Vallath2006-marker)] Murali Vallath. -[*Oracle 10g RAC -Grid, Services & Clustering*](https://www.oreilly.com/library/view/oracle-10g-rac/9781555583217/). Elsevier Digital Press, 2006. ISBN: 978-1-555-58321-7 - -[[20](/en/ch10#Bailis2014coord_ch10-marker)] Peter Bailis, Alan Fekete, Michael J. -Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. -[Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). -*Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185–196, November 2014. -[doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509) - -[[21](/en/ch10#Kingsbury2014etcd-marker)] Kyle Kingsbury. -[Call Me Maybe: etcd and -Consul](https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul). *aphyr.com*, June 2014. -Archived at [perma.cc/XL7U-378K](https://perma.cc/XL7U-378K) - -[[22](/en/ch10#Junqueira2011-marker)] Flavio P. Junqueira, Benjamin C. Reed, and -Marco Serafini. [Zab: High-Performance -Broadcast for Primary-Backup Systems](https://marcoserafini.github.io/assets/pdf/zab.pdf). At *41st IEEE International Conference on Dependable -Systems and Networks* (DSN), June 2011. -[doi:10.1109/DSN.2011.5958223](https://doi.org/10.1109/DSN.2011.5958223) - -[[23](/en/ch10#Ongaro2014atc-marker)] Diego Ongaro and John K. Ousterhout. -[In Search -of an Understandable Consensus Algorithm](https://www.usenix.org/system/files/conference/atc14/atc14-paper-ongaro.pdf). At *USENIX Annual Technical Conference* -(ATC), June 2014. - -[[24](/en/ch10#Attiya1995-marker)] Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. -[Sharing Memory Robustly in -Message-Passing Systems](https://www.cs.huji.ac.il/course/2004/dist/p124-attiya.pdf). *Journal of the ACM*, volume 42, issue 1, pages 124–142, January 1995. -[doi:10.1145/200836.200869](https://doi.org/10.1145/200836.200869) - -[[25](/en/ch10#Lynch1997-marker)] Nancy Lynch and Alex Shvartsman. -[Robust Emulation of Shared Memory -Using Dynamic Quorum-Acknowledged Broadcasts](https://groups.csail.mit.edu/tds/papers/Lynch/FTCS97.pdf). At *27th Annual International Symposium on -Fault-Tolerant Computing* (FTCS), June 1997. -[doi:10.1109/FTCS.1997.614100](https://doi.org/10.1109/FTCS.1997.614100) - -[[26](/en/ch10#Cachin2011-marker)] Christian Cachin, Rachid Guerraoui, and Luís Rodrigues. -[*Introduction to Reliable and Secure Distributed -Programming*](https://www.distributedprogramming.net/), 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7, -[doi:10.1007/978-3-642-15260-3](https://doi.org/10.1007/978-3-642-15260-3) - -[[27](/en/ch10#Ekstrom2012-marker)] Niklas Ekström, Mikhail Panchenko, and Jonathan Ellis. -[Possible -Issue with Read Repair?](https://lists.apache.org/thread/wwsjnnc93mdlpw8nb0d5gn4q1bmpzbon) Email thread on *cassandra-dev* mailing list, October 2012. - -[[28](/en/ch10#Herlihy1991-marker)] Maurice P. Herlihy. -[Wait-Free Synchronization](https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf). -*ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 13, issue 1, -pages 124–149, January 1991. -[doi:10.1145/114005.102808](https://doi.org/10.1145/114005.102808) - -[[29](/en/ch10#Fox1999-marker)] Armando Fox and Eric A. Brewer. -[Harvest, Yield, and -Scalable Tolerant Systems](https://radlab.cs.berkeley.edu/people/fox/static/pubs/pdf/c18.pdf). At *7th Workshop on Hot Topics in Operating Systems* (HotOS), -March 1999. [doi:10.1109/HOTOS.1999.798396](https://doi.org/10.1109/HOTOS.1999.798396) - -[[30](/en/ch10#Gilbert2002-marker)] Seth Gilbert and Nancy Lynch. -[Brewer’s Conjecture -and the Feasibility of Consistent, Available, Partition-Tolerant Web Services](https://www.comp.nus.edu.sg/~gilbert/pubs/BrewersConjecture-SigAct.pdf). -*ACM SIGACT News*, volume 33, issue 2, pages 51–59, June 2002. -[doi:10.1145/564585.564601](https://doi.org/10.1145/564585.564601) - -[[31](/en/ch10#Gilbert2012-marker)] Seth Gilbert and Nancy Lynch. -[Perspectives on the CAP -Theorem](https://groups.csail.mit.edu/tds/papers/Gilbert/Brewer2.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 30–36, February 2012. -[doi:10.1109/MC.2011.389](https://doi.org/10.1109/MC.2011.389) - -[[32](/en/ch10#Brewer2012rules-marker)] Eric A. Brewer. -[CAP Twelve Years -Later: How the ‘Rules’ Have Changed](https://sites.cs.ucsb.edu/~rich/class/cs293-cloud/papers/brewer-cap.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages -23–29, February 2012. [doi:10.1109/MC.2012.37](https://doi.org/10.1109/MC.2012.37) - -[[33](/en/ch10#Davidson1985-marker)] Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen. -[Consistency in Partitioned -Networks](https://www.cs.rice.edu/~alc/old/comp520/papers/DGS85.pdf). *ACM Computing Surveys*, volume 17, issue 3, pages 341–370, September 1985. -[doi:10.1145/5505.5508](https://doi.org/10.1145/5505.5508) - -[[34](/en/ch10#Johnson1975-marker)] Paul R. Johnson and Robert H. Thomas. -[RFC 677: The Maintenance of Duplicate -Databases](https://tools.ietf.org/html/rfc677). Network Working Group, January 1975. - -[[35](/en/ch10#Fischer1982-marker)] Michael J. Fischer and Alan Michael. -[Sacrificing -Serializability to Attain High Availability of Data in an Unreliable Network](https://sites.cs.ucsb.edu/~agrawal/spring2011/ugrad/p70-fischer.pdf). At -*1st ACM Symposium on Principles of Database Systems* (PODS), March 1982. -[doi:10.1145/588111.588124](https://doi.org/10.1145/588111.588124) - -[[36](/en/ch10#Brewer2012nosql-marker)] Eric A. Brewer. -[NoSQL: Past, Present, Future](https://www.infoq.com/presentations/NoSQL-History/). -At *QCon San Francisco*, November 2012. - -[[37](/en/ch10#Cockcroft2014-marker)] Adrian Cockcroft. -[Migrating to Microservices](https://www.infoq.com/presentations/migration-cloud-native/). -At *QCon London*, March 2014. - -[[38](/en/ch10#Kleppmann2015critique-marker)] Martin Kleppmann. -[A Critique of the CAP Theorem](https://arxiv.org/abs/1509.05393). arXiv:1509.05393, -September 2015. - -[[39](/en/ch10#Abadi2010-marker)] Daniel Abadi. -[Problems -with CAP, and Yahoo’s little known NoSQL system](https://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html). *dbmsmusings.blogspot.com*, April 2010. -Archived at [perma.cc/4NTZ-CLM9](https://perma.cc/4NTZ-CLM9) - -[[40](/en/ch10#Abadi2017-marker)] Daniel Abadi. -[Hazelcast -and the Mythical PA/EC System](https://dbmsmusings.blogspot.com/2017/10/hazelcast-and-mythical-paec-system.html). *dbmsmusings.blogspot.com*, October 2017. -Archived at [perma.cc/J5XM-U5C2](https://perma.cc/J5XM-U5C2) - -[[41](/en/ch10#Brewer2017-marker)] Eric Brewer. -[Spanner, TrueTime & The CAP -Theorem](https://research.google.com/pubs/archive/45855.pdf). *research.google.com*, February 2017. -Archived at [perma.cc/59UW-RH7N](https://perma.cc/59UW-RH7N) - -[[42](/en/ch10#Abadi2012-marker)] Daniel J. Abadi. -[Consistency Tradeoffs in -Modern Distributed Database System Design](https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf). *IEEE Computer Magazine*, -volume 45, issue 2, pages 37–42, February 2012. -[doi:10.1109/MC.2012.33](https://doi.org/10.1109/MC.2012.33) - -[[43](/en/ch10#Lynch1989-marker)] Nancy A. Lynch. -[A Hundred Impossibility Proofs -for Distributed Computing](https://groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf). At *8th ACM Symposium on Principles of Distributed -Computing* (PODC), August 1989. -[doi:10.1145/72981.72982](https://doi.org/10.1145/72981.72982) - -[[44](/en/ch10#Mahajan2011-marker)] Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin. -[Consistency, Availability, -and Convergence](https://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2036.pdf). University of Texas at Austin, Department of Computer Science, Tech Report UTCS -TR-11-22, May 2011. Archived at [perma.cc/SAV8-9JAJ](https://perma.cc/SAV8-9JAJ) - -[[45](/en/ch10#Attiya2015-marker)] Hagit Attiya, Faith Ellen, and Adam Morrison. -[Limitations -of Highly-Available Eventually-Consistent Data Stores](https://www.cs.tau.ac.il/~mad/publications/podc2015-replds.pdf). At *ACM Symposium on Principles of -Distributed Computing* (PODC), July 2015. -[doi:10.1145/2767386.2767419](https://doi.org/10.1145/2767386.2767419) - -[[46](/en/ch10#Sewell2010-marker)] Peter Sewell, Susmit Sarkar, Scott Owens, -Francesco Zappa Nardelli, and Magnus O. Myreen. -[x86-TSO: A Rigorous and Usable -Programmer’s Model for x86 Multiprocessors](https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf). *Communications of the ACM*, -volume 53, issue 7, pages 89–97, July 2010. -[doi:10.1145/1785414.1785443](https://doi.org/10.1145/1785414.1785443) - -[[47](/en/ch10#Thompson2011-marker)] Martin Thompson. -[Memory -Barriers/Fences](https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html). *mechanical-sympathy.blogspot.co.uk*, July 2011. -Archived at [perma.cc/7NXM-GC5U](https://perma.cc/7NXM-GC5U) - -[[48](/en/ch10#Drepper2007_ch10-marker)] Ulrich Drepper. -[What Every Programmer Should Know About -Memory](https://www.akkadia.org/drepper/cpumemory.pdf). *akkadia.org*, November 2007. Archived at -[perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ) - -[[49](/en/ch10#Attiya1994-marker)] Hagit Attiya and Jennifer L. Welch. -[Sequential Consistency -Versus Linearizability](https://courses.csail.mit.edu/6.852/01/papers/p91-attiya.pdf). *ACM Transactions on Computer Systems* (TOCS), -volume 12, issue 2, pages 91–122, May 1994. -[doi:10.1145/176575.176576](https://doi.org/10.1145/176575.176576) - -[[50](/en/ch10#Davis2024-marker)] Kyzer R. Davis, Brad G. Peabody, and Paul J. Leach. -[Universally Unique IDentifiers (UUIDs)](https://www.rfc-editor.org/rfc/rfc9562). -RFC 9562, IETF, May 2024. - -[[51](/en/ch10#King2010-marker)] Ryan King. -[Announcing Snowflake](https://blog.x.com/engineering/en_us/a/2010/announcing-snowflake). -*blog.x.com*, June 2010. Archived at -[archive.org](https://web.archive.org/web/20241128214604/https%3A//blog.x.com/engineering/en_us/a/2010/announcing-snowflake) - -[[52](/en/ch10#Feerasta2016-marker)] Alizain Feerasta. -[Universally Unique Lexicographically Sortable Identifier](https://github.com/ulid/spec). -*github.com*, 2016. -Archived at [perma.cc/NV2Y-ZP8U](https://perma.cc/NV2Y-ZP8U) - -[[53](/en/ch10#Conery2014-marker)] Rob Conery. -[A Better ID -Generator for PostgreSQL](https://bigmachine.io/2014/05/29/a-better-id-generator-for-postgresql/). *bigmachine.io*, May 2014. -Archived at [perma.cc/K7QV-3KFC](https://perma.cc/K7QV-3KFC) - -[[54](/en/ch10#Lamport1978_ch10-marker)] Leslie Lamport. -[Time, -Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, -volume 21, issue 7, pages 558–565, July 1978. -[doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563) - -[[55](/en/ch10#Kulkarni2014-marker)] Sandeep S. Kulkarni, Murat Demirbas, Deepak -Madeppa, Bharadwaj Avva, and Marcelo Leone. -[Logical Physical Clocks](https://cse.buffalo.edu/~demirbas/publications/hlc.pdf). -*18th International Conference on Principles of Distributed Systems* (OPODIS), December 2014. -[doi:10.1007/978-3-319-14472-6\_2](https://doi.org/10.1007/978-3-319-14472-6_2) - -[[56](/en/ch10#Bravo2015-marker)] Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo -Romano, and Luís Rodrigues. -[On the use of Clocks to Enforce -Consistency in the Cloud](http://sites.computer.org/debull/A15mar/p18.pdf). *IEEE Data Engineering Bulletin*, volume 38, issue 1, -pages 18–31, March 2015. -Archived at [perma.cc/68ZU-45SH](https://perma.cc/68ZU-45SH) - -[[57](/en/ch10#Peng2010_ch10-marker)] Daniel Peng and Frank Dabek. -[Large-Scale -Incremental Processing Using Distributed Transactions and Notifications](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf). At *9th USENIX -Conference on Operating Systems Design and Implementation* (OSDI), October 2010. - -[[58](/en/ch10#Chandra2007-marker)] Tushar Deepak Chandra, Robert Griesemer, and Joshua -Redstone. [Paxos -Made Live – An Engineering Perspective](https://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf). At *26th ACM Symposium on Principles of Distributed -Computing* (PODC), June 2007. -[doi:10.1145/1281100.1281103](https://doi.org/10.1145/1281100.1281103) - -[[59](/en/ch10#Portnoy2012-marker)] Will Portnoy. -[Lessons Learned from -Implementing Paxos](https://blog.willportnoy.com/2012/06/lessons-learned-from-paxos.html). *blog.willportnoy.com*, June 2012. -Archived at [perma.cc/QHD9-FDD2](https://perma.cc/QHD9-FDD2) - -[[60](/en/ch10#Oki1988-marker)] Brian M. Oki and Barbara H. Liskov. -[Viewstamped Replication: A New Primary Copy Method -to Support Highly-Available Distributed Systems](https://pmg.csail.mit.edu/papers/vr.pdf). At *7th ACM Symposium on Principles of -Distributed Computing* (PODC), August 1988. -[doi:10.1145/62546.62549](https://doi.org/10.1145/62546.62549) - -[[61](/en/ch10#Liskov2012-marker)] Barbara H. Liskov and James Cowling. -[Viewstamped Replication Revisited](https://pmg.csail.mit.edu/papers/vr-revisited.pdf). -Massachusetts Institute of Technology, Tech Report MIT-CSAIL-TR-2012-021, July 2012. -Archived at [perma.cc/56SJ-WENQ](https://perma.cc/56SJ-WENQ) - -[[62](/en/ch10#Lamport1998-marker)] Leslie Lamport. -[The -Part-Time Parliament](https://www.microsoft.com/en-us/research/publication/part-time-parliament/). *ACM Transactions on Computer Systems*, volume 16, issue 2, -pages 133–169, May 1998. -[doi:10.1145/279227.279229](https://doi.org/10.1145/279227.279229) - -[[63](/en/ch10#Lamport2001-marker)] Leslie Lamport. -[Paxos Made -Simple](https://www.microsoft.com/en-us/research/publication/paxos-made-simple/). *ACM SIGACT News*, volume 32, issue 4, pages 51–58, December 2001. -Archived at [perma.cc/82HP-MNKE](https://perma.cc/82HP-MNKE) - -[[64](/en/ch10#vanRenesse2011-marker)] Robbert van Renesse and Deniz Altinbuken. -[Paxos Made -Moderately Complex](https://people.cs.umass.edu/~arun/590CC/papers/paxos-moderately-complex.pdf). *ACM Computing Surveys* (CSUR), volume 47, issue 3, article no. 42, -February 2015. [doi:10.1145/2673577](https://doi.org/10.1145/2673577) - -[[65](/en/ch10#Ongaro2014thesis-marker)] Diego Ongaro. -[Consensus: Bridging Theory and Practice](https://github.com/ongardie/dissertation). -PhD Thesis, Stanford University, August 2014. -Archived at [perma.cc/5VTZ-2ADH](https://perma.cc/5VTZ-2ADH) - -[[66](/en/ch10#Howard2015refloated-marker)] Heidi Howard, Malte Schwarzkopf, Anil -Madhavapeddy, and Jon Crowcroft. -[Raft -Refloated: Do We Have Consensus?](https://www.cl.cam.ac.uk/research/srg/netos/papers/2015-raftrefloated-osr.pdf) *ACM SIGOPS Operating Systems Review*, volume 49, issue -1, pages 12–21, January 2015. -[doi:10.1145/2723872.2723876](https://doi.org/10.1145/2723872.2723876) - -[[67](/en/ch10#Medeiros2012-marker)] André Medeiros. -[ZooKeeper’s Atomic -Broadcast Protocol: Theory and Practice](http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf). Aalto University School of Science, March 2012. -Archived at [perma.cc/FVL4-JMVA](https://perma.cc/FVL4-JMVA) - -[[68](/en/ch10#vanRenesse2014-marker)] Robbert van Renesse, Nicolas Schiper, and -Fred B. Schneider. [Vive La Différence: Paxos vs. -Viewstamped Replication vs. Zab](https://arxiv.org/abs/1309.5671). *IEEE Transactions on Dependable and Secure Computing*, -volume 12, issue 4, pages 472–484, September 2014. -[doi:10.1109/TDSC.2014.2355848](https://doi.org/10.1109/TDSC.2014.2355848) - -[[69](/en/ch10#Howard2020-marker)] Heidi Howard and Richard Mortier. -[Paxos vs Raft: Have we reached consensus on distributed -consensus?](https://arxiv.org/abs/2004.05074). At *7th Workshop on Principles and Practice of Consistency for Distributed -Data* (PaPoC), April 2020. -[doi:10.1145/3380787.3393681](https://doi.org/10.1145/3380787.3393681) - -[[70](/en/ch10#Castro2002-marker)] Miguel Castro and Barbara H. Liskov. -[Practical -Byzantine Fault Tolerance and Proactive Recovery](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/p398-castro-bft-tocs.pdf). *ACM Transactions on Computer Systems*, -volume 20, issue 4, pages 396–461, November 2002. -[doi:10.1145/571637.571640](https://doi.org/10.1145/571637.571640) - -[[71](/en/ch10#Bano2019_ch10-marker)] Shehar Bano, Alberto Sonnino, Mustafa -Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis. -[SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At -*1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019. -[doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458) - -[[72](/en/ch10#Fischer1985-marker)] Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. -[Impossibility of Distributed Consensus with -One Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf). *Journal of the ACM*, volume 32, issue 2, pages 374–382, April 1985. -[doi:10.1145/3149.214121](https://doi.org/10.1145/3149.214121) - -[[73](/en/ch10#Chandra1996-marker)] Tushar Deepak Chandra and Sam Toueg. -[Unreliable Failure Detectors -for Reliable Distributed Systems](https://courses.csail.mit.edu/6.852/08/papers/CT96-JACM.pdf). *Journal of the ACM*, volume 43, issue 2, pages -225–267, March 1996. -[doi:10.1145/226643.226647](https://doi.org/10.1145/226643.226647) - -[[74](/en/ch10#BenOr1983-marker)] Michael Ben-Or. -[Another Advantage of Free Choice: -Completely Asynchronous Agreement Protocols](https://homepage.cs.uiowa.edu/~ghosh/BenOr.pdf). At *2nd ACM Symposium on Principles of -Distributed Computing* (PODC), August 1983. -[doi:10.1145/800221.806707](https://doi.org/10.1145/800221.806707) - -[[75](/en/ch10#Dwork1988_ch10-marker)] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. -[Consensus in the Presence of -Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323, April 1988. -[doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283) - -[[76](/en/ch10#Defago2004-marker)] Xavier Défago, André Schiper, and Péter Urbán. -[Total Order -Broadcast and Multicast Algorithms: Taxonomy and Survey](https://dspace.jaist.ac.jp/dspace/bitstream/10119/4883/1/defago_et_al.pdf). *ACM Computing Surveys*, volume -36, issue 4, pages 372–421, December 2004. -[doi:10.1145/1041680.1041682](https://doi.org/10.1145/1041680.1041682) - -[[77](/en/ch10#Attiya2004-marker)] Hagit Attiya and Jennifer Welch. *Distributed -Computing: Fundamentals, Simulations and Advanced Topics*, 2nd edition. -John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6, -[doi:10.1002/0471478210](https://doi.org/10.1002/0471478210) - -[[78](/en/ch10#Guerraoui1995-marker)] Rachid Guerraoui. -[Revisiting -the Relationship Between Non-Blocking Atomic Commitment and Consensus](https://citeseerx.ist.psu.edu/pdf/5d06489503b6f791aa56d2d7942359c2592e44b0). At *9th International -Workshop on Distributed Algorithms* (WDAG), September 1995. -[doi:10.1007/BFb0022140](https://doi.org/10.1007/BFb0022140) - -[[79](/en/ch10#Gray2006-marker)] Jim N. Gray and Leslie Lamport. -[Consensus on Transaction -Commit](https://dsf.berkeley.edu/cs286/papers/paxoscommit-tods2006.pdf). *ACM Transactions on Database Systems* (TODS), volume 31, issue 1, pages 133–160, -March 2006. [doi:10.1145/1132863.1132867](https://doi.org/10.1145/1132863.1132867) - -[[80](/en/ch10#Schneider1990-marker)] Fred B. Schneider. -[Implementing Fault-Tolerant -Services Using the State Machine Approach: A Tutorial](https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf). *ACM Computing Surveys*, volume -22, issue 4, pages 299–319, December 1990. -[doi:10.1145/98163.98167](https://doi.org/10.1145/98163.98167) - -[[81](/en/ch10#Thomson2012-marker)] Alexander Thomson, Thaddeus Diamond, Shu-Chun -Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. -[Calvin: Fast -Distributed Transactions for Partitioned Database Systems](https://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf). At *ACM International Conference -on Management of Data* (SIGMOD), May 2012. -[doi:10.1145/2213836.2213838](https://doi.org/10.1145/2213836.2213838) - -[[82](/en/ch10#Balakrishnan2013-marker)] Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, -Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck. -[Tango: -Distributed Data Structures over a Shared Log](https://www.microsoft.com/en-us/research/publication/tango-distributed-data-structures-over-a-shared-log/). At *24th ACM Symposium on Operating Systems -Principles* (SOSP), November 2013. -[doi:10.1145/2517349.2522732](https://doi.org/10.1145/2517349.2522732) - -[[83](/en/ch10#Balakrishnan2012-marker)] Mahesh -Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. -[CORFU: A Shared -Log Design for Flash Clusters](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final30.pdf). At *9th USENIX Symposium on Networked Systems Design and -Implementation* (NSDI), April 2012. - -[[84](/en/ch10#Gavrielatos2021-marker)] Vasilis Gavrielatos, -Antonios Katsarakis, and Vijay Nagarajan. -[Odyssey: the impact of modern -hardware on strongly-consistent replication protocols](https://vasigavr1.github.io/files/Odyssey_Eurosys_2021.pdf). At *16th European Conference on -Computer Systems* (EuroSys), April 2021. -[doi:10.1145/3447786.3456240](https://doi.org/10.1145/3447786.3456240) - -[[85](/en/ch10#Howard2016_ch10-marker)] Heidi Howard, Dahlia Malkhi, and -Alexander Spiegelman. -[Flexible -Paxos: Quorum Intersection Revisited](https://drops.dagstuhl.de/opus/volltexte/2017/7094/pdf/LIPIcs-OPODIS-2016-25.pdf). At *20th International Conference on Principles of -Distributed Systems* (OPODIS), December 2016. -[doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25) - -[[86](/en/ch10#Kleppmann2024distsys-marker)] Martin Kleppmann. -[Distributed Systems -lecture notes](https://www.cl.cam.ac.uk/teaching/2425/ConcDisSys/dist-sys-notes.pdf). *University of Cambridge*, October 2024. -Archived at [perma.cc/SS3Q-FNS5](https://perma.cc/SS3Q-FNS5) - -[[87](/en/ch10#Kingsbury2015elastic-marker)] Kyle Kingsbury. -[Call Me Maybe: -Elasticsearch 1.5.0](https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0). *aphyr.com*, April 2015. -Archived at [perma.cc/37MZ-JT7H](https://perma.cc/37MZ-JT7H) - -[[88](/en/ch10#Howard2015coracle-marker)] Heidi Howard and Jon Crowcroft. -[Coracle: Evaluating -Consensus at the Internet Edge](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p85.pdf). At *Annual Conference of the ACM Special Interest Group on -Data Communication* (SIGCOMM), August 2015. -[doi:10.1145/2829988.2790010](https://doi.org/10.1145/2829988.2790010) - -[[89](/en/ch10#Lianza2020_ch10-marker)] Tom Lianza and Chris Snook. -[A Byzantine failure -in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020. -Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY) - -[[90](/en/ch10#Kelly2014-marker)] Ivan Kelly. -[BookKeeper Tutorial](https://github.com/ivankelly/bookkeeper-tutorial). -*github.com*, October 2014. -Archived at [perma.cc/37Y6-VZWU](https://perma.cc/37Y6-VZWU) - -[[91](/en/ch10#Vanlightly2021-marker)] Jack Vanlightly. -[Apache -BookKeeper Insights Part 1 — External Consensus and Dynamic Membership](https://medium.com/splunk-maas/apache-bookkeeper-insights-part-1-external-consensus-and-dynamic-membership-c259f388da21). *medium.com*, November 2021. -Archived at [perma.cc/3MDB-8GFB](https://perma.cc/3MDB-8GFB) \ No newline at end of file +### Footnotes + +### References + +[^1]: Maurice P. Herlihy and Jeannette M. Wing. [Linearizability: A Correctness Condition for Concurrent Objects](https://cs.brown.edu/~mph/HerlihyW90/p463-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 12, issue 3, pages 463–492, July 1990. [doi:10.1145/78969.78972](https://doi.org/10.1145/78969.78972) +[^2]: Leslie Lamport. [On interprocess communication](https://www.microsoft.com/en-us/research/publication/interprocess-communication-part-basic-formalism-part-ii-algorithms/). *Distributed Computing*, volume 1, issue 2, pages 77–101, June 1986. [doi:10.1007/BF01786228](https://doi.org/10.1007/BF01786228) +[^3]: David K. Gifford. [Information Storage in a Decentralized Computer System](https://bitsavers.org/pdf/xerox/parc/techReports/CSL-81-8_Information_Storage_in_a_Decentralized_Computer_System.pdf). Xerox Palo Alto Research Centers, CSL-81-8, June 1981. Archived at [perma.cc/2XXP-3JPB](https://perma.cc/2XXP-3JPB) +[^4]: Martin Kleppmann. [Please Stop Calling Databases CP or AP](https://martin.kleppmann.com/2015/05/11/please-stop-calling-databases-cp-or-ap.html). *martin.kleppmann.com*, May 2015. Archived at [perma.cc/MJ5G-75GL](https://perma.cc/MJ5G-75GL) +[^5]: Kyle Kingsbury. [Call Me Maybe: MongoDB Stale Reads](https://aphyr.com/posts/322-call-me-maybe-mongodb-stale-reads). *aphyr.com*, April 2015. Archived at [perma.cc/DXB4-J4JC](https://perma.cc/DXB4-J4JC) +[^6]: Kyle Kingsbury. [Computational Techniques in Knossos](https://aphyr.com/posts/314-computational-techniques-in-knossos). *aphyr.com*, May 2014. Archived at [perma.cc/2X5M-EHTU](https://perma.cc/2X5M-EHTU) +[^7]: Kyle Kingsbury and Peter Alvaro. [Elle: Inferring Isolation Anomalies from Experimental Observations](https://www.vldb.org/pvldb/vol14/p268-alvaro.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 3, pages 268–280, November 2020. [doi:10.14778/3430915.3430918](https://doi.org/10.14778/3430915.3430918) +[^8]: Paolo Viotti and Marko Vukolić. [Consistency in Non-Transactional Distributed Storage Systems](https://arxiv.org/abs/1512.00168). *ACM Computing Surveys* (CSUR), volume 49, issue 1, article no. 19, June 2016. [doi:10.1145/2926965](https://doi.org/10.1145/2926965) +[^9]: Peter Bailis. [Linearizability Versus Serializability](http://www.bailis.org/blog/linearizability-versus-serializability/). *bailis.org*, September 2014. Archived at [perma.cc/386B-KAC3](https://perma.cc/386B-KAC3) +[^10]: Daniel Abadi. [Correctness Anomalies Under Serializable Isolation](https://dbmsmusings.blogspot.com/2019/06/correctness-anomalies-under.html). *dbmsmusings.blogspot.com*, June 2019. Archived at [perma.cc/JGS7-BZFY](https://perma.cc/JGS7-BZFY) +[^11]: Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Highly Available Transactions: Virtues and Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). *Proceedings of the VLDB Endowment*, volume 7, issue 3, pages 181–192, November 2013. [doi:10.14778/2732232.2732237](https://doi.org/10.14778/2732232.2732237), extended version published as [arXiv:1302.0309](https://arxiv.org/abs/1302.0309) +[^12]: Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. [*Concurrency Control and Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at [*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/). +[^13]: Andrei Matei. [CockroachDB’s consistency model](https://www.cockroachlabs.com/blog/consistency-model/). *cockroachlabs.com*, February 2021. Archived at [perma.cc/MR38-883B](https://perma.cc/MR38-883B) +[^14]: Murat Demirbas. [Strict-serializability, but at what cost, for what purpose?](https://muratbuffalo.blogspot.com/2022/08/strict-serializability-but-at-what-cost.html) *muratbuffalo.blogspot.com*, August 2022. Archived at [perma.cc/T8AY-N3U9](https://perma.cc/T8AY-N3U9) +[^15]: Ben Darnell. [How to talk about consistency and isolation in distributed DBs](https://www.cockroachlabs.com/blog/db-consistency-isolation-terminology/). *cockroachlabs.com*, February 2022. Archived at [perma.cc/53SV-JBGK](https://perma.cc/53SV-JBGK) +[^16]: Daniel Abadi. [An explanation of the difference between Isolation levels vs. Consistency levels](https://dbmsmusings.blogspot.com/2019/08/an-explanation-of-difference-between.html). *dbmsmusings.blogspot.com*, August 2019. Archived at [perma.cc/QSF2-CD4P](https://perma.cc/QSF2-CD4P) +[^17]: Mike Burrows. [The Chubby Lock Service for Loosely-Coupled Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006. +[^18]: Flavio P. Junqueira and Benjamin Reed. [*ZooKeeper: Distributed Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3 +[^19]: Murali Vallath. [*Oracle 10g RAC Grid, Services & Clustering*](https://www.oreilly.com/library/view/oracle-10g-rac/9781555583217/). Elsevier Digital Press, 2006. ISBN: 978-1-555-58321-7 +[^20]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). *Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185–196, November 2014. [doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509) +[^21]: Kyle Kingsbury. [Call Me Maybe: etcd and Consul](https://aphyr.com/posts/316-call-me-maybe-etcd-and-consul). *aphyr.com*, June 2014. Archived at [perma.cc/XL7U-378K](https://perma.cc/XL7U-378K) +[^22]: Flavio P. Junqueira, Benjamin C. Reed, and Marco Serafini. [Zab: High-Performance Broadcast for Primary-Backup Systems](https://marcoserafini.github.io/assets/pdf/zab.pdf). At *41st IEEE International Conference on Dependable Systems and Networks* (DSN), June 2011. [doi:10.1109/DSN.2011.5958223](https://doi.org/10.1109/DSN.2011.5958223) +[^23]: Diego Ongaro and John K. Ousterhout. [In Search of an Understandable Consensus Algorithm](https://www.usenix.org/system/files/conference/atc14/atc14-paper-ongaro.pdf). At *USENIX Annual Technical Conference* (ATC), June 2014. +[^24]: Hagit Attiya, Amotz Bar-Noy, and Danny Dolev. [Sharing Memory Robustly in Message-Passing Systems](https://www.cs.huji.ac.il/course/2004/dist/p124-attiya.pdf). *Journal of the ACM*, volume 42, issue 1, pages 124–142, January 1995. [doi:10.1145/200836.200869](https://doi.org/10.1145/200836.200869) +[^25]: Nancy Lynch and Alex Shvartsman. [Robust Emulation of Shared Memory Using Dynamic Quorum-Acknowledged Broadcasts](https://groups.csail.mit.edu/tds/papers/Lynch/FTCS97.pdf). At *27th Annual International Symposium on Fault-Tolerant Computing* (FTCS), June 1997. [doi:10.1109/FTCS.1997.614100](https://doi.org/10.1109/FTCS.1997.614100) +[^26]: Christian Cachin, Rachid Guerraoui, and Luís Rodrigues. [*Introduction to Reliable and Secure Distributed Programming*](https://www.distributedprogramming.net/), 2nd edition. Springer, 2011. ISBN: 978-3-642-15259-7, [doi:10.1007/978-3-642-15260-3](https://doi.org/10.1007/978-3-642-15260-3) +[^27]: Niklas Ekström, Mikhail Panchenko, and Jonathan Ellis. [Possible Issue with Read Repair?](https://lists.apache.org/thread/wwsjnnc93mdlpw8nb0d5gn4q1bmpzbon) Email thread on *cassandra-dev* mailing list, October 2012. +[^28]: Maurice P. Herlihy. [Wait-Free Synchronization](https://cs.brown.edu/~mph/Herlihy91/p124-herlihy.pdf). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 13, issue 1, pages 124–149, January 1991. [doi:10.1145/114005.102808](https://doi.org/10.1145/114005.102808) +[^29]: Armando Fox and Eric A. Brewer. [Harvest, Yield, and Scalable Tolerant Systems](https://radlab.cs.berkeley.edu/people/fox/static/pubs/pdf/c18.pdf). At *7th Workshop on Hot Topics in Operating Systems* (HotOS), March 1999. [doi:10.1109/HOTOS.1999.798396](https://doi.org/10.1109/HOTOS.1999.798396) +[^30]: Seth Gilbert and Nancy Lynch. [Brewer’s Conjecture and the Feasibility of Consistent, Available, Partition-Tolerant Web Services](https://www.comp.nus.edu.sg/~gilbert/pubs/BrewersConjecture-SigAct.pdf). *ACM SIGACT News*, volume 33, issue 2, pages 51–59, June 2002. [doi:10.1145/564585.564601](https://doi.org/10.1145/564585.564601) +[^31]: Seth Gilbert and Nancy Lynch. [Perspectives on the CAP Theorem](https://groups.csail.mit.edu/tds/papers/Gilbert/Brewer2.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 30–36, February 2012. [doi:10.1109/MC.2011.389](https://doi.org/10.1109/MC.2011.389) +[^32]: Eric A. Brewer. [CAP Twelve Years Later: How the ‘Rules’ Have Changed](https://sites.cs.ucsb.edu/~rich/class/cs293-cloud/papers/brewer-cap.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 23–29, February 2012. [doi:10.1109/MC.2012.37](https://doi.org/10.1109/MC.2012.37) +[^33]: Susan B. Davidson, Hector Garcia-Molina, and Dale Skeen. [Consistency in Partitioned Networks](https://www.cs.rice.edu/~alc/old/comp520/papers/DGS85.pdf). *ACM Computing Surveys*, volume 17, issue 3, pages 341–370, September 1985. [doi:10.1145/5505.5508](https://doi.org/10.1145/5505.5508) +[^34]: Paul R. Johnson and Robert H. Thomas. [RFC 677: The Maintenance of Duplicate Databases](https://tools.ietf.org/html/rfc677). Network Working Group, January 1975. +[^35]: Michael J. Fischer and Alan Michael. [Sacrificing Serializability to Attain High Availability of Data in an Unreliable Network](https://sites.cs.ucsb.edu/~agrawal/spring2011/ugrad/p70-fischer.pdf). At *1st ACM Symposium on Principles of Database Systems* (PODS), March 1982. [doi:10.1145/588111.588124](https://doi.org/10.1145/588111.588124) +[^36]: Eric A. Brewer. [NoSQL: Past, Present, Future](https://www.infoq.com/presentations/NoSQL-History/). At *QCon San Francisco*, November 2012. +[^37]: Adrian Cockcroft. [Migrating to Microservices](https://www.infoq.com/presentations/migration-cloud-native/). At *QCon London*, March 2014. +[^38]: Martin Kleppmann. [A Critique of the CAP Theorem](https://arxiv.org/abs/1509.05393). arXiv:1509.05393, September 2015. +[^39]: Daniel Abadi. [Problems with CAP, and Yahoo’s little known NoSQL system](https://dbmsmusings.blogspot.com/2010/04/problems-with-cap-and-yahoos-little.html). *dbmsmusings.blogspot.com*, April 2010. Archived at [perma.cc/4NTZ-CLM9](https://perma.cc/4NTZ-CLM9) +[^40]: Daniel Abadi. [Hazelcast and the Mythical PA/EC System](https://dbmsmusings.blogspot.com/2017/10/hazelcast-and-mythical-paec-system.html). *dbmsmusings.blogspot.com*, October 2017. Archived at [perma.cc/J5XM-U5C2](https://perma.cc/J5XM-U5C2) +[^41]: Eric Brewer. [Spanner, TrueTime & The CAP Theorem](https://research.google.com/pubs/archive/45855.pdf). *research.google.com*, February 2017. Archived at [perma.cc/59UW-RH7N](https://perma.cc/59UW-RH7N) +[^42]: Daniel J. Abadi. [Consistency Tradeoffs in Modern Distributed Database System Design](https://www.cs.umd.edu/~abadi/papers/abadi-pacelc.pdf). *IEEE Computer Magazine*, volume 45, issue 2, pages 37–42, February 2012. [doi:10.1109/MC.2012.33](https://doi.org/10.1109/MC.2012.33) +[^43]: Nancy A. Lynch. [A Hundred Impossibility Proofs for Distributed Computing](https://groups.csail.mit.edu/tds/papers/Lynch/podc89.pdf). At *8th ACM Symposium on Principles of Distributed Computing* (PODC), August 1989. [doi:10.1145/72981.72982](https://doi.org/10.1145/72981.72982) +[^44]: Prince Mahajan, Lorenzo Alvisi, and Mike Dahlin. [Consistency, Availability, and Convergence](https://apps.cs.utexas.edu/tech_reports/reports/tr/TR-2036.pdf). University of Texas at Austin, Department of Computer Science, Tech Report UTCS TR-11-22, May 2011. Archived at [perma.cc/SAV8-9JAJ](https://perma.cc/SAV8-9JAJ) +[^45]: Hagit Attiya, Faith Ellen, and Adam Morrison. [Limitations of Highly-Available Eventually-Consistent Data Stores](https://www.cs.tau.ac.il/~mad/publications/podc2015-replds.pdf). At *ACM Symposium on Principles of Distributed Computing* (PODC), July 2015. [doi:10.1145/2767386.2767419](https://doi.org/10.1145/2767386.2767419) +[^46]: Peter Sewell, Susmit Sarkar, Scott Owens, Francesco Zappa Nardelli, and Magnus O. Myreen. [x86-TSO: A Rigorous and Usable Programmer’s Model for x86 Multiprocessors](https://www.cl.cam.ac.uk/~pes20/weakmemory/cacm.pdf). *Communications of the ACM*, volume 53, issue 7, pages 89–97, July 2010. [doi:10.1145/1785414.1785443](https://doi.org/10.1145/1785414.1785443) +[^47]: Martin Thompson. [Memory Barriers/Fences](https://mechanical-sympathy.blogspot.com/2011/07/memory-barriersfences.html). *mechanical-sympathy.blogspot.co.uk*, July 2011. Archived at [perma.cc/7NXM-GC5U](https://perma.cc/7NXM-GC5U) +[^48]: Ulrich Drepper. [What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf). *akkadia.org*, November 2007. Archived at [perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ) +[^49]: Hagit Attiya and Jennifer L. Welch. [Sequential Consistency Versus Linearizability](https://courses.csail.mit.edu/6.852/01/papers/p91-attiya.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 12, issue 2, pages 91–122, May 1994. [doi:10.1145/176575.176576](https://doi.org/10.1145/176575.176576) +[^50]: Kyzer R. Davis, Brad G. Peabody, and Paul J. Leach. [Universally Unique IDentifiers (UUIDs)](https://www.rfc-editor.org/rfc/rfc9562). RFC 9562, IETF, May 2024. +[^51]: Ryan King. [Announcing Snowflake](https://blog.x.com/engineering/en_us/a/2010/announcing-snowflake). *blog.x.com*, June 2010. Archived at [archive.org](https://web.archive.org/web/20241128214604/https%3A//blog.x.com/engineering/en_us/a/2010/announcing-snowflake) +[^52]: Alizain Feerasta. [Universally Unique Lexicographically Sortable Identifier](https://github.com/ulid/spec). *github.com*, 2016. Archived at [perma.cc/NV2Y-ZP8U](https://perma.cc/NV2Y-ZP8U) +[^53]: Rob Conery. [A Better ID Generator for PostgreSQL](https://bigmachine.io/2014/05/29/a-better-id-generator-for-postgresql/). *bigmachine.io*, May 2014. Archived at [perma.cc/K7QV-3KFC](https://perma.cc/K7QV-3KFC) +[^54]: Leslie Lamport. [Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, volume 21, issue 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563) +[^55]: Sandeep S. Kulkarni, Murat Demirbas, Deepak Madeppa, Bharadwaj Avva, and Marcelo Leone. [Logical Physical Clocks](https://cse.buffalo.edu/~demirbas/publications/hlc.pdf). *18th International Conference on Principles of Distributed Systems* (OPODIS), December 2014. [doi:10.1007/978-3-319-14472-6\_2](https://doi.org/10.1007/978-3-319-14472-6_2) +[^56]: Manuel Bravo, Nuno Diegues, Jingna Zeng, Paolo Romano, and Luís Rodrigues. [On the use of Clocks to Enforce Consistency in the Cloud](http://sites.computer.org/debull/A15mar/p18.pdf). *IEEE Data Engineering Bulletin*, volume 38, issue 1, pages 18–31, March 2015. Archived at [perma.cc/68ZU-45SH](https://perma.cc/68ZU-45SH) +[^57]: Daniel Peng and Frank Dabek. [Large-Scale Incremental Processing Using Distributed Transactions and Notifications](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Peng.pdf). At *9th USENIX Conference on Operating Systems Design and Implementation* (OSDI), October 2010. +[^58]: Tushar Deepak Chandra, Robert Griesemer, and Joshua Redstone. [Paxos Made Live – An Engineering Perspective](https://www.read.seas.harvard.edu/~kohler/class/08w-dsi/chandra07paxos.pdf). At *26th ACM Symposium on Principles of Distributed Computing* (PODC), June 2007. [doi:10.1145/1281100.1281103](https://doi.org/10.1145/1281100.1281103) +[^59]: Will Portnoy. [Lessons Learned from Implementing Paxos](https://blog.willportnoy.com/2012/06/lessons-learned-from-paxos.html). *blog.willportnoy.com*, June 2012. Archived at [perma.cc/QHD9-FDD2](https://perma.cc/QHD9-FDD2) +[^60]: Brian M. Oki and Barbara H. Liskov. [Viewstamped Replication: A New Primary Copy Method to Support Highly-Available Distributed Systems](https://pmg.csail.mit.edu/papers/vr.pdf). At *7th ACM Symposium on Principles of Distributed Computing* (PODC), August 1988. [doi:10.1145/62546.62549](https://doi.org/10.1145/62546.62549) +[^61]: Barbara H. Liskov and James Cowling. [Viewstamped Replication Revisited](https://pmg.csail.mit.edu/papers/vr-revisited.pdf). Massachusetts Institute of Technology, Tech Report MIT-CSAIL-TR-2012-021, July 2012. Archived at [perma.cc/56SJ-WENQ](https://perma.cc/56SJ-WENQ) +[^62]: Leslie Lamport. [The Part-Time Parliament](https://www.microsoft.com/en-us/research/publication/part-time-parliament/). *ACM Transactions on Computer Systems*, volume 16, issue 2, pages 133–169, May 1998. [doi:10.1145/279227.279229](https://doi.org/10.1145/279227.279229) +[^63]: Leslie Lamport. [Paxos Made Simple](https://www.microsoft.com/en-us/research/publication/paxos-made-simple/). *ACM SIGACT News*, volume 32, issue 4, pages 51–58, December 2001. Archived at [perma.cc/82HP-MNKE](https://perma.cc/82HP-MNKE) +[^64]: Robbert van Renesse and Deniz Altinbuken. [Paxos Made Moderately Complex](https://people.cs.umass.edu/~arun/590CC/papers/paxos-moderately-complex.pdf). *ACM Computing Surveys* (CSUR), volume 47, issue 3, article no. 42, February 2015. [doi:10.1145/2673577](https://doi.org/10.1145/2673577) +[^65]: Diego Ongaro. [Consensus: Bridging Theory and Practice](https://github.com/ongardie/dissertation). PhD Thesis, Stanford University, August 2014. Archived at [perma.cc/5VTZ-2ADH](https://perma.cc/5VTZ-2ADH) +[^66]: Heidi Howard, Malte Schwarzkopf, Anil Madhavapeddy, and Jon Crowcroft. [Raft Refloated: Do We Have Consensus?](https://www.cl.cam.ac.uk/research/srg/netos/papers/2015-raftrefloated-osr.pdf) *ACM SIGOPS Operating Systems Review*, volume 49, issue 1, pages 12–21, January 2015. [doi:10.1145/2723872.2723876](https://doi.org/10.1145/2723872.2723876) +[^67]: André Medeiros. [ZooKeeper’s Atomic Broadcast Protocol: Theory and Practice](http://www.tcs.hut.fi/Studies/T-79.5001/reports/2012-deSouzaMedeiros.pdf). Aalto University School of Science, March 2012. Archived at [perma.cc/FVL4-JMVA](https://perma.cc/FVL4-JMVA) +[^68]: Robbert van Renesse, Nicolas Schiper, and Fred B. Schneider. [Vive La Différence: Paxos vs. Viewstamped Replication vs. Zab](https://arxiv.org/abs/1309.5671). *IEEE Transactions on Dependable and Secure Computing*, volume 12, issue 4, pages 472–484, September 2014. [doi:10.1109/TDSC.2014.2355848](https://doi.org/10.1109/TDSC.2014.2355848) +[^69]: Heidi Howard and Richard Mortier. [Paxos vs Raft: Have we reached consensus on distributed consensus?](https://arxiv.org/abs/2004.05074). At *7th Workshop on Principles and Practice of Consistency for Distributed Data* (PaPoC), April 2020. [doi:10.1145/3380787.3393681](https://doi.org/10.1145/3380787.3393681) +[^70]: Miguel Castro and Barbara H. Liskov. [Practical Byzantine Fault Tolerance and Proactive Recovery](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/01/p398-castro-bft-tocs.pdf). *ACM Transactions on Computer Systems*, volume 20, issue 4, pages 396–461, November 2002. [doi:10.1145/571637.571640](https://doi.org/10.1145/571637.571640) +[^71]: Shehar Bano, Alberto Sonnino, Mustafa Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis. [SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At *1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019. [doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458) +[^72]: Michael J. Fischer, Nancy Lynch, and Michael S. Paterson. [Impossibility of Distributed Consensus with One Faulty Process](https://groups.csail.mit.edu/tds/papers/Lynch/jacm85.pdf). *Journal of the ACM*, volume 32, issue 2, pages 374–382, April 1985. [doi:10.1145/3149.214121](https://doi.org/10.1145/3149.214121) +[^73]: Tushar Deepak Chandra and Sam Toueg. [Unreliable Failure Detectors for Reliable Distributed Systems](https://courses.csail.mit.edu/6.852/08/papers/CT96-JACM.pdf). *Journal of the ACM*, volume 43, issue 2, pages 225–267, March 1996. [doi:10.1145/226643.226647](https://doi.org/10.1145/226643.226647) +[^74]: Michael Ben-Or. [Another Advantage of Free Choice: Completely Asynchronous Agreement Protocols](https://homepage.cs.uiowa.edu/~ghosh/BenOr.pdf). At *2nd ACM Symposium on Principles of Distributed Computing* (PODC), August 1983. [doi:10.1145/800221.806707](https://doi.org/10.1145/800221.806707) +[^75]: Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. [Consensus in the Presence of Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323, April 1988. [doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283) +[^76]: Xavier Défago, André Schiper, and Péter Urbán. [Total Order Broadcast and Multicast Algorithms: Taxonomy and Survey](https://dspace.jaist.ac.jp/dspace/bitstream/10119/4883/1/defago_et_al.pdf). *ACM Computing Surveys*, volume 36, issue 4, pages 372–421, December 2004. [doi:10.1145/1041680.1041682](https://doi.org/10.1145/1041680.1041682) +[^77]: Hagit Attiya and Jennifer Welch. *Distributed Computing: Fundamentals, Simulations and Advanced Topics*, 2nd edition. John Wiley & Sons, 2004. ISBN: 978-0-471-45324-6, [doi:10.1002/0471478210](https://doi.org/10.1002/0471478210) +[^78]: Rachid Guerraoui. [Revisiting the Relationship Between Non-Blocking Atomic Commitment and Consensus](https://citeseerx.ist.psu.edu/pdf/5d06489503b6f791aa56d2d7942359c2592e44b0). At *9th International Workshop on Distributed Algorithms* (WDAG), September 1995. [doi:10.1007/BFb0022140](https://doi.org/10.1007/BFb0022140) +[^79]: Jim N. Gray and Leslie Lamport. [Consensus on Transaction Commit](https://dsf.berkeley.edu/cs286/papers/paxoscommit-tods2006.pdf). *ACM Transactions on Database Systems* (TODS), volume 31, issue 1, pages 133–160, March 2006. [doi:10.1145/1132863.1132867](https://doi.org/10.1145/1132863.1132867) +[^80]: Fred B. Schneider. [Implementing Fault-Tolerant Services Using the State Machine Approach: A Tutorial](https://www.cs.cornell.edu/fbs/publications/SMSurvey.pdf). *ACM Computing Surveys*, volume 22, issue 4, pages 299–319, December 1990. [doi:10.1145/98163.98167](https://doi.org/10.1145/98163.98167) +[^81]: Alexander Thomson, Thaddeus Diamond, Shu-Chun Weng, Kun Ren, Philip Shao, and Daniel J. Abadi. [Calvin: Fast Distributed Transactions for Partitioned Database Systems](https://cs.yale.edu/homes/thomson/publications/calvin-sigmod12.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2012. [doi:10.1145/2213836.2213838](https://doi.org/10.1145/2213836.2213838) +[^82]: Mahesh Balakrishnan, Dahlia Malkhi, Ted Wobber, Ming Wu, Vijayan Prabhakaran, Michael Wei, John D. Davis, Sriram Rao, Tao Zou, and Aviad Zuck. [Tango: Distributed Data Structures over a Shared Log](https://www.microsoft.com/en-us/research/publication/tango-distributed-data-structures-over-a-shared-log/). At *24th ACM Symposium on Operating Systems Principles* (SOSP), November 2013. [doi:10.1145/2517349.2522732](https://doi.org/10.1145/2517349.2522732) +[^83]: Mahesh Balakrishnan, Dahlia Malkhi, Vijayan Prabhakaran, Ted Wobber, Michael Wei, and John D. Davis. [CORFU: A Shared Log Design for Flash Clusters](https://www.usenix.org/system/files/conference/nsdi12/nsdi12-final30.pdf). At *9th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), April 2012. +[^84]: Vasilis Gavrielatos, Antonios Katsarakis, and Vijay Nagarajan. [Odyssey: the impact of modern hardware on strongly-consistent replication protocols](https://vasigavr1.github.io/files/Odyssey_Eurosys_2021.pdf). At *16th European Conference on Computer Systems* (EuroSys), April 2021. [doi:10.1145/3447786.3456240](https://doi.org/10.1145/3447786.3456240) +[^85]: Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. [Flexible Paxos: Quorum Intersection Revisited](https://drops.dagstuhl.de/opus/volltexte/2017/7094/pdf/LIPIcs-OPODIS-2016-25.pdf). At *20th International Conference on Principles of Distributed Systems* (OPODIS), December 2016. [doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25) +[^86]: Martin Kleppmann. [Distributed Systems lecture notes](https://www.cl.cam.ac.uk/teaching/2425/ConcDisSys/dist-sys-notes.pdf). *University of Cambridge*, October 2024. Archived at [perma.cc/SS3Q-FNS5](https://perma.cc/SS3Q-FNS5) +[^87]: Kyle Kingsbury. [Call Me Maybe: Elasticsearch 1.5.0](https://aphyr.com/posts/323-call-me-maybe-elasticsearch-1-5-0). *aphyr.com*, April 2015. Archived at [perma.cc/37MZ-JT7H](https://perma.cc/37MZ-JT7H) +[^88]: Heidi Howard and Jon Crowcroft. [Coracle: Evaluating Consensus at the Internet Edge](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p85.pdf). At *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2829988.2790010](https://doi.org/10.1145/2829988.2790010) +[^89]: Tom Lianza and Chris Snook. [A Byzantine failure in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020. Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY) +[^90]: Ivan Kelly. [BookKeeper Tutorial](https://github.com/ivankelly/bookkeeper-tutorial). *github.com*, October 2014. Archived at [perma.cc/37Y6-VZWU](https://perma.cc/37Y6-VZWU) +[^91]: Jack Vanlightly. [Apache BookKeeper Insights Part 1 — External Consensus and Dynamic Membership](https://medium.com/splunk-maas/apache-bookkeeper-insights-part-1-external-consensus-and-dynamic-membership-c259f388da21). *medium.com*, November 2021. Archived at [perma.cc/3MDB-8GFB](https://perma.cc/3MDB-8GFB) \ No newline at end of file diff --git a/content/en/ch2.md b/content/en/ch2.md index 584a31c..a3ba5ec 100644 --- a/content/en/ch2.md +++ b/content/en/ch2.md @@ -45,15 +45,11 @@ scalability. Imagine you are given the task of implementing a social network in the style of X (formerly Twitter), in which users can post messages and follow other users. This will be a huge -simplification of how such a service actually works -[[1](/en/ch2#Cvet2016), -[2](/en/ch2#Krikorian2012_ch2), -[3](/en/ch2#Twitter2023)], +simplification of how such a service actually works [^1] [^2] [^3], but it will help illustrate some of the issues that arise in large-scale systems. Let’s assume that users make 500 million posts per day, or 5,700 posts per second on average. -Occasionally, the rate can spike as high as 150,000 posts/second -[[4](/en/ch2#Krikorian2013)]. +Occasionally, the rate can spike as high as 150,000 posts/second [^4]. Let’s also assume that the average user follows 200 people and has 200 followers (although there is a very wide range: most people have only a handful of followers, and a few celebrities such as Barack Obama have over 100 million followers). @@ -143,7 +139,7 @@ extreme cases: unlikely that the user is actually reading all of the posts in their timeline, and therefore it’s okay to simply drop some of their timeline writes and show the user only a sample of the posts from the accounts they’re following - [[5](/en/ch2#Volpert2025)]. + [^5]. * When a celebrity account with a very large number of followers makes a post, we have to do a large amount of work to insert that post into the home timelines of each of their millions of followers. In this case it’s not okay to drop some of those writes. One way of solving this problem is to @@ -151,7 +147,7 @@ extreme cases: adding them to millions of timelines by storing the celebrity posts separately and merging them with the materialized timeline when it is read. Despite such optimizations, handling celebrities on a social network can require a lot of infrastructure - [[6](/en/ch2#Axon2010_ch2)]. + [^6]. # Describing Performance @@ -201,14 +197,14 @@ retries on the client side (*exponential backoff* and temporarily stop sending requests to a service that has returned errors or timed out recently (using a *circuit breaker* [[11](/en/ch2#Nygard2018), [12](/en/ch2#Chen2022)] -or *token bucket* algorithm [[13](/en/ch2#Brooker2022retries)]). +or *token bucket* algorithm [^13]). The server can also detect when it is approaching overload and start proactively rejecting requests -(*load shedding* [[14](/en/ch2#YanacekLoadShedding)]), and send back +(*load shedding* [^14]), and send back responses asking clients to slow down (*backpressure* [[1](/en/ch2#Cvet2016), [15](/en/ch2#Sackman2016_ch2)]). The choice of queueing and load-balancing algorithms can also make a difference -[[16](/en/ch2#Kopytkov2018)]. +[^16]. In terms of performance metrics, the response time is usually what users care about the most, whereas the throughput determines the required computing resources (e.g., how many servers you need), @@ -247,7 +243,7 @@ The response time can vary significantly from one request to the next, even if y same request over and over again. Many factors can add random delays: for example, a context switch to a background process, the loss of a network packet and TCP retransmission, a garbage collection pause, a page fault forcing a read from disk, mechanical vibrations in the server rack -[[17](/en/ch2#Gunawi2018_ch2)], +[^17], or many other causes. We will discuss this topic in more detail in [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing). Queueing delays often account for a large part of the variability in response times. As a server @@ -272,7 +268,7 @@ Variation in network delay is also known as *jitter*. It’s common to report the *average* response time of a service (technically, the *arithmetic mean*: that is, sum all the response times, and divide by the number of requests). The mean response time -is useful for estimating throughput limits [[18](/en/ch2#Brooker2017)]. +is useful for estimating throughput limits [^18]. However, the mean is not a very good metric if you want to know your “typical” response time, because it doesn’t tell you how many users actually experienced that delay. @@ -296,7 +292,7 @@ requirements for internal services in terms of the 99.9th percentile, even thoug in 1,000 requests. This is because the customers with the slowest requests are often those who have the most data on their accounts because they have made many purchases—that is, they’re the most valuable customers -[[19](/en/ch2#DeCandia2007_ch1)]. +[^19]. It’s important to keep those customers happy by ensuring the website is fast for them. On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed @@ -307,22 +303,22 @@ control, and the benefits are diminishing. # The user impact of response times It seems intuitively obvious that a fast service is better for users than a slow service -[[20](/en/ch2#Whitenton2020)]. +[^20]. However, it is surprisingly difficult to get hold of reliable data to quantify the effect that latency has on user behavior. Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue -[[21](/en/ch2#Linden2006)]. +[^21]. However, another Google study from 2009 reported that a 400 ms increase in latency resulted in only 0.6% fewer searches per day -[[22](/en/ch2#Brutlag2009)], +[^22], and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3% -[[23](/en/ch2#Schurman2009)]. +[^23]. Newer data from these companies appears not to be publicly available. A more recent Akamai study -[[24](/en/ch2#Akamai2017)] +[^24] claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times are also correlated with lower conversion rates! This seemingly paradoxical result is explained by @@ -331,7 +327,7 @@ error pages). However, since the study makes no effort to separate the effects o the effects of load time, its results are probably not meaningful. A study by Yahoo -[[25](/en/ch2#Bai2017)] +[^25] compares click-through rates on fast-loading versus slow-loading search results, controlling for quality of search results. It finds 20–30% more clicks on fast searches when the difference between fast and slow responses is 1.25 seconds or more. @@ -345,7 +341,7 @@ slow call to make the entire end-user request slow, as illustrated in [Figure 2 Even if only a small percentage of backend calls are slow, the chance of getting a slow call increases if an end-user request requires multiple backend calls, and so a higher proportion of end-user requests end up being slow (an effect known as *tail latency amplification* -[[26](/en/ch2#Dean2013_ch2)]). +[^26]). ![ddia 0206](/fig/ddia_0206.png) @@ -353,7 +349,7 @@ end-user requests end up being slow (an effect known as *tail latency amplificat Percentiles are often used in *service level objectives* (SLOs) and *service level agreements* (SLAs) as ways of defining the expected performance and availability of a service -[[27](/en/ch2#Hidalgo2020)]. +[^27]. For example, an SLO may set a target for a service to have a median response time of less than 200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not @@ -375,12 +371,12 @@ that can calculate a good approximation of percentiles at minimal CPU and memory Open source percentile estimation libraries include HdrHistogram, t-digest [[30](/en/ch2#Dunning2021), [31](/en/ch2#Kohn2021)], -OpenHistogram [[32](/en/ch2#Hartmann2020)], and DDSketch -[[33](/en/ch2#Masson2019)]. +OpenHistogram [^32], and DDSketch +[^33]. Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from several machines, is mathematically meaningless—the right way of aggregating response time data -is to add the histograms [[34](/en/ch2#Schwartz2015)]. +is to add the histograms [^34]. # Reliability and Fault Tolerance @@ -438,12 +434,12 @@ getting that budget item approved. Counter-intuitively, in such fault-tolerant systems, it can make sense to *increase* the rate of faults by triggering them deliberately—for example, by randomly killing individual processes without warning. This is called *fault injection*. Many critical bugs are actually due to poor error -handling [[38](/en/ch2#Yuan2014)]; by deliberately inducing faults, you ensure +handling [^38]; by deliberately inducing faults, you ensure that the fault-tolerance machinery is continually exercised and tested, which can increase your confidence that faults will be handled correctly when they occur naturally. *Chaos engineering* is a discipline that aims to improve confidence in fault-tolerance mechanisms through experiments such as deliberately injecting faults -[[39](/en/ch2#Rosenthal2020)]. +[^39]. Although we generally prefer tolerating faults over preventing faults, there are cases where prevention is better than cure (e.g., because no cure exists). This is the case with security @@ -460,11 +456,11 @@ When we think of causes of system failure, hardware faults quickly come to mind: [41](/en/ch2#Schroeder2007)]; in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day. Recent data suggests that disks are getting more reliable, but failure rates remain significant - [[42](/en/ch2#Klein2021)]. + [^42]. * Approximately 0.5–1% of solid state drives (SSDs) fail per year - [[43](/en/ch2#Narayanan2016)]. + [^43]. Small numbers of bit errors are corrected automatically - [[44](/en/ch2#Alibaba2019_ch2)], + [^44], but uncorrectable errors occur approximately once per year per drive, even in drives that are fairly new (i.e., that have experienced little wear); this error rate is higher than that of magnetic hard drives @@ -485,19 +481,19 @@ When we think of causes of system failure, hardware faults quickly come to mind: permanent physical defects. Even when memory with error-correcting codes (ECC) is used, more than 1% of machines encounter an uncorrectable error in a given year, which typically leads to a crash of the machine and the affected memory module needing to be replaced - [[52](/en/ch2#Schroeder2009)]. + [^52]. Moreover, certain pathological memory access patterns can flip bits with high probability - [[53](/en/ch2#Kim2014)]. + [^53]. * An entire datacenter might become unavailable (for example, due to power outage or network misconfiguration) or even be permanently destroyed (for example by fire, flood, or earthquake - [[54](/en/ch2#Bray2021)]). + [^54]). A solar storm, which induces large electrical currents in long-distance wires when the sun ejects a large mass of charged particles, could damage power grids and undersea network cables - [[55](/en/ch2#AbduJyothi2021)]. + [^55]. Although such large-scale failures are rare, their impact can be catastrophic if a service cannot tolerate the loss of a datacenter - [[56](/en/ch2#Cockcroft2019)]. + [^56]. These events are rare enough that you often don’t need to worry about them when working on a small system, as long as you can easily replace hardware that becomes faulty. However, in a large-scale @@ -551,25 +547,25 @@ common for many nodes to run the same software and thus have the same bugs [[59](/en/ch2#Gunawi2014), [60](/en/ch2#Kreps2012_ch1)]. Such faults are harder to anticipate, and they tend to cause many more system failures than -uncorrelated hardware faults [[47](/en/ch2#Ford2010)]. For example: +uncorrelated hardware faults [^47]. For example: * A software bug that causes every node to fail at the same time in particular circumstances. For example, on June 30, 2012, a leap second caused many Java applications to hang simultaneously due to a bug in the Linux kernel, bringing down many Internet services - [[61](/en/ch2#Minar2012_ch1)]. + [^61]. Due to a firmware bug, all SSDs of certain models suddenly fail after precisely 32,768 hours of operation (less than 4 years), rendering the data on them unrecoverable - [[62](/en/ch2#HPE2019_ch2)]. + [^62]. * A runaway process that uses up some shared, limited resource, such as CPU time, memory, disk space, network bandwidth, or threads - [[63](/en/ch2#Hochstein2020)]. + [^63]. For example, a process that consumes too much memory while processing a large request may be killed by the operating system. A bug in a client library could cause a much higher request - volume than anticipated [[64](/en/ch2#McCaffrey2015)]. + volume than anticipated [^64]. * A service that the system depends on slows down, becomes unresponsive, or starts returning corrupted responses. * An interaction between different systems results in emergent behavior that does not occur when - each system was tested in isolation [[65](/en/ch2#Tang2023)]. + each system was tested in isolation [^65]. * Cascading failures, where a problem in one component causes another component to become overloaded and slow down, which in turn brings down another component [[66](/en/ch2#Ulrich2016), @@ -595,19 +591,19 @@ adaptive in getting their job done. However, this characteristic also leads to u sometimes mistakes that can lead to failures, despite best intentions. For example, one study of large internet services found that configuration changes by operators were the leading cause of outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages -[[70](/en/ch2#Oppenheimer2003)]. +[^70]. It is tempting to label such problems as “human error” and to wish that they could be solved by better controlling human behavior through tighter procedures and compliance with rules. However, blaming people for mistakes is counterproductive. What we call “human error” is not really the cause of an incident, but rather a symptom of a problem with the sociotechnical system in which people are -trying their best to do their jobs [[71](/en/ch2#Dekker2017)]. +trying their best to do their jobs [^71]. Often complex systems have emergent behavior, in which unexpected interactions between components -may also lead to failures [[72](/en/ch2#Dekker2011)]. +may also lead to failures [^72]. Various technical measures can help minimize the impact of human mistakes, including thorough testing (both hand-written tests and *property testing* on lots of random inputs) -[[38](/en/ch2#Yuan2014)], rollback mechanisms for quickly +[^38], rollback mechanisms for quickly reverting configuration changes, gradual roll-outs of new code, detailed and clear monitoring, observability tools for diagnosing production issues (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)), and well-designed interfaces that encourage “the right thing” and discourage “the wrong thing”. @@ -622,7 +618,7 @@ problem is the organization’s priorities. Increasingly, organizations are adopting a culture of *blameless postmortems*: after an incident, the people involved are encouraged to share full details about what happened, without fear of punishment, since this allows others in the organization to learn how to prevent similar problems in -the future [[73](/en/ch2#Allspaw2012)]. +the future [^73]. This process may uncover a need to change business priorities, a need to invest in areas that have been neglected, a need to change the incentives for the people involved, or some other systemic issue that needs to be brought to the management’s attention. @@ -632,7 +628,7 @@ answers. “Bob should have been more careful when deploying that change” is n neither is “We must rewrite the backend in Haskell.” Instead, management should take the opportunity to learn the details of how the sociotechnical system works from the point of view of the people who work with it every day, and take steps to improve it based on this feedback -[[71](/en/ch2#Dekker2017)]. +[^71]. # How Important Is Reliability? @@ -642,21 +638,21 @@ risks if figures are reported incorrectly), and outages of e-commerce sites can terms of lost revenue and damage to reputation. In many applications, a temporary outage of a few minutes or even a few hours is tolerable -[[74](/en/ch2#Sabo2023)], +[^74], but permanent data loss or corruption would be catastrophic. Consider a parent who stores all their pictures and videos of their children in your photo application -[[75](/en/ch2#Jurewitz2013)]. How would they +[^75]. How would they feel if that database was suddenly corrupted? Would they know how to restore it from a backup? As another example of how unreliable software can harm people, consider the Post Office Horizon scandal. Between 1999 and 2019, hundreds of people managing Post Office branches in Britain were convicted of theft or fraud because the accounting software showed a shortfall in their accounts. Eventually it became clear that many of these shortfalls were due to bugs in the software, and many -convictions have since been overturned [[76](/en/ch2#Halper2025)]. +convictions have since been overturned [^76]. What led to this, probably the largest miscarriage of justice in British history, is the fact that English law assumes that computers operate correctly (and hence, evidence produced by computers is reliable) unless there is evidence to the contrary -[[77](/en/ch2#Bohm2022)]. +[^77]. Software engineers may laugh at the idea that software could ever be bug-free, but this is little solace to the people who were wrongfully imprisoned, declared bankrupt, or even committed suicide as a result of a wrongful conviction due to an unreliable computer system. @@ -680,7 +676,7 @@ to you depends on the type of application you are building. If you are building a new product that currently only has a small number of users, perhaps at a startup, the overriding engineering goal is usually to keep the system as simple and flexible as possible, so that you can easily modify and adapt the features of your product as you learn more -about customers’ needs [[78](/en/ch2#McKinley2015)]. +about customers’ needs [^78]. In such an environment, it is counterproductive to worry about hypothetical scale that might be needed in the future: in the best case, investments in scalability are wasted effort and premature optimization; in the worst case, they lock you into an inflexible design and make it harder to @@ -758,10 +754,10 @@ CPUs and RAM, but which stores data on an array of disks that is shared between are connected via a fast network: *Network-Attached Storage* (NAS) or *Storage Area Network* (SAN). This architecture has traditionally been used for on-premises data warehousing workloads, but contention and the overhead of locking limit the scalability of the shared-disk approach -[[81](/en/ch2#Stopford2009)]. +[^81]. By contrast, the *shared-nothing architecture* -[[82](/en/ch2#Stonebraker1986)] +[^82] (also called *horizontal scaling* or *scaling out*) has gained a lot of popularity. In this approach, we use a distributed system with multiple nodes, each of which has its own CPUs, RAM, and disks. Any coordination between nodes is done at the software level, via a conventional network. @@ -778,7 +774,7 @@ Some cloud-native database systems use separate services for storage and transac storage service. This model has some similarity to a shared-disk architecture, but it avoids the scalability problems of older systems: instead of providing a filesystem (NAS) or block device (SAN) abstraction, the storage service offers a specialized API that is designed for the specific needs of -the database [[83](/en/ch2#Antonopoulos2019_ch2)]. +the database [^83]. ## Principles for Scalability @@ -801,7 +797,7 @@ operate largely independently from each other. This is the underlying principle ([Link to Come]), and shared-nothing architectures. However, the challenge is in knowing where to draw the line between things that should be together, and things that should be apart. Design guidelines for microservices can be found in other books -[[84](/en/ch2#Newman2021_ch2)], +[^84], and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding). Another good principle is not to make things more complicated than necessary. If a single-machine @@ -830,7 +826,7 @@ and COBOL code); institutional knowledge of how and why a system was designed in have been lost as people have left the organization; it might be necessary to fix other people’s mistakes. Moreover, the computer system is often intertwined with the human organization that it supports, which means that maintenance of such *legacy* systems is as much a people problem as a -technical one [[87](/en/ch2#Bellotti2021)]. +technical one [^87]. Every system we create today will one day become a legacy system if it is valuable enough to survive for a long time. In order to minimize the pain for future generations who need to maintain our @@ -855,14 +851,14 @@ We previously discussed the role of operations in [“Operations in the Cloud Er human processes are at least as important for reliable operations as software tools. In fact, it has been suggested that “good operations can often work around the limitations of bad (or incomplete) software, but good software cannot run reliably with bad operations” -[[60](/en/ch2#Kreps2012_ch1)]. +[^60]. In large-scale systems consisting of many thousands of machines, manual maintenance would be unreasonably expensive, and automation is essential. However, automation can be a two-edged sword: there will always be edge cases (such as rare failure scenarios) that require manual intervention from the operations team. Since the cases that cannot be handled automatically are the most complex issues, greater automation requires a *more* skilled operations team that can resolve those issues -[[88](/en/ch2#Bainbridge1983)]. +[^88]. Moreover, if an automated system goes wrong, it is often harder to troubleshoot than a system that relies on an operator to perform some actions manually. For that reason, it is not the case that @@ -871,12 +867,12 @@ and the sweet spot will depend on the specifics of your particular application a Good operability means making routine tasks easy, allowing the operations team to focus their efforts on high-value activities. Data systems can do various things to make routine tasks easy, including -[[89](/en/ch2#Hamilton2007)]: +[^89]: * Allowing monitoring tools to check the system’s key metrics, and supporting observability tools (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)) to give insights into the system’s runtime behavior. A variety of commercial and open source tools can help here - [[90](/en/ch2#Horovits2021)]. + [^90]. * Avoiding dependency on individual machines (allowing machines to be taken down for maintenance while the system as a whole continues running uninterrupted) * Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”) @@ -890,30 +886,30 @@ Small software projects can have delightfully simple and expressive code, but as larger, they often become very complex and difficult to understand. This complexity slows down everyone who needs to work on the system, further increasing the cost of maintenance. A software project mired in complexity is sometimes described as a *big ball of mud* -[[91](/en/ch2#Foote1997)]. +[^91]. When complexity makes maintenance hard, budgets and schedules are often overrun. In complex software, there is also a greater risk of introducing bugs when making a change: when the system is harder for developers to understand and reason about, hidden assumptions, unintended consequences, and unexpected interactions are more easily overlooked -[[69](/en/ch2#Woods2017)]. +[^69]. Conversely, reducing complexity greatly improves the maintainability of software, and thus simplicity should be a key goal for the systems we build. Simple systems are easier to understand, and therefore we should try to solve a given problem in the simplest way possible. Unfortunately, this is easier said than done. Whether something is simple or not is often a subjective matter of taste, as there is no objective standard of simplicity -[[92](/en/ch2#Brooker2022)]. +[^92]. For example, one system may hide a complex implementation behind a simple interface, whereas another may have a simple implementation that exposes more internal detail to its users—which one is simpler? One attempt at reasoning about complexity has been to break it down into two categories, *essential* -and *accidental* complexity [[93](/en/ch2#Brooks1995)]. +and *accidental* complexity [^93]. The idea is that essential complexity is inherent in the problem domain of the application, while accidental complexity arises only because of limitations of our tooling. Unfortunately, this distinction is also flawed, because boundaries between the essential and the accidental shift as our -tooling evolves [[94](/en/ch2#Luu2020)]. +tooling evolves [^94]. One of the best tools we have for managing complexity is *abstraction*. A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand façade. A good @@ -929,8 +925,8 @@ programming in a high-level language, we are still using machine code; we are ju Abstractions for application code, which aim to reduce its complexity, can be created using methodologies such as *design patterns* -[[95](/en/ch2#Gamma1994)] -and *domain-driven design* (DDD) [[96](/en/ch2#Evans2003)]. +[^95] +and *domain-driven design* (DDD) [^96]. This book is not about such application-specific abstractions, but rather about general-purpose abstractions on top of which you can build your applications, such as database transactions, indexes, and event logs. If you want to use techniques such as DDD, you can implement them on top of @@ -953,11 +949,11 @@ The ease with which you can modify a data system, and adapt it to changing requi linked to its simplicity and its abstractions: loosely-coupled, simple systems are usually easier to modify than tightly-coupled, complex ones. Since this is such an important idea, we will use a different word to refer to agility on a data system level: *evolvability* -[[97](/en/ch2#Breivold2008)]. +[^97]. One major factor that makes change difficult in large systems is when some action is irreversible, and therefore that action needs to be taken very carefully -[[98](/en/ch2#Zaninotto2002)]. +[^98]. For example, say you are migrating from one database to another: if you cannot switch back to the old system in case of problems with the new one, the stakes are much higher than if you can easily go back. Minimizing irreversibility improves flexibility. @@ -990,529 +986,105 @@ There are no easy answers on how to achieve these things, but one thing that can applications using well-understood building blocks that provide useful abstractions. The rest of this book will cover a selection of building blocks that have proved to be valuable in practice. -##### Footnotes - ##### References -[[1](/en/ch2#Cvet2016-marker)] Mike Cvet. -[How We Learned to Stop Worrying and Love -Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016. -[[2](/en/ch2#Krikorian2012_ch2-marker)] Raffi Krikorian. -[Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). -At *QCon San Francisco*, November 2012. -Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK) - -[[3](/en/ch2#Twitter2023-marker)] Twitter. -[Twitter’s -Recommendation Algorithm](https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm). *blog.twitter.com*, March 2023. -Archived at [perma.cc/L5GT-229T](https://perma.cc/L5GT-229T) - -[[4](/en/ch2#Krikorian2013-marker)] Raffi Krikorian. -[New -Tweets per second record, and how!](https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how) *blog.twitter.com*, August 2013. -Archived at [perma.cc/6JZN-XJYN](https://perma.cc/6JZN-XJYN) - -[[5](/en/ch2#Volpert2025-marker)] Jaz Volpert. -[When Imperfect Systems are Good, Actually: -Bluesky’s Lossy Timelines](https://jazco.dev/2025/02/19/imperfection/). *jazco.dev*, February 2025. -Archived at [perma.cc/2PVE-L2MX](https://perma.cc/2PVE-L2MX) - -[[6](/en/ch2#Axon2010_ch2-marker)] Samuel Axon. -[3% of Twitter’s Servers -Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010. -Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX) - -[[7](/en/ch2#Bronson2021-marker)] Nathan Bronson, Abutalib Aghayev, Aleksey -Charapko, and Timothy Zhu. -[Metastable -Failures in Distributed Systems](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf). -At *Workshop on Hot Topics in Operating Systems* (HotOS), May 2021. -[doi:10.1145/3458336.3465286](https://doi.org/10.1145/3458336.3465286) - -[[8](/en/ch2#Brooker2021-marker)] Marc Brooker. -[Metastability and Distributed -Systems](https://brooker.co.za/blog/2021/05/24/metastable.html). *brooker.co.za*, May 2021. -Archived at [perma.cc/7FGJ-7XRK](https://perma.cc/7FGJ-7XRK) - -[[9](/en/ch2#Brooker2015-marker)] Marc Brooker. -[Exponential -Backoff And Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/). *aws.amazon.com*, March 2015. -Archived at [perma.cc/R6MS-AZKH](https://perma.cc/R6MS-AZKH) - -[[10](/en/ch2#Brooker2022backoff-marker)] Marc Brooker. -[What is Backoff For?](https://brooker.co.za/blog/2022/08/11/backoff.html) -*brooker.co.za*, August 2022. -Archived at [perma.cc/PW9N-55Q5](https://perma.cc/PW9N-55Q5) - -[[11](/en/ch2#Nygard2018-marker)] Michael T. Nygard. -[*Release It!*](https://learning.oreilly.com/library/view/release-it-2nd/9781680504552/), -2nd Edition. Pragmatic Bookshelf, January 2018. ISBN: 9781680502398 - -[[12](/en/ch2#Chen2022-marker)] Frank Chen. -[Slowing Down to Speed Up – Circuit Breakers -for Slack’s CI/CD](https://slack.engineering/circuit-breakers/). *slack.engineering*, August 2022. -Archived at [perma.cc/5FGS-ZPH3](https://perma.cc/5FGS-ZPH3) - -[[13](/en/ch2#Brooker2022retries-marker)] Marc Brooker. -[Fixing retries with token buckets and -circuit breakers](https://brooker.co.za/blog/2022/02/28/retries.html). *brooker.co.za*, February 2022. -Archived at [perma.cc/MD6N-GW26](https://perma.cc/MD6N-GW26) - -[[14](/en/ch2#YanacekLoadShedding-marker)] David Yanacek. -[Using load -shedding to avoid overload](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/). Amazon Builders’ Library, *aws.amazon.com*. -Archived at [perma.cc/9SAW-68MP](https://perma.cc/9SAW-68MP) - -[[15](/en/ch2#Sackman2016_ch2-marker)] Matthew Sackman. -[Pushing Back](https://wellquite.org/posts/lshift/pushing_back/). -*wellquite.org*, May 2016. -Archived at [perma.cc/3KCZ-RUFY](https://perma.cc/3KCZ-RUFY) - -[[16](/en/ch2#Kopytkov2018-marker)] Dmitry Kopytkov and Patrick Lee. -[Meet Bandaid, -the Dropbox service proxy](https://dropbox.tech/infrastructure/meet-bandaid-the-dropbox-service-proxy). *dropbox.tech*, March 2018. -Archived at [perma.cc/KUU6-YG4S](https://perma.cc/KUU6-YG4S) - -[[17](/en/ch2#Gunawi2018_ch2-marker)] Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, -Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, -Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert -Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. -[Fail-Slow at -Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf). -At *16th USENIX Conference on File and Storage Technologies*, February 2018. - -[[18](/en/ch2#Brooker2017-marker)] Marc Brooker. -[Is the Mean Really Useless?](https://brooker.co.za/blog/2017/12/28/mean.html) -*brooker.co.za*, December 2017. -Archived at [perma.cc/U5AE-CVEM](https://perma.cc/U5AE-CVEM) - -[[19](/en/ch2#DeCandia2007_ch1-marker)] Giuseppe DeCandia, Deniz Hastorun, Madan -Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter -Vosshall, and Werner Vogels. -[Dynamo: -Amazon’s Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating -Systems Principles* (SOSP), October 2007. -[doi:10.1145/1294261.1294281](https://doi.org/10.1145/1294261.1294281) - -[[20](/en/ch2#Whitenton2020-marker)] Kathryn Whitenton. -[The Need for Speed, 23 Years Later](https://www.nngroup.com/articles/the-need-for-speed/). -*nngroup.com*, May 2020. -Archived at [perma.cc/C4ER-LZYA](https://perma.cc/C4ER-LZYA) - -[[21](/en/ch2#Linden2006-marker)] Greg Linden. -[Marissa Mayer at Web 2.0](https://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html). -*glinden.blogspot.com*, November 2005. -Archived at [perma.cc/V7EA-3VXB](https://perma.cc/V7EA-3VXB) - -[[22](/en/ch2#Brutlag2009-marker)] Jake Brutlag. -[Speed Matters for Google -Web Search](https://services.google.com/fh/files/blogs/google_delayexp.pdf). *services.google.com*, June 2009. -Archived at [perma.cc/BK7R-X7M2](https://perma.cc/BK7R-X7M2) - -[[23](/en/ch2#Schurman2009-marker)] Eric Schurman and Jake Brutlag. -[Performance Related Changes and their User Impact](https://www.youtube.com/watch?v=bQSE51-gr2s). -Talk at *Velocity 2009*. - -[[24](/en/ch2#Akamai2017-marker)] Akamai Technologies, Inc. -[The -State of Online Retail Performance](https://web.archive.org/web/20210729180749/https%3A//www.akamai.com/us/en/multimedia/documents/report/akamai-state-of-online-retail-performance-spring-2017.pdf). *akamai.com*, April 2017. -Archived at [perma.cc/UEK2-HYCS](https://perma.cc/UEK2-HYCS) - -[[25](/en/ch2#Bai2017-marker)] Xiao Bai, Ioannis Arapakis, B. Barla Cambazoglu, and Ana Freire. -[Understanding and Leveraging the Impact of -Response Latency on User Behaviour in Web Search](https://iarapakis.github.io/papers/TOIS17.pdf). *ACM Transactions on Information Systems*, -volume 36, issue 2, article 21, April 2018. -[doi:10.1145/3106372](https://doi.org/10.1145/3106372) - -[[26](/en/ch2#Dean2013_ch2-marker)] Jeffrey Dean and Luiz André Barroso. -[The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/). -*Communications of the ACM*, volume 56, issue 2, pages 74–80, February 2013. -[doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794) - -[[27](/en/ch2#Hidalgo2020-marker)] Alex Hidalgo. -[*Implementing -Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets*](https://www.oreilly.com/library/view/implementing-service-level/9781492076803/). O’Reilly -Media, September 2020. ISBN: 1492076813 - -[[28](/en/ch2#Mogul2019-marker)] Jeffrey C. Mogul and John Wilkes. -[Nines are Not Enough: Meaningful Metrics for -Clouds](https://research.google/pubs/pub48033/). At *17th Workshop on Hot Topics in Operating Systems* (HotOS), May 2019. -[doi:10.1145/3317550.3321432](https://doi.org/10.1145/3317550.3321432) - -[[29](/en/ch2#Hauer2020-marker)] Tamás Hauer, Philipp Hoffmann, John Lunney, Dan Ardelean, and Amer Diwan. -[Meaningful Availability](https://www.usenix.org/conference/nsdi20/presentation/hauer). -At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020. - -[[30](/en/ch2#Dunning2021-marker)] Ted Dunning. -[The t-digest: -Efficient estimates of distributions](https://www.sciencedirect.com/science/article/pii/S2665963820300403). *Software Impacts*, volume 7, article 100049, February 2021. -[doi:10.1016/j.simpa.2020.100049](https://doi.org/10.1016/j.simpa.2020.100049) - -[[31](/en/ch2#Kohn2021-marker)] David Kohn. -[How -percentile approximation works (and why it’s more useful than averages)](https://www.timescale.com/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/). *timescale.com*, -September 2021. Archived at [perma.cc/3PDP-NR8B](https://perma.cc/3PDP-NR8B) - -[[32](/en/ch2#Hartmann2020-marker)] Heinrich Hartmann and Theo Schlossnagle. -[Circllhist — A Log-Linear Histogram Data Structure -for IT Infrastructure Monitoring](https://arxiv.org/pdf/2001.06561.pdf). *arxiv.org*, January 2020. - -[[33](/en/ch2#Masson2019-marker)] Charles Masson, Jee E. Rim, and Homin K. Lee. -[DDSketch: A Fast and Fully-Mergeable -Quantile Sketch with Relative-Error Guarantees](https://www.vldb.org/pvldb/vol12/p2195-masson.pdf). *Proceedings of the VLDB Endowment*, -volume 12, issue 12, pages 2195–2205, August 2019. -[doi:10.14778/3352063.3352135](https://doi.org/10.14778/3352063.3352135) - -[[34](/en/ch2#Schwartz2015-marker)] Baron Schwartz. -[Why -Percentiles Don’t Work the Way You Think](https://orangematter.solarwinds.com/2016/11/18/why-percentiles-dont-work-the-way-you-think/). *solarwinds.com*, November 2016. -Archived at [perma.cc/469T-6UGB](https://perma.cc/469T-6UGB) - -[[35](/en/ch2#Heimerdinger1992-marker)] Walter L. Heimerdinger and Charles B. Weinstock. -[A Conceptual -Framework for System Fault Tolerance](https://resources.sei.cmu.edu/asset_files/TechnicalReport/1992_005_001_16112.pdf). Technical Report CMU/SEI-92-TR-033, Software Engineering -Institute, Carnegie Mellon University, October 1992. -Archived at [perma.cc/GD2V-DMJW](https://perma.cc/GD2V-DMJW) - -[[36](/en/ch2#Gaertner1999-marker)] Felix C. Gärtner. -[Fundamentals of fault-tolerant -distributed computing in asynchronous environments](https://dl.acm.org/doi/pdf/10.1145/311531.311532). *ACM Computing Surveys*, volume 31, -issue 1, pages 1–26, March 1999. -[doi:10.1145/311531.311532](https://doi.org/10.1145/311531.311532) - -[[37](/en/ch2#Avizienis2004-marker)] Algirdas Avižienis, Jean-Claude Laprie, Brian Randell, -and Carl Landwehr. -[Basic Concepts and Taxonomy of Dependable and Secure -Computing](https://hdl.handle.net/1903/6459). *IEEE Transactions on Dependable and Secure Computing*, volume 1, issue 1, -January 2004. [doi:10.1109/TDSC.2004.2](https://doi.org/10.1109/TDSC.2004.2) - -[[38](/en/ch2#Yuan2014-marker)] Ding Yuan, Yu Luo, Xin Zhuang, Guilherme -Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. -[Simple -Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed -Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf). At *11th USENIX Symposium on Operating Systems Design -and Implementation* (OSDI), October 2014. - -[[39](/en/ch2#Rosenthal2020-marker)] Casey Rosenthal and Nora Jones. -[*Chaos -Engineering*](https://learning.oreilly.com/library/view/chaos-engineering/9781492043850/). O’Reilly Media, April 2020. ISBN: 9781492043867 - -[[40](/en/ch2#Pinheiro2007-marker)] Eduardo Pinheiro, Wolf-Dietrich Weber, and -Luiz Andre Barroso. -[Failure -Trends in a Large Disk Drive Population](https://www.usenix.org/legacy/events/fast07/tech/full_papers/pinheiro/pinheiro_old.pdf). At *5th USENIX Conference on File and Storage -Technologies* (FAST), February 2007. - -[[41](/en/ch2#Schroeder2007-marker)] Bianca Schroeder and Garth A. Gibson. -[Disk failures -in the real world: What does an MTTF of 1,000,000 hours mean to you?](https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder.pdf) At *5th USENIX -Conference on File and Storage Technologies* (FAST), February 2007. - -[[42](/en/ch2#Klein2021-marker)] Andy Klein. -[Backblaze Drive Stats -for Q2 2021](https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2021/). *backblaze.com*, August 2021. -Archived at [perma.cc/2943-UD5E](https://perma.cc/2943-UD5E) - -[[43](/en/ch2#Narayanan2016-marker)] Iyswarya Narayanan, Di Wang, Myeongjae Jeon, -Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and -Kushagra Vaid. -[SSD -Failures in Datacenters: What? When? and Why?](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/a7-narayanan.pdf) At *9th ACM International on Systems and -Storage Conference* (SYSTOR), June 2016. -[doi:10.1145/2928275.2928278](https://doi.org/10.1145/2928275.2928278) - -[[44](/en/ch2#Alibaba2019_ch2-marker)] Alibaba Cloud Storage Team. -[Storage System Design Analysis: Factors -Affecting NVMe SSD Performance (1)](https://www.alibabacloud.com/blog/594375). *alibabacloud.com*, January 2019. Archived at -[archive.org](https://web.archive.org/web/20230522005034/https%3A//www.alibabacloud.com/blog/594375) - -[[45](/en/ch2#Schroeder2016_ch2-marker)] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. -[Flash -Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf). At *14th USENIX Conference on -File and Storage Technologies* (FAST), February 2016. - -[[46](/en/ch2#Alter2019-marker)] Jacob Alter, Ji Xue, Alma Dimnaku, and Evgenia Smirni. -[SSD failures in the field: symptoms, -causes, and prediction models](https://dl.acm.org/doi/pdf/10.1145/3295500.3356172). At *International Conference for High Performance Computing, -Networking, Storage and Analysis* (SC), November 2019. -[doi:10.1145/3295500.3356172](https://doi.org/10.1145/3295500.3356172) - -[[47](/en/ch2#Ford2010-marker)] Daniel Ford, François Labelle, Florentina I. -Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. -[Availability in -Globally Distributed Storage Systems](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Ford.pdf). At *9th USENIX Symposium on Operating Systems Design -and Implementation* (OSDI), October 2010. - -[[48](/en/ch2#Vishwanath2010-marker)] Kashi Venkatesh Vishwanath and Nachiappan Nagappan. -[Characterizing -Cloud Computing Hardware Reliability](https://www.microsoft.com/en-us/research/wp-content/uploads/2010/06/socc088-vishwanath.pdf). At *1st ACM Symposium on Cloud Computing* (SoCC), -June 2010. [doi:10.1145/1807128.1807161](https://doi.org/10.1145/1807128.1807161) - -[[49](/en/ch2#Hochschild2021-marker)] Peter H. Hochschild, Paul Turner, Jeffrey C. -Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat. -[Cores that -don’t count](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf). At *Workshop on Hot Topics in Operating Systems* (HotOS), June 2021. -[doi:10.1145/3458336.3465297](https://doi.org/10.1145/3458336.3465297) - -[[50](/en/ch2#Dixit2021-marker)] Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, -Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar. -[Silent Data Corruptions at Scale](https://arxiv.org/abs/2102.11245). -*arXiv:2102.11245*, February 2021. - -[[51](/en/ch2#Behrens2015-marker)] Diogo Behrens, Marco Serafini, Sergei Arnautov, Flavio P. -Junqueira, and Christof Fetzer. -[Scalable -Error Isolation for Distributed Systems](https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/behrens). At *12th USENIX Symposium on Networked Systems -Design and Implementation* (NSDI), May 2015. - -[[52](/en/ch2#Schroeder2009-marker)] Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. -[DRAM -Errors in the Wild: A Large-Scale Field Study](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf). At *11th International Joint Conference on -Measurement and Modeling of Computer Systems* (SIGMETRICS), June 2009. -[doi:10.1145/1555349.1555372](https://doi.org/10.1145/1555349.1555372) - -[[53](/en/ch2#Kim2014-marker)] Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, -Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. -[Flipping Bits in Memory Without -Accessing Them: An Experimental Study of DRAM Disturbance Errors](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf). At *41st Annual -International Symposium on Computer Architecture* (ISCA), June 2014. -[doi:10.5555/2665671.2665726](https://doi.org/10.5555/2665671.2665726) - -[[54](/en/ch2#Bray2021-marker)] Tim Bray. -[Worst Case](https://www.tbray.org/ongoing/When/202x/2021/10/08/The-WOrst-Case). -*tbray.org*, October 2021. -Archived at [perma.cc/4QQM-RTHN](https://perma.cc/4QQM-RTHN) - -[[55](/en/ch2#AbduJyothi2021-marker)] Sangeetha Abdu Jyothi. -[Solar Superstorms: Planning for -an Internet Apocalypse](https://ics.uci.edu/~sabdujyo/papers/sigcomm21-cme.pdf). At *ACM SIGCOMM Conferene*, August 2021. -[doi:10.1145/3452296.3472916](https://doi.org/10.1145/3452296.3472916) - -[[56](/en/ch2#Cockcroft2019-marker)] Adrian Cockcroft. -[Failure -Modes and Continuous Resilience](https://adrianco.medium.com/failure-modes-and-continuous-resilience-6553078caad5). *adrianco.medium.com*, November 2019. -Archived at [perma.cc/7SYS-BVJP](https://perma.cc/7SYS-BVJP) - -[[57](/en/ch2#Han2021-marker)] Shujie Han, Patrick P. C. Lee, Fan Xu, Yi Liu, Cheng He, and Jiongzhou Liu. -[An In-Depth Study of Correlated -Failures in Production SSD-Based Data Centers](https://www.usenix.org/conference/fast21/presentation/han). At *19th USENIX Conference on File and Storage -Technologies* (FAST), February 2021. - -[[58](/en/ch2#Nightingale2011-marker)] Edmund B. Nightingale, John R. Douceur, and Vince Orgovan. -[Cycles, Cells and -Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs](https://eurosys2011.cs.uni-salzburg.at/pdf/eurosys2011-nightingale.pdf). -At *6th European Conference on Computer Systems* (EuroSys), April 2011. -[doi:10.1145/1966445.1966477](https://doi.org/10.1145/1966445.1966477) - -[[59](/en/ch2#Gunawi2014-marker)] Haryadi S. Gunawi, Mingzhe Hao, Tanakorn -Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, -Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. -[What Bugs Live in the Cloud?](https://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf) -At *5th ACM Symposium on Cloud Computing* (SoCC), November 2014. -[doi:10.1145/2670979.2670986](https://doi.org/10.1145/2670979.2670986) - -[[60](/en/ch2#Kreps2012_ch1-marker)] Jay Kreps. -[Getting -Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. -Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW) - -[[61](/en/ch2#Minar2012_ch1-marker)] Nelson Minar. -[Leap Second Crashes Half -the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012. -Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU) - -[[62](/en/ch2#HPE2019_ch2-marker)] Hewlett Packard Enterprise. -[Support -Alerts – Customer Bulletin a00092491en\_us](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us). *support.hpe.com*, November 2019. -Archived at [perma.cc/S5F6-7ZAC](https://perma.cc/S5F6-7ZAC) - -[[63](/en/ch2#Hochstein2020-marker)] Lorin Hochstein. -[awesome limits](https://github.com/lorin/awesome-limits). *github.com*, -November 2020. Archived at [perma.cc/3R5M-E5Q4](https://perma.cc/3R5M-E5Q4) - -[[64](/en/ch2#McCaffrey2015-marker)] Caitie McCaffrey. -[Clients -Are Jerks: AKA How Halo 4 DoSed the Services at Launch & How We Survived](https://www.caitiem.com/2015/06/23/clients-are-jerks-aka-how-halo-4-dosed-the-services-at-launch-how-we-survived/). *caitiem.com*, -June 2015. Archived at [perma.cc/MXX4-W373](https://perma.cc/MXX4-W373) - -[[65](/en/ch2#Tang2023-marker)] Lilia Tang, -Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu. -[Fail through the Cracks: Cross-System -Interaction Failures in Modern Cloud Systems](https://tianyin.github.io/pub/csi-failures.pdf). At *18th European Conference on Computer -Systems* (EuroSys), May 2023. -[doi:10.1145/3552326.3587448](https://doi.org/10.1145/3552326.3587448) - -[[66](/en/ch2#Ulrich2016-marker)] Mike Ulrich. -[Addressing Cascading Failures](https://sre.google/sre-book/addressing-cascading-failures/). -In Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy (ed). -[*Site -Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). -O’Reilly Media, 2016. ISBN: 9781491929124 - -[[67](/en/ch2#Fassbender2022-marker)] Harri Faßbender. -[Cascading -failures in large-scale distributed systems](https://blog.mi.hdm-stuttgart.de/index.php/2022/03/03/cascading-failures-in-large-scale-distributed-systems/). *blog.mi.hdm-stuttgart.de*, March 2022. -Archived at [perma.cc/K7VY-YJRX](https://perma.cc/K7VY-YJRX) - -[[68](/en/ch2#Cook2000-marker)] Richard I. Cook. -[How Complex -Systems Fail](https://www.adaptivecapacitylabs.com/HowComplexSystemsFail.pdf). Cognitive Technologies Laboratory, April 2000. -Archived at [perma.cc/RDS6-2YVA](https://perma.cc/RDS6-2YVA) - -[[69](/en/ch2#Woods2017-marker)] David D. Woods. -[STELLA: Report from the SNAFUcatchers Workshop on Coping -With Complexity](https://snafucatchers.github.io/). *snafucatchers.github.io*, March 2017. Archived at -[archive.org](https://web.archive.org/web/20230306130131/https%3A//snafucatchers.github.io/) - -[[70](/en/ch2#Oppenheimer2003-marker)] David Oppenheimer, Archana Ganapathi, and David A. Patterson. -[Why -Do Internet Services Fail, and What Can Be Done About It?](https://static.usenix.org/events/usits03/tech/full_papers/oppenheimer/oppenheimer.pdf) At *4th USENIX Symposium on -Internet Technologies and Systems* (USITS), March 2003. - -[[71](/en/ch2#Dekker2017-marker)] Sidney Dekker. -[*The Field -Guide to Understanding ‘Human Error’, 3rd Edition*](https://learning.oreilly.com/library/view/the-field-guide/9781317031833/). CRC Press, November 2017. -ISBN: 9781472439055 - -[[72](/en/ch2#Dekker2011-marker)] Sidney Dekker. -[*Drift -into Failure: From Hunting Broken Components to Understanding Complex Systems*](https://www.taylorfrancis.com/books/mono/10.1201/9781315257396/drift-failure-sidney-dekker). -CRC Press, 2011. ISBN: 9781315257396 - -[[73](/en/ch2#Allspaw2012-marker)] John Allspaw. -[Blameless PostMortems and a Just -Culture](https://www.etsy.com/codeascraft/blameless-postmortems/). *etsy.com*, May 2012. -Archived at [perma.cc/YMJ7-NTAP](https://perma.cc/YMJ7-NTAP) - -[[74](/en/ch2#Sabo2023-marker)] Itzy Sabo. -[Uptime -Guarantees — A Pragmatic Perspective](https://world.hey.com/itzy/uptime-guarantees-a-pragmatic-perspective-736d7ea4). *world.hey.com*, March 2023. -Archived at [perma.cc/F7TU-78JB](https://perma.cc/F7TU-78JB) - -[[75](/en/ch2#Jurewitz2013-marker)] Michael Jurewitz. -[The Human Impact of Bugs](http://jury.me/blog/2013/3/14/the-human-impact-of-bugs). -*jury.me*, March 2013. -Archived at [perma.cc/5KQ4-VDYL](https://perma.cc/5KQ4-VDYL) - -[[76](/en/ch2#Halper2025-marker)] Mark Halper. -[How -Software Bugs led to ‘One of the Greatest Miscarriages of Justice’ in British History](https://cacm.acm.org/news/how-software-bugs-led-to-one-of-the-greatest-miscarriages-of-justice-in-british-history/). -*Communications of the ACM*, January 2025. -[doi:10.1145/3703779](https://doi.org/10.1145/3703779) - -[[77](/en/ch2#Bohm2022-marker)] Nicholas Bohm, James Christie, Peter Bernard Ladkin, -Bev Littlewood, Paul Marshall, Stephen Mason, Martin Newby, Steven J. Murdoch, Harold Thimbleby, and Martyn Thomas. -[The -legal rule that computers are presumed to be operating correctly – unforeseen and unjust -consequences](https://www.benthamsgaze.org/wp-content/uploads/2022/06/briefing-presumption-that-computers-are-reliable.pdf). Briefing note, *benthamsgaze.org*, June 2022. -Archived at [perma.cc/WQ6X-TMW4](https://perma.cc/WQ6X-TMW4) - -[[78](/en/ch2#McKinley2015-marker)] Dan McKinley. -[Choose Boring Technology](https://mcfunley.com/choose-boring-technology). -*mcfunley.com*, March 2015. -Archived at [perma.cc/7QW7-J4YP](https://perma.cc/7QW7-J4YP) - -[[79](/en/ch2#Warfield2023_ch2-marker)] Andy Warfield. -[Building -and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. -Archived at [perma.cc/7LPK-TP7V](https://perma.cc/7LPK-TP7V) - -[[80](/en/ch2#Brooker2023multitenancy-marker)] Marc Brooker. -[Surprising Scalability of -Multitenancy](https://brooker.co.za/blog/2023/03/23/economics.html). *brooker.co.za*, March 2023. -Archived at [perma.cc/ZZD9-VV8T](https://perma.cc/ZZD9-VV8T) - -[[81](/en/ch2#Stopford2009-marker)] Ben Stopford. -[Shared -Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/). *benstopford.com*, -November 2009. Archived at [perma.cc/7BXH-EDUR](https://perma.cc/7BXH-EDUR) - -[[82](/en/ch2#Stonebraker1986-marker)] Michael Stonebraker. -[The Case for Shared Nothing](https://dsf.berkeley.edu/papers/hpts85-nothing.pdf). -*IEEE Database Engineering Bulletin*, volume 9, issue 1, pages 4–9, March 1986. - -[[83](/en/ch2#Antonopoulos2019_ch2-marker)] Panagiotis Antonopoulos, -Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald -Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya -Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. -[Socrates: The -New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* -(SIGMOD), pages 1743–1756, June 2019. -[doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047) - -[[84](/en/ch2#Newman2021_ch2-marker)] Sam Newman. -[*Building -Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O’Reilly Media, 2021. ISBN: 9781492034025 - -[[85](/en/ch2#Ensmenger2016-marker)] Nathan Ensmenger. -[When -Good Software Goes Bad: The Surprising Durability of an Ephemeral Technology](https://themaintainers.wpengine.com/wp-content/uploads/2021/04/ensmenger-maintainers-v2.pdf). -At *The Maintainers Conference*, April 2016. -Archived at [perma.cc/ZXT4-HGZB](https://perma.cc/ZXT4-HGZB) - -[[86](/en/ch2#Glass2002-marker)] Robert L. Glass. -[*Facts and -Fallacies of Software Engineering*](https://learning.oreilly.com/library/view/facts-and-fallacies/0321117425/). -Addison-Wesley Professional, October 2002. ISBN: 9780321117427 - -[[87](/en/ch2#Bellotti2021-marker)] Marianne Bellotti. -[*Kill It with -Fire*](https://learning.oreilly.com/library/view/kill-it-with/9781098128883/). No Starch Press, April 2021. ISBN: 9781718501188 - -[[88](/en/ch2#Bainbridge1983-marker)] Lisanne Bainbridge. -[Ironies of -automation](https://www.adaptivecapacitylabs.com/IroniesOfAutomation-Bainbridge83.pdf). *Automatica*, volume 19, issue 6, pages 775–779, November 1983. -[doi:10.1016/0005-1098(83)90046-8](https://doi.org/10.1016/0005-1098%2883%2990046-8) - -[[89](/en/ch2#Hamilton2007-marker)] James Hamilton. -[On -Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/events/lisa07/tech/full_papers/hamilton/hamilton.pdf). At *21st Large Installation -System Administration Conference* (LISA), November 2007. - -[[90](/en/ch2#Horovits2021-marker)] Dotan Horovits. -[Open Source -for Better Observability](https://horovits.medium.com/open-source-for-better-observability-8c65b5630561). *horovits.medium.com*, October 2021. -Archived at [perma.cc/R2HD-U2ZT](https://perma.cc/R2HD-U2ZT) - -[[91](/en/ch2#Foote1997-marker)] Brian Foote and Joseph Yoder. -[Big Ball of Mud](http://www.laputan.org/pub/foote/mud.pdf). At -*4th Conference on Pattern Languages of Programs* (PLoP), September 1997. -Archived at [perma.cc/4GUP-2PBV](https://perma.cc/4GUP-2PBV) - -[[92](/en/ch2#Brooker2022-marker)] Marc Brooker. -[What is a simple system?](https://brooker.co.za/blog/2022/05/03/simplicity.html) -*brooker.co.za*, May 2022. -Archived at [perma.cc/U72T-BFVE](https://perma.cc/U72T-BFVE) - -[[93](/en/ch2#Brooks1995-marker)] Frederick P. Brooks. -[No Silver Bullet – Essence and -Accident in Software Engineering](https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf). In -[*The Mythical -Man-Month*](https://www.oreilly.com/library/view/mythical-man-month-the/0201835959/), Anniversary edition, Addison-Wesley, 1995. ISBN: 9780201835953 - -[[94](/en/ch2#Luu2020-marker)] Dan Luu. -[Against essential and accidental complexity](https://danluu.com/essential-complexity/). -*danluu.com*, December 2020. -Archived at [perma.cc/H5ES-69KC](https://perma.cc/H5ES-69KC) - -[[95](/en/ch2#Gamma1994-marker)] Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. -[*Design Patterns: -Elements of Reusable Object-Oriented Software*](https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/). Addison-Wesley Professional, October 1994. -ISBN: 9780201633610 - -[[96](/en/ch2#Evans2003-marker)] Eric Evans. -[*Domain-Driven -Design: Tackling Complexity in the Heart of Software*](https://learning.oreilly.com/library/view/domain-driven-design-tackling/0321125215/). Addison-Wesley Professional, August 2003. -ISBN: 9780321125217 - -[[97](/en/ch2#Breivold2008-marker)] Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson. -[Analyzing Software Evolvability](https://www.es.mdh.se/pdf_publications/1251.pdf). -at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. -[doi:10.1109/COMPSAC.2008.50](https://doi.org/10.1109/COMPSAC.2008.50) - -[[98](/en/ch2#Zaninotto2002-marker)] Enrico Zaninotto. -[From X programming to the X organisation](https://martinfowler.com/articles/zaninotto.pdf). -At *XP Conference*, May 2002. -Archived at [perma.cc/R9AR-QCKZ](https://perma.cc/R9AR-QCKZ) +[^1]: Mike Cvet. [How We Learned to Stop Worrying and Love Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016. +[^2]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK) +[^3]: Twitter. [Twitter’s Recommendation Algorithm](https://blog.twitter.com/engineering/en_us/topics/open-source/2023/twitter-recommendation-algorithm). *blog.twitter.com*, March 2023. Archived at [perma.cc/L5GT-229T](https://perma.cc/L5GT-229T) +[^4]: Raffi Krikorian. [New Tweets per second record, and how!](https://blog.twitter.com/engineering/en_us/a/2013/new-tweets-per-second-record-and-how) *blog.twitter.com*, August 2013. Archived at [perma.cc/6JZN-XJYN](https://perma.cc/6JZN-XJYN) +[^5]: Jaz Volpert. [When Imperfect Systems are Good, Actually: Bluesky’s Lossy Timelines](https://jazco.dev/2025/02/19/imperfection/). *jazco.dev*, February 2025. Archived at [perma.cc/2PVE-L2MX](https://perma.cc/2PVE-L2MX) +[^6]: Samuel Axon. [3% of Twitter’s Servers Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010. Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX) +[^7]: Nathan Bronson, Abutalib Aghayev, Aleksey Charapko, and Timothy Zhu. [Metastable Failures in Distributed Systems](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s11-bronson.pdf). At *Workshop on Hot Topics in Operating Systems* (HotOS), May 2021. [doi:10.1145/3458336.3465286](https://doi.org/10.1145/3458336.3465286) +[^8]: Marc Brooker. [Metastability and Distributed Systems](https://brooker.co.za/blog/2021/05/24/metastable.html). *brooker.co.za*, May 2021. Archived at [perma.cc/7FGJ-7XRK](https://perma.cc/7FGJ-7XRK) +[^9]: Marc Brooker. [Exponential Backoff And Jitter](https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/). *aws.amazon.com*, March 2015. Archived at [perma.cc/R6MS-AZKH](https://perma.cc/R6MS-AZKH) +[^10]: Marc Brooker. [What is Backoff For?](https://brooker.co.za/blog/2022/08/11/backoff.html) *brooker.co.za*, August 2022. Archived at [perma.cc/PW9N-55Q5](https://perma.cc/PW9N-55Q5) +[^11]: Michael T. Nygard. [*Release It!*](https://learning.oreilly.com/library/view/release-it-2nd/9781680504552/), 2nd Edition. Pragmatic Bookshelf, January 2018. ISBN: 9781680502398 +[^12]: Frank Chen. [Slowing Down to Speed Up – Circuit Breakers for Slack’s CI/CD](https://slack.engineering/circuit-breakers/). *slack.engineering*, August 2022. Archived at [perma.cc/5FGS-ZPH3](https://perma.cc/5FGS-ZPH3) +[^13]: Marc Brooker. [Fixing retries with token buckets and circuit breakers](https://brooker.co.za/blog/2022/02/28/retries.html). *brooker.co.za*, February 2022. Archived at [perma.cc/MD6N-GW26](https://perma.cc/MD6N-GW26) +[^14]: David Yanacek. [Using load shedding to avoid overload](https://aws.amazon.com/builders-library/using-load-shedding-to-avoid-overload/). Amazon Builders’ Library, *aws.amazon.com*. Archived at [perma.cc/9SAW-68MP](https://perma.cc/9SAW-68MP) +[^15]: Matthew Sackman. [Pushing Back](https://wellquite.org/posts/lshift/pushing_back/). *wellquite.org*, May 2016. Archived at [perma.cc/3KCZ-RUFY](https://perma.cc/3KCZ-RUFY) +[^16]: Dmitry Kopytkov and Patrick Lee. [Meet Bandaid, the Dropbox service proxy](https://dropbox.tech/infrastructure/meet-bandaid-the-dropbox-service-proxy). *dropbox.tech*, March 2018. Archived at [perma.cc/KUU6-YG4S](https://perma.cc/KUU6-YG4S) +[^17]: Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. [Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf). At *16th USENIX Conference on File and Storage Technologies*, February 2018. +[^18]: Marc Brooker. [Is the Mean Really Useless?](https://brooker.co.za/blog/2017/12/28/mean.html) *brooker.co.za*, December 2017. Archived at [perma.cc/U5AE-CVEM](https://perma.cc/U5AE-CVEM) +[^19]: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. [Dynamo: Amazon’s Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007. [doi:10.1145/1294261.1294281](https://doi.org/10.1145/1294261.1294281) +[^20]: Kathryn Whitenton. [The Need for Speed, 23 Years Later](https://www.nngroup.com/articles/the-need-for-speed/). *nngroup.com*, May 2020. Archived at [perma.cc/C4ER-LZYA](https://perma.cc/C4ER-LZYA) +[^21]: Greg Linden. [Marissa Mayer at Web 2.0](https://glinden.blogspot.com/2006/11/marissa-mayer-at-web-20.html). *glinden.blogspot.com*, November 2005. Archived at [perma.cc/V7EA-3VXB](https://perma.cc/V7EA-3VXB) +[^22]: Jake Brutlag. [Speed Matters for Google Web Search](https://services.google.com/fh/files/blogs/google_delayexp.pdf). *services.google.com*, June 2009. Archived at [perma.cc/BK7R-X7M2](https://perma.cc/BK7R-X7M2) +[^23]: Eric Schurman and Jake Brutlag. [Performance Related Changes and their User Impact](https://www.youtube.com/watch?v=bQSE51-gr2s). Talk at *Velocity 2009*. +[^24]: Akamai Technologies, Inc. [The State of Online Retail Performance](https://web.archive.org/web/20210729180749/https%3A//www.akamai.com/us/en/multimedia/documents/report/akamai-state-of-online-retail-performance-spring-2017.pdf). *akamai.com*, April 2017. Archived at [perma.cc/UEK2-HYCS](https://perma.cc/UEK2-HYCS) +[^25]: Xiao Bai, Ioannis Arapakis, B. Barla Cambazoglu, and Ana Freire. [Understanding and Leveraging the Impact of Response Latency on User Behaviour in Web Search](https://iarapakis.github.io/papers/TOIS17.pdf). *ACM Transactions on Information Systems*, volume 36, issue 2, article 21, April 2018. [doi:10.1145/3106372](https://doi.org/10.1145/3106372) +[^26]: Jeffrey Dean and Luiz André Barroso. [The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/). *Communications of the ACM*, volume 56, issue 2, pages 74–80, February 2013. [doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794) +[^27]: Alex Hidalgo. [*Implementing Service Level Objectives: A Practical Guide to SLIs, SLOs, and Error Budgets*](https://www.oreilly.com/library/view/implementing-service-level/9781492076803/). O’Reilly Media, September 2020. ISBN: 1492076813 +[^28]: Jeffrey C. Mogul and John Wilkes. [Nines are Not Enough: Meaningful Metrics for Clouds](https://research.google/pubs/pub48033/). At *17th Workshop on Hot Topics in Operating Systems* (HotOS), May 2019. [doi:10.1145/3317550.3321432](https://doi.org/10.1145/3317550.3321432) +[^29]: Tamás Hauer, Philipp Hoffmann, John Lunney, Dan Ardelean, and Amer Diwan. [Meaningful Availability](https://www.usenix.org/conference/nsdi20/presentation/hauer). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020. +[^30]: Ted Dunning. [The t-digest: Efficient estimates of distributions](https://www.sciencedirect.com/science/article/pii/S2665963820300403). *Software Impacts*, volume 7, article 100049, February 2021. [doi:10.1016/j.simpa.2020.100049](https://doi.org/10.1016/j.simpa.2020.100049) +[^31]: David Kohn. [How percentile approximation works (and why it’s more useful than averages)](https://www.timescale.com/blog/how-percentile-approximation-works-and-why-its-more-useful-than-averages/). *timescale.com*, September 2021. Archived at [perma.cc/3PDP-NR8B](https://perma.cc/3PDP-NR8B) +[^32]: Heinrich Hartmann and Theo Schlossnagle. [Circllhist — A Log-Linear Histogram Data Structure for IT Infrastructure Monitoring](https://arxiv.org/pdf/2001.06561.pdf). *arxiv.org*, January 2020. +[^33]: Charles Masson, Jee E. Rim, and Homin K. Lee. [DDSketch: A Fast and Fully-Mergeable Quantile Sketch with Relative-Error Guarantees](https://www.vldb.org/pvldb/vol12/p2195-masson.pdf). *Proceedings of the VLDB Endowment*, volume 12, issue 12, pages 2195–2205, August 2019. [doi:10.14778/3352063.3352135](https://doi.org/10.14778/3352063.3352135) +[^34]: Baron Schwartz. [Why Percentiles Don’t Work the Way You Think](https://orangematter.solarwinds.com/2016/11/18/why-percentiles-dont-work-the-way-you-think/). *solarwinds.com*, November 2016. Archived at [perma.cc/469T-6UGB](https://perma.cc/469T-6UGB) +[^35]: Walter L. Heimerdinger and Charles B. Weinstock. [A Conceptual Framework for System Fault Tolerance](https://resources.sei.cmu.edu/asset_files/TechnicalReport/1992_005_001_16112.pdf). Technical Report CMU/SEI-92-TR-033, Software Engineering Institute, Carnegie Mellon University, October 1992. Archived at [perma.cc/GD2V-DMJW](https://perma.cc/GD2V-DMJW) +[^36]: Felix C. Gärtner. [Fundamentals of fault-tolerant distributed computing in asynchronous environments](https://dl.acm.org/doi/pdf/10.1145/311531.311532). *ACM Computing Surveys*, volume 31, issue 1, pages 1–26, March 1999. [doi:10.1145/311531.311532](https://doi.org/10.1145/311531.311532) +[^37]: Algirdas Avižienis, Jean-Claude Laprie, Brian Randell, and Carl Landwehr. [Basic Concepts and Taxonomy of Dependable and Secure Computing](https://hdl.handle.net/1903/6459). *IEEE Transactions on Dependable and Secure Computing*, volume 1, issue 1, January 2004. [doi:10.1109/TDSC.2004.2](https://doi.org/10.1109/TDSC.2004.2) +[^38]: Ding Yuan, Yu Luo, Xin Zhuang, Guilherme Renna Rodrigues, Xu Zhao, Yongle Zhang, Pranay U. Jain, and Michael Stumm. [Simple Testing Can Prevent Most Critical Failures: An Analysis of Production Failures in Distributed Data-Intensive Systems](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-yuan.pdf). At *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014. +[^39]: Casey Rosenthal and Nora Jones. [*Chaos Engineering*](https://learning.oreilly.com/library/view/chaos-engineering/9781492043850/). O’Reilly Media, April 2020. ISBN: 9781492043867 +[^40]: Eduardo Pinheiro, Wolf-Dietrich Weber, and Luiz Andre Barroso. [Failure Trends in a Large Disk Drive Population](https://www.usenix.org/legacy/events/fast07/tech/full_papers/pinheiro/pinheiro_old.pdf). At *5th USENIX Conference on File and Storage Technologies* (FAST), February 2007. +[^41]: Bianca Schroeder and Garth A. Gibson. [Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?](https://www.usenix.org/legacy/events/fast07/tech/schroeder/schroeder.pdf) At *5th USENIX Conference on File and Storage Technologies* (FAST), February 2007. +[^42]: Andy Klein. [Backblaze Drive Stats for Q2 2021](https://www.backblaze.com/blog/backblaze-drive-stats-for-q2-2021/). *backblaze.com*, August 2021. Archived at [perma.cc/2943-UD5E](https://perma.cc/2943-UD5E) +[^43]: Iyswarya Narayanan, Di Wang, Myeongjae Jeon, Bikash Sharma, Laura Caulfield, Anand Sivasubramaniam, Ben Cutler, Jie Liu, Badriddine Khessib, and Kushagra Vaid. [SSD Failures in Datacenters: What? When? and Why?](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/08/a7-narayanan.pdf) At *9th ACM International on Systems and Storage Conference* (SYSTOR), June 2016. [doi:10.1145/2928275.2928278](https://doi.org/10.1145/2928275.2928278) +[^44]: Alibaba Cloud Storage Team. [Storage System Design Analysis: Factors Affecting NVMe SSD Performance (1)](https://www.alibabacloud.com/blog/594375). *alibabacloud.com*, January 2019. Archived at [archive.org](https://web.archive.org/web/20230522005034/https%3A//www.alibabacloud.com/blog/594375) +[^45]: Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. [Flash Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/system/files/conference/fast16/fast16-papers-schroeder.pdf). At *14th USENIX Conference on File and Storage Technologies* (FAST), February 2016. +[^46]: Jacob Alter, Ji Xue, Alma Dimnaku, and Evgenia Smirni. [SSD failures in the field: symptoms, causes, and prediction models](https://dl.acm.org/doi/pdf/10.1145/3295500.3356172). At *International Conference for High Performance Computing, Networking, Storage and Analysis* (SC), November 2019. [doi:10.1145/3295500.3356172](https://doi.org/10.1145/3295500.3356172) +[^47]: Daniel Ford, François Labelle, Florentina I. Popovici, Murray Stokely, Van-Anh Truong, Luiz Barroso, Carrie Grimes, and Sean Quinlan. [Availability in Globally Distributed Storage Systems](https://www.usenix.org/legacy/event/osdi10/tech/full_papers/Ford.pdf). At *9th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2010. +[^48]: Kashi Venkatesh Vishwanath and Nachiappan Nagappan. [Characterizing Cloud Computing Hardware Reliability](https://www.microsoft.com/en-us/research/wp-content/uploads/2010/06/socc088-vishwanath.pdf). At *1st ACM Symposium on Cloud Computing* (SoCC), June 2010. [doi:10.1145/1807128.1807161](https://doi.org/10.1145/1807128.1807161) +[^49]: Peter H. Hochschild, Paul Turner, Jeffrey C. Mogul, Rama Govindaraju, Parthasarathy Ranganathan, David E. Culler, and Amin Vahdat. [Cores that don’t count](https://sigops.org/s/conferences/hotos/2021/papers/hotos21-s01-hochschild.pdf). At *Workshop on Hot Topics in Operating Systems* (HotOS), June 2021. [doi:10.1145/3458336.3465297](https://doi.org/10.1145/3458336.3465297) +[^50]: Harish Dattatraya Dixit, Sneha Pendharkar, Matt Beadon, Chris Mason, Tejasvi Chakravarthy, Bharath Muthiah, and Sriram Sankar. [Silent Data Corruptions at Scale](https://arxiv.org/abs/2102.11245). *arXiv:2102.11245*, February 2021. +[^51]: Diogo Behrens, Marco Serafini, Sergei Arnautov, Flavio P. Junqueira, and Christof Fetzer. [Scalable Error Isolation for Distributed Systems](https://www.usenix.org/conference/nsdi15/technical-sessions/presentation/behrens). At *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015. +[^52]: Bianca Schroeder, Eduardo Pinheiro, and Wolf-Dietrich Weber. [DRAM Errors in the Wild: A Large-Scale Field Study](https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35162.pdf). At *11th International Joint Conference on Measurement and Modeling of Computer Systems* (SIGMETRICS), June 2009. [doi:10.1145/1555349.1555372](https://doi.org/10.1145/1555349.1555372) +[^53]: Yoongu Kim, Ross Daly, Jeremie Kim, Chris Fallin, Ji Hye Lee, Donghyuk Lee, Chris Wilkerson, Konrad Lai, and Onur Mutlu. [Flipping Bits in Memory Without Accessing Them: An Experimental Study of DRAM Disturbance Errors](https://users.ece.cmu.edu/~yoonguk/papers/kim-isca14.pdf). At *41st Annual International Symposium on Computer Architecture* (ISCA), June 2014. [doi:10.5555/2665671.2665726](https://doi.org/10.5555/2665671.2665726) +[^54]: Tim Bray. [Worst Case](https://www.tbray.org/ongoing/When/202x/2021/10/08/The-WOrst-Case). *tbray.org*, October 2021. Archived at [perma.cc/4QQM-RTHN](https://perma.cc/4QQM-RTHN) +[^55]: Sangeetha Abdu Jyothi. [Solar Superstorms: Planning for an Internet Apocalypse](https://ics.uci.edu/~sabdujyo/papers/sigcomm21-cme.pdf). At *ACM SIGCOMM Conferene*, August 2021. [doi:10.1145/3452296.3472916](https://doi.org/10.1145/3452296.3472916) +[^56]: Adrian Cockcroft. [Failure Modes and Continuous Resilience](https://adrianco.medium.com/failure-modes-and-continuous-resilience-6553078caad5). *adrianco.medium.com*, November 2019. Archived at [perma.cc/7SYS-BVJP](https://perma.cc/7SYS-BVJP) +[^57]: Shujie Han, Patrick P. C. Lee, Fan Xu, Yi Liu, Cheng He, and Jiongzhou Liu. [An In-Depth Study of Correlated Failures in Production SSD-Based Data Centers](https://www.usenix.org/conference/fast21/presentation/han). At *19th USENIX Conference on File and Storage Technologies* (FAST), February 2021. +[^58]: Edmund B. Nightingale, John R. Douceur, and Vince Orgovan. [Cycles, Cells and Platters: An Empirical Analysis of Hardware Failures on a Million Consumer PCs](https://eurosys2011.cs.uni-salzburg.at/pdf/eurosys2011-nightingale.pdf). At *6th European Conference on Computer Systems* (EuroSys), April 2011. [doi:10.1145/1966445.1966477](https://doi.org/10.1145/1966445.1966477) +[^59]: Haryadi S. Gunawi, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, Thanh Do, Jeffry Adityatama, Kurnia J. Eliazar, Agung Laksono, Jeffrey F. Lukman, Vincentius Martin, and Anang D. Satria. [What Bugs Live in the Cloud?](https://ucare.cs.uchicago.edu/pdf/socc14-cbs.pdf) At *5th ACM Symposium on Cloud Computing* (SoCC), November 2014. [doi:10.1145/2670979.2670986](https://doi.org/10.1145/2670979.2670986) +[^60]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW) +[^61]: Nelson Minar. [Leap Second Crashes Half the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012. Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU) +[^62]: Hewlett Packard Enterprise. [Support Alerts – Customer Bulletin a00092491en\_us](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us). *support.hpe.com*, November 2019. Archived at [perma.cc/S5F6-7ZAC](https://perma.cc/S5F6-7ZAC) +[^63]: Lorin Hochstein. [awesome limits](https://github.com/lorin/awesome-limits). *github.com*, November 2020. Archived at [perma.cc/3R5M-E5Q4](https://perma.cc/3R5M-E5Q4) +[^64]: Caitie McCaffrey. [Clients Are Jerks: AKA How Halo 4 DoSed the Services at Launch & How We Survived](https://www.caitiem.com/2015/06/23/clients-are-jerks-aka-how-halo-4-dosed-the-services-at-launch-how-we-survived/). *caitiem.com*, June 2015. Archived at [perma.cc/MXX4-W373](https://perma.cc/MXX4-W373) +[^65]: Lilia Tang, Chaitanya Bhandari, Yongle Zhang, Anna Karanika, Shuyang Ji, Indranil Gupta, and Tianyin Xu. [Fail through the Cracks: Cross-System Interaction Failures in Modern Cloud Systems](https://tianyin.github.io/pub/csi-failures.pdf). At *18th European Conference on Computer Systems* (EuroSys), May 2023. [doi:10.1145/3552326.3587448](https://doi.org/10.1145/3552326.3587448) +[^66]: Mike Ulrich. [Addressing Cascading Failures](https://sre.google/sre-book/addressing-cascading-failures/). In Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy (ed). [*Site Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). O’Reilly Media, 2016. ISBN: 9781491929124 +[^67]: Harri Faßbender. [Cascading failures in large-scale distributed systems](https://blog.mi.hdm-stuttgart.de/index.php/2022/03/03/cascading-failures-in-large-scale-distributed-systems/). *blog.mi.hdm-stuttgart.de*, March 2022. Archived at [perma.cc/K7VY-YJRX](https://perma.cc/K7VY-YJRX) +[^68]: Richard I. Cook. [How Complex Systems Fail](https://www.adaptivecapacitylabs.com/HowComplexSystemsFail.pdf). Cognitive Technologies Laboratory, April 2000. Archived at [perma.cc/RDS6-2YVA](https://perma.cc/RDS6-2YVA) +[^69]: David D. Woods. [STELLA: Report from the SNAFUcatchers Workshop on Coping With Complexity](https://snafucatchers.github.io/). *snafucatchers.github.io*, March 2017. Archived at [archive.org](https://web.archive.org/web/20230306130131/https%3A//snafucatchers.github.io/) +[^70]: David Oppenheimer, Archana Ganapathi, and David A. Patterson. [Why Do Internet Services Fail, and What Can Be Done About It?](https://static.usenix.org/events/usits03/tech/full_papers/oppenheimer/oppenheimer.pdf) At *4th USENIX Symposium on Internet Technologies and Systems* (USITS), March 2003. +[^71]: Sidney Dekker. [*The Field Guide to Understanding ‘Human Error’, 3rd Edition*](https://learning.oreilly.com/library/view/the-field-guide/9781317031833/). CRC Press, November 2017. ISBN: 9781472439055 +[^72]: Sidney Dekker. [*Drift into Failure: From Hunting Broken Components to Understanding Complex Systems*](https://www.taylorfrancis.com/books/mono/10.1201/9781315257396/drift-failure-sidney-dekker). CRC Press, 2011. ISBN: 9781315257396 +[^73]: John Allspaw. [Blameless PostMortems and a Just Culture](https://www.etsy.com/codeascraft/blameless-postmortems/). *etsy.com*, May 2012. Archived at [perma.cc/YMJ7-NTAP](https://perma.cc/YMJ7-NTAP) +[^74]: Itzy Sabo. [Uptime Guarantees — A Pragmatic Perspective](https://world.hey.com/itzy/uptime-guarantees-a-pragmatic-perspective-736d7ea4). *world.hey.com*, March 2023. Archived at [perma.cc/F7TU-78JB](https://perma.cc/F7TU-78JB) +[^75]: Michael Jurewitz. [The Human Impact of Bugs](http://jury.me/blog/2013/3/14/the-human-impact-of-bugs). *jury.me*, March 2013. Archived at [perma.cc/5KQ4-VDYL](https://perma.cc/5KQ4-VDYL) +[^76]: Mark Halper. [How Software Bugs led to ‘One of the Greatest Miscarriages of Justice’ in British History](https://cacm.acm.org/news/how-software-bugs-led-to-one-of-the-greatest-miscarriages-of-justice-in-british-history/). *Communications of the ACM*, January 2025. [doi:10.1145/3703779](https://doi.org/10.1145/3703779) +[^77]: Nicholas Bohm, James Christie, Peter Bernard Ladkin, Bev Littlewood, Paul Marshall, Stephen Mason, Martin Newby, Steven J. Murdoch, Harold Thimbleby, and Martyn Thomas. [The legal rule that computers are presumed to be operating correctly – unforeseen and unjust consequences](https://www.benthamsgaze.org/wp-content/uploads/2022/06/briefing-presumption-that-computers-are-reliable.pdf). Briefing note, *benthamsgaze.org*, June 2022. Archived at [perma.cc/WQ6X-TMW4](https://perma.cc/WQ6X-TMW4) +[^78]: Dan McKinley. [Choose Boring Technology](https://mcfunley.com/choose-boring-technology). *mcfunley.com*, March 2015. Archived at [perma.cc/7QW7-J4YP](https://perma.cc/7QW7-J4YP) +[^79]: Andy Warfield. [Building and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. Archived at [perma.cc/7LPK-TP7V](https://perma.cc/7LPK-TP7V) +[^80]: Marc Brooker. [Surprising Scalability of Multitenancy](https://brooker.co.za/blog/2023/03/23/economics.html). *brooker.co.za*, March 2023. Archived at [perma.cc/ZZD9-VV8T](https://perma.cc/ZZD9-VV8T) +[^81]: Ben Stopford. [Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/). *benstopford.com*, November 2009. Archived at [perma.cc/7BXH-EDUR](https://perma.cc/7BXH-EDUR) +[^82]: Michael Stonebraker. [The Case for Shared Nothing](https://dsf.berkeley.edu/papers/hpts85-nothing.pdf). *IEEE Database Engineering Bulletin*, volume 9, issue 1, pages 4–9, March 1986. +[^83]: Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. [Socrates: The New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 1743–1756, June 2019. [doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047) +[^84]: Sam Newman. [*Building Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O’Reilly Media, 2021. ISBN: 9781492034025 +[^85]: Nathan Ensmenger. [When Good Software Goes Bad: The Surprising Durability of an Ephemeral Technology](https://themaintainers.wpengine.com/wp-content/uploads/2021/04/ensmenger-maintainers-v2.pdf). At *The Maintainers Conference*, April 2016. Archived at [perma.cc/ZXT4-HGZB](https://perma.cc/ZXT4-HGZB) +[^86]: Robert L. Glass. [*Facts and Fallacies of Software Engineering*](https://learning.oreilly.com/library/view/facts-and-fallacies/0321117425/). Addison-Wesley Professional, October 2002. ISBN: 9780321117427 +[^87]: Marianne Bellotti. [*Kill It with Fire*](https://learning.oreilly.com/library/view/kill-it-with/9781098128883/). No Starch Press, April 2021. ISBN: 9781718501188 +[^88]: Lisanne Bainbridge. [Ironies of automation](https://www.adaptivecapacitylabs.com/IroniesOfAutomation-Bainbridge83.pdf). *Automatica*, volume 19, issue 6, pages 775–779, November 1983. [doi:10.1016/0005-1098(83)90046-8](https://doi.org/10.1016/0005-1098%2883%2990046-8) +[^89]: James Hamilton. [On Designing and Deploying Internet-Scale Services](https://www.usenix.org/legacy/events/lisa07/tech/full_papers/hamilton/hamilton.pdf). At *21st Large Installation System Administration Conference* (LISA), November 2007. +[^90]: Dotan Horovits. [Open Source for Better Observability](https://horovits.medium.com/open-source-for-better-observability-8c65b5630561). *horovits.medium.com*, October 2021. Archived at [perma.cc/R2HD-U2ZT](https://perma.cc/R2HD-U2ZT) +[^91]: Brian Foote and Joseph Yoder. [Big Ball of Mud](http://www.laputan.org/pub/foote/mud.pdf). At *4th Conference on Pattern Languages of Programs* (PLoP), September 1997. Archived at [perma.cc/4GUP-2PBV](https://perma.cc/4GUP-2PBV) +[^92]: Marc Brooker. [What is a simple system?](https://brooker.co.za/blog/2022/05/03/simplicity.html) *brooker.co.za*, May 2022. Archived at [perma.cc/U72T-BFVE](https://perma.cc/U72T-BFVE) +[^93]: Frederick P. Brooks. [No Silver Bullet – Essence and Accident in Software Engineering](https://worrydream.com/refs/Brooks_1986_-_No_Silver_Bullet.pdf). In [*The Mythical Man-Month*](https://www.oreilly.com/library/view/mythical-man-month-the/0201835959/), Anniversary edition, Addison-Wesley, 1995. ISBN: 9780201835953 +[^94]: Dan Luu. [Against essential and accidental complexity](https://danluu.com/essential-complexity/). *danluu.com*, December 2020. Archived at [perma.cc/H5ES-69KC](https://perma.cc/H5ES-69KC) +[^95]: Erich Gamma, Richard Helm, Ralph Johnson, and John Vlissides. [*Design Patterns: Elements of Reusable Object-Oriented Software*](https://learning.oreilly.com/library/view/design-patterns-elements/0201633612/). Addison-Wesley Professional, October 1994. ISBN: 9780201633610 +[^96]: Eric Evans. [*Domain-Driven Design: Tackling Complexity in the Heart of Software*](https://learning.oreilly.com/library/view/domain-driven-design-tackling/0321125215/). Addison-Wesley Professional, August 2003. ISBN: 9780321125217 +[^97]: Hongyu Pei Breivold, Ivica Crnkovic, and Peter J. Eriksson. [Analyzing Software Evolvability](https://www.es.mdh.se/pdf_publications/1251.pdf). at *32nd Annual IEEE International Computer Software and Applications Conference* (COMPSAC), July 2008. [doi:10.1109/COMPSAC.2008.50](https://doi.org/10.1109/COMPSAC.2008.50) +[^98]: Enrico Zaninotto. [From X programming to the X organisation](https://martinfowler.com/articles/zaninotto.pdf). At *XP Conference*, May 2002. Archived at [perma.cc/R9AR-QCKZ](https://perma.cc/R9AR-QCKZ) diff --git a/content/en/ch3.md b/content/en/ch3.md index c0cc6d5..5d7d96d 100644 --- a/content/en/ch3.md +++ b/content/en/ch3.md @@ -55,17 +55,17 @@ the computer which operations to perform in which order. A declarative query lan because it is typically more concise and easier to write than an explicit algorithm. But more importantly, it also hides implementation details of the query engine, which makes it possible for the database system to introduce performance improvements without requiring any changes to queries. -[[1](/en/ch3#Brandon2024)]. +[^1]. For example, a database might be able to execute a declarative query in parallel across multiple CPU cores and machines, without you having to worry about how to implement that parallelism -[[2](/en/ch3#Hellerstein2010)]. +[^2]. In a hand-coded algorithm it would be a lot of work to implement such parallel execution yourself. # Relational Model versus Document Model The best-known data model today is probably that of SQL, based on the relational model proposed by -Edgar Codd in 1970 [[3](/en/ch3#Codd1970)]: +Edgar Codd in 1970 [^3]: data is organized into *relations* (called *tables* in SQL), where each relation is an unordered collection of *tuples* (*rows* in SQL). @@ -80,10 +80,10 @@ and early 1980s, the *network model* and the *hierarchical model* were the main the relational model came to dominate them. Object databases came and went again in the late 1980s and early 1990s. XML databases appeared in the early 2000s, but have only seen niche adoption. Each competitor to the relational model generated a lot of hype in its time, but it never lasted -[[4](/en/ch3#Stonebraker2005around)]. +[^4]. Instead, SQL has grown to incorporate other data types besides its relational core—for example, adding support for XML, JSON, and graph data -[[5](/en/ch3#Winand2015)]. +[^5]. In the 2010s, *NoSQL* was the latest buzzword that tried to overthrow the dominance of relational databases. NoSQL refers not to a single technology, but a loose set of ideas around new data models, @@ -122,7 +122,7 @@ reflections and other troubles. Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of boilerplate code required for this translation layer, but they are often criticized -[[6](/en/ch3#Fowler2012)]. +[^6]. Some commonly cited problems are: * ORMs are complex and can’t completely hide the differences between the two models, so developers @@ -137,7 +137,7 @@ Some commonly cited problems are: database. Customizing the ORM’s schema and query generation can be complex and negate the benefit of using the ORM in the first place. * ORMs make it easy to accidentally write inefficient queries, such as the *N+1 query problem* - [[7](/en/ch3#Mihalcea2023)]. + [^7]. For example, say you want to display a list of user comments on a page, so you perform one query that returns *N* comments, each containing the ID of its author. To show the name of the comment author you need to look up the ID in the users table. In hand-written SQL you would probably @@ -213,7 +213,7 @@ The JSON representation has better *locality* than the multi-table schema in [Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile in the relational example, you need to either perform multiple queries (query each table by `user_id`) or perform a messy multi-way join between the `users` table and its subordinate tables -[[8](/en/ch3#Schauder2023)]. +[^8]. In the JSON representation, all the relevant information is in one place, making the query both faster and simpler. @@ -314,7 +314,7 @@ name: * In a denormalized representation, we would include the image URL of the logo on every individual person’s profile; this makes the JSON document self-contained, but it creates a headache if we ever need to change the logo, because we now need to find all of the occurrences of the old URL - and update them [[9](/en/ch3#Zola2014)]. + and update them [^9]. * In a normalized representation, we would create an entity representing an organization or school, and store its name, logo URL, and perhaps other attributes (description, news feed, etc.) once on that entity. Every résumé that mentions the organization would then simply reference its ID, and @@ -350,7 +350,7 @@ denormalized representation consistent. However, the implementation of materialized timelines at X (formerly Twitter) does not store the actual text of each post: each entry actually only stores the post ID, the ID of the user who posted it, and a little bit of extra information to identify reposts and replies -[[11](/en/ch3#Krikorian2012_ch3)]. +[^11]. In other words, it is a precomputed result of (approximately) the following query: ``` @@ -366,7 +366,7 @@ the post ID to fetch the actual post content (as well as statistics such as the and replies), and look up the sender’s profile by ID (to get their username, profile picture, and other details). This process of looking up the human-readable information by ID is called *hydrating* the IDs, and it is essentially a join performed in application code -[[11](/en/ch3#Krikorian2012_ch3)]. +[^11]. The reason for storing only IDs in the precomputed timeline is that the data they refer to is fast-changing: the number of likes and replies may change multiple times per second on a popular @@ -453,7 +453,7 @@ support are able to create such indexes on values inside a document. Data warehouses (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are usually relational, and there are a few widely-used conventions for the structure of tables in a data warehouse: a *star schema*, *snowflake schema*, *dimensional modeling* -[[12](/en/ch3#Kimball2013_ch3)], +[^12], and *one big table* (OBT). These structures are optimized for the needs of business analysts. ETL processes translate data from operational systems into this schema. @@ -498,7 +498,7 @@ product categories, and each row in the `dim_product` table could reference the as foreign keys, rather than storing them as strings in the `dim_product` table. Snowflake schemas are more normalized than star schemas, but star schemas are often preferred because they are simpler for analysts to work with -[[12](/en/ch3#Kimball2013_ch3)]. +[^12]. In a typical data warehouse, tables are often quite wide: fact tables often have over 100 columns, sometimes several hundred. Dimension tables can also be wide, as they include all the metadata that @@ -519,7 +519,7 @@ Some data warehouse schemas take denormalization even further and leave out the entirely, folding the information in the dimensions into denormalized columns on the fact table instead (essentially, precomputing the join between the fact table and the dimension tables). This approach is known as *one big table* (OBT), and while it requires more storage space, it sometimes -enables faster queries [[13](/en/ch3#Kaminsky2022)]. +enables faster queries [^13]. In the context of analytics, such denormalization is unproblematic, since the data typically represents a log of historical data that is not going to change (except maybe for occasionally @@ -564,23 +564,23 @@ reading, clients have no guarantees as to what fields the documents may contain. Document databases are sometimes called *schemaless*, but that’s misleading, as the code that reads the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not -enforced by the database [[17](/en/ch3#Schemaless)]. +enforced by the database [^17]. A more accurate term is *schema-on-read* (the structure of the data is implicit, and only interpreted when the data is read), in contrast with *schema-on-write* (the traditional approach of relational databases, where the schema is explicit and the database ensures all data conforms to it -when the data is written) [[18](/en/ch3#Awadallah2009)]. +when the data is written) [^18]. Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas schema-on-write is similar to static (compile-time) type checking. Just as the advocates of static and dynamic type checking have big debates about their relative merits -[[19](/en/ch3#Odersky2013)], +[^19], enforcement of schemas in database is a contentious topic, and in general there’s no right or wrong answer. The difference between the approaches is particularly noticeable in situations where an application wants to change the format of its data. For example, say you are currently storing each user’s full name in one field, and you instead want to store the first name and last name separately -[[20](/en/ch3#Irwin2013)]. +[^20]. In a document database, you would just start writing new documents with the new fields and have code in the application that handles the case when old documents are read. For example: @@ -647,12 +647,12 @@ However, the idea of storing related data together for locality is not limited t model. For example, Google’s Spanner database offers the same locality properties in a relational data model, by allowing the schema to declare that a table’s rows should be interleaved (nested) within a parent table -[[25](/en/ch3#Corbett2012_ch2)]. +[^25]. Oracle allows the same, using a feature called *multi-table index cluster tables* -[[26](/en/ch3#BurlesonCluster)]. +[^26]. The *wide-column* data model popularized by Google’s Bigtable, and used e.g. in HBase and Accumulo, has a concept of *column families*, which have a similar purpose of managing locality -[[27](/en/ch3#Chang2006_ch3)]. +[^27]. ### Query languages for documents @@ -663,9 +663,9 @@ to query for values inside documents, and some provide rich query languages. XML databases are often queried using XQuery and XPath, which are designed to allow complex queries, including joins across multiple documents, and also format their results as XML -[[28](/en/ch3#Walmsley2015)]. JSON Pointer -[[29](/en/ch3#Bryan2013)] and JSONPath -[[30](/en/ch3#Goessner2024)] provide an equivalent to XPath for JSON. +[^28]. JSON Pointer +[^29] and JSONPath +[^30] provide an equivalent to XPath for JSON. MongoDB’s aggregation pipeline, whose `$lookup` operator for joins we saw in [“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization), is an example of a query language for collections of JSON @@ -716,7 +716,7 @@ matter of taste. Document databases and relational databases started out as very different approaches to data management, but they have grown more similar over time -[[31](/en/ch3#Stonebraker2024)]. +[^31]. Relational databases added support for JSON types and query operators, and the ability to index properties inside documents. Some document databases (such as MongoDB, Couchbase, and RethinkDB) added support for joins, secondary indexes, and declarative query languages. @@ -730,7 +730,7 @@ combination. ###### Note Codd’s original description of the relational model -[[3](/en/ch3#Codd1970)] actually allowed something similar to JSON +[^3] actually allowed something similar to JSON within a relational schema. He called it *nonsimple domains*. The idea was that a value in a row doesn’t have to just be a primitive datatype like a number or a string, but it could also be a nested relation (table)—so you can have an arbitrarily nested tree structure as a value, much like @@ -763,7 +763,7 @@ Well-known algorithms can operate on these graphs: for example, map navigation a the shortest path between two points in a road network, and PageRank can be used on the web graph to determine the popularity of a web page and thus its ranking in search results -[[32](/en/ch3#Page1999)]. +[^32]. Graphs can be represented in several different ways. In the *adjacency list* model, each vertex stores the IDs of its neighbor vertices that are one edge away. Alternatively, you can use an @@ -781,24 +781,24 @@ types of objects in a single database. For example: represent people, locations, events, checkins, and comments made by users; edges indicate which people are friends with each other, which checkin happened in which location, who commented on which post, who attended which event, and so on - [[33](/en/ch3#Bronson2013)]. + [^33]. * Knowledge graphs are used by search engines to record facts about entities that often occur in search queries, such as organizations, people, and places - [[34](/en/ch3#Noy2019)]. + [^34]. This information is obtained by crawling and analyzing the text on websites; some websites, such as Wikidata, also publish graph data in a structured form. There are several different, but related, ways of structuring and querying data in graphs. In this section we will discuss the *property graph* model (implemented by Neo4j, Memgraph, KùzuDB -[[35](/en/ch3#Feng2023)], -and others [[36](/en/ch3#Besta2019)]) +[^35], +and others [^36]) and the *triple-store* model (implemented by Datomic, AllegroGraph, Blazegraph, and others). These models are fairly similar in what they can express, and some graph databases (such as Amazon Neptune) support both models. We will also look at four query languages for graphs (Cypher, SPARQL, Datalog, and GraphQL), as well as SQL support for querying graphs. Other graph query languages exist, such as Gremlin -[[37](/en/ch3#TinkerPop2023)], +[^37], but these will give us a representative overview. To illustrate these different languages and models, this section uses the graph shown in @@ -902,12 +902,12 @@ extended to accommodate changes in your application’s data structures. *Cypher* is a query language for property graphs, originally created for the Neo4j graph database, and later developed into an open standard as *openCypher* -[[38](/en/ch3#Francis2018)]. +[^38]. Besides Neo4j, Cypher is supported by Memgraph, KùzuDB -[[35](/en/ch3#Feng2023)], +[^35], Amazon Neptune, Apache AGE (with storage in PostgreSQL), and others. It is named after a character in the movie *The Matrix* and is not related to ciphers in cryptography -[[39](/en/ch3#EifremTweet)]. +[^39]. [Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of [Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each @@ -1069,17 +1069,17 @@ JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id; The fact that a 4-line Cypher query requires 31 lines in SQL shows how much of a difference the right choice of data model and query language can make. And this is just the beginning; there are more details to consider, e.g., around handling cycles, and choosing between breadth-first or -depth-first traversal [[40](/en/ch3#Tisiot2021)]. +depth-first traversal [^40]. Oracle has a different SQL extension for recursive queries, which it calls *hierarchical* -[[41](/en/ch3#Goel2020)]. +[^41]. However, the situation may be improving: at the time of writing, there are plans to add a graph query language called GQL to the SQL standard [[42](/en/ch3#Deutsch2022), [43](/en/ch3#Green2019)], which will provide a syntax inspired by Cypher, GSQL -[[44](/en/ch3#Deutsch2018)], and PGQL -[[45](/en/ch3#vanRest2016)]. +[^44], and PGQL +[^45]. ## Triple-Stores and SPARQL @@ -1107,15 +1107,15 @@ The subject of a triple is equivalent to a vertex in a graph. The object is one To be precise, databases that offer a triple-like data model often need to store some additional metadata on each tuple. For example, AWS Neptune uses quads (4-tuples) by adding a graph ID to each -triple [[46](/en/ch3#NeptuneDataModel)]; +triple [^46]; Datomic uses 5-tuples, extending each triple with a transaction ID and a boolean to indicate -deletion [[47](/en/ch3#DatomicDataModel)]. +deletion [^47]. Since these databases retain the basic *subject-predicate-object* structure explained above, this book nevertheless calls them triple-stores. [Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as triples in a format called *Turtle*, a subset of *Notation3* (*N3*) -[[48](/en/ch3#Beckett2011)]. +[^48]. ##### Example 3-7. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as Turtle triples @@ -1166,13 +1166,13 @@ Web as originally envisioned did not succeed [[49](/en/ch3#Target2018), [50](/en/ch3#MendelGleason2022)], the legacy of the Semantic Web project lives on in a couple of specific technologies: *linked data* -standards such as JSON-LD [[51](/en/ch3#Sporny2014)], +standards such as JSON-LD [^51], *ontologies* used in biomedical science -[[52](/en/ch3#MichiganOntologies)], +[^52], Facebook’s Open Graph protocol -[[53](/en/ch3#OpenGraph)] +[^53] (which is used for link unfurling -[[54](/en/ch3#Haughey2015)]), +[^54]), knowledge graphs such as Wikidata, and standardized vocabularies for structured data maintained by [`schema.org`](https://schema.org/). @@ -1184,7 +1184,7 @@ for applications. The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the *Resource Description Framework* (RDF) -[[55](/en/ch3#W3CRDF)], +[^55], a data model that was designed for the Semantic Web. RDF data can also be encoded in other ways, for example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can automatically convert between different RDF encodings. @@ -1235,7 +1235,7 @@ just specify this prefix once at the top of the file, and then forget about it. ### The SPARQL query language *SPARQL* is a query language for triple-stores using the RDF data model -[[56](/en/ch3#Harris2013)]. +[^56]. (It is an acronym for *SPARQL Protocol and RDF Query Language*, pronounced “sparkle.”) It predates Cypher, and since Cypher’s pattern matching is borrowed from SPARQL, they look quite similar. @@ -1275,7 +1275,7 @@ bound to any vertex that has a `name` property whose value is the string `"Unite ``` SPARQL is supported by Amazon Neptune, AllegroGraph, Blazegraph, OpenLink Virtuoso, Apache Jena, and -various other triple stores [[36](/en/ch3#Besta2019)]. +various other triple stores [^36]. ## Datalog: Recursive Relational Queries @@ -1286,7 +1286,7 @@ Datalog is a much older language than SPARQL or Cypher: it arose from academic r It is less well known among software engineers and not widely supported in mainstream databases, but it ought to be better-known since it is a very expressive language that is particularly powerful for complex queries. Several niche databases, including Datomic, LogicBlox, CozoDB, and LinkedIn’s -LIquid [[60](/en/ch3#Meyer2020)] use Datalog as +LIquid [^60] use Datalog as their query language. Datalog is actually based on a relational data model, not a graph, but it appears in the graph @@ -1403,7 +1403,7 @@ APIs. GraphQL’s flexibility comes at a cost. Organizations that adopt GraphQL often need tooling to convert GraphQL queries into requests to internal services, which often use REST or gRPC (see [Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns -[[61](/en/ch3#Bessey2024)]. +[^61]. GraphQL’s query language is also limited since GraphQL come from an untrusted source. The language does not allow anything that could be expensive to execute, since otherwise users could perform denial-of-service attacks on a server by running lots of expensive queries. In particular, GraphQL @@ -1547,7 +1547,7 @@ known as *event sourcing* [[62](/en/ch3#Betts2012), [63](/en/ch3#Young2014)]. The principle of maintaining separate read-optimized representations and deriving them from the write-optimized representation is called *command query responsibility segregation (CQRS)* -[[64](/en/ch3#Young2010)]. +[^64]. These terms originated in the domain-driven design (DDD) community, although similar ideas have been around for a long time, for example in *state machine replication* (see [“Using shared logs”](/en/ch10#sec_consistency_smr)). @@ -1661,7 +1661,7 @@ users. Dataframe APIs also offer a wide variety of operations that go far beyond what relational databases offer, and the data model is often used in ways that are very different from typical relational data -modelling [[65](/en/ch3#Petersohn2020)]. +modelling [^65]. For example, a common use of dataframes is to transform data from a relational-like representation into a matrix or multidimensional array representation, which is the form that many machine learning algorithms expect of their input. @@ -1698,14 +1698,14 @@ into a matrix representation, while giving the data scientist control over the r is most suitable for achieving the goals of the data analysis or model training process. There are also databases such as TileDB -[[66](/en/ch3#Papadopoulos2016)] +[^66] that specialize in storing large multidimensional arrays of numbers; they are called *array databases* and are most commonly used for scientific datasets such as geospatial measurements (raster data on a regularly spaced grid), medical imaging, or observations from astronomical -telescopes [[67](/en/ch3#Rusu2022)]. +telescopes [^67]. Dataframes are also used in the financial industry for representing *time series data*, such as the prices of assets and trades over time -[[68](/en/ch3#Targett2023)]. +[^68]. # Summary @@ -1757,7 +1757,7 @@ a few brief examples: means taking one very long string (representing a DNA molecule) and matching it against a large database of strings that are similar, but not identical. None of the databases described here can handle this kind of usage, which is why researchers have written specialized genome database - software like GenBank [[69](/en/ch3#Benson2007)]. + software like GenBank [^69]. * Many financial systems use *ledgers* with double-entry accounting as their data model. This type of data can be represented in relational databases, but there are also databases such as TigerBeetle that specialize in this data model. Cryptocurrencies and blockchains are typically @@ -1771,361 +1771,78 @@ come into play when *implementing* the data models described in this chapter. ##### Footnotes + ##### References -[[1](/en/ch3#Brandon2024-marker)] Jamie Brandon. -[Unexplanations: -query optimization works because sql is declarative](https://www.scattered-thoughts.net/writing/unexplanations-sql-declarative/). *scattered-thoughts.net*, February 2024. -Archived at [perma.cc/P6W2-WMFZ](https://perma.cc/P6W2-WMFZ) -[[2](/en/ch3#Hellerstein2010-marker)] Joseph M. Hellerstein. -[The Declarative -Imperative: Experiences and Conjectures in Distributed Logic](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf). Tech report UCB/EECS-2010-90, -Electrical Engineering and Computer Sciences, University of California at Berkeley, June 2010. -Archived at [perma.cc/K56R-VVQM](https://perma.cc/K56R-VVQM) - -[[3](/en/ch3#Codd1970-marker)] Edgar F. Codd. -[A Relational Model of Data for Large -Shared Data Banks](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf). *Communications of the ACM*, volume 13, issue 6, pages 377–387, June 1970. -[doi:10.1145/362384.362685](https://doi.org/10.1145/362384.362685) - -[[4](/en/ch3#Stonebraker2005around-marker)] Michael Stonebraker and Joseph M. Hellerstein. -[What Goes Around Comes Around](http://mitpress2.mit.edu/books/chapters/0262693143chapm1.pdf). -In *Readings in Database Systems*, 4th edition, MIT Press, pages 2–41, 2005. -ISBN: 9780262693141 - -[[5](/en/ch3#Winand2015-marker)] Markus Winand. -[Modern SQL: Beyond Relational](https://modern-sql.com/). *modern-sql.com*, 2015. -Archived at [perma.cc/D63V-WAPN](https://perma.cc/D63V-WAPN) - -[[6](/en/ch3#Fowler2012-marker)] Martin Fowler. -[OrmHate](https://martinfowler.com/bliki/OrmHate.html). *martinfowler.com*, May -2012. Archived at [perma.cc/VCM8-PKNG](https://perma.cc/VCM8-PKNG) - -[[7](/en/ch3#Mihalcea2023-marker)] Vlad Mihalcea. -[N+1 query problem with JPA and Hibernate](https://vladmihalcea.com/n-plus-1-query-problem/). -*vladmihalcea.com*, January 2023. -Archived at [perma.cc/79EV-TZKB](https://perma.cc/79EV-TZKB) - -[[8](/en/ch3#Schauder2023-marker)] Jens Schauder. -[This -is the Beginning of the End of the N+1 Problem: Introducing Single Query Loading](https://spring.io/blog/2023/08/31/this-is-the-beginning-of-the-end-of-the-n-1-problem-introducing-single-query). *spring.io*, August 2023. -Archived at [perma.cc/6V96-R333](https://perma.cc/6V96-R333) - -[[9](/en/ch3#Zola2014-marker)] William Zola. -[6 Rules of -Thumb for MongoDB Schema Design](https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design). *mongodb.com*, June 2014. -Archived at [perma.cc/T2BZ-PPJB](https://perma.cc/T2BZ-PPJB) - -[[10](/en/ch3#Andrews2023-marker)] Sidney Andrews and Christopher McClister. -[Data modeling in -Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data). *learn.microsoft.com*, February 2023. Archived at -[archive.org](https://web.archive.org/web/20230207193233/https%3A//learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data) - -[[11](/en/ch3#Krikorian2012_ch3-marker)] Raffi Krikorian. -[Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). -At *QCon San Francisco*, November 2012. -Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK) - -[[12](/en/ch3#Kimball2013_ch3-marker)] Ralph Kimball and Margy Ross. -[*The Data -Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*](https://learning.oreilly.com/library/view/the-data-warehouse/9781118530801/), -3rd edition. John Wiley & Sons, July 2013. ISBN: 9781118530801 - -[[13](/en/ch3#Kaminsky2022-marker)] Michael Kaminsky. -[Data warehouse modeling: Star schema vs. -OBT](https://www.fivetran.com/blog/star-schema-vs-obt). *fivetran.com*, August 2022. -Archived at [perma.cc/2PZK-BFFP](https://perma.cc/2PZK-BFFP) - -[[14](/en/ch3#Nelson2018-marker)] Joe Nelson. -[User-defined Order in -SQL](https://begriffs.com/posts/2018-03-20-user-defined-order.html). *begriffs.com*, March 2018. -Archived at [perma.cc/GS3W-F7AD](https://perma.cc/GS3W-F7AD) - -[[15](/en/ch3#Wallace2017-marker)] Evan Wallace. -[Realtime Editing of -Ordered Sequences](https://www.figma.com/blog/realtime-editing-of-ordered-sequences/). *figma.com*, March 2017. -Archived at [perma.cc/K6ER-CQZW](https://perma.cc/K6ER-CQZW) - -[[16](/en/ch3#Greenspan2020-marker)] David Greenspan. -[Implementing -Fractional Indexing](https://observablehq.com/%40dgreensp/implementing-fractional-indexing). *observablehq.com*, October 2020. -Archived at [perma.cc/5N4R-MREN](https://perma.cc/5N4R-MREN) - -[[17](/en/ch3#Schemaless-marker)] Martin Fowler. -[Schemaless Data Structures](https://martinfowler.com/articles/schemaless/). -*martinfowler.com*, January 2013. - -[[18](/en/ch3#Awadallah2009-marker)] Amr Awadallah. -[Schema-on-Read vs. -Schema-on-Write](https://www.slideshare.net/awadallah/schemaonread-vs-schemaonwrite). At *Berkeley EECS RAD Lab Retreat*, Santa Cruz, CA, May 2009. -Archived at [perma.cc/DTB2-JCFR](https://perma.cc/DTB2-JCFR) - -[[19](/en/ch3#Odersky2013-marker)] Martin Odersky. -[The Trouble with Types](https://www.infoq.com/presentations/data-types-issues/). -At *Strange Loop*, September 2013. -Archived at [perma.cc/85QE-PVEP](https://perma.cc/85QE-PVEP) - -[[20](/en/ch3#Irwin2013-marker)] Conrad Irwin. -[MongoDB—Confessions -of a PostgreSQL Lover](https://speakerdeck.com/conradirwin/mongodb-confessions-of-a-postgresql-lover). At *HTML5DevConf*, October 2013. -Archived at [perma.cc/C2J6-3AL5](https://perma.cc/C2J6-3AL5) - -[[21](/en/ch3#Percona2023-marker)] [Percona -Toolkit Documentation: pt-online-schema-change](https://docs.percona.com/percona-toolkit/pt-online-schema-change.html). *docs.percona.com*, 2023. -Archived at [perma.cc/9K8R-E5UH](https://perma.cc/9K8R-E5UH) - -[[22](/en/ch3#Noach2016-marker)] Shlomi Noach. -[gh-ost: -GitHub’s Online Schema Migration Tool for MySQL](https://github.blog/2016-08-01-gh-ost-github-s-online-migration-tool-for-mysql/). *github.blog*, August 2016. -Archived at [perma.cc/7XAG-XB72](https://perma.cc/7XAG-XB72) - -[[23](/en/ch3#Mukherjee2022-marker)] Shayon Mukherjee. -[pg-osc: -Zero downtime schema changes in PostgreSQL](https://www.shayon.dev/post/2022/47/pg-osc-zero-downtime-schema-changes-in-postgresql/). *shayon.dev*, February 2022. -Archived at [perma.cc/35WN-7WMY](https://perma.cc/35WN-7WMY) - -[[24](/en/ch3#PerezAradros2023-marker)] Carlos Pérez-Aradros Herce. -[Introducing pgroll: zero-downtime, -reversible, schema migrations for Postgres](https://xata.io/blog/pgroll-schema-migrations-postgres). *xata.io*, October 2023. Archived at -[archive.org](https://web.archive.org/web/20231008161750/https%3A//xata.io/blog/pgroll-schema-migrations-postgres) - -[[25](/en/ch3#Corbett2012_ch2-marker)] James C. Corbett, Jeffrey Dean, Michael -Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher -Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, -Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, -Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. -[Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). -At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), -October 2012. - -[[26](/en/ch3#BurlesonCluster-marker)] Donald K. Burleson. -[Reduce I/O with Oracle -Cluster Tables](http://www.dba-oracle.com/oracle_tip_hash_index_cluster_table.htm). *dba-oracle.com*. -Archived at [perma.cc/7LBJ-9X2C](https://perma.cc/7LBJ-9X2C) - -[[27](/en/ch3#Chang2006_ch3-marker)] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, -Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. -[Bigtable: A Distributed Storage System for -Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and Implementation* -(OSDI), November 2006. - -[[28](/en/ch3#Walmsley2015-marker)] Priscilla Walmsley. -[*XQuery, -2nd Edition*](https://learning.oreilly.com/library/view/xquery-2nd-edition/9781491915080/). O’Reilly Media, December 2015. ISBN: 9781491915080 - -[[29](/en/ch3#Bryan2013-marker)] Paul C. Bryan, Kris Zyp, and Mark Nottingham. -[JavaScript Object Notation (JSON) Pointer](https://www.rfc-editor.org/rfc/rfc6901). -RFC 6901, IETF, April 2013. - -[[30](/en/ch3#Goessner2024-marker)] Stefan Gössner, Glyn Normington, and Carsten Bormann. -[JSONPath: Query Expressions for JSON](https://www.rfc-editor.org/rfc/rfc9535.html). -RFC 9535, IETF, February 2024. - -[[31](/en/ch3#Stonebraker2024-marker)] Michael Stonebraker and Andrew Pavlo. -[What Goes Around Comes -Around… And Around…](https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf). *ACM SIGMOD Record*, volume 53, issue 2, pages 21–37. -[doi:10.1145/3685980.3685984](https://doi.org/10.1145/3685980.3685984) - -[[32](/en/ch3#Page1999-marker)] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. -[The PageRank Citation Ranking: Bringing Order to the Web](http://ilpubs.stanford.edu:8090/422/). -Technical Report 1999-66, Stanford University InfoLab, November 1999. -Archived at [perma.cc/UML9-UZHW](https://perma.cc/UML9-UZHW) - -[[33](/en/ch3#Bronson2013-marker)] Nathan Bronson, Zach Amsden, George Cabrera, -Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, -Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. -[TAO: -Facebook’s Distributed Data Store for the Social Graph](https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson). At *USENIX Annual Technical -Conference* (ATC), June 2013. - -[[34](/en/ch3#Noy2019-marker)] Natasha Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, -Alan Patterson, and Jamie Taylor. -[Industry-Scale -Knowledge Graphs: Lessons and Challenges](https://cacm.acm.org/magazines/2019/8/238342-industry-scale-knowledge-graphs/fulltext). *Communications of the ACM*, volume 62, issue -8, pages 36–43, August 2019. -[doi:10.1145/3331166](https://doi.org/10.1145/3331166) - -[[35](/en/ch3#Feng2023-marker)] Xiyang Feng, Guodong Jin, Ziyi Chen, Chang Liu, and Semih Salihoğlu. -[KÙZU Graph Database Management System](https://www.cidrdb.org/cidr2023/papers/p48-jin.pdf). -At *3th Annual Conference on Innovative Data Systems Research* (CIDR 2023), January 2023. - -[[36](/en/ch3#Besta2019-marker)] Maciej Besta, Emanuel Peter, Robert -Gerstenberger, Marc Fischer, Michał Podstawski, Claude Barthels, Gustavo Alonso, Torsten Hoefler. -[Demystifying Graph Databases: Analysis and Taxonomy -of Data Organization, System Designs, and Graph Queries](https://arxiv.org/pdf/1910.09017.pdf). *arxiv.org*, October 2019. - -[[37](/en/ch3#TinkerPop2023-marker)] [Apache -TinkerPop 3.6.3 Documentation](https://tinkerpop.apache.org/docs/3.6.3/reference/). *tinkerpop.apache.org*, May 2023. -Archived at [perma.cc/KM7W-7PAT](https://perma.cc/KM7W-7PAT) - -[[38](/en/ch3#Francis2018-marker)] Nadime Francis, Alastair Green, Paolo Guagliardo, -Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and -Andrés Taylor. [Cypher: An Evolving Query -Language for Property Graphs](https://core.ac.uk/download/pdf/158372754.pdf). At *International Conference on Management of Data* -(SIGMOD), pages 1433–1445, May 2018. -[doi:10.1145/3183713.3190657](https://doi.org/10.1145/3183713.3190657) - -[[39](/en/ch3#EifremTweet-marker)] Emil Eifrem. -[Twitter correspondence](https://twitter.com/emileifrem/status/419107961512804352), -January 2014. Archived at [perma.cc/WM4S-BW64](https://perma.cc/WM4S-BW64) - -[[40](/en/ch3#Tisiot2021-marker)] Francesco Tisiot. -[Explore -the new SEARCH and CYCLE features in PostgreSQL® 14](https://aiven.io/blog/explore-the-new-search-and-cycle-features-in-postgresql-14). *aiven.io*, December 2021. -Archived at [perma.cc/J6BT-83UZ](https://perma.cc/J6BT-83UZ) - -[[41](/en/ch3#Goel2020-marker)] Gaurav Goel. -[Understanding -Hierarchies in Oracle](https://towardsdatascience.com/understanding-hierarchies-in-oracle-43f85561f3d9). *towardsdatascience.com*, May 2020. -Archived at [perma.cc/5ZLR-Q7EW](https://perma.cc/5ZLR-Q7EW) - -[[42](/en/ch3#Deutsch2022-marker)] Alin -Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor -Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Oskar van Rest, -Hannes Voigt, Domagoj Vrgoč, Mingxi Wu, and Fred Zemke. -[Graph Pattern Matching in GQL and SQL/PGQ](https://arxiv.org/abs/2112.06217). -At *International Conference on Management of Data* (SIGMOD), pages 2246–2258, June 2022. -[doi:10.1145/3514221.3526057](https://doi.org/10.1145/3514221.3526057) - -[[43](/en/ch3#Green2019-marker)] Alastair Green. -[SQL... and now GQL](https://opencypher.org/articles/2019/09/12/SQL-and-now-GQL/). -*opencypher.org*, September 2019. -Archived at [perma.cc/AFB2-3SY7](https://perma.cc/AFB2-3SY7) - -[[44](/en/ch3#Deutsch2018-marker)] Alin Deutsch, Yu Xu, and Mingxi Wu. -[Seamless -Syntactic and Semantic Integration of Query Primitives over Relational and Graph Data in GSQL](https://cdn2.hubspot.net/hubfs/4114546/IntegrationQuery%20PrimitivesGSQL.pdf). -*tigergraph.com*, November 2018. -Archived at [perma.cc/JG7J-Y35X](https://perma.cc/JG7J-Y35X) - -[[45](/en/ch3#vanRest2016-marker)] Oskar van Rest, Sungpack Hong, Jinha Kim, Xuming -Meng, and Hassan Chafi. [PGQL: a property -graph query language](https://event.cwi.nl/grades/2016/07-VanRest.pdf). At *4th International Workshop on Graph Data Management Experiences and -Systems* (GRADES), June 2016. -[doi:10.1145/2960414.2960421](https://doi.org/10.1145/2960414.2960421) - -[[46](/en/ch3#NeptuneDataModel-marker)] Amazon Web Services. -[Neptune -Graph Data Model](https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html). Amazon Neptune User Guide, *docs.aws.amazon.com*. -Archived at [perma.cc/CX3T-EZU9](https://perma.cc/CX3T-EZU9) - -[[47](/en/ch3#DatomicDataModel-marker)] Cognitect. -[Datomic Data Model](https://docs.datomic.com/cloud/whatis/data-model.html). -Datomic Cloud Documentation, *docs.datomic.com*. -Archived at [perma.cc/LGM9-LEUT](https://perma.cc/LGM9-LEUT) - -[[48](/en/ch3#Beckett2011-marker)] David Beckett and Tim Berners-Lee. -[Turtle – Terse RDF Triple Language](https://www.w3.org/TeamSubmission/turtle/). -W3C Team Submission, March 2011. - -[[49](/en/ch3#Target2018-marker)] Sinclair Target. -[Whatever Happened to the Semantic -Web?](https://twobithistory.org/2018/05/27/semantic-web.html) *twobithistory.org*, May 2018. -Archived at [perma.cc/M8GL-9KHS](https://perma.cc/M8GL-9KHS) - -[[50](/en/ch3#MendelGleason2022-marker)] Gavin Mendel-Gleason. -[The Semantic Web is Dead – Long Live -the Semantic Web!](https://terminusdb.com/blog/the-semantic-web-is-dead/) *terminusdb.com*, August 2022. -Archived at [perma.cc/G2MZ-DSS3](https://perma.cc/G2MZ-DSS3) - -[[51](/en/ch3#Sporny2014-marker)] Manu Sporny. -[JSON-LD and Why I Hate the Semantic Web](http://manu.sporny.org/2014/json-ld-origins-2/). -*manu.sporny.org*, January 2014. -Archived at [perma.cc/7PT4-PJKF](https://perma.cc/7PT4-PJKF) - -[[52](/en/ch3#MichiganOntologies-marker)] University of Michigan Library. -[Biomedical Ontologies and Controlled Vocabularies](https://guides.lib.umich.edu/ontology), -*guides.lib.umich.edu/ontology*. -Archived at [perma.cc/Q5GA-F2N8](https://perma.cc/Q5GA-F2N8) - -[[53](/en/ch3#OpenGraph-marker)] Facebook. -[The Open Graph protocol](https://ogp.me/), *ogp.me*. -Archived at [perma.cc/C49A-GUSY](https://perma.cc/C49A-GUSY) - -[[54](/en/ch3#Haughey2015-marker)] Matt Haughey. -[Everything -you ever wanted to know about unfurling but were afraid to ask /or/ How to make your site previews -look amazing in Slack](https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254). *medium.com*, November 2015. -Archived at [perma.cc/C7S8-4PZN](https://perma.cc/C7S8-4PZN) - -[[55](/en/ch3#W3CRDF-marker)] W3C RDF Working Group. -[Resource Description Framework (RDF)](https://www.w3.org/RDF/). -*w3.org*, February 2004. - -[[56](/en/ch3#Harris2013-marker)] Steve Harris, Andy Seaborne, and Eric -Prud’hommeaux. [SPARQL 1.1 Query Language](https://www.w3.org/TR/sparql11-query/). -W3C Recommendation, March 2013. - -[[57](/en/ch3#Green2013-marker)] Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou. -[Datalog and Recursive -Query Processing](http://blogs.evergreen.edu/sosw/files/2014/04/Green-Vol5-DBS-017.pdf). *Foundations and Trends in Databases*, volume 5, issue 2, pages 105–195, -November 2013. [doi:10.1561/1900000017](https://doi.org/10.1561/1900000017) - -[[58](/en/ch3#Ceri1989-marker)] Stefano Ceri, Georg Gottlob, and Letizia Tanca. -[What -You Always Wanted to Know About Datalog (And Never Dared to Ask)](https://www.researchgate.net/profile/Letizia_Tanca/publication/3296132_What_you_always_wanted_to_know_about_Datalog_and_never_dared_to_ask/links/0fcfd50ca2d20473ca000000.pdf). *IEEE Transactions on -Knowledge and Data Engineering*, volume 1, issue 1, pages 146–166, March 1989. -[doi:10.1109/69.43410](https://doi.org/10.1109/69.43410) - -[[59](/en/ch3#Abiteboul1995-marker)] Serge Abiteboul, Richard Hull, and Victor Vianu. -[*Foundations of Databases*](http://webdam.inria.fr/Alice/). Addison-Wesley, 1995. -ISBN: 9780201537710, available online at -[*webdam.inria.fr/Alice*](http://webdam.inria.fr/Alice/) - -[[60](/en/ch3#Meyer2020-marker)] Scott Meyer, Andrew Carter, and Andrew Rodriguez. -[LIquid: -The soul of a new graph database, Part 2](https://engineering.linkedin.com/blog/2020/liquid--the-soul-of-a-new-graph-database--part-2). *engineering.linkedin.com*, September 2020. -Archived at [perma.cc/K9M4-PD6Q](https://perma.cc/K9M4-PD6Q) - -[[61](/en/ch3#Bessey2024-marker)] Matt Bessey. -[Why, after 6 years, I’m over -GraphQL](https://bessey.dev/blog/2024/05/24/why-im-over-graphql/). *bessey.dev*, May 2024. Archived at -[perma.cc/2PAU-JYRA](https://perma.cc/2PAU-JYRA) - -[[62](/en/ch3#Betts2012-marker)] Dominic Betts, Julián -Domínguez, Grigori Melnik, Fernando Simonazzi, and Mani Subramanian. -[*Exploring -CQRS and Event Sourcing*](https://learn.microsoft.com/en-us/previous-versions/msp-n-p/jj554200%28v%3Dpandp.10%29). Microsoft Patterns & Practices, July 2012. -ISBN: 1621140164, archived at [perma.cc/7A39-3NM8](https://perma.cc/7A39-3NM8) - -[[63](/en/ch3#Young2014-marker)] Greg Young. -[CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs). At *Code on -the Beach*, August 2014. - -[[64](/en/ch3#Young2010-marker)] Greg Young. -[CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf). -*cqrs.wordpress.com*, November 2010. -Archived at [perma.cc/X5R6-R47F](https://perma.cc/X5R6-R47F) - -[[65](/en/ch3#Petersohn2020-marker)] Devin Petersohn, Stephen Macke, Doris -Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. -Joseph, and Aditya Parameswaran. -[Towards Scalable Dataframe Systems](https://www.vldb.org/pvldb/vol13/p2033-petersohn.pdf). -*Proceedings of the VLDB Endowment*, volume 13, issue 11, pages 2033–2046. -[doi:10.14778/3407790.3407807](https://doi.org/10.14778/3407790.3407807) - -[[66](/en/ch3#Papadopoulos2016-marker)] Stavros Papadopoulos, Kushal Datta, Samuel -Madden, and Timothy Mattson. -[The TileDB Array Data Storage Manager](https://www.vldb.org/pvldb/vol10/p349-papadopoulos.pdf). -*Proceedings of the VLDB Endowment*, volume 10, issue 4, pages 349–360, November 2016. -[doi:10.14778/3025111.3025117](https://doi.org/10.14778/3025111.3025117) - -[[67](/en/ch3#Rusu2022-marker)] Florin Rusu. -[Multidimensional -Array Data Management](https://faculty.ucmerced.edu/frusu/Papers/Report/2022-09-fntdb-arrays.pdf). *Foundations and Trends in Databases*, volume 12, numbers 2–3, -pages 69–220, February 2023. -[doi:10.1561/1900000069](https://doi.org/10.1561/1900000069) - -[[68](/en/ch3#Targett2023-marker)] Ed Targett. -[Bloomberg, -Man Group team up to develop open source “ArcticDB” database](https://www.thestack.technology/bloomberg-man-group-arcticdb-database-dataframe/). *thestack.technology*, -March 2023. Archived at [perma.cc/M5YD-QQYV](https://perma.cc/M5YD-QQYV) - -[[69](/en/ch3#Benson2007-marker)] Dennis A. Benson, Ilene -Karsch-Mizrachi, David J. Lipman, James Ostell, and David L. Wheeler. -[GenBank](https://academic.oup.com/nar/article/36/suppl_1/D25/2507746). -*Nucleic Acids Research*, volume 36, database issue, pages D25–D30, December 2007. -[doi:10.1093/nar/gkm929](https://doi.org/10.1093/nar/gkm929) +[^1]: Jamie Brandon. [Unexplanations: query optimization works because sql is declarative](https://www.scattered-thoughts.net/writing/unexplanations-sql-declarative/). *scattered-thoughts.net*, February 2024. Archived at [perma.cc/P6W2-WMFZ](https://perma.cc/P6W2-WMFZ) +[^2]: Joseph M. Hellerstein. [The Declarative Imperative: Experiences and Conjectures in Distributed Logic](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf). Tech report UCB/EECS-2010-90, Electrical Engineering and Computer Sciences, University of California at Berkeley, June 2010. Archived at [perma.cc/K56R-VVQM](https://perma.cc/K56R-VVQM) +[^3]: Edgar F. Codd. [A Relational Model of Data for Large Shared Data Banks](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf). *Communications of the ACM*, volume 13, issue 6, pages 377–387, June 1970. [doi:10.1145/362384.362685](https://doi.org/10.1145/362384.362685) +[^4]: Michael Stonebraker and Joseph M. Hellerstein. [What Goes Around Comes Around](http://mitpress2.mit.edu/books/chapters/0262693143chapm1.pdf). In *Readings in Database Systems*, 4th edition, MIT Press, pages 2–41, 2005. ISBN: 9780262693141 +[^5]: Markus Winand. [Modern SQL: Beyond Relational](https://modern-sql.com/). *modern-sql.com*, 2015. Archived at [perma.cc/D63V-WAPN](https://perma.cc/D63V-WAPN) +[^6]: Martin Fowler. [OrmHate](https://martinfowler.com/bliki/OrmHate.html). *martinfowler.com*, May 2012. Archived at [perma.cc/VCM8-PKNG](https://perma.cc/VCM8-PKNG) +[^7]: Vlad Mihalcea. [N+1 query problem with JPA and Hibernate](https://vladmihalcea.com/n-plus-1-query-problem/). *vladmihalcea.com*, January 2023. Archived at [perma.cc/79EV-TZKB](https://perma.cc/79EV-TZKB) +[^8]: Jens Schauder. [This is the Beginning of the End of the N+1 Problem: Introducing Single Query Loading](https://spring.io/blog/2023/08/31/this-is-the-beginning-of-the-end-of-the-n-1-problem-introducing-single-query). *spring.io*, August 2023. Archived at [perma.cc/6V96-R333](https://perma.cc/6V96-R333) +[^9]: William Zola. [6 Rules of Thumb for MongoDB Schema Design](https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design). *mongodb.com*, June 2014. Archived at [perma.cc/T2BZ-PPJB](https://perma.cc/T2BZ-PPJB) +[^10]: Sidney Andrews and Christopher McClister. [Data modeling in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data). *learn.microsoft.com*, February 2023. Archived at [archive.org](https://web.archive.org/web/20230207193233/https%3A//learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data) +[^11]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK) +[^12]: Ralph Kimball and Margy Ross. [*The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*](https://learning.oreilly.com/library/view/the-data-warehouse/9781118530801/), 3rd edition. John Wiley & Sons, July 2013. ISBN: 9781118530801 +[^13]: Michael Kaminsky. [Data warehouse modeling: Star schema vs. OBT](https://www.fivetran.com/blog/star-schema-vs-obt). *fivetran.com*, August 2022. Archived at [perma.cc/2PZK-BFFP](https://perma.cc/2PZK-BFFP) +[^14]: Joe Nelson. [User-defined Order in SQL](https://begriffs.com/posts/2018-03-20-user-defined-order.html). *begriffs.com*, March 2018. Archived at [perma.cc/GS3W-F7AD](https://perma.cc/GS3W-F7AD) +[^15]: Evan Wallace. [Realtime Editing of Ordered Sequences](https://www.figma.com/blog/realtime-editing-of-ordered-sequences/). *figma.com*, March 2017. Archived at [perma.cc/K6ER-CQZW](https://perma.cc/K6ER-CQZW) +[^16]: David Greenspan. [Implementing Fractional Indexing](https://observablehq.com/%40dgreensp/implementing-fractional-indexing). *observablehq.com*, October 2020. Archived at [perma.cc/5N4R-MREN](https://perma.cc/5N4R-MREN) +[^17]: Martin Fowler. [Schemaless Data Structures](https://martinfowler.com/articles/schemaless/). *martinfowler.com*, January 2013. +[^18]: Amr Awadallah. [Schema-on-Read vs. Schema-on-Write](https://www.slideshare.net/awadallah/schemaonread-vs-schemaonwrite). At *Berkeley EECS RAD Lab Retreat*, Santa Cruz, CA, May 2009. Archived at [perma.cc/DTB2-JCFR](https://perma.cc/DTB2-JCFR) +[^19]: Martin Odersky. [The Trouble with Types](https://www.infoq.com/presentations/data-types-issues/). At *Strange Loop*, September 2013. Archived at [perma.cc/85QE-PVEP](https://perma.cc/85QE-PVEP) +[^20]: Conrad Irwin. [MongoDB—Confessions of a PostgreSQL Lover](https://speakerdeck.com/conradirwin/mongodb-confessions-of-a-postgresql-lover). At *HTML5DevConf*, October 2013. Archived at [perma.cc/C2J6-3AL5](https://perma.cc/C2J6-3AL5) +[^21]: [Percona Toolkit Documentation: pt-online-schema-change](https://docs.percona.com/percona-toolkit/pt-online-schema-change.html). *docs.percona.com*, 2023. Archived at [perma.cc/9K8R-E5UH](https://perma.cc/9K8R-E5UH) +[^22]: Shlomi Noach. [gh-ost: GitHub’s Online Schema Migration Tool for MySQL](https://github.blog/2016-08-01-gh-ost-github-s-online-migration-tool-for-mysql/). *github.blog*, August 2016. Archived at [perma.cc/7XAG-XB72](https://perma.cc/7XAG-XB72) +[^23]: Shayon Mukherjee. [pg-osc: Zero downtime schema changes in PostgreSQL](https://www.shayon.dev/post/2022/47/pg-osc-zero-downtime-schema-changes-in-postgresql/). *shayon.dev*, February 2022. Archived at [perma.cc/35WN-7WMY](https://perma.cc/35WN-7WMY) +[^24]: Carlos Pérez-Aradros Herce. [Introducing pgroll: zero-downtime, reversible, schema migrations for Postgres](https://xata.io/blog/pgroll-schema-migrations-postgres). *xata.io*, October 2023. Archived at [archive.org](https://web.archive.org/web/20231008161750/https%3A//xata.io/blog/pgroll-schema-migrations-postgres) +[^25]: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. [Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012. +[^26]: Donald K. Burleson. [Reduce I/O with Oracle Cluster Tables](http://www.dba-oracle.com/oracle_tip_hash_index_cluster_table.htm). *dba-oracle.com*. Archived at [perma.cc/7LBJ-9X2C](https://perma.cc/7LBJ-9X2C) +[^27]: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. [Bigtable: A Distributed Storage System for Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006. +[^28]: Priscilla Walmsley. [*XQuery, 2nd Edition*](https://learning.oreilly.com/library/view/xquery-2nd-edition/9781491915080/). O’Reilly Media, December 2015. ISBN: 9781491915080 +[^29]: Paul C. Bryan, Kris Zyp, and Mark Nottingham. [JavaScript Object Notation (JSON) Pointer](https://www.rfc-editor.org/rfc/rfc6901). RFC 6901, IETF, April 2013. +[^30]: Stefan Gössner, Glyn Normington, and Carsten Bormann. [JSONPath: Query Expressions for JSON](https://www.rfc-editor.org/rfc/rfc9535.html). RFC 9535, IETF, February 2024. +[^31]: Michael Stonebraker and Andrew Pavlo. [What Goes Around Comes Around… And Around…](https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf). *ACM SIGMOD Record*, volume 53, issue 2, pages 21–37. [doi:10.1145/3685980.3685984](https://doi.org/10.1145/3685980.3685984) +[^32]: Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. [The PageRank Citation Ranking: Bringing Order to the Web](http://ilpubs.stanford.edu:8090/422/). Technical Report 1999-66, Stanford University InfoLab, November 1999. Archived at [perma.cc/UML9-UZHW](https://perma.cc/UML9-UZHW) +[^33]: Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. [TAO: Facebook’s Distributed Data Store for the Social Graph](https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson). At *USENIX Annual Technical Conference* (ATC), June 2013. +[^34]: Natasha Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, Alan Patterson, and Jamie Taylor. [Industry-Scale Knowledge Graphs: Lessons and Challenges](https://cacm.acm.org/magazines/2019/8/238342-industry-scale-knowledge-graphs/fulltext). *Communications of the ACM*, volume 62, issue 8, pages 36–43, August 2019. [doi:10.1145/3331166](https://doi.org/10.1145/3331166) +[^35]: Xiyang Feng, Guodong Jin, Ziyi Chen, Chang Liu, and Semih Salihoğlu. [KÙZU Graph Database Management System](https://www.cidrdb.org/cidr2023/papers/p48-jin.pdf). At *3th Annual Conference on Innovative Data Systems Research* (CIDR 2023), January 2023. +[^36]: Maciej Besta, Emanuel Peter, Robert Gerstenberger, Marc Fischer, Michał Podstawski, Claude Barthels, Gustavo Alonso, Torsten Hoefler. [Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries](https://arxiv.org/pdf/1910.09017.pdf). *arxiv.org*, October 2019. +[^37]: [Apache TinkerPop 3.6.3 Documentation](https://tinkerpop.apache.org/docs/3.6.3/reference/). *tinkerpop.apache.org*, May 2023. Archived at [perma.cc/KM7W-7PAT](https://perma.cc/KM7W-7PAT) +[^38]: Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. [Cypher: An Evolving Query Language for Property Graphs](https://core.ac.uk/download/pdf/158372754.pdf). At *International Conference on Management of Data* (SIGMOD), pages 1433–1445, May 2018. [doi:10.1145/3183713.3190657](https://doi.org/10.1145/3183713.3190657) +[^39]: Emil Eifrem. [Twitter correspondence](https://twitter.com/emileifrem/status/419107961512804352), January 2014. Archived at [perma.cc/WM4S-BW64](https://perma.cc/WM4S-BW64) +[^40]: Francesco Tisiot. [Explore the new SEARCH and CYCLE features in PostgreSQL® 14](https://aiven.io/blog/explore-the-new-search-and-cycle-features-in-postgresql-14). *aiven.io*, December 2021. Archived at [perma.cc/J6BT-83UZ](https://perma.cc/J6BT-83UZ) +[^41]: Gaurav Goel. [Understanding Hierarchies in Oracle](https://towardsdatascience.com/understanding-hierarchies-in-oracle-43f85561f3d9). *towardsdatascience.com*, May 2020. Archived at [perma.cc/5ZLR-Q7EW](https://perma.cc/5ZLR-Q7EW) +[^42]: Alin Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Oskar van Rest, Hannes Voigt, Domagoj Vrgoč, Mingxi Wu, and Fred Zemke. [Graph Pattern Matching in GQL and SQL/PGQ](https://arxiv.org/abs/2112.06217). At *International Conference on Management of Data* (SIGMOD), pages 2246–2258, June 2022. [doi:10.1145/3514221.3526057](https://doi.org/10.1145/3514221.3526057) +[^43]: Alastair Green. [SQL... and now GQL](https://opencypher.org/articles/2019/09/12/SQL-and-now-GQL/). *opencypher.org*, September 2019. Archived at [perma.cc/AFB2-3SY7](https://perma.cc/AFB2-3SY7) +[^44]: Alin Deutsch, Yu Xu, and Mingxi Wu. [Seamless Syntactic and Semantic Integration of Query Primitives over Relational and Graph Data in GSQL](https://cdn2.hubspot.net/hubfs/4114546/IntegrationQuery%20PrimitivesGSQL.pdf). *tigergraph.com*, November 2018. Archived at [perma.cc/JG7J-Y35X](https://perma.cc/JG7J-Y35X) +[^45]: Oskar van Rest, Sungpack Hong, Jinha Kim, Xuming Meng, and Hassan Chafi. [PGQL: a property graph query language](https://event.cwi.nl/grades/2016/07-VanRest.pdf). At *4th International Workshop on Graph Data Management Experiences and Systems* (GRADES), June 2016. [doi:10.1145/2960414.2960421](https://doi.org/10.1145/2960414.2960421) +[^46]: Amazon Web Services. [Neptune Graph Data Model](https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html). Amazon Neptune User Guide, *docs.aws.amazon.com*. Archived at [perma.cc/CX3T-EZU9](https://perma.cc/CX3T-EZU9) +[^47]: Cognitect. [Datomic Data Model](https://docs.datomic.com/cloud/whatis/data-model.html). Datomic Cloud Documentation, *docs.datomic.com*. Archived at [perma.cc/LGM9-LEUT](https://perma.cc/LGM9-LEUT) +[^48]: David Beckett and Tim Berners-Lee. [Turtle – Terse RDF Triple Language](https://www.w3.org/TeamSubmission/turtle/). W3C Team Submission, March 2011. +[^49]: Sinclair Target. [Whatever Happened to the Semantic Web?](https://twobithistory.org/2018/05/27/semantic-web.html) *twobithistory.org*, May 2018. Archived at [perma.cc/M8GL-9KHS](https://perma.cc/M8GL-9KHS) +[^50]: Gavin Mendel-Gleason. [The Semantic Web is Dead – Long Live the Semantic Web!](https://terminusdb.com/blog/the-semantic-web-is-dead/) *terminusdb.com*, August 2022. Archived at [perma.cc/G2MZ-DSS3](https://perma.cc/G2MZ-DSS3) +[^51]: Manu Sporny. [JSON-LD and Why I Hate the Semantic Web](http://manu.sporny.org/2014/json-ld-origins-2/). *manu.sporny.org*, January 2014. Archived at [perma.cc/7PT4-PJKF](https://perma.cc/7PT4-PJKF) +[^52]: University of Michigan Library. [Biomedical Ontologies and Controlled Vocabularies](https://guides.lib.umich.edu/ontology), *guides.lib.umich.edu/ontology*. Archived at [perma.cc/Q5GA-F2N8](https://perma.cc/Q5GA-F2N8) +[^53]: Facebook. [The Open Graph protocol](https://ogp.me/), *ogp.me*. Archived at [perma.cc/C49A-GUSY](https://perma.cc/C49A-GUSY) +[^54]: Matt Haughey. [Everything you ever wanted to know about unfurling but were afraid to ask /or/ How to make your site previews look amazing in Slack](https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254). *medium.com*, November 2015. Archived at [perma.cc/C7S8-4PZN](https://perma.cc/C7S8-4PZN) +[^55]: W3C RDF Working Group. [Resource Description Framework (RDF)](https://www.w3.org/RDF/). *w3.org*, February 2004. +[^56]: Steve Harris, Andy Seaborne, and Eric Prud’hommeaux. [SPARQL 1.1 Query Language](https://www.w3.org/TR/sparql11-query/). W3C Recommendation, March 2013. +[^57]: Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou. [Datalog and Recursive Query Processing](http://blogs.evergreen.edu/sosw/files/2014/04/Green-Vol5-DBS-017.pdf). *Foundations and Trends in Databases*, volume 5, issue 2, pages 105–195, November 2013. [doi:10.1561/1900000017](https://doi.org/10.1561/1900000017) +[^58]: Stefano Ceri, Georg Gottlob, and Letizia Tanca. [What You Always Wanted to Know About Datalog (And Never Dared to Ask)](https://www.researchgate.net/profile/Letizia_Tanca/publication/3296132_What_you_always_wanted_to_know_about_Datalog_and_never_dared_to_ask/links/0fcfd50ca2d20473ca000000.pdf). *IEEE Transactions on Knowledge and Data Engineering*, volume 1, issue 1, pages 146–166, March 1989. [doi:10.1109/69.43410](https://doi.org/10.1109/69.43410) +[^59]: Serge Abiteboul, Richard Hull, and Victor Vianu. [*Foundations of Databases*](http://webdam.inria.fr/Alice/). Addison-Wesley, 1995. ISBN: 9780201537710, available online at [*webdam.inria.fr/Alice*](http://webdam.inria.fr/Alice/) +[^60]: Scott Meyer, Andrew Carter, and Andrew Rodriguez. [LIquid: The soul of a new graph database, Part 2](https://engineering.linkedin.com/blog/2020/liquid--the-soul-of-a-new-graph-database--part-2). *engineering.linkedin.com*, September 2020. Archived at [perma.cc/K9M4-PD6Q](https://perma.cc/K9M4-PD6Q) +[^61]: Matt Bessey. [Why, after 6 years, I’m over GraphQL](https://bessey.dev/blog/2024/05/24/why-im-over-graphql/). *bessey.dev*, May 2024. Archived at [perma.cc/2PAU-JYRA](https://perma.cc/2PAU-JYRA) +[^62]: Dominic Betts, Julián Domínguez, Grigori Melnik, Fernando Simonazzi, and Mani Subramanian. [*Exploring CQRS and Event Sourcing*](https://learn.microsoft.com/en-us/previous-versions/msp-n-p/jj554200%28v%3Dpandp.10%29). Microsoft Patterns & Practices, July 2012. ISBN: 1621140164, archived at [perma.cc/7A39-3NM8](https://perma.cc/7A39-3NM8) +[^63]: Greg Young. [CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs). At *Code on the Beach*, August 2014. +[^64]: Greg Young. [CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf). *cqrs.wordpress.com*, November 2010. Archived at [perma.cc/X5R6-R47F](https://perma.cc/X5R6-R47F) +[^65]: Devin Petersohn, Stephen Macke, Doris Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya Parameswaran. [Towards Scalable Dataframe Systems](https://www.vldb.org/pvldb/vol13/p2033-petersohn.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 11, pages 2033–2046. [doi:10.14778/3407790.3407807](https://doi.org/10.14778/3407790.3407807) +[^66]: Stavros Papadopoulos, Kushal Datta, Samuel Madden, and Timothy Mattson. [The TileDB Array Data Storage Manager](https://www.vldb.org/pvldb/vol10/p349-papadopoulos.pdf). *Proceedings of the VLDB Endowment*, volume 10, issue 4, pages 349–360, November 2016. [doi:10.14778/3025111.3025117](https://doi.org/10.14778/3025111.3025117) +[^67]: Florin Rusu. [Multidimensional Array Data Management](https://faculty.ucmerced.edu/frusu/Papers/Report/2022-09-fntdb-arrays.pdf). *Foundations and Trends in Databases*, volume 12, numbers 2–3, pages 69–220, February 2023. [doi:10.1561/1900000069](https://doi.org/10.1561/1900000069) +[^68]: Ed Targett. [Bloomberg, Man Group team up to develop open source “ArcticDB” database](https://www.thestack.technology/bloomberg-man-group-arcticdb-database-dataframe/). *thestack.technology*, March 2023. Archived at [perma.cc/M5YD-QQYV](https://perma.cc/M5YD-QQYV) +[^69]: Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and David L. Wheeler. [GenBank](https://academic.oup.com/nar/article/36/suppl_1/D25/2507746). *Nucleic Acids Research*, volume 36, database issue, pages D25–D30, December 2007. [doi:10.1093/nar/gkm929](https://doi.org/10.1093/nar/gkm929) \ No newline at end of file diff --git a/content/en/ch4.md b/content/en/ch4.md index 55cca43..6432c55 100644 --- a/content/en/ch4.md +++ b/content/en/ch4.md @@ -125,7 +125,7 @@ to be updated every time data is written. This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but every index consumes additional disk space and slows down writes, sometimes substantially -[[1](/en/ch4#Samokhvalov2021)]. +[^1]. For this reason, databases don’t usually index everything by default, but require you—the person writing the application or administering the database—to choose indexes manually, using your knowledge of the application’s typical query patterns. You can then choose the indexes that give @@ -157,7 +157,7 @@ This approach is much faster, but it still suffers from several problems: * The hash table must fit in memory. In principle, you could maintain a hash table on disk, but unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of random access I/O, it is expensive to grow when it becomes full, and hash collisions require - fiddly logic [[2](/en/ch4#Graefe2011)]. + fiddly logic [^2]. * Range queries are not efficient. For example, you cannot easily scan over all keys between `10000` and `19999`—you’d have to look up each key individually in the hash map. @@ -165,7 +165,7 @@ This approach is much faster, but it still suffers from several problems: In practice, hash tables are not used very often for database indexes, and instead it is much more common to keep data in a structure that is *sorted by key* -[[3](/en/ch4#Jones2019)]. +[^3]. One example of such a structure is a *Sorted String Table*, or *SSTable* for short, as shown in [Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that they are sorted by key, and each key only appears once in the file. @@ -179,7 +179,7 @@ SSTable into *blocks* of a few kilobytes, and then store the first key of each b This kind of index, which stores only some of the keys, is called *sparse*. This index is stored in a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data structure that allows queries to quickly look up a particular key -[[4](/en/ch4#Lambov2022a)]. +[^4]. For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the first key of the next block is `handsome`. Now say you’re looking for the key `handiwork`, which @@ -203,8 +203,8 @@ We can solve this problem with a *log-structured* approach, which is a hybrid be log and a sorted file: 1. When a write comes in, add it to an in-memory ordered map data structure, such as a red-black - tree, skip list [[5](/en/ch4#Cormen2009)], or trie - [[6](/en/ch4#Lambov2022b)]. + tree, skip list [^5], or trie + [^6]. With these data structures, you can insert keys in any order, look them up efficiently, and read them back in sorted order. This in-memory data structure is called the *memtable*. 2. When the memtable gets bigger than some threshold—typically a few megabytes—write it out to @@ -221,7 +221,7 @@ log and a sorted file: and to discard overwritten or deleted values. Merging segments works similarly to the *mergesort* algorithm -[[5](/en/ch4#Cormen2009)]. The process is illustrated in +[^5]. The process is illustrated in [Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If the same key appears in more than one input file, keep only the more recent value. This produces a @@ -244,17 +244,17 @@ process to discard any previous values for the deleted key. Once the tombstone i oldest segment, it can be dropped. The algorithm described here is essentially what is used in RocksDB -[[7](/en/ch4#Borthakur2013)], +[^7], Cassandra, Scylla, and HBase -[[8](/en/ch4#Bertozzi2012)], +[^8], all of which were inspired by Google’s Bigtable paper -[[9](/en/ch4#Chang2006_ch4)] +[^9] (which introduced the terms *SSTable* and *memtable*). The algorithm was originally published in 1996 under the name *Log-Structured Merge-Tree* or *LSM-Tree* -[[10](/en/ch4#ONeil1996)], +[^10], building on earlier work on log-structured filesystems -[[11](/en/ch4#Rosenblum1992)]. +[^11]. For this reason, storage engines that are based on the principle of merging and compacting sorted files are often called *LSM storage engines*. @@ -267,7 +267,7 @@ can be deleted. The segment files don’t necessarily have to be stored on local disk: they are also well suited for writing to object storage. SlateDB and Delta Lake -[[12](/en/ch4#Armbrust2020)]. +[^12]. take this approach, for example. Having immutable segment files also simplifies crash recovery: if a crash happens while writing out @@ -282,14 +282,14 @@ more about durability and crash recovery in [Chapter 8](/en/ch8#ch_transactions With LSM storage it can be slow to read a key that was last updated a long time ago, or that does not exist, since the storage engine needs to check several segment files. In order to speed up such reads, LSM storage engines often include a *Bloom filter* -[[13](/en/ch4#Bloom1970)] +[^13] in each segment, which provides a fast but approximate way of checking whether a particular key appears in a particular SSTable. [Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash function, producing a set of numbers that are then interpreted as indexes into the array of bits -[[14](/en/ch4#Kirsch2008)]. +[^14]. We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key `handbag` hashes to the numbers (2, 9, 4), so we set the 2nd, 9th, and 4th bits to 1. The bitmap is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of @@ -313,7 +313,7 @@ as if a key is present, even though it isn’t, is called a *false positive*. The probability of false positives depends on the number of keys, the number of bits set per key, and the total number of bits in the Bloom filter. You can use an online calculator tool to work out the right parameters for your application -[[15](/en/ch4#Hurst2023)]. +[^15]. As a rule of thumb, you need to allocate 10 bits of Bloom filter space for every key in the SSTable to get a false positive probability of 1%, and the probability is reduced tenfold for every 5 additional bits you allocate per key. @@ -349,7 +349,7 @@ Leveled compaction As a rule of thumb, size-tiered compaction performs better if you have mostly writes and few reads, whereas leveled compaction performs better if your workload is dominated by reads. If you write a small number of keys frequently and a large number of keys rarely, then leveled compaction can also -be advantageous [[18](/en/ch4#Callaghan2018)]. +be advantageous [^18]. Even though there are many subtleties, the basic idea of LSM-trees—keeping a cascade of SSTables that are merged in the background—is simple and effective. We discuss their performance @@ -362,7 +362,7 @@ databases that don’t expose a network API. Instead, they are libraries that ru as your application code, typically reading and writing files on the local disk, and you interact with them through normal function calls. Examples of embedded storage engines include RocksDB, SQLite, LMDB, DuckDB, and KùzuDB -[[19](/en/ch4#Rao2023)]. +[^19]. Embedded databases are very commonly used in mobile apps to store the local user’s data. On the backend, they can be an appropriate choice if the data is small enough to fit on a single machine, @@ -370,7 +370,7 @@ and if there are not many concurrent transactions. For example, in a multitenant each tenant is small enough and completely separate from others (i.e., you do not need to run queries that combine data from multiple tenants), you can potentially use a separate embedded database instance per tenant -[[20](/en/ch4#BlueskySQLite)]. +[^20]. The storage and retrieval methods we discuss in this chapter are used in both embedded and in client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques @@ -381,9 +381,9 @@ for scaling a database across multiple machines. The log-structured approach is popular, but it is not the only form of key-value storage. The most widely used structure for reading and writing database records by key is the *B-tree*. -Introduced in 1970 [[21](/en/ch4#Bayer1970)] +Introduced in 1970 [^21] and called “ubiquitous” less than 10 years later -[[22](/en/ch4#Comer1979)], +[^22], B-trees have stood the test of time very well. They remain the standard index implementation in almost all relational databases, and many nonrelational databases use them too. @@ -443,7 +443,7 @@ both children, with a boundary value of 337 between them. If the parent page doe space for the new reference, it may also need to be split, and the splits can continue all the way to the root of the tree. When the root is split, we make a new root above it. Deleting keys (which may require nodes to be merged) is more complex -[[5](/en/ch4#Cormen2009)]. +[^5]. This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so @@ -462,7 +462,7 @@ Overwriting several pages at once, like in a page split, is a dangerous operatio crashes after only some of the pages have been written, you end up with a corrupted tree (e.g., there may be an *orphan* page that is not a child of any parent). If the hardware can’t atomically write an entire page, you can also end up with a partially written page (this is known as a *torn -page* [[23](/en/ch4#Miller2025)]). +page* [^23]). In order to make the database resilient to crashes, it is common for B-tree implementations to include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file @@ -476,7 +476,7 @@ To improve performance, B-tree implementations typically don’t immediately wri to disk, but buffer the B-tree pages in memory for a while first. The write-ahead log then also ensures that data is not lost in the case of a crash: as long as data has been written to the WAL, and flushed to disk using the `fsync()` system call, the data will be durable as the database will -be able to recover it after a crash [[25](/en/ch4#Suzuki2017_ch4)]. +be able to recover it after a crash [^25]. ### B-tree variants @@ -484,7 +484,7 @@ As B-trees have been around for so long, many variants have been developed over mention just a few: * Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB) - use a copy-on-write scheme [[26](/en/ch4#Chu2014)]. + use a copy-on-write scheme [^26]. A modified page is written to a different location, and a new version of the parent pages in the tree is created, pointing at the new location. This approach is also useful for concurrency control, as we shall see in [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation). @@ -524,7 +524,7 @@ LSM storage, range queries can also take advantage of the SSTable sorting, but t the segments in parallel and combine the results. Bloom filters don’t help for range queries (since you would need to compute the hash of every possible key within the range, which is impractical), making range queries more expensive than point queries in the LSM approach -[[29](/en/ch4#Callaghan2016lsm)]. +[^29]. High write throughput can cause latency spikes in a log-structured storage engine if the memtable fills up. This happens if data can’t be written out to disk fast enough, perhaps because @@ -537,7 +537,7 @@ been written out to disk Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but storage engines need to be carefully designed to take advantage of this parallelism -[[32](/en/ch4#Haas2023)]. +[^32]. ### Sequential vs. random writes @@ -570,7 +570,7 @@ but it can only be erased one block (typically 512 KiB) at a time. Some of the may contain valid data, whereas others may contain data that is no longer needed. Before erasing a block, the controller must first move pages containing valid data into other blocks; this process is called *garbage collection* (GC) -[[33](/en/ch4#Goossaert2014)]. +[^33]. A sequential write workload writes larger chunks of data at a time, so it is likely that a whole 512 KiB block belongs to a single file; when that file is later deleted again, the whole block @@ -593,7 +593,7 @@ durability, then again when the memtable is written to disk, and again every tim is part of a compaction. (If the values are significantly larger than the keys, this overhead can be reduced by storing values separately from keys, and performing compaction only on SSTables containing keys and references to values -[[37](/en/ch4#Lu2016)].) +[^37].) A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once to the tree page itself. In addition, they sometimes need to write out an entire page, even if only @@ -612,7 +612,7 @@ Write amplification is a problem in both LSM-trees and B-trees. Which one is bet various factors, such as the length of your keys and values, and how often you overwrite existing keys versus insert new ones. For typical workloads, LSM-trees tend to have lower write amplification because they don’t have to write entire pages and they can compress chunks of the SSTable -[[40](/en/ch4#Callaghan2015)]. +[^40]. This is another factor that makes LSM storage engines well suited for write-heavy workloads. Besides affecting throughput, write amplification is also relevant for the wear on SSDs: a storage @@ -630,7 +630,7 @@ database file may contain a lot of pages that are no longer used by the B-tree. to the B-tree can use those free pages, but they can’t easily be returned to the operating system because they are in the middle of the file, so they still take up space on the filesystem. Databases therefore need a background process that moves pages around to place them better, such as the vacuum -process in PostgreSQL [[25](/en/ch4#Suzuki2017_ch4)]. +process in PostgreSQL [^25]. Fragmentation is less of a problem in LSM-trees, since the compaction process periodically rewrites the data files anyway, and SSTables don’t have pages with unused space. Moreover, blocks of @@ -647,7 +647,7 @@ and be confident that it really has been deleted (perhaps to comply with data pr regulations). For example, in most LSM storage engines a deleted record may still exist in the higher levels until the tombstone representing the deletion has been propagated through all of the compaction levels, which may take a long time. Specialist storage engine designs can propagate -deletions faster [[42](/en/ch4#Sarkar2023)]. +deletions faster [^42]. On the other hand, the immutable nature of SSTable segment files is useful if you want to take a snapshot of a database at some point in time (e.g. for a backup or to create a copy of the database @@ -685,16 +685,16 @@ The key in an index is the thing that queries search by, but the value can be on * If the actual data (row, document, vertex) is stored directly within the index structure, it is called a *clustered index*. For example, in MySQL’s InnoDB storage engine, the primary key of a table is always a clustered index, and in SQL Server, you can specify one clustered index per - table [[43](/en/ch4#Fittl2025)]. + table [^43]. * Alternatively, the value can be a reference to the actual data: either the primary key of the row in question (InnoDB does this for secondary indexes), or a direct reference to a location on disk. In the latter case, the place where rows are stored is known as a *heap file*, and it stores data in no particular order (it may be append-only, or it may keep track of deleted rows in order to overwrite them with new data later). For example, Postgres uses the heap file approach - [[44](/en/ch4#Silcock2024)]. + [^44]. * A middle ground between the two is a *covering index* or *index with included columns*, which stores *some* of a table’s columns within the index, in addition to storing the full row on the - heap or in the primary key clustered index [[45](/en/ch4#Webb2008)]. + heap or in the primary key clustered index [^45]. This allows some queries to be answered by using the index alone, without having to resolve the primary key or look in the heap file (in which case, the index is said to *cover* the query). This can make some queries faster, but the duplication of data means the index uses more disk space and slows down @@ -708,7 +708,7 @@ overwritten in place, provided that the new value is not larger than the old val more complicated if the new value is larger, as it probably needs to be moved to a new location in the heap where there is enough space. In that case, either all indexes need to be updated to point at the new heap location of the record, or a forwarding pointer is left behind in the old heap -location [[2](/en/ch4#Graefe2011)]. +location [^2]. ## Keeping everything in memory @@ -742,7 +742,7 @@ associated with managing on-disk data structures [47](/en/ch4#VoltDB2014uj)]. RAMCloud is an open source, in-memory key-value store with durability (using a log-structured approach for the data in memory as well as the data on disk) -[[48](/en/ch4#Rumble2014)]. +[^48]. Redis and Couchbase provide weak durability by writing to disk asynchronously. @@ -751,7 +751,7 @@ they don’t need to read from disk. Even a disk-based storage engine may never if you have enough memory, because the operating system caches recently used disk blocks in memory anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data structures in a form that can be written to disk -[[49](/en/ch4#Harizopoulos2008)]. +[^49]. Besides performance, another interesting area for in-memory databases is providing data models that are difficult to implement with disk-based indexes. For example, Redis offers a database-like @@ -792,7 +792,7 @@ Cloud data warehouses tend to integrate better with other cloud services and to For example, many cloud warehouses support automatic log ingestion, and offer easy integration with data processing frameworks such as Google Cloud’s Dataflow or Amazon Web Services’ Kinesis. These warehouses are also more elastic because they decouple query computation from the storage layer -[[54](/en/ch4#Tereshko2016)]. +[^54]. Data is persisted on object storage rather than local disks, which makes it easy to adjust storage capacity and compute resources for queries independently, as we previously saw in [“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native). @@ -800,7 +800,7 @@ capacity and compute resources for queries independently, as we previously saw i Open source data warehouses such as Apache Hive, Trino, and Apache Spark have also evolved with the cloud. As data storage for analytics has moved to data lakes on object storage, open source warehouses have begun to break apart -[[55](/en/ch4#McKinney2023)]. The following +[^55]. The following components, which were previously integrated in a single system such as Apache Hive, are now often implemented as separate components: @@ -813,7 +813,7 @@ Query engine Storage format : The storage format determines how the rows of a table are encoded as bytes in a file, which is then typically stored in object storage or a distributed filesystem - [[12](/en/ch4#Armbrust2020)]. + [^12]. This data can then be accessed by the query engine, but also by other applications using the data lake. Examples of such storage formats are Parquet, ORC, Lance, or Nimble, and we will see more about them in the next section. @@ -846,7 +846,7 @@ rows), so in this section we will focus on storage of facts. Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4 or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics) -[[52](/en/ch4#Stonebraker2013)]. Take the query in +[^52]. Take the query in [Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of the `fact_sales` table: `date_key`, `product_sk`, @@ -884,7 +884,7 @@ long time. The idea behind *column-oriented* (or *columnar*) storage is simple: don’t store all the values from one row together, but store all the values from each *column* together instead -[[56](/en/ch4#Stonebraker2005)]. +[^56]. If each column is stored separately, a query only needs to read and parse those columns that are used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema). @@ -893,11 +893,11 @@ an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema) Column storage is easiest to understand in a relational data model, but it applies equally to nonrelational data. For example, Parquet -[[57](/en/ch4#LeDem2013)] +[^57] is a columnar storage format that supports a document data model, based on Google’s Dremel -[[58](/en/ch4#Melnik2010)], +[^58], using a technique known as *shredding* or *striping* -[[59](/en/ch4#Kearney2016)]. +[^59]. ![ddia 0407](/fig/ddia_0407.png) @@ -910,32 +910,32 @@ individual columns and put them together to form the 23rd row of the table. In fact, columnar storage engines don’t actually store an entire column (containing perhaps trillions of rows) in one go. Instead, they break the table into blocks of thousands or millions of rows, and within each block they store the values from each column separately -[[60](/en/ch4#Brandon2023)]. +[^60]. Since many queries are restricted to a particular date range, it is common to make each block contain the rows for a particular timestamp range. A query then only needs to load the columns it needs in those blocks that overlap with the required date range. Columnar storage is used in almost all analytic databases nowadays -[[60](/en/ch4#Brandon2023)], +[^60], ranging from large-scale cloud data warehouses such as Snowflake -[[61](/en/ch4#Dageville2016)] +[^61] to single-node embedded databases such as DuckDB -[[62](/en/ch4#Raasveldt2020)], +[^62], and product analytics systems such as Pinot -[[63](/en/ch4#Im2018)] -and Druid [[64](/en/ch4#Yang2014)]. +[^63] +and Druid [^64]. It is used in storage formats such as Parquet, ORC [[65](/en/ch4#Liu2023), [66](/en/ch4#Zeng2023)], -Lance [[67](/en/ch4#Pace2024)], -and Nimble [[68](/en/ch4#Helfman2024)], +Lance [^67], +and Nimble [^68], and in-memory analytics formats like Apache Arrow [[65](/en/ch4#Liu2023), [69](/en/ch4#McKinney2021)] -and Pandas/NumPy [[70](/en/ch4#McKinney2022)]. +and Pandas/NumPy [^70]. Some time-series databases, such as InfluxDB IOx -[[71](/en/ch4#Dix2021)] and TimescaleDB -[[72](/en/ch4#Soto2024)], +[^71] and TimescaleDB +[^72], are also based on column-oriented storage. ### Column Compression @@ -964,7 +964,7 @@ a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can ad run-length encoded: counting the number of consecutive zeros or ones and storing that number, as shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the two bitmap representations, using whichever is the most compact -[[73](/en/ch4#Lemire2016)]. +[^73]. This can make the encoding of a column remarkably efficient. Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data @@ -981,15 +981,15 @@ warehouse. For example: Bitmaps can also be used to answer graph queries, such as finding all users of a social network who are followed by user *X* and who also follow user *Y* -[[74](/en/ch4#Volpert2024)]. +[^74]. There are also various other compression schemes for columnar databases, which you can find in the -references [[75](/en/ch4#Abadi2013)]. +references [^75]. ###### Note Don’t confuse column-oriented databases with the *wide-column* (also known as *column-family*) data model, in which a row can have thousands of columns, and there is no need for all the rows to have -the same columns [[9](/en/ch4#Chang2006_ch4)]. Despite the similarity +the same columns [^9]. Despite the similarity in name, wide-column databases are row-oriented, since they store all values from a row together. Google’s Bigtable, Apache Accumulo, and HBase are examples of the wide-column model. @@ -1072,7 +1072,7 @@ operators. The simplest kind of operator is like an interpreter for a programmin iterating over each row, it checks a data structure representing the query to find out which comparisons or calculations it needs to perform on which columns. Unfortunately, this is too slow for many analytics purposes. Two alternative approaches for efficient query execution have emerged -[[77](/en/ch4#Kersten2018)]: +[^77]: Query compilation : The query engine takes the SQL query and generates code for executing it. The code iterates over @@ -1101,11 +1101,11 @@ Vectorized processing ###### Figure 4-9. A bitwise AND between two bitmaps lends itself to vectorization. The two approaches are very different in terms of their implementation, but both are used in -practice [[77](/en/ch4#Kersten2018)]. Both can achieve very good +practice [^77]. Both can achieve very good performance by taking advantages of the characteristics of modern CPUs: * preferring sequential memory access over random access to reduce cache misses - [[78](/en/ch4#Smith2020)], + [^78], * doing most of the work in tight inner loops (that is, with a small number of instructions and no function calls) to keep the CPU instruction processing pipeline busy and avoid branch mispredictions, @@ -1127,7 +1127,7 @@ expanded query. When the underlying data changes, a materialized view needs to be updated accordingly. Some databases can do that automatically, and there are also systems such as Materialize that specialize in materialized view maintenance -[[81](/en/ch4#Bartley2024)]. +[^81]. Performing such updates means more work on writes, but materialized views can improve read performance in workloads that repeatedly need to perform the same queries. @@ -1137,7 +1137,7 @@ discussed earlier, data warehouse queries often involve an aggregate function, s wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates grouped by different dimensions -[[82](/en/ch4#Gray2007)]. +[^82]. [Figure 4-10](/en/ch4#fig_data_cube) shows an example. ![ddia 0410](/fig/ddia_0410.png) @@ -1201,15 +1201,15 @@ South poles), but not both simultaneously. One option is to translate a two-dimensional location into a single number using a space-filling curve, and then to use a regular B-tree index -[[83](/en/ch4#Ramsak2000)]. +[^83]. More commonly, specialized spatial indexes such as R-trees or Bkd-trees -[[84](/en/ch4#Procopiuc2003)] +[^84] are used; they divide up the space so that nearby data points tend to be grouped in the same subtree. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQL’s Generalized Search Tree indexing facility -[[85](/en/ch4#Hellerstein1995)]. +[^85]. It is also possible to use regularly spaced grids of triangles, squares, or hexagons -[[86](/en/ch4#Brodsky2018)]. +[^86]. Multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce website you could use a three-dimensional index on the dimensions (*red*, *green*, *blue*) to search @@ -1219,13 +1219,13 @@ observations during the year 2013 where the temperature was between 25 and 30℃ one-dimensional index, you would have to either scan over all the records from 2013 (regardless of temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by timestamp and temperature simultaneously -[[87](/en/ch4#Escriva2012)]. +[^87]. ## Full-Text Search Full-text search allows you to search a collection of text documents (web pages, product descriptions, etc.) by keywords that might appear anywhere in the text -[[88](/en/ch4#Manning2008_ch4)]. +[^88]. Information retrieval is a big, specialist topic that often involves language-specific processing: for example, several Asian languages are written without spaces or punctuation between words, and therefore splitting text into words requires a model that indicates which character sequences @@ -1245,7 +1245,7 @@ index*. This is a key-value structure where the key is a term, and the value is all the documents that contain the term (the *postings list*). If the document IDs are sequential numbers, the postings list can also be represented as a sparse bitmap, like in [Figure 4-8](/en/ch4#fig_bitmap_index): the *n*th bit in the bitmap for term *x* is a 1 if the document with ID *n* contains the term *x* -[[89](/en/ch4#Wang2017)]. +[^89]. Finding all the documents that contain both terms *x* and *y* is now similar to a vectorized data warehouse query that searches for rows matching two conditions ([Figure 4-9](/en/ch4#fig_bitmap_and)): load the two @@ -1253,10 +1253,10 @@ bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps encoded, this can be done very efficiently. For example, Lucene, the full-text indexing engine used by Elasticsearch and Solr, works like this -[[90](/en/ch4#Grand2013)]. +[^90]. It stores the mapping from term to postings list in SSTable-like sorted files, which are merged in the background using the same log-structured approach we saw earlier in this chapter -[[91](/en/ch4#McCandless2011merges)]. +[^91]. PostgreSQL’s GIN index type also uses postings lists to support full-text search and indexing inside JSON documents [[92](/en/ch4#Fittl2021), @@ -1267,16 +1267,16 @@ which are called *n*-grams. For example, the trigrams (*n* = 3) of the string `"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can search the documents for arbitrary substrings that are at least three characters long. Trigram indexes even allows regular expressions in search queries; the downside is that they are quite large -[[94](/en/ch4#Korotkov2012)]. +[^94]. To cope with typos in documents or queries, Lucene is able to search text for words within a certain edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced) -[[95](/en/ch4#McCandless2011fuzzy)]. +[^95]. It does this by storing the set of terms as a finite state automaton over the characters in the keys, similar to a *trie* -[[96](/en/ch4#Heinz2002)], +[^96], and transforming it into a *Levenshtein automaton*, which supports efficient search for words within -a given edit distance [[97](/en/ch4#Schulz2002)]. +a given edit distance [^97]. ## Vector Embeddings @@ -1314,11 +1314,11 @@ vectors to determine how close they are, while Euclidean distance measures the s distance between two points in space. Many early embedding models such as Word2Vec -[[98](/en/ch4#Mikolov2013)], +[^98], BERT -[[99](/en/ch4#Devlin2018)], +[^99], and GPT -[[100](/en/ch4#Radford2018)] +[^100] worked with text data. Such models are usually implemented as neural networks. Researchers went on to create embedding models for video, audio, and images as well. More recently, model architecture has become *multimodal*: a single model can generate vector embeddings for multiple @@ -1362,9 +1362,9 @@ Hierarchical Navigable Small World (HNSW) Many popular vector databases implement IVF and HNSW indexes. Facebook’s Faiss library has many variations of each -[[101](/en/ch4#Faiis2023)], +[^101], and PostgreSQL’s pgvector supports both as well -[[102](/en/ch4#Matevosyan2024)]. +[^102]. The full details of the IVF and HNSW algorithms are beyond the scope of this book, but their papers are an excellent resource [[103](/en/ch4#Baranchuk2018), @@ -1419,567 +1419,113 @@ documentation for the database of your choice. ##### Footnotes + ##### References -[[1](/en/ch4#Samokhvalov2021-marker)] Nikolay Samokhvalov. -[How -partial, covering, and multicolumn indexes may slow down UPDATEs in PostgreSQL](https://postgres.ai/blog/20211029-how-partial-and-covering-indexes-affect-update-performance-in-postgresql). -*postgres.ai*, October 2021. -Archived at [perma.cc/PBK3-F4G9](https://perma.cc/PBK3-F4G9) -[[2](/en/ch4#Graefe2011-marker)] Goetz Graefe. -[Modern B-Tree Techniques](https://w6113.github.io/files/papers/btreesurvey-graefe.pdf). -*Foundations and Trends in Databases*, volume 3, issue 4, pages 203–402, August 2011. -[doi:10.1561/1900000028](https://doi.org/10.1561/1900000028) -[[3](/en/ch4#Jones2019-marker)] Evan Jones. -[Why databases use ordered -indexes but programming uses hash tables](https://www.evanjones.ca/ordered-vs-unordered-indexes.html). *evanjones.ca*, December 2019. -Archived at [perma.cc/NJX8-3ZZD](https://perma.cc/NJX8-3ZZD) -[[4](/en/ch4#Lambov2022a-marker)] Branimir Lambov. -[CEP-25: -Trie-indexed SSTable format](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A%2BTrie-indexed%2BSSTable%2Bformat). *cwiki.apache.org*, November 2022. -Archived at [perma.cc/HD7W-PW8U](https://perma.cc/HD7W-PW8U). -Linked Google Doc archived at [perma.cc/UL6C-AAAE](https://perma.cc/UL6C-AAAE) - -[[5](/en/ch4#Cormen2009-marker)] Thomas H. Cormen, Charles E. -Leiserson, Ronald L. Rivest, and Clifford Stein: *Introduction to Algorithms*, 3rd edition. -MIT Press, 2009. ISBN: 978-0-262-53305-8 - -[[6](/en/ch4#Lambov2022b-marker)] Branimir Lambov. -[Trie Memtables in Cassandra](https://www.vldb.org/pvldb/vol15/p3359-lambov.pdf). -*Proceedings of the VLDB Endowment*, volume 15, issue 12, pages 3359–3371, August 2022. -[doi:10.14778/3554821.3554828](https://doi.org/10.14778/3554821.3554828) - -[[7](/en/ch4#Borthakur2013-marker)] Dhruba Borthakur. -[The History of RocksDB](https://rocksdb.blogspot.com/2013/11/the-history-of-rocksdb.html). -*rocksdb.blogspot.com*, November 2013. -Archived at [perma.cc/Z7C5-JPSP](https://perma.cc/Z7C5-JPSP) - -[[8](/en/ch4#Bertozzi2012-marker)] Matteo Bertozzi. -[Apache HBase I/O – HFile](https://blog.cloudera.com/apache-hbase-i-o-hfile/). -*blog.cloudera.com*, June 2012. -Archived at [perma.cc/U9XH-L2KL](https://perma.cc/U9XH-L2KL) - -[[9](/en/ch4#Chang2006_ch4-marker)] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, -Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. -[Bigtable: A Distributed Storage System -for Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and -Implementation* (OSDI), November 2006. - -[[10](/en/ch4#ONeil1996-marker)] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and -Elizabeth O’Neil. -[The Log-Structured Merge-Tree (LSM-Tree)](https://www.cs.umb.edu/~poneil/lsmtree.pdf). -*Acta Informatica*, volume 33, issue 4, pages 351–385, June 1996. -[doi:10.1007/s002360050048](https://doi.org/10.1007/s002360050048) - -[[11](/en/ch4#Rosenblum1992-marker)] Mendel Rosenblum and John K. Ousterhout. -[The Design and Implementation of -a Log-Structured File System](https://research.cs.wisc.edu/areas/os/Qual/papers/lfs.pdf). -*ACM Transactions on Computer Systems*, volume 10, issue 1, pages 26–52, February 1992. -[doi:10.1145/146941.146943](https://doi.org/10.1145/146941.146943) - -[[12](/en/ch4#Armbrust2020-marker)] Michael Armbrust, Tathagata Das, Liwen Sun, -Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja -Łuszczak, Michał Świtakowski, Michał Szafrański, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter -Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, and Matei Zaharia. -[Delta Lake: High-Performance ACID Table -Storage over Cloud Object Stores](https://vldb.org/pvldb/vol13/p3411-armbrust.pdf). *Proceedings of the VLDB Endowment*, volume 13, -issue 12, pages 3411–3424, August 2020. -[doi:10.14778/3415478.3415560](https://doi.org/10.14778/3415478.3415560) - -[[13](/en/ch4#Bloom1970-marker)] Burton H. Bloom. -[Space/Time -Trade-offs in Hash Coding with Allowable Errors](https://people.cs.umass.edu/~emery/classes/cmpsci691st/readings/Misc/p422-bloom.pdf). *Communications of the ACM*, -volume 13, issue 7, pages 422–426, July 1970. -[doi:10.1145/362686.362692](https://doi.org/10.1145/362686.362692) - -[[14](/en/ch4#Kirsch2008-marker)] Adam Kirsch and Michael Mitzenmacher. -[Less Hashing, Same -Performance: Building a Better Bloom Filter](https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf). *Random Structures & Algorithms*, -volume 33, issue 2, pages 187–218, September 2008. -[doi:10.1002/rsa.20208](https://doi.org/10.1002/rsa.20208) - -[[15](/en/ch4#Hurst2023-marker)] Thomas Hurst. -[Bloom Filter Calculator](https://hur.st/bloomfilter/). *hur.st*, September 2023. -Archived at [perma.cc/L3AV-6VC2](https://perma.cc/L3AV-6VC2) - -[[16](/en/ch4#Luo2019-marker)] Chen Luo and Michael J. Carey. -[LSM-based storage techniques: a survey](https://arxiv.org/abs/1812.07527). -*The VLDB Journal*, volume 29, pages 393–418, July 2019. -[doi:10.1007/s00778-019-00555-y](https://doi.org/10.1007/s00778-019-00555-y) - -[[17](/en/ch4#Sarkar2022-marker)] Subhadeep Sarkar and Manos Athanassoulis. -[Dissecting, Designing, and Optimizing -LSM-based Data Stores](https://www.youtube.com/watch?v=hkMkBZn2mGs). Tutorial at *ACM International Conference on Management of Data* -(SIGMOD), June 2022. Slides archived at -[perma.cc/93B3-E827](https://perma.cc/93B3-E827) - -[[18](/en/ch4#Callaghan2018-marker)] Mark Callaghan. -[Name that -compaction algorithm](https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html). *smalldatum.blogspot.com*, August 2018. -Archived at [perma.cc/CN4M-82DY](https://perma.cc/CN4M-82DY) - -[[19](/en/ch4#Rao2023-marker)] Prashanth Rao. -[Embedded databases (1): The harmony of -DuckDB, KùzuDB and LanceDB](https://thedataquarry.com/posts/embedded-db-1/). *thedataquarry.com*, August 2023. -Archived at [perma.cc/PA28-2R35](https://perma.cc/PA28-2R35) - -[[20](/en/ch4#BlueskySQLite-marker)] Hacker News discussion. -[Bluesky migrates to single-tenant SQLite](https://news.ycombinator.com/item?id=38171322). -*news.ycombinator.com*, October 2023. -Archived at [perma.cc/69LM-5P6X](https://perma.cc/69LM-5P6X) - -[[21](/en/ch4#Bayer1970-marker)] Rudolf Bayer and Edward M. McCreight. -[Organization and Maintenance of Large -Ordered Indices](https://dl.acm.org/doi/pdf/10.1145/1734663.1734671). Boeing Scientific Research Laboratories, Mathematical and Information Sciences -Laboratory, report no. 20, July 1970. -[doi:10.1145/1734663.1734671](https://doi.org/10.1145/1734663.1734671) - -[[22](/en/ch4#Comer1979-marker)] Douglas Comer. -[The -Ubiquitous B-Tree](https://web.archive.org/web/20170809145513id_/http%3A//sites.fas.harvard.edu/~cs165/papers/comer.pdf). *ACM Computing Surveys*, volume 11, issue 2, pages 121–137, June 1979. -[doi:10.1145/356770.356776](https://doi.org/10.1145/356770.356776) - -[[23](/en/ch4#Miller2025-marker)] Alex Miller. -[Torn Write Detection and Protection](https://transactional.blog/blog/2025-torn-writes). -*transactional.blog*, April 2025. -Archived at [perma.cc/G7EB-33EW](https://perma.cc/G7EB-33EW) - -[[24](/en/ch4#Mohan1992-marker)] C. Mohan and Frank Levine. -[ARIES/IM: An Efficient and High -Concurrency Index Management Method Using Write-Ahead Logging](https://ics.uci.edu/~cs223/papers/p371-mohan.pdf). At *ACM -International Conference on Management of Data* (SIGMOD), June 1992. -[doi:10.1145/130283.130338](https://doi.org/10.1145/130283.130338) - -[[25](/en/ch4#Suzuki2017_ch4-marker)] Hironobu Suzuki. -[The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017. - -[[26](/en/ch4#Chu2014-marker)] Howard Chu. -[LDAP at Lightning Speed](https://buildstuff14.sched.com/event/08a1a368e272eb599a52e08b4c3c779d). -At *Build Stuff ’14*, November 2014. -Archived at [perma.cc/GB6Z-P8YH](https://perma.cc/GB6Z-P8YH) - -[[27](/en/ch4#Athanassoulis2016-marker)] Manos Athanassoulis, Michael S. Kester, -Lukas M. Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan. -[Designing Access Methods: The RUM -Conjecture](https://openproceedings.org/2016/conf/edbt/paper-12.pdf). At *19th International Conference on Extending Database Technology* (EDBT), March 2016. -[doi:10.5441/002/edbt.2016.42](https://doi.org/10.5441/002/edbt.2016.42) - -[[28](/en/ch4#Stopford2015-marker)] Ben Stopford. -[Log Structured Merge Trees](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/). -*benstopford.com*, February 2015. -Archived at [perma.cc/E5BV-KUJ6](https://perma.cc/E5BV-KUJ6) - -[[29](/en/ch4#Callaghan2016lsm-marker)] Mark Callaghan. -[The -Advantages of an LSM vs a B-Tree](https://smalldatum.blogspot.com/2016/01/summary-of-advantages-of-lsm-vs-b-tree.html). *smalldatum.blogspot.co.uk*, January 2016. -Archived at [perma.cc/3TYZ-EFUD](https://perma.cc/3TYZ-EFUD) - -[[30](/en/ch4#Balmau2019-marker)] Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan -Gupta, Ravishankar Chandhiramoorthi, and Diego Didona. -[SILK: Preventing Latency -Spikes in Log-Structured Merge Key-Value Stores](https://www.usenix.org/conference/atc19/presentation/balmau). At *USENIX Annual Technical Conference*, -July 2019. - -[[31](/en/ch4#RocksDBTuning-marker)] Igor Canadi, Siying Dong, Mark Callaghan, et al. -[RocksDB Tuning Guide](https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide). -*github.com*, 2023. -Archived at [perma.cc/UNY4-MK6C](https://perma.cc/UNY4-MK6C) - -[[32](/en/ch4#Haas2023-marker)] Gabriel Haas and Viktor Leis. -[What Modern NVMe Storage Can Do, and How -to Exploit it: High-Performance I/O for High-Performance Storage Engines](https://www.vldb.org/pvldb/vol16/p2090-haas.pdf). *Proceedings of the -VLDB Endowment*, volume 16, issue 9, pages 2090-2102. -[doi:10.14778/3598581.3598584](https://doi.org/10.14778/3598581.3598584) - -[[33](/en/ch4#Goossaert2014-marker)] Emmanuel Goossaert. -[Coding -for SSDs](https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/). *codecapsule.com*, February 2014. - -[[34](/en/ch4#Vanlightly2023nvme-marker)] Jack Vanlightly. -[Is -sequential IO dead in the era of the NVMe drive?](https://jack-vanlightly.com/blog/2023/5/9/is-sequential-io-dead-in-the-era-of-the-nvme-drive) *jack-vanlightly.com*, May 2023. -Archived at [perma.cc/7TMZ-TAPU](https://perma.cc/7TMZ-TAPU) - -[[35](/en/ch4#Alibaba2019_ch4-marker)] Alibaba Cloud Storage Team. -[Storage System Design Analysis: Factors Affecting -NVMe SSD Performance (2)](https://www.alibabacloud.com/blog/594376). *alibabacloud.com*, January 2019. Archived at -[archive.org](https://web.archive.org/web/20230510065132/https%3A//www.alibabacloud.com/blog/594376) - -[[36](/en/ch4#Hu2010-marker)] Xiao-Yu Hu and Robert Haas. -[The Fundamental Limit of Flash -Random Write Performance: Understanding, Analysis and Performance Modelling](https://dominoweb.draco.res.ibm.com/reports/rz3771.pdf). -*dominoweb.draco.res.ibm.com*, March 2010. -Archived at [perma.cc/8JUL-4ZDS](https://perma.cc/8JUL-4ZDS) - -[[37](/en/ch4#Lu2016-marker)] Lanyue Lu, Thanumalayan Sankaranarayana Pillai, -Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. -[WiscKey: -Separating Keys from Values in SSD-conscious Storage](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf). At *4th USENIX Conference on File and -Storage Technologies* (FAST), February 2016. - -[[38](/en/ch4#Zaitsev2006-marker)] Peter Zaitsev. -[Innodb Double Write](https://www.percona.com/blog/innodb-double-write/). -*percona.com*, August 2006. -Archived at [perma.cc/NT4S-DK7T](https://perma.cc/NT4S-DK7T) - -[[39](/en/ch4#Vondra2016-marker)] Tomas Vondra. -[On the Impact of -Full-Page Writes](https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-page-writes/). *2ndquadrant.com*, November 2016. -Archived at [perma.cc/7N6B-CVL3](https://perma.cc/7N6B-CVL3) - -[[40](/en/ch4#Callaghan2015-marker)] Mark Callaghan. -[Read, -write & space amplification - B-Tree vs LSM](https://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html). *smalldatum.blogspot.com*, November 2015. -Archived at [perma.cc/S487-WK5P](https://perma.cc/S487-WK5P) - -[[41](/en/ch4#Callaghan2016rocksdb-marker)] Mark Callaghan. -[Choosing Between Efficiency and -Performance with RocksDB](https://codemesh.io/codemesh2016/mark-callaghan). At *Code Mesh*, November 2016. -Video at [youtube.com/watch?v=tgzkgZVXKB4](https://www.youtube.com/watch?v=tgzkgZVXKB4) - -[[42](/en/ch4#Sarkar2023-marker)] Subhadeep Sarkar, Tarikul Islam -Papon, Dimitris Staratzis, Zichen Zhu, and Manos Athanassoulis. -[Enabling -Timely and Persistent Deletion in LSM-Engines](https://subhadeep.net/assets/fulltext/Enabling_Timely_and_Persistent_Deletion_in_LSM-Engines.pdf). *ACM Transactions on Database Systems*, -volume 48, issue 3, article no. 8, August 2023. -[doi:10.1145/3599724](https://doi.org/10.1145/3599724) - -[[43](/en/ch4#Fittl2025-marker)] Lukas Fittl. -[Postgres -vs. SQL Server: B-Tree Index Differences & the Benefit of Deduplication](https://pganalyze.com/blog/postgresql-vs-sql-server-btree-index-deduplication). -*pganalyze.com*, April 2025. -Archived at [perma.cc/XY6T-LTPX](https://perma.cc/XY6T-LTPX) - -[[44](/en/ch4#Silcock2024-marker)] Drew Silcock. -[How Postgres stores data -on disk – this one’s a page turner](https://drew.silcock.dev/blog/how-postgres-stores-data-on-disk/). *drew.silcock.dev*, August 2024. -Archived at [perma.cc/8K7K-7VJ2](https://perma.cc/8K7K-7VJ2) - -[[45](/en/ch4#Webb2008-marker)] Joe Webb. -[Using -Covering Indexes to Improve Query Performance](https://www.red-gate.com/simple-talk/databases/sql-server/learn/using-covering-indexes-to-improve-query-performance/). *simple-talk.com*, September 2008. -Archived at [perma.cc/6MEZ-R5VR](https://perma.cc/6MEZ-R5VR) - -[[46](/en/ch4#Stonebraker2007-marker)] Michael Stonebraker, Samuel Madden, Daniel J. -Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. -[The End of an -Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on -Very Large Data Bases* (VLDB), September 2007. - -[[47](/en/ch4#VoltDB2014uj-marker)] [VoltDB -Technical Overview White Paper](https://www.voltactivedata.com/wp-content/uploads/2017/03/hv-white-paper-voltdb-technical-overview.pdf). VoltDB, 2017. -Archived at [perma.cc/B9SF-SK5G](https://perma.cc/B9SF-SK5G) - -[[48](/en/ch4#Rumble2014-marker)] Stephen M. Rumble, Ankita Kejriwal, and John K. Ousterhout. -[Log-Structured -Memory for DRAM-Based Storage](https://www.usenix.org/system/files/conference/fast14/fast14-paper_rumble.pdf). At *12th USENIX Conference on File and Storage -Technologies* (FAST), February 2014. - -[[49](/en/ch4#Harizopoulos2008-marker)] Stavros Harizopoulos, Daniel J. Abadi, -Samuel Madden, and Michael Stonebraker. -[OLTP Through the Looking Glass, -and What We Found There](https://hstore.cs.brown.edu/papers/hstore-lookingglass.pdf). At *ACM International Conference on Management of Data* -(SIGMOD), June 2008. -[doi:10.1145/1376616.1376713](https://doi.org/10.1145/1376616.1376713) - -[[50](/en/ch4#Larson2013-marker)] Per-Åke Larson, Cipri Clinciu, Campbell Fraser, -Eric N. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan L. Price, Srikumar -Rangarajan, Remus Rusanu, and Mayukh Saubhasik. -[Enhancements -to SQL Server Column Stores](https://web.archive.org/web/20131203001153id_/http%3A//research.microsoft.com/pubs/193599/Apollo3%20-%20Sigmod%202013%20-%20final.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2013. -[doi:10.1145/2463676.2463708](https://doi.org/10.1145/2463676.2463708) - -[[51](/en/ch4#Farber2012-marker)] Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, -Ingo Müller, Hannes Rauhe, and Jonathan Dees. -[The -SAP HANA Database – An Architecture Overview](https://web.archive.org/web/20220208081111id_/http%3A//sites.computer.org/debull/A12mar/hana.pdf). -*IEEE Data Engineering Bulletin*, volume 35, issue 1, pages 28–33, March 2012. - -[[52](/en/ch4#Stonebraker2013-marker)] Michael Stonebraker. -[The Traditional RDBMS Wisdom Is (Almost Certainly) All -Wrong](https://slideshot.epfl.ch/talks/166). Presentation at *EPFL*, May 2013. - -[[53](/en/ch4#Prout2022_ch4-marker)] Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu -Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. -[Cloud-Native Transactions and Analytics -in SingleStore](https://dl.acm.org/doi/pdf/10.1145/3514221.3526055). At *ACM International Conference on Management of Data* (SIGMOD), June 2022. -[doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055) - -[[54](/en/ch4#Tereshko2016-marker)] Tino Tereshko and Jordan Tigani. -[BigQuery under the -hood](https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood). *cloud.google.com*, January 2016. -Archived at [perma.cc/WP2Y-FUCF](https://perma.cc/WP2Y-FUCF) - -[[55](/en/ch4#McKinney2023-marker)] Wes McKinney. -[The Road to Composable Data Systems: -Thoughts on the Last 15 Years and the Future](https://wesmckinney.com/blog/looking-back-15-years/). *wesmckinney.com*, September 2023. -Archived at [perma.cc/6L2M-GTJX](https://perma.cc/6L2M-GTJX) - -[[56](/en/ch4#Stonebraker2005-marker)] Michael Stonebraker, Daniel -J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam -Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. -[C-Store: -A Column-oriented DBMS](https://www.vldb.org/archives/website/2005/program/paper/thu/p553-stonebraker.pdf). At *31st International Conference on Very Large Data Bases* -(VLDB), pages 553–564, September 2005. - -[[57](/en/ch4#LeDem2013-marker)] Julien Le Dem. -[Dremel -Made Simple with Parquet](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html). *blog.twitter.com*, September 2013. - -[[58](/en/ch4#Melnik2010-marker)] Sergey Melnik, Andrey Gubarev, Jing Jing Long, -Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. -[Dremel: Interactive Analysis of Web-Scale -Datasets](https://vldb.org/pvldb/vol3/R29.pdf). At *36th International Conference on Very Large Data Bases* (VLDB), pages -330–339, September 2010. -[doi:10.14778/1920841.1920886](https://doi.org/10.14778/1920841.1920886) - -[[59](/en/ch4#Kearney2016-marker)] Joe Kearney. -[Understanding Record -Shredding: storing nested data in columns](https://www.joekearney.co.uk/posts/understanding-record-shredding). *joekearney.co.uk*, December 2016. -Archived at [perma.cc/ZD5N-AX5D](https://perma.cc/ZD5N-AX5D) - -[[60](/en/ch4#Brandon2023-marker)] Jamie Brandon. -[A -shallow survey of OLAP and HTAP query engines](https://www.scattered-thoughts.net/writing/a-shallow-survey-of-olap-and-htap-query-engines). *scattered-thoughts.net*, September 2023. -Archived at [perma.cc/L3KH-J4JF](https://perma.cc/L3KH-J4JF) - -[[61](/en/ch4#Dageville2016-marker)] Benoit Dageville, Thierry Cruanes, Marcin -Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin -Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter -Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. -[The Snowflake Elastic Data Warehouse](https://dl.acm.org/doi/pdf/10.1145/2882903.2903741). -At *ACM International Conference on Management of Data* (SIGMOD), pages 215–226, June 2016. -[doi:10.1145/2882903.2903741](https://doi.org/10.1145/2882903.2903741) - -[[62](/en/ch4#Raasveldt2020-marker)] Mark Raasveldt and Hannes Mühleisen. -[Data Management for Data -Science Towards Embedded Analytics](https://duckdb.org/pdf/CIDR2020-raasveldt-muehleisen-duckdb.pdf). At *10th Conference on Innovative Data Systems -Research* (CIDR), January 2020. - -[[63](/en/ch4#Im2018-marker)] Jean-François Im, Kishore Gopalakrishna, Subbu -Subramaniam, Mayank Shrivastava, Adwait Tumbde, Xiaotian Jiang, Jennifer Dai, Seunghyun Lee, Neha -Pawar, Jialiang Li, and Ravi Aringunram. -[Pinot: -Realtime OLAP for 530 Million Users](https://cwiki.apache.org/confluence/download/attachments/103092375/Pinot.pdf). At *ACM International Conference on Management of -Data* (SIGMOD), pages 583–594, May 2018. -[doi:10.1145/3183713.3190661](https://doi.org/10.1145/3183713.3190661) - -[[64](/en/ch4#Yang2014-marker)] Fangjin Yang, Eric Tschetter, Xavier -Léauté, Nelson Ray, Gian Merlino, and Deep Ganguli. -[Druid: A Real-time Analytical Data Store](https://static.druid.io/docs/druid.pdf). -At *ACM International Conference on Management of Data* (SIGMOD), June 2014. -[doi:10.1145/2588555.2595631](https://doi.org/10.1145/2588555.2595631) - -[[65](/en/ch4#Liu2023-marker)] Chunwei Liu, Anna Pavlenko, Matteo Interlandi, and Brandon Haynes. -[Deep Dive into Common Open Formats for Analytical DBMSs](https://www.vldb.org/pvldb/vol16/p3044-liu.pdf). -*Proceedings of the VLDB Endowment*, volume 16, issue 11, pages 3044–3056, July 2023. -[doi:10.14778/3611479.3611507](https://doi.org/10.14778/3611479.3611507) - -[[66](/en/ch4#Zeng2023-marker)] Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes -McKinney, and Huanchen Zhang. [An Empirical -Evaluation of Columnar Storage Formats](https://www.vldb.org/pvldb/vol17/p148-zeng.pdf). *Proceedings of the VLDB Endowment*, volume 17, -issue 2, pages 148–161. -[doi:10.14778/3626292.3626298](https://doi.org/10.14778/3626292.3626298) - -[[67](/en/ch4#Pace2024-marker)] Weston Pace. -[Lance v2: A columnar container format for modern data](https://blog.lancedb.com/lance-v2/). -*blog.lancedb.com*, April 2024. -Archived at [perma.cc/ZK3Q-S9VJ](https://perma.cc/ZK3Q-S9VJ) - -[[68](/en/ch4#Helfman2024-marker)] Yoav Helfman. -[Nimble, A New Columnar File Format](https://www.youtube.com/watch?v=bISBNVtXZ6M). -At *VeloxCon*, April 2024. - -[[69](/en/ch4#McKinney2021-marker)] Wes McKinney. -[Apache Arrow: High-Performance Columnar Data -Framework](https://www.youtube.com/watch?v=YhF8YR0OEFk). At *CMU Database Group – Vaccination Database Tech Talks*, December 2021. - -[[70](/en/ch4#McKinney2022-marker)] Wes McKinney. -[Python for Data -Analysis, 3rd Edition](https://learning.oreilly.com/library/view/python-for-data/9781098104023/). O’Reilly Media, August 2022. ISBN: 9781098104023 - -[[71](/en/ch4#Dix2021-marker)] Paul Dix. -[The Design of InfluxDB IOx: An In-Memory -Columnar Database Written in Rust with Apache Arrow](https://www.youtube.com/watch?v=_zbwz-4RDXg). At *CMU Database Group – Vaccination -Database Tech Talks*, May 2021. - -[[72](/en/ch4#Soto2024-marker)] Carlota Soto and Mike Freedman. -[Building -Columnar Compression for Large PostgreSQL Databases](https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database/). *timescale.com*, March 2024. -Archived at [perma.cc/7KTF-V3EH](https://perma.cc/7KTF-V3EH) - -[[73](/en/ch4#Lemire2016-marker)] Daniel Lemire, Gregory Ssi‐Yan‐Kai, and Owen Kaser. -[Consistently faster and smaller compressed bitmaps with Roaring](https://arxiv.org/pdf/1603.06549). -*Software: Practice and Experience*, volume 46, issue 11, pages 1547–1569, November 2016. -[doi:10.1002/spe.2402](https://doi.org/10.1002/spe.2402) - -[[74](/en/ch4#Volpert2024-marker)] Jaz Volpert. -[An entire Social Network in 1.6GB (GraphD -Part 2)](https://jazco.dev/2024/04/20/roaring-bitmaps/). *jazco.dev*, April 2024. -Archived at [perma.cc/L27Z-QVMG](https://perma.cc/L27Z-QVMG) - -[[75](/en/ch4#Abadi2013-marker)] Daniel J. Abadi, Peter Boncz, Stavros -Harizopoulos, Stratos Idreos, and Samuel Madden. -[The Design and -Implementation of Modern Column-Oriented Database Systems](https://www.cs.umd.edu/~abadi/papers/abadi-column-stores.pdf). *Foundations and Trends in -Databases*, volume 5, issue 3, pages 197–280, December 2013. -[doi:10.1561/1900000024](https://doi.org/10.1561/1900000024) - -[[76](/en/ch4#Lamb2012-marker)] Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, -Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. -[The Vertica Analytic Database: C-Store 7 Years Later](https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf). -*Proceedings of the VLDB Endowment*, volume 5, issue 12, pages 1790–1801, August 2012. -[doi:10.14778/2367502.2367518](https://doi.org/10.14778/2367502.2367518) - -[[77](/en/ch4#Kersten2018-marker)] Timo Kersten, Viktor Leis, Alfons Kemper, Thomas -Neumann, Andrew Pavlo, and Peter Boncz. -[Everything You Always Wanted to Know -About Compiled and Vectorized Queries But Were Afraid to Ask](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf). *Proceedings of the VLDB -Endowment*, volume 11, issue 13, pages 2209–2222, September 2018. -[doi:10.14778/3275366.3284966](https://doi.org/10.14778/3275366.3284966) - -[[78](/en/ch4#Smith2020-marker)] Forrest Smith. -[Memory Bandwidth Napkin -Math](https://www.forrestthewoods.com/blog/memory-bandwidth-napkin-math/). *forrestthewoods.com*, February 2020. -Archived at [perma.cc/Y8U4-PS7N](https://perma.cc/Y8U4-PS7N) - -[[79](/en/ch4#Boncz2005-marker)] Peter Boncz, Marcin Zukowski, and Niels Nes. -[MonetDB/X100: Hyper-Pipelining Query Execution](https://www.cidrdb.org/cidr2005/papers/P19.pdf). -At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005. - -[[80](/en/ch4#Zhou2002-marker)] Jingren Zhou and Kenneth A. Ross. -[Implementing Database Operations Using SIMD Instructions](https://www1.cs.columbia.edu/~kar/pubsk/simd.pdf). -At *ACM International Conference on Management of Data* (SIGMOD), pages 145–156, June 2002. -[doi:10.1145/564691.564709](https://doi.org/10.1145/564691.564709) - -[[81](/en/ch4#Bartley2024-marker)] Kevin Bartley. -[OLTP Queries: Transfer Expensive Workloads to -Materialize](https://materialize.com/blog/oltp-queries/). *materialize.com*, August 2024. -Archived at [perma.cc/4TYM-TYD8](https://perma.cc/4TYM-TYD8) - -[[82](/en/ch4#Gray2007-marker)] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew -Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. -[Data Cube: A Relational Aggregation Operator -Generalizing Group-By, Cross-Tab, and Sub-Totals](https://arxiv.org/pdf/cs/0701155). *Data Mining and Knowledge -Discovery*, volume 1, issue 1, pages 29–53, March 2007. -[doi:10.1023/A:1009726021843](https://doi.org/10.1023/A%3A1009726021843) - -[[83](/en/ch4#Ramsak2000-marker)] Frank Ramsak, Volker Markl, Robert Fenk, Martin -Zirkel, Klaus Elhardt, and Rudolf Bayer. -[Integrating the UB-Tree into a Database System Kernel](https://www.vldb.org/conf/2000/P263.pdf). -At *26th International Conference on Very Large Data Bases* (VLDB), September 2000. - -[[84](/en/ch4#Procopiuc2003-marker)] Octavian Procopiuc, Pankaj K. Agarwal, Lars -Arge, and Jeffrey Scott Vitter. -[Bkd-Tree: A Dynamic -Scalable kd-Tree](https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf). At *8th International Symposium on Spatial and Temporal Databases* -(SSTD), pages 46–65, July 2003. -[doi:10.1007/978-3-540-45072-6\_4](https://doi.org/10.1007/978-3-540-45072-6_4) - -[[85](/en/ch4#Hellerstein1995-marker)] Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer. -[Generalized Search Trees for Database Systems](https://dsf.berkeley.edu/papers/vldb95-gist.pdf). -At *21st International Conference on Very Large Data Bases* (VLDB), September 1995. - -[[86](/en/ch4#Brodsky2018-marker)] Isaac Brodsky. -[H3: Uber’s Hexagonal Hierarchical Spatial Index](https://eng.uber.com/h3/). -*eng.uber.com*, June 2018. -Archived at [archive.org](https://web.archive.org/web/20240722003854/https%3A//www.uber.com/blog/h3/) - -[[87](/en/ch4#Escriva2012-marker)] Robert Escriva, Bernard Wong, and Emin Gün Sirer. -[HyperDex: -A Distributed, Searchable Key-Value Store](https://www.cs.princeton.edu/courses/archive/fall13/cos518/papers/hyperdex.pdf). At *ACM SIGCOMM Conference*, August 2012. -[doi:10.1145/2377677.2377681](https://doi.org/10.1145/2377677.2377681) - -[[88](/en/ch4#Manning2008_ch4-marker)] Christopher D. Manning, Prabhakar Raghavan, -and Hinrich Schütze. -[*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/). -Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at -[nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/) - -[[89](/en/ch4#Wang2017-marker)] Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, -and Steven Swanson. -[An Experimental -Study of Bitmap Compression vs. Inverted List Compression](https://cseweb.ucsd.edu/~swanson/papers/SIGMOD2017-ListCompression.pdf). At *ACM International Conference -on Management of Data* (SIGMOD), pages 993–1008, May 2017. -[doi:10.1145/3035918.3064007](https://doi.org/10.1145/3035918.3064007) - -[[90](/en/ch4#Grand2013-marker)] Adrien Grand. -[What is in a Lucene -Index?](https://speakerdeck.com/elasticsearch/what-is-in-a-lucene-index) At *Lucene/Solr Revolution*, November 2013. -Archived at [perma.cc/Z7QN-GBYY](https://perma.cc/Z7QN-GBYY) - -[[91](/en/ch4#McCandless2011merges-marker)] Michael McCandless. -[Visualizing -Lucene’s Segment Merges](https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html). *blog.mikemccandless.com*, February 2011. -Archived at [perma.cc/3ZV8-72W6](https://perma.cc/3ZV8-72W6) - -[[92](/en/ch4#Fittl2021-marker)] Lukas Fittl. -[Understanding Postgres GIN Indexes: The Good and the -Bad](https://pganalyze.com/blog/gin-index). *pganalyze.com*, December 2021. -Archived at [perma.cc/V3MW-26H6](https://perma.cc/V3MW-26H6) - -[[93](/en/ch4#Angelakos2020-marker)] Jimmy Angelakos. -[The State of (Full) Text Search in PostgreSQL -12](https://www.youtube.com/watch?v=c8IrUHV70KQ). At *FOSDEM*, February 2020. -Archived at [perma.cc/J6US-3WZS](https://perma.cc/J6US-3WZS) - -[[94](/en/ch4#Korotkov2012-marker)] Alexander Korotkov. -[Index -support for regular expression search](https://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf). At *PGConf.EU Prague*, October 2012. -Archived at [perma.cc/5RFZ-ZKDQ](https://perma.cc/5RFZ-ZKDQ) - -[[95](/en/ch4#McCandless2011fuzzy-marker)] Michael McCandless. -[Lucene’s -FuzzyQuery Is 100 Times Faster in 4.0](https://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html). *blog.mikemccandless.com*, March 2011. -Archived at [perma.cc/E2WC-GHTW](https://perma.cc/E2WC-GHTW) - -[[96](/en/ch4#Heinz2002-marker)] Steffen Heinz, Justin Zobel, and Hugh E. Williams. -[Burst -Tries: A Fast, Efficient Data Structure for String Keys](https://web.archive.org/web/20130903070248id_/http%3A//ww2.cs.mu.oz.au%3A80/~jz/fulltext/acmtois02.pdf). -*ACM Transactions on Information Systems*, volume 20, issue 2, pages 192–223, April 2002. -[doi:10.1145/506309.506312](https://doi.org/10.1145/506309.506312) - -[[97](/en/ch4#Schulz2002-marker)] Klaus U. Schulz and Stoyan Mihov. -[Fast String -Correction with Levenshtein Automata](https://dmice.ohsu.edu/bedricks/courses/cs655/pdf/readings/2002_Schulz.pdf). *International Journal on Document Analysis and -Recognition*, volume 5, issue 1, pages 67–85, November 2002. -[doi:10.1007/s10032-002-0082-8](https://doi.org/10.1007/s10032-002-0082-8) - -[[98](/en/ch4#Mikolov2013-marker)] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. -[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781). -At *International Conference on Learning Representations* (ICLR), May 2013. -[doi:10.48550/arXiv.1301.3781](https://doi.org/10.48550/arXiv.1301.3781) - -[[99](/en/ch4#Devlin2018-marker)] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. -[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805). -At *Conference of the North American Chapter of the Association for Computational -Linguistics: Human Language Technologies*, volume 1, pages 4171–4186, June 2019. -[doi:10.18653/v1/N19-1423](https://doi.org/10.18653/v1/N19-1423) - -[[100](/en/ch4#Radford2018-marker)] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. -[Improving -Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf). *openai.com*, June 2018. -Archived at [perma.cc/5N3C-DJ4C](https://perma.cc/5N3C-DJ4C) - -[[101](/en/ch4#Faiis2023-marker)] Matthijs Douze, Maria Lomeli, and Lucas Hosseini. -[Faiss indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). -*github.com*, August 2024. -Archived at [perma.cc/2EWG-FPBS](https://perma.cc/2EWG-FPBS) - -[[102](/en/ch4#Matevosyan2024-marker)] Varik Matevosyan. -[Understanding pgvector’s HNSW Index Storage in Postgres](https://lantern.dev/blog/pgvector-storage). -*lantern.dev*, August 2024. -Archived at [perma.cc/B2YB-JB59](https://perma.cc/B2YB-JB59) - -[[103](/en/ch4#Baranchuk2018-marker)] Dmitry Baranchuk, Artem Babenko, and Yury Malkov. -[Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors](https://arxiv.org/pdf/1802.02422). -At *European Conference on Computer Vision* (ECCV), pages 202–216, September 2018. -[doi:10.1007/978-3-030-01258-8\_13](https://doi.org/10.1007/978-3-030-01258-8_13) - -[[104](/en/ch4#Malkov2020-marker)] Yury A. Malkov and Dmitry A. Yashunin. -[Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320). -*IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824–836, April 2020. -[doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473) +[^1]: Nikolay Samokhvalov. [How partial, covering, and multicolumn indexes may slow down UPDATEs in PostgreSQL](https://postgres.ai/blog/20211029-how-partial-and-covering-indexes-affect-update-performance-in-postgresql). *postgres.ai*, October 2021. Archived at [perma.cc/PBK3-F4G9](https://perma.cc/PBK3-F4G9) +[^2]: Goetz Graefe. [Modern B-Tree Techniques](https://w6113.github.io/files/papers/btreesurvey-graefe.pdf). *Foundations and Trends in Databases*, volume 3, issue 4, pages 203–402, August 2011. [doi:10.1561/1900000028](https://doi.org/10.1561/1900000028) +[^3]: Evan Jones. [Why databases use ordered indexes but programming uses hash tables](https://www.evanjones.ca/ordered-vs-unordered-indexes.html). *evanjones.ca*, December 2019. Archived at [perma.cc/NJX8-3ZZD](https://perma.cc/NJX8-3ZZD) +[^4]: Branimir Lambov. [CEP-25: Trie-indexed SSTable format](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A%2BTrie-indexed%2BSSTable%2Bformat). *cwiki.apache.org*, November 2022. Archived at [perma.cc/HD7W-PW8U](https://perma.cc/HD7W-PW8U). Linked Google Doc archived at [perma.cc/UL6C-AAAE](https://perma.cc/UL6C-AAAE) +[^5]: Thomas H. Cormen, Charles E. Leiserson, Ronald L. Rivest, and Clifford Stein: *Introduction to Algorithms*, 3rd edition. MIT Press, 2009. ISBN: 978-0-262-53305-8 +[^6]: Branimir Lambov. [Trie Memtables in Cassandra](https://www.vldb.org/pvldb/vol15/p3359-lambov.pdf). *Proceedings of the VLDB Endowment*, volume 15, issue 12, pages 3359–3371, August 2022. [doi:10.14778/3554821.3554828](https://doi.org/10.14778/3554821.3554828) +[^7]: Dhruba Borthakur. [The History of RocksDB](https://rocksdb.blogspot.com/2013/11/the-history-of-rocksdb.html). *rocksdb.blogspot.com*, November 2013. Archived at [perma.cc/Z7C5-JPSP](https://perma.cc/Z7C5-JPSP) +[^8]: Matteo Bertozzi. [Apache HBase I/O – HFile](https://blog.cloudera.com/apache-hbase-i-o-hfile/). *blog.cloudera.com*, June 2012. Archived at [perma.cc/U9XH-L2KL](https://perma.cc/U9XH-L2KL) +[^9]: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. [Bigtable: A Distributed Storage System for Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006. +[^10]: Patrick O’Neil, Edward Cheng, Dieter Gawlick, and Elizabeth O’Neil. [The Log-Structured Merge-Tree (LSM-Tree)](https://www.cs.umb.edu/~poneil/lsmtree.pdf). *Acta Informatica*, volume 33, issue 4, pages 351–385, June 1996. [doi:10.1007/s002360050048](https://doi.org/10.1007/s002360050048) +[^11]: Mendel Rosenblum and John K. Ousterhout. [The Design and Implementation of a Log-Structured File System](https://research.cs.wisc.edu/areas/os/Qual/papers/lfs.pdf). *ACM Transactions on Computer Systems*, volume 10, issue 1, pages 26–52, February 1992. [doi:10.1145/146941.146943](https://doi.org/10.1145/146941.146943) +[^12]: Michael Armbrust, Tathagata Das, Liwen Sun, Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja Łuszczak, Michał Świtakowski, Michał Szafrański, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, and Matei Zaharia. [Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores](https://vldb.org/pvldb/vol13/p3411-armbrust.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3411–3424, August 2020. [doi:10.14778/3415478.3415560](https://doi.org/10.14778/3415478.3415560) +[^13]: Burton H. Bloom. [Space/Time Trade-offs in Hash Coding with Allowable Errors](https://people.cs.umass.edu/~emery/classes/cmpsci691st/readings/Misc/p422-bloom.pdf). *Communications of the ACM*, volume 13, issue 7, pages 422–426, July 1970. [doi:10.1145/362686.362692](https://doi.org/10.1145/362686.362692) +[^14]: Adam Kirsch and Michael Mitzenmacher. [Less Hashing, Same Performance: Building a Better Bloom Filter](https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf). *Random Structures & Algorithms*, volume 33, issue 2, pages 187–218, September 2008. [doi:10.1002/rsa.20208](https://doi.org/10.1002/rsa.20208) +[^15]: Thomas Hurst. [Bloom Filter Calculator](https://hur.st/bloomfilter/). *hur.st*, September 2023. Archived at [perma.cc/L3AV-6VC2](https://perma.cc/L3AV-6VC2) +[^16]: Chen Luo and Michael J. Carey. [LSM-based storage techniques: a survey](https://arxiv.org/abs/1812.07527). *The VLDB Journal*, volume 29, pages 393–418, July 2019. [doi:10.1007/s00778-019-00555-y](https://doi.org/10.1007/s00778-019-00555-y) +[^17]: Subhadeep Sarkar and Manos Athanassoulis. [Dissecting, Designing, and Optimizing LSM-based Data Stores](https://www.youtube.com/watch?v=hkMkBZn2mGs). Tutorial at *ACM International Conference on Management of Data* (SIGMOD), June 2022. Slides archived at [perma.cc/93B3-E827](https://perma.cc/93B3-E827) +[^18]: Mark Callaghan. [Name that compaction algorithm](https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html). *smalldatum.blogspot.com*, August 2018. Archived at [perma.cc/CN4M-82DY](https://perma.cc/CN4M-82DY) +[^19]: Prashanth Rao. [Embedded databases (1): The harmony of DuckDB, KùzuDB and LanceDB](https://thedataquarry.com/posts/embedded-db-1/). *thedataquarry.com*, August 2023. Archived at [perma.cc/PA28-2R35](https://perma.cc/PA28-2R35) +[^20]: Hacker News discussion. [Bluesky migrates to single-tenant SQLite](https://news.ycombinator.com/item?id=38171322). *news.ycombinator.com*, October 2023. Archived at [perma.cc/69LM-5P6X](https://perma.cc/69LM-5P6X) +[^21]: Rudolf Bayer and Edward M. McCreight. [Organization and Maintenance of Large Ordered Indices](https://dl.acm.org/doi/pdf/10.1145/1734663.1734671). Boeing Scientific Research Laboratories, Mathematical and Information Sciences Laboratory, report no. 20, July 1970. [doi:10.1145/1734663.1734671](https://doi.org/10.1145/1734663.1734671) +[^22]: Douglas Comer. [The Ubiquitous B-Tree](https://web.archive.org/web/20170809145513id_/http%3A//sites.fas.harvard.edu/~cs165/papers/comer.pdf). *ACM Computing Surveys*, volume 11, issue 2, pages 121–137, June 1979. [doi:10.1145/356770.356776](https://doi.org/10.1145/356770.356776) +[^23]: Alex Miller. [Torn Write Detection and Protection](https://transactional.blog/blog/2025-torn-writes). *transactional.blog*, April 2025. Archived at [perma.cc/G7EB-33EW](https://perma.cc/G7EB-33EW) +[^24]: C. Mohan and Frank Levine. [ARIES/IM: An Efficient and High Concurrency Index Management Method Using Write-Ahead Logging](https://ics.uci.edu/~cs223/papers/p371-mohan.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 1992. [doi:10.1145/130283.130338](https://doi.org/10.1145/130283.130338) +[^25]: Hironobu Suzuki. [The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017. +[^26]: Howard Chu. [LDAP at Lightning Speed](https://buildstuff14.sched.com/event/08a1a368e272eb599a52e08b4c3c779d). At *Build Stuff ’14*, November 2014. Archived at [perma.cc/GB6Z-P8YH](https://perma.cc/GB6Z-P8YH) +[^27]: Manos Athanassoulis, Michael S. Kester, Lukas M. Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan. [Designing Access Methods: The RUM Conjecture](https://openproceedings.org/2016/conf/edbt/paper-12.pdf). At *19th International Conference on Extending Database Technology* (EDBT), March 2016. [doi:10.5441/002/edbt.2016.42](https://doi.org/10.5441/002/edbt.2016.42) +[^28]: Ben Stopford. [Log Structured Merge Trees](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/). *benstopford.com*, February 2015. Archived at [perma.cc/E5BV-KUJ6](https://perma.cc/E5BV-KUJ6) +[^29]: Mark Callaghan. [The Advantages of an LSM vs a B-Tree](https://smalldatum.blogspot.com/2016/01/summary-of-advantages-of-lsm-vs-b-tree.html). *smalldatum.blogspot.co.uk*, January 2016. Archived at [perma.cc/3TYZ-EFUD](https://perma.cc/3TYZ-EFUD) +[^30]: Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan Gupta, Ravishankar Chandhiramoorthi, and Diego Didona. [SILK: Preventing Latency Spikes in Log-Structured Merge Key-Value Stores](https://www.usenix.org/conference/atc19/presentation/balmau). At *USENIX Annual Technical Conference*, July 2019. +[^31]: Igor Canadi, Siying Dong, Mark Callaghan, et al. [RocksDB Tuning Guide](https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide). *github.com*, 2023. Archived at [perma.cc/UNY4-MK6C](https://perma.cc/UNY4-MK6C) +[^32]: Gabriel Haas and Viktor Leis. [What Modern NVMe Storage Can Do, and How to Exploit it: High-Performance I/O for High-Performance Storage Engines](https://www.vldb.org/pvldb/vol16/p2090-haas.pdf). *Proceedings of the VLDB Endowment*, volume 16, issue 9, pages 2090-2102. [doi:10.14778/3598581.3598584](https://doi.org/10.14778/3598581.3598584) +[^33]: Emmanuel Goossaert. [Coding for SSDs](https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/). *codecapsule.com*, February 2014. +[^34]: Jack Vanlightly. [Is sequential IO dead in the era of the NVMe drive?](https://jack-vanlightly.com/blog/2023/5/9/is-sequential-io-dead-in-the-era-of-the-nvme-drive) *jack-vanlightly.com*, May 2023. Archived at [perma.cc/7TMZ-TAPU](https://perma.cc/7TMZ-TAPU) +[^35]: Alibaba Cloud Storage Team. [Storage System Design Analysis: Factors Affecting NVMe SSD Performance (2)](https://www.alibabacloud.com/blog/594376). *alibabacloud.com*, January 2019. Archived at [archive.org](https://web.archive.org/web/20230510065132/https%3A//www.alibabacloud.com/blog/594376) +[^36]: Xiao-Yu Hu and Robert Haas. [The Fundamental Limit of Flash Random Write Performance: Understanding, Analysis and Performance Modelling](https://dominoweb.draco.res.ibm.com/reports/rz3771.pdf). *dominoweb.draco.res.ibm.com*, March 2010. Archived at [perma.cc/8JUL-4ZDS](https://perma.cc/8JUL-4ZDS) +[^37]: Lanyue Lu, Thanumalayan Sankaranarayana Pillai, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [WiscKey: Separating Keys from Values in SSD-conscious Storage](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf). At *4th USENIX Conference on File and Storage Technologies* (FAST), February 2016. +[^38]: Peter Zaitsev. [Innodb Double Write](https://www.percona.com/blog/innodb-double-write/). *percona.com*, August 2006. Archived at [perma.cc/NT4S-DK7T](https://perma.cc/NT4S-DK7T) +[^39]: Tomas Vondra. [On the Impact of Full-Page Writes](https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-page-writes/). *2ndquadrant.com*, November 2016. Archived at [perma.cc/7N6B-CVL3](https://perma.cc/7N6B-CVL3) +[^40]: Mark Callaghan. [Read, write & space amplification - B-Tree vs LSM](https://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html). *smalldatum.blogspot.com*, November 2015. Archived at [perma.cc/S487-WK5P](https://perma.cc/S487-WK5P) +[^41]: Mark Callaghan. [Choosing Between Efficiency and Performance with RocksDB](https://codemesh.io/codemesh2016/mark-callaghan). At *Code Mesh*, November 2016. Video at [youtube.com/watch?v=tgzkgZVXKB4](https://www.youtube.com/watch?v=tgzkgZVXKB4) +[^42]: Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Zichen Zhu, and Manos Athanassoulis. [Enabling Timely and Persistent Deletion in LSM-Engines](https://subhadeep.net/assets/fulltext/Enabling_Timely_and_Persistent_Deletion_in_LSM-Engines.pdf). *ACM Transactions on Database Systems*, volume 48, issue 3, article no. 8, August 2023. [doi:10.1145/3599724](https://doi.org/10.1145/3599724) +[^43]: Lukas Fittl. [Postgres vs. SQL Server: B-Tree Index Differences & the Benefit of Deduplication](https://pganalyze.com/blog/postgresql-vs-sql-server-btree-index-deduplication). *pganalyze.com*, April 2025. Archived at [perma.cc/XY6T-LTPX](https://perma.cc/XY6T-LTPX) +[^44]: Drew Silcock. [How Postgres stores data on disk – this one’s a page turner](https://drew.silcock.dev/blog/how-postgres-stores-data-on-disk/). *drew.silcock.dev*, August 2024. Archived at [perma.cc/8K7K-7VJ2](https://perma.cc/8K7K-7VJ2) +[^45]: Joe Webb. [Using Covering Indexes to Improve Query Performance](https://www.red-gate.com/simple-talk/databases/sql-server/learn/using-covering-indexes-to-improve-query-performance/). *simple-talk.com*, September 2008. Archived at [perma.cc/6MEZ-R5VR](https://perma.cc/6MEZ-R5VR) +[^46]: Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. [The End of an Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on Very Large Data Bases* (VLDB), September 2007. +[^47]: [VoltDB Technical Overview White Paper](https://www.voltactivedata.com/wp-content/uploads/2017/03/hv-white-paper-voltdb-technical-overview.pdf). VoltDB, 2017. Archived at [perma.cc/B9SF-SK5G](https://perma.cc/B9SF-SK5G) +[^48]: Stephen M. Rumble, Ankita Kejriwal, and John K. Ousterhout. [Log-Structured Memory for DRAM-Based Storage](https://www.usenix.org/system/files/conference/fast14/fast14-paper_rumble.pdf). At *12th USENIX Conference on File and Storage Technologies* (FAST), February 2014. +[^49]: Stavros Harizopoulos, Daniel J. Abadi, Samuel Madden, and Michael Stonebraker. [OLTP Through the Looking Glass, and What We Found There](https://hstore.cs.brown.edu/papers/hstore-lookingglass.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2008. [doi:10.1145/1376616.1376713](https://doi.org/10.1145/1376616.1376713) +[^50]: Per-Åke Larson, Cipri Clinciu, Campbell Fraser, Eric N. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan L. Price, Srikumar Rangarajan, Remus Rusanu, and Mayukh Saubhasik. [Enhancements to SQL Server Column Stores](https://web.archive.org/web/20131203001153id_/http%3A//research.microsoft.com/pubs/193599/Apollo3%20-%20Sigmod%202013%20-%20final.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2013. [doi:10.1145/2463676.2463708](https://doi.org/10.1145/2463676.2463708) +[^51]: Franz Färber, Norman May, Wolfgang Lehner, Philipp Große, Ingo Müller, Hannes Rauhe, and Jonathan Dees. [The SAP HANA Database – An Architecture Overview](https://web.archive.org/web/20220208081111id_/http%3A//sites.computer.org/debull/A12mar/hana.pdf). *IEEE Data Engineering Bulletin*, volume 35, issue 1, pages 28–33, March 2012. +[^52]: Michael Stonebraker. [The Traditional RDBMS Wisdom Is (Almost Certainly) All Wrong](https://slideshot.epfl.ch/talks/166). Presentation at *EPFL*, May 2013. +[^53]: Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. [Cloud-Native Transactions and Analytics in SingleStore](https://dl.acm.org/doi/pdf/10.1145/3514221.3526055). At *ACM International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055) +[^54]: Tino Tereshko and Jordan Tigani. [BigQuery under the hood](https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood). *cloud.google.com*, January 2016. Archived at [perma.cc/WP2Y-FUCF](https://perma.cc/WP2Y-FUCF) +[^55]: Wes McKinney. [The Road to Composable Data Systems: Thoughts on the Last 15 Years and the Future](https://wesmckinney.com/blog/looking-back-15-years/). *wesmckinney.com*, September 2023. Archived at [perma.cc/6L2M-GTJX](https://perma.cc/6L2M-GTJX) +[^56]: Michael Stonebraker, Daniel J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik. [C-Store: A Column-oriented DBMS](https://www.vldb.org/archives/website/2005/program/paper/thu/p553-stonebraker.pdf). At *31st International Conference on Very Large Data Bases* (VLDB), pages 553–564, September 2005. +[^57]: Julien Le Dem. [Dremel Made Simple with Parquet](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html). *blog.twitter.com*, September 2013. +[^58]: Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. [Dremel: Interactive Analysis of Web-Scale Datasets](https://vldb.org/pvldb/vol3/R29.pdf). At *36th International Conference on Very Large Data Bases* (VLDB), pages 330–339, September 2010. [doi:10.14778/1920841.1920886](https://doi.org/10.14778/1920841.1920886) +[^59]: Joe Kearney. [Understanding Record Shredding: storing nested data in columns](https://www.joekearney.co.uk/posts/understanding-record-shredding). *joekearney.co.uk*, December 2016. Archived at [perma.cc/ZD5N-AX5D](https://perma.cc/ZD5N-AX5D) +[^60]: Jamie Brandon. [A shallow survey of OLAP and HTAP query engines](https://www.scattered-thoughts.net/writing/a-shallow-survey-of-olap-and-htap-query-engines). *scattered-thoughts.net*, September 2023. Archived at [perma.cc/L3KH-J4JF](https://perma.cc/L3KH-J4JF) +[^61]: Benoit Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. [The Snowflake Elastic Data Warehouse](https://dl.acm.org/doi/pdf/10.1145/2882903.2903741). At *ACM International Conference on Management of Data* (SIGMOD), pages 215–226, June 2016. [doi:10.1145/2882903.2903741](https://doi.org/10.1145/2882903.2903741) +[^62]: Mark Raasveldt and Hannes Mühleisen. [Data Management for Data Science Towards Embedded Analytics](https://duckdb.org/pdf/CIDR2020-raasveldt-muehleisen-duckdb.pdf). At *10th Conference on Innovative Data Systems Research* (CIDR), January 2020. +[^63]: Jean-François Im, Kishore Gopalakrishna, Subbu Subramaniam, Mayank Shrivastava, Adwait Tumbde, Xiaotian Jiang, Jennifer Dai, Seunghyun Lee, Neha Pawar, Jialiang Li, and Ravi Aringunram. [Pinot: Realtime OLAP for 530 Million Users](https://cwiki.apache.org/confluence/download/attachments/103092375/Pinot.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 583–594, May 2018. [doi:10.1145/3183713.3190661](https://doi.org/10.1145/3183713.3190661) +[^64]: Fangjin Yang, Eric Tschetter, Xavier Léauté, Nelson Ray, Gian Merlino, and Deep Ganguli. [Druid: A Real-time Analytical Data Store](https://static.druid.io/docs/druid.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2014. [doi:10.1145/2588555.2595631](https://doi.org/10.1145/2588555.2595631) +[^65]: Chunwei Liu, Anna Pavlenko, Matteo Interlandi, and Brandon Haynes. [Deep Dive into Common Open Formats for Analytical DBMSs](https://www.vldb.org/pvldb/vol16/p3044-liu.pdf). *Proceedings of the VLDB Endowment*, volume 16, issue 11, pages 3044–3056, July 2023. [doi:10.14778/3611479.3611507](https://doi.org/10.14778/3611479.3611507) +[^66]: Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes McKinney, and Huanchen Zhang. [An Empirical Evaluation of Columnar Storage Formats](https://www.vldb.org/pvldb/vol17/p148-zeng.pdf). *Proceedings of the VLDB Endowment*, volume 17, issue 2, pages 148–161. [doi:10.14778/3626292.3626298](https://doi.org/10.14778/3626292.3626298) +[^67]: Weston Pace. [Lance v2: A columnar container format for modern data](https://blog.lancedb.com/lance-v2/). *blog.lancedb.com*, April 2024. Archived at [perma.cc/ZK3Q-S9VJ](https://perma.cc/ZK3Q-S9VJ) +[^68]: Yoav Helfman. [Nimble, A New Columnar File Format](https://www.youtube.com/watch?v=bISBNVtXZ6M). At *VeloxCon*, April 2024. +[^69]: Wes McKinney. [Apache Arrow: High-Performance Columnar Data Framework](https://www.youtube.com/watch?v=YhF8YR0OEFk). At *CMU Database Group – Vaccination Database Tech Talks*, December 2021. +[^70]: Wes McKinney. [Python for Data Analysis, 3rd Edition](https://learning.oreilly.com/library/view/python-for-data/9781098104023/). O’Reilly Media, August 2022. ISBN: 9781098104023 +[^71]: Paul Dix. [The Design of InfluxDB IOx: An In-Memory Columnar Database Written in Rust with Apache Arrow](https://www.youtube.com/watch?v=_zbwz-4RDXg). At *CMU Database Group – Vaccination Database Tech Talks*, May 2021. +[^72]: Carlota Soto and Mike Freedman. [Building Columnar Compression for Large PostgreSQL Databases](https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database/). *timescale.com*, March 2024. Archived at [perma.cc/7KTF-V3EH](https://perma.cc/7KTF-V3EH) +[^73]: Daniel Lemire, Gregory Ssi‐Yan‐Kai, and Owen Kaser. [Consistently faster and smaller compressed bitmaps with Roaring](https://arxiv.org/pdf/1603.06549). *Software: Practice and Experience*, volume 46, issue 11, pages 1547–1569, November 2016. [doi:10.1002/spe.2402](https://doi.org/10.1002/spe.2402) +[^74]: Jaz Volpert. [An entire Social Network in 1.6GB (GraphD Part 2)](https://jazco.dev/2024/04/20/roaring-bitmaps/). *jazco.dev*, April 2024. Archived at [perma.cc/L27Z-QVMG](https://perma.cc/L27Z-QVMG) +[^75]: Daniel J. Abadi, Peter Boncz, Stavros Harizopoulos, Stratos Idreos, and Samuel Madden. [The Design and Implementation of Modern Column-Oriented Database Systems](https://www.cs.umd.edu/~abadi/papers/abadi-column-stores.pdf). *Foundations and Trends in Databases*, volume 5, issue 3, pages 197–280, December 2013. [doi:10.1561/1900000024](https://doi.org/10.1561/1900000024) +[^76]: Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan, Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear. [The Vertica Analytic Database: C-Store 7 Years Later](https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf). *Proceedings of the VLDB Endowment*, volume 5, issue 12, pages 1790–1801, August 2012. [doi:10.14778/2367502.2367518](https://doi.org/10.14778/2367502.2367518) +[^77]: Timo Kersten, Viktor Leis, Alfons Kemper, Thomas Neumann, Andrew Pavlo, and Peter Boncz. [Everything You Always Wanted to Know About Compiled and Vectorized Queries But Were Afraid to Ask](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf). *Proceedings of the VLDB Endowment*, volume 11, issue 13, pages 2209–2222, September 2018. [doi:10.14778/3275366.3284966](https://doi.org/10.14778/3275366.3284966) +[^78]: Forrest Smith. [Memory Bandwidth Napkin Math](https://www.forrestthewoods.com/blog/memory-bandwidth-napkin-math/). *forrestthewoods.com*, February 2020. Archived at [perma.cc/Y8U4-PS7N](https://perma.cc/Y8U4-PS7N) +[^79]: Peter Boncz, Marcin Zukowski, and Niels Nes. [MonetDB/X100: Hyper-Pipelining Query Execution](https://www.cidrdb.org/cidr2005/papers/P19.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005. +[^80]: Jingren Zhou and Kenneth A. Ross. [Implementing Database Operations Using SIMD Instructions](https://www1.cs.columbia.edu/~kar/pubsk/simd.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 145–156, June 2002. [doi:10.1145/564691.564709](https://doi.org/10.1145/564691.564709) +[^81]: Kevin Bartley. [OLTP Queries: Transfer Expensive Workloads to Materialize](https://materialize.com/blog/oltp-queries/). *materialize.com*, August 2024. Archived at [perma.cc/4TYM-TYD8](https://perma.cc/4TYM-TYD8) +[^82]: Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh. [Data Cube: A Relational Aggregation Operator Generalizing Group-By, Cross-Tab, and Sub-Totals](https://arxiv.org/pdf/cs/0701155). *Data Mining and Knowledge Discovery*, volume 1, issue 1, pages 29–53, March 2007. [doi:10.1023/A:1009726021843](https://doi.org/10.1023/A%3A1009726021843) +[^83]: Frank Ramsak, Volker Markl, Robert Fenk, Martin Zirkel, Klaus Elhardt, and Rudolf Bayer. [Integrating the UB-Tree into a Database System Kernel](https://www.vldb.org/conf/2000/P263.pdf). At *26th International Conference on Very Large Data Bases* (VLDB), September 2000. +[^84]: Octavian Procopiuc, Pankaj K. Agarwal, Lars Arge, and Jeffrey Scott Vitter. [Bkd-Tree: A Dynamic Scalable kd-Tree](https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf). At *8th International Symposium on Spatial and Temporal Databases* (SSTD), pages 46–65, July 2003. [doi:10.1007/978-3-540-45072-6\_4](https://doi.org/10.1007/978-3-540-45072-6_4) +[^85]: Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer. [Generalized Search Trees for Database Systems](https://dsf.berkeley.edu/papers/vldb95-gist.pdf). At *21st International Conference on Very Large Data Bases* (VLDB), September 1995. +[^86]: Isaac Brodsky. [H3: Uber’s Hexagonal Hierarchical Spatial Index](https://eng.uber.com/h3/). *eng.uber.com*, June 2018. Archived at [archive.org](https://web.archive.org/web/20240722003854/https%3A//www.uber.com/blog/h3/) +[^87]: Robert Escriva, Bernard Wong, and Emin Gün Sirer. [HyperDex: A Distributed, Searchable Key-Value Store](https://www.cs.princeton.edu/courses/archive/fall13/cos518/papers/hyperdex.pdf). At *ACM SIGCOMM Conference*, August 2012. [doi:10.1145/2377677.2377681](https://doi.org/10.1145/2377677.2377681) +[^88]: Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/). Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at [nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/) +[^89]: Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou, and Steven Swanson. [An Experimental Study of Bitmap Compression vs. Inverted List Compression](https://cseweb.ucsd.edu/~swanson/papers/SIGMOD2017-ListCompression.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 993–1008, May 2017. [doi:10.1145/3035918.3064007](https://doi.org/10.1145/3035918.3064007) +[^90]: Adrien Grand. [What is in a Lucene Index?](https://speakerdeck.com/elasticsearch/what-is-in-a-lucene-index) At *Lucene/Solr Revolution*, November 2013. Archived at [perma.cc/Z7QN-GBYY](https://perma.cc/Z7QN-GBYY) +[^91]: Michael McCandless. [Visualizing Lucene’s Segment Merges](https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html). *blog.mikemccandless.com*, February 2011. Archived at [perma.cc/3ZV8-72W6](https://perma.cc/3ZV8-72W6) +[^92]: Lukas Fittl. [Understanding Postgres GIN Indexes: The Good and the Bad](https://pganalyze.com/blog/gin-index). *pganalyze.com*, December 2021. Archived at [perma.cc/V3MW-26H6](https://perma.cc/V3MW-26H6) +[^93]: Jimmy Angelakos. [The State of (Full) Text Search in PostgreSQL 12](https://www.youtube.com/watch?v=c8IrUHV70KQ). At *FOSDEM*, February 2020. Archived at [perma.cc/J6US-3WZS](https://perma.cc/J6US-3WZS) +[^94]: Alexander Korotkov. [Index support for regular expression search](https://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf). At *PGConf.EU Prague*, October 2012. Archived at [perma.cc/5RFZ-ZKDQ](https://perma.cc/5RFZ-ZKDQ) +[^95]: Michael McCandless. [Lucene’s FuzzyQuery Is 100 Times Faster in 4.0](https://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html). *blog.mikemccandless.com*, March 2011. Archived at [perma.cc/E2WC-GHTW](https://perma.cc/E2WC-GHTW) +[^96]: Steffen Heinz, Justin Zobel, and Hugh E. Williams. [Burst Tries: A Fast, Efficient Data Structure for String Keys](https://web.archive.org/web/20130903070248id_/http%3A//ww2.cs.mu.oz.au%3A80/~jz/fulltext/acmtois02.pdf). *ACM Transactions on Information Systems*, volume 20, issue 2, pages 192–223, April 2002. [doi:10.1145/506309.506312](https://doi.org/10.1145/506309.506312) +[^97]: Klaus U. Schulz and Stoyan Mihov. [Fast String Correction with Levenshtein Automata](https://dmice.ohsu.edu/bedricks/courses/cs655/pdf/readings/2002_Schulz.pdf). *International Journal on Document Analysis and Recognition*, volume 5, issue 1, pages 67–85, November 2002. [doi:10.1007/s10032-002-0082-8](https://doi.org/10.1007/s10032-002-0082-8) +[^98]: Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. [Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781). At *International Conference on Learning Representations* (ICLR), May 2013. [doi:10.48550/arXiv.1301.3781](https://doi.org/10.48550/arXiv.1301.3781) +[^99]: Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. [BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805). At *Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies*, volume 1, pages 4171–4186, June 2019. [doi:10.18653/v1/N19-1423](https://doi.org/10.18653/v1/N19-1423) +[^100]: Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. [Improving Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf). *openai.com*, June 2018. Archived at [perma.cc/5N3C-DJ4C](https://perma.cc/5N3C-DJ4C) +[^101]: Matthijs Douze, Maria Lomeli, and Lucas Hosseini. [Faiss indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes). *github.com*, August 2024. Archived at [perma.cc/2EWG-FPBS](https://perma.cc/2EWG-FPBS) +[^102]: Varik Matevosyan. [Understanding pgvector’s HNSW Index Storage in Postgres](https://lantern.dev/blog/pgvector-storage). *lantern.dev*, August 2024. Archived at [perma.cc/B2YB-JB59](https://perma.cc/B2YB-JB59) +[^103]: Dmitry Baranchuk, Artem Babenko, and Yury Malkov. [Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors](https://arxiv.org/pdf/1802.02422). At *European Conference on Computer Vision* (ECCV), pages 202–216, September 2018. [doi:10.1007/978-3-030-01258-8\_13](https://doi.org/10.1007/978-3-030-01258-8_13) +[^104]: Yury A. Malkov and Dmitry A. Yashunin. [Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320). *IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824–836, April 2020. [doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473) \ No newline at end of file diff --git a/content/en/ch5.md b/content/en/ch5.md index 1c137c5..27e9da1 100644 --- a/content/en/ch5.md +++ b/content/en/ch5.md @@ -119,17 +119,17 @@ restored with minimal additional code. However, they also have a number of deep integrating your systems with those of other organizations (which may use different languages). * In order to restore data in the same object types, the decoding process needs to be able to instantiate arbitrary classes. This is frequently a source of security problems - [[1](/en/ch5#CWE502)]: + [^1]: if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate arbitrary classes, which in turn often allows them to do terrible things such as remotely executing arbitrary code [[2](/en/ch5#Breen2015), [3](/en/ch5#McKenzie2013)]. * Versioning data is often an afterthought in these libraries: as they are intended for quick and easy encoding of data, they often neglect the inconvenient problems of forward and backward - compatibility [[4](/en/ch5#Goetz2019)]. + compatibility [^4]. * Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also often an afterthought. For example, Java’s built-in serialization is notorious for its bad - performance and bloated encoding [[5](/en/ch5#JvmSerializers)]. + performance and bloated encoding [^5]. For these reasons it’s generally a bad idea to use your language’s built-in encoding for anything other than very transient purposes. @@ -139,7 +139,7 @@ other than very transient purposes. When moving to standardized encodings that can be written and read by many programming languages, JSON and XML are the obvious contenders. They are widely known, widely supported, and almost as widely disliked. XML is often criticized for being too verbose and unnecessarily complicated -[[6](/en/ch5#XMLSExp)]. +[^6]. JSON’s popularity is mainly due to its built-in support in web browsers and simplicity relative to XML. CSV is another popular language-independent format, but it only supports tabular data without nesting. @@ -156,11 +156,11 @@ problems: This is a problem when dealing with large numbers; for example, integers greater than 253 cannot be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become inaccurate when parsed in a language that uses floating-point numbers, such as JavaScript - [[7](/en/ch5#Evans2023)]. + [^7]. An example of numbers larger than 253 occurs on X (formerly Twitter), which uses a 64-bit number to identify each post. The JSON returned by the API includes post IDs twice, once as a JSON number and once as a decimal string, to work around the fact that the numbers are not correctly parsed by - JavaScript applications [[8](/en/ch5#Harris2010)]. + JavaScript applications [^8]. * JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they don’t support binary strings (sequences of bytes without a character encoding). Binary strings are a useful feature, so people get around this limitation by encoding the binary data as text using @@ -174,7 +174,7 @@ problems: column. If an application change adds a new row or column, you have to handle that change manually. CSV is also a quite vague format (what happens if a value contains a comma or a newline character?). Although its escaping rules have been formally specified - [[9](/en/ch5#Shafranovich2005)], + [^9], not all parsers implement them correctly. Despite these flaws, JSON, XML, and CSV are good enough for many purposes. It’s likely that they will @@ -228,9 +228,9 @@ In addition to open and closed content models and validators, JSON Schema suppor if/else schema logic, named types, references to remote schemas, and much more. All of this makes for a very powerful schema language. Such features also make for unwieldy definitions. It can be challenging to resolve remote schemas, reason about conditional rules, or evolve schemas in a -forwards or backwards compatible way [[10](/en/ch5#Coates2024)]. +forwards or backwards compatible way [^10]. Similar concerns apply to XML Schema -[[11](/en/ch5#Geneves2008)]. +[^11]. ### Binary encoding @@ -239,7 +239,7 @@ observation led to the development of a profusion of binary encodings for JSON ( BSON, BJSON, UBJSON, BISON, Hessian, and Smile, to name a few) and for XML (WBXML and Fast Infoset, for example). These formats have been adopted in various niches, as they are more compact and sometimes faster to parse, but none of them are as widely adopted as the textual versions of JSON -and XML [[12](/en/ch5#Bray2019)]. +and XML [^12]. Some of these formats extend the set of datatypes (e.g., distinguishing integers and floating-point numbers, or adding support for binary strings), but otherwise they keep the JSON/XML data model unchanged. In @@ -287,7 +287,7 @@ In the following sections we will see how we can do much better, and encode the Protocol Buffers (protobuf) is a binary encoding library developed at Google. It is similar to Apache Thrift, which was originally developed by Facebook -[[13](/en/ch5#Slee2007)]; +[^13]; most of what this section says about Protocol Buffers applies also to Thrift. Protocol Buffers requires a schema for any data that is encoded. To encode the data @@ -311,7 +311,7 @@ language is very simple compared to JSON Schema: it only defines the fields of r types, but it does not support other restrictions on the possible values of fields. Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in -[Figure 5-3](/en/ch5#fig_encoding_protobuf) [[14](/en/ch5#Kleppmann2012evolution)]. +[Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14]. ![ddia 0503](/fig/ddia_0503.png) @@ -382,7 +382,7 @@ value won’t fit in 32 bits, it will be truncated. Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers. It was started in 2009 as a subproject of Hadoop, as a result of Protocol Buffers not being a good fit for Hadoop’s use cases -[[15](/en/ch5#Cutting2009)]. +[^15]. Avro also uses a schema to specify the structure of the data being encoded. It has two schema languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily @@ -493,7 +493,7 @@ case in Avro: if you want to allow a field to be null, you have to use a *union `union { null, long, string } field;` indicates that `field` can be a number, or a string, or null. You can only use `null` as a default value if it is the first branch of the union. This is a little more verbose than having everything nullable by default, but it helps prevent bugs by being explicit -about what can and cannot be null [[18](/en/ch5#Hoare2009)]. +about what can and cannot be null [^18]. Changing the datatype of a field is possible, provided that Avro can convert the type. Changing the name of a field is possible but a little tricky: the reader’s schema can contain aliases for field @@ -525,9 +525,9 @@ Database with individually written records schema, it can decode the rest of the record. Confluent’s schema registry for Apache Kafka - [[19](/en/ch5#ConfluentSchemaReg)] + [^19] and LinkedIn’s Espresso - [[20](/en/ch5#Auradkar2015)] + [^20] work this way, for example. Sending records over a network connection @@ -537,7 +537,7 @@ Sending records over a network connection A database of schema versions is a useful thing to have in any case, since it acts as documentation and gives you a chance to check schema compatibility -[[21](/en/ch5#Kreps2015)]. +[^21]. As the version number, you could use a simple incrementing integer, or you could use a hash of the schema. @@ -552,7 +552,7 @@ you have a relational database whose contents you want to dump to a file, and yo binary format to avoid the aforementioned problems with textual formats (JSON, CSV, XML). If you use Avro, you can fairly easily generate an Avro schema (in the JSON representation we saw earlier) from the relational schema and encode the database contents using that schema, dumping it all to an Avro -object container file [[22](/en/ch5#Shapira2014)]. +object container file [^22]. You can generate a record schema for each database table, and each column becomes a field in that record. The column name in the database maps to the field name in Avro. @@ -585,9 +585,9 @@ common with ASN.1, a schema definition language that was first standardized in 1 [24](/en/ch5#Kaliski1993)]. It was used to define various network protocols, and its binary encoding (DER) is still used to encode SSL certificates (X.509), for example -[[25](/en/ch5#HoffmanAndrews2020)]. +[^25]. ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers -[[26](/en/ch5#Walkin2010)]. +[^26]. However, it’s also very complex and badly documented, so ASN.1 is probably not a good choice for new applications. @@ -680,9 +680,9 @@ versions of the schema. More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or moving some data into a separate table—still require data to be rewritten, often at the application -level [[27](/en/ch5#Xu2017)]. +level [^27]. Maintaining forward and backward compatibility across such migrations is still a research problem -[[28](/en/ch5#Litt2020)]. +[^28]. ### Archival storage @@ -723,7 +723,7 @@ In some ways, services are similar to databases: they typically allow clients to data. However, while databases allow arbitrary queries using the query languages we discussed in [Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs that are predetermined by the business logic (application code) of the service -[[29](/en/ch5#Helland2005_ch5)]. This restriction provides a degree of encapsulation: services can impose +[^29]. This restriction provides a degree of encapsulation: services can impose fine-grained restrictions on what clients can and cannot do. A key design goal of a service-oriented/microservices architecture is to make the application easier @@ -764,7 +764,7 @@ need to somehow find out these details. Service developers often use an interfac language (IDL) to define and document their service’s API endpoints and data models, and to evolve them over time. Other developers can then use the service definition to determine how to query the service. The two most popular service IDLs are OpenAPI (also known as Swagger -[[32](/en/ch5#Swagger2014)]) +[^32]) and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send and receive Protocol Buffers. @@ -838,7 +838,7 @@ requests over a network, many of which received a lot of hype but have serious p JavaBeans (EJB) and Java’s Remote Method Invocation (RMI) are limited to Java. The Distributed Component Object Model (DCOM) is limited to Microsoft platforms. The Common Object Request Broker Architecture (CORBA) is excessively complex, and does not provide backward or forward -compatibility [[33](/en/ch5#Henning2006)]. +compatibility [^33]. SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are also plagued by complexity and compatibility problems [[34](/en/ch5#Lacey2006), @@ -846,7 +846,7 @@ also plagued by complexity and compatibility problems [36](/en/ch5#Bray2004)]. All of these are based on the idea of a *remote procedure call* (RPC), which has been around since -the 1970s [[37](/en/ch5#Birrell1984)]. +the 1970s [^37]. The RPC model tries to make a request to a remote network service look the same as calling a function or method in your programming language, within the same process (this abstraction is called *location transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed @@ -868,7 +868,7 @@ A network request is very different from a local function call: through, and only the response was lost. In that case, retrying will cause the action to be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into - the protocol [[40](/en/ch5#Leach2017idemptence)]. + the protocol [^40]. Local function calls don’t have this problem. (We discuss idempotence in more detail in [Link to Come].) * Every time you call a local function, it normally takes about the same time to execute. A network @@ -902,7 +902,7 @@ overloaded, the client has to be manually reconfigured. To provide higher availability and scalability, there are usually multiple instances of a service running on different machines, any of which can handle an incoming request. Spreading requests across these instances is called *load balancing* -[[41](/en/ch5#Rose2023)]. +[^41]. There are many load balancing and service discovery solutions available: * *Hardware load balancers* are specialized pieces of equipment that are installed in data centers. @@ -974,12 +974,12 @@ indefinitely. If a compatibility-breaking change is required, the service provid maintaining multiple versions of the service API side by side. There is no agreement on how API versioning should work (i.e., how a client can indicate which -version of the API it wants to use [[42](/en/ch5#Hunt2014wn)]). +version of the API it wants to use [^42]). For RESTful APIs, common approaches are to use a version number in the URL or in the HTTP `Accept` header. For services that use API keys to identify a particular client, another option is to store a client’s requested API version on the server and to allow this version selection to be updated through a separate administrative interface -[[43](/en/ch5#Leach2017versioning)]. +[^43]. ## Durable Execution and Workflows @@ -995,7 +995,7 @@ the credit card, and call the banking service to deposit debited funds, as shown Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a general-purpose programming language, a domain specific language (DSL), or a markup language such as Business Process Execution Language (BPEL) -[[44](/en/ch5#BPEL2007)]. +[^44]. # Tasks, Activities, and Functions @@ -1068,19 +1068,19 @@ class PaymentWorkflow: Frameworks like Temporal are not without their challenges. External services, such as the third-party payment gateway in our example, must still provide an idempotent API. Developers must remember to use unique IDs for these APIs to prevent duplicate execution -[[47](/en/ch5#Tenzer2024)]. +[^47]. And because durable execution frameworks log each RPC call in order, it expects a subsequent execution to make the same RPC calls in the same order. This makes code changes brittle: you might introduce undefined behavior simply by re-ordering function calls -[[48](/en/ch5#TemporalWorkflow)]. +[^48]. Instead of modifying the code of an existing workflow, it is safer to deploy a new version of the code separately, so that re-executions of existing workflow invocations continue to use the old version, and only new invocations use the new code -[[49](/en/ch5#Kleeman2024)]. +[^49]. Similarly, because durable execution frameworks expect to replay all code deterministically (the same inputs produce the same outputs), nondeterministic code such as random number generators or -system clocks are problematic [[48](/en/ch5#TemporalWorkflow)]. +system clocks are problematic [^48]. Frameworks often provide their own, deterministic implementations of such library functions, but you have to remember to use them. In some cases, such as with Temporal’s workflowcheck tool, frameworks provide static analysis tools to determine if nondeterministic behavior has been @@ -1099,7 +1099,7 @@ unlike RPC, the sender usually does not wait for the recipient to process the ev events are typically not sent to the recipient via a direct network connection, but go via an intermediary called a *message broker* (also called an *event broker*, *message queue*, or *message-oriented middleware*), which stores the message temporarily. -[[50](/en/ch5#Perera2023)]. +[^50]. Using a message broker has several advantages compared to direct RPC: @@ -1162,7 +1162,7 @@ scenarios, messages will be lost. Since each actor processes only one message at need to worry about threads, and each actor can be scheduled independently by the framework. In *distributed actor frameworks* such as Akka, Orleans -[[51](/en/ch5#Bernstein2014)], +[^51], and Erlang/OTP, this programming model is used to scale an application across multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient are on the same node or different nodes. If they are on different nodes, the message is @@ -1225,257 +1225,58 @@ quite achievable. May your application’s evolution be rapid and your deploymen ##### Footnotes + ##### References -[[1](/en/ch5#CWE502-marker)] [CWE-502: -Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, -July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y) -[[2](/en/ch5#Breen2015-marker)] Steve Breen. -[What -Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This -Vulnerability](https://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-and-your-application-have-in-common-this-vulnerability/). *foxglovesecurity.com*, November 2015. -Archived at [perma.cc/9U97-UVVD](https://perma.cc/9U97-UVVD) - -[[3](/en/ch5#McKenzie2013-marker)] Patrick McKenzie. -[What -the Rails Security Issue Means for Your Startup](https://www.kalzumeus.com/2013/01/31/what-the-rails-security-issue-means-for-your-startup/). *kalzumeus.com*, January 2013. -Archived at [perma.cc/2MBJ-7PZ6](https://perma.cc/2MBJ-7PZ6) - -[[4](/en/ch5#Goetz2019-marker)] Brian Goetz. -[Towards -Better Serialization](https://openjdk.org/projects/amber/design-notes/towards-better-serialization). *openjdk.org*, June 2019. -Archived at [perma.cc/UK6U-GQDE](https://perma.cc/UK6U-GQDE) - -[[5](/en/ch5#JvmSerializers-marker)] Eishay Smith. -[jvm-serializers wiki](https://github.com/eishay/jvm-serializers/wiki). -*github.com*, October 2023. -Archived at [perma.cc/PJP7-WCNG](https://perma.cc/PJP7-WCNG) - -[[6](/en/ch5#XMLSExp-marker)] [XML -Is a Poor Copy of S-Expressions](https://wiki.c2.com/?XmlIsaPoorCopyOfEssExpressions). *wiki.c2.com*, May 2013. -Archived at [perma.cc/7FAN-YBKL](https://perma.cc/7FAN-YBKL) - -[[7](/en/ch5#Evans2023-marker)] Julia Evans. -[Examples of floating -point problems](https://jvns.ca/blog/2023/01/13/examples-of-floating-point-problems/). *jvns.ca*, January 2023. -Archived at [perma.cc/M57L-QKKW](https://perma.cc/M57L-QKKW) - -[[8](/en/ch5#Harris2010-marker)] Matt Harris. -[Snowflake: -An Update and Some Very Important Information](https://groups.google.com/g/twitter-development-talk/c/ahbvo3VTIYI). Email to *Twitter Development -Talk* mailing list, October 2010. -Archived at [perma.cc/8UBV-MZ3D](https://perma.cc/8UBV-MZ3D) - -[[9](/en/ch5#Shafranovich2005-marker)] Yakov Shafranovich. -[RFC 4180: Common Format and MIME Type for -Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180). IETF, October 2005. - -[[10](/en/ch5#Coates2024-marker)] Andy Coates. -[Evolving JSON Schemas - Part I](https://www.creekservice.org/articles/2024/01/08/json-schema-evolution-part-1.html) and -[Part II](https://www.creekservice.org/articles/2024/01/09/json-schema-evolution-part-2.html). -*creekservice.org*, January 2024. Archived at -[perma.cc/MZW3-UA54](https://perma.cc/MZW3-UA54) and -[perma.cc/GT5H-WKZ5](https://perma.cc/GT5H-WKZ5) - -[[11](/en/ch5#Geneves2008-marker)] Pierre Genevès, Nabil Layaïda, and Vincent Quint. -[Ensuring Query Compatibility with Evolving XML Schemas](https://arxiv.org/abs/0811.4324). -INRIA Technical Report 6711, November 2008. - -[[12](/en/ch5#Bray2019-marker)] Tim Bray. -[Bits On the Wire](https://www.tbray.org/ongoing/When/201x/2019/11/17/Bits-On-the-Wire). -*tbray.org*, November 2019. -Archived at [perma.cc/3BT3-BQU3](https://perma.cc/3BT3-BQU3) - -[[13](/en/ch5#Slee2007-marker)] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. -[Thrift: Scalable -Cross-Language Services Implementation](https://thrift.apache.org/static/files/thrift-20070401.pdf). Facebook technical report, April 2007. -Archived at [perma.cc/22BS-TUFB](https://perma.cc/22BS-TUFB) - -[[14](/en/ch5#Kleppmann2012evolution-marker)] Martin Kleppmann. -[Schema -Evolution in Avro, Protocol Buffers and Thrift](https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html). *martin.kleppmann.com*, December 2012. -Archived at [perma.cc/E4R2-9RJT](https://perma.cc/E4R2-9RJT) - -[[15](/en/ch5#Cutting2009-marker)] Doug Cutting, Chad Walters, Jim Kellerman, et al. -[[PROPOSAL] -New Subproject: Avro](https://lists.apache.org/thread/z571w0r5jmfsjvnl0fq4fgg0vh28d3bk). Email thread on *hadoop-general* mailing list, -*lists.apache.org*, April 2009. -Archived at [perma.cc/4A79-BMEB](https://perma.cc/4A79-BMEB) - -[[16](/en/ch5#AvroSpec-marker)] Apache Software Foundation. -[Apache Avro 1.12.0 Specification](https://avro.apache.org/docs/1.12.0/specification/). -*avro.apache.org*, August 2024. -Archived at [perma.cc/C36P-5EBQ](https://perma.cc/C36P-5EBQ) - -[[17](/en/ch5#AvroParsing-marker)] Apache Software Foundation. -[Avro -schemas as LL(1) CFG definitions](https://avro.apache.org/docs/1.12.0/api/java/org/apache/avro/io/parsing/doc-files/parsing.html). *avro.apache.org*, August 2024. -Archived at [perma.cc/JB44-EM9Q](https://perma.cc/JB44-EM9Q) - -[[18](/en/ch5#Hoare2009-marker)] Tony Hoare. -[Null -References: The Billion Dollar Mistake](https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/). Talk at *QCon London*, March 2009. - -[[19](/en/ch5#ConfluentSchemaReg-marker)] Confluent, Inc. -[Schema Registry -Overview](https://docs.confluent.io/platform/current/schema-registry/index.html). *docs.confluent.io*, 2024. -Archived at [perma.cc/92C3-A9JA](https://perma.cc/92C3-A9JA) - -[[20](/en/ch5#Auradkar2015-marker)] Aditya Auradkar and Tom Quiggle. -[Introducing -Espresso—LinkedIn’s Hot New Distributed Document Store](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store). *engineering.linkedin.com*, January 2015. -Archived at [perma.cc/FX4P-VW9T](https://perma.cc/FX4P-VW9T) - -[[21](/en/ch5#Kreps2015-marker)] Jay Kreps. -[Putting Apache Kafka to -Use: A Practical Guide to Building a Stream Data Platform (Part 2)](https://www.confluent.io/blog/event-streaming-platform-2/). *confluent.io*, -February 2015. Archived at [perma.cc/8UA4-ZS5S](https://perma.cc/8UA4-ZS5S) - -[[22](/en/ch5#Shapira2014-marker)] Gwen Shapira. -[The Problem of Managing -Schemas](https://www.oreilly.com/content/the-problem-of-managing-schemas/). *oreilly.com*, November 2014. -Archived at [perma.cc/BY8Q-RYV3](https://perma.cc/BY8Q-RYV3) - -[[23](/en/ch5#Larmouth1999-marker)] John Larmouth. -[*ASN.1 -Complete*](https://www.oss.com/asn1/resources/books-whitepapers-pubs/larmouth-asn1-book.pdf). Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1. -Archived at [perma.cc/GB7Y-XSXQ](https://perma.cc/GB7Y-XSXQ) - -[[24](/en/ch5#Kaliski1993-marker)] Burton S. Kaliski Jr. -[A Layman’s Guide to a Subset of ASN.1, -BER, and DER](https://luca.ntop.org/Teaching/Appunti/asn1.html). Technical Note, RSA Data Security, Inc., November 1993. -Archived at [perma.cc/2LMN-W9U8](https://perma.cc/2LMN-W9U8) - -[[25](/en/ch5#HoffmanAndrews2020-marker)] Jacob Hoffman-Andrews. -[A Warm Welcome to ASN.1 and DER](https://letsencrypt.org/docs/a-warm-welcome-to-asn1-and-der/). -*letsencrypt.org*, April 2020. -Archived at [perma.cc/CYT2-GPQ8](https://perma.cc/CYT2-GPQ8) - -[[26](/en/ch5#Walkin2010-marker)] Lev Walkin. -[Question: -Extensibility and Dropping Fields](https://lionet.info/asn1c/blog/2010/09/21/question-extensibility-removing-fields/). *lionet.info*, September 2010. -Archived at [perma.cc/VX8E-NLH3](https://perma.cc/VX8E-NLH3) - -[[27](/en/ch5#Xu2017-marker)] Jacqueline Xu. -[Online migrations at scale](https://stripe.com/blog/online-migrations). -*stripe.com*, February 2017. -Archived at [perma.cc/X59W-DK7Y](https://perma.cc/X59W-DK7Y) - -[[28](/en/ch5#Litt2020-marker)] Geoffrey Litt, Peter van Hardenberg, and Orion Henry. -[Project Cambria: Translate your data with lenses](https://www.inkandswitch.com/cambria/). -Technical Report, *Ink & Switch*, October 2020. -Archived at [perma.cc/WA4V-VKDB](https://perma.cc/WA4V-VKDB) - -[[29](/en/ch5#Helland2005_ch5-marker)] Pat Helland. -[Data on the Outside Versus Data on the -Inside](https://www.cidrdb.org/cidr2005/papers/P12.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), -January 2005. - -[[30](/en/ch5#Fielding2000-marker)] Roy Thomas Fielding. -[Architectural -Styles and the Design of Network-Based Software Architectures](https://ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf). PhD Thesis, University of -California, Irvine, 2000. Archived at [perma.cc/LWY9-7BPE](https://perma.cc/LWY9-7BPE) - -[[31](/en/ch5#Fielding2008-marker)] Roy Thomas Fielding. -[REST APIs must -be hypertext-driven](https://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven).” *roy.gbiv.com*, October 2008. -Archived at [perma.cc/M2ZW-8ATG](https://perma.cc/M2ZW-8ATG) - -[[32](/en/ch5#Swagger2014-marker)] [OpenAPI -Specification Version 3.1.0](https://swagger.io/specification/). *swagger.io*, February 2021. -Archived at [perma.cc/3S6S-K5M4](https://perma.cc/3S6S-K5M4) - -[[33](/en/ch5#Henning2006-marker)] Michi Henning. -[The Rise and Fall of CORBA](https://cacm.acm.org/practice/the-rise-and-fall-of-corba/). -*Communications of the ACM*, volume 51, issue 8, pages 52–57, August 2008. -[doi:10.1145/1378704.1378718](https://doi.org/10.1145/1378704.1378718) - -[[34](/en/ch5#Lacey2006-marker)] Pete Lacey. -[The S Stands for Simple](https://harmful.cat-v.org/software/xml/soap/simple). -*harmful.cat-v.org*, November 2006. -Archived at [perma.cc/4PMK-Z9X7](https://perma.cc/4PMK-Z9X7) - -[[35](/en/ch5#Tilkov2006-marker)] Stefan Tilkov. -[Interview: Pete Lacey Criticizes -Web Services](https://www.infoq.com/articles/pete-lacey-ws-criticism/). *infoq.com*, December 2006. -Archived at [perma.cc/JWF4-XY3P](https://perma.cc/JWF4-XY3P) - -[[36](/en/ch5#Bray2004-marker)] Tim Bray. -[The Loyal WS-Opposition](https://www.tbray.org/ongoing/When/200x/2004/09/18/WS-Oppo). -*tbray.org*, September 2004. -Archived at [perma.cc/J5Q8-69Q2](https://perma.cc/J5Q8-69Q2) - -[[37](/en/ch5#Birrell1984-marker)] Andrew D. Birrell and Bruce Jay Nelson. -[Implementing -Remote Procedure Calls](https://www.cs.princeton.edu/courses/archive/fall03/cs518/papers/rpc.pdf). *ACM Transactions on Computer Systems* (TOCS), -volume 2, issue 1, pages 39–59, February 1984. -[doi:10.1145/2080.357392](https://doi.org/10.1145/2080.357392) - -[[38](/en/ch5#Waldo1994-marker)] Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. -[A Note on Distributed Computing](https://m.mirror.facebook.net/kde/devel/smli_tr-94-29.pdf). -Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994. -Archived at [perma.cc/8LRZ-BSZR](https://perma.cc/8LRZ-BSZR) - -[[39](/en/ch5#Vinoski2008-marker)] Steve Vinoski. -[Convenience over -Correctness](https://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf). *IEEE Internet Computing*, volume 12, issue 4, pages 89–92, July 2008. -[doi:10.1109/MIC.2008.75](https://doi.org/10.1109/MIC.2008.75) - -[[40](/en/ch5#Leach2017idemptence-marker)] Brandur Leach. -[Designing robust and predictable APIs with -idempotency](https://stripe.com/blog/idempotency). *stripe.com*, February 2017. -Archived at [perma.cc/JD22-XZQT](https://perma.cc/JD22-XZQT) - -[[41](/en/ch5#Rose2023-marker)] Sam Rose. -[Load Balancing](https://samwho.dev/load-balancing/). *samwho.dev*, April 2023. -Archived at [perma.cc/Q7BA-9AE2](https://perma.cc/Q7BA-9AE2) - -[[42](/en/ch5#Hunt2014wn-marker)] Troy Hunt. -[Your API versioning is -wrong, which is why I decided to do it 3 different wrong ways](https://www.troyhunt.com/your-api-versioning-is-wrong-which-is/). *troyhunt.com*, -February 2014. Archived at [perma.cc/9DSW-DGR5](https://perma.cc/9DSW-DGR5) - -[[43](/en/ch5#Leach2017versioning-marker)] Brandur Leach. -[APIs as infrastructure: future-proofing Stripe with -versioning](https://stripe.com/blog/api-versioning). *stripe.com*, August 2017. -Archived at [perma.cc/L63K-USFW](https://perma.cc/L63K-USFW) - -[[44](/en/ch5#BPEL2007-marker)] Alexandre Alves, Assaf Arkin, Sid Askary, et al. -[Web Services Business Process -Execution Language Version 2.0](https://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html). *docs.oasis-open.org*, April 2007. - -[[45](/en/ch5#TemporalService-marker)] [What -is a Temporal Service?](https://docs.temporal.io/clusters) *docs.temporal.io*, 2024. -Archived at [perma.cc/32P3-CJ9V](https://perma.cc/32P3-CJ9V) - -[[46](/en/ch5#Ewen2023-marker)] Stephan Ewen. -[Why we built Restate](https://restate.dev/blog/why-we-built-restate/). *restate.dev*, -August 2023. Archived at [perma.cc/BJJ2-X75K](https://perma.cc/BJJ2-X75K) - -[[47](/en/ch5#Tenzer2024-marker)] Keith Tenzer and Joshua Smith. -[Idempotency and Durable -Execution](https://temporal.io/blog/idempotency-and-durable-execution). *temporal.io*, February 2024. -Archived at [perma.cc/9LGW-PCLU](https://perma.cc/9LGW-PCLU) - -[[48](/en/ch5#TemporalWorkflow-marker)] [What -is a Temporal Workflow?](https://docs.temporal.io/workflows) *docs.temporal.io*, 2024. -Archived at [perma.cc/B5C5-Y396](https://perma.cc/B5C5-Y396) - -[[49](/en/ch5#Kleeman2024-marker)] Jack Kleeman. -[Solving durable -execution’s immutability problem](https://restate.dev/blog/solving-durable-executions-immutability-problem/). *restate.dev*, February 2024. -Archived at [perma.cc/G55L-EYH5](https://perma.cc/G55L-EYH5) - -[[50](/en/ch5#Perera2023-marker)] Srinath Perera. -[Exploring -Event-Driven Architecture: A Beginner’s Guide for Cloud Native Developers](https://wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/). *wso2.com*, -August 2023. Archived at -[archive.org](https://web.archive.org/web/20240716204613/https%3A//wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/) - -[[51](/en/ch5#Bernstein2014-marker)] Philip A. Bernstein, Sergey Bykov, Alan -Geller, Gabriel Kliot, and Jorgen Thelin. -[Orleans: -Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical -Report MSR-TR-2014-41, March 2014. -Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF) +[^1]: [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y) +[^2]: Steve Breen. [What Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This Vulnerability](https://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-and-your-application-have-in-common-this-vulnerability/). *foxglovesecurity.com*, November 2015. Archived at [perma.cc/9U97-UVVD](https://perma.cc/9U97-UVVD) +[^3]: Patrick McKenzie. [What the Rails Security Issue Means for Your Startup](https://www.kalzumeus.com/2013/01/31/what-the-rails-security-issue-means-for-your-startup/). *kalzumeus.com*, January 2013. Archived at [perma.cc/2MBJ-7PZ6](https://perma.cc/2MBJ-7PZ6) +[^4]: Brian Goetz. [Towards Better Serialization](https://openjdk.org/projects/amber/design-notes/towards-better-serialization). *openjdk.org*, June 2019. Archived at [perma.cc/UK6U-GQDE](https://perma.cc/UK6U-GQDE) +[^5]: Eishay Smith. [jvm-serializers wiki](https://github.com/eishay/jvm-serializers/wiki). *github.com*, October 2023. Archived at [perma.cc/PJP7-WCNG](https://perma.cc/PJP7-WCNG) +[^6]: [XML Is a Poor Copy of S-Expressions](https://wiki.c2.com/?XmlIsaPoorCopyOfEssExpressions). *wiki.c2.com*, May 2013. Archived at [perma.cc/7FAN-YBKL](https://perma.cc/7FAN-YBKL) +[^7]: Julia Evans. [Examples of floating point problems](https://jvns.ca/blog/2023/01/13/examples-of-floating-point-problems/). *jvns.ca*, January 2023. Archived at [perma.cc/M57L-QKKW](https://perma.cc/M57L-QKKW) +[^8]: Matt Harris. [Snowflake: An Update and Some Very Important Information](https://groups.google.com/g/twitter-development-talk/c/ahbvo3VTIYI). Email to *Twitter Development Talk* mailing list, October 2010. Archived at [perma.cc/8UBV-MZ3D](https://perma.cc/8UBV-MZ3D) +[^9]: Yakov Shafranovich. [RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180). IETF, October 2005. +[^10]: Andy Coates. [Evolving JSON Schemas - Part I](https://www.creekservice.org/articles/2024/01/08/json-schema-evolution-part-1.html) and [Part II](https://www.creekservice.org/articles/2024/01/09/json-schema-evolution-part-2.html). *creekservice.org*, January 2024. Archived at [perma.cc/MZW3-UA54](https://perma.cc/MZW3-UA54) and [perma.cc/GT5H-WKZ5](https://perma.cc/GT5H-WKZ5) +[^11]: Pierre Genevès, Nabil Layaïda, and Vincent Quint. [Ensuring Query Compatibility with Evolving XML Schemas](https://arxiv.org/abs/0811.4324). INRIA Technical Report 6711, November 2008. +[^12]: Tim Bray. [Bits On the Wire](https://www.tbray.org/ongoing/When/201x/2019/11/17/Bits-On-the-Wire). *tbray.org*, November 2019. Archived at [perma.cc/3BT3-BQU3](https://perma.cc/3BT3-BQU3) +[^13]: Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. [Thrift: Scalable Cross-Language Services Implementation](https://thrift.apache.org/static/files/thrift-20070401.pdf). Facebook technical report, April 2007. Archived at [perma.cc/22BS-TUFB](https://perma.cc/22BS-TUFB) +[^14]: Martin Kleppmann. [Schema Evolution in Avro, Protocol Buffers and Thrift](https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html). *martin.kleppmann.com*, December 2012. Archived at [perma.cc/E4R2-9RJT](https://perma.cc/E4R2-9RJT) +[^15]: Doug Cutting, Chad Walters, Jim Kellerman, et al. [[PROPOSAL] New Subproject: Avro](https://lists.apache.org/thread/z571w0r5jmfsjvnl0fq4fgg0vh28d3bk). Email thread on *hadoop-general* mailing list, *lists.apache.org*, April 2009. Archived at [perma.cc/4A79-BMEB](https://perma.cc/4A79-BMEB) +[^16]: Apache Software Foundation. [Apache Avro 1.12.0 Specification](https://avro.apache.org/docs/1.12.0/specification/). *avro.apache.org*, August 2024. Archived at [perma.cc/C36P-5EBQ](https://perma.cc/C36P-5EBQ) +[^17]: Apache Software Foundation. [Avro schemas as LL(1) CFG definitions](https://avro.apache.org/docs/1.12.0/api/java/org/apache/avro/io/parsing/doc-files/parsing.html). *avro.apache.org*, August 2024. Archived at [perma.cc/JB44-EM9Q](https://perma.cc/JB44-EM9Q) +[^18]: Tony Hoare. [Null References: The Billion Dollar Mistake](https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/). Talk at *QCon London*, March 2009. +[^19]: Confluent, Inc. [Schema Registry Overview](https://docs.confluent.io/platform/current/schema-registry/index.html). *docs.confluent.io*, 2024. Archived at [perma.cc/92C3-A9JA](https://perma.cc/92C3-A9JA) +[^20]: Aditya Auradkar and Tom Quiggle. [Introducing Espresso—LinkedIn’s Hot New Distributed Document Store](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store). *engineering.linkedin.com*, January 2015. Archived at [perma.cc/FX4P-VW9T](https://perma.cc/FX4P-VW9T) +[^21]: Jay Kreps. [Putting Apache Kafka to Use: A Practical Guide to Building a Stream Data Platform (Part 2)](https://www.confluent.io/blog/event-streaming-platform-2/). *confluent.io*, February 2015. Archived at [perma.cc/8UA4-ZS5S](https://perma.cc/8UA4-ZS5S) +[^22]: Gwen Shapira. [The Problem of Managing Schemas](https://www.oreilly.com/content/the-problem-of-managing-schemas/). *oreilly.com*, November 2014. Archived at [perma.cc/BY8Q-RYV3](https://perma.cc/BY8Q-RYV3) +[^23]: John Larmouth. [*ASN.1 Complete*](https://www.oss.com/asn1/resources/books-whitepapers-pubs/larmouth-asn1-book.pdf). Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1. Archived at [perma.cc/GB7Y-XSXQ](https://perma.cc/GB7Y-XSXQ) +[^24]: Burton S. Kaliski Jr. [A Layman’s Guide to a Subset of ASN.1, BER, and DER](https://luca.ntop.org/Teaching/Appunti/asn1.html). Technical Note, RSA Data Security, Inc., November 1993. Archived at [perma.cc/2LMN-W9U8](https://perma.cc/2LMN-W9U8) +[^25]: Jacob Hoffman-Andrews. [A Warm Welcome to ASN.1 and DER](https://letsencrypt.org/docs/a-warm-welcome-to-asn1-and-der/). *letsencrypt.org*, April 2020. Archived at [perma.cc/CYT2-GPQ8](https://perma.cc/CYT2-GPQ8) +[^26]: Lev Walkin. [Question: Extensibility and Dropping Fields](https://lionet.info/asn1c/blog/2010/09/21/question-extensibility-removing-fields/). *lionet.info*, September 2010. Archived at [perma.cc/VX8E-NLH3](https://perma.cc/VX8E-NLH3) +[^27]: Jacqueline Xu. [Online migrations at scale](https://stripe.com/blog/online-migrations). *stripe.com*, February 2017. Archived at [perma.cc/X59W-DK7Y](https://perma.cc/X59W-DK7Y) +[^28]: Geoffrey Litt, Peter van Hardenberg, and Orion Henry. [Project Cambria: Translate your data with lenses](https://www.inkandswitch.com/cambria/). Technical Report, *Ink & Switch*, October 2020. Archived at [perma.cc/WA4V-VKDB](https://perma.cc/WA4V-VKDB) +[^29]: Pat Helland. [Data on the Outside Versus Data on the Inside](https://www.cidrdb.org/cidr2005/papers/P12.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005. +[^30]: Roy Thomas Fielding. [Architectural Styles and the Design of Network-Based Software Architectures](https://ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf). PhD Thesis, University of California, Irvine, 2000. Archived at [perma.cc/LWY9-7BPE](https://perma.cc/LWY9-7BPE) +[^31]: Roy Thomas Fielding. [REST APIs must be hypertext-driven](https://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven).” *roy.gbiv.com*, October 2008. Archived at [perma.cc/M2ZW-8ATG](https://perma.cc/M2ZW-8ATG) +[^32]: [OpenAPI Specification Version 3.1.0](https://swagger.io/specification/). *swagger.io*, February 2021. Archived at [perma.cc/3S6S-K5M4](https://perma.cc/3S6S-K5M4) +[^33]: Michi Henning. [The Rise and Fall of CORBA](https://cacm.acm.org/practice/the-rise-and-fall-of-corba/). *Communications of the ACM*, volume 51, issue 8, pages 52–57, August 2008. [doi:10.1145/1378704.1378718](https://doi.org/10.1145/1378704.1378718) +[^34]: Pete Lacey. [The S Stands for Simple](https://harmful.cat-v.org/software/xml/soap/simple). *harmful.cat-v.org*, November 2006. Archived at [perma.cc/4PMK-Z9X7](https://perma.cc/4PMK-Z9X7) +[^35]: Stefan Tilkov. [Interview: Pete Lacey Criticizes Web Services](https://www.infoq.com/articles/pete-lacey-ws-criticism/). *infoq.com*, December 2006. Archived at [perma.cc/JWF4-XY3P](https://perma.cc/JWF4-XY3P) +[^36]: Tim Bray. [The Loyal WS-Opposition](https://www.tbray.org/ongoing/When/200x/2004/09/18/WS-Oppo). *tbray.org*, September 2004. Archived at [perma.cc/J5Q8-69Q2](https://perma.cc/J5Q8-69Q2) +[^37]: Andrew D. Birrell and Bruce Jay Nelson. [Implementing Remote Procedure Calls](https://www.cs.princeton.edu/courses/archive/fall03/cs518/papers/rpc.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 2, issue 1, pages 39–59, February 1984. [doi:10.1145/2080.357392](https://doi.org/10.1145/2080.357392) +[^38]: Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. [A Note on Distributed Computing](https://m.mirror.facebook.net/kde/devel/smli_tr-94-29.pdf). Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994. Archived at [perma.cc/8LRZ-BSZR](https://perma.cc/8LRZ-BSZR) +[^39]: Steve Vinoski. [Convenience over Correctness](https://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf). *IEEE Internet Computing*, volume 12, issue 4, pages 89–92, July 2008. [doi:10.1109/MIC.2008.75](https://doi.org/10.1109/MIC.2008.75) +[^40]: Brandur Leach. [Designing robust and predictable APIs with idempotency](https://stripe.com/blog/idempotency). *stripe.com*, February 2017. Archived at [perma.cc/JD22-XZQT](https://perma.cc/JD22-XZQT) +[^41]: Sam Rose. [Load Balancing](https://samwho.dev/load-balancing/). *samwho.dev*, April 2023. Archived at [perma.cc/Q7BA-9AE2](https://perma.cc/Q7BA-9AE2) +[^42]: Troy Hunt. [Your API versioning is wrong, which is why I decided to do it 3 different wrong ways](https://www.troyhunt.com/your-api-versioning-is-wrong-which-is/). *troyhunt.com*, February 2014. Archived at [perma.cc/9DSW-DGR5](https://perma.cc/9DSW-DGR5) +[^43]: Brandur Leach. [APIs as infrastructure: future-proofing Stripe with versioning](https://stripe.com/blog/api-versioning). *stripe.com*, August 2017. Archived at [perma.cc/L63K-USFW](https://perma.cc/L63K-USFW) +[^44]: Alexandre Alves, Assaf Arkin, Sid Askary, et al. [Web Services Business Process Execution Language Version 2.0](https://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html). *docs.oasis-open.org*, April 2007. +[^45]: [What is a Temporal Service?](https://docs.temporal.io/clusters) *docs.temporal.io*, 2024. Archived at [perma.cc/32P3-CJ9V](https://perma.cc/32P3-CJ9V) +[^46]: Stephan Ewen. [Why we built Restate](https://restate.dev/blog/why-we-built-restate/). *restate.dev*, August 2023. Archived at [perma.cc/BJJ2-X75K](https://perma.cc/BJJ2-X75K) +[^47]: Keith Tenzer and Joshua Smith. [Idempotency and Durable Execution](https://temporal.io/blog/idempotency-and-durable-execution). *temporal.io*, February 2024. Archived at [perma.cc/9LGW-PCLU](https://perma.cc/9LGW-PCLU) +[^48]: [What is a Temporal Workflow?](https://docs.temporal.io/workflows) *docs.temporal.io*, 2024. Archived at [perma.cc/B5C5-Y396](https://perma.cc/B5C5-Y396) +[^49]: Jack Kleeman. [Solving durable execution’s immutability problem](https://restate.dev/blog/solving-durable-executions-immutability-problem/). *restate.dev*, February 2024. Archived at [perma.cc/G55L-EYH5](https://perma.cc/G55L-EYH5) +[^50]: Srinath Perera. [Exploring Event-Driven Architecture: A Beginner’s Guide for Cloud Native Developers](https://wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/). *wso2.com*, August 2023. Archived at [archive.org](https://web.archive.org/web/20240716204613/https%3A//wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/) +[^51]: Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. [Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical Report MSR-TR-2014-41, March 2014. Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF) \ No newline at end of file diff --git a/content/en/ch6.md b/content/en/ch6.md index bf1a2f0..0eff31d 100644 --- a/content/en/ch6.md +++ b/content/en/ch6.md @@ -15,10 +15,8 @@ network. As discussed in [“Distributed versus Single-Node Systems”](https:// why you might want to replicate data: * To keep data geographically close to your users (and thus reduce access latency) -* To allow the system to continue working even if some of its parts have failed (and thus - increase availability) -* To scale out the number of machines that can serve read queries (and thus increase read - throughput) +* To allow the system to continue working even if some of its parts have failed (and thus increase availability) +* To scale out the number of machines that can serve read queries (and thus increase read throughput) In this chapter we will assume that your dataset is small enough that each machine can hold a copy of the entire dataset. In [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding) we will relax that assumption and discuss *sharding* @@ -39,7 +37,7 @@ many different implementations. We will discuss the consequences of such choices Replication of databases is an old topic—the principles haven’t changed much since they were studied in the 1970s -[[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lindsay1979_ch6)], +[^1], because the fundamental constraints of networks have remained the same. Despite being so old, concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag) we will get more precise about eventual consistency and discuss things like the *read-your-writes* and @@ -74,7 +72,7 @@ longer contain the same data. The most common solution is called *leader-based r [Figure 6-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_leader_follower)): 1. One of the replicas is designated the *leader* (also known as *primary* or *source* - [[2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gryp2020)]). + [^2]). When clients want to write to the database, they must send their requests to the leader, which first writes the new data to its local storage. 2. The other replicas are known as *followers* (*read replicas*, *secondaries*, or *hot standbys*). @@ -97,15 +95,15 @@ multiple leaders for the same shard at the same time. Single-leader replication is very widely used. It’s a built-in feature of many relational databases, such as PostgreSQL, MySQL, Oracle Data Guard -[[3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Oracle2019)], +[^3], and SQL Server’s Always On Availability Groups -[[4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#AlwaysOn2012)]. +[^4]. It is also used in some document databases such as MongoDB and DynamoDB -[[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Elhemali2022_ch6)], +[^5], message brokers such as Kafka, replicated block devices such as DRBD, and some network filesystems. Many consensus algorithms such as Raft, which is used for replication in CockroachDB -[[6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Taft2020_ch6)], -TiDB [[7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2020_ch6)], +[^6], +TiDB [^7], etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and automatically elect a new leader if the old one fails (we will discuss consensus in more detail in [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency)). @@ -114,7 +112,7 @@ automatically elect a new leader if the old one fails (we will discuss consensus In older documents you may see the term *master–slave replication*. It means the same as leader-based replication, but the term should be avoided as it is widely considered offensive -[[8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Knodel2023)]. +[^8]. ## Synchronous Versus Asynchronous Replication @@ -174,7 +172,7 @@ processing writes, even if all of its followers have fallen behind. Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless widely used, especially if there are many followers or if they are geographically distributed -[[9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2018)]. +[^9]. We will return to this issue in [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag). ## Setting Up New Followers @@ -250,7 +248,7 @@ architecture that places less frequently accessed data on object storage while n accessed data is kept on faster storage devices such as SSDs, NVMe, or even in memory. Other systems use object storage as their primary storage tier, but use a separate low-latency storage system such as Amazon’s EBS or Neon’s Safekeepers -[[12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kelvich2022)]) +[^12]) to store their WAL. Recently, some systems have gone even farther by adopting a *zero-disk architecture* (ZDA). ZDA-based systems persist all data to object storage and use disks and memory strictly for caching. This allows nodes to have no persistent state, which dramatically @@ -312,7 +310,7 @@ consists of the following steps: 2. *Choosing a new leader.* This could be done through an election process (where the leader is chosen by a majority of the remaining replicas), or a new leader could be appointed by a previously established *controller node* - [[13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Fontaine2021)]. + [^13]. The best candidate for leadership is usually the replica with the most up-to-date data changes from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader is a consensus problem, discussed in detail in [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency). @@ -333,7 +331,7 @@ Failover is fraught with things that can go wrong: * Discarding writes is especially dangerous if other storage systems outside of the database need to be coordinated with the database contents. For example, in one incident at GitHub - [[14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Newland2012)], + [^14], an out-of-date MySQL follower was promoted to leader. The database used an autoincrementing counter to assign primary keys to new rows, but because the new leader’s counter lagged behind the old leader’s, it reused some @@ -346,7 +344,7 @@ Failover is fraught with things that can go wrong: [“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some systems have a mechanism to shut down one node if two leaders are detected. However, if this mechanism is not carefully designed, you can end up with both nodes being shut down - [[15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Imbriaco2012_ch6)]. + [^15]. Moreover, there is a risk that by the time the split brain is detected and the old node is shut down, it is already too late and data has already been corrupted. * What is the right timeout before the leader is declared dead? A longer timeout means a longer @@ -413,7 +411,7 @@ Statement-based replication was used in MySQL before version 5.1. It is still so as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if there is any nondeterminism in a statement. VoltDB uses statement-based replication, and makes it safe by requiring transactions to be deterministic -[[16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hugg2015)]. +[^16]. However, determinism can be hard to guarantee in practice, so many databases prefer other replication methods. @@ -464,17 +462,17 @@ indicating that the transaction was committed. MySQL keeps a separate logical re called the *binlog*, in addition to the WAL (when configured to use row-based replication). PostgreSQL implements logical replication by decoding the physical WAL into row insertion/update/delete events -[[19](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2023)]. +[^19]. Since a logical log is decoupled from the storage engine internals, it can more easily be kept backward compatible, allowing the leader and the follower to run different versions of the database software. This in turn enables upgrading to a new version with minimal downtime -[[20](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Petchimuthu2021)]. +[^20]. A logical log format is also easier for external applications to parse. This aspect is useful if you want to send the contents of a database to an external system, such as a data warehouse for offline analysis, or for building custom indexes and caches -[[21](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sharma2015te_ch6)]. +[^21]. This technique is called *change data capture*, and we will return to it in [Link to Come]. # Problems with Replication Lag @@ -502,14 +500,14 @@ database: if you run the same query on the leader and a follower at the same tim different results, because not all writes have been reflected in the follower. This inconsistency is just a temporary state—if you stop writing to the database and wait a while, the followers will eventually catch up and become consistent with the leader. For that reason, this effect is known -as *eventual consistency* [[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011)]. +as *eventual consistency* [^22]. ###### Note The term *eventual consistency* was coined by Douglas Terry et al. -[[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry1994)], +[^23], popularized by Werner Vogels -[[24](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Vogels2008)], +[^24], and became the battle cry of many NoSQL projects. However, not only NoSQL databases are eventually consistent: followers in an asynchronously replicated relational database have the same characteristics. @@ -542,7 +540,7 @@ submitted was lost, so they will be understandably unhappy. ###### Figure 6-3. A user makes a write, followed by a read from a stale replica. To prevent this anomaly, we need read-after-write consistency. In this situation, we need *read-after-write consistency*, also known as *read-your-writes consistency* -[[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry1994)]. +[^23]. This is a guarantee that if the user reloads the page, they will always see any updates they submitted themselves. It makes no promises about other users: other users’ updates may not be visible until some later time. However, it reassures the user that their own input has been saved @@ -563,14 +561,14 @@ are various possible techniques. To mention a few: scaling). In that case, other criteria may be used to decide whether to read from the leader. For example, you could track the time of the last update and, for one minute after the last update, make all reads from the leader - [[25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Willison2022)]. + [^25]. You could also monitor the replication lag on followers and prevent queries on any follower that is more than one minute behind the leader. * The client can remember the timestamp of its most recent write—then the system can ensure that the replica serving any reads for that user reflects updates at least until that timestamp. If a replica is not sufficiently up to date, either the read can be handled by another replica or the query can wait until the replica has caught up - [[26](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Tharakan2020)]. + [^26]. The timestamp could be a *logical timestamp* (something that indicates ordering of writes, such as the log sequence number) or the actual system clock (in which case clock synchronization becomes critical; see [“Unreliable Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_clocks)). @@ -632,7 +630,7 @@ and then see it disappear again. ###### Figure 6-4. A user first reads from a fresh replica, then from a stale replica. Time appears to go backward. To prevent this anomaly, we need monotonic reads. -*Monotonic reads* [[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011)] is a guarantee that this +*Monotonic reads* [^22] is a guarantee that this kind of anomaly does not happen. It’s a lesser guarantee than strong consistency, but a stronger guarantee than eventual consistency. When you read data, you may see an old value; monotonic reads only means that if one user makes several reads in sequence, they will not see time go @@ -669,14 +667,14 @@ Mr. Poons To the observer it looks as though Mrs. Cake is answering the question before Mr. Poons has even asked it. Such psychic powers are impressive, but very confusing -[[27](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pratchett1991)]. +[^27]. ![ddia 0605](/fig/ddia_0605.png) ###### Figure 6-5. If some shards are replicated slower than others, an observer may see the answer before they see the question. Preventing this kind of anomaly requires another type of guarantee: *consistent prefix reads* -[[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011)]. This guarantee says that if a sequence of +[^22]. This guarantee says that if a sequence of writes happens in a certain order, then anyone reading those writes will see them appear in the same order. @@ -811,7 +809,7 @@ Consistency with another write on another leader. This is simply a fundamental limitation of distributed systems - [[28](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014coord_ch6)]. + [^28]. If you need to enforce such constraints, you’re therefore better off with a single-leader system. However, as we will see in [“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts), multi-leader systems can still achieve consistency properties that are useful in a wide range of apps that don’t need such @@ -820,13 +818,13 @@ Consistency Multi-leader replication is less common than single-leader replication, but it is still supported by many databases, including MySQL, Oracle, SQL Server, and YugabyteDB. In some cases it is an external add-on feature, for example in Redis Enterprise, EDB Postgres Distributed, and pglogical -[[29](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Raja2022)]. +[^29]. As multi-leader replication is a somewhat retrofitted feature in many databases, there are often subtle configuration pitfalls and surprising interactions with other database features. For example, autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason, multi-leader replication is often considered dangerous territory that should be avoided if possible -[[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2012)]. +[^30]. ### Multi-leader replication topologies @@ -857,7 +855,7 @@ In circular and star topologies, a write may need to pass through several nodes all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To prevent infinite replication loops, each node is given a unique identifier, and in the replication log, each write is tagged with the identifiers of all the nodes it has passed through -[[31](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#HBase7709)]. +[^31]. When a node receives a data change that is tagged with its own identifier, that data change is ignored, because the node knows that it has already been processed. @@ -949,13 +947,13 @@ existed for a long time, the term has recently gained attention [37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Jayakar2024)]. An application that allows a user to continue editing a file while offline (which may be implemented using a sync engine) is called *offline-first* -[[38](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Feyerke2013)]. +[^38]. The term *local-first software* refers to collaborative apps that are not only offline-first, but are also designed to continue working even if the developer who made the software shuts down all of -their online services [[39](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2019_ch6)]. +their online services [^39]. This can be achieved by using a sync engine with an open standard sync protocol for which multiple service providers are available -[[40](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2024lofi)]. +[^40]. For example, Git is a local-first collaboration system (albeit one that doesn’t support real-time collaboration) since you can sync via GitHub, GitLab, or any other repository hosting service. @@ -979,11 +977,11 @@ approach has a number of advantages: [“The problems with remote procedure calls (RPCs)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch05.html#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user interface needs to somehow reflect that error. A sync engine allows the app to perform reads and writes on local data, which almost never fails, leading to a more declarative programming style - [[41](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hofmeyr2024)]. + [^41]. * In order to display edits from other users in real-time, you need to receive notifications of those edits and efficiently update the user interface accordingly. A sync engine combined with a *reactive programming* model is a good way of implementing this - [[42](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#vanHardenberg2020)]. + [^42]. Sync engines work best when all the data that the user may need is downloaded in advance and stored persistently on the client. This means that the data is available for offline access when needed, @@ -993,7 +991,7 @@ of data. For example, downloading all the files that the user themselves created e-commerce website probably doesn’t make sense. The sync engine was pioneered by Lotus Notes in the 1980s -[[43](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kawell1988)] +[^43] (without using that term), and sync for specific apps such as calendars has also existed for a long time. Today there are a number of general-purpose sync engines, some of which use a proprietary backend service (e.g., Google Firestore, Realm, or Ditto), and some have an open source backend, @@ -1003,7 +1001,7 @@ Multiplayer video games have a similar need to respond immediately to the user reconcile them with other players’ actions received asynchronously over the network. In game development jargon the equivalent of a sync engine is called *netcode*. The techniques used in netcode are quite specific to the requirements of games -[[44](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pusch2019)], and don’t directly +[^44], and don’t directly carry over to other types of software, so we won’t consider them further in this book. ## Dealing with Conflicting Writes @@ -1040,7 +1038,7 @@ One strategy for conflicts is to avoid them occurring in the first place. For ex application can ensure that all writes for a particular record go through the same leader, then conflicts cannot occur, even if the database as a whole is multi-leader. This approach is not possible in the case of a sync engine client being updated offline, but it is sometimes possible in -geo-replicated server systems [[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2012)]. +geo-replicated server systems [^30]. For example, in an application where a user can only edit their own data, you can ensure that requests from a particular user are always routed to the same region and use the leader in that @@ -1126,7 +1124,7 @@ suffers from a number of problems: union of the carts). This meant that if the customer had removed an item from their cart in one sibling, but another sibling still contained that old item, the removed item would unexpectedly reappear in the customer’s cart - [[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)]. + [^45]. [Figure 6-10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear. * If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution @@ -1177,8 +1175,8 @@ then conflict resolution is inevitable, and automating it is often the best appr Two families of algorithms are commonly used to implement automatic conflict resolution: *Conflict-free replicated datatypes* (CRDTs) -[[46](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shapiro2011)] and *Operational Transformation* (OT) -[[47](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sun1998)]. +[^46] and *Operational Transformation* (OT) +[^47]. They have different design philosophies and performance characteristics, but both are able to perform automatic merges for all the aforementioned types of data. @@ -1214,12 +1212,12 @@ There are many algorithms based on variations of these ideas. Lists/arrays can b similarly, using list elements instead of characters, and other datatypes such as key-value maps can be added quite easily. There are some performance and functionality trade-offs between OT and CRDTs, but it’s possible to combine the advantages of CRDTs and OT in one algorithm -[[48](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gentle2025)]. +[^48]. OT is most often used for real-time collaborative editing of text, e.g. in Google Docs -[[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DayRichter2010)], whereas CRDTs can be found in +[^32], whereas CRDTs can be found in distributed databases such as Redis Enterprise, Riak, and Azure Cosmos DB -[[49](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shukla2018)]. +[^49]. Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge or Yjs) and with OT (e.g., ShareDB). @@ -1256,17 +1254,17 @@ systems were leaderless [[1](https://learning.oreilly.com/library/view/designing [50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979)], but the idea was mostly forgotten during the era of dominance of relational databases. It once again became a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in -2007 [[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)]. +2007 [^45]. Riak, Cassandra, and ScyllaDB are open source datastores with leaderless replication models inspired by Dynamo, so this kind of database is also known as *Dynamo-style*. ###### Note The original *Dynamo* system was only described in a paper -[[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)], but never released outside of +[^45], but never released outside of Amazon. The similarly-named *DynamoDB* is a more recent cloud database from AWS, but it has a completely different architecture: it uses single-leader replication based on the Multi-Paxos -consensus algorithm [[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Elhemali2022_ch6)]. +consensus algorithm [^5]. In some leaderless implementations, the client directly sends its writes to several replicas, while in others, a coordinator node does this on behalf of the client. However, unlike a leader database, @@ -1348,7 +1346,7 @@ considered successful, and we must query at least *r* nodes for each read. (In o *n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* > *n*, we expect to get an up-to-date value when reading, because at least one of the *r* nodes we’re reading from must be up to date. Reads and writes that obey these *r* and *w* values are called -*quorum* reads and writes [[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979)]. +*quorum* reads and writes [^50]. You can think of *r* and *w* as the minimum number of votes required for the read or write to be valid. @@ -1402,7 +1400,7 @@ Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, becau not necessarily majorities—it only matters that the sets of nodes used by the read and write operations overlap in at least one node. Other quorum assignments are possible, which allows some flexibility in the design of distributed algorithms -[[51](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Howard2016_ch6)]. +[^51]. You may also set *w* and *r* to smaller numbers, so that *w* + *r* ≤ *n* (i.e., the quorum condition is not satisfied). In this case, reads and writes will still be sent to *n* @@ -1432,7 +1430,7 @@ properties can be confusing. Some scenarios include: nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the replicas where it succeeded. This means that if a write was reported as failed, subsequent reads may or may not return the value from that write - [[52](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Blomstedt2012ricon)]. + [^52]. * If the database uses timestamps from a real-time clock to determine which write is newer (as Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lww). @@ -1445,7 +1443,7 @@ properties can be confusing. Some scenarios include: Thus, although quorums appear to guarantee that a read returns the latest written value, in practice it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate eventual consistency. The parameters *w* and *r* allow you to adjust the probability of stale values -being read [[53](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014pbs)], +being read [^53], but it’s wise to not take them as absolute guarantees. ### Monitoring staleness @@ -1464,7 +1462,7 @@ current position, you can measure the amount of replication lag. However, in systems with leaderless replication, there is no fixed order in which writes are applied, which makes monitoring more difficult. The number of hints that a replica stores for handoff can be one measure of system health, but it’s difficult to interpret usefully -[[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Breck2019)]. +[^54]. Eventual consistency is a deliberately vague guarantee, but for operability it’s important to be able to quantify “eventual.” @@ -1493,13 +1491,13 @@ Because there is no failover, and requests go to multiple replicas in parallel a becoming slow or unavailable has very little impact on response times: the client simply uses the responses from the other replicas that are faster to respond. Using the fastest responses is called *request hedging*, and it can significantly reduce tail latency -[[55](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Dean2013_ch6)]). +[^55]). At its core, the resilience of a leaderless system comes from the fact that it doesn’t distinguish between the normal case and the failure case. This is especially helpful when handling so-called *gray failures*, in which a node isn’t completely down, but running in a degraded state where it is unusually slow to handle requests -[[56](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2017_ch6)], +[^56], or when a node is simply overloaded (for example, if a node has been offline for a while, recovery via hinted handoff can cause a lot of additional load). A leader-based system has to decide whether the situation is bad enough to warrant a failover (which can itself cause further disruption), @@ -1511,7 +1509,7 @@ That said, leaderless systems can have performance problems as well: another replica is unavailable so that it can store hints about writes that the unavailable replica missed. When the unavailable replica comes back, the handoff process needs to send it those hints. This puts additional load on the replicas at a time when the system is already under - strain [[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Breck2019)]. + strain [^54]. * The more replicas you have, the bigger the size of your quorums, and the more responses you have to wait for before a request can complete. Even if you wait only for the fastest *r* or *w* replicas to respond, and even if you make the requests in parallel, a bigger *r* or *w* increases @@ -1521,7 +1519,7 @@ That said, leaderless systems can have performance problems as well: make it impossible to form a quorum. Some leaderless databases offer a configuration option that allows any reachable replica to accept writes, even if it’s not one of the usual replicas for that key (Riak and Dynamo call this a *sloppy quorum* - [[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)]; + [^45]; Cassandra and ScyllaDB call it *consistency level ANY*). There is no guarantee that subsequent reads will see the written value, but depending on the application it may still be better than having the write fail. @@ -1603,7 +1601,7 @@ An operation A *happens before* another operation B if B knows about A, or depen upon A in some way. Whether one operation happens before another operation is the key to defining what concurrency means. In fact, we can simply say that two operations are *concurrent* if neither happens before the other (i.e., neither knows about the other) -[[57](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lamport1978_ch6)]. +[^57]. Thus, whenever you have two operations A and B, there are three possibilities: either A happened before B, or B happened before A, or A and B are concurrent. What we need is an algorithm to tell us @@ -1621,7 +1619,7 @@ at exactly the same time—an issue we will discuss in more detail in [Chapter  For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if they are both unaware of each other, regardless of the physical time at which they occurred. People sometimes make a connection between this principle and the special theory of relativity in physics -[[57](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lamport1978_ch6)], which introduced the idea that +[^57], which introduced the idea that information cannot travel faster than the speed of light. Consequently, two events that occur some distance apart cannot possibly affect each other if the time between the events is shorter than the time it takes light to travel the distance between them. @@ -1719,7 +1717,7 @@ version numbers it has seen from each of the other replicas. This information in to overwrite and which values to keep as siblings. The collection of version numbers from all the replicas is called a *version vector* -[[58](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#ParkerJr1983)]. +[^58]. A few variants of this idea are in use, but the most interesting is probably the *dotted version vector* [[59](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Preguica2010), @@ -1827,350 +1825,71 @@ machine to store only a subset of the data. ##### Footnotes + ##### References -[[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lindsay1979_ch6-marker)] B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. -Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. -[Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). -IBM Research, Research Report RJ2571(33471), July 1979. -Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD) -[[2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gryp2020-marker)] Kenny Gryp. -[MySQL Terminology -Updates](https://dev.mysql.com/blog-archive/mysql-terminology-updates/). *dev.mysql.com*, July 2020. -Archived at [perma.cc/S62G-6RJ2](https://perma.cc/S62G-6RJ2) - -[[3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Oracle2019-marker)] Oracle Corporation. -[Oracle -(Active) Data Guard 19c: Real-Time Data Protection and Availability](https://www.oracle.com/technetwork/database/availability/dg-adg-technical-overview-wp-5347548.pdf). White Paper, *oracle.com*, March 2019. -Archived at [perma.cc/P5ST-RPKE](https://perma.cc/P5ST-RPKE) - -[[4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#AlwaysOn2012-marker)] Microsoft. -[What -is an Always On availability group?](https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server) *learn.microsoft.com*, September 2024. -Archived at [perma.cc/ABH6-3MXF](https://perma.cc/ABH6-3MXF) - -[[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Elhemali2022_ch6-marker)] Mostafa Elhemali, Niall Gallagher, Nicholas -Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu -Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, -Doug Terry, and Akshat Vig. -[Amazon DynamoDB: A Scalable, -Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical -Conference* (ATC), July 2022. - -[[6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Taft2020_ch6-marker)] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan -VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul -Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. -[CockroachDB: The Resilient -Geo-Distributed SQL Database](https://dl.acm.org/doi/abs/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of -Data* (SIGMOD), pages 1493–1509, June 2020. -[doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134) - -[[7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2020_ch6-marker)] Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, -Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, -Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan -Pei, and Xin Tang. -[TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf). -*Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3072–3084. -[doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535) - -[[8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Knodel2023-marker)] Mallory Knodel and Niels ten Oever. -[Terminology, Power, and -Inclusive Language in Internet-Drafts and RFCs](https://www.ietf.org/archive/id/draft-knodel-terminology-14.html). *IETF Internet-Draft*, August 2023. -Archived at [perma.cc/5ZY9-725E](https://perma.cc/5ZY9-725E) - -[[9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2018-marker)] Buck Hodges. -[Postmortem: VSTS 4 September 2018](https://devblogs.microsoft.com/devopsservice/?p=17485). -*devblogs.microsoft.com*, September 2018. -Archived at [perma.cc/ZF5R-DYZS](https://perma.cc/ZF5R-DYZS) - -[[10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Morling2024_ch6-marker)] Gunnar Morling. -[Leader -Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024. -Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y) - -[[11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Chandramohan2024-marker)] Vignesh Chandramohan, Rohan Desai, and Chris Riccomini. -[SlateDB Manifest -Design](https://github.com/slatedb/slatedb/blob/main/rfcs/0001-manifest.md). *github.com*, May 2024. -Archived at [perma.cc/8EUY-P32Z](https://perma.cc/8EUY-P32Z) - -[[12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kelvich2022-marker)] Stas Kelvich. -[Why does Neon use Paxos instead of Raft, and what’s the -difference?](https://neon.tech/blog/paxos) *neon.tech*, August 2022. -Archived at [perma.cc/SEZ4-2GXU](https://perma.cc/SEZ4-2GXU) - -[[13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Fontaine2021-marker)] Dimitri Fontaine. -[An -introduction to the pg\_auto\_failover project](https://tapoueh.org/blog/2021/11/an-introduction-to-the-pg_auto_failover-project/). *tapoueh.org*, November 2021. -Archived at [perma.cc/3WH5-6BAF](https://perma.cc/3WH5-6BAF) - -[[14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Newland2012-marker)] Jesse Newland. -[GitHub -availability this week](https://github.blog/news-insights/the-library/github-availability-this-week/). *github.blog*, September 2012. -Archived at [perma.cc/3YRF-FTFJ](https://perma.cc/3YRF-FTFJ) - -[[15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Imbriaco2012_ch6-marker)] Mark Imbriaco. -[Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/). -*github.blog*, December 2012. -Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ) - -[[16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hugg2015-marker)] John Hugg. -[‘All In’ with Determinism for Performance and -Testing in Distributed Systems](https://www.youtube.com/watch?v=gJRj3vJL4wE). At *Strange Loop*, September 2015. - -[[17](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Suzuki2017_ch6-marker)] Hironobu Suzuki. -[The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017. - -[[18](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2012-marker)] Amit Kapila. -[WAL -Internals of PostgreSQL](https://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf). At *PostgreSQL Conference* (PGCon), May 2012. -Archived at [perma.cc/6225-3SUX](https://perma.cc/6225-3SUX) - -[[19](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2023-marker)] Amit Kapila. -[Evolution -of Logical Replication](https://amitkapila16.blogspot.com/2023/09/evolution-of-logical-replication.html). *amitkapila16.blogspot.com*, September 2023. -Archived at [perma.cc/F9VX-JLER](https://perma.cc/F9VX-JLER) - -[[20](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Petchimuthu2021-marker)] Aru Petchimuthu. -[Upgrade -your Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL database, Part 2: Using the pglogical -extension](https://aws.amazon.com/blogs/database/part-2-upgrade-your-amazon-rds-for-postgresql-database-using-the-pglogical-extension/). *aws.amazon.com*, August 2021. -Archived at [perma.cc/RXT8-FS2T](https://perma.cc/RXT8-FS2T) - -[[21](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sharma2015te_ch6-marker)] Yogeshwer Sharma, Philippe Ajoux, Petchean -Ang, David Callies, Abhishek Choudhary, Laurent Demailly, Thomas Fersch, Liat Atsmon Guz, Andrzej -Kotulski, Sachin Kulkarni, Sanjeev Kumar, Harry Li, Jun Li, Evgeniy Makeev, Kowshik Prakasam, -Robbert van Renesse, Sabyasachi Roy, Pratyush Seth, Yee Jiun Song, Benjamin Wester, Kaushik -Veeraraghavan, and Peter Xie. -[Wormhole: -Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf). At *12th USENIX -Symposium on Networked Systems Design and Implementation* (NSDI), May 2015. - -[[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011-marker)] Douglas B. Terry. -[Replicated -Data Consistency Explained Through Baseball](https://www.microsoft.com/en-us/research/publication/replicated-data-consistency-explained-through-baseball/). Microsoft Research, Technical Report -MSR-TR-2011-137, October 2011. -Archived at [perma.cc/F4KZ-AR38](https://perma.cc/F4KZ-AR38) - -[[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry1994-marker)] Douglas B. Terry, Alan J. Demers, Karin Petersen, -Mike J. Spreitzer, Marvin M. Theher, and Brent B. Welch. -[Session Guarantees -for Weakly Consistent Replicated Data](https://csis.pace.edu/~marchese/CS865/Papers/SessionGuaranteesPDIS.pdf). At *3rd International Conference on Parallel and -Distributed Information Systems* (PDIS), September 1994. -[doi:10.1109/PDIS.1994.331722](https://doi.org/10.1109/PDIS.1994.331722) - -[[24](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Vogels2008-marker)] Werner Vogels. -[Eventually Consistent](https://queue.acm.org/detail.cfm?id=1466448). -*ACM Queue*, volume 6, issue 6, pages 14–19, October 2008. -[doi:10.1145/1466443.1466448](https://doi.org/10.1145/1466443.1466448) - -[[25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Willison2022-marker)] Simon Willison. -[Reply to: “My thoughts about Fly.io (so -far) and other newish technology I’m getting into”](https://news.ycombinator.com/item?id=31434055). *news.ycombinator.com*, May 2022. -Archived at [perma.cc/ZRV4-WWV8](https://perma.cc/ZRV4-WWV8) - -[[26](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Tharakan2020-marker)] Nithin Tharakan. -[Scaling Bitbucket’s -Database](https://www.atlassian.com/blog/bitbucket/scaling-bitbuckets-database). *atlassian.com*, October 2020. -Archived at [perma.cc/JAB7-9FGX](https://perma.cc/JAB7-9FGX) - -[[27](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pratchett1991-marker)] Terry Pratchett. *Reaper Man: A Discworld -Novel*. Victor Gollancz, 1991. ISBN: 978-0-575-04979-6 - -[[28](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014coord_ch6-marker)] Peter Bailis, Alan Fekete, Michael J. -Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. -[Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). -*Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185–196, November 2014. -[doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509) - -[[29](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Raja2022-marker)] Yaser Raja and Peter Celentano. -[PostgreSQL -bi-directional replication using pglogical](https://aws.amazon.com/blogs/database/postgresql-bi-directional-replication-using-pglogical/). *aws.amazon.com*, January 2022. -Archived at - -[[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2012-marker)] Robert Hodges. -[If -You \*Must\* Deploy Multi-Master Replication, Read This First](https://scale-out-blog.blogspot.com/2012/04/if-you-must-deploy-multi-master.html). *scale-out-blog.blogspot.com*, -April 2012. Archived at [perma.cc/C2JN-F6Y8](https://perma.cc/C2JN-F6Y8) - -[[31](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#HBase7709-marker)] Lars Hofhansl. -[HBASE-7709: Infinite Loop Possible in -Master/Master Replication](https://issues.apache.org/jira/browse/HBASE-7709). *issues.apache.org*, January 2013. -Archived at [perma.cc/24G2-8NLC](https://perma.cc/24G2-8NLC) - -[[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DayRichter2010-marker)] John Day-Richter. -[What’s -Different About the New Google Docs: Making Collaboration Fast](https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs.html). *drive.googleblog.com*, -September 2010. Archived at [perma.cc/5TL8-TSJ2](https://perma.cc/5TL8-TSJ2) - -[[33](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Wallace2019-marker)] Evan Wallace. -[How Figma’s -multiplayer technology works](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/). *figma.com*, October 2019. -Archived at [perma.cc/L49H-LY4D](https://perma.cc/L49H-LY4D) - -[[34](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Artman2023-marker)] Tuomas Artman. -[Scaling the Linear Sync Engine](https://linear.app/blog/scaling-the-linear-sync-engine). -*linear.app*, June 2023. - -[[35](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Saafan2024-marker)] Amr Saafan. -[Why Sync -Engines Might Be the Future of Web Applications](https://www.nilebits.com/blog/2024/09/sync-engines-future-web-applications/). *nilebits.com*, September 2024. -Archived at [perma.cc/5N73-5M3V](https://perma.cc/5N73-5M3V) - -[[36](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hagoel2024-marker)] Isaac Hagoel. -[Are Sync -Engines The Future of Web Applications?](https://dev.to/isaachagoel/are-sync-engines-the-future-of-web-applications-1bbi) *dev.to*, July 2024. -Archived at [perma.cc/R9HF-BKKL](https://perma.cc/R9HF-BKKL) - -[[37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Jayakar2024-marker)] Sujay Jayakar. -[A Map of Sync](https://stack.convex.dev/a-map-of-sync). *stack.convex.dev*, -October 2024. Archived at [perma.cc/82R3-H42A](https://perma.cc/82R3-H42A) - -[[38](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Feyerke2013-marker)] Alex Feyerke. -[Designing Offline-First Web Apps](https://alistapart.com/article/offline-first/). -*alistapart.com*, December 2013. -Archived at [perma.cc/WH7R-S2DS](https://perma.cc/WH7R-S2DS) - -[[39](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2019_ch6-marker)] Martin Kleppmann, -Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. -[Local-first software: You own your data, in -spite of the cloud](https://www.inkandswitch.com/local-first/). At *ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and -Reflections on Programming and Software* (Onward!), October 2019, pages 154–178. -[doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737) - -[[40](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2024lofi-marker)] Martin Kleppmann. -[The past, present, and -future of local-first](https://martin.kleppmann.com/2024/05/30/local-first-conference.html). At *Local-First Conference*, May 2024. - -[[41](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hofmeyr2024-marker)] Conrad Hofmeyr. -[API -Calling is to Sync Engines as jQuery is to React](https://www.powersync.com/blog/api-calling-is-to-sync-engines-as-jquery-is-to-react). *powersync.com*, November 2024. -Archived at [perma.cc/2FP9-7WJJ](https://perma.cc/2FP9-7WJJ) - -[[42](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#vanHardenberg2020-marker)] Peter van Hardenberg and Martin Kleppmann. -[PushPin: Towards -Production-Quality Peer-to-Peer Collaboration](https://martin.kleppmann.com/papers/pushpin-papoc20.pdf). At *7th Workshop on Principles and Practice -of Consistency for Distributed Data* (PaPoC), April 2020. -[doi:10.1145/3380787.3393683](https://doi.org/10.1145/3380787.3393683) - -[[43](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kawell1988-marker)] Leonard Kawell, Jr., Steven Beckhardt, Timothy -Halvorsen, Raymond Ozzie, and Irene Greif. -[Replicated document management in a group -communication system](https://dl.acm.org/doi/pdf/10.1145/62266.1024798). At *ACM Conference on Computer-Supported Cooperative Work* (CSCW), -September 1988. -[doi:10.1145/62266.1024798](https://doi.org/10.1145/62266.1024798) - -[[44](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pusch2019-marker)] Ricky Pusch. -[Explaining how fighting games use delay-based and -rollback netcode](https://words.infil.net/w02-netcode.html). *words.infil.net* and *arstechnica.com*, October 2019. -Archived at [perma.cc/DE7W-RDJ8](https://perma.cc/DE7W-RDJ8) - -[[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6-marker)] Giuseppe DeCandia, Deniz Hastorun, Madan -Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, -Peter Vosshall, and Werner Vogels. -[Dynamo: Amazon’s -Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating Systems Principles* -(SOSP), October 2007. -[doi:10.1145/1323293.1294281](https://doi.org/10.1145/1323293.1294281) - -[[46](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shapiro2011-marker)] Marc Shapiro, Nuno Preguiça, Carlos Baquero, and -Marek Zawirski. [A Comprehensive Study -of Convergent and Commutative Replicated Data Types](https://inria.hal.science/inria-00555588v1/document). INRIA Research Report no. 7506, January -2011. - -[[47](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sun1998-marker)] Chengzheng Sun and Clarence Ellis. -[Operational -Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=aef660812c5a9c4d3f06775f9455eeb090a4ff0f). At -*ACM Conference on Computer Supported Cooperative Work* (CSCW), November 1998. -[doi:10.1145/289444.289469](https://doi.org/10.1145/289444.289469) - -[[48](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gentle2025-marker)] Joseph Gentle and Martin Kleppmann. -[Collaborative Text Editing with Eg-walker: Better, -Faster, Smaller](https://arxiv.org/abs/2409.14252). At *20th European Conference on Computer Systems* (EuroSys), March 2025. -[doi:10.1145/3689031.3696076](https://doi.org/10.1145/3689031.3696076) - -[[49](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shukla2018-marker)] Dharma Shukla. -[Azure -Cosmos DB: Pushing the frontier of globally distributed databases](https://azure.microsoft.com/en-us/blog/azure-cosmos-db-pushing-the-frontier-of-globally-distributed-databases/). *azure.microsoft.com*, September 2018. -Archived at [perma.cc/UT3B-HH6R](https://perma.cc/UT3B-HH6R) - -[[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979-marker)] David K. Gifford. -[Weighted Voting for -Replicated Data](https://www.cs.cmu.edu/~15-749/READINGS/required/availability/gifford79.pdf). At *7th ACM Symposium on Operating Systems Principles* (SOSP), December 1979. -[doi:10.1145/800215.806583](https://doi.org/10.1145/800215.806583) - -[[51](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Howard2016_ch6-marker)] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. -[Flexible Paxos: -Quorum Intersection Revisited](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.OPODIS.2016.25). At *20th International Conference on Principles of Distributed -Systems* (OPODIS), December 2016. -[doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25) - -[[52](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Blomstedt2012ricon-marker)] Joseph Blomstedt. -[Bringing Consistency to Riak](https://vimeo.com/51973001). At *RICON West*, -October 2012. - -[[53](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014pbs-marker)] Peter Bailis, Shivaram Venkataraman, -Michael J. Franklin, Joseph M. Hellerstein, and Ion Stoica. -[Quantifying eventual consistency with -PBS](http://www.bailis.org/papers/pbs-vldbj2014.pdf). *The VLDB Journal*, volume 23, pages 279–302, April 2014. -[doi:10.1007/s00778-013-0330-1](https://doi.org/10.1007/s00778-013-0330-1) - -[[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Breck2019-marker)] Colin Breck. -[Shared-Nothing -Architectures for Server Replication and Synchronization](https://blog.colinbreck.com/shared-nothing-architectures-for-server-replication-and-synchronization/). *blog.colinbreck.com*, December 2019. -Archived at [perma.cc/48P3-J6CJ](https://perma.cc/48P3-J6CJ) - -[[55](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Dean2013_ch6-marker)] Jeffrey Dean and Luiz André Barroso. -[The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/). -*Communications of the ACM*, volume 56, issue 2, pages 74–80, February 2013. -[doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794) - -[[56](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2017_ch6-marker)] Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. -Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. -[Gray -Failure: The Achilles’ Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in -Operating Systems* (HotOS), May 2017. -[doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005) - -[[57](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lamport1978_ch6-marker)] Leslie Lamport. -[Time, -Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, -volume 21, issue 7, pages 558–565, July 1978. -[doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563) - -[[58](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#ParkerJr1983-marker)] D. Stott Parker Jr., Gerald J. Popek, Gerard -Rudisin, Allen Stoughton, Bruce J. Walker, Evelyn Walton, Johanna M. Chow, David Edwards, Stephen -Kiser, and Charles Kline. -[Detection of -Mutual Inconsistency in Distributed Systems](https://pages.cs.wisc.edu/~remzi/Classes/739/Papers/parker83detection.pdf). *IEEE Transactions on Software Engineering*, -volume SE-9, issue 3, pages 240–247, May 1983. -[doi:10.1109/TSE.1983.236733](https://doi.org/10.1109/TSE.1983.236733) - -[[59](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Preguica2010-marker)] Nuno Preguiça, Carlos Baquero, Paulo Sérgio -Almeida, Victor Fonte, and Ricardo Gonçalves. [Dotted -Version Vectors: Logical Clocks for Optimistic Replication](https://arxiv.org/abs/1011.5808). arXiv:1011.5808, November 2010. - -[[60](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Manepalli2022-marker)] Giridhar Manepalli. -[Clocks and Causality - Ordering Events -in Distributed Systems](https://www.exhypothesi.com/clocks-and-causality/). *exhypothesi.com*, November 2022. -Archived at [perma.cc/8REU-KVLQ](https://perma.cc/8REU-KVLQ) - -[[61](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Cribbs2014-marker)] Sean Cribbs. -[A Brief History of Time in Riak](https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak). -At *RICON*, October 2014. Archived at [perma.cc/7U9P-6JFX](https://perma.cc/7U9P-6JFX) - -[[62](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Brown2015-marker)] Russell Brown. -[Vector -Clocks Revisited Part 2: Dotted Version Vectors](https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/). *riak.com*, November 2015. -Archived at [perma.cc/96QP-W98R](https://perma.cc/96QP-W98R) - -[[63](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Baquero2011-marker)] Carlos Baquero. -[Version -Vectors Are Not Vector Clocks](https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/). *haslab.wordpress.com*, July 2011. -Archived at [perma.cc/7PNU-4AMG](https://perma.cc/7PNU-4AMG) - -[[64](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Schwarz1994-marker)] Reinhard Schwarz and Friedemann Mattern. -[Detecting Causal -Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed -Computing*, volume 7, issue 3, pages 149–174, March 1994. -[doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859) +[^1]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD) +[^2]: Kenny Gryp. [MySQL Terminology Updates](https://dev.mysql.com/blog-archive/mysql-terminology-updates/). *dev.mysql.com*, July 2020. Archived at [perma.cc/S62G-6RJ2](https://perma.cc/S62G-6RJ2) +[^3]: Oracle Corporation. [Oracle (Active) Data Guard 19c: Real-Time Data Protection and Availability](https://www.oracle.com/technetwork/database/availability/dg-adg-technical-overview-wp-5347548.pdf). White Paper, *oracle.com*, March 2019. Archived at [perma.cc/P5ST-RPKE](https://perma.cc/P5ST-RPKE) +[^4]: Microsoft. [What is an Always On availability group?](https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server) *learn.microsoft.com*, September 2024. Archived at [perma.cc/ABH6-3MXF](https://perma.cc/ABH6-3MXF) +[^5]: Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, and Akshat Vig. [Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical Conference* (ATC), July 2022. +[^6]: Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. [CockroachDB: The Resilient Geo-Distributed SQL Database](https://dl.acm.org/doi/abs/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 1493–1509, June 2020. [doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134) +[^7]: Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, and Xin Tang. [TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3072–3084. [doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535) +[^8]: Mallory Knodel and Niels ten Oever. [Terminology, Power, and Inclusive Language in Internet-Drafts and RFCs](https://www.ietf.org/archive/id/draft-knodel-terminology-14.html). *IETF Internet-Draft*, August 2023. Archived at [perma.cc/5ZY9-725E](https://perma.cc/5ZY9-725E) +[^9]: Buck Hodges. [Postmortem: VSTS 4 September 2018](https://devblogs.microsoft.com/devopsservice/?p=17485). *devblogs.microsoft.com*, September 2018. Archived at [perma.cc/ZF5R-DYZS](https://perma.cc/ZF5R-DYZS) +[^10]: Gunnar Morling. [Leader Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024. Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y) +[^11]: Vignesh Chandramohan, Rohan Desai, and Chris Riccomini. [SlateDB Manifest Design](https://github.com/slatedb/slatedb/blob/main/rfcs/0001-manifest.md). *github.com*, May 2024. Archived at [perma.cc/8EUY-P32Z](https://perma.cc/8EUY-P32Z) +[^12]: Stas Kelvich. [Why does Neon use Paxos instead of Raft, and what’s the difference?](https://neon.tech/blog/paxos) *neon.tech*, August 2022. Archived at [perma.cc/SEZ4-2GXU](https://perma.cc/SEZ4-2GXU) +[^13]: Dimitri Fontaine. [An introduction to the pg\_auto\_failover project](https://tapoueh.org/blog/2021/11/an-introduction-to-the-pg_auto_failover-project/). *tapoueh.org*, November 2021. Archived at [perma.cc/3WH5-6BAF](https://perma.cc/3WH5-6BAF) +[^14]: Jesse Newland. [GitHub availability this week](https://github.blog/news-insights/the-library/github-availability-this-week/). *github.blog*, September 2012. Archived at [perma.cc/3YRF-FTFJ](https://perma.cc/3YRF-FTFJ) +[^15]: Mark Imbriaco. [Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/). *github.blog*, December 2012. Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ) +[^16]: John Hugg. [‘All In’ with Determinism for Performance and Testing in Distributed Systems](https://www.youtube.com/watch?v=gJRj3vJL4wE). At *Strange Loop*, September 2015. +[^17]: Hironobu Suzuki. [The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017. +[^18]: Amit Kapila. [WAL Internals of PostgreSQL](https://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf). At *PostgreSQL Conference* (PGCon), May 2012. Archived at [perma.cc/6225-3SUX](https://perma.cc/6225-3SUX) +[^19]: Amit Kapila. [Evolution of Logical Replication](https://amitkapila16.blogspot.com/2023/09/evolution-of-logical-replication.html). *amitkapila16.blogspot.com*, September 2023. Archived at [perma.cc/F9VX-JLER](https://perma.cc/F9VX-JLER) +[^20]: Aru Petchimuthu. [Upgrade your Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL database, Part 2: Using the pglogical extension](https://aws.amazon.com/blogs/database/part-2-upgrade-your-amazon-rds-for-postgresql-database-using-the-pglogical-extension/). *aws.amazon.com*, August 2021. Archived at [perma.cc/RXT8-FS2T](https://perma.cc/RXT8-FS2T) +[^21]: Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, David Callies, Abhishek Choudhary, Laurent Demailly, Thomas Fersch, Liat Atsmon Guz, Andrzej Kotulski, Sachin Kulkarni, Sanjeev Kumar, Harry Li, Jun Li, Evgeniy Makeev, Kowshik Prakasam, Robbert van Renesse, Sabyasachi Roy, Pratyush Seth, Yee Jiun Song, Benjamin Wester, Kaushik Veeraraghavan, and Peter Xie. [Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf). At *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015. +[^22]: Douglas B. Terry. [Replicated Data Consistency Explained Through Baseball](https://www.microsoft.com/en-us/research/publication/replicated-data-consistency-explained-through-baseball/). Microsoft Research, Technical Report MSR-TR-2011-137, October 2011. Archived at [perma.cc/F4KZ-AR38](https://perma.cc/F4KZ-AR38) +[^23]: Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike J. Spreitzer, Marvin M. Theher, and Brent B. Welch. [Session Guarantees for Weakly Consistent Replicated Data](https://csis.pace.edu/~marchese/CS865/Papers/SessionGuaranteesPDIS.pdf). At *3rd International Conference on Parallel and Distributed Information Systems* (PDIS), September 1994. [doi:10.1109/PDIS.1994.331722](https://doi.org/10.1109/PDIS.1994.331722) +[^24]: Werner Vogels. [Eventually Consistent](https://queue.acm.org/detail.cfm?id=1466448). *ACM Queue*, volume 6, issue 6, pages 14–19, October 2008. [doi:10.1145/1466443.1466448](https://doi.org/10.1145/1466443.1466448) +[^25]: Simon Willison. [Reply to: “My thoughts about Fly.io (so far) and other newish technology I’m getting into”](https://news.ycombinator.com/item?id=31434055). *news.ycombinator.com*, May 2022. Archived at [perma.cc/ZRV4-WWV8](https://perma.cc/ZRV4-WWV8) +[^26]: Nithin Tharakan. [Scaling Bitbucket’s Database](https://www.atlassian.com/blog/bitbucket/scaling-bitbuckets-database). *atlassian.com*, October 2020. Archived at [perma.cc/JAB7-9FGX](https://perma.cc/JAB7-9FGX) +[^27]: Terry Pratchett. *Reaper Man: A Discworld Novel*. Victor Gollancz, 1991. ISBN: 978-0-575-04979-6 +[^28]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). *Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185–196, November 2014. [doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509) +[^29]: Yaser Raja and Peter Celentano. [PostgreSQL bi-directional replication using pglogical](https://aws.amazon.com/blogs/database/postgresql-bi-directional-replication-using-pglogical/). *aws.amazon.com*, January 2022. Archived at +[^30]: Robert Hodges. [If You \*Must\* Deploy Multi-Master Replication, Read This First](https://scale-out-blog.blogspot.com/2012/04/if-you-must-deploy-multi-master.html). *scale-out-blog.blogspot.com*, April 2012. Archived at [perma.cc/C2JN-F6Y8](https://perma.cc/C2JN-F6Y8) +[^31]: Lars Hofhansl. [HBASE-7709: Infinite Loop Possible in Master/Master Replication](https://issues.apache.org/jira/browse/HBASE-7709). *issues.apache.org*, January 2013. Archived at [perma.cc/24G2-8NLC](https://perma.cc/24G2-8NLC) +[^32]: John Day-Richter. [What’s Different About the New Google Docs: Making Collaboration Fast](https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs.html). *drive.googleblog.com*, September 2010. Archived at [perma.cc/5TL8-TSJ2](https://perma.cc/5TL8-TSJ2) +[^33]: Evan Wallace. [How Figma’s multiplayer technology works](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/). *figma.com*, October 2019. Archived at [perma.cc/L49H-LY4D](https://perma.cc/L49H-LY4D) +[^34]: Tuomas Artman. [Scaling the Linear Sync Engine](https://linear.app/blog/scaling-the-linear-sync-engine). *linear.app*, June 2023. +[^35]: Amr Saafan. [Why Sync Engines Might Be the Future of Web Applications](https://www.nilebits.com/blog/2024/09/sync-engines-future-web-applications/). *nilebits.com*, September 2024. Archived at [perma.cc/5N73-5M3V](https://perma.cc/5N73-5M3V) +[^36]: Isaac Hagoel. [Are Sync Engines The Future of Web Applications?](https://dev.to/isaachagoel/are-sync-engines-the-future-of-web-applications-1bbi) *dev.to*, July 2024. Archived at [perma.cc/R9HF-BKKL](https://perma.cc/R9HF-BKKL) +[^37]: Sujay Jayakar. [A Map of Sync](https://stack.convex.dev/a-map-of-sync). *stack.convex.dev*, October 2024. Archived at [perma.cc/82R3-H42A](https://perma.cc/82R3-H42A) +[^38]: Alex Feyerke. [Designing Offline-First Web Apps](https://alistapart.com/article/offline-first/). *alistapart.com*, December 2013. Archived at [perma.cc/WH7R-S2DS](https://perma.cc/WH7R-S2DS) +[^39]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: You own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019, pages 154–178. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737) +[^40]: Martin Kleppmann. [The past, present, and future of local-first](https://martin.kleppmann.com/2024/05/30/local-first-conference.html). At *Local-First Conference*, May 2024. +[^41]: Conrad Hofmeyr. [API Calling is to Sync Engines as jQuery is to React](https://www.powersync.com/blog/api-calling-is-to-sync-engines-as-jquery-is-to-react). *powersync.com*, November 2024. Archived at [perma.cc/2FP9-7WJJ](https://perma.cc/2FP9-7WJJ) +[^42]: Peter van Hardenberg and Martin Kleppmann. [PushPin: Towards Production-Quality Peer-to-Peer Collaboration](https://martin.kleppmann.com/papers/pushpin-papoc20.pdf). At *7th Workshop on Principles and Practice of Consistency for Distributed Data* (PaPoC), April 2020. [doi:10.1145/3380787.3393683](https://doi.org/10.1145/3380787.3393683) +[^43]: Leonard Kawell, Jr., Steven Beckhardt, Timothy Halvorsen, Raymond Ozzie, and Irene Greif. [Replicated document management in a group communication system](https://dl.acm.org/doi/pdf/10.1145/62266.1024798). At *ACM Conference on Computer-Supported Cooperative Work* (CSCW), September 1988. [doi:10.1145/62266.1024798](https://doi.org/10.1145/62266.1024798) +[^44]: Ricky Pusch. [Explaining how fighting games use delay-based and rollback netcode](https://words.infil.net/w02-netcode.html). *words.infil.net* and *arstechnica.com*, October 2019. Archived at [perma.cc/DE7W-RDJ8](https://perma.cc/DE7W-RDJ8) +[^45]: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. [Dynamo: Amazon’s Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007. [doi:10.1145/1323293.1294281](https://doi.org/10.1145/1323293.1294281) +[^46]: Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. [A Comprehensive Study of Convergent and Commutative Replicated Data Types](https://inria.hal.science/inria-00555588v1/document). INRIA Research Report no. 7506, January 2011. +[^47]: Chengzheng Sun and Clarence Ellis. [Operational Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=aef660812c5a9c4d3f06775f9455eeb090a4ff0f). At *ACM Conference on Computer Supported Cooperative Work* (CSCW), November 1998. [doi:10.1145/289444.289469](https://doi.org/10.1145/289444.289469) +[^48]: Joseph Gentle and Martin Kleppmann. [Collaborative Text Editing with Eg-walker: Better, Faster, Smaller](https://arxiv.org/abs/2409.14252). At *20th European Conference on Computer Systems* (EuroSys), March 2025. [doi:10.1145/3689031.3696076](https://doi.org/10.1145/3689031.3696076) +[^49]: Dharma Shukla. [Azure Cosmos DB: Pushing the frontier of globally distributed databases](https://azure.microsoft.com/en-us/blog/azure-cosmos-db-pushing-the-frontier-of-globally-distributed-databases/). *azure.microsoft.com*, September 2018. Archived at [perma.cc/UT3B-HH6R](https://perma.cc/UT3B-HH6R) +[^50]: David K. Gifford. [Weighted Voting for Replicated Data](https://www.cs.cmu.edu/~15-749/READINGS/required/availability/gifford79.pdf). At *7th ACM Symposium on Operating Systems Principles* (SOSP), December 1979. [doi:10.1145/800215.806583](https://doi.org/10.1145/800215.806583) +[^51]: Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. [Flexible Paxos: Quorum Intersection Revisited](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.OPODIS.2016.25). At *20th International Conference on Principles of Distributed Systems* (OPODIS), December 2016. [doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25) +[^52]: Joseph Blomstedt. [Bringing Consistency to Riak](https://vimeo.com/51973001). At *RICON West*, October 2012. +[^53]: Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, and Ion Stoica. [Quantifying eventual consistency with PBS](http://www.bailis.org/papers/pbs-vldbj2014.pdf). *The VLDB Journal*, volume 23, pages 279–302, April 2014. [doi:10.1007/s00778-013-0330-1](https://doi.org/10.1007/s00778-013-0330-1) +[^54]: Colin Breck. [Shared-Nothing Architectures for Server Replication and Synchronization](https://blog.colinbreck.com/shared-nothing-architectures-for-server-replication-and-synchronization/). *blog.colinbreck.com*, December 2019. Archived at [perma.cc/48P3-J6CJ](https://perma.cc/48P3-J6CJ) +[^55]: Jeffrey Dean and Luiz André Barroso. [The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/). *Communications of the ACM*, volume 56, issue 2, pages 74–80, February 2013. [doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794) +[^56]: Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. [Gray Failure: The Achilles’ Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in Operating Systems* (HotOS), May 2017. [doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005) +[^57]: Leslie Lamport. [Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, volume 21, issue 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563) +[^58]: D. Stott Parker Jr., Gerald J. Popek, Gerard Rudisin, Allen Stoughton, Bruce J. Walker, Evelyn Walton, Johanna M. Chow, David Edwards, Stephen Kiser, and Charles Kline. [Detection of Mutual Inconsistency in Distributed Systems](https://pages.cs.wisc.edu/~remzi/Classes/739/Papers/parker83detection.pdf). *IEEE Transactions on Software Engineering*, volume SE-9, issue 3, pages 240–247, May 1983. [doi:10.1109/TSE.1983.236733](https://doi.org/10.1109/TSE.1983.236733) +[^59]: Nuno Preguiça, Carlos Baquero, Paulo Sérgio Almeida, Victor Fonte, and Ricardo Gonçalves. [Dotted Version Vectors: Logical Clocks for Optimistic Replication](https://arxiv.org/abs/1011.5808). arXiv:1011.5808, November 2010. +[^60]: Giridhar Manepalli. [Clocks and Causality - Ordering Events in Distributed Systems](https://www.exhypothesi.com/clocks-and-causality/). *exhypothesi.com*, November 2022. Archived at [perma.cc/8REU-KVLQ](https://perma.cc/8REU-KVLQ) +[^61]: Sean Cribbs. [A Brief History of Time in Riak](https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak). At *RICON*, October 2014. Archived at [perma.cc/7U9P-6JFX](https://perma.cc/7U9P-6JFX) +[^62]: Russell Brown. [Vector Clocks Revisited Part 2: Dotted Version Vectors](https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/). *riak.com*, November 2015. Archived at [perma.cc/96QP-W98R](https://perma.cc/96QP-W98R) +[^63]: Carlos Baquero. [Version Vectors Are Not Vector Clocks](https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/). *haslab.wordpress.com*, July 2011. Archived at [perma.cc/7PNU-4AMG](https://perma.cc/7PNU-4AMG) +[^64]: Reinhard Schwarz and Friedemann Mattern. [Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed Computing*, volume 7, issue 3, pages 149–174, March 1994. [doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859) \ No newline at end of file diff --git a/content/en/ch7.md b/content/en/ch7.md index 8b0030f..b5f9ada 100644 --- a/content/en/ch7.md +++ b/content/en/ch7.md @@ -58,7 +58,7 @@ In many other systems, partitioning is just another word for sharding. While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to one theory, the term arose from the online role-play game *Ultima Online*, in which a magic crystal was shattered into pieces, and each of those shards refracted a copy of the game world -[[3](/en/ch7#Koster2009)]. +[^3]. The term *shard* thus came to mean one of a set of parallel game servers, and later was carried over to databases. Another theory is that *shard* was originally an acronym of *System for Highly Available Replicated Data*—reportedly a 1980s database, details of which are lost to history. @@ -88,7 +88,7 @@ single-shard database. The reason for this recommendation is that sharding often adds complexity: you typically have to decide which records to put in which shard by choosing a *partition key*; all records with the same partition key are placed in the same shard -[[4](/en/ch7#Fidalgo2021)]. +[^4]. This choice matters because accessing a record is fast if you know which shard it’s in, but if you don’t know the shard you have to do an inefficient search across all shards, and the sharding scheme is difficult to change. @@ -108,10 +108,10 @@ some systems don’t support them at all. Some systems use sharding even on a single machine, typically running one single-threaded process per CPU core to make use of the parallelism in the CPU, or to take advantage of a *nonuniform memory access* (NUMA) architecture in which some banks of memory are closer to one CPU than to others -[[5](/en/ch7#Drepper2007)]. +[^5]. For example, Redis, VoltDB, and FoundationDB use one process per core, and rely on sharding to spread load across CPU cores in the same machine -[[6](/en/ch7#Zhou2021_ch7)]. +[^6]. ## Sharding for Multitenancy @@ -125,7 +125,7 @@ Sometimes sharding is used to implement multitenant systems: either each tenant shard, or multiple small tenants may be grouped together into a larger shard. These shards might be physically separate databases (which we previously touched on in [“Embedded storage engines”](/en/ch4#sidebar_embedded)), or separately manageable portions of a larger logical database -[[7](/en/ch7#Slot2023)]. +[^7]. Using sharding for multitenancy has several advantages: Resource isolation @@ -143,19 +143,19 @@ Cell-based architecture tenants are grouped into a self-contained *cell*, and different cells are set up such that they can run largely independently from each other. This approach provides *fault isolation*: that is, a fault in one cell remains limited to that cell, and tenants in other cells are not affected - [[8](/en/ch7#Oliveira2023)]. + [^8]. Per-tenant backup and restore : Backing up each tenant’s shard separately makes it possible to restore a tenant’s state from a backup without affecting other tenants, which can be useful in case the tenant accidentally deletes or overwrites important data - [[9](/en/ch7#Shapira2023dont)]. + [^9]. Regulatory compliance : Data privacy regulation such as the GDPR gives individuals the right to access and delete all data stored about them. If each person’s data is stored in a separate shard, this translates into simple data export and deletion operations on their shard - [[10](/en/ch7#Schwarzkopf2019)]. + [^10]. Data residence : If a particular tenant’s data needs to be stored in a particular jurisdiction in order to comply @@ -166,14 +166,14 @@ Gradual schema rollout : Schema migrations (previously discussed in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)) can be rolled out gradually, one tenant at a time. This reduces risk, as you can detect problems before they affect all tenants, but it can be difficult to do transactionally - [[11](/en/ch7#Shapira2024)]. + [^11]. The main challenges around using sharding for multitenancy are: * It assumes that each individual tenant is small enough to fit on a single node. If that is not the case, and you have a single tenant that’s too big for one machine, you would need to additionally perform sharding within a single tenant, which brings us back to the topic of sharding for - scalability [[12](/en/ch7#Ganguli2020)]. + scalability [^12]. * If you have many small tenants, then creating a separate shard for each one may incur too much overhead. You could group several small tenants together into a bigger shard, but then you have the problem of how you move tenants from one shard to another as they grow. @@ -227,7 +227,7 @@ The shard boundaries might be chosen manually by an administrator, or the databa automatically. Manual key-range sharding is used by Vitess (a sharding layer for MySQL), for example; the automatic variant is used by Bigtable, its open source equivalent HBase, the range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB -[[6](/en/ch7#Zhou2021_ch7)]. YugabyteDB offers both manual and automatic +[^6]. YugabyteDB offers both manual and automatic tablet splitting. Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in @@ -242,7 +242,7 @@ lot of writes to nearby keys. For example, if the key is a timestamp, then the s ranges of time—e.g., one shard per month. Unfortunately, if you write data from the sensors to the database as the measurements happen, all the writes end up going to the same shard (the one for this month), so that shard can be overloaded with writes while others sit idle -[[13](/en/ch7#Lan2011)]. +[^13]. To avoid this problem in the sensor database, you need to use something other than the timestamp as the first element of the key. For example, you could prefix each timestamp with the sensor ID so @@ -257,7 +257,7 @@ When you first set up your database, there are no key ranges to split into shard such as HBase and MongoDB, allow you to configure an initial set of shards on an empty database, which is called *pre-splitting*. This requires that you already have some idea of what the key distribution is going to look like, so that you can choose appropriate key range boundaries -[[14](/en/ch7#Soztutar2013split)]. +[^14]. Later on, as your data volume and write throughput grow, a system with key-range sharding grows by splitting an existing shard into two or more smaller shards, each of which holds a contiguous @@ -276,7 +276,7 @@ With databases that manage shard boundaries automatically, a shard split is typi An advantage of key-range sharding is that the number of shards adapts to the data volume. If there is only a small amount of data, a small number of shards is sufficient, so overheads are small; if there is a huge amount of data, the size of each individual shard is limited to a configurable -maximum [[15](/en/ch7#Evans2013)]. +maximum [^15]. A downside of this approach is that splitting a shard is an expensive operation, since it requires all of its data to be rewritten into new files, similarly to a compaction in a log-structured @@ -301,7 +301,7 @@ uses MD5, whereas Cassandra and ScyllaDB use Murmur3. Many programming languages functions built in (as they are used for hash tables), but they may not be suitable for sharding: for example, in Java’s `Object.hashCode()` and Ruby’s `Object#hash`, the same key may have a different hash value in different processes, making them unsuitable for sharding -[[16](/en/ch7#Kleppmann2012hash)]. +[^16]. ### Hash modulo number of nodes @@ -350,7 +350,7 @@ used for any reads and writes that happen while the transfer is in progress. It’s common to choose the number of shards to be a number that is divisible by many factors, so that the dataset can be evenly split across various different numbers of nodes—not requiring the number -of nodes to be a power of 2, for example [[4](/en/ch7#Fidalgo2021)]. +of nodes to be a power of 2, for example [^4]. You can even account for mismatched hardware in your cluster: by assigning more shards to nodes that are more powerful, you can make those nodes take a greater share of the load. @@ -412,7 +412,7 @@ supports cluster keys. Clustering data not only improves range scan performance, improve compression and filtering performance as well. Hash-range sharding is used in YugabyteDB and DynamoDB -[[17](/en/ch7#Elhemali2022_ch7)], and is an option in MongoDB. +[^17], and is an option in MongoDB. Cassandra and ScyllaDB use a variant of this approach that is illustrated in [Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8 @@ -427,7 +427,7 @@ those imbalances tend to even out ###### Figure 7-6. Cassandra and ScyllaDB split the range of possible hash values (here 0–1023) into contiguous ranges with random boundaries, and assign several ranges to each node. When nodes are added or removed, range boundaries are added and removed, and shards are split or -merged accordingly [[19](/en/ch7#Lambov2016)]. +merged accordingly [^19]. In the example of [Figure 7-6](/en/ch7#fig_sharding_cassandra), when node 3 is added, node 1 transfers parts of two of its ranges to node 3, and node 2 transfers part of one of its ranges to node 3. This has the effect of giving the new node an approximately fair share of the dataset, @@ -447,13 +447,13 @@ the same shard as much as possible. The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of consistent hashing -[[20](/en/ch7#Karger1997)], +[^20], but several other consistent hashing algorithms have also been proposed -[[21](/en/ch7#Gryski2018)], +[^21], such as *highest random weight*, also known as *rendezvous hashing* -[[22](/en/ch7#Thaler1998)], +[^22], and *jump consistent hash* -[[23](/en/ch7#Lamping2014)]. +[^23]. With Cassandra’s algorithm, if one node is added, a small number of existing shards are split into sub-ranges; on the other hand, with rendezvous and jump consistent hashes, the new node is assigned individual keys that were previously scattered across all of the other nodes. Which one is @@ -468,7 +468,7 @@ some keys is much higher than to others—you can still end up with some servers while others sit almost idle. For example, on a social media site, a celebrity user with millions of followers may cause a storm -of activity when they do something [[24](/en/ch7#Axon2010_ch7)]. +of activity when they do something [^24]. This event can result in a large volume of reads and writes to the same key (where the partition key is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on). @@ -477,7 +477,7 @@ In such situations, a more flexible sharding policy is required [26](/en/ch7#Lee2021)]. A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine -[[27](/en/ch7#Fritchie2018)]. +[^27]. It’s also possible to compensate for skew at the application level. For example, if one key is known to be very hot, a simple technique is to add a random number to the beginning or end of the key. @@ -499,8 +499,8 @@ necessitating different strategies for handling them. Some systems (especially cloud services designed for large scale) have automated approaches for dealing with hot shards; for example, Amazon calls it *heat management* -[[28](/en/ch7#Warfield2023_ch7)] -or *adaptive capacity* [[17](/en/ch7#Elhemali2022_ch7)]. +[^28] +or *adaptive capacity* [^17]. The details of how these systems work go beyond the scope of this book. ## Operations: Automatic or Manual Rebalancing @@ -527,7 +527,7 @@ another. If it is not done carefully, this process can overload the network or t might harm the performance of other requests. The system must continue processing writes while the rebalancing is in progress; if a system is near its maximum write throughput, the shard-splitting process might not even be able to keep up with the rate of incoming writes -[[29](/en/ch7#Houlihan2017)]. +[^29]. Such automation can be dangerous in combination with automatic failure detection. For example, say one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that @@ -667,7 +667,7 @@ shards. Whenever you write to the database—to add, remove, or update a records deal with the shard that contains the record that you are writing. For that reason, this type of secondary index is known as a *local index*. In an information retrieval context it is also known as a *document-partitioned index* -[[30](/en/ch7#Manning2008_ch7)]. +[^30]. When reading from a local secondary index, if you already know the partition key of the record you’re looking for, you can just perform the search on the appropriate shard. Moreover, if you only @@ -685,10 +685,10 @@ shards lets you store more data, but it doesn’t increase your query throughput process every query anyway. Nevertheless, local secondary indexes are widely used -[[31](/en/ch7#Busch2012)]: -for example, MongoDB, Riak, Cassandra [[32](/en/ch7#HarEl2017)], -Elasticsearch [[33](/en/ch7#Tong2013)], SolrCloud, -and VoltDB [[34](/en/ch7#Pavlo2013)] +[^31]: +for example, MongoDB, Riak, Cassandra [^32], +Elasticsearch [^33], SolrCloud, +and VoltDB [^34] all use local secondary indexes. ## Global Secondary Indexes @@ -709,7 +709,7 @@ The index on the make of car is partitioned similarly (with the shard boundary b ###### Figure 7-10. A global secondary index reflects data from all shards, and is itself sharded by the indexed value. This kind of index is also called *term-partitioned* -[[30](/en/ch7#Manning2008_ch7)]: +[^30]: recall from [“Full-Text Search”](/en/ch4#sec_storage_full_text) that in full-text search, a *term* is a keyword in a text that you can search for. Here we generalise it to mean any value that you can search for in the secondary index. @@ -728,7 +728,7 @@ certain make, or searching for multiple words occurring in the same text), it’ terms will be assigned to different shards. To compute the logical AND of the two conditions, the system needs to find all the IDs that occur in both of the postings lists. That’s no problem if the postings lists are short, but if they are long, it can be slow to send them over the network to -compute their intersection [[30](/en/ch7#Manning2008_ch7)]. +compute their intersection [^30]. Another challenge with global secondary indexes is that writes are more complicated than with local indexes, because writing a single record might affect multiple shards of the index (every term in @@ -797,191 +797,41 @@ that question in the following chapters. ##### Footnotes + ##### References -[[1](/en/ch7#Giordano2023-marker)] Claire Giordano. -[Understanding -partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. -Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959) -[[2](/en/ch7#Leach2022-marker)] Brandur Leach. -[Partitioning in Postgres, 2022 -edition](https://brandur.org/fragments/postgres-partitioning-2022). *brandur.org*, October 2022. -Archived at [perma.cc/Z5LE-6AKX](https://perma.cc/Z5LE-6AKX) - -[[3](/en/ch7#Koster2009-marker)] Raph Koster. -[Database “sharding” -came from UO?](https://www.raphkoster.com/2009/01/08/database-sharding-came-from-uo/) *raphkoster.com*, January 2009. -Archived at [perma.cc/4N9U-5KYF](https://perma.cc/4N9U-5KYF) - -[[4](/en/ch7#Fidalgo2021-marker)] Garrett Fidalgo. -[Herding elephants: Lessons learned -from sharding Postgres at Notion](https://www.notion.com/blog/sharding-postgres-at-notion). *notion.com*, October 2021. -Archived at [perma.cc/5J5V-W2VX](https://perma.cc/5J5V-W2VX) - -[[5](/en/ch7#Drepper2007-marker)] Ulrich Drepper. -[What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf). -*akkadia.org*, November 2007. Archived at -[perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ) - -[[6](/en/ch7#Zhou2021_ch7-marker)] Jingyu Zhou, Meng Xu, Alexander Shraer, Bala -Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, -Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin -Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. -[FoundationDB: A Distributed Unbundled -Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data* -(SIGMOD), June 2021. -[doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559) - -[[7](/en/ch7#Slot2023-marker)] Marco Slot. -[Citus 12: -Schema-based sharding for PostgreSQL](https://www.citusdata.com/blog/2023/07/18/citus-12-schema-based-sharding-for-postgres/). *citusdata.com*, July 2023. -Archived at [perma.cc/R874-EC9W](https://perma.cc/R874-EC9W) - -[[8](/en/ch7#Oliveira2023-marker)] Robisson Oliveira. -[Reducing -the Scope of Impact with Cell-Based Architecture](https://docs.aws.amazon.com/pdfs/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/reducing-scope-of-impact-with-cell-based-architecture.pdf). AWS Well-Architected white paper, Amazon Web -Services, September 2023. -Archived at [perma.cc/4KWW-47NR](https://perma.cc/4KWW-47NR) - -[[9](/en/ch7#Shapira2023dont-marker)] Gwen Shapira. -[Things DBs Don’t Do - But Should](https://www.thenile.dev/blog/things-dbs-dont-do). -*thenile.dev*, February 2023. -Archived at [perma.cc/C3J4-JSFW](https://perma.cc/C3J4-JSFW) - -[[10](/en/ch7#Schwarzkopf2019-marker)] Malte Schwarzkopf, Eddie Kohler, M. Frans -Kaashoek, and Robert Morris. -[Position: GDPR -Compliance by Construction](https://cs.brown.edu/people/malte/pub/papers/2019-poly-gdpr.pdf). At *Towards Polystores that manage multiple Databases, Privacy, -Security and/or Policy Issues for Heterogenous Data* (Poly), August 2019. -[doi:10.1007/978-3-030-33752-0\_3](https://doi.org/10.1007/978-3-030-33752-0_3) - -[[11](/en/ch7#Shapira2024-marker)] Gwen Shapira. -[Introducing pg\_karnak: Transactional schema -migration across tenant databases](https://www.thenile.dev/blog/distributed-ddl). *thenile.dev*, November 2024. -Archived at [perma.cc/R5RD-8HR9](https://perma.cc/R5RD-8HR9) - -[[12](/en/ch7#Ganguli2020-marker)] Arka Ganguli, Guido Iaquinti, -Maggie Zhou, and Rafael Chacón. -[Scaling Datastores at -Slack with Vitess](https://slack.engineering/scaling-datastores-at-slack-with-vitess/). *slack.engineering*, December 2020. -Archived at [perma.cc/UW8F-ALJK](https://perma.cc/UW8F-ALJK) - -[[13](/en/ch7#Lan2011-marker)] Ikai Lan. -[App -Engine Datastore Tip: Monotonically Increasing Values Are Bad](https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/). *ikaisays.com*, -January 2011. Archived at [perma.cc/BPX8-RPJB](https://perma.cc/BPX8-RPJB) - -[[14](/en/ch7#Soztutar2013split-marker)] Enis Soztutar. -[Apache -HBase Region Splitting and Merging](https://www.cloudera.com/blog/technical/apache-hbase-region-splitting-and-merging.html). *cloudera.com*, February 2013. -Archived at [perma.cc/S9HS-2X2C](https://perma.cc/S9HS-2X2C) - -[[15](/en/ch7#Evans2013-marker)] Eric Evans. -[Rethinking Topology in Cassandra](https://www.youtube.com/watch?v=Qz6ElTdYjjU). At -*Cassandra Summit*, June 2013. -Archived at [perma.cc/2DKM-F438](https://perma.cc/2DKM-F438) - -[[16](/en/ch7#Kleppmann2012hash-marker)] Martin Kleppmann. -[Java’s -hashCode Is Not Safe for Distributed Systems](https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html). *martin.kleppmann.com*, June 2012. -Archived at [perma.cc/LK5U-VZSN](https://perma.cc/LK5U-VZSN) - -[[17](/en/ch7#Elhemali2022_ch7-marker)] Mostafa Elhemali, Niall Gallagher, Nicholas -Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu -Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, -Doug Terry, and Akshat Vig. -[Amazon DynamoDB: A Scalable, -Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical -Conference* (ATC), July 2022. - -[[18](/en/ch7#Williams2012-marker)] Brandon Williams. -[Virtual Nodes in Cassandra -1.2](https://www.datastax.com/blog/virtual-nodes-cassandra-12). *datastax.com*, December 2012. -Archived at [perma.cc/N385-EQXV](https://perma.cc/N385-EQXV) - -[[19](/en/ch7#Lambov2016-marker)] Branimir Lambov. -[New Token -Allocation Algorithm in Cassandra 3.0](https://www.datastax.com/blog/new-token-allocation-algorithm-cassandra-30). *datastax.com*, January 2016. -Archived at [perma.cc/2BG7-LDWY](https://perma.cc/2BG7-LDWY) - -[[20](/en/ch7#Karger1997-marker)] David Karger, Eric Lehman, Tom Leighton, Rina -Panigrahy, Matthew Levine, and Daniel Lewin. -[Consistent Hashing and Random Trees: -Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web](https://people.csail.mit.edu/karger/Papers/web.pdf). -At *29th Annual ACM Symposium on Theory of Computing* (STOC), May 1997. -[doi:10.1145/258533.258660](https://doi.org/10.1145/258533.258660) - -[[21](/en/ch7#Gryski2018-marker)] Damian Gryski. -[Consistent -Hashing: Algorithmic Tradeoffs](https://dgryski.medium.com/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8). *dgryski.medium.com*, April 2018. -Archived at [perma.cc/B2WF-TYQ8](https://perma.cc/B2WF-TYQ8) - -[[22](/en/ch7#Thaler1998-marker)] David G. Thaler and Chinya V. Ravishankar. -[Using name-based mappings to increase -hit rates](https://www.cs.kent.edu/~javed/DL/web/p1-thaler.pdf). *IEEE/ACM Transactions on Networking*, volume 6, issue 1, pages 1–14, February 1998. -[doi:10.1109/90.663936](https://doi.org/10.1109/90.663936) - -[[23](/en/ch7#Lamping2014-marker)] John Lamping and Eric Veach. -[A Fast, Minimal Memory, Consistent Hash -Algorithm](https://arxiv.org/abs/1406.2294). *arxiv.org*, June 2014. - -[[24](/en/ch7#Axon2010_ch7-marker)] Samuel Axon. -[3% of Twitter’s Servers -Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010. -Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX) - -[[25](/en/ch7#Guo2020-marker)] Gerald Guo and Thawan Kooburat. -[Scaling -services with Shard Manager](https://engineering.fb.com/2020/08/24/production-engineering/scaling-services-with-shard-manager/). *engineering.fb.com*, August 2020. -Archived at [perma.cc/EFS3-XQYT](https://perma.cc/EFS3-XQYT) - -[[26](/en/ch7#Lee2021-marker)] Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan -Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan, -Biren Damani, Pol Mauri Ruiz, Vikas Mehta, and Chunqiang Tang. -[Shard Manager: A Generic Shard -Management Framework for Geo-distributed Applications](https://dl.acm.org/doi/pdf/10.1145/3477132.3483546). *28th ACM SIGOPS Symposium on -Operating Systems Principles* (SOSP), pages 553–569, October 2021. -[doi:10.1145/3477132.3483546](https://doi.org/10.1145/3477132.3483546) - -[[27](/en/ch7#Fritchie2018-marker)] Scott Lystig Fritchie. -[A Critique of Resizable Hash -Tables: Riak Core & Random Slicing](https://www.infoq.com/articles/dynamo-riak-random-slicing/). *infoq.com*, August 2018. -Archived at [perma.cc/RPX7-7BLN](https://perma.cc/RPX7-7BLN) - -[[28](/en/ch7#Warfield2023_ch7-marker)] Andy Warfield. -[Building -and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. -Archived at [perma.cc/6S7P-GLM4](https://perma.cc/6S7P-GLM4) - -[[29](/en/ch7#Houlihan2017-marker)] Rich Houlihan. -[DynamoDB adaptive capacity: smooth performance -for chaotic workloads (DAT327)](https://www.youtube.com/watch?v=kMY0_m29YzU). At *AWS re:Invent*, November 2017. - -[[30](/en/ch7#Manning2008_ch7-marker)] Christopher D. Manning, Prabhakar Raghavan, -and Hinrich Schütze. -[*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/). -Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at -[nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/) - -[[31](/en/ch7#Busch2012-marker)] Michael Busch, Krishna Gade, Brian Larson, Patrick -Lok, Samuel Luckenbill, and Jimmy Lin. -[Earlybird: -Real-Time Search at Twitter](https://cs.uwaterloo.ca/~jimmylin/publications/Busch_etal_ICDE2012.pdf). At *28th IEEE International Conference on Data Engineering* -(ICDE), April 2012. -[doi:10.1109/ICDE.2012.149](https://doi.org/10.1109/ICDE.2012.149) - -[[32](/en/ch7#HarEl2017-marker)] Nadav Har’El. -[Indexing in Cassandra 3](https://github.com/scylladb/scylladb/wiki/Indexing-in-Cassandra-3). -*github.com*, April 2017. -Archived at [perma.cc/3ENV-8T9P](https://perma.cc/3ENV-8T9P) - -[[33](/en/ch7#Tong2013-marker)] Zachary Tong. -[Customizing Your -Document Routing](https://www.elastic.co/blog/customizing-your-document-routing/). *elastic.co*, June 2013. -Archived at [perma.cc/97VM-MREN](https://perma.cc/97VM-MREN) - -[[34](/en/ch7#Pavlo2013-marker)] Andrew Pavlo. -[H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). -*hstore.cs.brown.edu*, October 2013. -Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z) +[^1]: Claire Giordano. [Understanding partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959) +[^2]: Brandur Leach. [Partitioning in Postgres, 2022 edition](https://brandur.org/fragments/postgres-partitioning-2022). *brandur.org*, October 2022. Archived at [perma.cc/Z5LE-6AKX](https://perma.cc/Z5LE-6AKX) +[^3]: Raph Koster. [Database “sharding” came from UO?](https://www.raphkoster.com/2009/01/08/database-sharding-came-from-uo/) *raphkoster.com*, January 2009. Archived at [perma.cc/4N9U-5KYF](https://perma.cc/4N9U-5KYF) +[^4]: Garrett Fidalgo. [Herding elephants: Lessons learned from sharding Postgres at Notion](https://www.notion.com/blog/sharding-postgres-at-notion). *notion.com*, October 2021. Archived at [perma.cc/5J5V-W2VX](https://perma.cc/5J5V-W2VX) +[^5]: Ulrich Drepper. [What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf). *akkadia.org*, November 2007. Archived at [perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ) +[^6]: Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. [FoundationDB: A Distributed Unbundled Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2021. [doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559) +[^7]: Marco Slot. [Citus 12: Schema-based sharding for PostgreSQL](https://www.citusdata.com/blog/2023/07/18/citus-12-schema-based-sharding-for-postgres/). *citusdata.com*, July 2023. Archived at [perma.cc/R874-EC9W](https://perma.cc/R874-EC9W) +[^8]: Robisson Oliveira. [Reducing the Scope of Impact with Cell-Based Architecture](https://docs.aws.amazon.com/pdfs/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/reducing-scope-of-impact-with-cell-based-architecture.pdf). AWS Well-Architected white paper, Amazon Web Services, September 2023. Archived at [perma.cc/4KWW-47NR](https://perma.cc/4KWW-47NR) +[^9]: Gwen Shapira. [Things DBs Don’t Do - But Should](https://www.thenile.dev/blog/things-dbs-dont-do). *thenile.dev*, February 2023. Archived at [perma.cc/C3J4-JSFW](https://perma.cc/C3J4-JSFW) +[^10]: Malte Schwarzkopf, Eddie Kohler, M. Frans Kaashoek, and Robert Morris. [Position: GDPR Compliance by Construction](https://cs.brown.edu/people/malte/pub/papers/2019-poly-gdpr.pdf). At *Towards Polystores that manage multiple Databases, Privacy, Security and/or Policy Issues for Heterogenous Data* (Poly), August 2019. [doi:10.1007/978-3-030-33752-0\_3](https://doi.org/10.1007/978-3-030-33752-0_3) +[^11]: Gwen Shapira. [Introducing pg\_karnak: Transactional schema migration across tenant databases](https://www.thenile.dev/blog/distributed-ddl). *thenile.dev*, November 2024. Archived at [perma.cc/R5RD-8HR9](https://perma.cc/R5RD-8HR9) +[^12]: Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón. [Scaling Datastores at Slack with Vitess](https://slack.engineering/scaling-datastores-at-slack-with-vitess/). *slack.engineering*, December 2020. Archived at [perma.cc/UW8F-ALJK](https://perma.cc/UW8F-ALJK) +[^13]: Ikai Lan. [App Engine Datastore Tip: Monotonically Increasing Values Are Bad](https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/). *ikaisays.com*, January 2011. Archived at [perma.cc/BPX8-RPJB](https://perma.cc/BPX8-RPJB) +[^14]: Enis Soztutar. [Apache HBase Region Splitting and Merging](https://www.cloudera.com/blog/technical/apache-hbase-region-splitting-and-merging.html). *cloudera.com*, February 2013. Archived at [perma.cc/S9HS-2X2C](https://perma.cc/S9HS-2X2C) +[^15]: Eric Evans. [Rethinking Topology in Cassandra](https://www.youtube.com/watch?v=Qz6ElTdYjjU). At *Cassandra Summit*, June 2013. Archived at [perma.cc/2DKM-F438](https://perma.cc/2DKM-F438) +[^16]: Martin Kleppmann. [Java’s hashCode Is Not Safe for Distributed Systems](https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html). *martin.kleppmann.com*, June 2012. Archived at [perma.cc/LK5U-VZSN](https://perma.cc/LK5U-VZSN) +[^17]: Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, and Akshat Vig. [Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical Conference* (ATC), July 2022. +[^18]: Brandon Williams. [Virtual Nodes in Cassandra 1.2](https://www.datastax.com/blog/virtual-nodes-cassandra-12). *datastax.com*, December 2012. Archived at [perma.cc/N385-EQXV](https://perma.cc/N385-EQXV) +[^19]: Branimir Lambov. [New Token Allocation Algorithm in Cassandra 3.0](https://www.datastax.com/blog/new-token-allocation-algorithm-cassandra-30). *datastax.com*, January 2016. Archived at [perma.cc/2BG7-LDWY](https://perma.cc/2BG7-LDWY) +[^20]: David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. [Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web](https://people.csail.mit.edu/karger/Papers/web.pdf). At *29th Annual ACM Symposium on Theory of Computing* (STOC), May 1997. [doi:10.1145/258533.258660](https://doi.org/10.1145/258533.258660) +[^21]: Damian Gryski. [Consistent Hashing: Algorithmic Tradeoffs](https://dgryski.medium.com/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8). *dgryski.medium.com*, April 2018. Archived at [perma.cc/B2WF-TYQ8](https://perma.cc/B2WF-TYQ8) +[^22]: David G. Thaler and Chinya V. Ravishankar. [Using name-based mappings to increase hit rates](https://www.cs.kent.edu/~javed/DL/web/p1-thaler.pdf). *IEEE/ACM Transactions on Networking*, volume 6, issue 1, pages 1–14, February 1998. [doi:10.1109/90.663936](https://doi.org/10.1109/90.663936) +[^23]: John Lamping and Eric Veach. [A Fast, Minimal Memory, Consistent Hash Algorithm](https://arxiv.org/abs/1406.2294). *arxiv.org*, June 2014. +[^24]: Samuel Axon. [3% of Twitter’s Servers Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010. Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX) +[^25]: Gerald Guo and Thawan Kooburat. [Scaling services with Shard Manager](https://engineering.fb.com/2020/08/24/production-engineering/scaling-services-with-shard-manager/). *engineering.fb.com*, August 2020. Archived at [perma.cc/EFS3-XQYT](https://perma.cc/EFS3-XQYT) +[^26]: Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan, Biren Damani, Pol Mauri Ruiz, Vikas Mehta, and Chunqiang Tang. [Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications](https://dl.acm.org/doi/pdf/10.1145/3477132.3483546). *28th ACM SIGOPS Symposium on Operating Systems Principles* (SOSP), pages 553–569, October 2021. [doi:10.1145/3477132.3483546](https://doi.org/10.1145/3477132.3483546) +[^27]: Scott Lystig Fritchie. [A Critique of Resizable Hash Tables: Riak Core & Random Slicing](https://www.infoq.com/articles/dynamo-riak-random-slicing/). *infoq.com*, August 2018. Archived at [perma.cc/RPX7-7BLN](https://perma.cc/RPX7-7BLN) +[^28]: Andy Warfield. [Building and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. Archived at [perma.cc/6S7P-GLM4](https://perma.cc/6S7P-GLM4) +[^29]: Rich Houlihan. [DynamoDB adaptive capacity: smooth performance for chaotic workloads (DAT327)](https://www.youtube.com/watch?v=kMY0_m29YzU). At *AWS re:Invent*, November 2017. +[^30]: Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/). Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at [nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/) +[^31]: Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. [Earlybird: Real-Time Search at Twitter](https://cs.uwaterloo.ca/~jimmylin/publications/Busch_etal_ICDE2012.pdf). At *28th IEEE International Conference on Data Engineering* (ICDE), April 2012. [doi:10.1109/ICDE.2012.149](https://doi.org/10.1109/ICDE.2012.149) +[^32]: Nadav Har’El. [Indexing in Cassandra 3](https://github.com/scylladb/scylladb/wiki/Indexing-in-Cassandra-3). *github.com*, April 2017. Archived at [perma.cc/3ENV-8T9P](https://perma.cc/3ENV-8T9P) +[^33]: Zachary Tong. [Customizing Your Document Routing](https://www.elastic.co/blog/customizing-your-document-routing/). *elastic.co*, June 2013. Archived at [perma.cc/97VM-MREN](https://perma.cc/97VM-MREN) +[^34]: Andrew Pavlo. [H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). *hstore.cs.brown.edu*, October 2013. Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z) \ No newline at end of file diff --git a/content/en/ch8.md b/content/en/ch8.md index 5a6edef..19d6a3e 100644 --- a/content/en/ch8.md +++ b/content/en/ch8.md @@ -47,7 +47,7 @@ higher availability). Some safety properties can be achieved without transaction hand, transactions can prevent a lot of grief: for example, the technical cause behind the Post Office Horizon scandal (see [“How Important Is Reliability?”](/en/ch2#sidebar_reliability_importance)) was probably a lack of ACID transactions in the underlying accounting system -[[1](/en/ch8#Murdoch2021)]. +[^1]. How do you figure out whether you need transactions? In order to answer that question, we first need to understand exactly what safety guarantees transactions can provide, and what costs are associated @@ -86,10 +86,10 @@ The hype around NoSQL distributed databases led to a popular belief that transac fundamentally unscalable, and that any large-scale system would have to abandon transactions in order to maintain good performance and high availability. More recently, that belief has turned out to be wrong. So-called “NewSQL” databases such as CockroachDB -[[5](/en/ch8#Taft2020_ch8)], -TiDB [[6](/en/ch8#Huang2020)], -Spanner [[7](/en/ch8#Corbett2012_ch8)], -FoundationDB [[8](/en/ch8#Zhou2021_ch8)], +[^5], +TiDB [^6], +Spanner [^7], +FoundationDB [^8], and Yugabyte have shown that transactional systems can scale to large data volumes and high throughput. These systems combine sharding with consensus protocols ([Chapter 10](/en/ch10#ch_consistency)) to provide strong ACID guarantees at scale. @@ -104,19 +104,19 @@ operation and in various extreme (but realistic) circumstances. The safety guarantees provided by transactions are often described by the well-known acronym *ACID*, which stands for *Atomicity*, *Consistency*, *Isolation*, and *Durability*. It was coined in 1983 by Theo Härder and Andreas Reuter -[[9](/en/ch8#Harder1983)] +[^9] in an effort to establish precise terminology for fault-tolerance mechanisms in databases. However, in practice, one database’s implementation of ACID does not equal another’s implementation. For example, as we shall see, there is a lot of ambiguity around the meaning of *isolation* -[[10](/en/ch8#Bailis2013HAT)]. +[^10]. The high-level idea is sound, but the devil is in the details. Today, when a system claims to be “ACID compliant,” it’s unclear what guarantees you can actually expect. ACID has unfortunately become mostly a marketing term. (Systems that do not meet the ACID criteria are sometimes called *BASE*, which stands for *Basically Available*, *Soft state*, and *Eventual consistency* -[[11](/en/ch8#Fox1997)]. +[^11]. This is even more vague than the definition of ACID. It seems that the only sensible definition of BASE is “not ACID”; i.e., it can mean almost anything you want.) @@ -183,7 +183,7 @@ If you want the database to enforce your invariants, you need to declare them as part of the schema. For example, foreign key constraints, uniqueness constraints, or check constraints (which restrict the values that can appear in an individual row) are often used to model specific types of invariants. More complex consistency requirements can sometimes be modeled -using triggers or materialized views [[12](/en/ch8#Andrews2004)]. +using triggers or materialized views [^12]. However, complex invariants can be difficult or impossible to model using the constraints that databases usually provide. In that case, it’s the application’s responsibility to define its @@ -214,7 +214,7 @@ isolation as *serializability*, which means that each transaction can pretend th transaction running on the entire database. The database ensures that when the transactions have committed, the result is the same as if they had run *serially* (one after another), even though in reality they may have run concurrently -[[13](/en/ch8#Bernstein1987_ch8)]. +[^13]. However, serializability has a performance cost. In practice, many databases use forms of isolation that are weaker than serializability: that is, they allow concurrent transactions to interfere with @@ -262,12 +262,12 @@ The truth is, nothing is perfect: unavailable (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)). * When the power is suddenly cut, SSDs in particular have been shown to sometimes violate the guarantees they are supposed to provide: even `fsync` isn’t guaranteed to work correctly - [[15](/en/ch8#Zheng2013)]. + [^15]. Disk firmware can have bugs, just like any other kind of software [[16](/en/ch8#Denness2015), [17](/en/ch8#Surak2015)], e.g. causing drives to fail after exactly 32,768 hours of operation - [[18](/en/ch8#HPE2019_ch8)]. + [^18]. And `fsync` is hard to use; even PostgreSQL used it incorrectly for over 20 years [[19](/en/ch8#Ringer2018), [20](/en/ch8#Rebello2020), @@ -277,21 +277,21 @@ The truth is, nothing is perfect: [[22](/en/ch8#Pillai2014), [23](/en/ch8#Siebenmann2016)]. Filesystem errors on one replica can sometimes spread to other replicas as well - [[24](/en/ch8#Ganesan2017)]. + [^24]. * Data on disk can gradually become corrupted without this being detected - [[25](/en/ch8#Bairavasundaram2008)]. + [^25]. If data has been corrupted for some time, replicas and recent backups may also be corrupted. In this case, you will need to try to restore the data from a historical backup. * One study of SSDs found that between 30% and 80% of drives develop at least one bad block during the first four years of operation, and only some of these can be corrected by the firmware - [[26](/en/ch8#Schroeder2016_ch8)]. + [^26]. Magnetic hard drives have a lower rate of bad sectors, but a higher rate of complete failure than SSDs. * When a worn-out SSD (that has gone through many write/erase cycles) is disconnected from power, it can start losing data within a timescale of weeks to months, depending on the temperature - [[27](/en/ch8#Allison2015)]. + [^27]. This is less of a problem for drives with lower wear levels - [[28](/en/ch8#MahUng2015)]. + [^28]. In practice, there is no one technique that can provide absolute guarantees. There are only various risk-reduction techniques, including writing to disk, replicating to remote machines, and @@ -491,11 +491,11 @@ without any concurrency). In practice, isolation is unfortunately not that simple. Serializable isolation has a performance cost, and many databases don’t want to pay that price -[[10](/en/ch8#Bailis2013HAT)]. It’s therefore common for systems to use +[^10]. It’s therefore common for systems to use weaker levels of isolation, which protect against *some* concurrency issues, but not all. Those levels of isolation are much harder to understand, and they can lead to subtle bugs, but they are nevertheless used in practice -[[29](/en/ch8#Kleppmann2014)]. +[^29]. Concurrency bugs caused by weak transaction isolation are not just a theoretical problem. They have caused substantial loss of money @@ -503,8 +503,8 @@ caused substantial loss of money [31](/en/ch8#DAgosta2014), [32](/en/ch8#bitcointhief2014)], led to investigation by financial auditors -[[33](/en/ch8#Jorwekar2007_ch8)], -and caused customer data to be corrupted [[34](/en/ch8#Melanson2014)]. +[^33], +and caused customer data to be corrupted [^34]. A popular comment on revelations of such problems is “Use an ACID database if you’re handling financial data!”—but that misses the point. Even many popular relational database systems (which are usually considered “ACID”) use weak isolation, so they wouldn’t necessarily have prevented these @@ -513,14 +513,14 @@ bugs from occurring. ###### Note Incidentally, much of the banking system relies on text files that are exchanged via secure FTP -[[35](/en/ch8#Kim2014ACH)]. +[^35]. In this context, having an audit trail and some human-level fraud prevention measures is actually more important than ACID properties. Those examples also highlight an important point: even if concurrency issues are rare in normal operation, you have to consider the possibility that an attacker deliberately sends a burst of highly concurrent requests to your API in an attempt to deliberately exploit concurrency bugs -[[30](/en/ch8#Warszawski2017)]. Therefore, in order to build +[^30]. Therefore, in order to build applications that are reliable and secure, you have to ensure that such bugs are systematically prevented. @@ -551,7 +551,7 @@ writes, but does not prevent dirty reads. Let’s discuss these two guarantees i Imagine a transaction has written some data to the database, but the transaction has not yet committed or aborted. Can another transaction see that uncommitted data? If yes, that is called a -*dirty read* [[3](/en/ch8#Gray1976)]. +*dirty read* [^3]. Transactions running at the read committed isolation level must prevent dirty reads. This means that any writes by a transaction only become visible to others when that transaction commits (and then @@ -584,7 +584,7 @@ the earlier write. However, what happens if the earlier write is part of a transaction that has not yet committed, so the later write overwrites an uncommitted value? This is called a *dirty write* -[[36](/en/ch8#Berenson1995)]. Transactions running at the read +[^36]. Transactions running at the read committed isolation level must prevent dirty writes, usually by delaying the second write until the first write’s transaction has committed or aborted. @@ -611,7 +611,7 @@ By preventing dirty writes, this isolation level avoids some kinds of concurrenc Read committed is a very popular isolation level. It is the default setting in Oracle Database, PostgreSQL, SQL Server, and many other databases -[[10](/en/ch8#Bailis2013HAT)]. +[^10]. Most commonly, databases prevent dirty writes by using row-level locks: when a transaction wants to modify a particular row (or document or some other object), it must first acquire a lock on that @@ -636,7 +636,7 @@ different part of the application, due to waiting for locks. Nevertheless, locks are used to prevent dirty reads in some databases, such as IBM Db2 and Microsoft SQL Server in the `read_committed_snapshot=off` setting -[[29](/en/ch8#Kleppmann2014)]. +[^29]. A more commonly used approach to preventing dirty reads is the one illustrated in [Figure 8-4](/en/ch8#fig_transactions_read_committed): for every @@ -698,7 +698,7 @@ Analytic queries and integrity checks check that everything is in order (monitoring for data corruption). These queries are likely to return nonsensical results if they observe parts of the database at different points in time. -*Snapshot isolation* [[36](/en/ch8#Berenson1995)] is the most common +*Snapshot isolation* [^36] is the most common solution to this problem. The idea is that each transaction reads from a *consistent snapshot* of the database—that is, the transaction sees all the data that was committed in the database at the start of the transaction. Even if the data is subsequently changed by another transaction, each @@ -758,7 +758,7 @@ garbage collection process in the database removes any rows marked for deletion space. An update is internally translated into a delete and a insert -[[44](/en/ch8#Alleti2025)]. +[^44]. For example, in [Figure 8-7](/en/ch8#fig_transactions_mvcc), transaction 13 deducts $100 from account 2, changing the balance from $500 to $400. The `accounts` table now actually contains two rows for account 2: a row with a balance of $500 which was marked as deleted by transaction 13, and a row with a balance of @@ -776,7 +776,7 @@ version or the other way round, so that queries can internally iterate over all When a transaction reads from the database, transaction IDs are used to decide which row versions it can see and which are invisible. By carefully defining visibility rules, the database can present a consistent snapshot of the database to the application. This works roughly as follows -[[43](/en/ch8#Suzuki2017_ch8)]: +[^43]: 1. At the start of each transaction, the database makes a list of all the other transactions that are in progress (not yet committed or aborted) at that time. Any writes that those @@ -820,7 +820,7 @@ are no longer visible to any transaction, the corresponding index entries can al Many implementation details affect the performance of multi-version concurrency control [[45](/en/ch8#Pavlo2023), [46](/en/ch8#Wu2017)]. For example, PostgreSQL has optimizations for avoiding index updates if different versions of the -same row can fit on the same page [[40](/en/ch8#Momjian2014)]. +same row can fit on the same page [^40]. Some other databases avoid storing full copies of modified rows, and only store differences between versions to save space. @@ -829,7 +829,7 @@ Another approach is used in CouchDB, Datomic, and LMDB. Although they also use B pages of the tree when they are updated, but instead creates a new copy of each modified page. Parent pages, up to the root of the tree, are copied and updated to point to the new versions of their child pages. Any pages that are not affected by a write do not need to be copied, and can be -shared with the new tree [[47](/en/ch8#Prokopov2014)]. +shared with the new tree [^47]. With immutable B-trees, every write transaction (or batch of transactions) creates a new B-tree root, and a particular root is a consistent snapshot of the database at the point in time when it @@ -842,28 +842,28 @@ requires a background process for compaction and garbage collection. MVCC is a commonly used implementation technique for databases, and often it is used to implement snapshot isolation. However, different databases sometimes use different terms to refer to the same thing: for example, snapshot isolation is called “repeatable read” in PostgreSQL, and “serializable” -in Oracle [[29](/en/ch8#Kleppmann2014)]. Sometimes different systems +in Oracle [^29]. Sometimes different systems use the same term to mean different things: for example, while in PostgreSQL “repeatable read” means snapshot isolation, in MySQL it means an implementation of MVCC with weaker consistency than -snapshot isolation [[41](/en/ch8#Alvaro2023)]. +snapshot isolation [^41]. The reason for this naming confusion is that the SQL standard doesn’t have the concept of snapshot isolation, because the standard is based on System R’s 1975 definition of isolation levels -[[3](/en/ch8#Gray1976)] and snapshot isolation hadn’t yet been +[^3] and snapshot isolation hadn’t yet been invented then. Instead, it defines repeatable read, which looks superficially similar to snapshot isolation. PostgreSQL calls its snapshot isolation level “repeatable read” because it meets the requirements of the standard, and so they can claim standards compliance. Unfortunately, the SQL standard’s definition of isolation levels is flawed—it is ambiguous, imprecise, and not as implementation-independent as a standard should be -[[36](/en/ch8#Berenson1995)]. Even though several databases +[^36]. Even though several databases implement repeatable read, there are big differences in the guarantees they actually provide, despite being ostensibly standardized -[[29](/en/ch8#Kleppmann2014)]. There has been a formal definition of +[^29]. There has been a formal definition of repeatable read in the research literature [[37](/en/ch8#Adya1999), [38](/en/ch8#Bailis2014virtues_ch8)], but most implementations don’t satisfy that formal definition. And to top it off, IBM Db2 uses “repeatable read” to refer to serializability -[[10](/en/ch8#Bailis2013HAT)]. +[^10]. As a result, nobody really knows what repeatable read means. @@ -892,7 +892,7 @@ pattern occurs in various different scenarios: entire page contents to the server, overwriting whatever is currently in the database Because this is such a common problem, a variety of solutions have been developed -[[48](/en/ch8#Svetlov2025)]. +[^48]. ### Atomic write operations @@ -1011,7 +1011,7 @@ If the content has changed and no longer matches `'old content'`, this update wi so you need to check whether the update took effect and retry if necessary. Instead of comparing the full content, you could also use a version number column that you increment on every update, and apply the update only if the current version number hasn’t changed. This approach is sometimes -called *optimistic locking* [[52](/en/ch8#Dogan2020)]. +called *optimistic locking* [^52]. Note that if another transaction has concurrently modified `content`, the new content may not be visible under the MVCC visibility rules (see [“Visibility rules for observing a consistent snapshot”](/en/ch8#sec_transactions_mvcc_visibility)). Many @@ -1082,7 +1082,7 @@ been violated. ### Characterizing write skew -This anomaly is called *write skew* [[36](/en/ch8#Berenson1995)]. It +This anomaly is called *write skew* [^36]. It is neither a dirty write nor a lost update, because the two transactions are updating two different objects (Aaliyah’s and Bryce’s on-call records, respectively). It is less obvious that a conflict occurred here, but it’s definitely a race condition: if the two transactions had run one after another, the @@ -1101,7 +1101,7 @@ options are more restricted: * The automatic detection of lost updates that you find in some implementations of snapshot isolation unfortunately doesn’t help either: write skew is not automatically detected in PostgreSQL’s repeatable read, MySQL/InnoDB’s repeatable read, Oracle’s serializable, or SQL - Server’s snapshot isolation level [[29](/en/ch8#Kleppmann2014)]. + Server’s snapshot isolation level [^29]. Automatically preventing write skew requires true serializable isolation (see [“Serializability”](/en/ch8#sec_transactions_serializability)). * Some databases allow you to configure constraints, which are then enforced by the database (e.g., @@ -1109,7 +1109,7 @@ options are more restricted: specify that at least one doctor must be on call, you would need a constraint that involves multiple objects. Most databases do not have built-in support for such constraints, but you may be able to implement them with triggers or materialized views, as discussed in - [“Consistency”](/en/ch8#sec_transactions_acid_consistency) [[12](/en/ch8#Andrews2004)]. + [“Consistency”](/en/ch8#sec_transactions_acid_consistency) [^12]. * If you can’t use a serializable isolation level, the second-best option in this case is probably to explicitly lock the rows that the transaction depends on. In the doctors example, you could write something like the following: @@ -1139,7 +1139,7 @@ more situations in which it can occur. Here are some more examples: Meeting room booking system : Say you want to enforce that there cannot be two bookings for the same meeting room at the same - time [[55](/en/ch8#Terry1995_ch8)]. + time [^55]. When someone wants to make a booking, you first check for any conflicting bookings (i.e., bookings for the same room with an overlapping time range), and if none are found, you create the meeting (see [Example 8-2](/en/ch8#fig_transactions_meeting_rooms)). @@ -1216,10 +1216,10 @@ returned in step 1, so we could make the transaction safe and avoid write skew b in step 1 (`SELECT FOR UPDATE`). However, the other four examples are different: they check for the *absence* of rows matching some search condition, and the write *adds* a row matching the same condition. If the query in step 1 doesn’t return any rows, `SELECT FOR UPDATE` can’t attach locks to -anything [[56](/en/ch8#Schoenig2021)]. +anything [^56]. This effect, where a write in one transaction changes the result of a search query in another -transaction, is called a *phantom* [[4](/en/ch8#Eswaran1976)]. +transaction, is called a *phantom* [^4]. Snapshot isolation avoids phantoms in read-only queries, but in read-write transactions like the examples we discussed, phantoms can lead to particularly tricky cases of write skew. The SQL generated by ORMs is also prone to write skew @@ -1244,7 +1244,7 @@ to prevent bookings on the same room and time range from being modified concurre This approach is called *materializing conflicts*, because it takes a phantom and turns it into a lock conflict on a concrete set of rows that exist in the database -[[14](/en/ch8#Fekete2005)]. Unfortunately, it can be hard and +[^14]. Unfortunately, it can be hard and error-prone to figure out how to materialize conflicts, and it’s ugly to let a concurrency control mechanism leak into the application data model. For those reasons, materializing conflicts should be considered a last resort if no alternative is possible. A serializable isolation level is much @@ -1263,12 +1263,12 @@ a sad situation: particular isolation level—especially in a large application, where you might not be aware of all the things that may be happening concurrently. * There are no good tools to help us detect race conditions. In principle, static analysis may - help [[33](/en/ch8#Jorwekar2007_ch8)], but research techniques have not + help [^33], but research techniques have not yet found their way into practical use. Testing for concurrency issues is hard, because they are usually nondeterministic—problems only occur if you get unlucky with the timing. This is not a new problem—it has been like this since the 1970s, when weak isolation levels were -first introduced [[3](/en/ch8#Gray1976)]. All along, the answer +first introduced [^3]. All along, the answer from researchers has been simple: use *serializable* isolation! Serializable isolation is the strongest isolation level. It guarantees that even @@ -1297,7 +1297,7 @@ isolation is by definition serializable. Even though this seems like an obvious idea, it was only in the 2000s that database designers decided that a single-threaded loop for executing transactions was feasible -[[57](/en/ch8#Stonebraker2007_ch8)]. +[^57]. If multi-threaded concurrency was considered essential for getting good performance during the previous 30 years, what changed to make single-threaded execution possible? @@ -1353,7 +1353,7 @@ in order to get reasonable performance. For this reason, systems with single-threaded serial transaction processing don’t allow interactive multi-statement transactions. Instead, the application must either limit itself to transactions containing a single statement, or submit the entire transaction code to the database ahead of time, -as a *stored procedure* [[61](/en/ch8#Hugg2014debunking)]. +as a *stored procedure* [^61]. The differences between interactive transactions and stored procedures is illustrated in [Figure 8-9](/en/ch8#fig_transactions_stored_proc). Provided that all data required by a transaction is in memory, the @@ -1381,7 +1381,7 @@ SQL standard (SQL/PSM) since 1999. They have gained a somewhat bad reputation, f badly written code in an application server. * In a multitenant system that allows tenants to write their own stored procedures, it’s a security risk to execute untrusted code in the same process as the database kernel - [[62](/en/ch8#Zhou2025)]. + [^62]. However, those issues can be overcome. Modern implementations of stored procedures have abandoned PL/SQL and use existing general-purpose programming languages instead: VoltDB uses Java or Groovy, @@ -1418,7 +1418,7 @@ In order to scale to multiple CPU cores, and multiple nodes, you can shard your so that each transaction only needs to read and write data within a single shard, then each shard can have its own transaction processing thread running independently from the others. In this case, you can give each CPU core its own shard, which allows your transaction throughput to scale linearly -with the number of CPU cores [[59](/en/ch8#Kallman2008)]. +with the number of CPU cores [^59]. However, for any transaction that needs to access multiple shards, the database must coordinate the transaction across all the shards that it touches. The stored procedure needs to be performed in @@ -1427,9 +1427,9 @@ lock-step across all shards to ensure serializability across the whole system. Since cross-shard transactions have additional coordination overhead, they are vastly slower than single-shard transactions. VoltDB reports a throughput of about 1,000 cross-shard writes per second, which is orders of magnitude below its single-shard throughput and cannot be increased by adding -more machines [[61](/en/ch8#Hugg2014debunking)]. More recent research +more machines [^61]. More recent research has explored ways of making multi-shard transactions more scalable -[[63](/en/ch8#Zhou2022)]. +[^63]. Whether transactions can be single-shard depends very much on the structure of the data used by the application. Simple key-value data can often be sharded very easily, but data with multiple @@ -1489,7 +1489,7 @@ it protects against all the race conditions discussed earlier, including lost up 2PL is used by the serializable isolation level in MySQL (InnoDB) and SQL Server, and the repeatable read isolation level in Db2 -[[29](/en/ch8#Kleppmann2014)]. +[^29]. The blocking of readers and writers is implemented by having a lock on each object in the database. The lock can either be in *shared mode* or in *exclusive mode* (also known as a @@ -1556,7 +1556,7 @@ time range. (It’s okay to concurrently insert bookings for other rooms, or for different time that doesn’t affect the proposed booking.) How do we implement this? Conceptually, we need a *predicate lock* -[[4](/en/ch8#Eswaran1976)]. It works similarly to the +[^4]. It works similarly to the shared/exclusive lock described earlier, but rather than belonging to a particular object (e.g., one row in a table), it belongs to all objects that match some search condition, such as: @@ -1636,11 +1636,11 @@ comparatively new: it was first described in 2008 [65](/en/ch8#Cahill2009)]. Today SSI and similar algorithms are used in single-node databases (the serializable isolation level -in PostgreSQL [[54](/en/ch8#Ports2012)], SQL Server’s In-Memory -OLTP/Hekaton [[66](/en/ch8#Diaconu2013)], and HyPer -[[67](/en/ch8#Neumann2015)]), -distributed databases (CockroachDB [[5](/en/ch8#Taft2020_ch8)] and -FoundationDB [[8](/en/ch8#Zhou2021_ch8)]), and embedded storage +in PostgreSQL [^54], SQL Server’s In-Memory +OLTP/Hekaton [^66], and HyPer +[^67]), +distributed databases (CockroachDB [^5] and +FoundationDB [^8]), and embedded storage engines such as BadgerDB. ### Pessimistic versus optimistic concurrency control @@ -1663,9 +1663,9 @@ isolation was violated); if so, the transaction is aborted and has to be retried that executed serializably are allowed to commit. Optimistic concurrency control is an old idea -[[68](/en/ch8#Badal1979)], +[^68], and its advantages and disadvantages have been debated for a long time -[[69](/en/ch8#Agrawal1987)]. +[^69]. It performs badly if there is high contention (many transactions trying to access the same objects), as this leads to a high proportion of transactions needing to abort. If the system is already close to its maximum throughput, the additional transaction load from retried transactions can make @@ -1802,9 +1802,9 @@ serializable isolation. Compared to non-serializable snapshot isolation, the need to check for serializability violations introduces some performance overheads. How significant these overheads are is a matter of debate: some believe that serializability checking is not worth it -[[70](/en/ch8#Brooker2024snapshot)], +[^70], while others believe that the performance of serializability is now so good that there is no need to -use the weaker snapshot isolation any more [[67](/en/ch8#Neumann2015)]. +use the weaker snapshot isolation any more [^67]. The rate of aborts significantly affects the overall performance of SSI. For example, a transaction that reads and writes data over a long period of time is likely to run into conflicts and abort, so @@ -1819,7 +1819,7 @@ algorithms we have seen apply to both single-node and distributed databases: alt challenges in making concurrency control algorithms scalable (for example, performing distributed serializability checking for SSI), the high-level ideas for distributed concurrency control are similar to single-node concurrency control -[[8](/en/ch8#Zhou2021_ch8)]. +[^8]. Consistency and durability also don’t change much when we move to distributed transactions. However, atomicity requires more care. @@ -1834,7 +1834,7 @@ writes from that transaction are rolled back. Thus, on a single node, transaction commitment crucially depends on the *order* in which data is durably written to disk: first the data, then the commit record -[[22](/en/ch8#Pillai2014)]. +[^22]. The key deciding moment for whether the transaction commits or aborts is the moment at which the disk finishes writing the commit record: before that moment, it is still possible to abort (due to a crash), but after that moment, the transaction is committed (even if the database crashes). Thus, it @@ -1883,7 +1883,7 @@ is a classic algorithm in distributed databases [71](/en/ch8#Lindsay1979_ch8), [72](/en/ch8#Mohan1986)]. 2PC is used internally in some databases and also made available to applications in the form of *XA transactions* -[[73](/en/ch8#XASpec1991)] +[^73] (which are supported by the Java Transaction API, for example) or via WS-AtomicTransaction for SOAP web services [[74](/en/ch8#Neto2008), @@ -1920,7 +1920,7 @@ asks the bride and groom individually whether each wants to marry the other, and the answer “I do” from both. After receiving both acknowledgments, the minister pronounces the couple husband and wife: the transaction is committed, and the happy fact is broadcast to all attendees. If either bride or groom does not say “yes,” the ceremony is aborted -[[76](/en/ch8#Gray1981_ch8)]. +[^76]. ### A system of promises @@ -2036,7 +2036,7 @@ more than they can deliver [[78](/en/ch8#Hohpe2005), [80](/en/ch8#Oliver2011), [81](/en/ch8#Rahien2014)]. Many cloud services choose not to implement distributed transactions due to the operational -problems they engender [[82](/en/ch8#Vasters2012)]. +problems they engender [^82]. Some implementations of distributed transactions carry a heavy performance penalty. Much of the performance cost inherent in two-phase commit is due to the additional disk forcing (`fsync`) that @@ -2093,7 +2093,7 @@ atomic commit protocol that allows such heterogeneous distributed transactions. ### XA transactions *X/Open XA* (short for *eXtended Architecture*) is a standard for implementing two-phase commit -across heterogeneous technologies [[73](/en/ch8#XASpec1991)]. +across heterogeneous technologies [^73]. It was introduced in 1991 and has been widely implemented: XA is supported by many traditional relational databases (including PostgreSQL, MySQL, Db2, SQL Server, and Oracle) and message brokers (including ActiveMQ, HornetQ, MSMQ, and IBM MQ). @@ -2172,7 +2172,7 @@ most likely needs to be done under high stress and time pressure during a seriou Many XA implementations have an emergency escape hatch called *heuristic decisions*: allowing a participant to unilaterally decide to abort or commit an in-doubt transaction without a definitive -decision from the coordinator [[73](/en/ch8#XASpec1991)]. To be clear, +decision from the coordinator [^73]. To be clear, *heuristic* here is a euphemism for *probably breaking atomicity*, since the heuristic decision violates the system of promises in two-phase commit. Thus, heuristic decisions are intended only for getting out of catastrophic situations, and not for regular use. @@ -2214,12 +2214,12 @@ As explained previously, there is a big difference between distributed transacti multiple heterogeneous storage technologies, and those that are internal to a system—i.e., where all the participating nodes are shards of the same database running the same software. Such internal distributed transactions are a defining feature of “NewSQL” databases such as -CockroachDB [[5](/en/ch8#Taft2020_ch8)], -TiDB [[6](/en/ch8#Huang2020)], -Spanner [[7](/en/ch8#Corbett2012_ch8)], -FoundationDB [[8](/en/ch8#Zhou2021_ch8)], and YugabyteDB, for +CockroachDB [^5], +TiDB [^6], +Spanner [^7], +FoundationDB [^8], and YugabyteDB, for example. Some message brokers such as Kafka also support internal distributed transactions -[[85](/en/ch8#Wang2021)]. +[^85]. Many of these systems use 2-phase commit to ensure atomicity of transactions that write to multiple shards, and yet they don’t suffer the same problems as XA transactions. The reason is that because @@ -2320,12 +2320,12 @@ discussing various examples of race conditions, summarized in [Table 8-1](/en/c Table 8-1. Summary of anomalies that can occur at various isolation levels -| Isolation level | Dirty reads | Read skew | Phantom reads | Lost updates | Write skew | -| --- | --- | --- | --- | --- | --- | -| Read uncommitted | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | -| Read committed | ✓ Prevented | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | -| Snapshot isolation | ✓ Prevented | ✓ Prevented | ✓ Prevented | ? Depends | ✗ Possible | -| Serializable | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | +| Isolation level | Dirty reads | Read skew | Phantom reads | Lost updates | Write skew | +|--------------------|-------------|-------------|---------------|--------------|-------------| +| Read uncommitted | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | +| Read committed | ✓ Prevented | ✗ Possible | ✗ Possible | ✗ Possible | ✗ Possible | +| Snapshot isolation | ✓ Prevented | ✓ Prevented | ✓ Prevented | ? Depends | ✗ Possible | +| Serializable | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | ✓ Prevented | Dirty reads : One client reads another client’s writes before they have been committed. The read committed @@ -2390,482 +2390,93 @@ is used. ##### Footnotes + ##### References -[[1](/en/ch8#Murdoch2021-marker)] Steven J. Murdoch. -[What -went wrong with Horizon: learning from the Post Office Trial](https://www.benthamsgaze.org/2021/07/15/what-went-wrong-with-horizon-learning-from-the-post-office-trial/). *benthamsgaze.org*, July 2021. -Archived at [perma.cc/CNM4-553F](https://perma.cc/CNM4-553F) -[[2](/en/ch8#Chamberlin1981-marker)] Donald D. Chamberlin, Morton M. Astrahan, -Michael W. Blasgen, James N. Gray, W. Frank King, Bruce G. Lindsay, Raymond Lorie, James W. Mehl, -Thomas G. Price, Franco Putzolu, Patricia Griffiths Selinger, Mario Schkolnick, Donald R. Slutz, -Irving L. Traiger, Bradford W. Wade, and Robert A. Yost. -[A History and Evaluation of System -R](https://dsf.berkeley.edu/cs262/2005/SystemR.pdf). *Communications of the ACM*, volume 24, issue 10, pages 632–646, October 1981. -[doi:10.1145/358769.358784](https://doi.org/10.1145/358769.358784) -[[3](/en/ch8#Gray1976-marker)] Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger. -[Granularity of -Locks and Degrees of Consistency in a Shared Data Base](https://citeseerx.ist.psu.edu/pdf/e127f0a6a912bb9150ecfe03c0ebf7fbc289a023). in *Modelling in Data Base Management -Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management -Systems*, edited by G. M. Nijssen, pages 364–394, Elsevier/North Holland Publishing, 1976. Also -in *Readings in Database Systems*, 4th edition, edited by Joseph M. Hellerstein and Michael -Stonebraker, MIT Press, 2005. ISBN: 978-0-262-69314-1 - -[[4](/en/ch8#Eswaran1976-marker)] Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger. -[The -Notions of Consistency and Predicate Locks in a Database System](https://jimgray.azurewebsites.net/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf?from=https://research.microsoft.com/en-us/um/people/gray/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf). *Communications of the -ACM*, volume 19, issue 11, pages 624–633, November 1976. -[doi:10.1145/360363.360369](https://doi.org/10.1145/360363.360369) - -[[5](/en/ch8#Taft2020_ch8-marker)] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan -VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul -Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. -[CockroachDB: The Resilient -Geo-Distributed SQL Database](https://dl.acm.org/doi/pdf/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of -Data* (SIGMOD), pages 1493–1509, June 2020. -[doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134) - -[[6](/en/ch8#Huang2020-marker)] Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, -Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, -Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan -Pei, and Xin Tang. -[TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf). -*Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3072–3084. -[doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535) - -[[7](/en/ch8#Corbett2012_ch8-marker)] James C. Corbett, Jeffrey Dean, -Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, -Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, -Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, -Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. -[Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). -At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), -October 2012. - -[[8](/en/ch8#Zhou2021_ch8-marker)] Jingyu Zhou, Meng Xu, Alexander -Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty -Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, -Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. -[FoundationDB: A Distributed Unbundled -Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data* -(SIGMOD), June 2021. -[doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559) - -[[9](/en/ch8#Harder1983-marker)] Theo Härder and Andreas Reuter. -[Principles of -Transaction-Oriented Database Recovery](https://citeseerx.ist.psu.edu/pdf/11ef7c142295aeb1a28a0e714c91fc8d610c3047). *ACM Computing Surveys*, volume 15, issue 4, -pages 287–317, December 1983. [doi:10.1145/289.291](https://doi.org/10.1145/289.291) - -[[10](/en/ch8#Bailis2013HAT-marker)] Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph -M. Hellerstein, and Ion Stoica. -[HAT, not CAP: -Towards Highly Available Transactions](https://www.usenix.org/system/files/conference/hotos13/hotos13-final80.pdf). At *14th USENIX Workshop on Hot Topics in Operating -Systems* (HotOS), May 2013. - -[[11](/en/ch8#Fox1997-marker)] Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric -A. Brewer, and Paul Gauthier. -[Cluster-Based Scalable Network -Services](https://people.eecs.berkeley.edu/~brewer/cs262b/TACC.pdf). At *16th ACM Symposium on Operating Systems Principles* (SOSP), October 1997. -[doi:10.1145/268998.266662](https://doi.org/10.1145/268998.266662) - -[[12](/en/ch8#Andrews2004-marker)] Tony Andrews. -[Enforcing -Complex Constraints in Oracle](https://tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html). *tonyandrews.blogspot.co.uk*, October 2004. Archived at -[archive.org](https://web.archive.org/web/20220201190625/https%3A//tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html) - -[[13](/en/ch8#Bernstein1987_ch8-marker)] Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. -[*Concurrency Control and -Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available -online at [*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/). - -[[14](/en/ch8#Fekete2005-marker)] Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, -Patrick O’Neil, and Dennis Shasha. -[Making -Snapshot Isolation Serializable](https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2009/Papers/p492-fekete.pdf). *ACM Transactions on Database Systems*, -volume 30, issue 2, pages 492–528, June 2005. -[doi:10.1145/1071610.1071615](https://doi.org/10.1145/1071610.1071615) - -[[15](/en/ch8#Zheng2013-marker)] Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge. -[Understanding -the Robustness of SSDs Under Power Fault](https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf). At *11th USENIX Conference on File and Storage -Technologies* (FAST), February 2013. - -[[16](/en/ch8#Denness2015-marker)] Laurie Denness. -[SSDs: A Gift and a Curse](https://laur.ie/blog/2015/06/ssds-a-gift-and-a-curse/). -*laur.ie*, June 2015. Archived at [perma.cc/6GLP-BX3T](https://perma.cc/6GLP-BX3T) - -[[17](/en/ch8#Surak2015-marker)] Adam Surak. -[When -Solid State Drives Are Not That Solid](https://www.algolia.com/blog/engineering/when-solid-state-drives-are-not-that-solid). *blog.algolia.com*, June 2015. -Archived at [perma.cc/CBR9-QZEE](https://perma.cc/CBR9-QZEE) - -[[18](/en/ch8#HPE2019_ch8-marker)] Hewlett Packard Enterprise. -[Bulletin: -(Revision) HPE SAS Solid State Drives - Critical Firmware Upgrade Required for Certain HPE SAS -Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us). -*support.hpe.com*, November 2019. -Archived at [perma.cc/CZR4-AQBS](https://perma.cc/CZR4-AQBS) - -[[19](/en/ch8#Ringer2018-marker)] Craig Ringer et al. -[PostgreSQL’s -handling of fsync() errors is unsafe and risks data loss at least on XFS](https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com). Email thread on -pgsql-hackers mailing list, *postgresql.org*, March 2018. -Archived at [perma.cc/5RKU-57FL](https://perma.cc/5RKU-57FL) - -[[20](/en/ch8#Rebello2020-marker)] Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, -Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. -[Can Applications Recover -from fsync Failures?](https://www.usenix.org/conference/atc20/presentation/rebello) At *USENIX Annual Technical Conference* (ATC), July 2020. - -[[21](/en/ch8#Pillai2015-marker)] Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, -Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. -[Crash Consistency: Rethinking the -Fundamental Abstractions of the File System](https://dl.acm.org/doi/pdf/10.1145/2800695.2801719). *ACM Queue*, volume 13, issue 7, pages 20–28, July 2015. -[doi:10.1145/2800695.2801719](https://doi.org/10.1145/2800695.2801719) - -[[22](/en/ch8#Pillai2014-marker)] Thanumalayan Sankaranarayana Pillai, Vijay -Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. -[All File -Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf). -At *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014. - -[[23](/en/ch8#Siebenmann2016-marker)] Chris Siebenmann. -[Unix’s File Durability -Problem](https://utcc.utoronto.ca/~cks/space/blog/unix/FileSyncProblem). *utcc.utoronto.ca*, April 2016. -Archived at [perma.cc/VSS8-5MC4](https://perma.cc/VSS8-5MC4) - -[[24](/en/ch8#Ganesan2017-marker)] Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. -Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. -[Redundancy -Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and -Corruptions](https://www.usenix.org/conference/fast17/technical-sessions/presentation/ganesan). At *15th USENIX Conference on File and Storage Technologies* (FAST), -February 2017. - -[[25](/en/ch8#Bairavasundaram2008-marker)] Lakshmi N. Bairavasundaram, Garth R. -Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. -[An -Analysis of Data Corruption in the Storage Stack](https://www.usenix.org/legacy/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf). At *6th USENIX Conference on File and -Storage Technologies* (FAST), February 2008. - -[[26](/en/ch8#Schroeder2016_ch8-marker)] Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. -[Flash -Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder). At *14th USENIX Conference on -File and Storage Technologies* (FAST), February 2016. - -[[27](/en/ch8#Allison2015-marker)] Don Allison. -[SSD Storage – Ignorance of Technology Is No -Excuse](https://blog.korelogic.com/blog/2015/03/24). *blog.korelogic.com*, March 2015. -Archived at [perma.cc/9QN4-9SNJ](https://perma.cc/9QN4-9SNJ) - -[[28](/en/ch8#MahUng2015-marker)] Gordon Mah Ung. -[Debunked: -Your SSD won’t lose data if left unplugged after all](https://www.pcworld.com/article/427602/debunked-your-ssd-wont-lose-data-if-left-unplugged-after-all.html). *pcworld.com*, May 2015. -Archived at [perma.cc/S46H-JUDU](https://perma.cc/S46H-JUDU) - -[[29](/en/ch8#Kleppmann2014-marker)] Martin Kleppmann. -[Hermitage: -Testing the ‘I’ in ACID](https://martin.kleppmann.com/2014/11/25/hermitage-testing-the-i-in-acid.html). *martin.kleppmann.com*, November 2014. -Archived at [perma.cc/KP2Y-AQGK](https://perma.cc/KP2Y-AQGK) - -[[30](/en/ch8#Warszawski2017-marker)] Todd Warszawski and Peter Bailis. -[ACIDRain: Concurrency-Related Attacks -on Database-Backed Web Applications](http://www.bailis.org/papers/acidrain-sigmod2017.pdf). At *ACM International Conference on Management of -Data* (SIGMOD), May 2017. -[doi:10.1145/3035918.3064037](https://doi.org/10.1145/3035918.3064037) - -[[31](/en/ch8#DAgosta2014-marker)] Tristan D’Agosta. -[BTC Stolen from Poloniex](https://bitcointalk.org/index.php?topic=499580). -*bitcointalk.org*, March 2014. -Archived at [perma.cc/YHA6-4C5D](https://perma.cc/YHA6-4C5D) - -[[32](/en/ch8#bitcointhief2014-marker)] bitcointhief2. -[How -I Stole Roughly 100 BTC from an Exchange and How I Could Have Stolen More!](https://www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/) *reddit.com*, -February 2014. Archived at -[archive.org](https://web.archive.org/web/20250118042610/https%3A//www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/) - -[[33](/en/ch8#Jorwekar2007_ch8-marker)] Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan. -[Automating the -Detection of Snapshot Isolation Anomalies](https://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf). At *33rd International Conference on Very Large -Data Bases* (VLDB), September 2007. - -[[34](/en/ch8#Melanson2014-marker)] Michael Melanson. -[Transactions: -The Limits of Isolation](https://www.michaelmelanson.net/posts/transactions-the-limits-of-isolation/). *michaelmelanson.net*, November 2014. -Archived at [perma.cc/RG5R-KMYZ](https://perma.cc/RG5R-KMYZ) - -[[35](/en/ch8#Kim2014ACH-marker)] Edward Kim. -[How -ACH works: A developer perspective — Part 1](https://engineering.gusto.com/how-ach-works-a-developer-perspective-part-1-339d3e7bea1). *engineering.gusto.com*, April 2014. -Archived at [perma.cc/7B2H-PU94](https://perma.cc/7B2H-PU94) - -[[36](/en/ch8#Berenson1995-marker)] Hal Berenson, Philip A. Bernstein, Jim N. Gray, -Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. -[A Critique of -ANSI SQL Isolation Levels](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf). At *ACM International Conference on Management of Data* (SIGMOD), -May 1995. [doi:10.1145/568271.223785](https://doi.org/10.1145/568271.223785) - -[[37](/en/ch8#Adya1999-marker)] Atul Adya. [Weak -Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions](https://pmg.csail.mit.edu/papers/adya-phd.pdf). -PhD Thesis, Massachusetts Institute of Technology, March 1999. -Archived at [perma.cc/E97M-HW5Q](https://perma.cc/E97M-HW5Q) - -[[38](/en/ch8#Bailis2014virtues_ch8-marker)] Peter Bailis, Aaron Davidson, Alan Fekete, Ali -Ghodsi, Joseph M. Hellerstein, and Ion Stoica. -[Highly Available Transactions: Virtues and -Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). At *40th International Conference on Very Large Data Bases* (VLDB), -September 2014. - -[[39](/en/ch8#Crooks2017-marker)] Natacha Crooks, Youer Pu, Lorenzo Alvisi, and Allen Clement. -[Seeing is Believing: A -Client-Centric Specification of Database Isolation](https://www.cs.cornell.edu/lorenzo/papers/Crooks17Seeing.pdf). At *ACM Symposium on Principles of -Distributed Computing* (PODC), pages 73–82, July 2017. -[doi:10.1145/3087801.3087802](https://doi.org/10.1145/3087801.3087802) - -[[40](/en/ch8#Momjian2014-marker)] Bruce Momjian. -[MVCC Unmasked](https://momjian.us/main/writings/pgsql/mvcc.pdf). *momjian.us*, -July 2014. Archived at [perma.cc/KQ47-9GYB](https://perma.cc/KQ47-9GYB) - -[[41](/en/ch8#Alvaro2023-marker)] Peter Alvaro and Kyle Kingsbury. -[MySQL 8.0.34](https://jepsen.io/analyses/mysql-8.0.34). *jepsen.io*, December 2023. -Archived at [perma.cc/HGE2-Z878](https://perma.cc/HGE2-Z878) - -[[42](/en/ch8#Rogov2023-marker)] Egor Rogov. -[PostgreSQL 14 Internals](https://postgrespro.com/community/books/internals). -*postgrespro.com*, April 2023. -Archived at [perma.cc/FRK2-D7WB](https://perma.cc/FRK2-D7WB) - -[[43](/en/ch8#Suzuki2017_ch8-marker)] Hironobu Suzuki. -[The Internals of PostgreSQL](https://www.interdb.jp/pg/). -*interdb.jp*, 2017. - -[[44](/en/ch8#Alleti2025-marker)] Rohan Reddy Alleti. -[Internals -of MVCC in Postgres: Hidden costs of Updates vs Inserts](https://medium.com/%40rohanjnr44/internals-of-mvcc-in-postgres-hidden-costs-of-updates-vs-inserts-381eadd35844). *medium.com*, March 2025. -Archived at [perma.cc/3ACX-DFXT](https://perma.cc/3ACX-DFXT) - -[[45](/en/ch8#Pavlo2023-marker)] Andy Pavlo and Bohan Zhang. -[The -Part of PostgreSQL We Hate the Most](https://www.cs.cmu.edu/~pavlo/blog/2023/04/the-part-of-postgresql-we-hate-the-most.html). *cs.cmu.edu*, April 2023. -Archived at [perma.cc/XSP6-3JBN](https://perma.cc/XSP6-3JBN) - -[[46](/en/ch8#Wu2017-marker)] Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. -[An empirical evaluation of in-memory -multi-version concurrency control](https://vldb.org/pvldb/vol10/p781-Wu.pdf). *Proceedings of the VLDB Endowment*, volume 10, issue -7, pages 781–792, March 2017. -[doi:10.14778/3067421.3067427](https://doi.org/10.14778/3067421.3067427) - -[[47](/en/ch8#Prokopov2014-marker)] Nikita Prokopov. -[Unofficial Guide to Datomic -Internals](https://tonsky.me/blog/unofficial-guide-to-datomic-internals/). *tonsky.me*, May 2014. - -[[48](/en/ch8#Svetlov2025-marker)] Daniil Svetlov. -[A Practical Guide to Taming Postgres Isolation -Anomalies](https://dansvetlov.me/postgres-anomalies/). *dansvetlov.me*, March 2025. -Archived at [perma.cc/L7LE-TDLS](https://perma.cc/L7LE-TDLS) - -[[49](/en/ch8#Wiger2010-marker)] Nate Wiger. -[An Atomic Rant](https://nateware.com/2010/02/18/an-atomic-rant/). *nateware.com*, -February 2010. Archived at [perma.cc/5ZYB-PE44](https://perma.cc/5ZYB-PE44) - -[[50](/en/ch8#Coglan2020-marker)] James Coglan. -[Reading and writing, -part 3: web applications](https://blog.jcoglan.com/2020/10/12/reading-and-writing-part-3/). *blog.jcoglan.com*, October 2020. -Archived at [perma.cc/A7EK-PJVS](https://perma.cc/A7EK-PJVS) - -[[51](/en/ch8#Bailis2015_ch8-marker)] Peter Bailis, Alan Fekete, Michael J. Franklin, -Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. -[Feral Concurrency Control: An -Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf). At *ACM International Conference on -Management of Data* (SIGMOD), June 2015. -[doi:10.1145/2723372.2737784](https://doi.org/10.1145/2723372.2737784) - -[[52](/en/ch8#Dogan2020-marker)] Jaana Dogan. -[Things -I Wished More Developers Knew About Databases](https://rakyll.medium.com/things-i-wished-more-developers-knew-about-databases-2d0178464f78). *rakyll.medium.com*, April 2020. -Archived at [perma.cc/6EFK-P2TD](https://perma.cc/6EFK-P2TD) - -[[53](/en/ch8#Cahill2008-marker)] Michael J. Cahill, Uwe Röhm, and Alan Fekete. -[Serializable -Isolation for Snapshot Databases](https://www.cs.cornell.edu/~sowell/dbpapers/serializable_isolation.pdf). At *ACM International Conference on Management of Data* -(SIGMOD), June 2008. -[doi:10.1145/1376616.1376690](https://doi.org/10.1145/1376616.1376690) - -[[54](/en/ch8#Ports2012-marker)] Dan R. K. Ports and Kevin Grittner. -[Serializable Snapshot Isolation in PostgreSQL](https://drkp.net/papers/ssi-vldb12.pdf). -At *38th International Conference on Very Large Databases* (VLDB), August 2012. - -[[55](/en/ch8#Terry1995_ch8-marker)] Douglas B. Terry, Marvin M. Theimer, -Karin Petersen, Alan J. Demers, Mike J. Spreitzer and Carl H. Hauser. -[Managing -Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](https://pdos.csail.mit.edu/6.824/papers/bayou-conflicts.pdf). At -*15th ACM Symposium on Operating Systems Principles* (SOSP), December 1995. -[doi:10.1145/224056.224070](https://doi.org/10.1145/224056.224070) - -[[56](/en/ch8#Schoenig2021-marker)] Hans-Jürgen Schönig. -[Constraints -over multiple rows in PostgreSQL](https://www.cybertec-postgresql.com/en/postgresql-constraints-over-multiple-rows/). *cybertec-postgresql.com*, June 2021. -Archived at [perma.cc/2TGH-XUPZ](https://perma.cc/2TGH-XUPZ) - -[[57](/en/ch8#Stonebraker2007_ch8-marker)] Michael Stonebraker, Samuel Madden, -Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. -[The End of an -Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on -Very Large Data Bases* (VLDB), September 2007. - -[[58](/en/ch8#Hugg2014streaming-marker)] John Hugg. -[H-Store/VoltDB Architecture vs. CEP Systems -and Newer Streaming Architectures](https://www.youtube.com/watch?v=hD5M4a1UVz8). At *Data @Scale Boston*, November 2014. - -[[59](/en/ch8#Kallman2008-marker)] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew -Pavlo, Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang -Zhang, John Hugg, and Daniel J. Abadi. -[H-Store: A High-Performance, Distributed Main -Memory Transaction Processing System](https://www.vldb.org/pvldb/vol1/1454211.pdf). *Proceedings of the VLDB Endowment*, volume 1, -issue 2, pages 1496–1499, August 2008. - -[[60](/en/ch8#Hickey2012-marker)] Rich Hickey. -[The Architecture of Datomic](https://www.infoq.com/articles/Architecture-Datomic/). -*infoq.com*, November 2012. -Archived at [perma.cc/5YWU-8XJK](https://perma.cc/5YWU-8XJK) - -[[61](/en/ch8#Hugg2014debunking-marker)] John Hugg. -[Debunking Myths -About the VoltDB In-Memory Database](https://dzone.com/articles/debunking-myths-about-voltdb). *dzone.com*, May 2014. -Archived at [perma.cc/2Z9N-HPKF](https://perma.cc/2Z9N-HPKF) - -[[62](/en/ch8#Zhou2025-marker)] Xinjing Zhou, Viktor Leis, Xiangyao Yu, and Michael Stonebraker. -[OLTP Through the Looking Glass 16 -Years Later: Communication is the New Bottleneck](https://www.vldb.org/cidrdb/papers/2025/p17-zhou.pdf). At *15th Annual Conference on Innovative -Data Systems Research* (CIDR), January 2025. - -[[63](/en/ch8#Zhou2022-marker)] Xinjing Zhou, Xiangyao Yu, Goetz Graefe, and Michael Stonebraker. -[Lotus: scalable multi-partition -transactions on single-threaded partitioned databases](https://www.vldb.org/pvldb/vol15/p2939-zhou.pdf). *Proceedings of the VLDB -Endowment* (PVLDB), volume 15, issue 11, pages 2939–2952, July 2022. -[doi:10.14778/3551793.3551843](https://doi.org/10.14778/3551793.3551843) - -[[64](/en/ch8#Hellerstein2007_ch8-marker)] Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton. -[Architecture of a Database System](https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf). -*Foundations and Trends in Databases*, volume 1, issue 2, pages 141–259, November 2007. -[doi:10.1561/1900000002](https://doi.org/10.1561/1900000002) - -[[65](/en/ch8#Cahill2009-marker)] Michael J. Cahill. -[Serializable -Isolation for Snapshot Databases](https://ses.library.usyd.edu.au/bitstream/handle/2123/5353/michael-cahill-2009-thesis.pdf). PhD Thesis, University of Sydney, July 2009. -Archived at [perma.cc/727J-NTMP](https://perma.cc/727J-NTMP) - -[[66](/en/ch8#Diaconu2013-marker)] Cristian Diaconu, Craig Freedman, -Erik Ismert, Per-Åke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. -[Hekaton: -SQL Server’s Memory-Optimized OLTP Engine](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/06/Hekaton-Sigmod2013-final.pdf). At *ACM SIGMOD International Conference on -Management of Data* (SIGMOD), pages 1243–1254, June 2013. -[doi:10.1145/2463676.2463710](https://doi.org/10.1145/2463676.2463710) - -[[67](/en/ch8#Neumann2015-marker)] Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. -[Fast Serializable Multi-Version Concurrency -Control for Main-Memory Database Systems](https://db.in.tum.de/~muehlbau/papers/mvcc.pdf). At *ACM SIGMOD International Conference on -Management of Data* (SIGMOD), pages 677–689, May 2015. -[doi:10.1145/2723372.2749436](https://doi.org/10.1145/2723372.2749436) - -[[68](/en/ch8#Badal1979-marker)] D. Z. Badal. -[Correctness of Concurrency Control and -Implications in Distributed Databases](https://ieeexplore.ieee.org/abstract/document/762563). At *3rd International IEEE Computer Software and -Applications Conference* (COMPSAC), November 1979. -[doi:10.1109/CMPSAC.1979.762563](https://doi.org/10.1109/CMPSAC.1979.762563) - -[[69](/en/ch8#Agrawal1987-marker)] Rakesh Agrawal, Michael J. Carey, and Miron Livny. -[Concurrency Control -Performance Modeling: Alternatives and Implications](https://people.eecs.berkeley.edu/~brewer/cs262/ConcControl.pdf). *ACM Transactions on Database -Systems* (TODS), volume 12, issue 4, pages 609–654, December 1987. -[doi:10.1145/32204.32220](https://doi.org/10.1145/32204.32220) - -[[70](/en/ch8#Brooker2024snapshot-marker)] Marc Brooker. -[Snapshot Isolation vs -Serializability](https://brooker.co.za/blog/2024/12/17/occ-and-isolation.html). *brooker.co.za*, December 2024. -Archived at [perma.cc/5TRC-CR5G](https://perma.cc/5TRC-CR5G) - -[[71](/en/ch8#Lindsay1979_ch8-marker)] B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. -Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. -[Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). -IBM Research, Research Report RJ2571(33471), July 1979. -Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD) - -[[72](/en/ch8#Mohan1986-marker)] C. Mohan, Bruce G. Lindsay, and Ron Obermarck. -[Transaction -Management in the R\* Distributed Database Management System](https://cs.brown.edu/courses/csci2270/archives/2012/papers/dtxn/p378-mohan.pdf). -*ACM Transactions on Database Systems*, volume 11, issue 4, pages 378–396, December 1986. -[doi:10.1145/7239.7266](https://doi.org/10.1145/7239.7266) - -[[73](/en/ch8#XASpec1991-marker)] X/Open Company Ltd. -[Distributed Transaction Processing: -The XA Specification](https://pubs.opengroup.org/onlinepubs/009680699/toc.pdf). Technical Standard XO/CAE/91/300, December 1991. ISBN: 978-1-872-63024-3, -archived at [perma.cc/Z96H-29JB](https://perma.cc/Z96H-29JB) - -[[74](/en/ch8#Neto2008-marker)] Ivan Silva Neto and Francisco Reverbel. -[Lessons Learned from Implementing -WS-Coordination and WS-AtomicTransaction](https://www.ime.usp.br/~reverbel/papers/icis2008.pdf). At *7th IEEE/ACIS International Conference on -Computer and Information Science* (ICIS), May 2008. -[doi:10.1109/ICIS.2008.75](https://doi.org/10.1109/ICIS.2008.75) - -[[75](/en/ch8#Johnson2004-marker)] James E. Johnson, David E. Langworthy, Leslie Lamport, -and Friedrich H. Vogt. -[Formal -Specification of a Web Services Protocol](https://www.microsoft.com/en-us/research/publication/formal-specification-of-a-web-services-protocol/). At *1st International Workshop on Web Services and -Formal Methods* (WS-FM), February 2004. -[doi:10.1016/j.entcs.2004.02.022](https://doi.org/10.1016/j.entcs.2004.02.022) - -[[76](/en/ch8#Gray1981_ch8-marker)] Jim Gray. -[The Transaction -Concept: Virtues and Limitations](https://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf). At *7th International Conference on Very Large Data -Bases* (VLDB), September 1981. - -[[77](/en/ch8#Skeen1981-marker)] Dale Skeen. -[Nonblocking Commit -Protocols](https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/Ske81.pdf). At *ACM International Conference on Management of Data* (SIGMOD), April 1981. -[doi:10.1145/582318.582339](https://doi.org/10.1145/582318.582339) - -[[78](/en/ch8#Hohpe2005-marker)] Gregor Hohpe. -[Your Coffee Shop Doesn’t Use -Two-Phase Commit](https://www.martinfowler.com/ieeeSoftware/coffeeShop.pdf). *IEEE Software*, volume 22, issue 2, pages 64–66, March 2005. -[doi:10.1109/MS.2005.52](https://doi.org/10.1109/MS.2005.52) - -[[79](/en/ch8#Helland2007_ch8-marker)] Pat Helland. -[Life Beyond Distributed Transactions: -An Apostate’s Opinion](https://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf). At *3rd Biennial Conference on Innovative Data Systems Research* -(CIDR), January 2007. - -[[80](/en/ch8#Oliver2011-marker)] Jonathan Oliver. -[My Beef with -MSDTC and Two-Phase Commits](https://blog.jonathanoliver.com/my-beef-with-msdtc-and-two-phase-commits/). *blog.jonathanoliver.com*, April 2011. -Archived at [perma.cc/K8HF-Z4EN](https://perma.cc/K8HF-Z4EN) - -[[81](/en/ch8#Rahien2014-marker)] Oren Eini (Ahende Rahien). -[The Fallacy of -Distributed Transactions](https://ayende.com/blog/167362/the-fallacy-of-distributed-transactions). *ayende.com*, July 2014. -Archived at [perma.cc/VB87-2JEF](https://perma.cc/VB87-2JEF) - -[[82](/en/ch8#Vasters2012-marker)] Clemens Vasters. -[Transactions -in Windows Azure (with Service Bus) – An Email Discussion](https://learn.microsoft.com/en-gb/archive/blogs/clemensv/transactions-in-windows-azure-with-service-bus-an-email-discussion). *learn.microsoft.com*, July 2012. -Archived at [perma.cc/4EZ9-5SKW](https://perma.cc/4EZ9-5SKW) - -[[83](/en/ch8#Dhariwal2008-marker)] Ajmer Dhariwal. -[Orphaned MSDTC -Transactions (-2 spids)](https://www.eraofdata.com/posts/2008/orphaned-msdtc-transactions-2-spids/). *eraofdata.com*, December 2008. -Archived at [perma.cc/YG6F-U34C](https://perma.cc/YG6F-U34C) - -[[84](/en/ch8#Randal2013-marker)] Paul Randal. -[Real -World Story of DBCC PAGE Saving the Day](https://www.sqlskills.com/blogs/paul/real-world-story-of-dbcc-page-saving-the-day/). *sqlskills.com*, June 2013. -Archived at [perma.cc/2MJN-A5QH](https://perma.cc/2MJN-A5QH) - -[[85](/en/ch8#Wang2021-marker)] Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason -Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva -Mehta, Varun Madan, and Jun Rao. -[Consistency and Completeness: -Rethinking Distributed Stream Processing in Apache Kafka](https://dl.acm.org/doi/pdf/10.1145/3448016.3457556). At *ACM International Conference on -Management of Data* (SIGMOD), June 2021. -[doi:10.1145/3448016.3457556](https://doi.org/10.1145/3448016.3457556) +[^1]: Steven J. Murdoch. [What went wrong with Horizon: learning from the Post Office Trial](https://www.benthamsgaze.org/2021/07/15/what-went-wrong-with-horizon-learning-from-the-post-office-trial/). *benthamsgaze.org*, July 2021. Archived at [perma.cc/CNM4-553F](https://perma.cc/CNM4-553F) +[^2]: Donald D. Chamberlin, Morton M. Astrahan, Michael W. Blasgen, James N. Gray, W. Frank King, Bruce G. Lindsay, Raymond Lorie, James W. Mehl, Thomas G. Price, Franco Putzolu, Patricia Griffiths Selinger, Mario Schkolnick, Donald R. Slutz, Irving L. Traiger, Bradford W. Wade, and Robert A. Yost. [A History and Evaluation of System R](https://dsf.berkeley.edu/cs262/2005/SystemR.pdf). *Communications of the ACM*, volume 24, issue 10, pages 632–646, October 1981. [doi:10.1145/358769.358784](https://doi.org/10.1145/358769.358784) +[^3]: Jim N. Gray, Raymond A. Lorie, Gianfranco R. Putzolu, and Irving L. Traiger. [Granularity of Locks and Degrees of Consistency in a Shared Data Base](https://citeseerx.ist.psu.edu/pdf/e127f0a6a912bb9150ecfe03c0ebf7fbc289a023). in *Modelling in Data Base Management Systems: Proceedings of the IFIP Working Conference on Modelling in Data Base Management Systems*, edited by G. M. Nijssen, pages 364–394, Elsevier/North Holland Publishing, 1976. Also in *Readings in Database Systems*, 4th edition, edited by Joseph M. Hellerstein and Michael Stonebraker, MIT Press, 2005. ISBN: 978-0-262-69314-1 +[^4]: Kapali P. Eswaran, Jim N. Gray, Raymond A. Lorie, and Irving L. Traiger. [The Notions of Consistency and Predicate Locks in a Database System](https://jimgray.azurewebsites.net/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf?from=https://research.microsoft.com/en-us/um/people/gray/papers/On%20the%20Notions%20of%20Consistency%20and%20Predicate%20Locks%20in%20a%20Database%20System%20CACM.pdf). *Communications of the ACM*, volume 19, issue 11, pages 624–633, November 1976. [doi:10.1145/360363.360369](https://doi.org/10.1145/360363.360369) +[^5]: Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. [CockroachDB: The Resilient Geo-Distributed SQL Database](https://dl.acm.org/doi/pdf/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 1493–1509, June 2020. [doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134) +[^6]: Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, and Xin Tang. [TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 3072–3084. [doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535) +[^7]: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. [Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012. +[^8]: Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. [FoundationDB: A Distributed Unbundled Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2021. [doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559) +[^9]: Theo Härder and Andreas Reuter. [Principles of Transaction-Oriented Database Recovery](https://citeseerx.ist.psu.edu/pdf/11ef7c142295aeb1a28a0e714c91fc8d610c3047). *ACM Computing Surveys*, volume 15, issue 4, pages 287–317, December 1983. [doi:10.1145/289.291](https://doi.org/10.1145/289.291) +[^10]: Peter Bailis, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [HAT, not CAP: Towards Highly Available Transactions](https://www.usenix.org/system/files/conference/hotos13/hotos13-final80.pdf). At *14th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2013. +[^11]: Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A. Brewer, and Paul Gauthier. [Cluster-Based Scalable Network Services](https://people.eecs.berkeley.edu/~brewer/cs262b/TACC.pdf). At *16th ACM Symposium on Operating Systems Principles* (SOSP), October 1997. [doi:10.1145/268998.266662](https://doi.org/10.1145/268998.266662) +[^12]: Tony Andrews. [Enforcing Complex Constraints in Oracle](https://tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html). *tonyandrews.blogspot.co.uk*, October 2004. Archived at [archive.org](https://web.archive.org/web/20220201190625/https%3A//tonyandrews.blogspot.com/2004/10/enforcing-complex-constraints-in.html) +[^13]: Philip A. Bernstein, Vassos Hadzilacos, and Nathan Goodman. [*Concurrency Control and Recovery in Database Systems*](https://www.microsoft.com/en-us/research/people/philbe/book/). Addison-Wesley, 1987. ISBN: 978-0-201-10715-9, available online at [*microsoft.com*](https://www.microsoft.com/en-us/research/people/philbe/book/). +[^14]: Alan Fekete, Dimitrios Liarokapis, Elizabeth O’Neil, Patrick O’Neil, and Dennis Shasha. [Making Snapshot Isolation Serializable](https://www.cse.iitb.ac.in/infolab/Data/Courses/CS632/2009/Papers/p492-fekete.pdf). *ACM Transactions on Database Systems*, volume 30, issue 2, pages 492–528, June 2005. [doi:10.1145/1071610.1071615](https://doi.org/10.1145/1071610.1071615) +[^15]: Mai Zheng, Joseph Tucek, Feng Qin, and Mark Lillibridge. [Understanding the Robustness of SSDs Under Power Fault](https://www.usenix.org/system/files/conference/fast13/fast13-final80.pdf). At *11th USENIX Conference on File and Storage Technologies* (FAST), February 2013. +[^16]: Laurie Denness. [SSDs: A Gift and a Curse](https://laur.ie/blog/2015/06/ssds-a-gift-and-a-curse/). *laur.ie*, June 2015. Archived at [perma.cc/6GLP-BX3T](https://perma.cc/6GLP-BX3T) +[^17]: Adam Surak. [When Solid State Drives Are Not That Solid](https://www.algolia.com/blog/engineering/when-solid-state-drives-are-not-that-solid). *blog.algolia.com*, June 2015. Archived at [perma.cc/CBR9-QZEE](https://perma.cc/CBR9-QZEE) +[^18]: Hewlett Packard Enterprise. [Bulletin: (Revision) HPE SAS Solid State Drives - Critical Firmware Upgrade Required for Certain HPE SAS Solid State Drive Models to Prevent Drive Failure at 32,768 Hours of Operation](https://support.hpe.com/hpesc/public/docDisplay?docId=emr_na-a00092491en_us). *support.hpe.com*, November 2019. Archived at [perma.cc/CZR4-AQBS](https://perma.cc/CZR4-AQBS) +[^19]: Craig Ringer et al. [PostgreSQL’s handling of fsync() errors is unsafe and risks data loss at least on XFS](https://www.postgresql.org/message-id/flat/CAMsr%2BYHh%2B5Oq4xziwwoEfhoTZgr07vdGG%2Bhu%3D1adXx59aTeaoQ%40mail.gmail.com). Email thread on pgsql-hackers mailing list, *postgresql.org*, March 2018. Archived at [perma.cc/5RKU-57FL](https://perma.cc/5RKU-57FL) +[^20]: Anthony Rebello, Yuvraj Patel, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [Can Applications Recover from fsync Failures?](https://www.usenix.org/conference/atc20/presentation/rebello) At *USENIX Annual Technical Conference* (ATC), July 2020. +[^21]: Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [Crash Consistency: Rethinking the Fundamental Abstractions of the File System](https://dl.acm.org/doi/pdf/10.1145/2800695.2801719). *ACM Queue*, volume 13, issue 7, pages 20–28, July 2015. [doi:10.1145/2800695.2801719](https://doi.org/10.1145/2800695.2801719) +[^22]: Thanumalayan Sankaranarayana Pillai, Vijay Chidambaram, Ramnatthan Alagappan, Samer Al-Kiswany, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [All File Systems Are Not Created Equal: On the Complexity of Crafting Crash-Consistent Applications](https://www.usenix.org/system/files/conference/osdi14/osdi14-paper-pillai.pdf). At *11th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), October 2014. +[^23]: Chris Siebenmann. [Unix’s File Durability Problem](https://utcc.utoronto.ca/~cks/space/blog/unix/FileSyncProblem). *utcc.utoronto.ca*, April 2016. Archived at [perma.cc/VSS8-5MC4](https://perma.cc/VSS8-5MC4) +[^24]: Aishwarya Ganesan, Ramnatthan Alagappan, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [Redundancy Does Not Imply Fault Tolerance: Analysis of Distributed Storage Reactions to Single Errors and Corruptions](https://www.usenix.org/conference/fast17/technical-sessions/presentation/ganesan). At *15th USENIX Conference on File and Storage Technologies* (FAST), February 2017. +[^25]: Lakshmi N. Bairavasundaram, Garth R. Goodson, Bianca Schroeder, Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau. [An Analysis of Data Corruption in the Storage Stack](https://www.usenix.org/legacy/event/fast08/tech/full_papers/bairavasundaram/bairavasundaram.pdf). At *6th USENIX Conference on File and Storage Technologies* (FAST), February 2008. +[^26]: Bianca Schroeder, Raghav Lagisetty, and Arif Merchant. [Flash Reliability in Production: The Expected and the Unexpected](https://www.usenix.org/conference/fast16/technical-sessions/presentation/schroeder). At *14th USENIX Conference on File and Storage Technologies* (FAST), February 2016. +[^27]: Don Allison. [SSD Storage – Ignorance of Technology Is No Excuse](https://blog.korelogic.com/blog/2015/03/24). *blog.korelogic.com*, March 2015. Archived at [perma.cc/9QN4-9SNJ](https://perma.cc/9QN4-9SNJ) +[^28]: Gordon Mah Ung. [Debunked: Your SSD won’t lose data if left unplugged after all](https://www.pcworld.com/article/427602/debunked-your-ssd-wont-lose-data-if-left-unplugged-after-all.html). *pcworld.com*, May 2015. Archived at [perma.cc/S46H-JUDU](https://perma.cc/S46H-JUDU) +[^29]: Martin Kleppmann. [Hermitage: Testing the ‘I’ in ACID](https://martin.kleppmann.com/2014/11/25/hermitage-testing-the-i-in-acid.html). *martin.kleppmann.com*, November 2014. Archived at [perma.cc/KP2Y-AQGK](https://perma.cc/KP2Y-AQGK) +[^30]: Todd Warszawski and Peter Bailis. [ACIDRain: Concurrency-Related Attacks on Database-Backed Web Applications](http://www.bailis.org/papers/acidrain-sigmod2017.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017. [doi:10.1145/3035918.3064037](https://doi.org/10.1145/3035918.3064037) +[^31]: Tristan D’Agosta. [BTC Stolen from Poloniex](https://bitcointalk.org/index.php?topic=499580). *bitcointalk.org*, March 2014. Archived at [perma.cc/YHA6-4C5D](https://perma.cc/YHA6-4C5D) +[^32]: bitcointhief2. [How I Stole Roughly 100 BTC from an Exchange and How I Could Have Stolen More!](https://www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/) *reddit.com*, February 2014. Archived at [archive.org](https://web.archive.org/web/20250118042610/https%3A//www.reddit.com/r/Bitcoin/comments/1wtbiu/how_i_stole_roughly_100_btc_from_an_exchange_and/) +[^33]: Sudhir Jorwekar, Alan Fekete, Krithi Ramamritham, and S. Sudarshan. [Automating the Detection of Snapshot Isolation Anomalies](https://www.vldb.org/conf/2007/papers/industrial/p1263-jorwekar.pdf). At *33rd International Conference on Very Large Data Bases* (VLDB), September 2007. +[^34]: Michael Melanson. [Transactions: The Limits of Isolation](https://www.michaelmelanson.net/posts/transactions-the-limits-of-isolation/). *michaelmelanson.net*, November 2014. Archived at [perma.cc/RG5R-KMYZ](https://perma.cc/RG5R-KMYZ) +[^35]: Edward Kim. [How ACH works: A developer perspective — Part 1](https://engineering.gusto.com/how-ach-works-a-developer-perspective-part-1-339d3e7bea1). *engineering.gusto.com*, April 2014. Archived at [perma.cc/7B2H-PU94](https://perma.cc/7B2H-PU94) +[^36]: Hal Berenson, Philip A. Bernstein, Jim N. Gray, Jim Melton, Elizabeth O’Neil, and Patrick O’Neil. [A Critique of ANSI SQL Isolation Levels](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tr-95-51.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 1995. [doi:10.1145/568271.223785](https://doi.org/10.1145/568271.223785) +[^37]: Atul Adya. [Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions](https://pmg.csail.mit.edu/papers/adya-phd.pdf). PhD Thesis, Massachusetts Institute of Technology, March 1999. Archived at [perma.cc/E97M-HW5Q](https://perma.cc/E97M-HW5Q) +[^38]: Peter Bailis, Aaron Davidson, Alan Fekete, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Highly Available Transactions: Virtues and Limitations](https://www.vldb.org/pvldb/vol7/p181-bailis.pdf). At *40th International Conference on Very Large Data Bases* (VLDB), September 2014. +[^39]: Natacha Crooks, Youer Pu, Lorenzo Alvisi, and Allen Clement. [Seeing is Believing: A Client-Centric Specification of Database Isolation](https://www.cs.cornell.edu/lorenzo/papers/Crooks17Seeing.pdf). At *ACM Symposium on Principles of Distributed Computing* (PODC), pages 73–82, July 2017. [doi:10.1145/3087801.3087802](https://doi.org/10.1145/3087801.3087802) +[^40]: Bruce Momjian. [MVCC Unmasked](https://momjian.us/main/writings/pgsql/mvcc.pdf). *momjian.us*, July 2014. Archived at [perma.cc/KQ47-9GYB](https://perma.cc/KQ47-9GYB) +[^41]: Peter Alvaro and Kyle Kingsbury. [MySQL 8.0.34](https://jepsen.io/analyses/mysql-8.0.34). *jepsen.io*, December 2023. Archived at [perma.cc/HGE2-Z878](https://perma.cc/HGE2-Z878) +[^42]: Egor Rogov. [PostgreSQL 14 Internals](https://postgrespro.com/community/books/internals). *postgrespro.com*, April 2023. Archived at [perma.cc/FRK2-D7WB](https://perma.cc/FRK2-D7WB) +[^43]: Hironobu Suzuki. [The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017. +[^44]: Rohan Reddy Alleti. [Internals of MVCC in Postgres: Hidden costs of Updates vs Inserts](https://medium.com/%40rohanjnr44/internals-of-mvcc-in-postgres-hidden-costs-of-updates-vs-inserts-381eadd35844). *medium.com*, March 2025. Archived at [perma.cc/3ACX-DFXT](https://perma.cc/3ACX-DFXT) +[^45]: Andy Pavlo and Bohan Zhang. [The Part of PostgreSQL We Hate the Most](https://www.cs.cmu.edu/~pavlo/blog/2023/04/the-part-of-postgresql-we-hate-the-most.html). *cs.cmu.edu*, April 2023. Archived at [perma.cc/XSP6-3JBN](https://perma.cc/XSP6-3JBN) +[^46]: Yingjun Wu, Joy Arulraj, Jiexi Lin, Ran Xian, and Andrew Pavlo. [An empirical evaluation of in-memory multi-version concurrency control](https://vldb.org/pvldb/vol10/p781-Wu.pdf). *Proceedings of the VLDB Endowment*, volume 10, issue 7, pages 781–792, March 2017. [doi:10.14778/3067421.3067427](https://doi.org/10.14778/3067421.3067427) +[^47]: Nikita Prokopov. [Unofficial Guide to Datomic Internals](https://tonsky.me/blog/unofficial-guide-to-datomic-internals/). *tonsky.me*, May 2014. +[^48]: Daniil Svetlov. [A Practical Guide to Taming Postgres Isolation Anomalies](https://dansvetlov.me/postgres-anomalies/). *dansvetlov.me*, March 2025. Archived at [perma.cc/L7LE-TDLS](https://perma.cc/L7LE-TDLS) +[^49]: Nate Wiger. [An Atomic Rant](https://nateware.com/2010/02/18/an-atomic-rant/). *nateware.com*, February 2010. Archived at [perma.cc/5ZYB-PE44](https://perma.cc/5ZYB-PE44) +[^50]: James Coglan. [Reading and writing, part 3: web applications](https://blog.jcoglan.com/2020/10/12/reading-and-writing-part-3/). *blog.jcoglan.com*, October 2020. Archived at [perma.cc/A7EK-PJVS](https://perma.cc/A7EK-PJVS) +[^51]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Feral Concurrency Control: An Empirical Investigation of Modern Application Integrity](http://www.bailis.org/papers/feral-sigmod2015.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2015. [doi:10.1145/2723372.2737784](https://doi.org/10.1145/2723372.2737784) +[^52]: Jaana Dogan. [Things I Wished More Developers Knew About Databases](https://rakyll.medium.com/things-i-wished-more-developers-knew-about-databases-2d0178464f78). *rakyll.medium.com*, April 2020. Archived at [perma.cc/6EFK-P2TD](https://perma.cc/6EFK-P2TD) +[^53]: Michael J. Cahill, Uwe Röhm, and Alan Fekete. [Serializable Isolation for Snapshot Databases](https://www.cs.cornell.edu/~sowell/dbpapers/serializable_isolation.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2008. [doi:10.1145/1376616.1376690](https://doi.org/10.1145/1376616.1376690) +[^54]: Dan R. K. Ports and Kevin Grittner. [Serializable Snapshot Isolation in PostgreSQL](https://drkp.net/papers/ssi-vldb12.pdf). At *38th International Conference on Very Large Databases* (VLDB), August 2012. +[^55]: Douglas B. Terry, Marvin M. Theimer, Karin Petersen, Alan J. Demers, Mike J. Spreitzer and Carl H. Hauser. [Managing Update Conflicts in Bayou, a Weakly Connected Replicated Storage System](https://pdos.csail.mit.edu/6.824/papers/bayou-conflicts.pdf). At *15th ACM Symposium on Operating Systems Principles* (SOSP), December 1995. [doi:10.1145/224056.224070](https://doi.org/10.1145/224056.224070) +[^56]: Hans-Jürgen Schönig. [Constraints over multiple rows in PostgreSQL](https://www.cybertec-postgresql.com/en/postgresql-constraints-over-multiple-rows/). *cybertec-postgresql.com*, June 2021. Archived at [perma.cc/2TGH-XUPZ](https://perma.cc/2TGH-XUPZ) +[^57]: Michael Stonebraker, Samuel Madden, Daniel J. Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland. [The End of an Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on Very Large Data Bases* (VLDB), September 2007. +[^58]: John Hugg. [H-Store/VoltDB Architecture vs. CEP Systems and Newer Streaming Architectures](https://www.youtube.com/watch?v=hD5M4a1UVz8). At *Data @Scale Boston*, November 2014. +[^59]: Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alexander Rasin, Stanley Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang Zhang, John Hugg, and Daniel J. Abadi. [H-Store: A High-Performance, Distributed Main Memory Transaction Processing System](https://www.vldb.org/pvldb/vol1/1454211.pdf). *Proceedings of the VLDB Endowment*, volume 1, issue 2, pages 1496–1499, August 2008. +[^60]: Rich Hickey. [The Architecture of Datomic](https://www.infoq.com/articles/Architecture-Datomic/). *infoq.com*, November 2012. Archived at [perma.cc/5YWU-8XJK](https://perma.cc/5YWU-8XJK) +[^61]: John Hugg. [Debunking Myths About the VoltDB In-Memory Database](https://dzone.com/articles/debunking-myths-about-voltdb). *dzone.com*, May 2014. Archived at [perma.cc/2Z9N-HPKF](https://perma.cc/2Z9N-HPKF) +[^62]: Xinjing Zhou, Viktor Leis, Xiangyao Yu, and Michael Stonebraker. [OLTP Through the Looking Glass 16 Years Later: Communication is the New Bottleneck](https://www.vldb.org/cidrdb/papers/2025/p17-zhou.pdf). At *15th Annual Conference on Innovative Data Systems Research* (CIDR), January 2025. +[^63]: Xinjing Zhou, Xiangyao Yu, Goetz Graefe, and Michael Stonebraker. [Lotus: scalable multi-partition transactions on single-threaded partitioned databases](https://www.vldb.org/pvldb/vol15/p2939-zhou.pdf). *Proceedings of the VLDB Endowment* (PVLDB), volume 15, issue 11, pages 2939–2952, July 2022. [doi:10.14778/3551793.3551843](https://doi.org/10.14778/3551793.3551843) +[^64]: Joseph M. Hellerstein, Michael Stonebraker, and James Hamilton. [Architecture of a Database System](https://dsf.berkeley.edu/papers/fntdb07-architecture.pdf). *Foundations and Trends in Databases*, volume 1, issue 2, pages 141–259, November 2007. [doi:10.1561/1900000002](https://doi.org/10.1561/1900000002) +[^65]: Michael J. Cahill. [Serializable Isolation for Snapshot Databases](https://ses.library.usyd.edu.au/bitstream/handle/2123/5353/michael-cahill-2009-thesis.pdf). PhD Thesis, University of Sydney, July 2009. Archived at [perma.cc/727J-NTMP](https://perma.cc/727J-NTMP) +[^66]: Cristian Diaconu, Craig Freedman, Erik Ismert, Per-Åke Larson, Pravin Mittal, Ryan Stonecipher, Nitin Verma, and Mike Zwilling. [Hekaton: SQL Server’s Memory-Optimized OLTP Engine](https://www.microsoft.com/en-us/research/wp-content/uploads/2013/06/Hekaton-Sigmod2013-final.pdf). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 1243–1254, June 2013. [doi:10.1145/2463676.2463710](https://doi.org/10.1145/2463676.2463710) +[^67]: Thomas Neumann, Tobias Mühlbauer, and Alfons Kemper. [Fast Serializable Multi-Version Concurrency Control for Main-Memory Database Systems](https://db.in.tum.de/~muehlbau/papers/mvcc.pdf). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 677–689, May 2015. [doi:10.1145/2723372.2749436](https://doi.org/10.1145/2723372.2749436) +[^68]: D. Z. Badal. [Correctness of Concurrency Control and Implications in Distributed Databases](https://ieeexplore.ieee.org/abstract/document/762563). At *3rd International IEEE Computer Software and Applications Conference* (COMPSAC), November 1979. [doi:10.1109/CMPSAC.1979.762563](https://doi.org/10.1109/CMPSAC.1979.762563) +[^69]: Rakesh Agrawal, Michael J. Carey, and Miron Livny. [Concurrency Control Performance Modeling: Alternatives and Implications](https://people.eecs.berkeley.edu/~brewer/cs262/ConcControl.pdf). *ACM Transactions on Database Systems* (TODS), volume 12, issue 4, pages 609–654, December 1987. [doi:10.1145/32204.32220](https://doi.org/10.1145/32204.32220) +[^70]: Marc Brooker. [Snapshot Isolation vs Serializability](https://brooker.co.za/blog/2024/12/17/occ-and-isolation.html). *brooker.co.za*, December 2024. Archived at [perma.cc/5TRC-CR5G](https://perma.cc/5TRC-CR5G) +[^71]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD) +[^72]: C. Mohan, Bruce G. Lindsay, and Ron Obermarck. [Transaction Management in the R\* Distributed Database Management System](https://cs.brown.edu/courses/csci2270/archives/2012/papers/dtxn/p378-mohan.pdf). *ACM Transactions on Database Systems*, volume 11, issue 4, pages 378–396, December 1986. [doi:10.1145/7239.7266](https://doi.org/10.1145/7239.7266) +[^73]: X/Open Company Ltd. [Distributed Transaction Processing: The XA Specification](https://pubs.opengroup.org/onlinepubs/009680699/toc.pdf). Technical Standard XO/CAE/91/300, December 1991. ISBN: 978-1-872-63024-3, archived at [perma.cc/Z96H-29JB](https://perma.cc/Z96H-29JB) +[^74]: Ivan Silva Neto and Francisco Reverbel. [Lessons Learned from Implementing WS-Coordination and WS-AtomicTransaction](https://www.ime.usp.br/~reverbel/papers/icis2008.pdf). At *7th IEEE/ACIS International Conference on Computer and Information Science* (ICIS), May 2008. [doi:10.1109/ICIS.2008.75](https://doi.org/10.1109/ICIS.2008.75) +[^75]: James E. Johnson, David E. Langworthy, Leslie Lamport, and Friedrich H. Vogt. [Formal Specification of a Web Services Protocol](https://www.microsoft.com/en-us/research/publication/formal-specification-of-a-web-services-protocol/). At *1st International Workshop on Web Services and Formal Methods* (WS-FM), February 2004. [doi:10.1016/j.entcs.2004.02.022](https://doi.org/10.1016/j.entcs.2004.02.022) +[^76]: Jim Gray. [The Transaction Concept: Virtues and Limitations](https://jimgray.azurewebsites.net/papers/thetransactionconcept.pdf). At *7th International Conference on Very Large Data Bases* (VLDB), September 1981. +[^77]: Dale Skeen. [Nonblocking Commit Protocols](https://www.cs.utexas.edu/~lorenzo/corsi/cs380d/papers/Ske81.pdf). At *ACM International Conference on Management of Data* (SIGMOD), April 1981. [doi:10.1145/582318.582339](https://doi.org/10.1145/582318.582339) +[^78]: Gregor Hohpe. [Your Coffee Shop Doesn’t Use Two-Phase Commit](https://www.martinfowler.com/ieeeSoftware/coffeeShop.pdf). *IEEE Software*, volume 22, issue 2, pages 64–66, March 2005. [doi:10.1109/MS.2005.52](https://doi.org/10.1109/MS.2005.52) +[^79]: Pat Helland. [Life Beyond Distributed Transactions: An Apostate’s Opinion](https://www.cidrdb.org/cidr2007/papers/cidr07p15.pdf). At *3rd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2007. +[^80]: Jonathan Oliver. [My Beef with MSDTC and Two-Phase Commits](https://blog.jonathanoliver.com/my-beef-with-msdtc-and-two-phase-commits/). *blog.jonathanoliver.com*, April 2011. Archived at [perma.cc/K8HF-Z4EN](https://perma.cc/K8HF-Z4EN) +[^81]: Oren Eini (Ahende Rahien). [The Fallacy of Distributed Transactions](https://ayende.com/blog/167362/the-fallacy-of-distributed-transactions). *ayende.com*, July 2014. Archived at [perma.cc/VB87-2JEF](https://perma.cc/VB87-2JEF) +[^82]: Clemens Vasters. [Transactions in Windows Azure (with Service Bus) – An Email Discussion](https://learn.microsoft.com/en-gb/archive/blogs/clemensv/transactions-in-windows-azure-with-service-bus-an-email-discussion). *learn.microsoft.com*, July 2012. Archived at [perma.cc/4EZ9-5SKW](https://perma.cc/4EZ9-5SKW) +[^83]: Ajmer Dhariwal. [Orphaned MSDTC Transactions (-2 spids)](https://www.eraofdata.com/posts/2008/orphaned-msdtc-transactions-2-spids/). *eraofdata.com*, December 2008. Archived at [perma.cc/YG6F-U34C](https://perma.cc/YG6F-U34C) +[^84]: Paul Randal. [Real World Story of DBCC PAGE Saving the Day](https://www.sqlskills.com/blogs/paul/real-world-story-of-dbcc-page-saving-the-day/). *sqlskills.com*, June 2013. Archived at [perma.cc/2MJN-A5QH](https://perma.cc/2MJN-A5QH) +[^85]: Guozhang Wang, Lei Chen, Ayusman Dikshit, Jason Gustafson, Boyang Chen, Matthias J. Sax, John Roesler, Sophie Blee-Goldman, Bruno Cadonna, Apurva Mehta, Varun Madan, and Jun Rao. [Consistency and Completeness: Rethinking Distributed Stream Processing in Apache Kafka](https://dl.acm.org/doi/pdf/10.1145/3448016.3457556). At *ACM International Conference on Management of Data* (SIGMOD), June 2021. [doi:10.1145/3448016.3457556](https://doi.org/10.1145/3448016.3457556) \ No newline at end of file diff --git a/content/en/ch9.md b/content/en/ch9.md index b3bc107..2ab9953 100644 --- a/content/en/ch9.md +++ b/content/en/ch9.md @@ -62,7 +62,7 @@ When you are writing software that runs on several computers, connected by a net is fundamentally different. In distributed systems, faults occur much more frequently, and so we can no longer ignore them—we have no choice but to confront the messy reality of the physical world. And in the physical world, a remarkably wide range of things can go wrong, as illustrated by this -anecdote [[3](/en/ch9#Hale2010)]: +anecdote [^3]: > In my limited experience I’ve dealt with long-lived network partitions in a single data center (DC), > PDU [power distribution unit] failures, switch failures, accidental power cycles of whole racks, @@ -79,7 +79,7 @@ anything involving multiple nodes and the network, it may sometimes work and som fail. As we shall see, you may not even *know* whether something succeeded or not! This nondeterminism and possibility of partial failures is what makes distributed systems hard to -work with [[4](/en/ch9#Hodges2013)]. +work with [^4]. On the other hand, if a distributed system can tolerate partial failures, that opens up powerful possibilities: for example, it allows you to perform a rolling upgrade, rebooting one node at a time to install software updates while the system as a whole continues working uninterrupted all the @@ -150,7 +150,7 @@ retransmits dropped packets, it detects reordered packets and puts them back in and it detects packet corruption using a simple checksum. It also figures out how fast it can send data so that it is transferred as quickly as possible, but without overloading the network or the receiving node; this is known as *congestion control*, *flow control*, or *backpressure* -[[5](/en/ch9#Jacobson1988)]. +[^5]. When you “send” some data by writing it to a socket, it actually doesn’t get sent immediately, but it’s only placed in a buffer managed by your operating system. When the congestion control @@ -159,7 +159,7 @@ that buffer and passes it to the network interface. The packet passes through se routers, and eventually the receiving node’s operating system places the packet’s data in a receive buffer and sends an acknowledgment packet back to the sender. Only then does the receiving operating system notify the application that some more data has arrived -[[6](/en/ch9#Hubert2009)]. +[^6]. So, if TCP provides “reliability”, does that mean we no longer need to worry about networks being unreliable? Unfortunately not. It decides that a packet must have been lost if no acknowledgment @@ -170,12 +170,12 @@ you. Eventually, after a configurable timeout, TCP gives up and signals an error If a TCP connection is closed with an error—perhaps because the remote node crashed, or perhaps because the network was interrupted—you unfortunately have no way of knowing how much data was -actually processed by the remote node [[6](/en/ch9#Hubert2009)]. +actually processed by the remote node [^6]. Even if TCP acknowledged that a packet was delivered, this only means that the operating system kernel on the remote node received it, but the application may have crashed before it handled that data. If you want to be sure that a request was successful, you need a positive response from the application itself -[[7](/en/ch9#Saltzer1984_ch9)]. +[^7]. Nevertheless, TCP is very useful, because it provides a convenient way of sending and receiving messages that are too big to fit in one packet. Once a TCP connection is established, you can also @@ -189,40 +189,40 @@ We have been building computer networks for decades—one might hope that by now out how to make them reliable. Unfortunately, we have not yet succeeded. There are some systematic studies, and plenty of anecdotal evidence, showing that network problems can be surprisingly common, even in controlled environments like a datacenter operated by one company -[[8](/en/ch9#Bailis2014reliable)]: +[^8]: * One study in a medium-sized datacenter found about 12 network faults per month, of which half disconnected a single machine, and half disconnected an entire rack - [[9](/en/ch9#Leners2015)]. + [^9]. * Another study measured the failure rates of components like top-of-rack switches, aggregation switches, and load balancers - [[10](/en/ch9#Gill2011)]. + [^10]. It found that adding redundant networking gear doesn’t reduce faults as much as you might hope, since it doesn’t guard against human error (e.g., misconfigured switches), which is a major cause of outages. * Interruptions of wide-area fiber links have been blamed on cows - [[11](/en/ch9#Hoelzle2020)], - beavers [[12](/en/ch9#CBCNews2021)], - and sharks [[13](/en/ch9#Oremus2014)] + [^11], + beavers [^12], + and sharks [^13] (though shark bites have become rarer due to better shielding of submarine cables - [[14](/en/ch9#AuerbachJahajeeah2023)]). + [^14]). Humans are also at fault, be it due to accidental misconfiguration - [[15](/en/ch9#Janardhan2021)], - scavenging [[16](/en/ch9#Parfitt2011)], + [^15], + scavenging [^16], or sabotage - [[17](/en/ch9#Voce2025)]. + [^17]. * Across different cloud regions, round-trip times of up to several *minutes* have been observed at high percentiles [[18](/en/ch9#Liu2016), Table 3]. Even within a single datacenter, packet delay of more than a minute can occur during a network topology reconfiguration, triggered by a problem during a software upgrade for a switch - [[19](/en/ch9#Imbriaco2012_ch9)]. + [^19]. Thus, we have to assume that messages might be delayed arbitrarily. * Sometimes communications are partially interrupted, depending on who you’re talking to: for example, A and B can communicate, B and C can communicate, but A and C cannot [[20](/en/ch9#Lianza2020_ch9), [21](/en/ch9#Alfatafta2020)]. Other surprising faults include a network interface that sometimes drops all inbound packets but - sends outbound packets successfully [[22](/en/ch9#Donges2012)]: + sends outbound packets successfully [^22]: just because a network link works in one direction doesn’t guarantee it’s also working in the opposite direction. * Even a brief network interruption can have repercussions that last for much longer than the @@ -243,9 +243,9 @@ may fail—there is no way around it. If the error handling of network faults is not defined and tested, arbitrarily bad things could happen: for example, the cluster could become deadlocked and permanently unable to serve requests, -even when the network recovers [[24](/en/ch9#Kingsbury2014elastic)], +even when the network recovers [^24], or it could even delete all of your data -[[25](/en/ch9#Sanfilippo2014)]. +[^25]. If software is put in an unanticipated situation, it may do arbitrary unexpected things. Handling network faults doesn’t necessarily mean *tolerating* them: if your network is normally @@ -273,7 +273,7 @@ that something is not working: * If a node process crashed (or was killed by an administrator) but the node’s operating system is still running, a script can notify other nodes about the crash so that another node can take over quickly without having to wait for a timeout to expire. For example, HBase does this - [[26](/en/ch9#Liochon2015)]. + [^26]. * If you have access to the management interface of the network switches in your datacenter, you can query them to detect link failures at a hardware level (e.g., if the remote machine is powered down). This option is ruled out if you’re connecting via the internet, or if you’re in a shared @@ -333,7 +333,7 @@ times to throw the system off-balance. When driving a car, travel times on road networks often vary most due to traffic congestion. Similarly, the variability of packet delays on computer networks is most often due to queueing -[[27](/en/ch9#Grosvenor2015)]: +[^27]: * If several different nodes simultaneously try to send packets to the same destination, the network switch must queue them up and feed them into the destination network link one by one (as illustrated @@ -344,11 +344,11 @@ Similarly, the variability of packet delays on computer networks is most often d * When a packet reaches the destination machine, if all CPU cores are currently busy, the incoming request from the network is queued by the operating system until the application is ready to handle it. Depending on the load on the machine, this may take an arbitrary length of time - [[28](/en/ch9#Julienne2019)]. + [^28]. * In virtualized environments, a running operating system is often paused for tens of milliseconds while another virtual machine uses a CPU core. During this time, the VM cannot consume any data from the network, so the incoming data is queued (buffered) by the virtual machine monitor - [[29](/en/ch9#Wang2010)], + [^29], further increasing the variability of network delays. * As mentioned earlier, in order to avoid overloading the network, TCP limits the rate at which it sends data. This means additional queueing at the sender before the data even enters the network. @@ -396,11 +396,11 @@ determine an appropriate trade-off between failure detection delay and risk of p Even better, rather than using configured constant timeouts, systems can continually measure response times and their variability (*jitter*), and automatically adjust timeouts according to the observed response time distribution. The Phi Accrual failure detector -[[32](/en/ch9#Hayashibara2004)], +[^32], which is used for example in Akka and Cassandra -[[33](/en/ch9#Wang2013)] +[^33] is one way of doing this. TCP retransmission timeouts also work similarly -[[5](/en/ch9#Jacobson1988)]. +[^5]. ## Synchronous Versus Asynchronous Networks @@ -417,12 +417,12 @@ similar reliability and predictability in computer networks? When you make a call over the telephone network, it establishes a *circuit*: a fixed, guaranteed amount of bandwidth is allocated for the call, along the entire route between the two callers. This circuit remains in place until the call ends -[[34](/en/ch9#Keshav1997)]. +[^34]. For example, an ISDN network runs at a fixed rate of 4,000 frames per second. When a call is established, it is allocated 16 bits of space within each frame (in each direction). Thus, for the duration of the call, each side is guaranteed to be able to send exactly 16 bits of audio data every 250 microseconds -[[35](/en/ch9#Kyas1995)]. +[^35]. This kind of network is *synchronous*: even as data passes through several routers, it does not suffer from queueing, because the 16 bits of space for the call have already been reserved in the @@ -459,10 +459,10 @@ the rate of data transfer to the available network capacity. There have been some attempts to build hybrid networks that support both circuit switching and packet switching. *Asynchronous Transfer Mode* (ATM) was a competitor to Ethernet in the 1980s, but it didn’t gain much adoption outside of telephone network core switches. InfiniBand has some similarities -[[36](/en/ch9#Mellanox2014)]: +[^36]: it implements end-to-end flow control at the link layer, which reduces the need for queueing in the network, although it can still suffer from delays due to link congestion -[[37](/en/ch9#Santos2003)]. +[^37]. With careful use of *quality of service* (QoS, prioritization and scheduling of packets) and *admission control* (rate-limiting senders), it is possible to emulate circuit switching on packet networks, or provide statistically bounded delay [[27](/en/ch9#Grosvenor2015), @@ -491,7 +491,7 @@ fixed cost, so if you utilize it better, each byte you send over the wire is che A similar situation arises with CPUs: if you share each CPU core dynamically between several threads, one thread sometimes has to wait in the operating system’s run queue while another thread is running, so a thread can be paused for varying lengths of time -[[38](/en/ch9#Li2014)]. +[^38]. However, this utilizes the hardware better than if you allocated a static number of CPU cycles to each thread (see [“Response time guarantees”](/en/ch9#sec_distributed_clocks_realtime)). Better hardware utilization is also why cloud platforms run several virtual machines from different customers on the same physical machine. @@ -546,7 +546,7 @@ a quartz crystal oscillator. These devices are not perfectly accurate, so each m notion of time, which may be slightly faster or slower than on other machines. It is possible to synchronize clocks to some degree: the most commonly used mechanism is the Network Time Protocol (NTP), which allows the computer clock to be adjusted according to the time reported by a group of servers -[[39](/en/ch9#Windl2006)]. +[^39]. The servers in turn get their time from a more accurate time source, such as a GPS receiver. ## Monotonic Versus Time-of-Day Clocks @@ -572,20 +572,20 @@ various oddities, as described in the next section. In particular, if the local ahead of the NTP server, it may be forcibly reset and appear to jump back to a previous point in time. These jumps, as well as similar jumps caused by leap seconds, make time-of-day clocks unsuitable for measuring elapsed time -[[40](/en/ch9#GrahamCumming2017)]. +[^40]. Time-of-day clocks can experience jumps due to the start and end of Daylight Saving Time (DST); these can be avoided by always using UTC as time zone, which does not have DST. Time-of-day clocks have also historically had quite a coarse-grained resolution, e.g., moving forward in steps of 10 ms on older Windows systems -[[41](/en/ch9#Holmes2006)]. +[^41]. On recent systems, this is less of a problem. ### Monotonic clocks A monotonic clock is suitable for measuring a duration (time interval), such as a timeout or a service’s response time: `clock_gettime(CLOCK_MONOTONIC)` or `clock_gettime(CLOCK_BOOTTIME)` on -Linux [[42](/en/ch9#Greef2021)] +Linux [^42] and `System.nanoTime()` in Java are monotonic clocks, for example. The name comes from the fact that they are guaranteed to always move forward (whereas a time-of-day clock may jump back in time). @@ -598,11 +598,11 @@ clock values from two different computers, because they don’t mean the same th On a server with multiple CPU sockets, there may be a separate timer per CPU, which is not necessarily synchronized with other CPUs -[[43](/en/ch9#Yang2015)]. +[^43]. Operating systems compensate for any discrepancy and try to present a monotonic view of the clock to application threads, even as they are scheduled across different CPUs. However, it is wise to take this guarantee of monotonicity with a pinch of salt -[[44](/en/ch9#Loughran2015)]. +[^44]. NTP may adjust the frequency at which the monotonic clock moves forward (this is known as *slewing* the clock) if it detects that the computer’s local quartz is moving faster or slower than the NTP @@ -625,12 +625,12 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples * The quartz clock in a computer is not very accurate: it *drifts* (runs faster or slower than it should). Clock drift varies depending on the temperature of the machine. Google assumes a clock drift of up to 200 ppm (parts per million) for its servers - [[45](/en/ch9#Corbett2012_ch9)], + [^45], which is equivalent to 6 ms drift for a clock that is resynchronized with a server every 30 seconds, or 17 seconds drift for a clock that is resynchronized once a day. This drift limits the best possible accuracy you can achieve, even if everything is working correctly. * If a computer’s clock differs too much from an NTP server, it may refuse to synchronize, or the - local clock will be forcibly reset [[39](/en/ch9#Windl2006)]. Any + local clock will be forcibly reset [^39]. Any applications observing the time before and after this reset may see time go backward or suddenly jump forward. * If a node is accidentally firewalled off from NTP servers, the misconfiguration may go @@ -639,7 +639,7 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples * NTP synchronization can only be as good as the network delay, so there is a limit to its accuracy when you’re on a congested network with variable packet delays. One experiment showed that a minimum error of 35 ms is achievable when synchronizing over the internet - [[46](/en/ch9#Caporaloni2012)], + [^46], though occasional spikes in network delay lead to errors of around a second. Depending on the configuration, large network delays can cause the NTP client to give up entirely. * Some NTP servers are wrong or misconfigured, reporting time that is off by hours @@ -650,7 +650,7 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples were told by a stranger on the internet. * Leap seconds result in a minute that is 59 seconds or 61 seconds long, which messes up timing assumptions in systems that are not designed with leap seconds in mind - [[49](/en/ch9#Kamp2011)]. + [^49]. The fact that leap seconds have crashed many large systems [[40](/en/ch9#GrahamCumming2017), [50](/en/ch9#Minar2012_ch9)] @@ -660,21 +660,21 @@ hope—hardware clocks and NTP can be fickle beasts. To give just a few examples [[51](/en/ch9#Pascoe2011), [52](/en/ch9#Zhao2015)], although actual NTP server behavior varies in practice - [[53](/en/ch9#Veitch2016)]. + [^53]. Leap seconds will no longer be used from 2035 onwards, so this problem will fortunately go away. * In virtual machines, the hardware clock is virtualized, which raises additional challenges for applications that need accurate timekeeping - [[54](/en/ch9#VMware2011)]. + [^54]. When a CPU core is shared between virtual machines, each VM is paused for tens of milliseconds while another VM is running. From an application’s point of view, this pause manifests itself as - the clock suddenly jumping forward [[29](/en/ch9#Wang2010)]. + the clock suddenly jumping forward [^29]. If a VM pauses for several seconds, the clock may then be several seconds behind the actual time, but NTP may continue to report that the clock is almost perfectly in sync - [[55](/en/ch9#Yodaiken2017)]. + [^55]. * If you run software on devices that you don’t fully control (e.g., mobile or embedded devices), you probably cannot trust the device’s hardware clock at all. Some users deliberately set their hardware clock to an incorrect date and time, for example to cheat in games - [[56](/en/ch9#EmreAcer2017)]. + [^56]. As a result, the clock might be set to a time wildly in the past or the future. It is possible to achieve very good clock accuracy if you care about it sufficiently to invest @@ -682,7 +682,7 @@ significant resources. For example, the MiFID II European regulation for financi institutions requires all high-frequency trading funds to synchronize their clocks to within 100 microseconds of UTC, in order to help debug market anomalies such as “flash crashes” and to help detect market manipulation -[[57](/en/ch9#MiFID2015)]. +[^57]. Such accuracy can be achieved with some special hardware (GPS receivers and/or atomic clocks), the Precision Time Protocol (PTP) and careful deployment and monitoring @@ -690,10 +690,10 @@ Precision Time Protocol (PTP) and careful deployment and monitoring [59](/en/ch9#Obleukhov2022)]. Relying on GPS alone can be risky because GPS signals can easily be jammed. In some locations this happens frequently, e.g. close to military facilities -[[60](/en/ch9#Wiseman2022)]. +[^60]. Some cloud providers have begun offering high-accuracy clock synchronization for their virtual machines -[[61](/en/ch9#Levinson2023)]. +[^61]. However, clock synchronization still requires a lot of care. If your NTP daemon is misconfigured, or a firewall is blocking NTP traffic, the clock error due to drift can quickly become large. @@ -727,7 +727,7 @@ the broken clocks before they can cause too much damage. Let’s consider one particular situation in which it is tempting, but dangerous, to rely on clocks: ordering of events across multiple nodes -[[64](/en/ch9#Brooker2023time)]. +[^64]. For example, if two clients write to a distributed database, who got there first? Which write is the more recent one? @@ -762,7 +762,7 @@ a higher timestamp than the overwritten value, even if that timestamp is ahead o clock. However, that incurs the cost of an additional read to find the greatest existing timestamp. Some systems, including Cassandra and ScyllaDB, want to write to all replicas in a single round trip, and therefore they simply use the client clock’s timestamp along with a last write wins -policy [[62](/en/ch9#Kingsbury2013cassandra)]. This approach has some +policy [^62]. This approach has some serious problems: * Database writes can mysteriously disappear: a node with a lagging clock is unable to overwrite @@ -779,7 +779,7 @@ serious problems: * It is possible for two nodes to independently generate writes with the same timestamp, especially when the clock only has millisecond resolution. An additional tiebreaker value (which can simply be a large random number) is required to resolve such conflicts, but this approach can also lead to - violations of causality [[62](/en/ch9#Kingsbury2013cassandra)]. + violations of causality [^62]. Thus, even though it is tempting to resolve conflicts by keeping the most “recent” value and discarding others, it’s important to be aware that the definition of “recent” depends on a local @@ -795,7 +795,7 @@ you would need the clock error to be significantly lower than the network delay, possible. So-called *logical clocks* -[[66](/en/ch9#Lamport1978_ch9)], +[^66], which are based on incrementing counters rather than an oscillating quartz crystal, are a safer alternative for ordering events (see [“Detecting Concurrent Writes”](/en/ch6#sec_replication_concurrent)). Logical clocks do not measure the time of day or the number of seconds elapsed, only the relative ordering of events (whether one @@ -816,7 +816,7 @@ possible accuracy is probably to the tens of milliseconds, and the error may eas Thus, it doesn’t make sense to think of a clock reading as a point in time—it is more like a range of times, within a confidence interval: for example, a system may be 95% confident that the time now is between 10.3 and 10.5 seconds past the minute, but it doesn’t know any more precisely -than that [[67](/en/ch9#Sheehy2015)]. +than that [^67]. If we only know the time +/– 100 ms, the microsecond digits in the timestamp are essentially meaningless. @@ -832,7 +832,7 @@ Unfortunately, most systems don’t expose this uncertainty: for example, when y don’t know if its confidence interval is five milliseconds or five years. There are exceptions: the *TrueTime* API in Google’s Spanner -[[45](/en/ch9#Corbett2012_ch9)] and Amazon’s ClockBound explicitly report the +[^45] and Amazon’s ClockBound explicitly report the confidence interval on the local clock. When you ask it for the current time, you get back two values: `[earliest, latest]`, which are the *earliest possible* and the *latest possible* timestamp. Based on its uncertainty calculations, the clock knows that the actual current time is @@ -880,12 +880,12 @@ ensures that any transaction that may read the data is at a sufficiently later t confidence intervals do not overlap. In order to keep the wait time as short as possible, Spanner needs to keep the clock uncertainty as small as possible; for this purpose, Google deploys a GPS receiver or atomic clock in each datacenter, allowing clocks to be synchronized to within about -7 ms [[45](/en/ch9#Corbett2012_ch9)]. +7 ms [^45]. The atomic clocks and GPS receivers are not strictly necessary in Spanner: the important thing is to have a confidence interval, and the accurate clock sources only help keep that interval small. Other systems are beginning to adopt similar approaches: for example, YugabyteDB can leverage ClockBound -when running on AWS [[70](/en/ch9#Pachot2024)], +when running on AWS [^70], and several other systems now also rely on clock synchronization to various degrees [[71](/en/ch9#Kimball2022), [72](/en/ch9#Demirbas2025)]. @@ -898,7 +898,7 @@ node know that it is still leader (that it hasn’t been declared dead by the ot safely accept writes? One option is for the leader to obtain a *lease* from the other nodes, which is similar to a lock -with a timeout [[73](/en/ch9#Gray1989)]. +with a timeout [^73]. Only one node can hold the lease at any one time—thus, when a node obtains a lease, it knows that it is the leader for some amount of time, until the lease expires. In order to remain leader, the node must periodically renew the lease before it expires. If the node fails, it stops renewing the @@ -946,11 +946,11 @@ various reasons why this could happen: * Contention among threads accessing a shared resource, such as a lock or queue, can cause threads to spend a lot of their time waiting. Moving to a machine with more CPU cores can make such problems worse, and contention problems can be difficult to diagnose - [[74](/en/ch9#Sturman2022)]. + [^74]. * Many programming language runtimes (such as the Java Virtual Machine) have a *garbage collector* (GC) that occasionally needs to stop all running threads. In the past, such *“stop-the-world” GC pauses* would sometimes last for several minutes - [[75](/en/ch9#Lipcon2011)]! + [^75]! With modern GC algorithms this is less of a problem, but GC pauses can still be noticable (see [“Limiting the impact of garbage collection”](/en/ch9#sec_distributed_gc_impact)). * In virtualized environments, a virtual machine can be *suspended* (pausing the execution of all @@ -959,7 +959,7 @@ various reasons why this could happen: last for an arbitrary length of time. This feature is sometimes used for *live migration* of virtual machines from one host to another without a reboot, in which case the length of the pause depends on the rate at which processes are writing to memory - [[76](/en/ch9#Clark2005)]. + [^76]. * On end-user devices such as laptops and phones, execution may also be suspended and resumed arbitrarily, e.g., when the user closes the lid of their laptop. * When the operating system context-switches to another thread, or when the hypervisor switches to a @@ -969,14 +969,14 @@ various reasons why this could happen: there is a long queue of threads waiting to run—it may take some time before the paused thread gets to run again. * If the application performs synchronous disk access, a thread may be paused waiting for a slow - disk I/O operation to complete [[77](/en/ch9#Shaver2008)]. In many languages, disk access can happen + disk I/O operation to complete [^77]. In many languages, disk access can happen surprisingly, even if the code doesn’t explicitly mention file access—for example, the Java classloader lazily loads class files when they are first used, which could happen at any time in the program execution. I/O pauses and GC pauses may even conspire to combine their delays - [[78](/en/ch9#Zhuang2016)]. + [^78]. If the disk is actually a network filesystem or network block device (such as Amazon’s EBS), the I/O latency is further subject to the variability of network delays - [[31](/en/ch9#Newman2012)]. + [^31]. * If the operating system is configured to allow *swapping to disk* (*paging*), a simple memory access may result in a page fault that requires a page from disk to be loaded into memory. The thread is paused while this slow I/O operation takes place. If memory pressure is high, this may @@ -1051,7 +1051,7 @@ operating in a non-real-time environment. ### Limiting the impact of garbage collection Garbage collection used to be one of the biggest reasons for process pauses -[[79](/en/ch9#Thompson2013)], +[^79], but fortunately GC algorithms have improved a lot: a properly tuned collector will now usually pause for no more than a few milliseconds. The Java runtime offers collectors such as concurrent mark sweep (CMS), garbage-first (G1), the Z garbage collector (ZGC), Epsilon, and Shenandoah. Each of @@ -1099,7 +1099,7 @@ it is in, because problems in the network cannot reliably be distinguished from Discussions of these systems border on the philosophical: What do we know to be true or false in our system? How sure can we be of that knowledge, if the mechanisms for perception and measurement are -unreliable [[83](/en/ch9#Halpern1990)]? +unreliable [^83]? Should software systems obey the laws that we expect of the physical world, such as cause and effect? Fortunately, we don’t need to go as far as figuring out the meaning of life. In a distributed @@ -1119,7 +1119,7 @@ assumptions. Imagine a network with an asymmetric fault: a node is able to receive all messages sent to it, but any outgoing messages from that node are dropped or delayed -[[22](/en/ch9#Donges2012)]. Even though that node is working +[^22]. Even though that node is working perfectly well, and is receiving requests from other nodes, the other nodes cannot hear its responses. After some timeout, the other nodes declare it dead, because they haven’t heard from the node. The situation unfolds like a nightmare: the semi-disconnected node is dragged to the @@ -1161,7 +1161,7 @@ the use of quorums in more detail when we get to *consensus algorithms* in [Chap ## Distributed Locks and Leases Locks and leases in distributed application are prone to be misused, and a common source of bugs -[[84](/en/ch9#Tang2022)]. +[^84]. Let’s look at one particular case of how they can go wrong. In [“Process Pauses”](/en/ch9#sec_distributed_clocks_pauses) we saw that a lease is a kind of lock that times out and can be @@ -1221,13 +1221,13 @@ rule out zombies entirely, we have to instead ensure that they can’t do any da split brain. This is called *fencing off* the zombie. Some systems attempt to fence off zombies by shutting them down, for example by disconnecting them -from the network [[9](/en/ch9#Leners2015)], shutting down the VM via +from the network [^9], shutting down the VM via the cloud provider’s management interface, or even physically powering down the machine -[[87](/en/ch9#SUSE2025)]. +[^87]. This approach is known as *Shoot The Other Node In The Head* or STONITH. Unfortunately, it suffers from some problems: it does not protect against large network delays like in [Figure 9-5](/en/ch9#fig_distributed_lease_delay); it can happen that all of the nodes shut each other down -[[19](/en/ch9#Imbriaco2012_ch9)]; and by the time the zombie has been +[^19]; and by the time the zombie has been detected and shut down, it may already be too late and data may already have been corrupted. A more robust fencing solution, which protects against both zombies and delayed requests, is @@ -1245,7 +1245,7 @@ it must include its current fencing token. ###### Note There are several alternative names for fencing tokens. In Chubby, Google’s lock service, they are -called *sequencers* [[88](/en/ch9#Burrows2006_ch9)], and in Kafka they are called *epoch numbers*. +called *sequencers* [^88], and in Kafka they are called *epoch numbers*. In consensus algorithms, which we will discuss in [Chapter 10](/en/ch10#ch_consistency), the *ballot number* (Paxos) or *term number* (Raft) serves a similar purpose. @@ -1259,11 +1259,11 @@ has just acquired the lease must immediately make a write to the storage service write has completed, any zombies are fenced off. If ZooKeeper is your lock service, you can use the transaction ID `zxid` or the node version -`cversion` as fencing token [[85](/en/ch9#Junqueira2013_ch9)]. +`cversion` as fencing token [^85]. With etcd, the revision number along with the lease ID serves a similar purpose -[[89](/en/ch9#Kingsbury2020etcd)]. +[^89]. The FencedLock API in Hazelcast explicitly generates a fencing token -[[90](/en/ch9#BasriKahveci2019)]. +[^90]. This mechanism requires that the storage service has some way of checking whether a write is based on an outdated token. Alternatively, it’s sufficient for the service to support a write that @@ -1279,7 +1279,7 @@ lock service is somewhat redundant [[91](/en/ch9#Kleppmann2016), [92](/en/ch9#Sanfilippo2016)], since the lease assignment could have been implemented directly based on that storage service -[[93](/en/ch9#Morling2024_ch9)]. +[^93]. However, once you have a fencing token you can also use it with multiple services or replicas, and ensure that the old leaseholder is fenced off on all of those services. @@ -1325,12 +1325,12 @@ Distributed systems problems become much harder if there is a risk that nodes ma arbitrary faulty or corrupted responses)—for example, it might cast multiple contradictory votes in the same election. Such behavior is known as a *Byzantine fault*, and the problem of reaching consensus in this untrusting environment is known as the *Byzantine Generals Problem* -[[94](/en/ch9#Lamport1982)]. +[^94]. # The Byzantine Generals Problem The Byzantine Generals Problem is a generalization of the so-called *Two Generals Problem* -[[95](/en/ch9#Gray1978)], +[^95], which imagines a situation in which two army generals need to agree on a battle plan. As they have set up camp on two different sites, they can only communicate by messenger, and the messengers sometimes get delayed or lost (like packets in a network). We will discuss this problem of @@ -1345,10 +1345,10 @@ Byzantium was an ancient Greek city that later became Constantinople, in the pla Istanbul in Turkey. There isn’t any historic evidence that the generals of Byzantium were any more prone to intrigue and conspiracy than those elsewhere. Rather, the name is derived from *Byzantine* in the sense of *excessively complicated, bureaucratic, devious*, which was used in politics long -before computers [[96](/en/ch9#Palmer2011)]. +before computers [^96]. Lamport wanted to choose a nationality that would not offend any readers, and he was advised that calling it *The Albanian Generals Problem* was not such a good idea -[[97](/en/ch9#LamportPubs)]. +[^97]. A system is *Byzantine fault-tolerant* if it continues to operate correctly even if some of the nodes are malfunctioning and not obeying the protocol, or if malicious attackers are interfering @@ -1365,19 +1365,19 @@ with the network. This concern is relevant in certain specific circumstances. Fo messages, since they may be sent with malicious intent. For example, cryptocurrencies like Bitcoin and other blockchains can be considered to be a way of getting mutually untrusting parties to agree whether a transaction happened or not, without relying on a central authority - [[100](/en/ch9#Bano2019_ch9)]. + [^100]. However, in the kinds of systems we discuss in this book, we can usually safely assume that there are no Byzantine faults. In a datacenter, all the nodes are controlled by your organization (so they can hopefully be trusted) and radiation levels are low enough that memory corruption is not a major problem (although datacenters in orbit are being considered -[[101](/en/ch9#Feilden2024)]). +[^101]). Multitenant systems have mutually untrusting tenants, but they are isolated from each other using firewalls, virtualization, and access control policies, not using Byzantine fault tolerance. Protocols for making systems Byzantine fault-tolerant are quite expensive -[[102](/en/ch9#Mickens2013)], +[^102], and fault-tolerant embedded systems rely on support from the hardware level -[[98](/en/ch9#Rushby2001)]. In most server-side data systems, the +[^98]. In most server-side data systems, the cost of deploying Byzantine fault-tolerant solutions makes them impracticable. Web applications do need to expect arbitrary and malicious behavior of clients that are under @@ -1422,12 +1422,12 @@ pragmatic steps toward better reliability. For example: checking that a value is within a reasonable range and limiting the size of strings to prevent denial of service through large memory allocations. An internal service behind a firewall may be able to get away with less strict checks on inputs, but basic checks in protocol parsers are still - a good idea [[105](/en/ch9#Gilman2015)]. + a good idea [^105]. * NTP clients can be configured with multiple server addresses. When synchronizing, the client contacts all of them, estimates their errors, and checks that a majority of servers agree on some time range. As long as most of the servers are okay, a misconfigured NTP server that is reporting an incorrect time is detected as an outlier and is excluded from synchronization - [[39](/en/ch9#Windl2006)]. The use of multiple servers makes NTP + [^39]. The use of multiple servers makes NTP more robust than if it only uses a single server. ## System Model and Reality @@ -1448,14 +1448,14 @@ Synchronous model : The synchronous model assumes bounded network delay, bounded process pauses, and bounded clock error. This does not imply exactly synchronized clocks or zero network delay; it just means you know that network delay, pauses, and clock drift will never exceed some fixed upper bound - [[108](/en/ch9#Dwork1988_ch9)]. + [^108]. The synchronous model is not a realistic model of most practical systems, because (as discussed in this chapter) unbounded delays and pauses do occur. Partially synchronous model : Partial synchrony means that a system behaves like a synchronous system *most of the time*, but it sometimes exceeds the bounds for network delay, process pauses, and clock drift - [[108](/en/ch9#Dwork1988_ch9)]. This is a realistic model of many + [^108]. This is a realistic model of many systems: most of the time, networks and processes are quite well behaved—otherwise we would never be able to get anything done—but we have to reckon with the fact that any timing assumptions may be shattered occasionally. When this happens, network delay, pauses, and clock error may become @@ -1472,7 +1472,7 @@ nodes are: Crash-stop faults : In the *crash-stop* (or *fail-stop*) model, an algorithm may assume that a node can fail in only one way, namely by crashing - [[109](/en/ch9#Schlichting1983)]. + [^109]. This means that the node may suddenly stop responding at any moment, and thereafter that node is gone forever—it never comes back. @@ -1486,18 +1486,18 @@ Degraded performance and partial functionality : In addition to crashing and restarting, nodes may go slow: they may still be able to respond to health check requests, while being too slow to get any real work done. For example, a Gigabit network interface could suddenly drop to 1 Kb/s throughput due to a driver bug - [[110](/en/ch9#Do2013)]; + [^110]; a process that is under memory pressure may spend most of its time performing garbage collection - [[111](/en/ch9#Snyder2019)]; + [^111]; worn-out SSDs can have erratic performance; and hardware can be affected by high temperature, loose connectors, mechanical vibration, power supply problems, firmware bugs, and more - [[112](/en/ch9#Gunawi2018_ch9)]. + [^112]. Such a situation is called a *limping node*, *gray failure*, or *fail-slow* - [[113](/en/ch9#Huang2017_ch9)], + [^113], and it can be even more difficult to deal with than a cleanly failed node. A related problem is when a process stops doing some of the things it is supposed to do while other aspects continue working, for example because a background thread is crashed or deadlocked - [[114](/en/ch9#Lou2020)]. + [^114]. Byzantine (arbitrary) faults : Nodes may do absolutely anything, including trying to trick and deceive other nodes, as described @@ -1541,13 +1541,13 @@ safety properties, but *availability* is a liveness property. What distinguishes the two kinds of properties? A giveaway is that liveness properties often include the word “eventually” in their definition. (And yes, you guessed it—*eventual consistency* is a -liveness property [[115](/en/ch9#Bailis2013_ch9)].) +liveness property [^115].) Safety is often informally defined as *nothing bad happens*, and liveness as *something good eventually happens*. However, it’s best to not read too much into those informal definitions, because “good” and “bad” are value judgements that don’t apply well to algorithms. The actual definitions of safety and liveness are more precise -[[116](/en/ch9#Alpern1985)]: +[^116]: * If a safety property is violated, we can point at a particular point in time at which it was broken (for example, if the uniqueness property was violated, we can identify the particular @@ -1560,7 +1560,7 @@ definitions of safety and liveness are more precise An advantage of distinguishing between safety and liveness properties is that it helps us deal with difficult system models. For distributed algorithms, it is common to require that safety properties *always* hold, in all possible situations of a system model -[[108](/en/ch9#Dwork1988_ch9)]. That is, even if all nodes crash, or +[^108]. That is, even if all nodes crash, or the entire network fails, the algorithm must nevertheless ensure that it does not return a wrong result (i.e., that the safety properties remain satisfied). @@ -1580,10 +1580,10 @@ abstraction of reality. For example, algorithms in the crash-recovery model generally assume that data in stable storage survives crashes. However, what happens if the data on disk is corrupted, or the data is wiped out due to hardware error or misconfiguration -[[117](/en/ch9#Junqueira2015)]? +[^117]? What happens if a server has a firmware bug and fails to recognize its hard drives on reboot, even though the drives are correctly attached to the server -[[118](/en/ch9#Sanders2016)]? +[^118]? Quorum algorithms (see [“Quorums for reading and writing”](/en/ch6#sec_replication_quorum_condition)) rely on a node remembering the data that it claims to have stored. If a node may suffer from amnesia and forget previously stored data, @@ -1596,7 +1596,7 @@ to happen—and in non-Byzantine systems, we do have to make some assumptions ab and cannot happen. However, a real implementation may still have to include code to handle the case where something happens that was assumed to be impossible, even if that handling boils down to `printf("Sucks to be you")` and `exit(666)`—i.e., letting a human operator clean up the mess -[[119](/en/ch9#Kreps2013)]. +[^119]. (This is one difference between computer science and software engineering.) That is not to say that theoretical, abstract system models are worthless—quite the opposite. @@ -1650,15 +1650,15 @@ and fix bugs [124](/en/ch9#VanBenschoten2019)]. For example, using TLA+, researchers were able to demonstrate the potential for data loss in viewstamped replication (VR) caused by ambiguity in the prose description of the algorithm -[[125](/en/ch9#Vanlightly2022)]. +[^125]. By design, model checkers don’t run your actual code, but rather a simplified model that specifies only the core ideas of your protocol. This makes it more tractable to systematically explore the state space, but it risks that your specification and your implementation go out of sync with each -other [[126](/en/ch9#Wayne2024)]. +other [^126]. It is possible to check whether the model and the real implementation have equivalent behavior, but this requires instrumentation in the real implementation -[[127](/en/ch9#Ouyang2025)]. +[^127]. ### Fault injection @@ -1671,7 +1671,7 @@ processes—anything you can imagine going wrong with a computer. Fault injection tests are typically run in an environment that closely resembles the production environment where the system will run. Some even inject faults directly into their production environment. Netflix popularized this approach with their Chaos Monkey tool -[[128](/en/ch9#Izrailevsky2011)]. Production fault +[^128]. Production fault injection is often referred to as *chaos engineering*, which we discussed in [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability). @@ -1687,7 +1687,7 @@ The myriad of tools required to trigger failures make fault injection tests cumb It’s common to adopt a fault injection framework like Jepsen to run fault injection tests to simplify the process. Such frameworks come with integrations for various operating systems and many pre-built fault injectors -[[129](/en/ch9#Kingsbury2013jepsen)]. +[^129]. Jepsen has been remarkably effective at finding critical bugs in many widely-used systems [[130](/en/ch9#Kingsbury2024), [131](/en/ch9#Majumdar2017)]. @@ -1714,18 +1714,18 @@ Application-level example, FoundationDB, one of the pioneers in the DST space, is built using an asynchronous communication library called Flow. Flow provides a point for developers to inject a deterministic network simulation into the system - [[132](/en/ch9#FoundationDB_ch9)]. + [^132]. Similarly, TigerBeetle is an online transaction processing (OLTP) database with first-class DST support. The system’s state is modeled as a state machine, with all mutations occuring within a single event loop. When combined with mock deterministic primitives such as clocks, such an architecture is able to run deterministically - [[133](/en/ch9#Kladov2023)]. + [^133]. Runtime-level : Languages with asynchronous runtimes and commonly used libraries provide an insertion point to introduce determinism. A single-threaded runtime is used to force all asynchronous code to run sequentially. FrostDB, for example, patches Go’s runtime to execute goroutines sequentially - [[134](/en/ch9#Marques2024)]. + [^134]. Rust’s madsim library works in a similar manner. Madsim provides deterministic implementations of Tokio’s asynchronous runtime API, AWS’s S3 library, Kafka’s Rust library, and many others. Applications can swap in deterministic libraries and runtimes to get deterministic test executions @@ -1804,7 +1804,7 @@ slow to do anything useful, is even harder. Once a fault is detected, making a system tolerate it is not easy either: there is no global variable, no shared memory, no common knowledge or any other kind of shared state between the -machines [[83](/en/ch9#Halpern1990)]. +machines [^83]. Nodes can’t even agree on what time it is, let alone on anything more profound. The only way information can flow from one node to another is by sending it over the unreliable network. Major decisions cannot be safely made by a single node, so we require protocols that enlist help from @@ -1814,7 +1814,7 @@ If you’re used to writing software in the idealized mathematical perfection of where the same operation always deterministically returns the same result, then moving to the messy physical reality of distributed systems can be a bit of a shock. Conversely, distributed systems engineers will often regard a problem as trivial if it can be solved on a single computer -[[4](/en/ch9#Hodges2013)], +[^4], and indeed a single computer can do a lot nowadays. If you can avoid opening Pandora’s box and simply keep things on a single machine, for example by using an embedded storage engine (see [“Embedded storage engines”](/en/ch4#sidebar_embedded)), it is generally worth doing so. @@ -1839,711 +1839,141 @@ problems in distributed systems. ##### Footnotes + ##### References -[[1](/en/ch9#Cavage2013-marker)] Mark Cavage. -[There’s Just No Getting Around It: You’re -Building a Distributed System](https://queue.acm.org/detail.cfm?id=2482856). *ACM Queue*, volume 11, issue 4, pages 80-89, April 2013. -[doi:10.1145/2466486.2482856](https://doi.org/10.1145/2466486.2482856) -[[2](/en/ch9#Kreps2012_ch9-marker)] Jay Kreps. -[Getting -Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. -Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW) - -[[3](/en/ch9#Hale2010-marker)] Coda Hale. -[You Can’t Sacrifice -Partition Tolerance](https://codahale.com/you-cant-sacrifice-partition-tolerance/). *codahale.com*, October 2010. - - -[[4](/en/ch9#Hodges2013-marker)] Jeff Hodges. -[Notes -on Distributed Systems for Young Bloods](https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/). *somethingsimilar.com*, January 2013. -Archived at [perma.cc/B636-62CE](https://perma.cc/B636-62CE) - -[[5](/en/ch9#Jacobson1988-marker)] Van Jacobson. -[Congestion -Avoidance and Control](https://www.cs.usask.ca/ftp/pub/discus/seminars2002-2003/p314-jacobson.pdf). At *ACM Symposium on Communications Architectures and -Protocols* (SIGCOMM), August 1988. -[doi:10.1145/52324.52356](https://doi.org/10.1145/52324.52356) - -[[6](/en/ch9#Hubert2009-marker)] Bert Hubert. -[The -Ultimate SO\_LINGER Page, or: Why Is My TCP Not Reliable](https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable). *blog.netherlabs.nl*, January 2009. -Archived at [perma.cc/6HDX-L2RR](https://perma.cc/6HDX-L2RR) - -[[7](/en/ch9#Saltzer1984_ch9-marker)] Jerome H. Saltzer, David P. Reed, and David D. Clark. -[End-To-End -Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf). *ACM Transactions on Computer Systems*, volume 2, issue 4, -pages 277–288, November 1984. -[doi:10.1145/357401.357402](https://doi.org/10.1145/357401.357402) - -[[8](/en/ch9#Bailis2014reliable-marker)] Peter Bailis and Kyle Kingsbury. -[The Network Is Reliable](https://queue.acm.org/detail.cfm?id=2655736). -*ACM Queue*, volume 12, issue 7, pages 48-55, July 2014. -[doi:10.1145/2639988.2639988](https://doi.org/10.1145/2639988.2639988) - -[[9](/en/ch9#Leners2015-marker)] Joshua B. Leners, Trinabh Gupta, Marcos K. -Aguilera, and Michael Walfish. -[Taming Uncertainty in -Distributed Systems with Help from the Network](https://cs.nyu.edu/~mwalfish/papers/albatross-eurosys15.pdf). At *10th European Conference on Computer -Systems* (EuroSys), April 2015. -[doi:10.1145/2741948.2741976](https://doi.org/10.1145/2741948.2741976) - -[[10](/en/ch9#Gill2011-marker)] Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. -[Understanding -Network Failures in Data Centers: Measurement, Analysis, and Implications](https://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdf). At -*ACM SIGCOMM Conference*, August 2011. -[doi:10.1145/2018436.2018477](https://doi.org/10.1145/2018436.2018477) - -[[11](/en/ch9#Hoelzle2020-marker)] Urs Hölzle. -[But recently a farmer had started -grazing a herd of cows nearby. And whenever they stepped on the fiber link, they bent it enough -to cause a blip](https://x.com/uhoelzle/status/1263333283107991558). *x.com*, May 2020. -Archived at [perma.cc/WX8X-ZZA5](https://perma.cc/WX8X-ZZA5) - -[[12](/en/ch9#CBCNews2021-marker)] CBC News. -[Hundreds -lose internet service in northern B.C. after beaver chews through cable](https://www.cbc.ca/news/canada/british-columbia/beaver-internet-down-tumbler-ridge-1.6001594). *cbc.ca*, -April 2021. Archived at [perma.cc/UW8C-H2MY](https://perma.cc/UW8C-H2MY) - -[[13](/en/ch9#Oremus2014-marker)] Will Oremus. -[The -Global Internet Is Being Attacked by Sharks, Google Confirms](https://slate.com/technology/2014/08/shark-attacks-threaten-google-s-undersea-internet-cables-video.html). *slate.com*, August 2014. -Archived at [perma.cc/P6F3-C6YG](https://perma.cc/P6F3-C6YG) - -[[14](/en/ch9#AuerbachJahajeeah2023-marker)] Jess Auerbach Jahajeeah. -[Down to the wire: The -ship fixing our internet](https://continent.substack.com/p/down-to-the-wire-the-ship-fixing). *continent.substack.com*, November 2023. -Archived at [perma.cc/DP7B-EQ7S](https://perma.cc/DP7B-EQ7S) - -[[15](/en/ch9#Janardhan2021-marker)] Santosh Janardhan. -[More details -about the October 4 outage](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/). *engineering.fb.com*, October 2021. -Archived at [perma.cc/WW89-VSXH](https://perma.cc/WW89-VSXH) - -[[16](/en/ch9#Parfitt2011-marker)] Tom Parfitt. -[Georgian -woman cuts off web access to whole of Armenia](https://www.theguardian.com/world/2011/apr/06/georgian-woman-cuts-web-access). *theguardian.com*, April 2011. -Archived at [perma.cc/KMC3-N3NZ](https://perma.cc/KMC3-N3NZ) - -[[17](/en/ch9#Voce2025-marker)] Antonio Voce, Tural Ahmedzade and Ashley Kirk. -[‘Shadow -fleets’ and subaquatic sabotage: are Europe’s undersea internet cables under attack?](https://www.theguardian.com/world/ng-interactive/2025/mar/05/shadow-fleets-subaquatic-sabotage-europe-undersea-internet-cables-under-attack) -*theguardian.com*, March 2025. -Archived at [perma.cc/HA7S-ZDBV](https://perma.cc/HA7S-ZDBV) - -[[18](/en/ch9#Liu2016-marker)] Shengyun Liu, Paolo Viotti, -Christian Cachin, Vivien Quéma, and Marko Vukolić. -[XFT: Practical -Fault Tolerance beyond Crashes](https://www.usenix.org/system/files/conference/osdi16/osdi16-liu.pdf). At *12th USENIX Symposium on Operating Systems Design and -Implementation* (OSDI), November 2016. - -[[19](/en/ch9#Imbriaco2012_ch9-marker)] Mark Imbriaco. -[Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/). -*github.blog*, December 2012. -Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ) - -[[20](/en/ch9#Lianza2020_ch9-marker)] Tom Lianza and Chris Snook. -[A Byzantine failure -in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020. -Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY) - -[[21](/en/ch9#Alfatafta2020-marker)] Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, -and Samer Al-Kiswany. -[Toward a Generic Fault -Tolerance Technique for Partial Network Partitioning](https://www.usenix.org/conference/osdi20/presentation/alfatafta). At *14th USENIX Symposium on -Operating Systems Design and Implementation* (OSDI), November 2020. - -[[22](/en/ch9#Donges2012-marker)] Marc A. Donges. -[Re: bnx2 cards Intermittantly Going -Offline](https://www.spinics.net/lists/netdev/msg210485.html). Message to Linux *netdev* mailing list, *spinics.net*, September 2012. -Archived at [perma.cc/TXP6-H8R3](https://perma.cc/TXP6-H8R3) - -[[23](/en/ch9#Toman2020-marker)] Troy Toman. -[Inside a CODE RED: -Network Edition](https://signalvnoise.com/svn3/inside-a-code-red-network-edition/). *signalvnoise.com*, September 2020. -Archived at [perma.cc/BET6-FY25](https://perma.cc/BET6-FY25) - -[[24](/en/ch9#Kingsbury2014elastic-marker)] Kyle Kingsbury. -[Call Me Maybe: -Elasticsearch](https://aphyr.com/posts/317-call-me-maybe-elasticsearch). *aphyr.com*, June 2014. -[perma.cc/JK47-S89J](https://perma.cc/JK47-S89J) - -[[25](/en/ch9#Sanfilippo2014-marker)] Salvatore Sanfilippo. -[A Few Arguments About Redis Sentinel Properties and Fail -Scenarios](https://antirez.com/news/80). *antirez.com*, October 2014. -[perma.cc/8XEU-CLM8](https://perma.cc/8XEU-CLM8) - -[[26](/en/ch9#Liochon2015-marker)] Nicolas Liochon. -[CAP: -If All You Have Is a Timeout, Everything Looks Like a Partition](http://blog.thislongrun.com/2015/05/CAP-theorem-partition-timeout-zookeeper.html). *blog.thislongrun.com*, -May 2015. Archived at [perma.cc/FS57-V2PZ](https://perma.cc/FS57-V2PZ) - -[[27](/en/ch9#Grosvenor2015-marker)] Matthew P. Grosvenor, Malte Schwarzkopf, Ionel -Gog, Robert N. M. Watson, Andrew W. Moore, Steven Hand, and Jon Crowcroft. -[Queues -Don’t Matter When You Can JUMP Them!](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-grosvenor_update.pdf) At *12th USENIX Symposium on Networked -Systems Design and Implementation* (NSDI), May 2015. - -[[28](/en/ch9#Julienne2019-marker)] Theo Julienne. -[Debugging -network stalls on Kubernetes](https://github.blog/engineering/debugging-network-stalls-on-kubernetes/). *github.blog*, November 2019. -Archived at [perma.cc/K9M8-XVGL](https://perma.cc/K9M8-XVGL) - -[[29](/en/ch9#Wang2010-marker)] Guohui Wang and T. S. Eugene Ng. -[The Impact of -Virtualization on Network Performance of Amazon EC2 Data Center](https://www.cs.rice.edu/~eugeneng/papers/INFOCOM10-ec2.pdf). At *29th IEEE -International Conference on Computer Communications* (INFOCOM), March 2010. -[doi:10.1109/INFCOM.2010.5461931](https://doi.org/10.1109/INFCOM.2010.5461931) - -[[30](/en/ch9#Philips2014-marker)] Brandon Philips. -[etcd: Distributed Locking and Service -Discovery](https://www.youtube.com/watch?v=HJIjTTHWYnE). At *Strange Loop*, September 2014. - -[[31](/en/ch9#Newman2012-marker)] Steve Newman. -[A Systematic Look at EC2 I/O](https://www.sentinelone.com/blog/a-systematic-look-at-ec2-i-o/). -*blog.scalyr.com*, October 2012. -Archived at [perma.cc/FL4R-H2VE](https://perma.cc/FL4R-H2VE) - -[[32](/en/ch9#Hayashibara2004-marker)] Naohiro Hayashibara, Xavier Défago, Rami Yared, and -Takuya Katayama. [The ϕ Accrual Failure -Detector](https://hdl.handle.net/10119/4784). Japan Advanced Institute of Science and Technology, School of Information -Science, Technical Report IS-RR-2004-010, May 2004. -Archived at [perma.cc/NSM2-TRYA](https://perma.cc/NSM2-TRYA) - -[[33](/en/ch9#Wang2013-marker)] Jeffrey Wang. -[Phi -Accrual Failure Detector](https://ternarysearch.blogspot.com/2013/08/phi-accrual-failure-detector.html). *ternarysearch.blogspot.co.uk*, August 2013. -[perma.cc/L452-AMLV](https://perma.cc/L452-AMLV) - -[[34](/en/ch9#Keshav1997-marker)] Srinivasan Keshav. *An Engineering Approach -to Computer Networking: ATM Networks, the Internet, and the Telephone Network*. -Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6 - -[[35](/en/ch9#Kyas1995-marker)] Othmar Kyas. *ATM Networks*. -International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6 - -[[36](/en/ch9#Mellanox2014-marker)] Mellanox Technologies. -[InfiniBand -FAQ, Rev 1.3](https://network.nvidia.com/related-docs/whitepapers/InfiniBandFAQ_FQ_100.pdf). *network.nvidia.com*, December 2014. -Archived at [perma.cc/LQJ4-QZVK](https://perma.cc/LQJ4-QZVK) - -[[37](/en/ch9#Santos2003-marker)] Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman. -[End-to-End Congestion Control -for InfiniBand](https://infocom2003.ieee-infocom.org/papers/28_01.PDF). At *22nd Annual Joint Conference of the IEEE Computer and -Communications Societies* (INFOCOM), April 2003. Also published by HP Laboratories Palo -Alto, Tech Report HPL-2002-359. -[doi:10.1109/INFCOM.2003.1208949](https://doi.org/10.1109/INFCOM.2003.1208949) - -[[38](/en/ch9#Li2014-marker)] Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and -Steven D. Gribble. -[Tales of the Tail: Hardware, -OS, and Application-level Sources of Tail Latency](https://syslab.cs.washington.edu/papers/latency-socc14.pdf). At *ACM Symposium on Cloud Computing* -(SOCC), November 2014. -[doi:10.1145/2670979.2670988](https://doi.org/10.1145/2670979.2670988) - -[[39](/en/ch9#Windl2006-marker)] Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley. -[The NTP FAQ and HOWTO](https://www.ntp.org/ntpfaq/). *ntp.org*, November 2006. - -[[40](/en/ch9#GrahamCumming2017-marker)] John Graham-Cumming. -[How and -why the leap second affected Cloudflare DNS](https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/). *blog.cloudflare.com*, January 2017. -Archived at [archive.org](https://web.archive.org/web/20250202041444/https%3A//blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/) - -[[41](/en/ch9#Holmes2006-marker)] David Holmes. -[Inside -the Hotspot VM: Clocks, Timers and Scheduling Events – Part I – Windows](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks). *blogs.oracle.com*, -October 2006. Archived at [archive.org](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks) - -[[42](/en/ch9#Greef2021-marker)] Joran Dirk Greef. -[Three Clocks are -Better than One](https://tigerbeetle.com/blog/2021-08-30-three-clocks-are-better-than-one/). *tigerbeetle.com*, August 2021. -Archived at [perma.cc/5RXG-EU6B](https://perma.cc/5RXG-EU6B) - -[[43](/en/ch9#Yang2015-marker)] Oliver Yang. -[Pitfalls of TSC usage](https://oliveryang.net/2015/09/pitfalls-of-TSC-usage/). -*oliveryang.net*, September 2015. -Archived at [perma.cc/Z2QY-5FRA](https://perma.cc/Z2QY-5FRA) - -[[44](/en/ch9#Loughran2015-marker)] Steve Loughran. -[Time -on Multi-Core, Multi-Socket Servers](https://steveloughran.blogspot.com/2015/09/time-on-multi-core-multi-socket-servers.html). *steveloughran.blogspot.co.uk*, September 2015. -Archived at [perma.cc/7M4S-D4U6](https://perma.cc/7M4S-D4U6) - -[[45](/en/ch9#Corbett2012_ch9-marker)] James C. Corbett, Jeffrey Dean, Michael -Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher -Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, -Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, -Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. -[Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). -At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), -October 2012. - -[[46](/en/ch9#Caporaloni2012-marker)] M. Caporaloni and R. Ambrosini. -[How Closely Can a Personal Computer -Clock Track the UTC Timescale Via the Internet?](https://iopscience.iop.org/0143-0807/23/4/103/) *European Journal of Physics*, -volume 23, issue 4, pages L17–L21, June 2012. -[doi:10.1088/0143-0807/23/4/103](https://doi.org/10.1088/0143-0807/23/4/103) - -[[47](/en/ch9#Minar1999-marker)] Nelson Minar. -[A Survey of the NTP Network](https://alumni.media.mit.edu/~nelson/research/ntp-survey99/). -*alumni.media.mit.edu*, December 1999. -Archived at [perma.cc/EV76-7ZV3](https://perma.cc/EV76-7ZV3) - -[[48](/en/ch9#Holub2014-marker)] Viliam Holub. -[Synchronizing -Clocks in a Cassandra Cluster Pt. 1 – The Problem](https://blog.rapid7.com/2014/03/14/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/). *blog.rapid7.com*, March 2014. -Archived at [perma.cc/N3RV-5LNL](https://perma.cc/N3RV-5LNL) - -[[49](/en/ch9#Kamp2011-marker)] Poul-Henning Kamp. -[The One-Second War (What Time Will You Die?)](https://queue.acm.org/detail.cfm?id=1967009) -*ACM Queue*, volume 9, issue 4, pages 44–48, April 2011. -[doi:10.1145/1966989.1967009](https://doi.org/10.1145/1966989.1967009) - -[[50](/en/ch9#Minar2012_ch9-marker)] Nelson Minar. -[Leap Second Crashes Half -the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012. -Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU) - -[[51](/en/ch9#Pascoe2011-marker)] Christopher Pascoe. -[Time, -Technology and Leaping Seconds](https://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html). *googleblog.blogspot.co.uk*, September 2011. -Archived at [perma.cc/U2JL-7E74](https://perma.cc/U2JL-7E74) - -[[52](/en/ch9#Zhao2015-marker)] Mingxue Zhao and Jeff Barr. -[Look -Before You Leap – The Coming Leap Second and AWS](https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/). *aws.amazon.com*, May 2015. -Archived at [perma.cc/KPE9-XMFM](https://perma.cc/KPE9-XMFM) - -[[53](/en/ch9#Veitch2016-marker)] Darryl Veitch and Kanthaiah Vijayalayan. -[Network Timing -and the 2015 Leap Second](https://opus.lib.uts.edu.au/bitstream/10453/43923/1/LeapSecond_camera.pdf). At *17th International Conference on Passive and Active -Measurement* (PAM), April 2016. -[doi:10.1007/978-3-319-30505-9\_29](https://doi.org/10.1007/978-3-319-30505-9_29) - -[[54](/en/ch9#VMware2011-marker)] VMware, Inc. -[Timekeeping in VMware Virtual -Machines](https://www.vmware.com/docs/vmware_timekeeping). *vmware.com*, October 2008. -Archived at [perma.cc/HM5R-T5NF](https://perma.cc/HM5R-T5NF) - -[[55](/en/ch9#Yodaiken2017-marker)] Victor Yodaiken. -[Clock -Synchronization in Finance and Beyond](https://www.yodaiken.com/wp-content/uploads/2018/05/financeandbeyond.pdf). *yodaiken.com*, November 2017. -Archived at [perma.cc/9XZD-8ZZN](https://perma.cc/9XZD-8ZZN) - -[[56](/en/ch9#EmreAcer2017-marker)] Mustafa Emre Acer, Emily Stark, Adrienne Porter -Felt, Sascha Fahl, Radhika Bhargava, Bhanu Dev, Matt Braithwaite, Ryan Sleevi, and Parisa Tabriz. -[Where the Wild Warnings Are: Root Causes -of Chrome HTTPS Certificate Errors](https://acmccs.github.io/papers/p1407-acerA.pdf). At *ACM SIGSAC Conference on Computer and -Communications Security* (CCS), pages 1407–1420, October 2017. -[doi:10.1145/3133956.3134007](https://doi.org/10.1145/3133956.3134007) - -[[57](/en/ch9#MiFID2015-marker)] European Securities and Markets Authority. -[MiFID -II / MiFIR: Regulatory Technical and Implementing Standards – Annex I](https://www.esma.europa.eu/sites/default/files/library/2015/11/2015-esma-1464_annex_i_-_draft_rts_and_its_on_mifid_ii_and_mifir.pdf). -*esma.europa.eu*, Report ESMA/2015/1464, September 2015. -Archived at [perma.cc/ZLX9-FGQ3](https://perma.cc/ZLX9-FGQ3) - -[[58](/en/ch9#Bigum2015-marker)] Luke Bigum. -[Solving -MiFID II Clock Synchronisation With Minimum Spend (Part 1)](https://catach.blogspot.com/2015/11/solving-mifid-ii-clock-synchronisation.html). *catach.blogspot.com*, -November 2015. Archived at [perma.cc/4J5W-FNM4](https://perma.cc/4J5W-FNM4) - -[[59](/en/ch9#Obleukhov2022-marker)] Oleg Obleukhov and Ahmad Byagowi. -[How -Precision Time Protocol is being deployed at Meta](https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/). *engineering.fb.com*, November 2022. -Archived at [perma.cc/29G6-UJNW](https://perma.cc/29G6-UJNW) - -[[60](/en/ch9#Wiseman2022-marker)] John Wiseman. -[gpsjam.org](https://gpsjam.org/), July 2022. - -[[61](/en/ch9#Levinson2023-marker)] Josh Levinson, Julien Ridoux, and Chris Munns. -[It’s -About Time: Microsecond-Accurate Clocks on Amazon EC2 Instances](https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/). *aws.amazon.com*, November 2023. -Archived at [perma.cc/56M6-5VMZ](https://perma.cc/56M6-5VMZ) - -[[62](/en/ch9#Kingsbury2013cassandra-marker)] Kyle Kingsbury. -[Call Me Maybe: Cassandra](https://aphyr.com/posts/294-call-me-maybe-cassandra/). -*aphyr.com*, September 2013. -Archived at [perma.cc/4MBR-J96V](https://perma.cc/4MBR-J96V) - -[[63](/en/ch9#Daily2013_ch9-marker)] John Daily. -[Clocks Are Bad, or, -Welcome to the Wonderful World of Distributed Systems](https://riak.com/clocks-are-bad-or-welcome-to-distributed-systems/). *riak.com*, November 2013. -Archived at [perma.cc/4XB5-UCXY](https://perma.cc/4XB5-UCXY) - -[[64](/en/ch9#Brooker2023time-marker)] Marc Brooker. -[It’s About Time!](https://brooker.co.za/blog/2023/11/27/about-time.html) -*brooker.co.za*, November 2023. -Archived at [perma.cc/N6YK-DRPA](https://perma.cc/N6YK-DRPA) - -[[65](/en/ch9#Kingsbury2013timestamps-marker)] Kyle Kingsbury. -[The Trouble with Timestamps](https://aphyr.com/posts/299-the-trouble-with-timestamps). -*aphyr.com*, October 2013. -Archived at [perma.cc/W3AM-5VAV](https://perma.cc/W3AM-5VAV) - -[[66](/en/ch9#Lamport1978_ch9-marker)] Leslie Lamport. -[Time, -Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, -volume 21, issue 7, pages 558–565, July 1978. -[doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563) - -[[67](/en/ch9#Sheehy2015-marker)] Justin Sheehy. -[There Is No Now: Problems With Simultaneity -in Distributed Systems](https://queue.acm.org/detail.cfm?id=2745385). *ACM Queue*, volume 13, issue 3, pages 36–41, March 2015. -[doi:10.1145/2733108](https://doi.org/10.1145/2733108) - -[[68](/en/ch9#Demirbas2013-marker)] Murat Demirbas. -[Spanner: -Google’s Globally-Distributed Database](https://muratbuffalo.blogspot.com/2013/07/spanner-googles-globally-distributed_4.html). *muratbuffalo.blogspot.co.uk*, July 2013. -Archived at [perma.cc/6VWR-C9WB](https://perma.cc/6VWR-C9WB) - -[[69](/en/ch9#Malkhi2013-marker)] Dahlia Malkhi and Jean-Philippe Martin. -[Spanner’s Concurrency -Control](https://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf). *ACM SIGACT News*, volume 44, issue 3, pages 73–77, September 2013. -[doi:10.1145/2527748.2527767](https://doi.org/10.1145/2527748.2527767) - -[[70](/en/ch9#Pachot2024-marker)] Franck Pachot. -[Achieving Precise Clock -Synchronization on AWS](https://www.yugabyte.com/blog/aws-clock-synchronization/). *yugabyte.com*, December 2024. -Archived at [perma.cc/UYM6-RNBS](https://perma.cc/UYM6-RNBS) - -[[71](/en/ch9#Kimball2022-marker)] Spencer Kimball. -[Living Without Atomic -Clocks: Where CockroachDB and Spanner diverge](https://www.cockroachlabs.com/blog/living-without-atomic-clocks/). *cockroachlabs.com*, January 2022. -Archived at [perma.cc/AWZ7-RXFT](https://perma.cc/AWZ7-RXFT) - -[[72](/en/ch9#Demirbas2025-marker)] Murat Demirbas. -[Use of -Time in Distributed Databases (part 4): Synchronized clocks in production databases](https://muratbuffalo.blogspot.com/2025/01/use-of-time-in-distributed-databases.html). -*muratbuffalo.blogspot.com*, January 2025. -Archived at [perma.cc/9WNX-Q9U3](https://perma.cc/9WNX-Q9U3) - -[[73](/en/ch9#Gray1989-marker)] Cary G. Gray and David R. Cheriton. -[Leases: An Efficient -Fault-Tolerant Mechanism for Distributed File Cache Consistency](https://courses.cs.duke.edu/spring11/cps210/papers/p202-gray.pdf). At -*12th ACM Symposium on Operating Systems Principles* (SOSP), December 1989. -[doi:10.1145/74850.74870](https://doi.org/10.1145/74850.74870) - -[[74](/en/ch9#Sturman2022-marker)] Daniel Sturman, Scott Delap, Max Ross, et al. -[Roblox -Return to Service](https://corp.roblox.com/newsroom/2022/01/roblox-return-to-service-10-28-10-31-2021). *corp.roblox.com*, January 2022. -Archived at [perma.cc/8ALT-WAS4](https://perma.cc/8ALT-WAS4) - -[[75](/en/ch9#Lipcon2011-marker)] Todd Lipcon. -[Avoiding Full GCs -with MemStore-Local Allocation Buffers](https://www.slideshare.net/slideshow/hbase-hug-presentation/7038178). *slideshare.net*, February 2011. -Archived at - -[[76](/en/ch9#Clark2005-marker)] Christopher Clark, Keir Fraser, Steven Hand, -Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. -[Live -Migration of Virtual Machines](https://www.usenix.org/legacy/publications/library/proceedings/nsdi05/tech/full_papers/clark/clark.pdf). At *2nd USENIX Symposium on Symposium on -Networked Systems Design & Implementation* (NSDI), May 2005. - -[[77](/en/ch9#Shaver2008-marker)] Mike Shaver. -[fsyncers and -Curveballs](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/). *shaver.off.net*, May 2008. Archived at -[archive.org](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/) - -[[78](/en/ch9#Zhuang2016-marker)] Zhenyun Zhuang and Cuong Tran. -[Eliminating -Large JVM GC Pauses Caused by Background IO Traffic](https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic). *engineering.linkedin.com*, February 2016. -Archived at [perma.cc/ML2M-X9XT](https://perma.cc/ML2M-X9XT) - -[[79](/en/ch9#Thompson2013-marker)] Martin Thompson. -[Java -Garbage Collection Distilled](https://mechanical-sympathy.blogspot.com/2013/07/java-garbage-collection-distilled.html). *mechanical-sympathy.blogspot.co.uk*, July 2013. -Archived at [perma.cc/DJT3-NQLQ](https://perma.cc/DJT3-NQLQ) - -[[80](/en/ch9#Terei2015-marker)] David Terei and Amit Levy. -[Blade: A Data Center Garbage Collector](https://arxiv.org/pdf/1504.02578). -arXiv:1504.02578, April 2015. - -[[81](/en/ch9#Maas2015-marker)] Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz. -[Trash Day: Coordinating Garbage Collection in -Distributed Systems](https://timharris.uk/papers/2015-hotos.pdf). At *15th USENIX Workshop on Hot Topics in Operating Systems* -(HotOS), May 2015. - -[[82](/en/ch9#Fowler2011_ch9-marker)] Martin Fowler. -[The LMAX Architecture](https://martinfowler.com/articles/lmax.html). -*martinfowler.com*, July 2011. -Archived at [perma.cc/5AV4-N6RJ](https://perma.cc/5AV4-N6RJ) - -[[83](/en/ch9#Halpern1990-marker)] Joseph Y. Halpern and Yoram Moses. -[Knowledge and common knowledge -in a distributed environment](https://groups.csail.mit.edu/tds/papers/Halpern/JACM90.pdf). *Journal of the ACM* (JACM), volume 37, issue 3, pages -549–587, July 1990. -[doi:10.1145/79147.79161](https://doi.org/10.1145/79147.79161) - -[[84](/en/ch9#Tang2022-marker)] Chuzhe Tang, Zhaoguo Wang, Xiaodong Zhang, Qianmian -Yu, Binyu Zang, Haibing Guan, and Haibo Chen. -[Ad Hoc Transactions -in Web Applications: The Good, the Bad, and the Ugly](https://ipads.se.sjtu.edu.cn/_media/publications/concerto-sigmod22.pdf). At *ACM International Conference on -Management of Data* (SIGMOD), June 2022. -[doi:10.1145/3514221.3526120](https://doi.org/10.1145/3514221.3526120) - -[[85](/en/ch9#Junqueira2013_ch9-marker)] Flavio P. Junqueira and Benjamin Reed. -[*ZooKeeper: Distributed -Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3 - -[[86](/en/ch9#Soztutar2013hdfs-marker)] Enis Söztutar. -[HBase -and HDFS: Understanding Filesystem Usage in HBase](https://www.slideshare.net/slideshow/hbase-and-hdfs-understanding-filesystem-usage/22990858). At *HBaseCon*, June 2013. -Archived at [perma.cc/4DXR-9P88](https://perma.cc/4DXR-9P88) - -[[87](/en/ch9#SUSE2025-marker)] SUSE LLC. -[SUSE -Linux Enterprise High Availability 15 SP6 Administration Guide, Section 12: Fencing and STONITH](https://documentation.suse.com/sle-ha/15-SP6/html/SLE-HA-all/cha-ha-fencing.html). -*documentation.suse.com*, March 2025. -Archived at [perma.cc/8LAR-EL9D](https://perma.cc/8LAR-EL9D) - -[[88](/en/ch9#Burrows2006_ch9-marker)] Mike Burrows. -[The Chubby Lock Service for Loosely-Coupled -Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and -Implementation* (OSDI), November 2006. - -[[89](/en/ch9#Kingsbury2020etcd-marker)] Kyle Kingsbury. -[etcd 3.4.3](https://jepsen.io/analyses/etcd-3.4.3). *jepsen.io*, January 2020. -Archived at [perma.cc/2P3Y-MPWU](https://perma.cc/2P3Y-MPWU) - -[[90](/en/ch9#BasriKahveci2019-marker)] Ensar Basri Kahveci. -[Distributed Locks are Dead; Long -Live Distributed Locks!](https://hazelcast.com/blog/long-live-distributed-locks/) *hazelcast.com*, April 2019. -Archived at [perma.cc/7FS5-LDXE](https://perma.cc/7FS5-LDXE) - -[[91](/en/ch9#Kleppmann2016-marker)] Martin Kleppmann. -[How to do -distributed locking](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html). *martin.kleppmann.com*, February 2016. -Archived at [perma.cc/Y24W-YQ5L](https://perma.cc/Y24W-YQ5L) - -[[92](/en/ch9#Sanfilippo2016-marker)] Salvatore Sanfilippo. -[Is Redlock safe?](https://antirez.com/news/101) *antirez.com*, February 2016. -Archived at [perma.cc/B6GA-9Q6A](https://perma.cc/B6GA-9Q6A) - -[[93](/en/ch9#Morling2024_ch9-marker)] Gunnar Morling. -[Leader -Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024. -Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y) - -[[94](/en/ch9#Lamport1982-marker)] Leslie Lamport, Robert Shostak, and Marshall Pease. -[The -Byzantine Generals Problem](https://www.microsoft.com/en-us/research/publication/byzantine-generals-problem/). *ACM Transactions on Programming Languages and Systems* -(TOPLAS), volume 4, issue 3, pages 382–401, July 1982. -[doi:10.1145/357172.357176](https://doi.org/10.1145/357172.357176) - -[[95](/en/ch9#Gray1978-marker)] Jim N. Gray. -[Notes on Data Base -Operating Systems](https://jimgray.azurewebsites.net/papers/dbos.pdf). in *Operating Systems: An Advanced Course*, Lecture -Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller, -pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7. -Archived at [perma.cc/7S9M-2LZU](https://perma.cc/7S9M-2LZU) - -[[96](/en/ch9#Palmer2011-marker)] Brian Palmer. -[How -Complicated Was the Byzantine Empire?](https://slate.com/news-and-politics/2011/10/the-byzantine-tax-code-how-complicated-was-byzantium-anyway.html) *slate.com*, October 2011. -Archived at [perma.cc/AN7X-FL3N](https://perma.cc/AN7X-FL3N) - -[[97](/en/ch9#LamportPubs-marker)] Leslie Lamport. -[My Writings](https://lamport.azurewebsites.net/pubs/pubs.html). -*lamport.azurewebsites.net*, December 2014. -Archived at [perma.cc/5NNM-SQGR](https://perma.cc/5NNM-SQGR) - -[[98](/en/ch9#Rushby2001-marker)] John Rushby. -[Bus Architectures for -Safety-Critical Embedded Systems](https://www.csl.sri.com/papers/emsoft01/emsoft01.pdf). At *1st International Workshop on Embedded Software* -(EMSOFT), October 2001. -[doi:10.1007/3-540-45449-7\_22](https://doi.org/10.1007/3-540-45449-7_22) - -[[99](/en/ch9#Edge2013-marker)] Jake Edge. -[ELC: SpaceX Lessons Learned](https://lwn.net/Articles/540368/). *lwn.net*, -March 2013. Archived at [perma.cc/AYX8-QP5X](https://perma.cc/AYX8-QP5X) - -[[100](/en/ch9#Bano2019_ch9-marker)] Shehar Bano, Alberto Sonnino, Mustafa -Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis. -[SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At -*1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019. -[doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458) - -[[101](/en/ch9#Feilden2024-marker)] Ezra Feilden, Adi Oltean, and Philip Johnston. -[Why we should train AI in space](https://www.starcloud.com/wp). -White Paper, *starcloud.com*, September 2024. -Archived at [perma.cc/7Y3S-8UB6](https://perma.cc/7Y3S-8UB6) - -[[102](/en/ch9#Mickens2013-marker)] James Mickens. -[The Saddest -Moment](https://www.usenix.org/system/files/login-logout_1305_mickens.pdf). *USENIX ;login*, May 2013. -Archived at [perma.cc/T7BZ-XCFR](https://perma.cc/T7BZ-XCFR) - -[[103](/en/ch9#Kleppmann2020-marker)] Martin Kleppmann and Heidi Howard. -[Byzantine Eventual Consistency and the Fundamental Limits -of Peer-to-Peer Databases](https://arxiv.org/abs/2012.00472). *arxiv.org*, December 2020. -[doi:10.48550/arXiv.2012.00472](https://doi.org/10.48550/arXiv.2012.00472) - -[[104](/en/ch9#Kleppmann2022-marker)] Martin Kleppmann. -[Making CRDTs Byzantine Fault -Tolerant](https://martin.kleppmann.com/papers/bft-crdt-papoc22.pdf). At *9th Workshop on Principles and Practice of Consistency for Distributed -Data* (PaPoC), April 2022. -[doi:10.1145/3517209.3524042](https://doi.org/10.1145/3517209.3524042) - -[[105](/en/ch9#Gilman2015-marker)] Evan Gilman. -[The -Discovery of Apache ZooKeeper’s Poison Packet](https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/). *pagerduty.com*, May 2015. -Archived at [perma.cc/RV6L-Y5CQ](https://perma.cc/RV6L-Y5CQ) - -[[106](/en/ch9#Stone2000-marker)] Jonathan Stone and Craig Partridge. -[When -the CRC and TCP Checksum Disagree](https://conferences2.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf). At *ACM Conference on Applications, -Technologies, Architectures, and Protocols for Computer Communication* (SIGCOMM), August 2000. -[doi:10.1145/347059.347561](https://doi.org/10.1145/347059.347561) - -[[107](/en/ch9#Jones2015-marker)] Evan Jones. -[How Both TCP and Ethernet -Checksums Fail](https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html). *evanjones.ca*, October 2015. -Archived at [perma.cc/9T5V-B8X5](https://perma.cc/9T5V-B8X5) - -[[108](/en/ch9#Dwork1988_ch9-marker)] Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. -[Consensus in the -Presence of Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323, -April 1988. [doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283) - -[[109](/en/ch9#Schlichting1983-marker)] Richard D. Schlichting and Fred B. Schneider. -[Fail-stop processors: an -approach to designing fault-tolerant computing systems](https://www.cs.cornell.edu/fbs/publications/Fail_Stop.pdf). *ACM Transactions on Computer -Systems* (TOCS), volume 1, issue 3, pages 222–238, August 1983. -[doi:10.1145/357369.357371](https://doi.org/10.1145/357369.357371) - -[[110](/en/ch9#Do2013-marker)] Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, -Tiratat Patana-anake, and Haryadi S. Gunawi. -[Limplock: Understanding the Impact -of Limpware on Scale-out Cloud Systems](https://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf). At *4th ACM Symposium on Cloud Computing* -(SoCC), October 2013. -[doi:10.1145/2523616.2523627](https://doi.org/10.1145/2523616.2523627) - -[[111](/en/ch9#Snyder2019-marker)] Josh Snyder and Joseph Lynch. -[Garbage collecting -unhealthy JVMs, a proactive approach](https://netflixtechblog.medium.com/introducing-jvmquake-ec944c60ba70). Netflix Technology Blog, -*netflixtechblog.medium.com*, November 2019. -Archived at [perma.cc/8BTA-N3YB](https://perma.cc/8BTA-N3YB) - -[[112](/en/ch9#Gunawi2018_ch9-marker)] Haryadi S. Gunawi, Riza O. Suminto, Russell -Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah -Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree -Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. -[Fail-Slow at -Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf). -At *16th USENIX Conference on File and Storage Technologies*, February 2018. - -[[113](/en/ch9#Huang2017_ch9-marker)] Peng Huang, Chuanxiong Guo, Lidong Zhou, -Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. -[Gray -Failure: The Achilles’ Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in -Operating Systems* (HotOS), May 2017. -[doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005) - -[[114](/en/ch9#Lou2020-marker)] Chang Lou, Peng Huang, and Scott Smith. -[Understanding, Detecting and -Localizing Partial Failures in Large System Software](https://www.usenix.org/conference/nsdi20/presentation/lou). At *17th USENIX Symposium on -Networked Systems Design and Implementation* (NSDI), February 2020. - -[[115](/en/ch9#Bailis2013_ch9-marker)] Peter Bailis and Ali Ghodsi. -[Eventual Consistency Today: Limitations, -Extensions, and Beyond](https://queue.acm.org/detail.cfm?id=2462076). *ACM Queue*, volume 11, issue 3, pages 55-63, March 2013. -[doi:10.1145/2460276.2462076](https://doi.org/10.1145/2460276.2462076) - -[[116](/en/ch9#Alpern1985-marker)] Bowen Alpern and Fred B. Schneider. -[Defining Liveness](https://www.cs.cornell.edu/fbs/publications/DefLiveness.pdf). -*Information Processing Letters*, volume 21, issue 4, pages 181–185, October 1985. -[doi:10.1016/0020-0190(85)90056-0](https://doi.org/10.1016/0020-0190%2885%2990056-0) - -[[117](/en/ch9#Junqueira2015-marker)] Flavio P. Junqueira. -[Dude, Where’s My Metadata?](https://fpj.me/2015/05/28/dude-wheres-my-metadata/) -*fpj.me*, May 2015. -Archived at [perma.cc/D2EU-Y9S5](https://perma.cc/D2EU-Y9S5) - -[[118](/en/ch9#Sanders2016-marker)] Scott Sanders. -[January 28th Incident -Report](https://github.com/blog/2106-january-28th-incident-report). *github.com*, February 2016. -Archived at [perma.cc/5GZR-88TV](https://perma.cc/5GZR-88TV) - -[[119](/en/ch9#Kreps2013-marker)] Jay Kreps. -[A Few Notes -on Kafka and Jepsen](https://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen). *blog.empathybox.com*, September 2013. -[perma.cc/XJ5C-F583](https://perma.cc/XJ5C-F583) - -[[120](/en/ch9#Brooker2024correctness-marker)] Marc Brooker and Ankush Desai. -[Systems Correctness Practices at AWS](https://dl.acm.org/doi/pdf/10.1145/3712057). -*Queue, Volume 22, Issue 6*, November/December 2024. -[doi:10.1145/3712057](https://doi.org/10.1145/3712057) - -[[121](/en/ch9#SatarinTesting-marker)] Andrey Satarin. -[Testing Distributed Systems: -Curated list of resources on testing distributed systems](https://asatarin.github.io/testing-distributed-systems/). *asatarin.github.io*. -Archived at [perma.cc/U5V8-XP24](https://perma.cc/U5V8-XP24) - -[[122](/en/ch9#Vanlightly2024-marker)] Jack Vanlightly. -[Verifying Kafka transactions - Diary entry 2 - Writing an initial TLA+ spec](https://jack-vanlightly.com/analyses/2024/12/3/verifying-kafka-transactions-diary-entry-2-writing-an-initial-tla-spec). -*jack-vanlightly.com*, December 2024. -Archived at [perma.cc/NSQ8-MQ5N](https://perma.cc/NSQ8-MQ5N) - -[[123](/en/ch9#Tang2018-marker)] Siddon Tang. -[From Chaos to Order — Tools and -Techniques for Testing TiDB, A Distributed NewSQL Database](https://www.pingcap.com/blog/chaos-practice-in-tidb/). *pingcap.com*, April 2018. -Archived at [perma.cc/5EJB-R29F](https://perma.cc/5EJB-R29F) - -[[124](/en/ch9#VanBenschoten2019-marker)] Nathan VanBenschoten. -[Parallel Commits: An atomic commit -protocol for globally distributed transactions](https://www.cockroachlabs.com/blog/parallel-commits/). *cockroachlabs.com*, November 2019. -Archived at [perma.cc/5FZ7-QK6J](https://perma.cc/5FZ7-QK6J%20) - -[[125](/en/ch9#Vanlightly2022-marker)] Jack Vanlightly. -[Paper: VR Revisited - State Transfer (part 3)](https://jack-vanlightly.com/analyses/2022/12/28/paper-vr-revisited-state-transfer-part-3). -*jack-vanlightly.com*, December 2022. -Archived at [perma.cc/KNK3-K6WS](https://perma.cc/KNK3-K6WS) - -[[126](/en/ch9#Wayne2024-marker)] Hillel Wayne. -[What if -the spec doesn’t match the code?](https://buttondown.com/hillelwayne/archive/what-if-the-spec-doesnt-match-the-code/) *buttondown.com*, March 2024. -Archived at [perma.cc/8HEZ-KHER](https://perma.cc/8HEZ-KHER) - -[[127](/en/ch9#Ouyang2025-marker)] Lingzhi Ouyang, Xudong Sun, Ruize Tang, Yu Huang, -Madhav Jivrajani, Xiaoxing Ma, Tianyin Xu. -[Multi-Grained Specifications for Distributed System Model -Checking and Verification](https://arxiv.org/abs/2409.14301). At *20th European Conference on Computer Systems* (EuroSys), -March 2025. [doi:10.1145/3689031.3696069](https://doi.org/10.1145/3689031.3696069) - -[[128](/en/ch9#Izrailevsky2011-marker)] Yury Izrailevsky and Ariel Tseitlin. -[The Netflix Simian Army](https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116). -*netflixtechblog.com*, July, 2011. -Archived at [perma.cc/M3NY-FJW6](https://perma.cc/M3NY-FJW6) - -[[129](/en/ch9#Kingsbury2013jepsen-marker)] Kyle Kingsbury. -[Jepsen: On the perils of network partitions](https://aphyr.com/posts/281-jepsen-on-the-perils-of-network-partitions). -*aphyr.com*, May, 2013. -Archived at [perma.cc/W98G-6HQP](https://perma.cc/W98G-6HQP) - -[[130](/en/ch9#Kingsbury2024-marker)] Kyle Kingsbury. -[Jepsen Analyses](https://jepsen.io/analyses). *jepsen.io*, 2024. -Archived at [perma.cc/8LDN-D2T8](https://perma.cc/8LDN-D2T8) - -[[131](/en/ch9#Majumdar2017-marker)] Rupak Majumdar and Filip Niksic. -[Why is random testing effective for partition -tolerance bugs?](https://dl.acm.org/doi/pdf/10.1145/3158134) *Proceedings of the ACM on Programming Languages* (PACMPL), volume 2, -issue POPL, article no. 46, December 2017. -[doi:10.1145/3158134](https://doi.org/10.1145/3158134) - -[[132](/en/ch9#FoundationDB_ch9-marker)] FoundationDB project authors. -[Simulation and Testing](https://apple.github.io/foundationdb/testing.html). -*apple.github.io*. -Archived at [perma.cc/NQ3L-PM4C](https://perma.cc/NQ3L-PM4C) - -[[133](/en/ch9#Kladov2023-marker)] Alex Kladov. -[Simulation -Testing For Liveness](https://tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness/). *tigerbeetle.com*, July 2023. -Archived at [perma.cc/RKD4-HGCR](https://perma.cc/RKD4-HGCR) - -[[134](/en/ch9#Marques2024-marker)] Alfonso Subiotto Marqués. -[(Mostly) -Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024. -Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4) +[^1]: Mark Cavage. [There’s Just No Getting Around It: You’re Building a Distributed System](https://queue.acm.org/detail.cfm?id=2482856). *ACM Queue*, volume 11, issue 4, pages 80-89, April 2013. [doi:10.1145/2466486.2482856](https://doi.org/10.1145/2466486.2482856) +[^2]: Jay Kreps. [Getting Real About Distributed System Reliability](https://blog.empathybox.com/post/19574936361/getting-real-about-distributed-system-reliability). *blog.empathybox.com*, March 2012. Archived at [perma.cc/9B5Q-AEBW](https://perma.cc/9B5Q-AEBW) +[^3]: Coda Hale. [You Can’t Sacrifice Partition Tolerance](https://codahale.com/you-cant-sacrifice-partition-tolerance/). *codahale.com*, October 2010. +[^4]: Jeff Hodges. [Notes on Distributed Systems for Young Bloods](https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/). *somethingsimilar.com*, January 2013. Archived at [perma.cc/B636-62CE](https://perma.cc/B636-62CE) +[^5]: Van Jacobson. [Congestion Avoidance and Control](https://www.cs.usask.ca/ftp/pub/discus/seminars2002-2003/p314-jacobson.pdf). At *ACM Symposium on Communications Architectures and Protocols* (SIGCOMM), August 1988. [doi:10.1145/52324.52356](https://doi.org/10.1145/52324.52356) +[^6]: Bert Hubert. [The Ultimate SO\_LINGER Page, or: Why Is My TCP Not Reliable](https://blog.netherlabs.nl/articles/2009/01/18/the-ultimate-so_linger-page-or-why-is-my-tcp-not-reliable). *blog.netherlabs.nl*, January 2009. Archived at [perma.cc/6HDX-L2RR](https://perma.cc/6HDX-L2RR) +[^7]: Jerome H. Saltzer, David P. Reed, and David D. Clark. [End-To-End Arguments in System Design](https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf). *ACM Transactions on Computer Systems*, volume 2, issue 4, pages 277–288, November 1984. [doi:10.1145/357401.357402](https://doi.org/10.1145/357401.357402) +[^8]: Peter Bailis and Kyle Kingsbury. [The Network Is Reliable](https://queue.acm.org/detail.cfm?id=2655736). *ACM Queue*, volume 12, issue 7, pages 48-55, July 2014. [doi:10.1145/2639988.2639988](https://doi.org/10.1145/2639988.2639988) +[^9]: Joshua B. Leners, Trinabh Gupta, Marcos K. Aguilera, and Michael Walfish. [Taming Uncertainty in Distributed Systems with Help from the Network](https://cs.nyu.edu/~mwalfish/papers/albatross-eurosys15.pdf). At *10th European Conference on Computer Systems* (EuroSys), April 2015. [doi:10.1145/2741948.2741976](https://doi.org/10.1145/2741948.2741976) +[^10]: Phillipa Gill, Navendu Jain, and Nachiappan Nagappan. [Understanding Network Failures in Data Centers: Measurement, Analysis, and Implications](https://conferences.sigcomm.org/sigcomm/2011/papers/sigcomm/p350.pdf). At *ACM SIGCOMM Conference*, August 2011. [doi:10.1145/2018436.2018477](https://doi.org/10.1145/2018436.2018477) +[^11]: Urs Hölzle. [But recently a farmer had started grazing a herd of cows nearby. And whenever they stepped on the fiber link, they bent it enough to cause a blip](https://x.com/uhoelzle/status/1263333283107991558). *x.com*, May 2020. Archived at [perma.cc/WX8X-ZZA5](https://perma.cc/WX8X-ZZA5) +[^12]: CBC News. [Hundreds lose internet service in northern B.C. after beaver chews through cable](https://www.cbc.ca/news/canada/british-columbia/beaver-internet-down-tumbler-ridge-1.6001594). *cbc.ca*, April 2021. Archived at [perma.cc/UW8C-H2MY](https://perma.cc/UW8C-H2MY) +[^13]: Will Oremus. [The Global Internet Is Being Attacked by Sharks, Google Confirms](https://slate.com/technology/2014/08/shark-attacks-threaten-google-s-undersea-internet-cables-video.html). *slate.com*, August 2014. Archived at [perma.cc/P6F3-C6YG](https://perma.cc/P6F3-C6YG) +[^14]: Jess Auerbach Jahajeeah. [Down to the wire: The ship fixing our internet](https://continent.substack.com/p/down-to-the-wire-the-ship-fixing). *continent.substack.com*, November 2023. Archived at [perma.cc/DP7B-EQ7S](https://perma.cc/DP7B-EQ7S) +[^15]: Santosh Janardhan. [More details about the October 4 outage](https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/). *engineering.fb.com*, October 2021. Archived at [perma.cc/WW89-VSXH](https://perma.cc/WW89-VSXH) +[^16]: Tom Parfitt. [Georgian woman cuts off web access to whole of Armenia](https://www.theguardian.com/world/2011/apr/06/georgian-woman-cuts-web-access). *theguardian.com*, April 2011. Archived at [perma.cc/KMC3-N3NZ](https://perma.cc/KMC3-N3NZ) +[^17]: Antonio Voce, Tural Ahmedzade and Ashley Kirk. [‘Shadow fleets’ and subaquatic sabotage: are Europe’s undersea internet cables under attack?](https://www.theguardian.com/world/ng-interactive/2025/mar/05/shadow-fleets-subaquatic-sabotage-europe-undersea-internet-cables-under-attack) *theguardian.com*, March 2025. Archived at [perma.cc/HA7S-ZDBV](https://perma.cc/HA7S-ZDBV) +[^18]: Shengyun Liu, Paolo Viotti, Christian Cachin, Vivien Quéma, and Marko Vukolić. [XFT: Practical Fault Tolerance beyond Crashes](https://www.usenix.org/system/files/conference/osdi16/osdi16-liu.pdf). At *12th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), November 2016. +[^19]: Mark Imbriaco. [Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/). *github.blog*, December 2012. Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ) +[^20]: Tom Lianza and Chris Snook. [A Byzantine failure in the real world](https://blog.cloudflare.com/a-byzantine-failure-in-the-real-world/). *blog.cloudflare.com*, November 2020. Archived at [perma.cc/83EZ-ALCY](https://perma.cc/83EZ-ALCY) +[^21]: Mohammed Alfatafta, Basil Alkhatib, Ahmed Alquraan, and Samer Al-Kiswany. [Toward a Generic Fault Tolerance Technique for Partial Network Partitioning](https://www.usenix.org/conference/osdi20/presentation/alfatafta). At *14th USENIX Symposium on Operating Systems Design and Implementation* (OSDI), November 2020. +[^22]: Marc A. Donges. [Re: bnx2 cards Intermittantly Going Offline](https://www.spinics.net/lists/netdev/msg210485.html). Message to Linux *netdev* mailing list, *spinics.net*, September 2012. Archived at [perma.cc/TXP6-H8R3](https://perma.cc/TXP6-H8R3) +[^23]: Troy Toman. [Inside a CODE RED: Network Edition](https://signalvnoise.com/svn3/inside-a-code-red-network-edition/). *signalvnoise.com*, September 2020. Archived at [perma.cc/BET6-FY25](https://perma.cc/BET6-FY25) +[^24]: Kyle Kingsbury. [Call Me Maybe: Elasticsearch](https://aphyr.com/posts/317-call-me-maybe-elasticsearch). *aphyr.com*, June 2014. [perma.cc/JK47-S89J](https://perma.cc/JK47-S89J) +[^25]: Salvatore Sanfilippo. [A Few Arguments About Redis Sentinel Properties and Fail Scenarios](https://antirez.com/news/80). *antirez.com*, October 2014. [perma.cc/8XEU-CLM8](https://perma.cc/8XEU-CLM8) +[^26]: Nicolas Liochon. [CAP: If All You Have Is a Timeout, Everything Looks Like a Partition](http://blog.thislongrun.com/2015/05/CAP-theorem-partition-timeout-zookeeper.html). *blog.thislongrun.com*, May 2015. Archived at [perma.cc/FS57-V2PZ](https://perma.cc/FS57-V2PZ) +[^27]: Matthew P. Grosvenor, Malte Schwarzkopf, Ionel Gog, Robert N. M. Watson, Andrew W. Moore, Steven Hand, and Jon Crowcroft. [Queues Don’t Matter When You Can JUMP Them!](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-grosvenor_update.pdf) At *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015. +[^28]: Theo Julienne. [Debugging network stalls on Kubernetes](https://github.blog/engineering/debugging-network-stalls-on-kubernetes/). *github.blog*, November 2019. Archived at [perma.cc/K9M8-XVGL](https://perma.cc/K9M8-XVGL) +[^29]: Guohui Wang and T. S. Eugene Ng. [The Impact of Virtualization on Network Performance of Amazon EC2 Data Center](https://www.cs.rice.edu/~eugeneng/papers/INFOCOM10-ec2.pdf). At *29th IEEE International Conference on Computer Communications* (INFOCOM), March 2010. [doi:10.1109/INFCOM.2010.5461931](https://doi.org/10.1109/INFCOM.2010.5461931) +[^30]: Brandon Philips. [etcd: Distributed Locking and Service Discovery](https://www.youtube.com/watch?v=HJIjTTHWYnE). At *Strange Loop*, September 2014. +[^31]: Steve Newman. [A Systematic Look at EC2 I/O](https://www.sentinelone.com/blog/a-systematic-look-at-ec2-i-o/). *blog.scalyr.com*, October 2012. Archived at [perma.cc/FL4R-H2VE](https://perma.cc/FL4R-H2VE) +[^32]: Naohiro Hayashibara, Xavier Défago, Rami Yared, and Takuya Katayama. [The ϕ Accrual Failure Detector](https://hdl.handle.net/10119/4784). Japan Advanced Institute of Science and Technology, School of Information Science, Technical Report IS-RR-2004-010, May 2004. Archived at [perma.cc/NSM2-TRYA](https://perma.cc/NSM2-TRYA) +[^33]: Jeffrey Wang. [Phi Accrual Failure Detector](https://ternarysearch.blogspot.com/2013/08/phi-accrual-failure-detector.html). *ternarysearch.blogspot.co.uk*, August 2013. [perma.cc/L452-AMLV](https://perma.cc/L452-AMLV) +[^34]: Srinivasan Keshav. *An Engineering Approach to Computer Networking: ATM Networks, the Internet, and the Telephone Network*. Addison-Wesley Professional, May 1997. ISBN: 978-0-201-63442-6 +[^35]: Othmar Kyas. *ATM Networks*. International Thomson Publishing, 1995. ISBN: 978-1-850-32128-6 +[^36]: Mellanox Technologies. [InfiniBand FAQ, Rev 1.3](https://network.nvidia.com/related-docs/whitepapers/InfiniBandFAQ_FQ_100.pdf). *network.nvidia.com*, December 2014. Archived at [perma.cc/LQJ4-QZVK](https://perma.cc/LQJ4-QZVK) +[^37]: Jose Renato Santos, Yoshio Turner, and G. (John) Janakiraman. [End-to-End Congestion Control for InfiniBand](https://infocom2003.ieee-infocom.org/papers/28_01.PDF). At *22nd Annual Joint Conference of the IEEE Computer and Communications Societies* (INFOCOM), April 2003. Also published by HP Laboratories Palo Alto, Tech Report HPL-2002-359. [doi:10.1109/INFCOM.2003.1208949](https://doi.org/10.1109/INFCOM.2003.1208949) +[^38]: Jialin Li, Naveen Kr. Sharma, Dan R. K. Ports, and Steven D. Gribble. [Tales of the Tail: Hardware, OS, and Application-level Sources of Tail Latency](https://syslab.cs.washington.edu/papers/latency-socc14.pdf). At *ACM Symposium on Cloud Computing* (SOCC), November 2014. [doi:10.1145/2670979.2670988](https://doi.org/10.1145/2670979.2670988) +[^39]: Ulrich Windl, David Dalton, Marc Martinec, and Dale R. Worley. [The NTP FAQ and HOWTO](https://www.ntp.org/ntpfaq/). *ntp.org*, November 2006. +[^40]: John Graham-Cumming. [How and why the leap second affected Cloudflare DNS](https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/). *blog.cloudflare.com*, January 2017. Archived at [archive.org](https://web.archive.org/web/20250202041444/https%3A//blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/) +[^41]: David Holmes. [Inside the Hotspot VM: Clocks, Timers and Scheduling Events – Part I – Windows](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks). *blogs.oracle.com*, October 2006. Archived at [archive.org](https://web.archive.org/web/20160308031939/https%3A//blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks) +[^42]: Joran Dirk Greef. [Three Clocks are Better than One](https://tigerbeetle.com/blog/2021-08-30-three-clocks-are-better-than-one/). *tigerbeetle.com*, August 2021. Archived at [perma.cc/5RXG-EU6B](https://perma.cc/5RXG-EU6B) +[^43]: Oliver Yang. [Pitfalls of TSC usage](https://oliveryang.net/2015/09/pitfalls-of-TSC-usage/). *oliveryang.net*, September 2015. Archived at [perma.cc/Z2QY-5FRA](https://perma.cc/Z2QY-5FRA) +[^44]: Steve Loughran. [Time on Multi-Core, Multi-Socket Servers](https://steveloughran.blogspot.com/2015/09/time-on-multi-core-multi-socket-servers.html). *steveloughran.blogspot.co.uk*, September 2015. Archived at [perma.cc/7M4S-D4U6](https://perma.cc/7M4S-D4U6) +[^45]: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. [Spanner: Google’s Globally-Distributed Database](https://research.google/pubs/pub39966/). At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012. +[^46]: M. Caporaloni and R. Ambrosini. [How Closely Can a Personal Computer Clock Track the UTC Timescale Via the Internet?](https://iopscience.iop.org/0143-0807/23/4/103/) *European Journal of Physics*, volume 23, issue 4, pages L17–L21, June 2012. [doi:10.1088/0143-0807/23/4/103](https://doi.org/10.1088/0143-0807/23/4/103) +[^47]: Nelson Minar. [A Survey of the NTP Network](https://alumni.media.mit.edu/~nelson/research/ntp-survey99/). *alumni.media.mit.edu*, December 1999. Archived at [perma.cc/EV76-7ZV3](https://perma.cc/EV76-7ZV3) +[^48]: Viliam Holub. [Synchronizing Clocks in a Cassandra Cluster Pt. 1 – The Problem](https://blog.rapid7.com/2014/03/14/synchronizing-clocks-in-a-cassandra-cluster-pt-1-the-problem/). *blog.rapid7.com*, March 2014. Archived at [perma.cc/N3RV-5LNL](https://perma.cc/N3RV-5LNL) +[^49]: Poul-Henning Kamp. [The One-Second War (What Time Will You Die?)](https://queue.acm.org/detail.cfm?id=1967009) *ACM Queue*, volume 9, issue 4, pages 44–48, April 2011. [doi:10.1145/1966989.1967009](https://doi.org/10.1145/1966989.1967009) +[^50]: Nelson Minar. [Leap Second Crashes Half the Internet](https://www.somebits.com/weblog/tech/bad/leap-second-2012.html). *somebits.com*, July 2012. Archived at [perma.cc/2WB8-D6EU](https://perma.cc/2WB8-D6EU) +[^51]: Christopher Pascoe. [Time, Technology and Leaping Seconds](https://googleblog.blogspot.com/2011/09/time-technology-and-leaping-seconds.html). *googleblog.blogspot.co.uk*, September 2011. Archived at [perma.cc/U2JL-7E74](https://perma.cc/U2JL-7E74) +[^52]: Mingxue Zhao and Jeff Barr. [Look Before You Leap – The Coming Leap Second and AWS](https://aws.amazon.com/blogs/aws/look-before-you-leap-the-coming-leap-second-and-aws/). *aws.amazon.com*, May 2015. Archived at [perma.cc/KPE9-XMFM](https://perma.cc/KPE9-XMFM) +[^53]: Darryl Veitch and Kanthaiah Vijayalayan. [Network Timing and the 2015 Leap Second](https://opus.lib.uts.edu.au/bitstream/10453/43923/1/LeapSecond_camera.pdf). At *17th International Conference on Passive and Active Measurement* (PAM), April 2016. [doi:10.1007/978-3-319-30505-9\_29](https://doi.org/10.1007/978-3-319-30505-9_29) +[^54]: VMware, Inc. [Timekeeping in VMware Virtual Machines](https://www.vmware.com/docs/vmware_timekeeping). *vmware.com*, October 2008. Archived at [perma.cc/HM5R-T5NF](https://perma.cc/HM5R-T5NF) +[^55]: Victor Yodaiken. [Clock Synchronization in Finance and Beyond](https://www.yodaiken.com/wp-content/uploads/2018/05/financeandbeyond.pdf). *yodaiken.com*, November 2017. Archived at [perma.cc/9XZD-8ZZN](https://perma.cc/9XZD-8ZZN) +[^56]: Mustafa Emre Acer, Emily Stark, Adrienne Porter Felt, Sascha Fahl, Radhika Bhargava, Bhanu Dev, Matt Braithwaite, Ryan Sleevi, and Parisa Tabriz. [Where the Wild Warnings Are: Root Causes of Chrome HTTPS Certificate Errors](https://acmccs.github.io/papers/p1407-acerA.pdf). At *ACM SIGSAC Conference on Computer and Communications Security* (CCS), pages 1407–1420, October 2017. [doi:10.1145/3133956.3134007](https://doi.org/10.1145/3133956.3134007) +[^57]: European Securities and Markets Authority. [MiFID II / MiFIR: Regulatory Technical and Implementing Standards – Annex I](https://www.esma.europa.eu/sites/default/files/library/2015/11/2015-esma-1464_annex_i_-_draft_rts_and_its_on_mifid_ii_and_mifir.pdf). *esma.europa.eu*, Report ESMA/2015/1464, September 2015. Archived at [perma.cc/ZLX9-FGQ3](https://perma.cc/ZLX9-FGQ3) +[^58]: Luke Bigum. [Solving MiFID II Clock Synchronisation With Minimum Spend (Part 1)](https://catach.blogspot.com/2015/11/solving-mifid-ii-clock-synchronisation.html). *catach.blogspot.com*, November 2015. Archived at [perma.cc/4J5W-FNM4](https://perma.cc/4J5W-FNM4) +[^59]: Oleg Obleukhov and Ahmad Byagowi. [How Precision Time Protocol is being deployed at Meta](https://engineering.fb.com/2022/11/21/production-engineering/precision-time-protocol-at-meta/). *engineering.fb.com*, November 2022. Archived at [perma.cc/29G6-UJNW](https://perma.cc/29G6-UJNW) +[^60]: John Wiseman. [gpsjam.org](https://gpsjam.org/), July 2022. +[^61]: Josh Levinson, Julien Ridoux, and Chris Munns. [It’s About Time: Microsecond-Accurate Clocks on Amazon EC2 Instances](https://aws.amazon.com/blogs/compute/its-about-time-microsecond-accurate-clocks-on-amazon-ec2-instances/). *aws.amazon.com*, November 2023. Archived at [perma.cc/56M6-5VMZ](https://perma.cc/56M6-5VMZ) +[^62]: Kyle Kingsbury. [Call Me Maybe: Cassandra](https://aphyr.com/posts/294-call-me-maybe-cassandra/). *aphyr.com*, September 2013. Archived at [perma.cc/4MBR-J96V](https://perma.cc/4MBR-J96V) +[^63]: John Daily. [Clocks Are Bad, or, Welcome to the Wonderful World of Distributed Systems](https://riak.com/clocks-are-bad-or-welcome-to-distributed-systems/). *riak.com*, November 2013. Archived at [perma.cc/4XB5-UCXY](https://perma.cc/4XB5-UCXY) +[^64]: Marc Brooker. [It’s About Time!](https://brooker.co.za/blog/2023/11/27/about-time.html) *brooker.co.za*, November 2023. Archived at [perma.cc/N6YK-DRPA](https://perma.cc/N6YK-DRPA) +[^65]: Kyle Kingsbury. [The Trouble with Timestamps](https://aphyr.com/posts/299-the-trouble-with-timestamps). *aphyr.com*, October 2013. Archived at [perma.cc/W3AM-5VAV](https://perma.cc/W3AM-5VAV) +[^66]: Leslie Lamport. [Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, volume 21, issue 7, pages 558–565, July 1978. [doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563) +[^67]: Justin Sheehy. [There Is No Now: Problems With Simultaneity in Distributed Systems](https://queue.acm.org/detail.cfm?id=2745385). *ACM Queue*, volume 13, issue 3, pages 36–41, March 2015. [doi:10.1145/2733108](https://doi.org/10.1145/2733108) +[^68]: Murat Demirbas. [Spanner: Google’s Globally-Distributed Database](https://muratbuffalo.blogspot.com/2013/07/spanner-googles-globally-distributed_4.html). *muratbuffalo.blogspot.co.uk*, July 2013. Archived at [perma.cc/6VWR-C9WB](https://perma.cc/6VWR-C9WB) +[^69]: Dahlia Malkhi and Jean-Philippe Martin. [Spanner’s Concurrency Control](https://www.cs.cornell.edu/~ie53/publications/DC-col51-Sep13.pdf). *ACM SIGACT News*, volume 44, issue 3, pages 73–77, September 2013. [doi:10.1145/2527748.2527767](https://doi.org/10.1145/2527748.2527767) +[^70]: Franck Pachot. [Achieving Precise Clock Synchronization on AWS](https://www.yugabyte.com/blog/aws-clock-synchronization/). *yugabyte.com*, December 2024. Archived at [perma.cc/UYM6-RNBS](https://perma.cc/UYM6-RNBS) +[^71]: Spencer Kimball. [Living Without Atomic Clocks: Where CockroachDB and Spanner diverge](https://www.cockroachlabs.com/blog/living-without-atomic-clocks/). *cockroachlabs.com*, January 2022. Archived at [perma.cc/AWZ7-RXFT](https://perma.cc/AWZ7-RXFT) +[^72]: Murat Demirbas. [Use of Time in Distributed Databases (part 4): Synchronized clocks in production databases](https://muratbuffalo.blogspot.com/2025/01/use-of-time-in-distributed-databases.html). *muratbuffalo.blogspot.com*, January 2025. Archived at [perma.cc/9WNX-Q9U3](https://perma.cc/9WNX-Q9U3) +[^73]: Cary G. Gray and David R. Cheriton. [Leases: An Efficient Fault-Tolerant Mechanism for Distributed File Cache Consistency](https://courses.cs.duke.edu/spring11/cps210/papers/p202-gray.pdf). At *12th ACM Symposium on Operating Systems Principles* (SOSP), December 1989. [doi:10.1145/74850.74870](https://doi.org/10.1145/74850.74870) +[^74]: Daniel Sturman, Scott Delap, Max Ross, et al. [Roblox Return to Service](https://corp.roblox.com/newsroom/2022/01/roblox-return-to-service-10-28-10-31-2021). *corp.roblox.com*, January 2022. Archived at [perma.cc/8ALT-WAS4](https://perma.cc/8ALT-WAS4) +[^75]: Todd Lipcon. [Avoiding Full GCs with MemStore-Local Allocation Buffers](https://www.slideshare.net/slideshow/hbase-hug-presentation/7038178). *slideshare.net*, February 2011. Archived at +[^76]: Christopher Clark, Keir Fraser, Steven Hand, Jacob Gorm Hansen, Eric Jul, Christian Limpach, Ian Pratt, and Andrew Warfield. [Live Migration of Virtual Machines](https://www.usenix.org/legacy/publications/library/proceedings/nsdi05/tech/full_papers/clark/clark.pdf). At *2nd USENIX Symposium on Symposium on Networked Systems Design & Implementation* (NSDI), May 2005. +[^77]: Mike Shaver. [fsyncers and Curveballs](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/). *shaver.off.net*, May 2008. Archived at [archive.org](https://web.archive.org/web/20220107141023/http%3A//shaver.off.net/diary/2008/05/25/fsyncers-and-curveballs/) +[^78]: Zhenyun Zhuang and Cuong Tran. [Eliminating Large JVM GC Pauses Caused by Background IO Traffic](https://engineering.linkedin.com/blog/2016/02/eliminating-large-jvm-gc-pauses-caused-by-background-io-traffic). *engineering.linkedin.com*, February 2016. Archived at [perma.cc/ML2M-X9XT](https://perma.cc/ML2M-X9XT) +[^79]: Martin Thompson. [Java Garbage Collection Distilled](https://mechanical-sympathy.blogspot.com/2013/07/java-garbage-collection-distilled.html). *mechanical-sympathy.blogspot.co.uk*, July 2013. Archived at [perma.cc/DJT3-NQLQ](https://perma.cc/DJT3-NQLQ) +[^80]: David Terei and Amit Levy. [Blade: A Data Center Garbage Collector](https://arxiv.org/pdf/1504.02578). arXiv:1504.02578, April 2015. +[^81]: Martin Maas, Tim Harris, Krste Asanović, and John Kubiatowicz. [Trash Day: Coordinating Garbage Collection in Distributed Systems](https://timharris.uk/papers/2015-hotos.pdf). At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015. +[^82]: Martin Fowler. [The LMAX Architecture](https://martinfowler.com/articles/lmax.html). *martinfowler.com*, July 2011. Archived at [perma.cc/5AV4-N6RJ](https://perma.cc/5AV4-N6RJ) +[^83]: Joseph Y. Halpern and Yoram Moses. [Knowledge and common knowledge in a distributed environment](https://groups.csail.mit.edu/tds/papers/Halpern/JACM90.pdf). *Journal of the ACM* (JACM), volume 37, issue 3, pages 549–587, July 1990. [doi:10.1145/79147.79161](https://doi.org/10.1145/79147.79161) +[^84]: Chuzhe Tang, Zhaoguo Wang, Xiaodong Zhang, Qianmian Yu, Binyu Zang, Haibing Guan, and Haibo Chen. [Ad Hoc Transactions in Web Applications: The Good, the Bad, and the Ugly](https://ipads.se.sjtu.edu.cn/_media/publications/concerto-sigmod22.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526120](https://doi.org/10.1145/3514221.3526120) +[^85]: Flavio P. Junqueira and Benjamin Reed. [*ZooKeeper: Distributed Process Coordination*](https://www.oreilly.com/library/view/zookeeper/9781449361297/). O’Reilly Media, 2013. ISBN: 978-1-449-36130-3 +[^86]: Enis Söztutar. [HBase and HDFS: Understanding Filesystem Usage in HBase](https://www.slideshare.net/slideshow/hbase-and-hdfs-understanding-filesystem-usage/22990858). At *HBaseCon*, June 2013. Archived at [perma.cc/4DXR-9P88](https://perma.cc/4DXR-9P88) +[^87]: SUSE LLC. [SUSE Linux Enterprise High Availability 15 SP6 Administration Guide, Section 12: Fencing and STONITH](https://documentation.suse.com/sle-ha/15-SP6/html/SLE-HA-all/cha-ha-fencing.html). *documentation.suse.com*, March 2025. Archived at [perma.cc/8LAR-EL9D](https://perma.cc/8LAR-EL9D) +[^88]: Mike Burrows. [The Chubby Lock Service for Loosely-Coupled Distributed Systems](https://research.google/pubs/pub27897/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006. +[^89]: Kyle Kingsbury. [etcd 3.4.3](https://jepsen.io/analyses/etcd-3.4.3). *jepsen.io*, January 2020. Archived at [perma.cc/2P3Y-MPWU](https://perma.cc/2P3Y-MPWU) +[^90]: Ensar Basri Kahveci. [Distributed Locks are Dead; Long Live Distributed Locks!](https://hazelcast.com/blog/long-live-distributed-locks/) *hazelcast.com*, April 2019. Archived at [perma.cc/7FS5-LDXE](https://perma.cc/7FS5-LDXE) +[^91]: Martin Kleppmann. [How to do distributed locking](https://martin.kleppmann.com/2016/02/08/how-to-do-distributed-locking.html). *martin.kleppmann.com*, February 2016. Archived at [perma.cc/Y24W-YQ5L](https://perma.cc/Y24W-YQ5L) +[^92]: Salvatore Sanfilippo. [Is Redlock safe?](https://antirez.com/news/101) *antirez.com*, February 2016. Archived at [perma.cc/B6GA-9Q6A](https://perma.cc/B6GA-9Q6A) +[^93]: Gunnar Morling. [Leader Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024. Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y) +[^94]: Leslie Lamport, Robert Shostak, and Marshall Pease. [The Byzantine Generals Problem](https://www.microsoft.com/en-us/research/publication/byzantine-generals-problem/). *ACM Transactions on Programming Languages and Systems* (TOPLAS), volume 4, issue 3, pages 382–401, July 1982. [doi:10.1145/357172.357176](https://doi.org/10.1145/357172.357176) +[^95]: Jim N. Gray. [Notes on Data Base Operating Systems](https://jimgray.azurewebsites.net/papers/dbos.pdf). in *Operating Systems: An Advanced Course*, Lecture Notes in Computer Science, volume 60, edited by R. Bayer, R. M. Graham, and G. Seegmüller, pages 393–481, Springer-Verlag, 1978. ISBN: 978-3-540-08755-7. Archived at [perma.cc/7S9M-2LZU](https://perma.cc/7S9M-2LZU) +[^96]: Brian Palmer. [How Complicated Was the Byzantine Empire?](https://slate.com/news-and-politics/2011/10/the-byzantine-tax-code-how-complicated-was-byzantium-anyway.html) *slate.com*, October 2011. Archived at [perma.cc/AN7X-FL3N](https://perma.cc/AN7X-FL3N) +[^97]: Leslie Lamport. [My Writings](https://lamport.azurewebsites.net/pubs/pubs.html). *lamport.azurewebsites.net*, December 2014. Archived at [perma.cc/5NNM-SQGR](https://perma.cc/5NNM-SQGR) +[^98]: John Rushby. [Bus Architectures for Safety-Critical Embedded Systems](https://www.csl.sri.com/papers/emsoft01/emsoft01.pdf). At *1st International Workshop on Embedded Software* (EMSOFT), October 2001. [doi:10.1007/3-540-45449-7\_22](https://doi.org/10.1007/3-540-45449-7_22) +[^99]: Jake Edge. [ELC: SpaceX Lessons Learned](https://lwn.net/Articles/540368/). *lwn.net*, March 2013. Archived at [perma.cc/AYX8-QP5X](https://perma.cc/AYX8-QP5X) +[^100]: Shehar Bano, Alberto Sonnino, Mustafa Al-Bassam, Sarah Azouvi, Patrick McCorry, Sarah Meiklejohn, and George Danezis. [SoK: Consensus in the Age of Blockchains](https://smeiklej.com/files/aft19a.pdf). At *1st ACM Conference on Advances in Financial Technologies* (AFT), October 2019. [doi:10.1145/3318041.3355458](https://doi.org/10.1145/3318041.3355458) +[^101]: Ezra Feilden, Adi Oltean, and Philip Johnston. [Why we should train AI in space](https://www.starcloud.com/wp). White Paper, *starcloud.com*, September 2024. Archived at [perma.cc/7Y3S-8UB6](https://perma.cc/7Y3S-8UB6) +[^102]: James Mickens. [The Saddest Moment](https://www.usenix.org/system/files/login-logout_1305_mickens.pdf). *USENIX ;login*, May 2013. Archived at [perma.cc/T7BZ-XCFR](https://perma.cc/T7BZ-XCFR) +[^103]: Martin Kleppmann and Heidi Howard. [Byzantine Eventual Consistency and the Fundamental Limits of Peer-to-Peer Databases](https://arxiv.org/abs/2012.00472). *arxiv.org*, December 2020. [doi:10.48550/arXiv.2012.00472](https://doi.org/10.48550/arXiv.2012.00472) +[^104]: Martin Kleppmann. [Making CRDTs Byzantine Fault Tolerant](https://martin.kleppmann.com/papers/bft-crdt-papoc22.pdf). At *9th Workshop on Principles and Practice of Consistency for Distributed Data* (PaPoC), April 2022. [doi:10.1145/3517209.3524042](https://doi.org/10.1145/3517209.3524042) +[^105]: Evan Gilman. [The Discovery of Apache ZooKeeper’s Poison Packet](https://www.pagerduty.com/blog/the-discovery-of-apache-zookeepers-poison-packet/). *pagerduty.com*, May 2015. Archived at [perma.cc/RV6L-Y5CQ](https://perma.cc/RV6L-Y5CQ) +[^106]: Jonathan Stone and Craig Partridge. [When the CRC and TCP Checksum Disagree](https://conferences2.sigcomm.org/sigcomm/2000/conf/paper/sigcomm2000-9-1.pdf). At *ACM Conference on Applications, Technologies, Architectures, and Protocols for Computer Communication* (SIGCOMM), August 2000. [doi:10.1145/347059.347561](https://doi.org/10.1145/347059.347561) +[^107]: Evan Jones. [How Both TCP and Ethernet Checksums Fail](https://www.evanjones.ca/tcp-and-ethernet-checksums-fail.html). *evanjones.ca*, October 2015. Archived at [perma.cc/9T5V-B8X5](https://perma.cc/9T5V-B8X5) +[^108]: Cynthia Dwork, Nancy Lynch, and Larry Stockmeyer. [Consensus in the Presence of Partial Synchrony](https://groups.csail.mit.edu/tds/papers/Lynch/jacm88.pdf). *Journal of the ACM*, volume 35, issue 2, pages 288–323, April 1988. [doi:10.1145/42282.42283](https://doi.org/10.1145/42282.42283) +[^109]: Richard D. Schlichting and Fred B. Schneider. [Fail-stop processors: an approach to designing fault-tolerant computing systems](https://www.cs.cornell.edu/fbs/publications/Fail_Stop.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 1, issue 3, pages 222–238, August 1983. [doi:10.1145/357369.357371](https://doi.org/10.1145/357369.357371) +[^110]: Thanh Do, Mingzhe Hao, Tanakorn Leesatapornwongsa, Tiratat Patana-anake, and Haryadi S. Gunawi. [Limplock: Understanding the Impact of Limpware on Scale-out Cloud Systems](https://ucare.cs.uchicago.edu/pdf/socc13-limplock.pdf). At *4th ACM Symposium on Cloud Computing* (SoCC), October 2013. [doi:10.1145/2523616.2523627](https://doi.org/10.1145/2523616.2523627) +[^111]: Josh Snyder and Joseph Lynch. [Garbage collecting unhealthy JVMs, a proactive approach](https://netflixtechblog.medium.com/introducing-jvmquake-ec944c60ba70). Netflix Technology Blog, *netflixtechblog.medium.com*, November 2019. Archived at [perma.cc/8BTA-N3YB](https://perma.cc/8BTA-N3YB) +[^112]: Haryadi S. Gunawi, Riza O. Suminto, Russell Sears, Casey Golliher, Swaminathan Sundararaman, Xing Lin, Tim Emami, Weiguang Sheng, Nematollah Bidokhti, Caitie McCaffrey, Gary Grider, Parks M. Fields, Kevin Harms, Robert B. Ross, Andree Jacobson, Robert Ricci, Kirk Webb, Peter Alvaro, H. Birali Runesha, Mingzhe Hao, and Huaicheng Li. [Fail-Slow at Scale: Evidence of Hardware Performance Faults in Large Production Systems](https://www.usenix.org/system/files/conference/fast18/fast18-gunawi.pdf). At *16th USENIX Conference on File and Storage Technologies*, February 2018. +[^113]: Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. [Gray Failure: The Achilles’ Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in Operating Systems* (HotOS), May 2017. [doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005) +[^114]: Chang Lou, Peng Huang, and Scott Smith. [Understanding, Detecting and Localizing Partial Failures in Large System Software](https://www.usenix.org/conference/nsdi20/presentation/lou). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020. +[^115]: Peter Bailis and Ali Ghodsi. [Eventual Consistency Today: Limitations, Extensions, and Beyond](https://queue.acm.org/detail.cfm?id=2462076). *ACM Queue*, volume 11, issue 3, pages 55-63, March 2013. [doi:10.1145/2460276.2462076](https://doi.org/10.1145/2460276.2462076) +[^116]: Bowen Alpern and Fred B. Schneider. [Defining Liveness](https://www.cs.cornell.edu/fbs/publications/DefLiveness.pdf). *Information Processing Letters*, volume 21, issue 4, pages 181–185, October 1985. [doi:10.1016/0020-0190(85)90056-0](https://doi.org/10.1016/0020-0190%2885%2990056-0) +[^117]: Flavio P. Junqueira. [Dude, Where’s My Metadata?](https://fpj.me/2015/05/28/dude-wheres-my-metadata/) *fpj.me*, May 2015. Archived at [perma.cc/D2EU-Y9S5](https://perma.cc/D2EU-Y9S5) +[^118]: Scott Sanders. [January 28th Incident Report](https://github.com/blog/2106-january-28th-incident-report). *github.com*, February 2016. Archived at [perma.cc/5GZR-88TV](https://perma.cc/5GZR-88TV) +[^119]: Jay Kreps. [A Few Notes on Kafka and Jepsen](https://blog.empathybox.com/post/62279088548/a-few-notes-on-kafka-and-jepsen). *blog.empathybox.com*, September 2013. [perma.cc/XJ5C-F583](https://perma.cc/XJ5C-F583) +[^120]: Marc Brooker and Ankush Desai. [Systems Correctness Practices at AWS](https://dl.acm.org/doi/pdf/10.1145/3712057). *Queue, Volume 22, Issue 6*, November/December 2024. [doi:10.1145/3712057](https://doi.org/10.1145/3712057) +[^121]: Andrey Satarin. [Testing Distributed Systems: Curated list of resources on testing distributed systems](https://asatarin.github.io/testing-distributed-systems/). *asatarin.github.io*. Archived at [perma.cc/U5V8-XP24](https://perma.cc/U5V8-XP24) +[^122]: Jack Vanlightly. [Verifying Kafka transactions - Diary entry 2 - Writing an initial TLA+ spec](https://jack-vanlightly.com/analyses/2024/12/3/verifying-kafka-transactions-diary-entry-2-writing-an-initial-tla-spec). *jack-vanlightly.com*, December 2024. Archived at [perma.cc/NSQ8-MQ5N](https://perma.cc/NSQ8-MQ5N) +[^123]: Siddon Tang. [From Chaos to Order — Tools and Techniques for Testing TiDB, A Distributed NewSQL Database](https://www.pingcap.com/blog/chaos-practice-in-tidb/). *pingcap.com*, April 2018. Archived at [perma.cc/5EJB-R29F](https://perma.cc/5EJB-R29F) +[^124]: Nathan VanBenschoten. [Parallel Commits: An atomic commit protocol for globally distributed transactions](https://www.cockroachlabs.com/blog/parallel-commits/). *cockroachlabs.com*, November 2019. Archived at [perma.cc/5FZ7-QK6J](https://perma.cc/5FZ7-QK6J%20) +[^125]: Jack Vanlightly. [Paper: VR Revisited - State Transfer (part 3)](https://jack-vanlightly.com/analyses/2022/12/28/paper-vr-revisited-state-transfer-part-3). *jack-vanlightly.com*, December 2022. Archived at [perma.cc/KNK3-K6WS](https://perma.cc/KNK3-K6WS) +[^126]: Hillel Wayne. [What if the spec doesn’t match the code?](https://buttondown.com/hillelwayne/archive/what-if-the-spec-doesnt-match-the-code/) *buttondown.com*, March 2024. Archived at [perma.cc/8HEZ-KHER](https://perma.cc/8HEZ-KHER) +[^127]: Lingzhi Ouyang, Xudong Sun, Ruize Tang, Yu Huang, Madhav Jivrajani, Xiaoxing Ma, Tianyin Xu. [Multi-Grained Specifications for Distributed System Model Checking and Verification](https://arxiv.org/abs/2409.14301). At *20th European Conference on Computer Systems* (EuroSys), March 2025. [doi:10.1145/3689031.3696069](https://doi.org/10.1145/3689031.3696069) +[^128]: Yury Izrailevsky and Ariel Tseitlin. [The Netflix Simian Army](https://netflixtechblog.com/the-netflix-simian-army-16e57fbab116). *netflixtechblog.com*, July, 2011. Archived at [perma.cc/M3NY-FJW6](https://perma.cc/M3NY-FJW6) +[^129]: Kyle Kingsbury. [Jepsen: On the perils of network partitions](https://aphyr.com/posts/281-jepsen-on-the-perils-of-network-partitions). *aphyr.com*, May, 2013. Archived at [perma.cc/W98G-6HQP](https://perma.cc/W98G-6HQP) +[^130]: Kyle Kingsbury. [Jepsen Analyses](https://jepsen.io/analyses). *jepsen.io*, 2024. Archived at [perma.cc/8LDN-D2T8](https://perma.cc/8LDN-D2T8) +[^131]: Rupak Majumdar and Filip Niksic. [Why is random testing effective for partition tolerance bugs?](https://dl.acm.org/doi/pdf/10.1145/3158134) *Proceedings of the ACM on Programming Languages* (PACMPL), volume 2, issue POPL, article no. 46, December 2017. [doi:10.1145/3158134](https://doi.org/10.1145/3158134) +[^132]: FoundationDB project authors. [Simulation and Testing](https://apple.github.io/foundationdb/testing.html). *apple.github.io*. Archived at [perma.cc/NQ3L-PM4C](https://perma.cc/NQ3L-PM4C) +[^133]: Alex Kladov. [Simulation Testing For Liveness](https://tigerbeetle.com/blog/2023-07-06-simulation-testing-for-liveness/). *tigerbeetle.com*, July 2023. Archived at [perma.cc/RKD4-HGCR](https://perma.cc/RKD4-HGCR) +[^134]: Alfonso Subiotto Marqués. [(Mostly) Deterministic Simulation Testing in Go](https://www.polarsignals.com/blog/posts/2024/05/28/mostly-dst-in-go). *polarsignals.com*, May 2024. Archived at [perma.cc/ULD6-TSA4](https://perma.cc/ULD6-TSA4) \ No newline at end of file diff --git a/hugo.yaml b/hugo.yaml index 7b45334..13aca49 100644 --- a/hugo.yaml +++ b/hugo.yaml @@ -1,6 +1,6 @@ baseURL: 'https://ddia.vonng.com/' languageCode: 'zh-CN' -title: '设计数据密集型应用' +title: '设计数据密集型应用第二版' enableRobotsTXT: true # Parse Git commit @@ -28,7 +28,7 @@ languages: languageCode: zh contentDir: content/zh weight: 1 - title: 设计数据密集型应用 + title: 设计数据密集型应用(第二版) v2: languageName: 第二版 languageCode: v2 @@ -40,27 +40,29 @@ languages: languageCode: tw contentDir: content/tw weight: 3 - title: 設計資料密集型應用 + title: 設計資料密集型應用(第二版) en: languageName: English languageCode: en contentDir: content/en weight: 4 - title: Designing Data-Intensive Applications - + title: Designing Data-Intensive Applications 2nd Edition markup: - highlight: - noClasses: false goldmark: - renderer: - unsafe: true extensions: - passthrough: - delimiters: - block: [['\[', '\]'], ['$$', '$$']] - inline: [['\(', '\)']] - enable: true + footnote: true # 开启脚注语法:[^id] / [^id]: text + linkify: true # 自动将 URL 文本转为链接 + table: true # 启用 Markdown 表格 + taskList: true # 启用任务列表 [ ] / [x] + typographer: true # 智能排版(引号、破折号等) + parser: + attribute: true # 允许在标题后写 {#id .class key=val},用于显式锚点 + autoHeadingID: true # 为标题自动生成 ID(手写 {#id} 会覆盖自动生成) + autoHeadingIDType: github # 自动 ID 规则:github / blackfriday / none + tableOfContents: + startLevel: 2 # ToC 从 h2 开始 + endLevel: 4 # ToC 到 h4 结束 menu: main: