2
0
Fork 0
mirror of https://github.com/Vonng/ddia.git synced 2026-06-21 00:47:05 +08:00

fix reference link

This commit is contained in:
Feng Ruohang 2025-08-09 15:21:36 +08:00
parent 0c9db16820
commit d216e35c8e
11 changed files with 1494 additions and 4958 deletions

View file

@ -23,7 +23,7 @@ more complex, it is no longer sufficient to store everything in one system, but
necessary to combine multiple storage or processing systems that provide different capabilities.
We call an application *data-intensive* if data management is one of the primary challenges in
developing the application [[1](/en/ch1#Kouzes2009)].
developing the application [^1].
While in *compute-intensive* systems the challenge is parallelizing some very large computation, in
data-intensive applications we usually worry more about things like storing and processing large
data volumes, managing changes to data, ensuring consistency in the face of failures and
@ -86,7 +86,7 @@ for web applications, the client-side code (which runs in a web browser) is call
and the server-side code that handles user requests is known as the *backend*. Mobile apps are
similar to frontends in that they provide user interfaces, which often communicate over the Internet
with a server-side backend. Frontends sometimes manage data locally on the users device
[[2](/en/ch1#Kleppmann2019_ch1)],
[^2],
but the greatest data infrastructure challenges often lie in the backend: a frontend only needs to
handle one users data, whereas the backend manages data on behalf of *all* of the users.
@ -132,10 +132,10 @@ As we shall see in the next section, operational and analytical systems are ofte
good reasons. As these systems have matured, two new specialized roles have emerged: *data
engineers* and *analytics engineers*. Data engineers are the people who know how to integrate the
operational and the analytical systems, and who take responsibility for the organizations data
infrastructure more widely [[3](/en/ch1#Reis2022)].
infrastructure more widely [^3].
Analytics engineers model and transform data to make it more useful for the business analysts and
data scientists in an organization
[[4](/en/ch1#Machado2023)].
[^4].
Many engineers specialize on either the operational or the analytical side. However, this book
covers both operational and analytical data systems, since both play an important role in the
@ -176,7 +176,7 @@ answer analytic queries such as:
The reports that result from these types of queries are important for business intelligence, helping
the management decide what to do next. In order to differentiate this pattern of using databases
from transaction processing, it has been called *online analytic processing* (OLAP)
[[5](/en/ch1#Codd1993)].
[^5].
The difference between OLTP and analytics is not always clear-cut, but some typical characteristics
are listed in [Table 1-1](/en/ch1#tab_oltp_vs_olap).
@ -211,7 +211,7 @@ There is also a type of systems that is designed for analytical workloads (queri
over many records) but that are embedded into user-facing products. This category is known as
*product analytics* or *real-time analytics*, and systems designed for this type of use include
Pinot, Druid, and ClickHouse
[[6](/en/ch1#Soman2023)].
[^6].
## Data Warehousing
@ -242,7 +242,7 @@ systems, for several reasons:
A *data warehouse*, by contrast, is a separate database that analysts can query to their hearts
content, without affecting OLTP operations
[[7](/en/ch1#Chaudhuri1997)].
[^7].
As we shall see in [Chapter 4](/en/ch4#ch_storage), data warehouses often store data in a way that is very different
from OLTP databases, in order to optimize for the types of queries that are common in analytics.
@ -267,8 +267,7 @@ specialist data connector services such as Fivetran, Singer, or AirByte.
Some database systems offer *hybrid transactional/analytic processing* (HTAP), which aims to enable
OLTP and analytics in a single system without requiring ETL from one system into another
[[8](/en/ch1#Ozcan2017),
[9](/en/ch1#Prout2022_ch1)].
[^8] [^9].
However, many HTAP systems internally consist of an OLTP system coupled with a separate analytical
system, hidden behind a common interface—so the distinction between the two remains important for
understanding how these systems work.
@ -283,13 +282,13 @@ data from several operational systems in a single query.
HTAP therefore does not replace data warehouses. Rather, it is useful in scenarios where the same
application needs to both perform analytics queries that scan a large number of rows, and also
read and update individual records with low latency. Fraud detection can involve such workloads, for
example [[10](/en/ch1#Zhang2024)].
example [^10].
The separation between operational and analytical systems is part of a wider trend: as workloads
have become more demanding, systems have become more specialized and optimized for particular
workloads. General-purpose systems can handle small data volumes comfortably, but the greater the
scale, the more specialized systems tend to become
[[11](/en/ch1#Stonebraker2005fitsall)].
[^11].
### From data warehouse to data lake
@ -308,14 +307,11 @@ needs of data scientists, who might need to perform tasks such as:
they mention). Similarly, they might need to extract structured information from photos using
computer vision techniques.
Although there have been efforts to add machine learning operators to a SQL data model
[[12](/en/ch1#Cohen2009)]
and to build efficient machine learning systems on top of a relational foundation
[[13](/en/ch1#Olteanu2020)],
Although there have been efforts to add machine learning operators to a SQL data model [^12]
and to build efficient machine learning systems on top of a relational foundation [^13],
many data scientists prefer not to work in a relational database such as a data warehouse. Instead,
many prefer to use Python data analysis libraries such as pandas and scikit-learn, statistical
analysis languages such as R, and distributed analytics frameworks such as Spark
[[14](/en/ch1#Bornstein2020)].
analysis languages such as R, and distributed analytics frameworks such as Spark [^14].
We discuss these further in [“Dataframes, Matrices, and Arrays”](/en/ch3#sec_datamodels_dataframes).
Consequently, organizations face a need to make data available in a form that is suitable for use by
@ -325,7 +321,7 @@ difference from a data warehouse is that a data lake simply contains files, with
particular file format or data model. Files in a data lake might be collections of database records,
encoded using a file format such as Avro or Parquet (see [Chapter 5](/en/ch5#ch_encoding)), but they can equally well
contain text, images, videos, sensor readings, sparse matrices, feature vectors, genome sequences,
or any other kind of data [[15](/en/ch1#Fowler2015)].
or any other kind of data [^15].
Besides being more flexible, this is also often cheaper than relational data storage, since the data
lake can use commoditized file storage such as object stores (see [“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native)).
@ -334,14 +330,13 @@ an intermediate stop on the path from the operational systems to the data wareho
contains data in a “raw” form produced by the operational systems, without the transformation into a
relational data warehouse schema. This approach has the advantage that each consumer of the data can
transform the raw data into a form that best suits their needs. It has been dubbed the *sushi
principle*: “raw data is better” [[16](/en/ch1#Johnson2015)].
principle*: “raw data is better” [^16].
Besides loading data from a data lake into a separate data warehouse, it is also possible to run
typical data warehousing workloads (SQL queries and business analytics) directly on the files in the
data lake, alongside data science/machine learning workloads. This architecture is known as a *data
lakehouse*, and it requires a query execution engine and a metadata (e.g., schema management) layer
that extend the data lakes file storage
[[17](/en/ch1#Armbrust2021)].
that extend the data lakes file storage [^17].
Apache Hive, Spark SQL, Presto, and Trino are examples of this approach.
@ -349,7 +344,7 @@ Apache Hive, Spark SQL, Presto, and Trino are examples of this approach.
As analytics practices have matured, organizations have been increasingly paying attention to the
management and operations of analytics systems and data pipelines, as captured for example in the
DataOps manifesto [[18](/en/ch1#DataOps)].
DataOps manifesto [^18].
Part of this are issues of governance, privacy, and compliance with regulation such as GDPR and
CCPA, which we discuss in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance) and [Link to Come].
@ -361,11 +356,9 @@ application and how time-sensitive it is, a stream processing approach can be va
to identify and block potentially fraudulent or abusive activity.
In some cases the outputs of analytics systems are made available to operational systems (a process
sometimes known as *reverse ETL* [[19](/en/ch1#Manohar2021)]). For example, a
machine-learning model that was trained on data in an analytics system may be deployed to
sometimes known as *reverse ETL* [^19]). For example, a machine-learning model that was trained on data in an analytics system may be deployed to
production, so that it can generate recommendations for end-users, such as “people who bought X also
bought Y”. Such deployed outputs of analytics systems are also known as *data products*
[[20](/en/ch1#ORegan2018)].
bought Y”. Such deployed outputs of analytics systems are also known as *data products* [^20].
Machine learning models can be deployed to operational systems using specialized tools such as
TFX, Kubeflow, or MLflow.
@ -425,7 +418,7 @@ in-house, or should it be outsourced? Should you build or should you buy?
Ultimately, this is a question about business priorities. The received management wisdom is that
things that are a core competency or a competitive advantage of your organization should be done
in-house, whereas things that are non-core, routine, or commonplace should be left to a vendor
[[21](/en/ch1#Fournier2021)].
[^21].
To give an extreme example, most companies do not generate their own electricity (unless they are an
energy company, and leaving aside emergency backup power), since it is cheaper to buy electricity
from the grid.
@ -464,8 +457,7 @@ Whether a cloud service is actually cheaper and easier than self-hosting depends
skills and the workload on your systems. If you already have experience setting up and operating the
systems you need, and if your load is quite predictable (i.e., the number of machines you need does
not fluctuate wildly), then its often cheaper to buy your own machines and run the software on them
yourself [[22](/en/ch1#HeinemeierHansson2022),
[23](/en/ch1#Badizadegan2022)].
yourself [^22] [^23].
On the other hand, if you need a system that you dont already know how to deploy and operate, then
adopting a cloud service is often easier and quicker than learning to manage the system yourself. If
@ -508,7 +500,7 @@ The biggest downside of a cloud service is that you have no control over it:
* Moreover, if the service shuts down or becomes unacceptably expensive, or if the vendor decides to
change their product in a way you dont like, you are at their mercy—continuing to run an old
version of the software is usually not an option, so you will be forced to migrate to an
alternative service [[24](/en/ch1#Yegge2020)].
alternative service [^24].
This risk is mitigated if there are alternative services that expose a compatible API, but for
many cloud services there are no standard APIs, which raises the cost of switching, making vendor
lock-in a problem.
@ -535,17 +527,15 @@ and indeed such managed services are now available for many popular data systems
that have been designed from the ground up to be cloud-native have been shown to have several
advantages: better performance on the same hardware, faster recovery from failures, being able to
quickly scale computing resources to match the load, and supporting larger datasets
[[25](/en/ch1#Verbitski2017),
[26](/en/ch1#Antonopoulos2019_ch1),
[27](/en/ch1#Vuppalapati2020)].
[^25] [^26] [^27].
[Table 1-2](/en/ch1#tab_cloud_native_dbs) lists some examples of both types of systems.
Table 1-2. Examples of self-hosted and cloud-native database systems
| Category | Self-hosted systems | Cloud-native systems |
| --- | --- | --- |
| Operational/OLTP | MySQL, PostgreSQL, MongoDB | AWS Aurora [[25](/en/ch1#Verbitski2017)], Azure SQL DB Hyperscale [[26](/en/ch1#Antonopoulos2019_ch1)], Google Cloud Spanner |
| Analytical/OLAP | Teradata, ClickHouse, Spark | Snowflake [[27](/en/ch1#Vuppalapati2020)], Google BigQuery, Azure Synapse Analytics |
| Category | Self-hosted systems | Cloud-native systems |
|------------------|-----------------------------|-----------------------------------------------------------------------|
| Operational/OLTP | MySQL, PostgreSQL, MongoDB | AWS Aurora [^25], Azure SQL DB Hyperscale [^26], Google Cloud Spanner |
| Analytical/OLAP | Teradata, ClickHouse, Spark | Snowflake [^27], Google BigQuery, Azure Synapse Analytics |
### Layering of cloud services
@ -574,7 +564,7 @@ higher-level services. For example:
lost.
* Many other services are in turn built upon object storage and other cloud services: for example,
Snowflake is a cloud-based analytic database (data warehouse) that relies on S3 for data storage
[[27](/en/ch1#Vuppalapati2020)], and some other services in turn
[^27], and some other services in turn
build upon Snowflake.
As always with abstractions in computing, there is no one right answer to what you should use. As a
@ -605,9 +595,9 @@ cloud service provided by a separate set of machines, which emulates the behavio
*block device*, where each block is typically 4 KiB in size). This technology makes it
possible to run traditional disk-based software in the cloud, but the block device emulation
introduces overheads that can be avoided in systems that are designed from the ground up for the
cloud [[25](/en/ch1#Verbitski2017)]. It also makes the application
cloud [^25]. It also makes the application
very sensitive to network glitches, since every I/O on the virtual block device is actually a
network call [[28](/en/ch1#NickVanWiggeren2025)].
network call [^28].
To address this problem, cloud-native services generally avoid using virtual disks, and instead
build on dedicated storage services that are optimized for particular workloads. Object storage
@ -615,28 +605,23 @@ services such as S3 are designed for long-term storage of fairly large files, ra
of kilobytes to several gigabytes in size. The individual rows or values stored in a database are
typically much smaller than this; cloud databases therefore typically manage smaller values in a
separate service, and store larger data blocks (containing many individual values) in an object
store [[26](/en/ch1#Antonopoulos2019_ch1),
[29](/en/ch1#Breck2024)].
store [^26] [^29].
We will see ways of doing this in [Chapter 4](/en/ch4#ch_storage).
In a traditional systems architecture, the same computer is responsible for both storage (disk) and
computation (CPU and RAM), but in cloud-native systems, these two responsibilities have become
somewhat separated or *disaggregated* [[9](/en/ch1#Prout2022_ch1),
[27](/en/ch1#Vuppalapati2020),
[30](/en/ch1#Shapira2023separation),
[31](/en/ch1#Murthy2022)]:
somewhat separated or *disaggregated* [^9] [^27] [^30] [^31]:
for example, S3 only stores files, and if you want to analyze that data, you will have to run the
analysis code somewhere outside of S3. This implies transferring the data over the network, which we
will discuss further in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed).
Moreover, cloud-native systems are often *multitenant*, which means that rather than having a
separate machine for each customer, data and computation from several different customers are
handled on the same shared hardware by the same service
[[32](/en/ch1#Vanlightly2023serverless)].
handled on the same shared hardware by the same service [^32].
Multitenancy can enable better hardware utilization, easier scalability, and easier management by
the cloud provider, but it also requires careful engineering to ensure that one customers activity
does not affect the performance or security of the system for other customers
[[33](/en/ch1#Jonas2019)].
does not affect the performance or security of the system for other customers [^33].
## Operations in the Cloud Era
@ -645,7 +630,7 @@ Traditionally, the people managing an organizations server-side data infrastr
organizations have tried to integrate the roles of software development and operations into teams
with a shared responsibility for both backend services and data infrastructure; the *DevOps*
philosophy has guided this trend. *Site Reliability Engineers* (SREs) are Googles implementation of
this idea [[34](/en/ch1#Beyer2016)].
this idea [^34].
The role of operations is to ensure services are reliably delivered to users (including configuring
infrastructure and deploying applications), and to ensure a stable production environment (including
@ -669,31 +654,28 @@ processes and tools have evolved. The DevOps/SRE philosophy places greater empha
* preferring ephemeral virtual machines and services over long running servers,
* enabling frequent application updates,
* learning from incidents, and
* preserving the organizations knowledge about the system, even as individual people come and go
[[35](/en/ch1#Limoncelli2020)].
* preserving the organizations knowledge about the system, even as individual people come and go [^35].
With the rise of cloud services, there has been a bifurcation of roles: operations teams at
infrastructure companies specialize in the details of providing a reliable service to a large number
of customers, while the customers of the service spend as little time and effort as possible on
infrastructure [[36](/en/ch1#Majors2020)].
infrastructure [^36].
Customers of cloud services still require operations, but they focus on different aspects, such as
choosing the most appropriate service for a given task, integrating different services with each
other, and migrating from one service to another. Even though metered billing removes the need for
capacity planning in the traditional sense, its still important to know what resources you are
using for which purpose, so that you dont waste money on cloud resources that are not needed:
capacity planning becomes financial planning, and performance optimization becomes cost optimization
[[37](/en/ch1#Cherkasky2021)].
capacity planning becomes financial planning, and performance optimization becomes cost optimization [^37].
Moreover, cloud services do have resource limits or *quotas* (such as the maximum number of
processes you can run concurrently), which you need to know about and plan for before you run into
them [[38](/en/ch1#Kushchi2023)].
processes you can run concurrently), which you need to know about and plan for before you run into them [^38].
Adopting a cloud service can be easier and quicker than running your own infrastructure, although
even here there is a cost in learning how to use it, and perhaps working around its limitations.
Integration between different services becomes a particular challenge as a growing number of vendors
offers an ever broader range of cloud services targeting different use cases
[[39](/en/ch1#Bernhardsson2021),
[40](/en/ch1#Stancil2021)].
offers an ever broader range of cloud services targeting different use cases [^39][^40].
ETL (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) is only part of the story; operational cloud services also need
to be integrated with each other. At present, there is a lack of standards that would facilitate
this sort of integration, so it often involves significant manual effort.
@ -751,8 +733,7 @@ Using specialized hardware
Legal compliance
: Some countries have data residency laws that require data about people in their jurisdiction to be
stored and processed geographically within that country
[[41](/en/ch1#Korolov2022)].
stored and processed geographically within that country [^41].
The scope of these rules varies—for example, in some cases it applies only to medical or financial
data, while other cases are broader. A service with users in several such jurisdictions will
therefore have to distribute their data across servers in several locations.
@ -761,9 +742,7 @@ Sustainability
: If you have flexibility on where and when to run your jobs, you might be able to run them in a
time and place where plenty of renewable electricity is available, and avoid running them when the
power grid is under strain. This can reduce your carbon emissions and allow you to take advantage
of cheap power when it is available
[[42](/en/ch1#Borenstein2025),
[43](/en/ch1#Acun2023)].
of cheap power when it is available [^42][^43].
These reasons apply both to services that you write yourself (application code) and services
consisting of off-the-shelf software (such as databases).
@ -777,39 +756,32 @@ case, we dont know whether the service received the request, and simply retry
safe. We will discuss these problems in detail in [Chapter 9](/en/ch9#ch_distributed).
Although datacenter networks are fast, making a call to another service is still vastly slower than
calling a function in the same process
[[44](/en/ch1#Nath2019)].
calling a function in the same process [^44].
When operating on large volumes of data, rather than transferring the data from storage to a
separate machine that processes it, it can be faster to bring the computation to the machine that
already has the data
[[45](/en/ch1#Hellerstein2019)].
already has the data [^45].
More nodes are not always faster: in some cases, a simple single-threaded program on one computer
can perform significantly better than a cluster with over 100 CPU cores
[[46](/en/ch1#McSherry2015_ch1)].
can perform significantly better than a cluster with over 100 CPU cores [^46].
Troubleshooting a distributed system is often difficult: if the system is slow to respond, how do
you figure out where the problem lies? Techniques for diagnosing problems in distributed systems are
developed under the heading of *observability* [[47](/en/ch1#Sridharan2018),
[48](/en/ch1#Majors2019)],
developed under the heading of *observability* [^47] [^48],
which involves collecting data about the execution of a system, and allowing it to be queried in
ways that allows both high-level metrics and individual events to be analyzed. *Tracing* tools such
as OpenTelemetry, Zipkin, and Jaeger allow you to track which client called which server for which
operation, and how long each call took
[[49](/en/ch1#Sigelman2010)].
operation, and how long each call took [^49].
Databases provide various mechanisms for ensuring data consistency, as we shall see in
[Chapter 6](/en/ch6#ch_replication) and [Chapter 8](/en/ch8#ch_transactions). However, when each service has its own database,
maintaining consistency of data across those different services becomes the applications problem.
Distributed transactions, which we explore in [Chapter 8](/en/ch8#ch_transactions), are a possible technique for
ensuring consistency, but they are rarely used in a microservices context because they run counter
to the goal of making services independent from each other, and many databases dont support them
[[50](/en/ch1#Laigner2021)].
to the goal of making services independent from each other, and many databases dont support them [^50].
For all these reasons, if you can do something on a single machine, this is often much simpler and
cheaper compared to setting up a distributed system
[[23](/en/ch1#Badizadegan2022),
[46](/en/ch1#McSherry2015_ch1),
[51](/en/ch1#Tigani2023)].
cheaper compared to setting up a distributed system [^23] [^46] [^51].
CPUs, memory, and disks have grown larger, faster, and more reliable. When combined with single-node
databases such as DuckDB, SQLite, and KùzuDB, many workloads can now run on a single node. We will
explore more on this topic in [Chapter 4](/en/ch4#ch_storage).
@ -823,8 +795,7 @@ server (handling incoming requests) and a client (making outbound requests to ot
This way of building applications has traditionally been called a *service-oriented architecture*
(SOA); more recently the idea has been refined into a *microservices* architecture
[[52](/en/ch1#Newman2021_ch1),
[53](/en/ch1#Richardson2014)].
[^52] [^53].
In this architecture, a service has one well-defined purpose (for example, in the case of S3, this
would be file storage); each service exposes an API that can be called by clients via the network,
and each service has one team that is responsible for its maintenance. A complex application can
@ -857,16 +828,14 @@ client and server APIs; we discuss these further in [Chapter 5](/en/ch5#ch_enco
Microservices are primarily a technical solution to a people problem: allowing different teams to
make progress independently without having to coordinate with each other. This is valuable in a large
company, but in a small company where there are not many teams, using microservices is likely to be
unnecessary overhead, and it is preferable to implement the application in the simplest way possible
[[52](/en/ch1#Newman2021_ch1)].
unnecessary overhead, and it is preferable to implement the application in the simplest way possible [^52].
*Serverless*, or *function-as-a-service* (FaaS), is another approach to deploying services, in which
the management of the infrastructure is outsourced to a cloud vendor
[[33](/en/ch1#Jonas2019)].
the management of the infrastructure is outsourced to a cloud vendor [^33].
When using virtual machines, you have to explicitly choose when to start up or shut down an
instance; in contrast, with the serverless model, the cloud provider automatically allocates and
frees hardware resources as needed, based on the incoming requests to your service
[[54](/en/ch1#Shahrad2020)]. Serverless deployment
[^54]. Serverless deployment
shifts more of the operational burden to cloud providers and enables flexible billing by usage
rather than machine instances. To offer such benefits, many serverless infrastructure providers
impose a time limit on function execution, limit runtime environments, and might suffer from slow
@ -896,22 +865,20 @@ enterprise datacenter systems. Some of those differences are:
* A supercomputer typically runs large batch jobs that checkpoint the state of their computation to
disk from time to time. If a node fails, a common solution is to simply stop the entire cluster
workload, repair the faulty node, and then restart the computation from the last checkpoint
[[55](/en/ch1#Barroso2018),
[56](/en/ch1#Fiala2012)].
[^55] [^56].
With cloud services, it is usually not desirable to stop the entire cluster, since the services
need to continually serve users with minimal interruptions.
* Supercomputer nodes typically communicate through shared memory and remote direct memory access
(RDMA), which support high bandwidth and low latency, but assume a high level of trust among the
users of the system [[57](/en/ch1#KornfeldSimpson2020)].
users of the system [^57].
In cloud computing, the network and the machines are often shared by mutually untrusting
organizations, requiring stronger security mechanisms such as resource isolation (e.g., virtual
machines), encryption and authentication.
* Cloud datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to
provide high bisection bandwidth—a commonly used measure of a networks overall performance
[[55](/en/ch1#Barroso2018),
[58](/en/ch1#Singh2015)].
[^55] [^58].
Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses
[[59](/en/ch1#Lockwood2014)],
[^59],
which yield better performance for HPC workloads with known communication patterns.
* Cloud computing allows nodes to be distributed across multiple geographic regions, whereas
supercomputers generally assume that all of their nodes are close together.
@ -940,16 +907,14 @@ of the effects that computer systems have on people and society. Social media ha
individuals consume news, which influences their political opinions and hence may affect the outcome
of elections. Automated systems increasingly make decisions that have profound consequences for
individuals, such as deciding who should be given a loan or insurance coverage, who should be
invited to a job interview, or who should be suspected of a crime
[[60](/en/ch1#ONeil2016_ch1)].
invited to a job interview, or who should be suspected of a crime [^60].
Everyone who works on such systems shares a responsibility for considering the ethical impact and
ensuring that they comply with relevant law. It is not necessary for everybody to become an expert
in law and ethics, but a basic awareness of legal and ethical principles is just as important as,
say, some foundational knowledge in distributed systems.
Legal considerations are influencing the very foundations of how data systems are being designed
[[61](/en/ch1#Shastri2020)].
Legal considerations are influencing the very foundations of how data systems are being designed [^61].
For example, the GDPR grants individuals the right to have their data erased on request (sometimes
known as the *right to be forgotten*). However, as we shall see in this book, many data systems rely
on immutable constructs such as append-only logs as part of their design; how can we ensure deletion
@ -970,7 +935,7 @@ However, it is worth remembering that the costs of storage are not just the bill
S3 or another service: the cost-benefit calculation should also take into account the risks of
liability and reputational damage if the data were to be leaked or compromised by adversaries, and
the risk of legal costs and fines if the storage and processing of the data is found not to be
compliant with the law [[51](/en/ch1#Tigani2023)].
compliant with the law [^51].
Governments or police forces might also compel companies to hand over data. When there is a risk
that the data may reveal criminalized behaviors (for example, homosexuality in several Middle
@ -982,12 +947,10 @@ indicate approximate location).
Once all the risks are taken into account, it might be reasonable to decide that some data is simply
not worth storing, and that it should therefore be deleted. This principle of *data minimization*
(sometimes known by the German term *Datensparsamkeit*) runs counter to the “big data” philosophy of
storing lots of data speculatively in case it turns out to be useful in the future
[[62](/en/ch1#Datensparsamkeit)].
storing lots of data speculatively in case it turns out to be useful in the future [^62].
But it fits with the GDPR, which mandates that personal data may only be collected for a specified,
explicit purpose, that this data may not later be used for any other purpose, and that the data must
not be kept for longer than necessary for the purposes for which it was collected
[[63](/en/ch1#GDPR)].
not be kept for longer than necessary for the purposes for which it was collected [^63].
Businesses have also taken notice of privacy and safety concerns. Credit card companies require
payment processing businesses to adhere to strict payment card industry (PCI) standards. Processors
@ -1033,346 +996,71 @@ data is being processed—an aspect that many engineers are prone to ignoring. H
requirements into technical implementations is not yet well understood, but its important to keep
this question in mind as we move through the rest of this book.
##### Footnotes
##### References
[[1](/en/ch1#Kouzes2009-marker)] Richard T. Kouzes,
Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio.
[The
Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1,
January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26)
[[2](/en/ch1#Kleppmann2019_ch1-marker)] Martin Kleppmann, Adam Wiggins, Peter van
Hardenberg, and Mark McGranaghan. [Local-first
software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International
Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!),
October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
[[3](/en/ch1#Reis2022-marker)] Joe Reis and Matt Housley.
[*Fundamentals
of Data Engineering*](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/). OReilly Media, 2022. ISBN: 9781098108304
[[4](/en/ch1#Machado2023-marker)] Rui Pedro Machado and Helder Russa.
[*Analytics
Engineering with SQL and dbt*](https://www.oreilly.com/library/view/analytics-engineering-with/9781098142377/). OReilly Media, 2023. ISBN: 9781098142384
[[5](/en/ch1#Codd1993-marker)] Edgar F. Codd, S. B. Codd, and C. T. Salley.
[Providing
OLAP to User-Analysts: An IT Mandate](https://www.estgv.ipv.pt/PaginasPessoais/jloureiro/ESI_AID2007_2008/fichas/codd.pdf). E. F. Codd Associates, 1993.
Archived at [perma.cc/RKX8-2GEE](https://perma.cc/RKX8-2GEE)
[[6](/en/ch1#Soman2023-marker)] Chinmay Soman and Neha Pawar.
[Comparing Three
Real-Time OLAP Databases: Apache Pinot, Apache Druid, and ClickHouse](https://startree.ai/blog/a-tale-of-three-real-time-olap-databases). *startree.ai*,
April 2023. Archived at [perma.cc/8BZP-VWPA](https://perma.cc/8BZP-VWPA)
[[7](/en/ch1#Chaudhuri1997-marker)] Surajit Chaudhuri and Umeshwar Dayal.
[An Overview of Data
Warehousing and OLAP Technology](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigrecord.pdf). *ACM SIGMOD Record*, volume 26, issue 1, pages 6574,
March 1997. [doi:10.1145/248603.248616](https://doi.org/10.1145/248603.248616)
[[8](/en/ch1#Ozcan2017-marker)] Fatma Özcan, Yuanyuan Tian, and Pinar Tözün.
[Hybrid Transactional/Analytical
Processing: A Survey](https://humming80.github.io/papers/sigmod-htaptut.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017.
[doi:10.1145/3035918.3054784](https://doi.org/10.1145/3035918.3054784)
[[9](/en/ch1#Prout2022_ch1-marker)] Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu
Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov.
[Cloud-Native Transactions and Analytics
in SingleStore](https://dl.acm.org/doi/abs/10.1145/3514221.3526055). At *International Conference on Management of Data* (SIGMOD), June 2022.
[doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055)
[[10](/en/ch1#Zhang2024-marker)] Chao Zhang, Guoliang Li, Jintao Zhang,
Xinning Zhang, and Jianhua Feng.
[HTAP Databases: A Survey](https://arxiv.org/pdf/2404.15670).
*IEEE Transactions on Knowledge and Data Engineering*, April 2024.
[doi:10.1109/TKDE.2024.3389693](https://doi.org/10.1109/TKDE.2024.3389693)
[[11](/en/ch1#Stonebraker2005fitsall-marker)] Michael Stonebraker and Uğur Çetintemel.
[One Size Fits All: An
Idea Whose Time Has Come and Gone](https://pages.cs.wisc.edu/~shivaram/cs744-readings/fits_all.pdf). At *21st International Conference on Data Engineering*
(ICDE), April 2005. [doi:10.1109/ICDE.2005.1](https://doi.org/10.1109/ICDE.2005.1)
[[12](/en/ch1#Cohen2009-marker)] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M.
Hellerstein, and Caleb Welton. [MAD Skills:
New Analysis Practices for Big Data](https://www.vldb.org/pvldb/vol2/vldb09-219.pdf). *Proceedings of the VLDB Endowment*, volume 2,
issue 2, pages 14811492, August 2009.
[doi:10.14778/1687553.1687576](https://doi.org/10.14778/1687553.1687576)
[[13](/en/ch1#Olteanu2020-marker)] Dan Olteanu.
[The Relational Data Borg is Learning](https://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf).
*Proceedings of the VLDB Endowment*, volume 13, issue 12, August 2020.
[doi:10.14778/3415478.3415572](https://doi.org/10.14778/3415478.3415572)
[[14](/en/ch1#Bornstein2020-marker)] Matt Bornstein, Martin Casado, and Jennifer Li.
[Emerging
Architectures for Modern Data Infrastructure: 2020](https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/). *future.a16z.com*, October 2020.
Archived at [perma.cc/LF8W-KDCC](https://perma.cc/LF8W-KDCC)
[[15](/en/ch1#Fowler2015-marker)] Martin Fowler.
[DataLake](https://www.martinfowler.com/bliki/DataLake.html).
*martinfowler.com*, February 2015.
Archived at [perma.cc/4WKN-CZUK](https://perma.cc/4WKN-CZUK)
[[16](/en/ch1#Johnson2015-marker)] Bobby Johnson and Joseph Adler.
[The
Sushi Principle: Raw Data Is Better](https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210840/). At *Strata+Hadoop World*, February 2015.
[[17](/en/ch1#Armbrust2021-marker)] Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia.
[Lakehouse: A New Generation of
Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference
on Innovative Data Systems Research* (CIDR), January 2021.
[[18](/en/ch1#DataOps-marker)] DataKitchen, Inc.
[The DataOps Manifesto](https://dataopsmanifesto.org/en/). *dataopsmanifesto.org*, 2017.
Archived at [perma.cc/3F5N-FUQ4](https://perma.cc/3F5N-FUQ4)
[[19](/en/ch1#Manohar2021-marker)] Tejas Manohar.
[What is Reverse ETL: A Definition & Why Its
Taking Off](https://hightouch.io/blog/reverse-etl/). *hightouch.io*, November 2021.
Archived at [perma.cc/A7TN-GLYJ](https://perma.cc/A7TN-GLYJ)
[[20](/en/ch1#ORegan2018-marker)] Simon ORegan.
[Designing Data
Products](https://towardsdatascience.com/designing-data-products-b6b93edf3d23). *towardsdatascience.com*, August 2018.
Archived at [perma.cc/HU67-3RV8](https://perma.cc/HU67-3RV8)
[[21](/en/ch1#Fournier2021-marker)] Camille Fournier.
[Why is it so
hard to decide to buy?](https://skamille.medium.com/why-is-it-so-hard-to-decide-to-buy-d86fee98e88e) *skamille.medium.com*, July 2021.
Archived at [perma.cc/6VSG-HQ5X](https://perma.cc/6VSG-HQ5X)
[[22](/en/ch1#HeinemeierHansson2022-marker)] David Heinemeier Hansson.
[Why were leaving the cloud](https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0).
*world.hey.com*, October 2022.
Archived at [perma.cc/82E6-UJ65](https://perma.cc/82E6-UJ65)
[[23](/en/ch1#Badizadegan2022-marker)] Nima Badizadegan.
[Use One Big Server](https://specbranch.com/posts/one-big-server/).
*specbranch.com*, August 2022.
Archived at [perma.cc/M8NB-95UK](https://perma.cc/M8NB-95UK)
[[24](/en/ch1#Yegge2020-marker)] Steve Yegge.
[Dear
Google Cloud: Your Deprecation Policy is Killing You](https://steve-yegge.medium.com/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc). *steve-yegge.medium.com*, August 2020.
Archived at [perma.cc/KQP9-SPGU](https://perma.cc/KQP9-SPGU)
[[25](/en/ch1#Verbitski2017-marker)] Alexandre Verbitski, Anurag Gupta, Debanjan
Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz
Kharatishvili, and Xiaofeng Bao.
[Amazon
Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases](https://media.amazonwebservices.com/blog/2017/aurora-design-considerations-paper.pdf).
At *ACM International Conference on Management of Data* (SIGMOD), pages 10411052, May 2017.
[doi:10.1145/3035918.3056101](https://doi.org/10.1145/3035918.3056101)
[[26](/en/ch1#Antonopoulos2019_ch1-marker)] Panagiotis Antonopoulos, Alex Budovski, Cristian
Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar
Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna
Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade.
[Socrates: The
New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data*
(SIGMOD), pages 17431756, June 2019.
[doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047)
[[27](/en/ch1#Vuppalapati2020-marker)] Midhul Vuppalapati, Justin Miron, Rachit Agarwal,
Dan Truong, Ashish Motivala, and Thierry Cruanes.
[Building An Elastic Query
Engine on Disaggregated Storage](https://www.usenix.org/system/files/nsdi20-paper-vuppalapati.pdf). At *17th USENIX Symposium on Networked Systems Design and
Implementation* (NSDI), February 2020.
[[28](/en/ch1#NickVanWiggeren2025-marker)] Nick Van Wiggeren.
[The Real Failure Rate of EBS](https://planetscale.com/blog/the-real-fail-rate-of-ebs).
*planetscale.com*, March 2025.
Archived at [perma.cc/43CR-SAH5](https://perma.cc/43CR-SAH5)
[[29](/en/ch1#Breck2024-marker)] Colin Breck.
[Predicting the
Future of Distributed Systems](https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/). *blog.colinbreck.com*, August 2024.
Archived at [perma.cc/K5FC-4XX2](https://perma.cc/K5FC-4XX2)
[[30](/en/ch1#Shapira2023separation-marker)] Gwen Shapira.
[Compute-Storage Separation Explained](https://www.thenile.dev/blog/storage-compute).
*thenile.dev*, January 2023. Archived at
[perma.cc/QCV3-XJNZ](https://perma.cc/QCV3-XJNZ)
[[31](/en/ch1#Murthy2022-marker)] Ravi Murthy and Gurmeet Goindi.
[AlloyDB
for PostgreSQL under the hood: Intelligent, database-aware storage](https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage). *cloud.google.com*,
May 2022. Archived at
[archive.org](https://web.archive.org/web/20220514021120/https%3A//cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage)
[[32](/en/ch1#Vanlightly2023serverless-marker)] Jack Vanlightly.
[The
Architecture of Serverless Data Systems](https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems). *jack-vanlightly.com*, November 2023.
Archived at [perma.cc/UDV4-TNJ5](https://perma.cc/UDV4-TNJ5)
[[33](/en/ch1#Jonas2019-marker)] Eric Jonas, Johann Schleier-Smith, Vikram
Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth,
Neeraja Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, David A. Patterson.
[Cloud Programming Simplified: A Berkeley View on
Serverless Computing](https://arxiv.org/abs/1902.03383). *arxiv.org*, February 2019.
[[34](/en/ch1#Beyer2016-marker)] Betsy Beyer, Jennifer Petoff, Chris
Jones, and Niall Richard Murphy.
[*Site
Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/).
OReilly Media, 2016. ISBN: 9781491929124
[[35](/en/ch1#Limoncelli2020-marker)] Thomas Limoncelli.
[The Time I Stole $10,000 from Bell Labs](https://queue.acm.org/detail.cfm?id=3434773).
*ACM Queue*, volume 18, issue 5, November 2020.
[doi:10.1145/3434571.3434773](https://doi.org/10.1145/3434571.3434773)
[[36](/en/ch1#Majors2020-marker)] Charity Majors.
[The Future of Ops Jobs](https://acloudguru.com/blog/engineering/the-future-of-ops-jobs).
*acloudguru.com*, August 2020.
Archived at [perma.cc/GRU2-CZG3](https://perma.cc/GRU2-CZG3)
[[37](/en/ch1#Cherkasky2021-marker)] Boris Cherkasky.
[(Over)Pay
As You Go for Your Datastore](https://medium.com/riskified-technology/over-pay-as-you-go-for-your-datastore-11a29ae49a8b). *medium.com*, September 2021.
Archived at [perma.cc/Q8TV-2AM2](https://perma.cc/Q8TV-2AM2)
[[38](/en/ch1#Kushchi2023-marker)] Shlomi Kushchi.
[Serverless Doesnt Mean
DevOpsLess or NoOps](https://thenewstack.io/serverless-doesnt-mean-devopsless-or-noops/). *thenewstack.io*, February 2023.
Archived at [perma.cc/3NJR-AYYU](https://perma.cc/3NJR-AYYU)
[[39](/en/ch1#Bernhardsson2021-marker)] Erik Bernhardsson.
[Storm
in the stratosphere: how the cloud will be reshuffled](https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html). *erikbern.com*, November 2021.
Archived at [perma.cc/SYB2-99P3](https://perma.cc/SYB2-99P3)
[[40](/en/ch1#Stancil2021-marker)] Benn Stancil.
[The data OS](https://benn.substack.com/p/the-data-os). *benn.substack.com*,
September 2021. Archived at [perma.cc/WQ43-FHS6](https://perma.cc/WQ43-FHS6)
[[41](/en/ch1#Korolov2022-marker)] Maria Korolov.
[Data
residency laws pushing companies toward residency as a service](https://www.csoonline.com/article/3647761/data-residency-laws-pushing-companies-toward-residency-as-a-service.html). *csoonline.com*,
January 2022. Archived at [perma.cc/CHE4-XZZ2](https://perma.cc/CHE4-XZZ2)
[[42](/en/ch1#Borenstein2025-marker)] Severin Borenstein.
[Can
Data Centers Flex Their Power Demand?](https://energyathaas.wordpress.com/2025/04/14/can-data-centers-flex-their-power-demand/) *energyathaas.wordpress.com*, April 2025.
Archived at <https://perma.cc/MUD3-A6FF>
[[43](/en/ch1#Acun2023-marker)] Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Aditya
Sundarrajan, Kiwan Maeng, Manoj Chakkaravarthy, David Brooks, and Carole-Jean Wu.
[Carbon Dependencies in
Datacenter Design and Management](https://hotcarbon.org/assets/2022/pdf/hotcarbon22-acun.pdf).
*ACM SIGENERGY Energy Informatics Review*, volume 3, issue 3, pages 2126.
[doi:10.1145/3630614.3630619](https://doi.org/10.1145/3630614.3630619)
[[44](/en/ch1#Nath2019-marker)] Kousik Nath.
[These are
the numbers every computer engineer should know](https://www.freecodecamp.org/news/must-know-numbers-for-every-computer-engineer/). *freecodecamp.org*, September 2019.
Archived at [perma.cc/RW73-36RL](https://perma.cc/RW73-36RL)
[[45](/en/ch1#Hellerstein2019-marker)] Joseph M. Hellerstein, Jose Faleiro, Joseph E.
Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu.
[Serverless Computing: One Step Forward, Two Steps Back](https://arxiv.org/abs/1812.03651).
At *Conference on Innovative Data Systems Research* (CIDR), January 2019.
[[46](/en/ch1#McSherry2015_ch1-marker)] Frank McSherry, Michael Isard, and Derek G. Murray.
[Scalability!
But at What COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS),
May 2015.
[[47](/en/ch1#Sridharan2018-marker)] Cindy Sridharan.
*[Distributed
Systems Observability: A Guide to Building Robust Systems](https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-Systems-Observability-eBook.pdf)*. Report, OReilly Media, May 2018.
Archived at [perma.cc/M6JL-XKCM](https://perma.cc/M6JL-XKCM)
[[48](/en/ch1#Majors2019-marker)] Charity Majors.
[Observability — A 3-Year
Retrospective](https://thenewstack.io/observability-a-3-year-retrospective/). *thenewstack.io*, August 2019.
Archived at [perma.cc/CG62-TJWL](https://perma.cc/CG62-TJWL)
[[49](/en/ch1#Sigelman2010-marker)] Benjamin H. Sigelman, Luiz André Barroso, Mike
Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag.
[Dapper, a Large-Scale Distributed Systems Tracing
Infrastructure](https://research.google/pubs/pub36356/). Google Technical Report dapper-2010-1, April 2010.
Archived at [perma.cc/K7KU-2TMH](https://perma.cc/K7KU-2TMH)
[[50](/en/ch1#Laigner2021-marker)] Rodrigo Laigner, Yongluan Zhou, Marcos Antonio
Vaz Salles, Yijian Liu, and Marcos Kalinowski.
[Data management in microservices: State
of the practice, challenges, and research directions](https://www.vldb.org/pvldb/vol14/p3348-laigner.pdf). *Proceedings of the VLDB Endowment*,
volume 14, issue 13, pages 33483361, September 2021.
[doi:10.14778/3484224.3484232](https://doi.org/10.14778/3484224.3484232)
[[51](/en/ch1#Tigani2023-marker)] Jordan Tigani.
[Big Data is Dead](https://motherduck.com/blog/big-data-is-dead/).
*motherduck.com*, February 2023.
Archived at [perma.cc/HT4Q-K77U](https://perma.cc/HT4Q-K77U)
[[52](/en/ch1#Newman2021_ch1-marker)] Sam Newman.
[*Building
Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). OReilly Media, 2021. ISBN: 9781492034025
[[53](/en/ch1#Richardson2014-marker)] Chris Richardson.
[Microservices: Decomposing
Applications for Deployability and Scalability](https://www.infoq.com/articles/microservices-intro/). *infoq.com*, May 2014.
Archived at [perma.cc/CKN4-YEQ2](https://perma.cc/CKN4-YEQ2)
[[54](/en/ch1#Shahrad2020-marker)] Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri,
Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, Ricardo Bianchini.
[Serverless in the Wild:
Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider](https://www.usenix.org/system/files/atc20-shahrad.pdf).
At *USENIX Annual Technical Conference* (ATC), July 2020.
[[55](/en/ch1#Barroso2018-marker)] Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan.
[The Datacenter as a
Computer: Designing Warehouse-Scale Machines](https://www.morganclaypool.com/doi/10.2200/S00874ED3V01Y201809CAC046), third edition.
Morgan & Claypool Synthesis Lectures on Computer Architecture, October 2018.
[doi:10.2200/S00874ED3V01Y201809CAC046](https://doi.org/10.2200/S00874ED3V01Y201809CAC046)
[[56](/en/ch1#Fiala2012-marker)] David Fiala, Frank Mueller, Christian Engelmann, Rolf
Riesen, Kurt Ferreira, and Ron Brightwell.
[Detection and
Correction of Silent Data Corruption for Large-Scale High-Performance Computing](https://arcb.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc12.pdf),” at
*International Conference for High Performance Computing, Networking, Storage and
Analysis* (SC), November 2012.
[doi:10.1109/SC.2012.49](https://doi.org/10.1109/SC.2012.49)
[[57](/en/ch1#KornfeldSimpson2020-marker)] Anna Kornfeld
Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang.
[Securing RDMA
for High-Performance Datacenter Storage Systems](https://www.usenix.org/conference/hotcloud20/presentation/kornfeld-simpson). At *12th USENIX Workshop on Hot Topics in
Cloud Computing* (HotCloud), July 2020.
[[58](/en/ch1#Singh2015-marker)] Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson,
Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala,
Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat.
[Jupiter Rising: A
Decade of Clos Topologies and Centralized Control in Googles Datacenter Network](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf). At
*Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015.
[doi:10.1145/2785956.2787508](https://doi.org/10.1145/2785956.2787508)
[[59](/en/ch1#Lockwood2014-marker)] Glenn K. Lockwood.
[Hadoops
Uncomfortable Fit in HPC](https://blog.glennklockwood.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html). *glennklockwood.blogspot.co.uk*, May 2014.
Archived at [perma.cc/S8XX-Y67B](https://perma.cc/S8XX-Y67B)
[[60](/en/ch1#ONeil2016_ch1-marker)] Cathy ONeil: *Weapons of Math Destruction:
How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016.
ISBN: 9780553418811
[[61](/en/ch1#Shastri2020-marker)] Supreeth Shastri, Vinay Banakar, Melissa
Wasserman, Arun Kumar, and Vijay Chidambaram.
[Understanding and Benchmarking the
Impact of GDPR on Database Systems](https://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue
7, pages 10641077, March 2020.
[doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354)
[[62](/en/ch1#Datensparsamkeit-marker)] Martin Fowler.
[Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html).
*martinfowler.com*, December 2013.
Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6)
[[63](/en/ch1#GDPR-marker)] [Regulation
(EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data
Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016.
## Footnotes
## References
[^1]: Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. [The Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26)
[^2]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
[^3]: Joe Reis and Matt Housley. [*Fundamentals of Data Engineering*](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/). OReilly Media, 2022. ISBN: 9781098108304
[^4]: Rui Pedro Machado and Helder Russa. [*Analytics Engineering with SQL and dbt*](https://www.oreilly.com/library/view/analytics-engineering-with/9781098142377/). OReilly Media, 2023. ISBN: 9781098142384
[^5]: Edgar F. Codd, S. B. Codd, and C. T. Salley. [Providing OLAP to User-Analysts: An IT Mandate](https://www.estgv.ipv.pt/PaginasPessoais/jloureiro/ESI_AID2007_2008/fichas/codd.pdf). E. F. Codd Associates, 1993. Archived at [perma.cc/RKX8-2GEE](https://perma.cc/RKX8-2GEE)
[^6]: Chinmay Soman and Neha Pawar. [Comparing Three Real-Time OLAP Databases: Apache Pinot, Apache Druid, and ClickHouse](https://startree.ai/blog/a-tale-of-three-real-time-olap-databases). *startree.ai*, April 2023. Archived at [perma.cc/8BZP-VWPA](https://perma.cc/8BZP-VWPA)
[^7]: Surajit Chaudhuri and Umeshwar Dayal. [An Overview of Data Warehousing and OLAP Technology](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigrecord.pdf). *ACM SIGMOD Record*, volume 26, issue 1, pages 6574, March 1997. [doi:10.1145/248603.248616](https://doi.org/10.1145/248603.248616)
[^8]: Fatma Özcan, Yuanyuan Tian, and Pinar Tözün. [Hybrid Transactional/Analytical Processing: A Survey](https://humming80.github.io/papers/sigmod-htaptut.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017. [doi:10.1145/3035918.3054784](https://doi.org/10.1145/3035918.3054784)
[^9]: Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov. [Cloud-Native Transactions and Analytics in SingleStore](https://dl.acm.org/doi/abs/10.1145/3514221.3526055). At *International Conference on Management of Data* (SIGMOD), June 2022. [doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055)
[^10]: Chao Zhang, Guoliang Li, Jintao Zhang, Xinning Zhang, and Jianhua Feng. [HTAP Databases: A Survey](https://arxiv.org/pdf/2404.15670). *IEEE Transactions on Knowledge and Data Engineering*, April 2024. [doi:10.1109/TKDE.2024.3389693](https://doi.org/10.1109/TKDE.2024.3389693)
[^11]: Michael Stonebraker and Uğur Çetintemel. [One Size Fits All: An Idea Whose Time Has Come and Gone](https://pages.cs.wisc.edu/~shivaram/cs744-readings/fits_all.pdf). At *21st International Conference on Data Engineering* (ICDE), April 2005. [doi:10.1109/ICDE.2005.1](https://doi.org/10.1109/ICDE.2005.1)
[^12]: Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M. Hellerstein, and Caleb Welton. [MAD Skills: New Analysis Practices for Big Data](https://www.vldb.org/pvldb/vol2/vldb09-219.pdf). *Proceedings of the VLDB Endowment*, volume 2, issue 2, pages 14811492, August 2009. [doi:10.14778/1687553.1687576](https://doi.org/10.14778/1687553.1687576)
[^13]: Dan Olteanu. [The Relational Data Borg is Learning](https://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, August 2020. [doi:10.14778/3415478.3415572](https://doi.org/10.14778/3415478.3415572)
[^14]: Matt Bornstein, Martin Casado, and Jennifer Li. [Emerging Architectures for Modern Data Infrastructure: 2020](https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/). *future.a16z.com*, October 2020. Archived at [perma.cc/LF8W-KDCC](https://perma.cc/LF8W-KDCC)
[^15]: Martin Fowler. [DataLake](https://www.martinfowler.com/bliki/DataLake.html). *martinfowler.com*, February 2015. Archived at [perma.cc/4WKN-CZUK](https://perma.cc/4WKN-CZUK)
[^16]: Bobby Johnson and Joseph Adler. [The Sushi Principle: Raw Data Is Better](https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210840/). At *Strata+Hadoop World*, February 2015.
[^17]: Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia. [Lakehouse: A New Generation of Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference on Innovative Data Systems Research* (CIDR), January 2021.
[^18]: DataKitchen, Inc. [The DataOps Manifesto](https://dataopsmanifesto.org/en/). *dataopsmanifesto.org*, 2017. Archived at [perma.cc/3F5N-FUQ4](https://perma.cc/3F5N-FUQ4)
[^19]: Tejas Manohar. [What is Reverse ETL: A Definition & Why Its Taking Off](https://hightouch.io/blog/reverse-etl/). *hightouch.io*, November 2021. Archived at [perma.cc/A7TN-GLYJ](https://perma.cc/A7TN-GLYJ)
[^20]: Simon ORegan. [Designing Data Products](https://towardsdatascience.com/designing-data-products-b6b93edf3d23). *towardsdatascience.com*, August 2018. Archived at [perma.cc/HU67-3RV8](https://perma.cc/HU67-3RV8)
[^21]: Camille Fournier. [Why is it so hard to decide to buy?](https://skamille.medium.com/why-is-it-so-hard-to-decide-to-buy-d86fee98e88e) *skamille.medium.com*, July 2021. Archived at [perma.cc/6VSG-HQ5X](https://perma.cc/6VSG-HQ5X)
[^22]: David Heinemeier Hansson. [Why were leaving the cloud](https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0). *world.hey.com*, October 2022. Archived at [perma.cc/82E6-UJ65](https://perma.cc/82E6-UJ65)
[^23]: Nima Badizadegan. [Use One Big Server](https://specbranch.com/posts/one-big-server/). *specbranch.com*, August 2022. Archived at [perma.cc/M8NB-95UK](https://perma.cc/M8NB-95UK)
[^24]: Steve Yegge. [Dear Google Cloud: Your Deprecation Policy is Killing You](https://steve-yegge.medium.com/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc). *steve-yegge.medium.com*, August 2020. Archived at [perma.cc/KQP9-SPGU](https://perma.cc/KQP9-SPGU)
[^25]: Alexandre Verbitski, Anurag Gupta, Debanjan Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz Kharatishvili, and Xiaofeng Bao. [Amazon Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases](https://media.amazonwebservices.com/blog/2017/aurora-design-considerations-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 10411052, May 2017. [doi:10.1145/3035918.3056101](https://doi.org/10.1145/3035918.3056101)
[^26]: Panagiotis Antonopoulos, Alex Budovski, Cristian Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade. [Socrates: The New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data* (SIGMOD), pages 17431756, June 2019. [doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047)
[^27]: Midhul Vuppalapati, Justin Miron, Rachit Agarwal, Dan Truong, Ashish Motivala, and Thierry Cruanes. [Building An Elastic Query Engine on Disaggregated Storage](https://www.usenix.org/system/files/nsdi20-paper-vuppalapati.pdf). At *17th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), February 2020.
[^28]: Nick Van Wiggeren. [The Real Failure Rate of EBS](https://planetscale.com/blog/the-real-fail-rate-of-ebs). *planetscale.com*, March 2025. Archived at [perma.cc/43CR-SAH5](https://perma.cc/43CR-SAH5)
[^29]: Colin Breck. [Predicting the Future of Distributed Systems](https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/). *blog.colinbreck.com*, August 2024. Archived at [perma.cc/K5FC-4XX2](https://perma.cc/K5FC-4XX2)
[^30]: Gwen Shapira. [Compute-Storage Separation Explained](https://www.thenile.dev/blog/storage-compute). *thenile.dev*, January 2023. Archived at [perma.cc/QCV3-XJNZ](https://perma.cc/QCV3-XJNZ)
[^31]: Ravi Murthy and Gurmeet Goindi. [AlloyDB for PostgreSQL under the hood: Intelligent, database-aware storage](https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage). *cloud.google.com*, May 2022. Archived at [archive.org](https://web.archive.org/web/20220514021120/https%3A//cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage)
[^32]: Jack Vanlightly. [The Architecture of Serverless Data Systems](https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems). *jack-vanlightly.com*, November 2023. Archived at [perma.cc/UDV4-TNJ5](https://perma.cc/UDV4-TNJ5)
[^33]: Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth, Neeraja Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, David A. Patterson. [Cloud Programming Simplified: A Berkeley View on Serverless Computing](https://arxiv.org/abs/1902.03383). *arxiv.org*, February 2019.
[^34]: Betsy Beyer, Jennifer Petoff, Chris Jones, and Niall Richard Murphy. [*Site Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/). OReilly Media, 2016. ISBN: 9781491929124
[^35]: Thomas Limoncelli. [The Time I Stole $10,000 from Bell Labs](https://queue.acm.org/detail.cfm?id=3434773). *ACM Queue*, volume 18, issue 5, November 2020. [doi:10.1145/3434571.3434773](https://doi.org/10.1145/3434571.3434773)
[^36]: Charity Majors. [The Future of Ops Jobs](https://acloudguru.com/blog/engineering/the-future-of-ops-jobs). *acloudguru.com*, August 2020. Archived at [perma.cc/GRU2-CZG3](https://perma.cc/GRU2-CZG3)
[^37]: Boris Cherkasky. [(Over)Pay As You Go for Your Datastore](https://medium.com/riskified-technology/over-pay-as-you-go-for-your-datastore-11a29ae49a8b). *medium.com*, September 2021. Archived at [perma.cc/Q8TV-2AM2](https://perma.cc/Q8TV-2AM2)
[^38]: Shlomi Kushchi. [Serverless Doesnt Mean DevOpsLess or NoOps](https://thenewstack.io/serverless-doesnt-mean-devopsless-or-noops/). *thenewstack.io*, February 2023. Archived at [perma.cc/3NJR-AYYU](https://perma.cc/3NJR-AYYU)
[^39]: Erik Bernhardsson. [Storm in the stratosphere: how the cloud will be reshuffled](https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html). *erikbern.com*, November 2021. Archived at [perma.cc/SYB2-99P3](https://perma.cc/SYB2-99P3)
[^40]: Benn Stancil. [The data OS](https://benn.substack.com/p/the-data-os). *benn.substack.com*, September 2021. Archived at [perma.cc/WQ43-FHS6](https://perma.cc/WQ43-FHS6)
[^41]: Maria Korolov. [Data residency laws pushing companies toward residency as a service](https://www.csoonline.com/article/3647761/data-residency-laws-pushing-companies-toward-residency-as-a-service.html). *csoonline.com*, January 2022. Archived at [perma.cc/CHE4-XZZ2](https://perma.cc/CHE4-XZZ2)
[^42]: Severin Borenstein. [Can Data Centers Flex Their Power Demand?](https://energyathaas.wordpress.com/2025/04/14/can-data-centers-flex-their-power-demand/) *energyathaas.wordpress.com*, April 2025. Archived at <https://perma.cc/MUD3-A6FF>
[^43]: Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Aditya Sundarrajan, Kiwan Maeng, Manoj Chakkaravarthy, David Brooks, and Carole-Jean Wu. [Carbon Dependencies in Datacenter Design and Management](https://hotcarbon.org/assets/2022/pdf/hotcarbon22-acun.pdf). *ACM SIGENERGY Energy Informatics Review*, volume 3, issue 3, pages 2126. [doi:10.1145/3630614.3630619](https://doi.org/10.1145/3630614.3630619)
[^44]: Kousik Nath. [These are the numbers every computer engineer should know](https://www.freecodecamp.org/news/must-know-numbers-for-every-computer-engineer/). *freecodecamp.org*, September 2019. Archived at [perma.cc/RW73-36RL](https://perma.cc/RW73-36RL)
[^45]: Joseph M. Hellerstein, Jose Faleiro, Joseph E. Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. [Serverless Computing: One Step Forward, Two Steps Back](https://arxiv.org/abs/1812.03651). At *Conference on Innovative Data Systems Research* (CIDR), January 2019.
[^46]: Frank McSherry, Michael Isard, and Derek G. Murray. [Scalability! But at What COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
[^47]: Cindy Sridharan. *[Distributed Systems Observability: A Guide to Building Robust Systems](https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-Systems-Observability-eBook.pdf)*. Report, OReilly Media, May 2018. Archived at [perma.cc/M6JL-XKCM](https://perma.cc/M6JL-XKCM)
[^48]: Charity Majors. [Observability — A 3-Year Retrospective](https://thenewstack.io/observability-a-3-year-retrospective/). *thenewstack.io*, August 2019. Archived at [perma.cc/CG62-TJWL](https://perma.cc/CG62-TJWL)
[^49]: Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag. [Dapper, a Large-Scale Distributed Systems Tracing Infrastructure](https://research.google/pubs/pub36356/). Google Technical Report dapper-2010-1, April 2010. Archived at [perma.cc/K7KU-2TMH](https://perma.cc/K7KU-2TMH)
[^50]: Rodrigo Laigner, Yongluan Zhou, Marcos Antonio Vaz Salles, Yijian Liu, and Marcos Kalinowski. [Data management in microservices: State of the practice, challenges, and research directions](https://www.vldb.org/pvldb/vol14/p3348-laigner.pdf). *Proceedings of the VLDB Endowment*, volume 14, issue 13, pages 33483361, September 2021. [doi:10.14778/3484224.3484232](https://doi.org/10.14778/3484224.3484232)
[^51]: Jordan Tigani. [Big Data is Dead](https://motherduck.com/blog/big-data-is-dead/). *motherduck.com*, February 2023. Archived at [perma.cc/HT4Q-K77U](https://perma.cc/HT4Q-K77U)
[^52]: Sam Newman. [*Building Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). OReilly Media, 2021. ISBN: 9781492034025
[^53]: Chris Richardson. [Microservices: Decomposing Applications for Deployability and Scalability](https://www.infoq.com/articles/microservices-intro/). *infoq.com*, May 2014. Archived at [perma.cc/CKN4-YEQ2](https://perma.cc/CKN4-YEQ2)
[^54]: Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, Ricardo Bianchini. [Serverless in the Wild: Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider](https://www.usenix.org/system/files/atc20-shahrad.pdf). At *USENIX Annual Technical Conference* (ATC), July 2020.
[^55]: Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan. [The Datacenter as a Computer: Designing Warehouse-Scale Machines](https://www.morganclaypool.com/doi/10.2200/S00874ED3V01Y201809CAC046), third edition. Morgan & Claypool Synthesis Lectures on Computer Architecture, October 2018. [doi:10.2200/S00874ED3V01Y201809CAC046](https://doi.org/10.2200/S00874ED3V01Y201809CAC046)
[^56]: David Fiala, Frank Mueller, Christian Engelmann, Rolf Riesen, Kurt Ferreira, and Ron Brightwell. [Detection and Correction of Silent Data Corruption for Large-Scale High-Performance Computing](https://arcb.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc12.pdf),” at *International Conference for High Performance Computing, Networking, Storage and Analysis* (SC), November 2012. [doi:10.1109/SC.2012.49](https://doi.org/10.1109/SC.2012.49)
[^57]: Anna Kornfeld Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang. [Securing RDMA for High-Performance Datacenter Storage Systems](https://www.usenix.org/conference/hotcloud20/presentation/kornfeld-simpson). At *12th USENIX Workshop on Hot Topics in Cloud Computing* (HotCloud), July 2020.
[^58]: Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat. [Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Googles Datacenter Network](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf). At *Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015. [doi:10.1145/2785956.2787508](https://doi.org/10.1145/2785956.2787508)
[^59]: Glenn K. Lockwood. [Hadoops Uncomfortable Fit in HPC](https://blog.glennklockwood.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html). *glennklockwood.blogspot.co.uk*, May 2014. Archived at [perma.cc/S8XX-Y67B](https://perma.cc/S8XX-Y67B)
[^60]: Cathy ONeil: *Weapons of Math Destruction: How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016. ISBN: 9780553418811
[^61]: Supreeth Shastri, Vinay Banakar, Melissa Wasserman, Arun Kumar, and Vijay Chidambaram. [Understanding and Benchmarking the Impact of GDPR on Database Systems](https://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 7, pages 10641077, March 2020. [doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354)
[^62]: Martin Fowler. [Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html). *martinfowler.com*, December 2013. Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6)
[^63]: [Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016.

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -55,17 +55,17 @@ the computer which operations to perform in which order. A declarative query lan
because it is typically more concise and easier to write than an explicit algorithm. But more
importantly, it also hides implementation details of the query engine, which makes it possible for
the database system to introduce performance improvements without requiring any changes to queries.
[[1](/en/ch3#Brandon2024)].
[^1].
For example, a database might be able to execute a declarative query in parallel across multiple CPU
cores and machines, without you having to worry about how to implement that parallelism
[[2](/en/ch3#Hellerstein2010)].
[^2].
In a hand-coded algorithm it would be a lot of work to implement such parallel execution yourself.
# Relational Model versus Document Model
The best-known data model today is probably that of SQL, based on the relational model proposed by
Edgar Codd in 1970 [[3](/en/ch3#Codd1970)]:
Edgar Codd in 1970 [^3]:
data is organized into *relations* (called *tables* in SQL), where each relation is an unordered collection
of *tuples* (*rows* in SQL).
@ -80,10 +80,10 @@ and early 1980s, the *network model* and the *hierarchical model* were the main
the relational model came to dominate them. Object databases came and went again in the late 1980s
and early 1990s. XML databases appeared in the early 2000s, but have only seen niche adoption. Each
competitor to the relational model generated a lot of hype in its time, but it never lasted
[[4](/en/ch3#Stonebraker2005around)].
[^4].
Instead, SQL has grown to incorporate other data types besides its relational core—for example,
adding support for XML, JSON, and graph data
[[5](/en/ch3#Winand2015)].
[^5].
In the 2010s, *NoSQL* was the latest buzzword that tried to overthrow the dominance of relational
databases. NoSQL refers not to a single technology, but a loose set of ideas around new data models,
@ -122,7 +122,7 @@ reflections and other troubles.
Object-relational mapping (ORM) frameworks like ActiveRecord and Hibernate reduce the amount of
boilerplate code required for this translation layer, but they are often criticized
[[6](/en/ch3#Fowler2012)].
[^6].
Some commonly cited problems are:
* ORMs are complex and cant completely hide the differences between the two models, so developers
@ -137,7 +137,7 @@ Some commonly cited problems are:
database. Customizing the ORMs schema and query generation can be complex and negate the benefit
of using the ORM in the first place.
* ORMs make it easy to accidentally write inefficient queries, such as the *N+1 query problem*
[[7](/en/ch3#Mihalcea2023)].
[^7].
For example, say you want to display a list of user comments on a page, so you perform one query
that returns *N* comments, each containing the ID of its author. To show the name of the comment
author you need to look up the ID in the users table. In hand-written SQL you would probably
@ -213,7 +213,7 @@ The JSON representation has better *locality* than the multi-table schema in
[Figure 3-1](/en/ch3#fig_obama_relational) (see [“Data locality for reads and writes”](/en/ch3#sec_datamodels_document_locality)). If you want to fetch a profile
in the relational example, you need to either perform multiple queries (query each table by
`user_id`) or perform a messy multi-way join between the `users` table and its subordinate tables
[[8](/en/ch3#Schauder2023)].
[^8].
In the JSON representation, all the relevant information is in one place, making the query both
faster and simpler.
@ -314,7 +314,7 @@ name:
* In a denormalized representation, we would include the image URL of the logo on every individual
persons profile; this makes the JSON document self-contained, but it creates a headache if we
ever need to change the logo, because we now need to find all of the occurrences of the old URL
and update them [[9](/en/ch3#Zola2014)].
and update them [^9].
* In a normalized representation, we would create an entity representing an organization or school,
and store its name, logo URL, and perhaps other attributes (description, news feed, etc.) once on
that entity. Every résumé that mentions the organization would then simply reference its ID, and
@ -350,7 +350,7 @@ denormalized representation consistent.
However, the implementation of materialized timelines at X (formerly Twitter) does not store the
actual text of each post: each entry actually only stores the post ID, the ID of the user who posted
it, and a little bit of extra information to identify reposts and replies
[[11](/en/ch3#Krikorian2012_ch3)].
[^11].
In other words, it is a precomputed result of (approximately) the following query:
```
@ -366,7 +366,7 @@ the post ID to fetch the actual post content (as well as statistics such as the
and replies), and look up the senders profile by ID (to get their username, profile picture, and
other details). This process of looking up the human-readable information by ID is called
*hydrating* the IDs, and it is essentially a join performed in application code
[[11](/en/ch3#Krikorian2012_ch3)].
[^11].
The reason for storing only IDs in the precomputed timeline is that the data they refer to is
fast-changing: the number of likes and replies may change multiple times per second on a popular
@ -453,7 +453,7 @@ support are able to create such indexes on values inside a document.
Data warehouses (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are usually relational, and there are a few
widely-used conventions for the structure of tables in a data warehouse: a *star schema*,
*snowflake schema*, *dimensional modeling*
[[12](/en/ch3#Kimball2013_ch3)],
[^12],
and *one big table* (OBT). These structures are optimized for the needs of business analysts. ETL
processes translate data from operational systems into this schema.
@ -498,7 +498,7 @@ product categories, and each row in the `dim_product` table could reference the
as foreign keys, rather than storing them as strings in the `dim_product` table. Snowflake schemas
are more normalized than star schemas, but star schemas are often preferred because
they are simpler for analysts to work with
[[12](/en/ch3#Kimball2013_ch3)].
[^12].
In a typical data warehouse, tables are often quite wide: fact tables often have over 100 columns,
sometimes several hundred. Dimension tables can also be wide, as they include all the metadata that
@ -519,7 +519,7 @@ Some data warehouse schemas take denormalization even further and leave out the
entirely, folding the information in the dimensions into denormalized columns on the fact table
instead (essentially, precomputing the join between the fact table and the dimension tables). This
approach is known as *one big table* (OBT), and while it requires more storage space, it sometimes
enables faster queries [[13](/en/ch3#Kaminsky2022)].
enables faster queries [^13].
In the context of analytics, such denormalization is unproblematic, since the data typically
represents a log of historical data that is not going to change (except maybe for occasionally
@ -564,23 +564,23 @@ reading, clients have no guarantees as to what fields the documents may contain.
Document databases are sometimes called *schemaless*, but thats misleading, as the code that reads
the data usually assumes some kind of structure—i.e., there is an implicit schema, but it is not
enforced by the database [[17](/en/ch3#Schemaless)].
enforced by the database [^17].
A more accurate term is *schema-on-read* (the structure of the data is implicit, and only
interpreted when the data is read), in contrast with *schema-on-write* (the traditional approach of
relational databases, where the schema is explicit and the database ensures all data conforms to it
when the data is written) [[18](/en/ch3#Awadallah2009)].
when the data is written) [^18].
Schema-on-read is similar to dynamic (runtime) type checking in programming languages, whereas
schema-on-write is similar to static (compile-time) type checking. Just as the advocates of static
and dynamic type checking have big debates about their relative merits
[[19](/en/ch3#Odersky2013)],
[^19],
enforcement of schemas in database is a contentious topic, and in general theres no right or wrong
answer.
The difference between the approaches is particularly noticeable in situations where an application
wants to change the format of its data. For example, say you are currently storing each users full
name in one field, and you instead want to store the first name and last name separately
[[20](/en/ch3#Irwin2013)].
[^20].
In a document database, you would just start writing new documents with the new fields and have
code in the application that handles the case when old documents are read. For example:
@ -647,12 +647,12 @@ However, the idea of storing related data together for locality is not limited t
model. For example, Googles Spanner database offers the same locality properties in a relational
data model, by allowing the schema to declare that a tables rows should be interleaved (nested)
within a parent table
[[25](/en/ch3#Corbett2012_ch2)].
[^25].
Oracle allows the same, using a feature called *multi-table index cluster tables*
[[26](/en/ch3#BurlesonCluster)].
[^26].
The *wide-column* data model popularized by Googles Bigtable, and used e.g. in HBase and Accumulo,
has a concept of *column families*, which have a similar purpose of managing locality
[[27](/en/ch3#Chang2006_ch3)].
[^27].
### Query languages for documents
@ -663,9 +663,9 @@ to query for values inside documents, and some provide rich query languages.
XML databases are often queried using XQuery and XPath, which are designed to allow complex queries,
including joins across multiple documents, and also format their results as XML
[[28](/en/ch3#Walmsley2015)]. JSON Pointer
[[29](/en/ch3#Bryan2013)] and JSONPath
[[30](/en/ch3#Goessner2024)] provide an equivalent to XPath for JSON.
[^28]. JSON Pointer
[^29] and JSONPath
[^30] provide an equivalent to XPath for JSON.
MongoDBs aggregation pipeline, whose `$lookup` operator for joins we saw in
[“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization), is an example of a query language for collections of JSON
@ -716,7 +716,7 @@ matter of taste.
Document databases and relational databases started out as very different approaches to data
management, but they have grown more similar over time
[[31](/en/ch3#Stonebraker2024)].
[^31].
Relational databases added support for JSON types and query operators, and the ability to index
properties inside documents. Some document databases (such as MongoDB, Couchbase, and RethinkDB)
added support for joins, secondary indexes, and declarative query languages.
@ -730,7 +730,7 @@ combination.
###### Note
Codds original description of the relational model
[[3](/en/ch3#Codd1970)] actually allowed something similar to JSON
[^3] actually allowed something similar to JSON
within a relational schema. He called it *nonsimple domains*. The idea was that a value in a row
doesnt have to just be a primitive datatype like a number or a string, but it could also be a
nested relation (table)—so you can have an arbitrarily nested tree structure as a value, much like
@ -763,7 +763,7 @@ Well-known algorithms can operate on these graphs: for example, map navigation a
the shortest path between two points in a road network, and
PageRank can be used on the web graph to determine the
popularity of a web page and thus its ranking in search results
[[32](/en/ch3#Page1999)].
[^32].
Graphs can be represented in several different ways. In the *adjacency list* model, each vertex
stores the IDs of its neighbor vertices that are one edge away. Alternatively, you can use an
@ -781,24 +781,24 @@ types of objects in a single database. For example:
represent people, locations, events, checkins, and comments made by users; edges indicate which
people are friends with each other, which checkin happened in which location, who commented on
which post, who attended which event, and so on
[[33](/en/ch3#Bronson2013)].
[^33].
* Knowledge graphs are used by search engines to record facts about entities that often occur in
search queries, such as organizations, people, and places
[[34](/en/ch3#Noy2019)].
[^34].
This information is obtained by crawling and analyzing the text on websites; some websites, such
as Wikidata, also publish graph data in a structured form.
There are several different, but related, ways of structuring and querying data in graphs. In this
section we will discuss the *property graph* model (implemented by Neo4j, Memgraph, KùzuDB
[[35](/en/ch3#Feng2023)],
and others [[36](/en/ch3#Besta2019)])
[^35],
and others [^36])
and the *triple-store* model (implemented by Datomic, AllegroGraph, Blazegraph, and others). These
models are fairly similar in what they can express, and some graph databases (such as Amazon
Neptune) support both models.
We will also look at four query languages for graphs (Cypher, SPARQL, Datalog, and GraphQL), as well
as SQL support for querying graphs. Other graph query languages exist, such as Gremlin
[[37](/en/ch3#TinkerPop2023)],
[^37],
but these will give us a representative overview.
To illustrate these different languages and models, this section uses the graph shown in
@ -902,12 +902,12 @@ extended to accommodate changes in your applications data structures.
*Cypher* is a query language for property graphs, originally created for the Neo4j graph database,
and later developed into an open standard as *openCypher*
[[38](/en/ch3#Francis2018)].
[^38].
Besides Neo4j, Cypher is supported by Memgraph, KùzuDB
[[35](/en/ch3#Feng2023)],
[^35],
Amazon Neptune, Apache AGE (with storage in PostgreSQL), and others. It is named after a character
in the movie *The Matrix* and is not related to ciphers in cryptography
[[39](/en/ch3#EifremTweet)].
[^39].
[Example 3-4](/en/ch3#fig_cypher_create) shows the Cypher query to insert the lefthand portion of
[Figure 3-6](/en/ch3#fig_datamodels_graph) into a graph database. The rest of the graph can be added similarly. Each
@ -1069,17 +1069,17 @@ JOIN lives_in_europe ON vertices.vertex_id = lives_in_europe.vertex_id;
The fact that a 4-line Cypher query requires 31 lines in SQL shows how much of a difference the
right choice of data model and query language can make. And this is just the beginning; there are
more details to consider, e.g., around handling cycles, and choosing between breadth-first or
depth-first traversal [[40](/en/ch3#Tisiot2021)].
depth-first traversal [^40].
Oracle has a different SQL extension for recursive queries, which it calls *hierarchical*
[[41](/en/ch3#Goel2020)].
[^41].
However, the situation may be improving: at the time of writing, there are plans to add a graph
query language called GQL to the SQL standard [[42](/en/ch3#Deutsch2022),
[43](/en/ch3#Green2019)],
which will provide a syntax inspired by Cypher, GSQL
[[44](/en/ch3#Deutsch2018)], and PGQL
[[45](/en/ch3#vanRest2016)].
[^44], and PGQL
[^45].
## Triple-Stores and SPARQL
@ -1107,15 +1107,15 @@ The subject of a triple is equivalent to a vertex in a graph. The object is one
To be precise, databases that offer a triple-like data model often need to store some additional
metadata on each tuple. For example, AWS Neptune uses quads (4-tuples) by adding a graph ID to each
triple [[46](/en/ch3#NeptuneDataModel)];
triple [^46];
Datomic uses 5-tuples, extending each triple with a transaction ID and a boolean to indicate
deletion [[47](/en/ch3#DatomicDataModel)].
deletion [^47].
Since these databases retain the basic *subject-predicate-object* structure explained above, this
book nevertheless calls them triple-stores.
[Example 3-7](/en/ch3#fig_graph_n3_triples) shows the same data as in [Example 3-4](/en/ch3#fig_cypher_create), written as
triples in a format called *Turtle*, a subset of *Notation3* (*N3*)
[[48](/en/ch3#Beckett2011)].
[^48].
##### Example 3-7. A subset of the data in [Figure 3-6](/en/ch3#fig_datamodels_graph), represented as Turtle triples
@ -1166,13 +1166,13 @@ Web as originally envisioned did not succeed
[[49](/en/ch3#Target2018),
[50](/en/ch3#MendelGleason2022)],
the legacy of the Semantic Web project lives on in a couple of specific technologies: *linked data*
standards such as JSON-LD [[51](/en/ch3#Sporny2014)],
standards such as JSON-LD [^51],
*ontologies* used in biomedical science
[[52](/en/ch3#MichiganOntologies)],
[^52],
Facebooks Open Graph protocol
[[53](/en/ch3#OpenGraph)]
[^53]
(which is used for link unfurling
[[54](/en/ch3#Haughey2015)]),
[^54]),
knowledge graphs such as Wikidata, and standardized vocabularies for structured data maintained by
[`schema.org`](https://schema.org/).
@ -1184,7 +1184,7 @@ for applications.
The Turtle language we used in [Example 3-8](/en/ch3#fig_graph_n3_shorthand) is actually a way of encoding data in the
*Resource Description Framework* (RDF)
[[55](/en/ch3#W3CRDF)],
[^55],
a data model that was designed for the Semantic Web. RDF data can also be encoded in other ways, for
example (more verbosely) in XML, as shown in [Example 3-9](/en/ch3#fig_graph_rdf_xml). Tools like Apache Jena can
automatically convert between different RDF encodings.
@ -1235,7 +1235,7 @@ just specify this prefix once at the top of the file, and then forget about it.
### The SPARQL query language
*SPARQL* is a query language for triple-stores using the RDF data model
[[56](/en/ch3#Harris2013)].
[^56].
(It is an acronym for *SPARQL Protocol and RDF Query Language*, pronounced “sparkle.”)
It predates Cypher, and since Cyphers pattern matching is borrowed from SPARQL, they look quite
similar.
@ -1275,7 +1275,7 @@ bound to any vertex that has a `name` property whose value is the string `"Unite
```
SPARQL is supported by Amazon Neptune, AllegroGraph, Blazegraph, OpenLink Virtuoso, Apache Jena, and
various other triple stores [[36](/en/ch3#Besta2019)].
various other triple stores [^36].
## Datalog: Recursive Relational Queries
@ -1286,7 +1286,7 @@ Datalog is a much older language than SPARQL or Cypher: it arose from academic r
It is less well known among software engineers and not widely supported in mainstream databases, but
it ought to be better-known since it is a very expressive language that is particularly powerful for
complex queries. Several niche databases, including Datomic, LogicBlox, CozoDB, and LinkedIns
LIquid [[60](/en/ch3#Meyer2020)] use Datalog as
LIquid [^60] use Datalog as
their query language.
Datalog is actually based on a relational data model, not a graph, but it appears in the graph
@ -1403,7 +1403,7 @@ APIs.
GraphQLs flexibility comes at a cost. Organizations that adopt GraphQL often need tooling to
convert GraphQL queries into requests to internal services, which often use REST or gRPC (see
[Chapter 5](/en/ch5#ch_encoding)). Authorization, rate limiting, and performance challenges are additional concerns
[[61](/en/ch3#Bessey2024)].
[^61].
GraphQLs query language is also limited since GraphQL come from an untrusted source. The language
does not allow anything that could be expensive to execute, since otherwise users could perform
denial-of-service attacks on a server by running lots of expensive queries. In particular, GraphQL
@ -1547,7 +1547,7 @@ known as *event sourcing* [[62](/en/ch3#Betts2012),
[63](/en/ch3#Young2014)].
The principle of maintaining separate read-optimized representations and deriving them from the
write-optimized representation is called *command query responsibility segregation (CQRS)*
[[64](/en/ch3#Young2010)].
[^64].
These terms originated in the domain-driven design (DDD) community, although similar ideas have been
around for a long time, for example in *state machine replication* (see [“Using shared logs”](/en/ch10#sec_consistency_smr)).
@ -1661,7 +1661,7 @@ users.
Dataframe APIs also offer a wide variety of operations that go far beyond what relational databases
offer, and the data model is often used in ways that are very different from typical relational data
modelling [[65](/en/ch3#Petersohn2020)].
modelling [^65].
For example, a common use of dataframes is to transform data from a relational-like representation
into a matrix or multidimensional array representation, which is the form that many machine learning
algorithms expect of their input.
@ -1698,14 +1698,14 @@ into a matrix representation, while giving the data scientist control over the r
is most suitable for achieving the goals of the data analysis or model training process.
There are also databases such as TileDB
[[66](/en/ch3#Papadopoulos2016)]
[^66]
that specialize in storing large multidimensional arrays of numbers; they are called *array
databases* and are most commonly used for scientific datasets such as geospatial measurements
(raster data on a regularly spaced grid), medical imaging, or observations from astronomical
telescopes [[67](/en/ch3#Rusu2022)].
telescopes [^67].
Dataframes are also used in the financial industry for representing *time series data*, such as the
prices of assets and trades over time
[[68](/en/ch3#Targett2023)].
[^68].
# Summary
@ -1757,7 +1757,7 @@ a few brief examples:
means taking one very long string (representing a DNA molecule) and matching it against a large
database of strings that are similar, but not identical. None of the databases described here can
handle this kind of usage, which is why researchers have written specialized genome database
software like GenBank [[69](/en/ch3#Benson2007)].
software like GenBank [^69].
* Many financial systems use *ledgers* with double-entry accounting as their data model. This type
of data can be represented in relational databases, but there are also databases such as
TigerBeetle that specialize in this data model. Cryptocurrencies and blockchains are typically
@ -1771,361 +1771,78 @@ come into play when *implementing* the data models described in this chapter.
##### Footnotes
##### References
[[1](/en/ch3#Brandon2024-marker)] Jamie Brandon.
[Unexplanations:
query optimization works because sql is declarative](https://www.scattered-thoughts.net/writing/unexplanations-sql-declarative/). *scattered-thoughts.net*, February 2024.
Archived at [perma.cc/P6W2-WMFZ](https://perma.cc/P6W2-WMFZ)
[[2](/en/ch3#Hellerstein2010-marker)] Joseph M. Hellerstein.
[The Declarative
Imperative: Experiences and Conjectures in Distributed Logic](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf). Tech report UCB/EECS-2010-90,
Electrical Engineering and Computer Sciences, University of California at Berkeley, June 2010.
Archived at [perma.cc/K56R-VVQM](https://perma.cc/K56R-VVQM)
[[3](/en/ch3#Codd1970-marker)] Edgar F. Codd.
[A Relational Model of Data for Large
Shared Data Banks](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf). *Communications of the ACM*, volume 13, issue 6, pages 377387, June 1970.
[doi:10.1145/362384.362685](https://doi.org/10.1145/362384.362685)
[[4](/en/ch3#Stonebraker2005around-marker)] Michael Stonebraker and Joseph M. Hellerstein.
[What Goes Around Comes Around](http://mitpress2.mit.edu/books/chapters/0262693143chapm1.pdf).
In *Readings in Database Systems*, 4th edition, MIT Press, pages 241, 2005.
ISBN: 9780262693141
[[5](/en/ch3#Winand2015-marker)] Markus Winand.
[Modern SQL: Beyond Relational](https://modern-sql.com/). *modern-sql.com*, 2015.
Archived at [perma.cc/D63V-WAPN](https://perma.cc/D63V-WAPN)
[[6](/en/ch3#Fowler2012-marker)] Martin Fowler.
[OrmHate](https://martinfowler.com/bliki/OrmHate.html). *martinfowler.com*, May
2012. Archived at [perma.cc/VCM8-PKNG](https://perma.cc/VCM8-PKNG)
[[7](/en/ch3#Mihalcea2023-marker)] Vlad Mihalcea.
[N+1 query problem with JPA and Hibernate](https://vladmihalcea.com/n-plus-1-query-problem/).
*vladmihalcea.com*, January 2023.
Archived at [perma.cc/79EV-TZKB](https://perma.cc/79EV-TZKB)
[[8](/en/ch3#Schauder2023-marker)] Jens Schauder.
[This
is the Beginning of the End of the N+1 Problem: Introducing Single Query Loading](https://spring.io/blog/2023/08/31/this-is-the-beginning-of-the-end-of-the-n-1-problem-introducing-single-query). *spring.io*, August 2023.
Archived at [perma.cc/6V96-R333](https://perma.cc/6V96-R333)
[[9](/en/ch3#Zola2014-marker)] William Zola.
[6 Rules of
Thumb for MongoDB Schema Design](https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design). *mongodb.com*, June 2014.
Archived at [perma.cc/T2BZ-PPJB](https://perma.cc/T2BZ-PPJB)
[[10](/en/ch3#Andrews2023-marker)] Sidney Andrews and Christopher McClister.
[Data modeling in
Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data). *learn.microsoft.com*, February 2023. Archived at
[archive.org](https://web.archive.org/web/20230207193233/https%3A//learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data)
[[11](/en/ch3#Krikorian2012_ch3-marker)] Raffi Krikorian.
[Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/).
At *QCon San Francisco*, November 2012.
Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)
[[12](/en/ch3#Kimball2013_ch3-marker)] Ralph Kimball and Margy Ross.
[*The Data
Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*](https://learning.oreilly.com/library/view/the-data-warehouse/9781118530801/),
3rd edition. John Wiley & Sons, July 2013. ISBN: 9781118530801
[[13](/en/ch3#Kaminsky2022-marker)] Michael Kaminsky.
[Data warehouse modeling: Star schema vs.
OBT](https://www.fivetran.com/blog/star-schema-vs-obt). *fivetran.com*, August 2022.
Archived at [perma.cc/2PZK-BFFP](https://perma.cc/2PZK-BFFP)
[[14](/en/ch3#Nelson2018-marker)] Joe Nelson.
[User-defined Order in
SQL](https://begriffs.com/posts/2018-03-20-user-defined-order.html). *begriffs.com*, March 2018.
Archived at [perma.cc/GS3W-F7AD](https://perma.cc/GS3W-F7AD)
[[15](/en/ch3#Wallace2017-marker)] Evan Wallace.
[Realtime Editing of
Ordered Sequences](https://www.figma.com/blog/realtime-editing-of-ordered-sequences/). *figma.com*, March 2017.
Archived at [perma.cc/K6ER-CQZW](https://perma.cc/K6ER-CQZW)
[[16](/en/ch3#Greenspan2020-marker)] David Greenspan.
[Implementing
Fractional Indexing](https://observablehq.com/%40dgreensp/implementing-fractional-indexing). *observablehq.com*, October 2020.
Archived at [perma.cc/5N4R-MREN](https://perma.cc/5N4R-MREN)
[[17](/en/ch3#Schemaless-marker)] Martin Fowler.
[Schemaless Data Structures](https://martinfowler.com/articles/schemaless/).
*martinfowler.com*, January 2013.
[[18](/en/ch3#Awadallah2009-marker)] Amr Awadallah.
[Schema-on-Read vs.
Schema-on-Write](https://www.slideshare.net/awadallah/schemaonread-vs-schemaonwrite). At *Berkeley EECS RAD Lab Retreat*, Santa Cruz, CA, May 2009.
Archived at [perma.cc/DTB2-JCFR](https://perma.cc/DTB2-JCFR)
[[19](/en/ch3#Odersky2013-marker)] Martin Odersky.
[The Trouble with Types](https://www.infoq.com/presentations/data-types-issues/).
At *Strange Loop*, September 2013.
Archived at [perma.cc/85QE-PVEP](https://perma.cc/85QE-PVEP)
[[20](/en/ch3#Irwin2013-marker)] Conrad Irwin.
[MongoDB—Confessions
of a PostgreSQL Lover](https://speakerdeck.com/conradirwin/mongodb-confessions-of-a-postgresql-lover). At *HTML5DevConf*, October 2013.
Archived at [perma.cc/C2J6-3AL5](https://perma.cc/C2J6-3AL5)
[[21](/en/ch3#Percona2023-marker)] [Percona
Toolkit Documentation: pt-online-schema-change](https://docs.percona.com/percona-toolkit/pt-online-schema-change.html). *docs.percona.com*, 2023.
Archived at [perma.cc/9K8R-E5UH](https://perma.cc/9K8R-E5UH)
[[22](/en/ch3#Noach2016-marker)] Shlomi Noach.
[gh-ost:
GitHubs Online Schema Migration Tool for MySQL](https://github.blog/2016-08-01-gh-ost-github-s-online-migration-tool-for-mysql/). *github.blog*, August 2016.
Archived at [perma.cc/7XAG-XB72](https://perma.cc/7XAG-XB72)
[[23](/en/ch3#Mukherjee2022-marker)] Shayon Mukherjee.
[pg-osc:
Zero downtime schema changes in PostgreSQL](https://www.shayon.dev/post/2022/47/pg-osc-zero-downtime-schema-changes-in-postgresql/). *shayon.dev*, February 2022.
Archived at [perma.cc/35WN-7WMY](https://perma.cc/35WN-7WMY)
[[24](/en/ch3#PerezAradros2023-marker)] Carlos Pérez-Aradros Herce.
[Introducing pgroll: zero-downtime,
reversible, schema migrations for Postgres](https://xata.io/blog/pgroll-schema-migrations-postgres). *xata.io*, October 2023. Archived at
[archive.org](https://web.archive.org/web/20231008161750/https%3A//xata.io/blog/pgroll-schema-migrations-postgres)
[[25](/en/ch3#Corbett2012_ch2-marker)] James C. Corbett, Jeffrey Dean, Michael
Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher
Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd,
Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford,
Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang.
[Spanner: Googles Globally-Distributed Database](https://research.google/pubs/pub39966/).
At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI),
October 2012.
[[26](/en/ch3#BurlesonCluster-marker)] Donald K. Burleson.
[Reduce I/O with Oracle
Cluster Tables](http://www.dba-oracle.com/oracle_tip_hash_index_cluster_table.htm). *dba-oracle.com*.
Archived at [perma.cc/7LBJ-9X2C](https://perma.cc/7LBJ-9X2C)
[[27](/en/ch3#Chang2006_ch3-marker)] Fay Chang, Jeffrey Dean, Sanjay Ghemawat,
Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber.
[Bigtable: A Distributed Storage System for
Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and Implementation*
(OSDI), November 2006.
[[28](/en/ch3#Walmsley2015-marker)] Priscilla Walmsley.
[*XQuery,
2nd Edition*](https://learning.oreilly.com/library/view/xquery-2nd-edition/9781491915080/). OReilly Media, December 2015. ISBN: 9781491915080
[[29](/en/ch3#Bryan2013-marker)] Paul C. Bryan, Kris Zyp, and Mark Nottingham.
[JavaScript Object Notation (JSON) Pointer](https://www.rfc-editor.org/rfc/rfc6901).
RFC 6901, IETF, April 2013.
[[30](/en/ch3#Goessner2024-marker)] Stefan Gössner, Glyn Normington, and Carsten Bormann.
[JSONPath: Query Expressions for JSON](https://www.rfc-editor.org/rfc/rfc9535.html).
RFC 9535, IETF, February 2024.
[[31](/en/ch3#Stonebraker2024-marker)] Michael Stonebraker and Andrew Pavlo.
[What Goes Around Comes
Around… And Around…](https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf). *ACM SIGMOD Record*, volume 53, issue 2, pages 2137.
[doi:10.1145/3685980.3685984](https://doi.org/10.1145/3685980.3685984)
[[32](/en/ch3#Page1999-marker)] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd.
[The PageRank Citation Ranking: Bringing Order to the Web](http://ilpubs.stanford.edu:8090/422/).
Technical Report 1999-66, Stanford University InfoLab, November 1999.
Archived at [perma.cc/UML9-UZHW](https://perma.cc/UML9-UZHW)
[[33](/en/ch3#Bronson2013-marker)] Nathan Bronson, Zach Amsden, George Cabrera,
Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li,
Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani.
[TAO:
Facebooks Distributed Data Store for the Social Graph](https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson). At *USENIX Annual Technical
Conference* (ATC), June 2013.
[[34](/en/ch3#Noy2019-marker)] Natasha Noy, Yuqing Gao, Anshu Jain, Anant Narayanan,
Alan Patterson, and Jamie Taylor.
[Industry-Scale
Knowledge Graphs: Lessons and Challenges](https://cacm.acm.org/magazines/2019/8/238342-industry-scale-knowledge-graphs/fulltext). *Communications of the ACM*, volume 62, issue
8, pages 3643, August 2019.
[doi:10.1145/3331166](https://doi.org/10.1145/3331166)
[[35](/en/ch3#Feng2023-marker)] Xiyang Feng, Guodong Jin, Ziyi Chen, Chang Liu, and Semih Salihoğlu.
[KÙZU Graph Database Management System](https://www.cidrdb.org/cidr2023/papers/p48-jin.pdf).
At *3th Annual Conference on Innovative Data Systems Research* (CIDR 2023), January 2023.
[[36](/en/ch3#Besta2019-marker)] Maciej Besta, Emanuel Peter, Robert
Gerstenberger, Marc Fischer, Michał Podstawski, Claude Barthels, Gustavo Alonso, Torsten Hoefler.
[Demystifying Graph Databases: Analysis and Taxonomy
of Data Organization, System Designs, and Graph Queries](https://arxiv.org/pdf/1910.09017.pdf). *arxiv.org*, October 2019.
[[37](/en/ch3#TinkerPop2023-marker)] [Apache
TinkerPop 3.6.3 Documentation](https://tinkerpop.apache.org/docs/3.6.3/reference/). *tinkerpop.apache.org*, May 2023.
Archived at [perma.cc/KM7W-7PAT](https://perma.cc/KM7W-7PAT)
[[38](/en/ch3#Francis2018-marker)] Nadime Francis, Alastair Green, Paolo Guagliardo,
Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and
Andrés Taylor. [Cypher: An Evolving Query
Language for Property Graphs](https://core.ac.uk/download/pdf/158372754.pdf). At *International Conference on Management of Data*
(SIGMOD), pages 14331445, May 2018.
[doi:10.1145/3183713.3190657](https://doi.org/10.1145/3183713.3190657)
[[39](/en/ch3#EifremTweet-marker)] Emil Eifrem.
[Twitter correspondence](https://twitter.com/emileifrem/status/419107961512804352),
January 2014. Archived at [perma.cc/WM4S-BW64](https://perma.cc/WM4S-BW64)
[[40](/en/ch3#Tisiot2021-marker)] Francesco Tisiot.
[Explore
the new SEARCH and CYCLE features in PostgreSQL® 14](https://aiven.io/blog/explore-the-new-search-and-cycle-features-in-postgresql-14). *aiven.io*, December 2021.
Archived at [perma.cc/J6BT-83UZ](https://perma.cc/J6BT-83UZ)
[[41](/en/ch3#Goel2020-marker)] Gaurav Goel.
[Understanding
Hierarchies in Oracle](https://towardsdatascience.com/understanding-hierarchies-in-oracle-43f85561f3d9). *towardsdatascience.com*, May 2020.
Archived at [perma.cc/5ZLR-Q7EW](https://perma.cc/5ZLR-Q7EW)
[[42](/en/ch3#Deutsch2022-marker)] Alin
Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor
Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Oskar van Rest,
Hannes Voigt, Domagoj Vrgoč, Mingxi Wu, and Fred Zemke.
[Graph Pattern Matching in GQL and SQL/PGQ](https://arxiv.org/abs/2112.06217).
At *International Conference on Management of Data* (SIGMOD), pages 22462258, June 2022.
[doi:10.1145/3514221.3526057](https://doi.org/10.1145/3514221.3526057)
[[43](/en/ch3#Green2019-marker)] Alastair Green.
[SQL... and now GQL](https://opencypher.org/articles/2019/09/12/SQL-and-now-GQL/).
*opencypher.org*, September 2019.
Archived at [perma.cc/AFB2-3SY7](https://perma.cc/AFB2-3SY7)
[[44](/en/ch3#Deutsch2018-marker)] Alin Deutsch, Yu Xu, and Mingxi Wu.
[Seamless
Syntactic and Semantic Integration of Query Primitives over Relational and Graph Data in GSQL](https://cdn2.hubspot.net/hubfs/4114546/IntegrationQuery%20PrimitivesGSQL.pdf).
*tigergraph.com*, November 2018.
Archived at [perma.cc/JG7J-Y35X](https://perma.cc/JG7J-Y35X)
[[45](/en/ch3#vanRest2016-marker)] Oskar van Rest, Sungpack Hong, Jinha Kim, Xuming
Meng, and Hassan Chafi. [PGQL: a property
graph query language](https://event.cwi.nl/grades/2016/07-VanRest.pdf). At *4th International Workshop on Graph Data Management Experiences and
Systems* (GRADES), June 2016.
[doi:10.1145/2960414.2960421](https://doi.org/10.1145/2960414.2960421)
[[46](/en/ch3#NeptuneDataModel-marker)] Amazon Web Services.
[Neptune
Graph Data Model](https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html). Amazon Neptune User Guide, *docs.aws.amazon.com*.
Archived at [perma.cc/CX3T-EZU9](https://perma.cc/CX3T-EZU9)
[[47](/en/ch3#DatomicDataModel-marker)] Cognitect.
[Datomic Data Model](https://docs.datomic.com/cloud/whatis/data-model.html).
Datomic Cloud Documentation, *docs.datomic.com*.
Archived at [perma.cc/LGM9-LEUT](https://perma.cc/LGM9-LEUT)
[[48](/en/ch3#Beckett2011-marker)] David Beckett and Tim Berners-Lee.
[Turtle Terse RDF Triple Language](https://www.w3.org/TeamSubmission/turtle/).
W3C Team Submission, March 2011.
[[49](/en/ch3#Target2018-marker)] Sinclair Target.
[Whatever Happened to the Semantic
Web?](https://twobithistory.org/2018/05/27/semantic-web.html) *twobithistory.org*, May 2018.
Archived at [perma.cc/M8GL-9KHS](https://perma.cc/M8GL-9KHS)
[[50](/en/ch3#MendelGleason2022-marker)] Gavin Mendel-Gleason.
[The Semantic Web is Dead Long Live
the Semantic Web!](https://terminusdb.com/blog/the-semantic-web-is-dead/) *terminusdb.com*, August 2022.
Archived at [perma.cc/G2MZ-DSS3](https://perma.cc/G2MZ-DSS3)
[[51](/en/ch3#Sporny2014-marker)] Manu Sporny.
[JSON-LD and Why I Hate the Semantic Web](http://manu.sporny.org/2014/json-ld-origins-2/).
*manu.sporny.org*, January 2014.
Archived at [perma.cc/7PT4-PJKF](https://perma.cc/7PT4-PJKF)
[[52](/en/ch3#MichiganOntologies-marker)] University of Michigan Library.
[Biomedical Ontologies and Controlled Vocabularies](https://guides.lib.umich.edu/ontology),
*guides.lib.umich.edu/ontology*.
Archived at [perma.cc/Q5GA-F2N8](https://perma.cc/Q5GA-F2N8)
[[53](/en/ch3#OpenGraph-marker)] Facebook.
[The Open Graph protocol](https://ogp.me/), *ogp.me*.
Archived at [perma.cc/C49A-GUSY](https://perma.cc/C49A-GUSY)
[[54](/en/ch3#Haughey2015-marker)] Matt Haughey.
[Everything
you ever wanted to know about unfurling but were afraid to ask /or/ How to make your site previews
look amazing in Slack](https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254). *medium.com*, November 2015.
Archived at [perma.cc/C7S8-4PZN](https://perma.cc/C7S8-4PZN)
[[55](/en/ch3#W3CRDF-marker)] W3C RDF Working Group.
[Resource Description Framework (RDF)](https://www.w3.org/RDF/).
*w3.org*, February 2004.
[[56](/en/ch3#Harris2013-marker)] Steve Harris, Andy Seaborne, and Eric
Prudhommeaux. [SPARQL 1.1 Query Language](https://www.w3.org/TR/sparql11-query/).
W3C Recommendation, March 2013.
[[57](/en/ch3#Green2013-marker)] Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou.
[Datalog and Recursive
Query Processing](http://blogs.evergreen.edu/sosw/files/2014/04/Green-Vol5-DBS-017.pdf). *Foundations and Trends in Databases*, volume 5, issue 2, pages 105195,
November 2013. [doi:10.1561/1900000017](https://doi.org/10.1561/1900000017)
[[58](/en/ch3#Ceri1989-marker)] Stefano Ceri, Georg Gottlob, and Letizia Tanca.
[What
You Always Wanted to Know About Datalog (And Never Dared to Ask)](https://www.researchgate.net/profile/Letizia_Tanca/publication/3296132_What_you_always_wanted_to_know_about_Datalog_and_never_dared_to_ask/links/0fcfd50ca2d20473ca000000.pdf). *IEEE Transactions on
Knowledge and Data Engineering*, volume 1, issue 1, pages 146166, March 1989.
[doi:10.1109/69.43410](https://doi.org/10.1109/69.43410)
[[59](/en/ch3#Abiteboul1995-marker)] Serge Abiteboul, Richard Hull, and Victor Vianu.
[*Foundations of Databases*](http://webdam.inria.fr/Alice/). Addison-Wesley, 1995.
ISBN: 9780201537710, available online at
[*webdam.inria.fr/Alice*](http://webdam.inria.fr/Alice/)
[[60](/en/ch3#Meyer2020-marker)] Scott Meyer, Andrew Carter, and Andrew Rodriguez.
[LIquid:
The soul of a new graph database, Part 2](https://engineering.linkedin.com/blog/2020/liquid--the-soul-of-a-new-graph-database--part-2). *engineering.linkedin.com*, September 2020.
Archived at [perma.cc/K9M4-PD6Q](https://perma.cc/K9M4-PD6Q)
[[61](/en/ch3#Bessey2024-marker)] Matt Bessey.
[Why, after 6 years, Im over
GraphQL](https://bessey.dev/blog/2024/05/24/why-im-over-graphql/). *bessey.dev*, May 2024. Archived at
[perma.cc/2PAU-JYRA](https://perma.cc/2PAU-JYRA)
[[62](/en/ch3#Betts2012-marker)] Dominic Betts, Julián
Domínguez, Grigori Melnik, Fernando Simonazzi, and Mani Subramanian.
[*Exploring
CQRS and Event Sourcing*](https://learn.microsoft.com/en-us/previous-versions/msp-n-p/jj554200%28v%3Dpandp.10%29). Microsoft Patterns & Practices, July 2012.
ISBN: 1621140164, archived at [perma.cc/7A39-3NM8](https://perma.cc/7A39-3NM8)
[[63](/en/ch3#Young2014-marker)] Greg Young.
[CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs). At *Code on
the Beach*, August 2014.
[[64](/en/ch3#Young2010-marker)] Greg Young.
[CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf).
*cqrs.wordpress.com*, November 2010.
Archived at [perma.cc/X5R6-R47F](https://perma.cc/X5R6-R47F)
[[65](/en/ch3#Petersohn2020-marker)] Devin Petersohn, Stephen Macke, Doris
Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D.
Joseph, and Aditya Parameswaran.
[Towards Scalable Dataframe Systems](https://www.vldb.org/pvldb/vol13/p2033-petersohn.pdf).
*Proceedings of the VLDB Endowment*, volume 13, issue 11, pages 20332046.
[doi:10.14778/3407790.3407807](https://doi.org/10.14778/3407790.3407807)
[[66](/en/ch3#Papadopoulos2016-marker)] Stavros Papadopoulos, Kushal Datta, Samuel
Madden, and Timothy Mattson.
[The TileDB Array Data Storage Manager](https://www.vldb.org/pvldb/vol10/p349-papadopoulos.pdf).
*Proceedings of the VLDB Endowment*, volume 10, issue 4, pages 349360, November 2016.
[doi:10.14778/3025111.3025117](https://doi.org/10.14778/3025111.3025117)
[[67](/en/ch3#Rusu2022-marker)] Florin Rusu.
[Multidimensional
Array Data Management](https://faculty.ucmerced.edu/frusu/Papers/Report/2022-09-fntdb-arrays.pdf). *Foundations and Trends in Databases*, volume 12, numbers 23,
pages 69220, February 2023.
[doi:10.1561/1900000069](https://doi.org/10.1561/1900000069)
[[68](/en/ch3#Targett2023-marker)] Ed Targett.
[Bloomberg,
Man Group team up to develop open source “ArcticDB” database](https://www.thestack.technology/bloomberg-man-group-arcticdb-database-dataframe/). *thestack.technology*,
March 2023. Archived at [perma.cc/M5YD-QQYV](https://perma.cc/M5YD-QQYV)
[[69](/en/ch3#Benson2007-marker)] Dennis A. Benson, Ilene
Karsch-Mizrachi, David J. Lipman, James Ostell, and David L. Wheeler.
[GenBank](https://academic.oup.com/nar/article/36/suppl_1/D25/2507746).
*Nucleic Acids Research*, volume 36, database issue, pages D25D30, December 2007.
[doi:10.1093/nar/gkm929](https://doi.org/10.1093/nar/gkm929)
[^1]: Jamie Brandon. [Unexplanations: query optimization works because sql is declarative](https://www.scattered-thoughts.net/writing/unexplanations-sql-declarative/). *scattered-thoughts.net*, February 2024. Archived at [perma.cc/P6W2-WMFZ](https://perma.cc/P6W2-WMFZ)
[^2]: Joseph M. Hellerstein. [The Declarative Imperative: Experiences and Conjectures in Distributed Logic](https://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-90.pdf). Tech report UCB/EECS-2010-90, Electrical Engineering and Computer Sciences, University of California at Berkeley, June 2010. Archived at [perma.cc/K56R-VVQM](https://perma.cc/K56R-VVQM)
[^3]: Edgar F. Codd. [A Relational Model of Data for Large Shared Data Banks](https://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf). *Communications of the ACM*, volume 13, issue 6, pages 377387, June 1970. [doi:10.1145/362384.362685](https://doi.org/10.1145/362384.362685)
[^4]: Michael Stonebraker and Joseph M. Hellerstein. [What Goes Around Comes Around](http://mitpress2.mit.edu/books/chapters/0262693143chapm1.pdf). In *Readings in Database Systems*, 4th edition, MIT Press, pages 241, 2005. ISBN: 9780262693141
[^5]: Markus Winand. [Modern SQL: Beyond Relational](https://modern-sql.com/). *modern-sql.com*, 2015. Archived at [perma.cc/D63V-WAPN](https://perma.cc/D63V-WAPN)
[^6]: Martin Fowler. [OrmHate](https://martinfowler.com/bliki/OrmHate.html). *martinfowler.com*, May 2012. Archived at [perma.cc/VCM8-PKNG](https://perma.cc/VCM8-PKNG)
[^7]: Vlad Mihalcea. [N+1 query problem with JPA and Hibernate](https://vladmihalcea.com/n-plus-1-query-problem/). *vladmihalcea.com*, January 2023. Archived at [perma.cc/79EV-TZKB](https://perma.cc/79EV-TZKB)
[^8]: Jens Schauder. [This is the Beginning of the End of the N+1 Problem: Introducing Single Query Loading](https://spring.io/blog/2023/08/31/this-is-the-beginning-of-the-end-of-the-n-1-problem-introducing-single-query). *spring.io*, August 2023. Archived at [perma.cc/6V96-R333](https://perma.cc/6V96-R333)
[^9]: William Zola. [6 Rules of Thumb for MongoDB Schema Design](https://www.mongodb.com/blog/post/6-rules-of-thumb-for-mongodb-schema-design). *mongodb.com*, June 2014. Archived at [perma.cc/T2BZ-PPJB](https://perma.cc/T2BZ-PPJB)
[^10]: Sidney Andrews and Christopher McClister. [Data modeling in Azure Cosmos DB](https://learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data). *learn.microsoft.com*, February 2023. Archived at [archive.org](https://web.archive.org/web/20230207193233/https%3A//learn.microsoft.com/en-us/azure/cosmos-db/nosql/modeling-data)
[^11]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK)
[^12]: Ralph Kimball and Margy Ross. [*The Data Warehouse Toolkit: The Definitive Guide to Dimensional Modeling*](https://learning.oreilly.com/library/view/the-data-warehouse/9781118530801/), 3rd edition. John Wiley & Sons, July 2013. ISBN: 9781118530801
[^13]: Michael Kaminsky. [Data warehouse modeling: Star schema vs. OBT](https://www.fivetran.com/blog/star-schema-vs-obt). *fivetran.com*, August 2022. Archived at [perma.cc/2PZK-BFFP](https://perma.cc/2PZK-BFFP)
[^14]: Joe Nelson. [User-defined Order in SQL](https://begriffs.com/posts/2018-03-20-user-defined-order.html). *begriffs.com*, March 2018. Archived at [perma.cc/GS3W-F7AD](https://perma.cc/GS3W-F7AD)
[^15]: Evan Wallace. [Realtime Editing of Ordered Sequences](https://www.figma.com/blog/realtime-editing-of-ordered-sequences/). *figma.com*, March 2017. Archived at [perma.cc/K6ER-CQZW](https://perma.cc/K6ER-CQZW)
[^16]: David Greenspan. [Implementing Fractional Indexing](https://observablehq.com/%40dgreensp/implementing-fractional-indexing). *observablehq.com*, October 2020. Archived at [perma.cc/5N4R-MREN](https://perma.cc/5N4R-MREN)
[^17]: Martin Fowler. [Schemaless Data Structures](https://martinfowler.com/articles/schemaless/). *martinfowler.com*, January 2013.
[^18]: Amr Awadallah. [Schema-on-Read vs. Schema-on-Write](https://www.slideshare.net/awadallah/schemaonread-vs-schemaonwrite). At *Berkeley EECS RAD Lab Retreat*, Santa Cruz, CA, May 2009. Archived at [perma.cc/DTB2-JCFR](https://perma.cc/DTB2-JCFR)
[^19]: Martin Odersky. [The Trouble with Types](https://www.infoq.com/presentations/data-types-issues/). At *Strange Loop*, September 2013. Archived at [perma.cc/85QE-PVEP](https://perma.cc/85QE-PVEP)
[^20]: Conrad Irwin. [MongoDB—Confessions of a PostgreSQL Lover](https://speakerdeck.com/conradirwin/mongodb-confessions-of-a-postgresql-lover). At *HTML5DevConf*, October 2013. Archived at [perma.cc/C2J6-3AL5](https://perma.cc/C2J6-3AL5)
[^21]: [Percona Toolkit Documentation: pt-online-schema-change](https://docs.percona.com/percona-toolkit/pt-online-schema-change.html). *docs.percona.com*, 2023. Archived at [perma.cc/9K8R-E5UH](https://perma.cc/9K8R-E5UH)
[^22]: Shlomi Noach. [gh-ost: GitHubs Online Schema Migration Tool for MySQL](https://github.blog/2016-08-01-gh-ost-github-s-online-migration-tool-for-mysql/). *github.blog*, August 2016. Archived at [perma.cc/7XAG-XB72](https://perma.cc/7XAG-XB72)
[^23]: Shayon Mukherjee. [pg-osc: Zero downtime schema changes in PostgreSQL](https://www.shayon.dev/post/2022/47/pg-osc-zero-downtime-schema-changes-in-postgresql/). *shayon.dev*, February 2022. Archived at [perma.cc/35WN-7WMY](https://perma.cc/35WN-7WMY)
[^24]: Carlos Pérez-Aradros Herce. [Introducing pgroll: zero-downtime, reversible, schema migrations for Postgres](https://xata.io/blog/pgroll-schema-migrations-postgres). *xata.io*, October 2023. Archived at [archive.org](https://web.archive.org/web/20231008161750/https%3A//xata.io/blog/pgroll-schema-migrations-postgres)
[^25]: James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Dale Woodford, Yasushi Saito, Christopher Taylor, Michal Szymaniak, and Ruth Wang. [Spanner: Googles Globally-Distributed Database](https://research.google/pubs/pub39966/). At *10th USENIX Symposium on Operating System Design and Implementation* (OSDI), October 2012.
[^26]: Donald K. Burleson. [Reduce I/O with Oracle Cluster Tables](http://www.dba-oracle.com/oracle_tip_hash_index_cluster_table.htm). *dba-oracle.com*. Archived at [perma.cc/7LBJ-9X2C](https://perma.cc/7LBJ-9X2C)
[^27]: Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. [Bigtable: A Distributed Storage System for Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and Implementation* (OSDI), November 2006.
[^28]: Priscilla Walmsley. [*XQuery, 2nd Edition*](https://learning.oreilly.com/library/view/xquery-2nd-edition/9781491915080/). OReilly Media, December 2015. ISBN: 9781491915080
[^29]: Paul C. Bryan, Kris Zyp, and Mark Nottingham. [JavaScript Object Notation (JSON) Pointer](https://www.rfc-editor.org/rfc/rfc6901). RFC 6901, IETF, April 2013.
[^30]: Stefan Gössner, Glyn Normington, and Carsten Bormann. [JSONPath: Query Expressions for JSON](https://www.rfc-editor.org/rfc/rfc9535.html). RFC 9535, IETF, February 2024.
[^31]: Michael Stonebraker and Andrew Pavlo. [What Goes Around Comes Around… And Around…](https://db.cs.cmu.edu/papers/2024/whatgoesaround-sigmodrec2024.pdf). *ACM SIGMOD Record*, volume 53, issue 2, pages 2137. [doi:10.1145/3685980.3685984](https://doi.org/10.1145/3685980.3685984)
[^32]: Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. [The PageRank Citation Ranking: Bringing Order to the Web](http://ilpubs.stanford.edu:8090/422/). Technical Report 1999-66, Stanford University InfoLab, November 1999. Archived at [perma.cc/UML9-UZHW](https://perma.cc/UML9-UZHW)
[^33]: Nathan Bronson, Zach Amsden, George Cabrera, Prasad Chakka, Peter Dimov, Hui Ding, Jack Ferris, Anthony Giardullo, Sachin Kulkarni, Harry Li, Mark Marchukov, Dmitri Petrov, Lovro Puzar, Yee Jiun Song, and Venkat Venkataramani. [TAO: Facebooks Distributed Data Store for the Social Graph](https://www.usenix.org/conference/atc13/technical-sessions/presentation/bronson). At *USENIX Annual Technical Conference* (ATC), June 2013.
[^34]: Natasha Noy, Yuqing Gao, Anshu Jain, Anant Narayanan, Alan Patterson, and Jamie Taylor. [Industry-Scale Knowledge Graphs: Lessons and Challenges](https://cacm.acm.org/magazines/2019/8/238342-industry-scale-knowledge-graphs/fulltext). *Communications of the ACM*, volume 62, issue 8, pages 3643, August 2019. [doi:10.1145/3331166](https://doi.org/10.1145/3331166)
[^35]: Xiyang Feng, Guodong Jin, Ziyi Chen, Chang Liu, and Semih Salihoğlu. [KÙZU Graph Database Management System](https://www.cidrdb.org/cidr2023/papers/p48-jin.pdf). At *3th Annual Conference on Innovative Data Systems Research* (CIDR 2023), January 2023.
[^36]: Maciej Besta, Emanuel Peter, Robert Gerstenberger, Marc Fischer, Michał Podstawski, Claude Barthels, Gustavo Alonso, Torsten Hoefler. [Demystifying Graph Databases: Analysis and Taxonomy of Data Organization, System Designs, and Graph Queries](https://arxiv.org/pdf/1910.09017.pdf). *arxiv.org*, October 2019.
[^37]: [Apache TinkerPop 3.6.3 Documentation](https://tinkerpop.apache.org/docs/3.6.3/reference/). *tinkerpop.apache.org*, May 2023. Archived at [perma.cc/KM7W-7PAT](https://perma.cc/KM7W-7PAT)
[^38]: Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. [Cypher: An Evolving Query Language for Property Graphs](https://core.ac.uk/download/pdf/158372754.pdf). At *International Conference on Management of Data* (SIGMOD), pages 14331445, May 2018. [doi:10.1145/3183713.3190657](https://doi.org/10.1145/3183713.3190657)
[^39]: Emil Eifrem. [Twitter correspondence](https://twitter.com/emileifrem/status/419107961512804352), January 2014. Archived at [perma.cc/WM4S-BW64](https://perma.cc/WM4S-BW64)
[^40]: Francesco Tisiot. [Explore the new SEARCH and CYCLE features in PostgreSQL® 14](https://aiven.io/blog/explore-the-new-search-and-cycle-features-in-postgresql-14). *aiven.io*, December 2021. Archived at [perma.cc/J6BT-83UZ](https://perma.cc/J6BT-83UZ)
[^41]: Gaurav Goel. [Understanding Hierarchies in Oracle](https://towardsdatascience.com/understanding-hierarchies-in-oracle-43f85561f3d9). *towardsdatascience.com*, May 2020. Archived at [perma.cc/5ZLR-Q7EW](https://perma.cc/5ZLR-Q7EW)
[^42]: Alin Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Wim Martens, Jan Michels, Filip Murlak, Stefan Plantikow, Petra Selmer, Oskar van Rest, Hannes Voigt, Domagoj Vrgoč, Mingxi Wu, and Fred Zemke. [Graph Pattern Matching in GQL and SQL/PGQ](https://arxiv.org/abs/2112.06217). At *International Conference on Management of Data* (SIGMOD), pages 22462258, June 2022. [doi:10.1145/3514221.3526057](https://doi.org/10.1145/3514221.3526057)
[^43]: Alastair Green. [SQL... and now GQL](https://opencypher.org/articles/2019/09/12/SQL-and-now-GQL/). *opencypher.org*, September 2019. Archived at [perma.cc/AFB2-3SY7](https://perma.cc/AFB2-3SY7)
[^44]: Alin Deutsch, Yu Xu, and Mingxi Wu. [Seamless Syntactic and Semantic Integration of Query Primitives over Relational and Graph Data in GSQL](https://cdn2.hubspot.net/hubfs/4114546/IntegrationQuery%20PrimitivesGSQL.pdf). *tigergraph.com*, November 2018. Archived at [perma.cc/JG7J-Y35X](https://perma.cc/JG7J-Y35X)
[^45]: Oskar van Rest, Sungpack Hong, Jinha Kim, Xuming Meng, and Hassan Chafi. [PGQL: a property graph query language](https://event.cwi.nl/grades/2016/07-VanRest.pdf). At *4th International Workshop on Graph Data Management Experiences and Systems* (GRADES), June 2016. [doi:10.1145/2960414.2960421](https://doi.org/10.1145/2960414.2960421)
[^46]: Amazon Web Services. [Neptune Graph Data Model](https://docs.aws.amazon.com/neptune/latest/userguide/feature-overview-data-model.html). Amazon Neptune User Guide, *docs.aws.amazon.com*. Archived at [perma.cc/CX3T-EZU9](https://perma.cc/CX3T-EZU9)
[^47]: Cognitect. [Datomic Data Model](https://docs.datomic.com/cloud/whatis/data-model.html). Datomic Cloud Documentation, *docs.datomic.com*. Archived at [perma.cc/LGM9-LEUT](https://perma.cc/LGM9-LEUT)
[^48]: David Beckett and Tim Berners-Lee. [Turtle Terse RDF Triple Language](https://www.w3.org/TeamSubmission/turtle/). W3C Team Submission, March 2011.
[^49]: Sinclair Target. [Whatever Happened to the Semantic Web?](https://twobithistory.org/2018/05/27/semantic-web.html) *twobithistory.org*, May 2018. Archived at [perma.cc/M8GL-9KHS](https://perma.cc/M8GL-9KHS)
[^50]: Gavin Mendel-Gleason. [The Semantic Web is Dead Long Live the Semantic Web!](https://terminusdb.com/blog/the-semantic-web-is-dead/) *terminusdb.com*, August 2022. Archived at [perma.cc/G2MZ-DSS3](https://perma.cc/G2MZ-DSS3)
[^51]: Manu Sporny. [JSON-LD and Why I Hate the Semantic Web](http://manu.sporny.org/2014/json-ld-origins-2/). *manu.sporny.org*, January 2014. Archived at [perma.cc/7PT4-PJKF](https://perma.cc/7PT4-PJKF)
[^52]: University of Michigan Library. [Biomedical Ontologies and Controlled Vocabularies](https://guides.lib.umich.edu/ontology), *guides.lib.umich.edu/ontology*. Archived at [perma.cc/Q5GA-F2N8](https://perma.cc/Q5GA-F2N8)
[^53]: Facebook. [The Open Graph protocol](https://ogp.me/), *ogp.me*. Archived at [perma.cc/C49A-GUSY](https://perma.cc/C49A-GUSY)
[^54]: Matt Haughey. [Everything you ever wanted to know about unfurling but were afraid to ask /or/ How to make your site previews look amazing in Slack](https://medium.com/slack-developer-blog/everything-you-ever-wanted-to-know-about-unfurling-but-were-afraid-to-ask-or-how-to-make-your-e64b4bb9254). *medium.com*, November 2015. Archived at [perma.cc/C7S8-4PZN](https://perma.cc/C7S8-4PZN)
[^55]: W3C RDF Working Group. [Resource Description Framework (RDF)](https://www.w3.org/RDF/). *w3.org*, February 2004.
[^56]: Steve Harris, Andy Seaborne, and Eric Prudhommeaux. [SPARQL 1.1 Query Language](https://www.w3.org/TR/sparql11-query/). W3C Recommendation, March 2013.
[^57]: Todd J. Green, Shan Shan Huang, Boon Thau Loo, and Wenchao Zhou. [Datalog and Recursive Query Processing](http://blogs.evergreen.edu/sosw/files/2014/04/Green-Vol5-DBS-017.pdf). *Foundations and Trends in Databases*, volume 5, issue 2, pages 105195, November 2013. [doi:10.1561/1900000017](https://doi.org/10.1561/1900000017)
[^58]: Stefano Ceri, Georg Gottlob, and Letizia Tanca. [What You Always Wanted to Know About Datalog (And Never Dared to Ask)](https://www.researchgate.net/profile/Letizia_Tanca/publication/3296132_What_you_always_wanted_to_know_about_Datalog_and_never_dared_to_ask/links/0fcfd50ca2d20473ca000000.pdf). *IEEE Transactions on Knowledge and Data Engineering*, volume 1, issue 1, pages 146166, March 1989. [doi:10.1109/69.43410](https://doi.org/10.1109/69.43410)
[^59]: Serge Abiteboul, Richard Hull, and Victor Vianu. [*Foundations of Databases*](http://webdam.inria.fr/Alice/). Addison-Wesley, 1995. ISBN: 9780201537710, available online at [*webdam.inria.fr/Alice*](http://webdam.inria.fr/Alice/)
[^60]: Scott Meyer, Andrew Carter, and Andrew Rodriguez. [LIquid: The soul of a new graph database, Part 2](https://engineering.linkedin.com/blog/2020/liquid--the-soul-of-a-new-graph-database--part-2). *engineering.linkedin.com*, September 2020. Archived at [perma.cc/K9M4-PD6Q](https://perma.cc/K9M4-PD6Q)
[^61]: Matt Bessey. [Why, after 6 years, Im over GraphQL](https://bessey.dev/blog/2024/05/24/why-im-over-graphql/). *bessey.dev*, May 2024. Archived at [perma.cc/2PAU-JYRA](https://perma.cc/2PAU-JYRA)
[^62]: Dominic Betts, Julián Domínguez, Grigori Melnik, Fernando Simonazzi, and Mani Subramanian. [*Exploring CQRS and Event Sourcing*](https://learn.microsoft.com/en-us/previous-versions/msp-n-p/jj554200%28v%3Dpandp.10%29). Microsoft Patterns & Practices, July 2012. ISBN: 1621140164, archived at [perma.cc/7A39-3NM8](https://perma.cc/7A39-3NM8)
[^63]: Greg Young. [CQRS and Event Sourcing](https://www.youtube.com/watch?v=JHGkaShoyNs). At *Code on the Beach*, August 2014.
[^64]: Greg Young. [CQRS Documents](https://cqrs.files.wordpress.com/2010/11/cqrs_documents.pdf). *cqrs.wordpress.com*, November 2010. Archived at [perma.cc/X5R6-R47F](https://perma.cc/X5R6-R47F)
[^65]: Devin Petersohn, Stephen Macke, Doris Xin, William Ma, Doris Lee, Xiangxi Mo, Joseph E. Gonzalez, Joseph M. Hellerstein, Anthony D. Joseph, and Aditya Parameswaran. [Towards Scalable Dataframe Systems](https://www.vldb.org/pvldb/vol13/p2033-petersohn.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 11, pages 20332046. [doi:10.14778/3407790.3407807](https://doi.org/10.14778/3407790.3407807)
[^66]: Stavros Papadopoulos, Kushal Datta, Samuel Madden, and Timothy Mattson. [The TileDB Array Data Storage Manager](https://www.vldb.org/pvldb/vol10/p349-papadopoulos.pdf). *Proceedings of the VLDB Endowment*, volume 10, issue 4, pages 349360, November 2016. [doi:10.14778/3025111.3025117](https://doi.org/10.14778/3025111.3025117)
[^67]: Florin Rusu. [Multidimensional Array Data Management](https://faculty.ucmerced.edu/frusu/Papers/Report/2022-09-fntdb-arrays.pdf). *Foundations and Trends in Databases*, volume 12, numbers 23, pages 69220, February 2023. [doi:10.1561/1900000069](https://doi.org/10.1561/1900000069)
[^68]: Ed Targett. [Bloomberg, Man Group team up to develop open source “ArcticDB” database](https://www.thestack.technology/bloomberg-man-group-arcticdb-database-dataframe/). *thestack.technology*, March 2023. Archived at [perma.cc/M5YD-QQYV](https://perma.cc/M5YD-QQYV)
[^69]: Dennis A. Benson, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and David L. Wheeler. [GenBank](https://academic.oup.com/nar/article/36/suppl_1/D25/2507746). *Nucleic Acids Research*, volume 36, database issue, pages D25D30, December 2007. [doi:10.1093/nar/gkm929](https://doi.org/10.1093/nar/gkm929)

File diff suppressed because it is too large Load diff

View file

@ -119,17 +119,17 @@ restored with minimal additional code. However, they also have a number of deep
integrating your systems with those of other organizations (which may use different languages).
* In order to restore data in the same object types, the decoding process needs to be able to
instantiate arbitrary classes. This is frequently a source of security problems
[[1](/en/ch5#CWE502)]:
[^1]:
if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate
arbitrary classes, which in turn often allows them to do terrible things such as remotely
executing arbitrary code [[2](/en/ch5#Breen2015),
[3](/en/ch5#McKenzie2013)].
* Versioning data is often an afterthought in these libraries: as they are intended for quick and
easy encoding of data, they often neglect the inconvenient problems of forward and backward
compatibility [[4](/en/ch5#Goetz2019)].
compatibility [^4].
* Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also
often an afterthought. For example, Javas built-in serialization is notorious for its bad
performance and bloated encoding [[5](/en/ch5#JvmSerializers)].
performance and bloated encoding [^5].
For these reasons its generally a bad idea to use your languages built-in encoding for anything
other than very transient purposes.
@ -139,7 +139,7 @@ other than very transient purposes.
When moving to standardized encodings that can be written and read by many programming languages, JSON
and XML are the obvious contenders. They are widely known, widely supported, and almost as widely
disliked. XML is often criticized for being too verbose and unnecessarily complicated
[[6](/en/ch5#XMLSExp)].
[^6].
JSONs popularity is mainly due to its built-in support in web browsers and simplicity relative to
XML. CSV is another popular language-independent format, but it only supports tabular data without
nesting.
@ -156,11 +156,11 @@ problems:
This is a problem when dealing with large numbers; for example, integers greater than 253 cannot
be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become
inaccurate when parsed in a language that uses floating-point numbers, such as JavaScript
[[7](/en/ch5#Evans2023)].
[^7].
An example of numbers larger than 253 occurs on X (formerly Twitter), which uses a 64-bit number to
identify each post. The JSON returned by the API includes post IDs twice, once as a JSON number and
once as a decimal string, to work around the fact that the numbers are not correctly parsed by
JavaScript applications [[8](/en/ch5#Harris2010)].
JavaScript applications [^8].
* JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they
dont support binary strings (sequences of bytes without a character encoding). Binary strings are a
useful feature, so people get around this limitation by encoding the binary data as text using
@ -174,7 +174,7 @@ problems:
column. If an application change adds a new row or column, you have to handle that change manually.
CSV is also a quite vague format (what happens if a value contains a comma or a newline character?).
Although its escaping rules have been formally specified
[[9](/en/ch5#Shafranovich2005)],
[^9],
not all parsers implement them correctly.
Despite these flaws, JSON, XML, and CSV are good enough for many purposes. Its likely that they will
@ -228,9 +228,9 @@ In addition to open and closed content models and validators, JSON Schema suppor
if/else schema logic, named types, references to remote schemas, and much more. All of this makes
for a very powerful schema language. Such features also make for unwieldy definitions. It can be
challenging to resolve remote schemas, reason about conditional rules, or evolve schemas in a
forwards or backwards compatible way [[10](/en/ch5#Coates2024)].
forwards or backwards compatible way [^10].
Similar concerns apply to XML Schema
[[11](/en/ch5#Geneves2008)].
[^11].
### Binary encoding
@ -239,7 +239,7 @@ observation led to the development of a profusion of binary encodings for JSON (
BSON, BJSON, UBJSON, BISON, Hessian, and Smile, to name a few) and for XML (WBXML and Fast Infoset,
for example). These formats have been adopted in various niches, as they are more compact and
sometimes faster to parse, but none of them are as widely adopted as the textual versions of JSON
and XML [[12](/en/ch5#Bray2019)].
and XML [^12].
Some of these formats extend the set of datatypes (e.g., distinguishing integers and floating-point numbers,
or adding support for binary strings), but otherwise they keep the JSON/XML data model unchanged. In
@ -287,7 +287,7 @@ In the following sections we will see how we can do much better, and encode the
Protocol Buffers (protobuf) is a binary encoding library developed at Google.
It is similar to Apache Thrift, which was originally developed by Facebook
[[13](/en/ch5#Slee2007)];
[^13];
most of what this section says about Protocol Buffers applies also to Thrift.
Protocol Buffers requires a schema for any data that is encoded. To encode the data
@ -311,7 +311,7 @@ language is very simple compared to JSON Schema: it only defines the fields of r
types, but it does not support other restrictions on the possible values of fields.
Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in
[Figure 5-3](/en/ch5#fig_encoding_protobuf) [[14](/en/ch5#Kleppmann2012evolution)].
[Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14].
![ddia 0503](/fig/ddia_0503.png)
@ -382,7 +382,7 @@ value wont fit in 32 bits, it will be truncated.
Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers.
It was started in 2009 as a subproject of Hadoop, as a result of Protocol Buffers not being a good
fit for Hadoops use cases
[[15](/en/ch5#Cutting2009)].
[^15].
Avro also uses a schema to specify the structure of the data being encoded. It has two schema
languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily
@ -493,7 +493,7 @@ case in Avro: if you want to allow a field to be null, you have to use a *union
`union { null, long, string } field;` indicates that `field` can be a number, or a string, or null.
You can only use `null` as a default value if it is the first branch of the union. This is a little
more verbose than having everything nullable by default, but it helps prevent bugs by being explicit
about what can and cannot be null [[18](/en/ch5#Hoare2009)].
about what can and cannot be null [^18].
Changing the datatype of a field is possible, provided that Avro can convert the type. Changing the
name of a field is possible but a little tricky: the readers schema can contain aliases for field
@ -525,9 +525,9 @@ Database with individually written records
schema, it can decode the rest of the record.
Confluents schema registry for Apache Kafka
[[19](/en/ch5#ConfluentSchemaReg)]
[^19]
and LinkedIns Espresso
[[20](/en/ch5#Auradkar2015)]
[^20]
work this way, for example.
Sending records over a network connection
@ -537,7 +537,7 @@ Sending records over a network connection
A database of schema versions is a useful thing to have in any case, since it acts as documentation
and gives you a chance to check schema compatibility
[[21](/en/ch5#Kreps2015)].
[^21].
As the version number, you could use a simple incrementing integer, or you could use a hash of the
schema.
@ -552,7 +552,7 @@ you have a relational database whose contents you want to dump to a file, and yo
binary format to avoid the aforementioned problems with textual formats (JSON, CSV, XML). If you use
Avro, you can fairly easily generate an Avro schema (in the JSON representation we saw earlier) from the
relational schema and encode the database contents using that schema, dumping it all to an Avro
object container file [[22](/en/ch5#Shapira2014)].
object container file [^22].
You can generate a record schema for each database table, and each column becomes a field in that
record. The column name in the database maps to the field name in Avro.
@ -585,9 +585,9 @@ common with ASN.1, a schema definition language that was first standardized in 1
[24](/en/ch5#Kaliski1993)].
It was used to define various network protocols, and its binary encoding (DER) is still used to encode
SSL certificates (X.509), for example
[[25](/en/ch5#HoffmanAndrews2020)].
[^25].
ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers
[[26](/en/ch5#Walkin2010)].
[^26].
However, its also very complex and badly documented, so ASN.1
is probably not a good choice for new applications.
@ -680,9 +680,9 @@ versions of the schema.
More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or
moving some data into a separate table—still require data to be rewritten, often at the application
level [[27](/en/ch5#Xu2017)].
level [^27].
Maintaining forward and backward compatibility across such migrations is still a research problem
[[28](/en/ch5#Litt2020)].
[^28].
### Archival storage
@ -723,7 +723,7 @@ In some ways, services are similar to databases: they typically allow clients to
data. However, while databases allow arbitrary queries using the query languages we discussed in
[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
that are predetermined by the business logic (application code) of the service
[[29](/en/ch5#Helland2005_ch5)]. This restriction provides a degree of encapsulation: services can impose
[^29]. This restriction provides a degree of encapsulation: services can impose
fine-grained restrictions on what clients can and cannot do.
A key design goal of a service-oriented/microservices architecture is to make the application easier
@ -764,7 +764,7 @@ need to somehow find out these details. Service developers often use an interfac
language (IDL) to define and document their services API endpoints and data models, and to evolve
them over time. Other developers can then use the service definition to determine how to query the
service. The two most popular service IDLs are OpenAPI (also known as Swagger
[[32](/en/ch5#Swagger2014)])
[^32])
and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send
and receive Protocol Buffers.
@ -838,7 +838,7 @@ requests over a network, many of which received a lot of hype but have serious p
JavaBeans (EJB) and Javas Remote Method Invocation (RMI) are limited to Java. The Distributed
Component Object Model (DCOM) is limited to Microsoft platforms. The Common Object Request Broker
Architecture (CORBA) is excessively complex, and does not provide backward or forward
compatibility [[33](/en/ch5#Henning2006)].
compatibility [^33].
SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are
also plagued by complexity and compatibility problems
[[34](/en/ch5#Lacey2006),
@ -846,7 +846,7 @@ also plagued by complexity and compatibility problems
[36](/en/ch5#Bray2004)].
All of these are based on the idea of a *remote procedure call* (RPC), which has been around since
the 1970s [[37](/en/ch5#Birrell1984)].
the 1970s [^37].
The RPC model tries to make a request to a remote network service look the same as calling a function or
method in your programming language, within the same process (this abstraction is called *location
transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed
@ -868,7 +868,7 @@ A network request is very different from a local function call:
through, and only the response was lost.
In that case, retrying will cause the action to
be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into
the protocol [[40](/en/ch5#Leach2017idemptence)].
the protocol [^40].
Local function calls dont have this problem. (We discuss idempotence in more detail
in [Link to Come].)
* Every time you call a local function, it normally takes about the same time to execute. A network
@ -902,7 +902,7 @@ overloaded, the client has to be manually reconfigured.
To provide higher availability and scalability, there are usually multiple instances of a service
running on different machines, any of which can handle an incoming request. Spreading requests
across these instances is called *load balancing*
[[41](/en/ch5#Rose2023)].
[^41].
There are many load balancing and service discovery solutions available:
* *Hardware load balancers* are specialized pieces of equipment that are installed in data centers.
@ -974,12 +974,12 @@ indefinitely. If a compatibility-breaking change is required, the service provid
maintaining multiple versions of the service API side by side.
There is no agreement on how API versioning should work (i.e., how a client can indicate which
version of the API it wants to use [[42](/en/ch5#Hunt2014wn)]).
version of the API it wants to use [^42]).
For RESTful APIs, common approaches are to use a version
number in the URL or in the HTTP `Accept` header. For services that use API keys to identify a
particular client, another option is to store a clients requested API version on the server and to
allow this version selection to be updated through a separate administrative interface
[[43](/en/ch5#Leach2017versioning)].
[^43].
## Durable Execution and Workflows
@ -995,7 +995,7 @@ the credit card, and call the banking service to deposit debited funds, as shown
Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a
general-purpose programming language, a domain specific language (DSL), or a markup language such as
Business Process Execution Language (BPEL)
[[44](/en/ch5#BPEL2007)].
[^44].
# Tasks, Activities, and Functions
@ -1068,19 +1068,19 @@ class PaymentWorkflow:
Frameworks like Temporal are not without their challenges. External services, such as the
third-party payment gateway in our example, must still provide an idempotent API. Developers must
remember to use unique IDs for these APIs to prevent duplicate execution
[[47](/en/ch5#Tenzer2024)].
[^47].
And because durable execution frameworks log each RPC call in order, it expects a subsequent
execution to make the same RPC calls in the same order. This makes code changes brittle: you
might introduce undefined behavior simply by re-ordering function calls
[[48](/en/ch5#TemporalWorkflow)].
[^48].
Instead of modifying the code of an existing workflow, it is safer to deploy a new version of the
code separately, so that re-executions of existing workflow invocations continue to use the old
version, and only new invocations use the new code
[[49](/en/ch5#Kleeman2024)].
[^49].
Similarly, because durable execution frameworks expect to replay all code deterministically (the
same inputs produce the same outputs), nondeterministic code such as random number generators or
system clocks are problematic [[48](/en/ch5#TemporalWorkflow)].
system clocks are problematic [^48].
Frameworks often provide their own, deterministic implementations of such library functions, but
you have to remember to use them. In some cases, such as with Temporals workflowcheck tool,
frameworks provide static analysis tools to determine if nondeterministic behavior has been
@ -1099,7 +1099,7 @@ unlike RPC, the sender usually does not wait for the recipient to process the ev
events are typically not sent to the recipient via a direct network connection, but go via an
intermediary called a *message broker* (also called an *event broker*, *message queue*, or
*message-oriented middleware*), which stores the message temporarily.
[[50](/en/ch5#Perera2023)].
[^50].
Using a message broker has several advantages compared to direct RPC:
@ -1162,7 +1162,7 @@ scenarios, messages will be lost. Since each actor processes only one message at
need to worry about threads, and each actor can be scheduled independently by the framework.
In *distributed actor frameworks* such as Akka, Orleans
[[51](/en/ch5#Bernstein2014)],
[^51],
and Erlang/OTP, this programming model is used to scale an application across
multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient
are on the same node or different nodes. If they are on different nodes, the message is
@ -1225,257 +1225,58 @@ quite achievable. May your applications evolution be rapid and your deploymen
##### Footnotes
##### References
[[1](/en/ch5#CWE502-marker)] [CWE-502:
Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*,
July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y)
[[2](/en/ch5#Breen2015-marker)] Steve Breen.
[What
Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This
Vulnerability](https://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-and-your-application-have-in-common-this-vulnerability/). *foxglovesecurity.com*, November 2015.
Archived at [perma.cc/9U97-UVVD](https://perma.cc/9U97-UVVD)
[[3](/en/ch5#McKenzie2013-marker)] Patrick McKenzie.
[What
the Rails Security Issue Means for Your Startup](https://www.kalzumeus.com/2013/01/31/what-the-rails-security-issue-means-for-your-startup/). *kalzumeus.com*, January 2013.
Archived at [perma.cc/2MBJ-7PZ6](https://perma.cc/2MBJ-7PZ6)
[[4](/en/ch5#Goetz2019-marker)] Brian Goetz.
[Towards
Better Serialization](https://openjdk.org/projects/amber/design-notes/towards-better-serialization). *openjdk.org*, June 2019.
Archived at [perma.cc/UK6U-GQDE](https://perma.cc/UK6U-GQDE)
[[5](/en/ch5#JvmSerializers-marker)] Eishay Smith.
[jvm-serializers wiki](https://github.com/eishay/jvm-serializers/wiki).
*github.com*, October 2023.
Archived at [perma.cc/PJP7-WCNG](https://perma.cc/PJP7-WCNG)
[[6](/en/ch5#XMLSExp-marker)] [XML
Is a Poor Copy of S-Expressions](https://wiki.c2.com/?XmlIsaPoorCopyOfEssExpressions). *wiki.c2.com*, May 2013.
Archived at [perma.cc/7FAN-YBKL](https://perma.cc/7FAN-YBKL)
[[7](/en/ch5#Evans2023-marker)] Julia Evans.
[Examples of floating
point problems](https://jvns.ca/blog/2023/01/13/examples-of-floating-point-problems/). *jvns.ca*, January 2023.
Archived at [perma.cc/M57L-QKKW](https://perma.cc/M57L-QKKW)
[[8](/en/ch5#Harris2010-marker)] Matt Harris.
[Snowflake:
An Update and Some Very Important Information](https://groups.google.com/g/twitter-development-talk/c/ahbvo3VTIYI). Email to *Twitter Development
Talk* mailing list, October 2010.
Archived at [perma.cc/8UBV-MZ3D](https://perma.cc/8UBV-MZ3D)
[[9](/en/ch5#Shafranovich2005-marker)] Yakov Shafranovich.
[RFC 4180: Common Format and MIME Type for
Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180). IETF, October 2005.
[[10](/en/ch5#Coates2024-marker)] Andy Coates.
[Evolving JSON Schemas - Part I](https://www.creekservice.org/articles/2024/01/08/json-schema-evolution-part-1.html) and
[Part II](https://www.creekservice.org/articles/2024/01/09/json-schema-evolution-part-2.html).
*creekservice.org*, January 2024. Archived at
[perma.cc/MZW3-UA54](https://perma.cc/MZW3-UA54) and
[perma.cc/GT5H-WKZ5](https://perma.cc/GT5H-WKZ5)
[[11](/en/ch5#Geneves2008-marker)] Pierre Genevès, Nabil Layaïda, and Vincent Quint.
[Ensuring Query Compatibility with Evolving XML Schemas](https://arxiv.org/abs/0811.4324).
INRIA Technical Report 6711, November 2008.
[[12](/en/ch5#Bray2019-marker)] Tim Bray.
[Bits On the Wire](https://www.tbray.org/ongoing/When/201x/2019/11/17/Bits-On-the-Wire).
*tbray.org*, November 2019.
Archived at [perma.cc/3BT3-BQU3](https://perma.cc/3BT3-BQU3)
[[13](/en/ch5#Slee2007-marker)] Mark Slee, Aditya Agarwal, and Marc Kwiatkowski.
[Thrift: Scalable
Cross-Language Services Implementation](https://thrift.apache.org/static/files/thrift-20070401.pdf). Facebook technical report, April 2007.
Archived at [perma.cc/22BS-TUFB](https://perma.cc/22BS-TUFB)
[[14](/en/ch5#Kleppmann2012evolution-marker)] Martin Kleppmann.
[Schema
Evolution in Avro, Protocol Buffers and Thrift](https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html). *martin.kleppmann.com*, December 2012.
Archived at [perma.cc/E4R2-9RJT](https://perma.cc/E4R2-9RJT)
[[15](/en/ch5#Cutting2009-marker)] Doug Cutting, Chad Walters, Jim Kellerman, et al.
[[PROPOSAL]
New Subproject: Avro](https://lists.apache.org/thread/z571w0r5jmfsjvnl0fq4fgg0vh28d3bk). Email thread on *hadoop-general* mailing list,
*lists.apache.org*, April 2009.
Archived at [perma.cc/4A79-BMEB](https://perma.cc/4A79-BMEB)
[[16](/en/ch5#AvroSpec-marker)] Apache Software Foundation.
[Apache Avro 1.12.0 Specification](https://avro.apache.org/docs/1.12.0/specification/).
*avro.apache.org*, August 2024.
Archived at [perma.cc/C36P-5EBQ](https://perma.cc/C36P-5EBQ)
[[17](/en/ch5#AvroParsing-marker)] Apache Software Foundation.
[Avro
schemas as LL(1) CFG definitions](https://avro.apache.org/docs/1.12.0/api/java/org/apache/avro/io/parsing/doc-files/parsing.html). *avro.apache.org*, August 2024.
Archived at [perma.cc/JB44-EM9Q](https://perma.cc/JB44-EM9Q)
[[18](/en/ch5#Hoare2009-marker)] Tony Hoare.
[Null
References: The Billion Dollar Mistake](https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/). Talk at *QCon London*, March 2009.
[[19](/en/ch5#ConfluentSchemaReg-marker)] Confluent, Inc.
[Schema Registry
Overview](https://docs.confluent.io/platform/current/schema-registry/index.html). *docs.confluent.io*, 2024.
Archived at [perma.cc/92C3-A9JA](https://perma.cc/92C3-A9JA)
[[20](/en/ch5#Auradkar2015-marker)] Aditya Auradkar and Tom Quiggle.
[Introducing
Espresso—LinkedIns Hot New Distributed Document Store](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store). *engineering.linkedin.com*, January 2015.
Archived at [perma.cc/FX4P-VW9T](https://perma.cc/FX4P-VW9T)
[[21](/en/ch5#Kreps2015-marker)] Jay Kreps.
[Putting Apache Kafka to
Use: A Practical Guide to Building a Stream Data Platform (Part 2)](https://www.confluent.io/blog/event-streaming-platform-2/). *confluent.io*,
February 2015. Archived at [perma.cc/8UA4-ZS5S](https://perma.cc/8UA4-ZS5S)
[[22](/en/ch5#Shapira2014-marker)] Gwen Shapira.
[The Problem of Managing
Schemas](https://www.oreilly.com/content/the-problem-of-managing-schemas/). *oreilly.com*, November 2014.
Archived at [perma.cc/BY8Q-RYV3](https://perma.cc/BY8Q-RYV3)
[[23](/en/ch5#Larmouth1999-marker)] John Larmouth.
[*ASN.1
Complete*](https://www.oss.com/asn1/resources/books-whitepapers-pubs/larmouth-asn1-book.pdf). Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1.
Archived at [perma.cc/GB7Y-XSXQ](https://perma.cc/GB7Y-XSXQ)
[[24](/en/ch5#Kaliski1993-marker)] Burton S. Kaliski Jr.
[A Laymans Guide to a Subset of ASN.1,
BER, and DER](https://luca.ntop.org/Teaching/Appunti/asn1.html). Technical Note, RSA Data Security, Inc., November 1993.
Archived at [perma.cc/2LMN-W9U8](https://perma.cc/2LMN-W9U8)
[[25](/en/ch5#HoffmanAndrews2020-marker)] Jacob Hoffman-Andrews.
[A Warm Welcome to ASN.1 and DER](https://letsencrypt.org/docs/a-warm-welcome-to-asn1-and-der/).
*letsencrypt.org*, April 2020.
Archived at [perma.cc/CYT2-GPQ8](https://perma.cc/CYT2-GPQ8)
[[26](/en/ch5#Walkin2010-marker)] Lev Walkin.
[Question:
Extensibility and Dropping Fields](https://lionet.info/asn1c/blog/2010/09/21/question-extensibility-removing-fields/). *lionet.info*, September 2010.
Archived at [perma.cc/VX8E-NLH3](https://perma.cc/VX8E-NLH3)
[[27](/en/ch5#Xu2017-marker)] Jacqueline Xu.
[Online migrations at scale](https://stripe.com/blog/online-migrations).
*stripe.com*, February 2017.
Archived at [perma.cc/X59W-DK7Y](https://perma.cc/X59W-DK7Y)
[[28](/en/ch5#Litt2020-marker)] Geoffrey Litt, Peter van Hardenberg, and Orion Henry.
[Project Cambria: Translate your data with lenses](https://www.inkandswitch.com/cambria/).
Technical Report, *Ink & Switch*, October 2020.
Archived at [perma.cc/WA4V-VKDB](https://perma.cc/WA4V-VKDB)
[[29](/en/ch5#Helland2005_ch5-marker)] Pat Helland.
[Data on the Outside Versus Data on the
Inside](https://www.cidrdb.org/cidr2005/papers/P12.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR),
January 2005.
[[30](/en/ch5#Fielding2000-marker)] Roy Thomas Fielding.
[Architectural
Styles and the Design of Network-Based Software Architectures](https://ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf). PhD Thesis, University of
California, Irvine, 2000. Archived at [perma.cc/LWY9-7BPE](https://perma.cc/LWY9-7BPE)
[[31](/en/ch5#Fielding2008-marker)] Roy Thomas Fielding.
[REST APIs must
be hypertext-driven](https://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven).” *roy.gbiv.com*, October 2008.
Archived at [perma.cc/M2ZW-8ATG](https://perma.cc/M2ZW-8ATG)
[[32](/en/ch5#Swagger2014-marker)] [OpenAPI
Specification Version 3.1.0](https://swagger.io/specification/). *swagger.io*, February 2021.
Archived at [perma.cc/3S6S-K5M4](https://perma.cc/3S6S-K5M4)
[[33](/en/ch5#Henning2006-marker)] Michi Henning.
[The Rise and Fall of CORBA](https://cacm.acm.org/practice/the-rise-and-fall-of-corba/).
*Communications of the ACM*, volume 51, issue 8, pages 5257, August 2008.
[doi:10.1145/1378704.1378718](https://doi.org/10.1145/1378704.1378718)
[[34](/en/ch5#Lacey2006-marker)] Pete Lacey.
[The S Stands for Simple](https://harmful.cat-v.org/software/xml/soap/simple).
*harmful.cat-v.org*, November 2006.
Archived at [perma.cc/4PMK-Z9X7](https://perma.cc/4PMK-Z9X7)
[[35](/en/ch5#Tilkov2006-marker)] Stefan Tilkov.
[Interview: Pete Lacey Criticizes
Web Services](https://www.infoq.com/articles/pete-lacey-ws-criticism/). *infoq.com*, December 2006.
Archived at [perma.cc/JWF4-XY3P](https://perma.cc/JWF4-XY3P)
[[36](/en/ch5#Bray2004-marker)] Tim Bray.
[The Loyal WS-Opposition](https://www.tbray.org/ongoing/When/200x/2004/09/18/WS-Oppo).
*tbray.org*, September 2004.
Archived at [perma.cc/J5Q8-69Q2](https://perma.cc/J5Q8-69Q2)
[[37](/en/ch5#Birrell1984-marker)] Andrew D. Birrell and Bruce Jay Nelson.
[Implementing
Remote Procedure Calls](https://www.cs.princeton.edu/courses/archive/fall03/cs518/papers/rpc.pdf). *ACM Transactions on Computer Systems* (TOCS),
volume 2, issue 1, pages 3959, February 1984.
[doi:10.1145/2080.357392](https://doi.org/10.1145/2080.357392)
[[38](/en/ch5#Waldo1994-marker)] Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall.
[A Note on Distributed Computing](https://m.mirror.facebook.net/kde/devel/smli_tr-94-29.pdf).
Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994.
Archived at [perma.cc/8LRZ-BSZR](https://perma.cc/8LRZ-BSZR)
[[39](/en/ch5#Vinoski2008-marker)] Steve Vinoski.
[Convenience over
Correctness](https://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf). *IEEE Internet Computing*, volume 12, issue 4, pages 8992, July 2008.
[doi:10.1109/MIC.2008.75](https://doi.org/10.1109/MIC.2008.75)
[[40](/en/ch5#Leach2017idemptence-marker)] Brandur Leach.
[Designing robust and predictable APIs with
idempotency](https://stripe.com/blog/idempotency). *stripe.com*, February 2017.
Archived at [perma.cc/JD22-XZQT](https://perma.cc/JD22-XZQT)
[[41](/en/ch5#Rose2023-marker)] Sam Rose.
[Load Balancing](https://samwho.dev/load-balancing/). *samwho.dev*, April 2023.
Archived at [perma.cc/Q7BA-9AE2](https://perma.cc/Q7BA-9AE2)
[[42](/en/ch5#Hunt2014wn-marker)] Troy Hunt.
[Your API versioning is
wrong, which is why I decided to do it 3 different wrong ways](https://www.troyhunt.com/your-api-versioning-is-wrong-which-is/). *troyhunt.com*,
February 2014. Archived at [perma.cc/9DSW-DGR5](https://perma.cc/9DSW-DGR5)
[[43](/en/ch5#Leach2017versioning-marker)] Brandur Leach.
[APIs as infrastructure: future-proofing Stripe with
versioning](https://stripe.com/blog/api-versioning). *stripe.com*, August 2017.
Archived at [perma.cc/L63K-USFW](https://perma.cc/L63K-USFW)
[[44](/en/ch5#BPEL2007-marker)] Alexandre Alves, Assaf Arkin, Sid Askary, et al.
[Web Services Business Process
Execution Language Version 2.0](https://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html). *docs.oasis-open.org*, April 2007.
[[45](/en/ch5#TemporalService-marker)] [What
is a Temporal Service?](https://docs.temporal.io/clusters) *docs.temporal.io*, 2024.
Archived at [perma.cc/32P3-CJ9V](https://perma.cc/32P3-CJ9V)
[[46](/en/ch5#Ewen2023-marker)] Stephan Ewen.
[Why we built Restate](https://restate.dev/blog/why-we-built-restate/). *restate.dev*,
August 2023. Archived at [perma.cc/BJJ2-X75K](https://perma.cc/BJJ2-X75K)
[[47](/en/ch5#Tenzer2024-marker)] Keith Tenzer and Joshua Smith.
[Idempotency and Durable
Execution](https://temporal.io/blog/idempotency-and-durable-execution). *temporal.io*, February 2024.
Archived at [perma.cc/9LGW-PCLU](https://perma.cc/9LGW-PCLU)
[[48](/en/ch5#TemporalWorkflow-marker)] [What
is a Temporal Workflow?](https://docs.temporal.io/workflows) *docs.temporal.io*, 2024.
Archived at [perma.cc/B5C5-Y396](https://perma.cc/B5C5-Y396)
[[49](/en/ch5#Kleeman2024-marker)] Jack Kleeman.
[Solving durable
executions immutability problem](https://restate.dev/blog/solving-durable-executions-immutability-problem/). *restate.dev*, February 2024.
Archived at [perma.cc/G55L-EYH5](https://perma.cc/G55L-EYH5)
[[50](/en/ch5#Perera2023-marker)] Srinath Perera.
[Exploring
Event-Driven Architecture: A Beginners Guide for Cloud Native Developers](https://wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/). *wso2.com*,
August 2023. Archived at
[archive.org](https://web.archive.org/web/20240716204613/https%3A//wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/)
[[51](/en/ch5#Bernstein2014-marker)] Philip A. Bernstein, Sergey Bykov, Alan
Geller, Gabriel Kliot, and Jorgen Thelin.
[Orleans:
Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical
Report MSR-TR-2014-41, March 2014.
Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF)
[^1]: [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y)
[^2]: Steve Breen. [What Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This Vulnerability](https://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-and-your-application-have-in-common-this-vulnerability/). *foxglovesecurity.com*, November 2015. Archived at [perma.cc/9U97-UVVD](https://perma.cc/9U97-UVVD)
[^3]: Patrick McKenzie. [What the Rails Security Issue Means for Your Startup](https://www.kalzumeus.com/2013/01/31/what-the-rails-security-issue-means-for-your-startup/). *kalzumeus.com*, January 2013. Archived at [perma.cc/2MBJ-7PZ6](https://perma.cc/2MBJ-7PZ6)
[^4]: Brian Goetz. [Towards Better Serialization](https://openjdk.org/projects/amber/design-notes/towards-better-serialization). *openjdk.org*, June 2019. Archived at [perma.cc/UK6U-GQDE](https://perma.cc/UK6U-GQDE)
[^5]: Eishay Smith. [jvm-serializers wiki](https://github.com/eishay/jvm-serializers/wiki). *github.com*, October 2023. Archived at [perma.cc/PJP7-WCNG](https://perma.cc/PJP7-WCNG)
[^6]: [XML Is a Poor Copy of S-Expressions](https://wiki.c2.com/?XmlIsaPoorCopyOfEssExpressions). *wiki.c2.com*, May 2013. Archived at [perma.cc/7FAN-YBKL](https://perma.cc/7FAN-YBKL)
[^7]: Julia Evans. [Examples of floating point problems](https://jvns.ca/blog/2023/01/13/examples-of-floating-point-problems/). *jvns.ca*, January 2023. Archived at [perma.cc/M57L-QKKW](https://perma.cc/M57L-QKKW)
[^8]: Matt Harris. [Snowflake: An Update and Some Very Important Information](https://groups.google.com/g/twitter-development-talk/c/ahbvo3VTIYI). Email to *Twitter Development Talk* mailing list, October 2010. Archived at [perma.cc/8UBV-MZ3D](https://perma.cc/8UBV-MZ3D)
[^9]: Yakov Shafranovich. [RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180). IETF, October 2005.
[^10]: Andy Coates. [Evolving JSON Schemas - Part I](https://www.creekservice.org/articles/2024/01/08/json-schema-evolution-part-1.html) and [Part II](https://www.creekservice.org/articles/2024/01/09/json-schema-evolution-part-2.html). *creekservice.org*, January 2024. Archived at [perma.cc/MZW3-UA54](https://perma.cc/MZW3-UA54) and [perma.cc/GT5H-WKZ5](https://perma.cc/GT5H-WKZ5)
[^11]: Pierre Genevès, Nabil Layaïda, and Vincent Quint. [Ensuring Query Compatibility with Evolving XML Schemas](https://arxiv.org/abs/0811.4324). INRIA Technical Report 6711, November 2008.
[^12]: Tim Bray. [Bits On the Wire](https://www.tbray.org/ongoing/When/201x/2019/11/17/Bits-On-the-Wire). *tbray.org*, November 2019. Archived at [perma.cc/3BT3-BQU3](https://perma.cc/3BT3-BQU3)
[^13]: Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. [Thrift: Scalable Cross-Language Services Implementation](https://thrift.apache.org/static/files/thrift-20070401.pdf). Facebook technical report, April 2007. Archived at [perma.cc/22BS-TUFB](https://perma.cc/22BS-TUFB)
[^14]: Martin Kleppmann. [Schema Evolution in Avro, Protocol Buffers and Thrift](https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html). *martin.kleppmann.com*, December 2012. Archived at [perma.cc/E4R2-9RJT](https://perma.cc/E4R2-9RJT)
[^15]: Doug Cutting, Chad Walters, Jim Kellerman, et al. [[PROPOSAL] New Subproject: Avro](https://lists.apache.org/thread/z571w0r5jmfsjvnl0fq4fgg0vh28d3bk). Email thread on *hadoop-general* mailing list, *lists.apache.org*, April 2009. Archived at [perma.cc/4A79-BMEB](https://perma.cc/4A79-BMEB)
[^16]: Apache Software Foundation. [Apache Avro 1.12.0 Specification](https://avro.apache.org/docs/1.12.0/specification/). *avro.apache.org*, August 2024. Archived at [perma.cc/C36P-5EBQ](https://perma.cc/C36P-5EBQ)
[^17]: Apache Software Foundation. [Avro schemas as LL(1) CFG definitions](https://avro.apache.org/docs/1.12.0/api/java/org/apache/avro/io/parsing/doc-files/parsing.html). *avro.apache.org*, August 2024. Archived at [perma.cc/JB44-EM9Q](https://perma.cc/JB44-EM9Q)
[^18]: Tony Hoare. [Null References: The Billion Dollar Mistake](https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/). Talk at *QCon London*, March 2009.
[^19]: Confluent, Inc. [Schema Registry Overview](https://docs.confluent.io/platform/current/schema-registry/index.html). *docs.confluent.io*, 2024. Archived at [perma.cc/92C3-A9JA](https://perma.cc/92C3-A9JA)
[^20]: Aditya Auradkar and Tom Quiggle. [Introducing Espresso—LinkedIns Hot New Distributed Document Store](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store). *engineering.linkedin.com*, January 2015. Archived at [perma.cc/FX4P-VW9T](https://perma.cc/FX4P-VW9T)
[^21]: Jay Kreps. [Putting Apache Kafka to Use: A Practical Guide to Building a Stream Data Platform (Part 2)](https://www.confluent.io/blog/event-streaming-platform-2/). *confluent.io*, February 2015. Archived at [perma.cc/8UA4-ZS5S](https://perma.cc/8UA4-ZS5S)
[^22]: Gwen Shapira. [The Problem of Managing Schemas](https://www.oreilly.com/content/the-problem-of-managing-schemas/). *oreilly.com*, November 2014. Archived at [perma.cc/BY8Q-RYV3](https://perma.cc/BY8Q-RYV3)
[^23]: John Larmouth. [*ASN.1 Complete*](https://www.oss.com/asn1/resources/books-whitepapers-pubs/larmouth-asn1-book.pdf). Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1. Archived at [perma.cc/GB7Y-XSXQ](https://perma.cc/GB7Y-XSXQ)
[^24]: Burton S. Kaliski Jr. [A Laymans Guide to a Subset of ASN.1, BER, and DER](https://luca.ntop.org/Teaching/Appunti/asn1.html). Technical Note, RSA Data Security, Inc., November 1993. Archived at [perma.cc/2LMN-W9U8](https://perma.cc/2LMN-W9U8)
[^25]: Jacob Hoffman-Andrews. [A Warm Welcome to ASN.1 and DER](https://letsencrypt.org/docs/a-warm-welcome-to-asn1-and-der/). *letsencrypt.org*, April 2020. Archived at [perma.cc/CYT2-GPQ8](https://perma.cc/CYT2-GPQ8)
[^26]: Lev Walkin. [Question: Extensibility and Dropping Fields](https://lionet.info/asn1c/blog/2010/09/21/question-extensibility-removing-fields/). *lionet.info*, September 2010. Archived at [perma.cc/VX8E-NLH3](https://perma.cc/VX8E-NLH3)
[^27]: Jacqueline Xu. [Online migrations at scale](https://stripe.com/blog/online-migrations). *stripe.com*, February 2017. Archived at [perma.cc/X59W-DK7Y](https://perma.cc/X59W-DK7Y)
[^28]: Geoffrey Litt, Peter van Hardenberg, and Orion Henry. [Project Cambria: Translate your data with lenses](https://www.inkandswitch.com/cambria/). Technical Report, *Ink & Switch*, October 2020. Archived at [perma.cc/WA4V-VKDB](https://perma.cc/WA4V-VKDB)
[^29]: Pat Helland. [Data on the Outside Versus Data on the Inside](https://www.cidrdb.org/cidr2005/papers/P12.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.
[^30]: Roy Thomas Fielding. [Architectural Styles and the Design of Network-Based Software Architectures](https://ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf). PhD Thesis, University of California, Irvine, 2000. Archived at [perma.cc/LWY9-7BPE](https://perma.cc/LWY9-7BPE)
[^31]: Roy Thomas Fielding. [REST APIs must be hypertext-driven](https://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven).” *roy.gbiv.com*, October 2008. Archived at [perma.cc/M2ZW-8ATG](https://perma.cc/M2ZW-8ATG)
[^32]: [OpenAPI Specification Version 3.1.0](https://swagger.io/specification/). *swagger.io*, February 2021. Archived at [perma.cc/3S6S-K5M4](https://perma.cc/3S6S-K5M4)
[^33]: Michi Henning. [The Rise and Fall of CORBA](https://cacm.acm.org/practice/the-rise-and-fall-of-corba/). *Communications of the ACM*, volume 51, issue 8, pages 5257, August 2008. [doi:10.1145/1378704.1378718](https://doi.org/10.1145/1378704.1378718)
[^34]: Pete Lacey. [The S Stands for Simple](https://harmful.cat-v.org/software/xml/soap/simple). *harmful.cat-v.org*, November 2006. Archived at [perma.cc/4PMK-Z9X7](https://perma.cc/4PMK-Z9X7)
[^35]: Stefan Tilkov. [Interview: Pete Lacey Criticizes Web Services](https://www.infoq.com/articles/pete-lacey-ws-criticism/). *infoq.com*, December 2006. Archived at [perma.cc/JWF4-XY3P](https://perma.cc/JWF4-XY3P)
[^36]: Tim Bray. [The Loyal WS-Opposition](https://www.tbray.org/ongoing/When/200x/2004/09/18/WS-Oppo). *tbray.org*, September 2004. Archived at [perma.cc/J5Q8-69Q2](https://perma.cc/J5Q8-69Q2)
[^37]: Andrew D. Birrell and Bruce Jay Nelson. [Implementing Remote Procedure Calls](https://www.cs.princeton.edu/courses/archive/fall03/cs518/papers/rpc.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 2, issue 1, pages 3959, February 1984. [doi:10.1145/2080.357392](https://doi.org/10.1145/2080.357392)
[^38]: Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. [A Note on Distributed Computing](https://m.mirror.facebook.net/kde/devel/smli_tr-94-29.pdf). Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994. Archived at [perma.cc/8LRZ-BSZR](https://perma.cc/8LRZ-BSZR)
[^39]: Steve Vinoski. [Convenience over Correctness](https://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf). *IEEE Internet Computing*, volume 12, issue 4, pages 8992, July 2008. [doi:10.1109/MIC.2008.75](https://doi.org/10.1109/MIC.2008.75)
[^40]: Brandur Leach. [Designing robust and predictable APIs with idempotency](https://stripe.com/blog/idempotency). *stripe.com*, February 2017. Archived at [perma.cc/JD22-XZQT](https://perma.cc/JD22-XZQT)
[^41]: Sam Rose. [Load Balancing](https://samwho.dev/load-balancing/). *samwho.dev*, April 2023. Archived at [perma.cc/Q7BA-9AE2](https://perma.cc/Q7BA-9AE2)
[^42]: Troy Hunt. [Your API versioning is wrong, which is why I decided to do it 3 different wrong ways](https://www.troyhunt.com/your-api-versioning-is-wrong-which-is/). *troyhunt.com*, February 2014. Archived at [perma.cc/9DSW-DGR5](https://perma.cc/9DSW-DGR5)
[^43]: Brandur Leach. [APIs as infrastructure: future-proofing Stripe with versioning](https://stripe.com/blog/api-versioning). *stripe.com*, August 2017. Archived at [perma.cc/L63K-USFW](https://perma.cc/L63K-USFW)
[^44]: Alexandre Alves, Assaf Arkin, Sid Askary, et al. [Web Services Business Process Execution Language Version 2.0](https://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html). *docs.oasis-open.org*, April 2007.
[^45]: [What is a Temporal Service?](https://docs.temporal.io/clusters) *docs.temporal.io*, 2024. Archived at [perma.cc/32P3-CJ9V](https://perma.cc/32P3-CJ9V)
[^46]: Stephan Ewen. [Why we built Restate](https://restate.dev/blog/why-we-built-restate/). *restate.dev*, August 2023. Archived at [perma.cc/BJJ2-X75K](https://perma.cc/BJJ2-X75K)
[^47]: Keith Tenzer and Joshua Smith. [Idempotency and Durable Execution](https://temporal.io/blog/idempotency-and-durable-execution). *temporal.io*, February 2024. Archived at [perma.cc/9LGW-PCLU](https://perma.cc/9LGW-PCLU)
[^48]: [What is a Temporal Workflow?](https://docs.temporal.io/workflows) *docs.temporal.io*, 2024. Archived at [perma.cc/B5C5-Y396](https://perma.cc/B5C5-Y396)
[^49]: Jack Kleeman. [Solving durable executions immutability problem](https://restate.dev/blog/solving-durable-executions-immutability-problem/). *restate.dev*, February 2024. Archived at [perma.cc/G55L-EYH5](https://perma.cc/G55L-EYH5)
[^50]: Srinath Perera. [Exploring Event-Driven Architecture: A Beginners Guide for Cloud Native Developers](https://wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/). *wso2.com*, August 2023. Archived at [archive.org](https://web.archive.org/web/20240716204613/https%3A//wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/)
[^51]: Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. [Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical Report MSR-TR-2014-41, March 2014. Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF)

View file

@ -15,10 +15,8 @@ network. As discussed in [“Distributed versus Single-Node Systems”](https://
why you might want to replicate data:
* To keep data geographically close to your users (and thus reduce access latency)
* To allow the system to continue working even if some of its parts have failed (and thus
increase availability)
* To scale out the number of machines that can serve read queries (and thus increase read
throughput)
* To allow the system to continue working even if some of its parts have failed (and thus increase availability)
* To scale out the number of machines that can serve read queries (and thus increase read throughput)
In this chapter we will assume that your dataset is small enough that each machine can hold a copy of
the entire dataset. In [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding) we will relax that assumption and discuss *sharding*
@ -39,7 +37,7 @@ many different implementations. We will discuss the consequences of such choices
Replication of databases is an old topic—the principles havent changed much since they were
studied in the 1970s
[[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lindsay1979_ch6)],
[^1],
because the fundamental constraints of networks have remained the same. Despite being so old,
concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag) we will
get more precise about eventual consistency and discuss things like the *read-your-writes* and
@ -74,7 +72,7 @@ longer contain the same data. The most common solution is called *leader-based r
[Figure 6-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_leader_follower)):
1. One of the replicas is designated the *leader* (also known as *primary* or *source*
[[2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gryp2020)]).
[^2]).
When clients want to write to the database, they must send their requests to the leader, which
first writes the new data to its local storage.
2. The other replicas are known as *followers* (*read replicas*, *secondaries*, or *hot standbys*).
@ -97,15 +95,15 @@ multiple leaders for the same shard at the same time.
Single-leader replication is very widely used. Its a built-in feature of many relational databases,
such as PostgreSQL, MySQL, Oracle Data Guard
[[3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Oracle2019)],
[^3],
and SQL Servers Always On Availability Groups
[[4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#AlwaysOn2012)].
[^4].
It is also used in some document databases such as MongoDB and DynamoDB
[[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Elhemali2022_ch6)],
[^5],
message brokers such as Kafka, replicated block devices such as DRBD, and some network filesystems.
Many consensus algorithms such as Raft, which is used for replication in CockroachDB
[[6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Taft2020_ch6)],
TiDB [[7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2020_ch6)],
[^6],
TiDB [^7],
etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and
automatically elect a new leader if the old one fails (we will discuss consensus in more detail in
[Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency)).
@ -114,7 +112,7 @@ automatically elect a new leader if the old one fails (we will discuss consensus
In older documents you may see the term *masterslave replication*. It means the same as
leader-based replication, but the term should be avoided as it is widely considered offensive
[[8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Knodel2023)].
[^8].
## Synchronous Versus Asynchronous Replication
@ -174,7 +172,7 @@ processing writes, even if all of its followers have fallen behind.
Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless
widely used, especially if there are many followers or if they are geographically distributed
[[9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2018)].
[^9].
We will return to this issue in [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag).
## Setting Up New Followers
@ -250,7 +248,7 @@ architecture that places less frequently accessed data on object storage while n
accessed data is kept on faster storage devices such as SSDs, NVMe, or even in memory. Other systems
use object storage as their primary storage tier, but use a separate low-latency storage system such
as Amazons EBS or Neons Safekeepers
[[12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kelvich2022)])
[^12])
to store their WAL. Recently, some systems have gone even farther by adopting a
*zero-disk architecture* (ZDA). ZDA-based systems persist all data to object storage and use disks
and memory strictly for caching. This allows nodes to have no persistent state, which dramatically
@ -312,7 +310,7 @@ consists of the following steps:
2. *Choosing a new leader.* This could be done through an election process (where the leader is chosen by
a majority of the remaining replicas), or a new leader could be appointed by a previously
established *controller node*
[[13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Fontaine2021)].
[^13].
The best candidate for leadership is usually the replica with the most up-to-date data changes
from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader
is a consensus problem, discussed in detail in [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency).
@ -333,7 +331,7 @@ Failover is fraught with things that can go wrong:
* Discarding writes is especially dangerous if other storage systems outside of the database need to
be coordinated with the database contents.
For example, in one incident at GitHub
[[14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Newland2012)],
[^14],
an out-of-date MySQL follower
was promoted to leader. The database used an autoincrementing counter to assign primary keys to
new rows, but because the new leaders counter lagged behind the old leaders, it reused some
@ -346,7 +344,7 @@ Failover is fraught with things that can go wrong:
[“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
systems have a mechanism to shut down one node if two leaders are detected. However, if this
mechanism is not carefully designed, you can end up with both nodes being shut down
[[15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Imbriaco2012_ch6)].
[^15].
Moreover, there is a risk that by the time the split brain is detected and the old node is shut
down, it is already too late and data has already been corrupted.
* What is the right timeout before the leader is declared dead? A longer timeout means a longer
@ -413,7 +411,7 @@ Statement-based replication was used in MySQL before version 5.1. It is still so
as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if
there is any nondeterminism in a statement. VoltDB uses statement-based replication, and makes it
safe by requiring transactions to be deterministic
[[16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hugg2015)].
[^16].
However, determinism can be hard to guarantee in practice, so many databases prefer other
replication methods.
@ -464,17 +462,17 @@ indicating that the transaction was committed. MySQL keeps a separate logical re
called the *binlog*, in addition to the WAL (when configured to use row-based replication).
PostgreSQL implements logical replication by decoding the physical WAL into row
insertion/update/delete events
[[19](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2023)].
[^19].
Since a logical log is decoupled from the storage engine internals, it can more easily be kept
backward compatible, allowing the leader and the follower to run different versions of the database
software. This in turn enables upgrading to a new version with minimal downtime
[[20](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Petchimuthu2021)].
[^20].
A logical log format is also easier for external applications to parse. This aspect is useful if you want
to send the contents of a database to an external system, such as a data warehouse for offline
analysis, or for building custom indexes and caches
[[21](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sharma2015te_ch6)].
[^21].
This technique is called *change data capture*, and we will return to it in [Link to Come].
# Problems with Replication Lag
@ -502,14 +500,14 @@ database: if you run the same query on the leader and a follower at the same tim
different results, because not all writes have been reflected in the follower. This inconsistency is
just a temporary state—if you stop writing to the database and wait a while, the followers will
eventually catch up and become consistent with the leader. For that reason, this effect is known
as *eventual consistency* [[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011)].
as *eventual consistency* [^22].
###### Note
The term *eventual consistency* was coined by Douglas Terry et al.
[[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry1994)],
[^23],
popularized by Werner Vogels
[[24](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Vogels2008)],
[^24],
and became the battle cry of many NoSQL projects. However, not only NoSQL databases are eventually
consistent: followers in an asynchronously replicated relational database have the same
characteristics.
@ -542,7 +540,7 @@ submitted was lost, so they will be understandably unhappy.
###### Figure 6-3. A user makes a write, followed by a read from a stale replica. To prevent this anomaly, we need read-after-write consistency.
In this situation, we need *read-after-write consistency*, also known as *read-your-writes consistency*
[[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry1994)].
[^23].
This is a guarantee that if the user reloads the page, they will always see any updates they
submitted themselves. It makes no promises about other users: other users updates may not be
visible until some later time. However, it reassures the user that their own input has been saved
@ -563,14 +561,14 @@ are various possible techniques. To mention a few:
scaling). In that case, other criteria may be used to decide whether to read from the leader. For
example, you could track the time of the last update and, for one minute after the last update, make all
reads from the leader
[[25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Willison2022)].
[^25].
You could also monitor the replication lag on followers and prevent queries on any follower that
is more than one minute behind the leader.
* The client can remember the timestamp of its most recent write—then the system can ensure that the
replica serving any reads for that user reflects updates at least until that timestamp. If a
replica is not sufficiently up to date, either the read can be handled by another replica or the
query can wait until the replica has caught up
[[26](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Tharakan2020)].
[^26].
The timestamp could be a *logical timestamp* (something that indicates ordering of writes, such as
the log sequence number) or the actual system clock (in which case clock synchronization becomes
critical; see [“Unreliable Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_clocks)).
@ -632,7 +630,7 @@ and then see it disappear again.
###### Figure 6-4. A user first reads from a fresh replica, then from a stale replica. Time appears to go backward. To prevent this anomaly, we need monotonic reads.
*Monotonic reads* [[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011)] is a guarantee that this
*Monotonic reads* [^22] is a guarantee that this
kind of anomaly does not happen. Its a lesser guarantee than strong consistency, but a stronger
guarantee than eventual consistency. When you read data, you may see an old value; monotonic reads
only means that if one user makes several reads in sequence, they will not see time go
@ -669,14 +667,14 @@ Mr. Poons
To the observer it looks as though Mrs. Cake is answering the question before Mr. Poons has even asked
it. Such psychic powers are impressive, but very confusing
[[27](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pratchett1991)].
[^27].
![ddia 0605](/fig/ddia_0605.png)
###### Figure 6-5. If some shards are replicated slower than others, an observer may see the answer before they see the question.
Preventing this kind of anomaly requires another type of guarantee: *consistent prefix reads*
[[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011)]. This guarantee says that if a sequence of
[^22]. This guarantee says that if a sequence of
writes happens in a certain order, then anyone reading those writes will see them appear in the same
order.
@ -811,7 +809,7 @@ Consistency
with another write on another leader.
This is simply a fundamental limitation of distributed systems
[[28](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014coord_ch6)].
[^28].
If you need to enforce such constraints, youre therefore better off with a single-leader system.
However, as we will see in [“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts), multi-leader systems can still
achieve consistency properties that are useful in a wide range of apps that dont need such
@ -820,13 +818,13 @@ Consistency
Multi-leader replication is less common than single-leader replication, but it is still supported by
many databases, including MySQL, Oracle, SQL Server, and YugabyteDB. In some cases it is an external
add-on feature, for example in Redis Enterprise, EDB Postgres Distributed, and pglogical
[[29](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Raja2022)].
[^29].
As multi-leader replication is a somewhat retrofitted feature in many databases, there are often
subtle configuration pitfalls and surprising interactions with other database features. For example,
autoincrementing keys, triggers, and integrity constraints can be problematic. For this reason,
multi-leader replication is often considered dangerous territory that should be avoided if possible
[[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2012)].
[^30].
### Multi-leader replication topologies
@ -857,7 +855,7 @@ In circular and star topologies, a write may need to pass through several nodes
all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To
prevent infinite replication loops, each node is given a unique identifier, and in the replication
log, each write is tagged with the identifiers of all the nodes it has passed through
[[31](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#HBase7709)].
[^31].
When a node receives a data change that is tagged with its own identifier, that data change is
ignored, because the node knows that it has already been processed.
@ -949,13 +947,13 @@ existed for a long time, the term has recently gained attention
[37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Jayakar2024)].
An application that allows a user to continue editing a file while offline (which may be implemented
using a sync engine) is called *offline-first*
[[38](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Feyerke2013)].
[^38].
The term *local-first software* refers to collaborative apps that are not only offline-first, but
are also designed to continue working even if the developer who made the software shuts down all of
their online services [[39](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2019_ch6)].
their online services [^39].
This can be achieved by using a sync engine with an open standard sync protocol for which multiple
service providers are available
[[40](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2024lofi)].
[^40].
For example, Git is a local-first collaboration system (albeit one that doesnt support real-time
collaboration) since you can sync via GitHub, GitLab, or any other repository hosting service.
@ -979,11 +977,11 @@ approach has a number of advantages:
[“The problems with remote procedure calls (RPCs)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch05.html#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user
interface needs to somehow reflect that error. A sync engine allows the app to perform reads and
writes on local data, which almost never fails, leading to a more declarative programming style
[[41](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hofmeyr2024)].
[^41].
* In order to display edits from other users in real-time, you need to receive notifications of
those edits and efficiently update the user interface accordingly. A sync engine combined with a
*reactive programming* model is a good way of implementing this
[[42](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#vanHardenberg2020)].
[^42].
Sync engines work best when all the data that the user may need is downloaded in advance and stored
persistently on the client. This means that the data is available for offline access when needed,
@ -993,7 +991,7 @@ of data. For example, downloading all the files that the user themselves created
e-commerce website probably doesnt make sense.
The sync engine was pioneered by Lotus Notes in the 1980s
[[43](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kawell1988)]
[^43]
(without using that term), and sync for specific apps such as calendars has also existed for a long
time. Today there are a number of general-purpose sync engines, some of which use a proprietary
backend service (e.g., Google Firestore, Realm, or Ditto), and some have an open source backend,
@ -1003,7 +1001,7 @@ Multiplayer video games have a similar need to respond immediately to the user
reconcile them with other players actions received asynchronously over the network. In game
development jargon the equivalent of a sync engine is called *netcode*. The techniques used in
netcode are quite specific to the requirements of games
[[44](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pusch2019)], and dont directly
[^44], and dont directly
carry over to other types of software, so we wont consider them further in this book.
## Dealing with Conflicting Writes
@ -1040,7 +1038,7 @@ One strategy for conflicts is to avoid them occurring in the first place. For ex
application can ensure that all writes for a particular record go through the same leader, then
conflicts cannot occur, even if the database as a whole is multi-leader. This approach is not
possible in the case of a sync engine client being updated offline, but it is sometimes possible in
geo-replicated server systems [[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2012)].
geo-replicated server systems [^30].
For example, in an application where a user can only edit their own data, you can ensure that
requests from a particular user are always routed to the same region and use the leader in that
@ -1126,7 +1124,7 @@ suffers from a number of problems:
union of the carts). This meant that if the customer had removed an item from their cart in one
sibling, but another sibling still contained that old item, the removed item would unexpectedly
reappear in the customers cart
[[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)].
[^45].
[Figure 6-10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear.
* If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution
@ -1177,8 +1175,8 @@ then conflict resolution is inevitable, and automating it is often the best appr
Two families of algorithms are commonly used to implement automatic conflict resolution:
*Conflict-free replicated datatypes* (CRDTs)
[[46](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shapiro2011)] and *Operational Transformation* (OT)
[[47](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sun1998)].
[^46] and *Operational Transformation* (OT)
[^47].
They have different design philosophies and performance characteristics, but both are able to
perform automatic merges for all the aforementioned types of data.
@ -1214,12 +1212,12 @@ There are many algorithms based on variations of these ideas. Lists/arrays can b
similarly, using list elements instead of characters, and other datatypes such as key-value maps can
be added quite easily. There are some performance and functionality trade-offs between OT and CRDTs,
but its possible to combine the advantages of CRDTs and OT in one algorithm
[[48](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gentle2025)].
[^48].
OT is most often used for real-time collaborative editing of text, e.g. in Google Docs
[[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DayRichter2010)], whereas CRDTs can be found in
[^32], whereas CRDTs can be found in
distributed databases such as Redis Enterprise, Riak, and Azure Cosmos DB
[[49](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shukla2018)].
[^49].
Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge or Yjs) and with OT
(e.g., ShareDB).
@ -1256,17 +1254,17 @@ systems were leaderless [[1](https://learning.oreilly.com/library/view/designing
[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979)], but the
idea was mostly forgotten during the era of dominance of relational databases. It once again became
a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in
2007 [[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)].
2007 [^45].
Riak, Cassandra, and ScyllaDB are open source datastores with leaderless replication models inspired
by Dynamo, so this kind of database is also known as *Dynamo-style*.
###### Note
The original *Dynamo* system was only described in a paper
[[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)], but never released outside of
[^45], but never released outside of
Amazon. The similarly-named *DynamoDB* is a more recent cloud database from AWS, but it has a
completely different architecture: it uses single-leader replication based on the Multi-Paxos
consensus algorithm [[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Elhemali2022_ch6)].
consensus algorithm [^5].
In some leaderless implementations, the client directly sends its writes to several replicas, while
in others, a coordinator node does this on behalf of the client. However, unlike a leader database,
@ -1348,7 +1346,7 @@ considered successful, and we must query at least *r* nodes for each read. (In o
*n* = 3, *w* = 2, *r* = 2.) As long as *w* + *r* >
*n*, we expect to get an up-to-date value when reading, because at least one of the *r* nodes were
reading from must be up to date. Reads and writes that obey these *r* and *w* values are called
*quorum* reads and writes [[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979)].
*quorum* reads and writes [^50].
You can think of *r* and *w* as the minimum number of votes required for the read or write to be
valid.
@ -1402,7 +1400,7 @@ Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, becau
not necessarily majorities—it only matters that the sets of nodes used by the read and write
operations overlap in at least one node. Other quorum assignments are possible, which allows some
flexibility in the design of distributed algorithms
[[51](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Howard2016_ch6)].
[^51].
You may also set *w* and *r* to smaller numbers, so that *w* + *r**n* (i.e.,
the quorum condition is not satisfied). In this case, reads and writes will still be sent to *n*
@ -1432,7 +1430,7 @@ properties can be confusing. Some scenarios include:
nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the
replicas where it succeeded. This means that if a write was reported as failed, subsequent reads
may or may not return the value from that write
[[52](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Blomstedt2012ricon)].
[^52].
* If the database uses timestamps from a real-time clock to determine which write is newer (as
Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a
faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lww).
@ -1445,7 +1443,7 @@ properties can be confusing. Some scenarios include:
Thus, although quorums appear to guarantee that a read returns the latest written value, in practice
it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate
eventual consistency. The parameters *w* and *r* allow you to adjust the probability of stale values
being read [[53](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014pbs)],
being read [^53],
but its wise to not take them as absolute guarantees.
### Monitoring staleness
@ -1464,7 +1462,7 @@ current position, you can measure the amount of replication lag.
However, in systems with leaderless replication, there is no fixed order in which writes are
applied, which makes monitoring more difficult. The number of hints that a replica stores for
handoff can be one measure of system health, but its difficult to interpret usefully
[[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Breck2019)].
[^54].
Eventual consistency is a deliberately vague guarantee, but for operability its important to be
able to quantify “eventual.”
@ -1493,13 +1491,13 @@ Because there is no failover, and requests go to multiple replicas in parallel a
becoming slow or unavailable has very little impact on response times: the client simply uses the
responses from the other replicas that are faster to respond. Using the fastest responses is called
*request hedging*, and it can significantly reduce tail latency
[[55](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Dean2013_ch6)]).
[^55]).
At its core, the resilience of a leaderless system comes from the fact that it doesnt distinguish
between the normal case and the failure case. This is especially helpful when handling so-called
*gray failures*, in which a node isnt completely down, but running in a degraded state where it is
unusually slow to handle requests
[[56](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2017_ch6)],
[^56],
or when a node is simply overloaded (for example, if a node has been offline for a while, recovery
via hinted handoff can cause a lot of additional load). A leader-based system has to decide whether
the situation is bad enough to warrant a failover (which can itself cause further disruption),
@ -1511,7 +1509,7 @@ That said, leaderless systems can have performance problems as well:
another replica is unavailable so that it can store hints about writes that the unavailable
replica missed. When the unavailable replica comes back, the handoff process needs to send it
those hints. This puts additional load on the replicas at a time when the system is already under
strain [[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Breck2019)].
strain [^54].
* The more replicas you have, the bigger the size of your quorums, and the more responses you have
to wait for before a request can complete. Even if you wait only for the fastest *r* or *w*
replicas to respond, and even if you make the requests in parallel, a bigger *r* or *w* increases
@ -1521,7 +1519,7 @@ That said, leaderless systems can have performance problems as well:
make it impossible to form a quorum. Some leaderless databases offer a configuration option that
allows any reachable replica to accept writes, even if its not one of the usual replicas for that
key (Riak and Dynamo call this a *sloppy quorum*
[[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6)];
[^45];
Cassandra and ScyllaDB call it *consistency level ANY*). There is no guarantee that subsequent
reads will see the written value, but depending on the application it may still be better than
having the write fail.
@ -1603,7 +1601,7 @@ An operation A *happens before* another operation B if B knows about A, or depen
upon A in some way. Whether one operation happens before another operation is the key to defining
what concurrency means. In fact, we can simply say that two operations are *concurrent* if neither
happens before the other (i.e., neither knows about the other)
[[57](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lamport1978_ch6)].
[^57].
Thus, whenever you have two operations A and B, there are three possibilities: either A happened
before B, or B happened before A, or A and B are concurrent. What we need is an algorithm to tell us
@ -1621,7 +1619,7 @@ at exactly the same time—an issue we will discuss in more detail in [Chapter 
For defining concurrency, exact time doesnt matter: we simply call two operations concurrent if
they are both unaware of each other, regardless of the physical time at which they occurred. People
sometimes make a connection between this principle and the special theory of relativity in physics
[[57](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lamport1978_ch6)], which introduced the idea that
[^57], which introduced the idea that
information cannot travel faster than the speed of light. Consequently, two events that occur some
distance apart cannot possibly affect each other if the time between the events is shorter than the
time it takes light to travel the distance between them.
@ -1719,7 +1717,7 @@ version numbers it has seen from each of the other replicas. This information in
to overwrite and which values to keep as siblings.
The collection of version numbers from all the replicas is called a *version vector*
[[58](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#ParkerJr1983)].
[^58].
A few variants of this idea are in use, but the most interesting is probably the *dotted version
vector*
[[59](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Preguica2010),
@ -1827,350 +1825,71 @@ machine to store only a subset of the data.
##### Footnotes
##### References
[[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lindsay1979_ch6-marker)] B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N.
Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade.
[Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf).
IBM Research, Research Report RJ2571(33471), July 1979.
Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)
[[2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gryp2020-marker)] Kenny Gryp.
[MySQL Terminology
Updates](https://dev.mysql.com/blog-archive/mysql-terminology-updates/). *dev.mysql.com*, July 2020.
Archived at [perma.cc/S62G-6RJ2](https://perma.cc/S62G-6RJ2)
[[3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Oracle2019-marker)] Oracle Corporation.
[Oracle
(Active) Data Guard 19c: Real-Time Data Protection and Availability](https://www.oracle.com/technetwork/database/availability/dg-adg-technical-overview-wp-5347548.pdf). White Paper, *oracle.com*, March 2019.
Archived at [perma.cc/P5ST-RPKE](https://perma.cc/P5ST-RPKE)
[[4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#AlwaysOn2012-marker)] Microsoft.
[What
is an Always On availability group?](https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server) *learn.microsoft.com*, September 2024.
Archived at [perma.cc/ABH6-3MXF](https://perma.cc/ABH6-3MXF)
[[5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Elhemali2022_ch6-marker)] Mostafa Elhemali, Niall Gallagher, Nicholas
Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu
Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul,
Doug Terry, and Akshat Vig.
[Amazon DynamoDB: A Scalable,
Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical
Conference* (ATC), July 2022.
[[6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Taft2020_ch6-marker)] Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan
VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul
Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis.
[CockroachDB: The Resilient
Geo-Distributed SQL Database](https://dl.acm.org/doi/abs/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of
Data* (SIGMOD), pages 14931509, June 2020.
[doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134)
[[7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2020_ch6-marker)] Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang,
Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang,
Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan
Pei, and Xin Tang.
[TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf).
*Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 30723084.
[doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535)
[[8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Knodel2023-marker)] Mallory Knodel and Niels ten Oever.
[Terminology, Power, and
Inclusive Language in Internet-Drafts and RFCs](https://www.ietf.org/archive/id/draft-knodel-terminology-14.html). *IETF Internet-Draft*, August 2023.
Archived at [perma.cc/5ZY9-725E](https://perma.cc/5ZY9-725E)
[[9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2018-marker)] Buck Hodges.
[Postmortem: VSTS 4 September 2018](https://devblogs.microsoft.com/devopsservice/?p=17485).
*devblogs.microsoft.com*, September 2018.
Archived at [perma.cc/ZF5R-DYZS](https://perma.cc/ZF5R-DYZS)
[[10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Morling2024_ch6-marker)] Gunnar Morling.
[Leader
Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024.
Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y)
[[11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Chandramohan2024-marker)] Vignesh Chandramohan, Rohan Desai, and Chris Riccomini.
[SlateDB Manifest
Design](https://github.com/slatedb/slatedb/blob/main/rfcs/0001-manifest.md). *github.com*, May 2024.
Archived at [perma.cc/8EUY-P32Z](https://perma.cc/8EUY-P32Z)
[[12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kelvich2022-marker)] Stas Kelvich.
[Why does Neon use Paxos instead of Raft, and whats the
difference?](https://neon.tech/blog/paxos) *neon.tech*, August 2022.
Archived at [perma.cc/SEZ4-2GXU](https://perma.cc/SEZ4-2GXU)
[[13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Fontaine2021-marker)] Dimitri Fontaine.
[An
introduction to the pg\_auto\_failover project](https://tapoueh.org/blog/2021/11/an-introduction-to-the-pg_auto_failover-project/). *tapoueh.org*, November 2021.
Archived at [perma.cc/3WH5-6BAF](https://perma.cc/3WH5-6BAF)
[[14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Newland2012-marker)] Jesse Newland.
[GitHub
availability this week](https://github.blog/news-insights/the-library/github-availability-this-week/). *github.blog*, September 2012.
Archived at [perma.cc/3YRF-FTFJ](https://perma.cc/3YRF-FTFJ)
[[15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Imbriaco2012_ch6-marker)] Mark Imbriaco.
[Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/).
*github.blog*, December 2012.
Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ)
[[16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hugg2015-marker)] John Hugg.
[All In with Determinism for Performance and
Testing in Distributed Systems](https://www.youtube.com/watch?v=gJRj3vJL4wE). At *Strange Loop*, September 2015.
[[17](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Suzuki2017_ch6-marker)] Hironobu Suzuki.
[The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017.
[[18](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2012-marker)] Amit Kapila.
[WAL
Internals of PostgreSQL](https://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf). At *PostgreSQL Conference* (PGCon), May 2012.
Archived at [perma.cc/6225-3SUX](https://perma.cc/6225-3SUX)
[[19](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2023-marker)] Amit Kapila.
[Evolution
of Logical Replication](https://amitkapila16.blogspot.com/2023/09/evolution-of-logical-replication.html). *amitkapila16.blogspot.com*, September 2023.
Archived at [perma.cc/F9VX-JLER](https://perma.cc/F9VX-JLER)
[[20](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Petchimuthu2021-marker)] Aru Petchimuthu.
[Upgrade
your Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL database, Part 2: Using the pglogical
extension](https://aws.amazon.com/blogs/database/part-2-upgrade-your-amazon-rds-for-postgresql-database-using-the-pglogical-extension/). *aws.amazon.com*, August 2021.
Archived at [perma.cc/RXT8-FS2T](https://perma.cc/RXT8-FS2T)
[[21](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sharma2015te_ch6-marker)] Yogeshwer Sharma, Philippe Ajoux, Petchean
Ang, David Callies, Abhishek Choudhary, Laurent Demailly, Thomas Fersch, Liat Atsmon Guz, Andrzej
Kotulski, Sachin Kulkarni, Sanjeev Kumar, Harry Li, Jun Li, Evgeniy Makeev, Kowshik Prakasam,
Robbert van Renesse, Sabyasachi Roy, Pratyush Seth, Yee Jiun Song, Benjamin Wester, Kaushik
Veeraraghavan, and Peter Xie.
[Wormhole:
Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf). At *12th USENIX
Symposium on Networked Systems Design and Implementation* (NSDI), May 2015.
[[22](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry2011-marker)] Douglas B. Terry.
[Replicated
Data Consistency Explained Through Baseball](https://www.microsoft.com/en-us/research/publication/replicated-data-consistency-explained-through-baseball/). Microsoft Research, Technical Report
MSR-TR-2011-137, October 2011.
Archived at [perma.cc/F4KZ-AR38](https://perma.cc/F4KZ-AR38)
[[23](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Terry1994-marker)] Douglas B. Terry, Alan J. Demers, Karin Petersen,
Mike J. Spreitzer, Marvin M. Theher, and Brent B. Welch.
[Session Guarantees
for Weakly Consistent Replicated Data](https://csis.pace.edu/~marchese/CS865/Papers/SessionGuaranteesPDIS.pdf). At *3rd International Conference on Parallel and
Distributed Information Systems* (PDIS), September 1994.
[doi:10.1109/PDIS.1994.331722](https://doi.org/10.1109/PDIS.1994.331722)
[[24](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Vogels2008-marker)] Werner Vogels.
[Eventually Consistent](https://queue.acm.org/detail.cfm?id=1466448).
*ACM Queue*, volume 6, issue 6, pages 1419, October 2008.
[doi:10.1145/1466443.1466448](https://doi.org/10.1145/1466443.1466448)
[[25](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Willison2022-marker)] Simon Willison.
[Reply to: “My thoughts about Fly.io (so
far) and other newish technology Im getting into”](https://news.ycombinator.com/item?id=31434055). *news.ycombinator.com*, May 2022.
Archived at [perma.cc/ZRV4-WWV8](https://perma.cc/ZRV4-WWV8)
[[26](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Tharakan2020-marker)] Nithin Tharakan.
[Scaling Bitbuckets
Database](https://www.atlassian.com/blog/bitbucket/scaling-bitbuckets-database). *atlassian.com*, October 2020.
Archived at [perma.cc/JAB7-9FGX](https://perma.cc/JAB7-9FGX)
[[27](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pratchett1991-marker)] Terry Pratchett. *Reaper Man: A Discworld
Novel*. Victor Gollancz, 1991. ISBN: 978-0-575-04979-6
[[28](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014coord_ch6-marker)] Peter Bailis, Alan Fekete, Michael J.
Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica.
[Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237).
*Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185196, November 2014.
[doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509)
[[29](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Raja2022-marker)] Yaser Raja and Peter Celentano.
[PostgreSQL
bi-directional replication using pglogical](https://aws.amazon.com/blogs/database/postgresql-bi-directional-replication-using-pglogical/). *aws.amazon.com*, January 2022.
Archived at <https://perma.cc/BUQ2-5QWN>
[[30](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hodges2012-marker)] Robert Hodges.
[If
You \*Must\* Deploy Multi-Master Replication, Read This First](https://scale-out-blog.blogspot.com/2012/04/if-you-must-deploy-multi-master.html). *scale-out-blog.blogspot.com*,
April 2012. Archived at [perma.cc/C2JN-F6Y8](https://perma.cc/C2JN-F6Y8)
[[31](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#HBase7709-marker)] Lars Hofhansl.
[HBASE-7709: Infinite Loop Possible in
Master/Master Replication](https://issues.apache.org/jira/browse/HBASE-7709). *issues.apache.org*, January 2013.
Archived at [perma.cc/24G2-8NLC](https://perma.cc/24G2-8NLC)
[[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DayRichter2010-marker)] John Day-Richter.
[Whats
Different About the New Google Docs: Making Collaboration Fast](https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs.html). *drive.googleblog.com*,
September 2010. Archived at [perma.cc/5TL8-TSJ2](https://perma.cc/5TL8-TSJ2)
[[33](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Wallace2019-marker)] Evan Wallace.
[How Figmas
multiplayer technology works](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/). *figma.com*, October 2019.
Archived at [perma.cc/L49H-LY4D](https://perma.cc/L49H-LY4D)
[[34](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Artman2023-marker)] Tuomas Artman.
[Scaling the Linear Sync Engine](https://linear.app/blog/scaling-the-linear-sync-engine).
*linear.app*, June 2023.
[[35](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Saafan2024-marker)] Amr Saafan.
[Why Sync
Engines Might Be the Future of Web Applications](https://www.nilebits.com/blog/2024/09/sync-engines-future-web-applications/). *nilebits.com*, September 2024.
Archived at [perma.cc/5N73-5M3V](https://perma.cc/5N73-5M3V)
[[36](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hagoel2024-marker)] Isaac Hagoel.
[Are Sync
Engines The Future of Web Applications?](https://dev.to/isaachagoel/are-sync-engines-the-future-of-web-applications-1bbi) *dev.to*, July 2024.
Archived at [perma.cc/R9HF-BKKL](https://perma.cc/R9HF-BKKL)
[[37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Jayakar2024-marker)] Sujay Jayakar.
[A Map of Sync](https://stack.convex.dev/a-map-of-sync). *stack.convex.dev*,
October 2024. Archived at [perma.cc/82R3-H42A](https://perma.cc/82R3-H42A)
[[38](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Feyerke2013-marker)] Alex Feyerke.
[Designing Offline-First Web Apps](https://alistapart.com/article/offline-first/).
*alistapart.com*, December 2013.
Archived at [perma.cc/WH7R-S2DS](https://perma.cc/WH7R-S2DS)
[[39](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2019_ch6-marker)] Martin Kleppmann,
Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan.
[Local-first software: You own your data, in
spite of the cloud](https://www.inkandswitch.com/local-first/). At *ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and
Reflections on Programming and Software* (Onward!), October 2019, pages 154178.
[doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
[[40](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kleppmann2024lofi-marker)] Martin Kleppmann.
[The past, present, and
future of local-first](https://martin.kleppmann.com/2024/05/30/local-first-conference.html). At *Local-First Conference*, May 2024.
[[41](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hofmeyr2024-marker)] Conrad Hofmeyr.
[API
Calling is to Sync Engines as jQuery is to React](https://www.powersync.com/blog/api-calling-is-to-sync-engines-as-jquery-is-to-react). *powersync.com*, November 2024.
Archived at [perma.cc/2FP9-7WJJ](https://perma.cc/2FP9-7WJJ)
[[42](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#vanHardenberg2020-marker)] Peter van Hardenberg and Martin Kleppmann.
[PushPin: Towards
Production-Quality Peer-to-Peer Collaboration](https://martin.kleppmann.com/papers/pushpin-papoc20.pdf). At *7th Workshop on Principles and Practice
of Consistency for Distributed Data* (PaPoC), April 2020.
[doi:10.1145/3380787.3393683](https://doi.org/10.1145/3380787.3393683)
[[43](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kawell1988-marker)] Leonard Kawell, Jr., Steven Beckhardt, Timothy
Halvorsen, Raymond Ozzie, and Irene Greif.
[Replicated document management in a group
communication system](https://dl.acm.org/doi/pdf/10.1145/62266.1024798). At *ACM Conference on Computer-Supported Cooperative Work* (CSCW),
September 1988.
[doi:10.1145/62266.1024798](https://doi.org/10.1145/62266.1024798)
[[44](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Pusch2019-marker)] Ricky Pusch.
[Explaining how fighting games use delay-based and
rollback netcode](https://words.infil.net/w02-netcode.html). *words.infil.net* and *arstechnica.com*, October 2019.
Archived at [perma.cc/DE7W-RDJ8](https://perma.cc/DE7W-RDJ8)
[[45](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DeCandia2007_ch6-marker)] Giuseppe DeCandia, Deniz Hastorun, Madan
Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian,
Peter Vosshall, and Werner Vogels.
[Dynamo: Amazons
Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating Systems Principles*
(SOSP), October 2007.
[doi:10.1145/1323293.1294281](https://doi.org/10.1145/1323293.1294281)
[[46](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shapiro2011-marker)] Marc Shapiro, Nuno Preguiça, Carlos Baquero, and
Marek Zawirski. [A Comprehensive Study
of Convergent and Commutative Replicated Data Types](https://inria.hal.science/inria-00555588v1/document). INRIA Research Report no. 7506, January
2011.
[[47](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Sun1998-marker)] Chengzheng Sun and Clarence Ellis.
[Operational
Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=aef660812c5a9c4d3f06775f9455eeb090a4ff0f). At
*ACM Conference on Computer Supported Cooperative Work* (CSCW), November 1998.
[doi:10.1145/289444.289469](https://doi.org/10.1145/289444.289469)
[[48](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gentle2025-marker)] Joseph Gentle and Martin Kleppmann.
[Collaborative Text Editing with Eg-walker: Better,
Faster, Smaller](https://arxiv.org/abs/2409.14252). At *20th European Conference on Computer Systems* (EuroSys), March 2025.
[doi:10.1145/3689031.3696076](https://doi.org/10.1145/3689031.3696076)
[[49](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Shukla2018-marker)] Dharma Shukla.
[Azure
Cosmos DB: Pushing the frontier of globally distributed databases](https://azure.microsoft.com/en-us/blog/azure-cosmos-db-pushing-the-frontier-of-globally-distributed-databases/). *azure.microsoft.com*, September 2018.
Archived at [perma.cc/UT3B-HH6R](https://perma.cc/UT3B-HH6R)
[[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979-marker)] David K. Gifford.
[Weighted Voting for
Replicated Data](https://www.cs.cmu.edu/~15-749/READINGS/required/availability/gifford79.pdf). At *7th ACM Symposium on Operating Systems Principles* (SOSP), December 1979.
[doi:10.1145/800215.806583](https://doi.org/10.1145/800215.806583)
[[51](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Howard2016_ch6-marker)] Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman.
[Flexible Paxos:
Quorum Intersection Revisited](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.OPODIS.2016.25). At *20th International Conference on Principles of Distributed
Systems* (OPODIS), December 2016.
[doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25)
[[52](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Blomstedt2012ricon-marker)] Joseph Blomstedt.
[Bringing Consistency to Riak](https://vimeo.com/51973001). At *RICON West*,
October 2012.
[[53](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Bailis2014pbs-marker)] Peter Bailis, Shivaram Venkataraman,
Michael J. Franklin, Joseph M. Hellerstein, and Ion Stoica.
[Quantifying eventual consistency with
PBS](http://www.bailis.org/papers/pbs-vldbj2014.pdf). *The VLDB Journal*, volume 23, pages 279302, April 2014.
[doi:10.1007/s00778-013-0330-1](https://doi.org/10.1007/s00778-013-0330-1)
[[54](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Breck2019-marker)] Colin Breck.
[Shared-Nothing
Architectures for Server Replication and Synchronization](https://blog.colinbreck.com/shared-nothing-architectures-for-server-replication-and-synchronization/). *blog.colinbreck.com*, December 2019.
Archived at [perma.cc/48P3-J6CJ](https://perma.cc/48P3-J6CJ)
[[55](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Dean2013_ch6-marker)] Jeffrey Dean and Luiz André Barroso.
[The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/).
*Communications of the ACM*, volume 56, issue 2, pages 7480, February 2013.
[doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794)
[[56](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Huang2017_ch6-marker)] Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R.
Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao.
[Gray
Failure: The Achilles Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in
Operating Systems* (HotOS), May 2017.
[doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005)
[[57](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lamport1978_ch6-marker)] Leslie Lamport.
[Time,
Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*,
volume 21, issue 7, pages 558565, July 1978.
[doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563)
[[58](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#ParkerJr1983-marker)] D. Stott Parker Jr., Gerald J. Popek, Gerard
Rudisin, Allen Stoughton, Bruce J. Walker, Evelyn Walton, Johanna M. Chow, David Edwards, Stephen
Kiser, and Charles Kline.
[Detection of
Mutual Inconsistency in Distributed Systems](https://pages.cs.wisc.edu/~remzi/Classes/739/Papers/parker83detection.pdf). *IEEE Transactions on Software Engineering*,
volume SE-9, issue 3, pages 240247, May 1983.
[doi:10.1109/TSE.1983.236733](https://doi.org/10.1109/TSE.1983.236733)
[[59](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Preguica2010-marker)] Nuno Preguiça, Carlos Baquero, Paulo Sérgio
Almeida, Victor Fonte, and Ricardo Gonçalves. [Dotted
Version Vectors: Logical Clocks for Optimistic Replication](https://arxiv.org/abs/1011.5808). arXiv:1011.5808, November 2010.
[[60](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Manepalli2022-marker)] Giridhar Manepalli.
[Clocks and Causality - Ordering Events
in Distributed Systems](https://www.exhypothesi.com/clocks-and-causality/). *exhypothesi.com*, November 2022.
Archived at [perma.cc/8REU-KVLQ](https://perma.cc/8REU-KVLQ)
[[61](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Cribbs2014-marker)] Sean Cribbs.
[A Brief History of Time in Riak](https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak).
At *RICON*, October 2014. Archived at [perma.cc/7U9P-6JFX](https://perma.cc/7U9P-6JFX)
[[62](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Brown2015-marker)] Russell Brown.
[Vector
Clocks Revisited Part 2: Dotted Version Vectors](https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/). *riak.com*, November 2015.
Archived at [perma.cc/96QP-W98R](https://perma.cc/96QP-W98R)
[[63](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Baquero2011-marker)] Carlos Baquero.
[Version
Vectors Are Not Vector Clocks](https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/). *haslab.wordpress.com*, July 2011.
Archived at [perma.cc/7PNU-4AMG](https://perma.cc/7PNU-4AMG)
[[64](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Schwarz1994-marker)] Reinhard Schwarz and Friedemann Mattern.
[Detecting Causal
Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed
Computing*, volume 7, issue 3, pages 149174, March 1994.
[doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859)
[^1]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)
[^2]: Kenny Gryp. [MySQL Terminology Updates](https://dev.mysql.com/blog-archive/mysql-terminology-updates/). *dev.mysql.com*, July 2020. Archived at [perma.cc/S62G-6RJ2](https://perma.cc/S62G-6RJ2)
[^3]: Oracle Corporation. [Oracle (Active) Data Guard 19c: Real-Time Data Protection and Availability](https://www.oracle.com/technetwork/database/availability/dg-adg-technical-overview-wp-5347548.pdf). White Paper, *oracle.com*, March 2019. Archived at [perma.cc/P5ST-RPKE](https://perma.cc/P5ST-RPKE)
[^4]: Microsoft. [What is an Always On availability group?](https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/overview-of-always-on-availability-groups-sql-server) *learn.microsoft.com*, September 2024. Archived at [perma.cc/ABH6-3MXF](https://perma.cc/ABH6-3MXF)
[^5]: Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, and Akshat Vig. [Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical Conference* (ATC), July 2022.
[^6]: Rebecca Taft, Irfan Sharif, Andrei Matei, Nathan VanBenschoten, Jordan Lewis, Tobias Grieger, Kai Niemi, Andy Woods, Anne Birzin, Raphael Poss, Paul Bardea, Amruta Ranade, Ben Darnell, Bram Gruneir, Justin Jaffray, Lucy Zhang, and Peter Mattis. [CockroachDB: The Resilient Geo-Distributed SQL Database](https://dl.acm.org/doi/abs/10.1145/3318464.3386134). At *ACM SIGMOD International Conference on Management of Data* (SIGMOD), pages 14931509, June 2020. [doi:10.1145/3318464.3386134](https://doi.org/10.1145/3318464.3386134)
[^7]: Dongxu Huang, Qi Liu, Qiu Cui, Zhuhe Fang, Xiaoyu Ma, Fei Xu, Li Shen, Liu Tang, Yuxing Zhou, Menglong Huang, Wan Wei, Cong Liu, Jian Zhang, Jianjun Li, Xuelian Wu, Lingyu Song, Ruoxi Sun, Shuaipeng Yu, Lei Zhao, Nicholas Cameron, Liquan Pei, and Xin Tang. [TiDB: a Raft-based HTAP database](https://www.vldb.org/pvldb/vol13/p3072-huang.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue 12, pages 30723084. [doi:10.14778/3415478.3415535](https://doi.org/10.14778/3415478.3415535)
[^8]: Mallory Knodel and Niels ten Oever. [Terminology, Power, and Inclusive Language in Internet-Drafts and RFCs](https://www.ietf.org/archive/id/draft-knodel-terminology-14.html). *IETF Internet-Draft*, August 2023. Archived at [perma.cc/5ZY9-725E](https://perma.cc/5ZY9-725E)
[^9]: Buck Hodges. [Postmortem: VSTS 4 September 2018](https://devblogs.microsoft.com/devopsservice/?p=17485). *devblogs.microsoft.com*, September 2018. Archived at [perma.cc/ZF5R-DYZS](https://perma.cc/ZF5R-DYZS)
[^10]: Gunnar Morling. [Leader Election With S3 Conditional Writes](https://www.morling.dev/blog/leader-election-with-s3-conditional-writes/). *www.morling.dev*, August 2024. Archived at [perma.cc/7V2N-J78Y](https://perma.cc/7V2N-J78Y)
[^11]: Vignesh Chandramohan, Rohan Desai, and Chris Riccomini. [SlateDB Manifest Design](https://github.com/slatedb/slatedb/blob/main/rfcs/0001-manifest.md). *github.com*, May 2024. Archived at [perma.cc/8EUY-P32Z](https://perma.cc/8EUY-P32Z)
[^12]: Stas Kelvich. [Why does Neon use Paxos instead of Raft, and whats the difference?](https://neon.tech/blog/paxos) *neon.tech*, August 2022. Archived at [perma.cc/SEZ4-2GXU](https://perma.cc/SEZ4-2GXU)
[^13]: Dimitri Fontaine. [An introduction to the pg\_auto\_failover project](https://tapoueh.org/blog/2021/11/an-introduction-to-the-pg_auto_failover-project/). *tapoueh.org*, November 2021. Archived at [perma.cc/3WH5-6BAF](https://perma.cc/3WH5-6BAF)
[^14]: Jesse Newland. [GitHub availability this week](https://github.blog/news-insights/the-library/github-availability-this-week/). *github.blog*, September 2012. Archived at [perma.cc/3YRF-FTFJ](https://perma.cc/3YRF-FTFJ)
[^15]: Mark Imbriaco. [Downtime last Saturday](https://github.blog/news-insights/the-library/downtime-last-saturday/). *github.blog*, December 2012. Archived at [perma.cc/M7X5-E8SQ](https://perma.cc/M7X5-E8SQ)
[^16]: John Hugg. [All In with Determinism for Performance and Testing in Distributed Systems](https://www.youtube.com/watch?v=gJRj3vJL4wE). At *Strange Loop*, September 2015.
[^17]: Hironobu Suzuki. [The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017.
[^18]: Amit Kapila. [WAL Internals of PostgreSQL](https://www.pgcon.org/2012/schedule/attachments/258_212_Internals%20Of%20PostgreSQL%20Wal.pdf). At *PostgreSQL Conference* (PGCon), May 2012. Archived at [perma.cc/6225-3SUX](https://perma.cc/6225-3SUX)
[^19]: Amit Kapila. [Evolution of Logical Replication](https://amitkapila16.blogspot.com/2023/09/evolution-of-logical-replication.html). *amitkapila16.blogspot.com*, September 2023. Archived at [perma.cc/F9VX-JLER](https://perma.cc/F9VX-JLER)
[^20]: Aru Petchimuthu. [Upgrade your Amazon RDS for PostgreSQL or Amazon Aurora PostgreSQL database, Part 2: Using the pglogical extension](https://aws.amazon.com/blogs/database/part-2-upgrade-your-amazon-rds-for-postgresql-database-using-the-pglogical-extension/). *aws.amazon.com*, August 2021. Archived at [perma.cc/RXT8-FS2T](https://perma.cc/RXT8-FS2T)
[^21]: Yogeshwer Sharma, Philippe Ajoux, Petchean Ang, David Callies, Abhishek Choudhary, Laurent Demailly, Thomas Fersch, Liat Atsmon Guz, Andrzej Kotulski, Sachin Kulkarni, Sanjeev Kumar, Harry Li, Jun Li, Evgeniy Makeev, Kowshik Prakasam, Robbert van Renesse, Sabyasachi Roy, Pratyush Seth, Yee Jiun Song, Benjamin Wester, Kaushik Veeraraghavan, and Peter Xie. [Wormhole: Reliable Pub-Sub to Support Geo-Replicated Internet Services](https://www.usenix.org/system/files/conference/nsdi15/nsdi15-paper-sharma.pdf). At *12th USENIX Symposium on Networked Systems Design and Implementation* (NSDI), May 2015.
[^22]: Douglas B. Terry. [Replicated Data Consistency Explained Through Baseball](https://www.microsoft.com/en-us/research/publication/replicated-data-consistency-explained-through-baseball/). Microsoft Research, Technical Report MSR-TR-2011-137, October 2011. Archived at [perma.cc/F4KZ-AR38](https://perma.cc/F4KZ-AR38)
[^23]: Douglas B. Terry, Alan J. Demers, Karin Petersen, Mike J. Spreitzer, Marvin M. Theher, and Brent B. Welch. [Session Guarantees for Weakly Consistent Replicated Data](https://csis.pace.edu/~marchese/CS865/Papers/SessionGuaranteesPDIS.pdf). At *3rd International Conference on Parallel and Distributed Information Systems* (PDIS), September 1994. [doi:10.1109/PDIS.1994.331722](https://doi.org/10.1109/PDIS.1994.331722)
[^24]: Werner Vogels. [Eventually Consistent](https://queue.acm.org/detail.cfm?id=1466448). *ACM Queue*, volume 6, issue 6, pages 1419, October 2008. [doi:10.1145/1466443.1466448](https://doi.org/10.1145/1466443.1466448)
[^25]: Simon Willison. [Reply to: “My thoughts about Fly.io (so far) and other newish technology Im getting into”](https://news.ycombinator.com/item?id=31434055). *news.ycombinator.com*, May 2022. Archived at [perma.cc/ZRV4-WWV8](https://perma.cc/ZRV4-WWV8)
[^26]: Nithin Tharakan. [Scaling Bitbuckets Database](https://www.atlassian.com/blog/bitbucket/scaling-bitbuckets-database). *atlassian.com*, October 2020. Archived at [perma.cc/JAB7-9FGX](https://perma.cc/JAB7-9FGX)
[^27]: Terry Pratchett. *Reaper Man: A Discworld Novel*. Victor Gollancz, 1991. ISBN: 978-0-575-04979-6
[^28]: Peter Bailis, Alan Fekete, Michael J. Franklin, Ali Ghodsi, Joseph M. Hellerstein, and Ion Stoica. [Coordination Avoidance in Database Systems](https://arxiv.org/abs/1402.2237). *Proceedings of the VLDB Endowment*, volume 8, issue 3, pages 185196, November 2014. [doi:10.14778/2735508.2735509](https://doi.org/10.14778/2735508.2735509)
[^29]: Yaser Raja and Peter Celentano. [PostgreSQL bi-directional replication using pglogical](https://aws.amazon.com/blogs/database/postgresql-bi-directional-replication-using-pglogical/). *aws.amazon.com*, January 2022. Archived at <https://perma.cc/BUQ2-5QWN>
[^30]: Robert Hodges. [If You \*Must\* Deploy Multi-Master Replication, Read This First](https://scale-out-blog.blogspot.com/2012/04/if-you-must-deploy-multi-master.html). *scale-out-blog.blogspot.com*, April 2012. Archived at [perma.cc/C2JN-F6Y8](https://perma.cc/C2JN-F6Y8)
[^31]: Lars Hofhansl. [HBASE-7709: Infinite Loop Possible in Master/Master Replication](https://issues.apache.org/jira/browse/HBASE-7709). *issues.apache.org*, January 2013. Archived at [perma.cc/24G2-8NLC](https://perma.cc/24G2-8NLC)
[^32]: John Day-Richter. [Whats Different About the New Google Docs: Making Collaboration Fast](https://drive.googleblog.com/2010/09/whats-different-about-new-google-docs.html). *drive.googleblog.com*, September 2010. Archived at [perma.cc/5TL8-TSJ2](https://perma.cc/5TL8-TSJ2)
[^33]: Evan Wallace. [How Figmas multiplayer technology works](https://www.figma.com/blog/how-figmas-multiplayer-technology-works/). *figma.com*, October 2019. Archived at [perma.cc/L49H-LY4D](https://perma.cc/L49H-LY4D)
[^34]: Tuomas Artman. [Scaling the Linear Sync Engine](https://linear.app/blog/scaling-the-linear-sync-engine). *linear.app*, June 2023.
[^35]: Amr Saafan. [Why Sync Engines Might Be the Future of Web Applications](https://www.nilebits.com/blog/2024/09/sync-engines-future-web-applications/). *nilebits.com*, September 2024. Archived at [perma.cc/5N73-5M3V](https://perma.cc/5N73-5M3V)
[^36]: Isaac Hagoel. [Are Sync Engines The Future of Web Applications?](https://dev.to/isaachagoel/are-sync-engines-the-future-of-web-applications-1bbi) *dev.to*, July 2024. Archived at [perma.cc/R9HF-BKKL](https://perma.cc/R9HF-BKKL)
[^37]: Sujay Jayakar. [A Map of Sync](https://stack.convex.dev/a-map-of-sync). *stack.convex.dev*, October 2024. Archived at [perma.cc/82R3-H42A](https://perma.cc/82R3-H42A)
[^38]: Alex Feyerke. [Designing Offline-First Web Apps](https://alistapart.com/article/offline-first/). *alistapart.com*, December 2013. Archived at [perma.cc/WH7R-S2DS](https://perma.cc/WH7R-S2DS)
[^39]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: You own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019, pages 154178. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
[^40]: Martin Kleppmann. [The past, present, and future of local-first](https://martin.kleppmann.com/2024/05/30/local-first-conference.html). At *Local-First Conference*, May 2024.
[^41]: Conrad Hofmeyr. [API Calling is to Sync Engines as jQuery is to React](https://www.powersync.com/blog/api-calling-is-to-sync-engines-as-jquery-is-to-react). *powersync.com*, November 2024. Archived at [perma.cc/2FP9-7WJJ](https://perma.cc/2FP9-7WJJ)
[^42]: Peter van Hardenberg and Martin Kleppmann. [PushPin: Towards Production-Quality Peer-to-Peer Collaboration](https://martin.kleppmann.com/papers/pushpin-papoc20.pdf). At *7th Workshop on Principles and Practice of Consistency for Distributed Data* (PaPoC), April 2020. [doi:10.1145/3380787.3393683](https://doi.org/10.1145/3380787.3393683)
[^43]: Leonard Kawell, Jr., Steven Beckhardt, Timothy Halvorsen, Raymond Ozzie, and Irene Greif. [Replicated document management in a group communication system](https://dl.acm.org/doi/pdf/10.1145/62266.1024798). At *ACM Conference on Computer-Supported Cooperative Work* (CSCW), September 1988. [doi:10.1145/62266.1024798](https://doi.org/10.1145/62266.1024798)
[^44]: Ricky Pusch. [Explaining how fighting games use delay-based and rollback netcode](https://words.infil.net/w02-netcode.html). *words.infil.net* and *arstechnica.com*, October 2019. Archived at [perma.cc/DE7W-RDJ8](https://perma.cc/DE7W-RDJ8)
[^45]: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall, and Werner Vogels. [Dynamo: Amazons Highly Available Key-Value Store](https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf). At *21st ACM Symposium on Operating Systems Principles* (SOSP), October 2007. [doi:10.1145/1323293.1294281](https://doi.org/10.1145/1323293.1294281)
[^46]: Marc Shapiro, Nuno Preguiça, Carlos Baquero, and Marek Zawirski. [A Comprehensive Study of Convergent and Commutative Replicated Data Types](https://inria.hal.science/inria-00555588v1/document). INRIA Research Report no. 7506, January 2011.
[^47]: Chengzheng Sun and Clarence Ellis. [Operational Transformation in Real-Time Group Editors: Issues, Algorithms, and Achievements](https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=aef660812c5a9c4d3f06775f9455eeb090a4ff0f). At *ACM Conference on Computer Supported Cooperative Work* (CSCW), November 1998. [doi:10.1145/289444.289469](https://doi.org/10.1145/289444.289469)
[^48]: Joseph Gentle and Martin Kleppmann. [Collaborative Text Editing with Eg-walker: Better, Faster, Smaller](https://arxiv.org/abs/2409.14252). At *20th European Conference on Computer Systems* (EuroSys), March 2025. [doi:10.1145/3689031.3696076](https://doi.org/10.1145/3689031.3696076)
[^49]: Dharma Shukla. [Azure Cosmos DB: Pushing the frontier of globally distributed databases](https://azure.microsoft.com/en-us/blog/azure-cosmos-db-pushing-the-frontier-of-globally-distributed-databases/). *azure.microsoft.com*, September 2018. Archived at [perma.cc/UT3B-HH6R](https://perma.cc/UT3B-HH6R)
[^50]: David K. Gifford. [Weighted Voting for Replicated Data](https://www.cs.cmu.edu/~15-749/READINGS/required/availability/gifford79.pdf). At *7th ACM Symposium on Operating Systems Principles* (SOSP), December 1979. [doi:10.1145/800215.806583](https://doi.org/10.1145/800215.806583)
[^51]: Heidi Howard, Dahlia Malkhi, and Alexander Spiegelman. [Flexible Paxos: Quorum Intersection Revisited](https://drops.dagstuhl.de/entities/document/10.4230/LIPIcs.OPODIS.2016.25). At *20th International Conference on Principles of Distributed Systems* (OPODIS), December 2016. [doi:10.4230/LIPIcs.OPODIS.2016.25](https://doi.org/10.4230/LIPIcs.OPODIS.2016.25)
[^52]: Joseph Blomstedt. [Bringing Consistency to Riak](https://vimeo.com/51973001). At *RICON West*, October 2012.
[^53]: Peter Bailis, Shivaram Venkataraman, Michael J. Franklin, Joseph M. Hellerstein, and Ion Stoica. [Quantifying eventual consistency with PBS](http://www.bailis.org/papers/pbs-vldbj2014.pdf). *The VLDB Journal*, volume 23, pages 279302, April 2014. [doi:10.1007/s00778-013-0330-1](https://doi.org/10.1007/s00778-013-0330-1)
[^54]: Colin Breck. [Shared-Nothing Architectures for Server Replication and Synchronization](https://blog.colinbreck.com/shared-nothing-architectures-for-server-replication-and-synchronization/). *blog.colinbreck.com*, December 2019. Archived at [perma.cc/48P3-J6CJ](https://perma.cc/48P3-J6CJ)
[^55]: Jeffrey Dean and Luiz André Barroso. [The Tail at Scale](https://cacm.acm.org/research/the-tail-at-scale/). *Communications of the ACM*, volume 56, issue 2, pages 7480, February 2013. [doi:10.1145/2408776.2408794](https://doi.org/10.1145/2408776.2408794)
[^56]: Peng Huang, Chuanxiong Guo, Lidong Zhou, Jacob R. Lorch, Yingnong Dang, Murali Chintalapati, and Randolph Yao. [Gray Failure: The Achilles Heel of Cloud-Scale Systems](https://www.microsoft.com/en-us/research/wp-content/uploads/2017/06/paper-1.pdf). At *16th Workshop on Hot Topics in Operating Systems* (HotOS), May 2017. [doi:10.1145/3102980.3103005](https://doi.org/10.1145/3102980.3103005)
[^57]: Leslie Lamport. [Time, Clocks, and the Ordering of Events in a Distributed System](https://www.microsoft.com/en-us/research/publication/time-clocks-ordering-events-distributed-system/). *Communications of the ACM*, volume 21, issue 7, pages 558565, July 1978. [doi:10.1145/359545.359563](https://doi.org/10.1145/359545.359563)
[^58]: D. Stott Parker Jr., Gerald J. Popek, Gerard Rudisin, Allen Stoughton, Bruce J. Walker, Evelyn Walton, Johanna M. Chow, David Edwards, Stephen Kiser, and Charles Kline. [Detection of Mutual Inconsistency in Distributed Systems](https://pages.cs.wisc.edu/~remzi/Classes/739/Papers/parker83detection.pdf). *IEEE Transactions on Software Engineering*, volume SE-9, issue 3, pages 240247, May 1983. [doi:10.1109/TSE.1983.236733](https://doi.org/10.1109/TSE.1983.236733)
[^59]: Nuno Preguiça, Carlos Baquero, Paulo Sérgio Almeida, Victor Fonte, and Ricardo Gonçalves. [Dotted Version Vectors: Logical Clocks for Optimistic Replication](https://arxiv.org/abs/1011.5808). arXiv:1011.5808, November 2010.
[^60]: Giridhar Manepalli. [Clocks and Causality - Ordering Events in Distributed Systems](https://www.exhypothesi.com/clocks-and-causality/). *exhypothesi.com*, November 2022. Archived at [perma.cc/8REU-KVLQ](https://perma.cc/8REU-KVLQ)
[^61]: Sean Cribbs. [A Brief History of Time in Riak](https://speakerdeck.com/seancribbs/a-brief-history-of-time-in-riak). At *RICON*, October 2014. Archived at [perma.cc/7U9P-6JFX](https://perma.cc/7U9P-6JFX)
[^62]: Russell Brown. [Vector Clocks Revisited Part 2: Dotted Version Vectors](https://riak.com/posts/technical/vector-clocks-revisited-part-2-dotted-version-vectors/). *riak.com*, November 2015. Archived at [perma.cc/96QP-W98R](https://perma.cc/96QP-W98R)
[^63]: Carlos Baquero. [Version Vectors Are Not Vector Clocks](https://haslab.wordpress.com/2011/07/08/version-vectors-are-not-vector-clocks/). *haslab.wordpress.com*, July 2011. Archived at [perma.cc/7PNU-4AMG](https://perma.cc/7PNU-4AMG)
[^64]: Reinhard Schwarz and Friedemann Mattern. [Detecting Causal Relationships in Distributed Computations: In Search of the Holy Grail](https://disco.ethz.ch/courses/hs08/seminar/papers/mattern4.pdf). *Distributed Computing*, volume 7, issue 3, pages 149174, March 1994. [doi:10.1007/BF02277859](https://doi.org/10.1007/BF02277859)

View file

@ -58,7 +58,7 @@ In many other systems, partitioning is just another word for sharding.
While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to
one theory, the term arose from the online role-play game *Ultima Online*, in which a magic crystal
was shattered into pieces, and each of those shards refracted a copy of the game world
[[3](/en/ch7#Koster2009)].
[^3].
The term *shard* thus came to mean one of a set of parallel game servers, and later was carried over
to databases. Another theory is that *shard* was originally an acronym of *System for Highly
Available Replicated Data*—reportedly a 1980s database, details of which are lost to history.
@ -88,7 +88,7 @@ single-shard database.
The reason for this recommendation is that sharding often adds complexity: you typically have to
decide which records to put in which shard by choosing a *partition key*; all records with the
same partition key are placed in the same shard
[[4](/en/ch7#Fidalgo2021)].
[^4].
This choice matters because accessing a record is fast if you know which shard its in, but if you
dont know the shard you have to do an inefficient search across all shards, and the sharding scheme
is difficult to change.
@ -108,10 +108,10 @@ some systems dont support them at all.
Some systems use sharding even on a single machine, typically running one single-threaded process
per CPU core to make use of the parallelism in the CPU, or to take advantage of a *nonuniform memory
access* (NUMA) architecture in which some banks of memory are closer to one CPU than to others
[[5](/en/ch7#Drepper2007)].
[^5].
For example, Redis, VoltDB, and FoundationDB use one process per core, and rely on sharding to
spread load across CPU cores in the same machine
[[6](/en/ch7#Zhou2021_ch7)].
[^6].
## Sharding for Multitenancy
@ -125,7 +125,7 @@ Sometimes sharding is used to implement multitenant systems: either each tenant
shard, or multiple small tenants may be grouped together into a larger shard. These shards might be
physically separate databases (which we previously touched on in [“Embedded storage engines”](/en/ch4#sidebar_embedded)), or
separately manageable portions of a larger logical database
[[7](/en/ch7#Slot2023)].
[^7].
Using sharding for multitenancy has several advantages:
Resource isolation
@ -143,19 +143,19 @@ Cell-based architecture
tenants are grouped into a self-contained *cell*, and different cells are set up such that they
can run largely independently from each other. This approach provides *fault isolation*: that is,
a fault in one cell remains limited to that cell, and tenants in other cells are not affected
[[8](/en/ch7#Oliveira2023)].
[^8].
Per-tenant backup and restore
: Backing up each tenants shard separately makes it possible to restore a tenants state from a
backup without affecting other tenants, which can be useful in case the tenant accidentally
deletes or overwrites important data
[[9](/en/ch7#Shapira2023dont)].
[^9].
Regulatory compliance
: Data privacy regulation such as the GDPR gives individuals the right to access and delete all data
stored about them. If each persons data is stored in a separate shard, this translates into
simple data export and deletion operations on their shard
[[10](/en/ch7#Schwarzkopf2019)].
[^10].
Data residence
: If a particular tenants data needs to be stored in a particular jurisdiction in order to comply
@ -166,14 +166,14 @@ Gradual schema rollout
: Schema migrations (previously discussed in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)) can be rolled
out gradually, one tenant at a time. This reduces risk, as you can detect problems before they
affect all tenants, but it can be difficult to do transactionally
[[11](/en/ch7#Shapira2024)].
[^11].
The main challenges around using sharding for multitenancy are:
* It assumes that each individual tenant is small enough to fit on a single node. If that is not the
case, and you have a single tenant thats too big for one machine, you would need to additionally
perform sharding within a single tenant, which brings us back to the topic of sharding for
scalability [[12](/en/ch7#Ganguli2020)].
scalability [^12].
* If you have many small tenants, then creating a separate shard for each one may incur too much
overhead. You could group several small tenants together into a bigger shard, but then you have
the problem of how you move tenants from one shard to another as they grow.
@ -227,7 +227,7 @@ The shard boundaries might be chosen manually by an administrator, or the databa
automatically. Manual key-range sharding is used by Vitess (a sharding layer for MySQL), for
example; the automatic variant is used by Bigtable, its open source equivalent HBase, the
range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB
[[6](/en/ch7#Zhou2021_ch7)]. YugabyteDB offers both manual and automatic
[^6]. YugabyteDB offers both manual and automatic
tablet splitting.
Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in
@ -242,7 +242,7 @@ lot of writes to nearby keys. For example, if the key is a timestamp, then the s
ranges of time—e.g., one shard per month. Unfortunately, if you write data from the sensors to the
database as the measurements happen, all the writes end up going to the same shard (the one for
this month), so that shard can be overloaded with writes while others sit idle
[[13](/en/ch7#Lan2011)].
[^13].
To avoid this problem in the sensor database, you need to use something other than the timestamp as
the first element of the key. For example, you could prefix each timestamp with the sensor ID so
@ -257,7 +257,7 @@ When you first set up your database, there are no key ranges to split into shard
such as HBase and MongoDB, allow you to configure an initial set of shards on an empty database,
which is called *pre-splitting*. This requires that you already have some idea of what the key
distribution is going to look like, so that you can choose appropriate key range boundaries
[[14](/en/ch7#Soztutar2013split)].
[^14].
Later on, as your data volume and write throughput grow, a system with key-range sharding grows by
splitting an existing shard into two or more smaller shards, each of which holds a contiguous
@ -276,7 +276,7 @@ With databases that manage shard boundaries automatically, a shard split is typi
An advantage of key-range sharding is that the number of shards adapts to the data volume. If there
is only a small amount of data, a small number of shards is sufficient, so overheads are small; if
there is a huge amount of data, the size of each individual shard is limited to a configurable
maximum [[15](/en/ch7#Evans2013)].
maximum [^15].
A downside of this approach is that splitting a shard is an expensive operation, since it requires
all of its data to be rewritten into new files, similarly to a compaction in a log-structured
@ -301,7 +301,7 @@ uses MD5, whereas Cassandra and ScyllaDB use Murmur3. Many programming languages
functions built in (as they are used for hash tables), but they may not be suitable for sharding:
for example, in Javas `Object.hashCode()` and Rubys `Object#hash`, the same key may have a
different hash value in different processes, making them unsuitable for sharding
[[16](/en/ch7#Kleppmann2012hash)].
[^16].
### Hash modulo number of nodes
@ -350,7 +350,7 @@ used for any reads and writes that happen while the transfer is in progress.
Its common to choose the number of shards to be a number that is divisible by many factors, so that
the dataset can be evenly split across various different numbers of nodes—not requiring the number
of nodes to be a power of 2, for example [[4](/en/ch7#Fidalgo2021)].
of nodes to be a power of 2, for example [^4].
You can even account for mismatched hardware in your cluster: by assigning more shards to nodes that
are more powerful, you can make those nodes take a greater share of the load.
@ -412,7 +412,7 @@ supports cluster keys. Clustering data not only improves range scan performance,
improve compression and filtering performance as well.
Hash-range sharding is used in YugabyteDB and DynamoDB
[[17](/en/ch7#Elhemali2022_ch7)], and is an option in MongoDB.
[^17], and is an option in MongoDB.
Cassandra and ScyllaDB use a variant of this approach that is illustrated in
[Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
@ -427,7 +427,7 @@ those imbalances tend to even out
###### Figure 7-6. Cassandra and ScyllaDB split the range of possible hash values (here 01023) into contiguous ranges with random boundaries, and assign several ranges to each node.
When nodes are added or removed, range boundaries are added and removed, and shards are split or
merged accordingly [[19](/en/ch7#Lambov2016)].
merged accordingly [^19].
In the example of [Figure 7-6](/en/ch7#fig_sharding_cassandra), when node 3 is added, node 1
transfers parts of two of its ranges to node 3, and node 2 transfers part of one of its ranges to
node 3. This has the effect of giving the new node an approximately fair share of the dataset,
@ -447,13 +447,13 @@ the same shard as much as possible.
The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of
consistent hashing
[[20](/en/ch7#Karger1997)],
[^20],
but several other consistent hashing algorithms have also been proposed
[[21](/en/ch7#Gryski2018)],
[^21],
such as *highest random weight*, also known as *rendezvous hashing*
[[22](/en/ch7#Thaler1998)],
[^22],
and *jump consistent hash*
[[23](/en/ch7#Lamping2014)].
[^23].
With Cassandras algorithm, if one node is added, a small number of existing shards are split into
sub-ranges; on the other hand, with rendezvous and jump consistent hashes, the new node is assigned
individual keys that were previously scattered across all of the other nodes. Which one is
@ -468,7 +468,7 @@ some keys is much higher than to others—you can still end up with some servers
while others sit almost idle.
For example, on a social media site, a celebrity user with millions of followers may cause a storm
of activity when they do something [[24](/en/ch7#Axon2010_ch7)].
of activity when they do something [^24].
This event can result in a large volume of reads and writes to the same key (where the partition key
is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on).
@ -477,7 +477,7 @@ In such situations, a more flexible sharding policy is required
[26](/en/ch7#Lee2021)].
A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put
an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine
[[27](/en/ch7#Fritchie2018)].
[^27].
Its also possible to compensate for skew at the application level. For example, if one key is known
to be very hot, a simple technique is to add a random number to the beginning or end of the key.
@ -499,8 +499,8 @@ necessitating different strategies for handling them.
Some systems (especially cloud services designed for large scale) have automated approaches for
dealing with hot shards; for example, Amazon calls it *heat management*
[[28](/en/ch7#Warfield2023_ch7)]
or *adaptive capacity* [[17](/en/ch7#Elhemali2022_ch7)].
[^28]
or *adaptive capacity* [^17].
The details of how these systems work go beyond the scope of this book.
## Operations: Automatic or Manual Rebalancing
@ -527,7 +527,7 @@ another. If it is not done carefully, this process can overload the network or t
might harm the performance of other requests. The system must continue processing writes while the
rebalancing is in progress; if a system is near its maximum write throughput, the shard-splitting
process might not even be able to keep up with the rate of incoming writes
[[29](/en/ch7#Houlihan2017)].
[^29].
Such automation can be dangerous in combination with automatic failure detection. For example, say
one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that
@ -667,7 +667,7 @@ shards. Whenever you write to the database—to add, remove, or update a records
deal with the shard that contains the record that you are writing. For that reason, this type of
secondary index is known as a *local index*. In an information retrieval context it is also known as
a *document-partitioned index*
[[30](/en/ch7#Manning2008_ch7)].
[^30].
When reading from a local secondary index, if you already know the partition key of the record
youre looking for, you can just perform the search on the appropriate shard. Moreover, if you only
@ -685,10 +685,10 @@ shards lets you store more data, but it doesnt increase your query throughput
process every query anyway.
Nevertheless, local secondary indexes are widely used
[[31](/en/ch7#Busch2012)]:
for example, MongoDB, Riak, Cassandra [[32](/en/ch7#HarEl2017)],
Elasticsearch [[33](/en/ch7#Tong2013)], SolrCloud,
and VoltDB [[34](/en/ch7#Pavlo2013)]
[^31]:
for example, MongoDB, Riak, Cassandra [^32],
Elasticsearch [^33], SolrCloud,
and VoltDB [^34]
all use local secondary indexes.
## Global Secondary Indexes
@ -709,7 +709,7 @@ The index on the make of car is partitioned similarly (with the shard boundary b
###### Figure 7-10. A global secondary index reflects data from all shards, and is itself sharded by the indexed value.
This kind of index is also called *term-partitioned*
[[30](/en/ch7#Manning2008_ch7)]:
[^30]:
recall from [“Full-Text Search”](/en/ch4#sec_storage_full_text) that in full-text search, a *term* is a keyword in a text that
you can search for. Here we generalise it to mean any value that you can search for in the secondary
index.
@ -728,7 +728,7 @@ certain make, or searching for multiple words occurring in the same text), it
terms will be assigned to different shards. To compute the logical AND of the two conditions, the
system needs to find all the IDs that occur in both of the postings lists. Thats no problem if the
postings lists are short, but if they are long, it can be slow to send them over the network to
compute their intersection [[30](/en/ch7#Manning2008_ch7)].
compute their intersection [^30].
Another challenge with global secondary indexes is that writes are more complicated than with local
indexes, because writing a single record might affect multiple shards of the index (every term in
@ -797,191 +797,41 @@ that question in the following chapters.
##### Footnotes
##### References
[[1](/en/ch7#Giordano2023-marker)] Claire Giordano.
[Understanding
partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023.
Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959)
[[2](/en/ch7#Leach2022-marker)] Brandur Leach.
[Partitioning in Postgres, 2022
edition](https://brandur.org/fragments/postgres-partitioning-2022). *brandur.org*, October 2022.
Archived at [perma.cc/Z5LE-6AKX](https://perma.cc/Z5LE-6AKX)
[[3](/en/ch7#Koster2009-marker)] Raph Koster.
[Database “sharding”
came from UO?](https://www.raphkoster.com/2009/01/08/database-sharding-came-from-uo/) *raphkoster.com*, January 2009.
Archived at [perma.cc/4N9U-5KYF](https://perma.cc/4N9U-5KYF)
[[4](/en/ch7#Fidalgo2021-marker)] Garrett Fidalgo.
[Herding elephants: Lessons learned
from sharding Postgres at Notion](https://www.notion.com/blog/sharding-postgres-at-notion). *notion.com*, October 2021.
Archived at [perma.cc/5J5V-W2VX](https://perma.cc/5J5V-W2VX)
[[5](/en/ch7#Drepper2007-marker)] Ulrich Drepper.
[What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf).
*akkadia.org*, November 2007. Archived at
[perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ)
[[6](/en/ch7#Zhou2021_ch7-marker)] Jingyu Zhou, Meng Xu, Alexander Shraer, Bala
Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach,
Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin
Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav.
[FoundationDB: A Distributed Unbundled
Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data*
(SIGMOD), June 2021.
[doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559)
[[7](/en/ch7#Slot2023-marker)] Marco Slot.
[Citus 12:
Schema-based sharding for PostgreSQL](https://www.citusdata.com/blog/2023/07/18/citus-12-schema-based-sharding-for-postgres/). *citusdata.com*, July 2023.
Archived at [perma.cc/R874-EC9W](https://perma.cc/R874-EC9W)
[[8](/en/ch7#Oliveira2023-marker)] Robisson Oliveira.
[Reducing
the Scope of Impact with Cell-Based Architecture](https://docs.aws.amazon.com/pdfs/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/reducing-scope-of-impact-with-cell-based-architecture.pdf). AWS Well-Architected white paper, Amazon Web
Services, September 2023.
Archived at [perma.cc/4KWW-47NR](https://perma.cc/4KWW-47NR)
[[9](/en/ch7#Shapira2023dont-marker)] Gwen Shapira.
[Things DBs Dont Do - But Should](https://www.thenile.dev/blog/things-dbs-dont-do).
*thenile.dev*, February 2023.
Archived at [perma.cc/C3J4-JSFW](https://perma.cc/C3J4-JSFW)
[[10](/en/ch7#Schwarzkopf2019-marker)] Malte Schwarzkopf, Eddie Kohler, M. Frans
Kaashoek, and Robert Morris.
[Position: GDPR
Compliance by Construction](https://cs.brown.edu/people/malte/pub/papers/2019-poly-gdpr.pdf). At *Towards Polystores that manage multiple Databases, Privacy,
Security and/or Policy Issues for Heterogenous Data* (Poly), August 2019.
[doi:10.1007/978-3-030-33752-0\_3](https://doi.org/10.1007/978-3-030-33752-0_3)
[[11](/en/ch7#Shapira2024-marker)] Gwen Shapira.
[Introducing pg\_karnak: Transactional schema
migration across tenant databases](https://www.thenile.dev/blog/distributed-ddl). *thenile.dev*, November 2024.
Archived at [perma.cc/R5RD-8HR9](https://perma.cc/R5RD-8HR9)
[[12](/en/ch7#Ganguli2020-marker)] Arka Ganguli, Guido Iaquinti,
Maggie Zhou, and Rafael Chacón.
[Scaling Datastores at
Slack with Vitess](https://slack.engineering/scaling-datastores-at-slack-with-vitess/). *slack.engineering*, December 2020.
Archived at [perma.cc/UW8F-ALJK](https://perma.cc/UW8F-ALJK)
[[13](/en/ch7#Lan2011-marker)] Ikai Lan.
[App
Engine Datastore Tip: Monotonically Increasing Values Are Bad](https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/). *ikaisays.com*,
January 2011. Archived at [perma.cc/BPX8-RPJB](https://perma.cc/BPX8-RPJB)
[[14](/en/ch7#Soztutar2013split-marker)] Enis Soztutar.
[Apache
HBase Region Splitting and Merging](https://www.cloudera.com/blog/technical/apache-hbase-region-splitting-and-merging.html). *cloudera.com*, February 2013.
Archived at [perma.cc/S9HS-2X2C](https://perma.cc/S9HS-2X2C)
[[15](/en/ch7#Evans2013-marker)] Eric Evans.
[Rethinking Topology in Cassandra](https://www.youtube.com/watch?v=Qz6ElTdYjjU). At
*Cassandra Summit*, June 2013.
Archived at [perma.cc/2DKM-F438](https://perma.cc/2DKM-F438)
[[16](/en/ch7#Kleppmann2012hash-marker)] Martin Kleppmann.
[Javas
hashCode Is Not Safe for Distributed Systems](https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html). *martin.kleppmann.com*, June 2012.
Archived at [perma.cc/LK5U-VZSN](https://perma.cc/LK5U-VZSN)
[[17](/en/ch7#Elhemali2022_ch7-marker)] Mostafa Elhemali, Niall Gallagher, Nicholas
Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu
Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul,
Doug Terry, and Akshat Vig.
[Amazon DynamoDB: A Scalable,
Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical
Conference* (ATC), July 2022.
[[18](/en/ch7#Williams2012-marker)] Brandon Williams.
[Virtual Nodes in Cassandra
1.2](https://www.datastax.com/blog/virtual-nodes-cassandra-12). *datastax.com*, December 2012.
Archived at [perma.cc/N385-EQXV](https://perma.cc/N385-EQXV)
[[19](/en/ch7#Lambov2016-marker)] Branimir Lambov.
[New Token
Allocation Algorithm in Cassandra 3.0](https://www.datastax.com/blog/new-token-allocation-algorithm-cassandra-30). *datastax.com*, January 2016.
Archived at [perma.cc/2BG7-LDWY](https://perma.cc/2BG7-LDWY)
[[20](/en/ch7#Karger1997-marker)] David Karger, Eric Lehman, Tom Leighton, Rina
Panigrahy, Matthew Levine, and Daniel Lewin.
[Consistent Hashing and Random Trees:
Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web](https://people.csail.mit.edu/karger/Papers/web.pdf).
At *29th Annual ACM Symposium on Theory of Computing* (STOC), May 1997.
[doi:10.1145/258533.258660](https://doi.org/10.1145/258533.258660)
[[21](/en/ch7#Gryski2018-marker)] Damian Gryski.
[Consistent
Hashing: Algorithmic Tradeoffs](https://dgryski.medium.com/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8). *dgryski.medium.com*, April 2018.
Archived at [perma.cc/B2WF-TYQ8](https://perma.cc/B2WF-TYQ8)
[[22](/en/ch7#Thaler1998-marker)] David G. Thaler and Chinya V. Ravishankar.
[Using name-based mappings to increase
hit rates](https://www.cs.kent.edu/~javed/DL/web/p1-thaler.pdf). *IEEE/ACM Transactions on Networking*, volume 6, issue 1, pages 114, February 1998.
[doi:10.1109/90.663936](https://doi.org/10.1109/90.663936)
[[23](/en/ch7#Lamping2014-marker)] John Lamping and Eric Veach.
[A Fast, Minimal Memory, Consistent Hash
Algorithm](https://arxiv.org/abs/1406.2294). *arxiv.org*, June 2014.
[[24](/en/ch7#Axon2010_ch7-marker)] Samuel Axon.
[3% of Twitters Servers
Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010.
Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX)
[[25](/en/ch7#Guo2020-marker)] Gerald Guo and Thawan Kooburat.
[Scaling
services with Shard Manager](https://engineering.fb.com/2020/08/24/production-engineering/scaling-services-with-shard-manager/). *engineering.fb.com*, August 2020.
Archived at [perma.cc/EFS3-XQYT](https://perma.cc/EFS3-XQYT)
[[26](/en/ch7#Lee2021-marker)] Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan
Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan,
Biren Damani, Pol Mauri Ruiz, Vikas Mehta, and Chunqiang Tang.
[Shard Manager: A Generic Shard
Management Framework for Geo-distributed Applications](https://dl.acm.org/doi/pdf/10.1145/3477132.3483546). *28th ACM SIGOPS Symposium on
Operating Systems Principles* (SOSP), pages 553569, October 2021.
[doi:10.1145/3477132.3483546](https://doi.org/10.1145/3477132.3483546)
[[27](/en/ch7#Fritchie2018-marker)] Scott Lystig Fritchie.
[A Critique of Resizable Hash
Tables: Riak Core & Random Slicing](https://www.infoq.com/articles/dynamo-riak-random-slicing/). *infoq.com*, August 2018.
Archived at [perma.cc/RPX7-7BLN](https://perma.cc/RPX7-7BLN)
[[28](/en/ch7#Warfield2023_ch7-marker)] Andy Warfield.
[Building
and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023.
Archived at [perma.cc/6S7P-GLM4](https://perma.cc/6S7P-GLM4)
[[29](/en/ch7#Houlihan2017-marker)] Rich Houlihan.
[DynamoDB adaptive capacity: smooth performance
for chaotic workloads (DAT327)](https://www.youtube.com/watch?v=kMY0_m29YzU). At *AWS re:Invent*, November 2017.
[[30](/en/ch7#Manning2008_ch7-marker)] Christopher D. Manning, Prabhakar Raghavan,
and Hinrich Schütze.
[*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/).
Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at
[nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/)
[[31](/en/ch7#Busch2012-marker)] Michael Busch, Krishna Gade, Brian Larson, Patrick
Lok, Samuel Luckenbill, and Jimmy Lin.
[Earlybird:
Real-Time Search at Twitter](https://cs.uwaterloo.ca/~jimmylin/publications/Busch_etal_ICDE2012.pdf). At *28th IEEE International Conference on Data Engineering*
(ICDE), April 2012.
[doi:10.1109/ICDE.2012.149](https://doi.org/10.1109/ICDE.2012.149)
[[32](/en/ch7#HarEl2017-marker)] Nadav HarEl.
[Indexing in Cassandra 3](https://github.com/scylladb/scylladb/wiki/Indexing-in-Cassandra-3).
*github.com*, April 2017.
Archived at [perma.cc/3ENV-8T9P](https://perma.cc/3ENV-8T9P)
[[33](/en/ch7#Tong2013-marker)] Zachary Tong.
[Customizing Your
Document Routing](https://www.elastic.co/blog/customizing-your-document-routing/). *elastic.co*, June 2013.
Archived at [perma.cc/97VM-MREN](https://perma.cc/97VM-MREN)
[[34](/en/ch7#Pavlo2013-marker)] Andrew Pavlo.
[H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/).
*hstore.cs.brown.edu*, October 2013.
Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z)
[^1]: Claire Giordano. [Understanding partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959)
[^2]: Brandur Leach. [Partitioning in Postgres, 2022 edition](https://brandur.org/fragments/postgres-partitioning-2022). *brandur.org*, October 2022. Archived at [perma.cc/Z5LE-6AKX](https://perma.cc/Z5LE-6AKX)
[^3]: Raph Koster. [Database “sharding” came from UO?](https://www.raphkoster.com/2009/01/08/database-sharding-came-from-uo/) *raphkoster.com*, January 2009. Archived at [perma.cc/4N9U-5KYF](https://perma.cc/4N9U-5KYF)
[^4]: Garrett Fidalgo. [Herding elephants: Lessons learned from sharding Postgres at Notion](https://www.notion.com/blog/sharding-postgres-at-notion). *notion.com*, October 2021. Archived at [perma.cc/5J5V-W2VX](https://perma.cc/5J5V-W2VX)
[^5]: Ulrich Drepper. [What Every Programmer Should Know About Memory](https://www.akkadia.org/drepper/cpumemory.pdf). *akkadia.org*, November 2007. Archived at [perma.cc/NU6Q-DRXZ](https://perma.cc/NU6Q-DRXZ)
[^6]: Jingyu Zhou, Meng Xu, Alexander Shraer, Bala Namasivayam, Alex Miller, Evan Tschannen, Steve Atherton, Andrew J. Beamon, Rusty Sears, John Leach, Dave Rosenthal, Xin Dong, Will Wilson, Ben Collins, David Scherer, Alec Grieser, Young Liu, Alvin Moore, Bhaskar Muppana, Xiaoge Su, and Vishesh Yadav. [FoundationDB: A Distributed Unbundled Transactional Key Value Store](https://www.foundationdb.org/files/fdb-paper.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2021. [doi:10.1145/3448016.3457559](https://doi.org/10.1145/3448016.3457559)
[^7]: Marco Slot. [Citus 12: Schema-based sharding for PostgreSQL](https://www.citusdata.com/blog/2023/07/18/citus-12-schema-based-sharding-for-postgres/). *citusdata.com*, July 2023. Archived at [perma.cc/R874-EC9W](https://perma.cc/R874-EC9W)
[^8]: Robisson Oliveira. [Reducing the Scope of Impact with Cell-Based Architecture](https://docs.aws.amazon.com/pdfs/wellarchitected/latest/reducing-scope-of-impact-with-cell-based-architecture/reducing-scope-of-impact-with-cell-based-architecture.pdf). AWS Well-Architected white paper, Amazon Web Services, September 2023. Archived at [perma.cc/4KWW-47NR](https://perma.cc/4KWW-47NR)
[^9]: Gwen Shapira. [Things DBs Dont Do - But Should](https://www.thenile.dev/blog/things-dbs-dont-do). *thenile.dev*, February 2023. Archived at [perma.cc/C3J4-JSFW](https://perma.cc/C3J4-JSFW)
[^10]: Malte Schwarzkopf, Eddie Kohler, M. Frans Kaashoek, and Robert Morris. [Position: GDPR Compliance by Construction](https://cs.brown.edu/people/malte/pub/papers/2019-poly-gdpr.pdf). At *Towards Polystores that manage multiple Databases, Privacy, Security and/or Policy Issues for Heterogenous Data* (Poly), August 2019. [doi:10.1007/978-3-030-33752-0\_3](https://doi.org/10.1007/978-3-030-33752-0_3)
[^11]: Gwen Shapira. [Introducing pg\_karnak: Transactional schema migration across tenant databases](https://www.thenile.dev/blog/distributed-ddl). *thenile.dev*, November 2024. Archived at [perma.cc/R5RD-8HR9](https://perma.cc/R5RD-8HR9)
[^12]: Arka Ganguli, Guido Iaquinti, Maggie Zhou, and Rafael Chacón. [Scaling Datastores at Slack with Vitess](https://slack.engineering/scaling-datastores-at-slack-with-vitess/). *slack.engineering*, December 2020. Archived at [perma.cc/UW8F-ALJK](https://perma.cc/UW8F-ALJK)
[^13]: Ikai Lan. [App Engine Datastore Tip: Monotonically Increasing Values Are Bad](https://ikaisays.com/2011/01/25/app-engine-datastore-tip-monotonically-increasing-values-are-bad/). *ikaisays.com*, January 2011. Archived at [perma.cc/BPX8-RPJB](https://perma.cc/BPX8-RPJB)
[^14]: Enis Soztutar. [Apache HBase Region Splitting and Merging](https://www.cloudera.com/blog/technical/apache-hbase-region-splitting-and-merging.html). *cloudera.com*, February 2013. Archived at [perma.cc/S9HS-2X2C](https://perma.cc/S9HS-2X2C)
[^15]: Eric Evans. [Rethinking Topology in Cassandra](https://www.youtube.com/watch?v=Qz6ElTdYjjU). At *Cassandra Summit*, June 2013. Archived at [perma.cc/2DKM-F438](https://perma.cc/2DKM-F438)
[^16]: Martin Kleppmann. [Javas hashCode Is Not Safe for Distributed Systems](https://martin.kleppmann.com/2012/06/18/java-hashcode-unsafe-for-distributed-systems.html). *martin.kleppmann.com*, June 2012. Archived at [perma.cc/LK5U-VZSN](https://perma.cc/LK5U-VZSN)
[^17]: Mostafa Elhemali, Niall Gallagher, Nicholas Gordon, Joseph Idziorek, Richard Krog, Colin Lazier, Erben Mo, Akhilesh Mritunjai, Somu Perianayagam, Tim Rath, Swami Sivasubramanian, James Christopher Sorenson III, Sroaj Sosothikul, Doug Terry, and Akshat Vig. [Amazon DynamoDB: A Scalable, Predictably Performant, and Fully Managed NoSQL Database Service](https://www.usenix.org/conference/atc22/presentation/elhemali). At *USENIX Annual Technical Conference* (ATC), July 2022.
[^18]: Brandon Williams. [Virtual Nodes in Cassandra 1.2](https://www.datastax.com/blog/virtual-nodes-cassandra-12). *datastax.com*, December 2012. Archived at [perma.cc/N385-EQXV](https://perma.cc/N385-EQXV)
[^19]: Branimir Lambov. [New Token Allocation Algorithm in Cassandra 3.0](https://www.datastax.com/blog/new-token-allocation-algorithm-cassandra-30). *datastax.com*, January 2016. Archived at [perma.cc/2BG7-LDWY](https://perma.cc/2BG7-LDWY)
[^20]: David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. [Consistent Hashing and Random Trees: Distributed Caching Protocols for Relieving Hot Spots on the World Wide Web](https://people.csail.mit.edu/karger/Papers/web.pdf). At *29th Annual ACM Symposium on Theory of Computing* (STOC), May 1997. [doi:10.1145/258533.258660](https://doi.org/10.1145/258533.258660)
[^21]: Damian Gryski. [Consistent Hashing: Algorithmic Tradeoffs](https://dgryski.medium.com/consistent-hashing-algorithmic-tradeoffs-ef6b8e2fcae8). *dgryski.medium.com*, April 2018. Archived at [perma.cc/B2WF-TYQ8](https://perma.cc/B2WF-TYQ8)
[^22]: David G. Thaler and Chinya V. Ravishankar. [Using name-based mappings to increase hit rates](https://www.cs.kent.edu/~javed/DL/web/p1-thaler.pdf). *IEEE/ACM Transactions on Networking*, volume 6, issue 1, pages 114, February 1998. [doi:10.1109/90.663936](https://doi.org/10.1109/90.663936)
[^23]: John Lamping and Eric Veach. [A Fast, Minimal Memory, Consistent Hash Algorithm](https://arxiv.org/abs/1406.2294). *arxiv.org*, June 2014.
[^24]: Samuel Axon. [3% of Twitters Servers Dedicated to Justin Bieber](https://mashable.com/archive/justin-bieber-twitter). *mashable.com*, September 2010. Archived at [perma.cc/F35N-CGVX](https://perma.cc/F35N-CGVX)
[^25]: Gerald Guo and Thawan Kooburat. [Scaling services with Shard Manager](https://engineering.fb.com/2020/08/24/production-engineering/scaling-services-with-shard-manager/). *engineering.fb.com*, August 2020. Archived at [perma.cc/EFS3-XQYT](https://perma.cc/EFS3-XQYT)
[^26]: Sangmin Lee, Zhenhua Guo, Omer Sunercan, Jun Ying, Thawan Kooburat, Suryadeep Biswal, Jun Chen, Kun Huang, Yatpang Cheung, Yiding Zhou, Kaushik Veeraraghavan, Biren Damani, Pol Mauri Ruiz, Vikas Mehta, and Chunqiang Tang. [Shard Manager: A Generic Shard Management Framework for Geo-distributed Applications](https://dl.acm.org/doi/pdf/10.1145/3477132.3483546). *28th ACM SIGOPS Symposium on Operating Systems Principles* (SOSP), pages 553569, October 2021. [doi:10.1145/3477132.3483546](https://doi.org/10.1145/3477132.3483546)
[^27]: Scott Lystig Fritchie. [A Critique of Resizable Hash Tables: Riak Core & Random Slicing](https://www.infoq.com/articles/dynamo-riak-random-slicing/). *infoq.com*, August 2018. Archived at [perma.cc/RPX7-7BLN](https://perma.cc/RPX7-7BLN)
[^28]: Andy Warfield. [Building and operating a pretty big storage system called S3](https://www.allthingsdistributed.com/2023/07/building-and-operating-a-pretty-big-storage-system.html). *allthingsdistributed.com*, July 2023. Archived at [perma.cc/6S7P-GLM4](https://perma.cc/6S7P-GLM4)
[^29]: Rich Houlihan. [DynamoDB adaptive capacity: smooth performance for chaotic workloads (DAT327)](https://www.youtube.com/watch?v=kMY0_m29YzU). At *AWS re:Invent*, November 2017.
[^30]: Christopher D. Manning, Prabhakar Raghavan, and Hinrich Schütze. [*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/). Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at [nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/)
[^31]: Michael Busch, Krishna Gade, Brian Larson, Patrick Lok, Samuel Luckenbill, and Jimmy Lin. [Earlybird: Real-Time Search at Twitter](https://cs.uwaterloo.ca/~jimmylin/publications/Busch_etal_ICDE2012.pdf). At *28th IEEE International Conference on Data Engineering* (ICDE), April 2012. [doi:10.1109/ICDE.2012.149](https://doi.org/10.1109/ICDE.2012.149)
[^32]: Nadav HarEl. [Indexing in Cassandra 3](https://github.com/scylladb/scylladb/wiki/Indexing-in-Cassandra-3). *github.com*, April 2017. Archived at [perma.cc/3ENV-8T9P](https://perma.cc/3ENV-8T9P)
[^33]: Zachary Tong. [Customizing Your Document Routing](https://www.elastic.co/blog/customizing-your-document-routing/). *elastic.co*, June 2013. Archived at [perma.cc/97VM-MREN](https://perma.cc/97VM-MREN)
[^34]: Andrew Pavlo. [H-Store Frequently Asked Questions](https://hstore.cs.brown.edu/documentation/faq/). *hstore.cs.brown.edu*, October 2013. Archived at [perma.cc/X3ZA-DW6Z](https://perma.cc/X3ZA-DW6Z)

File diff suppressed because it is too large Load diff

File diff suppressed because it is too large Load diff

View file

@ -1,6 +1,6 @@
baseURL: 'https://ddia.vonng.com/'
languageCode: 'zh-CN'
title: '设计数据密集型应用'
title: '设计数据密集型应用第二版'
enableRobotsTXT: true
# Parse Git commit
@ -28,7 +28,7 @@ languages:
languageCode: zh
contentDir: content/zh
weight: 1
title: 设计数据密集型应用
title: 设计数据密集型应用(第二版)
v2:
languageName: 第二版
languageCode: v2
@ -40,27 +40,29 @@ languages:
languageCode: tw
contentDir: content/tw
weight: 3
title: 設計資料密集型應用
title: 設計資料密集型應用(第二版)
en:
languageName: English
languageCode: en
contentDir: content/en
weight: 4
title: Designing Data-Intensive Applications
title: Designing Data-Intensive Applications 2nd Edition
markup:
highlight:
noClasses: false
goldmark:
renderer:
unsafe: true
extensions:
passthrough:
delimiters:
block: [['\[', '\]'], ['$$', '$$']]
inline: [['\(', '\)']]
enable: true
footnote: true # 开启脚注语法:[^id] / [^id]: text
linkify: true # 自动将 URL 文本转为链接
table: true # 启用 Markdown 表格
taskList: true # 启用任务列表 [ ] / [x]
typographer: true # 智能排版(引号、破折号等)
parser:
attribute: true # 允许在标题后写 {#id .class key=val},用于显式锚点
autoHeadingID: true # 为标题自动生成 ID手写 {#id} 会覆盖自动生成)
autoHeadingIDType: github # 自动 ID 规则github / blackfriday / none
tableOfContents:
startLevel: 2 # ToC 从 h2 开始
endLevel: 4 # ToC 到 h4 结束
menu:
main: