mirror of
https://github.com/Vonng/ddia.git
synced 2026-06-23 01:47:00 +08:00
1378 lines
90 KiB
Markdown
1378 lines
90 KiB
Markdown
---
|
||
title: "1. Trade-offs in Data Systems Architecture"
|
||
weight: 101
|
||
breadcrumbs: false
|
||
---
|
||
|
||
> *There are no solutions, there are only trade-offs. […] But you try to get the best
|
||
> trade-off you can get, and that’s all you can hope for.*
|
||
>
|
||
> [Thomas Sowell](https://www.youtube.com/watch?v=2YUtKr8-_Fg),
|
||
> Interview with Fred Barnes (2005)
|
||
|
||
Data is central to much application development today. With web and mobile apps, software as a
|
||
service (SaaS), and cloud services, it has become normal to store data from many different users in
|
||
a shared server-based data infrastructure. Data from user activity, business transactions, devices
|
||
and sensors needs to be stored and made available for analysis. As users interact with an
|
||
application, they both read the data that is stored, and also generate more data.
|
||
|
||
Small amounts of data, which can be stored and processed on a single machine, are often fairly easy
|
||
to deal with. However, as the data volume or the rate of queries grows, it needs to be distributed
|
||
across multiple machines, which introduces many challenges. As the needs of the application become
|
||
more complex, it is no longer sufficient to store everything in one system, but it might be
|
||
necessary to combine multiple storage or processing systems that provide different capabilities.
|
||
|
||
We call an application *data-intensive* if data management is one of the primary challenges in
|
||
developing the application [[1](/en/ch1#Kouzes2009)].
|
||
While in *compute-intensive* systems the challenge is parallelizing some very large computation, in
|
||
data-intensive applications we usually worry more about things like storing and processing large
|
||
data volumes, managing changes to data, ensuring consistency in the face of failures and
|
||
concurrency, and making sure services are highly available.
|
||
|
||
Such applications are typically built from standard building blocks that provide commonly needed
|
||
functionality. For example, many applications need to:
|
||
|
||
* Store data so that they, or another application, can find it again later (*databases*)
|
||
* Remember the result of an expensive operation, to speed up reads (*caches*)
|
||
* Allow users to search data by keyword or filter it in various ways (*search indexes*)
|
||
* Handle events and data changes as soon as they occur (*stream processing*)
|
||
* Periodically crunch a large amount of accumulated data (*batch processing*)
|
||
|
||
In building an application we typically take several software systems or services, such as databases
|
||
or APIs, and glue them together with some application code. If you are doing exactly what the data
|
||
systems were designed for, then this process can be quite easy.
|
||
|
||
However, as your application becomes more ambitious, challenges arise. There are many database
|
||
systems with different characteristics, suitable for different purposes—how do you choose which one
|
||
to use? There are various approaches to caching, several ways of building search indexes, and so
|
||
on—how do you reason about their trade-offs? You need to figure out which tools and which approaches
|
||
are the most appropriate for the task at hand, and it can be difficult to combine tools when you
|
||
need to do something that a single tool cannot do alone.
|
||
|
||
This book is a guide to help you make decisions about which technologies to use and how to combine
|
||
them. As you will see, there is no one approach that is fundamentally better than others; everything
|
||
has pros and cons. With this book, you will learn to ask the right questions to evaluate and compare
|
||
data systems, so that you can figure out which approach will best serve the needs of your particular
|
||
application.
|
||
|
||
We will start our journey by looking at some of the ways that data is typically used in
|
||
organizations today. Many of the ideas here have their origin in *enterprise software* (i.e., the
|
||
software needs and engineering practices of large organizations, such as big corporations and
|
||
governments), since historically, only large organizations had the large data volumes that required
|
||
sophisticated technical solutions. If your data volume is small enough, you can simply keep it in a
|
||
spreadsheet! However, more recently it has also become common for smaller companies and startups to
|
||
manage large data volumes and build data-intensive systems.
|
||
|
||
One of the key challenges with data systems is that different people need to do very different
|
||
things with data. If you are working at a company, you and your team will have one set of
|
||
priorities, while another team may have entirely different goals, even though you might be working
|
||
with the same dataset! Moreover, those goals might not be explicitly articulated, which can lead to
|
||
misunderstandings and disagreement about the right approach.
|
||
|
||
To help you understand what choices you can make, this chapter compares several contrasting
|
||
concepts, and explores their trade-offs:
|
||
|
||
* the difference between operational and analytical systems ([“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics));
|
||
* pros and cons of cloud services and self-hosted systems ([“Cloud versus Self-Hosting”](/en/ch1#sec_introduction_cloud));
|
||
* when to move from single-node systems to distributed systems ([“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed)); and
|
||
* balancing the needs of the business and the rights of the user ([“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance)).
|
||
|
||
Moreover, this chapter will provide you with terminology that we will need for the rest of the book.
|
||
|
||
# Terminology: Frontends and Backends
|
||
|
||
Much of what we will discuss in this book relates to *backend development*. To explain that term:
|
||
for web applications, the client-side code (which runs in a web browser) is called the *frontend*,
|
||
and the server-side code that handles user requests is known as the *backend*. Mobile apps are
|
||
similar to frontends in that they provide user interfaces, which often communicate over the Internet
|
||
with a server-side backend. Frontends sometimes manage data locally on the user’s device
|
||
[[2](/en/ch1#Kleppmann2019_ch1)],
|
||
but the greatest data infrastructure challenges often lie in the backend: a frontend only needs to
|
||
handle one user’s data, whereas the backend manages data on behalf of *all* of the users.
|
||
|
||
A backend service is often reachable via HTTP (sometimes WebSocket); it usually consists of some
|
||
application code that reads and writes data in one or more databases, and sometimes interfaces with
|
||
additional data systems such as caches or message queues (which we might collectively call *data
|
||
infrastructure*). The application code is often *stateless* (i.e., when it finishes handling one
|
||
HTTP request, it forgets everything about that request), and any information that needs to persist
|
||
from one request to another needs to be stored either on the client, or in the server-side data
|
||
infrastructure.
|
||
|
||
# Analytical versus Operational Systems
|
||
|
||
If you are working on data systems in an enterprise, you are likely to encounter several different
|
||
types of people who work with data. The first type are *backend engineers* who build services that
|
||
handle requests for reading and updating data; these services often serve external users, either
|
||
directly or indirectly via other services (see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)). Sometimes
|
||
services are for internal use by other parts of the organization.
|
||
|
||
In addition to the teams managing backend services, two other groups of people typically require
|
||
access to an organization’s data: *business analysts*, who generate reports about the activities of
|
||
the organization in order to help the management make better decisions (*business intelligence* or
|
||
*BI*), and *data scientists*, who look for novel insights in data or who create user-facing product
|
||
features that are enabled by data analysis and machine learning/AI (for example, “people who bought
|
||
X also bought Y” recommendations on an e-commerce website, predictive analytics such as risk scoring
|
||
or spam filtering, and ranking of search results).
|
||
|
||
Although business analysts and data scientists tend to use different tools and operate in different
|
||
ways, they have some things in common: both perform *analytics*, which means they look at the data
|
||
that the users and backend services have generated, but they generally do not modify this data
|
||
(except perhaps for fixing mistakes). They might create derived datasets in which the original data
|
||
has been processed in some way. This has led to a split between two types of systems—a distinction
|
||
that we will use throughout this book:
|
||
|
||
* *Operational systems* consist of the backend services and data infrastructure where data is
|
||
created, for example by serving external users. Here, the application code both reads and modifies
|
||
the data in its databases, based on the actions performed by the users.
|
||
* *Analytical systems* serve the needs of business analysts and data scientists. They contain a
|
||
read-only copy of the data from the operational systems, and they are optimized for the types of
|
||
data processing that are needed for analytics.
|
||
|
||
As we shall see in the next section, operational and analytical systems are often kept separate, for
|
||
good reasons. As these systems have matured, two new specialized roles have emerged: *data
|
||
engineers* and *analytics engineers*. Data engineers are the people who know how to integrate the
|
||
operational and the analytical systems, and who take responsibility for the organization’s data
|
||
infrastructure more widely [[3](/en/ch1#Reis2022)].
|
||
Analytics engineers model and transform data to make it more useful for the business analysts and
|
||
data scientists in an organization
|
||
[[4](/en/ch1#Machado2023)].
|
||
|
||
Many engineers specialize on either the operational or the analytical side. However, this book
|
||
covers both operational and analytical data systems, since both play an important role in the
|
||
lifecycle of data within an organization. We will explore in-depth the data infrastructure that is
|
||
used to deliver services both to internal and external users, so that you can work better with your
|
||
colleagues on the other side of this divide.
|
||
|
||
## Characterizing Transaction Processing and Analytics
|
||
|
||
In the early days of business data processing, a write to the database typically corresponded to a
|
||
*commercial transaction* taking place: making a sale, placing an order with a supplier, paying an
|
||
employee’s salary, etc. As databases expanded into areas that didn’t involve money changing hands,
|
||
the term *transaction* nevertheless stuck, referring to a group of reads and writes that form a
|
||
logical unit.
|
||
|
||
###### Note
|
||
|
||
[Chapter 8](/en/ch8#ch_transactions) explores in detail what we mean with a transaction. This chapter uses the term
|
||
loosely to refer to low-latency reads and writes.
|
||
|
||
Even though databases started being used for many different kinds of data—posts on social media,
|
||
moves in a game, contacts in an address book, and many others—the basic access pattern
|
||
remained similar to processing business transactions. An operational system typically looks up a
|
||
small number of records by some key (this is called a *point query*). Records are inserted, updated,
|
||
or deleted based on the user’s input. Because these applications are interactive, this access
|
||
pattern became known as *online transaction processing* (OLTP).
|
||
|
||
However, databases also started being increasingly used for analytics, which has very different
|
||
access patterns compared to OLTP. Usually an analytic query scans over a huge number of records, and
|
||
calculates aggregate statistics (such as count, sum, or average) rather than returning the
|
||
individual records to the user. For example, a business analyst at a supermarket chain may want to
|
||
answer analytic queries such as:
|
||
|
||
* What was the total revenue of each of our stores in January?
|
||
* How many more bananas than usual did we sell during our latest promotion?
|
||
* Which brand of baby food is most often purchased together with brand X diapers?
|
||
|
||
The reports that result from these types of queries are important for business intelligence, helping
|
||
the management decide what to do next. In order to differentiate this pattern of using databases
|
||
from transaction processing, it has been called *online analytic processing* (OLAP)
|
||
[[5](/en/ch1#Codd1993)].
|
||
The difference between OLTP and analytics is not always clear-cut, but some typical characteristics
|
||
are listed in [Table 1-1](/en/ch1#tab_oltp_vs_olap).
|
||
|
||
Table 1-1. Comparing characteristics of operational and analytic systems
|
||
|
||
| Property | Operational systems (OLTP) | Analytical systems (OLAP) |
|
||
| --- | --- | --- |
|
||
| Main read pattern | Point queries (fetch individual records by key) | Aggregate over large number of records |
|
||
| Main write pattern | Create, update, and delete individual records | Bulk import (ETL) or event stream |
|
||
| Human user example | End user of web/mobile application | Internal analyst, for decision support |
|
||
| Machine use example | Checking if an action is authorized | Detecting fraud/abuse patterns |
|
||
| Type of queries | Fixed set of queries, predefined by application | Analyst can make arbitrary queries |
|
||
| Data represents | Latest state of data (current point in time) | History of events that happened over time |
|
||
| Dataset size | Gigabytes to terabytes | Terabytes to petabytes |
|
||
|
||
###### Note
|
||
|
||
The meaning of *online* in *OLAP* is unclear; it probably refers to the fact that queries are not
|
||
just for predefined reports, but that analysts use the OLAP system interactively for explorative
|
||
queries.
|
||
|
||
With operational systems, users are generally not allowed to construct custom SQL queries and run
|
||
them on the database, since that would potentially allow them to read or modify data that they do
|
||
not have permission to access. Moreover, they might write queries that are expensive to execute, and
|
||
hence affect the database performance for other users. For these reasons, OLTP systems mostly run a
|
||
fixed set of queries that are baked into the application code, and use one-off custom queries only
|
||
occasionally for maintenance or troubleshooting. On the other hand, analytic databases usually give
|
||
their users the freedom to write arbitrary SQL queries by hand, or to generate queries automatically
|
||
using a data visualization or dashboard tool such as Tableau, Looker, or Microsoft Power BI.
|
||
|
||
There is also a type of systems that is designed for analytical workloads (queries that aggregate
|
||
over many records) but that are embedded into user-facing products. This category is known as
|
||
*product analytics* or *real-time analytics*, and systems designed for this type of use include
|
||
Pinot, Druid, and ClickHouse
|
||
[[6](/en/ch1#Soman2023)].
|
||
|
||
## Data Warehousing
|
||
|
||
At first, the same databases were used for both transaction processing and analytic queries. SQL
|
||
turned out to be quite flexible in this regard: it works well for both types of queries.
|
||
Nevertheless, in the late 1980s and early 1990s, there was a trend for companies to stop using their
|
||
OLTP systems for analytics purposes, and to run the analytics on a separate database system instead.
|
||
This separate database was called a *data warehouse*.
|
||
|
||
A large enterprise may have dozens, even hundreds, of online transaction processing systems:
|
||
systems powering the customer-facing website, controlling point of sale (checkout) systems in
|
||
physical stores, tracking inventory in warehouses, planning routes for vehicles, managing suppliers,
|
||
administering employees, and performing many other tasks. Each of these systems is complex and needs
|
||
a team of people to maintain it, so these systems end up operating mostly independently from each
|
||
other.
|
||
|
||
It is usually undesirable for business analysts and data scientists to directly query these OLTP
|
||
systems, for several reasons:
|
||
|
||
* the data of interest may be spread across multiple operational systems, making it difficult to
|
||
combine those datasets in a single query (a problem known as *data silos*);
|
||
* the kinds of schemas and data layouts that are good for OLTP are less well suited for analytics
|
||
(see [“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics));
|
||
* analytic queries can be quite expensive, and running them on an OLTP database would impact the
|
||
performance for other users; and
|
||
* the OLTP systems might reside in a separate network that users are not allowed direct access to
|
||
for security or compliance reasons.
|
||
|
||
A *data warehouse*, by contrast, is a separate database that analysts can query to their hearts’
|
||
content, without affecting OLTP operations
|
||
[[7](/en/ch1#Chaudhuri1997)].
|
||
As we shall see in [Chapter 4](/en/ch4#ch_storage), data warehouses often store data in a way that is very different
|
||
from OLTP databases, in order to optimize for the types of queries that are common in analytics.
|
||
|
||
The data warehouse contains a read-only copy of the data in all the various OLTP systems in the
|
||
company. Data is extracted from OLTP databases (using either a periodic data dump or a continuous
|
||
stream of updates), transformed into an analysis-friendly schema, cleaned up, and then loaded into
|
||
the data warehouse. This process of getting data into the data warehouse is known as
|
||
*Extract–Transform–Load* (ETL) and is illustrated in [Figure 1-1](/en/ch1#fig_dwh_etl). Sometimes the order of the
|
||
*transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse,
|
||
after loading), resulting in *ELT*.
|
||
|
||

|
||
|
||
###### Figure 1-1. Simplified outline of ETL into a data warehouse.
|
||
|
||
In some cases the data sources of the ETL processes are external SaaS products such as customer
|
||
relationship management (CRM), email marketing, or credit card processing systems. In those cases,
|
||
you do not have direct access to the original database, since it is accessible only via the software
|
||
vendor’s API. Bringing the data from these external systems into your own data warehouse can enable
|
||
analyses that are not possible via the SaaS API. ETL for SaaS APIs is often implemented by
|
||
specialist data connector services such as Fivetran, Singer, or AirByte.
|
||
|
||
Some database systems offer *hybrid transactional/analytic processing* (HTAP), which aims to enable
|
||
OLTP and analytics in a single system without requiring ETL from one system into another
|
||
[[8](/en/ch1#Ozcan2017),
|
||
[9](/en/ch1#Prout2022_ch1)].
|
||
However, many HTAP systems internally consist of an OLTP system coupled with a separate analytical
|
||
system, hidden behind a common interface—so the distinction between the two remains important for
|
||
understanding how these systems work.
|
||
|
||
Moreover, even though HTAP exists, it is common to have a separation between transactional and
|
||
analytic systems due to their different goals and requirements. In particular, it is considered good
|
||
practice for each operational system to have its own database (see
|
||
[“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), leading to hundreds of separate operational databases; on the
|
||
other hand, an enterprise usually has a single data warehouse, so that business analysts can combine
|
||
data from several operational systems in a single query.
|
||
|
||
HTAP therefore does not replace data warehouses. Rather, it is useful in scenarios where the same
|
||
application needs to both perform analytics queries that scan a large number of rows, and also
|
||
read and update individual records with low latency. Fraud detection can involve such workloads, for
|
||
example [[10](/en/ch1#Zhang2024)].
|
||
|
||
The separation between operational and analytical systems is part of a wider trend: as workloads
|
||
have become more demanding, systems have become more specialized and optimized for particular
|
||
workloads. General-purpose systems can handle small data volumes comfortably, but the greater the
|
||
scale, the more specialized systems tend to become
|
||
[[11](/en/ch1#Stonebraker2005fitsall)].
|
||
|
||
### From data warehouse to data lake
|
||
|
||
A data warehouse often uses a *relational* data model that is queried through SQL (see
|
||
[Chapter 3](/en/ch3#ch_datamodels)), perhaps using specialized business intelligence software. This model works well
|
||
for the types of queries that business analysts need to make, but it is less well suited to the
|
||
needs of data scientists, who might need to perform tasks such as:
|
||
|
||
* Transform data into a form that is suitable for training a machine learning model; often this
|
||
requires turning the rows and columns of a database table into a vector or matrix of numerical
|
||
values called *features*. The process of performing this transformation in a way that maximizes
|
||
the performance of the trained model is called *feature engineering*, and it often requires custom
|
||
code that is difficult to express using SQL.
|
||
* Take textual data (e.g., reviews of a product) and use natural language processing techniques to
|
||
try to extract structured information from it (e.g., the sentiment of the author, or which topics
|
||
they mention). Similarly, they might need to extract structured information from photos using
|
||
computer vision techniques.
|
||
|
||
Although there have been efforts to add machine learning operators to a SQL data model
|
||
[[12](/en/ch1#Cohen2009)]
|
||
and to build efficient machine learning systems on top of a relational foundation
|
||
[[13](/en/ch1#Olteanu2020)],
|
||
many data scientists prefer not to work in a relational database such as a data warehouse. Instead,
|
||
many prefer to use Python data analysis libraries such as pandas and scikit-learn, statistical
|
||
analysis languages such as R, and distributed analytics frameworks such as Spark
|
||
[[14](/en/ch1#Bornstein2020)].
|
||
We discuss these further in [“Dataframes, Matrices, and Arrays”](/en/ch3#sec_datamodels_dataframes).
|
||
|
||
Consequently, organizations face a need to make data available in a form that is suitable for use by
|
||
data scientists. The answer is a *data lake*: a centralized data repository that holds a copy of any
|
||
data that might be useful for analysis, obtained from operational systems via ETL processes. The
|
||
difference from a data warehouse is that a data lake simply contains files, without imposing any
|
||
particular file format or data model. Files in a data lake might be collections of database records,
|
||
encoded using a file format such as Avro or Parquet (see [Chapter 5](/en/ch5#ch_encoding)), but they can equally well
|
||
contain text, images, videos, sensor readings, sparse matrices, feature vectors, genome sequences,
|
||
or any other kind of data [[15](/en/ch1#Fowler2015)].
|
||
Besides being more flexible, this is also often cheaper than relational data storage, since the data
|
||
lake can use commoditized file storage such as object stores (see [“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native)).
|
||
|
||
ETL processes have been generalized to *data pipelines*, and in some cases the data lake has become
|
||
an intermediate stop on the path from the operational systems to the data warehouse. The data lake
|
||
contains data in a “raw” form produced by the operational systems, without the transformation into a
|
||
relational data warehouse schema. This approach has the advantage that each consumer of the data can
|
||
transform the raw data into a form that best suits their needs. It has been dubbed the *sushi
|
||
principle*: “raw data is better” [[16](/en/ch1#Johnson2015)].
|
||
|
||
Besides loading data from a data lake into a separate data warehouse, it is also possible to run
|
||
typical data warehousing workloads (SQL queries and business analytics) directly on the files in the
|
||
data lake, alongside data science/machine learning workloads. This architecture is known as a *data
|
||
lakehouse*, and it requires a query execution engine and a metadata (e.g., schema management) layer
|
||
that extend the data lake’s file storage
|
||
[[17](/en/ch1#Armbrust2021)].
|
||
|
||
Apache Hive, Spark SQL, Presto, and Trino are examples of this approach.
|
||
|
||
### Beyond the data lake
|
||
|
||
As analytics practices have matured, organizations have been increasingly paying attention to the
|
||
management and operations of analytics systems and data pipelines, as captured for example in the
|
||
DataOps manifesto [[18](/en/ch1#DataOps)].
|
||
Part of this are issues of governance, privacy, and compliance with regulation such as GDPR and
|
||
CCPA, which we discuss in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance) and [Link to Come].
|
||
|
||
Moreover, analytical data is increasingly made available not only as files and relational tables,
|
||
but also as streams of events (see [Link to Come]). With file-based data analysis you can re-run the
|
||
analysis periodically (e.g., daily) in order to respond to changes in the data, but stream processing
|
||
allows analytics systems to respond to events much faster, on the order of seconds. Depending on the
|
||
application and how time-sensitive it is, a stream processing approach can be valuable, for example
|
||
to identify and block potentially fraudulent or abusive activity.
|
||
|
||
In some cases the outputs of analytics systems are made available to operational systems (a process
|
||
sometimes known as *reverse ETL* [[19](/en/ch1#Manohar2021)]). For example, a
|
||
machine-learning model that was trained on data in an analytics system may be deployed to
|
||
production, so that it can generate recommendations for end-users, such as “people who bought X also
|
||
bought Y”. Such deployed outputs of analytics systems are also known as *data products*
|
||
[[20](/en/ch1#ORegan2018)].
|
||
Machine learning models can be deployed to operational systems using specialized tools such as
|
||
TFX, Kubeflow, or MLflow.
|
||
|
||
## Systems of Record and Derived Data
|
||
|
||
Related to the distinction between operational and analytical systems, this book also distinguishes
|
||
between *systems of record* and *derived data systems*. These terms are useful because they can help
|
||
you clarify the flow of data through a system:
|
||
|
||
Systems of record
|
||
: A system of record, also known as *source of truth*, holds the authoritative or *canonical*
|
||
version of some data. When new data comes in, e.g., as user input, it is first written here. Each
|
||
fact is represented exactly once (the representation is typically *normalized*; see
|
||
[“Normalization, Denormalization, and Joins”](/en/ch3#sec_datamodels_normalization)). If there is any discrepancy between another system and the
|
||
system of record, then the value in the system of record is (by definition) the correct one.
|
||
|
||
Derived data systems
|
||
: Data in a derived system is the result of taking some existing data from another system and
|
||
transforming or processing it in some way. If you lose derived data, you can recreate it from the
|
||
original source. A classic example is a cache: data can be served from the cache if present, but
|
||
if the cache doesn’t contain what you need, you can fall back to the underlying database.
|
||
Denormalized values, indexes, materialized views, transformed data representations, and models
|
||
trained on a dataset also fall into this category.
|
||
|
||
Technically speaking, derived data is *redundant*, in the sense that it duplicates existing
|
||
information. However, it is often essential for getting good performance on read queries. You can
|
||
derive several different datasets from a single source, enabling you to look at the data from
|
||
different “points of view.”
|
||
|
||
Analytical systems are usually derived data systems, because they are consumers of data created
|
||
elsewhere. Operational services may contain a mixture of systems of record and derived data systems.
|
||
The systems of record are the primary databases to which data is first written, whereas the derived
|
||
data systems are the indexes and caches that speed up common read operations, especially for queries
|
||
that the system of record cannot answer efficiently.
|
||
|
||
Most databases, storage engines, and query languages are not inherently a system of record or a
|
||
derived system. A database is just a tool: how you use it is up to you. The distinction between
|
||
system of record and derived data system depends not on the tool, but on how you use it in your
|
||
application. By being clear about which data is derived from which other data, you can bring clarity
|
||
to an otherwise confusing system architecture.
|
||
|
||
When the data in one system is derived from the data in another, you need a process for updating the
|
||
derived data when the original in the system of record changes. Unfortunately, many databases are
|
||
designed based on the assumption that your application only ever needs to use that one database, and
|
||
they do not make it easy to integrate multiple systems in order to propagate such updates. In
|
||
[Link to Come] we will discuss approaches to *data integration*, which allow us to compose multiple
|
||
data systems to achieve things that one system alone cannot do.
|
||
|
||
That brings us to the end of our comparison of analytics and transaction processing. In the next
|
||
section, we will examine another trade-off that you might have already seen debated multiple times.
|
||
|
||
# Cloud versus Self-Hosting
|
||
|
||
With anything that an organization needs to do, one of the first questions is: should it be done
|
||
in-house, or should it be outsourced? Should you build or should you buy?
|
||
|
||
Ultimately, this is a question about business priorities. The received management wisdom is that
|
||
things that are a core competency or a competitive advantage of your organization should be done
|
||
in-house, whereas things that are non-core, routine, or commonplace should be left to a vendor
|
||
[[21](/en/ch1#Fournier2021)].
|
||
To give an extreme example, most companies do not generate their own electricity (unless they are an
|
||
energy company, and leaving aside emergency backup power), since it is cheaper to buy electricity
|
||
from the grid.
|
||
|
||
With software, two important decisions to be made are who builds the software and who deploys it.
|
||
There is a spectrum of possibilities that outsource each decision to various degrees, as illustrated
|
||
in [Figure 1-2](/en/ch1#fig_cloud_spectrum). At one extreme is bespoke software that you write and run in-house; at
|
||
the other extreme are widely-used cloud services or Software as a Service (SaaS) products that are
|
||
implemented and operated by an external vendor, and which you only access through a web interface or
|
||
API.
|
||
|
||

|
||
|
||
###### Figure 1-2. A spectrum of types of software and its operations.
|
||
|
||
The middle ground is off-the-shelf software (open source or commercial) that you *self-host*, i.e.,
|
||
deploy yourself—for example, if you download MySQL and install it on a server you control. This
|
||
could be on your own hardware (often called *on-premises*, even if the server is actually in a
|
||
rented datacenter rack and not literally on your own premises), or on a virtual machine in the cloud
|
||
(*Infrastructure as a Service* or IaaS). There are still more points along this spectrum, e.g.,
|
||
taking open source software and running a modified version of it.
|
||
|
||
Separately from this spectrum there is also the question of *how* you deploy services, either in the
|
||
cloud or on-premises—for example, whether you use an orchestration framework such as Kubernetes.
|
||
However, choice of deployment tooling is out of scope of this book, since other factors have a
|
||
greater influence on the architecture of data systems.
|
||
|
||
## Pros and Cons of Cloud Services
|
||
|
||
Using a cloud service, rather than running comparable software yourself, essentially outsources the
|
||
operation of that software to the cloud provider. There are good arguments for and against cloud
|
||
services. Cloud providers claim that using their services saves you time and money, and allows you
|
||
to move faster compared to setting up your own infrastructure.
|
||
|
||
Whether a cloud service is actually cheaper and easier than self-hosting depends very much on your
|
||
skills and the workload on your systems. If you already have experience setting up and operating the
|
||
systems you need, and if your load is quite predictable (i.e., the number of machines you need does
|
||
not fluctuate wildly), then it’s often cheaper to buy your own machines and run the software on them
|
||
yourself [[22](/en/ch1#HeinemeierHansson2022),
|
||
[23](/en/ch1#Badizadegan2022)].
|
||
|
||
On the other hand, if you need a system that you don’t already know how to deploy and operate, then
|
||
adopting a cloud service is often easier and quicker than learning to manage the system yourself. If
|
||
you have to hire and train staff specifically to maintain and operate the system, that can get very
|
||
expensive. You still need an operations team when you’re using the cloud (see
|
||
[“Operations in the Cloud Era”](/en/ch1#sec_introduction_operations)), but outsourcing the basic system administration can free up your
|
||
team to focus on higher-level concerns.
|
||
|
||
When you outsource the operation of a system to a company that specializes in running that service,
|
||
that can potentially result in a better service, since the provider gains operational expertise from
|
||
providing the service to many customers. On the other hand, if you run the service yourself, you can
|
||
configure and tune it to perform well on your particular workload; it is unlikely that a cloud
|
||
service would be willing to make such customizations on your behalf.
|
||
|
||
Cloud services are particularly valuable if the load on your systems varies a lot over time. If you
|
||
provision your machines to be able to handle peak load, but those computing resources are idle most
|
||
of the time, the system becomes less cost-effective. In this situation, cloud services have the
|
||
advantage that they can make it easier to scale your computing resources up or down in response to
|
||
changes in demand.
|
||
|
||
For example, analytics systems often have extremely variable load: running a large analytical query
|
||
quickly requires a lot of computing resources in parallel, but once the query completes, those
|
||
resources sit idle until the user makes the next query. Predefined queries (e.g., for daily reports)
|
||
can be enqueued and scheduled to smooth out the load, but for interactive queries, the faster you
|
||
want them to complete, the more variable the workload becomes. If your dataset is so large that
|
||
querying it quickly requires significant computing resources, using the cloud can save money, since
|
||
you can return unused resources to the provider rather than leaving them idle. For smaller datasets,
|
||
this difference is less significant.
|
||
|
||
The biggest downside of a cloud service is that you have no control over it:
|
||
|
||
* If it is lacking a feature you need, all you can do is to politely ask the vendor whether they
|
||
will add it; you generally cannot implement it yourself.
|
||
* If the service goes down, all you can do is to wait for it to recover.
|
||
* If you are using the service in a way that triggers a bug or causes performance problems, it will
|
||
be difficult for you to diagnose the issue. With software that you run yourself, you can get
|
||
performance metrics and debugging information from the operating system to help you understand its
|
||
behavior, and you can look at the server logs, but with a service hosted by a vendor you usually
|
||
do not have access to these internals.
|
||
* Moreover, if the service shuts down or becomes unacceptably expensive, or if the vendor decides to
|
||
change their product in a way you don’t like, you are at their mercy—continuing to run an old
|
||
version of the software is usually not an option, so you will be forced to migrate to an
|
||
alternative service [[24](/en/ch1#Yegge2020)].
|
||
This risk is mitigated if there are alternative services that expose a compatible API, but for
|
||
many cloud services there are no standard APIs, which raises the cost of switching, making vendor
|
||
lock-in a problem.
|
||
* The cloud provider needs to be trusted to keep the data secure, which can complicate the process
|
||
of complying with privacy and security regulations.
|
||
|
||
Despite all these risks, it has become more and more popular for organizations to build new
|
||
applications on top of cloud services, or adopting a hybrid approach in which cloud services are
|
||
used for some aspects of a system. However, cloud services will not subsume all in-house data
|
||
systems: many older systems predate the cloud, and for any services that have specialist
|
||
requirements that existing cloud services cannot meet, in-house systems remain necessary. For
|
||
example, very latency-sensitive applications such as high-frequency trading require full control of
|
||
the hardware.
|
||
|
||
## Cloud-Native System Architecture
|
||
|
||
Besides having a different economic model (subscribing to a service instead of buying hardware and
|
||
licensing software to run on it), the rise of the cloud has also had a profound effect on how data
|
||
systems are implemented on a technical level. The term *cloud-native* is used to describe an
|
||
architecture that is designed to take advantage of cloud services.
|
||
|
||
In principle, almost any software that you can self-host could also be provided as a cloud service,
|
||
and indeed such managed services are now available for many popular data systems. However, systems
|
||
that have been designed from the ground up to be cloud-native have been shown to have several
|
||
advantages: better performance on the same hardware, faster recovery from failures, being able to
|
||
quickly scale computing resources to match the load, and supporting larger datasets
|
||
[[25](/en/ch1#Verbitski2017),
|
||
[26](/en/ch1#Antonopoulos2019_ch1),
|
||
[27](/en/ch1#Vuppalapati2020)].
|
||
[Table 1-2](/en/ch1#tab_cloud_native_dbs) lists some examples of both types of systems.
|
||
|
||
Table 1-2. Examples of self-hosted and cloud-native database systems
|
||
|
||
| Category | Self-hosted systems | Cloud-native systems |
|
||
| --- | --- | --- |
|
||
| Operational/OLTP | MySQL, PostgreSQL, MongoDB | AWS Aurora [[25](/en/ch1#Verbitski2017)], Azure SQL DB Hyperscale [[26](/en/ch1#Antonopoulos2019_ch1)], Google Cloud Spanner |
|
||
| Analytical/OLAP | Teradata, ClickHouse, Spark | Snowflake [[27](/en/ch1#Vuppalapati2020)], Google BigQuery, Azure Synapse Analytics |
|
||
|
||
### Layering of cloud services
|
||
|
||
Many self-hosted data systems have very simple system requirements: they run on a conventional
|
||
operating system such as Linux or Windows, they store their data as files on the filesystem, and
|
||
they communicate via standard network protocols such as TCP/IP. A few systems depend on special
|
||
hardware such as GPUs (for machine learning) or RDMA network interfaces, but on the whole,
|
||
self-hosted software tends to use very generic computing resources: CPU, RAM, a filesystem, and an
|
||
IP network.
|
||
|
||
In a cloud, this type of software can be run on an Infrastructure-as-a-Service environment, using
|
||
one or more virtual machines (or *instances*) with a certain allocation of CPUs, memory, disk, and
|
||
network bandwidth. Compared to physical machines, cloud instances can be provisioned faster and they
|
||
come in a greater variety of sizes, but otherwise they are similar to a traditional computer: you
|
||
can run any software you like on it, but you are responsible for administering it yourself.
|
||
|
||
In contrast, the key idea of cloud-native services is to use not only the computing resources
|
||
managed by your operating system, but also to build upon lower-level cloud services to create
|
||
higher-level services. For example:
|
||
|
||
* *Object storage* services such as Amazon S3, Azure Blob Storage, and Cloudflare R2 store large
|
||
files. They provide more limited APIs than a typical filesystem (basic file reads and writes), but
|
||
they have the advantage that they hide the underlying physical machines: the service automatically
|
||
distributes the data across many machines, so that you don’t have to worry about running out of
|
||
disk space on any one machine. Even if some machines or their disks fail entirely, no data is
|
||
lost.
|
||
* Many other services are in turn built upon object storage and other cloud services: for example,
|
||
Snowflake is a cloud-based analytic database (data warehouse) that relies on S3 for data storage
|
||
[[27](/en/ch1#Vuppalapati2020)], and some other services in turn
|
||
build upon Snowflake.
|
||
|
||
As always with abstractions in computing, there is no one right answer to what you should use. As a
|
||
general rule, higher-level abstractions tend to be more oriented towards particular use cases. If
|
||
your needs match the situations for which a higher-level system is designed, using the existing
|
||
higher-level system will probably provide what you need with much less hassle than building it
|
||
yourself from lower-level systems. On the other hand, if there is no high-level system that meets
|
||
your needs, then building it yourself from lower-level components is the only option.
|
||
|
||
### Separation of storage and compute
|
||
|
||
In traditional computing, disk storage is regarded as durable (we assume that once something is
|
||
written to disk, it will not be lost). To tolerate the failure of an individual hard disk, RAID
|
||
(Redundant Array of Independent Disks) is often used to maintain copies of the data on several
|
||
disks attached to the same machine. RAID can be performed either in hardware or in software by the
|
||
operating system, and it is transparent to the applications accessing the filesystem.
|
||
|
||
In the cloud, compute instances (virtual machines) may also have local disks attached, but
|
||
cloud-native systems typically treat these disks more like an ephemeral cache, and less like
|
||
long-term storage. This is because the local disk becomes inaccessible if the associated instance
|
||
fails, or if the instance is replaced with a bigger or a smaller one (on a different physical
|
||
machine) in order to adapt to changes in load.
|
||
|
||
As an alternative to local disks, cloud services also offer virtual disk storage that can be
|
||
detached from one instance and attached to a different one (Amazon EBS, Azure managed disks, and
|
||
persistent disks in Google Cloud). Such a virtual disk is not actually a physical disk, but rather a
|
||
cloud service provided by a separate set of machines, which emulates the behavior of a disk (a
|
||
*block device*, where each block is typically 4 KiB in size). This technology makes it
|
||
possible to run traditional disk-based software in the cloud, but the block device emulation
|
||
introduces overheads that can be avoided in systems that are designed from the ground up for the
|
||
cloud [[25](/en/ch1#Verbitski2017)]. It also makes the application
|
||
very sensitive to network glitches, since every I/O on the virtual block device is actually a
|
||
network call [[28](/en/ch1#NickVanWiggeren2025)].
|
||
|
||
To address this problem, cloud-native services generally avoid using virtual disks, and instead
|
||
build on dedicated storage services that are optimized for particular workloads. Object storage
|
||
services such as S3 are designed for long-term storage of fairly large files, ranging from hundreds
|
||
of kilobytes to several gigabytes in size. The individual rows or values stored in a database are
|
||
typically much smaller than this; cloud databases therefore typically manage smaller values in a
|
||
separate service, and store larger data blocks (containing many individual values) in an object
|
||
store [[26](/en/ch1#Antonopoulos2019_ch1),
|
||
[29](/en/ch1#Breck2024)].
|
||
We will see ways of doing this in [Chapter 4](/en/ch4#ch_storage).
|
||
|
||
In a traditional systems architecture, the same computer is responsible for both storage (disk) and
|
||
computation (CPU and RAM), but in cloud-native systems, these two responsibilities have become
|
||
somewhat separated or *disaggregated* [[9](/en/ch1#Prout2022_ch1),
|
||
[27](/en/ch1#Vuppalapati2020),
|
||
[30](/en/ch1#Shapira2023separation),
|
||
[31](/en/ch1#Murthy2022)]:
|
||
for example, S3 only stores files, and if you want to analyze that data, you will have to run the
|
||
analysis code somewhere outside of S3. This implies transferring the data over the network, which we
|
||
will discuss further in [“Distributed versus Single-Node Systems”](/en/ch1#sec_introduction_distributed).
|
||
|
||
Moreover, cloud-native systems are often *multitenant*, which means that rather than having a
|
||
separate machine for each customer, data and computation from several different customers are
|
||
handled on the same shared hardware by the same service
|
||
[[32](/en/ch1#Vanlightly2023serverless)].
|
||
Multitenancy can enable better hardware utilization, easier scalability, and easier management by
|
||
the cloud provider, but it also requires careful engineering to ensure that one customer’s activity
|
||
does not affect the performance or security of the system for other customers
|
||
[[33](/en/ch1#Jonas2019)].
|
||
|
||
## Operations in the Cloud Era
|
||
|
||
Traditionally, the people managing an organization’s server-side data infrastructure were known as
|
||
*database administrators* (DBAs) or *system administrators* (sysadmins). More recently, many
|
||
organizations have tried to integrate the roles of software development and operations into teams
|
||
with a shared responsibility for both backend services and data infrastructure; the *DevOps*
|
||
philosophy has guided this trend. *Site Reliability Engineers* (SREs) are Google’s implementation of
|
||
this idea [[34](/en/ch1#Beyer2016)].
|
||
|
||
The role of operations is to ensure services are reliably delivered to users (including configuring
|
||
infrastructure and deploying applications), and to ensure a stable production environment (including
|
||
monitoring and diagnosing any problems that may affect reliability). For self-hosted systems,
|
||
operations traditionally involves a significant amount of work at the level of individual machines,
|
||
such as capacity planning (e.g., monitoring available disk space and adding more disks before you
|
||
run out of space), provisioning new machines, moving services from one machine to another, and
|
||
installing operating system patches.
|
||
|
||
Many cloud services present an API that hides the individual machines that actually implement the
|
||
service. For example, cloud storage replaces fixed-size disks with *metered billing*, where you can
|
||
store data without planning your capacity needs in advance, and you are then charged based on the
|
||
space actually used. Moreover, many cloud services remain highly available, even when individual
|
||
machines have failed (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability)).
|
||
|
||
This shift in emphasis from individual machines to services has been accompanied by a change in the
|
||
role of operations. The high-level goal of providing a reliable service remains the same, but the
|
||
processes and tools have evolved. The DevOps/SRE philosophy places greater emphasis on:
|
||
|
||
* automation—preferring repeatable processes over manual one-off jobs,
|
||
* preferring ephemeral virtual machines and services over long running servers,
|
||
* enabling frequent application updates,
|
||
* learning from incidents, and
|
||
* preserving the organization’s knowledge about the system, even as individual people come and go
|
||
[[35](/en/ch1#Limoncelli2020)].
|
||
|
||
With the rise of cloud services, there has been a bifurcation of roles: operations teams at
|
||
infrastructure companies specialize in the details of providing a reliable service to a large number
|
||
of customers, while the customers of the service spend as little time and effort as possible on
|
||
infrastructure [[36](/en/ch1#Majors2020)].
|
||
|
||
Customers of cloud services still require operations, but they focus on different aspects, such as
|
||
choosing the most appropriate service for a given task, integrating different services with each
|
||
other, and migrating from one service to another. Even though metered billing removes the need for
|
||
capacity planning in the traditional sense, it’s still important to know what resources you are
|
||
using for which purpose, so that you don’t waste money on cloud resources that are not needed:
|
||
capacity planning becomes financial planning, and performance optimization becomes cost optimization
|
||
[[37](/en/ch1#Cherkasky2021)].
|
||
Moreover, cloud services do have resource limits or *quotas* (such as the maximum number of
|
||
processes you can run concurrently), which you need to know about and plan for before you run into
|
||
them [[38](/en/ch1#Kushchi2023)].
|
||
|
||
Adopting a cloud service can be easier and quicker than running your own infrastructure, although
|
||
even here there is a cost in learning how to use it, and perhaps working around its limitations.
|
||
Integration between different services becomes a particular challenge as a growing number of vendors
|
||
offers an ever broader range of cloud services targeting different use cases
|
||
[[39](/en/ch1#Bernhardsson2021),
|
||
[40](/en/ch1#Stancil2021)].
|
||
ETL (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) is only part of the story; operational cloud services also need
|
||
to be integrated with each other. At present, there is a lack of standards that would facilitate
|
||
this sort of integration, so it often involves significant manual effort.
|
||
|
||
Other operational aspects that cannot fully be outsourced to cloud services include maintaining the
|
||
security of an application and the libraries it uses, managing the interactions between your own
|
||
services, monitoring the load on your services, and tracking down the cause of problems such as
|
||
performance degradations or outages. While the cloud is changing the role of operations, the need
|
||
for operations is as great as ever.
|
||
|
||
# Distributed versus Single-Node Systems
|
||
|
||
A system that involves several machines communicating via a network is called a *distributed
|
||
system*. Each of the processes participating in a distributed system is called a *node*. There are
|
||
various reasons why you might want a system to be distributed:
|
||
|
||
Inherently distributed systems
|
||
: If an application involves two or more interacting users, each using their own device, then the
|
||
system is unavoidably distributed: the communication between the devices will have to go via a
|
||
network.
|
||
|
||
Requests between cloud services
|
||
: If data is stored in one service but processed in another, it must be transferred over the network
|
||
from one service to the other.
|
||
|
||
Fault tolerance/high availability
|
||
: If your application needs to continue working even if one machine (or several machines, or
|
||
the network, or an entire datacenter) goes down, you can use multiple machines to give you
|
||
redundancy. When one fails, another one can take over. See [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability) and
|
||
[Chapter 6](/en/ch6#ch_replication) on replication.
|
||
|
||
Scalability
|
||
: If your data volume or computing requirements grow bigger than a single machine can handle,
|
||
you can potentially spread the load across multiple machines. See
|
||
[“Scalability”](/en/ch2#sec_introduction_scalability).
|
||
|
||
Latency
|
||
: If you have users around the world, you might want to have servers in various regions
|
||
worldwide so that each user can be served from a server that is geographically close to
|
||
them. That avoids the users having to wait for network packets to travel halfway around the
|
||
world to answer their requests. See [“Describing Performance”](/en/ch2#sec_introduction_percentiles).
|
||
|
||
Elasticity
|
||
: If your application is busy at some times and idle at other times, a cloud deployment can scale up
|
||
or down to meet the demand, so that you pay only for resources you are actively using. This is more
|
||
difficult on a single machine, which needs to be provisioned to handle the maximum load, even at
|
||
times when it is barely used.
|
||
|
||
Using specialized hardware
|
||
: Different parts of the system can take advantage of different types of hardware to match their
|
||
workload. For example, an object store may use machines with many disks but few CPUs, whereas a
|
||
data analysis system may use machines with lots of CPU and memory but no disks, and a machine
|
||
learning system may use machines with GPUs (which are much more efficient than CPUs for training
|
||
deep neural networks and other machine learning tasks).
|
||
|
||
Legal compliance
|
||
: Some countries have data residency laws that require data about people in their jurisdiction to be
|
||
stored and processed geographically within that country
|
||
[[41](/en/ch1#Korolov2022)].
|
||
The scope of these rules varies—for example, in some cases it applies only to medical or financial
|
||
data, while other cases are broader. A service with users in several such jurisdictions will
|
||
therefore have to distribute their data across servers in several locations.
|
||
|
||
Sustainability
|
||
: If you have flexibility on where and when to run your jobs, you might be able to run them in a
|
||
time and place where plenty of renewable electricity is available, and avoid running them when the
|
||
power grid is under strain. This can reduce your carbon emissions and allow you to take advantage
|
||
of cheap power when it is available
|
||
[[42](/en/ch1#Borenstein2025),
|
||
[43](/en/ch1#Acun2023)].
|
||
|
||
These reasons apply both to services that you write yourself (application code) and services
|
||
consisting of off-the-shelf software (such as databases).
|
||
|
||
## Problems with Distributed Systems
|
||
|
||
Distributed systems also have downsides. Every request and API call that goes via the network needs
|
||
to deal with the possibility of failure: the network may be interrupted, or the service may be
|
||
overloaded or crashed, and therefore any request may time out without receiving a response. In this
|
||
case, we don’t know whether the service received the request, and simply retrying it might not be
|
||
safe. We will discuss these problems in detail in [Chapter 9](/en/ch9#ch_distributed).
|
||
|
||
Although datacenter networks are fast, making a call to another service is still vastly slower than
|
||
calling a function in the same process
|
||
[[44](/en/ch1#Nath2019)].
|
||
When operating on large volumes of data, rather than transferring the data from storage to a
|
||
separate machine that processes it, it can be faster to bring the computation to the machine that
|
||
already has the data
|
||
[[45](/en/ch1#Hellerstein2019)].
|
||
More nodes are not always faster: in some cases, a simple single-threaded program on one computer
|
||
can perform significantly better than a cluster with over 100 CPU cores
|
||
[[46](/en/ch1#McSherry2015_ch1)].
|
||
|
||
Troubleshooting a distributed system is often difficult: if the system is slow to respond, how do
|
||
you figure out where the problem lies? Techniques for diagnosing problems in distributed systems are
|
||
developed under the heading of *observability* [[47](/en/ch1#Sridharan2018),
|
||
[48](/en/ch1#Majors2019)],
|
||
which involves collecting data about the execution of a system, and allowing it to be queried in
|
||
ways that allows both high-level metrics and individual events to be analyzed. *Tracing* tools such
|
||
as OpenTelemetry, Zipkin, and Jaeger allow you to track which client called which server for which
|
||
operation, and how long each call took
|
||
[[49](/en/ch1#Sigelman2010)].
|
||
|
||
Databases provide various mechanisms for ensuring data consistency, as we shall see in
|
||
[Chapter 6](/en/ch6#ch_replication) and [Chapter 8](/en/ch8#ch_transactions). However, when each service has its own database,
|
||
maintaining consistency of data across those different services becomes the application’s problem.
|
||
Distributed transactions, which we explore in [Chapter 8](/en/ch8#ch_transactions), are a possible technique for
|
||
ensuring consistency, but they are rarely used in a microservices context because they run counter
|
||
to the goal of making services independent from each other, and many databases don’t support them
|
||
[[50](/en/ch1#Laigner2021)].
|
||
|
||
For all these reasons, if you can do something on a single machine, this is often much simpler and
|
||
cheaper compared to setting up a distributed system
|
||
[[23](/en/ch1#Badizadegan2022),
|
||
[46](/en/ch1#McSherry2015_ch1),
|
||
[51](/en/ch1#Tigani2023)].
|
||
CPUs, memory, and disks have grown larger, faster, and more reliable. When combined with single-node
|
||
databases such as DuckDB, SQLite, and KùzuDB, many workloads can now run on a single node. We will
|
||
explore more on this topic in [Chapter 4](/en/ch4#ch_storage).
|
||
|
||
## Microservices and Serverless
|
||
|
||
The most common way of distributing a system across multiple machines is to divide them into clients
|
||
and servers, and let the clients make requests to the servers. Most commonly HTTP is used for this
|
||
communication, as we will discuss in [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc). The same process may be both a
|
||
server (handling incoming requests) and a client (making outbound requests to other services).
|
||
|
||
This way of building applications has traditionally been called a *service-oriented architecture*
|
||
(SOA); more recently the idea has been refined into a *microservices* architecture
|
||
[[52](/en/ch1#Newman2021_ch1),
|
||
[53](/en/ch1#Richardson2014)].
|
||
In this architecture, a service has one well-defined purpose (for example, in the case of S3, this
|
||
would be file storage); each service exposes an API that can be called by clients via the network,
|
||
and each service has one team that is responsible for its maintenance. A complex application can
|
||
thus be decomposed into multiple interacting services, each managed by a separate team.
|
||
|
||
There are several advantages to breaking down a complex piece of software into multiple services:
|
||
each service can be updated independently, reducing coordination effort among teams; each service
|
||
can be assigned the hardware resources it needs; and by hiding the implementation details behind an
|
||
API, the service owners are free to change the implementation without affecting clients. In terms of
|
||
data storage, it is common for each service to have its own databases, and not to share databases
|
||
between services: sharing a database would effectively make the entire database structure a part of
|
||
the service’s API, and then that structure would be difficult to change. Shared databases could also
|
||
cause one service’s queries to negatively impact the performance of other services.
|
||
|
||
On the other hand, having many services can itself breed complexity: each service requires
|
||
infrastructure for deploying new releases, adjusting the allocated hardware resources to match the
|
||
load, collecting logs, monitoring service health, and alerting an on-call engineer in the case of a
|
||
problem. *Orchestration* frameworks such as Kubernetes have become a popular way of deploying
|
||
services, since they provide a foundation for this infrastructure. Testing a service during
|
||
development can be complicated, since you also need to run all the other services that it depends
|
||
on.
|
||
|
||
Microservice APIs can be challenging to evolve. Clients that call an API expect the API to have
|
||
certain fields. Developers might wish to add or remove fields to an API as business needs change,
|
||
but doing so can cause clients to fail. Worse still, such failures are often not discovered until
|
||
late in the development cycle when the updated service API is deployed to a staging or production
|
||
environment. API description standards such as OpenAPI and gRPC help manage the relationship between
|
||
client and server APIs; we discuss these further in [Chapter 5](/en/ch5#ch_encoding).
|
||
|
||
Microservices are primarily a technical solution to a people problem: allowing different teams to
|
||
make progress independently without having to coordinate with each other. This is valuable in a large
|
||
company, but in a small company where there are not many teams, using microservices is likely to be
|
||
unnecessary overhead, and it is preferable to implement the application in the simplest way possible
|
||
[[52](/en/ch1#Newman2021_ch1)].
|
||
|
||
*Serverless*, or *function-as-a-service* (FaaS), is another approach to deploying services, in which
|
||
the management of the infrastructure is outsourced to a cloud vendor
|
||
[[33](/en/ch1#Jonas2019)].
|
||
When using virtual machines, you have to explicitly choose when to start up or shut down an
|
||
instance; in contrast, with the serverless model, the cloud provider automatically allocates and
|
||
frees hardware resources as needed, based on the incoming requests to your service
|
||
[[54](/en/ch1#Shahrad2020)]. Serverless deployment
|
||
shifts more of the operational burden to cloud providers and enables flexible billing by usage
|
||
rather than machine instances. To offer such benefits, many serverless infrastructure providers
|
||
impose a time limit on function execution, limit runtime environments, and might suffer from slow
|
||
start times when a function is first invoked. The term “serverless” can also be misleading: each
|
||
serverless function execution still runs on a server, but subsequent executions might run on a
|
||
different one. Moreover, infrastructure such as BigQuery and various Kafka offerings have adopted
|
||
“serverless” terminology to signal that their services auto-scale and that they bill by usage rather
|
||
than machine instances.
|
||
|
||
Just like cloud storage replaced capacity planning (deciding in advance how many disks to buy) with
|
||
a metered billing model, the serverless approach is bringing metered billing to code execution: you
|
||
only pay for the time that your application code is actually running, rather than having to
|
||
provision resources in advance.
|
||
|
||
## Cloud Computing versus Supercomputing
|
||
|
||
Cloud computing is not the only way of building large-scale computing systems; an alternative is
|
||
*high-performance computing* (HPC), also known as *supercomputing*. Although there are overlaps, HPC
|
||
often has different priorities and uses different techniques compared to cloud computing and
|
||
enterprise datacenter systems. Some of those differences are:
|
||
|
||
* Supercomputers are typically used for computationally intensive scientific computing tasks, such
|
||
as weather forecasting, climate modeling, molecular dynamics (simulating the movement of atoms and
|
||
molecules), complex optimization problems, and solving partial differential equations. On the
|
||
other hand, cloud computing tends to be used for online services, business data systems, and
|
||
similar systems that need to serve user requests with high availability.
|
||
* A supercomputer typically runs large batch jobs that checkpoint the state of their computation to
|
||
disk from time to time. If a node fails, a common solution is to simply stop the entire cluster
|
||
workload, repair the faulty node, and then restart the computation from the last checkpoint
|
||
[[55](/en/ch1#Barroso2018),
|
||
[56](/en/ch1#Fiala2012)].
|
||
With cloud services, it is usually not desirable to stop the entire cluster, since the services
|
||
need to continually serve users with minimal interruptions.
|
||
* Supercomputer nodes typically communicate through shared memory and remote direct memory access
|
||
(RDMA), which support high bandwidth and low latency, but assume a high level of trust among the
|
||
users of the system [[57](/en/ch1#KornfeldSimpson2020)].
|
||
In cloud computing, the network and the machines are often shared by mutually untrusting
|
||
organizations, requiring stronger security mechanisms such as resource isolation (e.g., virtual
|
||
machines), encryption and authentication.
|
||
* Cloud datacenter networks are often based on IP and Ethernet, arranged in Clos topologies to
|
||
provide high bisection bandwidth—a commonly used measure of a network’s overall performance
|
||
[[55](/en/ch1#Barroso2018),
|
||
[58](/en/ch1#Singh2015)].
|
||
Supercomputers often use specialized network topologies, such as multi-dimensional meshes and toruses
|
||
[[59](/en/ch1#Lockwood2014)],
|
||
which yield better performance for HPC workloads with known communication patterns.
|
||
* Cloud computing allows nodes to be distributed across multiple geographic regions, whereas
|
||
supercomputers generally assume that all of their nodes are close together.
|
||
|
||
Large-scale analytics systems sometimes share some characteristics with supercomputing, which is why
|
||
it can be worth knowing about these techniques if you are working in this area. However, this book
|
||
is mostly concerned with services that need to be continually available, as discussed in
|
||
[“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability).
|
||
|
||
# Data Systems, Law, and Society
|
||
|
||
So far you’ve seen in this chapter that the architecture of data systems is influenced not only by
|
||
technical goals and requirements, but also by the human needs of the organizations that they
|
||
support. Increasingly, data systems engineers are realizing that serving the needs of their own
|
||
business is not enough: we also have a responsibility towards society at large.
|
||
|
||
One particular concern are systems that store data about people and their behavior. Since 2018 the
|
||
*General Data Protection Regulation* (GDPR) has given residents of many European countries greater
|
||
control and legal rights over their personal data, and similar privacy regulation has been adopted
|
||
in various other countries and states around the world, including for example the California
|
||
Consumer Privacy Act (CCPA). Regulations around AI, such as the *EU AI Act*, place further
|
||
restrictions on how personal data can be used.
|
||
|
||
Moreover, even in areas that are not directly subject to regulation, there is increasing recognition
|
||
of the effects that computer systems have on people and society. Social media has changed how
|
||
individuals consume news, which influences their political opinions and hence may affect the outcome
|
||
of elections. Automated systems increasingly make decisions that have profound consequences for
|
||
individuals, such as deciding who should be given a loan or insurance coverage, who should be
|
||
invited to a job interview, or who should be suspected of a crime
|
||
[[60](/en/ch1#ONeil2016_ch1)].
|
||
|
||
Everyone who works on such systems shares a responsibility for considering the ethical impact and
|
||
ensuring that they comply with relevant law. It is not necessary for everybody to become an expert
|
||
in law and ethics, but a basic awareness of legal and ethical principles is just as important as,
|
||
say, some foundational knowledge in distributed systems.
|
||
|
||
Legal considerations are influencing the very foundations of how data systems are being designed
|
||
[[61](/en/ch1#Shastri2020)].
|
||
For example, the GDPR grants individuals the right to have their data erased on request (sometimes
|
||
known as the *right to be forgotten*). However, as we shall see in this book, many data systems rely
|
||
on immutable constructs such as append-only logs as part of their design; how can we ensure deletion
|
||
of some data in the middle of a file that is supposed to be immutable? How do we handle deletion of
|
||
data that has been incorporated into derived datasets (see [“Systems of Record and Derived Data”](/en/ch1#sec_introduction_derived)), such as
|
||
training data for machine learning models? Answering these questions creates new engineering
|
||
challenges.
|
||
|
||
At present we don’t have clear guidelines on which particular technologies or system architectures
|
||
should be considered “GDPR-compliant” or not. The regulation deliberately does not mandate
|
||
particular technologies, because these may quickly change as technology progresses. Instead, the
|
||
legal texts set out high-level principles that are subject to interpretation. This means that there
|
||
are no simple answers to the question of how to comply with privacy regulation, but we will look at
|
||
some of the technologies in this book through this lens.
|
||
|
||
In general, we store data because we think that its value is greater than the costs of storing it.
|
||
However, it is worth remembering that the costs of storage are not just the bill you pay for Amazon
|
||
S3 or another service: the cost-benefit calculation should also take into account the risks of
|
||
liability and reputational damage if the data were to be leaked or compromised by adversaries, and
|
||
the risk of legal costs and fines if the storage and processing of the data is found not to be
|
||
compliant with the law [[51](/en/ch1#Tigani2023)].
|
||
|
||
Governments or police forces might also compel companies to hand over data. When there is a risk
|
||
that the data may reveal criminalized behaviors (for example, homosexuality in several Middle
|
||
Eastern and African countries, or seeking an abortion in several US states), storing that data
|
||
creates real safety risks for users. Travel to an abortion clinic, for example, could easily be
|
||
revealed by location data, perhaps even by a log of the user’s IP addresses over time (which
|
||
indicate approximate location).
|
||
|
||
Once all the risks are taken into account, it might be reasonable to decide that some data is simply
|
||
not worth storing, and that it should therefore be deleted. This principle of *data minimization*
|
||
(sometimes known by the German term *Datensparsamkeit*) runs counter to the “big data” philosophy of
|
||
storing lots of data speculatively in case it turns out to be useful in the future
|
||
[[62](/en/ch1#Datensparsamkeit)].
|
||
But it fits with the GDPR, which mandates that personal data may only be collected for a specified,
|
||
explicit purpose, that this data may not later be used for any other purpose, and that the data must
|
||
not be kept for longer than necessary for the purposes for which it was collected
|
||
[[63](/en/ch1#GDPR)].
|
||
|
||
Businesses have also taken notice of privacy and safety concerns. Credit card companies require
|
||
payment processing businesses to adhere to strict payment card industry (PCI) standards. Processors
|
||
undergo frequent evaluations from independent auditors to verify continued compliance. Software
|
||
vendors have also seen increased scrutiny. Many buyers now require their vendors to comply with
|
||
Service Organization Control (SOC) Type 2 standards. As with PCI compliance, vendors undergo third
|
||
party audits to verify adherence.
|
||
|
||
Generally, it is important to balance the needs of your business against the needs of the people
|
||
whose data you are collecting and processing. There is much more to this topic; in [Link to Come] we
|
||
will go deeper into the topics of ethics and legal compliance, including the problems of bias and
|
||
discrimination.
|
||
|
||
# Summary
|
||
|
||
The theme of this chapter has been to understand trade-offs: that is, to recognize that for many
|
||
questions there is not one right answer, but several different approaches that each have various
|
||
pros and cons. We explored some of the most important choices that affect the architecture of data
|
||
systems, and introduced terminology that will be needed throughout the rest of this book.
|
||
|
||
We started by making a distinction between operational (transaction-processing, OLTP) and analytical
|
||
(OLAP) systems, and saw their different characteristics: not only managing different types of data
|
||
with different access patterns, but also serving different audiences. We encountered the concept of
|
||
a data warehouse and data lake, which receive data feeds from operational systems via ETL. In
|
||
[Chapter 4](/en/ch4#ch_storage) we will see that operational and analytical systems often use very different internal
|
||
data layouts because of the different types of queries they need to serve.
|
||
|
||
We then compared cloud services, a comparatively recent development, to the traditional paradigm of
|
||
self-hosted software that has previously dominated data systems architecture. Which of these
|
||
approaches is more cost-effective depends a lot on your particular situation, but it’s undeniable
|
||
that cloud-native approaches are bringing big changes to the way data systems are architected, for
|
||
example in the way they separate storage and compute.
|
||
|
||
Cloud systems are intrinsically distributed, and we briefly examined some of the trade-offs of
|
||
distributed systems compared to using a single machine. There are situations in which you can’t
|
||
avoid going distributed, but it’s advisable not to rush into making a system distributed if it’s
|
||
possible to keep it on a single machine. In [Chapter 9](/en/ch9#ch_distributed) we will cover the challenges with
|
||
distributed systems in more detail.
|
||
|
||
Finally, we saw that data systems architecture is determined not only by the needs of the business
|
||
deploying the system, but also by privacy regulation that protects the rights of the people whose
|
||
data is being processed—an aspect that many engineers are prone to ignoring. How we translate legal
|
||
requirements into technical implementations is not yet well understood, but it’s important to keep
|
||
this question in mind as we move through the rest of this book.
|
||
|
||
##### Footnotes
|
||
|
||
##### References
|
||
|
||
[[1](/en/ch1#Kouzes2009-marker)] Richard T. Kouzes,
|
||
Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio.
|
||
[The
|
||
Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1,
|
||
January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26)
|
||
|
||
[[2](/en/ch1#Kleppmann2019_ch1-marker)] Martin Kleppmann, Adam Wiggins, Peter van
|
||
Hardenberg, and Mark McGranaghan. [Local-first
|
||
software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International
|
||
Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!),
|
||
October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
|
||
|
||
[[3](/en/ch1#Reis2022-marker)] Joe Reis and Matt Housley.
|
||
[*Fundamentals
|
||
of Data Engineering*](https://www.oreilly.com/library/view/fundamentals-of-data/9781098108298/). O’Reilly Media, 2022. ISBN: 9781098108304
|
||
|
||
[[4](/en/ch1#Machado2023-marker)] Rui Pedro Machado and Helder Russa.
|
||
[*Analytics
|
||
Engineering with SQL and dbt*](https://www.oreilly.com/library/view/analytics-engineering-with/9781098142377/). O’Reilly Media, 2023. ISBN: 9781098142384
|
||
|
||
[[5](/en/ch1#Codd1993-marker)] Edgar F. Codd, S. B. Codd, and C. T. Salley.
|
||
[Providing
|
||
OLAP to User-Analysts: An IT Mandate](https://www.estgv.ipv.pt/PaginasPessoais/jloureiro/ESI_AID2007_2008/fichas/codd.pdf). E. F. Codd Associates, 1993.
|
||
Archived at [perma.cc/RKX8-2GEE](https://perma.cc/RKX8-2GEE)
|
||
|
||
[[6](/en/ch1#Soman2023-marker)] Chinmay Soman and Neha Pawar.
|
||
[Comparing Three
|
||
Real-Time OLAP Databases: Apache Pinot, Apache Druid, and ClickHouse](https://startree.ai/blog/a-tale-of-three-real-time-olap-databases). *startree.ai*,
|
||
April 2023. Archived at [perma.cc/8BZP-VWPA](https://perma.cc/8BZP-VWPA)
|
||
|
||
[[7](/en/ch1#Chaudhuri1997-marker)] Surajit Chaudhuri and Umeshwar Dayal.
|
||
[An Overview of Data
|
||
Warehousing and OLAP Technology](https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/sigrecord.pdf). *ACM SIGMOD Record*, volume 26, issue 1, pages 65–74,
|
||
March 1997. [doi:10.1145/248603.248616](https://doi.org/10.1145/248603.248616)
|
||
|
||
[[8](/en/ch1#Ozcan2017-marker)] Fatma Özcan, Yuanyuan Tian, and Pinar Tözün.
|
||
[Hybrid Transactional/Analytical
|
||
Processing: A Survey](https://humming80.github.io/papers/sigmod-htaptut.pdf). At *ACM International Conference on Management of Data* (SIGMOD), May 2017.
|
||
[doi:10.1145/3035918.3054784](https://doi.org/10.1145/3035918.3054784)
|
||
|
||
[[9](/en/ch1#Prout2022_ch1-marker)] Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu
|
||
Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov.
|
||
[Cloud-Native Transactions and Analytics
|
||
in SingleStore](https://dl.acm.org/doi/abs/10.1145/3514221.3526055). At *International Conference on Management of Data* (SIGMOD), June 2022.
|
||
[doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055)
|
||
|
||
[[10](/en/ch1#Zhang2024-marker)] Chao Zhang, Guoliang Li, Jintao Zhang,
|
||
Xinning Zhang, and Jianhua Feng.
|
||
[HTAP Databases: A Survey](https://arxiv.org/pdf/2404.15670).
|
||
*IEEE Transactions on Knowledge and Data Engineering*, April 2024.
|
||
[doi:10.1109/TKDE.2024.3389693](https://doi.org/10.1109/TKDE.2024.3389693)
|
||
|
||
[[11](/en/ch1#Stonebraker2005fitsall-marker)] Michael Stonebraker and Uğur Çetintemel.
|
||
[‘One Size Fits All’: An
|
||
Idea Whose Time Has Come and Gone](https://pages.cs.wisc.edu/~shivaram/cs744-readings/fits_all.pdf). At *21st International Conference on Data Engineering*
|
||
(ICDE), April 2005. [doi:10.1109/ICDE.2005.1](https://doi.org/10.1109/ICDE.2005.1)
|
||
|
||
[[12](/en/ch1#Cohen2009-marker)] Jeffrey Cohen, Brian Dolan, Mark Dunlap, Joseph M.
|
||
Hellerstein, and Caleb Welton. [MAD Skills:
|
||
New Analysis Practices for Big Data](https://www.vldb.org/pvldb/vol2/vldb09-219.pdf). *Proceedings of the VLDB Endowment*, volume 2,
|
||
issue 2, pages 1481–1492, August 2009.
|
||
[doi:10.14778/1687553.1687576](https://doi.org/10.14778/1687553.1687576)
|
||
|
||
[[13](/en/ch1#Olteanu2020-marker)] Dan Olteanu.
|
||
[The Relational Data Borg is Learning](https://www.vldb.org/pvldb/vol13/p3502-olteanu.pdf).
|
||
*Proceedings of the VLDB Endowment*, volume 13, issue 12, August 2020.
|
||
[doi:10.14778/3415478.3415572](https://doi.org/10.14778/3415478.3415572)
|
||
|
||
[[14](/en/ch1#Bornstein2020-marker)] Matt Bornstein, Martin Casado, and Jennifer Li.
|
||
[Emerging
|
||
Architectures for Modern Data Infrastructure: 2020](https://future.a16z.com/emerging-architectures-for-modern-data-infrastructure-2020/). *future.a16z.com*, October 2020.
|
||
Archived at [perma.cc/LF8W-KDCC](https://perma.cc/LF8W-KDCC)
|
||
|
||
[[15](/en/ch1#Fowler2015-marker)] Martin Fowler.
|
||
[DataLake](https://www.martinfowler.com/bliki/DataLake.html).
|
||
*martinfowler.com*, February 2015.
|
||
Archived at [perma.cc/4WKN-CZUK](https://perma.cc/4WKN-CZUK)
|
||
|
||
[[16](/en/ch1#Johnson2015-marker)] Bobby Johnson and Joseph Adler.
|
||
[The
|
||
Sushi Principle: Raw Data Is Better](https://learning.oreilly.com/videos/strata-hadoop/9781491924143/9781491924143-video210840/). At *Strata+Hadoop World*, February 2015.
|
||
|
||
[[17](/en/ch1#Armbrust2021-marker)] Michael Armbrust, Ali Ghodsi, Reynold Xin, and Matei Zaharia.
|
||
[Lakehouse: A New Generation of
|
||
Open Platforms that Unify Data Warehousing and Advanced Analytics](https://www.cidrdb.org/cidr2021/papers/cidr2021_paper17.pdf). At *11th Annual Conference
|
||
on Innovative Data Systems Research* (CIDR), January 2021.
|
||
|
||
[[18](/en/ch1#DataOps-marker)] DataKitchen, Inc.
|
||
[The DataOps Manifesto](https://dataopsmanifesto.org/en/). *dataopsmanifesto.org*, 2017.
|
||
Archived at [perma.cc/3F5N-FUQ4](https://perma.cc/3F5N-FUQ4)
|
||
|
||
[[19](/en/ch1#Manohar2021-marker)] Tejas Manohar.
|
||
[What is Reverse ETL: A Definition & Why It’s
|
||
Taking Off](https://hightouch.io/blog/reverse-etl/). *hightouch.io*, November 2021.
|
||
Archived at [perma.cc/A7TN-GLYJ](https://perma.cc/A7TN-GLYJ)
|
||
|
||
[[20](/en/ch1#ORegan2018-marker)] Simon O’Regan.
|
||
[Designing Data
|
||
Products](https://towardsdatascience.com/designing-data-products-b6b93edf3d23). *towardsdatascience.com*, August 2018.
|
||
Archived at [perma.cc/HU67-3RV8](https://perma.cc/HU67-3RV8)
|
||
|
||
[[21](/en/ch1#Fournier2021-marker)] Camille Fournier.
|
||
[Why is it so
|
||
hard to decide to buy?](https://skamille.medium.com/why-is-it-so-hard-to-decide-to-buy-d86fee98e88e) *skamille.medium.com*, July 2021.
|
||
Archived at [perma.cc/6VSG-HQ5X](https://perma.cc/6VSG-HQ5X)
|
||
|
||
[[22](/en/ch1#HeinemeierHansson2022-marker)] David Heinemeier Hansson.
|
||
[Why we’re leaving the cloud](https://world.hey.com/dhh/why-we-re-leaving-the-cloud-654b47e0).
|
||
*world.hey.com*, October 2022.
|
||
Archived at [perma.cc/82E6-UJ65](https://perma.cc/82E6-UJ65)
|
||
|
||
[[23](/en/ch1#Badizadegan2022-marker)] Nima Badizadegan.
|
||
[Use One Big Server](https://specbranch.com/posts/one-big-server/).
|
||
*specbranch.com*, August 2022.
|
||
Archived at [perma.cc/M8NB-95UK](https://perma.cc/M8NB-95UK)
|
||
|
||
[[24](/en/ch1#Yegge2020-marker)] Steve Yegge.
|
||
[Dear
|
||
Google Cloud: Your Deprecation Policy is Killing You](https://steve-yegge.medium.com/dear-google-cloud-your-deprecation-policy-is-killing-you-ee7525dc05dc). *steve-yegge.medium.com*, August 2020.
|
||
Archived at [perma.cc/KQP9-SPGU](https://perma.cc/KQP9-SPGU)
|
||
|
||
[[25](/en/ch1#Verbitski2017-marker)] Alexandre Verbitski, Anurag Gupta, Debanjan
|
||
Saha, Murali Brahmadesam, Kamal Gupta, Raman Mittal, Sailesh Krishnamurthy, Sandor Maurice, Tengiz
|
||
Kharatishvili, and Xiaofeng Bao.
|
||
[Amazon
|
||
Aurora: Design Considerations for High Throughput Cloud-Native Relational Databases](https://media.amazonwebservices.com/blog/2017/aurora-design-considerations-paper.pdf).
|
||
At *ACM International Conference on Management of Data* (SIGMOD), pages 1041–1052, May 2017.
|
||
[doi:10.1145/3035918.3056101](https://doi.org/10.1145/3035918.3056101)
|
||
|
||
[[26](/en/ch1#Antonopoulos2019_ch1-marker)] Panagiotis Antonopoulos, Alex Budovski, Cristian
|
||
Diaconu, Alejandro Hernandez Saenz, Jack Hu, Hanuma Kodavalla, Donald Kossmann, Sandeep Lingam, Umar
|
||
Farooq Minhas, Naveen Prakash, Vijendra Purohit, Hugh Qu, Chaitanya Sreenivas Ravella, Krystyna
|
||
Reisteter, Sheetal Shrotri, Dixin Tang, and Vikram Wakade.
|
||
[Socrates: The
|
||
New SQL Server in the Cloud](https://www.microsoft.com/en-us/research/uploads/prod/2019/05/socrates.pdf). At *ACM International Conference on Management of Data*
|
||
(SIGMOD), pages 1743–1756, June 2019.
|
||
[doi:10.1145/3299869.3314047](https://doi.org/10.1145/3299869.3314047)
|
||
|
||
[[27](/en/ch1#Vuppalapati2020-marker)] Midhul Vuppalapati, Justin Miron, Rachit Agarwal,
|
||
Dan Truong, Ashish Motivala, and Thierry Cruanes.
|
||
[Building An Elastic Query
|
||
Engine on Disaggregated Storage](https://www.usenix.org/system/files/nsdi20-paper-vuppalapati.pdf). At *17th USENIX Symposium on Networked Systems Design and
|
||
Implementation* (NSDI), February 2020.
|
||
|
||
[[28](/en/ch1#NickVanWiggeren2025-marker)] Nick Van Wiggeren.
|
||
[The Real Failure Rate of EBS](https://planetscale.com/blog/the-real-fail-rate-of-ebs).
|
||
*planetscale.com*, March 2025.
|
||
Archived at [perma.cc/43CR-SAH5](https://perma.cc/43CR-SAH5)
|
||
|
||
[[29](/en/ch1#Breck2024-marker)] Colin Breck.
|
||
[Predicting the
|
||
Future of Distributed Systems](https://blog.colinbreck.com/predicting-the-future-of-distributed-systems/). *blog.colinbreck.com*, August 2024.
|
||
Archived at [perma.cc/K5FC-4XX2](https://perma.cc/K5FC-4XX2)
|
||
|
||
[[30](/en/ch1#Shapira2023separation-marker)] Gwen Shapira.
|
||
[Compute-Storage Separation Explained](https://www.thenile.dev/blog/storage-compute).
|
||
*thenile.dev*, January 2023. Archived at
|
||
[perma.cc/QCV3-XJNZ](https://perma.cc/QCV3-XJNZ)
|
||
|
||
[[31](/en/ch1#Murthy2022-marker)] Ravi Murthy and Gurmeet Goindi.
|
||
[AlloyDB
|
||
for PostgreSQL under the hood: Intelligent, database-aware storage](https://cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage). *cloud.google.com*,
|
||
May 2022. Archived at
|
||
[archive.org](https://web.archive.org/web/20220514021120/https%3A//cloud.google.com/blog/products/databases/alloydb-for-postgresql-intelligent-scalable-storage)
|
||
|
||
[[32](/en/ch1#Vanlightly2023serverless-marker)] Jack Vanlightly.
|
||
[The
|
||
Architecture of Serverless Data Systems](https://jack-vanlightly.com/blog/2023/11/14/the-architecture-of-serverless-data-systems). *jack-vanlightly.com*, November 2023.
|
||
Archived at [perma.cc/UDV4-TNJ5](https://perma.cc/UDV4-TNJ5)
|
||
|
||
[[33](/en/ch1#Jonas2019-marker)] Eric Jonas, Johann Schleier-Smith, Vikram
|
||
Sreekanti, Chia-Che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, Joao Carreira, Karl Krauth,
|
||
Neeraja Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, David A. Patterson.
|
||
[Cloud Programming Simplified: A Berkeley View on
|
||
Serverless Computing](https://arxiv.org/abs/1902.03383). *arxiv.org*, February 2019.
|
||
|
||
[[34](/en/ch1#Beyer2016-marker)] Betsy Beyer, Jennifer Petoff, Chris
|
||
Jones, and Niall Richard Murphy.
|
||
[*Site
|
||
Reliability Engineering: How Google Runs Production Systems*](https://www.oreilly.com/library/view/site-reliability-engineering/9781491929117/).
|
||
O’Reilly Media, 2016. ISBN: 9781491929124
|
||
|
||
[[35](/en/ch1#Limoncelli2020-marker)] Thomas Limoncelli.
|
||
[The Time I Stole $10,000 from Bell Labs](https://queue.acm.org/detail.cfm?id=3434773).
|
||
*ACM Queue*, volume 18, issue 5, November 2020.
|
||
[doi:10.1145/3434571.3434773](https://doi.org/10.1145/3434571.3434773)
|
||
|
||
[[36](/en/ch1#Majors2020-marker)] Charity Majors.
|
||
[The Future of Ops Jobs](https://acloudguru.com/blog/engineering/the-future-of-ops-jobs).
|
||
*acloudguru.com*, August 2020.
|
||
Archived at [perma.cc/GRU2-CZG3](https://perma.cc/GRU2-CZG3)
|
||
|
||
[[37](/en/ch1#Cherkasky2021-marker)] Boris Cherkasky.
|
||
[(Over)Pay
|
||
As You Go for Your Datastore](https://medium.com/riskified-technology/over-pay-as-you-go-for-your-datastore-11a29ae49a8b). *medium.com*, September 2021.
|
||
Archived at [perma.cc/Q8TV-2AM2](https://perma.cc/Q8TV-2AM2)
|
||
|
||
[[38](/en/ch1#Kushchi2023-marker)] Shlomi Kushchi.
|
||
[Serverless Doesn’t Mean
|
||
DevOpsLess or NoOps](https://thenewstack.io/serverless-doesnt-mean-devopsless-or-noops/). *thenewstack.io*, February 2023.
|
||
Archived at [perma.cc/3NJR-AYYU](https://perma.cc/3NJR-AYYU)
|
||
|
||
[[39](/en/ch1#Bernhardsson2021-marker)] Erik Bernhardsson.
|
||
[Storm
|
||
in the stratosphere: how the cloud will be reshuffled](https://erikbern.com/2021/11/30/storm-in-the-stratosphere-how-the-cloud-will-be-reshuffled.html). *erikbern.com*, November 2021.
|
||
Archived at [perma.cc/SYB2-99P3](https://perma.cc/SYB2-99P3)
|
||
|
||
[[40](/en/ch1#Stancil2021-marker)] Benn Stancil.
|
||
[The data OS](https://benn.substack.com/p/the-data-os). *benn.substack.com*,
|
||
September 2021. Archived at [perma.cc/WQ43-FHS6](https://perma.cc/WQ43-FHS6)
|
||
|
||
[[41](/en/ch1#Korolov2022-marker)] Maria Korolov.
|
||
[Data
|
||
residency laws pushing companies toward residency as a service](https://www.csoonline.com/article/3647761/data-residency-laws-pushing-companies-toward-residency-as-a-service.html). *csoonline.com*,
|
||
January 2022. Archived at [perma.cc/CHE4-XZZ2](https://perma.cc/CHE4-XZZ2)
|
||
|
||
[[42](/en/ch1#Borenstein2025-marker)] Severin Borenstein.
|
||
[Can
|
||
Data Centers Flex Their Power Demand?](https://energyathaas.wordpress.com/2025/04/14/can-data-centers-flex-their-power-demand/) *energyathaas.wordpress.com*, April 2025.
|
||
Archived at <https://perma.cc/MUD3-A6FF>
|
||
|
||
[[43](/en/ch1#Acun2023-marker)] Bilge Acun, Benjamin Lee, Fiodar Kazhamiaka, Aditya
|
||
Sundarrajan, Kiwan Maeng, Manoj Chakkaravarthy, David Brooks, and Carole-Jean Wu.
|
||
[Carbon Dependencies in
|
||
Datacenter Design and Management](https://hotcarbon.org/assets/2022/pdf/hotcarbon22-acun.pdf).
|
||
*ACM SIGENERGY Energy Informatics Review*, volume 3, issue 3, pages 21–26.
|
||
[doi:10.1145/3630614.3630619](https://doi.org/10.1145/3630614.3630619)
|
||
|
||
[[44](/en/ch1#Nath2019-marker)] Kousik Nath.
|
||
[These are
|
||
the numbers every computer engineer should know](https://www.freecodecamp.org/news/must-know-numbers-for-every-computer-engineer/). *freecodecamp.org*, September 2019.
|
||
Archived at [perma.cc/RW73-36RL](https://perma.cc/RW73-36RL)
|
||
|
||
[[45](/en/ch1#Hellerstein2019-marker)] Joseph M. Hellerstein, Jose Faleiro, Joseph E.
|
||
Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu.
|
||
[Serverless Computing: One Step Forward, Two Steps Back](https://arxiv.org/abs/1812.03651).
|
||
At *Conference on Innovative Data Systems Research* (CIDR), January 2019.
|
||
|
||
[[46](/en/ch1#McSherry2015_ch1-marker)] Frank McSherry, Michael Isard, and Derek G. Murray.
|
||
[Scalability!
|
||
But at What COST?](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-mcsherry.pdf) At *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS),
|
||
May 2015.
|
||
|
||
[[47](/en/ch1#Sridharan2018-marker)] Cindy Sridharan.
|
||
*[Distributed
|
||
Systems Observability: A Guide to Building Robust Systems](https://unlimited.humio.com/rs/756-LMY-106/images/Distributed-Systems-Observability-eBook.pdf)*. Report, O’Reilly Media, May 2018.
|
||
Archived at [perma.cc/M6JL-XKCM](https://perma.cc/M6JL-XKCM)
|
||
|
||
[[48](/en/ch1#Majors2019-marker)] Charity Majors.
|
||
[Observability — A 3-Year
|
||
Retrospective](https://thenewstack.io/observability-a-3-year-retrospective/). *thenewstack.io*, August 2019.
|
||
Archived at [perma.cc/CG62-TJWL](https://perma.cc/CG62-TJWL)
|
||
|
||
[[49](/en/ch1#Sigelman2010-marker)] Benjamin H. Sigelman, Luiz André Barroso, Mike
|
||
Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag.
|
||
[Dapper, a Large-Scale Distributed Systems Tracing
|
||
Infrastructure](https://research.google/pubs/pub36356/). Google Technical Report dapper-2010-1, April 2010.
|
||
Archived at [perma.cc/K7KU-2TMH](https://perma.cc/K7KU-2TMH)
|
||
|
||
[[50](/en/ch1#Laigner2021-marker)] Rodrigo Laigner, Yongluan Zhou, Marcos Antonio
|
||
Vaz Salles, Yijian Liu, and Marcos Kalinowski.
|
||
[Data management in microservices: State
|
||
of the practice, challenges, and research directions](https://www.vldb.org/pvldb/vol14/p3348-laigner.pdf). *Proceedings of the VLDB Endowment*,
|
||
volume 14, issue 13, pages 3348–3361, September 2021.
|
||
[doi:10.14778/3484224.3484232](https://doi.org/10.14778/3484224.3484232)
|
||
|
||
[[51](/en/ch1#Tigani2023-marker)] Jordan Tigani.
|
||
[Big Data is Dead](https://motherduck.com/blog/big-data-is-dead/).
|
||
*motherduck.com*, February 2023.
|
||
Archived at [perma.cc/HT4Q-K77U](https://perma.cc/HT4Q-K77U)
|
||
|
||
[[52](/en/ch1#Newman2021_ch1-marker)] Sam Newman.
|
||
[*Building
|
||
Microservices*, second edition](https://www.oreilly.com/library/view/building-microservices-2nd/9781492034018/). O’Reilly Media, 2021. ISBN: 9781492034025
|
||
|
||
[[53](/en/ch1#Richardson2014-marker)] Chris Richardson.
|
||
[Microservices: Decomposing
|
||
Applications for Deployability and Scalability](https://www.infoq.com/articles/microservices-intro/). *infoq.com*, May 2014.
|
||
Archived at [perma.cc/CKN4-YEQ2](https://perma.cc/CKN4-YEQ2)
|
||
|
||
[[54](/en/ch1#Shahrad2020-marker)] Mohammad Shahrad, Rodrigo Fonseca, Íñigo Goiri,
|
||
Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, Ricardo Bianchini.
|
||
[Serverless in the Wild:
|
||
Characterizing and Optimizing the Serverless Workload at a Large Cloud Provider](https://www.usenix.org/system/files/atc20-shahrad.pdf).
|
||
At *USENIX Annual Technical Conference* (ATC), July 2020.
|
||
|
||
[[55](/en/ch1#Barroso2018-marker)] Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan.
|
||
[The Datacenter as a
|
||
Computer: Designing Warehouse-Scale Machines](https://www.morganclaypool.com/doi/10.2200/S00874ED3V01Y201809CAC046), third edition.
|
||
Morgan & Claypool Synthesis Lectures on Computer Architecture, October 2018.
|
||
[doi:10.2200/S00874ED3V01Y201809CAC046](https://doi.org/10.2200/S00874ED3V01Y201809CAC046)
|
||
|
||
[[56](/en/ch1#Fiala2012-marker)] David Fiala, Frank Mueller, Christian Engelmann, Rolf
|
||
Riesen, Kurt Ferreira, and Ron Brightwell.
|
||
[Detection and
|
||
Correction of Silent Data Corruption for Large-Scale High-Performance Computing](https://arcb.csc.ncsu.edu/~mueller/ftp/pub/mueller/papers/sc12.pdf),” at
|
||
*International Conference for High Performance Computing, Networking, Storage and
|
||
Analysis* (SC), November 2012.
|
||
[doi:10.1109/SC.2012.49](https://doi.org/10.1109/SC.2012.49)
|
||
|
||
[[57](/en/ch1#KornfeldSimpson2020-marker)] Anna Kornfeld
|
||
Simpson, Adriana Szekeres, Jacob Nelson, and Irene Zhang.
|
||
[Securing RDMA
|
||
for High-Performance Datacenter Storage Systems](https://www.usenix.org/conference/hotcloud20/presentation/kornfeld-simpson). At *12th USENIX Workshop on Hot Topics in
|
||
Cloud Computing* (HotCloud), July 2020.
|
||
|
||
[[58](/en/ch1#Singh2015-marker)] Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson,
|
||
Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala,
|
||
Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, and Amin Vahdat.
|
||
[Jupiter Rising: A
|
||
Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network](https://conferences.sigcomm.org/sigcomm/2015/pdf/papers/p183.pdf). At
|
||
*Annual Conference of the ACM Special Interest Group on Data Communication* (SIGCOMM), August 2015.
|
||
[doi:10.1145/2785956.2787508](https://doi.org/10.1145/2785956.2787508)
|
||
|
||
[[59](/en/ch1#Lockwood2014-marker)] Glenn K. Lockwood.
|
||
[Hadoop’s
|
||
Uncomfortable Fit in HPC](https://blog.glennklockwood.com/2014/05/hadoops-uncomfortable-fit-in-hpc.html). *glennklockwood.blogspot.co.uk*, May 2014.
|
||
Archived at [perma.cc/S8XX-Y67B](https://perma.cc/S8XX-Y67B)
|
||
|
||
[[60](/en/ch1#ONeil2016_ch1-marker)] Cathy O’Neil: *Weapons of Math Destruction:
|
||
How Big Data Increases Inequality and Threatens Democracy*. Crown Publishing, 2016.
|
||
ISBN: 9780553418811
|
||
|
||
[[61](/en/ch1#Shastri2020-marker)] Supreeth Shastri, Vinay Banakar, Melissa
|
||
Wasserman, Arun Kumar, and Vijay Chidambaram.
|
||
[Understanding and Benchmarking the
|
||
Impact of GDPR on Database Systems](https://www.vldb.org/pvldb/vol13/p1064-shastri.pdf). *Proceedings of the VLDB Endowment*, volume 13, issue
|
||
7, pages 1064–1077, March 2020.
|
||
[doi:10.14778/3384345.3384354](https://doi.org/10.14778/3384345.3384354)
|
||
|
||
[[62](/en/ch1#Datensparsamkeit-marker)] Martin Fowler.
|
||
[Datensparsamkeit](https://www.martinfowler.com/bliki/Datensparsamkeit.html).
|
||
*martinfowler.com*, December 2013.
|
||
Archived at [perma.cc/R9QX-CME6](https://perma.cc/R9QX-CME6)
|
||
|
||
[[63](/en/ch1#GDPR-marker)] [Regulation
|
||
(EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 (General Data
|
||
Protection Regulation)](https://eur-lex.europa.eu/legal-content/EN/TXT/HTML/?uri=CELEX:32016R0679&from=EN). *Official Journal of the European Union* L 119/1, May 2016.
|
||
|