ddia/content/en/ch4.md

---
title: "4. Storage and Retrieval"
weight: 104
breadcrumbs: false
---

> *One of the miseries of life is that everybody names things a little bit wrong. And so it makes
> everything a little harder to understand in the world than it would be if it were named
> differently. A computer does not primarily compute in the sense of doing arithmetic. […] They
> primarily are filing systems.*
>
> [Richard Feynman](https://www.youtube.com/watch?v=EKWGGDXe5MA&t=296s),
> *Idiosyncratic Thinking* seminar (1985)

On the most fundamental level, a database needs to do two things: when you give it some data, it
should store the data, and when you ask it again later, it should give the data back to you.

In [Chapter 3](/en/ch3#ch_datamodels) we discussed data models and query languages—i.e., the format in which you give
the database your data, and the interface through which you can ask for it again later. In this
chapter we discuss the same from the database’s point of view: how the database can store the data
that you give it, and how it can find the data again when you ask for it.

Why should you, as an application developer, care how the database handles storage and retrieval
internally? You’re probably not going to implement your own storage engine from scratch, but you
*do* need to select a storage engine that is appropriate for your application, from the many that
are available. In order to configure a storage engine to perform well on your kind of workload, you
need to have a rough idea of what the storage engine is doing under the hood.

In particular, there is a big difference between storage engines that are optimized for
transactional workloads (OLTP) and those that are optimized for analytics (we introduced this
distinction in [“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics)). This chapter starts by examining two families of
storage engines for OLTP: *log-structured* storage engines that write out immutable data files, and
storage engines such as *B-trees* that update data in-place. These structures are used for both
key-value storage as well as secondary indexes.

Later in [“Data Storage for Analytics”](/en/ch4#sec_storage_analytics) we’ll discuss a family of storage engines that is optimized for
analytics, and in [“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional) we’ll briefly look at indexes for more advanced
queries, such as text retrieval.

# Storage and Indexing for OLTP

Consider the world’s simplest database, implemented as two Bash functions:

```
#!/bin/bash

db_set () {
    echo "$1,$2" >> database
}

db_get () {
    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
}
```

These two functions implement a key-value store. You can call `db_set key value`, which will store
`key` and `value` in the database. The key and value can be (almost) anything you like—for
example, the value could be a JSON document. You can then call `db_get key`, which looks up the most
recent value associated with that particular key and returns it.

And it works:

```
$ db_set 12 '{"name":"London","attractions":["Big Ben","London Eye"]}'

$ db_set 42 '{"name":"San Francisco","attractions":["Golden Gate Bridge"]}'

$ db_get 42
{"name":"San Francisco","attractions":["Golden Gate Bridge"]}

```

The storage format is very simple: a text file where each line contains a key-value pair, separated
by a comma (roughly like a CSV file, ignoring escaping issues). Every call to `db_set` appends to
the end of the file. If you update a key several times, old versions of the value are not
overwritten—you need to look at the last occurrence of a key in a file to find the latest value
(hence the `tail -n 1` in `db_get`):

```
$ db_set 42 '{"name":"San Francisco","attractions":["Exploratorium"]}'

$ db_get 42
{"name":"San Francisco","attractions":["Exploratorium"]}

$ cat database
12,{"name":"London","attractions":["Big Ben","London Eye"]}
42,{"name":"San Francisco","attractions":["Golden Gate Bridge"]}
42,{"name":"San Francisco","attractions":["Exploratorium"]}

```

The `db_set` function actually has pretty good performance for something that is so simple, because
appending to a file is generally very efficient. Similarly to what `db_set` does, many databases
internally use a *log*, which is an append-only data file. Real databases have more issues to deal
with (such as handling concurrent writes, reclaiming disk space so that the log doesn’t grow
forever, and handling partially written records when recovering from a crash), but the basic
principle is the same. Logs are incredibly useful, and we will encounter them several times in this
book.

###### Note

The word *log* is often used to refer to application logs, where an application outputs text that
describes what’s happening. In this book, *log* is used in the more general sense: an append-only
sequence of records on disk. It doesn’t have to be human-readable; it might be binary and intended
only for internal use by the database system.

On the other hand, the `db_get` function has terrible performance if you have a large number of
records in your database. Every time you want to look up a key, `db_get` has to scan the entire
database file from beginning to end, looking for occurrences of the key. In algorithmic terms, the
cost of a lookup is *O*(*n*): if you double the number of records *n* in your database, a lookup
takes twice as long. That’s not good.

In order to efficiently find the value for a particular key in the database, we need a different
data structure: an *index*. In this chapter we will look at a range of indexing structures and see
how they compare; the general idea is to structure the data in a particular way (e.g., sorted by
some key) that makes it faster to locate the data you want. If you want to search the same data in
several different ways, you may need several different indexes on different parts of the data.

An index is an *additional* structure that is derived from the primary data. Many databases allow
you to add and remove indexes, and this doesn’t affect the contents of the database; it only affects
the performance of queries. Maintaining additional structures incurs overhead, especially on writes. For
writes, it’s hard to beat the performance of simply appending to a file, because that’s the simplest
possible write operation. Any kind of index usually slows down writes, because the index also needs
to be updated every time data is written.

This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but
every index consumes additional disk space and slows down writes, sometimes substantially
[[1](/en/ch4#Samokhvalov2021)].
For this reason, databases don’t usually index everything by default, but require you—the person
writing the application or administering the database—to choose indexes manually, using your
knowledge of the application’s typical query patterns. You can then choose the indexes that give
your application the greatest benefit, without introducing more overhead on writes than necessary.

## Log-Structured Storage

To start, let’s assume that you want to continue storing data in the append-only file written by
`db_set`, and you just want to speed up reads. One way you could do this is by keeping a hash map in
memory, in which every key is mapped to the byte offset in the file at which the most recent value
for that key can be found, as illustrated in [Figure 4-1](/en/ch4#fig_storage_csv_hash_index).

![ddia 0401](/fig/ddia_0401.png)

###### Figure 4-1. Storing a log of key-value pairs in a CSV-like format, indexed with an in-memory hash map.

Whenever you append a new key-value pair to the file, you also update the hash map to reflect the
offset of the data you just wrote. When you want to look up a value, you use the hash map to find
the offset in the log file, seek to that location, and read the value. If that part of the data file
is already in the filesystem cache, a read doesn’t require any disk I/O at all.

This approach is much faster, but it still suffers from several problems:

* You never free up disk space occupied by old log entries that have been overwritten; if you keep
  writing to the database you might run out of disk space.
* The hash map is not persisted, so you have to rebuild it when you restart the database—for
  example, by scanning the whole log file to find the latest byte offset for each key. This makes
  restarts slow if you have a lot of data.
* The hash table must fit in memory. In principle, you could maintain a hash table on disk, but
  unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of
  random access I/O, it is expensive to grow when it becomes full, and hash collisions require
  fiddly logic [[2](/en/ch4#Graefe2011)].
* Range queries are not efficient. For example, you cannot easily scan over all keys between `10000`
  and `19999`—you’d have to look up each key individually in the hash map.

### The SSTable file format

In practice, hash tables are not used very often for database indexes, and instead it is much more
common to keep data in a structure that is *sorted by key*
[[3](/en/ch4#Jones2019)].
One example of such a structure is a *Sorted String Table*, or *SSTable* for short, as shown in
[Figure 4-2](/en/ch4#fig_storage_sstable_index). This file format also stores key-value pairs, but it ensures that
they are sorted by key, and each key only appears once in the file.

![ddia 0402](/fig/ddia_0402.png)

###### Figure 4-2. An SSTable with a sparse index, allowing queries to jump to the right block.

Now you do not need to keep all the keys in memory: you can group the key-value pairs within an
SSTable into *blocks* of a few kilobytes, and then store the first key of each block in the index.
This kind of index, which stores only some of the keys, is called *sparse*. This index is stored in
a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data
structure that allows queries to quickly look up a particular key
[[4](/en/ch4#Lambov2022a)].

For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
first key of the next block is `handsome`. Now say you’re looking for the key `handiwork`, which
doesn’t appear in the sparse index. Because of the sorting you know that `handiwork` must appear
between `handbag` and `handsome`. This means you can seek to the offset for `handbag` and scan the
file from there until you find `handiwork` (or not, if the key is not present in the file). A block
of a few kilobytes can be scanned very quickly.

Moreover, each block of records can be compressed (indicated by the shaded area in
[Figure 4-2](/en/ch4#fig_storage_sstable_index)). Besides saving disk space, compression also reduces the I/O
bandwidth use, at the cost of using a bit more CPU time.

### Constructing and merging SSTables

The SSTable file format is better for reading than an append-only log, but it makes writes more
difficult. We can’t simply append at the end, because then the file would no longer be sorted
(unless the keys happen to be written in ascending order). If we had to rewrite the whole SSTable
every time a key is inserted somewhere in the middle, writes would become far too expensive.

We can solve this problem with a *log-structured* approach, which is a hybrid between an append-only
log and a sorted file:

1. When a write comes in, add it to an in-memory ordered map data structure, such as a red-black
   tree, skip list [[5](/en/ch4#Cormen2009)], or trie
   [[6](/en/ch4#Lambov2022b)].
   With these data structures, you can insert keys in any order, look them up efficiently, and read
   them back in sorted order. This in-memory data structure is called the *memtable*.
2. When the memtable gets bigger than some threshold—typically a few megabytes—write it out to
   disk in sorted order as an SSTable file. We call this new SSTable file the most recent *segment*
   of the database, and it is stored as a separate file alongside the older segments. Each segment
   has a separate index of its contents. While the new segment is being written out to disk, the
   database can continue writing to a new memtable instance, and the old memtable’s memory is freed
   when the writing of the SSTable is complete.
3. In order to read the value for some key, first try to find the key in the memtable and the most
   recent on-disk segment. If it’s not there, look in the next-older segment, etc. until you either
   find the key or reach the oldest segment. If the key does not appear in any of the segments, it
   does not exist in the database.
4. From time to time, run a merging and compaction process in the background to combine segment files
   and to discard overwritten or deleted values.

Merging segments works similarly to the *mergesort* algorithm
[[5](/en/ch4#Cormen2009)]. The process is illustrated in
[Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If
the same key appears in more than one input file, keep only the more recent value. This produces a
new merged segment file, also sorted by key, with one value per key, and it uses minimal memory
because we can iterate over the SSTables one key at a time.

![ddia 0403](/fig/ddia_0403.png)

###### Figure 4-3. Merging several SSTable segments, retaining only the most recent value for each key.

To ensure that the data in the memtable is not lost if the database crashes, the storage engine
keeps a separate log on disk to which every write is immediately appended. This log is not sorted by
key, but that doesn’t matter, because its only purpose is to restore the memtable after a crash.
Every time the memtable has been written out to an SSTable, the corresponding part of the log can be
discarded.

If you want to delete a key and its associated value, you have to append a special deletion record
called a *tombstone* to the data file. When log segments are merged, the tombstone tells the merging
process to discard any previous values for the deleted key. Once the tombstone is merged into the
oldest segment, it can be dropped.

The algorithm described here is essentially what is used in RocksDB
[[7](/en/ch4#Borthakur2013)],
Cassandra, Scylla, and HBase
[[8](/en/ch4#Bertozzi2012)],
all of which were inspired by Google’s Bigtable paper
[[9](/en/ch4#Chang2006_ch4)]
(which introduced the terms *SSTable* and *memtable*).

The algorithm was originally published in 1996 under the name *Log-Structured Merge-Tree* or *LSM-Tree*
[[10](/en/ch4#ONeil1996)],
building on earlier work on log-structured filesystems
[[11](/en/ch4#Rosenblum1992)].
For this reason, storage engines that are based on the principle of merging and compacting sorted
files are often called *LSM storage engines*.

In LSM storage engines, a segment file is written in one pass (either by writing out the memtable or
by merging some existing segments), and thereafter it is immutable. The merging and compaction of
segments can be done in a background thread, and while it is going on, we can still continue to
serve reads using the old segment files. When the merging process is complete, we switch read
requests to using the new merged segment instead of the old segments, and then the old segment files
can be deleted.

The segment files don’t necessarily have to be stored on local disk: they are also well suited for
writing to object storage. SlateDB and Delta Lake
[[12](/en/ch4#Armbrust2020)].
take this approach, for example.

Having immutable segment files also simplifies crash recovery: if a crash happens while writing out
the memtable or while merging segments, the database can just delete the unfinished SSTable and
start afresh. The log that persists writes to the memtable could contain incomplete records if there
was a crash halfway through writing a record, or if the disk was full; these are typically detected
by including checksums in the log, and discarding corrupted or incomplete log entries. We will talk
more about durability and crash recovery in [Chapter 8](/en/ch8#ch_transactions).

### Bloom filters

With LSM storage it can be slow to read a key that was last updated a long time ago, or that does
not exist, since the storage engine needs to check several segment files. In order to speed up such
reads, LSM storage engines often include a *Bloom filter*
[[13](/en/ch4#Bloom1970)]
in each segment, which provides a fast but approximate way of checking whether a particular key
appears in a particular SSTable.

[Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash
function, producing a set of numbers that are then interpreted as indexes into the array of bits
[[14](/en/ch4#Kirsch2008)].
We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key
`handbag` hashes to the numbers (2, 9, 4), so we set the 2nd, 9th, and 4th bits to 1. The bitmap
is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of
extra space, but the Bloom filter is generally small compared to the rest of the SSTable.

![ddia 0404](/fig/ddia_0404.png)

###### Figure 4-4. A Bloom filter provides a fast, probabilistic check whether a particular key exists in a particular SSTable.

When we want to know whether a key appears in the SSTable, we compute the same hash of that key as
before, and check the bits at those indexes. For example, in [Figure 4-4](/en/ch4#fig_storage_bloom), we’re querying
the key `handheld`, which hashes to (6, 11, 2). One of those bits is 1 (namely, bit number 2),
while the other two are 0. These checks can be made extremely fast using the bitwise operations that
all CPUs support.

If at least one of the bits is 0, we know that the key definitely does not appear in the SSTable.
If the bits in the query are all 1, it’s likely that the key is in the SSTable, but it’s also
possible that by coincidence all of those bits were set to 1 by other keys. This case when it looks
as if a key is present, even though it isn’t, is called a *false positive*.

The probability of false positives depends on the number of keys, the number of bits set per key,
and the total number of bits in the Bloom filter. You can use an online calculator tool to work out
the right parameters for your application
[[15](/en/ch4#Hurst2023)].
As a rule of thumb, you need to allocate 10 bits of Bloom filter space for every key in the SSTable
to get a false positive probability of 1%, and the probability is reduced tenfold for every 5
additional bits you allocate per key.

In the context of an LSM storage engines, false positives are no problem:

* If the Bloom filter says that a key *is not* present, we can safely skip that SSTable, since we
  can be sure that it doesn’t contain the key.
* If the Bloom filter says the key *is* present, we have to consult the sparse index and decode the
  block of key-value pairs to check whether the key really is there. If it was a false positive, we
  have done a bit of unnecessary work, but otherwise no harm is done—we just continue the search
  with the next-oldest segment.

### Compaction strategies

An important detail is how the LSM storage chooses when to perform compaction, and which SSTables to
include in a compaction. Many LSM-based storage systems allow you to configure which compaction
strategy to use, and some of the common choices are
[[16](/en/ch4#Luo2019),
[17](/en/ch4#Sarkar2022)]:

Size-tiered compaction
:   Newer and smaller SSTables are successively merged into older and larger SSTables. The SSTables
    containing older data can get very large, and merging them requires a lot of temporary disk space.
    The advantage of this strategy is that it can handle very high write throughput.

Leveled compaction
:   The key range is split up into smaller SSTables and older data is moved into separate “levels,”
    which allows the compaction to proceed more incrementally and use less disk space than the
    size-tiered strategy. This strategy is more efficient for reads than size-tiered compaction
    because the storage engine needs to read fewer SSTables to check whether they contain the key.

As a rule of thumb, size-tiered compaction performs better if you have mostly writes and few reads,
whereas leveled compaction performs better if your workload is dominated by reads. If you write a
small number of keys frequently and a large number of keys rarely, then leveled compaction can also
be advantageous [[18](/en/ch4#Callaghan2018)].

Even though there are many subtleties, the basic idea of LSM-trees—keeping a cascade of SSTables
that are merged in the background—is simple and effective. We discuss their performance
characteristics in more detail in [“Comparing B-Trees and LSM-Trees”](/en/ch4#sec_storage_btree_lsm_comparison).

# Embedded storage engines

Many databases run as a service that accepts queries over a network, but there are also *embedded*
databases that don’t expose a network API. Instead, they are libraries that run in the same process
as your application code, typically reading and writing files on the local disk, and you interact
with them through normal function calls. Examples of embedded storage engines include RocksDB,
SQLite, LMDB, DuckDB, and KùzuDB
[[19](/en/ch4#Rao2023)].

Embedded databases are very commonly used in mobile apps to store the local user’s data. On the
backend, they can be an appropriate choice if the data is small enough to fit on a single machine,
and if there are not many concurrent transactions. For example, in a multitenant system in which
each tenant is small enough and completely separate from others (i.e., you do not need to run
queries that combine data from multiple tenants), you can potentially use a separate embedded
database instance per tenant
[[20](/en/ch4#BlueskySQLite)].

The storage and retrieval methods we discuss in this chapter are used in both embedded and in
client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
for scaling a database across multiple machines.

## B-Trees

The log-structured approach is popular, but it is not the only form of key-value storage. The most
widely used structure for reading and writing database records by key is the *B-tree*.

Introduced in 1970 [[21](/en/ch4#Bayer1970)]
and called “ubiquitous” less than 10 years later
[[22](/en/ch4#Comer1979)],
B-trees have stood the test of time very well. They remain the standard index implementation in
almost all relational databases, and many nonrelational databases use them too.

Like SSTables, B-trees keep key-value pairs sorted by key, which allows efficient key-value lookups
and range queries. But that’s where the similarity ends: B-trees have a very different design
philosophy.

The log-structured indexes we saw earlier break the database down into variable-size *segments*,
typically several megabytes or more in size, that are written once and are then immutable. By
contrast, B-trees break the database down into fixed-size *blocks* or *pages*, and may overwrite a
page in-place. A page is traditionally 4 KiB in size, but PostgreSQL now uses 8 KiB and
MySQL uses 16 KiB by default.

Each page can be identified using a page number, which allows one page to refer to another—similar
to a pointer, but on disk instead of in memory. If all the pages are stored in the same file,
multiplying the page number by the page size gives us the byte offset in the file where the page is
located. We can use these page references to construct a tree of pages, as illustrated in
[Figure 4-5](/en/ch4#fig_storage_b_tree).

![ddia 0405](/fig/ddia_0405.png)

###### Figure 4-5. Looking up the key 251 using a B-tree index. From the root page we first follow the reference to the page for keys 200–300, then the page for keys 250–270.

One page is designated as the *root* of the B-tree; whenever you want to look up a key in the index,
you start here. The page contains several keys and references to child pages.
Each child is responsible for a continuous range of keys, and the keys between the references indicate
where the boundaries between those ranges lie.
(This structure is sometimes called a B+ tree, but we don’t need to distinguish it
from other B-tree variants.)

In the example in [Figure 4-5](/en/ch4#fig_storage_b_tree), we are looking for the key 251, so we know that we need to
follow the page reference between the boundaries 200 and 300. That takes us to a similar-looking
page that further breaks down the 200–300 range into subranges. Eventually we get down to a
page containing individual keys (a *leaf page*), which either contains the value for each key
inline or contains references to the pages where the values can be found.

The number of references to child pages in one page of the B-tree is called the *branching factor*.
For example, in [Figure 4-5](/en/ch4#fig_storage_b_tree) the branching factor is six. In practice, the branching
factor depends on the amount of space required to store the page references and the range
boundaries, but typically it is several hundred.

If you want to update the value for an existing key in a B-tree, you search for the leaf page
containing that key, and overwrite that page on disk with a version that contains the new value.
If you want to add a new key, you need to find the page whose range encompasses the new key and add
it to that page. If there isn’t enough free space in the page to accommodate the new key, the page
is split into two half-full pages, and the parent page is updated to account for the new subdivision
of key ranges.

![ddia 0406](/fig/ddia_0406.png)

###### Figure 4-6. Growing a B-tree by splitting a page on the boundary key 337. The parent page is updated to reference both children.

In the example of [Figure 4-6](/en/ch4#fig_storage_b_tree_split), we want to insert the key 334, but the page for the
range 333–345 is already full. We therefore split it into a page for the range 333–337 (including
the new key), and a page for 337–344. We also have to update the parent page to have references to
both children, with a boundary value of 337 between them. If the parent page doesn’t have enough
space for the new reference, it may also need to be split, and the splits can continue all the way
to the root of the tree. When the root is split, we make a new root above it. Deleting keys (which
may require nodes to be merged) is more complex
[[5](/en/ch4#Cormen2009)].

This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth
of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
you don’t need to follow many page references to find the page you are looking for. (A four-level
tree of 4 KiB pages with a branching factor of 500 can store up to 250 TB.)

### Making B-trees reliable

The basic underlying write operation of a B-tree is to overwrite a page on disk with new data. It is
assumed that the overwrite does not change the location of the page; i.e., all references to that
page remain intact when the page is overwritten. This is in stark contrast to log-structured indexes
such as LSM-trees, which only append to files (and eventually delete obsolete files) but never
modify files in place.

Overwriting several pages at once, like in a page split, is a dangerous operation: if the database
crashes after only some of the pages have been written, you end up with a corrupted tree (e.g.,
there may be an *orphan* page that is not a child of any parent). If the hardware can’t atomically
write an entire page, you can also end up with a partially written page (this is known as a *torn
page* [[23](/en/ch4#Miller2025)]).

In order to make the database resilient to crashes, it is common for B-tree implementations to
include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file
to which every B-tree modification must be written before it can be applied to the pages of the tree
itself. When the database comes back up after a crash, this log is used to restore the B-tree back
to a consistent state [[2](/en/ch4#Graefe2011),
[24](/en/ch4#Mohan1992)].
In filesystems, the equivalent mechanism is known as *journaling*.

To improve performance, B-tree implementations typically don’t immediately write every modified page
to disk, but buffer the B-tree pages in memory for a while first. The write-ahead log then also
ensures that data is not lost in the case of a crash: as long as data has been written to the WAL,
and flushed to disk using the `fsync()` system call, the data will be durable as the database will
be able to recover it after a crash [[25](/en/ch4#Suzuki2017_ch4)].

### B-tree variants

As B-trees have been around for so long, many variants have been developed over the years. To
mention just a few:

* Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB)
  use a copy-on-write scheme [[26](/en/ch4#Chu2014)].
  A modified page is written to a different location, and a new version of the parent pages in the tree
  is created, pointing at the new location. This approach is also useful for concurrency control, as we shall
  see in [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation).
* We can save space in pages by not storing the entire key, but abbreviating it. Especially in pages
  on the interior of the tree, keys only need to provide enough information to act as boundaries
  between key ranges. Packing more keys into a page allows the tree to have a higher branching
  factor, and thus fewer levels.
* To speed up scans over the key range in sorted order, some B-tree implementations try to lay out
  the tree so that leaf pages appear in sequential order on disk, reducing the number of disk seeks.
  However, it’s difficult to maintain that order as the tree grows.
* Additional pointers have been added to the tree. For example, each leaf page may have references to
  its sibling pages to the left and right, which allows scanning keys in order without jumping back
  to parent pages.

## Comparing B-Trees and LSM-Trees

As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads
[[27](/en/ch4#Athanassoulis2016),
[28](/en/ch4#Stopford2015)].
However, benchmarks are often sensitive to details of the workload. You need to test systems with
your particular workload in order to make a valid comparison. Moreover, it’s not a strict either/or
choice between LSM and B-trees: storage engines sometimes blend characteristics of both approaches,
for example by having multiple B-trees and merging them LSM-style. In this section we will briefly
discuss a few things that are worth considering when measuring the performance of a storage engine.

### Read performance

In a B-tree, looking up a key involves reading one page at each level of the B-tree. Since the
number of levels is usually quite small, this means that reads from a B-tree are generally fast and
have predictable performance. In an LSM storage engine, reads often have to check several different
SSTables at different stages of compaction, but Bloom filters help reduce the number of actual disk
I/O operations required. Both approaches can perform well, and which is faster depends on the
details of the storage engine and the workload.

Range queries are simple and fast on B-trees, as they can use the sorted structure of the tree. On
LSM storage, range queries can also take advantage of the SSTable sorting, but they need to scan all
the segments in parallel and combine the results. Bloom filters don’t help for range queries (since
you would need to compute the hash of every possible key within the range, which is impractical),
making range queries more expensive than point queries in the LSM approach
[[29](/en/ch4#Callaghan2016lsm)].

High write throughput can cause latency spikes in a log-structured storage engine if the
memtable fills up. This happens if data can’t be written out to disk fast enough, perhaps because
the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB,
perform *backpressure* in this situation: they suspend all reads and writes until the memtable has
been written out to disk
[[30](/en/ch4#Balmau2019),
[31](/en/ch4#RocksDBTuning)].

Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read
requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but
storage engines need to be carefully designed to take advantage of this parallelism
[[32](/en/ch4#Haas2023)].

### Sequential vs. random writes

With a B-tree, if the application writes keys that are scattered all over the key space, the
resulting disk operations are also scattered randomly, since the pages that the storage engine needs
to overwrite could be located anywhere on disk. On the other hand, a log-structured storage engine
writes entire segment files at a time (either writing out the memtable or while compacting existing
segments), which are much bigger than a page in a B-tree.

The pattern of many small, scattered writes (as found in B-trees) is called *random writes*, while
the pattern of fewer large writes (as found in LSM-trees) is called *sequential writes*. Disks
generally have higher sequential write throughput than random write throughput, which means that a
log-structured storage engine can generally handle higher write throughput on the same hardware than
a B-tree. This difference is particularly big on spinning-disk hard drives (HDDs); on the solid
state drives (SSDs) that most databases use today, the difference is smaller, but still noticeable
(see [“Sequential vs. Random Writes on SSDs”](/en/ch4#sidebar_sequential)).

# Sequential vs. Random Writes on SSDs

On spinning-disk hard drives (HDDs), sequential writes are much faster than random writes: a random
write has to mechanically move the disk head to a new position and wait for the right part of the
platter to pass underneath the disk head, which takes several milliseconds—an eternity in computing
timescales. However, SSDs (solid-state drives) including NVMe (Non-Volatile Memory Express, i.e.
flash memory attached to the PCI Express bus) have now overtaken HDDs for many use cases, and they
are not subject to such mechanical limitations.

Nevertheless, SSDs also have higher throughput for sequential writes than for than random writes.
The reason is that flash memory can be read or written one page (typically 4 KiB) at a time,
but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
may contain valid data, whereas others may contain data that is no longer needed. Before erasing a
block, the controller must first move pages containing valid data into other blocks; this process is
called *garbage collection* (GC)
[[33](/en/ch4#Goossaert2014)].

A sequential write workload writes larger chunks of data at a time, so it is likely that a whole
512 KiB block belongs to a single file; when that file is later deleted again, the whole block
can be erased without having to perform any GC. On the other hand, with a random write workload, it
is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
to perform more work before a block can be erased
[[34](/en/ch4#Vanlightly2023nvme),
[35](/en/ch4#Alibaba2019_ch4),
[36](/en/ch4#Hu2010)].

The write bandwidth consumed by GC is then not available for the application. Moreover, the
additional writes performed by GC contribute to wear on the flash memory; therefore, random writes
wear out the drive faster than sequential writes.

### Write amplification

With any type of storage engine, one write request from the application turns into multiple I/O
operations on the underlying disk. With LSM-trees, a value is first written to the log for
durability, then again when the memtable is written to disk, and again every time the key-value pair
is part of a compaction. (If the values are significantly larger than the keys, this overhead can be
reduced by storing values separately from keys, and performing compaction only on SSTables
containing keys and references to values
[[37](/en/ch4#Lu2016)].)

A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once
to the tree page itself. In addition, they sometimes need to write out an entire page, even if only
a few bytes in that page changed, to ensure the B-tree can be correctly recovered after a crash or
power failure [[38](/en/ch4#Zaitsev2006),
[39](/en/ch4#Vondra2016)].

If you take the total number of bytes written to disk in some workload, and divide by the number of
bytes you would have to write if you simply wrote an append-only log with no index, you get the
*write amplification*. (Sometimes write amplification is defined in terms of I/O operations rather
than bytes.) In write-heavy applications, the bottleneck might be the rate at which the database can
write to disk. In this case, the higher the write amplification, the fewer writes per second it can
handle within the available disk bandwidth.

Write amplification is a problem in both LSM-trees and B-trees. Which one is better depends on
various factors, such as the length of your keys and values, and how often you overwrite existing
keys versus insert new ones. For typical workloads, LSM-trees tend to have lower write amplification
because they don’t have to write entire pages and they can compress chunks of the SSTable
[[40](/en/ch4#Callaghan2015)].
This is another factor that makes LSM storage engines well suited for write-heavy workloads.

Besides affecting throughput, write amplification is also relevant for the wear on SSDs: a storage
engine with lower write amplification will wear out the SSD less quickly.

When measuring the write throughput of a storage engine, it is important to run the experiment for
long enough that the effects of write amplification become clear. When writing to an empty LSM-tree,
there are no compactions going on yet, so all of the disk bandwidth is available for new writes. As
the database grows, new writes need to share the disk bandwidth with compaction.

### Disk space usage

B-trees can become *fragmented* over time: for example, if a large number of keys are deleted, the
database file may contain a lot of pages that are no longer used by the B-tree. Subsequent additions
to the B-tree can use those free pages, but they can’t easily be returned to the operating system
because they are in the middle of the file, so they still take up space on the filesystem. Databases
therefore need a background process that moves pages around to place them better, such as the vacuum
process in PostgreSQL [[25](/en/ch4#Suzuki2017_ch4)].

Fragmentation is less of a problem in LSM-trees, since the compaction process periodically rewrites
the data files anyway, and SSTables don’t have pages with unused space. Moreover, blocks of
key-value pairs can better be compressed in SSTables, and thus often produce smaller files on disk
than B-trees. Keys and values that have been overwritten continue to consume space until they are
removed by a compaction, but this overhead is quite low when using leveled compaction
[[40](/en/ch4#Callaghan2015),
[41](/en/ch4#Callaghan2016rocksdb)].
Size-tiered compaction (see [“Compaction strategies”](/en/ch4#sec_storage_lsm_compaction)) uses more disk space, especially
temporarily during compaction.

Having multiple copies of some data on disk can also be a problem when you need to delete some data,
and be confident that it really has been deleted (perhaps to comply with data protection
regulations). For example, in most LSM storage engines a deleted record may still exist in the higher
levels until the tombstone representing the deletion has been propagated through all of the
compaction levels, which may take a long time. Specialist storage engine designs can propagate
deletions faster [[42](/en/ch4#Sarkar2023)].

On the other hand, the immutable nature of SSTable segment files is useful if you want to take a
snapshot of a database at some point in time (e.g. for a backup or to create a copy of the database
for testing): you can write out the memtable and record which segment files existed at that point in
time. As long as you don’t delete the files that are part of the snapshot, you don’t need to
actually copy them. In a B-tree whose pages are overwritten, taking such a snapshot efficiently is
more difficult.

## Multi-Column and Secondary Indexes

So far we have only discussed key-value indexes, which are like a *primary key* index in the
relational model. A primary key uniquely identifies one row in a relational table, or one document
in a document database, or one vertex in a graph database. Other records in the database can refer
to that row/document/vertex by its primary key (or ID), and the index is used to resolve such
references.

It is also very common to have *secondary indexes*. In relational databases, you can create several
secondary indexes on the same table using the `CREATE INDEX` command, allowing you to search by
columns other than the primary key. For example, in [Figure 3-1](/en/ch3#fig_obama_relational) in [Chapter 3](/en/ch3#ch_datamodels)
you would most likely have a secondary index on the `user_id` columns so that you can find all the
rows belonging to the same user in each of the tables.

A secondary index can easily be constructed from a key-value index. The main difference is that
in a secondary index, the indexed values are not necessarily unique; that is,
there might be many rows (documents, vertices) under the same index entry. This can be
solved in two ways: either by making each value in the index a list of matching row identifiers (like a
postings list in a full-text index) or by making each entry unique by appending a row identifier to
it. Storage engines with in-place updates, like B-trees, and log-structured storage can both be used
to implement an index.

### Storing values within the index

The key in an index is the thing that queries search by, but the value can be one of several things:

* If the actual data (row, document, vertex) is stored directly within the index structure, it is
  called a *clustered index*. For example, in MySQL’s InnoDB storage engine, the primary key of a
  table is always a clustered index, and in SQL Server, you can specify one clustered index per
  table [[43](/en/ch4#Fittl2025)].
* Alternatively, the value can be a reference to the actual data: either the primary key of the row
  in question (InnoDB does this for secondary indexes), or a direct reference to a location on disk.
  In the latter case, the place where rows are stored is known as a *heap file*, and it stores data
  in no particular order (it may be append-only, or it may keep track of deleted rows in order to
  overwrite them with new data later). For example, Postgres uses the heap file approach
  [[44](/en/ch4#Silcock2024)].
* A middle ground between the two is a *covering index* or *index with included columns*, which
  stores *some* of a table’s columns within the index, in addition to storing the full row on the
  heap or in the primary key clustered index [[45](/en/ch4#Webb2008)].
  This allows some queries to be answered by using the index alone, without having to resolve the
  primary key or look in the heap file (in which case, the index is said to *cover* the query).
  This can make some queries faster, but the duplication of data means the index uses more disk space and slows down
  writes.

The indexes discussed so far only map a single key to a value. If you need to query multiple columns
of a table (or multiple fields in a document) simultaneously, see [“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional).

When updating a value without changing the key, the heap file approach can allow the record to be
overwritten in place, provided that the new value is not larger than the old value. The situation is
more complicated if the new value is larger, as it probably needs to be moved to a new location in
the heap where there is enough space. In that case, either all indexes need to be updated to point
at the new heap location of the record, or a forwarding pointer is left behind in the old heap
location [[2](/en/ch4#Graefe2011)].

## Keeping everything in memory

The data structures discussed so far in this chapter have all been answers to the limitations of
disks. Compared to main memory, disks are awkward to deal with. With both magnetic disks and SSDs,
data on disk needs to be laid out carefully if you want good performance on reads and writes.
However, we tolerate this awkwardness because disks have two significant advantages: they are
durable (their contents are not lost if the power is turned off), and they have a lower cost per
gigabyte than RAM.

As RAM becomes cheaper, the cost-per-gigabyte argument is eroded. Many datasets are simply not that
big, so it’s quite feasible to keep them entirely in memory, potentially distributed across several
machines. This has led to the development of *in-memory databases*.

Some in-memory key-value stores, such as Memcached, are intended for caching use only, where it’s
acceptable for data to be lost if a machine is restarted. But other in-memory databases aim for
durability, which can be achieved with special hardware (such as battery-powered RAM), by writing a
log of changes to disk, by writing periodic snapshots to disk, or by replicating the in-memory state
to other machines.

When an in-memory database is restarted, it needs to reload its state, either from disk or over the
network from a replica (unless special hardware is used). Despite writing to disk, it’s still an
in-memory database, because the disk is merely used as an append-only log for durability, and reads
are served entirely from memory. Writing to disk also has operational advantages: files on disk can
easily be backed up, inspected, and analyzed by external utilities.

Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model,
and the vendors claim that they can offer big performance improvements by removing all the overheads
associated with managing on-disk data structures
[[46](/en/ch4#Stonebraker2007),
[47](/en/ch4#VoltDB2014uj)].
RAMCloud is an open source, in-memory key-value store with durability (using a log-structured
approach for the data in memory as well as the data on disk)
[[48](/en/ch4#Rumble2014)].

Redis and Couchbase provide weak durability by writing to disk asynchronously.

Counterintuitively, the performance advantage of in-memory databases is not due to the fact that
they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk
if you have enough memory, because the operating system caches recently used disk blocks in memory
anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data
structures in a form that can be written to disk
[[49](/en/ch4#Harizopoulos2008)].

Besides performance, another interesting area for in-memory databases is providing data models that
are difficult to implement with disk-based indexes. For example, Redis offers a database-like
interface to various data structures such as priority queues and sets. Because it keeps all data in
memory, its implementation is comparatively simple.

# Data Storage for Analytics

The data model of a data warehouse is most commonly relational, because SQL is generally a good fit
for analytic queries. There are many graphical data analysis tools that generate SQL queries,
visualize the results, and allow analysts to explore the data (through operations such as
*drill-down* and *slicing and dicing*).

On the surface, a data warehouse and a relational OLTP database look similar, because they both have
a SQL query interface. However, the internals of the systems can look quite different, because they
are optimized for very different query patterns. Many database vendors now focus on supporting
either transaction processing or analytics workloads, but not both.

Some databases, such as Microsoft SQL Server, SAP HANA, and SingleStore, have support for
transaction processing and data warehousing in the same product. However, these hybrid transactional
and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly
becoming two separate storage and query engines, which happen to be accessible through a common SQL
interface
[[50](/en/ch4#Larson2013),
[51](/en/ch4#Farber2012),
[52](/en/ch4#Stonebraker2013),
[53](/en/ch4#Prout2022_ch4)].

## Cloud Data Warehouses

Data warehouse vendors such as Teradata, Vertica, and SAP HANA sell both on-premises warehouses
under commercial licenses and cloud-based solutions. But as many of their customers move to the
cloud, new cloud data warehouses such as Google Cloud BigQuery, Amazon Redshift, and Snowflake have
also become widely adopted. Unlike traditional data warehouses, cloud data warehouses take advantage
of scalable cloud infrastructure like object storage and serverless computation platforms.

Cloud data warehouses tend to integrate better with other cloud services and to be more elastic.
For example, many cloud warehouses support automatic log ingestion, and offer easy integration with
data processing frameworks such as Google Cloud’s Dataflow or Amazon Web Services’ Kinesis. These
warehouses are also more elastic because they decouple query computation from the storage layer
[[54](/en/ch4#Tereshko2016)].
Data is persisted on object storage rather than local disks, which makes it easy to adjust storage
capacity and compute resources for queries independently, as we previously saw in
[“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native).

Open source data warehouses such as Apache Hive, Trino, and Apache Spark have also evolved with the
cloud. As data storage for analytics has moved to data lakes on object storage, open source warehouses
have begun to break apart
[[55](/en/ch4#McKinney2023)]. The following
components, which were previously integrated in a single system such as Apache Hive, are now often
implemented as separate components:

Query engine
:   Query engines such as Trino, Apache DataFusion, and Presto parse SQL queries, optimize them into
    execution plans, and execute them against the data. Execution usually requires parallel,
    distributed data processing tasks. Some query engines provide built-in task execution, while
    others choose to use third party execution frameworks such as Apache Spark or Apache Flink.

Storage format
:   The storage format determines how the rows of a table are encoded as bytes in a file, which is
    then typically stored in object storage or a distributed filesystem
    [[12](/en/ch4#Armbrust2020)].
    This data can then be accessed by the query engine, but also by other applications using the data
    lake. Examples of such storage formats are Parquet, ORC, Lance, or Nimble, and we will see more
    about them in the next section.

Table format
:   Files written in Apache Parquet and similar storage formats are typically immutable once written.
    To support row inserts and deletions, a table format such as Apache Iceberg or Databricks’s Delta
    format are used. Table formats specify a file format that defines which files constitute a table
    along with the table’s schema. Such formats also offer advanced features such as time travel (the
    ability to query a table as it was at a previous point in time), garbage collection, and even
    transactions.

Data catalog
:   Much like a table format defines which files make up a table, a data catalog defines which tables
    comprise a database. Catalogs are used to create, rename, and drop tables. Unlike storage and table
    formats, data catalogs such as Snowflake’s Polaris and Databricks’s Unity Catalog usually run as a
    standalone service that can be queried using a REST interface. Apache Iceberg also offers a
    catalog, which can be run inside a client or as a separate process. Query engines use catalog
    information when reading and writing tables. Traditionally, catalogs and query engines have been
    integrated, but decoupling them has enabled data discovery and data governance systems
    (discussed in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance)) to access a catalog’s metadata as well.

## Column-Oriented Storage

As discussed in [“Stars and Snowflakes: Schemas for Analytics”](/en/ch3#sec_datamodels_analytics), data warehouses by convention often use a relational
schema with a big fact table that contains foreign key references into dimension tables.
If you have trillions of rows and petabytes of data in your fact tables, storing and querying them
efficiently becomes a challenging problem. Dimension tables are usually much smaller (millions of
rows), so in this section we will focus on storage of facts.

Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4
or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics)
[[52](/en/ch4#Stonebraker2013)]. Take the query in
[Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of
the `fact_sales` table: `date_key`, `product_sk`,
and `quantity`. The query ignores all other columns.

##### Example 4-1. Analyzing whether people are more inclined to buy fresh fruit or candy, depending on the day of the week

```
SELECT
  dim_date.weekday, dim_product.category,
  SUM(fact_sales.quantity) AS quantity_sold
FROM fact_sales
  JOIN dim_date    ON fact_sales.date_key   = dim_date.date_key
  JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
WHERE
  dim_date.year = 2024 AND
  dim_product.category IN ('Fresh fruit', 'Candy')
GROUP BY
  dim_date.weekday, dim_product.category;
```

How can we execute this query efficiently?

In most OLTP databases, storage is laid out in a *row-oriented* fashion: all the values from one row
of a table are stored next to each other. Document databases are similar: an entire document is
typically stored as one contiguous sequence of bytes. You can see this in the CSV example of
[Figure 4-1](/en/ch4#fig_storage_csv_hash_index).

In order to process a query like [Example 4-1](/en/ch4#fig_storage_analytics_query), you may have indexes on
`fact_sales.date_key` and/or `fact_sales.product_sk` that tell the storage engine where to find
all the sales for a particular date or for a particular product. But then, a row-oriented storage
engine still needs to load all of those rows (each consisting of over 100 attributes) from disk into
memory, parse them, and filter out those that don’t meet the required conditions. That can take a
long time.

The idea behind *column-oriented* (or *columnar*) storage is simple: don’t store all the values from
one row together, but store all the values from each *column* together instead
[[56](/en/ch4#Stonebraker2005)].
If each column is stored separately, a query only needs to read and parse those columns that are
used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).

###### Note

Column storage is easiest to understand in a relational data model, but it applies equally to
nonrelational data. For example, Parquet
[[57](/en/ch4#LeDem2013)]
is a columnar storage format that supports a document data model, based on Google’s Dremel
[[58](/en/ch4#Melnik2010)],
using a technique known as *shredding* or *striping*
[[59](/en/ch4#Kearney2016)].

![ddia 0407](/fig/ddia_0407.png)

###### Figure 4-7. Storing relational data by column, rather than by row.

The column-oriented storage layout relies on each column storing the rows in the same order.
Thus, if you need to reassemble an entire row, you can take the 23rd entry from each of the
individual columns and put them together to form the 23rd row of the table.

In fact, columnar storage engines don’t actually store an entire column (containing perhaps
trillions of rows) in one go. Instead, they break the table into blocks of thousands or millions of
rows, and within each block they store the values from each column separately
[[60](/en/ch4#Brandon2023)].
Since many queries are restricted to a particular date range, it is common to make each block
contain the rows for a particular timestamp range. A query then only needs to load the columns it
needs in those blocks that overlap with the required date range.

Columnar storage is used in almost all analytic databases nowadays
[[60](/en/ch4#Brandon2023)],
ranging from large-scale cloud data warehouses such as Snowflake
[[61](/en/ch4#Dageville2016)]
to single-node embedded databases such as DuckDB
[[62](/en/ch4#Raasveldt2020)],
and product analytics systems such as Pinot
[[63](/en/ch4#Im2018)]
and Druid [[64](/en/ch4#Yang2014)].
It is used in storage formats such as Parquet, ORC
[[65](/en/ch4#Liu2023),
[66](/en/ch4#Zeng2023)],
Lance [[67](/en/ch4#Pace2024)],
and Nimble [[68](/en/ch4#Helfman2024)],
and in-memory analytics formats like Apache Arrow
[[65](/en/ch4#Liu2023),
[69](/en/ch4#McKinney2021)]
and Pandas/NumPy [[70](/en/ch4#McKinney2022)].
Some time-series databases, such as InfluxDB IOx
[[71](/en/ch4#Dix2021)] and TimescaleDB
[[72](/en/ch4#Soto2024)],
are also based on column-oriented storage.

### Column Compression

Besides only loading those columns from disk that are required for a query, we can further reduce
the demands on disk throughput and network bandwidth by compressing data. Fortunately,
column-oriented storage often lends itself very well to compression.

Take a look at the sequences of values for each column in [Figure 4-7](/en/ch4#fig_column_store): they often look quite
repetitive, which is a good sign for compression. Depending on the data in the column, different
compression techniques can be used. One technique that is particularly effective in data warehouses
is *bitmap encoding*, illustrated in [Figure 4-8](/en/ch4#fig_bitmap_index).

![ddia 0408](/fig/ddia_0408.png)

###### Figure 4-8. Compressed, bitmap-indexed storage of a single column.

Often, the number of distinct values in a column is small compared to the number of rows (for
example, a retailer may have billions of sales transactions, but only 100,000 distinct products).
We can now take a column with *n* distinct values and turn it into *n* separate bitmaps: one bitmap
for each distinct value, with one bit for each row. The bit is 1 if the row has that value, and 0 if
not.

One option is to store those bitmaps using one bit per row. However, these bitmaps typically contain
a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can additionally be
run-length encoded: counting the number of consecutive zeros or ones and storing that number, as
shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
two bitmap representations, using whichever is the most compact
[[73](/en/ch4#Lemire2016)].
This can make the encoding of a column remarkably efficient.

Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data
warehouse. For example:

`WHERE product_sk IN (31, 68, 69):`
:   Load the three bitmaps for `product_sk = 31`, `product_sk = 68`, and `product_sk = 69`, and
    calculate the bitwise *OR* of the three bitmaps, which can be done very efficiently.

`WHERE product_sk = 30 AND store_sk = 3:`
:   Load the bitmaps for `product_sk = 30` and `store_sk = 3`, and calculate the bitwise *AND*. This
    works because the columns contain the rows in the same order, so the *k*th bit in one column’s
    bitmap corresponds to the same row as the *k*th bit in another column’s bitmap.

Bitmaps can also be used to answer graph queries, such as finding all users of a social network who
are followed by user *X* and who also follow user *Y*
[[74](/en/ch4#Volpert2024)].
There are also various other compression schemes for columnar databases, which you can find in the
references [[75](/en/ch4#Abadi2013)].

###### Note

Don’t confuse column-oriented databases with the *wide-column* (also known as *column-family*) data
model, in which a row can have thousands of columns, and there is no need for all the rows to have
the same columns [[9](/en/ch4#Chang2006_ch4)]. Despite the similarity
in name, wide-column databases are row-oriented, since they store all values from a row together.
Google’s Bigtable, Apache Accumulo, and HBase are examples of the wide-column model.

### Sort Order in Column Storage

In a column store, it doesn’t necessarily matter in which order the rows are stored. It’s easiest to
store them in the order in which they were inserted, since then inserting a new row just means
appending to each of the columns. However, we can choose to impose an order, like we did with
SSTables previously, and use that as an indexing mechanism.

Note that it wouldn’t make sense to sort each column independently, because then we would no longer
know which items in the columns belong to the same row. We can only reconstruct a row because we
know that the *k*th item in one column belongs to the same row as the *k*th item in another
column.

Rather, the data needs to be sorted an entire row at a time, even though it is stored by column.
The administrator of the database can choose the columns by which the table should be sorted, using
their knowledge of common queries. For example, if queries often target date ranges, such as the
last month, it might make sense to make `date_key` the first sort key. Then the query can
scan only the rows from the last month, which will be much faster than scanning all rows.

A second column can determine the sort order of any rows that have the same value in the first
column. For example, if `date_key` is the first sort key in [Figure 4-7](/en/ch4#fig_column_store), it might make
sense for `product_sk` to be the second sort key so that all sales for the same product on the same
day are grouped together in storage. That will help queries that need to group or filter sales by
product within a certain date range.

Another advantage of sorted order is that it can help with compression of columns. If the primary
sort column does not have many distinct values, then after sorting, it will have long sequences
where the same value is repeated many times in a row. A simple run-length encoding, like we used for
the bitmaps in [Figure 4-8](/en/ch4#fig_bitmap_index), could compress that column down to a few kilobytes—even if
the table has billions of rows.

That compression effect is strongest on the first sort key. The second and third sort keys will be
more jumbled up, and thus not have such long runs of repeated values. Columns further down the
sorting priority appear in essentially random order, so they probably won’t compress as well. But
having the first few columns sorted is still a win overall.

### Writing to Column-Oriented Storage

We saw in [“Characterizing Transaction Processing and Analytics”](/en/ch1#sec_introduction_oltp) that reads in data warehouses tend to consist of aggregations
over a large number of rows; column-oriented storage, compression, and sorting all help to make
those read queries faster. Writes in a data warehouse tend to be a bulk import of data, often via an
ETL process.

With columnar storage, writing an individual row somewhere in the middle of a sorted table would be
very inefficient, as you would have to rewrite all the compressed columns from the insertion
position onwards. However, a bulk write of many rows at once amortizes the cost of rewriting those
columns, making it efficient.

A log-structured approach is often used to perform writes in batches. All writes first go to a
row-oriented, sorted, in-memory store. When enough writes have accumulated, they are merged with the
column-encoded files on disk and written to new files in bulk. As old files remain immutable, and
new files are written in one go, object storage is well suited for storing these files.

Queries need to examine both the column data on disk and the recent writes in memory, and combine
the two. The query execution engine hides this distinction from the user. From an analyst’s point
of view, data that has been modified with inserts, updates, or deletes is immediately reflected in
subsequent queries. Snowflake, Vertica, Apache Pinot, Apache Druid, and many others do this
[[61](/en/ch4#Dageville2016), [63](/en/ch4#Im2018),
[64](/en/ch4#Yang2014),
[76](/en/ch4#Lamb2012)].

## Query Execution: Compilation and Vectorization

A complex SQL query for analytics is broken down into a *query plan* consisting of multiple stages,
called *operators*, which may be distributed across multiple machines for parallel execution. Query
planners can perform a lot of optimizations by choosing which operators to use, in which order to
perform them, and where to run each operator.

Within each operator, the query engine needs to do various things with the values in a column, such
as finding all the rows where the value is among a particular set of values (perhaps as part of a
join), or checking whether the value is greater than 15. It also needs to look at several columns
for the same row, for example to find all sales transactions where the product is bananas and the
store is a particular store of interest.

For data warehouse queries that need to scan over millions of rows, we need to worry not only about
the amount of data they need to read off disk, but also the CPU time required to execute complex
operators. The simplest kind of operator is like an interpreter for a programming language: while
iterating over each row, it checks a data structure representing the query to find out which
comparisons or calculations it needs to perform on which columns. Unfortunately, this is too slow
for many analytics purposes. Two alternative approaches for efficient query execution have emerged
[[77](/en/ch4#Kersten2018)]:

Query compilation
:   The query engine takes the SQL query and generates code for executing it. The code iterates over
    the rows one by one, looks at the values in the columns of interest, performs whatever comparisons
    or calculations are needed, and copies the necessary values to an output buffer if the required
    conditions are satisfied. The query engine compiles the generated code to machine code (often
    using an existing compiler such as LLVM), and then runs it on the column-encoded data that has
    been loaded into memory. This approach to code generation is similar to the just-in-time (JIT)
    compilation approach that is used in the Java Virtual Machine (JVM) and similar runtimes.

Vectorized processing
:   The query is interpreted, not compiled, but it is made fast by processing many values from a
    column in a batch, instead of iterating over rows one by one. A fixed set of predefined operators
    are built into the database; we can pass arguments to them and get back a batch of results
    [[50](/en/ch4#Larson2013), [75](/en/ch4#Abadi2013)].

    For example, we could pass the `product_sk` column and the ID of “bananas” to an equality operator,
    and get back a bitmap (one bit per value in the input column, which is 1 if it’s a banana); we could
    then pass the `store_sk` column and the ID of the store of interest to the same equality operator,
    and get back another bitmap; and then we could pass the two bitmaps to a “bitwise AND” operator, as
    shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
    a particular store.

![ddia 0409](/fig/ddia_0409.png)

###### Figure 4-9. A bitwise AND between two bitmaps lends itself to vectorization.

The two approaches are very different in terms of their implementation, but both are used in
practice [[77](/en/ch4#Kersten2018)]. Both can achieve very good
performance by taking advantages of the characteristics of modern CPUs:

* preferring sequential memory access over random access to reduce cache misses
  [[78](/en/ch4#Smith2020)],
* doing most of the work in tight inner loops (that is, with a small number of instructions and no
  function calls) to keep the CPU instruction processing pipeline busy and avoid branch
  mispredictions,
* making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD)
  instructions [[79](/en/ch4#Boncz2005),
  [80](/en/ch4#Zhou2002)], and
* operating directly on compressed data without decoding it into a separate in-memory
  representation, which saves memory allocation and copying costs.

## Materialized Views and Data Cubes

We previously encountered *materialized views* in [“Materializing and Updating Timelines”](/en/ch2#sec_introduction_materializing):
in a relational data model, they are table-like object whose contents are the results of some
query. The difference is that a materialized view is an actual copy of the query results, written to
disk, whereas a virtual view is just a shortcut for writing queries. When you read from a virtual
view, the SQL engine expands it into the view’s underlying query on the fly and then processes the
expanded query.

When the underlying data changes, a materialized view needs to be updated accordingly. Some
databases can do that automatically, and there are also systems such as Materialize that specialize
in materialized view maintenance
[[81](/en/ch4#Bartley2024)].
Performing such updates means more work on writes, but materialized views can improve read
performance in workloads that repeatedly need to perform the same queries.

*Materialized aggregates* are a type of materialized views that can be useful in data warehouses. As
discussed earlier, data warehouse queries often involve an aggregate function, such as `COUNT`, `SUM`,
`AVG`, `MIN`, or `MAX` in SQL. If the same aggregates are used by many different queries, it can be
wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that
queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates
grouped by different dimensions
[[82](/en/ch4#Gray2007)].
[Figure 4-10](/en/ch4#fig_data_cube) shows an example.

![ddia 0410](/fig/ddia_0410.png)

###### Figure 4-10. Two dimensions of a data cube, aggregating data by summing.

Imagine for now that each fact has foreign keys to only two dimension tables—in [Figure 4-10](/en/ch4#fig_data_cube),
these are `date_key` and `product_sk`. You can now draw a two-dimensional table, with
dates along one axis and products along the other. Each cell contains the aggregate (e.g., `SUM`) of
an attribute (e.g., `net_price`) of all facts with that date-product combination. Then you can apply
the same aggregate along each row or column and get a summary that has been reduced by one
dimension (the sales by product regardless of date, or the sales by date regardless of product).

In general, facts often have more than two dimensions. In [Figure 3-5](/en/ch3#fig_dwh_schema) there are five
dimensions: date, product, store, promotion, and customer. It’s a lot harder to imagine what a
five-dimensional hypercube would look like, but the principle remains the same: each cell contains
the sales for a particular date-product-store-promotion-customer combination. These values can then
repeatedly be summarized along each of the dimensions.

The advantage of a materialized data cube is that certain queries become very fast because they
have effectively been precomputed. For example, if you want to know the total sales per store
yesterday, you just need to look at the totals along the appropriate dimension—no need to scan
millions of rows.

The disadvantage is that a data cube doesn’t have the same flexibility as querying the raw data. For example,
there is no way of calculating which proportion of sales comes from items that cost more than $100,
because the price isn’t one of the dimensions. Most data warehouses therefore try to keep as much
raw data as possible, and use aggregates such as data cubes only as a performance boost for certain
queries.

# Multidimensional and Full-Text Indexes

The B-trees and LSM-trees we saw in the first half of this chapter allow range queries over a single
attribute: for example, if the key is a username, you can use them as an index to efficiently find
all names starting with an L. But sometimes, searching by a single attribute is not enough.

The most common type of multi-column index is called a *concatenated index*, which simply combines
several fields into one key by appending one column to another (the index definition specifies in
which order the fields are concatenated). This is like an old-fashioned paper phone book, which
provides an index from (*lastname*, *firstname*) to phone number. Due to the sort order, the index
can be used to find all the people with a particular last name, or all the people with a particular
*lastname-firstname* combination. However, the index is useless if you want to find all the people
with a particular first name.

On the other hand, *multi-dimensional indexes* allow you to query several columns at once.
One case where this is particularly important is geospatial data. For example, a restaurant-search
website may have a database containing the latitude and longitude of each restaurant. When a user is
looking at the restaurants on a map, the website needs to search for all the restaurants within the
rectangular map area that the user is currently viewing. This requires a two-dimensional range query
like the following:

```
SELECT * FROM restaurants WHERE latitude  > 51.4946 AND latitude  < 51.5079
                            AND longitude > -0.1162 AND longitude < -0.1004;
```

A concatenated index over the latitude and longitude columns is not able to answer that kind of
query efficiently: it can give you either all the restaurants in a range of latitudes (but at any
longitude), or all the restaurants in a range of longitudes (but anywhere between the North and
South poles), but not both simultaneously.

One option is to translate a two-dimensional location into a single number using a space-filling
curve, and then to use a regular B-tree index
[[83](/en/ch4#Ramsak2000)].
More commonly, specialized spatial indexes such as R-trees or Bkd-trees
[[84](/en/ch4#Procopiuc2003)]
are used; they divide up the space so that nearby data points tend to be grouped in the same
subtree. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQL’s
Generalized Search Tree indexing facility
[[85](/en/ch4#Hellerstein1995)].
It is also possible to use regularly spaced grids of triangles, squares, or hexagons
[[86](/en/ch4#Brodsky2018)].

Multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce
website you could use a three-dimensional index on the dimensions (*red*, *green*, *blue*) to search
for products in a certain range of colors, or in a database of weather observations you could have a
two-dimensional index on (*date*, *temperature*) in order to efficiently search for all the
observations during the year 2013 where the temperature was between 25 and 30℃. With a
one-dimensional index, you would have to either scan over all the records from 2013 (regardless of
temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by
timestamp and temperature simultaneously
[[87](/en/ch4#Escriva2012)].

## Full-Text Search

Full-text search allows you to search a collection of text documents (web pages, product
descriptions, etc.) by keywords that might appear anywhere in the text
[[88](/en/ch4#Manning2008_ch4)].
Information retrieval is a big, specialist topic that often involves language-specific processing:
for example, several Asian languages are written without spaces or punctuation between words, and
therefore splitting text into words requires a model that indicates which character sequences
constitute a word. Full-text search also often involves matching words that are similar but not
identical (such as typos or different grammatical forms of words) and synonyms. Those problems go
beyond the scope of this book.

However, at its core, you can think of full-text search as another kind of multidimensional query:
in this case, each word that might appear in a text (a *term*) is a dimension. A document that
contains term *x* has a value of 1 in dimension *x*, and a document that doesn’t contain *x* has a
value of 0. Searching for documents mentioning “red apples” means a query that looks for a 1 in the
*red* dimension, and simultaneously a 1 in the *apples* dimension. The number of dimensions may thus
be very large.

The data structure that many search engines use to answer such queries is called an *inverted
index*. This is a key-value structure where the key is a term, and the value is the list of IDs of
all the documents that contain the term (the *postings list*). If the document IDs are sequential
numbers, the postings list can also be represented as a sparse bitmap, like in [Figure 4-8](/en/ch4#fig_bitmap_index):
the *n*th bit in the bitmap for term *x* is a 1 if the document with ID *n* contains the term *x*
[[89](/en/ch4#Wang2017)].

Finding all the documents that contain both terms *x* and *y* is now similar to a vectorized data
warehouse query that searches for rows matching two conditions ([Figure 4-9](/en/ch4#fig_bitmap_and)): load the two
bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps are run-length
encoded, this can be done very efficiently.

For example, Lucene, the full-text indexing engine used by Elasticsearch and Solr, works like this
[[90](/en/ch4#Grand2013)].
It stores the mapping from term to postings list in SSTable-like sorted files, which are merged in
the background using the same log-structured approach we saw earlier in this chapter
[[91](/en/ch4#McCandless2011merges)].
PostgreSQL’s GIN index type also uses postings lists to support full-text search and indexing inside
JSON documents
[[92](/en/ch4#Fittl2021),
[93](/en/ch4#Angelakos2020)].

Instead of breaking text into words, an alternative is to find all the substrings of length *n*,
which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
`"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can
search the documents for arbitrary substrings that are at least three characters long. Trigram
indexes even allows regular expressions in search queries; the downside is that they are quite large
[[94](/en/ch4#Korotkov2012)].

To cope with typos in documents or queries, Lucene is able to search text for words within a certain
edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced)
[[95](/en/ch4#McCandless2011fuzzy)].
It does this by storing the set of terms as a finite state automaton over the characters in the
keys, similar to a *trie*
[[96](/en/ch4#Heinz2002)],
and transforming it into a *Levenshtein automaton*, which supports efficient search for words within
a given edit distance [[97](/en/ch4#Schulz2002)].

## Vector Embeddings

Semantic search goes beyond synonyms and typos to try and understand document concepts
and user intentions. For example, if your help pages contain a page titled “cancelling your
subscription”, users should still be able to find that page when searching for “how to close my
account” or “terminate contract”, which are close in terms of meaning even though they use
completely different words.

To understand a document’s semantics—its meaning—semantic search indexes use embedding models to
translate a document into a vector of floating-point values, called a *vector embedding*. The vector
represents a point in a multi-dimensional space, and each floating-point value represents the document’s
location along one dimension’s axis. Embedding models generate vector embeddings that are near
each other (in this multi-dimensional space) when the embedding’s input documents are semantically
similar.

###### Note

We saw the term *vectorized processing* in [“Query Execution: Compilation and Vectorization”](/en/ch4#sec_storage_vectorized).
Vectors in semantic search have a different meaning. In vectorized processing, the vector refers to
a batch of bits that can be processed with specially optimized code. In embedding models, vectors are a list of
floating point numbers that represent a location in multi-dimensional space.

For example, a three-dimensional vector embedding for a Wikipedia page about agriculture might be
[0.1, 0.22, 0.11]. A Wikipedia page about vegetables would be quite near, perhaps with an embedding
of [0.13, 0.19, 0.24]. A page about star schemas might have an embedding of [0.82, 0.39, -0.74],
comparatively far away. We can tell by looking that the first two vectors are closer than the third.

Embedding models use much larger vectors (often over 1,000 numbers), but the principles are the
same. We don’t try to understand what the individual numbers mean;
they’re simply a way for embedding models to point to a location in an abstract multi-dimensional
space. Search engines use distance functions such as cosine similarity or Euclidean distance to
measure the distance between vectors. Cosine similarity measures the cosine of the angle of two
vectors to determine how close they are, while Euclidean distance measures the straight-line
distance between two points in space.

Many early embedding models such as Word2Vec
[[98](/en/ch4#Mikolov2013)],
BERT
[[99](/en/ch4#Devlin2018)],
and GPT
[[100](/en/ch4#Radford2018)]
worked with text data. Such models are usually implemented as neural networks. Researchers went on to
create embedding models for video, audio, and images as well. More recently, model
architecture has become *multimodal*: a single model can generate vector embeddings for multiple
modalities such as text and images.

Semantic search engines use an embedding model to generate a vector embedding when a user enters a
query. The user’s query and related context (such as a user’s location) are fed into the embedding
model. After the embedding model generates the query’s vector embedding, the search engine must find
documents with similar vector embeddings using a vector index.

Vector indexes store the vector embeddings of a collection of documents. To query the index, you
pass in the vector embedding of the query, and the index returns the documents whose vectors are
closest to the query vector. Since the R-trees we saw previously don’t work well for vectors with
many dimensions, specialized vector indexes are used, such as:

Flat indexes
:   Vectors are stored in the index as they are. A query must read every vector and measure its
    distance to the query vector. Flat indexes are accurate, but measuring the distance between the
    query and each vector is slow.

Inverted file (IVF) indexes
:   The vector space is clustered into partitions (called *centroids*) of vectors to reduce the number
    of vectors that must be compared. IVF indexes are faster than flat indexes, but can give only
    approximate results: the query and a document may fall into different partitions, even though they
    are close to each other. A query on an IVF index first defines *probes*, which are simply the number
    of partitions to check. Queries that use more probes will be more accurate, but will be slower, as
    more vectors must be compared.

Hierarchical Navigable Small World (HNSW)
:   HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw).
    Each layer is represented as a graph, where nodes represent vectors, and edges represent proximity
    to nearby vectors. A query starts by locating the nearest vector in the topmost layer, which has a
    small number of nodes. The query then moves to the same node in the layer below and follows the
    edges in that layer, which is more densely connected, looking for a vector that is closer to the
    query vector. The process continues until the last layer is reached. As with IVF indexes, HNSW
    indexes are approximate.

![ddia 0411](/fig/ddia_0411.png)

###### Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index.

Many popular vector databases implement IVF and HNSW indexes. Facebook’s Faiss library has many
variations of each
[[101](/en/ch4#Faiis2023)],
and PostgreSQL’s pgvector supports both as well
[[102](/en/ch4#Matevosyan2024)].
The full details of the IVF and HNSW algorithms are beyond the scope of this book, but their papers
are an excellent resource
[[103](/en/ch4#Baranchuk2018),
[104](/en/ch4#Malkov2020)].

# Summary

In this chapter we tried to get to the bottom of how databases perform storage and retrieval. What
happens when you store data in a database, and what does the database do when you query for the
data again later?

[“Analytical versus Operational Systems”](/en/ch1#sec_introduction_analytics) introduced the distinction between transaction processing (OLTP) and
analytics (OLAP). In this chapter we saw that storage engines optimized for OLTP look very different
from those optimized for analytics:

* OLTP systems are optimized for a high volume of requests, each of which reads and writes a small
  number of records, and which need fast responses. The records are typically accessed via a primary
  key or a secondary index, and these indexes are typically ordered mappings from key to record,
  which also support range queries.
* Data warehouses and similar analytic systems are optimized for complex read queries that scan over
  a large number of records. They generally use a column-oriented storage layout with compression
  that minimizes the amount of data that such a query needs to read off disk, and just-in-time
  compilation of queries or vectorization to minimize the amount of CPU time spent processing the
  data.

On the OLTP side, we saw storage engines from two main schools of thought:

* The log-structured approach, which only permits appending to files and deleting obsolete files,
  but never updates a file that has been written. SSTables, LSM-trees, RocksDB, Cassandra, HBase,
  Scylla, Lucene, and others belong to this group. In general, log-structured storage engines tend
  to provide high write throughput.
* The update-in-place approach, which treats the disk as a set of fixed-size pages that can be
  overwritten. B-trees, the biggest example of this philosophy, are used in all major relational
  OLTP databases and also many nonrelational ones. As a rule of thumb, B-trees tend to be better for
  reads, providing higher read throughput and lower response times than log-structured storage.

We then looked at indexes that can search for multiple conditions at the same time: multidimensional
indexes such as R-trees that can search for points on a map by latitude and longitude at the same
time, and full-text search indexes that can search for multiple keywords appearing in the same text.
Finally, vector databases are used for semantic search on text documents and other media; they use
vectors with a larger number of dimensions and find similar documents by comparing vector
similarity.

As an application developer, if you’re armed with this knowledge about the internals of storage
engines, you are in a much better position to know which tool is best suited for your particular
application. If you need to adjust a database’s tuning parameters, this understanding allows you to
imagine what effect a higher or a lower value may have.

Although this chapter couldn’t make you an expert in tuning any one particular storage engine, it
has hopefully equipped you with enough vocabulary and ideas that you can make sense of the
documentation for the database of your choice.

##### Footnotes

##### References

[[1](/en/ch4#Samokhvalov2021-marker)] Nikolay Samokhvalov.
[How
partial, covering, and multicolumn indexes may slow down UPDATEs in PostgreSQL](https://postgres.ai/blog/20211029-how-partial-and-covering-indexes-affect-update-performance-in-postgresql).
*postgres.ai*, October 2021.
Archived at [perma.cc/PBK3-F4G9](https://perma.cc/PBK3-F4G9)

[[2](/en/ch4#Graefe2011-marker)] Goetz Graefe.
[Modern B-Tree Techniques](https://w6113.github.io/files/papers/btreesurvey-graefe.pdf).
*Foundations and Trends in Databases*, volume 3, issue 4, pages 203–402, August 2011.
[doi:10.1561/1900000028](https://doi.org/10.1561/1900000028)

[[3](/en/ch4#Jones2019-marker)] Evan Jones.
[Why databases use ordered
indexes but programming uses hash tables](https://www.evanjones.ca/ordered-vs-unordered-indexes.html). *evanjones.ca*, December 2019.
Archived at [perma.cc/NJX8-3ZZD](https://perma.cc/NJX8-3ZZD)

[[4](/en/ch4#Lambov2022a-marker)] Branimir Lambov.
[CEP-25:
Trie-indexed SSTable format](https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-25%3A%2BTrie-indexed%2BSSTable%2Bformat). *cwiki.apache.org*, November 2022.
Archived at [perma.cc/HD7W-PW8U](https://perma.cc/HD7W-PW8U).
Linked Google Doc archived at [perma.cc/UL6C-AAAE](https://perma.cc/UL6C-AAAE)

[[5](/en/ch4#Cormen2009-marker)] Thomas H. Cormen, Charles E.
Leiserson, Ronald L. Rivest, and Clifford Stein: *Introduction to Algorithms*, 3rd edition.
MIT Press, 2009. ISBN: 978-0-262-53305-8

[[6](/en/ch4#Lambov2022b-marker)] Branimir Lambov.
[Trie Memtables in Cassandra](https://www.vldb.org/pvldb/vol15/p3359-lambov.pdf).
*Proceedings of the VLDB Endowment*, volume 15, issue 12, pages 3359–3371, August 2022.
[doi:10.14778/3554821.3554828](https://doi.org/10.14778/3554821.3554828)

[[7](/en/ch4#Borthakur2013-marker)] Dhruba Borthakur.
[The History of RocksDB](https://rocksdb.blogspot.com/2013/11/the-history-of-rocksdb.html).
*rocksdb.blogspot.com*, November 2013.
Archived at [perma.cc/Z7C5-JPSP](https://perma.cc/Z7C5-JPSP)

[[8](/en/ch4#Bertozzi2012-marker)] Matteo Bertozzi.
[Apache HBase I/O – HFile](https://blog.cloudera.com/apache-hbase-i-o-hfile/).
*blog.cloudera.com*, June 2012.
Archived at [perma.cc/U9XH-L2KL](https://perma.cc/U9XH-L2KL)

[[9](/en/ch4#Chang2006_ch4-marker)] Fay Chang, Jeffrey Dean, Sanjay Ghemawat,
Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber.
[Bigtable: A Distributed Storage System
for Structured Data](https://research.google/pubs/pub27898/). At *7th USENIX Symposium on Operating System Design and
Implementation* (OSDI), November 2006.

[[10](/en/ch4#ONeil1996-marker)] Patrick O’Neil, Edward Cheng, Dieter Gawlick, and
Elizabeth O’Neil.
[The Log-Structured Merge-Tree (LSM-Tree)](https://www.cs.umb.edu/~poneil/lsmtree.pdf).
*Acta Informatica*, volume 33, issue 4, pages 351–385, June 1996.
[doi:10.1007/s002360050048](https://doi.org/10.1007/s002360050048)

[[11](/en/ch4#Rosenblum1992-marker)] Mendel Rosenblum and John K. Ousterhout.
[The Design and Implementation of
a Log-Structured File System](https://research.cs.wisc.edu/areas/os/Qual/papers/lfs.pdf).
*ACM Transactions on Computer Systems*, volume 10, issue 1, pages 26–52, February 1992.
[doi:10.1145/146941.146943](https://doi.org/10.1145/146941.146943)

[[12](/en/ch4#Armbrust2020-marker)] Michael Armbrust, Tathagata Das, Liwen Sun,
Burak Yavuz, Shixiong Zhu, Mukul Murthy, Joseph Torres, Herman van Hovell, Adrian Ionescu, Alicja
Łuszczak, Michał Świtakowski, Michał Szafrański, Xiao Li, Takuya Ueshin, Mostafa Mokhtar, Peter
Boncz, Ali Ghodsi, Sameer Paranjpye, Pieter Senster, Reynold Xin, and Matei Zaharia.
[Delta Lake: High-Performance ACID Table
Storage over Cloud Object Stores](https://vldb.org/pvldb/vol13/p3411-armbrust.pdf). *Proceedings of the VLDB Endowment*, volume 13,
issue 12, pages 3411–3424, August 2020.
[doi:10.14778/3415478.3415560](https://doi.org/10.14778/3415478.3415560)

[[13](/en/ch4#Bloom1970-marker)] Burton H. Bloom.
[Space/Time
Trade-offs in Hash Coding with Allowable Errors](https://people.cs.umass.edu/~emery/classes/cmpsci691st/readings/Misc/p422-bloom.pdf). *Communications of the ACM*,
volume 13, issue 7, pages 422–426, July 1970.
[doi:10.1145/362686.362692](https://doi.org/10.1145/362686.362692)

[[14](/en/ch4#Kirsch2008-marker)] Adam Kirsch and Michael Mitzenmacher.
[Less Hashing, Same
Performance: Building a Better Bloom Filter](https://www.eecs.harvard.edu/~michaelm/postscripts/tr-02-05.pdf). *Random Structures & Algorithms*,
volume 33, issue 2, pages 187–218, September 2008.
[doi:10.1002/rsa.20208](https://doi.org/10.1002/rsa.20208)

[[15](/en/ch4#Hurst2023-marker)] Thomas Hurst.
[Bloom Filter Calculator](https://hur.st/bloomfilter/). *hur.st*, September 2023.
Archived at [perma.cc/L3AV-6VC2](https://perma.cc/L3AV-6VC2)

[[16](/en/ch4#Luo2019-marker)] Chen Luo and Michael J. Carey.
[LSM-based storage techniques: a survey](https://arxiv.org/abs/1812.07527).
*The VLDB Journal*, volume 29, pages 393–418, July 2019.
[doi:10.1007/s00778-019-00555-y](https://doi.org/10.1007/s00778-019-00555-y)

[[17](/en/ch4#Sarkar2022-marker)] Subhadeep Sarkar and Manos Athanassoulis.
[Dissecting, Designing, and Optimizing
LSM-based Data Stores](https://www.youtube.com/watch?v=hkMkBZn2mGs). Tutorial at *ACM International Conference on Management of Data*
(SIGMOD), June 2022. Slides archived at
[perma.cc/93B3-E827](https://perma.cc/93B3-E827)

[[18](/en/ch4#Callaghan2018-marker)] Mark Callaghan.
[Name that
compaction algorithm](https://smalldatum.blogspot.com/2018/08/name-that-compaction-algorithm.html). *smalldatum.blogspot.com*, August 2018.
Archived at [perma.cc/CN4M-82DY](https://perma.cc/CN4M-82DY)

[[19](/en/ch4#Rao2023-marker)] Prashanth Rao.
[Embedded databases (1): The harmony of
DuckDB, KùzuDB and LanceDB](https://thedataquarry.com/posts/embedded-db-1/). *thedataquarry.com*, August 2023.
Archived at [perma.cc/PA28-2R35](https://perma.cc/PA28-2R35)

[[20](/en/ch4#BlueskySQLite-marker)] Hacker News discussion.
[Bluesky migrates to single-tenant SQLite](https://news.ycombinator.com/item?id=38171322).
*news.ycombinator.com*, October 2023.
Archived at [perma.cc/69LM-5P6X](https://perma.cc/69LM-5P6X)

[[21](/en/ch4#Bayer1970-marker)] Rudolf Bayer and Edward M. McCreight.
[Organization and Maintenance of Large
Ordered Indices](https://dl.acm.org/doi/pdf/10.1145/1734663.1734671). Boeing Scientific Research Laboratories, Mathematical and Information Sciences
Laboratory, report no. 20, July 1970.
[doi:10.1145/1734663.1734671](https://doi.org/10.1145/1734663.1734671)

[[22](/en/ch4#Comer1979-marker)] Douglas Comer.
[The
Ubiquitous B-Tree](https://web.archive.org/web/20170809145513id_/http%3A//sites.fas.harvard.edu/~cs165/papers/comer.pdf). *ACM Computing Surveys*, volume 11, issue 2, pages 121–137, June 1979.
[doi:10.1145/356770.356776](https://doi.org/10.1145/356770.356776)

[[23](/en/ch4#Miller2025-marker)] Alex Miller.
[Torn Write Detection and Protection](https://transactional.blog/blog/2025-torn-writes).
*transactional.blog*, April 2025.
Archived at [perma.cc/G7EB-33EW](https://perma.cc/G7EB-33EW)

[[24](/en/ch4#Mohan1992-marker)] C. Mohan and Frank Levine.
[ARIES/IM: An Efficient and High
Concurrency Index Management Method Using Write-Ahead Logging](https://ics.uci.edu/~cs223/papers/p371-mohan.pdf). At *ACM
International Conference on Management of Data* (SIGMOD), June 1992.
[doi:10.1145/130283.130338](https://doi.org/10.1145/130283.130338)

[[25](/en/ch4#Suzuki2017_ch4-marker)] Hironobu Suzuki.
[The Internals of PostgreSQL](https://www.interdb.jp/pg/). *interdb.jp*, 2017.

[[26](/en/ch4#Chu2014-marker)] Howard Chu.
[LDAP at Lightning Speed](https://buildstuff14.sched.com/event/08a1a368e272eb599a52e08b4c3c779d).
At *Build Stuff ’14*, November 2014.
Archived at [perma.cc/GB6Z-P8YH](https://perma.cc/GB6Z-P8YH)

[[27](/en/ch4#Athanassoulis2016-marker)] Manos Athanassoulis, Michael S. Kester,
Lukas M. Maas, Radu Stoica, Stratos Idreos, Anastasia Ailamaki, and Mark Callaghan.
[Designing Access Methods: The RUM
Conjecture](https://openproceedings.org/2016/conf/edbt/paper-12.pdf). At *19th International Conference on Extending Database Technology* (EDBT), March 2016.
[doi:10.5441/002/edbt.2016.42](https://doi.org/10.5441/002/edbt.2016.42)

[[28](/en/ch4#Stopford2015-marker)] Ben Stopford.
[Log Structured Merge Trees](http://www.benstopford.com/2015/02/14/log-structured-merge-trees/).
*benstopford.com*, February 2015.
Archived at [perma.cc/E5BV-KUJ6](https://perma.cc/E5BV-KUJ6)

[[29](/en/ch4#Callaghan2016lsm-marker)] Mark Callaghan.
[The
Advantages of an LSM vs a B-Tree](https://smalldatum.blogspot.com/2016/01/summary-of-advantages-of-lsm-vs-b-tree.html). *smalldatum.blogspot.co.uk*, January 2016.
Archived at [perma.cc/3TYZ-EFUD](https://perma.cc/3TYZ-EFUD)

[[30](/en/ch4#Balmau2019-marker)] Oana Balmau, Florin Dinu, Willy Zwaenepoel, Karan
Gupta, Ravishankar Chandhiramoorthi, and Diego Didona.
[SILK: Preventing Latency
Spikes in Log-Structured Merge Key-Value Stores](https://www.usenix.org/conference/atc19/presentation/balmau). At *USENIX Annual Technical Conference*,
July 2019.

[[31](/en/ch4#RocksDBTuning-marker)] Igor Canadi, Siying Dong, Mark Callaghan, et al.
[RocksDB Tuning Guide](https://github.com/facebook/rocksdb/wiki/RocksDB-Tuning-Guide).
*github.com*, 2023.
Archived at [perma.cc/UNY4-MK6C](https://perma.cc/UNY4-MK6C)

[[32](/en/ch4#Haas2023-marker)] Gabriel Haas and Viktor Leis.
[What Modern NVMe Storage Can Do, and How
to Exploit it: High-Performance I/O for High-Performance Storage Engines](https://www.vldb.org/pvldb/vol16/p2090-haas.pdf). *Proceedings of the
VLDB Endowment*, volume 16, issue 9, pages 2090-2102.
[doi:10.14778/3598581.3598584](https://doi.org/10.14778/3598581.3598584)

[[33](/en/ch4#Goossaert2014-marker)] Emmanuel Goossaert.
[Coding
for SSDs](https://codecapsule.com/2014/02/12/coding-for-ssds-part-1-introduction-and-table-of-contents/). *codecapsule.com*, February 2014.

[[34](/en/ch4#Vanlightly2023nvme-marker)] Jack Vanlightly.
[Is
sequential IO dead in the era of the NVMe drive?](https://jack-vanlightly.com/blog/2023/5/9/is-sequential-io-dead-in-the-era-of-the-nvme-drive) *jack-vanlightly.com*, May 2023.
Archived at [perma.cc/7TMZ-TAPU](https://perma.cc/7TMZ-TAPU)

[[35](/en/ch4#Alibaba2019_ch4-marker)] Alibaba Cloud Storage Team.
[Storage System Design Analysis: Factors Affecting
NVMe SSD Performance (2)](https://www.alibabacloud.com/blog/594376). *alibabacloud.com*, January 2019. Archived at
[archive.org](https://web.archive.org/web/20230510065132/https%3A//www.alibabacloud.com/blog/594376)

[[36](/en/ch4#Hu2010-marker)] Xiao-Yu Hu and Robert Haas.
[The Fundamental Limit of Flash
Random Write Performance: Understanding, Analysis and Performance Modelling](https://dominoweb.draco.res.ibm.com/reports/rz3771.pdf).
*dominoweb.draco.res.ibm.com*, March 2010.
Archived at [perma.cc/8JUL-4ZDS](https://perma.cc/8JUL-4ZDS)

[[37](/en/ch4#Lu2016-marker)] Lanyue Lu, Thanumalayan Sankaranarayana Pillai,
Andrea C. Arpaci-Dusseau, and Remzi H. Arpaci-Dusseau.
[WiscKey:
Separating Keys from Values in SSD-conscious Storage](https://www.usenix.org/system/files/conference/fast16/fast16-papers-lu.pdf). At *4th USENIX Conference on File and
Storage Technologies* (FAST), February 2016.

[[38](/en/ch4#Zaitsev2006-marker)] Peter Zaitsev.
[Innodb Double Write](https://www.percona.com/blog/innodb-double-write/).
*percona.com*, August 2006.
Archived at [perma.cc/NT4S-DK7T](https://perma.cc/NT4S-DK7T)

[[39](/en/ch4#Vondra2016-marker)] Tomas Vondra.
[On the Impact of
Full-Page Writes](https://www.2ndquadrant.com/en/blog/on-the-impact-of-full-page-writes/). *2ndquadrant.com*, November 2016.
Archived at [perma.cc/7N6B-CVL3](https://perma.cc/7N6B-CVL3)

[[40](/en/ch4#Callaghan2015-marker)] Mark Callaghan.
[Read,
write & space amplification - B-Tree vs LSM](https://smalldatum.blogspot.com/2015/11/read-write-space-amplification-b-tree.html). *smalldatum.blogspot.com*, November 2015.
Archived at [perma.cc/S487-WK5P](https://perma.cc/S487-WK5P)

[[41](/en/ch4#Callaghan2016rocksdb-marker)] Mark Callaghan.
[Choosing Between Efficiency and
Performance with RocksDB](https://codemesh.io/codemesh2016/mark-callaghan). At *Code Mesh*, November 2016.
Video at [youtube.com/watch?v=tgzkgZVXKB4](https://www.youtube.com/watch?v=tgzkgZVXKB4)

[[42](/en/ch4#Sarkar2023-marker)] Subhadeep Sarkar, Tarikul Islam
Papon, Dimitris Staratzis, Zichen Zhu, and Manos Athanassoulis.
[Enabling
Timely and Persistent Deletion in LSM-Engines](https://subhadeep.net/assets/fulltext/Enabling_Timely_and_Persistent_Deletion_in_LSM-Engines.pdf). *ACM Transactions on Database Systems*,
volume 48, issue 3, article no. 8, August 2023.
[doi:10.1145/3599724](https://doi.org/10.1145/3599724)

[[43](/en/ch4#Fittl2025-marker)] Lukas Fittl.
[Postgres
vs. SQL Server: B-Tree Index Differences & the Benefit of Deduplication](https://pganalyze.com/blog/postgresql-vs-sql-server-btree-index-deduplication).
*pganalyze.com*, April 2025.
Archived at [perma.cc/XY6T-LTPX](https://perma.cc/XY6T-LTPX)

[[44](/en/ch4#Silcock2024-marker)] Drew Silcock.
[How Postgres stores data
on disk – this one’s a page turner](https://drew.silcock.dev/blog/how-postgres-stores-data-on-disk/). *drew.silcock.dev*, August 2024.
Archived at [perma.cc/8K7K-7VJ2](https://perma.cc/8K7K-7VJ2)

[[45](/en/ch4#Webb2008-marker)] Joe Webb.
[Using
Covering Indexes to Improve Query Performance](https://www.red-gate.com/simple-talk/databases/sql-server/learn/using-covering-indexes-to-improve-query-performance/). *simple-talk.com*, September 2008.
Archived at [perma.cc/6MEZ-R5VR](https://perma.cc/6MEZ-R5VR)

[[46](/en/ch4#Stonebraker2007-marker)] Michael Stonebraker, Samuel Madden, Daniel J.
Abadi, Stavros Harizopoulos, Nabil Hachem, and Pat Helland.
[The End of an
Architectural Era (It’s Time for a Complete Rewrite)](https://vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf). At *33rd International Conference on
Very Large Data Bases* (VLDB), September 2007.

[[47](/en/ch4#VoltDB2014uj-marker)] [VoltDB
Technical Overview White Paper](https://www.voltactivedata.com/wp-content/uploads/2017/03/hv-white-paper-voltdb-technical-overview.pdf). VoltDB, 2017.
Archived at [perma.cc/B9SF-SK5G](https://perma.cc/B9SF-SK5G)

[[48](/en/ch4#Rumble2014-marker)] Stephen M. Rumble, Ankita Kejriwal, and John K. Ousterhout.
[Log-Structured
Memory for DRAM-Based Storage](https://www.usenix.org/system/files/conference/fast14/fast14-paper_rumble.pdf). At *12th USENIX Conference on File and Storage
Technologies* (FAST), February 2014.

[[49](/en/ch4#Harizopoulos2008-marker)] Stavros Harizopoulos, Daniel J. Abadi,
Samuel Madden, and Michael Stonebraker.
[OLTP Through the Looking Glass,
and What We Found There](https://hstore.cs.brown.edu/papers/hstore-lookingglass.pdf). At *ACM International Conference on Management of Data*
(SIGMOD), June 2008.
[doi:10.1145/1376616.1376713](https://doi.org/10.1145/1376616.1376713)

[[50](/en/ch4#Larson2013-marker)] Per-Åke Larson, Cipri Clinciu, Campbell Fraser,
Eric N. Hanson, Mostafa Mokhtar, Michal Nowakiewicz, Vassilis Papadimos, Susan L. Price, Srikumar
Rangarajan, Remus Rusanu, and Mayukh Saubhasik.
[Enhancements
to SQL Server Column Stores](https://web.archive.org/web/20131203001153id_/http%3A//research.microsoft.com/pubs/193599/Apollo3%20-%20Sigmod%202013%20-%20final.pdf). At *ACM International Conference on Management of Data* (SIGMOD), June 2013.
[doi:10.1145/2463676.2463708](https://doi.org/10.1145/2463676.2463708)

[[51](/en/ch4#Farber2012-marker)] Franz Färber, Norman May, Wolfgang Lehner, Philipp Große,
Ingo Müller, Hannes Rauhe, and Jonathan Dees.
[The
SAP HANA Database – An Architecture Overview](https://web.archive.org/web/20220208081111id_/http%3A//sites.computer.org/debull/A12mar/hana.pdf).
*IEEE Data Engineering Bulletin*, volume 35, issue 1, pages 28–33, March 2012.

[[52](/en/ch4#Stonebraker2013-marker)] Michael Stonebraker.
[The Traditional RDBMS Wisdom Is (Almost Certainly) All
Wrong](https://slideshot.epfl.ch/talks/166). Presentation at *EPFL*, May 2013.

[[53](/en/ch4#Prout2022_ch4-marker)] Adam Prout, Szu-Po Wang, Joseph Victor, Zhou Sun, Yongzhu
Li, Jack Chen, Evan Bergeron, Eric Hanson, Robert Walzer, Rodrigo Gomes, and Nikita Shamgunov.
[Cloud-Native Transactions and Analytics
in SingleStore](https://dl.acm.org/doi/pdf/10.1145/3514221.3526055). At *ACM International Conference on Management of Data* (SIGMOD), June 2022.
[doi:10.1145/3514221.3526055](https://doi.org/10.1145/3514221.3526055)

[[54](/en/ch4#Tereshko2016-marker)] Tino Tereshko and Jordan Tigani.
[BigQuery under the
hood](https://cloud.google.com/blog/products/bigquery/bigquery-under-the-hood). *cloud.google.com*, January 2016.
Archived at [perma.cc/WP2Y-FUCF](https://perma.cc/WP2Y-FUCF)

[[55](/en/ch4#McKinney2023-marker)] Wes McKinney.
[The Road to Composable Data Systems:
Thoughts on the Last 15 Years and the Future](https://wesmckinney.com/blog/looking-back-15-years/). *wesmckinney.com*, September 2023.
Archived at [perma.cc/6L2M-GTJX](https://perma.cc/6L2M-GTJX)

[[56](/en/ch4#Stonebraker2005-marker)] Michael Stonebraker, Daniel
J. Abadi, Adam Batkin, Xuedong Chen, Mitch Cherniack, Miguel Ferreira, Edmond Lau, Amerson Lin, Sam
Madden, Elizabeth O’Neil, Pat O’Neil, Alex Rasin, Nga Tran, and Stan Zdonik.
[C-Store:
A Column-oriented DBMS](https://www.vldb.org/archives/website/2005/program/paper/thu/p553-stonebraker.pdf). At *31st International Conference on Very Large Data Bases*
(VLDB), pages 553–564, September 2005.

[[57](/en/ch4#LeDem2013-marker)] Julien Le Dem.
[Dremel
Made Simple with Parquet](https://blog.twitter.com/engineering/en_us/a/2013/dremel-made-simple-with-parquet.html). *blog.twitter.com*, September 2013.

[[58](/en/ch4#Melnik2010-marker)] Sergey Melnik, Andrey Gubarev, Jing Jing Long,
Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis.
[Dremel: Interactive Analysis of Web-Scale
Datasets](https://vldb.org/pvldb/vol3/R29.pdf). At *36th International Conference on Very Large Data Bases* (VLDB), pages
330–339, September 2010.
[doi:10.14778/1920841.1920886](https://doi.org/10.14778/1920841.1920886)

[[59](/en/ch4#Kearney2016-marker)] Joe Kearney.
[Understanding Record
Shredding: storing nested data in columns](https://www.joekearney.co.uk/posts/understanding-record-shredding). *joekearney.co.uk*, December 2016.
Archived at [perma.cc/ZD5N-AX5D](https://perma.cc/ZD5N-AX5D)

[[60](/en/ch4#Brandon2023-marker)] Jamie Brandon.
[A
shallow survey of OLAP and HTAP query engines](https://www.scattered-thoughts.net/writing/a-shallow-survey-of-olap-and-htap-query-engines). *scattered-thoughts.net*, September 2023.
Archived at [perma.cc/L3KH-J4JF](https://perma.cc/L3KH-J4JF)

[[61](/en/ch4#Dageville2016-marker)] Benoit Dageville, Thierry Cruanes, Marcin
Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin
Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter
Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner.
[The Snowflake Elastic Data Warehouse](https://dl.acm.org/doi/pdf/10.1145/2882903.2903741).
At *ACM International Conference on Management of Data* (SIGMOD), pages 215–226, June 2016.
[doi:10.1145/2882903.2903741](https://doi.org/10.1145/2882903.2903741)

[[62](/en/ch4#Raasveldt2020-marker)] Mark Raasveldt and Hannes Mühleisen.
[Data Management for Data
Science Towards Embedded Analytics](https://duckdb.org/pdf/CIDR2020-raasveldt-muehleisen-duckdb.pdf). At *10th Conference on Innovative Data Systems
Research* (CIDR), January 2020.

[[63](/en/ch4#Im2018-marker)] Jean-François Im, Kishore Gopalakrishna, Subbu
Subramaniam, Mayank Shrivastava, Adwait Tumbde, Xiaotian Jiang, Jennifer Dai, Seunghyun Lee, Neha
Pawar, Jialiang Li, and Ravi Aringunram.
[Pinot:
Realtime OLAP for 530 Million Users](https://cwiki.apache.org/confluence/download/attachments/103092375/Pinot.pdf). At *ACM International Conference on Management of
Data* (SIGMOD), pages 583–594, May 2018.
[doi:10.1145/3183713.3190661](https://doi.org/10.1145/3183713.3190661)

[[64](/en/ch4#Yang2014-marker)] Fangjin Yang, Eric Tschetter, Xavier
Léauté, Nelson Ray, Gian Merlino, and Deep Ganguli.
[Druid: A Real-time Analytical Data Store](https://static.druid.io/docs/druid.pdf).
At *ACM International Conference on Management of Data* (SIGMOD), June 2014.
[doi:10.1145/2588555.2595631](https://doi.org/10.1145/2588555.2595631)

[[65](/en/ch4#Liu2023-marker)] Chunwei Liu, Anna Pavlenko, Matteo Interlandi, and Brandon Haynes.
[Deep Dive into Common Open Formats for Analytical DBMSs](https://www.vldb.org/pvldb/vol16/p3044-liu.pdf).
*Proceedings of the VLDB Endowment*, volume 16, issue 11, pages 3044–3056, July 2023.
[doi:10.14778/3611479.3611507](https://doi.org/10.14778/3611479.3611507)

[[66](/en/ch4#Zeng2023-marker)] Xinyu Zeng, Yulong Hui, Jiahong Shen, Andrew Pavlo, Wes
McKinney, and Huanchen Zhang. [An Empirical
Evaluation of Columnar Storage Formats](https://www.vldb.org/pvldb/vol17/p148-zeng.pdf). *Proceedings of the VLDB Endowment*, volume 17,
issue 2, pages 148–161.
[doi:10.14778/3626292.3626298](https://doi.org/10.14778/3626292.3626298)

[[67](/en/ch4#Pace2024-marker)] Weston Pace.
[Lance v2: A columnar container format for modern data](https://blog.lancedb.com/lance-v2/).
*blog.lancedb.com*, April 2024.
Archived at [perma.cc/ZK3Q-S9VJ](https://perma.cc/ZK3Q-S9VJ)

[[68](/en/ch4#Helfman2024-marker)] Yoav Helfman.
[Nimble, A New Columnar File Format](https://www.youtube.com/watch?v=bISBNVtXZ6M).
At *VeloxCon*, April 2024.

[[69](/en/ch4#McKinney2021-marker)] Wes McKinney.
[Apache Arrow: High-Performance Columnar Data
Framework](https://www.youtube.com/watch?v=YhF8YR0OEFk). At *CMU Database Group – Vaccination Database Tech Talks*, December 2021.

[[70](/en/ch4#McKinney2022-marker)] Wes McKinney.
[Python for Data
Analysis, 3rd Edition](https://learning.oreilly.com/library/view/python-for-data/9781098104023/). O’Reilly Media, August 2022. ISBN: 9781098104023

[[71](/en/ch4#Dix2021-marker)] Paul Dix.
[The Design of InfluxDB IOx: An In-Memory
Columnar Database Written in Rust with Apache Arrow](https://www.youtube.com/watch?v=_zbwz-4RDXg). At *CMU Database Group – Vaccination
Database Tech Talks*, May 2021.

[[72](/en/ch4#Soto2024-marker)] Carlota Soto and Mike Freedman.
[Building
Columnar Compression for Large PostgreSQL Databases](https://www.timescale.com/blog/building-columnar-compression-in-a-row-oriented-database/). *timescale.com*, March 2024.
Archived at [perma.cc/7KTF-V3EH](https://perma.cc/7KTF-V3EH)

[[73](/en/ch4#Lemire2016-marker)] Daniel Lemire, Gregory Ssi‐Yan‐Kai, and Owen Kaser.
[Consistently faster and smaller compressed bitmaps with Roaring](https://arxiv.org/pdf/1603.06549).
*Software: Practice and Experience*, volume 46, issue 11, pages 1547–1569, November 2016.
[doi:10.1002/spe.2402](https://doi.org/10.1002/spe.2402)

[[74](/en/ch4#Volpert2024-marker)] Jaz Volpert.
[An entire Social Network in 1.6GB (GraphD
Part 2)](https://jazco.dev/2024/04/20/roaring-bitmaps/). *jazco.dev*, April 2024.
Archived at [perma.cc/L27Z-QVMG](https://perma.cc/L27Z-QVMG)

[[75](/en/ch4#Abadi2013-marker)] Daniel J. Abadi, Peter Boncz, Stavros
Harizopoulos, Stratos Idreos, and Samuel Madden.
[The Design and
Implementation of Modern Column-Oriented Database Systems](https://www.cs.umd.edu/~abadi/papers/abadi-column-stores.pdf). *Foundations and Trends in
Databases*, volume 5, issue 3, pages 197–280, December 2013.
[doi:10.1561/1900000024](https://doi.org/10.1561/1900000024)

[[76](/en/ch4#Lamb2012-marker)] Andrew Lamb, Matt Fuller, Ramakrishna Varadarajan,
Nga Tran, Ben Vandiver, Lyric Doshi, and Chuck Bear.
[The Vertica Analytic Database: C-Store 7 Years Later](https://vldb.org/pvldb/vol5/p1790_andrewlamb_vldb2012.pdf).
*Proceedings of the VLDB Endowment*, volume 5, issue 12, pages 1790–1801, August 2012.
[doi:10.14778/2367502.2367518](https://doi.org/10.14778/2367502.2367518)

[[77](/en/ch4#Kersten2018-marker)] Timo Kersten, Viktor Leis, Alfons Kemper, Thomas
Neumann, Andrew Pavlo, and Peter Boncz.
[Everything You Always Wanted to Know
About Compiled and Vectorized Queries But Were Afraid to Ask](https://www.vldb.org/pvldb/vol11/p2209-kersten.pdf). *Proceedings of the VLDB
Endowment*, volume 11, issue 13, pages 2209–2222, September 2018.
[doi:10.14778/3275366.3284966](https://doi.org/10.14778/3275366.3284966)

[[78](/en/ch4#Smith2020-marker)] Forrest Smith.
[Memory Bandwidth Napkin
Math](https://www.forrestthewoods.com/blog/memory-bandwidth-napkin-math/). *forrestthewoods.com*, February 2020.
Archived at [perma.cc/Y8U4-PS7N](https://perma.cc/Y8U4-PS7N)

[[79](/en/ch4#Boncz2005-marker)] Peter Boncz, Marcin Zukowski, and Niels Nes.
[MonetDB/X100: Hyper-Pipelining Query Execution](https://www.cidrdb.org/cidr2005/papers/P19.pdf).
At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.

[[80](/en/ch4#Zhou2002-marker)] Jingren Zhou and Kenneth A. Ross.
[Implementing Database Operations Using SIMD Instructions](https://www1.cs.columbia.edu/~kar/pubsk/simd.pdf).
At *ACM International Conference on Management of Data* (SIGMOD), pages 145–156, June 2002.
[doi:10.1145/564691.564709](https://doi.org/10.1145/564691.564709)

[[81](/en/ch4#Bartley2024-marker)] Kevin Bartley.
[OLTP Queries: Transfer Expensive Workloads to
Materialize](https://materialize.com/blog/oltp-queries/). *materialize.com*, August 2024.
Archived at [perma.cc/4TYM-TYD8](https://perma.cc/4TYM-TYD8)

[[82](/en/ch4#Gray2007-marker)] Jim Gray, Surajit Chaudhuri, Adam Bosworth, Andrew
Layman, Don Reichart, Murali Venkatrao, Frank Pellow, and Hamid Pirahesh.
[Data Cube: A Relational Aggregation Operator
Generalizing Group-By, Cross-Tab, and Sub-Totals](https://arxiv.org/pdf/cs/0701155). *Data Mining and Knowledge
Discovery*, volume 1, issue 1, pages 29–53, March 2007.
[doi:10.1023/A:1009726021843](https://doi.org/10.1023/A%3A1009726021843)

[[83](/en/ch4#Ramsak2000-marker)] Frank Ramsak, Volker Markl, Robert Fenk, Martin
Zirkel, Klaus Elhardt, and Rudolf Bayer.
[Integrating the UB-Tree into a Database System Kernel](https://www.vldb.org/conf/2000/P263.pdf).
At *26th International Conference on Very Large Data Bases* (VLDB), September 2000.

[[84](/en/ch4#Procopiuc2003-marker)] Octavian Procopiuc, Pankaj K. Agarwal, Lars
Arge, and Jeffrey Scott Vitter.
[Bkd-Tree: A Dynamic
Scalable kd-Tree](https://users.cs.duke.edu/~pankaj/publications/papers/bkd-sstd.pdf). At *8th International Symposium on Spatial and Temporal Databases*
(SSTD), pages 46–65, July 2003.
[doi:10.1007/978-3-540-45072-6\_4](https://doi.org/10.1007/978-3-540-45072-6_4)

[[85](/en/ch4#Hellerstein1995-marker)] Joseph M. Hellerstein, Jeffrey F. Naughton, and Avi Pfeffer.
[Generalized Search Trees for Database Systems](https://dsf.berkeley.edu/papers/vldb95-gist.pdf).
At *21st International Conference on Very Large Data Bases* (VLDB), September 1995.

[[86](/en/ch4#Brodsky2018-marker)] Isaac Brodsky.
[H3: Uber’s Hexagonal Hierarchical Spatial Index](https://eng.uber.com/h3/).
*eng.uber.com*, June 2018.
Archived at [archive.org](https://web.archive.org/web/20240722003854/https%3A//www.uber.com/blog/h3/)

[[87](/en/ch4#Escriva2012-marker)] Robert Escriva, Bernard Wong, and Emin Gün Sirer.
[HyperDex:
A Distributed, Searchable Key-Value Store](https://www.cs.princeton.edu/courses/archive/fall13/cos518/papers/hyperdex.pdf). At *ACM SIGCOMM Conference*, August 2012.
[doi:10.1145/2377677.2377681](https://doi.org/10.1145/2377677.2377681)

[[88](/en/ch4#Manning2008_ch4-marker)] Christopher D. Manning, Prabhakar Raghavan,
and Hinrich Schütze.
[*Introduction to Information Retrieval*](https://nlp.stanford.edu/IR-book/).
Cambridge University Press, 2008. ISBN: 978-0-521-86571-5, available online at
[nlp.stanford.edu/IR-book](https://nlp.stanford.edu/IR-book/)

[[89](/en/ch4#Wang2017-marker)] Jianguo Wang, Chunbin Lin, Yannis Papakonstantinou,
and Steven Swanson.
[An Experimental
Study of Bitmap Compression vs. Inverted List Compression](https://cseweb.ucsd.edu/~swanson/papers/SIGMOD2017-ListCompression.pdf). At *ACM International Conference
on Management of Data* (SIGMOD), pages 993–1008, May 2017.
[doi:10.1145/3035918.3064007](https://doi.org/10.1145/3035918.3064007)

[[90](/en/ch4#Grand2013-marker)] Adrien Grand.
[What is in a Lucene
Index?](https://speakerdeck.com/elasticsearch/what-is-in-a-lucene-index) At *Lucene/Solr Revolution*, November 2013.
Archived at [perma.cc/Z7QN-GBYY](https://perma.cc/Z7QN-GBYY)

[[91](/en/ch4#McCandless2011merges-marker)] Michael McCandless.
[Visualizing
Lucene’s Segment Merges](https://blog.mikemccandless.com/2011/02/visualizing-lucenes-segment-merges.html). *blog.mikemccandless.com*, February 2011.
Archived at [perma.cc/3ZV8-72W6](https://perma.cc/3ZV8-72W6)

[[92](/en/ch4#Fittl2021-marker)] Lukas Fittl.
[Understanding Postgres GIN Indexes: The Good and the
Bad](https://pganalyze.com/blog/gin-index). *pganalyze.com*, December 2021.
Archived at [perma.cc/V3MW-26H6](https://perma.cc/V3MW-26H6)

[[93](/en/ch4#Angelakos2020-marker)] Jimmy Angelakos.
[The State of (Full) Text Search in PostgreSQL
12](https://www.youtube.com/watch?v=c8IrUHV70KQ). At *FOSDEM*, February 2020.
Archived at [perma.cc/J6US-3WZS](https://perma.cc/J6US-3WZS)

[[94](/en/ch4#Korotkov2012-marker)] Alexander Korotkov.
[Index
support for regular expression search](https://wiki.postgresql.org/images/6/6c/Index_support_for_regular_expression_search.pdf). At *PGConf.EU Prague*, October 2012.
Archived at [perma.cc/5RFZ-ZKDQ](https://perma.cc/5RFZ-ZKDQ)

[[95](/en/ch4#McCandless2011fuzzy-marker)] Michael McCandless.
[Lucene’s
FuzzyQuery Is 100 Times Faster in 4.0](https://blog.mikemccandless.com/2011/03/lucenes-fuzzyquery-is-100-times-faster.html). *blog.mikemccandless.com*, March 2011.
Archived at [perma.cc/E2WC-GHTW](https://perma.cc/E2WC-GHTW)

[[96](/en/ch4#Heinz2002-marker)] Steffen Heinz, Justin Zobel, and Hugh E. Williams.
[Burst
Tries: A Fast, Efficient Data Structure for String Keys](https://web.archive.org/web/20130903070248id_/http%3A//ww2.cs.mu.oz.au%3A80/~jz/fulltext/acmtois02.pdf).
*ACM Transactions on Information Systems*, volume 20, issue 2, pages 192–223, April 2002.
[doi:10.1145/506309.506312](https://doi.org/10.1145/506309.506312)

[[97](/en/ch4#Schulz2002-marker)] Klaus U. Schulz and Stoyan Mihov.
[Fast String
Correction with Levenshtein Automata](https://dmice.ohsu.edu/bedricks/courses/cs655/pdf/readings/2002_Schulz.pdf). *International Journal on Document Analysis and
Recognition*, volume 5, issue 1, pages 67–85, November 2002.
[doi:10.1007/s10032-002-0082-8](https://doi.org/10.1007/s10032-002-0082-8)

[[98](/en/ch4#Mikolov2013-marker)] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
[Efficient Estimation of Word Representations in Vector Space](https://arxiv.org/pdf/1301.3781).
At *International Conference on Learning Representations* (ICLR), May 2013.
[doi:10.48550/arXiv.1301.3781](https://doi.org/10.48550/arXiv.1301.3781)

[[99](/en/ch4#Devlin2018-marker)] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova.
[BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding](https://arxiv.org/pdf/1810.04805).
At *Conference of the North American Chapter of the Association for Computational
Linguistics: Human Language Technologies*, volume 1, pages 4171–4186, June 2019.
[doi:10.18653/v1/N19-1423](https://doi.org/10.18653/v1/N19-1423)

[[100](/en/ch4#Radford2018-marker)] Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever.
[Improving
Language Understanding by Generative Pre-Training](https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf). *openai.com*, June 2018.
Archived at [perma.cc/5N3C-DJ4C](https://perma.cc/5N3C-DJ4C)

[[101](/en/ch4#Faiis2023-marker)] Matthijs Douze, Maria Lomeli, and Lucas Hosseini.
[Faiss indexes](https://github.com/facebookresearch/faiss/wiki/Faiss-indexes).
*github.com*, August 2024.
Archived at [perma.cc/2EWG-FPBS](https://perma.cc/2EWG-FPBS)

[[102](/en/ch4#Matevosyan2024-marker)] Varik Matevosyan.
[Understanding pgvector’s HNSW Index Storage in Postgres](https://lantern.dev/blog/pgvector-storage).
*lantern.dev*, August 2024.
Archived at [perma.cc/B2YB-JB59](https://perma.cc/B2YB-JB59)

[[103](/en/ch4#Baranchuk2018-marker)] Dmitry Baranchuk, Artem Babenko, and Yury Malkov.
[Revisiting the Inverted Indices for Billion-Scale Approximate Nearest Neighbors](https://arxiv.org/pdf/1802.02422).
At *European Conference on Computer Vision* (ECCV), pages 202–216, September 2018.
[doi:10.1007/978-3-030-01258-8\_13](https://doi.org/10.1007/978-3-030-01258-8_13)

[[104](/en/ch4#Malkov2020-marker)] Yury A. Malkov and Dmitry A. Yashunin.
[Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs](https://arxiv.org/pdf/1603.09320).
*IEEE Transactions on Pattern Analysis and Machine Intelligence*, volume 42, issue 4, pages 824–836, April 2020.
[doi:10.1109/TPAMI.2018.2889473](https://doi.org/10.1109/TPAMI.2018.2889473)