fix reference summary

2026-06-25 10:56:50 +08:00 · 2025-08-09 16:09:53 +08:00 · 2025-08-09 16:09:53 +08:00 · 4ec385f161
commit 4ec385f161
parent 752c2f58c7
14 changed files with 2811 additions and 3255 deletions
--- a/content/en/ch1.md
+++ b/content/en/ch1.md
@ -252,9 +252,7 @@ the data warehouse. This process of getting data into the data warehouse is know
 *transform* and *load* steps is swapped (i.e., the transformation is done in the data warehouse,
 after loading), resulting in *ELT*.
-![ddia 0101](/fig/ddia_0101.png)
+{{< figure src="/fig/ddia_0101.png" id="fig_dwh_etl" title="Figure 1-1. Simplified outline of ETL into a data warehouse." class="w-full my-4" >}}
 ###### Figure 1-1. Simplified outline of ETL into a data warehouse.
 In some cases the data sources of the ETL processes are external SaaS products such as customer
 relationship management (CRM), email marketing, or credit card processing systems. In those cases,
@ -428,9 +426,10 @@ the other extreme are widely-used cloud services or Software as a Service (SaaS)
 implemented and operated by an external vendor, and which you only access through a web interface or
 API.
 ![ddia 0102](/fig/ddia_0102.png)
-###### Figure 1-2. A spectrum of types of software and its operations.
+{{< figure src="/fig/ddia_0102.png" id="fig_cloud_spectrum" title="Figure 1-2. A spectrum of types of software and its operations." class="w-full my-4" >}}
 The middle ground is off-the-shelf software (open source or commercial) that you *self-host*, i.e.,
 deploy yourself—for example, if you download MySQL and install it on a server you control. This
@ -672,7 +671,7 @@ processes you can run concurrently), which you need to know about and plan for b
 Adopting a cloud service can be easier and quicker than running your own infrastructure, although
 even here there is a cost in learning how to use it, and perhaps working around its limitations.
 Integration between different services becomes a particular challenge as a growing number of vendors
-offers an ever broader range of cloud services targeting different use cases [^39][^40].
+offers an ever broader range of cloud services targeting different use cases [^39] [^40].
 ETL (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) is only part of the story; operational cloud services also need
 to be integrated with each other. At present, there is a lack of standards that would facilitate
@ -740,7 +739,7 @@ Sustainability
 :   If you have flexibility on where and when to run your jobs, you might be able to run them in a
    time and place where plenty of renewable electricity is available, and avoid running them when the
    power grid is under strain. This can reduce your carbon emissions and allow you to take advantage
-    of cheap power when it is available [^42][^43].
+    of cheap power when it is available [^42] [^43].
 These reasons apply both to services that you write yourself (application code) and services
 consisting of off-the-shelf software (such as databases).
@ -962,7 +961,7 @@ whose data you are collecting and processing. There is much more to this topic;
 will go deeper into the topics of ethics and legal compliance, including the problems of bias and
 discrimination.
-# Summary
+## Summary
 The theme of this chapter has been to understand trade-offs: that is, to recognize that for many
 questions there is not one right answer, but several different approaches that each have various
@ -994,9 +993,7 @@ data is being processed—an aspect that many engineers are prone to ignoring. H
 requirements into technical implementations is not yet well understood, but it’s important to keep
 this question in mind as we move through the rest of this book.
-## Footnotes
+### References
 ## References
 [^1]: Richard T. Kouzes, Gordon A. Anderson, Stephen T. Elbert, Ian Gorton, and Deborah K. Gracio. [The Changing Paradigm of Data-Intensive Computing](http://www2.ic.uff.br/~boeres/slides_AP/papers/TheChanginParadigmDataIntensiveComputing_2009.pdf). *IEEE Computer*, volume 42, issue 1, January 2009. [doi:10.1109/MC.2009.26](https://doi.org/10.1109/MC.2009.26)
 [^2]: Martin Kleppmann, Adam Wiggins, Peter van Hardenberg, and Mark McGranaghan. [Local-first software: you own your data, in spite of the cloud](https://www.inkandswitch.com/local-first/). At *2019 ACM SIGPLAN International Symposium on New Ideas, New Paradigms, and Reflections on Programming and Software* (Onward!), October 2019. [doi:10.1145/3359591.3359737](https://doi.org/10.1145/3359591.3359737)
--- a/content/en/ch10.md
+++ b/content/en/ch10.md
--- a/content/en/ch11.md
+++ b/content/en/ch11.md
@ -35,7 +35,7 @@ Stream processing is somewhere between online and offline/batch processing (so i
 As we shall see in this chapter, batch processing is an important building block in our quest to build reliable, scalable, and maintainable applications. For example, Map‐ Reduce, a batch processing algorithm published in 2004 [1], was (perhaps over- enthusiastically) called “the algorithm that makes Google so massively scalable” [2]. It was subsequently implemented in various open source data systems, including Hadoop, CouchDB, and MongoDB.
-MapReduce is a fairly low-level programming model compared to the parallel pro‐ cessing systems that were developed for data warehouses many years previously [3, 4], but it was a major step forward in terms of the scale of processing that could be achieved on commodity hardware. Although the importance of MapReduce is now declining [5], it is still worth understanding, because it provides a clear picture of why and how batch processing is useful.
+MapReduce is a fairly low-level programming model compared to the parallel pro‐ cessing systems that were developed for data warehouses many years previously [^3] [^4], but it was a major step forward in terms of the scale of processing that could be achieved on commodity hardware. Although the importance of MapReduce is now declining [5], it is still worth understanding, because it provides a clear picture of why and how batch processing is useful.
 In fact, batch processing is a very old form of computing. Long before programmable digital computers were invented, punch card tabulating machines—such as the Hol‐ lerith machines used in the 1890 US Census [6]—implemented a semi-mechanized form of batch processing to compute aggregate statistics from large inputs. And Map‐ Reduce bears an uncanny resemblance to the electromechanical IBM card-sorting machines that were widely used for business data processing in the 1940s and 1950s [7]. As usual, history has a tendency of repeating itself.
@ -94,7 +94,7 @@ In the next chapter, we will turn to stream processing, in which the input is *u
-## References
+### References
 1. Jeffrey Dean and Sanjay Ghemawat: “[MapReduce: Simplified Data Processing on Large Clusters](https://research.google/pubs/pub62/),” at *6th USENIX Symposium on Operating System Design and Implementation* (OSDI), December 2004.
 1. Joel Spolsky: “[The Perils of JavaSchools](https://www.joelonsoftware.com/2005/12/29/the-perils-of-javaschools-2/),” *joelonsoftware.com*, December 29, 2005.
--- a/content/en/ch12.md
+++ b/content/en/ch12.md
@ -75,7 +75,7 @@ Finally, we discussed techniques for achieving fault tolerance and exactly-once
-## References
+### References
 1. Tyler Akidau, Robert Bradshaw, Craig Chambers, et al.: “[The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing](http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf),” *Proceedings of the VLDB Endowment*, volume 8, number 12, pages 1792–1803, August 2015. [doi:10.14778/2824032.2824076](http://dx.doi.org/10.14778/2824032.2824076)
 1. Harold Abelson, Gerald Jay Sussman, and Julie Sussman: [*Structure and Interpretation of Computer Programs*](https://web.archive.org/web/20220807043536/https://mitpress.mit.edu/sites/default/files/sicp/index.html), 2nd edition. MIT Press, 1996. ISBN: 978-0-262-51087-5, available online at *mitpress.mit.edu*
--- a/content/en/ch13.md
+++ b/content/en/ch13.md
@ -48,7 +48,7 @@ Finally, we took a step back and examined some ethical aspects of building data-
 As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal.
-## References
+### References
 1. Rachid Belaid: “[Postgres Full-Text Search is Good Enough!](http://rachbelaid.com/postgres-full-text-search-is-good-enough/),” *rachbelaid.com*, July 13, 2015.
 1. Philippe Ajoux, Nathan Bronson, Sanjeev Kumar, et al.: “[Challenges to Adopting Stronger Consistency at Scale](https://www.usenix.org/system/files/conference/hotos15/hotos15-paper-ajoux.pdf),” at *15th USENIX Workshop on Hot Topics in Operating Systems* (HotOS), May 2015.
--- a/content/en/ch2.md
+++ b/content/en/ch2.md
@ -30,9 +30,9 @@ articulate them for your own systems:
 * How to define and measure the *performance* of a system (see [“Describing Performance”](/en/ch2#sec_introduction_percentiles));
 * What it means for a service to be *reliable*—namely, continuing to work correctly, even when
-  things go wrong (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability));
+ things go wrong (see [“Reliability and Fault Tolerance”](/en/ch2#sec_introduction_reliability));
 * Allowing a system to be *scalable* by having efficient ways of adding computing
-  capacity as the load on the system grows (see [“Scalability”](/en/ch2#sec_introduction_scalability)); and
+ capacity as the load on the system grows (see [“Scalability”](/en/ch2#sec_introduction_scalability)); and
 * Making it easier to maintain a system in the long term (see [“Maintainability”](/en/ch2#sec_introduction_maintainability)).
 The terminology introduced in this chapter will also be useful in the following chapters, when we go
@ -70,11 +70,11 @@ query to get the home timeline for a particular user:
 ```
 SELECT posts.*, users.* FROM posts
-  JOIN follows ON posts.sender_id = follows.followee_id
+ JOIN follows ON posts.sender_id = follows.followee_id
-  JOIN users   ON posts.sender_id = users.id
+ JOIN users ON posts.sender_id = users.id
-  WHERE follows.follower_id = current_user
+ WHERE follows.follower_id = current_user
-  ORDER BY posts.timestamp DESC
+ ORDER BY posts.timestamp DESC
-  LIMIT 1000
+ LIMIT 1000
 ```
 To execute this query, the database will use the `follows` table to find everybody who
@ -135,32 +135,32 @@ write. The cost of writes for most users is modest, but a social network also ha
 extreme cases:
 * If a user is following a very large number of accounts, and those accounts post a lot, that user
-  will have a high rate of writes to their materialized timeline. However, in this case it’s
+ will have a high rate of writes to their materialized timeline. However, in this case it’s
-  unlikely that the user is actually reading all of the posts in their timeline, and therefore it’s
+ unlikely that the user is actually reading all of the posts in their timeline, and therefore it’s
-  okay to simply drop some of their timeline writes and show the user only a sample of the posts
+ okay to simply drop some of their timeline writes and show the user only a sample of the posts
-  from the accounts they’re following
+ from the accounts they’re following
-  [^5].
+ [^5].
 * When a celebrity account with a very large number of followers makes a post, we have to do a large
-  amount of work to insert that post into the home timelines of each of their millions of followers.
+ amount of work to insert that post into the home timelines of each of their millions of followers.
-  In this case it’s not okay to drop some of those writes. One way of solving this problem is to
+ In this case it’s not okay to drop some of those writes. One way of solving this problem is to
-  handle celebrity posts separately from everyone else’s posts: we can save ourselves the effort of
+ handle celebrity posts separately from everyone else’s posts: we can save ourselves the effort of
-  adding them to millions of timelines by storing the celebrity posts separately and merging them
+ adding them to millions of timelines by storing the celebrity posts separately and merging them
-  with the materialized timeline when it is read. Despite such optimizations, handling celebrities
+ with the materialized timeline when it is read. Despite such optimizations, handling celebrities
-  on a social network can require a lot of infrastructure
+ on a social network can require a lot of infrastructure
-  [^6].
+ [^6].
 # Describing Performance
 Most discussions of software performance consider two main types of metric:
 Response time
-:   The elapsed time from the moment when a user makes a request until they receive the requested
+: The elapsed time from the moment when a user makes a request until they receive the requested
-    answer. The unit of measurement is seconds (or milliseconds, or microseconds).
+ answer. The unit of measurement is seconds (or milliseconds, or microseconds).
 Throughput
-:   The number of requests per second, or the data volume per second, that the system is processing.
+: The number of requests per second, or the data volume per second, that the system is processing.
-    For a given allocation of hardware resources, there is a *maximum throughput* that can be handled.
+ For a given allocation of hardware resources, there is a *maximum throughput* that can be handled.
-    The unit of measurement is “somethings per second”.
+ The unit of measurement is “somethings per second”.
 In the social network case study, “posts per second” and “timeline writes per second” are throughput
 metrics, whereas the “time it takes to load the home timeline” or the “time until a post is
@ -187,24 +187,19 @@ time out and resend their request. This causes the rate of requests to increase
 the problem worse—a *retry storm*. Even when the load is reduced again, such a system may remain in
 an overloaded state until it is rebooted or otherwise reset. This phenomenon is called a *metastable
 failure*, and it can cause serious outages in production systems
-[[7](/en/ch2#Bronson2021),
+[[^7], [^8]].
 [8](/en/ch2#Brooker2021)].
 To avoid retries overloading a service, you can increase and randomize the time between successive
 retries on the client side (*exponential backoff*
-[[9](/en/ch2#Brooker2015),
+[[^9], [^10]]),
 [10](/en/ch2#Brooker2022backoff)]),
 and temporarily stop sending requests to a service that has returned errors or timed out recently
-(using a *circuit breaker* [[11](/en/ch2#Nygard2018),
+(using a *circuit breaker* [[^11], [^12]]
 [12](/en/ch2#Chen2022)]
 or *token bucket* algorithm [^13]).
 The server can also detect when it is approaching overload and start proactively rejecting requests
 (*load shedding* [^14]), and send back
 responses asking clients to slow down (*backpressure*
-[[1](/en/ch2#Cvet2016),
+[[^1], [^15]]).
-[15](/en/ch2#Sackman2016_ch2)]).
+The choice of queueing and load-balancing algorithms can also make a difference [^16].
 The choice of queueing and load-balancing algorithms can also make a difference
 [^16].
 In terms of performance metrics, the response time is usually what users care about the most,
 whereas the throughput determines the required computing resources (e.g., how many servers you need),
@ -221,15 +216,15 @@ scalability in [“Scalability”](/en/ch2#sec_introduction_scalability).
 terms in a specific way (illustrated in [Figure 2-4](/en/ch2#fig_response_time)):
 * The *response time* is what the client sees; it includes all delays incurred anywhere in the
-  system.
+ system.
 * The *service time* is the duration for which the service is actively processing the user request.
 * *Queueing delays* can occur at several points in the flow: for example, after a request is
-  received, it might need to wait until a CPU is available before it can be processed; a response
+ received, it might need to wait until a CPU is available before it can be processed; a response
-  packet might need to be buffered before it is sent over the network if other tasks on the same
+ packet might need to be buffered before it is sent over the network if other tasks on the same
-  machine are sending a lot of data via the outbound network interface.
+ machine are sending a lot of data via the outbound network interface.
 * *Latency* is a catch-all term for time during which a request is not being actively processed,
-  i.e., during which it is *latent*. In particular, *network latency* or *network delay* refers to
+ i.e., during which it is *latent*. In particular, *network latency* or *network delay* refers to
-  the time that request and response spend traveling through the network.
+ the time that request and response spend traveling through the network.
 ![ddia 0204](/fig/ddia_0204.png)
@ -242,8 +237,7 @@ to another. You will encounter this style of diagram frequently over the course
 The response time can vary significantly from one request to the next, even if you keep making the
 same request over and over again. Many factors can add random delays: for example, a context switch
 to a background process, the loss of a network packet and TCP retransmission, a garbage collection
-pause, a page fault forcing a read from disk, mechanical vibrations in the server rack
+pause, a page fault forcing a read from disk, mechanical vibrations in the server rack [^17],
 [^17],
 or many other causes. We will discuss this topic in more detail in [“Timeouts and Unbounded Delays”](/en/ch9#sec_distributed_queueing).
 Queueing delays often account for a large part of the variability in response times. As a server
@ -291,8 +285,7 @@ directly affect users’ experience of the service. For example, Amazon describe
 requirements for internal services in terms of the 99.9th percentile, even though it only affects 1
 in 1,000 requests. This is because the customers with the slowest requests are often those who have
 the most data on their accounts because they have made many purchases—that is, they’re the most
-valuable customers
+valuable customers [^19].
 [^19].
 It’s important to keep those customers happy by ensuring the website is fast for them.
 On the other hand, optimizing the 99.99th percentile (the slowest 1 in 10,000 requests) was deemed
@ -302,23 +295,19 @@ control, and the benefits are diminishing.
 # The user impact of response times
-It seems intuitively obvious that a fast service is better for users than a slow service
+It seems intuitively obvious that a fast service is better for users than a slow service [^20].
 [^20].
 However, it is surprisingly difficult to get hold of reliable data to quantify the effect that
 latency has on user behavior.
 Some often-cited statistics are unreliable. In 2006 Google reported that a slowdown in search
-results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue
+results from 400 ms to 900 ms was associated with a 20% drop in traffic and revenue [^21].
 [^21].
 However, another Google study from 2009 reported that a 400 ms increase in latency resulted in
-only 0.6% fewer searches per day
+only 0.6% fewer searches per day [^22],
 [^22],
 and in the same year Bing found that a two-second increase in load time reduced ad revenue by 4.3%
 [^23].
 Newer data from these companies appears not to be publicly available.
-A more recent Akamai study
+A more recent Akamai study [^24]
 [^24]
 claims that a 100 ms increase in response time reduced the conversion rate of e-commerce sites
 by up to 7%; however, on closer inspection, the same study reveals that very *fast* page load times
 are also correlated with lower conversion rates! This seemingly paradoxical result is explained by
@ -326,8 +315,7 @@ the fact that the pages that load fastest are often those that have no useful co
 error pages). However, since the study makes no effort to separate the effects of page content from
 the effects of load time, its results are probably not meaningful.
-A study by Yahoo
+A study by Yahoo [^25]
 [^25]
 compares click-through rates on fast-loading versus slow-loading search results, controlling for
 quality of search results. It finds 20–30% more clicks on fast searches when the difference between
 fast and slow responses is 1.25 seconds or more.
@ -348,15 +336,13 @@ end-user requests end up being slow (an effect known as *tail latency amplificat
 ###### Figure 2-6. When several backend calls are needed to serve a request, it takes just a single slow backend request to slow down the entire end-user request.
 Percentiles are often used in *service level objectives* (SLOs) and *service level agreements*
-(SLAs) as ways of defining the expected performance and availability of a service
+(SLAs) as ways of defining the expected performance and availability of a service [^27].
 [^27].
 For example, an SLO may set a target for a service to have a median response time of less than
 200 ms and a 99th percentile under 1 s, and a target that at least 99.9% of valid requests
 result in non-error responses. An SLA is a contract that specifies what happens if the SLO is not
 met (for example, customers may be entitled to a refund). That is the basic idea, at least; in
 practice, defining good availability metrics for SLOs and SLAs is not straightforward
-[[28](/en/ch2#Mogul2019),
+[[^28], [^29]].
 [29](/en/ch2#Hauer2020)].
 # Computing percentiles
@ -369,10 +355,8 @@ The simplest implementation is to keep a list of response times for all requests
 window and to sort that list every minute. If that is too inefficient for you, there are algorithms
 that can calculate a good approximation of percentiles at minimal CPU and memory cost.
 Open source percentile estimation libraries include HdrHistogram,
-t-digest [[30](/en/ch2#Dunning2021),
+t-digest [[^30], [^31]],
-[31](/en/ch2#Kohn2021)],
+OpenHistogram [^32], and DDSketch [^33].
 OpenHistogram [^32], and DDSketch
 [^33].
 Beware that averaging percentiles, e.g., to reduce the time resolution or to combine data from
 several machines, is mathematically meaningless—the right way of aggregating response time data
@ -391,18 +375,16 @@ software, typical expectations include:
 If all those things together mean “working correctly,” then we can understand *reliability* as
 meaning, roughly, “continuing to work correctly, even when things go wrong.” To be more precise
 about things going wrong, we will distinguish between *faults* and *failures*
-[[35](/en/ch2#Heimerdinger1992),
+[[^35], [^36], [^37]]:
 [36](/en/ch2#Gaertner1999),
 [37](/en/ch2#Avizienis2004)]:
 Fault
-:   A fault is when a particular *part* of a system stops working correctly: for example, if a
+: A fault is when a particular *part* of a system stops working correctly: for example, if a
-    single hard drive malfunctions, or a single machine crashes, or an external service (that the
+ single hard drive malfunctions, or a single machine crashes, or an external service (that the
-    system depends on) has an outage.
+ system depends on) has an outage.
 Failure
-:   A failure is when the system *as a whole* stops providing the required service to the user; in
+: A failure is when the system *as a whole* stops providing the required service to the user; in
-    other words, when it does not meet the service level objective (SLO).
+ other words, when it does not meet the service level objective (SLO).
 The distinction between fault and failure can be confusing because they are the same thing, just at
 different levels. For example, if a hard drive stops working, we say that the hard drive has failed:
@ -438,8 +420,7 @@ handling [^38]; by deliberately inducing faults, you ensure
 that the fault-tolerance machinery is continually exercised and tested, which can increase your
 confidence that faults will be handled correctly when they occur naturally. *Chaos engineering* is
 a discipline that aims to improve confidence in fault-tolerance mechanisms through experiments such
-as deliberately injecting faults
+as deliberately injecting faults [^39].
 [^39].
 Although we generally prefer tolerating faults over preventing faults, there are cases where
 prevention is better than cure (e.g., because no cure exists). This is the case with security
@ -452,48 +433,34 @@ cured, as described in the following sections.
 When we think of causes of system failure, hardware faults quickly come to mind:
 * Approximately 2–5% of magnetic hard drives fail per year
-  [[40](/en/ch2#Pinheiro2007),
+ [[^40],
-  [41](/en/ch2#Schroeder2007)];
+ [^41]];
-  in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
+ in a storage cluster with 10,000 disks, we should therefore expect on average one disk failure per day.
-  Recent data suggests that disks are getting more reliable, but failure rates remain significant
+ Recent data suggests that disks are getting more reliable, but failure rates remain significant
-  [^42].
+ [^42].
 * Approximately 0.5–1% of solid state drives (SSDs) fail per year
-  [^43].
+ [^43].
-  Small numbers of bit errors are corrected automatically
+ Small numbers of bit errors are corrected automatically
-  [^44],
+ [^44],
-  but uncorrectable errors occur approximately once per year per drive, even in drives that are
+ but uncorrectable errors occur approximately once per year per drive, even in drives that are
-  fairly new (i.e., that have experienced little wear); this error rate is higher than that of
+ fairly new (i.e., that have experienced little wear); this error rate is higher than that of
-  magnetic hard drives
+ magnetic hard drives
-  [[45](/en/ch2#Schroeder2016_ch2),
+ [[^45],
-  [46](/en/ch2#Alter2019)].
+ [^46]].
 * Other hardware components such as power supplies, RAID controllers, and memory modules also fail,
-  although less frequently than hard drives
+ although less frequently than hard drives [^47] [^48].
  [[47](/en/ch2#Ford2010),
  [48](/en/ch2#Vishwanath2010)].
 * Approximately one in 1,000 machines has a CPU core that occasionally computes the wrong result,
-  likely due to manufacturing defects
+ likely due to manufacturing defects [^49] [^50] [^51]. In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program simply returning the wrong result.
  [[49](/en/ch2#Hochschild2021),
  [50](/en/ch2#Dixit2021),
  [51](/en/ch2#Behrens2015)].
  In some cases, an erroneous computation leads to a crash, but in other cases it leads to a program
  simply returning the wrong result.
 * Data in RAM can also be corrupted, either due to random events such as cosmic rays, or due to
-  permanent physical defects. Even when memory with error-correcting codes (ECC) is used, more than
+ permanent physical defects. Even when memory with error-correcting codes (ECC) is used, more than
-  1% of machines encounter an uncorrectable error in a given year, which typically leads to a crash
+ 1% of machines encounter an uncorrectable error in a given year, which typically leads to a crash
-  of the machine and the affected memory module needing to be replaced
+ of the machine and the affected memory module needing to be replaced [^52].
-  [^52].
+ Moreover, certain pathological memory access patterns can flip bits with high probability [^53].
  Moreover, certain pathological memory access patterns can flip bits with high probability
  [^53].
 * An entire datacenter might become unavailable (for example, due to power outage or network
-  misconfiguration) or even be permanently destroyed (for example by fire, flood, or earthquake
+ misconfiguration) or even be permanently destroyed (for example by fire, flood, or earthquake [^54]).
-  [^54]).
+ A solar storm, which induces large electrical currents in long-distance wires when the sun ejects
-  A solar storm, which induces large electrical currents in long-distance wires when the sun ejects
+ a large mass of charged particles, could damage power grids and undersea network cables [^55].
-  a large mass of charged particles, could damage power grids and undersea network cables
+ Although such large-scale failures are rare, their impact can be catastrophic if a service cannot tolerate the loss of a datacenter [^56].
  [^55].
  Although such large-scale failures are rare, their impact can be catastrophic if a service cannot
  tolerate the loss of a datacenter
  [^56].
 These events are rare enough that you often don’t need to worry about them when working on a small
 system, as long as you can easily replace hardware that becomes faulty. However, in a large-scale
@ -510,10 +477,7 @@ running uninterrupted for years.
 Redundancy is most effective when component faults are independent, that is, the occurrence of one
 fault does not change how likely it is that another fault will occur. However, experience has shown
-that there are often significant correlations between component failures
+that there are often significant correlations between component failures [^41] [^57] [^58];
 [[41](/en/ch2#Schroeder2007),
 [57](/en/ch2#Han2021),
 [58](/en/ch2#Nightingale2011)];
 unavailability of an entire server rack or an entire datacenter still happens more often than we
 would like.
@ -543,40 +507,30 @@ upgrade*, and we will discuss it further in [Chapter 5](/en/ch5#ch_encoding).
 Although hardware failures can be weakly correlated, they are still mostly independent: for
 example, if one disk fails, it’s likely that other disks in the same machine will be fine for
 another while. On the other hand, software faults are often very highly correlated, because it is
-common for many nodes to run the same software and thus have the same bugs
+common for many nodes to run the same software and thus have the same bugs [^59] [^60].
 [[59](/en/ch2#Gunawi2014),
 [60](/en/ch2#Kreps2012_ch1)].
 Such faults are harder to anticipate, and they tend to cause many more system failures than
 uncorrelated hardware faults [^47]. For example:
 * A software bug that causes every node to fail at the same time in particular circumstances. For
-  example, on June 30, 2012, a leap second caused many Java applications to hang simultaneously due
+ example, on June 30, 2012, a leap second caused many Java applications to hang simultaneously due
-  to a bug in the Linux kernel, bringing down many Internet services
+ to a bug in the Linux kernel, bringing down many Internet services [^61].
-  [^61].
+ Due to a firmware bug, all SSDs of certain models suddenly fail after precisely 32,768 hours of
-  Due to a firmware bug, all SSDs of certain models suddenly fail after precisely 32,768 hours of
+ operation (less than 4 years), rendering the data on them unrecoverable [^62].
  operation (less than 4 years), rendering the data on them unrecoverable
  [^62].
 * A runaway process that uses up some shared, limited resource, such as CPU time, memory, disk
-  space, network bandwidth, or threads
+ space, network bandwidth, or threads [^63]. For example, a process that consumes too much memory while processing a large request may be
-  [^63].
+ killed by the operating system. A bug in a client library could cause a much higher request
-  For example, a process that consumes too much memory while processing a large request may be
+ volume than anticipated [^64].
  killed by the operating system. A bug in a client library could cause a much higher request
  volume than anticipated [^64].
 * A service that the system depends on slows down, becomes unresponsive, or starts returning
-  corrupted responses.
+ corrupted responses.
 * An interaction between different systems results in emergent behavior that does not occur when
-  each system was tested in isolation [^65].
+ each system was tested in isolation [^65].
 * Cascading failures, where a problem in one component causes another component to become overloaded
-  and slow down, which in turn brings down another component
+ and slow down, which in turn brings down another component [^66] [^67]].
  [[66](/en/ch2#Ulrich2016),
  [67](/en/ch2#Fassbender2022)].
 The bugs that cause these kinds of software faults often lie dormant for a long time until they are
 triggered by an unusual set of circumstances. In those circumstances, it is revealed that the
 software is making some kind of assumption about its environment—and while that assumption is
-usually true, it eventually stops being true for some reason
+usually true, it eventually stops being true for some reason [^68] [^69].
 [[68](/en/ch2#Cook2000),
 [69](/en/ch2#Woods2017)].
 There is no quick solution to the problem of systematic faults in software. Lots of small things can
 help: carefully thinking about assumptions and interactions in the system; thorough testing; process
@ -590,8 +544,7 @@ human. Unlike machines, humans don’t just follow rules; their strength is bein
 adaptive in getting their job done. However, this characteristic also leads to unpredictability, and
 sometimes mistakes that can lead to failures, despite best intentions. For example, one study of
 large internet services found that configuration changes by operators were the leading cause of
-outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages
+outages, whereas hardware faults (servers or network) played a role in only 10–25% of outages [^70].
 [^70].
 It is tempting to label such problems as “human error” and to wish that they could be solved by
 better controlling human behavior through tighter procedures and compliance with rules. However,
@ -602,8 +555,7 @@ Often complex systems have emergent behavior, in which unexpected interactions b
 may also lead to failures [^72].
 Various technical measures can help minimize the impact of human mistakes, including thorough
-testing (both hand-written tests and *property testing* on lots of random inputs)
+testing (both hand-written tests and *property testing* on lots of random inputs) [^38], rollback mechanisms for quickly
 [^38], rollback mechanisms for quickly
 reverting configuration changes, gradual roll-outs of new code, detailed and clear monitoring,
 observability tools for diagnosing production issues (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)),
 and well-designed interfaces that encourage “the right thing” and discourage “the wrong thing”.
@ -627,8 +579,7 @@ As a general principle, when investigating an incident, you should be suspicious
 answers. “Bob should have been more careful when deploying that change” is not productive, but
 neither is “We must rewrite the backend in Haskell.” Instead, management should take the opportunity
 to learn the details of how the sociotechnical system works from the point of view of the people who
-work with it every day, and take steps to improve it based on this feedback
+work with it every day, and take steps to improve it based on this feedback [^71].
 [^71].
 # How Important Is Reliability?
@ -637,11 +588,9 @@ are also expected to work reliably. Bugs in business applications cause lost pro
 risks if figures are reported incorrectly), and outages of e-commerce sites can have huge costs in
 terms of lost revenue and damage to reputation.
-In many applications, a temporary outage of a few minutes or even a few hours is tolerable
+In many applications, a temporary outage of a few minutes or even a few hours is tolerable [^74],
 [^74],
 but permanent data loss or corruption would be catastrophic. Consider a parent who stores all their
-pictures and videos of their children in your photo application
+pictures and videos of their children in your photo application [^75]. How would they
 [^75]. How would they
 feel if that database was suddenly corrupted? Would they know how to restore it from a backup?
 As another example of how unreliable software can harm people, consider the Post Office Horizon
@ -651,8 +600,7 @@ Eventually it became clear that many of these shortfalls were due to bugs in the
 convictions have since been overturned [^76].
 What led to this, probably the largest miscarriage of justice in British history, is the fact that
 English law assumes that computers operate correctly (and hence, evidence produced by computers is
-reliable) unless there is evidence to the contrary
+reliable) unless there is evidence to the contrary [^77].
 [^77].
 Software engineers may laugh at the idea that software could ever be bug-free, but this is little
 solace to the people who were wrongfully imprisoned, declared bankrupt, or even committed suicide as
 a result of a wrongful conviction due to an unreliable computer system.
@ -714,9 +662,9 @@ Once you have described the load on your system, you can investigate what happen
 increases. You can look at it in two ways:
 * When you increase the load in a certain way and keep the system resources (CPUs, memory, network
-  bandwidth, etc.) unchanged, how is the performance of your system affected?
+ bandwidth, etc.) unchanged, how is the performance of your system affected?
 * When you increase the load in a certain way, how much do you need to increase the resources if you
-  want to keep performance unchanged?
+ want to keep performance unchanged?
 Usually our goal is to keep the performance of the system within the requirements of the SLA
 (see [“Use of Response Time Metrics”](/en/ch2#sec_introduction_slo_sla)) while also minimizing the cost of running the system. The greater
@ -728,8 +676,7 @@ If you can double the resources in order to handle twice the load, while keeping
 same, we say that you have *linear scalability*, and this is considered a good thing. Occasionally
 it is possible to handle twice the load with less than double the resources, due to economies of
 scale or a better distribution of peak load
-[[79](/en/ch2#Warfield2023_ch2),
+[[^79], [^80]].
 [80](/en/ch2#Brooker2023multitenancy)].
 Much more likely is that the cost grows faster than linearly, and there may be many reasons for the
 inefficiency. For example, if you have a lot of data, then processing a single write request may
 involve more work than if you have a small amount of data, even if the size of the request is the
@ -753,8 +700,7 @@ Another approach is the *shared-disk architecture*, which uses several machines
 CPUs and RAM, but which stores data on an array of disks that is shared between the machines, which
 are connected via a fast network: *Network-Attached Storage* (NAS) or *Storage Area Network* (SAN).
 This architecture has traditionally been used for on-premises data warehousing workloads, but
-contention and the overhead of locking limit the scalability of the shared-disk approach
+contention and the overhead of locking limit the scalability of the shared-disk approach [^81].
 [^81].
 By contrast, the *shared-nothing architecture*
 [^82]
@ -796,8 +742,7 @@ operate largely independently from each other. This is the underlying principle
 (see [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)), sharding ([Chapter 7](/en/ch7#ch_sharding)), stream processing
 ([Link to Come]), and shared-nothing architectures. However, the challenge is in knowing where to
 draw the line between things that should be together, and things that should be apart. Design
-guidelines for microservices can be found in other books
+guidelines for microservices can be found in other books [^84],
 [^84],
 and we discuss sharding of shared-nothing systems in [Chapter 7](/en/ch7#ch_sharding).
 Another good principle is not to make things more complicated than necessary. If a single-machine
@ -817,8 +762,7 @@ bugs that need fixing.
 It is widely recognized that the majority of the cost of software is not in its initial development,
 but in its ongoing maintenance—fixing bugs, keeping its systems operational, investigating failures,
 adapting it to new platforms, modifying it for new use cases, repaying technical debt, and adding
-new features [[85](/en/ch2#Ensmenger2016),
+new features [[^85], [^86]].
 [86](/en/ch2#Glass2002)].
 However, maintenance is also difficult. If a system has been successfully running for a long time,
 it may well use outdated technologies that not many engineers understand today (such as mainframes
@ -835,15 +779,15 @@ which decisions might create maintenance headaches in the future, in this book w
 to several principles that are widely applicable:
 Operability
-:   Make it easy for the organization to keep the system running smoothly.
+: Make it easy for the organization to keep the system running smoothly.
 Simplicity
-:   Make it easy for new engineers to understand the system, by implementing it using well-understood,
+: Make it easy for new engineers to understand the system, by implementing it using well-understood,
-    consistent patterns and structures, and avoiding unnecessary complexity.
+ consistent patterns and structures, and avoiding unnecessary complexity.
 Evolvability
-:   Make it easy for engineers to make changes to the system in the future, adapting it and extending
+: Make it easy for engineers to make changes to the system in the future, adapting it and extending
-    it for unanticipated use cases as requirements change.
+ it for unanticipated use cases as requirements change.
 ## Operability: Making Life Easy for Operations
@ -857,8 +801,7 @@ In large-scale systems consisting of many thousands of machines, manual maintena
 unreasonably expensive, and automation is essential. However, automation can be a two-edged sword:
 there will always be edge cases (such as rare failure scenarios) that require manual intervention
 from the operations team. Since the cases that cannot be handled automatically are the most complex
-issues, greater automation requires a *more* skilled operations team that can resolve those issues
+issues, greater automation requires a *more* skilled operations team that can resolve those issues [^88].
 [^88].
 Moreover, if an automated system goes wrong, it is often harder to troubleshoot than a system that
 relies on an operator to perform some actions manually. For that reason, it is not the case that
@ -866,15 +809,14 @@ more automation is always better for operability. However, some amount of automa
 and the sweet spot will depend on the specifics of your particular application and organization.
 Good operability means making routine tasks easy, allowing the operations team to focus their efforts
-on high-value activities. Data systems can do various things to make routine tasks easy, including
+on high-value activities. Data systems can do various things to make routine tasks easy, including [^89]:
 [^89]:
 * Allowing monitoring tools to check the system’s key metrics, and supporting observability tools
-  (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)) to give insights into the system’s runtime behavior.
+ (see [“Problems with Distributed Systems”](/en/ch1#sec_introduction_dist_sys_problems)) to give insights into the system’s runtime behavior.
-  A variety of commercial and open source tools can help here
+ A variety of commercial and open source tools can help here
-  [^90].
+ [^90].
 * Avoiding dependency on individual machines (allowing machines to be taken down for maintenance
-  while the system as a whole continues running uninterrupted)
+ while the system as a whole continues running uninterrupted)
 * Providing good documentation and an easy-to-understand operational model (“If I do X, Y will happen”)
 * Providing good default behavior, but also giving administrators the freedom to override defaults when needed
 * Self-healing where appropriate, but also giving administrators manual control over the system state when needed
@ -891,15 +833,13 @@ project mired in complexity is sometimes described as a *big ball of mud*
 When complexity makes maintenance hard, budgets and schedules are often overrun. In complex
 software, there is also a greater risk of introducing bugs when making a change: when the system is
 harder for developers to understand and reason about, hidden assumptions, unintended consequences,
-and unexpected interactions are more easily overlooked
+and unexpected interactions are more easily overlooked [^69].
 [^69].
 Conversely, reducing complexity greatly improves the maintainability of software, and thus
 simplicity should be a key goal for the systems we build.
 Simple systems are easier to understand, and therefore we should try to solve a given problem in the
 simplest way possible. Unfortunately, this is easier said than done. Whether something is simple or
-not is often a subjective matter of taste, as there is no objective standard of simplicity
+not is often a subjective matter of taste, as there is no objective standard of simplicity [^92].
 [^92].
 For example, one system may hide a complex implementation behind a simple interface, whereas another
 may have a simple implementation that exposes more internal detail to its users—which one is
 simpler?
@ -952,13 +892,12 @@ different word to refer to agility on a data system level: *evolvability*
 [^97].
 One major factor that makes change difficult in large systems is when some action is irreversible,
-and therefore that action needs to be taken very carefully
+and therefore that action needs to be taken very carefully [^98].
 [^98].
 For example, say you are migrating from one database to another: if you cannot switch back to the
 old system in case of problems with the new one, the stakes are much higher than if you can easily go
 back. Minimizing irreversibility improves flexibility.
-# Summary
+## Summary
 In this chapter we examined several examples of nonfunctional requirements: performance,
 reliability, scalability, and maintainability. Through these topics we have also encountered
@ -986,8 +925,7 @@ There are no easy answers on how to achieve these things, but one thing that can
 applications using well-understood building blocks that provide useful abstractions. The rest of
 this book will cover a selection of building blocks that have proved to be valuable in practice.
-##### References
+### Summary
 [^1]: Mike Cvet. [How We Learned to Stop Worrying and Love Fan-In at Twitter](https://www.youtube.com/watch?v=WEgCjwyXvwc). At *QCon San Francisco*, December 2016. 
 [^2]: Raffi Krikorian. [Timelines at Scale](https://www.infoq.com/presentations/Twitter-Timeline-Scalability/). At *QCon San Francisco*, November 2012. Archived at [perma.cc/V9G5-KLYK](https://perma.cc/V9G5-KLYK) 
--- a/content/en/ch3.md
+++ b/content/en/ch3.md
--- a/content/en/ch4.md
+++ b/content/en/ch4.md
@ -45,11 +45,11 @@ Consider the world’s simplest database, implemented as two Bash functions:
 #!/bin/bash
 db_set () {
-    echo "$1,$2" >> database
+ echo "$1,$2" >> database
 }
 db_get () {
-    grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
+ grep "^$1," database | sed -e "s/^$1,//" | tail -n 1
 }
 ```
@ -123,8 +123,7 @@ possible write operation. Any kind of index usually slows down writes, because t
 to be updated every time data is written.
 This is an important trade-off in storage systems: well-chosen indexes speed up read queries, but
-every index consumes additional disk space and slows down writes, sometimes substantially
+every index consumes additional disk space and slows down writes, sometimes substantially [^1].
 [^1].
 For this reason, databases don’t usually index everything by default, but require you—the person
 writing the application or administering the database—to choose indexes manually, using your
 knowledge of the application’s typical query patterns. You can then choose the indexes that give
@ -149,16 +148,16 @@ is already in the filesystem cache, a read doesn’t require any disk I/O at all
 This approach is much faster, but it still suffers from several problems:
 * You never free up disk space occupied by old log entries that have been overwritten; if you keep
-  writing to the database you might run out of disk space.
+ writing to the database you might run out of disk space.
 * The hash map is not persisted, so you have to rebuild it when you restart the database—for
-  example, by scanning the whole log file to find the latest byte offset for each key. This makes
+ example, by scanning the whole log file to find the latest byte offset for each key. This makes
-  restarts slow if you have a lot of data.
+ restarts slow if you have a lot of data.
 * The hash table must fit in memory. In principle, you could maintain a hash table on disk, but
-  unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of
+ unfortunately it is difficult to make an on-disk hash map perform well. It requires a lot of
-  random access I/O, it is expensive to grow when it becomes full, and hash collisions require
+ random access I/O, it is expensive to grow when it becomes full, and hash collisions require
-  fiddly logic [^2].
+ fiddly logic [^2].
 * Range queries are not efficient. For example, you cannot easily scan over all keys between `10000`
-  and `19999`—you’d have to look up each key individually in the hash map.
+ and `19999`—you’d have to look up each key individually in the hash map.
 ### The SSTable file format
@ -177,8 +176,7 @@ Now you do not need to keep all the keys in memory: you can group the key-value
 SSTable into *blocks* of a few kilobytes, and then store the first key of each block in the index.
 This kind of index, which stores only some of the keys, is called *sparse*. This index is stored in
 a separate part of the SSTable, for example using an immutable B-tree, a trie, or another data
-structure that allows queries to quickly look up a particular key
+structure that allows queries to quickly look up a particular key [^4].
 [^4].
 For example, in [Figure 4-2](/en/ch4#fig_storage_sstable_index), the first key of one block is `handbag`, and the
 first key of the next block is `handsome`. Now say you’re looking for the key `handiwork`, which
@ -202,25 +200,24 @@ We can solve this problem with a *log-structured* approach, which is a hybrid be
 log and a sorted file:
 1. When a write comes in, add it to an in-memory ordered map data structure, such as a red-black
-   tree, skip list [^5], or trie
+ tree, skip list [^5], or trie
-   [^6].
+ [^6].
-   With these data structures, you can insert keys in any order, look them up efficiently, and read
+ With these data structures, you can insert keys in any order, look them up efficiently, and read
-   them back in sorted order. This in-memory data structure is called the *memtable*.
+ them back in sorted order. This in-memory data structure is called the *memtable*.
 2. When the memtable gets bigger than some threshold—typically a few megabytes—write it out to
-   disk in sorted order as an SSTable file. We call this new SSTable file the most recent *segment*
+ disk in sorted order as an SSTable file. We call this new SSTable file the most recent *segment*
-   of the database, and it is stored as a separate file alongside the older segments. Each segment
+ of the database, and it is stored as a separate file alongside the older segments. Each segment
-   has a separate index of its contents. While the new segment is being written out to disk, the
+ has a separate index of its contents. While the new segment is being written out to disk, the
-   database can continue writing to a new memtable instance, and the old memtable’s memory is freed
+ database can continue writing to a new memtable instance, and the old memtable’s memory is freed
-   when the writing of the SSTable is complete.
+ when the writing of the SSTable is complete.
 3. In order to read the value for some key, first try to find the key in the memtable and the most
-   recent on-disk segment. If it’s not there, look in the next-older segment, etc. until you either
+ recent on-disk segment. If it’s not there, look in the next-older segment, etc. until you either
-   find the key or reach the oldest segment. If the key does not appear in any of the segments, it
+ find the key or reach the oldest segment. If the key does not appear in any of the segments, it
-   does not exist in the database.
+ does not exist in the database.
 4. From time to time, run a merging and compaction process in the background to combine segment files
-   and to discard overwritten or deleted values.
+ and to discard overwritten or deleted values.
-Merging segments works similarly to the *mergesort* algorithm
+Merging segments works similarly to the *mergesort* algorithm [^5]. The process is illustrated in
 [^5]. The process is illustrated in
 [Figure 4-3](/en/ch4#fig_storage_sstable_merging): start reading the input files side by side, look at the first key
 in each file, copy the lowest key (according to the sort order) to the output file, and repeat. If
 the same key appears in more than one input file, keep only the more recent value. This produces a
@ -242,18 +239,14 @@ called a *tombstone* to the data file. When log segments are merged, the tombsto
 process to discard any previous values for the deleted key. Once the tombstone is merged into the
 oldest segment, it can be dropped.
-The algorithm described here is essentially what is used in RocksDB
+The algorithm described here is essentially what is used in RocksDB [^7],
-[^7],
+Cassandra, Scylla, and HBase [^8],
-Cassandra, Scylla, and HBase
+all of which were inspired by Google’s Bigtable paper [^9]
 [^8],
 all of which were inspired by Google’s Bigtable paper
 [^9]
 (which introduced the terms *SSTable* and *memtable*).
 The algorithm was originally published in 1996 under the name *Log-Structured Merge-Tree* or *LSM-Tree*
 [^10],
-building on earlier work on log-structured filesystems
+building on earlier work on log-structured filesystems [^11].
 [^11].
 For this reason, storage engines that are based on the principle of merging and compacting sorted
 files are often called *LSM storage engines*.
@ -265,8 +258,7 @@ requests to using the new merged segment instead of the old segments, and then t
 can be deleted.
 The segment files don’t necessarily have to be stored on local disk: they are also well suited for
-writing to object storage. SlateDB and Delta Lake
+writing to object storage. SlateDB and Delta Lake [^12].
 [^12].
 take this approach, for example.
 Having immutable segment files also simplifies crash recovery: if a crash happens while writing out
@ -287,8 +279,7 @@ appears in a particular SSTable.
 [Figure 4-4](/en/ch4#fig_storage_bloom) shows an example of a Bloom filter containing two keys and 16 bits (in
 reality, it would contain more keys and more bits). For every key in the SSTable we compute a hash
-function, producing a set of numbers that are then interpreted as indexes into the array of bits
+function, producing a set of numbers that are then interpreted as indexes into the array of bits [^14].
 [^14].
 We set the bits corresponding to those indexes to 1, and leave the rest as 0. For example, the key
 `handbag` hashes to the numbers (2, 9, 4), so we set the 2nd, 9th, and 4th bits to 1. The bitmap
 is then stored as part of the SSTable, along with the sparse index of keys. This takes a bit of
@ -311,8 +302,7 @@ as if a key is present, even though it isn’t, is called a *false positive*.
 The probability of false positives depends on the number of keys, the number of bits set per key,
 and the total number of bits in the Bloom filter. You can use an online calculator tool to work out
-the right parameters for your application
+the right parameters for your application [^15].
 [^15].
 As a rule of thumb, you need to allocate 10 bits of Bloom filter space for every key in the SSTable
 to get a false positive probability of 1%, and the probability is reduced tenfold for every 5
 additional bits you allocate per key.
@ -320,30 +310,29 @@ additional bits you allocate per key.
 In the context of an LSM storage engines, false positives are no problem:
 * If the Bloom filter says that a key *is not* present, we can safely skip that SSTable, since we
-  can be sure that it doesn’t contain the key.
+ can be sure that it doesn’t contain the key.
 * If the Bloom filter says the key *is* present, we have to consult the sparse index and decode the
-  block of key-value pairs to check whether the key really is there. If it was a false positive, we
+ block of key-value pairs to check whether the key really is there. If it was a false positive, we
-  have done a bit of unnecessary work, but otherwise no harm is done—we just continue the search
+ have done a bit of unnecessary work, but otherwise no harm is done—we just continue the search
-  with the next-oldest segment.
+ with the next-oldest segment.
 ### Compaction strategies
 An important detail is how the LSM storage chooses when to perform compaction, and which SSTables to
 include in a compaction. Many LSM-based storage systems allow you to configure which compaction
 strategy to use, and some of the common choices are
-[[16](/en/ch4#Luo2019),
+[[^16], [^17]]:
 [17](/en/ch4#Sarkar2022)]:
 Size-tiered compaction
-:   Newer and smaller SSTables are successively merged into older and larger SSTables. The SSTables
+: Newer and smaller SSTables are successively merged into older and larger SSTables. The SSTables
-    containing older data can get very large, and merging them requires a lot of temporary disk space.
+ containing older data can get very large, and merging them requires a lot of temporary disk space.
-    The advantage of this strategy is that it can handle very high write throughput.
+ The advantage of this strategy is that it can handle very high write throughput.
 Leveled compaction
-:   The key range is split up into smaller SSTables and older data is moved into separate “levels,”
+: The key range is split up into smaller SSTables and older data is moved into separate “levels,”
-    which allows the compaction to proceed more incrementally and use less disk space than the
+ which allows the compaction to proceed more incrementally and use less disk space than the
-    size-tiered strategy. This strategy is more efficient for reads than size-tiered compaction
+ size-tiered strategy. This strategy is more efficient for reads than size-tiered compaction
-    because the storage engine needs to read fewer SSTables to check whether they contain the key.
+ because the storage engine needs to read fewer SSTables to check whether they contain the key.
 As a rule of thumb, size-tiered compaction performs better if you have mostly writes and few reads,
 whereas leveled compaction performs better if your workload is dominated by reads. If you write a
@ -360,16 +349,14 @@ Many databases run as a service that accepts queries over a network, but there a
 databases that don’t expose a network API. Instead, they are libraries that run in the same process
 as your application code, typically reading and writing files on the local disk, and you interact
 with them through normal function calls. Examples of embedded storage engines include RocksDB,
-SQLite, LMDB, DuckDB, and KùzuDB
+SQLite, LMDB, DuckDB, and KùzuDB [^19].
 [^19].
 Embedded databases are very commonly used in mobile apps to store the local user’s data. On the
 backend, they can be an appropriate choice if the data is small enough to fit on a single machine,
 and if there are not many concurrent transactions. For example, in a multitenant system in which
 each tenant is small enough and completely separate from others (i.e., you do not need to run
 queries that combine data from multiple tenants), you can potentially use a separate embedded
-database instance per tenant
+database instance per tenant [^20].
 [^20].
 The storage and retrieval methods we discuss in this chapter are used in both embedded and in
 client-server databases. In [Chapter 6](/en/ch6#ch_replication) and [Chapter 7](/en/ch7#ch_sharding) we will discuss techniques
@ -381,8 +368,7 @@ The log-structured approach is popular, but it is not the only form of key-value
 widely used structure for reading and writing database records by key is the *B-tree*.
 Introduced in 1970 [^21]
-and called “ubiquitous” less than 10 years later
+and called “ubiquitous” less than 10 years later [^22],
 [^22],
 B-trees have stood the test of time very well. They remain the standard index implementation in
 almost all relational databases, and many nonrelational databases use them too.
@ -441,8 +427,7 @@ the new key), and a page for 337–344. We also have to update the parent page t
 both children, with a boundary value of 337 between them. If the parent page doesn’t have enough
 space for the new reference, it may also need to be split, and the splits can continue all the way
 to the root of the tree. When the root is split, we make a new root above it. Deleting keys (which
-may require nodes to be merged) is more complex
+may require nodes to be merged) is more complex [^5].
 [^5].
 This algorithm ensures that the tree remains *balanced*: a B-tree with *n* keys always has a depth
 of *O*(log *n*). Most databases can fit into a B-tree that is three or four levels deep, so
@ -467,8 +452,7 @@ In order to make the database resilient to crashes, it is common for B-tree impl
 include an additional data structure on disk: a *write-ahead log* (WAL). This is an append-only file
 to which every B-tree modification must be written before it can be applied to the pages of the tree
 itself. When the database comes back up after a crash, this log is used to restore the B-tree back
-to a consistent state [[2](/en/ch4#Graefe2011),
+to a consistent state [[^2], [^24]].
 [24](/en/ch4#Mohan1992)].
 In filesystems, the equivalent mechanism is known as *journaling*.
 To improve performance, B-tree implementations typically don’t immediately write every modified page
@ -483,26 +467,25 @@ As B-trees have been around for so long, many variants have been developed over
 mention just a few:
 * Instead of overwriting pages and maintaining a WAL for crash recovery, some databases (like LMDB)
-  use a copy-on-write scheme [^26].
+ use a copy-on-write scheme [^26].
-  A modified page is written to a different location, and a new version of the parent pages in the tree
+ A modified page is written to a different location, and a new version of the parent pages in the tree
-  is created, pointing at the new location. This approach is also useful for concurrency control, as we shall
+ is created, pointing at the new location. This approach is also useful for concurrency control, as we shall
-  see in [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation).
+ see in [“Snapshot Isolation and Repeatable Read”](/en/ch8#sec_transactions_snapshot_isolation).
 * We can save space in pages by not storing the entire key, but abbreviating it. Especially in pages
-  on the interior of the tree, keys only need to provide enough information to act as boundaries
+ on the interior of the tree, keys only need to provide enough information to act as boundaries
-  between key ranges. Packing more keys into a page allows the tree to have a higher branching
+ between key ranges. Packing more keys into a page allows the tree to have a higher branching
-  factor, and thus fewer levels.
+ factor, and thus fewer levels.
 * To speed up scans over the key range in sorted order, some B-tree implementations try to lay out
-  the tree so that leaf pages appear in sequential order on disk, reducing the number of disk seeks.
+ the tree so that leaf pages appear in sequential order on disk, reducing the number of disk seeks.
-  However, it’s difficult to maintain that order as the tree grows.
+ However, it’s difficult to maintain that order as the tree grows.
 * Additional pointers have been added to the tree. For example, each leaf page may have references to
-  its sibling pages to the left and right, which allows scanning keys in order without jumping back
+ its sibling pages to the left and right, which allows scanning keys in order without jumping back
-  to parent pages.
+ to parent pages.
 ## Comparing B-Trees and LSM-Trees
 As a rule of thumb, LSM-trees are better suited for write-heavy applications, whereas B-trees are faster for reads
-[[27](/en/ch4#Athanassoulis2016),
+[[^27], [^28]].
 [28](/en/ch4#Stopford2015)].
 However, benchmarks are often sensitive to details of the workload. You need to test systems with
 your particular workload in order to make a valid comparison. Moreover, it’s not a strict either/or
 choice between LSM and B-trees: storage engines sometimes blend characteristics of both approaches,
@ -522,21 +505,18 @@ Range queries are simple and fast on B-trees, as they can use the sorted structu
 LSM storage, range queries can also take advantage of the SSTable sorting, but they need to scan all
 the segments in parallel and combine the results. Bloom filters don’t help for range queries (since
 you would need to compute the hash of every possible key within the range, which is impractical),
-making range queries more expensive than point queries in the LSM approach
+making range queries more expensive than point queries in the LSM approach [^29].
 [^29].
 High write throughput can cause latency spikes in a log-structured storage engine if the
 memtable fills up. This happens if data can’t be written out to disk fast enough, perhaps because
 the compaction process cannot keep up with incoming writes. Many storage engines, including RocksDB,
 perform *backpressure* in this situation: they suspend all reads and writes until the memtable has
 been written out to disk
-[[30](/en/ch4#Balmau2019),
+[[^30], [^31]].
 [31](/en/ch4#RocksDBTuning)].
 Regarding read throughput, modern SSDs (and especially NVMe) can perform many independent read
 requests in parallel. Both LSM-trees and B-trees are able to provide high read throughput, but
-storage engines need to be carefully designed to take advantage of this parallelism
+storage engines need to be carefully designed to take advantage of this parallelism [^32].
 [^32].
 ### Sequential vs. random writes
@ -568,17 +548,14 @@ The reason is that flash memory can be read or written one page (typically 4 Ki
 but it can only be erased one block (typically 512 KiB) at a time. Some of the pages in a block
 may contain valid data, whereas others may contain data that is no longer needed. Before erasing a
 block, the controller must first move pages containing valid data into other blocks; this process is
-called *garbage collection* (GC)
+called *garbage collection* (GC) [^33].
 [^33].
 A sequential write workload writes larger chunks of data at a time, so it is likely that a whole
 512 KiB block belongs to a single file; when that file is later deleted again, the whole block
 can be erased without having to perform any GC. On the other hand, with a random write workload, it
 is more likely that a block contains a mixture of pages with valid and invalid data, so the GC has
 to perform more work before a block can be erased
-[[34](/en/ch4#Vanlightly2023nvme),
+[[^34], [^35], [^36]].
 [35](/en/ch4#Alibaba2019_ch4),
 [36](/en/ch4#Hu2010)].
 The write bandwidth consumed by GC is then not available for the application. Moreover, the
 additional writes performed by GC contribute to wear on the flash memory; therefore, random writes
@ -591,14 +568,12 @@ operations on the underlying disk. With LSM-trees, a value is first written to t
 durability, then again when the memtable is written to disk, and again every time the key-value pair
 is part of a compaction. (If the values are significantly larger than the keys, this overhead can be
 reduced by storing values separately from keys, and performing compaction only on SSTables
-containing keys and references to values
+containing keys and references to values [^37].)
 [^37].)
 A B-tree index must write every piece of data at least twice: once to the write-ahead log, and once
 to the tree page itself. In addition, they sometimes need to write out an entire page, even if only
 a few bytes in that page changed, to ensure the B-tree can be correctly recovered after a crash or
-power failure [[38](/en/ch4#Zaitsev2006),
+power failure [[^38], [^39]].
 [39](/en/ch4#Vondra2016)].
 If you take the total number of bytes written to disk in some workload, and divide by the number of
 bytes you would have to write if you simply wrote an append-only log with no index, you get the
@ -610,8 +585,7 @@ handle within the available disk bandwidth.
 Write amplification is a problem in both LSM-trees and B-trees. Which one is better depends on
 various factors, such as the length of your keys and values, and how often you overwrite existing
 keys versus insert new ones. For typical workloads, LSM-trees tend to have lower write amplification
-because they don’t have to write entire pages and they can compress chunks of the SSTable
+because they don’t have to write entire pages and they can compress chunks of the SSTable [^40].
 [^40].
 This is another factor that makes LSM storage engines well suited for write-heavy workloads.
 Besides affecting throughput, write amplification is also relevant for the wear on SSDs: a storage
@ -636,8 +610,7 @@ the data files anyway, and SSTables don’t have pages with unused space. Moreov
 key-value pairs can better be compressed in SSTables, and thus often produce smaller files on disk
 than B-trees. Keys and values that have been overwritten continue to consume space until they are
 removed by a compaction, but this overhead is quite low when using leveled compaction
-[[40](/en/ch4#Callaghan2015),
+[[^40], [^41]].
 [41](/en/ch4#Callaghan2016rocksdb)].
 Size-tiered compaction (see [“Compaction strategies”](/en/ch4#sec_storage_lsm_compaction)) uses more disk space, especially
 temporarily during compaction.
@ -682,22 +655,22 @@ to implement an index.
 The key in an index is the thing that queries search by, but the value can be one of several things:
 * If the actual data (row, document, vertex) is stored directly within the index structure, it is
-  called a *clustered index*. For example, in MySQL’s InnoDB storage engine, the primary key of a
+ called a *clustered index*. For example, in MySQL’s InnoDB storage engine, the primary key of a
-  table is always a clustered index, and in SQL Server, you can specify one clustered index per
+ table is always a clustered index, and in SQL Server, you can specify one clustered index per
-  table [^43].
+ table [^43].
 * Alternatively, the value can be a reference to the actual data: either the primary key of the row
-  in question (InnoDB does this for secondary indexes), or a direct reference to a location on disk.
+ in question (InnoDB does this for secondary indexes), or a direct reference to a location on disk.
-  In the latter case, the place where rows are stored is known as a *heap file*, and it stores data
+ In the latter case, the place where rows are stored is known as a *heap file*, and it stores data
-  in no particular order (it may be append-only, or it may keep track of deleted rows in order to
+ in no particular order (it may be append-only, or it may keep track of deleted rows in order to
-  overwrite them with new data later). For example, Postgres uses the heap file approach
+ overwrite them with new data later). For example, Postgres uses the heap file approach
-  [^44].
+ [^44].
 * A middle ground between the two is a *covering index* or *index with included columns*, which
-  stores *some* of a table’s columns within the index, in addition to storing the full row on the
+ stores *some* of a table’s columns within the index, in addition to storing the full row on the
-  heap or in the primary key clustered index [^45].
+ heap or in the primary key clustered index [^45].
-  This allows some queries to be answered by using the index alone, without having to resolve the
+ This allows some queries to be answered by using the index alone, without having to resolve the
-  primary key or look in the heap file (in which case, the index is said to *cover* the query).
+ primary key or look in the heap file (in which case, the index is said to *cover* the query).
-  This can make some queries faster, but the duplication of data means the index uses more disk space and slows down
+ This can make some queries faster, but the duplication of data means the index uses more disk space and slows down
-  writes.
+ writes.
 The indexes discussed so far only map a single key to a value. If you need to query multiple columns
 of a table (or multiple fields in a document) simultaneously, see [“Multidimensional and Full-Text Indexes”](/en/ch4#sec_storage_multidimensional).
@ -737,11 +710,9 @@ easily be backed up, inspected, and analyzed by external utilities.
 Products such as VoltDB, SingleStore, and Oracle TimesTen are in-memory databases with a relational model,
 and the vendors claim that they can offer big performance improvements by removing all the overheads
 associated with managing on-disk data structures
-[[46](/en/ch4#Stonebraker2007),
+[[^46], [^47]].
 [47](/en/ch4#VoltDB2014uj)].
 RAMCloud is an open source, in-memory key-value store with durability (using a log-structured
-approach for the data in memory as well as the data on disk)
+approach for the data in memory as well as the data on disk) [^48].
 [^48].
 Redis and Couchbase provide weak durability by writing to disk asynchronously.
@ -749,8 +720,7 @@ Counterintuitively, the performance advantage of in-memory databases is not due
 they don’t need to read from disk. Even a disk-based storage engine may never need to read from disk
 if you have enough memory, because the operating system caches recently used disk blocks in memory
 anyway. Rather, they can be faster because they can avoid the overheads of encoding in-memory data
-structures in a form that can be written to disk
+structures in a form that can be written to disk [^49].
 [^49].
 Besides performance, another interesting area for in-memory databases is providing data models that
 are difficult to implement with disk-based indexes. For example, Redis offers a database-like
@ -774,10 +744,7 @@ transaction processing and data warehousing in the same product. However, these
 and analytical processing (HTAP) databases (introduced in [“Data Warehousing”](/en/ch1#sec_introduction_dwh)) are increasingly
 becoming two separate storage and query engines, which happen to be accessible through a common SQL
 interface
-[[50](/en/ch4#Larson2013),
+[[^50], [^51], [^52], [^53]].
 [51](/en/ch4#Farber2012),
 [52](/en/ch4#Stonebraker2013),
 [53](/en/ch4#Prout2022_ch4)].
 ## Cloud Data Warehouses
@ -790,50 +757,48 @@ of scalable cloud infrastructure like object storage and serverless computation
 Cloud data warehouses tend to integrate better with other cloud services and to be more elastic.
 For example, many cloud warehouses support automatic log ingestion, and offer easy integration with
 data processing frameworks such as Google Cloud’s Dataflow or Amazon Web Services’ Kinesis. These
-warehouses are also more elastic because they decouple query computation from the storage layer
+warehouses are also more elastic because they decouple query computation from the storage layer [^54].
 [^54].
 Data is persisted on object storage rather than local disks, which makes it easy to adjust storage
 capacity and compute resources for queries independently, as we previously saw in
 [“Cloud-Native System Architecture”](/en/ch1#sec_introduction_cloud_native).
 Open source data warehouses such as Apache Hive, Trino, and Apache Spark have also evolved with the
 cloud. As data storage for analytics has moved to data lakes on object storage, open source warehouses
-have begun to break apart
+have begun to break apart [^55]. The following
 [^55]. The following
 components, which were previously integrated in a single system such as Apache Hive, are now often
 implemented as separate components:
 Query engine
-:   Query engines such as Trino, Apache DataFusion, and Presto parse SQL queries, optimize them into
+: Query engines such as Trino, Apache DataFusion, and Presto parse SQL queries, optimize them into
-    execution plans, and execute them against the data. Execution usually requires parallel,
+ execution plans, and execute them against the data. Execution usually requires parallel,
-    distributed data processing tasks. Some query engines provide built-in task execution, while
+ distributed data processing tasks. Some query engines provide built-in task execution, while
-    others choose to use third party execution frameworks such as Apache Spark or Apache Flink.
+ others choose to use third party execution frameworks such as Apache Spark or Apache Flink.
 Storage format
-:   The storage format determines how the rows of a table are encoded as bytes in a file, which is
+: The storage format determines how the rows of a table are encoded as bytes in a file, which is
-    then typically stored in object storage or a distributed filesystem
+ then typically stored in object storage or a distributed filesystem
-    [^12].
+ [^12].
-    This data can then be accessed by the query engine, but also by other applications using the data
+ This data can then be accessed by the query engine, but also by other applications using the data
-    lake. Examples of such storage formats are Parquet, ORC, Lance, or Nimble, and we will see more
+ lake. Examples of such storage formats are Parquet, ORC, Lance, or Nimble, and we will see more
-    about them in the next section.
+ about them in the next section.
 Table format
-:   Files written in Apache Parquet and similar storage formats are typically immutable once written.
+: Files written in Apache Parquet and similar storage formats are typically immutable once written.
-    To support row inserts and deletions, a table format such as Apache Iceberg or Databricks’s Delta
+ To support row inserts and deletions, a table format such as Apache Iceberg or Databricks’s Delta
-    format are used. Table formats specify a file format that defines which files constitute a table
+ format are used. Table formats specify a file format that defines which files constitute a table
-    along with the table’s schema. Such formats also offer advanced features such as time travel (the
+ along with the table’s schema. Such formats also offer advanced features such as time travel (the
-    ability to query a table as it was at a previous point in time), garbage collection, and even
+ ability to query a table as it was at a previous point in time), garbage collection, and even
-    transactions.
+ transactions.
 Data catalog
-:   Much like a table format defines which files make up a table, a data catalog defines which tables
+: Much like a table format defines which files make up a table, a data catalog defines which tables
-    comprise a database. Catalogs are used to create, rename, and drop tables. Unlike storage and table
+ comprise a database. Catalogs are used to create, rename, and drop tables. Unlike storage and table
-    formats, data catalogs such as Snowflake’s Polaris and Databricks’s Unity Catalog usually run as a
+ formats, data catalogs such as Snowflake’s Polaris and Databricks’s Unity Catalog usually run as a
-    standalone service that can be queried using a REST interface. Apache Iceberg also offers a
+ standalone service that can be queried using a REST interface. Apache Iceberg also offers a
-    catalog, which can be run inside a client or as a separate process. Query engines use catalog
+ catalog, which can be run inside a client or as a separate process. Query engines use catalog
-    information when reading and writing tables. Traditionally, catalogs and query engines have been
+ information when reading and writing tables. Traditionally, catalogs and query engines have been
-    integrated, but decoupling them has enabled data discovery and data governance systems
+ integrated, but decoupling them has enabled data discovery and data governance systems
-    (discussed in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance)) to access a catalog’s metadata as well.
+ (discussed in [“Data Systems, Law, and Society”](/en/ch1#sec_introduction_compliance)) to access a catalog’s metadata as well.
 ## Column-Oriented Storage
@ -844,8 +809,7 @@ efficiently becomes a challenging problem. Dimension tables are usually much sma
 rows), so in this section we will focus on storage of facts.
 Although fact tables are often over 100 columns wide, a typical data warehouse query only accesses 4
-or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics)
+or 5 of them at one time (`"SELECT *"` queries are rarely needed for analytics) [^52]. Take the query in
 [^52]. Take the query in
 [Example 4-1](/en/ch4#fig_storage_analytics_query): it accesses a large number of rows (every occurrence of someone
 buying fruit or candy during the 2024 calendar year), but it only needs to access three columns of
 the `fact_sales` table: `date_key`, `product_sk`,
@ -855,16 +819,16 @@ and `quantity`. The query ignores all other columns.
 ```
 SELECT
-  dim_date.weekday, dim_product.category,
+ dim_date.weekday, dim_product.category,
-  SUM(fact_sales.quantity) AS quantity_sold
+ SUM(fact_sales.quantity) AS quantity_sold
 FROM fact_sales
-  JOIN dim_date    ON fact_sales.date_key   = dim_date.date_key
+ JOIN dim_date ON fact_sales.date_key = dim_date.date_key
-  JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
+ JOIN dim_product ON fact_sales.product_sk = dim_product.product_sk
 WHERE
-  dim_date.year = 2024 AND
+ dim_date.year = 2024 AND
-  dim_product.category IN ('Fresh fruit', 'Candy')
+ dim_product.category IN ('Fresh fruit', 'Candy')
 GROUP BY
-  dim_date.weekday, dim_product.category;
+ dim_date.weekday, dim_product.category;
 ```
 How can we execute this query efficiently?
@ -882,8 +846,7 @@ memory, parse them, and filter out those that don’t meet the required conditio
 long time.
 The idea behind *column-oriented* (or *columnar*) storage is simple: don’t store all the values from
-one row together, but store all the values from each *column* together instead
+one row together, but store all the values from each *column* together instead [^56].
 [^56].
 If each column is stored separately, a query only needs to read and parse those columns that are
 used in that query, which can save a lot of work. [Figure 4-7](/en/ch4#fig_column_store) shows this principle using
 an expanded version of the fact table from [Figure 3-5](/en/ch3#fig_dwh_schema).
@ -907,33 +870,24 @@ individual columns and put them together to form the 23rd row of the table.
 In fact, columnar storage engines don’t actually store an entire column (containing perhaps
 trillions of rows) in one go. Instead, they break the table into blocks of thousands or millions of
-rows, and within each block they store the values from each column separately
+rows, and within each block they store the values from each column separately [^60].
 [^60].
 Since many queries are restricted to a particular date range, it is common to make each block
 contain the rows for a particular timestamp range. A query then only needs to load the columns it
 needs in those blocks that overlap with the required date range.
-Columnar storage is used in almost all analytic databases nowadays
+Columnar storage is used in almost all analytic databases nowadays [^60],
-[^60],
+ranging from large-scale cloud data warehouses such as Snowflake [^61]
-ranging from large-scale cloud data warehouses such as Snowflake
+to single-node embedded databases such as DuckDB [^62],
-[^61]
+and product analytics systems such as Pinot [^63]
 to single-node embedded databases such as DuckDB
 [^62],
 and product analytics systems such as Pinot
 [^63]
 and Druid [^64].
 It is used in storage formats such as Parquet, ORC
-[[65](/en/ch4#Liu2023),
+[[^65], [^66]],
 [66](/en/ch4#Zeng2023)],
 Lance [^67],
 and Nimble [^68],
 and in-memory analytics formats like Apache Arrow
-[[65](/en/ch4#Liu2023),
+[[^65], [^69]]
 [69](/en/ch4#McKinney2021)]
 and Pandas/NumPy [^70].
-Some time-series databases, such as InfluxDB IOx
+Some time-series databases, such as InfluxDB IOx [^71] and TimescaleDB [^72],
 [^71] and TimescaleDB
 [^72],
 are also based on column-oriented storage.
 ### Column Compression
@ -961,21 +915,20 @@ One option is to store those bitmaps using one bit per row. However, these bitma
 a lot of zeros (we say that they are *sparse*). In that case, the bitmaps can additionally be
 run-length encoded: counting the number of consecutive zeros or ones and storing that number, as
 shown at the bottom of [Figure 4-8](/en/ch4#fig_bitmap_index). Techniques such as *roaring bitmaps* switch between the
-two bitmap representations, using whichever is the most compact
+two bitmap representations, using whichever is the most compact [^73].
 [^73].
 This can make the encoding of a column remarkably efficient.
 Bitmap indexes such as these are very well suited for the kinds of queries that are common in a data
 warehouse. For example:
 `WHERE product_sk IN (31, 68, 69):`
-:   Load the three bitmaps for `product_sk = 31`, `product_sk = 68`, and `product_sk = 69`, and
+: Load the three bitmaps for `product_sk = 31`, `product_sk = 68`, and `product_sk = 69`, and
-    calculate the bitwise *OR* of the three bitmaps, which can be done very efficiently.
+ calculate the bitwise *OR* of the three bitmaps, which can be done very efficiently.
 `WHERE product_sk = 30 AND store_sk = 3:`
-:   Load the bitmaps for `product_sk = 30` and `store_sk = 3`, and calculate the bitwise *AND*. This
+: Load the bitmaps for `product_sk = 30` and `store_sk = 3`, and calculate the bitwise *AND*. This
-    works because the columns contain the rows in the same order, so the *k*th bit in one column’s
+ works because the columns contain the rows in the same order, so the *k*th bit in one column’s
-    bitmap corresponds to the same row as the *k*th bit in another column’s bitmap.
+ bitmap corresponds to the same row as the *k*th bit in another column’s bitmap.
 Bitmaps can also be used to answer graph queries, such as finding all users of a social network who
 are followed by user *X* and who also follow user *Y*
@ -1046,9 +999,7 @@ Queries need to examine both the column data on disk and the recent writes in me
 the two. The query execution engine hides this distinction from the user. From an analyst’s point
 of view, data that has been modified with inserts, updates, or deletes is immediately reflected in
 subsequent queries. Snowflake, Vertica, Apache Pinot, Apache Druid, and many others do this
-[[61](/en/ch4#Dageville2016), [63](/en/ch4#Im2018),
+[[^61], [^63], [^64], [^76]].
 [64](/en/ch4#Yang2014),
 [76](/en/ch4#Lamb2012)].
 ## Query Execution: Compilation and Vectorization
@ -1068,30 +1019,29 @@ the amount of data they need to read off disk, but also the CPU time required to
 operators. The simplest kind of operator is like an interpreter for a programming language: while
 iterating over each row, it checks a data structure representing the query to find out which
 comparisons or calculations it needs to perform on which columns. Unfortunately, this is too slow
-for many analytics purposes. Two alternative approaches for efficient query execution have emerged
+for many analytics purposes. Two alternative approaches for efficient query execution have emerged [^77]:
 [^77]:
 Query compilation
-:   The query engine takes the SQL query and generates code for executing it. The code iterates over
+: The query engine takes the SQL query and generates code for executing it. The code iterates over
-    the rows one by one, looks at the values in the columns of interest, performs whatever comparisons
+ the rows one by one, looks at the values in the columns of interest, performs whatever comparisons
-    or calculations are needed, and copies the necessary values to an output buffer if the required
+ or calculations are needed, and copies the necessary values to an output buffer if the required
-    conditions are satisfied. The query engine compiles the generated code to machine code (often
+ conditions are satisfied. The query engine compiles the generated code to machine code (often
-    using an existing compiler such as LLVM), and then runs it on the column-encoded data that has
+ using an existing compiler such as LLVM), and then runs it on the column-encoded data that has
-    been loaded into memory. This approach to code generation is similar to the just-in-time (JIT)
+ been loaded into memory. This approach to code generation is similar to the just-in-time (JIT)
-    compilation approach that is used in the Java Virtual Machine (JVM) and similar runtimes.
+ compilation approach that is used in the Java Virtual Machine (JVM) and similar runtimes.
 Vectorized processing
-:   The query is interpreted, not compiled, but it is made fast by processing many values from a
+: The query is interpreted, not compiled, but it is made fast by processing many values from a
-    column in a batch, instead of iterating over rows one by one. A fixed set of predefined operators
+ column in a batch, instead of iterating over rows one by one. A fixed set of predefined operators
-    are built into the database; we can pass arguments to them and get back a batch of results
+ are built into the database; we can pass arguments to them and get back a batch of results
-    [[50](/en/ch4#Larson2013), [75](/en/ch4#Abadi2013)].
+ [[^50], [^75]].
-    For example, we could pass the `product_sk` column and the ID of “bananas” to an equality operator,
+ For example, we could pass the `product_sk` column and the ID of “bananas” to an equality operator,
-    and get back a bitmap (one bit per value in the input column, which is 1 if it’s a banana); we could
+ and get back a bitmap (one bit per value in the input column, which is 1 if it’s a banana); we could
-    then pass the `store_sk` column and the ID of the store of interest to the same equality operator,
+ then pass the `store_sk` column and the ID of the store of interest to the same equality operator,
-    and get back another bitmap; and then we could pass the two bitmaps to a “bitwise AND” operator, as
+ and get back another bitmap; and then we could pass the two bitmaps to a “bitwise AND” operator, as
-    shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
+ shown in [Figure 4-9](/en/ch4#fig_bitmap_and). The result would be a bitmap containing a 1 for all sales of bananas in
-    a particular store.
+ a particular store.
 ![ddia 0409](/fig/ddia_0409.png)
@ -1102,15 +1052,15 @@ practice [^77]. Both can achieve very good
 performance by taking advantages of the characteristics of modern CPUs:
 * preferring sequential memory access over random access to reduce cache misses
-  [^78],
+ [^78],
 * doing most of the work in tight inner loops (that is, with a small number of instructions and no
-  function calls) to keep the CPU instruction processing pipeline busy and avoid branch
+ function calls) to keep the CPU instruction processing pipeline busy and avoid branch
-  mispredictions,
+ mispredictions,
 * making use of parallelism such as multiple threads and single-instruction-multi-data (SIMD)
-  instructions [[79](/en/ch4#Boncz2005),
+ instructions [[^79],
-  [80](/en/ch4#Zhou2002)], and
+ [^80]], and
 * operating directly on compressed data without decoding it into a separate in-memory
-  representation, which saves memory allocation and copying costs.
+ representation, which saves memory allocation and copying costs.
 ## Materialized Views and Data Cubes
@ -1123,8 +1073,7 @@ expanded query.
 When the underlying data changes, a materialized view needs to be updated accordingly. Some
 databases can do that automatically, and there are also systems such as Materialize that specialize
-in materialized view maintenance
+in materialized view maintenance [^81].
 [^81].
 Performing such updates means more work on writes, but materialized views can improve read
 performance in workloads that repeatedly need to perform the same queries.
@ -1133,8 +1082,7 @@ discussed earlier, data warehouse queries often involve an aggregate function, s
 `AVG`, `MIN`, or `MAX` in SQL. If the same aggregates are used by many different queries, it can be
 wasteful to crunch through the raw data every time. Why not cache some of the counts or sums that
 queries use most often? A *data cube* or *OLAP cube* does this by creating a grid of aggregates
-grouped by different dimensions
+grouped by different dimensions [^82].
 [^82].
 [Figure 4-10](/en/ch4#fig_data_cube) shows an example.
 ![ddia 0410](/fig/ddia_0410.png)
@ -1187,8 +1135,8 @@ rectangular map area that the user is currently viewing. This requires a two-dim
 like the following:
 ```
-SELECT * FROM restaurants WHERE latitude  > 51.4946 AND latitude  < 51.5079
+SELECT * FROM restaurants WHERE latitude > 51.4946 AND latitude < 51.5079
-                            AND longitude > -0.1162 AND longitude < -0.1004;
+ AND longitude > -0.1162 AND longitude < -0.1004;
 ```
 A concatenated index over the latitude and longitude columns is not able to answer that kind of
@ -1197,16 +1145,12 @@ longitude), or all the restaurants in a range of longitudes (but anywhere betwee
 South poles), but not both simultaneously.
 One option is to translate a two-dimensional location into a single number using a space-filling
-curve, and then to use a regular B-tree index
+curve, and then to use a regular B-tree index [^83].
-[^83].
+More commonly, specialized spatial indexes such as R-trees or Bkd-trees [^84]
 More commonly, specialized spatial indexes such as R-trees or Bkd-trees
 [^84]
 are used; they divide up the space so that nearby data points tend to be grouped in the same
 subtree. For example, PostGIS implements geospatial indexes as R-trees using PostgreSQL’s
-Generalized Search Tree indexing facility
+Generalized Search Tree indexing facility [^85].
-[^85].
+It is also possible to use regularly spaced grids of triangles, squares, or hexagons [^86].
 It is also possible to use regularly spaced grids of triangles, squares, or hexagons
 [^86].
 Multi-dimensional indexes are not just for geographic locations. For example, on an ecommerce
 website you could use a three-dimensional index on the dimensions (*red*, *green*, *blue*) to search
@ -1215,14 +1159,12 @@ two-dimensional index on (*date*, *temperature*) in order to efficiently search
 observations during the year 2013 where the temperature was between 25 and 30℃. With a
 one-dimensional index, you would have to either scan over all the records from 2013 (regardless of
 temperature) and then filter them by temperature, or vice versa. A 2D index could narrow down by
-timestamp and temperature simultaneously
+timestamp and temperature simultaneously [^87].
 [^87].
 ## Full-Text Search
 Full-text search allows you to search a collection of text documents (web pages, product
-descriptions, etc.) by keywords that might appear anywhere in the text
+descriptions, etc.) by keywords that might appear anywhere in the text [^88].
 [^88].
 Information retrieval is a big, specialist topic that often involves language-specific processing:
 for example, several Asian languages are written without spaces or punctuation between words, and
 therefore splitting text into words requires a model that indicates which character sequences
@ -1249,26 +1191,21 @@ warehouse query that searches for rows matching two conditions ([Figure 4-9](/e
 bitmaps for terms *x* and *y* and compute their bitwise AND. Even if the bitmaps are run-length
 encoded, this can be done very efficiently.
-For example, Lucene, the full-text indexing engine used by Elasticsearch and Solr, works like this
+For example, Lucene, the full-text indexing engine used by Elasticsearch and Solr, works like this [^90].
 [^90].
 It stores the mapping from term to postings list in SSTable-like sorted files, which are merged in
-the background using the same log-structured approach we saw earlier in this chapter
+the background using the same log-structured approach we saw earlier in this chapter [^91].
 [^91].
 PostgreSQL’s GIN index type also uses postings lists to support full-text search and indexing inside
 JSON documents
-[[92](/en/ch4#Fittl2021),
+[[^92], [^93]].
 [93](/en/ch4#Angelakos2020)].
 Instead of breaking text into words, an alternative is to find all the substrings of length *n*,
 which are called *n*-grams. For example, the trigrams (*n* = 3) of the string
 `"hello"` are `"hel"`, `"ell"`, and `"llo"`. If we build an inverted index of all trigrams, we can
 search the documents for arbitrary substrings that are at least three characters long. Trigram
-indexes even allows regular expressions in search queries; the downside is that they are quite large
+indexes even allows regular expressions in search queries; the downside is that they are quite large [^94].
 [^94].
 To cope with typos in documents or queries, Lucene is able to search text for words within a certain
-edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced)
+edit distance (an edit distance of 1 means that one letter has been added, removed, or replaced) [^95].
 [^95].
 It does this by storing the set of terms as a finite state automaton over the characters in the
 keys, similar to a *trie*
 [^96],
@ -1309,12 +1246,9 @@ measure the distance between vectors. Cosine similarity measures the cosine of t
 vectors to determine how close they are, while Euclidean distance measures the straight-line
 distance between two points in space.
-Many early embedding models such as Word2Vec
+Many early embedding models such as Word2Vec [^98],
-[^98],
+BERT [^99],
-BERT
+and GPT [^100]
 [^99],
 and GPT
 [^100]
 worked with text data. Such models are usually implemented as neural networks. Researchers went on to
 create embedding models for video, audio, and images as well. More recently, model
 architecture has become *multimodal*: a single model can generate vector embeddings for multiple
@ -1331,42 +1265,39 @@ closest to the query vector. Since the R-trees we saw previously don’t work we
 many dimensions, specialized vector indexes are used, such as:
 Flat indexes
-:   Vectors are stored in the index as they are. A query must read every vector and measure its
+: Vectors are stored in the index as they are. A query must read every vector and measure its
-    distance to the query vector. Flat indexes are accurate, but measuring the distance between the
+ distance to the query vector. Flat indexes are accurate, but measuring the distance between the
-    query and each vector is slow.
+ query and each vector is slow.
 Inverted file (IVF) indexes
-:   The vector space is clustered into partitions (called *centroids*) of vectors to reduce the number
+: The vector space is clustered into partitions (called *centroids*) of vectors to reduce the number
-    of vectors that must be compared. IVF indexes are faster than flat indexes, but can give only
+ of vectors that must be compared. IVF indexes are faster than flat indexes, but can give only
-    approximate results: the query and a document may fall into different partitions, even though they
+ approximate results: the query and a document may fall into different partitions, even though they
-    are close to each other. A query on an IVF index first defines *probes*, which are simply the number
+ are close to each other. A query on an IVF index first defines *probes*, which are simply the number
-    of partitions to check. Queries that use more probes will be more accurate, but will be slower, as
+ of partitions to check. Queries that use more probes will be more accurate, but will be slower, as
-    more vectors must be compared.
+ more vectors must be compared.
 Hierarchical Navigable Small World (HNSW)
-:   HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw).
+: HNSW indexes maintain multiple layers of the vector space, as illustrated in [Figure 4-11](/en/ch4#fig_vector_hnsw).
-    Each layer is represented as a graph, where nodes represent vectors, and edges represent proximity
+ Each layer is represented as a graph, where nodes represent vectors, and edges represent proximity
-    to nearby vectors. A query starts by locating the nearest vector in the topmost layer, which has a
+ to nearby vectors. A query starts by locating the nearest vector in the topmost layer, which has a
-    small number of nodes. The query then moves to the same node in the layer below and follows the
+ small number of nodes. The query then moves to the same node in the layer below and follows the
-    edges in that layer, which is more densely connected, looking for a vector that is closer to the
+ edges in that layer, which is more densely connected, looking for a vector that is closer to the
-    query vector. The process continues until the last layer is reached. As with IVF indexes, HNSW
+ query vector. The process continues until the last layer is reached. As with IVF indexes, HNSW
-    indexes are approximate.
+ indexes are approximate.
 ![ddia 0411](/fig/ddia_0411.png)
 ###### Figure 4-11. Searching for the database entry that is closest to a given query vector in a HNSW index.
 Many popular vector databases implement IVF and HNSW indexes. Facebook’s Faiss library has many
-variations of each
+variations of each [^101],
-[^101],
+and PostgreSQL’s pgvector supports both as well [^102].
 and PostgreSQL’s pgvector supports both as well
 [^102].
 The full details of the IVF and HNSW algorithms are beyond the scope of this book, but their papers
 are an excellent resource
-[[103](/en/ch4#Baranchuk2018),
+[[^103], [^104]].
 [104](/en/ch4#Malkov2020)].
-# Summary
+## Summary
 In this chapter we tried to get to the bottom of how databases perform storage and retrieval. What
 happens when you store data in a database, and what does the database do when you query for the
@ -1377,25 +1308,25 @@ analytics (OLAP). In this chapter we saw that storage engines optimized for OLTP
 from those optimized for analytics:
 * OLTP systems are optimized for a high volume of requests, each of which reads and writes a small
-  number of records, and which need fast responses. The records are typically accessed via a primary
+ number of records, and which need fast responses. The records are typically accessed via a primary
-  key or a secondary index, and these indexes are typically ordered mappings from key to record,
+ key or a secondary index, and these indexes are typically ordered mappings from key to record,
-  which also support range queries.
+ which also support range queries.
 * Data warehouses and similar analytic systems are optimized for complex read queries that scan over
-  a large number of records. They generally use a column-oriented storage layout with compression
+ a large number of records. They generally use a column-oriented storage layout with compression
-  that minimizes the amount of data that such a query needs to read off disk, and just-in-time
+ that minimizes the amount of data that such a query needs to read off disk, and just-in-time
-  compilation of queries or vectorization to minimize the amount of CPU time spent processing the
+ compilation of queries or vectorization to minimize the amount of CPU time spent processing the
-  data.
+ data.
 On the OLTP side, we saw storage engines from two main schools of thought:
 * The log-structured approach, which only permits appending to files and deleting obsolete files,
-  but never updates a file that has been written. SSTables, LSM-trees, RocksDB, Cassandra, HBase,
+ but never updates a file that has been written. SSTables, LSM-trees, RocksDB, Cassandra, HBase,
-  Scylla, Lucene, and others belong to this group. In general, log-structured storage engines tend
+ Scylla, Lucene, and others belong to this group. In general, log-structured storage engines tend
-  to provide high write throughput.
+ to provide high write throughput.
 * The update-in-place approach, which treats the disk as a set of fixed-size pages that can be
-  overwritten. B-trees, the biggest example of this philosophy, are used in all major relational
+ overwritten. B-trees, the biggest example of this philosophy, are used in all major relational
-  OLTP databases and also many nonrelational ones. As a rule of thumb, B-trees tend to be better for
+ OLTP databases and also many nonrelational ones. As a rule of thumb, B-trees tend to be better for
-  reads, providing higher read throughput and lower response times than log-structured storage.
+ reads, providing higher read throughput and lower response times than log-structured storage.
 We then looked at indexes that can search for multiple conditions at the same time: multidimensional
 indexes such as R-trees that can search for points on a map by latitude and longitude at the same
@ -1413,10 +1344,11 @@ Although this chapter couldn’t make you an expert in tuning any one particular
 has hopefully equipped you with enough vocabulary and ideas that you can make sense of the
 documentation for the database of your choice.
 ##### Footnotes
-##### References
+
 ### Summary
--- a/content/en/ch5.md
+++ b/content/en/ch5.md
@ -31,22 +31,22 @@ and writing that field). However, in a large application, code changes often can
 instantaneously:
 * With server-side applications you may want to perform a *rolling upgrade*
-  (also known as a *staged rollout*), deploying the new version to a few nodes at a time, checking
+ (also known as a *staged rollout*), deploying the new version to a few nodes at a time, checking
-  whether the new version is running smoothly, and gradually working your way through all the nodes.
+ whether the new version is running smoothly, and gradually working your way through all the nodes.
-  This allows new versions to be deployed without service downtime, and thus encourages more
+ This allows new versions to be deployed without service downtime, and thus encourages more
-  frequent releases and better evolvability.
+ frequent releases and better evolvability.
 * With client-side applications you’re at the mercy of the user, who may not install the update for
-  some time.
+ some time.
 This means that old and new versions of the code, and old and new data formats, may potentially all
 coexist in the system at the same time. In order for the system to continue running smoothly, we
 need to maintain compatibility in both directions:
 Backward compatibility
-:   Newer code can read data that was written by older code.
+: Newer code can read data that was written by older code.
 Forward compatibility
-:   Older code can read data that was written by newer code.
+: Older code can read data that was written by newer code.
 Backward compatibility is normally not hard to achieve: as author of the newer code, you know the
 format of data written by older code, and so you can explicitly handle it (if necessary by simply
@ -77,12 +77,12 @@ message queues.
 Programs usually work with data in (at least) two different representations:
 1. In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These
-   data structures are optimized for efficient access and manipulation by the CPU (typically using
+ data structures are optimized for efficient access and manipulation by the CPU (typically using
-   pointers).
+ pointers).
 2. When you want to write data to a file or send it over the network, you have to encode it as some
-   kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldn’t
+ kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldn’t
-   make sense to any other process, this sequence-of-bytes representation often looks quite
+ make sense to any other process, this sequence-of-bytes representation often looks quite
-   different from the data structures that are normally used in memory.
+ different from the data structures that are normally used in memory.
 Thus, we need some kind of translation between the two representations. The translation from the
 in-memory representation to a byte sequence is called *encoding* (also known as *serialization* or
@ -114,22 +114,20 @@ These encoding libraries are very convenient, because they allow in-memory objec
 restored with minimal additional code. However, they also have a number of deep problems:
 * The encoding is often tied to a particular programming language, and reading the data in another
-  language is very difficult. If you store or transmit data in such an encoding, you are committing
+ language is very difficult. If you store or transmit data in such an encoding, you are committing
-  yourself to your current programming language for potentially a very long time, and precluding
+ yourself to your current programming language for potentially a very long time, and precluding
-  integrating your systems with those of other organizations (which may use different languages).
+ integrating your systems with those of other organizations (which may use different languages).
 * In order to restore data in the same object types, the decoding process needs to be able to
-  instantiate arbitrary classes. This is frequently a source of security problems
+ instantiate arbitrary classes. This is frequently a source of security problems [^1]:
-  [^1]:
+ if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate
-  if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate
+ arbitrary classes, which in turn often allows them to do terrible things such as remotely
-  arbitrary classes, which in turn often allows them to do terrible things such as remotely
+ executing arbitrary code [^2] [^3].
  executing arbitrary code [[2](/en/ch5#Breen2015),
  [3](/en/ch5#McKenzie2013)].
 * Versioning data is often an afterthought in these libraries: as they are intended for quick and
-  easy encoding of data, they often neglect the inconvenient problems of forward and backward
+ easy encoding of data, they often neglect the inconvenient problems of forward and backward
-  compatibility [^4].
+ compatibility [^4].
 * Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also
-  often an afterthought. For example, Java’s built-in serialization is notorious for its bad
+ often an afterthought. For example, Java’s built-in serialization is notorious for its bad
-  performance and bloated encoding [^5].
+ performance and bloated encoding [^5].
 For these reasons it’s generally a bad idea to use your language’s built-in encoding for anything
 other than very transient purposes.
@ -138,8 +136,7 @@ other than very transient purposes.
 When moving to standardized encodings that can be written and read by many programming languages, JSON
 and XML are the obvious contenders. They are widely known, widely supported, and almost as widely
-disliked. XML is often criticized for being too verbose and unnecessarily complicated
+disliked. XML is often criticized for being too verbose and unnecessarily complicated [^6].
 [^6].
 JSON’s popularity is mainly due to its built-in support in web browsers and simplicity relative to
 XML. CSV is another popular language-independent format, but it only supports tabular data without
 nesting.
@ -149,33 +146,31 @@ popular topic of debate). Besides the superficial syntactic issues, they also ha
 problems:
 * There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish
-  between a number and a string that happens to consist of digits (except by referring to an external
+ between a number and a string that happens to consist of digits (except by referring to an external
-  schema). JSON distinguishes strings and numbers, but it doesn’t distinguish integers and
+ schema). JSON distinguishes strings and numbers, but it doesn’t distinguish integers and
-  floating-point numbers, and it doesn’t specify a precision.
+ floating-point numbers, and it doesn’t specify a precision.
-  This is a problem when dealing with large numbers; for example, integers greater than 253 cannot
+ This is a problem when dealing with large numbers; for example, integers greater than 253 cannot
-  be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become
+ be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become
-  inaccurate when parsed in a language that uses floating-point numbers, such as JavaScript
+ inaccurate when parsed in a language that uses floating-point numbers, such as JavaScript [^7].
-  [^7].
+ An example of numbers larger than 253 occurs on X (formerly Twitter), which uses a 64-bit number to
-  An example of numbers larger than 253 occurs on X (formerly Twitter), which uses a 64-bit number to
+ identify each post. The JSON returned by the API includes post IDs twice, once as a JSON number and
-  identify each post. The JSON returned by the API includes post IDs twice, once as a JSON number and
+ once as a decimal string, to work around the fact that the numbers are not correctly parsed by
-  once as a decimal string, to work around the fact that the numbers are not correctly parsed by
+ JavaScript applications [^8].
  JavaScript applications [^8].
 * JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they
-  don’t support binary strings (sequences of bytes without a character encoding). Binary strings are a
+ don’t support binary strings (sequences of bytes without a character encoding). Binary strings are a
-  useful feature, so people get around this limitation by encoding the binary data as text using
+ useful feature, so people get around this limitation by encoding the binary data as text using
-  Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded.
+ Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded.
-  This works, but it’s somewhat hacky and increases the data size by 33%.
+ This works, but it’s somewhat hacky and increases the data size by 33%.
 * XML Schema and JSON Schema are powerful, and thus quite
-  complicated to learn and implement. Since the correct interpretation of data (such as numbers and
+ complicated to learn and implement. Since the correct interpretation of data (such as numbers and
-  binary strings) depends on information in the schema, applications that don’t use XML/JSON schemas
+ binary strings) depends on information in the schema, applications that don’t use XML/JSON schemas
-  need to potentially hard-code the appropriate encoding/decoding logic instead.
+ need to potentially hard-code the appropriate encoding/decoding logic instead.
 * CSV does not have any schema, so it is up to the application to define the meaning of each row and
-  column. If an application change adds a new row or column, you have to handle that change manually.
+ column. If an application change adds a new row or column, you have to handle that change manually.
-  CSV is also a quite vague format (what happens if a value contains a comma or a newline character?).
+ CSV is also a quite vague format (what happens if a value contains a comma or a newline character?).
-  Although its escaping rules have been formally specified
+ Although its escaping rules have been formally specified [^9],
-  [^9],
+ not all parsers implement them correctly.
  not all parsers implement them correctly.
 Despite these flaws, JSON, XML, and CSV are good enough for many purposes. It’s likely that they will
 remain popular, especially as data interchange formats (i.e., for sending data from one organization to
@ -211,16 +206,16 @@ JSON Schema so that keys may only contain digits, and values can only be strings
 ##### Example 5-1. Example JSON Schema with integer keys and string values. Integer keys are represented as strings containing only integers since JSON Schema requires all keys to be strings.
-```
+```json
 {
-  "$schema": "http://json-schema.org/draft-07/schema#",
+ "$schema": "http://json-schema.org/draft-07/schema#",
-  "type": "object",
+ "type": "object",
-  "patternProperties": {
+ "patternProperties": {
-    "^[0-9]+$": {
+ "^[0-9]+$": {
-      "type": "string"
+ "type": "string"
-    }
+ }
-  },
+ },
-  "additionalProperties": false
+ "additionalProperties": false
 }
 ```
@ -229,8 +224,7 @@ if/else schema logic, named types, references to remote schemas, and much more.
 for a very powerful schema language. Such features also make for unwieldy definitions. It can be
 challenging to resolve remote schemas, reason about conditional rules, or evolve schemas in a
 forwards or backwards compatible way [^10].
-Similar concerns apply to XML Schema
+Similar concerns apply to XML Schema [^11].
 [^11].
 ### Binary encoding
@ -251,9 +245,9 @@ will need to include the strings `userName`, `favoriteNumber`, and `interests` s
 ```
 {
-    "userName": "Martin",
+ "userName": "Martin",
-    "favoriteNumber": 1337,
+ "favoriteNumber": 1337,
-    "interests": ["daydreaming", "hacking"]
+ "interests": ["daydreaming", "hacking"]
 }
 ```
@ -262,13 +256,13 @@ shows the byte sequence that you get if you encode the JSON document in [Example
 MessagePack. The first few bytes are as follows:
 1. The first byte, `0x83`, indicates that what follows is an object (top four bits = `0x80`) with three
-   fields (bottom four bits = `0x03`). (In case you’re wondering what happens if an object has more
+ fields (bottom four bits = `0x03`). (In case you’re wondering what happens if an object has more
-   than 15 fields, so that the number of fields doesn’t fit in four bits, it then gets a different type
+ than 15 fields, so that the number of fields doesn’t fit in four bits, it then gets a different type
-   indicator, and the number of fields is encoded in two or four bytes.)
+ indicator, and the number of fields is encoded in two or four bytes.)
 2. The second byte, `0xa8`, indicates that what follows is a string (top four bits = `0xa0`) that is eight
-   bytes long (bottom four bits = `0x08`).
+ bytes long (bottom four bits = `0x08`).
 3. The next eight bytes are the field name `userName` in ASCII. Since the length was indicated
-   previously, there’s no need for any marker to tell us where the string ends (or any escaping).
+ previously, there’s no need for any marker to tell us where the string ends (or any escaping).
 4. The next seven bytes encode the six-letter string value `Martin` with a prefix `0xa6`, and so on.
 The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the
@ -286,8 +280,7 @@ In the following sections we will see how we can do much better, and encode the
 ## Protocol Buffers
 Protocol Buffers (protobuf) is a binary encoding library developed at Google.
-It is similar to Apache Thrift, which was originally developed by Facebook
+It is similar to Apache Thrift, which was originally developed by Facebook [^13];
 [^13];
 most of what this section says about Protocol Buffers applies also to Thrift.
 Protocol Buffers requires a schema for any data that is encoded. To encode the data
@ -298,9 +291,9 @@ interface definition language (IDL) like this:
 syntax = "proto3";
 message Person {
-    string user_name = 1;
+ string user_name = 1;
-    int64 favorite_number = 2;
+ int64 favorite_number = 2;
-    repeated string interests = 3;
+ repeated string interests = 3;
 }
 ```
@ -381,8 +374,7 @@ value won’t fit in 32 bits, it will be truncated.
 Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers.
 It was started in 2009 as a subproject of Hadoop, as a result of Protocol Buffers not being a good
-fit for Hadoop’s use cases
+fit for Hadoop’s use cases [^15].
 [^15].
 Avro also uses a schema to specify the structure of the data being encoded. It has two schema
 languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily
@ -393,9 +385,9 @@ Our example schema, written in Avro IDL, might look like this:
 ```
 record Person {
-    string               userName;
+ string userName;
-    union { null, long } favoriteNumber = null;
+ union { null, long } favoriteNumber = null;
-    array<string>        interests;
+ array<string> interests;
 }
 ```
@ -403,13 +395,13 @@ The equivalent JSON representation of that schema is as follows:
 ```
 {
-    "type": "record",
+ "type": "record",
-    "name": "Person",
+ "name": "Person",
-    "fields": [
+ "fields": [
-        {"name": "userName",       "type": "string"},
+ {"name": "userName", "type": "string"},
-        {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
+ {"name": "favoriteNumber", "type": ["null", "long"], "default": null},
-        {"name": "interests",      "type": {"type": "array", "items": "string"}}
+ {"name": "interests", "type": {"type": "array", "items": "string"}}
-    ]
+ ]
 }
 ```
@ -455,8 +447,7 @@ application code is expecting, and their types.
 If the reader’s and writer’s schema are the same, decoding is easy. If they are different, Avro
 resolves the differences by looking at the writer’s schema and the reader’s schema side by side and
 translating the data from the writer’s schema into the reader’s schema. The Avro specification
-[[16](/en/ch5#AvroSpec),
+[[^16], [^17]]
 [17](/en/ch5#AvroParsing)]
 defines exactly how this resolution works, and it is illustrated in
 [Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
@ -511,33 +502,32 @@ the space savings from the binary encoding futile.
 The answer depends on the context in which Avro is being used. To give a few examples:
 Large file with lots of records
-:   A common use for Avro is for storing a large file containing millions of records, all encoded with
+: A common use for Avro is for storing a large file containing millions of records, all encoded with
-    the same schema. (We will discuss this kind of situation in [Link to Come].) In this case, the
+ the same schema. (We will discuss this kind of situation in [Link to Come].) In this case, the
-    writer of that file can just include the writer’s schema once at the beginning of the file. Avro
+ writer of that file can just include the writer’s schema once at the beginning of the file. Avro
-    specifies a file format (object container files) to do this.
+ specifies a file format (object container files) to do this.
 Database with individually written records
-:   In a database, different records may be written at different points in time using different
+: In a database, different records may be written at different points in time using different
-    writer’s schemas—you cannot assume that all the records will have the same schema. The simplest
+ writer’s schemas—you cannot assume that all the records will have the same schema. The simplest
-    solution is to include a version number at the beginning of every encoded record, and to keep a
+ solution is to include a version number at the beginning of every encoded record, and to keep a
-    list of schema versions in your database. A reader can fetch a record, extract the version number,
+ list of schema versions in your database. A reader can fetch a record, extract the version number,
-    and then fetch the writer’s schema for that version number from the database. Using that writer’s
+ and then fetch the writer’s schema for that version number from the database. Using that writer’s
-    schema, it can decode the rest of the record.
+ schema, it can decode the rest of the record.
-    Confluent’s schema registry for Apache Kafka
+ Confluent’s schema registry for Apache Kafka
-    [^19]
+ [^19]
-    and LinkedIn’s Espresso
+ and LinkedIn’s Espresso
-    [^20]
+ [^20]
-    work this way, for example.
+ work this way, for example.
 Sending records over a network connection
-:   When two processes are communicating over a bidirectional network connection, they can negotiate
+: When two processes are communicating over a bidirectional network connection, they can negotiate
-    the schema version on connection setup and then use that schema for the lifetime of the
+ the schema version on connection setup and then use that schema for the lifetime of the
-    connection. The Avro RPC protocol (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) works like this.
+ connection. The Avro RPC protocol (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) works like this.
 A database of schema versions is a useful thing to have in any case, since it acts as documentation
-and gives you a chance to check schema compatibility
+and gives you a chance to check schema compatibility [^21].
 [^21].
 As the version number, you could use a simple incrementing integer, or you could use a hash of the
 schema.
@ -581,13 +571,10 @@ languages.
 The ideas on which these encodings are based are by no means new. For example, they have a lot in
 common with ASN.1, a schema definition language that was first standardized in 1984
-[[23](/en/ch5#Larmouth1999),
+[[^23], [^24]].
 [24](/en/ch5#Kaliski1993)].
 It was used to define various network protocols, and its binary encoding (DER) is still used to encode
-SSL certificates (X.509), for example
+SSL certificates (X.509), for example [^25].
-[^25].
+ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers [^26].
 ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers
 [^26].
 However, it’s also very complex and badly documented, so ASN.1
 is probably not a good choice for new applications.
@ -601,14 +588,14 @@ So, we can see that although textual data formats such as JSON, XML, and CSV are
 encodings based on schemas are also a viable option. They have a number of nice properties:
 * They can be much more compact than the various “binary JSON” variants, since they can omit field
-  names from the encoded data.
+ names from the encoded data.
 * The schema is a valuable form of documentation, and because the schema is required for decoding,
-  you can be sure that it is up to date (whereas manually maintained documentation may easily
+ you can be sure that it is up to date (whereas manually maintained documentation may easily
-  diverge from reality).
+ diverge from reality).
 * Keeping a database of schemas allows you to check forward and backward compatibility of schema
-  changes, before anything is deployed.
+ changes, before anything is deployed.
 * For users of statically typed programming languages, the ability to generate code from the schema
-  is useful, since it enables type-checking at compile time.
+ is useful, since it enables type-checking at compile time.
 In summary, schema evolution allows the same kind of flexibility as schemaless/schema-on-read JSON
 databases provide (see [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)), while also providing better
@ -681,8 +668,7 @@ versions of the schema.
 More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or
 moving some data into a separate table—still require data to be rewritten, often at the application
 level [^27].
-Maintaining forward and backward compatibility across such migrations is still a research problem
+Maintaining forward and backward compatibility across such migrations is still a research problem [^28].
 [^28].
 ### Archival storage
@ -722,8 +708,7 @@ application-specific, and the client and server need to agree on the details of
 In some ways, services are similar to databases: they typically allow clients to submit and query
 data. However, while databases allow arbitrary queries using the query languages we discussed in
 [Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
-that are predetermined by the business logic (application code) of the service
+that are predetermined by the business logic (application code) of the service [^29]. This restriction provides a degree of encapsulation: services can impose
 [^29]. This restriction provides a degree of encapsulation: services can impose
 fine-grained restrictions on what clients can and cannot do.
 A key design goal of a service-oriented/microservices architecture is to make the application easier
@ -742,18 +727,17 @@ perhaps a slight misnomer, because web services are not only used on the web, bu
 different contexts. For example:
 1. A client application running on a user’s device (e.g., a native app on a mobile device, or a
-   JavaScript web app in a browser) making requests to a service over HTTP. These requests typically
+ JavaScript web app in a browser) making requests to a service over HTTP. These requests typically
-   go over the public internet.
+ go over the public internet.
 2. One service making requests to another service owned by the same organization, often located
-   within the same datacenter, as part of a service-oriented/microservices architecture.
+ within the same datacenter, as part of a service-oriented/microservices architecture.
 3. One service making requests to a service owned by a different organization, usually via the
-   internet. This is used for data exchange between different organizations’ backend systems. This
+ internet. This is used for data exchange between different organizations’ backend systems. This
-   category includes public APIs provided by online services, such as credit card processing
+ category includes public APIs provided by online services, such as credit card processing
-   systems, or OAuth for shared access to user data.
+ systems, or OAuth for shared access to user data.
 The most popular service design philosophy is REST, which builds upon the principles of HTTP
-[[30](/en/ch5#Fielding2000),
+[[^30], [^31]].
 [31](/en/ch5#Fielding2008)].
 It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for
 cache control, authentication, and content type negotiation. An API designed according to the
 principles of REST is called *RESTful*.
@ -763,8 +747,7 @@ format to send and expect in response. Even if a service adopts RESTful design p
 need to somehow find out these details. Service developers often use an interface definition
 language (IDL) to define and document their service’s API endpoints and data models, and to evolve
 them over time. Other developers can then use the service definition to determine how to query the
-service. The two most popular service IDLs are OpenAPI (also known as Swagger
+service. The two most popular service IDLs are OpenAPI (also known as Swagger [^32])
 [^32])
 and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send
 and receive Protocol Buffers.
@ -778,25 +761,25 @@ definitions.
 ```
 openapi: 3.0.0
 info:
-  title: Ping, Pong
+ title: Ping, Pong
-  version: 1.0.0
+ version: 1.0.0
 servers:
-  - url: http://localhost:8080
+ - url: http://localhost:8080
 paths:
-  /ping:
+ /ping:
-    get:
+ get:
-      summary: Given a ping, returns a pong message
+ summary: Given a ping, returns a pong message
-      responses:
+ responses:
-        '200':
+ '200':
-          description: A pong
+ description: A pong
-          content:
+ content:
-            application/json:
+ application/json:
-              schema:
+ schema:
-                type: object
+ type: object
-                properties:
+ properties:
-                  message:
+ message:
-                    type: string
+ type: string
-                    example: Pong!
+ example: Pong!
 ```
 Even if a design philosophy and IDL are adopted, developers must still write the code that
@ -815,12 +798,12 @@ from pydantic import BaseModel
 app = FastAPI(title="Ping, Pong", version="1.0.0")
 class PongResponse(BaseModel):
-    message: str = "Pong!"
+ message: str = "Pong!"
@app.get("/ping", response_model=PongResponse,
-         summary="Given a ping, returns a pong message")
+ summary="Given a ping, returns a pong message")
 async def ping():
-    return PongResponse()
+ return PongResponse()
 ```
 Many frameworks couple service definitions and server code together. In some cases, such as with the
@ -841,50 +824,47 @@ Architecture (CORBA) is excessively complex, and does not provide backward or fo
 compatibility [^33].
 SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are
 also plagued by complexity and compatibility problems
-[[34](/en/ch5#Lacey2006),
+[[^34], [^35], [^36]].
 [35](/en/ch5#Tilkov2006),
 [36](/en/ch5#Bray2004)].
 All of these are based on the idea of a *remote procedure call* (RPC), which has been around since
 the 1970s [^37].
 The RPC model tries to make a request to a remote network service look the same as calling a function or
 method in your programming language, within the same process (this abstraction is called *location
 transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed
-[[38](/en/ch5#Waldo1994),
+[[^38], [^39]].
 [39](/en/ch5#Vinoski2008)].
 A network request is very different from a local function call:
 * A local function call is predictable and either succeeds or fails, depending only on parameters
-  that are under your control. A network request is unpredictable: the request or response may be
+ that are under your control. A network request is unpredictable: the request or response may be
-  lost due to a network problem, or the remote machine may be slow or unavailable, and such problems
+ lost due to a network problem, or the remote machine may be slow or unavailable, and such problems
-  are entirely outside of your control. Network problems are common, so you have to anticipate them,
+ are entirely outside of your control. Network problems are common, so you have to anticipate them,
-  for example by retrying a failed request.
+ for example by retrying a failed request.
 * A local function call either returns a result, or throws an exception, or never returns (because
-  it goes into an infinite loop or the process crashes). A network request has another possible
+ it goes into an infinite loop or the process crashes). A network request has another possible
-  outcome: it may return without a result, due to a *timeout*. In that case, you simply don’t know
+ outcome: it may return without a result, due to a *timeout*. In that case, you simply don’t know
-  what happened: if you don’t get a response from the remote service, you have no way of knowing
+ what happened: if you don’t get a response from the remote service, you have no way of knowing
-  whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).)
+ whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).)
 * If you retry a failed network request, it could happen that the previous request actually got
-  through, and only the response was lost.
+ through, and only the response was lost.
-  In that case, retrying will cause the action to
+ In that case, retrying will cause the action to
-  be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into
+ be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into
-  the protocol [^40].
+ the protocol [^40].
-  Local function calls don’t have this problem. (We discuss idempotence in more detail
+ Local function calls don’t have this problem. (We discuss idempotence in more detail
-  in [Link to Come].)
+ in [Link to Come].)
 * Every time you call a local function, it normally takes about the same time to execute. A network
-  request is much slower than a function call, and its latency is also wildly variable: at good
+ request is much slower than a function call, and its latency is also wildly variable: at good
-  times it may complete in less than a millisecond, but when the network is congested or the remote
+ times it may complete in less than a millisecond, but when the network is congested or the remote
-  service is overloaded it may take many seconds to do exactly the same thing.
+ service is overloaded it may take many seconds to do exactly the same thing.
 * When you call a local function, you can efficiently pass it references (pointers) to objects in
-  local memory. When you make a network request, all those parameters need to be encoded into a
+ local memory. When you make a network request, all those parameters need to be encoded into a
-  sequence of bytes that can be sent over the network. That’s okay if the parameters are immutable
+ sequence of bytes that can be sent over the network. That’s okay if the parameters are immutable
-  primitives like numbers or short strings, but it quickly becomes problematic with larger amounts
+ primitives like numbers or short strings, but it quickly becomes problematic with larger amounts
-  of data and mutable objects.
+ of data and mutable objects.
 * The client and the service may be implemented in different programming languages, so the RPC
-  framework must translate datatypes from one language into another. This can end up ugly, since not
+ framework must translate datatypes from one language into another. This can end up ugly, since not
-  all languages have the same types—recall JavaScript’s problems with numbers greater than 253,
+ all languages have the same types—recall JavaScript’s problems with numbers greater than 253,
-  for example (see [“JSON, XML, and Binary Variants”](/en/ch5#sec_encoding_json)). This problem doesn’t exist in a single process written in
+ for example (see [“JSON, XML, and Binary Variants”](/en/ch5#sec_encoding_json)). This problem doesn’t exist in a single process written in
-  a single language.
+ a single language.
 All of these factors mean that there’s no point trying to make a remote service look too much like a
 local object in your programming language, because it’s a fundamentally different thing. Part of the
@ -906,43 +886,43 @@ across these instances is called *load balancing*
 There are many load balancing and service discovery solutions available:
 * *Hardware load balancers* are specialized pieces of equipment that are installed in data centers.
-  They allow clients to connect to a single host and port, and incoming connections are routed to
+ They allow clients to connect to a single host and port, and incoming connections are routed to
-  one of the servers running the service. Such load balancers detect network failures when
+ one of the servers running the service. Such load balancers detect network failures when
-  connecting to a downstream server and shift the traffic to other servers.
+ connecting to a downstream server and shift the traffic to other servers.
 * *Software load balancers* behave in much the same way as hardware load balancers. But rather than
-  requiring a special appliance, software load balancers such as Nginx and HAProxy are applications
+ requiring a special appliance, software load balancers such as Nginx and HAProxy are applications
-  that can be installed on a standard machine.
+ that can be installed on a standard machine.
 * The *domain name service (DNS)* is how domain names are resolved on the Internet when you open a
-  webpage. It supports load balancing by allowing multiple IP addresses to be associated with a
+ webpage. It supports load balancing by allowing multiple IP addresses to be associated with a
-  single domain name. Clients can then be configured to connect to a service using a domain name
+ single domain name. Clients can then be configured to connect to a service using a domain name
-  rather than IP address, and the client’s network layer picks which IP address to use when making a
+ rather than IP address, and the client’s network layer picks which IP address to use when making a
-  connection. One drawback of this approach is that DNS is designed to propagate changes over longer
+ connection. One drawback of this approach is that DNS is designed to propagate changes over longer
-  periods of time, and to cache DNS entries. If servers are started, stopped, or moved frequently,
+ periods of time, and to cache DNS entries. If servers are started, stopped, or moved frequently,
-  clients might see stale IP addresses that no longer have a server running on them.
+ clients might see stale IP addresses that no longer have a server running on them.
 * *Service discovery systems* use a centralized registry rather than DNS to track which service
-  endpoints are available. When a new service instance starts up, it registers itself with the
+ endpoints are available. When a new service instance starts up, it registers itself with the
-  service discovery system by declaring the host and port it’s listening on, along with relevant
+ service discovery system by declaring the host and port it’s listening on, along with relevant
-  metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location,
+ metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location,
-  and more. The service then periodically sends a heartbeat signal to the discovery system to signal
+ and more. The service then periodically sends a heartbeat signal to the discovery system to signal
-  that the service is still available.
+ that the service is still available.
-  When a client wishes to connect to a service, it first queries the discovery system to get a list of
+ When a client wishes to connect to a service, it first queries the discovery system to get a list of
-  available endpoints, and then connects directly to the endpoint. Compared to DNS, service discovery
+ available endpoints, and then connects directly to the endpoint. Compared to DNS, service discovery
-  supports a much more dynamic environment where service instances change frequently. Discovery
+ supports a much more dynamic environment where service instances change frequently. Discovery
-  systems also give clients more metadata about the service they’re connecting to, which enables
+ systems also give clients more metadata about the service they’re connecting to, which enables
-  clients to make smarter load balancing decisions.
+ clients to make smarter load balancing decisions.
 * *Service meshes* are a sophisticated form of load balancing that combine software load balancers
-  and service discovery. Unlike traditional software load balancers, which run on a separate
+ and service discovery. Unlike traditional software load balancers, which run on a separate
-  machine, service mesh load balancers are typically deployed as an in-process client library or as
+ machine, service mesh load balancers are typically deployed as an in-process client library or as
-  a process or “sidecar” container on both the client and server. Client applications connect
+ a process or “sidecar” container on both the client and server. Client applications connect
-  to their own local service load balancer, which connects to the server’s load balancer. From
+ to their own local service load balancer, which connects to the server’s load balancer. From
-  there, the connection is routed to the local server process.
+ there, the connection is routed to the local server process.
-  Though complicated, this topology offers a number of advantages. Because the clients and servers are
+ Though complicated, this topology offers a number of advantages. Because the clients and servers are
-  routed entirely through local connections, connection encryption can be handled entirely at the load
+ routed entirely through local connections, connection encryption can be handled entirely at the load
-  balancer level. This shields clients and servers from having to deal with the complexities of SSL
+ balancer level. This shields clients and servers from having to deal with the complexities of SSL
-  certificates and TLS. Mesh systems also provide sophisticated observability. They can track which
+ certificates and TLS. Mesh systems also provide sophisticated observability. They can track which
-  services are calling each other in realtime, detect failures, track traffic load, and more.
+ services are calling each other in realtime, detect failures, track traffic load, and more.
 Which solution is appropriate depends on an organization’s needs. Those running in a very dynamic
 service environment with an orchestrator such as Kubernetes often choose to run a service mesh such
@ -962,10 +942,10 @@ The backward and forward compatibility properties of an RPC scheme are inherited
 encoding it uses:
 * gRPC (Protocol Buffers) and Avro RPC can be evolved according to the compatibility rules of the
-  respective encoding format.
+ respective encoding format.
 * RESTful APIs most commonly use JSON for responses, and JSON or URI-encoded/form-encoded request
-  parameters for requests. Adding optional request parameters and adding new fields to response
+ parameters for requests. Adding optional request parameters and adding new fields to response
-  objects are usually considered changes that maintain compatibility.
+ objects are usually considered changes that maintain compatibility.
 Service compatibility is made harder by the fact that RPC is often used for communication across
 organizational boundaries, so the provider of a service often has no control over its clients and
@ -978,8 +958,7 @@ version of the API it wants to use [^42]).
 For RESTful APIs, common approaches are to use a version
 number in the URL or in the HTTP `Accept` header. For services that use API keys to identify a
 particular client, another option is to store a client’s requested API version on the server and to
-allow this version selection to be updated through a separate administrative interface
+allow this version selection to be updated through a separate administrative interface [^43].
 [^43].
 ## Durable Execution and Workflows
@ -994,8 +973,7 @@ the credit card, and call the banking service to deposit debited funds, as shown
 [Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
 Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a
 general-purpose programming language, a domain specific language (DSL), or a markup language such as
-Business Process Execution Language (BPEL)
+Business Process Execution Language (BPEL) [^44].
 [^44].
 # Tasks, Activities, and Functions
@ -1038,8 +1016,7 @@ task fails, the framework will re-execute the task, but will skip any RPC calls
 that the task made successfully before failing. Instead, the framework will pretend to make the
 call, but will instead return the results from the previous call. This is possible because durable
 execution frameworks log all RPCs and state changes to durable storage like a write-ahead log
-[[45](/en/ch5#TemporalService),
+[[^45], [^46]].
 [46](/en/ch5#Ewen2023)].
 [Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
 using Temporal.
@ -1048,35 +1025,32 @@ using Temporal.
 ```
@workflow.defn
 class PaymentWorkflow:
-    @workflow.run
+ @workflow.run
-    async def run(self, payment: PaymentRequest) -> PaymentResult:
+ async def run(self, payment: PaymentRequest) -> PaymentResult:
-        is_fraud = await workflow.execute_activity(
+ is_fraud = await workflow.execute_activity(
-            check_fraud,
+ check_fraud,
-            payment,
+ payment,
-            start_to_close_timeout=timedelta(seconds=15),
+ start_to_close_timeout=timedelta(seconds=15),
-        )
+ )
-        if is_fraud:
+ if is_fraud:
-            return PaymentResultFraudulent
+ return PaymentResultFraudulent
-        credit_card_response = await workflow.execute_activity(
+ credit_card_response = await workflow.execute_activity(
-            debit_credit_card,
+ debit_credit_card,
-            payment,
+ payment,
-            start_to_close_timeout=timedelta(seconds=15),
+ start_to_close_timeout=timedelta(seconds=15),
-        )
+ )
-        # ...
+ # ...
 ```
 Frameworks like Temporal are not without their challenges. External services, such as the
 third-party payment gateway in our example, must still provide an idempotent API. Developers must
-remember to use unique IDs for these APIs to prevent duplicate execution
+remember to use unique IDs for these APIs to prevent duplicate execution [^47].
 [^47].
 And because durable execution frameworks log each RPC call in order, it expects a subsequent
 execution to make the same RPC calls in the same order. This makes code changes brittle: you
-might introduce undefined behavior simply by re-ordering function calls
+might introduce undefined behavior simply by re-ordering function calls [^48].
 [^48].
 Instead of modifying the code of an existing workflow, it is safer to deploy a new version of the
 code separately, so that re-executions of existing workflow invocations continue to use the old
-version, and only new invocations use the new code
+version, and only new invocations use the new code [^49].
 [^49].
 Similarly, because durable execution frameworks expect to replay all code deterministically (the
 same inputs produce the same outputs), nondeterministic code such as random number generators or
@ -1097,20 +1071,19 @@ how encoded data can flow from one process to another. A request is called an *e
 unlike RPC, the sender usually does not wait for the recipient to process the event. Moreover,
 events are typically not sent to the recipient via a direct network connection, but go via an
 intermediary called a *message broker* (also called an *event broker*, *message queue*, or
-*message-oriented middleware*), which stores the message temporarily.
+*message-oriented middleware*), which stores the message temporarily. [^50].
 [^50].
 Using a message broker has several advantages compared to direct RPC:
 * It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system
-  reliability.
+ reliability.
 * It can automatically redeliver messages to a process that has crashed, and thus prevent messages from
-  being lost.
+ being lost.
 * It avoids the need for service discovery, since senders do not need to directly connect to the IP
-  address of the recipient.
+ address of the recipient.
 * It allows the same message to be sent to several recipients.
 * It logically decouples the sender from the recipient (the sender just publishes messages and
-  doesn’t care who consumes them).
+ doesn’t care who consumes them).
 The communication via a message broker is *asynchronous*: the sender doesn’t wait for the message to
 be delivered, but simply sends it and then forgets about it. It’s possible to implement a
@ -1128,15 +1101,15 @@ The detailed delivery semantics vary by implementation and configuration, but in
 message distribution patterns are most often used:
 * One process adds a message to a named *queue*, and the broker delivers that message to a
-  *consumer* of that queue. If there are multiple consumers, one of them receives the message.
+ *consumer* of that queue. If there are multiple consumers, one of them receives the message.
 * One process publishes a message to a named *topic*, and the broker delivers that message to all
-  *subscribers* of that topic. If there are multiple subscribers, they all receive the message.
+ *subscribers* of that topic. If there are multiple subscribers, they all receive the message.
 Message brokers typically don’t enforce any particular data model—a message is just a sequence of
 bytes with some metadata, so you can use any encoding format. A common approach is to use Protocol
 Buffers, Avro, or JSON, and to deploy a schema registry alongside the message broker to store all
 the valid schema versions and check their compatibility
-[[19](/en/ch5#ConfluentSchemaReg), [21](/en/ch5#Kreps2015)].
+[[^19], [^21]].
 AsyncAPI, a messaging-based equivalent of OpenAPI, can also be used to specify the schema of
 messages.
@ -1160,8 +1133,7 @@ sending and receiving asynchronous messages. Message delivery is not guaranteed:
 scenarios, messages will be lost. Since each actor processes only one message at a time, it doesn’t
 need to worry about threads, and each actor can be scheduled independently by the framework.
-In *distributed actor frameworks* such as Akka, Orleans
+In *distributed actor frameworks* such as Akka, Orleans [^51],
 [^51],
 and Erlang/OTP, this programming model is used to scale an application across
 multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient
 are on the same node or different nodes. If they are on different nodes, the message is
@ -1178,7 +1150,7 @@ application, you still have to worry about forward and backward compatibility, a
 sent from a node running the new version to a node running the old version, and vice versa. This can
 be achieved by using one of the encodings discussed in this chapter.
-# Summary
+## Summary
 In this chapter we looked at several ways of turning data structures into bytes on the network or
 bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more
@ -1199,33 +1171,34 @@ read old data) and forward compatibility (old code can read new data).
 We discussed several data encoding formats and their compatibility properties:
 * Programming language–specific encodings are restricted to a single programming language and often
-  fail to provide forward and backward compatibility.
+ fail to provide forward and backward compatibility.
 * Textual formats like JSON, XML, and CSV are widespread, and their compatibility depends on how you
-  use them. They have optional schema languages, which are sometimes helpful and sometimes a
+ use them. They have optional schema languages, which are sometimes helpful and sometimes a
-  hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things
+ hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things
-  like numbers and binary strings.
+ like numbers and binary strings.
 * Binary schema–driven formats like Protocol Buffers and Avro allow compact, efficient encoding with
-  clearly defined forward and backward compatibility semantics. The schemas can be useful for
+ clearly defined forward and backward compatibility semantics. The schemas can be useful for
-  documentation and code generation in statically typed languages. However, these formats have the
+ documentation and code generation in statically typed languages. However, these formats have the
-  downside that data needs to be decoded before it is human-readable.
+ downside that data needs to be decoded before it is human-readable.
 We also discussed several modes of dataflow, illustrating different scenarios in which data
 encodings are important:
 * Databases, where the process writing to the database encodes the data and the process reading
-  from the database decodes it
+ from the database decodes it
 * RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes
-  a response, and the client finally decodes the response
+ a response, and the client finally decodes the response
 * Event-driven architectures (using message brokers or actors), where nodes communicate by sending
-  each other messages that are encoded by the sender and decoded by the recipient
+ each other messages that are encoded by the sender and decoded by the recipient
 We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are
 quite achievable. May your application’s evolution be rapid and your deployments be frequent.
 ##### Footnotes
-##### References
+
 ### Summary
 [^1]: [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y) 
--- a/content/en/ch6.md
+++ b/content/en/ch6.md
@ -11,7 +11,7 @@ breadcrumbs: false
 > Douglas Adams, *Mostly Harmless* (1992)
 *Replication* means keeping a copy of the same data on multiple machines that are connected via a
-network. As discussed in [“Distributed versus Single-Node Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_distributed), there are several reasons
+network. As discussed in [“Distributed versus Single-Node Systems”](/ch01.html#sec_introduction_distributed), there are several reasons
 why you might want to replicate data:
 * To keep data geographically close to your users (and thus reduce access latency)
@ -19,7 +19,7 @@ why you might want to replicate data:
 * To scale out the number of machines that can serve read queries (and thus increase read throughput)
 In this chapter we will assume that your dataset is small enough that each machine can hold a copy of
-the entire dataset. In [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding) we will relax that assumption and discuss *sharding*
+the entire dataset. In [Chapter 7](/ch07.html#ch_sharding) we will relax that assumption and discuss *sharding*
 (*partitioning*) of datasets that are too big for a single machine. In later chapters we will discuss
 various kinds of faults that can occur in a replicated data system, and how to deal with them.
@ -36,10 +36,8 @@ in databases, and although the details vary by database, the general principles
 many different implementations. We will discuss the consequences of such choices in this chapter.
 Replication of databases is an old topic—the principles haven’t changed much since they were
-studied in the 1970s
+studied in the 1970s [^1], because the fundamental constraints of networks have remained the same. Despite being so old,
-[^1],
+concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](/ch06.html#sec_replication_lag) we will
 because the fundamental constraints of networks have remained the same. Despite being so old,
 concepts such as *eventual consistency* still cause confusion. In [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag) we will
 get more precise about eventual consistency and discuss things like the *read-your-writes* and
 *monotonic reads* guarantees.
@ -52,7 +50,7 @@ delete some data, replication doesn’t help since the deletion will have also b
 replicas, so you need a backup if you want to restore the deleted data.
 In fact, replication and backups are often complementary to each other. Backups are sometimes part
-of the process of setting up replication, as we shall see in [“Setting Up New Followers”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_new_replica).
+of the process of setting up replication, as we shall see in [“Setting Up New Followers”](/ch06.html#sec_replication_new_replica).
 Conversely, archiving replication logs can be part of a backup process.
 Some databases internally maintain immutable snapshots of past states, which serve as a kind of
@ -69,7 +67,7 @@ question inevitably arises: how do we ensure that all the data ends up on all th
 Every write to the database needs to be processed by every replica; otherwise, the replicas would no
 longer contain the same data. The most common solution is called *leader-based replication*,
 *primary-backup*, or *active/passive*. It works as follows (see
-[Figure 6-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_leader_follower)):
+[Figure 6-1](/ch06.html#fig_replication_leader_follower)):
 1. One of the replicas is designated the *leader* (also known as *primary* or *source*
   [^2]).
@ -88,9 +86,9 @@ longer contain the same data. The most common solution is called *leader-based r
 ###### Figure 6-1. Single-leader replication directs all writes to a designated leader, which sends a stream of changes to the follower replicas.
-If the database is sharded (see [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding)), each shard has one leader. Different shards may
+If the database is sharded (see [Chapter 7](/ch07.html#ch_sharding)), each shard has one leader. Different shards may
 have their leaders on different nodes, but each shard must nevertheless have one leader node. In
-[“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader) we will discuss an alternative model in which a system may have
+[“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader) we will discuss an alternative model in which a system may have
 multiple leaders for the same shard at the same time.
 Single-leader replication is very widely used. It’s a built-in feature of many relational databases,
@ -106,7 +104,7 @@ Many consensus algorithms such as Raft, which is used for replication in Cockroa
 TiDB [^7],
 etcd, and RabbitMQ quorum queues (among others), are also based on a single leader, and
 automatically elect a new leader if the old one fails (we will discuss consensus in more detail in
-[Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency)).
+[Chapter 10](/ch10.html#ch_consistency)).
 > [!NOTE]
 > In older documents you may see the term *master–slave replication*. It means the same as
@ -119,17 +117,17 @@ An important detail of a replicated system is whether the replication happens *s
 *asynchronously*. (In relational databases, this is often a configurable option; other systems are
 often hardcoded to be either one or the other.)
-Think about what happens in [Figure 6-1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_leader_follower), where the user of a website updates
+Think about what happens in [Figure 6-1](/ch06.html#fig_replication_leader_follower), where the user of a website updates
 their profile image. At some point in time, the client sends the update request to the leader;
 shortly afterward, it is received by the leader. At some point, the leader forwards the data change
 to the followers. Eventually, the leader notifies the client that the update was successful.
-[Figure 6-2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_sync_replication) shows one possible way how the timings could work out.
+[Figure 6-2](/ch06.html#fig_replication_sync_replication) shows one possible way how the timings could work out.
 ![ddia 0602](/fig/ddia_0602.png)
 ###### Figure 6-2. Leader-based replication with one synchronous and one asynchronous follower.
-In the example of [Figure 6-2](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_sync_replication), the replication to follower 1 is
+In the example of [Figure 6-2](/ch06.html#fig_replication_sync_replication), the replication to follower 1 is
 *synchronous*: the leader waits until follower 1 has confirmed that it received the write before
 reporting success to the user, and before making the write visible to other clients. The replication
 to follower 2 is *asynchronous*: the leader sends the message, but doesn’t wait for a response from
@ -159,9 +157,9 @@ called *semi-synchronous*.
 In some systems, a *majority* (e.g., 3 out of 5 replicas, including the leader) of replicas is
 updated synchronously, and the remaining minority is asynchronous. This is an example of a *quorum*,
-which we will discuss further in [“Quorums for reading and writing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_quorum_condition). Majority quorums are often
+which we will discuss further in [“Quorums for reading and writing”](/ch06.html#sec_replication_quorum_condition). Majority quorums are often
 used in systems that use a consensus protocol for automatic leader election, which we will return to
-in [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency).
+in [Chapter 10](/ch10.html#ch_consistency).
 Sometimes, leader-based replication is configured to be completely asynchronous. In this case, if the
 leader fails and is not recoverable, any writes that have not yet been replicated to followers are
@ -172,7 +170,7 @@ processing writes, even if all of its followers have fallen behind.
 Weakening durability may sound like a bad trade-off, but asynchronous replication is nevertheless
 widely used, especially if there are many followers or if they are geographically distributed
 [^9].
-We will return to this issue in [“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag).
+We will return to this issue in [“Problems with Replication Lag”](/ch06.html#sec_replication_lag).
 ## Setting Up New Followers
@ -224,8 +222,8 @@ for live queries. Storing database data in object storage has many benefits:
  durability guarantees. This also allows databases to bypass inter-zone network fees.
 * Databases can use an object store’s *conditional write* feature—essentially, a *compare-and-set*
  (CAS) operation—to implement transactions and leadership election
-  [[10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Morling2024_ch6),
+  [[10](/ch06.html#Morling2024_ch6),
-  [11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Chandramohan2024)]).
+  [11](/ch06.html#Chandramohan2024)]).
 * Storing data from multiple databases in the same object store can simplify data integration,
  particularly when open formats such as Apache Parquet and Apache Iceberg are used.
@ -312,10 +310,10 @@ consists of the following steps:
   [^13].
   The best candidate for leadership is usually the replica with the most up-to-date data changes
   from the old leader (to minimize any data loss). Getting all the nodes to agree on a new leader
-   is a consensus problem, discussed in detail in [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency).
+   is a consensus problem, discussed in detail in [Chapter 10](/ch10.html#ch_consistency).
 3. *Reconfiguring the system to use the new leader.* Clients now need to send
   their write requests to the new leader (we discuss this
-   in [“Request Routing”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#sec_sharding_routing)). If the old leader comes back, it might still believe that it is
+   in [“Request Routing”](/ch07.html#sec_sharding_routing)). If the old leader comes back, it might still believe that it is
   the leader, not realizing that the other replicas have
   forced it to step down. The system needs to ensure that the old leader becomes a follower and
   recognizes the new leader.
@ -337,10 +335,10 @@ Failover is fraught with things that can go wrong:
  primary keys that were previously assigned by the old leader. These primary keys were also used in
  a Redis store, so the reuse of primary keys resulted in inconsistency between MySQL and Redis,
  which caused some private data to be disclosed to the wrong users.
-* In certain fault scenarios (see [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed)), it could happen that two nodes both believe
+* In certain fault scenarios (see [Chapter 9](/ch09.html#ch_distributed)), it could happen that two nodes both believe
  that they are the leader. This situation is called *split brain*, and it is dangerous: if both
  leaders accept writes, and there is no process for resolving conflicts (see
-  [“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
+  [“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader)), data is likely to be lost or corrupted. As a safety catch, some
  systems have a mechanism to shut down one node if two leaders are detected. However, if this
  mechanism is not carefully designed, you can end up with both nodes being shut down
  [^15].
@ -356,7 +354,7 @@ Failover is fraught with things that can go wrong:
 > [!NOTE]
 > Guarding against split brain by limiting or shutting down old leaders is known as *fencing* or, more
 > emphatically, *Shoot The Other Node In The Head* (STONITH). We will discuss fencing in more detail
-> in [“Distributed Locks and Leases”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_lock_fencing).
+> in [“Distributed Locks and Leases”](/ch09.html#sec_distributed_lock_fencing).
 There are no easy solutions to these problems. For this reason, some operations teams prefer to
 perform failovers manually, even if the software supports automatic failover.
@ -370,7 +368,7 @@ behind by several days could be catastrophic.
 These issues—node failures; unreliable networks; and trade-offs around replica consistency,
 durability, availability, and latency—are in fact fundamental problems in distributed systems.
-In [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed) and [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency) we will discuss them in greater depth.
+In [Chapter 9](/ch09.html#ch_distributed) and [Chapter 10](/ch10.html#ch_consistency) we will discuss them in greater depth.
 ## Implementation of Replication Logs
@ -401,9 +399,9 @@ break down:
 It is possible to work around those issues—for example, the leader can replace any nondeterministic
 function calls with a fixed return value when the statement is logged so that the followers all get
 the same value. The idea of executing deterministic statements in a fixed order is similar to the
-event sourcing model that we previously discussed in [“Event Sourcing and CQRS”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_events). This approach is
+event sourcing model that we previously discussed in [“Event Sourcing and CQRS”](/ch03.html#sec_datamodels_events). This approach is
 also known as *state machine replication*, and we will discuss the theory behind it in
-[“Using shared logs”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_smr).
+[“Using shared logs”](/ch10.html#sec_consistency_smr).
 Statement-based replication was used in MySQL before version 5.1. It is still sometimes used today,
 as it is quite compact, but by default MySQL now switches to row-based replication (discussed shortly) if
@ -415,7 +413,7 @@ replication methods.
 ### Write-ahead log (WAL) shipping
-In [Chapter 4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch04.html#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
+In [Chapter 4](/ch04.html#ch_storage) we saw that a write-ahead log is needed to make B-tree storage engines robust:
 every modification is first written to the WAL so that the tree can be restored to a consistent
 state after a crash. Since the WAL contains all the information necessary to restore the indexes and
 heap into a consistent state, we can use the exact same log to build a replica on another node:
@ -423,8 +421,8 @@ besides writing the log to disk, the leader also sends it across the network to
 the follower processes this log, it builds a copy of the exact same files as found on the leader.
 This method of replication is used in PostgreSQL and Oracle, among others
-[[17](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Suzuki2017_ch6),
+[[17](/ch06.html#Suzuki2017_ch6),
-[18](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Kapila2012)].
+[18](/ch06.html#Kapila2012)].
 The main disadvantage is that the log describes the data on a very low level: a WAL contains details
 of which bytes were changed in which disk blocks. This makes replication tightly coupled to the
 storage engine. If the database changes its storage format from one version to another, it is
@ -476,7 +474,7 @@ This technique is called *change data capture*, and we will return to it in [Lin
 # Problems with Replication Lag
 Being able to tolerate node failures is just one reason for wanting replication. As mentioned
-in [“Distributed versus Single-Node Systems”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch01.html#sec_introduction_distributed), other reasons are scalability (processing more
+in [“Distributed versus Single-Node Systems”](/ch01.html#sec_introduction_distributed), other reasons are scalability (processing more
 requests than a single machine can handle) and latency (placing replicas geographically closer to
 users).
@ -528,7 +526,7 @@ be read from a follower. This is especially appropriate if data is frequently vi
 occasionally written.
 With asynchronous replication, there is a problem, illustrated in
-[Figure 6-3](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
+[Figure 6-3](/ch06.html#fig_replication_read_your_writes): if the user views the data shortly after making a write, the
 new data may not yet have reached the replica. To the user, it looks as though the data they
 submitted was lost, so they will be understandably unhappy.
@ -568,7 +566,7 @@ are various possible techniques. To mention a few:
  [^26].
  The timestamp could be a *logical timestamp* (something that indicates ordering of writes, such as
  the log sequence number) or the actual system clock (in which case clock synchronization becomes
-  critical; see [“Unreliable Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_clocks)).
+  critical; see [“Unreliable Clocks”](/ch09.html#sec_distributed_clocks)).
 * If your replicas are distributed across regions (for geographical proximity to users or for
  availability), there is additional complexity. Any request that needs to be served by the leader
  must be routed to the region that contains the leader.
@ -604,7 +602,7 @@ zonal outages where one zone goes offline, but they do not protect against regio
 all zones in a region are unavailable. To survive a regional outage, a distributed system must be
 deployed across multiple regions, which can result in higher latencies, lower throughput, and
 increased cloud networking bills. We will discuss these tradeoffs more in
-[“Multi-leader replication topologies”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_topologies). For now, just know that when we say region, we mean a collection of
+[“Multi-leader replication topologies”](/ch06.html#sec_replication_topologies). For now, just know that when we say region, we mean a collection of
 zones/datacenters in a single geographic location.
 ## Monotonic Reads
@ -613,7 +611,7 @@ Our second example of an anomaly that can occur when reading from asynchronous f
 possible for a user to see things *moving backward in time*.
 This can happen if a user makes several reads from different replicas. For example,
-[Figure 6-4](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
+[Figure 6-4](/ch06.html#fig_replication_monotonic_reads) shows user 2345 making the same query twice, first to a follower
 with little lag, then to a follower with greater lag. (This scenario is quite likely if the user
 refreshes a web page, and each request is routed to a random server.) The first query returns a
 comment that was recently added by user 1234, but the second query doesn’t return anything because
@ -654,7 +652,7 @@ answered it.
 Now, imagine a third person is listening to this conversation through followers. The things said by
 Mrs. Cake go through a follower with little lag, but the things said by Mr. Poons have a longer
-replication lag (see [Figure 6-5](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_consistent_prefix)). This observer would hear the following:
+replication lag (see [Figure 6-5](/ch06.html#fig_replication_consistent_prefix)). This observer would hear the following:
 Mrs. Cake
 :   About ten seconds usually, Mr. Poons.
@ -676,7 +674,7 @@ writes happens in a certain order, then anyone reading those writes will see the
 order.
 This is a particular problem in sharded (partitioned) databases, which we will discuss in
-[Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding). If the database always applies writes in the same order, reads always see a
+[Chapter 7](/ch07.html#ch_sharding). If the database always applies writes in the same order, reads always see a
 consistent prefix, so this anomaly cannot happen. However, in many distributed databases, different
 shards operate independently, so there is no global ordering of writes: when a user reads from the
 database, they may see some parts of the database in an older state and some in a newer state.
@ -684,7 +682,7 @@ database, they may see some parts of the database in an older state and some in
 One solution is to make sure that any writes that are causally related to each other are written to
 the same shard—but in some applications that cannot be done efficiently. There are also algorithms
 that explicitly keep track of causal dependencies, a topic that we will return to in
-[“The “happens-before” relation and concurrency”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_happens_before).
+[“The “happens-before” relation and concurrency”](/ch06.html#sec_replication_happens_before).
 ## Solutions for Replication Lag
@ -700,15 +698,15 @@ synchronously updated follower. However, dealing with these issues in applicatio
 and easy to get wrong.
 The simplest programming model for application developers is to choose a database that provides a
-strong consistency guarantee for replicas such as linearizability (see [Chapter 10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#ch_consistency)), and ACID
+strong consistency guarantee for replicas such as linearizability (see [Chapter 10](/ch10.html#ch_consistency)), and ACID
-transactions (see [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions)). This allows you to mostly ignore the challenges that arise
+transactions (see [Chapter 8](/ch08.html#ch_transactions)). This allows you to mostly ignore the challenges that arise
 from replication, and treat the database as if it had just a single node. In the early 2010s the
 *NoSQL* movement promoted the view that these features limited scalability, and that large-scale
 systems would have to embrace eventual consistency.
 However, since then, a number of databases started providing strong consistency and transactions
 while also offering the fault tolerance, high availability, and scalability advantages of a
-distributed database. As mentioned in [“Relational Model versus Document Model”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_history), this trend is known as *NewSQL* to
+distributed database. As mentioned in [“Relational Model versus Document Model”](/ch03.html#sec_datamodels_history), this trend is known as *NewSQL* to
 contrast with NoSQL (although it’s less about SQL specifically, and more about new approaches to
 scalable transaction management).
@ -758,7 +756,7 @@ single-leader replication, the leader has to be in *one* of the regions, and all
 through that region.
 In a multi-leader configuration, you can have a leader in *each* region.
-[Figure 6-6](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
+[Figure 6-6](/ch06.html#fig_replication_multi_dc) shows what this architecture might look like. Within each region,
 regular leader–follower replication is used (with followers maybe in a different availability zone
 from the leader); between regions, each region’s leader replicates its changes to the leaders in
 other regions.
@ -798,7 +796,7 @@ Tolerance of network problems
 Consistency
 :   A single-leader system can provide strong consistency guarantees, such as serializable
-    transactions, which we will discuss in [Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions). The biggest downside of multi-leader
+    transactions, which we will discuss in [Chapter 8](/ch08.html#ch_transactions). The biggest downside of multi-leader
    systems is that the consistency they can achieve is much weaker. For example, you can’t guarantee
    that a bank account won’t go negative or that a username is unique: it’s always possible for
    different leaders to process writes that are individually fine (paying out some of the money in an
@ -808,7 +806,7 @@ Consistency
    This is simply a fundamental limitation of distributed systems
    [^28].
    If you need to enforce such constraints, you’re therefore better off with a single-leader system.
-    However, as we will see in [“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts), multi-leader systems can still
+    However, as we will see in [“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts), multi-leader systems can still
    achieve consistency properties that are useful in a wide range of apps that don’t need such
    constraints.
@ -826,17 +824,17 @@ multi-leader replication is often considered dangerous territory that should be
 ### Multi-leader replication topologies
 A *replication topology* describes the communication paths along which writes are propagated from
-one node to another. If you have two leaders, like in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), there is
+one node to another. If you have two leaders, like in [Figure 6-9](/ch06.html#fig_replication_write_conflict), there is
 only one plausible topology: leader 1 must send all of its writes to leader 2, and vice versa. With
 more than two leaders, various different topologies are possible. Some examples are illustrated in
-[Figure 6-7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_topologies).
+[Figure 6-7](/ch06.html#fig_replication_topologies).
 ![ddia 0607](/fig/ddia_0607.png)
 ###### Figure 6-7. Three example topologies in which multi-leader replication can be set up.
 The most general topology is *all-to-all*, shown in
-[Figure 6-7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_topologies)(c),
+[Figure 6-7](/ch06.html#fig_replication_topologies)(c),
 in which every leader sends its writes to every other leader. However, more restricted topologies
 are also used: for example a *circular topology* in which each node receives writes from one node
 and forwards those writes (plus any writes of its own) to one other node. Another popular topology
@ -845,7 +843,7 @@ star topology can be generalized to a tree.
 > [!NOTE]
 > Don’t confuse a star-shaped network topology with a *star schema* (see
-> [“Stars and Snowflakes: Schemas for Analytics”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch03.html#sec_datamodels_analytics)), which describes the structure of a data model.
+> [“Stars and Snowflakes: Schemas for Analytics”](/ch03.html#sec_datamodels_analytics)), which describes the structure of a data model.
 In circular and star topologies, a write may need to pass through several nodes before it reaches
 all replicas. Therefore, nodes need to forward data changes they receive from other nodes. To
@ -866,28 +864,28 @@ along different paths, avoiding a single point of failure.
 On the other hand, all-to-all topologies can have issues too. In particular, some network links may
 be faster than others (e.g., due to network congestion), with the result that some replication
-messages may “overtake” others, as illustrated in [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality).
+messages may “overtake” others, as illustrated in [Figure 6-8](/ch06.html#fig_replication_causality).
 ![ddia 0608](/fig/ddia_0608.png)
 ###### Figure 6-8. With multi-leader replication, writes may arrive in the wrong order at some replicas.
-In [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
+In [Figure 6-8](/ch06.html#fig_replication_causality), client A inserts a row into a table on leader 1, and client B
 updates that row on leader 3. However, leader 2 may receive the writes in a different order: it may
 first receive the update (which, from its point of view, is an update to a row that does not exist
 in the database) and only later receive the corresponding insert (which should have preceded the
 update).
-This is a problem of causality, similar to the one we saw in [“Consistent Prefix Reads”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_consistent_prefix):
+This is a problem of causality, similar to the one we saw in [“Consistent Prefix Reads”](/ch06.html#sec_replication_consistent_prefix):
 the update depends on the prior insert, so we need to make sure that all nodes process the insert
 first, and then the update. Simply attaching a timestamp to every write is not sufficient, because
 clocks cannot be trusted to be sufficiently in sync to correctly order these events at leader 2 (see
-[Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed)).
+[Chapter 9](/ch09.html#ch_distributed)).
 To order these events correctly, a technique called *version vectors* can be used, which we will
-discuss later in this chapter (see [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent)). However, many multi-leader
+discuss later in this chapter (see [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent)). However, many multi-leader
 replication systems don’t use good techniques for ordering updates, leaving them vulnerable to
-issues like the one in [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality). If you are using multi-leader replication, it
+issues like the one in [Figure 6-8](/ch06.html#fig_replication_causality). If you are using multi-leader replication, it
 is worth being aware of these issues, carefully reading the documentation, and thoroughly testing
 your database to ensure that it really does provide the guarantees you believe it to have.
@ -918,9 +916,9 @@ Sheets for text documents and spreadsheets, Figma for graphics, and Linear for p
 What makes these apps so responsive is that user input is immediately reflected in the user
 interface, without waiting for a network round-trip to the server, and edits by one user are shown
 to their collaborators with low latency
-[[32](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#DayRichter2010),
+[[32](/ch06.html#DayRichter2010),
-[33](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Wallace2019),
+[33](/ch06.html#Wallace2019),
-[34](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Artman2023)].
+[34](/ch06.html#Artman2023)].
 This again results in a multi-leader architecture: each web browser tab that has opened the shared
 file is a replica, and any updates that you make to the file are asynchronously replicated to the
@ -938,9 +936,9 @@ those changes.
 A software library that supports this process is called a *sync engine*. Although the idea has
 existed for a long time, the term has recently gained attention
-[[35](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Saafan2024),
+[[35](/ch06.html#Saafan2024),
-[36](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Hagoel2024),
+[36](/ch06.html#Hagoel2024),
-[37](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Jayakar2024)].
+[37](/ch06.html#Jayakar2024)].
 An application that allows a user to continue editing a file while offline (which may be implemented
 using a sync engine) is called *offline-first*
 [^38].
@ -970,7 +968,7 @@ approach has a number of advantages:
  offline is the same as having very large network delay.
 * A sync engine simplifies the programming model for frontend apps, compared to performing explicit
  service calls in application code. Every service call requires error handling, as discussed in
-  [“The problems with remote procedure calls (RPCs)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch05.html#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user
+  [“The problems with remote procedure calls (RPCs)”](/ch05.html#sec_problems_with_rpc): for example, if a request to update data on a server fails, the user
  interface needs to somehow reflect that error. A sync engine allows the app to perform reads and
  writes on local data, which almost never fails, leading to a more declarative programming style
  [^41].
@ -1007,7 +1005,7 @@ a local-first sync engine on end user devices—is that concurrent writes on dif
 lead to conflicts that need to be resolved.
 For example, consider a wiki page that is simultaneously being edited by two users, as shown in
-[Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
+[Figure 6-9](/ch06.html#fig_replication_write_conflict). User 1 changes the title of the page from A to B, and user 2
 independently changes the title from A to C. Each user’s change is successfully applied to their
 local leader. However, when the changes are asynchronously replicated, a conflict is detected.
 This problem does not occur in a single-leader database.
@ -1017,13 +1015,13 @@ This problem does not occur in a single-leader database.
 ###### Figure 6-9. A write conflict caused by two leaders concurrently updating the same record.
 > [!NOTE]
-> We say that the two writes in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict) are *concurrent* because neither
+> We say that the two writes in [Figure 6-9](/ch06.html#fig_replication_write_conflict) are *concurrent* because neither
 > was “aware” of the other at the time the write was originally made. It doesn’t matter whether the
 > writes literally happened at the same time; indeed, if the writes were made while offline, they
 > might have actually happened some time apart. What matters is whether one write occurred in a state
 > where the other write has already taken effect.
-In [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent) we will tackle the question of how a database can determine
+In [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent) we will tackle the question of how a database can determine
 whether two writes are concurrent. For now we will assume that we can detect conflicts, and we want
 to figure out the best way of resolving them.
@ -1052,13 +1050,13 @@ Another example of conflict avoidance: imagine you want to insert new records an
 IDs for them based on an auto-incrementing counter. If you have two leaders, you could set them up
 so that one leader only generates odd numbers and the other only generates even numbers. That way
 you can be sure that the two leaders won’t concurrently assign the same ID to different records.
-We will discuss other ID assignment schemes in [“ID Generators and Logical Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_logical).
+We will discuss other ID assignment schemes in [“ID Generators and Logical Clocks”](/ch10.html#sec_consistency_logical).
 ### Last write wins (discarding concurrent writes)
 If conflicts can’t be avoided, the simplest way of resolving them is to attach a timestamp to each
 write, and to always use the value with the greatest timestamp. For example, in
-[Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than
+[Figure 6-9](/ch06.html#fig_replication_write_conflict), let’s say that the timestamp of user 1’s write is greater than
 the timestamp of user 2’s write. In that case, both leaders will determine that the new title of the
 page should be B, and they discard the write that sets it to C. If the writes coincidentally have
 the same timestamp, the winner can be chosen by comparing the values (e.g., in the case of strings,
@ -1066,7 +1064,7 @@ taking the one that’s earlier in the alphabet).
 This approach is called *last write wins* (LWW) because the write with the greatest timestamp can be
 considered the “last” one. The term is misleading though, because when two writes are concurrent
-like in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), which one is older and which is later is undefined, and
+like in [Figure 6-9](/ch06.html#fig_replication_write_conflict), which one is older and which is later is undefined, and
 so the timestamp order of concurrent writes is essentially random.
 Therefore the real meaning of LWW is: when the same record is concurrently written on different
@ -1084,7 +1082,7 @@ Another problem with LWW is that if a real-time clock (e.g. a Unix timestamp) is
 for the writes, the system becomes very sensitive to clock synchronization. If one node has a clock
 that is ahead of the others, and you try to overwrite a value written by that node, your write may
 be ignored as it may have a lower timestamp, even though it clearly occurred later. This problem can
-be solved by using a *logical clock*, which we will discuss in [“ID Generators and Logical Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_logical).
+be solved by using a *logical clock*, which we will discuss in [“ID Generators and Logical Clocks”](/ch10.html#sec_consistency_logical).
 ### Manual conflict resolution
@ -1096,7 +1094,7 @@ merge is complete.
 In a database, it would be impractical for a conflict to stop the entire replication process until a
 human has resolved it. Instead, databases typically store all the concurrently written values for a
-given record—for example, both B and C in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict). These values are
+given record—for example, both B and C in [Figure 6-9](/ch06.html#fig_replication_write_conflict). These values are
 sometimes called *siblings*. The next time you query that record, the database returns *all* those
 values, rather than just the latest one. You can then resolve those values in whatever way you want,
 either automatically in application code (for example, you could concatenate B and C into “B/C”), or
@ -1120,7 +1118,7 @@ suffers from a number of problems:
  sibling, but another sibling still contained that old item, the removed item would unexpectedly
  reappear in the customer’s cart
  [^45].
-  [Figure 6-10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
+  [Figure 6-10](/ch06.html#fig_replication_amazon_anomaly) shows an example where Device 1 removes Book from the shopping
  cart and concurrently Device 2 removes DVD, but after merging the conflict both items reappear.
 * If multiple nodes observe the conflict and concurrently resolve it, the conflict resolution
  process can itself introduce a new conflict. Those resolutions could even be inconsistent: for
@ -1149,7 +1147,7 @@ updates as much as possible, and hence avoiding data loss:
  same position, it can be ordered deterministically so that all nodes get the same merged outcome.
 * If the data is a collection of items (ordered like a to-do list, or unordered like a shopping
  cart), we can merge it similarly to text by tracking insertions and deletions. To avoid the
-  shopping cart issue in [Figure 6-10](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_amazon_anomaly), the algorithms track the fact that Book
+  shopping cart issue in [Figure 6-10](/ch06.html#fig_replication_amazon_anomaly), the algorithms track the fact that Book
  and DVD were deleted, so the merged result is Cart = {Soap}.
 * If the data is an integer representing a counter that can be incremented or decremented (e.g., the
  number of likes on a social media post), the merge algorithm can tell how many increments and
@ -1175,7 +1173,7 @@ Two families of algorithms are commonly used to implement automatic conflict res
 They have different design philosophies and performance characteristics, but both are able to
 perform automatic merges for all the aforementioned types of data.
-[Figure 6-11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
+[Figure 6-11](/ch06.html#fig_replication_ot_crdt) shows an example of how OT and a CRDT merge concurrent updates to a
 text. Assume you have two replicas that both start off with the text “ice”. One replica prepends the
 letter “n” to make “nice”, while concurrently the other replica appends an exclamation mark to make
 “ice!”.
@ -1196,7 +1194,7 @@ OT
 CRDT
 :   Most CRDTs give each character a unique, immutable ID and use those to determine the positions of
-    insertions/deletions, instead of indexes. For example, in [Figure 6-11](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_ot_crdt) we assign
+    insertions/deletions, instead of indexes. For example, in [Figure 6-11](/ch06.html#fig_replication_ot_crdt) we assign
    the ID 1A to “i”, the ID 2A to “c”, etc. When inserting the exclamation mark, we generate an
    operation containing the ID of the new character (4B) and the ID of the existing character after
    which we want to insert (3A). To insert at the beginning of the string we give “nil” as the
@ -1218,7 +1216,7 @@ Sync engines for JSON data can be implemented both with CRDTs (e.g., Automerge o
 ### What is a conflict?
-Some kinds of conflict are obvious. In the example in [Figure 6-9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_write_conflict), two writes
+Some kinds of conflict are obvious. In the example in [Figure 6-9](/ch06.html#fig_replication_write_conflict), two writes
 concurrently modified the same field in the same record, setting it to two different values. There
 is little doubt that this is a conflict.
@ -1232,7 +1230,7 @@ are made on two different leaders.
 There isn’t a quick ready-made answer, but in the following chapters we will trace a path toward a
 good understanding of this problem. We will see some more examples of conflicts in
-[Chapter 8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch08.html#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and
+[Chapter 8](/ch08.html#ch_transactions), and in [Link to Come] we will discuss scalable approaches for detecting and
 resolving conflicts in a replicated system.
 # Leaderless Replication
@ -1245,8 +1243,8 @@ writes in the same order.
 Some data storage systems take a different approach, abandoning the concept of a leader and
 allowing any replica to directly accept writes from clients. Some of the earliest replicated data
-systems were leaderless [[1](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Lindsay1979_ch6),
+systems were leaderless [[1](/ch06.html#Lindsay1979_ch6),
-[50](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Gifford1979)], but the
+[50](/ch06.html#Gifford1979)], but the
 idea was mostly forgotten during the era of dominance of relational databases. It once again became
 a fashionable architecture for databases after Amazon used it for its in-house *Dynamo* system in
 2007 [^45].
@ -1270,10 +1268,10 @@ profound consequences for the way the database is used.
 Imagine you have a database with three replicas, and one of the replicas is currently
 unavailable—perhaps it is being rebooted to install a system update. In a single-leader
 configuration, if you want to continue processing writes, you may need to perform a failover (see
-[“Handling Node Outages”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_failover)).
+[“Handling Node Outages”](/ch06.html#sec_replication_failover)).
 On the other hand, in a leaderless configuration, failover does not exist.
-[Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
+[Figure 6-12](/ch06.html#fig_replication_quorum_node_outage) shows what happens: the client (user 1234) sends the write to
 all three replicas in parallel, and the two available replicas accept the write but the unavailable
 replica misses it. Let’s say that it’s sufficient for two out of three replicas to
 acknowledge the write: after user 1234 has received two *ok* responses, we consider the write to be
@ -1294,9 +1292,9 @@ stale value from another.
 In order to tell which responses are up-to-date and which are outdated, every value that is written
 needs to be tagged with a version number or timestamp, similarly to what we saw in
-[“Last write wins (discarding concurrent writes)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lww). When a client receives multiple values in response to a read, it uses the
+[“Last write wins (discarding concurrent writes)”](/ch06.html#sec_replication_lww). When a client receives multiple values in response to a read, it uses the
 one with the greatest timestamp (even if that value was only returned by one replica, and several
-other replicas returned older values). See [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent) for more details.
+other replicas returned older values). See [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent) for more details.
 ### Catching up on missed writes
@ -1306,7 +1304,7 @@ mechanisms are used in Dynamo-style datastores:
 Read repair
 :   When a client makes a read from several nodes in parallel, it can detect any stale responses.
-    For example, in [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
+    For example, in [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage), user 2345 gets a version 6 value from
    replica 3 and a version 7 value from replicas 1 and 2. The client sees that replica 3 has a stale
    value and writes the newer value back to that replica. This approach works well for values that are
    frequently read.
@ -1326,7 +1324,7 @@ Anti-entropy
 ### Quorums for reading and writing
-In the example of [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage), we considered the write to be successful
+In the example of [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage), we considered the write to be successful
 even though it was only processed on two out of three replicas. What if only one out of three
 replicas accepted the write? How far can we push this?
@ -1354,7 +1352,7 @@ database writes to fail.
 > [!NOTE]
 > There may be more than *n* nodes in the cluster, but any given value is stored only on *n*
 > nodes. This allows the dataset to be sharded, supporting datasets that are larger than you can fit
-> on one node. We will return to sharding in [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding).
+> on one node. We will return to sharding in [Chapter 7](/ch07.html#ch_sharding).
 The quorum condition, *w* + *r* > *n*, allows the system to tolerate unavailable nodes
 as follows:
@ -1362,9 +1360,9 @@ as follows:
 * If *w* < *n*, we can still process writes if a node is unavailable.
 * If *r* < *n*, we can still process reads if a node is unavailable.
 * With *n* = 3, *w* = 2, *r* = 2 we can tolerate one unavailable
-  node, like in [Figure 6-12](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_node_outage).
+  node, like in [Figure 6-12](/ch06.html#fig_replication_quorum_node_outage).
 * With *n* = 5, *w* = 3, *r* = 3 we can tolerate two unavailable nodes.
-  This case is illustrated in [Figure 6-13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_overlap).
+  This case is illustrated in [Figure 6-13](/ch06.html#fig_replication_quorum_overlap).
 Normally, reads and writes are always sent to all *n* replicas in parallel. The parameters *w* and
 *r* determine how many nodes we wait for—i.e., how many of the *n* nodes need to report success
@ -1386,7 +1384,7 @@ If you have *n* replicas, and you choose *w* and *r* such that *w* + *r* > *n*
 generally expect every read to return the most recent value written for a key. This is the case because the
 set of nodes to which you’ve written and the set of nodes from which you’ve read must overlap. That
 is, among the nodes you read there must be at least one node with the latest value (illustrated in
-[Figure 6-13](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_quorum_overlap)).
+[Figure 6-13](/ch06.html#fig_replication_quorum_overlap)).
 Often, *r* and *w* are chosen to be a majority (more than *n*/2) of nodes, because that ensures
 *w* + *r* > *n* while still tolerating up to *n*/2 (rounded down) node failures. But quorums are
@ -1413,12 +1411,12 @@ properties can be confusing. Some scenarios include:
  value, the number of replicas storing the new value may fall below *w*, breaking the quorum
  condition.
 * While a rebalancing is in progress, where some data is moved from one node to another (see
-  [Chapter 7](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch07.html#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
+  [Chapter 7](/ch07.html#ch_sharding)), nodes may have inconsistent views of which nodes should be holding the *n*
  replicas for a particular value. This can result in the read and write quorums no longer
  overlapping.
 * If a read is concurrent with a write operation, the read may or may not see the concurrently
  written value. In particular, it’s possible for one read to see the new value, and a subsequent
-  read to see the old value, as we shall see in [“Linearizability and quorums”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch10.html#sec_consistency_quorum_linearizable).
+  read to see the old value, as we shall see in [“Linearizability and quorums”](/ch10.html#sec_consistency_quorum_linearizable).
 * If a write succeeded on some replicas but failed on others (for example because the disks on some
  nodes are full), and overall succeeded on fewer than *w* replicas, it is not rolled back on the
  replicas where it succeeded. This means that if a write was reported as failed, subsequent reads
@ -1426,12 +1424,12 @@ properties can be confusing. Some scenarios include:
  [^52].
 * If the database uses timestamps from a real-time clock to determine which write is newer (as
  Cassandra and ScyllaDB do, for example), writes might be silently dropped if another node with a
-  faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lww).
+  faster clock has written to the same key—an issue we previously saw in [“Last write wins (discarding concurrent writes)”](/ch06.html#sec_replication_lww).
-  We will discuss this in more detail in [“Relying on Synchronized Clocks”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#sec_distributed_clocks_relying).
+  We will discuss this in more detail in [“Relying on Synchronized Clocks”](/ch09.html#sec_distributed_clocks_relying).
 * If two writes occur concurrently, one of them might be processed first on one replica, and the
  other might be processed first on another replica. This leads to a conflict, similarly to what we
-  saw for multi-leader replication (see [“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts)). We will return to this
+  saw for multi-leader replication (see [“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts)). We will return to this
-  topic in [“Detecting Concurrent Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_concurrent).
+  topic in [“Detecting Concurrent Writes”](/ch06.html#sec_replication_concurrent).
 Thus, although quorums appear to guarantee that a read returns the latest written value, in practice
 it is not so simple. Dynamo-style databases are generally optimized for use cases that can tolerate
@ -1463,7 +1461,7 @@ able to quantify “eventual.”
 A replication system based on a single leader can provide strong consistency guarantees that are
 difficult or impossible to achieve in a leaderless system. However, as we have seen in
-[“Problems with Replication Lag”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_lag), reads in a leader-based replicated system can also return stale values if
+[“Problems with Replication Lag”](/ch06.html#sec_replication_lag), reads in a leader-based replicated system can also return stale values if
 you make them on an asynchronously updated follower.
 Reading from the leader ensures up-to-date responses, but it suffers from performance problems:
@ -1507,7 +1505,7 @@ That said, leaderless systems can have performance problems as well:
  to wait for before a request can complete. Even if you wait only for the fastest *r* or *w*
  replicas to respond, and even if you make the requests in parallel, a bigger *r* or *w* increases
  the chance that you hit a slow replica, increasing the overall response time (see
-  [“Use of Response Time Metrics”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch02.html#sec_introduction_slo_sla)).
+  [“Use of Response Time Metrics”](/ch02.html#sec_introduction_slo_sla)).
 * A large-scale network interruption that disconnects a client from a large number of replicas can
  make it impossible to form a quorum. Some leaderless databases offer a configuration option that
  allows any reachable replica to accept writes, even if it’s not one of the usual replicas for that
@ -1526,7 +1524,7 @@ fault tolerance while also having a high likelihood of reading up-to-date data.
 ### Multi-region operation
 We previously discussed cross-region replication as a use case for multi-leader replication (see
-[“Multi-Leader Replication”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_multi_leader)). Leaderless replication is also suitable for
+[“Multi-Leader Replication”](/ch06.html#sec_replication_multi_leader)). Leaderless replication is also suitable for
 multi-region operation, since it is designed to tolerate conflicting concurrent writes, network
 interruptions, and latency spikes.
@ -1549,7 +1547,7 @@ resulting in conflicts that need to be resolved. Such conflicts may occur as the
 not always: they could also be detected later during read repair, hinted handoff, or anti-entropy.
 The problem is that events may arrive in a different order at different nodes, due to variable
-network delays and partial failures. For example, [Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency) shows two clients,
+network delays and partial failures. For example, [Figure 6-14](/ch06.html#fig_replication_concurrency) shows two clients,
 A and B, simultaneously writing to a key *X* in a three-node datastore:
 * Node 1 receives the write from A, but never receives the write from B due to a transient
@ -1563,13 +1561,13 @@ A and B, simultaneously writing to a key *X* in a three-node datastore:
 If each node simply overwrote the value for a key whenever it received a write request from a
 client, the nodes would become permanently inconsistent, as shown by the final *get* request in
-[Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
+[Figure 6-14](/ch06.html#fig_replication_concurrency): node 2 thinks that the final value of *X* is B, whereas the other
 nodes think that the value is A.
 In order to become eventually consistent, the replicas should converge toward the same value. For
 this, we can use any of the conflict resolution mechanisms we previously discussed in
-[“Dealing with Conflicting Writes”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_write_conflicts), such as last-write-wins (used by Cassandra and ScyllaDB),
+[“Dealing with Conflicting Writes”](/ch06.html#sec_replication_write_conflicts), such as last-write-wins (used by Cassandra and ScyllaDB),
-manual resolution, or CRDTs (described in [“CRDTs and Operational Transformation”](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#sec_replication_crdts), and used by Riak).
+manual resolution, or CRDTs (described in [“CRDTs and Operational Transformation”](/ch06.html#sec_replication_crdts), and used by Riak).
 Last-write-wins is easy to implement: each write is tagged with a timestamp, and a value with a
 higher timestamp always overwrites a value with a lower timestamp. However, a timestamp doesn’t tell
@ -1582,11 +1580,11 @@ take more care to detect concurrent writes.
 How do we decide whether two operations are concurrent or not? To develop an intuition, let’s look
 at some examples:
-* In [Figure 6-8](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before*
+* In [Figure 6-8](/ch06.html#fig_replication_causality), the two writes are not concurrent: A’s insert *happens before*
  B’s increment, because the value incremented by B is the value inserted by A. In other words, B’s
  operation builds upon A’s operation, so B’s operation must have happened later.
  We also say that B is *causally dependent* on A.
-* On the other hand, the two writes in [Figure 6-14](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_concurrency) are concurrent: when each
+* On the other hand, the two writes in [Figure 6-14](/ch06.html#fig_replication_concurrency) are concurrent: when each
  client starts the operation, it does not know that another client is also performing an operation
  on the same key. Thus, there is no causal dependency between the operations.
@ -1607,7 +1605,7 @@ conflict that needs to be resolved.
 It may seem that two operations should be called concurrent if they occur “at the same time”—but
 in fact, it is not important whether they literally overlap in time. Because of problems with clocks
 in distributed systems, it is actually quite difficult to tell whether two things happened
-at exactly the same time—an issue we will discuss in more detail in [Chapter 9](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch09.html#ch_distributed).
+at exactly the same time—an issue we will discuss in more detail in [Chapter 9](/ch09.html#ch_distributed).
 For defining concurrency, exact time doesn’t matter: we simply call two operations concurrent if
 they are both unaware of each other, regardless of the physical time at which they occurred. People
@ -1629,7 +1627,7 @@ happened before another. To keep things simple, let’s start with a database th
 replica. Once we have worked out how to do this on a single replica, we can generalize the approach
 to a leaderless database with multiple replicas.
-[Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) shows two clients concurrently adding items to the same
+[Figure 6-15](/ch06.html#fig_replication_causality_single) shows two clients concurrently adding items to the same
 shopping cart. (If that example strikes you as too inane, imagine instead two air traffic
 controllers concurrently adding aircraft to the sector they are tracking.) Initially, the cart is
 empty. Between them, the clients make five writes to the database:
@ -1664,8 +1662,8 @@ empty. Between them, the clients make five writes to the database:
 ###### Figure 6-15. Capturing causal dependencies between two clients concurrently editing a shopping cart.
-The dataflow between the operations in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) is illustrated
+The dataflow between the operations in [Figure 6-15](/ch06.html#fig_replication_causality_single) is illustrated
-graphically in [Figure 6-16](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causal_dependencies). The arrows indicate which operation
+graphically in [Figure 6-16](/ch06.html#fig_replication_causal_dependencies). The arrows indicate which operation
 *happened before* which other operation, in the sense that the later operation *knew about* or
 *depended on* the earlier one. In this example, the clients are never fully up to date with the data
 on the server, since there is always another operation going on concurrently. But old versions of
@ -1673,7 +1671,7 @@ the value do get overwritten eventually, and no writes are lost.
 ![ddia 0616](/fig/ddia_0616.png)
-###### Figure 6-16. Graph of causal dependencies in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single).
+###### Figure 6-16. Graph of causal dependencies in [Figure 6-15](/ch06.html#fig_replication_causality_single).
 Note that the server can determine whether two operations are concurrent by looking at the version
 numbers—it does not need to interpret the value itself (so the value could be any data
@ -1699,10 +1697,10 @@ on subsequent reads.
 ### Version vectors
-The example in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) used only a single replica. How does the
+The example in [Figure 6-15](/ch06.html#fig_replication_causality_single) used only a single replica. How does the
 algorithm change when there are multiple replicas, but no leader?
-[Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single) uses a single version number to capture dependencies between
+[Figure 6-15](/ch06.html#fig_replication_causality_single) uses a single version number to capture dependencies between
 operations, but that is not sufficient when there are multiple replicas accepting writes
 concurrently. Instead, we need to use a version number *per replica* as well as per key. Each
 replica increments its own version number when processing a write, and also keeps track of the
@ -1713,14 +1711,14 @@ The collection of version numbers from all the replicas is called a *version vec
 [^58].
 A few variants of this idea are in use, but the most interesting is probably the *dotted version
 vector*
-[[59](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Preguica2010),
+[[59](/ch06.html#Preguica2010),
-[60](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Manepalli2022)],
+[60](/ch06.html#Manepalli2022)],
 which is used in Riak 2.0
-[[61](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Cribbs2014),
+[[61](/ch06.html#Cribbs2014),
-[62](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Brown2015)].
+[62](/ch06.html#Brown2015)].
 We won’t go into the details, but the way it works is quite similar to what we saw in our cart example.
-Like the version numbers in [Figure 6-15](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#fig_replication_causality_single), version vectors are sent from the
+Like the version numbers in [Figure 6-15](/ch06.html#fig_replication_causality_single), version vectors are sent from the
 database replicas to clients when values are read, and need to be sent back to the database when a
 value is subsequently written. (Riak encodes the version vector as a string that it calls *causal
 context*.) The version vector allows the database to distinguish between overwrites and concurrent
@ -1734,12 +1732,12 @@ siblings are merged correctly.
 A *version vector* is sometimes also called a *vector clock*, even though they are not quite the
 same. The difference is subtle—please see the references for details
-[[60](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Manepalli2022),
+[[60](/ch06.html#Manepalli2022),
-[63](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Baquero2011),
+[63](/ch06.html#Baquero2011),
-[64](https://learning.oreilly.com/library/view/designing-data-intensive-applications/9781098119058/ch06.html#Schwarz1994)]. In brief, when
+[64](/ch06.html#Schwarz1994)]. In brief, when
 comparing the state of replicas, version vectors are the right data structure to use.
-# Summary
+## Summary
 In this chapter we looked at the issue of replication. Replication can serve several purposes:
@ -1816,10 +1814,10 @@ This chapter has assumed that every replica stores a full copy of the whole data
 unrealistic for large datasets. In the next chapter we will look at *sharding*, which allows each
 machine to store only a subset of the data.
 ##### Footnotes
-##### References
+
 ### Summary
 [^1]: B. G. Lindsay, P. G. Selinger, C. Galtieri, J. N. Gray, R. A. Lorie, T. G. Price, F. Putzolu, I. L. Traiger, and B. W. Wade. [Notes on Distributed Databases](https://dominoweb.draco.res.ibm.com/reports/RJ2571.pdf). IBM Research, Research Report RJ2571(33471), July 1979. Archived at [perma.cc/EPZ3-MHDD](https://perma.cc/EPZ3-MHDD)
--- a/content/en/ch7.md
+++ b/content/en/ch7.md
@ -13,10 +13,10 @@ breadcrumbs: false
 A distributed database typically distributes data across nodes in two ways:
 1. Having a copy of the same data on multiple nodes: this is *replication*, which we discussed in
-   [Chapter 6](/en/ch6#ch_replication).
+ [Chapter 6](/en/ch6#ch_replication).
 2. If we don’t want every node to store all the data, we can split up a large amount of data into
-   smaller *shards* or *partitions*, and store different shards on different nodes. We’ll discuss
+ smaller *shards* or *partitions*, and store different shards on different nodes. We’ll discuss
-   sharding in this chapter.
+ sharding in this chapter.
 Normally, shards are defined in such a way that each piece of data (each record, row, or document)
 belongs to exactly one shard. There are various ways of achieving this, which we discuss in depth in
@ -51,14 +51,12 @@ Some databases treat partitions and shards as two distinct concepts. For example
 partitioning is a way of splitting a large table into several files that are stored on the same
 machine (which has several advantages, such as making it very fast to delete an entire partition),
 whereas sharding splits a dataset across multiple machines
-[[1](/en/ch7#Giordano2023),
+[[^1], [^2]].
 [2](/en/ch7#Leach2022)].
 In many other systems, partitioning is just another word for sharding.
 While *partitioning* is quite descriptive, the term *sharding* is perhaps surprising. According to
 one theory, the term arose from the online role-play game *Ultima Online*, in which a magic crystal
-was shattered into pieces, and each of those shards refracted a copy of the game world
+was shattered into pieces, and each of those shards refracted a copy of the game world [^3].
 [^3].
 The term *shard* thus came to mean one of a set of parallel game servers, and later was carried over
 to databases. Another theory is that *shard* was originally an acronym of *System for Highly
 Available Replicated Data*—reportedly a 1980s database, details of which are lost to history.
@ -87,8 +85,7 @@ single-shard database.
 The reason for this recommendation is that sharding often adds complexity: you typically have to
 decide which records to put in which shard by choosing a *partition key*; all records with the
-same partition key are placed in the same shard
+same partition key are placed in the same shard [^4].
 [^4].
 This choice matters because accessing a record is fast if you know which shard it’s in, but if you
 don’t know the shard you have to do an inefficient search across all shards, and the sharding scheme
 is difficult to change.
@ -107,11 +104,9 @@ some systems don’t support them at all.
 Some systems use sharding even on a single machine, typically running one single-threaded process
 per CPU core to make use of the parallelism in the CPU, or to take advantage of a *nonuniform memory
-access* (NUMA) architecture in which some banks of memory are closer to one CPU than to others
+access* (NUMA) architecture in which some banks of memory are closer to one CPU than to others [^5].
 [^5].
 For example, Redis, VoltDB, and FoundationDB use one process per core, and rely on sharding to
-spread load across CPU cores in the same machine
+spread load across CPU cores in the same machine [^6].
 [^6].
 ## Sharding for Multitenancy
@ -124,61 +119,60 @@ signups, delivery data etc. are separate from those of other businesses.
 Sometimes sharding is used to implement multitenant systems: either each tenant is given a separate
 shard, or multiple small tenants may be grouped together into a larger shard. These shards might be
 physically separate databases (which we previously touched on in [“Embedded storage engines”](/en/ch4#sidebar_embedded)), or
-separately manageable portions of a larger logical database
+separately manageable portions of a larger logical database [^7].
 [^7].
 Using sharding for multitenancy has several advantages:
 Resource isolation
-:   If one tenant performs a computationally expensive operation, it is less likely that other
+: If one tenant performs a computationally expensive operation, it is less likely that other
-    tenants’ performance will be affected if they are running on different shards.
+ tenants’ performance will be affected if they are running on different shards.
 Permission isolation
-:   If there is a bug in your access control logic, it’s less likely that you will accidentally give
+: If there is a bug in your access control logic, it’s less likely that you will accidentally give
-    one tenant access to another tenant’s data if those tenants’ datasets are stored physically
+ one tenant access to another tenant’s data if those tenants’ datasets are stored physically
-    separately from each other.
+ separately from each other.
 Cell-based architecture
-:   You can apply sharding not only at the data storage level, but also for the services running your
+: You can apply sharding not only at the data storage level, but also for the services running your
-    application code. In a *cell-based architecture*, the services and storage for a particular set of
+ application code. In a *cell-based architecture*, the services and storage for a particular set of
-    tenants are grouped into a self-contained *cell*, and different cells are set up such that they
+ tenants are grouped into a self-contained *cell*, and different cells are set up such that they
-    can run largely independently from each other. This approach provides *fault isolation*: that is,
+ can run largely independently from each other. This approach provides *fault isolation*: that is,
-    a fault in one cell remains limited to that cell, and tenants in other cells are not affected
+ a fault in one cell remains limited to that cell, and tenants in other cells are not affected
-    [^8].
+ [^8].
 Per-tenant backup and restore
-:   Backing up each tenant’s shard separately makes it possible to restore a tenant’s state from a
+: Backing up each tenant’s shard separately makes it possible to restore a tenant’s state from a
-    backup without affecting other tenants, which can be useful in case the tenant accidentally
+ backup without affecting other tenants, which can be useful in case the tenant accidentally
-    deletes or overwrites important data
+ deletes or overwrites important data
-    [^9].
+ [^9].
 Regulatory compliance
-:   Data privacy regulation such as the GDPR gives individuals the right to access and delete all data
+: Data privacy regulation such as the GDPR gives individuals the right to access and delete all data
-    stored about them. If each person’s data is stored in a separate shard, this translates into
+ stored about them. If each person’s data is stored in a separate shard, this translates into
-    simple data export and deletion operations on their shard
+ simple data export and deletion operations on their shard
-    [^10].
+ [^10].
 Data residence
-:   If a particular tenant’s data needs to be stored in a particular jurisdiction in order to comply
+: If a particular tenant’s data needs to be stored in a particular jurisdiction in order to comply
-    with data residency laws, a region-aware database can allow you to assign that tenant’s shard to a
+ with data residency laws, a region-aware database can allow you to assign that tenant’s shard to a
-    particular region.
+ particular region.
 Gradual schema rollout
-:   Schema migrations (previously discussed in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)) can be rolled
+: Schema migrations (previously discussed in [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)) can be rolled
-    out gradually, one tenant at a time. This reduces risk, as you can detect problems before they
+ out gradually, one tenant at a time. This reduces risk, as you can detect problems before they
-    affect all tenants, but it can be difficult to do transactionally
+ affect all tenants, but it can be difficult to do transactionally
-    [^11].
+ [^11].
 The main challenges around using sharding for multitenancy are:
 * It assumes that each individual tenant is small enough to fit on a single node. If that is not the
-  case, and you have a single tenant that’s too big for one machine, you would need to additionally
+ case, and you have a single tenant that’s too big for one machine, you would need to additionally
-  perform sharding within a single tenant, which brings us back to the topic of sharding for
+ perform sharding within a single tenant, which brings us back to the topic of sharding for
-  scalability [^12].
+ scalability [^12].
 * If you have many small tenants, then creating a separate shard for each one may incur too much
-  overhead. You could group several small tenants together into a bigger shard, but then you have
+ overhead. You could group several small tenants together into a bigger shard, but then you have
-  the problem of how you move tenants from one shard to another as they grow.
+ the problem of how you move tenants from one shard to another as they grow.
 * If you ever need to support features that connect data across multiple tenants, these become
-  harder to implement if you need to join data across multiple shards.
+ harder to implement if you need to join data across multiple shards.
 # Sharding of Key-Value Data
@ -226,8 +220,7 @@ to distribute the data evenly, the shard boundaries need to adapt to the data.
 The shard boundaries might be chosen manually by an administrator, or the database can choose them
 automatically. Manual key-range sharding is used by Vitess (a sharding layer for MySQL), for
 example; the automatic variant is used by Bigtable, its open source equivalent HBase, the
-range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB
+range-based sharding option in MongoDB, CockroachDB, RethinkDB, and FoundationDB [^6]. YugabyteDB offers both manual and automatic
 [^6]. YugabyteDB offers both manual and automatic
 tablet splitting.
 Within each shard, keys are stored in sorted order (e.g., in a B-tree or SSTables, as discussed in
@ -241,8 +234,7 @@ A downside of key range sharding is that you can easily get a hot shard if there
 lot of writes to nearby keys. For example, if the key is a timestamp, then the shards correspond to
 ranges of time—e.g., one shard per month. Unfortunately, if you write data from the sensors to the
 database as the measurements happen, all the writes end up going to the same shard (the one for
-this month), so that shard can be overloaded with writes while others sit idle
+this month), so that shard can be overloaded with writes while others sit idle [^13].
 [^13].
 To avoid this problem in the sensor database, you need to use something other than the timestamp as
 the first element of the key. For example, you could prefix each timestamp with the sensor ID so
@ -256,8 +248,7 @@ need to perform a separate range query for each sensor.
 When you first set up your database, there are no key ranges to split into shards. Some databases,
 such as HBase and MongoDB, allow you to configure an initial set of shards on an empty database,
 which is called *pre-splitting*. This requires that you already have some idea of what the key
-distribution is going to look like, so that you can choose appropriate key range boundaries
+distribution is going to look like, so that you can choose appropriate key range boundaries [^14].
 [^14].
 Later on, as your data volume and write throughput grow, a system with key-range sharding grows by
 splitting an existing shard into two or more smaller shards, each of which holds a contiguous
@ -270,8 +261,8 @@ With databases that manage shard boundaries automatically, a shard split is typi
 * the shard reaching a configured size (for example, on HBase, the default is 10 GB), or
 * in some systems, the write throughput being persistently above some threshold. Thus, a hot shard
-  may be split even if it is not storing a lot of data, so that its write load can be distributed
+ may be split even if it is not storing a lot of data, so that its write load can be distributed
-  more uniformly.
+ more uniformly.
 An advantage of key-range sharding is that the number of shards adapts to the data volume. If there
 is only a small amount of data, a small number of shards is sufficient, so overheads are small; if
@ -300,8 +291,7 @@ For sharding purposes, the hash function need not be cryptographically strong: f
 uses MD5, whereas Cassandra and ScyllaDB use Murmur3. Many programming languages have simple hash
 functions built in (as they are used for hash tables), but they may not be suitable for sharding:
 for example, in Java’s `Object.hashCode()` and Ruby’s `Object#hash`, the same key may have a
-different hash value in different processes, making them unsuitable for sharding
+different hash value in different processes, making them unsuitable for sharding [^16].
 [^16].
 ### Hash modulo number of nodes
@ -411,16 +401,14 @@ cluster keys for a table. Delta Lake supports both manual and automatic partitio
 supports cluster keys. Clustering data not only improves range scan performance, but can
 improve compression and filtering performance as well.
-Hash-range sharding is used in YugabyteDB and DynamoDB
+Hash-range sharding is used in YugabyteDB and DynamoDB [^17], and is an option in MongoDB.
 [^17], and is an option in MongoDB.
 Cassandra and ScyllaDB use a variant of this approach that is illustrated in
 [Figure 7-6](/en/ch7#fig_sharding_cassandra): the space of hash values is split into a number of ranges proportional
 to the number of nodes (3 ranges per node in [Figure 7-6](/en/ch7#fig_sharding_cassandra), but actual numbers are 8
 per node in Cassandra by default, and 256 per node in ScyllaDB), with random boundaries between
 those ranges. This means some ranges are bigger than others, but by having multiple ranges per node
 those imbalances tend to even out
-[[15](/en/ch7#Evans2013),
+[[^15], [^18]].
 [18](/en/ch7#Williams2012)].
 ![ddia 0706](/fig/ddia_0706.png)
@ -446,10 +434,8 @@ ACID consistency (see [Chapter 8](/en/ch8#ch_transactions)), but rather describ
 the same shard as much as possible.
 The sharding algorithm used by Cassandra and ScyllaDB is similar to the original definition of
-consistent hashing
+consistent hashing [^20],
-[^20],
+but several other consistent hashing algorithms have also been proposed [^21],
 but several other consistent hashing algorithms have also been proposed
 [^21],
 such as *highest random weight*, also known as *rendezvous hashing*
 [^22],
 and *jump consistent hash*
@ -473,11 +459,9 @@ This event can result in a large volume of reads and writes to the same key (whe
 is perhaps the user ID of the celebrity, or the ID of the action that people are commenting on).
 In such situations, a more flexible sharding policy is required
-[[25](/en/ch7#Guo2020),
+[[^25], [^26]].
 [26](/en/ch7#Lee2021)].
 A system that defines shards based on ranges of keys (or ranges of hashes) makes it possible to put
-an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine
+an individual hot key in a shard by its own, and perhaps even assigning it a dedicated machine [^27].
 [^27].
 It’s also possible to compensate for skew at the application level. For example, if one key is known
 to be very hot, a simple technique is to add a random number to the beginning or end of the key.
@ -518,16 +502,14 @@ Fully automated rebalancing can be convenient, because there is less operational
 normal maintenance, and such systems can even auto-scale to adapt to changes in workload. Cloud
 databases such as DynamoDB are promoted as being able to automatically add and remove shards to
 adapt to big increases or decreases of load within a matter of minutes
-[[17](/en/ch7#Elhemali2022_ch7),
+[[^17], [^29]].
 [29](/en/ch7#Houlihan2017)].
 However, automatic shard management can also be unpredictable. Rebalancing is an expensive
 operation, because it requires rerouting requests and moving a large amount of data from one node to
 another. If it is not done carefully, this process can overload the network or the nodes, and it
 might harm the performance of other requests. The system must continue processing writes while the
 rebalancing is in progress; if a system is near its maximum write throughput, the shard-splitting
-process might not even be able to keep up with the rate of incoming writes
+process might not even be able to keep up with the rate of incoming writes [^29].
 [^29].
 Such automation can be dangerous in combination with automatic failure detection. For example, say
 one node is overloaded and is temporarily slow to respond to requests. The other nodes conclude that
@ -557,14 +539,14 @@ shards to nodes. On a high level, there are a few different approaches to this p
 in [Figure 7-7](/en/ch7#fig_sharding_routing)):
 1. Allow clients to contact any node (e.g., via a round-robin load balancer). If that node
-   coincidentally owns the shard to which the request applies, it can handle the request directly;
+ coincidentally owns the shard to which the request applies, it can handle the request directly;
-   otherwise, it forwards the request to the appropriate node, receives the reply, and passes the
+ otherwise, it forwards the request to the appropriate node, receives the reply, and passes the
-   reply along to the client.
+ reply along to the client.
 2. Send all requests from clients to a routing tier first, which determines the node that should
-   handle each request and forwards it accordingly. This routing tier does not itself handle any
+ handle each request and forwards it accordingly. This routing tier does not itself handle any
-   requests; it only acts as a shard-aware load balancer.
+ requests; it only acts as a shard-aware load balancer.
 3. Require that clients be aware of the sharding and the assignment of shards to nodes. In this
-   case, a client can connect directly to the appropriate node, without any intermediary.
+ case, a client can connect directly to the appropriate node, without any intermediary.
 ![ddia 0707](/fig/ddia_0707.png)
@ -573,15 +555,15 @@ in [Figure 7-7](/en/ch7#fig_sharding_routing)):
 In all cases, there are some key problems:
 * Who decides which shard should live on which node? It’s simplest to have a single coordinator
-  making that decision, but in that case how do you make it fault-tolerant in case the node running
+ making that decision, but in that case how do you make it fault-tolerant in case the node running
-  the coordinator goes down? And if the coordinator role can failover to another node, how do you
+ the coordinator goes down? And if the coordinator role can failover to another node, how do you
-  prevent a split-brain situation (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)) where two different
+ prevent a split-brain situation (see [“Handling Node Outages”](/en/ch6#sec_replication_failover)) where two different
-  coordinators make contradictory shard assignments?
+ coordinators make contradictory shard assignments?
 * How does the component performing the routing (which may be one of the nodes, or the routing tier,
-  or the client) learn about changes in the assignment of shards to nodes?
+ or the client) learn about changes in the assignment of shards to nodes?
 * While a shard is being moved from one node to another, there is a cutover period during which the
-  new node has taken over, but requests to the old node may still be in flight. How do you handle
+ new node has taken over, but requests to the old node may still be in flight. How do you handle
-  those?
+ those?
 Many distributed data systems rely on a separate coordination service such as ZooKeeper or etcd to
 keep track of shard assignments, as illustrated in [Figure 7-8](/en/ch7#fig_sharding_zookeeper). They use consensus
@ -684,8 +666,7 @@ expensive. Even if you query the shards in parallel, it is prone to tail latency
 shards lets you store more data, but it doesn’t increase your query throughput if every shard has to
 process every query anyway.
-Nevertheless, local secondary indexes are widely used
+Nevertheless, local secondary indexes are widely used [^31]:
 [^31]:
 for example, MongoDB, Riak, Cassandra [^32],
 Elasticsearch [^33], SolrCloud,
 and VoltDB [^34]
@ -742,7 +723,7 @@ indexes, so reads from a global index may be stale (similarly to replication lag
 Nevertheless, global indexes are useful if read throughput is higher than write throughput, and if
 the postings lists are not too long.
-# Summary
+## Summary
 In this chapter we explored different ways of sharding a large dataset into smaller subsets.
 Sharding is necessary when you have so much data that storing and processing it on a single machine
@ -756,20 +737,20 @@ cluster.
 We discussed two main approaches to sharding:
 * *Key range sharding*, where keys are sorted, and a shard owns all the keys from some minimum up to
-  some maximum. Sorting has the advantage that efficient range queries are possible, but there is a
+ some maximum. Sorting has the advantage that efficient range queries are possible, but there is a
-  risk of hot spots if the application often accesses keys that are close together in the sorted
+ risk of hot spots if the application often accesses keys that are close together in the sorted
-  order.
+ order.
-  In this approach, shards are typically rebalanced by splitting the range into two subranges when a
+ In this approach, shards are typically rebalanced by splitting the range into two subranges when a
-  shard gets too big.
+ shard gets too big.
 * *Hash sharding*, where a hash function is applied to each key, and a shard owns a range of hash
-  values (or another consistent hashing algorithm may be used to map hashes to shards). This method
+ values (or another consistent hashing algorithm may be used to map hashes to shards). This method
-  destroys the ordering of keys, making range queries inefficient, but it may distribute load more
+ destroys the ordering of keys, making range queries inefficient, but it may distribute load more
-  evenly.
+ evenly.
-  When sharding by hash, it is common to create a fixed number of shards in advance, to assign several
+ When sharding by hash, it is common to create a fixed number of shards in advance, to assign several
-  shards to each node, and to move entire shards from one node to another when nodes are added or
+ shards to each node, and to move entire shards from one node to another when nodes are added or
-  removed. Splitting shards, like with key ranges, is also possible.
+ removed. Splitting shards, like with key ranges, is also possible.
 It is common to use the first part of the key as the partition key (i.e., to identify the shard),
 and to sort records within that shard by the rest of the key. That way you can still have efficient
@ -779,13 +760,13 @@ We also discussed the interaction between sharding and secondary indexes. A seco
 needs to be sharded, and there are two methods:
 * *Local secondary indexes*, where the secondary indexes are stored
-  in the same shard as the primary key and value. This means that only a single shard needs to be
+ in the same shard as the primary key and value. This means that only a single shard needs to be
-  updated on write, but a lookup of the secondary index requires reading from all shards.
+ updated on write, but a lookup of the secondary index requires reading from all shards.
 * *Global secondary indexes*, which are sharded separately based on
-  the indexed values. An entry in the secondary index may refer to records from all shards of the
+ the indexed values. An entry in the secondary index may refer to records from all shards of the
-  primary key. When a record is written, several secondary index shards may need to be updated;
+ primary key. When a record is written, several secondary index shards may need to be updated;
-  however, a read of the postings list can be served from a single shard (fetching the actual
+ however, a read of the postings list can be served from a single shard (fetching the actual
-  records still requires reading from multiple shards).
+ records still requires reading from multiple shards).
 Finally, we discussed techniques for routing queries to the appropriate shard, and how a
 coordination service is often used to keep track of the assigment of shards to nodes.
@ -795,10 +776,10 @@ to multiple machines. However, operations that need to write to several shards c
 for example, what happens if the write to one shard succeeds, but another fails? We will address
 that question in the following chapters.
 ##### Footnotes
-##### References
+
 ### Summary
 [^1]: Claire Giordano. [Understanding partitioning and sharding in Postgres and Citus](https://www.citusdata.com/blog/2023/08/04/understanding-partitioning-and-sharding-in-postgres-and-citus/). *citusdata.com*, August 2023. Archived at [perma.cc/8BTK-8959](https://perma.cc/8BTK-8959) 
--- a/content/en/ch8.md
+++ b/content/en/ch8.md
--- a/content/en/ch9.md
+++ b/content/en/ch9.md
--- a/content/en/part-ii.md
+++ b/content/en/part-ii.md
@ -105,7 +105,7 @@ Later, in Part III of this book, we will discuss how you can take several (poten
 - [9. The Trouble with Distributed Systems](/en/ch9)
 - [10. Consistency and Consensus](/en/ch10)
-## References
+### References
 1. Ulrich Drepper: “[What Every Programmer Should Know About Memory](https://people.freebsd.org/~lstewart/articles/cpumemory.pdf),” akka‐dia.org, November 21, 2007.
 1. Ben Stopford: “[Shared Nothing vs. Shared Disk Architectures: An Independent View](http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/),” benstopford.com, November 24, 2009.