2
0
Fork 0
mirror of https://github.com/Vonng/ddia.git synced 2026-06-21 17:07:12 +08:00
ddia/content/en/ch5.md
2025-08-10 15:59:50 +08:00

1216 lines
No EOL
87 KiB
Markdown
Raw Blame History

This file contains invisible Unicode characters

This file contains invisible Unicode characters that are indistinguishable to humans but may be processed differently by a computer. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

---
title: "5. Encoding and Evolution"
weight: 105
breadcrumbs: false
---
![](/map/ch04.png)
> *Everything changes and nothing stands still.*
>
> Heraclitus of Ephesus, as quoted by Plato in *Cratylus* (360 BCE)
Applications inevitably change over time. Features are added or modified as new products are
launched, user requirements become better understood, or business circumstances change. In
[Chapter 2](/en/ch2#ch_nonfunctional) we introduced the idea of *evolvability*: we should aim to build systems that
make it easy to adapt to change (see [“Evolvability: Making Change Easy”](/en/ch2#sec_introduction_evolvability)).
In most cases, a change to an applications features also requires a change to data that it stores:
perhaps a new field or record type needs to be captured, or perhaps existing data needs to be
presented in a new way.
The data models we discussed in [Chapter 3](/en/ch3#ch_datamodels) have different ways of coping with such change.
Relational databases generally assume that all data in the database conforms to one schema: although
that schema can be changed (through schema migrations; i.e., `ALTER` statements), there is exactly
one schema in force at any one point in time. By contrast, schema-on-read (“schemaless”) databases
dont enforce a schema, so the database can contain a mixture of older and newer data formats
written at different times (see [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)).
When a data format or schema changes, a corresponding change to application code often needs to
happen (for example, you add a new field to a record, and the application code starts reading
and writing that field). However, in a large application, code changes often cannot happen
instantaneously:
* With server-side applications you may want to perform a *rolling upgrade*
(also known as a *staged rollout*), deploying the new version to a few nodes at a time, checking
whether the new version is running smoothly, and gradually working your way through all the nodes.
This allows new versions to be deployed without service downtime, and thus encourages more frequent releases and better evolvability.
* With client-side applications youre at the mercy of the user, who may not install the update for some time.
This means that old and new versions of the code, and old and new data formats, may potentially all
coexist in the system at the same time. In order for the system to continue running smoothly, we
need to maintain compatibility in both directions:
Backward compatibility
: Newer code can read data that was written by older code.
Forward compatibility
: Older code can read data that was written by newer code.
Backward compatibility is normally not hard to achieve: as author of the newer code, you know the
format of data written by older code, and so you can explicitly handle it (if necessary by simply
keeping the old code to read the old data). Forward compatibility can be trickier, because it
requires older code to ignore additions made by a newer version of the code.
Another challenge with forward compatibility is illustrated in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
Say you add a field to a record schema, and the newer code creates a record containing that new
field and stores it in a database. Subsequently, an older version of the code (which doesnt yet
know about the new field) reads the record, updates it, and writes it back. In this situation, the
desirable behavior is usually for the old code to keep the new field intact, even though it couldnt
be interpreted. But if the record is decoded into a model object that does not explicitly
preserve unknown fields, data can be lost, like in [Figure 5-1](/en/ch5#fig_encoding_preserve_field).
{{< figure src="/fig/ddia_0501.png" id="fig_encoding_preserve_field" caption="When an older version of the application updates data previously written by a newer version of the application, data may be lost if youre not careful." class="w-full my-4" >}}
In this chapter we will look at several formats for encoding data, including JSON, XML, Protocol
Buffers, and Avro. In particular, we will look at how they handle schema changes and how they
support systems where old and new data and code need to coexist. We will then discuss how those
formats are used for data storage and for communication: in databases, web services, REST APIs,
remote procedure calls (RPC), workflow engines, and event-driven systems such as actors and
message queues.
## Formats for Encoding Data {#sec_encoding_formats}
Programs usually work with data in (at least) two different representations:
1. In memory, data is kept in objects, structs, lists, arrays, hash tables, trees, and so on. These
data structures are optimized for efficient access and manipulation by the CPU (typically using
pointers).
2. When you want to write data to a file or send it over the network, you have to encode it as some
kind of self-contained sequence of bytes (for example, a JSON document). Since a pointer wouldnt
make sense to any other process, this sequence-of-bytes representation often looks quite
different from the data structures that are normally used in memory.
Thus, we need some kind of translation between the two representations. The translation from the
in-memory representation to a byte sequence is called *encoding* (also known as *serialization* or
*marshalling*), and the reverse is called *decoding* (*parsing*, *deserialization*,
*unmarshalling*).
--------
> [!TIP] TERMINOLOGY CLASH
*Serialization* is unfortunately also used in the context of transactions (see [Chapter 8](/en/ch8#ch_transactions)),
with a completely different meaning. To avoid overloading the word well stick with *encoding* in
this book, even though *serialization* is perhaps a more common term.
--------
There are exceptions in which encoding/decoding is not needed—for example, when a database operates
directly on compressed data loaded from disk, as discussed in [“Query Execution: Compilation and Vectorization”](/en/ch4#sec_storage_vectorized). There are
also *zero-copy* data formats that are designed to be used both at runtime and on disk/on the
network, without an explicit conversion step, such as Capn Proto and FlatBuffers.
However, most systems need to convert between in-memory objects and flat byte sequences. As this is
such a common problem, there are a myriad different libraries and encoding formats to choose from.
Lets do a brief overview.
### Language-Specific Formats {#id96}
Many programming languages come with built-in support for encoding in-memory objects into byte
sequences. For example, Java has `java.io.Serializable`, Python has `pickle`, Ruby has `Marshal`,
and so on. Many third-party libraries also exist, such as Kryo for Java.
These encoding libraries are very convenient, because they allow in-memory objects to be saved and
restored with minimal additional code. However, they also have a number of deep problems:
* The encoding is often tied to a particular programming language, and reading the data in another
language is very difficult. If you store or transmit data in such an encoding, you are committing
yourself to your current programming language for potentially a very long time, and precluding
integrating your systems with those of other organizations (which may use different languages).
* In order to restore data in the same object types, the decoding process needs to be able to
instantiate arbitrary classes. This is frequently a source of security problems [^1]:
if an attacker can get your application to decode an arbitrary byte sequence, they can instantiate
arbitrary classes, which in turn often allows them to do terrible things such as remotely
executing arbitrary code [^2] [^3].
* Versioning data is often an afterthought in these libraries: as they are intended for quick and
easy encoding of data, they often neglect the inconvenient problems of forward and backward compatibility [^4].
* Efficiency (CPU time taken to encode or decode, and the size of the encoded structure) is also
often an afterthought. For example, Javas built-in serialization is notorious for its bad
performance and bloated encoding [^5].
For these reasons its generally a bad idea to use your languages built-in encoding for anything
other than very transient purposes.
### JSON, XML, and Binary Variants {#sec_encoding_json}
When moving to standardized encodings that can be written and read by many programming languages, JSON
and XML are the obvious contenders. They are widely known, widely supported, and almost as widely
disliked. XML is often criticized for being too verbose and unnecessarily complicated [^6].
JSONs popularity is mainly due to its built-in support in web browsers and simplicity relative to
XML. CSV is another popular language-independent format, but it only supports tabular data without
nesting.
JSON, XML, and CSV are textual formats, and thus somewhat human-readable (although the syntax is a
popular topic of debate). Besides the superficial syntactic issues, they also have some subtle
problems:
* There is a lot of ambiguity around the encoding of numbers. In XML and CSV, you cannot distinguish
between a number and a string that happens to consist of digits (except by referring to an external
schema). JSON distinguishes strings and numbers, but it doesnt distinguish integers and
floating-point numbers, and it doesnt specify a precision.
This is a problem when dealing with large numbers; for example, integers greater than 253 cannot
be exactly represented in an IEEE 754 double-precision floating-point number, so such numbers become
inaccurate when parsed in a language that uses floating-point numbers, such as JavaScript [^7].
An example of numbers larger than 253 occurs on X (formerly Twitter), which uses a 64-bit number to
identify each post. The JSON returned by the API includes post IDs twice, once as a JSON number and
once as a decimal string, to work around the fact that the numbers are not correctly parsed by
JavaScript applications [^8].
* JSON and XML have good support for Unicode character strings (i.e., human-readable text), but they
dont support binary strings (sequences of bytes without a character encoding). Binary strings are a
useful feature, so people get around this limitation by encoding the binary data as text using
Base64. The schema is then used to indicate that the value should be interpreted as Base64-encoded.
This works, but its somewhat hacky and increases the data size by 33%.
* XML Schema and JSON Schema are powerful, and thus quite
complicated to learn and implement. Since the correct interpretation of data (such as numbers and
binary strings) depends on information in the schema, applications that dont use XML/JSON schemas
need to potentially hard-code the appropriate encoding/decoding logic instead.
* CSV does not have any schema, so it is up to the application to define the meaning of each row and
column. If an application change adds a new row or column, you have to handle that change manually.
CSV is also a quite vague format (what happens if a value contains a comma or a newline character?).
Although its escaping rules have been formally specified [^9],
not all parsers implement them correctly.
Despite these flaws, JSON, XML, and CSV are good enough for many purposes. Its likely that they will
remain popular, especially as data interchange formats (i.e., for sending data from one organization to
another). In these situations, as long as people agree on what the format is, it often doesnt
matter how pretty or efficient the format is. The difficulty of getting different organizations to
agree on *anything* outweighs most other concerns.
#### JSON Schema {#json-schema}
JSON Schema has become widely adopted as a way to model data whenever its exchanged between systems
or written to storage. Youll find JSON schemas in web services (see [“Web services”](/en/ch5#sec_web_services)) as part
of the OpenAPI web service specification, schema registries such as Confluents Schema Registry and
Red Hats Apicurio Registry, and in databases such as PostgreSQLs pg\_jsonschema validator extension
and MongoDBs `$jsonSchema` validator syntax.
The JSON Schema specification offers a number of features. Schemas include standard primitive types
including strings, numbers, integers, objects, arrays, booleans, or nulls. But JSON Schema also
offers a separate validation specification that allows developers to overlay constraints on fields.
For example, a `port` field might have a minimum of 1 and a maximum of 65535.
JSON Schemas can have either open or closed content models. An open content model permits any field
not defined in the schema to exist with any data type, whereas a closed content model only allows
fields that are explicitly defined. The open content model in JSON Schema is enabled when
`additionalProperties` is set to `true`, which is the default. Thus, JSON Schemas are usually a
definition of what *isnt* permitted (namely, invalid values on any of the defined fields), rather
than what *is* permitted in a schema.
Open content models are powerful, but can be complex. For example, say you want to define a map from
integers (such as IDs) to strings. JSON does not have a map or dictionary type, only an “object”
type that can contain string keys, and values of any type. You can then constrain this type with
JSON Schema so that keys may only contain digits, and values can only be strings, using
`patternProperties` and `additionalProperties` as shown in [Example 5-1](/en/ch5#fig_encoding_json_schema).
{{< figure id="fig_encoding_json_schema" title="Example 5-1. Example JSON Schema with integer keys and string values. Integer keys are represented as strings containing only integers since JSON Schema requires all keys to be strings." class="w-full my-4" >}}
```json
{
"$schema": "http://json-schema.org/draft-07/schema#",
"type": "object",
"patternProperties": {
"^[0-9]+$": {
"type": "string"
}
},
"additionalProperties": false
}
```
In addition to open and closed content models and validators, JSON Schema supports conditional
if/else schema logic, named types, references to remote schemas, and much more. All of this makes
for a very powerful schema language. Such features also make for unwieldy definitions. It can be
challenging to resolve remote schemas, reason about conditional rules, or evolve schemas in a
forwards or backwards compatible way [^10]. Similar concerns apply to XML Schema [^11].
#### Binary encoding {#binary-encoding}
JSON is less verbose than XML, but both still use a lot of space compared to binary formats. This
observation led to the development of a profusion of binary encodings for JSON (MessagePack, CBOR,
BSON, BJSON, UBJSON, BISON, Hessian, and Smile, to name a few) and for XML (WBXML and Fast Infoset,
for example). These formats have been adopted in various niches, as they are more compact and
sometimes faster to parse, but none of them are as widely adopted as the textual versions of JSON and XML [^12].
Some of these formats extend the set of datatypes (e.g., distinguishing integers and floating-point numbers,
or adding support for binary strings), but otherwise they keep the JSON/XML data model unchanged. In
particular, since they dont prescribe a schema, they need to include all the object field names within
the encoded data. That is, in a binary encoding of the JSON document in [Example 5-2](/en/ch5#fig_encoding_json), they
will need to include the strings `userName`, `favoriteNumber`, and `interests` somewhere.
{{< figure id="fig_encoding_json" title="Example 5-2. Example record which we will encode in several binary formats in this chapter" class="w-full my-4" >}}
```json
{
"userName": "Martin",
"favoriteNumber": 1337,
"interests": ["daydreaming", "hacking"]
}
```
Lets look at an example of MessagePack, a binary encoding for JSON. [Figure 5-2](/en/ch5#fig_encoding_messagepack)
shows the byte sequence that you get if you encode the JSON document in [Example 5-2](/en/ch5#fig_encoding_json) with
MessagePack. The first few bytes are as follows:
1. The first byte, `0x83`, indicates that what follows is an object (top four bits = `0x80`) with three
fields (bottom four bits = `0x03`). (In case youre wondering what happens if an object has more
than 15 fields, so that the number of fields doesnt fit in four bits, it then gets a different type
indicator, and the number of fields is encoded in two or four bytes.)
2. The second byte, `0xa8`, indicates that what follows is a string (top four bits = `0xa0`) that is eight
bytes long (bottom four bits = `0x08`).
3. The next eight bytes are the field name `userName` in ASCII. Since the length was indicated
previously, theres no need for any marker to tell us where the string ends (or any escaping).
4. The next seven bytes encode the six-letter string value `Martin` with a prefix `0xa6`, and so on.
The binary encoding is 66 bytes long, which is only a little less than the 81 bytes taken by the
textual JSON encoding (with whitespace removed). All the binary encodings of JSON are similar in
this regard. Its not clear whether such a small space reduction (and perhaps a speedup in parsing)
is worth the loss of human-readability.
In the following sections we will see how we can do much better, and encode the same record in just 32 bytes.
{{< figure link="#fig_encoding_json" src="/fig/ddia_0502.png" id="fig_encoding_messagepack" caption="Figure 5-2. Example record Example 5-2 encoded using MessagePack." class="w-full my-4" >}}
### Protocol Buffers {#sec_encoding_protobuf}
Protocol Buffers (protobuf) is a binary encoding library developed at Google.
It is similar to Apache Thrift, which was originally developed by Facebook [^13];
most of what this section says about Protocol Buffers applies also to Thrift.
Protocol Buffers requires a schema for any data that is encoded. To encode the data
in [Example 5-2](/en/ch5#fig_encoding_json) in Protocol Buffers, you would describe the schema in the Protocol Buffers
interface definition language (IDL) like this:
```protobuf
syntax = "proto3";
message Person {
string user_name = 1;
int64 favorite_number = 2;
repeated string interests = 3;
}
```
Protocol Buffers comes with a code generation tool that takes a schema definition like the one
shown here, and produces classes that implement the schema in various programming languages. Your
application code can call this generated code to encode or decode records of the schema. The schema
language is very simple compared to JSON Schema: it only defines the fields of records and their
types, but it does not support other restrictions on the possible values of fields.
Encoding [Example 5-2](/en/ch5#fig_encoding_json) using a Protocol Buffers encoder requires 33 bytes, as shown in [Figure 5-3](/en/ch5#fig_encoding_protobuf) [^14].
{{< figure src="/fig/ddia_0503.png" id="fig_encoding_protobuf" caption="Figure 5-3. Example record encoded using Protocol Buffers." class="w-full my-4" >}}
Similarly to [Figure 5-2](/en/ch5#fig_encoding_messagepack), each field has a type annotation (to indicate whether it
is a string, integer, etc.) and, where required, a length indication (such as the length of a
string). The strings that appear in the data (“Martin”, “daydreaming”, “hacking”) are also encoded
as ASCII (to be precise, UTF-8), similar to before.
The big difference compared to [Figure 5-2](/en/ch5#fig_encoding_messagepack) is that there are no field names
(`userName`, `favoriteNumber`, `interests`). Instead, the encoded data contains *field tags*, which
are numbers (`1`, `2`, and `3`). Those are the numbers that appear in the schema definition. Field tags
are like aliases for fields—they are a compact way of saying what field were talking about,
without having to spell out the field name.
As you can see, Protocol Buffers saves even more space by packing the field type and tag number into
a single byte. It uses variable-length integers: the number 1337 is encoded in two bytes, with the
top bit of each byte used to indicate whether there are still more bytes to come. This means numbers
between 64 and 63 are encoded in one byte, numbers between 8192 and 8191 are encoded in two bytes,
etc. Bigger numbers use more bytes.
Protocol Buffers doesnt have an explicit list or array datatype. Instead, the `repeated` modifier
on the `interests` field indicates that the field contains a list of values, rather than a single
value. In the binary encoding, the list elements are represented simply as repeated occurrences of
the same field tag within the same record.
#### Field tags and schema evolution {#field-tags-and-schema-evolution}
We said previously that schemas inevitably need to change over time. We call this *schema
evolution*. How does Protocol Buffers handle schema changes while keeping backward and forward compatibility?
As you can see from the examples, an encoded record is just the concatenation of its encoded fields.
Each field is identified by its tag number (the numbers `1`, `2`, `3` in the sample schema) and
annotated with a datatype (e.g., string or integer). If a field value is not set, it is simply
omitted from the encoded record. From this you can see that field tags are critical to the meaning
of the encoded data. You can change the name of a field in the schema, since the encoded data never
refers to field names, but you cannot change a fields tag, since that would make all existing
encoded data invalid.
You can add new fields to the schema, provided that you give each field a new tag number. If old
code (which doesnt know about the new tag numbers you added) tries to read data written by new
code, including a new field with a tag number it doesnt recognize, it can simply ignore that field.
The datatype annotation allows the parser to determine how many bytes it needs to skip, and preserve
the unknown fields to avoid the problem in [Figure 5-1](/en/ch5#fig_encoding_preserve_field). This maintains forward
compatibility: old code can read records that were written by new code.
What about backward compatibility? As long as each field has a unique tag number, new code can
always read old data, because the tag numbers still have the same meaning. If a field was added in
the new schema, and you read old data that does not yet contain that field, it is filled in with a
default value (for example, the empty string if the field type is string, or zero if its a number).
Removing a field is just like adding a field, with backward and forward compatibility concerns
reversed. You can never use the same tag number again, because you may still have data written
somewhere that includes the old tag number, and that field must be ignored by new code. Tag numbers
used in the past can be reserved in the schema definition to ensure they are not forgotten.
What about changing the datatype of a field? That is possible with some types—check the
documentation for details—but there is a risk that values will get truncated. For example, say you
change a 32-bit integer into a 64-bit integer. New code can easily read data written by old code,
because the parser can fill in any missing bits with zeros. However, if old code reads data written
by new code, the old code is still using a 32-bit variable to hold the value. If the decoded 64-bit
value wont fit in 32 bits, it will be truncated.
### Avro {#sec_encoding_avro}
Apache Avro is another binary encoding format that is interestingly different from Protocol Buffers.
It was started in 2009 as a subproject of Hadoop, as a result of Protocol Buffers not being a good
fit for Hadoops use cases [^15].
Avro also uses a schema to specify the structure of the data being encoded. It has two schema
languages: one (Avro IDL) intended for human editing, and one (based on JSON) that is more easily
machine-readable. Like Protocol Buffers, this schema language specifies only fields and their types,
and not complex validation rules like in JSON Schema.
Our example schema, written in Avro IDL, might look like this:
```c
record Person {
string userName;
union { null, long } favoriteNumber = null;
array<string> interests;
}
```
The equivalent JSON representation of that schema is as follows:
```c
{
"type": "record",
"name": "Person",
"fields": [
{"name": "userName", "type": "string"},
{"name": "favoriteNumber", "type": ["null", "long"], "default": null},
{"name": "interests", "type": {"type": "array", "items": "string"}}
]
}
```
First of all, notice that there are no tag numbers in the schema. If we encode our example record
([Example 5-2](/en/ch5#fig_encoding_json)) using this schema, the Avro binary encoding is just 32 bytes long—the
most compact of all the encodings we have seen. The breakdown of the encoded byte sequence is shown
in [Figure 5-4](/en/ch5#fig_encoding_avro).
If you examine the byte sequence, you can see that there is nothing to identify fields or their
datatypes. The encoding simply consists of values concatenated together. A string is just a length
prefix followed by UTF-8 bytes, but theres nothing in the encoded data that tells you that it is a
string. It could just as well be an integer, or something else entirely. An integer is encoded using
a variable-length encoding.
{{< figure src="/fig/ddia_0504.png" id="fig_encoding_avro" caption="Figure 5-4. Example record encoded using Avro." class="w-full my-4" >}}
To parse the binary data, you go through the fields in the order that they appear in the schema and
use the schema to tell you the datatype of each field. This means that the binary data can only be
decoded correctly if the code reading the data is using the *exact same schema* as the code that
wrote the data. Any mismatch in the schema between the reader and the writer would mean incorrectly
decoded data.
So, how does Avro support schema evolution?
#### The writers schema and the readers schema {#the-writers-schema-and-the-readers-schema}
When an application wants to encode some data (to write it to a file or database, to send it over
the network, etc.), it encodes the data using whatever version of the schema it knows about—for
example, that schema may be compiled into the application. This is known as the *writers schema*.
When an application wants to decode some data (read it from a file or database, receive it from the
network, etc.), it uses two schemas: the writers schema that is identical to the one used for
encoding, and the *readers schema*, which may be different. This is illustrated in
[Figure 5-5](/en/ch5#fig_encoding_avro_schemas). The readers schema defines the fields of each record that the
application code is expecting, and their types.
{{< figure src="/fig/ddia_0505.png" id="fig_encoding_avro_schemas" caption="Figure 5-5. In Protocol Buffers, encoding and decoding can use different versions of a schema. In Avro, decoding uses two schemas: the writer's schema must be identical to the one used for encoding, but the reader's schema can be an older or newer version." class="w-full my-4" >}}
If the readers and writers schema are the same, decoding is easy. If they are different, Avro
resolves the differences by looking at the writers schema and the readers schema side by side and
translating the data from the writers schema into the readers schema. The Avro specification [^16] [^17]
defines exactly how this resolution works, and it is illustrated in [Figure 5-6](/en/ch5#fig_encoding_avro_resolution).
For example, its no problem if the writers schema and the readers schema have their fields in a
different order, because the schema resolution matches up the fields by field name. If the code
reading the data encounters a field that appears in the writers schema but not in the readers
schema, it is ignored. If the code reading the data expects some field, but the writers schema does
not contain a field of that name, it is filled in with a default value declared in the readers
schema.
{{< figure src="/fig/ddia_0506.png" id="fig_encoding_avro_resolution" caption="Figure 5-6. An Avro reader resolves differences between the writer's schema and the reader's schema." class="w-full my-4" >}}
#### Schema evolution rules {#schema-evolution-rules}
With Avro, forward compatibility means that you can have a new version of the schema as writer and
an old version of the schema as reader. Conversely, backward compatibility means that you can have a
new version of the schema as reader and an old version as writer.
To maintain compatibility, you may only add or remove a field that has a default value. (The field
`favoriteNumber` in our Avro schema has a default value of `null`.) For example, say you add a
field with a default value, so this new field exists in the new schema but not the old one. When a
reader using the new schema reads a record written with the old schema, the default value is filled
in for the missing field.
If you were to add a field that has no default value, new readers wouldnt be able to read data
written by old writers, so you would break backward compatibility. If you were to remove a field
that has no default value, old readers wouldnt be able to read data written by new writers, so you
would break forward compatibility.
In some programming languages, `null` is an acceptable default for any variable, but this is not the
case in Avro: if you want to allow a field to be null, you have to use a *union type*. For example,
`union { null, long, string } field;` indicates that `field` can be a number, or a string, or null.
You can only use `null` as a default value if it is the first branch of the union. This is a little
more verbose than having everything nullable by default, but it helps prevent bugs by being explicit
about what can and cannot be null [^18].
Changing the datatype of a field is possible, provided that Avro can convert the type. Changing the
name of a field is possible but a little tricky: the readers schema can contain aliases for field
names, so it can match an old writers schema field names against the aliases. This means that
changing a field name is backward compatible but not forward compatible. Similarly, adding a branch
to a union type is backward compatible but not forward compatible.
#### But what is the writers schema? {#but-what-is-the-writers-schema}
There is an important question that weve glossed over so far: how does the reader know the writers
schema with which a particular piece of data was encoded? We cant just include the entire schema
with every record, because the schema would likely be much bigger than the encoded data, making all
the space savings from the binary encoding futile.
The answer depends on the context in which Avro is being used. To give a few examples:
Large file with lots of records
: A common use for Avro is for storing a large file containing millions of records, all encoded with
the same schema. (We will discuss this kind of situation in [Link to Come].) In this case, the
writer of that file can just include the writers schema once at the beginning of the file. Avro
specifies a file format (object container files) to do this.
Database with individually written records
: In a database, different records may be written at different points in time using different
writers schemas—you cannot assume that all the records will have the same schema. The simplest
solution is to include a version number at the beginning of every encoded record, and to keep a
list of schema versions in your database. A reader can fetch a record, extract the version number,
and then fetch the writers schema for that version number from the database. Using that writers
schema, it can decode the rest of the record.
Confluents schema registry for Apache Kafka [^19] and LinkedIns Espresso [^20] work this way, for example.
Sending records over a network connection
: When two processes are communicating over a bidirectional network connection, they can negotiate
the schema version on connection setup and then use that schema for the lifetime of the
connection. The Avro RPC protocol (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc)) works like this.
A database of schema versions is a useful thing to have in any case, since it acts as documentation
and gives you a chance to check schema compatibility [^21].
As the version number, you could use a simple incrementing integer, or you could use a hash of the schema.
#### Dynamically generated schemas {#dynamically-generated-schemas}
One advantage of Avros approach, compared to Protocol Buffers, is that the schema doesnt contain
any tag numbers. But why is this important? Whats the problem with keeping a couple of numbers in
the schema?
The difference is that Avro is friendlier to *dynamically generated* schemas. For example, say
you have a relational database whose contents you want to dump to a file, and you want to use a
binary format to avoid the aforementioned problems with textual formats (JSON, CSV, XML). If you use
Avro, you can fairly easily generate an Avro schema (in the JSON representation we saw earlier) from the
relational schema and encode the database contents using that schema, dumping it all to an Avro
object container file [^22].
You can generate a record schema for each database table, and each column becomes a field in that
record. The column name in the database maps to the field name in Avro.
Now, if the database schema changes (for example, a table has one column added and one column
removed), you can just generate a new Avro schema from the updated database schema and export data in
the new Avro schema. The data export process does not need to pay any attention to the schema
change—it can simply do the schema conversion every time it runs. Anyone who reads the new data
files will see that the fields of the record have changed, but since the fields are identified by
name, the updated writers schema can still be matched up with the old readers schema.
By contrast, if you were using Protocol Buffers for this purpose, the field tags would likely have
to be assigned by hand: every time the database schema changes, an administrator would have to
manually update the mapping from database column names to field tags. (It might be possible to
automate this, but the schema generator would have to be very careful to not assign previously used
field tags.) This kind of dynamically generated schema simply wasnt a design goal of Protocol
Buffers, whereas it was for Avro.
### The Merits of Schemas {#sec_encoding_schemas}
As we saw, Protocol Buffers and Avro both use a schema to describe a binary encoding format. Their
schema languages are much simpler than XML Schema or JSON Schema, which support much more detailed
validation rules (e.g., “the string value of this field must match this regular expression” or “the
integer value of this field must be between 0 and 100”). As Protocol Buffers and Avro are simpler to
implement and simpler to use, they have grown to support a fairly wide range of programming
languages.
The ideas on which these encodings are based are by no means new. For example, they have a lot in
common with ASN.1, a schema definition language that was first standardized in 1984 [^23] [^24].
It was used to define various network protocols, and its binary encoding (DER) is still used to encode
SSL certificates (X.509), for example [^25].
ASN.1 supports schema evolution using tag numbers, similar to Protocol Buffers [^26].
However, its also very complex and badly documented, so ASN.1 is probably not a good choice for new applications.
Many data systems also implement some kind of proprietary binary encoding for their data. For
example, most relational databases have a network protocol over which you can send queries to the
database and get back responses. Those protocols are generally specific to a particular database,
and the database vendor provides a driver (e.g., using the ODBC or JDBC APIs) that decodes responses
from the databases network protocol into in-memory data structures.
So, we can see that although textual data formats such as JSON, XML, and CSV are widespread, binary
encodings based on schemas are also a viable option. They have a number of nice properties:
* They can be much more compact than the various “binary JSON” variants, since they can omit field
names from the encoded data.
* The schema is a valuable form of documentation, and because the schema is required for decoding,
you can be sure that it is up to date (whereas manually maintained documentation may easily
diverge from reality).
* Keeping a database of schemas allows you to check forward and backward compatibility of schema
changes, before anything is deployed.
* For users of statically typed programming languages, the ability to generate code from the schema
is useful, since it enables type-checking at compile time.
In summary, schema evolution allows the same kind of flexibility as schemaless/schema-on-read JSON
databases provide (see [“Schema flexibility in the document model”](/en/ch3#sec_datamodels_schema_flexibility)), while also providing better
guarantees about your data and better tooling.
## Modes of Dataflow {#sec_encoding_dataflow}
At the beginning of this chapter we said that whenever you want to send some data to another process
with which you dont share memory—for example, whenever you want to send data over the network or
write it to a file—you need to encode it as a sequence of bytes. We then discussed a variety of
different encodings for doing this.
We talked about forward and backward compatibility, which are important for evolvability (making
change easy by allowing you to upgrade different parts of your system independently, and not having
to change everything at once). Compatibility is a relationship between one process that encodes the
data, and another process that decodes it.
Thats a fairly abstract idea—there are many ways data can flow from one process to another.
Who encodes the data, and who decodes it? In the rest of this chapter we will explore some of the
most common ways how data flows between processes:
* Via databases (see [“Dataflow Through Databases”](/en/ch5#sec_encoding_dataflow_db))
* Via service calls (see [“Dataflow Through Services: REST and RPC”](/en/ch5#sec_encoding_dataflow_rpc))
* Via workflow engines (see [“Durable Execution and Workflows”](/en/ch5#sec_encoding_dataflow_workflows))
* Via asynchronous messages (see [“Event-Driven Architectures”](/en/ch5#sec_encoding_dataflow_msg))
### Dataflow Through Databases {#sec_encoding_dataflow_db}
In a database, the process that writes to the database encodes the data, and the process that reads
from the database decodes it. There may just be a single process accessing the database, in which
case the reader is simply a later version of the same process—in that case you can think of
storing something in the database as *sending a message to your future self*.
Backward compatibility is clearly necessary here; otherwise your future self wont be able to decode
what you previously wrote.
In general, its common for several different processes to be accessing a database at the same time.
Those processes might be several different applications or services, or they may simply be several
instances of the same service (running in parallel for scalability or fault tolerance). Either way,
in an environment where the application is changing, it is likely that some processes accessing the
database will be running newer code and some will be running older code—for example because a new
version is currently being deployed in a rolling upgrade, so some instances have been updated while
others havent yet.
This means that a value in the database may be written by a *newer* version of the code, and
subsequently read by an *older* version of the code that is still running. Thus, forward
compatibility is also often required for databases.
#### Different values written at different times {#different-values-written-at-different-times}
A database generally allows any value to be updated at any time. This means that within a single
database you may have some values that were written five milliseconds ago, and some values that were
written five years ago.
When you deploy a new version of your application (of a server-side application, at least), you may
entirely replace the old version with the new version within a few minutes. The same is not true of
database contents: the five-year-old data will still be there, in the original encoding, unless you
have explicitly rewritten it since then. This observation is sometimes summed up as *data outlives
code*.
Rewriting (*migrating*) data into a new schema is certainly possible, but its an expensive thing to
do on a large dataset, so most databases avoid it if possible. Most relational databases allow
simple schema changes, such as adding a new column with a `null` default value, without rewriting
existing data. When an old row is read, the database fills in `null`s for any columns that are
missing from the encoded data on disk.
Schema evolution thus allows the entire database to appear as if it was encoded with a single
schema, even though the underlying storage may contain records encoded with various historical
versions of the schema.
More complex schema changes—for example, changing a single-valued attribute to be multi-valued, or
moving some data into a separate table—still require data to be rewritten, often at the application level [^27].
Maintaining forward and backward compatibility across such migrations is still a research problem [^28].
#### Archival storage {#archival-storage}
Perhaps you take a snapshot of your database from time to time, say for backup purposes or for
loading into a data warehouse (see [“Data Warehousing”](/en/ch1#sec_introduction_dwh)). In this case, the data dump will typically
be encoded using the latest schema, even if the original encoding in the source database contained a
mixture of schema versions from different eras. Since youre copying the data anyway, you might as
well encode the copy of the data consistently.
As the data dump is written in one go and is thereafter immutable, formats like Avro object
container files are a good fit. This is also a good opportunity to encode the data in an
analytics-friendly column-oriented format such as Parquet (see [“Column Compression”](/en/ch4#sec_storage_column_compression)).
In [Link to Come] we will talk more about using data in archival storage.
### Dataflow Through Services: REST and RPC {#sec_encoding_dataflow_rpc}
When you have processes that need to communicate over a network, there are a few different ways of
arranging that communication. The most common arrangement is to have two roles: *clients* and
*servers*. The servers expose an API over the network, and the clients can connect to the servers
to make requests to that API. The API exposed by the server is known as a *service*.
The web works this way: clients (web browsers) make requests to web servers, making `GET` requests
to download HTML, CSS, JavaScript, images, etc., and making `POST` requests to submit data to the
server. The API consists of a standardized set of protocols and data formats (HTTP, URLs, SSL/TLS,
HTML, etc.). Because web browsers, web servers, and website authors mostly agree on these standards,
you can use any web browser to access any website (at least in theory!).
Web browsers are not the only type of client. For example, native apps running on mobile devices and
desktop computers often talk to servers, and client-side JavaScript applications running inside web
browsers can also make HTTP requests.
In this case, the servers response is typically not HTML for displaying to a human, but rather data
in an encoding that is convenient for further processing by the client-side application code (most
often JSON). Although HTTP may be used as the transport protocol, the API implemented on top is
application-specific, and the client and server need to agree on the details of that API.
In some ways, services are similar to databases: they typically allow clients to submit and query
data. However, while databases allow arbitrary queries using the query languages we discussed in
[Chapter 3](/en/ch3#ch_datamodels), services expose an application-specific API that only allows inputs and outputs
that are predetermined by the business logic (application code) of the service [^29]. This restriction provides a degree of encapsulation: services can impose
fine-grained restrictions on what clients can and cannot do.
A key design goal of a service-oriented/microservices architecture is to make the application easier
to change and maintain by making services independently deployable and evolvable. A common principle
is that each service should be owned by one team, and that team should be able to release new
versions of the service frequently, without having to coordinate with other teams. We should
therefore expect old and new versions of servers and clients to be running at the same time, and so
the data encoding used by servers and clients must be compatible across versions of the service API.
#### Web services {#sec_web_services}
When HTTP is used as the underlying protocol for talking to the service, it is called a *web
service*. Web services are commonly used when building a service oriented or microservices
architecture (discussed earlier in [“Microservices and Serverless”](/en/ch1#sec_introduction_microservices)). The term “web service” is
perhaps a slight misnomer, because web services are not only used on the web, but in several
different contexts. For example:
1. A client application running on a users device (e.g., a native app on a mobile device, or a
JavaScript web app in a browser) making requests to a service over HTTP. These requests typically go over the public internet.
2. One service making requests to another service owned by the same organization, often located
within the same datacenter, as part of a service-oriented/microservices architecture.
3. One service making requests to a service owned by a different organization, usually via the
internet. This is used for data exchange between different organizations backend systems. This
category includes public APIs provided by online services, such as credit card processing
systems, or OAuth for shared access to user data.
The most popular service design philosophy is REST, which builds upon the principles of HTTP [^30] [^31].
It emphasizes simple data formats, using URLs for identifying resources and using HTTP features for
cache control, authentication, and content type negotiation. An API designed according to the
principles of REST is called *RESTful*.
Code that needs to invoke a web service API must know which HTTP endpoint to query, and what data
format to send and expect in response. Even if a service adopts RESTful design principles, clients
need to somehow find out these details. Service developers often use an interface definition
language (IDL) to define and document their services API endpoints and data models, and to evolve
them over time. Other developers can then use the service definition to determine how to query the
service. The two most popular service IDLs are OpenAPI (also known as Swagger [^32])
and gRPC. OpenAPI is used for web services that send and receive JSON data, while gRPC services send
and receive Protocol Buffers.
Developers typically write OpenAPI service definitions in JSON or YAML; see [Example 5-3](/en/ch5#fig_open_api_def).
The service definition allows developers to define service endpoints, documentation, versions, data
models, and much more. gRPC definitions look similar, but are defined using Protocol Buffers service definitions.
{{< figure id="fig_open_api_def" title="Example 5-3. Example OpenAPI service definition in YAML" class="w-full my-4" >}}
```yaml
openapi: 3.0.0
info:
title: Ping, Pong
version: 1.0.0
servers:
- url: http://localhost:8080
paths:
/ping:
get:
summary: Given a ping, returns a pong message
responses:
'200':
description: A pong
content:
application/json:
schema:
type: object
properties:
message:
type: string
example: Pong!
```
Even if a design philosophy and IDL are adopted, developers must still write the code that
implements their services API calls. A service framework is often adopted to simplify this
effort. Service frameworks such as Spring Boot, FastAPI, and gRPC allow developers to write the
business logic for each API endpoint while the framework code handles routing, metrics, caching,
authentication, and so on. [Example 5-4](/en/ch5#fig_fastapi_def) shows an example Python implementation of the service
defined in [Example 5-3](/en/ch5#fig_open_api_def).
{{< figure id="fig_fastapi_def" title="Example 5-4. Example FastAPI service implementing the definition from [Example 5-3](/en/ch5#fig_open_api_def)" class="w-full my-4" >}}
```python
from fastapi import FastAPI
from pydantic import BaseModel
app = FastAPI(title="Ping, Pong", version="1.0.0")
class PongResponse(BaseModel):
message: str = "Pong!"
@app.get("/ping", response_model=PongResponse,
summary="Given a ping, returns a pong message")
async def ping():
return PongResponse()
```
Many frameworks couple service definitions and server code together. In some cases, such as with the
popular Python FastAPI framework, servers are written in code and an IDL is generated automatically.
In other cases, such as with gRPC, the service definition is written first, and server code
scaffolding is generated. Both approaches allow developers to generate client libraries and SDKs
in a variety of languages from the service definition. In addition to code generation, IDL tools
such as Swaggers can generate documentation, verify schema change compatibility, and provide a
graphical user interfaces for developers to query and test services.
#### The problems with remote procedure calls (RPCs) {#sec_problems_with_rpc}
Web services are merely the latest incarnation of a long line of technologies for making API
requests over a network, many of which received a lot of hype but have serious problems. Enterprise
JavaBeans (EJB) and Javas Remote Method Invocation (RMI) are limited to Java. The Distributed
Component Object Model (DCOM) is limited to Microsoft platforms. The Common Object Request Broker
Architecture (CORBA) is excessively complex, and does not provide backward or forward compatibility [^33].
SOAP and the WS-\* web services framework aim to provide interoperability across vendors, but are
also plagued by complexity and compatibility problems [^34] [^35] [^36].
All of these are based on the idea of a *remote procedure call* (RPC), which has been around since the 1970s [^37].
The RPC model tries to make a request to a remote network service look the same as calling a function or
method in your programming language, within the same process (this abstraction is called *location
transparency*). Although RPC seems convenient at first, the approach is fundamentally flawed [^38] [^39].
A network request is very different from a local function call:
* A local function call is predictable and either succeeds or fails, depending only on parameters
that are under your control. A network request is unpredictable: the request or response may be
lost due to a network problem, or the remote machine may be slow or unavailable, and such problems
are entirely outside of your control. Network problems are common, so you have to anticipate them,
for example by retrying a failed request.
* A local function call either returns a result, or throws an exception, or never returns (because
it goes into an infinite loop or the process crashes). A network request has another possible
outcome: it may return without a result, due to a *timeout*. In that case, you simply dont know
what happened: if you dont get a response from the remote service, you have no way of knowing
whether the request got through or not. (We discuss this issue in more detail in [Chapter 9](/en/ch9#ch_distributed).)
* If you retry a failed network request, it could happen that the previous request actually got
through, and only the response was lost. In that case, retrying will cause the action to
be performed multiple times, unless you build a mechanism for deduplication (*idempotence*) into the protocol [^40].
Local function calls dont have this problem. (We discuss idempotence in more detail in [Link to Come].)
* Every time you call a local function, it normally takes about the same time to execute. A network
request is much slower than a function call, and its latency is also wildly variable: at good
times it may complete in less than a millisecond, but when the network is congested or the remote
service is overloaded it may take many seconds to do exactly the same thing.
* When you call a local function, you can efficiently pass it references (pointers) to objects in
local memory. When you make a network request, all those parameters need to be encoded into a
sequence of bytes that can be sent over the network. Thats okay if the parameters are immutable
primitives like numbers or short strings, but it quickly becomes problematic with larger amounts
of data and mutable objects.
* The client and the service may be implemented in different programming languages, so the RPC
framework must translate datatypes from one language into another. This can end up ugly, since not
all languages have the same types—recall JavaScripts problems with numbers greater than 253,
for example (see [“JSON, XML, and Binary Variants”](/en/ch5#sec_encoding_json)).
This problem doesnt exist in a single process written in a single language.
All of these factors mean that theres no point trying to make a remote service look too much like a
local object in your programming language, because its a fundamentally different thing. Part of the
appeal of REST is that it treats state transfer over a network as a process that is distinct from a
function call.
#### Load balancers, service discovery, and service meshes {#sec_encoding_service_discovery}
All services communicate over the network. For this reason, a client must know the address of the
service its connecting to—a problem known as *service discovery*. The simplest approach is to
configure a client to connect to the IP address and port where the service is running. This
configuration will work, but if the server goes offline, is transferred to a new machine, or becomes
overloaded, the client has to be manually reconfigured.
To provide higher availability and scalability, there are usually multiple instances of a service
running on different machines, any of which can handle an incoming request. Spreading requests
across these instances is called *load balancing* [^41].
There are many load balancing and service discovery solutions available:
* *Hardware load balancers* are specialized pieces of equipment that are installed in data centers.
They allow clients to connect to a single host and port, and incoming connections are routed to
one of the servers running the service. Such load balancers detect network failures when
connecting to a downstream server and shift the traffic to other servers.
* *Software load balancers* behave in much the same way as hardware load balancers. But rather than
requiring a special appliance, software load balancers such as Nginx and HAProxy are applications
that can be installed on a standard machine.
* The *domain name service (DNS)* is how domain names are resolved on the Internet when you open a
webpage. It supports load balancing by allowing multiple IP addresses to be associated with a
single domain name. Clients can then be configured to connect to a service using a domain name
rather than IP address, and the clients network layer picks which IP address to use when making a
connection. One drawback of this approach is that DNS is designed to propagate changes over longer
periods of time, and to cache DNS entries. If servers are started, stopped, or moved frequently,
clients might see stale IP addresses that no longer have a server running on them.
* *Service discovery systems* use a centralized registry rather than DNS to track which service
endpoints are available. When a new service instance starts up, it registers itself with the
service discovery system by declaring the host and port its listening on, along with relevant
metadata such as shard ownership information (see [Chapter 7](/en/ch7#ch_sharding)), data center location,
and more. The service then periodically sends a heartbeat signal to the discovery system to signal
that the service is still available.
When a client wishes to connect to a service, it first queries the discovery system to get a list of
available endpoints, and then connects directly to the endpoint. Compared to DNS, service discovery
supports a much more dynamic environment where service instances change frequently. Discovery
systems also give clients more metadata about the service theyre connecting to, which enables
clients to make smarter load balancing decisions.
* *Service meshes* are a sophisticated form of load balancing that combine software load balancers
and service discovery. Unlike traditional software load balancers, which run on a separate
machine, service mesh load balancers are typically deployed as an in-process client library or as
a process or “sidecar” container on both the client and server. Client applications connect
to their own local service load balancer, which connects to the servers load balancer. From
there, the connection is routed to the local server process.
Though complicated, this topology offers a number of advantages. Because the clients and servers are
routed entirely through local connections, connection encryption can be handled entirely at the load
balancer level. This shields clients and servers from having to deal with the complexities of SSL
certificates and TLS. Mesh systems also provide sophisticated observability. They can track which
services are calling each other in realtime, detect failures, track traffic load, and more.
Which solution is appropriate depends on an organizations needs. Those running in a very dynamic
service environment with an orchestrator such as Kubernetes often choose to run a service mesh such
as Istio or Linkerd. Specialized infrastructure such as databases or messaging systems might require
their own purpose-built load balancer. Simpler deployments are best served with software load
balancers.
#### Data encoding and evolution for RPC {#data-encoding-and-evolution-for-rpc}
For evolvability, it is important that RPC clients and servers can be changed and deployed
independently. Compared to data flowing through databases (as described in the last section), we can make a
simplifying assumption in the case of dataflow through services: it is reasonable to assume that
all the servers will be updated first, and all the clients second. Thus, you only need backward
compatibility on requests, and forward compatibility on responses.
The backward and forward compatibility properties of an RPC scheme are inherited from whatever encoding it uses:
* gRPC (Protocol Buffers) and Avro RPC can be evolved according to the compatibility rules of the respective encoding format.
* RESTful APIs most commonly use JSON for responses, and JSON or URI-encoded/form-encoded request
parameters for requests. Adding optional request parameters and adding new fields to response
objects are usually considered changes that maintain compatibility.
Service compatibility is made harder by the fact that RPC is often used for communication across
organizational boundaries, so the provider of a service often has no control over its clients and
cannot force them to upgrade. Thus, compatibility needs to be maintained for a long time, perhaps
indefinitely. If a compatibility-breaking change is required, the service provider often ends up
maintaining multiple versions of the service API side by side.
There is no agreement on how API versioning should work (i.e., how a client can indicate which
version of the API it wants to use [^42]).
For RESTful APIs, common approaches are to use a version
number in the URL or in the HTTP `Accept` header. For services that use API keys to identify a
particular client, another option is to store a clients requested API version on the server and to
allow this version selection to be updated through a separate administrative interface [^43].
### Durable Execution and Workflows {#sec_encoding_dataflow_workflows}
By definition, service-based architectures have multiple services that are all responsible for
different portions of an application. Consider a payment processing application that charges a
credit card and deposits the funds into a bank account. This system would likely have different
services responsible for fraud detection, credit card integration, bank integration, and so on.
Processing a single payment in our example requires many service calls. A payment processor service
might invoke the fraud detection service to check for fraud, call the credit card service to debit
the credit card, and call the banking service to deposit debited funds, as shown in
[Figure 5-7](/en/ch5#fig_encoding_workflow). We call this sequence of steps a *workflow*, and each step a *task*.
Workflows are typically defined as a graph of tasks. Workflow definitions may be written in a
general-purpose programming language, a domain specific language (DSL), or a markup language such as
Business Process Execution Language (BPEL) [^44].
--------
> [!TIP] TASKS, ACTIVITIES, AND FUNCTIONS
Different workflow engines use different names for tasks. Temporal, for example, uses the term
*activity*. Others refer to tasks as *durable functions*. Though the names differ, the concepts are the same.
--------
{{< figure src="/fig/ddia_0507.png" id="fig_encoding_workflow" title="Figure 5-7. Example of a workflow expressed using Business Process Model and Notation (BPMN), a graphical notation." class="w-full my-4" >}}
Workflows are run, or executed, by a *workflow engine*. Workflow engines determine when to run each
task, on which machine a task must be run, what to do if a task fails (e.g., if the machine crashes
while the task is running), how many tasks are allowed to execute in parallel, and more.
Workflow engines are typically composed of an orchestrator and an executor. The orchestrator is
responsible for scheduling tasks to be executed and the executor is responsible for executing tasks.
Execution begins when a workflow is triggered. The orchestrator triggers the workflow itself if
users define a time-based schedule, such as hourly execution. External sources such as a web service
or even a human can also trigger workflow executions. Once triggered, executors are invoked to run
tasks.
There are many kinds of workflow engines that address a diverse set of use cases. Some, such as
Airflow, Dagster, and Prefect, integrate with data systems and orchestrate ETL tasks. Others, such
as Camunda and Orkes, provide a graphical notation for workflows (such as BPMN, used in
[Figure 5-7](/en/ch5#fig_encoding_workflow)) so that non-engineers can more easily define and execute workflows. Still
others, such as Temporal and Restate provide *durable execution*.
#### Durable execution {#durable-execution}
Durable execution frameworks have become a popular way to build service-based architectures that
require transactionality. In our payment example, we would like to process each payment exactly
once. A failure while the workflow is executing could result in a credit card charge, but no
corresponding bank account deposit. In a service-based architecture, we cant simply wrap the two
tasks in a database transaction. Moreover, we might be interacting with third-party payment gateways
that we have limited control over.
Durable execution frameworks are a way to provide *exactly-once semantics* for workflows. If a
task fails, the framework will re-execute the task, but will skip any RPC calls or state changes
that the task made successfully before failing. Instead, the framework will pretend to make the
call, but will instead return the results from the previous call. This is possible because durable
execution frameworks log all RPCs and state changes to durable storage like a write-ahead log [^45] [^46].
[Example 5-5](/en/ch5#fig_temporal_workflow) shows an example of a workflow definition that supports durable execution
using Temporal.
{{< figure id="fig_temporal_workflow" title="Example 5-5. A Temporal workflow definition fragment for the payment workflow in [Figure 5-7](/en/ch5#fig_encoding_workflow)." class="w-full my-4" >}}
```python
@workflow.defn
class PaymentWorkflow:
@workflow.run
async def run(self, payment: PaymentRequest) -> PaymentResult:
is_fraud = await workflow.execute_activity(
check_fraud,
payment,
start_to_close_timeout=timedelta(seconds=15),
)
if is_fraud:
return PaymentResultFraudulent
credit_card_response = await workflow.execute_activity(
debit_credit_card,
payment,
start_to_close_timeout=timedelta(seconds=15),
)
# ...
```
Frameworks like Temporal are not without their challenges. External services, such as the
third-party payment gateway in our example, must still provide an idempotent API. Developers must
remember to use unique IDs for these APIs to prevent duplicate execution [^47].
And because durable execution frameworks log each RPC call in order, it expects a subsequent
execution to make the same RPC calls in the same order. This makes code changes brittle: you
might introduce undefined behavior simply by re-ordering function calls [^48].
Instead of modifying the code of an existing workflow, it is safer to deploy a new version of the
code separately, so that re-executions of existing workflow invocations continue to use the old
version, and only new invocations use the new code [^49].
Similarly, because durable execution frameworks expect to replay all code deterministically (the
same inputs produce the same outputs), nondeterministic code such as random number generators or system clocks are problematic [^48].
Frameworks often provide their own, deterministic implementations of such library functions, but
you have to remember to use them. In some cases, such as with Temporals workflowcheck tool,
frameworks provide static analysis tools to determine if nondeterministic behavior has been introduced.
--------
> [!NOTE]
> Making code deterministic is a powerful idea, but tricky to do robustly. In
> [“The Power of Determinism”](/en/ch9#sidebar_distributed_determinism) we will return to this topic.
--------
### Event-Driven Architectures {#sec_encoding_dataflow_msg}
In this final section, we will briefly look at *event-driven architectures*, which are another way
how encoded data can flow from one process to another. A request is called an *event* or *message*;
unlike RPC, the sender usually does not wait for the recipient to process the event. Moreover,
events are typically not sent to the recipient via a direct network connection, but go via an
intermediary called a *message broker* (also called an *event broker*, *message queue*, or
*message-oriented middleware*), which stores the message temporarily. [^50].
Using a message broker has several advantages compared to direct RPC:
* It can act as a buffer if the recipient is unavailable or overloaded, and thus improve system reliability.
* It can automatically redeliver messages to a process that has crashed, and thus prevent messages from being lost.
* It avoids the need for service discovery, since senders do not need to directly connect to the IP address of the recipient.
* It allows the same message to be sent to several recipients.
* It logically decouples the sender from the recipient (the sender just publishes messages and doesnt care who consumes them).
The communication via a message broker is *asynchronous*: the sender doesnt wait for the message to
be delivered, but simply sends it and then forgets about it. Its possible to implement a
synchronous RPC-like model by having the sender wait for a response on a separate channel.
#### Message brokers {#message-brokers}
In the past, the landscape of message brokers was dominated by commercial enterprise software from
companies such as TIBCO, IBM WebSphere, and webMethods, before open source implementations such as
RabbitMQ, ActiveMQ, HornetQ, NATS, and Apache Kafka become popular. More recently, cloud services
such as Amazon Kinesis, Azure Service Bus, and Google Cloud Pub/Sub have gained adoption. We will
compare them in more detail in [Link to Come].
The detailed delivery semantics vary by implementation and configuration, but in general, two
message distribution patterns are most often used:
* One process adds a message to a named *queue*, and the broker delivers that message to a
*consumer* of that queue. If there are multiple consumers, one of them receives the message.
* One process publishes a message to a named *topic*, and the broker delivers that message to all
*subscribers* of that topic. If there are multiple subscribers, they all receive the message.
Message brokers typically dont enforce any particular data model—a message is just a sequence of
bytes with some metadata, so you can use any encoding format. A common approach is to use Protocol
Buffers, Avro, or JSON, and to deploy a schema registry alongside the message broker to store all
the valid schema versions and check their compatibility [^19] [^21].
AsyncAPI, a messaging-based equivalent of OpenAPI, can also be used to specify the schema of messages.
Message brokers differ in terms of how durable their messages are. Many write messages to disk, so
that they are not lost in case the message broker crashes or needs to be restarted. Unlike
databases, many message brokers automatically delete messages again after they have been consumed.
Some brokers can be configured to store messages indefinitely, which you would require if you want
to use event sourcing (see [“Event Sourcing and CQRS”](/en/ch3#sec_datamodels_events)).
If a consumer republishes messages to another topic, you may need to be careful to preserve unknown
fields, to prevent the issue described previously in the context of databases
([Figure 5-1](/en/ch5#fig_encoding_preserve_field)).
#### Distributed actor frameworks {#distributed-actor-frameworks}
The *actor model* is a programming model for concurrency in a single process. Rather than dealing
directly with threads (and the associated problems of race conditions, locking, and deadlock), logic
is encapsulated in *actors*. Each actor typically represents one client or entity, it may have some
local state (which is not shared with any other actor), and it communicates with other actors by
sending and receiving asynchronous messages. Message delivery is not guaranteed: in certain error
scenarios, messages will be lost. Since each actor processes only one message at a time, it doesnt
need to worry about threads, and each actor can be scheduled independently by the framework.
In *distributed actor frameworks* such as Akka, Orleans [^51],
and Erlang/OTP, this programming model is used to scale an application across
multiple nodes. The same message-passing mechanism is used, no matter whether the sender and recipient
are on the same node or different nodes. If they are on different nodes, the message is
transparently encoded into a byte sequence, sent over the network, and decoded on the other side.
Location transparency works better in the actor model than in RPC, because the actor model already
assumes that messages may be lost, even within a single process. Although latency over the network
is likely higher than within the same process, there is less of a fundamental mismatch between local
and remote communication when using the actor model.
A distributed actor framework essentially integrates a message broker and the actor programming
model into a single framework. However, if you want to perform rolling upgrades of your actor-based
application, you still have to worry about forward and backward compatibility, as messages may be
sent from a node running the new version to a node running the old version, and vice versa. This can
be achieved by using one of the encodings discussed in this chapter.
## Summary {#summary}
In this chapter we looked at several ways of turning data structures into bytes on the network or
bytes on disk. We saw how the details of these encodings affect not only their efficiency, but more
importantly also the architecture of applications and your options for evolving them.
In particular, many services need to support rolling upgrades, where a new version of a service is
gradually deployed to a few nodes at a time, rather than deploying to all nodes simultaneously.
Rolling upgrades allow new versions of a service to be released without downtime (thus encouraging
frequent small releases over rare big releases) and make deployments less risky (allowing faulty
releases to be detected and rolled back before they affect a large number of users). These
properties are hugely beneficial for *evolvability*, the ease of making changes to an application.
During rolling upgrades, or for various other reasons, we must assume that different nodes are
running the different versions of our applications code. Thus, it is important that all data
flowing around the system is encoded in a way that provides backward compatibility (new code can
read old data) and forward compatibility (old code can read new data).
We discussed several data encoding formats and their compatibility properties:
* Programming languagespecific encodings are restricted to a single programming language and often
fail to provide forward and backward compatibility.
* Textual formats like JSON, XML, and CSV are widespread, and their compatibility depends on how you
use them. They have optional schema languages, which are sometimes helpful and sometimes a
hindrance. These formats are somewhat vague about datatypes, so you have to be careful with things
like numbers and binary strings.
* Binary schemadriven formats like Protocol Buffers and Avro allow compact, efficient encoding with
clearly defined forward and backward compatibility semantics. The schemas can be useful for
documentation and code generation in statically typed languages. However, these formats have the
downside that data needs to be decoded before it is human-readable.
We also discussed several modes of dataflow, illustrating different scenarios in which data
encodings are important:
* Databases, where the process writing to the database encodes the data and the process reading
from the database decodes it
* RPC and REST APIs, where the client encodes a request, the server decodes the request and encodes
a response, and the client finally decodes the response
* Event-driven architectures (using message brokers or actors), where nodes communicate by sending
each other messages that are encoded by the sender and decoded by the recipient
We can conclude that with a bit of care, backward/forward compatibility and rolling upgrades are
quite achievable. May your applications evolution be rapid and your deployments be frequent.
### References
[^1]: [CWE-502: Deserialization of Untrusted Data](https://cwe.mitre.org/data/definitions/502.html). Common Weakness Enumeration, *cwe.mitre.org*, July 2006. Archived at [perma.cc/26EU-UK9Y](https://perma.cc/26EU-UK9Y)
[^2]: Steve Breen. [What Do WebLogic, WebSphere, JBoss, Jenkins, OpenNMS, and Your Application Have in Common? This Vulnerability](https://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-and-your-application-have-in-common-this-vulnerability/). *foxglovesecurity.com*, November 2015. Archived at [perma.cc/9U97-UVVD](https://perma.cc/9U97-UVVD)
[^3]: Patrick McKenzie. [What the Rails Security Issue Means for Your Startup](https://www.kalzumeus.com/2013/01/31/what-the-rails-security-issue-means-for-your-startup/). *kalzumeus.com*, January 2013. Archived at [perma.cc/2MBJ-7PZ6](https://perma.cc/2MBJ-7PZ6)
[^4]: Brian Goetz. [Towards Better Serialization](https://openjdk.org/projects/amber/design-notes/towards-better-serialization). *openjdk.org*, June 2019. Archived at [perma.cc/UK6U-GQDE](https://perma.cc/UK6U-GQDE)
[^5]: Eishay Smith. [jvm-serializers wiki](https://github.com/eishay/jvm-serializers/wiki). *github.com*, October 2023. Archived at [perma.cc/PJP7-WCNG](https://perma.cc/PJP7-WCNG)
[^6]: [XML Is a Poor Copy of S-Expressions](https://wiki.c2.com/?XmlIsaPoorCopyOfEssExpressions). *wiki.c2.com*, May 2013. Archived at [perma.cc/7FAN-YBKL](https://perma.cc/7FAN-YBKL)
[^7]: Julia Evans. [Examples of floating point problems](https://jvns.ca/blog/2023/01/13/examples-of-floating-point-problems/). *jvns.ca*, January 2023. Archived at [perma.cc/M57L-QKKW](https://perma.cc/M57L-QKKW)
[^8]: Matt Harris. [Snowflake: An Update and Some Very Important Information](https://groups.google.com/g/twitter-development-talk/c/ahbvo3VTIYI). Email to *Twitter Development Talk* mailing list, October 2010. Archived at [perma.cc/8UBV-MZ3D](https://perma.cc/8UBV-MZ3D)
[^9]: Yakov Shafranovich. [RFC 4180: Common Format and MIME Type for Comma-Separated Values (CSV) Files](https://tools.ietf.org/html/rfc4180). IETF, October 2005.
[^10]: Andy Coates. [Evolving JSON Schemas - Part I](https://www.creekservice.org/articles/2024/01/08/json-schema-evolution-part-1.html) and [Part II](https://www.creekservice.org/articles/2024/01/09/json-schema-evolution-part-2.html). *creekservice.org*, January 2024. Archived at [perma.cc/MZW3-UA54](https://perma.cc/MZW3-UA54) and [perma.cc/GT5H-WKZ5](https://perma.cc/GT5H-WKZ5)
[^11]: Pierre Genevès, Nabil Layaïda, and Vincent Quint. [Ensuring Query Compatibility with Evolving XML Schemas](https://arxiv.org/abs/0811.4324). INRIA Technical Report 6711, November 2008.
[^12]: Tim Bray. [Bits On the Wire](https://www.tbray.org/ongoing/When/201x/2019/11/17/Bits-On-the-Wire). *tbray.org*, November 2019. Archived at [perma.cc/3BT3-BQU3](https://perma.cc/3BT3-BQU3)
[^13]: Mark Slee, Aditya Agarwal, and Marc Kwiatkowski. [Thrift: Scalable Cross-Language Services Implementation](https://thrift.apache.org/static/files/thrift-20070401.pdf). Facebook technical report, April 2007. Archived at [perma.cc/22BS-TUFB](https://perma.cc/22BS-TUFB)
[^14]: Martin Kleppmann. [Schema Evolution in Avro, Protocol Buffers and Thrift](https://martin.kleppmann.com/2012/12/05/schema-evolution-in-avro-protocol-buffers-thrift.html). *martin.kleppmann.com*, December 2012. Archived at [perma.cc/E4R2-9RJT](https://perma.cc/E4R2-9RJT)
[^15]: Doug Cutting, Chad Walters, Jim Kellerman, et al. [[PROPOSAL] New Subproject: Avro](https://lists.apache.org/thread/z571w0r5jmfsjvnl0fq4fgg0vh28d3bk). Email thread on *hadoop-general* mailing list, *lists.apache.org*, April 2009. Archived at [perma.cc/4A79-BMEB](https://perma.cc/4A79-BMEB)
[^16]: Apache Software Foundation. [Apache Avro 1.12.0 Specification](https://avro.apache.org/docs/1.12.0/specification/). *avro.apache.org*, August 2024. Archived at [perma.cc/C36P-5EBQ](https://perma.cc/C36P-5EBQ)
[^17]: Apache Software Foundation. [Avro schemas as LL(1) CFG definitions](https://avro.apache.org/docs/1.12.0/api/java/org/apache/avro/io/parsing/doc-files/parsing.html). *avro.apache.org*, August 2024. Archived at [perma.cc/JB44-EM9Q](https://perma.cc/JB44-EM9Q)
[^18]: Tony Hoare. [Null References: The Billion Dollar Mistake](https://www.infoq.com/presentations/Null-References-The-Billion-Dollar-Mistake-Tony-Hoare/). Talk at *QCon London*, March 2009.
[^19]: Confluent, Inc. [Schema Registry Overview](https://docs.confluent.io/platform/current/schema-registry/index.html). *docs.confluent.io*, 2024. Archived at [perma.cc/92C3-A9JA](https://perma.cc/92C3-A9JA)
[^20]: Aditya Auradkar and Tom Quiggle. [Introducing Espresso—LinkedIns Hot New Distributed Document Store](https://engineering.linkedin.com/espresso/introducing-espresso-linkedins-hot-new-distributed-document-store). *engineering.linkedin.com*, January 2015. Archived at [perma.cc/FX4P-VW9T](https://perma.cc/FX4P-VW9T)
[^21]: Jay Kreps. [Putting Apache Kafka to Use: A Practical Guide to Building a Stream Data Platform (Part 2)](https://www.confluent.io/blog/event-streaming-platform-2/). *confluent.io*, February 2015. Archived at [perma.cc/8UA4-ZS5S](https://perma.cc/8UA4-ZS5S)
[^22]: Gwen Shapira. [The Problem of Managing Schemas](https://www.oreilly.com/content/the-problem-of-managing-schemas/). *oreilly.com*, November 2014. Archived at [perma.cc/BY8Q-RYV3](https://perma.cc/BY8Q-RYV3)
[^23]: John Larmouth. [*ASN.1 Complete*](https://www.oss.com/asn1/resources/books-whitepapers-pubs/larmouth-asn1-book.pdf). Morgan Kaufmann, 1999. ISBN: 978-0-122-33435-1. Archived at [perma.cc/GB7Y-XSXQ](https://perma.cc/GB7Y-XSXQ)
[^24]: Burton S. Kaliski Jr. [A Laymans Guide to a Subset of ASN.1, BER, and DER](https://luca.ntop.org/Teaching/Appunti/asn1.html). Technical Note, RSA Data Security, Inc., November 1993. Archived at [perma.cc/2LMN-W9U8](https://perma.cc/2LMN-W9U8)
[^25]: Jacob Hoffman-Andrews. [A Warm Welcome to ASN.1 and DER](https://letsencrypt.org/docs/a-warm-welcome-to-asn1-and-der/). *letsencrypt.org*, April 2020. Archived at [perma.cc/CYT2-GPQ8](https://perma.cc/CYT2-GPQ8)
[^26]: Lev Walkin. [Question: Extensibility and Dropping Fields](https://lionet.info/asn1c/blog/2010/09/21/question-extensibility-removing-fields/). *lionet.info*, September 2010. Archived at [perma.cc/VX8E-NLH3](https://perma.cc/VX8E-NLH3)
[^27]: Jacqueline Xu. [Online migrations at scale](https://stripe.com/blog/online-migrations). *stripe.com*, February 2017. Archived at [perma.cc/X59W-DK7Y](https://perma.cc/X59W-DK7Y)
[^28]: Geoffrey Litt, Peter van Hardenberg, and Orion Henry. [Project Cambria: Translate your data with lenses](https://www.inkandswitch.com/cambria/). Technical Report, *Ink & Switch*, October 2020. Archived at [perma.cc/WA4V-VKDB](https://perma.cc/WA4V-VKDB)
[^29]: Pat Helland. [Data on the Outside Versus Data on the Inside](https://www.cidrdb.org/cidr2005/papers/P12.pdf). At *2nd Biennial Conference on Innovative Data Systems Research* (CIDR), January 2005.
[^30]: Roy Thomas Fielding. [Architectural Styles and the Design of Network-Based Software Architectures](https://ics.uci.edu/~fielding/pubs/dissertation/fielding_dissertation.pdf). PhD Thesis, University of California, Irvine, 2000. Archived at [perma.cc/LWY9-7BPE](https://perma.cc/LWY9-7BPE)
[^31]: Roy Thomas Fielding. [REST APIs must be hypertext-driven](https://roy.gbiv.com/untangled/2008/rest-apis-must-be-hypertext-driven).” *roy.gbiv.com*, October 2008. Archived at [perma.cc/M2ZW-8ATG](https://perma.cc/M2ZW-8ATG)
[^32]: [OpenAPI Specification Version 3.1.0](https://swagger.io/specification/). *swagger.io*, February 2021. Archived at [perma.cc/3S6S-K5M4](https://perma.cc/3S6S-K5M4)
[^33]: Michi Henning. [The Rise and Fall of CORBA](https://cacm.acm.org/practice/the-rise-and-fall-of-corba/). *Communications of the ACM*, volume 51, issue 8, pages 5257, August 2008. [doi:10.1145/1378704.1378718](https://doi.org/10.1145/1378704.1378718)
[^34]: Pete Lacey. [The S Stands for Simple](https://harmful.cat-v.org/software/xml/soap/simple). *harmful.cat-v.org*, November 2006. Archived at [perma.cc/4PMK-Z9X7](https://perma.cc/4PMK-Z9X7)
[^35]: Stefan Tilkov. [Interview: Pete Lacey Criticizes Web Services](https://www.infoq.com/articles/pete-lacey-ws-criticism/). *infoq.com*, December 2006. Archived at [perma.cc/JWF4-XY3P](https://perma.cc/JWF4-XY3P)
[^36]: Tim Bray. [The Loyal WS-Opposition](https://www.tbray.org/ongoing/When/200x/2004/09/18/WS-Oppo). *tbray.org*, September 2004. Archived at [perma.cc/J5Q8-69Q2](https://perma.cc/J5Q8-69Q2)
[^37]: Andrew D. Birrell and Bruce Jay Nelson. [Implementing Remote Procedure Calls](https://www.cs.princeton.edu/courses/archive/fall03/cs518/papers/rpc.pdf). *ACM Transactions on Computer Systems* (TOCS), volume 2, issue 1, pages 3959, February 1984. [doi:10.1145/2080.357392](https://doi.org/10.1145/2080.357392)
[^38]: Jim Waldo, Geoff Wyant, Ann Wollrath, and Sam Kendall. [A Note on Distributed Computing](https://m.mirror.facebook.net/kde/devel/smli_tr-94-29.pdf). Sun Microsystems Laboratories, Inc., Technical Report TR-94-29, November 1994. Archived at [perma.cc/8LRZ-BSZR](https://perma.cc/8LRZ-BSZR)
[^39]: Steve Vinoski. [Convenience over Correctness](https://steve.vinoski.net/pdf/IEEE-Convenience_Over_Correctness.pdf). *IEEE Internet Computing*, volume 12, issue 4, pages 8992, July 2008. [doi:10.1109/MIC.2008.75](https://doi.org/10.1109/MIC.2008.75)
[^40]: Brandur Leach. [Designing robust and predictable APIs with idempotency](https://stripe.com/blog/idempotency). *stripe.com*, February 2017. Archived at [perma.cc/JD22-XZQT](https://perma.cc/JD22-XZQT)
[^41]: Sam Rose. [Load Balancing](https://samwho.dev/load-balancing/). *samwho.dev*, April 2023. Archived at [perma.cc/Q7BA-9AE2](https://perma.cc/Q7BA-9AE2)
[^42]: Troy Hunt. [Your API versioning is wrong, which is why I decided to do it 3 different wrong ways](https://www.troyhunt.com/your-api-versioning-is-wrong-which-is/). *troyhunt.com*, February 2014. Archived at [perma.cc/9DSW-DGR5](https://perma.cc/9DSW-DGR5)
[^43]: Brandur Leach. [APIs as infrastructure: future-proofing Stripe with versioning](https://stripe.com/blog/api-versioning). *stripe.com*, August 2017. Archived at [perma.cc/L63K-USFW](https://perma.cc/L63K-USFW)
[^44]: Alexandre Alves, Assaf Arkin, Sid Askary, et al. [Web Services Business Process Execution Language Version 2.0](https://docs.oasis-open.org/wsbpel/2.0/wsbpel-v2.0.html). *docs.oasis-open.org*, April 2007.
[^45]: [What is a Temporal Service?](https://docs.temporal.io/clusters) *docs.temporal.io*, 2024. Archived at [perma.cc/32P3-CJ9V](https://perma.cc/32P3-CJ9V)
[^46]: Stephan Ewen. [Why we built Restate](https://restate.dev/blog/why-we-built-restate/). *restate.dev*, August 2023. Archived at [perma.cc/BJJ2-X75K](https://perma.cc/BJJ2-X75K)
[^47]: Keith Tenzer and Joshua Smith. [Idempotency and Durable Execution](https://temporal.io/blog/idempotency-and-durable-execution). *temporal.io*, February 2024. Archived at [perma.cc/9LGW-PCLU](https://perma.cc/9LGW-PCLU)
[^48]: [What is a Temporal Workflow?](https://docs.temporal.io/workflows) *docs.temporal.io*, 2024. Archived at [perma.cc/B5C5-Y396](https://perma.cc/B5C5-Y396)
[^49]: Jack Kleeman. [Solving durable executions immutability problem](https://restate.dev/blog/solving-durable-executions-immutability-problem/). *restate.dev*, February 2024. Archived at [perma.cc/G55L-EYH5](https://perma.cc/G55L-EYH5)
[^50]: Srinath Perera. [Exploring Event-Driven Architecture: A Beginners Guide for Cloud Native Developers](https://wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/). *wso2.com*, August 2023. Archived at [archive.org](https://web.archive.org/web/20240716204613/https%3A//wso2.com/blogs/thesource/exploring-event-driven-architecture-a-beginners-guide-for-cloud-native-developers/)
[^51]: Philip A. Bernstein, Sergey Bykov, Alan Geller, Gabriel Kliot, and Jorgen Thelin. [Orleans: Distributed Virtual Actors for Programmability and Scalability](https://www.microsoft.com/en-us/research/publication/orleans-distributed-virtual-actors-for-programmability-and-scalability/). Microsoft Research Technical Report MSR-TR-2014-41, March 2014. Archived at [perma.cc/PD3U-WDMF](https://perma.cc/PD3U-WDMF)