View Mode
← Back to Knowledge Hub
DDIA Β· Ch.5Data & Systems24 min read

Encoding and Evolution: How Data Survives Schema Changes

Chapter 5 of DDIA 2nd ed. distilled: the two compatibility directions (backward and forward) and why rolling upgrades make both mandatory; encoding format taxonomy from language-specific serialisers to JSON/XML/CSV to binary schema-driven formats (Protocol Buffers, Apache Avro); schema evolution rules for each format; Avro writer/reader schema resolution; the four dataflow modes (databases, REST/RPC services, workflow engines, message brokers); the fundamental problems with RPC; durable execution frameworks (Temporal, Restate) and exactly-once semantics; event-driven architectures with message brokers and the actor model.

The Compatibility Problem

Applications change continuously. Features are added or modified, requirements shift. These changes require corresponding changes to data formats. The complication: in a large system, code changes cannot happen instantaneously. Rolling upgrades deploy new server versions gradually β€” a few nodes at a time β€” while client-side applications are at the mercy of users who may not update for months. During any deployment window, multiple versions of the code are running simultaneously, reading and writing to the same databases and message queues. This requires compatibility in two directions: backward compatibility (new code reads data written by old code) and forward compatibility (old code reads data written by new code). Backward compatibility is typically straightforward β€” you wrote the new code, so you can explicitly handle the old format. Forward compatibility is harder: it requires old code to gracefully ignore fields and structures it has never seen before.

The silent data-loss trap (Figure 5-1 in the book): New code adds a photoURL field to a record and writes it to the database. Old code (still running on other nodes) reads that record, decodes it into a Person model object β€” which doesn't have a photoURL field β€” then updates the record and writes it back. The photoURL value is silently dropped. This is forward compatibility failure in a real system with real data loss.

The solution is either to use an encoding format that mechanically preserves unknown fields (Protocol Buffers, Avro), or to ensure your model objects pass unknown fields through without interpretation.

Encoding Formats: A Taxonomy

In-memory data structures use pointers and CPU-optimised layouts. On-disk and on-wire formats must be self-contained byte sequences β€” no pointers. The translation between the two is encoding/serialisation (to bytes) and decoding/deserialisation (to objects). The choice of encoding format determines: byte size, parse speed, schema evolution capability, cross-language interoperability, and human readability. Note: columnar storage formats like Parquet operate directly on compressed column data without a decode step β€” a special case called zero-copy encoding.

FormatCompactnessSchema EvolutionCross-LanguageHuman ReadableBest For
Language-specific (Java Serializable, Python pickle)❌ No❌ Afterthought❌ Tied to one languageβœ“ N/ATransient caching only β€” never use for storage or inter-service comms
JSON / XML / CSV~ Medium~ Manual discipline requiredβœ“ Universalβœ“ YesExternal / public APIs, data interchange between organisations
JSON Schema / XML Schema~ Medium~ Complex to evolve correctlyβœ“ Universalβœ“ YesOpenAPI specs, Confluent Schema Registry, validated data contracts
Protocol Buffers (protobuf)βœ“ ~33 bytes vs 81 JSONβœ“ Field tags + required/optional rulesβœ“ Code-gen for 10+ languages❌ BinarygRPC services, high-throughput internal APIs, mobile payloads
Apache Avroβœ“ ~32 bytes β€” most compactβœ“ Writer/reader schema resolutionβœ“ Code-gen + dynamic schemas❌ BinaryKafka topics, Hadoop/data lake pipelines, dynamically generated schemas
Cap'n Proto / FlatBuffersβœ“ Zero-copy β€” no encode/decode stepβœ“ Field tags like protobufβœ“ Code-gen❌ BinaryLatency-critical systems; memory-mapped storage; game engines

Protocol Buffers: Field Tags as the Compatibility Mechanism

Protocol Buffers encodes each field as a tag byte (field_number << 3 | wire_type) followed by the value. Field names are absent from the wire format β€” only tag numbers matter. Strings are length-prefixed UTF-8. Integers use variable-length zigzag encoding (small numbers use fewer bytes). The same record takes 33 bytes in protobuf vs 81 bytes in JSON. Schema evolution rules: you may add new fields (give each a new unique tag number β€” old readers skip unknown tags using the wire type to determine skip length); you may remove optional fields (never reuse the tag number β€” reserved tags prevent future reuse accidents); you may rename fields (safe β€” the wire format never mentions the name). Changing int32 to int64 is safe; narrowing truncates. The repeated modifier encodes lists as repeated occurrences of the same tag β€” compatible with optional fields, enabling list-to-scalar evolution.

Schema comparison: Protocol Buffers vs Apache Avro
// Protocol Buffers schema (proto3)
message Person {
  string user_name     = 1;
  int64  favorite_num  = 2;
  repeated string interests = 3;
}
// Encoded: 33 bytes. Field names absent β€” only tag numbers on wire.

// Avro IDL schema
record Person {
  string                   userName;
  union { null, long } favoriteNumber = null;
  array<string>            interests;
}
// Encoded: 32 bytes. No tags β€” schema must be known to decode.

Apache Avro: Writer and Reader Schemas

Avro's binary encoding contains only values β€” no type indicators, no field numbers. Decoding is only possible with the exact writer's schema, which the Avro library uses to walk the byte stream field by field. Schema evolution is achieved through writer/reader schema resolution: the library compares the two schemas side-by-side, matching fields by name (not position or tag). Fields present in the writer's schema but absent from the reader's are ignored. Fields present in the reader's schema but absent from the writer's are filled in from the reader's default value. This means: you may only add or remove fields that have default values (otherwise, backward or forward compatibility breaks). Field renaming is backward compatible via aliases in the reader schema. Null values require explicit union types: union { null, long } β€” enforcing explicit nullability rather than everything-is-nullable-by-default. How does the reader know the writer's schema? It depends on context: large Avro files embed the schema once at the top (object container file format); databases include a schema version number per record and maintain a schema registry; streaming connections negotiate the schema on connection setup.

Why Avro beats Protobuf for dynamically generated schemas: Protobuf field tag numbers must be assigned by hand and tracked carefully β€” if you generate a Protobuf schema from a relational table that has 200 columns, someone must manually assign and maintain 200 tag numbers. When the table gains or loses columns, tags must be carefully updated. Avro fields are matched by name, so you can generate an Avro schema directly from a relational schema (each column becomes a field), export data to Avro object container files, and re-generate the schema each time the table changes β€” the export process never needs to track tag assignments. This is why Avro is the dominant format for Hadoop, Kafka, and data lake exports.

Schema Evolution Rules by Format

FormatAdd a fieldRemove a fieldChange field typeRename a field
Protocol Buffersβœ“ Safe β€” old code ignores unknown tag⚠️ Safe only if field was optional; never reuse the tag number⚠️ Some safe (int32β†’int64); truncation risk if narrowingβœ“ Safe β€” wire format uses tag numbers, not names
Apache Avroβœ“ Safe if new field has a default valueβœ“ Safe only if removed field had a default valueβœ“ Safe if Avro can convert between typesβœ“ Backward compatible via aliases in reader schema; not forward compatible
JSON / RESTβœ“ Safe β€” receivers ignore unknown fields (by convention)⚠️ Risk if clients depend on the field⚠️ Risk of breakage β€” no type enforcement in JSON itself❌ Breaking change β€” maintain old name for backward compat

The Fundamental Problems with RPC

RPC (remote procedure call) attempts location transparency: making a remote service call appear syntactically identical to a local function call. This abstraction is fundamentally flawed. A network call and a local call differ in at least five critical ways that cannot be papered over by a framework. The history of enterprise RPC β€” CORBA, EJB, DCOM, SOAP, WS-* β€” is a history of failed attempts to hide this distinction. REST succeeds partly because it treats state transfer over the network as a distinct concept from a local function call, making the network boundary explicit.

01
Unpredictable failure modes
A local call either succeeds or throws an exception. A network call can also time out β€” leaving you with no way to know if the request arrived. Did it get processed? Was the response just lost? Local calls never have this ambiguity.
02
Retry-induced duplication
If you retry a timed-out request and the original did get through, the action executes twice. Local calls never have this problem. RPC systems require idempotence β€” designing operations so that calling them multiple times with the same input produces the same result as calling once.
03
Variable latency
A local call takes nanoseconds. A network call might take 0.5ms on a good day and 30 seconds when the network is congested. Treating them identically leads to slow and fragile code.
04
Parameter encoding cost
Local calls pass object references (pointers). Network calls must encode every parameter as bytes. Large mutable objects β€” easy to pass locally β€” become expensive to serialise repeatedly.
05
Cross-language type impedance
A gRPC service may be implemented in Go and called from Python. Integer types, null semantics, and floating-point behaviour differ across languages. JavaScript cannot correctly represent integers > 2⁡³ β€” a problem Twitter discovered when its post IDs exceeded this limit.

Four Modes of Dataflow

Compatibility is a relationship between two processes: the one that encodes data and the one that decodes it. The encoding choices and compatibility requirements differ significantly depending on how data flows between those processes. The chapter identifies four principal dataflow modes.

01
Dataflow through databases

Writing to a database is encoding; reading back is decoding. The writer and reader are often different versions of the same application β€” especially during rolling upgrades. A value written by new code may be read by old code. This requires forward compatibility (old code can read new data). The classic trap: old code reads a record, decodes it into a model object (losing unknown fields), then writes it back β€” silently dropping fields added by the newer version. The database must be treated as "sending a message to your future self."

02
Dataflow through services (REST and RPC)

Clients and servers are typically deployed independently. The convention is to upgrade servers first, then clients β€” so servers must be backward compatible (able to read old client requests) and clients must be forward compatible (able to read new server responses). REST uses HTTP natively: URL paths, HTTP verbs, JSON bodies. gRPC uses Protocol Buffers over HTTP/2. Compatibility is inherited from the encoding format. For cross-organisational APIs, compatibility must be maintained indefinitely β€” the provider cannot force clients to upgrade.

03
Dataflow through workflow engines

Workflow engines like Temporal and Restate provide durable execution: exactly-once semantics for multi-step service workflows. If a task fails midway, the framework replays only the failed step, using logged results for previously completed RPC calls. This requires strict determinism β€” non-deterministic code (random numbers, system clocks, unordered maps) will produce different results on replay. Workflow code must never be modified in place; deploy a new version and let in-flight workflows complete on the old one.

04
Dataflow through message brokers

Message brokers (Kafka, RabbitMQ, Amazon Kinesis, Google Pub/Sub) decouple senders from receivers via an intermediary. The sender publishes to a topic or queue; one or more consumers receive asynchronously without a direct network connection. This provides buffering (handle slow consumers), automatic redelivery (if a consumer crashes), fan-out (one message to many subscribers), and service discovery (senders need not know receiver addresses). The encoding format is the contract β€” use protobuf/Avro with a schema registry to enforce compatibility across producer and consumer versions.

Durable execution and encoding: Temporal and Restate log every RPC call and its result to a WAL-like durable store. On replay after a failure, the framework intercepts RPC calls that already have a logged result and returns the cached value instead of re-executing. This means the serialised form of every RPC argument and result must remain decodable across application versions β€” making encoding format choice and schema evolution critical for workflow longevity. A workflow that runs for days or weeks may span multiple application deployments.

Message Brokers: Decoupling as the Primary Benefit

Message brokers provide four benefits that direct RPC cannot: buffering (absorb producer bursts when consumers are slow), redelivery (automatic retry if a consumer crashes before acknowledging), fan-out (one message delivered to all subscribers of a topic), and service discovery elimination (producers publish to a named topic without knowing consumer addresses). Two distribution patterns: queue (one message β†’ one consumer; work distribution) and topic/pub-sub (one message β†’ all subscribers; event broadcast). Message encoding is the contract β€” brokers are encoding-agnostic byte carriers. Pairing Avro or Protobuf with a schema registry (Confluent Schema Registry for Kafka, AsyncAPI specifications) lets you enforce and version that contract across producer and consumer deployments. A critical detail: if a consumer re-publishes messages to another topic after processing, it must preserve unknown fields from the original message β€” otherwise the forward-compatibility problem from Figure 5-1 reappears in the broker pipeline.

Six Mental Models Worth Keeping

01
Data outlives code β€” plan for it

You can redeploy an application in minutes, replacing all running instances. You cannot redeploy a database. Data written five years ago with an old schema will still be there when you read it today. Schema evolution is not an edge case β€” it is the normal operating condition of any system that persists data.

02
Binary schema formats are not just about bytes

Protocol Buffers and Avro are ~2.5Γ— smaller than JSON for the same record. But the real win is not compactness β€” it is the schema as living documentation and the ability to enforce backward/forward compatibility mechanically in CI before anything reaches production.

03
Avro and Protobuf solve the same problem differently

Protobuf uses field tag numbers embedded in each encoded record β€” the schema can change the field name, but never the tag. Avro uses no tags β€” the schema must be transmitted alongside (or negotiated), and decoding uses the writer/reader schema pair. Protobuf is more self-contained; Avro is better for dynamically generated schemas (e.g., dumping a relational database where column names come from a query).

04
Rolling upgrades make forward compatibility mandatory

During a rolling upgrade, old-code nodes and new-code nodes run simultaneously. A message written by new code will be read by old code β€” requiring forward compatibility (old code reads new data). This is why "just ignore unknown fields" is a hard requirement, not a nice-to-have. Avro and Protobuf handle this mechanically. JSON relies on developer discipline.

05
RPC is not a local call β€” stop pretending it is

The entire history of distributed systems RPC β€” CORBA, EJB, DCOM, SOAP β€” failed partly because they tried to make network calls look like local function calls (location transparency). This hides the fundamental differences: network calls fail in new ways, have variable latency, require idempotence, and cross language boundaries. REST succeeds partly because it treats network transfer as its own distinct thing.

06
Message brokers are decoupling, not just async delivery

The key property of a message broker is not that delivery is asynchronous β€” it is that the sender does not need to know who the receiver is, where they are, or even whether they are running. This enables fan-out (one producer, many consumers), absorbing traffic spikes (broker buffers), and independent deployment of producers and consumers β€” the foundation of event-driven architectures.

Part of the DDIA Series

This article covers Chapter 5 of Designing Data-Intensive Applications (2nd ed.) by Martin Kleppmann. The series distils each chapter into audience-aware reading notes across three levels: conceptual overview (Layman), architectural context (Reviewer), and engineering detail (Engineer).

← Ch.4: Storage and RetrievalBack to Knowledge Hub β†’