Skip to main content
  1. Readings/
  2. Books/
  3. Designing Data-Intensive Applications: The Big Ideas Behind Reliable, Scalable, and Maintainable Systems/

Chapter 4. Encoding and Evolution

··1672 words·8 mins
  • Changing features typically result in data model changes.

  • Different data models have different ways of coping with these changes.

  • Client side updates are user-reliant, so that’s why the transition period between them needs to be designed well

    • backward compatibility
    • forward compatibility
      • Forward compatibility can be trickier, because it requires older code to ignore additions made by a newer version of the code.
  • Encoding format will affect compatibility that’s why this chapter explores that first

Formats for Encoding Data #

Representations for:

  1. in-memory data
  2. self-contained over network / storage

Translating between the 2 representations becomes necessary.

In following sections, we see some patterns:

  1. tag-based field marking as a layer of indirection and their pros and cons, especially how they relate to schema evolution rules

Language-Specific Formats #

Typically you get language specific frameworks / libraries that are convenient for in-memory object used

Usual problems:

  1. non-agnostic formats introduced

  2. @ decoding, hydrating classes can be a security concern (runtime constructs)

    can be a source of security concerns

  3. versioning drifts

  4. efficiency / performance hits (CPU-time for encoding / decoding)

JSON, XML, and Binary Variants #

Subtle nuances:

  1. XML & CSV: distinguishing number and string (of digits)

    number conversions may introduce floating point errors also. Interesting trick by twitter: id fields returned twice: one as number and the other as decimal \(\implies\) can verify that the conversion didn’t have any fp-precision errors

  2. JSON,XML: supports unicode well but not binary strings \(\implies\) bound to the char encoding format

    workaround: use b64 encoding of binary data as text. Problem is that it increases the data size by 33%

  3. JSON/XML has schema support to add structure to the payloads but not always do people use it so schemas sometimes get hardcoded to logic (instead of using the schema support)

  4. CSV no schema, might have escape character issues

Binary Encoding #

Useful to consider when data volume is high, can have meaningful performance benefits

  • JSON/XML is fat in its textual form, so it makes sense to use its binary versions that some systems offer

    e.g. MessagePack

Thrift and Protocol Buffers #

Thrift and protobufs are alternatives.

Binary encoding libraries based on the packing principles.

  • both require data scheme to be associated to the encoding, field tagging is a key mechanism for both these formats

  • typically need some codegen to be compatible with your classes that define the schema

  • typically optional fields are just for runtime presense checks mainly

  • managing schema changes:

    • the presense of a tag is a layer of indirection, the tag labels are separate
    • we can add new fields to the schema, each field just needs a new tag number
    • caveats:
      1. new fields can’t be forced to be required (else backward compatibility is lost)
      2. when doing datatype changes, there might be precision-losses

    this allows it to provide both forward and backward compatibility

  • special caveats:

    1. [list/array datatypes]

      protobufs: have no list / array datatypes, it’s just repeated

      • we can change an optional field (single-valued) to a repeated field (multi-valued) field

      thrift: has list datatype, it’s parameterised, can’t do the optional field change

      • can have nested lists!

Avro #

  • from Apache, subproject of hadoop, because Thrift was not good enough for Hadoop’s use-cases

  • The schema is the authority, it determines how to read the binary because there’s no other indication of fields.

    • both reader and writer MUST have the same schema

    • there’s no information on what type of data is encoded either

    • field order doesn’t have to be the same

    • null is not automatically allowed, have to union type it with the non-null datatype

      helps with preventing null-related bugs

  • effect on schema evolution:

    1. writer and reader schema DON’T have to be the SAME, they just have to be COMPATIBLE \(\implies\) that’s how we can handle schema evolution
    2. forward compatibility means that you can have a new version of the schema as writer and an old version of the schema as reader. Conversely, backward compatibility means that you can have a new version of the schema as reader and an old version as writer.
    3. evolution rules:
      • field changes need to have accompanying default-values
      • field datatype changes must have type converters
  • how to coordinate reader and writer schemas? \(\implies\) usage context matters here

    1. large file, many records (typically in the context of Hadoop)

      • typically all encoded with the same schema
      • in this case, writer of the file can include the writer’s schema once @ beginning of the file
      • there’s a special object container file format for this
    2. DB with individually written records

      • record writing might be done @ different points in time \(\implies\) schema drift \(\implies\) use version number @ the beginning of each encoded record, reader references schema using the number
    3. records sent over network connection

      • the schemas can be agreed upon @ the point of the connection setup, can be used for the rest of the connection’s lifecycle

    in general, it’s useful to have a db of schema version, helps dynamic compatibility checks and referencing and so on.

dynamically generated schemas #

  • no tag number \(\implies\) can generate the schema JIT from the relational schema
  • typically codegen is useful where it’s statically (and hopefully strictly) typed.
  • in dynamically typed language, codegen is not as good because
    1. no compile-time type checking anyway

    2. introduces a “compilation” step that is unnecessary from the POV of the rest of the dynamically typed language

    3. if the schema was dynamically generated, then it’s even more useless for dynamically typed languages – could have otherwise just been an object container file

      the object container file can be self-describing – similar to how json’s structure self-describes its schema

The Merits of Schemas #

  • typically there’s always some proprietary binary encoding for their data e.g. db queries over network, are typically db specific
  • benefits of having a schema (using binary encoding formats)
    1. compactness because field names can be ommitted
    2. schema is a form of docs, better updated typically
    3. db of schemas allows checks for forward and backward compatibility
    4. for programming languages that are statically typed, can make a lot of use from code-generation using the schema, also gives compile time type-checking benefits

Modes of Dataflow #

  • Compatibility is a relationship between one process that encodes the data, and another process that decodes it.

  • so the rest of this is about models of data flow

Dataflow Through Databases #

  • there’s a need to consider BOTH forward and backward compatibility, just because there’s a dimension of time (from when it’s written to when it’s read)

  • need to check for schema changes, else there’s chances of data-loss.

    When an older version of the application updates data previously written by a newer version of the application, data may be lost if you’re not careful.

Diff values written @ diff times #

deployments to application versions can have fast rollouts but the db migrations may take time, might need backfilling and other ETL ops \(\implies\) “data outlives code”

  • migrations can be expensive enough to want to avoid it in some cases

    • db’s understanding of migrations can do data backfills (e.g. filling up of NULLs) just in time
  • Schema evolution thus allows the entire database to appear as if it was encoded with a single schema, even though the underlying storage may contain records encoded with various historical versions of the schema.

Archival Storage #

For snapshotting, even if it’s a mixed schema – we just copy it using the latest schema, might as well do so and the copying ends up being more consistent as well

the data-dumping is immutable anyway

it’s possible to encode the data in an analytics friendly columnar format also (e.g. Parquet)

Dataflow Through Services: REST and RPC #

  • typical stuff that we’re used to in the web world

  • having a hierarchical cascade of services \(\implies\) service oriented architecture (SOA) \(\implies\) more modernly, microservices architecture.

  • fair to see a service serving a similar role as a db, just that there’s a layer of indirection in the form of an API which allows some logic-encapsulation (so it’s more than just data schema)

  • design goal of a service-oriented/microservices architecture is to make the application easier to change and maintain by making services independently deployable and evolvable.

Web-services #

  • REST should be seen as a design philosophy that builds onto HTTP’s principles
  • SOAP was supposed to work independently of HTTP if needed
  • SOAP typically suffers from interoperability issues between different vendors’ implementations

RPCs #

  • location transparency: we try to make a request to a remote network service look the same as calling a function / method in your programming language as though it’s within the same process
  • fundamental flaws:
    1. unpredictability @ network, function execution, machine…
    2. outcomes of a function call can be varied, might even just timeout or something because of a crash
    3. idempotency issues when you retry a service call
    4. network latencies affecting client experience
    5. memory not exactly correlated \(\implies\) have to transfer that context over if needed
  • modern RPCs still have their use cases, e.g.
    • gRPC supports streams of request-responses
    • others wrap things up into futures/promises
    • some of them provide a service registry, so load balancing and all that can happen
  • typically, custom RPC protocols with binary encoding formats can be more performant than generics like JSON over REST

Message-Passing Dataflow #

  • message passing inter-mediated via a message broker / message queue / message-oriented middleware
  • advantages compared to direct RPC
    • can be bufferred \(\implies\) better system reliability
    • broker can redeliver messages \(\implies\) prevents msg loss
    • node discovery can be outsourced
    • can broadcast messages
    • sender (producer) and recipient (consumer) are decoupled
  • async comms: typically it’s unidirectional, responses are typically over a separate channel – sender doesn’t wait (and get blocked), just sends and forgets
  • KIV chapter 11 for comparisons across message brokers

Distributed actor frameworks – actor model #

  • actor model for concurrency. Note: I realise that the actor model presented as a concurrency model might be foreign to people that may not have seen concurrency models from a first-principles POV

  • better location transparency when we use distributed actor frameworks

    evolution still needs to be managed well regardless of framework used.

    examples of such frameworks:

    1. akka

    2. orleans

    3. erlang otp

Summary #