Chapter 1. Reliable, Scalable, and Maintainable Applications

Table of Contents

Thinking About Data Systems #

Access patterns -> performance characteristics -> implementations #

this is where “data systems” differ: they have different access patterns, hence have diff performance characteristics and therefore their implementations are different.

Reliability #

Faults vs Failures and how the objective of fault-tolerance mechanisms is to prevent faults from becoming failures #

A fault is usually defined as one com‐ ponent of the system deviating from its spec, whereas a failure is when the system as a whole stops providing the required service to the user. It is impossible to reduce the probability of a fault to zero; therefore it is usually best to design fault-tolerance mechanisms that prevent faults from causing failures.

“Failure” here is in providing the desired service to users.

Chaos Engineering as a way to exercise and test the fault-tolerance of a system #

Hardware Faults #

hardware errors being increasingly mangaged via software fault-tolerances instead of hardware fault-tolerances (e.g. redundancies) #

there is a move toward systems that can tolerate the loss of entire machines, by using software fault-tolerance techniques in preference or in addition to hardware redundancy.

Software Errors #

Software Errors likely span a greater number of nodes that get affected than HW errors #

and because they are correlated across nodes, they tend to cause many more system failures than uncorrelated hardware faults [5].

Software bugs: it stems from assumptions made about its environment #

software is making some kind of assumption about its environ‐ ment—and while that assumption is usually true, it eventually stops being true for some reason [11].

Human Errors #

some ways of reducing human error #

tems combine several approaches: • Design systems in a way that minimizes opportunities for error. For example, well-designed abstractions, APIs, and admin interfaces make it easy to do “the right thing” and discourage “the wrong thing.” However, if the interfaces are too restrictive people will work around them, negating their benefit, so this is a tricky balance to get right. • Decouple the places where people make the most mistakes from the places where they can cause failures. In particular, provide fully featured non-production sandbox environments where people can explore and experiment safely, using real data, without affecting real users. • Test thoroughly at all levels, from unit tests to whole-system integration tests and manual tests [3]. Automated testing is widely used, well understood, and espe‐ cially valuable for covering corner cases that rarely arise in normal operation.

How Important Is Reliability? #

Cutting corners willfully when doing O to 1 work #

narrow profit margin)—but we should be very conscious of when we are cutting corners.

I think this part is especially relevant for startup work.

Scalability #

Describing Load #

Load is more of a time-snapshot #

Seems like being able to sample your load is a prerequisite to the others things like describing your performance and such.

load parameters depend on which operations of yours are most important #

doubles?). Load can be described with a few numbers which we call load parameters. The best choice of parameters depends on the architecture of your system: it may be requests per second to a web server, the ratio of reads to writes in a database, the number of simultaneously active users in a chat room, the hit rate on a cache, or something else. Perhaps the average case is what matters for you, or perhaps your bottleneck is dominated by a small number of extreme cases.

fanouts from a single operation is important to consider (twitter example) #

In transaction processing systems, we use it to describe the number of requests to other services that we need to make in order to serve one incoming request.

Introduction to the Twitter Home Timeline Example #

The key learning here is that the load parameter when discussing scalability is dependent on the fan-out rate which on figuring out two approaches of how to get the home timeline information, turns out to be the distribution of followers per user.

Also quite interesting that the mature solution ended up being a hybrid of the two instead of picking one over the other.

Describing Performance #

Response time should be thought of as a distribution of values instead of a single number #

Even if you only make the same request over and over again, you’ll get a slightly dif‐ ferent response time on every try. In practice, in a system handling a variety of requests, the response time can vary a lot. We therefore need to think of response time not as a single number, but as a distribution of values that you can measure. In Figure 1-4, each gray bar represents a request to a service, and its height shows how long that request took. Most requests are reasonably fast, but there are occa‐ sional outliers that take much longer. Perhaps the slow requests are intrinsically more expensive, e.g., because they process more data. But even in a scenario where you’d think all requests should take the same time, you get variation: random additional latency could be introduced by a context switch to a background process, the loss of a network packet and TCP retransmission, a garbage collection pause, a page fault forcing a read from disk, mechanical vibrations in the server rack [18], or many other causes.

Using average (arithmetic mean) response time vs using percentiles for response time #

Tail latencies are important to measure because it likely correlates to the experience that your power users get! #

High percentiles of response times, also known as tail latencies, are important because they directly affect users’ experience of the service.

The higher percentiles are essential measure of users’ experience of the service. This matters because it’s likely that the outliers refer to super users as well (e.g. the customers with the slowest requests are the ones with most data on their accounts because they’re the power users.)

Response times should be measured client-side; accounts for events such as head-of-line (HOL) blocking #

example, by its number of CPU cores), it only takes a small number of slow requests to hold up the processing of subsequent requests—an effect sometimes known as head-of-line blocking. Even if those subsequent requests are fast to process on the server, the client will see a slow overall response time due to the time waiting for the prior request to complete. Due to this effect, it is important to measure response times on the client side.

This outlines some ways that HOL blocking can be better analysed, by focusing on client-side responses:

The rough idea is that there’s both real-time and synthetic tests we can use to keep an eye on these kind of client-side perceived experiences.

Measuring client-side response times is critical for accurately assessing real user experience, particularly when addressing issues like head-of-line (HOL) blocking. HOL blocking occurs when a slow or stalled request delays subsequent requests in a queue, creating bottlenecks that server-side metrics alone cannot fully capture. Below is a technical breakdown of client-side measurement approaches, tooling, and implementation strategies:

---

### **Why Client-Side Measurement Matters for HOL Blocking**
- **Server-side limitations**: Server metrics (e.g., TTFB) measure backend efficiency but miss network latency, browser rendering delays, and HOL queuing effects.
- **End-to-end visibility**: Client-side metrics reflect the actual user experience, including DNS resolution, TLS negotiation, and resource contention.
- **HOL impact**: A slow client-side request (e.g., a large image download) can block critical API calls or subsequent page elements, degrading perceived performance.

---

### **Client-Side Measurement Tools & Techniques**
#### 1. **Browser Developer Tools**
- **Chrome DevTools Network Tab**:
  - Records detailed timing data (TTFB, content download, DOMContentLoaded).
  - Identifies HOL blocking by analyzing request waterfalls and queuing delays.
  - Use `performance.getEntries()` in the console to extract granular metrics[1][3].
  ```javascript
  // Example: Measure API response time for a specific endpoint
  performance.getEntries().filter(entry => entry.name.includes('/api/data'));
  ```

#### 2. **Real User Monitoring (RUM)**
- **Tools**: Datadog APM, New Relic Browser, Raygun APM[2][6].
  - **Key features**:
    - Track `Navigation Timing API` metrics (e.g., `domInteractive`, `loadEventEnd`).
    - Capture resource timing for third-party scripts and API calls.
    - Correlate client-side traces with server-side APM data to pinpoint HOL bottlenecks.

#### 3. **Synthetic Monitoring**
- **Tools**: BrowserStack, WebPageTest, BlazeMeter[3][5].
  - Simulate user journeys from global locations to measure:
    - **First Contentful Paint (FCP)**
    - **Time to Interactive (TTI)**
    - **Total Blocking Time (TBT)**
  - Configure multi-step scripts to replicate HOL scenarios (e.g., sequential API calls).

#### 4. **Custom Performance Observers**
- **JavaScript Performance API**:
  - Measure specific user interactions or API calls:
  ```javascript
  // Track fetch() requests
  const observer = new PerformanceObserver((list) => {
    list.getEntries().forEach(entry => console.log(entry));
  });
  observer.observe({ type: 'resource', buffered: true });
  ```

#### 5. **Distributed Tracing**
- **Tools**: OpenTelemetry, Zipkin, Jaeger.
  - Instrument client-side spans to trace requests end-to-end.
  - Example workflow:
    1. Inject trace headers into client HTTP requests.
    2. Propagate context through microservices.
    3. Visualize client-side queuing delays in tracing UIs.

---

### **Key Metrics to Capture**
- **Time to First Byte (TTFB)**: Latency between request initiation and first byte received[5].
- **Total Blocking Time (TBT)**: Sum of main-thread blocking periods during page load.
- **90th/95th Percentile Response Times**: Identify outliers exacerbated by HOL blocking[4][5].
- **Resource Contention**: Concurrent connections and HTTP/2 stream prioritization issues.

---

### **Implementation Strategy**
1. **Instrument Critical User Journeys**:
   - Use `window.performance.mark()` to timestamp key events (e.g., login, checkout).
2. **Deploy RUM Agents**:
   - Embed lightweight scripts (e.g., Google Analytics 4) to collect real-user data.
3. **Leverage APM Integrations**:
   - Configure Datadog or New Relic to merge client/server traces[2][6].
4. **Automate Synthetic Tests**:
   - Schedule BrowserStack tests to monitor HOL regressions in CI/CD pipelines[3].
5. **Analyze Waterfall Charts**:
   - Identify queued requests in Chrome DevTools or WebPageTest[1][3].

---

### **Addressing HOL Blocking**
- **HTTP/2 Prioritization**: Assign weights to critical requests (e.g., CSS, API payloads).
- **Request Throttling**: Limit concurrent requests for non-essential resources.
- **Lazy Loading**: Defer non-critical assets (e.g., images below the fold).

By combining client-side tooling with metrics like percentile response times and TBT, teams can isolate HOL blocking and optimize for true user-perceived performance[4][5][6].

Citations:
[1] http://nodesource.com/blog/measuring-latency-client-side-devtools/
[2] https://www.techtarget.com/searchapparchitecture/tip/Top-application-performance-monitoring-tools
[3] https://www.browserstack.com/guide/response-time-test
[4] https://www.perfmatrix.com/client-side-result-analysis/
[5] https://www.blazemeter.com/blog/key-test-metrics-to-track
[6] https://odown.com/blog/what-is-a-good-api-response-time/
[7] https://evuez.net/notes/head-of-line-blocking.html
[8] https://dzone.com/articles/client-side-performance-testing
[9] https://stackoverflow.com/questions/14849314/measuring-server-response-time-client-side
[10] https://www.headspin.io/blog/client-side-performance-testing-metrics-to-consider
[11] https://www.reddit.com/r/reactjs/comments/10zy709/how_do_you_measure_performance_on_your_client_side/
[12] https://citeseerx.ist.psu.edu/document?repid=rep1&type=pdf&doi=646c2d0aeb595552b7fe257a4ef02d08f0cdb5b8
[13] https://testguild.com/load-testing-tools/
[14] https://www.zendesk.com/blog/first-reply-time/
[15] https://community.dynatrace.com/t5/Real-User-Monitoring/Client-Side-Response-Time/m-p/176138
[16] https://stackoverflow.com/questions/26787980/measure-average-web-app-response-time-from-the-client-side-during-a-long-period
[17] https://uptimerobot.com/response-time-monitoring/
[18] https://www.scoutapm.com/frontend-monitoring/

---
Answer from Perplexity: pplx.ai/share

gotcha on fast calculation of metrics - percentiles should be accumulated and not averaged #

Beware that averaging percentiles, e.g., to reduce the time resolution or to com‐ bine data from several machines, is mathematically meaningless—the right way of aggregating response time data is to add the histograms [28].

Approaches for Coping with Load #

elastic vs manually scaled systems #

systems are elastic, meaning that they can automatically add computing resour‐ ces when they detect a load increase, whereas other systems are scaled manually (a human analyzes the capacity and decides to add more machines to the system). An elastic system can be useful if load is highly unpredictable, but manually scaled sys‐ tems are simpler and may have fewer operational surprises

Scalable architecture is built using good assumptions and sound primitives #

An architecture that scales well for a particular application is built around assump‐ tions of which operations will be common and which will be rare—the load parame‐ ters. If those assumptions turn out to be wrong, the engineering effort for scaling is at best wasted, and at worst counterproductive. In an early-stage startup or an unpro‐ ven product it’s usually more important to be able to iterate quickly on product fea‐ tures than it is to scale to some hypothetical future load. Even though they are specific to a particular application, scalable architectures are nevertheless usually built from general-purpose building blocks, arranged in familiar patterns. In this book we discuss those building blocks and patterns.

Maintainability #

Operability: Making Life Easy for Operations #

Simplicity: Managing Complexity #

effect of complexity on maintainability; why keep things simple #

When complexity makes maintenance hard, budgets and schedules are often over‐ run. In complex software, there is also a greater risk of introducing bugs when mak‐ ing a change: when the system is harder for developers to understand and reason about, hidden assumptions, unintended consequences, and unexpected interactions are more easily overlooked. Conversely, reducing complexity greatly improves the maintainability of software, and thus simplicity should be a key goal for the systems we build.

defining “accidental complexity”: when complexity arises from implementaion and not from the problem to be solved #

removing accidental complexity. Moseley and Marks [32] define complex‐ ity as accidental if it is not inherent in the problem that the software solves (as seen by the users) but arises only from the implementation.

good abstractions as a deterrent to accidental complexity #

we have for removing accidental complexity is abstraction. A good abstraction can hide a great deal of implementation detail behind a clean, simple-to-understand façade. A good abstraction can also be used for a wide range of different applications. Not only is this reuse more efficient than reimplementing a similar thing multiple times, but it also leads to higher-quality software, as quality improvements in the abstracted component benefit all applications that use it.

Evolvability: Making Change Easy #

defining evolvability #

The ease with which you can modify a data system, and adapt it to changing require‐ ments, is closely linked to its simplicity and its abstractions: simple and easy-to- understand systems are usually easier to modify than complex ones. But since this is such an important idea, we will use a different word to refer to agility on a data sys‐ tem level: evolvability [34].