Distributed Tracing

Table of Contents

🟠 P1 — following a request across service boundaries

Problem #

A single user request touches 5-20 services. When latency spikes, which service is the bottleneck? Logs from individual services don’t show the full picture.

Mechanism #

Trace ID: abc-123 (propagated via headers)

Gateway [50ms] ──→ Auth Service [10ms] ──→ User Service [200ms] ──→ DB [150ms]
                                          └──→ Cache [2ms] (parallel)
                ──→ Payment Service [30ms]

Total: 290ms. Bottleneck: User Service → DB query (150ms)

Key Concepts #

Trace: The full journey of a request
Span: A single operation within the trace (one service call, one DB query)
Context propagation: Trace ID passed via HTTP headers (traceparent) or gRPC metadata
Sampling: At high volume, trace 1-10% of requests to control storage costs

Instinct #

Tracing is most valuable for debugging latency, not errors (logs are better for errors).

Implement trace context propagation from day one — retrofitting it is painful.

Use head-based sampling (decide at ingress) for predictable costs, tail-based sampling (decide after completion) for capturing interesting traces (errors, high latency).

References #

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure — Google (2010)
OpenTelemetry: Traces