Skip to main content
  1. References/
  2. Architecture Design Basics/
  3. Pattern Taxonomy/
  4. Observability/

Distributed Tracing

·· 145 words· 1 min

🟠 P1 — following a request across service boundaries

Problem #

A single user request touches 5-20 services. When latency spikes, which service is the bottleneck? Logs from individual services don’t show the full picture.

Mechanism #

Trace ID: abc-123 (propagated via headers)

Gateway [50ms] ──→ Auth Service [10ms] ──→ User Service [200ms] ──→ DB [150ms]
                                          └──→ Cache [2ms] (parallel)
                ──→ Payment Service [30ms]

Total: 290ms. Bottleneck: User Service → DB query (150ms)

Key Concepts #

  • Trace: The full journey of a request
  • Span: A single operation within the trace (one service call, one DB query)
  • Context propagation: Trace ID passed via HTTP headers (traceparent) or gRPC metadata
  • Sampling: At high volume, trace 1-10% of requests to control storage costs

Instinct #

Tracing is most valuable for debugging latency, not errors (logs are better for errors).

Implement trace context propagation from day one — retrofitting it is painful.

Use head-based sampling (decide at ingress) for predictable costs, tail-based sampling (decide after completion) for capturing interesting traces (errors, high latency).

References #