- rtshkmr's digital garden/
- References/
- Architecture Design Basics/
- Pattern Taxonomy/
- Observability/
- Distributed Tracing/
Distributed Tracing
··
145 words·
1 min
Table of Contents
🟠P1 — following a request across service boundaries
Problem #
A single user request touches 5-20 services. When latency spikes, which service is the bottleneck? Logs from individual services don’t show the full picture.
Mechanism #
Trace ID: abc-123 (propagated via headers)
Gateway [50ms] ──→ Auth Service [10ms] ──→ User Service [200ms] ──→ DB [150ms]
└──→ Cache [2ms] (parallel)
──→ Payment Service [30ms]
Total: 290ms. Bottleneck: User Service → DB query (150ms)Key Concepts #
- Trace: The full journey of a request
- Span: A single operation within the trace (one service call, one DB query)
- Context propagation: Trace ID passed via HTTP headers (
traceparent) or gRPC metadata - Sampling: At high volume, trace 1-10% of requests to control storage costs
Instinct #
Tracing is most valuable for debugging latency, not errors (logs are better for errors).
Implement trace context propagation from day one — retrofitting it is painful.
Use head-based sampling (decide at ingress) for predictable costs, tail-based sampling (decide after completion) for capturing interesting traces (errors, high latency).