Networking Essentials
Table of Contents
๐ด P0 — foundation for reasoning about inter-service communication, latency, and failure modes
AI usage disclosure
Distributed systems communicate over networks. Understanding how networks behave — and fail — at each layer informs every protocol choice, load balancing decision, and failure-handling strategy in a system design.
- INTERVIEW: Networking is a stronger focus for infra and distributed systems interviews than for product design interviews. For product design, touch-and-go is fine; for infra, expect deep-dives.
The Layers That Matter #
Of the OSI model, three layers appear repeatedly in system design:
| Layer | Name | Concern | Key Protocols |
|---|---|---|---|
| L3 | Network | Routing and addressing; best-effort delivery | IP, ICMP |
| L4 | Transport | End-to-end reliability, ordering, flow control | TCP, UDP, QUIC |
| L7 | Application | Application-level semantics and data formatting | HTTP, WebSocket, DNS, gRPC |
- INSIGHT: The higher in the stack, the more latency and processing required. This is why the choice of L4 vs L7 load balancing matters — see Load Balancing Patterns.
- FLEX: L3 isn’t limited to IP. InfiniBand operates at the network layer and is used for massive ML training clusters where IP’s overhead is unacceptable.
Journey of a Web Request #
A single browser request traverses every layer:
1. URL entered in browser
2. DNS resolution โ IP address
3. TCP handshake (3-way: SYN โ SYN-ACK โ ACK)
4. TLS handshake (if HTTPS)
5. HTTP request transmitted
6. Server processes request (typically the dominant latency source)
7. HTTP response received
8. TCP teardown (4-way: FIN โ ACK โ FIN โ ACK)Observations #
- There is far more than a single request-response: DNS lookup, TCP handshake, TLS negotiation all add latency before the application even sees the request.
- Stateful connections must be kept alive or multiplexed, else we pay handshake costs on every roundtrip. Two mechanisms:
- HTTP keep-alive: reuse the same TCP connection for multiple requests
- HTTP/2 multiplexing: single connection, concurrent streams (but still suffers from TCP-level head-of-line blocking — unlike HTTP/3/QUIC which solves this)
- MISCONCEPTION: “HTTPS means the request is trustworthy.” HTTPS encrypts in transit but provides no guarantee about the origin of the request. The server must still validate request bodies — a case where dependency injection works against us because malicious injection of user identifiers is still possible.
Transport Layer: TCP vs UDP #
| Feature | UDP | TCP |
|---|---|---|
| Connection | Connectionless | Connection-oriented |
| Reliability | Best-effort | Guaranteed |
| Ordering | No guarantee | Maintains order |
| Flow control | No | Yes |
| Header size | 8 bytes | 20-60 bytes |
| Speed | Faster (less overhead) | Slower (guarantees) |
- RULE OF THUMB: Default to TCP. Only consider UDP when the case is strong: low-latency-critical (gaming, real-time), tolerable data loss (media streaming), high-volume telemetry where loss is acceptable, or platform constraints that exclude browsers.
- INSIGHT: QUIC (the transport under HTTP/3) provides TCP-like reliability over UDP, with per-stream flow control and 0-1 RTT connection setup. It’s increasingly relevant but still emerging for backend use.
Application Layer Protocols #
These protocols build on L4 and define how applications communicate. See Protocol Choices for when to use REST vs gRPC vs GraphQL, and Transport Protocols for HTTP version and real-time mechanism comparisons.
Key points for system design #
- RULE OF THUMB: REST for public-facing APIs, gRPC for inter-service communication. This is the default unless requirements dictate otherwise.
- CONVENTION: Web-gRPC needs a proxy (e.g. Envoy, gRPC-Gateway) for browser interop because native gRPC requires HTTP/2 with special framing that browsers don’t support. See Web-gRPC Proxy.
- PITFALL: gRPC-Web (the browser-compatible variant) does NOT support bidirectional streaming — only unary and server-streaming. For true bidirectional browser communication, WebSocket remains the standard.
Stateful Connection Protocols (SSE, WebSocket) #
Both SSE and WebSocket maintain persistent, stateful connections. This has architectural implications:
- CHALLENGE: Stateful connections need explicit handling of failure cases: what happens when the connection drops? How does the client reconnect? What state is lost?
- CHALLENGE: A stateless load balancer cannot be placed in front of stateful connections without care — see Load Balancing Patterns for L4 vs L7 considerations.
- MISCONCEPTION: “You can’t have more than 6 concurrent SSE connections.” This is an HTTP/1.1 browser limitation (6 connections per domain), not an SSE limitation. HTTP/2 multiplexing removes this constraint.
- EXP: I hit a routing bug with stateful WebSocket connections on Fly.io’s anycast infrastructure and had to hack the routing to make connection persistence work correctly.
WebRTC: P2P Communication #
WebRTC enables direct peer-to-peer communication using UDP as transport, without an intermediary server for the data channel itself (though signalling still requires a centralised server).
- Infrastructure requirements: STUN servers for NAT traversal (hole-punching to get routable addresses), TURN relay servers as fallback when direct connection fails.
- CHALLENGE: Correct implementation is very difficult and still prone to connection losses. Only reach for this when P2P is genuinely needed (AV conferencing, collaborative editing at scale).
- INSIGHT: For collaborative documents at very high participant counts, consider CRDTs as the conflict-resolution foundation rather than centralised coordination via WebRTC.
Load Balancing Fundamentals #
Two architectural choices: client-side vs server-side (dedicated) load balancing, and L4 vs L7 operation. See Load Balancing Patterns for algorithms and pattern-level guidance.
Client-side LB #
The client makes the routing decision, typically by querying a service registry for available instances.
- INTERVIEW: Client-side LB is great for internal microservices (built into gRPC). For all other cases, use a dedicated LB.
- RULE OF THUMB:
- Few clients under your control โ client-side LB can work (e.g. Redis Cluster client, gRPC)
- Many clients AND can tolerate update delays โ client-side LB can work (e.g. DNS)
- Many clients AND need fast updates โ dedicated LB
Redis Cluster example #
Cluster nodes use a gossip protocol for membership. Clients fetch cluster topology from any node, then hash keys locally to determine the destination shard. A MOVED response indicates a stale mapping. This is client-side LB at the application level.
DNS example #
DNS resolvers return rotated IP lists, so each client naturally hits different servers. This avoids single-point-of-failure problems and provides rudimentary load distribution.
Dedicated (Server-side) LB #
A server or appliance between client and backends. Adds a network hop but provides fast server-list updates and fine-grained routing control.
L4 vs L7 #
| Dimension | L4 (Transport) | L7 (Application) |
|---|---|---|
| Routing basis | IP addresses, ports | Request content (URL, headers, cookies) |
| Connection model | Binds a connection to a backend | Binds each request to a backend |
| Persistent conns | Natural (same TCP session โ same server) | Requires explicit WebSocket/SSE support |
| Performance | Fast (minimal inspection) | More CPU (packet inspection) |
| Flexibility | Limited (blind to content) | Rich (path-based routing, SSL termination) |
- RULE OF THUMB: Persistent connections (WebSocket, gRPC streams)? L4 LB. Everything else? L7 LB for the routing flexibility.
- MISCONCEPTION: “WebSocket requires L4 LB.” Modern L7 LBs (NGINX, HAProxy, ALB) handle WebSocket upgrades correctly. The point is that L4 naturally preserves TCP connections while L7 needs explicit protocol support.
- INTERVIEW: Choosing L4 vs L7 comes up in system design interviews for real-time features. Pair L4 LB with WebSocket usage; use L7 for HTTP-based solutions.
Health Checks #
LBs monitor backend health via periodic checks (TCP-level at L4, HTTP endpoint at L7). Unhealthy backends are removed from the pool until recovery. This makes LBs useful not just for load distribution but for high availability.
Geo-distribution and Latency #
- RULE OF THUMB: Physical distance adds latency that no software can eliminate. NYC to London: >80ms minimum (2/3 speed of light โ 56ms for signal propagation alone).
- CDNs form the edge layer for cacheable data — minimises latency and reduces backend load. See Cache Patterns.
- Regional partitioning works when data has natural geographic locality (e.g. ride-sharing: regional servers answer regional requests).
References #
DDIA 2e Reference #
- Chapter 8: The Trouble with Distributed Systems (network faults, unreliable clocks)
- Chapter 1: Reliability, scalability, maintainability framing