Networking Essentials

Table of Contents

🔴 P0 — foundation for reasoning about inter-service communication, latency, and failure modes

AI usage disclosure Claude Opus 4.6 · content-consolidation

Structured and consolidated from personal study notes using Claude for semantic comparison, merging of overlapping content, accuracy review, and editorial polish. All technical content originates from personal notes, DDIA 2e, and HelloInterview references.

Distributed systems communicate over networks. Understanding how networks behave — and fail — at each layer informs every protocol choice, load balancing decision, and failure-handling strategy in a system design.

INTERVIEW: Networking is a stronger focus for infra and distributed systems interviews than for product design interviews. For product design, touch-and-go is fine; for infra, expect deep-dives.

The Layers That Matter #

Of the OSI model, three layers appear repeatedly in system design:

Layer	Name	Concern	Key Protocols
L3	Network	Routing and addressing; best-effort delivery	IP, ICMP
L4	Transport	End-to-end reliability, ordering, flow control	TCP, UDP, QUIC
L7	Application	Application-level semantics and data formatting	HTTP, WebSocket, DNS, gRPC

INSIGHT: The higher in the stack, the more latency and processing required. This is why the choice of L4 vs L7 load balancing matters — see Load Balancing Patterns.
FLEX: L3 isn’t limited to IP. InfiniBand operates at the network layer and is used for massive ML training clusters where IP’s overhead is unacceptable.

Journey of a Web Request #

A single browser request traverses every layer:

1. URL entered in browser
2. DNS resolution → IP address
3. TCP handshake (3-way: SYN → SYN-ACK → ACK)
4. TLS handshake (if HTTPS)
5. HTTP request transmitted
6. Server processes request (typically the dominant latency source)
7. HTTP response received
8. TCP teardown (4-way: FIN → ACK → FIN → ACK)

Observations #

There is far more than a single request-response: DNS lookup, TCP handshake, TLS negotiation all add latency before the application even sees the request.
Stateful connections must be kept alive or multiplexed, else we pay handshake costs on every roundtrip. Two mechanisms:
- HTTP keep-alive: reuse the same TCP connection for multiple requests
- HTTP/2 multiplexing: single connection, concurrent streams (but still suffers from TCP-level head-of-line blocking — unlike HTTP/3/QUIC which solves this)
MISCONCEPTION: “HTTPS means the request is trustworthy.” HTTPS encrypts in transit but provides no guarantee about the origin of the request. The server must still validate request bodies — a case where dependency injection works against us because malicious injection of user identifiers is still possible.

Transport Layer: TCP vs UDP #

Feature	UDP	TCP
Connection	Connectionless	Connection-oriented
Reliability	Best-effort	Guaranteed
Ordering	No guarantee	Maintains order
Flow control	No	Yes
Header size	8 bytes	20-60 bytes
Speed	Faster (less overhead)	Slower (guarantees)

RULE OF THUMB: Default to TCP. Only consider UDP when the case is strong: low-latency-critical (gaming, real-time), tolerable data loss (media streaming), high-volume telemetry where loss is acceptable, or platform constraints that exclude browsers.
INSIGHT: QUIC (the transport under HTTP/3) provides TCP-like reliability over UDP, with per-stream flow control and 0-1 RTT connection setup. It’s increasingly relevant but still emerging for backend use.

Application Layer Protocols #

These protocols build on L4 and define how applications communicate. See Protocol Choices for when to use REST vs gRPC vs GraphQL, and Transport Protocols for HTTP version and real-time mechanism comparisons.

Key points for system design #

RULE OF THUMB: REST for public-facing APIs, gRPC for inter-service communication. This is the default unless requirements dictate otherwise.
CONVENTION: Web-gRPC needs a proxy (e.g. Envoy, gRPC-Gateway) for browser interop because native gRPC requires HTTP/2 with special framing that browsers don’t support. See Web-gRPC Proxy.
PITFALL: gRPC-Web (the browser-compatible variant) does NOT support bidirectional streaming — only unary and server-streaming. For true bidirectional browser communication, WebSocket remains the standard.

Stateful Connection Protocols (SSE, WebSocket) #

Both SSE and WebSocket maintain persistent, stateful connections. This has architectural implications:

CHALLENGE: Stateful connections need explicit handling of failure cases: what happens when the connection drops? How does the client reconnect? What state is lost?
CHALLENGE: A stateless load balancer cannot be placed in front of stateful connections without care — see Load Balancing Patterns for L4 vs L7 considerations.
MISCONCEPTION: “You can’t have more than 6 concurrent SSE connections.” This is an HTTP/1.1 browser limitation (6 connections per domain), not an SSE limitation. HTTP/2 multiplexing removes this constraint.
EXP: I hit a routing bug with stateful WebSocket connections on Fly.io’s anycast infrastructure and had to hack the routing to make connection persistence work correctly.

WebRTC: P2P Communication #

WebRTC enables direct peer-to-peer communication using UDP as transport, without an intermediary server for the data channel itself (though signalling still requires a centralised server).

Infrastructure requirements: STUN servers for NAT traversal (hole-punching to get routable addresses), TURN relay servers as fallback when direct connection fails.
CHALLENGE: Correct implementation is very difficult and still prone to connection losses. Only reach for this when P2P is genuinely needed (AV conferencing, collaborative editing at scale).
INSIGHT: For collaborative documents at very high participant counts, consider CRDTs as the conflict-resolution foundation rather than centralised coordination via WebRTC.

Load Balancing Fundamentals #

Two architectural choices: client-side vs server-side (dedicated) load balancing, and L4 vs L7 operation. See Load Balancing Patterns for algorithms and pattern-level guidance.

Client-side LB #

The client makes the routing decision, typically by querying a service registry for available instances.

INTERVIEW: Client-side LB is great for internal microservices (built into gRPC). For all other cases, use a dedicated LB.
RULE OF THUMB:
- Few clients under your control → client-side LB can work (e.g. Redis Cluster client, gRPC)
- Many clients AND can tolerate update delays → client-side LB can work (e.g. DNS)
- Many clients AND need fast updates → dedicated LB

Redis Cluster example #

Cluster nodes use a gossip protocol for membership. Clients fetch cluster topology from any node, then hash keys locally to determine the destination shard. A MOVED response indicates a stale mapping. This is client-side LB at the application level.

DNS example #

DNS resolvers return rotated IP lists, so each client naturally hits different servers. This avoids single-point-of-failure problems and provides rudimentary load distribution.

Dedicated (Server-side) LB #

A server or appliance between client and backends. Adds a network hop but provides fast server-list updates and fine-grained routing control.

L4 vs L7 #

Dimension	L4 (Transport)	L7 (Application)
Routing basis	IP addresses, ports	Request content (URL, headers, cookies)
Connection model	Binds a connection to a backend	Binds each request to a backend
Persistent conns	Natural (same TCP session → same server)	Requires explicit WebSocket/SSE support
Performance	Fast (minimal inspection)	More CPU (packet inspection)
Flexibility	Limited (blind to content)	Rich (path-based routing, SSL termination)

RULE OF THUMB: Persistent connections (WebSocket, gRPC streams)? L4 LB. Everything else? L7 LB for the routing flexibility.
MISCONCEPTION: “WebSocket requires L4 LB.” Modern L7 LBs (NGINX, HAProxy, ALB) handle WebSocket upgrades correctly. The point is that L4 naturally preserves TCP connections while L7 needs explicit protocol support.
INTERVIEW: Choosing L4 vs L7 comes up in system design interviews for real-time features. Pair L4 LB with WebSocket usage; use L7 for HTTP-based solutions.

Health Checks #

LBs monitor backend health via periodic checks (TCP-level at L4, HTTP endpoint at L7). Unhealthy backends are removed from the pool until recovery. This makes LBs useful not just for load distribution but for high availability.

Geo-distribution and Latency #

RULE OF THUMB: Physical distance adds latency that no software can eliminate. NYC to London: >80ms minimum (2/3 speed of light ≈ 56ms for signal propagation alone).
CDNs form the edge layer for cacheable data — minimises latency and reduces backend load. See Cache Patterns.
Regional partitioning works when data has natural geographic locality (e.g. ride-sharing: regional servers answer regional requests).

References #

HelloInterview: Networking Essentials

DDIA 2e Reference #

Chapter 8: The Trouble with Distributed Systems (network faults, unreliable clocks)
Chapter 1: Reliability, scalability, maintainability framing