Skip to main content
  1. Posts/
  2. Series/
  3. 🐫 Songs of the OCaml Compiler/

🐫 Songs of the OCaml Compiler Part II: A Design Tour

··14186 words·67 mins

This is Part II of a series on learning OCaml by writing a Paxos simulator. We build on Part I’s abstract grammar and witness how subsystems take shape from it — guided, as it turns out, by the OCaml compiler I’m sure the magical source within the compiler is the underlying Hindley-Milner Type system which expects the programmer to exercise clearer type-discipline in exchange for superior inference capabilities that feel like an extension to the programmer’s own mind. I’ve put up a few words about what that experience has been like here <TODO: add reference to part 3 where we reflect on the experience of working with the compiler> itself.

Written primarily for a technical audience A little programming experience goes a long way. I won’t focus heavily on syntax – OCaml’s consistency makes it readable. For code-examples, I describe intent wherever possible so that unfamiliar syntax never gets in the way. Part II gets more involved in places where the type system is the point, but non-technical readers can gloss over those sections without losing the thread. , this series can be read in different orders – see the cover post for suggested reading paths. By prioritising intuition over a pedantic focus on syntax, I’m hoping that a more general audience may find this useful There are some early signs of usefulness (see Part III’s) <TODO link to appropriate part in Part 3 for this>. Please reach out for comments <TODO link to part 3 for RFC> as well.

Figure 1: πŸ™™ “Shadows of the Gobi” by Thibault Gerbaldi (profile) πŸ™›

From Forces to Forms: the Emergence of Subsystems #

Part I ended before any code was written so that when the first let statement appeared, the system’s shape was already half-decided by deliberating on the forces at play – this was by design. What follows is an account of the other half: the three places where the OCaml compiler itself influenced my design, and what I learned from listening to it.

A Quick Primer on the Paxos Protocol #

Here’s a short primer on the algorithm, using the same narrative backdrop.

Paxos is a set of rules that helps a scattered group agree on one thing even when the “world” around them is unreliable. Going back to our caravans in the desert: formations drift apart, messengers get lost, storms blow in, camels faint, and missives sometimes arrive long after they matter. Yet the group still needs to answer a simple question: “Which campsite are we heading towards?” Paxos gives them a way to do this without a chief, without a central authority, and without assuming everyone is present at the same time. Every formation is equal. The only thing they can do is send messages – and hope they arrive – while following a small set of disciplined steps.

The heart of the algorithm is surprisingly modest:

If a majority follow the same rules when responding to new information, the group will never contradict itself β€” no matter how unreliable the desert becomes.

Here’s how it plays out:

Step
1Someone wants to make a proposalA caravan β€” say, Aghilas β€” wants to suggest a campsite. Before naming it, it asks the others: “If I propose something, will you at least consider it?”. Paxos requires every proposer to begin by asking for permission.
2A majority must acknowledgeNot all caravans need to reply; some may be stranded or tending to a collapsed messenger. As long as more than half respond (i.e. a quorum is reached), the journey can continue. (Only majority agreement matters – though a majority declining can stall progress.)
3They reveal any older, more serious proposalsIf a caravan has heard of a previous suggestion the group nearly agreed on, it reports it. The new proposer must honour that older information. This safeguards consistency.
4The actual proposal is madeOnly now does Aghilas say, “Given everything I’ve learned, here is the campsite I propose.”
5A majority must acceptNot everyone needs to hear it immediately β€” sandstorms still exist β€” but once a majority have written the same decision onto their parchment, the group can never choose differently in the future.

That’s the beauty of Paxos: it guarantees agreement without guaranteeing that everyone is present, awake, or reachable. Messages may be late, out of order, or duplicated. A caravan may wake up after days of silence. Two caravans may propose ideas at once. It doesn’t matter as long as everyone follows the rules, the group will not split into contradictory futures.

This is why Paxos remains a classic in distributed systems. It is deliberately conservative, distrustful of the world’s reliability, and yet manages to let a group converge on one shared truth. Alternatives like the Raft Consensus Algorithm are often easier to implement, but Paxos remains valuable because of the flexibility it affords across different use-cases. To get more information, here are some notes (TODO: permalink) from the repo about the algorithm.

TODO: specify clearly that this implementation is a simple paxos ipmoementation and give short intutive descriptions of how multi-paxos, fast paxos and all that are natural evolutions that are stepping stones away from what this is right now. This is supposed t obe a short expectation signpost

Before understanding the Simulator, here’s a happy-case run of the Protocol.

Nodes within a cluster at every point in time are multi-hatting. Each node wears the hat of a Proposer, Acceptor, and Learner at the same time. These are 3 separate sub-states that are independent and can have state changes simultaneously.
The state names in the diagram (e.g. WaitingForPromises, ProposerAccepting,…) might make more sense after the GADT pressure section when we see the State-machines, it’s possible to follow the message flow while glossing over these foreign things for now. The sequence diagram is an evolution of the steps we’ve seen in the Tuareg-themed table before so it should be relatable.
%%{init: {'themeVariables': {'fontSize': '24px'}, 'sequence': {'mirrorActors': true, 'messageAlign': 'center', 'noteAlign': 'left'}}}%% sequenceDiagram autonumber %% ============================================================ %% Paxos Happy-Path: One Round of Consensus (3-node cluster) %% ============================================================ %% Context: All 3 nodes wear all 3 hats (Proposer, Acceptor, Learner) %% simultaneously. Node A initiates. Msgs pass through Bus (abstracted). %% ============================================================ participant A as Node A
(initiates as Proposer) participant B as Node B
(Acceptor + Learner) participant C as Node C
(Acceptor + Learner) note over A, C: Each node maintains all 3 sub-states (Proposer, Acceptor, Learner) independently.

For each sub-state, all state transitions persist a snapshot to Storage. note over A: Initial Proposer state: Idle %% note over A: Initial Proposer state: Idle

Every state transition persists a snapshot to Storage. %% ═══════════════════════════════════════ %% PHASE 1 β€” Permission / Prepare %% ═══════════════════════════════════════ rect rgba(100, 150, 255, 0.08) note left of A: Phase 1: Prepare note over A: Node A changes its Proposer state
on sending the first prepare msg from
Idle β†’ Preparing β†’ WaitingForPromises par Broadcast PermissionRequest A ->>+ B: PermissionRequest(proposal, value) and A ->>+ C: PermissionRequest(proposal, value) and Self-loopback A ->>+ A: PermissionRequest(proposal, value) note right of A: Loopback keeps actor semantics
consistent & clean end note over A, C: Each Acceptor independently checks permissibility:
is this proposal higher than any prior promise? par Responses arrive asynchronously B -->>- A: PermissionGranted(last_accepted_b) and C -->>- A: PermissionGranted(last_accepted_c) and A -->>- A: PermissionGranted(last_accepted_a) end note over A: Quorum reached (at least 2 of 3 acceptors).
Pick value: from highest-proposal-id promise,
or own value if all promises are None. end %% ═══════════════════════════════════════ %% PHASE 2 β€” Accept %% ═══════════════════════════════════════ rect rgba(100, 200, 100, 0.08) note left of A: Phase 2: Accept note over A: On quorum reached from Phase 1,
Node A changes state again
WaitingForPromises β†’ ProposerAccepting par Broadcast Suggestion A ->>+ B: Suggestion(proposal, chosen_value) and A ->>+ C: Suggestion(proposal, chosen_value) and Self-loopback A ->>+ A: Suggestion(proposal, chosen_value) end note over A, C: Each Acceptor checks: proposal β‰₯ promised?
Accept and record value. par Responses arrive asynchronously B -->>- A: Accepted(proposal, value) and C -->>- A: Accepted(proposal, value) and A -->>- A: Accepted(proposal, value) end note over A: Quorum of acceptances reached. end %% ═══════════════════════════════════════ %% RESOLUTION β€” Decided %% ═══════════════════════════════════════ rect rgba(255, 200, 80, 0.08) note left of A: Resolution
to be broadcasted note over A: ProposerAccepting β†’ Decided(value) par Broadcast Decided A ->> B: Decided(value) and A ->> C: Decided(value) and A ->> A: Decided(value) end note over A,C: All nodes learn the decided value.
Consensus achieved for this round. end

4 Layers of Responsibility #

These forces encouraged me to see the implementation as having the following concrete layers of responsibility:

SubsystemTuareg ParallelWhat it owns
Node LogicThe caravan formationsPaxos protocol state machines, per-role sub-states, message-handling logic ( the most complex layer )
Messaging LayerThe messengersMessage routing and delivery, topic-based pub-sub, network partition simulation
Time & SchedulingThe desert sunGlobal event queue, discrete logical tick loop, deterministic event ordering
Simulation HarnessThe chroniclerScenario definition and injection, terminal UI, structured logging, replayable traces

These aren’t equal in weight: the Node Logic layer carries the Paxos protocol itself and is where most of the interesting type-system work lives. The other three are the infrastructure harnesses that make the system observable and controllable. The boundaries emerged from the grammar constraints, not from OCaml conventions – so this is just my way of reacting to the forces at play here.

The Big Picture Map #

The architecture map below outlines my simulator’s design – it’s my way of consolidating structures and some rules of thumb in a diagram Clicking on various parts of the diagram will lead you directly to the code. . I recommend reading it top-to-bottom in two passes: first, ignoring the details and trying to understand the layers that the diagram shows and then zooming in layer-by-layer. Each layer zooms closer to the Node and the last section shows how to make a Node.

To prime us for the design pressure stories that come after, I'll draw attention the OCaml specifics (e.g. functors like Make_node) will be explained in each section; for now, treat this as a signpost of where we’re headed and the following diagram as the overall map of what was implemented to a couple of things:

TODO: [ polish ] with the new phrasing and flair applied to the 3 design pressures, consider if there’s better ways to prime / sign-post them from within this table.

TODO: [ polish ] the description of the pressures likely needs to be tightened

PressureWhereWhat
How a type-mismatch taught me what OCaml’s module system actually guaranteesPrimitives layer: the three functor arrowsMake_node, Make_node_state, Make_storage compose the system bottom-up by injecting Value and Bus as type parameters
How the type system was asked to enforce state-machine invariants, and where it fell shortPrimitives layer: Node_state box and its role_selector annotationEach role carries its own precise sub-state type; transition_role_state is the only write path
How naming conventions and a single chokepoint function drew the boundary between immutable reasoning and mutable storage“What a Node Contains” layer: the State pentagonState is owned entirely by the node; no other layer reaches into it

%%{init: {'themeVariables': {'fontSize': '24px'}, 'flowchart': {'curve': 'monotoneY'}}}%% %%{elk: {'elk.spacing.nodeNode': 2, 'elk.layered.spacing.nodeNodeBetweenLayers': 2, 'elk.padding': '[top=2,left=2,bottom=20,right=2]'}}%% flowchart TB %%------------ layer 1 : user_facing --------------- subgraph user_facing ["User-facing controllables"] direction LR cli ~~~~~~ ui config ~~~~~~ logging cli@{ shape: subproc, label: "CLI ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Entry point: scenario selection, log-level, flags" } config@{ shape: subproc, label: "Config ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ JSON scenario file: nodes, events, timing" } ui@{shape: subproc, label: "UI ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Terminal-interactive REPL: pause, resume, inspect"} logging@{shape: subproc, label: "Logging ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ All entities emit structured log events (domain-specific, not generic), presented via UI"} style cli stroke-width:3px,padding:12px; style config stroke-width:3px,padding:12px; style ui stroke-width:3px,padding:12px; style logging stroke-width:3px,padding:12px; end user_facing =="configures and controls"===> simulation_engine %%------------- layer 2 : simulation_engine ----------- subgraph simulation_engine ["Simulation Engine"] direction TB sig_runtime ~~~ simulator %%=========== layer 2.1: runtime_elements ============= subgraph runtime_elements ["Runtime Elements"] direction LR sig_runtime ~~~ simulator comment_elements ~~~ simulator sig_runtime@{shape: odd, label: "Runtime ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Β· start Β· stop Β· step Β· Β· seed_nodes Β· is_runnable "} style sig_runtime stroke-width:3px,padding:12px; comment_elements@{shape: braces, label: "Keeps logical time, Schedules events, Can do lookups"} style comment_elements stroke-width:5px, padding:12px; simulator --> clock %% simulator --"logical time"--> clock simulator --> scheduler %% simulator --"manage events"--> scheduler simulator --> registries clock ~~~ scheduler ~~~ registries simulator@{ shape: dbl-circ, label: "Simulator"} style simulator font-size:50px,stroke-width:5px,padding:12px; clock@{shape: notch-pent, label: "Clock"} scheduler@{shape: notch-pent, label: "Scheduler"} %% TODO: make the label more elaboarative e.g. "Registries allow simulator to lookup and access entities" registries@{shape: notch-pent, label: "Registries"} style clock stroke-width: 3px; style scheduler stroke-width: 3px; style registries stroke-width: 3px; end runtime_elements --"produces"--> simulation_events %%============ layer 2.2: Simulation Events ============== subgraph simulation_events ["Simulation Events"] direction TB %%>>>>>>>>>>>>> layer 2.2.1: event variants %% what are the events being simulated? subgraph event_variants ["Event Variants"] direction LR %%sig_event %%comment_e_variants sig_event ~~~ e_msg_dispatch comment_e_variants ~~~ e_node_action & e_narration e_msg_dispatch ~~~ e_control e_node_action ~~~ e_metric e_narration ~~~ e_custom %% --- variants shapes and labels: sig_event@{shape: odd, label: "Sim_event ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Β· kind Β· spec (JSON) Β· Β· action (thunk) Β· of_spec "} style sig_event stroke-width:3px,padding:12px; comment_e_variants@{ shape: braces, label: "Events, declared as JSON specs within Config, get hydrated into runtime thunks by the Simulator ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Hydration is internal, not exposed by Runtime.S" } %% comment_e_variants@{ shape: braces, label: "Before sim starts, events are seeded into the scheduler, using the config" } style comment_e_variants font-size:24px,stroke-width:5px,padding:5px; e_msg_dispatch@{shape: notch-rect, label: "Message Dispatch ⎯⎯⎯⎯⎯⎯⎯⎯⎯ Trigger node ⇛ node(s) msg send" } %%e_msg_dispatch@{shape: notch-rect, label: "Message Dispatch"} %% e_control@{shape: notch-rect, label: "Node Control" } e_control@{shape: notch-rect, label: "Node Control ⎯⎯⎯⎯⎯⎯⎯⎯⎯ e.g. Activate/deactivate" } %% e_metric@{shape: notch-rect, label: "Metric Event" } e_metric@{shape: notch-rect, label: "Metric Event ⎯⎯⎯⎯⎯⎯⎯⎯⎯ Emit stats / state snapshots" } %% e_node_action@{shape: notch-rect, label: "Node Action" } e_node_action@{shape: notch-rect, label: "Node Action ⎯⎯⎯⎯⎯⎯⎯⎯⎯ Paxos: propose, suggest, announce" } %% e_narration@{shape: notch-rect, label: "Sim Narration" } e_narration@{shape: notch-rect, label: "Sim Narration ⎯⎯⎯⎯⎯⎯⎯⎯⎯ Human-friendly narration" } %% e_custom@{shape: notch-rect, label: "Custom Event" } e_custom@{shape: notch-rect, label: "Custom Event ⎯⎯⎯⎯⎯⎯⎯⎯⎯ User-defined extensions" } style e_msg_dispatch font-size:24px, stroke-width: 5px; style e_control font-size:24px, stroke-width: 5px; style e_node_action font-size:24px, stroke-width: 5px; style e_narration font-size:24px, stroke-width: 5px; style e_custom font-size:24px, stroke-width: 5px; style e_metric font-size:24px, stroke-width: 5px; end event_variants ~~~ event_dispatch %% >>>>>>>>>> layer 2.2.2: dispatching events & how they work %% how do they work? subgraph event_dispatch ["Sim Events Dispatch Mechanism "] direction LR inst_scheduler ~~~ gathered_events comment_event_dispatch ~~~ gathered_events %%% scheduler relies on the clock to figure out the set of events to dispatch comment_event_dispatch@{ shape: braces, label: "Scheduler uses Simulator's Clock to gather the Events for dispatch" } style comment_event_dispatch stroke-width:5px,padding:5px; inst_scheduler@{ shape: dbl-circ, label: "Scheduler"} style inst_scheduler font-size:24px; events_queue e_dispatch@--> gathered_events %% events_queue e_dispatch@-->|"fires thunks"| gathered_events events_queue@{ shape: das, label: "Events Queue ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ @ Time = t, dispatches gathered sim events "} style events_queue font-size:24px, stroke-width: 5px; %% @ time = t, a set of gathered events are dispatched gathered_events@{ shape: processes, label: "[ Gathered Sim Event ] ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ e.g. 'Create a network partition to split Node Cluster X into 3 partitions'" } end style simulation_events padding:20px, fill:none; end simulation_events =="sim events can drive node communication"==> communication_layer %%============ layer 2.3: Communication Layer ============== subgraph communication_layer ["How Nodes Communicate"] direction TB comment_registry@{ shape: braces, label: "Simulator has a single cluster of nodes with 1..N partitions ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Nodes use bus for intra-partition msg-passing " } style comment_registry stroke-width:5px,padding:12px; inst_node_registry@{shape: notch-pent, label: "Node Registry ⎯⎯⎯⎯⎯ Access Nodes "} inst_partition_registry@{shape: notch-pent, label: "Partition Registry ⎯⎯⎯⎯⎯ Access Partitions "} comment_registry ~~~ ex_cluster inst_node_registry -.-> node_1_1 & node_2_1 & node_1_2 inst_partition_registry -.-> ex_partition_1 & ex_partition_2 %% >>>>> layer 2.3.1: Partitions and Clusters -- one cluster with many partitions %% -- A cluster of nodes may have network partitions within it %% -- nodes within a partition can send messges to each other and themselves subgraph ex_cluster ["e.g. A Cluster"] %% subgraph 2.3.1.1: partition 1 subgraph ex_partition_1 ["Partition 1"] node_1_1@{ shape: dbl-circ, label: "Node A"} node_1_2@{ shape: dbl-circ, label: "Node B"} mbus_1@{ shape: hex, label: "Bus ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Topic-based pub-sub within a partition "} node_1_1 <-->|Broadcast or Unicast msgs to other nodes| mbus_1 node_1_2 <-->|Sends & Rcvs msgs| mbus_1 end %% subgraph 2.3.1.2: partition 2 subgraph ex_partition_2 ["Partition 2"] node_2_1@{ shape: dbl-circ, label: "Node X"} mbus_2@{ shape: hex, label: "Bus "} node_2_1 <-->|All msg-passing is mediated by bus ~~ Even msgs sent by Node to itself i.e. loopbacks| mbus_2 mbus_2 ~~~ comment_drain comment_drain@{ shape: braces, label: "Bus uses snapshot-drain: messages enqueued during a drain wait for the next tick ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Guarantees bounded, deterministic processing per tick" } style comment_drain stroke-width:5px, padding:12px; end end style ex_cluster fill:none; end style communication_layer fill:none; %%communication_layer ~~~ message_passing %% communication_layer -->|"Messages are the payload; bus is the transport"| message_passing comment_payload_transport@{ shape: braces, label: " above: Bus is the transport i.e. How msgs travel ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ below: Messages are the payload i.e. What msgs carry "} style comment_payload_transport stroke-width:5px, padding:12px; communication_layer ==> comment_payload_transport ==> message_passing %%============ layer 2.4: Coordination Mechanism ============== subgraph message_passing ["Coordination via Message-Passing"] %% what are the variants for messages ?? %% how can they be used to coordinate, control and manage time? %% coordinate: the paxos protocol to coordinate and achieve consensus %% control: only by the simulator, the actors can't send these messages direction TB sig_message@{shape: odd, label: "Message ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Coordination | Control | Time polymorphic in 'v "} style sig_message stroke-width:3px,padding:12px; sig_message ~~~ v_coordination sig_message ~~~ v_time sig_message ~~~ v_control v_coordination@{shape: notch-rect, label: "Coordination" } v_time@{shape: notch-rect, label: "Time" } v_control@{shape: notch-rect, label: "Control" } style v_coordination font-size:30px,stroke-width:3px,padding:12px; style v_time font-size:30px,stroke-width:3px,padding:12px; style v_control font-size:30px,stroke-width:3px,padding:12px; comment_m_variants@{ shape: braces, label: " Coordination = protocol-level (Paxos) ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Control = simulator-level (Harness) ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Time = infrastructure-level (Heartbeats) " } %% comment_m_variants ~~~ v_coordination & v_time & v_control comment_m_variants ~~~ v_control style comment_m_variants stroke-width:5px,padding:12px; %% >>>>> layer 2.4.1: coordination messages needed for paxos protocol v_coordination --> paxos_msgs subgraph paxos_msgs ["Paxos Protocol Messages"] direction LR pm_1 ~~~ pm_2 pm_3 ~~~ pm_4 pm_5 ~~~ pm_6 pm_1@{shape: notch-rect, label: "Permission Request ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Phase 1A: Proposer asks for promise " } pm_2@{shape: notch-rect, label: "Permission Granted ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Phase 1B: Acceptor promises " } pm_3@{shape: notch-rect, label: "Suggestion ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Phase 2A: Proposer sends value " } pm_4@{shape: notch-rect, label: "Accepted ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Phase 2B: Acceptor accepts value " } pm_5@{shape: notch-rect, label: "Nack ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Rejection: Higher proposal seen " } pm_6@{shape: notch-rect, label: "Decided ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Consensus reached, broadcast result " } end %% >>>>> layer 2.4.2: control messages v_control ---> control_msgs subgraph control_msgs ["Simulation Control Messages"] direction LR control_Pause ~~~ control_Resume control_MakeNodeIdle ~~~ control_ActivateNode %% control_MakeNodeIdle@{shape: notch-rect, label: "Make Node Idle" } control_MakeNodeIdle@{shape: notch-rect, label: "Make Node Idle ⎯⎯⎯⎯⎯⎯⎯⎯ Sim node crash / inactivity " } %% control_ActivateNode@{shape: notch-rect, label: "Activate Node" } control_ActivateNode@{shape: notch-rect, label: "Activate Node ⎯⎯⎯⎯⎯⎯⎯⎯ Recover from storage snapshot " } %% control_Pause@{shape: notch-rect, label: "Pause Sim"} control_Pause@{shape: notch-rect, label: "Pause Sim ⎯⎯⎯⎯⎯ Halts the tick loop " } %% control_Resume@{shape: notch-rect, label: "Resume Sim" } control_Resume@{shape: notch-rect, label: "Resume Sim ⎯⎯⎯⎯⎯⎯⎯ Continue from curr tick " } end end end simulation_engine =="Simulation manages Nodes"=====> layer_combined_node %% ----- layer 3: combined node subgraph layer_combined_node ["What a Node Contains"] direction LR const_state ~~~ inst_node_combined ~~~ const_bus ~~~ const_storage inst_node_combined@{ shape: dbl-circ, label: "Node"} style inst_node_combined font-size:80px,stroke-width:5px,padding:12px; inst_node_combined --->|has its own| const_state inst_node_combined -->|for communication, is binded to a| const_bus inst_node_combined -->|can recover from failure using| const_storage const_state@{shape: notch-pent, label: "State"} style const_state stroke-width: 3px; const_storage@{shape: notch-pent, label: "Storage"} style const_storage stroke-width: 3px; const_bus@{shape: notch-pent, label: "Bus"} style const_bus stroke-width: 3px; end style layer_combined_node fill:none, stroke:none %% ----- layer 4: Primitives %% TODO: the entire primitives layer, I choose to omit the trivial ones like counters and such they should likely be left omitted -- %% TODO: I'm wondering if there's some short descriptions or comments I can add to improve the design of this diagram subgraph layer_primitive ["Primitives"] direction BT inst_node_node@{ shape: dbl-circ, label: "Node"} style inst_node_node font-size:40px, stroke-width:5px,padding:20px; inst_node_bus@{ shape: dbl-circ, label: "Bus"} style inst_node_bus font-size:40px, stroke-width:5px,padding:20px; inst_value@{ shape: dbl-circ, label: "Value"} style inst_value font-size:40px, stroke-width:5px,padding:12px; func_make_node_state@{ shape: trap-b, label: "Β«functorΒ» Make_node_state: Injects Value type in state machine "} style func_make_node_state stroke-width:5px,padding:12px; %% ----- layer 4.1: Node state primitives %%% Node state sig stuff: %%subgraph node_state_primitives ["State Primitives"] %% p1 %%end %%style node_state_primitives fill:none, stroke:none %%% const_proposer_state@{shape: notch-pent, label: "Proposer"} const_acceptor_state@{shape: notch-pent, label: "Acceptor"} const_learner_state@{shape: notch-pent, label: "Learner"} sig_node_state@{shape: odd, label: "Node_state ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Β· role_state Β· role_selector Β· Β· get_role Β· set_role Β· quorum Β· "} style sig_node_state stroke-width:3px,padding:12px; const_proposer_state -..- comment_role_selector const_acceptor_state -..- comment_role_selector const_learner_state -..- comment_role_selector comment_role_selector -.-> inst_node_state %% const_proposer_state -.-> inst_node_state %% const_acceptor_state -.-> inst_node_state %% const_learner_state -.-> inst_node_state %% add in the comment about choosing between states comment_role_selector@{shape: braces, label: "Each role carries its precise sub-state type ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ transition_role_state uses locally abstract types for compile-time safe state updates ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ role_selector is a GADT that is relied on "} style comment_role_selector stroke-width:5px,padding:5px; %%join_sub_states -.-> inst_node_state %%join_sub_states@{shape:circle, label: " "} %% sig_node_state -.-> inst_node_state sig_node_state ~~~ const_proposer_state sig_node_state ~~~ const_acceptor_state sig_node_state ~~~ const_learner_state inst_node_state@{ shape: dbl-circ, label: "State"} style inst_node_state font-size:40px, stroke-width:5px,padding:12px; inst_value ==> func_make_node_state ====> inst_node_state func_make_storage@{ shape: trap-b, label: "Β«functorΒ» Make_storage: Injects state into persistence "} style func_make_storage stroke-width:5px,padding:12px; impl_file_storage@{ shape: notch-rect, label: "File Storage ⎯⎯⎯⎯⎯⎯ Atomic JSON writes to /tmp/paxos/" } impl_mem_storage@{ shape: notch-rect, label: "Mem Storage ⎯⎯⎯⎯⎯⎯ In-memory list (test/dev)" } inst_node_storage@{ shape: dbl-circ, label: "Storage"} style inst_node_storage font-size:40px, stroke-width:5px,padding:12px; %% inst_node_state ==> func_make_storage ===> inst_node_storage inst_node_state ==> func_make_storage func_make_storage -.-> impl_file_storage & impl_mem_storage impl_file_storage ===> inst_node_storage impl_mem_storage -..->|"(alt: test/dev)"| inst_node_storage func_make_node@{ shape: trap-b, label: "Β«functorΒ» Make_node: Value + Bus β†’ full Node actor "} style func_make_node stroke-width:5px,padding:12px; comment_sharing_constraints@{shape: braces, label: "Sharing constraints ensure type consistency: ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Value.t flowing through Node, State, Storage, and Bus is unified at compile time via sharing-constraints"} style comment_sharing_constraints stroke-width:5px, padding:12px; inst_value ======> func_make_node inst_node_bus ====>|bound at partition registration| func_make_node %% func_make_node ===> inst_node_node func_make_node === comment_sharing_constraints comment_sharing_constraints ==> inst_node_node comment_storage@{shape: braces, label: "Storage is internal to Node ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ each node creates its own persistence instance This avoids dependency cycles: Node_state must exist before Storage"} style comment_storage stroke-width:5px, padding:12px; %% inst_node_storage -----> inst_node_node inst_node_storage --- comment_storage comment_storage ---> inst_node_node inst_node_state -----> inst_node_node %%% orientation comment: comment_primitives@{shape: braces, label: "How Modules are Composed from Primitives ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Here, inputs flow upward into functors, to form runtime instances"} style comment_primitives stroke-width:5px, padding:12px; inst_node_node ~~~ comment_primitives end layer_combined_node <==> layer_primitive %% TODOs that apply to many parts across the whole diagram %% 1. all the edges that we have right now are mostly unlabelled edges, we might find some value in adding some desc to the directed arrows to improve the design of the diagram. %%% click events: click cli "https://github.com" _blank click config "https://github.com" _blank click ui "https://github.com" _blank click logging "https://github.com" _blank click simulator "https://github.com" _blank click clock "https://github.com" _blank click scheduler "https://github.com" _blank click registries "https://github.com" _blank click sig_event "https://github.com" _blank click e_msg_dispatch "https://github.com" _blank click e_control "https://github.com" _blank click e_metric "https://github.com" _blank click e_node_action "https://github.com" _blank click e_narration "https://github.com" _blank click e_custom "https://github.com" _blank %% TODO: comment_e_variants-- link to hydration subroutine click comment_e_variants "https://github.com" _blank click inst_scheduler "https://github.com" _blank click events_queue "https://github.com" _blank click gathered_events "https://github.com" _blank %% TODO: comment_event_dispatch -- link to subroutine click comment_event_dispatch "https://github.com" _blank %% TODO: comment_registry -- link to subroutine click comment_registry "https://github.com" _blank click inst_node_registry "https://github.com" _blank click inst_partition_registry "https://github.com" _blank click node_1_1 "https://github.com" _blank click node_1_2 "https://github.com" _blank click node_2_1 "https://github.com" _blank click mbus_1 "https://github.com" _blank click mbus_2 "https://github.com" _blank click sig_message "https://github.com" _blank click v_coordination "https://github.com" _blank click v_control "https://github.com" _blank click v_time "https://github.com" _blank click pm_1 "https://github.com" _blank click pm_2 "https://github.com" _blank click pm_3 "https://github.com" _blank click pm_4 "https://github.com" _blank click pm_5 "https://github.com" _blank click pm_6 "https://github.com" _blank click control_Pause "https://github.com" _blank click control_Resume "https://github.com" _blank click control_MakeNodeIdle "https://github.com" _blank click control_ActivateNode "https://github.com" _blank click inst_node_combined "https://github.com" _blank click const_state "https://github.com" _blank click const_bus "https://github.com" _blank click const_storage "https://github.com" _blank click inst_node_node "https://github.com" _blank click inst_node_bus "https://github.com" _blank click inst_value "https://github.com" _blank click func_make_node_state "https://github.com" _blank click const_proposer_state "https://github.com" _blank click const_acceptor_state "https://github.com" _blank click const_learner_state "https://github.com" _blank click sig_node_state "https://github.com" _blank click inst_node_state "https://github.com" _blank click func_make_storage "https://github.com" _blank click inst_node_storage "https://github.com" _blank click func_make_node "https://github.com" _blank click sig_runtime "https://github.com" _blank

This is the form that the grammar ended up taking. The map feels orderly from here: there are clean layers, named boundaries and the functor arrows show how we compose our structs. This orderliness took some iterating. What the map doesn’t show are the moments where the compiler refused to accept what I had written. Understanding why it refused changed how I thought about the design.

Design Pressures #

These moments are the focus of the rest of this post. We start small by describing some newbie snags these are just common misconceptions / missteps since we’re learning something fresh – I’d like to think of it as the tax we pay as newbies faced when wiring up the subsystems using functor-based dependency injection. It will show how the compiler’s feedback and cautions helped identify some fundamental blind-spots in my understanding of the type system. We then progress to how state-machine invariants had to be managed – and how, when I reached for GADTs to codify them, the compiler pushed back and tempered my ambition in certain cases. Finally, we see the pattern of keeping clean mutation-boundaries as an example of how the collaborative relationship between the programmer and the compiler leads to thinking in its grammar anyway – even if the compiler has nothing to say.

These moments started off as noise – type errors that I couldn’t immediately explain and constraints that felt arbitrary – and they gradually acquired a melody. Three movements, three design pressures.

Functor-based Dependency Injection: Decoupling Subsystems at Compile-Time #

We’ve seen that the combined node (center of the architecture map) has its own State, is bound to a Bus for communication purposes and relies on Storage for having the ability to recover from failures. Central to these is the idea of a “value” that the system achieves consensus on, which all these modules need to play nice with. So we define it as an abstract type behind a module signature (Value.S) — any module that satisfies Value.S can be swapped in without us having to rewrite anything. A byte-string today, perhaps a MIDI note tomorrow to make nodes that literally sing not entirely a joke β€” the consensus logic is completely agnostic to what’s being agreed upon. A leadership lease, a configuration delta, an audio sample: the protocol doesn’t care. , or more practically, a heterogeneous product type — consensus value AND a leadership-lease so the cluster knows which node to follow.

This is a classic case of parametric polymorphism Parametric polymorphism means: write the logic once, and it works for any type β€” the parameter fills in the blank. A list that sorts any comparable element, a bus that carries any message type, a consensus protocol that works for any agreed-upon value. The word β€œparametric” here signals that the abstraction is over a type parameter, not over specific values. that we want: the logic for achieving consensus has to be abstracted over the type of payload that we’re achieving consensus over. We just need the same V.t (value type) to flow through the different modules consistently. This seems straightforward: the ML ML as in Meta-Language. It’s a whole family of programming languages out of which OCaml is a member -world commonly uses parameterised modules as a building block for structure and composition of swappable-logic. This fits our need for injecting this payload dependency and it’s how we end up with the Make_node functor.

What’s a functor?

A functor is a module that takes another module (or modules) as a parameter. Think of it as a function, but at the module level – instead of computing values, it computes whole modules.

This works because OCaml’s module system supports higher-order functors – functors that take other modules as arguments. This is distinct from first-class modules (a separate OCaml feature that contrasts functors), which allow modules to be packed and passed around as ordinary values at runtime.

In the example below, Make_node(V)(Bus) takes a Value module and a Bus module and produces a fully wired Node module whose types are consistent with both.

The types don’t need to be threaded manually – the compiler enforces the connections for us at the point of application.


Why care about the “compile-time” point?

Consider how a Python equivalent of Make_node might look:

# Python: injection happens at runtime -- V and Bus are passed as objects
class Node:
    def __init__(self, value_module, bus_module):
        self.value = value_module
        self.bus   = bus_module

    def propose(self, v):
        serialized = self.value.serialize(v)  # runtime dispatch: dict lookup
        self.bus.publish(serialized)          # runtime dispatch: dict lookup

This works well — and Python’s type system (Protocol, mypy) can add static checks on top to give us some peace of mind. Notice what happens at runtime though: every call to self.value.serialize and self.bus.publish involves a dictionary lookup There’s an adage that everything in Python is a dictionary. It’s hyperbolic but also: PEP 252 established dict semantics for objects, PEP 232 extended it to functions, and PEP 412 optimized the memory layout for instances – the whole python object model is built on dict-based attribute lookup. to resolve which concrete method to call. The abstraction is real, it just has a runtime-cost.

Functor application in OCaml resolves all of this before a single instruction executes. Once the compiler sees Make_node(Value_string)(Event_bus), it fully specialises the module: V.sexp_of_t is directly Value_string.sexp_of_t, Bus.publish_broadcast is directly Event_bus.publish_broadcast. There are no vtables, no dictionary lookups, no boxing. The generated code is as efficient as if we had written a concrete Node module with string values hard-coded — but we still get the full abstraction in source.

This is what zero-cost abstraction I find it a funny chicken an egg situation: I first learned of the Zero-overhead principle from C++, wherein it’s a principle on the language/compiler design and now I come across claims from rustaceans that it may have origins from the Functional Programming (FP) world means in practice: the compiler eliminates the indirection entirely, because by the time it generates code, there is no indirection left to eliminate.

Interestingly, OCaml does support runtime DI via first-class modules — we can pack a module as (module M : S) and pass it as an ordinary value, store it in a list, or resolve it from a dispatch table at runtime. (TODO: permalink) The simulator’s UI layer does exactly this: Logger.resolve_ui selects a (module Ui.S) implementation from a flag at startup. The trade-off is that we give up the zero-cost abstraction guarantee at the packing boundary.

Functor-based DI is the right tool when the dependency is fixed at composition time; first-class modules are the right tool when it genuinely needs to vary at runtime.


 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
module Make_node (V : Value.S) (Bus : Event_bus.S) : (* << functor >> with 2 parameters: V and Bus*)
S with module V = V with module Bus = Bus = struct
  module V = V
  module Bus = Bus
  module State = Make_node_state (V)
  module Storage = Make_file_storage (State)
  (*...*)
  type t = {
      id : Types.node_id;
      alias : string;
      config : config;
      mutable state : State.role_state;
      subs : topic_to_sub_handle_to_bus_registry;
      logger : Logger.t;
      storage : Storage.t ref;
    }
 (*...*)
end
Code Snippet 1: The same V.t flows through Node, State, Bus and Storage consistently ref lib/node.ml TODO: permalink

The Simulator calls Make_node, and that’s also where the Bus module type is concretised. In the snippet below, the partition type shows us the actual type of the bus here.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
module Simulator : Runtime.S = struct
  module V = Value_string.Value_string
  module B = Event_bus (* [1]: the concrete Bus module*)
  module N = Node.Make_node (V) (B) (* [2]: functor application -- the same V and same B get injected to yield the concrete module N which satisfies the Node signature (Node.S) *)
  (*...*)
  (*[3] partition type is a product type, we can see the type for bus here*)
  type partition = {
      id : partition_id;
      mutable member_node_ids : int_set;
      bus : V.t Message.t B.t; (* B.t is polymorphic: type 'a t.
                               Here we instantiate as (V.t Message.t) B.t,
                               i.e. a bus carrying payloads of type Message.t
                               (the 'a param). V.t is the "value" we're achieving consensus on. *)
    }
  (*[4] we rely on routines for partition-creation*)
  let create_partition sim id =
    let bus =
      B.create
        ~log_level:(sim |> logger_of |> Logger.get_level)
        ~payload_to_string:(V.sexp_of_t |> Message.to_string)
        ()
    in
    { id; member_node_ids = Set.empty (module Int); bus }
    (*...*)
end
Code Snippet 2: The Simulator concretises modules (ref: TODO permalink)

The bus creation signature (B.create) makes the polymorphism explicit:

1
2
3
4
5
6
7
8
9
(* so that the payload (messages) can have a string representation *)
type 'a payload_serialiser = 'a -> string

(* creation routine *)
val create :
  ?log_level:Log_level.t ->
  payload_to_string:'a payload_serialiser ->
  unit ->
  'a t

Things look right. V.t appears to flow correctly: Value_string.Value_string.t is string under the hood and the B.create signature is polymorphic in 'a – the bus doesn’t care what it carries, as long as we supply a serialiser for logging the payload.

So we should be also be able to implement the create function like so \(\ldots\) right?

1
2
3
4
5
6
7
8
let create_partition sim id =
  let bus =
    B.create
      ~payload_to_string:(fun msg ->
        Sexplib.Sexp.to_string (Message.sexp_of_t Base.String.sexp_of_t msg))
      ()
  in
  { id; member_node_ids = Set.empty (module Int); bus }
Code Snippet 3: Our first rejection

Newbie Snag: Understanding Nominal Typing #

File "lib/simulator.ml", line 189, characters 52-55:
189 |     { id; member_node_ids = Set.empty (module Int); bus }
                                                          ^^^
Error: The value bus has type string Message.t B.t
       but an expression was expected of type V.t Message.t B.t
       Type string is not compatible with type V.t
Had 1 error, waiting for filesystem changes...

This is the first proper newbie snag I faced. Without the fundamentals, this is extremely annoying because it’s hard to connect the dots. The newbie thinking would likely be that technically, at the concrete level we should have already satisfied the type of payload_to_string, it should have work because Value_string.Value_string.t is string. The implicit assumption being that surely OCaml can handle structural typing (e.g. like in TypeScript), not nominal typing?

If we listen carefully, the error message is already guiding us: type string Message.t B.t and type V.t Message.t B.t are not the same i.e. the two types don’t unify.

The reality is that OCaml doesn’t unify types across module boundaries by default. The abstract type V.t declared behind Value.S is nominally distinct from string — even when the underlying representation is identical.

The fix is simple, we just have to respect the module boundaries and use V’s serialiser (V.sexp_of_t):

let bus =
    B.create
-      ~payload_to_string:(fun msg ->
-        Sexplib.Sexp.to_string (Message.sexp_of_t Base.String.sexp_of_t msg))
+      ~payload_to_string:(V.sexp_of_t |> Message.to_string)
      ()
  in

This was an unfortunate time-sink. The optimist in me would be appreciative of the fact that the OCaml compiler is strict and principled – it encourages us to respect interface boundaries. So, it becomes a matter of perspective whether the compiler is admonishing us or singing a cautionary note the compiler as critic before it becomes collaborator β€” this is the first bar of that melody .

The second snag was in the same neighbourhood — the module signature boundary — but more architectural in its consequences.

Newbie Snag: Destructive Substitution #

During an earlier refactor, I had extracted Acceptor_record and Node_state as separate functors (before they were eventually collapsed back into Make_node). The intent felt clean: Acceptor_record depends on Value, so it should be its own Make_acceptor_record(V); Node_state depends on Acceptor_record, so Make_node_state(A : Acceptor_record.S) should naturally reach through to A.V for the value type. To keep the output signature tidy, I used destructive substitution (:=) when writing the Acceptor_record signature:

module Make_acceptor_record (V : Value.S) :
  S with module V := V = struct
  (* ... *)
end

and tried to access the V within the concrete Acceptor_record, A like so:

module Make_node_state (A : Acceptor_record.S) : S = struct
  module V = A.V   (* ⛔️ compiler error: A has no member V *)
  (*...*)
end
Diagrammatically: Just to be clear, Acceptor_record is no longer its own thing now, this pressure describes a historical version of how things used to be.
flowchart TB %% ======= ORIGINAL INTENT: layered functors ======= comment_intent@{shape: braces, label: "Intent: clean dependency layering ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Value β†’ Acceptor_record β†’ Node_state"} val@{ shape: dbl-circ, label: "Value (V)" } style val stroke-width:5px; f_acceptor@{ shape: trap-t, label: "Β«functorΒ» Make_acceptor_record(V) ⎯⎯⎯⎯⎯⎯⎯ depends on Value" } style f_acceptor stroke-width:5px,padding:12px; comment_erase@{shape: braces, label: "⚠️ REALITY: ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Destructive substitution (`:=`) erases V from output signature ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Fix: If `=` were used β†’ A.V exists"} ar@{ shape: dbl-circ, label: "Acceptor_record (A)" } style ar stroke-width:5px; f_state@{ shape: trap-t, label: "Β«functorΒ» Make_node_state(A) ⎯⎯⎯⎯⎯⎯⎯ depends on Acceptor_record" } style f_state stroke-width:5px,padding:12px; state@{ shape: dbl-circ, label: "Node_state (NS)" } style state stroke-width:5px; val ==>|injected into| f_acceptor ==>|produces module constrained by signature| ar ==>|injected into| f_state ==>|concretises| state f_acceptor ~~~ comment_erase style comment_intent stroke-width:5px,padding:12px; style comment_erase stroke-width:3px,padding:12px; state -...->|"Attempt: NS tries to access A.V ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ ⛔️ A.V doesn't exist (path erased)"|ar

The compiler outright refused, saying that “A had no V within it”. From my newbie-lens, this made no sense: V was plainly declared within Acceptor_record.

Turns out, the diagnosis is this: := works by erasing. with module V := V removes V from the output signature entirely. The path A.V no longer exists because there is no V to reach through to. Hence, the term “destructive”.

:= (Destructive Substitution) erases the path. = (Sharing-Constraint) constrains it. If I needed that path, I was to use constraint-sharing Pins a module component to a specific identity without removing its name from the signature. .

The fix was to switch to a plain sharing constraint:

- S with module V := V = struct
+ S with module V = V = struct

A.V is now accessible downstream because V remains in the output signature (Acceptor_record.S), its identity is pinned to the functor argument, A.

When to use which? Destructive substitution (:=) vs Constraint-sharing (=)

Destructive substitution (:=) is the right tool when we want to hide an internal type from callers — we’re saying “this type is resolved, callers don’t need to name it.” Sharing constraints (=) are the right tool when downstream code needs to reach through the module to its contents — we’re saying “this type is resolved AND still accessible by name.”

The confusion as newbies makes sense: both forms are about type unification Type unification is the process of making two types equal by finding a consistent substitution. In OCaml’s module system, constraints like with module V = V unify module components across boundaries β€” the key question is whether that unified type remains nameable (accessible as a path like A.V) or is erased from the signature entirely (as with :). at a module boundary. The difference is whether the unified type remains a named path in the output signature or disappears into it.

There’s a useful heuristic here that parallels the Law of Demeter: if downstream code needs to write A.V, it is depending on A’s internal structure, not just its interface. That’s a signal worth paying attention to.

The question to ask is: does V belong in A’s public contract?

  • If yes — it’s part of what callers legitimately need to know about A — use =, keep the path open.
  • If no — it’s an implementation detail callers shouldn’t reach into — use :=, erase it, and if downstream code still needs it, inject it directly as its own parameter.

In this case, V naturally belongs in Acceptor_record.S’s contract — the value type is fundamental to what an acceptor record is. Perhaps the destructive substitution was an over-eager attempt at tidiness that severed a path that should have stayed open.


Takeaways #

The meta-lesson here is less about syntax and more about what module signatures actually are: access-control boundaries, not just type hints. Erasing a path is a meaningful act — one that the := syntax makes easy to do accidentally.

In hindsight, there were some early code-smells which were itches that I hadn’t scratched yet, which might have arrived at these simplifications sooner.

Before the types were correctly wired, there were two parallel type families living side-by-side in the node module — an internal representation and a message-layer representation of the same semantic objects:

(* Internal node types *)
type assertion = V.t Types.paxos_assertion_state [@@deriving sexp]
type promise   = assertion option [@@deriving sexp]

(* Boilerplate converters: internal ↔ message-layer *)
let paxos_assertion_of_assertion (a : assertion) : V.t Types.paxos_assertion_state =
  { proposal = a.proposal; value = a.value }

let assertion_of_paxos_assertion (paxos_a : V.t Types.paxos_assertion_state) : assertion =
  { proposal = paxos_a.proposal; value = paxos_a.value }

let promise_of_paxos_promise (paxos_p : V.t Types.paxos_promise) : promise = (* ... *)

These converters were the smell: identical fields, shuffled back and forth, buying nothing other than preventing the compiler’s complaints. The problem wasn’t exactly the converters — it was the mental gymnastics that they had to exist at all.

The root cause: the Node signature was explicitly fixing the type of assertion in the interface:

- type assertion = V.t Types.paxos_assertion_state [@@deriving sexp]
+ type assertion [@@deriving sexp]

By making assertion abstract in the signature and re-introducing the concrete definition inside the functor body, the type equality is no induced and longer explicitly declared. The message layer, event bus, node state, and storage layer are all instantiated with the same V, so their exposed types unify automatically. No explicit constraint needed, nor any converters, or boilerplate.

Essentially, the system healed itself.


The system essentially healed itself once the types were right – the compiler would just tell me what the next type-issue was once I improved my fundamental understanding of the type system. Eventually, the boilerplate had evaporated when the types were correctly wired. The noise was resolving into a beat I could follow.
Why did the boilerplate disappear? \(\Leftarrow\) a mechanistic explanation on what the compiler does now

The message layer, the event bus, the node state, and the storage layer are all instantiated with the same module argument V : Value.S, so their exposed types all carry the same type constructor V.t. This induces structural sharing of type parameters across the modules, producing the necessary type equalities without requiring explicit with type clauses in the interface.

OCaml’s module system ensures that all these modules refer to the same instance of V, which is sufficient to unify the types of assertions, messages, accepted values, etc. The resulting unification is robust, type-safe, and requires no explicit signature annotation.


We now advance to a deeper pressure: one that asks how much the type system can actually guarantee about a state machine.

Enforcing State Machine Invariants: to GADT or not to GADT? #

Earlier in the quick primer for the Paxos Protocol, we saw the happy case where 3 different nodes coordinated to achieve consensus. There’s a whole world of edge-cases to deliberate on if we went with that approach — so we shift our attention to representing the protocol’s wisdom via Finite State Machines (FSMs) Just like the adage earlier about the Python language and its reliance on lookup tables, here’s another one about Computer Science / Programming in that β€œAll Programming Philosophies are about State” or, β€œeverything is just a state machine”. Some interesting reads on this if you’re not familiar with what State Machines are about: old blog post by Shopify on why every developer should be force-fed state machines. Coming from the Elixir world, the GenServer abstraction is a beautiful pattern that is implicitly an FSM – see more elixir-discourse on FSM use-cases here. Sound FSMs make concurrency problems so much simpler to wrap our human-brains around! .

A Node Wears 3 Hats at the Same Time #

A node maintains 3 sub-states simultaneously: Proposer, Acceptor, Learner. Drawing out their FSMs before deep-diving into their logic forces us to be clear about a few things: which states are truly distinct versus merely named differently, which transitions carry side-effects (e.g. persistence, message-dispatch) versus pure state-changes, and which transitions should be impossible. The last category makes FSMs a powerful design and analysis tool for protocol implementations: if we can enumerate the illegal transitions, we can start asking whether the type system can prevent them at compile time rather than at runtime.

The Proposer FSM is the most complex of the 3 — six states for v0, the Aborted state is actually just a shim. It’s not critical to the flow yet because we omit retry mechanisms in v0. , two quorum decision gates, and a simulation-control state that will cause us trouble shortly \(\ldots\)

stateDiagram-v2 direction TB [*] --> idle idle --> prep: propose(value) idle --> pi:DeactivateNode pi --> idle: ActivateNode prep --> wfp: broadcasted PermissionRequest state isPhase1Quorum <> wfp --> isPhase1Quorum: is quorum reached? isPhase1Quorum --> wfp: no isPhase1Quorum --> aborted: nack quorum (rejected) isPhase1Quorum --> pa: grant quorum (accepted) + broadcasted Suggestion state isPhase2Quorum <> pa --> isPhase2Quorum: is quorum reached? isPhase2Quorum --> pa:no isPhase2Quorum --> decided: accept quorum reached + dispatched Decided isPhase2Quorum --> aborted aborted --> idle: retry (omitted from v0) aborted --> [*] decided --> [*] note right of pi Simulation layer only, not protocol-level. ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ ProposerInactive is a sibling variant of active protocol states in the same sum type. The type system cannot distinguish it from Idle, WaitingForPromises, etc. ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ This is why is_quorum_reached still calls failwith on this branch β€” the compiler sees all variants as equal peers. end note %% states to represent %% -- protocol states state "Preparing" as prep state "Waiting for Promises" as wfp state "Proposer Accepting" as pa state "Decided" as decided state "Aborted" as aborted %% -- simulation states state "Proposer Inactive" as pi state "Idle" as idle %% visual hierarchy: classDef simLayer fill:#3B82F6,stroke:#1D4ED8,color:#fff,font-style:italic,stroke-width:2px classDef quorumGate fill:#F59E0B,stroke:#B45309,color:#1a1a1a,stroke-width:2px class idle, pi simLayer class isPhase1Quorum, isPhase2Quorum quorumGate
FSM for Acceptor Hat

stateDiagram-v2 direction TB [*] --> idle %% -- simulation layer -- idle --> ai : DeactivateNode ai --> idle : ActivateNode %% -- protocol -- idle --> accepting : on receiving the first valid PermissionRequest accepting --> accepting : update promise / accepted on each subsequent valid request note right of ai Simulation layer only, not protocol-level. ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Same sibling-type pattern as ProposerInactive β€” the type system sees Idle, Accepting, and AcceptorInactive as equal peers in the same sum type. end note note left of accepting Tracks two values: Β· promised β€” highest proposal seen Β· accepted β€” last value accepted ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Both in record called "acceptor_record". end note state "Idle" as idle state "Accepting" as accepting state "Acceptor Inactive" as ai classDef simLayer fill:#3B82F6,stroke:#1D4ED8,color:#fff,font-style:italic,stroke-width:2px classDef quorumGate fill:#F59E0B,stroke:#B45309,color:#1a1a1a,stroke-width:2px class idle simLayer class ai simLayer

FSM for Learner Hat (trivial)

stateDiagram-v2 direction TB [*] --> lempty : Learned [] lempty --> lvalues : Decided msg received lvalues --> lvalues : additional Decided msgs note right of lempty No Inactive state. ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Learners are passive observers β€” they accumulate decisions but drive no protocol logic, so no recovery path is needed. end note note left of lvalues Accumulates assertion list. One entry per decided round. end note state "Learned (empty)" as lempty state "Learned (values)" as lvalues classDef simLayer fill:#3B82F6,stroke:#1D4ED8,color:#fff,font-style:italic,stroke-width:2px classDef quorumGate fill:#F59E0B,stroke:#B45309,color:#1a1a1a,stroke-width:2px

What immediately stands out to us from the Proposer state-diagram is that the current version of this FSM actually has 2 categories of states that are represented:

CategoryRelevant States (examples)
Protocol-Level StatePreparing, WaitingForPromises, ProposerAccepting, Aborted, DecidedEntry and exit from these states are all done via Coordination messages between nodes. These are relevant for upholding the Protocol’s invariants.
Simulation-Level StateIdle, ProposerInactive, AcceptorInactiveA node enters or exits these states via Control messages. These are semantically excluded from the Protocol operations and exist for the Simulation harness.

This means that as it currently stands, the simulation-level states (Blue-coloured states) and the protocol-level states (Green-coloured states) are in the same sum-type A sum type (also called a tagged union or discriminated union) is a type that can be exactly one of several named alternatives. In OCaml, these alternatives are called variants. If you come from TypeScript, think of a discriminated union; from Rust, an enum with data; from Java/Kotlin, a sealed class hierarchy. The key property being: a match expression over a sum type is exhaustive – the compiler warns you if you forget a case. A product type, mentioned earlier, is the complement: a type that holds several values simultaneously (a record or tuple). So, Sum = β€œone of these”; product = β€œall of these together”. . Therefore, our type-system can’t distinguish “active protocol state” from “simulation inactive state” since they’re all siblings in one flat variant. This affects how we make guarantees that the system won’t have illegal state transitions, which is fundamental to many of the protocol invariants An invariant is a property that must always hold true, no matter what path the system takes. For our Paxos FSM, one invariant is: β€œa quorum check only happens when the proposer is in WaitingForPromises or ProposerAccepting.” Another: β€œa node may only broadcast Decided if it has actually reached a decision.” Violating invariants goes beyond just producing wrong answers, it actually makes a protocol’s safety-guarantees meaningless. The question is whether we enforce these invariants via runtime checks (hoping the programmer got it right) or via the type system (making it impossible for anyone to get wrong when writing the source code) .

A great place to see this happening is when we’re making decisions on what to do next based on whether a quorum has been reached. These correspond to the two quorum-gates (orange-coloured diamond nodes) that we see in the diagram. Both these decision-points rely on the same predicate function, is_quorum_reached. We know that the node only calls this predicate function when it’s in 2 possible Proposer states: WaitingForPromises OR ProposerAccepting – all other entry-points coming from all other states are illegal. So how can we guarantee that we meet this invariant?

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
  (** Quorum predicate for both phase 1 and phase 2 of paxos.

      NOTE: This runtime check exists because proposer_state includes both
      active and inactive variants. A GADT-typed view layer
      (active_proposer_state) would eliminate this branch entirely.*)
  let is_quorum_reached cluster_size rs =
    match Proposer |> get_role rs with
    | WaitingForPromises wfp ->
        wfp |> is_quorum_reached_on_promise_wait cluster_size
    | ProposerAccepting pa ->
        pa |> is_quorum_reached_on_proposer_accepting_wait cluster_size
    | _ ->           (* <== wildcard-arm *)
        failwith
          "is_quorum_reached called on non-proposer state (must be \
           WaitingForPromises or ProposerAccepting)"
Code Snippet 4: is_quorum_reached and managing invariants (ref lib/node_state.ml) TODO: permalink

Our current solution is to use a wildcard arm that invokes a failwith — our hands are forced at the moment to rely on runtime checks. This failwith approach works because the program does the right thing at runtime if the call-sites are correct.

The problem is that “correct by programmer discipline” is exactly the kind of guarantee OCaml’s type system can be trusted to replace. Every failwith in a pattern match is a statement that the programmer knows something the compiler doesn’t: “this branch will never be reached.” That’s a pinky-promise with no enforcement.

If someone – including a future version of us – adds a new call to is_quorum_reached from a handler that processes Idle nodes, the compiler won’t object. The program would compile, the tests would pass (unless we wrote one for that exact path), and the bug would hide until a specific scenario triggers it at runtime. For a consensus protocol, that’s a silent correctness violation. We’re leaving safety on the table that the compiler could be providing for free. By the way, we risk this at other places too:

@ announce_decision, whereby a node may only announce what it has decided if it’s currently in a Decided state

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
  let announce_decision
      ({ id; alias; logger; state = { proposer; _ }; _ } as t : t) ~msg_id ~time
      decided_assertion =
    match proposer with
    | State.Decided _decided_val ->
        Logger.log_announce_decided_action ~id ~alias
          ~assertion:(decided_assertion |> assertion_to_string)
          logger;
        let msg =
          Message.make_decided ~msg_id ~time ~from:id ~decided_assertion
        in
        let bus = bus_for_topic t Types.Coordination in
        ((Types.Coordination, None), msg) |> Bus.enqueue bus ~alias
    | _ ->
        failwith
          "a node can only announce what it has learned when it has decided "
Code Snippet 5: runtime exn bomb @ announce_decision, ref node.ml (TODO: permalink)

@ propose, whereby a node may only propose if it’s currently in an Idle state

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
  (** Represents the act of a node driving the first step of the paxos process
      (asking for permission). This means that the node's state as a Proposer
      will change from [ Idle ] to [ Peparing ], as we create the message then
      dispatch it. Once done dispatching, the node will change its state to
      [ WaitingForPromises ], marking it ready to receive responses (both grants
      and nacks) for that proposal.

      As such, every paxos process (i.e. an attempt to assert a value and seek
      consensus) can be uniquely identified via its [proposal_id] *)
  let propose ~msg_id ~time ~bus ~assertion
      ({ id; alias; logger; state = { proposer; _ }; _ } as t) =
    match proposer with
    | State.Idle ->
        Logger.log_proposal_action ~id ~alias
          ~assertion:(assertion |> assertion_to_string)
          logger;
        assertion |> State.Preparing
        |> transition_role_state t time State.Proposer;
        let msg =
          Message.make_permission_request ~msg_id ~from:id ~assertion ~time
        in
        ((Types.Coordination, None), msg) |> Bus.enqueue bus ~alias;
        { assertion; promises_received = []; nacks_received = [] }
        |> State.WaitingForPromises
        |> transition_role_state t time State.Proposer
    | _ -> failwith "we can only propose if we are currently idle"
Code Snippet 6: runtime exn bomb @ propose, ref node.ml (TODO: permalink)

There’s something interesting in is_quorum_reached that may already help us \(\ldots\)


GADTs to the rescue? #

The get_role function-call within is_quorum_reached is fascinating and compels us to look at its signature.

let get_role : type a. role_state -> a role_selector -> a
observation from signatureimplication
a, the return type for get_role, is not fixed, it is determined by the selectorSo, if we pass in a Proposer, we get back proposer_state. When we pass in Acceptor, we get back acceptor_state – that’s why in is_quorum_reached, we could pattern match on specific variants of proposer_state like WaitingForPromises and ProposerAccepting
Though get_role is a single function, the compiler tracks which specific type flows out based on which constructor flows in.This feels like a type-safe Visitor Pattern, where the return type is determined by the variant that you pass in – touching the edge of dependent-typing

This invites us to inspect the machinery here to understand how it works. type a in the signature of get_role is a locally-abstract type read: placeholder type that exists within this function (get_role) only. This type variable (a) is progressively refined as we pattern-match on the role_selector. Read more here. . It introduces a type variable (a) that is scoped to this function (get_role) which the compiler refines on each branch of There’s an OCaml forum post that goes through some misconceptions around this role_selector. This means that the compiler learns which specific type a is by looking at which constructor matched. role_selector is our Generalised Algebraic Data Type (GADT) here.

179
180
181
182
183
184
185
186
187
188
(** GADT to encode which role and its sub-state type This encodes the
    association between a constructor (Proposer, Acceptor, Learner) and its
    precise sub-state type. *)
type _ role_selector =
  | Proposer : proposer_state role_selector
  | Acceptor : acceptor_state role_selector
  | Learner  : learner_state role_selector
  (* each arm follows the pattern:
  | <constructor> : <return_type> role_selector
  *)
Code Snippet 7: role_selector GADT (ref lib/node_stae.ml TODO:permalink)

Each constructor carries its return type in its signature, that’s why we can say that the GADT indexes the type parameter. This means that if the pattern-matching triggers on Proposer, then the compiler will know that a = proposer_stateProposer is the variant that produces proposer_state. This indexing mechanism is useful to select amongst the different states (proposer_state, acceptor_state, learner_state).

The signature for Node_state shows how all six proposer states – including ProposerInactive – live in one flat variant, which is precisely the structure the GADT indexes over:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
(** Part of the Node_state signature, showing how
    the FSM states within Proposer State can be built-up*)
module type S = sig
  (*...*)
  type waiting_for_promise_state = {
    assertion : assertion;
    promises_received : promise list;
    nacks_received : nack list;
  }
  [@@deriving sexp, yojson]

  (** State for a proposer actively collecting Phase 1 responses. *)
  type proposer_accepting_state = {
    assertion : assertion;
    acks : Types.node_id list;
    nacks_received : nack list;
  }
  [@@deriving sexp, yojson]

  (** States for the proposer role. *)
  type proposer_state =
    | ProposerInactive
        (** Node is unavailable; proposer cannot act until restored. *)
    | Idle  (** Proposer is available and may initiate Phase 1. *)
    | Preparing of assertion  (** Proposer has begun Phase 1. *)
    | WaitingForPromises of waiting_for_promise_state
        (** Proposer is waiting for a quorum of Phase 1 responses. *)
    | ProposerAccepting of proposer_accepting_state
        (** Proposer has entered Phase 2 and is collecting acceptances. *)
    | Decided of V.t  (** Proposer has reached a chosen value. *)
  [@@deriving sexp, yojson]

  type role_state = {
    proposer : proposer_state;
    acceptor : acceptor_state;
    learner : learner_state;
  }
  [@@deriving sexp, yojson]
  (*...*)
end
Code Snippet 8: Partial snippet of the Node_state signature ref lib/node_state.ml (TODO: permalink)

The usefulness of this indexing is clearer when we see the signature of set_role, the write complement to get_role’s read operation.

let set_role : type a. role_state -> a role_selector -> a -> role_state

Since the third argument to this function ( a ) is constrained by the role_selector, the compiler will forbid us from passing the wrong role. We won’t be able to pass an acceptor_state when the selector says Proposer. GADTs are shaping up to be the antidote to our earlier problem where we were looking for compile-time assurances so that we could do better than “correct by programmer discipline”. With this, an entire class of “wrong role” bugs are avoided at compile time.

In fact, we actually use it for our transition_role_state where we can see it all coming together.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
  (** Polymorphic transition function for role state. This is a classic
    GADT-indexed update, which is entirely side-effect focused.*)
  let transition_role_state ({ id; alias; logger; state; storage; _ } as node)
      time (type a) (sel : a State.role_selector) (new_substate : a) =
    let new_state = new_substate |> State.set_role state sel in
    Logger.node_state_change ~node_id:id ~alias logger
      ~old_state:(state |> State.state_to_str)
      ~new_state:(new_state |> State.state_to_str);
    node.state <- new_state;
    let updated_storage =
      match node.state |> Storage.persist_snapshot !storage time with
      | Ok updated_storage -> updated_storage
      | Error e ->
          Stdio.eprintf "Warning: failed to persist snapshot: %s\n%!"
            (Error.to_string_hum e);
          !(node.storage)
    in
    node.storage := updated_storage
Code Snippet 9: transition_role_state ref:TODO:permalink to the learning note

Every state transition in the entire system flows through this one function. The GADT ensures that the selector and the new substate agree on which role is being updated. It prevents us from accidentally transitioning the proposer state with an acceptor state. This is a genuine compile-time enforcement of types and it’s already helping us.

Though it solves the “wrong role” class of errors, it doesn’t really solve the “wrong state within a role” class of errors. is_quorum_reached still accepts ANY proposer_state (including Idle, ProposerInactive) because currently the GADT indexes over roles and not over states within a role. The _ in role_selector means “we don’t need to name the type parameter at this point – there exists a type parameter here, but we’re not going to bother naming it”. It helps us track More precisely, _ is an anonymous type parameter. Each GADT constructor instantiates it with a specific concrete type (proposer_state, acceptor_state, learner_state). This is called type indexing: the parameter serves as a compile-time tag. Pattern-matching on it introduces local type equalities that are valid within each branch. the sub-state type (e.g. acceptor_state vs learner_state). However, it doesn’t constrain which variant within that substate is allowed – so we can still call operations like is_quorum_reached on Idle or ProposerInactive nodes.

Figure 2: Now that we have a hammer, everything is a nail … right? (src). Read Law of the Instrument

The Serialisation Wall & Knowing the Tradeoff-Surface #

So, the natural move is to add in more GADTs, right? The idea practically writes itself. We saw the role_selector GADT prevent “wrong role” bugs. The natural next step is to introduce an active_proposer_state GADT that only admits WaitingForPromises and ProposerAccepting – making it structurally impossible to call is_quorum_reached on Idle or ProposerInactive nodes. The failwith arms then disappear. The compiler catches the rest. Clean, green, state-machine.

Except, our state types carry a pre-processor annotation As seen in the highlighted lines of the Node_state signature above, @@deriving@@ is an OCaml preprocessor attribute (metaprogramming in OCaml that does AST-to-AST transformations) that auto-generates boilerplate code. Think β€œcode that writes code”: telling the compiler β€œemit serialization functions for this type” saves us writing dozens of lines of sexp_of_ and of_yojson by hand. ([@@deriving sexp, yojson]) – they’re serialisable. At every state transition, we persist a snapshot to storage. This enables the simulator to dump, reload, diff, and replay node state at any point in time. That’s a fundamental capability of the simulator as a teaching tool.

Why GADTs and ppx deriving don’t play nicely

Briefly: ppx deriving directives (in practice) assume uniform, first-order algebraic types. GADTs violate this assumption by allowing constructors to refine their result types. Both operate at compile time β€” but with fundamentally different expectations about what a type definition looks like.

Concretely: for most ppx deriving directives to work, every constructor must produce the same result type. This follows from how OCaml does metaprogramming: we annotate a type definition, and a compiler plugin (a ppx rewriter) walks the Abstract Syntax Tree (AST) and splices in generated code. This is a purely structural approach – the generated code is derived from syntax alone, without type inference or the ability to reason about type equalities.

A typical variant satisfies this nicely:

type active_proposer_state =
  | WaitingForPromises
  | ProposerAccepting  (* every arm produces active_proposer_state *)

The rewriter can mechanically generate uniform cases over this shape.

Now let’s contrast that with a GADT:

type _ active_proposer_state =
  | WaitingForPromises : waiting_for_promise_state -> waiting_for_promise_state active_proposer_state
  | ProposerAccepting : proposer_accepting_state -> proposer_accepting_state active_proposer_state

Here, each constructor refines the result type. There is no longer a single uniform type that all constructors return. While this refinement is present in the AST, the rewriter cannot reason about it. Its code-generation templates assume a single return type – an assumption that no longer holds.

Therefore, derived code will either be rejected or fail to type-check.

This isn’t a fundamental limitation of ppx – it’s more of a consequence of the simplifying assumption that types are uniform. An assumption that GADTs deliberately break.


The compiler knows about it too – attempting to slap [@@deriving sexp] on our hypothetical active_proposer_state GADT would produce a build error. This is the compiler scolding us again, but this time, the scolding is pointing at a deeper design tension.

Pushing ahead with a full GADT encoding of the proposer states would mean writing custom serialisers, maintaining a parallel “erased” representation for storage, and updating both layers every time the state machine evolves. Every new constructor would require updating the GADT, the erased representation, the serialiser, the deserialiser, and the pattern matches that assumed exhaustiveness – five coordinated changes where the compiler can’t help us across the erased boundary. The burden of the impossible-case failwith we wanted to eliminate from the FSM handling just migrates to the serialisation logic – if we do this, we’d just be relocating the problem, not removing it. That’s the Serialisation Wall we face but we need not climb it.

The irony is that the more type-safe our runtime model becomes via GADTs, the less safe our serialisation boundary gets. Once bytes hit disk, all type indices are fiction unless we reify them manually.
Why it’s especially not worth-it to climb this wall for a Paxos simulator

Paxos state machines are not clean protocol FSMs in ways that full GADT encoding would fight against:

Forces
Cross-cutting stateProposer logic depends on acceptor responses; learner logic feeds back into proposer termination. The states are not as independent as the FSM diagrams suggest.
Temporal loosenessIn a simulator, states can be skipped, replayed, or re-entered. A GADT encoding that forbids “impossible” states would also forbid interesting states that we want to inspect mid-flight.
Observability requirementsThe simulator wants snapshots, rewinding, diffing, and replay. All of these require serialisable state. Full GADT encoding fights against all three.

The full-scope GADT refactor could make sense when the project matures and the representation stabilises. For v0, where we’re still actively evolving how states are factored, locking in a GADT-serialiser contract would be premature.


What a medium-term resolution may look like: typed GADT views

The principled resolution for this would likely be a two-layer architecture

Layer
A: storage layerwe keep the serialisable proposer_state as the storage layer
B: typed view layerwe introduce a non-serialisable active_proposer_state GADT and we bridge between Layer A and Layer B using converter functions

Layer A stays unchanged – our current proposer_state with [@@deriving sexp, yojson]:

type proposer_state =
  | ProposerInactive | Idle
  | Preparing of assertion
  | WaitingForPromises of waiting_for_promise_state
  | ProposerAccepting of proposer_accepting_state
  | Decided of V.t
[@@deriving sexp, yojson]

Layer B introduces a GADT that only admits the states where quorum logic is valid:

type _ active_proposer_state =
  | WaitingForPromises :
      waiting_for_promise_state -> waiting_for_promise_state active_proposer_state
  | ProposerAccepting :
      proposer_accepting_state -> proposer_accepting_state active_proposer_state

A bridge function converts from storage to view, returning None for states that have no typed view:

let active_proposer_of : proposer_state -> _ active_proposer_state option = function
  | WaitingForPromises w -> Some (WaitingForPromises w)
  | ProposerAccepting p -> Some (ProposerAccepting p)
  | _ -> None

With this, is_quorum_reached can be rewritten to accept only the typed view:

let is_quorum_reached_active ~cluster_size
    (state : _ active_proposer_state) : quorum_result =
  match state with
  | WaitingForPromises wfp ->
      is_quorum_reached_on_promise_wait cluster_size wfp
  | ProposerAccepting pa ->
      is_quorum_reached_on_proposer_accepting_wait cluster_size pa

No failwith needed and no wildcard arm. The compiler guarantees exhaustiveness because active_proposer_state has only the two valid constructors. The runtime check moves to the bridge function (active_proposer_of), where it belongs – at the boundary between serialised storage and typed logic, not deep inside the protocol code.

The main appeal of this separation is that the two layers evolve independently: storage representations can change conservatively (backwards-compatible schema migrations), while the typed views can be refactored aggressively as we learn more about which invariants are worth encoding.


After understanding the tradeoff-surface, we choose to do absolutely nothing for now. The debt is known and bounded with a planned resolution. That’s a different thing from debt that’s been swept under the rug. This gives us peace.

The failwith calls stay. They’re documented, they’re tested The test-suites for this project have covered the fundamental areas to protect. For this, this section within test_fsml.ml TODO: permalink to test failwith/is_quorum_reached section , and the two-layer escape hatch is designed for when the state machine stabilises a bit more.

Emerging Patterns: Clarity from Clean Mutation-Boundaries #

So far, the design pressures were driven by the compiler speaking – errors, warnings and build-failures that forced us to reconsider our design. Now, we arrive at a case where the compiler has nothing to say. And yet, instinctively, we find ourselves inheriting its discipline and idioms to judge clearer code-paths that give us clarity in places where the compiler can’t directly assist.

Let’s look at transition_role_state again — we had side-stepped the side-effects to focus on the GADT-indexed dispatch previously:

side-effect
mutationsnode.state <- new_state updates the node’s mutable state and node.storage := updated_storage updates the mutable storage ref. These are state updates.
loggingAn emission of a log event is an effect that has IO implications (via UI).
persistenceSnapshot persistence is most direct side-effect to pick up on here, involves a write operation onto a file in most cases — an IO side-effect.

If we swap out the lens of type-level correctness with one for discipline-level correctness, we start seeing the clarity within this flow of data. Every protocol handler (e.g. propose) computes a new substate purely and then delegates to this sole function for the side-effects. A single-point for all stateful effects of transitions means that it’s a clear mutation-boundary — it wasn’t planned from the beginning, it emerged.

Actors in Charge of Their Own Mailboxes #

We’ve been following the Actor Model of computation to represent logical units within our system. Each node owns its state exclusively and talks to others only through messages — a pattern we prescribed back in the Design Grammar in Part I.

What’s this “Actor Model” about?

It’s a concurrency model Concurrency models are ways to reason about doing many things at once. Without them, the number of possible interactions becomes unmanageable, so each model imposes constraints that make systems easier to think about. Entire ecosystems exist to provide these constraints as primitives β€” Erlang’s OTP being a classic example of behaviour-rules for the Actor model. In Python terms, think of each actor as an object that can only communicate with others through a message queue, never by directly calling methods on another object. where computation is organised around independent actors.

Each actor:
1) owns its state exclusively (no one else can touch it)
2) reacts to incoming messages
3) and influences others only by sending messages

Step back, and you might recognise that an actor is just a finite state machine with an input stream: a message arrives \(\Rightarrow\) the actor looks at its current state \(\Rightarrow\) computes the next state \(\Rightarrow\) optionally emits more messages.

Having constraints is the point of a model because it helps simplify reasoning. There’s no shared mutable state, and no direct access across boundaries for the Actor Model. Every state change within the Actor has a visible cause — some message arrived, and the actor reacted.

There were other ways this could have been modelled also e.g. shared mutable state with locks, or a central coordinator — but those blur ownership. The actor model keeps it sharp: each node is responsible for its own state, and coordination is always explicit.


Coming from the world of Elixir/Erlang, this concurrency-model is instinctive for me. We have processes (GenServers) with mailboxes, message passing, and isolation by default there. That instinct carries over Coming from Elixir, where GenServer gives you process-isolated mailboxes with selective receive and supervision for free, the instinct to build a per-node inbox was natural. The pattern was sound and it worked – but OCaml’s single-threaded (self-imposed v0 implementation constraint), functor-composed world offered little runtime support for the abstraction, so the coupling that Erlang’s runtime hides became visible in this OCaml codebase. : so the natural first-reach for an earlier version of the implementation was to give each node (Actor) a mailbox and let messages accumulate.

Here’s the rough shape of how it worked:

flowchart TB comment_pattern@{shape: braces, label: "A coherent actor-with-mailbox pattern -- not broken, but a tightly coupled one ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ We observe that three concerns interleave across the pipeline: accumulation, protocol logic, and state management"} style comment_pattern stroke-width:5px, padding:12px; bus_delivers@{shape: hex, label: "Bus ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ delivers message to node via callback"} style bus_delivers stroke-width:5px, padding:12px; comment_pattern ~~~ bus_delivers %% ====== Stage 1: handle_message ====== handle_msg@{shape: subproc, label: "[Stage 1] handle_message ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 1. Extracts proposal_id from msg 2. Gets or creates inbox_entry for that proposal 3. Appends msg to inbox_entry.messages πŸ’₯ mutates inbox 4. Updates node.quorum from msg payload πŸ’₯ mutates quorum ref 5. Immediately calls process_inboxes as the next operation"} style handle_msg stroke-width:3px, padding:12px; bus_delivers ==> handle_msg %% ====== Stage 2: process_inboxes ====== process_inbox@{shape: subproc, label: "[Stage 2] process_inboxes ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 1. Iterates every proposal key in inbox 2. Calls is_quorum_reached on each entry 3. If quorum met β†’ triggers next step πŸ’₯ would mutate state 4. Accumulation and evaluation interleaved"} style process_inbox stroke-width:3px, padding:12px; handle_msg ===>|"immediately"| process_inbox %% ====== Stage 3: set_node_state ====== set_state@{shape: subproc, label: "[Stage 3] set_node_state ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ 1. node.state ← new_state πŸ’₯ mutates state 2. printf logging πŸ’₯ side-effect (IO) 3. No persistence call"} style set_state stroke-width:3px, padding:12px; process_inbox ===> set_state %% ====== Storage: present but disconnected ====== storage_unwired@{shape: notch-pent, label: "Storage ⎯⎯⎯⎯⎯⎯ Present in Node.t Not wired into any transition"} style storage_unwired stroke-width:3px; set_state ~~~ comment_coupling set_state -..->|"batched over, early implementation"| storage_unwired %% ====== Annotation: the three concerns ====== comment_coupling@{shape: braces, label: "⚠️ Mutation-sites (πŸ’₯) are spread throughout the pipeline ⚠️: ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ [Stage 1] handle_message β†’ inbox + quorum [Stage 2] process_inboxes β†’ state (if quorum) [Stage 3] set_node_state β†’ state + logging ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ No single place to see: what happens when state changes?"} style comment_coupling stroke-width:5px, padding:12px; %% TODO: just like the archi map, this needs to have click events so that we can TODO: permalink

For this design, we notice that mutations get distributed across layers — every segment of the pipeline has mutations. This means three different functions \(\Rightarrow\) three different mutation-sites. There’s no single place where we can see “what happens when state changes”. Also, the lifecycle of an inbox is spread over multiple places.

The concurrency model is correct here. The inbox-pattern also works, but it doesn’t have the snap-together feel it would have had if it were a simple design.

This very instinct to reach for a mailbox is an example of correct thinking applied for the wrong runtime. The pattern works, but the infrastructure that makes it ergonomic in Erlang doesn’t exist here in OCaml —heuristics that come from prior experience can shape (and also sometimes misshape) design in a new language.

Chaos Naturally Converges into Clarity #

The FSM restructuring that was made in response to the compiler’s guidance for the previous pressure actually improved more than just type-safety. It unlocked new code-paths for us. Thanks to the layered, role-based substates, each FSM state can now carry its own accumulated data. We saw this earlier within waiting_for_promise_state (part of the Proposer state). There, we now hold promises_received and nacks_received (both, as an ordered list of promises/nacks) as part of that substate record. The stateful inbox gets sidelined into a redundant, parallel accumulator.

That’s what killed the inbox.

With the inbox now gone, where do the side-effects go?

Once the pure computation (assembling of new substate at handlers like propose ) is separated from the effectful execution (mutation, logging, persistence), we automagically see a clear mutation-boundary forming. All the side-effects naturally converge on transition_role_state.

flowchart TB subgraph mutation_boundary ["Functional Core, Imperative Shell"] direction TB proposer_handlers ~~~ acceptor_handlers ~~~ learner_handlers %% ============= PURE ZONE: handlers compute substates ============= subgraph pure_handlers ["Pure Computation β€” each handler yields a new substate"] direction TB subgraph proposer_handlers ["Proposer handlers"] direction LR h_propose@{shape: subproc, label: "propose"} h_perm_granted@{shape: subproc, label: "handle_permission_granted"} h_accepted@{shape: subproc, label: "handle_accepted"} h_nack@{shape: subproc, label: "handle_nack"} style h_propose stroke-width:3px,padding:12px; style h_perm_granted stroke-width:3px,padding:12px; style h_accepted stroke-width:3px,padding:12px; style h_nack stroke-width:3px,padding:12px; end subgraph acceptor_handlers ["Acceptor handlers"] direction LR h_perm_req@{shape: subproc, label: "handle_permission_request"} h_suggestion@{shape: subproc, label: "handle_suggestion"} style h_perm_req stroke-width:3px,padding:12px; style h_suggestion stroke-width:3px,padding:12px; end subgraph learner_handlers ["Learner handler"] direction LR h_decided@{shape: subproc, label: "handle_decided"} style h_decided stroke-width:3px,padding:12px; end end comment_pure@{shape: braces, label: "No mutation, no IO, no side-effects ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯ Each handler computes a typed substate value and passes it with the matching role_selector"} style comment_pure font-size:30px, stroke-width:5px, padding:12px; pure_handlers ~~~ comment_pure %% ============= THE CHOKEPOINT ============= chokepoint@{shape: dbl-circ, label: "transition_role_state"} style chokepoint font-size:40px, stroke-width:5px, padding:20px; h_propose ====> chokepoint h_perm_granted ==> chokepoint h_accepted ==> chokepoint h_nack ==> chokepoint h_perm_req ==> chokepoint h_suggestion ==> chokepoint h_decided ==> chokepoint %% ============= EFFECT ZONE: all side-effects here ============= subgraph effect_zone ["Effects β€” all here, nowhere else"] direction LR step_1@{shape: notch-rect, label: "State.set_role ⎯⎯⎯⎯⎯⎯ Pure immutable state transformation"} step_2@{shape: notch-rect, label: "Logger ⎯⎯⎯⎯⎯⎯ Structured state-change event"} step_3@{shape: notch-rect, label: "node.state ← ⎯⎯⎯⎯⎯⎯ Single mutable assignment"} step_4@{shape: notch-rect, label: "Storage.persist ⎯⎯⎯⎯⎯⎯ Snapshot to storage ref"} step_1 ~~~ step_2 ~~~ step_3 ~~~ step_4 style step_1 font-size:24px, stroke-width:5px; style step_2 font-size:24px, stroke-width:5px; style step_3 font-size:24px, stroke-width:5px; style step_4 font-size:24px, stroke-width:5px; end chokepoint ===> effect_zone end
There’s one small exception to this pattern

handle_simulation_control mutates node.state (ref TODO:permalink) outside this chokepoint because it’s a simulation-harness-level control, not part of the Paxos protocol logic – the same simulation-vs-protocol distinction from the GADT section. It’s known, bounded and consistent so we maintain our peace.


This emergent pattern has some overlaps with existing OCaml / FP idioms.

It seems that a common FP idiom that people in the interwebs cite is that of a “Functional Core Imperative Shell”, a demo can be found here. From the linked article, the part that I can relate to is:

The natural evolution is to make even more functions pure and focus more on the separation of concerns between domain logic and infrastructure. This can be achieved by representing effectful interactions as pure values and separating the evaluation and effectful execution from their pure representations.

That’s similar to our observations in how this mutation boundary emerged. I didn’t manage to frontload that much reading on more generalisable design patterns from the FP world. I do have this book – “Pearls of Functional Algorithm Design” by Richard Bird and this repo on design-patterns in Reason — in my reading list though. Anything with case-studies will be a good reference for gaining some intuition on this.


Once the boundary between pure and effectful code was visible, conventions naturally cascaded from it – in how we name things, how we construct them, and how we guard entry to state-dependent logic.

Pure assembly \(\Leftarrow\) make_* vs Stateful Creation / Initialisation \(\Leftarrow\) create_*

(* Pure assembly β€” no mutation, deterministic from arguments *)
let make_permission_request ~msg_id ~time ~from ~assertion =
  {meta= make_meta msg_id time Types.Coordination; from; assertion}

(* Resource allocation β€” creates fresh mutable bus, initialises set *)
let create_partition id =
  let bus = B.create ~payload_to_string:(...) in
  {id; member_node_ids= Set.empty (module Int); bus}

This is a straightforward one: the name-prefix tells you about the construction. make_* assembles a value; create_* allocates a resource with its own mutable identity. A reader who knows this convention can scan the codebase and know where the effects live without reading function bodies.


If Module Has_spec \(\implies\) can be hydrated/de-hydrated (spec-driven construction)

This pattern is about spec-driven construction. of_spec functions define a clear boundary where serialisable configuration (i.e. a spec) is hydrated into fully-initialised runtime structures.

There’s a module signature for this:

module type Has_spec = sig
  type t
  type spec [@@deriving sexp, yojson]

  val of_spec : spec -> t
end

We can mix that signature into an existing module signature like so:

module type S = sig
  module V : Value.S
  module Bus : Event_bus.S
  module State : Node_state.S
  module Storage : Storage.S with type snapshot_payload = State.role_state

  type t

  include Has_spec with type t := t (* πŸ”₯ DESTRUCTIVE! πŸ”₯ *)

  (*...*)
end
Code Snippet 10: Node module signature satisfies Has_spec (TODO: permalink)

To complete the example, here’s the actual hydration boundary functions for Node, implemented within the Make_node functor that we had seen before:

module Make_node (V : Value.S) (Bus : Event_bus.S) :
  S with module V = V with module Bus = Bus = struct
  (*...*)

  (**Shape of the static config for Node *)
  type spec = {
    node_id : int;
    node_alias : string;
    topic_strs : string list;
    initial_cluster_size : int;
    initial_state : string option; [@default None] [@yojson_drop_default]
    storage_config : string option; [@default None] [@yojson_drop_default]
  }
  [@@deriving sexp, yojson]

  (*...*)

  (** Hydration routine*)
  let of_spec { node_id; node_alias; topic_strs; initial_cluster_size; _ } =
    let default_state = State.init_state () in
    let config =
      {
        runtime = { cluster_size = ref initial_cluster_size };
        roles =
          [ "Acceptor"; "Learner"; "Proposer" ]
          |> List.filter_map ~f:role_of_str;
        topics = topic_strs |> List.filter_map ~f:Types.topic_of_str;
      }
    in
    {
      id = node_id;
      alias = node_alias;
      state = default_state;
      config;
      subs = Hashtbl.Poly.create ();
      storage = ref (Storage.create ~alias:node_alias ());
      logger = Logger.create Stdlib.__MODULE__ ();
    }

  (*...*)

  (** De-hydration routine*)
  let to_spec { id = node_id; alias = node_alias; config; _ } =
    {
      node_id;
      node_alias;
      topic_strs = config.topics |> List.map ~f:Types.topic_to_str;
      initial_cluster_size = !(config.runtime.cluster_size);
      initial_state = None;
      storage_config = None;
    }

  (*...*)
end
Code Snippet 11: The Hydration-boundary for Node (TODO:permalink)

This pattern allows us to load an effectful world out of static data (config files). It also overlaps with a few other patterns we might be familiar with:

  • hydration/re-hydration boundaries which is common in persistence layers
  • smart constructors (FP idiom, ref Haskell wiki).
  • configuration as a data-philosophy

Custom with-guards for State-Gated Effect Execution

This is about having guarded execution and effect-gating that allows for early-exit precondition checks.

let with_active ~node ~ignore_reason f =
  if node.state |> State.is_inactive then ignore_reason |> noop_ignore node
  else f ()

This also has the extra benefit of centralising both the guard and its fallback behaviour. It couples the condition, the reason and its fallback-effect on not meeting the state-preconditions. So now, handlers can have business-logic assuming preconditions hold.

  let handle_coordination node msg =
    with_active ~node
      ~ignore_reason:
        "inactive right now and can't be reached to get coordinated..."
    @@ fun () ->
    log_flow ~routine:Stdlib.__FUNCTION__ ~msg:"...coordination is happening"
      node;
    match msg |> Message.proposal_id_of with
    | None -> ()
    | Some _proposal_id -> (
        match msg with
        | Message.Coordination (PermissionRequest pr) ->
            pr |> handle_permission_request node
        | Message.Coordination (PermissionGranted pg) ->
            pg |> handle_permission_granted node
        | Message.Coordination (Suggestion s) -> s |> handle_suggestion node
        | Message.Coordination (Accepted a) -> a |> handle_accepted node
        | Message.Coordination (Nack n) -> n |> handle_nack node
        | Message.Coordination (Decided d) -> d |> handle_decided node
        | _ ->
            failwith
              "We can only coordinate if we receive a coordination message.")

This approach feels familiar if you’ve used Python’s precondition decorators before:

def requires_active(func):
    """ A decorator-based precondition guard"""
    def wrapper(node, *args, **kwargs):
        if node.is_inactive():
            log(f"{node.alias}: inactive, ignoring")
            return
        return func(node, *args, **kwargs)
    return wrapper

@requires_active # <--- applying that decorator
def handle_coordination(node, msg):
    # business logic assumes node is active
    ...

The OCaml version achieves the same pattern without decorators or dynamic dispatch β€” with_active is a plain higher-order function that takes the guarded logic as a thunk. The guard, the reason, and the fallback are all co-located in one call.


These aren’t any compiler-rules that we’re obliged to follow — they’re just extrapolations that we arrive at ourselves because they give us more of that satisfactory snap-together feel when we implement them.

Figure 3: πŸ™™ snap-together feel (retrieved) πŸ™›

In the Absence of the Compiler, We Hum its Tunes #

What’s fascinating is that the type system doesn't enforce any of the things In fact, OCaml has no effect system i.e. there’s no annotation that says β€œthis function is pure” or β€œside-effects live here”. Also β€” and this is something that blew my mind as a newbie into this language β€” OCaml is actually a general-purpose language. For example, we can use the power of Classes, Objects, First-class Modules (packed), Mixins, Inheritance \(\ldots\) levers you’d typically find in the OOP-world. Not to forget, imperative programming levers are also available. we just went through — e.g. nothing will stop someone from writing a create_* function that actually is pure, or a make_* function that allocates a mutable ref.

Yet, the discipline holds because of the compiler’s influence on our way of thinking. “What’s the shape of that struct, the direction of data-flow, the signature of that function, is it immutable or are we writing imperative logic in that area \(\ldots\) " – these are the threads of thought that makes the patterns feel natural, self-evident and emergent.

This pressure shows us the third type of relationship we can form with the compiler. The first pressure was about obeying – the compiler refused our types and we learned why. The second was about negotiating – we pushed the type system toward state-machine invariants and learned where it stops. The third is about internalising – applying the compiler’s discipline in territory it can’t reach.

What started as noise – type errors we didn’t yet understand, constraints that felt arbitrary – became a teacher’s voice we learned to trust. And what that voice left behind, once the specific errors were resolved and the specific constraints were satisfied, was a habit of thought: a preference for explicit boundaries, a sensitivity to where mutation lives, a belief that the names we give things should carry the weight of the architecture.

The compiler sings, you learn the songs, and eventually you find yourself humming them in places where the compiler isn’t.

Closer #

80’s Ad by Sony Walkman.