- rtshkmr's digital garden/
- Readings/
- Books/
- Real World OCaml: Functional Programming for the Masses/
- Chapter 16B: Parallel & Concurrent Programming in OCaml 5/
Chapter 16B: Parallel & Concurrent Programming in OCaml 5
Table of Contents
OCaml 5 removed the Global Runtime Lock and introduced Domains as the unit of true parallelism. These notes cover Domain-based programming, Domainslib’s work-stealing task pools, the relaxed memory model and DRF-SC guarantee, atomic variables for non-blocking communication, and blocking synchronisation with Mutex and Condition — a practical toolkit for parallel OCaml. We then go through Effects and Threads to complete the Concurrent programming picture.

Parallel Programming (via Domains) #
Domains #
Domains can be spawned and an associated lambda can be executed: Domain.spawn (fun _ -> print_endline "I ran in parallel"). There’s a limit by OCaml of 128 domains that can be active at the same time.
Spawned Domains can be joined to retrieve their results — join waits for target domain to terminate, this is a blocking operation.
Gotcha: join is blocking, but spawn is already parallel #
Careful not to misread this. The spawn will already make the work happen. So even if join is blocking, the work on the other domain wouldn’t be blocked (only the control flow would be blocked until the domain terminates). That’s why the double parallel call takes roughly the same time as a single call.
| |
$ ocamlopt -o fib_twice.exe fib_twice.ml
$ ./fib_twice.exe 42
fib(42) = 267914296
fib(42) = 267914296
$ hyperfine './fib_twice.exe 42'
Benchmark 1: ./fib_twice.exe 42
Time (mean ± sd): 1.249 s ± 0.025 s [User: 2.451 s, System: 0.012 s]
Range (min … max): 1.221 s … 1.290 s 10 runsDomainslib: for nested-parallel programming #
The idea is that the core compiler distribution only offers the low-lying primitives for concurrent and parallel programming. This is so that the high-level libraries can be distributed outside the core compiler distribution.
| |
perf analysis
$ ocamlfind ocamlopt -package domainslib -linkpkg -o fib_par2.exe fib_par2.ml
$ ./fib_par2.exe 1 42
fib(42) = 267914296
$ hyperfine './fib.exe 42' './fib_par2.exe 2 42' \
'./fib_par2.exe 4 42' './fib_par2.exe 8 42'
Benchmark 1: ./fib.exe 42
Time (mean ± sd): 1.217 s ± 0.018 s [User: 1.203 s, System: 0.004 s]
Range (min … max): 1.202 s … 1.261 s 10 runs
Benchmark 2: ./fib_par2.exe 2 42
Time (mean ± sd): 628.2 ms ± 2.9 ms [User: 1243.1 ms, System: 4.9 ms]
Range (min … max): 625.7 ms … 634.5 ms 10 runs
Benchmark 3: ./fib_par2.exe 4 42
Time (mean ± sd): 337.6 ms ± 23.4 ms [User: 1321.8 ms, System: 8.4 ms]
Range (min … max): 318.5 ms … 377.6 ms 10 runs
Benchmark 4: ./fib_par2.exe 8 42
Time (mean ± sd): 250.0 ms ± 9.4 ms [User: 1877.1 ms, System: 12.6 ms]
Range (min … max): 242.5 ms … 277.3 ms 11 runs
Summary
'./fib_par2.exe 8 42' ran
1.35 ± 0.11 times faster than './fib_par2.exe 4 42'
2.51 ± 0.10 times faster than './fib_par2.exe 2 42'
4.87 ± 0.20 times faster than './fib.exe 42'Notes:
- a good idea to keep things serial for small input sizes
- the task pool lifecycle management is ours to be responsible for:
setup_pool: configrun: kickstartteardown: fini
Parallel Iteration Constructs #
It’s an ergonomic loop-construct:
comparison between serial and parallel using the spectral_norm benchmark
| |
We observe the looping that happens at eval_A_times_u and eval_At_times_u and realise that the iterations of the outer loop are not dependent on each other and can be executed in parallel — Each iteration of the outer loop body reads from u but writes to disjoint memory locations in v.
| |
-let n = Array.length v - 1 in
- for i = 0 to n do
- let vi = ref 0. in
- for j = 0 to n do vi := !vi +. eval_A i j *. u.(j) done;
- v.(i) <- !vi
- done
+let n = Array.length v - 1 in
+ T.parallel_for pool ~start:0 ~finish:n ~body:(fun i ->
+ let vi = ref 0. in
+ for j = 0 to n do vi := !vi +. eval_A i j *. u.(j) done;
+ v.(i) <- !vi
+ )
Memory-model: The Easy Bits #
This section summarises the contents from this section of the OCaml manual.
Typical modern processors and compilers do aggressive optimisation of code, that’s why serial vs parallel code gives surprising behaviour differences in the parallel version.
OCaml also makes such optimisations but uses a relaxed memory model so that parallel code becomes more predictable. It describes what values an OCaml program is allowed to witness when reading a memory location.
Recipes to Follow #
So we get recipes to follow that retain the simplicity of sequential processing.
immutable values:
can be freely shared between multiple domains, can access in parallel
mutable data structures (e.g. ref cells arrays, mutable record fields)
the programmer needs to ensure that data-races are avoided — where a data-race is for concurent access of a non-atomic memory location without synchronisation AND at least one of the accesses is a write operation.
So it matters if the datastructure is atomic or non-atomic:
non-atomic datastructures (e.g. ref-cells, arrays, mutable record fields)
atomic datastructures — no need sync mechanisms
Approaches to introduce synchronisation:
- using atomic variables
- using explict sycn primitives mutexes, semaphores, conditional
The DRF-SC (Data-Race-Free programs with ) guarantee: when observed program behaviour can be explained by the interleaving of operations from different domains
In OCaml, DRF-SC guarantee is modular: if a part of a program is data-race free then the OCaml memory model ensures that those parts have sequential consistency despite other parts of the program having data-races.
Memory-model: The Hard Bits #
The rest of this section relates to the “Memory model: The Hard Bits”, the technical paper that is detailed about this can be see here.
Sequential Consistency: simple model to help us reason #
Sequential consistency is a consistency model that helps us easily reason about the behaviour of our programs.
values observed by the program can be explained through some interleaving of the operations from different domains in the program — we just need to know how the interleaving happens
elaboration on cases where sequential consistency model should make reasoning easier
| |
here we may have some observed behaviour, e.g. r1 = 2, r2 = 0, r3 = 2 if the write to b in d2 happened before the read of b in d1
We complicate this further, let a and b be the same ref
We might expect the assert statement to never fail since if r2 is 0 then the write in d2 occurred before the read of b in d1. Since a and b are aliases, the second read of a in d1 should also return 0, right?
Optimisation Mechanisms as a Source of Sequential Consistency Assertion Failure
Why would the assertion fail? \(\implies\) other mechanisms at play
compiler optimisations as a source of reordering
compiler optimisations may affect the program logic (e.g. via sub-expression elimination (CSE)). The use of optimisations is fair and valid and necesary for good performance. They don’t change the sequential meaning of the program but have effects within a parallel runtime.
CSE breaks sequential reasoning. \(\implies\) observed behaviour may not be able to be explained by interleaving of operations from different domains in the source program.
example of how compiler might optimise using CSE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16let d1 a b = let r1 = !a * 2 in let r2 = !b in let r3 = r1 in (* CSE: !a * 2 ==> r1 *) (r1, r2, r3) let d2 b = b := 0 let main () = let ab = ref 1 in let h = Domain.spawn (fun _ -> let r1, r2, r3 = d1 ab ab in assert (not (r1 = 2 && r2 = 0 && r3 = 2))) in d2 ab; Domain.join hCode Snippet 8: Here, r3 's original!a * 2might be replaced with a reference-read by the compilerhardware optimisations as a source of reordering
Complex cache hierarchies with multiple levels of cache needs cache coherence: ensuring that reads and writes to a single memory location respect sequential consistency.
Problem: weaker guarantees on programs that operate on different memory locations. The changes within a core may not have been published / replicated to the other cores yet.
example of lack of cache coherence
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16let a = ref 0 and b = ref 0 let d1 () = a := 1; !b let d2 () = b := 1; !a let main () = let h = Domain.spawn d2 in let r1 = d1 () in let r2 = Domain.join h in assert (not (r1 = 0 && r2 = 0))Code Snippet 9: it's possible for the assertion here to failThe assertion may fail if the writes performed at a CPU core are not immediately published to all of the other cores. Since
aandbare different memory locations, the reads ofaandbmay both witness the initial values and that’s why there might be an assertion failure.Why this happens: This behaviour can be explained if a
loadis allowed to be reordered before a preceding store to a different memory location.This reordering can happen due to the presence of in-core store-buffers on modern processors.
Each core effectively has a FIFO buffer of pending writes to avoid the need to block while a write completes. The writes to
aandbmay be in the store-buffers of coresc1andc2running the domainsd1andd2, respectively. The reads ofbandarunning on the coresc1andc2, respectively, will not see the writes if the writes have not propagated from the buffers to the main memory.
Data race freedom implies sequential consistency #
OCaml’s relaxed memory model so that we can precisely describe which orders are preserved by teh OCaml program — compiler, hardware free to optimise the program as long as they respect the ordering guarantees of the memory model.
The memory model describes the conditions under which a program will only exhibit sequentially consistent behaviours — that’s what Data Race Freedom implies Sequential Consistency (DRF-SC) guarantee means.
precursor understanding to help us understand what DRF-SC guarantee means
memory locations
- classified as atomic memory locations and non-atomic memory locations by OCaml
- non-atomic: ref cells, array fields, mutable record fields, immutable objects are also non-atomic locations with an initialising write but no further updates
- atomic: locations created using
Atomicmodule.
Happens-Before Relation
action type meaning examples inter-domain actions can be observed, influenced by actions on other domains R/W of atomic AND non-atomic locations, Spawn and Join of Domains, operations on mutexes intra-domain actions can neither be observed nor influence the execution of other domains evaluating an arithmetic expression, calling a function, etc so a totally ordered list of actions that is executed by the abstract machine is the execution trace — due to non-determinism there may be several possible execution traces for a given program.
Irreflexive, Transitive, happens-before relation: for a given execution trace, the thing that captures the causality between actions in OCaml program.
It’s the smallest transitive relation that satisfies the situations:
xprecedesyif:if action
xprecedes another actionyin the program order, thenxprecedesyin the happens-before order.what is program order? it’s the order in which a domain executes its actions
if
xis a write to atomic location andyis a subsequent R/W to that location in the execution trace, thenxprecedesyin happens-before orderfor atomic locations, the functions are considered both a read and a write:
compare_and_set,fetch_and_add,exchange,incr,decrxisDomain.spawn fandyis the first action in the newly spawned domain executingf, thenxprecedesyin happens-before orderxis the last action in a domain,d, andyisDomain.join d, thenxprecedsyin happens-before order.xis an unlock operation on a mutex, andyis any subsequent operation on the mutex in the execution trace, thenxprecedesyin happens-before order.
Data-race: a tight definition
two actions conflict if in a trace, they access the same non-atomic location where at least one is a write and neither of them is an initialising write to that location.
there’s a data race if there exits some execution trace of the program with 2 conflicting actions and there isn’t a happens-before relationship between the conflicting accesses.
if a program has no data race \(\implies\) it’s correctly synchronised
DRF-SC
strong guarantee: program without data races will only exhibit sequentially consistent behaviours
with this guarantee, we can use sequential reasoning: reasoning by executing one inter-domain actions after the other, to identify whether the program has a data-race.
this means, we don’t need to reason about re-orderings – in order to determine if a program has a data-race.
Reasoning with DRF-SC #
For the code examples in this section, assume that they are called in parallel by main function.
here’s how the main function would look like
It’s just spawning the two domains, having handles to them and then joining them.
| |
simple example of a data-race
| |
So this r is a non-atomic ref so the 2 domains race to access the ref and d1 is a write. Because there’s no happens-before relationship, this is a data race. They may exhibit non-sequentially-consistent behaviours.
This is a clear-cut way of identifying data-races!
not data race: accessing disjoint array indices and fields of a record in parallel
| |
races on atomic locations don’t lead to a data-race
| |
use-case: message passing
We can use atomic variables to implement non-blocking communication between domains.
example of atomic variables used for inter-domain comms
1 2 3 4 5 6 7 8 9 10(* No data race *) let msg = ref 0 let flag = Atomic.make false let d1 () = msg := 42; (* --- action a *) Atomic.set flag true (* --- action b *) let d2 () = if Atomic.get flag (* --- action c *) then !msg (* --- action d *) else 0Code Snippet 10: Here,msgis non-atomic but the actions don't lead to a data-raceHere, we see that actions
aanddwrite and read from the same non-atomic location,msgand therefore they are conflicting. Ifaanddhave a happens-before relationship, then we can show that this program does NOT have a data-race:apreceedsbin program-order, henceahappens-beforebchappens-befored- CASE A: if
d2observes the atomic variable,flagto betruethenbprecedescin happens-before order- happens-before is a transitive relationship, so the conflicting actions
aanddare in happens-before order.
- happens-before is a transitive relationship, so the conflicting actions
- CASE B: If
d2observes the flag to befalsethen the read ofmsgis NOT done, so there’s no conflicting access in this execution trace \(\implies\) program does not have a data trace
The next example, does have a data-race:
1 2 3 4 5 6 7 8 9(* Has data race *) let msg = ref 0 let flag = Atomic.make false let d1 () = msg := 42; (* --- action a *) Atomic.set flag true (* --- action b *) let d2 () = ignore (Atomic.get flag); (*---action c *) !msg (* --- action d *)Code Snippet 11: version of the message passing program that has a dataa-raceNow, there’s a possible trace where
aanddare conflicting operations AND there’s no happens-before relationship between them so there’s a data-race:Atomic.get flag; (* c *) !msg; (* d *) msg := 42; (* a *) Atomic.set flag true (* b *)Code Snippet 1: conflicting trace without happens-before relationship \(\implies\) data-raceextension: honing the intuition that made me think that this is a mutex-like behaviour
The atomic flag does play an ordering role analogous to what a mutex does — but calling it “a sort of mutex” conflates two different things that mutexes provide, only one of which the flag gives you.
A mutex provides two distinct guarantees:
- Mutual exclusion — only one thread inside the critical section at a time
- Happens-before edges — the
releaseof a mutex happens-before the subsequentacquireof it by another thread
The atomic flag here provides only the second one. Both domains run fully in parallel — there’s no exclusion. What the flag establishes is a conditional synchronisation point: if
d2observesflag = true, then the SC atomic operations guarantee thatb(the write offlag) happens-beforec(the read offlag), which combined with program order gives usa → b → c → d.The flag doesn’t protect
msgthe way a mutex would. It creates a one-way, conditional ordering guarantee. The reason it works is specifically because SCatomics establish a total order over all atomic operations globally — any read that observes a particular write is ordered after that write in the happens-before relation.[ The More Precise Analogy ]
What this pattern actually is in memory model literature is a release-acquire handshake (even though OCaml’s
Atomicgives the stronger SC, which subsumes it):d1does its work then releases by writing the flagd2acquires by reading the flag and only proceeds if it observes the release
This is exactly how
std::atomicwithmemory_order_release/memory_order_acquireworks in C++, and it’s the canonical message passing pattern. A mutex generalises this to be symmetric and reusable — but under the hood, a mutex’sunlock/lockpair is implemented using exactly this kind of release-acquire on an atomic variable.So my intuition has the causality slightly backwards: it’s not that the flag acts like a mutex — it’s that mutexes are built from this kind of atomic flag pattern.
[ Why The Second Example Breaks It ]
The broken version makes this crisp.
d2ignores the value of the flag:ignore (Atomic.get flag); (* c *) !msg (* d *)The SC total order still exists over the atomic operations —
bandcare still totally ordered. But now regardless of which way they order,dalways executes. Ifchappens beforebin the SC order, thendruns concurrently witha— no happens-before between them, hence a race.The conditionality in the first example (
if Atomic.get flag then) is load-bearing. It’s what makes Case B safe — the conflicting accessdsimply doesn’t happen if the ordering isn’t established. The flag by itself doesn’t prevent the race; the branch on the flag’s value is what eliminates the conflicting access in the unsafe trace.
Local Data Race Freedom (LDRF-SC guarantee) #
Even for programs with data-races, the OCaml memory model has strong guarantees for local data race freedom sequential consistency (LDRF-SC). Read the paper for more info.
DRF-SC says that the data race free parts of the program remain sequentially consistent. That is, even if the program has data races, those parts of the program that are disjoint from the parts with data races are amenable to sequential reasoning.
This is because the OCaml memory model is bounded both in space and time.:
- bounded in space: data races in one location don’t affect unrelated locations
- bounded in time: data races in the future don’t corrupt past behaviour
example
| |
cCan read of c return a value that is NOT 42?
yes for C++ and Java memory models. If program has datarace, even in unrelated parts, then the semantics is undefined
so if this was FFI-linked to a library that had a data-race, then under the C++ memory model, the read may return any value \(\implies\) data-races on unrelated locations can affect program behaviour \(\implies\) C++ memory model is NOT unbounded in space.
Java memory model is bounded in space but is not bounded in time; data-races in the future will affect past behaviour:
here’s an example:
1 2 3 4 5 6 7 8// Thread 1 C c = new C(); c.x = 42; a = c.x; g = c; // Thread 2 g.x = 7;The read of
c.xand the write ofgin the first thread are done on separate memory locations.Hence, the Java memory model allows them to be reordered.
As a result, the write in the second thread may occur before the read of
c.x, and hence,c.xreturns7.
The OCaml version of the java example would be:
| |
so there’s a data-race on both g and c, within snippet, just see the first 3 instructions:
let c = ref 0 in
c := 42;
let a = !c in
...The only memory location here is c. There is neither the data-race in space (the race on g) nor in time (the future race on c) therefore the snippet will have sequentially consistent behaviour, and teh value returned by !c will be 42
The OCaml memory model guarantees that even for programs with data races, memory safety is preserved. While programs with data races may observe non-sequentially consistent behaviours, they will not crash.
An Operational View of the Memory Model #
The purpose of this section in the manual is to give us a mental model intuition of how to reason with all this using an abstract state machine. The examples are good to follow here.
Blocking Synchronisation #
Both Domains and System Threads (systhreads) can be synchronised using Mutex, Condition, Semaphore modules — they are blocking synchronisation primitives.
Writing a Concurrent Stack with Blocking Synchronisation #
Implementation of a concurrent stack; mutable contents stores stack elements, mutex controls access to contents, a nonempty Condition variable
| |
contents stores stack elements, mutex controls access to contents, a nonempty Condition variableThis is familiar, the critical sections are wrapped around with mutex locking and unlocking, the Condition variable nonempty, is signalled while the lock is held in order to wake up any domains waiting on this condition — only if there are waiting domains will one of them be woken up
| |
So for the pop operation, we lock the mutex and if the stack is empty then the calling domain waits on teh condition variable nonempty using the wait primitive — so wait will atomically suspend the execution of the current domain and unlock the mutex — when the domain is woken up again (i.e. when the wait returns), then it holds the lock on mutex. The domain reads the contents of the stack again — if still non-empty then pop can be done so it updates the contents to the tail of the old list (xs) and returns the head (x).
Using a mutex here is sufficient sync between multiple domains using the stack \(\implies\) there’s no data races when multiple domains use the stack in parallel.
Interaction with systhreads #
Systhreads are pinned to the domain in which they are created – only one systhread can run OCaml code on a particular domain. Systhreads belonging to a particular domain may run C lib or system code in parallel (old school style). SO, systhreads belonging to different domains may execute in parallel.
When using systhreads, the thread created for executing the computation given to Domain.spawn is also treated as a systhread.
Here we use a shared ref cell protected by a mutex to communicate between different systehreads running on two different domains. The systhread identifiers uniquely identify systhreads in teh program. The id-counters are 0-indexed.
| |
$ ocamlopt -I +threads -I +unix unix.cmxa threads.cmxa -o dom_thr.exe dom_thr.ml
$ ./dom_thr.exe
Thread 1 running on domain 1 saw initial write
Thread 0 running on domain 0 saw the write by thread 1
Thread 2 running on domain 1 saw the write by thread 0
Thread 3 running on domain 0 saw the write by thread 2Interactions with C bindings #
Prior to multicore OCaml, the FFI / C-binding assumption was that if OCaml runtime lock is not released (GRL), then it would be safe to manipulate global C state (e.g. initialise a function-local static value) — this assumption doesn’t hold anymore in the face of parallel execution with multiple domains.
In multicore OCaml, C code running on a domain may run in parallel with any C code running in other domains even if neither of them has released the “domain lock”.
Atomics #
Are all about non-blocking sync primitives.It’s not a mechanism for suspending and waking up domains, they’re compiled to atomic read-modify-write primitives that the hardware provides.
example: Atomic vs non-atomic Counters in Parallel #
| |
$ ocamlopt -o incr.exe incr.ml
$ ./incr.exe 1_000_000
Non-atomic ref count: 1187193
Atomic ref count: 2000000example: Atomic variables used for Message-Passing #
Atomic variables can be used for low-level sync between domains – e.g. message passing between domains:
| |
Lock-free stack #
Atomic module is for implementing non-blocking, lock-free data-structures.
| |
compare_and_set r seen v sets the value of r to v \(\iff\) it’s current value is physically equal to seen — the comparator and the udpate occur atomically.
if overall comparison succeeded and the update happened \(\rightarrow\) returns true
If compare_and_set fails, then some other domain is also attempting to update the atomic ref at the same time \(\implies\) Domains.cpu_relax to back-off fora short duration allowing competing domains to make progress before retrying the failed operation.
Structured Concurrency (via Effects, Effect-Handlers) #
The concurrency primitives now rely on Effects and Effect-handlers as an alternative to the older thread-based approaches. Notes cover content from KC Sivaramakrishnan’s Blog post and the OCaml Manual’s Effect Handlers.
In doing this exploration, there was a need to be sensitive about what’s proposed vs what’s actually implemented. Even for the implemented approaches, there’s also subsequent deprecations (e.g. the point on multishot continuations being “deprecated” and shifted to its own repo). This doesn’t invalidate the intuition behind many of the concepts discussed, just makes it especially important to be sensitive about proposals vs implementations. The manual is the canonical source of truth.
How KC Sivaramakrishnan’s blog post has an older framing by using “linear algebraic effects”
The post calls the system “linear algebraic effects” and ties the one-shot property to the word “linear.”
OCaml 5’s documentation and the 2021 PLDI paper (“Retrofitting Effect Handlers onto OCaml”) dropped the “linear” qualifier from the public framing. The one-shot property still exists but it’s described as a runtime enforcement rather than a type-system property. There is still no effect type system tracking which effects a function performs — this was described as “WIP” in 2015 and remains absent from shipped OCaml 5.
Overview and Motivation #
Effect-handlers:
- generalisation of exception handlers
- allow non-local control-flow mechanisms to be expressed in a composable fashion e.g. resumable exceptions, lightweight threads, coroutines, generators, async I/O
Core tenet of Programming with Algebraic Effects, to express effectful computation:
- When we can separate the expression of an effectful computation from its implementation — expressing effectful computation.
The objective for OCaml’s multicore implementations is to provide a minimal set of functionality onto which programmers can implement new primitives and schedulers as OCaml libraries.
the tradeoff involved if the multicore runtime bakes concurrency primitives and concurrent thread scheduler within it
The problem with other FP PLs (GHC, F#, Manticore…) that are multicore capable is that the primitives for concurrency and the concurrent thread scheduler is baked into the runtime system.
the runtime system tends to grow in complexity (complex monolithic piece of software with extensive use of locks, condition variables, timers, thread pools …)
\(\implies\) this makes it difficult to maintain existing concurrency libraries and add new ones.
lack of malleability \(\implies\) prevents devs from experimenting with custom concurrency libs and scheduling strategies — low ecosystem innovation
Basics of Effects and Effect Handlers #
Xchg effect, the continuation happens immediately just for demonstration. Because these are pointers to the underlying fibres, we can also choose to capture and store them in data-structures as part of more complex logic that we apply to work with delimited continuations – that’s what the concurrency extension to the example does.an effect
open Effect open Effect.Deep type _ Effect.t += Xchg: int -> int t let comp1 () = perform (Xchg 0) + perform (Xchg 1)Code Snippet 19:Xchgeffect: an operation that is performed“the
Xchgeffect takes an integer param and, when performed, returns an integer”:declaration happens by extending the variant type
Effect.t(which is predefined and extensible)so, it’s a new constructor to that variant:
Xchg: int -> int tcomp1is the computation performs the effect twice usingperformprimitive
an effect-handler
Computations have to be enclosed within effect-handlers — so
comp1 ()is run under an effect handler like so:try comp1 () with | effect (Xchg n), k -> continue k (n+1) (* - : int = 3 *)Code Snippet 20: Effect handler to handleXchgeffect — here we always return the successor of the offered valuegeneral shape of an effect handler
try computation with | effect E1 k -> ... | effect E2 k -> ... | exn -> ...Code Snippet 21: shape of an effect-handler with multiple clauses (arms)effectis a keyword to help do pattern-matching on effects and not exceptions (or return values). So, theXchgeffect is handled with a continuation bound tok(the delimited continuation function).krepresents the suspended computation between the point ofperformand this handler.[ comparing effect-handlers to exception handlers ]
- effect handlers are a generalisation of exception handlers
- similarity: when computation performs the effect, the control jumps to the corresponding effect-handler AND unhandled effects are forwarded to the outer handler
- difference: unlike exception handlers, the effect handler is also provided with the delimited continuation
k
walkthrough of what it means for
kto represent “delimited continuation”kis a function representing “the rest of the computation” fromperformto the handler boundary.Component Responsibility comp1requests effects handler interprets effects krepresents suspended computation Let’s walkthrough this:
a recap on the things we know so far:
- the
try-withis the effect-handler. It pattern matches on theXchgeffect - the computation we care about is
comp1 ()which may emit effects — this is also what we will pause and continue. We can seecomp1as having effect boundaries:Code Snippet 22: rewritten version ofcomp1that is easy to read the control flow
- the
we trace the execution of
comp1and see that when the first perform happens, OCaml captures the continuationk(\(k_{1}\)).k1 = fun v -> let a = v in let b = perform (Xchg 1) in a + bCode Snippet 23: This is what \(k_{1}\) ends up capturingvis the input to the continuation here, which is based on the effect being handled:| effect (Xchg n), k -> continue k (n+1). Given that the first effect is performed withn = 0, this means that the continuation \(k_{1}\) will havev = 1on
continue, the control jumps back to the continuation function, the computation looks like:let a = 1 in let b = perform (Xchg 1) in (*<--- this is when the control jumps back to the effect handler*) a + bCode Snippet 24: resumed computation, before the 2nd effect is performed / emittedwhen the second
performhappens, the control jumps back to the effect handler. so \(k_2\) is the remaining computation that gets captured:k2 = fun v -> let b = v in 1 + bCode Snippet 25: 2nd continuation that gets capturedthe handler runs again with
n = 1sov = 2and \(k_2\) eventually gives1 + 2 = 3.
resuming the computation
The effect handler can resume the computation.
continueprimitive resumes the suspended computation.So in the example,
comp1is the computation which performsXchg 0andXchg 1, which makes it receive the values1(from the effectXch 0) and2(from the effectXchg 1), which sum to give3.
2 versions: deep effect-handlers and shallow effect-handlers
Effect handlers can come in deep or shallow versions. Deep effect handlers monitor computation until the computation terminates (normally, exceptionally… ) and handles all of the effects performed in sequence by the computation.
Shallow effect-handlers monitor a computation until either:
- computation terminates OR
- computation performs one effect – and the effect handler handles that single effect only.
Concurrency #
We extend the Xchg example, now to handle message-passing concurrency between 2 concurrent compuations using Xchg effect — and we wrap them as tasks.
This is so cool.
| |
Program execution looks something like this:
1. comp1 offers values 0 and 1
2. comp2 offers values 21 and 21
3. comp1 receives values 21 and 21 := 21 + 21 = 42
4. comp2 receives values 0 and 1 := 0 * 1 = 0| point of interest | elaboration |
|---|---|
| A task | A concurrent computation, it has a completion status 'a status |
task status, 'a status | Either Complete or Suspended, with a msg and cont (for the suspended delimited computation) |
the step fn (the effect handler) | executes a single step for the computation, either completes or suspends. Returns 'a status type. It could be allowed to perform other effects. |
who performs the Xchg effect? | handled by step but not it isn’t performed by step; the computations, comp1 and comp2 perform the Xchg effect |
the scheduler run_both | is where the tasks run to completion, communication between the two computations is programmed entirely inside the scheduler (m2 injected into k1 and m1 injected into k2) |
| computations and effect | the computations, comp1 and comp2, by themselves, don’t assign any meaning to the Xchg effect, the “exchange” happens @ scheduler, run_both |
User-level Threads #
With effect-handlers in OCaml, we can implement user-level threads and their schedulers without needing to be provided with language primitives for user-level threads.
| |
Xchg, Fork, Yield effects| point of interest | elaboration |
|---|---|
the Fork effect | consumes a thunk (suspended computation), returns a unit to the its performer |
the Yield effect | unparameterised, returns unit when performed |
| convenience performers | perform the effects |
| what enables round-robin scheduling? | the use of a mutable FIFO queue, run_q (ready queue). Tasks here are just thunks. |
the exchanger ref cell | it holds a suspended task that is offering to exchange a value. Invariant: at any point in time, either 0 or 1 suspended task that is offerring and exchange. |
the importance of spawn here (effect-handler) | runs the given computation f in an effect handler. f may raise an exception (handle by printing then dequeue), may return unit (then dequeue), or may perform effects |
effects performed by spawn | effect Yield handled by suspending current task and running next in run_q; Fork suspends current task, executes new task f immediately via a tail-call to spawn f; Xchg effect checks if there’s a waiting task to exchange values with else makes the current task the waiting exchanger. |
implementing Fork | in the implementation, we run the new task first. This is arbitrary. We could have chosen to insert f (new task) into the ready queue and resumed k immediately |
| Interleaved messages | From the output, we see that messages from the 2 tasks are interleaved |
| Direct-style concurrency | The run fn invocation makes no reference to the effect-handlers and has no monadic operations. It’s written in a direct-style. The use of effect-handlers keeps the user-code in simple, direct-style and effect-handlers are contained within the concurrency library implementation. |
Resuming with an Exception
There’s a simple discipline to follow to avoid blocked tasks:
Every continuation must be eventually either continued or discontinued.
If we look at the earlier implementation of
dequeue, we notice that it’s buggy because we might have a case where there’s nothing left to dequeue from therun_qand it returns to its caller BUT there could be a waiting task stored @exchanger— so that task may be blocked forever and it might even have leaky resources.Code Snippet 28: an example of a leaky task that may be blocked even ifFun.protectis used for the output channelIt’s not enough that
Fun.protectis used (to ensure that theocis closed on both normal and exceptional return cases) because it doesn’t do anything if the task is just blocked.exception Improper_synchronization let dequeue () = if Queue.is_empty run_q then begin match !exchanger with | None -> () (* done *) | Some (n, k) -> exchanger := None; discontinue k Improper_synchronization end else (Queue.pop run_q) ()Code Snippet 29: usingdiscontinueto ensure no forever-blocks – forces termination (with an exception)So in the blocked case, the
dequeuefunctiondiscontinues the blocked thread with anImprovper_synchronizationexception — the exception is raised @ the blockedxchgfunction call, which causes thefinallyblock to be run, thereby closing theocchannel. The user only observes things as though the function callxchg 0=raises the exceptionImproper_synchronization.So, effect handlers can:
- resume a continuation with a value via
continue - resume by raising an effect at the point of perform via
discontinue
- resume a continuation with a value via
Control Inversion #
| api style | who has control | example |
|---|---|---|
| push-based | the producer has control | fn List.iter in List.iter f l controls the traversal |
| pull-based | the consumer has control | the module Seq , like a delayed list – is produced on demand by the consumer |
For flexibility, suppose we wish to write a producer in the pull-based style and consumer because it’s more natural and ergonomic to be in control \(\implies\) that’s why we might be interested in Control Inversion.
val invert : iter:(('a -> unit) -> unit) -> 'a Seq.tinvert function type, takes in an iter (producer that pushes elements to consumer) and returns a Seq (consumer has the control)(* --- the classic push style from a producer *)
lst_iter (fun i -> Printf.printf "%d\n" i)
(*
1
2
3
- : unit = ()
*)
(* --- pull style from inverting the control *)
let s = invert ~iter:lst_iter
let next = Seq.to_dispenser s;;
(* val s : int Seq.t = <fun> *)
(* val next : unit -> int option = <fun> *)
next ();;
(* - : int option = Some 1 *)
next ();;
(* - : int option = Some 2 *)
next ();;
(* - : int option = Some 3 *)
next ();;
(* - : int option = None *)Implementing Control Inversion
let invert (type a) ~(iter : (a -> unit) -> unit) : a Seq.t = let module M = struct type _ Effect.t += Yield : a -> unit t end in let yield v = perform (M.Yield v) in fun () -> match iter yield with | () -> Seq.Nil | effect M.Yield v, k -> Seq.Cons (v, continue k)Code Snippet 32:yieldfunction performs theYieldeffect; Lambdafun () ->defers the action until first-demandedOnce the first -element of the sequence is demanded, the computation
iter yieldis executed under an effect-handler. Every timeiterpushes an element toyieldfunction, the computation is interrupted byYieldeffect.Yieldeffect handled by returning the valueSeq.Cons(v, continue k)to the consumer (which gets the elementvand the suspended computation,k).k, from the consumer’s eyes is just the tail of the sequence.When consumer demands
next, thenkis resumed — that’s howiter yieldmakes progress until either yields another element or terminates normally (in which case,Seq.Nilis returned to the consumer). TheSeqhere, that is returned by theinvert, is an ephemeral sequence — must be used at most once and the sequence must be fully consumed (at least once) — captured continuation is used linearly (one-shot).
Semantics #
- Nesting Handlers
- Fibres
- Unhandled Effects
- Linear Continuations
Shallow Handlers #
Quick Example #
A Simple Round-Robin Scheduler #
[ DEPRECATIONS: ] The OCaml manual is the source of truth for APIs in general. In this section, we already have some changes from the writings here and how things are like now:
type _ eff +=as the declaration type for effect \(\rightarrow\)Effect.tcontinue k v\(\rightarrow\)Effect.Deep.continue k v- multishot continuations are not standard and have been deprecated from multicore ocaml (used to be that we could clone a single-shot continuation and mimic multishot, here’s why ), they’re extracted into an ocaml-multicont.
Here’s a basic interface for our scheduler:
(* Control operations on threads *)
val fork : (unit -> unit) -> unit
val yield : unit -> unit
(* Runs the scheduler. *)
val run : (unit -> unit) -> unitPerforming an effectful computation is separate from its interpretation \(\implies\) interpretation is dynamically chosen based on the context in which an effect is performed.
Effectful actions for our Scheduler that we will declare:
- spawning a new thread
- yielding control to another thread
Predefined variant type for effects is the type 'a eff — 'a being the return type of performing that effect.
type _ eff +=
| Fork : (unit -> unit) -> unit eff
| Yield : unit eff
(* --- with convenience syntax *)
effect Fork : (unit -> unit) -> unit
effect Yield : unitFork and Yield effectsEffects are performed. The primitive perform has the type 'a eff -> 'a.
We now define the functions:
let fork f = perform (Fork f)
let yield () = perform Yieldfork and yieldWe need an interpretation of what it means to perform fork and yield, for which we use effect-handlers.
| |
fork and yieldHow this works:
- the scheduler queue is a queue of delimited continuations, and we have queue-manipulation functions
The effect-handling is done like so here:
spawn f evaluates f in a new thread of control, which may have the following cases for how it returns:
return normally with
()\(\implies\) we pop the scheduler queue and resume the resultant continuation usingcontinuecontinue k vis how the continuation (k: ('a, 'b') continuation) is resumed with valuev: 'aand returns a type of value'b
return exceptionally with
e\(\implies\) same as the normal return casereturn effectfully, with the effect performed along with the delimited continuation,
kThe return type of the effect (
'a) matches the argument type of the continuation and the return type of the delimited continuation is'b— specifically, for the patterneffect e k, if the effectehas type'a eff, then the delimited continuationkhas type('a, 'b) continuation.[ fork effect ] effectful return of
Fork f:the continuation function (
k) is what we enqueue and we spawn a new thread of control for evaluatingf(different from the originalf, thisfis from the effectful return)[ yield effect ] effectful return of
Yield:enqueue the current continuation (
k) and resume some other saved continuation from the scheduler queue.
Though the effect pattern in match expressions IS still supported in OCaml 5 as surface syntax, the idiomatic effect handler syntax has changed. In modern multicore OCaml, it’s more like so:
| |
So here’s a concurrent program to test out this scheduler:
| |
output of running this test
$ git clone https://github.com/kayceesrk/ocaml-eff-example
$ cd ocaml-eff-example
$ make
$ ./concurrent
Starting number 0
Forking number 1
Starting number 1
Forking number 3
Starting number 3
Yielding in number 3
Forking number 2
Starting number 2
Forking number 5
Starting number 5
Yielding in number 5
Forking number 4
Starting number 4
Yielding in number 4
Resumed number 3
Finishing number 3
Finishing number 0
Forking number 6
Starting number 6
Yielding in number 6
Resumed number 5
Finishing number 5
Finishing number 1
Resumed number 4
Finishing number 4
Finishing number 2
Resumed number 6
Finishing number 6Implementation #
Algebraic effects were retroffitted, their implementation faced some challenges as well.
Fibres for Concurrency #
Main challenge: efficient management of delimited continuations.
Multicore OCaml uses fibers for this:
- represent the unit of concurrency in the runtime system
- small heap-allocated, dynamically resized stacks
Most continuations only need to be linear (one-shot) i.e. resumed only once (as opposed to multishot Reminder: multishot continuations could be done in multicore OCaml before by just cloning a one-shot continuation , but this got deprecated. ).
Capturing one-shot continuation requires no extra context vs fn-call
It’s fast to capture a one-shot continuation \(\implies\) just need a pointer to the underlying fiber \(\implies\) no need for any allocation. Capturing a one-shot continuation requires no more context than needed for a normal function call because the calling convention for OCaml doesn’t use callee-save registers.
OCaml doesn’t have linear types Linear types are a type-system guarantee about how many times a value can be used – they enforce correctness @ compile time. Today’s OCaml is neither affine (i.e. can be used at most once) nor linear (can be used exactly once). However, one-shot continuations are logically linear. . One-shot continuations are logically linear — they have to be enforced @ runtime by raising an exception the second time a continuation is invoked.
Continuations: delimited vs undelimited
Delimited continuations enable complex nested and hierarchical schedulers to be expressed more naturally because they introduce parent-child relationships between fibers — similar to function invocation.
Table 3: comparing undelimited continuation (runtimes like Manticore) vs delimited continuation (like in OCaml)feature undelimited continuation delimited continuation (OCaml) intuition “Pause the whole universe and resume it anywhere” “Pause this function and resume it inside its caller” scope whole program up to a handler structure flat / global nested / hierarchical control unrestricted scoped analogy like gotoon steroidslike a function call
Running on Multiple Cores #
We’ve covered Domains earlier. Each domain, our unit of parallelism, runs a separate system thread (with a small local heap, single shared major heap).
To distribute fibres amongst the available domains, working sharing / stealing schedulers are initiated on each domain. For inter-domain communication and mutual execution, we can rely on sync mechanisms.
TODO Illustrative Case Studies / Examples #
This section is about internalising the examples that can be found here.