- rtshkmr's digital garden/
- Readings/
- Books/
- Real World OCaml: Functional Programming for the Masses/
- Chapter 16B: Parallel & Concurrent Programming in OCaml 5/
Chapter 16B: Parallel & Concurrent Programming in OCaml 5
Table of Contents
OCaml 5 removed the Global Runtime Lock and introduced Domains as the unit of true parallelism. These notes cover Domain-based programming, Domainslib’s work-stealing task pools, the relaxed memory model and DRF-SC guarantee, atomic variables for non-blocking communication, and blocking synchronisation with Mutex and Condition — a practical toolkit for parallel OCaml. We then go through Effects and Threads to complete the Concurrent programming picture.

Parallel Programming (via Domains) #
Domains #
Domains can be spawned and an associated lambda can be executed: Domain.spawn (fun _ -> print_endline "I ran in parallel"). There’s a limit by OCaml of 128 domains that can be active at the same time.
Spawned Domains can be joined to retrieve their results — join waits for target domain to terminate, this is a blocking operation.
Gotcha: join is blocking, but spawn is already parallel #
Careful not to misread this. The spawn will already make the work happen. So even if join is blocking, the work on the other domain wouldn’t be blocked (only the control flow would be blocked until the domain terminates). That’s why the double parallel call takes roughly the same time as a single call.
| |
$ ocamlopt -o fib_twice.exe fib_twice.ml
$ ./fib_twice.exe 42
fib(42) = 267914296
fib(42) = 267914296
$ hyperfine './fib_twice.exe 42'
Benchmark 1: ./fib_twice.exe 42
Time (mean ± sd): 1.249 s ± 0.025 s [User: 2.451 s, System: 0.012 s]
Range (min … max): 1.221 s … 1.290 s 10 runsDomainslib: for nested-parallel programming #
The idea is that the core compiler distribution only offers the low-lying primitives for concurrent and parallel programming. This is so that the high-level libraries can be distributed outside the core compiler distribution.
| |
perf analysis
$ ocamlfind ocamlopt -package domainslib -linkpkg -o fib_par2.exe fib_par2.ml
$ ./fib_par2.exe 1 42
fib(42) = 267914296
$ hyperfine './fib.exe 42' './fib_par2.exe 2 42' \
'./fib_par2.exe 4 42' './fib_par2.exe 8 42'
Benchmark 1: ./fib.exe 42
Time (mean ± sd): 1.217 s ± 0.018 s [User: 1.203 s, System: 0.004 s]
Range (min … max): 1.202 s … 1.261 s 10 runs
Benchmark 2: ./fib_par2.exe 2 42
Time (mean ± sd): 628.2 ms ± 2.9 ms [User: 1243.1 ms, System: 4.9 ms]
Range (min … max): 625.7 ms … 634.5 ms 10 runs
Benchmark 3: ./fib_par2.exe 4 42
Time (mean ± sd): 337.6 ms ± 23.4 ms [User: 1321.8 ms, System: 8.4 ms]
Range (min … max): 318.5 ms … 377.6 ms 10 runs
Benchmark 4: ./fib_par2.exe 8 42
Time (mean ± sd): 250.0 ms ± 9.4 ms [User: 1877.1 ms, System: 12.6 ms]
Range (min … max): 242.5 ms … 277.3 ms 11 runs
Summary
'./fib_par2.exe 8 42' ran
1.35 ± 0.11 times faster than './fib_par2.exe 4 42'
2.51 ± 0.10 times faster than './fib_par2.exe 2 42'
4.87 ± 0.20 times faster than './fib.exe 42'Notes:
- a good ideal to keep things serial for small input sizes
- the task pool lifecycle management is ours to be responsible for:
setup_pool: configrun: kickstartteardown: fini
Parallel Iteration Constructs #
It’s an ergonomic loop-construct:
comparison between serial and parallel using the spectral_norm benchmark
| |
We observe the looping that happens at eval_A_times_u and eval_At_times_u and realise that the iterations of the outer loop are not dependent on each other and can be executed in parallel — Each iteration of the outer loop body reads from u but writes to disjoint memory locations in v.
| |
-let n = Array.length v - 1 in
- for i = 0 to n do
- let vi = ref 0. in
- for j = 0 to n do vi := !vi +. eval_A i j *. u.(j) done;
- v.(i) <- !vi
- done
+let n = Array.length v - 1 in
+ T.parallel_for pool ~start:0 ~finish:n ~body:(fun i ->
+ let vi = ref 0. in
+ for j = 0 to n do vi := !vi +. eval_A i j *. u.(j) done;
+ v.(i) <- !vi
+ )
Memory-model: The Easy Bits #
This section summarises the contents from this section of the OCaml manual.
Typical modern processors and compilers do aggressive optimisation of code, that’s why serial vs parallel code gives surprising behaviour differences in the parallel version.
OCaml also makes such optimisations but uses a relaxed memory model so that parallel code becomes more predictable. It describes what values an OCaml program is allowed to witness when reading a memory location.
Recipes to Follow #
So we get recipes to follow that retain the simplicity of sequential processing.
immutable values:
can be freely shared between multiple domains, can access in parallel
mutable data structures (e.g. ref cells arrays, mutable record fields)
the programmer needs to ensure that data-races are avoided — where a data-race is for concurent access of a non-atomic memory location without synchronisation AND at least one of the accesses is a write operation.
So it matters if the datastructure is atomic or non-atomic:
non-atomic datastructures (e.g. ref-cells, arrays, mutable record fields)
atomic datastructures — no need sync mechanisms
Approaches to introduce synchronisation:
- using atomic variables
- using explict sycn primitives mutexes, semaphores, conditional
The DRF-SC (Data-Race-Free programs with ) guarantee: when observed program behaviour can be explained by the interleaving of operations from different domains
In OCaml, DRF-SC guarantee is modular: if a part of a program is data-race free then the OCaml memory model ensures that those parts have sequential consistency despite other parts of the program having data-races.
Memory-model: The Hard Bits #
The rest of this section relates to the “Memory model: The Hard Bits”, the technical paper that is detailed about this can be see here.
Sequential Consistency: simple model to help us reason #
Sequential consistency is a consistency model that helps us easily reason about the behaviour of our programs.
values observed by the program can be explained through some interleaving of the operations from different domains in the program — we just need to know how the interleaving happens
elaboration on cases where sequential consistency model should make reasoning easier
| |
here we may have some observed behaviour, e.g. r1 = 2, r2 = 0, r3 = 2 if the write to b in d2 happened before the read of b in d1
We complicate this further, let a and b be the same ref
We might expect the assert statement to never fail since if r2 is 0 then the write in d2 occurred before the read of b in d1. Since a and b are aliases, the second read of a in d1 should also return 0, right?
Optimisation Mechanisms as a Source of Sequential Consistency Assertion Failure
Why would the assertion fail? \(\implies\) other mechanisms at play
compiler optimisations as a source of reordering
compiler optimisations may affect the program logic (e.g. via sub-expression elimination (CSE)). The use of optimisations is fair and valid and necesary for good performance. They don’t change the sequential meaning of the program but have effects within a parallel runtime.
CSE breaks sequential reasoning. \(\implies\) observed behaviour may not be able to be explained by interleaving of operations from different domains in the source program.
example of how compiler might optimise using CSE
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16let d1 a b = let r1 = !a * 2 in let r2 = !b in let r3 = r1 in (* CSE: !a * 2 ==> r1 *) (r1, r2, r3) let d2 b = b := 0 let main () = let ab = ref 1 in let h = Domain.spawn (fun _ -> let r1, r2, r3 = d1 ab ab in assert (not (r1 = 2 && r2 = 0 && r3 = 2))) in d2 ab; Domain.join hCode Snippet 8: Here, r3 's original!a * 2might be replaced with a reference-read by the compilerhardware optimisations as a source of reordering
Complex cache hierarchies with multiple levels of cache needs cache coherence: ensuring that reads and writes to a single memory location respect sequential consistency.
Problem: weaker guarantees on programs that operate on different memory locations. The changes within a core may not have been published / replicated to the other cores yet.
example of lack of cache coherence
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16let a = ref 0 and b = ref 0 let d1 () = a := 1; !b let d2 () = b := 1; !a let main () = let h = Domain.spawn d2 in let r1 = d1 () in let r2 = Domain.join h in assert (not (r1 = 0 && r2 = 0))Code Snippet 9: it's possible for the assertion here to failThe assertion may fail if the writes performed at a CPU core are not immediately published to all of the other cores. Since
aandbare different memory locations, the reads ofaandbmay both witness the initial values and that’s why there might be an assertion failure.Why this happens: This behaviour can be explained if a
loadis allowed to be reordered before a preceding store to a different memory location.This reordering can happen due to the presence of in-core store-buffers on modern processors.
Each core effectively has a FIFO buffer of pending writes to avoid the need to block while a write completes. The writes to
aandbmay be in the store-buffers of coresc1andc2running the domainsd1andd2, respectively. The reads ofbandarunning on the coresc1andc2, respectively, will not see the writes if the writes have not propagated from the buffers to the main memory.
Data race freedom implies sequential consistency #
OCaml’s relaxed memory model so that we can precisely describe which orders are preserved by teh OCaml program — compiler, hardware free to optimise the program as long as they respect the ordering guarantees of the memory model.
The memory model describes the conditions under which a program will only exhibit sequentially consistent behaviours — that’s what Data Race Freedom implies Sequential Consistency (DRF-SC) guarantee means.
precursor understanding to help us understand what DRF-SC guarantee means
memory locations
- classified as atomic memory locations and non-atomic memory locations by OCaml
- non-atomic: ref cells, array fields, mutable record fields, immutable objects are also non-atomic locations with an initialising write but no further updates
- atomic: locations created using
Atomicmodule.
Happens-Before Relation
action type meaning examples inter-domain actions can be observed, influenced by actions on other domains R/W of atomic AND non-atomic locations, Spawn and Join of Domains, operations on mutexes intra-domain actions can neither be observed nor influence the execution of other domains evaluating an arithmetic expression, calling a function, etc so a totally ordered list of actions that is executed by the abstract machine is the execution trace — due to non-determinism there may be several possible execution traces for a given program.
Irreflexive, Transitive, happens-before relation: for a given execution trace, the thing that captures the causality between actions in OCaml program.
It’s the smallest transitive relation that satisfies the situations:
xprecedesyif:if action
xprecedes another actionyin the program order, thenxprecedesyin the happens-before order.what is program order? it’s the order in which a domain executes its actions
if
xis a write to atomic location andyis a subsequent R/W to that location in the execution trace, thenxprecedesyin happens-before orderfor atomic locations, the functions are considered both a read and a write:
compare_and_set,fetch_and_add,exchange,incr,decrxisDomain.spawn fandyis the first action in the newly spawned domain executingf, thenxprecedesyin happens-before orderxis the last action in a domain,d, andyisDomain.join d, thenxprecedsyin happens-before order.xis an unlock operation on a mutex, andyis any subsequent operation on the mutex in the execution trace, thenxprecedesyin happens-before order.
Data-race: a tight definition
two actions conflict if in a trace, they access the same non-atomic location where at least one is a write and neither of them is an initialising write to that location.
there’s a data race if there exits some execution trace of the program with 2 conflicting actions and there isn’t a happens-before relationship between the conflicting accesses.
if a program has no data race \(\implies\) it’s correctly synchronised
DRF-SC
strong guarantee: program without data races will only exhibit sequentially consistent behaviours
with this guarantee, we can use sequential reasoning: reasoning by executing one inter-domain actions after the other, to identify whether the program has a data-race.
this means, we don’t need to reason about re-orderings – in order to determine if a program has a data-race.
Reasoning with DRF-SC #
For the code examples in this section, assume that they are called in parallel by main function.
here’s how the main function would look like
It’s just spawning the two domains, having handles to them and then joining them.
| |
simple example of a data-race
| |
So this r is a non-atomic ref so the 2 domains race to access the ref and d1 is a write. Because there’s no happens-before relationship, this is a data race. They may exhibit non-sequentially-consistent behaviours.
This is a clear-cut way of identifying data-races!
not data race: accessing disjoint array indices and fields of a record in parallel
| |
races on atomic locations don’t lead to a data-race
| |
use-case: message passing
We can use atomic variables to implement non-blocking communication between domains.
example of atomic variables used for inter-domain comms
1 2 3 4 5 6 7 8 9 10(* No data race *) let msg = ref 0 let flag = Atomic.make false let d1 () = msg := 42; (* --- action a *) Atomic.set flag true (* --- action b *) let d2 () = if Atomic.get flag (* --- action c *) then !msg (* --- action d *) else 0Code Snippet 10: Here,msgis non-atomic but the actions don't lead to a data-raceHere, we see that actions
aanddwrite and read from the same non-atomic location,msgand therefore they are conflicting. Ifaanddhave a happens-before relationship, then we can show that this program does NOT have a data-race:apreceedsbin program-order, henceahappens-beforebchappens-befored- CASE A: if
d2observes the atomic variable,flagto betruethenbprecedescin happens-before order- happens-before is a transitive relationship, so the conflicting actions
aanddare in happens-before order.
- happens-before is a transitive relationship, so the conflicting actions
- CASE B: If
d2observes the flag to befalsethen the read ofmsgis NOT done, so there’s no conflicting access in this execution trace \(\implies\) program does not have a data trace
The next example, does have a data-race:
1 2 3 4 5 6 7 8 9(* Has data race *) let msg = ref 0 let flag = Atomic.make false let d1 () = msg := 42; (* --- action a *) Atomic.set flag true (* --- action b *) let d2 () = ignore (Atomic.get flag); (*---action c *) !msg (* --- action d *)Code Snippet 11: version of the message passing program that has a dataa-raceNow, there’s a possible trace where
aanddare conflicting operations AND there’s no happens-before relationship between them so there’s a data-race:Atomic.get flag; (* c *) !msg; (* d *) msg := 42; (* a *) Atomic.set flag true (* b *)Code Snippet 1: conflicting trace without happens-before relationship \(\implies\) data-raceextension: honing the intuition that made me think that this is a mutex-like behaviour
The atomic flag does play an ordering role analogous to what a mutex does — but calling it “a sort of mutex” conflates two different things that mutexes provide, only one of which the flag gives you.
A mutex provides two distinct guarantees:
- Mutual exclusion — only one thread inside the critical section at a time
- Happens-before edges — the
releaseof a mutex happens-before the subsequentacquireof it by another thread
The atomic flag here provides only the second one. Both domains run fully in parallel — there’s no exclusion. What the flag establishes is a conditional synchronisation point: if
d2observesflag = true, then the SC atomic operations guarantee thatb(the write offlag) happens-beforec(the read offlag), which combined with program order gives usa → b → c → d.The flag doesn’t protect
msgthe way a mutex would. It creates a one-way, conditional ordering guarantee. The reason it works is specifically because SCatomics establish a total order over all atomic operations globally — any read that observes a particular write is ordered after that write in the happens-before relation.[ The More Precise Analogy ]
What this pattern actually is in memory model literature is a release-acquire handshake (even though OCaml’s
Atomicgives the stronger SC, which subsumes it):d1does its work then releases by writing the flagd2acquires by reading the flag and only proceeds if it observes the release
This is exactly how
std::atomicwithmemory_order_release/memory_order_acquireworks in C++, and it’s the canonical message passing pattern. A mutex generalises this to be symmetric and reusable — but under the hood, a mutex’sunlock/lockpair is implemented using exactly this kind of release-acquire on an atomic variable.So my intuition has the causality slightly backwards: it’s not that the flag acts like a mutex — it’s that mutexes are built from this kind of atomic flag pattern.
[ Why The Second Example Breaks It ]
The broken version makes this crisp.
d2ignores the value of the flag:ignore (Atomic.get flag); (* c *) !msg (* d *)The SC total order still exists over the atomic operations —
bandcare still totally ordered. But now regardless of which way they order,dalways executes. Ifchappens beforebin the SC order, thendruns concurrently witha— no happens-before between them, hence a race.The conditionality in the first example (
if Atomic.get flag then) is load-bearing. It’s what makes Case B safe — the conflicting accessdsimply doesn’t happen if the ordering isn’t established. The flag by itself doesn’t prevent the race; the branch on the flag’s value is what eliminates the conflicting access in the unsafe trace.
Local Data Race Freedom (LDRF-SC guarantee) #
Even for programs with data-races, the OCaml memory model has strong guarantees for local data race freedom sequential consistency (LDRF-SC). Read the paper for more info.
DRF-SC says that the data race free parts of the program remain sequentially consistent. That is, even if the program has data races, those parts of the program that are disjoint from the parts with data races are amenable to sequential reasoning.
This is because the OCaml memory model is bounded both in space and time.
example
| |
cCan read of c return a value that is NOT 42?
yes for C++ and Java memory models. If program has datarace, even in unrelated parts, then the semantics is undefined
so if this was FFI-linked to a library that had a data-race, then under the C++ memory model, the read may return any value \(\implies\) data-races on unrelated locations can affect program behaviour \(\implies\) C++ memory model is NOT unbounded in space.
Java memory model is bounded in space but is not bounded in time; data-races in the future will affect past behaviour:
here’s an example:
1 2 3 4 5 6 7 8(* Thread 1 *) C c = new C(); c.x = 42; a = c.x; g = c; (* Thread 2 *) g.x = 7;The read of
c.xand the write ofgin the first thread are done on separate memory locations.Hence, the Java memory model allows them to be reordered.
As a result, the write in the second thread may occur before the read of
c.x, and hence,c.xreturns7.
The OCaml version of the java example would be:
| |
so there’s a data-race on both g and c, within snippet, just see the first 3 instructions:
let c = ref 0 in
c := 42;
let a = !c in
...The only memory location here is c. There is neither the data-race in space (the race on g) nor in time (the future race on c) therefore the snippet will have sequentially consistent behaviour, and teh value returned by !c will be 42
The OCaml memory model guarantees that even for programs with data races, memory safety is preserved. While programs with data races may observe non-sequentially consistent behaviours, they will not crash.
An Operational View of the Memory Model #
The purpose of this section in the manual is to give us a mental model intuition of how to reason with all this using an abstract state machine. The examples are good to follow here.
Blocking Synchronisation #
Both Domains and System Threads (systhreads) can be synchronised using Mutex, Condition, Semaphore modules — they are blocking synchronisation primitives.
Writing a Concurrent Stack with Blocking Synchronisation #
Implementation of a concurrent stack; mutable contents stores stack elements, mutex controls access to contents, a nonempty Condition variable
| |
contents stores stack elements, mutex controls access to contents, a nonempty Condition variableThis is familiar, the critical sections are wrapped around with mutex locking and unlocking, the Condition variable nonempty, is signalled while the lock is held in order to wake up any domains waiting on this condition — only if there are waiting domains will one of them be woken up
| |
So for the pop operation, we lock the mutex and if the stack is empty then the calling domain waits on teh condition variable nonempty using the wait primitive — so wait will atomically suspend the execution of the current domain and unlock the mutex — when the domain is woken up again (i.e. when the wait returns), then it holds the lock on mutex. The domain reads the contents of the stack again — if still non-empty then pop can be done so it updates the contents to the tail of the old list (xs) and returns the head (x).
Using a mutex here is sufficient sync between multiple domains using the stack \(\implies\) there’s no data races when multiple domains use the stack in parallel.
Interaction with systhreads #
Systhreads are pinned to the domain in which they are created – only one systhread can run OCaml code on a particular domain. Systhreads belonging to a particular domain may run C lib or system code in parallel (old school style). SO, systhreads belonging to different domains may execute in parallel.
When using systhreads, the thread created for executing the computation given to Domain.spawn is also treated as a systhread.
Here we use a shared ref cell protected by a mutex to communicate between different systehreads running on two different domains. The systhread identifiers uniquely identify systhreads in teh program. The id-counters are 0-indexed.
| |
$ ocamlopt -I +threads -I +unix unix.cmxa threads.cmxa -o dom_thr.exe dom_thr.ml
$ ./dom_thr.exe
Thread 1 running on domain 1 saw initial write
Thread 0 running on domain 0 saw the write by thread 1
Thread 2 running on domain 1 saw the write by thread 0
Thread 3 running on domain 0 saw the write by thread 2Interactions with C bindings #
Prior to multicore OCaml, the FFI / C-binding assumption was that if OCaml runtime lock is not released (GRL), then it would be safe to manipulate global C state (e.g. initialise a function-local static value) — this assumption doesn’t hold anymore in the face of parallel execution with multiple domains.
In multicore OCaml, C code running on a domain may run in parallel with any C code running in other domains even if neither of them has released the “domain lock”.
Atomics #
Are all about non-blocking sync primitives.It’s not a mechanism for suspending and waking up domains, they’re compiled to atomic read-modify-write primitives that the hardware provides.
example: Atomic vs non-atomic Counters in Parallel #
| |
$ ocamlopt -o incr.exe incr.ml
$ ./incr.exe 1_000_000
Non-atomic ref count: 1187193
Atomic ref count: 2000000example: Atomic variables used for Message-Passing #
Atomic variables can be used for low-level sync between domains – e.g. message passing between domains:
| |
Lock-free stack #
Atomic module is for implementing non-blocking, lock-free data-structures.
| |
compare_and_set r seen v sets the value of r to v \(\iff\) it’s current value is physically equal to seen — the comparator and the udpate occur atomically.
if overall comparison succeeded and the update happened $→$Areturns true
If compare_and_set fails, then some other domain is also attempting to update the atomic ref at the same time \(\implies\) Domains.cpu_relax to back-off fora short duration allowing competing domains to make progress before retrying the failed operation.