Chapter 16B: Parallel & Concurrent Programming in OCaml 5

OCaml 5 removed the Global Runtime Lock and introduced Domains as the unit of true parallelism. These notes cover Domain-based programming, Domainslib’s work-stealing task pools, the relaxed memory model and DRF-SC guarantee, atomic variables for non-blocking communication, and blocking synchronisation with Mutex and Condition — a practical toolkit for parallel OCaml. We then go through Effects and Threads to complete the Concurrent programming picture.

Parallel Programming (via Domains) #

This section goes through the parallelism docs on the OCaml manual, and overlaps a little with the parallelism paper, for which notes can be found here.

Domains #

Domains can be spawned and an associated lambda can be executed: Domain.spawn (fun _ -> print_endline "I ran in parallel"). There’s a limit by OCaml of 128 domains that can be active at the same time.

Spawned Domains can be joined to retrieve their results — join waits for target domain to terminate, this is a blocking operation.

Gotcha: `join` is blocking, but `spawn` is already parallel #

Careful not to misread this. The spawn will already make the work happen. So even if join is blocking, the work on the other domain wouldn’t be blocked (only the control flow would be blocked until the domain terminates). That’s why the double parallel call takes roughly the same time as a single call.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
(* fib_twice.ml *)
let n = int_of_string Sys.argv.(1)

let rec fib n = if n < 2 then n else fib (n - 1) + fib (n - 2)

let main () =
  let d1 = Domain.spawn (fun _ -> fib n) in
  let d2 = Domain.spawn (fun _ -> fib n) in
  let r1 = Domain.join d1 in
  Printf.printf "fib(%d) = %d\n%!" n r1;
  let r2 = Domain.join d2 in
  Printf.printf "fib(%d) = %d\n%!" n r2

let _ = main ()

Code Snippet 1: fib_twice.ml for input = 42 takes roughly the same time as a single call to fib 42

$ ocamlopt -o fib_twice.exe fib_twice.ml
$ ./fib_twice.exe 42
fib(42) = 267914296
fib(42) = 267914296
$ hyperfine './fib_twice.exe 42'
Benchmark 1: ./fib_twice.exe 42
  Time (mean ± sd):     1.249 s ±  0.025 s    [User: 2.451 s, System: 0.012 s]
  Range (min … max):    1.221 s …  1.290 s    10 runs

`Domainslib`: for nested-parallel programming #

The idea is that the core compiler distribution only offers the low-lying primitives for concurrent and parallel programming. This is so that the high-level libraries can be distributed outside the core compiler distribution.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
(* fib_par2.ml *)
let num_domains = int_of_string Sys.argv.(1)
let n = int_of_string Sys.argv.(2)

let rec fib n = if n < 2 then n else fib (n - 1) + fib (n - 2)

module T = Domainslib.Task

let rec fib_par pool n =
  if n > 20 then begin
    let a = T.async pool (fun _ -> fib_par pool (n-1)) in
    let b = T.async pool (fun _ -> fib_par pool (n-2)) in
    T.await pool a + T.await pool b
  end else fib n

let main () =
  let pool = T.setup_pool ~num_domains:(num_domains - 1) () in
  let res = T.run pool (fun _ -> fib_par pool n) in
  T.teardown_pool pool;
  Printf.printf "fib(%d) = %d\n" n res

let _ = main ()

Code Snippet 2: parallel implementation of fib, using Domainslib

perf analysis

$ ocamlfind ocamlopt -package domainslib -linkpkg -o fib_par2.exe fib_par2.ml
$ ./fib_par2.exe 1 42
fib(42) = 267914296
$ hyperfine './fib.exe 42' './fib_par2.exe 2 42' \
            './fib_par2.exe 4 42' './fib_par2.exe 8 42'
Benchmark 1: ./fib.exe 42
  Time (mean ± sd):     1.217 s ±  0.018 s    [User: 1.203 s, System: 0.004 s]
  Range (min … max):    1.202 s …  1.261 s    10 runs

Benchmark 2: ./fib_par2.exe 2 42
  Time (mean ± sd):    628.2 ms ±   2.9 ms    [User: 1243.1 ms, System: 4.9 ms]
  Range (min … max):   625.7 ms … 634.5 ms    10 runs

Benchmark 3: ./fib_par2.exe 4 42
  Time (mean ± sd):    337.6 ms ±  23.4 ms    [User: 1321.8 ms, System: 8.4 ms]
  Range (min … max):   318.5 ms … 377.6 ms    10 runs

Benchmark 4: ./fib_par2.exe 8 42
  Time (mean ± sd):    250.0 ms ±   9.4 ms    [User: 1877.1 ms, System: 12.6 ms]
  Range (min … max):   242.5 ms … 277.3 ms    11 runs

Summary
  './fib_par2.exe 8 42' ran
    1.35 ± 0.11 times faster than './fib_par2.exe 4 42'
    2.51 ± 0.10 times faster than './fib_par2.exe 2 42'
    4.87 ± 0.20 times faster than './fib.exe 42'

Notes:

a good ideal to keep things serial for small input sizes
the task pool lifecycle management is ours to be responsible for:
1. setup_pool: config
2. run: kickstart
3. teardown: fini

Parallel Iteration Constructs #

It’s an ergonomic loop-construct:

comparison between serial and parallel using the spectral_norm benchmark

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
(* spectralnorm.ml *)
let n = try int_of_string Sys.argv.(1) with _ -> 32

let eval_A i j = 1. /. float((i+j)*(i+j+1)/2+i+1)

let eval_A_times_u u v =
  let n = Array.length v - 1 in
  for i = 0 to  n do
    let vi = ref 0. in
    for j = 0 to n do vi := !vi +. eval_A i j *. u.(j) done;
    v.(i) <- !vi
  done

let eval_At_times_u u v =
  let n = Array.length v - 1 in
  for i = 0 to n do
    let vi = ref 0. in
    for j = 0 to n do vi := !vi +. eval_A j i *. u.(j) done;
    v.(i) <- !vi
  done

let eval_AtA_times_u u v =
  let w = Array.make (Array.length u) 0.0 in
  eval_A_times_u u w; eval_At_times_u w v

let () =
  let u = Array.make n 1.0  and  v = Array.make n 0.0 in
  for _i = 0 to 9 do
    eval_AtA_times_u u v; eval_AtA_times_u v u
  done;

  let vv = ref 0.0  and  vBv = ref 0.0 in
  for i=0 to n-1 do
    vv := !vv +. v.(i) *. v.(i);
    vBv := !vBv +. u.(i) *. v.(i)
  done;
  Printf.printf "%0.9f\n" (sqrt(!vBv /. !vv))

Code Snippet 3: serial implementation

We observe the looping that happens at eval_A_times_u and eval_At_times_u and realise that the iterations of the outer loop are not dependent on each other and can be executed in parallel — Each iteration of the outer loop body reads from u but writes to disjoint memory locations in v.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
(* spectralnorm_par.ml *)
let num_domains = try int_of_string Sys.argv.(1) with _ -> 1
let n = try int_of_string Sys.argv.(2) with _ -> 32

let eval_A i j = 1. /. float((i+j)*(i+j+1)/2+i+1)

module T = Domainslib.Task

let eval_A_times_u pool u v =
  let n = Array.length v - 1 in
  T.parallel_for pool ~start:0 ~finish:n ~body:(fun i ->
    let vi = ref 0. in
    for j = 0 to n do vi := !vi +. eval_A i j *. u.(j) done;
    v.(i) <- !vi
  )

let eval_At_times_u pool u v =
  let n = Array.length v - 1 in
  T.parallel_for pool ~start:0 ~finish:n ~body:(fun i ->
    let vi = ref 0. in
    for j = 0 to n do vi := !vi +. eval_A j i *. u.(j) done;
    v.(i) <- !vi
  )

let eval_AtA_times_u pool u v =
  let w = Array.make (Array.length u) 0.0 in
  eval_A_times_u pool u w; eval_At_times_u pool w v

let () =
  let pool = T.setup_pool ~num_domains:(num_domains - 1) () in
  let u = Array.make n 1.0  and  v = Array.make n 0.0 in
  T.run pool (fun _ ->
  for _i = 0 to 9 do
    eval_AtA_times_u pool u v; eval_AtA_times_u pool v u
  done);

  let vv = ref 0.0  and  vBv = ref 0.0 in
  for i=0 to n-1 do
    vv := !vv +. v.(i) *. v.(i);
    vBv := !vBv +. u.(i) *. v.(i)
  done;
  T.teardown_pool pool;
  Printf.printf "%0.9f\n" (sqrt(!vBv /. !vv))

Code Snippet 4: parallel implementation

-let n = Array.length v - 1 in
-  for i = 0 to  n do
-    let vi = ref 0. in
-    for j = 0 to n do vi := !vi +. eval_A i j *. u.(j) done;
-    v.(i) <- !vi
-  done
+let n = Array.length v - 1 in
+  T.parallel_for pool ~start:0 ~finish:n ~body:(fun i ->
+    let vi = ref 0. in
+    for j = 0 to n do vi := !vi +. eval_A i j *. u.(j) done;
+    v.(i) <- !vi
+  )

Code Snippet 5: The change to using parallel iterators is an isomorphic one

Memory-model: The Easy Bits #

This section summarises the contents from this section of the OCaml manual.

Typical modern processors and compilers do aggressive optimisation of code, that’s why serial vs parallel code gives surprising behaviour differences in the parallel version.

OCaml also makes such optimisations but uses a relaxed memory model so that parallel code becomes more predictable. It describes what values an OCaml program is allowed to witness when reading a memory location.

Recipes to Follow #

So we get recipes to follow that retain the simplicity of sequential processing.

immutable values:
can be freely shared between multiple domains, can access in parallel
mutable data structures (e.g. ref cells arrays, mutable record fields)
the programmer needs to ensure that data-races are avoided — where a data-race is for concurent access of a non-atomic memory location without synchronisation AND at least one of the accesses is a write operation.
So it matters if the datastructure is atomic or non-atomic:
1. non-atomic datastructures (e.g. ref-cells, arrays, mutable record fields)
2. atomic datastructures — no need sync mechanisms
Approaches to introduce synchronisation:
1. using atomic variables
2. using explict sycn primitives mutexes, semaphores, conditional
The DRF-SC (Data-Race-Free programs with ) guarantee: when observed program behaviour can be explained by the interleaving of operations from different domains
In OCaml, DRF-SC guarantee is modular: if a part of a program is data-race free then the OCaml memory model ensures that those parts have sequential consistency despite other parts of the program having data-races.

Memory-model: The Hard Bits #

The rest of this section relates to the “Memory model: The Hard Bits”, the technical paper that is detailed about this can be see here.

Sequential Consistency: simple model to help us reason #

Sequential consistency is a consistency model that helps us easily reason about the behaviour of our programs.

values observed by the program can be explained through some interleaving of the operations from different domains in the program — we just need to know how the interleaving happens

elaboration on cases where sequential consistency model should make reasoning easier

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
let d1 a b =
  let r1 = !a * 2 in
  let r2 = !b in
  let r3 = !a * 2 in
  (r1, r2, r3)

let d2 b = b := 0

let main () =
  let a = ref 1 in
  let b = ref 1 in
  let h = Domain.spawn (fun _ ->
    let r1, r2, r3 = d1 a b in
    Printf.printf "r1 = %d, r2 = %d, r3 = %d\n" r1 r2 r3)
  in
  d2 b;
  Domain.join h

Code Snippet 6: simple example, where two functions write to the same mutable refs

here we may have some observed behaviour, e.g. r1 = 2, r2 = 0, r3 = 2 if the write to b in d2 happened before the read of b in d1

We complicate this further, let a and b be the same ref

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
let d1 a b =
  let r1 = !a * 2 in
  let r2 = !b in
  let r3 = !a * 2 in
  (r1, r2, r3)

let d2 b = b := 0

let main () =
  let ab = ref 1 in
  let h = Domain.spawn (fun _ ->
    let r1, r2, r3 = d1 ab ab in
    assert (not (r1 = 2 && r2 = 0 && r3 = 2)))
  in
  d2 ab;
  Domain.join h

Code Snippet 7: all modifications done to the same ref cell

We might expect the assert statement to never fail since if r2 is 0 then the write in d2 occurred before the read of b in d1. Since a and b are aliases, the second read of a in d1 should also return 0, right?

Optimisation Mechanisms as a Source of Sequential Consistency Assertion Failure

Why would the assertion fail? $\implies$ other mechanisms at play

compiler optimisations as a source of reordering

compiler optimisations may affect the program logic (e.g. via sub-expression elimination (CSE)). The use of optimisations is fair and valid and necesary for good performance. They don’t change the sequential meaning of the program but have effects within a parallel runtime.

CSE breaks sequential reasoning. $\implies$ observed behaviour may not be able to be explained by interleaving of operations from different domains in the source program.

example of how compiler might optimise using CSE

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
   let d1 a b =
     let r1 = !a * 2 in
     let r2 = !b in
     let r3 = r1 in (* CSE: !a * 2 ==> r1 *)
     (r1, r2, r3)

   let d2 b = b := 0

   let main () =
     let ab = ref 1 in
     let h = Domain.spawn (fun _ ->
       let r1, r2, r3 = d1 ab ab in
       assert (not (r1 = 2 && r2 = 0 && r3 = 2)))
     in
     d2 ab;
     Domain.join h

Code Snippet 8: Here, r3 's original !a * 2 might be replaced with a reference-read by the compiler

hardware optimisations as a source of reordering

Complex cache hierarchies with multiple levels of cache needs cache coherence: ensuring that reads and writes to a single memory location respect sequential consistency.

Problem: weaker guarantees on programs that operate on different memory locations. The changes within a core may not have been published / replicated to the other cores yet.

example of lack of cache coherence

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
   let a = ref 0
   and b = ref 0

   let d1 () =
     a := 1;
     !b

   let d2 () =
     b := 1;
     !a

   let main () =
     let h = Domain.spawn d2 in
     let r1 = d1 () in
     let r2 = Domain.join h in
     assert (not (r1 = 0 && r2 = 0))

Code Snippet 9: it's possible for the assertion here to fail

The assertion may fail if the writes performed at a CPU core are not immediately published to all of the other cores. Since a and b are different memory locations, the reads of a and b may both witness the initial values and that’s why there might be an assertion failure.

Why this happens: This behaviour can be explained if a load is allowed to be reordered before a preceding store to a different memory location.

This reordering can happen due to the presence of in-core store-buffers on modern processors.

Each core effectively has a FIFO buffer of pending writes to avoid the need to block while a write completes. The writes to a and b may be in the store-buffers of cores c1 and c2 running the domains d1 and d2, respectively. The reads of b and a running on the cores c1 and c2, respectively, will not see the writes if the writes have not propagated from the buffers to the main memory.

Data race freedom implies sequential consistency #

OCaml’s relaxed memory model so that we can precisely describe which orders are preserved by teh OCaml program — compiler, hardware free to optimise the program as long as they respect the ordering guarantees of the memory model.

The memory model describes the conditions under which a program will only exhibit sequentially consistent behaviours — that’s what Data Race Freedom implies Sequential Consistency (DRF-SC) guarantee means.

precursor understanding to help us understand what DRF-SC guarantee means

memory locations
- classified as atomic memory locations and non-atomic memory locations by OCaml
- non-atomic: ref cells, array fields, mutable record fields, immutable objects are also non-atomic locations with an initialising write but no further updates
- atomic: locations created using Atomic module.

Happens-Before Relation

action type	meaning	examples
inter-domain actions	can be observed, influenced by actions on other domains	R/W of atomic AND non-atomic locations, Spawn and Join of Domains, operations on mutexes
intra-domain actions	can neither be observed nor influence the execution of other domains	evaluating an arithmetic expression, calling a function, etc

so a totally ordered list of actions that is executed by the abstract machine is the execution trace — due to non-determinism there may be several possible execution traces for a given program.

Irreflexive, Transitive, happens-before relation: for a given execution trace, the thing that captures the causality between actions in OCaml program.

It’s the smallest transitive relation that satisfies the situations: x precedes y if:

if action x precedes another action y in the program order, then x precedes y in the happens-before order.
what is program order? it’s the order in which a domain executes its actions
if x is a write to atomic location and y is a subsequent R/W to that location in the execution trace, then x precedes y in happens-before order
for atomic locations, the functions are considered both a read and a write: compare_and_set , fetch_and_add, exchange, incr, decr
x is Domain.spawn f and y is the first action in the newly spawned domain executing f, then x precedes y in happens-before order
x is the last action in a domain, d, and y is Domain.join d , then x preceds y in happens-before order.
x is an unlock operation on a mutex, and y is any subsequent operation on the mutex in the execution trace, then x precedes y in happens-before order.

Data-race: a tight definition
- two actions conflict if in a trace, they access the same non-atomic location where at least one is a write and neither of them is an initialising write to that location.
- there’s a data race if there exits some execution trace of the program with 2 conflicting actions and there isn’t a happens-before relationship between the conflicting accesses.
- if a program has no data race $\implies$ it’s correctly synchronised
DRF-SC
- strong guarantee: program without data races will only exhibit sequentially consistent behaviours
- with this guarantee, we can use sequential reasoning: reasoning by executing one inter-domain actions after the other, to identify whether the program has a data-race.
- this means, we don’t need to reason about re-orderings – in order to determine if a program has a data-race.

Reasoning with DRF-SC #

For the code examples in this section, assume that they are called in parallel by main function.

here’s how the main function would look like

It’s just spawning the two domains, having handles to them and then joining them.

1
2
3
4
5
6
let main () =
  let h1 = Domain.spawn d1 in
  let h2 = Domain.spawn d2 in
  ...
  ignore @@ Domain.join h1;
  ignore @@ Domain.join h2

simple example of a data-race

1
2
3
4
(* Has data race *)
let r = ref 0
let d1 () = r := 1
let d2 () = !r

So this r is a non-atomic ref so the 2 domains race to access the ref and d1 is a write. Because there’s no happens-before relationship, this is a data race. They may exhibit non-sequentially-consistent behaviours.

This is a clear-cut way of identifying data-races!

not data race: accessing disjoint array indices and fields of a record in parallel

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
(* --- disjoint array indices access: No data race *)
let a = [| 0; 1 |]
let d1 () = a.(0) <- 42
let d2 () = a.(1) <- 42


(* --- mutable record-fields access: No data race *)
type t = {
  mutable a : int;
  mutable b : int
}
let r = {a = 0; b = 1}
let d1 () = r.a <- 42
let d2 () = r.b <- 42

races on atomic locations don’t lead to a data-race

1
2
3
4
(*--- all are atomic accesses: No data race *)
let r = Atomic.make 0
let d1 () = Atomic.set r 1
let d2 () = Atomic.get r

use-case: message passing

We can use atomic variables to implement non-blocking communication between domains.

example of atomic variables used for inter-domain comms

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
(* No data race *)
let msg = ref 0
let flag = Atomic.make false
let d1 () =
  msg := 42; (* --- action a *)
  Atomic.set flag true (* --- action b *)
let d2 () =
  if Atomic.get flag (* --- action c *) then
    !msg (* --- action d *)
  else 0

Code Snippet 10: Here, msg is non-atomic but the actions don't lead to a data-race

Here, we see that actions a and d write and read from the same non-atomic location, msg and therefore they are conflicting. If a and d have a happens-before relationship, then we can show that this program does NOT have a data-race:

a preceeds b in program-order, hence a happens-before b
c happens-before d
CASE A: if d2 observes the atomic variable, flag to be true then b precedes c in happens-before order
1. happens-before is a transitive relationship, so the conflicting actions a and d are in happens-before order.
CASE B: If d2 observes the flag to be false then the read of msg is NOT done, so there’s no conflicting access in this execution trace $\implies$ program does not have a data trace

The next example, does have a data-race:

1
2
3
4
5
6
7
8
9
(* Has data race *)
let msg = ref 0
let flag = Atomic.make false
let d1 () =
  msg := 42; (* --- action a *)
  Atomic.set flag true (* --- action b *)
let d2 () =
  ignore (Atomic.get flag); (*---action c *)
  !msg (* --- action d *)

Code Snippet 11: version of the message passing program that has a dataa-race

Now, there’s a possible trace where a and d are conflicting operations AND there’s no happens-before relationship between them so there’s a data-race:

Atomic.get flag; (* c *)
!msg; (* d *)
msg := 42; (* a *)
Atomic.set flag true (* b *)

Code Snippet 1: conflicting trace without happens-before relationship $\implies$ data-race

extension: honing the intuition that made me think that this is a mutex-like behaviour

The atomic flag does play an ordering role analogous to what a mutex does — but calling it “a sort of mutex” conflates two different things that mutexes provide, only one of which the flag gives you.

A mutex provides two distinct guarantees:

Mutual exclusion — only one thread inside the critical section at a time
Happens-before edges — the release of a mutex happens-before the subsequent acquire of it by another thread

The atomic flag here provides only the second one. Both domains run fully in parallel — there’s no exclusion. What the flag establishes is a conditional synchronisation point: if d2 observes flag = true, then the SC atomic operations guarantee that b (the write of flag) happens-before c (the read of flag), which combined with program order gives us a → b → c → d.

The flag doesn’t protect msg the way a mutex would. It creates a one-way, conditional ordering guarantee. The reason it works is specifically because SCatomics establish a total order over all atomic operations globally — any read that observes a particular write is ordered after that write in the happens-before relation.

[ The More Precise Analogy ]

What this pattern actually is in memory model literature is a release-acquire handshake (even though OCaml’s Atomic gives the stronger SC, which subsumes it):

d1 does its work then releases by writing the flag
d2 acquires by reading the flag and only proceeds if it observes the release

This is exactly how std::atomic with memory_order_release / memory_order_acquire works in C++, and it’s the canonical message passing pattern. A mutex generalises this to be symmetric and reusable — but under the hood, a mutex’s unlock / lock pair is implemented using exactly this kind of release-acquire on an atomic variable.

So my intuition has the causality slightly backwards: it’s not that the flag acts like a mutex — it’s that mutexes are built from this kind of atomic flag pattern.

[ Why The Second Example Breaks It ]

The broken version makes this crisp. d2 ignores the value of the flag:

ignore (Atomic.get flag);  (* c *)
!msg                        (* d *)

The SC total order still exists over the atomic operations — b and c are still totally ordered. But now regardless of which way they order, d always executes. If c happens before b in the SC order, then d runs concurrently with a — no happens-before between them, hence a race.

The conditionality in the first example (if Atomic.get flag then) is load-bearing. It’s what makes Case B safe — the conflicting access d simply doesn’t happen if the ordering isn’t established. The flag by itself doesn’t prevent the race; the branch on the flag’s value is what eliminates the conflicting access in the unsafe trace.

Local Data Race Freedom (LDRF-SC guarantee) #

Even for programs with data-races, the OCaml memory model has strong guarantees for local data race freedom sequential consistency (LDRF-SC). Read the paper for more info.

DRF-SC says that the data race free parts of the program remain sequentially consistent. That is, even if the program has data races, those parts of the program that are disjoint from the parts with data races are amenable to sequential reasoning.

This is because the OCaml memory model is bounded both in space and time.

example

1
2
3
4
5
let snippet () =
  let c = ref 0 in
  c := 42;
  let a = !c in
  (a, c)

Code Snippet 12: actions on a newly allocated ref, c

Can read of c return a value that is NOT 42?

yes for C++ and Java memory models. If program has datarace, even in unrelated parts, then the semantics is undefined
so if this was FFI-linked to a library that had a data-race, then under the C++ memory model, the read may return any value $\implies$ data-races on unrelated locations can affect program behaviour $\implies$ C++ memory model is NOT unbounded in space.
Java memory model is bounded in space but is not bounded in time; data-races in the future will affect past behaviour:
here’s an example:
1 2 3 4 5 6 7 8
(* Thread 1 *) C c = new C(); c.x = 42; a = c.x; g = c; (* Thread 2 *) g.x = 7;
The read of c.x and the write of g in the first thread are done on separate memory locations.
Hence, the Java memory model allows them to be reordered.
As a result, the write in the second thread may occur before the read of c.x, and hence, c.x returns 7.

The OCaml version of the java example would be:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
let g = ref None

let snippet () =
  let c = ref 0 in
  c := 42;
  let a = !c in
  (a, c)

let d1 () =
  let (a,c) = snippet () in
  g := Some c;
  a

let d2 () =
  match !g with
  | None -> ()
  | Some c -> c := 7

Code Snippet 13: this snippet has sequentially consistent behaviour because OCaml memory model is bounded in both Space and Time.

so there’s a data-race on both g and c, within snippet, just see the first 3 instructions:

let c = ref 0 in
c := 42;
let a = !c in
...

The only memory location here is c. There is neither the data-race in space (the race on g) nor in time (the future race on c) therefore the snippet will have sequentially consistent behaviour, and teh value returned by !c will be 42

The OCaml memory model guarantees that even for programs with data races, memory safety is preserved. While programs with data races may observe non-sequentially consistent behaviours, they will not crash.

An Operational View of the Memory Model #

The purpose of this section in the manual is to give us a mental model intuition of how to reason with all this using an abstract state machine. The examples are good to follow here.

Trust me, the examples are great to follow. Skim them here.

Blocking Synchronisation #

Both Domains and System Threads (systhreads) can be synchronised using Mutex, Condition, Semaphore modules — they are blocking synchronisation primitives.

Writing a Concurrent Stack with Blocking Synchronisation #

Implementation of a concurrent stack; mutable contents stores stack elements, mutex controls access to contents, a nonempty Condition variable

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
module Blocking_stack : sig
  type 'a t
  val make : unit -> 'a t
  val push : 'a t -> 'a -> unit
  val pop  : 'a t -> 'a
end = struct
  type 'a t = {
    mutable contents: 'a list;
    mutex : Mutex.t;
    nonempty : Condition.t
  }

  let make () = {
    contents = [];
    mutex = Mutex.create ();
    nonempty = Condition.create ()
  }

  let push r v =
    Mutex.lock r.mutex;
    r.contents <- v::r.contents;
    Condition.signal r.nonempty;
    Mutex.unlock r.mutex

  let pop r =
    Mutex.lock r.mutex;
    let rec loop () =
      match r.contents with
      | [] ->
          Condition.wait r.nonempty r.mutex;
          loop ()
      | x::xs -> r.contents <- xs; x
    in
    let res = loop () in
    Mutex.unlock r.mutex;
    res
end

Code Snippet 14: mutable contents stores stack elements, mutex controls access to contents, a nonempty Condition variable

This is familiar, the critical sections are wrapped around with mutex locking and unlocking, the Condition variable nonempty, is signalled while the lock is held in order to wake up any domains waiting on this condition — only if there are waiting domains will one of them be woken up

18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
(* ... *)
  let push r v =
    Mutex.lock r.mutex;
    r.contents <- v::r.contents;
    Condition.signal r.nonempty;
    Mutex.unlock r.mutex

  let pop r =
    Mutex.lock r.mutex;
    let rec loop () =
      match r.contents with
      | [] ->
          Condition.wait r.nonempty r.mutex;      (waiting_partial)
          loop ()
      | x::xs -> r.contents <- xs; x
    in
    let res = loop () in
    Mutex.unlock r.mutex;
    res
(* ... *)

Code Snippet 15: partial implementation for the concurrent stack

So for the pop operation, we lock the mutex and if the stack is empty then the calling domain waits on teh condition variable nonempty using the wait primitive — so wait will atomically suspend the execution of the current domain and unlock the mutex — when the domain is woken up again (i.e. when the wait returns), then it holds the lock on mutex. The domain reads the contents of the stack again — if still non-empty then pop can be done so it updates the contents to the tail of the old list (xs) and returns the head (x).

Using a mutex here is sufficient sync between multiple domains using the stack $\implies$ there’s no data races when multiple domains use the stack in parallel.

Interaction with systhreads #

Related notes on the pre-Multicore OCaml can be found in this section, “Working with System Threads” — useful to juxtapose the two.

Systhreads are pinned to the domain in which they are created – only one systhread can run OCaml code on a particular domain. Systhreads belonging to a particular domain may run C lib or system code in parallel (old school style). SO, systhreads belonging to different domains may execute in parallel.

When using systhreads, the thread created for executing the computation given to Domain.spawn is also treated as a systhread.

Here we use a shared ref cell protected by a mutex to communicate between different systehreads running on two different domains. The systhread identifiers uniquely identify systhreads in teh program. The id-counters are 0-indexed.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
(* dom_thr.ml *)
let m = Mutex.create ()
let r = ref None (* protected by m *)

let task () =
  let my_thr_id = Thread.(id (self ())) in
  let my_dom_id :> int = Domain.self () in
  Mutex.lock m;
  begin match !r with
  | None ->
      Printf.printf "Thread %d running on domain %d saw initial write\n%!"
        my_thr_id my_dom_id
  | Some their_thr_id ->
      Printf.printf "Thread %d running on domain %d saw the write by thread %d\n%!"
        my_thr_id my_dom_id their_thr_id;
  end;
  r := Some my_thr_id;
  Mutex.unlock m

let task' () =
  let t = Thread.create task () in
  task ();
  Thread.join t

let main () =
  let d = Domain.spawn task' in
  task' ();
  Domain.join d

let _ = main ()

Code Snippet 16: Here, we create two domains in total, with two systhreads each (including the initial systhread for each of the domains)

$ ocamlopt -I +threads -I +unix unix.cmxa threads.cmxa -o dom_thr.exe dom_thr.ml
$ ./dom_thr.exe
Thread 1 running on domain 1 saw initial write
Thread 0 running on domain 0 saw the write by thread 1
Thread 2 running on domain 1 saw the write by thread 0
Thread 3 running on domain 0 saw the write by thread 2

Interactions with C bindings #

Prior to multicore OCaml, the FFI / C-binding assumption was that if OCaml runtime lock is not released (GRL), then it would be safe to manipulate global C state (e.g. initialise a function-local static value) — this assumption doesn’t hold anymore in the face of parallel execution with multiple domains.

In multicore OCaml, C code running on a domain may run in parallel with any C code running in other domains even if neither of them has released the “domain lock”.

Atomics #

Are all about non-blocking sync primitives.It’s not a mechanism for suspending and waking up domains, they’re compiled to atomic read-modify-write primitives that the hardware provides.

Atomic features exist because of hardware support for atomicity.

example: `Atomic` vs non-atomic Counters in Parallel #

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
(* incr.ml *)
let twice_in_parallel f =
  let d1 = Domain.spawn f in
  let d2 = Domain.spawn f in
  Domain.join d1;
  Domain.join d2

let plain_ref n =
  let r = ref 0 in
  let f () = for _i=1 to n do incr r done in
  twice_in_parallel f;
  Printf.printf "Non-atomic ref count: %d\n" !r

let atomic_ref n =
  let r = Atomic.make 0 in
  let f () = for _i=1 to n do Atomic.incr r done in
  twice_in_parallel f;
  Printf.printf "Atomic ref count: %d\n" (Atomic.get r)

let main () =
  let n = try int_of_string Sys.argv.(1) with _ -> 1 in
  plain_ref n;
  atomic_ref n

let _ = main ()

Code Snippet 17: Non-atomic counter vs Atomic Counter used in Parallel

$ ocamlopt -o incr.exe incr.ml
$ ./incr.exe 1_000_000
Non-atomic ref count: 1187193
Atomic ref count: 2000000

example: Atomic variables used for Message-Passing #

Atomic variables can be used for low-level sync between domains – e.g. message passing between domains:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
let r = Atomic.make None

let sender () = Atomic.set r (Some "Hello")

let rec receiver () =
  match Atomic.get r with
  | None -> Domain.cpu_relax (); receiver ()
  | Some m -> print_endline m

let main () =
  let s = Domain.spawn sender in
  let d = Domain.spawn receiver in
  Domain.join s;
  Domain.join d

let _ = main ()

(*
Hello
val r : string option Atomic.t = <abstr>
val sender : unit -> unit = <fun>
val receiver : unit -> unit = <fun>
val main : unit -> unit = <fun>
*)

Lock-free stack #

Atomic module is for implementing non-blocking, lock-free data-structures.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
module Lockfree_stack : sig
  type 'a t
  val make : unit -> 'a t
  val push : 'a t -> 'a -> unit
  val pop  : 'a t -> 'a option
end = struct
  type 'a t = 'a list Atomic.t

  let make () = Atomic.make []

  let rec push r v =
    let s = Atomic.get r in
    if Atomic.compare_and_set r s (v::s) then ()
    else (Domain.cpu_relax (); push r v)

  let rec pop r =
    let s = Atomic.get r in
    match s with
    | [] -> None
    | x::xs ->
        if Atomic.compare_and_set r s xs then Some x
        else (Domain.cpu_relax (); pop r)
end

Code Snippet 18: A lock-free stack (Treiber Stack), with an atomic ref that holds a list

compare_and_set r seen v sets the value of r to v $\iff$ it’s current value is physically equal to seen — the comparator and the udpate occur atomically.

if overall comparison succeeded and the update happened $→$Areturns true

If compare_and_set fails, then some other domain is also attempting to update the atomic ref at the same time $\implies$ Domains.cpu_relax to back-off fora short duration allowing competing domains to make progress before retrying the failed operation.

Chapter 16B: Parallel & Concurrent Programming in OCaml 5

Parallel Programming (via Domains) #

Domains #

Gotcha: `join` is blocking, but `spawn` is already parallel #

`Domainslib`: for nested-parallel programming #

Parallel Iteration Constructs #

Memory-model: The Easy Bits #

Recipes to Follow #

Memory-model: The Hard Bits #

Sequential Consistency: simple model to help us reason #

Data race freedom implies sequential consistency #

Reasoning with DRF-SC #

Local Data Race Freedom (LDRF-SC guarantee) #

An Operational View of the Memory Model #

Blocking Synchronisation #

Writing a Concurrent Stack with Blocking Synchronisation #

Interaction with systhreads #

Interactions with C bindings #

Atomics #

example: `Atomic` vs non-atomic Counters in Parallel #

example: Atomic variables used for Message-Passing #

Lock-free stack #

TODO Structured Concurrency (via Effects, Effect-Handlers) #

TODO Legacy / OS-level concurrency abstraction (via Threads) #

Parallel Programming (via Domains) #

Domains #

Gotcha: join is blocking, but spawn is already parallel #

Domainslib: for nested-parallel programming #

Parallel Iteration Constructs #

Memory-model: The Easy Bits #

Recipes to Follow #

Memory-model: The Hard Bits #

Sequential Consistency: simple model to help us reason #

Data race freedom implies sequential consistency #

Reasoning with DRF-SC #

Local Data Race Freedom (LDRF-SC guarantee) #

An Operational View of the Memory Model #

Blocking Synchronisation #

Writing a Concurrent Stack with Blocking Synchronisation #

Interaction with systhreads #

Interactions with C bindings #

Atomics #

example: Atomic vs non-atomic Counters in Parallel #

example: Atomic variables used for Message-Passing #

Lock-free stack #

TODO Structured Concurrency (via Effects, Effect-Handlers) #

TODO Legacy / OS-level concurrency abstraction (via Threads) #

Gotcha: `join` is blocking, but `spawn` is already parallel #

`Domainslib`: for nested-parallel programming #

example: `Atomic` vs non-atomic Counters in Parallel #