Chapter 16C: Retrofitting Parallelism onto OCaml

Table of Contents

The design of OCaml 5’s multicore garbage collector required retrofitting parallelism onto a 30-year-old single-core runtime without breaking backwards compatibility. These notes follow the original ICFP 2020 paper closely — covering the Streamflow-based major heap allocator, the two approaches to parallelising the minor collector, and the engineering decisions behind supporting weak references, ephemerons, lazy values, and fibers in the presence of concurrent GC threads.

Retrofitting Parallelism onto OCaml #

We focus on the general design decisions behind the thread-safe GC implementation – as per the original paper on it (Retrofitting Parallelism onto OCaml). Most of the notes in this section are from that paper.

Watching this talk by KC Sivaramakrishnan (IIT Madras, Tarides) is a good overview too.

Priorities and Requirements #

Retrofitting needs to be backwards compatible because there’s a large-scale adoption of OCaml in production systems already.
Paper describes techniques to balance between performance and feature backwards compatibility for sequential programs
- compatibility of the language semantics, performance profile, memory usage and C bindings
Main challenge: efficient design for the Garbage Collector (GC) – multicore capable GC.
Requirements:
1. well-behaved serial program doesn’t break on the parallel execution – the semantics, typing all remain the same on the serial and parallel runtimes
2. performance profile of serial program when in parallel runtime not that differnt from the serial runtime. Especially, the GC pause times should be more or less the same.
3. parallel programs should minimise pause times and then aim to run as fast as possible on the available cores. This goal is harder than achieving good throughput. Optimising for the pause times and then optimising for the throughput is easier than the other way round.

Background on Existing Allocation & Garbage Collection #

See notes @ Chapter 24: Understanding the Garbage Collector. It hinges on the foundations upon which the pre-OCaml 5 GC was built on, and is still relevant for this section.

These are characteristics of the existing GC that is relevant to keep in mind.

Because it’s a functional PL, OCaml usually has a high rate of allocation, with most objects being small, short-lived \(\implies\) memory allocation done on a minor heap, with a generational GC \(\implies\) allocations @ minor heap are very fast because it’s a just a pointer bump.
Mutable vs Immutable:
Immutable objects only have initialising writes – need no write barriers. See notes on Mutable Write Barrier.
Mutable objects need deletion/snapshot-at-the-beginning/Yuasa barrier for writes; none for reads – so reads are faster than writes. The low pause times are important to be able to make things fast.
Promotion from minor to major heap
For objects that survive a minor collection. That GC is done by an incremental, non-moving, mark-and-sweep collector with an optional compaction phase \(\implies\) supposed to have minimal pause times and that’s why latency sensitive applications (network services, UI) work well
Possible approach that wasn’t used for multicore GC implementation: serial and parallel code should have two different compilers and runtimes (similar to GHC Haskell – where a compiler flag is the one that determines which is use – where a compiler flag is the one that determines which is used)

The OCaml folks prefer a unified runtime to avoid maintenance of multiple code paths so they didn’t follow that. Different from GHC Haskell

What Feature Backwards Compatibility Means #

on Type-Safety
Type safety shouldn’t break under data-races, so the memory model for a shared-memory parallel programs needs strong guarantees even in the presence of data races.

on Features Closely-tied with the GC
Features that closely interact with the GC that must not break (read this to understand how to work with the GC better, separately, understanding the GC):
1. weak references: Weak references let you point to a value without keeping it alive. If the only remaining reference to an object is weak, the GC is free to reclaim it, and the weak reference then turns empty or invalid.
2. finalisers: Finalisers are callbacks the GC runs after a value becomes unreachable. They are useful for releasing non-memory resources or doing cleanup, but they are not deterministic and should NOT be used for critical logic – remember that C finalizers and OCaml finalizers are similar things but different constructs
3. ephemerons – weak hash tables that are helpful for caching / memoization or for “adding” a field to an arbitrary boxed OCaml value without having any memory leaks. Ephemerons are like weak hash-table entries with keys and optional data. They are especially useful for caches and memoization because they let you associate auxiliary data with a value without accidentally keeping the value alive forever; OCaml’s weak pointers are implemented as ephemerons without data
4. lazy values – Lazy values defer computation until the value is forced. From the GC’s perspective, a lazy value is just a heap object that may hold a thunk or cached result; once no one can reach it, the GC can collect it like any other object.

Striking a balance with C APIs
striking a balance between the complexity of having C APIs and the performance impact of what can be done.
@ C API: OCaml’s C API exposes quite a lot of the internals of the memory representation, and has an expressive API to access the heap efficiently \(\implies\) API bakes in the invariants of the GC:
1. reading any OCaml object (mutable / immutable) using the C API doesn’t add a read-barrier – it’s compiled as a plain memory-read
2. compiler doesn’t check for incorrect uses of the C API \(\implies\) it’s difficult to write correct and efficient FFI code.

GC Design Inspirations #

old generation (major heap), collection \(\Rightarrow\) VCGC (Very Concurrent Mark and Sweep Garbage Collection)
- avoids having explicit phase transitions between marking and sweeping, which is the traditional source of bugs in concurrent collector designs.
- this GC is what they use for collecting from the old generation GC (form the major heap). The GC is shared between all mutators and the collection is done by the VCGC.
- implication: if programs don’t use weak refs or ephemerons – the mutator and the GC threads only need to sync once per CPU cycle (to agree that the current cycle is done) \(\implies\) minimises pause times
old generation (major heap), allocation \(\Rightarrow\) Streamflow
- size-segmented thread-local pages that favour locality — promising for multicore behaviour
young generation (minor heap), collection \(\Rightarrow\) evaluating alternatives to consider given that young generation survivors are copied to the shared major heap:
1. ❌ option A: concurrent collector with thread-private minor heaps (see elaboration here)
  1. design invariant from a quasi-real-time collector design: “no pointers from major to minor heaps” \(\implies\) if we store a pointer to a private object into the shared major heap then the private object and ALL objects reachable from it get promoted to the shared heap (old generation, major heap) en masse \(\implies\) eager promotion — loads of false positives on what should be promoted (just because an object is pointed to by a shared object, it doesn’t mean another thread will attempt to access it)
  2. so the actual concurrent minor collector design for this multicore ocaml is similar to this, but lazier – follows GHC’s local heaps — where objects promoted to the shared heap whenever another thread actually tries to access them \(\implies\) a slower sharing operation because now we need to sync two different threads – but it’s performed less often (laziness).
    Implication: requires that reads be safe points where garbage collection can occur.
    Problem: stock OCaml doesn’t use read barriers and C API works under that assumption
    So, choosing this approach would violate the backward compatibility requirement
2. ✅ B: stop-the-world parallel minor collector (see elaboration here)
  - All mutators need to sync to perform the minor collection \(\implies\) bigger cost of possible worse pause times
  - benefits: memory access invariants are the same so no need to change the C API or its usage.
  - surprisingly, the stop-the-world minor collector outperforms the concurrent minor collector in almost ALL circumstances — even @ higher cores

Overview of the Memory Model in OCaml (Memory Management, Garbage Collection) #

My own notes on this from the textbook can be found here, GC notes here. This section only contains clarifying statements that improve on those notes.

Uniform memory representation: every value has the same size (1 word long), LSB used to disambiguate pointers from integers
Every allocation is word-aligned, that’s why LSB of pointers will always be 0 and integers are single-left-shift encoded (end up being 63-bits long)
Every OCaml object has word-sized header: encodes length and type of object and 2 bits for colouring by the GC
Generational GC with major and minor heaps — small, fixed-size minor heap, large major heap
- collection @ minor heap:
  when: when it is full
  a copying collector minimises pause times be cause the survival rate in the minor collection is low because it’s a FP language
  mechanism:
  1. promote all the roots that point to the minor heap over to the major heap:
    these are the globals, local roots that are registered in C calls, registers, then the remembered set of inter-generational pointers from major to minor heap, program stack
  2. transitively promote all the referenced objects. All refs from live objects to objects in the minor heap are updated to point to the new location (this includes refs from major heap to minor heap that are recorded in the remembered set)
  3. old minor heap is then reused in the next cycle
The major heap allocator is more sophisticated and differs between stock vs Multicore OCaml

Major Heap: changes from stock vs multicore #

Table 1: characteristics retained from stock GC

Retained
Mark-and-sweep	two-phased collection, where marking determined which blocks are live and sweeping does the collection so that it’s available for reuse
Non-moving	on major heap, there’s no movement of objects after they’re first allocated until they’re collected
Incremental	since it’s a stop-the-world action, it’s done on heap slices instead of the whole heap and that’s why it’s incremental. Incremental allows it to interleave with user program.

Some changes are necessary for supporting parallelism

A point of clarification on the paper’s use of the term “virtual machine” in this area — When the paper says “the virtual machine contains a number of domains,” it’s saying: the OCaml runtime model is extended to include multiple domains. It’s describing the conceptual machine that the paper is modifying — not asserting that OCaml always interprets bytecode.

So the multicore version retains these and is also parallel within Domains: the VM contains a number of domains — which are separate system threads in the same address space.

Central challenge: how to control the access to the large amount of shared, mutable state (because the GC needs to track how each byte of memory is being used) WITHOUT having too much use of expensive synchronisation?

Shared State Usage in Stock OCaml GC #

Mutable shared state refers to the state tracking for colours. It’s an incremental process, that’s why it has to persist across invocations of the process.

Managing access to this shared state is easy for single-thread collectors, it’s harder to do correctly and efficiently when it’s a parallel collector.

Table 2: Where mutable, shared state is needed, between phases of…

between phases
marking and sweeping	the object colours, whose meaning changes between phases
mutation and marking	the write barrier supplies new gray objects to the collector
allocation and sweeping	allocation must determine the phase of the collector and position of sweeping, as well as coordinating with the sweeper to manage free memory

3.1: Tricolor Collection in Stock OCaml GC #

Major heap allocations get coloured <black, gray, white> (and blue colour) in the object header’s 2-bit colour tag. The following is one way to design it, other design approaches exist too.

This part is better elaborated in my own notes here.

Some observations to take note of:

between the mark phase and the sweep phase, user program execution may be interleaved — during this, the user program may mutate the heap and alloate objects also – that’s alright because the overall process is incremental
when marking, the colouring is essential for a phased tree-traversal (the reachability tree). The current frontier is coloured grey.
frontier elements will be within the mark-stack.
the root-marking is 2-phased also :
phase 1: non-incremental, scans the registers and the program stack
phase 2: incremental, scans the global roots
after marking, when sweeping: objects still in use \(\implies\) back colour. Garbage \(\implies\) white colour.
so the cycle goes black \(\Rightarrow\) white \(\Rightarrow\) blue
Snapshot-at-beginning property: comes from the invariant that every object reachable at the start of marking eventually gets marked.
in the context of mutations e.g. mutating a field of an object, @ mark-phase of the GC, the write barrier loads the object pointed to by the old value of the field — if it’s white (i.e. can be collected), then it grays and adds to the mark-stack)
Guaranteeing termination of the GC:
The total amount of work to do during a GC cycle is bounded no matter what the mutator (program) does. This is because the old value of a field during mutation is grayed (this is the deletion barrier) and the roots are marked before anything else is marked (a black mutator).
disadvantage: any new allocations on the major heap during the marking phase of one cycle can only be collected at the next cycle.

3.2: Multicore OCaml’s Collector #

In Multicore OCaml, they try to avoid as much of the shared state as possible (to avoid any need for synchronisation).

Reducing shared state @ mutators <> GC
Each domain maintains its own stack of gray objects, when write barrier needs to gray an object, it marks and adds it to the local domain’s gray stack. There’s no need for any synchronisation because GC and OCaml code are interleaved on a single domain.
it’s possible to have a single object to be on the mark stack of 2 separate domains though.
Harder to avoid between marking and sweeping
change the approach to having 4 states: Marked, Unmarked, Garbage, Free
The sets of objects affected by marking and sweeping are disjoint \(\implies\) no need any synchronisation between them
- Marking: Unmarked \(\rightarrow\) Marked
- Sweeping: Garbage \(\rightarrow\) Free
- new allocations: always Marked — so they won’t be immediately collected — allocation also doesn’t need any synchronisation

An additional consideration unique to parallel collector: state shared between multiple domains doing the same phase of the collector in parallel

marking should be idempotent
that’s why it’s alright if an object is marked twice, that’s much cheaper than needing to synchronise each access to an object by the collector
sweeping should be disjoint; it’s not idempotent but that’s fine as long as it’s disjoint
each domain sweeps only the memory that it allocated, so sweeping and allocation become local to a domain.

Omitted: there’s more elaboration on how the pseudocode looks like, how termination detection works and heap cycling as well as reasoning on how certain invariants are adhered, I’ve left it out from these notes — ref the original paper.

3.3 Allocator #

Performance Goal for Major Heap Allocator: cost of allocation should be roughly proportional to the cost of initialisation — because all values allocated in OCaml are immediately initialised.

For single-core, refer to the allocation strategies we cover in the Garbage Collector notes here.

Problem: these are all single-threaded, since the allocation rates in OCaml are high, using a lock to protect the allocator won’t make the performance-cut.

Solution: use a different heap allocator that is based on Streamflow design:

domain-local, size-segmented list of pages for small allocations ( \(<\) 128 words)
each page of 4K words is carved into equal-sized slots, with size classes chosen in a way that < 10% wasted space
large allocations (\(\ge\) 128 words) forwarded to malloc, maintained ina a domain-local list of large blocks and returned via free when swept
before domain terminates, all of the pages it owns are moved to a global size-segmented list of pages — now called orphaned pages that must be serviced by the other domains because they might contain live objects referenced by other domains.
accessing the global list of pages is mutex-protected.

Serving an allocation request to the major heap:

first domain searches for suitable slot in corresponding size class in it’s local list of pages
else, domain sweeps the local unswept pages of that size class to find a free slot
else, domain adaopts ond of the full global pages of the appropriate size class, sweeps it and looks for a free slot
else, domain requests teh OS to allocate an additional page

So, in most cases, the allocations can be serviced without the need for synchronisation and atomic operations.

3.4 Safe Points #

Same as in stock OCaml (needed for correct language semantics) — needed to support the stop-the-world pauses needed by the major collector cycle changes and the parallel minor collector

Safe points in multicore ocaml are used in addition to the existing allocation points.

3.5 Opportunistic Work #

Because it’s incremental and concurrent, the work of the major collector can be done opportunistically:

when yielding: non-blocking algos mostly involve some form of looping when attemtping to progress in the face of contention, so they relax a bit (e.g. PAUSE on x86) \(\implies\) opportunity to do a small amount of opportunistic marking or sweeping.
when stop-the-world entry fails: in parallel minor collector, domains carry out opportunistic work until all domains are ready

More places to add in opportunistic work exists, KIV future implementations on things like read faults or stop-the-world leave waits.

Minor Heap: changes #

Related dive into the fast minor heap can be seen in Chapter 24::The Fast Minor Heap.

Much higher rate of allocation and collection, much smaller than major heap:

copying: copying over of live objects from minor to major heap, empties the minor heap for use in next cycle
non-incremental: mutator pause for the whole minor collection

4.1 Copying Minor Collection in Stock OCaml #

Unlike the major collector:

Minor collector traces objects from the roots to find the live objects.
set of roots includes the remembered set that hold the references from the major heap to the minor heap \(\implies\) all objets in the minor heap that are pointed to by something in the major heap @ time of minor collection get promoted to major heap without needing to traverse the major heap to find them.

typically the number of copied over objets is small, because:

minor heap is relatively small
only that part of the minor heap that is live at collection time must be traversed.

No need to have a fancy allocator here \(\implies\) we just use the bump-pointer algo

4.2 Parallelising the Minor Collector #

Minor collector more difficult to parallelise than the major collector because it move objects. An object can’t be moved by one thread while another is using it.

What to avoid: anything that requires careful coordination between the collector and the mutator, fine-grained synchronisation.

Options: to separate the minor collector from the mutator in time or space — both were implemented

time-separation: when minor heap fills, stop all domains and make all domains collect the minor heap in parallel before resuming the mutator
space-separation: each domain to have a private minor heap — prevents any access by one domain to another’s heap and allows them to collect their minor heaps independently

4.3 Concurrent Minor Collector with Private Minor Heaps #

{ Invariant to maintain: } no points between the minor heaps of different domains, permits each of the minor heaps to be independently collected. It’s legal to allow pointers from shared major heap to the minor heap to avoid paying the cost for early promotion (similar to GHC)

So concurrent minor collector: domain-local, private, minor hear arenas each with their own remembered set.

[ Read Faults and Interrupts ]

there’s a common interrupt mechanism used for inter-domain signalling for multiple reasons. Implemented via a multi-producer, single-consumer queue for receiving inter-domain interrupts
- triggering a read-fault
  i.e. following pointer from major to remote minor heap — needs to request promotion from remote minor
- stop-the-world phase invocation

[ Read Barrier ]

Main challenge for concurrent GC design: read-barrier optimisation

Details omitted here, refer to paper.

[ Promotion ]

The chosen solution is a hybrid of the following together with an enhanced write barrier that records domain-local minor remembered set:

minor gc \(\rightarrow\) promotes requested object and others in the minor heap to major heap (suffers from early-promotion problem because both requester and target domain suffer the pause time for a full minor GC on the target domain)
promote only the transitive closure of the requested object, scan the minor GC roots and the minor heap for pointers to the promoted objects , forward them to the major heap (problem: touches the entire minor heap when we only need to touch live objects during a minor GC, which is a small portion of the heap)

4.4 Stop-the-World Parallel Minor Collector #

This is an alternative to the concurrent minor collector.

Keeps the domain-local minor heaps, relaxes the invariant that disallows references between minor heaps. Since inter-heap references are allowed here, that’s why there’s a need for a stop-the-world synchronisation that involves all domains to carry out a minor heap collection \(\implies\) this also retains the stock OCaml C API

[ Interrupts and Synchronization ]

When domain wants to collect its minor heap, it uses the interrupt mechanism, signals all domains to do a stop-the-world collection.

At any one time, either stop-the-world minor collection or stop-the-world major collection requests are handled.

Interrupts are frequent, safe points become more needed to avoid long sync times

[ Parallel Promotion ]

All domains promote reachable objects in parallel. If 2 domains attempt to promote the same object they are serialised using Compare and Set (CAS) operations on the object header. This is assisted by colouring and intermediate state on the heap object header.

There’s also special fast-paths for cases where there’s only one domain running \(\implies\) same steps as stock OCaml minor collection for this situation.

[ Parallel Work-sharing ]

Static work sharing policies help to speed up the stop-the-world remembered set promotion. The workloads across domains when maintaining their independent remembered ssets can end up being imbalanced, that’s why work sharing policy helps to balance the load — splits the remembered set entries across all domains participating in the minor collection.

This only balances the roots in the remembered set – the underlying reachability trees may still end up being disproportionately distributed across domains

Completing OCaml Language Coverage #

The rest of this section is about how to make updates on the languages features that are tightly linked to the GC.

The guiding influence for their decisions is that there’s a need to preserve fast single-core performance because it’s not practical to maintain multiple runtimes (single-core vs multicore).

I’ve omitted the details here.

5.1 Weak References and Ephemerons #

Weak refs \(\rightarrow\) weak reachability : x is weakly reachable if it’s referred to by a weak reference. IF x is NOT also strongly reachable (via normal pointers) then the GC may collect x.

Ephemerons are extensions of weak-references (with a key-value where the ephemeron weakly points to the key). Value is reachable if the ephemeron is reachable AND the key is strongly reachable.

provide the mechanism to express conjunction of reachability relations — usual GC mechanism can only express disjunctions.
ephemeron sweeping: if key not strongly reachable, GC erases the refernce to the key and the value.

The mechanisms here feel to detailed to include in these notes.

5.2 Finalisers #

Each domain maintains a domain-local list of finalisers taht it has installed — so it’s locally handled

5.3 Lazy Values #

The stock version was unsafe against race conditions, so there was a need to redesign it.

Multicore Ocaml has lazy values that are safe for concurrent mutators and GC Threads.

5.4 Fibers #

Support for lightweight concurrency via language-level threads. They are implemented using runtime support for heap-allocated, dynamically resized, stack segments (i.e. fibers).

Fibers are:

heap allocated
dynamically resized
stack segments – fibers are program stacks

Interaction between concurrent GC and fibers has to be managed carefully \(\implies\) there’s a deletion barrier that requires that the program stack must be scanned at the start of the cycle.

Fibers are program stacks so before switching control to a fiber, all the objects on the fiber stack must be marked \(\implies\) risks race conditions between a mutator wanting to switch to a fiber and another domain wanting to mark the fiber stack \(\implies\) we need synchronisation primitives for this \(\implies\) uses a lock on the fiber whenever it’s being scanned for marking — so mutator would have to spin-wait until marking is complete

Discussion on Design Decisions #

This is about the difficult design decisions

the few places where the OCaml language exposed too much of the single-core internals had to be updated
Obj module (from stdlib) is an example, since it’s used in the result of the extracted code (e.g. from Coq) and the compiler itself so operations that update the header field of an object (Obj.set_tag and Obj.truncate) had to be removed because they conflict with concurrent GC threads non-atomically marking the object. — there’s higher level functionality provided for common use-cases to be done via alternative primitives
the parts of the language that interface with the outside world are another source where there were some changes (so this referes to Signal Handling, some parts of FFI, marshalling). They had to be made more robust in the face of parallelism, mainly by removing race conditions or shared state.
in most situations, they had to just tighten up the definition of allowable behaviour — e.g. named callbacks to OCaml from C can’t be mutated anymore using the raw pointer returned, once they’ve been registered
the other advanced languages features (ephemerons, lazy values, finalizers…) had to be designed to minimise single-core performance impact.
e.g. lazy values introduced 2 CAS operations into the value-forcing path but none if already forced – CAS cost gets amortized because lazy values memoize expensive computations
Multicore OCaml doesn’t support compactions.
empirically, they observed that the size-segmented multicore allocator is more tolerant to fragmentation and performs as well as the best-fit collector

Retrofitting Effect Handlers onto OCaml #

Fascinatingly, the concurrency primitives had to be changed to offer meaningful structured concurrency (direct-style concurrent programming) and there’s also a sequel paper for it — “Retrofitting Effect Handlers Onto Ocaml”.

This section will be a WIP for now, it’s got quite a bit of deep-diving that I need to do to fully grasp internals and internalise the tradeoffs within it. I’ve chosen to do only a practical cut of this so far in my explorations.