- rtshkmr's digital garden/
- Readings/
- Books/
- Real World OCaml: Functional Programming for the Masses/
- Chapter 24: Understanding the Garbage Collector/
Chapter 24: Understanding the Garbage Collector
Table of Contents
OCaml, with its runtime that is a C library, handles most of the memory management for us. Routines exist that can be called by our OCaml program for heap block allocation to fill the heap up with OCaml values. What’s left is for us to understand how the Garbage collector manages the rest of the lifecycle. Note: pre-multicore OCaml garbage collection was NOT thread-safe, the modern one is, so these notes will contain information about the multicore-GC as well (in addition to what’s covered in the textbook).
The C runtime manages a heap that it gets from the OS. To do so, it holds up heap blocks that it fills with OCaml values based on alloc requests by the OCaml program.

Mark and Sweep Garbage Collection #
When an allocation request can’t be met, then the GC is invoked. GC decides live vs dead values (to be collected) – program can’t explicitly free a value when done (QQs: it claims that we can’t force a collection – is that really true? e.g. can’t we just do a System.sleep or some kind of Os module related call that would trigger a GC collection?).
It’s not a live-tracking stateful approach, the mark and sweep is done regularly.
Scanning:
its’ a root of values (e.g. stack) and it’s a tree-traversal to figure out which heap blocks are reachable and which are not — ends up with a directed graph. Edge exists if there’s a pointer between heap block nodes
- dead blocks — if can’t reach via the graph traversal \(\implies\) can be freed and reused by the application
- alive blocks — reachable
The marking is done by colouring the color segment, see below for what the colours mean.
Generational Garbage Collection #
Generational GC: separate memory regions to hold blocks based on how long the blocks have been alive:
- young: small, fixed-size minor heap where most blocks get allocated
- old generation: larger, variable-sized, major heap for long-lived blocks
The Gc Module and OCAMLRUNPARAM #
Runtime system behaviour can be altered / tuned by using either Gc module or using the OCAMLRUNPARAM env var.
The env var allows us to set GC params without recompiling (since it’s not within the code, just a consumed var) — see runtime system config params if interested.
The Fast Minor Heap #
It’s a single contiguous chunk of virtual memory (fixed size) that has a sequence of OCaml blocks — if there’s is space then allocation is \(O(1)\) operation withe a couple of CPU instructions \(\implies\) meant for short-lived values.
GC happens via copying collection: when live blocks get moved to the major heap — so b blocks take \(O(b)\) time — given the Generational Hypothesis, b should be pretty small.
It’s generally a stop-the-world action \(\implies\) that’s why it needs to be FAST to not block application execution.
Allocating on the Minor Heap #
So the overall minor heap is given from the OS from a malloc, that’s what establishes the base and caml_young_end pointers to memory. To ensure that it’s word-aligned, that’s why caml_young_start is slightly to the right of base.
The growth of the minor heap (OCaml runtime controlled) is from right to left — this is quick to allocate because it’s a branchless, simple action of:
- decrement the current
curr_ptrand - check if limit is reached — if so then collection is triggerred.
QQ: it seems that the minor heap kind of behaves like a stack of its own because of the contiguity point and how the pointers just get decremented like that — it feels like it’s behaving more like a stack. Does my intuition or observation have any merit here?
Purpose of limit #
limit is a dynamic pointer here – so if the runtime wants to schedule a minor heap collection, then it just needs to set limit to end and in the next allocation, there won’t be enough space for anything so a collection would be triggerred (early collection).
| Poll Points | |
|---|---|
| why? | introduced by the compiler to allow pre-emption from UNIX signals, other internal book-keeping — especially because our coding style might make it such that it may take a long time do an allocation on the minor heap |
| what happens? | @ a poll point, ptr is just checked against limit |
| where? | typically at the start of every function, at back edge of loops – depends on the dataflow, the compiler optimises this to minimise the poll points |
Controlling the Minor Heap Size #
As usual, we could use Gc module or inject via OCAMLRUNPARAM:
open Core;;
let c = Gc.get ();;
val c : Gc.Control.t =
{Core.Gc.Control.minor_heap_size = 262144; major_heap_increment = 15;
space_overhead = 120; verbose = 0; max_overhead = 500;
stack_limit = 1048576; allocation_policy = 2; window_size = 1;
custom_major_ratio = 44; custom_minor_ratio = 100;
custom_minor_max_size = 8192}
(* --- how to tune the size: *)
Gc.tune ~minor_heap_size:(262144 * 2) ();;
- : unit = ()Core defaults to 8Mb, else it's a 2Mb default on 64-bit machinesNotes:
Coreincreasese the default minor heap size from standard OCaml significantly- if we change the GC size — will trigger an immediate minor heap-allocation
The Long-Lived Major Heap #
Variable sized, where bulk of the longer-lived larger-values in the program exist. It can be any number of non-contiguous chunks of virtual memory, each with live blocks with free memory alongside to expand.
OCaml runtime maintains a free-list data structure that acts as a lookup index for free memory that it has allocated — used to satisfy allocation requests for OCaml blocks.
It can be in the size of Gbs, collection works in phases and is also a stop-the-world action so that blocks can be moved around without being observed by the live application.
| mark | acts on heap slice | scans the block graph and marks all live blocks by setting the color tag of the block header |
| sweep | acts on heap slice | equentially scans the heap chunks and identifies dead blocks that weren’t marked earlier |
| compact | acts on entire memory at once | relocates live blocks into a freshly allocated heap to eliminate gaps in the free list. This prevents the fragmentation of heap blocks in long-running programs and normally occurs much less frequently than the mark and sweep phases |
Since it’s a stop-the-world action, it is done incrementally over heap slices to avoid pausing the live application for too long.
Compaction touches the entire memory in one go, though it’s a rare operation — it’s for defragging.
Allocating on the Major Heap #
Singly linked list of contiguous memory chunks that are sorted in increasing order of virtual memory addresses.
Each chunk is from malloc, has a header and data-area that contains OCaml heap chunks.
Typically this is the process for allocation:
- check free list of blocks for suitable region that fits
- if no room on free list, runtime will expand the major heap — allocates a fresh heap chunk that is large enough \(\implies\) chunk added to free list \(\implies\) repeat first step
- if room exists \(\implies\) allocate
Controlling the Heap Increment #
major_heap_increment is an observed value by the OCaml runtime, helps control the major heap growth — it’s the number of words to add to teh major heap per expansion
Memory Allocation Strategies #
Different strategies exist for different program behaviours. They all need to manage the following trade-off: allocation-cost vs heap fragmentation.
How they work:
- Best-fit strategy \(\implies\) best of both worlds
- small allocations: use size-segregated free-lists for small allocations (i.e. \(\le\) 16 words)
- large allocations: use splay tree (recency-biased) to select the smallest block that fits
- Next-fit strategy: holds a pointer to the last allocated block; scans forward through the free list and wraps around if necessary \(\implies\) prioritises low allocation cost vs fragmentation
- First-fit strategy: scans free list from the beginning for the first block large enough \(\implies\) lower fragmentation over allocation-cost
| Strategy | Core Idea | Advantages | Disadvantages | Best Use Case |
|---|---|---|---|---|
| Best-Fit (default) | Find the smallest suitable block | Balanced performance; efficient for common small allocations; reduces wasted space | Moderate CPU cost; not the fastest | General-purpose workloads with mixed allocation sizes |
| Next-Fit | Continue from last used position | Fast allocation; good cache locality; reuses nearby memory | Causes fragmentation, especially for large blocks at start of list | Workloads prioritizing speed and locality over fragmentation |
| First-Fit | Take the first suitable block | Reduces fragmentation; fewer heap compactions | Slower allocation due to full scan | Real-time or fragmentation-sensitive workloads with varied sizes |
Marking and Scanning the Heap #
This happens in heap slices instead of the complete major heap at once. There’s 2-bit color field in its header — allows for stateful colour-checks to pause and resume:
| Colour value | Meaning |
|---|---|
| blue | on free list, not being used |
| white @ marking | not reached, possibly reachable |
| white @ sweeping | unreachable, can be freed |
| black | reachable, fields have been scanned |
When allocated, all heap values are coloured white (reachable, not scanned yet).
The runtime tracks the rate of allocation (and avaialble memory) and calculates the size of slice of heap to sweep.
Marking uses a mark-stack:
- starts from always -alive root values – will be black coloured
- white blocks are blackend and pushed to mark stack
- if there’s a mark-stack overflow, then there’s a pruning that happens: empties the mark stack entirely, summarizing the addresses of each block as start and end ranges in each heap chunk header.
Controlling Major Heap Collections #
It’s possible to trigger a single heap slice using Gc.major_slice / Gc.full_major.
QQ: earlier, the text mentioned that it’s not possible to explicitly trigger a collection. But in this case, the major heap can be collected from (and implicitly, also the minor heap since this function will also collect from the minor heap)???
Heap Compaction #
It’s natural to end up having a fragmented major heap, so compaction is possible.
OCaml’s heap compaction happens in-place.
We can tune the max_overhead which is something like a threshold/ratio of free:allocated memory before which compaction happens.
Intergenerational Pointers #
Problem: To know which minor heap blocks are live, the collector needs to check the major heap blocks to figure out which minor heap blocks are pointed to by the major heap blocks. If not, every minor collection needs to also scan the major heap.
Solution: maintain a set of intergenerational pointers that acts as a remembered set. There’s a write barrier to updating this remembered set, and that’s how the set is kept updated.
Mutable Write Barrier #
The write barrier can have great impact — that’s why sometimes doing a immutable update can be faster than doing a mutable update.
The compiler tracks all mutable types and @ runtime, caml_modify is called before making the mutable change. It’s job is to keep the remembered set consistent — checks the location of the target write and the value it’s being changed to.
It’s efficient but sometimes might be slower than just allocating a fresh value on the fast minor heap and doing some extra collections.
| |
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ mutable │ 52.14ms │ 2.00Mw │ 20.75w │ 20.75w │ 100.00% │
│ immutable │ 41.62ms │ 5.00Mw │ 115.01w │ 115.01w │ 79.83% │
└───────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
module Mutable :
sig
type t = { mutable iters : Base.int; mutable count : Base.float; }
val test : t -> unit/2
end
module Immutable :
sig
type t = { iters : Base.int; count : Base.float; }
val test : t -> unit/2
endObservations:
- Space/time tradeoff: Mutable version takes longer to complete than immutable version but also uses fewer minor-heap words than the immutable version
- typically better to use immutable datastructures than conventional mutable ones
- if the mutation is rare, then it’s faster to tolerate the write-barrier hit and avoid all the extra allocations.
Attaching Finalizer Functions to Values #
For situations that are classic sources of memory leaks (e.g. closing of file-descriptors or last-remnant in log-buffer), it’s good to run finalisers to manually free things up.
Finalizers can run at any time in any thread, so they can be confusing in multi-threaded contexts.
Typically GC calls the finalization functions in the order of deallocation. Suppose values become unreachable during the same GC cycle, then the finalization functions will be called in the reverse order of the corresponding calls to add_finalizer. Each call to add_finalizer adds to the set of functions, which are run when value becomes unreachable — we can have mutliple finalizers pointing to the same heap block.
Every finalizer runs at most once, runs when the GC determines a block, b, is unreachable and removes its corresponding finalizers from the finalizers set.
Program termination will NOT cause all the finalizers to be run before the runtime exits.
What Values Can Be Finalized? #
- Values that aren’t heap allocated can’t be finalized
- ints, const constructors, bools, empty array, empty list, unit value
- Some constant values (e.g. list of integer constants) can be heap-allocated but not deallocated until the end of the program — just use
Core.Heap_blockto check if it’s heap allocated and finalizable. - Finalizers need to be registered, the registereres can be found in
Core.Gc.Expert