Skip to main content
  1. Readings/
  2. Books/
  3. Real World OCaml: Functional Programming for the Masses/

Chapter 24: Understanding the Garbage Collector

·· 2542 words· 12 mins

OCaml, with its runtime that is a C library, handles most of the memory management for us. Routines exist that can be called by our OCaml program for heap block allocation to fill the heap up with OCaml values. What’s left is for us to understand how the Garbage collector manages the rest of the lifecycle. Note: pre-multicore OCaml garbage collection was NOT thread-safe, the modern one is, so these notes will contain information about the multicore-GC as well (in addition to what’s covered in the textbook).

The C runtime manages a heap that it gets from the OS. To do so, it holds up heap blocks that it fills with OCaml values based on alloc requests by the OCaml program.

Figure 1: Green-tea pickers in Hangzhou (retrieved)

Mark and Sweep Garbage Collection #

When an allocation request can’t be met, then the GC is invoked. GC decides live vs dead values (to be collected) – program can’t explicitly free a value when done (QQs: it claims that we can’t force a collection – is that really true? e.g. can’t we just do a System.sleep or some kind of Os module related call that would trigger a GC collection?).

It is a live-tracking stateful approach if we consider the major GC, the mark and sweep is done regularly across slices and the colouring is what allows us to track state. The colours persist across slice boundaries so that the GC can pause and resume (incremental progress).

The minor GC doesn’t do live-tracking incrementally, it collects in one go.

Scanning:

its’ a root of values (e.g. stack) and it’s a tree-traversal to figure out which heap blocks are reachable and which are not — ends up with a directed graph. Edge exists if there’s a pointer between heap block nodes

  • dead blocks — if can’t reach via the graph traversal \(\implies\) can be freed and reused by the application
  • alive blocks — reachable

The marking is done by colouring the color segment, see below for what the colours mean.

Generational Garbage Collection #

Generational GC: separate memory regions to hold blocks based on how long the blocks have been alive:

  • young: small, fixed-size minor heap where most blocks get allocated
  • old generation: larger, variable-sized, major heap for long-lived blocks
Relies on the Generational Hypothesis: if we code in a typical functional programming style then young blocks die young and old blocks stay around longer than young ones — the lifecycle patterns are distinct so the memory layouts and GC algorithms are also different.

The Gc Module and OCAMLRUNPARAM #

Runtime system behaviour can be altered / tuned by using either Gc module or using the OCAMLRUNPARAM env var.

The env var allows us to set GC params without recompiling (since it’s not within the code, just a consumed var) — see runtime system config params if interested.

The Fast Minor Heap #

Changes to the Minor Heap, when Retrofitting Parallelism into OCaml can be seen in Chapter 16C::Minor Heap: changes

It’s a single contiguous chunk of virtual memory (fixed size) that has a sequence of OCaml blocks — if there’s is space then allocation is \(O(1)\) operation withe a couple of CPU instructions \(\implies\) meant for short-lived values.

GC happens via copying collection: when live blocks get moved to the major heap — so b blocks take \(O(b)\) time — given the Generational Hypothesis, b should be pretty small.

It’s generally a stop-the-world action \(\implies\) that’s why it needs to be FAST to not block application execution.

Allocating on the Minor Heap #

So the overall minor heap is given from the OS from a malloc, that’s what establishes the base and caml_young_end pointers to memory. To ensure that it’s word-aligned, that’s why caml_young_start is slightly to the right of base.

The growth of the minor heap (OCaml runtime controlled) is from right to left — this is quick to allocate because it’s a branchless, simple action of:

  1. decrement the current curr_ptr and
  2. check if limit is reached — if so then collection is triggerred.

QQ: it seems that the minor heap kind of behaves like a stack of its own because of the contiguity point and how the pointers just get decremented like that — it feels like it’s behaving more like a stack. Does my intuition or observation have any merit here?

Figure 2: Layout of the minor heap

Purpose of limit #

limit is a dynamic pointer here – so if the runtime wants to schedule a minor heap collection, then it just needs to set limit to end and in the next allocation, there won’t be enough space for anything so a collection would be triggerred (early collection).

Table 1: Compiler introduces poll points
Poll Points
why?introduced by the compiler to allow pre-emption from UNIX signals, other internal book-keeping — especially because our coding style might make it such that it may take a long time do an allocation on the minor heap
what happens?@ a poll point, ptr is just checked against limit
where?typically at the start of every function, at back edge of loops – depends on the dataflow, the compiler optimises this to minimise the poll points

Controlling the Minor Heap Size #

As usual, we could use Gc module or inject via OCAMLRUNPARAM:

open Core;;

let c = Gc.get ();;
val c : Gc.Control.t =
  {Core.Gc.Control.minor_heap_size = 262144; major_heap_increment = 15;
   space_overhead = 120; verbose = 0; max_overhead = 500;
   stack_limit = 1048576; allocation_policy = 2; window_size = 1;
   custom_major_ratio = 44; custom_minor_ratio = 100;
   custom_minor_max_size = 8192}


(* --- how to tune the size: *)
Gc.tune ~minor_heap_size:(262144 * 2) ();;
- : unit = ()
Code Snippet 1: Example of reading and tuning, Core defaults to 8Mb, else it's a 2Mb default on 64-bit machines

Notes:

  1. Core increasese the default minor heap size from standard OCaml significantly
  2. if we change the GC size — will trigger an immediate minor heap-collection

The Long-Lived Major Heap #

Variable sized, where bulk of the longer-lived larger-values in the program exist. It can be any number of non-contiguous chunks of virtual memory, each with live blocks with free memory alongside to expand.

OCaml runtime maintains a free-list data structure that acts as a lookup index for free memory that it has allocated — used to satisfy allocation requests for OCaml blocks.

It can be in the size of Gbs, collection works in phases and the mark and sweep phases are done in incremental slices, each of which causes a brief pause. Only the compaction phase is a stop-the-world action so that blocks can be moved around without being observed by the live application.

phase
markacts on heap slicescans the block graph and marks all live blocks by setting the color tag of the block header
sweepacts on heap sliceequentially scans the heap chunks and identifies dead blocks that weren’t marked earlier
compactacts on entire memory at oncerelocates live blocks into a freshly allocated heap to eliminate gaps in the free list. This prevents the fragmentation of heap blocks in long-running programs and normally occurs much less frequently than the mark and sweep phases

Mark and sweep are done in incremental slices — each slice causes a brief pause, but the collection is interleaved with the running program. Only compaction is a true stop-the-world operation. Compaction touches the entire memory in one go, though it’s a rare operation — it’s for defragging.

Allocating on the Major Heap #

Singly linked list of contiguous memory chunks that are sorted in increasing order of virtual memory addresses.

Each chunk is from malloc, has a header and data-area that contains OCaml heap chunks.

Figure 3: Layout of a Contiguous Chunk allocated on the Major Heap

Typically this is the process for allocation:

  1. check free list of blocks for suitable region that fits
    1. if no room on free list, runtime will expand the major heap — allocates a fresh heap chunk that is large enough \(\implies\) chunk added to free list \(\implies\) repeat first step
    2. if room exists \(\implies\) allocate

Controlling the Heap Increment #

major_heap_increment is an observed value by the OCaml runtime, helps control the major heap growth — it’s the number of words to add to teh major heap per expansion

Memory Allocation Strategies #

Different strategies exist for different program behaviours. They all need to manage the following trade-off: allocation-cost vs heap fragmentation.

How they work:

  1. Best-fit strategy \(\implies\) best of both worlds
    1. small allocations: use size-segregated free-lists for small allocations (i.e. \(\le\) 16 words)
    2. large allocations: use splay tree (recency-biased) to select the smallest block that fits
  2. Next-fit strategy: holds a pointer to the last allocated block; scans forward through the free list and wraps around if necessary \(\implies\) prioritises low allocation cost vs fragmentation
  3. First-fit strategy: scans free list from the beginning for the first block large enough \(\implies\) lower fragmentation over allocation-cost
Table 2: Comparison of memory allocation strategies
StrategyCore IdeaAdvantagesDisadvantagesBest Use Case
Best-Fit (default)Find the smallest suitable blockBalanced performance; efficient for common small allocations; reduces wasted spaceModerate CPU cost; not the fastestGeneral-purpose workloads with mixed allocation sizes
Next-FitContinue from last used positionFast allocation; good cache locality; reuses nearby memoryCauses fragmentation, especially for large blocks at start of listWorkloads prioritizing speed and locality over fragmentation
First-FitTake the first suitable blockReduces fragmentation; fewer heap compactionsSlower allocation than next-fit, due to full scan – causes more fragmentation at the low end of the free list over time compared to best-fitReal-time or fragmentation-sensitive workloads with varied sizes

Marking and Scanning the Heap #

This happens in heap slices instead of the complete major heap at once. There’s 2-bit color field in its header — allows for stateful colour-checks to pause and resume:

Colour valueMeaning
blueon free list, not being used
white @ markingnot reached, possibly reachable
white @ sweepingDead Object, unreachable, can be freed
blackLive object, reachable, fields have been scanned
grayColouring frontier — marked objects but not scanned for outgoing pointers yet

When allocated, all heap values are coloured white (reachable, not scanned yet).

The runtime tracks the rate of allocation (and available memory) and calculates the size of slice of heap to sweep.

Marking uses a mark-stack:

  1. starts from always -alive root values – will be gray coloured
  2. when a white block is discovered as reachable, it is grayed and pushed to the mark-stack. It becomes black only after its fields have been scanned.
  3. if there’s a mark-stack overflow, then there’s a pruning that happens: empties the mark stack entirely, summarizing the addresses of each block as start and end ranges in each heap chunk header.

So, the mark-stack is the gray set, they become black after their fields are scanned and pushed / processed.

Controlling Major Heap Collections #

It’s possible to trigger a single heap slice using Gc.major_slice / Gc.full_major.

QQ: earlier, the text mentioned that it’s not possible to explicitly trigger a collection. But in this case, the major heap can be collected from (and implicitly, also the minor heap since this function will also collect from the minor heap)

Yes, we can explicitly trigger collection.

Gc.major_slice, Gc.full_major, Gc.compact all exist for this purpose.

The nuance is that our program can’t explicitly free a value when done — i.e. this is specifically about freeing individual values (which is true, you can’t free a specific block like C), not about triggering a GC pass.

The distinction is: you can trigger a collection, but you can’t say “free this specific value.”

Heap Compaction #

It’s natural to end up having a fragmented major heap, so compaction is possible.

OCaml’s heap compaction happens in-place.

We can tune the max_overhead which is something like a threshold/ratio of free:allocated memory before which compaction happens.

Intergenerational Pointers #

Problem: To know which minor heap blocks are live, the collector needs to check the major heap blocks to figure out which minor heap blocks are pointed to by the major heap blocks. If not, every minor collection needs to also scan the major heap.

Solution: maintain a set of intergenerational pointers that acts as a remembered set. There’s a write barrier to updating this remembered set, and that’s how the set is kept updated.

Mutable Write Barrier #

See notes @ Chapter 16C:Retrofitting Parallelism onto OCaml. It hinges on the foundations upon which the pre-OCaml 5 GC was built on, and is still relevant for this section.
what the write barrier does
  1. maintains the Snapshot-at-beginning property when object field is mutated: comes from the invariant that every object reachable at the start of marking eventually gets marked.

    in the context of mutations e.g. mutating a field of an object, @ mark-phase of the GC, the write barrier loads the object pointed to by the old value of the field — if it’s white (i.e. can be collected), then it grays and adds to the mark-stack)

  2. write barrier maintains the /remembered set/of intergenerational pointers (which are used as a root for minor collection)

The write barrier can have great impact — that’s why sometimes doing a immutable update can be faster than doing a mutable update.

The compiler tracks all mutable types and @ runtime, caml_modify is called before making the mutable change. It’s job is to keep the remembered set consistent — checks the location of the target write and the value it’s being changed to.

It’s efficient but sometimes might be slower than just allocating a fresh value on the fast minor heap and doing some extra collections.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
# require "Core_bench";;
open Core
open Core_bench

module Mutable = struct
  type t =
    { mutable iters : int
    ; mutable count : float
    }

  let rec test t =
    if t.iters = 0
    then ()
    else (
      t.iters <- t.iters - 1;
      t.count <- t.count +. 1.0;
      test t)
end

module Immutable = struct
  type t =
    { iters : int
    ; count : float
    }

  let rec test t =
    if t.iters = 0
    then ()
    else test { iters = t.iters - 1; count = t.count +. 1.0 }
end

let () =
  let iters = 1_000_000 in
  let count = 0.0 in
  let tests =
    [ Bench.Test.create ~name:"mutable" (fun () ->
          Mutable.test { iters; count })
    ; Bench.Test.create ~name:"immutable" (fun () ->
          Immutable.test { iters; count })
    ]
  in
  Bench.make_command tests |> Command_unix.run
Code Snippet 2: Timed test: mutable vs immutable update
Estimated testing time 20s (2 benchmarks x 10s). Change using '-quota'.
┌───────────┬──────────┬─────────┬──────────┬──────────┬────────────┐
│ Name      │ Time/Run │ mWd/Run │ mjWd/Run │ Prom/Run │ Percentage │
├───────────┼──────────┼─────────┼──────────┼──────────┼────────────┤
│ mutable   │  52.14ms │  2.00Mw │   20.75w │   20.75w │    100.00% │
│ immutable │  41.62ms │  5.00Mw │  115.01w │  115.01w │     79.83% │
└───────────┴──────────┴─────────┴──────────┴──────────┴────────────┘
module Mutable :
  sig
    type t = { mutable iters : Base.int; mutable count : Base.float; }
    val test : t -> unit/2
  end
module Immutable :
  sig
    type t = { iters : Base.int; count : Base.float; }
    val test : t -> unit/2
  end
Code Snippet 1: Outcome of timed test

Observations:

  1. Space/time tradeoff: Mutable version takes longer to complete than immutable version but also uses fewer minor-heap words than the immutable version
  2. typically better to use immutable datastructures than conventional mutable ones
  3. if the mutation is rare, then it’s faster to tolerate the write-barrier hit and avoid all the extra allocations.

Attaching Finalizer Functions to Values #

For situations that are classic sources of memory leaks (e.g. closing of file-descriptors or last-remnant in log-buffer), it’s good to run finalisers to manually free things up.

Finalizers can run at any time in any thread, so they can be confusing in multi-threaded contexts.

Typically GC calls the finalization functions in the order of deallocation. Suppose values become unreachable during the same GC cycle, then the finalization functions will be called in the reverse order of the corresponding calls to add_finalizer. Each call to add_finalizer adds to the set of functions, which are run when value becomes unreachable — we can have mutliple finalizers pointing to the same heap block.

Every finalizer runs at most once, runs when the GC determines a block, b, is unreachable and removes its corresponding finalizers from the finalizers set.

Program termination will NOT cause all the finalizers to be run before the runtime exits.

Remember OCaml finalizers here are not the same as C finalizers in the context of FFIs.

What Values Can Be Finalized? #

  • Values that aren’t heap allocated can’t be finalized
    • ints, const constructors, bools, empty array, empty list, unit value
  • Some constant values (e.g. list of integer constants) can be heap-allocated but not deallocated until the end of the program — just use Core.Heap_block to check if it’s heap allocated and finalizable.
  • Finalizers need to be registered, the registereres can be found in Core.Gc.Expert