- rtshkmr's digital garden/
- Readings/
- Books/
- Real World OCaml: Functional Programming for the Masses/
- Chapter 14: Maps and Hash Tables/
Chapter 14: Maps and Hash Tables
Table of Contents
When storing associated, organised data, OCaml’s association lists come to mind but they’re inefficient because most actions are in \(O(n)\) time.
Here’s a short example:
open Base;;
let digit_alist =
[ 0, "zero"; 1, "one"; 2, "two" ; 3, "three"; 4, "four"
; 5, "five"; 6, "six"; 7, "seven"; 8, "eight"; 9, "nine" ];;
(*
val digit_alist : (int * string) list =
[(0, "zero"); (1, "one"); (2, "two"); (3, "three"); (4, "four");
(5, "five"); (6, "six"); (7, "seven"); (8, "eight"); (9, "nine")]
*)
List.Assoc.find ~equal:Int.equal digit_alist 6;;
(* - : string option = Some "six" *)
List.Assoc.find ~equal:Int.equal digit_alist 22;;
(* - : string option = None *)
List.Assoc.add ~equal:Int.equal digit_alist 0 "zilch";;
(*
- : (int, string) Base.List.Assoc.t =
[(0, "zilch"); (1, "one"); (2, "two"); (3, "three"); (4, "four");
(5, "five"); (6, "six"); (7, "seven"); (8, "eight"); (9, "nine")]
*)
Maps (work in logarithmic time) and Hash Tables (work in constant time) offer better performance.
Maps #
When looking at the basic syntax for using Maps, the following things stand out:
The type definition for type
tusesMap.twhich is supplied 3 types params:one for the type of the key
one for the type of the data
one for the comparator witness :
Indicates which comparator function was used to construct the map. It doesn’t say anything about the type of data stored in the map.
In the example below,
String.comparator_witnesstells us the default string comparator was used
Map.emptytakes in a first-class module as an argument.This first-class module provides the comparison function for building the map, along with the sexp converter for generating useful error messages.
This is also why we don’t need to provide the module again when using functions like
Map.findorMap.add.There’s some requirements we need to meet for our custom modules to be able to work like this (to be passed into empty map). KIV
(* ===== interface *)
open Base
(** A collection of string frequency counts *)
type t
(** The empty set of frequency counts *)
val empty : t
(** Bump the frequency count for the given string. *)
val touch : t -> string -> t
(** Converts the set of frequency counts to an association list. Every
string in the list will show up at most once, and the integers
will be at least 1. *)
val to_list : t -> (string * int) list
(* ==== implementation *)
open Base
type t = (string, int, String.comparator_witness) Map.t
let empty = Map.empty (module String)
let to_list t = Map.to_alist t
let touch t s =
let count =
match Map.find t s with
| None -> 0
| Some x -> x
in
Map.set t ~key:s ~data:(count + 1)
Sets #
Base gives a set data type also. It’s possible to encode sets in terms of maps but it’s more efficient and natural to use Base’s specialised set type.
We have the typical set operations that one would expect in any programming languages.
open Base
Set.of_list (module Int) [1;2;3] |> Set.to_list;;
(* - : int list = [1; 2; 3] *)
Set.union
(Set.of_list (module Int) [1;2;3;2])
(Set.of_list (module Int) [3;5;1])
|> Set.to_list;;
(* - : int list = [1; 2; 3; 5] *)
Modules and Comparators #
This section is about how we can make comparator witnesses work for our custom modules so that we can create maps typed with our custom modules. NOTE: the actual elaboration is done here in this section.
First, creating maps/sets has some useful functions that we should take note:
- Creating maps from alists:
open Base;; let digit_alist = [ 0, "zero"; 1, "one"; 2, "two" ; 3, "three"; 4, "four" ; 5, "five"; 6, "six"; 7, "seven"; 8, "eight"; 9, "nine" ];; (* val digit_alist : (int * string) list = [(0, "zero"); (1, "one"); (2, "two"); (3, "three"); (4, "four"); (5, "five"); (6, "six"); (7, "seven"); (8, "eight"); (9, "nine")] *) let digit_map = Map.of_alist_exn (module Int) digit_alist;; (* -- this will exn on duplicate keys encountered*) (* val digit_map : (int, string, Int.comparator_witness) Map.t = <abstr> *) Map.find digit_map 3;; (* - : string option = Some "three" *)
In order for us to make maps and sets from our own custom modules, we need to satisfy the Base.Comparator.S signature.
There’s a common idiom for doing this:
for our custom module e.g.
Book, we create a submodule,Tthat contains the basic functionality for OUR custom type (Book)Include that submodule,
TAlso include the result of applying a functor to that module. The functor here may be:
Comparator.Make(T)Comparable.Make(T): has some more helper functions e.g. infix comparison operators and min and max functions, in addition to the comparator itself.
module Book = struct
module T = struct
type t = { title: string; isbn: string }
let compare t1 t2 =
let cmp_title = String.compare t1.title t2.title in
if cmp_title <> 0 then cmp_title
else String.compare t1.isbn t2.isbn
let sexp_of_t t : Sexp.t =
List [ Atom t.title; Atom t.isbn ]
end
include T (*includes our custom module's implementation*)
include Comparator.Make(T) (*applies the functor*)
end;;
let some_programming_books =
Set.of_list (module Book)
[ { title = "Real World OCaml"
; isbn = "978-1449323912" }
; { title = "Structure and Interpretation of Computer Programs"
; isbn = "978-0262510875" }
; { title = "The C Programming Language"
; isbn = "978-0131101630" } ];;
(*
val some_programming_books : (Book.t, Book.comparator_witness) Set.t =
<abstr>
*)
Why Do We Need Comparator Witnesses? #
A comparator witness isn’t about tracking what kind of data we’re using — it’s about tracking how that data is ordered.
Some map and set operations (such as merging, union, or intersection) depend on both operands being ordered according to the same total order. That total order of individual operands is determined by the comparison function used when the data structure was created.
The compiler can’t just rely on the type of the operands here. The problem is that the type 'a alone doesn’t tell us which comparison function was used. e.g. 2 maps might both use keys of type int, yet be ordered by different comparison functions (e.g. ascending vs. descending). If we combined them blindly, we’d break the data structure’s invariants. Instead of having this ambiguity, we can have an explicit outcome defined in the form of a witness – hence “singles out” the ordering function that defines the structure’s total order
That’s why Comparator Witnesses need to be used.
Values that witness a specific comparison function and carry its identity at the type level. This ensures that only maps and sets built with the same comparator witness can be combined, enforcing correctness through the type system.
This is a code example of that:
(* === init some modules that will be comparable, but use different comparators as the functor *)
(* here's the ordinary version: *)
let alist = ["foo", 0; "snoo", 3];;
let ord_map = Map.of_alist_exn (module String) alist;;
(* this Reverse module is to act as a parallel to the ordinary version (which is just from an alist) *)
module Reverse = struct
module T = struct
type t = string
let sexp_of_t = String.sexp_of_t
let t_of_sexp = String.t_of_sexp
let compare x y = String.compare y x
end
include T
include Comparator.Make(T)
end;;
(* reverse_map: *)
let rev_map = Map.of_alist_exn (module Reverse) alist;;
(* ==== demonstration that although the types of their items are the same (ints), the ordering is different *)
Map.min_elt ord_map;;
(* - : (string * int) option = Some ("foo", 0) *)
Map.min_elt rev_map;;
(* - : (string * int) option = Some ("snoo", 3) *)
(* ==== example intended use for the two maps: *)
Map.symmetric_diff ord_map rev_map;;
(* -- this will error out because the comparator witnesses are different
Line 1, characters 28-35:
Error: This expression has type
(string, int, Reverse.comparator_witness) Map.t
but an expression was expected of type
(string, int, String.comparator_witness) Map.t
Type Reverse.comparator_witness is not compatible with type
String.comparator_witness
*)
The Polymorphic Comparator #
We have the choice of using the polymorphic comparator that is builtin instead of having to implement it all the time for our custom modules. However, we should use it with knowledge of their nuances.
Map.Poly.of_alist_exn digit_alist;;
(* - : (int, string) Map.Poly.t = <abstr> *)
(* --- this will error out : *)
Map.symmetric_diff
(Map.Poly.singleton 3 "three")
(Map.singleton (module Int) 3 "four" );;
(*
Line 3, characters 5-43:
Error: This expression has type (int, string, Int.comparator_witness) Map.t
but an expression was expected of type
(int, string, Map.Poly.comparator_witness) Map.t
Type Int.comparator_witness is not compatible with type
Map.Poly.comparator_witness
*)
maps based on the polymorphic comparator have different comparator witnesses than those based on the type-specific comparison function
this is expected because there’s NO guarantee that the comparator associated with a given type will order things in the same way that polymorphic compare does.
the perils of polymorphic compare #
RULE OF THUMB: polymorphic compare should be AVOIDED in production code.
Problems:
compares directly on the runtime representation of OCaml values
So, it will walk the structure of those values regardless of their type \(\implies\) it’s type-oblivious
This means that it peeks under the ordinary abstraction boundaries which can lead to counter-intuitive outcomes:
(* === example: m1 and m2 should technically be seen as the SAME map *) let m1 = Map.of_alist_exn (module Int) [1, "one";2, "two"];; (* val m1 : (int, string, Int.comparator_witness) Map.t = <abstr> *) let m2 = Map.of_alist_exn (module Int) [2, "two";1, "one"];; (* val m2 : (int, string, Int.comparator_witness) Map.t = <abstr> *) (* ==== using a more idiomatic comparison (Map.equal) shows this equality *) Map.equal String.equal m1 m2;; (* - : bool = true *) (* ==== PROBLEM: polymorphic compare walks the datastructures which will have a different underlying layout of the underlying trees so Polymorphic compare will say they are not equal *) Poly.(m1 = m2);; (* --- note this errors out because maps store the comparison function they were created with and poly compare won't work on functions *) (* Exception: Invalid_argument "compare: functional value". *) (* ==== Polymorphic compare on the binary tree *) (* --- Map.Using_comparator.to_tree exposes the binary tree. *) Poly.((Map.Using_comparator.to_tree m1) = (Map.Using_comparator.to_tree m2));; (* - : bool = false *)polymorphic compare won’t work on functions (naturally). So when it does the datastructure-walk, if there are function stored then a (runtime) exception will be raised. This is clear when we directly poly-compare maps (example above).
The abstraction breaking can result in bugs that are subtle e.g. a map where keys are sets? then the map built with the polymorphic comparator behaves incorrectly and inconsistently and separates out the keys that should be aggregated together.
The inconsistency would be from differences in the order in which the sets were built.
Satisfying Comparator.S with [@@deriving] #
Using maps and sets on our custom type means that we need to satisfy Comparator.S:
- need sexp converters for the type
- need comparison functions for the type
We can rely on derivation annotations (Base provides this) instead of manually having to write this all the time (naturally we can always write them ourselves if we want some custom logic to be used).
(* ==== CASE 1: no derivation ppx used at all *)
module Book = struct
module T = struct
type t = { title: string; isbn: string }
let compare t1 t2 =
let cmp_title = String.compare t1.title t2.title in
if cmp_title <> 0 then cmp_title
else String.compare t1.isbn t2.isbn
let sexp_of_t t : Sexp.t =
List [ Atom t.title; Atom t.isbn ]
end
include T
include Comparator.Make(T)
end
(* ==== CASE 2: using ppx_sexp_conv and ppx_compare *)
#require "ppx_jane";;
module Book = struct
module T = struct
type t = { title: string; isbn: string }
[@@deriving compare, sexp_of] (* -- these are the annotations *)
end
include T
include Comparator.Make(T)
end;;
=, == and phys_equal #
There’s different notions of equality:
physically equal: if same pointer in memory
so 2 data structures with identical content but structured separately would have different pointers in memory and hence physically unequal.
polymorphic equal: it’s a structural equality
the 2 data structures would be equal structurally – values having the same contents
This kind of equality is the same polymorphic equals point we were talking about earlier in the previous section.
Not only does OCaml have multiple notions of equality (and picking between them can be tricky), the opened default library may default to a different kind of equality.
Core:
- physical equality:
== - polymorphic equality:
=
- physical equality:
Base:
- physical equality: can be done using
phys_equal - polymorphic equality: hidden –
==is deprecated =is type-specific equality (operator overloading based on what type is used)
ref 1 == ref 1;; (*-- this is deprecated*) phys_equal (ref 1) (ref 1) ;;- physical equality: can be done using
Applying [@@deriving] to Maps and Sets (special syntax) #
Previously we saw how we could use [@@deriving] annotation and ready our custom types to be used with maps/sets.
won’t work:
If we try to do that with maps directly, this won’t work. There’s no equivalent
Map.t_of_sexpfor maps. this is because there’s no proper way to serialise/deserialise the comparator witness using sexps.will work: using
Map.Mmap factory functor to define the type, which uses two-level functorization to encode key types and comparator witness.This is the map factory functor that fixes the key type and comparator witness once and for all. At point in time when the functor is applied, the key and comparator witness have already been closed over. This makes
Map.M(Key).tequivalent to('key, 'value, Key.comparator_witness), but written in a form that behaves like a partially applied functor over the key type.The signature for this functor looks something like this:
module Map.M(String) : sig type 'v t (* parameterised value type, the key type is already captured *) val t_of_sexp : (Sexp.t -> 'v) -> Sexp.t -> 'v t ... end
Summary:
Map.t= fully generic map type(key, value, comparator).Map.M(Key).t= specialized map type where the key and comparator modules are fixed, leaving only the value type parameter.
RULE OF THUMB: just use this functor (Map.M(Key).t)
(* === these ways won't work: *)
type string_int_map =
(string,int,String.comparator_witness) Map.t [@@deriving sexp];;
(*
Line 2, characters 44-49:
Error: Unbound value Map.t_of_sexp
Hint: Did you mean m__t_of_sexp?
*)
(* === this way will work because we use the Map.M functor: *)
type string_int_map = int Map.M(String).t [@@deriving sexp];;
(*
type string_int_map = int Base.Map.M(Base.String).t
val string_int_map_of_sexp : Sexp.t -> string_int_map = <fun>
val sexp_of_string_int_map : string_int_map -> Sexp.t = <fun>
*)
(* === demo: the type signature looks different but the meaning is same *)
let m = Map.singleton (module String) "one" 1;; (* -- creates map*)
(* val m : (string, int, String.comparator_witness) Map.t = <abstr> *)
(m : int Map.M(String).t);;
(* - : int Base.Map.M(Base.String).t = <abstr> *)
Trees – for space efficiency #
Maps carry within them the comparator they were created with, but we could wish for space efficiency and just keep hold of ONLY the tree.
The comparator can still be typed via a phantom type, even if it physically wouldn’t include the comparator anymore. The only thing is that we’d need to explicitly provide the comparator when we use this tree-form of the map.
Phantom Type:
we had seen this earlier in our little language example where we encoded extra compile-time information without affecting runtime representation. Now we benefit from its use elsewhere.
the type doesn’t correspond to any value directly represented in the underlying physical structure of the value.
i.e. it doesn’t show up in the definition of the type.
the type reflects something about the logic of the value of the question
let ord_tree = Map.Using_comparator.to_tree ord_map;; (* --- see how the phantom type shows up here: val ord_tree : (string, int, String.comparator_witness) Map.Using_comparator.Tree.t = <abstr> *) (* ==== example of how to use the map with only its tree available. *) Map.Using_comparator.Tree.find ~comparator:String.comparator ord_tree "snoo";; (* - : int option = Some 3 *) (* -- the phantom type helps us enforce the invariant: we're using the same comparator when looking pu a value as when we stored it. So this example will fail: *) Map.Using_comparator.Tree.find ~comparator:Reverse.comparator ord_tree "snoo";; (* Line 1, characters 63-71: Error: This expression has type (string, int, String.comparator_witness) Map.Using_comparator.Tree.t but an expression was expected of type (string, int, Reverse.comparator_witness) Map.Using_comparator.Tree.t Type String.comparator_witness is not compatible with type Reverse.comparator_witness *)
Hash Tables #
The imperative cousins of maps, they differ in the following ways:
they are mutable: adding a k-v pair modifies the existing table
have better time complexity than maps
it’s constant time lookup and modifications vs log time for maps
hash tables depend on hash function (key to integer), maps depend on comparator function
when creating hash tables, we don’t need any type witnesses (like comparator witness needed for maps/sets) \(\implies\) the syntax is easier
REASON: No Keeping the same hash function / comparator function preserves no special invariant in the case of Hash Tables
(* === custom types just need to satisfy the signature for hashtable key: Base.Hashtbl.Key.S *)
#show Base.Hashtbl.Key.S;;
(* module type S = Base__Hashtbl_intf.Key.S *)
(* === we can use derive annotation for the hash function, makes our custom type hashable. *)
module Book = struct
type t = { title: string; isbn: string }
[@@deriving compare, sexp_of, hash]
end;;
(*
module Book :
sig
type t = { title : string; isbn : string; }
val compare : t -> t -> int
val sexp_of_t : t -> Sexp.t
val hash_fold_t :
Base_internalhash_types.state -> t -> Base_internalhash_types.state
val hash : t -> int
end
*)
let table = Hashtbl.create (module Book);;
(* val table : (Book.t, '_weak2) Base.Hashtbl.t = <abstr> *)
We could also use OCaml’s polymorphic hash function. RULE OF THUMB: don’t use it in areas where correctness matters.
Time Complexity of Hash Tables #
There needs to be some clarification on the time complexity aspects described:
it’s amortized time complexity.
The “constant time lookup and update” is more of amortised costs actually. Like other languages, OCaml’s hash table does need resizing so some operations may get expensive.
the cost of the hash function matters
collision, for
Baseleads to the use of binary trees for the hash-bucket \(\implies\) if your hash function is sub-par and has many collisions then hash table use ends up being logarithmic time complexityNOTE: this is useful to protect against DDOS attacks!
The logarithmic behavior of Base’s hash tables in the presence of hash collisions also helps protect against some denial-of-service attacks. One well-known type of attack is to send queries to a service with carefully chosen keys to cause many collisions. This, in combination with the linear behavior of most hashtables, can cause the service to become unresponsive due to high CPU load. Base’s hash tables would be much less susceptible to such an attack because the amount of degradation would be far less.
Collisions with the Polymorphic Hash Function #
Like Polymorphic Compare, Polymorphic Hash doesn’t pay attention to the type and blindly walks down the structure of a data type, computing the hash from what it sees.
Problems:
So as long as physical equality is not there, even with equivalent data, it will deem hash tables as unequal
extra prone to creating hash collisions.
this is because it works by BFS traversal walking over the data structure. Also it is bounded in the number of nodes it’s willing to traverse – which defaults to 10 “meaningful” nodes.
the bound means that parts of the data structure may be ignored and we result in pathological cases where every value we store has the same hash value.
Hashtbl.Poly.hashable.hash (List.range 0 9);; (* - : int = 209331808 *) Hashtbl.Poly.hashable.hash (List.range 0 10);; (* - : int = 182325193 *) Hashtbl.Poly.hashable.hash (List.range 0 11);; (* - : int = 182325193 *) Hashtbl.Poly.hashable.hash (List.range 0 100);; (* - : int = 182325193 *)
RULE OF THUMB: for large custom data-structures, write your own hash function OR use [@@deriving]
Inline create hash function using ppx_hash #
SYNTACTIC SUGAR: We can just use %hash directly. It’s a shorter form of [@@deriving hash] – that’s what the previous example uses.
Choosing between Maps and Hash Tables #
They overlap in functionality so we need to decide when to use what?
idiom: follow the programming paradigm
immutable maps (for functional code) vs mutable hash tables (for imperative code)
idiom: focus on performance needs and data volume
For code that is dominated by updates and lookups, hash tables are a clear performance win, and the win is clearer the larger the amount of data.
Here’s some benchmarks:
Bench mark to demonstrate generally more efficient hashtables than maps
open Base open Core_bench let map_iter ~num_keys ~iterations = let rec loop i map = if i <= 0 then () else loop (i - 1) (Map.change map (i % num_keys) ~f:(fun current -> Some (1 + Option.value ~default:0 current))) in loop iterations (Map.empty (module Int)) let table_iter ~num_keys ~iterations = let table = Hashtbl.create (module Int) ~size:num_keys in let rec loop i = if i <= 0 then () else ( Hashtbl.change table (i % num_keys) ~f:(fun current -> Some (1 + Option.value ~default:0 current)); loop (i - 1)) in loop iterations let tests ~num_keys ~iterations = let t name f = Bench.Test.create f ~name in [ t "table" (fun () -> table_iter ~num_keys ~iterations) ; t "map" (fun () -> map_iter ~num_keys ~iterations) ] let () = tests ~num_keys:1000 ~iterations:100_000 |> Bench.make_command |> Command_unix.runthe dune file:
(executable (name map_vs_hash) (libraries base core_bench core_unix.command_unix))dune build map_vs_hash.exe ./_build/default/map_vs_hash.exe -ascii -quota 1 -clear-columns time speedup Estimated testing time 2s (2 benchmarks x 1s). Change using -quota SECS. Name Time/Run Speedup ------- ---------- --------- table 13.34ms 1.00 map 44.54ms 3.34Benchmark to demonstrate that when we need to keep multiple related copies (like snapshots) of data at once, maps are better because they’re immutable
open Base open Core_bench let create_maps ~num_keys ~iterations = let rec loop i map = if i <= 0 then [] else ( let new_map = Map.change map (i % num_keys) ~f:(fun current -> Some (1 + Option.value ~default:0 current)) in new_map :: loop (i - 1) new_map) in loop iterations (Map.empty (module Int)) let create_tables ~num_keys ~iterations = let table = Hashtbl.create (module Int) ~size:num_keys in let rec loop i = if i <= 0 then [] else ( Hashtbl.change table (i % num_keys) ~f:(fun current -> Some (1 + Option.value ~default:0 current)); let new_table = Hashtbl.copy table in new_table :: loop (i - 1)) in loop iterations let tests ~num_keys ~iterations = let t name f = Bench.Test.create f ~name in [ t "table" (fun () -> ignore (create_tables ~num_keys ~iterations)) ; t "map" (fun () -> ignore (create_maps ~num_keys ~iterations)) ] let () = tests ~num_keys:50 ~iterations:1000 |> Bench.make_command |> Command_unix.runexecutable (name map_vs_hash2) (libraries core_bench core_unix.command_unix))dune build map_vs_hash2.exe ./_build/default/map_vs_hash2.exe -ascii -clear-columns time speedup Estimated testing time 20s (2 benchmarks x 10s). Change using -quota SECS. Name Time/Run Speedup ------- ------------ --------- table 4_453.95us 25.80 map 172.61us 1.00