optimap is a high-performance, open-addressing, general-purpose hash map for C++ that implements the Swiss Table design pattern. It leverages a memory layout friendly to modern CPUs and SIMD instructions for accelerated lookups.
This document details design choices, performance optimizations, and lessons learned during the development of optimap.
- CMake
$\geq$ 4.0 - A C++23-compatible compiler (I used
clang)
include/hashmap.hppis the main hash map implementationinclude/gxhash.hppcontains the hash function
mkdir build && cd build
cmake ..
makeMain Build then...
ctest -VHere lie the performance and memory usage benchmark plots for OptiMap compared to std::unordered_map and ankerl::unordered_dense::map. std::unordered_map and ankerl::unordered_dense::map use various hash functions while OptiMap uses its own proprietary hash function.
Most of the benchmarks are taken from this repo, which I found from this article.
-
Memory Efficiency:
OptiMaprivals top hash maps in memory usage at scale. -
Performance Gains Over
std::unordered_map:OptiMapis nearly 4 times faster thanstd::unordered_mapacross a wide range of large-scale operations, including insertion, deletion, and copying.- I am aware that
std::unordered_mapis not very performant in the context of bespoke unordered maps likeankerl,google::dense_hash_map, or any of the others, but this is a good standard for establishing a widely-understood performance baseline
- I am aware that
-
Fast
clear()Operation: Clearing a map with 100 million elements takes just ~17 milliseconds withOptiMap, magnitudes faster thanstd::unordered_map's ~13 seconds and competitive withankerl's ~4 milliseconds. -
Instantaneous Construction & Destruction: Creating and destroying empty maps is a zero-cost abstraction in
OptiMap, registering 0 time and 0 memory overhead, matching the behavior ofstd::unordered_map.
Peruse the performance plots below by clicking the dropdowns.
Memory Usage PlotsMemory consumption for various operations. Lower is better. |
optimap is an open-addressing hash map engineered from first principles of data-oriented design. The architecture prioritizes CPU cache efficiency and instruction-level parallelism to achieve maximum throughput. Every design choice is intended to minimize cache misses, reduce branch mispredictions, and leverage modern CPU features like SIMD and AES-NI.
The core of optimap is a C++ implementation of the Swiss Table design, which decouples slot metadata from the key-value entries. This separation allows the algorithm to operate on a compact, cache-friendly metadata array for most of a probe sequence, deferring expensive access to the main bucket array until a potential match is identified. I defer further technical detail about the Swiss Table structure to this link. This is the repo of the original implementation. It is interesting. You should read it.
The lookup process is heavily optimized via SIMD instructions; it processes 16 slots in parallel instead linear probing naively.
-
Decoupled Metadata Array (
m_ctrl): The map maintains a contiguousint8_tarray where each byte corresponds to a slot in the main bucket array. This array is cache-friendly; a single 64-byte cache line holds the metadata for 64 slots. -
H1/H2Hash Partitioning: A 64-bit hash is partitioned into two components:-
H1(Lower): Determines the starting group index for a probe sequence (hash & (capacity - 1)) -
H2(Upper): A "fingerprint" of the hash stored in the metadata array. The$8^{th}$ bit (MSb) is reserved as a state flag, where a1indicates an EMPTY (0b10000000) or DELETED (0b11111110) slot.0indicates a FULL slot.
-
-
Parallel Lookup with
SSE2: The probing mechanism is executed withSSE2intrinsics:- A 16-byte chunk of the metadata array is loaded into a
__m128iregister. - The target
H2fingerprint is broadcast across another__m128iregister using_mm_set1_epi8. - A single
_mm_cmpeq_epi8instruction performs a parallel byte-wise comparison, identifying all slots in the group that match theH2fingerprint. - The 128-bit result is compacted into a 16-bit bitmask via
_mm_movemask_epi8.
- A 16-byte chunk of the metadata array is loaded into a
If this bitmask is non-zero, it signifies one or more potential matches. The algorithm then uses a bit-scan intrinsic (__builtin_ctz or _BitScanForward) to identify the index of each potential match. Then the more expensive full-key comparison is performed by accessing the main bucket array. This strategy filters out the majority of non-matching slots using a few, efficient CPU instructions.
The memory layout is optimized to prevent pipeline stalls and maximize data locality.
- Contiguous Allocation: The
m_ctrlmetadata,m_bucketskey-value entries, andm_group_maskiteration acceleration mask are allocated in a single, contiguous memory block. This reduces allocation overhead and ensures that all components of the hash map are physically co-located, maximizing the utility of the CPU's prefetcher. - Cache-Line Alignment: The block is aligned to a 64-byte boundary. This guarantees that a 16-byte metadata group can never be split across two cache lines. This prevents alignment-related stalls during SIMD load operations.
Oliver Giniaux wrote the original implementation in Rust. I wrote it in C++. gxhash features a hardware-accelerated path using AES-NI CPU instructions, ensuring fast hash generation that minimizes collisions.
- AES-NI Instruction Set:
gxhashleverages the AES instruction set (AES-NI), a hardware support for AES encryption and decryption available on most modern x86 CPUs. These instructions can be repurposed to create a powerful permutation and diffusion function for hashing. An AES round is effectively a high-quality, hardware-accelerated mixing function that is significantly faster than traditional integer multiplication and bit-rotation operations. - Runtime CPU Dispatching: To maintain portability,
gxhashperforms runtime feature detection. It queries the CPU to determine if AES-NI is supported and dispatches to the hardware-accelerated implementation if available. Otherwise, it falls back to a portable (but slower) hashing algorithm.





















