Skip to content

colalb1/OptiMap

Repository files navigation

OptiMap: A Fast C++ Hash Map

optimap is a high-performance, open-addressing, general-purpose hash map for C++ that implements the Swiss Table design pattern. It leverages a memory layout friendly to modern CPUs and SIMD instructions for accelerated lookups.

This document details design choices, performance optimizations, and lessons learned during the development of optimap.

Prerequisites

  • CMake $\geq$ 4.0
  • A C++23-compatible compiler (I used clang)

Relevant Files

  • include/hashmap.hpp is the main hash map implementation
  • include/gxhash.hpp contains the hash function

Build

Main Build

mkdir build && cd build
cmake ..
make

Run Tests

Main Build then...

ctest -V

Benchmarks

Here lie the performance and memory usage benchmark plots for OptiMap compared to std::unordered_map and ankerl::unordered_dense::map. std::unordered_map and ankerl::unordered_dense::map use various hash functions while OptiMap uses its own proprietary hash function.

Most of the benchmarks are taken from this repo, which I found from this article.

Highlights

  • Memory Efficiency: OptiMap rivals top hash maps in memory usage at scale.

  • Performance Gains Over std::unordered_map: OptiMap is nearly 4 times faster than std::unordered_map across a wide range of large-scale operations, including insertion, deletion, and copying.

    • I am aware that std::unordered_map is not very performant in the context of bespoke unordered maps like ankerl, google::dense_hash_map, or any of the others, but this is a good standard for establishing a widely-understood performance baseline
  • Fast clear() Operation: Clearing a map with 100 million elements takes just ~17 milliseconds with OptiMap, magnitudes faster than std::unordered_map's ~13 seconds and competitive with ankerl's ~4 milliseconds.

  • Instantaneous Construction & Destruction: Creating and destroying empty maps is a zero-cost abstraction in OptiMap, registering 0 time and 0 memory overhead, matching the behavior of std::unordered_map.

Peruse the performance plots below by clicking the dropdowns.

Performance Plots
Speed of various operations. Lower is better.

Copy Performance

CtorDtorEmptyMap Performance

CtorDtorSingleEntryMap Performance

InsertHugeInt Performance

IterateIntegers Performance

RandomDistinct2 Performance

RandomFind 200 Performance

RandomFind 2000 Performance

RandomFind 500000 Performance

RandomFindString 1000000 Performance

RandomFindString Performance

Memory Usage Plots
Memory consumption for various operations. Lower is better.

Copy Memory

CtorDtorEmptyMap Memory

CtorDtorSingleEntryMap Memory

InsertHugeInt Memory

IterateIntegers Memory

RandomDistinct2 Memory

RandomFind 200 Memory

RandomFind 2000 Memory

RandomFind 500000 Memory

RandomFindString 1000000 Memory

RandomFindString Memory

Architecture & Design

optimap is an open-addressing hash map engineered from first principles of data-oriented design. The architecture prioritizes CPU cache efficiency and instruction-level parallelism to achieve maximum throughput. Every design choice is intended to minimize cache misses, reduce branch mispredictions, and leverage modern CPU features like SIMD and AES-NI.

The core of optimap is a C++ implementation of the Swiss Table design, which decouples slot metadata from the key-value entries. This separation allows the algorithm to operate on a compact, cache-friendly metadata array for most of a probe sequence, deferring expensive access to the main bucket array until a potential match is identified. I defer further technical detail about the Swiss Table structure to this link. This is the repo of the original implementation. It is interesting. You should read it.

Core Mechanics & Performance Optimizations

SIMD-Accelerated Swiss Table Probing

The lookup process is heavily optimized via SIMD instructions; it processes 16 slots in parallel instead linear probing naively.

  • Decoupled Metadata Array (m_ctrl): The map maintains a contiguous int8_t array where each byte corresponds to a slot in the main bucket array. This array is cache-friendly; a single 64-byte cache line holds the metadata for 64 slots.

  • H1/H2 Hash Partitioning: A 64-bit hash is partitioned into two components:

    • H1 (Lower): Determines the starting group index for a probe sequence (hash & (capacity - 1))
    • H2 (Upper): A "fingerprint" of the hash stored in the metadata array. The $8^{th}$ bit (MSb) is reserved as a state flag, where a 1 indicates an EMPTY (0b10000000) or DELETED (0b11111110) slot. 0 indicates a FULL slot.
  • Parallel Lookup with SSE2: The probing mechanism is executed with SSE2 intrinsics:

    • A 16-byte chunk of the metadata array is loaded into a __m128i register.
    • The target H2 fingerprint is broadcast across another __m128i register using _mm_set1_epi8.
    • A single _mm_cmpeq_epi8 instruction performs a parallel byte-wise comparison, identifying all slots in the group that match the H2 fingerprint.
    • The 128-bit result is compacted into a 16-bit bitmask via _mm_movemask_epi8.

If this bitmask is non-zero, it signifies one or more potential matches. The algorithm then uses a bit-scan intrinsic (__builtin_ctz or _BitScanForward) to identify the index of each potential match. Then the more expensive full-key comparison is performed by accessing the main bucket array. This strategy filters out the majority of non-matching slots using a few, efficient CPU instructions.

Memory Layout and Data Locality

The memory layout is optimized to prevent pipeline stalls and maximize data locality.

  • Contiguous Allocation: The m_ctrl metadata, m_buckets key-value entries, and m_group_mask iteration acceleration mask are allocated in a single, contiguous memory block. This reduces allocation overhead and ensures that all components of the hash map are physically co-located, maximizing the utility of the CPU's prefetcher.
  • Cache-Line Alignment: The block is aligned to a 64-byte boundary. This guarantees that a 16-byte metadata group can never be split across two cache lines. This prevents alignment-related stalls during SIMD load operations.

gxhash: Hardware-Accelerated Hashing

Oliver Giniaux wrote the original implementation in Rust. I wrote it in C++. gxhash features a hardware-accelerated path using AES-NI CPU instructions, ensuring fast hash generation that minimizes collisions.

  • AES-NI Instruction Set: gxhash leverages the AES instruction set (AES-NI), a hardware support for AES encryption and decryption available on most modern x86 CPUs. These instructions can be repurposed to create a powerful permutation and diffusion function for hashing. An AES round is effectively a high-quality, hardware-accelerated mixing function that is significantly faster than traditional integer multiplication and bit-rotation operations.
  • Runtime CPU Dispatching: To maintain portability, gxhash performs runtime feature detection. It queries the CPU to determine if AES-NI is supported and dispatches to the hardware-accelerated implementation if available. Otherwise, it falls back to a portable (but slower) hashing algorithm.

What I Learned

$$\{\text{this whole project}\}\setminus\{\text{C++ syntax, template programming, other minor C++ details}\}$$

About

A fast C++ hash map.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published