Skip to content

Implementation & Design Notes

io.shift edited this page Aug 4, 2025 · 1 revision

This page covers the decisions made during the development of cl-freelock.

Bounded Queue

The MPMC bounded queue was the most complex data structure to implement correctly. An initial, intuitive approach based on checking for an empty slot and then claiming it with a Compare-And-Swap (CAS) operation proved to be vulnerable to a livelock under high contention. This was discovered when a 2-producer, 2-consumer benchmark would hang indefinitely.

To solve this, we implemented a provably correct, livelock-free algorithm based on the work of Dmitry Vyukov. The Vyukov MPMC algorithm uses an auxiliary sequence array that runs in parallel with the main data buffer. Each slot in the queue has an associated sequence number that acts as a state machine, guaranteeing that a thread can only access a slot when it is in the correct state for its operation (e.g., ready for a specific push, or a specific pop). This eliminates the "check-then-act" race condition that caused the original livelock.

The SPSC Queue and Memory Barriers

The SPSC queue achieves its immense speed by avoiding expensive CAS operations on its hot path. Since only one thread ever modifies the head and one thread ever modifies the tail, we don't need atomic operations to protect them from each other.

However, we still need to ensure that memory writes made by the producer are visible to the consumer in the correct order. Modern CPUs, fortunately and unfortunately aggressively reorder memory operations. To prevent this from breaking our logic, we use memory barriers (also sometimes referred to as “memory fences”).

A memory barrier forces the CPU to complete all memory operations before the barrier before it begins any operations after the barrier. In our SPSC queue:

  • The producer writes the data, then issues a barrier, then updates the tail pointer.
  • The consumer reads the tail pointer, then issues a barrier, then reads the data.

This guarantees that the consumer will always see the correct data.

During development, we discovered a version-specific bug in the sb-thread:barrier macro in the target SBCL environment. We worked around this by implementing the SBCL barrier using a dummy CAS operation on a global variable ((cas *dummy* nil nil)), which is guaranteed by the x86-64 architecture to act as a full memory fence.

The :cl-freelock-single-threaded Optimization

The unbounded queue is based on the classic Michael-Scott algorithm. In its multi-threaded form, the queue-push operation contains "helping" logic. If a thread sees that a previous thread was slow to update the global tail pointer, it will finish the job for them before proceeding. This is crucial for correctness and throughput in a multi-producer scenario.

However, this helping logic is pure overhead in a single-producer, single-consumer (1P/1C) scenario. By adding :cl-freelock-single-threaded to the features, before loading the library, you enable a specialized version of the unbounded queue where this helping logic is compiled out. This allows the compiler to heavily inline the code, resulting in a significant performance boost for the 1P/1C use case.

Clone this wiki locally