- Overview
- Design Goals
- Memory Acquisition Strategy
- Segregated Free Lists
- Block Management
- Allocation Algorithm
- Deallocation Algorithm
- Thread Safety
- Performance Characteristics
- Design Trade-offs
This custom memory allocator is a production-quality replacement for the standard C library's malloc/free/calloc/realloc functions. It demonstrates modern memory allocation techniques used in real-world allocators.
┌─────────────────────────────────────────────────────┐
│ Application Code │
└─────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Public API (allocator.h) │
│ mem_malloc() mem_free() mem_calloc() mem_realloc() │
└─────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Core Allocator (allocator.c) │
│ ┌────────────────┐ ┌────────────────┐ │
│ │ Segregated │ │ Block │ │
│ │ Free Lists │ │ Management │ │
│ │ (10 classes) │ │ (split/merge) │ │
│ └────────────────┘ └────────────────┘ │
└─────────────────────┬───────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────┐
│ Memory Acquisition Layer │
│ ┌──────────────┐ ┌──────────────┐ │
│ │ brk() │ │ mmap() │ │
│ │ Small allocs│ │ Large allocs│ │
│ │ < 128KB │ │ ≥ 128KB │ │
│ └──────────────┘ └──────────────┘ │
└─────────────────────────────────────────────────────┘
- Performance: Competitive with system malloc for common use cases
- Low Fragmentation: Minimize both internal and external fragmentation
- Predictability: Consistent performance characteristics
- Simplicity: Easy to understand and maintain
- Safety: Proper bounds checking and error handling
- Debuggability: Clear statistics and Valgrind compatibility
The allocator uses different strategies based on allocation size:
Uses brk() (via sbrk()) to expand the heap:
Advantages:
- Fast: Simple pointer arithmetic
- Good locality: Heap is contiguous
- Low overhead: No system call per allocation
Disadvantages:
- Can't release memory in the middle
- Potential for fragmentation at heap level
Implementation:
void* expand_heap(size_t size) {
size_t alloc_size = size < BRK_INCREMENT ? BRK_INCREMENT : align_size(size);
void* old_brk = sbrk(0);
void* new_brk = sbrk(alloc_size);
// ... create block header ...
}Uses mmap() for direct memory mapping:
Advantages:
- Can release immediately with
munmap() - No heap fragmentation
- Memory is zero-initialized by kernel
Disadvantages:
- System call overhead
- TLB pressure with many mappings
Implementation:
void* ptr = mmap(NULL, total_size, PROT_READ | PROT_WRITE,
MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);The 128KB threshold is chosen based on:
- Typical memory page size (4KB) × 32
- Balance between system call overhead and fragmentation
- Common allocation patterns in C programs
The allocator maintains 10 separate free lists:
| Class | Size Range | Target Use Cases |
|---|---|---|
| 0 | ≤ 32 B | Tiny objects, small strings |
| 1 | ≤ 64 B | Small structures, short strings |
| 2 | ≤ 128 B | Cache-line sized objects |
| 3 | ≤ 256 B | Medium structures |
| 4 | ≤ 512 B | Small buffers |
| 5 | ≤ 1 KB | Typical buffers |
| 6 | ≤ 2 KB | Network packets, file I/O |
| 7 | ≤ 4 KB | Page-sized buffers |
| 8 | ≤ 8 KB | Large buffers |
| 9 | > 8 KB | Very large allocations (pre-mmap) |
- Fast Search: Check appropriate size class first
- Reduced Fragmentation: Similar-sized objects grouped together
- Cache Efficiency: Better spatial locality
- Predictable Performance: O(n) where n is blocks in size class
typedef struct block_header {
size_t size; // Total block size including header
struct block_header* next; // Next in free list
struct block_header* prev; // Previous in free list
int is_free; // Allocation status
int is_mmap; // Allocation method
} block_header_t;
static block_header_t* free_lists[NUM_SIZE_CLASSES];Each allocation has a metadata header:
┌──────────────────────────────────────┐
│ size (8 bytes) │ Total size including header
├──────────────────────────────────────┤
│ next (8 bytes) │ Free list linkage
├──────────────────────────────────────┤
│ prev (8 bytes) │ Free list linkage
├──────────────────────────────────────┤
│ is_free (4 bytes) │ Status flags
├──────────────────────────────────────┤
│ is_mmap (4 bytes) │ Allocation method
├══════════════════════════════════════┤
│ │
│ User Data │
│ (size - sizeof(block_header_t)) │
│ │
└──────────────────────────────────────┘
Header Size: 40 bytes on 64-bit systems
All allocations are aligned to 16-byte boundaries:
static inline size_t align_size(size_t size) {
return (size + ALIGNMENT - 1) & ~(ALIGNMENT - 1);
}Benefits:
- SIMD instruction compatibility
- Cache line alignment
- Required by some architectures
When allocating from a larger free block:
Before:
┌────────────────────────────────────────────┐
│ Free Block (size = 1024) │
└────────────────────────────────────────────┘
After (allocating 128 bytes):
┌──────────────────┐┌───────────────────────┐
│ Allocated (128) ││ Free (896) │
└──────────────────┘└───────────────────────┘
Conditions for splitting:
if (block->size >= total_size + sizeof(block_header_t) + MIN_BLOCK_SIZE)Must have enough space for:
- Requested allocation + header
- New header for remainder
- Minimum usable space (32 bytes)
When freeing a block, merge with adjacent free blocks:
Before:
┌──────────┐┌──────────┐┌──────────┐┌──────────┐
│ Allocated││ Free ││ Free ││ Allocated│
└──────────┘└──────────┘└──────────┘└──────────┘
After (freeing first allocated block):
┌────────────────────────────────────┐┌──────────┐
│ Free (coalesced) ││ Allocated│
└────────────────────────────────────┘└──────────┘
Algorithm:
block_header_t* coalesce(block_header_t* block) {
void* block_end = (void*)block + block->size;
block_header_t* next_block = (block_header_t*)block_end;
if (next_block is valid and free) {
remove_from_free_list(next_block);
block->size += next_block->size;
return coalesce(block); // Recursive
}
return block;
}START
│
├─ size == 0? ─→ return NULL
│
├─ Align size and add header overhead
│
├─ size >= MMAP_THRESHOLD?
│ │
│ YES ─→ mmap() ─→ return ptr + header_size
│ │
│ NO
│ │
│ ├─ Find free block in appropriate size class
│ │
│ ├─ Found?
│ │ │
│ │ NO ─→ expand_heap() ─→ Get new block
│ │ │
│ │ YES ─→ Use existing block
│ │
│ ├─ Remove from free list
│ │
│ ├─ Split if too large
│ │
│ ├─ Mark as allocated
│ │
│ └─ return ptr + header_size
- Calculate size class:
class = get_size_class(size) - Search from
classtoNUM_SIZE_CLASSES - 1 - Return first block where
block->size >= size - If no block found, expand heap
This allocator uses First Fit with segregated lists:
Advantages:
- Faster: O(1) to O(n) where n is small
- Simpler implementation
- Segregation provides implicit "good fit"
Disadvantages:
- May not find optimal block
- Can cause fragmentation in large class
START
│
├─ ptr == NULL? ─→ return
│
├─ Get block header: block = ptr - sizeof(header)
│
├─ is_mmap?
│ │
│ YES ─→ munmap(block, block->size) ─→ return
│ │
│ NO
│ │
│ ├─ Mark as free
│ │
│ ├─ Coalesce with adjacent free blocks
│ │
│ └─ Add to appropriate free list
Current implementation: Forward coalescing only
- Merges with blocks that come after in memory
- Simpler than bidirectional
- Still effective at reducing fragmentation
Future improvement: Bidirectional coalescing
- Would require footer in each block
- +8 bytes overhead per allocation
- Better fragmentation reduction
void* mem_malloc(size_t size) {
// Direct implementation
// No locking
}Use when:
- Single-threaded applications
- Manual synchronization at higher level
- Performance is critical
static pthread_mutex_t allocator_mutex = PTHREAD_MUTEX_INITIALIZER;
void* mem_malloc_ts(size_t size) {
pthread_mutex_lock(&allocator_mutex);
void* ptr = mem_malloc(size);
pthread_mutex_unlock(&allocator_mutex);
return ptr;
}Characteristics:
- Global mutex: Simple but coarse-grained
- All allocations are serialized
- ~10-20% performance overhead
Future improvement: Per-thread caches
- Thread-local free lists
- Reduce contention
- Complex implementation
| Operation | Best Case | Average Case | Worst Case |
|---|---|---|---|
| malloc | O(1) | O(n) | O(n) |
| free | O(1) | O(1) | O(1) |
| calloc | O(n) | O(n) | O(n) |
| realloc | O(1) | O(n) | O(n) |
Where n = number of free blocks in searched size classes
Overhead per allocation:
- Header: 40 bytes
- Alignment: up to 15 bytes
- Total: ~40-55 bytes
Efficiency:
- Small allocations (64B): ~60-85% efficient
- Medium allocations (1KB): ~95% efficient
- Large allocations (>8KB): ~99% efficient
Internal fragmentation:
- Minimized by splitting
- Alignment padding: ≤15 bytes
- Header overhead: Fixed 40 bytes
External fragmentation:
- Reduced by coalescing
- Segregated bins help
- Can accumulate in long-running programs
Chosen: Segregated Bins
Pros:
- Faster allocation (smaller search space)
- Better locality
- Implicit size-based best-fit
Cons:
- More complex
- 10 pointers of overhead
- Possible fragmentation between classes
Chosen: Forward-Only
Pros:
- Simpler implementation
- Less memory overhead (no footer)
- Still reduces most fragmentation
Cons:
- Misses some coalescing opportunities
- Can't merge with previous block
Chosen: Global Mutex (for thread-safe version)
Pros:
- Simple implementation
- No deadlock risk
- Easy to reason about
Cons:
- Serializes all operations
- Scalability bottleneck
- Not optimal for many threads
Rationale:
- System call overhead amortized
- Page size × 32 (good for most systems)
- Avoids heap fragmentation for large objects
Trade-off:
- Lower threshold: Less heap fragmentation, more system calls
- Higher threshold: Fewer system calls, more heap fragmentation
Could be reduced to 24 bytes:
struct {
size_t size; // 8 bytes
void* next; // 8 bytes
uint32_t flags; // 4 bytes (is_free, is_mmap, etc.)
uint32_t pad; // 4 bytes (alignment)
}; // 24 bytesKept at 40 bytes for:
- Clearer code
- Easier debugging
- Double-linked lists (easier removal)
This allocator demonstrates key concepts used in production allocators:
- Hybrid acquisition: brk for small, mmap for large
- Segregated storage: Multiple size classes
- Defragmentation: Splitting and coalescing
- Thread safety: Optional mutex protection
- Performance: Competitive with system malloc
The design prioritizes clarity and correctness over maximum performance, making it an excellent educational example and a solid foundation for further optimization.