This guide explains how to analyze memory usage in the cachesim project using Valgrind's massif tool and identifies critical memory overhead areas as potential optimization targets. The workflow consists of the following steps:
- Heap memory profiling
- Identifying memory hotspots
- Optimization opportunities
- Compile with debug symbols:
# Compile with debug symbols
mkdir _build && cd _build
cmake -G Ninja .. -DCMAKE_BUILD_TYPE=Debug
ninja- Run
massiffor a specific algorithm (e.g., S3-FIFO):
valgrind --tool=massif --massif-out-file=massif.out \
./bin/cachesim /path/to/wiki_2019t.oracleGeneral oracleGeneral s3fifo 1gb- Generate a human-readable report:
ms_print ./massif.out > massif.resultThe generated report primarily includes a bar chart of memory usage (with instructions executed as the x-axis) and several heap profile snapshots. Some snapshots display detailed function call relationships showing how memory was allocated. Below is an example of such a report:
MB
519.4^ :
|#:::::::::::::@::::::::::::::::@::@@::::::::::::::::::::::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
|#::: :: :: :::@: ::::: : ::: ::@::@ ::: :: : : ::::::: :::::@::@::@::@::
0 +----------------------------------------------------------------------->Gi
0 652.3
--------------------------------------------------------------------------------
n time(i) total(B) useful-heap(B) extra-heap(B) stacks(B)
--------------------------------------------------------------------------------
0 0 0 0 0 0
1 174,738,746 542,473,864 541,412,480 1,061,384 0
99.80% (541,412,480B) (heap allocation functions) malloc/new/new[], --alloc-fns, etc.
->98.97% (536,870,912B) 0x483659: create_chained_hashtable_v2 (chainedHashTableV2.c:93)
| ->98.97% (536,870,912B) 0x48CCD6: cache_struct_init (cache.c:60)
| ->74.23% (402,653,184B) 0x40FEB3: FIFO_init (FIFO.c:51)
| | ->24.74% (134,217,728B) 0x423884: S3FIFO_init (S3FIFO.c:113)
| | | ->24.74% (134,217,728B) 0x40880A: create_cache (cache_init.h:132)
| | | ->24.74% (134,217,728B) 0x40970B: parse_cmd (cli_parser.c:343)
| | | ->24.74% (134,217,728B) 0x406FFD: main (main.c:18)
| | |
| | ->24.74% (134,217,728B) 0x4238C0: S3FIFO_init (S3FIFO.c:118)
| | | ->24.74% (134,217,728B) 0x40880A: create_cache (cache_init.h:132)
| | | ->24.74% (134,217,728B) 0x40970B: parse_cmd (cli_parser.c:343)
| | | ->24.74% (134,217,728B) 0x406FFD: main (main.c:18)
| | |
| | ->24.74% (134,217,728B) 0x423921: S3FIFO_init (S3FIFO.c:125)
| | ->24.74% (134,217,728B) 0x40880A: create_cache (cache_init.h:132)
| | ->24.74% (134,217,728B) 0x40970B: parse_cmd (cli_parser.c:343)
| | ->24.74% (134,217,728B) 0x406FFD: main (main.c:18)
| |
| ->24.74% (134,217,728B) 0x4236BB: S3FIFO_init (S3FIFO.c:81)
| ->24.74% (134,217,728B) 0x40880A: create_cache (cache_init.h:132)
| ->24.74% (134,217,728B) 0x40970B: parse_cmd (cli_parser.c:343)
| ->24.74% (134,217,728B) 0x406FFD: main (main.c:18)
|
->00.84% (4,541,568B) in 1+ places, all below ms_print's threshold (01.00%)The above report indicates that the memory usage of S3-FIFO remains nearly constant throughout instruction execution. The peak memory usage is 519.4MB. 98.97% of memory is used for create_chained_hashtable_v2 (chainedHashTableV2.c:93). In the function S3FIFO_init(), there is one cache_struct_init() call and three FIFO_init() calls, each of which incurs another cache_struct_init(). In total, there are four cache_struct_init() calls, which create four hash tables. Each hash table occupies 24.74% of the allocated memory.
Optimization opportunities: S3-FIFO can be implemented with a single hash table, which can reduce the memory usage.
We conducted tests using the wiki_2019t workload (available at CMU's dataset repository) to evaluate the memory usage of various algorithms with a cache size of 1GB.
| Algorithm | Peak Memory Usage (MB) | Observations |
|---|---|---|
| FIFO | 132.1 | Memory usage remains constant. Major memory consumption from create_chained_hashtable_v2(). |
| LRU | 132.2 | Memory usage remains constant. Major memory consumption from create_chained_hashtable_v2(). |
| Clock | 132.0 | Memory usage remains constant. Major memory consumption from create_chained_hashtable_v2(). |
| SLRU | 132.6 | Memory usage remains constant. Major memory consumption from create_chained_hashtable_v2(). |
| LFU | 131.4 | Memory usage remains constant. Major memory consumption from create_chained_hashtable_v2(). |
| ARC | 813.1 | Memory usage increases steadily, peaking at the end. Major consumption from create_chained_hashtable_v2() and create_cache_obj_from_request(). |
| TwoQ | 517.8 | Memory usage remains constant. Major consumption from create_chained_hashtable_v2(). Involves four calls (two FIFO, one for LRU, one for TwoQ itself). |
| GDSF | 141.8 | Memory usage remains constant. Major consumption from create_chained_hashtable_v2(). |
| Hyperbolic | 4.480 | Extremely low memory usage. |
| LeCaR | 192.3 | Memory usage increases steadily, peaking at the end. Major consumption from create_chained_hashtable_v2() and insert_obj_info_freq_node(). |
| Cacheus | 805.2 | Memory usage remains constant. Major consumption from create_chained_hashtable_v2(). Involves multiple cache constructions like SR_LRU, CRLFU, and ghost cache. |
| LHD | 206.0 | Memory usage remains constant. Major consumption from create_chained_hashtable_v2() and vector maintenance for HitDensity in LHD. |
| WTinyLFU | 8580 | Memory usage remains constant. Major consumption from minimalIncrementCBF_init(), constructing a Counting Bloom Filter. |
| S3-FIFO | 519.8 | Memory usage remains constant. Major consumption from create_chained_hashtable_v2(). Involves multiple cache constructions like Small FIFO, Main FIFO, Ghost FIFO. |
| Sieve | 131.7 | Memory usage remains constant. Major consumption from minimalIncrementCBF_init(). |
- FIFO, LRU, Clock, SLRU, LFU, GDSF, Sieve all use only one hashtable, thus their memory usage is low.
- TwoQ, Cacheus, and S3-FIFO use multiple cache instances, resulting in multiple hashtables internally, which leads to higher memory usage.
- Hyperbolic uses a smaller hashpower during initialization, thus its memory usage is low.
- LeCaR and LHD use hashtables while also maintaining additional metadata, which increases memory usage.
- WTinyLFU uses a significant amount of memory to store Bloom Filters, thus its memory usage is high.
- ARC has high memory usage, the reason for which is unclear.