- Tested on Debian via Windows Subsystem for Linux 2 (WSL2)
- CPU: [Intel i7-11800H]
- Compiled with
g++ -O3 -ffast-math -march=native -g - Note: WSL2 introduces minimal virtualization overhead for CPU/cache benchmarks, but the relative difference between row-major and column-major traversal remains clearly visible (and often amplified due to higher memory latency on misses).
Results may vary slightly on native Linux, but the core lesson align traversal with storage order for cache efficiency holds everywhere.
A practical demonstration of how memory layout and traversal order affect performance in C++. I started with the beginner-friendly std::vector<std::vector>, saw the slowdowns, then optimized to a contiguous flat 1D vector — with benchmarks, perf stats, and explanations. Key Insight: Cache misses dominate aligning access with storage order can yield 20–50× speedups
Began with the common way to represent 2D Matrix using vectors of vectors. It's easy and intuitive , but each row has seperate allocation.
| Traversal | Time | L1 Cache Miss Rate (approx) | Notes |
|---|---|---|---|
| Row-major (fast) | ~0.7–0.8s | ~15–30% | Better locality within rows |
| Column-major (slow) | ~1–2s | ~90–99% | Huge strides + fragmentation |
std::vector<std::vector<int>> arr(n, std::vector<int>(n, 0));
// Row-major (better)
for (int i = 0; i < n; ++i)
for (int j = 0; j < n; ++j)
arr[i][j]++;
// Column-major (terrible here)
for (int j = 0; j < n; ++j)
for (int i = 0; i < n; ++i)
arr[i][j]++;Use a single std::vector with manual indexing -> full line usage -> fast!.
| Traversal | Time | L1 Cache Miss Rate (approx) | Notes |
|---|---|---|---|
| Row-major | ~0.08s | ~15–30% | 20-30x Faster |
| Column-major | ~0.7s | ~90–99% | Sill Slow due to Strides. |
#include<iostream>
#include<vector>
#include<chrono>
int main(){
const int n = 5000;
std::vector<int>arr(n*n,0);
// Row Major Traversal.
auto s1 = std::chrono::high_resolution_clock::now();
for(int i = 0 ; i < n ; i++){
for(int j = 0 ; j < n ; j++){
arr[i * n + j]++;
}
}
auto e1 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff1 = e1 - s1;
// Column Major Traversal.
auto s2 = std::chrono::high_resolution_clock::now();
for(int j = 0 ; j < n ; j++){
for(int i = 0 ; i < n ; i++){
arr[i * n + j]++;
}
}
auto e2 = std::chrono::high_resolution_clock::now();
std::chrono::duration<double> diff2 = e2 - s2;
std::cout << "Flat Row Major - " << diff1.count() << "s\n";
std::cout << "Flat Column Major - " << diff2.count() << "s\n";
// to prevent optimization deletion.
std::cout << "Verify - " << arr[0] << "and" << arr[n*n-1] << "\n";
return 0;
}
- vector is convenient but slow for large matrices.
- Use flat 1D arrays for performance-critical code.
- Always traverse in storage order (row-major in C++).
- Cache misses are the real bottleneck — profile with perf!
- Perf flags :
perf stat -r 5