Technologies: C/C++, OpenMP, Linux (WSL)
Parallel programs often fail to achieve expected speedup on multi-core processors. This project investigates hardware-level causes of performance degradation, specifically cache line contention (false sharing) and scalability limits using OpenMP workloads.
Each thread incremented its own counter. Memory was then padded so each counter occupied a separate cache line.
Threads: 16 Normal: 0.0156784 sec Padded: 0.0102941 sec
Observation: Performance improved by approximately 1.5x without changing the algorithm, confirming cache line contention.
Execution time measured as thread count increased.
Threads 1 : 7.235e-06 sec Threads 2 : 0.000234615 sec Threads 4 : 0.000237547 sec Threads 8 : 0.000262015 sec Threads 16: 0.0179845 sec
Observation: Increasing threads did not improve performance. Higher thread counts increased runtime due to coherence overhead and scheduling cost.
- Adjacent thread variables caused cache contention
- Separating data across cache lines improved performance
- Parallel scalability has practical limits
- Hardware coherence overhead can serialize execution
Parallel performance depends not only on algorithm design but also on memory layout and CPU cache behavior. False sharing significantly limits real-world scalability of multi-threaded programs.
g++ false_sharing.cpp -O2 -fopenmp -o fs ./fsg++ scaling.cpp -O2 -fopenmp -o scaling ./scaling