In this project, I implemented various functions in the NumPy library in C to speed up computationally-heavy level tasks. This was achieved through a combination of data-level parallelism, thread-level parallelism, loop unrolling, and clever algorithms.
Various optimizations:
- SIMD data-level parallelism: this was used in matrix addition, subtraction, multiplication, and powering. This allowed speedups of 4x.
- OpenMP API tread-level parallelism: these methods were also used in matrix addition, subtraction, mutliplication, and powering. Also
- Loop Unrolling: By having each iteration of the loop complete more code, the CPU did not have to branch as many times and could take advantage of its pipelining. Therefore, the number of times the instructions processed were flushed due to a branch instruction was reduced 4-fold, and a furhter speedup was achieved.
- Repeated Squares Algorithm: Drastically reduces runtime complexity of the matrix powering method. Say we were trying to compute the power of a nxn matrix to the mth power. The initial idea was iteratively calling matrix multiplication in order to compute high powers of large matrices (sizes over 500x500) but this became way too slow as matrix multiplication was already cubic polynomial time, and had to loop m number of times. The repeated squares algorithm reduced the number of iterations in the loop down to logarithmic time. Allowing a total runtime of O(n^3 * log(m)). This allowed matrix powering to achieve a 700x speedup in comparison to the intial method on my local machine.