This project implements a parallelized version of the BFGS (Broyden–Fletcher–Goldfarb–Shanno) optimization algorithm using C++, OpenMP, and SIMD (Single Instruction Multiple Data) vectorization.
The goal of this project is to accelerate the computation of gradient-based optimization in high-dimensional spaces ( e.g., 5,000+ dimensions) through a hybrid approach that combines shared-memory parallelism (OpenMP) and * data-level parallelism* (SIMD intrinsics).
This work was developed as part of the "Parallel computing systems" Master’s course at the Faculty of Electrical Engineering, University of Sarajevo.
- Full BFGS optimization implementation for multivariate functions
- Parallel gradient and Hessian evaluations using OpenMP
- SIMD acceleration for vector operations (dot products, matrix-vector multiplications)
- Flexible dimensionality — tested at 5,000+ dimensions
- Benchmarking suite comparing serial vs. parallel execution times on multiple benchmark functions
- Continuous (differentiable) TSP relaxation for testing BFGS
- OpenMP parallel regions are used to divide gradient and Hessian computations across threads.
- Reduction clauses are applied for summation-heavy operations (e.g., dot products).
- The algorithm maintains thread-safe updates to shared memory (Hessian approximation) through synchronized regions.
- Utilizes compiler auto-vectorization and manual SIMD intrinsics (e.g., SSE/AVX) for linear algebra routines.
- Ensures memory alignment of vectors and matrices for maximum SIMD efficiency (
alignas(32)orstd::aligned_alloc).
The following table shows relative speedups (normalized to the sequential version = 1.0) for different compilation modes and problem sizes:
| Problem Dimension | Sequential | OpenMP + Autovectorization | OpenMP + SIMD | OpenMP + SIMD + Pointer Array |
|---|---|---|---|---|
| 1,000 | 1.00 | 1.95 | 1.88 | 2.02 |
| 3,000 | 1.00 | 9.18 | 9.07 | 9.27 |
| 5,000 | 1.00 | 6.49 | 6.49 | 7.15 |
(Exact results depend on CPU architecture and compiler optimizations.)
- Intel Core i5-14600K
- 14 cores (6 Performance + 8 Efficiency)
- 20 threads
- Maximum boost clock up to 5.3 GHz
- 1.2 MB L1 cache, 20 MB L2 cache, 24 MB L3 cache
- 32 GB DDR5 RAM, 6000 MT/s
- Windows 11 Pro
- C++23 or newer
- OpenMP (usually included with GCC/Clang)
- Optional: Intel or GCC SIMD intrinsics
Licensed under the Apache License, Version 2.0. See LICENSE for details.
© 2025 Ismar Muslić, Tarik Hastor, Ivona Jozić, Kanita Kadušić, Merjem Gutošić