A high performance C/C++ inference engine that runs on CPU.
- install the virtual environment.
uv sync
- first convert the weights from safetensors.
python3 scripts/convert_weight.py
-
move the weights and index.json to
/model. -
You can either use the
setup.shor continue from step 5
bash setup.sh
- create the build directory
mkdir build
cd build
- configure with Cmake
cmake ..
- build the project
make -j
- run the executable
cd .. && ./build/inferGPT
| Vectorization | Sampling Strategy | Performance | Speedup |
|---|---|---|---|
| no SIMD | Temperature | 20 toks/sec | 1.0x |
| NEON SIMD (dot product) | Temperature | 57.27 toks/sec | 2.9x |
| int4 quantization | Temperature | 209.75 toks/sec | 10x |
Roadmap ( would flash attention make sense on a CPU? flash attention style blocked matrices?)
- Add conditional compilation for metal archs
- Operator fusion
- Implement SIMD instructions
- Add quantization algorithms with performance benchmarking
- Support GPU operations via CUDA C++