GitHub - JINO-ROHIT/inferGPT: a simple c++ inference engine for gpt based architecture

inferGPT

A high performance C/C++ inference engine that runs on CPU.

uv sync

python3 scripts/convert_weight.py

bash setup.sh

mkdir build
cd build

cmake ..

make -j

cd .. && ./build/inferGPT

Vectorization	Sampling Strategy	Performance	Speedup
no SIMD	Temperature	20 toks/sec	1.0x
NEON SIMD (dot product)	Temperature	57.27 toks/sec	2.9x
int4 quantization	Temperature	209.75 toks/sec	10x

Roadmap ( would flash attention make sense on a CPU? flash attention style blocked matrices?)

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
include		include
scripts		scripts
src		src
.gitignore		.gitignore
.python-version		.python-version
CMakeLists.txt		CMakeLists.txt
README.md		README.md
main.cpp		main.cpp
pyproject.toml		pyproject.toml
setup.sh		setup.sh
uv.lock		uv.lock