Skip to content

a simple c++ inference engine for gpt based architecture

Notifications You must be signed in to change notification settings

JINO-ROHIT/inferGPT

Repository files navigation

inferGPT

A high performance C/C++ inference engine that runs on CPU.

Setup

  1. install the virtual environment.
uv sync
  1. first convert the weights from safetensors.
python3 scripts/convert_weight.py
  1. move the weights and index.json to /model.

  2. You can either use the setup.sh or continue from step 5

bash setup.sh
  1. create the build directory
mkdir build
cd build
  1. configure with Cmake
cmake ..
  1. build the project
make -j
  1. run the executable
cd .. && ./build/inferGPT

Current Benchmarks

Vectorization Sampling Strategy Performance Speedup
no SIMD Temperature 20 toks/sec 1.0x
NEON SIMD (dot product) Temperature 57.27 toks/sec 2.9x
int4 quantization Temperature 209.75 toks/sec 10x

Roadmap ( would flash attention make sense on a CPU? flash attention style blocked matrices?)

  • Add conditional compilation for metal archs
  • Operator fusion
  • Implement SIMD instructions
  • Add quantization algorithms with performance benchmarking
  • Support GPU operations via CUDA C++

References

  1. https://github.com/a1k0n/a1gpt/
  2. https://github.com/karpathy/llama2.c
  3. https://github.com/ggml-org/llama.cpp

About

a simple c++ inference engine for gpt based architecture

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published