⚡ FlashFlow

A GPU-Optimized Transformer Inference Runtime with Fused CUDA/Triton Kernels, IR Fusion Passes, and FX-Based Graph Lowering

FlashFlow is a end-to-end optimized inference engine project for GPT-style Transformers. It integrates:

A compiler-style IR with graph-level optimizations
Fusion passes for attention + MLP
Fused CUDA and Triton kernels for high-throughput inference
Quantization-aware training (INT8) and quantized runtime
Autotuning for kernel tile sizes
A clean PyTorch-to-FlashFlow export pipeline using FX
Profiling support via Nsight Compute and Nsight Systems
A fully working reference PyTorch GPTSmall model + training scripts
Extensive unit tests for both C++ and Python components

FlashFlow serves as a mini TensorRT / Inductor specialized for GPT inference, built from scratch.

📂 Directory Structure

FLASHFLOW/
│
├── benchmarks/
│   ├── bench_kernel_micro.py
│   └── bench_throughput.py
│
├── cpp/
│   ├── include/flashflow/
│   │     ├── autotune.hpp
│   │     ├── graph.hpp
│   │     ├── ir.hpp
│   │     ├── kernels.hpp
│   │     ├── quantization.hpp
│   │     └── runtime.hpp
│   │
│   ├── kernels/
│   │     ├── cuda/
│   │     │     ├── fused_attention.cu
│   │     │     ├── fused_mlp.cu
│   │     │     ├── layernorm.cu
│   │     │     └── softmax.cu
│   │     └── triton/
│   │           ├── fused_attention_triton.py
│   │           └── fused_mlp_triton.py
│   │
│   ├── src/
│   │     ├── fusion_passes.cpp
│   │     ├── graph_lowering_fx.cpp
│   │     ├── ir.cpp
│   │     ├── kernel_registry.cpp
│   │     ├── logging.cpp
│   │     ├── memory_planner.cpp
│   │     ├── quant_runtime.cpp
│   │     └── runtime.cpp
│   │
│   └── CMakeLists.txt
│
├── python/
│   ├── eval/
│   │     ├── eval_latency.py
│   │     └── eval_perplexity.py
│   ├── export/
│   │     ├── export_checkpoint.py
│   │     └── export_fx_graph.py
│   ├── models/
│   │     ├── gpt_small.py
│   │     └── transformer_blocks.py
│   ├── training/
│   │     ├── train_baseline.py
│   │     ├── train_mixed_precision.py
│   │     └── train_qat_int8.py
│
├── scripts/
│   ├── build_cpp.sh
│   ├── run_profiler_ncu.sh
│   └── run_profiler_nsys.sh
│
├── tests/
│   ├── cpp/
│   │     ├── test_fusion.cpp
│   │     ├── test_ir.cpp
│   │     └── test_runtime.cpp
│   └── python/
│         ├── test_export_fx.py
│         └── test_training_equivalence.py
│
└── README.md

🚀 Installation

FlashFlow requires:

Python ≥ 3.9
CUDA ≥ 11.7
PyTorch ≥ 2.1
A GPU with Compute Capability ≥ 7.0 (V100/A100/RTX30xx/RTX40xx)

📦 Install Python Requirements

1️⃣ Create environment

conda create -n flashflow python=3.10 -y
conda activate flashflow

2️⃣ Install PyTorch (CUDA-enabled)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

3️⃣ Install project requirements

pip install -r requirements.txt

🔧 Build the C++ Runtime

Option A: Using script (recommended)

./scripts/build_cpp.sh

Option B: Using CMake manually

mkdir -p build
cd build
cmake .. -DCMAKE_BUILD_TYPE=Release
cmake --build . -j$(nproc)

🧪 Running Tests

Python tests:

pytest -q python/tests

C++ tests:

./build/test_ir
./build/test_fusion
./build/test_runtime

🎓 Train a Baseline GPT Model

FP32/FP16 baseline

python python/training/train_baseline.py --data data/train.txt --save-dir checkpoints/baseline

Mixed precision

python python/training/train_mixed_precision.py --data data/train.txt --save-dir checkpoints/amp --amp-dtype bf16

INT8 QAT

python python/training/train_qat_int8.py --data data/train.txt --save-dir checkpoints/qat

📤 Exporting to FlashFlow Format

Export FX Graph

python python/export/export_fx_graph.py --model-name checkpoints/baseline/final_model.pt --out graph.json

Export Weights

python python/export/export_checkpoint.py --model checkpoints/baseline/final_model.pt --out-dir export/

⚙️ Evaluation

Latency

python python/eval/eval_latency.py --engine flashflow_engine.pt

Perplexity

python python/eval/eval_perplexity.py --model checkpoints/baseline/final_model.pt

🔬 Profiling

Nsight Compute

./scripts/run_profiler_ncu.sh ./build/bench_mlp --kernel-name fused_mlp

Nsight Systems

./scripts/run_profiler_nsys.sh ./build/bench_attention

🏁 Conclusion

FlashFlow is a full-stack optimized Transformer inference engine with real fused kernels, IR fusion, FX lowering, quantization, autotuning, and runtime execution.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
.vscode		.vscode
benchmarks		benchmarks
cpp		cpp
docs		docs
python		python
scripts		scripts
tests/python		tests/python
.DS_Store		.DS_Store
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

⚡ FlashFlow

A GPU-Optimized Transformer Inference Runtime with Fused CUDA/Triton Kernels, IR Fusion Passes, and FX-Based Graph Lowering

📂 Directory Structure

🚀 Installation

📦 Install Python Requirements

1️⃣ Create environment

2️⃣ Install PyTorch (CUDA-enabled)

3️⃣ Install project requirements

🔧 Build the C++ Runtime

Option A: Using script (recommended)

Option B: Using CMake manually

🧪 Running Tests

Python tests:

C++ tests:

🎓 Train a Baseline GPT Model

FP32/FP16 baseline

Mixed precision

INT8 QAT

📤 Exporting to FlashFlow Format

Export FX Graph

Export Weights

⚙️ Evaluation

Latency

Perplexity

🔬 Profiling

Nsight Compute

Nsight Systems

🏁 Conclusion

About

Uh oh!

Releases

Packages

Uh oh!

Languages

nnt-git13/flashflow

Folders and files

Latest commit

History

Repository files navigation

⚡ FlashFlow

A GPU-Optimized Transformer Inference Runtime with Fused CUDA/Triton Kernels, IR Fusion Passes, and FX-Based Graph Lowering

📂 Directory Structure

🚀 Installation

📦 Install Python Requirements

1️⃣ Create environment

2️⃣ Install PyTorch (CUDA-enabled)

3️⃣ Install project requirements

🔧 Build the C++ Runtime

Option A: Using script (recommended)

Option B: Using CMake manually

🧪 Running Tests

Python tests:

C++ tests:

🎓 Train a Baseline GPT Model

FP32/FP16 baseline

Mixed precision

INT8 QAT

📤 Exporting to FlashFlow Format

Export FX Graph

Export Weights

⚙️ Evaluation

Latency

Perplexity

🔬 Profiling

Nsight Compute

Nsight Systems

🏁 Conclusion

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages