Poorly optimized inference server

Department of Poorly Optimized GPU Code presents...

Poorly optimized inference server

Stats:

All experiments were conducted on generation of 300 next tokens. Since CUDA is ASYNC we cannot use python time.time utility, beacuse it will only measure the time oevrhead to lunch the kernels. Insterad we use torch.cuda.Event so that we correctly measure actual time that it takes.

Also (which is not done) CUDA needs to be initialized, so first we need to run some data through model and at the end run torch.cuda.syncronize() and then start the measurements.

Huggingface implementation
Navive implementation of Qwen3 0.6B model:

========================================
    Time to first tokens: 49.2472 ms
    Tokens per second: 3.6248
========================================

With KV caching

========================================
    Time to first tokens: 53.4047 ms
    Tokens per sencond: 24.0186
========================================

torch.compile

========================================
    Time to first tokens: 56.3527 ms
    Tokens per second: 57.3766
========================================

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
models		models
.gitignore		.gitignore
.python-version		.python-version
README.md		README.md
generate.py		generate.py
main.py		main.py
model_utils.py		model_utils.py
pyproject.toml		pyproject.toml
quantize.py		quantize.py
ruff.toml		ruff.toml
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Department of Poorly Optimized GPU Code presents...

Poorly optimized inference server

Stats:

About

Uh oh!

Releases

Packages

Languages

FIlipHand/model_inference

Folders and files

Latest commit

History

Repository files navigation

Department of Poorly Optimized GPU Code presents...

Poorly optimized inference server

Stats:

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages