m40-llm

Tesla M40–optimized Rust + CUDA LLM server/runtime. FP16 weights, FP32 compute via cuBLAS. GGUF loader and stable C FFI (m40llm_*). Goal: be much faster than ollama on the M40.

What it is

Single-GPU server for Maxwell Tesla M40 (sm_52)
FP16 storage / FP32 compute (cuBLAS/cuBLASLt as available)
GGUF loader; C FFI symbols m40llm_* for embedding
Small, explicit codebase focused on M40 performance
Optional HTTP server (enable with --features server)

Who it’s for

M40 owners who want maximum throughput/low latency on this specific card
Tinkerers/researchers who want Maxwell-specific hacks, not generic portability
Users who find vLLM hard/unsupported on M40 and llama.cpp too slow there

How it compares

vs ollama: we compete head‑on for M40. Expect higher throughput/lower latency from Maxwell‑specific kernels/layouts, FP16‑storage/FP32‑compute, and decode‑path tricks (graphs, persistent kernel, warp micro‑batching).
vs vLLM: excellent on modern GPUs but impractical on M40 (sm_52). m40‑llm is designed to be M40‑first and actually set up/run there.
vs llama.cpp: very portable, but most speed paths target newer GPUs. On M40 it tends to run without its big speed tricks; m40‑llm focuses on sm_52‑specific performance instead of broad portability.

Building

Standard (non-CUDA)

cargo build --no-default-features

CUDA-enabled (requires CUDA 12.x toolkit)

cargo build --features cuda  # With NVCC installed

Recommended micromamba toolchain setup (x86_64, tested on this branch):

# install micromamba if needed
curl -Ls https://micro.mamba.pm/install.sh | bash
source ~/.bashrc

# CUDA 12.4 toolchain with cuBLAS headers + libs
micromamba create -y -n m40-llm -c conda-forge -c nvidia/label/cuda-12.4.1 \
  cuda-nvcc=12.4.99 cuda-cudart=12.4.99 cuda-cudart-dev=12.4.99 \
  libcublas=12.4.5.8 libcublas-dev=12.4.5.8

# Build/link with cuBLAS enabled
micromamba run -n m40-llm env M40LLM_ENABLE_CUBLAS=1 cargo build --features cuda

CI verifies two configurations:

noncuda: No CUDA dependencies
cuda-with-nvcc: Full CUDA+NVCC toolchain

Performance strategy on M40

FP16 storage, FP32 compute tiles: load FP16 to shared, convert to FP32, compute in registers
Tuned GEMM with cuBLAS/cuBLASLt; explicit row/col layouts; layout tests included
CUDA Graphs + persistent decode kernel to minimize launch overhead
Warp-level micro-batching (e.g., one warp per sequence) for decode
Optimized KV cache: FP16 or INT8 per-head; contiguous per-head layout; pinned host staging
Streams/Hyper‑Q: high‑priority decode stream, concurrent lower‑priority prefill
Read‑only (__ldg) and texture caches for non-GEMM ops (norms, embeddings)

Build features (Cargo)

This project uses Cargo feature flags to switch between CPU‑only and GPU‑accelerated builds, and to include an optional HTTP server.

cuda: Enables the CUDA backend. When set:
- Requires nvcc on PATH; CUDA builds fail fast if the toolchain is missing.
- Compiles CUDA kernels for sm_52 (plus compute_52 PTX) and links against the CUDA runtime. If the cuBLAS header (cublas_v2.h) is found and M40LLM_ENABLE_CUBLAS=1 is set, we also link cuBLAS and enable GEMM paths and tests.
server: Includes the HTTP server binary routes so you can run m40-llm run ....

Build script behavior:

Compiles kernels for sm_52 and also embeds PTX for compute_52 so newer GPUs can JIT from PTX if needed.
Exposes cfg(nvcc) when a real CUDA toolchain is present.
Exposes cfg(have_cublas) when cuBLAS headers and libraries are found and M40LLM_ENABLE_CUBLAS=1.

Build

Build the project in one of these modes:

CPU only (no CUDA):
- Build: cargo build --no-default-features
- Test: cargo test --no-default-features
CUDA enabled (requires nvcc on PATH):
- Build: cargo build --features cuda
- Test: cargo test --features cuda

Tests

CPU‑only mode: cargo test --no-default-features runs all non‑CUDA tests.
CUDA mode (--features cuda): CUDA smoke and GEMM tests run when the environment has CUDA headers, and additional GEMM/cuBLAS tests run when the build detects cublas_v2.h. Tests rely on nvcc being present because the build fails without it when CUDA is enabled.
Minimal forward parity: see docs/minimal_forward.md and tests/forward_parity_toy.rs for a CUDA‑gated toy test validating one‑layer, seq_len=1 numerics.

CUDA device selection and cuBLAS

Auto‑select M40: CudaContext::new(-1) will pick a Tesla M40 (sm_52) if one is visible. If none is visible, it falls back to device 0.
Force selection: set M40LLM_FORCE_M40=1 to force runtime selection of an sm_52 device even when a specific device_id is passed.
Respect CUDA_VISIBLE_DEVICES: device enumeration respects CUDA_VISIBLE_DEVICES. The auto‑picker searches only among visible devices and selects the first sm_52 it finds.
cuBLAS control: by default, we do not link cuBLAS even if headers are present. Set M40LLM_ENABLE_CUBLAS=1 to enable cuBLAS integration if both the header (cublas_v2.h) and a shared library (e.g., libcublas.so.11) are detected. Otherwise, fallback CUDA kernels are used.
Test gating: build.rs exposes cfg(nvcc) when a real CUDA toolchain is present and cfg(have_cublas) when cuBLAS is enabled; CUDA tests use these to gate cuBLAS‑specific coverage. Some CUDA tests also use require_sm52() to skip gracefully when not on an sm_52 device.

Server (feature = server)

cargo run \
  --no-default-features \
  --features server \
  -- run \
  --model path/to.gguf \
  --addr 0.0.0.0:58439

Contributing

See CONTRIBUTING.md for guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 261 Commits
.githooks		.githooks
.github/workflows		.github/workflows
.idea		.idea
.vscode		.vscode
benches		benches
cuda		cuda
docs		docs
examples		examples
llm-context		llm-context
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.gitmessage		.gitmessage
AGENTS.md		AGENTS.md
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
README.md		README.md
build.rs		build.rs
cog.toml		cog.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

m40-llm

What it is

Who it’s for

How it compares

Building

Standard (non-CUDA)

CUDA-enabled (requires CUDA 12.x toolkit)

Performance strategy on M40

Build features (Cargo)

Build

Tests

CUDA device selection and cuBLAS

Server (feature = server)

Contributing

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

GuyPaddock/m40-llm

Folders and files

Latest commit

History

Repository files navigation

m40-llm

What it is

Who it’s for

How it compares

Building

Standard (non-CUDA)

CUDA-enabled (requires CUDA 12.x toolkit)

Performance strategy on M40

Build features (Cargo)

Build

Tests

CUDA device selection and cuBLAS

Server (feature = server)

Contributing

About

Resources

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages