Tesla M40–optimized Rust + CUDA LLM server/runtime. FP16 weights, FP32 compute via cuBLAS. GGUF loader and stable C FFI (m40llm_*). Goal: be much faster than ollama on the M40.
- Single-GPU server for Maxwell Tesla M40 (sm_52)
- FP16 storage / FP32 compute (cuBLAS/cuBLASLt as available)
- GGUF loader; C FFI symbols
m40llm_*for embedding - Small, explicit codebase focused on M40 performance
- Optional HTTP server (enable with
--features server)
- M40 owners who want maximum throughput/low latency on this specific card
- Tinkerers/researchers who want Maxwell-specific hacks, not generic portability
- Users who find vLLM hard/unsupported on M40 and llama.cpp too slow there
- vs ollama: we compete head‑on for M40. Expect higher throughput/lower latency from Maxwell‑specific kernels/layouts, FP16‑storage/FP32‑compute, and decode‑path tricks (graphs, persistent kernel, warp micro‑batching).
- vs vLLM: excellent on modern GPUs but impractical on M40 (sm_52). m40‑llm is designed to be M40‑first and actually set up/run there.
- vs llama.cpp: very portable, but most speed paths target newer GPUs. On M40 it tends to run without its big speed tricks; m40‑llm focuses on sm_52‑specific performance instead of broad portability.
cargo build --no-default-featurescargo build --features cuda # With NVCC installedRecommended micromamba toolchain setup (x86_64, tested on this branch):
# install micromamba if needed
curl -Ls https://micro.mamba.pm/install.sh | bash
source ~/.bashrc
# CUDA 12.4 toolchain with cuBLAS headers + libs
micromamba create -y -n m40-llm -c conda-forge -c nvidia/label/cuda-12.4.1 \
cuda-nvcc=12.4.99 cuda-cudart=12.4.99 cuda-cudart-dev=12.4.99 \
libcublas=12.4.5.8 libcublas-dev=12.4.5.8
# Build/link with cuBLAS enabled
micromamba run -n m40-llm env M40LLM_ENABLE_CUBLAS=1 cargo build --features cudaCI verifies two configurations:
noncuda: No CUDA dependenciescuda-with-nvcc: Full CUDA+NVCC toolchain
- FP16 storage, FP32 compute tiles: load FP16 to shared, convert to FP32, compute in registers
- Tuned GEMM with cuBLAS/cuBLASLt; explicit row/col layouts; layout tests included
- CUDA Graphs + persistent decode kernel to minimize launch overhead
- Warp-level micro-batching (e.g., one warp per sequence) for decode
- Optimized KV cache: FP16 or INT8 per-head; contiguous per-head layout; pinned host staging
- Streams/Hyper‑Q: high‑priority decode stream, concurrent lower‑priority prefill
- Read‑only (
__ldg) and texture caches for non-GEMM ops (norms, embeddings)
This project uses Cargo feature flags to switch between CPU‑only and GPU‑accelerated builds, and to include an optional HTTP server.
cuda: Enables the CUDA backend. When set:- Requires
nvccon PATH; CUDA builds fail fast if the toolchain is missing. - Compiles CUDA kernels for sm_52 (plus compute_52 PTX) and links against the CUDA runtime. If the cuBLAS header (
cublas_v2.h) is found andM40LLM_ENABLE_CUBLAS=1is set, we also link cuBLAS and enable GEMM paths and tests.
- Requires
server: Includes the HTTP server binary routes so you can runm40-llm run ....
Build script behavior:
- Compiles kernels for
sm_52and also embeds PTX forcompute_52so newer GPUs can JIT from PTX if needed. - Exposes
cfg(nvcc)when a real CUDA toolchain is present. - Exposes
cfg(have_cublas)when cuBLAS headers and libraries are found andM40LLM_ENABLE_CUBLAS=1.
Build the project in one of these modes:
- CPU only (no CUDA):
- Build:
cargo build --no-default-features - Test:
cargo test --no-default-features
- Build:
- CUDA enabled (requires nvcc on PATH):
- Build:
cargo build --features cuda - Test:
cargo test --features cuda
- Build:
- CPU‑only mode:
cargo test --no-default-featuresruns all non‑CUDA tests. - CUDA mode (
--features cuda): CUDA smoke and GEMM tests run when the environment has CUDA headers, and additional GEMM/cuBLAS tests run when the build detectscublas_v2.h. Tests rely onnvccbeing present because the build fails without it when CUDA is enabled. - Minimal forward parity: see docs/minimal_forward.md and tests/forward_parity_toy.rs for a CUDA‑gated toy test validating one‑layer, seq_len=1 numerics.
- Auto‑select M40: CudaContext::new(-1) will pick a Tesla M40 (sm_52) if one is visible. If none is visible, it falls back to device 0.
- Force selection: set M40LLM_FORCE_M40=1 to force runtime selection of an sm_52 device even when a specific device_id is passed.
- Respect CUDA_VISIBLE_DEVICES: device enumeration respects CUDA_VISIBLE_DEVICES. The auto‑picker searches only among visible devices and selects the first sm_52 it finds.
- cuBLAS control: by default, we do not link cuBLAS even if headers are present. Set M40LLM_ENABLE_CUBLAS=1 to enable cuBLAS integration if both the header (cublas_v2.h) and a shared library (e.g., libcublas.so.11) are detected. Otherwise, fallback CUDA kernels are used.
- Test gating: build.rs exposes cfg(nvcc) when a real CUDA toolchain is present and cfg(have_cublas) when cuBLAS is enabled; CUDA tests use these to gate cuBLAS‑specific coverage. Some CUDA tests also use require_sm52() to skip gracefully when not on an sm_52 device.
cargo run \
--no-default-features \
--features server \
-- run \
--model path/to.gguf \
--addr 0.0.0.0:58439
See CONTRIBUTING.md for guidelines.