GPU-Accelerated Distributed LLM Inference Engine
Quick Start • Models • API • Distributed • Benchmarks
NeuroGrid is a high-performance inference engine for Large Language Models (LLMs), built from scratch in Go + CUDA. Designed for both single-GPU and distributed inference across multiple machines.
# 1. Download a model
make download-tinyllama # Small model for testing (~2.2GB)
# 2. Build and run
make run
# 3. Test the API
curl -X POST http://localhost:8090/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "tinyllama", "messages": [{"role": "user", "content": "Hello!"}]}'That's it! The server auto-detects the model and starts on port 8090.
| Requirement | Version |
|---|---|
| Go | 1.21+ |
| CUDA Toolkit | 11.x or 12.x |
| GPU | NVIDIA with Compute 7.0+ (RTX 20/30/40/50) |
| OS | Linux (Ubuntu 22.04/24.04) |
| Model | Size | VRAM | Download Command |
|---|---|---|---|
| TinyLlama 1.1B | ~2.2GB | ~3GB | make download-tinyllama |
| Mistral 7B Instruct | ~15GB | ~14GB | make download-mistral7b-instruct |
| Llama 2 7B | ~13GB | ~14GB | make download-llama7b ¹ |
| Llama 2 13B | ~26GB | ~26GB | make download-llama13b ¹ |
¹ Requires HF_TOKEN environment variable (get token at huggingface.co/settings/tokens)
# Generic download - works with any public model
make download REPO=mistralai/Mistral-Nemo-Instruct-2407
make download REPO=Qwen/Qwen2.5-7B-Instruct
make download REPO=google/gemma-2-9b-it
# For gated models (Llama, etc.)
export HF_TOKEN=your_token
make download REPO=meta-llama/Llama-3.3-70B-Instruct# Auto-detect model and run
make run
# Or run with specific model
make run-mistral # Mistral 7B Instruct
make run-tinyllama # TinyLlama 1.1B
make run-llama7b # Llama 2 7B
# Custom configuration
make run HTTP_PORT=8080 GPU_ID=1 LOG_LEVEL=debug| Variable | Default | Description |
|---|---|---|
HTTP_PORT |
8090 | API server port |
GPU_ID |
0 | CUDA device ID |
LOG_LEVEL |
info | Log verbosity (debug, info, warn, error) |
POST /v1/chat/completions
curl -X POST http://localhost:8090/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "mistral-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is the capital of France?"}
],
"max_tokens": 100,
"temperature": 0.7
}'{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1769027418,
"model": "mistral-7b",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "The capital of France is Paris."
},
"finish_reason": "eos"
}]
}curl http://localhost:8090/health
# {"status":"healthy","model":"mistral-7b","timestamp":1769027418,"version":"1.0.0"}For multi-GPU inference across machines using Pipeline Parallelism.
┌─────────────────────────────────────────────────────────────────┐
│ Coordinator (GH200) │
│ - HTTP API endpoint │
│ - Orchestrates inference │
│ - Layers 0-13 (local) │
└───────────────────────────┬─────────────────────────────────────┘
│ P2P (libp2p)
┌─────────────────┴─────────────────┐
│ │
┌─────────┴───────────┐ ┌────────────┴──────────┐
│ Worker (RTX 4090) │ │ Worker (RTX 2080) │
│ Layers 14-26 │ │ Layers 27-39 │
└─────────────────────┘ └───────────────────────┘
When workers have the model downloaded locally, skip weight transfer for faster startup.
Important: Use -wait-for-assignment on workers so they only load layers assigned by the coordinator. This is critical for heterogeneous GPU clusters to prevent OOM errors.
# Coordinator (GH200 - start first)
LD_LIBRARY_PATH=./build ./build/neurogrid \
-model models/mistral-nemo-instruct-2407 \
-http-port 8090 \
-p2p-port 9000 \
-gpu 0 \
-min-peers 2 \
-max-seq-len 4096 \
-skip-weight-transfer \
-disable-mdns
# Worker 1 (RTX 4090 - connect via bootstrap)
LD_LIBRARY_PATH=./build ./build/worker \
-bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
-model models/mistral-nemo-instruct-2407 \
-gpu 0 \
-port 9001 \
-wait-for-assignment
# Worker 2 (RTX 2080 - connect via bootstrap)
LD_LIBRARY_PATH=./build ./build/worker \
-bootstrap /ip4/<COORDINATOR_IP>/tcp/9000/p2p/<COORDINATOR_PEER_ID> \
-model /path/to/models/mistral-nemo-instruct-2407 \
-gpu 0 \
-port 9002 \
-wait-for-assignmentWhat happens:
- Workers connect and report their GPU info (VRAM, name) to coordinator
- Coordinator computes layer assignments based on actual VRAM of each GPU
- Coordinator sends layer requests to workers
- Workers load only their assigned layers from local storage
For simple LAN setups, workers can be discovered automatically via mDNS:
# Machine 1: Worker with GPU 0
make run-worker GPU_ID=0 P2P_PORT=9001
# Machine 2: Worker with GPU 0
make run-worker GPU_ID=0 P2P_PORT=9002
# Machine 3: Coordinator (connects to workers automatically via mDNS)
make run-coordinator MIN_PEERS=2| Flag | Default | Description |
|---|---|---|
-min-peers |
0 | Minimum workers to wait for (0 = local only) |
-skip-weight-transfer |
false | Skip P2P weight distribution (workers have local models) |
-disable-mdns |
false | Disable mDNS discovery (use explicit bootstrap) |
-max-seq-len |
4096 | Max sequence length (caps KV cache size) |
| Flag | Default | Description |
|---|---|---|
-bootstrap |
"" | Coordinator address for explicit connection |
-model |
"" | Path to local model weights |
-wait-for-assignment |
false | Critical for heterogeneous clusters: Wait for coordinator to assign layers before loading |
-max-seq-len |
4096 | Max sequence length (caps KV cache size) |
-port |
9000 | P2P listen port |
-gpu |
0 | GPU device ID |
NeuroGrid supports clusters with different GPU types (e.g., GH200 + RTX 4090 + RTX 2080). The system:
- GPU Info Protocol: Workers report actual VRAM to coordinator on connect
- VRAM-Aware Scheduling: Scheduler assigns layers based on each GPU's available memory
- On-Demand Loading: With
-wait-for-assignment, workers load only their assigned layers
This prevents OOM errors on smaller GPUs that would occur if all GPUs tried to load the same layers.
Ghost Peer Issue: If you see connections to unknown peers, use -disable-mdns and connect workers via explicit -bootstrap addresses.
Out of Memory on Workers:
- Use
-wait-for-assignmenton workers (most common fix) - Reduce
-max-seq-lento use less KV cache memory - The scheduler automatically assigns fewer layers to GPUs with less VRAM
Workers not receiving layer assignments: Ensure coordinator has -min-peers set to the number of expected workers.
# Verify CUDA installation
nvcc --version
nvidia-smi
# Verify Go installation
go version# Build everything (CUDA library + binaries)
make build-all
# Build CUDA library only
make cuda
# Build specific binary
make build-coordinator
make build-worker
# Clean build artifacts
make cleanmake test # CUDA tests
make test-e2e # End-to-end tests (no CUDA required)
make test-all # All tests| Metric | Value |
|---|---|
| TTFT (Time to First Token) | ~1.3s |
| Generation Speed | ~6.9 tokens/sec |
| GPU Memory | ~14GB |
make bench-quick # Quick benchmark
make bench-full # Full benchmark suiteneurogrid-engine/
├── cmd/
│ ├── neurogrid/ # Main server (coordinator)
│ ├── worker/ # Distributed worker node
│ └── download/ # Model download utility
├── gpu/
│ ├── cuda/ # CUDA kernels
│ ├── engine/ # C++ layer implementation
│ └── bindings/ # Go ↔ CUDA bindings
├── pkg/
│ ├── inference/ # Inference engine
│ ├── model/ # Model loading & tokenizer
│ ├── scheduler/ # Layer distribution
│ └── huggingface/ # HF model downloader
├── api/ # HTTP API server
├── p2p/ # libp2p networking
└── Makefile
If you see libgpu_engine.so: cannot open shared object file:
# Option 1: Use make (handles LD_LIBRARY_PATH automatically)
make run
# Option 2: Set LD_LIBRARY_PATH manually
export LD_LIBRARY_PATH=$(pwd)/build:/usr/local/cuda/lib64:$LD_LIBRARY_PATH
./build/neurogrid --model ./models/tinyllama --model-name tinyllama# Check available models
ls ./models/
# Download a model
make download-tinyllamaTry a smaller model:
make download-tinyllama
make run-tinyllamamake help # Show all available commandsSource Available License with Academic & Educational Use Grant
- ✅ Free for students, researchers, and academic use
- ✅ Free for personal learning and non-commercial projects
- ❌ Commercial use requires a license
Contact: leandrobar93@gmail.com
- cuBLAS - NVIDIA's GPU-accelerated BLAS
- libp2p - Peer-to-peer networking
- HuggingFace - Model hub
NeuroGrid Engine v0.1.0
Built with Go + CUDA