Welcome to Inferno! This guide will help you get up and running with the Inferno AI/ML inference server quickly.
- Prerequisites
- Installation
- Quick Start
- Configuration
- Model Management
- API Usage
- Monitoring & Observability
- Production Deployment
- Next Steps
- OS: Linux, macOS, or Windows
- RAM: Minimum 8GB (16GB+ recommended for larger models)
- CPU: Modern multi-core processor
- Storage: 10GB+ free space for models and cache
- GPU (Optional): NVIDIA CUDA, AMD ROCm, or Apple Metal for acceleration
- Rust: 1.70+ (for building from source)
- Docker: 20.10+ (for containerized deployment)
- curl: For testing API endpoints
curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
source ~/.cargo/env# Clone the repository
git clone https://github.com/inferno-ai/inferno.git
cd inferno
# Build the project
cargo build --release
# The binary will be at target/release/inferno# Pull the latest image
docker pull inferno:latest
# Or build from source
docker build -t inferno .# Download for your platform
curl -L -o inferno https://github.com/inferno-ai/inferno/releases/latest/download/inferno-linux-x86_64
chmod +x inferno
sudo mv inferno /usr/local/bin/# Check version
inferno --version
# View help
inferno --help# Create default configuration
inferno config init
# View current configuration
inferno config show# Create models directory
mkdir -p ~/inferno/models
# Set models directory
export INFERNO_MODELS_DIR=~/inferno/models
# Or set in config
inferno config set models_dir ~/inferno/models# For this example, we'll create a mock model file
# In practice, you would download actual GGUF or ONNX models
mkdir -p ~/inferno/models
echo "GGUF" > ~/inferno/models/sample-model.gguf
# List available models
inferno models list# Start with default settings
inferno serve
# Or with custom settings
inferno serve --bind 0.0.0.0:8080 --models-dir ~/inferno/models# Health check
curl http://localhost:8080/health
# List models
curl http://localhost:8080/models
# Run inference (with a mock model)
curl -X POST http://localhost:8080/inference \
-H "Content-Type: application/json" \
-d '{
"model": "sample-model",
"prompt": "Hello, world!",
"max_tokens": 50
}'Inferno searches for configuration files in this order:
.inferno.toml(current directory)~/.inferno.toml(home directory)~/.config/inferno/config.toml(config directory)
Create a configuration file:
cat > ~/.inferno.toml << EOF
# Basic settings
models_dir = "~/inferno/models"
cache_dir = "~/inferno/cache"
log_level = "info"
log_format = "pretty"
# Server settings
[server]
bind_address = "127.0.0.1"
port = 8080
max_concurrent_requests = 10
request_timeout_seconds = 300
# Backend settings
[backend_config]
context_size = 4096
batch_size = 512
gpu_layers = 32
# Security settings
[model_security]
verify_checksums = true
allowed_model_extensions = ["gguf", "onnx"]
max_model_size_gb = 50.0
sandbox_enabled = true
# Metrics settings
[metrics]
enabled = true
bind_address = "127.0.0.1"
port = 9090
path = "/metrics"
collection_interval_seconds = 10
# Observability settings
[observability]
prometheus_enabled = true
prometheus_endpoint = "/metrics"
otel_enabled = false
grafana_enabled = false
EOFYou can override any configuration with environment variables:
export INFERNO_LOG_LEVEL=debug
export INFERNO_SERVER__PORT=8081
export INFERNO_BACKEND_CONFIG__GPU_LAYERS=40- GGUF: Quantized models (recommended for CPU inference)
- ONNX: Optimized neural networks
# Load a model into memory
inferno models load llama-2-7b --gpu-layers 32
# Or via API
curl -X POST http://localhost:8080/models/llama-2-7b/load \
-H "Content-Type: application/json" \
-d '{"gpu_layers": 32, "context_size": 4096}'# List all models
inferno models list
# Show model details
inferno models info llama-2-7b
# Validate model file
inferno validate ~/inferno/models/llama-2-7b.gguf# Unload from memory
inferno models unload llama-2-7b
# Or via API
curl -X POST http://localhost:8080/models/llama-2-7b/unloadIf authentication is enabled, you'll need an API key:
# Generate API key
inferno security api-key create --user admin --name my-key
# Use in requests
curl -H "Authorization: Bearer YOUR_API_KEY" \
http://localhost:8080/modelscurl -X POST http://localhost:8080/inference \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b",
"prompt": "Explain quantum computing in simple terms",
"max_tokens": 200,
"temperature": 0.7,
"top_p": 0.9
}'curl -X POST http://localhost:8080/inference/stream \
-H "Content-Type: application/json" \
-H "Accept: text/event-stream" \
-d '{
"model": "llama-2-7b",
"prompt": "Write a short poem about AI",
"max_tokens": 100,
"stream": true
}'# Chat completions
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is machine learning?"}
]
}'
# Text completions
curl -X POST http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b",
"prompt": "The future of AI is",
"max_tokens": 50
}'# Submit batch job
curl -X POST http://localhost:8080/batch \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b",
"requests": [
{"id": "1", "prompt": "What is Python?"},
{"id": "2", "prompt": "Explain machine learning"},
{"id": "3", "prompt": "What is quantum computing?"}
],
"max_tokens": 100
}'
# Check batch status
curl http://localhost:8080/batch/BATCH_ID
# Get results
curl http://localhost:8080/batch/BATCH_ID/results# Start server with metrics enabled
inferno serve --metrics-enabled
# View Prometheus metrics
curl http://localhost:8080/metrics# Initialize observability
inferno observability init --prometheus --grafana
# Start metrics server
inferno observability metrics serve
# Show observability status
inferno observability statusinferno_inference_requests_total: Total inference requestsinferno_inference_duration_seconds: Request durationinferno_models_loaded: Number of loaded modelsinferno_memory_usage_bytes: Memory consumptioninferno_errors_total: Error count
# Clone the examples
git clone https://github.com/inferno-ai/inferno.git
cd inferno/examples
# Start the full stack
docker-compose up -d
# View logs
docker-compose logs -f inferno
# Scale horizontally
docker-compose up -d --scale inferno=3# Production environment variables
export INFERNO_LOG_LEVEL=warn
export INFERNO_LOG_FORMAT=json
export INFERNO_AUTH_ENABLED=true
export INFERNO_RATE_LIMITING_ENABLED=true
export INFERNO_MAX_REQUESTS_PER_MINUTE=1000
export INFERNO_PROMETHEUS_ENABLED=true
export INFERNO_OTEL_ENABLED=true# Enable authentication
inferno security init
# Create admin user
inferno security user create admin --role admin
# Generate API keys
inferno security api-key create --user admin --name production
# Configure rate limiting
inferno security rate-limit set --requests-per-minute 1000# Built-in health check
curl http://localhost:8080/health
# Custom health check script
#!/bin/bash
response=$(curl -s http://localhost:8080/health)
if echo "$response" | grep -q "healthy"; then
echo "Service is healthy"
exit 0
else
echo "Service is unhealthy"
exit 1
fi-
Real-time Streaming
inferno streaming interactive --model llama-2-7b
-
Distributed Inference
inferno distributed worker start
-
GPU Acceleration
inferno gpu enable inferno gpu status -
A/B Testing
inferno ab-test create --name model-comparison
- GitHub: Issues and discussions
- GitHub Discussions: Community help
- Documentation: Wiki
- Model not loading: Check file format and size limits
- Out of memory: Reduce context size or enable GPU offloading
- Slow inference: Enable GPU acceleration or increase batch size
- API errors: Verify authentication and rate limits
# Check logs
inferno logs
# Validate configuration
inferno config validate
# Test model file
inferno validate /path/to/model.gguf
# System information
inferno system info# Run with debug logging
INFERNO_LOG_LEVEL=debug inferno serve
# Enable trace logging
INFERNO_LOG_LEVEL=trace RUST_BACKTRACE=full inferno serve🎉 Congratulations! You're now ready to use Inferno for AI/ML inference. Check out the examples directory for more detailed usage patterns and integration examples.