-
Notifications
You must be signed in to change notification settings - Fork 0
FAQ
Common questions and answers about Inferno. Can't find what you're looking for? Visit GitHub Discussions!
Inferno is a production-ready AI inference server that runs entirely on your local hardware. Think of it as your private ChatGPT that works offline, supports multiple model formats, and gives you complete control over your AI infrastructure.
| Aspect | Cloud AI (OpenAI, etc.) | Inferno |
|---|---|---|
| Privacy | Data sent to external servers | 100% local processing |
| Internet | Required for every request | Works completely offline |
| Models | Limited to provider's models | Use any model you want |
| Speed | Network dependent | Local hardware speed |
| Control | Limited customization | Full control over everything |
Yes! Inferno is completely open source under MIT/Apache 2.0 licenses. You pay only for:
- Your hardware (computer, GPU)
- Electricity to run it
- Models (many are free, some commercial models require licenses)
No subscription fees, no per-token costs, no vendor lock-in.
- Chat with AI models (like ChatGPT, but private)
- Generate code (coding assistants, debugging)
- Process documents (summarization, analysis, translation)
- Create content (writing, brainstorming, creative tasks)
- Automate workflows (batch processing, API integration)
- Build AI-powered applications (embed in your software)
Format Support:
- โ GGUF - Optimized format for CPU/GPU inference (llama.cpp compatible)
- โ ONNX - Cross-platform ML format (PyTorch, TensorFlow, scikit-learn exports)
- โ PyTorch - Via conversion to GGUF/ONNX
- โ SafeTensors - Via conversion to GGUF/ONNX
Popular Models:
- Llama 2 (7B, 13B, 70B) - General purpose, code, chat
- Mistral (7B, 8x7B) - High performance, multilingual
- CodeLlama (7B, 13B, 34B) - Code generation and completion
- Vicuna, Alpaca, WizardLM - Various specialized models
- Custom models - Any GGUF or ONNX model
Minimum (7B models):
- 8GB RAM, any modern CPU
- 20GB storage space
- No GPU required (but helpful)
Recommended (13B models):
- 16GB RAM, multi-core CPU
- NVIDIA RTX 3060+ or Apple M1/M2
- 50GB+ SSD storage
Optimal (30B+ models):
- 32GB+ RAM
- NVIDIA RTX 4080+ or Apple M2 Ultra
- 100GB+ NVMe SSD
See System Requirements for detailed specifications.
No! Inferno runs completely offline after initial setup:
Requires Internet:
- Downloading Inferno itself
- Downloading AI models (one-time)
- Software updates (optional)
Works Offline:
- All AI inference and chat
- Model conversion and management
- API services and batch processing
- Everything after initial setup
Perfect for air-gapped environments, remote locations, or privacy-critical applications.
Hardware Upgrades:
- GPU: Single biggest performance boost (10-50x faster)
- SSD: Faster model loading and caching
- RAM: Handle larger models and context
Software Optimizations:
- Model Choice: Smaller models (7B vs 70B) run much faster
- Quantization: Q4_0 models use less memory, run faster
- Context Size: Reduce context_size in config
- Batch Size: Tune batch_size for your hardware
See Performance Tuning for detailed optimization guide.
Docker (Recommended):
- โ Fastest setup (5 minutes)
- โ Consistent environment
- โ Easy updates and backups
- โ Works on any platform
Pre-built Binaries:
- โ No Docker required
- โ Native performance
โ ๏ธ Manual dependency management
Build from Source:
- โ Latest features
- โ Custom optimizations
โ ๏ธ Requires development tools (30+ minutes)
Choose Docker unless you have specific requirements.
# Using Inferno's built-in downloader
inferno models download llama-2-7b-chat
inferno models download mistral-7b-instruct
# List available models
inferno models available
# Manual download (from Hugging Face, etc.)
# Place .gguf or .onnx files in your models directoryPopular Model Sources:
- Hugging Face - Largest collection
- TheBloke - High-quality GGUF conversions
- Microsoft - Official Microsoft models
- Meta - Official Llama models
Default Locations:
-
Linux:
~/.local/share/inferno/models/ -
macOS:
~/Library/Application Support/inferno/models/ -
Windows:
%APPDATA%\inferno\models\ -
Docker:
/data/models/(mapped to host volume)
Custom Location:
# Via command line
inferno serve --models-dir /path/to/models
# Via config file
models_dir = "/custom/path/models"
# Via environment variable
export INFERNO_MODELS_DIR="/custom/path/models"Docker:
docker pull inferno:latest
docker stop inferno && docker rm inferno
docker run -d --name inferno -p 8080:8080 inferno:latest serveBinary:
# Download latest release
wget https://github.com/ringo380/inferno/releases/latest/download/inferno-linux-x86_64.tar.gz
tar xzf inferno-linux-x86_64.tar.gz
sudo mv inferno /usr/local/bin/Source:
cd inferno
git pull
cargo build --release
sudo cp target/release/inferno /usr/local/bin/100% Private by Design:
- โ No data transmission - Everything runs locally
- โ No telemetry - Inferno doesn't "phone home"
- โ No cloud dependencies - Works completely offline
- โ Open source - Audit the code yourself
- โ Your infrastructure - You control everything
Even more private than:
- Self-hosted cloud solutions (no internet required)
- VPN + cloud AI (no external connections)
- Enterprise AI platforms (no vendor access)
Security Features:
- โ Input validation - All user inputs sanitized
- โ Memory safety - Written in Rust (memory-safe language)
- โ Authentication - JWT tokens, API keys, RBAC
- โ Audit logging - Track all operations
- โ Rate limiting - Prevent abuse
- โ TLS/HTTPS - Encrypted communications
Security Best Practices:
- Enable authentication for production use
- Use HTTPS/TLS for network access
- Keep Inferno updated
- Follow Security Hardening guide
Yes! Inferno is designed for sensitive use cases:
Industries Using Inferno:
- Healthcare - HIPAA-compliant AI analysis
- Finance - PCI DSS compliant document processing
- Legal - Attorney-client privileged document review
- Government - Classified/sensitive information processing
- Research - Proprietary data analysis
Compliance Features:
- Air-gapped deployment support
- Comprehensive audit trails
- Data residency control
- No external dependencies
OpenAI-Compatible API:
# Use any OpenAI client library
import openai
client = openai.OpenAI(
base_url="http://localhost:8080/v1",
api_key="not-needed" # Or your actual API key if auth enabled
)
response = client.chat.completions.create(
model="your-model",
messages=[{"role": "user", "content": "Hello!"}]
)Native REST API:
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "your-model", "messages": [...]}'WebSocket Streaming:
const ws = new WebSocket('ws://localhost:8080/ws/stream');
ws.send(JSON.stringify({model: "your-model", prompt: "Hello!"}));Yes! Inferno supports multiple approaches:
Single Instance, Multiple Models:
# Load multiple models
inferno models load llama-2-7b
inferno models load codellama-13b
# Use different models in requests
curl -X POST http://localhost:8080/v1/chat/completions \
-d '{"model": "llama-2-7b", "messages": [...]}'
curl -X POST http://localhost:8080/v1/chat/completions \
-d '{"model": "codellama-13b", "messages": [...]}'Multiple Instances:
# Run different models on different ports
inferno serve --model llama-2-7b --bind 0.0.0.0:8080 &
inferno serve --model codellama-13b --bind 0.0.0.0:8081 &Load Balancing:
# Distribute requests across multiple instances
inferno serve --distributed --workers 4# Process multiple prompts from a file
inferno run --model your-model --batch --input prompts.txt --output responses.jsonl
# Process documents
inferno run --model your-model --batch --input documents/ --output summaries.json
# Scheduled batch processing
inferno batch-queue create --schedule "0 2 * * *" --input daily_reports/ --model summarizerInput Formats:
- Text files - One prompt per line
- JSON Lines - Structured prompts
- CSV - Tabular data
- Directory - Process all files in a folder
Check Common Issues:
# Verify installation
inferno --version
# Check port availability
sudo lsof -i :8080
# Try different port
inferno serve --bind 0.0.0.0:8081
# Check permissions
ls -la /usr/local/bin/inferno
# View detailed logs
RUST_LOG=debug inferno serve# Verify model exists
inferno models list
# Check model format
file your-model.gguf # Should show GGUF format
# Check available memory
free -h # Linux
vm_stat # macOS
# Try smaller model
inferno models download microsoft/DialoGPT-smallCheck Resource Usage:
# Monitor during inference
top # CPU usage
nvidia-smi # GPU usage (NVIDIA)
iostat # Disk I/OCommon Solutions:
- Use smaller model (7B instead of 13B+)
- Enable GPU acceleration
- Reduce context_size in config
- Use quantized models (Q4_0, Q5_0)
- Add more RAM or switch to SSD
# Test basic connectivity
curl http://localhost:8080/health
# Check server logs
docker logs inferno # Docker
journalctl -f -u inferno # systemd
# Verify model name
inferno models list
# Test with simple request
curl -X POST http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model": "your-model", "messages": [{"role": "user", "content": "test"}]}'Inferno License: MIT/Apache 2.0 - Free for commercial use!
Model Licenses: Check individual model licenses
- Open Models (Llama 2, Mistral, etc.) - Free commercial use
- Commercial Models - May require paid licenses
- Custom Models - Your own license terms
Yes! The MIT/Apache 2.0 license allows:
- โ Commercial use
- โ Modification and distribution
- โ Private use
- โ Selling applications that include Inferno
- โ SaaS applications using Inferno
Just remember:
- Include license notice in your distributions
- Check individual model licenses
- No warranty provided (standard open source terms)
Learning Path:
- Start Here: Quick Start Tutorial - Get your first AI working
- Understand Basics: Usage Examples - See what's possible
- Explore Models: Model Management - Try different AI capabilities
- Optimize: Performance Tuning - Make it faster
- Production: Production Deployment - Scale up
Recommended Reading:
- Hugging Face Course - Free AI/ML course
- Fast.ai - Practical deep learning
- OpenAI Cookbook - AI application patterns
Ways to Help:
- Report Issues - Found a bug? Report it
- Improve Docs - Contributing to Wiki
- Share Examples - Post your use cases in Discussions
- Write Code - See CONTRIBUTING.md
- Help Others - Answer questions in GitHub Discussions
Community Resources:
- GitHub Discussions - Community help and questions (fastest response)
- GitHub Issues - Bug reports
- Wiki - This documentation you're reading!
- Enterprise Support - Contact maintainer for specialized installation assistance (information and pricing available)
Response Times:
- GitHub Discussions - Community-driven, usually within hours
- GitHub Issues - 1-3 days for bug reports and technical issues
- Wiki - Community maintained, always available
Near Term (v1.1-1.2):
- Visual web dashboard
- Model marketplace integration
- Enhanced multi-GPU support
- Performance optimizations
Medium Term (v1.3-1.5):
- Multi-modal support (vision, audio)
- Advanced clustering and scaling
- Enterprise SSO integration
- Advanced monitoring and analytics
Long Term (v2.0+):
- Federated learning support
- Edge device deployment
- Custom model training integration
- Advanced AI workflow automation
Current Status: Production ready for most use cases
Stability:
- โ Core Features - Stable, well-tested
- โ GGUF/ONNX Support - Production ready
- โ API Compatibility - OpenAI-compatible, stable
โ ๏ธ Advanced Features - Some features still evolvingโ ๏ธ Breaking Changes - Possible in minor versions (following semver)
Upgrade Recommendations:
- Production - Pin to specific versions, test upgrades
- Development - Use latest for newest features
- Enterprise - Consider LTS releases when available
FAQ last updated for Inferno v1.0.0. Question not answered? Visit GitHub Discussions for community help!