A WebUI-based management system for llama.cpp model lifecycle with Ollama API compatibility.
LlamaController provides a secure, web-based interface to manage llama.cpp instances with full model lifecycle control (load, unload, switch) while maintaining compatibility with Ollama's REST API ecosystem. This allows existing Ollama-compatible applications to seamlessly work with llama.cpp deployments.
- Centralized Model Management: Single interface to control multiple models
- API Compatibility: Drop-in replacement for Ollama in existing workflows
- Configuration Isolation: Separate llama.cpp binaries from model configurations
- Secure Access: Protected by authentication with token-based API access
- Multi-tenancy Support: Different tokens for different applications/users
- Web Interface: User-friendly dashboard for model management
- Multi-GPU Support: Load models on GPU 0, GPU 1, or both GPUs (in progress)
- GPU Status Detection: Real-time monitoring of GPU usage, supports idle/model-loaded/occupied-by-others states
- Mock GPU Testing: Supports mock mode for GPU status testing on machines without NVIDIA GPU
- Automatic GPU Status Refresh: Optimized refresh interval (5min), manual refresh on model load/unload
- Air-Gap Support: All web and API resources are served locally, fully offline compatible
- Python 3.8+ (Conda environment recommended)
- llama.cpp installed with
llama-serverexecutable - GGUF model files
- Optional: Multiple NVIDIA GPUs for multi-GPU support
conda create -n llama.cpp python=3.11 -y
conda activate llama.cpppip install -r requirements.txtpython scripts/init_db.pyEdit the configuration files in config/ directory to match your system:
config/llamacpp-config.yaml- llama.cpp server settingsconfig/models-config.yaml- Available models configurationconfig/auth-config.yaml- Authentication settings
python run.pyOpen your browser and navigate to: http://localhost:3000
Default credentials:
- Username:
admin - Password:
admin123
llamacontroller/
├── src/llamacontroller/ # Main application code
│ ├── core/ # Core business logic
│ │ ├── config.py # Configuration management
│ │ ├── lifecycle.py # Model lifecycle manager
│ │ └── adapter.py # llama.cpp process adapter
│ ├── api/ # REST API endpoints
│ │ ├── management.py # Management API
│ │ ├── ollama.py # Ollama-compatible API
│ │ ├── auth.py # Authentication endpoints
│ │ └── tokens.py # Token management
│ ├── auth/ # Authentication
│ │ ├── service.py # Auth service
│ │ └── dependencies.py # FastAPI auth dependencies
│ ├── db/ # Database models
│ │ ├── models.py # SQLAlchemy models
│ │ └── crud.py # Database operations
│ ├── web/ # Web UI
│ │ ├── routes.py # Web routes
│ │ └── templates/ # Jinja2 templates
│ ├── models/ # Pydantic models
│ │ ├── config.py # Configuration models
│ │ ├── api.py # API request/response models
│ │ └── ollama.py # Ollama schema models
│ └── utils/ # Utilities
├── config/ # Configuration files
├── tests/ # Test suite
├── docs/ # Documentation
├── design/ # Design documents
├── scripts/ # Utility scripts
├── logs/ # Application logs (auto-created)
└── data/ # Runtime data (auto-created)
Current Version: 0.8.0 (Beta)
Project Status: Core features complete, multi-GPU enhancement in progress
- Project structure
- Configuration files (YAML-based)
- Configuration manager with Pydantic validation
- llama.cpp process adapter
- Logging system
- Model lifecycle manager
- Load/unload/switch operations
- Process health monitoring
- Auto-restart on crash
- FastAPI application
- Ollama-compatible endpoints
-
/api/generate- Text generation -
/api/chat- Chat completion -
/api/tags- List models -
/api/show- Show model info -
/api/ps- Running models
-
- Management API endpoints
-
/api/v1/models/load- Load model -
/api/v1/models/unload- Unload model -
/api/v1/models/status- Model status
-
- Request/response streaming support
- Automatic OpenAPI documentation at
/docs
- SQLite database with SQLAlchemy
- User authentication (bcrypt password hashing)
- Session-based authentication for Web UI
- API token system with CRUD operations
- Token validation middleware
- Audit logging
- Security features (rate limiting, login lockout)
- Modern responsive interface (Tailwind CSS + HTMX + Alpine.js)
- Login page with authentication
- Dashboard for model management
- Load/unload/switch model controls
- API token management interface
- Server logs viewer
- Real-time status updates via HTMX
Goal: Support loading models on specific GPUs (GPU 0, GPU 1, or both), with robust GPU status detection
- GPU configuration models (ports: 8081, 8088)
- Adapter GPU parameter support (tensor-split)
- Web UI GPU selection interface (toggle buttons)
- Dashboard GPU status display (per-GPU cards)
- Real-time GPU status detection (idle/model-loaded/occupied-by-others)
- Mock GPU testing (configurable mock data for offline/dev environments)
- Automatic refresh & manual refresh on model load/unload
- Button disable logic for occupied/running GPUs
- Lifecycle manager multi-instance support
- API endpoints GPU parameter support
- Request routing to correct GPU instance
- Comprehensive multi-GPU testing
Multi-GPU & GPU Status Features:
- Load different models on different GPUs simultaneously
- Each GPU uses its own port (GPU 0: 8081, GPU 1: 8088)
- Support for single GPU or both GPUs with tensor splitting
- Web UI shows status of each GPU independently
- GPU status detection: idle, model loaded, occupied by others
- Mock mode: test GPU status logic without real GPU hardware
- Dashboard buttons auto-disable for occupied/running GPUs
- Status refresh interval optimized (5min), manual refresh on actions
- Unit tests for core modules
- Integration tests for API and auth
- Configuration validation tests
- GPU status detection tests (
scripts/test_gpu_detection.py) - GPU refresh improvements tests (
scripts/test_gpu_refresh_improvements.py) - Mock GPU scenario tests (
tests/mock/scenarios/) - User documentation
- QUICKSTART.md
- API_TEST_REPORT.md
- TOKEN_AUTHENTICATION_GUIDE.md
- PARAMETER_CONFIGURATION.md
- Multi-GPU documentation
- Deployment guide
- Performance tuning guide
- Quick Start Guide - Get started quickly
- Token Authentication - API token usage
- Parameter Configuration - Model parameters
- Introduction
- Installation & Setup
- Basic Usage
- API Reference
- Multi-GPU Features
- Web UI Guide
- Troubleshooting
- Testing Best Practices
- Project Overview
- Enhancement Requirements
- Development Setup
- Architecture
- Implementation Guide
- Testing Best Practices
- GPU Status Detection
- GPU Refresh Improvements
- GPU Mock Testing
- Air-Gap Fix
curl http://localhost:3000/api/tags
curl -X POST http://localhost:3000/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "phi-4-reasoning","prompt": "Explain quantum computing"}'
curl -X POST http://localhost:3000/api/chat \
-H "Content-Type: application/json" \
-d '{"model": "phi-4-reasoning","messages": [{"role": "user", "content": "Hello!"}]}'curl -X POST http://localhost:3000/api/v1/models/load \
-H "Authorization: Bearer YOUR_API_TOKEN" \
-H "Content-Type: application/json" \
-d '{"model_id": "phi-4-reasoning"}'
curl http://localhost:3000/api/v1/models/status \
-H "Authorization: Bearer YOUR_API_TOKEN"
curl -X POST http://localhost:3000/api/v1/models/unload \
-H "Authorization: Bearer YOUR_API_TOKEN"curl -X GET http://localhost:3000/gpu/status
curl -X GET http://localhost:3000/gpu/count
curl -X GET http://localhost:3000/gpu/configpytest
pytest tests/test_api.py
pytest --cov=src/llamacontroller --cov-report=html
python scripts/test_api_endpoints.py
python scripts/test_auth_endpoints.py
python scripts/test_gpu_detection.py
python scripts/test_gpu_refresh_improvements.py- Default admin credentials should be changed immediately
- API tokens should be kept secure and not committed to version control
- Use HTTPS in production environments
- Configure CORS appropriately for production
- Review audit logs regularly
- Keep llama.cpp server on localhost only (not exposed externally)
- Single model loaded at a time per GPU (multi-model support planned)
- Multi-GPU feature requires lifecycle manager refactoring (in progress)
- No GPU memory monitoring yet (planned)
- Session timeout is fixed at 1 hour (configurable in future)
- WebSocket real-time GPU status not yet implemented
- Air-gap support uses full Tailwind CDN script (consider CLI build for production)
- Complete multi-GPU lifecycle manager support
- GPU request routing logic
- Multi-GPU integration tests
- Multi-GPU documentation
- WebSocket real-time GPU status
- GPU memory usage chart
- GPU memory monitoring
- Model preloading for faster switching
- Advanced rate limiting
- Prometheus metrics export
- Air-gap Tailwind CSS optimization
- Multiple models per GPU
- Distributed deployment support
- Model download from HuggingFace
- Automatic GPU selection based on load
- Model quantization support
- AMD GPU support
This project is currently in active development. Contributions are welcome!
- Fork the repository
- Create a feature branch
- Make your changes
- Run tests:
pytest - Submit a pull request
- Follow PEP 8 style guide
- Use type hints
- Write docstrings for public functions
- Add tests for new features
- Update documentation
This project is licensed under the MIT License - see the LICENSE file for details.
- llama.cpp - The underlying inference engine
- Ollama - API specification inspiration
- FastAPI - Web framework
- HTMX - Dynamic HTML interactions
For issues and questions:
- Check the documentation
- Review work logs for implementation details
- Open an issue on GitHub (when available)
Status: Beta - Core features complete, multi-GPU in progress
Version: 0.8.0
Last Updated: 2025-11-14
Python: 3.8+
License: MIT
