LlamaController

A WebUI-based management system for llama.cpp model lifecycle with Ollama API compatibility.

🎯 Project Overview

LlamaController provides a secure, web-based interface to manage llama.cpp instances with full model lifecycle control (load, unload, switch) while maintaining compatibility with Ollama's REST API ecosystem. This allows existing Ollama-compatible applications to seamlessly work with llama.cpp deployments.

✨ Features

Centralized Model Management: Single interface to control multiple models
API Compatibility: Drop-in replacement for Ollama in existing workflows
Configuration Isolation: Separate llama.cpp binaries from model configurations
Secure Access: Protected by authentication with token-based API access
Multi-tenancy Support: Different tokens for different applications/users
Web Interface: User-friendly dashboard for model management
Multi-GPU Support: Load models on GPU 0, GPU 1, or both GPUs (in progress)
GPU Status Detection: Real-time monitoring of GPU usage, supports idle/model-loaded/occupied-by-others states
Mock GPU Testing: Supports mock mode for GPU status testing on machines without NVIDIA GPU
Automatic GPU Status Refresh: Optimized refresh interval (5min), manual refresh on model load/unload
Air-Gap Support: All web and API resources are served locally, fully offline compatible

📋 Prerequisites

Python 3.8+ (Conda environment recommended)
llama.cpp installed with llama-server executable
GGUF model files
Optional: Multiple NVIDIA GPUs for multi-GPU support

🚀 Quick Start

1. Set up Conda Environment

conda create -n llama.cpp python=3.11 -y
conda activate llama.cpp

2. Install Dependencies

pip install -r requirements.txt

3. Initialize Database

python scripts/init_db.py

4. Configure

Edit the configuration files in config/ directory to match your system:

config/llamacpp-config.yaml - llama.cpp server settings
config/models-config.yaml - Available models configuration
config/auth-config.yaml - Authentication settings

5. Start LlamaController

python run.py

6. Access Web UI

Open your browser and navigate to: http://localhost:3000

Default credentials:

Username: admin
Password: admin123

⚠️ Important: Change the default password after first login!

📁 Project Structure

llamacontroller/
├── src/llamacontroller/       # Main application code
│   ├── core/                  # Core business logic
│   │   ├── config.py          # Configuration management
│   │   ├── lifecycle.py       # Model lifecycle manager
│   │   └── adapter.py         # llama.cpp process adapter
│   ├── api/                   # REST API endpoints
│   │   ├── management.py      # Management API
│   │   ├── ollama.py          # Ollama-compatible API
│   │   ├── auth.py            # Authentication endpoints
│   │   └── tokens.py          # Token management
│   ├── auth/                  # Authentication
│   │   ├── service.py         # Auth service
│   │   └── dependencies.py    # FastAPI auth dependencies
│   ├── db/                    # Database models
│   │   ├── models.py          # SQLAlchemy models
│   │   └── crud.py            # Database operations
│   ├── web/                   # Web UI
│   │   ├── routes.py          # Web routes
│   │   └── templates/         # Jinja2 templates
│   ├── models/                # Pydantic models
│   │   ├── config.py          # Configuration models
│   │   ├── api.py             # API request/response models
│   │   └── ollama.py          # Ollama schema models
│   └── utils/                 # Utilities
├── config/                    # Configuration files
├── tests/                     # Test suite
├── docs/                      # Documentation
├── design/                    # Design documents
├── scripts/                   # Utility scripts
├── logs/                      # Application logs (auto-created)
└── data/                      # Runtime data (auto-created)

🔧 Development Status

Current Version: 0.8.0 (Beta)
Project Status: Core features complete, multi-GPU enhancement in progress

✅ Phase 1: Foundation (100% Complete)

✅ Phase 2: Model Lifecycle (100% Complete)

Model lifecycle manager
Load/unload/switch operations
Process health monitoring
Auto-restart on crash

✅ Phase 3: REST API Layer (100% Complete)

✅ Phase 4: Authentication (100% Complete)

SQLite database with SQLAlchemy
User authentication (bcrypt password hashing)
Session-based authentication for Web UI
API token system with CRUD operations
Token validation middleware
Audit logging
Security features (rate limiting, login lockout)

✅ Phase 5: Web UI (100% Complete)

Modern responsive interface (Tailwind CSS + HTMX + Alpine.js)
Login page with authentication
Dashboard for model management
Load/unload/switch model controls
API token management interface
Server logs viewer
Real-time status updates via HTMX

🔄 Phase 6: Multi-GPU Enhancement & GPU Status Detection (40% Complete)

Goal: Support loading models on specific GPUs (GPU 0, GPU 1, or both), with robust GPU status detection

Multi-GPU & GPU Status Features:

Load different models on different GPUs simultaneously
Each GPU uses its own port (GPU 0: 8081, GPU 1: 8088)
Support for single GPU or both GPUs with tensor splitting
Web UI shows status of each GPU independently
GPU status detection: idle, model loaded, occupied by others
Mock mode: test GPU status logic without real GPU hardware
Dashboard buttons auto-disable for occupied/running GPUs
Status refresh interval optimized (5min), manual refresh on actions

📝 Phase 7: Testing & Documentation (70% Complete)

📖 Documentation

User Guides

Quick Start Guide - Get started quickly
Token Authentication - API token usage
Parameter Configuration - Model parameters

User Manual (English)

Technical Documentation

Test Reports

🛠️ API Usage Examples

Using Ollama-Compatible API

curl http://localhost:3000/api/tags
curl -X POST http://localhost:3000/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "phi-4-reasoning","prompt": "Explain quantum computing"}'
curl -X POST http://localhost:3000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model": "phi-4-reasoning","messages": [{"role": "user", "content": "Hello!"}]}'

Using Management API

curl -X POST http://localhost:3000/api/v1/models/load \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model_id": "phi-4-reasoning"}'
curl http://localhost:3000/api/v1/models/status \
  -H "Authorization: Bearer YOUR_API_TOKEN"
curl -X POST http://localhost:3000/api/v1/models/unload \
  -H "Authorization: Bearer YOUR_API_TOKEN"

GPU Status API

curl -X GET http://localhost:3000/gpu/status
curl -X GET http://localhost:3000/gpu/count
curl -X GET http://localhost:3000/gpu/config

🧪 Running Tests

pytest
pytest tests/test_api.py
pytest --cov=src/llamacontroller --cov-report=html
python scripts/test_api_endpoints.py
python scripts/test_auth_endpoints.py
python scripts/test_gpu_detection.py
python scripts/test_gpu_refresh_improvements.py

🔒 Security Notes

Default admin credentials should be changed immediately
API tokens should be kept secure and not committed to version control
Use HTTPS in production environments
Configure CORS appropriately for production
Review audit logs regularly
Keep llama.cpp server on localhost only (not exposed externally)

🚧 Known Limitations

Single model loaded at a time per GPU (multi-model support planned)
Multi-GPU feature requires lifecycle manager refactoring (in progress)
No GPU memory monitoring yet (planned)
Session timeout is fixed at 1 hour (configurable in future)
WebSocket real-time GPU status not yet implemented
Air-gap support uses full Tailwind CDN script (consider CLI build for production)

🗺️ Roadmap

Short Term (v0.9)

Complete multi-GPU lifecycle manager support
GPU request routing logic
Multi-GPU integration tests
Multi-GPU documentation
WebSocket real-time GPU status
GPU memory usage chart

Medium Term (v1.0)

Long Term (v2.0+)

🤝 Contributing

This project is currently in active development. Contributions are welcome!

Development Setup

Fork the repository
Create a feature branch
Make your changes
Run tests: pytest
Submit a pull request

Coding Standards

Follow PEP 8 style guide
Use type hints
Write docstrings for public functions
Add tests for new features
Update documentation

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

llama.cpp - The underlying inference engine
Ollama - API specification inspiration
FastAPI - Web framework
HTMX - Dynamic HTML interactions

📞 Support

For issues and questions:

Check the documentation
Review work logs for implementation details
Open an issue on GitHub (when available)

Status: Beta - Core features complete, multi-GPU in progress
Version: 0.8.0
Last Updated: 2025-11-14
Python: 3.8+
License: MIT

Name		Name	Last commit message	Last commit date
Latest commit History 37 Commits
.github/workflows		.github/workflows
design		design
docs		docs
scripts		scripts
src/llamacontroller		src/llamacontroller
tests		tests
work_log		work_log
.env.example		.env.example
.gitignore		.gitignore
.python-version		.python-version
LICENSE		LICENSE
README.md		README.md
SETUP.md		SETUP.md
environment.yml		environment.yml
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements.txt		requirements.txt
run.py		run.py
start_with_mock_gpu.ps1		start_with_mock_gpu.ps1
uv.lock		uv.lock

License

jianlins/llamacontroller

Folders and files

Latest commit

History

Repository files navigation