Skip to content

jianlins/llamacontroller

Repository files navigation

LlamaController

Pytest Python Version License Code Style FastAPI Status PRs Welcome

项目截图

A WebUI-based management system for llama.cpp model lifecycle with Ollama API compatibility.

🎯 Project Overview

LlamaController provides a secure, web-based interface to manage llama.cpp instances with full model lifecycle control (load, unload, switch) while maintaining compatibility with Ollama's REST API ecosystem. This allows existing Ollama-compatible applications to seamlessly work with llama.cpp deployments.

✨ Features

  • Centralized Model Management: Single interface to control multiple models
  • API Compatibility: Drop-in replacement for Ollama in existing workflows
  • Configuration Isolation: Separate llama.cpp binaries from model configurations
  • Secure Access: Protected by authentication with token-based API access
  • Multi-tenancy Support: Different tokens for different applications/users
  • Web Interface: User-friendly dashboard for model management
  • Multi-GPU Support: Load models on GPU 0, GPU 1, or both GPUs (in progress)
  • GPU Status Detection: Real-time monitoring of GPU usage, supports idle/model-loaded/occupied-by-others states
  • Mock GPU Testing: Supports mock mode for GPU status testing on machines without NVIDIA GPU
  • Automatic GPU Status Refresh: Optimized refresh interval (5min), manual refresh on model load/unload
  • Air-Gap Support: All web and API resources are served locally, fully offline compatible

📋 Prerequisites

  • Python 3.8+ (Conda environment recommended)
  • llama.cpp installed with llama-server executable
  • GGUF model files
  • Optional: Multiple NVIDIA GPUs for multi-GPU support

🚀 Quick Start

1. Set up Conda Environment

conda create -n llama.cpp python=3.11 -y
conda activate llama.cpp

2. Install Dependencies

pip install -r requirements.txt

3. Initialize Database

python scripts/init_db.py

4. Configure

Edit the configuration files in config/ directory to match your system:

  • config/llamacpp-config.yaml - llama.cpp server settings
  • config/models-config.yaml - Available models configuration
  • config/auth-config.yaml - Authentication settings

5. Start LlamaController

python run.py

6. Access Web UI

Open your browser and navigate to: http://localhost:3000

Default credentials:

  • Username: admin
  • Password: admin123

⚠️ Important: Change the default password after first login!

📁 Project Structure

llamacontroller/
├── src/llamacontroller/       # Main application code
│   ├── core/                  # Core business logic
│   │   ├── config.py          # Configuration management
│   │   ├── lifecycle.py       # Model lifecycle manager
│   │   └── adapter.py         # llama.cpp process adapter
│   ├── api/                   # REST API endpoints
│   │   ├── management.py      # Management API
│   │   ├── ollama.py          # Ollama-compatible API
│   │   ├── auth.py            # Authentication endpoints
│   │   └── tokens.py          # Token management
│   ├── auth/                  # Authentication
│   │   ├── service.py         # Auth service
│   │   └── dependencies.py    # FastAPI auth dependencies
│   ├── db/                    # Database models
│   │   ├── models.py          # SQLAlchemy models
│   │   └── crud.py            # Database operations
│   ├── web/                   # Web UI
│   │   ├── routes.py          # Web routes
│   │   └── templates/         # Jinja2 templates
│   ├── models/                # Pydantic models
│   │   ├── config.py          # Configuration models
│   │   ├── api.py             # API request/response models
│   │   └── ollama.py          # Ollama schema models
│   └── utils/                 # Utilities
├── config/                    # Configuration files
├── tests/                     # Test suite
├── docs/                      # Documentation
├── design/                    # Design documents
├── scripts/                   # Utility scripts
├── logs/                      # Application logs (auto-created)
└── data/                      # Runtime data (auto-created)

🔧 Development Status

Current Version: 0.8.0 (Beta)
Project Status: Core features complete, multi-GPU enhancement in progress

✅ Phase 1: Foundation (100% Complete)

  • Project structure
  • Configuration files (YAML-based)
  • Configuration manager with Pydantic validation
  • llama.cpp process adapter
  • Logging system

✅ Phase 2: Model Lifecycle (100% Complete)

  • Model lifecycle manager
  • Load/unload/switch operations
  • Process health monitoring
  • Auto-restart on crash

✅ Phase 3: REST API Layer (100% Complete)

  • FastAPI application
  • Ollama-compatible endpoints
    • /api/generate - Text generation
    • /api/chat - Chat completion
    • /api/tags - List models
    • /api/show - Show model info
    • /api/ps - Running models
  • Management API endpoints
    • /api/v1/models/load - Load model
    • /api/v1/models/unload - Unload model
    • /api/v1/models/status - Model status
  • Request/response streaming support
  • Automatic OpenAPI documentation at /docs

✅ Phase 4: Authentication (100% Complete)

  • SQLite database with SQLAlchemy
  • User authentication (bcrypt password hashing)
  • Session-based authentication for Web UI
  • API token system with CRUD operations
  • Token validation middleware
  • Audit logging
  • Security features (rate limiting, login lockout)

✅ Phase 5: Web UI (100% Complete)

  • Modern responsive interface (Tailwind CSS + HTMX + Alpine.js)
  • Login page with authentication
  • Dashboard for model management
  • Load/unload/switch model controls
  • API token management interface
  • Server logs viewer
  • Real-time status updates via HTMX

🔄 Phase 6: Multi-GPU Enhancement & GPU Status Detection (40% Complete)

Goal: Support loading models on specific GPUs (GPU 0, GPU 1, or both), with robust GPU status detection

  • GPU configuration models (ports: 8081, 8088)
  • Adapter GPU parameter support (tensor-split)
  • Web UI GPU selection interface (toggle buttons)
  • Dashboard GPU status display (per-GPU cards)
  • Real-time GPU status detection (idle/model-loaded/occupied-by-others)
  • Mock GPU testing (configurable mock data for offline/dev environments)
  • Automatic refresh & manual refresh on model load/unload
  • Button disable logic for occupied/running GPUs
  • Lifecycle manager multi-instance support
  • API endpoints GPU parameter support
  • Request routing to correct GPU instance
  • Comprehensive multi-GPU testing

Multi-GPU & GPU Status Features:

  • Load different models on different GPUs simultaneously
  • Each GPU uses its own port (GPU 0: 8081, GPU 1: 8088)
  • Support for single GPU or both GPUs with tensor splitting
  • Web UI shows status of each GPU independently
  • GPU status detection: idle, model loaded, occupied by others
  • Mock mode: test GPU status logic without real GPU hardware
  • Dashboard buttons auto-disable for occupied/running GPUs
  • Status refresh interval optimized (5min), manual refresh on actions

📝 Phase 7: Testing & Documentation (70% Complete)

  • Unit tests for core modules
  • Integration tests for API and auth
  • Configuration validation tests
  • GPU status detection tests (scripts/test_gpu_detection.py)
  • GPU refresh improvements tests (scripts/test_gpu_refresh_improvements.py)
  • Mock GPU scenario tests (tests/mock/scenarios/)
  • User documentation
    • QUICKSTART.md
    • API_TEST_REPORT.md
    • TOKEN_AUTHENTICATION_GUIDE.md
    • PARAMETER_CONFIGURATION.md
  • Multi-GPU documentation
  • Deployment guide
  • Performance tuning guide

📖 Documentation

User Guides

User Manual (English)

Technical Documentation

Test Reports

🛠️ API Usage Examples

Using Ollama-Compatible API

curl http://localhost:3000/api/tags
curl -X POST http://localhost:3000/api/generate \
  -H "Content-Type: application/json" \
  -d '{"model": "phi-4-reasoning","prompt": "Explain quantum computing"}'
curl -X POST http://localhost:3000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"model": "phi-4-reasoning","messages": [{"role": "user", "content": "Hello!"}]}'

Using Management API

curl -X POST http://localhost:3000/api/v1/models/load \
  -H "Authorization: Bearer YOUR_API_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"model_id": "phi-4-reasoning"}'
curl http://localhost:3000/api/v1/models/status \
  -H "Authorization: Bearer YOUR_API_TOKEN"
curl -X POST http://localhost:3000/api/v1/models/unload \
  -H "Authorization: Bearer YOUR_API_TOKEN"

GPU Status API

curl -X GET http://localhost:3000/gpu/status
curl -X GET http://localhost:3000/gpu/count
curl -X GET http://localhost:3000/gpu/config

🧪 Running Tests

pytest
pytest tests/test_api.py
pytest --cov=src/llamacontroller --cov-report=html
python scripts/test_api_endpoints.py
python scripts/test_auth_endpoints.py
python scripts/test_gpu_detection.py
python scripts/test_gpu_refresh_improvements.py

🔒 Security Notes

  • Default admin credentials should be changed immediately
  • API tokens should be kept secure and not committed to version control
  • Use HTTPS in production environments
  • Configure CORS appropriately for production
  • Review audit logs regularly
  • Keep llama.cpp server on localhost only (not exposed externally)

🚧 Known Limitations

  • Single model loaded at a time per GPU (multi-model support planned)
  • Multi-GPU feature requires lifecycle manager refactoring (in progress)
  • No GPU memory monitoring yet (planned)
  • Session timeout is fixed at 1 hour (configurable in future)
  • WebSocket real-time GPU status not yet implemented
  • Air-gap support uses full Tailwind CDN script (consider CLI build for production)

🗺️ Roadmap

Short Term (v0.9)

  • Complete multi-GPU lifecycle manager support
  • GPU request routing logic
  • Multi-GPU integration tests
  • Multi-GPU documentation
  • WebSocket real-time GPU status
  • GPU memory usage chart

Medium Term (v1.0)

  • GPU memory monitoring
  • Model preloading for faster switching
  • Advanced rate limiting
  • Prometheus metrics export
  • Air-gap Tailwind CSS optimization

Long Term (v2.0+)

  • Multiple models per GPU
  • Distributed deployment support
  • Model download from HuggingFace
  • Automatic GPU selection based on load
  • Model quantization support
  • AMD GPU support

🤝 Contributing

This project is currently in active development. Contributions are welcome!

Development Setup

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run tests: pytest
  5. Submit a pull request

Coding Standards

  • Follow PEP 8 style guide
  • Use type hints
  • Write docstrings for public functions
  • Add tests for new features
  • Update documentation

📝 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • llama.cpp - The underlying inference engine
  • Ollama - API specification inspiration
  • FastAPI - Web framework
  • HTMX - Dynamic HTML interactions

📞 Support

For issues and questions:

  • Check the documentation
  • Review work logs for implementation details
  • Open an issue on GitHub (when available)

Status: Beta - Core features complete, multi-GPU in progress
Version: 0.8.0
Last Updated: 2025-11-14
Python: 3.8+
License: MIT

About

A web UI to manage server side llama.cpp.

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 2

  •  
  •  

Languages