llama.cpp Studio is a web-based control plane for running and managing local LLMs on top of llama.cpp, ik_llama.cpp, and LMDeploy – all served through a single OpenAI-compatible endpoint powered by llama-swap.
It is designed for power users running models on a single machine or small server (Docker or bare metal) with strong support for:
- CPU-only inference (OpenBLAS)
- NVIDIA CUDA GPUs (via the NVIDIA Container Toolkit)
- HuggingFace search (GGUF + safetensors): Search the Hub, inspect metadata, and plan downloads by quantization or safetensors bundle.
- Model library with multi-quantization support: Manage multiple quantizations per base model in a grouped view with start/stop/delete actions.
- Per-model runtime configuration: Configure engine (llama.cpp / ik_llama / LMDeploy), context length, GPU layers, batch sizes, and advanced flags.
- Unified multi-model serving: Serve many GGUF quantizations at once via
llama-swapon port2000. - System & progress monitoring: Live system stats, GPU information, and unified progress for downloads, builds, CUDA/LMDeploy installs via SSE.
llama.cpp Studio is a single application composed of a Vue 3 SPA frontend and a FastAPI backend. The backend persists configuration to YAML files under /app/data and orchestrates runtimes through llama-swap.
flowchart LR
userClient[User_Client] --> browserUI["Web_UI_(Vue_3_SPA)"]
browserUI --> fastapiBackend["FastAPI_Backend"]
fastapiBackend --> dataStore["YAML_DataStore_(models_engines_settings)"]
fastapiBackend --> progressSSE["SSE_/api/events"]
fastapiBackend --> llamaSwap["llama-swap_Proxy_:2000"]
llamaSwap --> llamaCpp["llama.cpp_ik_llama_runtimes"]
llamaSwap --> lmdeploy["LMDeploy_TurboMind_(safetensors)"]
App.vueprovides the global shell:- Header with llama-swap status and theme toggle
- Navigation between the main sections
- Central
<router-view>for page content - Global ConfirmDialog/Toast and SSE connection
- Main views:
- Model Library (
/models) – installed models grouped by base model and quantization. - Model Search (
/search) – HuggingFace search & download (GGUF and safetensors). - Model Config (
/models/:id/config) – per-quantization configuration. - Engines & System (
/engines) – llama.cpp / ik_llama builds, CUDA and LMDeploy status, system & GPU info.
- Model Library (
- State management:
useModelStore– models, search, downloads, metadata, start/stop/config operations.useEnginesStore– engine versions, CUDA installer, system and GPU info.useProgressStore– EventSource connection to/api/events, normalized tasks, logs, and notifications.
backend/main.py:- Ensures the
/app/data(or local./data) directory structure exists and is writable. - Initializes the YAML-backed
DataStorefor models, engine versions, and settings. - Loads
HUGGINGFACE_API_KEYfrom the environment if present. - Starts and manages the
llama-swapproxy on port2000when a valid llama.cpp/ik_llama binary is active. - Registers all known models with
llama-swapat startup based on logical metadata (not hard-coded paths). - Serves the built Vue app from
frontend/distand exposes a catch-all SPA route.
- Ensures the
- Key route groups:
/api/models– model library, HuggingFace search, GGUF/safetensors downloads, configuration, start/stop./api/llama-versions– llama.cpp/ik_llama build settings, builds, version listing, activation, deletion, CUDA installer./api/lmdeploy– LMDeploy install/remove./api/status&/api/gpu-info– system and GPU metrics plusllama-swapproxy health./api/events– Server-Sent Events stream for unified progress and notifications.
llama.cppandik_llama.cppversions are:- Built from source under
/app/data/llama-cpp/... - Recorded in the DataStore with metadata and active version selection
- Exposed to the frontend via
/api/llama-versions
- Built from source under
llama-swap:- Is downloaded and installed into the runtime image at build time.
- Runs a single proxy process on port
2000and multiplexes multiple model backends. - Reads its configuration from files generated by the backend based on stored models and the active engine.
- LMDeploy:
- Is installed into
/app/data/lmdeploy/venvfrom PyPI or source on demand. - Serves safetensors checkpoints using TurboMind behind
llama-swap.
- Is installed into
-
Unified model library
- Models are grouped by HuggingFace repo (e.g.
meta-llama/Meta-Llama-3-8B-Instruct). - Each group contains one or more quantizations (GGUF) and optional safetensors bundles.
- Per-quantization rows show size, download timestamp, runtime type, and running state.
- Models are grouped by HuggingFace repo (e.g.
-
HuggingFace search (GGUF + safetensors)
- Search by model name or keyword with a choice of:
gguf– quantized GGUF files and bundles.safetensors– safetensors checkpoints.
- See metadata (file sizes, quantization names, tags) before you download.
- Search by model name or keyword with a choice of:
-
Downloads & bundles
- GGUF:
- Download individual quantizations or full bundles.
- Optionally attach
.mmprojprojector files for multimodal models.
- Safetensors:
- Download full safetensors bundles.
- All downloads are tracked as long-running tasks via SSE and shown in the global progress panel.
- GGUF:
-
llama.cpp and ik_llama.cpp
- Multiple versions per engine are supported.
- Builds are always from source, configured using stored build settings (CUDA flags, flash attention, CPU variants, etc.).
- Versions can be activated/deactivated; activation updates
llama-swapconfiguration automatically. - Old versions can be removed to reclaim disk space.
-
CUDA toolkit management (NVIDIA only)
- Optional in-container CUDA installer can install or remove the CUDA Toolkit (plus optional cuDNN/TensorRT) under
/app/data/cuda. - Progress and logs for installs/uninstalls are surfaced in the Engines/System view and via SSE events.
- Only NVIDIA CUDA + CPU are documented and supported; other GPU backends are not part of this project’s supported surface.
- Optional in-container CUDA installer can install or remove the CUDA Toolkit (plus optional cuDNN/TensorRT) under
-
LMDeploy integration
- Install LMDeploy from PyPI or from source into a dedicated virtual environment under
/app/data/lmdeploy/venv. - The backend exposes endpoints to:
- Check for the latest LMDeploy version.
- Install/update/remove LMDeploy.
- Once installed, safetensors models can be launched via LMDeploy TurboMind and are exposed through the same
llama-swapendpoint.
- Install LMDeploy from PyPI or from source into a dedicated virtual environment under
-
Single OpenAI-compatible endpoint
- All models are served via
llama-swaponhttp://<host>:2000. - The proxy implements standard OpenAI-style
/v1/chat/completionsand/v1/models.
- All models are served via
-
Concurrent GGUF quantizations
- Multiple GGUF quantizations can be active at once behind
llama-swap. - The System Status view shows running models and basic health information.
- Multiple GGUF quantizations can be active at once behind
-
Safetensors via LMDeploy
- One LMDeploy runtime is supported at a time for safetensors models.
- It is exposed alongside GGUF models through the same
llama-swapAPI.
-
System & GPU status
/api/statusreports CPU, memory, disk utilization, running model instances, andllama-swapproxy health./api/gpu-inforeports detected GPUs and their capabilities (focused on NVIDIA/CUDA).
-
Unified progress tracking
/api/eventsstreams:- Download progress and completion events.
- llama.cpp/ik_llama source build progress.
- CUDA toolkit installation/uninstallation status and logs.
- LMDeploy installation status and logs.
- Notifications related to long-running tasks.
The recommended way to run llama.cpp Studio is via Docker Compose. All examples assume you’ve cloned the repository.
git clone <repository-url>
cd llama-cpp-studioUse the CPU compose file (docker-compose.cpu.yml) during development. It mounts the backend source and enables reload:
docker-compose -f docker-compose.cpu.yml up --buildThis will:
- Expose the web UI and API at
http://localhost:8080. - Expose the
llama-swapproxy athttp://localhost:2000. - Mount
./datato/app/dataso models, configs, and logs persist between runs.
For NVIDIA GPUs with the NVIDIA Container Toolkit installed on the host:
docker-compose -f docker-compose.cuda.yml up --build -dThis will:
- Build the image from the current source tree.
- Map:
8080:8080– web UI + FastAPI backend2000:2000–llama-swapOpenAI-compatible endpoint
- Mount
./datato/app/data. - Reserve all GPUs for the container using the Compose
deploy.resources.reservations.devicessection.
You can also build and run the container without Compose:
# Build the image
docker build -t llama-cpp-studio .
# GPU-capable run (NVIDIA)
docker run -d \
--name llama-cpp-studio \
--gpus all \
-p 8080:8080 \
-p 2000:2000 \
-v ./data:/app/data \
llama-cpp-studio
# CPU-only run
docker run -d \
--name llama-cpp-studio-cpu \
-p 8080:8080 \
-p 2000:2000 \
-e CUDA_VISIBLE_DEVICES="" \
-v ./data:/app/data \
llama-cpp-studioIf you prefer pulling from a registry, use the GitHub Container Registry image published by this project (replace <org-or-user> with the correct namespace):
docker pull ghcr.io/<org-or-user>/llama-cpp-studio:latestRun it with the same ports and volume mapping as above.
Common environment variables for the backend:
HUGGINGFACE_API_KEY– HuggingFace token used for model search and download.- When set via environment variable, the UI treats it as read-only and shows only a masked preview.
CUDA_VISIBLE_DEVICES– controls which GPUs are visible to the container:- Default in Compose is
all. - Set to
""(empty string) for CPU-only runs.
- Default in Compose is
HF_HOMEandHUGGINGFACE_HUB_CACHE– location for the HuggingFace cache:- Default to
/app/data/hf-cacheand/app/data/hf-cache/hubso cache data is persisted in the volume.
- Default to
BACKEND_CORS_ORIGINS,BACKEND_CORS_ALLOW_CREDENTIALS– advanced CORS options for custom setups.RELOAD– when running the backend directly, controls uvicorn reload behavior (truein local dev,falsein Docker).
These can be set directly in docker-compose.yml or via an .env file referenced by Compose.
The image expects a writable data directory at /app/data, typically mapped from ./data on the host:
- Models – GGUF files and safetensors bundles.
- Config – YAML files for models, engines, and other settings.
- Logs – backend logs, build logs, installer logs.
- llama.cpp builds – source trees and build outputs.
- CUDA toolkit – if installed, under
/app/data/cuda. - LMDeploy virtualenv – under
/app/data/lmdeploy/venv.
Recommended Compose mapping:
volumes:
- ./data:/app/dataYou can provide your HuggingFace token in multiple ways:
- Directly in Compose (keep this private):
environment:
- HUGGINGFACE_API_KEY=your_huggingface_token_here.envfile (not committed to git):
HUGGINGFACE_API_KEY=your_huggingface_token_hereThen in Compose:
env_file:
- .envOnce configured, the Model Search UI will use this token transparently.
- Open the Model Search view.
- Enter a HuggingFace repo name or search term, choose:
gguf– to browse GGUF quantizations and bundles.safetensors– to browse safetensors bundles.
- Expand a result to:
- Inspect file sizes and quantization names.
- See optional projector (
.mmproj) files for multimodal models.
- Click Download to start a download; progress will appear in the global progress panel.
- Open the Model Library view (
/models). - Each card groups all quantizations for a base model:
- GGUF quantizations (different sizes and quant schemes).
- Safetensors bundles (if present).
- Per-row actions:
- Start / Stop – launch or stop a model via
llama-swap. - Configure – open the per-quantization configuration screen.
- Delete – remove a specific quantization.
- Start / Stop – launch or stop a model via
- Group-level actions let you delete entire model groups to reclaim disk space.
- From the library, click Configure on a quantization.
- Choose an engine:
llama.cpporik_llama.cppfor GGUF.LMDeployfor safetensors.
- Adjust:
- Context length.
- GPU layers (
-ngl-style behavior). - Batch sizes and other llama.cpp/LMDeploy flags.
- Advanced options are rendered from a parameter registry maintained on the backend, allowing you to set engine-specific flags explicitly.
- Open the Engines & System view (
/engines) to:- View and manage llama.cpp and ik_llama.cpp versions:
- Build from source using saved build settings (e.g. CUDA on/off).
- Activate a version (updates
llama-swapconfiguration). - Delete non-active versions to free disk.
- Manage CUDA toolkit in the container:
- Install or uninstall specific CUDA versions.
- See status and detailed logs.
- Manage LMDeploy:
- Install from PyPI or a git branch.
- Remove LMDeploy and its virtualenv.
- Tail installer logs for debugging.
- View and manage llama.cpp and ik_llama.cpp versions:
All of these actions surface their progress and logs in the unified progress UI.
- The header shows a concise llama-swap status indicator (health and port).
- The System section displays:
- CPU, memory, and disk usage.
- Detected NVIDIA GPUs and key characteristics.
- Currently running models as reported by
llama-swap.
Once at least one model is running, you can call the llama-swap proxy directly.
- Base URL:
http://<host>:2000 - Chat completions:
curl http://localhost:2000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "your-model-name",
"messages": [{"role": "user", "content": "Hello!"}]
}'- Model listing:
GET http://localhost:2000/v1/models - Health:
GET http://localhost:2000/health
Model IDs are shown in the System Status view and in the Model Library when a model is running.
-
GPU not detected
- Ensure the NVIDIA Container Toolkit is installed and
nvidia-smiworks on the host. - Use
--gpus all(docker run) or thedeploy.resources.reservations.devicesstanza in Compose. - Confirm
CUDA_VISIBLE_DEVICESis not set to""when you intend to use the GPU.
- Ensure the NVIDIA Container Toolkit is installed and
-
Build failures (llama.cpp / ik_llama / CUDA)
- Check that you have enough disk space (≥ 10 GB free is a good baseline).
- Verify CUDA and driver versions are compatible with the chosen build settings.
- Review build or installer logs (via the progress UI or log files in
/app/data/logs).
-
Memory errors / out-of-memory
- Reduce context length and/or batch size for the model configuration.
- For GPU runs, lower GPU layers or choose a smaller quantization.
-
Model download failures
- Verify HuggingFace connectivity and model visibility (public/private).
- Ensure
HUGGINGFACE_API_KEYis correctly configured for private models. - Check free space under
/app/data.
-
llama-swap
- Hit
http://localhost:2000/healthandhttp://localhost:2000/v1/modelsto check proxy state.
- Hit
- Container logs:
docker logs llama-cpp-studio- Backend and task logs: stored under
/app/data/logsand surfaced via/api/events. - CUDA installer logs: available via CUDA log endpoints and the Engines/System view.
- The backend code lives under
backend/. - To run the backend directly in development:
cd backend
pip install -r ../requirements.txt
uvicorn main:app --reload --port 8080- The frontend SPA lives under
frontend/.
cd frontend
npm install
npm run devThe dev server (typically on port 5173) is configured to proxy API calls to the backend.
pip install -r requirements.txt pytest pytest-asyncio
PYTHONPATH=. pytest backend/tests -vThe test suite includes:
- Smoke tests to ensure the app boots and key routes (
/api/status,/api/models,/api/llama-versions,/api/events) respond. - Tests for LMDeploy management and configuration.
- Tests for CUDA installer flows and model introspection logic.
This project is licensed under the MIT License – see the LICENSE file for details.
- Fork the repository.
- Create a feature branch.
- Make your changes and add tests where appropriate.
- Open a pull request describing your changes and how you tested them.
- Open an issue on GitHub for bugs or feature requests.
- Review this README and the troubleshooting section before filing.
- llama.cpp – core inference engine.
- llama-swap – multi-model serving proxy.
- HuggingFace – model hosting and search.
- Vue.js – frontend framework.
- FastAPI – backend framework.