A comprehensive set of service configurations to deploy a high-performance, private AI infrastructure. This project leverages llama.cpp, whisper.cpp, and various specialized models to provide a unified OpenAI-compatible API interface through a KrakenD API Gateway.
- Full OpenAI Compatibility: Seamlessly use your favorite AI tools and clients.
- Unified Gateway: Single entry point for completions, tools, embeddings, and audio services via KrakenD.
- GPU Optimized: Configurations tuned for CUDA-accelerated inference.
- Robust Deployment: Ready-to-use systemd service files for automated startup and recovery.
- Diverse Model Support:
- 🧠 LLM: Qwen, DeepSeek, Llama.
- 🔍 Embeddings: BGE-M3, Nomic-Embed.
- 🔄 Reranking: BGE-Reranker.
- 🎙️ Audio: Whisper (Turbo, Large-v3) & XTTS/Silero TTS.
The infrastructure is split into individual microservices unified by a KrakenD gateway:
| Service | Endpoint (Internal) | Purpose | Backend Engine |
|---|---|---|---|
aismart.service |
:6150 |
High-quality smart completion | llama-server (Qwen3.5-35B) |
aifast.service |
:6155 |
Performance-optimized completion | llama-server (Fast models) |
aicoder.service |
:5000 |
Specialized coding completions | llama-server (DeepSeek-Coder) |
aiembed.service |
:5500 |
Text embedding generation | llama-server (BGE-M3) |
airerank.service |
:5550 |
Search result reranking | llama-server (BGE-Reranker) |
whisper.service |
:5005 |
STT (Speech-to-Text) | whisper-server (Whisper Large-v3) |
xtts.service |
:5050 / :10200 |
TTS (Text-to-Speech) | Silero / XTTS API Server |
The KrakenD Gateway (running on port 9000) exposes the following unified endpoints:
POST /v1/chat/completions- Unified chat interface.POST /v1/completions- Legacy completion support.POST /v1/tools- Tool-use and function calling support.
POST /v1/embeddings- Generate vector representations of text.POST /v1/rerank- Rank documents based on query relevance.
POST /v1/audio/transcriptions- Convert audio to text.POST /v1/audio/translations- Translate audio in real-time.POST /v1/audio/speech- Convert text to natural-sounding audio.
- Linux OS (Ubuntu/Debian recommended)
- NVIDIA GPU with CUDA drivers
llama.cppandwhisper.cppcompiled and available in/ai/KrakenDinstalled
Copy the .service files to your systemd directory:
cp *.service /etc/systemd/system/
systemctl daemon-reloadEnable and start the services you need:
systemctl enable --now aismart aiembed whisper xttsDeploy the KrakenD configuration:
krakend run -c krakend.jsonThe gateway includes API Key authentication by default (configurable in krakend.json).
Security is handled at the gateway level. You can manage access via the auth/api-keys section in krakend.json.
Default Key: a132b20c-96be-467f-a15a-ed08aed67888
This project configuration is provided for convenience. Ensure you comply with the licenses of the individual models (Qwen, DeepSeek, Whisper, BGE) and engines (llama.cpp, whisper.cpp) used.