Skip to content

NW15D/llama-api-layer

Repository files navigation

🚀 Unified Llama OpenAI-Compatible API Gateway

A comprehensive set of service configurations to deploy a high-performance, private AI infrastructure. This project leverages llama.cpp, whisper.cpp, and various specialized models to provide a unified OpenAI-compatible API interface through a KrakenD API Gateway.

🌟 Features

  • Full OpenAI Compatibility: Seamlessly use your favorite AI tools and clients.
  • Unified Gateway: Single entry point for completions, tools, embeddings, and audio services via KrakenD.
  • GPU Optimized: Configurations tuned for CUDA-accelerated inference.
  • Robust Deployment: Ready-to-use systemd service files for automated startup and recovery.
  • Diverse Model Support:
    • 🧠 LLM: Qwen, DeepSeek, Llama.
    • 🔍 Embeddings: BGE-M3, Nomic-Embed.
    • 🔄 Reranking: BGE-Reranker.
    • 🎙️ Audio: Whisper (Turbo, Large-v3) & XTTS/Silero TTS.

🏗️ Architecture

The infrastructure is split into individual microservices unified by a KrakenD gateway:

Service Endpoint (Internal) Purpose Backend Engine
aismart.service :6150 High-quality smart completion llama-server (Qwen3.5-35B)
aifast.service :6155 Performance-optimized completion llama-server (Fast models)
aicoder.service :5000 Specialized coding completions llama-server (DeepSeek-Coder)
aiembed.service :5500 Text embedding generation llama-server (BGE-M3)
airerank.service :5550 Search result reranking llama-server (BGE-Reranker)
whisper.service :5005 STT (Speech-to-Text) whisper-server (Whisper Large-v3)
xtts.service :5050 / :10200 TTS (Text-to-Speech) Silero / XTTS API Server

🛠️ Combined Endpoints (API Gateway)

The KrakenD Gateway (running on port 9000) exposes the following unified endpoints:

💬 Chat & Completions

  • POST /v1/chat/completions - Unified chat interface.
  • POST /v1/completions - Legacy completion support.
  • POST /v1/tools - Tool-use and function calling support.

🔍 Search & Retrieval

  • POST /v1/embeddings - Generate vector representations of text.
  • POST /v1/rerank - Rank documents based on query relevance.

🔊 Audio Services

  • POST /v1/audio/transcriptions - Convert audio to text.
  • POST /v1/audio/translations - Translate audio in real-time.
  • POST /v1/audio/speech - Convert text to natural-sounding audio.

🚀 Installation & Setup

1. Requirements

  • Linux OS (Ubuntu/Debian recommended)
  • NVIDIA GPU with CUDA drivers
  • llama.cpp and whisper.cpp compiled and available in /ai/
  • KrakenD installed

2. Service Deployment

Copy the .service files to your systemd directory:

cp *.service /etc/systemd/system/
systemctl daemon-reload

Enable and start the services you need:

systemctl enable --now aismart aiembed whisper xtts

3. API Gateway Configuration

Deploy the KrakenD configuration:

krakend run -c krakend.json

The gateway includes API Key authentication by default (configurable in krakend.json).


🔐 Authentication

Security is handled at the gateway level. You can manage access via the auth/api-keys section in krakend.json.

Default Key: a132b20c-96be-467f-a15a-ed08aed67888


📜 License

This project configuration is provided for convenience. Ensure you comply with the licenses of the individual models (Qwen, DeepSeek, Whisper, BGE) and engines (llama.cpp, whisper.cpp) used.

About

Proxy layer for llama.cpp server API to OpenAi API compatible

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors