llama.cpp Backend

The llama.cpp backend (backend/cpp/llama-cpp/grpc-server.cpp) is a gRPC adaptation of the upstream HTTP server (llama.cpp/tools/server/server.cpp). It uses the same underlying server infrastructure from llama.cpp/tools/server/server-context.cpp.

Building and Testing

Test llama.cpp backend compilation: make backends/llama-cpp
The backend is built as part of the main build process
Check backend/cpp/llama-cpp/Makefile for build configuration

Architecture

grpc-server.cpp: gRPC server implementation, adapts HTTP server patterns to gRPC
Uses shared server infrastructure: server-context.cpp, server-task.cpp, server-queue.cpp, server-common.cpp
The gRPC server mirrors the HTTP server's functionality but uses gRPC instead of HTTP

Common Issues When Updating llama.cpp

When fixing compilation errors after upstream changes:

Check how server.cpp (HTTP server) handles the same change
Look for new public APIs or getter methods
Store copies of needed data instead of accessing private members
Update function calls to match new signatures
Test with make backends/llama-cpp

Key Differences from HTTP Server

gRPC uses BackendServiceImpl class with gRPC service methods
HTTP server uses server_routes with HTTP handlers
Both use the same server_context and task queue infrastructure
gRPC methods: LoadModel, Predict, PredictStream, Embedding, Rerank, TokenizeString, GetMetrics, Health

Tool Call Parsing Maintenance

When working on JSON/XML tool call parsing functionality, always check llama.cpp for reference implementation and updates:

Checking for XML Parsing Changes

Review XML Format Definitions: Check llama.cpp/common/chat-parser-xml-toolcall.h for xml_tool_call_format struct changes
Review Parsing Logic: Check llama.cpp/common/chat-parser-xml-toolcall.cpp for parsing algorithm updates
Review Format Presets: Check llama.cpp/common/chat-parser.cpp for new XML format presets (search for xml_tool_call_format form)
Review Model Lists: Check llama.cpp/common/chat.h for COMMON_CHAT_FORMAT_* enum values that use XML parsing:
- COMMON_CHAT_FORMAT_GLM_4_5
- COMMON_CHAT_FORMAT_MINIMAX_M2
- COMMON_CHAT_FORMAT_KIMI_K2
- COMMON_CHAT_FORMAT_QWEN3_CODER_XML
- COMMON_CHAT_FORMAT_APRIEL_1_5
- COMMON_CHAT_FORMAT_XIAOMI_MIMO
- Any new formats added

Model Configuration Options

Always check llama.cpp for new model configuration options that should be supported in LocalAI:

Check Server Context: Review llama.cpp/tools/server/server-context.cpp for new parameters
Check Chat Params: Review llama.cpp/common/chat.h for common_chat_params struct changes
Check Server Options: Review llama.cpp/tools/server/server.cpp for command-line argument changes
Examples of options to check:
- ctx_shift - Context shifting support
- parallel_tool_calls - Parallel tool calling
- reasoning_format - Reasoning format options
- Any new flags or parameters

Implementation Guidelines

Feature Parity: Always aim for feature parity with llama.cpp's implementation
Test Coverage: Add tests for new features matching llama.cpp's behavior
Documentation: Update relevant documentation when adding new formats or options
Backward Compatibility: Ensure changes don't break existing functionality

Files to Monitor

llama.cpp/common/chat-parser-xml-toolcall.h - Format definitions
llama.cpp/common/chat-parser-xml-toolcall.cpp - Parsing logic
llama.cpp/common/chat-parser.cpp - Format presets and model-specific handlers
llama.cpp/common/chat.h - Format enums and parameter structures
llama.cpp/tools/server/server-context.cpp - Server configuration options

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

llama.cpp Backend

Building and Testing

Architecture

Common Issues When Updating llama.cpp

Key Differences from HTTP Server

Tool Call Parsing Maintenance

Checking for XML Parsing Changes

Model Configuration Options

Implementation Guidelines

Files to Monitor

FilesExpand file tree

llama-cpp-backend.md

Latest commit

History

llama-cpp-backend.md

File metadata and controls

llama.cpp Backend

Building and Testing

Architecture

Common Issues When Updating llama.cpp

Key Differences from HTTP Server

Tool Call Parsing Maintenance

Checking for XML Parsing Changes

Model Configuration Options

Implementation Guidelines

Files to Monitor