This repository contains scripts for running and interacting with a llama.cpp server.
start-local-server.sh- Bash script to start the llama.cpp server locallystart-docker-server.sh- Bash script to start the server using Dockerchat_client.py- Python client for chatting with the llama.cpp serverexamples.py- Example usage of the Python chat client
Using pip:
pip install -r requirements.txtOr using the project:
pip install -e .Use the provided script to start your server:
./start-local-server.shThe script will:
- Allow you to configure server settings via environment variables or command-line arguments
- Show available models in your models directory
- Let you select which model to use
- Choose between foreground or background mode
Start an interactive chat session:
python chat_client.pyWith custom server settings:
python chat_client.py --host localhost --port 8000Send a single message:
python chat_client.py --message "Hello, how are you?"Enable streaming for real-time responses:
python chat_client.py --streamhelp- Show available commandssystem <prompt>- Set a system promptclear- Clear the system promptstream- Toggle streaming mode on/offquitorexit- End the conversation
from chat_client import LlamaCppClient
# Create client
client = LlamaCppClient(host="localhost", port=8000)
# Check if server is running
if client.check_health():
# Send a message
response = client.chat_completion("Hello!")
print(response)response = client.chat_completion(
message="Write a Python function to calculate fibonacci numbers",
system_prompt="You are a helpful coding assistant",
temperature=0.3,
max_tokens=500
)for chunk in client.stream_chat_completion("Tell me a story"):
print(chunk, end="", flush=True)You can configure the server using environment variables:
LLAMA_HOST- Server host (default: 0.0.0.0)LLAMA_PORT- Server port (default: 8000)LLAMA_MODELS_PATH- Path to model files (default: /models)LLAMA_CONTEXT_SIZE- Context size (default: 512)LLAMA_GPU_LAYERS- GPU layers (default: 99)LLAMA_LOG_FILE- Log file name (default: llama-server.log)
The server script supports various command-line arguments that override environment variables:
--host- Server host-p, --port- Server port-m, --models-path- Path to model files-c, --context-size- Context size-g, --gpu-layers- GPU layers-l, --log-file- Log file name-h, --help- Show help message
Run the example script to see different usage patterns:
python examples.pyThis will demonstrate:
- Simple chat interactions
- Using system prompts
- Streaming responses
- Multi-turn conversations
The llama.cpp server provides OpenAI-compatible endpoints:
GET /health- Health checkGET /v1/models- List available modelsPOST /v1/chat/completions- Chat completions (streaming and non-streaming)
- Server not responding: Make sure the llama.cpp server is running and accessible
- Import errors: Install dependencies with
pip install -r requirements.txt - Connection refused: Check if the host and port are correct
- No models found: Ensure your models directory contains .gguf files
- Python 3.10+
- requests library
- llama.cpp server running locally or remotely