Skip to content

aleenprd/llamacpp_client

Repository files navigation

Llama.cpp Server Test

This repository contains scripts for running and interacting with a llama.cpp server.

Components

  • start-local-server.sh - Bash script to start the llama.cpp server locally
  • start-docker-server.sh - Bash script to start the server using Docker
  • chat_client.py - Python client for chatting with the llama.cpp server
  • examples.py - Example usage of the Python chat client

Setup

1. Install Python Dependencies

Using pip:

pip install -r requirements.txt

Or using the project:

pip install -e .

2. Start the Llama.cpp Server

Use the provided script to start your server:

./start-local-server.sh

The script will:

  • Allow you to configure server settings via environment variables or command-line arguments
  • Show available models in your models directory
  • Let you select which model to use
  • Choose between foreground or background mode

Using the Python Chat Client

Interactive Chat

Start an interactive chat session:

python chat_client.py

With custom server settings:

python chat_client.py --host localhost --port 8000

Single Message

Send a single message:

python chat_client.py --message "Hello, how are you?"

Streaming Responses

Enable streaming for real-time responses:

python chat_client.py --stream

Available Commands in Interactive Mode

  • help - Show available commands
  • system <prompt> - Set a system prompt
  • clear - Clear the system prompt
  • stream - Toggle streaming mode on/off
  • quit or exit - End the conversation

Python API Examples

Basic Usage

from chat_client import LlamaCppClient

# Create client
client = LlamaCppClient(host="localhost", port=8000)

# Check if server is running
if client.check_health():
    # Send a message
    response = client.chat_completion("Hello!")
    print(response)

With System Prompt

response = client.chat_completion(
    message="Write a Python function to calculate fibonacci numbers",
    system_prompt="You are a helpful coding assistant",
    temperature=0.3,
    max_tokens=500
)

Streaming Responses

for chunk in client.stream_chat_completion("Tell me a story"):
    print(chunk, end="", flush=True)

Configuration

Environment Variables

You can configure the server using environment variables:

  • LLAMA_HOST - Server host (default: 0.0.0.0)
  • LLAMA_PORT - Server port (default: 8000)
  • LLAMA_MODELS_PATH - Path to model files (default: /models)
  • LLAMA_CONTEXT_SIZE - Context size (default: 512)
  • LLAMA_GPU_LAYERS - GPU layers (default: 99)
  • LLAMA_LOG_FILE - Log file name (default: llama-server.log)

Command-line Arguments

The server script supports various command-line arguments that override environment variables:

  • --host - Server host
  • -p, --port - Server port
  • -m, --models-path - Path to model files
  • -c, --context-size - Context size
  • -g, --gpu-layers - GPU layers
  • -l, --log-file - Log file name
  • -h, --help - Show help message

Examples

Run the example script to see different usage patterns:

python examples.py

This will demonstrate:

  • Simple chat interactions
  • Using system prompts
  • Streaming responses
  • Multi-turn conversations

API Endpoints

The llama.cpp server provides OpenAI-compatible endpoints:

  • GET /health - Health check
  • GET /v1/models - List available models
  • POST /v1/chat/completions - Chat completions (streaming and non-streaming)

Troubleshooting

  1. Server not responding: Make sure the llama.cpp server is running and accessible
  2. Import errors: Install dependencies with pip install -r requirements.txt
  3. Connection refused: Check if the host and port are correct
  4. No models found: Ensure your models directory contains .gguf files

Requirements

  • Python 3.10+
  • requests library
  • llama.cpp server running locally or remotely

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published