Performance Guide

This guide covers Hancock's latency targets, benchmark suite, and load testing tooling.

Latency Targets
Benchmark Suite
Load Testing with Locust
Performance Tests
CI Hardware Profile
Tuning Recommendations

Latency Targets

These are the target latencies for Hancock endpoints under normal load. They are enforced in CI via the benchmark suite.

Endpoint	p50	p95	p99
`GET /health`	< 10 ms	< 25 ms	< 50 ms
`GET /models`	< 20 ms	< 50 ms	< 100 ms
`GET /mode`	< 20 ms	< 50 ms	< 100 ms
`POST /v1/chat` (LLM mocked)	< 120 ms	< 260 ms	< 500 ms

The benchmark suite keeps the p99 threshold of 500 ms as a hard CI gate.

tests/test_performance.py additionally enforces endpoint-level p50/p95 regression thresholds and validates burst + concurrency behavior around rate limiting and webhook HMAC checks.

Benchmark Suite

tests/benchmark_suite.py is a pytest-based micro-benchmark that runs locally or in CI.

Running Locally

# Run with default settings (50 iterations, 5-iteration warmup)
pytest tests/benchmark_suite.py -v

# Run a specific endpoint benchmark
pytest tests/benchmark_suite.py -v -k "health"

# Output a summary table
pytest tests/benchmark_suite.py -v --tb=short

How It Works

Warm-up: 5 requests are sent to each endpoint to prime connection pools and JIT paths.
Measurement: 50 timed requests are sent sequentially.
Statistics: p50, p95, and p99 are computed from the 50 samples.
Assertion: The test fails if p99 exceeds the threshold (500 ms for non-LLM endpoints).

CI Integration

The benchmark runs on every pull request via .github/workflows/benchmark.yml. It posts a summary table to the PR as a comment:

| Endpoint       | p50    | p95    | p99    | Status |
|----------------|--------|--------|--------|--------|
| GET /health    | 3 ms   | 6 ms   | 9 ms   | ✅     |
| GET /models    | 8 ms   | 14 ms  | 22 ms  | ✅     |
| POST /chat     | 41 ms  | 89 ms  | 134 ms | ✅     |

Load Testing with Locust

tests/load_test_locust.py provides Locust user profiles for sustained load testing.

User Profiles

Class	Behaviour	Use Case
`HealthOnlyUser`	Polls `GET /health`	Smoke test — verifies availability under load
`ReadOnlyUser`	Mix of `GET /health`, `/models`, `/mode`	Read-only load without LLM calls

Running Locust

Headless (CLI)

# Install Locust
pip install locust

# Smoke test — 10 users, 1 minute
locust -f tests/load_test_locust.py \
  --host=http://localhost:5000 \
  --users 10 \
  --spawn-rate 2 \
  --run-time 60s \
  --headless \
  --class-picker HealthOnlyUser

# Sustained read load — 50 users, 5 minutes
locust -f tests/load_test_locust.py \
  --host=http://localhost:5000 \
  --users 50 \
  --spawn-rate 5 \
  --run-time 5m \
  --headless \
  --class-picker ReadOnlyUser

Web UI

locust -f tests/load_test_locust.py --host=http://localhost:5000
# Open http://localhost:8089

Interpreting Results

Key metrics to watch during a load test:

Metric	Acceptable	Investigate
Failure rate	< 0.1%	> 1%
Median response time	< 100 ms	> 500 ms
p99 response time	< 500 ms	> 2 s
Requests/s at target load	Stable	Declining

Monitor process resource usage in Grafana during load tests (e.g., via hancock_memory_usage_bytes and hancock_active_connections if metrics_exporter middleware is wired into the agent).

Running Against a Deployed Instance

locust -f tests/load_test_locust.py \
  --host=https://your-hancock-instance.example.com \
  --users 20 \
  --spawn-rate 2 \
  --run-time 2m \
  --headless

If API authentication is enabled, set HANCOCK_API_KEY in the environment — the Locust profiles read it automatically.

Performance Tests

tests/test_performance.py is a lighter pytest suite that runs on every push alongside the unit tests.

It asserts:

Burst behavior for POST /v1/ask and POST /v1/chat around HANCOCK_RATE_LIMIT.
Concurrent POST /v1/webhook HMAC validation under load (mixed valid/invalid signatures).
Endpoint p50/p95 latency thresholds for regression gating in CI.

These tests use the Flask test client (no real network), so they measure application logic overhead, not network latency.

pytest tests/test_performance.py -v

CI Hardware Profile

To keep results comparable across contributors and CI, run the performance suite with this profile (or as close as possible):

CPU: 2-4 vCPU x86_64 (GitHub ubuntu-latest hosted runner class).
RAM: 7-16 GB.
Disk: SSD-backed ephemeral workspace with at least 10 GB free.
Python: 3.12.x with dependencies from requirements.txt and requirements-dev.txt.
Backend mode: Flask test client + mocked LLM responses (no network model calls).

Recommended local replication command:

HANCOCK_PERF_ARTIFACT=artifacts/performance-latency.json \
pytest tests/benchmark_suite.py tests/test_performance.py -v --tb=short -s \
  --junitxml=artifacts/benchmark-junit.xml

The CI workflow stores the following benchmark artifacts:

artifacts/performance-latency.json (p50/p95 stats from tests/test_performance.py)
artifacts/benchmark-junit.xml (pytest results)
artifacts/benchmark-summary.txt (standalone benchmark table)

Tuning Recommendations

LLM Backend

Ollama (local): Use a GPU-enabled host for the best model throughput. The llama3.1:8b model runs comfortably on a 16 GB VRAM GPU.
NVIDIA NIM: NIM endpoints are rate-limited. Use the NVIDIA_API_KEY with sufficient quota for your expected request rate.

Flask / WSGI

The default hancock_agent.py --server uses Flask's development server. For production, run behind a WSGI server:

# Gunicorn example
pip install gunicorn
gunicorn -w 4 -b 0.0.0.0:5000 "hancock_agent:create_app()"

Use 4 × CPU_cores workers as a starting point and tune based on Prometheus metrics.

Kubernetes HPA

deploy/k8s/hpa.yaml scales pods when CPU exceeds 70% or memory exceeds 80%. Adjust thresholds and maxReplicas based on your observed p99 latency at various replica counts.

Health Check TTL

monitoring/health_check.py caches deep health check results for 30 s to avoid hammering the Ollama endpoint on every Kubernetes liveness probe tick. Increase this if Ollama probes are contributing to model latency.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Performance Guide

Table of Contents

Latency Targets

Benchmark Suite

Running Locally

How It Works

CI Integration

Load Testing with Locust

User Profiles

Running Locust

Headless (CLI)

Web UI

Interpreting Results

Running Against a Deployed Instance

Performance Tests

CI Hardware Profile

Tuning Recommendations

LLM Backend

Flask / WSGI

Kubernetes HPA

Health Check TTL

Uh oh!

FilesExpand file tree

performance.md

Latest commit

History

performance.md

File metadata and controls

Performance Guide

Table of Contents

Latency Targets

Benchmark Suite

Running Locally

How It Works

CI Integration

Load Testing with Locust

User Profiles

Running Locust

Headless (CLI)

Web UI

Interpreting Results

Running Against a Deployed Instance

Performance Tests

CI Hardware Profile

Tuning Recommendations

LLM Backend

Flask / WSGI

Kubernetes HPA

Health Check TTL