This guide covers Hancock's latency targets, benchmark suite, and load testing tooling.
- Latency Targets
- Benchmark Suite
- Load Testing with Locust
- Performance Tests
- CI Hardware Profile
- Tuning Recommendations
These are the target latencies for Hancock endpoints under normal load. They are enforced in CI via the benchmark suite.
| Endpoint | p50 | p95 | p99 |
|---|---|---|---|
GET /health |
< 10 ms | < 25 ms | < 50 ms |
GET /models |
< 20 ms | < 50 ms | < 100 ms |
GET /mode |
< 20 ms | < 50 ms | < 100 ms |
POST /v1/chat (LLM mocked) |
< 120 ms | < 260 ms | < 500 ms |
The benchmark suite keeps the p99 threshold of 500 ms as a hard CI gate.
tests/test_performance.py additionally enforces endpoint-level p50/p95 regression thresholds and validates burst + concurrency behavior around rate limiting and webhook HMAC checks.
tests/benchmark_suite.py is a pytest-based micro-benchmark that runs locally or in CI.
# Run with default settings (50 iterations, 5-iteration warmup)
pytest tests/benchmark_suite.py -v
# Run a specific endpoint benchmark
pytest tests/benchmark_suite.py -v -k "health"
# Output a summary table
pytest tests/benchmark_suite.py -v --tb=short- Warm-up: 5 requests are sent to each endpoint to prime connection pools and JIT paths.
- Measurement: 50 timed requests are sent sequentially.
- Statistics: p50, p95, and p99 are computed from the 50 samples.
- Assertion: The test fails if p99 exceeds the threshold (500 ms for non-LLM endpoints).
The benchmark runs on every pull request via .github/workflows/benchmark.yml. It posts a summary table to the PR as a comment:
| Endpoint | p50 | p95 | p99 | Status |
|----------------|--------|--------|--------|--------|
| GET /health | 3 ms | 6 ms | 9 ms | ✅ |
| GET /models | 8 ms | 14 ms | 22 ms | ✅ |
| POST /chat | 41 ms | 89 ms | 134 ms | ✅ |
tests/load_test_locust.py provides Locust user profiles for sustained load testing.
| Class | Behaviour | Use Case |
|---|---|---|
HealthOnlyUser |
Polls GET /health |
Smoke test — verifies availability under load |
ReadOnlyUser |
Mix of GET /health, /models, /mode |
Read-only load without LLM calls |
# Install Locust
pip install locust
# Smoke test — 10 users, 1 minute
locust -f tests/load_test_locust.py \
--host=http://localhost:5000 \
--users 10 \
--spawn-rate 2 \
--run-time 60s \
--headless \
--class-picker HealthOnlyUser
# Sustained read load — 50 users, 5 minutes
locust -f tests/load_test_locust.py \
--host=http://localhost:5000 \
--users 50 \
--spawn-rate 5 \
--run-time 5m \
--headless \
--class-picker ReadOnlyUserlocust -f tests/load_test_locust.py --host=http://localhost:5000
# Open http://localhost:8089Key metrics to watch during a load test:
| Metric | Acceptable | Investigate |
|---|---|---|
| Failure rate | < 0.1% | > 1% |
| Median response time | < 100 ms | > 500 ms |
| p99 response time | < 500 ms | > 2 s |
| Requests/s at target load | Stable | Declining |
Monitor process resource usage in Grafana during load tests (e.g., via hancock_memory_usage_bytes and hancock_active_connections if metrics_exporter middleware is wired into the agent).
locust -f tests/load_test_locust.py \
--host=https://your-hancock-instance.example.com \
--users 20 \
--spawn-rate 2 \
--run-time 2m \
--headlessIf API authentication is enabled, set HANCOCK_API_KEY in the environment — the Locust profiles read it automatically.
tests/test_performance.py is a lighter pytest suite that runs on every push alongside the unit tests.
It asserts:
- Burst behavior for
POST /v1/askandPOST /v1/chataroundHANCOCK_RATE_LIMIT. - Concurrent
POST /v1/webhookHMAC validation under load (mixed valid/invalid signatures). - Endpoint p50/p95 latency thresholds for regression gating in CI.
These tests use the Flask test client (no real network), so they measure application logic overhead, not network latency.
pytest tests/test_performance.py -vTo keep results comparable across contributors and CI, run the performance suite with this profile (or as close as possible):
- CPU: 2-4 vCPU x86_64 (GitHub
ubuntu-latesthosted runner class). - RAM: 7-16 GB.
- Disk: SSD-backed ephemeral workspace with at least 10 GB free.
- Python: 3.12.x with dependencies from
requirements.txtandrequirements-dev.txt. - Backend mode: Flask test client + mocked LLM responses (no network model calls).
Recommended local replication command:
HANCOCK_PERF_ARTIFACT=artifacts/performance-latency.json \
pytest tests/benchmark_suite.py tests/test_performance.py -v --tb=short -s \
--junitxml=artifacts/benchmark-junit.xmlThe CI workflow stores the following benchmark artifacts:
artifacts/performance-latency.json(p50/p95 stats fromtests/test_performance.py)artifacts/benchmark-junit.xml(pytest results)artifacts/benchmark-summary.txt(standalone benchmark table)
- Ollama (local): Use a GPU-enabled host for the best model throughput. The
llama3.1:8bmodel runs comfortably on a 16 GB VRAM GPU. - NVIDIA NIM: NIM endpoints are rate-limited. Use the
NVIDIA_API_KEYwith sufficient quota for your expected request rate.
The default hancock_agent.py --server uses Flask's development server. For production, run behind a WSGI server:
# Gunicorn example
pip install gunicorn
gunicorn -w 4 -b 0.0.0.0:5000 "hancock_agent:create_app()"Use 4 × CPU_cores workers as a starting point and tune based on Prometheus metrics.
deploy/k8s/hpa.yaml scales pods when CPU exceeds 70% or memory exceeds 80%. Adjust thresholds and maxReplicas based on your observed p99 latency at various replica counts.
monitoring/health_check.py caches deep health check results for 30 s to avoid hammering the Ollama endpoint on every Kubernetes liveness probe tick. Increase this if Ollama probes are contributing to model latency.