Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

README.md

Performance Testing Guide

Overview

This directory contains performance testing tools and scripts for IronSys.

Test Types

1. Unit Benchmarks (pytest-benchmark)

Python unit-level performance tests.

Location: python/tests/benchmarks/

Run:

cd python
pytest tests/benchmarks/ -v --benchmark-only

Tests:

  • test_cache_performance.py - Cache operations (get, set, SWR, parallel)
  • test_rate_limiter_performance.py - Rate limiter throughput

Performance Targets:

  • Cache operations: > 500 ops/sec
  • Parallel cache reads: > 2,000 ops/sec
  • Rate limiter checks: > 20,000 ops/sec
  • Parallel rate limiting: > 50,000 ops/sec

2. Load Testing (k6)

Full system load testing with realistic traffic patterns.

Location: scripts/performance/load-test.js

Install k6:

# macOS
brew install k6

# Linux
sudo apt-get install k6

# Or use Docker
docker pull grafana/k6

Run:

# Local testing
API_BASE_URL=http://localhost:8000 k6 run scripts/performance/load-test.js

# Production-like load
API_BASE_URL=https://api.ironsys.example.com \
  TEST_DURATION=10m \
  TARGET_RPS=5000 \
  k6 run scripts/performance/load-test.js

# Using Docker
docker run --rm -i grafana/k6 run - < scripts/performance/load-test.js

Test Stages:

  1. Ramp up (2 min): 0 → 50% target RPS
  2. Steady state (5 min): Maintain target RPS
  3. Peak load (5 min): 150% target RPS
  4. Spike test (1.5 min): 300% target RPS for 30s
  5. Ramp down (2 min): Back to 0

Scenarios:

  • 80% GET /slots (read operations)
  • 20% POST /reserve (write operations)

Performance Thresholds:

  • P95 latency < 500ms
  • P99 latency < 1000ms
  • Error rate < 1%
  • Cache hit rate > 70%

Metrics:

  • HTTP request duration (P50, P95, P99, max)
  • Error rate
  • Throughput (req/s)
  • Cache hit rate
  • Reservation latency
  • Slot read latency

3. Stress Testing (bash + ab/wrk)

System behavior under extreme conditions.

Location: scripts/performance/stress-test.sh

Prerequisites:

# Install Apache Bench (comes with Apache)
sudo apt-get install apache2-utils

# Or install wrk (recommended)
sudo apt-get install wrk

# Or on macOS
brew install wrk

Run:

# Basic stress test
./scripts/performance/stress-test.sh

# Custom configuration
API_BASE_URL=http://localhost:8000 \
  CONCURRENT_USERS=2000 \
  DURATION=600 \
  ./scripts/performance/stress-test.sh

Tests:

  1. Basic stress: High concurrency reads and writes
  2. Memory stress: Many unique requests to fill cache
  3. Connection pool stress: Concurrent long-running requests
  4. Rate limiter stress: Rapid requests to trigger rate limiting

Environment Variables:

  • API_BASE_URL: API endpoint (default: http://localhost:8000)
  • DURATION: Test duration in seconds (default: 300)
  • CONCURRENT_USERS: Concurrent connections (default: 1000)
  • REQUESTS_PER_USER: Requests per user for ab (default: 100)

Performance Baselines

API Performance

Metric Target Acceptable Critical
P95 Latency < 200ms < 500ms < 1000ms
P99 Latency < 500ms < 1000ms < 2000ms
Error Rate < 0.01% < 0.1% < 1%
Throughput > 5000 rps > 2000 rps > 500 rps

Cache Performance

Metric Target Acceptable Critical
Hit Rate > 80% > 70% > 50%
Get Latency < 2ms < 5ms < 10ms
Set Latency < 2ms < 5ms < 10ms

Database Performance

Metric Target Acceptable Critical
Query Time < 10ms < 50ms < 100ms
Connection Pool < 50% < 80% < 95%
Active Connections < 10 < 15 < 20

Kafka Performance

Metric Target Acceptable Critical
Consumer Lag < 100 < 1000 < 10000
Publish Latency < 10ms < 50ms < 100ms
Error Rate < 0.01% < 0.1% < 1%

Running Performance Tests in CI/CD

GitHub Actions

See .github/workflows/ci.yml for automated performance testing:

performance-test:
  runs-on: ubuntu-latest
  steps:
    - name: Run benchmark tests
      run: pytest tests/benchmarks/ --benchmark-only

    - name: Run load tests
      run: |
        docker-compose up -d
        sleep 10
        k6 run --vus 100 --duration 60s scripts/performance/load-test.js

Monitoring During Tests

Prometheus Metrics

Monitor these metrics during load tests:

# Request rate
rate(ironsys_requests_total[1m])

# P95 latency
histogram_quantile(0.95, rate(ironsys_request_duration_seconds_bucket[1m]))

# Error rate
rate(ironsys_requests_total{status=~"5.."}[1m]) /
rate(ironsys_requests_total[1m])

# Cache hit rate
rate(ironsys_cache_hits_total{type="fresh"}[1m]) /
rate(ironsys_cache_hits_total[1m])

# Consumer lag
kafka_consumer_lag

# Circuit breaker state
ironsys_circuit_breaker_state

Grafana Dashboard

Import the performance testing dashboard:

kubectl apply -f infra/grafana/dashboards/ironsys-overview.json

Interpreting Results

Good Performance

Requests/sec:    5234.56
Transfer/sec:    2.34MB
Latency Distribution:
  50%: 123ms
  75%: 187ms
  90%: 256ms
  95%: 342ms
  99%: 567ms
Error Rate: 0.02%

Degraded Performance

Requests/sec:    1234.56
Transfer/sec:    567KB
Latency Distribution:
  50%: 456ms
  75%: 789ms
  90%: 1.2s
  95%: 2.3s
  99%: 4.5s
Error Rate: 2.5%

Actions:

  1. Check resource utilization (CPU, memory)
  2. Check database query performance
  3. Check cache hit rate
  4. Check circuit breaker states
  5. Review application logs

System Under Stress

Requests/sec:    234.56
Transfer/sec:    123KB
Failed requests: 1234 (12.3%)
Error Rate: 12.3%

Actions:

  1. Immediate: Scale up resources
  2. Check for resource exhaustion (memory, connections)
  3. Check for cascading failures
  4. Review circuit breaker states
  5. Check Kafka consumer lag

Performance Optimization Tips

1. Cache Optimization

# Use SWR for better availability
cached_data, is_stale = await cache.get_with_swr(key)

# Batch cache operations
async with cache.client.pipeline() as pipe:
    pipe.get(key1)
    pipe.get(key2)
    results = await pipe.execute()

2. Database Optimization

# Use connection pooling
DB_POOL_SIZE = 20
DB_MAX_OVERFLOW = 10

# Use prepared statements
await conn.fetchrow("SELECT * FROM slots WHERE id = $1", slot_id)

# Batch operations
async with conn.transaction():
    await conn.executemany("INSERT INTO ...", data)

3. Kafka Optimization

# Batch messages
producer.send_batch(messages, partition_key=slot_id)

# Tune consumer settings
KAFKA_MAX_POLL_RECORDS = 100
KAFKA_MAX_POLL_INTERVAL_MS = 300000

4. API Optimization

# Use async/await
async def get_slot(slot_id: UUID):
    return await db.fetchrow("SELECT * FROM slots WHERE id = $1", slot_id)

# Enable compression
app.add_middleware(GZipMiddleware, minimum_size=1000)

# Use connection pooling
uvicorn.run(app, workers=4)

Troubleshooting

High Latency

  1. Check database query performance
  2. Check cache hit rate
  3. Check network latency
  4. Check resource utilization

High Error Rate

  1. Check circuit breaker states
  2. Check rate limiter configuration
  3. Check database connections
  4. Check Kafka connectivity

Low Throughput

  1. Check worker concurrency
  2. Check connection pool size
  3. Check resource limits (CPU, memory)
  4. Check network bandwidth

References