-
Notifications
You must be signed in to change notification settings - Fork 2
Open
Labels
Description
Summary
Implement the ECS Fargate compute layer with Proxy and Dispatcher services, auto-scaling, and health checks.
Epic: #174
Architecture: docs/architecture/planned/aws-ecs-cdk.md
Tasks
ECS Cluster
- Create ECS Fargate cluster
- Configure cluster capacity providers
- Enable ECS Exec for debugging/manual access
- Enable Container Insights (optional, cost consideration)
- Create DynamoDB lock table for init container
Proxy Service Task Definition
- Create task definition (ARM64 for cost savings)
- Configure ECS native secrets injection:
MANAGEMENT_TOKEN← Secrets ManagerDATABASE_URL← Secrets ManagerREDIS_URL← Secrets ManagerLOG_LEVEL← SSM ParameterDB_DRIVER← SSM Parameter- etc.
- Configure shared volume for
/config - Add init container (see below)
- Add main proxy container (ports 8080, 8081)
Init Container (Config + Migrations)
The init container handles only:
- Fetch
api_providers.yamlfrom SSM → shared volume - Run migrations with DynamoDB lock
Note: Env vars (LOG_LEVEL, etc.) and secrets (tokens, passwords) are injected by ECS natively - no init container involvement!
┌─────────────────────────────────────────────────────────────┐
│ ECS Task Start │
│ │
│ ECS injects env vars automatically: │
│ MANAGEMENT_TOKEN, DATABASE_URL, LOG_LEVEL, etc. │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Init Container │
│ 1. Acquire DynamoDB lock (wait if held by another) │
│ 2. Fetch /llm-proxy/api-providers from SSM │
│ 3. Write to /config/api_providers.yaml (shared volume) │
│ 4. Run: goose up (migrations) │
│ 5. Release lock │
│ 6. Exit 0 │
└─────────────────────────────────────────────────────────────┘
↓
┌─────────────────────────────────────────────────────────────┐
│ Main Container (Proxy) │
│ - Env vars already populated by ECS ✓ │
│ - Reads /config/api_providers.yaml from shared volume ✓ │
│ - Starts llm-proxy server │
│ - NO app changes needed! │
└─────────────────────────────────────────────────────────────┘
Proxy Service
- Create ECS service (desired: 1, min: 1, max: 4)
- Configure health check (
/healthendpoint) - Set up auto-scaling policies:
- Target CPU utilization (70%)
- Target request count per target
Dispatcher Service
- Create task definition (ARM64)
- Configure container (same image, different command)
- Configure ECS secrets injection (same as proxy)
- Create ECS service (desired: 1, no scaling)
- Configure health check
Docker Image
- Update Dockerfile for multi-arch build (AMD64 + ARM64)
- Build with
-tags postgresfor PostgreSQL support - Create ECR repository
- Add init script for SSM fetch + migration with lock
Database & Redis Connectivity Testing
- Test goose migrations against Aurora (via init container)
- Verify Redis TLS connection (
rediss://) - Test LLM Proxy connects to both services
- Document connection troubleshooting
DynamoDB Lock Pattern (with Wait)
#!/bin/sh
# init.sh - runs in init container
LOCK_ID="init"
LOCK_TABLE="llm-proxy-locks"
LOCK_TTL=300 # 5 minutes
MAX_WAIT=300 # 5 minutes max wait
POLL_INTERVAL=5
acquire_lock() {
TTL=$(($(date +%s) + LOCK_TTL))
aws dynamodb put-item \
--table-name $LOCK_TABLE \
--item '{"lock_id":{"S":"'$LOCK_ID'"},"expires_at":{"N":"'$TTL'"}}' \
--condition-expression "attribute_not_exists(lock_id) OR expires_at < :now" \
--expression-attribute-values '{":now":{"N":"'$(date +%s)'"}}' 2>/dev/null
}
release_lock() {
aws dynamodb delete-item --table-name $LOCK_TABLE --key '{"lock_id":{"S":"'$LOCK_ID'"}}'
}
# Step 1: Acquire lock (wait if held)
echo "Acquiring init lock..."
WAITED=0
while ! acquire_lock; do
if [ $WAITED -ge $MAX_WAIT ]; then
echo "ERROR: Timeout waiting for init lock"
exit 1
fi
echo "Lock held by another task, waiting ${POLL_INTERVAL}s... ($WAITED/$MAX_WAIT)"
sleep $POLL_INTERVAL
WAITED=$((WAITED + POLL_INTERVAL))
done
echo "Lock acquired!"
# Step 2: Fetch api_providers.yaml from SSM
echo "Fetching config from SSM..."
aws ssm get-parameter --name "/llm-proxy/api-providers" --query "Parameter.Value" --output text > /config/api_providers.yaml
# Step 3: Run migrations
echo "Running migrations..."
goose -dir /migrations postgres "$DATABASE_URL" up
MIGRATION_EXIT=$?
# Step 4: Release lock
echo "Releasing lock..."
release_lock
exit $MIGRATION_EXITConfiguration Props
proxyMinTasks?: number; // default: 1
proxyMaxTasks?: number; // default: 4
proxyCpu?: number; // default: 256
proxyMemory?: number; // default: 512
useArm64?: boolean; // default: true
enableContainerInsights?: boolean; // default: false
enableEcsExec?: boolean; // default: trueAcceptance Criteria
- Proxy service runs on Fargate with ARM64
- Dispatcher service runs on Fargate
- ECS injects env vars from Secrets Manager + SSM
- Init container fetches api_providers.yaml from SSM
- Migrations run with DynamoDB lock (tasks WAIT)
- Services connect to Aurora and Redis
- Health checks pass consistently
- Auto-scaling responds to CPU load
- ECS Exec works for debugging
- Graceful shutdown handles SIGTERM
Dependencies
- Story 1: CDK Foundation (VPC, secrets, SSM params)
- Story 2: Data Layer (Aurora, Redis)
Estimated Effort
Large - 4-5 days
Notes
- ARM64 Fargate is ~20% cheaper than x86
- DynamoDB lock table is essentially free (on-demand, minimal usage)
- Lock TTL (5 min) prevents deadlocks if init container crashes
- ECS natively injects secrets/params as env vars - no .env file
- Init container only handles api_providers.yaml + migrations
- No LLM Proxy code changes needed
Reactions are currently unavailable