Skip to content

[AWS] Story 3: Compute Layer (ECS Fargate) #178

@mfittko

Description

@mfittko

Summary

Implement the ECS Fargate compute layer with Proxy and Dispatcher services, auto-scaling, and health checks.

Epic: #174
Architecture: docs/architecture/planned/aws-ecs-cdk.md


Tasks

ECS Cluster

  • Create ECS Fargate cluster
  • Configure cluster capacity providers
  • Enable ECS Exec for debugging/manual access
  • Enable Container Insights (optional, cost consideration)
  • Create DynamoDB lock table for init container

Proxy Service Task Definition

  • Create task definition (ARM64 for cost savings)
  • Configure ECS native secrets injection:
    • MANAGEMENT_TOKEN ← Secrets Manager
    • DATABASE_URL ← Secrets Manager
    • REDIS_URL ← Secrets Manager
    • LOG_LEVEL ← SSM Parameter
    • DB_DRIVER ← SSM Parameter
    • etc.
  • Configure shared volume for /config
  • Add init container (see below)
  • Add main proxy container (ports 8080, 8081)

Init Container (Config + Migrations)

The init container handles only:

  1. Fetch api_providers.yaml from SSM → shared volume
  2. Run migrations with DynamoDB lock

Note: Env vars (LOG_LEVEL, etc.) and secrets (tokens, passwords) are injected by ECS natively - no init container involvement!

┌─────────────────────────────────────────────────────────────┐
│ ECS Task Start                                              │
│                                                             │
│  ECS injects env vars automatically:                        │
│    MANAGEMENT_TOKEN, DATABASE_URL, LOG_LEVEL, etc.          │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ Init Container                                              │
│  1. Acquire DynamoDB lock (wait if held by another)         │
│  2. Fetch /llm-proxy/api-providers from SSM                 │
│  3. Write to /config/api_providers.yaml (shared volume)     │
│  4. Run: goose up (migrations)                              │
│  5. Release lock                                            │
│  6. Exit 0                                                  │
└─────────────────────────────────────────────────────────────┘
                          ↓
┌─────────────────────────────────────────────────────────────┐
│ Main Container (Proxy)                                      │
│  - Env vars already populated by ECS ✓                      │
│  - Reads /config/api_providers.yaml from shared volume ✓    │
│  - Starts llm-proxy server                                  │
│  - NO app changes needed!                                   │
└─────────────────────────────────────────────────────────────┘

Proxy Service

  • Create ECS service (desired: 1, min: 1, max: 4)
  • Configure health check (/health endpoint)
  • Set up auto-scaling policies:
    • Target CPU utilization (70%)
    • Target request count per target

Dispatcher Service

  • Create task definition (ARM64)
  • Configure container (same image, different command)
  • Configure ECS secrets injection (same as proxy)
  • Create ECS service (desired: 1, no scaling)
  • Configure health check

Docker Image

  • Update Dockerfile for multi-arch build (AMD64 + ARM64)
  • Build with -tags postgres for PostgreSQL support
  • Create ECR repository
  • Add init script for SSM fetch + migration with lock

Database & Redis Connectivity Testing

  • Test goose migrations against Aurora (via init container)
  • Verify Redis TLS connection (rediss://)
  • Test LLM Proxy connects to both services
  • Document connection troubleshooting

DynamoDB Lock Pattern (with Wait)

#!/bin/sh
# init.sh - runs in init container

LOCK_ID="init"
LOCK_TABLE="llm-proxy-locks"
LOCK_TTL=300  # 5 minutes
MAX_WAIT=300  # 5 minutes max wait
POLL_INTERVAL=5

acquire_lock() {
  TTL=$(($(date +%s) + LOCK_TTL))
  aws dynamodb put-item \
    --table-name $LOCK_TABLE \
    --item '{"lock_id":{"S":"'$LOCK_ID'"},"expires_at":{"N":"'$TTL'"}}' \
    --condition-expression "attribute_not_exists(lock_id) OR expires_at < :now" \
    --expression-attribute-values '{":now":{"N":"'$(date +%s)'"}}' 2>/dev/null
}

release_lock() {
  aws dynamodb delete-item --table-name $LOCK_TABLE --key '{"lock_id":{"S":"'$LOCK_ID'"}}'
}

# Step 1: Acquire lock (wait if held)
echo "Acquiring init lock..."
WAITED=0
while ! acquire_lock; do
  if [ $WAITED -ge $MAX_WAIT ]; then
    echo "ERROR: Timeout waiting for init lock"
    exit 1
  fi
  echo "Lock held by another task, waiting ${POLL_INTERVAL}s... ($WAITED/$MAX_WAIT)"
  sleep $POLL_INTERVAL
  WAITED=$((WAITED + POLL_INTERVAL))
done

echo "Lock acquired!"

# Step 2: Fetch api_providers.yaml from SSM
echo "Fetching config from SSM..."
aws ssm get-parameter --name "/llm-proxy/api-providers" --query "Parameter.Value" --output text > /config/api_providers.yaml

# Step 3: Run migrations
echo "Running migrations..."
goose -dir /migrations postgres "$DATABASE_URL" up
MIGRATION_EXIT=$?

# Step 4: Release lock
echo "Releasing lock..."
release_lock

exit $MIGRATION_EXIT

Configuration Props

proxyMinTasks?: number;      // default: 1
proxyMaxTasks?: number;      // default: 4
proxyCpu?: number;           // default: 256
proxyMemory?: number;        // default: 512
useArm64?: boolean;          // default: true
enableContainerInsights?: boolean; // default: false
enableEcsExec?: boolean;     // default: true

Acceptance Criteria

  • Proxy service runs on Fargate with ARM64
  • Dispatcher service runs on Fargate
  • ECS injects env vars from Secrets Manager + SSM
  • Init container fetches api_providers.yaml from SSM
  • Migrations run with DynamoDB lock (tasks WAIT)
  • Services connect to Aurora and Redis
  • Health checks pass consistently
  • Auto-scaling responds to CPU load
  • ECS Exec works for debugging
  • Graceful shutdown handles SIGTERM

Dependencies

  • Story 1: CDK Foundation (VPC, secrets, SSM params)
  • Story 2: Data Layer (Aurora, Redis)

Estimated Effort

Large - 4-5 days


Notes

  • ARM64 Fargate is ~20% cheaper than x86
  • DynamoDB lock table is essentially free (on-demand, minimal usage)
  • Lock TTL (5 min) prevents deadlocks if init container crashes
  • ECS natively injects secrets/params as env vars - no .env file
  • Init container only handles api_providers.yaml + migrations
  • No LLM Proxy code changes needed

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions