Skip to content

bcfmtolgahan/eks-sre-agent

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

2 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

EKS SRE Agent

AI-powered SRE agent for Amazon EKS clusters. Built on AWS Bedrock AgentCore with MCP (Model Context Protocol) tools.

Features

  • Kubernetes Operations: Get pods, describe pods, view logs, get events, restart deployments
  • Prometheus Integration: Query metrics, get alerts, analyze service health (Golden Signals, SLI/SLO)
  • CloudWatch Integration: Get alarms, query metrics, search logs

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  AWS Bedrock AgentCore Runtime          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚
β”‚  β”‚  Docker Container                 β”‚  β”‚
β”‚  β”‚  src/main.py (Strands Agent)      β”‚  β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                    β”‚
                    β–Ό MCP Protocol
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  AgentCore Gateway (JWT Auth)           β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚              β”‚              β”‚
        β–Ό              β–Ό              β–Ό
   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”
   β”‚ K8s    β”‚    β”‚ Prom   β”‚    β”‚ CW     β”‚
   β”‚ Lambda β”‚    β”‚ Lambda β”‚    β”‚ Lambda β”‚
   β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚              β”‚              β”‚
        β–Ό              β–Ό              β–Ό
   EKS API      Prometheus      CloudWatch

Project Structure

eks-sre-agent/
β”œβ”€β”€ src/
β”‚   β”œβ”€β”€ main.py                 # Agent entrypoints
β”‚   β”œβ”€β”€ mcp_client/client.py    # Gateway client with Cognito auth
β”‚   └── model/load.py           # Bedrock model loader
β”œβ”€β”€ mcp/
β”‚   β”œβ”€β”€ lambda/                 # Kubernetes MCP tools
β”‚   β”‚   β”œβ”€β”€ handler.py
β”‚   β”‚   β”œβ”€β”€ requirements.txt
β”‚   β”‚   └── build_lambda.sh
β”‚   β”œβ”€β”€ prometheus-lambda/      # Prometheus MCP tools
β”‚   β”‚   β”œβ”€β”€ handler.py
β”‚   β”‚   └── requirements.txt
β”‚   └── cloudwatch-lambda/      # CloudWatch MCP tools
β”‚       β”œβ”€β”€ handler.py
β”‚       └── requirements.txt
β”œβ”€β”€ terraform/
β”‚   β”œβ”€β”€ main.tf
β”‚   β”œβ”€β”€ bedrock_agentcore.tf    # All AWS resources
β”‚   β”œβ”€β”€ variables.tf
β”‚   └── terraform.tfvars.example
β”œβ”€β”€ Dockerfile
└── pyproject.toml

MCP Tools Reference

Kubernetes Tool (kubernetes_sre_tool)

Action Parameters Description
get_pods namespace List all pods with status
describe_pod namespace, pod_name Detailed pod info
get_pod_logs namespace, pod_name, previous Container logs
get_events namespace Kubernetes events
restart_deployment namespace, deployment_name Rolling restart
scale_deployment namespace, deployment_name, replicas Scale replicas

Example Lambda invocation:

aws lambda invoke --function-name eks-sre-agent-McpLambda \
  --payload '{"tool_name": "get_pods", "namespace": "demo"}' \
  --cli-binary-format raw-in-base64-out output.json

Response:

{
  "statusCode": 200,
  "body": {
    "result": {
      "pods": [
        {
          "name": "nginx-7d6877d777-abc12",
          "namespace": "demo",
          "phase": "Running",
          "containers": [
            {"name": "nginx", "ready": true, "restarts": 0, "state": "Running"}
          ],
          "node": "ip-10-0-1-100.ec2.internal"
        }
      ],
      "count": 1
    }
  }
}

Prometheus Tool (prometheus_tool)

Action Parameters Description
query query Execute PromQL instant query
query_range query, start, end, step Range query
get_alerts - Get firing/pending alerts
get_error_rate job, window Calculate error rate %
get_latency_percentiles job, window P50/P90/P99 latency
get_throughput job, window Requests per second
get_saturation resource, namespace CPU/Memory saturation
analyze_service job, window Full Golden Signals analysis
calculate_sli job, slo_target, window SLI calculation
check_error_budget job, slo_target, window Error budget status

Example:

aws lambda invoke --function-name eks-sre-agent-PrometheusLambda \
  --payload '{"tool_name": "get_alerts"}' \
  --cli-binary-format raw-in-base64-out output.json

Response:

{
  "statusCode": 200,
  "body": {
    "status": "success",
    "firing_count": 2,
    "pending_count": 0,
    "firing_alerts": [
      {
        "labels": {"alertname": "KubePodCrashLooping", "pod": "app-xyz", "severity": "warning"},
        "annotations": {"summary": "Pod is crash looping."},
        "state": "firing",
        "activeAt": "2024-01-26T10:00:00Z"
      }
    ]
  }
}

CloudWatch Tool (cloudwatch_tool)

Action Parameters Description
get_alarms state Get CloudWatch alarms
get_metrics namespace, metric_name, period Get metric statistics
query_logs log_group, query, hours CloudWatch Logs Insights

Example:

aws lambda invoke --function-name eks-sre-agent-CloudWatchLambda \
  --payload '{"tool_name": "get_alarms", "state": "ALARM"}' \
  --cli-binary-format raw-in-base64-out output.json

Deployment

Prerequisites

  • AWS CLI configured with appropriate permissions
  • Terraform >= 1.2
  • Docker
  • EKS cluster with:
    • Prometheus (kube-prometheus-stack) exposed via NodePort
    • IAM permissions for Lambda to access EKS API

Step 1: Configure Variables

cd terraform
cp terraform.tfvars.example terraform.tfvars

Edit terraform.tfvars:

app_name         = "eks-sre-agent"
eks_cluster_name = "my-eks-cluster"
prometheus_url   = "http://10.0.1.100:32575"  # NodePort IP:Port

# Required if Prometheus is in private subnet
prometheus_vpc_config = {
  subnet_ids         = ["subnet-abc123", "subnet-def456"]
  security_group_ids = ["sg-xyz789"]
}

Step 2: Deploy Infrastructure

terraform init
terraform plan
terraform apply

This creates:

  • ECR repository + Docker image
  • 3 Lambda functions (K8s, Prometheus, CloudWatch)
  • AgentCore Gateway with Cognito JWT auth
  • AgentCore Runtime
  • IAM roles and policies
  • Cognito User Pool for authentication

Step 3: Build Kubernetes Lambda (if needed)

The Kubernetes Lambda requires the kubernetes Python package which isn't included by default:

cd mcp/lambda
chmod +x build_lambda.sh
./build_lambda.sh

# Update Lambda
aws lambda update-function-code \
  --function-name eks-sre-agent-McpLambda \
  --zip-file fileb://build/lambda.zip \
  --region eu-west-1

Usage

Using AgentCore CLI

# Install agentcore CLI
pip install bedrock-agentcore-cli

# Invoke the agent
agentcore invoke '{"prompt": "What pods are crashing in demo namespace?"}'

agentcore invoke '{"prompt": "Show me Prometheus alerts"}'

agentcore invoke '{"prompt": "Analyze the health of kube-state-metrics service"}'

Using AWS Console

  1. Go to Bedrock AgentCore β†’ Runtimes
  2. Select eks-sre-agent_Agent
  3. Go to Test Console
  4. Enter payload:
{"prompt": "Scan demo namespace for problems"}

Available Entrypoints

Entrypoint Payload Description
invoke {"prompt": "..."} General SRE queries
investigate {"pod_name": "x", "namespace": "y"} Deep dive into pod
heal {"pod_name": "x", "namespace": "y", "dry_run": true} Fix issues
scan {"namespace": "demo"} Scan for problems
analyze {"job": "nginx", "window": "5m"} Golden Signals
slo_status {"job": "nginx", "slo_target": 99.9} SLI/SLO check
alerts {"state": "firing"} Get all alerts

Example Prompts

"What pods are in CrashLoopBackOff in the demo namespace?"
"Show me the logs from the previous crash of pod nginx-abc123"
"What's the error rate for the api-gateway service in the last hour?"
"Is my SLO of 99.9% being met for the payment service?"
"Restart the nginx deployment in production namespace"
"What Prometheus alerts are currently firing?"

Environment Variables

The agent container uses these environment variables (set by Terraform):

Variable Description
AWS_REGION AWS region
GATEWAY_URL AgentCore Gateway URL
COGNITO_CLIENT_ID Cognito app client ID
COGNITO_CLIENT_SECRET Cognito app client secret
COGNITO_TOKEN_URL Cognito OAuth token endpoint
COGNITO_SCOPE OAuth scope
MEMORY_ID AgentCore Memory ID (optional)

Lambda functions use:

Variable Description
EKS_CLUSTER_NAME Target EKS cluster name
PROMETHEUS_URL Prometheus server URL

Troubleshooting

Lambda can't connect to EKS

  • Ensure Lambda is in VPC with access to EKS API
  • Check security group allows outbound to EKS
  • Verify IAM role has eks:DescribeCluster permission

Lambda can't connect to Prometheus

  • Expose Prometheus via NodePort: kubectl patch svc prometheus -n monitoring -p '{"spec":{"type":"NodePort"}}'
  • Get NodePort: kubectl get svc prometheus -n monitoring
  • Use node's private IP + NodePort in prometheus_url
  • Ensure security group allows traffic on NodePort

"No module named kubernetes" error

  • Run build_lambda.sh to include dependencies
  • Update Lambda with the new zip file

License

MIT

About

eks-sre-agent

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages