AI-powered SRE agent for Amazon EKS clusters. Built on AWS Bedrock AgentCore with MCP (Model Context Protocol) tools.
- Kubernetes Operations: Get pods, describe pods, view logs, get events, restart deployments
- Prometheus Integration: Query metrics, get alerts, analyze service health (Golden Signals, SLI/SLO)
- CloudWatch Integration: Get alarms, query metrics, search logs
βββββββββββββββββββββββββββββββββββββββββββ
β AWS Bedrock AgentCore Runtime β
β βββββββββββββββββββββββββββββββββββββ β
β β Docker Container β β
β β src/main.py (Strands Agent) β β
β βββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββ
β
βΌ MCP Protocol
βββββββββββββββββββββββββββββββββββββββββββ
β AgentCore Gateway (JWT Auth) β
βββββββββββββββββββββββββββββββββββββββββββ
β β β
βΌ βΌ βΌ
ββββββββββ ββββββββββ ββββββββββ
β K8s β β Prom β β CW β
β Lambda β β Lambda β β Lambda β
ββββββββββ ββββββββββ ββββββββββ
β β β
βΌ βΌ βΌ
EKS API Prometheus CloudWatch
eks-sre-agent/
βββ src/
β βββ main.py # Agent entrypoints
β βββ mcp_client/client.py # Gateway client with Cognito auth
β βββ model/load.py # Bedrock model loader
βββ mcp/
β βββ lambda/ # Kubernetes MCP tools
β β βββ handler.py
β β βββ requirements.txt
β β βββ build_lambda.sh
β βββ prometheus-lambda/ # Prometheus MCP tools
β β βββ handler.py
β β βββ requirements.txt
β βββ cloudwatch-lambda/ # CloudWatch MCP tools
β βββ handler.py
β βββ requirements.txt
βββ terraform/
β βββ main.tf
β βββ bedrock_agentcore.tf # All AWS resources
β βββ variables.tf
β βββ terraform.tfvars.example
βββ Dockerfile
βββ pyproject.toml
| Action | Parameters | Description |
|---|---|---|
get_pods |
namespace |
List all pods with status |
describe_pod |
namespace, pod_name |
Detailed pod info |
get_pod_logs |
namespace, pod_name, previous |
Container logs |
get_events |
namespace |
Kubernetes events |
restart_deployment |
namespace, deployment_name |
Rolling restart |
scale_deployment |
namespace, deployment_name, replicas |
Scale replicas |
Example Lambda invocation:
aws lambda invoke --function-name eks-sre-agent-McpLambda \
--payload '{"tool_name": "get_pods", "namespace": "demo"}' \
--cli-binary-format raw-in-base64-out output.jsonResponse:
{
"statusCode": 200,
"body": {
"result": {
"pods": [
{
"name": "nginx-7d6877d777-abc12",
"namespace": "demo",
"phase": "Running",
"containers": [
{"name": "nginx", "ready": true, "restarts": 0, "state": "Running"}
],
"node": "ip-10-0-1-100.ec2.internal"
}
],
"count": 1
}
}
}| Action | Parameters | Description |
|---|---|---|
query |
query |
Execute PromQL instant query |
query_range |
query, start, end, step |
Range query |
get_alerts |
- | Get firing/pending alerts |
get_error_rate |
job, window |
Calculate error rate % |
get_latency_percentiles |
job, window |
P50/P90/P99 latency |
get_throughput |
job, window |
Requests per second |
get_saturation |
resource, namespace |
CPU/Memory saturation |
analyze_service |
job, window |
Full Golden Signals analysis |
calculate_sli |
job, slo_target, window |
SLI calculation |
check_error_budget |
job, slo_target, window |
Error budget status |
Example:
aws lambda invoke --function-name eks-sre-agent-PrometheusLambda \
--payload '{"tool_name": "get_alerts"}' \
--cli-binary-format raw-in-base64-out output.jsonResponse:
{
"statusCode": 200,
"body": {
"status": "success",
"firing_count": 2,
"pending_count": 0,
"firing_alerts": [
{
"labels": {"alertname": "KubePodCrashLooping", "pod": "app-xyz", "severity": "warning"},
"annotations": {"summary": "Pod is crash looping."},
"state": "firing",
"activeAt": "2024-01-26T10:00:00Z"
}
]
}
}| Action | Parameters | Description |
|---|---|---|
get_alarms |
state |
Get CloudWatch alarms |
get_metrics |
namespace, metric_name, period |
Get metric statistics |
query_logs |
log_group, query, hours |
CloudWatch Logs Insights |
Example:
aws lambda invoke --function-name eks-sre-agent-CloudWatchLambda \
--payload '{"tool_name": "get_alarms", "state": "ALARM"}' \
--cli-binary-format raw-in-base64-out output.json- AWS CLI configured with appropriate permissions
- Terraform >= 1.2
- Docker
- EKS cluster with:
- Prometheus (kube-prometheus-stack) exposed via NodePort
- IAM permissions for Lambda to access EKS API
cd terraform
cp terraform.tfvars.example terraform.tfvarsEdit terraform.tfvars:
app_name = "eks-sre-agent"
eks_cluster_name = "my-eks-cluster"
prometheus_url = "http://10.0.1.100:32575" # NodePort IP:Port
# Required if Prometheus is in private subnet
prometheus_vpc_config = {
subnet_ids = ["subnet-abc123", "subnet-def456"]
security_group_ids = ["sg-xyz789"]
}terraform init
terraform plan
terraform applyThis creates:
- ECR repository + Docker image
- 3 Lambda functions (K8s, Prometheus, CloudWatch)
- AgentCore Gateway with Cognito JWT auth
- AgentCore Runtime
- IAM roles and policies
- Cognito User Pool for authentication
The Kubernetes Lambda requires the kubernetes Python package which isn't included by default:
cd mcp/lambda
chmod +x build_lambda.sh
./build_lambda.sh
# Update Lambda
aws lambda update-function-code \
--function-name eks-sre-agent-McpLambda \
--zip-file fileb://build/lambda.zip \
--region eu-west-1# Install agentcore CLI
pip install bedrock-agentcore-cli
# Invoke the agent
agentcore invoke '{"prompt": "What pods are crashing in demo namespace?"}'
agentcore invoke '{"prompt": "Show me Prometheus alerts"}'
agentcore invoke '{"prompt": "Analyze the health of kube-state-metrics service"}'- Go to Bedrock AgentCore β Runtimes
- Select
eks-sre-agent_Agent - Go to Test Console
- Enter payload:
{"prompt": "Scan demo namespace for problems"}| Entrypoint | Payload | Description |
|---|---|---|
invoke |
{"prompt": "..."} |
General SRE queries |
investigate |
{"pod_name": "x", "namespace": "y"} |
Deep dive into pod |
heal |
{"pod_name": "x", "namespace": "y", "dry_run": true} |
Fix issues |
scan |
{"namespace": "demo"} |
Scan for problems |
analyze |
{"job": "nginx", "window": "5m"} |
Golden Signals |
slo_status |
{"job": "nginx", "slo_target": 99.9} |
SLI/SLO check |
alerts |
{"state": "firing"} |
Get all alerts |
"What pods are in CrashLoopBackOff in the demo namespace?"
"Show me the logs from the previous crash of pod nginx-abc123"
"What's the error rate for the api-gateway service in the last hour?"
"Is my SLO of 99.9% being met for the payment service?"
"Restart the nginx deployment in production namespace"
"What Prometheus alerts are currently firing?"
The agent container uses these environment variables (set by Terraform):
| Variable | Description |
|---|---|
AWS_REGION |
AWS region |
GATEWAY_URL |
AgentCore Gateway URL |
COGNITO_CLIENT_ID |
Cognito app client ID |
COGNITO_CLIENT_SECRET |
Cognito app client secret |
COGNITO_TOKEN_URL |
Cognito OAuth token endpoint |
COGNITO_SCOPE |
OAuth scope |
MEMORY_ID |
AgentCore Memory ID (optional) |
Lambda functions use:
| Variable | Description |
|---|---|
EKS_CLUSTER_NAME |
Target EKS cluster name |
PROMETHEUS_URL |
Prometheus server URL |
- Ensure Lambda is in VPC with access to EKS API
- Check security group allows outbound to EKS
- Verify IAM role has
eks:DescribeClusterpermission
- Expose Prometheus via NodePort:
kubectl patch svc prometheus -n monitoring -p '{"spec":{"type":"NodePort"}}' - Get NodePort:
kubectl get svc prometheus -n monitoring - Use node's private IP + NodePort in
prometheus_url - Ensure security group allows traffic on NodePort
- Run
build_lambda.shto include dependencies - Update Lambda with the new zip file
MIT