Skip to content

kumar-ks/Agentic_CrewAi_OpsAgents

Repository files navigation

AgenticAI with CrewAI - DevOps Incident Management

A production-ready multi-agent framework for autonomous DevOps incident management using CrewAI. Features real kubectl and boto3 integrations with comprehensive error handling, metrics tracking, and self-healing capabilities.

🚀 Quick Start

cd AgenticAI
pip install -r requirements.txt
cp .env.example .env
# Add your OPENAI_API_KEY to .env

# Optional: Configure for real integrations
kubectl version  # Verify kubectl is installed
aws configure    # Configure AWS credentials

# Run examples
python devops_examples/real_world_examples.py

✨ What's New - Real Integrations

Real Tools (Not Placeholders!)

  • Kubernetes: Actual kubectl commands
  • AWS: Real boto3 SDK integration
  • Error Handling: Retry logic with exponential backoff
  • Metrics: MTTR, MTTD, SLA tracking
  • Monitoring: Health checks and performance tracking

Production Features

  • ✅ Comprehensive error handling
  • ✅ Structured logging
  • ✅ Metrics collection and export
  • ✅ SLA breach detection
  • ✅ Graceful degradation
  • ✅ Input validation
  • ✅ Timeout handling

🎯 What's Included

15 Specialized Agents

Detection & Triage (2 agents)

  • Incident Detector - Monitors all systems
  • Triage Specialist - Prioritizes incidents

Platform Specialists (4 agents)

  • Kubernetes Expert - Troubleshoots K8s
  • AWS Fargate Expert - Handles ECS/Fargate
  • AWS Lambda Expert - Debugs serverless
  • Container Security Expert - Scans images

Self-Healing (3 agents)

  • Self-Healing Agent - Auto-fixes issues
  • Emergency Responder - Fast incident response
  • Preventive Maintenance - Proactive fixes

Support & Analysis (6 agents)

  • Observability Expert - Analyzes Dynatrace
  • IaC Expert - Manages Terraform/Pulumi
  • Incident Commander - Coordinates response
  • Communication Specialist - Updates stakeholders
  • Postmortem Analyst - Creates reports
  • Diagnostic Agent - Recommends fixes

30+ DevOps Tools

Real Kubernetes Tools (devops_tools/real_k8s_tools.py)

  • check_k8s_pod_status - Real kubectl get pods
  • get_k8s_logs - Real kubectl logs
  • get_k8s_events - Real kubectl get events
  • scale_k8s_deployment - Real kubectl scale
  • restart_pod - Real kubectl delete pod
  • restart_deployment - Real kubectl rollout restart
  • adjust_resource_limits - Real kubectl set resources
  • update_container_image - Real kubectl set image
  • rollback_deployment - Real kubectl rollout undo
  • create_horizontal_pod_autoscaler - Real kubectl autoscale

Real AWS Tools (devops_tools/real_aws_tools.py)

  • check_lambda_status - Real boto3 Lambda.get_function
  • analyze_lambda_errors - Real CloudWatch Logs Insights
  • invoke_lambda - Real boto3 Lambda.invoke
  • check_fargate_task_status - Real boto3 ECS.describe_tasks
  • scan_ecr_image - Real boto3 ECR.describe_image_scan_findings
  • list_ecr_images - Real boto3 ECR.describe_images

Placeholder Tools (devops_tools/devops_tools.py)

  • Dynatrace integration (placeholder)
  • Terraform/Pulumi tools (placeholder)
  • Notification tools (placeholder)

Utilities (utils/)

  • Error handling with retry logic
  • Metrics collection (MTTR, MTTD, SLA)
  • Health monitoring
  • Structured logging

Tech Stack Coverage

✅ Kubernetes (Real) ✅ AWS Lambda (Real) ✅ AWS Fargate/ECS (Real) ✅ AWS ECR (Real) ⏳ Dynatrace (Placeholder) ⏳ Terraform (Placeholder) ⏳ Pulumi (Placeholder)

📚 Documentation

🏗️ Project Structure

AgenticAI/
├── agents/              # Base agent framework
├── devops_agents/       # 15 specialized DevOps agents
│   ├── incident_agents.py
│   └── self_healing_agent.py
├── tools/               # Base tools
├── devops_tools/        # DevOps-specific tools
│   ├── devops_tools.py         # Placeholder tools
│   ├── real_k8s_tools.py       # Real kubectl integration ✅
│   └── real_aws_tools.py       # Real boto3 integration ✅
├── utils/               # Utilities
│   ├── error_handling.py       # Retry logic, validation ✅
│   ├── metrics.py              # MTTR, MTTD, SLA tracking ✅
│   └── logger.py
├── memory/              # Memory systems
├── prompts/             # Prompt templates
├── config/              # Configuration
├── examples/            # General examples
├── devops_examples/     # DevOps workflows
│   ├── incident_response_examples.py
│   └── real_world_examples.py  # Real tool examples ✅
├── docs/                # Complete documentation
├── tests/               # Tests
├── data/                # Data storage
└── logs/                # Application logs

💡 Example: Real Kubernetes Self-Healing

from devops_agents.self_healing_agent import SelfHealingAgentFactory
from agents.crew_task import CrewTask
from agents.crew_manager import CrewManager
from utils.metrics import track_incident, resolve_incident

# Track incident
track_incident("INC-001", "P1")

# Create self-healing agent
factory = SelfHealingAgentFactory()
healer = factory.create_self_healing_k8s_agent()

# Define problem (uses real kubectl commands)
task = CrewTask(
    description="Pod 'api-pod-12345' in 'production' is crashing. Fix it.",
    agent=healer.get_agent(),
    expected_output="Issue fixed and pod running"
)

# Execute - agent will:
# 1. Run: kubectl get pods -n production api-pod-12345
# 2. Run: kubectl logs -n production api-pod-12345
# 3. Run: kubectl get events -n production
# 4. Diagnose: OOMKilled
# 5. Run: kubectl set resources deployment/api-deployment --limits=memory=1Gi
# 6. Verify: kubectl get pods -n production

crew = CrewManager(
    agents=[healer.get_agent()],
    tasks=[task.get_task()]
)

result = crew.kickoff()
resolve_incident("INC-001")

print(result)  # "Fixed OOMKilled by increasing memory to 1Gi"

📊 Metrics & Monitoring

from utils.metrics import get_metrics_summary, export_metrics

# Get metrics
summary = get_metrics_summary()
print(f"MTTR: {summary['mttr_seconds']}s")
print(f"MTTD: {summary['mttd_seconds']}s")
print(f"Success Rate: {summary['agent_success_rate']}%")

# Export to file
export_metrics("logs/metrics_report.json")

🎯 Use Cases

  1. Kubernetes Pod Crashes - Auto-detect and fix with real kubectl
  2. Lambda High Error Rate - Analyze with real CloudWatch Logs
  3. Production Outages - Full incident response workflow
  4. Infrastructure Drift - Detect and remediate
  5. Container Security - Scan ECR images with real boto3
  6. Performance Issues - Analyze metrics
  7. Post-Incident Analysis - Generate reports with metrics

🔧 Configuration

Environment Variables

# Required
OPENAI_API_KEY=sk-your-key-here

# Optional - for real AWS integrations
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key

# Optional - for Dynatrace
DYNATRACE_TENANT=your_tenant_id
DYNATRACE_API_TOKEN=your_api_token

Dummy Data (Safe for Testing)

# Default dummy values in real_k8s_tools.py
DUMMY_NAMESPACE = "production"
DUMMY_POD = "api-pod-12345"
DUMMY_DEPLOYMENT = "api-deployment"

# Default dummy values in real_aws_tools.py
DUMMY_FUNCTION = "api-function"
DUMMY_CLUSTER = "prod-cluster"
DUMMY_REPOSITORY = "my-app"

# Replace with your actual resource names

📦 Installation

# Navigate to directory
cd AgenticAI

# Install dependencies
pip install -r requirements.txt

# Install kubectl (for K8s tools)
# macOS: brew install kubectl
# Linux: snap install kubectl --classic
# Windows: choco install kubernetes-cli

# Configure AWS (for AWS tools)
aws configure

# Configure environment
cp .env.example .env
# Edit .env and add OPENAI_API_KEY

# Run examples
python devops_examples/real_world_examples.py

🔑 Key Features

  • Real Integrations - Actual kubectl and boto3 commands
  • Error Handling - Retry logic with exponential backoff
  • Metrics Tracking - MTTR, MTTD, SLA compliance
  • Self-Healing - Automatic issue remediation
  • Multi-Platform - K8s, AWS, Dynatrace, IaC
  • Production-Ready - Logging, validation, timeouts
  • Graceful Degradation - Works without kubectl/AWS
  • Extensible - Easy to add agents and tools

⚠️ Important Notes

Real vs Placeholder Tools

Real (Production-Ready)

  • ✅ Kubernetes tools - Uses actual kubectl
  • ✅ AWS Lambda tools - Uses actual boto3
  • ✅ AWS Fargate tools - Uses actual boto3
  • ✅ AWS ECR tools - Uses actual boto3

Placeholder (Need Implementation)

  • ⏳ Dynatrace tools - API integration needed
  • ⏳ Terraform tools - CLI execution needed
  • ⏳ Pulumi tools - CLI execution needed
  • ⏳ Notification tools - Webhook integration needed

Safety Features

  • Uses dummy data by default
  • Validates inputs before execution
  • Logs all actions for audit trail
  • Handles errors gracefully
  • Supports dry-run mode

🤝 Contributing

Contributions welcome! Priority areas:

  • Implement Dynatrace API integration
  • Add Terraform/Pulumi CLI execution
  • Create notification integrations (Slack, PagerDuty)
  • Add more test coverage
  • Improve error handling

📄 License

MIT License

🆘 Support

Check documentation in docs/ or review examples in devops_examples/.


Built with CrewAI - Production-ready multi-agent DevOps automation

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages