AgenticAI with CrewAI - DevOps Incident Management

A production-ready multi-agent framework for autonomous DevOps incident management using CrewAI. Features real kubectl and boto3 integrations with comprehensive error handling, metrics tracking, and self-healing capabilities.

🚀 Quick Start

cd AgenticAI
pip install -r requirements.txt
cp .env.example .env
# Add your OPENAI_API_KEY to .env

# Optional: Configure for real integrations
kubectl version  # Verify kubectl is installed
aws configure    # Configure AWS credentials

# Run examples
python devops_examples/real_world_examples.py

✨ What's New - Real Integrations

Real Tools (Not Placeholders!)

✅ Kubernetes: Actual kubectl commands
✅ AWS: Real boto3 SDK integration
✅ Error Handling: Retry logic with exponential backoff
✅ Metrics: MTTR, MTTD, SLA tracking
✅ Monitoring: Health checks and performance tracking

Production Features

✅ Comprehensive error handling
✅ Structured logging
✅ Metrics collection and export
✅ SLA breach detection
✅ Graceful degradation
✅ Input validation
✅ Timeout handling

🎯 What's Included

15 Specialized Agents

Detection & Triage (2 agents)

Incident Detector - Monitors all systems
Triage Specialist - Prioritizes incidents

Platform Specialists (4 agents)

Kubernetes Expert - Troubleshoots K8s
AWS Fargate Expert - Handles ECS/Fargate
AWS Lambda Expert - Debugs serverless
Container Security Expert - Scans images

Self-Healing (3 agents)

Self-Healing Agent - Auto-fixes issues
Emergency Responder - Fast incident response
Preventive Maintenance - Proactive fixes

Support & Analysis (6 agents)

Observability Expert - Analyzes Dynatrace
IaC Expert - Manages Terraform/Pulumi
Incident Commander - Coordinates response
Communication Specialist - Updates stakeholders
Postmortem Analyst - Creates reports
Diagnostic Agent - Recommends fixes

30+ DevOps Tools

Real Kubernetes Tools (devops_tools/real_k8s_tools.py)

check_k8s_pod_status - Real kubectl get pods
get_k8s_logs - Real kubectl logs
get_k8s_events - Real kubectl get events
scale_k8s_deployment - Real kubectl scale
restart_pod - Real kubectl delete pod
restart_deployment - Real kubectl rollout restart
adjust_resource_limits - Real kubectl set resources
update_container_image - Real kubectl set image
rollback_deployment - Real kubectl rollout undo
create_horizontal_pod_autoscaler - Real kubectl autoscale

Real AWS Tools (devops_tools/real_aws_tools.py)

check_lambda_status - Real boto3 Lambda.get_function
analyze_lambda_errors - Real CloudWatch Logs Insights
invoke_lambda - Real boto3 Lambda.invoke
check_fargate_task_status - Real boto3 ECS.describe_tasks
scan_ecr_image - Real boto3 ECR.describe_image_scan_findings
list_ecr_images - Real boto3 ECR.describe_images

Placeholder Tools (devops_tools/devops_tools.py)

Dynatrace integration (placeholder)
Terraform/Pulumi tools (placeholder)
Notification tools (placeholder)

Utilities (utils/)

Error handling with retry logic
Metrics collection (MTTR, MTTD, SLA)
Health monitoring
Structured logging

Tech Stack Coverage

✅ Kubernetes (Real) ✅ AWS Lambda (Real) ✅ AWS Fargate/ECS (Real) ✅ AWS ECR (Real) ⏳ Dynatrace (Placeholder) ⏳ Terraform (Placeholder) ⏳ Pulumi (Placeholder)

📚 Documentation

Quick Start - Get started in 5 minutes
DevOps Guide - Complete DevOps documentation
Self-Healing Guide - Auto-fix capabilities
Visual Guide - Architecture diagrams
Detailed Guide - AgenticAI concepts
Examples - Usage examples
FAQ - Common questions

🏗️ Project Structure

AgenticAI/
├── agents/              # Base agent framework
├── devops_agents/       # 15 specialized DevOps agents
│   ├── incident_agents.py
│   └── self_healing_agent.py
├── tools/               # Base tools
├── devops_tools/        # DevOps-specific tools
│   ├── devops_tools.py         # Placeholder tools
│   ├── real_k8s_tools.py       # Real kubectl integration ✅
│   └── real_aws_tools.py       # Real boto3 integration ✅
├── utils/               # Utilities
│   ├── error_handling.py       # Retry logic, validation ✅
│   ├── metrics.py              # MTTR, MTTD, SLA tracking ✅
│   └── logger.py
├── memory/              # Memory systems
├── prompts/             # Prompt templates
├── config/              # Configuration
├── examples/            # General examples
├── devops_examples/     # DevOps workflows
│   ├── incident_response_examples.py
│   └── real_world_examples.py  # Real tool examples ✅
├── docs/                # Complete documentation
├── tests/               # Tests
├── data/                # Data storage
└── logs/                # Application logs

💡 Example: Real Kubernetes Self-Healing

from devops_agents.self_healing_agent import SelfHealingAgentFactory
from agents.crew_task import CrewTask
from agents.crew_manager import CrewManager
from utils.metrics import track_incident, resolve_incident

# Track incident
track_incident("INC-001", "P1")

# Create self-healing agent
factory = SelfHealingAgentFactory()
healer = factory.create_self_healing_k8s_agent()

# Define problem (uses real kubectl commands)
task = CrewTask(
    description="Pod 'api-pod-12345' in 'production' is crashing. Fix it.",
    agent=healer.get_agent(),
    expected_output="Issue fixed and pod running"
)

# Execute - agent will:
# 1. Run: kubectl get pods -n production api-pod-12345
# 2. Run: kubectl logs -n production api-pod-12345
# 3. Run: kubectl get events -n production
# 4. Diagnose: OOMKilled
# 5. Run: kubectl set resources deployment/api-deployment --limits=memory=1Gi
# 6. Verify: kubectl get pods -n production

crew = CrewManager(
    agents=[healer.get_agent()],
    tasks=[task.get_task()]
)

result = crew.kickoff()
resolve_incident("INC-001")

print(result)  # "Fixed OOMKilled by increasing memory to 1Gi"

📊 Metrics & Monitoring

from utils.metrics import get_metrics_summary, export_metrics

# Get metrics
summary = get_metrics_summary()
print(f"MTTR: {summary['mttr_seconds']}s")
print(f"MTTD: {summary['mttd_seconds']}s")
print(f"Success Rate: {summary['agent_success_rate']}%")

# Export to file
export_metrics("logs/metrics_report.json")

🎯 Use Cases

Kubernetes Pod Crashes - Auto-detect and fix with real kubectl
Lambda High Error Rate - Analyze with real CloudWatch Logs
Production Outages - Full incident response workflow
Infrastructure Drift - Detect and remediate
Container Security - Scan ECR images with real boto3
Performance Issues - Analyze metrics
Post-Incident Analysis - Generate reports with metrics

🔧 Configuration

Environment Variables

# Required
OPENAI_API_KEY=sk-your-key-here

# Optional - for real AWS integrations
AWS_REGION=us-east-1
AWS_ACCESS_KEY_ID=your_access_key
AWS_SECRET_ACCESS_KEY=your_secret_key

# Optional - for Dynatrace
DYNATRACE_TENANT=your_tenant_id
DYNATRACE_API_TOKEN=your_api_token

Dummy Data (Safe for Testing)

# Default dummy values in real_k8s_tools.py
DUMMY_NAMESPACE = "production"
DUMMY_POD = "api-pod-12345"
DUMMY_DEPLOYMENT = "api-deployment"

# Default dummy values in real_aws_tools.py
DUMMY_FUNCTION = "api-function"
DUMMY_CLUSTER = "prod-cluster"
DUMMY_REPOSITORY = "my-app"

# Replace with your actual resource names

📦 Installation

# Navigate to directory
cd AgenticAI

# Install dependencies
pip install -r requirements.txt

# Install kubectl (for K8s tools)
# macOS: brew install kubectl
# Linux: snap install kubectl --classic
# Windows: choco install kubernetes-cli

# Configure AWS (for AWS tools)
aws configure

# Configure environment
cp .env.example .env
# Edit .env and add OPENAI_API_KEY

# Run examples
python devops_examples/real_world_examples.py

🔑 Key Features

✅ Real Integrations - Actual kubectl and boto3 commands
✅ Error Handling - Retry logic with exponential backoff
✅ Metrics Tracking - MTTR, MTTD, SLA compliance
✅ Self-Healing - Automatic issue remediation
✅ Multi-Platform - K8s, AWS, Dynatrace, IaC
✅ Production-Ready - Logging, validation, timeouts
✅ Graceful Degradation - Works without kubectl/AWS
✅ Extensible - Easy to add agents and tools

⚠️ Important Notes

Real vs Placeholder Tools

Real (Production-Ready)

✅ Kubernetes tools - Uses actual kubectl
✅ AWS Lambda tools - Uses actual boto3
✅ AWS Fargate tools - Uses actual boto3
✅ AWS ECR tools - Uses actual boto3

Placeholder (Need Implementation)

⏳ Dynatrace tools - API integration needed
⏳ Terraform tools - CLI execution needed
⏳ Pulumi tools - CLI execution needed
⏳ Notification tools - Webhook integration needed

Safety Features

Uses dummy data by default
Validates inputs before execution
Logs all actions for audit trail
Handles errors gracefully
Supports dry-run mode

🤝 Contributing

Contributions welcome! Priority areas:

Implement Dynatrace API integration
Add Terraform/Pulumi CLI execution
Create notification integrations (Slack, PagerDuty)
Add more test coverage
Improve error handling

📄 License

MIT License

🆘 Support

Check documentation in docs/ or review examples in devops_examples/.

Built with CrewAI - Production-ready multi-agent DevOps automation

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AgenticAI with CrewAI - DevOps Incident Management

🚀 Quick Start

✨ What's New - Real Integrations

Real Tools (Not Placeholders!)

Production Features

🎯 What's Included

15 Specialized Agents

30+ DevOps Tools

Tech Stack Coverage

📚 Documentation

🏗️ Project Structure

💡 Example: Real Kubernetes Self-Healing

📊 Metrics & Monitoring

🎯 Use Cases

🔧 Configuration

Environment Variables

Dummy Data (Safe for Testing)

📦 Installation

🔑 Key Features

⚠️ Important Notes

Real vs Placeholder Tools

Safety Features

🤝 Contributing

📄 License

🆘 Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
agents		agents
config		config
devops_agents		devops_agents
devops_examples		devops_examples
devops_tools		devops_tools
docs		docs
examples		examples
memory		memory
prompts		prompts
tests		tests
tools		tools
utils		utils
CAN_AGENTS_FIX_ISSUES.md		CAN_AGENTS_FIX_ISSUES.md
CREWAI_GUIDE.md		CREWAI_GUIDE.md
INTEGRATION_ANALYSIS.md		INTEGRATION_ANALYSIS.md
INTEGRATION_FLOW.md		INTEGRATION_FLOW.md
INTEGRATION_SUMMARY.md		INTEGRATION_SUMMARY.md
QUICKSTART_DEVOPS.md		QUICKSTART_DEVOPS.md
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
validate_integration.py		validate_integration.py

Folders and files

Latest commit

History

Repository files navigation

AgenticAI with CrewAI - DevOps Incident Management

🚀 Quick Start

✨ What's New - Real Integrations

Real Tools (Not Placeholders!)

Production Features

🎯 What's Included

15 Specialized Agents

30+ DevOps Tools

Tech Stack Coverage

📚 Documentation

🏗️ Project Structure

💡 Example: Real Kubernetes Self-Healing

📊 Metrics & Monitoring

🎯 Use Cases

🔧 Configuration

Environment Variables

Dummy Data (Safe for Testing)

📦 Installation

🔑 Key Features

⚠️ Important Notes

Real vs Placeholder Tools

Safety Features

🤝 Contributing

📄 License

🆘 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages