Ever been on-call at 2 AM, frantically switching between PagerDuty, Slack, Grafana, GitHub, and your runbooks trying to figure out what's broken? OpsLens is here to help.
It's an AI-powered copilot that pulls together all those scattered signals—alerts, logs, metrics, screenshots, PRs—and turns them into something actually useful: a clear timeline of what happened, AI-generated hypotheses about the root cause, and actionable next steps. Think of it as your incident response co-pilot that never sleeps.
View all incidents at a glance with severity, status, and timestamps
Chronological timeline showing events from GitHub, PagerDuty, and other sources
AI-generated root cause hypotheses ranked by confidence with supporting evidence
All evidence in one place: logs, metrics, screenshots, and their AI analysis
When an incident hits, OpsLens:
- Builds a timeline automatically - Pulls in events from GitHub (recent deployments), PagerDuty (alerts), and other sources to show you what changed and when
- Analyzes screenshots with AI - Upload a dashboard screenshot and it'll tell you what's wrong, what metrics are spiking, and what errors it sees
- Generates root cause hypotheses - Uses AI to analyze all the evidence and suggest what might be causing the issue, ranked by confidence
- Suggests actions - Gives you concrete next steps: queries to run, who to page, whether to rollback
- Drafts postmortems - When it's all over, it'll generate a postmortem draft with root cause analysis and follow-up actions
- GitHub - Automatically fetches your recent PR merges and adds them to incident timelines. Works with both personal accounts and organizations.
- PagerDuty - Pulls in active incidents and alerts, correlates them with your OpsLens incidents
- Webhooks - Accept webhooks from GitHub, PagerDuty, or any custom system to automatically create incidents
See the screenshots above to see the UI in action—the incident list, timeline view, hypotheses, and evidence viewer.
- LLM (Llama 3.1) - Summarizes logs, generates hypotheses, writes postmortems. All via Hugging Face's Inference API, so no need to host massive models yourself.
- VLM (Qwen2.5-VL) - Upload a screenshot of your Grafana dashboard and it'll tell you what it sees: errors, anomalies, key metrics. Pretty wild, honestly.
- RAG (BGE-M3) - Semantic search over your runbooks and historical postmortems. Ask "how do I fix database connection issues?" and it'll find the relevant runbook.
- REST API - Full REST API with OpenAPI docs. Create incidents, upload evidence, generate hypotheses—all programmatically
- API Key Authentication - Secure your API with API keys. Optional in dev, required in production
- Webhooks - Both incoming (GitHub, PagerDuty) and outgoing (notify your systems when incidents are created/updated)
- Test Endpoints - Built-in endpoints to test your integrations and verify everything's working
You'll need:
- Docker Desktop (make sure it's running)
- API keys (all free tiers work fine):
- Hugging Face - for AI models
- GitHub - Personal Access Token with
reposcope - PagerDuty - API key from your account
-
Clone it:
git clone https://github.com/yourusername/OpsLens.git cd OpsLens -
Set up your API keys:
cp secrets.env.example secrets.env # Open secrets.env and paste your API keys -
Start everything:
./setup.sh
This will start all the services, initialize the database, and optionally generate some realistic test data.
-
Open it up:
- Frontend: http://localhost:3000
- API Docs: http://localhost:8000/docs (interactive Swagger UI)
- Backend: http://localhost:8000
That's it! You should see the frontend with some sample incidents (if you generated synthetic data).
Just go to http://localhost:3000. You'll see a list of incidents (like in the first screenshot above). Each incident shows its severity, status, and when it was created. Click on any one to dive into the details.
On an incident page, click "Generate Timeline" (see the timeline screenshot above). OpsLens will:
- Fetch recent GitHub PR merges (last 24 hours)
- Pull in PagerDuty incidents
- Correlate everything chronologically
- Show you a clean timeline of what happened
The timeline view shows all events in chronological order, making it easy to see what changed and when.
Got a dashboard screenshot? Upload it in the Evidence tab (see the evidence screenshot above). The VLM will analyze it and tell you:
- What errors it sees
- What metrics are spiking
- Any anomalies it notices
- A plain-English description of what's on the dashboard
Processing happens in the background, so give it a few seconds and refresh. The analysis will appear in the evidence content.
Click "Generate Hypotheses" (see the hypotheses screenshot above) and the AI will:
- Analyze all the evidence (logs, metrics, screenshots)
- Generate root cause hypotheses
- Rank them by confidence
- Link them to supporting evidence
The hypotheses are ranked by confidence, so you know which ones to investigate first.
Want incidents to be created automatically? Set up webhooks:
GitHub:
- Go to your repo → Settings → Webhooks
- Add webhook:
https://your-domain.com/api/v1/webhooks/github - Select "Pull requests" events
PagerDuty:
- PagerDuty → Integrations → Webhooks
- Add webhook:
https://your-domain.com/api/v1/webhooks/pagerduty - Select incident events
Now when PRs merge or PagerDuty alerts fire, incidents are created automatically!
# Test GitHub connection
curl http://localhost:8000/api/v1/integrations/test/github
# Test PagerDuty connection
curl http://localhost:8000/api/v1/integrations/test/pagerduty
# Test everything
curl http://localhost:8000/api/v1/integrations/test/all# Check if VLM is ready
curl http://localhost:8000/api/v1/test/vlm/status
# Test with an actual screenshot
curl -X POST http://localhost:8000/api/v1/test/vlm \
-F "file=@your_screenshot.png"# Test webhook endpoint
curl http://localhost:8000/api/v1/webhooks/test
# Send a test webhook
curl -X POST http://localhost:8000/api/v1/webhooks/generic \
-H "Content-Type: application/json" \
-d '{"title": "Test Incident", "severity": "medium"}'OpsLens/
├── backend/ # FastAPI backend
│ ├── app/
│ │ ├── api/ # REST endpoints (incidents, evidence, webhooks, auth)
│ │ ├── auth/ # Authentication (API keys, JWT)
│ │ ├── db/ # Database models
│ │ ├── integrations/# GitHub, PagerDuty
│ │ ├── services/ # Business logic (ML, RAG, incidents)
│ │ └── workers/ # Celery async tasks
│ └── requirements.txt
├── frontend/ # Next.js frontend
│ ├── app/ # Pages and components
│ └── lib/ # API client, types
└── docker-compose.yml # All services
Backend:
- FastAPI - Fast, modern Python web framework
- PostgreSQL + pgvector - Database with vector search built-in
- Celery + Redis - Background job processing
- Hugging Face Inference API - AI models without the infrastructure headache
Frontend:
- Next.js 14 - React framework
- TypeScript - Type safety
- Tailwind CSS - Styling
Infrastructure:
- Docker Compose - Everything containerized
- All services run in containers for easy setup
The API is fully documented at http://localhost:8000/docs. You can try endpoints right there.
By default, authentication is disabled for easy development. To enable it:
- Set
ENABLE_AUTH=truein your config - Create an API key:
curl -X POST http://localhost:8000/api/v1/auth/api-keys \ -H "Content-Type: application/json" \ -d '{"name": "My Integration"}'
- Use it:
curl -H "X-API-Key: your-key" \ http://localhost:8000/api/v1/incidents
See API_AUTHENTICATION.md for the full guide.
OpsLens supports both incoming and outgoing webhooks:
- Incoming: GitHub, PagerDuty, or generic webhooks create incidents automatically
- Outgoing: OpsLens can notify your systems when incidents are created/updated
Check out WEBHOOKS.md for setup instructions.
# Start everything
docker-compose up -d
# View logs
docker-compose logs -f backend
# Restart a service
docker-compose restart backend
# Stop everything
docker-compose downThe architecture makes it easy to add new integrations:
- Create a new file in
backend/app/integrations/ - Add test endpoints in
backend/app/api/integrations.py - Wire it into the incident worker
- Done!
See the GitHub and PagerDuty integrations for examples.
This is an MVP, but it's fully functional. Here's what's coming:
- More integrations - Slack, Datadog, Grafana, OpenTelemetry
- Better AI - Fine-tuned models, better hypothesis generation
- Production features - RBAC, audit logging, multi-tenant support
- Evaluation - Metrics to measure how well the AI is performing
Check the roadmap section for more details.
Services won't start?
- Make sure Docker Desktop is running
- Check if ports 3000, 8000, 5432, 6379 are free
- Look at logs:
docker-compose logs backend
API keys not working?
- Verify your keys in
secrets.env - Test the integration endpoints (see above)
- Check the logs for error messages
VLM not analyzing screenshots?
- Make sure your Hugging Face API key is set
- Check the VLM status endpoint
- First request might take 30-60 seconds (model loading)
Webhooks not working?
- Verify the webhook URL is accessible
- Check webhook secret matches
- Look at backend logs for incoming webhooks
Found a bug? Have an idea? Pull requests welcome! This is a portfolio project, so contributions that make it more impressive are especially appreciated.
MIT License - do whatever you want with it.
- API Authentication Guide - Setting up API keys and security
- Webhooks Guide - Incoming and outgoing webhooks
- VLM Testing - Testing the vision-language model
- Integration Guide - Integrating OpsLens into your systems
-
API key authentication✅ Done -
Webhook support (incoming & outgoing)✅ Done -
VLM screenshot analysis✅ Done -
GitHub & PagerDuty integrations✅ Done - Slack integration (incident threads, notifications)
- Datadog/Grafana metrics integration
- OpenTelemetry trace correlation
- Kubernetes event ingestion
- Advanced RAG with fine-tuning
- Evaluation harness for ML models
- Role-based access control
- Multi-tenant support
Built for on-call engineers who are tired of context-switching at 2 AM.