This document is the authoritative guide for all development work on the AIOps NextGen platform. Follow these instructions precisely for consistent, high-quality development, integration and test process.
- Project Context
- Document Hierarchy
- Development Principles
- OpenShift Sandbox Environment
- Sprint Execution Protocol
- Code Implementation Standards
- Testing Requirements
- Git Workflow
- Quality Gates
- Troubleshooting
- Reference Quick Links
AIOps NextGen is an AI-driven observability platform for managing multiple OpenShift/Kubernetes clusters with specialized support for:
- GPU workloads (AI/ML training and inference)
- Telecom CNF (PTP synchronization, SR-IOV networking)
- Federated observability (Prometheus, Loki, Tempo)
- Intelligent analysis (anomaly detection, root cause analysis)
The platform has 5 microservices implemented as scaffolds with mock data. A code audit identified 23 issues (6 CRITICAL, 10 HIGH, 5 MEDIUM, 2 LOW) that must be resolved before production deployment.
Transform the scaffold implementation into production-ready code by executing 10 sprints that address all identified issues while maintaining strict adherence to the specification documents.
These documents define the truth. All implementation MUST conform to these specifications:
| Document | Purpose | Location |
|---|---|---|
README.md |
Project overview and quick start | /README.md |
specs/00-overview.md |
Architecture vision, design principles | /specs/00-overview.md |
specs/01-data-models.md |
Pydantic model definitions | /specs/01-data-models.md |
specs/02-cluster-registry.md |
Cluster management service spec | /specs/02-cluster-registry.md |
specs/03-observability-collector.md |
Metrics/logs/traces collection spec | /specs/03-observability-collector.md |
specs/04-intelligence-engine.md |
AI/ML analysis service spec | /specs/04-intelligence-engine.md |
specs/05-realtime-streaming.md |
WebSocket streaming spec | /specs/05-realtime-streaming.md |
specs/06-api-gateway.md |
Gateway, auth, routing spec | /specs/06-api-gateway.md |
specs/07-frontend.md |
React UI specification | /specs/07-frontend.md |
specs/08-integration-matrix.md |
Service communication patterns | /specs/08-integration-matrix.md |
specs/09-deployment.md |
Kubernetes/Helm deployment | /specs/09-deployment.md |
| Document | Purpose | Location |
|---|---|---|
debug/issues.md |
23 identified issues with severity | /debug/issues.md |
debug/bugfix-sprint-plan.md |
High-level sprint overview | /debug/bugfix-sprint-plan.md |
| Document | Focus Area | Location |
|---|---|---|
debug/sprint/README.md |
Sprint index and execution guide | /debug/sprint/README.md |
debug/sprint/sprint-01-*.md |
Security Foundation | /debug/sprint/sprint-01-security-foundation.md |
debug/sprint/sprint-02-*.md |
Kubernetes Integration | /debug/sprint/sprint-02-kubernetes-integration.md |
debug/sprint/sprint-03-*.md |
Prometheus Authentication | /debug/sprint/sprint-03-prometheus-auth.md |
debug/sprint/sprint-04-*.md |
Logs & Traces | /debug/sprint/sprint-04-logs-traces.md |
debug/sprint/sprint-05-*.md |
GPU Telemetry | /debug/sprint/sprint-05-gpu-telemetry.md |
debug/sprint/sprint-06-*.md |
CNF Monitoring | /debug/sprint/sprint-06-cnf-monitoring.md |
debug/sprint/sprint-07-*.md |
WebSocket Hardening | /debug/sprint/sprint-07-websocket-hardening.md |
debug/sprint/sprint-08-*.md |
Anomaly Detection & RCA | /debug/sprint/sprint-08-anomaly-rca.md |
debug/sprint/sprint-09-*.md |
Reports & MCP Tools | /debug/sprint/sprint-09-reports-mcp.md |
debug/sprint/sprint-10-*.md |
API Gateway Polish | /debug/sprint/sprint-10-api-gateway-polish.md |
| Document | Purpose | Location |
|---|---|---|
CLAUDE.md |
AI assistant context and commands | /CLAUDE.md |
RULE: The specs/ folder is the source of truth.
If code contradicts spec, the code is wrong.
Before implementing any feature:
- Read the relevant spec section
- Understand the data models from
specs/01-data-models.md - Check integration patterns in
specs/08-integration-matrix.md - Verify deployment requirements in
specs/09-deployment.md
RULE: All features MUST work in air-gapped environments.
Local vLLM is preferred over external APIs.
- Never hardcode external API dependencies
- Always provide local alternatives
- Use environment variables for all external URLs
- Container images must be pullable from private registries
RULE: Security is not optional. Every endpoint MUST be authenticated.
Every operation MUST be authorized.
- OAuth 2.0 via OpenShift for authentication
- RBAC with admin/operator/viewer roles
- Credentials stored in Kubernetes Secrets only
- No secrets in environment variables or config files
RULE: Remove ALL mock data patterns.
Every function must interact with real systems.
Search for and eliminate:
# For sandbox testing# Mock datareturn [] # TODO- Hardcoded metric values
- Fake GPU data
RULE: Every service must emit logs, metrics, and traces.
- Structured JSON logging via structlog
- Prometheus metrics on
/metrics - OpenTelemetry traces to OTLP collector
- Health endpoints on
/healthand/ready
RULE: ALL code changes MUST be tested on the OpenShift sandbox environment
before being considered complete. Local-only testing is NOT sufficient.
The AIOps NextGen platform runs on OpenShift. Every sprint task must be validated against the live sandbox cluster to ensure:
- Kubernetes API interactions work correctly
- Service-to-service communication functions
- Database and Redis connections are stable
- OAuth/RBAC integrations are operational
IMPORTANT FOR AI AGENTS:
Before starting any development or testing work, the agent MUST:
-
Ask the user for sandbox environment details:
"To proceed with development and testing, I need access to the OpenShift sandbox. Please provide: 1. Path to your kubeconfig file (e.g., /Users/fenar/projects/clusters/sandbox01/kubeconfig) 2. OpenShift API URL (e.g., https://api.sandbox01.narlabs.io:6443) 3. Project/namespace name (e.g., aiops-nextgen) Once provided, I will verify cluster connectivity and pod status." -
Store and use the credentials throughout the session
-
Never hardcode or persist credentials beyond the session
Once credentials are provided, set up the environment:
# Set the KUBECONFIG environment variable
export KUBECONFIG=/path/to/kubeconfig
# Verify cluster connectivity
oc whoami
oc project aiops-nextgen
# Check current pod status
oc get pods
# Expected services (all should be Running):
# - api-gateway-*
# - cluster-registry-*
# - intelligence-engine-*
# - observability-collector-*
# - realtime-streaming-*
# - postgresql-*
# - redis-*Before starting a new development phase/sprint, verify all existing ocp artificacts are cleaned up and aiops-nextgen ns is a clean slate :
# Check pod status
oc get deployments
oc get pods -o wide
# Check service endpoints
oc get svc
# Verify routes are accessible
oc get routes
| Status | Meaning | Resolution |
|---|---|---|
Running |
Healthy | No action needed |
CreateContainerConfigError |
Missing ConfigMap/Secret | Check oc describe pod <name> for missing refs |
CrashLoopBackOff |
Application crash | Check oc logs <pod> for errors |
ImagePullBackOff |
Cannot pull image | Verify image exists and registry credentials |
Pending |
Waiting for resources | Check node resources or PVC status |
Example: Fix CreateContainerConfigError
# Identify the issue
oc describe pod intelligence-engine-67f84c6675-fw2fn | grep -A5 "Events:"
# Common fix: Create missing secret
oc create secret generic llm-credentials \
--from-literal=LLM_PROVIDER=local \
--from-literal=LLM_LOCAL_URL=http://vllm:8080/v1
# Restart the deployment
oc rollout restart deployment/intelligence-engineAfter implementing and locally testing code changes:
# 1. Build the new container image
cd src/<service>
docker build -t quay.io/aiops-nextgen/<service>:dev .
# 2. Push to registry (if using external registry)
docker push quay.io/aiops-nextgen/<service>:dev
# 3. Update the deployment image
oc set image deployment/<service> <service>=quay.io/aiops-nextgen/<service>:dev
# 4. Watch the rollout
oc rollout status deployment/<service>
# 5. Verify the new pod is running
oc get pods -l app=<service>
# 6. Check logs for startup errors
oc logs -l app=<service> --tail=100 -fAfter deploying changes, run integration tests:
# Get the route URL for the API gateway
export API_URL=$(oc get route api-gateway -o jsonpath='{.spec.host}')
# Test health endpoints
curl -s https://$API_URL/health | jq
curl -s https://$API_URL/ready | jq
# Test API endpoints (with auth token if required)
curl -s https://$API_URL/api/v1/clusters | jq
# Port-forward for direct service testing
oc port-forward svc/cluster-registry 8080:8080 &
curl -s http://localhost:8080/health | jq
# Run pytest against sandbox (with sandbox URL configured)
SANDBOX_API_URL=https://$API_URL pytest tests/integration/ -v# Stream logs from a specific pod
oc logs -f deployment/<service>
# Get shell access to a pod
oc exec -it deployment/<service> -- /bin/sh
# Check environment variables
oc exec deployment/<service> -- env | sort
# Check mounted secrets/configmaps
oc exec deployment/<service> -- ls -la /etc/secrets/
# Debug networking
oc exec deployment/<service> -- curl -s http://cluster-registry:8080/health
# Check database connectivity
oc exec deployment/postgresql -- psql -U aiops -d aiops -c "SELECT 1;"
# Check Redis connectivity
oc exec deployment/redis -- redis-cli pingServices require these environment variables (configured via ConfigMaps/Secrets):
| Variable | Description | Example |
|---|---|---|
POSTGRES_HOST |
PostgreSQL hostname | postgresql |
POSTGRES_PORT |
PostgreSQL port | 5432 |
POSTGRES_USER |
Database user | aiops |
POSTGRES_PASSWORD |
Database password | (from secret) |
POSTGRES_DATABASE |
Database name | aiops |
REDIS_HOST |
Redis hostname | redis |
REDIS_PORT |
Redis port | 6379 |
OAUTH_ISSUER |
OpenShift OAuth URL | https://oauth-openshift.apps... |
LLM_PROVIDER |
LLM provider type | local |
LLM_LOCAL_URL |
vLLM server URL | http://vllm:8080/v1 |
For each sprint task, verify on sandbox:
- Code deployed to sandbox cluster
- Pod running without errors
- Health endpoint returning healthy
- API endpoints responding correctly
- Integration with other services working
- Logs show expected behavior
- No error events in pod events
Before starting any sprint:
- Read the sprint document completely
- Verify all dependency sprints are completed
- Read relevant spec sections referenced in the sprint
- Understand the issues being addressed (check
debug/issues.md) - Create feature branch from
main - Set up local development environment
CRITICAL: Execute sprints in this exact order. Do not skip or reorder.
Phase 1: Security Foundation (BLOCKING)
├── Sprint 1: OAuth + RBAC + WebSocket Auth
└── Sprint 2: K8s Secrets + Credential Validation + Discovery
Phase 2: Observability Stack
├── Sprint 3: Prometheus Authentication
├── Sprint 4: Loki + Tempo Collectors
├── Sprint 5: Real GPU Telemetry
└── Sprint 6: CNF Monitoring (PTP, SR-IOV)
Phase 3: Real-Time & Intelligence
├── Sprint 7: WebSocket Hardening
└── Sprint 8: Anomaly Detection + RCA
Phase 4: Completion
├── Sprint 9: Reports + MCP Tools
└── Sprint 10: Chat Persistence + Tracing
For each task in a sprint:
-
Read the Task Section
- Understand the file to create/modify
- Note the spec references
-
Implement the Code
- Copy the code from the sprint document
- Adapt only if necessary (explain why in commit)
- Add any missing imports
-
Write Tests
- Implement the test file provided
- Add edge cases not covered
- Ensure >80% coverage for the file
-
Verify Locally
# Lint the code ruff check src/<service>/ # Run tests pytest src/<service>/tests/test_<module>.py -v # Type check (optional but recommended) mypy src/<service>/<module>.py
-
Mark Task Complete
- Update sprint progress tracking
- Note any deviations or issues
After completing all tasks in a sprint:
- All acceptance criteria checked and passing
- All tests passing with >80% coverage
- No linting errors
- API documentation updated if endpoints changed
- Integration test with dependent services
- PR created and reviewed
src/<service>/
├── __init__.py # Package init with version
├── main.py # FastAPI application entry
├── api/
│ ├── __init__.py
│ └── v1/
│ ├── __init__.py
│ └── <resource>.py # API endpoints
├── services/
│ ├── __init__.py
│ └── <service>.py # Business logic
├── clients/
│ ├── __init__.py
│ └── <client>.py # External service clients
├── middleware/
│ ├── __init__.py
│ └── <middleware>.py # Request/response middleware
├── models/
│ ├── __init__.py
│ └── <models>.py # Service-specific models
└── tests/
├── __init__.py
├── conftest.py # Pytest fixtures
└── test_<module>.py # Test files
# Standard library
from datetime import datetime
from typing import Optional
# Third-party
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel
# Shared package
from shared.models import Cluster
from shared.config import get_settings
from shared.observability import get_logger
# Local
from services.my_service import my_functionfrom shared.observability import get_logger
logger = get_logger(__name__)
# Use structured logging
logger.info(
"Operation completed",
operation="create_cluster",
cluster_id=cluster_id,
duration_ms=duration,
)
# Log levels:
# - DEBUG: Detailed debugging (not in production)
# - INFO: Normal operations
# - WARNING: Unexpected but handled situations
# - ERROR: Failures requiring attentionfrom fastapi import HTTPException, status
# Always use appropriate HTTP status codes
raise HTTPException(
status_code=status.HTTP_404_NOT_FOUND,
detail=f"Cluster {cluster_id} not found",
)
# Standard error responses
# 400 - Bad Request (invalid input)
# 401 - Unauthorized (missing/invalid auth)
# 403 - Forbidden (insufficient permissions)
# 404 - Not Found (resource doesn't exist)
# 500 - Internal Server Error (unexpected failure)# Always use async for I/O operations
async def get_cluster(cluster_id: str) -> Cluster:
async with get_async_session() as db:
result = await db.execute(
select(ClusterORM).where(ClusterORM.id == cluster_id)
)
return result.scalar_one_or_none()
# Use asyncio.gather for concurrent operations
results = await asyncio.gather(
get_metrics(cluster_id),
get_logs(cluster_id),
get_traces(cluster_id),
)from shared.config import get_settings
# Always use get_settings() - it's cached
settings = get_settings()
# Access nested settings
db_url = settings.database.async_url
redis_url = settings.redis.url
llm_provider = settings.llm.provider"""Tests for <module>."""
import pytest
from unittest.mock import AsyncMock, MagicMock, patch
from <module> import <Class>, <function>
@pytest.fixture
def sample_data():
"""Provide sample test data."""
return {...}
class TestClassName:
"""Tests for ClassName."""
async def test_method_success(self, sample_data):
"""Test successful method execution."""
# Arrange
...
# Act
result = await method(sample_data)
# Assert
assert result.status == "success"
async def test_method_failure(self):
"""Test method failure handling."""
with pytest.raises(HTTPException) as exc_info:
await method(invalid_data)
assert exc_info.value.status_code == 400| Category | Minimum Coverage |
|---|---|
| Services | 80% |
| API Endpoints | 80% |
| Middleware | 90% |
| Clients | 70% |
| Models | 100% |
# Run all tests for a service
pytest src/<service>/tests/ -v
# Run with coverage
pytest src/<service>/tests/ --cov=src/<service> --cov-report=html
# Run specific test file
pytest src/<service>/tests/test_<module>.py -v
# Run tests matching pattern
pytest -k "test_auth" -vfeature/sprint-XX-brief-description
bugfix/issue-XXX-brief-description
hotfix/critical-issue-description
<type>(<scope>): <description>
[optional body]
[optional footer]
Types: feat, fix, docs, style, refactor, test, chore
Scope: service name or component
Examples:
feat(api-gateway): implement OAuth middleware
- Add OAuth 2.0 token validation
- Integrate with OpenShift OAuth provider
- Cache JWKS with 1-hour TTL
Resolves: ISSUE-010
## Summary
Brief description of changes
## Issues Addressed
- ISSUE-XXX: Description
- ISSUE-YYY: Description
## Sprint Reference
Sprint X: [Sprint Name](debug/sprint/sprint-XX-name.md)
## Changes
- [ ] File 1: Description
- [ ] File 2: Description
## Testing
- [ ] Unit tests passing
- [ ] Integration tests passing
- [ ] Manual testing completed
## Acceptance Criteria
- [ ] Criteria 1 from sprint doc
- [ ] Criteria 2 from sprint doc
## Rollback Plan
Steps to rollback if issues ariseEvery commit must pass:
# Linting
ruff check src/
# Formatting
black --check src/
# Type checking (if configured)
mypy src/
# Tests
pytest src/<service>/tests/- All CI checks passing
- Code review approved (1+ approver)
- No merge conflicts
- Branch up to date with main
- All acceptance criteria verified
- Documentation updated if needed
Before marking a sprint complete:
- All tasks implemented
- All tests passing (>80% coverage)
- All acceptance criteria checked
- Integration tested with dependent services
- No regression in existing functionality
- PR merged to main
Import Errors
# Ensure shared package is installed
pip install -e src/shared/Database Connection Failed
# Check PostgreSQL is running
docker-compose ps postgres
# Verify connection string
echo $POSTGRES_HOST $POSTGRES_PORT $POSTGRES_DATABASERedis Connection Failed
# Check Redis is running
docker-compose ps redis
# Test connection
redis-cli -h localhost pingOAuth Validation Failed
# Verify OAuth issuer is accessible
curl -s $OAUTH_ISSUER/.well-known/oauth-authorization-server | jq# Run with debug logging
LOG_LEVEL=DEBUG python -m uvicorn main:app --reload
# Enable SQL query logging
SQLALCHEMY_ECHO=true python -m uvicorn main:app- Check the relevant spec document
- Review the sprint document for implementation details
- Search existing issues in
debug/issues.md - Check
CLAUDE.mdfor AI assistant context
- Architecture Overview
- Data Models
- Cluster Registry
- Observability Collector
- Intelligence Engine
- Realtime Streaming
- API Gateway
- Integration Matrix
- Deployment
| Version | Date | Author | Changes |
|---|---|---|---|
| 1.0 | 2024-12-29 | Claude | Initial version |
Remember: The specs/ folder is sacred. When in doubt, read the spec.