AIOps NextGen - DevOps Instructions

This document is the authoritative guide for all development work on the AIOps NextGen platform. Follow these instructions precisely for consistent, high-quality development, integration and test process.

Project Context
Document Hierarchy
Development Principles
OpenShift Sandbox Environment
Sprint Execution Protocol
Code Implementation Standards
Testing Requirements
Git Workflow
Quality Gates
Troubleshooting
Reference Quick Links

1. Project Context

What is AIOps NextGen?

AIOps NextGen is an AI-driven observability platform for managing multiple OpenShift/Kubernetes clusters with specialized support for:

GPU workloads (AI/ML training and inference)
Telecom CNF (PTP synchronization, SR-IOV networking)
Federated observability (Prometheus, Loki, Tempo)
Intelligent analysis (anomaly detection, root cause analysis)

Current State

The platform has 5 microservices implemented as scaffolds with mock data. A code audit identified 23 issues (6 CRITICAL, 10 HIGH, 5 MEDIUM, 2 LOW) that must be resolved before production deployment.

Goal

Transform the scaffold implementation into production-ready code by executing 10 sprints that address all identified issues while maintaining strict adherence to the specification documents.

2. Document Hierarchy

Holy Grail Foundations (NEVER DEVIATE)

These documents define the truth. All implementation MUST conform to these specifications:

Document	Purpose	Location
`README.md`	Project overview and quick start	`/README.md`
`specs/00-overview.md`	Architecture vision, design principles	`/specs/00-overview.md`
`specs/01-data-models.md`	Pydantic model definitions	`/specs/01-data-models.md`
`specs/02-cluster-registry.md`	Cluster management service spec	`/specs/02-cluster-registry.md`
`specs/03-observability-collector.md`	Metrics/logs/traces collection spec	`/specs/03-observability-collector.md`
`specs/04-intelligence-engine.md`	AI/ML analysis service spec	`/specs/04-intelligence-engine.md`
`specs/05-realtime-streaming.md`	WebSocket streaming spec	`/specs/05-realtime-streaming.md`
`specs/06-api-gateway.md`	Gateway, auth, routing spec	`/specs/06-api-gateway.md`
`specs/07-frontend.md`	React UI specification	`/specs/07-frontend.md`
`specs/08-integration-matrix.md`	Service communication patterns	`/specs/08-integration-matrix.md`
`specs/09-deployment.md`	Kubernetes/Helm deployment	`/specs/09-deployment.md`

Debug Documents (Problem Definition)

Document	Purpose	Location
`debug/issues.md`	23 identified issues with severity	`/debug/issues.md`
`debug/bugfix-sprint-plan.md`	High-level sprint overview	`/debug/bugfix-sprint-plan.md`

Sprint Definitions (Implementation Guide)

Document	Focus Area	Location
`debug/sprint/README.md`	Sprint index and execution guide	`/debug/sprint/README.md`
`debug/sprint/sprint-01-*.md`	Security Foundation	`/debug/sprint/sprint-01-security-foundation.md`
`debug/sprint/sprint-02-*.md`	Kubernetes Integration	`/debug/sprint/sprint-02-kubernetes-integration.md`
`debug/sprint/sprint-03-*.md`	Prometheus Authentication	`/debug/sprint/sprint-03-prometheus-auth.md`
`debug/sprint/sprint-04-*.md`	Logs & Traces	`/debug/sprint/sprint-04-logs-traces.md`
`debug/sprint/sprint-05-*.md`	GPU Telemetry	`/debug/sprint/sprint-05-gpu-telemetry.md`
`debug/sprint/sprint-06-*.md`	CNF Monitoring	`/debug/sprint/sprint-06-cnf-monitoring.md`
`debug/sprint/sprint-07-*.md`	WebSocket Hardening	`/debug/sprint/sprint-07-websocket-hardening.md`
`debug/sprint/sprint-08-*.md`	Anomaly Detection & RCA	`/debug/sprint/sprint-08-anomaly-rca.md`
`debug/sprint/sprint-09-*.md`	Reports & MCP Tools	`/debug/sprint/sprint-09-reports-mcp.md`
`debug/sprint/sprint-10-*.md`	API Gateway Polish	`/debug/sprint/sprint-10-api-gateway-polish.md`

Development Guide

Document	Purpose	Location
`CLAUDE.md`	AI assistant context and commands	`/CLAUDE.md`

3. Development Principles

3.1 Specification Compliance

RULE: The specs/ folder is the source of truth.
      If code contradicts spec, the code is wrong.

Before implementing any feature:

Read the relevant spec section
Understand the data models from specs/01-data-models.md
Check integration patterns in specs/08-integration-matrix.md
Verify deployment requirements in specs/09-deployment.md

3.2 Air-Gapped Ready Design

RULE: All features MUST work in air-gapped environments.
      Local vLLM is preferred over external APIs.

Never hardcode external API dependencies
Always provide local alternatives
Use environment variables for all external URLs
Container images must be pullable from private registries

3.3 Security First

RULE: Security is not optional. Every endpoint MUST be authenticated.
      Every operation MUST be authorized.

OAuth 2.0 via OpenShift for authentication
RBAC with admin/operator/viewer roles
Credentials stored in Kubernetes Secrets only
No secrets in environment variables or config files

3.4 No Mock Data in Production Code

RULE: Remove ALL mock data patterns.
      Every function must interact with real systems.

Search for and eliminate:

# For sandbox testing
# Mock data
return [] # TODO
Hardcoded metric values
Fake GPU data

3.5 Observability Built-In

RULE: Every service must emit logs, metrics, and traces.

Structured JSON logging via structlog
Prometheus metrics on /metrics
OpenTelemetry traces to OTLP collector
Health endpoints on /health and /ready

4. OpenShift Sandbox Environment

4.1 Mandatory Sandbox Testing

RULE: ALL code changes MUST be tested on the OpenShift sandbox environment
      before being considered complete. Local-only testing is NOT sufficient.

The AIOps NextGen platform runs on OpenShift. Every sprint task must be validated against the live sandbox cluster to ensure:

Kubernetes API interactions work correctly
Service-to-service communication functions
Database and Redis connections are stable
OAuth/RBAC integrations are operational

4.2 Agent Sandbox Credential Protocol

IMPORTANT FOR AI AGENTS:

Before starting any development or testing work, the agent MUST:

Ask the user for sandbox environment details:

"To proceed with development and testing, I need access to the OpenShift sandbox.
Please provide:
1. Path to your kubeconfig file (e.g., /Users/fenar/projects/clusters/sandbox01/kubeconfig)
2. OpenShift API URL (e.g., https://api.sandbox01.narlabs.io:6443)
3. Project/namespace name (e.g., aiops-nextgen)

Once provided, I will verify cluster connectivity and pod status."

Store and use the credentials throughout the session
Never hardcode or persist credentials beyond the session

4.3 Sandbox Environment Setup

Once credentials are provided, set up the environment:

# Set the KUBECONFIG environment variable
export KUBECONFIG=/path/to/kubeconfig

# Verify cluster connectivity
oc whoami
oc project aiops-nextgen

# Check current pod status
oc get pods

# Expected services (all should be Running):
# - api-gateway-*
# - cluster-registry-*
# - intelligence-engine-*
# - observability-collector-*
# - realtime-streaming-*
# - postgresql-*
# - redis-*

4.4 Service Health Verification

Before starting a new development phase/sprint, verify all existing ocp artificacts are cleaned up and aiops-nextgen ns is a clean slate :

# Check pod status
oc get deployments
oc get pods -o wide

# Check service endpoints
oc get svc

# Verify routes are accessible
oc get routes

4.5 Common Pod Issues and Resolution

Status	Meaning	Resolution
`Running`	Healthy	No action needed
`CreateContainerConfigError`	Missing ConfigMap/Secret	Check `oc describe pod <name>` for missing refs
`CrashLoopBackOff`	Application crash	Check `oc logs <pod>` for errors
`ImagePullBackOff`	Cannot pull image	Verify image exists and registry credentials
`Pending`	Waiting for resources	Check node resources or PVC status

Example: Fix CreateContainerConfigError

# Identify the issue
oc describe pod intelligence-engine-67f84c6675-fw2fn | grep -A5 "Events:"

# Common fix: Create missing secret
oc create secret generic llm-credentials \
  --from-literal=LLM_PROVIDER=local \
  --from-literal=LLM_LOCAL_URL=http://vllm:8080/v1

# Restart the deployment
oc rollout restart deployment/intelligence-engine

4.6 Deploying Code Changes to Sandbox

After implementing and locally testing code changes:

# 1. Build the new container image
cd src/<service>
docker build -t quay.io/aiops-nextgen/<service>:dev .

# 2. Push to registry (if using external registry)
docker push quay.io/aiops-nextgen/<service>:dev

# 3. Update the deployment image
oc set image deployment/<service> <service>=quay.io/aiops-nextgen/<service>:dev

# 4. Watch the rollout
oc rollout status deployment/<service>

# 5. Verify the new pod is running
oc get pods -l app=<service>

# 6. Check logs for startup errors
oc logs -l app=<service> --tail=100 -f

4.7 Integration Testing on Sandbox

After deploying changes, run integration tests:

# Get the route URL for the API gateway
export API_URL=$(oc get route api-gateway -o jsonpath='{.spec.host}')

# Test health endpoints
curl -s https://$API_URL/health | jq
curl -s https://$API_URL/ready | jq

# Test API endpoints (with auth token if required)
curl -s https://$API_URL/api/v1/clusters | jq

# Port-forward for direct service testing
oc port-forward svc/cluster-registry 8080:8080 &
curl -s http://localhost:8080/health | jq

# Run pytest against sandbox (with sandbox URL configured)
SANDBOX_API_URL=https://$API_URL pytest tests/integration/ -v

4.8 Debugging on Sandbox

# Stream logs from a specific pod
oc logs -f deployment/<service>

# Get shell access to a pod
oc exec -it deployment/<service> -- /bin/sh

# Check environment variables
oc exec deployment/<service> -- env | sort

# Check mounted secrets/configmaps
oc exec deployment/<service> -- ls -la /etc/secrets/

# Debug networking
oc exec deployment/<service> -- curl -s http://cluster-registry:8080/health

# Check database connectivity
oc exec deployment/postgresql -- psql -U aiops -d aiops -c "SELECT 1;"

# Check Redis connectivity
oc exec deployment/redis -- redis-cli ping

4.9 Sandbox Environment Variables

Services require these environment variables (configured via ConfigMaps/Secrets):

Variable	Description	Example
`POSTGRES_HOST`	PostgreSQL hostname	`postgresql`
`POSTGRES_PORT`	PostgreSQL port	`5432`
`POSTGRES_USER`	Database user	`aiops`
`POSTGRES_PASSWORD`	Database password	(from secret)
`POSTGRES_DATABASE`	Database name	`aiops`
`REDIS_HOST`	Redis hostname	`redis`
`REDIS_PORT`	Redis port	`6379`
`OAUTH_ISSUER`	OpenShift OAuth URL	`https://oauth-openshift.apps...`
`LLM_PROVIDER`	LLM provider type	`local`
`LLM_LOCAL_URL`	vLLM server URL	`http://vllm:8080/v1`

4.10 Sprint Task Completion Checklist (Sandbox)

For each sprint task, verify on sandbox:

Code deployed to sandbox cluster
Pod running without errors
Health endpoint returning healthy
API endpoints responding correctly
Integration with other services working
Logs show expected behavior
No error events in pod events

5. Sprint Execution Protocol

5.1 Pre-Sprint Checklist

Before starting any sprint:

Read the sprint document completely
Verify all dependency sprints are completed
Read relevant spec sections referenced in the sprint
Understand the issues being addressed (check debug/issues.md)
Create feature branch from main
Set up local development environment

5.2 Sprint Execution Order

CRITICAL: Execute sprints in this exact order. Do not skip or reorder.

Phase 1: Security Foundation (BLOCKING)
├── Sprint 1: OAuth + RBAC + WebSocket Auth
└── Sprint 2: K8s Secrets + Credential Validation + Discovery

Phase 2: Observability Stack
├── Sprint 3: Prometheus Authentication
├── Sprint 4: Loki + Tempo Collectors
├── Sprint 5: Real GPU Telemetry
└── Sprint 6: CNF Monitoring (PTP, SR-IOV)

Phase 3: Real-Time & Intelligence
├── Sprint 7: WebSocket Hardening
└── Sprint 8: Anomaly Detection + RCA

Phase 4: Completion
├── Sprint 9: Reports + MCP Tools
└── Sprint 10: Chat Persistence + Tracing

5.3 Task Execution Within Sprint

For each task in a sprint:

Read the Task Section
- Understand the file to create/modify
- Note the spec references
Implement the Code
- Copy the code from the sprint document
- Adapt only if necessary (explain why in commit)
- Add any missing imports
Write Tests
- Implement the test file provided
- Add edge cases not covered
- Ensure >80% coverage for the file

Verify Locally

# Lint the code
ruff check src/<service>/

# Run tests
pytest src/<service>/tests/test_<module>.py -v

# Type check (optional but recommended)
mypy src/<service>/<module>.py

Mark Task Complete
- Update sprint progress tracking
- Note any deviations or issues

5.4 Post-Sprint Verification

After completing all tasks in a sprint:

All acceptance criteria checked and passing
All tests passing with >80% coverage
No linting errors
API documentation updated if endpoints changed
Integration test with dependent services
PR created and reviewed

6. Code Implementation Standards

6.1 File Structure

src/<service>/
├── __init__.py           # Package init with version
├── main.py               # FastAPI application entry
├── api/
│   ├── __init__.py
│   └── v1/
│       ├── __init__.py
│       └── <resource>.py # API endpoints
├── services/
│   ├── __init__.py
│   └── <service>.py      # Business logic
├── clients/
│   ├── __init__.py
│   └── <client>.py       # External service clients
├── middleware/
│   ├── __init__.py
│   └── <middleware>.py   # Request/response middleware
├── models/
│   ├── __init__.py
│   └── <models>.py       # Service-specific models
└── tests/
    ├── __init__.py
    ├── conftest.py       # Pytest fixtures
    └── test_<module>.py  # Test files

6.2 Import Order

# Standard library
from datetime import datetime
from typing import Optional

# Third-party
from fastapi import APIRouter, HTTPException
from pydantic import BaseModel

# Shared package
from shared.models import Cluster
from shared.config import get_settings
from shared.observability import get_logger

# Local
from services.my_service import my_function

6.3 Logging Standards

from shared.observability import get_logger

logger = get_logger(__name__)

# Use structured logging
logger.info(
    "Operation completed",
    operation="create_cluster",
    cluster_id=cluster_id,
    duration_ms=duration,
)

# Log levels:
# - DEBUG: Detailed debugging (not in production)
# - INFO: Normal operations
# - WARNING: Unexpected but handled situations
# - ERROR: Failures requiring attention

6.4 Error Handling

from fastapi import HTTPException, status

# Always use appropriate HTTP status codes
raise HTTPException(
    status_code=status.HTTP_404_NOT_FOUND,
    detail=f"Cluster {cluster_id} not found",
)

# Standard error responses
# 400 - Bad Request (invalid input)
# 401 - Unauthorized (missing/invalid auth)
# 403 - Forbidden (insufficient permissions)
# 404 - Not Found (resource doesn't exist)
# 500 - Internal Server Error (unexpected failure)

6.5 Async Patterns

# Always use async for I/O operations
async def get_cluster(cluster_id: str) -> Cluster:
    async with get_async_session() as db:
        result = await db.execute(
            select(ClusterORM).where(ClusterORM.id == cluster_id)
        )
        return result.scalar_one_or_none()

# Use asyncio.gather for concurrent operations
results = await asyncio.gather(
    get_metrics(cluster_id),
    get_logs(cluster_id),
    get_traces(cluster_id),
)

6.6 Configuration Access

from shared.config import get_settings

# Always use get_settings() - it's cached
settings = get_settings()

# Access nested settings
db_url = settings.database.async_url
redis_url = settings.redis.url
llm_provider = settings.llm.provider

7. Testing Requirements

7.1 Test Structure

"""Tests for <module>."""

import pytest
from unittest.mock import AsyncMock, MagicMock, patch

from <module> import <Class>, <function>


@pytest.fixture
def sample_data():
    """Provide sample test data."""
    return {...}


class TestClassName:
    """Tests for ClassName."""

    async def test_method_success(self, sample_data):
        """Test successful method execution."""
        # Arrange
        ...
        # Act
        result = await method(sample_data)
        # Assert
        assert result.status == "success"

    async def test_method_failure(self):
        """Test method failure handling."""
        with pytest.raises(HTTPException) as exc_info:
            await method(invalid_data)
        assert exc_info.value.status_code == 400

7.2 Coverage Requirements

Category	Minimum Coverage
Services	80%
API Endpoints	80%
Middleware	90%
Clients	70%
Models	100%

7.3 Running Tests

# Run all tests for a service
pytest src/<service>/tests/ -v

# Run with coverage
pytest src/<service>/tests/ --cov=src/<service> --cov-report=html

# Run specific test file
pytest src/<service>/tests/test_<module>.py -v

# Run tests matching pattern
pytest -k "test_auth" -v

8. Git Workflow

8.1 Branch Naming

feature/sprint-XX-brief-description
bugfix/issue-XXX-brief-description
hotfix/critical-issue-description

8.2 Commit Messages

<type>(<scope>): <description>

[optional body]

[optional footer]

Types: feat, fix, docs, style, refactor, test, chore
Scope: service name or component

Examples:

feat(api-gateway): implement OAuth middleware

- Add OAuth 2.0 token validation
- Integrate with OpenShift OAuth provider
- Cache JWKS with 1-hour TTL

Resolves: ISSUE-010

8.3 Pull Request Template

## Summary
Brief description of changes

## Issues Addressed
- ISSUE-XXX: Description
- ISSUE-YYY: Description

## Sprint Reference
Sprint X: [Sprint Name](debug/sprint/sprint-XX-name.md)

## Changes
- [ ] File 1: Description
- [ ] File 2: Description

## Testing
- [ ] Unit tests passing
- [ ] Integration tests passing
- [ ] Manual testing completed

## Acceptance Criteria
- [ ] Criteria 1 from sprint doc
- [ ] Criteria 2 from sprint doc

## Rollback Plan
Steps to rollback if issues arise

9. Quality Gates

9.1 Pre-Commit Checks

Every commit must pass:

# Linting
ruff check src/

# Formatting
black --check src/

# Type checking (if configured)
mypy src/

# Tests
pytest src/<service>/tests/

9.2 PR Merge Requirements

9.3 Sprint Completion Gate

Before marking a sprint complete:

All tasks implemented
All tests passing (>80% coverage)
All acceptance criteria checked
Integration tested with dependent services
No regression in existing functionality
PR merged to main

10. Troubleshooting

10.1 Common Issues

Import Errors

# Ensure shared package is installed
pip install -e src/shared/

Database Connection Failed

# Check PostgreSQL is running
docker-compose ps postgres

# Verify connection string
echo $POSTGRES_HOST $POSTGRES_PORT $POSTGRES_DATABASE

Redis Connection Failed

# Check Redis is running
docker-compose ps redis

# Test connection
redis-cli -h localhost ping

OAuth Validation Failed

# Verify OAuth issuer is accessible
curl -s $OAUTH_ISSUER/.well-known/oauth-authorization-server | jq

10.2 Debug Mode

# Run with debug logging
LOG_LEVEL=DEBUG python -m uvicorn main:app --reload

# Enable SQL query logging
SQLALCHEMY_ECHO=true python -m uvicorn main:app

10.3 Getting Help

Check the relevant spec document
Review the sprint document for implementation details
Search existing issues in debug/issues.md
Check CLAUDE.md for AI assistant context

11. Reference Quick Links

Specifications

Debug & Sprint

Development

Main README

Revision History

Version	Date	Author	Changes
1.0	2024-12-29	Claude	Initial version

Remember: The specs/ folder is sacred. When in doubt, read the spec.

FilesExpand file tree

devops-instructions.md

Latest commit

History