Skip to content

jfang2048/ai_sre_agent_pub

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

AI SRE Agent

Version License English Docs 中文文档 Runtime

A node-local collector plus a controller-side incident runtime for Linux, Kubernetes, GPU, and AI infrastructure.

Keep raw collection close to the host, keep controller state bounded, and keep incident reasoning inspectable after the fact. The controller writes a versioned artifact chain for the incident workflow so an operator can inspect what each logical role observed, inferred, proposed, executed, and verified.

中文:README.zh-CN.md

Implementation anchors

The maintained path is built from these code surfaces:

  • C++ probe-core: cpp/probe_core/main.cpp main(), parseOptions(), collectSysinfoHost(), collectMem(), parseProcStat(), and gzipCompress()
  • C++ GPU sampling: cpp/probe_core/gpu_nvml.cpp CollectNVMLSamples() plus appendNVMLProcesses()
  • Go collector: backend/internal/collector/collector.go New(), buildCollectorInfo(), and parseExternalMetricCommand()
  • Go durability and transport: backend/internal/collector/spool/spool.go NewWithOptions() / writeOffset(), and backend/internal/collector/transport/client.go New(), validateBatch(), validateAck(), loadTLSCredentials()
  • Go controller runtime: backend/internal/controller/controller.go New(), backend/internal/controller/agentcore/agent.go NewQueryService(), and backend/internal/controller/agentcore/workflow_engine.go NewWorkflowEngine()
  • Durable workflow artifacts: backend/internal/controller/agentcore/analysis_handoff.go buildAnalysisHandoff(), workflow_artifacts.go buildWorkflowArtifactChain(), and workflow_orchestrator.go NewBoltDurableStore() / NewPostgresDurableStore()
  • Frontend stack: React 18 + Vite with @tanstack/react-query, axios, recharts, lucide-react, and zustand under frontend/src/

Data-plane source policy

The current collector is kernel-first where the repo can do that without inventing a new runtime:

  • host CPU scheduler counters: perf_event_open software counters in cpp/probe_core/main.cpp
  • per-process accounting: cpp/probe_core/process_kernel_collector.cpp using taskstats generic-netlink
  • interface/link counters: cpp/probe_core/network_kernel_collector.cpp using rtnetlink
  • socket queue state: cpp/probe_core/network_kernel_collector.cpp using sock_diag
  • runtime event flow: the probe-core eBPF socket path now accepts a versioned binary record format in cpp/probe_core/kernel_event_protocol.cpp, with JSON fallback for compatibility
  • GPU: cpp/probe_core/gpu_nvml.cpp through NVML only; the probe-core hot path no longer shells out to nvidia-smi

The remaining file-based paths are explicit fallback or cold-path mechanisms:

  • PSI: /proc/pressure/*
  • cgroup stats: /sys/fs/cgroup/...
  • disk counters and queue attributes: /sys/block/* with /proc/diskstats fallback
  • process reconciliation and top-row enrichment: periodic /proc scans plus /proc/<pid>/smaps_rollup and /proc/<pid>/fd
  • hardware discovery: infrequent /proc and /sys reads under hardware.refresh_interval

Privilege reality in this tree:

  • CAP_BPF or CAP_SYS_ADMIN still gates the primary eBPF path
  • CAP_PERFMON or CAP_SYS_ADMIN is the clean path for perf-based host counters
  • CAP_NET_ADMIN or CAP_SYS_ADMIN is the expected privilege boundary for the taskstats/sock_diag process path
  • when those capabilities are missing, the collector stays up and falls back instead of pretending the kernel path exists

Why this exists

Most SRE automation demos fail in the same places:

  • they assume telemetry is complete and always fresh
  • they treat retries as free
  • they hide reasoning inside one large in-memory object
  • they blur advisory output and executable action
  • they make recovery impossible once the controller restarts

This repo is built for the opposite constraints:

  • collector-side evidence can be delayed, dropped, or replayed
  • controller memory and file descriptors are bounded resources
  • execution needs policy, approval, idempotency, and post-action verification
  • operators need compact artifacts they can inspect under pressure

Runtime shape

flowchart LR
    subgraph Host[Observed host]
      P[probe-core / eBPF / helpers]
      C[collector]
      S[disk spool]
      P --> C --> S
    end

    subgraph Controller[controller]
      I[ingest]
      H[bounded hot state]
      O[observer role]
      A[analyzer role]
      R[planner role]
      G[policy gate]
      X[executor role]
      V[verifier role]
      M[memory role]
      U[HTTP API / UI]

      I --> H --> O --> A --> R --> G --> X --> V --> M --> U
      M --> U
    end

    S --> I
Loading

The controller still runs these logical agents in one process. The important boundary is the artifact contract, not the process boundary. Each stage emits a compact record with:

  • schema version
  • producer and consumer
  • workflow / incident / correlation IDs
  • timestamps and status
  • input artifact IDs
  • evidence references
  • replay flags

The chain lives inside the RCA evidence package and is exposed through the workflow APIs.

Logical agents and ownership

Role Owns Reads Writes May change live state?
observer current window summary collector snapshots, bounded history observation artifact no
analyzer anomaly grouping and RCA ranking observation artifact, evidence refs anomaly + hypothesis artifacts no
planner remediation proposals hypothesis artifact, recommendations proposal artifact no
policy gate execution eligibility proposal artifact, controller policy execution-plan artifact no
executor governed tool calls execution-plan artifact execution-result artifact only when posture and approval allow it
verifier before/after effect check execution result, fresh evidence verification artifact no
memory final incident record full chain final incident artifact, incident memory no

The old analysis_agent and validation_action_agent code paths still exist. The artifact chain narrows their logical roles without pretending they are separate daemons.

Artifact chain

The incident workflow now emits these stage artifacts in order:

  1. observation_summary
  2. anomaly_finding
  3. root_cause_hypothesis
  4. remediation_proposal
  5. execution_plan
  6. execution_result
  7. verification_result
  8. incident_report

Each artifact is compact by design. Raw telemetry stays out of the handoff. The artifact carries evidence IDs and a short list of raw references so downstream stages can reload details without copying large payloads through every step.

See docs/en/42-agent-artifacts-and-handoffs.md for the concrete schema.

Deterministic boundary

Model output can influence hypotheses and suggestions. It does not decide execution.

Execution is still gated by controller code:

  • actuator safety classification
  • policy status
  • approval state
  • idempotency key reuse
  • post-action verification
  • optional rollback handling

Default posture remains conservative:

  • dry-run on
  • approval required
  • impactful and destructive paths blocked
  • validation defaults to read-only

Resource model

The code is written around bounded cost, not idealized throughput.

  • Memory: controller hot state and evidence references are bounded; artifact payloads are compact summaries, not telemetry dumps.
  • FD usage: the collector spools to disk, the controller persists artifacts through the artifact manager, and the workflow avoids keeping per-incident files open longer than a single write or read path.
  • Concurrency: action execution is explicitly bounded. Validation loops run under tool-call and iteration budgets.
  • Queue growth: replay and spool paths are bounded and visible. The workflow artifact chain does not create an unbounded side queue.
  • Serialization cost: the artifact chain is small enough to ship inside the evidence package and cheap enough to decode during debugging.

Failure model

Things that can and do go wrong:

  • stale or partial telemetry
  • controller restart during an incident
  • action proposal without enough rollback data
  • verification that cannot prove improvement because the evidence window is weak
  • duplicate requests for the same incident shape

The current design handles those cases by preserving state, surfacing uncertainty, and preferring proposal-only over unsafe execution.

Observability and operator surfaces

Useful endpoints during an incident:

  • GET /api/v1/agent/rca
  • GET /api/v1/agent/workflow/runs
  • GET /api/v1/agent/workflow/runs/{run_id}
  • GET /api/v1/agent/workflow/evidence/{run_id}
  • GET /api/v1/agent/workflow/audit
  • GET /api/v1/status
  • GET /api/v1/ingest/status

Useful files on disk:

  • data/agent/workflow_runs.db
  • data/agent/workflows/messages/<run_id>/
  • data/agent/workflows/evidence/<run_id>/package.json
  • artifact-manager metadata and payload roots

Concrete code paths behind those surfaces:

  • HTTP/UI routes are registered from backend/internal/controller/controller.go
  • query/RCA output is assembled by backend/internal/controller/agentcore/agent.go
  • durable artifact manifests come from backend/internal/controller/agentcore/workflow_artifacts.go

Deployment boundary

This repo does not assume one controller forever.

  • run metadata can move to Postgres
  • artifact metadata can move to a shared backend
  • payloads can move from filesystem to S3
  • hot state is still local to one active writer
  • HA followers still reject ingest writes

That means durability is better than it was, but the system is not yet a fully distributed workflow runtime.

Read next

About

AI SRE Agent: A push-based observability and ops platform for Linux/GPU AI infra. Collects signals via eBPF, analyzes joint risks and RCA with hybrid LLM workflows, and provides guarded actions to reduce MTTR. Built with Go, C++, TS. Open-source public edition.

Resources

License

Contributing

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors