AI SRE Agent

A node-local collector plus a controller-side incident runtime for Linux, Kubernetes, GPU, and AI infrastructure.

Keep raw collection close to the host, keep controller state bounded, and keep incident reasoning inspectable after the fact. The controller writes a versioned artifact chain for the incident workflow so an operator can inspect what each logical role observed, inferred, proposed, executed, and verified.

中文：README.zh-CN.md

Implementation anchors

The maintained path is built from these code surfaces:

C++ probe-core: cpp/probe_core/main.cpp main(), parseOptions(), collectSysinfoHost(), collectMem(), parseProcStat(), and gzipCompress()
C++ GPU sampling: cpp/probe_core/gpu_nvml.cpp CollectNVMLSamples() plus appendNVMLProcesses()
Go collector: backend/internal/collector/collector.go New(), buildCollectorInfo(), and parseExternalMetricCommand()
Go durability and transport: backend/internal/collector/spool/spool.go NewWithOptions() / writeOffset(), and backend/internal/collector/transport/client.go New(), validateBatch(), validateAck(), loadTLSCredentials()
Go controller runtime: backend/internal/controller/controller.go New(), backend/internal/controller/agentcore/agent.go NewQueryService(), and backend/internal/controller/agentcore/workflow_engine.go NewWorkflowEngine()
Durable workflow artifacts: backend/internal/controller/agentcore/analysis_handoff.go buildAnalysisHandoff(), workflow_artifacts.go buildWorkflowArtifactChain(), and workflow_orchestrator.go NewBoltDurableStore() / NewPostgresDurableStore()
Frontend stack: React 18 + Vite with @tanstack/react-query, axios, recharts, lucide-react, and zustand under frontend/src/

Data-plane source policy

The current collector is kernel-first where the repo can do that without inventing a new runtime:

host CPU scheduler counters: perf_event_open software counters in cpp/probe_core/main.cpp
per-process accounting: cpp/probe_core/process_kernel_collector.cpp using taskstats generic-netlink
interface/link counters: cpp/probe_core/network_kernel_collector.cpp using rtnetlink
socket queue state: cpp/probe_core/network_kernel_collector.cpp using sock_diag
runtime event flow: the probe-core eBPF socket path now accepts a versioned binary record format in cpp/probe_core/kernel_event_protocol.cpp, with JSON fallback for compatibility
GPU: cpp/probe_core/gpu_nvml.cpp through NVML only; the probe-core hot path no longer shells out to nvidia-smi

The remaining file-based paths are explicit fallback or cold-path mechanisms:

PSI: /proc/pressure/*
cgroup stats: /sys/fs/cgroup/...
disk counters and queue attributes: /sys/block/* with /proc/diskstats fallback
process reconciliation and top-row enrichment: periodic /proc scans plus /proc/<pid>/smaps_rollup and /proc/<pid>/fd
hardware discovery: infrequent /proc and /sys reads under hardware.refresh_interval

Privilege reality in this tree:

CAP_BPF or CAP_SYS_ADMIN still gates the primary eBPF path
CAP_PERFMON or CAP_SYS_ADMIN is the clean path for perf-based host counters
CAP_NET_ADMIN or CAP_SYS_ADMIN is the expected privilege boundary for the taskstats/sock_diag process path
when those capabilities are missing, the collector stays up and falls back instead of pretending the kernel path exists

Why this exists

Most SRE automation demos fail in the same places:

they assume telemetry is complete and always fresh
they treat retries as free
they hide reasoning inside one large in-memory object
they blur advisory output and executable action
they make recovery impossible once the controller restarts

This repo is built for the opposite constraints:

collector-side evidence can be delayed, dropped, or replayed
controller memory and file descriptors are bounded resources
execution needs policy, approval, idempotency, and post-action verification
operators need compact artifacts they can inspect under pressure

Runtime shape

flowchart LR
    subgraph Host[Observed host]
      P[probe-core / eBPF / helpers]
      C[collector]
      S[disk spool]
      P --> C --> S
    end

    subgraph Controller[controller]
      I[ingest]
      H[bounded hot state]
      O[observer role]
      A[analyzer role]
      R[planner role]
      G[policy gate]
      X[executor role]
      V[verifier role]
      M[memory role]
      U[HTTP API / UI]

      I --> H --> O --> A --> R --> G --> X --> V --> M --> U
      M --> U
    end

    S --> I

The controller still runs these logical agents in one process. The important boundary is the artifact contract, not the process boundary. Each stage emits a compact record with:

schema version
producer and consumer
workflow / incident / correlation IDs
timestamps and status
input artifact IDs
evidence references
replay flags

The chain lives inside the RCA evidence package and is exposed through the workflow APIs.

Logical agents and ownership

Role	Owns	Reads	Writes	May change live state?
observer	current window summary	collector snapshots, bounded history	observation artifact	no
analyzer	anomaly grouping and RCA ranking	observation artifact, evidence refs	anomaly + hypothesis artifacts	no
planner	remediation proposals	hypothesis artifact, recommendations	proposal artifact	no
policy gate	execution eligibility	proposal artifact, controller policy	execution-plan artifact	no
executor	governed tool calls	execution-plan artifact	execution-result artifact	only when posture and approval allow it
verifier	before/after effect check	execution result, fresh evidence	verification artifact	no
memory	final incident record	full chain	final incident artifact, incident memory	no

The old analysis_agent and validation_action_agent code paths still exist. The artifact chain narrows their logical roles without pretending they are separate daemons.

Artifact chain

The incident workflow now emits these stage artifacts in order:

observation_summary
anomaly_finding
root_cause_hypothesis
remediation_proposal
execution_plan
execution_result
verification_result
incident_report

Each artifact is compact by design. Raw telemetry stays out of the handoff. The artifact carries evidence IDs and a short list of raw references so downstream stages can reload details without copying large payloads through every step.

See docs/en/42-agent-artifacts-and-handoffs.md for the concrete schema.

Deterministic boundary

Model output can influence hypotheses and suggestions. It does not decide execution.

Execution is still gated by controller code:

actuator safety classification
policy status
approval state
idempotency key reuse
post-action verification
optional rollback handling

Default posture remains conservative:

dry-run on
approval required
impactful and destructive paths blocked
validation defaults to read-only

Resource model

The code is written around bounded cost, not idealized throughput.

Memory: controller hot state and evidence references are bounded; artifact payloads are compact summaries, not telemetry dumps.
FD usage: the collector spools to disk, the controller persists artifacts through the artifact manager, and the workflow avoids keeping per-incident files open longer than a single write or read path.
Concurrency: action execution is explicitly bounded. Validation loops run under tool-call and iteration budgets.
Queue growth: replay and spool paths are bounded and visible. The workflow artifact chain does not create an unbounded side queue.
Serialization cost: the artifact chain is small enough to ship inside the evidence package and cheap enough to decode during debugging.

Failure model

Things that can and do go wrong:

stale or partial telemetry
controller restart during an incident
action proposal without enough rollback data
verification that cannot prove improvement because the evidence window is weak
duplicate requests for the same incident shape

The current design handles those cases by preserving state, surfacing uncertainty, and preferring proposal-only over unsafe execution.

Observability and operator surfaces

Useful endpoints during an incident:

GET /api/v1/agent/rca
GET /api/v1/agent/workflow/runs
GET /api/v1/agent/workflow/runs/{run_id}
GET /api/v1/agent/workflow/evidence/{run_id}
GET /api/v1/agent/workflow/audit
GET /api/v1/status
GET /api/v1/ingest/status

Useful files on disk:

data/agent/workflow_runs.db
data/agent/workflows/messages/<run_id>/
data/agent/workflows/evidence/<run_id>/package.json
artifact-manager metadata and payload roots

Concrete code paths behind those surfaces:

HTTP/UI routes are registered from backend/internal/controller/controller.go
query/RCA output is assembled by backend/internal/controller/agentcore/agent.go
durable artifact manifests come from backend/internal/controller/agentcore/workflow_artifacts.go

Deployment boundary

This repo does not assume one controller forever.

run metadata can move to Postgres
artifact metadata can move to a shared backend
payloads can move from filesystem to S3
hot state is still local to one active writer
HA followers still reject ingest writes

That means durability is better than it was, but the system is not yet a fully distributed workflow runtime.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.github/workflows		.github/workflows
.tmpbin		.tmpbin
.tmpbuild/ai_sre_agent_workflow_tests		.tmpbuild/ai_sre_agent_workflow_tests
backend		backend
configs		configs
cpp/probe_core		cpp/probe_core
data		data
dataset		dataset
deploy		deploy
docs		docs
eval_data		eval_data
frontend		frontend
proto		proto
python		python
research		research
scripts		scripts
tests		tests
web		web
.codex		.codex
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
.gitmodules		.gitmodules
.golangci.yml		.golangci.yml
.yamllint.yml		.yamllint.yml
CHANGELOG.md		CHANGELOG.md
CODEOWNERS		CODEOWNERS
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
README.zh-CN.md		README.zh-CN.md
SECURITY.md		SECURITY.md
VERSION		VERSION
docker-compose.yaml		docker-compose.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AI SRE Agent

Implementation anchors

Data-plane source policy

Why this exists

Runtime shape

Logical agents and ownership

Artifact chain

Deterministic boundary

Resource model

Failure model

Observability and operator surfaces

Deployment boundary

Read next

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AI SRE Agent

Implementation anchors

Data-plane source policy

Why this exists

Runtime shape

Logical agents and ownership

Artifact chain

Deterministic boundary

Resource model

Failure model

Observability and operator surfaces

Deployment boundary

Read next

About

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages