A node-local collector plus a controller-side incident runtime for Linux, Kubernetes, GPU, and AI infrastructure.
Keep raw collection close to the host, keep controller state bounded, and keep incident reasoning inspectable after the fact. The controller writes a versioned artifact chain for the incident workflow so an operator can inspect what each logical role observed, inferred, proposed, executed, and verified.
The maintained path is built from these code surfaces:
- C++
probe-core:cpp/probe_core/main.cppmain(),parseOptions(),collectSysinfoHost(),collectMem(),parseProcStat(), andgzipCompress() - C++ GPU sampling:
cpp/probe_core/gpu_nvml.cppCollectNVMLSamples()plusappendNVMLProcesses() - Go collector:
backend/internal/collector/collector.goNew(),buildCollectorInfo(), andparseExternalMetricCommand() - Go durability and transport:
backend/internal/collector/spool/spool.goNewWithOptions()/writeOffset(), andbackend/internal/collector/transport/client.goNew(),validateBatch(),validateAck(),loadTLSCredentials() - Go controller runtime:
backend/internal/controller/controller.goNew(),backend/internal/controller/agentcore/agent.goNewQueryService(), andbackend/internal/controller/agentcore/workflow_engine.goNewWorkflowEngine() - Durable workflow artifacts:
backend/internal/controller/agentcore/analysis_handoff.gobuildAnalysisHandoff(),workflow_artifacts.gobuildWorkflowArtifactChain(), andworkflow_orchestrator.goNewBoltDurableStore()/NewPostgresDurableStore() - Frontend stack: React 18 + Vite with
@tanstack/react-query,axios,recharts,lucide-react, andzustandunderfrontend/src/
The current collector is kernel-first where the repo can do that without inventing a new runtime:
- host CPU scheduler counters:
perf_event_opensoftware counters incpp/probe_core/main.cpp - per-process accounting:
cpp/probe_core/process_kernel_collector.cppusingtaskstatsgeneric-netlink - interface/link counters:
cpp/probe_core/network_kernel_collector.cppusingrtnetlink - socket queue state:
cpp/probe_core/network_kernel_collector.cppusingsock_diag - runtime event flow: the probe-core eBPF socket path now accepts a versioned binary record format in
cpp/probe_core/kernel_event_protocol.cpp, with JSON fallback for compatibility - GPU:
cpp/probe_core/gpu_nvml.cppthrough NVML only; the probe-core hot path no longer shells out tonvidia-smi
The remaining file-based paths are explicit fallback or cold-path mechanisms:
- PSI:
/proc/pressure/* - cgroup stats:
/sys/fs/cgroup/... - disk counters and queue attributes:
/sys/block/*with/proc/diskstatsfallback - process reconciliation and top-row enrichment: periodic
/procscans plus/proc/<pid>/smaps_rollupand/proc/<pid>/fd - hardware discovery: infrequent
/procand/sysreads underhardware.refresh_interval
Privilege reality in this tree:
CAP_BPForCAP_SYS_ADMINstill gates the primary eBPF pathCAP_PERFMONorCAP_SYS_ADMINis the clean path for perf-based host countersCAP_NET_ADMINorCAP_SYS_ADMINis the expected privilege boundary for the taskstats/sock_diag process path- when those capabilities are missing, the collector stays up and falls back instead of pretending the kernel path exists
Most SRE automation demos fail in the same places:
- they assume telemetry is complete and always fresh
- they treat retries as free
- they hide reasoning inside one large in-memory object
- they blur advisory output and executable action
- they make recovery impossible once the controller restarts
This repo is built for the opposite constraints:
- collector-side evidence can be delayed, dropped, or replayed
- controller memory and file descriptors are bounded resources
- execution needs policy, approval, idempotency, and post-action verification
- operators need compact artifacts they can inspect under pressure
flowchart LR
subgraph Host[Observed host]
P[probe-core / eBPF / helpers]
C[collector]
S[disk spool]
P --> C --> S
end
subgraph Controller[controller]
I[ingest]
H[bounded hot state]
O[observer role]
A[analyzer role]
R[planner role]
G[policy gate]
X[executor role]
V[verifier role]
M[memory role]
U[HTTP API / UI]
I --> H --> O --> A --> R --> G --> X --> V --> M --> U
M --> U
end
S --> I
The controller still runs these logical agents in one process. The important boundary is the artifact contract, not the process boundary. Each stage emits a compact record with:
- schema version
- producer and consumer
- workflow / incident / correlation IDs
- timestamps and status
- input artifact IDs
- evidence references
- replay flags
The chain lives inside the RCA evidence package and is exposed through the workflow APIs.
| Role | Owns | Reads | Writes | May change live state? |
|---|---|---|---|---|
| observer | current window summary | collector snapshots, bounded history | observation artifact | no |
| analyzer | anomaly grouping and RCA ranking | observation artifact, evidence refs | anomaly + hypothesis artifacts | no |
| planner | remediation proposals | hypothesis artifact, recommendations | proposal artifact | no |
| policy gate | execution eligibility | proposal artifact, controller policy | execution-plan artifact | no |
| executor | governed tool calls | execution-plan artifact | execution-result artifact | only when posture and approval allow it |
| verifier | before/after effect check | execution result, fresh evidence | verification artifact | no |
| memory | final incident record | full chain | final incident artifact, incident memory | no |
The old analysis_agent and validation_action_agent code paths still exist. The artifact chain narrows their logical roles without pretending they are separate daemons.
The incident workflow now emits these stage artifacts in order:
observation_summaryanomaly_findingroot_cause_hypothesisremediation_proposalexecution_planexecution_resultverification_resultincident_report
Each artifact is compact by design. Raw telemetry stays out of the handoff. The artifact carries evidence IDs and a short list of raw references so downstream stages can reload details without copying large payloads through every step.
See docs/en/42-agent-artifacts-and-handoffs.md for the concrete schema.
Model output can influence hypotheses and suggestions. It does not decide execution.
Execution is still gated by controller code:
- actuator safety classification
- policy status
- approval state
- idempotency key reuse
- post-action verification
- optional rollback handling
Default posture remains conservative:
- dry-run on
- approval required
- impactful and destructive paths blocked
- validation defaults to read-only
The code is written around bounded cost, not idealized throughput.
- Memory: controller hot state and evidence references are bounded; artifact payloads are compact summaries, not telemetry dumps.
- FD usage: the collector spools to disk, the controller persists artifacts through the artifact manager, and the workflow avoids keeping per-incident files open longer than a single write or read path.
- Concurrency: action execution is explicitly bounded. Validation loops run under tool-call and iteration budgets.
- Queue growth: replay and spool paths are bounded and visible. The workflow artifact chain does not create an unbounded side queue.
- Serialization cost: the artifact chain is small enough to ship inside the evidence package and cheap enough to decode during debugging.
Things that can and do go wrong:
- stale or partial telemetry
- controller restart during an incident
- action proposal without enough rollback data
- verification that cannot prove improvement because the evidence window is weak
- duplicate requests for the same incident shape
The current design handles those cases by preserving state, surfacing uncertainty, and preferring proposal-only over unsafe execution.
Useful endpoints during an incident:
GET /api/v1/agent/rcaGET /api/v1/agent/workflow/runsGET /api/v1/agent/workflow/runs/{run_id}GET /api/v1/agent/workflow/evidence/{run_id}GET /api/v1/agent/workflow/auditGET /api/v1/statusGET /api/v1/ingest/status
Useful files on disk:
data/agent/workflow_runs.dbdata/agent/workflows/messages/<run_id>/data/agent/workflows/evidence/<run_id>/package.json- artifact-manager metadata and payload roots
Concrete code paths behind those surfaces:
- HTTP/UI routes are registered from
backend/internal/controller/controller.go - query/RCA output is assembled by
backend/internal/controller/agentcore/agent.go - durable artifact manifests come from
backend/internal/controller/agentcore/workflow_artifacts.go
This repo does not assume one controller forever.
- run metadata can move to Postgres
- artifact metadata can move to a shared backend
- payloads can move from filesystem to S3
- hot state is still local to one active writer
- HA followers still reject ingest writes
That means durability is better than it was, but the system is not yet a fully distributed workflow runtime.