feat(daemoneye-agent): implement agent loading state, heartbeat detection, and recovery #134

unclesp1d3r · 2026-02-02T15:05:02Z

Summary

Loading State Machine

Implement agent loading state machine (Loading → Ready → SteadyState) with error states
Add collectors configuration module for /etc/daemoneye/collectors.json with JSON schema and validation
Add collector readiness tracking via registration - collectors marked ready when they register
Implement startup timeout handling with configurable per-collector timeouts
Implement "begin monitoring" broadcast on control.collector.lifecycle topic
Add privilege dropping stub for future platform-specific implementation
Integrate loading state coordination into main.rs startup sequence

Heartbeat Detection & Recovery

Add HeartbeatStatus enum (Healthy, Degraded, Failed) with needs_recovery() helper
Implement missed heartbeat tracking in CollectorRegistry with check_heartbeats()
Create escalating RecoveryAction chain: HealthCheck → GracefulShutdown → ForceKill → Restart
Add CollectorRecoveryState for tracking recovery attempts per collector
Implement execute_recovery() function for automated escalating recovery

Test Plan

51 unit tests pass for broker_manager and collector_config modules
10 integration tests pass for loading state workflow
21 integration tests pass for heartbeat detection and recovery
10 unit tests pass for recovery module
All existing integration tests continue to pass
cargo clippy -- -D warnings passes with zero warnings
Pre-commit hooks (fmt, clippy, check, audit) all pass

🤖 Generated with Claude Code

…ersions Add documentation for two Clippy lints encountered during PR review: - map_err_ignore: Name ignored variables in closures (`|_elapsed|` not `|_|`) - as_conversions: Add `#[allow(clippy::as_conversions)]` with safety comment Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…ordination Implement the loading state machine for daemoneye-agent that coordinates collector registration during startup: - Add AgentState enum with state machine transitions: Loading → Ready → SteadyState (plus StartupFailed and ShuttingDown) - Add CollectorsConfig module for `/etc/daemoneye/collectors.json`: - JSON-based collector configuration with validation - Support for enabled/disabled collectors, startup timeouts, heartbeat intervals - Builder pattern for CollectorEntry construction - Add collector readiness tracking via registration: - CollectorReadinessTracker struct tracks expected vs ready collectors - Registration marks collectors as ready automatically - Implement startup timeout handling: - Configurable timeout from max of enabled collectors' startup_timeout_secs - wait_for_collectors_ready() with polling and timeout detection - Marks agent as StartupFailed on timeout - Implement "begin monitoring" broadcast: - Sends lifecycle event on control.collector.lifecycle topic - Called after transition to Ready state - Add privilege dropping stub for future implementation - Integrate loading state into main.rs startup sequence: - Load collectors config, wait for registration, transition states - Broadcast begin monitoring, enter steady state operation - Add 10 integration tests for loading state workflow - Add 27+ unit tests for state machine and configuration Future work: Heartbeat failure detection and escalating recovery actions (Tasks #12, #13, #17 are blocked pending this foundation) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai · 2026-02-02T15:05:19Z

Caution

Review failed

Failed to post review comments

Summary by CodeRabbit

New Features
- Collector configuration loading/validation with startup timeouts and heartbeat intervals.
- Agent startup state machine and readiness tracking to ensure collectors are ready before steady state.
- Heartbeat monitoring with health status reporting and an escalating recovery workflow for unhealthy collectors.
Documentation
- Updated coding guidelines and gotchas (naming, casts, tracing macros, formatting).
Tests
- Extensive unit and integration tests for startup sequencing, readiness, heartbeat detection, and recovery.

Walkthrough

Introduces a collector configuration module, an agent startup state machine with readiness tracking in the broker, heartbeat-based recovery and escalation logic, and extensive unit/integration tests exercising readiness, heartbeat detection, and recovery flows.

Changes

Cohort / File(s)	Summary
Configuration `daemoneye-agent/src/collector_config.rs`, `daemoneye-agent/src/lib.rs`	New `CollectorsConfig` and `CollectorEntry` types, loading/validation, platform defaults, builder helpers, and re-exports.
Broker & Startup State `daemoneye-agent/src/broker_manager.rs`, `daemoneye-agent/src/main.rs`, `daemoneye-agent/tests/loading_state_integration.rs`	Adds `AgentState` state machine, `CollectorReadinessTracker`, readiness APIs (set/get config, mark ready, pending list, transitions, wait_for_collectors_ready, broadcast_begin_monitoring, drop_privileges), integrates config load and readiness wait in startup, and adds coverage tests.
Heartbeat & Recovery `daemoneye-agent/src/collector_registry.rs`, `daemoneye-agent/src/recovery.rs`, `daemoneye-agent/tests/heartbeat_detection_integration.rs`	Adds missed-heartbeats tracking, `HeartbeatStatus` enum and helpers, heartbeat-check and recovery candidate APIs; introduces escalating recovery actions, recovery state, executor, and tests.
Public Surface `daemoneye-agent/src/lib.rs`	Exports new modules/re-exports: `collector_config` (CollectorsConfig etc.), `recovery` items, and adds `HeartbeatStatus` to registry exports.
Docs & Spec `AGENTS.md`, `spec/procmond/index.md`	Documentation guidelines added (naming ignored vars, as_conversions, tracing-await gotcha, pre-commit note, cross-crate traits) and a spec checkbox updated.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as Agent (main)
    participant Config as ConfigLoader
    participant Broker as BrokerManager
    participant Registry as RegistrationProvider
    participant Collector as Collector

    Agent->>Config: load_collectors_config()
    Config-->>Agent: CollectorsConfig
    Agent->>Broker: set_collectors_config(config)
    Agent->>Broker: wait_for_collectors_ready(timeout)
    Broker->>Registry: initialize readiness expectations

    loop poll / registrations
        Collector->>Registry: register_collector()
        Registry->>Broker: notify_registration(id)
        Broker->>Registry: mark_collector_ready(id)
        Registry-->>Broker: readiness update
    end

    Broker-->>Agent: all_collectors_ready -> transition_to_ready()
    Agent->>Broker: transition_to_steady_state()
    Agent->>Broker: broadcast_begin_monitoring()
    Agent->>Broker: drop_privileges()

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

feat(procmond): implement WAL and EventBus connector integration #127: Modifies agent subsystems (broker_manager and collector_registry) with related health/state/readiness features.

Suggested labels

documentation

Poem

🐰 I hopped through configs, ready and bright,
I nudged the broker toward startup light,
Collectors gathered, one by one,
Heartbeats checked, recoveries run,
Hooray — the agent hums through the night! 🥕

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: implementing agent loading state machine, heartbeat detection, and recovery mechanisms for daemoneye-agent.
Description check	✅ Passed	The description is well-structured and comprehensively covers the loading state machine, heartbeat detection, recovery implementation, and test plan with detailed sub-sections matching the changeset.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch implement-agent-loading-state-and-heartbeat-detection

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

dosubot · 2026-02-02T15:05:23Z

Related Documentation

Checked 21 published document(s) in 1 knowledge base(s). No updates required.

^{How did I do? Any feedback?}

qltysh · 2026-02-02T15:06:11Z

2 new issues

Tool	Category	Rule	Count
qlty	Structure	Function with many returns (count = 8): run	1
qlty	Structure	Function with high complexity (count = 19): run	1

Copilot

Pull request overview

This PR implements an agent loading state machine for coordinating startup between daemoneye-agent and its collectors. The state machine ensures all collectors are ready before transitioning to normal operation.

Changes:

Implemented agent loading state machine (Loading → Ready → SteadyState) with error states in broker_manager.rs
Added collectors configuration module in collector_config.rs for loading and validating /etc/daemoneye/collectors.json
Integrated loading state coordination into main.rs startup sequence with timeout handling
Added 10 integration tests and 51 unit tests for comprehensive coverage
Updated spec to mark Ticket 3 as complete

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
spec/procmond/index.md	Marked Ticket 3 (RPC Service and Registration Manager) as complete
daemoneye-agent/tests/loading_state_integration.rs	New integration tests for loading state workflow including timeout, registration, and full lifecycle
daemoneye-agent/src/main.rs	Integrated loading state coordination with configuration loading, collector waiting, privilege dropping stub, and state transitions
daemoneye-agent/src/lib.rs	Exported new public API items for collector configuration
daemoneye-agent/src/collector_config.rs	New module for loading, validating, and managing collector configurations from JSON
daemoneye-agent/src/broker_manager.rs	Implemented AgentState enum, CollectorReadinessTracker, state transition methods, and integrated with registration
AGENTS.md	Added coding guidelines for ignored variables in closures and as_conversions

Copilot · 2026-02-02T15:10:59Z

daemoneye-agent/src/main.rs

+            // Broadcast "begin monitoring" to all collectors
+            if let Err(e) = broker_manager.broadcast_begin_monitoring().await {
+                error!(error = %e, "Failed to broadcast begin monitoring");
+                // Continue anyway - broadcast failure is not fatal for the main loop
+                warn!("Collectors may not have received begin monitoring signal");
+            }
+
+            // Transition to SteadyState
+            if let Err(e) = broker_manager.transition_to_steady_state().await {
+                error!(error = %e, "Failed to transition to SteadyState");
+                // This shouldn't happen if we're in Ready state, but log and continue
+                warn!("Agent may not be in expected state");
+            }


The "begin monitoring" broadcast is being performed twice: once explicitly here (line 204) and again inside transition_to_steady_state() (broker_manager.rs line 951). This duplication means collectors will receive the BeginMonitoring message twice.

Either remove the explicit broadcast call here and let transition_to_steady_state() handle it, or remove the broadcast from inside transition_to_steady_state() and keep it here. The recommended approach is to remove the explicit call here since the transition method's documentation states it broadcasts "begin monitoring" during the transition.

Copilot · 2026-02-02T15:10:59Z

daemoneye-agent/src/broker_manager.rs

+
+                // Broadcast "begin monitoring" to all collectors
+                drop(state); // Release lock before async operations
+                self.broadcast_begin_monitoring().await?;


State transition and broadcast are not atomic. If broadcast_begin_monitoring() fails, the state will already have been changed to SteadyState (line 947), but the error will propagate (line 951 with ?). This leaves the agent in SteadyState even though the broadcast failed.

Consider one of these approaches:

Perform the broadcast BEFORE changing state, so failure prevents the transition

Revert the state on broadcast failure

Make broadcast failure non-fatal (remove ? and log error) since collectors might still receive the message through other means

The current main.rs treats broadcast failure as non-fatal (line 204-208), so option 3 aligns with that design.

Suggested change

self.broadcast_begin_monitoring().await?;

if let Err(error) = self.broadcast_begin_monitoring().await {

warn!(%error, "Failed to broadcast begin monitoring; continuing in SteadyState");

}

Copilot · 2026-02-02T15:11:00Z

daemoneye-agent/tests/loading_state_integration.rs

+    // 6. Broadcast begin monitoring
+    manager.broadcast_begin_monitoring().await?;
+
+    // 7. Transition to SteadyState


This test broadcasts "begin monitoring" explicitly at line 288, and then calls transition_to_steady_state() at line 291. However, transition_to_steady_state() also broadcasts "begin monitoring" internally (broker_manager.rs line 951). This means the broadcast happens twice in this test.

This test should match the actual usage pattern. Remove line 288 to let transition_to_steady_state() handle the broadcast, or adjust the test to match the corrected implementation once the duplicate broadcast issue in main.rs is resolved.

Suggested change

// 6. Broadcast begin monitoring

manager.broadcast_begin_monitoring().await?;

// 7. Transition to SteadyState

// 6. Transition to SteadyState

Copilot · 2026-02-02T15:11:00Z

daemoneye-agent/src/broker_manager.rs

+        let message = serde_json::json!({
+            "type": "BeginMonitoring",
+            "timestamp": std::time::SystemTime::now()
+                .duration_since(std::time::SystemTime::UNIX_EPOCH)
+                .unwrap_or_default()
+                .as_millis(),


Using unwrap_or_default() here means if the system time is before Unix epoch (an extremely rare but possible scenario), the timestamp will silently be 0 milliseconds instead of propagating an error. This could cause confusion in logs or collector behavior.

Consider either:

Using a proper error with context: .context("System time is before Unix epoch")?

Adding a warning log when the fallback is used

Documenting this behavior if it's intentional

Given that this is for a "BeginMonitoring" message, option 1 (failing fast) is recommended since it indicates a serious system misconfiguration.

Suggested change

let message = serde_json::json!({

"type": "BeginMonitoring",

"timestamp": std::time::SystemTime::now()

.duration_since(std::time::SystemTime::UNIX_EPOCH)

.unwrap_or_default()

.as_millis(),

let timestamp = std::time::SystemTime::now()

.duration_since(std::time::SystemTime::UNIX_EPOCH)

.context("System time is before Unix epoch when creating BeginMonitoring timestamp")?

.as_millis();

let message = serde_json::json!({

"type": "BeginMonitoring",

"timestamp": timestamp,

Copilot · 2026-02-02T15:11:01Z

daemoneye-agent/src/broker_manager.rs

+//! ```
+//!
+//! - **Loading**: Agent starting, broker initializing, spawning collectors
+//! - **Ready**: All collectors registered and reported "ready", privileges dropped


The documentation states that in the Ready state "privileges dropped", but this is inaccurate. Looking at main.rs lines 196-201, privilege dropping occurs AFTER the transition to Ready state. The Ready state only indicates that all collectors have registered.

Update the documentation to clarify: "All collectors registered and reported 'ready'. Waiting for privilege dropping before transitioning to steady state." or similar wording that doesn't imply privileges have already been dropped.

Suggested change

//! - **Ready**: All collectors registered and reported "ready", privileges dropped

//! - **Ready**: All collectors registered and reported "ready". Awaiting privilege

//! dropping before transitioning to `SteadyState`.

Copilot · 2026-02-02T15:11:01Z

daemoneye-agent/src/broker_manager.rs

+    Loading,
+
+    /// All collectors have registered and reported "ready".
+    /// The agent has dropped privileges (if configured).


This documentation states "The agent has dropped privileges (if configured)" in the Ready state, but according to main.rs (lines 196-201), privilege dropping occurs AFTER transitioning to Ready. The Ready state should only indicate that all collectors have registered.

Update this to: "All collectors have registered and reported 'ready'. Privilege dropping will occur before transitioning to steady state." to match the actual implementation.

Suggested change

/// The agent has dropped privileges (if configured).

/// Privilege dropping will occur before transitioning to steady state.

…ecovery Add comprehensive heartbeat monitoring and recovery infrastructure: - HeartbeatStatus enum (Healthy, Degraded, Failed) with needs_recovery() - Missed heartbeat tracking in CollectorRegistry with check_heartbeats() - RecoveryAction escalation chain: HealthCheck → GracefulShutdown → ForceKill → Restart - CollectorRecoveryState for tracking recovery attempts per collector - execute_recovery() function for automated escalating recovery - 21 integration tests covering heartbeat and recovery workflows Also includes documentation updates for async-in-tracing gotchas, pre-commit handling, and cross-crate trait imports. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

unclesp1d3r and others added 3 commits February 1, 2026 23:48

Mark Ticket 3 as complete in procmond spec

590310a

Copilot AI review requested due to automatic review settings February 2, 2026 15:05

dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Feb 2, 2026

Copilot started reviewing on behalf of unclesp1d3r February 2, 2026 15:05 View session

coderabbitai bot added architecture System architecture and design decisions configuration Configuration management and settings integration Related to integration testing and component integration labels Feb 2, 2026

dosubot bot added daemoneye An important component of the larger DaemonEye suite. process-monitoring Process monitoring and enumeration features labels Feb 2, 2026

Copilot AI reviewed Feb 2, 2026

View reviewed changes

unclesp1d3r changed the title ~~feat(daemoneye-agent): implement agent loading state and collector coordination~~ feat(daemoneye-agent): implement agent loading state, heartbeat detection, and recovery Feb 2, 2026

coderabbitai bot added the documentation Improvements or additions to documentation label Feb 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(daemoneye-agent): implement agent loading state, heartbeat detection, and recovery #134

feat(daemoneye-agent): implement agent loading state, heartbeat detection, and recovery #134

Uh oh!

unclesp1d3r commented Feb 2, 2026 •

edited

Loading

Uh oh!

coderabbitai bot commented Feb 2, 2026 •

edited

Loading

Review failed

Uh oh!

dosubot bot commented Feb 2, 2026

Uh oh!

qltysh bot commented Feb 2, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Copilot AI Feb 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-                self.broadcast_begin_monitoring().await?;
+                if let Err(error) = self.broadcast_begin_monitoring().await {
+                    warn!(%error, "Failed to broadcast begin monitoring; continuing in SteadyState");
+                }

	//! - Ready: All collectors registered and reported "ready", privileges dropped
	//! - Ready: All collectors registered and reported "ready". Awaiting privilege
	//! dropping before transitioning to `SteadyState`.

	/// The agent has dropped privileges (if configured).
	/// Privilege dropping will occur before transitioning to steady state.

Uh oh!

feat(daemoneye-agent): implement agent loading state, heartbeat detection, and recovery #134

Are you sure you want to change the base?

feat(daemoneye-agent): implement agent loading state, heartbeat detection, and recovery #134

Uh oh!

Conversation

unclesp1d3r commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Loading State Machine

Heartbeat Detection & Recovery

Test Plan

Uh oh!

coderabbitai bot commented Feb 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Summary by CodeRabbit

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Poem

Uh oh!

dosubot bot commented Feb 2, 2026

Uh oh!

qltysh bot commented Feb 2, 2026

2 new issues

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

unclesp1d3r commented Feb 2, 2026 •

edited

Loading

coderabbitai bot commented Feb 2, 2026 •

edited

Loading