Skip to content

Conversation

@unclesp1d3r
Copy link
Member

@unclesp1d3r unclesp1d3r commented Feb 2, 2026

Summary

Loading State Machine

  • Implement agent loading state machine (Loading → Ready → SteadyState) with error states
  • Add collectors configuration module for /etc/daemoneye/collectors.json with JSON schema and validation
  • Add collector readiness tracking via registration - collectors marked ready when they register
  • Implement startup timeout handling with configurable per-collector timeouts
  • Implement "begin monitoring" broadcast on control.collector.lifecycle topic
  • Add privilege dropping stub for future platform-specific implementation
  • Integrate loading state coordination into main.rs startup sequence

Heartbeat Detection & Recovery

  • Add HeartbeatStatus enum (Healthy, Degraded, Failed) with needs_recovery() helper
  • Implement missed heartbeat tracking in CollectorRegistry with check_heartbeats()
  • Create escalating RecoveryAction chain: HealthCheck → GracefulShutdown → ForceKill → Restart
  • Add CollectorRecoveryState for tracking recovery attempts per collector
  • Implement execute_recovery() function for automated escalating recovery

Test Plan

  • 51 unit tests pass for broker_manager and collector_config modules
  • 10 integration tests pass for loading state workflow
  • 21 integration tests pass for heartbeat detection and recovery
  • 10 unit tests pass for recovery module
  • All existing integration tests continue to pass
  • cargo clippy -- -D warnings passes with zero warnings
  • Pre-commit hooks (fmt, clippy, check, audit) all pass

🤖 Generated with Claude Code

unclesp1d3r and others added 3 commits February 1, 2026 23:48
…ersions

Add documentation for two Clippy lints encountered during PR review:

- map_err_ignore: Name ignored variables in closures (`|_elapsed|` not `|_|`)
- as_conversions: Add `#[allow(clippy::as_conversions)]` with safety comment

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…ordination

Implement the loading state machine for daemoneye-agent that coordinates
collector registration during startup:

- Add AgentState enum with state machine transitions:
  Loading → Ready → SteadyState (plus StartupFailed and ShuttingDown)
- Add CollectorsConfig module for `/etc/daemoneye/collectors.json`:
  - JSON-based collector configuration with validation
  - Support for enabled/disabled collectors, startup timeouts, heartbeat intervals
  - Builder pattern for CollectorEntry construction
- Add collector readiness tracking via registration:
  - CollectorReadinessTracker struct tracks expected vs ready collectors
  - Registration marks collectors as ready automatically
- Implement startup timeout handling:
  - Configurable timeout from max of enabled collectors' startup_timeout_secs
  - wait_for_collectors_ready() with polling and timeout detection
  - Marks agent as StartupFailed on timeout
- Implement "begin monitoring" broadcast:
  - Sends lifecycle event on control.collector.lifecycle topic
  - Called after transition to Ready state
- Add privilege dropping stub for future implementation
- Integrate loading state into main.rs startup sequence:
  - Load collectors config, wait for registration, transition states
  - Broadcast begin monitoring, enter steady state operation
- Add 10 integration tests for loading state workflow
- Add 27+ unit tests for state machine and configuration

Future work: Heartbeat failure detection and escalating recovery actions
(Tasks #12, #13, #17 are blocked pending this foundation)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings February 2, 2026 15:05
@dosubot dosubot bot added the size:XXL This PR changes 1000+ lines, ignoring generated files. label Feb 2, 2026
@coderabbitai
Copy link

coderabbitai bot commented Feb 2, 2026

Caution

Review failed

Failed to post review comments

Summary by CodeRabbit

  • New Features

    • Collector configuration loading/validation with startup timeouts and heartbeat intervals.
    • Agent startup state machine and readiness tracking to ensure collectors are ready before steady state.
    • Heartbeat monitoring with health status reporting and an escalating recovery workflow for unhealthy collectors.
  • Documentation

    • Updated coding guidelines and gotchas (naming, casts, tracing macros, formatting).
  • Tests

    • Extensive unit and integration tests for startup sequencing, readiness, heartbeat detection, and recovery.

Walkthrough

Introduces a collector configuration module, an agent startup state machine with readiness tracking in the broker, heartbeat-based recovery and escalation logic, and extensive unit/integration tests exercising readiness, heartbeat detection, and recovery flows.

Changes

Cohort / File(s) Summary
Configuration
daemoneye-agent/src/collector_config.rs, daemoneye-agent/src/lib.rs
New CollectorsConfig and CollectorEntry types, loading/validation, platform defaults, builder helpers, and re-exports.
Broker & Startup State
daemoneye-agent/src/broker_manager.rs, daemoneye-agent/src/main.rs, daemoneye-agent/tests/loading_state_integration.rs
Adds AgentState state machine, CollectorReadinessTracker, readiness APIs (set/get config, mark ready, pending list, transitions, wait_for_collectors_ready, broadcast_begin_monitoring, drop_privileges), integrates config load and readiness wait in startup, and adds coverage tests.
Heartbeat & Recovery
daemoneye-agent/src/collector_registry.rs, daemoneye-agent/src/recovery.rs, daemoneye-agent/tests/heartbeat_detection_integration.rs
Adds missed-heartbeats tracking, HeartbeatStatus enum and helpers, heartbeat-check and recovery candidate APIs; introduces escalating recovery actions, recovery state, executor, and tests.
Public Surface
daemoneye-agent/src/lib.rs
Exports new modules/re-exports: collector_config (CollectorsConfig etc.), recovery items, and adds HeartbeatStatus to registry exports.
Docs & Spec
AGENTS.md, spec/procmond/index.md
Documentation guidelines added (naming ignored vars, as_conversions, tracing-await gotcha, pre-commit note, cross-crate traits) and a spec checkbox updated.

Sequence Diagram(s)

sequenceDiagram
    participant Agent as Agent (main)
    participant Config as ConfigLoader
    participant Broker as BrokerManager
    participant Registry as RegistrationProvider
    participant Collector as Collector

    Agent->>Config: load_collectors_config()
    Config-->>Agent: CollectorsConfig
    Agent->>Broker: set_collectors_config(config)
    Agent->>Broker: wait_for_collectors_ready(timeout)
    Broker->>Registry: initialize readiness expectations

    loop poll / registrations
        Collector->>Registry: register_collector()
        Registry->>Broker: notify_registration(id)
        Broker->>Registry: mark_collector_ready(id)
        Registry-->>Broker: readiness update
    end

    Broker-->>Agent: all_collectors_ready -> transition_to_ready()
    Agent->>Broker: transition_to_steady_state()
    Agent->>Broker: broadcast_begin_monitoring()
    Agent->>Broker: drop_privileges()
Loading

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

Possibly related PRs

Suggested labels

documentation

Poem

🐰 I hopped through configs, ready and bright,
I nudged the broker toward startup light,
Collectors gathered, one by one,
Heartbeats checked, recoveries run,
Hooray — the agent hums through the night! 🥕

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately summarizes the main changes: implementing agent loading state machine, heartbeat detection, and recovery mechanisms for daemoneye-agent.
Description check ✅ Passed The description is well-structured and comprehensively covers the loading state machine, heartbeat detection, recovery implementation, and test plan with detailed sub-sections matching the changeset.
Docstring Coverage ✅ Passed Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch implement-agent-loading-state-and-heartbeat-detection

Comment @coderabbitai help to get the list of available commands and usage tips.

@dosubot
Copy link

dosubot bot commented Feb 2, 2026

Related Documentation

Checked 21 published document(s) in 1 knowledge base(s). No updates required.

How did I do? Any feedback?  Join Discord

@qltysh
Copy link
Contributor

qltysh bot commented Feb 2, 2026

2 new issues

Tool Category Rule Count
qlty Structure Function with many returns (count = 8): run 1
qlty Structure Function with high complexity (count = 19): run 1

@coderabbitai coderabbitai bot added architecture System architecture and design decisions configuration Configuration management and settings integration Related to integration testing and component integration labels Feb 2, 2026
@dosubot dosubot bot added daemoneye An important component of the larger DaemonEye suite. process-monitoring Process monitoring and enumeration features labels Feb 2, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements an agent loading state machine for coordinating startup between daemoneye-agent and its collectors. The state machine ensures all collectors are ready before transitioning to normal operation.

Changes:

  • Implemented agent loading state machine (Loading → Ready → SteadyState) with error states in broker_manager.rs
  • Added collectors configuration module in collector_config.rs for loading and validating /etc/daemoneye/collectors.json
  • Integrated loading state coordination into main.rs startup sequence with timeout handling
  • Added 10 integration tests and 51 unit tests for comprehensive coverage
  • Updated spec to mark Ticket 3 as complete

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
spec/procmond/index.md Marked Ticket 3 (RPC Service and Registration Manager) as complete
daemoneye-agent/tests/loading_state_integration.rs New integration tests for loading state workflow including timeout, registration, and full lifecycle
daemoneye-agent/src/main.rs Integrated loading state coordination with configuration loading, collector waiting, privilege dropping stub, and state transitions
daemoneye-agent/src/lib.rs Exported new public API items for collector configuration
daemoneye-agent/src/collector_config.rs New module for loading, validating, and managing collector configurations from JSON
daemoneye-agent/src/broker_manager.rs Implemented AgentState enum, CollectorReadinessTracker, state transition methods, and integrated with registration
AGENTS.md Added coding guidelines for ignored variables in closures and as_conversions

Comment on lines +203 to +215
// Broadcast "begin monitoring" to all collectors
if let Err(e) = broker_manager.broadcast_begin_monitoring().await {
error!(error = %e, "Failed to broadcast begin monitoring");
// Continue anyway - broadcast failure is not fatal for the main loop
warn!("Collectors may not have received begin monitoring signal");
}

// Transition to SteadyState
if let Err(e) = broker_manager.transition_to_steady_state().await {
error!(error = %e, "Failed to transition to SteadyState");
// This shouldn't happen if we're in Ready state, but log and continue
warn!("Agent may not be in expected state");
}
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "begin monitoring" broadcast is being performed twice: once explicitly here (line 204) and again inside transition_to_steady_state() (broker_manager.rs line 951). This duplication means collectors will receive the BeginMonitoring message twice.

Either remove the explicit broadcast call here and let transition_to_steady_state() handle it, or remove the broadcast from inside transition_to_steady_state() and keep it here. The recommended approach is to remove the explicit call here since the transition method's documentation states it broadcasts "begin monitoring" during the transition.

Copilot uses AI. Check for mistakes.

// Broadcast "begin monitoring" to all collectors
drop(state); // Release lock before async operations
self.broadcast_begin_monitoring().await?;
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

State transition and broadcast are not atomic. If broadcast_begin_monitoring() fails, the state will already have been changed to SteadyState (line 947), but the error will propagate (line 951 with ?). This leaves the agent in SteadyState even though the broadcast failed.

Consider one of these approaches:

  1. Perform the broadcast BEFORE changing state, so failure prevents the transition
  2. Revert the state on broadcast failure
  3. Make broadcast failure non-fatal (remove ? and log error) since collectors might still receive the message through other means

The current main.rs treats broadcast failure as non-fatal (line 204-208), so option 3 aligns with that design.

Suggested change
self.broadcast_begin_monitoring().await?;
if let Err(error) = self.broadcast_begin_monitoring().await {
warn!(%error, "Failed to broadcast begin monitoring; continuing in SteadyState");
}

Copilot uses AI. Check for mistakes.
Comment on lines +287 to +290
// 6. Broadcast begin monitoring
manager.broadcast_begin_monitoring().await?;

// 7. Transition to SteadyState
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test broadcasts "begin monitoring" explicitly at line 288, and then calls transition_to_steady_state() at line 291. However, transition_to_steady_state() also broadcasts "begin monitoring" internally (broker_manager.rs line 951). This means the broadcast happens twice in this test.

This test should match the actual usage pattern. Remove line 288 to let transition_to_steady_state() handle the broadcast, or adjust the test to match the corrected implementation once the duplicate broadcast issue in main.rs is resolved.

Suggested change
// 6. Broadcast begin monitoring
manager.broadcast_begin_monitoring().await?;
// 7. Transition to SteadyState
// 6. Transition to SteadyState

Copilot uses AI. Check for mistakes.
Comment on lines +1159 to +1164
let message = serde_json::json!({
"type": "BeginMonitoring",
"timestamp": std::time::SystemTime::now()
.duration_since(std::time::SystemTime::UNIX_EPOCH)
.unwrap_or_default()
.as_millis(),
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using unwrap_or_default() here means if the system time is before Unix epoch (an extremely rare but possible scenario), the timestamp will silently be 0 milliseconds instead of propagating an error. This could cause confusion in logs or collector behavior.

Consider either:

  1. Using a proper error with context: .context("System time is before Unix epoch")?
  2. Adding a warning log when the fallback is used
  3. Documenting this behavior if it's intentional

Given that this is for a "BeginMonitoring" message, option 1 (failing fast) is recommended since it indicates a serious system misconfiguration.

Suggested change
let message = serde_json::json!({
"type": "BeginMonitoring",
"timestamp": std::time::SystemTime::now()
.duration_since(std::time::SystemTime::UNIX_EPOCH)
.unwrap_or_default()
.as_millis(),
let timestamp = std::time::SystemTime::now()
.duration_since(std::time::SystemTime::UNIX_EPOCH)
.context("System time is before Unix epoch when creating BeginMonitoring timestamp")?
.as_millis();
let message = serde_json::json!({
"type": "BeginMonitoring",
"timestamp": timestamp,

Copilot uses AI. Check for mistakes.
//! ```
//!
//! - **Loading**: Agent starting, broker initializing, spawning collectors
//! - **Ready**: All collectors registered and reported "ready", privileges dropped
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The documentation states that in the Ready state "privileges dropped", but this is inaccurate. Looking at main.rs lines 196-201, privilege dropping occurs AFTER the transition to Ready state. The Ready state only indicates that all collectors have registered.

Update the documentation to clarify: "All collectors registered and reported 'ready'. Waiting for privilege dropping before transitioning to steady state." or similar wording that doesn't imply privileges have already been dropped.

Suggested change
//! - **Ready**: All collectors registered and reported "ready", privileges dropped
//! - **Ready**: All collectors registered and reported "ready". Awaiting privilege
//! dropping before transitioning to `SteadyState`.

Copilot uses AI. Check for mistakes.
Loading,

/// All collectors have registered and reported "ready".
/// The agent has dropped privileges (if configured).
Copy link

Copilot AI Feb 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This documentation states "The agent has dropped privileges (if configured)" in the Ready state, but according to main.rs (lines 196-201), privilege dropping occurs AFTER transitioning to Ready. The Ready state should only indicate that all collectors have registered.

Update this to: "All collectors have registered and reported 'ready'. Privilege dropping will occur before transitioning to steady state." to match the actual implementation.

Suggested change
/// The agent has dropped privileges (if configured).
/// Privilege dropping will occur before transitioning to steady state.

Copilot uses AI. Check for mistakes.
…ecovery

Add comprehensive heartbeat monitoring and recovery infrastructure:

- HeartbeatStatus enum (Healthy, Degraded, Failed) with needs_recovery()
- Missed heartbeat tracking in CollectorRegistry with check_heartbeats()
- RecoveryAction escalation chain: HealthCheck → GracefulShutdown → ForceKill → Restart
- CollectorRecoveryState for tracking recovery attempts per collector
- execute_recovery() function for automated escalating recovery
- 21 integration tests covering heartbeat and recovery workflows

Also includes documentation updates for async-in-tracing gotchas,
pre-commit handling, and cross-crate trait imports.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@unclesp1d3r unclesp1d3r changed the title feat(daemoneye-agent): implement agent loading state and collector coordination feat(daemoneye-agent): implement agent loading state, heartbeat detection, and recovery Feb 2, 2026
@coderabbitai coderabbitai bot added the documentation Improvements or additions to documentation label Feb 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

architecture System architecture and design decisions configuration Configuration management and settings daemoneye An important component of the larger DaemonEye suite. documentation Improvements or additions to documentation integration Related to integration testing and component integration process-monitoring Process monitoring and enumeration features size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants