Skip to content

feat(runner): consider two-tier metrics architecture (hypervisor + guest agent) #1283

@lancy

Description

@lancy

Background

We investigated moving metrics collection from inside the sandbox (microVM) to the runner layer. After researching how major cloud providers handle VM monitoring, we found that our current in-guest metrics collection approach aligns with industry standards.

Industry Research: AWS CloudWatch

AWS EC2 uses a two-tier monitoring architecture:

Tier 1: Hypervisor Metrics (Automatic, No Agent)

  • CPU Utilization ✅
  • Network I/O ✅
  • Disk I/O (read/write operations) ✅
  • Memory Utilization ❌ (not available)
  • Disk Space ❌ (not available)

Tier 2: Guest Agent Metrics (CloudWatch Agent Required)

  • Memory Utilization ✅
  • Disk Space ✅
  • Process-level metrics ✅

From AWS Documentation:

"When the CloudWatch agent collects memory metrics, the source is the host's memory management subsystem. For example, the Linux kernel exposes OS-maintained data in /proc. For memory, the data is in /proc/meminfo."

Why Hypervisor Cannot See Guest Memory

From AWS Under the Hood:

"Memory utilization needs to be assessed based on what processes within the operating system are using that aren't visible to the hypervisor."

"The hypervisor can observe how much data is read/written, but it doesn't know the available disk space or how it's partitioned within the VM's file system."

This is a fundamental architectural constraint - when guest OS frees memory, it updates its internal free list but the actual page data remains unchanged. The hypervisor has no way to detect this.

Current VM0 Implementation

Our sandbox scripts collect metrics from inside the VM:

  • CPU: /proc/stat
  • Memory: free -b
  • Disk: df -B1 /

This is exactly what AWS CloudWatch Agent does - it runs inside the guest and reads from /proc/meminfo.

Future Consideration: Two-Tier Architecture

We could enhance our metrics by adding host-side (Tier 1) metrics to complement existing guest metrics:

Metric Source CPU Memory Disk Space Disk I/O Network I/O
Host (procfs/cgroups) ❌ (only RSS)
Guest (current) - -

Potential Benefits

  • Host-side CPU includes VMM overhead (more complete picture)
  • Network I/O from TAP device (currently not collected)
  • Disk I/O from block device statistics
  • Metrics available even if guest agent fails

Implementation Options

  1. Firecracker Balloon - Can provide guest memory stats via virtio, but requires kernel support
  2. procfs for Firecracker process - CPU time, RSS (not accurate for guest memory)
  3. cgroups v2 - If we run Firecracker in cgroups (requires infrastructure change)

Conclusion

No changes needed now. Our current approach is industry-standard. This issue tracks potential future enhancement to add complementary host-side metrics.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    laterMaybe later

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions