-
Notifications
You must be signed in to change notification settings - Fork 6
Description
Background
We investigated moving metrics collection from inside the sandbox (microVM) to the runner layer. After researching how major cloud providers handle VM monitoring, we found that our current in-guest metrics collection approach aligns with industry standards.
Industry Research: AWS CloudWatch
AWS EC2 uses a two-tier monitoring architecture:
Tier 1: Hypervisor Metrics (Automatic, No Agent)
- CPU Utilization ✅
- Network I/O ✅
- Disk I/O (read/write operations) ✅
- Memory Utilization ❌ (not available)
- Disk Space ❌ (not available)
Tier 2: Guest Agent Metrics (CloudWatch Agent Required)
- Memory Utilization ✅
- Disk Space ✅
- Process-level metrics ✅
From AWS Documentation:
"When the CloudWatch agent collects memory metrics, the source is the host's memory management subsystem. For example, the Linux kernel exposes OS-maintained data in /proc. For memory, the data is in /proc/meminfo."
Why Hypervisor Cannot See Guest Memory
From AWS Under the Hood:
"Memory utilization needs to be assessed based on what processes within the operating system are using that aren't visible to the hypervisor."
"The hypervisor can observe how much data is read/written, but it doesn't know the available disk space or how it's partitioned within the VM's file system."
This is a fundamental architectural constraint - when guest OS frees memory, it updates its internal free list but the actual page data remains unchanged. The hypervisor has no way to detect this.
Current VM0 Implementation
Our sandbox scripts collect metrics from inside the VM:
- CPU:
/proc/stat - Memory:
free -b - Disk:
df -B1 /
This is exactly what AWS CloudWatch Agent does - it runs inside the guest and reads from /proc/meminfo.
Future Consideration: Two-Tier Architecture
We could enhance our metrics by adding host-side (Tier 1) metrics to complement existing guest metrics:
| Metric Source | CPU | Memory | Disk Space | Disk I/O | Network I/O |
|---|---|---|---|---|---|
| Host (procfs/cgroups) | ✅ | ❌ (only RSS) | ❌ | ✅ | ✅ |
| Guest (current) | ✅ | ✅ | ✅ | - | - |
Potential Benefits
- Host-side CPU includes VMM overhead (more complete picture)
- Network I/O from TAP device (currently not collected)
- Disk I/O from block device statistics
- Metrics available even if guest agent fails
Implementation Options
- Firecracker Balloon - Can provide guest memory stats via virtio, but requires kernel support
- procfs for Firecracker process - CPU time, RSS (not accurate for guest memory)
- cgroups v2 - If we run Firecracker in cgroups (requires infrastructure change)
Conclusion
No changes needed now. Our current approach is industry-standard. This issue tracks potential future enhancement to add complementary host-side metrics.