feat(runner): consider two-tier metrics architecture (hypervisor + guest agent)

## Background

We investigated moving metrics collection from inside the sandbox (microVM) to the runner layer. After researching how major cloud providers handle VM monitoring, we found that **our current in-guest metrics collection approach aligns with industry standards**.

## Industry Research: AWS CloudWatch

AWS EC2 uses a **two-tier monitoring architecture**:

### Tier 1: Hypervisor Metrics (Automatic, No Agent)
- CPU Utilization ✅
- Network I/O ✅
- Disk I/O (read/write operations) ✅
- **Memory Utilization ❌** (not available)
- **Disk Space ❌** (not available)

### Tier 2: Guest Agent Metrics (CloudWatch Agent Required)
- Memory Utilization ✅
- Disk Space ✅
- Process-level metrics ✅

From [AWS Documentation](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/metrics-collected-by-CloudWatch-agent.html):

> "When the CloudWatch agent collects memory metrics, the source is the host's memory management subsystem. For example, the Linux kernel exposes OS-maintained data in /proc. For memory, the data is in /proc/meminfo."

### Why Hypervisor Cannot See Guest Memory

From [AWS Under the Hood](https://devopslearning.medium.com/aws-under-the-hood-day-6-why-doesnt-aws-ec2-cloudwatch-collect-metrics-like-memory-and-disk-f18571edb815):

> "Memory utilization needs to be assessed based on what processes within the operating system are using that aren't visible to the hypervisor."

> "The hypervisor can observe how much data is read/written, but it doesn't know the available disk space or how it's partitioned within the VM's file system."

This is a **fundamental architectural constraint** - when guest OS frees memory, it updates its internal free list but the actual page data remains unchanged. The hypervisor has no way to detect this.

## Current VM0 Implementation

Our sandbox scripts collect metrics from inside the VM:
- CPU: `/proc/stat`
- Memory: `free -b`
- Disk: `df -B1 /`

This is **exactly** what AWS CloudWatch Agent does - it runs inside the guest and reads from `/proc/meminfo`.

## Future Consideration: Two-Tier Architecture

We could enhance our metrics by adding **host-side (Tier 1) metrics** to complement existing guest metrics:

| Metric Source | CPU | Memory | Disk Space | Disk I/O | Network I/O |
|---------------|-----|--------|------------|----------|-------------|
| **Host (procfs/cgroups)** | ✅ | ❌ (only RSS) | ❌ | ✅ | ✅ |
| **Guest (current)** | ✅ | ✅ | ✅ | - | - |

### Potential Benefits
- Host-side CPU includes VMM overhead (more complete picture)
- Network I/O from TAP device (currently not collected)
- Disk I/O from block device statistics
- Metrics available even if guest agent fails

### Implementation Options
1. **Firecracker Balloon** - Can provide guest memory stats via virtio, but requires kernel support
2. **procfs for Firecracker process** - CPU time, RSS (not accurate for guest memory)
3. **cgroups v2** - If we run Firecracker in cgroups (requires infrastructure change)

## Conclusion

No changes needed now. Our current approach is industry-standard. This issue tracks potential future enhancement to add complementary host-side metrics.

## References
- [AWS CloudWatch Agent Metrics](https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/metrics-collected-by-CloudWatch-agent.html)
- [AWS EC2 CloudWatch Metrics](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/viewing_metrics_with_cloudwatch.html)
- [Why AWS EC2 Doesn't Collect Memory/Disk by Default](https://devopslearning.medium.com/aws-under-the-hood-day-6-why-doesnt-aws-ec2-cloudwatch-collect-metrics-like-memory-and-disk-f18571edb815)
- [Firecracker Ballooning Documentation](https://github.com/firecracker-microvm/firecracker/blob/main/docs/ballooning.md)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(runner): consider two-tier metrics architecture (hypervisor + guest agent) #1283

Background

Industry Research: AWS CloudWatch

Tier 1: Hypervisor Metrics (Automatic, No Agent)

Tier 2: Guest Agent Metrics (CloudWatch Agent Required)

Why Hypervisor Cannot See Guest Memory

Current VM0 Implementation

Future Consideration: Two-Tier Architecture

Potential Benefits

Implementation Options

Conclusion

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Metric Source	CPU	Memory	Disk Space	Disk I/O	Network I/O
Host (procfs/cgroups)	✅	❌ (only RSS)	❌	✅	✅
Guest (current)	✅	✅	✅	-	-

feat(runner): consider two-tier metrics architecture (hypervisor + guest agent) #1283

Description

Background

Industry Research: AWS CloudWatch

Tier 1: Hypervisor Metrics (Automatic, No Agent)

Tier 2: Guest Agent Metrics (CloudWatch Agent Required)

Why Hypervisor Cannot See Guest Memory

Current VM0 Implementation

Future Consideration: Two-Tier Architecture

Potential Benefits

Implementation Options

Conclusion

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions