-
Notifications
You must be signed in to change notification settings - Fork 32
Description
Summary
I'd like to propose adding Kubernetes-native deployment and monitoring support for CoreWeave SUNK (Slurm on Kubernetes) environments. Since SUNK exposes the standard Slurm interface on top of Kubernetes, GCM already works with SUNK clusters out of the box for most collectors. This proposal focuses on closing the gaps and adding Kubernetes-aware monitoring.
Motivation
SUNK is gaining traction as a deployment model for GPU clusters, especially in cloud-native environments. While GCM's existing collectors work with SUNK (same Slurm CLI/REST interface), there are opportunities to:
- Enrich Slurm data with Kubernetes metadata — correlate Slurm job IDs with K8s pod status, resource requests/limits, and node conditions
- Deploy GCM natively in K8s — run as a sidecar or DaemonSet alongside SUNK components rather than requiring separate host-level installation
- Adapt health checks for containerized environments — checks that depend on host access (dmesg, syslogs) need container-aware alternatives
Proposed Scope
1. Kubernetes metrics collector (k8s_pod_monitor.py)
- New collector that queries the Kubernetes API to gather pod-level metrics
- Correlates Slurm job IDs with their corresponding K8s pods
- Collects pod status, resource requests/limits, restart counts, and node conditions
- Follows the existing collector pattern (Click CLI,
SinkImplprotocol,run_data_collection_loop)
2. Helm chart adaptation for SUNK deployment
- Extend the existing Helm chart (from Add Docker image and Helm chart for Kubernetes deployment #57) with SUNK-specific values
- Support deployment as a sidecar container or standalone DaemonSet
- Add ServiceAccount and RBAC for Kubernetes API access
- ConfigMap-driven selection between CLI and REST client backends
3. Container-aware health checks
- Adapt checks that rely on host-level access (dmesg, syslogs) to work inside containers
- Add K8s-specific health checks (pod readiness, node conditions, PVC status)
- Graceful degradation when host access is unavailable
4. Documentation
- Deployment guide for CoreWeave SUNK environments
- Example
values.yamlfor SUNK integration - Architecture diagram showing GCM components in a SUNK cluster
Compatibility
Since SUNK exposes the standard Slurm interface, all existing collectors remain fully compatible:
| GCM Component | SUNK Compatible | Notes |
|---|---|---|
| SlurmCliClient | Yes | Same CLI commands via login nodes |
| SlurmRestClient (#62) | Yes | slurmrestd available in SUNK |
| All collectors (squeue, sinfo, sshare, sprio, etc.) | Yes | Identical Slurm output |
| Health checks | Partial | Host-level checks need adaptation |
| Exporters/Sinks | Yes | Fully scheduler-agnostic |
Testing
I have access to my own lab environment where I can validate the implementation end-to-end on a SUNK cluster. All new code will include unit tests following the existing patterns.
Timeline
I estimate approximately 4 weeks to deliver a complete implementation, as I have a trip scheduled in the middle of the development period. Happy to break this into smaller PRs if preferred.
Questions for Maintainers
- Does this direction align with the project's vision for "Support for additional schedulers beyond Slurm" from the roadmap?
- Would you prefer this as a single PR or broken into smaller incremental PRs?
- Any specific Kubernetes metrics or health checks you'd like to see prioritized?
Looking forward to your feedback before starting implementation.