Skip to content

Proposal: Kubernetes-native deployment and monitoring for SUNK (Slurm-on-K8s) #63

@gustcol

Description

@gustcol

Summary

I'd like to propose adding Kubernetes-native deployment and monitoring support for CoreWeave SUNK (Slurm on Kubernetes) environments. Since SUNK exposes the standard Slurm interface on top of Kubernetes, GCM already works with SUNK clusters out of the box for most collectors. This proposal focuses on closing the gaps and adding Kubernetes-aware monitoring.

Motivation

SUNK is gaining traction as a deployment model for GPU clusters, especially in cloud-native environments. While GCM's existing collectors work with SUNK (same Slurm CLI/REST interface), there are opportunities to:

  1. Enrich Slurm data with Kubernetes metadata — correlate Slurm job IDs with K8s pod status, resource requests/limits, and node conditions
  2. Deploy GCM natively in K8s — run as a sidecar or DaemonSet alongside SUNK components rather than requiring separate host-level installation
  3. Adapt health checks for containerized environments — checks that depend on host access (dmesg, syslogs) need container-aware alternatives

Proposed Scope

1. Kubernetes metrics collector (k8s_pod_monitor.py)

  • New collector that queries the Kubernetes API to gather pod-level metrics
  • Correlates Slurm job IDs with their corresponding K8s pods
  • Collects pod status, resource requests/limits, restart counts, and node conditions
  • Follows the existing collector pattern (Click CLI, SinkImpl protocol, run_data_collection_loop)

2. Helm chart adaptation for SUNK deployment

  • Extend the existing Helm chart (from Add Docker image and Helm chart for Kubernetes deployment #57) with SUNK-specific values
  • Support deployment as a sidecar container or standalone DaemonSet
  • Add ServiceAccount and RBAC for Kubernetes API access
  • ConfigMap-driven selection between CLI and REST client backends

3. Container-aware health checks

  • Adapt checks that rely on host-level access (dmesg, syslogs) to work inside containers
  • Add K8s-specific health checks (pod readiness, node conditions, PVC status)
  • Graceful degradation when host access is unavailable

4. Documentation

  • Deployment guide for CoreWeave SUNK environments
  • Example values.yaml for SUNK integration
  • Architecture diagram showing GCM components in a SUNK cluster

Compatibility

Since SUNK exposes the standard Slurm interface, all existing collectors remain fully compatible:

GCM Component SUNK Compatible Notes
SlurmCliClient Yes Same CLI commands via login nodes
SlurmRestClient (#62) Yes slurmrestd available in SUNK
All collectors (squeue, sinfo, sshare, sprio, etc.) Yes Identical Slurm output
Health checks Partial Host-level checks need adaptation
Exporters/Sinks Yes Fully scheduler-agnostic

Testing

I have access to my own lab environment where I can validate the implementation end-to-end on a SUNK cluster. All new code will include unit tests following the existing patterns.

Timeline

I estimate approximately 4 weeks to deliver a complete implementation, as I have a trip scheduled in the middle of the development period. Happy to break this into smaller PRs if preferred.

Questions for Maintainers

  1. Does this direction align with the project's vision for "Support for additional schedulers beyond Slurm" from the roadmap?
  2. Would you prefer this as a single PR or broken into smaller incremental PRs?
  3. Any specific Kubernetes metrics or health checks you'd like to see prioritized?

Looking forward to your feedback before starting implementation.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions