Proposal: Kubernetes-native deployment and monitoring for SUNK (Slurm-on-K8s)

## Summary

I'd like to propose adding Kubernetes-native deployment and monitoring support for [CoreWeave SUNK](https://docs.coreweave.com/products/sunk) (Slurm on Kubernetes) environments. Since SUNK exposes the standard Slurm interface on top of Kubernetes, GCM already works with SUNK clusters out of the box for most collectors. This proposal focuses on closing the gaps and adding Kubernetes-aware monitoring.

## Motivation

SUNK is gaining traction as a deployment model for GPU clusters, especially in cloud-native environments. While GCM's existing collectors work with SUNK (same Slurm CLI/REST interface), there are opportunities to:

1. **Enrich Slurm data with Kubernetes metadata** — correlate Slurm job IDs with K8s pod status, resource requests/limits, and node conditions
2. **Deploy GCM natively in K8s** — run as a sidecar or DaemonSet alongside SUNK components rather than requiring separate host-level installation
3. **Adapt health checks for containerized environments** — checks that depend on host access (dmesg, syslogs) need container-aware alternatives

## Proposed Scope

### 1. Kubernetes metrics collector (`k8s_pod_monitor.py`)
- New collector that queries the Kubernetes API to gather pod-level metrics
- Correlates Slurm job IDs with their corresponding K8s pods
- Collects pod status, resource requests/limits, restart counts, and node conditions
- Follows the existing collector pattern (Click CLI, `SinkImpl` protocol, `run_data_collection_loop`)

### 2. Helm chart adaptation for SUNK deployment
- Extend the existing Helm chart (from #57) with SUNK-specific values
- Support deployment as a sidecar container or standalone DaemonSet
- Add ServiceAccount and RBAC for Kubernetes API access
- ConfigMap-driven selection between CLI and REST client backends

### 3. Container-aware health checks
- Adapt checks that rely on host-level access (dmesg, syslogs) to work inside containers
- Add K8s-specific health checks (pod readiness, node conditions, PVC status)
- Graceful degradation when host access is unavailable

### 4. Documentation
- Deployment guide for CoreWeave SUNK environments
- Example `values.yaml` for SUNK integration
- Architecture diagram showing GCM components in a SUNK cluster

## Compatibility

Since SUNK exposes the standard Slurm interface, all existing collectors remain fully compatible:

| GCM Component | SUNK Compatible | Notes |
|---|---|---|
| SlurmCliClient | Yes | Same CLI commands via login nodes |
| SlurmRestClient (#62) | Yes | slurmrestd available in SUNK |
| All collectors (squeue, sinfo, sshare, sprio, etc.) | Yes | Identical Slurm output |
| Health checks | Partial | Host-level checks need adaptation |
| Exporters/Sinks | Yes | Fully scheduler-agnostic |

## Testing

I have access to my own lab environment where I can validate the implementation end-to-end on a SUNK cluster. All new code will include unit tests following the existing patterns.

## Timeline

I estimate approximately 4 weeks to deliver a complete implementation, as I have a trip scheduled in the middle of the development period. Happy to break this into smaller PRs if preferred.

## Questions for Maintainers

1. Does this direction align with the project's vision for "Support for additional schedulers beyond Slurm" from the roadmap?
2. Would you prefer this as a single PR or broken into smaller incremental PRs?
3. Any specific Kubernetes metrics or health checks you'd like to see prioritized?

Looking forward to your feedback before starting implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposal: Kubernetes-native deployment and monitoring for SUNK (Slurm-on-K8s) #63

Summary

Motivation

Proposed Scope

1. Kubernetes metrics collector (`k8s_pod_monitor.py`)

2. Helm chart adaptation for SUNK deployment

3. Container-aware health checks

4. Documentation

Compatibility

Testing

Timeline

Questions for Maintainers

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GCM Component	SUNK Compatible	Notes
SlurmCliClient	Yes	Same CLI commands via login nodes
SlurmRestClient (#62)	Yes	slurmrestd available in SUNK
All collectors (squeue, sinfo, sshare, sprio, etc.)	Yes	Identical Slurm output
Health checks	Partial	Host-level checks need adaptation
Exporters/Sinks	Yes	Fully scheduler-agnostic

Proposal: Kubernetes-native deployment and monitoring for SUNK (Slurm-on-K8s) #63

Description

Summary

Motivation

Proposed Scope

1. Kubernetes metrics collector (k8s_pod_monitor.py)

2. Helm chart adaptation for SUNK deployment

3. Container-aware health checks

4. Documentation

Compatibility

Testing

Timeline

Questions for Maintainers

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. Kubernetes metrics collector (`k8s_pod_monitor.py`)