Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -282,3 +282,5 @@ go.work.sum
# Meta-internal CI
skycastle/
scrut/

.claude/
22 changes: 14 additions & 8 deletions charts/gcm/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,22 +23,18 @@ The chart is published to GHCR as an OCI artifact and versioned alongside GCM re

```shell
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
--set healthChecks.cluster=my-cluster \
--set healthChecks.sink=otel \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace>
```

**DCGM 3** (for older NVIDIA drivers R535/R525):

```shell
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.image.tag=dcgm3 \
--set healthChecks.image.tag=dcgm3 \
--set healthChecks.cluster=my-cluster \
--set healthChecks.sink=otel \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster
```

To pin a specific chart version, add `--version X.Y.Z`.
Expand All @@ -48,11 +44,15 @@ Health checks and monitoring are independent — you can deploy either or both:
```shell
# Health checks only
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=false \
--set healthChecks.cluster=my-cluster

# Monitoring only
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set healthChecks.enabled=false \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster
Expand Down Expand Up @@ -186,13 +186,17 @@ Sink-specific options can be passed via `sinkOpts` (OmegaConf dot-list syntax).
```shell
# Monitoring: send GPU metrics to an OpenTelemetry collector
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster \
--set monitoring.sinkOpts[0]=otel_endpoint=http://otel-collector:4318 \
--set "monitoring.sinkOpts[1]=metric_resource_attributes={'environment': 'production'}"

# Health checks: send results to an OpenTelemetry collector
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set healthChecks.sink=otel \
--set healthChecks.cluster=my-cluster \
--set healthChecks.sinkOpts[0]=otel_endpoint=http://otel-collector:4318 \
Expand All @@ -207,6 +211,8 @@ For clusters that use **labels** instead of taints to identify GPU nodes, use `n

```shell
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.nodeSelector."nvidia\.com/gpu\.present"=true \
--set healthChecks.nodeSelector."nvidia\.com/gpu\.present"=true
```
Expand Down
4 changes: 2 additions & 2 deletions charts/gcm/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -5,8 +5,8 @@

# Usage:
# helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
# -f charts/gcm/<CUSTOM-OVERRIDES>.yaml \
# --namespace monitoring
# -f <PATH/TO>/custom-values.yaml \
# --namespace <namespace>

imagePullSecrets: []
nameOverride: ""
Expand Down
2 changes: 1 addition & 1 deletion shelper/go.mod
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
module github.com/fairinternal/fair-cluster-monitoring/shelper
module github.com/facebookresearch/gcm/shelper

go 1.24.3

Expand Down
4 changes: 2 additions & 2 deletions slurmprocessor/common.go
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
// Copyright (c) Meta Platforms, Inc. and affiliates.
// All rights reserved.
package main
package slurmprocessor

import (
"context"
"log"
"strings"

shelper "github.com/fairinternal/fair-cluster-monitoring/shelper"
shelper "github.com/facebookresearch/gcm/shelper"
"go.opentelemetry.io/collector/component"
"go.opentelemetry.io/collector/consumer"
"go.opentelemetry.io/collector/pdata/pcommon"
Expand Down
2 changes: 1 addition & 1 deletion slurmprocessor/config.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// Copyright (c) Meta Platforms, Inc. and affiliates.
// All rights reserved.
package main
package slurmprocessor

// Config Structure
type Config struct {
Expand Down
2 changes: 1 addition & 1 deletion slurmprocessor/config_test.go
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
// Copyright (c) Meta Platforms, Inc. and affiliates.
// All rights reserved.
package main
package slurmprocessor

import (
"path"
Expand Down
11 changes: 5 additions & 6 deletions slurmprocessor/factory.go
Original file line number Diff line number Diff line change
@@ -1,18 +1,17 @@
// Copyright (c) Meta Platforms, Inc. and affiliates.
// All rights reserved.
package main
package slurmprocessor

import (
"context"
"log"

"github.com/open-telemetry/opentelemetry-collector-contrib/processor/attributesprocessor/internal/metadata"
"go.opentelemetry.io/collector/component"
"go.opentelemetry.io/collector/consumer"
"go.opentelemetry.io/collector/processor"
"go.opentelemetry.io/collector/processor/processorhelper"

shelper "github.com/fairinternal/fair-cluster-monitoring/shelper"
shelper "github.com/facebookresearch/gcm/shelper"
)

const (
Expand All @@ -25,9 +24,9 @@ func NewFactory() processor.Factory {
return processor.NewFactory(
component.MustNewType(typeStr),
createDefaultConfig,
processor.WithTraces(createTracesProcessor, metadata.TracesStability),
processor.WithMetrics(createMetricsProcessor, metadata.MetricsStability),
processor.WithLogs(createLogsProcessor, metadata.LogsStability),
processor.WithTraces(createTracesProcessor, component.StabilityLevelAlpha),
processor.WithMetrics(createMetricsProcessor, component.StabilityLevelAlpha),
processor.WithLogs(createLogsProcessor, component.StabilityLevelAlpha),
)
}

Expand Down
5 changes: 2 additions & 3 deletions slurmprocessor/go.mod
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
module github.com/fairinternal/gpu-cluster-monitoring/slurmprocessor
module github.com/facebookresearch/gcm/slurmprocessor

go 1.24.3

require (
github.com/open-telemetry/opentelemetry-collector-contrib/processor/attributesprocessor v0.126.0
github.com/facebookresearch/gcm/shelper v0.0.1
github.com/stretchr/testify v1.10.0
go.opentelemetry.io/collector/component v1.32.0
go.opentelemetry.io/collector/consumer v1.32.0
Expand All @@ -23,7 +23,6 @@ require (
github.com/cespare/xxhash/v2 v2.3.0 // indirect
github.com/davecgh/go-spew v1.1.1 // indirect
github.com/ebitengine/purego v0.8.3 // indirect
github.com/fairinternal/fair-cluster-monitoring/shelper v0.0.0-20250620180146-cf5a014efe9f // indirect
github.com/go-logr/logr v1.4.2 // indirect
github.com/go-logr/stdr v1.2.2 // indirect
github.com/go-ole/go-ole v1.2.6 // indirect
Expand Down
6 changes: 2 additions & 4 deletions slurmprocessor/go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -12,8 +12,8 @@ github.com/davecgh/go-spew v1.1.1 h1:vj9j/u1bqnvCEfJOwUhtlOARqs3+rkHYY13jYWTU97c
github.com/davecgh/go-spew v1.1.1/go.mod h1:J7Y8YcW2NihsgmVo/mv3lAwl/skON4iLHjSsI+c5H38=
github.com/ebitengine/purego v0.8.3 h1:K+0AjQp63JEZTEMZiwsI9g0+hAMNohwUOtY0RPGexmc=
github.com/ebitengine/purego v0.8.3/go.mod h1:iIjxzd6CiRiOG0UyXP+V1+jWqUXVjPKLAI0mRfJZTmQ=
github.com/fairinternal/fair-cluster-monitoring/shelper v0.0.0-20250620180146-cf5a014efe9f h1:6O4bbLKO9n8m2myxE5oMk7+RAxGy9UYGDJlB/pWRzQM=
github.com/fairinternal/fair-cluster-monitoring/shelper v0.0.0-20250620180146-cf5a014efe9f/go.mod h1:8iCqKvY1tift2boY06yo0PsgbhPwi/NSmyzNLRRZyMQ=
github.com/facebookresearch/gcm/shelper v0.0.1 h1:QLRBEoZuI6dFX0AUvizC3j0bxrCQrmRiErtsgg1tV98=
github.com/facebookresearch/gcm/shelper v0.0.1/go.mod h1:51syI2aPfL2BYW0zEYjblygcomw9SsMTte7LxgD+b9Q=
github.com/felixge/httpsnoop v1.0.4 h1:NFTV2Zj1bL4mc9sqWACXbQFVBBg2W3GPvqp8/ESS2Wg=
github.com/felixge/httpsnoop v1.0.4/go.mod h1:m8KPJKqk1gH5J9DgRY2ASl2lWCfGKXixSwevea8zH2U=
github.com/foxboron/go-tpm-keyfiles v0.0.0-20250323135004-b31fac66206e h1:2jjYsGgM13xId2Ku+UGDQTO5It50LhT6lljiVJvBj1Y=
Expand Down Expand Up @@ -83,8 +83,6 @@ github.com/modern-go/reflect2 v1.0.2 h1:xBagoLtFs94CBntxluKeaWgTMpvLxC4ur3nMaC9G
github.com/modern-go/reflect2 v1.0.2/go.mod h1:yWuevngMOJpCy52FWWMvUC8ws7m/LJsjYzDa0/r8luk=
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822 h1:C3w9PqII01/Oq1c1nUAm88MOHcQC9l5mIlSMApZMrHA=
github.com/munnerz/goautoneg v0.0.0-20191010083416-a7dc8b61c822/go.mod h1:+n7T8mK8HuQTcFwEeznm/DIxMOiR9yIdICNftLE1DvQ=
github.com/open-telemetry/opentelemetry-collector-contrib/processor/attributesprocessor v0.126.0 h1:ezG3TqbSnQG9JcaLMh0cTts/Jvek6mlj/WApOC3wQtE=
github.com/open-telemetry/opentelemetry-collector-contrib/processor/attributesprocessor v0.126.0/go.mod h1:xbMS6tl+zIdD26RQXr6VdP2bDuBCBEdV6pC0WgNKiUI=
github.com/pierrec/lz4/v4 v4.1.22 h1:cKFw6uJDK+/gfw5BcDL0JL5aBsAFdsIT18eRtLj7VIU=
github.com/pierrec/lz4/v4 v4.1.22/go.mod h1:gZWDp/Ze/IJXGXf23ltt2EXimqmTUXEy0GFuRQyBid4=
github.com/pmezard/go-difflib v1.0.0 h1:4DBwDE0NGyQoBHbLQYPwSUPoCMWR5BEzIk/f1lZbAQM=
Expand Down
4 changes: 2 additions & 2 deletions slurmprocessor/logs.go
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
// Copyright (c) Meta Platforms, Inc. and affiliates.
// All rights reserved.
package main
package slurmprocessor

import (
"context"
"log"

shelper "github.com/fairinternal/fair-cluster-monitoring/shelper"
shelper "github.com/facebookresearch/gcm/shelper"
"go.opentelemetry.io/collector/component"
"go.opentelemetry.io/collector/consumer"
"go.opentelemetry.io/collector/pdata/plog"
Expand Down
4 changes: 2 additions & 2 deletions slurmprocessor/metrics.go
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
// Copyright (c) Meta Platforms, Inc. and affiliates.
// All rights reserved.
package main
package slurmprocessor

import (
"context"
"log"

shelper "github.com/fairinternal/fair-cluster-monitoring/shelper"
shelper "github.com/facebookresearch/gcm/shelper"
"go.opentelemetry.io/collector/component"
"go.opentelemetry.io/collector/consumer"
"go.opentelemetry.io/collector/pdata/pmetric"
Expand Down
4 changes: 2 additions & 2 deletions slurmprocessor/traces.go
Original file line number Diff line number Diff line change
@@ -1,12 +1,12 @@
// Copyright (c) Meta Platforms, Inc. and affiliates.
// All rights reserved.
package main
package slurmprocessor

import (
"context"
"log"

shelper "github.com/fairinternal/fair-cluster-monitoring/shelper"
shelper "github.com/facebookresearch/gcm/shelper"
"go.opentelemetry.io/collector/component"
"go.opentelemetry.io/collector/consumer"
"go.opentelemetry.io/collector/pdata/ptrace"
Expand Down
12 changes: 8 additions & 4 deletions website/docs/GCM_Health_Checks/kubernetes_deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,16 +30,20 @@ The recommended way to deploy on Kubernetes is via the [GCM Helm chart](https://

```shell
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
--set healthChecks.cluster=my-cluster \
--set healthChecks.sink=otel
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=false \
--set healthChecks.enabled=true
```

Or from source:

```shell
helm install gcm charts/gcm \
--set healthChecks.cluster=my-cluster \
--set healthChecks.sink=otel
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=false \
--set healthChecks.enabled=true
```

See the [Helm chart README](https://github.com/facebookresearch/gcm/tree/main/charts/gcm/README.md) for full configuration options.
Expand Down
24 changes: 18 additions & 6 deletions website/docs/GCM_Monitoring/kubernetes_deployment.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,16 +29,20 @@ The recommended way to deploy on Kubernetes is via the [GCM Helm chart](https://

```shell
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=true \
--set healthChecks.enabled=false
```

Or from source:

```shell
helm install gcm charts/gcm \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=true \
--set healthChecks.enabled=false
```

See the [Helm chart README](https://github.com/facebookresearch/gcm/tree/main/charts/gcm/README.md) for full configuration options.
Expand All @@ -57,7 +61,11 @@ See the [Helm chart README](https://github.com/facebookresearch/gcm/tree/main/ch
### Sending Metrics to OpenTelemetry

```shell
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \ \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace> \
--set monitoring.enabled=true \
--set healthChecks.enabled=false \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster \
--set monitoring.extraEnv[0].name=OTEL_EXPORTER_OTLP_ENDPOINT \
Expand All @@ -67,7 +75,11 @@ helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
Sink-specific options can also be passed via `monitoring.sinkOpts`:

```shell
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \
helm install gcm oci://ghcr.io/facebookresearch/charts/gcm \ \
-f <PATH/TO>/custom-values.yaml \
--namespace <namespace>
--set monitoring.enabled=true \
--set healthChecks.enabled=false \
--set monitoring.sink=otel \
--set monitoring.cluster=my-cluster \
--set monitoring.sinkOpts[0]=otel_endpoint=http://otel-collector:4318 \
Expand Down
Loading