Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,7 +46,7 @@ Facebook has adopted a Code of Conduct that we expect project participants to ad

## The Team

GPU Cluster Monitoring is actively maintained by [Lucca Bertoncini](https://github.com/luccabb), [Caleb Ho](https://github.com/calebho), [Apostolos Kokolis](https://github.com/A-Kokolis), [Liao Hu](https://github.com/L1A0), [Thanh Nguyen](https://github.com/giongto35), [Billy Campoli](https://github.com/tooji) with a number of contributions coming from talented individuals (in no particular order, and non-exhaustive): [Jörg Doku](https://github.com/Jorghi12), [Vivian Peng](https://github.com/vzpeng), [Parth Malani](https://github.com/pmmalani), [Kalyan Saladi](https://github.com/skalyan), [Shubho Sengupta](https://github.com/shubho), [Leo Huang](https://github.com/lifeihuang), [Robert Vincent](https://github.com/bvincent-penguin), [Max Wang](https://github.com/mxw), [Sujit Verma](https://github.com/sujitoc), [Teng Li](https://github.com/teng-li), [James Taylor](https://github.com/jamestaylr), [Xiaodong Ma](https://github.com/xman1979), [Chris Henry](https://github.com/chenry3), [Jakob Johnson](https://github.com/jj10306), [Kareem Sakher](https://github.com/kjsakher), [Abinesh Ramakrishnan](https://github.com/ibanesh), [Nabib Ahmed](https://github.com/nahmed3536), [Yong Li](https://github.com/yonglimeta), [Junjie Qian](https://github.com/junjieqian), [David Watson](https://github.com/davidewatson), [Guanyu Wu](https://github.com/kwu-penguin), [Jaromir Latal](https://github.com/jermenkoo), [Samuel Doud](https://github.com/SamuelDoud), [Yidi Wu](https://github.com/ydwu4), [Xinyuan Zhang](https://github.com/xinyuanzzz), [Neha Saxena](https://github.com/nehasaxena210), [Gustavo Lima](https://github.com/gustcol).
GPU Cluster Monitoring is actively maintained by [Lucca Bertoncini](https://github.com/luccabb), [Caleb Ho](https://github.com/calebho), [Apostolos Kokolis](https://github.com/A-Kokolis), [Liao Hu](https://github.com/L1A0), [Thanh Nguyen](https://github.com/giongto35), [Billy Campoli](https://github.com/tooji) with a number of contributions coming from talented individuals (in no particular order, and non-exhaustive): [Jörg Doku](https://github.com/Jorghi12), [Vivian Peng](https://github.com/vzpeng), [Parth Malani](https://github.com/pmmalani), [Kalyan Saladi](https://github.com/skalyan), [Shubho Sengupta](https://github.com/shubho), [Leo Huang](https://github.com/lifeihuang), [Robert Vincent](https://github.com/bvincent-penguin), [Max Wang](https://github.com/mxw), [Sujit Verma](https://github.com/sujitoc), [Teng Li](https://github.com/teng-li), [James Taylor](https://github.com/jamestaylr), [Xiaodong Ma](https://github.com/xman1979), [Chris Henry](https://github.com/chenry3), [Jakob Johnson](https://github.com/jj10306), [Kareem Sakher](https://github.com/kjsakher), [Abinesh Ramakrishnan](https://github.com/ibanesh), [Nabib Ahmed](https://github.com/nahmed3536), [Yong Li](https://github.com/yonglimeta), [Junjie Qian](https://github.com/junjieqian), [David Watson](https://github.com/davidewatson), [Guanyu Wu](https://github.com/kwu-penguin), [Jaromir Latal](https://github.com/jermenkoo), [Samuel Doud](https://github.com/SamuelDoud), [Yidi Wu](https://github.com/ydwu4), [Xinyuan Zhang](https://github.com/xinyuanzzz), [Neha Saxena](https://github.com/nehasaxena210), [Achintya Paningapalli](https://github.com/theap06), [Gustavo Lima](https://github.com/gustcol).

Feel free to contribute and add your name!

Expand Down
2 changes: 2 additions & 0 deletions dev-requirements.txt

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

13 changes: 12 additions & 1 deletion gcm/health_checks/cli/health_checks.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,9 +26,20 @@
@feature_flags_config(FeatureValueHealthChecksFeatures)
@toml_config_option("health_checks", default_config_path=DEFAULT_CONFIG_PATH)
@detach_option
@click.option(
"--backend",
type=click.Choice(["nvml"]),
default="nvml",
show_default=True,
help="Accelerator backend used by GPU health checks.",
)
@click.version_option(__version__)
def health_checks(detach: bool) -> None:
def health_checks(detach: bool, backend: str) -> None:
"""GPU Cluster Monitoring: Large-Scale AI Research Cluster Monitoring."""
ctx = click.get_current_context()
ctx.meta["accelerator_backend"] = backend
if isinstance(ctx.obj, dict):
ctx.obj["accelerator_backend"] = backend


list_of_checks: List[click.core.Command] = [
Expand Down
94 changes: 94 additions & 0 deletions gcm/monitoring/accelerator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# Accelerator HAL (Python)

This package provides a hardware-agnostic accelerator abstraction for a
Python-first observability codebase.

## Layout

```text
gcm/monitoring/accelerator/
backend.py # core interfaces and identity models
metrics.py # normalized metrics and capability model
errors.py # typed errors for backend operations
manager.py # backend orchestration and routing
probe.py # dynamic shared library probe helpers
registry.py # default backend registration
backends/
nvml.py
```

## Design notes

- Backends are discovered and probed at runtime; missing drivers degrade
gracefully.
- Metric output uses a single normalized `MetricSet` type.
- Optional vendor fields remain `None` unless supported by backend capability.
- This design can be implemented directly in Python or backed by Rust/C++
worker processes behind the same backend protocol.

## Lifecycle

1. Build an `AcceleratorManager` from `default_backend_factories()`.
2. Call `probe_all()` to initialize and retain healthy backends.
3. Call `refresh_devices()` to enumerate backend devices and cache handles.
4. Call `read_all_metrics()` with a `MetricRequest` during each collection loop.
5. Call `close()` on shutdown.

## Backend authoring guide

- Implement `AcceleratorBackend` methods in `backends/<vendor>.py`.
- `probe()` should only verify runtime readiness and return a clear reason on
failure.
- `enumerate_devices()` should return stable, backend-scoped `DeviceHandle.id`
values.
- `read_metrics()` should map into normalized `MetricSet` fields and avoid
failing the full read when a single metric is unavailable.
- Keep unsupported fields as `None` and gate behavior through `CapabilitySet`.

## Scope in this PR

- Includes a functional NVML backend only.
- Keeps the HAL contract/manager generic so additional backends can be added in
follow-up PRs.

## Migration note

- HAL behavior is Python-first to simplify integration and testability.
- If needed later, vendor-specific FFI logic can move into Rust/C++ sidecar
workers without changing the Python HAL interface.

## Test plan

### Full-run commands (with output)

**gcm** (single collection, stdout sink):

```bash
gcm --backend=nvml nvml_monitor --sink=stdout --once --log-folder=/tmp/gcm-log
```

Example output (with NVIDIA GPUs present):

```json
[{"gpu_id": 0, "hostname": "node01", "mem_util": 45, "gpu_util": 32, ...}]
[{"gpu_index": 0, "max_gpu_util": 32, "min_gpu_util": 28, ...}]
```

Without GPUs: exits with `DeviceTelemetryException` / NVML not found.

**health_checks** (nvidia-smi gpu_num check, stdout sink):

```bash
health_checks --backend=nvml check-nvidia-smi fair_cluster nagios --sink=stdout -c gpu_num --gpu_num=0
```

Example output:

```json
[{"node": "node01", "cluster": "fair_cluster", "health_check": "nvidia smi", "type": "nagios", "result": 0, "_msg": "Number of GPUs present is the same as expected, 0", ...}]
```

### Automated tests

- `pytest -q gcm/tests/test_accelerator_hal.py`
- `pytest -q gcm/tests/test_gcm.py -k "backend or full_run"`
37 changes: 37 additions & 0 deletions gcm/monitoring/accelerator/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
from gcm.monitoring.accelerator.backend import (
AcceleratorBackend,
BackendName,
DeviceHandle,
ProbeResult,
)
from gcm.monitoring.accelerator.errors import (
AcceleratorError,
BackendUnavailableError,
UnsupportedOperationError,
)
from gcm.monitoring.accelerator.manager import AcceleratorManager
from gcm.monitoring.accelerator.metrics import (
Capability,
CapabilitySet,
MetricRequest,
MetricSet,
)
from gcm.monitoring.accelerator.registry import default_backend_factories

__all__ = [
"AcceleratorBackend",
"AcceleratorError",
"AcceleratorManager",
"BackendName",
"BackendUnavailableError",
"Capability",
"CapabilitySet",
"DeviceHandle",
"MetricRequest",
"MetricSet",
"ProbeResult",
"UnsupportedOperationError",
"default_backend_factories",
]
51 changes: 51 additions & 0 deletions gcm/monitoring/accelerator/backend.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
from dataclasses import dataclass, field
from datetime import datetime, timezone
from enum import Enum
from typing import Callable, List, Protocol

from gcm.monitoring.accelerator.metrics import CapabilitySet, MetricRequest, MetricSet


class BackendName(str, Enum):
NVML = "nvml"


@dataclass(frozen=True)
class ProbeResult:
backend: BackendName
healthy: bool
reason: str
library_path: str | None = None
driver_version: str | None = None
probed_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))


@dataclass(frozen=True)
class DeviceHandle:
backend: BackendName
id: str
vendor: str
model: str | None = None
bus_id: str | None = None
serial: str | None = None


class AcceleratorBackend(Protocol):
def name(self) -> BackendName: ...

def probe(self) -> ProbeResult: ...

def enumerate_devices(self) -> List[DeviceHandle]: ...

def capabilities(self, device: DeviceHandle) -> CapabilitySet: ...

def read_metrics(
self, device: DeviceHandle, request: MetricRequest
) -> MetricSet: ...

def close(self) -> None: ...


BackendFactory = Callable[[], AcceleratorBackend]
2 changes: 2 additions & 0 deletions gcm/monitoring/accelerator/backends/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
# Copyright (c) Meta Platforms, Inc. and affiliates.
# All rights reserved.
Loading
Loading