facebookresearch · theap06 · Mar 11, 2026 · Mar 11, 2026 · Mar 11, 2026 · Mar 12, 2026
@@ -46,7 +46,7 @@ Facebook has adopted a Code of Conduct that we expect project participants to ad
 
 ## The Team
 
-GPU Cluster Monitoring is actively maintained by [Lucca Bertoncini](https://github.com/luccabb), [Caleb Ho](https://github.com/calebho), [Apostolos Kokolis](https://github.com/A-Kokolis), [Liao Hu](https://github.com/L1A0), [Thanh Nguyen](https://github.com/giongto35), [Billy Campoli](https://github.com/tooji) with a number of contributions coming from talented individuals (in no particular order, and non-exhaustive): [Jörg Doku](https://github.com/Jorghi12), [Vivian Peng](https://github.com/vzpeng), [Parth Malani](https://github.com/pmmalani), [Kalyan Saladi](https://github.com/skalyan), [Shubho Sengupta](https://github.com/shubho), [Leo Huang](https://github.com/lifeihuang), [Robert Vincent](https://github.com/bvincent-penguin), [Max Wang](https://github.com/mxw), [Sujit Verma](https://github.com/sujitoc), [Teng Li](https://github.com/teng-li), [James Taylor](https://github.com/jamestaylr), [Xiaodong Ma](https://github.com/xman1979), [Chris Henry](https://github.com/chenry3), [Jakob Johnson](https://github.com/jj10306), [Kareem Sakher](https://github.com/kjsakher), [Abinesh Ramakrishnan](https://github.com/ibanesh), [Nabib Ahmed](https://github.com/nahmed3536), [Yong Li](https://github.com/yonglimeta), [Junjie Qian](https://github.com/junjieqian), [David Watson](https://github.com/davidewatson), [Guanyu Wu](https://github.com/kwu-penguin), [Jaromir Latal](https://github.com/jermenkoo), [Samuel Doud](https://github.com/SamuelDoud), [Yidi Wu](https://github.com/ydwu4), [Xinyuan Zhang](https://github.com/xinyuanzzz), [Neha Saxena](https://github.com/nehasaxena210), [Gustavo Lima](https://github.com/gustcol).
+GPU Cluster Monitoring is actively maintained by [Lucca Bertoncini](https://github.com/luccabb), [Caleb Ho](https://github.com/calebho), [Apostolos Kokolis](https://github.com/A-Kokolis), [Liao Hu](https://github.com/L1A0), [Thanh Nguyen](https://github.com/giongto35), [Billy Campoli](https://github.com/tooji) with a number of contributions coming from talented individuals (in no particular order, and non-exhaustive): [Jörg Doku](https://github.com/Jorghi12), [Vivian Peng](https://github.com/vzpeng), [Parth Malani](https://github.com/pmmalani), [Kalyan Saladi](https://github.com/skalyan), [Shubho Sengupta](https://github.com/shubho), [Leo Huang](https://github.com/lifeihuang), [Robert Vincent](https://github.com/bvincent-penguin), [Max Wang](https://github.com/mxw), [Sujit Verma](https://github.com/sujitoc), [Teng Li](https://github.com/teng-li), [James Taylor](https://github.com/jamestaylr), [Xiaodong Ma](https://github.com/xman1979), [Chris Henry](https://github.com/chenry3), [Jakob Johnson](https://github.com/jj10306), [Kareem Sakher](https://github.com/kjsakher), [Abinesh Ramakrishnan](https://github.com/ibanesh), [Nabib Ahmed](https://github.com/nahmed3536), [Yong Li](https://github.com/yonglimeta), [Junjie Qian](https://github.com/junjieqian), [David Watson](https://github.com/davidewatson), [Guanyu Wu](https://github.com/kwu-penguin), [Jaromir Latal](https://github.com/jermenkoo), [Samuel Doud](https://github.com/SamuelDoud), [Yidi Wu](https://github.com/ydwu4), [Xinyuan Zhang](https://github.com/xinyuanzzz), [Neha Saxena](https://github.com/nehasaxena210), [Achintya Paningapalli](https://github.com/theap06), [Gustavo Lima](https://github.com/gustcol).
 
 Feel free to contribute and add your name!
 

@@ -26,9 +26,20 @@
 @feature_flags_config(FeatureValueHealthChecksFeatures)
 @toml_config_option("health_checks", default_config_path=DEFAULT_CONFIG_PATH)
 @detach_option
+@click.option(
+    "--backend",
+    type=click.Choice(["nvml"]),
+    default="nvml",
+    show_default=True,
+    help="Accelerator backend used by GPU health checks.",
+)
 @click.version_option(__version__)
-def health_checks(detach: bool) -> None:
+def health_checks(detach: bool, backend: str) -> None:
     """GPU Cluster Monitoring: Large-Scale AI Research Cluster Monitoring."""
+    ctx = click.get_current_context()
+    ctx.meta["accelerator_backend"] = backend
+    if isinstance(ctx.obj, dict):
+        ctx.obj["accelerator_backend"] = backend
 
 
 list_of_checks: List[click.core.Command] = [

@@ -0,0 +1,94 @@
+# Accelerator HAL (Python)
+
+This package provides a hardware-agnostic accelerator abstraction for a
+Python-first observability codebase.
+
+## Layout
+
+```text
+gcm/monitoring/accelerator/
+  backend.py                   # core interfaces and identity models
+  metrics.py                   # normalized metrics and capability model
+  errors.py                    # typed errors for backend operations
+  manager.py                   # backend orchestration and routing
+  probe.py                     # dynamic shared library probe helpers
+  registry.py                  # default backend registration
+  backends/
+    nvml.py
+```
+
+## Design notes
+
+- Backends are discovered and probed at runtime; missing drivers degrade
+  gracefully.
+- Metric output uses a single normalized `MetricSet` type.
+- Optional vendor fields remain `None` unless supported by backend capability.
+- This design can be implemented directly in Python or backed by Rust/C++
+  worker processes behind the same backend protocol.
+
+## Lifecycle
+
+1. Build an `AcceleratorManager` from `default_backend_factories()`.
+2. Call `probe_all()` to initialize and retain healthy backends.
+3. Call `refresh_devices()` to enumerate backend devices and cache handles.
+4. Call `read_all_metrics()` with a `MetricRequest` during each collection loop.
+5. Call `close()` on shutdown.
+
+## Backend authoring guide
+
+- Implement `AcceleratorBackend` methods in `backends/<vendor>.py`.
+- `probe()` should only verify runtime readiness and return a clear reason on
+  failure.
+- `enumerate_devices()` should return stable, backend-scoped `DeviceHandle.id`
+  values.
+- `read_metrics()` should map into normalized `MetricSet` fields and avoid
+  failing the full read when a single metric is unavailable.
+- Keep unsupported fields as `None` and gate behavior through `CapabilitySet`.
+
+## Scope in this PR
+
+- Includes a functional NVML backend only.
+- Keeps the HAL contract/manager generic so additional backends can be added in
+  follow-up PRs.
+
+## Migration note
+
+- HAL behavior is Python-first to simplify integration and testability.
+- If needed later, vendor-specific FFI logic can move into Rust/C++ sidecar
+  workers without changing the Python HAL interface.
+
+## Test plan
+
+### Full-run commands (with output)
+
+**gcm** (single collection, stdout sink):
+
+```bash
+gcm --backend=nvml nvml_monitor --sink=stdout --once --log-folder=/tmp/gcm-log
+```
+
+Example output (with NVIDIA GPUs present):
+
+```json
+[{"gpu_id": 0, "hostname": "node01", "mem_util": 45, "gpu_util": 32, ...}]
+[{"gpu_index": 0, "max_gpu_util": 32, "min_gpu_util": 28, ...}]
+```
+
+Without GPUs: exits with `DeviceTelemetryException` / NVML not found.
+
+**health_checks** (nvidia-smi gpu_num check, stdout sink):
+
+```bash
+health_checks --backend=nvml check-nvidia-smi fair_cluster nagios --sink=stdout -c gpu_num --gpu_num=0
+```
+
+Example output:
+
+```json
+[{"node": "node01", "cluster": "fair_cluster", "health_check": "nvidia smi", "type": "nagios", "result": 0, "_msg": "Number of GPUs present is the same as expected, 0", ...}]
+```
+
+### Automated tests
+
+- `pytest -q gcm/tests/test_accelerator_hal.py`
+- `pytest -q gcm/tests/test_gcm.py -k "backend or full_run"`
@@ -0,0 +1,37 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+from gcm.monitoring.accelerator.backend import (
+    AcceleratorBackend,
+    BackendName,
+    DeviceHandle,
+    ProbeResult,
+)
+from gcm.monitoring.accelerator.errors import (
+    AcceleratorError,
+    BackendUnavailableError,
+    UnsupportedOperationError,
+)
+from gcm.monitoring.accelerator.manager import AcceleratorManager
+from gcm.monitoring.accelerator.metrics import (
+    Capability,
+    CapabilitySet,
+    MetricRequest,
+    MetricSet,
+)
+from gcm.monitoring.accelerator.registry import default_backend_factories
+
+__all__ = [
+    "AcceleratorBackend",
+    "AcceleratorError",
+    "AcceleratorManager",
+    "BackendName",
+    "BackendUnavailableError",
+    "Capability",
+    "CapabilitySet",
+    "DeviceHandle",
+    "MetricRequest",
+    "MetricSet",
+    "ProbeResult",
+    "UnsupportedOperationError",
+    "default_backend_factories",
+]
@@ -0,0 +1,51 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
+from dataclasses import dataclass, field
+from datetime import datetime, timezone
+from enum import Enum
+from typing import Callable, List, Protocol
+
+from gcm.monitoring.accelerator.metrics import CapabilitySet, MetricRequest, MetricSet
+
+
+class BackendName(str, Enum):
+    NVML = "nvml"
+
+
+@dataclass(frozen=True)
+class ProbeResult:
+    backend: BackendName
+    healthy: bool
+    reason: str
+    library_path: str | None = None
+    driver_version: str | None = None
+    probed_at: datetime = field(default_factory=lambda: datetime.now(timezone.utc))
+
+
+@dataclass(frozen=True)
+class DeviceHandle:
+    backend: BackendName
+    id: str
+    vendor: str
+    model: str | None = None
+    bus_id: str | None = None
+    serial: str | None = None
+
+
+class AcceleratorBackend(Protocol):
+    def name(self) -> BackendName: ...
+
+    def probe(self) -> ProbeResult: ...
+
+    def enumerate_devices(self) -> List[DeviceHandle]: ...
+
+    def capabilities(self, device: DeviceHandle) -> CapabilitySet: ...
+
+    def read_metrics(
+        self, device: DeviceHandle, request: MetricRequest
+    ) -> MetricSet: ...
+
+    def close(self) -> None: ...
+
+
+BackendFactory = Callable[[], AcceleratorBackend]
@@ -0,0 +1,2 @@
+# Copyright (c) Meta Platforms, Inc. and affiliates.
+# All rights reserved.
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@
		# Copyright (c) Meta Platforms, Inc. and affiliates.
		# All rights reserved.