add python accelerator HAL for multi-vendor backends by theap06 · Pull Request #86 · facebookresearch/gcm

theap06 · 2026-03-03T06:32:31Z

Introduce a hardware-agnostic accelerator abstraction layer with normalized metrics, backend management, and runtime probing. Includes a functional NVML backend plus ROCm/LevelZero/TPU/Neuron scaffolds and dedicated HAL tests.

Adds a Python-first hardware-agnostic accelerator HAL at gcm/monitoring/accelerator.
Decouples telemetry collection from NVML-only assumptions via a common backend interface and normalized metrics.
Implements functional NVMLBackend; adds scaffold backends for ROCm, LevelZero, TPU, and Neuron

Implements Feature Request #74

Test Plan:

Ran HAL tests:
12 passed

pytest -q gcm/tests/test_accelerator_hal.py 

gcm --backend=nvml nvml_monitor --sink=stdout --once --log-folder=/tmp/gcm-log

[{"gpu_id": 0, "hostname": "129-159-36-52", "mem_util": 0, "gpu_util": 0, "mem_used_percent": 0, "temperature": 45, "power_draw": 25, "power_used_percent": 5, "retired_pages_count_single_bit": 0, "retired_pages_count_double_bit": 0, "job_id": null, "job_user": null, ...}]
[{"gpu_index": 0, "mem_util": 0, "mem_used_percent": 0, "gpu_util": 0, "temperature": 45, "power_draw": 25, "power_used_percent": 5, "retired_pages_count_single_bit": 0, "retired_pages_count_double_bit": 0}, {"max_gpu_util": 0.0, "min_gpu_util": 0.0, "avg_gpu_util": 0.0, "ram_util": 0.42}]

echo $? 
Output: 0

health_checks --backend=nvml check-nvidia-smi fair_cluster nagios --sink=stdout -c gpu_num --gpu_num=0

[{"node": "129-159-36-52", "cluster": "fair_cluster", "derived_cluster": "fair_cluster", "health_check": "nvidia smi", "type": "nagios", "result": 0, "_msg": "Number of GPUs present is the same as expected, 0", "job_id": 0, "1772265789.8423011": <float>, "1772265789.8663087": <float>}]

echo $? 
Output: 0

github-actions · 2026-03-03T06:32:40Z

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow	What it runs
GPU Cluster Monitoring Python CI	lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI	shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command	Description	Requires approval?
`/metaci tests`	Runs Meta internal integration tests (pytest)	Yes — a maintainer must trigger the command and approve the deployment request
`/metaci integration tests`	Same as above (alias)	Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

theap06 · 2026-03-04T06:00:10Z

Could I get some feedback? This is tackling my aforementioned feature request on supporting other hardware

luccabb

adds scaffold backends for ROCm, LevelZero, TPU, and Neuron

lets just keep NVML since its the only backend supported in this PR

gcm/monitoring/accelerator/backends/levelzero.py

luccabb

add some integration to existing paths, then run the cmds before and after these changes to prove its all working as expected:

gcm --backend=nvml --sink=stdout ...

healthchecks ... --backend=nvml --sink=stdout ...

add the above to the test plan

luccabb · 2026-03-04T18:11:16Z

gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...

this should be a full run of the cmd with the corresponding output for each

gcm/tests/data/sample-sacct-fake-line-break-expected.txt

luccabb

gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...

this should be a full run of the cmd with the corresponding output for each

theap06 · 2026-03-08T08:23:42Z

@luccabb the hardware backend works and the functionality works on my end with a gpu instance. let me know if you have any other feedback

luccabb

@theap06 I meant you should add the real CLI call and outputs from it in the test plan from inside your env.

similar to what you did for #61

theap06 · 2026-03-09T02:10:51Z

@theap06 I meant you should add the real CLI call and outputs from it in the test plan from inside your env.

similar to what you did for #61

Done.

theap06 · 2026-03-10T23:13:57Z

@luccabb could I get some feedback? I added the testing for HAL in the pr message.

luccabb · 2026-03-11T19:51:36Z

@claude can you review this pr?

claude · 2026-03-11T19:51:57Z

Claude Code is working…

I'll analyze this and get back to you.

View job run

luccabb · 2026-03-11T21:47:01Z

@claude can you review this pr?

claude · 2026-03-11T21:47:14Z

Claude finished @luccabb's task in 1m 52s —— View job

PR Review: add python accelerator HAL for multi-vendor backends

Gather context and understand request
Read CLAUDE.md for repo guidelines
Read changed files
Provide comprehensive review

Overall, this is a clean, well-structured HAL abstraction. The module layout, error hierarchy, Protocol-based interface, and test coverage are all solid. A few issues need to be addressed before merging.

Critical Issues

1. --backend option is accepted but never used

In both gcm/monitoring/cli/gcm.py:43 and gcm/health_checks/cli/health_checks.py:39, the backend parameter is immediately discarded:

def main(detach: bool, backend: str) -> None:
    _ = backend   # ← never forwarded anywhere

The option appears in --help and is validated by Click, but it does nothing. The new AcceleratorManager/NVMLBackend is never instantiated from the CLI path. The nvml_monitor subcommand still goes directly through the old DeviceTelemetryClient. Until this wiring is added, the HAL is not actually used at runtime. Fix this →

2. get_vbios_version() used to populate the model field — semantically wrong

In gcm/monitoring/accelerator/backends/nvml.py:92:

model = client.get_device_by_index(index).get_vbios_version()

get_vbios_version() returns a firmware version string (e.g. "90.00.38.00.09"), not the GPU model name (e.g. "A100 80GB"). The DeviceHandle.model field implies a human-readable hardware model. This should use get_name() or similar if available on the client. Fix this →

Minor Issues

3. NVMLBackend.close() does not close the underlying client

In gcm/monitoring/accelerator/backends/nvml.py:174-176:

def close(self) -> None:
    self._client = None   # ← just drops the reference; no shutdown call
    return None

If DeviceTelemetryClient holds resources (e.g. NVML context), this leaks them. The close should call self._client.close() (or equivalent) before clearing the reference. Fix this →

4. BackendOperationError stores cause redundantly

In gcm/monitoring/accelerator/errors.py:20-27, cause: Exception is stored as a dataclass field and set as the chained exception via raise ... from e at the call sites. This means the exception appears twice. The __str__ method also manually formats it. Consider dropping the explicit cause field and relying solely on Python's built-in exception chaining (__cause__).

5. NVMLBackend subclasses AcceleratorBackend (a Protocol)

In gcm/monitoring/accelerator/backends/nvml.py:51:

@dataclass
class NVMLBackend(AcceleratorBackend):

AcceleratorBackend is defined as a Protocol. You don't need to explicitly inherit from a Protocol to satisfy it — you just need to implement the methods. Subclassing a Protocol creates a coupling that isn't needed and is non-idiomatic. NVMLBackend should just be a plain class (or dataclass) that happens to implement all the AcceleratorBackend methods.

Observations / Non-blocking

MetricRequest.include_process_info is ignored in read_metrics: The _request parameter in NVMLBackend.read_metrics is never read. If process info collection is planned, a TODO comment would clarify intent.
probe() is called before enumerate_devices() in tests but not enforced: Nothing stops a caller from calling enumerate_devices() on an unprobed backend (the client will be lazily initialized, which may raise). A guard or documented contract would help.
Test coverage is good — the 12 HAL unit tests cover normal operation, partial failures, missing library, fallback paths, manager lifecycle, and registry. The new integration tests in test_gcm.py gracefully handle no-GPU environments.
README is clear and well-written — the lifecycle section and backend authoring guide are helpful.

Summary

The HAL design is solid and the tests are thorough. The main issue is that the --backend flag is not wired to the HAL — the new code is tested in isolation but not yet integrated into the runtime monitoring path, which undermines the core goal of the PR. The model field bug and close() resource leak should also be fixed. Happy to see this land once those are addressed.

theap06 · 2026-03-11T23:24:38Z

@claude can I get feedback

theap06 · 2026-03-12T00:05:53Z

@luccabb I think it should be ready for prod

luccabb · 2026-03-12T01:12:24Z

@claude

claude · 2026-03-12T01:12:42Z

Claude finished @luccabb's task in 1m 29s —— View job

PR Re-Review: add python accelerator HAL for multi-vendor backends

Gather context
Read changed files to check if previous issues were addressed
Provide updated review

All five issues from the previous review have been addressed. Here's the status:

Previously Critical — Now Fixed

--backend is now wired ✓
gcm.py:46 stores the value in ctx.obj["accelerator_backend"], and nvml_monitor.py:77-98 reads it via _selected_backend_name() which is used to drive _probe_selected_backend(). The HAL is now actually exercised at startup.

One note: the monitoring loop itself (nvml_monitor.py:367) still collects metrics via obj.get_device_telemetry() → NVMLDeviceTelemetryClient() directly, not through NVMLBackend.read_metrics(). The --backend flag now validates availability but doesn't route the data collection path through the HAL. This is fine as an initial integration — just worth documenting as a follow-up.
get_vbios_version() → get_name() bug fixed ✓
nvml.py:87-91 now uses getattr(handle, "get_name", None) with a callable/type check before assigning model. Clean.

Previously Minor — Now Fixed

close() resource leak fixed ✓ — nvml.py:178-180 now calls close_method() if available before clearing the reference.
BackendOperationError redundant cause field removed ✓ — errors.py now only holds backend and operation; no duplicate storage.
NVMLBackend no longer subclasses AcceleratorBackend Protocol ✓ — nvml.py:46 is now a plain @dataclass.

One remaining minor issue

In health_checks.py:40-41:

if isinstance(ctx.obj, dict):
    ctx.obj["accelerator_backend"] = backend

Compare to gcm.py:44-46 which ensures the dict exists first:

if not isinstance(ctx.obj, dict):
    ctx.obj = {}
ctx.obj["accelerator_backend"] = backend

If ctx.obj is not a dict when health_checks runs (e.g. not yet initialized by a prior decorator), the backend value is silently dropped. This is unlikely to cause a real bug today since DaemonGroup initializes context, but the asymmetry is fragile. The health_checks pattern should match gcm.py's defensive style. Fix this →

Summary: All critical and minor issues from the prior review are fixed. The remaining item above is a small defensive programming issue in health_checks.py. The HAL design, tests, and integration are solid. This is close to merge-ready once that conditional assignment is made consistent.

theap06 · 2026-03-12T05:51:06Z

@claude could I get some feedback on this pr?

theap06 requested review from calebho, giongto35, jj10306 and luccabb as code owners March 3, 2026 06:32

meta-cla bot added the cla signed label Mar 3, 2026

facebook-github-bot added the module: rocm label Mar 3, 2026

theap06 marked this pull request as draft March 3, 2026 06:35

theap06 marked this pull request as ready for review March 3, 2026 07:05

luccabb requested changes Mar 4, 2026

View reviewed changes

gcm/monitoring/accelerator/backends/levelzero.py Outdated Show resolved Hide resolved

theap06 requested a review from luccabb March 4, 2026 08:33

luccabb requested changes Mar 4, 2026

View reviewed changes

theap06 requested a review from A-Kokolis as a code owner March 4, 2026 09:00

theap06 requested a review from luccabb March 4, 2026 09:01

luccabb requested changes Mar 4, 2026

View reviewed changes

gcm/tests/data/sample-sacct-fake-line-break-expected.txt Show resolved Hide resolved

theap06 requested a review from luccabb March 4, 2026 19:12

theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 44b06c0 to 49fa293 Compare March 4, 2026 19:25

luccabb requested changes Mar 4, 2026

View reviewed changes

theap06 requested a review from luccabb March 4, 2026 22:26

luccabb requested changes Mar 8, 2026

View reviewed changes

theap06 requested a review from luccabb March 9, 2026 02:10

fixed the style and the claude feedback

f312b33

theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 33f4abe to f312b33 Compare March 11, 2026 23:23

theap06 added 2 commits March 11, 2026 16:33

Fix HAL nox setup and health_checks ctx obj handling

31ace9b

Skip backend probe for injected nvml monitor objects

2eb1cff

Initialize health_checks ctx obj when missing

778ca5a

theap06 added 2 commits March 11, 2026 23:01

Preserve health_checks obj while storing backend

14bb10b

adding myself to the README fiel

b2ea528

Conversation

theap06 commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Mar 3, 2026

CI Commands

Uh oh!

theap06 commented Mar 4, 2026

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

luccabb left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

luccabb commented Mar 4, 2026

Uh oh!

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

theap06 commented Mar 8, 2026

Uh oh!

luccabb left a comment

Choose a reason for hiding this comment

Uh oh!

theap06 commented Mar 9, 2026

Uh oh!

theap06 commented Mar 10, 2026

Uh oh!

luccabb commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026

Uh oh!

luccabb commented Mar 11, 2026

Uh oh!

claude bot commented Mar 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: add python accelerator HAL for multi-vendor backends

Critical Issues

Minor Issues

Observations / Non-blocking

Summary

Uh oh!

theap06 commented Mar 11, 2026

Uh oh!

theap06 commented Mar 12, 2026

Uh oh!

luccabb commented Mar 12, 2026

Uh oh!

claude bot commented Mar 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Re-Review: add python accelerator HAL for multi-vendor backends

Uh oh!

theap06 commented Mar 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

theap06 commented Mar 3, 2026 •

edited

Loading

luccabb left a comment •

edited

Loading

claude bot commented Mar 11, 2026 •

edited

Loading

claude bot commented Mar 12, 2026 •

edited

Loading