Skip to content

add python accelerator HAL for multi-vendor backends#86

Open
theap06 wants to merge 6 commits intofacebookresearch:mainfrom
theap06:feat/accelerator-hal-upstream-clean
Open

add python accelerator HAL for multi-vendor backends#86
theap06 wants to merge 6 commits intofacebookresearch:mainfrom
theap06:feat/accelerator-hal-upstream-clean

Conversation

@theap06
Copy link
Contributor

@theap06 theap06 commented Mar 3, 2026

Introduce a hardware-agnostic accelerator abstraction layer with normalized metrics, backend management, and runtime probing. Includes a functional NVML backend plus ROCm/LevelZero/TPU/Neuron scaffolds and dedicated HAL tests.

Adds a Python-first hardware-agnostic accelerator HAL at gcm/monitoring/accelerator.
Decouples telemetry collection from NVML-only assumptions via a common backend interface and normalized metrics.
Implements functional NVMLBackend; adds scaffold backends for ROCm, LevelZero, TPU, and Neuron

Implements Feature Request #74

Test Plan:

Ran HAL tests:
12 passed

pytest -q gcm/tests/test_accelerator_hal.py 

gcm --backend=nvml nvml_monitor --sink=stdout --once --log-folder=/tmp/gcm-log

[{"gpu_id": 0, "hostname": "129-159-36-52", "mem_util": 0, "gpu_util": 0, "mem_used_percent": 0, "temperature": 45, "power_draw": 25, "power_used_percent": 5, "retired_pages_count_single_bit": 0, "retired_pages_count_double_bit": 0, "job_id": null, "job_user": null, ...}]
[{"gpu_index": 0, "mem_util": 0, "mem_used_percent": 0, "gpu_util": 0, "temperature": 45, "power_draw": 25, "power_used_percent": 5, "retired_pages_count_single_bit": 0, "retired_pages_count_double_bit": 0}, {"max_gpu_util": 0.0, "min_gpu_util": 0.0, "avg_gpu_util": 0.0, "ram_util": 0.42}]

echo $? 
Output: 0

health_checks --backend=nvml check-nvidia-smi fair_cluster nagios --sink=stdout -c gpu_num --gpu_num=0

[{"node": "129-159-36-52", "cluster": "fair_cluster", "derived_cluster": "fair_cluster", "health_check": "nvidia smi", "type": "nagios", "result": 0, "_msg": "Number of GPUs present is the same as expected, 0", "job_id": 0, "1772265789.8423011": <float>, "1772265789.8663087": <float>}]

echo $? 
Output: 0

@github-actions
Copy link

github-actions bot commented Mar 3, 2026

CI Commands

The following CI workflows run automatically on every push and pull request:

Workflow What it runs
GPU Cluster Monitoring Python CI lint, tests, typecheck, format, deb build, pyoxidizer builds
Go packages CI shelper tests, format, lint

The following commands can be used by maintainers to trigger additional tests that require access to secrets:

Command Description Requires approval?
/metaci tests Runs Meta internal integration tests (pytest) Yes — a maintainer must trigger the command and approve the deployment request
/metaci integration tests Same as above (alias) Yes

Note: Only repository maintainers (OWNER association) can trigger /metaci commands. After commenting the command, a maintainer must also navigate to the Actions tab and approve the deployment to the graph-api-access environment before the jobs will run. See the approval guidelines for what to approve or reject.

@theap06 theap06 marked this pull request as draft March 3, 2026 06:35
@theap06 theap06 marked this pull request as ready for review March 3, 2026 07:05
@theap06
Copy link
Contributor Author

theap06 commented Mar 4, 2026

Could I get some feedback? This is tackling my aforementioned feature request on supporting other hardware

Copy link
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adds scaffold backends for ROCm, LevelZero, TPU, and Neuron

lets just keep NVML since its the only backend supported in this PR

@theap06 theap06 requested a review from luccabb March 4, 2026 08:33
Copy link
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add some integration to existing paths, then run the cmds before and after these changes to prove its all working as expected:

gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...

add the above to the test plan

@theap06 theap06 requested a review from A-Kokolis as a code owner March 4, 2026 09:00
@theap06 theap06 requested a review from luccabb March 4, 2026 09:01
@luccabb
Copy link
Member

luccabb commented Mar 4, 2026

gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...

this should be a full run of the cmd with the corresponding output for each

@theap06 theap06 requested a review from luccabb March 4, 2026 19:12
@theap06 theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 44b06c0 to 49fa293 Compare March 4, 2026 19:25
Copy link
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...

this should be a full run of the cmd with the corresponding output for each

@theap06 theap06 requested a review from luccabb March 4, 2026 22:26
@theap06
Copy link
Contributor Author

theap06 commented Mar 8, 2026

@luccabb the hardware backend works and the functionality works on my end with a gpu instance. let me know if you have any other feedback

Copy link
Member

@luccabb luccabb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@theap06 I meant you should add the real CLI call and outputs from it in the test plan from inside your env.

similar to what you did for #61

@theap06
Copy link
Contributor Author

theap06 commented Mar 9, 2026

@theap06 I meant you should add the real CLI call and outputs from it in the test plan from inside your env.

similar to what you did for #61

Done.

@theap06 theap06 requested a review from luccabb March 9, 2026 02:10
@theap06
Copy link
Contributor Author

theap06 commented Mar 10, 2026

@luccabb could I get some feedback? I added the testing for HAL in the pr message.

@luccabb
Copy link
Member

luccabb commented Mar 11, 2026

@claude can you review this pr?

@claude
Copy link

claude bot commented Mar 11, 2026

Claude Code is working…

I'll analyze this and get back to you.

View job run

@luccabb
Copy link
Member

luccabb commented Mar 11, 2026

@claude can you review this pr?

@claude
Copy link

claude bot commented Mar 11, 2026

Claude finished @luccabb's task in 1m 52s —— View job


PR Review: add python accelerator HAL for multi-vendor backends

  • Gather context and understand request
  • Read CLAUDE.md for repo guidelines
  • Read changed files
  • Provide comprehensive review

Overall, this is a clean, well-structured HAL abstraction. The module layout, error hierarchy, Protocol-based interface, and test coverage are all solid. A few issues need to be addressed before merging.


Critical Issues

1. --backend option is accepted but never used

In both gcm/monitoring/cli/gcm.py:43 and gcm/health_checks/cli/health_checks.py:39, the backend parameter is immediately discarded:

def main(detach: bool, backend: str) -> None:
    _ = backend   # ← never forwarded anywhere

The option appears in --help and is validated by Click, but it does nothing. The new AcceleratorManager/NVMLBackend is never instantiated from the CLI path. The nvml_monitor subcommand still goes directly through the old DeviceTelemetryClient. Until this wiring is added, the HAL is not actually used at runtime. Fix this →


2. get_vbios_version() used to populate the model field — semantically wrong

In gcm/monitoring/accelerator/backends/nvml.py:92:

model = client.get_device_by_index(index).get_vbios_version()

get_vbios_version() returns a firmware version string (e.g. "90.00.38.00.09"), not the GPU model name (e.g. "A100 80GB"). The DeviceHandle.model field implies a human-readable hardware model. This should use get_name() or similar if available on the client. Fix this →


Minor Issues

3. NVMLBackend.close() does not close the underlying client

In gcm/monitoring/accelerator/backends/nvml.py:174-176:

def close(self) -> None:
    self._client = None   # ← just drops the reference; no shutdown call
    return None

If DeviceTelemetryClient holds resources (e.g. NVML context), this leaks them. The close should call self._client.close() (or equivalent) before clearing the reference. Fix this →

4. BackendOperationError stores cause redundantly

In gcm/monitoring/accelerator/errors.py:20-27, cause: Exception is stored as a dataclass field and set as the chained exception via raise ... from e at the call sites. This means the exception appears twice. The __str__ method also manually formats it. Consider dropping the explicit cause field and relying solely on Python's built-in exception chaining (__cause__).

5. NVMLBackend subclasses AcceleratorBackend (a Protocol)

In gcm/monitoring/accelerator/backends/nvml.py:51:

@dataclass
class NVMLBackend(AcceleratorBackend):

AcceleratorBackend is defined as a Protocol. You don't need to explicitly inherit from a Protocol to satisfy it — you just need to implement the methods. Subclassing a Protocol creates a coupling that isn't needed and is non-idiomatic. NVMLBackend should just be a plain class (or dataclass) that happens to implement all the AcceleratorBackend methods.


Observations / Non-blocking

  • MetricRequest.include_process_info is ignored in read_metrics: The _request parameter in NVMLBackend.read_metrics is never read. If process info collection is planned, a TODO comment would clarify intent.
  • probe() is called before enumerate_devices() in tests but not enforced: Nothing stops a caller from calling enumerate_devices() on an unprobed backend (the client will be lazily initialized, which may raise). A guard or documented contract would help.
  • Test coverage is good — the 12 HAL unit tests cover normal operation, partial failures, missing library, fallback paths, manager lifecycle, and registry. The new integration tests in test_gcm.py gracefully handle no-GPU environments.
  • README is clear and well-written — the lifecycle section and backend authoring guide are helpful.

Summary

The HAL design is solid and the tests are thorough. The main issue is that the --backend flag is not wired to the HAL — the new code is tested in isolation but not yet integrated into the runtime monitoring path, which undermines the core goal of the PR. The model field bug and close() resource leak should also be fixed. Happy to see this land once those are addressed.

@theap06 theap06 force-pushed the feat/accelerator-hal-upstream-clean branch from 33f4abe to f312b33 Compare March 11, 2026 23:23
@theap06
Copy link
Contributor Author

theap06 commented Mar 11, 2026

@claude can I get feedback

@theap06
Copy link
Contributor Author

theap06 commented Mar 12, 2026

@luccabb I think it should be ready for prod

@luccabb
Copy link
Member

luccabb commented Mar 12, 2026

@claude

@claude
Copy link

claude bot commented Mar 12, 2026

Claude finished @luccabb's task in 1m 29s —— View job


PR Re-Review: add python accelerator HAL for multi-vendor backends

  • Gather context
  • Read changed files to check if previous issues were addressed
  • Provide updated review

All five issues from the previous review have been addressed. Here's the status:

Previously Critical — Now Fixed

  1. --backend is now wired
    gcm.py:46 stores the value in ctx.obj["accelerator_backend"], and nvml_monitor.py:77-98 reads it via _selected_backend_name() which is used to drive _probe_selected_backend(). The HAL is now actually exercised at startup.

    One note: the monitoring loop itself (nvml_monitor.py:367) still collects metrics via obj.get_device_telemetry()NVMLDeviceTelemetryClient() directly, not through NVMLBackend.read_metrics(). The --backend flag now validates availability but doesn't route the data collection path through the HAL. This is fine as an initial integration — just worth documenting as a follow-up.

  2. get_vbios_version()get_name() bug fixed
    nvml.py:87-91 now uses getattr(handle, "get_name", None) with a callable/type check before assigning model. Clean.

Previously Minor — Now Fixed

  1. close() resource leak fixed ✓ — nvml.py:178-180 now calls close_method() if available before clearing the reference.

  2. BackendOperationError redundant cause field removed ✓ — errors.py now only holds backend and operation; no duplicate storage.

  3. NVMLBackend no longer subclasses AcceleratorBackend Protocol ✓ — nvml.py:46 is now a plain @dataclass.


One remaining minor issue

In health_checks.py:40-41:

if isinstance(ctx.obj, dict):
    ctx.obj["accelerator_backend"] = backend

Compare to gcm.py:44-46 which ensures the dict exists first:

if not isinstance(ctx.obj, dict):
    ctx.obj = {}
ctx.obj["accelerator_backend"] = backend

If ctx.obj is not a dict when health_checks runs (e.g. not yet initialized by a prior decorator), the backend value is silently dropped. This is unlikely to cause a real bug today since DaemonGroup initializes context, but the asymmetry is fragile. The health_checks pattern should match gcm.py's defensive style. Fix this →


Summary: All critical and minor issues from the prior review are fixed. The remaining item above is a small defensive programming issue in health_checks.py. The HAL design, tests, and integration are solid. This is close to merge-ready once that conditional assignment is made consistent.

@theap06
Copy link
Contributor Author

theap06 commented Mar 12, 2026

@claude could I get some feedback on this pr?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants