add python accelerator HAL for multi-vendor backends#86
add python accelerator HAL for multi-vendor backends#86theap06 wants to merge 6 commits intofacebookresearch:mainfrom
Conversation
CI CommandsThe following CI workflows run automatically on every push and pull request:
The following commands can be used by maintainers to trigger additional tests that require access to secrets:
|
|
Could I get some feedback? This is tackling my aforementioned feature request on supporting other hardware |
luccabb
left a comment
There was a problem hiding this comment.
adds scaffold backends for ROCm, LevelZero, TPU, and Neuron
lets just keep NVML since its the only backend supported in this PR
this should be a full run of the cmd with the corresponding output for each |
44b06c0 to
49fa293
Compare
luccabb
left a comment
There was a problem hiding this comment.
gcm --backend=nvml --sink=stdout ...
healthchecks ... --backend=nvml --sink=stdout ...
this should be a full run of the cmd with the corresponding output for each
|
@luccabb the hardware backend works and the functionality works on my end with a gpu instance. let me know if you have any other feedback |
|
@luccabb could I get some feedback? I added the testing for HAL in the pr message. |
|
@claude can you review this pr? |
|
I'll analyze this and get back to you. |
|
@claude can you review this pr? |
|
Claude finished @luccabb's task in 1m 52s —— View job PR Review: add python accelerator HAL for multi-vendor backends
Overall, this is a clean, well-structured HAL abstraction. The module layout, error hierarchy, Protocol-based interface, and test coverage are all solid. A few issues need to be addressed before merging. Critical Issues1. In both def main(detach: bool, backend: str) -> None:
_ = backend # ← never forwarded anywhereThe option appears in 2. In model = client.get_device_by_index(index).get_vbios_version()
Minor Issues3. In def close(self) -> None:
self._client = None # ← just drops the reference; no shutdown call
return NoneIf 4. In 5. In @dataclass
class NVMLBackend(AcceleratorBackend):
Observations / Non-blocking
SummaryThe HAL design is solid and the tests are thorough. The main issue is that the |
33f4abe to
f312b33
Compare
|
@claude can I get feedback |
|
@luccabb I think it should be ready for prod |
|
Claude finished @luccabb's task in 1m 29s —— View job PR Re-Review: add python accelerator HAL for multi-vendor backends
All five issues from the previous review have been addressed. Here's the status: Previously Critical — Now Fixed
Previously Minor — Now Fixed
One remaining minor issue In if isinstance(ctx.obj, dict):
ctx.obj["accelerator_backend"] = backendCompare to if not isinstance(ctx.obj, dict):
ctx.obj = {}
ctx.obj["accelerator_backend"] = backendIf Summary: All critical and minor issues from the prior review are fixed. The remaining item above is a small defensive programming issue in |
|
@claude could I get some feedback on this pr? |

Introduce a hardware-agnostic accelerator abstraction layer with normalized metrics, backend management, and runtime probing. Includes a functional NVML backend plus ROCm/LevelZero/TPU/Neuron scaffolds and dedicated HAL tests.
Adds a Python-first hardware-agnostic accelerator HAL at gcm/monitoring/accelerator.
Decouples telemetry collection from NVML-only assumptions via a common backend interface and normalized metrics.
Implements functional NVMLBackend; adds scaffold backends for ROCm, LevelZero, TPU, and Neuron
Implements Feature Request #74
Test Plan:
Ran HAL tests:
12 passed