-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
test_template_scale_values_self_compare[double] in accordo/tests/test_reduction_validation.py fails intermittently. The float variant always passes.
Failure
The test runs the same scale_values<double> kernel twice (self-compare) and expects identical snapshots. Instead, the first (reference) snapshot contains garbage/uninitialized memory while the second (optimized) snapshot is correct:
Reference 'input' (double*): [2.12e+000, 1.48e-323, 2.12e-314, 0.0, 3.16e-322, ...] ← garbage
Reference 'output' (double*): [9.17e+199, 1.17e+214, -6.06e-066, ...] ← garbage
Optimized 'input' (double*): [0., 1., 2., 3., 4., 5., ...] ← correct
Optimized 'output' (double*): [0., 2., 4., 6., 8., 10., ...] ← correct
Both input and output are T* (non-const), so Accordo captures IPC handles for both.
What's NOT the problem
GPU synchronization is correct. In write_packets() (accordo.hip:608-685):
- Packet dispatched via
writer()(line 663) hsa_signal_wait_scacquire()blocks until kernel completion (line 665-666)send_message_and_wait()only runs after kernel is done (line 680)
The barriers are in place — the kernel is fully complete before any IPC handles are created.
Likely root cause: IPC handle lifetime / missing hipIpcCloseMemHandle
After open_ipc_handle() (hip_interop.py:29-74) calls hipIpcOpenMemHandle, there is no corresponding hipIpcCloseMemHandle anywhere in the codebase. The IPC mapping leaks.
When two snapshots are taken sequentially:
- First child process spawned, kernel runs, IPC handles written, Python opens them via
hipIpcOpenMemHandle, reads data, sends "done" — child exits, GPU allocations freed - Second child process spawned — but the stale IPC mapping from snapshot 1 is still open in Python's address space
- The leaked mapping may interfere with the second snapshot's IPC open, or the first snapshot's read may race with process cleanup
The fact that it's always the first snapshot that gets garbage (not the second) suggests the issue may be in the timing of the first child process's GPU memory becoming visible via IPC — possibly a race between hipIpcGetMemHandle in the child and hipIpcOpenMemHandle in the parent, or the child's hipMalloc returning a suballocated pointer from a pool that hasn't been committed yet for IPC.
Why double but not float?
doubleuses 2x the memory (8KB vs 4KB for 1024 elements), which may trigger different allocation paths in the HIP memory pool- Larger allocations may be more susceptible to lazy commitment / page fault timing
- The
floattest runs first (alphabetical parametrize order), so GPU memory state is different
Investigation needed
- Add
hipIpcCloseMemHandleafter reading data inhip_interop.pyand see if the flake goes away - Add debug logging to dump the actual pointer values and IPC handle contents for both snapshots to see if they differ
- Check if
hipMallocsuballocates — if the trackedpointer_sizes_from HSA-level hooks don't match thehipMallocsizes, the IPC read could be at the wrong offset
CI Evidence
- Fails on PR README refresh, remove root metapackage, installer --tools #81 (
pytest non-editable - accordo): run log - Main branch passes (the test is intermittent)