Skip to content

Flaky test: test_template_scale_values_self_compare[double] — GPU sync race in snapshot capture #84

@mawad-amd

Description

@mawad-amd

Summary

test_template_scale_values_self_compare[double] in accordo/tests/test_reduction_validation.py fails intermittently. The float variant always passes.

Failure

The test runs the same scale_values<double> kernel twice (self-compare) and expects identical snapshots. Instead, the first (reference) snapshot contains garbage/uninitialized memory while the second (optimized) snapshot is correct:

Reference 'input' (double*):  [2.12e+000, 1.48e-323, 2.12e-314, 0.0, 3.16e-322, ...]  ← garbage
Reference 'output' (double*): [9.17e+199, 1.17e+214, -6.06e-066, ...]                   ← garbage

Optimized 'input' (double*):  [0., 1., 2., 3., 4., 5., ...]                             ← correct
Optimized 'output' (double*): [0., 2., 4., 6., 8., 10., ...]                            ← correct

Both input and output are T* (non-const), so Accordo captures IPC handles for both.

What's NOT the problem

GPU synchronization is correct. In write_packets() (accordo.hip:608-685):

  1. Packet dispatched via writer() (line 663)
  2. hsa_signal_wait_scacquire() blocks until kernel completion (line 665-666)
  3. send_message_and_wait() only runs after kernel is done (line 680)

The barriers are in place — the kernel is fully complete before any IPC handles are created.

Likely root cause: IPC handle lifetime / missing hipIpcCloseMemHandle

After open_ipc_handle() (hip_interop.py:29-74) calls hipIpcOpenMemHandle, there is no corresponding hipIpcCloseMemHandle anywhere in the codebase. The IPC mapping leaks.

When two snapshots are taken sequentially:

  1. First child process spawned, kernel runs, IPC handles written, Python opens them via hipIpcOpenMemHandle, reads data, sends "done" — child exits, GPU allocations freed
  2. Second child process spawned — but the stale IPC mapping from snapshot 1 is still open in Python's address space
  3. The leaked mapping may interfere with the second snapshot's IPC open, or the first snapshot's read may race with process cleanup

The fact that it's always the first snapshot that gets garbage (not the second) suggests the issue may be in the timing of the first child process's GPU memory becoming visible via IPC — possibly a race between hipIpcGetMemHandle in the child and hipIpcOpenMemHandle in the parent, or the child's hipMalloc returning a suballocated pointer from a pool that hasn't been committed yet for IPC.

Why double but not float?

  • double uses 2x the memory (8KB vs 4KB for 1024 elements), which may trigger different allocation paths in the HIP memory pool
  • Larger allocations may be more susceptible to lazy commitment / page fault timing
  • The float test runs first (alphabetical parametrize order), so GPU memory state is different

Investigation needed

  1. Add hipIpcCloseMemHandle after reading data in hip_interop.py and see if the flake goes away
  2. Add debug logging to dump the actual pointer values and IPC handle contents for both snapshots to see if they differ
  3. Check if hipMalloc suballocates — if the tracked pointer_sizes_ from HSA-level hooks don't match the hipMalloc sizes, the IPC read could be at the wrong offset

CI Evidence

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions