Flaky test: test_template_scale_values_self_compare[double] — GPU sync race in snapshot capture

## Summary

`test_template_scale_values_self_compare[double]` in `accordo/tests/test_reduction_validation.py` fails intermittently. The `float` variant always passes.

## Failure

The test runs the same `scale_values<double>` kernel twice (self-compare) and expects identical snapshots. Instead, the first (reference) snapshot contains garbage/uninitialized memory while the second (optimized) snapshot is correct:

```
Reference 'input' (double*):  [2.12e+000, 1.48e-323, 2.12e-314, 0.0, 3.16e-322, ...]  ← garbage
Reference 'output' (double*): [9.17e+199, 1.17e+214, -6.06e-066, ...]                   ← garbage

Optimized 'input' (double*):  [0., 1., 2., 3., 4., 5., ...]                             ← correct
Optimized 'output' (double*): [0., 2., 4., 6., 8., 10., ...]                            ← correct
```

Both `input` and `output` are `T*` (non-const), so Accordo captures IPC handles for both.

## What's NOT the problem

**GPU synchronization is correct.** In `write_packets()` (`accordo.hip:608-685`):
1. Packet dispatched via `writer()` (line 663)
2. `hsa_signal_wait_scacquire()` blocks until kernel completion (line 665-666)
3. `send_message_and_wait()` only runs **after** kernel is done (line 680)

The barriers are in place — the kernel is fully complete before any IPC handles are created.

## Likely root cause: IPC handle lifetime / missing `hipIpcCloseMemHandle`

After `open_ipc_handle()` (`hip_interop.py:29-74`) calls `hipIpcOpenMemHandle`, there is **no corresponding `hipIpcCloseMemHandle`** anywhere in the codebase. The IPC mapping leaks.

When two snapshots are taken sequentially:
1. First child process spawned, kernel runs, IPC handles written, Python opens them via `hipIpcOpenMemHandle`, reads data, sends "done" — child exits, GPU allocations freed
2. Second child process spawned — but the stale IPC mapping from snapshot 1 is still open in Python's address space
3. The leaked mapping may interfere with the second snapshot's IPC open, or the first snapshot's read may race with process cleanup

The fact that it's always the **first** snapshot that gets garbage (not the second) suggests the issue may be in the timing of the first child process's GPU memory becoming visible via IPC — possibly a race between `hipIpcGetMemHandle` in the child and `hipIpcOpenMemHandle` in the parent, or the child's `hipMalloc` returning a suballocated pointer from a pool that hasn't been committed yet for IPC.

### Why `double` but not `float`?

- `double` uses 2x the memory (8KB vs 4KB for 1024 elements), which may trigger different allocation paths in the HIP memory pool
- Larger allocations may be more susceptible to lazy commitment / page fault timing
- The `float` test runs first (alphabetical parametrize order), so GPU memory state is different

## Investigation needed

1. **Add `hipIpcCloseMemHandle`** after reading data in `hip_interop.py` and see if the flake goes away
2. **Add debug logging** to dump the actual pointer values and IPC handle contents for both snapshots to see if they differ
3. **Check if `hipMalloc` suballocates** — if the tracked `pointer_sizes_` from HSA-level hooks don't match the `hipMalloc` sizes, the IPC read could be at the wrong offset

## CI Evidence

- Fails on PR #81 (`pytest non-editable - accordo`): [run log](https://github.com/AMDResearch/intellikit/actions/runs/23359817476/job/67996331977)
- Main branch passes (the test is intermittent)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flaky test: test_template_scale_values_self_compare[double] — GPU sync race in snapshot capture #84

Summary

Failure

What's NOT the problem

Likely root cause: IPC handle lifetime / missing `hipIpcCloseMemHandle`

Why `double` but not `float`?

Investigation needed

CI Evidence

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Flaky test: test_template_scale_values_self_compare[double] — GPU sync race in snapshot capture #84

Description

Summary

Failure

What's NOT the problem

Likely root cause: IPC handle lifetime / missing hipIpcCloseMemHandle

Why double but not float?

Investigation needed

CI Evidence

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Likely root cause: IPC handle lifetime / missing `hipIpcCloseMemHandle`

Why `double` but not `float`?