-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Break down benchmark timings to separately report PCIe transfer time (host↔device) versus kernel execution time, giving clear visibility into where time is actually spent.
Motivation
The current benchmark reports total wall-clock time per iteration, but this conflates two very different costs: data transfer over the PCIe bus and actual GPU kernel execution. For bandwidth-bound kernels (e.g. VectorAdd), transfer time may dominate. For compute-bound kernels (e.g. SHA-256), kernel time dominates. Without this breakdown, it's impossible to identify the real bottleneck or measure the benefit of optimisations like double-buffering (#6).
Acceptance Criteria
- Instrument the benchmark harness to separately time:
- Host → Device transfer
- Kernel execution
- Device → Host transfer
- Report these as separate columns/rows in the benchmark output alongside existing metrics
- Maintain backward compatibility — existing total time and throughput metrics should still be reported
- Timing should use GPU-side events/synchronisation where possible for accuracy (not just host-side stopwatch)
- Include transfer vs compute ratio/percentage in the summary
Technical Notes
- ILGPU provides stream synchronisation primitives that can be used to isolate transfer and compute phases
BenchmarkResultwill need new fields for transfer and kernel timingsBenchmarkRunnerwill need to wrap each phase with timing instrumentation- Consider adding a
--verboseor--breakdownflag to show the detailed timing (keeping default output clean)
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels