Skip to content

PCIe vs kernel timing breakdown #7

@SolidRegardless

Description

@SolidRegardless

Summary

Break down benchmark timings to separately report PCIe transfer time (host↔device) versus kernel execution time, giving clear visibility into where time is actually spent.

Motivation

The current benchmark reports total wall-clock time per iteration, but this conflates two very different costs: data transfer over the PCIe bus and actual GPU kernel execution. For bandwidth-bound kernels (e.g. VectorAdd), transfer time may dominate. For compute-bound kernels (e.g. SHA-256), kernel time dominates. Without this breakdown, it's impossible to identify the real bottleneck or measure the benefit of optimisations like double-buffering (#6).

Acceptance Criteria

  • Instrument the benchmark harness to separately time:
    • Host → Device transfer
    • Kernel execution
    • Device → Host transfer
  • Report these as separate columns/rows in the benchmark output alongside existing metrics
  • Maintain backward compatibility — existing total time and throughput metrics should still be reported
  • Timing should use GPU-side events/synchronisation where possible for accuracy (not just host-side stopwatch)
  • Include transfer vs compute ratio/percentage in the summary

Technical Notes

  • ILGPU provides stream synchronisation primitives that can be used to isolate transfer and compute phases
  • BenchmarkResult will need new fields for transfer and kernel timings
  • BenchmarkRunner will need to wrap each phase with timing instrumentation
  • Consider adding a --verbose or --breakdown flag to show the detailed timing (keeping default output clean)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions