-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Summary
Expand the benchmark task suite with three new GPU kernel implementations: Fast Fourier Transform (FFT), matrix multiplication, and sorting networks.
Motivation
The current five tasks demonstrate a range from bandwidth-bound to compute-bound workloads, but they're all element-wise operations with no inter-element dependencies. FFT, matrix multiply, and sorting networks introduce fundamentally different parallel patterns — butterfly networks, tiled shared-memory access, and compare-and-swap — that stress different aspects of GPU architecture and reveal new performance characteristics.
Acceptance Criteria
- Implement
FftTask— a radix-2 Cooley-Tukey FFT on a power-of-two input array - Implement
MatrixMultiplyTask— dense matrix multiplication (e.g. 1024×1024 × 1024×1024) - Implement
SortingNetworkTask— a bitonic sort or odd-even merge sort on the GPU - All three tasks implement
IBenchmarkTaskand integrate with the existing harness - Each task includes proper host↔device memory transfers (no shortcuts)
- Results should be validated for correctness against a CPU reference implementation
- Performance should be benchmarked and reported alongside existing tasks
Technical Notes
- FFT requires shared memory and butterfly-pattern access — may need ILGPU shared memory primitives
- Matrix multiply benefits hugely from tiled/blocked algorithms using shared memory
- Sorting networks are interesting because they have a fixed comparison pattern, making them ideal for GPU parallelism despite sorting being traditionally sequential
- Batch size semantics differ per task: for FFT it's the transform length, for matrix multiply it's the matrix dimension, for sorting it's the array length
- Consider adding these under
src/GpuStreamingDemo.Core/Tasks/following existing patterns
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels