How DRAM works and why should you care | GPU Programming
opertaion:
- memory-bound reduction
- compute-bound element-wise operation
thread = basic unit
warp = thread group
occupancy = active warp / max warp
resource light kernel = high occupancy
high occupancy can hide memory latency. when one thread group wait for data transfer, other group can run
one kernel
break down to small kernel:
- reduce local max
- reduce global max
- reduce local exp
- reduce global exp
- div
tune param for each kernel
Approximation of The Power Function
IEEE-754 32 bit floating point representation:
share data at register level
load data in vector. less instruction
exp(x - max) = exp(x - local_max) * exp(local_max - global_max)
- reduce local max and sum
- reduce global max and sum
- div
small input: fusion 3 save kernel launch overhead and memory pass
big input: fusion 5 simple, high-occupancy kernels are better at hiding memory latency, leading to higher effective memory bandwidth and performance.
kiss: keep it simple stupid
