FastSoftmax on tinygrad

$$\Large\text{softmax}(x_i) = \frac{e^{x_i - max(x)}}{\sum_{j=1}^{K} e^{x_j - max(x)}}$$

background

How DRAM works and why should you care | GPU Programming

opertaion:

memory-bound reduction
compute-bound element-wise operation

thread = basic unit

warp = thread group

occupancy = active warp / max warp

resource light kernel = high occupancy

high occupancy can hide memory latency. when one thread group wait for data transfer, other group can run

kernel

base

one kernel

fusion 5

break down to small kernel:

reduce local max
reduce global max
reduce local exp
reduce global exp
div

fusion 5 tuned

tune param for each kernel

fusion 5 fast exp

Approximation of The Power Function

$$e^x = 2^{x \log_2 e}$$ $$x' = x \log_2 e \quad \implies \quad 2^{x'} = 2^{i+f} = 2^i \cdot 2^f$$

IEEE-754 32 bit floating point representation:

$$\text{bits}(2^i) = (i + 127) << 23$$

$$ 2^f \approx 0.0570f^3 + 0.2486f^2 + 0.6928f + 0.9992 $$

fusion 5 register

share data at register level

fusion 5 vector

load data in vector. less instruction

fusion 3

exp(x - max) = exp(x - local_max) * exp(local_max - global_max)

reduce local max and sum
reduce global max and sum
div

small input: fusion 3 save kernel launch overhead and memory pass

big input: fusion 5 simple, high-occupancy kernels are better at hiding memory latency, leading to higher effective memory bandwidth and performance.

benchmark

tip

kiss: keep it simple stupid

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitignore		.gitignore
LICENSE		LICENSE
benchmark.png		benchmark.png
benchmark.py		benchmark.py
pyproject.toml		pyproject.toml
readme.md		readme.md
stress.sh		stress.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastSoftmax on tinygrad

background

kernel

base

fusion 5

fusion 5 tuned

fusion 5 fast exp

fusion 5 register

fusion 5 vector

fusion 3

benchmark

tip

About

Uh oh!

Releases

Packages

Languages

License

0guanhua0/softmax

Folders and files

Latest commit

History

Repository files navigation

FastSoftmax on tinygrad

background

kernel

base

fusion 5

fusion 5 tuned

fusion 5 fast exp

fusion 5 register

fusion 5 vector

fusion 3

benchmark

tip

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages