GPU kernel for sorted patches with chunk_offsets#7440
Conversation
667ac15 to
b12a04c
Compare
|
Initial results seem very positive: Compared to develop: |
| #include "patches.h" | ||
|
|
||
| /// Load a chunk offset value, dispatching on the runtime type. | ||
| __device__ inline uint32_t load_chunk_offset(const GPUPatches &patches, uint32_t idx) { |
There was a problem hiding this comment.
this exists b/c the chunk_offsets array can be any PType. We only index it once when we setup the cursor, so instead of trying to do an expensive cast of the whole array upfront, I just do runtime dispatch on the PType
|
@claude take a first pass at this PR. Focus on approach, CUDA kernels, and finding opportunities to simplify/remove |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
2 similar comments
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
|
Claude encountered an error —— View job I'll analyze this and get back to you. |
Merging this PR will not alter performance
Comparing Footnotes
|
40aa26e to
fc5ac8f
Compare
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
ef280f2 to
82f8911
Compare
CI clippy was failing on deprecation warnings in three CUDA bitpacked tests using `ArrayRef::to_canonical`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Andrew Duffy <andrew@a10y.dev>
7e3281c to
1a1c672
Compare
|
I think Patched array is the abstraction we want. How are offsets laid out was always independent in my head. The question is whether each thread having to do forward pass vs knowing starting point is better |
Summary
This branch is a proof of concept for doing data parallel patching without needing to do G-ALP style transposed patches.
Old Method
New Method
chunk_offsetsfrom thePatches. They are like an ends buffer except the final end is implicit and not stored.n_chunk_patches / n_threadsshare of the patches for that chunk, which it just applies in a straightline loop (to the shared memory, not global memory)TODO
offset_within_chunkso need to add some tests for thatBenchmark build_chunk_offsets to understand overheadFollow up
This is strictly better than what we have, so I think it's worth taking as-is. But it begs the question of do we need Patched array at all anymore.