Measure the performance of computing the lane id 

Currently the lane id is read ([here](https://github.com/gkarlos/cuIdx/blob/e5435b54cbfbe9ef1e24e20f0f8ed01a2eed0952/cuidx.cuh#L359)) by accessing the `%laneid` register. An alternative is to compute it by `i % WARPSIZE`

According to [this](https://stackoverflow.com/questions/44337309/whats-the-most-efficient-way-to-calculate-the-warp-id-lane-id-in-a-1-d-grid) and [this](https://devtalk.nvidia.com/default/topic/1011523/cuda-programming-and-performance/how-costly-is-the-s2r-instruction-reading-a-special-register-/post/5165296/#5165296), reading from the `%laneid` register is more costly than `i % WARPSIZE`. It would be good to have some benchmark results that show the difference between the two versions.

First create a `benchamarks/laneid/` directory where you will be placing all your files. You can start with some simple kernels, for instance, each thread `i` reading its lane id and writing it to the `i`-th index of an array. You may subsequently move to more involved kernels. For each benchmark include appropriate plot(s) and information about your GPU device and CUDA version.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Measure the performance of computing the lane id #2

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Measure the performance of computing the lane id #2

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions