CUDA Stream Compaction

University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2 - Stream Compaction

Tong Hu
- LinkedIn
Time tested on: Windows 11, Ryzen 7 1700X @ 3.4GHz 16GB, GTX 1080 16GB (Personal Computer)
Nsight system analysis tested on: Windows 11, i5-11600K @ 3.91GHz 32GB, RTX 2060 6GB (Personal Desktop) (since GTX 1080 does not support GPU metric collection)

Features

CPU Scan & Stream Compaction
Naive GPU Scan algorithm
Work-Efficient GPU Scan (extra credit version) & Stream Compaction
Using Thrust's Implementation

Introduction

In this project, I implemented the scan function commonly used in stream compaction of int array. I implemented 3 different version (CPU version, naive parallel version, and efficient work version) of the Scan(exlusive Prefix Sum) algorithm, and use thrust's implemntation of exclusive scan, then compared the performance of each version of Scan.

Roughly optimize the block sizes

Figure1. Time(ms) of scan function vs. block size (array size = $2^{27}$)

Table 1. Time(ms) of scan function in different block size (array size = $2^{27}$)

Block size	1	8	16	32	64	128	256	512
naïve pow2	4378.43	588.428	313.288	164.764	146.405	144.845	144.111	144.72
naïve npow2	4373.24	571.921	291.82	164.399	146.764	144.936	144.916	144.649
efficient pow2	328.222	64.9941	57.1655	58.7879	62.4402	62.0726	62.0135	60.9551
efficient npow2	332.084	65.0913	56.9548	58.7963	62.0282	62.4519	61.8724	61.3648

From Figure 1 we can see that as the block size increases, the time of Scan function decreases rapidly, and after some point, the time does not change much although the number of block increased. From Table 1 we can see the naive parallel scan performs best when the block size equals to 256, while the work efficient parallel scan performs best when the block size is 16. Therefore I select 256 and 16 as block sizes for naive scan and work-efficient scan in the following performance analysis.

Performance analysis

Figure 2. Time(ms) of scan function in diffrent array size.

Table 2. Time(ms) of scan function in diffrent array size.

log of array size	4	8	16	20	22	24	25	26	27	28	29	30
cpu	0.0002	0.0008	0.153	2.4054	9.0351	37.6997	78.1179	153.777	304.135	592.904	1266.01	2379.02
naïve	0.02048	0.19456	0.318272	0.918432	3.64445	16.1397	33.5231	68.8091	144.143	302.976	639.805	4626.05
efficient	0.306176	0.326656	0.566368	1.32307	2.16864	7.9192	15.282	29.2795	57.1435	113.729	234.424	429.09
thrust	0.233504	0.131072	0.111616	0.929792	0.75264	1.57184	2.51597	3.45184	5.73133	10.7909	20.3489	112.254

From Figure 2 we can tell that when the array size increased, the time cost of Scan will increase.

When the array size is small (smaller than $2^{16}$), CPU Scan performs better than GPU Scans and thrust's Scan. This is probably because the overhead of invoking GPU kernels overwhelms the benefit of parallel scans. When the array size is relative large, the time cost less for GPU Scans.

Comparing the naive parallel scan and work-efficient parallel scan, we can see from the figure that the work-efficient parallel scan performs better when the array size is large (greater than $2^{20}$). It takes naive parallel scan $O(n\log n)$ floating point adds operations while takes work-efficient scan $O(n)$ adds. Although both algorithm seems to run in parallel in the ideal case, in reality the number of threads run in parallel is bounded by hardware, and therefore, work-efficient scan performs better since the number of threads it need to lauch is smaller.

Figure 3. Nsight system trace

From Figure 3, we can see that compared with self-implemented Scan, the thrust's scan has very low DRAM bandwidth usage, and the unallocated warps in active SMs are also very low. Following are factors that might affect the performance:

Thrust's Scan might use shared memory, and memory coalescing when accessing global memory. This improves the memory throughput.
Thrust's Scan might optimize the block size and launch parameters based on workload and GPU type while our self-implemented scans hard code the block size only roughly optimized by eye.

Bottleneck

Also, from Figure 3, we can know that the bottleneck of performance of Scan functions should be the memory bandwidth. The bandwidth of DRAM of both naive and work-efficient Scan are almost full.

The output of the test program when the array size is $2^{25}$.

****************
** SCAN TESTS **
****************
    [  26   0  20  47  36   3  37  33  24  20  25  43  43 ...   6   0 ]
==== cpu scan, power-of-two ====
   elapsed time: 79.4345ms    (std::chrono Measured)
    [   0  26  26  46  93 129 132 169 202 226 246 271 314 ... 821748880 821748886 ]
==== cpu scan, non-power-of-two ====
   elapsed time: 78.2092ms    (std::chrono Measured)
    [   0  26  26  46  93 129 132 169 202 226 246 271 314 ... 821748831 821748870 ]
    passed
==== naive scan, power-of-two ====
   elapsed time: 34.0556ms    (CUDA Measured)
    passed
==== naive scan, non-power-of-two ====
   elapsed time: 33.1287ms    (CUDA Measured)
    passed
==== work-efficient scan, power-of-two ====
   elapsed time: 14.6806ms    (CUDA Measured)
    passed
==== work-efficient scan, non-power-of-two ====
   elapsed time: 14.6292ms    (CUDA Measured)
    passed
==== thrust scan, power-of-two ====
   elapsed time: 2.33472ms    (CUDA Measured)
    passed
==== thrust scan, non-power-of-two ====
   elapsed time: 2.2352ms    (CUDA Measured)
    passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
    [   1   0   0   1   1   0   0   2   2   1   3   1   0 ...   3   0 ]
==== cpu compact without scan, power-of-two ====
   elapsed time: 82.3193ms    (std::chrono Measured)
    [   1   1   1   2   2   1   3   1   1   1   2   1   2 ...   2   3 ]
    passed
==== cpu compact without scan, non-power-of-two ====
   elapsed time: 82.2665ms    (std::chrono Measured)
    [   1   1   1   2   2   1   3   1   1   1   2   1   2 ...   3   2 ]
    passed
==== cpu compact with scan ====
   elapsed time: 329.982ms    (std::chrono Measured)
    [   1   1   1   2   2   1   3   1   1   1   2   1   2 ...   2   3 ]
    passed
==== work-efficient compact, power-of-two ====
   elapsed time: 21.669ms    (CUDA Measured)
    passed
==== work-efficient compact, non-power-of-two ====
   elapsed time: 21.466ms    (CUDA Measured)
    passed

Name		Name	Last commit message	Last commit date
Latest commit History 63 Commits
cmake		cmake
img		img
src		src
stream_compaction		stream_compaction
.cproject		.cproject
.gitignore		.gitignore
.project		.project
CMakeLists.txt		CMakeLists.txt
GNUmakefile		GNUmakefile
INSTRUCTION.md		INSTRUCTION.md
README.md		README.md
cis565_stream_compaction_test.launch		cis565_stream_compaction_test.launch

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CUDA Stream Compaction

Features

Introduction

Roughly optimize the block sizes

Performance analysis

Bottleneck

The output of the test program when the array size is $2^{25}$.

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

TongHuoAo/Project2-Stream-Compaction

Folders and files

Latest commit

History

Repository files navigation

CUDA Stream Compaction

Features

Introduction

Roughly optimize the block sizes

Performance analysis

Bottleneck

The output of the test program when the array size is $2^{25}$.

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages