Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
82 changes: 76 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,12 +3,82 @@ CUDA Stream Compaction

**University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

* (TODO) YOUR NAME HERE
* (TODO) [LinkedIn](), [personal website](), [twitter](), etc.
* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
* Sydney Miller
* [LinkedIn](https://www.linkedin.com/in/sydney-miller-upenn/)
* Tested on: GTX 222 222MB (CETS Virtual Lab)

### (TODO: Your README)

Include analysis, etc. (Remember, this is public, so don't put
anything here that you don't want to share with the world.)
### README

### Performance Analysis

#### Optimize Block Sizes for Each Implementation
![Scan Run Time vs Block Size Chart](img/ScanRunTimeVSBlockSizeChart.png)
![Scan Run Time vs Block Size Graph](img/ScanRunTimeVSBlockSize.png)

![Stream Compaction Run Time vs Block Size Chart](img/StreamCompactionRuntimeVSBlockSizeChart.png)
![Stream Compaction Run Time vs Block Size Graph](img/StreamCompactionRuntimeVSBlockSize.png)

For most of the implementations it seems like they perform the best when the black size is not too big or small. A smaller block size for stream compaction seemed to work better compared to scan.

#### Compare GPU Scan implementations to Serial CPU For Varying Array Sizess
![Stream Compaction Run Time vs Block Size Chart](img/ScanRunTimeVSArraySizeChart.png)
![Stream Compaction Run Time vs Block Size Graph](img/ScanRunTimeVSArraySize.png)

All of the implementations performed similarly in that the run time increased when the size of the array increased. The thrust implementation compared to the other implementation was much faster and a much shallower slope as seen in the chart above. This means that the thrust implementation handles exponential growth of array sizes a lot better than the other implementations. My guess is that the thrust implementation handles memory allocation much better than the other GPU implementations since we are able to see the greatest performance difference with larger amounts of memory being used. The CPU implantation was the next fastest implementation. I think the bottlenecks for the various GPU implementations are their use of global memory. In addition, some implementation use %, which can take longer on a GPU compared to comparison operators.

#### Output of Test Program
```
****************
** SCAN TESTS **
****************
[ 5 39 0 1 23 32 46 30 49 44 40 18 31 ... 10 0 ]
==== cpu scan, power-of-two ====
elapsed time: 0.0005ms (std::chrono Measured)
[ 0 5 44 44 45 68 100 146 176 225 269 309 327 ... 6474 6484 ]
==== cpu scan, non-power-of-two ====
elapsed time: 0.0005ms (std::chrono Measured)
[ 0 5 44 44 45 68 100 146 176 225 269 309 327 ... 6380 6411 ]
passed
==== naive scan, power-of-two ====
elapsed time: 0ms (CUDA Measured)
passed
==== naive scan, non-power-of-two ====
elapsed time: 0ms (CUDA Measured)
passed
==== work-efficient scan, power-of-two ====
elapsed time: 0.043008ms (CUDA Measured)
passed
==== work-efficient scan, non-power-of-two ====
elapsed time: 0.03584ms (CUDA Measured)
passed
==== thrust scan, power-of-two ====
elapsed time: 0.0856ms (CUDA Measured)
passed
==== thrust scan, non-power-of-two ====
elapsed time: 0.04608ms (CUDA Measured)
passed

*****************************
** STREAM COMPACTION TESTS **
*****************************
[ 1 3 0 3 1 2 0 2 3 0 0 2 3 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 0.0009ms (std::chrono Measured)
[ 1 3 3 1 2 2 3 2 3 3 2 2 2 ... 3 2 ]
passed
==== cpu compact without scan, non-power-of-two ====
elapsed time: 0.0011ms (std::chrono Measured)
[ 1 3 3 1 2 2 3 2 3 3 2 2 2 ... 3 1 ]
passed
==== cpu compact with scan ====
elapsed time: 0.0006ms (std::chrono Measured)
[ 1 3 3 1 2 2 3 2 3 3 2 2 2 ... 3 2 ]
passed
==== work-efficient compact, power-of-two ====
elapsed time: 0.033792ms (CUDA Measured)
passed
==== work-efficient compact, non-power-of-two ====
elapsed time: 0.044032ms (CUDA Measured)
passed
```
Binary file added img/ScanRunTimeVSArraySize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ScanRunTimeVSArraySizeChart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ScanRunTimeVSBlockSize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/ScanRunTimeVSBlockSizeChart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/StreamCompactionRuntimeVSBlockSize.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added img/StreamCompactionRuntimeVSBlockSizeChart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
24 changes: 18 additions & 6 deletions stream_compaction/common.cu
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
#include "common.h"

void checkCUDAErrorFn(const char *msg, const char *file, int line) {
void checkCUDAErrorFn(const char* msg, const char* file, int line) {
cudaError_t err = cudaGetLastError();
if (cudaSuccess == err) {
return;
Expand All @@ -22,17 +22,29 @@ namespace StreamCompaction {
* Maps an array to an array of 0s and 1s for stream compaction. Elements
* which map to 0 will be removed, and elements which map to 1 will be kept.
*/
__global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
// TODO
__global__ void kernMapToBoolean(int n, int* bools, const int* idata) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}

bools[index] = idata[index] == 0 ? 0 : 1;
}

/**
* Performs scatter on an array. That is, for each element in idata,
* if bools[idx] == 1, it copies idata[idx] to odata[indices[idx]].
*/
__global__ void kernScatter(int n, int *odata,
const int *idata, const int *bools, const int *indices) {
// TODO
__global__ void kernScatter(int n, int* odata,
const int* idata, const int* bools, const int* indices) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n) {
return;
}

if (bools[index] == 1) {
odata[indices[index]] = idata[index];
}
}

}
Expand Down
39 changes: 32 additions & 7 deletions stream_compaction/cpu.cu
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,10 @@ namespace StreamCompaction {
*/
void scan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
odata[0] = 0;
for (int i = 1; i < n; i++) {
odata[i] = idata[i - 1] + odata[i - 1];
}
timer().endCpuTimer();
}

Expand All @@ -30,9 +33,15 @@ namespace StreamCompaction {
*/
int compactWithoutScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
int index = 0;
for (int i = 0; i < n; i++) {
if (idata[i] != 0) {
odata[index] = idata[i];
index++;
}
}
timer().endCpuTimer();
return -1;
return index;
}

/**
Expand All @@ -41,10 +50,26 @@ namespace StreamCompaction {
* @returns the number of elements remaining after compaction.
*/
int compactWithScan(int n, int *odata, const int *idata) {
timer().startCpuTimer();
// TODO
timer().endCpuTimer();
return -1;
//timer().startCpuTimer();
// create a new array mapping the input array to zero's and one's
int* zerosAndOnes = new int[n];
for (int i = 0; i < n; i++) {
idata[i] == 0 ? zerosAndOnes[i] = 0 : zerosAndOnes[i] = 1;
}

// scan new array
int* scannedArray = new int[n];
scan(n, scannedArray, zerosAndOnes);

//scatter
for (int i = 0; i < n; i++) {
if (zerosAndOnes[i] == 1) {
odata[scannedArray[i]] = idata[i];
}
}

//timer().endCpuTimer();
return scannedArray[n-1];
}
}
}
178 changes: 171 additions & 7 deletions stream_compaction/efficient.cu
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
#include <iostream>
#include <memory>
#include <cuda.h>
#include <cuda_runtime.h>
#include "common.h"
#include "efficient.h"

#define checkCUDAErrorWithLine(msg) checkCUDAError(msg, __LINE__)


namespace StreamCompaction {
namespace Efficient {
using StreamCompaction::Common::PerformanceTimer;
Expand All @@ -12,13 +17,131 @@ namespace StreamCompaction {
return timer;
}

// up sweep
__global__ void upSweep(int n, int d, int* data, int dist, int distHalf) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);

if (index >= n || index % dist != 0) {
return;
}

int toUpdate = index + dist - 1;
int toGet = index + distHalf - 1;

data[toUpdate] += data[toGet];
}

// up sweep efficient
__global__ void upSweepEfficient(int n, int d, int* data, int stride, int offset) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n || index >= n / stride) {
return;
}

int toUpdate = ((index + 1) * stride) - 1;
int toGet = toUpdate - offset;

data[toUpdate] += data[toGet];
}

// down sweep
__global__ void downSweep(int n, int d, int* data, int dist, int distHalf) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n || index % dist != 0) {
return;
}

int t_index = index + distHalf - 1;
int replace_index = index + dist - 1;

int t = data[t_index];
data[t_index] = data[replace_index];
data[replace_index] += t;
}

// down sweep efficient
__global__ void downSweepEfficient(int n, int d, int* data, int stride, int offset) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index >= n || index >= n / stride) {
return;
}

int replace_index = n - 1 - (index * stride);
int t_index = replace_index - offset;


int t = data[t_index];
data[t_index] = data[replace_index];
data[replace_index] += t;
}

// set n-1 to power of 2 values equal to 0
__global__ void setZeros(int n, int power_of_2, int* data) {
int index = threadIdx.x + (blockIdx.x * blockDim.x);
if (index < power_of_2 && index >= n - 1) {
data[index] = 0;
}
}

/**
* Performs prefix-sum (aka scan) on idata, storing the result into odata.
*/
void scan(int n, int *odata, const int *idata) {
void scan(int n, int* odata, const int* idata) {
int power_of_2 = 1;
while (power_of_2 < n) {
power_of_2 *= 2;
}

// create array of size power of 2
int* data;

cudaMalloc((void**)&data, power_of_2 * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc data failed!");

// fill array and pad end with 0's
std::unique_ptr<int[]>padded_array{ new int[power_of_2] };
cudaMemcpy(padded_array.get(), idata, sizeof(int) * n, cudaMemcpyHostToHost);
for (int i = n; i < power_of_2; i++) {
padded_array[i] = 0;
}

cudaMemcpy(data, padded_array.get(), sizeof(int) * power_of_2, cudaMemcpyHostToDevice);

// kernel values
int blockSize = 128;
dim3 fullBlocksPerGrid((power_of_2 + blockSize - 1) / blockSize);

timer().startGpuTimer();
// TODO
// up-sweep
for (int d = 0; d <= ilog2(power_of_2) - 1; d++) {
int dist = pow(2, d + 1);
int distHalf = pow(2, d);
upSweep << <fullBlocksPerGrid, blockSize >> > (power_of_2, d, data, dist, distHalf);
/*int stride = pow(2, d+1);
int offset = pow(2, d);
upSweepEfficient << <fullBlocksPerGrid, blockSize >> > (power_of_2, d, data, stride, offset);*/
}


// set the last value to 0
setZeros << <fullBlocksPerGrid, blockSize >> > (n, power_of_2, data);

// down-sweep
for (int d = ilog2(power_of_2) - 1; d >= 0; d--) {
int dist = pow(2, d + 1);
int distHalf = pow(2, d);
downSweep << <fullBlocksPerGrid, blockSize >> > (power_of_2, d, data, dist, distHalf);
/*int stride = pow(2, d + 1);
int offset = pow(2, d);
downSweepEfficient << <fullBlocksPerGrid, blockSize >> > (power_of_2, d, data, stride, offset);*/
}
timer().endGpuTimer();

// set the out data to the scanned data
cudaMemcpy(odata, data, sizeof(int) * n, cudaMemcpyDeviceToHost);

// free memory
cudaFree(data);
}

/**
Expand All @@ -30,11 +153,52 @@ namespace StreamCompaction {
* @param idata The array of elements to compact.
* @returns The number of elements remaining after compaction.
*/
int compact(int n, int *odata, const int *idata) {
timer().startGpuTimer();
// TODO
timer().endGpuTimer();
return -1;
int compact(int n, int* odata, const int* idata) {
// malloc necessary space oon GPU
int* gpu_idata;
int* bools;
int* scanned_data;
int* scattered_data;

cudaMalloc((void**)&gpu_idata, n * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc gpu_idata failed!");
cudaMemcpy(gpu_idata, idata, sizeof(int) * n, cudaMemcpyHostToDevice);

cudaMalloc((void**)&bools, n * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc bools failed!");

cudaMalloc((void**)&scanned_data, n * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc scanned_data failed!");

cudaMalloc((void**)&scattered_data, n * sizeof(int));
checkCUDAErrorWithLine("cudaMalloc scattered_data failed!");

int blockSize = 128;
dim3 fullBlocksPerGrid((n + blockSize - 1) / blockSize);

//timer().startGpuTimer();
// change to zeros and ones
Common::kernMapToBoolean << <fullBlocksPerGrid, blockSize >> > (n, bools, gpu_idata);

// exclusive scan data
scan(n, scanned_data, bools);

// scatter
Common::kernScatter << <fullBlocksPerGrid, blockSize >> > (n, scattered_data, gpu_idata, bools, scanned_data);
cudaMemcpy(odata, scattered_data, sizeof(int) * n, cudaMemcpyDeviceToHost);
int num = n;
for (int i = 0; i < n; i++) {
if (odata[i] == 0) {
num = i;
break;
}
}
//timer().endGpuTimer();

// return last index in scanned_data
std::unique_ptr<int[]>scanned_cpu{ new int[n] };
cudaMemcpy(scanned_cpu.get(), scanned_data, sizeof(int) * num, cudaMemcpyDeviceToHost);
return num;
}
}
}
Loading