Project2: Fengkai Wu #21

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open

wufk wants to merge 9 commits into CIS565-Fall-2017:master from wufk:master

README.md

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -3,11 +3,31 @@ CUDA Stream Compaction
  
    **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2**

    * (TODO) YOUR NAME HERE

    * Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)

    * Fengkai Wu

    * Tested on: Windows 10, i7-6700 @ 3.40GHz 16GB, Quadro K620 4095MB (Twn M70 Lab)

    ### (TODO: Your README)

    ### Analysis

    Include analysis, etc. (Remember, this is public, so don't put

    anything here that you don't want to share with the world.)

    The running time of exclusive scan under different algorithms are as follows:

    ![img_1](https://github.com/wufk/Project2-Stream-Compaction/blob/master/img/Scan.png)

    ![img_1](https://github.com/wufk/Project2-Stream-Compaction/blob/master/img/proj2table.PNG)

    The running time of stream compaction is as follows:

    ![img_2](https://github.com/wufk/Project2-Stream-Compaction/blob/master/img/cmpact.png)

    As the graph shows, naive scan takes extreme long time to finish the job while the efficient way is much fast. However, the GPU performance is still not as good as CPU. In my Implementation of efficient scan, for each downSweep/upSweep, the number of actual number of working threads is re-computed. The launching blocks are also derived from the number of threads to be used. Bits shifting and modulus operation are also avoided. Other possible factors that downplay the performance might due to too many kernel calls when sweeping up and down, large use of global memory and too many threads required when the array size is large.

    Possible ways to further enhance the performance in the future includes using shared memory and dividing  and scanning the array by blocks.

    Another worth noticing is that thrust runs way faster when the array size is non multiple of two. 

    The timeline of execution when the array size is 2^20 is as follows:

    ![img_2](https://github.com/wufk/Project2-Stream-Compaction/blob/master/img/proj2Perform.PNG)

    It shows that CUDA library of memery manipulation is very expensive. Furthermore, we can see that using thrust and using our own algorithms of scanning is calling different CUDA runtime API. Thrust is calling cudaDeviceSynchronize function while our algorithms call cudaEventSynchronize. This may partly explain why thrust run way faster, in that it is optimized in device and hardware while our effort is just focusing on algorithm and high level part.

    In summary, to get better performance in GPU computing, architecture makes a huge difference and optimization must focus on better allocating resources and making use of the specaiality of GPU hardware.

img/Scan.png

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

img/cmpact.png

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

img/proj2Perform.PNG

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

img/proj2table.PNG

Sorry, something went wrong. Reload?

Sorry, we cannot display this file.

Sorry, this file is invalid so it cannot be displayed.

src/main.cpp

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -13,21 +13,36 @@
  
    #include <stream_compaction/thrust.h>

    #include "testing_helpers.hpp"

    const int SIZE = 1 << 8; // feel free to change the size of array

    const int NPOT = SIZE - 3; // Non-Power-Of-Two

    int a[SIZE], b[SIZE], c[SIZE];

    int main(int argc, char* argv[]) {

        // Scan tests

    	if (argc != 2) {

    		printf("test.exe [sizeOfArray, please input 1-25]");

    		return 1;

    	}

    	const int SIZE = 1 << atoi(argv[1]); // feel free to change the size of array

    	const int NPOT = SIZE - 3; // Non-Power-Of-Two

    	//int a[SIZE], b[SIZE], c[SIZE];

    	int* b = new int[SIZE];

    	int* a = new int[SIZE];

    	int* c = new int[SIZE];

    	int *input_inclusive = new int[SIZE];

    	int *b_inclusive = new int[SIZE];

    	int *c_inclusive = new int[SIZE];

        printf("\n");

        printf("****************\n");

        printf("** SCAN TESTS **\n");

        printf("****************\n");

        printf("********************\n");

    	printf("** SCAN TESTS, %d **\n", atoi(argv[1]));

        printf("********************\n");

        genArray(SIZE - 1, a, 50);  // Leave a 0 at the end to test that edge case

        genArray(SIZE - 1, a, 10);  // Leave a 0 at the end to test that edge case

    	genArray(SIZE - 1, input_inclusive, 10);

        a[SIZE - 1] = 0;

    	input_inclusive[SIZE - 1] = 0;

        printArray(SIZE, a, true);

    	printArray(SIZE, input_inclusive, true);

        // initialize b using StreamCompaction::CPU::scan you implement

        // We use b for further comparison. Make sure your StreamCompaction::CPU::scan is correct.

    @@ -38,39 +53,56 @@ int main(int argc, char* argv[]) {
  
        printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");

        printArray(SIZE, b, true);

    	zeroArray(SIZE, b_inclusive);

    	printDesc("cpu inclusive scan, power of two");

    	StreamCompaction::CPU::inScan(SIZE, b_inclusive, input_inclusive);

    	printArray(SIZE, b_inclusive, true);

        zeroArray(SIZE, c);

        printDesc("cpu scan, non-power-of-two");

        StreamCompaction::CPU::scan(NPOT, c, a);

        printElapsedTime(StreamCompaction::CPU::timer().getCpuElapsedTimeForPreviousOperation(), "(std::chrono Measured)");

        printArray(NPOT, b, true);

        printCmpResult(NPOT, b, c);

    	zeroArray(SIZE, c);

    	printDesc("cpu inclusive scan, non power of two");

    	StreamCompaction::CPU::inScan(SIZE, c, input_inclusive);

    	printArray(SIZE, c, true);

    	printCmpResult(NPOT, b_inclusive, c);

    	zeroArray(SIZE, c);

    	printDesc("SM inclusive scan, non power of two");

    	StreamCompaction::Efficient::scanSM(NPOT, c, input_inclusive);

    	printArray(SIZE, c, true);

    	printCmpResult(NPOT, b_inclusive, c);

        zeroArray(SIZE, c);

        printDesc("naive scan, power-of-two");

        StreamCompaction::Naive::scan(SIZE, c, a);

        printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");

        //printArray(SIZE, c, true);

        printArray(SIZE, c, true);

        printCmpResult(SIZE, b, c);

        zeroArray(SIZE, c);

        printDesc("naive scan, non-power-of-two");

        StreamCompaction::Naive::scan(NPOT, c, a);

        printElapsedTime(StreamCompaction::Naive::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");

        //printArray(SIZE, c, true);

        printArray(SIZE, c, true);

        printCmpResult(NPOT, b, c);

        zeroArray(SIZE, c);

        printDesc("work-efficient scan, power-of-two");

        StreamCompaction::Efficient::scan(SIZE, c, a);

        printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");

        //printArray(SIZE, c, true);

        printArray(SIZE, c, true);

        printCmpResult(SIZE, b, c);

        zeroArray(SIZE, c);

        printDesc("work-efficient scan, non-power-of-two");

        StreamCompaction::Efficient::scan(NPOT, c, a);

        printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");

        //printArray(NPOT, c, true);

        printArray(NPOT, c, true);

        printCmpResult(NPOT, b, c);

        zeroArray(SIZE, c);

    @@ -129,15 +161,19 @@ int main(int argc, char* argv[]) {
  
        printDesc("work-efficient compact, power-of-two");

        count = StreamCompaction::Efficient::compact(SIZE, c, a);

        printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");

        //printArray(count, c, true);

        printArray(count, c, true);

        printCmpLenResult(count, expectedCount, b, c);

        zeroArray(SIZE, c);

        printDesc("work-efficient compact, non-power-of-two");

        count = StreamCompaction::Efficient::compact(NPOT, c, a);

        printElapsedTime(StreamCompaction::Efficient::timer().getGpuElapsedTimeForPreviousOperation(), "(CUDA Measured)");

        //printArray(count, c, true);

        printArray(count, c, true);

        printCmpLenResult(count, expectedNPOT, b, c);

        system("pause"); // stop Win32 console from closing on exit

    	delete[] a;

    	delete[] b;

    	delete[] c;

        system

    	("pause"); // stop Win32 console from closing on exit

    }

stream_compaction/CMakeLists.txt

-Original file line number
+Diff line change
@@ Expand Up / @@ -13,5 +13,5 @@ set(SOURCE_FILES @@
     cuda_add_library(stream_compaction
         ${SOURCE_FILES}
-        OPTIONS -arch=sm_20
+        OPTIONS -arch=sm_30
         )

stream_compaction/common.cu

-Original file line number
+Diff line change
@@ -1,4 +1,5 @@
     #include "common.h"
+    #include "device_launch_parameters.h"
     void checkCUDAErrorFn(const char *msg, const char *file, int line) {
         cudaError_t err = cudaGetLastError();
@@ Expand All / @@ -23,7 +24,11 @@ namespace StreamCompaction { @@
              * which map to 0 will be removed, and elements which map to 1 will be kept.
              */
             __global__ void kernMapToBoolean(int n, int *bools, const int *idata) {
-                // TODO
+    			int idx = threadIdx.x + blockIdx.x * blockDim.x;
+    			if (idx >= n) return;
+    			bools[idx] = (int)(idata[idx] != 0);
             }
             /**
@@ Expand All / @@ -32,7 +37,14 @@ namespace StreamCompaction { @@
              */
             __global__ void kernScatter(int n, int *odata,
                     const int *idata, const int *bools, const int *indices) {
-                // TODO
+    			int idx = threadIdx.x + blockDim.x * blockIdx.x;
+    			if (idx >= n) return;
+    			if (bools[idx])
+    			{
+    				odata[indices[idx]] = idata[idx];
+    			}
             }
         }
@@ Expand Down @@

stream_compaction/cpu.cu

            
                      Original file line number
                      Diff line number
                      Diff line change
                  
    @@ -1,38 +1,75 @@
  
    #include <cstdio>

    #include "cpu.h"

    #include "common.h"

    #include "common.h"

    namespace StreamCompaction {

        namespace CPU {

            using StreamCompaction::Common::PerformanceTimer;

            PerformanceTimer& timer()

            {

    	        static PerformanceTimer timer;

    	        return timer;

            using StreamCompaction::Common::PerformanceTimer;

            PerformanceTimer& timer()

            {

    	        static PerformanceTimer timer;

    	        return timer;

            }

    		void scanImpl(int n, int *odata, const int *idata) {

    			int pre;

    			for (int i = 0; i < n; ++i)

    			{

    				if (i == 0) {

    					pre = idata[i];

    					odata[i] = 0;

    				}

    				else {

    					int temp = idata[i];

    					odata[i] = odata[i - 1] + pre;

    					pre = temp;

    				}

    			}

    		}

            /**

             * CPU scan (prefix sum).

             * For performance analysis, this is supposed to be a simple for loop.

             * (Optional) For better understanding before starting moving to GPU, you can simulate your GPU scan in this function first.

             */

            void scan(int n, int *odata, const int *idata) {

    	        timer().startCpuTimer();

                // TODO

    			scanImpl(n, odata, idata);

    	        timer().endCpuTimer();

            }

    		void inScan(int n, int * odata, const int * idata)

    		{

    			odata[0] = idata[0];

    			for (int i = 1; i < n; i++) {

    				odata[i] = odata[i - 1] + idata[i];

    			}

    		}

            /**

             * CPU stream compaction without using the scan function.

             *

             * @returns the number of elements remaining after compaction.

             */

            int compactWithoutScan(int n, int *odata, const int *idata) {

    	        timer().startCpuTimer();

                // TODO

    			int count = 0;

    			int k = 0;

    			for (int i = 0; i < n; ++i)

    			{

    				if (idata[i] != 0)

    				{

    					count++;

    					odata[k++] = idata[i];

    				}

    			}

    	        timer().endCpuTimer();

                return -1;

                return count;

            }

            /**

    @@ -42,9 +79,23 @@ namespace StreamCompaction {
  
             */

            int compactWithScan(int n, int *odata, const int *idata) {

    	        timer().startCpuTimer();

    	        // TODO

    			for (int i = 0; i < n; ++i)

    			{

    				odata[i] = (idata[i] != 0);

    			}

    			scanImpl(n, odata, odata);

    			int count = 0;

    			for (int i = 0; i < n; ++i)

    			{

    				if (idata[i] != 0) {

    					odata[odata[i]] = idata[i];

    					count++;

    				}

    			}

    	        timer().endCpuTimer();

                return -1;

                return count;

            }

        }

    }

stream_compaction/cpu.h

-Original file line number
+Diff line change
@@ Expand Up / @@ -8,6 +8,8 @@ namespace StreamCompaction { @@
             void scan(int n, int *odata, const int *idata);
+    		void inScan(int n, int *odata, const int *idata);
             int compactWithoutScan(int n, int *odata, const int *idata);
             int compactWithScan(int n, int *odata, const int *idata);
@@ Expand Down @@

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Project2: Fengkai Wu #21

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!

Project2: Fengkai Wu #21

Are you sure you want to change the base?

Uh oh!

Project2: Fengkai Wu #21

Uh oh!

Uh oh!

Diff view

Diff view

There are no files selected for viewing

Uh oh!