diff --git a/README.md b/README.md
index cad1abd..9affe29 100644
--- a/README.md
+++ b/README.md
@@ -1,20 +1,41 @@
 CUDA Rasterizer
 ===============
 
-[CLICK ME FOR INSTRUCTION OF THIS PROJECT](./INSTRUCTION.md)
-
 **University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 4**
 
-* (TODO) YOUR NAME HERE
-* Tested on: (TODO) Windows 22, i7-2222 @ 2.22GHz 22GB, GTX 222 222MB (Moore 2222 Lab)
+* Name: William Ho
+* Email: willho@seas.upenn.edu
+* Tested on: Windows 7 Professional, i7-6700 @ 3.40 GHz 16.0GB, NVIDIA QuadroK620 (Moore 100C Lab)
+
+## Overview
+
+This is an implementation of a GPU rasterization pipeline, designed to simulate the OpenGL graphics pipeline. It implements vertex shading,rasterization, fragment shading, and a framebuffer.
+
+## Basic Rasterizer
+
+|| box.gltf | flower.gltf | cow.gltf |
+|:----:|:-----:|:-----:|:-----:|
+|Fragment Normals|![](renders/CubeNormals.PNG)|![](renders/FlowerNormals.PNG)|![](renders/CowNormals2.PNG)|
+|Lambertian Shading|![](renders/CubeLambert.PNG)|![](renders/FlowerLambert.PNG)|![](renders/CowLambert2.PNG)|
+
+
+
+These images are the first images I was able to generate from a basic rasterizer. On an implementation level, I am parallelizing over the primitives in my scene, and iterating through fragment coordinates. These renderings made use of axis-aligned bounding boxes for each primitive to cull the fragments that had to be checked. This optimization, is still quite timely. 
+
+The biggest hurdle in this basic rasterizer is the problem of race conditions that arise when trying to write to the same spaces in the depth buffer. Since a mutex must be used to lock access to those parts of the buffer, the rasterization time performance is considerably impacted. 
+
+## Additional Feature
+
+### Tile Based Approach
+
+![](renders/CubeTiled.PNG)
 
-### (TODO: Your README)
+With very limited success, I attempted a tile based approach to rasterizing. This involved a preprocess step of parellelizing over triangles in order to bucket them into their overlapping tiles, and then parallelizing over the tiles. The most significant advantage to this was the ability to use shared memory within tiles to avoid the common global memory accesses to the depth buffer that previously choked my naive rasterizer. With this approach, I was able to speed up the rasterization step considerably for `box.gltf`. With my naive rasterizer on the machine I was working on, I did not succeed in rendering `box.gltf` at higher than `100x100` pixel resolution because the rasterization kernel would exceed the NVIDIA timeout limit. Tiling allowed me to render up to `800x800` pixel resolution. In many ways, this seems promising.
 
-*DO NOT* leave the README to the last minute! It is a crucial part of the
-project, and we will not be able to grade you without a good README.
+However, a truly robust implementation of this approach would have to overcome several hurdles. First off, as evidenced in the above image, there are many visual artifacts that need to be addressed. More important, preprocessing triangles into buckets requires race condition handling, which, for any mesh with significant numbers of triangles, is far worse than my naive rasterizer implementation. As a result, I have not yet rendered successfully any higher polygon meshes, due to the timeout limit as stated above. 
 
 
-### Credits
+## Credits
 
 * [tinygltfloader](https://github.com/syoyo/tinygltfloader) by [@soyoyo](https://github.com/syoyo)
 * [glTF Sample Models](https://github.com/KhronosGroup/glTF/blob/master/sampleModels/README.md)
diff --git a/renders/CowLambert.PNG b/renders/CowLambert.PNG
new file mode 100755
index 0000000..b71de74
Binary files /dev/null and b/renders/CowLambert.PNG differ
diff --git a/renders/CowLambert2.PNG b/renders/CowLambert2.PNG
new file mode 100755
index 0000000..26b1983
Binary files /dev/null and b/renders/CowLambert2.PNG differ
diff --git a/renders/CowNormals.PNG b/renders/CowNormals.PNG
new file mode 100755
index 0000000..51033e1
Binary files /dev/null and b/renders/CowNormals.PNG differ
diff --git a/renders/CowNormals2.PNG b/renders/CowNormals2.PNG
new file mode 100755
index 0000000..312bc81
Binary files /dev/null and b/renders/CowNormals2.PNG differ
diff --git a/renders/CubeLambert.PNG b/renders/CubeLambert.PNG
new file mode 100755
index 0000000..5e268d7
Binary files /dev/null and b/renders/CubeLambert.PNG differ
diff --git a/renders/CubeNormals.PNG b/renders/CubeNormals.PNG
new file mode 100755
index 0000000..6039eb8
Binary files /dev/null and b/renders/CubeNormals.PNG differ
diff --git a/renders/CubeTiled.PNG b/renders/CubeTiled.PNG
new file mode 100755
index 0000000..a4b5bec
Binary files /dev/null and b/renders/CubeTiled.PNG differ
diff --git a/renders/FlowerLambert.PNG b/renders/FlowerLambert.PNG
new file mode 100755
index 0000000..6b5efbd
Binary files /dev/null and b/renders/FlowerLambert.PNG differ
diff --git a/renders/FlowerNormals.PNG b/renders/FlowerNormals.PNG
new file mode 100755
index 0000000..60da407
Binary files /dev/null and b/renders/FlowerNormals.PNG differ
diff --git a/renders/TileBasedCubeV001.PNG b/renders/TileBasedCubeV001.PNG
new file mode 100755
index 0000000..cbd55ba
Binary files /dev/null and b/renders/TileBasedCubeV001.PNG differ
diff --git a/src/main.cpp b/src/main.cpp
index 7986959..eff8d62 100644
--- a/src/main.cpp
+++ b/src/main.cpp
@@ -138,8 +138,8 @@ bool init(const tinygltf::Scene & scene) {
         return false;
     }
 
-    width = 800;
-    height = 800;
+    width = 400;
+    height = 400;
     window = glfwCreateWindow(width, height, "CIS 565 Pathtracer", NULL, NULL);
     if (!window) {
         glfwTerminate();
diff --git a/src/rasterize.cu b/src/rasterize.cu
index 1262a09..1c6455b 100644
--- a/src/rasterize.cu
+++ b/src/rasterize.cu
@@ -18,6 +18,13 @@
 #include <glm/gtc/quaternion.hpp>
 #include <glm/gtc/matrix_transform.hpp>
 
+#define TILERENDER 0
+#define TILERENDERWITHPREPROCESS 0
+//These need to be defined at compile time, but they need to be mathematically sound. TILEX * TILEY = TILESIZE
+#define TILEX 16
+#define TILEY 16
+#define TILESIZE 256
+
 namespace {
 
 	typedef unsigned short VertexIndex;
@@ -96,6 +103,13 @@ namespace {
 		// TODO: add more attributes when needed
 	};
 
+#if TILERENDERWITHPREPROCESS
+	struct Tile {
+		int numTriangles = 0;
+		int triangleIndices[1000];
+	};
+#endif
+
 }
 
 static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2PrimitivesMap;
@@ -103,13 +117,20 @@ static std::map<std::string, std::vector<PrimitiveDevBufPointers>> mesh2Primitiv
 
 static int width = 0;
 static int height = 0;
+static float depthRange = 5000000.0f;
 
 static int totalNumPrimitives = 0;
 static Primitive *dev_primitives = NULL;
 static Fragment *dev_fragmentBuffer = NULL;
 static glm::vec3 *dev_framebuffer = NULL;
 
+#if TILERENDERWITHPREPROCESS
+static int *dev_triangleIndicesForFrag = NULL;
+static Tile *dev_tiles = NULL;
+#endif
+
 static int * dev_depth = NULL;	// you might need this buffer when doing depth test
+static int * mutex = NULL;
 
 /**
  * Kernel that writes the image to the OpenGL PBO directly.
@@ -166,6 +187,17 @@ void rasterizeInit(int w, int h) {
 	cudaFree(dev_depth);
 	cudaMalloc(&dev_depth, width * height * sizeof(int));
 
+	cudaFree(mutex);
+	cudaMalloc(&mutex, sizeof(int));
+
+#if TILERENDERWITHPREPROCESS
+	cudaFree(dev_triangleIndicesForFrag);
+	cudaMalloc(&dev_triangleIndicesForFrag, width * height * sizeof(int));
+
+	cudaFree(dev_tiles);
+	cudaMalloc(&dev_tiles, ((width + TILEX - 1) / TILEX) * ((height + TILEY - 1) / TILEY) * sizeof(Tile));
+#endif
+
 	checkCUDAError("rasterizeInit");
 }
 
@@ -633,12 +665,25 @@ void _vertexTransformAndAssembly(
 	// vertex id
 	int vid = (blockIdx.x * blockDim.x) + threadIdx.x;
 	if (vid < numVertices) {
+		//VertexIndex vertexIndex = primitive.dev_indices[vid];
+		VertexOut &ref_vs_output = primitive.dev_verticesOut[vid];
+		//
+		//primitive.dev_verticesOut[vid].texcoord0 = primitive.dev_texcoord0[vid];
+		glm::vec4 pos = glm::vec4(primitive.dev_position[vid], 1.0f);
+		ref_vs_output.pos = MVP * pos;
+		ref_vs_output.eyePos = glm::vec3(MV * pos);
+		ref_vs_output.eyeNor = glm::vec3(MV_normal * primitive.dev_normal[vid]);
+
+		//ref_vs_output.eyeNor = primitive.dev_normal[vid];
+		//ref_vs_output.dev_diffuseTex = primitive.dev_diffuseTex;
 
 		// TODO: Apply vertex transformation here
 		// Multiply the MVP matrix for each vertex position, this will transform everything into clipping space
 		// Then divide the pos by its w element to transform into NDC space
 		// Finally transform x and y to viewport space
 
+		//basic 
+
 		// TODO: Apply vertex assembly here
 		// Assemble all attribute arraies into the primitive array
 		
@@ -660,12 +705,12 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 		// TODO: uncomment the following code for a start
 		// This is primitive assembly for triangles
 
-		//int pid;	// id for cur primitives vector
-		//if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
-		//	pid = iid / (int)primitive.primitiveType;
-		//	dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
-		//		= primitive.dev_verticesOut[primitive.dev_indices[iid]];
-		//}
+		int pid;	// id for cur primitives vector
+		if (primitive.primitiveMode == TINYGLTF_MODE_TRIANGLES) {
+			pid = iid / (int)primitive.primitiveType;
+			dev_primitives[pid + curPrimitiveBeginId].v[iid % (int)primitive.primitiveType]
+				= primitive.dev_verticesOut[primitive.dev_indices[iid]];
+		}
 
 
 		// TODO: other primitive types (point, line)
@@ -673,7 +718,380 @@ void _primitiveAssembly(int numIndices, int curPrimitiveBeginId, Primitive* dev_
 	
 }
 
+__device__
+glm::vec3 getBarycentricWeights(glm::vec3 p, glm::vec3 p1, glm::vec3 p2, glm::vec3 p3) {
+	float totalArea = glm::length(glm::cross(p1 - p3, p2 - p3)) / 2.0f;
+
+	float area1 = glm::length(glm::cross(p2 - p, p3 - p)) / 2.0f;
+	float area2 = glm::length(glm::cross(p3 - p, p1 - p)) / 2.0f;
+	float area3 = glm::length(glm::cross(p1 - p, p2 - p)) / 2.0f;
+
+	return glm::vec3(area1 / totalArea, area2 / totalArea, area3 / totalArea);
+} 
+
+__device__ 
+bool isInTriangle(glm::vec3 p, glm::vec3 p1 , glm::vec3 p2 , glm::vec3 p3 ) {
+	float totalArea = glm::abs((p1.x * (p2.y - p3.y) + p2.x * (p3.y - p1.y) + p3.x * (p1.y - p2.y)) / 2.0f);
+	
+	float area1 = glm::abs((p.x * (p2.y - p3.y) + p2.x * (p3.y - p.y) + p3.x * (p.y - p2.y)) / 2.0f);
+	float area2 = glm::abs((p1.x * (p.y - p3.y) + p.x * (p3.y - p1.y) + p3.x * (p1.y - p.y)) / 2.0f);
+	float area3 = glm::abs((p1.x * (p2.y - p.y) + p2.x * (p.y - p1.y) + p.x * (p1.y - p2.y)) / 2.0f);	
+	glm::vec3 bw = glm::vec3(area1 / totalArea, area2 / totalArea, area3 / totalArea);
+	return !((bw.x + bw.y + bw.z) > 1.00001f);
+}
 
+/**
+* Computes the axis-aligned bounding box for a given prim
+* Outputs:
+* glm::ivec4 Contains pixel coordinates for left bottom and top right corners of AABB
+*/
+__device__
+glm::ivec4 computeAABB(int width, int height, Primitive prim) {
+	float maxX = fmaxf(prim.v[0].pos.x, fmaxf(prim.v[1].pos.x, prim.v[2].pos.x));
+	float maxY = fmaxf(prim.v[0].pos.y, fmaxf(prim.v[1].pos.y, prim.v[2].pos.y));
+	float minX = fminf(prim.v[0].pos.x, fminf(prim.v[1].pos.x, prim.v[2].pos.x));
+	float minY = fminf(prim.v[0].pos.y, fminf(prim.v[1].pos.y, prim.v[2].pos.y));
+
+	return glm::ivec4(
+		(int) ((minX + 1.0f) * (width / 2)) - 1,
+		(int) ((1.0f - maxY) * (height / 2)) - 1, //Necessary to flip max and min Y because in pixel space, 0,0 is the top left
+		(int) ((maxX + 1.0f) * (width / 2)) + 1,
+		(int) ((1.0f - minY) * (height / 2) + 1)
+	);
+}
+
+/**
+* Converts pixel coordinates to fragment index
+*/
+__device__
+int pixelToFragIndex(int x, int y, int width, int height) {
+	return y * width - x;
+}
+
+__device__
+glm::vec3 NDCtoPixel(glm::vec3 p, int width, int height) {
+	return glm::vec3((p.x + 1.0f) * (width / 2), (1.0f - p.y) * (height / 2), p.z);
+}
+
+/* Takes in information in NDC and outputs z-depth
+*/
+
+__device__
+float computeFragmentDepth(glm::vec3 p, Primitive prim) {
+	glm::vec3 p1 = glm::vec3(prim.v[0].pos);
+	glm::vec3 p2 = glm::vec3(prim.v[1].pos);
+	glm::vec3 p3 = glm::vec3(prim.v[2].pos);
+
+	p1.z = 0.0f;
+	p2.z = 0.0f;
+	p3.z = 0.0f;
+
+	glm::vec3 eyePos1 = glm::vec3(prim.v[0].eyePos);
+	glm::vec3 eyePos2 = glm::vec3(prim.v[1].eyePos);
+	glm::vec3 eyePos3 = glm::vec3(prim.v[2].eyePos);
+
+	float totalArea = glm::length(glm::cross(p1 - p3, p2 - p3)) / 2.0f;
+
+	float area1 = glm::length(glm::cross(p2 - p, p3 - p)) / 2.0f;
+	float area2 = glm::length(glm::cross(p1 - p, p3 - p)) / 2.0f;
+	float area3 = glm::length(glm::cross(p1 - p, p2 - p)) / 2.0f;
+	glm::vec3 bw = glm::vec3(area1 / totalArea, area2 / totalArea, area3 / totalArea);
+
+	return 1.0f / ((1.0f / eyePos1.z) * bw.x + (1.0f / eyePos2.z) * bw.y + (1.0f / eyePos3.z)  * bw.z);
+}
+
+__device__
+glm::vec3 computeLambertian(glm::vec3 normal, glm::vec3 light, glm::vec3 baseColor) {
+	return fmaxf(glm::dot(glm::normalize(normal), glm::normalize(light)), 0.1f) * baseColor;
+}
+
+__global__
+void rasterizeTriangles(int numPrimitives, 
+	int width,
+	int height,
+	Primitive* dev_primitives, 
+	Fragment* dev_fragmentBuffer,
+	int * dev_depth,
+	float depthRange,
+	int * mutex) {
+	int primId = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (primId < numPrimitives) {
+		Primitive& primitive = dev_primitives[primId];
+		glm::ivec4 AABB = computeAABB(width, height, primitive);
+		glm::vec3 pPix1 = NDCtoPixel(glm::vec3(primitive.v[0].pos), width, height);
+		glm::vec3 pPix2 = NDCtoPixel(glm::vec3(primitive.v[1].pos), width, height);
+		glm::vec3 pPix3 = NDCtoPixel(glm::vec3(primitive.v[2].pos), width, height);
+		glm::vec3 tri[3] = {
+			pPix1, 
+			pPix2, 
+			pPix3 };
+		for (int y = AABB.y; y < AABB.w; y++) {
+			for (int x = AABB.x; x < AABB.z; x++) {
+				glm::vec3 bw = calculateBarycentricCoordinate(tri, glm::vec2(x, y));
+				if (isBarycentricCoordInBounds(bw)) {
+					int fragIndex = pixelToFragIndex(x, y, width, height);
+					//printf("%f\n", computeFragmentDepth(glm::vec3(x, y, 0.0f), primitive));
+					int depth = (int) (depthRange * -getZAtCoordinate(bw, tri));
+					//atomicMin(dev_depth + fragIndex, depth);
+					//if (depth == dev_depth[fragIndex]) dev_fragmentBuffer[fragIndex].color = glm::vec3((float) primId / (float) numPrimitives, 0.0f, 0.0f);
+					//printf("%i\n", depth);
+					bool isSet;
+					do {
+						isSet = (atomicCAS(mutex, 0, 1) == 0);
+						if (isSet) {
+							dev_depth[fragIndex] = min(dev_depth[fragIndex], depth);
+							if (depth == dev_depth[fragIndex]) dev_fragmentBuffer[fragIndex].color = primitive.v[0].eyeNor;
+
+						}
+						if (isSet) {
+							*mutex = 0;
+						}
+					} while (!isSet);
+				}
+			}
+		}
+	}
+
+}
+
+__device__
+int tileIndexToFragIndex(int tileIndex, int tileX, glm::ivec4 tile, int width, int height) {
+	int tileYcoord = tileIndex / tileX;
+	int tileXcoord = tileIndex - (tileYcoord * tileX);
+	//printf("%i, %i, %i\n", tileIndex, tileX, tileYcoord);
+
+	return pixelToFragIndex(tile.x + tileXcoord, tile.y + tileYcoord, width, height);
+}
+
+__device__
+bool pixelInTile(int x, int y, glm::ivec4 tile) {
+	return (x >= tile.x && x < tile.z && y >= tile.y && y < tile.w);
+}
+
+__global__
+void tileRasterizeTriangles(int numPrimitives,
+	int width,
+	int height,
+	Primitive* dev_primitives,
+	Fragment* dev_fragmentBuffer,
+	int * dev_depth,
+	float depthRange,
+	int * mutex) {
+	//Allocate shared memory for tile
+	//Format: x,y,z is color and w is depth
+	// On Moore 100 Machines, max shared memory is c000 (49152) bytes, 
+	// with this, shared memory usage is 40000 bytes
+	int tileX = blockIdx.x;
+	int tileY = blockIdx.y;
+	__shared__ glm::vec4 shared_tileBuffer[TILESIZE];
+	int sharedWritesPerThread = (TILESIZE + blockDim.x - 1) / blockDim.x;
+	for (int i = 0; i < sharedWritesPerThread; i++) {
+		if (i + threadIdx.x * sharedWritesPerThread < TILESIZE) {
+			shared_tileBuffer[i + threadIdx.x * sharedWritesPerThread] = glm::vec4(0.0f, 0.0f, 0.0f, FLT_MAX);
+		}
+	}
+	__syncthreads();
+	glm::ivec4 tile = glm::ivec4(tileX * TILEX, tileY * TILEY, (tileX + 1) * TILEX, (tileY + 1) * TILEY);
+	int primId = threadIdx.x;
+	if (primId < numPrimitives) {
+		//Initialize shared memory
+		//Evaluate bounding box and determine if the triangle needs to be rendered in this tile
+		Primitive& primitive = dev_primitives[primId];
+		glm::ivec4 AABB = computeAABB(width, height, primitive);
+		if (AABB.x >= tile.z || AABB.z < tile.x || AABB.y >= tile.w || AABB.w < tile.y) {
+			return;
+		}
+
+		glm::vec3 pPix1 = NDCtoPixel(glm::vec3(primitive.v[0].pos), width, height);
+		glm::vec3 pPix2 = NDCtoPixel(glm::vec3(primitive.v[1].pos), width, height);
+		glm::vec3 pPix3 = NDCtoPixel(glm::vec3(primitive.v[2].pos), width, height);
+		glm::vec3 tri[3] = {
+			pPix1,
+			pPix2,
+			pPix3 };
+		for (int y = AABB.y; y < AABB.w; y++) {
+			for (int x = AABB.x; x < AABB.z; x++) {
+				//Test if pixel x,y is in the triangle
+				glm::vec3 bw = calculateBarycentricCoordinate(tri, glm::vec2(x, y));
+				if (isBarycentricCoordInBounds(bw) && pixelInTile(x, y, tile)) {
+					//Convert the pixel to a fragmentIndex and compute its depth
+					int fragIndex = pixelToFragIndex(x, y, width, height);
+					int fragTileIndex = (x - tile.x) + (y - tile.y) * TILEY;
+					//Cull fragments outside of tile
+					float depth = (depthRange * -getZAtCoordinate(bw, tri));
+					//depth test, the mutex ensures that the code within "isSet" happens atomically
+					//that is, that code is guaranteed to execute in a single thread without anything executing in
+					//any other thread
+					bool isSet;
+					do {
+						isSet = (atomicCAS(mutex, 0, 1) == 0);
+						if (isSet) {
+							/*dev_depth[fragIndex] = min(dev_depth[fragIndex], (int)depth);
+							if (depth == dev_depth[fragIndex]) dev_fragmentBuffer[fragIndex].color = primitive.v[0].eyeNor;*/
+							float originalDepth = shared_tileBuffer[fragTileIndex].w;
+							shared_tileBuffer[fragTileIndex].w = fminf(originalDepth, depth);
+							if (depth < originalDepth) {
+								glm::vec3 color = primitive.v[0].eyeNor;
+								shared_tileBuffer[fragTileIndex].x = color.x;
+								shared_tileBuffer[fragTileIndex].y = color.y;
+								shared_tileBuffer[fragTileIndex].z = color.z;
+
+							}
+						}
+						if (isSet) {
+							*mutex = 0;
+						}
+					} while (!isSet);
+				}
+			}
+		}
+		//This is therefore where I should call __syncThreads(); and write to the dev_fragmentBuffer
+		//Will need some kind of for loop to write to the dev_fragmentBuffer
+	}
+	__syncthreads();
+
+	//Write shared memory to fragmentbuffer. This might be more optimal if we wanted to retire threads after the above branch and split up writes
+	//based on the number of primitives.
+	for (int i = 0; i < sharedWritesPerThread; i++) {
+		if (i + threadIdx.x * sharedWritesPerThread < TILESIZE) {
+			int fragIndex = tileIndexToFragIndex(i + threadIdx.x * sharedWritesPerThread,
+				TILEX,
+				tile,
+				width,
+				height);
+			dev_fragmentBuffer[fragIndex].color = glm::vec3(shared_tileBuffer[i + threadIdx.x * sharedWritesPerThread]);
+		}
+	}
+}
+
+
+#if TILERENDERWITHPREPROCESS
+__global__
+void computeTrianglesToBeRendered(int numPrimitives,
+	int width,
+	int height,
+	int tileGridWidth,
+	Primitive* dev_primitives,
+	Tile* dev_tiles,
+	int *mutex) {
+	//Parellelize over triangles
+	//Using AABB, bucket the triangles into the tiles
+	int primId = (blockIdx.x * blockDim.x) + threadIdx.x;
+	if (primId < numPrimitives) {
+		Primitive& primitive = dev_primitives[primId];
+		glm::ivec4 AABB = computeAABB(width, height, primitive);
+		int minX = (AABB.x / TILEX) * TILEX;
+		int minY = (AABB.y / TILEY) * TILEY;
+		for (int y = minY; y < AABB.w; y += TILEY) {
+			for (int x = minX; x < AABB.z; x += TILEX) {
+				Tile& tile = dev_tiles[(x / TILEX) + (y / TILEY) * tileGridWidth];
+				bool isSet;
+				do {
+					isSet = (atomicCAS(mutex, 0, 1) == 0);
+					if (isSet) {
+						int bucketIdx = tile.numTriangles;
+						tile.triangleIndices[bucketIdx] = primId;
+						tile.numTriangles++;
+					}
+					if (isSet) {
+						*mutex = 0;
+					}
+				} while (!isSet);
+			}
+		}
+	}
+
+}
+
+__global__
+void tileRasterizeTrianglesAfterPreProcess(
+	int width,
+	int height,
+	int tileGridWidth,
+	Primitive* dev_primitives,
+	Tile* dev_tiles,
+	Fragment* dev_fragmentBuffer,
+	float depthRange) {
+	//Block is tile
+	//Thread is pixel
+	//Loop over the triangles in the bucket
+	
+	__shared__ glm::vec4 shared_tileBuffer[TILESIZE];
+
+	int tileX = blockIdx.x;
+	int tileY = blockIdx.y;
+	int pixelX = blockIdx.x * blockDim.x + threadIdx.x;
+	int pixelY = blockIdx.y * blockDim.y + threadIdx.y;
+
+	int fragIdx = pixelToFragIndex(pixelX, pixelY, width, height);
+	Tile& tile = dev_tiles[tileX + tileY * tileGridWidth];
+	int idxInTile = threadIdx.x + threadIdx.y * blockDim.x;
+
+	shared_tileBuffer[idxInTile] = glm::vec4(0.0f, 0.0f, 0.0f, FLT_MAX);
+
+	__syncthreads();
+
+	for (int i = 0; i < tile.numTriangles; i++) {
+		int primId = tile.triangleIndices[i];
+		Primitive& primitive = dev_primitives[primId];
+		glm::vec3 pPix1 = NDCtoPixel(glm::vec3(primitive.v[0].pos), width, height);
+		glm::vec3 pPix2 = NDCtoPixel(glm::vec3(primitive.v[1].pos), width, height);
+		glm::vec3 pPix3 = NDCtoPixel(glm::vec3(primitive.v[2].pos), width, height);
+		glm::vec3 tri[3] = {
+			pPix1,
+			pPix2,
+			pPix3 };
+		glm::ivec4 AABB = computeAABB(width, height, primitive);
+		glm::vec3 bw = calculateBarycentricCoordinate(tri, glm::vec2(pixelX, pixelY));
+		if (isBarycentricCoordInBounds(bw) && pixelInTile(pixelX, pixelY, glm::ivec4(tileX * TILEX, 
+			tileY * TILEY, 
+			(tileX + 1) * TILEX,
+			(tileY + 1) * TILEY))) {
+			float depth = (depthRange * -getZAtCoordinate(bw, tri));
+			float originalDepth = shared_tileBuffer[idxInTile].w;
+			shared_tileBuffer[idxInTile].w = fminf(originalDepth, depth);
+			if (depth < originalDepth) {
+				glm::vec3 color = primitive.v[0].eyeNor;
+				shared_tileBuffer[idxInTile].x = color.x;
+				shared_tileBuffer[idxInTile].y = color.y;
+				shared_tileBuffer[idxInTile].z = color.z;
+			}
+		}
+	}
+
+	dev_fragmentBuffer[fragIdx].color = glm::vec3(shared_tileBuffer[idxInTile]);
+}
+
+__global__
+void colorTileBorders(int width, int height, Fragment* dev_fragmentBuffer) {
+	int pixelX = blockIdx.x * blockDim.x + threadIdx.x;
+	int pixelY = blockIdx.y * blockDim.y + threadIdx.y;
+
+	if ((threadIdx.x == 0 || threadIdx.y == 0) && pixelX < width && pixelY < height) {
+		dev_fragmentBuffer[pixelToFragIndex(pixelX, pixelY, width, height)].color = glm::vec3(1.0f, 0.0f, 0.0f);
+	}
+}
+#endif
+
+__global__
+void shadeLambertian(int width, int height, Fragment* dev_fragmentBuffer, glm::vec3 light) {
+	int pixelX = blockIdx.x * blockDim.x + threadIdx.x;
+	int pixelY = blockIdx.y * blockDim.y + threadIdx.y;
+	if (pixelX * pixelY < width * height) {
+		int fragIdx = pixelToFragIndex(pixelX, pixelY, width, height);
+		dev_fragmentBuffer[fragIdx].color = glm::dot(dev_fragmentBuffer[fragIdx].color, light) * glm::vec3(1.0f, 0.5f, 0.5f);
+	}
+}
+__global__
+void redFragments(int width, int height, Fragment* dev_fragmentBuffer) {
+	int fragX = (blockIdx.x * blockDim.x) + threadIdx.x;
+	int fragY = (blockIdx.y * blockDim.y) + threadIdx.y;
+
+	int index = (width - fragX) + (height - fragY) * width; 
+	if (index < width * height) {
+		dev_fragmentBuffer[index].color = glm::vec3((float) fragX / (float) width, (float) fragY / (float) height, 0.0f);
+	}
+}
 
 /**
  * Perform rasterization.
@@ -721,9 +1139,93 @@ void rasterize(uchar4 *pbo, const glm::mat4 & MVP, const glm::mat4 & MV, const g
 	
 	cudaMemset(dev_fragmentBuffer, 0, width * height * sizeof(Fragment));
 	initDepth << <blockCount2d, blockSize2d >> >(width, height, dev_depth);
-	
-	// TODO: rasterize
 
+	dim3 threadsPerBlock((curPrimitiveBeginId + 31) / 32);
+	dim3 numBlocksForRasterization((curPrimitiveBeginId + threadsPerBlock.x - 1) / threadsPerBlock.x);
+
+	cudaMemset(mutex, 0, sizeof(int));
+
+	glm::vec3 light = glm::vec3(1.0f, 1.0f, 1.0f);
+	
+	 //TODO: rasterize
+#if TILERENDER
+	dim3 blockDim2dTiles((width + TILEX - 1) / TILEX, (height + TILEY - 1) / TILEY);
+	tileRasterizeTriangles << <blockDim2dTiles, 1024 >> >
+		(curPrimitiveBeginId,
+			width,
+			height,
+			dev_primitives,
+			dev_fragmentBuffer,
+			dev_depth,
+			depthRange,
+			mutex);
+	/*tileRasterizeTriangles << <1, 128 >> >
+		(curPrimitiveBeginId,
+			width,
+			height,
+			dev_primitives,
+			dev_fragmentBuffer,
+			dev_depth,
+			depthRange,
+			mutex,
+			1,
+			1);*/
+
+#elif TILERENDERWITHPREPROCESS
+
+	computeTrianglesToBeRendered << <numBlocksForRasterization, threadsPerBlock >> > (
+		curPrimitiveBeginId,
+		width,
+		height,
+		(width + TILEX - 1) / TILEX,
+		dev_primitives,
+		dev_tiles,
+		mutex);
+
+	dim3 blockCountForTiles((width + TILEX - 1) / TILEX, (height + TILEY - 1) / TILEY);
+	dim3 blockSizeForTiles(TILEX, TILEY);
+
+	checkCUDAError("triangle bucketing");
+
+	tileRasterizeTrianglesAfterPreProcess<<<blockCountForTiles, blockSizeForTiles>>>(
+		width,
+		height,
+		(width + TILEX - 1) / TILEX,
+		dev_primitives,
+		dev_tiles,
+		dev_fragmentBuffer,
+		depthRange
+	);
+
+	shadeLambertian << <blockCount2d, blockSize2d >> > (
+		width, height,
+		dev_fragmentBuffer,
+		light);
+
+	colorTileBorders << <blockCount2d, blockSize2d >> > (
+		width, 
+		height,
+		dev_fragmentBuffer);
+#else
+	rasterizeTriangles << <numBlocksForRasterization, threadsPerBlock >> > 
+		(curPrimitiveBeginId, 
+		width,
+		height, 
+		dev_primitives, 
+		dev_fragmentBuffer,
+		dev_depth,
+		depthRange,
+		mutex);
+
+	/*shadeLambertian << <blockCount2d, blockSize2d >> > (
+		width, height,
+		dev_fragmentBuffer,
+		light);*/
+
+#endif
+	
+	//redFragments << <blockCount2d, blockSize2d >> > (width, height, dev_fragmentBuffer);
+	checkCUDAError("rasterization");
 
 
     // Copy depthbuffer colors into framebuffer