AI-Core · dabba-doubleyoo · Mar 18, 2020 · Mar 23, 2020
diff --git a/Convolutional Neural Networks.ipynb b/Convolutional Neural Networks.ipynb
@@ -26,21 +26,51 @@
     "\n",
     "Neural networks have individual weights for each input feature because they expect each feature to represent a totally different thing (e.g. age, height, weight). In other domains like computer vision however, different features (pixels) can represent the same thing just in different locations (e.g. money on my left, money on my right).\n",
     "\n",
-    "Instead of learning to look for the same features of an image with different weights for each position that that feature might be in, we should try to share the same learnt weights over all positions of the input. This will save us both time and memory in computing and storing these duplicate weights. \n",
+    "Imagine if we treated each and every pixel as a feature. That would mean, we would need one weight, or parameter, for each pixel in an image. For a small 256x256 pixel image, that would at least 65,536 parameters! Each of these weights would have to be updated using backpropagation - leading to an explosion in computation as our image sizes increase.\n",
+    "\n",
+    "Luckily, mother nature has embued in humans a mechanism for avoiding this kind of explosion. Within our own visual systems, we have the ability to focus our attention within certain regions of our visual field. If we see a tiger to the left of our visual field, you can bet we'll recognise it when it pounces at us from the right of our visual fields. In some sense, it almost seems like the mechanism for recognition *shares* it's knowledge across various regions of our receptive field. What if we could do the same for Neural Networks?\n",
+    "\n",
+    "Instead of learning to look for the same features of an image with different weights for each position that that feature might be in, what if we could share the same learnt weights over all positions of the input. This will save us both time and memory in computing and storing these duplicate weights. \n",
     "\n",
     "#### So what\n",
     "\n",
     "Using our prior understanding of how image data should be processed spacially, we'd like to find some kind of model that can retain the spacial structure of an input, and look for the same features over the whole of this space. This is what we will use convolutional neural networks for.\n",
     "\n",
     "## Images as data\n",
     "\n",
-    "Images are not naturally vectors. They obviously have a height and a width rather than just a length - so they need to at least be a matrix. \n",
+    "What is an image? How is this represented as data? We have this notion of a 'pixel', but what is that actually comprised of?\n",
+    "\n",
+    "Images are not naturally vectors, or lists. They have a height and a width rather than just a length - so they need to at least be a matrix or grid of values.\n",
+    "\n",
+    "To represent a black and white image, for example, we could take a square matrix of values, and place a value of '1' where we would like the pixel to be white, and '0' where it should be black.\n",
+    "\n",
+    "This would mean that each pixel would have a **binary** intensity - either completely black (0) or completely white (1). \n",
+    "\n",
+    "![Binary Image Diagram](images/binary_image_diagram.png)\n",
+    "\n",
+    "What if we wanted each pixel to be able to have a range of values, so that we could represent different shades of grey which sit in between white or black?\n",
+    "\n",
+    "To achieve this, we could ensure that each pixel can take a range of values say within [0, 128]. This value would represent the **intensity** of the colour we want to show (white), where the absence of intensity (such as a '0' value), would represent pure black. 128 is a convenient upper bound, as this is $2^7$. We could now represent images in gray scale:\n",
+    "\n",
+    "![Grayscale Image Diagram](images/grayscale_image_diagram.png)\n",
+    "\n",
+    "#### What about colour?\n",
+    "\n",
+    "To represent a red and black image, we could interpret the pixel value as the intensity of **redness**, **blueness**, or **greenness** rather than **whiteness**.\n",
+    "\n",
+    "![https://i1.wp.com/obsessive-coffee-disorder.com/wp-content/uploads/2014/02/lena.png?fit=1403%2C511](https://i1.wp.com/obsessive-coffee-disorder.com/wp-content/uploads/2014/02/lena.png?fit=1403%2C511)\n",
+    "\n",
+    "Any non-black color can be made by combining 3 primary colors. \n",
+    "\n",
+    "The coloured pixels on your monitor represent the colours we know and recognise by combining different amounts of red, green and blue light. This is mixed and interpreted by the eye to form any of the 16.8 million colours that modern LED/LCD screens can reproduce. \n",
     "\n",
-    "#### Channels\n",
+    "If you take a close up of images displayed on the screen you're reading, you'll notice that each image is made up of these smaller red, green and blue pixels:\n",
+    "\n",
+    "![https://upload.wikimedia.org/wikipedia/commons/5/57/Subpixel-rendering-RGB.png](https://upload.wikimedia.org/wikipedia/commons/5/57/Subpixel-rendering-RGB.png)\n",
     "\n",
-    "Any non-black color can be made by combining 3 primary colors.\n",
     "As such, as well as height and width, color images have another axis called the **channels**, which specifies the intensity (contribution) of each of these primary colors.\n",
     "Red, green and blue are the (arbitrary) standard primary colors. \n",
+    "\n",
     "So most images that we will work with have a red channel, a green channel and a blue channel.\n",
     "This is illustrated below.\n",
     "\n",
@@ -53,35 +83,51 @@
     "In the past, people would try to draw patterns that they thought would appear in images and be useful for the problem that they were trying to solve. This was a painstakingly long process, and was obviously susceptible to a lot of bias by these feature designers.\n",
     "\n",
     "## Filters/Kernels\n",
+    "\n",
     "These supposedly useful patterns mentioned above are known as **filters** or **kernels**. \n",
-    "Each filter looks for a particular pattern.\n",
-    "E.g. a filter that looks for circles would have high values in a circle and low values in other locations.\n",
+    "\n",
+    "Each filter is designed to highlight or emphasise a particular pattern or shape within an image.\n",
+    "\n",
+    "To design a filter which would detect circles, we would construct a matrix of values - effectively an image - that itself looks like a circle. When we apply this filter over a source image, it would produce a high value (known as an *activation*) whenever it comes across a circle, and low values elsewhere.\n",
     "\n",
     "![title](images/kernels.jpg)\n",
     "\n",
-    "Filters *look* for the patterns they represent by seeing how similar the pixels at any particular location match the values that they contain. A mathematically convenient way to do this is by taking a **dot product** between the filter's values and the input values which it covers - an element wise multiplication and sum of the results. **This produces a single value** which should be larger when the input better matches the feature that the filter looks for.\n",
+    "Filters *look* for the patterns they represent by seeing how similar the pixels at any particular location in the input image match its own pixels, defining the shape or pattern it's looking for. \n",
     "\n",
-    "It is standard for filters to always convolve through the full depth of the input. So if we have an input with 3 channels (e.g. a color image), our kernel will also have a depth of 3 - where each channel of the filter is what it looks for from that corresponding color channel. If our input has 54 channels, then so will our filter. \n",
+    "A mathematically convenient way to do this is by taking a **dot product** between the filter's values (remember - an image is just a matrix of values as we saw earlier) and the pixels which it covers - an element wise multiplication and sum of the results. \n",
     "\n",
-    "The width and height of our kernels is up to us (they are hyperparameters). It's standard to have kernels with equal width and height.\n",
+    "**This produces a single value** which should be larger when the input better matches the feature that the filter looks for.\n",
+    "\n",
+    "This 'sliding' across and element-wise multiplication is known as a **convolution**.\n",
     "\n",
     "## The convolution operation\n",
     "\n",
+    "Now that we have our filter, we need some way of 'sliding' it across our input so that we can detect the patterns and shapes that we want across an entire image. Enter convolutions!\n",
+    "\n",
     "In machine learning, convolution is the proccess of moving a filter across every possible position of the input and computing a value for how well it is matched at each location. \n",
     "\n",
     "This pattern matching over the spacially structured input produces a similar spacially structured output. We call this output an **activation map** or a **feature map** because it represents the activations in the next layer that should represent some higher level (more complex) features than the feature maps in the input.\n",
     "\n",
-    "The animation below shows how a 1x3x3 filter is applied to a 1x5x5 image (for simplicity, input channels = 1). \n",
+    "The animation below shows how a 1x3x3 filter is applied to a 1x5x5 image (for simplicity, input channels = 1).\n",
+    "\n",
     "On the left is the filter that we will convolve over the input. In the centre is the input being convolved over. On the right is the output activation map produced by convolving this filter over this input.\n",
     "\n",
-    "Notice how the output has high  values when the filter is passed over locations where there is an X shape in the input image. This is because the values of the filter are such that it is performing pattern matching for the X shape.\n",
+    "(*Why is it that our convolutions are producing volumes as opposed to single values or images?*)\n",
+    "\n",
+    "Notice how the output has high  values when the filter is passed over locations where there is an X shape in the input image. This is because the image pixel values are similar to the pattern defined in the filter for those specific locations - resulting in higher dot product values.\n",
+    "\n",
+    "It is standard for filters to always convolve through the full depth of the input. So if we have an input with 3 channels (e.g. a color image), our kernel will also have a depth of 3 - then our filter would be applied to each of the three colour channels indpedendantly.\n",
+    "\n",
+    "The width and height of our kernels is up to us (they are hyperparameters). It's standard to have kernels with equal width and height.\n",
     "\n",
     "![image](images/convolution_animation.gif)\n",
     "\n",
     "The convolution operation has a few possible parameters:\n",
     "\n",
     "### Stride\n",
-    "The stride is the number of pixels we shift our kernel along by to compute the next value in the output activation map. Increased stride means less values are computed for the output activation map, which means that we have to do less computation and store less output information, decreasing computing time and cost and reducing memory requirements but reducing output resolution.\n",
+    "The stride is the number of pixels we slide our kernel along by to compute the next value in the output activation map. \n",
+    "\n",
+    "Increased stride means less values are computed for the output activation map, which means that we have to do less computation and store less output information, decreasing computing time and cost and reducing memory requirements but reducing output resolution.\n",
     "\n",
     "### Padding\n",
     "We can *pad* the input with a border of extra pixels around the edge. Why might we want to do this?\n",
@@ -342,7 +388,7 @@
    "name": "python",
    "nbconvert_exporter": "python",
    "pygments_lexer": "ipython3",
-   "version": "3.6.8"
+   "version": "3.6.3"
   }
  },
  "nbformat": 4,