From a953a0e5883be4699c172cb3fae6db6e0990a2b3 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 16 Dec 2025 00:13:56 +0000 Subject: [PATCH 1/2] Initial plan From b8b23f6fe5d757042d168506c0a2056587c3e4d5 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Tue, 16 Dec 2025 00:19:37 +0000 Subject: [PATCH 2/2] Create dl-final_exam.ipynb with 30 comprehensive exam questions Co-authored-by: liganega <2748863+liganega@users.noreply.github.com> --- notebooks/dl-final_exam.ipynb | 686 ++++++++++++++++++++++++++++++++++ 1 file changed, 686 insertions(+) create mode 100644 notebooks/dl-final_exam.ipynb diff --git a/notebooks/dl-final_exam.ipynb b/notebooks/dl-final_exam.ipynb new file mode 100644 index 0000000..9311837 --- /dev/null +++ b/notebooks/dl-final_exam.ipynb @@ -0,0 +1,686 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Deep Learning Final Exam\n", + "\n", + "This exam covers the material from the \"Introduction to Computer Vision\" notebook (`NB-computer_vision_intro.ipynb`).\n", + "\n", + "**Topics Covered:**\n", + "- Introduction to Convolutional Neural Networks (CNNs)\n", + "- Conv2D and MaxPooling2D layers\n", + "- Padding and Strides\n", + "- GlobalAveragePooling2D\n", + "- Training a CNN on MNIST\n", + "- The relevance of deep learning for small-data problems\n", + "- Data preprocessing (image_dataset_from_directory)\n", + "- TensorFlow Dataset objects\n", + "- Data augmentation\n", + "- Transfer learning and fine-tuning\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Part 1: Multiple Choice Questions (10 Questions)\n", + "\n", + "*Instructions: Choose the best answer for each question.*\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 1\n", + "\n", + "**What is the primary purpose of a Conv2D layer in a Convolutional Neural Network?**\n", + "\n", + "A) To reduce the spatial dimensions of the input\n", + "\n", + "B) To extract local features from the input data using learnable filters\n", + "\n", + "C) To normalize the pixel values between 0 and 1\n", + "\n", + "D) To flatten the input into a 1D vector\n", + "\n", + "E) To perform classification on the extracted features\n", + "\n", + "**Answer: B**\n", + "\n", + "**Explanation:** Conv2D layers apply learnable convolutional filters to extract local spatial features from images. Each filter learns to detect specific patterns like edges, textures, or more complex features in deeper layers." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 2\n", + "\n", + "**When using a 3x3 kernel with 'valid' padding on a 28x28 input image, what will be the output spatial dimensions?**\n", + "\n", + "A) 28x28\n", + "\n", + "B) 30x30\n", + "\n", + "C) 26x26\n", + "\n", + "D) 14x14\n", + "\n", + "E) 27x27\n", + "\n", + "**Answer: C**\n", + "\n", + "**Explanation:** With 'valid' padding (no padding), the output size is calculated as: (input_size - kernel_size + 1). Therefore, 28 - 3 + 1 = 26. The output will be 26x26." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 3\n", + "\n", + "**What is the main benefit of using MaxPooling2D layers in a CNN?**\n", + "\n", + "A) To increase the number of parameters in the model\n", + "\n", + "B) To add non-linearity to the model\n", + "\n", + "C) To reduce spatial dimensions and provide translation invariance\n", + "\n", + "D) To normalize the activations\n", + "\n", + "E) To prevent vanishing gradients\n", + "\n", + "**Answer: C**\n", + "\n", + "**Explanation:** MaxPooling2D reduces the spatial dimensions (downsampling) while retaining the most important features. It also provides some translation invariance, meaning small shifts in the input won't drastically affect the output. Additionally, it reduces computational cost and helps prevent overfitting." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 4\n", + "\n", + "**What does GlobalAveragePooling2D do to a feature map?**\n", + "\n", + "A) Takes the maximum value from each channel\n", + "\n", + "B) Computes the average of all values in each channel, reducing spatial dimensions to 1x1\n", + "\n", + "C) Concatenates all spatial positions into a single vector\n", + "\n", + "D) Applies a weighted average based on learned parameters\n", + "\n", + "E) Randomly samples values from each channel\n", + "\n", + "**Answer: B**\n", + "\n", + "**Explanation:** GlobalAveragePooling2D computes the average of all spatial positions for each channel (filter), effectively reducing each feature map to a single value. For example, a (5, 5, 256) tensor becomes (256,). This reduces parameters compared to using Flatten + Dense layers." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 5\n", + "\n", + "**In the context of CNNs, what does a 'stride' of 2 mean?**\n", + "\n", + "A) The filter moves 2 pixels at a time during convolution\n", + "\n", + "B) The filter size is 2x2\n", + "\n", + "C) Two convolutional layers are stacked together\n", + "\n", + "D) The padding is set to 2 pixels\n", + "\n", + "E) The batch size is 2\n", + "\n", + "**Answer: A**\n", + "\n", + "**Explanation:** Stride refers to the step size by which the convolutional filter moves across the input. A stride of 2 means the filter moves 2 pixels at a time, which reduces the output dimensions and computational cost compared to stride of 1." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 6\n", + "\n", + "**Why is deep learning particularly relevant for small-data problems when using techniques like transfer learning?**\n", + "\n", + "A) Deep learning models always require large datasets\n", + "\n", + "B) Pretrained models on large datasets can be leveraged to extract meaningful features from small datasets\n", + "\n", + "C) Small datasets work better with deep networks than shallow networks\n", + "\n", + "D) Deep learning automatically generates more data\n", + "\n", + "E) Small datasets don't benefit from deep learning\n", + "\n", + "**Answer: B**\n", + "\n", + "**Explanation:** Transfer learning allows us to use models pretrained on large datasets (like ImageNet) to extract features from small datasets. The pretrained model has already learned general visual features, which can be adapted to new tasks with limited data through feature extraction or fine-tuning." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 7\n", + "\n", + "**What is the primary purpose of data augmentation in training CNNs?**\n", + "\n", + "A) To reduce the training time\n", + "\n", + "B) To increase the diversity of training data and reduce overfitting\n", + "\n", + "C) To normalize the input images\n", + "\n", + "D) To reduce the model size\n", + "\n", + "E) To improve the test accuracy without changing the model\n", + "\n", + "**Answer: B**\n", + "\n", + "**Explanation:** Data augmentation applies random transformations (rotations, flips, zooms, etc.) to training images, effectively creating new training samples. This increases data diversity, helps the model generalize better, and reduces overfitting, especially when training data is limited." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 8\n", + "\n", + "**What does `image_dataset_from_directory` do in Keras?**\n", + "\n", + "A) Downloads images from the internet\n", + "\n", + "B) Creates a tf.data.Dataset from image files organized in subdirectories by class\n", + "\n", + "C) Saves model predictions to a directory\n", + "\n", + "D) Converts images to numpy arrays\n", + "\n", + "E) Deletes temporary image files\n", + "\n", + "**Answer: B**\n", + "\n", + "**Explanation:** `image_dataset_from_directory` is a utility function that creates a TensorFlow Dataset from images organized in subdirectories, where each subdirectory represents a class. It automatically labels the images based on the directory structure and provides efficient data loading." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 9\n", + "\n", + "**In the MNIST CNN example, why is the input shape specified as (28, 28, 1)?**\n", + "\n", + "A) 28x28 is the image size, and 1 represents the grayscale channel\n", + "\n", + "B) 28x28 is the batch size, and 1 is the number of classes\n", + "\n", + "C) The first 28 represents width, second 28 represents height, and 1 is the batch size\n", + "\n", + "D) All three dimensions represent spatial coordinates\n", + "\n", + "E) 28x28 represents the number of parameters, and 1 is the learning rate\n", + "\n", + "**Answer: A**\n", + "\n", + "**Explanation:** MNIST images are 28x28 pixels in grayscale. The shape (28, 28, 1) represents height=28, width=28, and channels=1 (grayscale). For RGB images, the last dimension would be 3." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 10\n", + "\n", + "**What is the difference between feature extraction and fine-tuning in transfer learning?**\n", + "\n", + "A) Feature extraction freezes pretrained layers while fine-tuning updates them\n", + "\n", + "B) Feature extraction is faster but fine-tuning is always more accurate\n", + "\n", + "C) Fine-tuning requires more data than feature extraction\n", + "\n", + "D) Feature extraction can only be used with convolutional layers\n", + "\n", + "E) They are the same thing\n", + "\n", + "**Answer: A**\n", + "\n", + "**Explanation:** In feature extraction, the pretrained layers are frozen (not trainable), and only the new top layers are trained. In fine-tuning, some or all of the pretrained layers are unfrozen and updated during training, allowing the model to adapt more closely to the new task. Fine-tuning is typically done after feature extraction and requires careful learning rate adjustment." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## Part 2: Short Answer Questions (10 Questions)\n", + "\n", + "*Instructions: Provide a concise answer for each question (2-3 sentences).*\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 11\n", + "\n", + "**Explain what happens to the spatial dimensions when you apply a MaxPooling2D layer with pool_size=2.**\n", + "\n", + "**Expected Answer:**\n", + "\n", + "MaxPooling2D with pool_size=2 divides each spatial dimension by 2. For example, an input of (13, 13, 64) becomes (6, 6, 64) after max pooling. The operation takes the maximum value from each 2x2 window, effectively downsampling the feature maps while preserving the most prominent features.\n", + "\n", + "**Explanation/Context:**\n", + "\n", + "Max pooling is a downsampling operation that reduces computational cost and provides some translation invariance. It helps create a hierarchical representation where higher layers capture more abstract features." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 12\n", + "\n", + "**Why do we normalize/rescale pixel values (e.g., divide by 255) before feeding images to a neural network?**\n", + "\n", + "**Expected Answer:**\n", + "\n", + "Pixel values typically range from 0 to 255. Normalizing them to the range [0, 1] by dividing by 255 helps neural networks train more effectively. This is because neural networks work better with small input values, and normalization prevents features with larger scales from dominating the learning process. It also helps with gradient flow and convergence during training.\n", + "\n", + "**Explanation/Context:**\n", + "\n", + "Input normalization is a standard preprocessing step in deep learning. It's often implemented using a Rescaling layer in Keras." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 13\n", + "\n", + "**What is the advantage of using GlobalAveragePooling2D instead of Flatten followed by Dense layers?**\n", + "\n", + "**Expected Answer:**\n", + "\n", + "GlobalAveragePooling2D significantly reduces the number of parameters compared to Flatten + Dense layers. For example, with a (5, 5, 256) feature map, Flatten would create 6,400 values requiring 6,400 × num_classes parameters, while GlobalAveragePooling2D creates only 256 values. This reduction helps prevent overfitting and reduces computational cost while maintaining good performance.\n", + "\n", + "**Explanation/Context:**\n", + "\n", + "GlobalAveragePooling2D is commonly used in modern CNN architectures as a more efficient alternative to fully connected layers at the end of the network." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 14\n", + "\n", + "**Describe what a convolutional filter (kernel) does when it slides over an image.**\n", + "\n", + "**Expected Answer:**\n", + "\n", + "A convolutional filter is a small matrix (e.g., 3x3) with learnable weights that slides over the input image. At each position, it performs element-wise multiplication with the overlapping region of the input and sums the results to produce a single output value. This process creates a feature map that highlights specific patterns the filter has learned to detect, such as edges, textures, or more complex features.\n", + "\n", + "**Explanation/Context:**\n", + "\n", + "Multiple filters are typically used in each Conv2D layer, each learning to detect different features. The number of filters determines the depth of the output feature maps." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 15\n", + "\n", + "**What is the purpose of using 'same' padding in a convolutional layer?**\n", + "\n", + "**Expected Answer:**\n", + "\n", + "'Same' padding adds zeros around the input image borders so that the output has the same spatial dimensions as the input (when stride=1). This prevents the feature maps from shrinking too quickly as we go deeper in the network and ensures that border pixels are processed as many times as central pixels. Without padding, each convolution reduces the spatial dimensions.\n", + "\n", + "**Explanation/Context:**\n", + "\n", + "The two main padding options are 'valid' (no padding) and 'same' (zero padding to maintain dimensions). The choice depends on the network architecture and desired output size." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 16\n", + "\n", + "**What is a TensorFlow Dataset object and why is it useful for training deep learning models?**\n", + "\n", + "**Expected Answer:**\n", + "\n", + "A TensorFlow Dataset (tf.data.Dataset) is an abstraction for representing a sequence of elements (like images and labels) that provides efficient data loading, preprocessing, and batching. It's useful because it can prefetch data, apply transformations in parallel, and handle large datasets that don't fit in memory. This improves training performance by ensuring the GPU always has data to process without waiting for data loading.\n", + "\n", + "**Explanation/Context:**\n", + "\n", + "Dataset objects can be created from directories, numpy arrays, or other sources, and support operations like map, batch, shuffle, and prefetch for efficient data pipelines." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 17\n", + "\n", + "**Give three examples of data augmentation techniques commonly used for image classification.**\n", + "\n", + "**Expected Answer:**\n", + "\n", + "Common data augmentation techniques include: (1) Random horizontal flips - mirrors images left-to-right, (2) Random rotations - rotates images by a small angle, and (3) Random zooms - zooms in or out of the image. Other examples include random crops, brightness/contrast adjustments, and translations. These transformations create variations of training images while preserving their labels.\n", + "\n", + "**Explanation/Context:**\n", + "\n", + "Data augmentation is implemented in Keras using layers like RandomFlip, RandomRotation, and RandomZoom, which can be included in the model or applied to the dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 18\n", + "\n", + "**In the context of CNNs, what does 'filters' parameter represent in a Conv2D layer?**\n", + "\n", + "**Expected Answer:**\n", + "\n", + "The 'filters' parameter specifies the number of convolutional filters (also called kernels) in the layer, which determines the depth/number of channels in the output feature maps. For example, Conv2D(filters=64) will produce 64 different feature maps, each detecting different patterns in the input. More filters allow the network to learn more diverse features but increase computational cost and parameters.\n", + "\n", + "**Explanation/Context:**\n", + "\n", + "Typical CNN architectures progressively increase the number of filters in deeper layers (e.g., 32 → 64 → 128 → 256) to capture increasingly complex features." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 19\n", + "\n", + "**Why might you want to use a pretrained model instead of training from scratch?**\n", + "\n", + "**Expected Answer:**\n", + "\n", + "Pretrained models have already learned general visual features from large datasets (like ImageNet with millions of images), which saves training time and computational resources. They often achieve better performance, especially on small datasets, because they can leverage knowledge from the source task. Training from scratch requires large amounts of data and computational power, which may not be available.\n", + "\n", + "**Explanation/Context:**\n", + "\n", + "Transfer learning with pretrained models is a standard practice in computer vision, allowing practitioners to achieve state-of-the-art results with limited resources." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 20\n", + "\n", + "**What is the relationship between the kernel_size and the number of parameters in a Conv2D layer?**\n", + "\n", + "**Expected Answer:**\n", + "\n", + "The number of parameters in a Conv2D layer is calculated as: kernel_size × kernel_size × input_channels × filters + filters (biases). For example, a 3x3 kernel with 64 input channels and 128 output filters has 3 × 3 × 64 × 128 + 128 = 73,856 parameters. Larger kernel sizes significantly increase the parameter count and computational cost.\n", + "\n", + "**Explanation/Context:**\n", + "\n", + "Most modern CNNs use small kernels (typically 3x3) to reduce parameters while maintaining expressive power through depth. Larger kernels like 5x5 or 7x7 are less common." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## Part 3: Descriptive Questions (10 Questions)\n", + "\n", + "*Instructions: Explain the concepts or code logic in detail (5-7 sentences).*\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 21\n", + "\n", + "**Describe the complete architecture of the MNIST CNN model shown below and explain the purpose of each layer:**\n", + "\n", + "```python\n", + "inputs = keras.Input(shape=(28, 28, 1))\n", + "x = layers.Conv2D(filters=64, kernel_size=3, activation=\"relu\")(inputs)\n", + "x = layers.MaxPooling2D(pool_size=2)(x)\n", + "x = layers.Conv2D(filters=128, kernel_size=3, activation=\"relu\")(x)\n", + "x = layers.MaxPooling2D(pool_size=2)(x)\n", + "x = layers.Conv2D(filters=256, kernel_size=3, activation=\"relu\")(x)\n", + "x = layers.GlobalAveragePooling2D()(x)\n", + "outputs = layers.Dense(10, activation=\"softmax\")(x)\n", + "model = keras.Model(inputs=inputs, outputs=outputs)\n", + "```\n", + "\n", + "**Model Answer:**\n", + "\n", + "This CNN architecture processes 28×28 grayscale MNIST digit images through a hierarchical feature extraction pipeline. The model starts with an Input layer accepting (28, 28, 1) shaped tensors. The first Conv2D layer with 64 filters applies 3×3 kernels to extract low-level features like edges and corners, producing (26, 26, 64) feature maps with ReLU activation for non-linearity. The first MaxPooling2D layer downsamples to (13, 13, 64), reducing spatial dimensions by half while retaining important features.\n", + "\n", + "The second Conv2D layer with 128 filters learns mid-level features from the pooled outputs, creating (11, 11, 128) feature maps. Another MaxPooling2D reduces this to (5, 5, 128). The third Conv2D layer with 256 filters extracts high-level, abstract features, producing (3, 3, 256) feature maps. The GlobalAveragePooling2D layer compresses each of the 256 feature maps into a single value by averaging, resulting in a 256-dimensional vector. Finally, a Dense layer with 10 units and softmax activation produces probability distributions over the 10 digit classes. This architecture progressively increases filter depth while decreasing spatial dimensions, a common pattern in CNNs that builds increasingly abstract representations." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 22\n", + "\n", + "**Explain how convolution operation works and why it's particularly effective for image processing compared to fully connected layers.**\n", + "\n", + "**Model Answer:**\n", + "\n", + "The convolution operation works by sliding a small learnable filter (kernel) across the input image, computing element-wise products at each position and summing them to produce output values. For a 3×3 kernel on an image, at each position the kernel overlaps with a 3×3 patch of the image, multiplies corresponding values, and sums them to create one output pixel. This sliding window approach means the same filter weights are reused across the entire image, a property called parameter sharing or weight tying.\n", + "\n", + "Convolutions are particularly effective for images because they exploit two key properties: local connectivity and translation invariance. Local connectivity means each filter only looks at small regions of the input, which aligns with how visual features work - edges, textures, and patterns are local phenomena. Translation invariance means the same filter can detect the same feature anywhere in the image, so a filter that learns to detect horizontal edges works equally well at any position.\n", + "\n", + "Compared to fully connected layers, convolutions are vastly more efficient. A fully connected layer treating a 28×28 image as 784 inputs connected to 784 outputs would have 784×784 = 614,656 parameters, while a 3×3 convolutional filter has only 9 parameters (plus bias). This massive parameter reduction prevents overfitting, reduces memory requirements, and allows training on smaller datasets. Additionally, convolutions preserve spatial structure, while fully connected layers destroy it by flattening inputs, losing valuable information about spatial relationships between pixels." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 23\n", + "\n", + "**Discuss the trade-offs between using 'valid' padding versus 'same' padding in convolutional layers. When would you choose each option?**\n", + "\n", + "**Model Answer:**\n", + "\n", + "'Valid' padding means no padding is added, so the convolution only operates where the kernel fully fits within the input. With a 3×3 kernel on a 28×28 image, the output becomes 26×26 because the kernel can only be placed in 26 positions horizontally and vertically. 'Same' padding adds zeros around the borders (1 pixel for a 3×3 kernel) so the output has the same spatial dimensions as the input - 28×28 in this case. The choice between them involves several trade-offs.\n", + "\n", + "'Valid' padding has the advantage of not introducing artificial zero values, which means all computations use real image data. It's computationally slightly more efficient and can be useful when you want controlled dimension reduction. However, it causes rapid shrinkage of feature maps - with several layers, the spatial dimensions can become very small, limiting network depth. Additionally, border pixels are processed fewer times than central pixels, potentially losing important edge information.\n", + "\n", + "'Same' padding addresses these issues by maintaining dimensions, allowing much deeper networks without dimension collapse. It ensures border pixels get adequate processing, which can be important for tasks where edges matter. The downside is the introduction of zero values that don't represent real data, though in practice this rarely causes problems. Most modern architectures use 'same' padding as the default because it provides more flexibility in network design - you can add many convolutional layers without worrying about dimensions shrinking too quickly, and control downsampling explicitly through MaxPooling or strided convolutions instead." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 24\n", + "\n", + "**Explain the concept of transfer learning in deep learning and describe the difference between feature extraction and fine-tuning approaches.**\n", + "\n", + "**Model Answer:**\n", + "\n", + "Transfer learning is the practice of using a model pretrained on one task (usually ImageNet classification with 1.4 million images across 1000 categories) and adapting it to a different but related task. The key insight is that lower and middle layers of CNNs learn general-purpose visual features - edges, textures, shapes, and object parts - that are useful across many computer vision tasks. Rather than training from scratch, we leverage this learned knowledge, which is especially valuable when we have limited training data or computational resources.\n", + "\n", + "Feature extraction is the simpler transfer learning approach where we freeze all pretrained layers, making them non-trainable. We remove the original top layers (classifier) and add new layers specific to our task, then train only these new layers. For example, using a pretrained VGG16 or ResNet model, we'd freeze the convolutional base and train only a new Dense layer for our classes. This is fast and works well when the new task is similar to the pretraining task, as the pretrained features are already highly relevant. It requires minimal data since we're only learning the final classification layer.\n", + "\n", + "Fine-tuning goes a step further by unfreezing some or all of the pretrained layers and continuing training with a very low learning rate. This allows the pretrained weights to adapt to the specific characteristics of the new dataset. Typically, we first do feature extraction to train the new top layers, then unfreeze deeper layers gradually and fine-tune. Fine-tuning requires more data than pure feature extraction and risks overfitting on small datasets, but can achieve better performance when the new task differs somewhat from the pretraining task. The low learning rate is crucial - high learning rates would destroy the pretrained weights and lose the transfer learning benefit." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 25\n", + "\n", + "**Describe the complete data preprocessing pipeline for training a CNN on the dogs vs cats dataset, including the role of `image_dataset_from_directory`.**\n", + "\n", + "**Model Answer:**\n", + "\n", + "The data preprocessing pipeline for the dogs vs cats dataset begins with organizing images into a directory structure where subdirectories represent classes - for example, `dogs_vs_cats_small/train/cat/` and `dogs_vs_cats_small/train/dog/`. The `image_dataset_from_directory` function reads this structure and creates a tf.data.Dataset object, automatically assigning labels based on directory names. We specify parameters like image_size=(180, 180) to resize all images to a consistent size, and batch_size (e.g., 32) to group images for efficient processing.\n", + "\n", + "The Dataset object provides several key benefits: it loads images on-the-fly from disk rather than loading everything into memory, applies batching automatically, and can prefetch data to ensure the GPU always has data ready. We typically create separate datasets for train, validation, and test sets using the same approach. These datasets can be passed directly to model.fit() without additional conversion.\n", + "\n", + "Additional preprocessing steps are incorporated into the model itself using Keras layers. A Rescaling(1./255) layer normalizes pixel values from [0, 255] to [0, 1]. For data augmentation during training, layers like RandomFlip('horizontal'), RandomRotation(0.1), and RandomZoom(0.2) are added. An important detail is that augmentation layers should only be applied to training data, not validation or test data, to ensure consistent evaluation. The complete pipeline - loading, resizing, batching, rescaling, and augmentation - is efficient, reproducible, and takes advantage of TensorFlow's optimization for parallel data loading and preprocessing while the GPU handles model training." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 26\n", + "\n", + "**Explain why data augmentation is important for training CNNs, particularly on small datasets. Provide specific examples of how it helps prevent overfitting.**\n", + "\n", + "**Model Answer:**\n", + "\n", + "Data augmentation is a regularization technique that artificially increases the effective size and diversity of the training dataset by applying random transformations to images during training. Without augmentation, a model trained on a small dataset (say, 2000 images) might memorize specific details of those exact images rather than learning generalizable features. For example, if all training images of cats show them facing right, the model might learn that \"cats face right\" instead of learning actual cat features. Data augmentation addresses this by showing the model variations it wouldn't otherwise see.\n", + "\n", + "Common augmentations like random horizontal flips teach the model that objects can appear in different orientations - a cat facing left is still a cat. Random rotations (e.g., ±10 degrees) help the model handle images taken at slight angles. Random zooms simulate objects at different distances from the camera. Random translations, brightness/contrast adjustments, and crops further increase variety. Critically, these transformations preserve the label - a horizontally flipped dog image is still a dog - so they provide new training examples without requiring manual labeling.\n", + "\n", + "The overfitting prevention mechanism works through multiple pathways. First, augmentation forces the model to be invariant to these transformations, learning more robust features. A model that learns \"pointy ears\" as a cat feature is more robust than one that learns \"pointy ears in the top-right corner.\" Second, since augmentations are random and applied on-the-fly during training, the model technically never sees the exact same image twice, which is equivalent to having a much larger dataset. Third, augmentation acts as a form of noise injection, making the model more resilient. The result is significantly improved generalization - in experiments, augmentation can improve test accuracy by 5-15% on small datasets, making the difference between a model that overfits and one that generalizes well." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 27\n", + "\n", + "**Compare and contrast MaxPooling2D and GlobalAveragePooling2D in terms of their functionality, when to use each, and their impact on the model.**\n", + "\n", + "**Model Answer:**\n", + "\n", + "MaxPooling2D and GlobalAveragePooling2D are both downsampling operations but serve different purposes in CNN architectures. MaxPooling2D operates locally, taking the maximum value from small windows (typically 2×2) across the feature maps. For instance, with pool_size=2, it divides each spatial dimension by 2, transforming a (26, 26, 64) tensor into (13, 13, 64). The operation preserves the number of channels and retains the strongest activations within each window, providing some translation invariance and reducing computational cost for subsequent layers.\n", + "\n", + "GlobalAveragePooling2D, in contrast, operates globally across entire feature maps. It computes the average of all spatial positions for each channel, completely eliminating spatial dimensions. A (3, 3, 256) tensor becomes a (256,) vector. This is typically used as the final pooling operation before the classification layer, replacing the traditional approach of Flatten + Dense layers. Its primary benefit is dramatic parameter reduction - with GlobalAveragePooling2D followed by Dense(10), we only need 256 × 10 = 2,560 parameters, compared to thousands or millions with Flatten + Dense.\n", + "\n", + "Use MaxPooling2D throughout the network between convolutional blocks to progressively reduce spatial dimensions while maintaining spatial information for deeper processing. Use GlobalAveragePooling2D once at the end to transition from spatial feature maps to a feature vector for classification. MaxPooling2D is about downsampling while preserving some spatial structure; GlobalAveragePooling2D is about completely pooling spatial information into channel-wise features. Some modern architectures (like ResNets) exclusively use strided convolutions instead of MaxPooling2D for downsampling, but GlobalAveragePooling2D remains popular as a parameter-efficient final pooling layer that reduces overfitting compared to fully connected layers." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 28\n", + "\n", + "**Analyze the following model training code and explain each component:**\n", + "\n", + "```python\n", + "model.compile(\n", + " optimizer=\"adam\",\n", + " loss=\"sparse_categorical_crossentropy\",\n", + " metrics=[\"accuracy\"]\n", + ")\n", + "\n", + "model.fit(train_images, train_labels, epochs=5, batch_size=64)\n", + "```\n", + "\n", + "**Model Answer:**\n", + "\n", + "This code demonstrates the standard workflow for compiling and training a Keras model. The compile() method configures the learning process by specifying three essential components: optimizer, loss function, and metrics. The optimizer=\"adam\" choice uses the Adam (Adaptive Moment Estimation) optimizer, which is a sophisticated gradient descent variant that adapts learning rates for each parameter individually. Adam is popular because it works well with default parameters (learning_rate=0.001), handles sparse gradients well, and typically converges faster than basic SGD.\n", + "\n", + "The loss=\"sparse_categorical_crossentropy\" specifies the objective function to minimize during training. This loss is appropriate for multi-class classification problems where labels are integers (0, 1, 2, ..., 9 for MNIST) rather than one-hot encoded vectors. If labels were one-hot encoded, we'd use \"categorical_crossentropy\" instead. The \"sparse\" version is more memory-efficient and convenient when working with integer labels. Crossentropy measures the difference between predicted probability distributions and true labels, penalizing confident wrong predictions more heavily.\n", + "\n", + "The metrics=[\"accuracy\"] parameter specifies what to track during training. While the model minimizes loss, accuracy is often more interpretable for users - it's simply the percentage of correct predictions. During training, both loss and accuracy are displayed for monitoring. The fit() method then executes training: it runs for 5 epochs (complete passes through the training data), processing 64 images at a time (batch_size=64). This means in each epoch, 60,000 training images are divided into batches of 64, resulting in 938 batches per epoch. After each batch, the model computes gradients and updates weights. Batch size affects training speed (larger batches are faster on GPUs) and generalization (smaller batches add noise that can help escape local minima)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 29\n", + "\n", + "**Explain why CNNs achieve higher accuracy on MNIST compared to fully connected networks, discussing the architectural advantages that contribute to better performance.**\n", + "\n", + "**Model Answer:**\n", + "\n", + "CNNs achieve superior performance on MNIST (and image tasks generally) compared to fully connected networks due to three fundamental architectural principles: local connectivity, parameter sharing, and translation invariance, combined with hierarchical feature learning. A fully connected network treating a 28×28 MNIST image as 784 independent features ignores the critical spatial structure of images - it doesn't know that neighboring pixels are related. It must learn \"edge at position (5,10)\" and \"edge at position (6,10)\" as completely separate patterns, wasting parameters and learning capacity.\n", + "\n", + "CNNs, through convolutional layers, exploit local connectivity by processing small neighborhoods of pixels together. A 3×3 filter learns to detect a pattern (like a vertical edge) in its receptive field, and through parameter sharing, this same filter is applied across the entire image. This means the network learns \"vertical edge detector\" once and reuses it everywhere, dramatically reducing parameters from millions to thousands. This efficiency allows deeper networks that can learn more sophisticated features without overfitting on limited data. The reduced parameter count also means the model needs fewer training examples to learn effectively.\n", + "\n", + "The hierarchical architecture of CNNs - alternating Conv2D and MaxPooling2D layers - builds progressively abstract representations. Early layers detect simple features like edges and corners. Middle layers combine these into textures and simple shapes. Deeper layers recognize digit parts like loops and strokes. Final layers identify complete digits. This hierarchical processing mirrors how the visual cortex processes images and is much more powerful than a flat fully connected architecture. MaxPooling adds translation invariance - a digit slightly shifted remains recognizable - which fully connected networks lack. These combined advantages typically yield 98-99% accuracy on MNIST for CNNs versus 95-97% for fully connected networks, with the gap widening dramatically on more complex image datasets." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Question 30\n", + "\n", + "**Describe the complete workflow for implementing transfer learning with a pretrained model (like VGG16 or ResNet), from loading the model to training on a new dataset.**\n", + "\n", + "**Model Answer:**\n", + "\n", + "Implementing transfer learning begins with loading a pretrained model from Keras Applications (e.g., VGG16, ResNet50, or InceptionV3) using code like `base_model = keras.applications.VGG16(weights='imagenet', include_top=False, input_shape=(180, 180, 3))`. The `weights='imagenet'` loads pretrained weights, `include_top=False` removes the original classifier head (designed for 1000 ImageNet classes), and `input_shape` specifies our input dimensions. Setting `base_model.trainable = False` freezes all pretrained layers, preventing their weights from being updated during initial training.\n", + "\n", + "Next, we build our custom classifier on top of the frozen base. This typically involves adding GlobalAveragePooling2D to convert feature maps to vectors, followed by Dense layers for our specific task. For example: `x = GlobalAveragePooling2D()(base_model.output)`, then `x = Dense(256, activation='relu')(x)`, and finally `outputs = Dense(num_classes, activation='softmax')(x)`. We create the complete model with `model = keras.Model(inputs=base_model.input, outputs=outputs)`. This new model uses pretrained convolutional layers as a fixed feature extractor and trains only the new classification layers.\n", + "\n", + "For the feature extraction phase, we compile with a standard learning rate (e.g., 0.001) and train for several epochs until validation accuracy plateaus. This trains only the top layers while keeping the pretrained base frozen. Optionally, for fine-tuning, we then unfreeze some or all of the base model layers: `base_model.trainable = True`, possibly freezing early layers with layer-specific trainability. We must recompile after changing trainability. For fine-tuning, we use a much lower learning rate (e.g., 0.0001 or 1e-5) to avoid destroying pretrained weights with large updates. We train for additional epochs, monitoring validation metrics closely to avoid overfitting. This two-phase approach - feature extraction then fine-tuning - typically achieves better results than either alone, leveraging pretrained knowledge while adapting to the specific characteristics of the new dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "---\n", + "\n", + "## End of Exam\n", + "\n", + "**Note:** This exam is designed for study and practice purposes. All answers and explanations are provided to facilitate learning of computer vision and deep learning concepts covered in the NB-computer_vision_intro.ipynb notebook." + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.0" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}