+ Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed." This is an older, informal definition.
+
+
+ Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
+
+
+ Example: playing checkers.
+
+
+ E = the experience of playing many games of checkers
+
+
+ T = the task of playing checkers.
+
+
+ P = the probability that the program will win the next game.
+
+
+ In general, any machine learning problem can be assigned to one of two broad classifications:
+
+
+ supervised learning, OR
+
+
+ unsupervised learning.
+
+
+
+ Supervised Learning
+
+
+
+ In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output.
+
+
+ Supervised learning problems are categorized into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories. Here is a description on Math is Fun on Continuous and Discrete Data.
+
+
+
+ Example 1:
+
+
+
+ Given data about the size of houses on the real estate market, try to predict their price. Price as a function of size is a continuous output, so this is a regression problem.
+
+
+ We could turn this example into a classification problem by instead making our output about whether the house "sells for more or less than the asking price." Here we are classifying the houses based on price into two discrete categories.
+
+
+
+ Example 2
+
+ :
+
+
+ (a) Regression - Given a picture of Male/Female, We have to predict his/her age on the basis of given picture.
+
+
+ (b) Classification - Given a picture of Male/Female, We have to predict Whether He/She is of High school, College, Graduate age. Another Example for Classification - Banks have to decide whether or not to give a loan to someone on the basis of his credit history.
+
+
+
+ Unsupervised Learning
+
+
+
+ Unsupervised learning, on the other hand, allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables.
+
+
+ We can derive this structure by clustering the data based on relationships among the variables in the data.
+
+
+ With unsupervised learning there is no feedback based on the prediction results, i.e., there is no teacher to correct you.
+
+
+
+ Example:
+
+
+
+ Clustering: Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number that are somehow similar or related by different variables, such as word frequency, sentence length, page count, and so on.
+
+ Recall that in
+
+ regression problems
+
+ , we are taking input variables and trying to fit the output onto a
+
+ continuous
+
+ expected result function.
+
+
+ Linear regression with one variable is also known as "univariate linear regression."
+
+
+ Univariate linear regression is used when you want to predict a
+
+ single output
+
+ value y from a
+
+ single input
+
+ value x. We're doing
+
+ supervised learning
+
+ here, so that means we already have an idea about what the input/output cause and effect should be.
+
+ Note that this is like the equation of a straight line. We give to $$h_\theta(x)$$ values for $$\theta_0$$ and $$\theta_1$$ to get our estimated output $$\hat{y}$$. In other words, we are trying to create a function called $$h_\theta$$ that is trying to map our input data (the x's) to our output data (the y's).
+
+
+ Example:
+
+
+ Suppose we have the following set of training data:
+
+
+
+
+
+ input x
+
+
+
+
+ output y
+
+
+
+
+
+
+ 0
+
+
+
+
+ 4
+
+
+
+
+
+
+ 1
+
+
+
+
+ 7
+
+
+
+
+
+
+ 2
+
+
+
+
+ 7
+
+
+
+
+
+
+ 3
+
+
+
+
+ 8
+
+
+
+
+
+ Now we can make a random guess about our $$h_\theta$$ function: $$\theta_0=2$$ and $$\theta_1=2$$. The hypothesis function becomes $$h_\theta(x)=2+2x$$.
+
+
+ So for input of 1 to our hypothesis, y will be 4. This is off by 3. Note that we will be trying out various values of $$\theta_0$$ and $$\theta_1$$ to try to find values which provide the best possible "fit" or the most representative "straight line" through the data points mapped on the x-y plane.
+
+
+ Cost Function
+
+
+ We can measure the accuracy of our hypothesis function by using a
+
+ cost function
+
+ . This takes an average (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's compared to the actual output y's.
+
+ To break it apart, it is $$\frac{1}{2}$$ $$\bar{x}$$ where $$\bar{x}$$ is the mean of the squares of $$h_\theta (x_{i}) - y_{i}$$ , or the difference between the predicted value and the actual value.
+
+
+ This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved $$\left(\frac{1}{2m}\right)$$ as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the $$\frac{1}{2}$$ term.
+
+
+ Now we are able to concretely measure the accuracy of our predictor function against the correct results we have so that we can predict new results we don't have.
+
+
+ If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We are trying to make straight line (defined by $$h_\theta(x)$$) which passes through this scattered set of data. Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered points from the line will be the least. In the best case, the line should pass through all the points of our training data set. In such a case the value of $$J(\theta_0, \theta_1)$$ will be 0.
+
+
+ ML:Gradient Descent
+
+
+ So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in hypothesis function. That's where gradient descent comes in.
+
+
+ Imagine that we graph our hypothesis function based on its fields $$\theta_0$$ and $$\theta_1$$ (actually we are graphing the cost function as a function of the parameter estimates). This can be kind of confusing; we are moving up to a higher level of abstraction. We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting particular set of parameters.
+
+
+ We put $$\theta_0$$ on the x axis and $$\theta_1$$ on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters.
+
+
+ We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum.
+
+
+ The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent, and the size of each step is determined by the parameter α, which is called the learning rate.
+
+ $$\theta_j := \theta_j - \alpha [\text{Slope of tangent aka derivative in j dimension}]$$[Slope of tangent aka derivative in j dimension]
+
+
+
+ Gradient Descent for Linear Regression
+
+
+
+ When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to (the derivation of the formulas are out of the scope of this course, but a really great one can be found here):
+
+ where m is the size of the training set, $$\theta_0$$ a constant that will be changing simultaneously with $$\theta_1$$ and $$x_{i}, y_{i}$$are values of the given training set (data).
+
+
+ Note that we have separated out the two cases for $$\theta_j$$ into separate equations for $$\theta_0$$ and $$\theta_1$$; and that for $$\theta_1$$ we are multiplying $$x_{i}$$ at the end due to the derivative.
+
+
+ The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate.
+
+
+
+ Gradient Descent for Linear Regression: visual worked example
+
+
+ $$\begin{bmatrix} a & b & c \newline d & e & f \newline g & h & i \newline j & k & l\end{bmatrix}$$
+
+
+
+
+
+
+
+ The above matrix has four rows and three columns, so it is a 4 x 3 matrix.
+
+
+ A vector is a matrix with one column and many rows:
+
+
+
+
+
+ $$\begin{bmatrix} w \newline x \newline y \newline z \end{bmatrix}$$
+
+
+
+
+
+
+
+ So vectors are a subset of matrices. The above vector is a 4 x 1 matrix.
+
+
+
+ Notation and terms
+
+ :
+
+
+
+
+ $$A_{ij}$$ refers to the element in the ith row and jth column of matrix A.
+
+
+
+
+ A vector with 'n' rows is referred to as an 'n'-dimensional vector
+
+
+
+
+ $$v_i$$ refers to the element in the ith row of the vector.
+
+
+
+
+ In general, all our vectors and matrices will be 1-indexed. Note that for some programming languages, the arrays are 0-indexed.
+
+
+
+
+ Matrices are usually denoted by uppercase names while vectors are lowercase.
+
+
+
+
+ "Scalar" means that an object is a single value, not a vector or matrix.
+
+
+
+
+ $$\mathbb{R}$$ refers to the set of scalar real numbers
+
+
+
+
+ $$\mathbb{R^n}$$ refers to the set of n-dimensional vectors of real numbers
+
+
+
+
+ Addition and Scalar Multiplication
+
+
+ Addition and subtraction are
+
+ element-wise
+
+ , so you simply add or subtract each corresponding element:
+
+
+
+
+
+ $$\begin{bmatrix} a & b \newline c & d \newline \end{bmatrix} +\begin{bmatrix} w & x \newline y & z \newline \end{bmatrix} =\begin{bmatrix} a+w & b+x \newline c+y & d+z \newline \end{bmatrix}$$
+
+
+
+
+
+ To add or subtract two matrices, their dimensions must be
+
+ the same
+
+ .
+
+
+ In scalar multiplication, we simply multiply every element by the scalar value:
+
+
+
+
+
+ $$\begin{bmatrix} a & b \newline c & d \newline \end{bmatrix} * x =\begin{bmatrix} a*x & b*x \newline c*x & d*x \newline \end{bmatrix}$$
+
+
+
+
+
+
+
+ Matrix-Vector Multiplication
+
+
+ We map the column of the vector onto each row of the matrix, multiplying each element and summing the result.
+
+
+
+
+
+ $$\begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix} *\begin{bmatrix} x \newline y \newline \end{bmatrix} =\begin{bmatrix} a*x + b*y \newline c*x + d*y \newline e*x + f*y\end{bmatrix}$$
+
+
+
+
+
+ The result is a
+
+ vector
+
+ . The vector must be the
+
+ second
+
+ term of the multiplication. The number of
+
+ columns
+
+ of the matrix must equal the number of
+
+ rows
+
+ of the vector.
+
+
+ An
+
+ m x n matrix
+
+ multiplied by an
+
+ n x 1 vector
+
+ results in an
+
+ m x 1 vector
+
+ .
+
+
+ Matrix-Matrix Multiplication
+
+
+ We multiply two matrices by breaking it into several vector multiplications and concatenating the result
+
+
+
+
+
+ $$\begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix} *\begin{bmatrix} w & x \newline y & z \newline \end{bmatrix} =\begin{bmatrix} a*w + b*y & a*x + b*z \newline c*w + d*y & c*x + d*z \newline e*w + f*y & e*x + f*z\end{bmatrix}$$
+
+
+
+
+
+ An
+
+ m x n matrix
+
+ multiplied by an
+
+ n x o matrix
+
+ results in an
+
+ m x o
+
+ matrix. In the above example, a 3 x 2 matrix times a 2 x 2 matrix resulted in a 3 x 2 matrix.
+
+
+ To multiply two matrices, the number of
+
+ columns
+
+ of the first matrix must equal the number of
+
+ rows
+
+ of the second matrix.
+
+
+ Matrix Multiplication Properties
+
+
+
+
+ Not commutative. A∗B≠B∗A
+
+
+
+
+ Associative. (A∗B)∗C=A∗(B∗C)
+
+
+
+
+ The
+
+ identity matrix
+
+ , when multiplied by any matrix of the same dimensions, results in the original matrix. It's just like multiplying numbers by 1. The identity matrix simply has 1's on the diagonal (upper left to lower right diagonal) and 0's elsewhere.
+
+ When multiplying the identity matrix after some matrix (A∗I), the square identity matrix should match the other matrix's
+
+ columns
+
+ . When multiplying the identity matrix before some other matrix (I∗A), the square identity matrix should match the other matrix's
+
+ rows
+
+ .
+
+
+ Inverse and Transpose
+
+
+ The
+
+ inverse
+
+ of a matrix A is denoted A−1. Multiplying by the inverse results in the identity matrix.
+
+
+ A non square matrix does not have an inverse matrix. We can compute inverses of matrices in octave with the pinv(A) function [1] and in matlab with the inv(A) function. Matrices that don't have an inverse are
+
+ singular
+
+ or
+
+ degenerate
+
+ .
+
+
+ The
+
+ transposition
+
+ of a matrix is like rotating the matrix 90
+
+ °
+
+ in clockwise direction and then reversing it. We can compute transposition of matrices in matlab with the transpose(A) function or A':
+
+
+
+
+
+
+
+ $$A = \begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix}$$
+
+
+
+
+
+
+
+
+ $$A^T = \begin{bmatrix} a & c & e \newline b & d & f \newline \end{bmatrix}$$
+
+ Linear regression with multiple variables is also known as "multivariate linear regression".
+
+
+ We now introduce notation for equations where we can have any number of input variables.
+
+
+
+
+
+ $$\begin{align*}x_j^{(i)} &= \text{value of feature } j \text{ in the }i^{th}\text{ training example} \newline x^{(i)}& = \text{the column vector of all the feature inputs of the }i^{th}\text{ training example} \newline m &= \text{the number of training examples} \newline n &= \left| x^{(i)} \right| ; \text{(the number of features)} \end{align*}$$
+
+
+
+
+
+ Now define the multivariable form of the hypothesis function as follows, accommodating these multiple features:
+
+ In order to develop intuition about this function, we can think about $$\theta_0$$ as the basic price of a house, $$\theta_1$$ as the price per square meter, $$\theta_2$$ as the price per floor, etc. $$x_1$$ will be the number of square meters in the house, $$x_2$$ the number of floors, etc.
+
+
+ Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as:
+
+ This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more.
+
+
+ Remark: Note that for convenience reasons in this course Mr. Ng assumes $$x_{0}^{(i)} =1 \text{ for } (i\in { 1,\dots, m } )$$
+
+
+ [
+
+ Note
+
+ : So that we can do matrix operations with theta and x, we will set $$x^{(i)}_0$$ = 1, for all values of i. This makes the two vectors 'theta' and $$x_{(i)}$$ match each other element-wise (that is, have the same number of elements: n+1).]
+
+
+ The training examples are stored in X row-wise, like such:
+
+ Sometimes, the summation of the product of two terms can be expressed as the product of two vectors.
+
+
+ Here, $$x_j^{(i)}$$, for i = 1,...,m, represents the m elements of the j-th column, $$\vec{x_j}$$ , of the training set X.
+
+
+ The other term $$\left(h_\theta(x^{(i)}) - y^{(i)} \right)$$ is the vector of the deviations between the predictions $$h_\theta(x^{(i)})$$ and the true values $$y^{(i)}$$. Re-writing $$\frac{\partial J(\theta)}{\partial \theta_j}$$, we have:
+
+ We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven.
+
+
+ The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally:
+
+
+ −1 ≤ $$x_{(i)}$$ ≤ 1
+
+
+ or
+
+
+ −0.5 ≤ $$x_{(i)}$$ ≤ 0.5
+
+
+ These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few.
+
+
+ Two techniques to help with this are
+
+ feature scaling
+
+ and
+
+ mean normalization
+
+ . Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable, resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula:
+
+
+ $$x_i := \dfrac{x_i - \mu_i}{s_i}$$
+
+
+ Where $$μ_i$$ is the
+
+ average
+
+ of all the values for feature (i) and $$s_i$$ is the range of values (max - min), or $$s_i$$ is the standard deviation.
+
+
+ Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation.
+
+
+ Example: $$x_i$$ is housing prices with range of 100 to 2000, with a mean value of 1000. Then, $$x_i := \dfrac{price-1000}{1900}$$.
+
+
+ Quiz question #1 on Feature Normalization (Week 2, Linear Regression with Multiple Variables)
+
+
+ Your answer should be rounded to exactly two decimal places. Use a '.' for the decimal point, not a ','. The tricky part of this question is figuring out which feature of which training example you are asked to normalize. Note that the mobile app doesn't allow entering a negative number (Jan 2016), so you will need to use a browser to submit this quiz if your solution requires a negative number.
+
+
+ Gradient Descent Tips
+
+
+
+ Debugging gradient descent.
+
+ Make a plot with
+
+ number of iterations
+
+ on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α.
+
+
+
+ Automatic convergence test.
+
+ Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as 10−3. However in practice it's difficult to choose this threshold value.
+
+
+ It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration. Andrew Ng recommends decreasing α by multiples of 3.
+
+
+ Features and Polynomial Regression
+
+
+ We can improve our features and the form of our hypothesis function in a couple different ways.
+
+
+ We can
+
+ combine
+
+ multiple features into one. For example, we can combine $$x_1$$ and $$x_2$$ into a new feature $$x_3$$ by taking $$x_1$$⋅$$x_2$$.
+
+
+
+ Polynomial Regression
+
+
+
+ Our hypothesis function need not be linear (a straight line) if that does not fit the data well.
+
+
+ We can
+
+ change the behavior or curve
+
+ of our hypothesis function by making it a quadratic, cubic or square root function (or any other form).
+
+
+ For example, if our hypothesis function is $$h_\theta(x) = \theta_0 + \theta_1 x_1$$ then we can create additional features based on $$x_1$$, to get the quadratic function $$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2$$ or the cubic function $$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3$$
+
+
+ In the cubic version, we have created new features $$x_2$$ and $$x_3$$ where $$x_2 = x_1^2$$ and $$x_3 = x_1^3$$.
+
+
+ To make it a square root function, we could do: $$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 \sqrt{x_1}$$
+
+
+ Note that at 2:52 and through 6:22 in the "Features and Polynomial Regression" video, the curve that Prof Ng discusses about "doesn't ever come back down" is in reference to the hypothesis function that uses the sqrt() function (shown by the solid purple line), not the one that uses $$size^2$$ (shown with the dotted blue line). The quadratic form of the hypothesis function would have the shape shown with the blue dotted line if $$\theta_2$$ was negative.
+
+
+ One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important.
+
+
+ eg. if $$x_1$$ has range 1 - 1000 then range of $$x_1^2$$ becomes 1 - 1000000 and that of $$x_1^3$$ becomes 1 - 1000000000.
+
+
+ Normal Equation
+
+
+ The "Normal Equation" is a method of finding the optimum theta
+
+ without iteration.
+
+
+
+ $$\theta = (X^T X)^{-1}X^T y$$
+
+
+ There is
+
+ no need
+
+ to do feature scaling with the normal equation.
+
+
+ Mathematical proof of the Normal equation requires knowledge of linear algebra and is fairly involved, so you do not need to worry about the details.
+
+
+ Proofs are available at these links for those who are interested:
+
+ The following is a comparison of gradient descent and the normal equation:
+
+
+
+
+
+ Gradient Descent
+
+
+
+
+ Normal Equation
+
+
+
+
+
+
+ Need to choose alpha
+
+
+
+
+ No need to choose alpha
+
+
+
+
+
+
+ Needs many iterations
+
+
+
+
+ No need to iterate
+
+
+
+
+
+
+ O ($$kn^2$$)
+
+
+
+
+ O ($$n^3$$), need to calculate inverse of $$X^TX$$
+
+
+
+
+
+
+ Works well when n is large
+
+
+
+
+ Slow if n is very large
+
+
+
+
+
+ With the normal equation, computing the inversion has complexity $$\mathcal{O}(n^3)$$. So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process.
+
+
+
+ Normal Equation Noninvertibility
+
+
+
+ When implementing the normal equation in octave we want to use the 'pinv' function rather than 'inv.'
+
+
+ $$X^TX$$ may be
+
+ noninvertible
+
+ . The common causes are:
+
+
+
+
+ Redundant features, where two features are very closely related (i.e. they are linearly dependent)
+
+
+
+
+ Too many features (e.g. m ≤ n). In this case, delete some features or use "regularization" (to be explained in a later lesson).
+
+
+
+
+ Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features.
+
+
+ ML:Octave Tutorial
+
+
+ Basic Operations
+
+
%% Change Octave prompt
+PS1('>> ');
+%% Change working directory in windows example:
+cd 'c:/path/to/desired/directory name'
+%% Note that it uses normal slashes and does not use escape characters for the empty spaces.
+
+%% elementary operations
+5+6
+3-2
+5*8
+1/2
+2^6
+1 == 2 % false
+1 ~= 2 % true. note, not "!="
+1 && 0
+1 || 0
+xor(1,0)
+
+
+%% variable assignment
+a = 3; % semicolon suppresses output
+b = 'hi';
+c = 3>=1;
+
+% Displaying them:
+a = pi
+disp(a)
+disp(sprintf('2 decimals: %0.2f', a))
+disp(sprintf('6 decimals: %0.6f', a))
+format long
+a
+format short
+a
+
+
+%% vectors and matrices
+A = [1 2; 3 4; 5 6]
+
+v = [1 2 3]
+v = [1; 2; 3]
+v = 1:0.1:2 % from 1 to 2, with stepsize of 0.1. Useful for plot axes
+v = 1:6 % from 1 to 6, assumes stepsize of 1 (row vector)
+
+C = 2*ones(2,3) % same as C = [2 2 2; 2 2 2]
+w = ones(1,3) % 1x3 vector of ones
+w = zeros(1,3)
+w = rand(1,3) % drawn from a uniform distribution
+w = randn(1,3)% drawn from a normal distribution (mean=0, var=1)
+w = -6 + sqrt(10)*(randn(1,10000)); % (mean = -6, var = 10) - note: add the semicolon
+hist(w) % plot histogram using 10 bins (default)
+hist(w,50) % plot histogram using 50 bins
+% note: if hist() crashes, try "graphics_toolkit('gnu_plot')"
+
+I = eye(4) % 4x4 identity matrix
+
+% help function
+help eye
+help rand
+help help
+
%% dimensions
+sz = size(A) % 1x2 matrix: [(number of rows) (number of columns)]
+size(A,1) % number of rows
+size(A,2) % number of cols
+length(v) % size of longest dimension
+
+
+%% loading data
+pwd % show current directory (current path)
+cd 'C:\Users\ang\Octave files' % change directory
+ls % list files in current directory
+load q1y.dat % alternatively, load('q1y.dat')
+load q1x.dat
+who % list variables in workspace
+whos % list variables in workspace (detailed view)
+clear q1y % clear command without any args clears all vars
+v = q1x(1:10); % first 10 elements of q1x (counts down the columns)
+save hello.mat v; % save variable v into file hello.mat
+save hello.txt v -ascii; % save as ascii
+% fopen, fread, fprintf, fscanf also work [[not needed in class]]
+
+%% indexing
+A(3,2) % indexing is (row,col)
+A(2,:) % get the 2nd row.
+ % ":" means every element along that dimension
+A(:,2) % get the 2nd col
+A([1 3],:) % print all the elements of rows 1 and 3
+
+A(:,2) = [10; 11; 12] % change second column
+A = [A, [100; 101; 102]]; % append column vec
+A(:) % Select all elements as a column vector.
+
+% Putting data together
+A = [1 2; 3 4; 5 6]
+B = [11 12; 13 14; 15 16] % same dims as A
+C = [A B] % concatenating A and B matrices side by side
+C = [A, B] % concatenating A and B matrices side by side
+C = [A; B] % Concatenating A and B top and bottom
+
+
+ Computing on Data
+
+
%% initialize variables
+A = [1 2;3 4;5 6]
+B = [11 12;13 14;15 16]
+C = [1 1;2 2]
+v = [1;2;3]
+
+%% matrix operations
+A * C % matrix multiplication
+A .* B % element-wise multiplication
+% A .* C or A * B gives error - wrong dimensions
+A .^ 2 % element-wise square of each element in A
+1./v % element-wise reciprocal
+log(v) % functions like this operate element-wise on vecs or matrices
+exp(v)
+abs(v)
+
+-v % -1*v
+
+v + ones(length(v), 1)
+% v + 1 % same
+
+A' % matrix transpose
+
+%% misc useful functions
+
+% max (or min)
+a = [1 15 2 0.5]
+val = max(a)
+[val,ind] = max(a) % val - maximum element of the vector a and index - index value where maximum occur
+val = max(A) % if A is matrix, returns max from each column
+
+% compare values in a matrix & find
+a < 3 % checks which values in a are less than 3
+find(a < 3) % gives location of elements less than 3
+A = magic(3) % generates a magic matrix - not much used in ML algorithms
+[r,c] = find(A>=7) % row, column indices for values matching comparison
+
+% sum, prod
+sum(a)
+prod(a)
+floor(a) % or ceil(a)
+max(rand(3),rand(3))
+max(A,[],1) - maximum along columns(defaults to columns - max(A,[]))
+max(A,[],2) - maximum along rows
+A = magic(9)
+sum(A,1)
+sum(A,2)
+sum(sum( A .* eye(9) ))
+sum(sum( A .* flipud(eye(9)) ))
+
+
+% Matrix inverse (pseudo-inverse)
+pinv(A) % inv(A'*A)*A'
+
+
+ Plotting Data
+
+
%% plotting
+t = [0:0.01:0.98];
+y1 = sin(2*pi*4*t);
+plot(t,y1);
+y2 = cos(2*pi*4*t);
+hold on; % "hold off" to turn off
+plot(t,y2,'r');
+xlabel('time');
+ylabel('value');
+legend('sin','cos');
+title('my plot');
+print -dpng 'myPlot.png'
+close; % or, "close all" to close all figs
+figure(1); plot(t, y1);
+figure(2); plot(t, y2);
+figure(2), clf; % can specify the figure number
+subplot(1,2,1); % Divide plot into 1x2 grid, access 1st element
+plot(t,y1);
+subplot(1,2,2); % Divide plot into 1x2 grid, access 2nd element
+plot(t,y2);
+axis([0.5 1 -1 1]); % change axis scale
+
+%% display a matrix (or image)
+figure;
+imagesc(magic(15)), colorbar, colormap gray;
+% comma-chaining function calls.
+a=1,b=2,c=3
+a=1;b=2;c=3;
+
+
+ Control statements: for, while, if statements
+
+
v = zeros(10,1);
+for i=1:10,
+ v(i) = 2^i;
+end;
+% Can also use "break" and "continue" inside for and while loops to control execution.
+
+i = 1;
+while i <= 5,
+ v(i) = 100;
+ i = i+1;
+end
+
+i = 1;
+while true,
+ v(i) = 999;
+ i = i+1;
+ if i == 6,
+ break;
+ end;
+end
+
+if v(1)==1,
+ disp('The value is one!');
+elseif v(1)==2,
+ disp('The value is two!');
+else
+ disp('The value is not one or two!');
+end
+
+
+ Functions
+
+
+ To create a function, type the function code in a text editor (e.g. gedit or notepad), and save the file as "functionName.m"
+
+
+ Example function:
+
+
function y = squareThisNumber(x)
+
+y = x^2;
+
+
+ To call the function in Octave, do either:
+
+
+ 1) Navigate to the directory of the functionName.m file and call the function:
+
+
% Navigate to directory:
+ cd /path/to/function
+
+ % Call the function:
+ functionName(args)
+
+
+ 2) Add the directory of the function to the load path and save it:
+
+ You should not use addpath/savepath for any of the assignments in this course. Instead use 'cd' to change the current working directory. Watch the video on submitting assignments in week 2 for instructions.
+
+
+
% To add the path for the current session of Octave:
+ addpath('/path/to/function/')
+
+ % To remember the path for future sessions of Octave, after executing addpath above, also do:
+ savepath
+
+
+ Octave's functions can return more than one value:
+
+ Vectorization is the process of taking code that relies on
+
+ loops
+
+ and converting it into
+
+ matrix operations
+
+ . It is more efficient, more elegant, and more concise.
+
+
+ As an example, let's compute our prediction from a hypothesis. Theta is the vector of fields for the hypothesis and x is a vector of variables.
+
+ If you recall the definition multiplying vectors, you'll see that this one operation does the element-wise multiplication and overall sum in a very concise notation.
+
+
+ Working on and Submitting Programming Exercises
+
+
+
+
+ Download and extract the assignment's zip file.
+
+
+
+
+ Edit the proper file 'a.m', where a is the name of the exercise you're working on.
+
+
+
+
+ Run octave and cd to the assignment's extracted directory
+
+
+
+
+ Run the 'submit' function and enter the assignment number, your email, and a password (found on the top of the "Programming Exercises" page on coursera)
+
0:24 The size command
+1:39 The length command
+2:18 File system commands
+2:25 File handling
+4:50 Who, whos, and clear
+6:50 Saving data
+8:35 Manipulating data
+12:10 Unrolling a matrix
+12:35 Examples
+14:50 Summary
+
+
+ Computing on Data
+
+
+
0:00 Matrix operations
+0:57 Element-wise operations
+4:28 Min and max
+5:10 Element-wise comparisons
+5:43 The find command
+6:00 Various commands and operations
+
+
+ Plotting data
+
+
+
0:00 Introduction
+0:54 Basic plotting
+2:04 Superimposing plots and colors
+3:15 Saving a plot to an image
+4:19 Clearing a plot and multiple figures
+4:59 Subplots
+6:15 The axis command
+6:39 Color square plots
+8:35 Wrapping up
+
+
+ Control statements
+
+
+
0:10 For loops
+1:33 While loops
+3:35 If statements
+4:54 Functions
+6:15 Search paths
+7:40 Multiple return values
+8:59 Cost function example (machine learning)
+12:24 Summary
+
+
+
+ Vectorization
+
+
+
0:00 Why vectorize?
+1:30 Example
+4:22 C++ example
+5:40 Vectorization applied to gradient descent
+9:45 Python
+ Now we are switching from regression problems to
+
+ classification problems
+
+ . Don't be confused by the name "Logistic Regression"; it is named that way for historical reasons and is actually an approach to classification problems, not regression problems.
+
+
+ Binary Classification
+
+
+ Instead of our output vector y being a continuous range of values, it will only be 0 or 1.
+
+
+ y∈{0,1}
+
+
+ Where 0 is usually taken as the "negative class" and 1 as the "positive class", but you are free to assign any representation to it.
+
+
+ We're only doing two classes for now, called a "Binary Classification Problem."
+
+
+ One method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. This method doesn't work well because classification is not actually a linear function.
+
+
+ Hypothesis Representation
+
+
+ Our hypothesis should satisfy:
+
+
+ $$0 \leq h_\theta (x) \leq 1$$
+
+
+ Our new form uses the "Sigmoid Function," also called the "Logistic Function":
+
+
+
+
+
+ $$\begin{align*}& h_\theta (x) = g ( \theta^T x ) \newline \newline& z = \theta^T x \newline& g(z) = \dfrac{1}{1 + e^{-z}}\end{align*}$$
+
+
+
+
+
+
+ The function g(z), shown here, maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification. Try playing with interactive plot of sigmoid function : (
+
+ https://www.desmos.com/calculator/bgontvxotm
+
+ ).
+
+
+ We start with our old hypothesis (linear regression), except that we want to restrict the range to 0 and 1. This is accomplished by plugging $$\theta^Tx$$ into the Logistic Function.
+
+
+ $$h_\theta$$ will give us the
+
+ probability
+
+ that our output is 1. For example, $$h_\theta(x)=0.7$$ gives us the probability of 70% that our output is 1.
+
+ Our probability that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%).
+
+
+ Decision Boundary
+
+
+ In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows:
+
+
+
+
+
+ $$\begin{align*}& h_\theta(x) \geq 0.5 \rightarrow y = 1 \newline& h_\theta(x) < 0.5 \rightarrow y = 0 \newline\end{align*}$$
+
+
+
+
+
+ The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5:
+
+
+
+
+
+ $$\begin{align*}& g(z) \geq 0.5 \newline& when \; z \geq 0\end{align*}$$
+
+ In this case, our decision boundary is a straight vertical line placed on the graph where $$x_1 = 5$$, and everything to the left of that denotes y = 1, while everything to the right denotes y = 0.
+
+
+ Again, the input to the sigmoid function g(z) (e.g. $$\theta^T X$$) doesn't need to be linear, and could be a function that describes a circle (e.g. $$z = \theta_0 + \theta_1 x_1^2 +\theta_2 x_2^2$$) or any shape to fit our data.
+
+
+ Cost Function
+
+
+ We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function.
+
+
+ Instead, our cost function for logistic regression looks like:
+
+ The more our hypothesis is off from y, the larger the cost function output. If our hypothesis is equal to y, then our cost is 0:
+
+
+
+
+
+ $$\begin{align*}& \mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1 \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0 \newline \end{align*}$$
+
+
+
+
+
+ If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity.
+
+
+ If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity.
+
+
+ Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression.
+
+
+ Simplified Cost Function and Gradient Descent
+
+
+ We can compress our cost function's two conditional cases into one case:
+
+ Notice that when y is equal to 1, then the second term $$(1-y)\log(1-h_\theta(x))$$ will be zero and will not affect the result. If y is equal to 0, then the first term $$-y \log(h_\theta(x))$$ will be zero and will not affect the result.
+
+
+ We can fully write out our entire cost function as follows:
+
+ "Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. A. Ng suggests not to write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them.
+
+
+ We first need to provide a function that evaluates the following two functions for a given input value θ:
+
+ We can write a single function that returns both of these:
+
+
function [jVal, gradient] = costFunction(theta)
+ jVal = [...code to compute J(theta)...];
+ gradient = [...code to compute derivative of J(theta)...];
+end
+
+ Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()". (Note: the value for MaxIter should be an integer, not a character string - errata in the video at 7:30)
+
+ We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand.
+
+
+ Multiclass Classification: One-vs-all
+
+
+ Now we will approach the classification of data into more than two categories. Instead of y = {0,1} we will expand our definition so that y = {0,1...n}.
+
+
+ In this case we divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that 'y' is a member of one of our classes.
+
+
+
+
+
+ $$\begin{align*}& y \in \lbrace0, 1 ... n\rbrace \newline& h_\theta^{(0)}(x) = P(y = 0 | x ; \theta) \newline& h_\theta^{(1)}(x) = P(y = 1 | x ; \theta) \newline& \cdots \newline& h_\theta^{(n)}(x) = P(y = n | x ; \theta) \newline& \mathrm{prediction} = \max_i( h_\theta ^{(i)}(x) )\newline\end{align*}$$
+
+
+
+
+
+ We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction.
+
+
+ ML:Regularization
+
+
+
+ The Problem of Overfitting
+
+
+
+ Regularization is designed to address the problem of overfitting.
+
+
+ High bias or underfitting is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. eg. if we take $$h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2$$ then we are making an initial assumption that a linear model will fit the training data well and will be able to generalize but that may not be the case.
+
+
+ At the other extreme, overfitting or high variance is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data.
+
+
+ This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting:
+
+
+ 1) Reduce the number of features:
+
+
+ a) Manually select which features to keep.
+
+
+ b) Use a model selection algorithm (studied later in the course).
+
+
+ 2) Regularization
+
+
+ Keep all the features, but reduce the parameters $$\theta_j$$.
+
+
+ Regularization works well when we have a lot of slightly useful features.
+
+
+ Cost Function
+
+
+ If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost.
+
+
+ Say we wanted to make the following function more quadratic:
+
+ We'll want to eliminate the influence of $$\theta_3x^3$$ and $$\theta_4x^4$$ . Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our
+
+ cost function
+
+ :
+
+ We've added two extra terms at the end to inflate the cost of $$\theta_3$$ and $$\theta_4$$. Now, in order for the cost function to get close to zero, we will have to reduce the values of $$\theta_3$$ and $$\theta_4$$ to near zero. This will in turn greatly reduce the values of $$\theta_3x^3$$ and $$\theta_4x^4$$ in our hypothesis function.
+
+
+ We could also regularize all of our theta parameters in a single summation:
+
+ The λ, or lambda, is the
+
+ regularization parameter
+
+ . It determines how much the costs of our theta parameters are inflated. You can visualize the effect of regularization in this interactive plot :
+
+ https://www.desmos.com/calculator/1hexc8ntqp
+
+
+
+ Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting.
+
+
+ Regularized Linear Regression
+
+
+ We can apply regularization to both linear regression and logistic regression. We will approach linear regression first.
+
+
+ Gradient Descent
+
+
+ We will modify our gradient descent function to separate out $$\theta_0$$ from the rest of the parameters because we do not want to penalize $$\theta_0$$.
+
+ The first term in the above equation, $$1 - \alpha\frac{\lambda}{m}$$ will always be less than 1. Intuitively you can see it as reducing the value of $$\theta_j$$ by some amount on every update.
+
+
+ Notice that the second term is now exactly the same as it was before.
+
+
+
+ Normal Equation
+
+
+
+ Now let's approach regularization using the alternate method of the non-iterative normal equation.
+
+
+ To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses:
+
+ L is a matrix with 0 at the top left and 1's down the diagonal, with 0's everywhere else. It should have dimension (n+1)×(n+1). Intuitively, this is the identity matrix (though we are not including $$x_0$$), multiplied with a single real number λ.
+
+
+ Recall that if m ≤ n, then $$X^TX$$ is non-invertible. However, when we add the term λ⋅L, then $$X^TX$$ + λ⋅L becomes invertible.
+
+
+ Regularized Logistic Regression
+
+
+ We can regularize logistic regression in a similar way that we regularize linear regression. Let's start with the cost function.
+
+
+ Cost Function
+
+
+ Recall that our cost function for logistic regression was:
+
+
+ Note Well:
+
+ The second sum, $$\sum_{j=1}^n \theta_j^2$$
+
+ means to explicitly exclude
+
+ the bias term, $$\theta_0$$. I.e. the θ vector is indexed from 0 to n (holding n+1 values, $$\theta_0$$ through $$\theta_n$$), and this sum explicitly skips $$\theta_0$$, by running from 1 to n, skipping 0.
+
+
+ Gradient Descent
+
+
+ Just like with linear regression, we will want to
+
+ separately
+
+ update $$\theta_0$$ and the rest of the parameters because we do not want to regularize $$\theta_0$$.
+
+ This is identical to the gradient descent function presented for linear regression.
+
+
+ Initial Ones Feature Vector
+
+
+ Constant Feature
+
+
+ As it turns out it is crucial to add a constant feature to your pool of features before starting any training of your machine. Normally that feature is just a set of ones for all your training examples.
+
+
+ Concretely, if X is your feature matrix then $$X_0$$ is a vector with ones.
+
+
+ Below are some insights to explain the reason for this constant feature. The first part draws some analogies from electrical engineering concept, the second looks at understanding the ones vector by using a simple machine learning example.
+
+
+ Electrical Engineering
+
+
+ From electrical engineering, in particular signal processing, this can be explained as DC and AC.
+
+
+ The initial feature vector X without the constant term captures the dynamics of your model. That means those features particularly record changes in your output y - in other words changing some feature $$X_i$$ where $$i\not= 0$$ will have a change on the output y. AC is normally made out of many components or harmonics; hence we also have many features (yet we have one DC term).
+
+
+ The constant feature represents the DC component. In control engineering this can also be the steady state.
+
+
+ Interestingly removing the DC term is easily done by differentiating your signal - or simply taking a difference between consecutive points of a discrete signal (it should be noted that at this point the analogy is implying time-based signals - so this will also make sense for machine learning application with a time basis - e.g. forecasting stock exchange trends).
+
+
+ Another interesting note: if you were to play and AC+DC signal as well as an AC only signal where both AC components are the same then they would sound exactly the same. That is because we only hear changes in signals and Δ(AC+DC)=Δ(AC).
+
+
+ Housing price example
+
+
+ Suppose you design a machine which predicts the price of a house based on some features. In this case what does the ones vector help with?
+
+
+ Let's assume a simple model which has features that are directly proportional to the expected price i.e. if feature Xi increases so the expected price y will also increase. So as an example we could have two features: namely the size of the house in [m2], and the number of rooms.
+
+
+ When you train your machine you will start by prepending a ones vector $$X_0$$. You may then find after training that the weight for your initial feature of ones is some value θ0. As it turns, when applying your hypothesis function $$h_{\theta}(X)$$ - in the case of the initial feature you will just be multiplying by a constant (most probably θ0 if you not applying any other functions such as sigmoids). This constant (let's say it's $$θ_0$$ for argument's sake) is the DC term. It is a constant that doesn't change.
+
+
+ But what does it mean for this example? Well, let's suppose that someone knows that you have a working model for housing prices. It turns out that for this example, if they ask you how much money they can expect if they sell the house you can say that they need at least θ0 dollars (or rands) before you even use your learning machine. As with the above analogy, your constant θ0 is somewhat of a steady state where all your inputs are zeros. Concretely, this is the price of a house with no rooms which takes up no space.
+
+
+ However this explanation has some holes because if you have some features which decrease the price e.g. age, then the DC term may not be an absolute minimum of the price. This is because the age may make the price go even lower.
+
+
+ Theoretically if you were to train a machine without a ones vector $$f_{AC}(X)$$, it's output may not match the output of a machine which had a ones vector $$f_{DC}(X)$$. However, $$f_{AC}(X)$$ may have exactly the same trend as $$f_{DC}(X)$$ i.e. if you were to plot both machine's output you would find that they may look exactly the same except that it seems one output has just been shifted (by a constant). With reference to the housing price problem: suppose you make predictions on two houses $$house_A$$ and $$house_B$$ using both machines. It turns out while the outputs from the two machines would different, the difference between houseA and houseB's predictions according to both machines could be exactly the same. Realistically, that means a machine trained without the ones vector $$f_AC$$ could actually be very useful if you have just one benchmark point. This is because you can find out the missing constant by simply taking a difference between the machine's prediction an actual price - then when making predictions you simply add that constant to what even output you get. That is: if $$house_{benchmark}$$ is your benchmark then the DC component is simply $$price(house_{benchmark}) - f_{AC}(features(house_{benchmark}))$$
+
+
+ A more simple and crude way of putting it is that the DC component of your model represents the inherent bias of the model. The other features then cause tension in order to move away from that bias position.
+
+
+ Kholofelo Moyaba
+
+
+ A simpler approach
+
+
+ A "bias" feature is simply a way to move the "best fit" learned vector to better fit the data. For example, consider a learning problem with a single feature $$X_1$$. The formula without the $$X_0$$ feature is just $$theta_1 * X_1 = y$$. This is graphed as a line that always passes through the origin, with slope y/theta. The $$x_0$$ term allows the line to pass through a different point on the y axis. This will almost always give a better fit. Not all best fit lines go through the origin (0,0) right?
+
In English we use the word "combination" loosely, without thinking if the order of things is important. In other words:
+
+
+
+
"My fruit salad is a combination of apples, grapes and bananas" We don't care what order the fruits are in, they could also be "bananas, grapes and apples" or "grapes, apples and bananas", its the same fruit salad.
+
+
+
+
+
+
+
+
"The combination to the safe is 472". Now we do care about the order. "724" won't work, nor will "247". It has to be exactly 4-7-2.
+
+
+
So, in Mathematics we use more precise language:
+
+
+
+
When the order doesn't matter, it is a Combination.
+
+
+
+
When the order does matter it is a Permutation.
+
+
+
+
+
+
+
+
So, we should really call this a "Permutation Lock"!
+
+
+
In other words:
+
A Permutation is an ordered Combination.
+
+
+
+
+
+
To help you to remember, think "Permutation ... Position"
+
+
+
+
Permutations
+
There are basically two types of permutation:
+
+
Repetition is Allowed: such as the lock above. It could be "333".
+
No Repetition: for example the first three people in a running race. You can't be first and second.
+
+
+
1. Permutations with Repetition
+
These are the easiest to calculate.
+
When a thing has n different types ... we have n choices each time!
+
For example: choosing 3 of those things, the permutations are:
+
n × n × n
+ (n multiplied 3 times)
+
More generally: choosing r of something that has n different types, the permutations are:
+
n × n × ... (r times)
+
(In other words, there are n possibilities for the first choice, THEN there are n possibilites for the second choice, and so on, multplying each time.)
+
Which is easier to write down using an exponent of r:
+
n × n × ... (r times) = nr
+
+
Example: in the lock above, there are 10 numbers to choose from (0,1,2,3,4,5,6,7,8,9) and we choose 3 of them:
where n is the number of things to choose from, and we choose r of them
+ (Repetition allowed, order matters)
+
+
+
+
+
2. Permutations without Repetition
+
In this case, we have to reduce the number of available choices each time.
+
+
For example, what order could 16 pool balls be in?
+
After choosing, say, number "14" we can't choose it again.
+
So, our first choice has 16 possibilites, and our next choice has 15 possibilities, then 14, 13, etc. And the total permutations are:
+
16 × 15 × 14 × 13 × ... = 20,922,789,888,000
+
But maybe we don't want to choose them all, just 3 of them, so that is only:
+
16 × 15 × 14 = 3,360
+
In other words, there are 3,360 different ways that 3 pool balls could be arranged out of 16 balls.
+
+
Without repetition our choices get reduced each time.
+
+
But how do we write that mathematically? Answer: we use the "factorial function"
+
+
+
+
+
The factorial function (symbol: !) just means to multiply a series of descending natural numbers. Examples:
+
+
4! = 4 × 3 × 2 × 1 = 24
+
7! = 7 × 6 × 5 × 4 × 3 × 2 × 1 = 5,040
+
1! = 1
+
+
+
+
Note: it is generally agreed that 0! = 1. It may seem funny that multiplying no numbers together gets us 1, but it helps simplify a lot of equations.
+
+
+
+
So, when we want to select all of the billiard balls the permutations are:
+
16! = 20,922,789,888,000
+
But when we want to select just 3 we don't want to multiply after 14. How do we do that? There is a neat trick: we divide by 13!
+
+
+
16 × 15 × 14 × 13 × 12 ...
+
+
= 16 × 15 × 14 = 3,360
+
+
+
13 × 12 ...
+
+
+
+
Do you see? 16! / 13! = 16 × 15 × 14
+
The formula is written:
+
+
+
+
+
n!(n − r)!
+
+
+
where n is the number of things to choose from, and we choose r of them
+ (No repetition, order matters)
+
+
+
+
+
Example Our "order of 3 out of 16 pool balls example" is:
+
+
+
16!
+
=
+
16!
+
=
+
20,922,789,888,000
+
= 3,360
+
+
+
(16-3)!
+
13!
+
6,227,020,800
+
+
+
(which is just the same as: 16 × 15 × 14 = 3,360)
+
+
+
Example: How many ways can first and second place be awarded to 10 people?
+
+
+
10!
+
=
+
10!
+
=
+
3,628,800
+
= 90
+
+
+
(10-2)!
+
8!
+
40,320
+
+
+
(which is just the same as: 10 × 9 = 90)
+
+
Notation
+
Instead of writing the whole formula, people use different notations such as these:
+
+
+
Example: P(10,2) = 90
+
+
Combinations
+
There are also two types of combinations (remember the order does not matter now):
+
+
Repetition is Allowed: such as coins in your pocket (5,5,5,10,10)
+
No Repetition: such as lottery numbers (2,14,15,27,30,33)
+
+
+
1. Combinations with Repetition
+
Actually, these are the hardest to explain, so we will come back to this later.
+
2. Combinations without Repetition
+
This is how lotteries work. The numbers are drawn one at a time, and if we have the lucky numbers (no matter what order) we win!
+
The easiest way to explain it is to:
+
+
assume that the order does matter (ie permutations),
+
then alter it so the order does not matter.
+
+
Going back to our pool ball example, let's say we just want to know which 3 pool balls are chosen, not the order.
+
We already know that 3 out of 16 gave us 3,360 permutations.
+
But many of those are the same to us now, because we don't care what order!
+
For example, let us say balls 1, 2 and 3 are chosen. These are the possibilites:
+
+
+
Order does matter
+
Order doesn't matter
+
+
+
1 2 3
+ 1 3 2
+ 2 1 3
+ 2 3 1
+ 3 1 2
+ 3 2 1
+
1 2 3
+
+
+
So, the permutations will have 6 times as many possibilites.
+
In fact there is an easy way to work out how many ways "1 2 3" could be placed in order, and we have already talked about it. The answer is:
+
3!= 3 × 2 × 1 = 6
+
(Another example: 4 things can be placed in 4!= 4 × 3 × 2 × 1 = 24 different ways, try it for yourself!)
+
So we adjust our permutations formula to reduce it by how many ways the objects could be in order (because we aren't interested in their order any more):
+
+
That formula is so important it is often just written in big parentheses like this:
+
+
+
+
+
+
+
where n is the number of things to choose from, and we choose r of them
+ (No repetition, order doesn't matter)
+
+
+
+
It is often called "n choose r" (such as "16 choose 3")
As well as the "big parentheses", people also use these notations:
+
+
+
Just remember the formula:
+
n!r!(n − r)!
+
Example
+
+
So, our pool ball example (now without order) is:
+
+
+
16!
+
=
+
16!
+
=
+
20,922,789,888,000
+
= 560
+
+
+
3!(16-3)!
+
3!×13!
+
6×6,227,020,800
+
+
+
Or we could do it this way:
+
+
+
16×15×14
+
=
+
3360
+
= 560
+
+
+
3×2×1
+
6
+
+
+
+
+
It is interesting to also note how this formula is nice and symmetrical:
+
+
In other words choosing 3 balls out of 16, or choosing 13 balls out of 16 have the same number of combinations.
+
+
+
16!
+
=
+
16!
+
=
+
16!
+
= 560
+
+
+
3!(16-3)!
+
13!(16-13)!
+
3!×13!
+
+
+
Pascal's Triangle
+
We can also use Pascal's Triangle to find the values. Go down to row "n" (the top row is 0), and then along "r" places and the value there is our answer. Here is an extract showing row 16:
Let us say there are five flavors of icecream: banana, chocolate, lemon, strawberry and vanilla.
+
We can have three scoops. How many variations will there be?
+
Let's use letters for the flavors: {b, c, l, s, v}. Example selections include
+
+
{c, c, c} (3 scoops of chocolate)
+
{b, l, v} (one each of banana, lemon and vanilla)
+
{b, v, v} (one of banana, two of vanilla)
+
+
+
(And just to be clear: There are n=5 things to choose from, and we choose r=3 of them.
+ Order does not matter, and we can repeat!)
+
Now, I can't describe directly to you how to calculate this, but I can show you a special technique that lets you work it out.
+
+
Think about the ice cream being in boxes, we could say "move past the first box, then take 3 scoops, then move along 3 more boxes to the end" and we will have 3 scoops of chocolate!
+
So it is like we are ordering a robot to get our ice cream, but it doesn't change anything, we still get what we want.
+
We can write this down as (arrow means move, circle means scoop).
+
In fact the three examples above can be written like this:
+
+
+
{c, c, c} (3 scoops of chocolate):
+
+
+
+
{b, l, v} (one each of banana, lemon and vanilla):
+
+
+
+
{b, v, v} (one of banana, two of vanilla):
+
+
+
+
OK, so instead of worrying about different flavors, we have a simpler question: "how many different ways can we arrange arrows and circles?"
+
Notice that there are always 3 circles (3 scoops of ice cream) and 4 arrows (we need to move 4 times to go from the 1st to 5th container).
+
So (being general here) there are r + (n−1) positions, and we want to choose r of them to have circles.
+
This is like saying "we have r + (n−1) pool balls and want to choose r of them". In other words it is now like the pool balls question, but with slightly changed numbers. And we can write it like this:
+
+
+
+
+
+
+
where n is the number of things to choose from, and we choose r of them
+ (Repetition allowed, order doesn't matter)
+
+
+
+
Interestingly, we can look at the arrows instead of the circles, and say "we have r + (n−1) positions and want to choose (n−1) of them to have arrows", and the answer is the same:
+
+
So, what about our example, what is the answer?
+
+
+
(3+5−1)!
+
=
+
7!
+
=
+
5040
+
= 35
+
+
+
3!(5−1)!
+
3!×4!
+
6×24
+
+
+
There are 35 ways of having 3 scoops from five flavors of icecream.
+
In Conclusion
+
Phew, that was a lot to absorb, so maybe you could read it again to be sure!
+
But knowing how these formulas work is only half the battle. Figuring out how to interpret a real world situation can be quite hard.
+
But at least now you know how to calculate all 4 variations of "Order does/does not matter" and "Repeats are/are not allowed".
+ Performing linear regression with a complex set of data with many features is very unwieldy. Say you wanted to create a hypothesis from three (3) features that included all the quadratic terms:
+
+ That gives us 6 features. The exact way to calculate how many features for all polynomial terms is the combination function with repetition:
+
+ http://www.mathsisfun.com/combinatorics/combinations-permutations.html
+
+ $$\frac{(n+r-1)!}{r!(n-1)!}$$. In this case we are taking all two-element combinations of three features: $$\frac{(3 + 2 - 1)!}{(2!\cdot (3-1)!)}$$ = $$\frac{4!}{4} = 6$$. (
+
+ Note
+
+ : you do not have to know these formulas, I just found it helpful for understanding).
+
+
+ For 100 features, if we wanted to make them quadratic we would get $$\frac{(100 + 2 - 1)!}{(2\cdot (100-1)!)} = 5050$$ resulting new features.
+
+
+ We can approximate the growth of the number of new features we get with all quadratic terms with $$\mathcal{O}(n^2/2)$$. And if you wanted to include all cubic terms in your hypothesis, the features would grow asymptotically at $$\mathcal{O}(n^3)$$. These are very steep growths, so as the number of our features increase, the number of quadratic or cubic features increase very rapidly and becomes quickly impractical.
+
+
+ Example: let our training set be a collection of 50 x 50 pixel black-and-white photographs, and our goal will be to classify which ones are photos of cars. Our feature set size is then n = 2500 if we compare every pair of pixels.
+
+
+ Now let's say we need to make a quadratic hypothesis function. With quadratic features, our growth is $$\mathcal{O}(n^2/2)$$. So our total features will be about $$2500^2 / 2 = 3125000$$, which is very impractical.
+
+
+ Neural networks offers an alternate way to perform machine learning when we have complex hypotheses with many features.
+
+
+ Neurons and the Brain
+
+
+ Neural networks are limited imitations of how our own brains work. They've had a big recent resurgence because of advances in computer hardware.
+
+
+ There is evidence that the brain uses only one "learning algorithm" for all its different functions. Scientists have tried cutting (in an animal brain) the connection between the ears and the auditory cortex and rewiring the optical nerve with the auditory cortex to find that the auditory cortex literally learns to see.
+
+
+ This principle is called "neuroplasticity" and has many examples and experimental evidence.
+
+
+ Model Representation I
+
+
+ Let's examine how we will represent a hypothesis function using neural networks.
+
+
+ At a very simple level, neurons are basically computational units that take input (
+
+ dendrites
+
+ ) as electrical input (called "spikes") that are channeled to outputs (
+
+ axons
+
+ ).
+
+
+ In our model, our dendrites are like the input features $$x_1\cdots x_n$$, and the output is the result of our hypothesis function:
+
+
+ In this model our x0 input node is sometimes called the "bias unit." It is always equal to 1.
+
+
+ In neural networks, we use the same logistic function as in classification: $$\frac{1}{1 + e^{-\theta^Tx}}$$. In neural networks however we sometimes call it a sigmoid (logistic)
+
+ activation
+
+ function.
+
+
+ Our "theta" parameters are sometimes instead called "weights" in the neural networks model.
+
+
+ Visually, a simplistic representation looks like:
+
+ Our input nodes (layer 1) go into another node (layer 2), and are output as the hypothesis function.
+
+
+ The first layer is called the "input layer" and the final layer the "output layer," which gives the final value computed on the hypothesis.
+
+
+ We can have intermediate layers of nodes between the input and output layers called the "hidden layer."
+
+
+ We label these intermediate or "hidden" layer nodes $$a^2_0 \cdots a^2_n$$ and call them "activation units."
+
+
+
+
+
+ $$\begin{align*}& a_i^{(j)} = \text{"activation" of unit $i$ in layer $j$} \newline& \Theta^{(j)} = \text{matrix of weights controlling function mapping from layer $j$ to layer $j+1$}\end{align*}$$
+
+
+
+
+
+ If we had one hidden layer, it would look visually something like:
+
+ This is saying that we compute our activation nodes by using a 3×4 matrix of parameters. We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix $$\Theta^{(2)}$$ containing the weights for our second layer of nodes.
+
+
+ Each layer gets its own matrix of weights, $$\Theta^{(j)}$$.
+
+
+ The dimensions of these matrices of weights is determined as follows:
+
+
+ $$\text{If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$.}$$
+
+
+ The +1 comes from the addition in $$\Theta^{(j)}$$ of the "bias nodes," $$x_0$$ and $$\Theta_0^{(j)}$$. In other words the output nodes will not include the bias nodes while the inputs will.
+
+
+ Example: layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of $$\Theta^{(1)}$$ is going to be 4×3 where $$s_j = 2$$ and $$s_{j+1} = 4$$, so $$s_{j+1} \times (s_j + 1) = 4 \times 3$$.
+
+
+ Model Representation II
+
+
+ In this section we'll do a vectorized implementation of the above functions. We're going to define a new variable $$z_k^{(j)}$$ that encompasses the parameters inside our g function. In our previous example if we replaced the variable z for all the parameters we would get:
+
+ Setting $$x = a^{(1)}$$, we can rewrite the equation as:
+
+
+
+
+
+ $$z^{(j)} = \Theta^{(j-1)}a^{(j-1)}$$
+
+
+
+
+
+ We are multiplying our matrix $$\Theta^{(j-1)}$$ with dimensions $$s_j\times (n+1)$$ (where $$s_j$$ is the number of our activation nodes) by our vector $$a^{(j-1)}$$ with height (n+1). This gives us our vector $$z^{(j)}$$ with height $$s_j$$.
+
+
+ Now we can get a vector of our activation nodes for layer j as follows:
+
+
+ $$a^{(j)} = g(z^{(j)})$$
+
+
+ Where our function g can be applied element-wise to our vector $$z^{(j)}$$.
+
+
+ We can then add a bias unit (equal to 1) to layer j after we have computed $$a^{(j)}$$. This will be element $$a_0^{(j)}$$ and will be equal to 1.
+
+
+ To compute our final hypothesis, let's first compute another z vector:
+
+
+ $$z^{(j+1)} = \Theta^{(j)}a^{(j)}$$
+
+
+ We get this final z vector by multiplying the next theta matrix after $$\Theta^{(j-1)}$$ with the values of all the activation nodes we just got.
+
+
+ This last theta matrix $$\Theta^{(j)}$$ will have only
+
+ one row
+
+ so that our result is a single number.
+
+
+ We then get our final result with:
+
+
+ $$h_\Theta(x) = a^{(j+1)} = g(z^{(j+1)})$$
+
+
+ Notice that in this
+
+ last step
+
+ , between layer j and layer j+1, we are doing
+
+ exactly the same thing
+
+ as we did in logistic regression.
+
+
+ Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses.
+
+
+ Examples and Intuitions I
+
+
+ A simple example of applying neural networks is by predicting $$x_1$$ AND $$x_2$$, which is the logical 'and' operator and is only true if both $$x_1$$ and $$x_2$$ are 1.
+
+ So we have constructed one of the fundamental operations in computers by using a small neural network rather than using an actual AND gate. Neural networks can also be used to simulate all the other logical gates.
+
+
+ Examples and Intuitions II
+
+
+ The $$Θ^(1)$$ matrices for AND, NOR, and OR are:
+
+ And there we have the XNOR operator using two hidden layers!
+
+
+ Multiclass Classification
+
+
+ To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we wanted to classify our data into one of four final resulting classes:
+
+ Our final layer of nodes, when multiplied by its theta matrix, will result in another vector, on which we will apply the g() logistic function to get a vector of hypothesis values.
+
+
+ Our resulting hypothesis for one set of inputs may look like:
+
If you benefit from the book, please make a small
+donation. I suggest $5, but you can choose the amount.
+
+
+
+
+
+
+Sponsors
+
+
+
+
+
+
+
+
+
Thanks to all the supporters who made the book possible, with
+especial thanks to Pavel Dudrenov. Thanks also to all the
+contributors to the Bugfinder Hall of
+Fame.
In the last chapter we saw how neural networks can
learn their weights and biases using the gradient descent algorithm.
There was, however, a gap in our explanation: we didn't discuss how to
compute the gradient of the cost function. That's quite a gap! In
this chapter I'll explain a fast algorithm for computing such
gradients, an algorithm known as backpropagation.
The backpropagation algorithm was originally introduced in the 1970s,
but its importance wasn't fully appreciated until a
famous
1986 paper by
David
Rumelhart,
Geoffrey
Hinton, and
Ronald
Williams. That paper describes several
neural networks where backpropagation works far faster than earlier
approaches to learning, making it possible to use neural nets to solve
problems which had previously been insoluble. Today, the
backpropagation algorithm is the workhorse of learning in neural
networks.
This chapter is more mathematically involved than the rest of the
book. If you're not crazy about mathematics you may be tempted to
skip the chapter, and to treat backpropagation as a black box whose
details you're willing to ignore. Why take the time to study those
details?
The reason, of course, is understanding. At the heart of
backpropagation is an expression for the partial derivative $\partial
C / \partial w$ of the cost function $C$ with respect to any weight
$w$ (or bias $b$) in the network. The expression tells us how quickly
the cost changes when we change the weights and biases. And while the
expression is somewhat complex, it also has a beauty to it, with each
element having a natural, intuitive interpretation. And so
backpropagation isn't just a fast algorithm for learning. It actually
gives us detailed insights into how changing the weights and biases
changes the overall behaviour of the network. That's well worth
studying in detail.
With that said, if you want to skim the chapter, or jump straight to
the next chapter, that's fine. I've written the rest of the book to
be accessible even if you treat backpropagation as a black box. There
are, of course, points later in the book where I refer back to results
from this chapter. But at those points you should still be able to
understand the main conclusions, even if you don't follow all the
reasoning.
Before discussing backpropagation, let's warm up with a fast
matrix-based algorithm to compute the output from a neural network.
We actually already briefly saw this algorithm
near
the end of the last chapter, but I described it quickly, so it's
worth revisiting in detail. In particular, this is a good way of
getting comfortable with the notation used in backpropagation, in a
familiar context.
Let's begin with a notation which lets us refer to weights in the
network in an unambiguous way. We'll use $w^l_{jk}$ to denote the
weight for the connection from the $k^{\rm th}$ neuron in the
$(l-1)^{\rm th}$ layer to the $j^{\rm th}$ neuron in the $l^{\rm th}$
layer. So, for example, the diagram below shows the weight on a
connection from the fourth neuron in the second layer to the second
neuron in the third layer of a network:
This notation is cumbersome at first, and it does take some work to
master. But with a little effort you'll find the notation becomes
easy and natural. One quirk of the notation is the ordering of the
$j$ and $k$ indices. You might think that it makes more sense to use
$j$ to refer to the input neuron, and $k$ to the output neuron, not
vice versa, as is actually done. I'll explain the reason for this
quirk below.
We use a similar notation for the network's biases and activations.
Explicitly, we use $b^l_j$ for the bias of the $j^{\rm th}$ neuron in
the $l^{\rm th}$ layer. And we use $a^l_j$ for the activation of the
$j^{\rm th}$ neuron in the $l^{\rm th}$ layer. The following diagram
shows examples of these notations in use:
With these notations, the activation $a^{l}_j$ of the $j^{\rm th}$
neuron in the $l^{\rm th}$ layer is related to the activations in the
$(l-1)^{\rm th}$ layer by the equation (compare
Equation (4)\begin{eqnarray}
\frac{1}{1+\exp(-\sum_j w_j x_j-b)} \nonumber\end{eqnarray} and surrounding
discussion in the last chapter)
\begin{eqnarray}
a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right),
\tag{23}\end{eqnarray}
where the sum is over all neurons $k$ in the $(l-1)^{\rm th}$ layer. To
rewrite this expression in a matrix form we define a weight
matrix $w^l$ for each layer, $l$. The entries of the weight matrix
$w^l$ are just the weights connecting to the $l^{\rm th}$ layer of neurons,
that is, the entry in the $j^{\rm th}$ row and $k^{\rm th}$ column is $w^l_{jk}$.
Similarly, for each layer $l$ we define a bias vector, $b^l$.
You can probably guess how this works - the components of the bias
vector are just the values $b^l_j$, one component for each neuron in
the $l^{\rm th}$ layer. And finally, we define an activation vector $a^l$
whose components are the activations $a^l_j$.
The last ingredient we need to rewrite (23)\begin{eqnarray}
a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray} in a
matrix form is the idea of vectorizing a function such as $\sigma$.
We met vectorization briefly in the last chapter, but to recap, the
idea is that we want to apply a function such as $\sigma$ to every
element in a vector $v$. We use the obvious notation $\sigma(v)$ to
denote this kind of elementwise application of a function. That is,
the components of $\sigma(v)$ are just $\sigma(v)_j = \sigma(v_j)$.
As an example, if we have the function $f(x) = x^2$ then the
vectorized form of $f$ has the effect
\begin{eqnarray}
f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right)
= \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right]
= \left[ \begin{array}{c} 4 \\ 9 \end{array} \right],
\tag{24}\end{eqnarray}
that is, the vectorized $f$ just squares every element of the vector.
With these notations in mind, Equation (23)\begin{eqnarray}
a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right) \nonumber\end{eqnarray} can
be rewritten in the beautiful and compact vectorized form
\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l).
\tag{25}\end{eqnarray}
This expression gives us a much more global way of thinking about how
the activations in one layer relate to activations in the previous
layer: we just apply the weight matrix to the activations, then add
the bias vector, and finally apply the $\sigma$ function*
*By the way, it's this expression that
motivates the quirk in the $w^l_{jk}$ notation mentioned earlier.
If we used $j$ to index the input neuron, and $k$ to index the
output neuron, then we'd need to replace the weight matrix in
Equation (25)\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray} by the transpose of the
weight matrix. That's a small change, but annoying, and we'd lose
the easy simplicity of saying (and thinking) "apply the weight
matrix to the activations".. That global view is often easier and
more succinct (and involves fewer indices!) than the neuron-by-neuron
view we've taken to now. Think of it as a way of escaping index hell,
while remaining precise about what's going on. The expression is also
useful in practice, because most matrix libraries provide fast ways of
implementing matrix multiplication, vector addition, and
vectorization. Indeed, the
code
in the last chapter made implicit use of this expression to compute
the behaviour of the network.
When using Equation (25)\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray} to compute $a^l$,
we compute the intermediate quantity $z^l \equiv w^l a^{l-1}+b^l$
along the way. This quantity turns out to be useful enough to be
worth naming: we call $z^l$ the weighted input to the neurons
in layer $l$. We'll make considerable use of the weighted input $z^l$
later in the chapter. Equation (25)\begin{eqnarray}
a^{l} = \sigma(w^l a^{l-1}+b^l) \nonumber\end{eqnarray} is
sometimes written in terms of the weighted input, as $a^l =
\sigma(z^l)$. It's also worth noting that $z^l$ has components $z^l_j
= \sum_k w^l_{jk} a^{l-1}_k+b^l_j$, that is, $z^l_j$ is just the
weighted input to the activation function for neuron $j$ in layer $l$.
The goal of backpropagation is to compute the partial derivatives
$\partial C / \partial w$ and $\partial C / \partial b$ of the cost
function $C$ with respect to any weight $w$ or bias $b$ in the
network. For backpropagation to work we need to make two main
assumptions about the form of the cost function. Before stating those
assumptions, though, it's useful to have an example cost function in
mind. We'll use the quadratic cost function from last chapter
(c.f. Equation (6)\begin{eqnarray} C(w,b) \equiv
\frac{1}{2n} \sum_x \| y(x) - a\|^2 \nonumber\end{eqnarray}). In the notation of
the last section, the quadratic cost has the form
\begin{eqnarray}
C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2,
\tag{26}\end{eqnarray}
where: $n$ is the total number of training examples; the sum is over
individual training examples, $x$; $y = y(x)$ is the corresponding
desired output; $L$ denotes the number of layers in the network; and
$a^L = a^L(x)$ is the vector of activations output from the network
when $x$ is input.
Okay, so what assumptions do we need to make about our cost function,
$C$, in order that backpropagation can be applied? The first
assumption we need is that the cost function can be written as an
average $C = \frac{1}{n} \sum_x C_x$ over cost functions $C_x$ for
individual training examples, $x$. This is the case for the quadratic
cost function, where the cost for a single training example is $C_x =
\frac{1}{2} \|y-a^L \|^2$. This assumption will also hold true for
all the other cost functions we'll meet in this book.
The reason we need this assumption is because what backpropagation
actually lets us do is compute the partial derivatives $\partial C_x
/ \partial w$ and $\partial C_x / \partial b$ for a single training
example. We then recover $\partial C / \partial w$ and $\partial C
/ \partial b$ by averaging over training examples. In fact, with this
assumption in mind, we'll suppose the training example $x$ has been
fixed, and drop the $x$ subscript, writing the cost $C_x$ as $C$.
We'll eventually put the $x$ back in, but for now it's a notational
nuisance that is better left implicit.
The second assumption we make about the cost is that it can be written
as a function of the outputs from the neural network:
For example, the quadratic cost function satisfies this requirement,
since the quadratic cost for a single training example $x$ may be
written as
\begin{eqnarray}
C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2,
\tag{27}\end{eqnarray}
and thus is a function of the output activations. Of course, this
cost function also depends on the desired output $y$, and you may
wonder why we're not regarding the cost also as a function of $y$.
Remember, though, that the input training example $x$ is fixed, and so
the output $y$ is also a fixed parameter. In particular, it's not
something we can modify by changing the weights and biases in any way,
i.e., it's not something which the neural network learns. And so it
makes sense to regard $C$ as a function of the output activations
$a^L$ alone, with $y$ merely a parameter that helps define that
function.
The backpropagation algorithm is based on common linear algebraic
operations - things like vector addition, multiplying a vector by a
matrix, and so on. But one of the operations is a little less
commonly used. In particular, suppose $s$ and $t$ are two vectors of
the same dimension. Then we use $s \odot t$ to denote the
elementwise product of the two vectors. Thus the components of
$s \odot t$ are just $(s \odot t)_j = s_j t_j$. As an example,
\begin{eqnarray}
\left[\begin{array}{c} 1 \\ 2 \end{array}\right]
\odot \left[\begin{array}{c} 3 \\ 4\end{array} \right]
= \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right]
= \left[ \begin{array}{c} 3 \\ 8 \end{array} \right].
\tag{28}\end{eqnarray}
This kind of elementwise multiplication is sometimes called the
Hadamard product or Schur product. We'll refer to it as
the Hadamard product. Good matrix libraries usually provide fast
implementations of the Hadamard product, and that comes in handy when
implementing backpropagation.
Backpropagation is about understanding how changing the weights and
biases in a network changes the cost function. Ultimately, this means
computing the partial derivatives $\partial C / \partial w^l_{jk}$ and
$\partial C / \partial b^l_j$. But to compute those, we first
introduce an intermediate quantity, $\delta^l_j$, which we call the
error in the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer.
Backpropagation will give us a procedure to compute the error
$\delta^l_j$, and then will relate $\delta^l_j$ to $\partial C
/ \partial w^l_{jk}$ and $\partial C / \partial b^l_j$.
To understand how the error is defined, imagine there is a demon in
our neural network:
The demon sits at the $j^{\rm th}$ neuron in layer $l$. As the input to the
neuron comes in, the demon messes with the neuron's operation. It
adds a little change $\Delta z^l_j$ to the neuron's weighted input, so
that instead of outputting $\sigma(z^l_j)$, the neuron instead outputs
$\sigma(z^l_j+\Delta z^l_j)$. This change propagates through later
layers in the network, finally causing the overall cost to change by
an amount $\frac{\partial C}{\partial z^l_j} \Delta z^l_j$.
Now, this demon is a good demon, and is trying to help you improve the
cost, i.e., they're trying to find a $\Delta z^l_j$ which makes the
cost smaller. Suppose $\frac{\partial C}{\partial z^l_j}$ has a large
value (either positive or negative). Then the demon can lower the
cost quite a bit by choosing $\Delta z^l_j$ to have the opposite sign
to $\frac{\partial C}{\partial z^l_j}$. By contrast, if
$\frac{\partial C}{\partial z^l_j}$ is close to zero, then the demon
can't improve the cost much at all by perturbing the weighted input
$z^l_j$. So far as the demon can tell, the neuron is already pretty
near optimal*
*This is only the case for small changes $\Delta
z^l_j$, of course. We'll assume that the demon is constrained to
make such small changes.. And so there's a heuristic sense in
which $\frac{\partial C}{\partial z^l_j}$ is a measure of the error in
the neuron.
Motivated by this story, we define the error $\delta^l_j$ of neuron
$j$ in layer $l$ by
\begin{eqnarray}
\delta^l_j \equiv \frac{\partial C}{\partial z^l_j}.
\tag{29}\end{eqnarray}
As per our usual conventions, we use $\delta^l$ to denote the vector
of errors associated with layer $l$. Backpropagation will give us a
way of computing $\delta^l$ for every layer, and then relating those
errors to the quantities of real interest, $\partial C / \partial
w^l_{jk}$ and $\partial C / \partial b^l_j$.
You might wonder why the demon is changing the weighted input $z^l_j$.
Surely it'd be more natural to imagine the demon changing the output
activation $a^l_j$, with the result that we'd be using $\frac{\partial
C}{\partial a^l_j}$ as our measure of error. In fact, if you do
this things work out quite similarly to the discussion below. But it
turns out to make the presentation of backpropagation a little more
algebraically complicated. So we'll stick with $\delta^l_j =
\frac{\partial C}{\partial z^l_j}$ as our measure of error*
*In
classification problems like MNIST the term "error" is sometimes
used to mean the classification failure rate. E.g., if the neural
net correctly classifies 96.0 percent of the digits, then the error
is 4.0 percent. Obviously, this has quite a different meaning from
our $\delta$ vectors. In practice, you shouldn't have trouble
telling which meaning is intended in any given usage..
Plan of attack: Backpropagation is based around four
fundamental equations. Together, those equations give us a way of
computing both the error $\delta^l$ and the gradient of the cost
function. I state the four equations below. Be warned, though: you
shouldn't expect to instantaneously assimilate the equations. Such an
expectation will lead to disappointment. In fact, the backpropagation
equations are so rich that understanding them well requires
considerable time and patience as you gradually delve deeper into the
equations. The good news is that such patience is repaid many times
over. And so the discussion in this section is merely a beginning,
helping you on the way to a thorough understanding of the equations.
Here's a preview of the ways we'll delve more deeply into the
equations later in the chapter: I'll
give
a short proof of the equations, which helps explain why they are
true; we'll restate
the equations in algorithmic form as pseudocode, and
see how the
pseudocode can be implemented as real, running Python code; and, in
the final
section of the chapter, we'll develop an intuitive picture of what
the backpropagation equations mean, and how someone might discover
them from scratch. Along the way we'll return repeatedly to the four
fundamental equations, and as you deepen your understanding those
equations will come to seem comfortable and, perhaps, even beautiful
and natural.
An equation for the error in the output layer, $\delta^L$:
The components of $\delta^L$ are given by
\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j).
\tag{BP1}\end{eqnarray}
This is a very natural expression. The first term on the right,
$\partial C / \partial a^L_j$, just measures how fast the cost is
changing as a function of the $j^{\rm th}$ output activation. If, for
example, $C$ doesn't depend much on a particular output neuron, $j$,
then $\delta^L_j$ will be small, which is what we'd expect. The
second term on the right, $\sigma'(z^L_j)$, measures how fast the
activation function $\sigma$ is changing at $z^L_j$.
Notice that everything in (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray} is easily computed. In
particular, we compute $z^L_j$ while computing the behaviour of the
network, and it's only a small additional overhead to compute
$\sigma'(z^L_j)$. The exact form of $\partial C / \partial a^L_j$
will, of course, depend on the form of the cost function. However,
provided the cost function is known there should be little trouble
computing $\partial C / \partial a^L_j$. For example, if we're using
the quadratic cost function then $C = \frac{1}{2} \sum_j
(y_j-a^L_j)^2$, and so $\partial C / \partial a^L_j = (a_j^L-y_j)$,
which obviously is easily computable.
An equation for the error $\delta^l$ in terms of the error in
the next layer, $\delta^{l+1}$: In particular
\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l),
\tag{BP2}\end{eqnarray}
where $(w^{l+1})^T$ is the transpose of the weight matrix $w^{l+1}$ for
the $(l+1)^{\rm th}$ layer. This equation appears complicated, but
each element has a nice interpretation. Suppose we know the error
$\delta^{l+1}$ at the $l+1^{\rm th}$ layer. When we apply the
transpose weight matrix, $(w^{l+1})^T$, we can think intuitively of
this as moving the error backward through the network, giving
us some sort of measure of the error at the output of the $l^{\rm th}$
layer. We then take the Hadamard product $\odot \sigma'(z^l)$. This
moves the error backward through the activation function in layer $l$,
giving us the error $\delta^l$ in the weighted input to layer $l$.
An equation for the rate of change of the cost with respect to
any weight in the network: In particular:
\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j.
\tag{BP4}\end{eqnarray}
This tells us how to compute the partial derivatives $\partial C
/ \partial w^l_{jk}$ in terms of the quantities $\delta^l$ and
$a^{l-1}$, which we already know how to compute. The equation can be
rewritten in a less index-heavy notation as
\begin{eqnarray} \frac{\partial
C}{\partial w} = a_{\rm in} \delta_{\rm out},
\tag{32}\end{eqnarray}
where it's understood that $a_{\rm in}$ is the activation of the
neuron input to the weight $w$, and $\delta_{\rm out}$ is the error of
the neuron output from the weight $w$. Zooming in to look at just the
weight $w$, and the two neurons connected by that weight, we can
depict this as:
We can obtain similar insights for earlier layers. In particular,
note the $\sigma'(z^l)$ term in (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}. This means that
$\delta^l_j$ is likely to get small if the neuron is near saturation.
And this, in turn, means that any weights input to a saturated neuron
will learn slowly*
*This reasoning won't hold if ${w^{l+1}}^T
\delta^{l+1}$ has large enough entries to compensate for the
smallness of $\sigma'(z^l_j)$. But I'm speaking of the general
tendency..
Summing up, we've learnt that a weight will learn slowly if either the
input neuron is low-activation, or if the output neuron has saturated,
i.e., is either high- or low-activation.
None of these observations is too greatly surprising. Still, they
help improve our mental model of what's going on as a neural network
learns. Furthermore, we can turn this type of reasoning around. The
four fundamental equations turn out to hold for any activation
function, not just the standard sigmoid function (that's because, as
we'll see in a moment, the proofs don't use any special properties of
$\sigma$). And so we can use these equations to design
activation functions which have particular desired learning
properties. As an example to give you the idea, suppose we were to
choose a (non-sigmoid) activation function $\sigma$ so that $\sigma'$
is always positive, and never gets close to zero. That would prevent
the slow-down of learning that occurs when ordinary sigmoid neurons
saturate. Later in the book we'll see examples where this kind of
modification is made to the activation function. Keeping the four
equations (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}-(BP4)\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j \nonumber\end{eqnarray} in mind can help explain why such
modifications are tried, and what impact they can have.
Let's begin with Equation (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}, which gives an expression for
the output error, $\delta^L$. To prove this equation, recall that by
definition
\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial z^L_j}.
\tag{36}\end{eqnarray}
Applying the chain rule, we can re-express the partial derivative
above in terms of partial derivatives with respect to the output
activations,
\begin{eqnarray}
\delta^L_j = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j},
\tag{37}\end{eqnarray}
where the sum is over all neurons $k$ in the output layer. Of course,
the output activation $a^L_k$ of the $k^{\rm th}$ neuron depends only
on the weighted input $z^L_j$ for the $j^{\rm th}$ neuron when $k =
j$. And so $\partial a^L_k / \partial z^L_j$ vanishes when $k \neq
j$. As a result we can simplify the previous equation to
\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}.
\tag{38}\end{eqnarray}
Recalling that $a^L_j = \sigma(z^L_j)$ the second term on the right
can be written as $\sigma'(z^L_j)$, and the equation becomes
\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j),
\tag{39}\end{eqnarray}
which is just (BP1)\begin{eqnarray}
\delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j) \nonumber\end{eqnarray}, in component form.
Next, we'll prove (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray}, which gives an equation for the error
$\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$.
To do this, we want to rewrite $\delta^l_j = \partial C / \partial
z^l_j$ in terms of $\delta^{l+1}_k = \partial C / \partial z^{l+1}_k$.
We can do this using the chain rule,
\begin{eqnarray}
\delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\
& = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\
& = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k,
\tag{42}\end{eqnarray}
where in the last line we have interchanged the two terms on the
right-hand side, and substituted the definition of $\delta^{l+1}_k$.
To evaluate the first term on the last line, note that
\begin{eqnarray}
z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k.
\tag{43}\end{eqnarray}
Differentiating, we obtain
\begin{eqnarray}
\frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j).
\tag{44}\end{eqnarray}
Substituting back into (42)\begin{eqnarray}
& = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k \nonumber\end{eqnarray} we obtain
\begin{eqnarray}
\delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j).
\tag{45}\end{eqnarray}
This is just (BP2)\begin{eqnarray}
\delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l) \nonumber\end{eqnarray} written in component form.
That completes the proof of the four fundamental equations of
backpropagation. The proof may seem complicated. But it's really
just the outcome of carefully applying the chain rule. A little less
succinctly, we can think of backpropagation as a way of computing the
gradient of the cost function by systematically applying the chain
rule from multi-variable calculus. That's all there really is to
backpropagation - the rest is details.
The backpropagation equations provide us with a way of computing the
gradient of the cost function. Let's explicitly write this out in the
form of an algorithm:
Input $x$: Set the corresponding activation $a^{1}$ for
the input layer.
Feedforward: For each $l = 2, 3, \ldots, L$ compute
$z^{l} = w^l a^{l-1}+b^l$ and $a^{l} = \sigma(z^{l})$.
Output error $\delta^L$: Compute the vector $\delta^{L}
= \nabla_a C \odot \sigma'(z^L)$.
Backpropagate the error: For each $l = L-1, L-2,
\ldots, 2$ compute $\delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot
\sigma'(z^{l})$.
Output: The gradient of the cost function is given by
$\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and
$\frac{\partial C}{\partial b^l_j} = \delta^l_j$.
Examining the algorithm you can see why it's called
backpropagation. We compute the error vectors $\delta^l$
backward, starting from the final layer. It may seem peculiar that
we're going through the network backward. But if you think about the
proof of backpropagation, the backward movement is a consequence of
the fact that the cost is a function of outputs from the network. To
understand how the cost varies with earlier weights and biases we need
to repeatedly apply the chain rule, working backward through the
layers to obtain usable expressions.
Backpropagation with a single modified neuron Suppose we modify
a single neuron in a feedforward network so that the output from the
neuron is given by $f(\sum_j w_j x_j + b)$, where $f$ is some
function other than the sigmoid. How should we modify the
backpropagation algorithm in this case?
Backpropagation with linear neurons Suppose we replace the
usual non-linear $\sigma$ function with $\sigma(z) = z$ throughout
the network. Rewrite the backpropagation algorithm for this case.
As I've described it above, the backpropagation algorithm computes the
gradient of the cost function for a single training example, $C =
C_x$. In practice, it's common to combine backpropagation with a
learning algorithm such as stochastic gradient descent, in which we
compute the gradient for many training examples. In particular, given
a mini-batch of $m$ training examples, the following algorithm applies
a gradient descent learning step based on that mini-batch:
Input a set of training examples
For each training example $x$: Set the corresponding
input activation $a^{x,1}$, and perform the following steps:
Feedforward: For each $l = 2, 3, \ldots, L$ compute
$z^{x,l} = w^l a^{x,l-1}+b^l$ and $a^{x,l} = \sigma(z^{x,l})$.
Backpropagate the error: For each $l = L-1, L-2,
\ldots, 2$ compute $\delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1})
\odot \sigma'(z^{x,l})$.
Gradient descent: For each $l = L, L-1, \ldots, 2$
update the weights according to the rule $w^l \rightarrow
w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$, and the
biases according to the rule $b^l \rightarrow b^l-\frac{\eta}{m}
\sum_x \delta^{x,l}$.
Of course, to implement stochastic gradient descent in practice you
also need an outer loop generating mini-batches of training examples,
and an outer loop stepping through multiple epochs of training. I've
omitted those for simplicity.
Having understood backpropagation in the abstract, we can now
understand the code used in the last chapter to implement
backpropagation. Recall from
that
chapter that the code was contained in the update_mini_batch
and backprop methods of the Network class. The code for
these methods is a direct translation of the algorithm described
above. In particular, the update_mini_batch method updates the
Network's weights and biases by computing the gradient for the
current mini_batch of training examples:
classNetwork(object):
+...
+ defupdate_mini_batch(self,mini_batch,eta):
+ """Update the network's weights and biases by applying
+ gradient descent using backpropagation to a single mini batch.
+ The "mini_batch" is a list of tuples "(x, y)", and "eta"
+ is the learning rate."""
+ nabla_b=[np.zeros(b.shape)forbinself.biases]
+ nabla_w=[np.zeros(w.shape)forwinself.weights]
+ forx,yinmini_batch:
+ delta_nabla_b,delta_nabla_w=self.backprop(x,y)
+ nabla_b=[nb+dnbfornb,dnbinzip(nabla_b,delta_nabla_b)]
+ nabla_w=[nw+dnwfornw,dnwinzip(nabla_w,delta_nabla_w)]
+ self.weights=[w-(eta/len(mini_batch))*nw
+ forw,nwinzip(self.weights,nabla_w)]
+ self.biases=[b-(eta/len(mini_batch))*nb
+ forb,nbinzip(self.biases,nabla_b)]
+
+
Most of the work is done by the line
delta_nabla_b, delta_nabla_w = self.backprop(x, y) which uses
the backprop method to figure out the partial derivatives
$\partial C_x / \partial b^l_j$ and $\partial C_x / \partial
w^l_{jk}$. The backprop method follows the algorithm in the
last section closely. There is one small change - we use a slightly
different approach to indexing the layers. This change is made to
take advantage of a feature of Python, namely the use of negative list
indices to count backward from the end of a list, so, e.g.,
l[-3] is the third last entry in a list l. The code for
backprop is below, together with a few helper functions, which
are used to compute the $\sigma$ function, the derivative $\sigma'$,
and the derivative of the cost function. With these inclusions you
should be able to understand the code in a self-contained way. If
something's tripping you up, you may find it helpful to consult
the
original description (and complete listing) of the code.
classNetwork(object):
+...
+ defbackprop(self,x,y):
+ """Return a tuple "(nabla_b, nabla_w)" representing the
+ gradient for the cost function C_x. "nabla_b" and
+ "nabla_w" are layer-by-layer lists of numpy arrays, similar
+ to "self.biases" and "self.weights"."""
+ nabla_b=[np.zeros(b.shape)forbinself.biases]
+ nabla_w=[np.zeros(w.shape)forwinself.weights]
+ # feedforward
+ activation=x
+ activations=[x]# list to store all the activations, layer by layer
+ zs=[]# list to store all the z vectors, layer by layer
+ forb,winzip(self.biases,self.weights):
+ z=np.dot(w,activation)+b
+ zs.append(z)
+ activation=sigmoid(z)
+ activations.append(activation)
+ # backward pass
+ delta=self.cost_derivative(activations[-1],y)* \
+ sigmoid_prime(zs[-1])
+ nabla_b[-1]=delta
+ nabla_w[-1]=np.dot(delta,activations[-2].transpose())
+ # Note that the variable l in the loop below is used a little
+ # differently to the notation in Chapter 2 of the book. Here,
+ # l = 1 means the last layer of neurons, l = 2 is the
+ # second-last layer, and so on. It's a renumbering of the
+ # scheme in the book, used here to take advantage of the fact
+ # that Python can use negative indices in lists.
+ forlinxrange(2,self.num_layers):
+ z=zs[-l]
+ sp=sigmoid_prime(z)
+ delta=np.dot(self.weights[-l+1].transpose(),delta)*sp
+ nabla_b[-l]=delta
+ nabla_w[-l]=np.dot(delta,activations[-l-1].transpose())
+ return(nabla_b,nabla_w)
+
+...
+
+ defcost_derivative(self,output_activations,y):
+ """Return the vector of partial derivatives \partial C_x /
+ \partial a for the output activations."""
+ return(output_activations-y)
+
+defsigmoid(z):
+ """The sigmoid function."""
+ return1.0/(1.0+np.exp(-z))
+
+defsigmoid_prime(z):
+ """Derivative of the sigmoid function."""
+ returnsigmoid(z)*(1-sigmoid(z))
+
Fully matrix-based approach to backpropagation over a
mini-batch Our implementation of stochastic gradient descent loops
over training examples in a mini-batch. It's possible to modify the
backpropagation algorithm so that it computes the gradients for all
training examples in a mini-batch simultaneously. The idea is that
instead of beginning with a single input vector, $x$, we can begin
with a matrix $X = [x_1 x_2 \ldots x_m]$ whose columns are the
vectors in the mini-batch. We forward-propagate by multiplying by
the weight matrices, adding a suitable matrix for the bias terms,
and applying the sigmoid function everywhere. We backpropagate along
similar lines. Explicitly write out pseudocode for this approach to
the backpropagation algorithm. Modify network.py so that it
uses this fully matrix-based approach. The advantage of this
approach is that it takes full advantage of modern libraries for
linear algebra. As a result it can be quite a bit faster than
looping over the mini-batch. (On my laptop, for example, the
speedup is about a factor of two when run on MNIST classification
problems like those we considered in the last chapter.) In
practice, all serious libraries for backpropagation use this fully
matrix-based approach or some variant.
In what sense is backpropagation a fast algorithm? To answer this
question, let's consider another approach to computing the gradient.
Imagine it's the early days of neural networks research. Maybe it's
the 1950s or 1960s, and you're the first person in the world to think
of using gradient descent to learn! But to make the idea work you
need a way of computing the gradient of the cost function. You think
back to your knowledge of calculus, and decide to see if you can use
the chain rule to compute the gradient. But after playing around a
bit, the algebra looks complicated, and you get discouraged. So you
try to find another approach. You decide to regard the cost as a
function of the weights $C = C(w)$ alone (we'll get back to the biases
in a moment). You number the weights $w_1, w_2, \ldots$, and want to
compute $\partial C / \partial w_j$ for some particular weight $w_j$.
An obvious way of doing that is to use the approximation
\begin{eqnarray} \frac{\partial
C}{\partial w_{j}} \approx \frac{C(w+\epsilon
e_j)-C(w)}{\epsilon},
\tag{46}\end{eqnarray}
where $\epsilon > 0$ is a small positive number, and $e_j$ is the unit
vector in the $j^{\rm th}$ direction. In other words, we can estimate
$\partial C / \partial w_j$ by computing the cost $C$ for two slightly
different values of $w_j$, and then applying
Equation (46)\begin{eqnarray} \frac{\partial
C}{\partial w_{j}} \approx \frac{C(w+\epsilon
e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}. The same idea will let us
compute the partial derivatives $\partial C / \partial b$ with respect
to the biases.
This approach looks very promising. It's simple conceptually, and
extremely easy to implement, using just a few lines of code.
Certainly, it looks much more promising than the idea of using the
chain rule to compute the gradient!
Unfortunately, while this approach appears promising, when you
implement the code it turns out to be extremely slow. To understand
why, imagine we have a million weights in our network. Then for each
distinct weight $w_j$ we need to compute $C(w+\epsilon e_j)$ in order
to compute $\partial C / \partial w_j$. That means that to compute
the gradient we need to compute the cost function a million different
times, requiring a million forward passes through the network (per
training example). We need to compute $C(w)$ as well, so that's a
total of a million and one passes through the network.
What's clever about backpropagation is that it enables us to
simultaneously compute all the partial derivatives $\partial C
/ \partial w_j$ using just one forward pass through the network,
followed by one backward pass through the network. Roughly speaking,
the computational cost of the backward pass is about the same as the
forward pass*
*This should be plausible, but it requires some
analysis to make a careful statement. It's plausible because the
dominant computational cost in the forward pass is multiplying by
the weight matrices, while in the backward pass it's multiplying by
the transposes of the weight matrices. These operations obviously
have similar computational cost.. And so the total cost of
backpropagation is roughly the same as making just two forward passes
through the network. Compare that to the million and one forward
passes we needed for the approach based
on (46)\begin{eqnarray} \frac{\partial
C}{\partial w_{j}} \approx \frac{C(w+\epsilon
e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}! And so even though backpropagation
appears superficially more complex than the approach based
on (46)\begin{eqnarray} \frac{\partial
C}{\partial w_{j}} \approx \frac{C(w+\epsilon
e_j)-C(w)}{\epsilon} \nonumber\end{eqnarray}, it's actually much, much faster.
This speedup was first fully appreciated in 1986, and it greatly
expanded the range of problems that neural networks could solve.
That, in turn, caused a rush of people using neural networks. Of
course, backpropagation is not a panacea. Even in the late 1980s
people ran up against limits, especially when attempting to use
backpropagation to train deep neural networks, i.e., networks with
many hidden layers. Later in the book we'll see how modern computers
and some clever new ideas now make it possible to use backpropagation
to train such deep neural networks.
As I've explained it, backpropagation presents two mysteries. First,
what's the algorithm really doing? We've developed a picture of the
error being backpropagated from the output. But can we go any deeper,
and build up more intuition about what is going on when we do all
these matrix and vector multiplications? The second mystery is how
someone could ever have discovered backpropagation in the first place?
It's one thing to follow the steps in an algorithm, or even to follow
the proof that the algorithm works. But that doesn't mean you
understand the problem so well that you could have discovered the
algorithm in the first place. Is there a plausible line of reasoning
that could have led you to discover the backpropagation algorithm? In
this section I'll address both these mysteries.
To improve our intuition about what the algorithm is doing, let's
imagine that we've made a small change $\Delta w^l_{jk}$ to some
weight in the network, $w^l_{jk}$:
That change in weight will cause a change in the output activation
from the corresponding neuron:
That, in turn, will cause a change in all the activations in
the next layer:
Those changes will in turn cause changes in the next layer, and then
the next, and so on all the way through to causing a change in the
final layer, and then in the cost function:
The change $\Delta C$ in the cost is related to the change $\Delta
w^l_{jk}$ in the weight by the equation
\begin{eqnarray}
\Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk}.
\tag{47}\end{eqnarray}
This suggests that a possible approach to computing $\frac{\partial
C}{\partial w^l_{jk}}$ is to carefully track how a small change in
$w^l_{jk}$ propagates to cause a small change in $C$. If we can do
that, being careful to express everything along the way in terms of
easily computable quantities, then we should be able to compute
$\partial C / \partial w^l_{jk}$.
Let's try to carry this out. The change $\Delta w^l_{jk}$ causes a
small change $\Delta a^{l}_j$ in the activation of the $j^{\rm th}$ neuron in
the $l^{\rm th}$ layer. This change is given by
\begin{eqnarray}
\Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}.
\tag{48}\end{eqnarray}
The change in activation $\Delta a^l_{j}$ will cause changes in
all the activations in the next layer, i.e., the $(l+1)^{\rm
th}$ layer. We'll concentrate on the way just a single one of those
activations is affected, say $a^{l+1}_q$,
In fact, it'll cause the following change:
\begin{eqnarray}
\Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \Delta a^l_j.
\tag{49}\end{eqnarray}
Substituting in the expression from Equation (48)\begin{eqnarray}
\Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray},
we get:
\begin{eqnarray}
\Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}.
\tag{50}\end{eqnarray}
Of course, the change $\Delta a^{l+1}_q$ will, in turn, cause changes
in the activations in the next layer. In fact, we can imagine a path
all the way through the network from $w^l_{jk}$ to $C$, with each
change in activation causing a change in the next activation, and,
finally, a change in the cost at the output. If the path goes through
activations $a^l_j, a^{l+1}_q, \ldots, a^{L-1}_n, a^L_m$ then the
resulting expression is
\begin{eqnarray}
\Delta C \approx \frac{\partial C}{\partial a^L_m}
\frac{\partial a^L_m}{\partial a^{L-1}_n}
\frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
\frac{\partial a^{l+1}_q}{\partial a^l_j}
\frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk},
\tag{51}\end{eqnarray}
that is, we've picked up a $\partial a / \partial a$ type term for
each additional neuron we've passed through, as well as the $\partial
C/\partial a^L_m$ term at the end. This represents the change in $C$
due to changes in the activations along this particular path through
the network. Of course, there's many paths by which a change in
$w^l_{jk}$ can propagate to affect the cost, and we've been
considering just a single path. To compute the total change in $C$ it
is plausible that we should sum over all the possible paths between
the weight and the final cost, i.e.,
\begin{eqnarray}
\Delta C \approx \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
\frac{\partial a^L_m}{\partial a^{L-1}_n}
\frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
\frac{\partial a^{l+1}_q}{\partial a^l_j}
\frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk},
\tag{52}\end{eqnarray}
where we've summed over all possible choices for the intermediate
neurons along the path. Comparing with (47)\begin{eqnarray}
\Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk} \nonumber\end{eqnarray} we
see that
\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
\frac{\partial a^L_m}{\partial a^{L-1}_n}
\frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
\frac{\partial a^{l+1}_q}{\partial a^l_j}
\frac{\partial a^l_j}{\partial w^l_{jk}}.
\tag{53}\end{eqnarray}
Now, Equation (53)\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
\frac{\partial a^L_m}{\partial a^{L-1}_n}
\frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
\frac{\partial a^{l+1}_q}{\partial a^l_j}
\frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray} looks complicated. However,
it has a nice intuitive interpretation. We're computing the rate of
change of $C$ with respect to a weight in the network. What the
equation tells us is that every edge between two neurons in the
network is associated with a rate factor which is just the partial
derivative of one neuron's activation with respect to the other
neuron's activation. The edge from the first weight to the first
neuron has a rate factor $\partial a^{l}_j / \partial w^l_{jk}$. The
rate factor for a path is just the product of the rate factors along
the path. And the total rate of change $\partial C / \partial
w^l_{jk}$ is just the sum of the rate factors of all paths from the
initial weight to the final cost. This procedure is illustrated here,
for a single path:
What I've been providing up to now is a heuristic argument, a way of
thinking about what's going on when you perturb a weight in a network.
Let me sketch out a line of thinking you could use to further develop
this argument. First, you could derive explicit expressions for all
the individual partial derivatives in
Equation (53)\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
\frac{\partial a^L_m}{\partial a^{L-1}_n}
\frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
\frac{\partial a^{l+1}_q}{\partial a^l_j}
\frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray}. That's easy to do with a bit of
calculus. Having done that, you could then try to figure out how to
write all the sums over indices as matrix multiplications. This turns
out to be tedious, and requires some persistence, but not
extraordinary insight. After doing all this, and then simplifying as
much as possible, what you discover is that you end up with exactly
the backpropagation algorithm! And so you can think of the
backpropagation algorithm as providing a way of computing the sum over
the rate factor for all these paths. Or, to put it slightly
differently, the backpropagation algorithm is a clever way of keeping
track of small perturbations to the weights (and biases) as they
propagate through the network, reach the output, and then affect the
cost.
Now, I'm not going to work through all this here. It's messy and
requires considerable care to work through all the details. If you're
up for a challenge, you may enjoy attempting it. And even if not, I
hope this line of thinking gives you some insight into what
backpropagation is accomplishing.
What about the other mystery - how backpropagation could have been
discovered in the first place? In fact, if you follow the approach I
just sketched you will discover a proof of backpropagation.
Unfortunately, the proof is quite a bit longer and more complicated
than the one I described earlier in this chapter. So how was that
short (but more mysterious) proof discovered? What you find when you
write out all the details of the long proof is that, after the fact,
there are several obvious simplifications staring you in the face.
You make those simplifications, get a shorter proof, and write that
out. And then several more obvious simplifications jump out at
you. So you repeat again. The result after a few iterations is the
proof we saw earlier*
*There is one clever step required. In
Equation (53)\begin{eqnarray}
\frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m}
\frac{\partial a^L_m}{\partial a^{L-1}_n}
\frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots
\frac{\partial a^{l+1}_q}{\partial a^l_j}
\frac{\partial a^l_j}{\partial w^l_{jk}} \nonumber\end{eqnarray} the intermediate variables are
activations like $a_q^{l+1}$. The clever idea is to switch to using
weighted inputs, like $z^{l+1}_q$, as the intermediate variables. If
you don't have this idea, and instead continue using the activations
$a^{l+1}_q$, the proof you obtain turns out to be slightly more
complex than the proof given earlier in the chapter. - short, but
somewhat obscure, because all the signposts to its construction have
been removed! I am, of course, asking you to trust me on this, but
there really is no great mystery to the origin of the earlier proof.
It's just a lot of hard work simplifying the proof I've sketched in
this section.
+
+
+
\ No newline at end of file
diff --git a/ML_Mathematical_Approach/07_week-5-lecture-notes/01__chapter3_-_bp.pdf b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__chapter3_-_bp.pdf
new file mode 100644
index 0000000..4036092
Binary files /dev/null and b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__chapter3_-_bp.pdf differ
diff --git a/ML_Mathematical_Approach/07_week-5-lecture-notes/01__node37.html b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__node37.html
new file mode 100644
index 0000000..15fb3ae
--- /dev/null
+++ b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__node37.html
@@ -0,0 +1,466 @@
+
+
+
+
+The Backpropagation Algorithm
+
+
+
+
+
+
+
+
+
+
+
+
+
Propagates inputs forward in the usual way, i.e.
+
+
All outputs are computed using sigmoid thresholding of the inner
+ product of the corresponding weight and input vectors.
+
All outputs at stage n are connected to all the inputs at
+ stage n+1
+
2.
+
Propagates the errors backwards by apportioning them to each
+unit according to the amount of this error the unit is responsible for.
+
+We now derive the stochastic Backpropagation algorithm for the general
+case. The derivation is simple, but unfortunately the book-keeping is
+a little messy.
+
+
+
+
+input vector for unit j (xji =
+ ith input to the jth unit)
+
+
+
+weight vector for unit j (wji = weight on
+ xji)
+
+
+,
+the weighted sum of inputs
+ for unit j
+
oj = output of unit j (
+
+)
+
tj = target for unit j
+
+
+Downstream(j) = set of units whose immediate inputs include
+ the output of j
+
Outputs = set of output units in the final layer
+
+Since we update after each training example, we can simplify
+the notation somewhat by imagining that the training set consists of
+exactly one example and so the error can simply be denoted by E.
+
+
+We want to calculate
+
+
+for each input weight
+wji for each output unit j. Note first that since zj is a
+function of wji regardless of where in the network unit j is
+located,
+
+
+
+
+
+Furthermore,
+
+
+is the same regardless of which input
+weight of unit j we are trying to update. So we denote this
+quantity by .
+
+
+Consider the case when
+
+.
+We know
+
+
+
+
+
+
+
+
+
+Since the outputs of all units
+are independent of wji,
+we can drop the summation and consider just the contribution to E by
+j.
+
+
+
+
+
+Thus
+
+
+
+
+
+
+
+
+(17)
+
+
+
+
+
+Now consider the case when j is a hidden unit. Like before, we make the
+following two important observations.
+
+
1.
+
For each unit k downstream from j, zk is a function of zj
+
2.
+
The contribution to error by all units
+in the same
+layer as j is independent of wji
+We want to calculate
+
+
+for each input weight
+wji for each hidden unit j. Note that wji influences just
+zj which influences oj which influences
+
+
+each of which influence E. So we can write
+
+
+
+
+
+
+
+Again note that all the terms except xji in the above product are
+the same regardless of which input weight of unit j we are trying to
+update. Like before, we denote this common quantity by .
+Also note that
+
+,
+
+
+
+and
+
+.
+Substituting,
+
+
+
+
+
+Thus,
+
+
+
+
+
+
+
+
+(18)
+
+
+
+We are now in a position to state the Backpropagation algorithm
+formally.
+
+
+Each training example is of the form
+
+
+where
+is the input vector and
+is the
+target vector.
+is the learning rate (e.g., .05). ni, nh
+and no are the number of input, hidden and output nodes
+respectively. Input from unit i to unit j is denoted xji and
+its weight is denoted by wji.
+
+
+
+
Create a feed-forward network with ni inputs, nh hidden
+ units, and no output units.
+
Initialize all the weights to small random values (e.g., between
+ -.05 and .05)
+
Until termination condition is met, Do
+
+
For each training example
+
+,
+Do
+
+
1.
+
Input the instance
+and compute the output ou
+of every unit.
+
+ Let's first define a few variables that we will need to use:
+
+
+ a) L= total number of layers in the network
+
+
+ b) $$s_l$$ = number of units (not counting bias unit) in layer l
+
+
+ c) K= number of output units/classes
+
+
+ Recall that in neural networks, we may have many output nodes. We denote $$h_\Theta(x)_k$$ as being a hypothesis that results in the $$k^{th}$$ output.
+
+
+ Our cost function for neural networks is going to be a generalization of the one we used for logistic regression.
+
+
+ Recall that the cost function for regularized logistic regression was:
+
+ We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, between the square brackets, we have an additional nested summation that loops through the number of output nodes.
+
+
+ In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term.
+
+
+ Note:
+
+
+
+
+ the double sum simply adds up the logistic regression costs calculated for each cell in the output layer; and
+
+
+
+
+ the triple sum simply adds up the squares of all the individual Θs in the entire network.
+
+
+
+
+ the i in the triple sum does
+
+ not
+
+ refer to training example i
+
+
+
+
+ Backpropagation Algorithm
+
+
+ "Backpropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression.
+
+
+ Our goal is to compute:
+
+
+ $$\min_\Theta J(\Theta)$$
+
+
+ That is, we want to minimize our cost function J using an optimal set of parameters in theta.
+
+
+ In this section we'll look at the equations we use to compute the partial derivative of J(Θ):
+
+ In back propagation we're going to compute for every node:
+
+
+ $$\delta_j^{(l)}$$ = "error" of node j in layer l
+
+
+ Recall that $$a_j^{(l)}$$ is activation node j in layer l.
+
+
+ For the
+
+ last layer
+
+ , we can compute the vector of delta values with:
+
+
+ $$\delta^{(L)} = a^{(L)} - y$$
+
+
+ Where L is our total number of layers and $$a^{(L)}$$ is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y.
+
+
+ To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left:
+
+ The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g', or g-prime, which is the derivative of the activation function g evaluated with the input values given by z(l).
+
+
+ The g-prime derivative terms can also be written out as:
+
+
+
+
+
+ $$g'(u) = g(u)\ .*\ (1 - g(u))$$
+
+
+
+
+
+ The full back propagation equation for the inner nodes is then:
+
+ A. Ng states that the derivation and proofs are complicated and involved, but you can still implement the above equations to do back propagation without knowing the details.
+
+
+ We can compute our partial derivative terms by multiplying our activation values and our error values for each training example t:
+
+ This however ignores regularization, which we'll deal with later.
+
+
+ Note: $$\delta^{l+1}$$ and $$a^{l+1}$$ are vectors with $$s_{l+1}$$ elements. Similarly, $$\ a^{(l)}$$ is a vector with $$s_l$$ elements. Multiplying them produces a matrix that is $$s_{l+1}$$ by $$s_l$$ which is the same dimension as $$\Theta^{(l)}$$. That is, the process produces a gradient term for every element in $$\Theta^{(l)}$$. (Actually, $$\Theta^{(l)}$$ has $$s_{l}$$ + 1 column, so the dimensionality is not exactly the same).
+
+
+ We can now take all these equations and put them together into a backpropagation algorithm:
+
+
+
+ Back propagation Algorithm
+
+
+
+ Given training set $$\lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace$$
+
+
+
+
+ Set $$\Delta^{(l)}_{i,j}$$ := 0 for all (l,i,j)
+
+
+
+
+ For training example t =1 to m:
+
+
+
+
+ Set $$a^{(1)} := x^{(t)}$$
+
+
+
+
+ Perform forward propagation to compute $$a^{(l)}$$ for l=2,3,…,L
+
+
+
+
+ Using $$y^{(t)}$$, compute $$\delta^{(L)} = a^{(L)} - y^{(t)}$$
+
+ $$\Delta^{(l)}_{i,j} := \Delta^{(l)}_{i,j} + a_j^{(l)} \delta_i^{(l+1)}$$ or with vectorization, $$\Delta^{(l)} := \Delta^{(l)} + \delta^{(l+1)}(a^{(l)})^T$$
+
+
+
+
+ $$D^{(l)}_{i,j} := \dfrac{1}{m}\left(\Delta^{(l)}_{i,j} + \lambda\Theta^{(l)}_{i,j}\right)$$ If j≠0 NOTE: Typo in lecture slide omits outside parentheses. This version is correct.
+
+
+
+
+ $$D^{(l)}_{i,j} := \dfrac{1}{m}\Delta^{(l)}_{i,j}$$ If j=0
+
+
+
+
+ The capital-delta matrix is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative.
+
+
+ The actual proof is quite involved, but, the $$D^{(l)}_{i,j}$$ terms are the partial derivatives and the results we are looking for:
+
+ Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are.
+
+
+ Note: In lecture, sometimes i is used to index a training example. Sometimes it is used to index a unit in a layer. In the Back Propagation Algorithm described here, t is used to index a training example rather than overloading the use of i.
+
+
+ Implementation Note: Unrolling Parameters
+
+
+ With neural networks, we are working with sets of matrices:
+
+ If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from the "unrolled" versions as follows:
+
+ NOTE: The lecture slides show an example neural network with 3 layers. However,
+
+ 3
+
+ theta matrices are defined: Theta1, Theta2, Theta3. There should be only 2 theta matrices: Theta1 (10 x 11), Theta2 (1 x 11).
+
+
+ Gradient Checking
+
+
+ Gradient checking will assure that our backpropagation works as intended.
+
+
+ We can approximate the derivative of our cost function with:
+
+ A good small value for $${\epsilon}$$ (epsilon), guarantees the math above to become true. If the value be much smaller, may we will end up with numerical problems. The professor Andrew usually uses the value $${\epsilon = 10^{-4}}$$.
+
+
+ We are only adding or subtracting epsilon to the $$Theta_j$$ matrix. In octave we can do it as follows:
+
+ We then want to check that gradApprox ≈ deltaVector.
+
+
+ Once you've verified
+
+ once
+
+ that your backpropagation algorithm is correct, then you don't need to compute gradApprox again. The code to compute gradApprox is very slow.
+
+
+ Random Initialization
+
+
+ Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly.
+
+
+ Instead we can randomly initialize our weights:
+
+
+ Initialize each $$\Theta^{(l)}_{ij}$$ to a random value between$$ [-\epsilon,\epsilon]$$:
+
+ First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers total.
+
+
+
+
+ Number of input units = dimension of features $$x^{(i)}$$
+
+
+
+
+ Number of output units = number of classes
+
+
+
+
+ Number of hidden units per layer = usually more the better (must balance with cost of computation as it increases with more hidden units)
+
+
+
+
+ Defaults: 1 hidden layer. If more than 1 hidden layer, then the same number of units in every hidden layer.
+
+
+
+
+
+ Training a Neural Network
+
+
+
+
+
+ Randomly initialize the weights
+
+
+
+
+ Implement forward propagation to get $$h_\theta(x^{(i)})$$
+
+
+
+
+ Implement the cost function
+
+
+
+
+ Implement backpropagation to compute partial derivatives
+
+
+
+
+ Use gradient checking to confirm that your backpropagation works. Then disable gradient checking.
+
+
+
+
+ Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta.
+
+
+
+
+ When we perform forward and back propagation, we loop on every training example:
+
+
for i = 1:m,
+ Perform forward propagation and backpropagation using example (x(i),y(i))
+ (Get activations a(l) and delta terms d(l) for l = 2,...,L
+
+ Bonus: Tutorial on How to classify your own images of digits
+
+
+ This tutorial will guide you on how to use the classifier provided in exercise 3 to classify you own images like this:
+
+
+
+
+
+ It will also explain how the images are converted thru several formats to be processed and displayed.
+
+
+ Introduction
+
+
+ The classifier provided expects 20 x 20 pixels black and white images converted in a row vector of 400 real numbers like this
+
+
[ 0.14532, 0.12876, ...]
+
+ Each pixel is represented by a real number between -1.0 to 1.0, meaning -1.0 equal black and 1.0 equal white (any number in between is a shade of gray, and number 0.0 is exactly the middle gray).
+
+
+
+ .jpg and color RGB images
+
+
+
+ The most common image format that can be read by Octave is .jpg using function that outputs a three-dimensional matrix of integer numbers from 0 to 255, representing the height x width x 3 integers as indexes of a color map for each pixel (explaining color maps is beyond scope).
+
+
Image3DmatrixRGB = imread("myOwnPhoto.jpg");
+
+
+ Convert to Black & White
+
+
+
+ A common way to convert color images to black & white, is to convert them to a YIQ standard and keep only the Y component that represents the luma information (black & white). I and Q represent the chrominance information (color).Octave has a function
+
+ rgb2ntsc()
+
+ that outputs a similar three-dimensional matrix but of real numbers from -1.0 to 1.0, representing the height x width x 3 (Y luma, I in-phase, Q quadrature) intensity for each pixel.
+
+
Image3DmatrixYIQ = rgb2ntsc(MyImageRGB);
+
+ To obtain the Black & White component just discard the I and Q matrices. This leaves a two-dimensional matrix of real numbers from -1.0 to 1.0 representing the height x width pixels black & white values.
+
+
Image2DmatrixBW = Image3DmatrixYIQ(:,:,1);
+
+
+ Cropping to square image
+
+
+
+ It is useful to crop the original image to be as square as possible. The way to crop a matrix is by selecting an area inside the original B&W image and copy it to a new matrix. This is done by selecting the rows and columns that define the area. In other words, it is copying a rectangular subset of the matrix like this:
+
+ Cropping does not have to be all the way to a square.
+
+ It could be cropping just a percentage of the way to a square
+
+ so you can leave more of the image intact. The next step of scaling will take care of streaching the image to fit a square.
+
+
+
+ Scaling to 20 x 20 pixels
+
+
+
+ The classifier provided was trained with 20 x 20 pixels images so we need to scale our photos to meet. It may cause distortion depending on the height and width ratio of the cropped original photo. There are many ways to scale a photo but we are going to use the simplest one. We lay a scaled grid of 20 x 20 over the original photo and take a sample pixel on the center of each grid. To lay a scaled grid, we compute two vectors of 20 indexes each evenly spaced on the original size of the image. One for the height and one for the width of the image. For example, in an image of 320 x 200 pixels will produce to vectors like
+
+
[9 25 41 57 73 ... 313] % 20 indexes
+
[6 16 26 36 46 ... 196] % 20 indexes
+
+ Copy the value of each pixel located by the grid of these indexes to a new matrix. Ending up with a matrix of 20 x 20 real numbers.
+
+
+
+ Black & White to Gray & White
+
+
+
+ The classifier provided was trained with images of white digits over gray background. Specifically, the 20 x 20 matrix of real numbers ONLY range from 0.0 to 1.0 instead of the complete black & white range of -1.0 to 1.0, this means that we have to normalize our photos to a range 0.0 to 1.0 for this classifier to work. But also, we invert the black and white colors because is easier to "draw" black over white on our photos and we need to get white digits. So in short, we
+
+ invert black and white
+
+ and
+
+ stretch black to gray
+
+ .
+
+
+
+ Rotation of image
+
+
+
+ Some times our photos are automatically rotated like in our celular phones. The classifier provided can not recognize rotated images so we may need to rotate it back sometimes. This can be done with an Octave function
+
+ rot90()
+
+ like this.
+
+
ImageAligned = rot90(Image, rotationStep);
+
+ Where rotationStep is an integer: -1 mean rotate 90 degrees CCW and 1 mean rotate 90 degrees CW.
+
+
+ Approach
+
+
+
+
+ The approach is to have a function that converts our photo to the format the classifier is expecting. As if it was just a sample from the training data set.
+
+
+
+
+ Use the classifier to predict the digit in the converted image.
+
+
+
+
+ Code step by step
+
+
+ Define the function name, the output variable and three parameters, one for the filename of our photo, one optional cropping percentage (if not provided will default to zero, meaning no cropping) and the last optional rotation of the image (if not provided will default to cero, meaning no rotation).
+
+
function vectorImage = imageTo20x20Gray(fileName, cropPercentage=0, rotStep=0)
+
+
+ Read the file as a RGB image and convert it to Black & White 2D matrix (see the introduction).
+
+
% Read as RGB image
+Image3DmatrixRGB = imread(fileName);
+% Convert to NTSC image (YIQ)
+Image3DmatrixYIQ = rgb2ntsc(Image3DmatrixRGB );
+% Convert to grays keeping only luminance (Y)
+% ...and discard chrominance (IQ)
+Image2DmatrixBW = Image3DmatrixYIQ(:,:,1);
+
+
+ Establish the final size of the cropped image.
+
+
% Get the size of your image
+oldSize = size(Image2DmatrixBW);
+% Obtain crop size toward centered square (cropDelta)
+% ...will be zero for the already minimum dimension
+% ...and if the cropPercentage is zero,
+% ...both dimensions are zero
+% ...meaning that the original image will go intact to croppedImage
+cropDelta = floor((oldSize - min(oldSize)) .* (cropPercentage/100));
+% Compute the desired final pixel size for the original image
+finalSize = oldSize - cropDelta;
+
+
+ Obtain the origin and amount of the columns and rows to be copied to the cropped image.
+
+
% Compute each dimension origin for croping
+cropOrigin = floor(cropDelta / 2) + 1;
+% Compute each dimension copying size
+copySize = cropOrigin + finalSize - 1;
+% Copy just the desired cropped image from the original B&W image
+croppedImage = Image2DmatrixBW( ...
+ cropOrigin(1):copySize(1), cropOrigin(2):copySize(2));
+
+
+ Compute the scale and compute back the new size. This last step is extra. It is computed back so the code keeps general for future modification of the classifier size. For example: if changed from 20 x 20 pixels to 30 x 30. Then the we only need to change the line of code where the scale is computed.
+
+
% Resolution scale factors: [rows cols]
+scale = [20 20] ./ finalSize;
+% Compute back the new image size (extra step to keep code general)
+newSize = max(floor(scale .* finalSize),1);
+
+
+ Compute two sets of 20 indexes evenly spaced. One over the original height and one over the original width of the image.
+
+
% Compute a re-sampled set of indices:
+rowIndex = min(round(((1:newSize(1))-0.5)./scale(1)+0.5), finalSize(1));
+colIndex = min(round(((1:newSize(2))-0.5)./scale(2)+0.5), finalSize(2));
+
+
+ Copy just the indexed values from old image to get new image of 20 x 20 real numbers. This is called "sampling" because it copies just a sample pixel indexed by a grid. All the sample pixels make the new image.
+
+
% Copy just the indexed values from old image to get new image
+newImage = croppedImage(rowIndex,colIndex,:);
+
+
+ Rotate the matrix using the
+
+ rot90()
+
+ function with the rotStep parameter: -1 is CCW, 0 is no rotate, 1 is CW.
+
+
% Rotate if needed: -1 is CCW, 0 is no rotate, 1 is CW
+newAlignedImage = rot90(newImage, rotStep);
+
+
+ Invert black and white because it is easier to draw black digits over white background in our photos but the classifier needs white digits.
+
+
% Invert black and white
+invertedImage = - newAlignedImage;
+
+
+ Find the min and max gray values in the image and compute the total value range in preparation for normalization.
+
+
% Find min and max grays values in the image
+maxValue = max(invertedImage(:));
+minValue = min(invertedImage(:));
+% Compute the value range of actual grays
+delta = maxValue - minValue;
+
+
+ Do normalization so all values end up between 0.0 and 1.0 because this particular classifier do not perform well with negative numbers.
+
+
% Normalize grays between 0 and 1
+normImage = (invertedImage - minValue) / delta;
+
+ Add some contrast to the image. The multiplication factor is the contrast control, you can increase it if desired to obtain sharper contrast (contrast only between gray and white, black was already removed in normalization).
+
+ Show the image specifying the black & white range [-1 1] to avoid automatic ranging using the image range values of gray to white. Showing the photo with different range, does not affect the values in the output matrix, so do not affect the classifier. It is only as a visual feedback for the user.
+
+
% Show image as seen by the classifier
+imshow(contrastedImage, [-1, 1] );
+
+ Finally, output the matrix as a unrolled vector to be compatible with the classifier.
+
+
% Output the matrix as a unrolled vector
+vectorImage = reshape(normImage, 1, newSize(1) * newSize(2));
+
+ End function.
+
+
end;
+
+
+
+ Usage samples
+
+
+ Single photo
+
+
+
+
+ Photo file in myDigit.jpg
+
+
+
+
+ Cropping 60% of the way to square photo
+
+
+
+
+ No rotationvectorImage = imageTo20x20Gray('myDigit.jpg',60);
+predict(Theta1, Theta2, vectorImage)
+
+ Photo files in myFirstDigit.jpg, mySecondDigit.jpg
+
+
+
+
+ First crop to square and second 25% of the way to square photo
+
+
+
+
+ First no rotation and second CW rotationvectorImage(1,:) = imageTo20x20Gray('myFirstDigit.jpg',100);
+vectorImage(2,:) = imageTo20x20Gray('mySecondDigit.jpg',25,1);
+predict(Theta1, Theta2, vectorImage)
+
+
+
+
+ Tips
+
+
+
+
+ JPG photos of black numbers over white background
+
+
+
+
+ Preferred square photos but not required
+
+
+
+
+ Rotate as needed because the classifier can only work with vertical digits
+
+
+
+
+ Leave background space around digit. Al least 2 pixels when seen at 20 x 20 resolution. This means that the classifier only really works in a 16 x 16 area.
+
+
+
+
+ Play changing the contrast multipier to 10 (or more).
+
+
+
+
+ Complete code (just copy and paste)
+
+
+
+
function vectorImage = imageTo20x20Gray(fileName, cropPercentage=0, rotStep=0)
+%IMAGETO20X20GRAY display reduced image and converts for digit classification
+%
+% Sample usage:
+% imageTo20x20Gray('myDigit.jpg', 100, -1);
+%
+% First parameter: Image file name
+% Could be bigger than 20 x 20 px, it will
+% be resized to 20 x 20. Better if used with
+% square images but not required.
+%
+% Second parameter: cropPercentage (any number between 0 and 100)
+% 0 0% will be cropped (optional, no needed for square images)
+% 50 50% of available croping will be cropped
+% 100 crop all the way to square image (for rectangular images)
+%
+% Third parameter: rotStep
+% -1 rotate image 90 degrees CCW
+% 0 do not rotate (optional)
+% 1 rotate image 90 degrees CW
+%
+% (Thanks to Edwin Frühwirth for parts of this code)
+% Read as RGB image
+Image3DmatrixRGB = imread(fileName);
+% Convert to NTSC image (YIQ)
+Image3DmatrixYIQ = rgb2ntsc(Image3DmatrixRGB );
+% Convert to grays keeping only luminance (Y) and discard chrominance (IQ)
+Image2DmatrixBW = Image3DmatrixYIQ(:,:,1);
+% Get the size of your image
+oldSize = size(Image2DmatrixBW);
+% Obtain crop size toward centered square (cropDelta)
+% ...will be zero for the already minimum dimension
+% ...and if the cropPercentage is zero,
+% ...both dimensions are zero
+% ...meaning that the original image will go intact to croppedImage
+cropDelta = floor((oldSize - min(oldSize)) .* (cropPercentage/100));
+% Compute the desired final pixel size for the original image
+finalSize = oldSize - cropDelta;
+% Compute each dimension origin for croping
+cropOrigin = floor(cropDelta / 2) + 1;
+% Compute each dimension copying size
+copySize = cropOrigin + finalSize - 1;
+% Copy just the desired cropped image from the original B&W image
+croppedImage = Image2DmatrixBW( ...
+ cropOrigin(1):copySize(1), cropOrigin(2):copySize(2));
+% Resolution scale factors: [rows cols]
+scale = [20 20] ./ finalSize;
+% Compute back the new image size (extra step to keep code general)
+newSize = max(floor(scale .* finalSize),1);
+% Compute a re-sampled set of indices:
+rowIndex = min(round(((1:newSize(1))-0.5)./scale(1)+0.5), finalSize(1));
+colIndex = min(round(((1:newSize(2))-0.5)./scale(2)+0.5), finalSize(2));
+% Copy just the indexed values from old image to get new image
+newImage = croppedImage(rowIndex,colIndex,:);
+% Rotate if needed: -1 is CCW, 0 is no rotate, 1 is CW
+newAlignedImage = rot90(newImage, rotStep);
+% Invert black and white
+invertedImage = - newAlignedImage;
+% Find min and max grays values in the image
+maxValue = max(invertedImage(:));
+minValue = min(invertedImage(:));
+% Compute the value range of actual grays
+delta = maxValue - minValue;
+% Normalize grays between 0 and 1
+normImage = (invertedImage - minValue) / delta;
+% Add contrast. Multiplication factor is contrast control.
+contrastedImage = sigmoid((normImage -0.5) * 5);
+% Show image as seen by the classifier
+imshow(contrastedImage, [-1, 1] );
+% Output the matrix as a unrolled vector
+vectorImage = reshape(contrastedImage, 1, newSize(1)*newSize(2));
+end
+
+ Photo Gallery
+
+
+ Digit 2
+
+
+
+
+ Digit 6
+
+
+
+
+ Digit 6 inverted is digit 9. This is the same photo of a six but rotated. Also, changed the contrast multiplier from 5 to 20. You can note that the gray background is smoother.
+
+
+
+
+
+
+ Digit 3
+
+
+
+
+
+
+ Explanation of Derivatives Used in Backpropagation
+
+
+
+
+ We know that for a logistic regression classifier (which is what all of the output neurons in a neural network are), we use the cost function, $$J(\theta) = -ylog(h_{\theta}(x)) - (1-y)log(1-h_{\theta}(x))$$, and apply this over the K output neurons, and for all m examples.
+
+
+
+
+ The equation to compute the partial derivatives of the theta terms in the output neurons:
+
+ Clearly they share some pieces in common, so a delta term ($$δ^{(L)}$$) can be used for the common pieces between the output layer and the hidden layer immediately before it (with the possibility that there could be many hidden layers if we wanted):
+
+ And we can go ahead and use another delta term ($$δ^{(L−1)}$$) for the pieces that would be shared by the final hidden layer and a hidden layer before that, if we had one. Regardless, this delta term will still serve to make the math and implementation more concise.
+
+ Using $$\delta^{(L)} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}}$$, we need to evaluate both partial derivatives.
+
+
+
+
+ Given $$J(\theta) = -ylog(a^{(L)}) - (1-y)log(1-a^{(L)})$$, where $$a^{(L)} = h_{\theta}(x))$$, the partial derivative is:
+
+ The NN we created for classification can easily be modified to have a linear output. First solve the 4th programming exercise. You can create a new function script, nnCostFunctionLinear.m, with the following characteristics
+
+
+
+
+ There is only one output node, so you do not need the 'num_labels' parameter.
+
+
+
+
+ Since there is one linear output, you do not need to convert y into a logical matrix.
+
+
+
+
+ You still need a non-linear function in the hidden layer.
+
+
+
+
+ The non-linear function is often the tanh() function - it has an output range from -1 to +1, and its gradient is easily implemented. Let g(z)=tanh(z).
+
+
+
+
+ The gradient of tanh is $$g'(z) = 1 - g(z)^2$$. Use this in backpropagation in place of the sigmoid gradient.
+
+
+
+
+ Remove the sigmoid function from the output layer (i.e. calculate a3 without using a sigmoid function), since we want a linear output.
+
+
+
+
+ Cost computation: Use the linear cost function for J (from ex1 and ex5) for the unregularized portion. For the regularized portion, use the same method as ex4.
+
+
+
+
+ Where reshape() is used to form the Theta matrices, replace 'num_labels' with '1'.
+
+
+
+
+ You still need to randomly initialize the Theta values, just as with any NN. You will want to experiment with different epsilon values. You will also need to create a predictLinear() function, using the tanh() function in the hidden layer, and a linear output.
+
+
+ Testing your linear NN
+
+
+ Here is a test case for your nnCostFunctionLinear()
+
+ Now create a script that uses the 'ex5data1.mat' from ex5, but without creating the polynomial terms. With 8 units in the hidden layer and MaxIter set to 200, you should be able to get a final cost value of 0.3 to 0.4. The results will vary a bit due to the random Theta initialization. If you plot the training set and the predicted values for the training set (using your predictLinear() function), you should have a good match.
+
+
+ Deriving the Sigmoid Gradient Function
+
+
+ We let the sigmoid function be $$ \sigma(x) = \frac{1}{1 + e^{-x}}$$
+
+
+ Deriving the equation above yields to $$ (\frac{1}{1 + e^{-x}})^2 \frac {d}{ds} \frac{1}{1 + e^{-x}}$$
+
+
+ Which is equal to $$ (\frac{1}{1 + e^{-x}})^2 e^{-x} (-1)$$
+
+
+
+
+
diff --git a/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Bias-Variance.pdf b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Bias-Variance.pdf
new file mode 100644
index 0000000..f654c6b
Binary files /dev/null and b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Bias-Variance.pdf differ
diff --git a/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Chap9.Part2.pdf b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Chap9.Part2.pdf
new file mode 100644
index 0000000..9abe9c2
Binary files /dev/null and b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Chap9.Part2.pdf differ
diff --git a/ML_Mathematical_Approach/08_week-6-lecture-notes/01__resources.html b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__resources.html
new file mode 100644
index 0000000..d28d390
--- /dev/null
+++ b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__resources.html
@@ -0,0 +1,1091 @@
+
+
+
+ ML:Advice for Applying Machine Learning
+
+
+ Deciding What to Try Next
+
+
+ Errors in your predictions can be troubleshooted by:
+
+
+
+
+ Getting more training examples
+
+
+
+
+ Trying smaller sets of features
+
+
+
+
+ Trying additional features
+
+
+
+
+ Trying polynomial features
+
+
+
+
+ Increasing or decreasing λ
+
+
+
+
+ Don't just pick one of these avenues at random. We'll explore diagnostic techniques for choosing one of the above solutions in the following sections.
+
+
+ Evaluating a Hypothesis
+
+
+ A hypothesis may have low error for the training examples but still be inaccurate (because of overfitting).
+
+
+ With a given dataset of training examples, we can split up the data into two sets: a
+
+ training set
+
+ and a
+
+ test set
+
+ .
+
+
+ The new procedure using these two sets is then:
+
+
+
+
+ Learn $$\Theta$$ and minimize $$J_{train}(\Theta)$$ using the training set
+
+
+
+
+ Compute the test set error $$J_{test}(\Theta)$$
+
+
+
+
+ The test set error
+
+
+
+
+ For linear regression: $$J_{test}(\Theta) = \dfrac{1}{2m_{test}} \sum_{i=1}^{m_{test}}(h_\Theta(x^{(i)}_{test}) - y^{(i)}_{test})^2$$
+
+ This gives us the proportion of the test data that was misclassified.
+
+
+ Model Selection and Train/Validation/Test Sets
+
+
+
+
+ Just because a learning algorithm fits a training set well, that does not mean it is a good hypothesis.
+
+
+
+
+ The error of your hypothesis as measured on the data set with which you trained the parameters will be lower than any other data set.
+
+
+
+
+ In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result.
+
+
+
+ Without the Validation Set (note: this is a bad method - do not use it)
+
+
+
+
+
+ Optimize the parameters in Θ using the training set for each polynomial degree.
+
+
+
+
+ Find the polynomial degree d with the least error using the test set.
+
+
+
+
+ Estimate the generalization error also using the test set with $$J_{test}(\Theta^{(d)})$$, (d = theta from polynomial with lower error);
+
+
+
+
+ In this case, we have trained one variable, d, or the degree of the polynomial, using the test set. This will cause our error value to be greater for any other set of data.
+
+
+
+ Use of the CV set
+
+
+
+ To solve this, we can introduce a third set, the
+
+ Cross Validation Set
+
+ , to serve as an intermediate set that we can train d with. Then our test set will give us an accurate, non-optimistic error.
+
+
+ One example way to break down our dataset into the three sets is:
+
+
+
+
+ Training set: 60%
+
+
+
+
+ Cross validation set: 20%
+
+
+
+
+ Test set: 20%
+
+
+
+
+ We can now calculate three separate error values for the three different sets.
+
+
+
+ With the Validation Set (note: this method presumes we do not also use the CV set for regularization)
+
+
+
+
+
+ Optimize the parameters in Θ using the training set for each polynomial degree.
+
+
+
+
+ Find the polynomial degree d with the least error using the cross validation set.
+
+
+
+
+ Estimate the generalization error using the test set with $$J_{test}(\Theta^{(d)})$$, (d = theta from polynomial with lower error);
+
+
+
+
+ This way, the degree of the polynomial d has not been trained using the test set.
+
+
+ (Mentor note: be aware that using the CV set to select 'd' means that we cannot also use it for the validation curve process of setting the lambda value).
+
+
+ Diagnosing Bias vs. Variance
+
+
+ In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis.
+
+
+
+
+ We need to distinguish whether
+
+ bias
+
+ or
+
+ variance
+
+ is the problem contributing to bad predictions.
+
+
+
+
+ High bias is underfitting and high variance is overfitting. We need to find a golden mean between these two.
+
+
+
+
+ The training error will tend to
+
+ decrease
+
+ as we increase the degree d of the polynomial.
+
+
+ At the same time, the cross validation error will tend to
+
+ decrease
+
+ as we increase d up to a point, and then it will
+
+ increase
+
+ as d is increased, forming a convex curve.
+
+
+
+ High bias (underfitting)
+
+ : both $$J_{train}(\Theta)$$ and $$J_{CV}(\Theta)$$ will be high. Also, $$J_{CV}(\Theta) \approx J_{train}(\Theta)$$.
+
+
+
+ High variance (overfitting)
+
+ : $$J_{train}(\Theta)$$ will be low and $$J_{CV}(\Theta)$$ will be much greater than$$J_{train}(\Theta)$$.
+
+
+ The is represented in the figure below:
+
+
+
+
+
+ Regularization and Bias/Variance
+
+
+ Instead of looking at the degree d contributing to bias/variance, now we will look at the regularization parameter λ.
+
+
+
+
+ Large λ: High bias (underfitting)
+
+
+
+
+ Intermediate λ: just right
+
+
+
+
+ Small λ: High variance (overfitting)
+
+
+
+
+ A large lambda heavily penalizes all the Θ parameters, which greatly simplifies the line of our resulting function, so causes underfitting.
+
+
+ The relationship of λ to the training set and the variance set is as follows:
+
+
+
+ Low λ
+
+ : $$J_{train}(\Theta)$$ is low and $$J_{CV}(\Theta)$$ is high (high variance/overfitting).
+
+
+
+ Intermediate λ
+
+ : $$J_{train}(\Theta)$$ and $$J_{CV}(\Theta)$$ are somewhat low and $$J_{train}(\Theta) \approx J_{CV}(\Theta)$$.
+
+
+
+ Large λ
+
+ : both $$J_{train}(\Theta)$$ and $$J_{CV}(\Theta)$$ will be high (underfitting /high bias)
+
+
+ The figure below illustrates the relationship between lambda and the hypothesis:
+
+
+
+
+
+ In order to choose the model and the regularization λ, we need:
+
+
+
+
+ Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24});
+
+
+
+
+ 2. Create a set of models with different degrees or any other variants.
+
+
+ 3. Iterate through the $$\lambda$$s and for each $$\lambda$$ go through all the models to learn some $$\Theta$$.
+
+
+ 4. Compute the cross validation error using the learned Θ (computed with λ) on the $$J_{CV}(\Theta)$$ without regularization or λ = 0.
+
+
+ 5. Select the best combo that produces the lowest error on the cross validation set.
+
+
+ 6. Using the best combo Θ and λ, apply it on $$J_{test}(\Theta)$$ to see if it has a good generalization of the problem.
+
+
+ Learning Curves
+
+
+ Training 3 examples will easily have 0 errors because we can always find a quadratic curve that exactly touches 3 points.
+
+
+
+
+ As the training set gets larger, the error for a quadratic function increases.
+
+
+
+
+ The error value will plateau out after a certain m, or training set size.
+
+
+
+
+
+ With high bias
+
+
+
+
+ Low training set size
+
+ : causes $$J_{train}(\Theta)$$ to be low and $$J_{CV}(\Theta)$$ to be high.
+
+
+
+ Large training set size
+
+ : causes both $$J_{train}(\Theta)$$ and $$J_{CV}(\Theta)$$ to be high with $$J_{train}(\Theta)$$≈$$J_{CV}(\Theta)$$.
+
+
+ If a learning algorithm is suffering from
+
+ high bias
+
+ , getting more training data
+
+ will not (by itself) help much
+
+ .
+
+
+ For high variance, we have the following relationships in terms of the training set size:
+
+
+
+ With high variance
+
+
+
+
+ Low training set size
+
+ : $$J_{train}(\Theta)$$ will be low and $$J_{CV}(\Theta)$$ will be high.
+
+
+
+ Large training set size
+
+ : $$J_{train}(\Theta)$$ increases with training set size and $$J_{CV}(\Theta)$$ continues to decrease without leveling off. Also, $$J_{train}(\Theta)$$<$$J_{CV}(\Theta)$$ but the difference between them remains significant.
+
+
+ If a learning algorithm is suffering from
+
+ high variance
+
+ , getting more training data is
+
+ likely to help.
+
+
+
+
+
+
+
+
+
+
+
+ Deciding What to Do Next Revisited
+
+
+ Our decision process can be broken down as follows:
+
+
+
+
+ Getting more training examples
+
+
+
+
+ Fixes high variance
+
+
+
+
+ Trying smaller sets of features
+
+
+
+
+ Fixes high variance
+
+
+
+
+ Adding features
+
+
+
+
+ Fixes high bias
+
+
+
+
+ Adding polynomial features
+
+
+
+
+ Fixes high bias
+
+
+
+
+ Decreasing λ
+
+
+
+
+ Fixes high bias
+
+
+
+
+ Increasing λ
+
+
+
+
+ Fixes high variance
+
+
+ Diagnosing Neural Networks
+
+
+
+
+ A neural network with fewer parameters is
+
+ prone to underfitting
+
+ . It is also
+
+ computationally cheaper
+
+ .
+
+
+
+
+ A large neural network with more parameters is
+
+ prone to overfitting
+
+ . It is also
+
+ computationally expensive
+
+ . In this case you can use regularization (increase λ) to address the overfitting.
+
+
+
+
+ Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set.
+
+
+ Model Selection:
+
+
+ Choosing M the order of polynomials.
+
+
+ How can we tell which parameters Θ to leave in the model (known as "model selection")?
+
+
+ There are several ways to solve this problem:
+
+
+
+
+ Get more data (very difficult).
+
+
+
+
+ Choose the model which best fits the data without overfitting (very difficult).
+
+
+
+
+ Reduce the opportunity for overfitting through regularization.
+
+
+
+
+
+ Bias: approximation error (Difference between expected value and optimal value)
+
+
+
+
+
+ High Bias = UnderFitting (BU)
+
+
+
+
+ $$J_{train}(\Theta)$$ and $$J_{CV}(\Theta)$$ both will be high and $$J_{train}(\Theta)$$ ≈ $$J_{CV}(\Theta)$$
+
+
+
+
+
+ Variance: estimation error due to finite data
+
+
+
+
+
+ High Variance = OverFitting (VO)
+
+
+
+
+ $$J_{train}(\Theta)$$ is low and $$J_{CV}(\Theta)$$ ≫$$J_{train}(\Theta)$$
+
+
+
+
+
+ Intuition for the bias-variance trade-off:
+
+
+
+
+
+ Complex model => sensitive to data => much affected by changes in X => high variance, low bias.
+
+
+
+
+ Simple model => more rigid => does not change as much with changes in X => low variance, high bias.
+
+
+
+
+ One of the most important goals in learning: finding a model that is just right in the bias-variance trade-off.
+
+
+
+ Regularization Effects:
+
+
+
+
+
+ Small values of λ allow model to become finely tuned to noise leading to large variance => overfitting.
+
+
+
+
+ Large values of λ pull weight parameters to zero leading to large bias => underfitting.
+
+
+
+
+
+ Model Complexity Effects:
+
+
+
+
+
+ Lower-order polynomials (low model complexity) have high bias and low variance. In this case, the model fits poorly consistently.
+
+
+
+
+ Higher-order polynomials (high model complexity) fit the training data extremely well and the test data extremely poorly. These have low bias on the training data, but very high variance.
+
+
+
+
+ In reality, we would want to choose a model somewhere in between, that can generalize well but also fits the data reasonably well.
+
+
+
+
+
+ A typical rule of thumb when running diagnostics is:
+
+
+
+
+
+ More training examples fixes high variance but not high bias.
+
+
+
+
+ Fewer features fixes high variance but not high bias.
+
+
+
+
+ Additional features fixes high bias but not high variance.
+
+
+
+
+ The addition of polynomial and interaction features fixes high bias but not high variance.
+
+
+
+
+ When using gradient descent, decreasing lambda can fix high bias and increasing lambda can fix high variance (lambda is the regularization parameter).
+
+
+
+
+ When using neural networks, small neural networks are more prone to under-fitting and big neural networks are prone to over-fitting. Cross-validation of network size is a way to choose alternatives.
+
+
+
+
+ ML:Machine Learning System Design
+
+
+ Prioritizing What to Work On
+
+
+ Different ways we can approach a machine learning problem:
+
+
+
+
+ Collect lots of data (for example "honeypot" project but doesn't always work)
+
+
+
+
+ Develop sophisticated features (for example: using email header data in spam emails)
+
+
+
+
+ Develop algorithms to process your input in different ways (recognizing misspellings in spam).
+
+
+
+
+ It is difficult to tell which of the options will be helpful.
+
+
+ Error Analysis
+
+
+ The recommended approach to solving machine learning problems is:
+
+
+
+
+ Start with a simple algorithm, implement it quickly, and test it early.
+
+
+
+
+ Plot learning curves to decide if more data, more features, etc. will help
+
+
+
+
+ Error analysis: manually examine the errors on examples in the cross validation set and try to spot a trend.
+
+
+
+
+ It's important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance.
+
+
+ You may need to process your input before it is useful. For example, if your input is a set of words, you may want to treat the same word with different forms (fail/failing/failed) as one word, so must use "stemming software" to recognize them all as one.
+
+
+ Error Metrics for Skewed Classes
+
+
+ It is sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm.
+
+
+
+
+ For example: In predicting a cancer diagnoses where 0.5% of the examples have cancer, we find our learning algorithm has a 1% error. However, if we were to simply classify every single example as a 0, then our error would reduce to 0.5% even though we did not improve the algorithm.
+
+
+
+
+ This usually happens with
+
+ skewed classes
+
+ ; that is, when our class is very rare in the entire data set.
+
+
+ Or to say it another way, when we have lot more examples from one class than from the other class.
+
+
+ For this we can use
+
+ Precision/Recall
+
+ .
+
+
+
+
+ Predicted: 1, Actual: 1 --- True positive
+
+
+
+
+ Predicted: 0, Actual: 0 --- True negative
+
+
+
+
+ Predicted: 0, Actual, 1 --- False negative
+
+
+
+
+ Predicted: 1, Actual: 0 --- False positive
+
+
+
+
+
+ Precision
+
+ : of all patients we predicted where y=1, what fraction actually has cancer?
+
+
+
+
+
+ $$\dfrac{\text{True Positives}}{\text{Total number of predicted positives}}
+= \dfrac{\text{True Positives}}{\text{True Positives}+\text{False positives}}$$
+
+
+
+
+
+
+ Recall
+
+ : Of all the patients that actually have cancer, what fraction did we correctly detect as having cancer?
+
+
+
+
+
+ $$\dfrac{\text{True Positives}}{\text{Total number of actual positives}}= \dfrac{\text{True Positives}}{\text{True Positives}+\text{False negatives}}$$
+
+
+
+
+
+ These two metrics give us a better sense of how our classifier is doing. We want both precision and recall to be high.
+
+
+ In the example at the beginning of the section, if we classify all patients as 0, then our
+
+ recall
+
+ will be $$\dfrac{0}{0 + f} = 0$$, so despite having a lower error percentage, we can quickly see it has worse recall.
+
+ Note 1: if an algorithm predicts only negatives like it does in one of exercises, the precision is not defined, it is impossible to divide by 0. F1 score will not be defined too.
+
+
+ Trading Off Precision and Recall
+
+
+ We might want a
+
+ confident
+
+ prediction of two classes using logistic regression. One way is to increase our threshold:
+
+
+
+
+ Predict 1 if: $$h_\theta(x) \geq 0.7$$
+
+
+
+
+ Predict 0 if: $$h_\theta(x) < 0.7$$
+
+
+
+
+ This way, we only predict cancer if the patient has a 70% chance.
+
+
+ Doing this, we will have
+
+ higher precision
+
+ but
+
+ lower recall
+
+ (refer to the definitions in the previous section).
+
+
+ In the opposite example, we can lower our threshold:
+
+
+
+
+ Predict 1 if: $$h_\theta(x) \geq 0.3$$
+
+
+
+
+ Predict 0 if: $$h_\theta(x) < 0.3$$
+
+
+
+
+ That way, we get a very
+
+ safe
+
+ prediction. This will cause
+
+ higher recall
+
+ but
+
+ lower precision
+
+ .
+
+
+ The greater the threshold, the greater the precision and the lower the recall.
+
+
+ The lower the threshold, the greater the recall and the lower the precision.
+
+
+ In order to turn these two metrics into one single number, we can take the
+
+ F value
+
+ .
+
+
+ One way is to take the
+
+ average
+
+ :
+
+
+ $$\dfrac{P+R}{2}$$
+
+
+ This does not work well. If we predict all y=0 then that will bring the average up despite having 0 recall. If we predict all examples as y=1, then the very high recall will bring up the average despite having 0 precision.
+
+
+ A better way is to compute the
+
+ F Score
+
+ (or F1 score):
+
+
+ $$\text{F Score} = 2\dfrac{PR}{P + R}$$
+
+
+ In order for the F Score to be large, both precision and recall must be large.
+
+
+ We want to train precision and recall on the
+
+ cross validation set
+
+ so as not to bias our test set.
+
+
+ Data for Machine Learning
+
+
+ How much data should we train on?
+
+
+ In certain cases, an "inferior algorithm," if given enough data, can outperform a superior algorithm with less data.
+
+
+ We must choose our features to have
+
+ enough
+
+ information. A useful test is: Given input x, would a human expert be able to confidently predict y?
+
+
+
+ Rationale for large data
+
+ : if we have a
+
+ low bias
+
+ algorithm (many features or hidden units making a very complex function), then the larger the training set we use, the less we will have overfitting (and the more accurate the algorithm will be on the test set).
+
+
+ Quiz instructions
+
+
+ When the quiz instructions tell you to enter a value to "two decimal digits", what it really means is "two significant digits". So, just for example, the value 0.0123 should be entered as "0.012", not "0.01".
+
+ The
+
+ Support Vector Machine
+
+ (SVM) is yet another type of
+
+ supervised
+
+ machine learning algorithm. It is sometimes cleaner and more powerful.
+
+
+ Recall that in logistic regression, we use the following rules:
+
+
+ if y=1, then $$h_\theta(x) \approx 1$$ and $$\Theta^Tx \gg 0$$
+
+
+ if y=0, then $$h_\theta(x) \approx 0$$ and $$\Theta^Tx \ll 0$$
+
+
+ Recall the cost function for (unregularized) logistic regression:
+
+ To make a support vector machine, we will modify the first term of the cost function $$-\log(h_{\theta}(x)) = -\log\Big(\dfrac{1}{1 + e^{-\theta^Tx}}\Big)$$ so that when $$θ^Tx$$ (from now on, we shall refer to this as z) is
+
+ greater than
+
+ 1, it outputs 0. Furthermore, for values of z less than 1, we shall use a straight decreasing line instead of the sigmoid curve.(In the literature, this is called a hinge loss (
+
+ https://en.wikipedia.org/wiki/Hinge_loss)
+
+ function.)
+
+
+
+
+
+ Similarly, we modify the second term of the cost function $$-\log(1 - h_{\theta(x)}) = -\log\Big(1 - \dfrac{1}{1 + e^{-\theta^Tx}}\Big)$$ so that when z is
+
+ less than
+
+ -1, it outputs 0. We also modify it so that for values of z greater than -1, we use a straight increasing line instead of the sigmoid curve.
+
+
+
+
+
+ We shall denote these as $$\text{cost}_1(z)$$ and $$\text{cost}_0(z)$$ (respectively, note that $$\text{cost}_1(z)$$ is the cost for classifying when y=1, and $$\text{cost}_0(z)$$ is the cost for classifying when y=0), and we may define them as follows (where k is an arbitrary constant defining the magnitude of the slope of the line):
+
+
+ $$z = \theta^Tx$$
+
+
+ $$\text{cost}_0(z) = \max(0, k(1+z))$$
+
+
+ $$\text{cost}_1(z) = \max(0, k(1-z))$$
+
+
+ Recall the full cost function from (regularized) logistic regression:
+
+ We can optimize this a bit by multiplying this by m (thus removing the m factor in the denominators). Note that this does not affect our optimization, since we're simply multiplying our cost function by a positive constant (for example, minimizing $$(u-5)^2 + 1$$ gives us 5; multiplying it by 10 to make it $$10(u-5)^2 + 10$$ still gives us 5 when minimized).
+
+ This is equivalent to multiplying the equation by $$C = \dfrac{1}{\lambda}$$, and thus results in the same values when optimized. Now, when we wish to regularize more (that is, reduce overfitting), we
+
+ decrease
+
+ C, and when we wish to regularize less (that is, reduce underfitting), we
+
+ increase
+
+ C.
+
+
+ Finally, note that the hypothesis of the Support Vector Machine is
+
+ not
+
+ interpreted as the probability of y being 1 or 0 (as it is for the hypothesis of logistic regression). Instead, it outputs either 1 or 0. (In technical terms, it is a discriminant function.)
+
+ A useful way to think about Support Vector Machines is to think of them as
+
+ Large Margin Classifiers
+
+ .
+
+
+ If y=1, we want $$\Theta^Tx \geq 1$$ (not just ≥0)
+
+
+ If y=0, we want $$\Theta^Tx \leq -1$$ (not just <0)
+
+
+ Now when we set our constant C to a very
+
+ large
+
+ value (e.g. 100,000), our optimizing function will constrain Θ such that the equation A (the summation of the cost of each example) equals 0. We impose the following constraints on Θ:
+
+
+ $$\Theta^Tx \geq 1$$ if y=1 and $$\Theta^Tx \leq -1$$ if y=0.
+
+
+ If C is very large, we must choose Θ parameters such that:
+
+ Recall the decision boundary from logistic regression (the line separating the positive and negative examples). In SVMs, the decision boundary has the special property that it is
+
+ as far away as possible
+
+ from both the positive and the negative examples.
+
+
+ The distance of the decision boundary to the nearest example is called the
+
+ margin
+
+ . Since SVMs maximize this margin, it is often called a
+
+ Large Margin Classifier
+
+ .
+
+
+ The SVM will separate the negative and positive examples by a
+
+ large margin
+
+ .
+
+
+ This large margin is only achieved when
+
+ C is very large
+
+ .
+
+
+ Data is
+
+ linearly separable
+
+ when a
+
+ straight line
+
+ can separate the positive and negative examples.
+
+
+ If we have
+
+ outlier
+
+ examples that we don't want to affect the decision boundary, then we can
+
+ reduce
+
+ C.
+
+
+ Increasing and decreasing C is similar to respectively decreasing and increasing λ, and can simplify our decision boundary.
+
+
+ Mathematics Behind Large Margin Classification (Optional)
+
+ The
+
+ length of vector v
+
+ is denoted $$||v||$$, and it describes the line on a graph from origin (0,0) to $$(v_1,v_2)$$.
+
+
+ The length of vector v can be calculated with $$\sqrt{v_1^2 + v_2^2}$$by the Pythagorean theorem.
+
+
+ The
+
+ projection
+
+ of vector v onto vector u is found by taking a right angle from u to the end of v, creating a right triangle.
+
+
+
+
+ p= length of projection of v onto the vector u.
+
+
+
+
+ $$u^Tv= p \cdot ||u||$$
+
+
+
+
+ Note that $$u^Tv = ||u|| \cdot ||v|| \cos \theta$$ where θ is the angle between u and v. Also, $$p = ||v|| \cos \theta$$. If you substitute p for $$||v|| \cos \theta$$, you get $$u^Tv= p \cdot ||u||$$.
+
+
+ So the product $$u^Tv$$ is equal to the length of the projection times the length of vector u.
+
+
+ In our example, since u and v are vectors of the same length, $$u^Tv = v^Tu$$.
+
+ So we now have a new
+
+ optimization objective
+
+ by substituting $$p^{(i)} \cdot ||\Theta ||$$ in for $$\Theta^Tx^{(i)}$$:
+
+
+ If y=1, we want $$p^{(i)} \cdot ||\Theta || \geq 1$$
+
+
+ If y=0, we want $$p^{(i)} \cdot ||\Theta || \leq -1$$
+
+
+ The reason this causes a "large margin" is because: the vector for Θ is perpendicular to the decision boundary. In order for our optimization objective (above) to hold true, we need the absolute value of our projections $$p^{(i)}$$ to be as large as possible.
+
+
+ If $$\Theta_0 =0$$, then all our decision boundaries will intersect (0,0). If $$\Theta_0 \neq 0$$, the support vector machine will still find a large margin for the decision boundary.
+
+
+ Kernels I
+
+
+
+ Kernels
+
+ allow us to make complex, non-linear classifiers using Support Vector Machines.
+
+
+ Given x, compute new feature depending on proximity to landmarks $$l^{(1)},\ l^{(2)},\ l^{(3)}$$.
+
+
+ To do this, we find the "similarity" of x and some landmark $$l^{(i)}$$:
+
+ There are a couple properties of the similarity function:
+
+
+ If $$x \approx l^{(i)}$$, then $$f_i = \exp(-\dfrac{\approx 0^2}{2\sigma^2}) \approx 1$$
+
+
+ If x is far from $$l^{(i)}$$, then $$f_i = \exp(-\dfrac{(large\ number)^2}{2\sigma^2}) \approx 0$$
+
+
+ In other words, if x and the landmark are close, then the similarity will be close to 1, and if x and the landmark are far away from each other, the similarity will be close to 0.
+
+
+ Each landmark gives us the features in our hypothesis:
+
+ $$\sigma^2$$ is a parameter of the Gaussian Kernel, and it can be modified to increase or decrease the
+
+ drop-off
+
+ of our feature $$f_i$$. Combined with looking at the values inside Θ, we can choose these landmarks to get the general shape of the decision boundary.
+
+
+ Kernels II
+
+
+ One way to get the landmarks is to put them in the
+
+ exact same
+
+ locations as all the training examples. This gives us m landmarks, with one landmark per training example.
+
+
+ Given example x:
+
+
+ $$f_1 = similarity(x,l^{(1)})$$, $$f_2 = similarity(x,l^{(2)})$$, $$f_3 = similarity(x,l^{(3)})$$, and so on.
+
+
+ This gives us a "feature vector," $$f_{(i)}$$ of all our features for example $$x_{(i)}$$. We may also set $$f_0 = 1$$ to correspond with $$Θ_0$$. Thus given training example $$x_{(i)}$$:
+
+ Using kernels to generate f(i) is not exclusive to SVMs and may also be applied to logistic regression. However, because of computational optimizations on SVMs, kernels combined with SVMs is much faster than with other algorithms, so kernels are almost always found combined only with SVMs.
+
+
+
+ Choosing SVM Parameters
+
+
+
+ Choosing C (recall that $$C = \dfrac{1}{\lambda}$$
+
+
+
+
+ If C is large, then we get higher variance/lower bias
+
+
+
+
+ If C is small, then we get lower variance/higher bias
+
+
+
+
+ The other parameter we must choose is $$σ^2$$ from the Gaussian Kernel function:
+
+
+ With a large $$σ^2$$, the features fi vary more smoothly, causing higher bias and lower variance.
+
+
+ With a small $$σ^2$$, the features fi vary less smoothly, causing lower bias and higher variance.
+
+
+
+ Using An SVM
+
+
+
+ There are lots of good SVM libraries already written. A. Ng often uses 'liblinear' and 'libsvm'. In practical application, you should use one of these libraries rather than rewrite the functions.
+
+
+ In practical application, the choices you do need to make are:
+
+
+
+
+ Choice of parameter C
+
+
+
+
+ Choice of kernel (similarity function)
+
+
+
+
+ No kernel ("linear" kernel) -- gives standard linear classifier
+
+
+
+
+ Choose when n is large and when m is small
+
+
+
+
+ Gaussian Kernel (above) -- need to choose $$σ^2$$
+
+
+
+
+ Choose when n is small and m is large
+
+
+
+
+ The library may ask you to provide the kernel function.
+
+
+
+ Note:
+
+ do perform feature scaling before using the Gaussian Kernel.
+
+
+
+ Note:
+
+ not all similarity functions are valid kernels. They must satisfy "Mercer's Theorem" which guarantees that the SVM package's optimizations run correctly and do not diverge.
+
+
+ You want to train C and the parameters for the kernel function using the training and cross-validation datasets.
+
+
+
+ Multi-class Classification
+
+
+
+ Many SVM libraries have multi-class classification built-in.
+
+
+ You can use the
+
+ one-vs-all
+
+ method just like we did for logistic regression, where $$y \in {1,2,3,\dots,K}$$ with $$\Theta^{(1)}, \Theta^{(2)}, \dots,\Theta{(K)}$$. We pick class i with the largest $$(\Theta^{(i)})^Tx$$.
+
+
+
+ Logistic Regression vs. SVMs
+
+
+
+ If n is large (relative to m), then use logistic regression, or SVM without a kernel (the "linear kernel")
+
+
+ If n is small and m is intermediate, then use SVM with a Gaussian Kernel
+
+
+ If n is small and m is large, then manually create/add more features, then use logistic regression or SVM without a kernel.
+
+
+ In the first case, we don't have enough examples to need a complicated polynomial hypothesis. In the second example, we have enough examples that we may need a complex non-linear hypothesis. In the last case, we want to increase our features so that logistic regression becomes applicable.
+
+
+
+ Note
+
+ : a neural network is likely to work well for any of these situations, but may be slower to train.
+
+
+
+
+
diff --git a/ML_Mathematical_Approach/09_week-7-lecture-notes/01__svm-notes-long-08.pdf b/ML_Mathematical_Approach/09_week-7-lecture-notes/01__svm-notes-long-08.pdf
new file mode 100644
index 0000000..d7e5866
Binary files /dev/null and b/ML_Mathematical_Approach/09_week-7-lecture-notes/01__svm-notes-long-08.pdf differ
diff --git a/ML_Mathematical_Approach/10_week-8-lecture-notes/01__resources.html b/ML_Mathematical_Approach/10_week-8-lecture-notes/01__resources.html
new file mode 100644
index 0000000..65b4cfa
--- /dev/null
+++ b/ML_Mathematical_Approach/10_week-8-lecture-notes/01__resources.html
@@ -0,0 +1,804 @@
+
+
+
+ ML:Clustering
+
+
+ Unsupervised Learning: Introduction
+
+
+ Unsupervised learning is contrasted from supervised learning because it uses an
+
+ unlabeled
+
+ training set rather than a labeled one.
+
+
+ In other words, we don't have the vector y of expected results, we only have a dataset of features where we can find structure.
+
+
+ Clustering is good for:
+
+
+
+
+ Market segmentation
+
+
+
+
+ Social network analysis
+
+
+
+
+ Organizing computer clusters
+
+
+
+
+ Astronomical data analysis
+
+
+
+
+ K-Means Algorithm
+
+
+ The K-Means Algorithm is the most popular and widely used algorithm for automatically grouping data into coherent subsets.
+
+
+
+
+ Randomly initialize two points in the dataset called the
+
+ cluster centroids
+
+ .
+
+
+
+
+ Cluster assignment: assign all examples into one of two groups based on which cluster centroid the example is closest to.
+
+
+
+
+ Move centroid: compute the averages for all the points inside each of the two cluster centroid groups, then move the cluster centroid points to those averages.
+
+
+
+
+ Re-run (2) and (3) until we have found our clusters.
+
+
+
+
+ Our main variables are:
+
+
+
+
+ K (number of clusters)
+
+
+
+
+ Training set $${x^{(1)}, x^{(2)}, \dots,x^{(m)}}$$
+
+
+
+
+ Where $$x^{(i)} \in \mathbb{R}^n$$
+
+
+
+
+ Note that we
+
+ will not use
+
+ the x0=1 convention.
+
+
+
+ The algorithm:
+
+
+
Randomly initialize K cluster centroids mu(1), mu(2), ..., mu(K)
+Repeat:
+ for i = 1 to m:
+ c(i):= index (from 1 to K) of cluster centroid closest to x(i)
+ for k = 1 to K:
+ mu(k):= average (mean) of points assigned to cluster k
+
+ The
+
+ first for-loop
+
+ is the 'Cluster Assignment' step. We make a vector
+
+ c
+
+ where
+
+ c(i)
+
+ represents the centroid assigned to example
+
+ x(i)
+
+ .
+
+
+ We can write the operation of the Cluster Assignment step more mathematically as follows:
+
+
+ $$c^{(i)} = argmin_k\ ||x^{(i)} - \mu_k||^2$$
+
+
+ That is, each $$c^{(i)}$$ contains the index of the centroid that has minimal distance to $$x^{(i)}$$.
+
+
+ By convention, we square the right-hand-side, which makes the function we are trying to minimize more sharply increasing. It is mostly just a convention. But a convention that helps reduce the computation load because the Euclidean distance requires a square root but it is canceled.
+
+ Where each of $$x^{(k_1)}, x^{(k_2)}, \dots, x^{(k_n)}$$ are the training examples assigned to group $$mμ_k$$.
+
+
+ If you have a cluster centroid with
+
+ 0 points
+
+ assigned to it, you can randomly
+
+ re-initialize
+
+ that centroid to a new point. You can also simply
+
+ eliminate
+
+ that cluster group.
+
+
+ After a number of iterations the algorithm will
+
+ converge
+
+ , where new iterations do not affect the clusters.
+
+
+ Note on non-separated clusters: some datasets have no real inner separation or natural structure. K-means can still evenly segment your data into K subsets, so can still be useful in this case.
+
+
+ Optimization Objective
+
+
+ Recall some of the parameters we used in our algorithm:
+
+
+
+
+ $$c^{(i)}$$ = index of cluster (1,2,...,K) to which example x(i) is currently assigned
+
+
+
+
+ $$\mu_k $$= cluster centroid k (μk∈ℝn)
+
+
+
+
+ $$\mu_{c^{(i)}}$$ = cluster centroid of cluster to which example x(i) has been assigned
+
+
+
+
+ Using these variables we can define our
+
+ cost function
+
+ :
+
+ Our
+
+ optimization objective
+
+ is to minimize all our parameters using the above cost function:
+
+
+ $$min_{c,\mu}\ J(c,\mu)$$
+
+
+ That is, we are finding all the values in sets c, representing all our clusters, and μ, representing all our centroids, that will minimize
+
+ the average of the distances
+
+ of every training example to its corresponding cluster centroid.
+
+
+ The above cost function is often called the
+
+ distortion
+
+ of the training examples.
+
+
+ In the
+
+ cluster assignment step
+
+ , our goal is to:
+
+
+ Minimize J(…) with $$c^{(1)},\dots,c^{(m)}$$ (holding $$\mu_1,\dots,\mu_K$$ fixed)
+
+
+ In the
+
+ move centroid
+
+ step, our goal is to:
+
+
+ Minimize J(…) with $$\mu_1,\dots,\mu_K$$
+
+
+ With k-means, it is
+
+ not possible for the cost function to sometimes increase
+
+ . It should always descend.
+
+
+ Random Initialization
+
+
+ There's one particular recommended method for randomly initializing your cluster centroids.
+
+
+
+
+ Have K<m. That is, make sure the number of your clusters is less than the number of your training examples.
+
+
+
+
+ Randomly pick K training examples. (Not mentioned in the lecture, but also be sure the selected examples are unique).
+
+
+
+
+ Set $$\mu_1,\dots,\mu_K$$ equal to these K examples.
+
+
+
+
+ K-means
+
+ can get stuck in local optima
+
+ . To decrease the chance of this happening, you can run the algorithm on many different random initializations. In cases where K<10 it is strongly recommended to run a loop of random initializations.
+
+
for i = 1 to 100:
+ randomly initialize k-means
+ run k-means to get 'c' and 'm'
+ compute the cost function (distortion) J(c,m)
+pick the clustering that gave us the lowest cost
+
+
+ Choosing the Number of Clusters
+
+
+ Choosing K can be quite arbitrary and ambiguous.
+
+
+
+ The elbow method
+
+ : plot the cost J and the number of clusters K. The cost function should reduce as we increase the number of clusters, and then flatten out. Choose K at the point where the cost function starts to flatten out.
+
+
+ However, fairly often, the curve is
+
+ very gradual
+
+ , so there's no clear elbow.
+
+
+
+ Note:
+
+ J will
+
+ always
+
+ decrease as K is increased. The one exception is if k-means gets stuck at a bad local optimum.
+
+
+ Another way to choose K is to observe how well k-means performs on a
+
+ downstream purpose
+
+ . In other words, you choose K that proves to be most useful for some goal you're trying to achieve from using these clusters.
+
+ We may want to reduce the dimension of our features if we have a lot of redundant data.
+
+
+
+
+ To do this, we find two highly correlated features, plot them, and make a new line that seems to describe both features accurately. We place all the new features on this single line.
+
+
+
+
+ Doing dimensionality reduction will reduce the total data we have to store in computer memory and will speed up our learning algorithm.
+
+
+ Note: in dimensionality reduction, we are reducing our features rather than our number of examples. Our variable m will stay the same size; n, the number of features each example from $$x^{(1)}$$ to $$x^{(m)}$$ carries, will be reduced.
+
+
+
+ Motivation II: Visualization
+
+
+
+ It is not easy to visualize data that is more than three dimensions. We can reduce the dimensions of our data to 3 or less in order to plot it.
+
+
+ We need to find new features, $$z_1, z_2$$(and perhaps $$z_3$$) that can effectively
+
+ summarize
+
+ all the other features.
+
+
+ Example: hundreds of features related to a country's economic system may all be combined into one feature that you call "Economic Activity."
+
+
+ Principal Component Analysis Problem Formulation
+
+
+ The most popular dimensionality reduction algorithm is
+
+ Principal Component Analysis
+
+ (PCA)
+
+
+
+ Problem formulation
+
+
+
+ Given two features, $$x_1$$ and $$x_2$$, we want to find a single line that effectively describes both features at once. We then map our old features onto this new line to get a new single feature.
+
+
+ The same can be done with three features, where we map them to a plane.
+
+
+ The
+
+ goal of PCA
+
+ is to
+
+ reduce
+
+ the average of all the distances of every feature to the projection line. This is the
+
+ projection error
+
+ .
+
+
+ Reduce from 2d to 1d: find a direction (a vector $$u^{(1)} \in \mathbb{R}^n$$) onto which to project the data so as to minimize the projection error.
+
+
+ The more general case is as follows:
+
+
+ Reduce from n-dimension to k-dimension: Find k vectors $$u^{(1)}, u^{(2)}, \dots, u^{(k)}$$ onto which to project the data so as to minimize the projection error.
+
+
+ If we are converting from 3d to 2d, we will project our data onto two directions (a plane), so k will be 2.
+
+
+
+ PCA is not linear regression
+
+
+
+
+
+ In linear regression, we are minimizing the
+
+ squared error
+
+ from every point to our predictor line. These are vertical distances.
+
+
+
+
+ In PCA, we are minimizing the
+
+ shortest distance
+
+ , or shortest
+
+ orthogonal
+
+ distances, to our data points.
+
+
+
+
+ More generally, in linear regression we are taking all our examples in x and applying the parameters in Θ to predict y.
+
+
+ In PCA, we are taking a number of features $$x_1, x_2, \dots, x_n$$, and finding a closest common dataset among them. We aren't trying to predict any result and we aren't applying any theta weights to the features.
+
+
+ Principal Component Analysis Algorithm
+
+
+ Before we can apply PCA, there is a data pre-processing step we must perform:
+
+ Replace each $$x_j^{(i)}$$ with $$x_j^{(i)} - \mu_j$$
+
+
+
+
+ If different features on different scales (e.g., $$x_1$$ = size of house, $$x_2$$ = number of bedrooms), scale features to have comparable range of values.
+
+
+
+
+ Above, we first subtract the mean of each feature from the original feature. Then we scale all the features $$x_j^{(i)} = \dfrac{x_j^{(i)} - \mu_j}{s_j}$$
+
+
+ We can define specifically what it means to reduce from 2d to 1d data as follows:
+
+ We denote the covariance matrix with a capital sigma (which happens to be the same symbol for summation, confusingly---they represent entirely different things).
+
+
+ Note that $$x^{(i)}$$ is an n×1 vector, $$(x^{(i)})^T$$ is an 1×n vector and X is a m×n matrix (row-wise stored examples). The product of those will be an n×n matrix, which are the dimensions of Σ.
+
+ svd() is the 'singular value decomposition', a built-in Octave function.
+
+
+ What we actually want out of svd() is the 'U' matrix of the Sigma covariance matrix: $$U \in \mathbb{R}^{n \times n}$$. U contains $$u^{(1)},\dots,u^{(n)}$$, which is exactly what we want.
+
+
+
+ 3. Take the first k columns of the U matrix and compute z
+
+
+
+ We'll assign the first k columns of U to a variable called 'Ureduce'. This will be an n×k matrix. We compute z with:
+
+
+ $$z^{(i)} = Ureduce^T \cdot x^{(i)}$$
+
+
+ $$UreduceZ^T$$ will have dimensions k×n while x(i) will have dimensions n×1. The product $$Ureduce^T \cdot x^{(i)}$$ will have dimensions k×1.
+
+
+ To summarize, the whole algorithm in octave is roughly:
+
+
Sigma = (1/m) * X' * X; % compute the covariance matrix
+[U,S,V] = svd(Sigma); % compute our projected directions
+Ureduce = U(:,1:k); % take the first k directions
+Z = X * Ureduce; % compute the projected data points
+
+
+ Reconstruction from Compressed Representation
+
+
+ If we use PCA to compress our data, how can we uncompress our data, or go back to our original number of features?
+
+
+ To go from 1-dimension back to 2d we do: $$z \in \mathbb{R} \rightarrow x \in \mathbb{R}^2$$.
+
+
+ We can do this with the equation: $$x_{approx}^{(1)} = U_{reduce} \cdot z^{(1)}$$.
+
+
+ Note that we can only get approximations of our original data.
+
+
+ Note: It turns out that the U matrix has the special property that it is a Unitary Matrix. One of the special properties of a Unitary Matrix is:
+
+
+ $$U^{-1} = U^∗$$ where the "*" means "conjugate transpose".
+
+
+ Since we are dealing with real numbers here, this is equivalent to:
+
+
+ $$U^{-1} = U^T$$ So we could compute the inverse and use that, but it would be a waste of energy and compute cycles.
+
+
+ Choosing the Number of Principal Components
+
+
+ How do we choose k, also called the
+
+ number of principal components
+
+ ? Recall that k is the dimension we are reducing to.
+
+
+ One way to choose k is by using the following formula:
+
+
+
+
+ Given the average squared projection error: $$\dfrac{1}{m}\sum^m_{i=1}||x^{(i)} - x_{approx}^{(i)}||^2$$
+
+
+
+
+ Also given the total variation in the data: $$\dfrac{1}{m}\sum^m_{i=1}||x^{(i)}||^2$$
+
+
+
+
+ Choose k to be the smallest value such that: $$\dfrac{\dfrac{1}{m}\sum^m_{i=1}||x^{(i)} - x_{approx}^{(i)}||^2}{\dfrac{1}{m}\sum^m_{i=1}||x^{(i)}||^2} \leq 0.01$$
+
+
+
+
+ In other words, the squared projection error divided by the total variation should be less than one percent, so that
+
+ 99% of the variance is retained
+
+ .
+
+
+
+ Algorithm for choosing k
+
+
+
+
+
+ Try PCA with k=1,2,…
+
+
+
+
+ Compute $$U_{reduce}, z, x$$
+
+
+
+
+ Check the formula given above that 99% of the variance is retained. If not, go to step one and increase k.
+
+
+
+
+ This procedure would actually be horribly inefficient. In Octave, we will call svd:
+
+
[U,S,V] = svd(Sigma)
+
+
+ Which gives us a matrix S. We can actually check for 99% of retained variance using the S matrix as follows:
+
+ The most common use of PCA is to speed up supervised learning.
+
+
+ Given a training set with a large number of features (e.g. $$x^{(1)},\dots,x^{(m)} \in \mathbb{R}^{10000}$$ ) we can use PCA to reduce the number of features in each example of the training set (e.g. $$z^{(1)},\dots,z^{(m)} \in \mathbb{R}^{1000}$$).
+
+
+ Note that we should define the PCA reduction from $$x^{(i)}$$ to $$z^{(i)}$$ only on the training set and not on the cross-validation or test sets. You can apply the mapping z(i) to your cross-validation and test sets after it is defined on the training set.
+
+
+ Applications
+
+
+
+
+ Compressions
+
+
+
+
+ Reduce space of data
+
+
+ Speed up algorithm
+
+
+
+
+ Visualization of data
+
+
+
+
+ Choose k = 2 or k = 3
+
+
+
+ Bad use of PC
+
+
+ A:
+
+ trying to prevent overfitting. We might think that reducing the features with PCA would be an effective way to address overfitting. It might work, but is not recommended because it does not consider the values of our results y. Using just regularization will be at least as effective.
+
+
+ Don't assume you need to do PCA.
+
+ Try your full machine learning algorithm without PCA first.
+
+ Then use PCA if you find that you need it.
+
+
+
+
+
+
+
diff --git a/ML_Mathematical_Approach/11_week-9-lecture-notes/01__gaussians.pdf b/ML_Mathematical_Approach/11_week-9-lecture-notes/01__gaussians.pdf
new file mode 100644
index 0000000..3d77a0c
Binary files /dev/null and b/ML_Mathematical_Approach/11_week-9-lecture-notes/01__gaussians.pdf differ
diff --git a/ML_Mathematical_Approach/11_week-9-lecture-notes/01__resources.html b/ML_Mathematical_Approach/11_week-9-lecture-notes/01__resources.html
new file mode 100644
index 0000000..ef771c6
--- /dev/null
+++ b/ML_Mathematical_Approach/11_week-9-lecture-notes/01__resources.html
@@ -0,0 +1,632 @@
+
+
+
+ ML:Anomaly Detection
+
+
+ Problem Motivation
+
+
+ Just like in other learning problems, we are given a dataset $${x^{(1)}, x^{(2)},\dots,x^{(m)}}$$.
+
+
+ We are then given a new example, $$x_{test}$$, and we want to know whether this new example is abnormal/anomalous.
+
+
+ We define a "model" p(x) that tells us the probability the example is not anomalous. We also use a threshold ϵ (epsilon) as a dividing line so we can say which examples are anomalous and which are not.
+
+
+ A very common application of anomaly detection is detecting fraud:
+
+
+
+
+ $$x^{(i)} =$$ features of user i's activities
+
+
+
+
+ Model p(x) from the data.
+
+
+
+
+ Identify unusual users by checking which have p(x)<ϵ.
+
+
+
+
+ If our anomaly detector is flagging
+
+ too many
+
+ anomalous examples, then we need to
+
+ decrease
+
+ our threshold ϵ
+
+
+ Gaussian Distribution
+
+
+ The Gaussian Distribution is a familiar bell-shaped curve that can be described by a function $$\mathcal{N}(\mu,\sigma^2)$$
+
+
+ Let x∈ℝ. If the probability distribution of x is Gaussian with mean μ, variance $$\sigma^2$$, then:
+
+
+ $$x \sim \mathcal{N}(\mu, \sigma^2)$$
+
+
+ The little ∼ or 'tilde' can be read as "distributed as."
+
+
+ The Gaussian Distribution is parameterized by a mean and a variance.
+
+
+ Mu, or μ, describes the center of the curve, called the mean. The width of the curve is described by sigma, or σ, called the standard deviation.
+
+ A vectorized version of the calculation for μ is $$\mu = \dfrac{1}{m}\displaystyle \sum_{i=1}^m x^{(i)}$$. You can vectorize $$\sigma^2$$ similarly.
+
+
+ Developing and Evaluating an Anomaly Detection System
+
+
+ To evaluate our learning algorithm, we take some labeled data, categorized into anomalous and non-anomalous examples ( y = 0 if normal, y = 1 if anomalous).
+
+
+ Among that data, take a large proportion of
+
+ good
+
+ , non-anomalous data for the training set on which to train p(x).
+
+
+ Then, take a smaller proportion of mixed anomalous and non-anomalous examples (you will usually have many more non-anomalous examples) for your cross-validation and test sets.
+
+
+ For example, we may have a set where 0.2% of the data is anomalous. We take 60% of those examples, all of which are good (y=0) for the training set. We then take 20% of the examples for the cross-validation set (with 0.1% of the anomalous examples) and another 20% from the test set (with another 0.1% of the anomalous).
+
+
+ In other words, we split the data 60/20/20 training/CV/test and then split the anomalous examples 50/50 between the CV and test sets.
+
+
+
+ Algorithm evaluation:
+
+
+
+ Fit model p(x) on training set $$\lbrace x^{(1)},\dots,x^{(m)} \rbrace$$
+
+
+ On a cross validation/test example x, predict:
+
+
+ If p(x) < ϵ (
+
+ anomaly
+
+ ), then y=1
+
+
+ If p(x) ≥ ϵ (
+
+ normal
+
+ ), then y=0
+
+
+ Possible evaluation metrics (see "Machine Learning System Design" section):
+
+ Note that we use the cross-validation set to choose parameter ϵ
+
+
+ Anomaly Detection vs. Supervised Learning
+
+
+ When do we use anomaly detection and when do we use supervised learning?
+
+
+ Use anomaly detection when...
+
+
+
+
+ We have a very small number of positive examples (y=1 ... 0-20 examples is common) and a large number of negative (y=0) examples.
+
+
+
+
+ We have many different "types" of anomalies and it is hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we've seen so far.
+
+
+
+
+ Use supervised learning when...
+
+
+
+
+ We have a large number of both positive and negative examples. In other words, the training set is more evenly divided into classes.
+
+
+
+
+ We have enough positive examples for the algorithm to get a sense of what new positives examples look like. The future positive examples are likely similar to the ones in the training set.
+
+
+
+
+ Choosing What Features to Use
+
+
+ The features will greatly affect how well your anomaly detection algorithm works.
+
+
+ We can check that our features are
+
+ gaussian
+
+ by plotting a histogram of our data and checking for the bell-shaped curve.
+
+
+ Some
+
+ transforms
+
+ we can try on an example feature x that does not have the bell-shaped curve are:
+
+
+
+
+ log(x)
+
+
+
+
+ log(x+1)
+
+
+
+
+ log(x+c) for some constant
+
+
+
+
+ $$\sqrt{x}$$
+
+
+
+
+ $$x^{1/3}$$
+
+
+
+
+ We can play with each of these to try and achieve the gaussian shape in our data.
+
+
+ There is an
+
+ error analysis procedure
+
+ for anomaly detection that is very similar to the one in supervised learning.
+
+
+ Our goal is for p(x) to be large for normal examples and small for anomalous examples.
+
+
+ One common problem is when p(x) is similar for both types of examples. In this case, you need to examine the anomalous examples that are giving high probability in detail and try to figure out new features that will better distinguish the data.
+
+
+ In general, choose features that might take on unusually large or small values in the event of an anomaly.
+
+
+ Multivariate Gaussian Distribution (Optional)
+
+
+ The multivariate gaussian distribution is an extension of anomaly detection and may (or may not) catch more anomalies.
+
+
+ Instead of modeling $$p(x_1),p(x_2),\dots$$ separately, we will model p(x) all in one go. Our parameters will be: $$\mu \in \mathbb{R}^n$$ and $$\Sigma \in \mathbb{R}^{n \times n}$$
+
+ The important effect is that we can model oblong gaussian contours, allowing us to better fit data that might not fit into the normal circular contours.
+
+
+ Varying Σ changes the shape, width, and orientation of the contours. Changing μ will move the center of the distribution.
+
+ Anomaly Detection using the Multivariate Gaussian Distribution (Optional)
+
+
+ When doing anomaly detection with multivariate gaussian distribution, we compute μ and Σ normally. We then compute p(x) using the new formula in the previous section and flag an anomaly if p(x) < ϵ.
+
+
+ The original model for p(x) corresponds to a multivariate Gaussian where the contours of $$p(x;\mu,\Sigma)$$ are axis-aligned.
+
+
+ The multivariate Gaussian model can automatically capture correlations between different features of x.
+
+
+ However, the original model maintains some advantages: it is computationally cheaper (no matrix to invert, which is costly for large number of features) and it performs well even with small training set size (in multivariate Gaussian model, it should be greater than the number of features for Σ to be invertible).
+
+
+ ML:Recommender Systems
+
+
+ Problem Formulation
+
+
+ Recommendation is currently a very popular application of machine learning.
+
+
+ Say we are trying to recommend movies to customers. We can use the following definitions
+
+
+
+
+ $$n_u =$$ number of users
+
+
+
+
+ $$n_m =$$ number of movies
+
+
+
+
+ $$r(i,j) = 1$$ if user j has rated movie i
+
+
+
+
+ $$y(i,j) =$$ rating given by user j to movie i (defined only if r(i,j)=1)
+
+
+
+
+ Content Based Recommendations
+
+
+ We can introduce two features, $$x_1$$ and $$x_2$$ which represents how much romance or how much action a movie may have (on a scale of 0−1).
+
+
+ One approach is that we could do linear regression for every single user. For each user j, learn a parameter $$\theta^{(j)} \in \mathbb{R}^3$$. Predict user j as rating movie i with $$(\theta^{(j)})^Tx^{(i)}$$ stars.
+
+
+
+
+ $$\theta^{(j)} =$$ parameter vector for user j
+
+
+
+
+ $$x^{(i)} =$$ feature vector for movie i
+
+
+
+
+ For user j, movie i, predicted rating: $$(\theta^{(j)})^T(x^{(i)})$$
+
+
+
+
+ $$m^{(j)} =$$ number of movies rated by user j
+
+
+
+
+ To learn $$\theta^{(j)}$$, we do the following
+
+ We can apply our linear regression gradient descent update using the above cost function.
+
+
+ The only real difference is that we
+
+ eliminate the constant
+
+ $$\dfrac{1}{m}$$.
+
+
+ Collaborative Filtering
+
+
+ It can be very difficult to find features such as "amount of romance" or "amount of action" in a movie. To figure this out, we can use
+
+ feature finders
+
+ .
+
+
+ We can let the users tell us how much they like the different genres, providing their parameter vector immediately for us.
+
+
+ To infer the features from given parameters, we use the squared error function with regularization over all the users:
+
+ It looks very complicated, but we've only combined the cost function for theta and the cost function for x.
+
+
+ Because the algorithm can learn them itself, the bias units where x0=1 have been removed, therefore x∈ℝn and θ∈ℝn.
+
+
+ These are the steps in the algorithm:
+
+
+
+
+ Initialize $$x^{(i)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)}$$ to small random values. This serves to break symmetry and ensures that the algorithm learns features $$x^{(i)},...,x^{(n_m)}$$ that are different from each other.
+
+ For a user with parameters θ and a movie with (learned) features x, predict a star rating of $$\theta^Tx$$.
+
+
+
+
+ Vectorization: Low Rank Matrix Factorization
+
+
+ Given matrices X (each row containing features of a particular movie) and Θ (each row containing the weights for those features for a given user), then the full matrix Y of all predicted ratings of all movies by all users is given simply by: $$Y = X\Theta^T$$.
+
+
+ Predicting how similar two movies i and j are can be done using the distance between their respective feature vectors x. Specifically, we are looking for a small value of $$||x^{(i)} - x^{(j)}||$$.
+
+
+ Implementation Detail: Mean Normalization
+
+
+ If the ranking system for movies is used from the previous lectures, then new users (who have watched no movies), will be assigned new movies incorrectly. Specifically, they will be assigned θ with all components equal to zero due to the minimization of the regularization term. That is, we assume that the new user will rank all movies 0, which does not seem intuitively correct.
+
+
+ We rectify this problem by normalizing the data relative to the mean. First, we use a matrix Y to store the data from previous ratings, where the ith row of Y is the ratings for the ith movie and the jth column corresponds to the ratings for the jth user.
+
+ Which is effectively the mean of the previous ratings for the ith movie (where only movies that have been watched by users are counted). We now can normalize the data by subtracting u, the mean rating, from the actual ratings for each user (column in matrix Y):
+
+
+ As an example, consider the following matrix Y and mean ratings μ:
+
+ We mainly benefit from a very large dataset when our algorithm has high variance when m is small. Recall that if our algorithm has high bias, more data will not have any benefit.
+
+
+ Datasets can often approach such sizes as m = 100,000,000. In this case, our gradient descent step will have to make a summation over all one hundred million examples. We will want to try to avoid this -- the approaches for doing so are described below.
+
+
+ Stochastic Gradient Descent
+
+
+ Stochastic gradient descent is an alternative to classic (or batch) gradient descent and is more efficient and scalable to large data sets.
+
+
+ Stochastic gradient descent is written out in a different but similar way:
+
+ This algorithm will only try to fit one training example at a time. This way we can make progress in gradient descent without having to scan all m training examples first. Stochastic gradient descent will be unlikely to converge at the global minimum and will instead wander around it randomly, but usually yields a result that is close enough. Stochastic gradient descent will usually take 1-10 passes through your data set to get near the global minimum.
+
+
+ Mini-Batch Gradient Descent
+
+
+ Mini-batch gradient descent can sometimes be even faster than stochastic gradient descent. Instead of using all m examples as in batch gradient descent, and instead of using only 1 example as in stochastic gradient descent, we will use some in-between number of examples b.
+
+ We're simply summing over ten examples at a time. The advantage of computing more than one example at a time is that we can use vectorized implementations over the b examples.
+
+
+ Stochastic Gradient Descent Convergence
+
+
+ How do we choose the learning rate α for stochastic gradient descent? Also, how do we debug stochastic gradient descent to make sure it is getting as close as possible to the global optimum?
+
+
+ One strategy is to plot the average cost of the hypothesis applied to every 1000 or so training examples. We can compute and save these costs during the gradient descent iterations.
+
+
+ With a smaller learning rate, it is
+
+ possible
+
+ that you may get a slightly better solution with stochastic gradient descent. That is because stochastic gradient descent will oscillate and jump around the global minimum, and it will make smaller random jumps with a smaller learning rate.
+
+
+ If you increase the number of examples you average over to plot the performance of your algorithm, the plot's line will become smoother.
+
+
+ With a very small number of examples for the average, the line will be too noisy and it will be difficult to find the trend.
+
+
+ One strategy for trying to actually converge at the global minimum is to
+
+ slowly decrease α over time
+
+ . For example $$\alpha = \dfrac{const1}{iterationNumber + const2}$$
+
+
+ However, this is not often done because people don't want to have to fiddle with even more parameters.
+
+
+ Online Learning
+
+
+ With a continuous stream of users to a website, we can run an endless loop that gets (x,y), where we collect some user actions for the features in x to predict some behavior y.
+
+
+ You can update θ for each individual (x,y) pair as you collect them. This way, you can adapt to new pools of users, since you are continuously updating theta.
+
+
+ Map Reduce and Data Parallelism
+
+
+ We can divide up batch gradient descent and dispatch the cost function for a subset of the data to many different machines so that we can train our algorithm in parallel.
+
+
+ You can split your training set into z subsets corresponding to the number of machines you have. On each of those machines calculate $$\displaystyle \sum_{i=p}^{q}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}$$, where we've split the data starting at p and ending at q.
+
+
+ MapReduce will take all these dispatched (or 'mapped') jobs and 'reduce' them by calculating:
+
+ This is simply taking the computed cost from all the machines, calculating their average, multiplying by the learning rate, and updating theta.
+
+
+ Your learning algorithm is MapReduceable if it can be
+
+ expressed as computing sums of functions over the training set
+
+ . Linear regression and logistic regression are easily parallelizable.
+
+
+ For neural networks, you can compute forward propagation and back propagation on subsets of your data on many machines. Those machines can report their derivatives back to a 'master' server that will combine them.
+
+ Supervised Learning: 1:25: Describing the curve as quadratic is confusing since the independent variable is price, but the plot's X-axis represents area.
+
+
+
+
+
+
+ Unsupervised Learning: 6:56 - the mouse does not point to the correct audio sample being played on the slide. Each subsequent audio sample has the mouse pointing to the previous sample.
+
+
+
+
+ Unsupervised Learning: 12:50 - the slide shows first option "Given email labelled as span/not spam, learn a spam filter" as one of the answers as well. Whereas, in the audio Professor puts it in Supervised Learning category.
+
+
+
+
+ Linear Regression With One Variable
+
+
+
+
+ A general note about the graphs that Prof Ng sketches when discussing the cost function. The vertical axis can be labeled either 'y' or 'h(x)' interchangeably. 'y' is the true value of the training example, and is indicated with a marker. 'h(x)' is the hypothesis, and is typically drawn as a curve. The scale of the vertical axis is the same, so both can be plotted on the same axis.
+
+
+
+
+ In the video "Cost Function - Intuition I", at about 6:34, the value given for J(0.5) is incorrect.
+
+
+
+
+ Parameter Learning: Video "Gradient Descent for Linear Regression": At 6:15, the equation Prof Ng writes in blue "h(x) = -900 - 0.1x" is incorrect, it should use "+900".
+
+
+
+
+ Gradient Descent for Linear Regression
+
+
+
+
+ At Timestamp 3:27 of this video lecture, the equation for θ1 is wrong, please refer to first line of Page 6 of ex1.pdf (Week 2 programming Assignment) for model equation (The last x is X superscript i, subscript j (Which is 1 in this case, as it is of θ1)). θ0 is correct as it will be multiplied by 1 anyways(value of X superscript i, subscript 0 is 1), as per the model equation.
+
+
+
+
+ Linear Algebra Review
+
+
+
+
+ Matrix-Matrix Multiplication: 7:14 to 7:33 - While exploring a matrix multiplication, Andrew solved the problem correctly below, but when he tried to rewrite the answer in the original problem, one of the numbers was written incorrectly. The correct result was (matrix 9 15) and (matrix 7 12), but when it was rewritten above it was written as (matrix 9 15) and (matrix 4 12). The 4 should have been a 7. (Thanks to John Kemp and others). This has been partially corrected in the video - third subresult matrix shows 7 but the sound is still 4 for both subresult and result matrices. Subtitle at 6:48 should be “two is seven and two”, and subtitle at 7:14 should be “seven twelve and you”.
+
+
+
+
+ 3.4: Matrix-Matrix Multiplication: 8:12 - Andrew says that the matrix on the bottom left shows the housing prices, but those are the house sizes as written above
+
+
+
+
+ 3.6: Transpose and Inverse: 9:23 - While demonstrating a transpose, an example was used to identify B(subscript 12) and A(subscript 21). The correct number 3 was circled in both cases above, but when it was written below, it was written as a 2. The 2 should have been a 3. (Thanks to John Kemp and others)
+
+
+
+
+ Addition and scalar multiplication video
+
+
+
+
+ Spanish subtitles for this video are wrong. Seems that those subtitles are from another video.
+
+ Multiple Features, at 7:25 to 7:30. It is recorded that $$\theta^T$$ is an (n+1)×1 matrix; it should be 1×(n+1) matrix since it has 1 row with n+1 columns.
+
+
+
+
+ Gradient Descent in Practice I - Feature Scaling, at 6:20 to 6:24. It is recorded that the average price of the house is 1,000 but he writes 100 on the slide. The slide should show "Average size = 1000" instead of "Average size = 100".
+
+
+
+
+ Gradient Descent in Practice II - Learning Rate, at 5:00. The plot on the right-hand side is not of J(θ) against the number of iterations, but rather against the parameters; so the x axis should say "θ", not "No. of iterations".
+
+
+
+
+ Error in Normal Equation at 2:19. The range shown for θ is 0 to m, but it should be 0 to n.
+
+
+
+
+ Normal Equation, from 8:00 to 8:44The design matrix X (in the bottom right side of the slide) given in the example should have elements x with subscript 1 and superscripts varying from 1to m because for all m training set there are only 2 features x0 and x1.
+
+
+
+
+ Normal Equation, at 12:56($$X^TX$$)−1 is described as an n×n matrix, but it should be n+1×n+1
+
+
+
+
+ Error in "Normal Equation Noninvertibility" at 3:20. Prof Ng states that X is non-invertible if m <= n. The correct statement should be that X is non-invertible if m < n, and may be non-invertible if m = n.
+
+
+
+
+
+
+
+
+ Ex1.pdf, page 9; code segmentPrior to defining J_vals, we must initialize theta0_vals and theta1_vals. Add "theta0_vals = -10:0.01:10; theta1_vals = -1:0.01:4;" to the beginning of the code segment.
+
+
+
+
+ Ex1.pdf, page 9; code segmentcomputeCost(x, y, t) - should be capital X (as it is correctly in ex1.m)
+
+
+
+
+ Ex1.pdf, page 14; normal equations in closed-form solution to linear regression:Inconsistency in notation: Whereas the column vector $$\vec{y}$$ of dimension m×1is denoted by an over line vector, the (n+1)×1 vector θ lacks the over line vector.Same applies to definition of vector θ in lecture notes - I would suggest to always denote column and row vectors by over line vectors, increases readability enormously to distinguish vectors from matrices. Matlab itself of course does not distinguish between vectors and matrices, vectors are just special matrices with only one row resp. column.
+
+
+
+
+ Ex1.pdf, page 14:In section 3.3, there is the following sentence: "The code in ex1.m will add the column of 1’s to X for you." But in this section, it is talking about ex1_multi.m. Of course, it is also true that ex1.m adds the bias units for you, but that is not the script being referred to here.
+
+
+
+
+ Octave Tutorial:When Prof Ng enters this line: "w = -6 + sqrt(10) * randn(1,10000))", add a semicolon at the end to prevent its display.If Octave crashes when you type "hist(w)", try the command "graphics_toolkit('gnu_plot')".
+
+
+
+
+ Vectorization video at 2:19. The comma at the end of the line "for j=1:n+1," is a typo. The comma is not needed.
+
+
+
+
+ Errors in the Programming Exercise Instructions
+
+
+
+
+ ex1.pdf Page 2 1st paragraph: Students are required to modify ex1_muilt.m to complete the optional portion of this exercise. So ignore the sentence "You do not need to modify either of them.
+
+
+
+
+ Errors in the programming exercise scripts
+
+
+
+
+ In plotData.m, the statement "figure" at line 17 should be above the "===== YOUR CODE HERE ===" section. The code you add must be below the "figure" statement.
+
+ At 1:56 in the transcript, it should read 'sigmoid function' instead of 'sec y function'.
+
+
+ Cost Function
+
+
+ The section between 8:30 and 9:20 is then repeated from 9:20 to the quiz. The case for y=0 is explained twice.
+
+
+ Simplified Cost Function and Gradient Descent
+
+
+ These following mistakes also exist in the video:
+
+
+
+
+ 6.5: On page 19 in the PDF, the leftmost square bracket seems to be slightly misplaced.
+
+
+
+
+ 6.5: It seems that the factor 1/m is accidentally omitted between pages 20 and 21 when the handwritten expression is converted to a typeset one (starting at 6:53 of the video)
+
+
+
+
+ Advanced Optimization
+
+
+ In the video at 7:30, the notation for specifying MaxIter is incorrect. The value provided should be an integer, not a character string. So (...'MaxIter', '100') is incorrect. It should be (...'MaxIter', 100). This error only exists in the video - the exercise script files are correct.
+
+
+ VII. Regularization
+
+
+ The Problem of Overfitting
+
+
+ At 2:07, a curve is drawn using predicting function $$\theta_0+\theta_1 x+\theta_2 x^2$$, which is said as "just right". But when size of house is large enough, the prediction of this function will increase much faster than linear if $$\theta_2 > 0$$, or will decrease to −∞ if $$θ_2$$ < 0, which neither could correspond to reality. Instead, $$\theta_0+\theta_1 x+\theta_2 \sqrt{x}$$ may be "just right".
+
+
+ At 2:28, a curve is drawn using a quartic (degree 4) polynomial predicting function $$\theta_0+\theta_1 x+\theta_2 x^2 +\theta_3 x^3 +\theta_4 x^4$$; however, the curve drawn is at least quintic (degree 5).
+
+
+ Cost Function
+
+
+ In the video at 5:17, the sum of the regularization term should use 'j' instead of 'i', giving $$\sum_{j=1}^{n} \theta _j ^2$$ instead of $$\sum_{i=1}^{n} \theta _j ^2$$.
+
+
+ Regularized linear regression
+
+
+ In the video starting at 8:04, Prof Ng discusses the Normal Equation and invertibility. He states that X is non-invertible if m <= n. The correct statement should be that X is non-invertible if m < n, and may be non-invertible if m = n.
+
+
+ Regularized logistic regression
+
+
+ In the video at 3:52, the lecturer mistakenly said "gradient descent for regularized linear regression". Indeed, it should be "gradient descent for regularized logistic regression".
+
+
+ In the video at 5:21, the cost function is missing a pair of parentheses around the second log argument. It should be $$J(θ)=J(\theta) = [-\frac{1}{m}\sum_{i=1}^{m}y^{(i)}log(h _\theta (x^{(i)}) + (1-y^{(i)})log(1-h _\theta (x^{(i)}))] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta ^2 _j$$
+
+
+ In the original videos for the course (ML-001 through ML-008), there were typos in the equation for regularized logistic regression in both the video lecture and the PDF lecture notes. In the slides for "Gradient descent" and "advanced optimization", there should be positive signs for the regularization term of the gradient. The formula on page 10 of 'ex2.pdf' is correct. These issues in the video were corrected for the 'on-demand' format of the course.
+
+
+ Quizzes
+
+
+
+
+ Typo "it's" in question «Because logistic regression outputs values $$0≤h_θ(x)≤1$$, it's range [...]»
+
+
+
+
+ 1m factor missing in the definition of the gradient in question «For logistic regression, the gradient is given by $$\frac{\partial}{\partial \theta_j} J(\theta) = \sum_{i=1}^m$$ . It should be $$\frac{\partial}{\partial \theta_j} J(\theta) = \frac{1}{m} \sum_{i=1}^m$$.
+
+
+
+
+ Programming Exercise Errata
+
+
+
+
+ In ex2.pdf on page 5, Section 1.2.3, "gradent descent" should be "gradient descent".
+
+
+
+
+ If you are using a linux-derived operating system, you may need to remove the attribute "MarkerFaceColor" from the plot() function call in plotData.m.
+
+
+
+
+ In ex2.m at lines 10 through 13, the list of files the student needs to complete should include plotData.m
+
+ In the videos "Model Representation I" and "Model Representation II:, the diagram of the NN does not show the added bias units in the input and hidden layers. The bias units are represented in the equations as the variable x0.
+
+
+
+
+ In the video "Model representation I", in the in-video quiz, the figure is incorrect in that it does not show the added bias units. The bias units must be included when you calculate the size of a Theta matrix.
+
+
+
+
+ In the video "Model Representation II" at 2:42, Prof Ng mistakenly says that z(2) is a 3-dimensional vector. What he means is that the vector z(2) has three features - i.e it is size (3 x 1).
+
+
+
+
+ Errata in the programming exercise
+
+
+
+
+ In ex3.pdf at Section 1.3.2 "Vectorizing the gradient", there is a typo in the series of entries demonstrating how to compute the partial derivatives for all $$θ_j$$ where $$h_\theta(x) - y$$ is defined. The last row in the array has $$h_\theta(x^{(1)}) - y^{(m)}$$ but it should be $$h_\theta(x^{(m)}) - y^{(m)}$$
+
+
+
+
+ Clarification: The instructions in ex3.pdf ask you to first write the unregularized portions of cost function (in Section 1.3.1 for cost and 1.3.2 for the gradients), then to add the regularized portions of the cost function (in Section 1.3.3).
+
+
+
+
+ Note: The test case for lrCostFunction() in ex3.m includes regularization, so you should first complete through Section 1.3.3.
+
+
+
+
+ Errata in the quiz
+
+
+
+
+ In question 4 of the Neural Networks: Representation quiz, one potential answer may include the variable Theta2, even though this variable is undefined (the question only defines Theta1). When answering the question, treat Theta2 as Theta with a superscript "(2)", or $$\Theta^{(2)}$$, from lecture.
+
+ Note: In this video, the NN diagrams omit the bias units in the input and hidden layers.
+
+
+
+
+ At time 0:14-1.0, the indices for Θ should be $$\Theta_{ij}^{(l)}$$ for both in the cost function and in the partial derivative.
+
+
+
+
+ At 1:30, the first step of forward propagation omits adding the bias unit. The bias units are shown for a(2) and a(3), but not for a(1).
+
+
+
+
+ Errata in video "Backpropagation Intuition"
+
+
+
+
+ At time 4:39, the last term for the calculation for $$z^3_1$$ (three-color handwritten formula) should be $$a^2_2$$ instead of $$a^2_1$$.
+
+
+
+
+ At time 6:08 and after, the equation for cost(i) is incorrect. The first term is missing parentheses for the log() function, and the second term should be $$(1-y^{(i)})\log(1-h{_\theta}{(x^{(i)}}))$$
+
+
+
+
+ At time 7:10, the statement is given $$\delta_j^{(l)} = \frac{\partial}{\partial z_j^{(l)}} \text{cost}(t)$$ This statement is not strictly correct, and is provided as an intuition for how the backpropagation process works. This video does not attempt to provide mathematical proofs.
+
+
+
+
+ At time 8:50, Prof Ng writes on the slide that $$\delta^{(4)} = y - a^{(4)}$$. This is incorrect, it should be $$\delta^{(4)} = a^{(4)} - y$$
+
+
+
+
+ At time 9:40, the descriptions of the $$\delta_3$$ and $$\delta_2$$ values are not correct. Again, this video provides intuitions, and is not intended to be used for either proofs or implementation of your NN. See the video "Backpropagation Algorithm" for the correct implementation.
+
+
+
+
+ At time 11:09, Professor Ng correctly writes Θ but mistakenly says delta.
+
+
+
+
+ Errata in video "Implementation Note: Unrolling Parameters"
+
+
+
+
+ Starting at 2:03, the image in the upper right corner of the slide is incorrect - it is missing a one of the hidden layers. The text of this lesson discusses a NN with two hidden layers.
+
+
+
+
+ Errata in video "Gradient Checking"
+
+
+ Errata in video "Random Initialization"
+
+
+
+
+ At 1:00 Prof Ng provides the example of $$\Theta_{ij} = 0$$ but his mathematical reasoning assumes $$\Theta_{ij} = n \neq 0$$. Otherwise he would use that $$a^{(2)}_{i} = 0.5$$, since the logistic function outputs 0.5 at input 0.
+
+
+
+
+ Errata in video "Putting It Together"
+
+
+
+
+ In minute 11 while Prof Ng is explaining gradient descent, the vertical axis on the graph of the cost function has a range (-3,+3) but the cost function is positive by definition
+
+
+
+
+ Errata in the lecture slides (Lecture9.pdf)
+
+
+
+
+ On page 5: The final term in the expression for J(θ) has a subscript i missing i.e. $$\theta_{j}^{(l)}$$ becomes $$\theta_{ij}^{(l)}$$, and i,j index allows every element in array l to contribute to the matrix norm. This matches the final equation on page 3.
+
+
+
+
+ On page 6: The first line of forward propagation omits adding the bias units.
+
+
+
+
+ On page 8: The equation for D when j ≠ 0 should include $$\dfrac{\lambda}{m}\Theta$$.
+
+
+
+
+ On page 8: Name collision! The loop/training example index "i" is overloaded with the node index for the next layer.
+
+
+
+
+ Errata in ex4.pdf
+
+
+
+
+ On page 3: Below Figure 1 text says "...5000 training examples in ex3data1.mat". The text should say "...in ex4data1.mat".
+
+
+
+
+ On page 9: In Step 5, the text says "...by dividing the accumulated gradients by $$\frac{1}{m}$$:". The text should say "... by multiplying...".
+
+
+
+
+ Errata in the programming exercise scripts
+
+
+
+
+ In ex4.m at line 114 and 115, the vector of test values for sigmoidGradient() should start with '-1', not '1'.
+
+
+
+
+ In ex4.m at line 168, the fprintf() statement is hard-coded to output "lambda = 10", even though the variable lambda is set to 3.
+
+
+
+
+ checkNNGradients.m: Line 41 should read "'(
+
+ Right
+
+ -Your Numerical Gradient,
+
+ Left
+
+ -Analytical Gradient)\n\n']);"
+
+
+
+
+ randInitializeWeights : line 19 "Note: The first row of W corresponds to the parameters for the bias units" it is column not row. also it's bias unit.
+
+ Quiz questions in Week 6 should refer to linear regression, not logistic regression (typo only).
+
+
+
+ Errata in the Video Lectures
+
+
+
+ In the "Regularization and Bias/Variance" video
+
+
+ The slide "Linear Regression with Regularization" has an error in the formula for J(θ): the regularization term should go from j=1 up to n (and not m), that is $$\frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$$. The quiz in the video "Regularization and Bias/Variance" has regularization terms for $$J_{train}$$ and $$J_{CV}$$, while the rest of the video stresses that these should not be there. Also, the quiz says "Consider regularized logistic regression," but exhibits cost functions for regularized linear regression.
+
+
+ At around 5:58, Prof. Ng says, "picking theta-5, the fifth order polynomial". Instead, he should have said the fifth value of λ (0.08), because in this example, the polynomial degree is fixed at d = 4 and we are varying λ.
+
+
+ In the "Advice for applying ML" set of videos
+
+
+ Often (if not always)
+
+ the sums corresponding to the regularization terms in J(θ)
+
+ are (erroneously) written with j running from 1 to m. In fact,
+
+ j should run from 1 to n
+
+ , that is, the regularization term should be $$\lambda \sum_{j=1}^n \theta_j^2$$. The variable m is the number of (x,y) pairs in the set used to calculate the cost, while n is the largest index of j in the θj parameters or in the elements $$x_j$$ of the vector of features.
+
+
+ In the "Advice for Applying Machine Learning" section, the figure that illustrates the relationship between lambda and the hypothesis. used to detect high variance or high bias, is incorrect. Jtrain is low when lambda is small (indicating a high variance problem) and high when lambda is high (indicating a high bias problem).
+
+
+ Video (10-2: Advice for Applying Machine Learning -- hypothesis testing)
+
+
+ The slide that introduces
+
+ Training/Testing procedure for logistic regression
+
+ , (around 04:50) the cost function is incorrect. It should be:
+
+ Video Regularization and Bias/Variance (00:48)
+
+
+ Regularization term is wrong. Should be $$\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2$$ and not sum over m.
+
+
+ Videos 10-4 and 10-5: current subtitles are mistimed
+
+
+ Looks like the videos were updated in Sept 2014, but the subtitles were not updated accordingly. (10-3 was also updated in Aug 2014, but the subtitles were updated)
+
+
+ Errata in the ex5 programming exercise
+
+
+ In ex5.m at line 104, the reference to "slide 8 in ML-advice.pdf" should be "Figure 3 in ex5.pdf".
+
+ In 'Optimization Objective', starting at 9:25, the SVM regularization term should start from j=1, not j=0.
+
+
+
+
+ In 'Optimization Objective', starting at 13:37, the SVM regularization term should be summing over j instead of i: $$\sum\limits_{j=1}^n\theta{_j}^2$$.
+
+
+
+
+ In 'Large Margin Intuition', starting from 1:04, should be labelled z≤−1, for graphic on the right. (It is drawn correctly).
+
+
+
+
+ In 'Mathematics Behind Large Margin Classification', starting from 11:22, second condition should be $$p^{(i)} \cdot |\theta| \leq -1$$ if $$y^{(i)} = 0$$ instead of $$y^{(i)} = 1$$. This persists also in the quiz.
+
+
+
+
+ In 'Mathematics Behind Large Margin Classification', at 16:33, Dr. Ng writes towards the right of the slide that for a vertical decision boundary, $$p^{(1)} . ||\theta ||> 0$$, while it should be $$p^{(1)} . ||\theta ||> 1$$.
+
+
+
+
+ In the 'Kernels I' video quiz, the notation surrounding $$x_1-l^{(1)}$$ inside the exp( ) function is the norm notation.
+
+
+
+
+ In 'Using An SVM', at 13:51, Dr. Ng writes θ = K instead of class y=K
+
+
+
+
+ Errata in programming assignment ex6
+
+
+
+
+ In ex6.pdf, typo on page 1: "SVM rraining function" should be "SVM
+
+ t
+
+ raining function"
+
+
+
+
+ In ex6.m at line 69: Inside the fprintf() statement, the text "sigma = 0.5" should be "sigma = 2"
+
+
+
+
+ In ex6.pdf, typo in section 1.2.2 on page 6: "obserse" should be ‘observe’.
+
+
+
+
+ In ex6.pdf, in section 1.2.3 on page 7: The submit grader requires that you use exactly the eight values listed for both C and sigma.
+
+
+
+
+ In dataset3Params.m, at line 2 "EX6PARAMS" should be "DATASET3PARAMS".
+
+
+
+
+ In visualizeBoundary.m at line 21, the statement "contour(X1, X2, vals, [0 0], 'Color', 'b');" does not reliably display the contour plot on all platforms. The form "contour(X1, X2, vals, [0.5 0.5], 'linecolor', 'b');" seems to be a better choice.
+
+ In the video ‘Motivation II: Visualization’, around 2:45, prof. Ng says $$ℝ^2$$, but writes ℝ. The latter is incorrect and should be $$ℝ^2$$.
+
+
+ In the video ‘Motivation II: Visualization’, the quiz at 5:00 has a typo where the reduced data set should be go up to $$z^{(n)}$$ rather than $$z^{(m)}$$.
+
+
+ In the video "Principal Component Analysis Algorithm", around 1:00 the slide should read "Replace each $$x_j^{(i)}$$ with $$x_j^{(i)}-\mu_j$$." (The second x is missing the superscript (i).)
+
+
+ In the video "Principal Component Analysis Algorithm", the formula shown at around 5:00 incorrectly shows summation from 1 to n. The correct summation (shown later in the video) is from 1 to m. In the matrix U shown at around 9:00 incorrectly shows superscript of last column-vector "u" as m, the correct superscript is n.
+
+
+ In the video "Reconstruction from Compressed Representation", the quiz refers to a formula which is defined in the next video, "Choosing the Number of Principal Components"
+
+
+ In the video "Choosing the number of principal components" at 8:45, the summation in the denominator should be from 1 to n (not 1 to m).
+
+
+ In the in-video quiz in "Data Compression" at 9:47 the correct answer contains k≤n but it should be k<n.
+
+
+ Programming Exercise Errata
+
+
+ In the ex7.pdf file, Section 2.2 says “You task is to complete the code” but it should be “You
+
+ r
+
+ task”
+
+
+ In the ex7.pdf file, Section 2.4.1 should say that each column (not row) vector of U represents a principal component.
+
+
+ In the ex7.pdf file, Section 2.4.2 there is a typo: “predict the identit
+
+ f
+
+ y of the person” (the 'f' is unneeded).
+
+
+ In the ex7_pca.m file at line 126, the fprintf string says '(this mght take a minute or two ...)'. The 'mght' should be 'might'.
+
+
+ In the ex7 projectData.m file, update the Instructions to read:
+
+
% projection_k = x' * U(:, k);
+
+
+ In the function script "pca.m", the 3rd line should read "[U, S] = pca(X)" not "[U, S, X] = pca(X)"
+
+ At the risk of being pedantic, it should be noted that p(x) is not a probability but rather the normalized probability density as parameterized by the feature vector, x; therefore, ϵ is a threshold condition on the probability density. Determination of the actual probability would require integration of this density over the appropriate extent of phase space.
+
+
+ In the
+
+ Developing and Evaluating an Anomaly Detection System
+
+ video an alternative way for some people to split the data is to use the same data for the cv and test sets, therefore the number of anomalous engines (y = 1) in each set would be
+
+ 20
+
+ rather than 10 as it states on the slide.
+
+
+ XVI. Recommender Systems
+
+
+ In review questions, question 5 in option starting "Recall that the cost function for the content-based recommendation system is" the right side of the formula should be divided by m where m is number of movies. That would mean that the formula will no longer be standard cost function for the content-based recommendation system. However without this change correct answer is marked as incorrect and vice-versa. This description is not very clear but being more specific would mean breaking the honour code.
+
+
+ In the Problem Formulation video the review question states that the no. of movies is $$n_m = 1$$. The correct value for $$n_m= 2$$.
+
+
+ In "Collaborative Filtering" video, review question 2: "Which of the following is a correct gradient descent update rule for i ≠ 0?"; Instead of i ≠ 0 it should be k≠0.
+
+
+ In lesson 5 "Vectorization: Low Rank Matrix Factorization" and in lesson 6 "Implementation detail: Mean normalization" the matrix Y contains a mistake. The element $$Y^{(5,4)}$$ (Dave's opinion on "Sword vs Karate") should be a question mark but is incorrectly given as 0.
+
+
+ In lesson 6 this mistake is propagated to the calculation of μ. When μ is calculated the 5th movie is given an average rating of 1.25 because (0+0+5+0)/4=1.25, but it should be (0+0+5)/3=1.67. This the affects the new values in the matrix Y.
+
+
+ In ex8_cofi.m at line 199, where theta is trained using fmincg() for the movie ratings, the use of "Y" in the function call should be "Ynorm". Y is normalized in line 181, creating Ynorm, but then it is never used. The video lecture "Implementation Detail: Mean Normalization" at 5:34 makes it pretty clear that the normalized Y matrix should be used for calculating theta.
+
+
+ In ex8.pdf section 2, "collaborative fitlering" should be "collaborative fi
+
+ lt
+
+ ering"
+
+
+ In ex8.pdf section 2.2.1, “it will later by called” should be “it will later b
+
+ e
+
+ called”
+
+
+ In checkCostFunction.m it prints "If your backpropagation implementation is correct...", but in this exercise there is no backpropagation.
+
+
+ In the quiz, question 4 has an invalid phrase: "Even if
+
+ you
+
+ each user has rated only a small fraction of all of your products (so r(i,j)=0 for the vast majority of (i,j) pairs), you can still build a recommender system by using collaborative filtering." The word "you" seems misplaced, "your" or none.
+
+
+ In the quiz, question 4, one of answer options has a typo "For collaborative filtering, it is possible to use one of the advanced optimization algo
+
+ ir
+
+ thms"
+
+
+ In ex8.pdf at the bottom of page 8, the text says that the number of features used by ex8_cofi.m is 100. Actually the number of features is 10, not 100.
+
Octave supports various helpful statistical functions. Many are useful as
+initial steps to prepare a data set for further analysis. Others provide
+different measures from those of the basic descriptive statistics.
+
+
+
: center(x)
+
: center(x, dim)
+
Center data by subtracting its mean.
+
+
If x is a vector, subtract its mean.
+
+
If x is a matrix, do the above for each column.
+
+
If the optional argument dim is given, operate along this dimension.
+
+
Programming Note: center has obvious application for normalizing
+statistical data. It is also useful for improving the precision of general
+numerical calculations. Whenever there is a large value that is common
+to a batch of data, the mean can be subtracted off, the calculation
+performed, and then the mean added back to obtain the final answer.
+
If x is a vector, subtract its mean and divide by its standard
+deviation. If the standard deviation is zero, divide by 1 instead.
+
+
The optional parameter opt determines the normalization to use when
+computing the standard deviation and has the same definition as the
+corresponding parameter for std.
+
+
If x is a matrix, calculate along the first non-singleton dimension.
+If the third optional argument dim is given, operate along this
+dimension.
+
+
The optional outputs mu and sigma contain the mean and standard
+deviation.
+
When x is a vector, the function counts the number of elements of
+x that fall in the histogram bins defined by edges. This
+must be a vector of monotonically increasing values that define the edges
+of the histogram bins. n(k) contains the number of elements
+in x for which edges(k) <= x < edges(k+1).
+The final element of n contains the number of elements of x
+exactly equal to the last element of edges.
+
+
When x is an N-dimensional array, the computation is carried
+out along dimension dim. If not specified dim defaults to the
+first non-singleton dimension.
+
+
When a second output argument is requested an index matrix is also returned.
+The idx matrix has the same size as x. Each element of
+idx contains the index of the histogram bin in which the
+corresponding element of x was counted.
+
unique function documented at unique is often
+useful for statistics.
+
+
+
: c =nchoosek(n, k)
+
: c =nchoosek(set, k)
+
+
Compute the binomial coefficient of n or list all possible
+combinations of a set of items.
+
+
If n is a scalar then calculate the binomial coefficient
+of n and k which is defined as
+
+
+
/ \
+ | n | n (n-1) (n-2) … (n-k+1) n!
+ | | = ------------------------- = ---------
+ | k | k! k! (n-k)!
+ \ /
+
+
+
This is the number of combinations of n items taken in groups of
+size k.
+
+
If the first argument is a vector, set, then generate all
+combinations of the elements of set, taken k at a time, with
+one row per combination. The result c has k columns and
+nchoosek (length (set), k) rows.
+
+
For example:
+
+
How many ways can three items be grouped into pairs?
+
+
+
nchoosek (3, 2)
+ ⇒ 3
+
+
+
What are the possible pairs?
+
+
+
nchoosek (1:3, 2)
+ ⇒ 1 2
+ 1 3
+ 2 3
+
+
+
Programming Note: When calculating the binomial coefficient nchoosek
+works only for non-negative, integer arguments. Use bincoeff for
+non-integer and negative scalar arguments, or for computing many binomial
+coefficients at once with vector inputs for n or k.
+
Broadcasting refers to how Octave binary operators and functions behave
+when their matrix or array operands or arguments differ in size. Since
+version 3.6.0, Octave now automatically broadcasts vectors, matrices,
+and arrays when using elementwise binary operators and functions.
+Broadly speaking, smaller arrays are “broadcast” across the larger
+one, until they have a compatible shape. The rule is that corresponding
+array dimensions must either
+
+
+
be equal, or
+
+
one of them must be 1.
+
+
+
In case all dimensions are equal, no broadcasting occurs and ordinary
+element-by-element arithmetic takes place. For arrays of higher
+dimensions, if the number of dimensions isn’t the same, then missing
+trailing dimensions are treated as 1. When one of the dimensions is 1,
+the array with that singleton dimension gets copied along that dimension
+until it matches the dimension of the other array. For example, consider
+
Without broadcasting, x + y would be an error because the dimensions
+do not agree. However, with broadcasting it is as if the following
+operation were performed:
+
That is, the smaller array of size [1 3] gets copied along the
+singleton dimension (the number of rows) until it is [3 3]. No
+actual copying takes place, however. The internal implementation reuses
+elements along the necessary dimension in order to achieve the desired
+effect without copying in memory.
+
+
Both arrays can be broadcast across each other, for example, all
+pairwise differences of the elements of a vector with itself:
+
+
+
y - y'
+⇒ 0 10 20
+ -10 0 10
+ -20 -10 0
+
+
+
Here the vectors of size [1 3] and [3 1] both get
+broadcast into matrices of size [3 3] before ordinary matrix
+subtraction takes place.
+
+
A special case of broadcasting that may be familiar is when all
+dimensions of the array being broadcast are 1, i.e., the array is a
+scalar. Thus for example, operations like x - 42 and max
+(x, 2) are basic examples of broadcasting.
+
+
For a higher-dimensional example, suppose img is an RGB image of
+size [m n 3] and we wish to multiply each color by a different
+scalar. The following code accomplishes this with broadcasting,
+
+
+
img .*= permute ([0.8, 0.9, 1.2], [1, 3, 2]);
+
+
+
Note the usage of permute to match the dimensions of the
+[0.8, 0.9, 1.2] vector with img.
+
+
For functions that are not written with broadcasting semantics,
+bsxfun can be useful for coercing them to broadcast.
+
+
+
: bsxfun(f, A, B)
+
The binary singleton expansion function performs broadcasting,
+that is, it applies a binary function f element-by-element to two
+array arguments A and B, and expands as necessary
+singleton dimensions in either input argument.
+
+
f is a function handle, inline function, or string containing the name
+of the function to evaluate. The function f must be capable of
+accepting two column-vector arguments of equal length, or one column vector
+argument and a scalar.
+
+
The dimensions of A and B must be equal or singleton. The
+singleton dimensions of the arrays will be expanded to the same
+dimensionality as the other array.
+
Broadcasting is only applied if either of the two broadcasting
+conditions hold. As usual, however, broadcasting does not apply when two
+dimensions differ and neither is 1:
+
+
+
x = [1 2 3
+ 4 5 6];
+y = [10 20
+ 30 40];
+x + y
+
+
+
This will produce an error about nonconformant arguments.
+
+
Besides common arithmetic operations, several functions of two arguments
+also broadcast. The full list of functions and operators that broadcast
+is
+
+
+
plus + .+
+ minus - .-
+ times .*
+ rdivide ./
+ ldivide .\
+ power .^ .**
+ lt <
+ le <=
+ eq ==
+ gt >
+ ge >=
+ ne != ~=
+ and &
+ or |
+ atan2
+ hypot
+ max
+ min
+ mod
+ rem
+ xor
+
+ += -= .+= .-= .*= ./= .\= .^= .**= &= |=
+
+
+
Beware of resorting to broadcasting if a simpler operation will suffice.
+For matrices a and b, consider the following:
+
+
+
c = sum (permute (a, [1, 3, 2]) .* permute (b, [3, 2, 1]), 3);
+
+
+
This operation broadcasts the two matrices with permuted dimensions
+across each other during elementwise multiplication in order to obtain a
+larger 3-D array, and this array is then summed along the third dimension.
+A moment of thought will prove that this operation is simply the much
+faster ordinary matrix multiplication, c = a*b;.
+
+
A note on terminology: “broadcasting” is the term popularized by the
+Numpy numerical environment in the Python programming language. In other
+programming languages and environments, broadcasting may also be known
+as binary singleton expansion (BSX, in MATLAB, and the
+origin of the name of the bsxfun function), recycling (R
+programming language), single-instruction multiple data (SIMD),
+or replication.
+
+
+
19.2.1 Broadcasting and Legacy Code
+
+
The new broadcasting semantics almost never affect code that worked
+in previous versions of Octave. Consequently, all code inherited from
+MATLAB that worked in previous versions of Octave should still work
+without change in Octave. The only exception is code such as
+
+
+
try
+ c = a.*b;
+catch
+ c = a.*a;
+end_try_catch
+
+
+
that may have relied on matrices of different size producing an error.
+Because such operation is now valid Octave syntax, this will no longer
+produce an error. Instead, the following code should be used:
+
+
+
if (isequal (size (a), size (b)))
+ c = a .* b;
+else
+ c = a .* a;
+endif
+
+ This is a step-by-step tutorial for how to complete the computeCost() function portion of ex1. You will still have to do some thinking, because I'll describe the implementation, but you have to turn it into Octave script commands. All the programming exercises in this course follow the same procedure; you are provided a starter code template for a function that you need to complete. You never have to start a new script file from scratch. This is a vectorized implementation. You're only going to write a few simple lines of code.
+
+
+ With a text editor (NOT a word processor), open up the computeCost.m file. Scroll down until you find the "====== YOUR CODE HERE =====" section. Below this section is where you're going to add your lines of code. Just skip over the lines that start with the '%' sign - those are instructive comments.
+
+
+ We'll write these three lines of code by inspecting the equation on Page 5 of ex1.pdf. The first line of code will compute a vector 'h' containing all of the hypothesis values - one for each training example (i.e. for each row of X). The hypothesis (also called the prediction) is simply the product of X and theta. So your first line of code is...
+
+
h = {multiply X and theta, in the proper order that the ....inner dimensions match}
+
+ Since X is size (m x n) and theta is size (n x 1), you arrange the order of operators so the result is size (m x 1).
+
+
+ The second line of code will compute the difference between the hypothesis and y - that's the error for each training example. Difference means subtract.
+
+
error = {the difference between h and y}
+
+ The third line of code will compute the square of each of those error terms (using element-wise exponentiation),
+
+
+ An example of using element-wise exponentiation - try this in your workspace command line so you see how it works.
+
+
v = [-2 3]
+
+v_sqr = v.^2
+
+ So, now you should compute the squares of the error terms:
+
+
error_sqr = {use what you have learned}
+
+ Next, here's an example of how the sum function works (try this from your command line)
+
+
q = sum([1 2 3])
+
+ Now, we'll finish the last two steps all in one line of code. You need to compute the sum of the error_sqr vector, and scale the result (multiply) by 1/(2*m). That completed sum is the cost value J.
+
+
J = {multiply 1/(2*m) times the sum of the error_sqr vector}
+
+ That's it. If you run the ex1.m script, you should have the correct value for J. Then you should run one of the unit tests (available in the Forum).
+
+
+ Then you can run the submit script, and hopefully it will pass.
+
+
+ Be sure that every line of code ends with a semicolon. That will suppress the output of any values to the workspace. Leaving out the semicolons will surely make the grader unhappy.
+
+
+
+ Gradient Descent Tutorial
+
+ - also applies to gradientDescentMulti() - includes test cases.
+
+
+ I use the vectorized method, hopefully you're comfortable with vector math. Using this method means you don't have to fuss with array indices, and your solution will automatically work for any number of features or training examples.
+
+
+ What follows is a vectorized implementation of the gradient descent equation on the bottom of Page 5 in ex1.pdf.
+
+
+ Reminder that 'm' is the number of training examples (the rows of X), and 'n' is the number of features (the columns of X). 'n' is also the size of the theta vector (n x 1).
+
+
+ Perform all of these steps within the provided for-loop from 1 to the number of iterations. Note that the code template provides you this for-loop - you only have to complete the body of the for-loop. The steps below go immediately below where the script template says "======= YOUR CODE HERE ======".
+
+
+ 1 - The hypothesis is a vector, formed by multiplying the X matrix and the theta vector. X has size (m x n), and theta is (n x 1), so the product is (m x 1). That's good, because it's the same size as 'y'. Call this hypothesis vector 'h'.
+
+
+ 2 - The "errors vector" is the difference between the 'h' vector and the 'y' vector.
+
+
+ 3 - The change in theta (the "gradient") is the sum of the product of X and the "errors vector", scaled by alpha and 1/m. Since X is (m x n), and the error vector is (m x 1), and the result you want is the same size as theta (which is (n x 1), you need to transpose X before you can multiply it by the error vector.
+
+
+ The vector multiplication automatically includes calculating the sum of the products.
+
+
+ When you're scaling by alpha and 1/m, be sure you use enough sets of parenthesis to get the factors correct.
+
+
+ 4 - Subtract this "change in theta" from the original value of theta. A line of code like this will do it:
+
+
theta = theta - theta_change;
+
+ That's it. Since you're never indexing by m or n, this solution works identically for both gradientDescent() and gradientDescentMulti().
+
+
+
+ Feature Normalization Tutorial
+
+
+
+ There are a couple of methods to accomplish this. The method here is one I use that doesn't rely on automatic broadcasting or the bsxfun() or repmat() functions.
+
+
+ You can use the mean() and sigma() functions to get the mean and std deviation for each column of X. These are returned as row vectors (1 x n)
+
+
+ Now you want to apply those values to each element in every row of the X matrix. One way to do this is to duplicate these vectors for each row in X, so they're the same size.
+
+
+ One method to do this is to create a column vector of all-ones - size (m x 1) - and multiply it by the mu or sigma row vector (1 x n). Dimensionally, (m x 1) * (1 x n) gives you a (m x n) matrix, and every row of the resulting matrix will be identical.
+
+
+ Now that X, mu, and sigma are all the same size, you can use element-wise operators to compute X_normalized.
+
+
+ Try these commands in your workspace:
+
+
X = [1 2 3; 4 5 6]
+
+% creates a test matrix
+
+mu = mean(X)
+
+% returns a row vector
+
+sigma = std(X)
+
+% returns a row vector
+
+m = size(X, 1)
+
+% returns the number of rows in X
+
+mu_matrix = ones(m, 1) * mu
+
+sigma_matrix = ones(m, 1) * sigma
+
+ Now you can subtract the mu matrix from X, and divide element-wise by the sigma matrix, and arrive at X_normalized.
+
+
+ You can do this even easier if you're using a Matlab or Octave version that supports automatic broadcasting - then you can skip the "multiply by a column of 1's" part.
+
+
+ You can also use the bsxfun() or repmat() functions. Be advised the bsxfun() has a non-obvious syntax that I can never remember, and repmat() runs rather slowly.
+
+ The values can be inspected by adding the "keyboard" command within your for-loop. This exits the code to the debugger, where you can inspect the values. Use the "return" command to resume execution.
+
+
+ Test Case 2:
+
+
+ This test case is similar, but uses a non-zero initial theta value.
+
+ The submit script, for all the programming assignments, does not report the line number and location of the error when it crashes. The follow method can be used to make it do so which makes debugging easier.
+
+
+ Open ex1/lib/submitWithConfiguration.m and replace line:
+
+
fprintf('!! Please try again later.\n');
+
+
+ (around 28) with:
+
+
fprintf('Error from file:%s\nFunction:%s\nOn line:%d\n', e.stack(1,1).file,e.stack(1,1).name, e.stack(1,1).line );
+
+
+ That top line says '!! Please try again later' on crash, instead of that, the bottom line will give the location and line number of the error. This change can be applied to all the programming assignments.
+
+
+ Note for OS X users
+
+
+ If you are using OS X and get this error message when you run ex1.m and expect to see a plot figure:
+
+
gnuplot> set terminal aqua enhanced title "Figure 1" size 560 420 font "*,6" dashlength 1
+ ^
+ line 0: unknown or ambiguous terminal type; type just 'set terminal' for a list
+
+
+ ... try entering this command in the workspace console to change the terminal type:
+
+
setenv("GNUTERM","qt")
+
+
+ How to check format of function arguments
+
+
+ So that you may print the argument just by typing its name in the body of the function on a distinct line and call submit() in Octave.
+
+
+ For example I may print the theta argument in the "Compute cost for one variable" exercise by writing this in my computeCost.m file. Of course, it will fail because 5 is just random number, but it will show me the value of theta:
+
+
function J = computeCost(X, y, theta)
+ m = length(y);
+ J = 0
+ theta
+ J = 5 % I have added this line just to show that the argument you want to print doesn't have to be on the last line
+end
+
+
+ Testing matrix operations in Octave
+
+
+ In our programming exercises, there are many complex matrix operations where it may not be clear what form the result is in. I find it helpful to create a few basic matrices and vectors to test out my operations. For instance the following commands can be copied to a file to be used at any time for testing an operation.
+
+
X = [1 2 3; 1 2 3; 1 2 3; 1 2 3; 1 5 6] % Make sure X has more rows than theta and isn't square
+y = [1; 2; 3; 4; 5]
+theta = [1; 1; 1]
+
+
+ With these basic matrices and vectors you can model most of the programming exercises. If you don't know what form specific operations in the exercises take, you can test it in the Octave shell.
+
+
+ One thing that got me was using formulas like theta' * x where x was a single row in X. All the notes show x as being a mX1 vector, but X(i,:) is a 1xm vector. Using the terminal, I figured out that I had to transpose x. It is very helpful.
+
+
+ Repeating previous operations in Octave
+
+
+ When using the great unit tests by Vinh, if your function doesn't work the first time -- after you to edit and save your function file, then in your Octave window - just type ctrl-p to back up to what you typed previously, then enter to run it. (once you've gone back, can use ctrl-n for next) (more info @
+
+ https://www.gnu.org/software/octave/doc/interpreter/Commands-For-History.html
+
+ )
+
+
+ Warm up exercise
+
+
+ If you type "ex1.m" you will get an error - just use "ex1". Press 'Run' in Matlab editor.
+
+
+ Compute cost for one variable
+
+
+ theta is a matrix of size 2x1; first row is theta[0] and second one is theta[1] (I following index convention of videos here) Also fill arbitrary (non-zero) initial values to theta[0] and theta[1].
+
+
+ Gradient descent for one variable
+
+
+ See the 5th segment of Week 1 Video II ("Gradient Descent") for a key tip on simultaneous updates of theta.
+
+ The bsxfun is helpful for applying a function (limited to two arguments) in an element-wise fashion to rows of a matrix using a vector of source values. This is useful for feature normalization. An example you can enter at the octave command line:
+
Z=[1 1 1; 2 2 2;];
+v=[1 1 1];
+Z - v % or Z .- v
+ans =
+ 0 0 0
+ 1 1 1
+
+
+ A note regarding Feature Normalization when a feature is a constant: <provided by a ML-005 student>
+
+
+ When I used the feature normalization routine we used in class it did not occur to me that some features of the training examples may have constant values, which means that the sigma vector has zeroes for those features. Thus when I divide by sigma to normalize the matrix NaNs filled in some slots. This causes gradient descent to get lost wandering through a NaN wasteland, but never reporting why.The fix is easy. In featureNormalize, after sigma is calculated but before the division takes place, insert
+
+
sigma( sigma == 0 ) = 1; % to keep away the NaN's and Inf's
+
+ Once this was done, gradient descent ran fine.
+
+
+ TA note: for the ML class exercises, you do not need this trick, because the scripts add the column of bias units after the features are normalized. But for your use outside of the class exercises, this may be a useful technique.
+
+
+ Gradient descent for multiple variables
+
+
+ The lecture notes "Week 2" under section Matrix Notation basically spells out one line solution to the problem.
+
+
+ When predicting prices using theta derived from gradient descent, do not forget to normalize input x or you'll get multimillion house value (wrong one).
+
+
+ Normal Equations
+
+
+ I found that the line "data = csvread('ex1data2.txt');" in ex1_multi.m is not needed as we previously load this data via "data = load('ex1data2.txt');"
+
+
+ Prior steps normalized X, this line sets X back to the original values. To have theta from gradient descent and from the normal equations to be close run the normal equations using normalized features as well. Therefor do not reload X.
+
+
+ Comment: I think the point in reloading is to show that you actually get the same results even without doing anything with the data beforehand. Of course for this script its not effective, but in a real application you would use only one of the approaches. Similar considerations would argue against feature normalization. Therefore do reload X.
+
+ Note for MATLAB users: If you are using MATLAB version R2015a or later, the fminunc() function has been changed in this version. The function works better, but does not give the expected result for Figure 5 in ex2.pdf, and it throws some warning messages (about a local minimum) when you run ex2_reg.m. This is normal, and you should still be able to submit your work to the grader.
+
+
+ Typos in the lectures (updated):
+
+
+ There are typos in the week 3 lectures, specifically for regularized logistic regression. This could create some confusion while doing the the last part of exercise 2. The equations in ex2.pdf are correct.
+
+
+ Gradient and theta values for ex2.m
+
+
+ Here are the values of both cost J and the gradients for the "initial theta (zeros)" test (ex2.pdf Section 1.2.2):
+
+
Cost at initial theta (zeros): 0.693147
+Gradient at initial theta (zeros):
+ -0.100000
+ -12.009217
+ -11.262842
+
+
+ Here are the values for both cost J and theta for the "theta found by fminunc" test (ex2.pdf Section 1.2.3):
+
+
Cost at theta found by fminunc: 0.203498
+theta:
+ -25.164593
+ 0.206261
+ 0.201499
+
+ Not 100% sure about this, so please take this with a grain of salt.
+
+
+ It appears to me that the "mapFeature" vector displayed on page 9 of the ex2.pdf is the transpose of what is intended. Also, it would be more clear if each of the variables carried the (i) superscript denoting the trial
+
+ Of course this assumes exactly two features in the original dataset. I think of this more as "mapTrial" than as "mapFeature" because what we're really doing is mapping the original trials with two features onto a new set of trials with 28 features.
+
+
+ I would not have thought twice about this, had I not gulped hard at the imprecise use of the word "dimensions" in the phase, "a 28-dimensional vector" in the text which follows the expression.
+
+
+ This is how I interpreted it for the homework, and the results were accepted. But if I'm way off base, please delete this wiki entry.
+
+ The plot() attribute "MarkerFaceColor" may not be supported on your version of Octave or MATLAB. You may need to modify it. Use the command "plot help" to see what attributes are supported. (You might just try to replace "MarkerFaceColor" with "MarkerFace", then the plot should work, although you get a warning.)
+
+
+ Logistic Regression Gradient
+
+
+ [w.r.t.=with respect to]
+
+
+ Don't stumble over terminology - "the partial derivatives of the cost w.r.t. each parameter in theta" are:
+
+ I was confused about this and kept trying to return the updated theta values . . .
+
+
+ UPDATE (the above was really helpful, thank you for putting it here) As an additional hint: the instructions say: "[...] the gradient of the cost with respect to the parameters" - you're only asked for a gradient, don't overdo it (see above). The fact that you're not given alpha should be a hint in itself. You don't need it. You won't be iterating neither.
+
+
+ Sigmoid function
+
+
+ 1) The sigmoid function accepts only on one parameter named 'z'. This variable 'z' can represent a scalar, vector, or matrix. No other variable names should appear in the sigmoid() function.
+
+
+ 2) The implementation of the sigmoid function should use only element-wise operators. The operators needed are addition, element-wise division (the './' operator), and the exp() function.
+
+
+ Decision Boundary
+
+
+ Thoughts regarding why the equation, $$theta_{1} + theta_{2}x_{2} + theta_{3}x_{3}$$, is set equal to 0 for determining a decision boundary:
+
+
+ In this exercise, we're solving a
+
+ classification
+
+ problem using logistic regression.
+
+
+
+
+ The hypothesis equation is $$h_{\theta}(x) = g(z)$$, where g is the sigmoid function $$\dfrac{1}{1+e^{-z}}$$, and $$z = \theta^{T}x$$
+
+
+
+
+ For classification, we usually interpret a hypothesis value $$h_{\theta}(x) \geq 0.5$$ as predicting class "1"
+
+ This means that $$g(\theta^{T}x) \geq 0.5$$ predicts class "1"
+
+
+
+
+ The sigmoid function g(z) outputs ≥0.5 when z≥0 (look at a graph of the sigmoid function)
+
+
+
+
+ Remember, $$z = \theta^{T}x$$
+
+
+
+
+ So, $$\theta^{T}x \geq 0$$ predicts class "1"
+
+
+
+
+ Remember $$\theta^{T}x = \theta_{1} + \theta_{2}x_{2} + \theta_{3}x_{3}$$ in this example (using 1-indexing)
+
+
+
+
+ So, $$\theta_{1} + \theta_{2}x_{2} + \theta_{3}x_{3} \geq 0$$ predicts class "1"
+
+
+
+
+ The decision boundary lets us see the line that has been learned in order to separate out the y=0 vs y=1 classes, in this example
+
+
+
+
+ This boundary is at $$h_{\theta}(x) = 0.5$$ (remember, this is the lowest possible value for predicting that a class is "1")
+
+
+
+
+ So, $$\theta_{1} + \theta_{2}x_{2} + \theta_{3}x_{3} = 0$$ is the boundary
+
+
+
+
+ The decision boundary will be a line composed of
+
+ any
+
+ (x2,x3) points that make this equation
+
+ equal zero
+
+ .
+
+
+
+
+ In order to plot the line along the specific data we have, we arbitrarily decide to use values of $$x_{2}$$ from our data, by choosing the max and min, and then add/subtract a little bit in order to make the line fit nicely. Think about it, you could continue down the line in the above equation an infinite amount in either direction, and it will still be the line dividing the two classes. However, we only have data that lies around a certain area of this line, so we make sure to only plot the line and data in that region (otherwise it would just be a line and some blank space around it).
+
+
+
+
+ Solve for $$x_{3}$$ since we're using $$x_{2}$$ values (the max & min values +/- 2 in order to make a nice line). --> $$x_{3} = \dfrac{-1}{theta_3} * (theta_{2}x_{2} + theta_1)$$, as seen in the Octave function.
+
+
+
+
+ Plugin the two $$x_2$$ values (stored in plot_x) into the above equation to get the two corresponding $$x_3$$ values (and store in the plot_y variable).
+
+
+
+
+ Plot a line using these values -> this will be the decision boundary.
+
+
+
+
+ Plot the rest of our data on the graph as well, and notice that the line should separate the classes.
+
+
+
+
+ The above still applies even if you're using higher-order polynomial features, with the note that instead of a decision boundary "line", it will be a decision boundary "polynomial".
+
+ ML:Programming Exercise 3:Multi-class Classification and Neural Networks
+
+
+ Debugging Tip
+
+
+ The submit script, for all the programming assignments, does not report the line number and location of the error when it crashes. The follow method can be used to make it do so which makes debugging easier.
+
+
+ Open ex3/lib/submitWithConfiguration.m and replace line:
+
+
fprintf('!! Please try again later.\n');
+
+
+ (around 28) with:
+
+
fprintf('Error from file:%s\nFunction:%s\nOn line:%d\n', e.stack(1,1).file,e.stack(1,1).name, e.stack(1,1).line );
+
+
+ That top line says '!! Please try again later' on crash, instead of that, the bottom line will give the location and line number of the error. This change can be applied to all the programming assignments.
+
+
+ 1.4.1 One-vs-all Prediction
+
+
+ The pdf says you should get 94.9% training accuracy. This might not be correct depending on how you implement your code.
+
+
+
+ "The result you will get may differ a little bit based on how you implement your code. Sometimes, although mathematically two expressions are the same, Matlab may compute them differently. For example, expressions X'*(sigmoid(X*theta)-y) and sum((sigmoid(X*theta)-y)*ones(1,size(X,2)).*X) are the same mathematically; however, Matlab does not compute them the same numerically. I tried to use the same input for these two expressions and Matlab gave me a difference about 2*10^(-10) in 1 norm. Therefore, when you use different expressions to compute the gradient and then use fmincg to learn the parameters, your result may be a little different. Actually, when I used the first expression, I got the accruacy 95.14% and when I used the second one, I got 94.94%. They should be both correct in this sense."
+
+
+ -Posted by guoxian (Student)
+
+
+
+ Use the submit feature to find out if you are correct even if you get a different answer for training accuracy.
+
+ It wasn't clear to me weather when computing the hidden layer you only need to compute $$g(z^1)$$, or should you transform it to binary values (set the value to 1 for g>0.5 and to 0 for g<0.5), like we learned in logistic regression. Both solutions give almost the same results in the final predictions. From the "submit" feature it is clear that you shouldn't transform the values to binary values.
+
+ -Posted by inna (Student)
+
+
+
+
+ Prediction of an image outside the dataset (Neural Network)
+
+
+
+ To test the prediction with images outside the dataset, below is a code that I wrote to import the image and use the prediction.
+
+
function p = predictImg(Theta1, Theta2, Img)
+X = imread(Img);% reads the image .bmp (24 bits) (20x20)
+
+X = double(X);% converts it to double
+temp = X;% creates a copy for later use
+
+X = (X.-128)./255;%normalize the features
+X = X .* (temp > 0);%return the original 0 values to the X
+X = reshape(X, [], numel(X));%converts the 20x20 matrix into a 1x400 vector
+
+displayData(X);%display the image imported
+
+p = predict(Theta1, Theta2, X);% calls the neural network prediction method
+
+
+ Usage:
+
+
p = predictImg(Theta1, Theta2, '1.bmp');
+
+
+ Obs: Because this function will use the Theta1, and Theta2 created my
+
+ ex3_nn
+
+ , call it before the first use of this function.
+
Broadcasting refers to how Octave binary operators and functions behave
+when their matrix or array operands or arguments differ in size. Since
+version 3.6.0, Octave now automatically broadcasts vectors, matrices,
+and arrays when using elementwise binary operators and functions.
+Broadly speaking, smaller arrays are “broadcast” across the larger
+one, until they have a compatible shape. The rule is that corresponding
+array dimensions must either
+
+
+
be equal, or
+
+
one of them must be 1.
+
+
+
In case all dimensions are equal, no broadcasting occurs and ordinary
+element-by-element arithmetic takes place. For arrays of higher
+dimensions, if the number of dimensions isn’t the same, then missing
+trailing dimensions are treated as 1. When one of the dimensions is 1,
+the array with that singleton dimension gets copied along that dimension
+until it matches the dimension of the other array. For example, consider
+
Without broadcasting, x + y would be an error because the dimensions
+do not agree. However, with broadcasting it is as if the following
+operation were performed:
+
That is, the smaller array of size [1 3] gets copied along the
+singleton dimension (the number of rows) until it is [3 3]. No
+actual copying takes place, however. The internal implementation reuses
+elements along the necessary dimension in order to achieve the desired
+effect without copying in memory.
+
+
Both arrays can be broadcast across each other, for example, all
+pairwise differences of the elements of a vector with itself:
+
+
+
y - y'
+⇒ 0 10 20
+ -10 0 10
+ -20 -10 0
+
+
+
Here the vectors of size [1 3] and [3 1] both get
+broadcast into matrices of size [3 3] before ordinary matrix
+subtraction takes place.
+
+
A special case of broadcasting that may be familiar is when all
+dimensions of the array being broadcast are 1, i.e., the array is a
+scalar. Thus for example, operations like x - 42 and max
+(x, 2) are basic examples of broadcasting.
+
+
For a higher-dimensional example, suppose img is an RGB image of
+size [m n 3] and we wish to multiply each color by a different
+scalar. The following code accomplishes this with broadcasting,
+
+
+
img .*= permute ([0.8, 0.9, 1.2], [1, 3, 2]);
+
+
+
Note the usage of permute to match the dimensions of the
+[0.8, 0.9, 1.2] vector with img.
+
+
For functions that are not written with broadcasting semantics,
+bsxfun can be useful for coercing them to broadcast.
+
+
+
: bsxfun(f, A, B)
+
The binary singleton expansion function performs broadcasting,
+that is, it applies a binary function f element-by-element to two
+array arguments A and B, and expands as necessary
+singleton dimensions in either input argument.
+
+
f is a function handle, inline function, or string containing the name
+of the function to evaluate. The function f must be capable of
+accepting two column-vector arguments of equal length, or one column vector
+argument and a scalar.
+
+
The dimensions of A and B must be equal or singleton. The
+singleton dimensions of the arrays will be expanded to the same
+dimensionality as the other array.
+
Broadcasting is only applied if either of the two broadcasting
+conditions hold. As usual, however, broadcasting does not apply when two
+dimensions differ and neither is 1:
+
+
+
x = [1 2 3
+ 4 5 6];
+y = [10 20
+ 30 40];
+x + y
+
+
+
This will produce an error about nonconformant arguments.
+
+
Besides common arithmetic operations, several functions of two arguments
+also broadcast. The full list of functions and operators that broadcast
+is
+
+
+
plus + .+
+ minus - .-
+ times .*
+ rdivide ./
+ ldivide .\
+ power .^ .**
+ lt <
+ le <=
+ eq ==
+ gt >
+ ge >=
+ ne != ~=
+ and &
+ or |
+ atan2
+ hypot
+ max
+ min
+ mod
+ rem
+ xor
+
+ += -= .+= .-= .*= ./= .\= .^= .**= &= |=
+
+
+
Beware of resorting to broadcasting if a simpler operation will suffice.
+For matrices a and b, consider the following:
+
+
+
c = sum (permute (a, [1, 3, 2]) .* permute (b, [3, 2, 1]), 3);
+
+
+
This operation broadcasts the two matrices with permuted dimensions
+across each other during elementwise multiplication in order to obtain a
+larger 3-D array, and this array is then summed along the third dimension.
+A moment of thought will prove that this operation is simply the much
+faster ordinary matrix multiplication, c = a*b;.
+
+
A note on terminology: “broadcasting” is the term popularized by the
+Numpy numerical environment in the Python programming language. In other
+programming languages and environments, broadcasting may also be known
+as binary singleton expansion (BSX, in MATLAB, and the
+origin of the name of the bsxfun function), recycling (R
+programming language), single-instruction multiple data (SIMD),
+or replication.
+
+
+
19.2.1 Broadcasting and Legacy Code
+
+
The new broadcasting semantics almost never affect code that worked
+in previous versions of Octave. Consequently, all code inherited from
+MATLAB that worked in previous versions of Octave should still work
+without change in Octave. The only exception is code such as
+
+
+
try
+ c = a.*b;
+catch
+ c = a.*a;
+end_try_catch
+
+
+
that may have relied on matrices of different size producing an error.
+Because such operation is now valid Octave syntax, this will no longer
+produce an error. Instead, the following code should be used:
+
+
+
if (isequal (size (a), size (b)))
+ c = a .* b;
+else
+ c = a .* a;
+endif
+
+ This is the toughest exercise so far, mainly because you have to implement a series of steps, each subject to error, before you get any feedback. These techniques may help:
+
+
+ See the tutorial below (developed for the Spring 2014 session).
+
+
+ Use the command line. The command line is your friend. Run enough of ex4.m to initialize X, y, Theta1, and Theta2, then work one statement or operation at a time to get the results you want. When you get a statement working, transfer it to nnCostFunction--and save the file.
+
+
+ Use dimensions. Use size() to check the dimensions of vectors and matrices to determine order of multiplication and whether a transpose is needed. This is especially valuable for the gradients. Keep in mind that the gradient matrices are the same size as Theta1 and Theta2. Also note that you will need to do some things that may seem counter-intuitive, like multiplying a m X 1 vector by a 1 X n vector to get an m X n matrix.
+
+
+ You may find it helpful to note the dimensions of each matrix in a comment on the line of code, as you define it and use it, e.g.:
+
+
Theta1 = reshape(.....) % (nhn x (n+1))
+a = b * c % dimcheck: (nhn x (n+1)) = (nhn x m) * (m x (n+1))
+
+
+
+ Do not hard-code. Specifically, do not hard-code the size of the 'binarized' y vector to 10. It will work fine for the initial tests, but will blow up with cryptic error messages later on.
+
+
+
+
+ If you get stuck on gradients, try working on a smaller, easier to grasp problem. You can steal code from checkNNGradients and paste it into the command line to get a 3-5-3 network that's a bit more manageable.
+
+
+
+
+ Full vectorization of backprop
+
+
+
+
+ If you want to get rid of the loop over the training samples in back propagation algorithm, you are facing the problem to create a logical vector for y for all training examples. Some smart guy from the spring 2013 instance of this course came up with the following elegant solution for this task
+
+
yv=[1:num_labels] == y
+
+
+ (This does not seem to work in Octave 3.2.4, I use 3.6.4 Doesn't work on 3.4 either.)
+
+
+ After getting this, it was pretty straightforward to vectorize the loop. I could transform each line from my for-loop 1:1 to the vectorized code.
+
+ Using vectorization speeds up the code considerably.
+
+
+ Another method for generating the y matrix, this time looping over the labels:
+
+
y_matrix = []; % create a null matrix
+for i = 1:num_labels
+ y_matrix = [y_mat y == i];
+end
+
+
+ Another vectorized one-line method (using vectorized indexing of an eye matrix)- Spring 2014 session:
+
+
y_matrix = eye(num_labels)(y,:); % works for Octave
+...or
+all_combos = eye(num_labels);
+y_matrix = all_combos(y,:) % works for Matlab
+
+
+ This method uses an indexing trick to vectorize the creation of 'y_matrix', where each element of 'y' is mapped to a single-value row vector copied from an eye matrix.
+
+
+
+ FYI: Misleading Formula in Ex4 pdf for regularization term of cost
+
+
+
+ The summation indexes for Theta 1 and 2 should be from 2 to 26 and 2 to 401 respectively.
+
+
+ Tutorial for Ex.4 Forward and Backpropagation (Spring 2014 session)
+
+
+ This tutorial outlines the process of accomplishing the goals for Programming Exercise 4. The purpose is to create a collection of all the useful yet scattered and obscure knowledge that otherwise would require hours of frustrating searches.This tutorial is targeted solely at vectorized implementations. If you're a looper, you're doing it the hard way, and you're on your own.I'll use the less-than-helpful greek letters and math notation from the video lectures in this tutorial, though I'll start off with a glossary so we can agree on what they are. I will also suggest some common variable names, so students can more easily get help on the Forum. It is left to the reader to convert these lines into program statements.
+
+ You will need to determine the correct order and transpositions for each matrix multiplication
+
+ .Most of this material appears in either the video lectures, slides, course wiki, or the ex4.pdf file, though nowhere else does it all appear in one place.
+
+ Glossary:
+
+ Each of these variables will have a subscript, noting which NN layer it is associated with.Θ: A matrix of weights to compute the inner values of the neural network. When we used single-vector theta values, it was noted with the lower-case character θ.z : is the result of multiplying a data vector with a Θ matrix. A typical variable name would be "z2".a : The "activation" output from a neural layer. This is always generated using a sigmoid function g() on a z value. A typical variable name would be "a2".δ : lower-case delta is used for the "error" term in each layer. A typical variable name would be "d2".Δ : upper-case delta is used to hold the sum of the product of a δ value with the previous layer's a value. In the vectorized solution, these sums are calculated automatically though the magic of matrix algebra. A typical variable name would be "Delta2".Θ gradient : This is the thing we're looking for, the partial derivative of theta. There is one of these variables associated with each Δ. These values are returned by nnCostFunction(), so the variable names must be "Theta1_grad" and "Theta2_grad".g() is the sigmoid function.g′() is the sigmoid gradient function.Tip: One handy method for ignoring a column of bias units is to use the notation "SomeMatrix(:,2:end)". This selects all of the rows of a matrix, and omits the entire first column.
+
+ Here we go
+
+ Nearly all of the editing in this exercise happens in nnCostFunction.m. Let's get started.
+
+
+
+ A note regarding the sizes of these data objects:
+
+ See the Appendix at the bottom of the tutorial for information on the sizes of the data objects.
+
+ A note regarding bias units, regularization, and back-propagation:
+
+
+ There are two methods for handing the bias units in the back-propagation and gradient calculations. I've described only one of them here, it's the one that I understood the best. Both methods work, choose the one that makes sense to you and avoids dimension errors. It matters not a whit whether the bias unit is dropped before or after it is calculated - both methods give the same results, though the order of operations and transpositions required may be different. Those with contrary opinions are welcome to write their own tutorial.
+
+
+ Forward Propagation:
+
+ We'll start by outlining the forward propagation process. Though this was already accomplished once during Exercise 3, you'll need to duplicate some of that work because computing the gradients requires some of the intermediate results from forward propagation.
+
+
+ Step 1 - Expand the 'y' output values into a matrix of single values (see ex4.pdf Page 5). This is most easily done using an eye() matrix of size num_labels, with vectorized indexing by 'y', as in "eye(num_labels)(y,:)". Discussions of this and other methods are available in the Course Wiki - Programming Exercises section. A typical variable name would be "y_matrix".
+
+
+ Step 2 - perform the forward propagation:a1 equals the X input matrix with a column of 1's added (bias units)z2 equals the product of a1 and Θ1a2 is the result of passing z2 through g()a2 then has a column of 1st added (bias units)z3 equals the product of a2 and Θ2a3 is the result of passing z3 through g()
+
+ Cost Function, non-regularized
+
+
+
+ Step 3 - Compute the unregularized cost according to ex4.pdf (top of Page 5), (I had a hard time understanding this equation mainly that I had a misconception that y(i)
+
+ k is a vector, instead it is just simply one number) using a3, your y
+
+ matrix, and m (the number of training examples). Cost should be a scalar value. If you get a vector of cost values, you can sum that vector to get the cost.Remember to use element-wise multiplication with the log() function.Now you can run ex4.m to check the unregularized cost is correct, then you can submit Part 1 to the grader.
+
+
+
+ Cost Regularization
+
+
+
+ Step 4 - Compute the regularized component of the cost according to ex4.pdf Page 6, using Θ1 and Θ2 (ignoring the columns of bias units), along with λ, and m. The easiest method to do this is to compute the regularization terms separately, then add them to the unregularized cost from Step 3.You can run ex4.m to check the regularized cost, then you can submit Part 2 to the grader.
+
+ Sigmoid Gradient and Random Initialization
+
+
+
+ Step 5 - You'll need to prepare the sigmoid gradient function g′(), as shown in ex4.pdf Page 7You can submit Part 3 to the grader.
+
+
+ Step 6 - Implement the random initialization function as instructed on ex4.pdf, top of Page 8. You do not submit this function to the grader.
+
+ Backpropagation
+
+
+
+ Step 7 - Now we work from the output layer back to the hidden layer, calculating how bad the errors are. See ex4.pdf Page 9 for reference.δ3 equals the difference between a3 and the y_matrix.δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units), then multiplied element-wise by the g′() of z2 (computed back in Step 2).Note that at this point, the instructions in ex4.pdf are specific to looping implementations, so the notation there is different.Δ2 equals the product of δ3 and a2. This step calculates the product and sum of the errors.Δ1 equals the product of δ2 and a1. This step calculates the product and sum of the errors.
+
+
+
+ Gradient, non-regularized
+
+
+
+ Step 8 - Now we calculate the
+
+ non-regularized
+
+ theta gradients, using the sums of the errors we just computed. (see ex4.pdf bottom of Page 11)Θ1 gradient equals Δ1 scaled by 1/mΘ2 gradient equals Δ2 scaled by 1/mThe ex4.m script will also perform gradient checking for you, using a smaller test case than the full character classification example. So if you're debugging your nnCostFunction() using the "keyboard" command during this, you'll suddenly be seeing some much smaller sizes of X and the Θvalues. Do not be alarmed.If the feedback provided to you by ex4.m for gradient checking seems OK, you can now submit Part 4 to the grader.
+
+ Gradient Regularization
+
+
+
+ Step 9 - For reference see ex4.pdf, top of Page 12, for the right-most terms of the equation for j>=1.Now we calculate the regularization terms for the theta gradients. The goal is that regularization of the gradient should not change the theta gradient(:,1) values (for the bias units) calculated in Step 8. There are several ways to implement this (in Steps
+
+ 9a
+
+ and
+
+ 9b
+
+ ).
+
+ Method 1
+
+ :
+
+ 9a)
+
+ Calculate the regularization for indexes (:,2:end), and
+
+ 9b)
+
+ add them to theta gradients (:,2:end).
+
+ Method 2
+
+ :
+
+ 9a)
+
+ Calculate the regularization for the entire theta gradient, then overwrite the (:,1) value with 0 before
+
+ 9b)
+
+ adding to the entire matrix.Details for Steps 9a and 9b
+
+ 9a)
+
+ Pick a method, and calculate the regularization terms as follows:(λ/m)∗Θ1 (using either Method 1 or Method 2)...and(λ/m)∗Θ2 (using either Method 1 or Method 2)
+
+ 9b)
+
+ Add these regularization terms to the appropriate Θ1 gradient and Θ2 gradient terms from Step 8 (using either Method 1 or Method 2). Avoid modifying the bias unit of the theta gradients.
+
+ Note: there is an errata in the lecture video and slides regarding some missing parenthesis for this calculation. The ex4.pdf file is correct.
+
+ The ex4.m script will provide you feedback regarding the acceptable relative difference. If all seems well, you can submit Part 5 to the grader.Now pat yourself on the back.
+
+
+
+ Appendix:
+
+
+
+ Here are the sizes for the character recognition example, using the method described in this tutorial. a1: 5000x401z2: 5000x25a2: 5000x26a3: 5000x10d3: 5000x10d2: 5000x25Theta1, Delta1 and Theta1
+
+ grad: 25x401Theta2, Delta2 and Theta2
+
+ grad: 10x26Note that the ex4.m script uses a several test cases of different sizes, and the submit grader uses yet another different test case.
+
+
+ Debugging Tip
+
+
+ The submit script, for all the programming assignments, does not report the line number and location of the error when it crashes. The follow method can be used to make it do so which makes debugging easier.
+
+
+ Open ex4/lib/submitWithConfiguration.m and replace line:
+
+
fprintf('!! Please try again later.\n');
+
+
+ (around 28) with:
+
+
fprintf('Error from file:%s\nFunction:%s\nOn line:%d\n', e.stack(1,1).file,e.stack(1,1).name, e.stack(1,1).line );
+
+
+ That top line says '!! Please try again later' on crash, instead of that, the bottom line will give the location and line number of the error. This change can be applied to all the programming assignments.
+
+
+ Tips for classifying your own images:
+
+
+ There's no documentation on how the images were prepared for this course. These tips may be helpful.
+
+
+
+
+ The images must be gray-scale with 20x20 pixels.
+
+
+
+
+ The image pixels are scaled (or normalized) so that -1.0 is black, 0.0 is grey, and +1.0 is white. However, nearly all of the pixels are in the 0.0 to +1.0 range. The backgrounds are grey, and the image "pen strokes" are white.
+
+
+
+
+ Your images must use the same value range as the training data, otherwise the NN will not be able to classify them.
+
+
+
+
+ Center the digit image so it does not use the two pixels around the borders.
+
+
+
+
+ Bonus: Neural Network does not need order in pixels of an image as humans do
+
+
+ The pixels order (as a human sees them) is not necessary (or relevant) for a Neural Network.
+
+
+ You can test it with a modified ex3.m program below (you can call it ex3_rand.m)
+
+
+ The program has a randomize pixel position step "scrambling" the 400 vector positions BEFORE the training. As long as you keep the same pixel position when predicting, the results are the same.
+
+
+ It is interesting to "see" how prediction perfectly works with a scrambled picture!
+
+
+ You can test it once you have submitted OK the ex3.m program (meaning that
+
+ you have the oneVsAll function working OK first
+
+ ).
+
+
+ ex3_rand.m is a modified version of ex3.m
+
+
% ex3_rand.m (is a modified version of ex3.m to scramble pixels/features)
+%
+%% Machine Learning Online Class - Exercise 3 | Randomize Features
+
+%% Initialization
+clear; close all; clc
+
+%% Setup the parameters you will use for this part of the exercise
+input_layer_size = 400; % 20x20 Input Images of Digits
+num_labels = 10; % 10 labels, from 1 to 10
+ % (note that we have mapped "0" to label 10)
+
+%% =========== Part 1: Loading and Visualizing Data =============
+% We start the exercise by first loading and visualizing the dataset.
+% You will be working with a dataset that contains handwritten digits.
+%
+
+% Load Training Data
+fprintf('Loading and Visualizing Data ...\n')
+
+load('ex3data1.mat'); % training data stored in arrays X, y
+m = size(X, 1);
+
+% Randomly select 100 data points to display
+rand_indices = randperm(m, 100);
+sel = X(rand_indices,:);
+
+displayData(sel);
+
+fprintf('Program paused. Press enter to continue.\n');
+pause;
+
+%% ============ Part 2: Vectorize Logistic Regression ============
+% In this part of the exercise, you will reuse your logistic regression
+% code from the last exercise. You task here is to make sure that your
+% regularized logistic regression implementation is vectorized. After
+% that, you will implement one-vs-all classification for the handwritten
+% digit dataset.
+%
+
+% Added to randomize features (to probe that is irrelevant)
+fprintf('\nRandomizing columns...\n');
+X_rand = X(:, randperm(size(X,2)));
+
+fprintf('\nTraining One-vs-All Logistic Regression...\n')
+
+lambda = 0.1;
+[all_theta] = oneVsAll(X_rand, y, num_labels, lambda);
+
+fprintf('Program paused. Press enter to continue.\n');
+pause;
+
+
+%% ================ Part 3: Predict for One-Vs-All ================
+% After ...
+pred = predictOneVsAll(all_theta, X_rand);
+
+fprintf('\nTraining Set Accuracy:%f\n', mean(double(pred == y)) * 100);
+
+%% ============ Part 4: Predict Random Samples ============
+% To give you an idea of the network's output, you can also run
+% through the examples one at the a time to see what it is predicting.
+
+% Randomly permute examples
+rp = randperm(m);
+
+for i = 1:m
+ % Display
+ fprintf('\nDisplaying Example Randomized Image\n');
+ displayData(X_rand(rp(i),:));
+
+ pred = predictOneVsAll(all_theta, X_rand(rp(i),:));
+ fprintf('\nNeural Network Prediction:%d (label%d)\n', pred, y(rp(i)));
+
+ % Pause
+ fprintf('Program paused. Press enter to continue.\n');
+ pause;
+end
+
+
+ Why the order is Irrelevant for the Neural-Network
+
+
+ You can see that the order of the pixels is irrelevant as long as you are consistent in two ways:
+
+
+
+
+ Between samples. Each feature should mean the same pixel. You can not change the pixel location for one sample and not for the others. You can scramble them but you have to keep the "scrambling" fixed for the entire samples.
+
+
+
+
+ Between labels. Each label should represent the same digit for its group of samples. Meaning a digit four is a four for all of the samples you labeled as four and can not change it.It does not matter if the pixels are 'scrambled", it is a four.
+
+
+
+
+ Equivalent example of order irrelevancy
+
+
+ An equivalent example is the order of variable names when solving a system of equations. It does not matter how you call a variable or the order as long as you are consistent through out the solution.
+
+
+ For example, this:
+
+
+ $$3x_1 + 4x_2 = 26$$
+
+
+ $$2x_1 -3x_2 = -11$$
+
+
+ Solution: $$x_1 = 2;\quad x_2=5$$
+
+
+ ...is equivalent to:
+
+
+ $$3x_2 + 4x_1 = 26$$
+
+
+ $$2x_2 - 3x_1 = -11$$
+
+
+ Solution: $$x_2 = 2;\quad x_1=5$$
+
+
+ ...also you can "scramble" the terms and "labels"
+
+
+ $$-3x_1 + 2x_2 = -11$$
+
+
+ $$4x_1 + 3x_2 = 26$$
+
+
+ Solution: $$x_1 = 5;\quad x_2 = 2$$
+
+
+ It has to do with convention. Any convention as long as it is the same all the way through.
+
+ ML:Programming Exercise 5:Regularized Linear Regression and Bias vs Variance
+
+
+ Proposed erratum: the Optional exercise (Section 3.5) instructs you to select i examples from the cross-validation set. Shouldn't you always validate on the full cross-validation set as in section 2.1?
+
+
+ Other miscellany:
+
+
+
+
+ shouldn't it be "vs." instead of "v.s."?
+
+
+
+
+ p. 3 "overal"
+
+
+
+
+ p. 6 "wil"
+
+
+
+
+ p. 7 "For use polynomial regression [sic]"
+
+
+
+
+ p. 7 "zero-eth" - shouldn't this be "zero-th"?
+
+
+
+
+ p. 9 "where the low training error is low [sic]"
+
+
+
+
+ Debugging Tip
+
+
+ The submit script, for all the programming assignments, does not report the line number and location of the error when it crashes. The follow method can be used to make it do so which makes debugging easier.
+
+
+ Open ex5/lib/submitWithConfiguration.m and replace line:
+
+
fprintf('!! Please try again later.\n');
+
+
+ (around 28) with:
+
+
fprintf('Error from file:%s\nFunction:%s\nOn line:%d\n', e.stack(1,1).file,e.stack(1,1).name, e.stack(1,1).line );
+
+
+ That top line says '!! Please try again later' on crash, instead of that, the bottom line will give the location and line number of the error. This change can be applied to all the programming assignments.
+
+This exercise gives you practice with using SVMs for linear classification.
+You will use a free SVM software package called
+
+LIBSVM that interfaces to MATLAB/Octave. To begin, download the
+
+
+LIBSVM Matlab Interface (choose the package with the description "a simple MATLAB interface") and
+unzip the contents to any convenient location on your computer.
+
+
+Then, download the data for this exercise:
+ex7Data.zip.
+
+
+
+
+
+
+
+Installing LIBSVM
+
+
+After you've downloaded the
+
+
+LIBSVM Matlab Interface,
+follow the instructions in the package's README file
+ to build LIBSVM from its source code. Instructions are provided for both
+Matlab and Octave on Unix and Windows systems.
+
+
+If you've built LIBSVM successfully, you should see 4 files with the suffix "mexglx"
+ ("mexw32" on Windows). These are the binaries that you will run from MATLAB/Octave, and you
+need to make them visible to your working directory for this exercise.
+This can be done in any of the following 3 ways:
+
+
+(1). Creating links to the binaries from your working directory
+
+
+(2). Adding the location of the binaries to the Matlab/Octave path
+
+
+(3). Copying the binaries to your working directory.
+
+
+Linear classification
+
+
+Recall from the video lectures that SVM classification solves the following
+optimization problem:
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+After solving, the SVM classifier predicts "1"
+if
+ and "-1" otherwise. The decision boundary is given by the
+line .
+
+
+2-Dimensional classification problem
+
+
+Let's first consider a classification problem with two features.
+Load the "twofeature.txt" data file into Matlab/Octave with the following
+command:
+
+
+Note that this file is formatted for LIBSVM, so loading it with the
+usual Matlab/Octave commands would not work.
+
+
+After loading, the "trainlabels" vector should contain the classification
+labels for your training data, and the "trainfeatures" matrix should contain
+2 features per training example.
+
+
+Now plot your data, using separate symbols for positives and negatives. Your
+plot should look similar to this:
+
+
+
+
+
+
+In this plot, we see two classes of data with a somewhat obvious
+ separation gap. However, the blue class has an outlier on the far left.
+We'll now look at how this outlier affects the SVM decision boundary.
+
+
+Setting cost to C = 1
+
+
+Recall from the lecture videos that the parameter in the SVM optimization
+problem is a positive cost factor that
+penalizes misclassified training examples.
+A larger discourages misclassification more than a smaller .
+
+
+The last string argument tells LIBSVM to train using the options
+
+
+a.-s 0, SVM classification
+
+
+b.-t 0, a linear kernel, because we want a linear decision boundary
+
+
+c.-c 1, a cost factor of 1
+
+
+You can see all available options by typing "svmtrain" at the Matlab/Octave
+console.
+
+
+After training is done, "model" will be a struct
+that contains the model parameters. We're now interested
+in getting the variables and . Unfortunately, these are not
+ explicity represented in the model struct, but you can
+calculate them with the following commands:
+
+
+Once you have and , use them to plot the decision boundary. The outcome
+should look like the graph below.
+
+
+
+
+
+
+With , we see that the outlier is misclassified, but the decision boundary
+seems like a reasonable fit.
+
+
+Setting cost to C = 100
+
+
+Now let's look at what happens when the cost factor is much higher. Train your
+model and plot the decision boundary again, this time with set to 100.
+The outlier will now be classified correctly, but the decision boundary will
+not seem like a natural fit for the rest of the data:
+
+
+
+
+
+
+This example shows that when cost penalty is large, the
+ SVM algorithm will very hard to avoid misclassifications. The tradeoff is that
+the algorithm will give less weight to producing a large separation margin.
+
+
+Text classification
+
+
+Now let's return to our spam classification example from the previous exercise.
+In your data folder, there should be the same 4 training sets you saw in the
+Naive Bayes exercise, only now formatted for LIBSVM. They are named:
+
+
+a. email_train-50.txt (based on 50 email documents)
+
+
+b. email_train-100.txt (100 documents)
+
+
+c. email_train-400.txt (400 documents)
+
+
+d. email_train-all.txt (the complete 700 training documents)
+
+
+You will train a linear SVM model on each of the four training sets with
+ left at the default SVM value. After training, test
+ the performance of each model on set the named "email_test.txt." This
+is done with the "svmpredict" command, which you can find out more about
+by typing "svmpredict" at the MATLAB/Octave console.
+
+
+During test time, the accuracy on the test set will be printed to the console.
+Record the classification accuracy for each training set
+and check your answers with the solutions.
+How do the errors compare to the Naive Bayes errors?
+
+
+
+
+
+
+
+
+
+Solutions
+
+
+
+
+
+
+
+An m-file implementation of the two-feature exercise can be found
+
+here.
+
+
+An m-file for the email classification exercise is
+
+here.
+
+
+Classification accuracy
+
+
+Here are the classification performance results that LIBSVM reports.
+
+
+d. the complete 700 training documents: Accuracy = 98.4615% (256/260)
+
+
+Here are the error comparisons with Naive Bayes:
+
+
+
+
+
+
+
Naive Bayes
+
SVM
+
+
+
50 train docs
+
2.7%
+
24.6%
+
+
+
100 train docs
+
2.3%
+
11.5%
+
+
+
400 train docs
+
2.3%
+
1.9%
+
+
+
700 train docs
+
1.9%
+
1.5%
+
+
+
+
+The conclusion from these results is that Naive
+Bayes performs better than SVM with less data, but SVM shows better asymptotic
+ performance as the amount of training data increases.
+
+