diff --git a/ML_Mathematical_Approach/._classification_fmri.pdf b/ML_Mathematical_Approach/._classification_fmri.pdf new file mode 100644 index 0000000..55be2b8 Binary files /dev/null and b/ML_Mathematical_Approach/._classification_fmri.pdf differ diff --git a/ML_Mathematical_Approach/._nninitialization.pdf b/ML_Mathematical_Approach/._nninitialization.pdf new file mode 100644 index 0000000..6f4dabc Binary files /dev/null and b/ML_Mathematical_Approach/._nninitialization.pdf differ diff --git a/ML_Mathematical_Approach/01_tutorials/01__resources.html b/ML_Mathematical_Approach/01_tutorials/01__resources.html new file mode 100644 index 0000000..53f3cda --- /dev/null +++ b/ML_Mathematical_Approach/01_tutorials/01__resources.html @@ -0,0 +1,56 @@ + + +

+ + Programming Exercise Tutorials + +

+
+ + + diff --git a/ML_Mathematical_Approach/02_test-cases/01__resources.html b/ML_Mathematical_Approach/02_test-cases/01__resources.html new file mode 100644 index 0000000..871b51f --- /dev/null +++ b/ML_Mathematical_Approach/02_test-cases/01__resources.html @@ -0,0 +1,56 @@ + + +

+ + Programming Exercise Test Cases + +

+
+ + + diff --git a/ML_Mathematical_Approach/03_week-1-lecture-notes/01__resources.html b/ML_Mathematical_Approach/03_week-1-lecture-notes/01__resources.html new file mode 100644 index 0000000..638de48 --- /dev/null +++ b/ML_Mathematical_Approach/03_week-1-lecture-notes/01__resources.html @@ -0,0 +1,709 @@ + + +

+ ML:Introduction +

+

+ What is Machine Learning? +

+

+ Two definitions of Machine Learning are offered. Arthur Samuel described it as: "the field of study that gives computers the ability to learn without being explicitly programmed." This is an older, informal definition. +

+

+ Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E." +

+

+ Example: playing checkers. +

+

+ E = the experience of playing many games of checkers +

+

+ T = the task of playing checkers. +

+

+ P = the probability that the program will win the next game. +

+

+ In general, any machine learning problem can be assigned to one of two broad classifications: +

+

+ supervised learning, OR +

+

+ unsupervised learning. +

+

+ + Supervised Learning + +

+

+ In supervised learning, we are given a data set and already know what our correct output should look like, having the idea that there is a relationship between the input and the output. +

+

+ Supervised learning problems are categorized into "regression" and "classification" problems. In a regression problem, we are trying to predict results within a continuous output, meaning that we are trying to map input variables to some continuous function. In a classification problem, we are instead trying to predict results in a discrete output. In other words, we are trying to map input variables into discrete categories. Here is a description on Math is Fun on Continuous and Discrete Data. +

+

+ + Example 1: + +

+

+ Given data about the size of houses on the real estate market, try to predict their price. Price as a function of size is a continuous output, so this is a regression problem. +

+

+ We could turn this example into a classification problem by instead making our output about whether the house "sells for more or less than the asking price." Here we are classifying the houses based on price into two discrete categories. +

+

+ + Example 2 + + : +

+

+ (a) Regression - Given a picture of Male/Female, We have to predict his/her age on the basis of given picture. +

+

+ (b) Classification - Given a picture of Male/Female, We have to predict Whether He/She is of High school, College, Graduate age. Another Example for Classification - Banks have to decide whether or not to give a loan to someone on the basis of his credit history. +

+

+ + Unsupervised Learning + +

+

+ Unsupervised learning, on the other hand, allows us to approach problems with little or no idea what our results should look like. We can derive structure from data where we don't necessarily know the effect of the variables. +

+

+ We can derive this structure by clustering the data based on relationships among the variables in the data. +

+

+ With unsupervised learning there is no feedback based on the prediction results, i.e., there is no teacher to correct you. +

+

+ + Example: + +

+

+ Clustering: Take a collection of 1000 essays written on the US Economy, and find a way to automatically group these essays into a small number that are somehow similar or related by different variables, such as word frequency, sentence length, page count, and so on. +

+

+ Non-clustering: The "Cocktail Party Algorithm", which can find structure in messy data (such as the identification of individual voices and music from a mesh of sounds at a cocktail party ( + + https://en.wikipedia.org/wiki/Cocktail_party_effect + + ) ). Here is an answer on Quora to enhance your understanding. : + + https://www.quora.com/What-is-the-difference-between-supervised-and-unsupervised-learning-algorithms + + ? +

+

+ ML:Linear Regression with One Variable +

+

+ Model Representation +

+

+ Recall that in + + regression problems + + , we are taking input variables and trying to fit the output onto a + + continuous + + expected result function. +

+

+ Linear regression with one variable is also known as "univariate linear regression." +

+

+ Univariate linear regression is used when you want to predict a + + single output + + value y from a + + single input + + value x. We're doing + + supervised learning + + here, so that means we already have an idea about what the input/output cause and effect should be. +

+

+ The Hypothesis Function +

+

+ Our hypothesis function has the general form: +

+

+ $$\hat{y} = h_\theta(x) = \theta_0 + \theta_1 x$$ +

+

+ Note that this is like the equation of a straight line. We give to $$h_\theta(x)$$ values for $$\theta_0$$ and $$\theta_1$$ to get our estimated output $$\hat{y}$$. In other words, we are trying to create a function called $$h_\theta$$ that is trying to map our input data (the x's) to our output data (the y's). +

+

+ Example: +

+

+ Suppose we have the following set of training data: +

+ + + + + + + + + + + + + + + + + + + + + +
+

+ input x +

+
+

+ output y +

+
+

+ 0 +

+
+

+ 4 +

+
+

+ 1 +

+
+

+ 7 +

+
+

+ 2 +

+
+

+ 7 +

+
+

+ 3 +

+
+

+ 8 +

+
+

+ Now we can make a random guess about our $$h_\theta$$ function: $$\theta_0=2$$ and $$\theta_1=2$$. The hypothesis function becomes $$h_\theta(x)=2+2x$$. +

+

+ So for input of 1 to our hypothesis, y will be 4. This is off by 3. Note that we will be trying out various values of $$\theta_0$$ and $$\theta_1$$ to try to find values which provide the best possible "fit" or the most representative "straight line" through the data points mapped on the x-y plane. +

+

+ Cost Function +

+

+ We can measure the accuracy of our hypothesis function by using a + + cost function + + . This takes an average (actually a fancier version of an average) of all the results of the hypothesis with inputs from x's compared to the actual output y's. +

+

+ $$J(\theta_0, \theta_1) = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left ( \hat{y}_{i}- y_{i} \right)^2 = \dfrac {1}{2m} \displaystyle \sum _{i=1}^m \left (h_\theta (x_{i}) - y_{i} \right)^2$$ +

+

+ To break it apart, it is $$\frac{1}{2}$$ $$\bar{x}$$ where $$\bar{x}$$ is the mean of the squares of $$h_\theta (x_{i}) - y_{i}$$ , or the difference between the predicted value and the actual value. +

+

+ This function is otherwise called the "Squared error function", or "Mean squared error". The mean is halved $$\left(\frac{1}{2m}\right)$$ as a convenience for the computation of the gradient descent, as the derivative term of the square function will cancel out the $$\frac{1}{2}$$ term. +

+

+ Now we are able to concretely measure the accuracy of our predictor function against the correct results we have so that we can predict new results we don't have. +

+

+ If we try to think of it in visual terms, our training data set is scattered on the x-y plane. We are trying to make straight line (defined by $$h_\theta(x)$$) which passes through this scattered set of data. Our objective is to get the best possible line. The best possible line will be such so that the average squared vertical distances of the scattered points from the line will be the least. In the best case, the line should pass through all the points of our training data set. In such a case the value of $$J(\theta_0, \theta_1)$$ will be 0. +

+

+ ML:Gradient Descent +

+

+ So we have our hypothesis function and we have a way of measuring how well it fits into the data. Now we need to estimate the parameters in hypothesis function. That's where gradient descent comes in. +

+

+ Imagine that we graph our hypothesis function based on its fields $$\theta_0$$ and $$\theta_1$$ (actually we are graphing the cost function as a function of the parameter estimates). This can be kind of confusing; we are moving up to a higher level of abstraction. We are not graphing x and y itself, but the parameter range of our hypothesis function and the cost resulting from selecting particular set of parameters. +

+

+ We put $$\theta_0$$ on the x axis and $$\theta_1$$ on the y axis, with the cost function on the vertical z axis. The points on our graph will be the result of the cost function using our hypothesis with those specific theta parameters. +

+

+ We will know that we have succeeded when our cost function is at the very bottom of the pits in our graph, i.e. when its value is the minimum. +

+

+ The way we do this is by taking the derivative (the tangential line to a function) of our cost function. The slope of the tangent is the derivative at that point and it will give us a direction to move towards. We make steps down the cost function in the direction with the steepest descent, and the size of each step is determined by the parameter α, which is called the learning rate. +

+

+ The gradient descent algorithm is: +

+

+ repeat until convergence: +

+

+ $$\theta_j := \theta_j - \alpha \frac{\partial}{\partial \theta_j} J(\theta_0, \theta_1)$$ +

+

+ where +

+

+ j=0,1 represents the feature index number. +

+

+ Intuitively, this could be thought of as: +

+

+ repeat until convergence: +

+

+ $$\theta_j := \theta_j - \alpha [\text{Slope of tangent aka derivative in j dimension}]$$[Slope of tangent aka derivative in j dimension] +

+

+ + Gradient Descent for Linear Regression + +

+

+ When specifically applied to the case of linear regression, a new form of the gradient descent equation can be derived. We can substitute our actual cost function and our actual hypothesis function and modify the equation to (the derivation of the formulas are out of the scope of this course, but a really great one can be found here): +

+ + + + +
+

+ $$\begin{align*} + \text{repeat until convergence: } \lbrace & \newline + \theta_0 := & \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}(h_\theta(x_{i}) - y_{i}) \newline + \theta_1 := & \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m}\left((h_\theta(x_{i}) - y_{i}) x_{i}\right) \newline + \rbrace& + \end{align*}$$ +

+
+

+ where m is the size of the training set, $$\theta_0$$ a constant that will be changing simultaneously with $$\theta_1$$ and $$x_{i}, y_{i}$$are values of the given training set (data). +

+

+ Note that we have separated out the two cases for $$\theta_j$$ into separate equations for $$\theta_0$$ and $$\theta_1$$; and that for $$\theta_1$$ we are multiplying $$x_{i}$$ at the end due to the derivative. +

+

+ The point of all this is that if we start with a guess for our hypothesis and then repeatedly apply these gradient descent equations, our hypothesis will become more and more accurate. +

+

+ + Gradient Descent for Linear Regression: visual worked example + +

+

+ Some may find the following video ( + + https://www.youtube.com/watch?v=WnqQrPNYz5Q + + ) useful as it visualizes the improvement of the hypothesis as the error function reduces. +

+

+ ML:Linear Algebra Review +

+

+ + + Khan Academy has excellent Linear Algebra Tutorials ( + + https://www.khanacademy.org/#linear-algebra + + ) +

+

+ Matrices and Vectors +

+

+ Matrices are 2-dimensional arrays: +

+ + + + +
+

+ $$\begin{bmatrix} a & b & c \newline d & e & f \newline g & h & i \newline j & k & l\end{bmatrix}$$ +

+
+

+

+

+ The above matrix has four rows and three columns, so it is a 4 x 3 matrix. +

+

+ A vector is a matrix with one column and many rows: +

+ + + + +
+

+ $$\begin{bmatrix} w \newline x \newline y \newline z \end{bmatrix}$$ +

+
+

+

+

+ So vectors are a subset of matrices. The above vector is a 4 x 1 matrix. +

+

+ + Notation and terms + + : +

+ +

+ Addition and Scalar Multiplication +

+

+ Addition and subtraction are + + element-wise + + , so you simply add or subtract each corresponding element: +

+ + + + +
+

+ $$\begin{bmatrix} a & b \newline c & d \newline \end{bmatrix} +\begin{bmatrix} w & x \newline y & z \newline \end{bmatrix} =\begin{bmatrix} a+w & b+x \newline c+y & d+z \newline \end{bmatrix}$$ +

+
+

+ To add or subtract two matrices, their dimensions must be + + the same + + . +

+

+ In scalar multiplication, we simply multiply every element by the scalar value: +

+ + + + +
+

+ $$\begin{bmatrix} a & b \newline c & d \newline \end{bmatrix} * x =\begin{bmatrix} a*x & b*x \newline c*x & d*x \newline \end{bmatrix}$$ +

+
+

+

+

+ Matrix-Vector Multiplication +

+

+ We map the column of the vector onto each row of the matrix, multiplying each element and summing the result. +

+ + + + +
+

+ $$\begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix} *\begin{bmatrix} x \newline y \newline \end{bmatrix} =\begin{bmatrix} a*x + b*y \newline c*x + d*y \newline e*x + f*y\end{bmatrix}$$ +

+
+

+ The result is a + + vector + + . The vector must be the + + second + + term of the multiplication. The number of + + columns + + of the matrix must equal the number of + + rows + + of the vector. +

+

+ An + + m x n matrix + + multiplied by an + + n x 1 vector + + results in an + + m x 1 vector + + . +

+

+ Matrix-Matrix Multiplication +

+

+ We multiply two matrices by breaking it into several vector multiplications and concatenating the result +

+ + + + +
+

+ $$\begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix} *\begin{bmatrix} w & x \newline y & z \newline \end{bmatrix} =\begin{bmatrix} a*w + b*y & a*x + b*z \newline c*w + d*y & c*x + d*z \newline e*w + f*y & e*x + f*z\end{bmatrix}$$ +

+
+

+ An + + m x n matrix + + multiplied by an + + n x o matrix + + results in an + + m x o + + matrix. In the above example, a 3 x 2 matrix times a 2 x 2 matrix resulted in a 3 x 2 matrix. +

+

+ To multiply two matrices, the number of + + columns + + of the first matrix must equal the number of + + rows + + of the second matrix. +

+

+ Matrix Multiplication Properties +

+ +

+ The + + identity matrix + + , when multiplied by any matrix of the same dimensions, results in the original matrix. It's just like multiplying numbers by 1. The identity matrix simply has 1's on the diagonal (upper left to lower right diagonal) and 0's elsewhere. +

+ + + + +
+

+ $$\begin{bmatrix} 1 & 0 & 0 \newline 0 & 1 & 0 \newline 0 & 0 & 1 \newline \end{bmatrix}$$ +

+
+

+ When multiplying the identity matrix after some matrix (A∗I), the square identity matrix should match the other matrix's + + columns + + . When multiplying the identity matrix before some other matrix (I∗A), the square identity matrix should match the other matrix's + + rows + + . +

+

+ Inverse and Transpose +

+

+ The + + inverse + + of a matrix A is denoted A−1. Multiplying by the inverse results in the identity matrix. +

+

+ A non square matrix does not have an inverse matrix. We can compute inverses of matrices in octave with the pinv(A) function [1] and in matlab with the inv(A) function. Matrices that don't have an inverse are + + singular + + or + + degenerate + + . +

+

+ The + + transposition + + of a matrix is like rotating the matrix 90 + + ° + + in clockwise direction and then reversing it. We can compute transposition of matrices in matlab with the transpose(A) function or A': +

+

+

+ + + + +
+

+ $$A = \begin{bmatrix} a & b \newline c & d \newline e & f \end{bmatrix}$$ +

+
+ + + + +
+

+ $$A^T = \begin{bmatrix} a & c & e \newline b & d & f \newline \end{bmatrix}$$ +

+
+

+ In other words: +

+

+ $$A_{ij} = A^T_{ji}$$ +

+
+ + + diff --git a/ML_Mathematical_Approach/04_week-2-lecture-notes/01__featuresX.dat b/ML_Mathematical_Approach/04_week-2-lecture-notes/01__featuresX.dat new file mode 100644 index 0000000..2cdd51f --- /dev/null +++ b/ML_Mathematical_Approach/04_week-2-lecture-notes/01__featuresX.dat @@ -0,0 +1,27 @@ +2104 3 +1600 3 +2400 3 +1416 2 +3000 4 +1985 4 +1534 3 +1427 3 +1380 3 +1494 3 +1940 4 +2000 3 +1890 3 +4478 5 +1268 3 +1437 3 +1239 3 +2132 4 +4215 4 +2162 4 +1664 2 +2238 3 +2567 4 +1200 3 +852 2 +1852 4 +1203 3 \ No newline at end of file diff --git a/ML_Mathematical_Approach/04_week-2-lecture-notes/01__priceY.dat b/ML_Mathematical_Approach/04_week-2-lecture-notes/01__priceY.dat new file mode 100644 index 0000000..3349235 --- /dev/null +++ b/ML_Mathematical_Approach/04_week-2-lecture-notes/01__priceY.dat @@ -0,0 +1,27 @@ +3999 +3299 +3690 +2320 +5399 +2999 +3149 +1989 +2120 +2425 +2399 +3470 +3299 +6999 +2599 +4499 +1509 +1667 +5948 +4718 +3932 +2011 +4538 +2251 +2617 +4084 +3523 \ No newline at end of file diff --git a/ML_Mathematical_Approach/04_week-2-lecture-notes/01__resources.html b/ML_Mathematical_Approach/04_week-2-lecture-notes/01__resources.html new file mode 100644 index 0000000..f30e235 --- /dev/null +++ b/ML_Mathematical_Approach/04_week-2-lecture-notes/01__resources.html @@ -0,0 +1,940 @@ + + +

+ ML:Linear Regression with Multiple Variables +

+

+ Linear regression with multiple variables is also known as "multivariate linear regression". +

+

+ We now introduce notation for equations where we can have any number of input variables. +

+ + + + +
+

+ $$\begin{align*}x_j^{(i)} &= \text{value of feature } j \text{ in the }i^{th}\text{ training example} \newline x^{(i)}& = \text{the column vector of all the feature inputs of the }i^{th}\text{ training example} \newline m &= \text{the number of training examples} \newline n &= \left| x^{(i)} \right| ; \text{(the number of features)} \end{align*}$$ +

+
+

+ Now define the multivariable form of the hypothesis function as follows, accommodating these multiple features: +

+

+ $$h_\theta (x) = \theta_0 + \theta_1 x_1 + \theta_2 x_2 + \theta_3 x_3 + \cdots + \theta_n x_n$$ +

+

+ In order to develop intuition about this function, we can think about $$\theta_0$$ as the basic price of a house, $$\theta_1$$ as the price per square meter, $$\theta_2$$ as the price per floor, etc. $$x_1$$ will be the number of square meters in the house, $$x_2$$ the number of floors, etc. +

+

+ Using the definition of matrix multiplication, our multivariable hypothesis function can be concisely represented as: +

+ + + + +
+

+ $$\begin{align*}h_\theta(x) =\begin{bmatrix}\theta_0 \hspace{2em} \theta_1 \hspace{2em} ... \hspace{2em} \theta_n\end{bmatrix}\begin{bmatrix}x_0 \newline x_1 \newline \vdots \newline x_n\end{bmatrix}= \theta^T x\end{align*}$$ +

+
+

+ This is a vectorization of our hypothesis function for one training example; see the lessons on vectorization to learn more. +

+

+ Remark: Note that for convenience reasons in this course Mr. Ng assumes $$x_{0}^{(i)} =1 \text{ for } (i\in { 1,\dots, m } )$$ +

+

+ [ + + Note + + : So that we can do matrix operations with theta and x, we will set $$x^{(i)}_0$$ = 1, for all values of i. This makes the two vectors 'theta' and $$x_{(i)}$$ match each other element-wise (that is, have the same number of elements: n+1).] +

+

+ The training examples are stored in X row-wise, like such: +

+ + + + +
+

+ $$\begin{align*}X = \begin{bmatrix}x^{(1)}_0 & x^{(1)}_1 \newline x^{(2)}_0 & x^{(2)}_1 \newline x^{(3)}_0 & x^{(3)}_1 \end{bmatrix}&,\theta = \begin{bmatrix}\theta_0 \newline \theta_1 \newline\end{bmatrix}\end{align*}$$ +

+
+

+ You can calculate the hypothesis as a column vector of size (m x 1) with: +

+

+ $$h_\theta(X) = X \theta$$ +

+

+ + For the rest of these notes, and other lecture notes, X will represent a matrix of training examples + + $$x_{(i)}$$ + + stored row-wise. + +

+

+ + Cost function + +

+

+ For the parameter vector θ (of type $$\mathbb{R}^{n+1}$$ or in $$\mathbb{R}^{(n+1) \times 1}$$, the cost function is: +

+

+ $$J(\theta) = \dfrac {1}{2m} \displaystyle \sum_{i=1}^m \left (h_\theta (x^{(i)}) - y^{(i)} \right)^2$$ +

+

+ The vectorized version is: +

+

+ $$J(\theta) = \dfrac {1}{2m} (X\theta - \vec{y})^{T} (X\theta - \vec{y})$$ +

+

+ Where $$\vec{y}$$ denotes the vector of all y values. +

+

+ + Gradient Descent for Multiple Variables + +

+

+ The gradient descent equation itself is generally the same form; we just have to repeat it for our 'n' features: +

+ + + + +
+

+ $$\begin{align*} +& \text{repeat until convergence:} \; \lbrace \newline +\; & \theta_0 := \theta_0 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_0^{(i)}\newline +\; & \theta_1 := \theta_1 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_1^{(i)} \newline +\; & \theta_2 := \theta_2 - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_2^{(i)} \newline +& \cdots +\newline \rbrace +\end{align*}$$ +

+
+

+ In other words: +

+ + + + +
+

+ $$\begin{align*}& \text{repeat until convergence:} \; \lbrace \newline \; & \theta_j := \theta_j - \alpha \frac{1}{m} \sum\limits_{i=1}^{m} (h_\theta(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)} \; & \text{for j := 0..n}\newline \rbrace\end{align*}$$ +

+
+

+ Matrix Notation +

+

+ The Gradient Descent rule can be expressed as: +

+

+ $$\theta := \theta - \alpha \nabla J(\theta)$$ +

+

+ Where $$\nabla J(\theta)$$ is a column vector of the form: +

+ + + + +
+

+ $$\nabla J(\theta) = \begin{bmatrix}\frac{\partial J(\theta)}{\partial \theta_0} \newline \frac{\partial J(\theta)}{\partial \theta_1} \newline \vdots \newline \frac{\partial J(\theta)}{\partial \theta_n} \end{bmatrix}$$ +

+
+

+ The j-th component of the gradient is the summation of the product of two terms: +

+ + + + +
+

+ $$\begin{align*} +\; &\frac{\partial J(\theta)}{\partial \theta_j} &=& \frac{1}{m} \sum\limits_{i=1}^{m} \left(h_\theta(x^{(i)}) - y^{(i)} \right) \cdot x_j^{(i)} \newline +\; & &=& \frac{1}{m} \sum\limits_{i=1}^{m} x_j^{(i)} \cdot \left(h_\theta(x^{(i)}) - y^{(i)} \right) +\end{align*}$$ +

+
+

+ Sometimes, the summation of the product of two terms can be expressed as the product of two vectors. +

+

+ Here, $$x_j^{(i)}$$, for i = 1,...,m, represents the m elements of the j-th column, $$\vec{x_j}$$ , of the training set X. +

+

+ The other term $$\left(h_\theta(x^{(i)}) - y^{(i)} \right)$$ is the vector of the deviations between the predictions $$h_\theta(x^{(i)})$$ and the true values $$y^{(i)}$$. Re-writing $$\frac{\partial J(\theta)}{\partial \theta_j}$$, we have: +

+ + + + +
+

+ $$\begin{align*}\; &\frac{\partial J(\theta)}{\partial \theta_j} &=& \frac1m \vec{x_j}^{T} (X\theta - \vec{y}) \newline\newline\newline\; &\nabla J(\theta) & = & \frac 1m X^{T} (X\theta - \vec{y}) \newline\end{align*}$$ +

+
+

+ Finally, the matrix notation (vectorized) of the Gradient Descent rule is: +

+ + + + +
+

+ $$\theta := \theta - \frac{\alpha}{m} X^{T} (X\theta - \vec{y})$$ +

+
+

+ Feature Normalization +

+

+ We can speed up gradient descent by having each of our input values in roughly the same range. This is because θ will descend quickly on small ranges and slowly on large ranges, and so will oscillate inefficiently down to the optimum when the variables are very uneven. +

+

+ The way to prevent this is to modify the ranges of our input variables so that they are all roughly the same. Ideally: +

+

+ −1 ≤ $$x_{(i)}$$ ≤ 1 +

+

+ or +

+

+ −0.5 ≤ $$x_{(i)}$$ ≤ 0.5 +

+

+ These aren't exact requirements; we are only trying to speed things up. The goal is to get all input variables into roughly one of these ranges, give or take a few. +

+

+ Two techniques to help with this are + + feature scaling + + and + + mean normalization + + . Feature scaling involves dividing the input values by the range (i.e. the maximum value minus the minimum value) of the input variable, resulting in a new range of just 1. Mean normalization involves subtracting the average value for an input variable from the values for that input variable, resulting in a new average value for the input variable of just zero. To implement both of these techniques, adjust your input values as shown in this formula: +

+

+ $$x_i := \dfrac{x_i - \mu_i}{s_i}$$ +

+

+ Where $$μ_i$$ is the + + average + + of all the values for feature (i) and $$s_i$$ is the range of values (max - min), or $$s_i$$ is the standard deviation. +

+

+ Note that dividing by the range, or dividing by the standard deviation, give different results. The quizzes in this course use range - the programming exercises use standard deviation. +

+

+ Example: $$x_i$$ is housing prices with range of 100 to 2000, with a mean value of 1000. Then, $$x_i := \dfrac{price-1000}{1900}$$. +

+

+ Quiz question #1 on Feature Normalization (Week 2, Linear Regression with Multiple Variables) +

+

+ Your answer should be rounded to exactly two decimal places. Use a '.' for the decimal point, not a ','. The tricky part of this question is figuring out which feature of which training example you are asked to normalize. Note that the mobile app doesn't allow entering a negative number (Jan 2016), so you will need to use a browser to submit this quiz if your solution requires a negative number. +

+

+ Gradient Descent Tips +

+

+ + Debugging gradient descent. + + Make a plot with + + number of iterations + + on the x-axis. Now plot the cost function, J(θ) over the number of iterations of gradient descent. If J(θ) ever increases, then you probably need to decrease α. +

+

+ + Automatic convergence test. + + Declare convergence if J(θ) decreases by less than E in one iteration, where E is some small value such as 10−3. However in practice it's difficult to choose this threshold value. +

+

+ It has been proven that if learning rate α is sufficiently small, then J(θ) will decrease on every iteration. Andrew Ng recommends decreasing α by multiples of 3. +

+

+ Features and Polynomial Regression +

+

+ We can improve our features and the form of our hypothesis function in a couple different ways. +

+

+ We can + + combine + + multiple features into one. For example, we can combine $$x_1$$ and $$x_2$$ into a new feature $$x_3$$ by taking $$x_1$$⋅$$x_2$$. +

+

+ + Polynomial Regression + +

+

+ Our hypothesis function need not be linear (a straight line) if that does not fit the data well. +

+

+ We can + + change the behavior or curve + + of our hypothesis function by making it a quadratic, cubic or square root function (or any other form). +

+

+ For example, if our hypothesis function is $$h_\theta(x) = \theta_0 + \theta_1 x_1$$ then we can create additional features based on $$x_1$$, to get the quadratic function $$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2$$ or the cubic function $$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 x_1^2 + \theta_3 x_1^3$$ +

+

+ In the cubic version, we have created new features $$x_2$$ and $$x_3$$ where $$x_2 = x_1^2$$ and $$x_3 = x_1^3$$. +

+

+ To make it a square root function, we could do: $$h_\theta(x) = \theta_0 + \theta_1 x_1 + \theta_2 \sqrt{x_1}$$ +

+

+ Note that at 2:52 and through 6:22 in the "Features and Polynomial Regression" video, the curve that Prof Ng discusses about "doesn't ever come back down" is in reference to the hypothesis function that uses the sqrt() function (shown by the solid purple line), not the one that uses $$size^2$$ (shown with the dotted blue line). The quadratic form of the hypothesis function would have the shape shown with the blue dotted line if $$\theta_2$$ was negative. +

+

+ One important thing to keep in mind is, if you choose your features this way then feature scaling becomes very important. +

+

+ eg. if $$x_1$$ has range 1 - 1000 then range of $$x_1^2$$ becomes 1 - 1000000 and that of $$x_1^3$$ becomes 1 - 1000000000. +

+

+ Normal Equation +

+

+ The "Normal Equation" is a method of finding the optimum theta + + without iteration. + +

+

+ $$\theta = (X^T X)^{-1}X^T y$$ +

+

+ There is + + no need + + to do feature scaling with the normal equation. +

+

+ Mathematical proof of the Normal equation requires knowledge of linear algebra and is fairly involved, so you do not need to worry about the details. +

+

+ Proofs are available at these links for those who are interested: +

+

+ + https://en.wikipedia.org/wiki/Linear_least_squares_(mathematics) + +

+

+ + http://eli.thegreenplace.net/2014/derivation-of-the-normal-equation-for-linear-regression + +

+

+ The following is a comparison of gradient descent and the normal equation: +

+ + + + + + + + + + + + + + + + + + + + + +
+

+ Gradient Descent +

+
+

+ Normal Equation +

+
+

+ Need to choose alpha +

+
+

+ No need to choose alpha +

+
+

+ Needs many iterations +

+
+

+ No need to iterate +

+
+

+ O ($$kn^2$$) +

+
+

+ O ($$n^3$$), need to calculate inverse of $$X^TX$$ +

+
+

+ Works well when n is large +

+
+

+ Slow if n is very large +

+
+

+ With the normal equation, computing the inversion has complexity $$\mathcal{O}(n^3)$$. So if we have a very large number of features, the normal equation will be slow. In practice, when n exceeds 10,000 it might be a good time to go from a normal solution to an iterative process. +

+

+ + Normal Equation Noninvertibility + +

+

+ When implementing the normal equation in octave we want to use the 'pinv' function rather than 'inv.' +

+

+ $$X^TX$$ may be + + noninvertible + + . The common causes are: +

+ +

+ Solutions to the above problems include deleting a feature that is linearly dependent with another or deleting one or more features when there are too many features. +

+

+ ML:Octave Tutorial +

+

+ Basic Operations +

+
%% Change Octave prompt  
+PS1('>> ');
+%% Change working directory in windows example:
+cd 'c:/path/to/desired/directory name'
+%% Note that it uses normal slashes and does not use escape characters for the empty spaces.
+
+%% elementary operations
+5+6
+3-2
+5*8
+1/2
+2^6
+1 == 2 % false
+1 ~= 2 % true.  note, not "!="
+1 && 0
+1 || 0
+xor(1,0)
+
+
+%% variable assignment
+a = 3; % semicolon suppresses output
+b = 'hi';
+c = 3>=1;
+
+% Displaying them:
+a = pi
+disp(a)
+disp(sprintf('2 decimals: %0.2f', a))
+disp(sprintf('6 decimals: %0.6f', a))
+format long
+a
+format short
+a
+
+
+%%  vectors and matrices
+A = [1 2; 3 4; 5 6]
+
+v = [1 2 3]
+v = [1; 2; 3]
+v = 1:0.1:2   % from 1 to 2, with stepsize of 0.1. Useful for plot axes
+v = 1:6       % from 1 to 6, assumes stepsize of 1 (row vector)
+
+C = 2*ones(2,3) % same as C = [2 2 2; 2 2 2]
+w = ones(1,3)   % 1x3 vector of ones
+w = zeros(1,3)
+w = rand(1,3) % drawn from a uniform distribution 
+w = randn(1,3)% drawn from a normal distribution (mean=0, var=1)
+w = -6 + sqrt(10)*(randn(1,10000));  % (mean = -6, var = 10) - note: add the semicolon
+hist(w)    % plot histogram using 10 bins (default)
+hist(w,50) % plot histogram using 50 bins
+% note: if hist() crashes, try "graphics_toolkit('gnu_plot')" 
+
+I = eye(4)   % 4x4 identity matrix
+
+% help function
+help eye
+help rand
+help help
+
+

+ Moving Data Around +

+

+ + Data files used in this section + + : + + featuresX.dat + + , + + priceY.dat + +

+
%% dimensions
+sz = size(A) % 1x2 matrix: [(number of rows) (number of columns)]
+size(A,1) % number of rows
+size(A,2) % number of cols
+length(v) % size of longest dimension
+
+
+%% loading data
+pwd   % show current directory (current path)
+cd 'C:\Users\ang\Octave files'  % change directory 
+ls    % list files in current directory 
+load q1y.dat   % alternatively, load('q1y.dat')
+load q1x.dat
+who   % list variables in workspace
+whos  % list variables in workspace (detailed view) 
+clear q1y      % clear command without any args clears all vars
+v = q1x(1:10); % first 10 elements of q1x (counts down the columns)
+save hello.mat v;  % save variable v into file hello.mat
+save hello.txt v -ascii; % save as ascii
+% fopen, fread, fprintf, fscanf also work  [[not needed in class]]
+
+%% indexing
+A(3,2)  % indexing is (row,col)
+A(2,:)  % get the 2nd row. 
+        % ":" means every element along that dimension
+A(:,2)  % get the 2nd col
+A([1 3],:) % print all  the elements of rows 1 and 3
+
+A(:,2) = [10; 11; 12]     % change second column
+A = [A, [100; 101; 102]]; % append column vec
+A(:) % Select all elements as a column vector.
+
+% Putting data together 
+A = [1 2; 3 4; 5 6]
+B = [11 12; 13 14; 15 16] % same dims as A
+C = [A B]  % concatenating A and B matrices side by side
+C = [A, B] % concatenating A and B matrices side by side
+C = [A; B] % Concatenating A and B top and bottom
+
+

+ Computing on Data +

+
%% initialize variables
+A = [1 2;3 4;5 6]
+B = [11 12;13 14;15 16]
+C = [1 1;2 2]
+v = [1;2;3]
+
+%% matrix operations
+A * C  % matrix multiplication
+A .* B % element-wise multiplication
+% A .* C  or A * B gives error - wrong dimensions
+A .^ 2 % element-wise square of each element in A
+1./v   % element-wise reciprocal
+log(v)  % functions like this operate element-wise on vecs or matrices 
+exp(v)
+abs(v)
+
+-v  % -1*v
+
+v + ones(length(v), 1)  
+% v + 1  % same
+
+A'  % matrix transpose
+
+%% misc useful functions
+
+% max  (or min)
+a = [1 15 2 0.5]
+val = max(a)
+[val,ind] = max(a) % val -  maximum element of the vector a and index - index value where maximum occur
+val = max(A) % if A is matrix, returns max from each column
+
+% compare values in a matrix & find
+a < 3 % checks which values in a are less than 3
+find(a < 3) % gives location of elements less than 3
+A = magic(3) % generates a magic matrix - not much used in ML algorithms
+[r,c] = find(A>=7)  % row, column indices for values matching comparison
+
+% sum, prod
+sum(a)
+prod(a)
+floor(a) % or ceil(a)
+max(rand(3),rand(3))
+max(A,[],1) -  maximum along columns(defaults to columns - max(A,[]))
+max(A,[],2) - maximum along rows
+A = magic(9)
+sum(A,1)
+sum(A,2)
+sum(sum( A .* eye(9) ))
+sum(sum( A .* flipud(eye(9)) ))
+
+
+% Matrix inverse (pseudo-inverse)
+pinv(A)        % inv(A'*A)*A'
+
+

+ Plotting Data +

+
%% plotting
+t = [0:0.01:0.98];
+y1 = sin(2*pi*4*t); 
+plot(t,y1);
+y2 = cos(2*pi*4*t);
+hold on;  % "hold off" to turn off
+plot(t,y2,'r');
+xlabel('time');
+ylabel('value');
+legend('sin','cos');
+title('my plot');
+print -dpng 'myPlot.png'
+close;           % or,  "close all" to close all figs
+figure(1); plot(t, y1);
+figure(2); plot(t, y2);
+figure(2), clf;  % can specify the figure number
+subplot(1,2,1);  % Divide plot into 1x2 grid, access 1st element
+plot(t,y1);
+subplot(1,2,2);  % Divide plot into 1x2 grid, access 2nd element
+plot(t,y2);
+axis([0.5 1 -1 1]);  % change axis scale
+
+%% display a matrix (or image) 
+figure;
+imagesc(magic(15)), colorbar, colormap gray;
+% comma-chaining function calls.  
+a=1,b=2,c=3
+a=1;b=2;c=3;
+
+

+ Control statements: for, while, if statements +

+
v = zeros(10,1);
+for i=1:10, 
+    v(i) = 2^i;
+end;
+% Can also use "break" and "continue" inside for and while loops to control execution.
+
+i = 1;
+while i <= 5,
+  v(i) = 100; 
+  i = i+1;
+end
+
+i = 1;
+while true, 
+  v(i) = 999; 
+  i = i+1;
+  if i == 6,
+    break;
+  end;
+end
+
+if v(1)==1,
+  disp('The value is one!');
+elseif v(1)==2,
+  disp('The value is two!');
+else
+  disp('The value is not one or two!');
+end
+
+

+ Functions +

+

+ To create a function, type the function code in a text editor (e.g. gedit or notepad), and save the file as "functionName.m" +

+

+ Example function: +

+
function y = squareThisNumber(x)
+
+y = x^2;
+
+

+ To call the function in Octave, do either: +

+

+ 1) Navigate to the directory of the functionName.m file and call the function: +

+
    % Navigate to directory:
+    cd /path/to/function
+
+    % Call the function:
+    functionName(args)
+
+

+ 2) Add the directory of the function to the load path and save it: + + You should not use addpath/savepath for any of the assignments in this course. Instead use 'cd' to change the current working directory. Watch the video on submitting assignments in week 2 for instructions. + +

+
    % To add the path for the current session of Octave:
+    addpath('/path/to/function/')
+
+    % To remember the path for future sessions of Octave, after executing addpath above, also do:
+    savepath
+
+

+ Octave's functions can return more than one value: +

+
    function [y1, y2] = squareandCubeThisNo(x)
+    y1 = x^2
+    y2 = x^3
+
+

+ Call the above function this way: +

+
    [a,b] = squareandCubeThisNo(x)
+
+

+ Vectorization +

+

+ Vectorization is the process of taking code that relies on + + loops + + and converting it into + + matrix operations + + . It is more efficient, more elegant, and more concise. +

+

+ As an example, let's compute our prediction from a hypothesis. Theta is the vector of fields for the hypothesis and x is a vector of variables. +

+

+ With loops: +

+
prediction = 0.0;
+for j = 1:n+1,
+  prediction += theta(j) * x(j);
+end;
+
+

+ With vectorization: +

+
prediction = theta' * x;
+
+

+ If you recall the definition multiplying vectors, you'll see that this one operation does the element-wise multiplication and overall sum in a very concise notation. +

+

+ Working on and Submitting Programming Exercises +

+
    +
  1. +

    + Download and extract the assignment's zip file. +

    +
  2. +
  3. +

    + Edit the proper file 'a.m', where a is the name of the exercise you're working on. +

    +
  4. +
  5. +

    + Run octave and cd to the assignment's extracted directory +

    +
  6. +
  7. +

    + Run the 'submit' function and enter the assignment number, your email, and a password (found on the top of the "Programming Exercises" page on coursera) +

    +
  8. +
+

+ Video Lecture Table of Contents +

+

+ + Basic Operations + +

+
0:00    Introduction
+3:15    Elementary and Logical operations
+5:12    Variables
+7:38    Matrices
+8:30    Vectors
+11:53   Histograms
+12:44   Identity matrices
+13:14   Help command
+

+ + Moving Data Around + +

+
0:24    The size command
+1:39    The length command
+2:18    File system commands
+2:25    File handling
+4:50    Who, whos, and clear
+6:50    Saving data
+8:35    Manipulating data
+12:10   Unrolling a matrix
+12:35   Examples
+14:50   Summary
+

+ + Computing on Data + +

+
0:00    Matrix operations
+0:57    Element-wise operations
+4:28    Min and max
+5:10    Element-wise comparisons
+5:43    The find command
+6:00    Various commands and operations
+

+ + Plotting data + +

+
0:00    Introduction
+0:54    Basic plotting
+2:04    Superimposing plots and colors
+3:15    Saving a plot to an image
+4:19    Clearing a plot and multiple figures
+4:59    Subplots
+6:15    The axis command
+6:39    Color square plots
+8:35    Wrapping up
+

+ + Control statements + +

+
0:10    For loops
+1:33    While loops
+3:35    If statements
+4:54    Functions
+6:15    Search paths
+7:40    Multiple return values
+8:59    Cost function example (machine learning)
+12:24   Summary
+
+

+ + Vectorization + +

+
0:00    Why vectorize?
+1:30    Example
+4:22    C++ example
+5:40    Vectorization applied to gradient descent
+9:45    Python
+

+

+

+

+
+ + + diff --git a/ML_Mathematical_Approach/05_week-3-lecture-notes/01__resources.html b/ML_Mathematical_Approach/05_week-3-lecture-notes/01__resources.html new file mode 100644 index 0000000..272dd09 --- /dev/null +++ b/ML_Mathematical_Approach/05_week-3-lecture-notes/01__resources.html @@ -0,0 +1,714 @@ + + +

+ ML:Logistic Regression +

+

+ Now we are switching from regression problems to + + classification problems + + . Don't be confused by the name "Logistic Regression"; it is named that way for historical reasons and is actually an approach to classification problems, not regression problems. +

+

+ Binary Classification +

+

+ Instead of our output vector y being a continuous range of values, it will only be 0 or 1. +

+

+ y∈{0,1} +

+

+ Where 0 is usually taken as the "negative class" and 1 as the "positive class", but you are free to assign any representation to it. +

+

+ We're only doing two classes for now, called a "Binary Classification Problem." +

+

+ One method is to use linear regression and map all predictions greater than 0.5 as a 1 and all less than 0.5 as a 0. This method doesn't work well because classification is not actually a linear function. +

+

+ Hypothesis Representation +

+

+ Our hypothesis should satisfy: +

+

+ $$0 \leq h_\theta (x) \leq 1$$ +

+

+ Our new form uses the "Sigmoid Function," also called the "Logistic Function": +

+ + + + +
+

+ $$\begin{align*}& h_\theta (x) = g ( \theta^T x ) \newline \newline& z = \theta^T x \newline& g(z) = \dfrac{1}{1 + e^{-z}}\end{align*}$$ +

+
+ +

+ The function g(z), shown here, maps any real number to the (0, 1) interval, making it useful for transforming an arbitrary-valued function into a function better suited for classification. Try playing with interactive plot of sigmoid function : ( + + https://www.desmos.com/calculator/bgontvxotm + + ). +

+

+ We start with our old hypothesis (linear regression), except that we want to restrict the range to 0 and 1. This is accomplished by plugging $$\theta^Tx$$ into the Logistic Function. +

+

+ $$h_\theta$$ will give us the + + probability + + that our output is 1. For example, $$h_\theta(x)=0.7$$ gives us the probability of 70% that our output is 1. +

+ + + + +
+

+ $$\begin{align*}& h_\theta(x) = P(y=1 | x ; \theta) = 1 - P(y=0 | x ; \theta) \newline& P(y = 0 | x;\theta) + P(y = 1 | x ; \theta) = 1\end{align*}$$ +

+
+

+ Our probability that our prediction is 0 is just the complement of our probability that it is 1 (e.g. if probability that it is 1 is 70%, then the probability that it is 0 is 30%). +

+

+ Decision Boundary +

+

+ In order to get our discrete 0 or 1 classification, we can translate the output of the hypothesis function as follows: +

+ + + + +
+

+ $$\begin{align*}& h_\theta(x) \geq 0.5 \rightarrow y = 1 \newline& h_\theta(x) < 0.5 \rightarrow y = 0 \newline\end{align*}$$ +

+
+

+ The way our logistic function g behaves is that when its input is greater than or equal to zero, its output is greater than or equal to 0.5: +

+ + + + +
+

+ $$\begin{align*}& g(z) \geq 0.5 \newline& when \; z \geq 0\end{align*}$$ +

+
+

+ Remember.- +

+ + + + +
+

+ $$\begin{align*}z=0, e^{0}=1 \Rightarrow g(z)=1/2\newline z \to \infty, e^{-\infty} \to 0 \Rightarrow g(z)=1 \newline z \to -\infty, e^{\infty}\to \infty \Rightarrow g(z)=0 \end{align*}$$ +

+
+

+ So if our input to g is $$\theta^T X$$, then that means: +

+ + + + +
+

+ $$\begin{align*}& h_\theta(x) = g(\theta^T x) \geq 0.5 \newline& when \; \theta^T x \geq 0\end{align*}$$ +

+
+

+ From these statements we can now say: +

+ + + + +
+

+ $$\begin{align*}& \theta^T x \geq 0 \Rightarrow y = 1 \newline& \theta^T x < 0 \Rightarrow y = 0 \newline\end{align*}$$ +

+
+

+ The + + decision boundary + + is the line that separates the area where y = 0 and where y = 1. It is created by our hypothesis function. +

+

+ + Example + + : +

+ + + + +
+

+ $$\begin{align*}& \theta = \begin{bmatrix}5 \newline -1 \newline 0\end{bmatrix} \newline & y = 1 \; if \; 5 + (-1) x_1 + 0 x_2 \geq 0 \newline & 5 - x_1 \geq 0 \newline & - x_1 \geq -5 \newline& x_1 \leq 5 \newline \end{align*}$$ +

+
+

+ In this case, our decision boundary is a straight vertical line placed on the graph where $$x_1 = 5$$, and everything to the left of that denotes y = 1, while everything to the right denotes y = 0. +

+

+ Again, the input to the sigmoid function g(z) (e.g. $$\theta^T X$$) doesn't need to be linear, and could be a function that describes a circle (e.g. $$z = \theta_0 + \theta_1 x_1^2 +\theta_2 x_2^2$$) or any shape to fit our data. +

+

+ Cost Function +

+

+ We cannot use the same cost function that we use for linear regression because the Logistic Function will cause the output to be wavy, causing many local optima. In other words, it will not be a convex function. +

+

+ Instead, our cost function for logistic regression looks like: +

+ + + + +
+

+ $$\begin{align*}& J(\theta) = \dfrac{1}{m} \sum_{i=1}^m \mathrm{Cost}(h_\theta(x^{(i)}),y^{(i)}) \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(h_\theta(x)) \; & \text{if y = 1} \newline & \mathrm{Cost}(h_\theta(x),y) = -\log(1-h_\theta(x)) \; & \text{if y = 0}\end{align*}$$ +

+
+ +

+

+ +

+ The more our hypothesis is off from y, the larger the cost function output. If our hypothesis is equal to y, then our cost is 0: +

+ + + + +
+

+ $$\begin{align*}& \mathrm{Cost}(h_\theta(x),y) = 0 \text{ if } h_\theta(x) = y \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 0 \; \mathrm{and} \; h_\theta(x) \rightarrow 1 \newline & \mathrm{Cost}(h_\theta(x),y) \rightarrow \infty \text{ if } y = 1 \; \mathrm{and} \; h_\theta(x) \rightarrow 0 \newline \end{align*}$$ +

+
+

+ If our correct answer 'y' is 0, then the cost function will be 0 if our hypothesis function also outputs 0. If our hypothesis approaches 1, then the cost function will approach infinity. +

+

+ If our correct answer 'y' is 1, then the cost function will be 0 if our hypothesis function outputs 1. If our hypothesis approaches 0, then the cost function will approach infinity. +

+

+ Note that writing the cost function in this way guarantees that J(θ) is convex for logistic regression. +

+

+ Simplified Cost Function and Gradient Descent +

+

+ We can compress our cost function's two conditional cases into one case: +

+

+ $$\mathrm{Cost}(h_\theta(x),y) = - y \; \log(h_\theta(x)) - (1 - y) \log(1 - h_\theta(x))$$ +

+

+ Notice that when y is equal to 1, then the second term $$(1-y)\log(1-h_\theta(x))$$ will be zero and will not affect the result. If y is equal to 0, then the first term $$-y \log(h_\theta(x))$$ will be zero and will not affect the result. +

+

+ We can fully write out our entire cost function as follows: +

+ + + + +
+

+ $$J(\theta) = - \frac{1}{m} \displaystyle \sum_{i=1}^m [y^{(i)}\log (h_\theta (x^{(i)})) + (1 - y^{(i)})\log (1 - h_\theta(x^{(i)}))]$$ +

+
+

+ A vectorized implementation is: +

+ + + + +
+

+ $$\begin{align*} +& h = g(X\theta)\newline +& J(\theta) = \frac{1}{m} \cdot \left(-y^{T}\log(h)-(1-y)^{T}\log(1-h)\right) +\end{align*}$$ +

+
+

+ + Gradient Descent + +

+

+ Remember that the general form of gradient descent is: +

+ + + + +
+

+ $$\begin{align*}& Repeat \; \lbrace \newline & \; \theta_j := \theta_j - \alpha \dfrac{\partial}{\partial \theta_j}J(\theta) \newline & \rbrace\end{align*}$$ +

+
+

+ We can work out the derivative part using calculus to get: +

+ + + + +
+

+ $$\begin{align*} +& Repeat \; \lbrace \newline +& \; \theta_j := \theta_j - \frac{\alpha}{m} \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)}) x_j^{(i)} \newline & \rbrace +\end{align*}$$ +

+
+

+ Notice that this algorithm is identical to the one we used in linear regression. We still have to simultaneously update all values in theta. +

+

+ A vectorized implementation is: +

+

+ $$\theta := \theta - \frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})$$ +

+

+ + Partial derivative of J(θ) + +

+

+ First calculate derivative of sigmoid function (it will be useful while finding partial derivative of J(θ)): +

+ + + + +
+

+ $$\begin{align*}\sigma(x)'&=\left(\frac{1}{1+e^{-x}}\right)'=\frac{-(1+e^{-x})'}{(1+e^{-x})^2}=\frac{-1'-(e^{-x})'}{(1+e^{-x})^2}=\frac{0-(-x)'(e^{-x})}{(1+e^{-x})^2}=\frac{-(-1)(e^{-x})}{(1+e^{-x})^2}=\frac{e^{-x}}{(1+e^{-x})^2} \newline &=\left(\frac{1}{1+e^{-x}}\right)\left(\frac{e^{-x}}{1+e^{-x}}\right)=\sigma(x)\left(\frac{+1-1 + e^{-x}}{1+e^{-x}}\right)=\sigma(x)\left(\frac{1 + e^{-x}}{1+e^{-x}} - \frac{1}{1+e^{-x}}\right)=\sigma(x)(1 - \sigma(x))\end{align*}$$ +

+
+

+ Now we are ready to find out resulting partial derivative: +

+ + + + +
+

+ $$\begin{align*}\frac{\partial}{\partial \theta_j} J(\theta) &= \frac{\partial}{\partial \theta_j} \frac{-1}{m}\sum_{i=1}^m \left [ y^{(i)} log (h_\theta(x^{(i)})) + (1-y^{(i)}) log (1 - h_\theta(x^{(i)})) \right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} \frac{\partial}{\partial \theta_j} log (h_\theta(x^{(i)})) + (1-y^{(i)}) \frac{\partial}{\partial \theta_j} log (1 - h_\theta(x^{(i)}))\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \frac{\partial}{\partial \theta_j} h_\theta(x^{(i)})}{h_\theta(x^{(i)})} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - h_\theta(x^{(i)}))}{1 - h_\theta(x^{(i)})}\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \frac{\partial}{\partial \theta_j} \sigma(\theta^T x^{(i)})}{h_\theta(x^{(i)})} + \frac{(1-y^{(i)})\frac{\partial}{\partial \theta_j} (1 - \sigma(\theta^T x^{(i)}))}{1 - h_\theta(x^{(i)})}\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{h_\theta(x^{(i)})} + \frac{- (1-y^{(i)}) \sigma(\theta^T x^{(i)}) (1 - \sigma(\theta^T x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h_\theta(x^{(i)})}\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ \frac{y^{(i)} h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{h_\theta(x^{(i)})} - \frac{(1-y^{(i)}) h_\theta(x^{(i)}) (1 - h_\theta(x^{(i)})) \frac{\partial}{\partial \theta_j} \theta^T x^{(i)}}{1 - h_\theta(x^{(i)})}\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} (1 - h_\theta(x^{(i)})) x^{(i)}_j - (1-y^{(i)}) h_\theta(x^{(i)}) x^{(i)}_j\right ] \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} (1 - h_\theta(x^{(i)})) - (1-y^{(i)}) h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - y^{(i)} h_\theta(x^{(i)}) - h_\theta(x^{(i)}) + y^{(i)} h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline&= - \frac{1}{m}\sum_{i=1}^m \left [ y^{(i)} - h_\theta(x^{(i)}) \right ] x^{(i)}_j \newline&= \frac{1}{m}\sum_{i=1}^m \left [ h_\theta(x^{(i)}) - y^{(i)} \right ] x^{(i)}_j\end{align*}$$ +

+
+

+ The vectorized version; +

+

+ $$\nabla J(\theta) = \frac{1}{m} \cdot X^T \cdot \left(g\left(X\cdot\theta\right) - \vec{y}\right)$$ +

+

+ Advanced Optimization +

+

+ "Conjugate gradient", "BFGS", and "L-BFGS" are more sophisticated, faster ways to optimize θ that can be used instead of gradient descent. A. Ng suggests not to write these more sophisticated algorithms yourself (unless you are an expert in numerical computing) but use the libraries instead, as they're already tested and highly optimized. Octave provides them. +

+

+ We first need to provide a function that evaluates the following two functions for a given input value θ: +

+

+ $$\begin{align*} & J(\theta) \newline & \dfrac{\partial}{\partial \theta_j}J(\theta)\end{align*}$$ +

+

+ We can write a single function that returns both of these: +

+
function [jVal, gradient] = costFunction(theta)
+  jVal = [...code to compute J(theta)...];
+  gradient = [...code to compute derivative of J(theta)...];
+end
+

+ Then we can use octave's "fminunc()" optimization algorithm along with the "optimset()" function that creates an object containing the options we want to send to "fminunc()". (Note: the value for MaxIter should be an integer, not a character string - errata in the video at 7:30) +

+
options = optimset('GradObj', 'on', 'MaxIter', 100);
+      initialTheta = zeros(2,1);
+      [optTheta, functionVal, exitFlag] = fminunc(@costFunction, initialTheta, options);
+
+

+ We give to the function "fminunc()" our cost function, our initial vector of theta values, and the "options" object that we created beforehand. +

+

+ Multiclass Classification: One-vs-all +

+

+ Now we will approach the classification of data into more than two categories. Instead of y = {0,1} we will expand our definition so that y = {0,1...n}. +

+

+ In this case we divide our problem into n+1 (+1 because the index starts at 0) binary classification problems; in each one, we predict the probability that 'y' is a member of one of our classes. +

+ + + + +
+

+ $$\begin{align*}& y \in \lbrace0, 1 ... n\rbrace \newline& h_\theta^{(0)}(x) = P(y = 0 | x ; \theta) \newline& h_\theta^{(1)}(x) = P(y = 1 | x ; \theta) \newline& \cdots \newline& h_\theta^{(n)}(x) = P(y = n | x ; \theta) \newline& \mathrm{prediction} = \max_i( h_\theta ^{(i)}(x) )\newline\end{align*}$$ +

+
+

+ We are basically choosing one class and then lumping all the others into a single second class. We do this repeatedly, applying binary logistic regression to each case, and then use the hypothesis that returned the highest value as our prediction. +

+

+ ML:Regularization +

+

+ + The Problem of Overfitting + +

+

+ Regularization is designed to address the problem of overfitting. +

+

+ High bias or underfitting is when the form of our hypothesis function h maps poorly to the trend of the data. It is usually caused by a function that is too simple or uses too few features. eg. if we take $$h_\theta(x) = \theta_0 + \theta_1x_1 + \theta_2x_2$$ then we are making an initial assumption that a linear model will fit the training data well and will be able to generalize but that may not be the case. +

+

+ At the other extreme, overfitting or high variance is caused by a hypothesis function that fits the available data but does not generalize well to predict new data. It is usually caused by a complicated function that creates a lot of unnecessary curves and angles unrelated to the data. +

+

+ This terminology is applied to both linear and logistic regression. There are two main options to address the issue of overfitting: +

+

+ 1) Reduce the number of features: +

+

+ a) Manually select which features to keep. +

+

+ b) Use a model selection algorithm (studied later in the course). +

+

+ 2) Regularization +

+

+ Keep all the features, but reduce the parameters $$\theta_j$$. +

+

+ Regularization works well when we have a lot of slightly useful features. +

+

+ Cost Function +

+

+ If we have overfitting from our hypothesis function, we can reduce the weight that some of the terms in our function carry by increasing their cost. +

+

+ Say we wanted to make the following function more quadratic: +

+

+ $$\theta_0 + \theta_1x + \theta_2x^2 + \theta_3x^3 + \theta_4x^4$$ +

+

+ We'll want to eliminate the influence of $$\theta_3x^3$$ and $$\theta_4x^4$$ . Without actually getting rid of these features or changing the form of our hypothesis, we can instead modify our + + cost function + + : +

+

+ $$min_\theta\ \dfrac{1}{2m}\sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + 1000\cdot\theta_3^2 + 1000\cdot\theta_4^2$$ +

+

+ We've added two extra terms at the end to inflate the cost of $$\theta_3$$ and $$\theta_4$$. Now, in order for the cost function to get close to zero, we will have to reduce the values of $$\theta_3$$ and $$\theta_4$$ to near zero. This will in turn greatly reduce the values of $$\theta_3x^3$$ and $$\theta_4x^4$$ in our hypothesis function. +

+

+ We could also regularize all of our theta parameters in a single summation: +

+ + + + +
+

+ $$min_\theta\ \dfrac{1}{2m}\ \left[ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})^2 + \lambda\ \sum_{j=1}^n \theta_j^2 \right]$$ +

+
+

+ The λ, or lambda, is the + + regularization parameter + + . It determines how much the costs of our theta parameters are inflated. You can visualize the effect of regularization in this interactive plot : + + https://www.desmos.com/calculator/1hexc8ntqp + +

+

+ Using the above cost function with the extra summation, we can smooth the output of our hypothesis function to reduce overfitting. If lambda is chosen to be too large, it may smooth out the function too much and cause underfitting. +

+

+ Regularized Linear Regression +

+

+ We can apply regularization to both linear regression and logistic regression. We will approach linear regression first. +

+

+ Gradient Descent +

+

+ We will modify our gradient descent function to separate out $$\theta_0$$ from the rest of the parameters because we do not want to penalize $$\theta_0$$. +

+ + + + +
+

+ $$\begin{align*} +& \text{Repeat}\ \lbrace \newline +& \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \newline +& \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline +& \rbrace +\end{align*}$$ +

+
+

+ The term $$\frac{\lambda}{m}\theta_j$$ performs our regularization. +

+

+ With some manipulation our update rule can also be represented as: +

+

+ $$\theta_j := \theta_j(1 - \alpha\frac{\lambda}{m}) - \alpha\frac{1}{m}\sum_{i=1}^m(h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)}$$ +

+

+ The first term in the above equation, $$1 - \alpha\frac{\lambda}{m}$$ will always be less than 1. Intuitively you can see it as reducing the value of $$\theta_j$$ by some amount on every update. +

+

+ Notice that the second term is now exactly the same as it was before. +

+

+ + Normal Equation + +

+

+ Now let's approach regularization using the alternate method of the non-iterative normal equation. +

+

+ To add in regularization, the equation is the same as our original, except that we add another term inside the parentheses: +

+ + + + +
+

+ $$\begin{align*}& \theta = \left( X^TX + \lambda \cdot L \right)^{-1} X^Ty \newline& \text{where}\ \ L = \begin{bmatrix} 0 & & & & \newline & 1 & & & \newline & & 1 & & \newline & & & \ddots & \newline & & & & 1 \newline\end{bmatrix}\end{align*}$$ +

+
+

+ L is a matrix with 0 at the top left and 1's down the diagonal, with 0's everywhere else. It should have dimension (n+1)×(n+1). Intuitively, this is the identity matrix (though we are not including $$x_0$$), multiplied with a single real number λ. +

+

+ Recall that if m ≤ n, then $$X^TX$$ is non-invertible. However, when we add the term λ⋅L, then $$X^TX$$ + λ⋅L becomes invertible. +

+

+ Regularized Logistic Regression +

+

+ We can regularize logistic regression in a similar way that we regularize linear regression. Let's start with the cost function. +

+

+ Cost Function +

+

+ Recall that our cost function for logistic regression was: +

+

+ $$J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)})) \large]$$ +

+

+ We can regularize this equation by adding a term to the end: +

+ + + + +
+

+ $$J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$$ +

+
+

+ + Note Well: + + The second sum, $$\sum_{j=1}^n \theta_j^2$$ + + means to explicitly exclude + + the bias term, $$\theta_0$$. I.e. the θ vector is indexed from 0 to n (holding n+1 values, $$\theta_0$$ through $$\theta_n$$), and this sum explicitly skips $$\theta_0$$, by running from 1 to n, skipping 0. +

+

+ Gradient Descent +

+

+ Just like with linear regression, we will want to + + separately + + update $$\theta_0$$ and the rest of the parameters because we do not want to regularize $$\theta_0$$. +

+ + + + +
+

+ $$\begin{align*}& \text{Repeat}\ \lbrace \newline& \ \ \ \ \theta_0 := \theta_0 - \alpha\ \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_0^{(i)} \newline& \ \ \ \ \theta_j := \theta_j - \alpha\ \left[ \left( \frac{1}{m}\ \sum_{i=1}^m (h_\theta(x^{(i)}) - y^{(i)})x_j^{(i)} \right) + \frac{\lambda}{m}\theta_j \right] &\ \ \ \ \ \ \ \ \ \ j \in \lbrace 1,2...n\rbrace\newline& \rbrace\end{align*}$$ +

+
+

+ This is identical to the gradient descent function presented for linear regression. +

+

+ Initial Ones Feature Vector +

+

+ Constant Feature +

+

+ As it turns out it is crucial to add a constant feature to your pool of features before starting any training of your machine. Normally that feature is just a set of ones for all your training examples. +

+

+ Concretely, if X is your feature matrix then $$X_0$$ is a vector with ones. +

+

+ Below are some insights to explain the reason for this constant feature. The first part draws some analogies from electrical engineering concept, the second looks at understanding the ones vector by using a simple machine learning example. +

+

+ Electrical Engineering +

+

+ From electrical engineering, in particular signal processing, this can be explained as DC and AC. +

+

+ The initial feature vector X without the constant term captures the dynamics of your model. That means those features particularly record changes in your output y - in other words changing some feature $$X_i$$ where $$i\not= 0$$ will have a change on the output y. AC is normally made out of many components or harmonics; hence we also have many features (yet we have one DC term). +

+

+ The constant feature represents the DC component. In control engineering this can also be the steady state. +

+

+ Interestingly removing the DC term is easily done by differentiating your signal - or simply taking a difference between consecutive points of a discrete signal (it should be noted that at this point the analogy is implying time-based signals - so this will also make sense for machine learning application with a time basis - e.g. forecasting stock exchange trends). +

+

+ Another interesting note: if you were to play and AC+DC signal as well as an AC only signal where both AC components are the same then they would sound exactly the same. That is because we only hear changes in signals and Δ(AC+DC)=Δ(AC). +

+

+ Housing price example +

+

+ Suppose you design a machine which predicts the price of a house based on some features. In this case what does the ones vector help with? +

+

+ Let's assume a simple model which has features that are directly proportional to the expected price i.e. if feature Xi increases so the expected price y will also increase. So as an example we could have two features: namely the size of the house in [m2], and the number of rooms. +

+

+ When you train your machine you will start by prepending a ones vector $$X_0$$. You may then find after training that the weight for your initial feature of ones is some value θ0. As it turns, when applying your hypothesis function $$h_{\theta}(X)$$ - in the case of the initial feature you will just be multiplying by a constant (most probably θ0 if you not applying any other functions such as sigmoids). This constant (let's say it's $$θ_0$$ for argument's sake) is the DC term. It is a constant that doesn't change. +

+

+ But what does it mean for this example? Well, let's suppose that someone knows that you have a working model for housing prices. It turns out that for this example, if they ask you how much money they can expect if they sell the house you can say that they need at least θ0 dollars (or rands) before you even use your learning machine. As with the above analogy, your constant θ0 is somewhat of a steady state where all your inputs are zeros. Concretely, this is the price of a house with no rooms which takes up no space. +

+

+ However this explanation has some holes because if you have some features which decrease the price e.g. age, then the DC term may not be an absolute minimum of the price. This is because the age may make the price go even lower. +

+

+ Theoretically if you were to train a machine without a ones vector $$f_{AC}(X)$$, it's output may not match the output of a machine which had a ones vector $$f_{DC}(X)$$. However, $$f_{AC}(X)$$ may have exactly the same trend as $$f_{DC}(X)$$ i.e. if you were to plot both machine's output you would find that they may look exactly the same except that it seems one output has just been shifted (by a constant). With reference to the housing price problem: suppose you make predictions on two houses $$house_A$$ and $$house_B$$ using both machines. It turns out while the outputs from the two machines would different, the difference between houseA and houseB's predictions according to both machines could be exactly the same. Realistically, that means a machine trained without the ones vector $$f_AC$$ could actually be very useful if you have just one benchmark point. This is because you can find out the missing constant by simply taking a difference between the machine's prediction an actual price - then when making predictions you simply add that constant to what even output you get. That is: if $$house_{benchmark}$$ is your benchmark then the DC component is simply $$price(house_{benchmark}) - f_{AC}(features(house_{benchmark}))$$ +

+

+ A more simple and crude way of putting it is that the DC component of your model represents the inherent bias of the model. The other features then cause tension in order to move away from that bias position. +

+

+ Kholofelo Moyaba +

+

+ A simpler approach +

+

+ A "bias" feature is simply a way to move the "best fit" learned vector to better fit the data. For example, consider a learning problem with a single feature $$X_1$$. The formula without the $$X_0$$ feature is just $$theta_1 * X_1 = y$$. This is graphed as a line that always passes through the origin, with slope y/theta. The $$x_0$$ term allows the line to pass through a different point on the y axis. This will almost always give a better fit. Not all best fit lines go through the origin (0,0) right? +

+

+ Joe Cotton +

+

+

+
+ + + diff --git a/ML_Mathematical_Approach/06_week-4-lecture-notes/01__combinations-permutations.html b/ML_Mathematical_Approach/06_week-4-lecture-notes/01__combinations-permutations.html new file mode 100644 index 0000000..cf6ae3d --- /dev/null +++ b/ML_Mathematical_Approach/06_week-4-lecture-notes/01__combinations-permutations.html @@ -0,0 +1,460 @@ + + + + +Combinations and Permutations + + + + + + + + + + +
+
+
+ + + + +
+ +
+ +
+

Combinations and Permutations

+

What's the Difference?

+

In English we use the word "combination" loosely, without thinking if the order of things is important. In other words:

+ + + + + + + + + + + + + +
in speech"My fruit salad is a combination of apples, grapes and bananas" We don't care what order the fruits are in, they could also be "bananas, grapes and apples" or "grapes, apples and bananas", its the same fruit salad.
  
in speech"The combination to the safe is 472". Now we do care about the order. "724" won't work, nor will "247". It has to be exactly 4-7-2.
+

So, in Mathematics we use more precise language:

+ + + + + + + + + +
dotWhen the order doesn't matter, it is a Combination.
dotWhen the order does matter it is a Permutation.
+
+ + + + + + +
 combination lockSo, we should really call this a "Permutation Lock"!
+

In other words:

+

A Permutation is an ordered Combination.

+
+
+ + + + + +
thoughtTo help you to remember, think "Permutation ... Position"
+
+

Permutations

+

There are basically two types of permutation:

+
    +
  1. Repetition is Allowed: such as the lock above. It could be "333".
  2. +
  3. No Repetition: for example the first three people in a running race. You can't be first and second.
  4. +
+

 

+

1. Permutations with Repetition

+

These are the easiest to calculate.

+

When a thing has n different types ... we have n choices each time!

+

For example: choosing 3 of those things, the permutations are:

+

n × n × n
+
(n multiplied 3 times)

+

More generally: choosing r of something that has n different types, the permutations are:

+

n × n × ... (r times)

+

(In other words, there are n possibilities for the first choice, THEN there are n possibilites for the second choice, and so on, multplying each time.)

+

Which is easier to write down using an exponent of r:

+

n × n × ... (r times) = nr

+
+

Example: in the lock above, there are 10 numbers to choose from (0,1,2,3,4,5,6,7,8,9) and we choose 3 of them:

+

10 × 10 × ... (3 times) = 103 = 1,000 permutations

+
+

So, the formula is simply:

+
+ + + + + + + +
nr
where n is the number of things to choose from, and we choose r of them
+ (Repetition allowed, order matters)
+
+

 

+

2. Permutations without Repetition

+

In this case, we have to reduce the number of available choices each time.

+

pool balls

+

For example, what order could 16 pool balls be in?

+

After choosing, say, number "14" we can't choose it again.

+

So, our first choice has 16 possibilites, and our next choice has 15 possibilities, then 14, 13, etc. And the total permutations are:

+

16 × 15 × 14 × 13 × ... = 20,922,789,888,000

+

But maybe we don't want to choose them all, just 3 of them, so that is only:

+

16 × 15 × 14 = 3,360

+

In other words, there are 3,360 different ways that 3 pool balls could be arranged out of 16 balls.

+
+

Without repetition our choices get reduced each time.

+
+

But how do we write that mathematically? Answer: we use the "factorial function"

+
+ + + + + + + + +
exclamation mark means factorial

The factorial function (symbol: !) just means to multiply a series of descending natural numbers. Examples:

+
    +
  • 4! = 4 × 3 × 2 × 1 = 24
  • +
  • 7! = 7 × 6 × 5 × 4 × 3 × 2 × 1 = 5,040
  • +
  • 1! = 1
  • +
Note: it is generally agreed that 0! = 1. It may seem funny that multiplying no numbers together gets us 1, but it helps simplify a lot of equations.
+
+

So, when we want to select all of the billiard balls the permutations are:

+

16! = 20,922,789,888,000

+

But when we want to select just 3 we don't want to multiply after 14. How do we do that? There is a neat trick: we divide by 13!

+ + + + + + + + + + +
16 × 15 × 14 × 13 × 12 ...
  = 16 × 15 × 14 = 3,360
13 × 12 ...
 
+

Do you see? 16! / 13! = 16 × 15 × 14

+

The formula is written:

+
+ + + + + + + +
+

n!(n − r)!

where n is the number of things to choose from, and we choose r of them
+ (No repetition, order matters)
+
+
+

Example Our "order of 3 out of 16 pool balls example" is:

+ + + + + + + + + + + + + + +
16! = 16! = 20,922,789,888,000 = 3,360
(16-3)!13!6,227,020,800
+

(which is just the same as: 16 × 15 × 14 = 3,360)

+
+
+

Example: How many ways can first and second place be awarded to 10 people?

+ + + + + + + + + + + + + + +
10! = 10! = 3,628,800 = 90
(10-2)!8!40,320
+

(which is just the same as: 10 × 9 = 90)

+
+

Notation

+

Instead of writing the whole formula, people use different notations such as these:

+

permutation notation P(n,r) = nPr = n!/(n-r)!

+
+

Example: P(10,2) = 90

+
+

Combinations

+

There are also two types of combinations (remember the order does not matter now):

+
    +
  1. Repetition is Allowed: such as coins in your pocket (5,5,5,10,10)
  2. +
  3. No Repetition: such as lottery numbers (2,14,15,27,30,33)
  4. +
+

 

+

1. Combinations with Repetition

+

Actually, these are the hardest to explain, so we will come back to this later.

+

2. Combinations without Repetition

+

This is how lotteries work. The numbers are drawn one at a time, and if we have the lucky numbers (no matter what order) we win!

+

The easiest way to explain it is to:

+
    +
  • assume that the order does matter (ie permutations),
  • +
  • then alter it so the order does not matter.
  • +
+

Going back to our pool ball example, let's say we just want to know which 3 pool balls are chosen, not the order.

+

We already know that 3 out of 16 gave us 3,360 permutations.

+

But many of those are the same to us now, because we don't care what order!

+

For example, let us say balls 1, 2 and 3 are chosen. These are the possibilites:

+ + + + + + + + + +
Order does matterOrder doesn't matter
1 2 3
+ 1 3 2
+ 2 1 3
+ 2 3 1
+ 3 1 2
+ 3 2 1
1 2 3
+

So, the permutations will have 6 times as many possibilites.

+

In fact there is an easy way to work out how many ways "1 2 3" could be placed in order, and we have already talked about it. The answer is:

+

3! = 3 × 2 × 1 = 6

+

(Another example: 4 things can be placed in 4! = 4 × 3 × 2 × 1 = 24 different ways, try it for yourself!)

+

So we adjust our permutations formula to reduce it by how many ways the objects could be in order (because we aren't interested in their order any more):

+

combinations no repeat: n!/(n-r)! x (1/r!) = n!/r!(n-r)!

+

That formula is so important it is often just written in big parentheses like this:

+
+ + + + + + + +
combinations no repeat: n!/r!(n-r)! = ( n r )
where n is the number of things to choose from, and we choose r of them
+ (No repetition, order doesn't matter)
+
+

It is often called "n choose r" (such as "16 choose 3")

+

And is also known as the Binomial Coefficient.

+
+

Notation

+

As well as the "big parentheses", people also use these notations:

+

combination notation: C(n,r) = nCr = ( n r ) = n!/r!(n-r)!

+

 

+

Just remember the formula:

+

n!r!(n − r)!

+

Example

+ +

So, our pool ball example (now without order) is:

+ + + + + + + + + + + + + + +
16! = 16! = 20,922,789,888,000 = 560
3!(16-3)!3!×13!6×6,227,020,800
+

Or we could do it this way:

+ + + + + + + + + + + +
16×15×14 = 3360 = 560
3×2×16
+
+

 

+

It is interesting to also note how this formula is nice and symmetrical:

+

n!/r!(n-r)! = ( n r ) = ( n n-r )

+

In other words choosing 3 balls out of 16, or choosing 13 balls out of 16 have the same number of combinations.

+ + + + + + + + + + + + + + +
16! = 16! = 16! = 560
3!(16-3)!13!(16-13)!3!×13!
+

Pascal's Triangle

+

We can also use Pascal's Triangle to find the values. Go down to row "n" (the top row is 0), and then along "r" places and the value there is our answer. Here is an extract showing row 16:

+
+
1    14    91    364  ...
+1 15 105 455 1365 ...
+1 16 120 560 1820 4368 ...
+
+

 

+

1. Combinations with Repetition

+ +

OK, now we can tackle this one ...

+

ice cream

+

Let us say there are five flavors of icecream: banana, chocolate, lemon, strawberry and vanilla.

+

We can have three scoops. How many variations will there be?

+

Let's use letters for the flavors: {b, c, l, s, v}. Example selections include

+
    +
  • {c, c, c} (3 scoops of chocolate)
  • +
  • {b, l, v} (one each of banana, lemon and vanilla)
  • +
  • {b, v, v} (one of banana, two of vanilla)
  • +
+ +

(And just to be clear: There are n=5 things to choose from, and we choose r=3 of them.
+ Order does not matter, and we can repeat!)

+

Now, I can't describe directly to you how to calculate this, but I can show you a special technique that lets you work it out.

+

bclsv

+

Think about the ice cream being in boxes, we could say "move past the first box, then take 3 scoops, then move along 3 more boxes to the end" and we will have 3 scoops of chocolate!

+

So it is like we are ordering a robot to get our ice cream, but it doesn't change anything, we still get what we want.

+

We can write this down as acccaaa (arrow means move, circle means scoop).

+

In fact the three examples above can be written like this:

+ + + + + + + + + + + + + +
{c, c, c} (3 scoops of chocolate):acccaaa
{b, l, v} (one each of banana, lemon and vanilla):caacaac
{b, v, v} (one of banana, two of vanilla):caaaacc
+

OK, so instead of worrying about different flavors, we have a simpler question: "how many different ways can we arrange arrows and circles?"

+

Notice that there are always 3 circles (3 scoops of ice cream) and 4 arrows (we need to move 4 times to go from the 1st to 5th container).

+

So (being general here) there are r + (n−1) positions, and we want to choose r of them to have circles.

+

This is like saying "we have r + (n−1) pool balls and want to choose r of them". In other words it is now like the pool balls question, but with slightly changed numbers. And we can write it like this:

+
+ + + + + + + +
combinations repeat: ( r+n-1 r ) = (r+n-1)!/r!(n-r)!
where n is the number of things to choose from, and we choose r of them
+ (Repetition allowed, order doesn't matter)
+
+

Interestingly, we can look at the arrows instead of the circles, and say "we have r + (n−1) positions and want to choose (n−1) of them to have arrows", and the answer is the same:

+

( r+n-1 r ) = ( r+n-1 n-1 ) = (r+n-1)!/r!(n-r)!

+

So, what about our example, what is the answer?

+ + + + + + + + + + + + + + +
(3+5−1)! = 7! = 5040 = 35
3!(5−1)!3!×4!6×24
+

There are 35 ways of having 3 scoops from five flavors of icecream.

+

In Conclusion

+

Phew, that was a lot to absorb, so maybe you could read it again to be sure!

+

But knowing how these formulas work is only half the battle. Figuring out how to interpret a real world situation can be quite hard.

+

But at least now you know how to calculate all 4 variations of "Order does/does not matter" and "Repeats are/are not allowed".

+

 

+
+ +  + + +
+ + + + +
+ +
+
+ +
Search :: Index :: About :: Contact :: Contribute :: Cite This Page :: Privacy
+
+ Copyright © 2014 MathsIsFun.com
+
+
+
+
+
+
+
+
+
+
+ + + diff --git a/ML_Mathematical_Approach/06_week-4-lecture-notes/01__resources.html b/ML_Mathematical_Approach/06_week-4-lecture-notes/01__resources.html new file mode 100644 index 0000000..a08fa9d --- /dev/null +++ b/ML_Mathematical_Approach/06_week-4-lecture-notes/01__resources.html @@ -0,0 +1,478 @@ + + +

+ ML:Neural Networks: Representation +

+

+ Non-linear Hypotheses +

+

+ Performing linear regression with a complex set of data with many features is very unwieldy. Say you wanted to create a hypothesis from three (3) features that included all the quadratic terms: +

+ + + + +
+

+ $$\begin{align*}& g(\theta_0 + \theta_1x_1^2 + \theta_2x_1x_2 + \theta_3x_1x_3 \newline& + \theta_4x_2^2 + \theta_5x_2x_3 \newline& + \theta_6x_3^2 )\end{align*}$$ +

+
+

+ That gives us 6 features. The exact way to calculate how many features for all polynomial terms is the combination function with repetition: + + http://www.mathsisfun.com/combinatorics/combinations-permutations.html + + $$\frac{(n+r-1)!}{r!(n-1)!}$$. In this case we are taking all two-element combinations of three features: $$\frac{(3 + 2 - 1)!}{(2!\cdot (3-1)!)}$$ = $$\frac{4!}{4} = 6$$. ( + + Note + + : you do not have to know these formulas, I just found it helpful for understanding). +

+

+ For 100 features, if we wanted to make them quadratic we would get $$\frac{(100 + 2 - 1)!}{(2\cdot (100-1)!)} = 5050$$ resulting new features. +

+

+ We can approximate the growth of the number of new features we get with all quadratic terms with $$\mathcal{O}(n^2/2)$$. And if you wanted to include all cubic terms in your hypothesis, the features would grow asymptotically at $$\mathcal{O}(n^3)$$. These are very steep growths, so as the number of our features increase, the number of quadratic or cubic features increase very rapidly and becomes quickly impractical. +

+

+ Example: let our training set be a collection of 50 x 50 pixel black-and-white photographs, and our goal will be to classify which ones are photos of cars. Our feature set size is then n = 2500 if we compare every pair of pixels. +

+

+ Now let's say we need to make a quadratic hypothesis function. With quadratic features, our growth is $$\mathcal{O}(n^2/2)$$. So our total features will be about $$2500^2 / 2 = 3125000$$, which is very impractical. +

+

+ Neural networks offers an alternate way to perform machine learning when we have complex hypotheses with many features. +

+

+ Neurons and the Brain +

+

+ Neural networks are limited imitations of how our own brains work. They've had a big recent resurgence because of advances in computer hardware. +

+

+ There is evidence that the brain uses only one "learning algorithm" for all its different functions. Scientists have tried cutting (in an animal brain) the connection between the ears and the auditory cortex and rewiring the optical nerve with the auditory cortex to find that the auditory cortex literally learns to see. +

+

+ This principle is called "neuroplasticity" and has many examples and experimental evidence. +

+

+ Model Representation I +

+

+ Let's examine how we will represent a hypothesis function using neural networks. +

+

+ At a very simple level, neurons are basically computational units that take input ( + + dendrites + + ) as electrical input (called "spikes") that are channeled to outputs ( + + axons + + ). +

+

+ In our model, our dendrites are like the input features $$x_1\cdots x_n$$, and the output is the result of our hypothesis function: +

+

+ In this model our x0 input node is sometimes called the "bias unit." It is always equal to 1. +

+

+ In neural networks, we use the same logistic function as in classification: $$\frac{1}{1 + e^{-\theta^Tx}}$$. In neural networks however we sometimes call it a sigmoid (logistic) + + activation + + function. +

+

+ Our "theta" parameters are sometimes instead called "weights" in the neural networks model. +

+

+ Visually, a simplistic representation looks like: +

+ + + + +
+

+ $$\begin{bmatrix}x_0 \newline x_1 \newline x_2 \newline \end{bmatrix}\rightarrow\begin{bmatrix}\ \ \ \newline \end{bmatrix}\rightarrow h_\theta(x)$$ +

+
+

+ Our input nodes (layer 1) go into another node (layer 2), and are output as the hypothesis function. +

+

+ The first layer is called the "input layer" and the final layer the "output layer," which gives the final value computed on the hypothesis. +

+

+ We can have intermediate layers of nodes between the input and output layers called the "hidden layer." +

+

+ We label these intermediate or "hidden" layer nodes $$a^2_0 \cdots a^2_n$$ and call them "activation units." +

+ + + + +
+

+ $$\begin{align*}& a_i^{(j)} = \text{"activation" of unit $i$ in layer $j$} \newline& \Theta^{(j)} = \text{matrix of weights controlling function mapping from layer $j$ to layer $j+1$}\end{align*}$$ +

+
+

+ If we had one hidden layer, it would look visually something like: +

+ + + + +
+

+ $$\begin{bmatrix}x_0 \newline x_1 \newline x_2 \newline x_3\end{bmatrix}\rightarrow\begin{bmatrix}a_1^{(2)} \newline a_2^{(2)} \newline a_3^{(2)} \newline \end{bmatrix}\rightarrow h_\theta(x)$$ +

+
+

+ The values for each of the "activation" nodes is obtained as follows: +

+ + + + +
+

+ $$\begin{align*} +a_1^{(2)} = g(\Theta_{10}^{(1)}x_0 + \Theta_{11}^{(1)}x_1 + \Theta_{12}^{(1)}x_2 + \Theta_{13}^{(1)}x_3) \newline +a_2^{(2)} = g(\Theta_{20}^{(1)}x_0 + \Theta_{21}^{(1)}x_1 + \Theta_{22}^{(1)}x_2 + \Theta_{23}^{(1)}x_3) \newline +a_3^{(2)} = g(\Theta_{30}^{(1)}x_0 + \Theta_{31}^{(1)}x_1 + \Theta_{32}^{(1)}x_2 + \Theta_{33}^{(1)}x_3) \newline +h_\Theta(x) = a_1^{(3)} = g(\Theta_{10}^{(2)}a_0^{(2)} + \Theta_{11}^{(2)}a_1^{(2)} + \Theta_{12}^{(2)}a_2^{(2)} + \Theta_{13}^{(2)}a_3^{(2)}) \newline +\end{align*}$$ +

+
+

+ This is saying that we compute our activation nodes by using a 3×4 matrix of parameters. We apply each row of the parameters to our inputs to obtain the value for one activation node. Our hypothesis output is the logistic function applied to the sum of the values of our activation nodes, which have been multiplied by yet another parameter matrix $$\Theta^{(2)}$$ containing the weights for our second layer of nodes. +

+

+ Each layer gets its own matrix of weights, $$\Theta^{(j)}$$. +

+

+ The dimensions of these matrices of weights is determined as follows: +

+

+ $$\text{If network has $s_j$ units in layer $j$ and $s_{j+1}$ units in layer $j+1$, then $\Theta^{(j)}$ will be of dimension $s_{j+1} \times (s_j + 1)$.}$$ +

+

+ The +1 comes from the addition in $$\Theta^{(j)}$$ of the "bias nodes," $$x_0$$ and $$\Theta_0^{(j)}$$. In other words the output nodes will not include the bias nodes while the inputs will. +

+

+ Example: layer 1 has 2 input nodes and layer 2 has 4 activation nodes. Dimension of $$\Theta^{(1)}$$ is going to be 4×3 where $$s_j = 2$$ and $$s_{j+1} = 4$$, so $$s_{j+1} \times (s_j + 1) = 4 \times 3$$. +

+

+ Model Representation II +

+

+ In this section we'll do a vectorized implementation of the above functions. We're going to define a new variable $$z_k^{(j)}$$ that encompasses the parameters inside our g function. In our previous example if we replaced the variable z for all the parameters we would get: +

+ + + + +
+

+ $$\begin{align*}a_1^{(2)} = g(z_1^{(2)}) \newline a_2^{(2)} = g(z_2^{(2)}) \newline a_3^{(2)} = g(z_3^{(2)}) \newline \end{align*}$$ +

+
+

+ In other words, for layer j=2 and node k, the variable z will be: +

+ + + + +
+

+ $$z_k^{(2)} = \Theta_{k,0}^{(1)}x_0 + \Theta_{k,1}^{(1)}x_1 + \cdots + \Theta_{k,n}^{(1)}x_n$$ +

+
+

+ The vector representation of x and $$z^{j}$$ is: +

+ + + + +
+

+ $$\begin{align*}x = \begin{bmatrix}x_0 \newline x_1 \newline\cdots \newline x_n\end{bmatrix} &z^{(j)} = \begin{bmatrix}z_1^{(j)} \newline z_2^{(j)} \newline\cdots \newline z_n^{(j)}\end{bmatrix}\end{align*}$$ +

+
+

+ Setting $$x = a^{(1)}$$, we can rewrite the equation as: +

+ + + + +
+

+ $$z^{(j)} = \Theta^{(j-1)}a^{(j-1)}$$ +

+
+

+ We are multiplying our matrix $$\Theta^{(j-1)}$$ with dimensions $$s_j\times (n+1)$$ (where $$s_j$$ is the number of our activation nodes) by our vector $$a^{(j-1)}$$ with height (n+1). This gives us our vector $$z^{(j)}$$ with height $$s_j$$. +

+

+ Now we can get a vector of our activation nodes for layer j as follows: +

+

+ $$a^{(j)} = g(z^{(j)})$$ +

+

+ Where our function g can be applied element-wise to our vector $$z^{(j)}$$. +

+

+ We can then add a bias unit (equal to 1) to layer j after we have computed $$a^{(j)}$$. This will be element $$a_0^{(j)}$$ and will be equal to 1. +

+

+ To compute our final hypothesis, let's first compute another z vector: +

+

+ $$z^{(j+1)} = \Theta^{(j)}a^{(j)}$$ +

+

+ We get this final z vector by multiplying the next theta matrix after $$\Theta^{(j-1)}$$ with the values of all the activation nodes we just got. +

+

+ This last theta matrix $$\Theta^{(j)}$$ will have only + + one row + + so that our result is a single number. +

+

+ We then get our final result with: +

+

+ $$h_\Theta(x) = a^{(j+1)} = g(z^{(j+1)})$$ +

+

+ Notice that in this + + last step + + , between layer j and layer j+1, we are doing + + exactly the same thing + + as we did in logistic regression. +

+

+ Adding all these intermediate layers in neural networks allows us to more elegantly produce interesting and more complex non-linear hypotheses. +

+

+ Examples and Intuitions I +

+

+ A simple example of applying neural networks is by predicting $$x_1$$ AND $$x_2$$, which is the logical 'and' operator and is only true if both $$x_1$$ and $$x_2$$ are 1. +

+

+ The graph of our functions will look like: +

+ + + + +
+

+ $$\begin{align*}\begin{bmatrix}x_0 \newline x_1 \newline x_2\end{bmatrix} \rightarrow\begin{bmatrix}g(z^{(2)})\end{bmatrix} \rightarrow h_\Theta(x)\end{align*}$$ +

+
+

+ Remember that $$x_0$$ is our bias variable and is always 1. +

+

+ Let's set our first theta matrix as: +

+ + + + +
+

+ $$\Theta^{(1)} =\begin{bmatrix}-30 & 20 & 20\end{bmatrix}$$ +

+
+

+ This will cause the output of our hypothesis to only be positive if both $$x_1$$ and $$x_2$$ are 1. In other words: +

+ + + + +
+

+ $$\begin{align*}& h_\Theta(x) = g(-30 + 20x_1 + 20x_2) \newline \newline & x_1 = 0 \ \ and \ \ x_2 = 0 \ \ then \ \ g(-30) \approx 0 \newline & x_1 = 0 \ \ and \ \ x_2 = 1 \ \ then \ \ g(-10) \approx 0 \newline & x_1 = 1 \ \ and \ \ x_2 = 0 \ \ then \ \ g(-10) \approx 0 \newline & x_1 = 1 \ \ and \ \ x_2 = 1 \ \ then \ \ g(10) \approx 1\end{align*}$$ +

+
+

+ So we have constructed one of the fundamental operations in computers by using a small neural network rather than using an actual AND gate. Neural networks can also be used to simulate all the other logical gates. +

+

+ Examples and Intuitions II +

+

+ The $$Θ^(1)$$ matrices for AND, NOR, and OR are: +

+ + + + +
+

+ $$\begin{align*}AND:\newline\Theta^{(1)} &=\begin{bmatrix}-30 & 20 & 20\end{bmatrix} \newline NOR:\newline\Theta^{(1)} &= \begin{bmatrix}10 & -20 & -20\end{bmatrix} \newline OR:\newline\Theta^{(1)} &= \begin{bmatrix}-10 & 20 & 20\end{bmatrix} \newline\end{align*}$$ +

+
+

+ We can combine these to get the XNOR logical operator (which gives 1 if $$x_1$$ and $$x_2$$ are both 0 or both 1). +

+ + + + +
+

+ $$\begin{align*}\begin{bmatrix}x_0 \newline x_1 \newline x_2\end{bmatrix} \rightarrow\begin{bmatrix}a_1^{(2)} \newline a_2^{(2)} \end{bmatrix} \rightarrow\begin{bmatrix}a^{(3)}\end{bmatrix} \rightarrow h_\Theta(x)\end{align*}$$ +

+
+

+ For the transition between the first and second layer, we'll use a $$Θ^(1)$$ matrix that combines the values for AND and NOR: +

+ + + + +
+

+ $$\Theta^{(1)} =\begin{bmatrix}-30 & 20 & 20 \newline 10 & -20 & -20\end{bmatrix}$$ +

+
+

+ For the transition between the second and third layer, we'll use a $$Θ^(2)$$ matrix that uses the value for OR: +

+ + + + +
+

+ $$\Theta^{(2)} =\begin{bmatrix}-10 & 20 & 20\end{bmatrix}$$ +

+
+

+ Let's write out the values for all our nodes: +

+ + + + +
+

+ $$\begin{align*}& a^{(2)} = g(\Theta^{(1)} \cdot x) \newline& a^{(3)} = g(\Theta^{(2)} \cdot a^{(2)}) \newline& h_\Theta(x) = a^{(3)}\end{align*}$$ +

+
+

+ And there we have the XNOR operator using two hidden layers! +

+

+ Multiclass Classification +

+

+ To classify data into multiple classes, we let our hypothesis function return a vector of values. Say we wanted to classify our data into one of four final resulting classes: +

+ + + + +
+

+ $$\begin{align*}\begin{bmatrix}x_0 \newline x_1 \newline x_2 \newline\cdots \newline x_n\end{bmatrix} \rightarrow\begin{bmatrix}a_0^{(2)} \newline a_1^{(2)} \newline a_2^{(2)} \newline\cdots\end{bmatrix} \rightarrow\begin{bmatrix}a_0^{(3)} \newline a_1^{(3)} \newline a_2^{(3)} \newline\cdots\end{bmatrix} \rightarrow \cdots \rightarrow\begin{bmatrix}h_\Theta(x)_1 \newline h_\Theta(x)_2 \newline h_\Theta(x)_3 \newline h_\Theta(x)_4 \newline\end{bmatrix} \rightarrow\end{align*}$$ +

+
+

+ Our final layer of nodes, when multiplied by its theta matrix, will result in another vector, on which we will apply the g() logistic function to get a vector of hypothesis values. +

+

+ Our resulting hypothesis for one set of inputs may look like: +

+ + + + +
+

+ $$h_\Theta(x) =\begin{bmatrix}0 \newline 0 \newline 1 \newline 0 \newline\end{bmatrix}$$ +

+
+

+ In which case our resulting class is the third one down, or $$h_\Theta(x)_3$$. +

+

+ We can define our set of resulting classes as y: +

+ +

+ Our final value of our hypothesis for a set of inputs will be one of the elements in y. +

+

+

+
+ + + diff --git a/ML_Mathematical_Approach/07_week-5-lecture-notes/01__chap2.html b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__chap2.html new file mode 100644 index 0000000..b1fd6ed --- /dev/null +++ b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__chap2.html @@ -0,0 +1,296 @@ + + + + + + + + + + + + + + + Neural networks and deep learning + + + + + + + + + + + + +

+ CHAPTER 2

+

How the backpropagation algorithm works

+

Neural Networks and Deep Learning

What this book is about

On the exercises and problems

Using neural nets to recognize handwritten digits

+

How the backpropagation algorithm works

+

Improving the way neural networks learn

+

A visual proof that neural nets can compute any function

+

Why are deep neural networks hard to train?

+

Deep learning

+

Appendix: Is there a simple algorithm for intelligence?

Acknowledgements

Frequently Asked Questions

+
+ + +
+ + + + +
+ + + +
+Sponsors +
+ + + + + +
+ + + +
+Resources + + + + + + + + + + + + + +
+ + + +
+

In the last chapter we saw how neural networks can learn their weights and biases using the gradient descent algorithm. There was, however, a gap in our explanation: we didn't discuss how to compute the gradient of the cost function. That's quite a gap! In this chapter I'll explain a fast algorithm for computing such gradients, an algorithm known as backpropagation.

The backpropagation algorithm was originally introduced in the 1970s, but its importance wasn't fully appreciated until a famous 1986 paper by David Rumelhart, Geoffrey Hinton, and Ronald Williams. That paper describes several neural networks where backpropagation works far faster than earlier approaches to learning, making it possible to use neural nets to solve problems which had previously been insoluble. Today, the backpropagation algorithm is the workhorse of learning in neural networks.

This chapter is more mathematically involved than the rest of the book. If you're not crazy about mathematics you may be tempted to skip the chapter, and to treat backpropagation as a black box whose details you're willing to ignore. Why take the time to study those details?

The reason, of course, is understanding. At the heart of backpropagation is an expression for the partial derivative $\partial C / \partial w$ of the cost function $C$ with respect to any weight $w$ (or bias $b$) in the network. The expression tells us how quickly the cost changes when we change the weights and biases. And while the expression is somewhat complex, it also has a beauty to it, with each element having a natural, intuitive interpretation. And so backpropagation isn't just a fast algorithm for learning. It actually gives us detailed insights into how changing the weights and biases changes the overall behaviour of the network. That's well worth studying in detail.

With that said, if you want to skim the chapter, or jump straight to the next chapter, that's fine. I've written the rest of the book to be accessible even if you treat backpropagation as a black box. There are, of course, points later in the book where I refer back to results from this chapter. But at those points you should still be able to understand the main conclusions, even if you don't follow all the reasoning.

Warm up: a fast matrix-based approach to computing the output from a neural network

Before discussing backpropagation, let's warm up with a fast matrix-based algorithm to compute the output from a neural network. We actually already briefly saw this algorithm near the end of the last chapter, but I described it quickly, so it's worth revisiting in detail. In particular, this is a good way of getting comfortable with the notation used in backpropagation, in a familiar context.

Let's begin with a notation which lets us refer to weights in the network in an unambiguous way. We'll use $w^l_{jk}$ to denote the weight for the connection from the $k^{\rm th}$ neuron in the $(l-1)^{\rm th}$ layer to the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer. So, for example, the diagram below shows the weight on a connection from the fourth neuron in the second layer to the second neuron in the third layer of a network:

This notation is cumbersome at first, and it does take some work to master. But with a little effort you'll find the notation becomes easy and natural. One quirk of the notation is the ordering of the $j$ and $k$ indices. You might think that it makes more sense to use $j$ to refer to the input neuron, and $k$ to the output neuron, not vice versa, as is actually done. I'll explain the reason for this quirk below.

We use a similar notation for the network's biases and activations. Explicitly, we use $b^l_j$ for the bias of the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer. And we use $a^l_j$ for the activation of the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer. The following diagram shows examples of these notations in use:

With these notations, the activation $a^{l}_j$ of the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer is related to the activations in the $(l-1)^{\rm th}$ layer by the equation (compare Equation (4) and surrounding discussion in the last chapter) \begin{eqnarray} a^{l}_j = \sigma\left( \sum_k w^{l}_{jk} a^{l-1}_k + b^l_j \right), \tag{23}\end{eqnarray} where the sum is over all neurons $k$ in the $(l-1)^{\rm th}$ layer. To rewrite this expression in a matrix form we define a weight matrix $w^l$ for each layer, $l$. The entries of the weight matrix $w^l$ are just the weights connecting to the $l^{\rm th}$ layer of neurons, that is, the entry in the $j^{\rm th}$ row and $k^{\rm th}$ column is $w^l_{jk}$. Similarly, for each layer $l$ we define a bias vector, $b^l$. You can probably guess how this works - the components of the bias vector are just the values $b^l_j$, one component for each neuron in the $l^{\rm th}$ layer. And finally, we define an activation vector $a^l$ whose components are the activations $a^l_j$.

The last ingredient we need to rewrite (23) in a matrix form is the idea of vectorizing a function such as $\sigma$. We met vectorization briefly in the last chapter, but to recap, the idea is that we want to apply a function such as $\sigma$ to every element in a vector $v$. We use the obvious notation $\sigma(v)$ to denote this kind of elementwise application of a function. That is, the components of $\sigma(v)$ are just $\sigma(v)_j = \sigma(v_j)$. As an example, if we have the function $f(x) = x^2$ then the vectorized form of $f$ has the effect \begin{eqnarray} f\left(\left[ \begin{array}{c} 2 \\ 3 \end{array} \right] \right) = \left[ \begin{array}{c} f(2) \\ f(3) \end{array} \right] = \left[ \begin{array}{c} 4 \\ 9 \end{array} \right], \tag{24}\end{eqnarray} that is, the vectorized $f$ just squares every element of the vector.

With these notations in mind, Equation (23) can be rewritten in the beautiful and compact vectorized form \begin{eqnarray} a^{l} = \sigma(w^l a^{l-1}+b^l). \tag{25}\end{eqnarray} This expression gives us a much more global way of thinking about how the activations in one layer relate to activations in the previous layer: we just apply the weight matrix to the activations, then add the bias vector, and finally apply the $\sigma$ function* *By the way, it's this expression that motivates the quirk in the $w^l_{jk}$ notation mentioned earlier. If we used $j$ to index the input neuron, and $k$ to index the output neuron, then we'd need to replace the weight matrix in Equation (25) by the transpose of the weight matrix. That's a small change, but annoying, and we'd lose the easy simplicity of saying (and thinking) "apply the weight matrix to the activations".. That global view is often easier and more succinct (and involves fewer indices!) than the neuron-by-neuron view we've taken to now. Think of it as a way of escaping index hell, while remaining precise about what's going on. The expression is also useful in practice, because most matrix libraries provide fast ways of implementing matrix multiplication, vector addition, and vectorization. Indeed, the code in the last chapter made implicit use of this expression to compute the behaviour of the network.

When using Equation (25) to compute $a^l$, we compute the intermediate quantity $z^l \equiv w^l a^{l-1}+b^l$ along the way. This quantity turns out to be useful enough to be worth naming: we call $z^l$ the weighted input to the neurons in layer $l$. We'll make considerable use of the weighted input $z^l$ later in the chapter. Equation (25) is sometimes written in terms of the weighted input, as $a^l = \sigma(z^l)$. It's also worth noting that $z^l$ has components $z^l_j = \sum_k w^l_{jk} a^{l-1}_k+b^l_j$, that is, $z^l_j$ is just the weighted input to the activation function for neuron $j$ in layer $l$.

The two assumptions we need about the cost function

The goal of backpropagation is to compute the partial derivatives $\partial C / \partial w$ and $\partial C / \partial b$ of the cost function $C$ with respect to any weight $w$ or bias $b$ in the network. For backpropagation to work we need to make two main assumptions about the form of the cost function. Before stating those assumptions, though, it's useful to have an example cost function in mind. We'll use the quadratic cost function from last chapter (c.f. Equation (6)). In the notation of the last section, the quadratic cost has the form \begin{eqnarray} C = \frac{1}{2n} \sum_x \|y(x)-a^L(x)\|^2, \tag{26}\end{eqnarray} where: $n$ is the total number of training examples; the sum is over individual training examples, $x$; $y = y(x)$ is the corresponding desired output; $L$ denotes the number of layers in the network; and $a^L = a^L(x)$ is the vector of activations output from the network when $x$ is input.

Okay, so what assumptions do we need to make about our cost function, $C$, in order that backpropagation can be applied? The first assumption we need is that the cost function can be written as an average $C = \frac{1}{n} \sum_x C_x$ over cost functions $C_x$ for individual training examples, $x$. This is the case for the quadratic cost function, where the cost for a single training example is $C_x = \frac{1}{2} \|y-a^L \|^2$. This assumption will also hold true for all the other cost functions we'll meet in this book.

The reason we need this assumption is because what backpropagation actually lets us do is compute the partial derivatives $\partial C_x / \partial w$ and $\partial C_x / \partial b$ for a single training example. We then recover $\partial C / \partial w$ and $\partial C / \partial b$ by averaging over training examples. In fact, with this assumption in mind, we'll suppose the training example $x$ has been fixed, and drop the $x$ subscript, writing the cost $C_x$ as $C$. We'll eventually put the $x$ back in, but for now it's a notational nuisance that is better left implicit.

The second assumption we make about the cost is that it can be written as a function of the outputs from the neural network:

For example, the quadratic cost function satisfies this requirement, since the quadratic cost for a single training example $x$ may be written as \begin{eqnarray} C = \frac{1}{2} \|y-a^L\|^2 = \frac{1}{2} \sum_j (y_j-a^L_j)^2, \tag{27}\end{eqnarray} and thus is a function of the output activations. Of course, this cost function also depends on the desired output $y$, and you may wonder why we're not regarding the cost also as a function of $y$. Remember, though, that the input training example $x$ is fixed, and so the output $y$ is also a fixed parameter. In particular, it's not something we can modify by changing the weights and biases in any way, i.e., it's not something which the neural network learns. And so it makes sense to regard $C$ as a function of the output activations $a^L$ alone, with $y$ merely a parameter that helps define that function.

The Hadamard product, $s \odot t$

The backpropagation algorithm is based on common linear algebraic operations - things like vector addition, multiplying a vector by a matrix, and so on. But one of the operations is a little less commonly used. In particular, suppose $s$ and $t$ are two vectors of the same dimension. Then we use $s \odot t$ to denote the elementwise product of the two vectors. Thus the components of $s \odot t$ are just $(s \odot t)_j = s_j t_j$. As an example, \begin{eqnarray} \left[\begin{array}{c} 1 \\ 2 \end{array}\right] \odot \left[\begin{array}{c} 3 \\ 4\end{array} \right] = \left[ \begin{array}{c} 1 * 3 \\ 2 * 4 \end{array} \right] = \left[ \begin{array}{c} 3 \\ 8 \end{array} \right]. \tag{28}\end{eqnarray} This kind of elementwise multiplication is sometimes called the Hadamard product or Schur product. We'll refer to it as the Hadamard product. Good matrix libraries usually provide fast implementations of the Hadamard product, and that comes in handy when implementing backpropagation.

The four fundamental equations behind backpropagation

Backpropagation is about understanding how changing the weights and biases in a network changes the cost function. Ultimately, this means computing the partial derivatives $\partial C / \partial w^l_{jk}$ and $\partial C / \partial b^l_j$. But to compute those, we first introduce an intermediate quantity, $\delta^l_j$, which we call the error in the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer. Backpropagation will give us a procedure to compute the error $\delta^l_j$, and then will relate $\delta^l_j$ to $\partial C / \partial w^l_{jk}$ and $\partial C / \partial b^l_j$.

To understand how the error is defined, imagine there is a demon in our neural network:

The demon sits at the $j^{\rm th}$ neuron in layer $l$. As the input to the neuron comes in, the demon messes with the neuron's operation. It adds a little change $\Delta z^l_j$ to the neuron's weighted input, so that instead of outputting $\sigma(z^l_j)$, the neuron instead outputs $\sigma(z^l_j+\Delta z^l_j)$. This change propagates through later layers in the network, finally causing the overall cost to change by an amount $\frac{\partial C}{\partial z^l_j} \Delta z^l_j$.

Now, this demon is a good demon, and is trying to help you improve the cost, i.e., they're trying to find a $\Delta z^l_j$ which makes the cost smaller. Suppose $\frac{\partial C}{\partial z^l_j}$ has a large value (either positive or negative). Then the demon can lower the cost quite a bit by choosing $\Delta z^l_j$ to have the opposite sign to $\frac{\partial C}{\partial z^l_j}$. By contrast, if $\frac{\partial C}{\partial z^l_j}$ is close to zero, then the demon can't improve the cost much at all by perturbing the weighted input $z^l_j$. So far as the demon can tell, the neuron is already pretty near optimal* *This is only the case for small changes $\Delta z^l_j$, of course. We'll assume that the demon is constrained to make such small changes.. And so there's a heuristic sense in which $\frac{\partial C}{\partial z^l_j}$ is a measure of the error in the neuron.

Motivated by this story, we define the error $\delta^l_j$ of neuron $j$ in layer $l$ by \begin{eqnarray} \delta^l_j \equiv \frac{\partial C}{\partial z^l_j}. \tag{29}\end{eqnarray} As per our usual conventions, we use $\delta^l$ to denote the vector of errors associated with layer $l$. Backpropagation will give us a way of computing $\delta^l$ for every layer, and then relating those errors to the quantities of real interest, $\partial C / \partial w^l_{jk}$ and $\partial C / \partial b^l_j$.

You might wonder why the demon is changing the weighted input $z^l_j$. Surely it'd be more natural to imagine the demon changing the output activation $a^l_j$, with the result that we'd be using $\frac{\partial C}{\partial a^l_j}$ as our measure of error. In fact, if you do this things work out quite similarly to the discussion below. But it turns out to make the presentation of backpropagation a little more algebraically complicated. So we'll stick with $\delta^l_j = \frac{\partial C}{\partial z^l_j}$ as our measure of error* *In classification problems like MNIST the term "error" is sometimes used to mean the classification failure rate. E.g., if the neural net correctly classifies 96.0 percent of the digits, then the error is 4.0 percent. Obviously, this has quite a different meaning from our $\delta$ vectors. In practice, you shouldn't have trouble telling which meaning is intended in any given usage..

Plan of attack: Backpropagation is based around four fundamental equations. Together, those equations give us a way of computing both the error $\delta^l$ and the gradient of the cost function. I state the four equations below. Be warned, though: you shouldn't expect to instantaneously assimilate the equations. Such an expectation will lead to disappointment. In fact, the backpropagation equations are so rich that understanding them well requires considerable time and patience as you gradually delve deeper into the equations. The good news is that such patience is repaid many times over. And so the discussion in this section is merely a beginning, helping you on the way to a thorough understanding of the equations.

Here's a preview of the ways we'll delve more deeply into the equations later in the chapter: I'll give a short proof of the equations, which helps explain why they are true; we'll restate the equations in algorithmic form as pseudocode, and see how the pseudocode can be implemented as real, running Python code; and, in the final section of the chapter, we'll develop an intuitive picture of what the backpropagation equations mean, and how someone might discover them from scratch. Along the way we'll return repeatedly to the four fundamental equations, and as you deepen your understanding those equations will come to seem comfortable and, perhaps, even beautiful and natural.

An equation for the error in the output layer, $\delta^L$: The components of $\delta^L$ are given by \begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j). \tag{BP1}\end{eqnarray} This is a very natural expression. The first term on the right, $\partial C / \partial a^L_j$, just measures how fast the cost is changing as a function of the $j^{\rm th}$ output activation. If, for example, $C$ doesn't depend much on a particular output neuron, $j$, then $\delta^L_j$ will be small, which is what we'd expect. The second term on the right, $\sigma'(z^L_j)$, measures how fast the activation function $\sigma$ is changing at $z^L_j$.

Notice that everything in (BP1) is easily computed. In particular, we compute $z^L_j$ while computing the behaviour of the network, and it's only a small additional overhead to compute $\sigma'(z^L_j)$. The exact form of $\partial C / \partial a^L_j$ will, of course, depend on the form of the cost function. However, provided the cost function is known there should be little trouble computing $\partial C / \partial a^L_j$. For example, if we're using the quadratic cost function then $C = \frac{1}{2} \sum_j (y_j-a^L_j)^2$, and so $\partial C / \partial a^L_j = (a_j^L-y_j)$, which obviously is easily computable.

Equation (BP1) is a componentwise expression for $\delta^L$. It's a perfectly good expression, but not the matrix-based form we want for backpropagation. However, it's easy to rewrite the equation in a matrix-based form, as \begin{eqnarray} \delta^L = \nabla_a C \odot \sigma'(z^L). \tag{BP1a}\end{eqnarray} Here, $\nabla_a C$ is defined to be a vector whose components are the partial derivatives $\partial C / \partial a^L_j$. You can think of $\nabla_a C$ as expressing the rate of change of $C$ with respect to the output activations. It's easy to see that Equations (BP1a) and (BP1) are equivalent, and for that reason from now on we'll use (BP1) interchangeably to refer to both equations. As an example, in the case of the quadratic cost we have $\nabla_a C = (a^L-y)$, and so the fully matrix-based form of (BP1) becomes \begin{eqnarray} \delta^L = (a^L-y) \odot \sigma'(z^L). \tag{30}\end{eqnarray} As you can see, everything in this expression has a nice vector form, and is easily computed using a library such as Numpy.

An equation for the error $\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$: In particular \begin{eqnarray} \delta^l = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^l), \tag{BP2}\end{eqnarray} where $(w^{l+1})^T$ is the transpose of the weight matrix $w^{l+1}$ for the $(l+1)^{\rm th}$ layer. This equation appears complicated, but each element has a nice interpretation. Suppose we know the error $\delta^{l+1}$ at the $l+1^{\rm th}$ layer. When we apply the transpose weight matrix, $(w^{l+1})^T$, we can think intuitively of this as moving the error backward through the network, giving us some sort of measure of the error at the output of the $l^{\rm th}$ layer. We then take the Hadamard product $\odot \sigma'(z^l)$. This moves the error backward through the activation function in layer $l$, giving us the error $\delta^l$ in the weighted input to layer $l$.

By combining (BP2) with (BP1) we can compute the error $\delta^l$ for any layer in the network. We start by using (BP1) to compute $\delta^L$, then apply Equation (BP2) to compute $\delta^{L-1}$, then Equation (BP2) again to compute $\delta^{L-2}$, and so on, all the way back through the network.

An equation for the rate of change of the cost with respect to any bias in the network: In particular: \begin{eqnarray} \frac{\partial C}{\partial b^l_j} = \delta^l_j. \tag{BP3}\end{eqnarray} That is, the error $\delta^l_j$ is exactly equal to the rate of change $\partial C / \partial b^l_j$. This is great news, since (BP1) and (BP2) have already told us how to compute $\delta^l_j$. We can rewrite (BP3) in shorthand as \begin{eqnarray} \frac{\partial C}{\partial b} = \delta, \tag{31}\end{eqnarray} where it is understood that $\delta$ is being evaluated at the same neuron as the bias $b$.

An equation for the rate of change of the cost with respect to any weight in the network: In particular: \begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j. \tag{BP4}\end{eqnarray} This tells us how to compute the partial derivatives $\partial C / \partial w^l_{jk}$ in terms of the quantities $\delta^l$ and $a^{l-1}$, which we already know how to compute. The equation can be rewritten in a less index-heavy notation as \begin{eqnarray} \frac{\partial C}{\partial w} = a_{\rm in} \delta_{\rm out}, \tag{32}\end{eqnarray} where it's understood that $a_{\rm in}$ is the activation of the neuron input to the weight $w$, and $\delta_{\rm out}$ is the error of the neuron output from the weight $w$. Zooming in to look at just the weight $w$, and the two neurons connected by that weight, we can depict this as:

A nice consequence of Equation (32) is that when the activation $a_{\rm in}$ is small, $a_{\rm in} \approx 0$, the gradient term $\partial C / \partial w$ will also tend to be small. In this case, we'll say the weight learns slowly, meaning that it's not changing much during gradient descent. In other words, one consequence of (BP4) is that weights output from low-activation neurons learn slowly.

There are other insights along these lines which can be obtained from (BP1)-(BP4). Let's start by looking at the output layer. Consider the term $\sigma'(z^L_j)$ in (BP1). Recall from the graph of the sigmoid function in the last chapter that the $\sigma$ function becomes very flat when $\sigma(z^L_j)$ is approximately $0$ or $1$. When this occurs we will have $\sigma'(z^L_j) \approx 0$. And so the lesson is that a weight in the final layer will learn slowly if the output neuron is either low activation ($\approx 0$) or high activation ($\approx 1$). In this case it's common to say the output neuron has saturated and, as a result, the weight has stopped learning (or is learning slowly). Similar remarks hold also for the biases of output neuron.

We can obtain similar insights for earlier layers. In particular, note the $\sigma'(z^l)$ term in (BP2). This means that $\delta^l_j$ is likely to get small if the neuron is near saturation. And this, in turn, means that any weights input to a saturated neuron will learn slowly* *This reasoning won't hold if ${w^{l+1}}^T \delta^{l+1}$ has large enough entries to compensate for the smallness of $\sigma'(z^l_j)$. But I'm speaking of the general tendency..

Summing up, we've learnt that a weight will learn slowly if either the input neuron is low-activation, or if the output neuron has saturated, i.e., is either high- or low-activation.

None of these observations is too greatly surprising. Still, they help improve our mental model of what's going on as a neural network learns. Furthermore, we can turn this type of reasoning around. The four fundamental equations turn out to hold for any activation function, not just the standard sigmoid function (that's because, as we'll see in a moment, the proofs don't use any special properties of $\sigma$). And so we can use these equations to design activation functions which have particular desired learning properties. As an example to give you the idea, suppose we were to choose a (non-sigmoid) activation function $\sigma$ so that $\sigma'$ is always positive, and never gets close to zero. That would prevent the slow-down of learning that occurs when ordinary sigmoid neurons saturate. Later in the book we'll see examples where this kind of modification is made to the activation function. Keeping the four equations (BP1)-(BP4) in mind can help explain why such modifications are tried, and what impact they can have.

Problem

Proof of the four fundamental equations (optional)

We'll now prove the four fundamental equations (BP1)-(BP4). All four are consequences of the chain rule from multivariable calculus. If you're comfortable with the chain rule, then I strongly encourage you to attempt the derivation yourself before reading on.

Let's begin with Equation (BP1), which gives an expression for the output error, $\delta^L$. To prove this equation, recall that by definition \begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial z^L_j}. \tag{36}\end{eqnarray} Applying the chain rule, we can re-express the partial derivative above in terms of partial derivatives with respect to the output activations, \begin{eqnarray} \delta^L_j = \sum_k \frac{\partial C}{\partial a^L_k} \frac{\partial a^L_k}{\partial z^L_j}, \tag{37}\end{eqnarray} where the sum is over all neurons $k$ in the output layer. Of course, the output activation $a^L_k$ of the $k^{\rm th}$ neuron depends only on the weighted input $z^L_j$ for the $j^{\rm th}$ neuron when $k = j$. And so $\partial a^L_k / \partial z^L_j$ vanishes when $k \neq j$. As a result we can simplify the previous equation to \begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \frac{\partial a^L_j}{\partial z^L_j}. \tag{38}\end{eqnarray} Recalling that $a^L_j = \sigma(z^L_j)$ the second term on the right can be written as $\sigma'(z^L_j)$, and the equation becomes \begin{eqnarray} \delta^L_j = \frac{\partial C}{\partial a^L_j} \sigma'(z^L_j), \tag{39}\end{eqnarray} which is just (BP1), in component form.

Next, we'll prove (BP2), which gives an equation for the error $\delta^l$ in terms of the error in the next layer, $\delta^{l+1}$. To do this, we want to rewrite $\delta^l_j = \partial C / \partial z^l_j$ in terms of $\delta^{l+1}_k = \partial C / \partial z^{l+1}_k$. We can do this using the chain rule, \begin{eqnarray} \delta^l_j & = & \frac{\partial C}{\partial z^l_j} \tag{40}\\ & = & \sum_k \frac{\partial C}{\partial z^{l+1}_k} \frac{\partial z^{l+1}_k}{\partial z^l_j} \tag{41}\\ & = & \sum_k \frac{\partial z^{l+1}_k}{\partial z^l_j} \delta^{l+1}_k, \tag{42}\end{eqnarray} where in the last line we have interchanged the two terms on the right-hand side, and substituted the definition of $\delta^{l+1}_k$. To evaluate the first term on the last line, note that \begin{eqnarray} z^{l+1}_k = \sum_j w^{l+1}_{kj} a^l_j +b^{l+1}_k = \sum_j w^{l+1}_{kj} \sigma(z^l_j) +b^{l+1}_k. \tag{43}\end{eqnarray} Differentiating, we obtain \begin{eqnarray} \frac{\partial z^{l+1}_k}{\partial z^l_j} = w^{l+1}_{kj} \sigma'(z^l_j). \tag{44}\end{eqnarray} Substituting back into (42) we obtain \begin{eqnarray} \delta^l_j = \sum_k w^{l+1}_{kj} \delta^{l+1}_k \sigma'(z^l_j). \tag{45}\end{eqnarray} This is just (BP2) written in component form.

The final two equations we want to prove are (BP3) and (BP4). These also follow from the chain rule, in a manner similar to the proofs of the two equations above. I leave them to you as an exercise.

Exercise

That completes the proof of the four fundamental equations of backpropagation. The proof may seem complicated. But it's really just the outcome of carefully applying the chain rule. A little less succinctly, we can think of backpropagation as a way of computing the gradient of the cost function by systematically applying the chain rule from multi-variable calculus. That's all there really is to backpropagation - the rest is details.

The backpropagation algorithm

The backpropagation equations provide us with a way of computing the gradient of the cost function. Let's explicitly write this out in the form of an algorithm:

  1. Input $x$: Set the corresponding activation $a^{1}$ for the input layer.

  2. Feedforward: For each $l = 2, 3, \ldots, L$ compute $z^{l} = w^l a^{l-1}+b^l$ and $a^{l} = \sigma(z^{l})$.

  3. Output error $\delta^L$: Compute the vector $\delta^{L} = \nabla_a C \odot \sigma'(z^L)$.

  4. Backpropagate the error: For each $l = L-1, L-2, \ldots, 2$ compute $\delta^{l} = ((w^{l+1})^T \delta^{l+1}) \odot \sigma'(z^{l})$.

  5. Output: The gradient of the cost function is given by $\frac{\partial C}{\partial w^l_{jk}} = a^{l-1}_k \delta^l_j$ and $\frac{\partial C}{\partial b^l_j} = \delta^l_j$.

Examining the algorithm you can see why it's called backpropagation. We compute the error vectors $\delta^l$ backward, starting from the final layer. It may seem peculiar that we're going through the network backward. But if you think about the proof of backpropagation, the backward movement is a consequence of the fact that the cost is a function of outputs from the network. To understand how the cost varies with earlier weights and biases we need to repeatedly apply the chain rule, working backward through the layers to obtain usable expressions.

Exercises

As I've described it above, the backpropagation algorithm computes the gradient of the cost function for a single training example, $C = C_x$. In practice, it's common to combine backpropagation with a learning algorithm such as stochastic gradient descent, in which we compute the gradient for many training examples. In particular, given a mini-batch of $m$ training examples, the following algorithm applies a gradient descent learning step based on that mini-batch:

  1. Input a set of training examples

  2. For each training example $x$: Set the corresponding input activation $a^{x,1}$, and perform the following steps:

    • Feedforward: For each $l = 2, 3, \ldots, L$ compute $z^{x,l} = w^l a^{x,l-1}+b^l$ and $a^{x,l} = \sigma(z^{x,l})$.

    • Output error $\delta^{x,L}$: Compute the vector $\delta^{x,L} = \nabla_a C_x \odot \sigma'(z^{x,L})$.

    • Backpropagate the error: For each $l = L-1, L-2, \ldots, 2$ compute $\delta^{x,l} = ((w^{l+1})^T \delta^{x,l+1}) \odot \sigma'(z^{x,l})$.

  3. Gradient descent: For each $l = L, L-1, \ldots, 2$ update the weights according to the rule $w^l \rightarrow w^l-\frac{\eta}{m} \sum_x \delta^{x,l} (a^{x,l-1})^T$, and the biases according to the rule $b^l \rightarrow b^l-\frac{\eta}{m} \sum_x \delta^{x,l}$.

Of course, to implement stochastic gradient descent in practice you also need an outer loop generating mini-batches of training examples, and an outer loop stepping through multiple epochs of training. I've omitted those for simplicity.

The code for backpropagation

Having understood backpropagation in the abstract, we can now understand the code used in the last chapter to implement backpropagation. Recall from that chapter that the code was contained in the update_mini_batch and backprop methods of the Network class. The code for these methods is a direct translation of the algorithm described above. In particular, the update_mini_batch method updates the Network's weights and biases by computing the gradient for the current mini_batch of training examples:

class Network(object):
+...
+    def update_mini_batch(self, mini_batch, eta):
+        """Update the network's weights and biases by applying
+        gradient descent using backpropagation to a single mini batch.
+        The "mini_batch" is a list of tuples "(x, y)", and "eta"
+        is the learning rate."""
+        nabla_b = [np.zeros(b.shape) for b in self.biases]
+        nabla_w = [np.zeros(w.shape) for w in self.weights]
+        for x, y in mini_batch:
+            delta_nabla_b, delta_nabla_w = self.backprop(x, y)
+            nabla_b = [nb+dnb for nb, dnb in zip(nabla_b, delta_nabla_b)]
+            nabla_w = [nw+dnw for nw, dnw in zip(nabla_w, delta_nabla_w)]
+        self.weights = [w-(eta/len(mini_batch))*nw 
+                        for w, nw in zip(self.weights, nabla_w)]
+        self.biases = [b-(eta/len(mini_batch))*nb 
+                       for b, nb in zip(self.biases, nabla_b)]
+
+ Most of the work is done by the line delta_nabla_b, delta_nabla_w = self.backprop(x, y) which uses the backprop method to figure out the partial derivatives $\partial C_x / \partial b^l_j$ and $\partial C_x / \partial w^l_{jk}$. The backprop method follows the algorithm in the last section closely. There is one small change - we use a slightly different approach to indexing the layers. This change is made to take advantage of a feature of Python, namely the use of negative list indices to count backward from the end of a list, so, e.g., l[-3] is the third last entry in a list l. The code for backprop is below, together with a few helper functions, which are used to compute the $\sigma$ function, the derivative $\sigma'$, and the derivative of the cost function. With these inclusions you should be able to understand the code in a self-contained way. If something's tripping you up, you may find it helpful to consult the original description (and complete listing) of the code.
class Network(object):
+...
+   def backprop(self, x, y):
+        """Return a tuple "(nabla_b, nabla_w)" representing the
+        gradient for the cost function C_x.  "nabla_b" and
+        "nabla_w" are layer-by-layer lists of numpy arrays, similar
+        to "self.biases" and "self.weights"."""
+        nabla_b = [np.zeros(b.shape) for b in self.biases]
+        nabla_w = [np.zeros(w.shape) for w in self.weights]
+        # feedforward
+        activation = x
+        activations = [x] # list to store all the activations, layer by layer
+        zs = [] # list to store all the z vectors, layer by layer
+        for b, w in zip(self.biases, self.weights):
+            z = np.dot(w, activation)+b
+            zs.append(z)
+            activation = sigmoid(z)
+            activations.append(activation)
+        # backward pass
+        delta = self.cost_derivative(activations[-1], y) * \
+            sigmoid_prime(zs[-1])
+        nabla_b[-1] = delta
+        nabla_w[-1] = np.dot(delta, activations[-2].transpose())
+        # Note that the variable l in the loop below is used a little
+        # differently to the notation in Chapter 2 of the book.  Here,
+        # l = 1 means the last layer of neurons, l = 2 is the
+        # second-last layer, and so on.  It's a renumbering of the
+        # scheme in the book, used here to take advantage of the fact
+        # that Python can use negative indices in lists.
+        for l in xrange(2, self.num_layers):
+            z = zs[-l]
+            sp = sigmoid_prime(z)
+            delta = np.dot(self.weights[-l+1].transpose(), delta) * sp
+            nabla_b[-l] = delta
+            nabla_w[-l] = np.dot(delta, activations[-l-1].transpose())
+        return (nabla_b, nabla_w)
+
+...
+
+    def cost_derivative(self, output_activations, y):
+        """Return the vector of partial derivatives \partial C_x /
+        \partial a for the output activations."""
+        return (output_activations-y) 
+
+def sigmoid(z):
+    """The sigmoid function."""
+    return 1.0/(1.0+np.exp(-z))
+
+def sigmoid_prime(z):
+    """Derivative of the sigmoid function."""
+    return sigmoid(z)*(1-sigmoid(z))
+
+

Problem

In what sense is backpropagation a fast algorithm?

In what sense is backpropagation a fast algorithm? To answer this question, let's consider another approach to computing the gradient. Imagine it's the early days of neural networks research. Maybe it's the 1950s or 1960s, and you're the first person in the world to think of using gradient descent to learn! But to make the idea work you need a way of computing the gradient of the cost function. You think back to your knowledge of calculus, and decide to see if you can use the chain rule to compute the gradient. But after playing around a bit, the algebra looks complicated, and you get discouraged. So you try to find another approach. You decide to regard the cost as a function of the weights $C = C(w)$ alone (we'll get back to the biases in a moment). You number the weights $w_1, w_2, \ldots$, and want to compute $\partial C / \partial w_j$ for some particular weight $w_j$. An obvious way of doing that is to use the approximation \begin{eqnarray} \frac{\partial C}{\partial w_{j}} \approx \frac{C(w+\epsilon e_j)-C(w)}{\epsilon}, \tag{46}\end{eqnarray} where $\epsilon > 0$ is a small positive number, and $e_j$ is the unit vector in the $j^{\rm th}$ direction. In other words, we can estimate $\partial C / \partial w_j$ by computing the cost $C$ for two slightly different values of $w_j$, and then applying Equation (46). The same idea will let us compute the partial derivatives $\partial C / \partial b$ with respect to the biases.

This approach looks very promising. It's simple conceptually, and extremely easy to implement, using just a few lines of code. Certainly, it looks much more promising than the idea of using the chain rule to compute the gradient!

Unfortunately, while this approach appears promising, when you implement the code it turns out to be extremely slow. To understand why, imagine we have a million weights in our network. Then for each distinct weight $w_j$ we need to compute $C(w+\epsilon e_j)$ in order to compute $\partial C / \partial w_j$. That means that to compute the gradient we need to compute the cost function a million different times, requiring a million forward passes through the network (per training example). We need to compute $C(w)$ as well, so that's a total of a million and one passes through the network.

What's clever about backpropagation is that it enables us to simultaneously compute all the partial derivatives $\partial C / \partial w_j$ using just one forward pass through the network, followed by one backward pass through the network. Roughly speaking, the computational cost of the backward pass is about the same as the forward pass* *This should be plausible, but it requires some analysis to make a careful statement. It's plausible because the dominant computational cost in the forward pass is multiplying by the weight matrices, while in the backward pass it's multiplying by the transposes of the weight matrices. These operations obviously have similar computational cost.. And so the total cost of backpropagation is roughly the same as making just two forward passes through the network. Compare that to the million and one forward passes we needed for the approach based on (46)! And so even though backpropagation appears superficially more complex than the approach based on (46), it's actually much, much faster.

This speedup was first fully appreciated in 1986, and it greatly expanded the range of problems that neural networks could solve. That, in turn, caused a rush of people using neural networks. Of course, backpropagation is not a panacea. Even in the late 1980s people ran up against limits, especially when attempting to use backpropagation to train deep neural networks, i.e., networks with many hidden layers. Later in the book we'll see how modern computers and some clever new ideas now make it possible to use backpropagation to train such deep neural networks.

Backpropagation: the big picture

As I've explained it, backpropagation presents two mysteries. First, what's the algorithm really doing? We've developed a picture of the error being backpropagated from the output. But can we go any deeper, and build up more intuition about what is going on when we do all these matrix and vector multiplications? The second mystery is how someone could ever have discovered backpropagation in the first place? It's one thing to follow the steps in an algorithm, or even to follow the proof that the algorithm works. But that doesn't mean you understand the problem so well that you could have discovered the algorithm in the first place. Is there a plausible line of reasoning that could have led you to discover the backpropagation algorithm? In this section I'll address both these mysteries.

To improve our intuition about what the algorithm is doing, let's imagine that we've made a small change $\Delta w^l_{jk}$ to some weight in the network, $w^l_{jk}$:

That change in weight will cause a change in the output activation from the corresponding neuron:
That, in turn, will cause a change in all the activations in the next layer:
Those changes will in turn cause changes in the next layer, and then the next, and so on all the way through to causing a change in the final layer, and then in the cost function:
The change $\Delta C$ in the cost is related to the change $\Delta w^l_{jk}$ in the weight by the equation \begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial w^l_{jk}} \Delta w^l_{jk}. \tag{47}\end{eqnarray} This suggests that a possible approach to computing $\frac{\partial C}{\partial w^l_{jk}}$ is to carefully track how a small change in $w^l_{jk}$ propagates to cause a small change in $C$. If we can do that, being careful to express everything along the way in terms of easily computable quantities, then we should be able to compute $\partial C / \partial w^l_{jk}$.

Let's try to carry this out. The change $\Delta w^l_{jk}$ causes a small change $\Delta a^{l}_j$ in the activation of the $j^{\rm th}$ neuron in the $l^{\rm th}$ layer. This change is given by \begin{eqnarray} \Delta a^l_j \approx \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}. \tag{48}\end{eqnarray} The change in activation $\Delta a^l_{j}$ will cause changes in all the activations in the next layer, i.e., the $(l+1)^{\rm th}$ layer. We'll concentrate on the way just a single one of those activations is affected, say $a^{l+1}_q$,

In fact, it'll cause the following change: \begin{eqnarray} \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \Delta a^l_j. \tag{49}\end{eqnarray} Substituting in the expression from Equation (48), we get: \begin{eqnarray} \Delta a^{l+1}_q \approx \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}. \tag{50}\end{eqnarray} Of course, the change $\Delta a^{l+1}_q$ will, in turn, cause changes in the activations in the next layer. In fact, we can imagine a path all the way through the network from $w^l_{jk}$ to $C$, with each change in activation causing a change in the next activation, and, finally, a change in the cost at the output. If the path goes through activations $a^l_j, a^{l+1}_q, \ldots, a^{L-1}_n, a^L_m$ then the resulting expression is \begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}, \tag{51}\end{eqnarray} that is, we've picked up a $\partial a / \partial a$ type term for each additional neuron we've passed through, as well as the $\partial C/\partial a^L_m$ term at the end. This represents the change in $C$ due to changes in the activations along this particular path through the network. Of course, there's many paths by which a change in $w^l_{jk}$ can propagate to affect the cost, and we've been considering just a single path. To compute the total change in $C$ it is plausible that we should sum over all the possible paths between the weight and the final cost, i.e., \begin{eqnarray} \Delta C \approx \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}} \Delta w^l_{jk}, \tag{52}\end{eqnarray} where we've summed over all possible choices for the intermediate neurons along the path. Comparing with (47) we see that \begin{eqnarray} \frac{\partial C}{\partial w^l_{jk}} = \sum_{mnp\ldots q} \frac{\partial C}{\partial a^L_m} \frac{\partial a^L_m}{\partial a^{L-1}_n} \frac{\partial a^{L-1}_n}{\partial a^{L-2}_p} \ldots \frac{\partial a^{l+1}_q}{\partial a^l_j} \frac{\partial a^l_j}{\partial w^l_{jk}}. \tag{53}\end{eqnarray} Now, Equation (53) looks complicated. However, it has a nice intuitive interpretation. We're computing the rate of change of $C$ with respect to a weight in the network. What the equation tells us is that every edge between two neurons in the network is associated with a rate factor which is just the partial derivative of one neuron's activation with respect to the other neuron's activation. The edge from the first weight to the first neuron has a rate factor $\partial a^{l}_j / \partial w^l_{jk}$. The rate factor for a path is just the product of the rate factors along the path. And the total rate of change $\partial C / \partial w^l_{jk}$ is just the sum of the rate factors of all paths from the initial weight to the final cost. This procedure is illustrated here, for a single path:

What I've been providing up to now is a heuristic argument, a way of thinking about what's going on when you perturb a weight in a network. Let me sketch out a line of thinking you could use to further develop this argument. First, you could derive explicit expressions for all the individual partial derivatives in Equation (53). That's easy to do with a bit of calculus. Having done that, you could then try to figure out how to write all the sums over indices as matrix multiplications. This turns out to be tedious, and requires some persistence, but not extraordinary insight. After doing all this, and then simplifying as much as possible, what you discover is that you end up with exactly the backpropagation algorithm! And so you can think of the backpropagation algorithm as providing a way of computing the sum over the rate factor for all these paths. Or, to put it slightly differently, the backpropagation algorithm is a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost.

Now, I'm not going to work through all this here. It's messy and requires considerable care to work through all the details. If you're up for a challenge, you may enjoy attempting it. And even if not, I hope this line of thinking gives you some insight into what backpropagation is accomplishing.

What about the other mystery - how backpropagation could have been discovered in the first place? In fact, if you follow the approach I just sketched you will discover a proof of backpropagation. Unfortunately, the proof is quite a bit longer and more complicated than the one I described earlier in this chapter. So how was that short (but more mysterious) proof discovered? What you find when you write out all the details of the long proof is that, after the fact, there are several obvious simplifications staring you in the face. You make those simplifications, get a shorter proof, and write that out. And then several more obvious simplifications jump out at you. So you repeat again. The result after a few iterations is the proof we saw earlier* *There is one clever step required. In Equation (53) the intermediate variables are activations like $a_q^{l+1}$. The clever idea is to switch to using weighted inputs, like $z^{l+1}_q$, as the intermediate variables. If you don't have this idea, and instead continue using the activations $a^{l+1}_q$, the proof you obtain turns out to be slightly more complex than the proof given earlier in the chapter. - short, but somewhat obscure, because all the signposts to its construction have been removed! I am, of course, asking you to trust me on this, but there really is no great mystery to the origin of the earlier proof. It's just a lot of hard work simplifying the proof I've sketched in this section.




+ + + \ No newline at end of file diff --git a/ML_Mathematical_Approach/07_week-5-lecture-notes/01__chapter3_-_bp.pdf b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__chapter3_-_bp.pdf new file mode 100644 index 0000000..4036092 Binary files /dev/null and b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__chapter3_-_bp.pdf differ diff --git a/ML_Mathematical_Approach/07_week-5-lecture-notes/01__node37.html b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__node37.html new file mode 100644 index 0000000..15fb3ae --- /dev/null +++ b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__node37.html @@ -0,0 +1,466 @@ + + + + +The Backpropagation Algorithm + + + + + + + + + + + + +

+ + + + + +next + +up + +previous +
+ Next: Issues in ANN Learning + Up: Artificial Neural Nets + Previous: Multilayer Nets, Sigmoid Units +
+
+ + +

+The Backpropagation Algorithm +

+
+
1. +
Propagates inputs forward in the usual way, i.e. + +
2. +
Propagates the errors backwards by apportioning them to each +unit according to the amount of this error the unit is responsible for. +
+We now derive the stochastic Backpropagation algorithm for the general +case. The derivation is simple, but unfortunately the book-keeping is +a little messy. + +Since we update after each training example, we can simplify +the notation somewhat by imagining that the training set consists of +exactly one example and so the error can simply be denoted by E. + +

+We want to calculate + +$\frac{\partial E}{\partial w_{ji}}$ +for each input weight +wji for each output unit j. Note first that since zj is a +function of wji regardless of where in the network unit j is +located, +

+
+\begin{eqnarray*}\frac{\partial E}{\partial w_{ji}} &=& \frac{\partial E}{\parti...
+...rtial w_{ji}} \\
+&=& \frac{\partial E}{\partial z_j} x_{ji}\\
+\end{eqnarray*} +

+
+Furthermore, + +$\frac{\partial E}{\partial z_j}$ +is the same regardless of which input +weight of unit j we are trying to update. So we denote this +quantity by $\delta_j$. + +

+Consider the case when + +$j \in Outputs$. +We know +

+
+ + + +\begin{displaymath}E = \frac{1}{2}\sum_{k \in Outputs} (t_k - \sigma(z_k))^2
+\end{displaymath} +
+
+

+Since the outputs of all units $k \ne j$ +are independent of wji, +we can drop the summation and consider just the contribution to E by +j. +

+
+\begin{eqnarray*}\delta_j = \frac{\partial E}{\partial z_j} &=& \frac{\partial }...
+...o_j)(1-\sigma(z_j))\sigma(z_j)\\
+&=& -(t_j - o_j)(1-o_j)o_j\\
+\end{eqnarray*} +

+
+Thus +

+
+ + + + + + +
 \begin{displaymath}
+\Delta w_{ji} = -\eta \frac{\partial E}{\partial w_ij} = \eta \delta_j x_{ji}
+\end{displaymath} +(17)
+
+

+ +

+Now consider the case when j is a hidden unit. Like before, we make the +following two important observations. +

+
1. +
For each unit k downstream from j, zk is a function of zj +
2. +
The contribution to error by all units $l \ne j$ +in the same +layer as j is independent of wji
+We want to calculate + +$\frac{\partial E}{\partial w_{ji}}$ +for each input weight +wji for each hidden unit j. Note that wji influences just +zj which influences oj which influences + +$z_k \forall k \in
+Downstream(j)$ +each of which influence E. So we can write +

+
+\begin{eqnarray*}\frac{\partial E}{\partial w_{ji}} &=& \sum_{k \in Downstream(j...
+...al o_j} \cdot
+\frac{\partial o_j}{\partial z_j} \cdot x_{ji}\\
+\end{eqnarray*} +

+
+ +

+Again note that all the terms except xji in the above product are +the same regardless of which input weight of unit j we are trying to +update. Like before, we denote this common quantity by $\delta_j$. +Also note that + +$\frac{\partial E}{\partial z_k} = \delta_k$, + + +$\frac{\partial z_k}{\partial o_j} =
+w_{kj}$ +and + +$\frac{\partial o_j}{\partial z_j} = o_j (1-o_j)$. +Substituting, +

+
+\begin{eqnarray*}\delta_j &=& \sum_{k \in Downstream(j)}
+\frac{\partial E}{\par...
+...
+&=& \sum_{k \in Downstream(j)} \delta_k w_{kj} o_j (1-o_j)\\
+\end{eqnarray*} +

+
+Thus, +

+
+ + + + + + +
 \begin{displaymath}
+\delta_k = o_j (1-o_j) \sum_{k \in Downstream(j)} \delta_k w_{kj}
+\end{displaymath} +(18)
+
+

+We are now in a position to state the Backpropagation algorithm +formally. + +

+Formal statement of the algorithm: + +

+ +

+Stochastic Backpropagation(training examples, $\eta$, +ni, nh, no) + +

+Each training example is of the form + +$\langle \vec{x}, \vec{t}
+\rangle$ +where $\vec{x}$ +is the input vector and $\vec{t}$ +is the +target vector. $\eta$ +is the learning rate (e.g., .05). ni, nh +and no are the number of input, hidden and output nodes +respectively. Input from unit i to unit j is denoted xji and +its weight is denoted by wji. + +

+

+ +

+ +

+


+ + +next + +up + +previous +
+ Next: Issues in ANN Learning + Up: Artificial Neural Nets + Previous: Multilayer Nets, Sigmoid Units + +
+Anand Venkataraman +
1999-09-16 +
+ + diff --git a/ML_Mathematical_Approach/07_week-5-lecture-notes/01__resources.html b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__resources.html new file mode 100644 index 0000000..235b671 --- /dev/null +++ b/ML_Mathematical_Approach/07_week-5-lecture-notes/01__resources.html @@ -0,0 +1,1377 @@ + + +

+ ML:Neural Networks: Learning +

+

+ Cost Function +

+

+ Let's first define a few variables that we will need to use: +

+

+ a) L= total number of layers in the network +

+

+ b) $$s_l$$ = number of units (not counting bias unit) in layer l +

+

+ c) K= number of output units/classes +

+

+ Recall that in neural networks, we may have many output nodes. We denote $$h_\Theta(x)_k$$ as being a hypothesis that results in the $$k^{th}$$ output. +

+

+ Our cost function for neural networks is going to be a generalization of the one we used for logistic regression. +

+

+ Recall that the cost function for regularized logistic regression was: +

+ + + + +
+

+ $$J(\theta) = - \frac{1}{m} \sum_{i=1}^m \large[ y^{(i)}\ \log (h_\theta (x^{(i)})) + (1 - y^{(i)})\ \log (1 - h_\theta(x^{(i)}))\large] + \frac{\lambda}{2m}\sum_{j=1}^n \theta_j^2$$ +

+
+

+ For neural networks, it is going to be slightly more complicated: +

+ + + + +
+

+ $$\begin{gather*}\large J(\Theta) = - \frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K \left[y^{(i)}_k \log ((h_\Theta (x^{(i)}))_k) + (1 - y^{(i)}_k)\log (1 - (h_\Theta(x^{(i)}))_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_{l+1}} ( \Theta_{j,i}^{(l)})^2\end{gather*}$$ +

+
+

+ We have added a few nested summations to account for our multiple output nodes. In the first part of the equation, between the square brackets, we have an additional nested summation that loops through the number of output nodes. +

+

+ In the regularization part, after the square brackets, we must account for multiple theta matrices. The number of columns in our current theta matrix is equal to the number of nodes in our current layer (including the bias unit). The number of rows in our current theta matrix is equal to the number of nodes in the next layer (excluding the bias unit). As before with logistic regression, we square every term. +

+

+ Note: +

+ +

+ Backpropagation Algorithm +

+

+ "Backpropagation" is neural-network terminology for minimizing our cost function, just like what we were doing with gradient descent in logistic and linear regression. +

+

+ Our goal is to compute: +

+

+ $$\min_\Theta J(\Theta)$$ +

+

+ That is, we want to minimize our cost function J using an optimal set of parameters in theta. +

+

+ In this section we'll look at the equations we use to compute the partial derivative of J(Θ): +

+

+ $$\dfrac{\partial}{\partial \Theta_{i,j}^{(l)}}J(\Theta)$$ +

+

+ In back propagation we're going to compute for every node: +

+

+ $$\delta_j^{(l)}$$ = "error" of node j in layer l +

+

+ Recall that $$a_j^{(l)}$$ is activation node j in layer l. +

+

+ For the + + last layer + + , we can compute the vector of delta values with: +

+

+ $$\delta^{(L)} = a^{(L)} - y$$ +

+

+ Where L is our total number of layers and $$a^{(L)}$$ is the vector of outputs of the activation units for the last layer. So our "error values" for the last layer are simply the differences of our actual results in the last layer and the correct outputs in y. +

+

+ To get the delta values of the layers before the last layer, we can use an equation that steps us back from right to left: +

+ + + + +
+

+ $$\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ g'(z^{(l)})$$ +

+
+

+ The delta values of layer l are calculated by multiplying the delta values in the next layer with the theta matrix of layer l. We then element-wise multiply that with a function called g', or g-prime, which is the derivative of the activation function g evaluated with the input values given by z(l). +

+

+ The g-prime derivative terms can also be written out as: +

+ + + + +
+

+ $$g'(u) = g(u)\ .*\ (1 - g(u))$$ +

+
+

+ The full back propagation equation for the inner nodes is then: +

+ + + + +
+

+ $$\delta^{(l)} = ((\Theta^{(l)})^T \delta^{(l+1)})\ .*\ a^{(l)}\ .*\ (1 - a^{(l)})$$ +

+
+

+ A. Ng states that the derivation and proofs are complicated and involved, but you can still implement the above equations to do back propagation without knowing the details. +

+

+ We can compute our partial derivative terms by multiplying our activation values and our error values for each training example t: +

+ + + + +
+

+ $$\dfrac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}} = \frac{1}{m}\sum_{t=1}^m a_j^{(t)(l)} {\delta}_i^{(t)(l+1)}$$ +

+
+

+ This however ignores regularization, which we'll deal with later. +

+

+ Note: $$\delta^{l+1}$$ and $$a^{l+1}$$ are vectors with $$s_{l+1}$$ elements. Similarly, $$\ a^{(l)}$$ is a vector with $$s_l$$ elements. Multiplying them produces a matrix that is $$s_{l+1}$$ by $$s_l$$ which is the same dimension as $$\Theta^{(l)}$$. That is, the process produces a gradient term for every element in $$\Theta^{(l)}$$. (Actually, $$\Theta^{(l)}$$ has $$s_{l}$$ + 1 column, so the dimensionality is not exactly the same). +

+

+ We can now take all these equations and put them together into a backpropagation algorithm: +

+

+ + Back propagation Algorithm + +

+

+ Given training set $$\lbrace (x^{(1)}, y^{(1)}) \cdots (x^{(m)}, y^{(m)})\rbrace$$ +

+ +

+ For training example t =1 to m: +

+ +

+ The capital-delta matrix is used as an "accumulator" to add up our values as we go along and eventually compute our partial derivative. +

+

+ The actual proof is quite involved, but, the $$D^{(l)}_{i,j}$$ terms are the partial derivatives and the results we are looking for: +

+

+ $$D_{i,j}^{(l)} = \dfrac{\partial J(\Theta)}{\partial \Theta_{i,j}^{(l)}}.$$ +

+

+ Backpropagation Intuition +

+

+ The cost function is: +

+ + + + +
+

+ $$\begin{gather*}J(\theta) = - \frac{1}{m} \sum_{t=1}^m\sum_{k=1}^K \left[ y^{(t)}_k \ \log (h_\theta (x^{(t)}))_k + (1 - y^{(t)}_k)\ \log (1 - h_\theta(x^{(t)})_k)\right] + \frac{\lambda}{2m}\sum_{l=1}^{L-1} \sum_{i=1}^{s_l} \sum_{j=1}^{s_l+1} ( \theta_{j,i}^{(l)})^2\end{gather*}$$ +

+
+

+ If we consider simple non-multiclass classification (k = 1) and disregard regularization, the cost is computed with: +

+ + + + +
+

+ $$cost(t) =y^{(t)} \ \log (h_\theta (x^{(t)})) + (1 - y^{(t)})\ \log (1 - h_\theta(x^{(t)}))$$ +

+
+

+ More intuitively you can think of that equation roughly as: +

+ + + + +
+

+ $$cost(t) \approx (h_\theta(x^{(t)})-y^{(t)})^2$$ +

+
+

+ Intuitively, $$\delta_j^{(l)}$$ is the "error" for $$a^{(l)}_j$$ (unit j in layer l) +

+

+ More formally, the delta values are actually the derivative of the cost function: +

+ + + + +
+

+ $$\delta_j^{(l)} = \dfrac{\partial}{\partial z_j^{(l)}} cost(t)$$ +

+
+

+ Recall that our derivative is the slope of a line tangent to the cost function, so the steeper the slope the more incorrect we are. +

+

+ Note: In lecture, sometimes i is used to index a training example. Sometimes it is used to index a unit in a layer. In the Back Propagation Algorithm described here, t is used to index a training example rather than overloading the use of i. +

+

+ Implementation Note: Unrolling Parameters +

+

+ With neural networks, we are working with sets of matrices: +

+ + + + +
+

+ $$\begin{align*} +\Theta^{(1)}, \Theta^{(2)}, \Theta^{(3)}, \dots \newline +D^{(1)}, D^{(2)}, D^{(3)}, \dots +\end{align*}$$ +

+
+

+ In order to use optimizing functions such as "fminunc()", we will want to "unroll" all the elements and put them into one long vector: +

+
thetaVector = [ Theta1(:); Theta2(:); Theta3(:); ]
+deltaVector = [ D1(:); D2(:); D3(:) ]
+

+ If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11, then we can get back our original matrices from the "unrolled" versions as follows: +

+
Theta1 = reshape(thetaVector(1:110),10,11)
+Theta2 = reshape(thetaVector(111:220),10,11)
+Theta3 = reshape(thetaVector(221:231),1,11)
+
+

+ NOTE: The lecture slides show an example neural network with 3 layers. However, + + 3 + + theta matrices are defined: Theta1, Theta2, Theta3. There should be only 2 theta matrices: Theta1 (10 x 11), Theta2 (1 x 11). +

+

+ Gradient Checking +

+

+ Gradient checking will assure that our backpropagation works as intended. +

+

+ We can approximate the derivative of our cost function with: +

+

+ $$\dfrac{\partial}{\partial\Theta}J(\Theta) \approx \dfrac{J(\Theta + \epsilon) - J(\Theta - \epsilon)}{2\epsilon}$$ +

+

+ With multiple theta matrices, we can approximate the derivative + + with respect to + + $$Θ_j$$ as follows: +

+

+ $$\dfrac{\partial}{\partial\Theta_j}J(\Theta) \approx \dfrac{J(\Theta_1, \dots, \Theta_j + \epsilon, \dots, \Theta_n) - J(\Theta_1, \dots, \Theta_j - \epsilon, \dots, \Theta_n)}{2\epsilon}$$ +

+

+ A good small value for $${\epsilon}$$ (epsilon), guarantees the math above to become true. If the value be much smaller, may we will end up with numerical problems. The professor Andrew usually uses the value $${\epsilon = 10^{-4}}$$. +

+

+ We are only adding or subtracting epsilon to the $$Theta_j$$ matrix. In octave we can do it as follows: +

+
epsilon = 1e-4;
+for i = 1:n,
+  thetaPlus = theta;
+  thetaPlus(i) += epsilon;
+  thetaMinus = theta;
+  thetaMinus(i) -= epsilon;
+  gradApprox(i) = (J(thetaPlus) - J(thetaMinus))/(2*epsilon)
+end;
+
+

+ We then want to check that gradApprox ≈ deltaVector. +

+

+ Once you've verified + + once + + that your backpropagation algorithm is correct, then you don't need to compute gradApprox again. The code to compute gradApprox is very slow. +

+

+ Random Initialization +

+

+ Initializing all theta weights to zero does not work with neural networks. When we backpropagate, all nodes will update to the same value repeatedly. +

+

+ Instead we can randomly initialize our weights: +

+

+ Initialize each $$\Theta^{(l)}_{ij}$$ to a random value between$$ [-\epsilon,\epsilon]$$: +

+

+ $$\epsilon = \dfrac{\sqrt{6}}{\sqrt{\mathrm{Loutput} + \mathrm{Linput}}}$$ +

+

+ $$\Theta^{(l)} = 2 \epsilon \; \mathrm{rand}(\mathrm{Loutput}, \mathrm{Linput} + 1) - \epsilon$$ +

+
If the dimensions of Theta1 is 10x11, Theta2 is 10x11 and Theta3 is 1x11.
+
+Theta1 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
+Theta2 = rand(10,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
+Theta3 = rand(1,11) * (2 * INIT_EPSILON) - INIT_EPSILON;
+
+

+ rand(x,y) will initialize a matrix of random real numbers between 0 and 1. (Note: this epsilon is unrelated to the epsilon from Gradient Checking) +

+

+ Why use this method? This paper may be useful: + + https://web.stanford.edu/class/ee373b/nninitialization.pdf + +

+

+ Putting it Together +

+

+ First, pick a network architecture; choose the layout of your neural network, including how many hidden units in each layer and how many layers total. +

+ +

+ + Training a Neural Network + +

+
    +
  1. +

    + Randomly initialize the weights +

    +
  2. +
  3. +

    + Implement forward propagation to get $$h_\theta(x^{(i)})$$ +

    +
  4. +
  5. +

    + Implement the cost function +

    +
  6. +
  7. +

    + Implement backpropagation to compute partial derivatives +

    +
  8. +
  9. +

    + Use gradient checking to confirm that your backpropagation works. Then disable gradient checking. +

    +
  10. +
  11. +

    + Use gradient descent or a built-in optimization function to minimize the cost function with the weights in theta. +

    +
  12. +
+

+ When we perform forward and back propagation, we loop on every training example: +

+
for i = 1:m,
+   Perform forward propagation and backpropagation using example (x(i),y(i))
+   (Get activations a(l) and delta terms d(l) for l = 2,...,L
+

+ Bonus: Tutorial on How to classify your own images of digits +

+

+ This tutorial will guide you on how to use the classifier provided in exercise 3 to classify you own images like this: +

+

+

+ +

+ It will also explain how the images are converted thru several formats to be processed and displayed. +

+

+ Introduction +

+

+ The classifier provided expects 20 x 20 pixels black and white images converted in a row vector of 400 real numbers like this +

+
[ 0.14532, 0.12876, ...]
+

+ Each pixel is represented by a real number between -1.0 to 1.0, meaning -1.0 equal black and 1.0 equal white (any number in between is a shade of gray, and number 0.0 is exactly the middle gray). +

+

+ + .jpg and color RGB images + +

+

+ The most common image format that can be read by Octave is .jpg using function that outputs a three-dimensional matrix of integer numbers from 0 to 255, representing the height x width x 3 integers as indexes of a color map for each pixel (explaining color maps is beyond scope). +

+
Image3DmatrixRGB = imread("myOwnPhoto.jpg");
+

+ + Convert to Black & White + +

+

+ A common way to convert color images to black & white, is to convert them to a YIQ standard and keep only the Y component that represents the luma information (black & white). I and Q represent the chrominance information (color).Octave has a function + + rgb2ntsc() + + that outputs a similar three-dimensional matrix but of real numbers from -1.0 to 1.0, representing the height x width x 3 (Y luma, I in-phase, Q quadrature) intensity for each pixel. +

+
Image3DmatrixYIQ = rgb2ntsc(MyImageRGB);
+

+ To obtain the Black & White component just discard the I and Q matrices. This leaves a two-dimensional matrix of real numbers from -1.0 to 1.0 representing the height x width pixels black & white values. +

+
Image2DmatrixBW = Image3DmatrixYIQ(:,:,1);
+

+ + Cropping to square image + +

+

+ It is useful to crop the original image to be as square as possible. The way to crop a matrix is by selecting an area inside the original B&W image and copy it to a new matrix. This is done by selecting the rows and columns that define the area. In other words, it is copying a rectangular subset of the matrix like this: +

+
croppedImage = Image2DmatrixBW(origen1:size1, origin2:size2);
+
+

+ Cropping does not have to be all the way to a square. + + It could be cropping just a percentage of the way to a square + + so you can leave more of the image intact. The next step of scaling will take care of streaching the image to fit a square. +

+

+ + Scaling to 20 x 20 pixels + +

+

+ The classifier provided was trained with 20 x 20 pixels images so we need to scale our photos to meet. It may cause distortion depending on the height and width ratio of the cropped original photo. There are many ways to scale a photo but we are going to use the simplest one. We lay a scaled grid of 20 x 20 over the original photo and take a sample pixel on the center of each grid. To lay a scaled grid, we compute two vectors of 20 indexes each evenly spaced on the original size of the image. One for the height and one for the width of the image. For example, in an image of 320 x 200 pixels will produce to vectors like +

+
[9    25    41    57    73 ... 313] % 20 indexes
+
[6    16    26    36    46 ... 196] % 20 indexes
+

+ Copy the value of each pixel located by the grid of these indexes to a new matrix. Ending up with a matrix of 20 x 20 real numbers. +

+

+ + Black & White to Gray & White + +

+

+ The classifier provided was trained with images of white digits over gray background. Specifically, the 20 x 20 matrix of real numbers ONLY range from 0.0 to 1.0 instead of the complete black & white range of -1.0 to 1.0, this means that we have to normalize our photos to a range 0.0 to 1.0 for this classifier to work. But also, we invert the black and white colors because is easier to "draw" black over white on our photos and we need to get white digits. So in short, we + + invert black and white + + and + + stretch black to gray + + . +

+

+ + Rotation of image + +

+

+ Some times our photos are automatically rotated like in our celular phones. The classifier provided can not recognize rotated images so we may need to rotate it back sometimes. This can be done with an Octave function + + rot90() + + like this. +

+
ImageAligned = rot90(Image, rotationStep);
+

+ Where rotationStep is an integer: -1 mean rotate 90 degrees CCW and 1 mean rotate 90 degrees CW. +

+

+ Approach +

+
    +
  1. +

    + The approach is to have a function that converts our photo to the format the classifier is expecting. As if it was just a sample from the training data set. +

    +
  2. +
  3. +

    + Use the classifier to predict the digit in the converted image. +

    +
  4. +
+

+ Code step by step +

+

+ Define the function name, the output variable and three parameters, one for the filename of our photo, one optional cropping percentage (if not provided will default to zero, meaning no cropping) and the last optional rotation of the image (if not provided will default to cero, meaning no rotation). +

+
function vectorImage = imageTo20x20Gray(fileName, cropPercentage=0, rotStep=0)
+
+

+ Read the file as a RGB image and convert it to Black & White 2D matrix (see the introduction). +

+
% Read as RGB image
+Image3DmatrixRGB = imread(fileName);
+% Convert to NTSC image (YIQ)
+Image3DmatrixYIQ = rgb2ntsc(Image3DmatrixRGB );
+% Convert to grays keeping only luminance (Y)
+%        ...and discard chrominance (IQ)
+Image2DmatrixBW  = Image3DmatrixYIQ(:,:,1);
+
+

+ Establish the final size of the cropped image. +

+
% Get the size of your image
+oldSize = size(Image2DmatrixBW);
+% Obtain crop size toward centered square (cropDelta)
+% ...will be zero for the already minimum dimension
+% ...and if the cropPercentage is zero, 
+% ...both dimensions are zero
+% ...meaning that the original image will go intact to croppedImage
+cropDelta = floor((oldSize - min(oldSize)) .* (cropPercentage/100));
+% Compute the desired final pixel size for the original image
+finalSize = oldSize - cropDelta;
+
+

+ Obtain the origin and amount of the columns and rows to be copied to the cropped image. +

+
% Compute each dimension origin for croping
+cropOrigin = floor(cropDelta / 2) + 1;
+% Compute each dimension copying size
+copySize = cropOrigin + finalSize - 1;
+% Copy just the desired cropped image from the original B&W image
+croppedImage = Image2DmatrixBW( ...
+                    cropOrigin(1):copySize(1), cropOrigin(2):copySize(2));
+
+

+ Compute the scale and compute back the new size. This last step is extra. It is computed back so the code keeps general for future modification of the classifier size. For example: if changed from 20 x 20 pixels to 30 x 30. Then the we only need to change the line of code where the scale is computed. +

+
% Resolution scale factors: [rows cols]
+scale = [20 20] ./ finalSize;
+% Compute back the new image size (extra step to keep code general)
+newSize = max(floor(scale .* finalSize),1); 
+
+

+ Compute two sets of 20 indexes evenly spaced. One over the original height and one over the original width of the image. +

+
% Compute a re-sampled set of indices:
+rowIndex = min(round(((1:newSize(1))-0.5)./scale(1)+0.5), finalSize(1));
+colIndex = min(round(((1:newSize(2))-0.5)./scale(2)+0.5), finalSize(2));
+
+

+ Copy just the indexed values from old image to get new image of 20 x 20 real numbers. This is called "sampling" because it copies just a sample pixel indexed by a grid. All the sample pixels make the new image. +

+
% Copy just the indexed values from old image to get new image
+newImage = croppedImage(rowIndex,colIndex,:);
+
+

+ Rotate the matrix using the + + rot90() + + function with the rotStep parameter: -1 is CCW, 0 is no rotate, 1 is CW. +

+
% Rotate if needed: -1 is CCW, 0 is no rotate, 1 is CW
+newAlignedImage = rot90(newImage, rotStep);
+
+

+ Invert black and white because it is easier to draw black digits over white background in our photos but the classifier needs white digits. +

+
% Invert black and white
+invertedImage = - newAlignedImage;
+
+

+ Find the min and max gray values in the image and compute the total value range in preparation for normalization. +

+
% Find min and max grays values in the image
+maxValue = max(invertedImage(:));
+minValue = min(invertedImage(:));
+% Compute the value range of actual grays
+delta = maxValue - minValue;
+
+

+ Do normalization so all values end up between 0.0 and 1.0 because this particular classifier do not perform well with negative numbers. +

+
% Normalize grays between 0 and 1
+normImage = (invertedImage - minValue) / delta;
+

+ Add some contrast to the image. The multiplication factor is the contrast control, you can increase it if desired to obtain sharper contrast (contrast only between gray and white, black was already removed in normalization). +

+
% Add contrast. Multiplication factor is contrast control.
+contrastedImage = sigmoid((normImage -0.5) * 5);
+
+

+ Show the image specifying the black & white range [-1 1] to avoid automatic ranging using the image range values of gray to white. Showing the photo with different range, does not affect the values in the output matrix, so do not affect the classifier. It is only as a visual feedback for the user. +

+
% Show image as seen by the classifier
+imshow(contrastedImage, [-1, 1] );
+

+ Finally, output the matrix as a unrolled vector to be compatible with the classifier. +

+
% Output the matrix as a unrolled vector
+vectorImage = reshape(normImage, 1, newSize(1) * newSize(2));
+

+ End function. +

+
end;
+

+

+

+ Usage samples +

+

+ Single photo +

+ +

+ Multiple photos +

+ +

+ Tips +

+ +

+ Complete code (just copy and paste) +

+

+

+
function vectorImage = imageTo20x20Gray(fileName, cropPercentage=0, rotStep=0)
+%IMAGETO20X20GRAY display reduced image and converts for digit classification
+%
+% Sample usage: 
+%       imageTo20x20Gray('myDigit.jpg', 100, -1);
+%
+%       First parameter: Image file name
+%             Could be bigger than 20 x 20 px, it will
+%             be resized to 20 x 20. Better if used with
+%             square images but not required.
+% 
+%       Second parameter: cropPercentage (any number between 0 and 100)
+%             0  0% will be cropped (optional, no needed for square images)
+%            50  50% of available croping will be cropped
+%           100  crop all the way to square image (for rectangular images)
+% 
+%       Third parameter: rotStep
+%            -1  rotate image 90 degrees CCW
+%             0  do not rotate (optional)
+%             1  rotate image 90 degrees CW
+%
+% (Thanks to Edwin Frühwirth for parts of this code)
+% Read as RGB image
+Image3DmatrixRGB = imread(fileName);
+% Convert to NTSC image (YIQ)
+Image3DmatrixYIQ = rgb2ntsc(Image3DmatrixRGB );
+% Convert to grays keeping only luminance (Y) and discard chrominance (IQ)
+Image2DmatrixBW  = Image3DmatrixYIQ(:,:,1);
+% Get the size of your image
+oldSize = size(Image2DmatrixBW);
+% Obtain crop size toward centered square (cropDelta)
+% ...will be zero for the already minimum dimension
+% ...and if the cropPercentage is zero, 
+% ...both dimensions are zero
+% ...meaning that the original image will go intact to croppedImage
+cropDelta = floor((oldSize - min(oldSize)) .* (cropPercentage/100));
+% Compute the desired final pixel size for the original image
+finalSize = oldSize - cropDelta;
+% Compute each dimension origin for croping
+cropOrigin = floor(cropDelta / 2) + 1;
+% Compute each dimension copying size
+copySize = cropOrigin + finalSize - 1;
+% Copy just the desired cropped image from the original B&W image
+croppedImage = Image2DmatrixBW( ...
+                    cropOrigin(1):copySize(1), cropOrigin(2):copySize(2));
+% Resolution scale factors: [rows cols]
+scale = [20 20] ./ finalSize;
+% Compute back the new image size (extra step to keep code general)
+newSize = max(floor(scale .* finalSize),1); 
+% Compute a re-sampled set of indices:
+rowIndex = min(round(((1:newSize(1))-0.5)./scale(1)+0.5), finalSize(1));
+colIndex = min(round(((1:newSize(2))-0.5)./scale(2)+0.5), finalSize(2));
+% Copy just the indexed values from old image to get new image
+newImage = croppedImage(rowIndex,colIndex,:);
+% Rotate if needed: -1 is CCW, 0 is no rotate, 1 is CW
+newAlignedImage = rot90(newImage, rotStep);
+% Invert black and white
+invertedImage = - newAlignedImage;
+% Find min and max grays values in the image
+maxValue = max(invertedImage(:));
+minValue = min(invertedImage(:));
+% Compute the value range of actual grays
+delta = maxValue - minValue;
+% Normalize grays between 0 and 1
+normImage = (invertedImage - minValue) / delta;
+% Add contrast. Multiplication factor is contrast control.
+contrastedImage = sigmoid((normImage -0.5) * 5);
+% Show image as seen by the classifier
+imshow(contrastedImage, [-1, 1] );
+% Output the matrix as a unrolled vector
+vectorImage = reshape(contrastedImage, 1, newSize(1)*newSize(2));
+end
+

+ Photo Gallery +

+

+ Digit 2 +

+ +

+ + Digit 6 + +

+ +

+ Digit 6 inverted is digit 9. This is the same photo of a six but rotated. Also, changed the contrast multiplier from 5 to 20. You can note that the gray background is smoother. +

+

+

+ +

+ + Digit 3 + +

+

+

+ +

+ Explanation of Derivatives Used in Backpropagation +

+ +

+ $$\delta^{(L)} = \frac{\partial J(\theta)}{\partial a^{(L)}} \frac{\partial a^{(L)}}{\partial z^{(L)}}$$ +

+

+ $$\delta^{(L)} = (\frac{1-y}{1-a^{(L)}} - \frac{y}{a^{(L)}}) (a^{(L)}(1-a^{(L)}))$$ +

+

+ $$\delta^{(L)} =a^{(L)} - y$$ +

+ +

+ $$\delta^{(3)} =a^{(3)} - y$$ +

+ +

+ $$\frac{\partial z^{(L)}}{\partial \theta^{(L-1)}} = a^{(L-1)}$$ +

+ +

+ $$\frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial \theta^{(L-1)}}$$ +

+

+ $$\frac{\partial J(\theta)}{\partial \theta^{(L-1)}} = (a^{(L)} - y) (a^{(L-1)})$$ +

+ +

+ $$\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \delta^{(L-1)} \frac{\partial z^{(L-1)}}{\partial \theta^{(L-2)}}$$ +

+ +

+ $$\frac{\partial z^{(L)}}{\partial a^{(L-1)}} = \theta^{(L-1)}$$ +

+ +

+ $$\delta^{(L-1)} = \delta^{(L)} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}}$$ +

+

+ $$\delta^{(L-1)} = \delta^{(L)} (\theta^{(L-1)}) (a^{(L-1)}(1-a^{(L-1)}))$$ +

+

+ $$\delta^{(L-1)} = \delta^{(L)} \theta^{(L-1)} a^{(L-1)}(1-a^{(L-1)})$$ +

+ +

+ $$\delta^{(2)} = \delta^{(3)} \theta^{(2)} a^{(2)}(1-a^{(2)})$$ +

+ +

+ $$\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = \delta^{(L-1)} \frac{\partial z^{(L-1)}}{\partial \theta^{(L-2)}}$$ +

+

+ $$\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = (\delta^{(L)} \frac{\partial z^{(L)}}{\partial a^{(L-1)}} \frac{\partial a^{(L-1)}}{\partial z^{(L-1)}}) (a^{(L-2)})$$ +

+

+ $$\frac{\partial J(\theta)}{\partial \theta^{(L-2)}} = ((a^{(L)} - y) (\theta^{(L-1)})(a^{(L-1)}(1-a^{(L-1)}))) (a^{(L-2)})$$ +

+

+ NN for linear systems +

+

+ Introduction +

+

+ The NN we created for classification can easily be modified to have a linear output. First solve the 4th programming exercise. You can create a new function script, nnCostFunctionLinear.m, with the following characteristics +

+ +

+ You still need to randomly initialize the Theta values, just as with any NN. You will want to experiment with different epsilon values. You will also need to create a predictLinear() function, using the tanh() function in the hidden layer, and a linear output. +

+

+ Testing your linear NN +

+

+ Here is a test case for your nnCostFunctionLinear() +

+
% inputs
+nn_params = [31 16 15 -29 -13 -8 -7 13 54 -17 -11 -9 16]'/ 10;
+il = 1;
+hl = 4;
+X = [1; 2; 3];
+y = [1; 4; 9];
+lambda = 0.01;
+
+% command
+[j g] = nnCostFunctionLinear(nn_params, il, hl, X, y, lambda)
+
+% results
+j =  0.020815
+g =
+    -0.0131002
+    -0.0110085
+    -0.0070569
+     0.0189212
+    -0.0189639
+    -0.0192539
+    -0.0102291
+     0.0344732
+     0.0024947
+     0.0080624
+     0.0021964
+     0.0031675
+    -0.0064244
+
+

+ Now create a script that uses the 'ex5data1.mat' from ex5, but without creating the polynomial terms. With 8 units in the hidden layer and MaxIter set to 200, you should be able to get a final cost value of 0.3 to 0.4. The results will vary a bit due to the random Theta initialization. If you plot the training set and the predicted values for the training set (using your predictLinear() function), you should have a good match. +

+

+ Deriving the Sigmoid Gradient Function +

+

+ We let the sigmoid function be $$ \sigma(x) = \frac{1}{1 + e^{-x}}$$ +

+

+ Deriving the equation above yields to $$ (\frac{1}{1 + e^{-x}})^2 \frac {d}{ds} \frac{1}{1 + e^{-x}}$$ +

+

+ Which is equal to $$ (\frac{1}{1 + e^{-x}})^2 e^{-x} (-1)$$ +

+

+ $$ (\frac{1}{1 + e^{-x}}) (\frac{1}{1 + e^{-x}}) (-e^{-x})$$ +

+

+ $$ (\frac{1}{1 + e^{-x}}) (\frac{-e^{-x}}{1 + e^{-x}})$$ +

+

+ $$ \sigma(x)(1- \sigma(x))$$ +

+

+ Additional Resources for Backpropagation +

+ +

+

+
+ + + diff --git a/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Bias-Variance.pdf b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Bias-Variance.pdf new file mode 100644 index 0000000..f654c6b Binary files /dev/null and b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Bias-Variance.pdf differ diff --git a/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Chap9.Part2.pdf b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Chap9.Part2.pdf new file mode 100644 index 0000000..9abe9c2 Binary files /dev/null and b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__Chap9.Part2.pdf differ diff --git a/ML_Mathematical_Approach/08_week-6-lecture-notes/01__resources.html b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__resources.html new file mode 100644 index 0000000..d28d390 --- /dev/null +++ b/ML_Mathematical_Approach/08_week-6-lecture-notes/01__resources.html @@ -0,0 +1,1091 @@ + + +

+ ML:Advice for Applying Machine Learning +

+

+ Deciding What to Try Next +

+

+ Errors in your predictions can be troubleshooted by: +

+ +

+ Don't just pick one of these avenues at random. We'll explore diagnostic techniques for choosing one of the above solutions in the following sections. +

+

+ Evaluating a Hypothesis +

+

+ A hypothesis may have low error for the training examples but still be inaccurate (because of overfitting). +

+

+ With a given dataset of training examples, we can split up the data into two sets: a + + training set + + and a + + test set + + . +

+

+ The new procedure using these two sets is then: +

+
    +
  1. +

    + Learn $$\Theta$$ and minimize $$J_{train}(\Theta)$$ using the training set +

    +
  2. +
  3. +

    + Compute the test set error $$J_{test}(\Theta)$$ +

    +
  4. +
+

+ The test set error +

+
    +
  1. +

    + For linear regression: $$J_{test}(\Theta) = \dfrac{1}{2m_{test}} \sum_{i=1}^{m_{test}}(h_\Theta(x^{(i)}_{test}) - y^{(i)}_{test})^2$$ +

    +
  2. +
  3. +

    + For classification ~ Misclassification error (aka 0/1 misclassification error): +

    +
  4. +
+ + + + +
+

+ $$err(h_\Theta(x),y) = +\begin{matrix} +1 & \mbox{if } h_\Theta(x) \geq 0.5\ and\ y = 0\ or\ h_\Theta(x) < 0.5\ and\ y = 1\newline +0 & \mbox otherwise +\end{matrix}$$ +

+
+

+ This gives us a binary 0 or 1 error result based on a misclassification. +

+

+ The average test error for the test set is +

+

+ $$\text{Test Error} = \dfrac{1}{m_{test}} \sum^{m_{test}}_{i=1} err(h_\Theta(x^{(i)}_{test}), y^{(i)}_{test})$$ +

+

+ This gives us the proportion of the test data that was misclassified. +

+

+ Model Selection and Train/Validation/Test Sets +

+ +

+ In order to choose the model of your hypothesis, you can test each degree of polynomial and look at the error result. +

+

+ + Without the Validation Set (note: this is a bad method - do not use it) + +

+
    +
  1. +

    + Optimize the parameters in Θ using the training set for each polynomial degree. +

    +
  2. +
  3. +

    + Find the polynomial degree d with the least error using the test set. +

    +
  4. +
  5. +

    + Estimate the generalization error also using the test set with $$J_{test}(\Theta^{(d)})$$, (d = theta from polynomial with lower error); +

    +
  6. +
+

+ In this case, we have trained one variable, d, or the degree of the polynomial, using the test set. This will cause our error value to be greater for any other set of data. +

+

+ + Use of the CV set + +

+

+ To solve this, we can introduce a third set, the + + Cross Validation Set + + , to serve as an intermediate set that we can train d with. Then our test set will give us an accurate, non-optimistic error. +

+

+ One example way to break down our dataset into the three sets is: +

+ +

+ We can now calculate three separate error values for the three different sets. +

+

+ + With the Validation Set (note: this method presumes we do not also use the CV set for regularization) + +

+
    +
  1. +

    + Optimize the parameters in Θ using the training set for each polynomial degree. +

    +
  2. +
  3. +

    + Find the polynomial degree d with the least error using the cross validation set. +

    +
  4. +
  5. +

    + Estimate the generalization error using the test set with $$J_{test}(\Theta^{(d)})$$, (d = theta from polynomial with lower error); +

    +
  6. +
+

+ This way, the degree of the polynomial d has not been trained using the test set. +

+

+ (Mentor note: be aware that using the CV set to select 'd' means that we cannot also use it for the validation curve process of setting the lambda value). +

+

+ Diagnosing Bias vs. Variance +

+

+ In this section we examine the relationship between the degree of the polynomial d and the underfitting or overfitting of our hypothesis. +

+ +

+ The training error will tend to + + decrease + + as we increase the degree d of the polynomial. +

+

+ At the same time, the cross validation error will tend to + + decrease + + as we increase d up to a point, and then it will + + increase + + as d is increased, forming a convex curve. +

+

+ + High bias (underfitting) + + : both $$J_{train}(\Theta)$$ and $$J_{CV}(\Theta)$$ will be high. Also, $$J_{CV}(\Theta) \approx J_{train}(\Theta)$$. +

+

+ + High variance (overfitting) + + : $$J_{train}(\Theta)$$ will be low and $$J_{CV}(\Theta)$$ will be much greater than$$J_{train}(\Theta)$$. +

+

+ The is represented in the figure below: +

+

+

+ +

+ Regularization and Bias/Variance +

+

+ Instead of looking at the degree d contributing to bias/variance, now we will look at the regularization parameter λ. +

+ +

+ A large lambda heavily penalizes all the Θ parameters, which greatly simplifies the line of our resulting function, so causes underfitting. +

+

+ The relationship of λ to the training set and the variance set is as follows: +

+

+ + Low λ + + : $$J_{train}(\Theta)$$ is low and $$J_{CV}(\Theta)$$ is high (high variance/overfitting). +

+

+ + Intermediate λ + + : $$J_{train}(\Theta)$$ and $$J_{CV}(\Theta)$$ are somewhat low and $$J_{train}(\Theta) \approx J_{CV}(\Theta)$$. +

+

+ + Large λ + + : both $$J_{train}(\Theta)$$ and $$J_{CV}(\Theta)$$ will be high (underfitting /high bias) +

+

+ The figure below illustrates the relationship between lambda and the hypothesis: +

+

+

+ +

+ In order to choose the model and the regularization λ, we need: +

+
    +
  1. +

    + Create a list of lambdas (i.e. λ∈{0,0.01,0.02,0.04,0.08,0.16,0.32,0.64,1.28,2.56,5.12,10.24}); +

    +
  2. +
+

+ 2. Create a set of models with different degrees or any other variants. +

+

+ 3. Iterate through the $$\lambda$$s and for each $$\lambda$$ go through all the models to learn some $$\Theta$$. +

+

+ 4. Compute the cross validation error using the learned Θ (computed with λ) on the $$J_{CV}(\Theta)$$ without regularization or λ = 0. +

+

+ 5. Select the best combo that produces the lowest error on the cross validation set. +

+

+ 6. Using the best combo Θ and λ, apply it on $$J_{test}(\Theta)$$ to see if it has a good generalization of the problem. +

+

+ Learning Curves +

+

+ Training 3 examples will easily have 0 errors because we can always find a quadratic curve that exactly touches 3 points. +

+ +

+ + With high bias + +

+

+ + Low training set size + + : causes $$J_{train}(\Theta)$$ to be low and $$J_{CV}(\Theta)$$ to be high. +

+

+ + Large training set size + + : causes both $$J_{train}(\Theta)$$ and $$J_{CV}(\Theta)$$ to be high with $$J_{train}(\Theta)$$≈$$J_{CV}(\Theta)$$. +

+

+ If a learning algorithm is suffering from + + high bias + + , getting more training data + + will not (by itself) help much + + . +

+

+ For high variance, we have the following relationships in terms of the training set size: +

+

+ + With high variance + +

+

+ + Low training set size + + : $$J_{train}(\Theta)$$ will be low and $$J_{CV}(\Theta)$$ will be high. +

+

+ + Large training set size + + : $$J_{train}(\Theta)$$ increases with training set size and $$J_{CV}(\Theta)$$ continues to decrease without leveling off. Also, $$J_{train}(\Theta)$$<$$J_{CV}(\Theta)$$ but the difference between them remains significant. +

+

+ If a learning algorithm is suffering from + + high variance + + , getting more training data is + + likely to help. + +

+

+ + +

+ +

+

+ +

+ Deciding What to Do Next Revisited +

+

+ Our decision process can be broken down as follows: +

+ +

+ Fixes high variance +

+ +

+ Fixes high variance +

+ +

+ Fixes high bias +

+ +

+ Fixes high bias +

+ +

+ Fixes high bias +

+ +

+ Fixes high variance +

+

+ Diagnosing Neural Networks +

+ +

+ Using a single hidden layer is a good starting default. You can train your neural network on a number of hidden layers using your cross validation set. +

+

+ Model Selection: +

+

+ Choosing M the order of polynomials. +

+

+ How can we tell which parameters Θ to leave in the model (known as "model selection")? +

+

+ There are several ways to solve this problem: +

+ +

+ + Bias: approximation error (Difference between expected value and optimal value) + +

+ +

+ + Variance: estimation error due to finite data + +

+ +

+ + Intuition for the bias-variance trade-off: + +

+ +

+ One of the most important goals in learning: finding a model that is just right in the bias-variance trade-off. +

+

+ + Regularization Effects: + +

+ +

+ + Model Complexity Effects: + +

+ +

+ + A typical rule of thumb when running diagnostics is: + +

+ +

+ ML:Machine Learning System Design +

+

+ Prioritizing What to Work On +

+

+ Different ways we can approach a machine learning problem: +

+ +

+ It is difficult to tell which of the options will be helpful. +

+

+ Error Analysis +

+

+ The recommended approach to solving machine learning problems is: +

+ +

+ It's important to get error results as a single, numerical value. Otherwise it is difficult to assess your algorithm's performance. +

+

+ You may need to process your input before it is useful. For example, if your input is a set of words, you may want to treat the same word with different forms (fail/failing/failed) as one word, so must use "stemming software" to recognize them all as one. +

+

+ Error Metrics for Skewed Classes +

+

+ It is sometimes difficult to tell whether a reduction in error is actually an improvement of the algorithm. +

+ +

+ This usually happens with + + skewed classes + + ; that is, when our class is very rare in the entire data set. +

+

+ Or to say it another way, when we have lot more examples from one class than from the other class. +

+

+ For this we can use + + Precision/Recall + + . +

+ +

+ + Precision + + : of all patients we predicted where y=1, what fraction actually has cancer? +

+ + + + +
+

+ $$\dfrac{\text{True Positives}}{\text{Total number of predicted positives}} += \dfrac{\text{True Positives}}{\text{True Positives}+\text{False positives}}$$ +

+
+

+ + Recall + + : Of all the patients that actually have cancer, what fraction did we correctly detect as having cancer? +

+ + + + +
+

+ $$\dfrac{\text{True Positives}}{\text{Total number of actual positives}}= \dfrac{\text{True Positives}}{\text{True Positives}+\text{False negatives}}$$ +

+
+

+ These two metrics give us a better sense of how our classifier is doing. We want both precision and recall to be high. +

+

+ In the example at the beginning of the section, if we classify all patients as 0, then our + + recall + + will be $$\dfrac{0}{0 + f} = 0$$, so despite having a lower error percentage, we can quickly see it has worse recall. +

+

+ Accuracy = $$\frac {true positive + true negative} {total population}$$ +

+

+ Note 1: if an algorithm predicts only negatives like it does in one of exercises, the precision is not defined, it is impossible to divide by 0. F1 score will not be defined too. +

+

+ Trading Off Precision and Recall +

+

+ We might want a + + confident + + prediction of two classes using logistic regression. One way is to increase our threshold: +

+ +

+ This way, we only predict cancer if the patient has a 70% chance. +

+

+ Doing this, we will have + + higher precision + + but + + lower recall + + (refer to the definitions in the previous section). +

+

+ In the opposite example, we can lower our threshold: +

+ +

+ That way, we get a very + + safe + + prediction. This will cause + + higher recall + + but + + lower precision + + . +

+

+ The greater the threshold, the greater the precision and the lower the recall. +

+

+ The lower the threshold, the greater the recall and the lower the precision. +

+

+ In order to turn these two metrics into one single number, we can take the + + F value + + . +

+

+ One way is to take the + + average + + : +

+

+ $$\dfrac{P+R}{2}$$ +

+

+ This does not work well. If we predict all y=0 then that will bring the average up despite having 0 recall. If we predict all examples as y=1, then the very high recall will bring up the average despite having 0 precision. +

+

+ A better way is to compute the + + F Score + + (or F1 score): +

+

+ $$\text{F Score} = 2\dfrac{PR}{P + R}$$ +

+

+ In order for the F Score to be large, both precision and recall must be large. +

+

+ We want to train precision and recall on the + + cross validation set + + so as not to bias our test set. +

+

+ Data for Machine Learning +

+

+ How much data should we train on? +

+

+ In certain cases, an "inferior algorithm," if given enough data, can outperform a superior algorithm with less data. +

+

+ We must choose our features to have + + enough + + information. A useful test is: Given input x, would a human expert be able to confidently predict y? +

+

+ + Rationale for large data + + : if we have a + + low bias + + algorithm (many features or hidden units making a very complex function), then the larger the training set we use, the less we will have overfitting (and the more accurate the algorithm will be on the test set). +

+

+ Quiz instructions +

+

+ When the quiz instructions tell you to enter a value to "two decimal digits", what it really means is "two significant digits". So, just for example, the value 0.0123 should be entered as "0.012", not "0.01". +

+

+ References: +

+ +

+

+

+

+

+

+
+ + + diff --git a/ML_Mathematical_Approach/09_week-7-lecture-notes/01__resources.html b/ML_Mathematical_Approach/09_week-7-lecture-notes/01__resources.html new file mode 100644 index 0000000..18a561f --- /dev/null +++ b/ML_Mathematical_Approach/09_week-7-lecture-notes/01__resources.html @@ -0,0 +1,662 @@ + + +

+ Optimization Objective +

+

+ The + + Support Vector Machine + + (SVM) is yet another type of + + supervised + + machine learning algorithm. It is sometimes cleaner and more powerful. +

+

+ Recall that in logistic regression, we use the following rules: +

+

+ if y=1, then $$h_\theta(x) \approx 1$$ and $$\Theta^Tx \gg 0$$ +

+

+ if y=0, then $$h_\theta(x) \approx 0$$ and $$\Theta^Tx \ll 0$$ +

+

+ Recall the cost function for (unregularized) logistic regression: +

+

+

+ + + + +
+

+ $$\begin{align*}J(\theta) & = \frac{1}{m}\sum_{i=1}^m -y^{(i)} \log(h_\theta(x^{(i)})) - (1 - y^{(i)})\log(1 - h_\theta(x^{(i)}))\\ & = \frac{1}{m}\sum_{i=1}^m -y^{(i)} \log\Big(\dfrac{1}{1 + e^{-\theta^Tx^{(i)}}}\Big) - (1 - y^{(i)})\log\Big(1 - \dfrac{1}{1 + e^{-\theta^Tx^{(i)}}}\Big)\end{align*}$$ +

+
+

+ To make a support vector machine, we will modify the first term of the cost function $$-\log(h_{\theta}(x)) = -\log\Big(\dfrac{1}{1 + e^{-\theta^Tx}}\Big)$$ so that when $$θ^Tx$$ (from now on, we shall refer to this as z) is + + greater than + + 1, it outputs 0. Furthermore, for values of z less than 1, we shall use a straight decreasing line instead of the sigmoid curve.(In the literature, this is called a hinge loss ( + + https://en.wikipedia.org/wiki/Hinge_loss) + + function.) +

+

+

+ +

+ Similarly, we modify the second term of the cost function $$-\log(1 - h_{\theta(x)}) = -\log\Big(1 - \dfrac{1}{1 + e^{-\theta^Tx}}\Big)$$ so that when z is + + less than + + -1, it outputs 0. We also modify it so that for values of z greater than -1, we use a straight increasing line instead of the sigmoid curve. +

+

+

+ +

+ We shall denote these as $$\text{cost}_1(z)$$ and $$\text{cost}_0(z)$$ (respectively, note that $$\text{cost}_1(z)$$ is the cost for classifying when y=1, and $$\text{cost}_0(z)$$ is the cost for classifying when y=0), and we may define them as follows (where k is an arbitrary constant defining the magnitude of the slope of the line): +

+

+ $$z = \theta^Tx$$ +

+

+ $$\text{cost}_0(z) = \max(0, k(1+z))$$ +

+

+ $$\text{cost}_1(z) = \max(0, k(1-z))$$ +

+

+ Recall the full cost function from (regularized) logistic regression: +

+ + + + +
+

+ $$J(\theta) = \frac{1}{m} \sum_{i=1}^m y^{(i)}(-\log(h_\theta(x^{(i)}))) + (1 - y^{(i)})(-\log(1 - h_\theta(x^{(i)}))) + \dfrac{\lambda}{2m}\sum_{j=1}^n \Theta^2_j$$ +

+
+

+ Note that the negative sign has been distributed into the sum in the above equation. +

+

+ We may transform this into the cost function for support vector machines by substituting $$\text{cost}_0(z)$$ and $$\text{cost}_1(z)$$: +

+ + + + +
+

+ $$J(\theta) = \frac{1}{m} \sum_{i=1}^m y^{(i)} \ \text{cost}_1(\theta^Tx^{(i)}) + (1 - y^{(i)}) \ \text{cost}_0(\theta^Tx^{(i)}) + \dfrac{\lambda}{2m}\sum_{j=1}^n \Theta^2_j$$ +

+
+

+ We can optimize this a bit by multiplying this by m (thus removing the m factor in the denominators). Note that this does not affect our optimization, since we're simply multiplying our cost function by a positive constant (for example, minimizing $$(u-5)^2 + 1$$ gives us 5; multiplying it by 10 to make it $$10(u-5)^2 + 10$$ still gives us 5 when minimized). +

+ + + + +
+

+ $$J(\theta) = \sum_{i=1}^m y^{(i)} \ \text{cost}_1(\theta^Tx^{(i)}) + (1 - y^{(i)}) \ \text{cost}_0(\theta^Tx^{(i)}) + \dfrac{\lambda}{2}\sum_{j=1}^n \Theta^2_j$$ +

+
+

+ Furthermore, convention dictates that we regularize using a factor C, instead of λ, like so: +

+ + + + +
+

+ $$J(\theta) = C\sum_{i=1}^m y^{(i)} \ \text{cost}_1(\theta^Tx^{(i)}) + (1 - y^{(i)}) \ \text{cost}_0(\theta^Tx^{(i)}) + \dfrac{1}{2}\sum_{j=1}^n \Theta^2_j$$ +

+
+

+ This is equivalent to multiplying the equation by $$C = \dfrac{1}{\lambda}$$, and thus results in the same values when optimized. Now, when we wish to regularize more (that is, reduce overfitting), we + + decrease + + C, and when we wish to regularize less (that is, reduce underfitting), we + + increase + + C. +

+

+ Finally, note that the hypothesis of the Support Vector Machine is + + not + + interpreted as the probability of y being 1 or 0 (as it is for the hypothesis of logistic regression). Instead, it outputs either 1 or 0. (In technical terms, it is a discriminant function.) +

+ + + + +
+

+ $$h_\theta(x) =\begin{cases} 1 & \text{if} \ \Theta^Tx \geq 0 \\ 0 & \text{otherwise}\end{cases}$$ +

+
+

+ Large Margin Intuition +

+

+ A useful way to think about Support Vector Machines is to think of them as + + Large Margin Classifiers + + . +

+

+ If y=1, we want $$\Theta^Tx \geq 1$$ (not just ≥0) +

+

+ If y=0, we want $$\Theta^Tx \leq -1$$ (not just <0) +

+

+ Now when we set our constant C to a very + + large + + value (e.g. 100,000), our optimizing function will constrain Θ such that the equation A (the summation of the cost of each example) equals 0. We impose the following constraints on Θ: +

+

+ $$\Theta^Tx \geq 1$$ if y=1 and $$\Theta^Tx \leq -1$$ if y=0. +

+

+ If C is very large, we must choose Θ parameters such that: +

+

+ $$\sum_{i=1}^m y^{(i)}\text{cost}_1(\Theta^Tx) + (1 - y^{(i)})\text{cost}_0(\Theta^Tx) = 0$$ +

+

+ This reduces our cost function to: +

+ + + + +
+

+ $$ +\begin{align*} +J(\theta) = C \cdot 0 + \dfrac{1}{2}\sum_{j=1}^n \Theta^2_j \newline += \dfrac{1}{2}\sum_{j=1}^n \Theta^2_j +\end{align*}$$ +

+
+

+ Recall the decision boundary from logistic regression (the line separating the positive and negative examples). In SVMs, the decision boundary has the special property that it is + + as far away as possible + + from both the positive and the negative examples. +

+

+ The distance of the decision boundary to the nearest example is called the + + margin + + . Since SVMs maximize this margin, it is often called a + + Large Margin Classifier + + . +

+

+ The SVM will separate the negative and positive examples by a + + large margin + + . +

+

+ This large margin is only achieved when + + C is very large + + . +

+

+ Data is + + linearly separable + + when a + + straight line + + can separate the positive and negative examples. +

+

+ If we have + + outlier + + examples that we don't want to affect the decision boundary, then we can + + reduce + + C. +

+

+ Increasing and decreasing C is similar to respectively decreasing and increasing λ, and can simplify our decision boundary. +

+

+ Mathematics Behind Large Margin Classification (Optional) +

+

+ + Vector Inner Product + +

+

+ Say we have two vectors, u and v: +

+

+ $$\begin{align*} +u = +\begin{bmatrix} +u_1 \newline u_2 +\end{bmatrix} +& v = +\begin{bmatrix} +v_1 \newline v_2 +\end{bmatrix} +\end{align*}$$ +

+

+ The + + length of vector v + + is denoted $$||v||$$, and it describes the line on a graph from origin (0,0) to $$(v_1,v_2)$$. +

+

+ The length of vector v can be calculated with $$\sqrt{v_1^2 + v_2^2}$$by the Pythagorean theorem. +

+

+ The + + projection + + of vector v onto vector u is found by taking a right angle from u to the end of v, creating a right triangle. +

+ +

+ Note that $$u^Tv = ||u|| \cdot ||v|| \cos \theta$$ where θ is the angle between u and v. Also, $$p = ||v|| \cos \theta$$. If you substitute p for $$||v|| \cos \theta$$, you get $$u^Tv= p \cdot ||u||$$. +

+

+ So the product $$u^Tv$$ is equal to the length of the projection times the length of vector u. +

+

+ In our example, since u and v are vectors of the same length, $$u^Tv = v^Tu$$. +

+

+ $$u^Tv = v^Tu = p \cdot ||u|| = u_1v_1 + u_2v_2$$ +

+

+ If the + + angle + + between the lines for v and u is + + greater than 90 degrees + + , then the projection p will be + + negative + + . +

+ + + + +
+

+ $$\begin{align*}&\min_\Theta \dfrac{1}{2}\sum_{j=1}^n \Theta_j^2 \newline&= \dfrac{1}{2}(\Theta_1^2 + \Theta_2^2 + \dots + \Theta_n^2) \newline&= \dfrac{1}{2}(\sqrt{\Theta_1^2 + \Theta_2^2 + \dots + \Theta_n^2})^2 \newline&= \dfrac{1}{2}||\Theta ||^2 \newline\end{align*}$$ +

+
+

+ We can use the same rules to rewrite $$\Theta^Tx^{(i)}$$: +

+

+ $$\Theta^Tx^{(i)} = p^{(i)} \cdot ||\Theta || = \Theta_1x_1^{(i)} + \Theta_2x_2^{(i)} + \dots + \Theta_n x_n^{(i)}$$ +

+

+ So we now have a new + + optimization objective + + by substituting $$p^{(i)} \cdot ||\Theta ||$$ in for $$\Theta^Tx^{(i)}$$: +

+

+ If y=1, we want $$p^{(i)} \cdot ||\Theta || \geq 1$$ +

+

+ If y=0, we want $$p^{(i)} \cdot ||\Theta || \leq -1$$ +

+

+ The reason this causes a "large margin" is because: the vector for Θ is perpendicular to the decision boundary. In order for our optimization objective (above) to hold true, we need the absolute value of our projections $$p^{(i)}$$ to be as large as possible. +

+

+ If $$\Theta_0 =0$$, then all our decision boundaries will intersect (0,0). If $$\Theta_0 \neq 0$$, the support vector machine will still find a large margin for the decision boundary. +

+

+ Kernels I +

+

+ + Kernels + + allow us to make complex, non-linear classifiers using Support Vector Machines. +

+

+ Given x, compute new feature depending on proximity to landmarks $$l^{(1)},\ l^{(2)},\ l^{(3)}$$. +

+

+ To do this, we find the "similarity" of x and some landmark $$l^{(i)}$$: +

+

+ $$f_i = similarity(x, l^{(i)}) = \exp(-\dfrac{||x - l^{(i)}||^2}{2\sigma^2})$$ +

+

+ This "similarity" function is called a + + Gaussian Kernel + + . It is a specific example of a kernel. +

+

+ The similarity function can also be written as follows: +

+

+ $$f_i = similarity(x, l^{(i)}) = \exp(-\dfrac{\sum^n_{j=1}(x_j-l_j^{(i)})^2}{2\sigma^2})$$ +

+

+ There are a couple properties of the similarity function: +

+

+ If $$x \approx l^{(i)}$$, then $$f_i = \exp(-\dfrac{\approx 0^2}{2\sigma^2}) \approx 1$$ +

+

+ If x is far from $$l^{(i)}$$, then $$f_i = \exp(-\dfrac{(large\ number)^2}{2\sigma^2}) \approx 0$$ +

+

+ In other words, if x and the landmark are close, then the similarity will be close to 1, and if x and the landmark are far away from each other, the similarity will be close to 0. +

+

+ Each landmark gives us the features in our hypothesis: +

+ + + + +
+

+ $$\begin{align*}l^{(1)} \rightarrow f_1 \newline l^{(2)} \rightarrow f_2 \newline l^{(3)} \rightarrow f_3 \newline\dots \newline h_\Theta(x) = \Theta_1f_1 + \Theta_2f_2 + \Theta_3f_3 + \dots\end{align*}$$ +

+
+

+ $$\sigma^2$$ is a parameter of the Gaussian Kernel, and it can be modified to increase or decrease the + + drop-off + + of our feature $$f_i$$. Combined with looking at the values inside Θ, we can choose these landmarks to get the general shape of the decision boundary. +

+

+ Kernels II +

+

+ One way to get the landmarks is to put them in the + + exact same + + locations as all the training examples. This gives us m landmarks, with one landmark per training example. +

+

+ Given example x: +

+

+ $$f_1 = similarity(x,l^{(1)})$$, $$f_2 = similarity(x,l^{(2)})$$, $$f_3 = similarity(x,l^{(3)})$$, and so on. +

+

+ This gives us a "feature vector," $$f_{(i)}$$ of all our features for example $$x_{(i)}$$. We may also set $$f_0 = 1$$ to correspond with $$Θ_0$$. Thus given training example $$x_{(i)}$$: +

+ + + + +
+

+ $$x^{(i)} \rightarrow \begin{bmatrix}f_1^{(i)} = similarity(x^{(i)}, l^{(1)}) \newline f_2^{(i)} = similarity(x^{(i)}, l^{(2)}) \newline\vdots \newline f_m^{(i)} = similarity(x^{(i)}, l^{(m)}) \newline\end{bmatrix}$$ +

+
+

+ Now to get the parameters Θ we can use the SVM minimization algorithm but with $$f^{(i)}$$ substituted in for $$x^{(i)}$$: +

+

+ $$\min_{\Theta} C \sum_{i=1}^m y^{(i)}\text{cost}_1(\Theta^Tf^{(i)}) + (1 - y^{(i)})\text{cost}_0(\theta^Tf^{(i)}) + \dfrac{1}{2}\sum_{j=1}^n \Theta^2_j$$ +

+

+ Using kernels to generate f(i) is not exclusive to SVMs and may also be applied to logistic regression. However, because of computational optimizations on SVMs, kernels combined with SVMs is much faster than with other algorithms, so kernels are almost always found combined only with SVMs. +

+

+ + Choosing SVM Parameters + +

+

+ Choosing C (recall that $$C = \dfrac{1}{\lambda}$$ +

+ +

+ The other parameter we must choose is $$σ^2$$ from the Gaussian Kernel function: +

+

+ With a large $$σ^2$$, the features fi vary more smoothly, causing higher bias and lower variance. +

+

+ With a small $$σ^2$$, the features fi vary less smoothly, causing lower bias and higher variance. +

+

+ + Using An SVM + +

+

+ There are lots of good SVM libraries already written. A. Ng often uses 'liblinear' and 'libsvm'. In practical application, you should use one of these libraries rather than rewrite the functions. +

+

+ In practical application, the choices you do need to make are: +

+ +

+ The library may ask you to provide the kernel function. +

+

+ + Note: + + do perform feature scaling before using the Gaussian Kernel. +

+

+ + Note: + + not all similarity functions are valid kernels. They must satisfy "Mercer's Theorem" which guarantees that the SVM package's optimizations run correctly and do not diverge. +

+

+ You want to train C and the parameters for the kernel function using the training and cross-validation datasets. +

+

+ + Multi-class Classification + +

+

+ Many SVM libraries have multi-class classification built-in. +

+

+ You can use the + + one-vs-all + + method just like we did for logistic regression, where $$y \in {1,2,3,\dots,K}$$ with $$\Theta^{(1)}, \Theta^{(2)}, \dots,\Theta{(K)}$$. We pick class i with the largest $$(\Theta^{(i)})^Tx$$. +

+

+ + Logistic Regression vs. SVMs + +

+

+ If n is large (relative to m), then use logistic regression, or SVM without a kernel (the "linear kernel") +

+

+ If n is small and m is intermediate, then use SVM with a Gaussian Kernel +

+

+ If n is small and m is large, then manually create/add more features, then use logistic regression or SVM without a kernel. +

+

+ In the first case, we don't have enough examples to need a complicated polynomial hypothesis. In the second example, we have enough examples that we may need a complex non-linear hypothesis. In the last case, we want to increase our features so that logistic regression becomes applicable. +

+

+ + Note + + : a neural network is likely to work well for any of these situations, but may be slower to train. +

+

+ Additional references +

+ +

+

+
+ + + diff --git a/ML_Mathematical_Approach/09_week-7-lecture-notes/01__svm-notes-long-08.pdf b/ML_Mathematical_Approach/09_week-7-lecture-notes/01__svm-notes-long-08.pdf new file mode 100644 index 0000000..d7e5866 Binary files /dev/null and b/ML_Mathematical_Approach/09_week-7-lecture-notes/01__svm-notes-long-08.pdf differ diff --git a/ML_Mathematical_Approach/10_week-8-lecture-notes/01__resources.html b/ML_Mathematical_Approach/10_week-8-lecture-notes/01__resources.html new file mode 100644 index 0000000..65b4cfa --- /dev/null +++ b/ML_Mathematical_Approach/10_week-8-lecture-notes/01__resources.html @@ -0,0 +1,804 @@ + + +

+ ML:Clustering +

+

+ Unsupervised Learning: Introduction +

+

+ Unsupervised learning is contrasted from supervised learning because it uses an + + unlabeled + + training set rather than a labeled one. +

+

+ In other words, we don't have the vector y of expected results, we only have a dataset of features where we can find structure. +

+

+ Clustering is good for: +

+ +

+ K-Means Algorithm +

+

+ The K-Means Algorithm is the most popular and widely used algorithm for automatically grouping data into coherent subsets. +

+
    +
  1. +

    + Randomly initialize two points in the dataset called the + + cluster centroids + + . +

    +
  2. +
  3. +

    + Cluster assignment: assign all examples into one of two groups based on which cluster centroid the example is closest to. +

    +
  4. +
  5. +

    + Move centroid: compute the averages for all the points inside each of the two cluster centroid groups, then move the cluster centroid points to those averages. +

    +
  6. +
  7. +

    + Re-run (2) and (3) until we have found our clusters. +

    +
  8. +
+

+ Our main variables are: +

+ +

+ Note that we + + will not use + + the x0=1 convention. +

+

+ + The algorithm: + +

+
Randomly initialize K cluster centroids mu(1), mu(2), ..., mu(K)
+Repeat:
+   for i = 1 to m:
+      c(i):= index (from 1 to K) of cluster centroid closest to x(i)
+   for k = 1 to K:
+      mu(k):= average (mean) of points assigned to cluster k
+

+ The + + first for-loop + + is the 'Cluster Assignment' step. We make a vector + + c + + where + + c(i) + + represents the centroid assigned to example + + x(i) + + . +

+

+ We can write the operation of the Cluster Assignment step more mathematically as follows: +

+

+ $$c^{(i)} = argmin_k\ ||x^{(i)} - \mu_k||^2$$ +

+

+ That is, each $$c^{(i)}$$ contains the index of the centroid that has minimal distance to $$x^{(i)}$$. +

+

+ By convention, we square the right-hand-side, which makes the function we are trying to minimize more sharply increasing. It is mostly just a convention. But a convention that helps reduce the computation load because the Euclidean distance requires a square root but it is canceled. +

+

+ Without the square: +

+

+ $$||x^{(i)} - \mu_k|| = ||\quad\sqrt{(x_1^i - \mu_{1(k)})^2 + (x_2^i - \mu_{2(k)})^2 + (x_3^i - \mu_{3(k)})^2 + ...}\quad||$$ +

+

+ With the square: +

+

+ $$||x^{(i)} - \mu_k||^2 = ||\quad(x_1^i - \mu_{1(k)})^2 + (x_2^i - \mu_{2(k)})^2 + (x_3^i - \mu_{3(k)})^2 + ...\quad||$$ +

+

+ ...so the square convention serves two purposes, minimize more sharply and less computation. +

+

+ The + + second for-loop + + is the 'Move Centroid' step where we move each centroid to the average of its group. +

+

+ More formally, the equation for this loop is as follows: +

+

+ $$\mu_k = \dfrac{1}{n}[x^{(k_1)} + x^{(k_2)} + \dots + x^{(k_n)}] \in \mathbb{R}^n$$ +

+

+ Where each of $$x^{(k_1)}, x^{(k_2)}, \dots, x^{(k_n)}$$ are the training examples assigned to group $$mμ_k$$. +

+

+ If you have a cluster centroid with + + 0 points + + assigned to it, you can randomly + + re-initialize + + that centroid to a new point. You can also simply + + eliminate + + that cluster group. +

+

+ After a number of iterations the algorithm will + + converge + + , where new iterations do not affect the clusters. +

+

+ Note on non-separated clusters: some datasets have no real inner separation or natural structure. K-means can still evenly segment your data into K subsets, so can still be useful in this case. +

+

+ Optimization Objective +

+

+ Recall some of the parameters we used in our algorithm: +

+ +

+ Using these variables we can define our + + cost function + + : +

+

+ $$J(c^{(i)},\dots,c^{(m)},\mu_1,\dots,\mu_K) = \dfrac{1}{m}\sum_{i=1}^m ||x^{(i)} - \mu_{c^{(i)}}||^2$$ +

+

+ Our + + optimization objective + + is to minimize all our parameters using the above cost function: +

+

+ $$min_{c,\mu}\ J(c,\mu)$$ +

+

+ That is, we are finding all the values in sets c, representing all our clusters, and μ, representing all our centroids, that will minimize + + the average of the distances + + of every training example to its corresponding cluster centroid. +

+

+ The above cost function is often called the + + distortion + + of the training examples. +

+

+ In the + + cluster assignment step + + , our goal is to: +

+

+ Minimize J(…) with $$c^{(1)},\dots,c^{(m)}$$ (holding $$\mu_1,\dots,\mu_K$$ fixed) +

+

+ In the + + move centroid + + step, our goal is to: +

+

+ Minimize J(…) with $$\mu_1,\dots,\mu_K$$ +

+

+ With k-means, it is + + not possible for the cost function to sometimes increase + + . It should always descend. +

+

+ Random Initialization +

+

+ There's one particular recommended method for randomly initializing your cluster centroids. +

+
    +
  1. +

    + Have K<m. That is, make sure the number of your clusters is less than the number of your training examples. +

    +
  2. +
  3. +

    + Randomly pick K training examples. (Not mentioned in the lecture, but also be sure the selected examples are unique). +

    +
  4. +
  5. +

    + Set $$\mu_1,\dots,\mu_K$$ equal to these K examples. +

    +
  6. +
+

+ K-means + + can get stuck in local optima + + . To decrease the chance of this happening, you can run the algorithm on many different random initializations. In cases where K<10 it is strongly recommended to run a loop of random initializations. +

+
for i = 1 to 100:
+   randomly initialize k-means
+   run k-means to get 'c' and 'm'
+   compute the cost function (distortion) J(c,m)
+pick the clustering that gave us the lowest cost
+
+

+ Choosing the Number of Clusters +

+

+ Choosing K can be quite arbitrary and ambiguous. +

+

+ + The elbow method + + : plot the cost J and the number of clusters K. The cost function should reduce as we increase the number of clusters, and then flatten out. Choose K at the point where the cost function starts to flatten out. +

+

+ However, fairly often, the curve is + + very gradual + + , so there's no clear elbow. +

+

+ + Note: + + J will + + always + + decrease as K is increased. The one exception is if k-means gets stuck at a bad local optimum. +

+

+ Another way to choose K is to observe how well k-means performs on a + + downstream purpose + + . In other words, you choose K that proves to be most useful for some goal you're trying to achieve from using these clusters. +

+

+ Bonus: Discussion of the drawbacks of K-Means +

+

+ This links to a discussion that shows various situations in which K-means gives totally correct but unexpected results: + + http://stats.stackexchange.com/questions/133656/how-to-understand-the-drawbacks-of-k-means + +

+

+ ML:Dimensionality Reduction +

+

+ + Motivation I: Data Compression + +

+ +

+ Doing dimensionality reduction will reduce the total data we have to store in computer memory and will speed up our learning algorithm. +

+

+ Note: in dimensionality reduction, we are reducing our features rather than our number of examples. Our variable m will stay the same size; n, the number of features each example from $$x^{(1)}$$ to $$x^{(m)}$$ carries, will be reduced. +

+

+ + Motivation II: Visualization + +

+

+ It is not easy to visualize data that is more than three dimensions. We can reduce the dimensions of our data to 3 or less in order to plot it. +

+

+ We need to find new features, $$z_1, z_2$$(and perhaps $$z_3$$) that can effectively + + summarize + + all the other features. +

+

+ Example: hundreds of features related to a country's economic system may all be combined into one feature that you call "Economic Activity." +

+

+ Principal Component Analysis Problem Formulation +

+

+ The most popular dimensionality reduction algorithm is + + Principal Component Analysis + + (PCA) +

+

+ + Problem formulation + +

+

+ Given two features, $$x_1$$ and $$x_2$$, we want to find a single line that effectively describes both features at once. We then map our old features onto this new line to get a new single feature. +

+

+ The same can be done with three features, where we map them to a plane. +

+

+ The + + goal of PCA + + is to + + reduce + + the average of all the distances of every feature to the projection line. This is the + + projection error + + . +

+

+ Reduce from 2d to 1d: find a direction (a vector $$u^{(1)} \in \mathbb{R}^n$$) onto which to project the data so as to minimize the projection error. +

+

+ The more general case is as follows: +

+

+ Reduce from n-dimension to k-dimension: Find k vectors $$u^{(1)}, u^{(2)}, \dots, u^{(k)}$$ onto which to project the data so as to minimize the projection error. +

+

+ If we are converting from 3d to 2d, we will project our data onto two directions (a plane), so k will be 2. +

+

+ + PCA is not linear regression + +

+ +

+ More generally, in linear regression we are taking all our examples in x and applying the parameters in Θ to predict y. +

+

+ In PCA, we are taking a number of features $$x_1, x_2, \dots, x_n$$, and finding a closest common dataset among them. We aren't trying to predict any result and we aren't applying any theta weights to the features. +

+

+ Principal Component Analysis Algorithm +

+

+ Before we can apply PCA, there is a data pre-processing step we must perform: +

+

+ + Data preprocessing + +

+ +

+ $$\mu_j = \dfrac{1}{m}\sum^m_{i=1}x_j^{(i)}$$ +

+ +

+ Above, we first subtract the mean of each feature from the original feature. Then we scale all the features $$x_j^{(i)} = \dfrac{x_j^{(i)} - \mu_j}{s_j}$$ +

+

+ We can define specifically what it means to reduce from 2d to 1d data as follows: +

+

+ $$\Sigma = \dfrac{1}{m}\sum^m_{i=1}(x^{(i)})(x^{(i)})^T$$ +

+

+ The z values are all real numbers and are the projections of our features onto $$u^{(1)}$$. +

+

+ So, PCA has two tasks: figure out $$u^{(1)},\dots,u^{(k)}$$ and also to find $$z_1, z_2, \dots, z_m$$. +

+

+ The mathematical proof for the following procedure is complicated and beyond the scope of this course. +

+

+ + 1. Compute "covariance matrix" + +

+

+ $$\Sigma = \dfrac{1}{m}\sum^m_{i=1}(x^{(i)})(x^{(i)})^T$$ +

+

+ This can be vectorized in Octave as: +

+
Sigma = (1/m) * X' * X;
+
+

+ We denote the covariance matrix with a capital sigma (which happens to be the same symbol for summation, confusingly---they represent entirely different things). +

+

+ Note that $$x^{(i)}$$ is an n×1 vector, $$(x^{(i)})^T$$ is an 1×n vector and X is a m×n matrix (row-wise stored examples). The product of those will be an n×n matrix, which are the dimensions of Σ. +

+

+ + 2. Compute "eigenvectors" of covariance matrix Σ + +

+
[U,S,V] = svd(Sigma);
+
+

+ svd() is the 'singular value decomposition', a built-in Octave function. +

+

+ What we actually want out of svd() is the 'U' matrix of the Sigma covariance matrix: $$U \in \mathbb{R}^{n \times n}$$. U contains $$u^{(1)},\dots,u^{(n)}$$, which is exactly what we want. +

+

+ + 3. Take the first k columns of the U matrix and compute z + +

+

+ We'll assign the first k columns of U to a variable called 'Ureduce'. This will be an n×k matrix. We compute z with: +

+

+ $$z^{(i)} = Ureduce^T \cdot x^{(i)}$$ +

+

+ $$UreduceZ^T$$ will have dimensions k×n while x(i) will have dimensions n×1. The product $$Ureduce^T \cdot x^{(i)}$$ will have dimensions k×1. +

+

+ To summarize, the whole algorithm in octave is roughly: +

+
Sigma = (1/m) * X' * X; % compute the covariance matrix
+[U,S,V] = svd(Sigma);   % compute our projected directions
+Ureduce = U(:,1:k);     % take the first k directions
+Z = X * Ureduce;        % compute the projected data points
+
+

+ Reconstruction from Compressed Representation +

+

+ If we use PCA to compress our data, how can we uncompress our data, or go back to our original number of features? +

+

+ To go from 1-dimension back to 2d we do: $$z \in \mathbb{R} \rightarrow x \in \mathbb{R}^2$$. +

+

+ We can do this with the equation: $$x_{approx}^{(1)} = U_{reduce} \cdot z^{(1)}$$. +

+

+ Note that we can only get approximations of our original data. +

+

+ Note: It turns out that the U matrix has the special property that it is a Unitary Matrix. One of the special properties of a Unitary Matrix is: +

+

+ $$U^{-1} = U^∗$$ where the "*" means "conjugate transpose". +

+

+ Since we are dealing with real numbers here, this is equivalent to: +

+

+ $$U^{-1} = U^T$$ So we could compute the inverse and use that, but it would be a waste of energy and compute cycles. +

+

+ Choosing the Number of Principal Components +

+

+ How do we choose k, also called the + + number of principal components + + ? Recall that k is the dimension we are reducing to. +

+

+ One way to choose k is by using the following formula: +

+ +

+ In other words, the squared projection error divided by the total variation should be less than one percent, so that + + 99% of the variance is retained + + . +

+

+ + Algorithm for choosing k + +

+
    +
  1. +

    + Try PCA with k=1,2,… +

    +
  2. +
  3. +

    + Compute $$U_{reduce}, z, x$$ +

    +
  4. +
  5. +

    + Check the formula given above that 99% of the variance is retained. If not, go to step one and increase k. +

    +
  6. +
+

+ This procedure would actually be horribly inefficient. In Octave, we will call svd: +

+
[U,S,V] = svd(Sigma)
+
+

+ Which gives us a matrix S. We can actually check for 99% of retained variance using the S matrix as follows: +

+

+ $$\dfrac{\sum_{i=1}^kS_{ii}}{\sum_{i=1}^nS_{ii}} \geq 0.99$$ +

+

+ Advice for Applying PCA +

+

+ The most common use of PCA is to speed up supervised learning. +

+

+ Given a training set with a large number of features (e.g. $$x^{(1)},\dots,x^{(m)} \in \mathbb{R}^{10000}$$ ) we can use PCA to reduce the number of features in each example of the training set (e.g. $$z^{(1)},\dots,z^{(m)} \in \mathbb{R}^{1000}$$). +

+

+ Note that we should define the PCA reduction from $$x^{(i)}$$ to $$z^{(i)}$$ only on the training set and not on the cross-validation or test sets. You can apply the mapping z(i) to your cross-validation and test sets after it is defined on the training set. +

+

+ Applications +

+ +

+ Reduce space of data +

+

+ Speed up algorithm +

+ +

+ Choose k = 2 or k = 3 +

+

+ + Bad use of PC + + + A: + + trying to prevent overfitting. We might think that reducing the features with PCA would be an effective way to address overfitting. It might work, but is not recommended because it does not consider the values of our results y. Using just regularization will be at least as effective. +

+

+ Don't assume you need to do PCA. + + Try your full machine learning algorithm without PCA first. + + Then use PCA if you find that you need it. +

+

+

+
+ + + diff --git a/ML_Mathematical_Approach/11_week-9-lecture-notes/01__gaussians.pdf b/ML_Mathematical_Approach/11_week-9-lecture-notes/01__gaussians.pdf new file mode 100644 index 0000000..3d77a0c Binary files /dev/null and b/ML_Mathematical_Approach/11_week-9-lecture-notes/01__gaussians.pdf differ diff --git a/ML_Mathematical_Approach/11_week-9-lecture-notes/01__resources.html b/ML_Mathematical_Approach/11_week-9-lecture-notes/01__resources.html new file mode 100644 index 0000000..ef771c6 --- /dev/null +++ b/ML_Mathematical_Approach/11_week-9-lecture-notes/01__resources.html @@ -0,0 +1,632 @@ + + +

+ ML:Anomaly Detection +

+

+ Problem Motivation +

+

+ Just like in other learning problems, we are given a dataset $${x^{(1)}, x^{(2)},\dots,x^{(m)}}$$. +

+

+ We are then given a new example, $$x_{test}$$, and we want to know whether this new example is abnormal/anomalous. +

+

+ We define a "model" p(x) that tells us the probability the example is not anomalous. We also use a threshold ϵ (epsilon) as a dividing line so we can say which examples are anomalous and which are not. +

+

+ A very common application of anomaly detection is detecting fraud: +

+ +

+ If our anomaly detector is flagging + + too many + + anomalous examples, then we need to + + decrease + + our threshold ϵ +

+

+ Gaussian Distribution +

+

+ The Gaussian Distribution is a familiar bell-shaped curve that can be described by a function $$\mathcal{N}(\mu,\sigma^2)$$ +

+

+ Let x∈ℝ. If the probability distribution of x is Gaussian with mean μ, variance $$\sigma^2$$, then: +

+

+ $$x \sim \mathcal{N}(\mu, \sigma^2)$$ +

+

+ The little ∼ or 'tilde' can be read as "distributed as." +

+

+ The Gaussian Distribution is parameterized by a mean and a variance. +

+

+ Mu, or μ, describes the center of the curve, called the mean. The width of the curve is described by sigma, or σ, called the standard deviation. +

+

+ The full function is as follows: +

+

+ $$\large p(x;\mu,\sigma^2) = \dfrac{1}{\sigma\sqrt{(2\pi)}}e^{-\dfrac{1}{2}(\dfrac{x - \mu}{\sigma})^2}$$ +

+

+ We can estimate the parameter μ from a given dataset by simply taking the average of all the examples: +

+

+ $$\mu = \dfrac{1}{m}\displaystyle \sum_{i=1}^m x^{(i)}$$ +

+

+ We can estimate the other parameter, $$\sigma^2$$, with our familiar squared error formula: +

+

+ $$\sigma^2 = \dfrac{1}{m}\displaystyle \sum_{i=1}^m(x^{(i)} - \mu)^2$$ +

+

+ Algorithm +

+

+ Given a training set of examples, $$\lbrace x^{(1)},\dots,x^{(m)}\rbrace$$ where each example is a vector, $$x \in \mathbb{R}^n$$. +

+

+ $$p(x) = p(x_1;\mu_1,\sigma_1^2)p(x_2;\mu_2,\sigma^2_2)\cdots p(x_n;\mu_n,\sigma^2_n)$$ +

+

+ In statistics, this is called an "independence assumption" on the values of the features inside training example x. +

+

+ More compactly, the above expression can be written as follows: +

+

+ $$= \displaystyle \prod^n_{j=1} p(x_j;\mu_j,\sigma_j^2)$$ +

+

+ + The algorithm + +

+

+ Choose features $$x_i$$ that you think might be indicative of anomalous examples. +

+

+ Fit parameters $$\mu_1,\dots,\mu_n,\sigma_1^2,\dots,\sigma_n^2$$ +

+

+ Calculate $$\mu_j = \dfrac{1}{m}\displaystyle \sum_{i=1}^m x_j^{(i)}$$ +

+

+ Calculate $$\sigma^2_j = \dfrac{1}{m}\displaystyle \sum_{i=1}^m(x_j^{(i)} - \mu_j)^2$$ +

+

+ Given a new example x, compute p(x): +

+

+ $$p(x) = \displaystyle \prod^n_{j=1} p(x_j;\mu_j,\sigma_j^2) = \prod\limits^n_{j=1} \dfrac{1}{\sqrt{2\pi}\sigma_j}exp(-\dfrac{(x_j - \mu_j)^2}{2\sigma^2_j})$$ +

+

+ Anomaly if p(x)<ϵ +

+

+ A vectorized version of the calculation for μ is $$\mu = \dfrac{1}{m}\displaystyle \sum_{i=1}^m x^{(i)}$$. You can vectorize $$\sigma^2$$ similarly. +

+

+ Developing and Evaluating an Anomaly Detection System +

+

+ To evaluate our learning algorithm, we take some labeled data, categorized into anomalous and non-anomalous examples ( y = 0 if normal, y = 1 if anomalous). +

+

+ Among that data, take a large proportion of + + good + + , non-anomalous data for the training set on which to train p(x). +

+

+ Then, take a smaller proportion of mixed anomalous and non-anomalous examples (you will usually have many more non-anomalous examples) for your cross-validation and test sets. +

+

+ For example, we may have a set where 0.2% of the data is anomalous. We take 60% of those examples, all of which are good (y=0) for the training set. We then take 20% of the examples for the cross-validation set (with 0.1% of the anomalous examples) and another 20% from the test set (with another 0.1% of the anomalous). +

+

+ In other words, we split the data 60/20/20 training/CV/test and then split the anomalous examples 50/50 between the CV and test sets. +

+

+ + Algorithm evaluation: + +

+

+ Fit model p(x) on training set $$\lbrace x^{(1)},\dots,x^{(m)} \rbrace$$ +

+

+ On a cross validation/test example x, predict: +

+

+ If p(x) < ϵ ( + + anomaly + + ), then y=1 +

+

+ If p(x) ≥ ϵ ( + + normal + + ), then y=0 +

+

+ Possible evaluation metrics (see "Machine Learning System Design" section): +

+ +

+ Note that we use the cross-validation set to choose parameter ϵ +

+

+ Anomaly Detection vs. Supervised Learning +

+

+ When do we use anomaly detection and when do we use supervised learning? +

+

+ Use anomaly detection when... +

+ +

+ Use supervised learning when... +

+ +

+ Choosing What Features to Use +

+

+ The features will greatly affect how well your anomaly detection algorithm works. +

+

+ We can check that our features are + + gaussian + + by plotting a histogram of our data and checking for the bell-shaped curve. +

+

+ Some + + transforms + + we can try on an example feature x that does not have the bell-shaped curve are: +

+ +

+ We can play with each of these to try and achieve the gaussian shape in our data. +

+

+ There is an + + error analysis procedure + + for anomaly detection that is very similar to the one in supervised learning. +

+

+ Our goal is for p(x) to be large for normal examples and small for anomalous examples. +

+

+ One common problem is when p(x) is similar for both types of examples. In this case, you need to examine the anomalous examples that are giving high probability in detail and try to figure out new features that will better distinguish the data. +

+

+ In general, choose features that might take on unusually large or small values in the event of an anomaly. +

+

+ Multivariate Gaussian Distribution (Optional) +

+

+ The multivariate gaussian distribution is an extension of anomaly detection and may (or may not) catch more anomalies. +

+

+ Instead of modeling $$p(x_1),p(x_2),\dots$$ separately, we will model p(x) all in one go. Our parameters will be: $$\mu \in \mathbb{R}^n$$ and $$\Sigma \in \mathbb{R}^{n \times n}$$ +

+

+ $$p(x;\mu,\Sigma) = \dfrac{1}{(2\pi)^{n/2} |\Sigma|^{1/2}} exp(-1/2(x-\mu)^T\Sigma^{-1}(x-\mu))$$ +

+

+ The important effect is that we can model oblong gaussian contours, allowing us to better fit data that might not fit into the normal circular contours. +

+

+ Varying Σ changes the shape, width, and orientation of the contours. Changing μ will move the center of the distribution. +

+

+ Check also: +

+ +

+ Anomaly Detection using the Multivariate Gaussian Distribution (Optional) +

+

+ When doing anomaly detection with multivariate gaussian distribution, we compute μ and Σ normally. We then compute p(x) using the new formula in the previous section and flag an anomaly if p(x) < ϵ. +

+

+ The original model for p(x) corresponds to a multivariate Gaussian where the contours of $$p(x;\mu,\Sigma)$$ are axis-aligned. +

+

+ The multivariate Gaussian model can automatically capture correlations between different features of x. +

+

+ However, the original model maintains some advantages: it is computationally cheaper (no matrix to invert, which is costly for large number of features) and it performs well even with small training set size (in multivariate Gaussian model, it should be greater than the number of features for Σ to be invertible). +

+

+ ML:Recommender Systems +

+

+ Problem Formulation +

+

+ Recommendation is currently a very popular application of machine learning. +

+

+ Say we are trying to recommend movies to customers. We can use the following definitions +

+ +

+ Content Based Recommendations +

+

+ We can introduce two features, $$x_1$$ and $$x_2$$ which represents how much romance or how much action a movie may have (on a scale of 0−1). +

+

+ One approach is that we could do linear regression for every single user. For each user j, learn a parameter $$\theta^{(j)} \in \mathbb{R}^3$$. Predict user j as rating movie i with $$(\theta^{(j)})^Tx^{(i)}$$ stars. +

+ +

+ For user j, movie i, predicted rating: $$(\theta^{(j)})^T(x^{(i)})$$ +

+ +

+ To learn $$\theta^{(j)}$$, we do the following +

+

+ $$min_{\theta^{(j)}} = \dfrac{1}{2}\displaystyle \sum_{i:r(i,j)=1} ((\theta^{(j)})^T(x^{(i)}) - y^{(i,j)})^2 + \dfrac{\lambda}{2} \sum_{k=1}^n(\theta_k^{(j)})^2$$ +

+

+ This is our familiar linear regression. The base of the first summation is choosing all i such that $$r(i,j) = 1$$. +

+

+ To get the parameters for all our users, we do the following: +

+

+ $$min_{\theta^{(1)},\dots,\theta^{(n_u)}} = \dfrac{1}{2}\displaystyle \sum_{j=1}^{n_u} \sum_{i:r(i,j)=1} ((\theta^{(j)})^T(x^{(i)}) - y^{(i,j)})^2 + \dfrac{\lambda}{2} \sum_{j=1}^{n_u} \sum_{k=1}^n(\theta_k^{(j)})^2$$ +

+

+ We can apply our linear regression gradient descent update using the above cost function. +

+

+ The only real difference is that we + + eliminate the constant + + $$\dfrac{1}{m}$$. +

+

+ Collaborative Filtering +

+

+ It can be very difficult to find features such as "amount of romance" or "amount of action" in a movie. To figure this out, we can use + + feature finders + + . +

+

+ We can let the users tell us how much they like the different genres, providing their parameter vector immediately for us. +

+

+ To infer the features from given parameters, we use the squared error function with regularization over all the users: +

+

+ $$min_{x^{(1)},\dots,x^{(n_m)}} \dfrac{1}{2} \displaystyle \sum_{i=1}^{n_m} \sum_{j:r(i,j)=1} ((\theta^{(j)})^T x^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2}\sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2$$ +

+

+ You can also + + randomly guess + + the values for theta to guess the features repeatedly. You will actually converge to a good set of features. +

+

+ Collaborative Filtering Algorithm +

+

+ To speed things up, we can simultaneously minimize our features and our parameters: +

+

+ $$J(x,\theta) = \dfrac{1}{2} \displaystyle \sum_{(i,j):r(i,j)=1}((\theta^{(j)})^Tx^{(i)} - y^{(i,j)})^2 + \dfrac{\lambda}{2}\sum_{i=1}^{n_m} \sum_{k=1}^{n} (x_k^{(i)})^2 + \dfrac{\lambda}{2}\sum_{j=1}^{n_u} \sum_{k=1}^{n} (\theta_k^{(j)})^2$$ +

+

+ It looks very complicated, but we've only combined the cost function for theta and the cost function for x. +

+

+ Because the algorithm can learn them itself, the bias units where x0=1 have been removed, therefore x∈ℝn and θ∈ℝn. +

+

+ These are the steps in the algorithm: +

+
    +
  1. +

    + Initialize $$x^{(i)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)}$$ to small random values. This serves to break symmetry and ensures that the algorithm learns features $$x^{(i)},...,x^{(n_m)}$$ that are different from each other. +

    +
  2. +
  3. +

    + Minimize $$J(x^{(i)},...,x^{(n_m)},\theta^{(1)},...,\theta^{(n_u)})$$ using gradient descent (or an advanced optimization algorithm).E.g. for every $$j=1,...,n_u,i=1,...n_m$$:$$x_k^{(i)} := x_k^{(i)} - \alpha\left (\displaystyle \sum_{j:r(i,j)=1}{((\theta^{(j)})^T x^{(i)} - y^{(i,j)}) \theta_k^{(j)}} + \lambda x_k^{(i)} \right)$$$$\theta_k^{(j)} := \theta_k^{(j)} - \alpha\left (\displaystyle \sum_{i:r(i,j)=1}{((\theta^{(j)})^T x^{(i)} - y^{(i,j)}) x_k^{(i)}} + \lambda \theta_k^{(j)} \right)$$ +

    +
  4. +
  5. +

    + For a user with parameters θ and a movie with (learned) features x, predict a star rating of $$\theta^Tx$$. +

    +
  6. +
+

+ Vectorization: Low Rank Matrix Factorization +

+

+ Given matrices X (each row containing features of a particular movie) and Θ (each row containing the weights for those features for a given user), then the full matrix Y of all predicted ratings of all movies by all users is given simply by: $$Y = X\Theta^T$$. +

+

+ Predicting how similar two movies i and j are can be done using the distance between their respective feature vectors x. Specifically, we are looking for a small value of $$||x^{(i)} - x^{(j)}||$$. +

+

+ Implementation Detail: Mean Normalization +

+

+ If the ranking system for movies is used from the previous lectures, then new users (who have watched no movies), will be assigned new movies incorrectly. Specifically, they will be assigned θ with all components equal to zero due to the minimization of the regularization term. That is, we assume that the new user will rank all movies 0, which does not seem intuitively correct. +

+

+ We rectify this problem by normalizing the data relative to the mean. First, we use a matrix Y to store the data from previous ratings, where the ith row of Y is the ratings for the ith movie and the jth column corresponds to the ratings for the jth user. +

+

+ We can now define a vector +

+

+ $$\mu = [\mu_1, \mu_2, \dots , \mu_{n_m}]$$ +

+

+ such that +

+

+ $$\mu_i = \frac{\sum_{j:r(i,j)=1}{Y_{i,j}}}{\sum_{j}{r(i,j)}}$$ +

+

+ Which is effectively the mean of the previous ratings for the ith movie (where only movies that have been watched by users are counted). We now can normalize the data by subtracting u, the mean rating, from the actual ratings for each user (column in matrix Y): +

+

+ As an example, consider the following matrix Y and mean ratings μ: +

+

+ $$Y = +\begin{bmatrix} + 5 & 5 & 0 & 0 \newline + 4 & ? & ? & 0 \newline + 0 & 0 & 5 & 4 \newline + 0 & 0 & 5 & 0 \newline +\end{bmatrix}, \quad + \mu = +\begin{bmatrix} + 2.5 \newline + 2 \newline + 2.25 \newline + 1.25 \newline +\end{bmatrix}$$ +

+

+ The resulting Y′ vector is: +

+

+ $$Y' = +\begin{bmatrix} + 2.5 & 2.5 & -2.5 & -2.5 \newline + 2 & ? & ? & -2 \newline + -.2.25 & -2.25 & 3.75 & 1.25 \newline + -1.25 & -1.25 & 3.75 & -1.25 +\end{bmatrix}$$ +

+

+ Now we must slightly modify the linear regression prediction to include the mean normalization term: +

+

+ $$(\theta^{(j)})^T x^{(i)} + \mu_i$$ +

+

+ Now, for a new user, the initial predicted values will be equal to the μ term instead of simply being initialized to zero, which is more accurate. +

+

+

+

+

+
+ + + diff --git a/ML_Mathematical_Approach/12_week-10-lecture-notes/01__resources.html b/ML_Mathematical_Approach/12_week-10-lecture-notes/01__resources.html new file mode 100644 index 0000000..8179fb5 --- /dev/null +++ b/ML_Mathematical_Approach/12_week-10-lecture-notes/01__resources.html @@ -0,0 +1,200 @@ + + +

+ Learning with Large Datasets +

+

+ We mainly benefit from a very large dataset when our algorithm has high variance when m is small. Recall that if our algorithm has high bias, more data will not have any benefit. +

+

+ Datasets can often approach such sizes as m = 100,000,000. In this case, our gradient descent step will have to make a summation over all one hundred million examples. We will want to try to avoid this -- the approaches for doing so are described below. +

+

+ Stochastic Gradient Descent +

+

+ Stochastic gradient descent is an alternative to classic (or batch) gradient descent and is more efficient and scalable to large data sets. +

+

+ Stochastic gradient descent is written out in a different but similar way: +

+

+ $$cost(\theta,(x^{(i)}, y^{(i)})) = \dfrac{1}{2}(h_{\theta}(x^{(i)}) - y^{(i)})^2$$ +

+

+ The only difference in the above cost function is the elimination of the m constant within $$\dfrac{1}{2}$$. +

+

+ $$J_{train}(\theta) = \dfrac{1}{m} \displaystyle \sum_{i=1}^m cost(\theta, (x^{(i)}, y^{(i)}))$$ +

+

+ $$J_{train}$$ is now just the average of the cost applied to all of our training examples. +

+

+ The algorithm is as follows +

+
    +
  1. +

    + Randomly 'shuffle' the dataset +

    +
  2. +
  3. +

    + For $$i = 1\dots m$$ +

    +
  4. +
+

+ $$\Theta_j := \Theta_j - \alpha (h_{\Theta}(x^{(i)}) - y^{(i)}) \cdot x^{(i)}_j$$ +

+

+ This algorithm will only try to fit one training example at a time. This way we can make progress in gradient descent without having to scan all m training examples first. Stochastic gradient descent will be unlikely to converge at the global minimum and will instead wander around it randomly, but usually yields a result that is close enough. Stochastic gradient descent will usually take 1-10 passes through your data set to get near the global minimum. +

+

+ Mini-Batch Gradient Descent +

+

+ Mini-batch gradient descent can sometimes be even faster than stochastic gradient descent. Instead of using all m examples as in batch gradient descent, and instead of using only 1 example as in stochastic gradient descent, we will use some in-between number of examples b. +

+

+ Typical values for b range from 2-100 or so. +

+

+ For example, with b=10 and m=1000: +

+

+ Repeat: +

+

+ For $$i = 1,11,21,31,\dots,991$$ +

+

+ $$\theta_j := \theta_j - \alpha \dfrac{1}{10} \displaystyle \sum_{k=i}^{i+9} (h_\theta(x^{(k)}) - y^{(k)})x_j^{(k)}$$ +

+

+ We're simply summing over ten examples at a time. The advantage of computing more than one example at a time is that we can use vectorized implementations over the b examples. +

+

+ Stochastic Gradient Descent Convergence +

+

+ How do we choose the learning rate α for stochastic gradient descent? Also, how do we debug stochastic gradient descent to make sure it is getting as close as possible to the global optimum? +

+

+ One strategy is to plot the average cost of the hypothesis applied to every 1000 or so training examples. We can compute and save these costs during the gradient descent iterations. +

+

+ With a smaller learning rate, it is + + possible + + that you may get a slightly better solution with stochastic gradient descent. That is because stochastic gradient descent will oscillate and jump around the global minimum, and it will make smaller random jumps with a smaller learning rate. +

+

+ If you increase the number of examples you average over to plot the performance of your algorithm, the plot's line will become smoother. +

+

+ With a very small number of examples for the average, the line will be too noisy and it will be difficult to find the trend. +

+

+ One strategy for trying to actually converge at the global minimum is to + + slowly decrease α over time + + . For example $$\alpha = \dfrac{const1}{iterationNumber + const2}$$ +

+

+ However, this is not often done because people don't want to have to fiddle with even more parameters. +

+

+ Online Learning +

+

+ With a continuous stream of users to a website, we can run an endless loop that gets (x,y), where we collect some user actions for the features in x to predict some behavior y. +

+

+ You can update θ for each individual (x,y) pair as you collect them. This way, you can adapt to new pools of users, since you are continuously updating theta. +

+

+ Map Reduce and Data Parallelism +

+

+ We can divide up batch gradient descent and dispatch the cost function for a subset of the data to many different machines so that we can train our algorithm in parallel. +

+

+ You can split your training set into z subsets corresponding to the number of machines you have. On each of those machines calculate $$\displaystyle \sum_{i=p}^{q}(h_{\theta}(x^{(i)}) - y^{(i)}) \cdot x_j^{(i)}$$, where we've split the data starting at p and ending at q. +

+

+ MapReduce will take all these dispatched (or 'mapped') jobs and 'reduce' them by calculating: +

+

+ $$\Theta_j := \Theta_j - \alpha \dfrac{1}{z}(temp_j^{(1)} + temp_j^{(2)} + \cdots + temp_j^{(z)})$$ +

+

+ For all $$j = 0, \dots, n$$. +

+

+ This is simply taking the computed cost from all the machines, calculating their average, multiplying by the learning rate, and updating theta. +

+

+ Your learning algorithm is MapReduceable if it can be + + expressed as computing sums of functions over the training set + + . Linear regression and logistic regression are easily parallelizable. +

+

+ For neural networks, you can compute forward propagation and back propagation on subsets of your data on many machines. Those machines can report their derivatives back to a 'master' server that will combine them. +

+

+

+
+ + + diff --git a/ML_Mathematical_Approach/13_errata-week-1/01__resources.html b/ML_Mathematical_Approach/13_errata-week-1/01__resources.html new file mode 100644 index 0000000..6c6d4e2 --- /dev/null +++ b/ML_Mathematical_Approach/13_errata-week-1/01__resources.html @@ -0,0 +1,133 @@ + + +

+ Introduction +

+ + +

+ Linear Regression With One Variable +

+ +

+ Gradient Descent for Linear Regression +

+ +

+ Linear Algebra Review +

+ +

+ Addition and scalar multiplication video +

+ +
+ + + diff --git a/ML_Mathematical_Approach/14_errata-week-2/01__resources.html b/ML_Mathematical_Approach/14_errata-week-2/01__resources.html new file mode 100644 index 0000000..df73be2 --- /dev/null +++ b/ML_Mathematical_Approach/14_errata-week-2/01__resources.html @@ -0,0 +1,145 @@ + + +

+ Errors in the video lectures +

+ +

+

+ +

+ Errors in the Programming Exercise Instructions +

+ +

+ Errors in the programming exercise scripts +

+ +
+ + + diff --git a/ML_Mathematical_Approach/15_errata-week-3/01__resources.html b/ML_Mathematical_Approach/15_errata-week-3/01__resources.html new file mode 100644 index 0000000..8031b4b --- /dev/null +++ b/ML_Mathematical_Approach/15_errata-week-3/01__resources.html @@ -0,0 +1,163 @@ + + +

+ VI. Logistic Regression +

+

+ Decision Boundary +

+

+ At 1:56 in the transcript, it should read 'sigmoid function' instead of 'sec y function'. +

+

+ Cost Function +

+

+ The section between 8:30 and 9:20 is then repeated from 9:20 to the quiz. The case for y=0 is explained twice. +

+

+ Simplified Cost Function and Gradient Descent +

+

+ These following mistakes also exist in the video: +

+ +

+ Advanced Optimization +

+

+ In the video at 7:30, the notation for specifying MaxIter is incorrect. The value provided should be an integer, not a character string. So (...'MaxIter', '100') is incorrect. It should be (...'MaxIter', 100). This error only exists in the video - the exercise script files are correct. +

+

+ VII. Regularization +

+

+ The Problem of Overfitting +

+

+ At 2:07, a curve is drawn using predicting function $$\theta_0+\theta_1 x+\theta_2 x^2$$, which is said as "just right". But when size of house is large enough, the prediction of this function will increase much faster than linear if $$\theta_2 > 0$$, or will decrease to −∞ if $$θ_2$$ < 0, which neither could correspond to reality. Instead, $$\theta_0+\theta_1 x+\theta_2 \sqrt{x}$$ may be "just right". +

+

+ At 2:28, a curve is drawn using a quartic (degree 4) polynomial predicting function $$\theta_0+\theta_1 x+\theta_2 x^2 +\theta_3 x^3 +\theta_4 x^4$$; however, the curve drawn is at least quintic (degree 5). +

+

+ Cost Function +

+

+ In the video at 5:17, the sum of the regularization term should use 'j' instead of 'i', giving $$\sum_{j=1}^{n} \theta _j ^2$$ instead of $$\sum_{i=1}^{n} \theta _j ^2$$. +

+

+ Regularized linear regression +

+

+ In the video starting at 8:04, Prof Ng discusses the Normal Equation and invertibility. He states that X is non-invertible if m <= n. The correct statement should be that X is non-invertible if m < n, and may be non-invertible if m = n. +

+

+ Regularized logistic regression +

+

+ In the video at 3:52, the lecturer mistakenly said "gradient descent for regularized linear regression". Indeed, it should be "gradient descent for regularized logistic regression". +

+

+ In the video at 5:21, the cost function is missing a pair of parentheses around the second log argument. It should be $$J(θ)=J(\theta) = [-\frac{1}{m}\sum_{i=1}^{m}y^{(i)}log(h _\theta (x^{(i)}) + (1-y^{(i)})log(1-h _\theta (x^{(i)}))] + \frac{\lambda}{2m} \sum_{j=1}^{n} \theta ^2 _j$$ +

+

+ In the original videos for the course (ML-001 through ML-008), there were typos in the equation for regularized logistic regression in both the video lecture and the PDF lecture notes. In the slides for "Gradient descent" and "advanced optimization", there should be positive signs for the regularization term of the gradient. The formula on page 10 of 'ex2.pdf' is correct. These issues in the video were corrected for the 'on-demand' format of the course. +

+

+ Quizzes +

+ +

+ Programming Exercise Errata +

+ +

+

+
+ + + diff --git a/ML_Mathematical_Approach/16_errata-week-4/01__resources.html b/ML_Mathematical_Approach/16_errata-week-4/01__resources.html new file mode 100644 index 0000000..ba35ac1 --- /dev/null +++ b/ML_Mathematical_Approach/16_errata-week-4/01__resources.html @@ -0,0 +1,101 @@ + + +

+ Errata in the video lectures +

+ +

+ Errata in the programming exercise +

+ +

+ Errata in the quiz +

+ +
+ + + diff --git a/ML_Mathematical_Approach/17_errata-week-5/01__resources.html b/ML_Mathematical_Approach/17_errata-week-5/01__resources.html new file mode 100644 index 0000000..5ff57cd --- /dev/null +++ b/ML_Mathematical_Approach/17_errata-week-5/01__resources.html @@ -0,0 +1,212 @@ + + +

+ Errata in video "Backpropagation Algorithm" +

+ +

+ Errata in video "Backpropagation Intuition" +

+ +

+ Errata in video "Implementation Note: Unrolling Parameters" +

+ +

+ Errata in video "Gradient Checking" +

+

+ Errata in video "Random Initialization" +

+ +

+ Errata in video "Putting It Together" +

+ +

+ Errata in the lecture slides (Lecture9.pdf) +

+ +

+ Errata in ex4.pdf +

+ +

+ Errata in the programming exercise scripts +

+ +
+ + + diff --git a/ML_Mathematical_Approach/18_errata-week-6/01__resources.html b/ML_Mathematical_Approach/18_errata-week-6/01__resources.html new file mode 100644 index 0000000..bceb3ec --- /dev/null +++ b/ML_Mathematical_Approach/18_errata-week-6/01__resources.html @@ -0,0 +1,121 @@ + + +

+ + Errata in the Graded Quizzes + +

+

+ Quiz questions in Week 6 should refer to linear regression, not logistic regression (typo only). +

+

+ + Errata in the Video Lectures + +

+

+ In the "Regularization and Bias/Variance" video +

+

+ The slide "Linear Regression with Regularization" has an error in the formula for J(θ): the regularization term should go from j=1 up to n (and not m), that is $$\frac{\lambda}{2m} \sum_{j=1}^n \theta_j^2$$. The quiz in the video "Regularization and Bias/Variance" has regularization terms for $$J_{train}$$ and $$J_{CV}$$, while the rest of the video stresses that these should not be there. Also, the quiz says "Consider regularized logistic regression," but exhibits cost functions for regularized linear regression. +

+

+ At around 5:58, Prof. Ng says, "picking theta-5, the fifth order polynomial". Instead, he should have said the fifth value of λ (0.08), because in this example, the polynomial degree is fixed at d = 4 and we are varying λ. +

+

+ In the "Advice for applying ML" set of videos +

+

+ Often (if not always) + + the sums corresponding to the regularization terms in J(θ) + + are (erroneously) written with j running from 1 to m. In fact, + + j should run from 1 to n + + , that is, the regularization term should be $$\lambda \sum_{j=1}^n \theta_j^2$$. The variable m is the number of (x,y) pairs in the set used to calculate the cost, while n is the largest index of j in the θj parameters or in the elements $$x_j$$ of the vector of features. +

+

+ In the "Advice for Applying Machine Learning" section, the figure that illustrates the relationship between lambda and the hypothesis. used to detect high variance or high bias, is incorrect. Jtrain is low when lambda is small (indicating a high variance problem) and high when lambda is high (indicating a high bias problem). +

+

+ Video (10-2: Advice for Applying Machine Learning -- hypothesis testing) +

+

+ The slide that introduces + + Training/Testing procedure for logistic regression + + , (around 04:50) the cost function is incorrect. It should be: +

+

+ $$J_{\mathrm{test}}(\theta)=-\frac{1}{m_{\mathrm{test}}}\sum_{i=1}^{m_{\mathrm{test}}}\left(y_{\mathrm{test}}^{(i)}\cdot \log(h_{\theta}(x_{\mathrm{test}}^{(i)})) + (1-y_{\mathrm{test}}^{(i)})\cdot \log(1-h_{\theta}(x_{\mathrm{test}}^{(i)}))\right)$$ +

+

+ Video Regularization and Bias/Variance (00:48) +

+

+ Regularization term is wrong. Should be $$\frac{\lambda}{2m}\sum_{j=1}^{n}\theta_j^2$$ and not sum over m. +

+

+ Videos 10-4 and 10-5: current subtitles are mistimed +

+

+ Looks like the videos were updated in Sept 2014, but the subtitles were not updated accordingly. (10-3 was also updated in Aug 2014, but the subtitles were updated) +

+

+ Errata in the ex5 programming exercise +

+

+ In ex5.m at line 104, the reference to "slide 8 in ML-advice.pdf" should be "Figure 3 in ex5.pdf". +

+
+ + + diff --git a/ML_Mathematical_Approach/19_errata-week-7/01__resources.html b/ML_Mathematical_Approach/19_errata-week-7/01__resources.html new file mode 100644 index 0000000..3acda4a --- /dev/null +++ b/ML_Mathematical_Approach/19_errata-week-7/01__resources.html @@ -0,0 +1,130 @@ + + +

+ Errata in video lectures +

+ +

+ Errata in programming assignment ex6 +

+ +
+ + + diff --git a/ML_Mathematical_Approach/20_errata-week-8/01__resources.html b/ML_Mathematical_Approach/20_errata-week-8/01__resources.html new file mode 100644 index 0000000..fd0e556 --- /dev/null +++ b/ML_Mathematical_Approach/20_errata-week-8/01__resources.html @@ -0,0 +1,106 @@ + + +

+ Video Lecture Errata +

+

+ In the video ‘Motivation II: Visualization’, around 2:45, prof. Ng says $$ℝ^2$$, but writes ℝ. The latter is incorrect and should be $$ℝ^2$$. +

+

+ In the video ‘Motivation II: Visualization’, the quiz at 5:00 has a typo where the reduced data set should be go up to $$z^{(n)}$$ rather than $$z^{(m)}$$. +

+

+ In the video "Principal Component Analysis Algorithm", around 1:00 the slide should read "Replace each $$x_j^{(i)}$$ with $$x_j^{(i)}-\mu_j$$." (The second x is missing the superscript (i).) +

+

+ In the video "Principal Component Analysis Algorithm", the formula shown at around 5:00 incorrectly shows summation from 1 to n. The correct summation (shown later in the video) is from 1 to m. In the matrix U shown at around 9:00 incorrectly shows superscript of last column-vector "u" as m, the correct superscript is n. +

+

+ In the video "Reconstruction from Compressed Representation", the quiz refers to a formula which is defined in the next video, "Choosing the Number of Principal Components" +

+

+ In the video "Choosing the number of principal components" at 8:45, the summation in the denominator should be from 1 to n (not 1 to m). +

+

+ In the in-video quiz in "Data Compression" at 9:47 the correct answer contains k≤n but it should be k<n. +

+

+ Programming Exercise Errata +

+

+ In the ex7.pdf file, Section 2.2 says “You task is to complete the code” but it should be “You + + r + + task” +

+

+ In the ex7.pdf file, Section 2.4.1 should say that each column (not row) vector of U represents a principal component. +

+

+ In the ex7.pdf file, Section 2.4.2 there is a typo: “predict the identit + + f + + y of the person” (the 'f' is unneeded). +

+

+ In the ex7_pca.m file at line 126, the fprintf string says '(this mght take a minute or two ...)'. The 'mght' should be 'might'. +

+

+ In the ex7 projectData.m file, update the Instructions to read: +

+
%    projection_k = x' * U(:, k);
+
+

+ In the function script "pca.m", the 3rd line should read "[U, S] = pca(X)" not "[U, S, X] = pca(X)" +

+
+ + + diff --git a/ML_Mathematical_Approach/21_errata-week-9/01__resources.html b/ML_Mathematical_Approach/21_errata-week-9/01__resources.html new file mode 100644 index 0000000..147ff33 --- /dev/null +++ b/ML_Mathematical_Approach/21_errata-week-9/01__resources.html @@ -0,0 +1,123 @@ + + +

+ XV. Anomaly Detection +

+

+ At the risk of being pedantic, it should be noted that p(x) is not a probability but rather the normalized probability density as parameterized by the feature vector, x; therefore, ϵ is a threshold condition on the probability density. Determination of the actual probability would require integration of this density over the appropriate extent of phase space. +

+

+ In the + + Developing and Evaluating an Anomaly Detection System + + video an alternative way for some people to split the data is to use the same data for the cv and test sets, therefore the number of anomalous engines (y = 1) in each set would be + + 20 + + rather than 10 as it states on the slide. +

+

+ XVI. Recommender Systems +

+

+ In review questions, question 5 in option starting "Recall that the cost function for the content-based recommendation system is" the right side of the formula should be divided by m where m is number of movies. That would mean that the formula will no longer be standard cost function for the content-based recommendation system. However without this change correct answer is marked as incorrect and vice-versa. This description is not very clear but being more specific would mean breaking the honour code. +

+

+ In the Problem Formulation video the review question states that the no. of movies is $$n_m = 1$$. The correct value for $$n_m= 2$$. +

+

+ In "Collaborative Filtering" video, review question 2: "Which of the following is a correct gradient descent update rule for i ≠ 0?"; Instead of i ≠ 0 it should be k≠0. +

+

+ In lesson 5 "Vectorization: Low Rank Matrix Factorization" and in lesson 6 "Implementation detail: Mean normalization" the matrix Y contains a mistake. The element $$Y^{(5,4)}$$ (Dave's opinion on "Sword vs Karate") should be a question mark but is incorrectly given as 0. +

+

+ In lesson 6 this mistake is propagated to the calculation of μ. When μ is calculated the 5th movie is given an average rating of 1.25 because (0+0+5+0)/4=1.25, but it should be (0+0+5)/3=1.67. This the affects the new values in the matrix Y. +

+

+ In ex8_cofi.m at line 199, where theta is trained using fmincg() for the movie ratings, the use of "Y" in the function call should be "Ynorm". Y is normalized in line 181, creating Ynorm, but then it is never used. The video lecture "Implementation Detail: Mean Normalization" at 5:34 makes it pretty clear that the normalized Y matrix should be used for calculating theta. +

+

+ In ex8.pdf section 2, "collaborative fitlering" should be "collaborative fi + + lt + + ering" +

+

+ In ex8.pdf section 2.2.1, “it will later by called” should be “it will later b + + e + + called” +

+

+ In checkCostFunction.m it prints "If your backpropagation implementation is correct...", but in this exercise there is no backpropagation. +

+

+ In the quiz, question 4 has an invalid phrase: "Even if + + you + + each user has rated only a small fraction of all of your products (so r(i,j)=0 for the vast majority of (i,j) pairs), you can still build a recommender system by using collaborative filtering." The word "you" seems misplaced, "your" or none. +

+

+ In the quiz, question 4, one of answer options has a typo "For collaborative filtering, it is possible to use one of the advanced optimization algo + + ir + + thms" +

+

+ In ex8.pdf at the bottom of page 8, the text says that the number of features used by ex8_cofi.m is 100. Actually the number of features is 10, not 100. +

+
+ + + diff --git a/ML_Mathematical_Approach/22_programming-ex-1/01__Basic-Statistical-Functions.html b/ML_Mathematical_Approach/22_programming-ex-1/01__Basic-Statistical-Functions.html new file mode 100644 index 0000000..68e4797 --- /dev/null +++ b/ML_Mathematical_Approach/22_programming-ex-1/01__Basic-Statistical-Functions.html @@ -0,0 +1,323 @@ + + + + +GNU Octave: Basic Statistical Functions + + + + + + + + + + + + + + + + + + + + + +
+

+Next: , Previous: , Up: Statistics   [Contents][Index]

+
+
+ +

26.2 Basic Statistical Functions

+ +

Octave supports various helpful statistical functions. Many are useful as +initial steps to prepare a data set for further analysis. Others provide +different measures from those of the basic descriptive statistics. +

+
+
: center (x)
+
: center (x, dim)
+

Center data by subtracting its mean. +

+

If x is a vector, subtract its mean. +

+

If x is a matrix, do the above for each column. +

+

If the optional argument dim is given, operate along this dimension. +

+

Programming Note: center has obvious application for normalizing +statistical data. It is also useful for improving the precision of general +numerical calculations. Whenever there is a large value that is common +to a batch of data, the mean can be subtracted off, the calculation +performed, and then the mean added back to obtain the final answer. +

+

See also: zscore. +

+ + +
+
: z = zscore (x)
+
: z = zscore (x, opt)
+
: z = zscore (x, opt, dim)
+
: [z, mu, sigma] = zscore (…)
+

Compute the Z score of x +

+

If x is a vector, subtract its mean and divide by its standard +deviation. If the standard deviation is zero, divide by 1 instead. +

+

The optional parameter opt determines the normalization to use when +computing the standard deviation and has the same definition as the +corresponding parameter for std. +

+

If x is a matrix, calculate along the first non-singleton dimension. +If the third optional argument dim is given, operate along this +dimension. +

+

The optional outputs mu and sigma contain the mean and standard +deviation. +

+ +

See also: mean, std, center. +

+ + +
+
: n = histc (x, edges)
+
: n = histc (x, edges, dim)
+
: [n, idx] = histc (…)
+

Compute histogram counts. +

+

When x is a vector, the function counts the number of elements of +x that fall in the histogram bins defined by edges. This +must be a vector of monotonically increasing values that define the edges +of the histogram bins. n(k) contains the number of elements +in x for which edges(k) <= x < edges(k+1). +The final element of n contains the number of elements of x +exactly equal to the last element of edges. +

+

When x is an N-dimensional array, the computation is carried +out along dimension dim. If not specified dim defaults to the +first non-singleton dimension. +

+

When a second output argument is requested an index matrix is also returned. +The idx matrix has the same size as x. Each element of +idx contains the index of the histogram bin in which the +corresponding element of x was counted. +

+

See also: hist. +

+ + +

unique function documented at unique is often +useful for statistics. +

+
+
: c = nchoosek (n, k)
+
: c = nchoosek (set, k)
+
+

Compute the binomial coefficient of n or list all possible +combinations of a set of items. +

+

If n is a scalar then calculate the binomial coefficient +of n and k which is defined as +

+
+
 /   \
+ | n |    n (n-1) (n-2) … (n-k+1)       n!
+ |   |  = ------------------------- =  ---------
+ | k |               k!                k! (n-k)!
+ \   /
+
+ +

This is the number of combinations of n items taken in groups of +size k. +

+

If the first argument is a vector, set, then generate all +combinations of the elements of set, taken k at a time, with +one row per combination. The result c has k columns and +nchoosek (length (set), k) rows. +

+

For example: +

+

How many ways can three items be grouped into pairs? +

+
+
nchoosek (3, 2)
+   ⇒ 3
+
+ +

What are the possible pairs? +

+
+
nchoosek (1:3, 2)
+   ⇒  1   2
+       1   3
+       2   3
+
+ +

Programming Note: When calculating the binomial coefficient nchoosek +works only for non-negative, integer arguments. Use bincoeff for +non-integer and negative scalar arguments, or for computing many binomial +coefficients at once with vector inputs for n or k. +

+ +

See also: bincoeff, perms. +

+ + +
+
: perms (v)
+

Generate all permutations of v with one row per permutation. +

+

The result has size factorial (n) * n, where n +is the length of v. +

+

Example +

+
+
perms ([1, 2, 3])
+⇒
+  1   2   3
+  2   1   3
+  1   3   2
+  2   3   1
+  3   1   2
+  3   2   1
+
+ +

Programming Note: The maximum length of v should be less than or +equal to 10 to limit memory consumption. +

+

See also: permute, randperm, nchoosek. +

+ + +
+
: ranks (x, dim)
+

Return the ranks of x along the first non-singleton dimension +adjusted for ties. +

+

If the optional argument dim is given, operate along this dimension. +

+

See also: spearman, kendall. +

+ + +
+
: run_count (x, n)
+
: run_count (x, n, dim)
+

Count the upward runs along the first non-singleton dimension of x +of length 1, 2, …, n-1 and greater than or equal to n. +

+

If the optional argument dim is given then operate along this +dimension. +

+

See also: runlength. +

+ + +
+
: count = runlength (x)
+
: [count, value] = runlength (x)
+

Find the lengths of all sequences of common values. +

+

count is a vector with the lengths of each repeated value. +

+

The optional output value contains the value that was repeated in +the sequence. +

+
+
runlength ([2, 2, 0, 4, 4, 4, 0, 1, 1, 1, 1])
+⇒  [2, 1, 3, 1, 4]
+
+ +

See also: run_count. +

+ + +
+
: probit (p)
+

Return the probit (the quantile of the standard normal distribution) for +each element of p. +

+

See also: logit. +

+ + +
+
: logit (p)
+

Compute the logit for each value of p +

+

The logit is defined as +

+
+
logit (p) = log (p / (1-p))
+
+ + +

See also: probit, logistic_cdf. +

+ + +
+
: cloglog (x)
+

Return the complementary log-log function of x. +

+

The complementary log-log function is defined as +

+
+
cloglog (x) = - log (- log (x))
+
+ +
+ + +
+
: [t, l_x] = table (x)
+
: [t, l_x, l_y] = table (x, y)
+

Create a contingency table t from data vectors. +

+

The l_x and l_y vectors are the corresponding levels. +

+

Currently, only 1- and 2-dimensional tables are supported. +

+ + +
+
+

+Next: , Previous: , Up: Statistics   [Contents][Index]

+
+ + + + + diff --git a/ML_Mathematical_Approach/22_programming-ex-1/01__Broadcasting.html b/ML_Mathematical_Approach/22_programming-ex-1/01__Broadcasting.html new file mode 100644 index 0000000..55976f5 --- /dev/null +++ b/ML_Mathematical_Approach/22_programming-ex-1/01__Broadcasting.html @@ -0,0 +1,278 @@ + + + + +GNU Octave: Broadcasting + + + + + + + + + + + + + + + + + + + + + +
+

+Next: , Previous: , Up: Vectorization and Faster Code Execution   [Contents][Index]

+
+
+ +

19.2 Broadcasting

+ + + + + + +

Broadcasting refers to how Octave binary operators and functions behave +when their matrix or array operands or arguments differ in size. Since +version 3.6.0, Octave now automatically broadcasts vectors, matrices, +and arrays when using elementwise binary operators and functions. +Broadly speaking, smaller arrays are “broadcast” across the larger +one, until they have a compatible shape. The rule is that corresponding +array dimensions must either +

+
    +
  1. be equal, or + +
  2. one of them must be 1. +
+ +

In case all dimensions are equal, no broadcasting occurs and ordinary +element-by-element arithmetic takes place. For arrays of higher +dimensions, if the number of dimensions isn’t the same, then missing +trailing dimensions are treated as 1. When one of the dimensions is 1, +the array with that singleton dimension gets copied along that dimension +until it matches the dimension of the other array. For example, consider +

+
+
x = [1 2 3;
+     4 5 6;
+     7 8 9];
+
+y = [10 20 30];
+
+x + y
+
+ +

Without broadcasting, x + y would be an error because the dimensions +do not agree. However, with broadcasting it is as if the following +operation were performed: +

+
+
x = [1 2 3
+     4 5 6
+     7 8 9];
+
+y = [10 20 30
+     10 20 30
+     10 20 30];
+
+x + y
+⇒    11   22   33
+      14   25   36
+      17   28   39
+
+ +

That is, the smaller array of size [1 3] gets copied along the +singleton dimension (the number of rows) until it is [3 3]. No +actual copying takes place, however. The internal implementation reuses +elements along the necessary dimension in order to achieve the desired +effect without copying in memory. +

+

Both arrays can be broadcast across each other, for example, all +pairwise differences of the elements of a vector with itself: +

+
+
y - y'
+⇒    0   10   20
+    -10    0   10
+    -20  -10    0
+
+ +

Here the vectors of size [1 3] and [3 1] both get +broadcast into matrices of size [3 3] before ordinary matrix +subtraction takes place. +

+

A special case of broadcasting that may be familiar is when all +dimensions of the array being broadcast are 1, i.e., the array is a +scalar. Thus for example, operations like x - 42 and max +(x, 2) are basic examples of broadcasting. +

+

For a higher-dimensional example, suppose img is an RGB image of +size [m n 3] and we wish to multiply each color by a different +scalar. The following code accomplishes this with broadcasting, +

+
+
img .*= permute ([0.8, 0.9, 1.2], [1, 3, 2]);
+
+ +

Note the usage of permute to match the dimensions of the +[0.8, 0.9, 1.2] vector with img. +

+

For functions that are not written with broadcasting semantics, +bsxfun can be useful for coercing them to broadcast. +

+
+
: bsxfun (f, A, B)
+

The binary singleton expansion function performs broadcasting, +that is, it applies a binary function f element-by-element to two +array arguments A and B, and expands as necessary +singleton dimensions in either input argument. +

+

f is a function handle, inline function, or string containing the name +of the function to evaluate. The function f must be capable of +accepting two column-vector arguments of equal length, or one column vector +argument and a scalar. +

+

The dimensions of A and B must be equal or singleton. The +singleton dimensions of the arrays will be expanded to the same +dimensionality as the other array. +

+

See also: arrayfun, cellfun. +

+ + +

Broadcasting is only applied if either of the two broadcasting +conditions hold. As usual, however, broadcasting does not apply when two +dimensions differ and neither is 1: +

+
+
x = [1 2 3
+     4 5 6];
+y = [10 20
+     30 40];
+x + y
+
+ +

This will produce an error about nonconformant arguments. +

+

Besides common arithmetic operations, several functions of two arguments +also broadcast. The full list of functions and operators that broadcast +is +

+
+
      plus      +  .+
+      minus     -  .-
+      times     .*
+      rdivide   ./
+      ldivide   .\
+      power     .^  .**
+      lt        <
+      le        <=
+      eq        ==
+      gt        >
+      ge        >=
+      ne        !=  ~=
+      and       &
+      or        |
+      atan2
+      hypot
+      max
+      min
+      mod
+      rem
+      xor
+
+      +=  -=  .+=  .-=  .*=  ./=  .\=  .^=  .**=  &=  |=
+
+ +

Beware of resorting to broadcasting if a simpler operation will suffice. +For matrices a and b, consider the following: +

+
+
c = sum (permute (a, [1, 3, 2]) .* permute (b, [3, 2, 1]), 3);
+
+ +

This operation broadcasts the two matrices with permuted dimensions +across each other during elementwise multiplication in order to obtain a +larger 3-D array, and this array is then summed along the third dimension. +A moment of thought will prove that this operation is simply the much +faster ordinary matrix multiplication, c = a*b;. +

+

A note on terminology: “broadcasting” is the term popularized by the +Numpy numerical environment in the Python programming language. In other +programming languages and environments, broadcasting may also be known +as binary singleton expansion (BSX, in MATLAB, and the +origin of the name of the bsxfun function), recycling (R +programming language), single-instruction multiple data (SIMD), +or replication. +

+ +

19.2.1 Broadcasting and Legacy Code

+ +

The new broadcasting semantics almost never affect code that worked +in previous versions of Octave. Consequently, all code inherited from +MATLAB that worked in previous versions of Octave should still work +without change in Octave. The only exception is code such as +

+
+
try
+  c = a.*b;
+catch
+  c = a.*a;
+end_try_catch
+
+ +

that may have relied on matrices of different size producing an error. +Because such operation is now valid Octave syntax, this will no longer +produce an error. Instead, the following code should be used: +

+
+
if (isequal (size (a), size (b)))
+  c = a .* b;
+else
+  c = a .* a;
+endif
+
+ + +
+
+

+Next: , Previous: , Up: Vectorization and Faster Code Execution   [Contents][Index]

+
+ + + + + diff --git a/ML_Mathematical_Approach/22_programming-ex-1/01__resources.html b/ML_Mathematical_Approach/22_programming-ex-1/01__resources.html new file mode 100644 index 0000000..d5e79e0 --- /dev/null +++ b/ML_Mathematical_Approach/22_programming-ex-1/01__resources.html @@ -0,0 +1,548 @@ + + +

+ Tutorials +

+

+ + Compute Cost Tutorial + +

+

+ This is a step-by-step tutorial for how to complete the computeCost() function portion of ex1. You will still have to do some thinking, because I'll describe the implementation, but you have to turn it into Octave script commands. All the programming exercises in this course follow the same procedure; you are provided a starter code template for a function that you need to complete. You never have to start a new script file from scratch. This is a vectorized implementation. You're only going to write a few simple lines of code. +

+

+ With a text editor (NOT a word processor), open up the computeCost.m file. Scroll down until you find the "====== YOUR CODE HERE =====" section. Below this section is where you're going to add your lines of code. Just skip over the lines that start with the '%' sign - those are instructive comments. +

+

+ We'll write these three lines of code by inspecting the equation on Page 5 of ex1.pdf. The first line of code will compute a vector 'h' containing all of the hypothesis values - one for each training example (i.e. for each row of X). The hypothesis (also called the prediction) is simply the product of X and theta. So your first line of code is... +

+
h = {multiply X and theta, in the proper order that the ....inner dimensions match}
+

+ Since X is size (m x n) and theta is size (n x 1), you arrange the order of operators so the result is size (m x 1). +

+

+ The second line of code will compute the difference between the hypothesis and y - that's the error for each training example. Difference means subtract. +

+
error = {the difference between h and y}
+

+ The third line of code will compute the square of each of those error terms (using element-wise exponentiation), +

+

+ An example of using element-wise exponentiation - try this in your workspace command line so you see how it works. +

+
v = [-2 3]
+
+v_sqr = v.^2
+

+ So, now you should compute the squares of the error terms: +

+
error_sqr = {use what you have learned}
+

+ Next, here's an example of how the sum function works (try this from your command line) +

+
q = sum([1 2 3])
+

+ Now, we'll finish the last two steps all in one line of code. You need to compute the sum of the error_sqr vector, and scale the result (multiply) by 1/(2*m). That completed sum is the cost value J. +

+
J = {multiply 1/(2*m) times the sum of the error_sqr vector}
+

+ That's it. If you run the ex1.m script, you should have the correct value for J. Then you should run one of the unit tests (available in the Forum). +

+

+ Then you can run the submit script, and hopefully it will pass. +

+

+ Be sure that every line of code ends with a semicolon. That will suppress the output of any values to the workspace. Leaving out the semicolons will surely make the grader unhappy. +

+

+ + Gradient Descent Tutorial + + - also applies to gradientDescentMulti() - includes test cases. +

+

+ I use the vectorized method, hopefully you're comfortable with vector math. Using this method means you don't have to fuss with array indices, and your solution will automatically work for any number of features or training examples. +

+

+ What follows is a vectorized implementation of the gradient descent equation on the bottom of Page 5 in ex1.pdf. +

+

+ Reminder that 'm' is the number of training examples (the rows of X), and 'n' is the number of features (the columns of X). 'n' is also the size of the theta vector (n x 1). +

+

+ Perform all of these steps within the provided for-loop from 1 to the number of iterations. Note that the code template provides you this for-loop - you only have to complete the body of the for-loop. The steps below go immediately below where the script template says "======= YOUR CODE HERE ======". +

+

+ 1 - The hypothesis is a vector, formed by multiplying the X matrix and the theta vector. X has size (m x n), and theta is (n x 1), so the product is (m x 1). That's good, because it's the same size as 'y'. Call this hypothesis vector 'h'. +

+

+ 2 - The "errors vector" is the difference between the 'h' vector and the 'y' vector. +

+

+ 3 - The change in theta (the "gradient") is the sum of the product of X and the "errors vector", scaled by alpha and 1/m. Since X is (m x n), and the error vector is (m x 1), and the result you want is the same size as theta (which is (n x 1), you need to transpose X before you can multiply it by the error vector. +

+

+ The vector multiplication automatically includes calculating the sum of the products. +

+

+ When you're scaling by alpha and 1/m, be sure you use enough sets of parenthesis to get the factors correct. +

+

+ 4 - Subtract this "change in theta" from the original value of theta. A line of code like this will do it: +

+
theta = theta - theta_change;
+

+ That's it. Since you're never indexing by m or n, this solution works identically for both gradientDescent() and gradientDescentMulti(). +

+

+ + Feature Normalization Tutorial + +

+

+ There are a couple of methods to accomplish this. The method here is one I use that doesn't rely on automatic broadcasting or the bsxfun() or repmat() functions. +

+

+ You can use the mean() and sigma() functions to get the mean and std deviation for each column of X. These are returned as row vectors (1 x n) +

+

+ Now you want to apply those values to each element in every row of the X matrix. One way to do this is to duplicate these vectors for each row in X, so they're the same size. +

+

+ One method to do this is to create a column vector of all-ones - size (m x 1) - and multiply it by the mu or sigma row vector (1 x n). Dimensionally, (m x 1) * (1 x n) gives you a (m x n) matrix, and every row of the resulting matrix will be identical. +

+

+ Now that X, mu, and sigma are all the same size, you can use element-wise operators to compute X_normalized. +

+

+ Try these commands in your workspace: +

+
X = [1 2 3; 4 5 6]
+
+% creates a test matrix
+
+mu = mean(X)
+
+% returns a row vector
+
+sigma = std(X)
+
+% returns a row vector
+
+m = size(X, 1)
+
+% returns the number of rows in X
+
+mu_matrix = ones(m, 1) * mu
+
+sigma_matrix = ones(m, 1) * sigma
+

+ Now you can subtract the mu matrix from X, and divide element-wise by the sigma matrix, and arrive at X_normalized. +

+

+ You can do this even easier if you're using a Matlab or Octave version that supports automatic broadcasting - then you can skip the "multiply by a column of 1's" part. +

+

+ You can also use the bsxfun() or repmat() functions. Be advised the bsxfun() has a non-obvious syntax that I can never remember, and repmat() runs rather slowly. +

+

+ Test Cases +

+

+ + computeCost: + +

+

+ >>computeCost( [1 2; 1 3; 1 4; 1 5], [7;6;5;4], [0.1;0.2] ) +

+

+ ans = 11.9450 +

+

+ ----- +

+

+ >>computeCost( [1 2 3; 1 3 4; 1 4 5; 1 5 6], [7;6;5;4], [0.1;0.2;0.3]) +

+

+ ans = 7.0175 +

+

+ ============ +

+

+ + gradientDescent: + +

+

+ Test Case 1: +

+
>>[theta J_hist] = gradientDescent([1 5; 1 2; 1 4; 1 5],[1 6 4 2]',[0 0]',0.01,1000);
+
+% then type in these variable names, to display the final results
+
+>>theta
+
+theta =
+
+5.2148
+
+-0.5733
+
+>>J_hist(1)
+
+ans = 5.9794
+
+>>J_hist(1000)
+
+ans = 0.85426
+

+ For debugging, here are the first few theta values computed in the gradientDescent() for-loop for this test case: +

+
% first iteration
+theta =
+  0.032500
+  0.107500
+% second iteration
+theta =
+  0.060375
+  0.194887
+% third iteration
+theta =
+  0.084476
+  0.265867
+% fourth iteration
+theta =
+  0.10550
+  0.32346
+

+ The values can be inspected by adding the "keyboard" command within your for-loop. This exits the code to the debugger, where you can inspect the values. Use the "return" command to resume execution. +

+

+ Test Case 2: +

+

+ This test case is similar, but uses a non-zero initial theta value. +

+
>> [theta J_hist] = gradientDescent([1 5; 1 2],[1 6]',[.5 .5]',0.1,10);
+>> theta
+theta =   
+1.70986  
+0.19229
+>> J_hist
+J_hist =   
+  5.8853  
+  5.7139  
+  5.5475  
+  5.3861  
+  5.2294  
+  5.0773  
+  4.9295  
+  4.7861  
+  4.6469  
+  4.5117
+

+ + featureNormalize(): + +

+
[Xn mu sigma] = featureNormalize([1 ; 2 ; 3])
+% result
+Xn = 
+  -1  
+  0  
+  1
+mu =  2
+sigma =  1
+[Xn mu sigma] = featureNormalize(magic(3))
+% result
+Xn =   
+  1.13389 -1.00000 0.37796  
+  -0.75593 0.00000 0.75593 
+  -0.37796 1.00000 -1.13389
+mu =   
+  5   5   5
+sigma =   
+  2.6458   4.0000   2.6458
+%--------------
+[Xn mu sigma] = featureNormalize([-ones(1,3); magic(3)])
+% results
+Xn =  
+  -1.21725  -1.01472  -1.21725   
+  1.21725  -0.56373   0.67625 
+  -0.13525   0.33824   0.94675 
+  0.13525   1.24022  -0.40575
+mu =   
+  3.5000   3.5000   3.5000
+sigma = 
+  3.6968   4.4347   3.6968
+

+ + computeCostMulti + +

+
X = [ 2 1 3; 7 1 9; 1 8 1; 3 7 4 ];
+y = [2 ; 5 ; 5 ; 6];
+theta_test = [0.4 ; 0.6 ; 0.8];
+computeCostMulti( X, y, theta_test )
+% result
+ans =  5.2950
+

+ + gradientDescentMulti + +

+
X = [ 2 1 3; 7 1 9; 1 8 1; 3 7 4 ];
+y = [2 ; 5 ; 5 ; 6];
+[theta J_hist] = gradientDescentMulti(X, y, zeros(3,1), 0.01, 100);
+
+% results
+
+>> theta
+theta =
+
+   0.23680
+   0.56524
+   0.31248
+
+>> J_hist(1)
+ans =  2.8299
+
+>> J_hist(end)
+ans =  0.0017196
+

+ + normalEqn + +

+
X = [ 2 1 3; 7 1 9; 1 8 1; 3 7 4 ];
+y = [2 ; 5 ; 5 ; 6];
+theta = normalEqn(X,y)
+
+% results
+theta =
+
+   0.0083857
+   0.5681342
+   0.4863732
+

+ Debugging Tip +

+

+ The submit script, for all the programming assignments, does not report the line number and location of the error when it crashes. The follow method can be used to make it do so which makes debugging easier. +

+

+ Open ex1/lib/submitWithConfiguration.m and replace line: +

+
 fprintf('!! Please try again later.\n');
+
+

+ (around 28) with: +

+
fprintf('Error from file:%s\nFunction:%s\nOn line:%d\n', e.stack(1,1).file,e.stack(1,1).name, e.stack(1,1).line );
+
+

+ That top line says '!! Please try again later' on crash, instead of that, the bottom line will give the location and line number of the error. This change can be applied to all the programming assignments. +

+

+ Note for OS X users +

+

+ If you are using OS X and get this error message when you run ex1.m and expect to see a plot figure: +

+
gnuplot> set terminal aqua enhanced title "Figure 1" size 560 420  font "*,6" dashlength 1                     
+                  ^
+     line 0: unknown or ambiguous terminal type; type just 'set terminal' for a list
+
+

+ ... try entering this command in the workspace console to change the terminal type: +

+
setenv("GNUTERM","qt")
+
+

+ How to check format of function arguments +

+

+ So that you may print the argument just by typing its name in the body of the function on a distinct line and call submit() in Octave. +

+

+ For example I may print the theta argument in the "Compute cost for one variable" exercise by writing this in my computeCost.m file. Of course, it will fail because 5 is just random number, but it will show me the value of theta: +

+
function J = computeCost(X, y, theta)
+    m = length(y);
+    J = 0
+    theta
+    J = 5  % I have added this line just to show that the argument you want to print doesn't have to be on the last line
+end
+
+

+ Testing matrix operations in Octave +

+

+ In our programming exercises, there are many complex matrix operations where it may not be clear what form the result is in. I find it helpful to create a few basic matrices and vectors to test out my operations. For instance the following commands can be copied to a file to be used at any time for testing an operation. +

+
X = [1 2 3; 1 2 3; 1 2 3; 1 2 3; 1 5 6] % Make sure X has more rows than theta and isn't square
+y = [1; 2; 3; 4; 5]
+theta = [1; 1; 1]
+
+

+ With these basic matrices and vectors you can model most of the programming exercises. If you don't know what form specific operations in the exercises take, you can test it in the Octave shell. +

+

+ One thing that got me was using formulas like theta' * x where x was a single row in X. All the notes show x as being a mX1 vector, but X(i,:) is a 1xm vector. Using the terminal, I figured out that I had to transpose x. It is very helpful. +

+

+ Repeating previous operations in Octave +

+

+ When using the great unit tests by Vinh, if your function doesn't work the first time -- after you to edit and save your function file, then in your Octave window - just type ctrl-p to back up to what you typed previously, then enter to run it. (once you've gone back, can use ctrl-n for next) (more info @ + + https://www.gnu.org/software/octave/doc/interpreter/Commands-For-History.html + + ) +

+

+ Warm up exercise +

+

+ If you type "ex1.m" you will get an error - just use "ex1". Press 'Run' in Matlab editor. +

+

+ Compute cost for one variable +

+

+ theta is a matrix of size 2x1; first row is theta[0] and second one is theta[1] (I following index convention of videos here) Also fill arbitrary (non-zero) initial values to theta[0] and theta[1]. +

+

+ Gradient descent for one variable +

+

+ See the 5th segment of Week 1 Video II ("Gradient Descent") for a key tip on simultaneous updates of theta. +

+

+ Feature normalization +

+

+ Use the zscore function to normalize: + + http://www.gnu.org/software/octave/doc/interpreter/Basic-Statistical-Functions.html#XREFzscore + +

+

+ repmat function can be used here. +

+

+ The bsxfun is helpful for applying a function (limited to two arguments) in an element-wise fashion to rows of a matrix using a vector of source values. This is useful for feature normalization. An example you can enter at the octave command line: +

+
Z=[1 1 1; 2 2 2;];
+v=[1 1 1];
+bsxfun(@minus,Z,v);
+ans =
+    0   0   0
+    1   1   1
+
+

+ In this case, the corresponding elements of v are subtracted from each row of Z. The minus(a,b) function is equivalent to computing (a-b). +

+

+ (other mathematical functions: @plus, @rdivide) +

+

+ In Octave >= 3.0.6 you can use broadcast feature to abbreviate: + + https://www.gnu.org/software/octave/doc/interpreter/Broadcasting.html#Broadcasting + +

+
Z=[1 1 1; 2 2 2;];
+v=[1 1 1];
+Z - v   % or Z .- v
+ans =
+   0   0   0
+   1   1   1
+
+

+ A note regarding Feature Normalization when a feature is a constant: <provided by a ML-005 student> +

+

+ When I used the feature normalization routine we used in class it did not occur to me that some features of the training examples may have constant values, which means that the sigma vector has zeroes for those features. Thus when I divide by sigma to normalize the matrix NaNs filled in some slots. This causes gradient descent to get lost wandering through a NaN wasteland, but never reporting why.The fix is easy. In featureNormalize, after sigma is calculated but before the division takes place, insert +

+
sigma( sigma == 0 ) = 1;         % to keep away the NaN's and Inf's
+

+ Once this was done, gradient descent ran fine. +

+

+ TA note: for the ML class exercises, you do not need this trick, because the scripts add the column of bias units after the features are normalized. But for your use outside of the class exercises, this may be a useful technique. +

+

+ Gradient descent for multiple variables +

+

+ The lecture notes "Week 2" under section Matrix Notation basically spells out one line solution to the problem. +

+

+ When predicting prices using theta derived from gradient descent, do not forget to normalize input x or you'll get multimillion house value (wrong one). +

+

+ Normal Equations +

+

+ I found that the line "data = csvread('ex1data2.txt');" in ex1_multi.m is not needed as we previously load this data via "data = load('ex1data2.txt');" +

+

+ Prior steps normalized X, this line sets X back to the original values. To have theta from gradient descent and from the normal equations to be close run the normal equations using normalized features as well. Therefor do not reload X. +

+

+ Comment: I think the point in reloading is to show that you actually get the same results even without doing anything with the data beforehand. Of course for this script its not effective, but in a real application you would use only one of the approaches. Similar considerations would argue against feature normalization. Therefore do reload X. +

+
+ + + diff --git a/ML_Mathematical_Approach/23_programming-ex-2/01__resources.html b/ML_Mathematical_Approach/23_programming-ex-2/01__resources.html new file mode 100644 index 0000000..1024f67 --- /dev/null +++ b/ML_Mathematical_Approach/23_programming-ex-2/01__resources.html @@ -0,0 +1,284 @@ + + +

+ Note for MATLAB users: If you are using MATLAB version R2015a or later, the fminunc() function has been changed in this version. The function works better, but does not give the expected result for Figure 5 in ex2.pdf, and it throws some warning messages (about a local minimum) when you run ex2_reg.m. This is normal, and you should still be able to submit your work to the grader. +

+

+ Typos in the lectures (updated): +

+

+ There are typos in the week 3 lectures, specifically for regularized logistic regression. This could create some confusion while doing the the last part of exercise 2. The equations in ex2.pdf are correct. +

+

+ Gradient and theta values for ex2.m +

+

+ Here are the values of both cost J and the gradients for the "initial theta (zeros)" test (ex2.pdf Section 1.2.2): +

+
Cost at initial theta (zeros): 0.693147
+Gradient at initial theta (zeros):
+ -0.100000
+ -12.009217
+ -11.262842
+
+

+ Here are the values for both cost J and theta for the "theta found by fminunc" test (ex2.pdf Section 1.2.3): +

+
Cost at theta found by fminunc: 0.203498
+theta:
+ -25.164593
+  0.206261
+  0.201499
+
+

+ mapFeature() discussion: +

+

+ For two features x1 and x2, mapFunction calculates following terms. $$\ 1 ,\ x_1 ,\ x_2 ,\ x_1^2 ,\ x_1x_2 ,\ x_2^2 ,\ x_1^3 ,\ x_1^2x_2 ,\ x_1x_2^2 ,\ x_2^3 ,\ x_1^4 ,\ x_1^3x_2 ,\ x_1^2x_2^2 ,\ x_1x_2^3 ,\ x_2^4 ,\ x_1^5 ,\ x_1^4x_2 ,\ x_1^3x_2^2 ,\ x_1^2x_2^3 ,\ x_1x_2^4 ,\ x_2^5 ,\ x_1^6 ,\ x_1^5x_2 ,\ x_1^4x_2^2 ,\ x_1^3x_2^3 ,\ x_1^2x_2^4 ,\ x_1x_2^5 ,\ x_2^6.$$ +

+

+ Not 100% sure about this, so please take this with a grain of salt. +

+

+ It appears to me that the "mapFeature" vector displayed on page 9 of the ex2.pdf is the transpose of what is intended. Also, it would be more clear if each of the variables carried the (i) superscript denoting the trial +

+

+ $$mapFeature(x^{(i)}) = \left[ \begin{array} {c} 1 \\ x_{1}^{(i)} \\ x_{2}^{(i)} \\ \left( x_{1}^{(i)} \right)^{2} \\ x_{1}^{(i)} x_{2}^{(i)} \\ \left( x_{2}^{(i)} \right)^{2} \\ \left( x_{1}^{(i)} \right)^{3} \\ \vdots \\ x_{1}^{(i)} \left( x_{2}^{(i)} \right)^{5} \\ \left( x_{2}^{(i)} \right)^{6} \end{array} \right] ^{T}$$ +

+

+ Of course this assumes exactly two features in the original dataset. I think of this more as "mapTrial" than as "mapFeature" because what we're really doing is mapping the original trials with two features onto a new set of trials with 28 features. +

+

+ I would not have thought twice about this, had I not gulped hard at the imprecise use of the word "dimensions" in the phase, "a 28-dimensional vector" in the text which follows the expression. +

+

+ This is how I interpreted it for the homework, and the results were accepted. But if I'm way off base, please delete this wiki entry. +

+

+ =========================================================================== +

+

+ I found this Octave expression quite useful for the regularization programming exercise: +

+
 ones(size(theta)) - eye(size(theta))
+
+

+ =========================================================================== +

+

+ I found these other Octave expressions which also are quite useful for the regularization programming exercise: +

+
 theta(2:size(theta))
+ theta(2:end)
+
+

+ =========================================================================== +

+

+ plotData.m - color attributes +

+

+ The plot() attribute "MarkerFaceColor" may not be supported on your version of Octave or MATLAB. You may need to modify it. Use the command "plot help" to see what attributes are supported. (You might just try to replace "MarkerFaceColor" with "MarkerFace", then the plot should work, although you get a warning.) +

+

+ Logistic Regression Gradient +

+

+ [w.r.t.=with respect to] +

+

+ Don't stumble over terminology - "the partial derivatives of the cost w.r.t. each parameter in theta" are: +

+

+ $$\frac{\alpha}{m} X^{T} (g(X \theta ) - \vec{y})$$ +

+

+ I was confused about this and kept trying to return the updated theta values . . . +

+

+ UPDATE (the above was really helpful, thank you for putting it here) As an additional hint: the instructions say: "[...] the gradient of the cost with respect to the parameters" - you're only asked for a gradient, don't overdo it (see above). The fact that you're not given alpha should be a hint in itself. You don't need it. You won't be iterating neither. +

+

+ Sigmoid function +

+

+ 1) The sigmoid function accepts only on one parameter named 'z'. This variable 'z' can represent a scalar, vector, or matrix. No other variable names should appear in the sigmoid() function. +

+

+ 2) The implementation of the sigmoid function should use only element-wise operators. The operators needed are addition, element-wise division (the './' operator), and the exp() function. +

+

+ Decision Boundary +

+

+ Thoughts regarding why the equation, $$theta_{1} + theta_{2}x_{2} + theta_{3}x_{3}$$, is set equal to 0 for determining a decision boundary: +

+

+ In this exercise, we're solving a + + classification + + problem using logistic regression. +

+ +

+ Lambda effect over Decision Boundary +

+

+

+ +

+

+
+ + + diff --git a/ML_Mathematical_Approach/24_programming-ex-3/01__resources.html b/ML_Mathematical_Approach/24_programming-ex-3/01__resources.html new file mode 100644 index 0000000..55e8674 --- /dev/null +++ b/ML_Mathematical_Approach/24_programming-ex-3/01__resources.html @@ -0,0 +1,140 @@ + + +

+ ML:Programming Exercise 3:Multi-class Classification and Neural Networks +

+

+ Debugging Tip +

+

+ The submit script, for all the programming assignments, does not report the line number and location of the error when it crashes. The follow method can be used to make it do so which makes debugging easier. +

+

+ Open ex3/lib/submitWithConfiguration.m and replace line: +

+
 fprintf('!! Please try again later.\n');
+
+

+ (around 28) with: +

+
fprintf('Error from file:%s\nFunction:%s\nOn line:%d\n', e.stack(1,1).file,e.stack(1,1).name, e.stack(1,1).line );
+
+

+ That top line says '!! Please try again later' on crash, instead of that, the bottom line will give the location and line number of the error. This change can be applied to all the programming assignments. +

+

+ 1.4.1 One-vs-all Prediction +

+

+ The pdf says you should get 94.9% training accuracy. This might not be correct depending on how you implement your code. +

+

+ + "The result you will get may differ a little bit based on how you implement your code. Sometimes, although mathematically two expressions are the same, Matlab may compute them differently. For example, expressions X'*(sigmoid(X*theta)-y) and sum((sigmoid(X*theta)-y)*ones(1,size(X,2)).*X) are the same mathematically; however, Matlab does not compute them the same numerically. I tried to use the same input for these two expressions and Matlab gave me a difference about 2*10^(-10) in 1 norm. Therefore, when you use different expressions to compute the gradient and then use fmincg to learn the parameters, your result may be a little different. Actually, when I used the first expression, I got the accruacy 95.14% and when I used the second one, I got 94.94%. They should be both correct in this sense." + + + -Posted by guoxian (Student) + +

+

+ Use the submit feature to find out if you are correct even if you get a different answer for training accuracy. +

+

+ + 2.2 Feedforward Propagation and Prediction (Neural network) + +

+

+ It wasn't clear to me weather when computing the hidden layer you only need to compute $$g(z^1)$$, or should you transform it to binary values (set the value to 1 for g>0.5 and to 0 for g<0.5), like we learned in logistic regression. Both solutions give almost the same results in the final predictions. From the "submit" feature it is clear that you shouldn't transform the values to binary values. + + -Posted by inna (Student) + +

+

+ + Prediction of an image outside the dataset (Neural Network) + +

+

+ To test the prediction with images outside the dataset, below is a code that I wrote to import the image and use the prediction. +

+
function p = predictImg(Theta1, Theta2, Img)
+X = imread(Img);% reads the image .bmp (24 bits) (20x20)
+
+X = double(X);% converts it to double
+temp = X;% creates a copy for later use
+
+X = (X.-128)./255;%normalize the features
+X = X .* (temp > 0);%return the original 0 values to the X
+X = reshape(X, [], numel(X));%converts the 20x20 matrix into a 1x400 vector
+
+displayData(X);%display the image imported
+
+p = predict(Theta1, Theta2, X);% calls the neural network prediction method
+
+

+ Usage: +

+
p = predictImg(Theta1, Theta2, '1.bmp');
+
+

+ Obs: Because this function will use the Theta1, and Theta2 created my + + ex3_nn + + , call it before the first use of this function. +

+

+ + -Posted by Vítor Albiero (Student) + +

+
+ + + diff --git a/ML_Mathematical_Approach/25_programming-ex-4/01__Broadcasting.html b/ML_Mathematical_Approach/25_programming-ex-4/01__Broadcasting.html new file mode 100644 index 0000000..55976f5 --- /dev/null +++ b/ML_Mathematical_Approach/25_programming-ex-4/01__Broadcasting.html @@ -0,0 +1,278 @@ + + + + +GNU Octave: Broadcasting + + + + + + + + + + + + + + + + + + + + + +
+

+Next: , Previous: , Up: Vectorization and Faster Code Execution   [Contents][Index]

+
+
+ +

19.2 Broadcasting

+ + + + + + +

Broadcasting refers to how Octave binary operators and functions behave +when their matrix or array operands or arguments differ in size. Since +version 3.6.0, Octave now automatically broadcasts vectors, matrices, +and arrays when using elementwise binary operators and functions. +Broadly speaking, smaller arrays are “broadcast” across the larger +one, until they have a compatible shape. The rule is that corresponding +array dimensions must either +

+
    +
  1. be equal, or + +
  2. one of them must be 1. +
+ +

In case all dimensions are equal, no broadcasting occurs and ordinary +element-by-element arithmetic takes place. For arrays of higher +dimensions, if the number of dimensions isn’t the same, then missing +trailing dimensions are treated as 1. When one of the dimensions is 1, +the array with that singleton dimension gets copied along that dimension +until it matches the dimension of the other array. For example, consider +

+
+
x = [1 2 3;
+     4 5 6;
+     7 8 9];
+
+y = [10 20 30];
+
+x + y
+
+ +

Without broadcasting, x + y would be an error because the dimensions +do not agree. However, with broadcasting it is as if the following +operation were performed: +

+
+
x = [1 2 3
+     4 5 6
+     7 8 9];
+
+y = [10 20 30
+     10 20 30
+     10 20 30];
+
+x + y
+⇒    11   22   33
+      14   25   36
+      17   28   39
+
+ +

That is, the smaller array of size [1 3] gets copied along the +singleton dimension (the number of rows) until it is [3 3]. No +actual copying takes place, however. The internal implementation reuses +elements along the necessary dimension in order to achieve the desired +effect without copying in memory. +

+

Both arrays can be broadcast across each other, for example, all +pairwise differences of the elements of a vector with itself: +

+
+
y - y'
+⇒    0   10   20
+    -10    0   10
+    -20  -10    0
+
+ +

Here the vectors of size [1 3] and [3 1] both get +broadcast into matrices of size [3 3] before ordinary matrix +subtraction takes place. +

+

A special case of broadcasting that may be familiar is when all +dimensions of the array being broadcast are 1, i.e., the array is a +scalar. Thus for example, operations like x - 42 and max +(x, 2) are basic examples of broadcasting. +

+

For a higher-dimensional example, suppose img is an RGB image of +size [m n 3] and we wish to multiply each color by a different +scalar. The following code accomplishes this with broadcasting, +

+
+
img .*= permute ([0.8, 0.9, 1.2], [1, 3, 2]);
+
+ +

Note the usage of permute to match the dimensions of the +[0.8, 0.9, 1.2] vector with img. +

+

For functions that are not written with broadcasting semantics, +bsxfun can be useful for coercing them to broadcast. +

+
+
: bsxfun (f, A, B)
+

The binary singleton expansion function performs broadcasting, +that is, it applies a binary function f element-by-element to two +array arguments A and B, and expands as necessary +singleton dimensions in either input argument. +

+

f is a function handle, inline function, or string containing the name +of the function to evaluate. The function f must be capable of +accepting two column-vector arguments of equal length, or one column vector +argument and a scalar. +

+

The dimensions of A and B must be equal or singleton. The +singleton dimensions of the arrays will be expanded to the same +dimensionality as the other array. +

+

See also: arrayfun, cellfun. +

+ + +

Broadcasting is only applied if either of the two broadcasting +conditions hold. As usual, however, broadcasting does not apply when two +dimensions differ and neither is 1: +

+
+
x = [1 2 3
+     4 5 6];
+y = [10 20
+     30 40];
+x + y
+
+ +

This will produce an error about nonconformant arguments. +

+

Besides common arithmetic operations, several functions of two arguments +also broadcast. The full list of functions and operators that broadcast +is +

+
+
      plus      +  .+
+      minus     -  .-
+      times     .*
+      rdivide   ./
+      ldivide   .\
+      power     .^  .**
+      lt        <
+      le        <=
+      eq        ==
+      gt        >
+      ge        >=
+      ne        !=  ~=
+      and       &
+      or        |
+      atan2
+      hypot
+      max
+      min
+      mod
+      rem
+      xor
+
+      +=  -=  .+=  .-=  .*=  ./=  .\=  .^=  .**=  &=  |=
+
+ +

Beware of resorting to broadcasting if a simpler operation will suffice. +For matrices a and b, consider the following: +

+
+
c = sum (permute (a, [1, 3, 2]) .* permute (b, [3, 2, 1]), 3);
+
+ +

This operation broadcasts the two matrices with permuted dimensions +across each other during elementwise multiplication in order to obtain a +larger 3-D array, and this array is then summed along the third dimension. +A moment of thought will prove that this operation is simply the much +faster ordinary matrix multiplication, c = a*b;. +

+

A note on terminology: “broadcasting” is the term popularized by the +Numpy numerical environment in the Python programming language. In other +programming languages and environments, broadcasting may also be known +as binary singleton expansion (BSX, in MATLAB, and the +origin of the name of the bsxfun function), recycling (R +programming language), single-instruction multiple data (SIMD), +or replication. +

+ +

19.2.1 Broadcasting and Legacy Code

+ +

The new broadcasting semantics almost never affect code that worked +in previous versions of Octave. Consequently, all code inherited from +MATLAB that worked in previous versions of Octave should still work +without change in Octave. The only exception is code such as +

+
+
try
+  c = a.*b;
+catch
+  c = a.*a;
+end_try_catch
+
+ +

that may have relied on matrices of different size producing an error. +Because such operation is now valid Octave syntax, this will no longer +produce an error. Instead, the following code should be used: +

+
+
if (isequal (size (a), size (b)))
+  c = a .* b;
+else
+  c = a .* a;
+endif
+
+ + +
+
+

+Next: , Previous: , Up: Vectorization and Faster Code Execution   [Contents][Index]

+
+ + + + + diff --git a/ML_Mathematical_Approach/25_programming-ex-4/01__resources.html b/ML_Mathematical_Approach/25_programming-ex-4/01__resources.html new file mode 100644 index 0000000..96f690e --- /dev/null +++ b/ML_Mathematical_Approach/25_programming-ex-4/01__resources.html @@ -0,0 +1,510 @@ + + +

+ ML:Programming Exercise 4:Neural Networks Learning +

+

+ This is the toughest exercise so far, mainly because you have to implement a series of steps, each subject to error, before you get any feedback. These techniques may help: +

+

+ See the tutorial below (developed for the Spring 2014 session). +

+

+ Use the command line. The command line is your friend. Run enough of ex4.m to initialize X, y, Theta1, and Theta2, then work one statement or operation at a time to get the results you want. When you get a statement working, transfer it to nnCostFunction--and save the file. +

+

+ Use dimensions. Use size() to check the dimensions of vectors and matrices to determine order of multiplication and whether a transpose is needed. This is especially valuable for the gradients. Keep in mind that the gradient matrices are the same size as Theta1 and Theta2. Also note that you will need to do some things that may seem counter-intuitive, like multiplying a m X 1 vector by a 1 X n vector to get an m X n matrix. +

+

+ You may find it helpful to note the dimensions of each matrix in a comment on the line of code, as you define it and use it, e.g.: +

+
Theta1 = reshape(.....)   % (nhn x (n+1))  
+a = b * c  % dimcheck: (nhn x (n+1))  = (nhn x m) * (m x (n+1))  
+ +

+ If you want to get rid of the loop over the training samples in back propagation algorithm, you are facing the problem to create a logical vector for y for all training examples. Some smart guy from the spring 2013 instance of this course came up with the following elegant solution for this task +

+
yv=[1:num_labels] == y
+
+

+ (This does not seem to work in Octave 3.2.4, I use 3.6.4 Doesn't work on 3.4 either.) +

+

+ After getting this, it was pretty straightforward to vectorize the loop. I could transform each line from my for-loop 1:1 to the vectorized code. +

+

+ Note, the above expression relies on the broadcasting feature of Octave: see + + http://www.gnu.org/software/octave/doc/interpreter/Broadcasting.html + + . +

+

+ A call to bsxfun is an equivalent solution that explicitly apply a broadcast: +

+
yv = bsxfun(@eq, y, 1:num_labels);
+
+

+ A different solution - kind of slow (this loop alone took about half the time of my vectorized solution on a mac laptop): +

+
yv = zeros(m, num_labels);
+for i = 1:m
+  yv(i, y(i)) = 1;
+end
+
+

+ Using vectorization speeds up the code considerably. +

+

+ Another method for generating the y matrix, this time looping over the labels: +

+
y_matrix = [];   % create a null matrix
+for i = 1:num_labels
+    y_matrix = [y_mat y == i];
+end
+
+

+ Another vectorized one-line method (using vectorized indexing of an eye matrix)- Spring 2014 session: +

+
y_matrix = eye(num_labels)(y,:);  % works for Octave
+...or
+all_combos = eye(num_labels);    
+y_matrix = all_combos(y,:)        % works for Matlab
+
+

+ This method uses an indexing trick to vectorize the creation of 'y_matrix', where each element of 'y' is mapped to a single-value row vector copied from an eye matrix. +

+

+ + FYI: Misleading Formula in Ex4 pdf for regularization term of cost + +

+

+ The summation indexes for Theta 1 and 2 should be from 2 to 26 and 2 to 401 respectively. +

+

+ Tutorial for Ex.4 Forward and Backpropagation (Spring 2014 session) +

+

+ This tutorial outlines the process of accomplishing the goals for Programming Exercise 4. The purpose is to create a collection of all the useful yet scattered and obscure knowledge that otherwise would require hours of frustrating searches.This tutorial is targeted solely at vectorized implementations. If you're a looper, you're doing it the hard way, and you're on your own.I'll use the less-than-helpful greek letters and math notation from the video lectures in this tutorial, though I'll start off with a glossary so we can agree on what they are. I will also suggest some common variable names, so students can more easily get help on the Forum. It is left to the reader to convert these lines into program statements. + + You will need to determine the correct order and transpositions for each matrix multiplication + + .Most of this material appears in either the video lectures, slides, course wiki, or the ex4.pdf file, though nowhere else does it all appear in one place. + + Glossary: + + Each of these variables will have a subscript, noting which NN layer it is associated with.Θ: A matrix of weights to compute the inner values of the neural network. When we used single-vector theta values, it was noted with the lower-case character θ.z : is the result of multiplying a data vector with a Θ matrix. A typical variable name would be "z2".a : The "activation" output from a neural layer. This is always generated using a sigmoid function g() on a z value. A typical variable name would be "a2".δ : lower-case delta is used for the "error" term in each layer. A typical variable name would be "d2".Δ : upper-case delta is used to hold the sum of the product of a δ value with the previous layer's a value. In the vectorized solution, these sums are calculated automatically though the magic of matrix algebra. A typical variable name would be "Delta2".Θ gradient : This is the thing we're looking for, the partial derivative of theta. There is one of these variables associated with each Δ. These values are returned by nnCostFunction(), so the variable names must be "Theta1_grad" and "Theta2_grad".g() is the sigmoid function.g′() is the sigmoid gradient function.Tip: One handy method for ignoring a column of bias units is to use the notation "SomeMatrix(:,2:end)". This selects all of the rows of a matrix, and omits the entire first column. + + Here we go + + Nearly all of the editing in this exercise happens in nnCostFunction.m. Let's get started. +

+

+ + A note regarding the sizes of these data objects: + + See the Appendix at the bottom of the tutorial for information on the sizes of the data objects. + + A note regarding bias units, regularization, and back-propagation: + + + There are two methods for handing the bias units in the back-propagation and gradient calculations. I've described only one of them here, it's the one that I understood the best. Both methods work, choose the one that makes sense to you and avoids dimension errors. It matters not a whit whether the bias unit is dropped before or after it is calculated - both methods give the same results, though the order of operations and transpositions required may be different. Those with contrary opinions are welcome to write their own tutorial. + + + Forward Propagation: + + We'll start by outlining the forward propagation process. Though this was already accomplished once during Exercise 3, you'll need to duplicate some of that work because computing the gradients requires some of the intermediate results from forward propagation. +

+

+ Step 1 - Expand the 'y' output values into a matrix of single values (see ex4.pdf Page 5). This is most easily done using an eye() matrix of size num_labels, with vectorized indexing by 'y', as in "eye(num_labels)(y,:)". Discussions of this and other methods are available in the Course Wiki - Programming Exercises section. A typical variable name would be "y_matrix". +

+

+ Step 2 - perform the forward propagation:a1 equals the X input matrix with a column of 1's added (bias units)z2 equals the product of a1 and Θ1a2 is the result of passing z2 through g()a2 then has a column of 1st added (bias units)z3 equals the product of a2 and Θ2a3 is the result of passing z3 through g() + + Cost Function, non-regularized + +

+

+ Step 3 - Compute the unregularized cost according to ex4.pdf (top of Page 5), (I had a hard time understanding this equation mainly that I had a misconception that y(i) + + k is a vector, instead it is just simply one number) using a3, your y + + matrix, and m (the number of training examples). Cost should be a scalar value. If you get a vector of cost values, you can sum that vector to get the cost.Remember to use element-wise multiplication with the log() function.Now you can run ex4.m to check the unregularized cost is correct, then you can submit Part 1 to the grader. +

+

+ + Cost Regularization + +

+

+ Step 4 - Compute the regularized component of the cost according to ex4.pdf Page 6, using Θ1 and Θ2 (ignoring the columns of bias units), along with λ, and m. The easiest method to do this is to compute the regularization terms separately, then add them to the unregularized cost from Step 3.You can run ex4.m to check the regularized cost, then you can submit Part 2 to the grader. + + Sigmoid Gradient and Random Initialization + +

+

+ Step 5 - You'll need to prepare the sigmoid gradient function g′(), as shown in ex4.pdf Page 7You can submit Part 3 to the grader. +

+

+ Step 6 - Implement the random initialization function as instructed on ex4.pdf, top of Page 8. You do not submit this function to the grader. + + Backpropagation + +

+

+ Step 7 - Now we work from the output layer back to the hidden layer, calculating how bad the errors are. See ex4.pdf Page 9 for reference.δ3 equals the difference between a3 and the y_matrix.δ2 equals the product of δ3 and Θ2 (ignoring the Θ2 bias units), then multiplied element-wise by the g′() of z2 (computed back in Step 2).Note that at this point, the instructions in ex4.pdf are specific to looping implementations, so the notation there is different.Δ2 equals the product of δ3 and a2. This step calculates the product and sum of the errors.Δ1 equals the product of δ2 and a1. This step calculates the product and sum of the errors. +

+

+ + Gradient, non-regularized + +

+

+ Step 8 - Now we calculate the + + non-regularized + + theta gradients, using the sums of the errors we just computed. (see ex4.pdf bottom of Page 11)Θ1 gradient equals Δ1 scaled by 1/mΘ2 gradient equals Δ2 scaled by 1/mThe ex4.m script will also perform gradient checking for you, using a smaller test case than the full character classification example. So if you're debugging your nnCostFunction() using the "keyboard" command during this, you'll suddenly be seeing some much smaller sizes of X and the Θvalues. Do not be alarmed.If the feedback provided to you by ex4.m for gradient checking seems OK, you can now submit Part 4 to the grader. + + Gradient Regularization + +

+

+ Step 9 - For reference see ex4.pdf, top of Page 12, for the right-most terms of the equation for j>=1.Now we calculate the regularization terms for the theta gradients. The goal is that regularization of the gradient should not change the theta gradient(:,1) values (for the bias units) calculated in Step 8. There are several ways to implement this (in Steps + + 9a + + and + + 9b + + ). + + Method 1 + + : + + 9a) + + Calculate the regularization for indexes (:,2:end), and + + 9b) + + add them to theta gradients (:,2:end). + + Method 2 + + : + + 9a) + + Calculate the regularization for the entire theta gradient, then overwrite the (:,1) value with 0 before + + 9b) + + adding to the entire matrix.Details for Steps 9a and 9b + + 9a) + + Pick a method, and calculate the regularization terms as follows:(λ/m)∗Θ1 (using either Method 1 or Method 2)...and(λ/m)∗Θ2 (using either Method 1 or Method 2) + + 9b) + + Add these regularization terms to the appropriate Θ1 gradient and Θ2 gradient terms from Step 8 (using either Method 1 or Method 2). Avoid modifying the bias unit of the theta gradients. + + Note: there is an errata in the lecture video and slides regarding some missing parenthesis for this calculation. The ex4.pdf file is correct. + + The ex4.m script will provide you feedback regarding the acceptable relative difference. If all seems well, you can submit Part 5 to the grader.Now pat yourself on the back. +

+

+ + Appendix: + +

+

+ Here are the sizes for the character recognition example, using the method described in this tutorial. a1: 5000x401z2: 5000x25a2: 5000x26a3: 5000x10d3: 5000x10d2: 5000x25Theta1, Delta1 and Theta1 + + grad: 25x401Theta2, Delta2 and Theta2 + + grad: 10x26Note that the ex4.m script uses a several test cases of different sizes, and the submit grader uses yet another different test case. +

+

+ Debugging Tip +

+

+ The submit script, for all the programming assignments, does not report the line number and location of the error when it crashes. The follow method can be used to make it do so which makes debugging easier. +

+

+ Open ex4/lib/submitWithConfiguration.m and replace line: +

+
 fprintf('!! Please try again later.\n');
+
+

+ (around 28) with: +

+
fprintf('Error from file:%s\nFunction:%s\nOn line:%d\n', e.stack(1,1).file,e.stack(1,1).name, e.stack(1,1).line );
+
+

+ That top line says '!! Please try again later' on crash, instead of that, the bottom line will give the location and line number of the error. This change can be applied to all the programming assignments. +

+

+ Tips for classifying your own images: +

+

+ There's no documentation on how the images were prepared for this course. These tips may be helpful. +

+ +

+ Bonus: Neural Network does not need order in pixels of an image as humans do +

+

+ The pixels order (as a human sees them) is not necessary (or relevant) for a Neural Network. +

+

+ You can test it with a modified ex3.m program below (you can call it ex3_rand.m) +

+

+ The program has a randomize pixel position step "scrambling" the 400 vector positions BEFORE the training. As long as you keep the same pixel position when predicting, the results are the same. +

+

+ It is interesting to "see" how prediction perfectly works with a scrambled picture! +

+

+ You can test it once you have submitted OK the ex3.m program (meaning that + + you have the oneVsAll function working OK first + + ). +

+

+ ex3_rand.m is a modified version of ex3.m +

+
% ex3_rand.m (is a modified version of ex3.m to scramble pixels/features)
+%
+%% Machine Learning Online Class - Exercise 3 | Randomize Features
+
+%% Initialization
+clear; close all; clc
+
+%% Setup the parameters you will use for this part of the exercise
+input_layer_size  = 400; % 20x20 Input Images of Digits
+num_labels = 10;         % 10 labels, from 1 to 10   
+                         % (note that we have mapped "0" to label 10)
+
+%% =========== Part 1: Loading and Visualizing Data =============
+%  We start the exercise by first loading and visualizing the dataset. 
+%  You will be working with a dataset that contains handwritten digits.
+%
+
+% Load Training Data
+fprintf('Loading and Visualizing Data ...\n')
+
+load('ex3data1.mat'); % training data stored in arrays X, y
+m = size(X, 1);
+
+% Randomly select 100 data points to display
+rand_indices = randperm(m, 100);
+sel = X(rand_indices,:);
+
+displayData(sel);
+
+fprintf('Program paused. Press enter to continue.\n');
+pause;
+
+%% ============ Part 2: Vectorize Logistic Regression ============
+%  In this part of the exercise, you will reuse your logistic regression
+%  code from the last exercise. You task here is to make sure that your
+%  regularized logistic regression implementation is vectorized. After
+%  that, you will implement one-vs-all classification for the handwritten
+%  digit dataset.
+%
+
+% Added to randomize features (to probe that is irrelevant)
+fprintf('\nRandomizing columns...\n');
+X_rand = X(:, randperm(size(X,2)));
+
+fprintf('\nTraining One-vs-All Logistic Regression...\n')
+
+lambda = 0.1;
+[all_theta] = oneVsAll(X_rand, y, num_labels, lambda);
+
+fprintf('Program paused. Press enter to continue.\n');
+pause;
+
+
+%% ================ Part 3: Predict for One-Vs-All ================
+%  After ...
+pred = predictOneVsAll(all_theta, X_rand);
+
+fprintf('\nTraining Set Accuracy:%f\n', mean(double(pred == y)) * 100);
+
+%% ============ Part 4: Predict Random Samples ============
+%  To give you an idea of the network's output, you can also run
+%  through the examples one at the a time to see what it is predicting.
+
+%  Randomly permute examples
+rp = randperm(m);
+
+for i = 1:m
+   % Display 
+    fprintf('\nDisplaying Example Randomized Image\n');
+    displayData(X_rand(rp(i),:));
+
+    pred = predictOneVsAll(all_theta, X_rand(rp(i),:));
+    fprintf('\nNeural Network Prediction:%d (label%d)\n', pred, y(rp(i)));
+
+   % Pause
+    fprintf('Program paused. Press enter to continue.\n');
+    pause;
+end
+
+

+ Why the order is Irrelevant for the Neural-Network +

+

+ You can see that the order of the pixels is irrelevant as long as you are consistent in two ways: +

+
    +
  1. +

    + Between samples. Each feature should mean the same pixel. You can not change the pixel location for one sample and not for the others. You can scramble them but you have to keep the "scrambling" fixed for the entire samples. +

    +
  2. +
  3. +

    + Between labels. Each label should represent the same digit for its group of samples. Meaning a digit four is a four for all of the samples you labeled as four and can not change it.It does not matter if the pixels are 'scrambled", it is a four. +

    +
  4. +
+

+ Equivalent example of order irrelevancy +

+

+ An equivalent example is the order of variable names when solving a system of equations. It does not matter how you call a variable or the order as long as you are consistent through out the solution. +

+

+ For example, this: +

+

+ $$3x_1 + 4x_2 = 26$$ +

+

+ $$2x_1 -3x_2 = -11$$ +

+

+ Solution: $$x_1 = 2;\quad x_2=5$$ +

+

+ ...is equivalent to: +

+

+ $$3x_2 + 4x_1 = 26$$ +

+

+ $$2x_2 - 3x_1 = -11$$ +

+

+ Solution: $$x_2 = 2;\quad x_1=5$$ +

+

+ ...also you can "scramble" the terms and "labels" +

+

+ $$-3x_1 + 2x_2 = -11$$ +

+

+ $$4x_1 + 3x_2 = 26$$ +

+

+ Solution: $$x_1 = 5;\quad x_2 = 2$$ +

+

+ It has to do with convention. Any convention as long as it is the same all the way through. +

+
+ + + diff --git a/ML_Mathematical_Approach/26_programming-ex-5/01__resources.html b/ML_Mathematical_Approach/26_programming-ex-5/01__resources.html new file mode 100644 index 0000000..3499406 --- /dev/null +++ b/ML_Mathematical_Approach/26_programming-ex-5/01__resources.html @@ -0,0 +1,111 @@ + + +

+ ML:Programming Exercise 5:Regularized Linear Regression and Bias vs Variance +

+

+ Proposed erratum: the Optional exercise (Section 3.5) instructs you to select i examples from the cross-validation set. Shouldn't you always validate on the full cross-validation set as in section 2.1? +

+

+ Other miscellany: +

+ +

+ Debugging Tip +

+

+ The submit script, for all the programming assignments, does not report the line number and location of the error when it crashes. The follow method can be used to make it do so which makes debugging easier. +

+

+ Open ex5/lib/submitWithConfiguration.m and replace line: +

+
 fprintf('!! Please try again later.\n');
+
+

+ (around 28) with: +

+
fprintf('Error from file:%s\nFunction:%s\nOn line:%d\n', e.stack(1,1).file,e.stack(1,1).name, e.stack(1,1).line );
+
+

+ That top line says '!! Please try again later' on crash, instead of that, the bottom line will give the location and line number of the error. This change can be applied to all the programming assignments. +

+
+ + + diff --git a/ML_Mathematical_Approach/27_programming-ex-6/01__DocumentPage.php b/ML_Mathematical_Approach/27_programming-ex-6/01__DocumentPage.php new file mode 100644 index 0000000..928e6ca --- /dev/null +++ b/ML_Mathematical_Approach/27_programming-ex-6/01__DocumentPage.php @@ -0,0 +1,503 @@ + + + + + + + + Machine Learning + + + + + + + + + + + + + + +
+ + + +
+
+
+
+ + Machine + +
+
+

Machine Learning

+

Andrew Ng

+
+
+
+
+
+ + + + + +ex7 + + + + + + + + + + + + + + +

+ +

+Exercise 7: SVM Linear Classification + +

+This exercise gives you practice with using SVMs for linear classification. +You will use a free SVM software package called + +LIBSVM that interfaces to MATLAB/Octave. To begin, download the + + +LIBSVM Matlab Interface (choose the package with the description "a simple MATLAB interface") and +unzip the contents to any convenient location on your computer. + +

+Then, download the data for this exercise: +ex7Data.zip. + +

+ +

+
+ +

+Installing LIBSVM + +

+After you've downloaded the + + +LIBSVM Matlab Interface, +follow the instructions in the package's README file + to build LIBSVM from its source code. Instructions are provided for both +Matlab and Octave on Unix and Windows systems. + +

+If you've built LIBSVM successfully, you should see 4 files with the suffix "mexglx" + ("mexw32" on Windows). These are the binaries that you will run from MATLAB/Octave, and you +need to make them visible to your working directory for this exercise. +This can be done in any of the following 3 ways: + +

+(1). Creating links to the binaries from your working directory + +

+(2). Adding the location of the binaries to the Matlab/Octave path + +

+(3). Copying the binaries to your working directory. + +

+Linear classification + +

+Recall from the video lectures that SVM classification solves the following +optimization problem: + +

+

+
+ + +\begin{displaymath}
+\min_{w,b}\qquad\left\Vert w\right\Vert ^{2}+C\sum_{i}^{m}\xi_{i}
+\end{displaymath} +
+
+

+ +

+

+
+ +\begin{eqnarray*}
+\mbox{subject to}\qquad y^{(i)}(w^{T}x^{(i)}+b) & \geq & 1-\xi_{i},\qquad i=1,2...,m\\
+\xi_{i} & \geq & 0,\qquad i=1,2...,m\end{eqnarray*}
+

+

+ +

+After solving, the SVM classifier predicts "1" +if +$w^T x + b \geq 0$ and "-1" otherwise. The decision boundary is given by the +line $w^T x + b = 0$. + +

+2-Dimensional classification problem + +

+Let's first consider a classification problem with two features. +Load the "twofeature.txt" data file into Matlab/Octave with the following +command: + +

+

+[trainlabels, trainfeatures] = libsvmread('twofeature.txt');
+
+ +

+Note that this file is formatted for LIBSVM, so loading it with the +usual Matlab/Octave commands would not work. + +

+After loading, the "trainlabels" vector should contain the classification +labels for your training data, and the "trainfeatures" matrix should contain +2 features per training example. + +

+Now plot your data, using separate symbols for positives and negatives. Your +plot should look similar to this: + +

+

+
+ +

+In this plot, we see two classes of data with a somewhat obvious + separation gap. However, the blue class has an outlier on the far left. +We'll now look at how this outlier affects the SVM decision boundary. + +

+Setting cost to C = 1 + +

+Recall from the lecture videos that the parameter $C$ in the SVM optimization +problem is a positive cost factor that +penalizes misclassified training examples. +A larger $C$ discourages misclassification more than a smaller $C$. + +

+First, we'll run the classifier with $C = 1$. + +

+To train your model, call + +

+

+model = svmtrain(trainlabels, trainfeatures, '-s 0 -t 0 -c 1');
+
+ +

+The last string argument tells LIBSVM to train using the options + +

+a. -s 0, SVM classification + +

+b. -t 0, a linear kernel, because we want a linear decision boundary + +

+c. -c 1, a cost factor of 1 + +

+You can see all available options by typing "svmtrain" at the Matlab/Octave +console. + +

+After training is done, "model" will be a struct +that contains the model parameters. We're now interested +in getting the variables $w$ and $b$. Unfortunately, these are not + explicity represented in the model struct, but you can +calculate them with the following commands: + +

+

+w = model.SVs' * model.sv_coef;
+b = -model.rho;
+if (model.Label(1) == -1)
+    w = -w; b = -b;
+end
+
+ +

+Once you have $w$ and $b$, use them to plot the decision boundary. The outcome +should look like the graph below. + +

+

+
+ +

+With $C = 1$, we see that the outlier is misclassified, but the decision boundary +seems like a reasonable fit. + +

+Setting cost to C = 100 + +

+Now let's look at what happens when the cost factor is much higher. Train your +model and plot the decision boundary again, this time with $C$ set to 100. +The outlier will now be classified correctly, but the decision boundary will +not seem like a natural fit for the rest of the data: + +

+

+
+ +

+This example shows that when cost penalty is large, the + SVM algorithm will very hard to avoid misclassifications. The tradeoff is that +the algorithm will give less weight to producing a large separation margin. + +

+Text classification + +

+Now let's return to our spam classification example from the previous exercise. +In your data folder, there should be the same 4 training sets you saw in the +Naive Bayes exercise, only now formatted for LIBSVM. They are named: + +

+a. email_train-50.txt (based on 50 email documents) + +

+b. email_train-100.txt (100 documents) + +

+c. email_train-400.txt (400 documents) + +

+d. email_train-all.txt (the complete 700 training documents) + +

+You will train a linear SVM model on each of the four training sets with +$C$ left at the default SVM value. After training, test + the performance of each model on set the named "email_test.txt." This +is done with the "svmpredict" command, which you can find out more about +by typing "svmpredict" at the MATLAB/Octave console. + +

+During test time, the accuracy on the test set will be printed to the console. +Record the classification accuracy for each training set +and check your answers with the solutions. +How do the errors compare to the Naive Bayes errors? + +

+ +

+ +
+ + +

+


+ + + +
+ +
+

Resources


+
+
+
+ + + + + + diff --git a/ML_Mathematical_Approach/27_programming-ex-6/01__msg00226.html b/ML_Mathematical_Approach/27_programming-ex-6/01__msg00226.html new file mode 100644 index 0000000..2c3c673 --- /dev/null +++ b/ML_Mathematical_Approach/27_programming-ex-6/01__msg00226.html @@ -0,0 +1,168 @@ + + + + + + + + + + + +[Octave-bug-tracker] [bug #41096] Unknown hggroup property Color + + + +
+ +
+
octave-bug-tracker
+
+[Top][All Lists] +
+
+ + + +Advanced +
+ +
+ + + + +
+[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] + + + + +

[Octave-bug-tracker] [bug #41096] Unknown hggroup property Color

+
+ + + + + + + + + + + + + + + + + + + + + + + + + +
+From: +Rik
+Subject: +[Octave-bug-tracker] [bug #41096] Unknown hggroup property Color
+Date: +Sat, 04 Jan 2014 23:17:27 +0000
+User-agent: +Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:26.0) Gecko/20100101 Firefox/26.0
+ + +
+ + +
Update of bug #41096 (project octave):
+
+             Open/Closed:                    Open => Closed                 
+
+    _______________________________________________________
+
+Follow-up Comment #1:
+
+'color' isn't a supported property of a contour group.  I think the property
+you need to use is 'linecolor'.  The list of possible properties is documented
+here http://www.mathworks.com/help/matlab/ref/contourgroupproperties.html.
+
+This did work under 3.6.4, but I don't think it was ever supposed to.
+
+    _______________________________________________________
+
+Reply to this item at:
+
+  <http://savannah.gnu.org/bugs/?41096>
+
+_______________________________________________
+  Message sent via/by Savannah
+  http://savannah.gnu.org/
+
+
+
+
+ + + +
+
+ + + + +
reply via email to
+
+
+ + + +
[Prev in Thread]Current Thread[Next in Thread]
+ + +
+ + + + + + + + + + + diff --git a/ML_Mathematical_Approach/27_programming-ex-6/01__resources.html b/ML_Mathematical_Approach/27_programming-ex-6/01__resources.html new file mode 100644 index 0000000..d39bd24 --- /dev/null +++ b/ML_Mathematical_Approach/27_programming-ex-6/01__resources.html @@ -0,0 +1,216 @@ + + +

+ ML:Programming Exercise 6:Support Vector Machines +

+

+ Keep in mind that all the programming exercise solutions should handle any number of features in the training examples. Passing the test case in the PDF file is not sufficient to be sure of passing the submit grader's test case. +

+

+ Debugging Tip +

+

+ The submit script, for all the programming assignments, does not report the line number and location of the error when it crashes. The follow method can be used to make it do so which makes debugging easier. +

+

+ Open ex6/lib/submitWithConfiguration.m and replace line: +

+
 fprintf('!! Please try again later.\n');
+
+

+ (around 28) with: +

+
fprintf('Error from file:%s\nFunction:%s\nOn line:%d\n', e.stack(1,1).file,e.stack(1,1).name, e.stack(1,1).line );
+
+

+ That top line says '!! Please try again later' on crash, instead of that, the bottom line will give the location and line number of the error. This change can be applied to all the programming assignments. +

+

+ Update to ex6.m +

+

+ At line 69/70, change "sigma = 0.5" to "sigma = %0.5f", and change the list of output variables from "sim" to "sigma, sim". This lets the screen output display the actual value of sigma, rather than an (incorrect) constant value. +

+

+ Trouble with the contour plot (visualizeBoundary.m) +

+

+ Octave 3.8.x and higher +

+

+ If you have Octave 3.8.x, the ex6 script will not plot decision boundary, and prints 'Unknown hggroup property Color' with stack trace. +

+

+ One fix is to modify line 21 in visualizeBoundary.m with this code: +

+
contour(X1, X2, vals, [1 1], 'linecolor', 'blue');
+
+

+ (Note: I tried this and although the error went away, I still don't see any contour line drawn; sokolov 3/22/2015) +

+

+ I had the same problem with the line not displaying until i changed the [0 0] to [1 1] - tmcarthur 7/1/2016 +

+

+ OR +

+

+ If you change line 21 to following, it will show two lines and will work with >= 3.8.x . +

+
contour(X1, X2, vals);
+
+

+ For more information see +

+

+ + http://lists.gnu.org/archive/html/octave-bug-tracker/2014-01/msg00226.html + +

+

+ Matlab +

+

+ In Matlab R2014b and R2015b, simply changing the [0 0] parameter on line 21 in visualizeBoundary.m to [1 1] plots the boundary. +

+

+ processEmail no loop possible +

+

+ Can use find() or ismember() for the word vocabulary cell array +

+

+ Understanding SMO and the svmTrain() and svmPredict() methods +

+

+ The + + svmTrain.m + + file is provided with this exercise and it contains an implementation of the Sequential Minimal Optimization (SMO) algorithm to minimize an SVM. You don't need to understand how it works in order to complete the exercise. There are comments in the code that reference numbered equations, but the code doesn't say what document those numbers reference. It turns out to be a section of the course materials from CS 229 at Stanford covering SMO, which can be found here: +

+

+ + http://cs229.stanford.edu/materials/smo.pdf + +

+

+ More SVM explanations +

+

+ "An Idiot's Guide to Support Vector Machines" +

+

+ + http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf + +

+

+ Information on SVMLIB +

+

+ This exercise uses the SVMLIB package to solve an exercise similar to ex6 (also by Prof Ng). +

+

+ + http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex7/ex7.html + +

+

+ Using LIBSVM in MATLAB/Octave +

+

+ In the optional section of this exercise, Prof Ng recommended that we use LIBSVM to solve the problem. +

+

+ + http://www.csie.ntu.edu.tw/~cjlin/libsvm/ + +

+

+ Installing LIBSVM on MATLAB/Octave is very easy. +

+ +
+ + + diff --git a/ML_Mathematical_Approach/27_programming-ex-6/01__smo.pdf b/ML_Mathematical_Approach/27_programming-ex-6/01__smo.pdf new file mode 100644 index 0000000..de9afef Binary files /dev/null and b/ML_Mathematical_Approach/27_programming-ex-6/01__smo.pdf differ diff --git a/ML_Mathematical_Approach/27_programming-ex-6/01__svm-notes-long-08.pdf b/ML_Mathematical_Approach/27_programming-ex-6/01__svm-notes-long-08.pdf new file mode 100644 index 0000000..d7e5866 Binary files /dev/null and b/ML_Mathematical_Approach/27_programming-ex-6/01__svm-notes-long-08.pdf differ diff --git a/ML_Mathematical_Approach/28_programming-ex-7/01__bsxfun.html b/ML_Mathematical_Approach/28_programming-ex-7/01__bsxfun.html new file mode 100644 index 0000000..ccfa6fd --- /dev/null +++ b/ML_Mathematical_Approach/28_programming-ex-7/01__bsxfun.html @@ -0,0 +1,1087 @@ + + + +Apply element-wise operation to two arrays with implicit +expansion enabled - MATLAB bsxfun - MathWorks India + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + +
+ +
+
+
+ +
+
+
+

Documentation

+
+
+
+
+ +
+
+ + + + +
+
+ +
+
+ +
+
+ +
+
+
+
+ +
+ + +
+ + + + + + + +
+
+
+
+ + + + + + + + +
+
+

This is machine translation

+

+ Translated by Microsoft + +
+ Mouseover text to see original. Click the button below to return to the English verison of the page. +

+

+ Note: This page has been translated by MathWorks. Please click here
+ To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.
+
+

+ +
+
+
+ + + + + + + + + +
+ +

bsxfun

Apply element-wise operation to two arrays with implicit +expansion enabled

+

Syntax

C = bsxfun(fun,A,B)

Description

example

C = bsxfun(fun,A,B) applies +the element-wise binary operation specified by the function handle fun to +arrays A and B.

Examples

collapse all

Subtract the column mean from the corresponding column elements of a matrix A. Then normalize by the standard deviation.

A = [1 2 10; 3 4 20; 9 6 15];
+C = bsxfun(@minus, A, mean(A));
+D = bsxfun(@rdivide, C, std(A))
D = 
+
+   -0.8006   -1.0000   -1.0000
+   -0.3203         0    1.0000
+    1.1209    1.0000         0
+
+

In MATLAB® R2016b and later, you can directly use operators instead of bsxfun, since the operators independently support implicit expansion of arrays with compatible sizes.

(A - mean(A))./std(A)
ans = 
+
+   -0.8006   -1.0000   -1.0000
+   -0.3203         0    1.0000
+    1.1209    1.0000         0
+
+

Compare the elements in a column vector and a row vector. The result is a matrix containing the comparison of each combination of elements from the vectors. An equivalent way to execute this operation is with A > B.

A = [8; 17; 20; 24]
A = 
+
+     8
+    17
+    20
+    24
+
+
B = [0 10 21]
B = 
+
+     0    10    21
+
+
C = bsxfun(@gt,A,B)
C = 4x3 logical array
+   1   0   0
+   1   1   0
+   1   1   0
+   1   1   1
+
+

Create a function handle that represents the function .

fun = @(a,b) a - exp(b);

Use bsxfun to apply the function to vectors a and b. The bsxfun function expands the vectors into matrices of the same size, which is an efficient way to evaluate fun for many combinations of the inputs.

a = 1:7;
+b = pi*[0 1/4 1/3 1/2 2/3 3/4 1].';
+C = bsxfun(fun,a,b)
C = 
+
+         0    1.0000    2.0000    3.0000    4.0000    5.0000    6.0000
+   -1.1933   -0.1933    0.8067    1.8067    2.8067    3.8067    4.8067
+   -1.8497   -0.8497    0.1503    1.1503    2.1503    3.1503    4.1503
+   -3.8105   -2.8105   -1.8105   -0.8105    0.1895    1.1895    2.1895
+   -7.1205   -6.1205   -5.1205   -4.1205   -3.1205   -2.1205   -1.1205
+   -9.5507   -8.5507   -7.5507   -6.5507   -5.5507   -4.5507   -3.5507
+  -22.1407  -21.1407  -20.1407  -19.1407  -18.1407  -17.1407  -16.1407
+
+

Input Arguments

collapse all

+

Binary function to apply, specified as a function handle. fun must +be a binary (two-input) element-wise function of the form C += fun(A,B) that accepts arrays A and B with +compatible sizes. For more information, see Compatible Array Sizes for Basic Operations. fun must +support scalar expansion, such that if A or B is +a scalar, then C is the result of applying the +scalar to every element in the other input array.

+

In MATLAB® R2016b and later, the built-in binary functions +listed in this table independently support implicit expansion. With +these functions, you can call the function or operator directly instead +of using bsxfun. For example, you can replace C += bsxfun(@plus,A,B) with A+B.

+
FunctionSymbolDescription
plus

+

Plus

minus

-

Minus

times

.*

Array multiply

rdivide

./

Right array divide

ldivide

.\

Left array divide

power

.^

Array power

eq

==

Equal

ne

~=

Not equal

gt

>

Greater than

ge

>=

Greater than or equal to

lt

<

Less than

le

<=

Less than or equal to

and

&

Element-wise logical AND

or

|

Element-wise logical OR

xor

N/A

Logical exclusive OR

max

N/A

Binary maximum

min

N/A

Binary minimum

mod

N/A

Modulus after division

rem

N/A

Remainder after division

atan2

N/A

Four-quadrant inverse tangent; result in radians

atan2d

N/A

Four-quadrant inverse tangent; result in degrees

hypot

N/A

Square root of sum of squares

+

Example: C = bsxfun(@plus,[1 2],[2; 3])

+

Data Types: function_handle

+

Input arrays, specified as scalars, vectors, matrices, or multidimensional +arrays. Inputs A and B must +have compatible sizes. For more information, see Compatible Array Sizes for Basic Operations. Whenever +a dimension of A or B is singleton +(equal to one), bsxfun virtually replicates the +array along that dimension to match the other array. In the case where +a dimension of A or B is singleton, +and the corresponding dimension in the other array is zero, bsxfun virtually +diminishes the singleton dimension to zero.

+

Data Types: single | double | uint8 | uint16 | uint32 | uint64 | int8 | int16 | int32 | int64 | char | logical
+Complex Number Support: Yes

Extended Capabilities

C/C++ Code Generation
Generate C and C++ code using MATLAB® Coder™.

Introduced in R2007a

+
+
+
Was this topic helpful? + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+
+ + +
+
+ +
+
+ + + + + + + + +
+ + + + \ No newline at end of file diff --git a/ML_Mathematical_Approach/28_programming-ex-7/01__permute.html b/ML_Mathematical_Approach/28_programming-ex-7/01__permute.html new file mode 100644 index 0000000..98e3332 --- /dev/null +++ b/ML_Mathematical_Approach/28_programming-ex-7/01__permute.html @@ -0,0 +1,1029 @@ + + + +Rearrange dimensions of N-D array - MATLAB permute - MathWorks India + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + +
+ +
+
+
+ +
+
+
+

Documentation

+
+
+
+
+ +
+
+ + + + +
+
+ +
+
+ +
+
+ +
+
+
+
+ +
+ + +
+ + + + + + + +
+
+
+
+ + + + + + + + +
+
+

This is machine translation

+

+ Translated by Microsoft + +
+ Mouseover text to see original. Click the button below to return to the English verison of the page. +

+

+ Note: This page has been translated by MathWorks. Please click here
+ To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.
+
+

+ +
+
+
+ + + + + + + + + +
+ +

permute

Rearrange dimensions of N-D array

+

Syntax

B = permute(A,order)
+

Description

B = permute(A,order) rearranges +the dimensions of A so that they are in the order +specified by the vector order. B has +the same values of A but the order of the subscripts +needed to access any particular element is rearranged as specified +by order. All the elements of order must +be unique, real, positive, integer values.

Examples

collapse all

Create a 3-by-4-by-5 array and permute it so that the first and third dimensions are switched.

A = rand(3,4,5);
+B = permute(A,[3 2 1]);
+size(B)
ans = 
+
+     5     4     3
+
+

Tips

permute and ipermute are +a generalization of transpose (.') for multidimensional +arrays.

Extended Capabilities

Introduced before R2006a

+
+
+
Was this topic helpful? + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+
+ + +
+
+ +
+
+ + + + + + + + +
+ + + + \ No newline at end of file diff --git a/ML_Mathematical_Approach/28_programming-ex-7/01__repmat.html b/ML_Mathematical_Approach/28_programming-ex-7/01__repmat.html new file mode 100644 index 0000000..7e875f8 --- /dev/null +++ b/ML_Mathematical_Approach/28_programming-ex-7/01__repmat.html @@ -0,0 +1,1136 @@ + + + +Repeat copies of array - MATLAB repmat - MathWorks India + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ +
+ + + + + +
+ +
+
+
+ +
+
+
+

Documentation

+
+
+
+
+ +
+
+ + + + +
+
+ +
+
+ +
+
+ +
+
+
+
+ +
+ + +
+ + + + + + + +
+
+
+
+ + + + + + + + +
+
+

This is machine translation

+

+ Translated by Microsoft + +
+ Mouseover text to see original. Click the button below to return to the English verison of the page. +

+

+ Note: This page has been translated by MathWorks. Please click here
+ To view all translated materals including this page, select Japan from the country navigator on the bottom of this page.
+
+

+ +
+
+
+ + + + + + + + + +
+ +

repmat

Repeat copies of array

+

Syntax

B = repmat(A,n)
B = repmat(A,r1,...,rN)
B = repmat(A,r)

Description

example

B = repmat(A,n) returns +an array containing n copies of A in +the row and column dimensions. The size of B is size(A)*n when A is +a matrix.

example

B = repmat(A,r1,...,rN) specifies +a list of scalars, r1,..,rN, that describes how +copies of A are arranged in each dimension. When A has N dimensions, +the size of B is size(A).*[r1...rN]. +For example, repmat([1 2; 3 4],2,3) returns a 4-by-6 +matrix.

example

B = repmat(A,r) specifies +the repetition scheme with row vector r. For example, repmat(A,[2 +3]) returns the same result as repmat(A,2,3).

Examples

collapse all

Repeat copies of a matrix into a 2-by-2 block arrangement.

A = diag([100 200 300])
A = 
+
+   100     0     0
+     0   200     0
+     0     0   300
+
+
B = repmat(A,2)
B = 
+
+   100     0     0   100     0     0
+     0   200     0     0   200     0
+     0     0   300     0     0   300
+   100     0     0   100     0     0
+     0   200     0     0   200     0
+     0     0   300     0     0   300
+
+

Repeat copies of a matrix into a 2-by-3 block arrangement.

A = diag([100 200 300])
A = 
+
+   100     0     0
+     0   200     0
+     0     0   300
+
+
B = repmat(A,2,3)
B = 
+
+   100     0     0   100     0     0   100     0     0
+     0   200     0     0   200     0     0   200     0
+     0     0   300     0     0   300     0     0   300
+   100     0     0   100     0     0   100     0     0
+     0   200     0     0   200     0     0   200     0
+     0     0   300     0     0   300     0     0   300
+
+

Repeat copies of a matrix into a 2-by-3-by-2 block arrangement.

A = [1 2; 3 4]
A = 
+
+     1     2
+     3     4
+
+
B = repmat(A,[2 3 2])
B = 
+B(:,:,1) =
+
+     1     2     1     2     1     2
+     3     4     3     4     3     4
+     1     2     1     2     1     2
+     3     4     3     4     3     4
+
+
+B(:,:,2) =
+
+     1     2     1     2     1     2
+     3     4     3     4     3     4
+     1     2     1     2     1     2
+     3     4     3     4     3     4
+
+

Vertically stack a row vector four times.

A = 1:4;
+B = repmat(A,4,1)
B = 
+
+     1     2     3     4
+     1     2     3     4
+     1     2     3     4
+     1     2     3     4
+
+

Horizontally stack a column vector four times.

A = (1:3)';  
+B = repmat(A,1,4)
B = 
+
+     1     1     1     1
+     2     2     2     2
+     3     3     3     3
+
+

Create a table with variables Age and Height.

A = table([39; 26],[70; 63],'VariableNames',{'Age' 'Height'})
A=2x2 table
+    Age    Height
+    ___    ______
+
+    39     70    
+    26     63    
+
+

Repeat copies of the table into a 2-by-3 block format.

B = repmat(A,2,3)
B=4x6 table
+    Age    Height    Age_1    Height_1    Age_2    Height_2
+    ___    ______    _____    ________    _____    ________
+
+    39     70        39       70          39       70      
+    26     63        26       63          26       63      
+    39     70        39       70          39       70      
+    26     63        26       63          26       63      
+
+

repmat repeats the entries of the table and appends a number to the new variable names.

Input Arguments

collapse all

+

Input array, specified as a scalar, vector, matrix, or multidimensional +array.

+

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64 | logical | char | string | struct | table | cell
+Complex Number Support: Yes

+

Number of times to repeat the input array in the row and column +dimensions, specified as an integer value. If n is 0 or +negative, the result is an empty array.

+

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

+

Repetition factors for each dimension, specified as separate +arguments of integer values. If any repetition factor is 0 or +negative, the result is an empty array.

+

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

+

Vector of repetition factors for each dimension, specified as +a row vector of integer values. If any value in r is 0 or +negative, the result is an empty array.

+

Data Types: single | double | int8 | int16 | int32 | int64 | uint8 | uint16 | uint32 | uint64

Tips

  • To build block arrays by forming the tensor product +of the input with an array of ones, use kron. +For example, to stack the row vector A = 1:3 four +times vertically, you can use B = kron(A,ones(4,1)).

  • To create block arrays and perform a binary operation +in a single pass, use bsxfun. +In some cases, bsxfun provides a simpler and +more memory efficient solution. For example, to add the vectors A += 1:5 and B = (1:10)' to produce a 10-by-5 +array, use bsxfun(@plus,A,B) instead of repmat(A,10,1) ++ repmat(B,1,5).

  • When A is a scalar of a certain +type, you can use other functions to get the same result as repmat.

    +
    repmat SyntaxEquivalent +Alternative
    repmat(NaN,m,n)NaN(m,n)
    repmat(single(inf),m,n)inf(m,n,'single')
    repmat(int8(0),m,n)zeros(m,n,'int8')
    repmat(uint32(1),m,n)ones(m,n,'uint32')
    repmat(eps,m,n)eps(ones(m,n))

    +

Extended Capabilities

Introduced before R2006a

+
+
+
Was this topic helpful? + + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+
+
+
+ + +
+
+ +
+
+ + + + + + + + +
+ + + + \ No newline at end of file diff --git a/ML_Mathematical_Approach/28_programming-ex-7/01__resources.html b/ML_Mathematical_Approach/28_programming-ex-7/01__resources.html new file mode 100644 index 0000000..77a10eb --- /dev/null +++ b/ML_Mathematical_Approach/28_programming-ex-7/01__resources.html @@ -0,0 +1,190 @@ + + +

+ ML:Programming Exercise 7:K-Means Clustering and PCA +

+

+ Debugging Tip +

+

+ The submit script, for all the programming assignments, does not report the line number and location of the error when it crashes. The follow method can be used to make it do so which makes debugging easier. +

+

+ Open ex7/lib/submitWithConfiguration.m and replace line: +

+
 fprintf('!! Please try again later.\n');
+
+

+ (around 28) with: +

+
fprintf('Error from file:%s\nFunction:%s\nOn line:%d\n', e.stack(1,1).file,e.stack(1,1).name, e.stack(1,1).line );
+
+

+ That top line says '!! Please try again later' on crash, instead of that, the bottom line will give the location and line number of the error. This change can be applied to all the programming assignments. +

+

+ Workaround for problem in plotting routine +

+

+ {CTA Note: This problem only effects certain versions of Octave} after completion of the computecentroids.m function, i ran into the following problem: +

+
    K-Means iteration 1/10...
+    error: __scatter__: A(I): index out of bounds; value 4 out of bound 3
+    error: called from:
+    error:   /Applications/Octave.app/Contents/Resources/share/octave/3.4.0/m/plot/private/__scatter__.m at line 199, column 13
+    error:   /Applications/Octave.app/Contents/Resources/share/octave/3.4.0/m/plot/scatter.m at line 71, column 11
+    error:  ?/ex7/mlclass-ex7/plotDataPoints.m at line 12, column 1
+    error:  ?/ex7/mlclass-ex7/plotProgresskMeans.m at line 11, column 1
+    error:  ?/ex7/mlclass-ex7/runkMeans.m at line 48, column 9
+    error:  ?/ex7/mlclass-ex7/ex7.m at line 92, column 19
+
+

+ i don't think it is caused by my solution, and found a workaround by modifying the plotDataPoints.m as follows +

+
   % use idx directly. It will index into the default color map.
+   % scatter(X(:,1), X(:,2), 15, colors);
+    scatter(X(:,1), X(:,2), 15, idx);
+
+

+ The issue is a bug in the scatter() function in certain versions of Octave. +

+

+ findClosestCentroids() issue with regards to the grader +

+

+ If two centroids have identical distances, the submit grader wants you to select the one with the lowest index value. This situation arises when running ex7_pca.m - some of the image pixels have the same minimum distance to more than one centroid. This restriction is most easily accommodated by using the min() function to find the centroid with the minimum distance. Students have found that using the find() function does not result in the answer the grader prefers. +

+

+ Selecting the initial centroids - an additional consideration +

+

+ This issue was omitted from the lectures. When the initial centroids are selected, be sure that they are each unique. For example, if using K-Means to compress an image, each of the initial centroids should represent a unique color. If two initial centroids were the exact same color, then you would effectively have K-1 centroids, not K. +

+

+ Using the kMeansInitCentroids() method as given in ex7.pdf, an experiment on the "bird_small.mat" data set shows that approximately 5 tries in 10,000 will result in duplicate centroids. The method given in ex7.pdf only selects unique members of the training set as the centroids - it does not verify that they are not duplicate values. +

+

+ One method for preventing duplicate centroids would be as follows: +

+ +

+ Another method would be to prevent any duplicates at all by using the unique function on the training examples (unique(X, 'rows')) + + before + + randomly selecting the initial centroids. +

+

+ Fully vectorizing findClosestCentroids() +

+

+ It is possible to fully vectorize this function by using 3D arrays for the training examples and the centroids. +

+

+ + Tip 1: + + To transform 2D arrays to 3D, you can use + + permute + + with an extra dimension index. For example, you can transform a m×n (2D) matrix + + A2 + + to a m×1×n (3D) array + + A3 + + using A3 = permute(A2, [1 3 2]); +

+

+ + Tip 2: + + Instead of using + + repmat + + to "expand" a matrix for binary operations, it is usually faster to use + + bsxfun + + . +

+

+ Errata in projectData.m +

+

+ Make the following change in the "Instructions" section: +

+
%      projection_k = x' * U(:, 1:k);
+
+

+ The "1:k" portion was missing the "1:" part. +

+
+ + + diff --git a/ML_Mathematical_Approach/29_programming-ex-8/01__resources.html b/ML_Mathematical_Approach/29_programming-ex-8/01__resources.html new file mode 100644 index 0000000..9edb722 --- /dev/null +++ b/ML_Mathematical_Approach/29_programming-ex-8/01__resources.html @@ -0,0 +1,134 @@ + + +

+ ML:Programming Exercise 8: Anomaly Detection and Recommender Systems +

+

+ Debugging Tip +

+

+ The submit script, for all the programming assignments, does not report the line number and location of the error when it crashes. The follow method can be used to make it do so which makes debugging easier. +

+

+ Open ex8/lib/submitWithConfiguration.m and replace line: +

+
 fprintf('!! Please try again later.\n');
+
+

+ (around 28) with: +

+
fprintf('Error from file:%s\nFunction:%s\nOn line:%d\n', e.stack(1,1).file,e.stack(1,1).name, e.stack(1,1).line );
+
+

+ That top line says '!! Please try again later' on crash, instead of that, the bottom line will give the location and line number of the error. This change can be applied to all the programming assignments. +

+

+ error in ex8_cofi.m (reported by Charles Davis in session ML-005) +

+

+ line 199 in ex8_cofi.m reads +

+

+ theta = fmincg (@(t)(cofiCostFunc(t, Y, R, num_users, num_movies, num_features, lambda)), initial_parameters, options); +

+

+ but I believe it should be +

+

+ theta = fmincg (@(t)(cofiCostFunc(t, Ynorm, R, num_users, num_movies, num_features, lambda)), initial_parameters, options); +

+

+ ...to avoid creating ratings > 5 at line 219. This doesn't affect the submissions of course, just the cosmetics of the recommendations. +

+

+ Supporting analysis: Y is normalized in line 181, creating Ynorm, but then it is never used. The video lecture "Implementation Detail: Mean Normalization" at 5:34 makes it pretty clear that the normalized Y matrix should be used for calculating theta. +

+

+ This errata also means that "ex8.pdf" Figure 4 is incorrect, since it shows movies with ratings greater than 5-stars. +

+

+ Item 2: The grader uses Y with non-zero values +

+

+ When using the R matrix (to ignore movies that have not been rated), do not rely on Y(i,j) to be 0 when a user has not rated a film. This expectation is true for the ex8_cofi.m script, but that is NOT true for the test case used by the submit grader for Part 3 through Part 6. +

+

+ Note: This might no longer be true, the grader seems to be using Y(i,j) == 0 for when a user has not rated a film +

+

+ Item 3: Regularization +

+

+ Note: Unlike previous assignments when we performed regularization, for this exercise, we do NOT skip the 1st column of Theta or X when computing regularization. This is because we are not specifying bias units in the collaborative filtering algorithm (since the algorithm determines all of the theta values, it can set one to the '1' value if it leads to the optimum solution). Therefore, all values of Theta and X should be considered in regularization. +

+

+ 1.2 Estimating parameters for a Gaussian +

+

+ the var function can actually return normalization with 1/m instead of 1/(m-1). Set the second argument 0 for 1/(m-1) and 1 for 1/m. +

+

+ errors in cofiCostFunc.m +

+

+ line 9 should read "% Unfold the + + X + + and + + Theta + + matrices from params" +

+

+

+
+ + + diff --git a/ML_Mathematical_Approach/30_installation-issues/01__Octave_3.6.1_for_windows_mingw_ b/ML_Mathematical_Approach/30_installation-issues/01__Octave_3.6.1_for_windows_mingw_ new file mode 100644 index 0000000..1c9db29 --- /dev/null +++ b/ML_Mathematical_Approach/30_installation-issues/01__Octave_3.6.1_for_windows_mingw_ @@ -0,0 +1,2107 @@ + + + + + + + + + + + + + + + + + + + + + Octave-Forge - Browse /Octave Windows binaries/Octave 3.6.1 for Windows MinGW installer at SourceForge.net + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + +
+ +
+ + + + + + + + +
+
+ + + +
+ +
+
Looking for the latest version? + + Download optim-1.5.0.tar.gz (389.6 kB) + + +
+ + + + + Home + + + / + + Octave Windows binaries / + + Octave 3.6.1 for Windows MinGW installer + +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
NameModifiedSizeDownloads / WeekStatus
Parent folder
+ + + + Octave3.6.1_gcc4.6.2_20120303.7z2012-03-07175.9 MB + 0
+ + + + Octave3.6.1_gcc4.6.2_pkgs_20120303.7z2012-03-0546.8 MB + 0
+ + + + README2012-03-058.2 kB + 0
Totals: 3 Items 222.8 MB + +
+
+
+
+ +
Octave-3.6.1-mingw + octaveforge pkgs + +1. Files for manual installation + +a. Octave-3.6.1-mingw binaries tree +Octave3.6.1_gcc4.6.2_20120303.7z - MD5:294B99B5E4D47CAA83E8940EB2918D10 + +This is a 7z archive which includes a directory tree of all the binaries and +libraries required for a complete octave installation (excluding octaveforge +packages) + +The archive includes: +octave-3.6.1 including PDF documentation (built using Tatsuro Matsuka OctaveLibs and gplibs http://www.tatsuromatsuoka.com/octave/Eng/Win/) +mingw32 + msys tool chain +gnuplot-4.4.4 +fig2dev-3.2.5c +ghostscript-9.0.4 +pstoedit-3.60 +Optional blas libs replacements: +<your_install_dir>\bin directory includes several libblas.dll.<libblas_source> where <libblas_source> is a text extension which describes the source of the library + libblas.dll.ref - reference blas implementation, very slow but most stable + libblas.dll.libopenblas_dynamicarch-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 2 threads, detects cpu architecture and selects respective lib + libblas.dll.libopenblas_dynamicarch_nt4-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 4 threads, detects cpu architecture and selects respective lib + libblas.dll.libopenblas_nehalemp-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 2 threads, tuned for nehalem cpu architecture + libblas.dll.libopenblas_nehalemp_nt4-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 4 threads, tuned for nehalem cpu architecture + libblas.dll.libopenblas_core2p-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 2 threads, tuned for core2 cpu architecture + libblas.dll.libopenblas_core2p_nt4-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 4 threads, tuned for core2 cpu architecture + libblas.dll.altas-3.8.4_ht-pentium - ATLAS based libblass, tuned for older ht-pentium (compiled by Tatsuro Matsuka) + libblas.dll.altas-3.8.4_corei5 - ATLAS based libblass, tuned for older core i5 cpu (compiled by Tatsuro Matsuka) +Default installed libblass.dll is libopenblas_dynamicarch-r0.1alpha2.5-0-fda39c6 which intended to automatically detect the cpu architecture and select a respectivly tuned library +In case the default library is not properly functioning on the actual cpu, or you wish to explore the performance with another liblas.dll.<libblas_source> it can be manually selected to replace the default one: + delete <your_install_dir>\bin\libblas.dll + make a copy of the desired <your_install_dir>\bin\libblas.dll.<libblas_source> + rename the copy of the desired <your_install_dir>\bin\libblas.dll.<libblas_source> to libblas.dll + +Maintainer: Nitzan Arazi +Latest update: 2012-03-03 + +b. Octaveforge pkgs, built for Octave-3.6.1-mingw +Octave3.6.1_gcc4.6.2_pkgs_20120303.7z - MD5:44A85F26A8925FEC5E1F0856408C9DD5 + +This is a 7z archive which includes additional binaries and libraries for a +set of octaveforge packages. + +The included packages are: +actuarial-1.1.0 +ad-1.0.6_patched +audio-1.1.4 +benchmark-1.1.1 +bim-1.0.2 +bioinfo-0.1.2 +civil-engineering-1.0.7 +combinatorics-1.0.9 +communications-1.1.0_svn20120127_patched +control-2.2.5 +data-smoothing-1.3.0 +dataframe-0.9.1 +econometrics-1.0.8 +fenv-0.1.0 +financial-0.3.2 +fpl-1.2.0 +fuzzy-logic-toolkit-0.3.0 +ga-0.9.8 +general-1.2.2 +generate_html-0.1.3 +geometry-1.4.0 +gnuplot-1.0.1 +gpc-0.1.7 +gsl-1.0.8 +ident-1.0.7 +image-1.0.15 +informationtheory-0.1.8 +integration-1.0.7_svn20120128 +io-1.0.17 +irsa-1.0.7 +java-1.2.8_patched +linear-algebra-2.1.0_svn20120225 +mapping-1.0.7 +mechanics-1.2.0 +miscellaneous-1.0.11_svn20120127 +missing-functions-1.0.2 +msh-1.0.2 +multicore-0.2.15 +nan-2.5.2 +nlwing2-1.2.0 +nnet-0.1.13 +nurbs-1.3.5 +ocs-0.1.3_svn20120128_patched +octclip-1.0.0 +octgpr-1.2.0 +odebvp-1.0.6 +odepkg-0.8.0_svn20120127 +optim-1.0.17_patched +optiminterp-0.3.4_svn20120128_patched +outliers-0.13.9 +physicalconstants-0.1.7 +plot-1.1.0 +quaternion-1.0.0 +queueing-1.0.0 +secs1d-0.0.8 +secs2d-0.0.8 +secs3d-0.0.1 +signal-1.1.2 +simp-1.1.0 +sockets-1.0.7_svn20120128_patched +specfun-1.1.0 +special-matrix-1.0.7 +spline-gcvspl-1.0.8 +splines-1.0.7 +statistics-1.1.0_svn20120128 +strings-1.0.7 +struct-1.0.9 +symband-1.0.10 +symbolic-1.1.0 +tcl-octave-0.1.8 +time-1.0.9 +tsa-4.1.1 +video-1.0.2_patched +vrml-1.0.12_svn20111014_patched +windows-1.1.0 +xraylib-1.0.8 +zenity-0.5.7 + +Maintainer: Nitzan Arazi +Latest update: 2012-03-03 + +2. Manual installation instructions + +Create an installation directory of which doesn't have space chars (i.e. +C:\Octave\Octave3.6.1_gcc4.6.2\). This directory is referred hereafter as +<your_install_dir>. + +Extract the complete directories tree from Octave3.6.1_gcc4.6.2_20120303.7z +to the installation directory keeping the original directory structure as in +the archive (you can use 7-zip tool from http://www.7-zip.org/). + +Copy octave3.6.1_gcc4.6.2.lnk to any convenient location and edit its +properties respectively to point to <your_install_dir>\bin\octave.exe and +<your_install_dir>\share\octave\3.6.1\imagelib\octave-logo.ico as an icon + +Copy octave3.6.1_gcc4.6.2_docs.lnk to any convenient location and edit its +properties respectively to point to <your_install_dir>\doc\octave and +<your_install_dir>\share\octave\3.6.1\imagelib\octave-logo.ico as an icon. + +At this point you can: +a. Launch and use octave by double-clicking the copied +octave3.6.1_gcc4.6.2.lnk +b. Access and browse the documentation files by double-clicking the copied +octave3.6.1_gcc4.6.2_docs.lnk + +3. Manual installation instructions for the Octave-forge packages + +Extract the complete directories tree from +Octave3.6.1_gcc4.6.2_pkgs_20120303.7z to the installation directory +(<your_install_dir>) keeping the original directory structure as in the +archive (you can use 7zip tool from http://www.7-zip.org/). + +In order to update octave_packages database with your installation tree and +auto-load most packages (excluding 'ad' and 'windows' which may crash octave +when loaded and 'clear all' is executed), launch Octave and execute the +following 3 rebuild commands from the octave console: + + pkg rebuild -auto + pkg rebuild -noauto ad windows + pkg rebuild -noauto nan % shadows many statistics functions + pkg rebuild -noauto gsl % shadows some core functions + pkg rebuild -auto java + +Last pkg rebuild command is required in order for the java pkg entry to be +moved to the top of <your_install_dir>\share\octave\octave_packages db file +- thus java pkg is loaded before io pkg is loaded, and io pkg related jars +are added to java class path. + +You can optionally adjust your installed packages status per your specific +needs and usage by executing the following commands: + +a. To interactively load or unload a package + pkg load <pkg_name> +or + pkg unload <pkg_name> +b. To disable auto-load for specific pkg <pkg_name> + pkg rebuild -noauto <pkg_name> +c. To enable auto-load for specific pkg <pkg_name> + pkg rebuild -auto <pkg_name> +d. To completely uninstall a package + pkg uninstall <pkg_name> + +4. Optional installation of Notepad++ as an editor (recommended) + +Download recent Notepad++ installation package from +http://notepad-plus-plus.org/ and install it on your system. + +Edit <your_install_dir>\share\octave\site\m\startup\octaverc and un-comment +the line which sets octave default editor: + + EDITOR('C:\Program Files\Notepad++\notepad++.exe'); + +Note: You may adjust the above line for the location of notepad++.exe as +installed on your system. + +5. Troubleshooting + +Upon launching, some warnings may be displayed. These warnings can be +ignored. + +Following warnings are about missing external tools which may reduce some of +the functions of some packages. These external tools are not provided by the +7z archives in sourceforge. + + warning: gmsh does not seem to be present some functionalities will be +disabled + warning: dx does not seem to be present some functionalities will be +disabled + +Following warning is about fstat function of the statistics package that +overloads the old (to be deprecated) fstat function of octave-3.6.1 + + warning: function +C:\Octave\3.6.1_gcc-4.6.2\share\octave\packages\statistics-1.1.0\fstat.m +shadows a core library function +
+ Source: README, updated 2012-03-05 + +
+
+ + +
+
+ + + + + + + + + +
+ +
+ + +
+ + + + + + + + + + + +
+ +
+ + + + +
+ + + + +
+ +
+ + + + +
+ + + + +
+
+
+

Thanks for helping keep SourceForge clean.

+

+ Screenshot instructions:
+ Windows
+ Mac
+ Red Hat Linux   + Ubuntu +

+

+ Click URL instructions:
+ Right-click on ad, choose "Copy Link", then paste here →
+ (This may not be possible with some types of ads) +

+ More information about our ad policies +
+
+ X + + + + +

+

+

Briefly describe the problem (required): + +

+ +
Upload screenshot of ad (required):
+
+ Select a file, or drag & drop file here. +
+
+ + + + + +

+

Please provide the ad click URL, if possible: + +

+ + +
+
+
+ + + + + + + + + +
+

Get latest updates about Open Source Projects, Conferences and News.

+
+ +
+
+ + + + + + +
+ No, thanks +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + diff --git a/ML_Mathematical_Approach/30_installation-issues/01__index.php b/ML_Mathematical_Approach/30_installation-issues/01__index.php new file mode 100644 index 0000000..da0e942 --- /dev/null +++ b/ML_Mathematical_Approach/30_installation-issues/01__index.php @@ -0,0 +1,1226 @@ + + + + +Octave for Microsoft Windows - Octave + + + + + + + + + + + + + + + + +
+
+
+ + +
+
+

Octave for Microsoft Windows

+ +
+
From Octave
+
(Redirected from Octave for Windows)
+
Jump to: navigation, search
+ + +
This article is about using pre-built installers of Octave for Windows; for instructions about building it, see Windows Installer.
+

The most recent Windows installers are available from ftp.gnu.org/gnu/octave/windows/. +Users are encouraged to use the latest version unless a specific feature or requirement warrants using an older version of the software. Version specific instructions and installation notes are provided below. +

Be advised that GNU Octave is primarily developed on GNU/Linux and other POSIX conform systems. The ports of GNU Octave to Microsoft Windows use different approaches to get most of the original Octave and adapt it to Microsoft Windows idiosyncrasies (e.g. dynamic libraries, file paths, permissions, environment variables, GUI system, etc). Bear this in mind and don't panic if you get unexpected results. There are a lot of suggestions on the mailing lists for tuning your Octave installation. GNU Octave standalone ports for Windows are independently compiled using either the MinGW or Microsoft Visual Studio development environments (3.6 or before). +

+

Contents

+ +
+ +

Installers for Microsoft Windows[edit]

+

Octave-4.2.1[edit]

+

The easiest way to install GNU Octave on Microsoft Windows is using MXE builds. For the current 4.2.1 release both 32-bit and 64-bit installers and zip archived packages can be found at ftp.gnu.org/gnu/octave/windows/. +

For executable installers the user can simply run the downloaded file and follow the onscreen installation prompts. It is recommended that the installation path not include spaces or non-ASCII characters. Shortcuts to the program will be automatically created. +

For the zip-file archives, the user should extract the file content to a directory on the harddrive (such as C:\Octave). Manual shortcuts can then be created to either the octave.bat or octave.vbs files in the main installation directory. +

+

Packages[edit]

+

A selection of pre-built octave-forge packages are prepared for all versions of the official release. If you installed Octave using the executable installer you can confirm the package list by typing the command below from the Octave command prompt: +

+
 >> pkg list
+
+

If you instead installed Octave from the .zip archive, you need to first rebuild the package list on your local machine. (The command above will produce a blank output and packages will be inaccessible before rebuilding.) Do this by typing the following command: +

+
  >> pkg rebuild
+
+

Packages can be updated by running +

+
  >> pkg update
+
+

Other packages can be installed by running +

+
  >> pkg install -forge <package name>
+
+

To manually install a new or updated package version, the package file can be downloaded from the Octave-Forge websiteto the working directory and can be installed using: +

+
  >> pkg install package_file_name.tar.gz
+
+

For detailed instruction of installing Octave-Forge packages is shown at Octave-Forge +


+Note that a security related issue in Windows XP currently prevents Octave from automatically retrieving packages from the website for installation or updates when running under that Operating System, and manual package installation is necessary to update or install new packages. +

+

Octave 4.2.1 on cygwin[edit]

+ +
  • Latest packages:
+
octave-4.2.1-1
+
Its announce on cygwin mailing list [1]
+
octave-forge packages have each a cygwin package
+
Its announce on cygwin mailing list [2]
+
Full cygwin package list is available here [3]
+
At today 2017-04-06, 64 forge packages are available.
+
  • To install :
+
run cygwin setup-x86.exe (for cygwin 32 bit) or setup-x86_64.exe (for cygwin 64 bit) and select them in the Math category.
+
All the package dependencies will be also installed.
+
Graphics is based on X and to plot you will need to start octave within xterm (or similar).
+
I recommend to install "xinit", "xlaunch" and "gnuplot". These packages will pull all the functional Xserver.
+
Otherwise the only graphics will be ASCII art ;-)
+

Notes[edit]

+
  • When building from development source (default branch)
+
"make check"
+
passes almost all the tests. Only, and not substantial, failures are:
+
   
+    /pub/hg/octave/src/data.cc : 8 failures due to different handling of complex Inf on sort
+    /pub/hg/octave/src/syscalls.cc: 1 failure on fork. This disappears when octave is installed
+    /pub/hg/octave/scripts/sparse/svds.m: 1 failure due to test sensitivity on starting point. See 
+    https://mailman.cae.wisc.edu/pipermail/octave-maintainers/2011-September/024715.html
+
+
  • To build from cygwin source package, you need to install "cygport" and the relevant development libraries
+
    
+     $ tar -xf octave-4.2.0-1-src.tar.xz 
+     $ cygport octave.cygport almostall
+
+
see cygport documentation for further info.
+

Older version instructions[edit]

+

Note that the instructions below may contain outdated links or instructions that are no longer relevant to current versions +

+

Older MXE builds[edit]

+

Octave-4.2.0[edit]

+

The instructions for Octave 4.2.0 are the same as for Octave 4.2.1 above. However, note that version 4.2.0 has a bug that prevents it from automatically retrieving packages from the Octave-Forge website for installation or updates. Manual package installation is necessary with this version to update or install new packages on Windows. +

To manually install a new or updated package version, the package file can be downloaded from the Octave-Forge website to the working directory and can be installed using: +

+
  >> pkg install package_file_name.tar.gz
+
+

Detailed instruction of installing Octave-Forge packages is shown at Octave-Forge +

+

Octave-4.0.3[edit]

+

The easiest way to install GNU Octave on Microsoft Windows is using MXE builds. For the current 4.0. release installers and zip archived packages can be found at ftp.gnu.org/gnu/octave/windows/. Unofficial 64 bit binary is available at "File list of Octave for Windows". +

Known issue specific to windows version 4.0 +

+
  1. Both cli and gui cannot handle path name of which contains non-ascii characters. (For 3.8, cli can handle non-ascii code if codepage is properly set to the machine locale)
  2. +
  3. nan package cannot be installed by "pkg install" commands. Workaround is to execute "setenv CC gcc" before "pkg install".
+

Known issue of octave starup +

+
  1. octave sometimes will fail to startup because of internal troubles. On such occasion, please try to delete .config\octave folder by which USERPROFILE environmental variable points. (One of the way to see value of USERPROFILE, startup command prompt and type "set")
+

Packages[edit]

+

Pre-built octave-forge packages are prepared for official release. If you installed Octave 4.0.3 using the executable installer you can confirm the package list by typing the command below from the Octave command prompt: +

+
 >> pkg list
+
+

If you instead installed Octave from the .zip archive, you need to first rebuild the package list on your local machine. (The command above will produce a blank output.) Do this by typing the following command: +

+
 >> pkg rebuild
+
+


+Other packages can be installed by +

+
  >> pkg install -forge <package name>
+
+

For detailed instruction of installing Octave-Forge packages is shown at Octave-Forge +

For 64 bit binary distributed from the unofficial site, pre-built package are not prepared. +Please follow the instruction on the distribution page. +

+

gnuplot[edit]

+

Current octave for windows ships a not full featured gnuplot. Therefore you cannot use the full features of gnuplot graphics toolkit +(e.g. cannot use cairo based devise like "-dpdfcairo"). If you want use it, please use the following instruction +

Download and install gnuplot if you do not have it. You can find the windows installer in the "gnuplot web site for Files section" The latest version is 5.0.3. +

We can find path of USERPROFILE directory by +

+
 >> getenv USERPROFILE
+
+

make an .octaverc file in USERPROFILE directory by your favorite text editor and set gnuplot_binary e.g. +

+
 gnuplot_binary 'C:\Program Files (x86)\gnuplot\bin\gnuplot.exe'
+
+

Please do not forget to quote the path name by single quote (') if name of which has (a) white space(s). +gnuplot ver. 5 supports windows, wxt and qt terminal. On octave, windows terminal is default. +If you want change it to wxt terminal, execute +

+
>> setenv GNUTERM wxt
+
+

You can of course describe it in .octaverc. +

+

Older MinGW ports[edit]

+

Octave-4.0.0[edit]

+

The easiest way to install GNU Octave on Microsoft Windows is using MXE builds. For the current 4.0.0 release installers can be found at ftp.gnu.org. +

Known issues specific to windows version 4.0.0. +

+
  1. Both cli and gui cannot handle path name of which contains non-ascii characters. (For 3.8, cli can handle non-ascii code if codepage is properly set to the machine locale)
+

Packages[edit]

+

Pre-build octave-forge packages are not prepared from octave-3.8 or later for windows. You can install some octave-forge packages using archived sources and build script. However, small flaws exist in current octave-4.0.0_0 distribution. Before install, correct version number of general and signal packages version to 2.0.0 and 1.3.2, respectively and comment out install io package as "#try_install io-2.2.7.tar.gz" in C:\octave\octave-4.0.0\src\build_packages.m". Then execute +

+
   >> cd C:\octave\octave-4.0.0\src
+   >> build_packages
+   >> pkg install -forge io
+
+

Other octave-forge packages may be installed by +

+
  >> pkg install -forge (package name)
+
+

For detailed instruction of installing Octave-Forge packages is shown at Octave-Forge +

+

Octave-3.8.2[edit]

+

The site that provide previous version of octave for windows of ver. 3.8.2 (unofficial build using mxe-octave) is closed. +A mirrored binary can be downloaded at File list of Octave for Windows +

If you got any problems while running Windows 8 or libstdc++-6.dll errors, try this octave-gui.bat file and place it into your Octave folder (e.g. `C:/octave/octave-3.8.2`). +

+
   @echo off
+   set PATH=%CD%\bin\
+   start octave --force-gui -i --line-editing
+   exit
+
+

Octave-3.6.4-mingw + octaveforge pkgs[edit]

+

For build instructions before octave-3.8 see Octave_for_MinGW or the Octave-Forge repository [4]. +

+

Files for manual installation[edit]

+
  1. Octave-3.6.4-mingw binaries tree +
    • Octave3.6.4_gcc4.6.2_20130329.7z - MD5:46BD238C664E17B4B25E72A11C38163F
    +
    This is a 7z archive which includes a directory tree of all the binaries and libraries required for a complete octave installation (excluding octaveforge packages)
    +
    +
    It can be downloaded from Octave Forge
    +
    +
    • The archive include:
    +
    • octave-3.6.4 including PDF documentation (built using Tatsuro Matsuka OctaveLibs and gplibs http://www.tatsuromatsuoka.com/octave/Eng/Win/)
    • +
    • mingw32 + msys tool chain
    • +
    • gnuplot-4.6.0
    • +
    • fig2dev-3.2.5c
    • +
    • ghostscript-9.0.7
    • +
    • pstoedit-3.61
    • +
    • OpenBLAS-v2.6.0 and ATLAS-3.8.4 based libblas alternatives
    +
    +
    • Maintainer: Nitzan Arazi
    • +
    • Latest update: 2013-03-29
  2. +
  3. Octaveforge pkgs, built for Octave-3.6.4-mingw +
    • Octave3.6.4_gcc4.6.2_pkgs_20130331.7z - MD5:8AB5F5F88E7267FB1E47BABC29FD7FE0
    +
    +
    It can be downloaded from Octave Forge
    +
    +
    This is a 7z archive which includes additional binaries and libraries for a set of octaveforge packages.
    +
    +
    • The included packages are:
    +
    • actuarial-1.1.0
    • +
    • ad-1.0.6_patched
    • +
    • audio-1.1.4
    • +
    • benchmark-1.1.1
    • +
    • bim-1.1.1
    • +
    • bioinfo-0.1.2
    • +
    • cgi-0.1.0
    • +
    • civil-engineering-1.0.7
    • +
    • combinatorics-2.0.0
    • +
    • communications-1.1.1_patched
    • +
    • control-2.4.2
    • +
    • data-smoothing-1.3.0
    • +
    • dataframe-0.9.1
    • +
    • econometrics-1.1.1
    • +
    • fenv-0.1.0
    • +
    • financial-0.4.0
    • +
    • fits-1.0.2
    • +
    • fpl-1.3.3
    • +
    • fuzzy-logic-toolkit-0.4.2
    • +
    • ga-0.10.0
    • +
    • general-1.3.2
    • +
    • generate_html-0.1.5
    • +
    • geometry-1.6.0
    • +
    • gnuplot-1.0.1
    • +
    • gsl-1.0.8
    • +
    • ident-1.0.7
    • +
    • image-2.0.0
    • +
    • informationtheory-0.1.8
    • +
    • integration-1.0.7_svn20120128
    • +
    • io-1.2.1
    • +
    • irsa-1.0.7
    • +
    • java-1.2.9_patched
    • +
    • linear-algebra-2.2.0
    • +
    • lssa-0.1.2
    • +
    • mapping-1.0.7
    • +
    • mechanics-1.3.1
    • +
    • miscellaneous-1.2.0
    • +
    • missing-functions-1.0.2
    • +
    • msh-1.0.6
    • +
    • multicore-0.2.15
    • +
    • nan-2.5.5
    • +
    • ncarray-1.0.0
    • +
    • nlwing2-1.2.0
    • +
    • nnet-0.1.13
    • +
    • nurbs-1.3.6
    • +
    • ocs-0.1.3_svn20120128_patched
    • +
    • octcdf-1.1.5
    • +
    • octclip-1.0.3
    • +
    • octgpr-1.2.0
    • +
    • octproj-1.1.2
    • +
    • odebvp-1.0.6
    • +
    • odepkg-0.8.4
    • +
    • optim-1.2.2
    • +
    • optiminterp-0.3.4
    • +
    • outliers-0.13.9
    • +
    • physicalconstants-1.0.0
    • +
    • plot-1.1.0
    • +
    • quaternion-2.0.2
    • +
    • queueing-1.2.1
    • +
    • secs1d-0.0.9
    • +
    • secs2d-0.0.8
    • +
    • secs3d-0.0.1
    • +
    • signal-1.2.1
    • +
    • simp-1.1.0
    • +
    • sockets-1.0.8_patched
    • +
    • specfun-1.1.0
    • +
    • special-matrix-1.0.7
    • +
    • splines-1.1.2
    • +
    • statistics-1.2.0
    • +
    • strings-1.1.0_patched
    • +
    • struct-1.0.10
    • +
    • symband-1.0.10
    • +
    • symbolic-1.1.0
    • +
    • tcl-octave-0.1.8
    • +
    • time-2.0.0
    • +
    • tsa-4.2.4
    • +
    • video-1.0.2_patched
    • +
    • vrml-1.0.13_patched
    • +
    • windows-1.2.1
    • +
    • xraylib-1.0.8
    • +
    • zenity-0.5.7
    +
    +
    • Maintainer: Nitzan Arazi
    • +
    • Latest update: 2013-03-31
+

Manual installation instructions[edit]

+
  1. Create an installation directory of which doesn't have space chars (i.e. C:\Octave\Octave3.6.4_gcc4.6.2\). This directory is referred hereafter as <your_install_dir>.
  2. +
  3. Extract the complete directories tree from Octave3.6.4_gcc4.6.2_20130329.7z to the installation directory keeping the original directory structure as in the archive (you can use 7-zip tool from http://www.7-zip.org/). Note that the archive contains Octave3.6.4_gcc4.6.2 folder, so you want to extract to the *parent* of <your_install_dir>.
  4. +
  5. Copy octave3.6.4_gcc4.6.2.lnk to any convenient location and edit its properties respectively to point to <your_install_dir>\bin\octave.exe and <your_install_dir>\doc\octave\icons\octave-logo.ico as an icon +
    Note for windows 8 users: As a workaround for a gnulib windows 8 compatibility bug, add command line switches ' -i --line-editing' to the octave.exe shortcut (i.e. <octave-dir>\bin\octave.exe -i --line-editing)
  6. +
  7. Copy octave3.6.4_gcc4.6.2_docs.lnk to any convenient location and edit its properties respectively to point to <your_install_dir>\doc\octave and <your_install_dir>\doc\octave\icons\octave-logo.ico as an icon. +
    At this point you can:
    +
    a. Launch and use octave by double-clicking the copied octave3.6.4_gcc4.6.2.lnk
    +
    b. Access and browse the documentation files by double-clicking the copied octave3.6.4_gcc4.6.2_docs.lnk
  8. +
  9. Optional libblas dll replacement for optimizing the linear algebra subroutines for your CPU: +
    • The default configuration should automatically detect your cpu architecture and select an appropriately tuned library.
    • +
    • In case the default library is not properly functioning on the actual cpu, or you wish to explore the performance with another library, you can manually replace the default one by performing the following procedures (where libblas.dll.<libblas_source> should be replaced with the full name of the desired library from the list below these instructions): +
      1. Exit octave
      2. +
      3. Delete <your_install_dir>\bin\libblas.dll
      4. +
      5. Make a copy of the desired <your_install_dir>\bin\libblas.dll.<libblas_source>
      6. +
      7. Rename the copy of the desired <your_install_dir>\bin\libblas.dll.<libblas_source> to libblas.dll
    • +
    • The following is a list of the available libblas.dll.<libblas_source> options: +
      1. libblas.dll.ref - reference blas implementation, very slow but most stable
      2. +
      3. libblas.dll.OpenBLAS-v2.6.0-0-54e7b37_dynamicarch_nt4 - Openblas based, up to 4 threads, detects cpu architecture and selects respective lib
      4. +
      5. libblas.dll.OpenBLAS-v2.6.0-0-54e7b37_nehalem_nt4 - Openblas based, up to 4 threads, tuned for nehalem cpu architecture
      6. +
      7. libblas.dll.OpenBLAS-v2.6.0-0-54e7b37_core2_nt4 - Openblas based, up to 4 threads, tuned for core2 cpu architecture
      8. +
      9. libblas.dll.OpenBLAS-v2.6.0-0-54e7b37_sandybridge_nt4 - Openblas based, up to 4 threads, tuned for sandybridge cpu architecture
      10. +
      11. libblas.dll.OpenBLAS-v2.6.0-0-54e7b37_atom_nt4 - Openblas based, up to 4 threads, tuned for atom cpu architecture
      12. +
      13. libblas.dll.altas-3.8.4_ht-pentium - ATLAS based libblass, tuned for older ht-pentium (compiled by Tatsuro Matsuka)
      14. +
      15. libblas.dll.altas-3.8.4_corei5 - ATLAS based libblass, tuned for older core i5 cpu (compiled by Tatsuro Matsuka)
+

Manual installation instructions for the Octave-forge packages[edit]

+
  1. Extract the complete directories tree from Octave3.6.4_gcc4.6.2_pkgs_20130331.7z to the installation directory (<your_install_dir>) keeping the original directory structure as in the archive (you can use 7zip tool from http://www.7-zip.org/).
  2. +
  3. In order to update octave_packages database with your installation tree and auto-load most packages (excluding 'ad' and 'windows' which may crash octave when loaded and 'clear all' is executed), launch Octave and execute the following five rebuild commands from the octave console:
+
   pkg rebuild -auto
+   pkg rebuild -noauto ad % may crash octave when loaded and 'clear all' is executed
+   pkg rebuild -noauto nan % shadows many statistics functions
+   pkg rebuild -noauto gsl % shadows some core functions
+   pkg rebuild -auto java
+
+
  1. Last pkg rebuild command is required in order for the java pkg entry to be moved to the top of <your_install_dir>\share\octave\octave_packages db file - thus java pkg is loaded before io pkg is loaded, and io pkg related jars are added to java class path.
  2. +
  3. You can optionally adjust your installed packages status per your specific needs and usage by executing the following commands:
+
a. To interactively load or unload a package
+
pkg load <pkg_name>
+
+
or
+
pkg unload <pkg_name>
+
+
b. To disable auto-load for specific pkg <pkg_name>
+
pkg rebuild -noauto <pkg_name>
+
+
c. To enable auto-load for specific pkg <pkg_name>
+
pkg rebuild -auto <pkg_name>
+
+
d. To completely uninstall a package
+
pkg uninstall <pkg_name>
+
+

Optional installation of Notepad++ as an editor (recommended)[edit]

+
  1. Download recent Notepad++ installation package from http://notepad-plus-plus.org/ and install it on your system.
  2. +
  3. Edit <your_install_dir>\share\octave\site\m\startup\octaverc and un-comment the line which sets octave default editor:
+
EDITOR('C:\Program Files\Notepad++\notepad++.exe');
+edit ("editor", sprintf ("%s %%s", EDITOR ()))
+edit mode async
+
+
Note: win64 users may use the w32 programs directory:
+
EDITOR('C:\Program Files (x86)\Notepad++\notepad++.exe');
+
+
Note: You may adjust the above line for the location of notepad++.exe as installed on your system.
+

Troubleshooting[edit]

+

Upon launching, some warnings may be displayed. These warnings can be ignored. +

+
  • Following warnings are about missing external tools which may reduce some of the functions of some packages. These external tools are not provided by the 7z archives in sourceforge.
+
warning: gmsh does not seem to be present some functionalities will be disabled 
+warning: dx does not seem to be present some functionalities will be disabled 
+
+
  • Following warning is about fstat function of the statistics package that overloads the old (to be deprecated) fstat function of octave-3.6.4
+
warning: function C:\Octave\3.6.4_gcc-4.6.2\share\octave\packages\statistics-1.2.0\fstat.m shadows a core library function
+
+

Octave-3.6.2-mingw + octaveforge pkgs[edit]

+

Files for manual installation[edit]

+
  1. Octave-3.6.2-mingw binaries tree +
    • Octave362_gcc462_20120609.7z - MD5:1FA1F6191C151D527830722F71822312
    +
    This is a 7z archive which includes a directory tree of all the binaries and libraries required for a complete octave installation (excluding octaveforge packages)
    +
    +
    It can be downloaded from Octave Forge
+

Octave-3.6.1-mingw + octaveforge pkgs[edit]

+

Files for manual installation[edit]

+
  1. Octave-3.6.1-mingw binaries tree +
    • Octave3.6.1_gcc4.6.2_20120303.7z - MD5:294B99B5E4D47CAA83E8940EB2918D10
    +
    This is a 7z archive which includes a directory tree of all the binaries and libraries required for a complete octave installation (excluding octaveforge packages)
    +
    +
    It can be downloaded from Octave Forge
    +
    +
    • The archive include:
    +
    • octave-3.6.1 including PDF documentation (built by Tatsuro Matsuka http://www.tatsuromatsuoka.com/octave/Eng/Win/)
    • +
    • OpenBLAS-r0.1alpha2.5 and ATLAS-3.8.4 based libblas altenatives
    • +
    • mingw32 + msys tool chain
    • +
    • gnuplot-4.4.4
    • +
    • fig2dev-3.2.5c
    • +
    • ghostscript-9.0.4
    • +
    • pstoedit-3.60
    +
    +
    • Maintainer: Nitzan Arazi
    • +
    • Latest update: 2012-03-03
  2. +
  3. Octaveforge pkgs, built for Octave-3.6.1-mingw +
    • Octave3.6.1_gcc4.6.2_pkgs_20120303.7z - MD5:44A85F26A8925FEC5E1F0856408C9DD5
    +
    +
    It can be downloaded from Octave Forge
    +
    +
    This is a 7z archive which includes additional binaries and libraries for a set of octaveforge packages.
    +
    +
    • The included packages are:
    +
    • actuarial-1.1.0
    • +
    • ad-1.0.6_patched
    • +
    • audio-1.1.4
    • +
    • benchmark-1.1.1
    • +
    • bim-1.0.2
    • +
    • bioinfo-0.1.2
    • +
    • civil-engineering-1.0.7
    • +
    • combinatorics-1.0.9
    • +
    • communications-1.1.0_svn20120127_patched
    • +
    • control-2.2.5
    • +
    • data-smoothing-1.3.0
    • +
    • dataframe-0.9.1
    • +
    • econometrics-1.0.8
    • +
    • fenv-0.1.0
    • +
    • financial-0.3.2
    • +
    • fpl-1.2.0
    • +
    • fuzzy-logic-toolkit-0.3.0
    • +
    • ga-0.9.8
    • +
    • general-1.2.2
    • +
    • generate_html-0.1.3
    • +
    • geometry-1.4.0
    • +
    • gnuplot-1.0.1
    • +
    • gpc-0.1.7
    • +
    • gsl-1.0.8
    • +
    • ident-1.0.7
    • +
    • image-1.0.15
    • +
    • informationtheory-0.1.8
    • +
    • integration-1.0.7_svn20120128
    • +
    • io-1.0.17
    • +
    • irsa-1.0.7
    • +
    • java-1.2.8_patched
    • +
    • linear-algebra-2.1.0_svn20120225
    • +
    • mapping-1.0.7
    • +
    • mechanics-1.2.0
    • +
    • miscellaneous-1.0.11_svn20120127
    • +
    • missing-functions-1.0.2
    • +
    • msh-1.0.2
    • +
    • multicore-0.2.15
    • +
    • nan-2.5.2
    • +
    • nlwing2-1.2.0
    • +
    • nnet-0.1.13
    • +
    • nurbs-1.3.5
    • +
    • ocs-0.1.3_svn20120128_patched
    • +
    • octclip-1.0.0
    • +
    • octgpr-1.2.0
    • +
    • odebvp-1.0.6
    • +
    • odepkg-0.8.0_svn20120127
    • +
    • optim-1.0.17_patched
    • +
    • optiminterp-0.3.4_svn20120128_patched
    • +
    • outliers-0.13.9
    • +
    • physicalconstants-0.1.7
    • +
    • plot-1.1.0
    • +
    • quaternion-1.0.0
    • +
    • queueing-1.0.0
    • +
    • secs1d-0.0.8
    • +
    • secs2d-0.0.8
    • +
    • secs3d-0.0.1
    • +
    • signal-1.1.2
    • +
    • simp-1.1.0
    • +
    • sockets-1.0.7_svn20120128_patched
    • +
    • specfun-1.1.0
    • +
    • special-matrix-1.0.7
    • +
    • spline-gcvspl-1.0.8
    • +
    • splines-1.0.7
    • +
    • statistics-1.1.0_svn20120128
    • +
    • strings-1.0.7
    • +
    • struct-1.0.9
    • +
    • symband-1.0.10
    • +
    • symbolic-1.1.0
    • +
    • tcl-octave-0.1.8
    • +
    • time-1.0.9
    • +
    • tsa-4.1.1
    • +
    • video-1.0.2_patched
    • +
    • vrml-1.0.12_svn20111014_patched
    • +
    • windows-1.1.0
    • +
    • xraylib-1.0.8
    • +
    • zenity-0.5.7
    +
    +
    • Maintainer: Nitzan Arazi
    • +
    • Latest update: 2012-03-03
+

Manual installation instructions[edit]

+
  1. Create an installation directory of which doesn't have space chars (i.e. C:\Octave\Octave3.6.1_gcc4.6.2\). This directory is referred hereafter as <your_install_dir>.
  2. +
  3. Extract the complete directories tree from Octave3.6.1_gcc4.6.2_20120303.7z to the installation directory keeping the original directory structure as in the archive (you can use 7-zip tool from http://www.7-zip.org/). Note that the archive contains Octave3.6.1_gcc4.6.2 folder, so you want to extract to the *parent* of <your_install_dir>.
  4. +
  5. Copy octave3.6.1_gcc4.6.2.lnk to any convenient location and edit its properties respectively to point to <your_install_dir>\bin\octave.exe and <your_install_dir>\share\octave\3.6.1\imagelib\octave-logo.ico as an icon
  6. +
  7. Copy octave3.6.1_gcc4.6.2_docs.lnk to any convenient location and edit its properties respectively to point to <your_install_dir>\doc\octave and <your_install_dir>\share\octave\3.6.1\imagelib\octave-logo.ico as an icon. +
    At this point you can:
    +
    a. Launch and use octave by double-clicking the copied octave3.6.1_gcc4.6.2.lnk
    +
    b. Access and browse the documentation files by double-clicking the copied octave3.6.1_gcc4.6.2_docs.lnk
  8. +
  9. Optional libblas dll replacement for optimizing the linear algebra subroutines for your CPU: +
    • The default configuration should automatically detect your cpu architecture and select an appropriately tuned library.
    • +
    • In case the default library is not properly functioning on the actual cpu, or you wish to explore the performance with another library, you can manually replace the default one by performing the following procedures (where libblas.dll.<libblas_source> should be replaced with the full name of the desired library from the list below these instructions): +
      1. Exit octave
      2. +
      3. Delete <your_install_dir>\bin\libblas.dll
      4. +
      5. Make a copy of the desired <your_install_dir>\bin\libblas.dll.<libblas_source>
      6. +
      7. Rename the copy of the desired <your_install_dir>\bin\libblas.dll.<libblas_source> to libblas.dll
    • +
    • The following is a list of the available libblas.dll.<libblas_source> options: +
      1. libblas.dll.ref - reference blas implementation, very slow but most stable
      2. +
      3. libblas.dll.libopenblas_dynamicarch-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 2 threads, detects cpu architecture and selects respective lib
      4. +
      5. libblas.dll.libopenblas_dynamicarch_nt4-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 4 threads, detects cpu architecture and selects respective lib
      6. +
      7. libblas.dll.libopenblas_nehalemp-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 2 threads, tuned for nehalem cpu architecture
      8. +
      9. libblas.dll.libopenblas_nehalemp_nt4-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 4 threads, tuned for nehalem cpu architecture
      10. +
      11. libblas.dll.libopenblas_core2p-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 2 threads, tuned for core2 cpu architecture
      12. +
      13. libblas.dll.libopenblas_core2p_nt4-r0.1alpha2.5-0-fda39c6 - Openblas based, up to 4 threads, tuned for core2 cpu architecture
      14. +
      15. libblas.dll.altas-3.8.4_ht-pentium - ATLAS based libblass, tuned for older ht-pentium (compiled by Tatsuro Matsuka)
      16. +
      17. libblas.dll.altas-3.8.4_corei5 - ATLAS based libblass, tuned for older core i5 cpu (compiled by Tatsuro Matsuka)
+

Manual installation instructions for the Octave-forge packages[edit]

+
  1. Extract the complete directories tree from Octave3.6.1_gcc4.6.2_pkgs_20120303.7z to the installation directory (<your_install_dir>) keeping the original directory structure as in the archive (you can use 7zip tool from http://www.7-zip.org/).
  2. +
  3. In order to update octave_packages database with your installation tree and auto-load most packages (excluding 'ad' and 'windows' which may crash octave when loaded and 'clear all' is executed), launch Octave and execute the following 3 rebuild commands from the octave console:
+
   pkg rebuild -auto
+   pkg rebuild -noauto ad windows
+   pkg rebuild -noauto nan % shadows many statistics functions
+   pkg rebuild -noauto gsl % shadows some core functions
+   pkg rebuild -auto java
+
+
  1. Last pkg rebuild command is required in order for the java pkg entry to be moved to the top of <your_install_dir>\share\octave\octave_packages db file - thus java pkg is loaded before io pkg is loaded, and io pkg related jars are added to java class path.
  2. +
  3. You can optionally adjust your installed packages status per your specific needs and usage by executing the following commands:
+
a. To interactively load or unload a package
+
pkg load <pkg_name>
+
+
or
+
pkg unload <pkg_name>
+
+
b. To disable auto-load for specific pkg <pkg_name>
+
pkg rebuild -noauto <pkg_name>
+
+
c. To enable auto-load for specific pkg <pkg_name>
+
pkg rebuild -auto <pkg_name>
+
+
d. To completely uninstall a package
+
pkg uninstall <pkg_name>
+
+

Optional installation of Notepad++ as an editor (recommended)[edit]

+
  1. Download recent Notepad++ installation package from http://notepad-plus-plus.org/ and install it on your system.
  2. +
  3. Edit <your_install_dir>\share\octave\site\m\startup\octaverc and un-comment the line which sets octave default editor:
+
EDITOR('C:\Program Files\Notepad++\notepad++.exe');
+edit ("editor", sprintf ("%s %%s", EDITOR ()))
+edit mode async
+
+
Note: You may adjust the above line for the location of notepad++.exe as installed on your system.
+

Troubleshooting[edit]

+

Upon launching, some warnings may be displayed. These warnings can be ignored. +

+
  • Following warnings are about missing external tools which may reduce some of the functions of some packages. These external tools are not provided by the 7z archives in sourceforge.
+
warning: gmsh does not seem to be present some functionalities will be disabled 
+warning: dx does not seem to be present some functionalities will be disabled 
+
+
  • Following warning is about fstat function of the statistics package that overloads the old (to be deprecated) fstat function of octave-3.6.1
+
warning: function C:\Octave\3.6.1_gcc-4.6.2\share\octave\packages\statistics-1.1.0\fstat.m shadows a core library function
+
+

Octave-3.6.0-mingw + octaveforge pkgs[edit]

+

Files for manual installation[edit]

+
  1. Octave-3.6.0-mingw binaries tree +
    • Octave3.6.0_gcc4.6.2_20120129.7z - MD5:53E4823B0DC5F2923C4CBCB8B60FC1B6
    +
    This is a 7z archive which includes a directory tree of all the binaries and libraries required for a complete octave installation (excluding octaveforge packages)
    +
    +
    It can be downloaded from octave forge
    +
    +
    • The archive include:
    +
    +
    +
    • Maintainer: Nitzan Arazi
    • +
    • Latest update: 2012-01-29
  2. +
  3. Octaveforge pkgs, built for Octave-3.6.0-mingw +
    • Octave3.6.0_gcc4.6.2_pkgs_20120128.7z - MD5:93CC6207EED411BCE747193D3A8B6625
    +
    +
    It can be downloaded from octave forge
    +
    +
    This is a 7z archive which includes additional binaries and libraries for a set of octaveforge packages.
    +
    +
    • The included packages are:
    +
    • actuarial-1.1.0
    • +
    • ad-1.0.6_patched
    • +
    • audio-1.1.4
    • +
    • benchmark-1.1.1
    • +
    • bim-1.0.2
    • +
    • bioinfo-0.1.2
    • +
    • civil-engineering-1.0.7
    • +
    • combinatorics-1.0.9
    • +
    • communications-1.1.0_svn20120127_patched
    • +
    • control-2.2.4
    • +
    • data-smoothing-1.2.3
    • +
    • dataframe-0.8.2
    • +
    • econometrics-1.0.8
    • +
    • fenv-0.1.0
    • +
    • financial-0.3.2
    • +
    • fpl-1.2.0
    • +
    • fuzzy-logic-toolkit-0.3.0
    • +
    • ga-0.9.8
    • +
    • general-1.2.2
    • +
    • generate_html-0.1.3
    • +
    • geometry-1.4.0
    • +
    • gnuplot-1.0.1
    • +
    • gpc-0.1.7
    • +
    • gsl-1.0.8
    • +
    • ident-1.0.7
    • +
    • image-1.0.15
    • +
    • informationtheory-0.1.8
    • +
    • integration-1.0.7_svn20120128
    • +
    • io-1.0.16
    • +
    • irsa-1.0.7
    • +
    • java-1.2.8_patched
    • +
    • linear-algebra-2.1.0_svn20120127
    • +
    • mapping-1.0.7
    • +
    • mechanics-1.2.0
    • +
    • miscellaneous-1.0.11_svn20120127
    • +
    • missing-functions-1.0.2
    • +
    • msh-1.0.2
    • +
    • multicore-0.2.15
    • +
    • nlwing2-1.2.0
    • +
    • nnet-0.1.13
    • +
    • nurbs-1.3.5
    • +
    • ocs-0.1.3_svn20120128_patched
    • +
    • octclip-1.0.0
    • +
    • octgpr-1.2.0
    • +
    • odebvp-1.0.6
    • +
    • odepkg-0.8.0_svn20120127
    • +
    • optim-1.0.17_patched
    • +
    • optiminterp-0.3.4_svn20120128_patched
    • +
    • outliers-0.13.9
    • +
    • physicalconstants-0.1.7
    • +
    • plot-1.1.0
    • +
    • quaternion-1.0.0
    • +
    • secs1d-0.0.8
    • +
    • secs2d-0.0.8
    • +
    • secs3d-0.0.1
    • +
    • signal-1.1.2
    • +
    • simp-1.1.0
    • +
    • sockets-1.0.7_svn20120128_patched
    • +
    • specfun-1.1.0
    • +
    • special-matrix-1.0.7
    • +
    • spline-gcvspl-1.0.8
    • +
    • splines-1.0.7
    • +
    • statistics-1.1.0_svn20120128
    • +
    • strings-1.0.7
    • +
    • struct-1.0.9
    • +
    • symband-1.0.10
    • +
    • symbolic-1.1.0
    • +
    • tcl-octave-0.1.8
    • +
    • time-1.0.9
    • +
    • tsa-4.1.1
    • +
    • video-1.0.2_patched
    • +
    • vrml-1.0.12_svn20111014_patched
    • +
    • windows-1.1.0
    • +
    • xraylib-1.0.8
    • +
    • zenity-0.5.7
    +
    +
    • Maintainer: Nitzan Arazi
    • +
    • Latest update: 2012-01-28
+

Manual installation instructions[edit]

+
  1. Create an installation directory of which doesn't have space chars (i.e. C:\Octave\Octave3.6.0_gcc4.6.2\). This directory is referred hereafter as <your_install_dir>.
  2. +
  3. Extract the complete directories tree from Octave3.6.0_gcc4.6.2_20120129.7z to the installation directory keeping the original directory structure as in the archive (you can use 7-zip tool from http://www.7-zip.org/).
  4. +
  5. Copy octave3.6.0_gcc4.6.2.lnk to any convenient location and edit its properties respectively to point to <your_install_dir>\bin\octave.exe and <your_install_dir>\share\octave\3.6.0\imagelib\octave-logo.ico as an icon
  6. +
  7. Copy octave3.6.0_gcc4.6.2_docs.lnk to any convenient location and edit its properties respectively to point to <your_install_dir>\doc\octave and <your_install_dir>\share\octave\3.6.0\imagelib\octave-logo.ico as an icon. +
    At this point you can:
    +
    a. Launch and use octave by double-clicking the copied octave3.6.0_gcc4.6.2.lnk
    +
    b. Access and browse the documentation files by double-clicking the copied octave3.6.0_gcc4.6.2_docs.lnk
+

Manual installation instructions for the Octave-forge packages[edit]

+
  1. Extract the complete directories tree from Octave3.6.0_gcc4.6.2_pkgs_20120128.7z to the installation directory (<your_install_dir>) keeping the original directory structure as in the archive (you can use 7zip tool from http://www.7-zip.org/).
  2. +
  3. In order to update octave_packages database with your installation tree and auto-load most packages (excluding 'ad' and 'windows' which may crash octave when loaded and 'clear all' is executed), launch Octave and execute the following 3 rebuild commands from the octave console:
+
   pkg rebuild -auto
+   pkg rebuild -noauto ad windows
+   pkg rebuild -auto java
+
+
  1. Last pkg rebuild command is required in order for the java pkg entry to be moved to the top of <your_install_dir>\share\octave\octave_packages db file - thus java pkg is loaded before io pkg is loaded, and io pkg related jars are added to java class path.
  2. +
  3. You can optionally adjust your installed packages status per your specific needs and usage by executing the following commands:
+
a. To interactively load or unload a package
+
pkg load <pkg_name>
+
+
or
+
pkg unload <pkg_name>
+
+
b. To disable auto-load for specific pkg <pkg_name>
+
pkg rebuild -noauto <pkg_name>
+
+
c. To enable auto-load for specific pkg <pkg_name>
+
pkg rebuild -auto <pkg_name>
+
+
d. To completely uninstall a package
+
pkg uninstall <pkg_name>
+
+

Optional installation of Notepad++ as an editor (recommended)[edit]

+
  1. Download recent Notepad++ installation package from http://notepad-plus-plus.org/ and install it on your system.
  2. +
  3. Edit <your_install_dir>\share\octave\site\m\startup\octaverc and un-comment the line which sets octave default editor:
+
EDITOR('C:\Program Files\Notepad++\notepad++.exe');
+edit ("editor", sprintf ("%s %%s", EDITOR ()))
+edit mode async
+
+
Note: You may adjust the above line for the location of notepad++.exe as installed on your system.
+

Troubleshooting[edit]

+

Upon launching, some warnings may be displayed. These warnings can be ignored. +

+
  • Following warnings are about missing external tools which may reduce some of the functions of some packages. These external tools are not provided by the 7z archives in sourceforge.
+
warning: gmsh does not seem to be present some functionalities will be disabled 
+warning: dx does not seem to be present some functionalities will be disabled 
+
+
  • Following warning is about fstat function of the statistics package that overloads the old (to be deprecated) fstat function of octave-3.6.0
+
warning: function C:\Octave\3.6.0_gcc-4.6.2\share\octave\packages\statistics-1.1.0\fstat.m shadows a core library function
+
+

Octave-3.4.3-mingw + octaveforge pkgs[edit]

+
  1. Octave-3.4.3-mingw (without pkgs) +
    Octave3.4.3_gcc4.5.2_20111025.7z - MD5:5AA004D933E000E762AE2AE95573ACBD - http://www.multiupload.com/KDQ1N463UW
  2. +
  3. Octaveforge pkgs, built for Octave-3.4.3-mingw +
    Octave3.4.3_gcc4.5.2_pkgs_20111026.7z - MD5:2987F6078B4AD161F2D23634D5109D61 - http://www.multiupload.com/7U6J23CSZ6
+
+
The above archive files are now able to be downloaded from octave forge
+ +

Troubleshooting[edit]

+

Upon launching, some warnings may be displayed. The following warnings can be ignored: +

+
  • Following warning is about interpretation of logical operators (on scalars) in octave which is slightly different than matlab's interpretation.
+
warning: C:\Octave\3.4.3_gcc-4.5.2\share\octave\packages\integration-1.0.7\PKG_ADD: possible Matlab-style short-circuit operator
+at line 9, column 32 
+
+
  • Following messages are from java package about loading of java classes that have been found and how to manually run a statement which will display its capabilities.
+
io PKG_ADD: java classes has been found and added in C:\Octave\3.4.3_gcc-4.5.2\bin 
+io PKG_ADD: run chk_spreadsheet_support([],3) to view io support 
+
+
  • Following warnings are about missing external tools which may reduce some of the functions of some packages. These external tools are not provided by the 7z archives in sourceforge.
+
warning: gmsh does not seem to be present some functionalities will be disabled 
+warning: dx does not seem to be present some functionalities will be disabled 
+
+
  • Following warning is about fstat function of the statistics package that overloads the old (to be deprecated) fstat function of octave-3.4.3
+
warning: function C:\Octave\3.4.3_gcc-4.5.2\share\octave\packages\statistics-1.0.10\fstat.m shadows a core library function
+
+

Octave-3.4.2-mingw + octaveforge pkgs[edit]

+
  1. Octave-3.4.2-mingw (without pkgs) +
    Octave3.4.2_gcc4.5.2_20110914.7z - MD5:4AA0DD4C97F73B2E9E0F7370CD8AD719 - http://www.multiupload.com/TCUHKNNH9S
  2. +
  3. Octaveforge pkgs, built for Octave-3.4.2-mingw +
    Octave3.4.2_gcc4.5.2_pkgs_20111014.7z - MD5:49097AF3C6FC6CDB58EE83F510A50993 - http://www.multiupload.com/DCWFZOUGZA
+ +

Installation[edit]

+

The installation instructions are the same as for the 3.4.3 version, above. +

+

Notes[edit]

+

For details, please see http://old.nabble.com/Octave-3.4.2-mingw-%2B-octaveforge-pkgs-to32394771.html +

Upon launching, some warnings may be displayed. The following warnings can be ignored: +

+
  • Following warning is about interpretation of logical operators (on scalars) in octave which is slightly different than matlab's interpretation.
+
warning: C:\Octave\Octave3.4.2_gcc4.5.2\share\octave\packages\integration-1.0.7\PKG_ADD: possible Matlab-style 
+short-circuit operator at line 9, column 32
+
+
  • Following warnings are about missing external tools which may reduce some of the functions of some packages. These external tools are not provided by the 7z archives in sourceforge.
+
warning: gmsh does not seem to be present some functionalities will be disabled
+warning: dx does not seem to be present some functionalities will be disabled
+
+
  • Following warning is about fstat function of the statistics package that overloads the old (to be deprecated) fstat function of octave-3.4.3
+
warning: function C:\Octave\Octave3.4.2_gcc4.5.2\share\octave\packages\statistics-1.0.10\fstat.m shadows a core library
+
+


+

+

Octave 3.2.4 for Windows MinGW32[edit]

+ +

Includes[edit]

+
  • GNU Octave, version 3.2.4 (i686-pc-mingw32)
  • +
  • atlas 3.8.2
  • +
  • mingw32 (GCC 4.4.0 on http://www.mingw.org )
  • +
  • gnuplot Version 4.4.0 specially prepared for octave
  • +
  • mini-MSYS 1.0.11
  • +
  • notepad++ 5.6.7 as text editor
  • +
  • Some components of octave-forge packages +
    • actuarial-1.1.0 (New!)
    • +
    • audio-1.1.4
    • +
    • benchmark-1.1.1
    • +
    • bim-1.0.0 (New!)
    • +
    • bioinfo-0.1.2
    • +
    • combinatorics-1.0.9
    • +
    • communications-1.0.10
    • +
    • control-1.0.11
    • +
    • data-smoothing-1.2.0
    • +
    • econometrics-1.0.8
    • +
    • fenv-0.1.0 (New!)
    • +
    • financial-0.3.2
    • +
    • fixed-0.7.10
    • +
    • fpl-1.0.0 (New!)
    • +
    • ga-0.9.7
    • +
    • general-1.2.0 (updated)
    • +
    • generate_html-0.1.2 (New!)
    • +
    • gnuplot-1.0.1 (New!)
    • +
    • gpc-0.1.7
    • +
    • gsl-1.0.8
    • +
    • ident-1.0.7
    • +
    • image-1.0.10
    • +
    • informationtheory-0.1.8
    • +
    • integration-1.0.7
    • +
    • io-1.0.11 (updated)
    • +
    • irsa-1.0.7
    • +
    • java-1.2.7 (New!)
    • +
    • jhandles-0.3.5 (New!)
    • +
    • linear-algebra-1.0.8
    • +
    • mapping-1.0.7
    • +
    • miscellaneous-1.0.9
    • +
    • missing-functions-1.0.2
    • +
    • msh-1.0.0 (New!)
    • +
    • nlwing2-1.1.1 (New!)
    • +
    • nnet-0.1.10
    • +
    • nurbs-1.0.3 (New!)
    • +
    • ocs-0.0.4 (New!)
    • +
    • oct2mat-1.0.7 (New!)
    • +
    • octcdf-1.0.17 (updated 1.0.17+)
    • +
    • octgpr-1.1.5 (New!)
    • +
    • odebvp-1.0.6
    • +
    • odepkg-0.6.10 (updated)
    • +
    • optim-1.0.12 (updated)
    • +
    • optiminterp-0.3.2
    • +
    • outliers-0.13.9
    • +
    • physicalconstants-0.1.7
    • +
    • plot-1.0.7
    • +
    • quaternion-1.0.0
    • +
    • signal-1.0.10
    • +
    • simp-1.1.0 (New!)
    • +
    • sockets-1.0.5
    • +
    • specfun-1.0.8
    • +
    • special-matrix-1.0.7
    • +
    • spline-gcvspl-1.0.8 (New!)
    • +
    • splines-1.0.7
    • +
    • statistics-1.0.9
    • +
    • strings-1.0.7
    • +
    • struct-1.0.7
    • +
    • symband-1.0.10 (New!)
    • +
    • symbolic-1.0.9
    • +
    • time-1.0.9
    • +
    • video-1.0.2 (New!)
    • +
    • windows-1.0.8(updated to 1.0.8+)
    • +
    • zenity-0.5.7
+

Notes[edit]

+
  • Although there are some remaining known issues, some bugs reported to the octave-3.2.3 have been corrected. In addition, useful octave-forge packages are added (Java, Jhandles, ....). Please see RELEASE_NOTES.txt for details:
  • +
  • Default Octave install folder changed to e.g. C:\Octave\3.2.4_gcc-4.4.0\. +
    If you have installed octave in a folder where the path name has whitespace, for example, C:\Program Files\, 'pkg install (package name)' command will fail: See http://sourceforge.net/mailarchive/message.php?msg_name=4A1AF9EF.1000005@hotmail.com for details
+

Additional important topics found after the release: +

+ +
     pkg rebuild -noauto oct2mat
+
+
at the octave prompt and then restart octave. The operation results in the oct2mat package not to be auto-loaded in startup. When you want to use oct2mat, execute
+
     pkg load oct2mat 
+
+
  • The plot octave-forge package still have ginput code although the ginput function is now merge into octave itself. Therefore conflict occur if the plot package is installed. To avoid this problem, rename 'ginput.m' in the folder ..\Octave\3.2.4_gcc-4.4.0\share\octave\packages\plot-1.0.7, for example ginput.ob.m. In some computers which has one core CPU, response of ginput is very slow. In the case, modify '__gnuplot_ginput__.m' according to the following thread. http://old.nabble.com/ginput-on-Octave-3.2.4-mingw32-to28093888.html
  • +
  • From gnuplot-4.4.0, the default terminal of gnuplot for windows is the wxt terminal. Some users may set the GNUTERM environmental variable for the windows terminal being default. The gnuplot for windows allows to set GNUTERM to 'win' (abbreviated form) but octave does not recognize the abbreviated form for terminal name. If one would like set GNUTERM to windows terminal, one should specify it as 'windows' (full form) but not 'win' (abbreviated form). In detail see the following thread: http://old.nabble.com/flicking-problem-again-Octave-3.2.4-mingw32-td28038688.html
+

Older Octave versions with Visual Studio[edit]

+

Octave binaries compiled with Microsoft Visual Studio are available for download from Octave-Forge site. These binaries come in the form of an easy-to-use installer (created with NSIS) and are provided in 2 flavors: pre-compiled version for Visual Studio 2008 and for Visual Studio 2010. +

These binaries do not include the Microsoft Visual C++ compiler. This must be installed separately, but is only required if you plan to compile and link source code against the pre-compiled octave release. If the Visual C++ compiler is not present on the target system, then the Visual C++ runtime libraries must be installed prior the installation of these binaries. These runtime libraries are support libraries that are required by any code compiled with the Visual C++ compiler. They can be downloaded for free from the Microsoft download site: +

+ +

Note that if you already installed other software on your system, there is a possibility that these runtime libraries are already present. Search for a files named msvcr90.dll (Visual Studio 2008) or msvcr100.dll (Visual Studio 2010) in the %WINDIR% directory (usually C:\WINDOWS). +

+

Installation[edit]

+

The pre-compiled versions for Visual Studio come in the form of a self-installing executable. Simply download the executable, run it and follow the installer instructions. To avoid possible problems with white spaces in the octave paths, it is strongly recommended to install octave in a directory that do not contain any white spaces. +

Octave-Forge packages are not installed by default. To install packages, expand the section "Octave Forge" in the component selection page of the installer and select the packages you wish to install. Note that installed packages are not loaded by default. To use the packages, you still need to load them into octave. +

+

Printing (installing Ghostscript)[edit]

+

In order to use the print command ghostscript must be installed. +The installer may be obtained at +sourceforge. +

The instructions below assume the GLP version of Ghostscript is installed with +the Destination directory C:\Program Files (x86)\GPLGS\. +The Destination directory may be different for 32 bit and 64 bit windows and can also +change for different versions of Ghostscript. Therefore, it is important that the user +make note of the Destination directory used to install Ghostscript and use it in place of +the Destination directory used in these instructions. +

In order for Octave to find Ghostscript, the directory containing Ghostscript's command line +program must be in the command shell's path. The name of Ghostscript's command +line program may vary. Some examples are gswin32c.exe, gswin64c.exe, +gs.exe, and mgs.exe. +To directory containing Ghostscript's command line program may either be added to the command +shell's using Windows Control Panel, or by having Octave modify the path variable to +include the directory where Ghostscript's command line programs resides. +

For the latter, to following lines may be placed in the ~/.octaverc +file (where ~ indicates the user's home folder). +The variable gs_path should be set to the Destination +directory where Ghostscript was installed. +

+
 cmd_path = getenv ("path");
+ gs_path = 'C:\Program Files (x86)\GPLGS\';
+ if (isempty (strfind (cmd_path, gs_path)))
+   setenv ('path', strcat (cmd_path, pathsep (), gs_path));
+ endif
+
+

In this case, the value of gs_path has been set to the location of Ghostscript's command +line program for the GPL's 8.15 version of Ghostscript. The location for other versions may differ. +Please determine the location of the installed Ghostscript command line program and make the needed +adjustments to these instructions. +

To set the path via the Control Panel, +

+
  • Go to Control Panel --> System and Security --> System
  • +
  • Click Advanced System Settings
  • +
  • Click Environment Variables
  • +
  • In the System Variables area, locate the Path variable, highlight it and click Edit.
  • +
  • Add the Destination directory where Ghostscript is installed and confirm the change by clickiing OK, OK, OK.
+

If the 64 bit version of Ghostscript is installed, Octave will not automatically detect it. To use the 64 bit version an +option telling Octave about it must be passed to the print command. For example to produce PDF +output for a figure, using the 64 bit version of Ghostscript, the command below may be used. +

+
 print -Ggswin64c.exe figure.pdf
+
+

At this point most of Octave's printing functionality should work. When output is +produced using the print command the warnings below will be given. +

+
 warning: print.m: epstool binary is not available.
+ Some output formats are not available.
+ warning: print.m: fig2dev binary is not available.
+ Some output formats are not available.
+ warning: print.m: pstoedit binary is not available.
+ Some output formats are not available.
+
+

For the print command to be fully functional, each of these utilities will also need to be installed, +and their locations added to the Path variable via either the Control Panel +or Octave's ~/.octaverc file. +

+

Using the Visual C++ compiler with Octave[edit]

+

As of version 3.6.1, the Microsoft Visual C++ compiler is not automatically detected. If you need to use it from octave (for instance to compile a MEX, OCT file, and building packages), then you must configure your system by updating the appropriate environment variables: %PATH%, %INCLUDE% and %LIB%. One way to achieve this easily is to call the vcvarsall.bat script (from the Visual C++ installation directory) prior executing octave. You can for instance automate this by creating a batch script with the following content (adapt paths to your actual installation): +

+
call "C:\Program Files\Microsoft Visual Studio 9.0\VC\vcvarsall.bat"
+"C:\Octave-3.6.1\bin\octave.exe"
+
+

Octave 3.6.4[edit]

+

Download[edit]

+

Octave-Forge +

+

Content[edit]

+
  • Octave 3.6.4
  • +
  • OpenBLAS-0.2.2 (dynamic architectures, up to 4 threads)
  • +
  • ATLAS 3.8.4 single-threaded and multi-threaded (2 threads)
  • +
  • All required libraries
  • +
  • Gnuplot 4.4.4
  • +
  • FLTK
  • +
  • 82 packages from Octave-Forge (Must be installed through the Octave installer, see README: 3. Content)
+

Octave 3.6.2[edit]

+

Download[edit]

+

Octave-Forge +

+

Content[edit]

+
  • Octave 3.6.2
  • +
  • OpenBLAS-0.1.1 (dynamic architectures, up to 4 threads)
  • +
  • ATLAS 3.8.4 single-threaded and multi-threaded (2 threads)
  • +
  • All required libraries
  • +
  • QtHandles
  • +
  • Octave GUI (experimental, compiled from development sources)
  • +
  • Gnuplot 4.4.4
  • +
  • 72 packages from Octave-Forge (see README: 3. Content)
+

Octave 3.6.1[edit]

+

Download[edit]

+

Octave-Forge +

+

Content[edit]

+
  • Octave 3.6.1
  • +
  • ATLAS 3.8.4 single-threaded (SSE/SSE2/SSE3) and multi-threaded (SSE3, 2 threads)
  • +
  • All required libraries
  • +
  • QtHandles
  • +
  • Octave GUI (experimental, compiled from development sources)
  • +
  • Gnuplot 4.4.4
  • +
  • 72 packages from Octave-Forge
+


+

+

Alternative[edit]

+

In addition to the instructions provided in the Octave manual and Octave-Forge repository, a basic toolkit for building Octave in windows using the MSVC compiler has been produced by Michael Goffioul. It consists of a set of scripts that can be used to compile Octave and its dependencies. +

A pre-compiled (with VS2010) version of everything has also been provided, so it is not necessary to recompile everything from scratch. The files can be found at: http://dl.dropbox.com/u/45539519/octave-build2.zip and http://dl.dropbox.com/u/45539519/VC10Libs.zip +

Note that this is not a enterprise-level SDK, so don't try to start an enterprise with it. +

+ + + + + +
+ +
+
+
+
+
+

Navigation menu

+ + + + + + +
+
+ +
+ diff --git a/ML_Mathematical_Approach/30_installation-issues/01__msg00110.html b/ML_Mathematical_Approach/30_installation-issues/01__msg00110.html new file mode 100644 index 0000000..2ec876a --- /dev/null +++ b/ML_Mathematical_Approach/30_installation-issues/01__msg00110.html @@ -0,0 +1,205 @@ + + + + + + + + + + + + + + + +[Octave-bug-tracker] [bug #35769] [Windows] Segmentation fault on subplo + + + +
+ +
+
octave-bug-tracker
+[Top][All Lists] +
+ + + +Advanced +
+ +
+ + + + +
+[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index] + + + + +

[Octave-bug-tracker] [bug #35769] [Windows] Segmentation fault on subplo

+
+ + + + + + + + + + + + + + + + + + + + + + + + + +
+From: +Philip Nienhuis
+Subject: +[Octave-bug-tracker] [bug #35769] [Windows] Segmentation fault on subplot
+Date: +Sun, 11 Mar 2012 16:30:57 +0000
+User-agent: +Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.11) Gecko/20100701 SeaMonkey/2.0.6
+ + +
+ + +
Update of bug #35769 (project octave):
+
+                  Status:                    None => Works For Me           
+             Open/Closed:                    Open => Closed                 
+
+    _______________________________________________________
+
+Follow-up Comment #4:
+
+I suggest you'd better experiment with the various libblas.dll versions found
+in /bin. See the README which is in the octave-3.6.1-.......7z archive.
+
+E.g., on my Windows boxes a simple
+
+  demo plotyy
+
+can reliably crash Octave, depending on which libblas.dll is in place.
+
+(Admittedly this libblas.dll affair is a little sneaky.... the effects
+(crashes) can pop up in many unexpected situations. Sorry for that.)
+
+The latest Octave-3.6.1 MinGW version (dated March 3, 2012) is much more
+stable, yet offers a few more libblas.dll versions to try (9) than the earlier
+3.6.1 one (just 5 or so IIRC).
+
+On my box your example works OK (I tried it several times) with MinGW Octave
+3.6.1, both fltk & gnuplot. 
+On MSVC it works OK as well (gnuplot, fltk and qt).
+
+I'll close this bug with a "Works for me". 
+It can always be reopened if you report that your example code crashes with
+all 9 (nine) libblas.dll versions in /bin in the latest octave-3.6.1-MinGW
+drop.
+
+
+    _______________________________________________________
+
+Reply to this item at:
+
+  <http://savannah.gnu.org/bugs/?35769>
+
+_______________________________________________
+  Message sent via/by Savannah
+  http://savannah.gnu.org/
+
+
+
+
+ + + +
+
+ + + + +
reply via email to
+
+
+ + + +
[Prev in Thread]Current Thread[Next in Thread]
+ + +
+ + + + + + + + + + + diff --git a/ML_Mathematical_Approach/30_installation-issues/01__resources.html b/ML_Mathematical_Approach/30_installation-issues/01__resources.html new file mode 100644 index 0000000..48a4b48 --- /dev/null +++ b/ML_Mathematical_Approach/30_installation-issues/01__resources.html @@ -0,0 +1,380 @@ + + +

+ General +

+

+ Installation files for all platforms are available at the + + GNU Octave Repository + + on SourceForge. +

+

+ The Gnu Octave Wiki has installation instructions for + + Windows + + and + + Mac OS X + + . +

+

+ The present instance of the course was tested using Octave 3.8.2. Earlier versions of Octave are not guaranteed to work. Very early versions are known to NOT work for the submit.m script. +

+

+ Windows +

+

+ Version 4.0.1 (and later...) +

+

+ If you don't want to use MATLAB, then Octave 4.0.1 (or later) is recommended. +

+

+ Version 4.0.0 +

+

+ Octave 4.0.0 has a bug in the prinft() function, which makes the submit script throw a very troubling error. Octave 4.0.0 is not recommended. +

+

+ Version 3.8.2 +

+

+ The course materials were tested using Octave 3.8.2. This is the oldest version of Octave which will work on the present version of the course (as of June 2015). Earlier versions of Octave are not guaranteed to work, and some very early versions (such as 3.2.4, which Prof Ng uses in the video lectures) will not work at all. +

+

+ Historical information on previous Octave versions... +

+

+ Version 3.6.4 +

+ +

+ Version 3.6.2 +

+ +

+ Version 3.6.1 +

+ +

+ If you have the wrong version of libblas.dll installed, this will generally crash Octave. The libblas.dll.ref version is a slow but stable version. +

+

+ Version 3.2.4 +

+ +

+ Mac OS X +

+

+ Version 3.8.1 +

+

+ If you are not afraid of the Terminal you can use + + Homebrew + + . This is by far the easiest method of installing everything that is needed. In fact, this may be the best option for versions of OSX prior to 10.9 Mavericks, because the + + Octave wiki + + states that the binary installer is to be used "at your own risk" and is "not guaranteed to work" with anything other than OSX 10.9 Mavericks. +

+

+ Let's install Homebrew as the easiest solution. Open Terminal and type the installation command below, which is listed on the Homebrew home page also, and then hit enter: +

+
ruby -e "$(curl -fsSL https://raw.githubusercontent.com/Homebrew/install/master/install)"
+
+

+ You will need a set of command line tools that are a subset of XCode. If you're comfortable with the Terminal, then use the command below to install just what is needed. Otherwise (or if you have issues), you need to install XCode from App Store, go to the XCode preferences -> Downloads/Components section, and then select Command Line Tools to be installed. +

+
xcode-select --install
+
+

+ Let's install an up to date gcc compiler: +

+
brew install gcc
+
+

+ You will also need to install [MacTeX]. This can be done with Homebrew, as seen below using the Homebrew Cask project, or by downloading and installing from the website ( + + http://www.tug.org/mactex/ + + ). +

+
brew install caskroom/cask/brew-cask
+brew cask install mactex
+
+

+ Now let's install the plotting software, gnuplot, and have it use native Qt graphics (instead of the older X11 setup): +

+
brew install gnuplot --with-qt
+
+

+ Finally, let's install Octave (note all build issues have been fixed, as of Nov 4th 2014): +

+
brew tap homebrew/science
+brew install octave
+
+

+ And tell Octave to use Qt graphics with gnuplot when plotting: +

+
echo "setenv GNUTERM qt;" >> ~/.octaverc
+
+

+ Now restart your Terminal and launch octave: +

+
octave
+
+

+ Also, to avoid having to constantly change directories after starting octave, the following script will load octave with the Terminal's current working directory automatically set in octave. Let's set it up: +

+
echo '#!/bin/bash' >> /usr/local/bin/oct
+echo 'octave -p "`pwd`"' >> /usr/local/bin/oct
+chmod +x /usr/local/bin/oct
+
+

+ And to start octave using this script instead (for example when you are working in a project directory and want to load octave with the directory already set): +

+
oct
+
+

+ Known Issues: Executing GNUPlot gives error message on Octave (v 3.4.0) with X11 installed on OS 10.6.8(Snow leopard). For example: octave-3.4.0:10> hist(w) warning: broken pipe -- some output may be lost "Reason: Incompatible library version: libfontconfig.1.dylib requires version 14.0.0 or later, but libfreetype.6.dylib provides version 13.0.0" +

+

+ Fix: Open terminal window and type following 3 lines: cd /Applications/Gnuplot.app/Contents/Resources/lib mv libfreetype.6.dylib libfreetype.6.dylib.bak ln -s /usr/X11/lib/libfreetype.6.dylib . +

+

+ (Source and explanation : + + http://stackoverflow.com/questions/19932161/incompatible-library-version-libfontconfig-1-dylib-13-instead-of-15 + + ) +

+

+ Linux +

+

+ Ubuntu +

+

+ Ubuntu Software Center +

+

+ Just search for GNU Octave in Ubuntu Software Center and click install. When the installation finishes, you're ready to use Octave. +

+

+ Command Line +

+

+ If you prefer using the command line or if you have an Ubuntu based version of Linux that comes without Ubuntu Software Center, you can install Octave by typing this command on a terminal: +

+
sudo apt-get install octave
+
+

+ To update to a newer version (Octave 4.0.0 does not work), try these commands: +

+
sudo apt-add-repository ppa:octave/stable
+sudo apt-get update
+sudo apt-get upgrade octave
+
+

+ If you get an error "E: Package 'octave3.2' has no installation candidate", then follow these steps +

+
sudo apt-add-repository -y ppa:mtmiller/octave
+sudo apt-get update
+sudo apt-get install octave
+
+

+ This thread may be helpful with your Ubuntu installation: + + http://askubuntu.com/questions/194151/how-do-you-install-the-latest-version-of-gnu-octave/206050#206050 + +

+

+ Red Hat / CentOS +

+

+ Command Line +

+

+ You can install Octave from the yum repository using the following command lines: +

+
sudo yum install epel-release
+sudo yum install octave
+
+

+ ArchLinux +

+

+ Command Line +

+

+ Octave can be installed from pacman. +

+
sudo pacman -S octave
+
+

+ Browser (any OS) +

+

+ It is possible to use Octave online, without installing it on local computer. Web interface has a code editor and REPL console with inline plots. It gives access to Octave 3.6.4. +

+

+ URL: + + + + http://octave.im/ + +

+

+ Watching Videos in Chrome +

+

+ If you are on Ubuntu and the embedded playback of the mp4 course videos fails in the Chromium browser see + + this Launchpad answer + + . Chromium is the open source version of Chrome. +

+
+ + + diff --git a/ML_Mathematical_Approach/31_useful-resources/01__105-machine-learning-paper.pdf b/ML_Mathematical_Approach/31_useful-resources/01__105-machine-learning-paper.pdf new file mode 100644 index 0000000..d03da60 Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__105-machine-learning-paper.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__Bishop_-_Pattern_Recognition_And_Machine_Learning_-_Springer__2006.pdf b/ML_Mathematical_Approach/31_useful-resources/01__Bishop_-_Pattern_Recognition_And_Machine_Learning_-_Springer__2006.pdf new file mode 100644 index 0000000..af7d777 Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__Bishop_-_Pattern_Recognition_And_Machine_Learning_-_Springer__2006.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__GreedyFuncApproxSS.pdf b/ML_Mathematical_Approach/31_useful-resources/01__GreedyFuncApproxSS.pdf new file mode 100644 index 0000000..5723fec Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__GreedyFuncApproxSS.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__LinAlg.pdf b/ML_Mathematical_Approach/31_useful-resources/01__LinAlg.pdf new file mode 100644 index 0000000..0838a06 Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__LinAlg.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__MLBOOK.pdf b/ML_Mathematical_Approach/31_useful-resources/01__MLBOOK.pdf new file mode 100644 index 0000000..bfae5a7 Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__MLBOOK.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__NormanEtAlTICS.pdf b/ML_Mathematical_Approach/31_useful-resources/01__NormanEtAlTICS.pdf new file mode 100644 index 0000000..5e717b3 Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__NormanEtAlTICS.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__Teach_Data_Science.html b/ML_Mathematical_Approach/31_useful-resources/01__Teach_Data_Science.html new file mode 100644 index 0000000..4184c6c --- /dev/null +++ b/ML_Mathematical_Approach/31_useful-resources/01__Teach_Data_Science.html @@ -0,0 +1,141 @@ + + + + +jsresearch.net + + + + + + + +
+ +
+
+
+
+
+
+
+
+ + + + + + + + + \ No newline at end of file diff --git a/ML_Mathematical_Approach/31_useful-resources/01__adaboost4.pdf b/ML_Mathematical_Approach/31_useful-resources/01__adaboost4.pdf new file mode 100644 index 0000000..1cdec02 Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__adaboost4.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__cacm12.pdf b/ML_Mathematical_Approach/31_useful-resources/01__cacm12.pdf new file mode 100644 index 0000000..892f46e Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__cacm12.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__copy.html b/ML_Mathematical_Approach/31_useful-resources/01__copy.html new file mode 100644 index 0000000..14e85aa --- /dev/null +++ b/ML_Mathematical_Approach/31_useful-resources/01__copy.html @@ -0,0 +1,33 @@ + + + + + + + + + + +
+
+ +
+
+
+
+
+ +
+
+
+ + + + diff --git a/ML_Mathematical_Approach/31_useful-resources/01__cs229-linalg.pdf b/ML_Mathematical_Approach/31_useful-resources/01__cs229-linalg.pdf new file mode 100644 index 0000000..5299bfd Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__cs229-linalg.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__doku.php b/ML_Mathematical_Approach/31_useful-resources/01__doku.php new file mode 100644 index 0000000..1222ee5 --- /dev/null +++ b/ML_Mathematical_Approach/31_useful-resources/01__doku.php @@ -0,0 +1,307 @@ + + + + + courses:bigdata:start | CILVR Lab @ NYU + + + + + + + + + + + + + + + + + + +
+
+
+ +
+
    +
  • People
    +
  • +
  • Courses
    +
  • +
+
+ +
+
+
+
+
+ + + + + + +
+
+
+ +
+
+
+
+
+
+
+ + + + + +

Big Data, Large Scale Machine Learning

+
+ +
+ +

Course Information

+
+ + +
+ +

News

+
+
    +
  • 2013-05-09: Assignment 3 is released. It is an optional assignment.
    +
  • +
  • 2013-04-08: Assignment 2 is ready.
    +
  • +
  • 2013-02-26: Assignment 1 is out!
    +
  • +
  • 2013-02-12: Public enrollment available in the Piazza discussion group. No more access codes or NYU email!
    +
  • +
  • 2013-02-04: CHANGE OF CLASSROOM: lectures will now take place in the auditorium WWH 109
    +
  • +
  • 2013-01-30: videos of first lecture available
    +
  • +
  • 2013-01-28: first lecture today at 5:00 pm, Warren Weaver Hall, Room 101
    +
      +
    • topics: linear representation, on-line gradient descent and improvements thereof.
      +
    • +
    +
  • +
  • +
  • +
  • 2013-01-28: students will have access to a cluster with 100 node (with 8 cores each) running Linux and Hadoop.
    +
  • +
+ +
+ +

Course Material

+ + +

Prerequisites

+
+ +

+This course is for people interested in automatically extracting knowledge from large amounts of data. Students should have some prior knowledge or experience with basic machine learning methods. +

+ +

+You must have taken a machine learning course at the undergraduate or graduate level prior to taking this course, or have industry experience with machine learning. +

+ +

+Required skills: +

+
    +
  • knowledge of basic methods in machine learning such as linear classifiers, logistic regression, K-Means clustering, and principal components analysis.
    +
  • +
  • although much of the assignments will use dynamic/scripting programming languages, some proficiency in C programming will be assumed
    +
  • +
  • knowledge of basic concepts in probability and statistics: probability distributions and probability density functions, conditional probabilities, marginalization, Bayes' theorem
    +
  • +
  • basic knowledge of linear algebra and multivariate calculus: linear system solving, eigenvalues/eigenvectors, least square minimization, gradient, Jacobian, and Hessian.
    +
  • +
+ +
+ +

Syllabus

+
+
    +
  • Introduction
    +
  • +
  • Online methods for linear models
    +
  • +
  • Online methods for nonlinear models
    +
  • +
  • LBFGS
    +
  • +
  • Boosted Decision Trees and stumps
    +
  • +
  • Mapreduce/Allreduce
    +
  • +
  • Hadoop
    +
  • +
  • Parallelization of learning algorithms: OpenMP, CUDA, OpenCL
    +
  • +
  • Inverted Indices & Predictive Indexing
    +
  • +
  • Feature Hashing
    +
  • +
  • Locally-sensitive Hashing & Linear Dimensionality Reduction
    +
  • +
  • Nonlinear Dimensionality Reduction
    +
  • +
  • Feature Learning
    +
  • +
  • Handling Many Classes, class embedding
    +
  • +
  • Active Learning
    +
  • +
  • Exploration and Learning
    +
  • +
+ +
+ +

Evaluation

+
+ +

+Evaluation will be a combination of programming assignments and a final project. +

+ +
+ + +
+
+
+
+
 
+
 
+
+
+
/srv/www/cilvr/htdocs/data/pages/courses/bigdata/start.txt · Last modified: 2013/05/09 15:18 by xz558
+
+
+
+
+
+ Recent changes RSS feed + Creative Commons License + Valid XHTML 1.0 + Valid CSS + Driven by DokuWiki +
+
+
+
+
+
+ Drupal Garland Theme for Dokuwiki +
+ + + + +
+
+ + diff --git a/ML_Mathematical_Approach/31_useful-resources/01__index.html b/ML_Mathematical_Approach/31_useful-resources/01__index.html new file mode 100644 index 0000000..58b1899 --- /dev/null +++ b/ML_Mathematical_Approach/31_useful-resources/01__index.html @@ -0,0 +1,232 @@ + + + + + + + + + Summary — Shark 3.0a documentation + + + + + + + + + + + + + + + + + + + + + + +
+
+ +
+
+ + + +
+
+
+
+ +
+
+
+

Summary

+

SHARK is a fast, modular, feature-rich open-source C++ machine learning library. +It provides methods for linear and nonlinear optimization, kernel-based learning +algorithms, neural networks, and various other machine learning techniques (see the +feature list below). +It serves as a powerful toolbox for real world applications as well as research. +Shark depends on Boost and CMake. +It is compatible with Windows, Solaris, MacOS X, and Linux. Shark is licensed under +the permissive +GNU Lesser General Public License.

+

For an overview over the previous major release of Shark (2.0) we +refer to:

+
+Christian Igel, Verena Heidrich-Meisner, and Tobias Glasmachers. +Shark. +Journal of Machine Learning Research 9, pp. 993-996, 2008. +[Bibtex]
+
+

Where to start

+

In the menu above, click on “Getting started”, or use this direct link to the +installation instructions. +After installation, there is a guide to the different documentation pages available +here.

+
+
+

Why Shark?

+

Speed and flexibility

+

Shark provides an excellent trade-off between flexibility and +ease-of-use on the one hand, and computational efficiency on the other.

+

One for all

+

Shark offers numerous algorithms from various machine learning and +computational intelligence domains in a way that they can be easily +combined and extended.

+

Unique features

+

Shark comes with a lot of powerful algorithms that are to our best +knowledge not implemented in any other library, for example in the +domains of model selection and training of binary and multi-class SVMs, +or evolutionary single- and multi-objective optimization.

+
+
+

Selected features

+

Shark currently supports:

+
    +
  • Supervised learning
      +
    • Linear discriminant analysis (LDA), Fisher–LDA
    • +
    • Linear regression
    • +
    • Support vector machines (SVMs) for one-class, binary and true +multi-category classification as well as regression; includes fast variants for linear kernels.
    • +
    • Feed-forward and recurrent multi-layer artificial neural networks
    • +
    • Radial basis function networks
    • +
    • Regularization networks as well as Gaussian processes for regression
    • +
    • Iterative nearest neighbor classification and regression
    • +
    • Decision trees and random forests
    • +
    +
  • +
  • Unsupervised learning
      +
    • Principal component analysis
    • +
    • Restricted Boltzmann machines (including many state-of-the-art +learning algorithms)
    • +
    • Hierarchical clustering
    • +
    • Data structures for efficient distance-based clustering
    • +
    +
  • +
  • Evolutionary algorithms
      +
    • Single-objective optimization (e.g., CMA–ES)
    • +
    • Multi-objective optimization (in particular, highly efficient +algorithms for computing as well as approximating the contributing hypervolume)
    • +
    +
  • +
  • Basic linear algebra and optimization algorithms
  • +
+
+
+ + +
+
+
+
+
+ + + + \ No newline at end of file diff --git a/ML_Mathematical_Approach/31_useful-resources/01__index.php b/ML_Mathematical_Approach/31_useful-resources/01__index.php new file mode 100644 index 0000000..2b8cd44 --- /dev/null +++ b/ML_Mathematical_Approach/31_useful-resources/01__index.php @@ -0,0 +1,21 @@ + + + Coursera + + + +
+
+ +
+

+ This page is no longer available. +

+
+ This page was hosted on our old technology platform. We've moved to our new platform at www.coursera.org. Explore our catalog to see if this course is available on our new platform, or learn more about the platform transition here. +
+
+ + \ No newline at end of file diff --git a/ML_Mathematical_Approach/31_useful-resources/01__kdd10-outlier-tutorial.pdf b/ML_Mathematical_Approach/31_useful-resources/01__kdd10-outlier-tutorial.pdf new file mode 100644 index 0000000..33d3a9e Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__kdd10-outlier-tutorial.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__materials.html b/ML_Mathematical_Approach/31_useful-resources/01__materials.html new file mode 100644 index 0000000..df321c1 --- /dev/null +++ b/ML_Mathematical_Approach/31_useful-resources/01__materials.html @@ -0,0 +1 @@ + diff --git a/ML_Mathematical_Approach/31_useful-resources/01__mimetextutorial.html b/ML_Mathematical_Approach/31_useful-resources/01__mimetextutorial.html new file mode 100644 index 0000000..751d1d9 --- /dev/null +++ b/ML_Mathematical_Approach/31_useful-resources/01__mimetextutorial.html @@ -0,0 +1,15 @@ + + + mimetextutorial.html + + + + + + <body> + <a href="http://www.forkosh.com/weblist.cgi?-t=weblist&-o=php&-f=sources/mimetextutorial.html">Click for mimetextutorial.html</a> + </body> + + + \ No newline at end of file diff --git a/ML_Mathematical_Approach/31_useful-resources/01__p675.pdf b/ML_Mathematical_Approach/31_useful-resources/01__p675.pdf new file mode 100644 index 0000000..81cea79 Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__p675.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__painless-conjugate-gradient.pdf b/ML_Mathematical_Approach/31_useful-resources/01__painless-conjugate-gradient.pdf new file mode 100644 index 0000000..99a26b5 Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__painless-conjugate-gradient.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__pmwiki.php b/ML_Mathematical_Approach/31_useful-resources/01__pmwiki.php new file mode 100644 index 0000000..eab48f3 --- /dev/null +++ b/ML_Mathematical_Approach/31_useful-resources/01__pmwiki.php @@ -0,0 +1,333 @@ + + + + + + + David Barber : Brml - Online browse + + + + + + + + + + + + + + + + +
+ + +
+
+ + +
+

Online Versions & Errata
+

+

The online version differs from the hardcopy in page numbering so please refer to the hardcopy if you wish to cite a particular page. The list of errata for the first edition is here. +

+

+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +2 Feb 2017 +

+

This version corresponds to the published Cambridge University Press version, differing only in minor text details. There are some errata (in magenta) and addenda (in blue) from the published version highlighted using the margin text `@@' and `++'.

+

+

Please leave a comment if you find an error or have a suggestion. +

+
+ +
+
+
+
+ + + + + diff --git a/ML_Mathematical_Approach/31_useful-resources/01__previous.html b/ML_Mathematical_Approach/31_useful-resources/01__previous.html new file mode 100644 index 0000000..46f3e0c --- /dev/null +++ b/ML_Mathematical_Approach/31_useful-resources/01__previous.html @@ -0,0 +1,271 @@ + + + + +Learning From Data MOOC - The Lectures + + + + + + + + + + +
California Institute of Technology + +
+ + + + +
+ +
+

THE LECTURES

+ + +
+

+
    + +
  • Taught by Feynman Prize winner Professor Yaser Abu-Mostafa.
  • + +

    +

  • The fundamental concepts and techniques are explained in detail. The focus of the lectures is real understanding, not just "knowing."
  • + +

    +

  • Lectures use incremental viewgraphs (2853 in total) to simulate the pace of blackboard teaching.
  • + +

    +

  • The 18 lectures (below) are available on different platforms: +

    +
    + +Here is the playlist on YouTube + +

    + +Lectures are available on iTunes U course app + + + +
    +
+
+ + + + +
+

+ +
+Place the mouse on a lecture title for a short description +
+ +

+

+ +
    + +
  • Lecture 1 (The Learning Problem)
    Lecture (some audio drops, sorry!) - Q&A - Slides
  • + + +
    The Learning Problem - Introduction; supervised, unsupervised, and reinforcement learning. Components of the learning problem.
    + +

    + +
  • Lecture 2 (Is Learning Feasible?)
    Review - Lecture - Q&A - Slides
  • + + +
    Is Learning Feasible? - Can we generalize from a limited sample to the entire space? Relationship between in-sample and out-of-sample.
    + +

    + +
  • Lecture 3 (The Linear Model I)
    Review - Lecture - Q&A - Slides
  • + + +
    The Linear Model I - Linear classification and linear regression. Extending linear models through nonlinear transforms.
    + +

    + +
  • Lecture 4 (Error and Noise)
    Review - Lecture - Q&A - Slides
  • + + +
    Error and Noise - The principled choice of error measures. What happens when the target we want to learn is noisy.
    + +

    + +
  • Lecture 5 (Training versus Testing)
    Review - Lecture - Q&A - Slides
  • + + +
    Training versus Testing - The difference between training and testing in mathematical terms. What makes a learning model able to generalize?
    + +

    + +
  • Lecture 6 (Theory of Generalization)
    Review - Lecture - Q&A - Slides
  • + + +
    Theory of Generalization - How an infinite model can learn from a finite sample. The most important theoretical result in machine learning.
    + +

    + +
  • Lecture 7 (The VC Dimension)
    Review - Lecture - Q&A - Slides
  • + + +
    The VC Dimension - A measure of what it takes a model to learn. Relationship to the number of parameters and degrees of freedom.
    + +

    + +
  • Lecture 8 (Bias-Variance Tradeoff)
    Review - Lecture - Q&A - Slides
  • + + +
    Bias-Variance Tradeoff - Breaking down the learning performance into competing quantities. The learning curves.
    + +

    + +
  • Lecture 9 (The Linear Model II)
    Review - Lecture - Q&A - Slides
  • + + +
    The Linear Model II - More about linear models. Logistic regression, maximum likelihood, and gradient descent.
    + +

    + +
  • Lecture 10 (Neural Networks)
    Review - Lecture - Q&A - Slides
  • + + +
    Neural Networks - A biologically inspired model. The efficient backpropagation learning algorithm. Hidden layers.
    + +

    + +
  • Lecture 11 (Overfitting)
    Review - Lecture - Q&A - Slides
  • + + +
    Overfitting - Fitting the data too well; fitting the noise. Deterministic noise versus stochastic noise.
    + +

    + +
  • Lecture 12 (Regularization)
    Review - Lecture - Q&A - Slides
  • + + +
    Regularization - Putting the brakes on fitting the noise. Hard and soft constraints. Augmented error and weight decay.
    + +

    + +
  • Lecture 13 (Validation)
    Review - Lecture - Q&A - Slides
  • + + +
    Validation - Taking a peek out of sample. Model selection and data contamination. Cross validation.
    + +

    + +
  • Lecture 14 (Support Vector Machines)
    Review - Lecture - Q&A - Slides
  • + + +
    Support Vector Machines - One of the most successful learning algorithms; getting a complex model at the price of a simple one.
    + +

    + +
  • Lecture 15 (Kernel Methods)
    Review - Lecture - Q&A - Slides
  • + + +
    Kernel Methods - Extending SVM to infinite-dimensional spaces using the kernel trick, and to non-separable data using soft margins.
    + +

    + +
  • Lecture 16 (Radial Basis Functions)
    Review - Lecture - Q&A - Slides
  • + + +
    Radial Basis Functions - An important learning model that connects several machine learning models and techniques.
    + +

    + +
  • Lecture 17 (Three Learning Principles)
    Review - Lecture - Q&A - Slides
  • + + +
    Three Learning Principles - Major pitfalls for machine learning practitioners; Occam's razor, sampling bias, and data snooping.
    + +

    + +
  • Lecture 18 (Epilogue)
    Review - Lecture - Acknowledgment - Slides
  • + + +
    Epilogue - The map of machine learning. Brief views of Bayesian learning and aggregation methods.
    + +
+

+ +
+ + +
+ + + + +
+ +
image
+ +
+ +
image
+
+ +
+ +
+ + + + +
+ + + diff --git a/ML_Mathematical_Approach/31_useful-resources/01__resources.html b/ML_Mathematical_Approach/31_useful-resources/01__resources.html new file mode 100644 index 0000000..591989e --- /dev/null +++ b/ML_Mathematical_Approach/31_useful-resources/01__resources.html @@ -0,0 +1,636 @@ + + +

+ Below is a compilation of web links. Hopefully these resources will help improve your learning experience. +

+

+ Informative Web Sites +

+ +

+ Linear Algebra +

+ +

+ Writing Equations in Forum Posts +

+ +

+ Online E-Books +

+ +

+ Textbook information +

+ +

+ Advanced classes online +

+ +

+ Machine Learning frameworks and libraries in Python +

+ +

+ Machine Learning frameworks and libraries in C++ +

+ +

+ Machine Learning frameworks and libraries in Java +

+ +

+ Machine Learning Data Sets +

+ +

+ Octave packages +

+ +

+ Octave online +

+ +

+ Translation Projects +

+ +

+ Useful papers +

+ +

+ General +

+ +

+ Boosting +

+ +

+ Outlier and Anomaly Detection +

+ +

+ SVM +

+ +

+ + http://web.mit.edu/6.034/wwwbob/svm-notes-long-08.pdf + +

+

+ Interesting applications +

+ +

+

+

+ Deep Learning School, Sept. 2016 (URL includes links to video archives) +

+

+ + https://www.bayareadlschool.org/ + +

+
+ + + diff --git a/ML_Mathematical_Approach/31_useful-resources/01__svm-notes-long-08.pdf b/ML_Mathematical_Approach/31_useful-resources/01__svm-notes-long-08.pdf new file mode 100644 index 0000000..d7e5866 Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__svm-notes-long-08.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__thebook.pdf b/ML_Mathematical_Approach/31_useful-resources/01__thebook.pdf new file mode 100644 index 0000000..240e7dd Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__thebook.pdf differ diff --git a/ML_Mathematical_Approach/31_useful-resources/01__viewcontent.cgi b/ML_Mathematical_Approach/31_useful-resources/01__viewcontent.cgi new file mode 100644 index 0000000..bfd4e8f Binary files /dev/null and b/ML_Mathematical_Approach/31_useful-resources/01__viewcontent.cgi differ diff --git a/ML_Mathematical_Approach/32_tips-on-octave-os-x/01__resources.html b/ML_Mathematical_Approach/32_tips-on-octave-os-x/01__resources.html new file mode 100644 index 0000000..b774880 --- /dev/null +++ b/ML_Mathematical_Approach/32_tips-on-octave-os-x/01__resources.html @@ -0,0 +1,175 @@ + + +

+ Try installing a later version of Octave than the one recommended on the course content (3.8.0). +

+

+ + http://wiki.octave.org/Octave_for_MacOS_X + + . Under "Installing a Mac OS X Bundle," you can click on "download Octave 4.0.3 with Graphical User Interface." +

+

+ Error message "unknown or ambiguous terminal type" +

+

+ A) Try changing the terminal type with this command, for any of "qt", "x11", or "aqua": +

+
setenv("GNUTERM","qt")
+
+

+ Alternatively you can install the AquaTerm backend from SourceForge: + + http://sourceforge.net/projects/aquaterm/ + +

+

+ and then reinstalling GNUplot with Aqua terminal support: +

+
brew uninstall gnuplot
+brew install gnuplot ----with-aquaterm
+
+

+ B) You may also try this: +

+
brew uninstall fontconfig
+brew install fontconfig --universal
+brew uninstall gnuplot
+brew install gnuplot --with-qt
+
+

+ ... then add this to ~/.octaverc +

+
setenv("GNUTERM","qt")
+
+

+ The hist() or plot () function hangs +

+

+ It's not really hung - on some distributions of Octave, the first plotting function you call causes the font library to generated. This can take a minute or so the first time, then after that the plotting functions will work much faster. +

+

+ Alternatively if Octave still does not respond after some time, you may have to change your fontconfig. I also installed gnuplot with-x11 and changed the file octaverc. These are the terminal commands: +

+
brew install fltk
+brew install gnuplot --with-x11
+brew uninstall fontconfig
+brew install fontconfig --universal
+
+then edit /usr/local/share/octave/site/m/startup/octaverc , put this in the file
+
+

+ -------- (start copy) -------- +

+
setenv ("GNUTERM", "X11")
+gnuplot_binary("/usr/local/bin/gnuplot")
+graphics_toolkit('gnuplot')
+
+

+ -------- (end copy) -------- +

+

+ save the file and start octave +

+

+ verify by running: +

+
available_graphics_toolkits       % this will show the available graphics toolkits that can be loaded
+loaded_graphics_toolkits          % this will show the graphics toolkit that is currently loaded
+
+

+ other useful command examples: +

+
register_graphics_toolkit("fltk") % this will add the fltk graphics toolkit to the available graphics toolkits list
+graphics_toolkit("qt")            % this will load the qt graphics toolkit
+
+

+ Errors when editing ex1 "plotData.m" +

+

+ If you get an error like: +

+
error: invalid character '�' (ASCII 226) near line 14, column 14
+
+

+ Then try to uncheck the TextEdit Preference - Smart quotes and Smart dashes; then use double quotes(") instead of single quotes(') +

+

+ Try Using Vagrant and Virtualbox +

+

+ If you are using OS X (and some brands of Linux), you can have a lot of trouble getting the visualizations to work natively. One solution is to turn to virtualization; you can find a vagrant file that gets an ubuntu machine configured in virtual box, along with scripts to make this feel like a native OS X app here: + + http://deepneural.blogspot.fr/p/welcome.html + + Another script can be found here, but this one is just the Vagrant file and does not have all the nice OS X scripts bundled with it. + + https://gist.github.com/Starefossen/9353638 + +

+

+ You'll additionally need virtualbox and vagrant to go down this route, which are thankfully both free. + + https://www.virtualbox.org + + + https://www.vagrantup.com + +

+

+ You'll need an X server, which you almost certainly are using in Linux already, but does not come out of the box with OS X. OS X users can get it here: + + http://www.xquartz.org + +

+

+

+
+ + + diff --git a/ML_Mathematical_Approach/32_tips-on-octave-os-x/01__welcome.html b/ML_Mathematical_Approach/32_tips-on-octave-os-x/01__welcome.html new file mode 100644 index 0000000..c8e8c91 --- /dev/null +++ b/ML_Mathematical_Approach/32_tips-on-octave-os-x/01__welcome.html @@ -0,0 +1,1556 @@ + + + + + + + + + + + + + + + + + + +Deep Learning and Neural Nets: Octave for Mac Installation + + + + + + +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+

NEW: RESOURCE PAGE

+
+Searching for tutorials and software about Deep Learning and Neural Nets? Be sure to look at my Resource Page! +
+
+ + + + + + + +
+
+
+
+Looking for Octave? Go to my Easy Octave on Mac page! +
+
+ + + + + + + +
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+
+ +
+
+
+
+
+ +
+ + +
+ +
+
+ + + + +

+Octave for Mac Installation +

+
+
+
+
+
+MacOctave, Octave for OS X Download.
+
+
+
+
+
+
+
+
+Welcome. You are looking to install Octave on the Mac, without pain. You are in the right place. With the installer you can download below you will get a very usable Octave system in about 20 minutes, provided you have a broadband connection. 
+
+
+
+If you have an issue, would like to see some changes or improvements to this page, please send me an email!
+
+
+edmundronald at gmail dot com
+
And please, please, TIP ME!  $10 individuals, $5 students, and $50 for companies.
+
+ + +
+ + + + +
Contribution
+ + + +
+
+
+
+
+
+
+I will be really thankful if you donate, and donations will help me maintain this software! 
+
+
+ +
+
+
+The release of Octave which will be installed at the moment is 4.02  A folder containing instructions and the necessary files is available for download. There is nothing to type, all you need to do is click. 
+
+
+WARNING:You may hit a Vagrant Installation BUG. SEE HERE FOR DETAILS
+
+
+BTW, if the Barista at Starbucks deserves a tip, don't you think I do too? Hit that Donate link, I really appreciate it ... if you are an individual user, $10 would be nice, $5 if you are a student, $50 if you use this for work.
+
+If you need help you can comment below or shoot me an email, but do remember that tipping me keeps me happy, people who donate get faster responses and better handholding and problem solving. My email is edmundronald at gmail dot com.
+
+
+Edmund
+
+
+
+ +
+
+ +

17 comments:

+
+ + +
+
  1. Hi Edmund,

    Thanks for the help getting this working on my Hackintosh. For you and for anyone else that may be interested, here's the key stuff that may help the install go more smoothly on other Hackintoshes.

    First off some details about my particular Hackintosh

    System Hackintosh CustoMac Pro
    CPU Intel i7 3.4 GHz
    Graphics EVGA nVidia GeForce GTX 670
    Other Hardware GA-Z77X-UD5H/F14, 3.4 GHz i3770, 16GB 1600 MHz DDR3 (2x8GB Corsair Vengeance), EVGA GeForce GTX 670, SanDisk 240 GB SSD, Seagate Barracuda 7200 RPM 1TB, TP-Link TL-WDN4800, Sony 24x SATA DVD +/- RW, D

    Recommended Hackintosh GIGABYTE - UEFI DualBIOS settings (I was using these at the start of the process):

    Save & Exit Page: Load Optimized Defaults: Yes
    M.I.T\Advanced Memory Settings Page: Extreme Memory Profile(X.M.P.): Profile1Gigabyte 7 Series motherboard
    Recommended GIGABYTE - UEFI DualBIOS settings:
    Save & Exit Page: Load Optimized Defaults: Yes
    M.I.T\Advanced Memory Settings Page: Extreme Memory Profile(X.M.P.): Profile1
    BIOS Features Page: Intel Virtualization Technology: Disabled
    BIOS Features Page: VT-d: Disabled
    Peripherals Page: VIA 1394 Controller: Enabled
    Power Management Page: Wake on LAN: Disabled
    BIOS Features Page: VT-d: Disabled
    Peripherals Page: VIA 1394 Controller: Enabled
    Power Management Page: Wake on LAN: Disabled

    The conventional wisdom in the Hackintosh community appears to be to disable Intel Virtualization Technology in the bios, however this will prevent running any Virtual Machines, and when running the VirtualBox GUI it can be seen that attempting to boot any VM will give an error stating that "Vt-x is disabled in the BIOS". It's worth noting that Vt-x is NOT the same as Vt-d. I found that enabling Vt-d would prevent my Hackintosh from booting, however enabling Intel Virtualization Technology did not prevent me from booting, and did enable VirtualBox to power up the octave VM.

    Beyond this point I'm not clear as to what specifically got me up and running, but I believe it was some combination of the new Vagrant file you sent and rebooting one more time at my end after I had verified that I could power up VMs using VirtualBox.

    Bottom line: I primarily need to enable Intel Virtualization Technology on the BIOS features tab of my Gigabyte UEFI BIOS. Enabling Vt-d was counterproductive and not needed.

    Thanks again for the help!

    Steve

    ReplyDelete
  2. Hi Steve - congrats on making this work. My only explanatory comment here is that on a Hackintosh one may have issues with the VirtualBox configuration. It is possible that another virtualiser might play better.

    If people need the Vagrantfile that shows the boot console they can email me, or just remove the "#" on the line that says "vb.gui = true ". But be careful to use a clean text editor. A backup of the Vagrantfile is provided for safety in the install package ...

    ReplyDelete
  3. hi edmund,
    i am just tying out your software. it works pretty nice so far, but i need to install an extra package (signal).
    i have no idea how to do that actually.
    could you give me some advice?
    thanks a lot & best regards, f

    ReplyDelete
  4. Hi Fran,

    My advice would be to
    1. Look at the following page: http://octave.sourceforge.net
    2 Try typing this at the Octave prompt: pkg install -forge signal

    You will get an error and be told to install control before signal ...which means typing

    3 pkg install -forge control
    4 pkg install -forge signal

    After you've figured it out with these hints, you might be so kind as to tip your Octave Barista, your contribution even small will be greatly appreciated :)

    ReplyDelete
  5. Hi,
    As Fran I'm trying to install an extra package (dynare http://www.dynare.org/DynareWiki/InstallOnDebianOrUbuntu )

    I added the apt lines for installing it in the virtual machine (using “vagrant up; vagrant ssh”), but when I run sudo apt-get install dynare it starts to require dependencies that, as far as I know, are already installed..

    Could you give some hint to do it?

    Thanks

    ReplyDelete
  6. Hi, unknown Dynare installer: A lot of people ask me for free tech support; I would suggest they try the Octave mailing list. If you had donated, I might have felt motivated to help you, but that ship has now sailed.

    You're like somebody who goes into Starbucks and asks for a glass of -free- water, goes to the restroom, and then comes back and tells the Barista "oh, and as you give things away, could I have a latte" ?

    Edmund

    ReplyDelete
  7. I just installed this wonderfull package, bit when I run the octave-gui-signed applet, only the commandline version of Octave is displayed, I dont get a GUI window. Any ideas?

    I'm running El Capitan 10.11.1

    Thanks!

    ReplyDelete
    Replies
    1. Hi Unknown,

      Please email me edmundronald at gmail, and we will try to figure out your bug.

      E.

      Delete
  8. Would it be possible to do a version of octave that supports 16 bit images. I've tried to install Octave your way and then recompile GraphicsMagick and Octave from the source packages, but failed.

    ReplyDelete
  9. I love the package, but I need to use 16 bit image files. I tried to install your package and then recompile graphicsMagick and Octave from the source but it doesn't seem to work. Can you build a package that supports graphicMagick with quantum = 16?

    ReplyDelete
    Replies
    1. the point about my package is that one doesn't recompile. I guess one can recompile -I've done it - but one really needs to be motivated.

      Delete
  10. Hi Edmund,

    Thank you for the great solution! I have installed Octave following the instructions and been able to submit my Coursera assignment. Still, I have two questions.

    First, I am wondering how to cleanly uninstall all the packages automatically installed by octave-gui-signed (though I will be using it for some time).

    In addition, the text looks fuzzy on my Retina screen. Is there some way to fix it? I guess this is probably a problem with XQuartz.

    Thanks!

    ReplyDelete
  11. to uninstall, cd to the oct folder and type
    vagrant destroy

    btw, if it worked for you, dude, did you tip?

    ReplyDelete
  12. Hi Edmund,

    When i launch octave-gui-signed on my OS X, i am getting a popup "AppleEvent Timed Out Error -1712" and the application exits after i close the popup.

    Any suggestions.

    Thanks

    ReplyDelete
    Replies
    1. Yes, I wonder whether you have the screensaver running? Try and turn it off when running the applet the first time as it needs to download a lot of stuff.

      BTW, if the applet works for you, a donation would be appreciated ...

      Delete
  13. Looks like a great package you've made for us. Will install shortly.

    I noticed a broken link on your page, the link on the top: "Looking for Octave? Go to my Easy Octave on Mac page!" leads to "Sorry, the page you were looking for in this blog does not exist.".

    ReplyDelete
  14. Flemming,

    You are absolutely right. I think this is the page :) I will remove the dead link.
    And by the way, thank you very much for the donation you made!

    Edmund

    ReplyDelete
+
+
+
+ +

Hey, let me know what you think of my blog, and what material I should add!

+ + + + +
+

+ +
+
+ +
+ +
+
+Home +
+
+
+ +
+ +
+
+
+
+
+ +
+
+
+
+ +
+
+
+
+ +
+ +
+
+
+
+
+
+
+
+ +
+ +
+
+
+
+
+
+
+
+ + + + + + + + + + + \ No newline at end of file diff --git a/ML_Mathematical_Approach/classification_fmri.pdf b/ML_Mathematical_Approach/classification_fmri.pdf new file mode 100644 index 0000000..76688fb Binary files /dev/null and b/ML_Mathematical_Approach/classification_fmri.pdf differ diff --git a/ML_Mathematical_Approach/nninitialization.pdf b/ML_Mathematical_Approach/nninitialization.pdf new file mode 100644 index 0000000..7429cf6 Binary files /dev/null and b/ML_Mathematical_Approach/nninitialization.pdf differ