INF264 CS

INF264 - Introduction to Machine Learning - Cheat Sheet

k-nearest neighbor classifier
linear models for classification and regression
decision trees
support vector machines and kernels
probabilistic models
ensemble methods
neural networks
dimensionality reduction
Clustering

K-Nearest Neighbor

Inductive bias: similar objects have similar labels.

Method: find k points that are most similar to the test point and let them vote for the class label.

K is a parameter chosen by the modeller.

For binary classification, an odd k avoids ties.

Too small k may lead to overfitting.

We can use the k nearest neighbors if we can compute the similarity between objects.

Often, objects are represented by vectors in some space and their dissimilarity is measured by their distance

Common distance measures:

Euclidean distance: d (x, z) = sqrt(sigmai (xi − zi )2 )
Manhattan distance: d (x, z) = sqrt(sigmai |xi − zi |)
Lp - norm: ||x − z||p = (sigmai |xi − zi |^p)^1/p

Properties:

Non-parametric: Cannot be characterized with a fixed set of parameters
Lazy: No work at the learning stage (unless we want to store the training points in a smart way)
Instance-based: Memorizes the whole dataset

Pros:

No assumptions about data
Simple
Often good accuracy
Versatile
With enough training points, can approximate any decision boundary

Cons:

High memory requirement
Prediction stage can be slow
Sensitive to irrelevant features and the scale of the data
Curse of dimensionality

Variants:

Majority vote
Distance-weighted vote
- Inverse of the distance
- Inverse of the square of distance

Support Vector Machines

Support vector machines are supervised learning models. Classification by support vector machines is done by computing decision boundaries. Decision boundaries are determined by support vectors. In training, data is mapped to points in space with the purpose of measuring the gap or the margin between the classes. Gaps are defined as the perpendicular distances between the decision boundary and the closest data points of each class. A linear kernel function is applied to compute the gaps and their associated decision boundary or separating hyperplane. For the model with the best possible accuracy, these gaps should be maximized.

In a linear classification task, the data points closest to the decision boundary are the support vectors.

>>>>> gd2md-html alert: inline image link here (to images/image1.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

If the input data is not linearly separable, kernels can be applied to the learning algorithm. A kernel is a function that can be used by a support vector machine to project points to a higher dimensional space (feature space).

>>>>> gd2md-html alert: inline image link here (to images/image2.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Probabilistic Models

In real life, many tasks require us to reason and act under uncertainty. How do we access uncertainty, pool information together and make coherent reasoning and decisions? Probabilistic modeling is a systematic approach to address these problems. Probabilistic models can know when they don't know.

Uncertainties:

Aleatory uncertainty

Due to randomness
We are not able to obtain observations which can reduce this uncertainty

Epistemic uncertainty

Due to lack of knowledge
We are able to obtain observations which can reduce this uncertainty
Two observers may have different epistemic uncertainty

Two commonly used interpretations:

Frequentist - Objective
1. Frequencies from repetitions of experiments (realizable or hypothetical)
2. Handles aleatory uncertainty
Bayesian - Subjective (degree of belief) 3. Here A are propositions and P(A) is a degree of belief in A being true 4. We may say “I believe to the extent of P(A) that A is true” 5. Contrast with the frequentist interpretation: there P(A) is the proportion of times that A occurs to be true 6. Handles both aleatory and epistemic uncertainty

Bayes’ theorem

P(A | B) = P(B | A)P(A) / P(B) , ∀A, B, with P(B) > 0

Bayesian modeling vs standard modelling

So far, we have selected one model that is in some sense “best”
- E.g using empirical risk minimization that often corresponds to maximum likelihood arg max P(D|θ)
Bayesian methods don't return a single model but a probability distribution over models
- Summarize the current knowledge (or lack of knowledge) in a prior distribution P(θ)
- Update the belief based on observed data D
- Output: posterior distribution P(θ|D) that summarizes the knowledge after observing D

Bayesian prediction

Bayesian inference returns a posterior distribution over model parameters, but how do we use this for prediction? The answer is that we take a weighted average of predictions over all models:

>>>>> gd2md-html alert: inline image link here (to images/image3.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Note: Instead of predicting a single label, we have a distribution over all possible labels

Terminology

Likelihood: P(D|θ) is the probability of observing the particular data D given that the data was generated from a model with parameters θ
- “Loss”
Prior: P(θ) is the probability distribution of the parameters before seeing data
- Does not depend on data
- Is our subjective uncertainty about parameters before we see any data (We might have some other information, though)
- Epistemic uncertainty
- “Regularization”/Inductive Bias
Posterior: P(θ|D) is the probability of parameters given data
- Probability of a model after seeing data
- Our prior believes are updated based on data we see

Neural Networks

Learn features from data (Construct complex features using lots of simple neurons)

Artificial Neurons:

Multiply each input with a corresponding weight
Take a sum of these products and feed it into a (non-linear) activation function
Output the result

>>>>> gd2md-html alert: inline image link here (to images/image4.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Activation functions

>>>>> gd2md-html alert: inline image link here (to images/image5.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Neural network algorithms

Forward pass: performs inference (predictions)

Backward pass: performs learning (finding the right weights and biases)

Stochastic gradient descent is an optimization algorithm for minimizing the loss of a predictive model with regard to a training dataset.
Back-propagation is an automatic differentiation algorithm for calculating gradients for the weights in a neural network graph structure.
Stochastic gradient descent and the back-propagation of error algorithms together are used to train neural network models.

Back-propagation

An efficient algorithm for computing the gradient. (for determining how a single training example would like to nudge the weights and biases)

First compute the gradient for the last set of weights
Then propagate the error to the previous layer and compute
the gradient for the next set of weights
Repeat this for the rest of the layers

Stochastic gradient descent

A stochastic gradient descent step:

Sample a minibatch of m training examples
Compute gradient over the samples in the minibatch
Forward pass to compute the loss
Backward pass to compute the gradient of the loss
Update weights

Linear Regression and Classification

Simple and widely studied models

Often computationally convenient

Basis functions

Pros:

Adds flexibility

Cons:

If dimensionality grows, then also computational requirements grow
How to choose appropriate basis functions?

Feature engineering

Use domain knowledge to construct transformations of the original features.

Can be achieved by basis functions.

Often critical to the performance of the machine learning algorithms

Bias-variance tradeoff

How good is my model?

What if we consider only the error on the training data?

A simple model will not be able to fit perfectly to the training data
A complex model typically fits better
For example, in polynomial regression increasing the degree of the polynomial never decreases the fit
A polynomial with the degree n - 1 will fit perfectly to n data points

Best fitting model is not necessarily the best-predicting model.

Bias: Error due to modeling assumptions.

Variance: Error due to variations in the training set.

Simple models tend to have high bias and low variance

Complex models tend to have low bias and high variance

Goal: Low bias and low variance

Logistic regression

Using standard linear regression is problematic in predicting categorical responses, therefore we have a variant called logistic regression that we can use instead. Predicted Y lies withing the range of 0 and 1.

Gradient descent for logistic regression

>>>>> gd2md-html alert: inline image link here (to images/image6.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

Decision Trees

Inner nodes are associated with a test

Usually, the test involves only one feature
Outgoing edges correspond to possible test results

A child is selected based on the test result.

Leaf nodes make a prediction

Classification: assign a class label to an object
Regression: assign a numerical value
Predictions can be also probabilistic

To classify an object, tree is traversed from the root to a leaf

Decision trees can approximate any function of the input features with arbitrary precision.

If there are no duplicates with different class labels, there is a consistent decision tree for every training set with one path to a leaf for each data point

Can be large
Probably does not generalize well

Prefer more compact decision trees (Ockham’s razor)

Finding an optimal decision tree (the one that minimizes the expected number of tests) is NP-hard

Trees are typically learned using greedy heuristics

Start with an empty tree
Greedily, select a feature to split
Recurse

Typically, one wants to divide the space into “pure” parts. In other words, one wants to reduce uncertainty. That is, choose the feature which gives the highest information gain.

Pros:

Fast (both to learn and predict)
Easy to interpret
Invariant to scaling
Can handle both categorical and continuous data
Implicit feature selection

Cons:

Unstable
- Due to greedy learning, small differences in training data can result in totally different trees
Usually competitive accuracy only in an ensemble
Creates biased trees if some class dominates
- Balance data before learning

Entropy

Entropy measures average information content that is obtained from a stochastic information source. More uncertainty -> more information content.

High entropy:

Lots of uncertainty
Nearly uniform distribution

Low entropy:

Less uncertainty
Outcomes have probabilities near 0 or 1

Other impurity measures

>>>>> gd2md-html alert: inline image link here (to images/image7.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

alt_text

ID3

Iterative Dichotomiser 3

If all data points have the same label
- Return a leaf with that label
Else if all data points have identical feature values
- Return a leaf with the most common label
Else
- Choose a feature that maximizes the information gain
- Add a branch for each value of te feature
- For each branch
  - Call the algorithm recursively for the data points with the particular feature value

If a feature has more than two values, it is still common to split only to two branches

Splits to several branches can be achieved by a sequence of two-way splits

Typically, overfits to the training data.

How to avoid overfitting

If there are no training points with identical feature values, ID3 will result in 100% training accuracy.
Inductive bias: prefer smaller trees
Set a minimum size for a leaf or maximum depth for the tree
- Usually not enough
Early stopping (pre-pruning)
- Stop recursion once information gain is non-positive
- Problematic, cannot handle e.g, XOR
Pruning (post-pruning)
- Build a full tree, remove parts of it later

Reduced-Error Pruning

Divide data to training and pruning (validation) data

Use the training data to build a full decision tree
For each subtree T
1. If replacing the subtree with the majority label in T does not decrease accuracy on the pruning data
  1. Replace T with a leaf node that predicts the majority class

Regression trees

Prediction values y using the points associated with the leave
- typically , by averaging
Learning
- Small modifications to the algorithm
- Instead of the impurity measures above, use squared error
- Predict with mean value for the data subset
- Information gain from a split is the squared error before the split - squared error after the split

Clustering

Goal: Find intrinsic structure in unlabeled data.

A cluster is a set of items that is similar, and differs from other cluster items.

Why clustering?

Gain understanding about data
- “Classification without labels”
Preprocessing
- Quantization (lossy compression)
- Reducing the number of data points
Applications
- Marketing: find customer segments
- Biology: reconstruct evolutionary history of species

Cluster characterization:

Centroid models
- A cluster is represented by a single mean vector
Distribution models
- Clusters are modeled with statistical distributions
Connectivity models:
- Build clusters based on distance connectivity
Density models:
- A cluster is set of points connected by dense regions

K-Means

Probably the most famous and commonly used clustering method. A cluster is represented by a centroid. The goal is to put similar points in the same cluster by minimizing the within-cluster variance.

Assumption: there exists exactly k clusters.

Lloyd’s algorithm:

Pick initial cluster centroids
Repeat until nothing changes
1. Assign each data points to a cluster whose centroid is closest to that point
2. Update the centroids: The new centroid of a cluster is the mean of the data points assigned to that cluster

Note that generally the centroids are not data points. However, we usually initialize cluster centroids with observed data points.

Overfitting

Definition: A model performs well on training data but generalizes poorly.

Potential causes:

Noise (irrelevant information or randomness in data)
Too flexible (complex) model
Too little data

Ensemble Methods

Ensemble methods combine several models to produce more accurate ones
Bagging reduces variance
Boosting reduces bias
Random forests and (gradient) boosting are good choices for

quick-and-dirty classifiers in many domains

Setup

Generate a collection (ensemble) of classifiers (or regressors)

from a given data set. Each classifier can be “weak”
Each classifier is given a different data set. Data sets can be

generated by resampling or weighting.
The final classification is defined by voting

Nice property

Each weak classifier can be simple with large error; the only

requirement is that they are better than random
Still, ensemble can perform very well

Diagnostics:

High bias, low variance → underfitting
Low bias, high variance → overfitting
To reduce error, one can either reduce bias or variance

Two approaches:

Bagging: reduce variance
Boosting: reduce bias

Bagging

Bootstrap aggregating

Does not change bias
Need to have low bias base models
Less need to worry about overfitting because averaging takes care of it
Bagging works best with unstable models (high variance) E.g., decision trees
Easily parallelizable

Weaknesses

Loss of interpretability
For example, with random forest the final classifier is not a
tree
Computational complexity
You need to learn lots of models
Classifiers need to be independent

Boosting

Create simple classifier (“rule of thumb”) based on a subset of examples
Repeat T times

How to choose examples at each round?

Concentrate on “hardest” examples (those most often misclassified by the previous rules of thumb)

How to combine rules of thumb into a single prediction rule?

Take (weighted) majority vote of rules of thumb

Probabilistic Models

Bayesian models provide a principled way to handle uncertainty
In Bayesian inference, one obtains a posterior distribution of parameters by updating prior beliefs
Simple Bayesian models for classification include naive Bayes

Strengths

Provides a principled way to combine prior information and data
Handles missing data naturally
Gives possibility to define a prior
Assumptions become transparent
Possibility to include prior information

Weaknesses

Computationally intensive
We have seen only “easy” models so our view about computational challenges maybe biased
Typically, there are no closed-form solutions and one has to resort to computationally intensive approximations
Does not tell how to select a prior (and that selection can affect heavily the results)

Dimensionality Reduction

Goal: represent data points with fewer features

Curse of dimensionality

Various phenomena that occur in high-dimensional spaces (but not in low dimensions)
Nearest neighbors will be far away
Need more data for accurate predictions

What is dimensionality reduction?

Feature selection
- Select a subset of features and ignore the others
Feature extraction
- Create a small number of new features based on the original features and ignore the original ones

Autoencoders

A neural network that tries to copy its input to it output
Neural networks that are used to learn low-dimensional representation of data

Principal component analysis (PCA)

Learn the most interesting directions (new basis functions) from the data
Interesting direction = lots of variation
Represent data using these directions (change of basis)
Throw away “uninteresting” directions without much loss of information. This gives a lower dimensional representation

INF264 CS

INF264 - Introduction to Machine Learning - Cheat Sheet

K-Nearest Neighbor

Support Vector Machines

Probabilistic Models

Uncertainties:

Aleatory uncertainty

Epistemic uncertainty

Two commonly used interpretations:

Bayes’ theorem

Bayesian modeling vs standard modelling

Bayesian prediction

Terminology

Neural Networks

Activation functions

Neural network algorithms

Back-propagation

Stochastic gradient descent

Linear Regression and Classification

Basis functions

Feature engineering

Bias-variance tradeoff

Logistic regression

Gradient descent for logistic regression

Decision Trees

Entropy

Other impurity measures

ID3

How to avoid overfitting

Reduced-Error Pruning

Regression trees

Clustering

K-Means

Overfitting

Ensemble Methods

Probabilistic Models

Dimensionality Reduction

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally