-
Notifications
You must be signed in to change notification settings - Fork 1
INF264 CS
- k-nearest neighbor classifier
- linear models for classification and regression
- decision trees
- support vector machines and kernels
- probabilistic models
- ensemble methods
- neural networks
- dimensionality reduction
- Clustering
Inductive bias: similar objects have similar labels.
Method: find k points that are most similar to the test point and let them vote for the class label.
K is a parameter chosen by the modeller.
For binary classification, an odd k avoids ties.
Too small k may lead to overfitting.
We can use the k nearest neighbors if we can compute the similarity between objects.
Often, objects are represented by vectors in some space and their dissimilarity is measured by their distance
Common distance measures:
- Euclidean distance: d (x, z) = sqrt(sigmai (xi − zi )2 )
- Manhattan distance: d (x, z) = sqrt(sigmai |xi − zi |)
- Lp - norm: ||x − z||p = (sigmai |xi − zi |^p)^1/p
Properties:
- Non-parametric: Cannot be characterized with a fixed set of parameters
- Lazy: No work at the learning stage (unless we want to store the training points in a smart way)
- Instance-based: Memorizes the whole dataset
Pros:
- No assumptions about data
- Simple
- Often good accuracy
- Versatile
- With enough training points, can approximate any decision boundary
Cons:
- High memory requirement
- Prediction stage can be slow
- Sensitive to irrelevant features and the scale of the data
- Curse of dimensionality
Variants:
- Majority vote
- Distance-weighted vote
- Inverse of the distance
- Inverse of the square of distance
Support vector machines are supervised learning models. Classification by support vector machines is done by computing decision boundaries. Decision boundaries are determined by support vectors. In training, data is mapped to points in space with the purpose of measuring the gap or the margin between the classes. Gaps are defined as the perpendicular distances between the decision boundary and the closest data points of each class. A linear kernel function is applied to compute the gaps and their associated decision boundary or separating hyperplane. For the model with the best possible accuracy, these gaps should be maximized.
In a linear classification task, the data points closest to the decision boundary are the support vectors.
>>>>> gd2md-html alert: inline image link here (to images/image1.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

If the input data is not linearly separable, kernels can be applied to the learning algorithm. A kernel is a function that can be used by a support vector machine to project points to a higher dimensional space (feature space).
>>>>> gd2md-html alert: inline image link here (to images/image2.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

In real life, many tasks require us to reason and act under uncertainty. How do we access uncertainty, pool information together and make coherent reasoning and decisions? Probabilistic modeling is a systematic approach to address these problems. Probabilistic models can know when they don't know.
- Due to randomness
- We are not able to obtain observations which can reduce this uncertainty
- Due to lack of knowledge
- We are able to obtain observations which can reduce this uncertainty
- Two observers may have different epistemic uncertainty
- Frequentist - Objective
- Frequencies from repetitions of experiments (realizable or hypothetical)
- Handles aleatory uncertainty
- Bayesian - Subjective (degree of belief) 3. Here A are propositions and P(A) is a degree of belief in A being true 4. We may say “I believe to the extent of P(A) that A is true” 5. Contrast with the frequentist interpretation: there P(A) is the proportion of times that A occurs to be true 6. Handles both aleatory and epistemic uncertainty
P(A | B) = P(B | A)P(A) / P(B) , ∀A, B, with P(B) > 0
- So far, we have selected one model that is in some sense “best”
- E.g using empirical risk minimization that often corresponds to maximum likelihood arg max P(D|θ)
- Bayesian methods don't return a single model but a probability distribution over models
- Summarize the current knowledge (or lack of knowledge) in a prior distribution P(θ)
- Update the belief based on observed data D
- Output: posterior distribution P(θ|D) that summarizes the knowledge after observing D
Bayesian inference returns a posterior distribution over model parameters, but how do we use this for prediction? The answer is that we take a weighted average of predictions over all models:
>>>>> gd2md-html alert: inline image link here (to images/image3.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Note: Instead of predicting a single label, we have a distribution over all possible labels
- Likelihood: P(D|θ) is the probability of observing the particular data D given that the data was generated from a model with parameters θ
- “Loss”
- Prior: P(θ) is the probability distribution of the parameters before seeing data
- Does not depend on data
- Is our subjective uncertainty about parameters before we see any data (We might have some other information, though)
- Epistemic uncertainty
- “Regularization”/Inductive Bias
- Posterior: P(θ|D) is the probability of parameters given data
- Probability of a model after seeing data
- Our prior believes are updated based on data we see
- Learn features from data (Construct complex features using lots of simple neurons)
Artificial Neurons:
- Multiply each input with a corresponding weight
- Take a sum of these products and feed it into a (non-linear) activation function
- Output the result
>>>>> gd2md-html alert: inline image link here (to images/image4.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

>>>>> gd2md-html alert: inline image link here (to images/image5.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Forward pass: performs inference (predictions)
Backward pass: performs learning (finding the right weights and biases)
- Stochastic gradient descent is an optimization algorithm for minimizing the loss of a predictive model with regard to a training dataset.
- Back-propagation is an automatic differentiation algorithm for calculating gradients for the weights in a neural network graph structure.
- Stochastic gradient descent and the back-propagation of error algorithms together are used to train neural network models.
An efficient algorithm for computing the gradient. (for determining how a single training example would like to nudge the weights and biases)
- First compute the gradient for the last set of weights
- Then propagate the error to the previous layer and compute
- the gradient for the next set of weights
- Repeat this for the rest of the layers
A stochastic gradient descent step:
- Sample a minibatch of m training examples
- Compute gradient over the samples in the minibatch
- Forward pass to compute the loss
- Backward pass to compute the gradient of the loss
- Update weights
Simple and widely studied models
Often computationally convenient
Pros:
- Adds flexibility
Cons:
- If dimensionality grows, then also computational requirements grow
- How to choose appropriate basis functions?
Use domain knowledge to construct transformations of the original features.
Can be achieved by basis functions.
Often critical to the performance of the machine learning algorithms
How good is my model?
What if we consider only the error on the training data?
- A simple model will not be able to fit perfectly to the training data
- A complex model typically fits better
- For example, in polynomial regression increasing the degree of the polynomial never decreases the fit
- A polynomial with the degree n - 1 will fit perfectly to n data points
Best fitting model is not necessarily the best-predicting model.
Bias: Error due to modeling assumptions.
Variance: Error due to variations in the training set.
Simple models tend to have high bias and low variance
Complex models tend to have low bias and high variance
Goal: Low bias and low variance
Using standard linear regression is problematic in predicting categorical responses, therefore we have a variant called logistic regression that we can use instead. Predicted Y lies withing the range of 0 and 1.
>>>>> gd2md-html alert: inline image link here (to images/image6.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Inner nodes are associated with a test
- Usually, the test involves only one feature
- Outgoing edges correspond to possible test results
A child is selected based on the test result.
Leaf nodes make a prediction
- Classification: assign a class label to an object
- Regression: assign a numerical value
- Predictions can be also probabilistic
To classify an object, tree is traversed from the root to a leaf
Decision trees can approximate any function of the input features with arbitrary precision.
If there are no duplicates with different class labels, there is a consistent decision tree for every training set with one path to a leaf for each data point
- Can be large
- Probably does not generalize well
Prefer more compact decision trees (Ockham’s razor)
Finding an optimal decision tree (the one that minimizes the expected number of tests) is NP-hard
Trees are typically learned using greedy heuristics
- Start with an empty tree
- Greedily, select a feature to split
- Recurse
Typically, one wants to divide the space into “pure” parts. In other words, one wants to reduce uncertainty. That is, choose the feature which gives the highest information gain.
Pros:
- Fast (both to learn and predict)
- Easy to interpret
- Invariant to scaling
- Can handle both categorical and continuous data
- Implicit feature selection
Cons:
- Unstable
- Due to greedy learning, small differences in training data can result in totally different trees
- Usually competitive accuracy only in an ensemble
- Creates biased trees if some class dominates
- Balance data before learning
Entropy measures average information content that is obtained from a stochastic information source. More uncertainty -> more information content.
High entropy:
- Lots of uncertainty
- Nearly uniform distribution
Low entropy:
- Less uncertainty
- Outcomes have probabilities near 0 or 1
>>>>> gd2md-html alert: inline image link here (to images/image7.png). Store image on your image server and adjust path/filename/extension if necessary.
(Back to top)(Next alert)
>>>>>

Iterative Dichotomiser 3
- If all data points have the same label
- Return a leaf with that label
- Else if all data points have identical feature values
- Return a leaf with the most common label
- Else
- Choose a feature that maximizes the information gain
- Add a branch for each value of te feature
- For each branch
- Call the algorithm recursively for the data points with the particular feature value
If a feature has more than two values, it is still common to split only to two branches
- Splits to several branches can be achieved by a sequence of two-way splits
Typically, overfits to the training data.
- If there are no training points with identical feature values, ID3 will result in 100% training accuracy.
- Inductive bias: prefer smaller trees
- Set a minimum size for a leaf or maximum depth for the tree
- Usually not enough
- Early stopping (pre-pruning)
- Stop recursion once information gain is non-positive
- Problematic, cannot handle e.g, XOR
- Pruning (post-pruning)
- Build a full tree, remove parts of it later
Divide data to training and pruning (validation) data
- Use the training data to build a full decision tree
- For each subtree T
- If replacing the subtree with the majority label in T does not decrease accuracy on the pruning data
- Replace T with a leaf node that predicts the majority class
- If replacing the subtree with the majority label in T does not decrease accuracy on the pruning data
- Prediction values y using the points associated with the leave
- typically , by averaging
- Learning
- Small modifications to the algorithm
- Instead of the impurity measures above, use squared error
- Predict with mean value for the data subset
- Information gain from a split is the squared error before the split - squared error after the split
Goal: Find intrinsic structure in unlabeled data.
A cluster is a set of items that is similar, and differs from other cluster items.
Why clustering?
- Gain understanding about data
- “Classification without labels”
- Preprocessing
- Quantization (lossy compression)
- Reducing the number of data points
- Applications
- Marketing: find customer segments
- Biology: reconstruct evolutionary history of species
Cluster characterization:
- Centroid models
- A cluster is represented by a single mean vector
- Distribution models
- Clusters are modeled with statistical distributions
- Connectivity models:
- Build clusters based on distance connectivity
- Density models:
- A cluster is set of points connected by dense regions
Probably the most famous and commonly used clustering method. A cluster is represented by a centroid. The goal is to put similar points in the same cluster by minimizing the within-cluster variance.
Assumption: there exists exactly k clusters.
Lloyd’s algorithm:
- Pick initial cluster centroids
- Repeat until nothing changes
- Assign each data points to a cluster whose centroid is closest to that point
- Update the centroids: The new centroid of a cluster is the mean of the data points assigned to that cluster
Note that generally the centroids are not data points. However, we usually initialize cluster centroids with observed data points.
Definition: A model performs well on training data but generalizes poorly.
Potential causes:
- Noise (irrelevant information or randomness in data)
- Too flexible (complex) model
- Too little data
-
Ensemble methods combine several models to produce more accurate ones
-
Bagging reduces variance
-
Boosting reduces bias
-
Random forests and (gradient) boosting are good choices for
quick-and-dirty classifiers in many domains
Setup
-
Generate a collection (ensemble) of classifiers (or regressors)
from a given data set. Each classifier can be “weak”
-
Each classifier is given a different data set. Data sets can be
generated by resampling or weighting.
-
The final classification is defined by voting
Nice property
-
Each weak classifier can be simple with large error; the only
requirement is that they are better than random
-
Still, ensemble can perform very well
Diagnostics:
- High bias, low variance → underfitting
- Low bias, high variance → overfitting
- To reduce error, one can either reduce bias or variance
Two approaches:
- Bagging: reduce variance
- Boosting: reduce bias
Bagging
Bootstrap aggregating
- Does not change bias
- Need to have low bias base models
- Less need to worry about overfitting because averaging takes care of it
- Bagging works best with unstable models (high variance) E.g., decision trees
- Easily parallelizable
Weaknesses
- Loss of interpretability
- For example, with random forest the final classifier is not a
- tree
- Computational complexity
- You need to learn lots of models
- Classifiers need to be independent
Boosting
- Create simple classifier (“rule of thumb”) based on a subset of examples
- Repeat T times
How to choose examples at each round?
- Concentrate on “hardest” examples (those most often misclassified by the previous rules of thumb)
How to combine rules of thumb into a single prediction rule?
- Take (weighted) majority vote of rules of thumb
- Bayesian models provide a principled way to handle uncertainty
- In Bayesian inference, one obtains a posterior distribution of parameters by updating prior beliefs
- Simple Bayesian models for classification include naive Bayes
Strengths
- Provides a principled way to combine prior information and data
- Handles missing data naturally
- Gives possibility to define a prior
- Assumptions become transparent
- Possibility to include prior information
Weaknesses
- Computationally intensive
- We have seen only “easy” models so our view about computational challenges maybe biased
- Typically, there are no closed-form solutions and one has to resort to computationally intensive approximations
- Does not tell how to select a prior (and that selection can affect heavily the results)
Goal: represent data points with fewer features
Curse of dimensionality
- Various phenomena that occur in high-dimensional spaces (but not in low dimensions)
- Nearest neighbors will be far away
- Need more data for accurate predictions
What is dimensionality reduction?
- Feature selection
- Select a subset of features and ignore the others
- Feature extraction
- Create a small number of new features based on the original features and ignore the original ones
Autoencoders
- A neural network that tries to copy its input to it output
- Neural networks that are used to learn low-dimensional representation of data
Principal component analysis (PCA)
- Learn the most interesting directions (new basis functions) from the data
- Interesting direction = lots of variation
- Represent data using these directions (change of basis)
- Throw away “uninteresting” directions without much loss of information. This gives a lower dimensional representation