Supervised learning is a class of machine learning algorithms which uses a set of known examples to come up with a mechanism to predict something about unknown examples. A supervised learning algorithm is trained using a pre-labeled training set and uses that knowledge to make predictions about data that wasn't in the training set.
Basic definitions:
-
Classification is the process of taking some input
$X$ and mapping it to some discrete label - for example identifying a disease based on a chest x ray image or MRI. -
Regression is the process of mapping some input
$X$ to a real number value - for example predicting home prices. - Instances are vectors of values that define the input space of a problem.
- Concept is an idealized function that maps from the input space to an output.
- Target concept is the concept that our algorithm is trying to achieve.
- Hypothesis class is all of the functions were are willing to consider as a target concept.
- Sample (aka training set) is a set of all inputs paired with a correct output.
- Candidate is a concept that might be the target concept
- Testing set is a set of inputs paired with a correct output which is not included in the training set.
-
Cross validation is a method of training a machine learning model which attempts to get the best performing,
general model. We hold out some of the training set as a stand in for the test data to determine how well our
algorithm performs on unseen data. The general process is to split the data into K folds. Then we train
$K - 1$ folds and use the extra fold for validation. We can use different combinations of$K - 1$ folds to train different versions of the model, tune hyperparameters, etc.
A decision tree is a very simple type of machine learning algorithm which is used to classify samples.
O
/ \
O O
/ | \ / | \
[] [] [] [] [] []
Each O node is a decision point which splits the samples based on a specific feature. The [] nodes represent
classifications based on the samples which those particular feature sets. Consider the tree:
O
/ \ f0 left x = 0, right x = 1
O O
/ | \ / | \ f1 left x = 0, mid x = 1, right x = 2
[1] [2] [3] [4] [5] [6]
Then [1] would have all samples with features [0, 0, ...], [4] would have all samples with features
[1, 0, ...], etc. The number of levels of a decision tree do not need to match the number of features in the
samples - the input [0, 2, 1, 3, 4] would be at node [3] in the decision tree above. Each of those nodes would
then correspond to a class - so classifying a sample is as simple as finding the matching leaf in the decision tree
and returning the associated class.
Decision trees are very useful classification tools, but they need to be constructed carefully because they can be
very space inefficient. A "naive" decision trees for an N-feature boolean problem could take
The general algorithm behind constructing a decision tree given a testing set is known as ID3. The basic idea is for a particular set of sample data:
- Select the best attribute.
- Split the samples by the attribute.
- Repeat for the sample subsets until the sample subset is "pure" (ie all the samples in the subsets are of the same class) or for some predetermined set of iterations.
Or in code:
def id3(node, samples):
if is_pure(samples):
# if all the samples have the same label
node.label = get_label(samples)
return node
elif should_stop(node):
# if an external stopping criteria is met
node.label = get_majority_label(samples)
return node
# get the most informative attribute based on the samples
attr = best_attribute(samples)
node.decision_attr = attr
for val in possible_values(attr):
# for each value of the attribute create a new child node
node.children.append(Node(), get_samples_where(samples, attr=attr, val=val))
return nodeDetermining the "best" attribute is a bit tricky and there are a few ways to determine it. Generally speaking, the "best" attribute is defined as the attribute which is the most "informative" or which allows us to split the samples into "more pure" groups.
Note that this algorithm has a should_stop check which terminates the algorithm based on some external condition
and sets the leaf node to the most common label in the samples at that node. This stopping criterian is most
commonly based on a maximum depth parameter on the tree.
Information gain is a means of determining the most informative feature and is defined as:
where entropy is a function which defines the randomness of a sample set.
and
Note here that the expression
So
ID3 generally has a bias towards generating trees with
- "Good" splits near the root
- Reasonably correct classifications (of the training data)
- Shorter trees (which is a product of the good splits near the root)
If the attributes are continuous we have to do some modifications - there are a lot of different ways to handle continuous attributes for example defining ranges of values. For discrete valued attributes, an individual attribute should only appear once in a path to a leaf. For continuous valued attributes, an individual attribute can appear multiple times on a path to a leaf if the split criteria is different.
It is very easy for decision trees to overfit to training data - in decision trees overfitting happens when we create too complex of a tree based on the training data. We can avoid this in a few ways.
- Use cross validation to create the trees by generating a bunch of trees on different "folds" of training data and selecting the tree that performed best on the "validation set".
- Check the tree against a validation set before every new expansion and stop the tree construction once the validation accuracy has hit a specific target.
- Prune the tree after creation.
We can also use decision trees for regression problems though we need to handle the leaves in a different way - perhaps by using the average of the values at the leaves. We also may need to redefined our sense of purity to include values that are similar to each other as opposed to strict equality.
In classification problems we're trying to group data into distinct buckets based on features. However, in a regression problem we are more trying to come up with a continuous function which can map the features of a sample to the output value.
Linear regression, is a method of regression where we try to fit a linear function (in 2 or more dimensions) to a set of points.
We can actually solve this problem analytically using calculus by defining an error function which describes the error between our ideal function and the sample points, then finding the parameters of the ideal function which minimizes that error function.
So given
where
Now this requires some linear algebra knowledge. We can simplify this expression by operating over the entire
matrix at once. Note that if
So our equation becomes:
which is quite easy for a computer to solve.
Our fit function doesn't necessarily need to be a linear function though. We can choose any kind of function as
the
A neural network is a class of machine learning models which attempts to use a system analogous to neurons in the brain to learn from data.
Neural networks are built off of a very simple unit called a perceptron. A perceptron is a function that models a very simple neuron.
A neuron receives input through signals at the "dendrites" - those inputs are weighted and combined together and, if the aggregate input reaches a certain activation threshold, the neuron fires an action potential which sends a neurotransmitter to other neurons. Essentially, a neuron is a function which takes the weighted sum of its inputs and passes that to a unit activation function.
A perceptron follows the idea of a neuron. Given some input
We see that perceptrons are able to create a linear hyperplane. This means that a perceptron can classify some simple functions
like boolean AND:
|
|
o \ x
| \
o----o\-----
boolean OR:
|
|
x x
|\
o--\-x-----
boolean NOT:
|
-----x---|---o------
However, it can't classify non-linear functions like boolean XOR:
|
|
x o
|
o----x-----
To solve these kinds of problems we can use a network of perceptrons:
x1----------w1----------
\ \
====== ( AND )--w3--- ( ) -->
/ /
x2----------w2---------
w1 = 1
w2 = 1
w3 = -2
The perceptron rule for training perceptrons is very simple. We know that we can calculate:
Note here that we have folded the threshold parameter
Then for every sample
Then we update
If the data is linearly separable - meaning if there exists a line which can split the data, this algorithm will always find a dividing line. However, if the data is not linearly separable, this algorithm will run forever.
We can also use gradient descent to train our perceptrons. In this method we use the sum of least squared error between the activation value of the perceptron and the expected class:
We use the activation value instead of the "actual output" of the perceptron because the step function is not differentiable whereas the activation value is differentiable.
So then our update rule is
The update rules for these two approaches are very similar, but the gradient descent method is more robust against non-linearly separable data.
One problem with the gradient rule for training perceptrons is that the actual output of a perceptron is not a differentiable function since the step function is not differentiable. So we have to use the activation value instead which is not optimal. To solve this problem can use the sigmoid function instead of a step activation function to model perceptrons.
The sigmoid has the property that
Note that:
A neural network is a network of perceptrons typically arranged in a uniform multilayer pattern where each layer's outputs feeds into the next layer's inputs.
The input and output layers are exposed to the "world" and the rest of the layers are known as hidden layers. Each unit in the neural network is a perceptron who's activation is gated using a non-linear differentiable function - typically the sigmoid function, RELU, TANU, etc. Thus we can actually represent an entire neural network (even an immensely complicated one) as a differentiable function and use that derivative to update the weights on each individual perceptron.
Backpropogation is an algorithm which uses the chain rule to more easily compute the derivatives of the units in the neural network. The general idea behind backpropogation is this:
Consider a perceptron unit
Then:
and
w_j1,i1
( i1 ) ----------
w_j1,i2 \
( i2 ) ----------- ( j1 ) -------- (k1)
w_j1,i3 / /
( i3 )---------- ( j2 ) ------
To update an individual weight parameter in the neural network
Since
And since
Let us define
Note that the error at node
Note that:
And:
So:
And finally:
And:
This may look a bit messy, but it's actually quite elegant when considered recursively. Note that for the
output nodes we can say that
- Get the prediction for the network. In the process keep track of the inputs
$z$ for every node. - Calculate the error at the output node
$\delta_{n} = \hat{y} - y$ . - For every layer starting from
$n-1$ calculate the weight update using$\delta_k$ values from the previous layer. - Update the weights using
$w_{j,i}' = w_{j, i} + \eta \frac{\delta E}{\delta w_{j,i}}$
With backpropogation and powerful computers it's very easy to design complicated neural networks that (in theory) can learn very complicated functions. And while we can do gradient descent and backpropagation to train network, gradient descent can get stuck in local minima so we can use different optimization techniques like using momentum terms, calculating higher order derivatives, using randomized optimization algorithms, and adding regularization parameters to penalize complexity.
Instance based learning algorithms are algorithms that "memorize" the training data and, when presented with a new sample, use only the training data it's memorized to make an inference about the new sample's output value.
The K Nearest Neighbors algorithm is a very simple, but power instance based learning algorithm. This algorithm
takes in the training data and when presented with a test sample, uses the value of the
This kNN algorithm is very simple, but a lot of that is because it leaves a lot up to the designer. The designer
chooses the value for
Note that kNN is a lazy learning algorithm. This means that it only does work when it has to. "Training" is a constant time operation since it just has to memorize a set number of data points, but the real work happens on query time. This is different from other algorithms like decision trees and neural networks which are eager learners - they frontload the work at training time, but queries take little to no work.
kNN has a very specific preference bias. kNN assumes that locality implies similarity, regression functions are smooth, and that each feature is equally important. That last point reflects the curse of dimensionality. As the number of features grow, the amount of data we need to generalize properly grows exponentially.
There is a special case of kNN where
Consider the problem of detecting spam email. We can come up with a bunch of loose rules to help classify if an individual email is spam. For example,
- if the email only contains an image it is likely spam
- if the email contains some "blacklist" of words it is likely spam
- if the email is from someone in your contact list it is likely not spam
- if the email is just a URL it is likely spam
Each of these rules on it's own is pretty weak - we can think of a number of reasonable, real life counterexamples for any of these rules individually. However, each rule individually is still better than randomly guessing.
Ensemble learning is a set of learning algorithms which uses the aggregation of the results of an "ensemble" of "weak" learning algorithms to come up with an output for a test sample.
The basic algorithm is:
- Learn over a subset of the data and generate a "rule"
- repeat step 1 for a set number of iterations
- Combine the generated rules into a more complex rule
This algorithm is quite straightforward though we need to decide how we pick subsets of the data, what kind of learner we use to generate a rule, and how we want to combine the rules together. And generally we find that if we do this right, using ensemble learning with less complex (or weak) learners performs better on validation and testing sets than a single complex learner.
Bagging is an ensemble learning algorithm which randomly samples by taking uniform sized random subsets of the data (with replacement), trains a weak learner on that subset, and combines the results by taking the mean of the individual weak learners. This is a very simple algorithm but in practice it can work quite well.
Boosting is an ensemble learning algorithm which:
- samples by trying to learn the "hardest" examples in a given iteration
- combines results using the weighted mean.
In boosting we define error as:
Essentially error is defined as the probability that a given sample is incorrect where
We also define a weak learner as an algorithm where:
Basically a weak learner is a learner that must have an error rate that is less than 50% for any possible probability distribution over the samples.
So the boosting algorithm is:
- Given training set
$X$ ,$Y$ which is a binary classification problem. Note that$y \in Y$ is in the set${ -1, 1}$ . - repeat for iteration
$t$ until$T$ - construct probability distribution
$D_t$ . This is the distribution we will use to sample to train the weak learner. - find a weak learner classifier
$h_t(x)$ by drawing from$D_t$ with small error$P_D(h_t(x_i) = y_i)$
- construct probability distribution
- Output
$H_{final}$ .
We need to construct
where
and
Note that
So if a weak learner is correct for sample
This happens because:
So if all of the samples agree, the
If any samples were not correctly classified, then
If a weak learner is wrong for sample
Then:
So we weight each of the weak learners by how much error it has.
Support vector machines (SVM) are a class of machine learning classification algorithms which attempts to find the "best" hyperplane to classify linearly separable data.
Consider the data below:
|
| x x
| o x
| o o x
| o
-----------------------
Since this data is, clearly, linearly separable we could use a perceptron to find a line (or hyperplane in the n-dimensional case) that classifies the points. However, the perceptron algorithm makes no guarantees about which dividing hyperplane it will select based on some training data.
|
| |x x
| o | x
| o o | x
| o |
-----------------------
and
|
| \ x x
| o \ x
| o o \ x
| o \
-----------------------
are both completely valid hyperplanes to separate the training data. However, intuitively, it's easy to understand that not all hyperplanes are created equal. A hyperplane which is "too close" to one class of data adds an overfitting bias to the algorithm since it adds additional constraints based on the training data. So the "best" hyperplane is the hyperplane that is farthest away from any of the training data.
So the SVM algorithm attempts to find a hyperplane that separates classes of the training data while also maximizing the distance between the hyperplane and the training data points.
So consider the equation of a hyperplane:
where
Then let us classify the "upper" set of samples with class 1 and the "lower" set of samples with class 0 and define 3 parallel lines:
| l2 l3
| l1 \ \
| \ \ \ +1
| \ \ \+1 +1
| -1\ \ \ +1
---------------------
Where
And we want to find the parameters
| l2 l3
| l1 \ \
| \ \ x2
| x1 \ \
| \ \ \
---------------------
Then we can normalize
Recall that
Also note that in a hyperplane,
So a support vector machine attempts to classify the points correctly (minimize the error) while maximizing the margin. In other words we want to keep:
while minimizing
Note that maximizing
This works because
THrough quadratic programming, this optimization problem transforms into:
such that
and
This is also know as the dual form of the perceptron.
Once we know the
(and we can pull out
Note that most of the
Also note that
This SVM method is great for linearly separable data but completely falls apart on non-linearly separable data. Consider data that looks like:
| x
| x o x
| x o o x
| x o o x
| x o x
| x x
-----------------------
This data is clearly not linearly separable, but we can see that a circular decision plane could separate this
data. And we can do that by replacing the
More interestingly we don't always need to actually project
Then:
Which can be factored into:
Which is the same as:
So we can actually do the whole higher order function without taking the time and space to project the vectors into a higher dimensional space.
Thus we can replace that term with a kernel function
Information theory is a subfield of mathematics which has very useful applications to computer science. Basically, information theory provides us with a mathematical framework to determine whether input vectors are "similar" (mutual information) and whether specific features are informative (entropy).
More generally entropy is defined as the amount of information encoded in a specific variable. We can say that
or that the entropy of some variable
so:
Recall that this function has the curve:
so entropy is small if a variable is constrained (is highly likely to be in a small number of states) and is large if the variable has a (close to) equal likelihood of being in a lot of states.
Often times we want to know how much information is encoded in multiple variables together. So we can either calculate the entropy of the joint distribution:
Or we can calculate the entropy of one variable conditioned on the other:
This can also be seen as:
From bayes rule:
xo:
and
Note that if
$H(X, Y) = H(X) + H(Y)$ $H(Y | X) = H(Y)$
Conditional independence is a very useful metric but doesn't necessarily tell us everything. For example
so basically mutual information is a measure of the reduction in randomness of a variable when it is conditioned on
another variable. Note that if
KL Divergence is a metric that measures the distance between two probability distributions. It is given as:
Note that if the distributions are the same, the KL divergence is 0.
Computational learning theory is a sub-field of computer science and machine learning which seeks to define new learning problems, show why and how certain learning algorithms work, and help us understand why some problems are fundamentally hard to solve. When thinking about learning algorithms we generally care about:
- probability of successful training:
$p = 1 - \delta$ - number of samples to train on:
$m$ - complexity of the hypothesis class
- accuracy to which target concept is approximated:
$\epsilon$ - manner in which training examples are presented
- manner in which training examples are selected
Computational complexity is the amount of computational effort that is needed for a learner to converge.
Sample complexity is the number of training iterations needed for a learner to create a successful hypothesis.
Mistake bounds are the number of misclassifications a learner can make over an infinite run.
Given a true hypothesis
A version space is essentially the set of hypothesis in a hypothesis space which is consistent with the data given.
The training error of a hyporhtesis
We also define
Then we say a concept class
A version space
We can then find that the number of training examples
VC dimensionality is a notion of computational learning theory which allows us to know the number of dimensions
that can be considered when classifying a hypothesis class. The VC dimensionality of
For example the hypothesis class
- | +
-----------x-------theta----------
- | +
------------theta-----x----------
But we can't label all possible 2d combinations (+, +), (+, -), (-, +), (-, -) using a single line.
- | +
-------x----x-------theta----------
- | + +
------------theta-----x------x----
- | +
-----x-------theta------x----------
This tells us that the VC dimensionality of this hypothesis class is
This notion of "labeling in all possible ways" is called "shattering".
In general we find the that for a hypothesis class that is represented by a d-dimensional hyperplane, the VC
dimensionality is
But not all hypothesis classes have finite VC dimensions. A hypothesis class which returns convex polygons has an infinite VC dimension since a convex polygon can have an infinite number of sides and be pretty much any shape and therefore label any number of points.
We can use VC dimensionality to redefine the minimum amount of data required to learn an infinitely sized hypothesis class as:
Additionally,
Bayesian learning is a way of learning the most probable hypothesis given the data and domain knowledge. Or we want to find:
Bayes Rule tells us that
So
and
So:
Now:
-
$P(D)$ is the prior on the data. So this is our prior belief that this data will occur. Usually this term is just a normalizing factor. -
$P(D | h)$ is the probability of the training data given the hypothesis. Or more usefully it is the probability that the hypothesis is consistent with the input data. This is a binary value since the hypothesis can be empirically consistent or inconsistent with the input data. -
$P(h)$ is the prior on the hypothesis or our belief this hypothesis will occur.
Now
Now recall that we defined a version space as:
or the set of all hypothesis that are consistent with the data. And note that:
We know that
So:
where
Then we know that for any consistent hypothesis
The above ideas are great if we have a completely noise free data set. But what if we have some data where each label has some noise attached:
So the output of each datapoint is given by some hypothesis function with an additional error drawn from the normal distribution. We know from bayes rule that we can find the best hypothesis by finding:
In the noise-free dataset version,
And
where
So
Then:
Since that leading
The
which is the same as:
which is the same as minimizing the sum of squares error of the data.
bayesian inference is a machine learning and AI technique which uses probabilistic, graphical models to predict outcomes based on data. bayesian inference is unlike any other machine learning algorithm we've discussed so far because it requires built in domain knowledge and understanding of the system at play.
Consider the following joint distribution
| storm | lightning | P(storm, lightning) |
|---|---|---|
| T | T | 0.25 |
| T | F | 0.4 |
| F | T | 0.05 |
| F | F | 0.3 |
This is a pretty small distribution that tells us some information. But now for every new attribute we add (if it's binary) we need to increase the amount of data we're storing exponentially - which is not computationally feasible for complicated systems. So how do we store complicated joint distributions?
We say that two variables
Recall that in general:
if
therefore:
which is similar to the notion of conditional independence.
We can represented these joint distributions in a bayesian Network which is a graphical model of how the
different variables in the distribution interact. In our previous distribution, We know that
If
but
and
So this distribution can be represented by a graph like:
( S ) --> ( L )
P(S) P(L|S)
where the dependence of
Now consider another node
| storm | lightning | thunder | P(storm, thunder, lightning) |
|---|---|---|---|
| T | T | T | 0.2 |
| T | F | T | 0.04 |
| F | T | T | 0.04 |
| F | F | T | 0.03 |
| T | T | F | 0.05 |
| T | F | F | 0.36 |
| F | T | F | 0.01 |
| F | F | F | 0.27 |
Note that to incorporate the new variable, we had to double the amount of information. But also note that
And in fact if we enumerate over all of the possibilities for
We can represent this in graphical form as
( S ) --> ( L ) -> ( T )
P(S) P(L|S) P(T|L)
Note that for a set of parents
Note that a general joint distribution is of size
(A) -> (B) <- (C): A and C are independent (blocking); A and C are conditionally dependent given B(A) -> (B) -> (C): A and C are conditionally independent given B (blocking); A and C are dependent(A) <- (B) -> (C): A and C are conditionally independent given B (blocking); A and C are dependent
We can also find that the sets
Inference is the process of finding some useful information the joint distribution from a bayes net.
Usually we want to know the "posterior probability" of some variables or the "most likely explanation".
The Posterior Probability of a variable
The most Likely Explanation is the most likely value for a variable
The most common, and most brute force, method of inference is inference by enumeration. Given:
- Evidence variables
$E_1...E_k = e_1...e_k$ - Query variables:
$Q$ - Hidden variables:
$H_1...H_r$
We want to calculate:
What we are doing here is essentially taking the margin (or summing over) all of the "hidden" variables (or variables that aren't part of the provided query and evidence).
For example, consider the bayesian network above. If we wanted to calculate
This is clearly extremely inefficient - in fact, inference in bayes nets is an NP hard problem. However, we can still use a few tricks to make bayes nets useful in machine learning.
Naive Bayes is a relatively simple algorithm which uses a simple bayesian network to make predictions. The bayes net is of the structure:
( x_1 )
/ ( x_2 )
( output ) -- ...
\
( x_n )
All of the features are conditionally independent of each other.
We can train this network by treating each feature in the data as an independent, counting the number of times
it appears with the different output values, using the counts the construct the CPT
Naive bayes is a very simple algorithm, and despite the fact that the assumption it makes about feature independence, it actually works very well in practice. This is true because in practice, we use this for classification and so there is room for error in terms of the actual probabilities - so even if the naive bayes network doesn't perfectly model the joint distribution, it's close enough to work for classification.
Unsupervised learning is a completely different type of machine learning which attempts to make some sense of completely unrelated data. Essentially unsupervised learning is attempting to describe the data you have whereas supervised learning attempts to find some function to match your already described data.
Randomized optimization (RO) is the process of using randomized algorithms to find the maximum of a fitness function. This is particularly important if your fitness function cannot be solved analytically (ie if it doesn't have a derivative or the derivative is difficult to find).
Given an input space
This is a relatively simple and intuitive idea across many disciplines. We may need to do this kind of optimization to tune parameters for chemical or biological processes, find routes, find functional solutions or roots, tune hyperparameters for machine learning algorithms, tune weights for machine learning algorithms, etc.
Hill climbing is one of hte simplest RO algorithms.
- Pick a random point
$x \in X$ . - Select two points within a "neighborhood" around
$x$ -$N(x)$ . - Let
$x' = argmax_{z \in N(x)}{(f(z))}$ - If
$f(x') <= f(x)$ , terminate. Otherwise set$x = x'$ and repeat from step 2.
This algorithm is very simple but has some clear drawbacks - specifically it is very easy for this algorithm to get caught in local maxima.
Random restart hill climbing is a slight modification of hte standard hill climbing algorithm which attempts to make hill climbing more robust against local maxima.
- For i in range(N)
- Let x = the result of hill climbing
- If x > the best maximum currently found, update the best maximum
This algorithm is more robust against local maxima since we perform hill climbing from multiple places throughout the input space. While obviously we cannot guarantee that we will hit the global maximum, we can be more certain that the maximum we find after performing a series of random hill climbings is, at worst, a high local maximum.
Simulated annealing is an RO algorithm which attempts to solve the local maxima problem by allowing the algorithm to "explore" sub-optimal paths in the hopes of finding a global maximum.
- For a finite set of iterations
- Sample a point
$x' = N(x)$ - Jump to the new point
$x'$ based on a probability function$P(x', x, T)$ - Decrease the "temperature"
$T$
The probability function is piecewise: $$P(x', x', T) = \begin{cases} 1&\ \text{if } f(x') \geq f(x)\ e^{\frac{f(x') - f(x)}{T}}& \end{cases} $$
This means that if the newly sampled point
- The numerator means that
$P(x', x, T) \to 1$ as$f(x') \to f(x)$ . This means that the probability of jumping is higher if the new point is "not so bad" (ie not much worse than the old point). - The denominator means that
$P(x', x, T) \to 0$ as$T \to 0$ . This means that as we keep iterating (and$T$ keeps decreasing), the probability of jumping to bad points decreases. So as$T \to \infty$ simulated annealing acts as a random walk, and as$T \to 0$ it behaves like hill climbing. In our algorithm generally want to decrease$T$ slowly so that we can let the algorithm find the best paths.
Simulated annealing has the interesting property that the probability of the simulatd annealing algorithm ending at
some point
This is known as the Boltzman distribution.
Genetic algorithms are algorithms that incorporate techniques analogous to biological reproduction and evolution to perform a local search. Genetic algorithms generally assume a multidimensional input space (for example n-bit strings, vectors, etc).
- Define a population
$P$ from the input space. - Repeat until convergence:
- Compute
$f(x) \forall x \in P$ . - Select the "most fit" samples from the population. This can be done by taking the samples with the best fitness scores, using a weight function to select samples etc.
- Replace the "less" fit examples by
- "pairing up" samples from the more fit pool and performing crossovers.
- performing mutations on the more fit or crossover samples.
- Compute
Crossover in this context means combining the features of two samples to create two new samples. There are many different techniques that can be used to do crossover.
Consider the samples 01101100 and 11010111.
A one point crossover takes one point (or position) as a dividing point and swaps halves of the samples. This kind of crossover assumes a structural locality to the data - it assumes that there is some value to maintaining some of the original structure.
In our samples above if we picked position 4,
0110 | 1100
1101 | 0111
01100111
11011100
A uniform crossover considers each point in the samples individually and uses a uniform probability distribution to determine whether to "keep" or "swap" the values at any position. This assumes that there is no structural locality to the data.
The randomized algorithms discussed above are great and finding individual points. However, they don't convey any kind of structure and have unclear probability distributions.
MIMIC is an algorithm that attempts to solve these issues with the generic RO algorithms by refining a specific probability function.
- Repeat until convergence
- Generate samples from
$P^{\theta_{i}}(x)$ - Set
$\theta_{i+1}$ based on the most fit samples and retain only those samples - Estimate
$P^{\theta_{i+1}}(x)$
- Generate samples from
THe tricky part here is how we estimate and sample from
MIMIC does very well with structure, but it can get stuck in local optima (though it's not common). Additionally, MIMIC takes much more time per iteration than other algorithms (like Simulated Annealing and Genetic Algorithms). However, MIMIC gives a lot more information per iteration than other algorithms.
The clustering problem is a relatively well defined supervised learning problem. Given a set of objects
Single linkage clustering is a very basic clustering algorithm.
- Initialize each data point as its own cluster
- Repeat
$N - k$ times- Calculate the inter-cluster distances (the distances between all individual clusters)
- Merge the two closest clusters
This will create
K means clustering is another very popular clustering algorithm.
- Pick
$k$ cluster center points at random - Repeat until convergence
- For each cluster center, assign the closest points to that cluster
- recompute the cluster centers as the mean of the points in the cluster
We can also write this out more mathematically. Let
Then K means clustering can be mathematically defined as:
Find the cluster assignments for every point
Recalculate the cluster centers:
We can thin of this algorithm as an optimization problem. Our state is the configuration or clusters - this can
be represented as
Since the error here is monotonically non-increasing, the algorithm will always converge in finite time (though it could take a long time). However, the algorithm can get stuck in non-optimal points. We can use random restarts or try to distribute cluster centers intelligently to mitigate that risk though.
One problem with k means clustering is that some points may not fit well into a single cluster - it may be more
correct for it to be "shared" by multiple clusters. Soft clustering is a clustering algorithm that tries to
fix that. Essentially we follow a very similar idea to k-means. We pick
Soft clustering is an expectation maximization algorithm.
where
Then given the cluster assignment likelihoods we can recompute the means as:
EM is very similar to k-means (and in fact it can be modified slightly to pretty much be k-means). The likelihood function is monotonically non-decreasing, but since there are an infinite number of possible configurations, Em does not have to converge (though it usually does). However, it will not diverge. It can also get stuck at non-optimal points. It can also work with any probability distribution (not just gaussians).
THere are a few general properties of clustering algorithms we want to consider:
-
Richness is the property that for any assignment of objects to clusters there is some distance matrix
$D$ such that$P_D$ supports that clustering. - Scale invariance is the property that scaling distances by a positive value should not change the clustering assignments.
- Consistency is the property that decreasing the intracluster distances and increasing the intercluster distances should not change the cluster assignments.
There is no clustering algorithm that has richness, scale invariance, and consistency. This is called the **impossibility theorem.
Feature selection is the process of selecting a subset of features from our input data to train our machine learning algorithms. There are a few reasons we want to do this. First of all, reducing the number of features we have to deal with (by selecting the most relevant features) helps us better interpret and understand our data and results. Secondly, the curse of dimensionality says that the amount of data needed to train a machine learning algorithm grows exponentially with the number of features in the dataset. So reducing the number of features we have o deal with (ideally) reduces the amount of data needed to train.
However, we basically want to find the "best" subset of features to train our algorithm. This is an NP-Hard problem
since there are
Filtering is a flow forward process where we first use a search algorithm to come up with a subset of features and then pass that subset over to the learner. Wrapping is a process by which the feature filtering process is part of the learning algorithm itself - so the learner learns on some subset of features and reports its results back to the filtering algorithm. The filtering algorithm can then update its selection process based on the results of the learner. Filtering is generally fast. However, it generally looks at the features in isolation and doesn't consider interdependencies between features. It also completely ignores the learner. Wrapping does consider feature dependencies (insofar as the learning algorithm does), but it is extremely slow.
There are a lot of different algorithms we can use for filtering feature selection.
- Train a decision tree and only consider features that are important for the tree - this is tantamount to selecting features with high information gain.
- Select features with high variance
- Select features with high entropy
- Select features that are not linear combinations of other features
For wrapping feature selection we can essentially perform a search using a Randomized Optimization algorithm where the fitness function is the accuracy of the learner. We can select a neighborhood in a "forward" way (start from a set of features and try to add a single feature) or a "backward" way (start with all the features and try to remove a single feature).
In general in feature selection we can say that a feature
Feature transformation is the process of doing transformation on our current set of features to create a newer
(more compact) which retains as much (relevant) information as possible. Here we are focused on linear
transformations so we are trying to find some matrix
Principle components analysis or PCA is an eigenproblem which tries to project the feature space onto a new feature space with maximum variance. PCA essentially finds a direction which maximizes the variance (the principle component), then selects other directions which are orthogonal to the principle component.
In PCA we are essentially transforming our data into a new feature space (which is the same size as the original) but maximizing the variance in a smaller set of dimensions. We can then select the M highest variance dimensions in the transformed space. PCA has the property that if we go through the process of broadcasting our original feature space into the M highest variance dimensions after PCA, reconstructing the data has the lowest L2 error of any possible reconstruction. However, we can see that using this method it is actually possible to drop actually relevant features that may have low variance if there are other, noisy features with high variance. So PCA needs to be taken with a grain of salt when used with classification problems.
PCA is a global algorithm that means that there is a global constraint over all of the features. Namely that all the dimensions must be orthogonal. This means that the algorithm tends to pick out "globa" features. For example, on a set of images PCA would return the "average image" or the brightness, contrast, etc of the images.
Note an eigenproblem is a computational problem that can be solved by finding the eigenvalues and/or eigenvectors of a matrix.
Independent components analysis or ICA is a process which tries to map the features into a new feature space such that the new features are independent of each other. ICA was originally developed to solve the blind source seperation problem. Basically consider a set of N people in a room who are all talking where the noises are recorded by a set of N microphones. The data detected by the microphone is a linear combination of the sound from each of the sources - though based on the location of each microphone, the data at each microphone will be different. If we treat each microphone as a feature, ICA transforms the data in such a way as to make sure that the mutuaul information between the transformed features is 0 (that is they are independent). Thus it is able to actually extract the "sources" of the noise as the independent features.
Unlike PCA, ICA doesn't have an orthogonality constraint so it is more of a local algorithm - thus it cane pick out more distinct local features. For example, on a set of images, ICA tends to pick out edges as the features.
Random components analysis (RCA) or Random projection generates
Linear discrimenent analysis or LDA tries to find a projection that discriminates based on the classification label.
Reinforcement learning is a sub-discipline of machine learning similar to supervised learning though with a
different objective. Where supervised learning's goal is to approximate some function given training data
A Markov Decision Process (MDP) is a fundamental framework in reinforcement learning. It consists of:
-
$S$ : a set of states -
$T(s, a, s')$ =$P(s'|s,a)$ : a transition probability matrix -
$A(s)$ and$A$ : a set of actions (where$A(s)$ is the set of valid actions at state$s \in S$ ). -
$R(s)$ ,$R(s, a)$ ,$R(s, a, s')$ : the reward for a combination of states and actions.
Given these parameters we want to generate a policy,
For example, consider a "grid world" where the agent starts at a specific spot and their goal is to make it to
the box marked GOAL - if they reach the goal, the agent "wins". However, if they land in the box marked PIT,
then the agent loses.
| x | 0 | 1 | 2 | 3 |
|---|---|---|---|---|
| 0 | GOAL | |||
| 1 | xxxx | PIT | ||
| 2 | xxxx | |||
| 3 | P |
This world can be viewed as an MDP where
$S = { (0, 0), (0, 1), (0, 2), ... (4, 4) }$ -
$T(s, a, s') = 1$ if$s$ is adjacent to$s'$ else$0$ $A = {U, D, L, R }$ -
$R((0, 4)) = 100$ ,$R((1, 4)) == -100$
Our goal here is to find an optimal policy
Note that this is a markov decision process because:
- The transition model function
$T(s, a, s')$ only cares about the current state. It does not care about anything that came before. - The transition model doesn't change.
So generally, to train an MDP we take in a sequence of
Based on the constraints above we find that the utility of a sequence of states
where
This is a geometric series so if we have
This notion of "discounting" the utility allows us to find a finite reward for a potentially infinite sequence of states.
Based on all of this we can define the optimal policy
Then we can redefine the utility of some state
Note that the reward for a state is different from the utility from that state. The reward is the short term, immediate reward whereas the utility is the long term, delayed reward.
So then we can say that:
where
and
This is known as the Bellman equation. It is clearly a recursive function, so we need to use some computational
tricks to find
- Start off with some arbitrary utilities
- For
$t = [1, 2, ...]$ - Update the utilities for each state by using the equation: $$\hat{U}{t}(s) = R(s) + \gamma \max{a}{\sum_{s'}{T(s, a, s')\hat{U}_{t - 1}(s')}}$$
- If the utilities have converged, break the loop
This algorithm is known as value iteration (note the utility of the state can also be considered the value of the state).
Policy iteration is a slight variation of value iteration. Generally we don't actually need to know the actual values of the states - we just want to converge onto an optimal policy. So we can do something similar.
- Start out with an arbitrary initial policy
$pi_{0}$ . - For
$t = [1, 2, ...]$ -
Update the policy:
$$\pi_{t} = \argmax_{a}{T(s, a, s')U_{t-1}(s')}$$ where
$U_t(s) = R(s) + \gamma \sum_{s'}{T(s, \pi_{t}(s), s')U_{t - 1}(s')}$ -
If the policy has converged, break the loop
-
Value iteration and policy iteration are great techniques for solving an MDP where you know the model parameters.
However, in general you don't know parameters like the transition probabilities, rewards by state, etc. You need to
actually run through the environment to learn those parameters. In reinforcement learning we are given a series
of observations of actions in an MDP - a sequence of the form
There are a lot of different approaches we can take when attempting reinforcement learning.
- Policy Search is an approach where we try to directly learn the policy. This is a bit difficult because we generally don't have access to the action we should have taken.
-
Value based learning is an approach where we attempt to learn the value or utility of the states. We can
then translate that to policy by taking some kind of
argmax. Note that this value is not necessarily the same as the value defined for value / policy iteration. -
Model based learning is an approach which attempts to learn the transition and reward functions given the
sequence of
$(s, a, r, s')$ pairs. Converting this into a policy essentially requires using the model we learn to perform value or policy iteration.
Q Learning is a very common value based learning algorithm which uses a value function function
So
We can also redefine our original value and policy equations in terms of
Recall that:
We can also say that:
This makes sense since for some fixed state
So clearly we could just use our
- Initialize arbitrary
$\hat{Q}$ - For each observation
$(s, a, r, s')$ - Calculate the update parameter
$$\hat{q}(s, a) = r + \gamma \max_{a'}{\hat{Q_{prev}}(s', a')}$$ - Update
$$\hat{Q}(s, a) = (1 - \alpha_t)\hat{Q_{prev}}(s, a) + \alpha_t \hat{q}(s, a)$$
- Calculate the update parameter
Note that the operation
Calculates the moving average and converges to the expected value of
Note that we say "for each observation" in the algorithm, but really you are running an agent from state
- How do we initialize
$Q$ ? - How do we decay the learning rate
$\alpha_t$ - How do we choose actions?
There are a lot of approaches to tuning these parameters. For example, we could choose random actions or always choose a single action. However, these strategies are bad because they don't take into account what the algorithm has learned and so won't return good results. Furthermore, we could just follow the policy we've generated through previous iterations, but that runs the risk of getting stuck in local optima.
So a generally good one is to use a simulated annealing like approach. So given some state
This leads to a more specific Q-Learning algorithm:
- Initialize arbitrary
$\hat{Q}$ - While True
- Given state s generate an observation by choosing an action and observing the resulting state and reward:
$$a =
\begin{cases}
\hat{\pi}(s) \ &\text{if}\ \ rand() < 1 - \epsilon\
rand(A) &
\end{cases}
$$
where
$$\hat{\pi}(s) = \argmax_{a'}{\hat{Q}(s, a')}$$ - Calculate the update parameter
$$\hat{q}(s, a) = r + \gamma \max_{a'}{\hat{Q}_{prev}(s', a')}$$ - Update
$$\hat{Q}(s, a) = (1 - \alpha_t)\hat{Q}_{prev}(s, a) + \alpha_t \hat{q}(s, a)$$
- Given state s generate an observation by choosing an action and observing the resulting state and reward:
$$a =
\begin{cases}
\hat{\pi}(s) \ &\text{if}\ \ rand() < 1 - \epsilon\
rand(A) &
\end{cases}
$$
where
It turns out that if the algorithm is greedy in the limit and infinite epsilon (GLIE), that this algorithm in the limit converges on the correct Q values and optimal policy.
Another interesting approach is to initialize all of the
Game theory is a mathematical theory that centers around conflict - specifically how to behave optimally in a specific conflicts.
When we think about games we often think of zero sum games. A zero sum game is a mathematical representation of a situation in which each participant's gain or loss of utility is exactly balanced by the losses or gains of the utility of the other participants. This means that in a zero sum game if player 1 earns 1 point, player 2 sees that as a loss of 1 point.
We often think of these types of games as game trees.
root (MAX) (Agent A)
/ | \
x1 x2 x3 (MIN) (Agent B)
/ \ / \ / \
x11 x12 x21 x22 x31 x32 (MAX) (Agent A)
... ... ...
the leaf nodes of the tree represent values that Agent A earns if they reach that point in the game tree. To find an optimal policy from any state then, Agent A can simply find the action that maximizes its reward - it does at by recursively checking each action assuming that Agent B acts optimally. This algorithm is known as minimax.
def max_values(state):
if state is terminal:
return terminal(state) # value of terminal state
v = -inf
for each successor of state:
v = max(v, min_values(successor))
return v
def min_value(state)
if state is terminal:
return terminal(state) # value of terminal state
v = inf
for each successor of state:
v = min(v, max_values(successor))
return vIn a 2 player zero sum game of perfect information there always exists and optimal pure strategy for each player. This is true for deterministic and non-deterministic games. For non-deterministic games we simply have to take the expectation of any stochastic nodes in the tree. Furthermore, for these games Minimax yields the same result as Maximin. This is known as Von Neuman's theorem.
However, consider a game of hidden information. For example:
- A picks a card. There is a 50% probability he picks a red card vs a black card. B cannot see the card.
- If A picked red he can fold or hold. If he folds he earns -20c.
- If A chooses to hold:
- If B chooses to hold:
- If A has a red card -> A earns -40c
- If A has a black card -> A earns 30c
- If B chooses to hold:
- If B chooses to fold:
- A earns 10c
This game is said to have hidden information because, B does not know which state he is in at any given moment. We can work out a value matrix for this game as:
| Actions | B folds | B holds |
|---|---|---|
| A folds | -5 | +5 |
| A holds | +10 | -5 |
Note that for A folds we are actually saying that A will only fold on Red - he will always hold if he gets black. So for the A folds and B folds case we work out the value as:
and for the A folds and B holds case we work out the value as:
Note however, for this kind of system, minimax is not the same as maximin. So there is no pure strategy to solve this game. Instead we need to use some mixed strategy where we find a probability distribution over the possible actions.
In simple games like the game above we can do this relatively easily. We define
- if B folds:
$$V = 10p -5(1-p))$$ - If B holds:
$$V = -5p + 5(1-p)$$
These actually end up defining a set of lines which intersect at the point 0.4.
Now A is trying to maximize its reward and B is trying to minimize its reward. So A will pick 0.4 as its hold probability because that is the probability that maximizes its reward in the "worst case" - if B picks the lower strategy. Note that this point can occur at the minimum (0), maximum (1), or intersection point of the lines (if it exists between 0 and 1).
Now let's consider a non-zero sum game of imperfect information. The prisoner's dilemma is a classic problem in game theory. Two criminals are being interrogated by policeman in separate rooms. Each criminal is told if they "defect" (or rat out the other) they will go free and the other criminal will serve 9 months time. If both criminals defect at the same time they both serve 6 months time. If both "cooperate" (or both keep their mouth shut) they serve 1 month in jail.
| Actions | B cooperate | B defect |
|---|---|---|
| A cooperate | -1, -1 | -9, 0 |
| A defect | 0, -9 | -6, -6 |
Note the tuples are (A reward, B reward). Note that this is not zero sum because the rewards do not balance each other out. Interestingly the correct choice here is to always defect.
This problem is easy to solve just by looking at it, but most games are more complicated than that. The Nash
equilibrium attempts to solve this. Consider a situation with
That is, each of the strategies maximizes the utility for that agent
For example, in the prisoner's dilemma:
| Actions | B cooperate | B defect |
|---|---|---|
| A cooperate | -1, -1 | -9, 0 |
| A defect | 0, -9 | -6, -6 |
- If B cooperates, A should defect because it would save 1 month.
- If B defects, A should defect because it would save 3 months.
So for A, defecting strictly dominates cooperating. Since this situation is symetric, the exact same argument can be made for B.
A few properties of Nash Equilibrium:
- In the n player pure strategy game if the elimination of strictly dominated strategies eliminated all but ooe combination, that combination is the unique Nash Equilibrium.
- Any nash equilibrium will survive elimination of strictly dominated strategies.
- If n is finite, for all
$i$ ,$S_i$ is finite. - If we are playing an n-repeating game, the Nash Equilibrium is the n-repeated equilibrium of a single instance of the ame. This is only true if the game has a deterministic number of rounds. If the number of rounds is unknown, then we can't just use the single game strategy.
An unbounded repeated game is a game where we repeat the same game for some unknown number of rounds. For example
we may play prisoners dilemma but at the end of each round terminate the game with a probability
In a repeated game setting, the possibility for "retaliation" opens the door for cooperation. For example, consider the prisoners dilemma:
| Actions | B cooperate | B defect |
|---|---|---|
| A cooperate | -1, -1 | -9, 0 |
| A defect | 0, -9 | -6, -6 |
One, quite simple, strategy to play the prisoners dilemma in an unbounded repeated sequence is the "tit for tat" strategy. Essentially on the first round, cooperate. On every subsequent round, do what the opponent did in the previous round.
So if given the opponent strategies of "always cooperate", "always defect", and "tit for tat" we can see that teh optimal strategies for the player are as follows:
| Opponent Strategy | Always Cooperate | Always Defect | Tit for tat |
|---|---|---|---|
| Always Cooperate | x | ||
| Always Defect | x | ||
| Tit for tat | x | x |
Now note that:
(always cooperate, always defect)is not a Nash equilibrium because from the opponents perspective it's better to switch to always defect.(always defect, always defect)is a Nash equilibrium because from both players perspectives that strategy gives the optimal utility.(tit for tat, always cooperate)is not a Nash equilibrium because, from the opponents perspective it's better to switch to always defect.(tit for tat, tit for tat)is a Nash equilibrium because from both players perspectives that strategy gives the optimal utility.
So here we have a game with multiple Nash equilibria.
If we plot the possible rewards for the prisoners dilemma in a 2 dimensional plane we can see that they form a "convex hole". The region inside the convex hole is known as the feasible region which is the region of possible average rewards over some iterations of the game.
A minmax profile is a pair of payoffs, one for each player in a game, that represents the payoffs that can be achieved by a player defending itself from a malicious adversary.
For example consider the following reward matrix
| Actions | B left | B right |
|---|---|---|
| A left | 1, 2 | 0, 0 |
| A right | 0, 0 | 2, 1 |
The minimax profile would give us the average payoffs of each player assuming the other player was acting
maliciously. So that would be (1, 1).
The security level profile is similar but assumes that the other player is operating using a mixed strategy so
B chooses left with some probability
Since the matrix is symmetric, the security level profile is (2/3, 2/3).
The minmax profile is a point in the feasible region - the region above and to the right of that point is the "acceptable region".
The folk theorem in Game Theory says that any feasible payoff profile that strictly dominates the minmax / security level profile can be realized as a Nash equilibrium payoff profile with sufficiently large discount factor. This essentially says that if the minmax / security level profile strongly dominates all other actions, an agent can use it as a "threat". So the agents should behave to mutual benefit but as soon as the opponent harms the player agent, the player should apply the minmax / security level strategy. This is also known as the grim trigger.
The problem with the grim trigger strategy is that the threat is implausible. We say that a strategy is subgame perfect if for any constructed history, the strategy will choose the best response. Consider grim vs tit for tat in the prisoners dilemma.
Grim : C D D D D D ...
Tit for tat : D C D D D D ...
(note that the constructed history does not need to be consistent with the strategy).
We can see that at the second time step, Grim chose to defect which triggered a cascade of defects on both sides. But it would have been strictly better for Grim to have cooperated at that point. So grim is not subgame perfect.
It turns out the tit for tat vs tit for tat is not subgame perfect either. If
A: C D C D C D...
B: D C D C D C...
then the average for each agent is around -4.5. But if agent A had instead chosen to cooperate after the first defect, its average would have worked out to -1.
The Pavlov strategy is similar to tit for tat
- If both agents agree (both cooperate or both defect on the previous round) then cooperate
- If the agents disagree (one cooperates and the other defects) then defect
Pavlov v Pavlov is both in Nash equilibrium and subgame perfect. We can see the Nash equilibrium because both agents will cooperate and have no more optimal action. Subgame perfect because for each possible set of states, the agents will eventually reach the mutual cooperation state which is the best average we can hope for. So no matter what, there is no option where we can choose a better action (in the long run) than the Pavlov agent.
CC -> CC
CD -> DD -> CC
DC -> DD -> CC
DD -> CC
The computational folk theorem says that for any 2 player game running for some nondeterministic number of iterations, we can build a pavlov like machine with a Nash equilibrium and is subgame perfect in polynmial time. Either:
- we can construct the pavlov like machine
- the game is zero sum in which case we can find the nash equilibrium in polynomial time using linear programming.
- at most one player can improve its strategy (from what is found from linear programming in the zero sum case) which gives us a nash equilibrium.
Stochastic games provide a way of us to think about multi-agent MDPs. A stochastic game has
-
$S$ : states -
$A_i$ : actions for player i -
$T$ : transitions. A transition here is of the form$T(s, (a_1, a_2, ..., a_n), s')$ . -
$R_i$ : rewards for player i. Rewards are of the form$R_i(s, (a_1, a_2, ..., a_n), s')$ . -
$\gamma$ : discount factor
Note that the stochastic game model is really a generalization of the MDP. If we ignore all other agents in the
transition and rewards then the agent just becomes part of the environment and this generalizes to an MDP.
Furthermore, if this is a 2 player game and
So given that this is a more general MDP we can reform the belman equations to try to learn policies (strategies) for the game.
Lets assume that we're playing a two player zero sum game where
But this doesn't make a lot of sense because it assumes that all agents are attempting to maximize the value for agent i. So we actually modify the equation to use minimax to estimate the future state values:
We can then use this equation like we would use Q-learning and say for some observation
This is also known as minimax-Q.
We find that:
- We can do value iteration using these equations
- minimax-Q converges to a unique solution
- the policies for each agent can be computed independently
- the q functions are enough to specify a policy
- the updates are efficient (polynomial time)
In general sum games we would actually change the q-function to:
which gives us a nash-Q algorithm:
However in nash-Q
- Value iteration doesn't work
- nash-Q doesn't converge converge
- there is no unique solution
- the policies cannot be computed independently
- Q functions are not sufficient to specify a policy
- the updates are not efficient (they are in class P-PAD which can be as hard as NP)










