Deep Learning Notes

Batch Normalization

This prevents the vanishing/exploding gradient problem in which the models learns to slowly or to quickly.
It does this by first normalizing (mean) then standardizing (standard deviation) the data to be 0 and 1 respectively.
In order to do this, each node must now also carry an alpha and gamma which are new variables to learn. These represent the predicted true mean and true standard deviation.
Greatly speeds up training by making gradient descent easier (i.e. less steep/flat zones in the function)

max(0, x)
This is better than sigmoid/tanh because it does not diminish values that are either small or large.
Variants
- Leaky ReLU - Piecewise function. < 0 => alpha * x. >= 0, x. This helps prevent zeroing out of small values.

Methods that prevent overfitting the model
L2 Regularization
- Adds a term to the loss function
- Penalizes large weights
- One parameter that scales the term added
Dropout
- Randomly selects nodes to not be trained on
- Helps with generalization
- One parameter that detemines frequency of dropout

Convergence on high loss
- probably too high of learning rates
- probably due to vanishing or exploding gradient