This project implements anomaly detection using Gaussian distribution models. It identifies potentially faulty servers by analyzing latency and throughput statistics, and extends to higher-dimensional data.
- Estimate Gaussian parameters: Compute mean (mu) and variance (sigma^2) for each feature
- Compute probabilities: For each data point, compute
p(x) = prod(N(x_i; mu_i, sigma_i^2)) - Select threshold (epsilon): Using F1 score on a cross-validation set, find the epsilon that best separates normal from anomalous points
- Flag anomalies: Points where
p(x) < epsilonare flagged as anomalous
| File | Description |
|---|---|
sample8.m |
Main script: anomaly detection on server data |
estimateGaussian.m |
Estimates mu and sigma^2 from data |
multivariateGaussian.m |
Computes multivariate Gaussian probability |
selectThreshold.m |
Selects epsilon using F1 score |
visualizeFit.m |
Visualizes the Gaussian fit with contours |
ex8data1.mat |
2D server statistics dataset |
ex8data2.mat |
11D server statistics dataset |
- 2D dataset: Best epsilon = 8.99e-05, F1 score = 0.875
- 11D dataset: Best epsilon = 1.38e-18, F1 score = 0.615, with 117 anomalies found
Left: Gaussian contours with anomalies circled in red. Right: F1 score vs. epsilon threshold for optimal threshold selection.
Exercises from Andrew Ng's Machine Learning course on Coursera, completed by Keivan Hassani Monfared.
