train.py
- trains on a set of training logs using various algorithms
- saves training models as
joblibpickle files - predicts accuracy of the training models
- takes the following parameters:
--train_data_dir: sets the location of the training logs (default:data/train/laptop)
--test_data_dir: sets the location of the testing logs (default:data/test/laptop)
--save-dir: set location where the joblib pickle files are saved to (default:save)
Make sure you have a recent version of python2.7 and python pip, then install the required libraries.
pip install numpy sklearn
Create data directories.
mkdir -p data/{train,test}/laptop
Create save directory
mkdir -p save
Collect logs
find /var/log -type f -size +10k -name "*.log" 2>/dev/null | while read log
do
rows=$(wc -l "$log" | awk '{ print $1 }')
head -$(($rows - ($rows / 10))) "$log" > data/train/laptop/"${log##*/}"
tail -$(($rows / 10)) "$log" > data/test/laptop/"${log##*/}"
done
Run the script
python2.7 train.py
This should give something like the following:
Training log collection => 250587 data entries Testing log collection => 27843 data entries SGDClassifier Success rate: 97.38% MultinomialNB Success rate: 98.64% BernoulliNB Success rate: 96.36% DecisionTreeClassifier Success rate: 95.26% ExtraTreeClassifier Success rate: 94.52% ExtraTreesClassifier Success rate: 99.21% LinearSVC Success rate: 99.17% NearestCentroid Success rate: 92.29% RandomForestClassifier Success rate: 99.06% RidgeClassifier Success rate: 99.16%
predict.py
- loads training models from
joblibpickle files - predicts accuracy of the training models
- takes the following parameters:
--test_data_dir: sets the location of the testing logs (default:data/test/laptop)
--save-dir: set location where the joblib pickle files are saved to (default:save)
$ python2.7 predict.py Testing log collection => 27843 data entries SGDClassifier Success rate: 97.38% MultinomialNB Success rate: 98.64% BernoulliNB Success rate: 96.36% DecisionTreeClassifier Success rate: 95.26% ExtraTreeClassifier Success rate: 94.52% ExtraTreesClassifier Success rate: 99.21% LinearSVC Success rate: 99.17% NearestCentroid Success rate: 92.29% RandomForestClassifier Success rate: 99.06% RidgeClassifier Success rate: 99.16%
Adjust the algorithms array to include any number of Scikit Learn algorithms that you want to run:
algorithms = [
# svm.SVC(kernel='linear', C = 1.0), # QUITE SLOW
linear_model.SGDClassifier(loss='hinge', penalty='l2', alpha=1e-3, random_state=42, max_iter=5, tol=None),
naive_bayes.MultinomialNB(),
naive_bayes.BernoulliNB(),
tree.DecisionTreeClassifier(max_depth=1000),
tree.ExtraTreeClassifier(),
ensemble.ExtraTreesClassifier(),
svm.LinearSVC(),
# linear_model.LogisticRegressionCV(multi_class='multinomial'), # A BIT SLOW
# neural_network.MLPClassifier(), # VERY SLOW
neighbors.NearestCentroid(),
ensemble.RandomForestClassifier(),
linear_model.RidgeClassifier(),
]