Skip to content

Latest commit

 

History

History
160 lines (130 loc) · 5.07 KB

File metadata and controls

160 lines (130 loc) · 5.07 KB

Practical Machine Learning - Prediction Assignment Writeup

This document describes the steps taken for the prediction assignment of the practical machine learning course.

The first part is the declaration of the packages which are used Note : to be reproductible, I also set the seed value.

library(caret)
## Loading required package: lattice
## Loading required package: ggplot2
library(randomForest)
## randomForest 4.6-7
## Type rfNews() to see new features/changes/bug fixes.
library(Hmisc)
library(randomForest)
library(foreach)
library(doParallel)
set.seed(1234)

The first step is to load the csv file (I've preloaded the code onto my workspace).

The data contained may NA's and hence those colums were removed using #DIV/0!

data <- read.csv("pml-training.csv", na.strings=c("#DIV/0!"))
clData <- data

IV/0!" values :

Forcing the cast to numeric values:

for(i in c(8:ncol(clData)-1)) {clData[,i] = as.numeric(as.character(clData[,i]))}

To manage the second issue we will select as feature only the column with a 100% completion

featuresnames <- colnames(clData[colSums(is.na(clData)) == 0])[-(1:7)]
features <- clData[featuresnames]

We have now a dataframe for features which are feasible for us

xdata <- createDataPartition(y=features$classe, p=3/4, list=FALSE )
training <- features[xdata,]
testing <- features[-xdata,]

We can now train a classifier with the training data. To do that we will use parallelise the processing with the foreach and doParallel package : we call registerDoParallel to instantiate the configuration. (By default it's assign the half of the core available on your laptop, for me it's 4, because of hyperthreading) So we ask to process 4 random forest with 150 trees each and combine then to have a random forest model with a total of 600 trees.

model <- foreach(ntree=rep(150, 4), .combine=randomForest::combine) %dopar% randomForest(training[-ncol(training)], training$classe, ntree=ntree)

The training and testing is done in the codes below.

predictionsTr <- predict(model, newdata=training)
confusionMatrix(predictionsTr,training$classe)
predictionsTe <- predict(model, newdata=testing)
confusionMatrix(predictionsTe,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 4185    0    0    0    0
##          B    0 2848    0    0    0
##          C    0    0 2567    0    0
##          D    0    0    0 2412    0
##          E    0    0    0    0 2706
## 
## Overall Statistics
##                                 
##                Accuracy : 1     
##                  95% CI : (1, 1)
##     No Information Rate : 0.284 
##     P-Value [Acc > NIR] : <2e-16
##                                 
##                   Kappa : 1     
##  Mcnemar's Test P-Value : NA    
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    1.000    1.000    1.000    1.000
## Specificity             1.000    1.000    1.000    1.000    1.000
## Pos Pred Value          1.000    1.000    1.000    1.000    1.000
## Neg Pred Value          1.000    1.000    1.000    1.000    1.000
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.194    0.174    0.164    0.184
## Detection Prevalence    0.284    0.194    0.174    0.164    0.184
## Balanced Accuracy       1.000    1.000    1.000    1.000    1.000
predictionsTe <- predict(model, newdata=testing)
confusionMatrix(predictionsTe,testing$classe)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    A    B    C    D    E
##          A 1395    1    0    0    0
##          B    0  946    6    0    0
##          C    0    2  849    6    1
##          D    0    0    0  798    1
##          E    0    0    0    0  899
## 
## Overall Statistics
##                                         
##                Accuracy : 0.997         
##                  95% CI : (0.994, 0.998)
##     No Information Rate : 0.284         
##     P-Value [Acc > NIR] : <2e-16        
##                                         
##                   Kappa : 0.997         
##  Mcnemar's Test P-Value : NA            
## 
## Statistics by Class:
## 
##                      Class: A Class: B Class: C Class: D Class: E
## Sensitivity             1.000    0.997    0.993    0.993    0.998
## Specificity             1.000    0.998    0.998    1.000    1.000
## Pos Pred Value          0.999    0.994    0.990    0.999    1.000
## Neg Pred Value          1.000    0.999    0.999    0.999    1.000
## Prevalence              0.284    0.194    0.174    0.164    0.184
## Detection Rate          0.284    0.193    0.173    0.163    0.183
## Detection Prevalence    0.285    0.194    0.175    0.163    0.183
## Balanced Accuracy       1.000    0.998    0.995    0.996    0.999

The Random walk produces a kappa of .997 which is very good result for us this is validated by the project submission score.