R: Machine Learning Intro

Some basic notes.

Steps: Question -> input data -> features -> algorithm -> parameters -> evaluation

Large data sample: 60% training, 20% test, 20% validation
Medium data sample: 60%-75% training, 25%-40% test
Small data sample: only training, then cross validation on a sample part

Evaluation indexes:

Sensitivity: True Positive / (True Positive + False Negative)
Specificity: True Negative / (False Positive + True Negative)
Positive Predictive Value: TP / (TP + FP)
Negative Predictive Value: TN / (FN + TN)
Accuracy: (TP + TN) / (TP + FP + TN + FN)
MSE (continuous values): 1/N(Σᵢ Predictionᵢ - Truthᵢ)²
RMSE (Root Mean Square Error): √MSE

Caret package:

library(caret)

1. Pre-process (cleaning)

Search for NA, correlation between variables (if correlated think about PCA)

2. Build Partitions

library(kernlab)

#Example: see if email is spam
data(spam)
dati <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training <- spam[dati,]     #train data
testing <- spam[-dati,]     #test data

Alternatives to createDataPartition():

createFolds() to create n partitions for different tests
createResample() to use each record multiple times
createTimeSlices() for historical series (use more recent data for test)

3. Build the model and make predictions

General Multiple Linear Regression model:

modello <- train(type~., data=training, method="glm")

See the coefficients:

modello$finalModel

Prediction on test data:

previsioni <- predict(modello, newdata=testing)

Some other models:

Trees, method="rpart"
Random forests, method="rf"

4. Model comparison (confusion matrix)

See how many correct prediction and some evaluation indexes:

confusionMatrix(previsioni, testing$type)