R: Machine Learning Intro

Some basic notes.

Steps: Question -> input data -> features -> algorithm -> parameters -> evaluation

  • Large data sample: 60% training, 20% test, 20% validation
  • Medium data sample: 60%-75% training, 25%-40% test
  • Small data sample: only training, then cross validation on a sample part

Evaluation indexes:

  1. Sensitivity: True Positive / (True Positive + False Negative)
  2. Specificity: True Negative / (False Positive + True Negative)
  3. Positive Predictive Value: TP / (TP + FP)
  4. Negative Predictive Value: TN / (FN + TN)
  5. Accuracy: (TP + TN) / (TP + FP + TN + FN)
  6. MSE (continuous values): 1/N(Σᵢ Predictionᵢ - Truthᵢ)²
  7. RMSE (Root Mean Square Error): √MSE

Caret package:

library(caret)

1. Pre-process (cleaning)

Search for NA, correlation between variables (if correlated think about PCA)

2. Build Partitions

library(kernlab)

#Example: see if email is spam
data(spam)
dati <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training <- spam[dati,]     #train data
testing <- spam[-dati,]     #test data

Alternatives to createDataPartition():

  • createFolds() to create n partitions for different tests
  • createResample() to use each record multiple times
  • createTimeSlices() for historical series (use more recent data for test)

3. Build the model and make predictions

General Multiple Linear Regression model:

modello <- train(type~., data=training, method="glm")

See the coefficients:

modello$finalModel

Prediction on test data:

previsioni <- predict(modello, newdata=testing)

Some other models:

  • Trees, method="rpart"
  • Random forests, method="rf"

4. Model comparison (confusion matrix)

See how many correct prediction and some evaluation indexes:

confusionMatrix(previsioni, testing$type)