R: Machine Learning Intro
Play this article
Some basic notes.
Steps: Question -> input data -> features -> algorithm -> parameters -> evaluation
- Large data sample: 60% training, 20% test, 20% validation
- Medium data sample: 60%-75% training, 25%-40% test
- Small data sample: only training, then cross validation on a sample part
Evaluation indexes:
- Sensitivity: True Positive / (True Positive + False Negative)
- Specificity: True Negative / (False Positive + True Negative)
- Positive Predictive Value: TP / (TP + FP)
- Negative Predictive Value: TN / (FN + TN)
- Accuracy: (TP + TN) / (TP + FP + TN + FN)
- MSE (continuous values): 1/N(Σᵢ Predictionᵢ - Truthᵢ)²
- RMSE (Root Mean Square Error): √MSE
Caret package:
library(caret)
1. Pre-process (cleaning)
Search for NA, correlation between variables (if correlated think about PCA)
2. Build Partitions
library(kernlab)
#Example: see if email is spam
data(spam)
dati <- createDataPartition(y=spam$type, p=0.75, list=FALSE)
training <- spam[dati,] #train data
testing <- spam[-dati,] #test data
Alternatives to createDataPartition():
- createFolds() to create n partitions for different tests
- createResample() to use each record multiple times
- createTimeSlices() for historical series (use more recent data for test)
3. Build the model and make predictions
General Multiple Linear Regression model:
modello <- train(type~., data=training, method="glm")
See the coefficients:
modello$finalModel
Prediction on test data:
previsioni <- predict(modello, newdata=testing)
Some other models:
- Trees, method="rpart"
- Random forests, method="rf"
4. Model comparison (confusion matrix)
See how many correct prediction and some evaluation indexes:
confusionMatrix(previsioni, testing$type)