Coursera Data Science Specialization: Regression Models Course Project
This is my Regression Models course project, part of the Johns Hopkins Data Science Specialization on Coursera.
You can find the Rmd and html format here, the compiled html result here.
Miles Per Gallon: Automatic vs Manual Transmission
Summary
This document use the data of the mtcars package for the analysis. The data was extracted from the 1974 Motor Trend US magazine, and comprises fuel consumption and 10 aspects of automobile design and performance for 32 automobiles (1973-74 models).
The goal is answer at this question: "is an automatic or manual transmission better for MPG?" and quantify the MPG difference between the transmissions.
Based on the data, this document show that the manual transmission is better than the automatic for the mpg: all the models tested say this.
The best model says that the manual transmission allow to do an average of 2.396 miles plus then the automatic.
Intro
mtcars have 32 observations on 11 variables:
- mpg: Miles/(US) gallon
- cyl: Number of cylinders
- disp: Displacement (cu.in.)
- hp: Gross horsepower
- drat: Rear axle ratio
- wt: Weight (1000 lbs)
- qsec: 1/4 mile time
- vs: V/S
- am: Transmission (0 = automatic, 1 = manual)
- gear: Number of forward gears
- carb: Number of carburetors
The variables am and vs must be model as factor variables:
data(mtcars)
mtcars$vs <- as.factor(mtcars$vs)
mtcars$am <- as.factor(mtcars$am)
Exploratory analysis
Base statistics:
summary(mtcars)
## mpg cyl disp hp
## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0
## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5
## Median :19.20 Median :6.000 Median :196.3 Median :123.0
## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7
## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0
## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0
## drat wt qsec vs am
## Min. :2.760 Min. :1.513 Min. :14.50 0:18 0:19
## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1:14 1:13
## Median :3.695 Median :3.325 Median :17.71
## Mean :3.597 Mean :3.217 Mean :17.85
## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90
## Max. :4.930 Max. :5.424 Max. :22.90
## gear carb
## Min. :3.000 Min. :1.000
## 1st Qu.:3.000 1st Qu.:2.000
## Median :4.000 Median :2.000
## Mean :3.688 Mean :2.812
## 3rd Qu.:4.000 3rd Qu.:4.000
## Max. :5.000 Max. :8.000
The boxplot shown the miles per gallon by transmission type:
boxplot(mpg ~ am, data = mtcars,
col = c("orange", "blue"),
xlab = "Transmission",
ylab = "Miles per Gallon",
main = "MPG by Transmission Type",
names= c("automatic","manual"),
horizontal= T)
Manual transimission seems better seeing the boxplot.
Test
Have a test for the hypotesis that mean mpg for manual and automatic transmission is similar.
auto=subset(mtcars,select=mpg,am==0)
manual=subset(mtcars,select=mpg,am==1)
t.test(auto,manual)
##
## Welch Two Sample t-test
##
## data: auto and manual
## t = -3.7671, df = 18.332, p-value = 0.001374
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -11.280194 -3.209684
## sample estimates:
## mean of x mean of y
## 17.14737 24.39231
Null hypotesis (mean mpg for automatic and manual are similar) rejected.
Linear Regressions
First I do a simple regression with am dependent variable and mpg independent.
regSIM <- lm(mpg~am,mtcars)
summary(regSIM)
##
## Call:
## lm(formula = mpg ~ am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -9.3923 -3.0923 -0.2974 3.2439 9.5077
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 17.147 1.125 15.247 1.13e-15 ***
## am1 7.245 1.764 4.106 0.000285 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.902 on 30 degrees of freedom
## Multiple R-squared: 0.3598, Adjusted R-squared: 0.3385
## F-statistic: 16.86 on 1 and 30 DF, p-value: 0.000285
This regression show that manual is better: average 7.245 miles plus then automatic.
The R squared id 0.36, so the model explains 36% of the variance.
Now I do a multivariate regression with am dependent variable and all the other variables as independent.
regTOT <- lm(mpg~.,mtcars)
summary(regTOT)
##
## Call:
## lm(formula = mpg ~ ., data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4506 -1.6044 -0.1196 1.2193 4.6271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 12.30337 18.71788 0.657 0.5181
## cyl -0.11144 1.04502 -0.107 0.9161
## disp 0.01334 0.01786 0.747 0.4635
## hp -0.02148 0.02177 -0.987 0.3350
## drat 0.78711 1.63537 0.481 0.6353
## wt -3.71530 1.89441 -1.961 0.0633 .
## qsec 0.82104 0.73084 1.123 0.2739
## vs1 0.31776 2.10451 0.151 0.8814
## am1 2.52023 2.05665 1.225 0.2340
## gear 0.65541 1.49326 0.439 0.6652
## carb -0.19942 0.82875 -0.241 0.8122
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.65 on 21 degrees of freedom
## Multiple R-squared: 0.869, Adjusted R-squared: 0.8066
## F-statistic: 13.93 on 10 and 21 DF, p-value: 3.793e-07
Manual is better but not like before (only 2.52 miles plus then automatic).
R squared shown that model explains 86.9% of the variance.
The problem is that all the coefficients aren't significative.
I use the stepwise regression method for choice the best variables for explain the mpg values.
regSR=step(regTOT,trace=0)
summary(regSR)
##
## Call:
## lm(formula = mpg ~ wt + qsec + am, data = mtcars)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4811 -1.5555 -0.7257 1.4110 4.6610
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.6178 6.9596 1.382 0.177915
## wt -3.9165 0.7112 -5.507 6.95e-06 ***
## qsec 1.2259 0.2887 4.247 0.000216 ***
## am1 2.9358 1.4109 2.081 0.046716 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.459 on 28 degrees of freedom
## Multiple R-squared: 0.8497, Adjusted R-squared: 0.8336
## F-statistic: 52.75 on 3 and 28 DF, p-value: 1.21e-11
The variables that better explains the mpg values are wt, qsec and am.
This model explains 84.97% of the variance and have all the coefficients significative at 5%, so this model is better than the other two.
For this model the manual transmission allow to do 2.396 miles plus then the automatic.
anova(regSIM,regSR,regTOT)
## Analysis of Variance Table
##
## Model 1: mpg ~ am
## Model 2: mpg ~ wt + qsec + am
## Model 3: mpg ~ cyl + disp + hp + drat + wt + qsec + vs + am + gear + carb
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 30 720.90
## 2 28 169.29 2 551.61 39.2687 8.025e-08 ***
## 3 21 147.49 7 21.79 0.4432 0.8636
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Anova confirm that the model with three regressor (wt, qsec, am), is the best choice.
Residuals
This is the resuidual plot of the best model:
plot(regSR, which=c(1:1))