# Basic Exploratory Data Analysis with Python

Oct 7, 2018·

### GOAL: PREDICT THE PRICE

``````import pandas as pd
import numpy as np

path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'
#(path='C:/Users/figinim/Documents/Studies/IBM/Resources/automobileEDA.csv')

``````

Print first rows and some info:

``````df.head()
``````
``````! pip install seaborn
import matplotlib.pyplot as plt
import seaborn as sns
``````

List the data types for each column:

``````print(df.dtypes)
``````

Correlation between the variable

``````df.corr()
``````

Correlation between three specific variable:

``````df[['bore','stroke' ,'compression-ratio','horsepower']].corr()
``````

Scatterplot of "engine-size" and "price"1:

``````sns.regplot(x="engine-size", y="price", data=df)
``````

Correlation between the two:

``````df[["engine-size", "price"]].corr()
``````

Scatterplot and correlation between highway-mpg and price

``````sns.regplot(x="highway-mpg", y="price", data=df)
df[['highway-mpg', 'price']].corr()
``````

Boxplots for price for each body-style

``````sns.boxplot(x="body-style", y="price", data=df)
``````

Boxplots for price for each engine-location

``````sns.boxplot(x="engine-location", y="price", data=df)
``````

Find out basic statistical analysis

``````df.describe()    #only numeric fields
df.describe(include=['object'])    #with object fields
``````

Value counts of a field

``````df['drive-wheels'].value_counts()
df['drive-wheels'].value_counts().to_frame()    #in table
``````

Make a cross table for engine-location count

``````# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts
``````

### GROUPING

Different drive-wheels

``````df['drive-wheels'].unique()
``````

Price for drive-wheels

``````df_group_one = df[['drive-wheels','price']].groupby(['drive-wheels'],as_index=False).mean()
df_group_one
``````

Price for drive-wheels and body-style

``````df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test1
``````

``````grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')
grouped_pivot
``````

### HEATMAP

``````plt.pcolor(grouped_pivot, cmap='RdBu')    # use the grouped results
plt.colorbar()
plt.show()
``````

Now with the correct axis values:

``````fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')
row_labels = grouped_pivot.columns.levels[1]     #label names
col_labels = grouped_pivot.index    #label names
#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)
ax.set_xticklabels(row_labels, minor=False)     #insert labels
ax.set_yticklabels(col_labels, minor=False)     #insert labels
plt.xticks(rotation=90)    #rotate label if too long
fig.colorbar(im)
plt.show()
``````

### P-VALUE, PEARSON CORRELATION COEFFICIENT AND ANOVA

By convention, when the
p-value is < 0.001: we say there is strong evidence that the correlation is significant.
p-value is < 0.05: there is moderate evidence that the correlation is significant.
p-value is < 0.1: there is weak evidence that the correlation is significant.
p-value is > 0.1: there is no evidence that the correlation is significant.

``````from scipy import stats
``````

Horsepower vs price

``````pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)
``````

The Pearson Correlation Coefficient is 0.8095745670036559 with a P-value of P = 6.369057428260101e-48
Since the p-value is < 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1)

ANOVA: Analysis of Variance
The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

• F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.
• P-value: P-value tells how statistically significant is our calculated score value.
If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.

Price for wheel-drive

``````grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test2.get_group('4wd')['price']
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])
print( "ANOVA results: F=", f_val, ", P =", p_val)
``````

ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23
This is a great result, with a large F test score showing a strong correlation and a P value of almost 0 implying almost certain statistical significance.