Basic Exploratory Data Analysis with Python

Play this article

GOAL: PREDICT THE PRICE

import pandas as pd
import numpy as np

path='https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/DA0101EN/automobileEDA.csv'
#(path='C:/Users/figinim/Documents/Studies/IBM/Resources/automobileEDA.csv')

df = pd.read_csv(path)

Print first rows and some info:

df.head()
! pip install seaborn
import matplotlib.pyplot as plt
import seaborn as sns

List the data types for each column:

print(df.dtypes)

Correlation between the variable

df.corr()

Correlation between three specific variable:

df[['bore','stroke' ,'compression-ratio','horsepower']].corr()

Scatterplot of "engine-size" and "price"1:

sns.regplot(x="engine-size", y="price", data=df)

image.png

Correlation between the two:

df[["engine-size", "price"]].corr()

Scatterplot and correlation between highway-mpg and price

sns.regplot(x="highway-mpg", y="price", data=df)
df[['highway-mpg', 'price']].corr()

Boxplots for price for each body-style

sns.boxplot(x="body-style", y="price", data=df)

image.png

Boxplots for price for each engine-location

sns.boxplot(x="engine-location", y="price", data=df)

Find out basic statistical analysis

df.describe()    #only numeric fields
df.describe(include=['object'])    #with object fields

Value counts of a field

df['drive-wheels'].value_counts()
df['drive-wheels'].value_counts().to_frame()    #in table

Make a cross table for engine-location count

# engine-location as variable
engine_loc_counts = df['engine-location'].value_counts().to_frame()
engine_loc_counts.rename(columns={'engine-location': 'value_counts'}, inplace=True)
engine_loc_counts.index.name = 'engine-location'
engine_loc_counts

GROUPING

Different drive-wheels

df['drive-wheels'].unique()

Price for drive-wheels

df_group_one = df[['drive-wheels','price']].groupby(['drive-wheels'],as_index=False).mean()
df_group_one

Price for drive-wheels and body-style

df_gptest = df[['drive-wheels','body-style','price']]
grouped_test1 = df_gptest.groupby(['drive-wheels','body-style'],as_index=False).mean()
grouped_test1

Pivot for better reading

grouped_pivot = grouped_test1.pivot(index='drive-wheels',columns='body-style')
grouped_pivot

HEATMAP

plt.pcolor(grouped_pivot, cmap='RdBu')    # use the grouped results
plt.colorbar()
plt.show()

Now with the correct axis values:

fig, ax = plt.subplots()
im = ax.pcolor(grouped_pivot, cmap='RdBu')
row_labels = grouped_pivot.columns.levels[1]     #label names
col_labels = grouped_pivot.index    #label names
#move ticks and labels to the center
ax.set_xticks(np.arange(grouped_pivot.shape[1]) + 0.5, minor=False)
ax.set_yticks(np.arange(grouped_pivot.shape[0]) + 0.5, minor=False)
ax.set_xticklabels(row_labels, minor=False)     #insert labels
ax.set_yticklabels(col_labels, minor=False)     #insert labels
plt.xticks(rotation=90)    #rotate label if too long
fig.colorbar(im)
plt.show()

image.png

P-VALUE, PEARSON CORRELATION COEFFICIENT AND ANOVA

By convention, when the
p-value is < 0.001: we say there is strong evidence that the correlation is significant.
p-value is < 0.05: there is moderate evidence that the correlation is significant.
p-value is < 0.1: there is weak evidence that the correlation is significant.
p-value is > 0.1: there is no evidence that the correlation is significant.

from scipy import stats

Horsepower vs price

pearson_coef, p_value = stats.pearsonr(df['horsepower'], df['price'])
print("The Pearson Correlation Coefficient is", pearson_coef, " with a P-value of P = ", p_value)

The Pearson Correlation Coefficient is 0.8095745670036559 with a P-value of P = 6.369057428260101e-48
Since the p-value is < 0.001, the correlation between horsepower and price is statistically significant, and the linear relationship is quite strong (~0.809, close to 1)

ANOVA: Analysis of Variance
The Analysis of Variance (ANOVA) is a statistical method used to test whether there are significant differences between the means of two or more groups. ANOVA returns two parameters:

  • F-test score: ANOVA assumes the means of all groups are the same, calculates how much the actual means deviate from the assumption, and reports it as the F-test score. A larger score means there is a larger difference between the means.
  • P-value: P-value tells how statistically significant is our calculated score value.
    If our price variable is strongly correlated with the variable we are analyzing, expect ANOVA to return a sizeable F-test score and a small p-value.

Price for wheel-drive

grouped_test2=df_gptest[['drive-wheels', 'price']].groupby(['drive-wheels'])
grouped_test2.get_group('4wd')['price']
f_val, p_val = stats.f_oneway(grouped_test2.get_group('fwd')['price'], grouped_test2.get_group('rwd')['price'], grouped_test2.get_group('4wd')['price'])
print( "ANOVA results: F=", f_val, ", P =", p_val)

ANOVA results: F= 67.95406500780399 , P = 3.3945443577151245e-23
This is a great result, with a large F test score showing a strong correlation and a P value of almost 0 implying almost certain statistical significance.