Data Science at GT

Module Two - Exploratory Data Analysis

Agenda

History
Methods
Tools
Examples
Group Activity
Announcements

In 1977, John W. Tukey wrote the book Exploratory Data Analysis. At the time, statisticians spent most of their effort on statistical hypothesis testing (confirmatory data analysis). Tukey was an advocate of using initial insights into datasets to inform hypothesis test decision-making.

EDA is a set of approaches used to accomplish the following goals:

Maximize insight into data
Uncover data structures
Test underlying assumptions
Aid with preliminary selection of appropriate models

Categories of EDA:

Univariate non-graphical
Multivariate non-graphical
Univariate graphical
Multivariate graphical

Univariate vs. Multivariate

Univariate: looking at one variable at a time
Multivariate: looking at multiple variables at a time

Rules of Thumb:

Univariate EDA before Multivariate EDA
EDA should not be limited to techniques you have seen before. Sometimes, you need to develop new ways to best represent your data.

Univariate non-graphical

Categorical

Simple tabulation of category frequencies:

Quantitative

make preliminary assessments about the population distribution of independent variables
summary statistics

Central Tendency

Has to do with the typical or middle values. Some measures of central tendency include:

mean
median
mode

Measures of Spread

variance and standard deviation
interquartile range

Skewness - measure asymmetry

Kurtosis - measure of peakedness relative to a gaussian distribution

Positive Skew - values far above the mode are more common than values far below

Positive Kurtosis - relative to a gaussian distribution with the same variance, values far from the mean are more likely (fat tails)

Univariate graphical

histogram
boxplot
qq-plot

Boxplot

central tendency
outliers
symmetry and skew

qq plot

detection of non-normality
skewness and kurtosis

Multivariate non-graphical

Categorical

Cross tabulation:

After cross-tablutaion

Quantitative

variance and covariance matrices

Multivariate graphical

scatterplots

Tools

Python packages
Excel
Tableau

matplotlib

most widely used visualization library in python
can be time consuming to develop publication quality charts

seaborn

built on matplotlib
enhances styling options

ggplot

ability to layer chart elements
can have steep learning curve

pygal

interactive and web ready
graphs rendered as SVG elements
slow for large visualizations

Examples

Auto-MPG data: 9 attributes, 398 instances
seaborn graphs

Summary Statistics

In [18]:

sns.despine(left=True) 
%matplotlib inline 
# Plot a simple histogram with binsize determined
x = sns.distplot(d, kde=False, color="b")

In [21]:

sns.despine(left=True)
%matplotlib inline
# Plot a kernel density estimate and rug plot
sns.distplot(d, hist=False, rug=True, color="r")

Out[21]:

In [23]:

sns.despine(left=True)
%matplotlib inline
# Plot a filled kernel density estimate
sns.distplot(d, hist=False, color="g", kde_kws={"shade": True})

Out[23]:

In [25]:

# Show each distribution with both violins and points
sns.violinplot(data=d, palette=pal, inner="points")

Out[25]:

In [27]:

sns.heatmap(flights, annot=True, fmt="d", linewidths=.5)

Out[27]:

Group Activity

In groups of 3 to 5, download the Chronic Kidney Disease dataset and apply some EDA techniques to it. Explain your process and findings. What did you learn about the dataset and how would this impact your modeling decisions?

Announcements

Project team updates
Blog posts