Data Science at GT

Module Two - Exploratory Data Analysis

Agenda

  • History
  • Methods
  • Tools
  • Examples
  • Group Activity
  • Announcements

In 1977, John W. Tukey wrote the book Exploratory Data Analysis. At the time, statisticians spent most of their effort on statistical hypothesis testing (confirmatory data analysis). Tukey was an advocate of using initial insights into datasets to inform hypothesis test decision-making.

EDA is a set of approaches used to accomplish the following goals:

  • Maximize insight into data
  • Uncover data structures
  • Test underlying assumptions
  • Aid with preliminary selection of appropriate models

Categories of EDA:

  • Univariate non-graphical
  • Multivariate non-graphical
  • Univariate graphical
  • Multivariate graphical

Univariate vs. Multivariate

  • Univariate: looking at one variable at a time
  • Multivariate: looking at multiple variables at a time

Rules of Thumb:

  • Univariate EDA before Multivariate EDA
  • EDA should not be limited to techniques you have seen before. Sometimes, you need to develop new ways to best represent your data.

Univariate non-graphical

Categorical

Simple tabulation of category frequencies:

Quantitative

  • make preliminary assessments about the population distribution of independent variables
  • summary statistics

Central Tendency

Has to do with the typical or middle values. Some measures of central tendency include:

  • mean
  • median
  • mode

Measures of Spread

  • variance and standard deviation
  • interquartile range

Skewness - measure asymmetry


Kurtosis - measure of peakedness relative to a gaussian distribution

Positive Skew - values far above the mode are more common than values far below


Positive Kurtosis - relative to a gaussian distribution with the same variance, values far from the mean are more likely (fat tails)

Univariate graphical

  • histogram
  • boxplot
  • qq-plot

Boxplot

  • central tendency
  • outliers
  • symmetry and skew

qq plot

  • detection of non-normality
  • skewness and kurtosis

Multivariate non-graphical

Categorical


Cross tabulation:

After cross-tablutaion

Quantitative


variance and covariance matrices

Multivariate graphical

  • scatterplots

Tools

  • Python packages
  • Excel
  • Tableau

matplotlib

  • most widely used visualization library in python
  • can be time consuming to develop publication quality charts

seaborn

  • built on matplotlib
  • enhances styling options

ggplot

  • ability to layer chart elements
  • can have steep learning curve

pygal

  • interactive and web ready
  • graphs rendered as SVG elements
  • slow for large visualizations

Examples

  • Auto-MPG data: 9 attributes, 398 instances
  • seaborn graphs

Summary Statistics

In [18]:
sns.despine(left=True) 
%matplotlib inline
# Plot a simple histogram with binsize determined
x = sns.distplot(d, kde=False, color="b")
In [21]:
sns.despine(left=True)
%matplotlib inline
# Plot a kernel density estimate and rug plot
sns.distplot(d, hist=False, rug=True, color="r")
Out[21]:
In [23]:
sns.despine(left=True)
%matplotlib inline
# Plot a filled kernel density estimate
sns.distplot(d, hist=False, color="g", kde_kws={"shade": True})
Out[23]:
In [25]:
# Show each distribution with both violins and points
sns.violinplot(data=d, palette=pal, inner="points")
Out[25]:
In [27]:
sns.heatmap(flights, annot=True, fmt="d", linewidths=.5)
			
Out[27]:

Group Activity

In groups of 3 to 5, download the Chronic Kidney Disease dataset and apply some EDA techniques to it. Explain your process and findings. What did you learn about the dataset and how would this impact your modeling decisions?

Announcements

  • Project team updates
  • Blog posts