Data Science at GT

09/27 - Finding & Cleaning Data

Agenda

  • Gathering Data
  • Cleaning Data
  • Group Activity
  • Announcements

Gathering Data

Some channels to use when looking for data include:

  • Open Data Projects
  • APIs
  • Web Scraping
  • Creating Data

What is Open Data?

  • Collections of data sources and warehouses made available to the public
  • Developed by governments, corporations, and research groups with the hope that others can extract value from them
  • Generally well documented and formatted

Government

  • US
  • UK
  • Dubai
  • Africa
Location URL
United States http://www.data.gov
United Kingdom http://www.data.gov.uk
Dubai http://www.dm.gov.ae
Africa http://opendataforafrica.org/

Corporate

  • Lending Club: Peer-to-peer lending marketplace
  • World Bank
  • Airbnb: User pathway challenge
Corporation URL
Lending Club https://www.lendingclub.com/info/download-data.action
World Bank http://data.worldbank.org/
Airbnb http://databits.io/challenges/airbnb-user-pathways-challenge

APIs

  • Application Programming Interface
  • Access and develop on sets of methods provided by entities
  • Data directly into program

Which companies provide API access?

  • Facebook
  • Twitter
  • Quandl

Access Quandl API

In [5]:
import quandl
		
In [6]:
data = quandl.get('NSE/OIL')
		
In [7]:
data.head()
		
Out[7]:
Open High Low Last Close Total Trade Quantity Turnover (Lacs)
Date
2009-09-30 1096.0 1156.7 1090.0 1135.00 1141.20 19748012 223877.07
2009-10-01 1102.0 1173.7 1102.0 1167.00 1166.35 3074254 35463.78
2009-10-05 1152.0 1165.9 1136.6 1143.00 1140.55 919832 10581.13
2009-10-06 1149.8 1157.2 1132.1 1143.30 1144.90 627957 7185.90
2009-10-07 1153.8 1160.7 1140.0 1141.45 1141.60 698216 8032.98

Web Scraping

  • Not all data is readily accessible
  • Scraping may work when APIs don't exist
  • Tools to look into: Python, R, Bash, PHP
  • Take note of copyright and confidentiality

Creating Data

  • Interviews
    • Structured vs. Unstructured data
  • Surveys
    • Scalable with various tools
    • Sampling bias, wording

Data Cleaning

  • OpenRefine
    • Easy to use
    • Most basic operations can be completed with the interface
  • Google Sheets
    • Clean data through Google App Scripts
    • Slow but intuitive if familiar with javascript

Group Activity

In groups of 3 to 5, look for a dataset and identify some strategies to extract value from it. Discuss your reasoning with the group.

Announcements

Blog Posts

  • Groups of 1 to 3
  • Short summaries