44
Big Data Analytics Building Blocks; Simple Data Storage (SQLite) CSE6242 / CX4242 Jan 9, 2014 Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

  • Upload
    others

  • View
    11

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Big Data Analytics Building Blocks;Simple Data Storage (SQLite)

CSE6242 / CX4242Jan 9, 2014

Duen Horng (Polo) ChauGeorgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Page 2: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

�2

What is Data & Visual Analytics?

Page 3: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

�2

What is Data & Visual Analytics?

No formal definition!

Page 4: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

�2

Polo’s definition: the interdisciplinary science of combining computation techniques and interactive visualization to transform and model data to aid discovery, decision making, etc.

What is Data & Visual Analytics?

No formal definition!

Page 5: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

�3

What are the “ingredients”?

Page 6: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

�3

What are the “ingredients”?

Need to worry (a lot) about storage, complex system design, scalability of algorithms, visualization techniques, etc.##

Used to be “simpler” before the big data era (why?)

Page 7: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

What is big data? Why care?• Many companies’ businesses are based on big data (Google, Facebook, Amazon,

Apple, Symantec, LinkedIn, and many more)#

• Web search"

• Rank webpages (PageRank algorithm)#

• Predict what you’re going to type#

• Advertisement (e.g., on Facebook)#

• Infer users’ interest; show relevant ads#

• Infer what you like, based on what your friends like#

• Recommendation systems (e.g., Netflix, Pandora, Amazon)#

• Online education#

• Health IT: patient records (EMR)#

• Bio and Chemical modeling: #

• Finance #

• Cybersecruity

Page 8: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Good news! Many big data jobs

• What jobs are hot? #

• “Data scientist”"

• Emphasize breadth of knowledge#

• This course helps you learn some of the skills

Page 9: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Big data analytics process and building blocks

Page 10: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 11: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Process, not “steps”• Can skip some"

• Can go back (two-way street)"

• Examples#

• Data types inform visualization design#

• Data informs choice of algorithms#

• Visualization informs data cleaning (dirty data)#

• Visualization informs algorithm design (user finds that results don’t make sense)

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 12: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

How big data affects the process?• The 4V of big data#

• Volume: “billions”, “petabytes” are common#

• Velocity: think Twitter, fraud detection, etc.#

• Variety: text (webpages), video (e.g., youtube), etc.#

• Veracity: uncertainty of data

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Disseminationhttp://www.ibmbigdatahub.com/infographic/four-vs-big-data

Page 13: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Schedule

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 14: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Two example analytics processes

Page 15: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

NetProbe: Fraud Detection in Online Auction#

WWW 2007 NetProbe

http://www.cs.cmu.edu/~dchau/papers/p201-pandit.pdf

Page 16: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Find bad sellers (fraudsters) on eBay who don’t deliver their items

NetProbe: The Problem

Buyer

$$$

Seller

!13

Auction fraud is #3 online crime in 2010source: www.ic3.gov

Page 17: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

!14

Page 18: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

NetProbe: Key Ideas§ Fraudsters fabricate their reputation by

“trading” with their accomplices#§ Fake transactions form near bipartite cores#§ How to detect them?

!15

Page 19: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

NetProbe: Key IdeasUse Belief Propagation

!16

F A HFraudster

AccompliceHonest

Darker means more likely

Page 20: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

NetProbe: Main Results

!17

Page 21: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

!18

Page 22: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

!18

Page 23: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

!18

“Belgian Police”

Page 24: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

!19

Page 25: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

What analytics process does NetProbe go through?

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Scraping

Design detection algorithm

Not released

Paper, talks, lectures

Page 26: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Discovr movie app

Page 27: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage
Page 28: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

What analytics process would you go through to build the app?Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

IMDB, Rotten tomatoes, youtube

May have duplicate trailers

Determine which movies are related

Mac app, iOS app

Page 29: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Homework 1 (out next Tue)

• Simple “End-to-end” analysis

• Collect data from Rotten Tomatoes (using API)

• Movies (Actors, directors, related movies, etc.)

• Store in SQLite database

• Transform data to movie-movie network

• Analyze, using SQL queries (e.g., create graph’s degree distribution)

• Visualize, using Gephi

• Describe your discoveries

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Page 30: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Data Collection, Simple Storage (SQLite) & Cleaning

Page 31: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Today: Data Collection, Simple Storage (SQLite) & Cleaning

•How to get data?#• Download (where?)#• API#• Scrape/Crawl,

or from equipment (e.g., sensors) High effort

Low effort

�26

Page 32: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Data you can just download•Yahoo Finance (csv)#

•StackOverflow (xml)#

•Yahoo Music (KDD cup)#

•Atlanta crime data (csv)#

•Soccer statistics

�27

Page 33: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Data via API•CrunchBase (database about companies) - JSON#

•Twitter#

•Last.fm (Pandora has API?)#

•Flickr#

•Facebook#

•Rotten Tomatoes#

•iTunes�28

Page 34: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Data that needs scraping•Amazon (reviews, product info)#

•ESPN#

•Google Scholar#

•(eBay?)

�29

Page 35: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

•Most popular embedded database in the world#• iPhone (iOS), Android, Chrome (browsers), Mac, etc.#

•Self-contained: one file contains data + schema#

•Serverless: database right on your computer#

•Zero-configuration: no need to set up!##

http://www.sqlite.org#http://www.sqlite.org/different.html �30

Page 36: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

How does it work?•>sqlite3 database.db!

#

•sqlite> create table student(ssn integer, name text);!

•sqlite> .schema!

•CREATE TABLE student(ssn integer, name text);

ssn name

�31

Page 37: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

How does it work?

•insert into student values(111, "Smith");!

•insert into student values(222, "Johnson");!

•insert into student values(333, "Obama");!

•select * from student;!

ssn name111 Smith222 Johnson333 Obama

�32

Page 38: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

How does it work?•create table takes (ssn integer, course_id integer, grade integer);

ssn course_id grade

�33

Page 39: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

How does it work?

•More than one tables - joins#

•E.g., create roster for this course

ssn course_id grade111 6242 100222 6242 90222 4000 80

ssn name111 Smith222 Johnson333 Obama

�34

Page 40: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

How does it work?•select name from student, takes where student.ssn = takes.ssn and takes.course_id = 6242;!

ssn course_id grade111 6242 100222 6242 90222 4000 80

ssn name111 Smith222 Johnson333 Obama

�35

Page 41: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

SQL General Form

•select a1, a2, ... an from t1, t2, ... tm where predicate [order by ....] [group by ...] [having ...]!

�36

Page 42: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Find ssn and GPA for each student

•select ssn, avg(grade) from takes group by ssn;

ssn course_id grade111 6242 100222 6242 90222 4000 80

ssn avg(grade)111 100222 85

�37

Page 43: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

What if slow?•Build an index to speed things up.SQLite implements B-tree.Speed improves from O(N) if to do a sequential scan to O(logN) .#

•create index student_ssn_index on student(ssn);

�38

Page 44: Big Data Analytics Building Blocks; Simple Data Storage ...poloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242-2014010… · Big Data Analytics Building Blocks; Simple Data Storage

Homework 1Write simple scripts to import Rotten Tomatoes data into SQLite, and do some simple queries.

http://developer.rottentomatoes.com/docs/read/json/v10/Movie_Info�39