29
Data & Visual Analytics CSE6242 / CX4242 Jan 7, 2014 Duen Horng (Polo) Chau Georgia Tech

Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Data & Visual Analytics

CSE6242 / CX4242Jan 7, 2014

Duen Horng (Polo) ChauGeorgia Tech

Page 2: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Who Am I?www.cc.gatech.edu/~dchau/

Page 3: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Course StaffInstructor Duen Horng (Polo) Chau Assistant Professor, CSE Office hour: Thu 3-4pm, Klaus 1324

TA Robert Pienta, PhD student, CSE !

TA Long Tran, PhD student, CS

Page 4: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�4

I Work with Large Graphs

Page 5: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�4

= Large Network DataI Work with Large Graphs

Page 6: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�5

Internet50 Billion Web Pages

www.worldwidewebsize.com www.opte.org

Page 7: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�6

Facebook

Modified from Marc_Smith, flickr

800 Million Users

Page 8: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�7

Citation Network

www.scirus.com/press/html/feb_2006.html#2 Modified from well-formed.eigenfactor.org

250 Million Articles

Page 9: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Twitter!Who-follows-whom (500 million users)!

!Who-buys-what (120 million users)!

cellphone network!Who-calls-whom (100 million users)!

Protein-protein interactions!200 million possible interactions in human genome

�8

Many More

Sources: www.selectscience.net www.phonedog.com www.mediabistro.com www.practicalecommerce.com/

Page 10: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�9

Large Graphs I Analyzed

DATA à INSIGHTS

Graph Nodes Edges

YahooWeb 1.4 Billion 6 Billion

Symantec Machine-File Graph 1 Billion 37 Billion

Twitter 104 Million 3.7 Billion

Phone call network 30 Million 260 Million

Page 11: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

7

Page 12: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

7Number of items an average human

holds in working memory

±2George Miller, 1956

Page 13: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File
Page 14: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

7

Page 15: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Data

Insights

Page 16: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�13

How to do that?

COMPUTATION + HUMAN INTUITION

Page 17: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Both develop methods for making sense of network data

�14

How to do that?

COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative

Summarization, clustering, classification Interaction, visualization

>Millions of nodes Thousands of nodes

Page 18: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�14

How to do that?

COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative

Summarization, clustering, classification Interaction, visualization

>Millions of nodes Thousands of nodes

Page 19: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�14

How to do that?

COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative

Summarization, clustering, classification Interaction, visualization

>Millions of nodes Thousands of nodes

Page 20: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�14

How to do that?

COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative

Summarization, clustering, classification Interaction, visualization

>Millions of nodes Thousands of nodes

Page 21: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�14

How to do that?

COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative

Summarization, clustering, classification Interaction, visualization

>Millions of nodes Thousands of nodes

Page 22: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

�14

How to do that?

COMPUTATION INTERACTIVE VISAutomatic User-driven; iterative

Summarization, clustering, classification Interaction, visualization

>Millions of nodes Thousands of nodes

Page 23: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

“Computers are incredibly fast, accurate, and stupid.

Human beings are incredibly slow, inaccurate, and brilliant.

Together they are powerful beyond imagination.”

Page 24: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

“Essentially,

all models are wrong,

but some are useful”

George Box

Page 25: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Course homepage poloclub.gatech.edu/cse6242/

Discussion, Q&A, find teammates

Piazza (link on homepage, soon)

Submission T-Square

Logistics

Page 26: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Course Goals

• Learn scalable visual and computation techniques and tools, for typical data types

• Learn how to combine both kinds of methods (how they complement each other)

• Gain practical know-how

• Gain breath of knowledge

Page 27: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Course Expectation

• Overview of scalable visual and computation techniques and tools

• Gain knowledge & experience (useful for jobs, research)

• Experience with designing and developing an interactive analysis tool

Page 28: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Schedule

See course homepagepoloclub.gatech.edu/cse6242/

Page 29: Data & Visual Analytics - gatech.edupoloclub.gatech.edu/cse6242/2014spring/lectures/CSE6242... · 2014. 1. 7. · Graph Nodes Edges YahooWeb 1.4 Billion 6 Billion Symantec Machine-File

Grading

• 3-4 homework assignments (40%)

• End-to-end analysis

• Techniques (computation and vis)

• Hadoop (+ other “big data” tools)

• Group project (50%) -- 3 to 4 people

• Participation (10%) -- in class, and on Piazza