Data Mining Concepts & Tasks - Polo Club of Data Sciencepoloclub.gatech.edu/...20140909-DataMiningConcepts.pdf · 09/09/2014 · Data Mining Concepts & Tasks Each data-driven (business,

Data Mining Concepts & Tasks

CSE6242 / CX4242Sept 9, 2014

Duen Horng (Polo) ChauGeorgia Tech

Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

Last TimeData Cleaning

• Google Refine, Data Wrangler

Data Integration• Many examples: Google

knowledge graph, Facebook Graph Search, Freebase, Feldspar, Kayak, Apple Siri, etc.

2

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

Continuing with Data Integration

Freebase(a graph of entities)"

“…a large collaborative knowledge base consisting of metadata composed mainly

by its community members…”

4

Wikipedia.

Crowd-sourcing Approaches: Freebase

5http://wiki.freebase.com/wiki/What_is_Freebase%3F

http://wiki.freebase.com/wiki/What_is_Freebase%3F

What do we need before we can even integrate datasets/tables/schemas?

6


7

You need an ID for every unique entity/item/object/thing… Easy?


8

person_id name state1 Smith GA2 Johnson NY3 Obama NY

person_id name state_id1 Smith 1112 Johnson 2223 Obama 222

state_id state_name111 GA222 NY333 CA

+

Entity Resolution (A hard problem in data integration)Polo ChauP. ChauDuen Horng ChauDuen ChauD. Chau

9

Why is Entity Resolution so Important?

D-Dupe Interactive Data Deduplication and IntegrationTVCG 2008University of MarylandBilgic, Licamele, Getoor, Kang, Shneiderman

12http://www.cs.umd.edu/projects/linqs/ddupe/ (skip to 0:55)http://linqs.cs.umd.edu/basilic/web/Publications/2008/kang:tvcg08/kang-tvcg08.pdf

http://www.cs.umd.edu/projects/linqs/ddupe/

http://linqs.cs.umd.edu/basilic/web/Publications/2008/kang:tvcg08/kang-tvcg08.pdf

Polo

Poalo

Numerous similarity functions

• Euclidean distanceEuclidean norm / L2 norm

• Manhattan distance

• Jaccard Similaritye.g., overlap of nodes’ #neighbors"

• String edit distancee.g., “Polo Chau” vs “Polo Chan”

• Many more…15

http://infolab.stanford.edu/~ullman/mmds/ch3a.pdfExcellent read:

Core components: Similarity functions

Determine how two entities are similar.D-Dupe’s approach: Attribute similarity + relational similarity

16

Similarity score for a pair of entities

17

Attribute similarity (a weighted sum)

Summary for data integrationOpportunities

• enable new services (Siri, padmapper)• enable new ways to discover info• improve existing services• reduce redundancy• new way to interactive with data• promote knowledge transfer (e.g., between

companies)18

Data Mining Concepts & TasksEach data-driven (business, decision-making) problem is unique, e.g., different goals, constraints."

Good news: many (sub)tasks that underlie these problems are common"

Here is an overview of the common tasks, based on Data Science for Business: What you need to know about data mining and data-analytic thinking"

19

Collection

Cleaning

Integration

Visualization

Analysis

Presentation

Dissemination

http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323

1. (soft) Classification, Probability Estimation (supervised learning)

Predict which of a (small) set of classes an entity belong to.Examples: Is this app malicious or benign? Will this customer click on this ad?More Examples? payment transaction -> fraudulent?news/emails -> spam?tumor -> benign?sentiment analysis -> +, -, neutralweather -> rain, storm, sunnymovies genres -> action, etc.friends -> close, acquaintance, etc.online dating -> will work out or not?surveillance system -> suspicious or not

21

2. Regression (“value estimation”)(supervised learning)

Predict the numerical value of some variable for an entity.

Example: how much minutes will this cellphone customer use?

Related to classification, but predict how much, instead of discrete decisions (e.g., yes, no)

More Examples?

stock prices

price of plane tickets

weather prediction

credit scores

time until machine fails (data center)

inventory management (supply chain)

population change (city, population planning)

sports stat (gambling)

22

3. Similarity MatchingFind similar entities (from a large dataset) based on what we know about them.Examples?Online datingrecommendation systems (similar songs, movies)image “classifier” (find all sunset images)suggestions for online shoppingmarket segmentationsuggestion of friends on facebookonline advertisement-> restaurant “classification” (italian, Chinese)search results (google “similar” results)search query matching

23

4. Clustering (unsupervised learning)Group entities together by their similarity. (User provides # of clusters)Examples?factors for diseasesmovie categories (genres; soft clustering)market segmentation for targeted advertisementsocial network analysis (whether people like the same thing)geographical data (identify “neighborhood”, popular landmarks)

24

5. Co-occurrence grouping

Find associations between entities based on transactions that involve them (e.g., bread and milk often bought together)

25

(Many names: frequent itemset mining, association rule discovery, market-basket analysis)

6. Profiling / Pattern Mining / Anomaly Detection (unsupervised)

Characterize typical behaviors of an entity (person, computer router, etc.) so you can find trends and outliers. Examples?computer instruction predictionremoving noise from experiment (data cleaning)detect anomalies in network trafficmoneyballweather anomalies (e.g., big storm)google sign-in (alert)smart security cameraembezzlementtrending articles

26

7. Link Prediction / RecommendationPredict if two entities should be connected, and how strongly that link should be.

Examples?two people on Facebookamazon (things bought together); asssociation-rule miningnetflix: recommend jim carey movierelated questions on quoratop apps on apple storecrime group detection (bad guys on social network)google search suggestions

27

8. Data reduction (“dimensionality reduction”)

Shrink a large dataset into smaller one, with as little loss of information as possibleWhen to do it? Examples? Why do it?Original data is too big -> too hard to process, or take too long2D -> 1D (many Ds -> few Ds): for visualization, for more efficient algorithmsGraph partitioning - split a large graph into smaller subgraphs

28

Start thinking about project

• What kind of datasets and problems do you want to solve?

• What techniques do you need?

• Will describe project requirements in next lecture

29

Documents

Data Mining Concepts & Tasks - Polo Club of Data Sciencepoloclub.gatech.edu/...20140909-DataMiningConcepts.pdf · 09/09/2014 · Data Mining Concepts & Tasks Each data-driven (business,