Upload
others
View
6
Download
1
Embed Size (px)
Citation preview
Data Mining Concepts & Tasks
CSE6242 / CX4242Sept 9, 2014
Duen Horng (Polo) ChauGeorgia Tech
Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos
Last TimeData Cleaning
• Google Refine, Data Wrangler
Data Integration• Many examples: Google
knowledge graph, Facebook Graph Search, Freebase, Feldspar, Kayak, Apple Siri, etc.
2
Collection
Cleaning
Integration
Visualization
Analysis
Presentation
Dissemination
Continuing with Data Integration
Freebase(a graph of entities)"
“…a large collaborative knowledge base consisting of metadata composed mainly
by its community members…”
4
Wikipedia.
Crowd-sourcing Approaches: Freebase
5http://wiki.freebase.com/wiki/What_is_Freebase%3F
What do we need before we can even integrate datasets/tables/schemas?
6
What do we need before we can even integrate datasets/tables/schemas?
7
You need an ID for every unique entity/item/object/thing… Easy?
What do we need before we can even integrate datasets/tables/schemas?
8
person_id name state1 Smith GA2 Johnson NY3 Obama NY
person_id name state_id1 Smith 1112 Johnson 2223 Obama 222
state_id state_name111 GA222 NY333 CA
+
Entity Resolution (A hard problem in data integration)Polo ChauP. ChauDuen Horng ChauDuen ChauD. Chau
9
Why is Entity Resolution so Important?
D-Dupe Interactive Data Deduplication and IntegrationTVCG 2008University of MarylandBilgic, Licamele, Getoor, Kang, Shneiderman
12http://www.cs.umd.edu/projects/linqs/ddupe/ (skip to 0:55)http://linqs.cs.umd.edu/basilic/web/Publications/2008/kang:tvcg08/kang-tvcg08.pdf
Polo
Poalo
Numerous similarity functions
• Euclidean distanceEuclidean norm / L2 norm
• Manhattan distance
• Jaccard Similaritye.g., overlap of nodes’ #neighbors"
• String edit distancee.g., “Polo Chau” vs “Polo Chan”
• Many more…15
http://infolab.stanford.edu/~ullman/mmds/ch3a.pdfExcellent read:
Core components: Similarity functions
Determine how two entities are similar.D-Dupe’s approach: Attribute similarity + relational similarity
16
Similarity score for a pair of entities
17
Attribute similarity (a weighted sum)
Summary for data integrationOpportunities
• enable new services (Siri, padmapper)• enable new ways to discover info• improve existing services• reduce redundancy• new way to interactive with data• promote knowledge transfer (e.g., between
companies)18
Data Mining Concepts & TasksEach data-driven (business, decision-making) problem is unique, e.g., different goals, constraints."
Good news: many (sub)tasks that underlie these problems are common"
Here is an overview of the common tasks, based on Data Science for Business: What you need to know about data mining and data-analytic thinking"
19
Collection
Cleaning
Integration
Visualization
Analysis
Presentation
Dissemination
http://www.amazon.com/Data-Science-Business-data-analytic-thinking/dp/1449361323
1. (soft) Classification, Probability Estimation (supervised learning)
Predict which of a (small) set of classes an entity belong to.Examples: Is this app malicious or benign? Will this customer click on this ad?More Examples? payment transaction -> fraudulent?news/emails -> spam?tumor -> benign?sentiment analysis -> +, -, neutralweather -> rain, storm, sunnymovies genres -> action, etc.friends -> close, acquaintance, etc.online dating -> will work out or not?surveillance system -> suspicious or not
21
2. Regression (“value estimation”)(supervised learning)
Predict the numerical value of some variable for an entity.
Example: how much minutes will this cellphone customer use?
Related to classification, but predict how much, instead of discrete decisions (e.g., yes, no)
More Examples?
stock prices
price of plane tickets
weather prediction
credit scores
time until machine fails (data center)
inventory management (supply chain)
population change (city, population planning)
sports stat (gambling)
22
3. Similarity MatchingFind similar entities (from a large dataset) based on what we know about them.Examples?Online datingrecommendation systems (similar songs, movies)image “classifier” (find all sunset images)suggestions for online shoppingmarket segmentationsuggestion of friends on facebookonline advertisement-> restaurant “classification” (italian, Chinese)search results (google “similar” results)search query matching
23
4. Clustering (unsupervised learning)Group entities together by their similarity. (User provides # of clusters)Examples?factors for diseasesmovie categories (genres; soft clustering)market segmentation for targeted advertisementsocial network analysis (whether people like the same thing)geographical data (identify “neighborhood”, popular landmarks)
24
5. Co-occurrence grouping
Find associations between entities based on transactions that involve them (e.g., bread and milk often bought together)
25
(Many names: frequent itemset mining, association rule discovery, market-basket analysis)
6. Profiling / Pattern Mining / Anomaly Detection (unsupervised)
Characterize typical behaviors of an entity (person, computer router, etc.) so you can find trends and outliers. Examples?computer instruction predictionremoving noise from experiment (data cleaning)detect anomalies in network trafficmoneyballweather anomalies (e.g., big storm)google sign-in (alert)smart security cameraembezzlementtrending articles
26
7. Link Prediction / RecommendationPredict if two entities should be connected, and how strongly that link should be.
Examples?two people on Facebookamazon (things bought together); asssociation-rule miningnetflix: recommend jim carey movierelated questions on quoratop apps on apple storecrime group detection (bad guys on social network)google search suggestions
27
8. Data reduction (“dimensionality reduction”)
Shrink a large dataset into smaller one, with as little loss of information as possibleWhen to do it? Examples? Why do it?Original data is too big -> too hard to process, or take too long2D -> 1D (many Ds -> few Ds): for visualization, for more efficient algorithmsGraph partitioning - split a large graph into smaller subgraphs
28
Start thinking about project
• What kind of datasets and problems do you want to solve?
• What techniques do you need?
• Will describe project requirements in next lecture
29