U Kang 1
Advanced Data Mining
Introduction
U KangSeoul National University
U Kang 2
In This Lecture
Motivation Overview of Topics
U Kang 3
Outline
MotivationOverview of TopicsConclusion
U Kang 4
Motivation
There are many “big data” Graph Time series Text Image …
U Kang 5
Main Questions
How can we find patterns and models from big data?
How can we do it in a scalable way?
U Kang 6
What is this course about?
This course covers advanced theories, algorithms and systems for mining big data.
Topics Graph Spectral Analysis Large scale distributed system (e.g. MapReduce) Singular Value Decomposition, Tensor Time series, approximation, graph compression,
community detection, anomaly detection
U Kang 7
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools
Conclusion
U Kang 8
What does the Internet look like? What does FaceBook look like? What is ‘normal’/‘abnormal’? Which patterns/laws hold?
Graph Mining
MRFerocius, Social Network, 2011, https://stackoverflow.com/questions/4594962/social-network-directed-graph-library-for-net
U Kang 9
What does the Internet look like? What does FaceBook look like? What is ‘normal’/‘abnormal’? Which patterns/laws hold?
Large datasets reveal patterns and anomalies that may be invisible otherwise
Graph Mining
MRFerocius, Social Network, 2011, https://stackoverflow.com/questions/4594962/social-network-directed-graph-library-for-net
U Kang 10
Are real graphs random?
Power Law
U Kang 11
Are real graphs random? No!
Power Law
PowerLaw
U Kang 12
Node (closeness) centrality
B
C
A
Q: If you have to pick 1 person to advertise,who do you want to choose?
[Kang et al. SDM’10]
U Kang 13
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools
Conclusion
U Kang 14
Spectral Graph Analysis
Solve graph problems using theory of linear algebra
Adjacency matrix
Eigenvector
Apply the solution
Random walkson the graph(e.g. protein
interaction)
Wikipedia, Schizophrenia PPI, 2016,
https://en.wikipedia.org/wiki/Protein%E2%80%93protein_interaction
U Kang 15
Triangle Counting Real social networks have a lot of triangles
Friends of friends are friends
But, triangles are expensive to compute (3-way join; several approx. algos)
Q: Can we do that quickly? A: Yes!
#triangles = 16σ𝑖 𝜆𝑖
3
(and, because of skewness in eigenvalues, we only need the top few eigenvalues!)
Triangle Counting[Kang et al. PAKDD’11]
U Kang 16
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools
Conclusion
U Kang 17
Motivation
Many big data Crawled document Web request log …
Many ‘large scale computations’ Inverted index Graph operation Summaries of the number of pages crawled per host Most frequent queries in a given day …
U Kang 18
Motivation
But, developing the code is very complex : How to parallelize the computation? How to distribute the data? How to handle failures?
U Kang 19
Motivation
Failures Assume a machine works for 3 years without failure What is the expected number of failed machines when
operating 1 million machines?
U Kang 20
MapReduce Example: histogram of fruit names
Map 0 Map 1 Map 2
Reduce 0 Reduce 1
Shuffle
(apple, 1)(apple, 1) (strawberry,1)
(apple, 2) (orange, 1)(strawberry, 1)
(orange, 1)
HDFS
HDFS
map( fruit ) {output(fruit, 1);
}
reduce( fruit, v[1..n] ) {for(i=1; i <=n; i++)sum = sum + v[i];
output(fruit, sum);}
U Kang 21
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorTime SeriesOther tools
Conclusion
U Kang 22
Singular Value Decomposition (SVD)
Essential tool for Concept discovery Dimensionality reduction Finding fixed points Solving linear systems …
U Kang 23
SVD - Example
A = U Λ VT
datainfo
retrievalbrainlung
=CS
MD
x x
U Kang 24
SVD - Example
A = U Λ VT
datainfo
retrievalbrainlung
=CS
MD
x x
CS Medical
U Kang 25
SVD - Example
A = U Λ VT
datainfo
retrievalbrainlung
=CS
MD
x x
CS Medical‘strength’ of CS-concept
U Kang 26
SVD - Example
A = U Λ VT
datainfo
retrievalbrainlung
=CS
MD
x x
CS Medical‘strength’ of CS-concept
doc-concept similarity
U Kang 27
SVD - Example
A = U Λ VT
datainfo
retrievalbrainlung
=CS
MD
x x
CS Medical‘strength’ of CS-concept
doc-concept similarity term-concept similarity
U Kang 28
What is a Tensor?
N-D generalization of matrix:
13 11 22 55 ...
5 4 6 7 ...
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
data mining classif. tree ...JohnPeterMaryNick
...
KDD’20
U Kang 29
What is a Tensor?
N-D generalization of matrix:
13 11 22 55 ...
5 4 6 7 ...
... ... ... ... ...
... ... ... ... ...
... ... ... ... ...
data mining classif. tree ...JohnPeterMaryNick
...
KDD’21
KDD’22
KDD’20
U Kang 30
Motivating Applications
Why tensors are useful? Multi-way semantic indexing Sensor data analysis
U Kang 31
Multi-way Semantic Indexing
Data: author, keyword, year
DBDBDM
DBDBDM
Keywords
Auth
ors
Sun, Jimeng, Dacheng Tao, and Christos Faloutsos. "Beyond streams and graphs: dynamic tensor analysis." KDD. 2006.
U Kang 32
Sensor Data Analysis
Data: location, type, time
1st factor (Main trend)
(a1) daily pattern (b1) main pattern (c1) Main correlation
2nd factor (Major abnormal trend)
(a2) abnormal residual (b2) three abnormal sensors (c2) Voltage anomaly
Core TensorTensor Streams
Sun, Jimeng, Spiros Papadimitriou, and S. Yu Philip. "Window-based tensor analysis on high-dimensional and multi-aspect streams." ICDM. 2006.
U Kang 33
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorRecommendationOther tools
Conclusion
U Kang 34
Recommender System
Search vs. recommender system Search: a user actively looks for what the user want (e.g.,
by entering a keyword in a search engine) Recommender system: the system automatically
provides recommended items to users
U Kang 35
Real World Applications
Amazon.com
35 percent of what consumers purchase on Amazon come from recommendations
https://c1.staticflickr.com/5/4067/4551424756_3e176d6939_z.jpg
U Kang 36
Real World Applications
Netflix
Personalization and recommendations saves ≥ $1B per year
https://www.flickr.com/photos/wfryer/2661730729
U Kang 37
Matrix Factorization for CF
Map each user and each item to a low-dimensional space
Serious
Escapist
Geared toward mal
es
Geared toward fem
ales
Koren et al., Matrix Factorization Techniques for Recommender Systems, IEEE Computer, 2009
U Kang 38
Outline
MotivationOverview of Topics
GraphSpectral AnalysisMapReduceSVD, TensorTime SeriesOther tools
Conclusion
U Kang 39
Tool 1: Time Series Analysis
Given: one or more sequences x1 , x2 , … , xt , …(y1, y2, … , yt, …)
Task Find similar sequences Forecast future values Classify sequences (e.g., fault or normal)
U Kang 40
Matrix Profile
Repeated earthquakes
Yeh, Chin-Chia Michael, et al. "Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets." 2016 IEEE 16th international conference on data mining (ICDM). Ieee, 2016. (https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx)
U Kang 41
Matrix Profile
Abnormal heartbeat detection from ECG
Yeh, Chin-Chia Michael, et al. "Matrix profile I: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets." 2016 IEEE 16th international conference on data mining (ICDM). Ieee, 2016. (https://www.cs.ucr.edu/~eamonn/matrix_profile_i.pptx)
Maximum value in matrixprofile indicates discord
U Kang 42
Tool 2 : Approximation
Flajolet-Martin sketch: Let’s say there are n unique numbers in a set S We save this information in memory Task : given a new number i, check whether i is in S
Question: how much memory do we need to answer such question?
U Kang 43
Tool 2 : Approximation
Flajolet-Martin sketch: Let’s say there are n unique numbers in a set S We save this information in memory Task : given a new number i, check whether i is in S
Question: how much memory do we need to answer such question?
Answer: O(n) bytes, usually. But, Flajolet-Martin sketch uses only O(log(n)) bits to do it almost accurately
U Kang 44
Tool 2 : Approximation
Application : speed-up the graph computation
For 2 Billon Edges, - standard closeness takes 30,000 years- effective closeness takes ~ 1 day !1,000,000 times faster!
U Kang 45
Tool 3 : Graph Compression
Original SlashBurn
U Kang 46
Tool 4 : Community Detection
How to find good communities in a graph?
http://en.wikipedia.org/wiki/Community_structure
U Kang 47
Tool 5 : Anomaly Detection
How to find outliers, or anomalies?
U Kang 48
OddBall at work (Posts)
# citations
# cr
oss-
citat
ions
223K posts217K citations
http://instapundit.com/archives/025235.phphttp://www.sizemore.co.
uk/2005/08/i-feel-some-movies-coming-on.html
POSTS
Anomaly, Event, and Fraud Detection in Large Graph Datasets, L. Akoglu, C. Faloutsos, in ACM WSDM 2013 tutorial, Rome, Italy
U Kang 49
Outline
MotivationOverview of TopicsConclusion
U Kang 50
Conclusion
Advanced theories, algorithms and systems for mining big data.
Topics Graph Spectral Analysis Large scale distributed system (e.g. MapReduce) Singular Value Decomposition, Tensor Time series, approximation, graph compression,
community detection, anomaly detection
U Kang 51
Questions?