Click here to load reader
View
5
Download
0
Embed Size (px)
Stats 170A: Project in Data Science
Exploratory Data Analysis: Clustering Algorithms
Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 2
Assignment 5
Refer to the Wiki page
Due noon on Monday February 12th to EEE dropbox
Note: due before class (by 2pm)
Questions?
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 3
What is Exploratory Data Analysis?
• EDA = {visualization, clustering, dimension reduction, ….}
• For small numbers of variables, EDA = visualization
• For large numbers of variables, we need to be cleverer – Clustering, dimension reduction, embedding algorithms – These are techniques that essentially reduce high-dimensional data to
something we can look at
• Today’s lecture: – Finish up visualization – Overview of clustering algorithms
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 4
Tufte’s Principles of Visualization
Graphical excellence…
– is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design
– consists of complex ideas communicated with clarity, precision and efficiency
– is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space
– requires telling the truth about the data
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 5
Different Ways of Presenting the Same Data
From Karl Broman, via www.cs.princeton.edu/
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 6
Principle of Proportional Ink (or How to Lie with Visualization)
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 7
Principle of Proportional Ink (or How to Lie with Visualization)
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 8
Potentially Misleading Scales on the X-axis
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 9
Example: Visualization of Napoleon’s 1812 March
Illustrates size of army, direction, location, temperature, date…all on one chart
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 10
From New York Times, Feb 2 2018
Data Journalism
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 11
Exploratory Data Analysis: Clustering
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 12
x1
x2
Example: Clustering Vectors in a 2-Dimensional Space
Each point (or 2d vector) represents a document
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 13
x1
x2
Cluster 1
Cluster 2
Example: Possible Clusters
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 14
x1
x2
Cluster 1
Cluster 2
Example: How many Clusters?
Cluster 3
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 15
Cluster Structure in Real-World Data
0.0 0.5 1.0 1.5
0. 0
0. 5
1. 0
signal T
si gn
al C
≈ 1500 subjects
Two measurements per subject
Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 16
Cluster Structure in Real-World Data
0.0 0.5 1.0 1.5
0. 0
0. 5
1. 0
signal T
si gn
al C
≈ 1500 subjects
Two measurements per subject
Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 17 17
0.0 0.5 1.0 1.5
0. 0
0. 5
1. 0
signal T
si gn
al C
CC
TT
CT
Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 18
Issues in Clustering
• Representation – How do we represent our examples as data vectors?
• Distance – How do we want to define distance between vectors?
• Algorithm – What type of algorithm do we want to use to search for clusters? – What is the time and space complexity of the algorithm?
• Number of Clusters – How many clusters do we want?
No “right” answer to these questions in general…it depends on the application
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 19
Cluster Analysis vs Classification
• Data are unlabeled
• The number of clusters are unknown
• “Unsupervised” learning • Goal: find unknown
structures
19
• The labels for training data are known
• The number of classes are known
• “Supervised” learning • Goal: allocate new
observations, whose labels are unknown, to one of the known classes
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 20
Clustering: The K-Means Algorithm
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 21
Notation
N documents Represent each document as a vector of T terms (e.g., counts or tf-idf)
The vector for the ith document is: xi = ( xi1, xi2, …, xij , .... , xiT ), i = 1,…..N
Document-Term matrix • xij is the ith row, jth column • columns correspond to terms • rows correspond to documents
We can think of our documents as being in a T-dimensional space, with clusters as “clouds of points”
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 22
The K-Means Clustering Algorithm
Input: N vectors x1, …. xN of dimension D K = number of clusters (K > 1)
Output: – K cluster centers, c1, …. cK, each center is a vector of dimension D – (Equivalently) A list of cluster assignments (values 1 to K) for each of the N
input vectors
Note: In K-means each input vector x is assigned to one and only one cluster k, or cluster center ck
The K -means algorithm partitions the N data vectors into K disjoint groups
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 23
x1
x2
Cluster 1
Cluster 2
Example of K-Means Output with 2 Clusters
c1
c2
Blue circles are examples of documents Red circles are examples of cluster centers
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 24
Squared Error Distance
),,,( 21 Txxxx != Consider two vectors each with T components (i.e., dimension T)
∑ =
−= T
j jjE yxyxd
1
2)(),(
A common distance metric is squared error distance:
In two dimensions the square root of this is the usual notion of spatial distance, i.e., Euclidean distance
),,,( 21 Tyyyy !=
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 25
Squared Errors and Cluster Centers
• Squared error (distance) between a data point x and a cluster center c:
dist [ x , c ] = Σj ( xj - cj ) 2
Index j is over the D components/dimensions of the vectors
Cluster 1
c1
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 26
Squared Errors and Cluster Centers
• Squared error (distance) between a data point x and a cluster center c:
dist [ x , c ] = Σj ( xj - cj ) 2
• Total squared error between a cluster center ck and all Nk points assigned to that cluster:
Sk = Σi d [ xi , ck ]
Sum is over the D components/dimensions of the vectors
This sum is over vectors, over the Nk points assigned to cluster k
Distance defined as Euclidean distance
Cluster 1
c1
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 27
Squared Errors and Cluster Centers
• Squared error (distance) between a data point x and a cluster center c:
dist [ x , c ] = Σj ( xj - cj ) 2
• Total squared error between a cluster center ck and all Nk points assigned to that cluster:
Sk = Σi d [ xi , ck ]
• Total squared error summed across K clusters
SSE = Σk Sk
Sum is over the D components/dimensions of the vectors
Sum is over the Nk points assigned to cluster k
Sum is over the K clusters
Distance defined as Euclidean distance
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 28
K-means Objective Function
• K-means: minimize the total squared error, i.e., find the K cluster centers ck, and assignments, that minimize
SSE = Σk Sk = Σk ( Σi d [ xi , ck ] )
• K-means seeks to minimize SSE, i.e., find the cluster centers such that the sum-squared-error is smallest – will place cluster centers strategically to “cover” data – similar to data compression (in fact used in data compression algorithms)
Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 29
K-Means Algorithm
• Random initialization – Select the initial