# Stats 170A: Project in Data Science Exploratory ... Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

• View
5

0

Embed Size (px)

### Text of Stats 170A: Project in Data Science Exploratory ... Stats 170A: Project in Data Science Exploratory...

• Stats 170A: Project in Data Science

Exploratory Data Analysis: Clustering Algorithms

Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 2

Assignment 5

Refer to the Wiki page

Due noon on Monday February 12th to EEE dropbox

Note: due before class (by 2pm)

Questions?

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 3

What is Exploratory Data Analysis?

• EDA = {visualization, clustering, dimension reduction, ….}

• For small numbers of variables, EDA = visualization

• For large numbers of variables, we need to be cleverer – Clustering, dimension reduction, embedding algorithms – These are techniques that essentially reduce high-dimensional data to

something we can look at

• Today’s lecture: – Finish up visualization – Overview of clustering algorithms

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 4

Tufte’s Principles of Visualization

Graphical excellence…

– is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design

– consists of complex ideas communicated with clarity, precision and efficiency

– is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space

– requires telling the truth about the data

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 5

Different Ways of Presenting the Same Data

From Karl Broman, via www.cs.princeton.edu/

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 6

Principle of Proportional Ink (or How to Lie with Visualization)

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 7

Principle of Proportional Ink (or How to Lie with Visualization)

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 8

Potentially Misleading Scales on the X-axis

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 9

Example: Visualization of Napoleon’s 1812 March

Illustrates size of army, direction, location, temperature, date…all on one chart

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 10

From New York Times, Feb 2 2018

Data Journalism

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 11

Exploratory Data Analysis: Clustering

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 12

x1

x2

Example: Clustering Vectors in a 2-Dimensional Space

Each point (or 2d vector) represents a document

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 13

x1

x2

Cluster 1

Cluster 2

Example: Possible Clusters

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 14

x1

x2

Cluster 1

Cluster 2

Example: How many Clusters?

Cluster 3

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 15

Cluster Structure in Real-World Data

0.0 0.5 1.0 1.5

0. 0

0. 5

1. 0

signal T

si gn

al C

≈ 1500 subjects

Two measurements per subject

Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 16

Cluster Structure in Real-World Data

0.0 0.5 1.0 1.5

0. 0

0. 5

1. 0

signal T

si gn

al C

≈ 1500 subjects

Two measurements per subject

Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 17 17

0.0 0.5 1.0 1.5

0. 0

0. 5

1. 0

signal T

si gn

al C

CC

TT

CT

Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 18

Issues in Clustering

• Representation – How do we represent our examples as data vectors?

• Distance – How do we want to define distance between vectors?

• Algorithm – What type of algorithm do we want to use to search for clusters? – What is the time and space complexity of the algorithm?

• Number of Clusters – How many clusters do we want?

No “right” answer to these questions in general…it depends on the application

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 19

Cluster Analysis vs Classification

• Data are unlabeled

• The number of clusters are unknown

• “Unsupervised” learning • Goal: find unknown

structures

19

• The labels for training data are known

• The number of classes are known

• “Supervised” learning • Goal: allocate new

observations, whose labels are unknown, to one of the known classes

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 20

Clustering: The K-Means Algorithm

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 21

Notation

N documents Represent each document as a vector of T terms (e.g., counts or tf-idf)

The vector for the ith document is: xi = ( xi1, xi2, …, xij , .... , xiT ), i = 1,…..N

Document-Term matrix • xij is the ith row, jth column • columns correspond to terms • rows correspond to documents

We can think of our documents as being in a T-dimensional space, with clusters as “clouds of points”

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 22

The K-Means Clustering Algorithm

Input: N vectors x1, …. xN of dimension D K = number of clusters (K > 1)

Output: – K cluster centers, c1, …. cK, each center is a vector of dimension D – (Equivalently) A list of cluster assignments (values 1 to K) for each of the N

input vectors

Note: In K-means each input vector x is assigned to one and only one cluster k, or cluster center ck

The K -means algorithm partitions the N data vectors into K disjoint groups

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 23

x1

x2

Cluster 1

Cluster 2

Example of K-Means Output with 2 Clusters

c1

c2

Blue circles are examples of documents Red circles are examples of cluster centers

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 24

Squared Error Distance

),,,( 21 Txxxx != Consider two vectors each with T components (i.e., dimension T)

∑ =

−= T

j jjE yxyxd

1

2)(),(

A common distance metric is squared error distance:

In two dimensions the square root of this is the usual notion of spatial distance, i.e., Euclidean distance

),,,( 21 Tyyyy !=

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 25

Squared Errors and Cluster Centers

• Squared error (distance) between a data point x and a cluster center c:

dist [ x , c ] = Σj ( xj - cj ) 2

Index j is over the D components/dimensions of the vectors

Cluster 1

c1

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 26

Squared Errors and Cluster Centers

• Squared error (distance) between a data point x and a cluster center c:

dist [ x , c ] = Σj ( xj - cj ) 2

• Total squared error between a cluster center ck and all Nk points assigned to that cluster:

Sk = Σi d [ xi , ck ]

Sum is over the D components/dimensions of the vectors

This sum is over vectors, over the Nk points assigned to cluster k

Distance defined as Euclidean distance

Cluster 1

c1

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 27

Squared Errors and Cluster Centers

• Squared error (distance) between a data point x and a cluster center c:

dist [ x , c ] = Σj ( xj - cj ) 2

• Total squared error between a cluster center ck and all Nk points assigned to that cluster:

Sk = Σi d [ xi , ck ]

• Total squared error summed across K clusters

SSE = Σk Sk

Sum is over the D components/dimensions of the vectors

Sum is over the Nk points assigned to cluster k

Sum is over the K clusters

Distance defined as Euclidean distance

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 28

K-means Objective Function

• K-means: minimize the total squared error, i.e., find the K cluster centers ck, and assignments, that minimize

SSE = Σk Sk = Σk ( Σi d [ xi , ck ] )

• K-means seeks to minimize SSE, i.e., find the cluster centers such that the sum-squared-error is smallest – will place cluster centers strategically to “cover” data – similar to data compression (in fact used in data compression algorithms)

• Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 29

K-Means Algorithm

• Random initialization – Select the initial

Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
##### Exploratory laprotomy
Health & Medicine
Documents
Documents
Documents
##### Stats on Stats ETF August 2016
Economy & Finance
Documents
Documents
Documents
Documents
Education
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents
Documents