Click here to load reader

Stats 170A: Project in Data Science Exploratory ... Stats 170A: Project in Data Science Exploratory Data Analysis: Clustering Algorithms Padhraic Smyth Department of Computer Science

  • View
    5

  • Download
    0

Embed Size (px)

Text of Stats 170A: Project in Data Science Exploratory ... Stats 170A: Project in Data Science Exploratory...

  • Stats 170A: Project in Data Science

    Exploratory Data Analysis: Clustering Algorithms

    Padhraic Smyth Department of Computer Science Bren School of Information and Computer Sciences University of California, Irvine

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 2

    Assignment 5

    Refer to the Wiki page

    Due noon on Monday February 12th to EEE dropbox

    Note: due before class (by 2pm)

    Questions?

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 3

    What is Exploratory Data Analysis?

    • EDA = {visualization, clustering, dimension reduction, ….}

    • For small numbers of variables, EDA = visualization

    • For large numbers of variables, we need to be cleverer – Clustering, dimension reduction, embedding algorithms – These are techniques that essentially reduce high-dimensional data to

    something we can look at

    • Today’s lecture: – Finish up visualization – Overview of clustering algorithms

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 4

    Tufte’s Principles of Visualization

    Graphical excellence…

    – is the well-designed presentation of interesting data – a matter of substance, of statistics, and of design

    – consists of complex ideas communicated with clarity, precision and efficiency

    – is that which gives to the viewer the greatest number of ideas in the shortest time with the least ink in the smallest space

    – requires telling the truth about the data

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 5

    Different Ways of Presenting the Same Data

    From Karl Broman, via www.cs.princeton.edu/

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 6

    Principle of Proportional Ink (or How to Lie with Visualization)

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 7

    Principle of Proportional Ink (or How to Lie with Visualization)

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 8

    Potentially Misleading Scales on the X-axis

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 9

    Example: Visualization of Napoleon’s 1812 March

    Illustrates size of army, direction, location, temperature, date…all on one chart

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 10

    From New York Times, Feb 2 2018

    Data Journalism

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 11

    Exploratory Data Analysis: Clustering

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 12

    x1

    x2

    Example: Clustering Vectors in a 2-Dimensional Space

    Each point (or 2d vector) represents a document

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 13

    x1

    x2

    Cluster 1

    Cluster 2

    Example: Possible Clusters

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 14

    x1

    x2

    Cluster 1

    Cluster 2

    Example: How many Clusters?

    Cluster 3

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 15

    Cluster Structure in Real-World Data

    0.0 0.5 1.0 1.5

    0. 0

    0. 5

    1. 0

    signal T

    si gn

    al C

    ≈ 1500 subjects

    Two measurements per subject

    Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 16

    Cluster Structure in Real-World Data

    0.0 0.5 1.0 1.5

    0. 0

    0. 5

    1. 0

    signal T

    si gn

    al C

    ≈ 1500 subjects

    Two measurements per subject

    Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 17 17

    0.0 0.5 1.0 1.5

    0. 0

    0. 5

    1. 0

    signal T

    si gn

    al C

    CC

    TT

    CT

    Figure from Prof Zhaoxia Yu, Statistics Department, UC Irvine

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 18

    Issues in Clustering

    • Representation – How do we represent our examples as data vectors?

    • Distance – How do we want to define distance between vectors?

    • Algorithm – What type of algorithm do we want to use to search for clusters? – What is the time and space complexity of the algorithm?

    • Number of Clusters – How many clusters do we want?

    No “right” answer to these questions in general…it depends on the application

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 19

    Cluster Analysis vs Classification

    • Data are unlabeled

    • The number of clusters are unknown

    • “Unsupervised” learning • Goal: find unknown

    structures

    19

    • The labels for training data are known

    • The number of classes are known

    • “Supervised” learning • Goal: allocate new

    observations, whose labels are unknown, to one of the known classes

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 20

    Clustering: The K-Means Algorithm

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 21

    Notation

    N documents Represent each document as a vector of T terms (e.g., counts or tf-idf)

    The vector for the ith document is: xi = ( xi1, xi2, …, xij , .... , xiT ), i = 1,…..N

    Document-Term matrix • xij is the ith row, jth column • columns correspond to terms • rows correspond to documents

    We can think of our documents as being in a T-dimensional space, with clusters as “clouds of points”

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 22

    The K-Means Clustering Algorithm

    Input: N vectors x1, …. xN of dimension D K = number of clusters (K > 1)

    Output: – K cluster centers, c1, …. cK, each center is a vector of dimension D – (Equivalently) A list of cluster assignments (values 1 to K) for each of the N

    input vectors

    Note: In K-means each input vector x is assigned to one and only one cluster k, or cluster center ck

    The K -means algorithm partitions the N data vectors into K disjoint groups

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 23

    x1

    x2

    Cluster 1

    Cluster 2

    Example of K-Means Output with 2 Clusters

    c1

    c2

    Blue circles are examples of documents Red circles are examples of cluster centers

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 24

    Squared Error Distance

    ),,,( 21 Txxxx != Consider two vectors each with T components (i.e., dimension T)

    ∑ =

    −= T

    j jjE yxyxd

    1

    2)(),(

    A common distance metric is squared error distance:

    In two dimensions the square root of this is the usual notion of spatial distance, i.e., Euclidean distance

    ),,,( 21 Tyyyy !=

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 25

    Squared Errors and Cluster Centers

    • Squared error (distance) between a data point x and a cluster center c:

    dist [ x , c ] = Σj ( xj - cj ) 2

    Index j is over the D components/dimensions of the vectors

    Cluster 1

    c1

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 26

    Squared Errors and Cluster Centers

    • Squared error (distance) between a data point x and a cluster center c:

    dist [ x , c ] = Σj ( xj - cj ) 2

    • Total squared error between a cluster center ck and all Nk points assigned to that cluster:

    Sk = Σi d [ xi , ck ]

    Sum is over the D components/dimensions of the vectors

    This sum is over vectors, over the Nk points assigned to cluster k

    Distance defined as Euclidean distance

    Cluster 1

    c1

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 27

    Squared Errors and Cluster Centers

    • Squared error (distance) between a data point x and a cluster center c:

    dist [ x , c ] = Σj ( xj - cj ) 2

    • Total squared error between a cluster center ck and all Nk points assigned to that cluster:

    Sk = Σi d [ xi , ck ]

    • Total squared error summed across K clusters

    SSE = Σk Sk

    Sum is over the D components/dimensions of the vectors

    Sum is over the Nk points assigned to cluster k

    Sum is over the K clusters

    Distance defined as Euclidean distance

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 28

    K-means Objective Function

    • K-means: minimize the total squared error, i.e., find the K cluster centers ck, and assignments, that minimize

    SSE = Σk Sk = Σk ( Σi d [ xi , ck ] )

    • K-means seeks to minimize SSE, i.e., find the cluster centers such that the sum-squared-error is smallest – will place cluster centers strategically to “cover” data – similar to data compression (in fact used in data compression algorithms)

  • Padhraic Smyth, UC Irvine: Stats 170AB, Winter 2018: 29

    K-Means Algorithm

    • Random initialization – Select the initial

Search related