- Home
- Documents
*Hierarchical spectral clustering - ETH Z · PDF file Spectral clustering tries to exploit the...*

Click here to load reader

View

0Download

0

Embed Size (px)

Hierarchical spectral clustering

Yves Brise

September 23, 2005

Professor: Emo Welzl

Advisors: Joachim Giesen, Dieter Mitsche

1

Contents

1 Introduction 3

2 Model & Problem 5 2.1 Planted Partition Model . . . . . . . . . . . . . . . . . . 5 2.2 Disturbed Planted Partition Model . . . . . . . . . . . . 6 2.3 Spectral Properties For Partitioning . . . . . . . . . . . . 8

3 Algorithms 11 3.1 SpectralReconstruct . . . . . . . . . . . . . . . . . . . 11 3.2 ClusterHierarchical . . . . . . . . . . . . . . . . . . 12

4 Implementation 13 4.1 Random Number Generator . . . . . . . . . . . . . . . . 13 4.2 Building Under Cygwin . . . . . . . . . . . . . . . . . . . 13 4.3 TRLAN, LAPACK and BLAS . . . . . . . . . . . . . . . 15 4.4 FASD . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

5 Experiments 19 5.1 Threshold For ClusterHierarchical . . . . . . . . . . 19 5.2 Increasing Number Of Bad Elements . . . . . . . . . . . 21 5.3 Varying p & q . . . . . . . . . . . . . . . . . . . . . . . . 23

6 Conclusion 26

7 Outlook 27

A Source Code 28

2

1 Introduction

The goal of this semester project was to implement and test two al- gorithms that deal with the clustering problem. That is, the problem of grouping n possibly related items into k disjoint clusters. It goes without saying that related items should be grouped together, whereas unrelated items should be distributed into different clusters. This kind of problem can, as an example, be motivated by the task of thematically grouping documents in a newsgroup. Given a collection of sports-related newslet- ters, what are all the documents concerning basketball? This is one of the questions that one is hoping to answer reliably and efficiently. Such a partitioning can be modeled as a graph. The elements become the vertices of the graph and two vertices are connected by an edge if and only if the associated elements belong to the same cluster. From this graph we can derive a similarity matrix, which is square and symmetric. In our case we use the n × n adjacency matrix, having zeros between elements that do not belong to the same cluster and ones between el- ements that do. In the model of planted partitions, that we will use, the algorithms do not have access to the adjacency matrix, or else the reconstruction will be trivial. Instead the model assumes that the matrix representing the perfect clusters is disturbed by changes that occur with a certain probability, very much like one would imagine the clustering information to be when derived from a loosely related set of elements such as a collection of articles.

Spectral clustering tries to exploit the spectral properties of the sim- ilarity matrix in order to reconstruct the clusters only from that infor- mation. Important parameters are, of course, the probabilities by which the matrix is changed, but also the size and the number of clusters. The idea behind hierarchical clustering is very simple. The reconstruc- tion of the clusters is divided into different steps that form a hierarchy of processing. The two algorithms presented, ClusterHierarchical and SpectralReconstruct, are intended to be applied to the data in sequen- tial order. The first step is done by ClusterHierarchical, which tries to sort out elements that are not classifiable, that is which can not be associated with any of the clusters. The algorithm divides the elements into two different groups and only clears one of these groups for the further processing by the following steps. The elements from the other group are discarded. In our model we will call this the planted partition detection problem, or

3

just detection for short (see 2.2). The second step is done by SpectralReconstruct, which attempts to do the actual grouping into clusters. We call this the planted partition reconstruction problem, or reconstruction for short.

Here is an outline of the different sections of the thesis. Section (2) describes the mathematical models and the spectral prop- erties of the adjacency matrix. It also leads to the estimation of the number of clusters, which is a crucial ingredient of the two algorithms. The detection and the reconstruction problem basically rely on the same model, but nevertheless a few differences have to be explained. The next section (3) presents the algorithms in pseudo code and gives a short prose description of what they are doing. Section (4) is an experience report about the problems that occurred dur- ing the implementation, as well as a sort of user manual for the program FASD, which is one of the results of this semester project. Finally, the section (5) shows the results of experiments that were made. In these experiments we generate random clustering data and then mea- sure the error of the detection algorithm ClusterHierarchical for dif- ferent parameters. This is followed by a conclusion and an outlook on what future work could comprise. In the appendix we add and comment on some of the source code, es- pecially those parts that are preoccupied with the calling of external libraries that we use.

4

2 Model & Problem

2.1 Planted Partition Model

In this subsection we introduce the planted partition reconstruction prob- lem and define two quality measures that can be used to compare different partitioning algorithms. We first introduce the A(ϕ, p, q) distribution, see also McSherry [3].

A(ϕ, p, q) distribution. Given a surjective function ϕ : {1, . . . , n} → {1, . . . , k} and probabilities p, q ∈ (0, 1) with p > q. The A(ϕ, p, q) distribution is a distribution on the set of n × n symmetric, 0-1 matrices with zero trace. Let Â = (âij) be a matrix drawn from this distribution. It is âij = 0 if i = j and for i 6= j,

P (âij = 1) = p if ϕ(i) = ϕ(j) P (âij = 0) = 1− p if ϕ(i) = ϕ(j) P (âij = 1) = q if ϕ(i) 6= ϕ(j) P (âij = 0) = 1− q if ϕ(i) 6= ϕ(j),

independently. The matrix of expectations A = (aij) corresponding to the A(ϕ, p, q) distribution is given as

aij = 0 if i = j aij = p if ϕ(i) = ϕ(j) and i 6= j aij = q if ϕ(i) 6= ϕ(j)

Lemma 1 (Füredi and Komlós [1], Krivelevich and Vu [2]) Let Â be a matrix drawn from the A(ϕ, p, q) distribution and A be the matrix of expectations corresponding to this distribution. Let c = min{p(1 − p), q(1− q)} and assume that c2 � (log n)6/n. Then

|A− Â| ≤ √ n

with probability at least 1 − 2e−c2n/8. Here | · | denotes the L2 matrix norm, i.e., |B| = max|x|=1 |Bx|.

Planted partition reconstruction problem. Given a matrix Â drawn from the A(ϕ, p, q) distribution. Assume that all clusters ϕ−1(l), l ∈ {1, . . . , k} have the same size n/k. Then the function ϕ is called a par- tition function. The planted partition reconstruction problem asks to reconstruct ϕ up to a permutation of {1, . . . , k} from Â only.

5

Quality of a reconstruction algorithm. A planted partition re- construction algorithm takes a matrix Â drawn from the distribution A(ϕ, p, q) as input and outputs a function ψ : {1, . . . , n} → {1, . . . , k′}. There are two natural measures to assess the quality of the reconstruction algorithm.

(1) The probability of correct reconstruction, i.e.,

P [ϕ = ψ up to a permutation of {1, . . . , k}].

(2) The distribution of the number of elements in {1, . . . , n} misclas- sified by the algorithm. The definition for the number of misclas- sifications used here (see also Meila et al. [4]) is as the size of a maximum matching on the weighted, complete bipartite graph whose vertices are the original clusters ϕ−1(i), i ∈ {1, . . . , k} and the clusters ψ−1(j), j ∈ {1, . . . , k′} produced by the algorithm. The weight of the edge {ϕ−1(i), ψ−1(j)} is |ϕ−1(i)∩ψ−1(j)|, i.e. the size of the intersection of the clusters. The matching gives a pairing of the clusters defined by ϕ and ψ. Assume without loss of general- ity that always ϕ−1(i) and ψ−1(i) are paired. Then the number of misclassifications is given as

n− min{k,k′}∑

i=1

|ϕ−1(i) ∩ ψ−1(i)|.

2.2 Disturbed Planted Partition Model

The assessment of ClusterHierarchical requires a slightly different model. We introduce a special symbol to represent the set of elements that do not belong to any cluster, and we have to adapt the measurement of error.

D(ϕ, p, q) distribution. Given a surjective function ϕ : {1, . . . , n} → {1, . . . , k} ∪ {χ} and prob- abilities p, q ∈ (0, 1) with p > q. The D(ϕ, p, q) distribution is a distri- bution on the set of n× n symmetric, 0-1 matrices with zero trace. Let D̂ = (d̂ij) be a matrix drawn from this distribution. It is d̂ij = 0 if i = j

6

and for i 6= j,

P (d̂ij = 1) = p if ϕ(i) = ϕ(j) 6= χ P (d̂ij = 0) = 1− p if ϕ(i) = ϕ(j) 6= χ P (d̂ij = 1) = q if ϕ(i) = χ ∨ ϕ(j) = χ ∨ ϕ(i) 6= ϕ(j) P (d̂ij = 0) = 1− q if ϕ(i) = χ ∨ ϕ(j) = χ ∨ ϕ(i) 6= ϕ(j) ,

independently. The matrix of expectations D = (dij) corresponding to the D(ϕ, p, q) distribution is given as

dij = 0 if i = j dij = p if i 6= j and ϕ(i) = ϕ(j) 6= χ dij = q if i 6= j and ϕ(i) = χ ∨ ϕ(j) = χ ∨ ϕ(i) 6= ϕ(j)

That means that elements that do not belong to any cluster get connected to any other element with probability q. Elements that do belong to clusters get connected to elements from the same cluster with probability p and with probability q to any other element.

Planted partition detection problem. Given a matrix D̂ drawn from the D(ϕ, p, q) dist