25
Bayesian Hierarchical Clustering Katherine Heller Zoubin Ghahramani Presented by Soumya Ghosh Slides courtesy: Katherine Heller

Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

  • Upload
    others

  • View
    16

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Bayesian Hierarchical Clustering

Katherine Heller

Zoubin Ghahramani

Presented by

Soumya Ghosh Slides courtesy: Katherine Heller

Page 2: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Hierarchical Clustering

• Classic algorithm

• Agglomerative, bottom up clustering

• Initialize with each data instance as its own cluster.

• Progressively merge the most similar pairs creating a binary tree

Page 3: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Problems

• No probabilistic model of the data : – Difficult to deal with new data instances

– Can’t be compared to or combined with other probabilistic models

– No notion of how good a particular clustering of the data is

• Correct distance metric?

• More importantly, need to specify distance between groups

Page 4: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Problems

C1

C2

C3

Merge which pair of clusters?

+

+

C1

C2

+

+

C1

C2

+

+

C1

C2

Single Linkage

Complete Linkage

Centroid Linkage

Page 5: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Bayesian Hierarchical Clustering

• Notation

– 𝐷 = * 𝑥1, … , 𝑥𝑛+

– 𝐷𝑖 ⊂ 𝐷, data at

leaves of tree 𝑇𝑖

𝑇1

𝑇2

𝑇3

𝐷1 𝐷2 𝐷3

Page 6: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Tree-Consistent Partitions

•Consider the above tree and all 15 possible partitions of {1,2,3,4}:

(1)(2)(3)(4), (1 2)(3)(4), (1 3)(2)(4), (1 4)(2)(3), (2 3)(1)(4),

(2 4)(1)(3), (3 4)(1)(2), (1 2)(3 4), (1 3)(2 4), (1 4)(2 3),

(1 2 3)(4), (1 2 4)(3), (1 3 4)(2), (2 3 4)(1), (1 2 3 4)

• (1 2) (3) (4) and (1 2 3) (4) are tree-consistent partitions

• (1)(2 3)(4) and (1 3)(2 4) are not tree-consistent partitions

Page 7: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Bayesian Hierarchical Clustering

• Data generated from a Dirichlet Process Mixture.

• Similarity is now measured through a statistical test.

• For each candidate merge compare two hypotheses: – 𝐻1 : all data in 𝐷𝑘 generated

from the same component – 𝐻2 : data in 𝐷𝑘 came from some

other clustering consistent with the sub trees 𝑇𝑖 𝑎𝑛𝑑 𝑇𝑗.

iDjD

kD

iT jT

Page 8: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Computing the Marginal Likelihood for 𝐻1

• Given that our model is a DPM we can compute

– 𝑃 𝐷𝑘 𝐻1𝑘 - data at tree 𝑇𝑘 was generated from

the same cluster.

– 𝑃 𝐷𝑘 𝐻1𝑘 = ∫ 𝑝 𝐷𝑘 𝜃 𝑝 𝜃 𝛽 𝑑𝜃

– Easy to compute if the model has conjugacy.

Page 9: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Marginal Likelihood for the alternative hypothesis

• 𝑃 𝐷𝑘 𝐻2𝑘 - 𝐷𝑘was generated from two or

more components defining partitions consistent with trees 𝑇𝑖 𝑎𝑛𝑑 𝑇𝑗

– 𝑃 𝐷𝑘 𝐻2𝑘 = 𝑃 𝐷𝑖 𝑇𝑖 𝑃 𝐷𝑗 𝑇𝑗)

– 𝑃 𝐷𝑘 𝑇𝑘 = 𝜋𝑘𝑝 𝐷𝑘 𝐻𝑘1 + 1 − 𝜋𝑘 𝑃 𝐷𝑘 𝐻2

𝑘

• 𝜋𝑘 = 𝑝(𝐻𝑘1)

Page 10: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Algorithm Details iD jDkDiT jT

iDjD

kD

iT jT

Page 11: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Computing the Prior for 𝐻1𝑘

• 𝜋𝑘 is the relative mass of the partition where all points are in one cluster vs all other partitions consistent with the subtrees, in a Dirichlet process mixture model

• Can be computed bottom up

Page 12: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Marginal Likelihood of a Dirchlet Process Mixture

• Marginal Likelihood :

• 𝜈 = *𝜈1, … 𝜈𝑁+

• From the CRP (distribution over partitions) we have

Page 13: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Marginal Likelihood of a Dirchlet Process Mixture

Lemma 1:

Page 14: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Marginal Likelihood of Tree Consistent Partitions

• Lower bounds the true DPM marginal likelihood

Page 15: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Combinatorial Lower Bounds

• BHC forms a lower bound for the marginal likelihood of an infinite mixture model by efficiently summing over an exponentially large subset of all partitions.

• Idea is to deterministically sum over partitions with high probability, thereby accounting for most of the mass.

Page 16: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Experimental Results

• Toy Example

• UCI Datasets

• Newsgroup Clustering

Page 17: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Results: a Toy Example

Page 18: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Results: a Toy Example

Page 19: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Predicting New Data Points

Page 20: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Results: Purity Scores

Purity is a measure of how well the hierarchical tree structure is correlated with the labels of the known classes.

Page 21: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

4 Newsgroups Results

800 examples, 50 attributes: rec.sport.baseball, rec.sports.hockey, rec.autos, sci.space

Page 22: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Newsgroups: Average Linkage HC

Page 23: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Newsgroups: Bayesian HC

Page 24: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Comparison with Mean Field Lower Bound

Page 25: Bayesian Hierarchical Clustering - Brown UniversityBayesian Hierarchical Clustering •Data generated from a Dirichlet Process Mixture. •Similarity is now measured through a statistical

Issues and Opportunities

• Greedy algorithm:

– The algorithm may not find the globally optimal tree

• No tree uncertainty:

– The algorithm finds a single tree, rather than a distribution over plausible trees

• complexity for building tree

• Extend inference algorithm to more sophisticated models.

2( )O n