Nonparametric Bayes Pachinko Allocation by Li, Blei and McCallum (UAI 2007)

Nonparametric Bayes Pachinko Allocationby

Li, Blei and McCallum (UAI 2007)

Presented by Lihan He

ECE, Duke University

March 3rd, 2008

Reviews on Topic Models (LDA, CTM)

Pachinko Allocation (PAM)

Nonparametric Pachinko Allocation

Experimental Results

Conclusions

Outlines

Notation and terminology

• Word: the basic unit from a vocabulary of size V (includes V distinct words). The vth word is represented by

• Document: a sequence of N words.

• Corpus: a collection of M documents.

• Topic: a multinomial distribution over words.

T

V

vthw

dim

]00100[

],,,[ 21 NwwwW

},,,{ 21 MWWWD

• The words in a document are exchangeable;

• Documents are also exchangeable.

Assumptions:

Reviews on Topic Models – Notation

βα, fixed unknown parameters

Random variables (w are observable)

fixed known parameterskVNM ,,,

Generative process for each document W in a corpus D:

1. Choose

2. For each of the N words in the document W

(a) Choose a topic

(b) Choose a word

dimareand),(~ kαθαDirichletθ

nw)(~ lMultinomiazn

matrixais),(~ VklMultinomiawnzn ββ

θ

)1|1( ijij zwp

Reviews on Topic Models - Latent Dirichlet Allocation (LDA)

z

wN

M

wz,,

is a document-level variable, z and w are word-level variables.

Limitations:

1. Because of the independence assumption implicit in the Dirichlet distribution, LDA is unable to capture the correlation between different topics.

2. Manually select the number of topics k.

)(~ αDirichletθ

,1

)1(][

0020

jijiCov

k

ii

10

0 is usually very large for the posterior

Reviews on Topic Models - Latent Dirichlet Allocation (LDA)


1. Choose


(a) Choose a topic

(b) Choose a word

nw))((~ ηflMultinomiazn

matrixais),(~ VklMultinomiawnzn ββ

)}log(exp{)|(1

k

i

T iezηηzp

Reviews on Topic Models - Correlated Topic Models (CTM)

j

ii j

i

e

ef

)(

Key point: the topic proportions are drawn from a logistic normal distribution rather than a Dirichlet distribution.

θ

z

wN

M

),(~ kN

Limitations:

1. Limited to pair-wise correlations between topics, and the number of parameters in the covariance matrix grows as the square of the number of topics.

2. Manually select the number of topics k.

]})[log(])[log(exp{),|( 121 μ

θμ

θΣμθf

jj

T

jj

Σ

Reviews on Topic Models - Correlated Topic Models (CTM)

Pachinko Allocation Model (PAM)

In PAM, the concept of topics are extended to be distributions not only over words (as in LDA and CTM), but also over other topics.

The structure of PAM is extremely flexible.

Pachinko: a Japanese game, in which metal balls bounce down around a complex collection of pins until they land in various bins at the bottom.

Four-level PAMt

t

tz

wN

M

r

r

rz

Sβ,tr ,α

SkVNM ,,,,

wzz trtr ,,,,

fixed known parameters

fixed unknown parameters

random variables


1. Choose

2. For each of the S super-topics, choose


(a) Choose a super-topic

(b) Choose a sub-topic

(c) Choose a word

dimareand),(~ SαθαDirichletθ rrrr

)(~ )( rztt lMultinomiaz

matrixais),(~ VklMultinomiawtzn ββ

)(~ rr lMultinomiaz

r mixting weights for super-topic

t mixting weights for sub-topic

dimareand),(~ kαθαDirichletθ tttt

nw


root

super-topic

sub-topic

word

Advantage:

Capture correlations between topics by a super-topic layer.

Limitation:

Manually select the number of super-topics S and the number of sub-topics k.



Assumes an HDP-based prior for PAM

Based on a 5-level hierarchical Chinese restaurant process

Automatically decides the super-topic number S and the sub-topic number k

Chinese restaurant process:

P (a new customer sits at an occupied table t)

P (a new customer sits at an unoccupied table)

')'(

)(

ttC

tC

')'(

ttC

denoted as ).,)}(({ ttCCRP


root

super-topic

sub-topic

word customer

restaurant

category

dish

Notation:

There are infinite numbers of super-topic and sub-topic.

Both super-topic (category) and sub-topic (dish) are globally shared among all documents.

Sampling for super-topics involves two-level CRP.

Sampling for sub-topics involves three-level CRP.


Generative process:

A customer x arrives at restaurant rj

1. He chooses the kth entryway ejk in the restaurant from2. If ejk is a new entryway, a category cl is associated to it from3. After choosing the category, the customer makes the decision for which table he will

sit at. He chooses table tjln from 4. If the customer sits at an existing table, he will share the menu and dish with other

customers at the same table. Otherwise, he will choose a menu mlp for the new table from

5. If the customer gets an existing menu, he will eat the dish on the menu. Otherwise, he samples dish dm for the new menu from

),)},(({ 0kkjCCRP

),})',(({ 0'lj

jlCCRP

),)},,(({ 1nnljCCRP

),}),,'(({ 1'pj

pljCCRP

),}),'(({ 1'ml

mlCCRP


Graphical Model

Model parameters: scalars and base H. Two-level clustering of indicator variables, with first level clustering

using 2-layer CRP and second level clustering using 3-layer CRP. Atoms are all drawn from base H.

11010 ,,,,

NM


Datasets:

20 newsgroup comp5 dataset: 5 different newsgroups, 4,836 documents, including 468,252 words and 35,567 unique words.

Rexa dataset: digital library of computer science. Randomly choose 5,000 documents, including 350,760 words and 25,597 unique words.

NIPS dataset: 1,647 abstracts of NIPS paper from 1987-1999, including 114,142 words and 11,708 unique words.

Likelihood Comparison:


Topic Examples

20 newsgroup comp5 dataset


Topic Examples

NIPS dataset

Nonparametric Bayes PAM discovers the sparse structure.

Conclusions

A nonparametric Bayesian prior for pachinko allocation is presented based on a variant of the hierarchical Dirichlet process;

Nonparametric PAM automatically discovers topic correlations as well as determining the numbers of topics at different levels;

The topic structure discovered by nonparametric PAM is usually sparse.

Appendix: Hierarchical Latent Dirichlet Allocation (hLDA)

z

wN

M

Key difference from LDA:

Topics are organized as an L-level tree structure, instead of a kxV matrix.

L is prespecified manually.

β


1. Choose a path from the root of the topic tree to a leaf. The path includes L topics.

2. Choose


(a) Choose a topic

(b) Choose a word is a V-dim vector, which is the multinomial parameter for the znth topic along the path from root to leaf, chosen by step 1.

dimareand),(~ LαθαDirichletθ

nw)(~ lMultinomiazn

)()( ),(~nn zzn ββlMultinomiaw

References:

W. Li, D. M. Blei, and A. McCallum. Nonparametric Bayes pachinko allocation. In Proceedings of Conference on Uncertainty in Artificial Intelligence (UAI), 2007.

W. Li and A. McCallum. Pachinko allocation: DAG-structured mixture models of topic correlations. In Proceedings of International Conference on Machine Learning (ICML), 2006.

D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent Dirichlet allocation. Journal of Machine Learning Research, 3: 993-1022, 2003.

D. M. Blei and J. D. Lafferty. Correlated topic model. In Advances in Neural Information Processing Systems (NIPS), 2006.

D. M. Blei, T. L. Griffiths, M. I. Jordan, and J. B. Tenenbaum. Hierarchical topic models and the nested Chinese restaurant process. In Advances in Neural Information Processing Systems (NIPS), 2004.

J. Aitchison and S. M. Shen. Logistic-normal distributions: Some properties and uses. Biometrika, vol.67, no.2, pp.261-272, 1980.