62
Modeling Science David M. Blei Department of Computer Science Princeton University April 17, 2008 Joint work with John Lafferty (CMU) D. Blei Modeling Science 1 / 53

Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Modeling Science

David M. Blei

Department of Computer SciencePrinceton University

April 17, 2008

Joint work with John Lafferty (CMU)

D. Blei Modeling Science 1 / 53

Page 2: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Modeling ScienceScience, August 13, 1886

water acid disease

milk water blood

food solution cholera

dry experiments bacteria

fed liquid found

cows chemical bacillus

houses action experiments

butter copper organisms

fat crystals bacilli

found carbon cases

made alcohol diseases

contained made germs

wells obtained animal

produced substances koch

poisonous nitrogen made

5

Science, June 24, 1994

evolution rna disease

evolutionary mrna host

species site bacteria

organisms splicing diseases

biology rnas new

phylogenetic nuclear bacterial

life sequence resistance

origin introns control

diversity messenger strains

groups cleavage infectious

molecular two malaria

animals splice parasites

two sequences parasite

new polymerase tuberculosis

living intron health

6• On-line archives of document collections require better

organization. Manual organization is not practical.• Our goal: To discover the hidden thematic structure with

hierarchical probabilistic models called topic models.• Use this structure for browsing, search, and similarity.

D. Blei Modeling Science 2 / 53

Page 3: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Modeling ScienceScience, August 13, 1886

water acid disease

milk water blood

food solution cholera

dry experiments bacteria

fed liquid found

cows chemical bacillus

houses action experiments

butter copper organisms

fat crystals bacilli

found carbon cases

made alcohol diseases

contained made germs

wells obtained animal

produced substances koch

poisonous nitrogen made

5

Science, June 24, 1994

evolution rna disease

evolutionary mrna host

species site bacteria

organisms splicing diseases

biology rnas new

phylogenetic nuclear bacterial

life sequence resistance

origin introns control

diversity messenger strains

groups cleavage infectious

molecular two malaria

animals splice parasites

two sequences parasite

new polymerase tuberculosis

living intron health

6• Our data are the pages Science from 1880-2002 (from JSTOR)• No reliable punctuation, meta-data, or references.• Note: this is just a subset of JSTOR’s archive.

D. Blei Modeling Science 2 / 53

Page 4: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Discover topics from a corpus

“Genetics” “Evolution” “Disease” “Computers”

human evolution disease computergenome evolutionary host models

dna species bacteria informationgenetic organisms diseases datagenes life resistance computers

sequence origin bacterial systemgene biology new network

molecular groups strains systemssequencing phylogenetic control model

map living infectious parallelinformation diversity malaria methods

genetics group parasite networksmapping new parasites softwareproject two united new

sequences common tuberculosis simulations

D. Blei Modeling Science 3 / 53

Page 5: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Model the evolution of topics over time

1880 1900 1920 1940 1960 1980 2000

o o o o o o o ooooooooo o o o o o o o o o

o o o o oo o o

o oo o

o oooo

o

oooo

oo o

o o oo o

oooo o

o

o

ooo

o

o

o

oo o o

o o o

1880 1900 1920 1940 1960 1980 2000

o o ooo o

oooo o o o

oo o o o o o

o o o o o

o o oooo

ooooo o

o

o

oooooo o

ooo o

o o o o o o o o o oo o

o ooo

o

o

oo o o o o o

RELATIVITY

LASERFORCE

NERVE

OXYGEN

NEURON

"Theoretical Physics" "Neuroscience"

D. Blei Modeling Science 4 / 53

Page 6: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Model connections between topics

wild typemutant

mutationsmutantsmutation

plantsplantgenegenes

arabidopsis

p53cell cycleactivitycyclin

regulation

amino acidscdna

sequenceisolatedprotein

genedisease

mutationsfamiliesmutation

rnadna

rna polymerasecleavage

site

cellscell

expressioncell lines

bone marrow

united stateswomen

universitiesstudents

education

sciencescientists

saysresearchpeople

researchfundingsupport

nihprogram

surfacetip

imagesampledevice

laseropticallight

electronsquantum

materialsorganicpolymerpolymersmolecules

volcanicdepositsmagmaeruption

volcanism

mantlecrust

upper mantlemeteorites

ratios

earthquakeearthquakes

faultimages

dataancientfoundimpact

million years agoafrica

climateocean

icechanges

climate change

cellsproteins

researchersproteinfound

patientsdisease

treatmentdrugsclinical

geneticpopulationpopulationsdifferencesvariation

fossil recordbirds

fossilsdinosaurs

fossil

sequencesequences

genomedna

sequencing

bacteriabacterial

hostresistanceparasite

developmentembryos

drosophilagenes

expression

speciesforestforests

populationsecosystems

synapsesltp

glutamatesynapticneurons

neuronsstimulusmotorvisual

cortical

ozoneatmospheric

measurementsstratosphere

concentrations

sunsolar wind

earthplanetsplanet

co2carbon

carbon dioxidemethane

water

receptorreceptors

ligandligands

apoptosis

proteinsproteinbindingdomaindomains

activatedtyrosine phosphorylation

activationphosphorylation

kinase

magneticmagnetic field

spinsuperconductivitysuperconducting

physicistsparticlesphysicsparticle

experimentsurfaceliquid

surfacesfluid

model reactionreactionsmoleculemolecules

transition state

enzymeenzymes

ironactive sitereduction

pressurehigh pressure

pressurescore

inner core

brainmemorysubjects

lefttask

computerproblem

informationcomputersproblems

starsastronomers

universegalaxiesgalaxy

virushiv

aidsinfectionviruses

miceantigent cells

antigensimmune response

D. Blei Modeling Science 5 / 53

Page 7: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Outline

1 Introduction

2 Latent Dirichlet allocation

3 Dynamic topic models

4 Correlated topic models

D. Blei Modeling Science 6 / 53

Page 8: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Outline

1 Introduction

2 Latent Dirichlet allocation

3 Dynamic topic models

4 Correlated topic models

D. Blei Modeling Science 7 / 53

Page 9: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Probabilistic modeling

1 Treat data as observations that arise from a generativeprobabilistic process that includes hidden variables

• For documents, the hidden variables reflect the thematicstructure of the collection.

2 Infer the hidden structure using posterior inference• What are the topics that describe this collection?

3 Situate new data into the estimated model.• How does this query or new document fit into the estimated

topic structure?

D. Blei Modeling Science 8 / 53

Page 10: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Intuition behind LDA

Simple intuition: Documents exhibit multiple topics.

D. Blei Modeling Science 9 / 53

Page 11: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Generative process

• Cast these intuitions into a generative probabilistic process• Each document is a random mixture of corpus-wide topics• Each word is drawn from one of those topics

D. Blei Modeling Science 10 / 53

Page 12: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Generative process

• In reality, we only observe the documents• Our goal is to infer the underlying topic structure

• What are the topics?• How are the documents divided according to those topics?

D. Blei Modeling Science 10 / 53

Page 13: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Graphical models (Aside)

· · ·

Y

X1 X2 XN

Xn

Y

N

• Nodes are random variables• Edges denote possible dependence• Observed variables are shaded• Plates denote replicated structure

D. Blei Modeling Science 11 / 53

Page 14: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Graphical models (Aside)

· · ·

Y

X1 X2 XN

Xn

Y

N

• Structure of the graph defines the pattern of conditionaldependence between the ensemble of random variables

• E.g., this graph corresponds to

p(y , x1, . . . , xN) = p(y)

N∏n=1

p(xn | y)

D. Blei Modeling Science 11 / 53

Page 15: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Latent Dirichlet allocation

θd Zd,n Wd,nN

D Kβk

α η

Dirichletparameter

Per-documenttopic proportions

Per-wordtopic assignment

Observedword Topics

Topichyperparameter

Each piece of the structure is a random variable.

D. Blei Modeling Science 12 / 53

Page 16: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Latent Dirichlet allocation

θd Zd,n Wd,nN

D Kβk

α η

1 Draw each topic βi ∼ Dir(η), for i ∈ {1, . . . , K }.2 For each document:

1 Draw topic proportions θd ∼ Dir(α).2 For each word:

1 Draw Zd,n ∼ Mult(θd ).2 Draw Wd,n ∼ Mult(βzd,n).

D. Blei Modeling Science 13 / 53

Page 17: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Latent Dirichlet allocation

θd Zd,n Wd,nN

D Kβk

α η

• From a collection of documents, infer• Per-word topic assignment zd,n• Per-document topic proportions θd• Per-corpus topic distributions βk

• Use posterior expectations to perform the task at hand, e.g.,information retrieval, document similarity, etc.

D. Blei Modeling Science 13 / 53

Page 18: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Latent Dirichlet allocation

θd Zd,n Wd,nN

D Kβk

α η

• Computing the posterior is intractable:

p(θ | α)∏N

n=1 p(zn | θ)p(wn | zn, β1:K )∫θ

p(θ | α)∏N

n=1∑K

z=1 p(zn | θ)p(wn | zn, β1:K )

• Several approximation techniques have been developed.

D. Blei Modeling Science 13 / 53

Page 19: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Latent Dirichlet allocation

θd Zd,n Wd,nN

D Kβk

α η

• Mean field variational methods (Blei et al., 2001, 2003)• Expectation propagation (Minka and Lafferty, 2002)• Collapsed Gibbs sampling (Griffiths and Steyvers, 2002)• Collapsed variational inference (Teh et al., 2006)

D. Blei Modeling Science 13 / 53

Page 20: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Example inference

• Data: The OCR’ed collection of Science from 1990–2000• 17K documents• 11M words• 20K unique terms (stop words and rare words removed)

• Model: 100-topic LDA model using variational inference.

D. Blei Modeling Science 14 / 53

Page 21: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Example inference

1 8 16 26 36 46 56 66 76 86 96

Topics

Probability

0.0

0.1

0.2

0.3

0.4

D. Blei Modeling Science 15 / 53

Page 22: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Example topics

“Genetics” “Evolution” “Disease” “Computers”

human evolution disease computergenome evolutionary host models

dna species bacteria informationgenetic organisms diseases datagenes life resistance computers

sequence origin bacterial systemgene biology new network

molecular groups strains systemssequencing phylogenetic control model

map living infectious parallelinformation diversity malaria methods

genetics group parasite networksmapping new parasites softwareproject two united new

sequences common tuberculosis simulations

D. Blei Modeling Science 16 / 53

Page 23: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

LDA summary

• LDA is a powerful model for• Visualizing the hidden thematic structure in large corpora• Generalizing new data to fit into that structure

• LDA is a mixed membership model (Erosheva, 2004) that buildson the work of Deerwester et al. (1990) and Hofmann (1999).

• For document collections and other grouped data, this might bemore appropriate than a simple finite mixture

D. Blei Modeling Science 17 / 53

Page 24: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

LDA summary

• Modular : It can be embedded in more complicated models.• E.g., syntax and semantics; authorship; word sense

• General : The data generating distribution can be changed.• E.g., images; social networks; population genetics data

• Variational inference is fast; lets us to analyze large data sets.

• See Blei et al., 2003 for details and a quantitative comparison.• Code to play with LDA is freely available on my web-site,

http://www.cs.princeton.edu/∼blei.

D. Blei Modeling Science 18 / 53

Page 25: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

LDA summary

• But, LDA makes certain assumptions about the data.• When are they appropriate?

D. Blei Modeling Science 19 / 53

Page 26: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Outline

1 Introduction

2 Latent Dirichlet allocation

3 Dynamic topic models

4 Correlated topic models

D. Blei Modeling Science 20 / 53

Page 27: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

LDA and exchangeability

θd Zd,n Wd,nN

D Kβk

α η

• LDA assumes that documents are exchangeable.• I.e., their joint probability is invariant to permutation.• This is too restrictive.

D. Blei Modeling Science 21 / 53

Page 28: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Documents are not exchangeable

"Infrared Reflectance in Leaf-Sitting Neotropical Frogs" (1977)"Instantaneous Photography" (1890)

• Documents about the same topic are not exchangeable.• Topics evolve over time.

D. Blei Modeling Science 22 / 53

Page 29: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Dynamic topic model

• Divide corpus into sequential slices (e.g., by year).• Assume each slice’s documents exchangeable.

• Drawn from an LDA model.• Allow topic distributions evolve from slice to slice.

D. Blei Modeling Science 23 / 53

Page 30: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Dynamic topic models

D

θd

Zd,n

Wd,n

N

K

α

D

θd

Zd,n

Wd,n

N

α

D

θd

Zd,n

Wd,n

N

α

βk,1 βk,2 βk,T

. . .

D. Blei Modeling Science 24 / 53

Page 31: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Modeling evolving topics

βk,1 βk,2 βk,T

. . .

• Use a logistic normal distribution to model evolving topics(Aitchison, 1980)

• A state-space model on the natural parameter of the topicmultinomial (West and Harrison, 1997)

βt,k | βt−1,k ∼ N (βt−1,k , Iσ 2)

p(w | βt,k ) = exp{βt,k − log(1 +

∑V−1v=1 exp{βt,k ,v })

}

D. Blei Modeling Science 25 / 53

Page 32: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Posterior inference

• Our goal is to compute the posterior distribution,

p(β1:T ,1:K , θ1:T ,1:D, z1:T ,1:D | w1:T ,1:D).

• Exact inference is impossible• Per-document mixed-membership model• Non-conjugacy between p(w | βt,k ) and p(βt,k )

• MCMC is not practical for the amount of data.• Solution: Variational inference

D. Blei Modeling Science 26 / 53

Page 33: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Science data

TECHVIEW: DNA S E Q U E N C I NG

Sequencing the Genome, Fast

James C. Mullikin and Amanda A. McMurray

Genome sequencing projects reveal the genetic makeup of an organism by reading off the sequence of theDNA bases, which encodes all of the infor-mation necessary for the life of the organ-ism. The base sequence contains four nu-cleotides-adenine, thymidine, guanosine,and cytosine-which are linked togetherinto long double-helical chains. Over thelast two decades, automated DNA se-quencers have made the process of obtain-ing the base-by-base sequence of DNA...

• Analyze JSTOR’s entire collection from Science (1880-2002)• Restrict to 30K terms that occur more than ten times• The data are 76M words in 130K documents

D. Blei Modeling Science 27 / 53

Page 34: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Analyzing a document

Original article Topic proportions

D. Blei Modeling Science 28 / 53

Page 35: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Analyzing a document

sequencegenomegenessequenceshumangenednasequencingchromosomeregionsanalysisdatagenomicnumber

devicesdevicematerialscurrenthighgatelightsiliconmaterialtechnologyelectricalfiberpowerbased

datainformationnetworkwebcomputerlanguagenetworkstimesoftwaresystemwordsalgorithmnumberinternet

Original article Most likely words from top topics

D. Blei Modeling Science 28 / 53

Page 36: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Analyzing a topic

1880electric

machinepowerenginesteam

twomachines

ironbattery

wire

1890electricpower

companysteam

electricalmachine

twosystemmotorengine

1900apparatus

steampowerengine

engineeringwater

constructionengineer

roomfeet

1910air

waterengineeringapparatus

roomlaboratoryengineer

madegastube

1920apparatus

tubeair

pressurewaterglassgas

madelaboratorymercury

1930tube

apparatusglass

airmercury

laboratorypressure

madegas

small

1940air

tubeapparatus

glasslaboratory

rubberpressure

smallmercury

gas

1950tube

apparatusglass

airchamber

instrumentsmall

laboratorypressurerubber

1960tube

systemtemperature

airheat

chamberpowerhigh

instrumentcontrol

1970air

heatpowersystem

temperaturechamber

highflowtube

design

1980high

powerdesignheat

systemsystemsdevices

instrumentscontrollarge

1990materials

highpowercurrent

applicationstechnology

devicesdesigndeviceheat

2000devicesdevice

materialscurrent

gatehighlight

siliconmaterial

technology

D. Blei Modeling Science 29 / 53

Page 37: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Visualizing trends within a topic

1880 1900 1920 1940 1960 1980 2000

o o o o o o o ooooooooo o o o o o o o o o

o o o o oo o o

o oo o

o oooo

o

oooo

oo o

o o oo o

oooo o

o

o

ooo

o

o

o

oo o o

o o o

1880 1900 1920 1940 1960 1980 2000

o o ooo o

oooo o o o

oo o o o o o

o o o o o

o o oooo

ooooo o

o

o

oooooo o

ooo o

o o o o o o o o o oo o

o ooo

o

o

oo o o o o o

RELATIVITY

LASERFORCE

NERVE

OXYGEN

NEURON

"Theoretical Physics" "Neuroscience"

D. Blei Modeling Science 30 / 53

Page 38: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Time-corrected document similarity

• Consider the expected Hellinger distance between the topicproportions of two documents,

dij = E

[K∑

k=1

(√

θi,k −√

θj,k )2| wi , wj

]

• Uses the latent structure to define similarity• Time has been factored out because the topics associated to the

components are different from year to year.• Similarity based only on topic proportions

D. Blei Modeling Science 31 / 53

Page 39: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Time-corrected document similarity

The Brain of the Orang (1880)

D. Blei Modeling Science 32 / 53

Page 40: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Time-corrected document similarity

Representation of the Visual Field on the Medial Wall ofOccipital-Parietal Cortex in the Owl Monkey (1976)

D. Blei Modeling Science 33 / 53

Page 41: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Browser of Science

D. Blei Modeling Science 34 / 53

Page 42: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Quantitative comparison

• Compute the probability of each year’s documents conditional onall the previous year’s documents,

p(wt | w1, . . . , wt−1)

• Compare exchangeable and dynamic topic models

D. Blei Modeling Science 35 / 53

Page 43: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Quantitative comparison

1920 1940 1960 1980 2000

1015

2025

Year

Per

−w

ord

nega

tive

log

likel

ihoo

d

● ●●

● ●●

● ●

● ● ●

● ●●

●●

● ●

●●

● ●

● ●● ●

LDADTM

D. Blei Modeling Science 36 / 53

Page 44: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Outline

1 Introduction

2 Latent Dirichlet allocation

3 Dynamic topic models

4 Correlated topic models

D. Blei Modeling Science 37 / 53

Page 45: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

The hidden assumptions of the Dirichlet distribution

• The Dirichlet is an exponential family distribution on the simplex,positive vectors that sum to one.

• However, the near independence of components makes it a poorchoice for modeling topic proportions.

• An article about fossil fuels is more likely to also be aboutgeology than about genetics.

D. Blei Modeling Science 38 / 53

Page 46: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

The logistic normal distribution

• The logistic normal is a distribution on the simplex that canmodel dependence between components.

• The natural parameters of the multinomial are drawn from amultivariate Gaussian distribution.

X ∼ NK−1(µ, 6)

θi = exp{xi − log(1 +∑K−1

j=1 exp{xj})}

D. Blei Modeling Science 39 / 53

Page 47: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Correlated topic model (CTM)

Zd,n Wd,nN

D

K

Σ

µ

ηd

βk

• Draw topic proportions from a logistic normal, where topicoccurrences can exhibit correlation.

• Use for:• Providing a “map” of topics and how they are related• Better prediction via correlated topics

D. Blei Modeling Science 40 / 53

Page 48: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

wild typemutant

mutationsmutantsmutation

plantsplantgenegenes

arabidopsis

p53cell cycleactivitycyclin

regulation

amino acidscdna

sequenceisolatedprotein

genedisease

mutationsfamiliesmutation

rnadna

rna polymerasecleavage

site

cellscell

expressioncell lines

bone marrow

united stateswomen

universitiesstudents

education

sciencescientists

saysresearchpeople

researchfundingsupport

nihprogram

surfacetip

imagesampledevice

laseropticallight

electronsquantum

materialsorganicpolymerpolymersmolecules

volcanicdepositsmagmaeruption

volcanism

mantlecrust

upper mantlemeteorites

ratios

earthquakeearthquakes

faultimages

dataancientfoundimpact

million years agoafrica

climateocean

icechanges

climate change

cellsproteins

researchersproteinfound

patientsdisease

treatmentdrugsclinical

geneticpopulationpopulationsdifferencesvariation

fossil recordbirds

fossilsdinosaurs

fossil

sequencesequences

genomedna

sequencing

bacteriabacterial

hostresistanceparasite

developmentembryos

drosophilagenes

expression

speciesforestforests

populationsecosystems

synapsesltp

glutamatesynapticneurons

neuronsstimulusmotorvisual

cortical

ozoneatmospheric

measurementsstratosphere

concentrations

sunsolar wind

earthplanetsplanet

co2carbon

carbon dioxidemethane

water

receptorreceptors

ligandligands

apoptosis

proteinsproteinbindingdomaindomains

activatedtyrosine phosphorylation

activationphosphorylation

kinase

magneticmagnetic field

spinsuperconductivitysuperconducting

physicistsparticlesphysicsparticle

experimentsurfaceliquid

surfacesfluid

model reactionreactionsmoleculemolecules

transition state

enzymeenzymes

ironactive sitereduction

pressurehigh pressure

pressurescore

inner core

brainmemorysubjects

lefttask

computerproblem

informationcomputersproblems

starsastronomers

universegalaxiesgalaxy

virushiv

aidsinfectionviruses

miceantigent cells

antigensimmune response

D. Blei Modeling Science 41 / 53

Page 49: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Summary

• Topic models provide useful descriptive statistics for analyzingand understanding the latent structure of large text collections.

• Probabilistic graphical models are a useful way to expressassumptions about the hidden structure of complicated data.

• Variational methods allow us to perform posterior inference toautomatically infer that structure from large data sets.

• Current research• Choosing the number of topics• Continuous time dynamic topic models• Topic models for prediction• Inferring the impact of a document

D. Blei Modeling Science 42 / 53

Page 50: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

“We should seek out unfamiliar summaries of observational material,and establish their useful properties... And still more novelty cancome from finding, and evading, still deeper lying constraints.”(John Tukey, The Future of Data Analysis, 1962)

D. Blei Modeling Science 43 / 53

Page 51: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Supervised topic models (with Jon McAuliffe)

• Most topic models are unsupervised. They are fit by maximizingthe likelihood of a collection of documents.

• Consider documents paired with response variables.For example:

• Movie reviews paired with a number of stars• Web pages paired with a number of “diggs”

• We develop supervised topic models, models of documents andresponses that are fit to find topics predictive of the response.

D. Blei Modeling Science 44 / 53

Page 52: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Supervised LDA

θd Zd,n Wd,nN

D

Kβk

α

Yd η, σ2

1 Draw topic proportions θ | α ∼ Dir(α).2 For each word

1 Draw topic assignment zn | θ ∼ Mult(θ).2 Draw word wn | zn, β1:K ∼ Mult(βzn).

3 Draw response variable y | z1:N, η, σ 2∼ N

(η>z̄, σ 2

), where

z̄ = (1/N)∑N

n=1 zn.

D. Blei Modeling Science 45 / 53

Page 53: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Comments

• SLDA is used as follows.• Fit coefficients and topics from a collection of

document-response pairs.• Use the fitted model to predict the responses of previously

unseen documents,

E[Y | w1:N, α, β1:K , η, σ 2] = η>E[Z̄ | w1:N, α, β1:K ].

• The process enforces that the document is generated first,followed by the response. The response is generated from theparticular topics that were realized in generating the document.

D. Blei Modeling Science 46 / 53

Page 54: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Example: Movie reviews

bothmotionsimpleperfectfascinatingpowercomplex

howevercinematographyscreenplayperformancespictureseffectivepicture

histheircharactermanywhileperformancebetween

−30 −20 −10 0 10 20

● ●● ● ●● ● ● ● ●

morehasthanfilmsdirectorwillcharacters

onefromtherewhichwhomuchwhat

awfulfeaturingroutinedryofferedcharlieparis

notaboutmovieallwouldtheyits

havelikeyouwasjustsomeout

badguyswatchableitsnotonemovie

leastproblemunfortunatelysupposedworseflatdull

• We fit a 10-topic sLDA model to movie review data (Pang andLee, 2005).

• The documents are the words of the reviews.• The responses are the number of stars associated with

each review (modeled as continuous).• Each component of coefficient vector η is associated with a topic.

D. Blei Modeling Science 47 / 53

Page 55: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Simulations

● ●

●●

●●

●●

2 4 10 20 30

0.00

0.02

0.04

0.06

0.08

0.10

0.12

Number of topics

Pred

ictive

R2

● ● ●●

●● ●

2 4 10 20 30

−8.6

−8.5

−8.4

−8.3

−8.2

−8.1

−8.0

Number of topics

Per−

word

hel

d ou

t log

likel

ihoo

d

●● ●

● ●

5 10 15 20 25 30 35 40 45 50

−6.4

2−6

.41

−6.4

0−6

.39

−6.3

8−6

.37

Number of topics

Per−

word

hel

d ou

t log

likel

ihoo

d

● ●● ● ● ●

●●

●● ●

●● ● ●

5 10 15 20 25 30 35 40 45 50

0.0

0.1

0.2

0.3

0.4

0.5

Number of topics

Pred

ictive

R2

sLDALDA

Movie corpus

Digg corpus

D. Blei Modeling Science 48 / 53

Page 56: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Diversion: Variational inference

• Let x1:N be observations and z1:M be latent variables• Our goal is to compute the posterior distribution

p(z1:M | x1:N) =p(z1:M , x1:N)∫

p(z1:M , x1:N)dz1:M

• For many interesting distributions, the marginal likelihood of theobservations is difficult to efficiently compute

D. Blei Modeling Science 49 / 53

Page 57: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Variational inference

• Use Jensen’s inequality to bound the log prob of theobservations:

log p(x1:N) ≥ Eqν [log p(z1:M , x1:N)] − Eqν [log qν(z1:M)].

• We have introduced a distribution of the latent variables with freevariational parameters ν.

• We optimize those parameters to tighten this bound.• This is the same as finding the member of the family qν that is

closest in KL divergence to p(z1:M | x1:N).

D. Blei Modeling Science 50 / 53

Page 58: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Mean-field variational inference

• Complexity of optimization is determined by factorization of qν

• In mean field variational inference qν is fully factored

qν(z1:M) =

M∏m=1

qνm(zm).

• The latent variables are independent.• Each is governed by its own variational parameter νm.

• In the true posterior they can exhibit dependence(often, this is what makes exact inference difficult).

D. Blei Modeling Science 51 / 53

Page 59: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

MFVI and conditional exponential families

• Suppose the distribution of each latent variable conditional onthe observations and other latent variables is in the exponentialfamily:

p(zm | z−m, x) = hm(zm) exp{gm(z−m, x)T zm − am(gi(z−m, x))}

• Assume qν is fully factorized and each factor is in the sameexponential family:

qνm(zm) = hm(zm) exp{νTmzm − am(νm)}

D. Blei Modeling Science 52 / 53

Page 60: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

MFVI and conditional exponential families

• Variational inference is the following coordinate ascent algorithm

νm = Eqν [gm(Z−m, x)]

• Notice the relationship to Gibbs sampling

D. Blei Modeling Science 52 / 53

Page 61: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Variational family for the DTM

βk,1 βk,2 βk,T

. . .

β̂k,1 β̂k,2 β̂k,T

• Distribution of θ and z is fully-factorized (Blei et al., 2003)• Distribution of {β1,k , . . . , βT ,k } is a variational Kalman filter• Gaussian state-space model with free observations β̂k ,t .• Fit observations such that the corresponding posterior over the

chain is close to the true posterior.

D. Blei Modeling Science 53 / 53

Page 62: Modeling Science - Columbia Universityblei/talks/Blei_Science_2008.pdfsequencing phylogenetic control model map living infectious parallel information diversity malaria methods genetics

Variational family for the DTM

βk,1 βk,2 βk,T

. . .

β̂k,1 β̂k,2 β̂k,T

• Given a document collection, use coordinate ascent on all thevariational parameters until the KL converges.

• Yields a distribution close to the true posterior of interest• Take expectations w/r/t the simpler variational distribution

D. Blei Modeling Science 53 / 53