Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Bayesian (Nonparametric) Approaches toLanguage Modeling

Frank Wood

unemployed (and homeless)

January, 2013

Wood (University of Oxford) January, 2013 1 / 34

Message

Whole Distributions As Random Variables

Hierarchical Modeling

Embracing Uncertainty


Challenge Problem

Predict What Comes Next

01001001, 01101110, 00100000, 01110100,01101000, 01110010, 01100101, 01100101,00100000, 01100100, 01100001, 01111001,01110011, 00100000, 01111001, 01101111,01110101, 01110010, 00100000, 01101000,01100001, 01110010, 01100100, 00100000,01100100, 01110010, 01101001, 01110110,01100101, 00100000, 01110111, 01101001,01101100, 01101100, 00100000, 01100011,01110010, 01100001, 01110011, 01101000 . . .


.6 < Entropy1 of English < 1.3 bits/byte [Shannon, 1951]

3 4 5 6 7 8 9

2

4

stream length (log10)

bits

/ by

te

L=103 L=104 L=105 L=106 L=107

Sequence memoizer with life-long learning computational characteristics[Bartlett, Pfau, and Wood, 2010]

Wikipedia : 26.8GB → 4GB (≈ 1.2) bits/byte

gzip : 7.9GB

PAQ : 3.8GB

L is number of states in SM, data Wikipedia dump 2010 head, depth 16

1Entropy = − 1T

∑Tt=1 log2 P(xt |x1:t−1) (lower is better)


How Good Is ≈ 1.5 bits/byte?

the fastest way to make money is . . .

being player . bucketti wish i made it end in the reducers andassemblance smart practices to allow city is something to walk inmost of the work of agriculture . i ’d be able to compete with anearlier goals : words the danger of conduct in southern california, has set of juice of the fights lack of audience that the eclasmbeverly hills . ” she companies or running back down that thebook , ” grass . and the coynes , exaggerate between 1972 . thepad , a debate on emissions on air political capital crashing thatthe new obviously program ” – irock price , ” coach beganrefugees , much and2

2constant-space, byte-SM, NY Times Corpus (50MB), log-loss 1.49 bits/byte, 2million rest.’s


Notation and iid Conditional Distribution Estimation

Let Σ be a set of symbols, let x = x1, x2, ..., xT be an observed sequence,xt , s, s

′ ∈ Σ, and u ∈W = {w1,w2, . . .wK} with wk ∈ Σ+.

Gu(s) =N(us)∑

s′∈Σ N(us ′)

is the maximum likelihood estimator for the conditional distributionP(X = s|u) ≡ Gu(s).


Notation and iid Conditional Distribution Estimation

Let Σ be a set of symbols, let x = x1, x2, ..., xT be an observed sequence,xt , s, s

′ ∈ Σ, and u ∈W = {w1,w2, . . .wK} with wk ∈ Σ+.

Gu(s) =N(us)∑

s′∈Σ N(us ′)

is the maximum likelihood estimator for the conditional distributionP(X = s|u) ≡ Gu(s).

Estimating a finite set of conditional distributions corresponding to a finiteset of “contexts” u generally yields a poor model. Why?

Long contexts occur infrequently

Maximum likelihood estimates given few observations are often poor

Finite set of contexts might be sparse


Bayesian Smoothing

Use independent Dirichlet priors, i.e.

Gu ∼ Dir(α

|Σ|)

xt |xt−k , . . . , xt−1 = u ∼ Gu


Bayesian Smoothing

Use independent Dirichlet priors, i.e.

Gu ∼ Dir(α

|Σ|)

xt |xt−k , . . . , xt−1 = u ∼ Gu

and embrace uncertainty about Gu’s :

P(xT+1 = s|x = u) =

∫P(xt+1 = s|Gu)P(Gu|x)dG = E[Gu(s)],

yielding, for instance,

E[Gu(s)] =N(us) + α

|Σ|∑s′∈Σ N(us ′) + α


Hierarchical Smoothing

Smooth Gu using a similar distribution Gσ(u), i.e.

Gu ∼ Dir(cGσ(u))3 or Gu ∼ DP(cGσ(u))4

with, for instance, σ(x1x2x3 . . . xt) = x2x3 . . . xt (suffix operator)

3[MacKay and Peto, 1995]4[Teh et al., 2006]


Natural Question

Is it possible to simultaneously estimate all distributions in such a model,even in the infinite-depth case?


Unexpected Answer

Is it possible to simultaneously estimate all distributions in such a model,even in the infinite-depth case?

YES!

Sequence memoizer (SM), Wood et al. [2011]

The SM is a compact O(T ), unbounded-depth, hierarchical,Bayesian-nonparametric distribution over discrete sequence data.

Graphical Pitman-Yor process (GPYP), Wood and Teh [2008]

The GPYP is a hierarchical sharing prior with directed acyclic dependency.


SM = hierarchical PY process [Teh, 2006] of unbounded depth.

Figure : Binary Sequence Memoizer Graphical Model


Binary Sequence Memoizer

Notation

Gε|UΣ, d0 ∼ PY(d0, 0,UΣ)

Gu|Gσ(u), d|u| ∼ PY(d|u|, 0,Gσ(u)) ∀u ∈ Σ+

xt |x1, . . . , xt−1 = u ∼ Gu

Here σ(x1x2x3 . . . xt) = x2x3 . . . xt is the suffix operator, Σ = {0, 1}, andUΣ = [.5, .5].

Encodes natural assumptions :

Recency assumption encoded by form of coupling related conditionaldistributions (“back-off”)

Power-law effects encoded by the choice of hierarchical modeling glue:the Pitman-Yor process (PY).


Pitman-Yor process

A Pitman-Yor process PY(c , d ,H) is a distribution over distributions withthree parameters

A discount 0 ≤ d < 1 that controls power-law behavior

d = 0 is Dirichlet process (DP)

A concentration c > −d like that of the Dirichlet process (DP)

A base distribution H also like that of the DP

Key properties :

If G ∼ PY(c , d ,H) then a priori

E[G (s)] = H(s)

Var[G (s)] = (1− d)H(s)(1− H(s))


Text and Pitman-Yor Process Power-Law Properties

100

101

102

103

104

105

106Wordfrequency

100 101 102 103 104 105 106

Rank (according to frequency)

Pitman-Yor

English text

Dirichlet

Power-law scaling of word frequencies in English text. Relative wordfrequency is plotted against each words’ rank when ordered according tofrequency. There are a few very common words and a large number ofrelatively rare words.Figure from Wood et al. [2011].


Big Problem

Space. And time.


Coagulation and Fragmentation5

PY properties used to achieve linear space sequence memoizer.

Coagulation

If

G2|G1 ∼ PY(d1, 0,G1)

and

G3|G2 ∼ PY(d2, 0,G2)

then

G3|G1 ∼ PY(d1d2, 0,G1)

with G2 marginalized out.

Fragmentation

Suppose

G3|G1 ∼ PY(d1d2, 0,G1).

Operations exist to draw

G2|G1 ∼ PY(d1, 0,G1)

and

G3|G2 ∼ PY(d2, 0,G2)

5[Pitman, 1999, Ho et al., 2006, Wood et al., 2009]Wood (University of Oxford) January, 2013 15 / 34

O(T ) Sequence Memoizer6 Graphical Model for x = 110100

6[Wood et al, 2009]Wood (University of Oxford) January, 2013 16 / 34

O(T ) Sequence Memoizer vs. Smoothed nth-order Markov Model

0 1 2 3 4 5 6Context length (n)

0

3×106

6×106

9×106

1.2×107

1.5×107

Number

ofnodes

100

200

300

400Perplexity

Markov

perp

lexity

SM perplexity

Markov#nodes

SM #nodes

SM versus nth order Markov models with hierarchical PYP priors as n varies. In red is computational complexity.


Problem

Nonparametric → O(T ) space =⇒ doesn’t work for life-long learning


Towards Lifelong Learning : Incremental Estimation

Sequential Monte Carlo inference [Gasthaus, Wood, and Teh, 2010]

Wikipedia 100MB → 20.8MB (1.66 bits/byte7)

≈ PAQ, special-purpose, non-incremental, Hutter-prize leader

Still O(T ) storage complexity.

7average log-loss

− 1

T

T∑t=1

log2 P(xt |x1:t−1)

is average # of bits to transmit one byte under optimal entropy encoding (lower isbetter)


Towards Lifelong Learning : Constant Space Approximation

“Forgetting Counts” [Bartlett, Pfau, and Wood, 2010]

14,164 41,860 89,479 155,623 300,490 467,502 632,448 776,5471.8

2

2.2

2.4

2.6

Ave

rage

Log

Los

s (b

its)

NaiveFinite DepthRandom DeletionGreedy DeletionSequence Memoizergzipbzip2

States

Constant-state-count SM approximationData: Calgary corpus, States = number of states used in 3-10 grams

“Improvements to the SM” [Gasthaus and Teh, 2011]

Constant-size representation of predictive distributions


Life-Long Learning SM [Bartlett, Pfau, and Wood, 2010]

3 4 5 6 7 8 9

2

4

stream length (log10)

bits

/ by

te

L=103 L=104 L=105 L=106 L=107

SM with life-long learning computational characteristics.

Wikipedia : 26.8GB → 4GB (≈ 1.2) bits/byte

L is number of states in SM, data Wikipedia dump 2010 head, depth 16,


Summary So Far

SM = powerful Bayesian nonparametric sequence model

Intuition : limn→∞ smoothing n-gramByte predictive performance in human rangeBayesian-nonparametric =⇒ incompatible with lifelong learningNon-trivial approximations preserve performanceResulting model compatible with lifelong learning

Widely useful

Language-modelingLossless compressionPlug-in replacement for Markov model component of any complexgraphical modelDirect plug-in replacement for n-gram language model

Software : http://www.sequencememoizer.com/


Can We Do Better?

Problem :

No long-range coherence

Domain specificity

Solution :

Develop models that more efficiently encode long-range dependencies

Develop models that use domain-specific context

Graphical Pitman-Yor process (GPYP)


GPYP Application : Language Model Domain Adaptation


Doubly Hierarchical Pitman Yor Process Language Model8

GD{} ∼ PY(dD0 , αD0 , λ

D0 U + (1− λD0 )GH{})

GD{wt−1} ∼ PY(dD1 , αD1 , λ

D1 GD{} + (1− λD1 )GH{wt−1})

...

GD{wt−j :wt−1} ∼ PY(dDj , αDj , λ

Dj GD{wt−j+1:wt−1} + (1− λDj )GH{wt−j :wt−1})

x |wt−n+1 : wt−1 ∼ GD{wt−n+1:wt−1}.

8[Wood and Teh, 2009]Wood (University of Oxford) January, 2013 25 / 34

Domain Adaptation Schematic

......

...

......

...

......

...

Switch Priors

Latent

Domain 2Domain 1

...


Domain Adaptation Effect (Copacetic)

0 2 4 6 8 10

x 105

250

300

350

400

450

Brown corpus subset size

Test

per

plex

ity

Union

DHPYPLMMixture

Baseline

Mixture : [Kneser and Steinbiss, 1993]Union : [Bellegarda, 2004]

Training corporaSmall

State of the Union (SOU)

1945-1963, 1969-2006∼ 370k tokens∼ 13k types

BigBrown

1967∼ 1m tokens∼ 50k types

Test corpus

Johnson’s SOU speeches

1963-1969∼ 37k tokens


Domain Adaptation Effect (Not)

0 2 4 6 8 10

x 105

180

200

220

240

260

280

300

320

Brown corpus subset size

Test

per

plex

ity

Union

DHPYPLMMixture

Baseline

Mixture : [Kneser and Steinbiss, 1993]Union : [Bellegarda, 2004]

Training corporaSmall

Augmented Multi-PartyInteraction (AMI)

2007∼ 800k tokens∼ 8k types

BigBrown

1967∼ 1m tokens∼ 50k types

Test corpus

AMI excerpt

2007∼ 60k tokens


Domain Adaptation Cost-Benefit Analysis

250

275300350

375

400

450

Bro

wn C

orp

us S

ize

SOU Corpus Size2 4 6 8 10 12

x 104

1

2

3

4

5

6

7

8

9

10x 10

5

200

300

400

500

Test P

erp

lexity

Realistic scenarioIncreasing Brown Corpus sizeimposes computational cost

$

Increasing SOU corpus sizeimposes data acquisition cost

$$$


Summary

Whole Distributions As Random Variables

Hierarchical Modeling

Embracing Uncertainty


The End

Thank you.


References I

Nicholas Bartlett, David Pfau, and Frank Wood. Forgetting counts:Constant memory inference for a dependent hierarchical Pitman-Yorprocess. In Proceedings of the 27th International Conference onMachine Learning (ICML-10), pages 63–70, June 2010. URLhttp://www.icml2010.org/papers/549.pdf.

J. R. Bellegarda. Statistical language model adaptation: review andperspectives. Speech Communication, 42:93–108, 2004.

J. Gasthaus and Y. W. Teh. Improvements to the sequence memoizer. InProceedings of Neural Information Processing Systems, pages 685–693,2011.

J. Gasthaus, F. Wood, and Y. W. Teh. Lossless compression based on theSequence Memoizer. In Data Compression Conference 2010, pages337–345, 2010.


http://www.icml2010.org/papers/549.pdf

References II

M. W. Ho, L. F. James, and J. W. Lau. Coagulation fragmentation lawsinduced by general coagulations of two-parameter Poisson-Dirichletprocesses. http://arxiv.org/abs/math.PR/0601608, 2006.

R. Kneser and V. Steinbiss. On the dynamic adaptation of stochasticlanguage models. In Proceedings of the IEEE International Conferenceon Acoustics, Speech, and Signal Processing, pages 586–589, 1993.

D. J. C. MacKay and L.C. Bauman Peto. A hierarchical Dirichlet languagemodel. Natural language engineering, 1(2):289–307, 1995.

J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27:1870–1902, 1999.

C.E. Shannon. Prediction and entropy of printed english. The Bell SystemTechnical Journal, pages 50–64, 1951.

Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yorprocesses. In Proceedings of the Association for ComputationalLinguistics, pages 985–992, 2006.


References III

Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. HierarchicalDirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.

F. Wood and Y.W. Teh. A hierarchical, hierarchical Pitman Yor processlanguage model. In ICML/UAI Nonparametric Bayes Workshop, 2008.

F. Wood and Y.W. Teh. A hierarchical nonparametric Bayesian approachto statistical language model domain adaptation. In ArtificialIntelligence and Statistics, 2009.

F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. W. Teh. Astochastic memoizer for sequence data. In Proceedings of the 26thInternational Conference on Machine Learning, pages 1129–1136,Montreal, Canada, 2009.

F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh. Thesequence memoizer. Communications of the ACM, 2011.


Documents

Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram