Upload
others
View
12
Download
0
Embed Size (px)
Citation preview
Bayesian (Nonparametric) Approaches toLanguage Modeling
Frank Wood
unemployed (and homeless)
January, 2013
Wood (University of Oxford) January, 2013 1 / 34
Message
Whole Distributions As Random Variables
Hierarchical Modeling
Embracing Uncertainty
Wood (University of Oxford) January, 2013 2 / 34
Challenge Problem
Predict What Comes Next
01001001, 01101110, 00100000, 01110100,01101000, 01110010, 01100101, 01100101,00100000, 01100100, 01100001, 01111001,01110011, 00100000, 01111001, 01101111,01110101, 01110010, 00100000, 01101000,01100001, 01110010, 01100100, 00100000,01100100, 01110010, 01101001, 01110110,01100101, 00100000, 01110111, 01101001,01101100, 01101100, 00100000, 01100011,01110010, 01100001, 01110011, 01101000 . . .
Wood (University of Oxford) January, 2013 3 / 34
.6 < Entropy1 of English < 1.3 bits/byte [Shannon, 1951]
3 4 5 6 7 8 9
2
4
stream length (log10)
bits
/ by
te
L=103 L=104 L=105 L=106 L=107
Sequence memoizer with life-long learning computational characteristics[Bartlett, Pfau, and Wood, 2010]
Wikipedia : 26.8GB → 4GB (≈ 1.2) bits/byte
gzip : 7.9GB
PAQ : 3.8GB
L is number of states in SM, data Wikipedia dump 2010 head, depth 16
1Entropy = − 1T
∑Tt=1 log2 P(xt |x1:t−1) (lower is better)
Wood (University of Oxford) January, 2013 4 / 34
How Good Is ≈ 1.5 bits/byte?
the fastest way to make money is . . .
being player . bucketti wish i made it end in the reducers andassemblance smart practices to allow city is something to walk inmost of the work of agriculture . i ’d be able to compete with anearlier goals : words the danger of conduct in southern california, has set of juice of the fights lack of audience that the eclasmbeverly hills . ” she companies or running back down that thebook , ” grass . and the coynes , exaggerate between 1972 . thepad , a debate on emissions on air political capital crashing thatthe new obviously program ” – irock price , ” coach beganrefugees , much and2
2constant-space, byte-SM, NY Times Corpus (50MB), log-loss 1.49 bits/byte, 2million rest.’s
Wood (University of Oxford) January, 2013 5 / 34
Notation and iid Conditional Distribution Estimation
Let Σ be a set of symbols, let x = x1, x2, ..., xT be an observed sequence,xt , s, s
′ ∈ Σ, and u ∈W = {w1,w2, . . .wK} with wk ∈ Σ+.
Gu(s) =N(us)∑
s′∈Σ N(us ′)
is the maximum likelihood estimator for the conditional distributionP(X = s|u) ≡ Gu(s).
Wood (University of Oxford) January, 2013 6 / 34
Notation and iid Conditional Distribution Estimation
Let Σ be a set of symbols, let x = x1, x2, ..., xT be an observed sequence,xt , s, s
′ ∈ Σ, and u ∈W = {w1,w2, . . .wK} with wk ∈ Σ+.
Gu(s) =N(us)∑
s′∈Σ N(us ′)
is the maximum likelihood estimator for the conditional distributionP(X = s|u) ≡ Gu(s).
Estimating a finite set of conditional distributions corresponding to a finiteset of “contexts” u generally yields a poor model. Why?
Long contexts occur infrequently
Maximum likelihood estimates given few observations are often poor
Finite set of contexts might be sparse
Wood (University of Oxford) January, 2013 6 / 34
Bayesian Smoothing
Use independent Dirichlet priors, i.e.
Gu ∼ Dir(α
|Σ|)
xt |xt−k , . . . , xt−1 = u ∼ Gu
Wood (University of Oxford) January, 2013 7 / 34
Bayesian Smoothing
Use independent Dirichlet priors, i.e.
Gu ∼ Dir(α
|Σ|)
xt |xt−k , . . . , xt−1 = u ∼ Gu
and embrace uncertainty about Gu’s :
P(xT+1 = s|x = u) =
∫P(xt+1 = s|Gu)P(Gu|x)dG = E[Gu(s)],
yielding, for instance,
E[Gu(s)] =N(us) + α
|Σ|∑s′∈Σ N(us ′) + α
Wood (University of Oxford) January, 2013 7 / 34
Hierarchical Smoothing
Smooth Gu using a similar distribution Gσ(u), i.e.
Gu ∼ Dir(cGσ(u))3 or Gu ∼ DP(cGσ(u))4
with, for instance, σ(x1x2x3 . . . xt) = x2x3 . . . xt (suffix operator)
3[MacKay and Peto, 1995]4[Teh et al., 2006]
Wood (University of Oxford) January, 2013 8 / 34
Natural Question
Is it possible to simultaneously estimate all distributions in such a model,even in the infinite-depth case?
Wood (University of Oxford) January, 2013 9 / 34
Unexpected Answer
Is it possible to simultaneously estimate all distributions in such a model,even in the infinite-depth case?
YES!
Sequence memoizer (SM), Wood et al. [2011]
The SM is a compact O(T ), unbounded-depth, hierarchical,Bayesian-nonparametric distribution over discrete sequence data.
Graphical Pitman-Yor process (GPYP), Wood and Teh [2008]
The GPYP is a hierarchical sharing prior with directed acyclic dependency.
Wood (University of Oxford) January, 2013 9 / 34
SM = hierarchical PY process [Teh, 2006] of unbounded depth.
Figure : Binary Sequence Memoizer Graphical Model
Wood (University of Oxford) January, 2013 10 / 34
Binary Sequence Memoizer
Notation
Gε|UΣ, d0 ∼ PY(d0, 0,UΣ)
Gu|Gσ(u), d|u| ∼ PY(d|u|, 0,Gσ(u)) ∀u ∈ Σ+
xt |x1, . . . , xt−1 = u ∼ Gu
Here σ(x1x2x3 . . . xt) = x2x3 . . . xt is the suffix operator, Σ = {0, 1}, andUΣ = [.5, .5].
Encodes natural assumptions :
Recency assumption encoded by form of coupling related conditionaldistributions (“back-off”)
Power-law effects encoded by the choice of hierarchical modeling glue:the Pitman-Yor process (PY).
Wood (University of Oxford) January, 2013 11 / 34
Pitman-Yor process
A Pitman-Yor process PY(c , d ,H) is a distribution over distributions withthree parameters
A discount 0 ≤ d < 1 that controls power-law behavior
d = 0 is Dirichlet process (DP)
A concentration c > −d like that of the Dirichlet process (DP)
A base distribution H also like that of the DP
Key properties :
If G ∼ PY(c , d ,H) then a priori
E[G (s)] = H(s)
Var[G (s)] = (1− d)H(s)(1− H(s))
Wood (University of Oxford) January, 2013 12 / 34
Text and Pitman-Yor Process Power-Law Properties
100
101
102
103
104
105
106Wordfrequency
100 101 102 103 104 105 106
Rank (according to frequency)
Pitman-Yor
English text
Dirichlet
Power-law scaling of word frequencies in English text. Relative wordfrequency is plotted against each words’ rank when ordered according tofrequency. There are a few very common words and a large number ofrelatively rare words.Figure from Wood et al. [2011].
Wood (University of Oxford) January, 2013 13 / 34
Big Problem
Space. And time.
Wood (University of Oxford) January, 2013 14 / 34
Coagulation and Fragmentation5
PY properties used to achieve linear space sequence memoizer.
Coagulation
If
G2|G1 ∼ PY(d1, 0,G1)
and
G3|G2 ∼ PY(d2, 0,G2)
then
G3|G1 ∼ PY(d1d2, 0,G1)
with G2 marginalized out.
Fragmentation
Suppose
G3|G1 ∼ PY(d1d2, 0,G1).
Operations exist to draw
G2|G1 ∼ PY(d1, 0,G1)
and
G3|G2 ∼ PY(d2, 0,G2)
5[Pitman, 1999, Ho et al., 2006, Wood et al., 2009]Wood (University of Oxford) January, 2013 15 / 34
O(T ) Sequence Memoizer6 Graphical Model for x = 110100
6[Wood et al, 2009]Wood (University of Oxford) January, 2013 16 / 34
O(T ) Sequence Memoizer vs. Smoothed nth-order Markov Model
0 1 2 3 4 5 6Context length (n)
0
3×106
6×106
9×106
1.2×107
1.5×107
Number
ofnodes
100
200
300
400Perplexity
Markov
perp
lexity
SM perplexity
Markov#nodes
SM #nodes
SM versus nth order Markov models with hierarchical PYP priors as n varies. In red is computational complexity.
Wood (University of Oxford) January, 2013 17 / 34
Problem
Nonparametric → O(T ) space =⇒ doesn’t work for life-long learning
Wood (University of Oxford) January, 2013 18 / 34
Towards Lifelong Learning : Incremental Estimation
Sequential Monte Carlo inference [Gasthaus, Wood, and Teh, 2010]
Wikipedia 100MB → 20.8MB (1.66 bits/byte7)
≈ PAQ, special-purpose, non-incremental, Hutter-prize leader
Still O(T ) storage complexity.
7average log-loss
− 1
T
T∑t=1
log2 P(xt |x1:t−1)
is average # of bits to transmit one byte under optimal entropy encoding (lower isbetter)
Wood (University of Oxford) January, 2013 19 / 34
Towards Lifelong Learning : Constant Space Approximation
“Forgetting Counts” [Bartlett, Pfau, and Wood, 2010]
14,164 41,860 89,479 155,623 300,490 467,502 632,448 776,5471.8
2
2.2
2.4
2.6
Ave
rage
Log
Los
s (b
its)
NaiveFinite DepthRandom DeletionGreedy DeletionSequence Memoizergzipbzip2
States
Constant-state-count SM approximationData: Calgary corpus, States = number of states used in 3-10 grams
“Improvements to the SM” [Gasthaus and Teh, 2011]
Constant-size representation of predictive distributions
Wood (University of Oxford) January, 2013 20 / 34
Life-Long Learning SM [Bartlett, Pfau, and Wood, 2010]
3 4 5 6 7 8 9
2
4
stream length (log10)
bits
/ by
te
L=103 L=104 L=105 L=106 L=107
SM with life-long learning computational characteristics.
Wikipedia : 26.8GB → 4GB (≈ 1.2) bits/byte
L is number of states in SM, data Wikipedia dump 2010 head, depth 16,
Wood (University of Oxford) January, 2013 21 / 34
Summary So Far
SM = powerful Bayesian nonparametric sequence model
Intuition : limn→∞ smoothing n-gramByte predictive performance in human rangeBayesian-nonparametric =⇒ incompatible with lifelong learningNon-trivial approximations preserve performanceResulting model compatible with lifelong learning
Widely useful
Language-modelingLossless compressionPlug-in replacement for Markov model component of any complexgraphical modelDirect plug-in replacement for n-gram language model
Software : http://www.sequencememoizer.com/
Wood (University of Oxford) January, 2013 22 / 34
Can We Do Better?
Problem :
No long-range coherence
Domain specificity
Solution :
Develop models that more efficiently encode long-range dependencies
Develop models that use domain-specific context
Graphical Pitman-Yor process (GPYP)
Wood (University of Oxford) January, 2013 23 / 34
GPYP Application : Language Model Domain Adaptation
Wood (University of Oxford) January, 2013 24 / 34
Doubly Hierarchical Pitman Yor Process Language Model8
GD{} ∼ PY(dD0 , αD0 , λ
D0 U + (1− λD0 )GH{})
GD{wt−1} ∼ PY(dD1 , αD1 , λ
D1 GD{} + (1− λD1 )GH{wt−1})
...
GD{wt−j :wt−1} ∼ PY(dDj , αDj , λ
Dj GD{wt−j+1:wt−1} + (1− λDj )GH{wt−j :wt−1})
x |wt−n+1 : wt−1 ∼ GD{wt−n+1:wt−1}.
8[Wood and Teh, 2009]Wood (University of Oxford) January, 2013 25 / 34
Domain Adaptation Schematic
......
...
......
...
......
...
Switch Priors
Latent
Domain 2Domain 1
...
Wood (University of Oxford) January, 2013 26 / 34
Domain Adaptation Effect (Copacetic)
0 2 4 6 8 10
x 105
250
300
350
400
450
Brown corpus subset size
Test
per
plex
ity
Union
DHPYPLMMixture
Baseline
Mixture : [Kneser and Steinbiss, 1993]Union : [Bellegarda, 2004]
Training corporaSmall
State of the Union (SOU)
1945-1963, 1969-2006∼ 370k tokens∼ 13k types
BigBrown
1967∼ 1m tokens∼ 50k types
Test corpus
Johnson’s SOU speeches
1963-1969∼ 37k tokens
Wood (University of Oxford) January, 2013 27 / 34
Domain Adaptation Effect (Not)
0 2 4 6 8 10
x 105
180
200
220
240
260
280
300
320
Brown corpus subset size
Test
per
plex
ity
Union
DHPYPLMMixture
Baseline
Mixture : [Kneser and Steinbiss, 1993]Union : [Bellegarda, 2004]
Training corporaSmall
Augmented Multi-PartyInteraction (AMI)
2007∼ 800k tokens∼ 8k types
BigBrown
1967∼ 1m tokens∼ 50k types
Test corpus
AMI excerpt
2007∼ 60k tokens
Wood (University of Oxford) January, 2013 28 / 34
Domain Adaptation Cost-Benefit Analysis
250
275300350
375
400
450
Bro
wn C
orp
us S
ize
SOU Corpus Size2 4 6 8 10 12
x 104
1
2
3
4
5
6
7
8
9
10x 10
5
200
300
400
500
Test P
erp
lexity
Realistic scenarioIncreasing Brown Corpus sizeimposes computational cost
$
Increasing SOU corpus sizeimposes data acquisition cost
$$$
Wood (University of Oxford) January, 2013 29 / 34
Summary
Whole Distributions As Random Variables
Hierarchical Modeling
Embracing Uncertainty
Wood (University of Oxford) January, 2013 30 / 34
The End
Thank you.
Wood (University of Oxford) January, 2013 31 / 34
References I
Nicholas Bartlett, David Pfau, and Frank Wood. Forgetting counts:Constant memory inference for a dependent hierarchical Pitman-Yorprocess. In Proceedings of the 27th International Conference onMachine Learning (ICML-10), pages 63–70, June 2010. URLhttp://www.icml2010.org/papers/549.pdf.
J. R. Bellegarda. Statistical language model adaptation: review andperspectives. Speech Communication, 42:93–108, 2004.
J. Gasthaus and Y. W. Teh. Improvements to the sequence memoizer. InProceedings of Neural Information Processing Systems, pages 685–693,2011.
J. Gasthaus, F. Wood, and Y. W. Teh. Lossless compression based on theSequence Memoizer. In Data Compression Conference 2010, pages337–345, 2010.
Wood (University of Oxford) January, 2013 32 / 34
References II
M. W. Ho, L. F. James, and J. W. Lau. Coagulation fragmentation lawsinduced by general coagulations of two-parameter Poisson-Dirichletprocesses. http://arxiv.org/abs/math.PR/0601608, 2006.
R. Kneser and V. Steinbiss. On the dynamic adaptation of stochasticlanguage models. In Proceedings of the IEEE International Conferenceon Acoustics, Speech, and Signal Processing, pages 586–589, 1993.
D. J. C. MacKay and L.C. Bauman Peto. A hierarchical Dirichlet languagemodel. Natural language engineering, 1(2):289–307, 1995.
J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27:1870–1902, 1999.
C.E. Shannon. Prediction and entropy of printed english. The Bell SystemTechnical Journal, pages 50–64, 1951.
Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yorprocesses. In Proceedings of the Association for ComputationalLinguistics, pages 985–992, 2006.
Wood (University of Oxford) January, 2013 33 / 34
References III
Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. HierarchicalDirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.
F. Wood and Y.W. Teh. A hierarchical, hierarchical Pitman Yor processlanguage model. In ICML/UAI Nonparametric Bayes Workshop, 2008.
F. Wood and Y.W. Teh. A hierarchical nonparametric Bayesian approachto statistical language model domain adaptation. In ArtificialIntelligence and Statistics, 2009.
F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. W. Teh. Astochastic memoizer for sequence data. In Proceedings of the 26thInternational Conference on Machine Learning, pages 1129–1136,Montreal, Canada, 2009.
F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh. Thesequence memoizer. Communications of the ACM, 2011.
Wood (University of Oxford) January, 2013 34 / 34