37
Bayesian (Nonparametric) Approaches to Language Modeling Frank Wood unemployed (and homeless) January, 2013 Wood (University of Oxford) January, 2013 1 / 34

Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

  • Upload
    others

  • View
    12

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Bayesian (Nonparametric) Approaches toLanguage Modeling

Frank Wood

unemployed (and homeless)

January, 2013

Wood (University of Oxford) January, 2013 1 / 34

Page 2: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Message

Whole Distributions As Random Variables

Hierarchical Modeling

Embracing Uncertainty

Wood (University of Oxford) January, 2013 2 / 34

Page 3: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Challenge Problem

Predict What Comes Next

01001001, 01101110, 00100000, 01110100,01101000, 01110010, 01100101, 01100101,00100000, 01100100, 01100001, 01111001,01110011, 00100000, 01111001, 01101111,01110101, 01110010, 00100000, 01101000,01100001, 01110010, 01100100, 00100000,01100100, 01110010, 01101001, 01110110,01100101, 00100000, 01110111, 01101001,01101100, 01101100, 00100000, 01100011,01110010, 01100001, 01110011, 01101000 . . .

Wood (University of Oxford) January, 2013 3 / 34

Page 4: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

.6 < Entropy1 of English < 1.3 bits/byte [Shannon, 1951]

3 4 5 6 7 8 9

2

4

stream length (log10)

bits

/ by

te

L=103 L=104 L=105 L=106 L=107

Sequence memoizer with life-long learning computational characteristics[Bartlett, Pfau, and Wood, 2010]

Wikipedia : 26.8GB → 4GB (≈ 1.2) bits/byte

gzip : 7.9GB

PAQ : 3.8GB

L is number of states in SM, data Wikipedia dump 2010 head, depth 16

1Entropy = − 1T

∑Tt=1 log2 P(xt |x1:t−1) (lower is better)

Wood (University of Oxford) January, 2013 4 / 34

Page 5: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

How Good Is ≈ 1.5 bits/byte?

the fastest way to make money is . . .

being player . bucketti wish i made it end in the reducers andassemblance smart practices to allow city is something to walk inmost of the work of agriculture . i ’d be able to compete with anearlier goals : words the danger of conduct in southern california, has set of juice of the fights lack of audience that the eclasmbeverly hills . ” she companies or running back down that thebook , ” grass . and the coynes , exaggerate between 1972 . thepad , a debate on emissions on air political capital crashing thatthe new obviously program ” – irock price , ” coach beganrefugees , much and2

2constant-space, byte-SM, NY Times Corpus (50MB), log-loss 1.49 bits/byte, 2million rest.’s

Wood (University of Oxford) January, 2013 5 / 34

Page 6: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Notation and iid Conditional Distribution Estimation

Let Σ be a set of symbols, let x = x1, x2, ..., xT be an observed sequence,xt , s, s

′ ∈ Σ, and u ∈W = {w1,w2, . . .wK} with wk ∈ Σ+.

Gu(s) =N(us)∑

s′∈Σ N(us ′)

is the maximum likelihood estimator for the conditional distributionP(X = s|u) ≡ Gu(s).

Wood (University of Oxford) January, 2013 6 / 34

Page 7: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Notation and iid Conditional Distribution Estimation

Let Σ be a set of symbols, let x = x1, x2, ..., xT be an observed sequence,xt , s, s

′ ∈ Σ, and u ∈W = {w1,w2, . . .wK} with wk ∈ Σ+.

Gu(s) =N(us)∑

s′∈Σ N(us ′)

is the maximum likelihood estimator for the conditional distributionP(X = s|u) ≡ Gu(s).

Estimating a finite set of conditional distributions corresponding to a finiteset of “contexts” u generally yields a poor model. Why?

Long contexts occur infrequently

Maximum likelihood estimates given few observations are often poor

Finite set of contexts might be sparse

Wood (University of Oxford) January, 2013 6 / 34

Page 8: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Bayesian Smoothing

Use independent Dirichlet priors, i.e.

Gu ∼ Dir(α

|Σ|)

xt |xt−k , . . . , xt−1 = u ∼ Gu

Wood (University of Oxford) January, 2013 7 / 34

Page 9: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Bayesian Smoothing

Use independent Dirichlet priors, i.e.

Gu ∼ Dir(α

|Σ|)

xt |xt−k , . . . , xt−1 = u ∼ Gu

and embrace uncertainty about Gu’s :

P(xT+1 = s|x = u) =

∫P(xt+1 = s|Gu)P(Gu|x)dG = E[Gu(s)],

yielding, for instance,

E[Gu(s)] =N(us) + α

|Σ|∑s′∈Σ N(us ′) + α

Wood (University of Oxford) January, 2013 7 / 34

Page 10: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Hierarchical Smoothing

Smooth Gu using a similar distribution Gσ(u), i.e.

Gu ∼ Dir(cGσ(u))3 or Gu ∼ DP(cGσ(u))4

with, for instance, σ(x1x2x3 . . . xt) = x2x3 . . . xt (suffix operator)

3[MacKay and Peto, 1995]4[Teh et al., 2006]

Wood (University of Oxford) January, 2013 8 / 34

Page 11: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Natural Question

Is it possible to simultaneously estimate all distributions in such a model,even in the infinite-depth case?

Wood (University of Oxford) January, 2013 9 / 34

Page 12: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Unexpected Answer

Is it possible to simultaneously estimate all distributions in such a model,even in the infinite-depth case?

YES!

Sequence memoizer (SM), Wood et al. [2011]

The SM is a compact O(T ), unbounded-depth, hierarchical,Bayesian-nonparametric distribution over discrete sequence data.

Graphical Pitman-Yor process (GPYP), Wood and Teh [2008]

The GPYP is a hierarchical sharing prior with directed acyclic dependency.

Wood (University of Oxford) January, 2013 9 / 34

Page 13: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

SM = hierarchical PY process [Teh, 2006] of unbounded depth.

Figure : Binary Sequence Memoizer Graphical Model

Wood (University of Oxford) January, 2013 10 / 34

Page 14: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Binary Sequence Memoizer

Notation

Gε|UΣ, d0 ∼ PY(d0, 0,UΣ)

Gu|Gσ(u), d|u| ∼ PY(d|u|, 0,Gσ(u)) ∀u ∈ Σ+

xt |x1, . . . , xt−1 = u ∼ Gu

Here σ(x1x2x3 . . . xt) = x2x3 . . . xt is the suffix operator, Σ = {0, 1}, andUΣ = [.5, .5].

Encodes natural assumptions :

Recency assumption encoded by form of coupling related conditionaldistributions (“back-off”)

Power-law effects encoded by the choice of hierarchical modeling glue:the Pitman-Yor process (PY).

Wood (University of Oxford) January, 2013 11 / 34

Page 15: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Pitman-Yor process

A Pitman-Yor process PY(c , d ,H) is a distribution over distributions withthree parameters

A discount 0 ≤ d < 1 that controls power-law behavior

d = 0 is Dirichlet process (DP)

A concentration c > −d like that of the Dirichlet process (DP)

A base distribution H also like that of the DP

Key properties :

If G ∼ PY(c , d ,H) then a priori

E[G (s)] = H(s)

Var[G (s)] = (1− d)H(s)(1− H(s))

Wood (University of Oxford) January, 2013 12 / 34

Page 16: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Text and Pitman-Yor Process Power-Law Properties

100

101

102

103

104

105

106Wordfrequency

100 101 102 103 104 105 106

Rank (according to frequency)

Pitman-Yor

English text

Dirichlet

Power-law scaling of word frequencies in English text. Relative wordfrequency is plotted against each words’ rank when ordered according tofrequency. There are a few very common words and a large number ofrelatively rare words.Figure from Wood et al. [2011].

Wood (University of Oxford) January, 2013 13 / 34

Page 17: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Big Problem

Space. And time.

Wood (University of Oxford) January, 2013 14 / 34

Page 18: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Coagulation and Fragmentation5

PY properties used to achieve linear space sequence memoizer.

Coagulation

If

G2|G1 ∼ PY(d1, 0,G1)

and

G3|G2 ∼ PY(d2, 0,G2)

then

G3|G1 ∼ PY(d1d2, 0,G1)

with G2 marginalized out.

Fragmentation

Suppose

G3|G1 ∼ PY(d1d2, 0,G1).

Operations exist to draw

G2|G1 ∼ PY(d1, 0,G1)

and

G3|G2 ∼ PY(d2, 0,G2)

5[Pitman, 1999, Ho et al., 2006, Wood et al., 2009]Wood (University of Oxford) January, 2013 15 / 34

Page 19: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

O(T ) Sequence Memoizer6 Graphical Model for x = 110100

6[Wood et al, 2009]Wood (University of Oxford) January, 2013 16 / 34

Page 20: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

O(T ) Sequence Memoizer vs. Smoothed nth-order Markov Model

0 1 2 3 4 5 6Context length (n)

0

3×106

6×106

9×106

1.2×107

1.5×107

Number

ofnodes

100

200

300

400Perplexity

Markov

perp

lexity

SM perplexity

Markov#nodes

SM #nodes

SM versus nth order Markov models with hierarchical PYP priors as n varies. In red is computational complexity.

Wood (University of Oxford) January, 2013 17 / 34

Page 21: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Problem

Nonparametric → O(T ) space =⇒ doesn’t work for life-long learning

Wood (University of Oxford) January, 2013 18 / 34

Page 22: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Towards Lifelong Learning : Incremental Estimation

Sequential Monte Carlo inference [Gasthaus, Wood, and Teh, 2010]

Wikipedia 100MB → 20.8MB (1.66 bits/byte7)

≈ PAQ, special-purpose, non-incremental, Hutter-prize leader

Still O(T ) storage complexity.

7average log-loss

− 1

T

T∑t=1

log2 P(xt |x1:t−1)

is average # of bits to transmit one byte under optimal entropy encoding (lower isbetter)

Wood (University of Oxford) January, 2013 19 / 34

Page 23: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Towards Lifelong Learning : Constant Space Approximation

“Forgetting Counts” [Bartlett, Pfau, and Wood, 2010]

14,164 41,860 89,479 155,623 300,490 467,502 632,448 776,5471.8

2

2.2

2.4

2.6

Ave

rage

Log

Los

s (b

its)

NaiveFinite DepthRandom DeletionGreedy DeletionSequence Memoizergzipbzip2

States

Constant-state-count SM approximationData: Calgary corpus, States = number of states used in 3-10 grams

“Improvements to the SM” [Gasthaus and Teh, 2011]

Constant-size representation of predictive distributions

Wood (University of Oxford) January, 2013 20 / 34

Page 24: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Life-Long Learning SM [Bartlett, Pfau, and Wood, 2010]

3 4 5 6 7 8 9

2

4

stream length (log10)

bits

/ by

te

L=103 L=104 L=105 L=106 L=107

SM with life-long learning computational characteristics.

Wikipedia : 26.8GB → 4GB (≈ 1.2) bits/byte

L is number of states in SM, data Wikipedia dump 2010 head, depth 16,

Wood (University of Oxford) January, 2013 21 / 34

Page 25: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Summary So Far

SM = powerful Bayesian nonparametric sequence model

Intuition : limn→∞ smoothing n-gramByte predictive performance in human rangeBayesian-nonparametric =⇒ incompatible with lifelong learningNon-trivial approximations preserve performanceResulting model compatible with lifelong learning

Widely useful

Language-modelingLossless compressionPlug-in replacement for Markov model component of any complexgraphical modelDirect plug-in replacement for n-gram language model

Software : http://www.sequencememoizer.com/

Wood (University of Oxford) January, 2013 22 / 34

Page 26: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Can We Do Better?

Problem :

No long-range coherence

Domain specificity

Solution :

Develop models that more efficiently encode long-range dependencies

Develop models that use domain-specific context

Graphical Pitman-Yor process (GPYP)

Wood (University of Oxford) January, 2013 23 / 34

Page 27: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

GPYP Application : Language Model Domain Adaptation

Wood (University of Oxford) January, 2013 24 / 34

Page 28: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Doubly Hierarchical Pitman Yor Process Language Model8

GD{} ∼ PY(dD0 , αD0 , λ

D0 U + (1− λD0 )GH{})

GD{wt−1} ∼ PY(dD1 , αD1 , λ

D1 GD{} + (1− λD1 )GH{wt−1})

...

GD{wt−j :wt−1} ∼ PY(dDj , αDj , λ

Dj GD{wt−j+1:wt−1} + (1− λDj )GH{wt−j :wt−1})

x |wt−n+1 : wt−1 ∼ GD{wt−n+1:wt−1}.

8[Wood and Teh, 2009]Wood (University of Oxford) January, 2013 25 / 34

Page 29: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Domain Adaptation Schematic

......

...

......

...

......

...

Switch Priors

Latent

Domain 2Domain 1

...

Wood (University of Oxford) January, 2013 26 / 34

Page 30: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Domain Adaptation Effect (Copacetic)

0 2 4 6 8 10

x 105

250

300

350

400

450

Brown corpus subset size

Test

per

plex

ity

Union

DHPYPLMMixture

Baseline

Mixture : [Kneser and Steinbiss, 1993]Union : [Bellegarda, 2004]

Training corporaSmall

State of the Union (SOU)

1945-1963, 1969-2006∼ 370k tokens∼ 13k types

BigBrown

1967∼ 1m tokens∼ 50k types

Test corpus

Johnson’s SOU speeches

1963-1969∼ 37k tokens

Wood (University of Oxford) January, 2013 27 / 34

Page 31: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Domain Adaptation Effect (Not)

0 2 4 6 8 10

x 105

180

200

220

240

260

280

300

320

Brown corpus subset size

Test

per

plex

ity

Union

DHPYPLMMixture

Baseline

Mixture : [Kneser and Steinbiss, 1993]Union : [Bellegarda, 2004]

Training corporaSmall

Augmented Multi-PartyInteraction (AMI)

2007∼ 800k tokens∼ 8k types

BigBrown

1967∼ 1m tokens∼ 50k types

Test corpus

AMI excerpt

2007∼ 60k tokens

Wood (University of Oxford) January, 2013 28 / 34

Page 32: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Domain Adaptation Cost-Benefit Analysis

250

275300350

375

400

450

Bro

wn C

orp

us S

ize

SOU Corpus Size2 4 6 8 10 12

x 104

1

2

3

4

5

6

7

8

9

10x 10

5

200

300

400

500

Test P

erp

lexity

Realistic scenarioIncreasing Brown Corpus sizeimposes computational cost

$

Increasing SOU corpus sizeimposes data acquisition cost

$$$

Wood (University of Oxford) January, 2013 29 / 34

Page 33: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

Summary

Whole Distributions As Random Variables

Hierarchical Modeling

Embracing Uncertainty

Wood (University of Oxford) January, 2013 30 / 34

Page 34: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

The End

Thank you.

Wood (University of Oxford) January, 2013 31 / 34

Page 35: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

References I

Nicholas Bartlett, David Pfau, and Frank Wood. Forgetting counts:Constant memory inference for a dependent hierarchical Pitman-Yorprocess. In Proceedings of the 27th International Conference onMachine Learning (ICML-10), pages 63–70, June 2010. URLhttp://www.icml2010.org/papers/549.pdf.

J. R. Bellegarda. Statistical language model adaptation: review andperspectives. Speech Communication, 42:93–108, 2004.

J. Gasthaus and Y. W. Teh. Improvements to the sequence memoizer. InProceedings of Neural Information Processing Systems, pages 685–693,2011.

J. Gasthaus, F. Wood, and Y. W. Teh. Lossless compression based on theSequence Memoizer. In Data Compression Conference 2010, pages337–345, 2010.

Wood (University of Oxford) January, 2013 32 / 34

Page 36: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

References II

M. W. Ho, L. F. James, and J. W. Lau. Coagulation fragmentation lawsinduced by general coagulations of two-parameter Poisson-Dirichletprocesses. http://arxiv.org/abs/math.PR/0601608, 2006.

R. Kneser and V. Steinbiss. On the dynamic adaptation of stochasticlanguage models. In Proceedings of the IEEE International Conferenceon Acoustics, Speech, and Signal Processing, pages 586–589, 1993.

D. J. C. MacKay and L.C. Bauman Peto. A hierarchical Dirichlet languagemodel. Natural language engineering, 1(2):289–307, 1995.

J. Pitman. Coalescents with multiple collisions. Annals of Probability, 27:1870–1902, 1999.

C.E. Shannon. Prediction and entropy of printed english. The Bell SystemTechnical Journal, pages 50–64, 1951.

Y. W. Teh. A hierarchical Bayesian language model based on Pitman-Yorprocesses. In Proceedings of the Association for ComputationalLinguistics, pages 985–992, 2006.

Wood (University of Oxford) January, 2013 33 / 34

Page 37: Bayesian (Nonparametric) Approaches to Language Modeling › ~fwood › talks › bnp_lm.pdf · SM = powerful Bayesian nonparametric sequence model Intuition : lim n!1 smoothing n-gram

References III

Y. W. Teh, M. I. Jordan, M. J. Beal, and D. M. Blei. HierarchicalDirichlet processes. Journal of the American Statistical Association, 101(476):1566–1581, 2006.

F. Wood and Y.W. Teh. A hierarchical, hierarchical Pitman Yor processlanguage model. In ICML/UAI Nonparametric Bayes Workshop, 2008.

F. Wood and Y.W. Teh. A hierarchical nonparametric Bayesian approachto statistical language model domain adaptation. In ArtificialIntelligence and Statistics, 2009.

F. Wood, C. Archambeau, J. Gasthaus, L. James, and Y. W. Teh. Astochastic memoizer for sequence data. In Proceedings of the 26thInternational Conference on Machine Learning, pages 1129–1136,Montreal, Canada, 2009.

F. Wood, J. Gasthaus, C. Archambeau, L. James, and Y. W. Teh. Thesequence memoizer. Communications of the ACM, 2011.

Wood (University of Oxford) January, 2013 34 / 34