25
4/28/2010 1 Statistical NLP Spring 2010 Lecture 25: Diachronics Dan Klein – UC Berkeley Evolution: Main Phenomena Mutations of sequences Time Speciation Time

SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

  • Upload
    others

  • View
    4

  • Download
    1

Embed Size (px)

Citation preview

Page 1: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

1

Statistical NLPSpring 2010

Lecture 25: Diachronics

Dan Klein – UC Berkeley

Evolution: Main Phenomena

Mutations of sequences

Time

Speciation

Time

Page 2: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

2

Tree of Languages

� Challenge: identify

the phylogeny

� Much work in

biology, e.g. work

by Warnow,

Felsenstein,

Steele…

� Also in linguistics,

e.g. Warnow et

al., Gray and

Atkinson…

http://andromeda.rutgers.edu/~jlynch/language.html

Statistical Inference Tasks

Modern Text

Ancestral

Word Forms

Cognate Groups /

Translations

Grammatical

Inference

Inputs Outputs

Phylogeny

FR IT PT ES

fuego feu

focus

fuego

feuhuevo

oeuf

les faits sont très clairs

Page 3: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

3

Outline

Ancestral

Word Forms

Cognate Groups /

Translations

Grammatical

Inference

fuego feu

focus

fuego

feuhuevo

oeuf

les faits sont très clairs

Language Evolution: Sound Change

camera /kamera/Latin

chambre /ʃambʁ/French

Deletion: /e/

Change: /k/ .. /tʃ/ .. /ʃ/

Insertion: /b/

Eng. camera from Latin,

“camera obscura”

Eng. chamber from Old Fr. before the initial /t/ dropped

Page 4: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

4

Diachronic Evidence

tonitru non tonotrutonight not tonite

Yahoo! Answers [2009] Appendix Probi [ca 300]

Synchronic (Comparative) Evidence

Page 5: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

5

Simple Model: Single Characters

CG CC CC GG

CG

C GG

A Model for Words Forms

focus /fokus/

fuego /fweɣo/

/fogo/

fogo /fogo/

fuoco /fwɔko/

θIB

IT ES PT

IB

LAθES

θIT θPT

[BGK 07]

Page 6: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

6

Contextual Changes

/fokus/

/fwɔko/

f# o

f# w ɔ

θIT

Changes are Systematic

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

/kentrum/

/sentro/

/sentro/

/sentro//tʃɛntro/

Page 7: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

7

Experimental Setup

� Data sets

� Small: Romance

� French, Italian, Portuguese,

Spanish

� 2344 words

� Complete cognate sets

� Target: (Vulgar) Latin

� Large: Oceanic

� 661 languages

� 140K words

� Incomplete cognate sets

� Target: Proto-Oceanic

[Blust, 1993]

FR IT PT ES

Data: Romance

Page 8: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

8

Learning: Objective

maxθ

P (θ|w1 . . . wL)

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

z

w

Learning: EM

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

/fokus/

/fweɣo/

/fogo/

/fogo//fwɔko/

� M-Step

� Find parameters

which fit (expected)

sound change counts

� Easy: gradient

ascent on theta

� E-Step

� Find (expected)

change counts given

parameters

� Hard: variables are

string-valued

Page 9: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

9

Computing Expectations

‘grass’

Standard approach, e.g. [Holmes 2001]:

Gibbs sampling each sequence

[Holmes 01,

BGK 07]

A Gibbs Sampler

‘grass’

Page 10: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

10

A Gibbs Sampler

‘grass’

A Gibbs Sampler

‘grass’

Page 11: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

11

Getting Stuck

How to jump to a state where the liquids

/r/ and /l/ have a common ancestor?

?

Getting Stuck

Page 12: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

12

Solution: Vertical Slices

Single

Sequence

Resampling

Ancestry

Resampling

[BGK 08]

Details: Defining “Slices”

The sampling domains (kernels) are indexed by contiguous

subsequences (anchors) of the observed leaf sequences

Correct construction section(G) is

non-trivial but very efficient

anchor

Page 13: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

13

Results: Alignment Efficiency

Is ancestry resampling faster than basic Gibbs?

Hypothesis: Larger gains for deeper trees

Depth of the phylogenetic tree

Sum

of P

airs

Setup: Fixed wall

time

Synthetic data,

same parameters

Results: Romance

Page 14: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

14

Learned Rules / Mutations

Learned Rules / Mutations

Page 15: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

15

Comparison to Other Methods

� Evaluation metric: edit distance from a reconstruction

made by a linguist (lower is better)

UsOakes

� Comparison to system from [Oakes, 2000]

� Uses exact inference and

deterministic rules

� Reconstruction of

Proto-Malayo-Javanic

cf [Nothefer, 1975]

Data: Oceanic

Proto-Oceanic

Page 16: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

16

Data: Oceanic

http://language.psy.auckland.ac.nz/austronesian/research.php

Result: More Languages Help

Number of modern languages used

Mean edit distance

Distance from [Blust, 1993] Reconstructions

Page 17: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

17

Results: Large Scale

Visualization: Learned universals

*The model did not have features encoding natural classes

Page 18: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

18

Regularity and Functional Load

In a language, some pairs of sounds are more

contrastive than others (higher functional load)

Example: English “p”/“b” versus “t”/”th”

“p”/“b”: pot/dot, pin/din, dress/press,

pew/dew, ...

“t”/”th”: thin/tin

Functional Load: Timeline

Functional Load Hypothesis (FLH): sounds changes are

less frequent when they merge phonemes with high

functional load [Martinet, 55]

Previous research within linguistics: “FLH does not

seem to be supported by the data” [King, 67]

Caveat: only four languages were used in King’s study

[Hocket 67; Surandran et al., 06]

Our work: we reexamined the question with two orders of

magnitude more data [BGK, under review]

Page 19: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

19

Regularity and Functional Load

Functional load as computed by [King, 67]

Data: only 4 languages from the Austronesian data

Merge

r po

sterior prob

ability

Each dot is a sound change

identified by the system

Regularity and Functional Load

Data: all 637 languages from the Austronesian data

Functional load as computed by [King, 67]

Merge

r po

sterior prob

ability

Page 20: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

20

Outline

Ancestral

Word Forms

Cognate Groups /

Translations

Grammatical

Inference

fuego feu

focus

fuego

feuhuevo

oeuf

les faits sont très clairs

Cognate Groups

/fweɣo/

/fogo/

/fwɔko/

/berβo/

/vɛrbo/

/vɛrbo/ /tʃɛntro/

/sentro/

/sɛntro/

‘fire’

Page 21: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

21

Model: Cognate Survival

omnis

-

-

-ogni

IT ES PT

IB

LA

+

-

-

-+

IT ES PT

IB

LA

Results: Grouping Accuracy

Fraction of Words Correctly Grouped

Method

[Hall and Klein, in submission]

Page 22: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

22

Semantics: Matching Meanings

day

Occurs with:

“night”

“sun”

“week”

tag

DE

tag ENEN

Occurs with:

“nacht”

“sonne”

“woche”

Occurs with:

“name”

“label”

“along”

Outline

Ancestral

Word Forms

Cognate Groups /

Translations

Grammatical

Inference

fuego feu

focus

fuego

feuhuevo

oeuf

les faits sont très clairs

Page 23: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

23

Grammar Induction

congress narrowly passed the amended bill les faits sont très clairs

Task: Given sentences, infer grammar

(and parse tree structures)

Shared Prior

les faits sont très clairs

P (θℓ|θ) = N(θ, σ2I)

P (θ) = N(0, σ2I)

congress narrowly passed the amended bill

Page 24: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

24

Results: Phylogenetic Prior

0

10

20

30

40

50

60

70

Dutch

Danish

Swedish

Spanish

Portuguese

Slovene

Chinese

English

WG NG

RMG

IE

GLAvg rel gain: 29%

Conclusion

� Phylogeny-structured models can:

� Accurately reconstruct ancestral words

� Give evidence to open linguistic debates

� Detect translations from form and context

� Improve language learning algorithms

� Lots of questions still open:

� Can we get better phylogenies using these high-

res models?

� What do these models have to say about the very

earliest languages? Proto-world?

Page 25: SP10 cs288 lecture 25 -- diachronics.pptklein/cs288/sp10... · Answers [2009] Appendix Probi[ca 300] Synchronic (Comparative) Evidence. 4/28/2010 5 Simple Model: Single Characters

4/28/2010

25

Thank you!

nlp.cs.berkeley.edu