Carnegie Mellon Christian Monson ParaMor Finding Paradigms Across Morphology Christian Monson

Preview:

Citation preview

Carnegie Mellon

Christian Monson

ParaMorFinding Paradigms

Across Morphology

Christian Monson

2Carnegie Mellon

Christian Monson

Turkish Morphology – Beads on a String

take passive negativepresent

progressive2nd person singular

You are not being taken

3Carnegie Mellon

Christian Monson

götür ül m sunüyor

take passive negativepresent

progressive

You are not being taken

2nd person singular

Turkish Morphology – Beads on a String

4Carnegie Mellon

Christian Monson

Applications of Computational Morphology

• Machine Translation– Turkish-English (Oflazer, 2007)

– Czech-English (Goldwater and McClosky, 2005)

• Speech Recognition– Finnish (Creutz, 2006)

• Information Retrieval

5Carnegie Mellon

Christian Monson

Challenges of Computational Morphology

• Time Consuming for a New Language– Kemal Oflazer estimates

• 3-4 months to build basic Turkish analyzer• Plus lexicon development and maintenance

• Expertise Needed– Greenlandic

• Official language of Greenland• Agglutinative Inuit language• 50,000 speakers• Per Langaard

6Carnegie Mellon

Christian Monson

The SolutionRaw Text

Unsupervised Morphology

Induction

7Carnegie Mellon

Christian Monson

ParaMor – Paradigm MorphologyParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults• ParaMor

– Unsupervised morphology induction system

• Paradigm– The natural structure of morphology

8Carnegie Mellon

Christian Monson

Paradigms – The Structure of Morphology

ül m sunüyor

take passive negativepresent

progressive2nd person singular

Stem Voice PolarityTense &

MoodPerson & Number

götür

9Carnegie Mellon

Christian Monson

Paradigms – The Structure of Morphology

ül m umüyor

Stem Voice PolarityTense &

MoodPerson & Number

take passive negativepresent

progressive 1st person singular

umgötür

10Carnegie Mellon

Christian Monson

Paradigms – The Structure of Morphology

ül m umüyor

Stem Voice PolarityTense &

MoodPerson & Number

take passive negativepresent

progressive3rd person singular

umØ

götür

11Carnegie Mellon

Christian Monson

Paradigms – The Structure of Morphology

ül m umüyor

Stem Voice PolarityTense &

MoodPerson & Number

take passive negativepresent

progressive

1st person plural

umØuz

götür

12Carnegie Mellon

Christian Monson

Paradigms – The Structure of Morphology

ül m umüyor

Stem Voice PolarityTense &

MoodPerson & Number

take passive negativepresent

progressive

umØuz

götür

13Carnegie Mellon

Christian Monson

Paradigms – The Structure of Morphology

ül m umüyor

Stem Voice PolarityTense &

MoodPerson & Number

take passive negativefuture

umØuz

yecekgötür

14Carnegie Mellon

Christian Monson

Paradigms – The Structure of Morphology

ül m umüyor

Stem Voice PolarityTense &

MoodPerson & Number

take passive negative

umØuz

yecekgötür

15Carnegie Mellon

Christian Monson

Paradigms – The Structure of Morphology

ül m umüyor

Stem Voice PolarityTense &

MoodPerson & Number

umØuz

yecek

16Carnegie Mellon

Christian Monson

Paradigms – The Structure of Morphology

ül m umüyorumØuz

yecek

Paradigms

17Carnegie Mellon

Christian Monson

Paradigms – The Structure of Morphology

ül m umüyorumØuz

yecek

Paradigms

• Paradigm– Set of mutually replaceable strings

18Carnegie Mellon

Christian Monson

Paradigms – The Structure of Morphology

ül m umüyorumØuz

yecek

Paradigm

• Paradigm– Set of mutually replaceable strings

19Carnegie Mellon

Christian Monson

The ParaMor AlgorithmParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

• Identify suffix paradigms in 3 steps

20Carnegie Mellon

Christian Monson

The ParaMor AlgorithmParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

• Identify suffix paradigms in 3 steps1.Search for candidate paradigms

21Carnegie Mellon

Christian Monson

The ParaMor AlgorithmParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

• Identify suffix paradigms in 3 steps1.Search for candidate paradigms

2.Cluster candidates modeling the same paradigm

22Carnegie Mellon

Christian Monson

The ParaMor Algorithm

• Identify suffix paradigms in 3 steps1.Search for candidate paradigms

2.Cluster candidates modeling the same paradigm

3.Filter

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

23Carnegie Mellon

Christian Monson

The ParaMor Algorithm

• Identify suffix paradigms in 3 steps1.Search for candidate paradigms

2.Cluster candidates modeling the same paradigm

3.Filter

• Segment words – Using the discovered paradigms

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

24Carnegie Mellon

Christian Monson

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

• All character boundaries are candidate morpheme boundaries

25Carnegie Mellon

Christian Monson

s10662

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

autorizacionesbuscabamos

costasimportadoras

vallas…

• Begin search with the most frequent word-final string

Spanish

26Carnegie Mellon

Christian Monson

s10662

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

autorizacionesbuscabamos

costasimportadoras

vallas…

Ø s5501

• Identify the most frequent mutually replaceable string– Stems that occur with one

suffix in a paradigm will likely occur with other suffixes in that paradigm Spanish

27Carnegie Mellon

Christian Monson

s10662

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

• Stop adding suffixes – When the most frequent mutually

replaceable string severly decreases the stem count.

Ø s5501

Ø r s

287autorizaciones

buscabamoscostas

importadorasvallas

28Carnegie Mellon

Christian Monson

s10662

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

• Move on to the next most frequent word-final string

Ø s5501

Ø r s

287

a8981

29Carnegie Mellon

Christian Monson

a8981

s10662

a o2304

a o os

1410

a as o os892

Ø s5501

Ø r s

287

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

30Carnegie Mellon

Christian Monson

n6051

a8981

s10662

Ø n1874

Ø n r

509

Ø do n r354

Ø da das do dos n ndo r ron

118

a o2304

a o os

1410

a as o os892

Ø s5501

Ø r s

287

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

31Carnegie Mellon

Christian Monson

n6051

a8981

s10662

Ø n1874

Ø n r

509

Ø do n r354

Ø da das do dos n ndo r ron

118

a o2304

a o os

1410

a as o os892

Ø s5501

es2751

Ø es874

Ø r s

287

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

32Carnegie Mellon

Christian Monson

an1786

n6051

a8981

s10662

a an1049

a an ar

413

a an ar ó353

a ada adas ado ados an

ar aron ó149

Ø n1874

Ø n r

509

Ø do n r354

Ø da das do dos n ndo r ron

118

a o2304

a o os

1410

a as o os892

Ø s5501

es2751

Ø es874

Ø r s

287

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

33Carnegie Mellon

Christian Monson

...strado

15rado167

an1786

n6051

a8981

s10662

a an1049

a an ar

413

a an ar ó353

a ada adas ado ados an

ar aron ó149

rada radas rado rados

53

rada radorados

67

rada rado89

ra rada radasrado rados ran

rar raron ró23

Ø n1874

Ø n r

509

Ø do n r354

Ø da das do dos n ndo r ron

118

a o2304

a o os

1410

a as o os892

Ø s5501

strada strado12

strada strado stró

9

strada strado strar stró

8

strada stradas strado strar stró

7

es2751

Ø es874

Ø r s

287

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Search for Candidate Paradigms

...

34Carnegie Mellon

Christian Monson

Cluster Candidates per Paradigm

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

35Carnegie Mellon

Christian Monson

Cluster Candidates per Paradigm

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664

451 Covered Types

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

36Carnegie Mellon

Christian Monson

Cluster Candidates per Paradigm

15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci, aplic, apoy, celebr, consider, …

375 Covered Types

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664

451 Covered Types

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

37Carnegie Mellon

Christian Monson

Cluster Candidates per Paradigm

15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci, aplic, apoy, celebr, consider, …

375 Covered Types

15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci, aplic, apoy, celebr, concentr, …

330 Covered Types

15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci, apoy, confirm, consider, declar, …

345 Covered Types

16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664

451 Covered Types

17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.715

532 Covered Types

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

38Carnegie Mellon

Christian Monson

Filter Candidate ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

• 2 types of filtering1. Remove small unclustered

candidate paradigms

2. Remove candidates modeling unlikely morpheme boundaries (Harris, 1955)

39Carnegie Mellon

Christian Monson

Segment Words Using ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

administradas

40Carnegie Mellon

Christian Monson

Segment Words Using ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

administradas

a ada adas ado ados an ar aron ó ...

41Carnegie Mellon

Christian Monson

Segment Words Using ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

administradas

a ada adas ado ados an ar aron ó ...

administrada

42Carnegie Mellon

Christian Monson

Segment Words Using ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

administradas

administr +adas

administrada

a ada adas ado ados an ar aron ó ...

43Carnegie Mellon

Christian Monson

Segment Words Using ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

administradas

administr +adas

a as o os

administrada

44Carnegie Mellon

Christian Monson

Segment Words Using ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

administradas

administr +adas, administrad +as

a as o os

administrada

Old way: Separate alternative analysis

45Carnegie Mellon

Christian Monson

Segment Words Using ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

administradas

administr +adas, administrad +as

a as o os

administrada

administr +ad +as New way: Augment the current segmentation

46Carnegie Mellon

Christian Monson

Segment Words Using ParadigmsParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

administradas

administr +ad +a +s

Ø s

administradaØ

administr +adas, administrad +as, administrada +s

47Carnegie Mellon

Christian Monson

Morpho Challenge 2007ParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

• Peer operated competition – For unsupervised morphology

induction algorithms

• 4 languages– English– German– Finnish– Turkish

48Carnegie Mellon

Christian Monson

ParaMor in Morpho Challenge 2007ParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

• Developed on Spanish – ParaMor’s free parameters were

frozen

49Carnegie Mellon

Christian Monson

2 Methods of EvaluationParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

1. LinguisticSegmentations compared to a morphologically analyzed lexicon

Analysis Answer

administradas administr +ad +a +s administrar +Adj +Fem +Pl

administrada administr +ad +a administrar +Adj +Fem

50Carnegie Mellon

Christian Monson

2 Methods of EvaluationParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

1. LinguisticSegmentations compared to a morphologically analyzed lexicon

Analysis Answer

administradas administr +ad +a +s administrar +Adj +Fem +Pl

administrada administr +ad +a administrar +Adj +Fem

51Carnegie Mellon

Christian Monson

2 Methods of EvaluationParaMor

IdentifySearchClusterFilter

SegmentEvaluationResults

2. Task basedInformation retrieval– Short two-sentence queries– About international news topics – Binary relevance assessments – About 50 queries and 20K

relevance judgements for each language.

52Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

F1

Ber

nhar

d 2

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Mor

fess

or

47.2

53Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

F1

Ber

nhar

d 2

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

47.2

Mor

fess

or

Par

aMor

50.6

54Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

F1

Ber

nhar

d 2

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Ber

nhar

d 2

Mor

fess

or47.2

50.6 50.7

55Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

F1

Ber

nhar

d 2

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

50.7

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

60.8

56Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

F1

Ber

nhar

d 2

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

60.8

56.3

57Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

F1

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Ber

nhar

d 2

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

Ber

nhar

d 2

Ber

nhar

d 2

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

60.8

56.352.9 53.4

58Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

F1

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Ber

nhar

d 2

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

Ber

nhar

d 2

Ber

nhar

d 2

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

60.8

56.352.9

53.4

59Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

F1

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Ber

nhar

d 2

Mor

fess

or

Par

aMor

Par

aMor

& M

orf.

Ber

nhar

d 2

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

Ber

nhar

d 2

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

60.8

56.352.9

53.4

48.2 48.5

60Carnegie Mellon

Christian Monson

20

40

60

English German Finnish Turkish

Linguistic Evaluation

F1

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Ber

nhar

d 2

Mor

fess

or

Par

aMor

Par

aMor

& M

orf.

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

Ber

nhar

d 2

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

Ber

nhar

d 2

Mor

fess

or

Par

aMor

Par

aMor

& M

orfe

ssor

60.8

56.352.9

53.4

48.2 48.5

24.7

52.0

61Carnegie Mellon

Christian Monson

20

35

English German Finnish

IR Evaluation (TF/IDF)

Average PrecisionM

orf.

P &

M

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

McN

amee

Par

.

27.0 – No Morphological Analysis

28.9

26.4

62Carnegie Mellon

Christian Monson

20

35

English German Finnish

IR Evaluation (TF/IDF)

Average PrecisionM

orf.

P &

M

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

McN

amee

Par

aMor 27.0 – No Morphological Analysis

28.9 29.3

63Carnegie Mellon

Christian Monson

20

35

English German Finnish

IR Evaluation (TF/IDF)

Average PrecisionM

orf.

P &

M

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Mor

fess

or

Par

aMor

McN

amee

Par

aMor

Mor

fess

or B

asel

ine

Par

aMor

& M

. 30.7 – No Morphological Analysis28.9 29.3

38.3

32.1

64Carnegie Mellon

Christian Monson

20

35

English German Finnish

IR Evaluation (TF/IDF)

Average PrecisionM

orf.

P &

M

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Mor

fess

or

Par

aMor

McN

amee

Par

aMor

Mor

fess

or B

asel

ine

Par

aMor

& M

. 30.7 – No Morphological Analysis28.9 29.3

38.3 38.2

65Carnegie Mellon

Christian Monson

20

35

English German Finnish

IR Evaluation (TF/IDF)

Average PrecisionM

orf.

P &

M

ParaMorIdentify

SearchClusterFilter

SegmentEvaluationResults

Mor

fess

or

Par

aMor

Mor

fess

or

Par

aMor

McN

amee

Par

aMor

Mor

fess

or B

asel

ine

Par

aMor

& M

orfe

ssor

Mor

fess

or B

asel

ine

Par

aMor

& M

orfe

ssor

32.0 – No Morphological Analysis

28.9 29.3

38.8 38.2

41.2

37.2

66Carnegie Mellon

Christian Monson

ParaMor: State-of-the-Art Unsupervised Morphology Induction System

• Combined system among the best in Morpho Challenge 2007

• Consistent across languages

• Better than no morphology– Task based (IR) measure

67Carnegie Mellon

Christian Monson

Many Future Directions

• Improve Performance– F1 of 50-60% is state-of-the-art!

– Inflection classes– Morphophonology

• Beyond beads-on-a-string

68Carnegie Mellon

Christian Monson

Recommended