Upload
joella-bennett
View
217
Download
0
Embed Size (px)
Citation preview
Carnegie Mellon
Christian Monson
ParaMorFinding Paradigms
Across Morphology
Christian Monson
2Carnegie Mellon
Christian Monson
Turkish Morphology – Beads on a String
take passive negativepresent
progressive2nd person singular
You are not being taken
3Carnegie Mellon
Christian Monson
götür ül m sunüyor
take passive negativepresent
progressive
You are not being taken
2nd person singular
Turkish Morphology – Beads on a String
4Carnegie Mellon
Christian Monson
Applications of Computational Morphology
• Machine Translation– Turkish-English (Oflazer, 2007)
– Czech-English (Goldwater and McClosky, 2005)
• Speech Recognition– Finnish (Creutz, 2006)
• Information Retrieval
5Carnegie Mellon
Christian Monson
Challenges of Computational Morphology
• Time Consuming for a New Language– Kemal Oflazer estimates
• 3-4 months to build basic Turkish analyzer• Plus lexicon development and maintenance
• Expertise Needed– Greenlandic
• Official language of Greenland• Agglutinative Inuit language• 50,000 speakers• Per Langaard
6Carnegie Mellon
Christian Monson
The SolutionRaw Text
Unsupervised Morphology
Induction
7Carnegie Mellon
Christian Monson
ParaMor – Paradigm MorphologyParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults• ParaMor
– Unsupervised morphology induction system
• Paradigm– The natural structure of morphology
8Carnegie Mellon
Christian Monson
Paradigms – The Structure of Morphology
ül m sunüyor
take passive negativepresent
progressive2nd person singular
Stem Voice PolarityTense &
MoodPerson & Number
götür
9Carnegie Mellon
Christian Monson
Paradigms – The Structure of Morphology
ül m umüyor
Stem Voice PolarityTense &
MoodPerson & Number
take passive negativepresent
progressive 1st person singular
umgötür
10Carnegie Mellon
Christian Monson
Paradigms – The Structure of Morphology
ül m umüyor
Stem Voice PolarityTense &
MoodPerson & Number
take passive negativepresent
progressive3rd person singular
umØ
götür
11Carnegie Mellon
Christian Monson
Paradigms – The Structure of Morphology
ül m umüyor
Stem Voice PolarityTense &
MoodPerson & Number
take passive negativepresent
progressive
1st person plural
umØuz
götür
12Carnegie Mellon
Christian Monson
Paradigms – The Structure of Morphology
ül m umüyor
Stem Voice PolarityTense &
MoodPerson & Number
take passive negativepresent
progressive
umØuz
götür
13Carnegie Mellon
Christian Monson
Paradigms – The Structure of Morphology
ül m umüyor
Stem Voice PolarityTense &
MoodPerson & Number
take passive negativefuture
umØuz
yecekgötür
14Carnegie Mellon
Christian Monson
Paradigms – The Structure of Morphology
ül m umüyor
Stem Voice PolarityTense &
MoodPerson & Number
take passive negative
umØuz
yecekgötür
15Carnegie Mellon
Christian Monson
Paradigms – The Structure of Morphology
ül m umüyor
Stem Voice PolarityTense &
MoodPerson & Number
umØuz
yecek
16Carnegie Mellon
Christian Monson
Paradigms – The Structure of Morphology
ül m umüyorumØuz
yecek
Paradigms
17Carnegie Mellon
Christian Monson
Paradigms – The Structure of Morphology
ül m umüyorumØuz
yecek
Paradigms
• Paradigm– Set of mutually replaceable strings
18Carnegie Mellon
Christian Monson
Paradigms – The Structure of Morphology
ül m umüyorumØuz
yecek
Paradigm
• Paradigm– Set of mutually replaceable strings
19Carnegie Mellon
Christian Monson
The ParaMor AlgorithmParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
• Identify suffix paradigms in 3 steps
20Carnegie Mellon
Christian Monson
The ParaMor AlgorithmParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
• Identify suffix paradigms in 3 steps1.Search for candidate paradigms
21Carnegie Mellon
Christian Monson
The ParaMor AlgorithmParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
• Identify suffix paradigms in 3 steps1.Search for candidate paradigms
2.Cluster candidates modeling the same paradigm
22Carnegie Mellon
Christian Monson
The ParaMor Algorithm
• Identify suffix paradigms in 3 steps1.Search for candidate paradigms
2.Cluster candidates modeling the same paradigm
3.Filter
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
23Carnegie Mellon
Christian Monson
The ParaMor Algorithm
• Identify suffix paradigms in 3 steps1.Search for candidate paradigms
2.Cluster candidates modeling the same paradigm
3.Filter
• Segment words – Using the discovered paradigms
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
24Carnegie Mellon
Christian Monson
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
• All character boundaries are candidate morpheme boundaries
25Carnegie Mellon
Christian Monson
s10662
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
autorizacionesbuscabamos
costasimportadoras
vallas…
• Begin search with the most frequent word-final string
Spanish
26Carnegie Mellon
Christian Monson
s10662
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
autorizacionesbuscabamos
costasimportadoras
vallas…
Ø s5501
• Identify the most frequent mutually replaceable string– Stems that occur with one
suffix in a paradigm will likely occur with other suffixes in that paradigm Spanish
27Carnegie Mellon
Christian Monson
s10662
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
• Stop adding suffixes – When the most frequent mutually
replaceable string severly decreases the stem count.
Ø s5501
Ø r s
287autorizaciones
buscabamoscostas
importadorasvallas
…
28Carnegie Mellon
Christian Monson
s10662
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
• Move on to the next most frequent word-final string
Ø s5501
Ø r s
287
a8981
29Carnegie Mellon
Christian Monson
a8981
s10662
a o2304
a o os
1410
a as o os892
Ø s5501
Ø r s
287
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
30Carnegie Mellon
Christian Monson
n6051
a8981
s10662
Ø n1874
Ø n r
509
Ø do n r354
Ø da das do dos n ndo r ron
118
a o2304
a o os
1410
a as o os892
Ø s5501
Ø r s
287
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
31Carnegie Mellon
Christian Monson
n6051
a8981
s10662
Ø n1874
Ø n r
509
Ø do n r354
Ø da das do dos n ndo r ron
118
a o2304
a o os
1410
a as o os892
Ø s5501
es2751
Ø es874
Ø r s
287
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
32Carnegie Mellon
Christian Monson
an1786
n6051
a8981
s10662
a an1049
a an ar
413
a an ar ó353
a ada adas ado ados an
ar aron ó149
Ø n1874
Ø n r
509
Ø do n r354
Ø da das do dos n ndo r ron
118
a o2304
a o os
1410
a as o os892
Ø s5501
es2751
Ø es874
Ø r s
287
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
33Carnegie Mellon
Christian Monson
...strado
15rado167
an1786
n6051
a8981
s10662
a an1049
a an ar
413
a an ar ó353
a ada adas ado ados an
ar aron ó149
rada radas rado rados
53
rada radorados
67
rada rado89
ra rada radasrado rados ran
rar raron ró23
Ø n1874
Ø n r
509
Ø do n r354
Ø da das do dos n ndo r ron
118
a o2304
a o os
1410
a as o os892
Ø s5501
strada strado12
strada strado stró
9
strada strado strar stró
8
strada stradas strado strar stró
7
es2751
Ø es874
Ø r s
287
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Search for Candidate Paradigms
...
34Carnegie Mellon
Christian Monson
Cluster Candidates per Paradigm
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci, aplic, apoy, celebr, concentr, …
330 Covered Types
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci, apoy, confirm, consider, declar, …
345 Covered Types
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
35Carnegie Mellon
Christian Monson
Cluster Candidates per Paradigm
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci, aplic, apoy, celebr, concentr, …
330 Covered Types
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci, apoy, confirm, consider, declar, …
345 Covered Types
16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664
451 Covered Types
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
36Carnegie Mellon
Christian Monson
Cluster Candidates per Paradigm
15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci, aplic, apoy, celebr, consider, …
375 Covered Types
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci, aplic, apoy, celebr, concentr, …
330 Covered Types
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci, apoy, confirm, consider, declar, …
345 Covered Types
16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664
451 Covered Types
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
37Carnegie Mellon
Christian Monson
Cluster Candidates per Paradigm
15: a aba aban ada adas ado ados an ando ar aron arse ará arán ó25 Stems: anunci, aplic, apoy, celebr, consider, …
375 Covered Types
15: a aba ada adas ado ados an ando ar aron arse ará arán aría ó22 Stems: anunci, aplic, apoy, celebr, concentr, …
330 Covered Types
15: a aba ada adas ado ados an ando ar ara aron arse ará arán ó23 Stems: anunci, apoy, confirm, consider, declar, …
345 Covered Types
16: a aba ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.664
451 Covered Types
17: a aba aban ada adas ado ados an ando ar ara aron arse ará arán aría óCosine Similarity: 0.715
532 Covered Types
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
38Carnegie Mellon
Christian Monson
Filter Candidate ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
• 2 types of filtering1. Remove small unclustered
candidate paradigms
2. Remove candidates modeling unlikely morpheme boundaries (Harris, 1955)
39Carnegie Mellon
Christian Monson
Segment Words Using ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
administradas
40Carnegie Mellon
Christian Monson
Segment Words Using ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
administradas
a ada adas ado ados an ar aron ó ...
41Carnegie Mellon
Christian Monson
Segment Words Using ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
administradas
a ada adas ado ados an ar aron ó ...
administrada
42Carnegie Mellon
Christian Monson
Segment Words Using ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
administradas
administr +adas
administrada
a ada adas ado ados an ar aron ó ...
43Carnegie Mellon
Christian Monson
Segment Words Using ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
administradas
administr +adas
a as o os
administrada
44Carnegie Mellon
Christian Monson
Segment Words Using ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
administradas
administr +adas, administrad +as
a as o os
administrada
Old way: Separate alternative analysis
45Carnegie Mellon
Christian Monson
Segment Words Using ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
administradas
administr +adas, administrad +as
a as o os
administrada
administr +ad +as New way: Augment the current segmentation
46Carnegie Mellon
Christian Monson
Segment Words Using ParadigmsParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
administradas
administr +ad +a +s
Ø s
administradaØ
administr +adas, administrad +as, administrada +s
47Carnegie Mellon
Christian Monson
Morpho Challenge 2007ParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
• Peer operated competition – For unsupervised morphology
induction algorithms
• 4 languages– English– German– Finnish– Turkish
48Carnegie Mellon
Christian Monson
ParaMor in Morpho Challenge 2007ParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
• Developed on Spanish – ParaMor’s free parameters were
frozen
49Carnegie Mellon
Christian Monson
2 Methods of EvaluationParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
1. LinguisticSegmentations compared to a morphologically analyzed lexicon
Analysis Answer
administradas administr +ad +a +s administrar +Adj +Fem +Pl
administrada administr +ad +a administrar +Adj +Fem
50Carnegie Mellon
Christian Monson
2 Methods of EvaluationParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
1. LinguisticSegmentations compared to a morphologically analyzed lexicon
Analysis Answer
administradas administr +ad +a +s administrar +Adj +Fem +Pl
administrada administr +ad +a administrar +Adj +Fem
51Carnegie Mellon
Christian Monson
2 Methods of EvaluationParaMor
IdentifySearchClusterFilter
SegmentEvaluationResults
2. Task basedInformation retrieval– Short two-sentence queries– About international news topics – Binary relevance assessments – About 50 queries and 20K
relevance judgements for each language.
52Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
F1
Ber
nhar
d 2
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Mor
fess
or
47.2
53Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
F1
Ber
nhar
d 2
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
47.2
Mor
fess
or
Par
aMor
50.6
54Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
F1
Ber
nhar
d 2
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Ber
nhar
d 2
Mor
fess
or47.2
50.6 50.7
55Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
F1
Ber
nhar
d 2
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
50.7
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
60.8
56Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
F1
Ber
nhar
d 2
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
60.8
56.3
57Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
F1
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Ber
nhar
d 2
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
Ber
nhar
d 2
Ber
nhar
d 2
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
60.8
56.352.9 53.4
58Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
F1
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Ber
nhar
d 2
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
Ber
nhar
d 2
Ber
nhar
d 2
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
60.8
56.352.9
53.4
59Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
F1
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Ber
nhar
d 2
Mor
fess
or
Par
aMor
Par
aMor
& M
orf.
Ber
nhar
d 2
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
Ber
nhar
d 2
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
60.8
56.352.9
53.4
48.2 48.5
60Carnegie Mellon
Christian Monson
20
40
60
English German Finnish Turkish
Linguistic Evaluation
F1
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Ber
nhar
d 2
Mor
fess
or
Par
aMor
Par
aMor
& M
orf.
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
Ber
nhar
d 2
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
Ber
nhar
d 2
Mor
fess
or
Par
aMor
Par
aMor
& M
orfe
ssor
60.8
56.352.9
53.4
48.2 48.5
24.7
52.0
61Carnegie Mellon
Christian Monson
20
35
English German Finnish
IR Evaluation (TF/IDF)
Average PrecisionM
orf.
P &
M
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
McN
amee
Par
.
27.0 – No Morphological Analysis
28.9
26.4
62Carnegie Mellon
Christian Monson
20
35
English German Finnish
IR Evaluation (TF/IDF)
Average PrecisionM
orf.
P &
M
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
McN
amee
Par
aMor 27.0 – No Morphological Analysis
28.9 29.3
63Carnegie Mellon
Christian Monson
20
35
English German Finnish
IR Evaluation (TF/IDF)
Average PrecisionM
orf.
P &
M
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Mor
fess
or
Par
aMor
McN
amee
Par
aMor
Mor
fess
or B
asel
ine
Par
aMor
& M
. 30.7 – No Morphological Analysis28.9 29.3
38.3
32.1
64Carnegie Mellon
Christian Monson
20
35
English German Finnish
IR Evaluation (TF/IDF)
Average PrecisionM
orf.
P &
M
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Mor
fess
or
Par
aMor
McN
amee
Par
aMor
Mor
fess
or B
asel
ine
Par
aMor
& M
. 30.7 – No Morphological Analysis28.9 29.3
38.3 38.2
65Carnegie Mellon
Christian Monson
20
35
English German Finnish
IR Evaluation (TF/IDF)
Average PrecisionM
orf.
P &
M
ParaMorIdentify
SearchClusterFilter
SegmentEvaluationResults
Mor
fess
or
Par
aMor
Mor
fess
or
Par
aMor
McN
amee
Par
aMor
Mor
fess
or B
asel
ine
Par
aMor
& M
orfe
ssor
Mor
fess
or B
asel
ine
Par
aMor
& M
orfe
ssor
32.0 – No Morphological Analysis
28.9 29.3
38.8 38.2
41.2
37.2
66Carnegie Mellon
Christian Monson
ParaMor: State-of-the-Art Unsupervised Morphology Induction System
• Combined system among the best in Morpho Challenge 2007
• Consistent across languages
• Better than no morphology– Task based (IR) measure
67Carnegie Mellon
Christian Monson
Many Future Directions
• Improve Performance– F1 of 50-60% is state-of-the-art!
– Inflection classes– Morphophonology
• Beyond beads-on-a-string
68Carnegie Mellon
Christian Monson