Upload
braydon-wolfrey
View
219
Download
1
Embed Size (px)
Citation preview
CGMIL 2008 - Hyderabad - India
An Italian-English dependency parser and its [possible] application
to HindiLeonardo Lesmo
Natural Language Processing Group(Dip. Informatica – Univ. Torino)
(http://www.di.unito.it/gull)
CGMIL 2008 - Hyderabad - India
OUTLINE
The Turin University ParserPerformancesThe Turin University Treebank (TUT)Mapping between TUT and AnnCorraCurrent activities and the future
CGMIL 2008 - Hyderabad - India
Post-processing Segmentation
Analysis of Conjunctions
Chunking
Tagging rules
Lexical access
Verbal Attachment
Verbal subcategories
Verbal frames
THE PARSER
Dictionary
Morphology
POS tagging
Chunking rules
CGMIL 2008 - Hyderabad - India
When the man that you mentioned sent me that beautiful message, I fell in love with him
When [the man] that you mentioned sent me [that beautiful message], I fell [in love] [with him]
chunking
{{When [the man] {that you mentioned} sent me [that beautiful message]}, I fell [in love] [with him] }
segmentation
caseframing
AN EXAMPLE
CGMIL 2008 - Hyderabad - India
beautiful
verb-obj
to fall
I
verb+fin-rmod-time
when
verb-subj
prep-arg
in with
rmod
message
conj-arg
himto send
thatdet+def- arg
adjc+qualif-rmod
prep-arg
love
meverb-indobj
verb-subj
the
man
verb-indcomp*locut
to mention
that
verb-rmod+ relcl
det+def- arg
verb-subjverb-obj
you
THE FINAL RESULT
CGMIL 2008 - Hyderabad - India
1 When (WHEN CONJ SUBORD TIME) [7;VERB+FIN-RMOD-TIME]2 the (THE ART DEF ALLVAL ALLVAL) [7;VERB-SUBJ]3 man (MAN NOUN COMMON M SING) [2;DET+DEF-ARG]4 that (THAT PRON RELAT ALLVAL ALLVAL LSUBJ+OBL) [6;VERB-OBJ]5 you (YOU PRON PERS ALLVAL ALLVAL 2 LSUBJ+LOBJ+LIOBJ+OBL) [6;VERB-SUBJ] 6 mentioned (MENTION VERB MAIN IND PAST ALLVAL ALLVAL) [3;VERB-RMOD-RELCL]7 sent (SEND VERB MAIN IND PAST ALLVAL ALLVAL) [1;CONJ-ARG]8 me (I PRON PERS ALLVAL SING 1 LOBJ+LIOBJ+OBL) [7;VERB-INDCOMPL-THEME]9 that (THAT ADJ DEMONS ALLVAL SING) [7;VERB-OBJ]10 beautiful (BEAUTIFUL ADJ QUALIF ALLVAL ALLVAL) [11;ADJC+QUALIF-RMOD]11 message (MESSAGE NOUN COMMON N SING) [9;DET+DEF-ARG]12 , (#\, PUNCT) [14;SEPARATOR]13 I (I PRON PERS ALLVAL SING 1 LSUBJ) [14;VERB-SUBJ]14 fell (FALL VERB MAIN IND PAST ALLVAL ALLVAL) [0;TOP-VERB]15 in (IN PREP MONO) [14;PREP-RMOD]16 love (LOVE NOUN COMMON N SING) [15;PREP-ARG]17 with (WITH PREP MONO) [14;PREP-RMOD]18 him (HE PRON PERS M SING 3 LOBJ+LIOBJ+OBL) [17;PREP-ARG]19 . (#\. PUNCT) [14;END]
THE ACTUAL FORMAT
CGMIL 2008 - Hyderabad - India
LAS UAS LAS2 Participant
86.94 90.90 91.59 UniTo_Lesmo
77.88 88.43 83.00 UniPi_Attardi
75.12 85.81 82.05 IIIT_Mannem
74.85 85.88 81.59 UniStuttIMS_Schielen
* 85.46 * UPenn_Champollion
47.62 62.11 54.90 UniRoma2_Zanzotto
Results: Evalita 2007
LAS: Labeled Attachment ScoreUAS: Correct Attachment ScoreLAS2: Correct Label Score
CGMIL 2008 - Hyderabad - India
LAS UAS
CoNLL EPT CoNLL EPT
UniPi_Attardi 81.34 77.88 85.54 88.43
IIIT_Mannem 78.67 75.12 82.91 85.81
UniStuttIMS_Schielen 80.46 74.85 84.54 85.88
Comparison with CoNLL
CoNLL: International contest for dependency parsers (multilanguage)
CGMIL 2008 - Hyderabad - India
The Turin University Treebank (TUT)
• Current size:Italian: 2200 sentences
62445 tokens(4635 traces; 6704 punctuation)
English: 150 sentences4250 tokens
(253 traces; 513 punctuation)English not yet online (under test)
CGMIL 2008 - Hyderabad - India
1. ADJ (adjectives)- DEITT (deictic) next- DEMONS (demonstrative) such, this, that- EXCLAM (exclamative)- INDEF (indefinite) numerous, certain, few- INTERR (interrogative) what, which- ORDIN (ordinal) first, twentieth, last- ORDINSUFF (ordinal suffixes) nd, rd, th, st- POSS (possessive) my, your, their- QUALIF (qualificative) nice, big, English
2. ADV (adverbs)- ADFIRM (adfirmative)- ADVERS (adversative) although, though- COMPAR (comparative) less, more- CONCESS (concessive) also- DOUBT (doubt) perhaps
- EXPLIC (explicative) that_is- INTERJ (interjections) at_any_rate- INTERR (interrogative) how, where, when, why- LIMIT (limit) just, only- LOC (locative) there, within, below, here - MANNER (manner) aloud, alright, well- NEG (negation) not- QUANT (quantification) little, rather, too- REASON (motivation) in_fact- STRENG (strengthening) even, moreover- SUPERL (superlative) most- TIME (time) sometime, afterward, already
Parts of Speech(and
“subtypes”)
CGMIL 2008 - Hyderabad - India
3. ART (articles)- DEF (definite) the- INDEF (indefinite) a, another, - GENITIVE (genitive): 's
4. CONJ (conjunctions)- COORD (coordinative) and, but, or, neither,
nor- SUBORD (subordinative) since, that, to, unless- COMPAR (comparative) than
5. DATE (dates) 08/06/20086. INTERJ (interjections) alas7. MARKER (markers)8. NOUN (nouns)
- COMMON house, boy, chair - PROPER Mary, Italia, Italy, England
9. NUM (numbers) zero, twenty, 127, 3.14
10. PHRAS (phrasals) yes, no11. PREDET (predeterminers) all,
both12. PREP (prepositions)
- MONO of, to, from, in- POLI during, above, under, in front of
Parts of Speech(and
“subtypes”)2
CGMIL 2008 - Hyderabad - India
13. PRON (pronouns)DEMONS (demonstrative) this, that, EXCLAM (exclamative) whatINDEF (indefinite) everything, nobody, somethingINTERR (interrogative) what, whoLOC (locative) I: ne, ci, viORDIN (ordinals) first, second, fiftiethPERS (personal) I, you, we, herPOSS (possessive) mine, yoursREFL-IMPERS (reflexive-impersonal) ci, vi, si, seRELAT (relative) that, who, which, where
14. PUNCT (punctuation)15. SPECIAL (special)16. VERB (verbs)
MAIN (all standard verbs) go, eat, give, be (in “to be intelligent”)AUX (auxiliaries) be (in “to be kissed”)MOD (modals) must, can, will
Parts of Speech(and
“subtypes”)3
CGMIL 2008 - Hyderabad - India
The labelling schemeTop Dependent
Function
Arg Modifier
Nofunction
adjc-arg
advb-arg
conj-argnoun-arg
verb-arg
verb-subjverb-obj
verb-indobjverb-indcompl
verb-predcompl
Apposition Rmod
CGMIL 2008 - Hyderabad - India
Nofunction
Aux
Contin
Coordinator
Emptycompl
InterjectionSeparatorVisitor
Verb-expletive
Aux+passiveAux+tense
Aux+progressive
Contin+denomContin+locut
Contin+prep Coordantec
Coord
Coord2nd
The NOFUNCTION labels
CGMIL 2008 - Hyderabad - India
Some examples
Aux+progressive: I am looking for …
Aux+tense: … the debate has – to quite some extent - suffered from …
Aux+passive: … whose historical experience is not marked by …
Auxiliaries
Contin+locut: … convinced of the feasibility … in order to reinforce …
Continuations
Contin+prep: … grown out of the millenniums …
Contin+denom: Samuel Alexander asserted …
CGMIL 2008 - Hyderabad - India
The question of what we might consider to be an adequate …
Visitors (and traces)
the
of
prep-rmodquestion
prep-argwhat
det+def- arg
verb-obj
trace
verb-subj
trace to
verb-predcompl+obj
prep-arg
be
trace
verb-subj
an
verb-obj
considerwe
verb-rmod+relcl
verb-subj
trace
visitor
mightverb+modal-indcompl
CGMIL 2008 - Hyderabad - India
Coordination
base: … is tautologous and without ontologic commitment …
coord+base coord2nd+base
compar: … were more like mythical heroes than like the omnipotent God …
coord+compar coord2nd+comparcoordantec+compar
correlat: … neither John nor his friends …
coord2nd+correlatcoord+correlatcoordantec+correlat
… and “word” traces
compar: … Samuel asserted that mentality emerged … and then tasserted tSamuel that …
coord+base coord2nd+base
CGMIL 2008 - Hyderabad - India
The AnnCorra scheme
• It is chunk-based (some elementary subtrees are left unanalysed)
• It involves 28 relations (arc labels) and 25 different POS (tabel below)
• There are some non-dependency labels (as for coordination (ccof)
• Some POS are merged (e.g. Demonstratives include both Adj and Pron)
CGMIL 2008 - Hyderabad - India
AnnCorra TUT
Common Noun NN NOUN (common)
Proper Noun NNP NOUN (proper)
Location, Time NST ADV (time), ADV (loc)
Pronoun PRP PRON except the ones in Demonstrative and Question
Adjective JJ ADJ except the ones in Demonstrative and Question
Adverb RB ADV (with some exceptions)
Demonstrative DEM PRON (demons), ADJ (demons)
Question Words WQ ADJ (interr), ADV (interr), PRON RELAT????, PRON (interr)
Main verb VM VERB (main)
Verb Aux VAUX VERB (aux), VERB (mod)
Post position PSP PREP
Particles RP None
Conjuncts CC CONJ
Quantifiers QF DET, PREDET
Cardinal numb QC NUM
Ordinal numb QO ADJ (ordin), PRON (ordin)
Classifier CL None
Intensifier INTF ADV (quant)
Interjection INJ INTERJ
Negation NEG ADV (neg)
Quotative UT None
Sym SYM SPECIAL or PUNCT
Compounds *C None
Reduplicative RDP None
Echo ECH None
Mapping category
labels
CGMIL 2008 - Hyderabad - India
k1 (karta): the primary (or “most independent”) participant in the action (similar to agent) VERB-SUBJ
k2 (karma): this is the secondary participant (often, the patient). VERB-OBJ k3 (karana): the instrument. VERB-INDCOMPL-MEANSMANNER k4 (sampradana): recipient or the beneficiary of an action VERB-INDOBJ k5 (apadana): the stationary element in a separation ???? k7 (adhikarana): the locus (spatial or temporal or abstract) of karta or karma. It is
tagged as k7p, k7t or k7 depending on the type of location. VERB-INDCOMPL-LOC
The argument (karaka) labels
Mapping arc labels
CGMIL 2008 - Hyderabad - India
must
read
verb+modal-indcompl
I
verb-subj
verb-subj
the
verb-obj
book
tdet+def-arg
(must read)
I
k1
(the book)
k2
Mapping the structure
Chunk-based structure of AnnCorra
CGMIL 2008 - Hyderabad - India
Current activities and the future
A word about semantics: DTS
theoremsstudents
verb-subj
heard
threetwo
verb-obj
difficult
det+quantif-arg det+indef-arg
adjc+qualif-rmod
quant(x): quant(y):
x
student'1
y
theorem' difficult'11
restr(x): restr(y):
difficult'
1study'
yx
student' theorem'
2
111
CGMIL 2008 - Hyderabad - India
difficult'
1
study'
yx
student' theorem'
2
11 1
CTX
Disambiguation: Semdep arcs
1
study'
yx
student' theorem'
2
11 1
CTX
difficult'
2x [ student’(x) 3y [theorem’(y) study’(x,y) ]] 3y [ theorem’(y) 2x [student’(y) study’(x,y) ]]
Any more reading? 1
study'
yx
student' theorem'
2
11 1
CTX
difficult'
??? Branching Quantification (Independent Set)
CGMIL 2008 - Hyderabad - India
Current activities and the future
Practical semantic interpretation based on ontological knowledge for DB access
Extension of the treebank with semantic annotation (in cooperation with Johan Bos)
Development of a graphical interface with a online server (Java implementation and socket-based connection with a Lisp server)
Automatic analysis of legal texts for extracting information about trule amendments (date, modified text, new text)
CGMIL 2008 - Hyderabad - India
The future (last but not least)
Morphological analysis of Hindi (mid-way)
Development and testing of a Hindi parser and of mapping rules from Hindi to English and viceversa
In cooperation with IIIT Hyderabad
CGMIL 2008 - Hyderabad - India
HEAD= wiw2w1 wi+2wi-1 wi+1wn
?? ? ???
….. …..
Function:
Structure:(head-category head-subcategory (dependent-position (dependent-category (dependent-constraints))) ARC-LABEL)
More on Parsing 1
CGMIL 2008 - Hyderabad - India
Examples:
(ART DEF (before (PREDET (agree))) PDETMOD)
i (cat=ART, subcat=DEF gender=m, number=pl)
tutti (cat=PREDET, gender=m, number=pl)
PDETMOD
theall
(NOUN COMMON (chunk-follows (ADJ (agree) (subcat qualif))) ADJCMOD-QUALIF)
bello (cat=ADJ, subcat=QUALIF, gender=m, number=sing)
giardino (cat=NOUN, gender=m, number=sing)
ADJCMOD-QUALIF
nicegarden
molto (cat=ADV)
very
More on Parsing 2
CGMIL 2008 - Hyderabad - India
verbs
nosubj-verbs
subj-verbs
obj-verbs
basic-transempty-modal
modal
ssubj-inf-verbs
trans
indobj-verbs
trans-indobj
subcategorization classes
bisognare
camminare
dovere
dictionary
potere
need
walk
must
can
Verb subcategorization classes:
More on Parsing 3
CGMIL 2008 - Hyderabad - India
Transformations:
basic class (e.g. trans) transformed classes (e.g. trans, trans+passivization,trans+infinitivization,trans+prodrop,trans+passivization+infinitivization,….. )
Example transformation:(infinitivization replacing (subj-verbs) (is-inf-form tr-verb v-casefr) (cancel-case s-subj))
More on Parsing 4