Prague Dependency Treebank(s)
Workshop at LSA2011, Part I
Jan Hajič, Zdeňka Urešová
Institute of Formal and Applied Linguistics
School of Computer Science
Faculty of Mathematics and Physics
Charles University, Prague
Czech Republic
July 30, 2011 LSA2011 Prague Dependency Treebanks I 2
Part I - Text to Syntax
The Prague Dependency Treebank projects The theory behind it The Corpora Morphology
Dictionaries, tools (incl. POS tagger) Dependency surface syntax
Czech, Arabic, English Parallel annotated corpus
July 30, 2011 LSA2011 Prague Dependency Treebanks I 3
The Prague Dependency Treebank(s)
The idea Apply the “old” Prague theory to real-word texts Provide enough data for ML experiments
?“Old” Prague theory Prague structuralism (1930s) Stratificational approach Centered on “deep syntax”
Separated from “surface form” Dependency based (how else )
July 30, 2011 LSA2011 Prague Dependency Treebanks I 4
PDT:The Methodology
Manual annotation is PRIMARY Some help from existing tools possible
“No information loss, no redundancy” Much formalization, but… … original form always retrievable
Dictionaries In theory: “secondary”, side effect of annotation (generalization) In reality: help consistency Links: data → dictionary(-ies)
Result: used for Machine Learning Ergonomy of annotation
Graphical (“linguistic”) presentation & editing
July 30, 2011 LSA2011 Prague Dependency Treebanks I 5
The Prague Dependency Treebank Project: Czech Treebank
1995 (Dublin) 1996-2006-... 1998 PDT v. 0.5 released (JHU workshop)
400k words manually annotated, unchecked 2001 PDT 1.0 released (LDC):
1.3MW annotated, morphology & surface syntax 2006 PDT 2.0 release
0.8MW annotated (50k sentences) + PDT 1.0 corrected the “tectogrammatical layer”
underlying (deep) syntax
July 30, 2011 LSA2011 Prague Dependency Treebanks I 6
Related Projects (Treebanks)
Prague Czech-English Dependency Treebank WSJ portion of PTB, translated to Czech (1.2 mil. words)
Annotated for basice set of attributes Penn Treebank / WSJ
Pre-converted, manually annotated for basic sets of attributes Named entity annotation, co-reference (BBN) merged in Detailed breakdown of NP
Prague Arabic Dependency Treebank apply same representation to annotation of Arabic surface syntax so far
Both published (in first version) 2004 (by LDC) PCEDT/PEDT version 2.0 being prepared (2011?) Preliminary version available for browsing – see the workshop web
July 30, 2011 LSA2011 Prague Dependency Treebanks I 7
PDT (Czech) Data
4 sources: Lidové noviny (daily newspaper, incl. extra sections) DNES (Mladá fronta Dnes) (daily newspaper) Vesmír (popular science magazine, monthly) Českomoravský Profit (economical journal, weekly)
Full articles selected article ~ DOCUMENT (basic corpus unit)
Time period: 1990-1995 1.8 million tokens (~110,000 sentences)
July 30, 2011 LSA2011 Prague Dependency Treebanks I 8
PDT Annotation Layers L0 (w) Words (tokens)
automatic segmentation and markup only L1 (m) Morphology
Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)
Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)
Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
PD
T 1
.0
(200
1)P
DT
2.0
(2
006)
July 30, 2011 LSA2011 Prague Dependency Treebanks I 9
PDT Annotation Layers L0 (w) Words (tokens)
automatic segmentation and markup only L1 (m) Morphology
Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)
Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)
Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
July 30, 2011 LSA2011 Prague Dependency Treebanks I 10
Morphological Attributes
Tag: 13 categoriesExample: AAFP3----3N----Adjective no poss. Gender negatedRegular no poss. Number no voiceFeminine no person reserve1Plural no tense reserve2Dative superlative base
var.Lemma: POS-unique identifier
Books/verb -> book-1, went -> go, to/prep. -> to-1
Ex.: nejnezajímavějším“(to) the most uninteresting”
July 30, 2011 LSA2011 Prague Dependency Treebanks I 11
Morphological Tagset 13 categories, 4452 plausible tags (combinations):
Category # of values Example(s)POS 10 N (noun), Z (punctuation)SUBPOS 75 P (personal pron.), U (possessive adj.)GENDER 8 I (masc. inanimate), X (any), - (N.A)NUMBER 4 P (plural), D (dual)CASE 9 1 (nominative), 6 (locative)POSSGENDER 4 M (masc. animate), F (feminine)POSSNUMBER 3 S (singular), P (plural)PERSON 5 1 (first), ...TENSE 4 P (present), M (past)GRADE 5 3 (superlative)NEGATION 3 A (affirmative), N (negative)VOICE 3 A (active), P (passive)VAR 11 1 (1st variant), 6 (colloq. style), 8 (abbrev.)
July 30, 2011 LSA2011 Prague Dependency Treebanks I 12
Morphological Analysis Formally: MA: A+ → Pow(L x T)
MA(f) = { [ l,t ] }; f A+ (the token), l L (lemma), t T (tag)
tokens taken in isolation no attempt to solve e.g. auxiliaries vs. full verbs Ex.: MA(“má“) = { [mít,VB-S---3P-AA---], lit. “to have”
lit. “has”,”my” [můj,PSFS1-S1------1], lit. “my” [můj,PSFS5-S1------1], [můj,PSNP1-S1------1], [můj,PSNP4-S1------1], [můj,PSNP5-S1------1] }
July 30, 2011 LSA2011 Prague Dependency Treebanks I 13
Morphological Disambiguation
Full morphological disambiguation more complex than (e.g. English) POS tagging
Several full morphological taggers: (Pure) HMM Feature-based (MaxEnt, NB)
used in the PDT distribution Voted Perceptron, (M. Collins, EMNLP’02)
All: ~ 94-96% accuracy (perceptron is best) rule & statistic combination: tiny improvement
(Hajič et al., ACL 2001, Spoustova et al., 2007: > 96%)
July 30, 2011 LSA2011 Prague Dependency Treebanks I 14
The Segmentation Problem:Arabic
Tokenization / segmentation not always trivial Arabic, German, Chinese, Japanese
July 30, 2011 LSA2011 Prague Dependency Treebanks I 15
The Segmentation Problem:Solution for Arabic
Find max. no. of segments, concatenate up to max. 4 (x10) for Arabic
F---------VIIA-3MS--S----3MP4----------- sa-+yu-hbir-u+-hum+0
F---------VIIA-3MS--S----3MP4-----------
P---------SD----MS----------------------
P---------------------------------------
N-------2R------------------------------
N-------2D------------------------------
A-----FS2D------------------------------
Resulting annotation:
July 30, 2011 LSA2011 Prague Dependency Treebanks I 16
Arabic Tagging Results
Maximum entropy, features ~ categories Experiments on Penn Arabic Treebanks
POS: From 95.25-97.37% Full morph. 88.17-89.31% Segmentation: 98.60-99.37%
Prague Arabic Dependency Treebank POS: 96.02% Full morph.: 89.24% Segmentation: 99.25%
July 30, 2011 LSA2011 Prague Dependency Treebanks I 17
PDT Annotation Layers L0 (w) Words (tokens)
automatic segmentation and markup only L1 (m) Morphology
Tag (full morphology, 13 categories), lemma L2 (a) Analytical layer (surface syntax)
Dependency, analytical dependency function L3 (t) Tectogrammatical layer (“deep” syntax)
Dependency, functor (detailed), grammatemes, ellipsis solution, coreference, topic/focus (deep word order), valency lexicon
July 30, 2011 LSA2011 Prague Dependency Treebanks I 18
Layer 2 (a-layer): Analytical Syntax
Dependency + Analytical Function
dependent
governor
The influence of the Mexicancrisis on Central and EasternEurope has apparently been underestimated.
July 30, 2011 LSA2011 Prague Dependency Treebanks I 19
Analytical Syntax: Functions
Main (for [main] semantic lexemes):Pred, Sb, Obj, Adv, Atr, Atv(V), AuxV, Pnom“Double” dependency: AtrAdv, AtrObj, AtrAtr
Special (function words, punctuation,...):Reflefives, particles: AuxT, AuxR, AuxO, AuxZ, AuxYPrepositions/Conjunctions: AuxP, AuxCPunctuation, Graphics: AuxX, AuxS, AuxG, AuxK
StructuralElipsis: ExD, Coordination etc.: Coord, Apos
July 30, 2011 LSA2011 Prague Dependency Treebanks I 20
Surface Syntax Example
Complete sentence: Sb, Pred, Obj The-baker bakes rolls. Pekař peče housky.
July 30, 2011 LSA2011 Prague Dependency Treebanks I 21
Surface Syntax Example
Incomplete phrases Peter works well , but Paul badly Petr pracuje dobře, ale Pavel špatně
July 30, 2011 LSA2011 Prague Dependency Treebanks I 22
Surface Syntax Example
Variants (equal meaning) (he) bought shoes for boy koupil boty pro kluka
July 30, 2011 LSA2011 Prague Dependency Treebanks I 23
PDT-styleArabic Surface Syntax
Only several differences (Sometimes) Separate nodes for individual
segments (cf. tagging/segmentation) Copula treatment (Czech: rare treated as
ellispsis; Arabic: systematic solution needed): Pred (Added) analytic functions:
AuxM (did-not)
Ante (what)
Work by Faculty of Arts, Charles University Arabic language students
July 30, 2011 LSA2011 Prague Dependency Treebanks I 24
Arabic Surface SyntaxExample
In the section on literature, the magazine presented the issue of the Arabic language and the dangers that threaten it.
July 30, 2011 LSA2011 Prague Dependency Treebanks I 25
English Analytic Layer
By conversion from PTB Extended analytic functions
Head rules Jason Eisner’s, added more for full conversion
Coordination, traces, etc.
Coordination handling Same as in Czech/Arabic PDT
July 30, 2011 LSA2011 Prague Dependency Treebanks I 26
Penn Treebank
University of Pennsylvania, 1993 Linguistic Data Consortium
Wall Street Journal texts, ca. 50,000 sentences 1989-1991 Financial (most), news, arts, sports 2499 (2312) documents in 25 sections
Annotation POS (Part-of-speech tags) Syntactic “bracketing” + bracket (syntactic) labels (Syntactic) Function tags, traces, co-indexing + Propbanking
July 30, 2011 LSA2011 Prague Dependency Treebanks I 27
Penn Treebank Example
( (S (NP-SBJ (NP (NNP Pierre) (NNP Vinken) ) (, ,) (ADJP (NP (CD 61) (NNS years) ) (JJ old) ) (, ,) ) (VP (MD will) (VP (VB join) (NP (DT the) (NN board) ) (PP-CLR (IN as) (NP (DT a) (JJ nonexecutive) (NN director) )) (NP-TMP (NNP Nov.) (CD 29) ))) (. .) ))
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
POS tag (NNS)(noun, plural)
Phrase label (NP)
Noun Phrase
“Preterminal”
July 30, 2011 LSA2011 Prague Dependency Treebanks I 28
Penn Treebank Example:Sentence Tree
Phrase-based tree representation:
July 30, 2011 LSA2011 Prague Dependency Treebanks I 29
Parallel Czech-English Annotation
English text -> Czech text (human translation) Czech side (goal): all layers manual annotation English side (goal):
Morphology and surface syntax: technical conversion Penn Treebank style -> PDT Analytic layer
Tectogrammatical annotation: manual annotation (Slightly) different rules needed for English
Alignment Natural, sentence level only (now)
July 30, 2011 LSA2011 Prague Dependency Treebanks I 30
Human Translation ofWSJ Texts
Hired translators / FCE level Specific rules for translation
Sentence per sentence only …to get simple 1:1 alignment
Fluent Czech at the target side If a choice, prefer “literal” translation
The numbers: English tokens: 1,173,766 Translated to Czech:
Revised/PCEDT 1.0: 487,929 Now finished (all 2312 documents)
July 30, 2011 LSA2011 Prague Dependency Treebanks I 31
English Annotation POS and Syntax
Automatic conversion from Penn Treebank PDT morphological layer
From POS tags PDT analytic layer
From: Penn Treebank Syntactic Structure Non-terminal labels Function tags (non-terminal “suffixes”)
2-step process Head determination rules Conversion to dependency + analytic function
July 30, 2011 LSA2011 Prague Dependency Treebanks I 32
Head Determination Rules
Exhaustive set of rules By J. Eisner + M. Cmejrek/J. Curin 4000 rules (non-terminal based)
Ex.: (S (NP-SBJ VP .)) → VP Additional rules
Coordination, Apposition Punctuation (end-of-sentence, internal)
Original idea (possibility of conversion) J. Robinson (1960s)
July 30, 2011 LSA2011 Prague Dependency Treebanks I 33
Example: Head Determination Rules (J.E.)
(board)
(board)(the)
(join)
(will) (join)
(join)
(join)
(NP (DT NN)) → NN
(VP (VB NP)) → VB
(VP (MD VP)) → VP
(S (… VP …)) → VP
Rules:
July 30, 2011 LSA2011 Prague Dependency Treebanks I 34
Conversion: Analytic Structure, Functions
Analytic Function assignment (conversion) Rules
based on functional tags:-SBJ Sb -PRD Pnom
-BNF Obj -DTV Obj-LGS Obj -ADV Adv-DIR Adv -EXT Adv-LOC Adv -MNR Adv-PRP Adv -PUT Adv-TMP Adv
Ad-hoc rules (if functional tags missing) Lemmatization (years → year)
July 30, 2011 LSA2011 Prague Dependency Treebanks I 35
Example: Analytical Structure, Functions
(board)
(board)(the)
(join)
(will) (join)
(join)
(join)
→→
Penn Treebank structure
(with heads added) PDT-like Analytic
Representation
July 30, 2011 LSA2011 Prague Dependency Treebanks I 36
Annotation tools
TrEd Visualization, annotation, processing, search Perl, Perl-Tk Customizable Multiple data formats, native: Prague ML (XML)
Demonstration I Surface-syntax annotation
July 30, 2011 LSA2011 Prague Dependency Treebanks I 37
Speech reconstruction
Spoken input ASR does not do capitalization, punctuation Disfluencies Repetitions Corrections Fillers Deviations from syntax ...
ungrammatical / dissimilar to text
July 30, 2011 LSA2011 Prague Dependency Treebanks I 38
Objective
Create gold-standard data for (Statistical) training Testing
Use in machine learning of automatic speech reconstruction (eventually) language understanding
Go beyond state-of-the-art ASR Post-Correction / disfluency removal (cf. e.g. Fitzgerald, 2008, or Lopez/Cozar &
Callejas, 2008)
July 30, 2011 LSA2011 Prague Dependency Treebanks I 39
we ‘re sitting on the step ofwe ‘re sitting on the step of
What is Speech Reconstruction
original transcript → edited transcript resembling an interview editing for print
... apart from disfluency removal?
I think it ‘sI think it ‘s my aunt Molly ‘s housemy aunt Molly ‘s houseandand
I thinkI think
We’re sitting on the step of my aunt Molly’s house, We’re sitting on the step of my aunt Molly’s house, ..
????
my aunt Molly ‘s housemy aunt Molly ‘s house
Disfluency removal
Speech reconstruction
July 30, 2011 LSA2011 Prague Dependency Treebanks I 40
Word-for-word transcription
using Transcriber 1.5.1 audio synchronization spoken text non-speech events (UH-
HUH, laughter, etc.) rough segmentation
(defined acoustically) speaker/turn identification
July 30, 2011 LSA2011 Prague Dependency Treebanks I 41
Annotation rules
• sentence segmentation• orthography• capitalization• punctuation• morphosyntax• word order• partial ellipsis restoration• no discourse-irrelevant non- speech events• filler and fragment deletion• but: ...meaning preservation
~ written-text standards
July 30, 2011 LSA2011 Prague Dependency Treebanks I 43
Segment splitting
(reconstruction) segment = sentence
(transcript) segment ~ non-silence span
July 30, 2011 LSA2011 Prague Dependency Treebanks I 44
Word order, deletions, insertions, ...
punctuation, capitalization, ...
July 30, 2011 LSA2011 Prague Dependency Treebanks I 45
Function word insertion
July 30, 2011 LSA2011 Prague Dependency Treebanks I 46
Annotation layers
Multilayered standoff annotation
July 30, 2011 LSA2011 Prague Dependency Treebanks I 47
Current status
English dialogues 151,000 words (14,5 h) 16,000 words double annotated Audio / manual transcript / reconstruction Auto tagging/parsing: syntax / semnatics
Annotation manual Baseline automatic SR systems
July 30, 2011 LSA2011 Prague Dependency Treebanks I 48
Conclusion
Language resources Beyond post-ASR corrections:
“Speech Reconstruction” Integrated annotation of
Audio, ASR, manual transcription, edited speech reconstruction
[morphology, syntax, semantics] The next step: machine learning
July 30, 2011 LSA2011 Prague Dependency Treebanks I 51
Czech Example
Original transcription:ale taky důvod byl ten že škodováci byli hrozně rádi bejvali kdybych tam byla mohla nastoupit k ni - k nim jako do zaměstnání
Recovering punctuation and capitalization:Ale taky důvod byl ten , že škodováci byli hrozně rádi bejvali , kdybych tam byla , mohla nastoupit k ni k nim jako do zaměstnání
Translation:Ale taky důvod byl ten , že škodováci bývali byli hrozně rádi , kdybych tam byla , mohla nastoupit k ní , k nim jako do zaměstnání
Reference reconstructions:(a) Ale důvod byl taky ten , že "škodováci" by bývali byli hrozně rádi , kdybych tam k nim mohla nastoupit do zaměstnání .(b) Důvod byl ale taky ten , že škodováci by bývali byli hrozně rádi , kdybych tam byla mohla nastoupit k nim jako do zaměstnání .