Prague Dependency Treebank 1.0

Preview:

DESCRIPTION

Prague Dependency Treebank 1.0. Functional Generative Description. Functional Generative Description. theoretical framework based on the findings of European structural linguistics, esp. of the classical Prague School methodological requirements of a formal description levels : - PowerPoint PPT Presentation

Citation preview

Prague Dependency Treebank 1.0

Functional Generative Description

Functional Generative Description

theoretical framework based on the findings of European structural linguistics, esp. of the classical Prague School

methodological requirements of a formal description levels:

tectogrammatical (underlying) representations (TRs) with dependency based syntax

morphemics phonemics and phonetics

TRs (see Sgall, Hajičová and Panevová 1986, formally specified by Petkevič, also in a declarative way)

The Language Layers

Phonemic, Morphonological, Morphemic, Analytical (surface syntax) Tectogrammatical (deep syntax).

Dependency tree

My younger brother arrived there yesterday.

Linearized form, one-to-one relation:((I)Appurt (younger)Rstr brother)Act arrive.Pret.Indic (Dir there) (Temp yesterday)

Dependency Tree

labels - lexical meanings (abstract symbols) with indices functors

subscripts at parentheses oriented towards head grammatemes - values of morphological categories

Tense, Modality, Number, Definiteness, etc. projectivity valency

arguments (inner participants) and adjuncts (circumstantials or 'free modifications')

obligatory and optional with a given head, deletable or not

Dependency Tree

Arguments/participants of verbs Actor/Bearer

(underlying subject) Objective (Patient,

underlying direct object) Addressee

(underlying indirect object) Effect ('second' object: to

choose so. as sth.) Origin

(to make sth. out of sth.)

Adjuncts Locative, several

Directional and Temporal modifications

Condition, Means, Manner, etc.

Dependency Tree

Arguments (inner participants) Material (Partitive)

two baskets of sth. Identity

the river Danube; the notion of operator

Adjuncts (free modifications) Possession

(Appurtenance)

my table; Jim's brother Restrictive

rich man Descriptive

the Swedes, who are a Scandinavian nation

Complementations dependent mainly on nouns

Dependency Tree

syntactic grammatemes Loc, Dir - in, on, under, between... Regard - with, without

operational (testable) criteria for distinguishing

arguments from adjuncts, from each other

deletability (dialogue test)

Simplified valency frames

read V Act Addr Obj

change V Act Obj Orig Eff

give V Act Addr Obj

brother N Appurt

man N

glass N Material

full A Material

obligatory complementations in blue

Topic-focus articulation

contextual boundness main verb CB/NB (T/F) dependents to the left/right

communicative dynamism left-right (mother, sisters,

transitive) partial ordering

underlying word order left-right linear ordering

left-to-right order of nodes together with the index T or (prototypically) F indicates the TFA of the sentence (of the TR)

young

there

T

Topic-focus articulation

TFA - one of the basic aspects of underlying structures

young

there

T

yesterday

F

Complex sentence

a subordinated (dependent) clause (i.e. its main verb) depends on a word contained in its governing clause

My brother, whom you know, arrived there yesterday.

Complex sentence

function words (synsemantic) are viewed as function morphemes, syntactically fixed to certain lexical (autosemantic) words - prepositions and articles to nouns, conjunctions and auxiliaries to verbs

Martin came there late, since he had to accompany his sick mother.

Complex sentence

Martin arrived late to the session, since he had to accompany his sick mother.

schematically (morphemes):

Martin arrive.ed late to the session since he have.ed to accompany he.s sick mother.

dot - close connection of morphemes ('semes')

deleted items restored order of items - difference between 'underlying' and surface

(morphemic) word order transductive components - Panevová, Oliva, Borota

coordination (multidimensional) Jim and Mary, who have two children, went to Boston. the linearized notation is adequate: ((Jim Mary)Conj ((who)Act have (Pat (two)Rstr children)))Act

went (Dir Boston)

structures close to Boolean, i.e. no complex 'innate properties' specific for natural language are needed.

Prague Dependency Treebank - corpus annotation an intermediate level - 'analytical'

representations dependency trees, not always projective nodes for all word tokens, even for punctuation

marks tectogrammmatical tree: coordinating

conjunction as the head

Prague Dependency Treebank 1.0

Morphological Layer

ANNOTATED CORPORA

PDT version 1.0, 2000

(1996 - 2000)

(currently) ver. 2

Penn Treebank, release 3, 1999

(1989 - 1999)

PropBank (currently)

The Levels in PDT

Morphemic Analytical Tectogrammatical

TAG SETs

Czech - ambiguous inflective language nový, nového, novému, novém, novým, nová, nové, novou, nových, novým, novými, … novější, novejšího, novějšímu, novějším, …., nejnovější, nejnovějšího, nejnovějšímu, nejnovějším….. nejnovějších, nejnovějším, …

English - language with poor inflectionwork, works, worked, working

TEXT SOURCES

Lidové noviny

Mladá Fronta Dnes

Vesmír

Českomoravský

Profit

...taken from Czech

National Corpus

´88, ´89 WSJ articles

Air Travel Information

System transcripts

Brown Corpus

Switchboard transcripts

ANNOTATION STRATEGY - Penn Treebank

TEXT

Ken Church‘s stochastic tagger,

Eric Brill‘s transformation tagger

corrections by annotator (GNU Emacs Lisp based package)

ANNOTATION STRATEGY - PDT

Automatic Morphological Analyzer (AMA)

two independent annotators; Linux, Win tools

differences resolved by third annotator

comparison with the current AMA; manual resolution; Win tools

INTERNAL FORMAT

SGML coding, csts dtd word/tag(|tag)*

<s id=“ln95040:020-p1s1“><f>Pokus<l>pokus<t>NNIS1-----A----<f>o<l>o<t>RR--4----------<f>zázrak<l>zázrak<t>NNIS4-----A----<d>.<l>.<t>Z:-------------

The/DT envelope/NN arrives/VBZ in/IN the/DT mail/NN ./.

SAMPLES

SGML coding

SGML coding

word/tag

word/lemma/tag

CONVERSION

pdt2wsj.pl

pdt2wsjFLT.pl

DATA SIZE

# wordtokens

# sentences

PDT 1.0 1 730K 112K

Penn Treebank

release 3

4 600K 350K

DATA SETs of MORPHOLOGICALLY ANNOTATED DATAfor tagging only #tokens/sentences

training data 1 470K/95K

development test data 130K/8K

evaluation test data 127K/8K

for parsing (preprocessing step)

training data 475K/29K

development test data 130K/8K

evaluation test data 127K/8K

TOOLS

Automatic Morphological Analyser/Generator of Czech HMAnalyze.pl,

HMGenerate.pl Dictionary: CZE_a Remote Access

Czech Taggers

HMM

Exponential

Prague Dependency Treebank 1.0

Analytical Layer in PDT

Introduction

Input: morphologically tagged sentences

Graph Editor: “user-friendly” software

Output: ATS structure „surface“ syntax tree structure nodes labelled by the analytical functions

Analytical Functions Pred - Predicate if it depends on the tree root Sb - Subject Obj - Object Adv - Adverbial Atv - Complement AtvV - Complement, if one governor is present Atr - Attribute Pnom - Nominal predicate‘s nominal part, depends on the

copula „to be“ AuxV - Auxiliary verb „to be“ Coord - Coordination node Apos - Apposition node AuxR - Reflexive particle, which is neither Obj nor AuxT

(passive) AuxT - Reflexive particle, lexically bound to the verb

Analytical Functions AuxP - Preposition or a part of compound preposition AuxC - Subordinate conjunction AuxO - (Superfluously) referring particle or emotional particle AuxZ - Rhematizer or another node acting to another

constituent AuxX - Comma, but not the main coordinating comma AuxG - Other graphical symbols being not classified as AuxK AuxY - Other words, such as particles without a specific

syntactic function, parts of lexical idioms, etc. AuxS - Sentence holder (the only added root to the tree) AuxK - Punctuation at the end of the sentence

or direct speech or citation clause ExD - Ellipsis handling: functions for nodes which pseudo

depend on a node on which the would not depend if there were no ellipsis

AtrAtr, AtrAdv, AdvAtr, AtrObj, ObjAtr + *_Co, *_Pa, *_Ap

Two stages (chronologically)

(A) manual „analytic“ annotation (ATS) training data for (B)(a)

(B) (a) semiautomatic procedure (Collin‘s parser) (b) manual correcting of (B)(a)

Constraints and limitations

any string has a node of its own word-form, punctuation mark, etc. AuxV, AuxP, AuxC, AuxX, AuxG…

reflecting the coordination and apposition relations so called third dimension of the graph in the plain tree

(X_Co, X_Ap, X_Pa, where X is one of analytic functions, such as Sb, Obj, Adv, etc.)

Constraints and limitations

no missing nodes (on the surface) can be added analytic funtion Ex_D is used

relations between semi-automatic and manual procedure 80% edges are established correctly automatically

Project organization

team consisting of 5-6 annotators handbook for ATS structure annotation 100000 sentences on ATS tectogrammatical annotation follows

Projectivity/Nonprojectivity/Surface Order A(B, C)

B C

A

B C

A

CB

A

Projectivity/Non-projectivity/Surface Order A(B( C ))

B

C

A

C

B

A

C

B

A

Adv

AuxT

První restituční zákon českého parlamentu se do sněmovních lavic může vrátit jako bumerang.

Prague Dependency Treebank 1.0

From the Analyticaltowards

the Tectogrammatical layer

Introduction

ATS annotation nodes:

word forms punctuation graphical symbols

TGTS annotation autosemantic words deletions

edges: surface relations

deep layer functions

Input Czech

sentence

Morphological tagging and lexical

disambiguation

TokenizationSyntactic parsing and analytic function

assignment

Tree structure pruning

Attribute assignments TGTS

ATS PDT1.0

Annotation process

Transition procedure

deterministic procedure operating on trees macro language for Graph Editor (perl) automatic changes & tools for annotators

Requirements new attributes for tectogrammatical layer ATS is recoverable from TGTS automatized to a maximally high degree

New attributes

trlemma - lemma of the original node or lemma composed of joined nodes

morphological grammatemes gender, number, degree of comparison, tense,gender, number, degree of comparison, tense, aspect, iterativeness, verbal modality, deontic aspect, iterativeness, verbal modality, deontic

modality, sentence modalitymodality, sentence modality

positionposition of the nodeof the node functor, topic-focus articulation, syntactic grammateme,functor, topic-focus articulation, syntactic grammateme, type of relation (dependency, coordination, apposition), type of relation (dependency, coordination, apposition), phraseme, deletion, quoted word, direct speech, phraseme, deletion, quoted word, direct speech, coreference, antecedentcoreference, antecedent

Tree Structure Pruning

U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný.

For those, who start actually at zero, the tax outcome for the state is not substantial.

Tree Structure Pruning

U toho, kdo začíná opravdu od nuly, není daňový výnos pro stát podstatný.

For those, who start actually at zero, the tax outcome for the state is not substantial.

REG

Verbal Nodes

•… enterpreneurs should have (their) taxes …

•… podnikatelé by měli mít daně …

PRED

verbmod=CDNdeontmod=HRT

Attribute Assignments

prepositions stored as fw attribute quoted words

clause in quotes -> DSP one pair of quotes in the sentence -> DSPP string in quotes -> QUOT

gender, number, tense, degcmp, aspect default values

Macros for Annotators

keyboard shortcuts (in Graph editor) structure changes

hide/recover nodes merge nodes

add new nodes functor assignments

Manual annotation

structure checking functors deletions of obligatory modifications

feedback for formulating the handbook for annotators

Prague Dependency Treebank 1.0

Tectogrammatical Layer

C T

T

T

T

T

F

FT

T

Jirka se včera opil do němoty a Honza dneska. George himself yesterday drank to silence and Honza today.

Attributes of Coreferrential relations only in MC

attribute valuescoref the lemma of the antecedentcorsnt NIL - in the same sentence

PREV1 ... PREVi - position of the sentence which includes the antecedent

grammatical coreferenceantec the functor of the antecedent

Example

Honza slíbil přijít včas.Honza promised to come in time.

coref: Honzacorsnt: NILcornum: 1antec: ACT

Recommended