24
Representing dictionaries with the TEI Proposal for basic guidelines Laurent Romary - Max Planck Digital Library With the help of Susanne Alt - CNRS

Representing dictionaries with the TEI Proposal for basic guidelines Laurent Romary - Max Planck Digital Library With the help of Susanne Alt - CNRS

Embed Size (px)

Citation preview

Representing dictionaries with the TEI

Proposal for basic guidelines

Laurent Romary - Max Planck Digital Library

With the help of Susanne Alt - CNRS

Background

• The P5 edition of the TEI guidelines– XML– ODD - Roma

• Modules and classes

– DTD, RelaxNG, W3C schemas

• The dictionary chapter– Very close to the P4 version– Work to be done

• Enhancing the coherence with the class system• Providing more examples• …

Proposal for today

• Browse through the main features of the dictionary chapter– Identify questionable issues– Select best practices

• Work with Roma and implement (part of) the best practices– Minimal schema that dictionary project can start

with• Bottom approach to customization

• Discuss about conformance

Dictionaries as TEI documents

• Same general document structure as any other TEI document– <teiHeader>, <text>

• Define a common strategy concerning source identification with general text sources

• Specific documentation of previous editions• Intuition that <teiCorpus> is not to be retained here

– <front>, <body>, <back>– Divisions…

• Strong case for unnumbered <div>s• Can we recommend/implement a basic dictionary oriented

typology?

Issues

[see Wuerzburg.xml]

• Providing precise guidelines for– <publicationStmt>

• Elicit the role and possible content of <publisher>

– <sourceDesc>• Base the guidelines on <biblStruct> (<biblItem

>?) and <listBibl>

Describing dictionary entries

• A variety of possible objects– <entry>, <entryFree> <superEntry>, <dictScrap>– <hom>, <re>

• First issue: dealing with the editorial workflow– Keep <dictScrap> for ongoing tagging activity

• depends on the degree of structure of the dictionary

– Stay consistent in the use of entry/entryFree/superEntry/hom

• Strong feeling for limiting ourselves to <entry>

– Point to the importance of <re>• Embedded entries

Finding the right granularity

• The core lexical unit: <entry>– Should be used coherently in a dictionary project

to gather up homogenous lexical objects

• Possible combination with:– <superEntry> to group sets of homographs

• Should only be used to record such a feature when it exists in legacy data

• Should be avoided for new editorial projects

– <hom> to subdivide senses in groups of homonyms

Example

• Recording a series of homographs with <superEntry>

<body> <entry/> <entry/> <superEntry> <entry type="hom" n="1"/> <entry type="hom" n="2"/> </superEntry> </body>

• Issues– Values of ‘n’ attribute according to the source– Values of type defined in ‘att.entryLike’

Example

• Recording a series of homographs with <hom><entry> <hom n="1"> <sense n="1"/><sense n="2"/> </hom> <hom n="2"> <sense n="1"/><sense n="2"/><sense n="3"/> </hom> </entry>

• Issues– Weak boundary between polysemes and homonyms– Why not just have separate entries?

From word to senses…

• Background– Semasiological vs. onomasiological views

on lexical data• Two complementary data organisations• Two sets of standards

– In ISO: TMF (ISO 16642) vs. LMF– In the TEI: Terminology vs. Print dictionary chapters

The LMF Model

Lexical DB

1..1

Global Info

1..1

Lexical Entry

0..n

1..1

1..1

Form

1..1

0..n

1..1

0..n

1..1

Sense

Consequences for dictionaries

• Strong <form> to <sense> orientation– <form> qualifies the entry, with the identification of

the headword and its morphological variations– <sense> is subordinated to the choice made for

<form>– Role of grammatical information

• Overall qualification of the entry• Qualification of morphological variants

• Issue– <re> does not necessarily fit into the theory

Example

• Basic structure of an <entry><entry>

<form>

<orth>chat</orth>

</form>

<sense>

<def>Petit animal familier</def>

</sense>

</entry>

Representing form and grammar

• General issues– Multiple forms

• <orth>, <pron>, etc.

– Compounds• May be represented using embedded forms

– Role of grammar (<gramGrp>)• In isolation: qualifies the entry• Within a form: marks special features associated with the

form

– Inflexions• Can be reprensented by means of additional <form>’s

Example

• A simple entry<entry>

<form><orth>chat</orth><pron>∫a</pron>

</form><gramGrp>

<pos>N</pos><gen>f<gen>

</gramGrp></entry>

Example

• Simple entry with inflected form<entry>

<form type=“lemma”><orth>chat</orth>

</form><gramGrp>

<pos>N</pos><gen>m</gen>

</gramGrp><form type=“inflected”>

<orth>chats</orth><gramGrp>

<number>p</number></gramGrp>

</form></entry>

<form>: the case of the Campe dictionary

• Step 1: Dealing with the presence of determiners<form type=“lemma”>

<form type=“determiner”>

<orth>Das</orth>

</form>

<form type=“headword”>

<orth>Aak</orth>

</form>

</form>

<form>: the case of the Campe dictionary

• Step 2: adding grammatical information<form type=“lemma”>

<form type=“determiner”><orth>Das</orth><gramGrp>

<pos value=“D”/><gen>n</gen>

</gramGrp></form><form type=“headword”>

<orth>Aak</orth><gramGrp>

<pos>N</pos><gen>n</gen>

</gramGrp></form>

</form>

<form>: the case of the Campe dictionary

• Step 3: dealing with inflected forms<form type=“inflected”>

<form type=“determiner”>

<orth>des</orth>

<gramGrp>…</gramGrp>

</form>

<form type=“headword”>

<orth><oVar><oRef/>-es</oVar></orth>

<gramGrp>

<case value=“G”>G</case>

</gramGrp>

</form>

</form>

Main arguments for the proposed changes

• Coherent use of <form> and <orth>– Accounts for a coherent access to

orthographic information in form/orth

• Coherent use of grammatical features– Danger of tag abuse with

• <gram type=“art_n”>Das</gram>– ‘type’ attribute should indicate a grammatical feature– <gram> content should be the value of that feature– Non differentiation of features (art_n -> pos + gen)

<sense>: main components

• Core elements– <def>: to provide the definition– <dicteg>

• Need to establish guidelines on the identification of sources

– <etym>: a complex issue…

Documentation des exemples

<dicteg><cit>

<q>Ta gamine est assise trop <oRef/>, elle ne dépasse pas de la table.</q><biblStruct>

<author>BENOIT M, MICHEL C.</author><title>Le Parler de Metz et du pays messin</title><imprint>

<pubPlace>Metz</pubPlace><publisher>Serpenoise</publisher><date>2001</date><biblScope>p. 38</biblScope>

</imprint></biblStruct>

</cit></dicteg>

<dicteg><q>Ta gamine est assise trop <oRef/>, elle ne dépasse pas de la table.</q></dicteg>

<dicteg><cit><q>Ta gamine est assise trop <oRef/>, elle ne dépasse pas de la table.</q><bibl>Benoit M., Michel C., Le Parler de Metz...</bibl>

</cit></dicteg>

A quick glimpse into Roma

• A journey in three steps– Adding the PD module and generating a

schema– Checking out elements– Expressing constraints on specific values

Final discussion

• What is it, being TEI conformant?